Some high-level musings on performance in a CMS
I’ve been working for some time with performance optimization of a reasonably large EPiServer site, and it’s been a very learning experience. This is not a post about specific fixes to specific issues, it’s about high-level design principles.
The lessons learned are applicable in many scenarios, but I’d like to share some of my high-level insights in the context of a CMS in general, and EPiServer in particular. While it’s not reasonable to expect any adoption of these principles in EPiServer in any near time, you may find some guidelines on what features to use when expecting heavy growth of the number of pages in your site.
In general, I have found that a common denominator for problems is site enumeration. Site enumeration should simply never be allowed - this precludes the current FindPagesWithCritiera architecture for example, and most of the other ways to produce deep listings. It also disallows the current mirroring, subscription or archival algorithms where the site is enumerated to find updates. All such operations have to be done by other mechanisms, perhaps using variations on the subscriber pattern to inform listeners of updates or placing these operations on a queue that can be processed in linear time in relation to the amount of actual work to do - not the total number of pages.
A different way to express the above – all operations taking measurable amount of time per page used by the application must have sub-linear time complexity in relation to the total number of pages. A truly scalable application will not implement any features that have linear or worse time complexity in relation to the total number of pages, if the time per page is measurable. It can still have linear time complexity as far as the result set is concerned for a given operation.
A good example of a nice way to implement a feature is the blog post describing upcoming improvements of dynamic properties in CMS 6 – that algorithm has linear time complexity in relation to the number of parents that a page has, that’s quite a change from the current implementation where cache revalidation has worse than linear time complexity in relation to the total number of pages!
In EPiServer, loading a page takes significant time. Therefore, any operation that has linear time complexity in relation to the total number of pages will break down quite quickly when the size of the site is significant. If we have 100 000 pages, 100 000 x measurable time == too long. That’s the simple fact.
A corollary that I have found is that while caching is great and required for performance, the first-time hit must still be within acceptable limits. The site must still run without the cache, just not great. Otherwise we end up with situations where the site cannot start at all when under load. EPiServer suffers from this scenario when certain lists and other operations require almost the whole site to be enumerated, causing site startup to be very difficult at times.
The caching corollary also leads to the conclusion that while you can place design limits requiring for example all pages to be resident in memory for optimal performance – it must still perform well enough to actually be able to start even when under maximum load before the site has been loaded into memory.
The rules apply to background jobs as well. In an EPiServer site that runs at the limit, running a job such as mirroring or archival may well act as a denial of service attack by invalidating caches to the extent that the site stops working, or by growing the memory use so a recycle is forced. So, not even those jobs can use site enumeration – also when the time required to run a job is several hours or more, it becomes extremely troublesome.
While the above conclusions may seem self-evident at first glance, please recall that EPiServer implements many features that break these rules, and many EPiServer sites face some serious performance issues when the number of pages grow.
I propose that a good rule to apply when designing a framework such as a programmable CMS, is that if a feature exists it will be used. It’s not sufficient to to document limitations with texts like “Don’t use for large amounts of data”. Either a feature can be used within the design constraints of the framework, or it cannot. Don’t allow different design goals for different parts of the framework. If you can’t make it fast enough, don’t do it.
In EPiServer, there are many features that tend to break down and become unusable as the total number of pages grow – I hope EPiServer over time will work to change the architecture to remove those obstacles to scalability. It’s still amazing what can be done with the current architecture, but at least in my current case I’ve had to work around quite a few issues that are caused by various EPiServer features having linear time complexity or worse in relation to the total number of pages in the site.
Good post and good principles that many implementations could benefit from!
Compared to earlier days EPiServer is now used at larger and larger sites which puts other demands on the APIs.
When we are designing new APis we always try to design them to work for a large site. One example of that is the new mirroring function in CMS6 that uses a ChangeLog instead of site enumeration to figure out what changes that needs to be mirrored.
Regarding the old APIs we try to refactor them to work in larger sites as well (here we obviously have to take backward compatibility in consideration). One example of that is the new refactored dynamic property implementation.
I can agree on you on some points here. some of the API calls does n calls to the database to populate n pages. This could have been done with one call to the database an populate n pages. But to remove those API's is a bit extreme I think.
When we develope large sites (100 000+) pages we need to take steps that we don't call all the pages to do stuff. That is in my option our job :)
CMS 5+ has a extrymly more powerfull caching mecanisme. CMS 4 was extrymly slow. So I think they are doing stuff all the time regarding preformance. FindPagesWithCriteria is one area that should get some more love :)
Svante, how big is the site you've been working on? (number of pages) How many editors (approx) and how much content is produced regularly?
How about overloads for the DataFactory methods that don't use the cache (or use a small, "private" cache partition)? It might seem like a bad thing for performance but the idea is this: When a site has been running for a while you get to some kind of cache equilibrium where the most used data is in cache. If then some deep-running task (like some archiving job) starts crawling the page tree this equilibrium will be destoryed by pulling lots of other pages into cache. If this were done without cache the equilibrium would be kept. The archiving job (and the likes) only needs to know about a few pages at a time anyway so they can be kept locally (or in the cache partition).
Of course this would add more complexity to the API and risk being used in the wrong situation where it would actually worsen performance instead (same thing with lazy-loading properties, another idea that briefly flew by in my brain).
Steve: In pages the site is 100,000+ growing at about 1,000/week iirc. 400-500 editors.
Magnus: Uncached access by background site enumeration jobs would indeed probably be a 'quick fix' for some issues. Scheduled jobs site enumeration to an extent works like a cache poisoning denial of service attack against the site today... As you say, the API would grow more complex, so an automatic solution is probably better. One simple way would be to implement a 'I'm a scheduled job flag' set in the thread context, or possibly use the prescence of a HttpContext to determine the behavior. Autotuning is more ambitious. Any EPiServer architects listening in with comments on what might be feasible to ease the pain of site enumeration by scheduled jobs within a reasonable time frame?