CMS editor server instance crashing main site
Recently I was involved in supporting a new large scale implementation for a global client.
The implementation had gone through full testing and passed with flying colours. There seemed to be the usual cosmetic bugs and certain “Known bugs” that were acceptable by the client.
We finally got the go ahead to switch the new site on, this is where the real fun started.
Once the Load Balancer changes were made and the site made publicly available, we saw issues which were causing key functionality to have huge performance issues and ultimately crashing the site. When I say crashing the site, I mean the servers stopped responding.
The configuration of the implementation was to have the following:
- 1 CMS editor server
- Front end presentation servers with CMS access switched off.
- Akamai CDN
It seemed a standard set up and nothing new.
What we found, was Cached objects were being invalidated for data sources from a different provider. After a while the front end presentation servers would stop responding and cause a failover.
Initial investigation was made, however, this was inconclusive and was pointing at Akamai not caching correctly.
The goose chase started and a series of remedial steps were taken:
- Monitor the DB
- Look at deactivating the many languages.
Monitoring the DB showed masses of Deadlocks and the dreaded SQL Query that is initiated by FindPagesWithCriteria were causing the DB to lock up.
However, this was not actually the issue.
If we think about it, the CMS editor application was being bought up causing the Presentation servers and DB to max out.
Further thinking it seems logical to think that the CMS was causing the issue.
To prove the point, we stopped the CMS server and the issue went away immediately and the Site was responding as expected.
After looking at the implementation it was identified that the Remote Events that are standard EPiServer built in functionality and I have never seen cause an issue.
What was realised was that the CMS and front end instances were configured to listen to sites that were not actually active. (There were 2 other sites that weren’t actually switched on)
So what was happening, the CMS was sending out events to all the presentation servers announcing it self and trying to register itself which was causing the Presentation servers to invalidate their cache. With users still trying to browse the site, this was causing them all to go back to populate their caches which cause the DB to max out and causing the sites to crash.
To fix this, the following is what to do:
- Set the scheduler to false for all servers and all sites in episerver.config.
- If other sites are not running in ISS make sure to comment them out in the episerver.config sites section.
- Delete the content of automaticSiteMapping section, an attribute in episerverFramework.config on all server. Its recreated automatically on site startup.
- Delete table tblsiteConfig, recreated on startup
- Make sure no scheduled jobs are active in admin mode.
- Restart frontend sites.
The point of the above is, when switching a site to live, make sure only that is running what is needed.