We're currently experiencing intermittent CPU spikes on one of our EPiServer sites.
Can anyone share some best practices when troubleshooting these problems?
I've tried LogParserStudio for IIS logs and DebugDiag for memory dumps, but with no luck.
Is WinDbg the way to go or some 3rd party application?
Windbg is still the golden standard to analyse such issue. Remember you can always reach out to developer support service for further assistance.
nothing yet has beaten windbg. you can also try to open process dump in VS (2019 actually has some nice updates and you can mix match process architectures and open 64bit dumps as well). look for threads, their running time and CLR stack. that would be perfect starting position :)
Agree with Valdis, Windbg is the greatest (if you learn how to use it..).But I would say that VS2019 is a great alternative and more easy to use
If you are able to reproduce the bad performance locally you can use the performance profiler tool in VS 2017. That has found plenty of performance issues for me.
If it occurs on production I usually start by analyzing IIS logs to find see if any particular page / service has an absurd mean execution time. I use Log Parser Studio for this. There is a pretty good query call top 25 slowest urls that can get you started in the right direction. Also check scheduled jobs and what execution time those have in Episerver admin.
If all else fails...
I am ovbiously biased but I find it more often than not it is easier to use windbg to find out a 100% CPU problem's root cause. Know a few basic command like .threadpool, .runaway or .clrstack will help you narrow down if not find out where the problem is.
Thanks everyone for sharing thoughts and ideas!
Should I use an "ordinary" process dump or a crash dump for troubleshooting?
Right now we have an "ordinary" process dump from when the process has exceeded 80% CPU for 20 seconds (DebugDiag).
every dump is valuable (most of the time). don't forget to bring also mscordacwks.dll from the server (it's required for windbg to match runtime info properly).
It is valuable to setup load tests and hopefully get a controlled way to reproduce. Makes it a lot easier to get a well timed memory dump as well.
For local profiling dotTrace is still my pick. Can be easily combined with local load test if a single request doesn't reveal things.
Like Johan, I usually do several local profiling sessions (using dotTrace and sometimes dotMemory) to measure, improve and repeat. I usually do this to the app startup, to isolated flows or pages and to complicated logic. The idea is that if the code is improved to the load of one user, it should definitely also perform better for many users.
To perform simple local "load tests" of single URLs, I use the Bombardier HTTP benchmark tool.
But I also look at Application Insights after doing load tests on load balanced servers. Then I try to replicate locally to yet another round of measure, improve and repeat.
For production issues, I would first take a look at the Application Insights (for sites that use it). Mainly to determine pages or dependencies to investigate further.
you are all lucky when you can replicate issue locally :)
Yeah, I always feel so lucky. And have great bug hunters on my team.
I guess the conclusion is to get more information in the form of either logs or crash dumps and focus on analyzing those. Avoid starting to look at code and guess what the problem can be / applying random fixes.
There is another psycological advantage to this as well. If you can measure the problem first, ask for time, then fix it and measure it again, it makes it much easier to explain to end-customer why they should pay you.
If you can state that it was feature X that was the problem and that you have now improved performance by 50% which resolved the issue you are less likely to have a bouncing invoice at the end of the day. Coding is fun but getting paid for coding is even more fun!
Thank you once again!
I decided to upload our memory dump to the EPiServer Developer support. They know more about this than I :-)
If you give the ticket number I will try to look into it if time permits :)
I did look into it and provided my thoughts to support. Apparently you received it. So let's try that way to see how it goes.
I received it, yes.
Thank you for taking your time looking into the problem! It's greatly appreciated.
awesome if case permits - you could follow-up here and share what's sharable. what was wrong, any gotchas and things others should keep in mind..