Best practice for troubleshooting 100% CPU on EPiServer site?

Markus Andersson

Vote:

expand_less 0 expand_more

Hi!

We're currently experiencing intermittent CPU spikes on one of our EPiServer sites.

Can anyone share some best practices when troubleshooting these problems?

I've tried LogParserStudio for IIS logs and DebugDiag for memory dumps, but with no luck.

Is WinDbg the way to go or some 3rd party application?

#203525

Apr 26, 2019 8:54

Quan Mai

Vote:

expand_less 0 expand_more

Windbg is still the golden standard to analyse such issue. Remember you can always reach out to developer support service for further assistance.

#203526

Apr 26, 2019 10:00

valdis

Vote:

expand_less 0 expand_more

nothing yet has beaten windbg. you can also try to open process dump in VS (2019 actually has some nice updates and you can mix match process architectures and open 64bit dumps as well). look for threads, their running time and CLR stack. that would be perfect starting position :)

#203529

Apr 26, 2019 11:06

Henrik Fransas

Vote:

expand_less 0 expand_more

Agree with Valdis, Windbg is the greatest (if you learn how to use it..).
But I would say that VS2019 is a great alternative and more easy to use

#203575

Apr 29, 2019 9:25

Daniel Ovaska

Vote:

expand_less 0 expand_more

If you are able to reproduce the bad performance locally you can use the performance profiler tool in VS 2017. That has found plenty of performance issues for me.

If it occurs on production I usually start by analyzing IIS logs to find see if any particular page / service has an absurd mean execution time. I use Log Parser Studio for this. There is a pretty good query call top 25 slowest urls that can get you started in the right direction. Also check scheduled jobs and what execution time those have in Episerver admin.

If all else fails...

WinDbg

#203582

Apr 29, 2019 11:50

Quan Mai

Vote:

expand_less 0 expand_more

I am ovbiously biased but I find it more often than not it is easier to use windbg to find out a 100% CPU problem's root cause. Know a few basic command like .threadpool, .runaway or .clrstack will help you narrow down if not find out where the problem is.

#203583

Apr 29, 2019 11:58

Markus Andersson

Vote:

expand_less 0 expand_more

Thanks everyone for sharing thoughts and ideas!

Should I use an "ordinary" process dump or a crash dump for troubleshooting?

Right now we have an "ordinary" process dump from when the process has exceeded 80% CPU for 20 seconds (DebugDiag).

#203623

Edited, Apr 30, 2019 10:01

valdis

Vote:

expand_less 0 expand_more

every dump is valuable (most of the time). don't forget to bring also mscordacwks.dll from the server (it's required for windbg to match runtime info properly).

#203624

Apr 30, 2019 10:07

Johan Kronberg

Vote:

expand_less 0 expand_more

It is valuable to setup load tests and hopefully get a controlled way to reproduce. Makes it a lot easier to get a well timed memory dump as well.

For local profiling dotTrace is still my pick. Can be easily combined with local load test if a single request doesn't reveal things.

#203649

Apr 30, 2019 21:50

Stefan Holm Olsen

Vote:

expand_less 0 expand_more

Like Johan, I usually do several local profiling sessions (using dotTrace and sometimes dotMemory) to measure, improve and repeat. I usually do this to the app startup, to isolated flows or pages and to complicated logic. The idea is that if the code is improved to the load of one user, it should definitely also perform better for many users.

To perform simple local "load tests" of single URLs, I use the Bombardier HTTP benchmark tool.

But I also look at Application Insights after doing load tests on load balanced servers. Then I try to replicate locally to yet another round of measure, improve and repeat.

For production issues, I would first take a look at the Application Insights (for sites that use it). Mainly to determine pages or dependencies to investigate further.

#203655

May 01, 2019 9:17

valdis

Vote:

expand_less 0 expand_more

you are all lucky when you can replicate issue locally :)

#203667

May 02, 2019 8:11

Stefan Holm Olsen

Vote:

expand_less 0 expand_more

Yeah, I always feel so lucky. And have great bug hunters on my team.

#203668

May 02, 2019 8:42

Daniel Ovaska

Vote:

expand_less 0 expand_more

I guess the conclusion is to get more information in the form of either logs or crash dumps and focus on analyzing those.
Avoid starting to look at code and guess what the problem can be / applying random fixes.

There is another psycological advantage to this as well. If you can measure the problem first, ask for time, then fix it and measure it again, it makes it much easier to explain to end-customer why they should pay you.

If you can state that it was feature X that was the problem and that you have now improved performance by 50% which resolved the issue you are less likely to have a bouncing invoice at the end of the day. Coding is fun but getting paid for coding is even more fun!

#203670

May 02, 2019 9:41

Markus Andersson

Vote:

expand_less 0 expand_more

Thank you once again!

I decided to upload our memory dump to the EPiServer Developer support. They know more about this than I :-)

//Markus

#203773

May 07, 2019 10:13

Quan Mai

Vote:

expand_less 0 expand_more

If you give the ticket number I will try to look into it if time permits :)

#203827

May 08, 2019 8:12

Quan Mai

Vote:

expand_less 0 expand_more

I did look into it and provided my thoughts to support. Apparently you received it. So let's try that way to see how it goes.

#203843

May 08, 2019 16:01

Markus Andersson

Vote:

expand_less 0 expand_more

I received it, yes.

Thank you for taking your time looking into the problem! It's greatly appreciated.

#203844

May 08, 2019 16:02

valdis

Vote:

expand_less 0 expand_more

awesome if case permits - you could follow-up here and share what's sharable. what was wrong, any gotchas and things others should keep in mind..

thx!

#204095

May 20, 2019 0:07