ugly side of cloud services for developers ;) if you would be on-prem, I would recommend to attach debugger to the process when job is stuck and see what's call stack of the process threads (basically diagnose where process is hanging at that moment).
would check what are settings for webapp pool, how it's configured for the recycles? on some days - do you see correlation between workload on the site and job successful fact? could it be that job crashes under the load?
If your job loads a lot of data from the epi APIs (like IContentLoader, ICatalogSystem etc), you could be a victim of this cache bug, causing the memory consumption to go too high and the app getting recycled: https://world.episerver.com/blogs/Magnus-Rahl/Dates/2017/11/two-bugs-in-aspnet-that-break-cache-memory-management/
In addition to the workaround, updating to CMS 11.1+ will get you a new feature which automatically uses a much shorter cache timeout for content pulled from inside scheduled jobs. See more in this article which also mentions restartable scheduled jobs which you might want to look into: https://world.episerver.com/documentation/developer-guides/CMS/scheduled-jobs/
Thank for your answers. There is a correlation between high site load and jobs stopping but that's not always the case. The job failed to complete even when there was no traffic on the website.
Regarding the cache issue, we are indeed loading a lot of content from epi but after we process a batch we manually clear the cache and at least on my local the memory stays pretty much the same during the job run. And also we are using episerver 11 already.
We improved the processing on our side, and fortunately our client was able to export zip files to their FTP so the size of the download reduced considerably and it no longer takes a lot of time for download. Also, Episerver Support offered us a solution to have a separate web app locked to one instance. These solutions combined seem to work, at least the jobs ran successfully in the past 3 days.
Thank you again for your answers!
We have a job that needs to run for a couple of hours. But at different points in time it gets stuck. What the job does, it gets some files from an FTP and then uses those files to update the catalog. Sometimes it stops at processing items step (after the files are downloaded), sometimes when downloading files, sometimes immediately after it is started.
When I say stuck, I mean it doesn't crash with an exception from our code and it doesn't go into the "Failed" status, it just stops and it can be started again and there is no record in the "History" tab.
I also want to mention that on some days, the job runs successfully.
Do you guys know some best practices on implementing long running jobs or do you have a clue on what might cause the jobs to stop? The solution is hosted on the DXC and we are using Episerver Commerce 11.2.4