Running Find and Commerce on Azure/DXC? Read this...
Introduction
I recently ran into an issue which required a lot of time and effort to solve, and wanted to share my findings in case someone else runs into similar issues.
Background
A client of ours is currently running a sizable Episerver Commerce site hosted in Azure by Episerver DXC.
This particular site is one of the most integration heavy sites I've had the pleasure to work with recently. As with many commerce sites, this one too is integrated with both a PIM and an ERP among other back-end systems. With well over 15K products and 150K variants, needless to say, it is a challenge to keep all information up to date on the site, with both new and updated products, variants and even inventory.
To manage this, we have created an API that constantly receives product updates from the PIM and scheduled batch jobs that update the inventory from the clients ERP system.
In the best interest of the end user (and for best performance), we put all this information in Episerver Find.
When products/variants or inventory are updated, Episerver Find events are triggered which adds the related products or variants into the Find Indexing Queue (stored in the DDS in the CMS database). Any of currently running Episerver instances will process the queue as soon as there are any and update the document in Find, usually within seconds.
The Issue
After an internal effort by our client to improve the quality of the metadata of the products on the site, they came to us with some strange findings.
After products/variants were imported from the PIM, they noticed that a few of the variants were missing information which they knew were in the PIM. And stranger yet, this wasn't happening all the time.
This particular subset of nested information is currently feeding an alternative navigation feature on the site, which made it very apparent when the product/variant they were looking for wasn't listed.
Related findings
After realizing this was not a data/user issue, I did some additional digging and found the following things:
- If I reimported the same product, the same variants that were previously incorrect were correct, but other variants were incorrect.
- After manually re-indexing the product and variants from an Admin Plug-in we built, everything looked correct on the site. The difference here is that the content is being indexed by the instance I'm currently on.
- To add even more to this mystery, there was nothing in the error logs that indicated that anything was wrong.
- This issue only occurred in the production environment. And we couldn't replicate this in any other environment.
- Page Types that were explicitly excluded in our Find client conventions still ended up in the index.
- When comparing the same Find document when it was incorrectly indexed, and later a correct one, I noticed that some of our properties which had nested conventions were missing the "$$nested" part of the name in the Find index.
- Knowing how an incorrect document looked in Find, I could now query it directly to get the exact number of failed products and variants, which showed that this issue actually had affected 5-10% of the entire catalog.
Hypothesis
My initial thought, however unlikely, was that one (or more) instances were not indexing correctly or wasn't initialized.
Obviously, since all instances in Azure are running the same codebase, it wasn't completely logical, but either way, there are things I can do to confirm this theory, which I wanted to do before submitting a ticket with Episerver Developer support.
Steps taken
The first thing I did was add additional logging and logging levels to our EPiServerLog.config, to make sure everything is initialized in the intended order.
<logger name="EPiServer.Framework.Initialization">
<level value="Debug" />
</logger>
<logger name="EPiServer.Commerce.Initialization">
<level value="Debug" />
</logger>
After deploying this, the logs clearly showed that our Find client conventions were initialized correctly on every single instance. At this point, we even re-indexed the whole site (which takes 2-3 hours), and everything seemed to be working. I queried the Find index and saw that the number of failed documents were counting down.
However, the day after the deploy, the number had gone up again.
Still not convinced that this wasn't the root cause, I wanted to know which instance(s) that were failing, so I added an "IndexedBy" property to the product and variation models.
public string IndexedBy => EnvironmentUtility.GetEnviroment();
public class EnvironmentUtility
{
public static string GetEnviroment()
{
var environment = $"{Environment.GetEnvironmentVariable("WEBSITE_SITE_NAME")}:{Dns.GetHostName()}:{Environment.GetEnvironmentVariable("WEBSITE_INSTANCE_ID")}";
return (environment.Length > 2 ? environment : Environment.MachineName);
}
}
Intriguingly enough, this did confirm my theory. I could see that all incorrect Find documents were indexed by the same Azure instance(s)!
Even though the logs looked fine, I updated our initialization module containing our Find conventions so that they were executed at the very end of the initialization process, but that still didn't help. I also tried setting the nested convention using a typed implementation, instead of an interface, which also didn't help.
The Solution
Obviously, this was a harder nut to crack than I initially thought, so at this point, I had reached out to Episerver developer support. Thanks to the work I had already put in, I did get a confirmation that this was a bug (which later turned out to be two) that will be fixed in future releases of Find.
The final step I took (with the support of the Epi dev team), was instead of implementing our Find conventions in an IInitializationModule, we let it inherit from IConfigurableModule, and add the following
public void ConfigureContainer(ServiceConfigurationContext context)
{
EventedIndexingSettings.Instance.IndexingQueueBatchSize = 0;
if (Logger.IsEnabled(Level.Information))
Logger.Log(Level.Information, "Setting IndexingQueueBatchSize to 0 #find");
}
And then adding the following lines to the Initialize() method
var batchSize = EPiServer.Find.Configuration.GetConfiguration().IndexingQueueBatchSize;
EventedIndexingSettings.Instance.IndexingQueueBatchSize = batchSize;
if (Logger.IsEnabled(Level.Information))
Logger.Log(Level.Information, $"Setting IndexingQueueBatchSize to {batchSize} #find");
Root Cause
As you probably can tell by the code above, the root cause was that WHEN and IF there are items in the Find Indexing Queue, when an Azure web site (re)starts, it will start processing the queue before our Find conventions (which are commerce dependent) were completely initialized(!). At that point, it doesn't help to re-apply the conventions (I built an admin plug-in to try that as well).
This is probably one of those edge cases where you have to have a site that is constantly updating the Find index, while utilizing CMS, Commerce and Find and is hosted in a load balanced environment, but please rate and/or comment if you find this article interesting or helpful!
Thanks!
Stellan Danald
Solutions Architect @ Making Waves
P.S. Have you checked out the EPiServer Slack Community?
https://episervercommunity.net/ - Read about it at https://world.episerver.com/blogs/stellan-danald/dates/2017/6/episerver-slack-community/
An additional tip*
If you want to see how many references are in your Find indexing queue, run the following SQL script in your CMS database
SELECT COUNT(1)
FROM [dbo].[tblBigTable]
WHERE [StoreName] = 'EPiServer.Find.Cms.IndexingQueueReference'
SELECT COUNT(1)
FROM [dbo].[tblBigTableIdentity]
WHERE [StoreName] = 'EPiServer.Find.Cms.IndexingQueueReference'
Comments