Don't index large files, implement conventions for IContentMedia, e.g.
Add to your IndexingInitialization
private static bool ShouldIndexFile(IContentMedia file)
{
if (file == null)
return false;
try
{
using (var stream = file.BinaryData.OpenRead())
return stream.Length <= 52428800; // less than 50MB
}
catch (Exception e)
{
log.Error(String.Format("Unable to determine if the file {0} (ID: {1}) should get index.", file.Name, file.ContentLink.ID), e);
return false;
}
}
And
public void Initialize(InitializationEngine context)
{
// override defaults if you like
ContentIndexer.Instance.ContentBatchSize = 20; // default is 100
ContentIndexer.Instance.MediaBatchSize = 3; // default is 5
// remove media types that should not be indexed
ContentIndexer.Instance.Conventions.ForInstancesOf<GenericMedia>().ShouldIndex(ShouldIndexFile); // implement any custom MediaData implementations etc
ContentIndexer.Instance.Conventions.ForInstancesOf<MediaData>().ShouldIndex(ShouldIndexFile); // filter the base type
// ...
}
You may find some additional advice in this blog post by Ben Nitti, https://world.optimizely.com/blogs/repo-journal/dates/2021/4/optimizing-your-asset-indexing-with-conventions/#:~:text=Episerver%20recommends%20not%20exceeding%20the%20by%20default%2050%20MB%20maximum%20request%20size.
Hi Eric,
Thanks, this works like a charm. However, I still think that HTTP 419 errors should not trigger a retry automatically since for a file which is over 50 MB, it is guaranteed to get the same error for consecutive retries.
Hi Eric,
Thanks, this works like a charm. However, I still think that HTTP 419 errors should not trigger a retry automatically since for a file which is over 50 MB, it is guaranteed to get the same error for consecutive retries.
There are several other settings you can add to fine-tune the search engine.
Try these to get started (update the values to your preference)
ContentIndexer.Instance.MaxTries = 1;
ContentIndexer.Instance.MaxWaitTime = 10;
Hi huseyinerdinc
It does look like we're retrying on HTTP 413 Payload Too Large even when the batch size is 1 which we really shouldn't be doing. Otherwise the logic is to split the batch size and retry.
I will be reporting this issue to the development team as a bug. I'll let you know the bug number once it's been assigned.
Hello,
I've noticed that having too many large files increase the duration of the indexing job drastically since large files (presumably those which are over 50 MB?) are causing an exception which in turn causes a retry. The exception which is thrown is ServiceException: The remote server returned an error: (413) Payload Too Large. Request entity too large.
Having this particular error, the job should not retry indexing the file 2 more times since it's certain that the subsequent requests will throw the same error. Below is an example of such case. The waiting period between retries adds an extra 20 seconds per file to the indexing job.