Suggestion: Do not retry indexing large attachments

Vote:
 

Hello,

I've noticed that having too many large files increase the duration of the indexing job drastically since large files (presumably those which are over 50 MB?) are causing an exception which in turn causes a retry. The exception which is thrown is ServiceException: The remote server returned an error: (413) Payload Too Large. Request entity too large.

Having this particular error, the job should not retry indexing the file 2 more times since it's certain that the subsequent requests will throw the same error. Below is an example of such case. The waiting period between retries adds an extra 20 seconds per file to the indexing job.

2023-02-24T13:08:11.930029493Z       Indexing failed (http error), attempt 1 out of 3: EPiServer.Find.ServiceException: The remote server returned an error: (413) Payload Too Large.

2023-02-24T13:08:11.930035493Z       Request entity too large
2023-02-24T13:08:21.898188482Z       Indexing failed (http error), attempt 2 out of 3: EPiServer.Find.ServiceException: The remote server returned an error: (413) Payload Too Large.

2023-02-24T13:08:21.898205282Z       Request entity too large
2023-02-24T13:08:31.642090599Z       Indexing failed (http error), attempt 3 out of 3: EPiServer.Find.ServiceException: The remote server returned an error: (413) Payload Too Large.

2023-02-24T13:08:31.642099599Z       Request entity too large
#297143
Edited, Feb 24, 2023 13:23
Vote:
 

Don't index large files, implement conventions for IContentMedia, e.g.

Add to your IndexingInitialization

private static bool ShouldIndexFile(IContentMedia file)
{
    if (file == null)
        return false;

    try
    {
        using (var stream = file.BinaryData.OpenRead())
            return stream.Length <= 52428800; // less than 50MB
    }
    catch (Exception e)
    {
        log.Error(String.Format("Unable to determine if the file {0} (ID: {1}) should get index.", file.Name, file.ContentLink.ID), e);
        return false;
    }
}

And

public void Initialize(InitializationEngine context)
{
    // override defaults if you like
    ContentIndexer.Instance.ContentBatchSize = 20; // default is 100
    ContentIndexer.Instance.MediaBatchSize = 3; // default is 5

    // remove media types that should not be indexed
    ContentIndexer.Instance.Conventions.ForInstancesOf<GenericMedia>().ShouldIndex(ShouldIndexFile); // implement any custom MediaData implementations etc
    ContentIndexer.Instance.Conventions.ForInstancesOf<MediaData>().ShouldIndex(ShouldIndexFile); // filter the base type

    // ...
}

You may find some additional advice in this blog post by Ben Nitti, https://world.optimizely.com/blogs/repo-journal/dates/2021/4/optimizing-your-asset-indexing-with-conventions/#:~:text=Episerver%20recommends%20not%20exceeding%20the%20by%20default%2050%20MB%20maximum%20request%20size.

#297146
Feb 24, 2023 15:23
Vote:
 

Hi Eric,

Thanks, this works like a charm. However, I still think that HTTP 419 errors should not trigger a retry automatically since for a file which is over 50 MB, it is guaranteed to get the same error for consecutive retries. 

#297299
Feb 27, 2023 8:21
Vote:
 

Hi Eric,

Thanks, this works like a charm. However, I still think that HTTP 419 errors should not trigger a retry automatically since for a file which is over 50 MB, it is guaranteed to get the same error for consecutive retries. 

#297300
Feb 27, 2023 8:21
Vote:
 

There are several other settings you can add to fine-tune the search engine. 

Try these to get started (update the values to your preference) 

ContentIndexer.Instance.MaxTries = 1;
ContentIndexer.Instance.MaxWaitTime = 10;
#297307
Edited, Feb 27, 2023 9:32
Vote:
 

Hi huseyinerdinc

It does look like we're retrying on HTTP 413 Payload Too Large even when the batch size is 1 which we really shouldn't be doing. Otherwise the logic is to split the batch size and retry.
I will be reporting this issue to the development team as a bug. I'll let you know the bug number once it's been assigned.

#299529
Apr 04, 2023 20:45
huseyinerdinc - Apr 05, 2023 6:50
Sounds great, thanks.
dada - Apr 05, 2023 7:35
A bug has been filed with the dev team. Internal reference: FIND-11340
Looks like an easy fix but I can't give you an ETA at this time.
This topic was created over six months ago and has been resolved. If you have a similar question, please create a new topic and refer to this one.
* You are NOT allowed to include any hyperlinks in the post because your account hasn't associated to your company. User profile should be updated.