Suggestion: Do not retry indexing large attachments

huseyinerdinc

Vote:

expand_less 0 expand_more

Hello,

I've noticed that having too many large files increase the duration of the indexing job drastically since large files (presumably those which are over 50 MB?) are causing an exception which in turn causes a retry. The exception which is thrown is ServiceException: The remote server returned an error: (413) Payload Too Large. Request entity too large.

Having this particular error, the job should not retry indexing the file 2 more times since it's certain that the subsequent requests will throw the same error. Below is an example of such case. The waiting period between retries adds an extra 20 seconds per file to the indexing job.

2023-02-24T13:08:11.930029493Z       Indexing failed (http error), attempt 1 out of 3: EPiServer.Find.ServiceException: The remote server returned an error: (413) Payload Too Large.

2023-02-24T13:08:11.930035493Z       Request entity too large

2023-02-24T13:08:21.898188482Z       Indexing failed (http error), attempt 2 out of 3: EPiServer.Find.ServiceException: The remote server returned an error: (413) Payload Too Large.

2023-02-24T13:08:21.898205282Z       Request entity too large

2023-02-24T13:08:31.642090599Z       Indexing failed (http error), attempt 3 out of 3: EPiServer.Find.ServiceException: The remote server returned an error: (413) Payload Too Large.

2023-02-24T13:08:31.642099599Z       Request entity too large

#297143

Edited, Feb 24, 2023 13:23

Eric Herlitz

Vote:

expand_less 0 expand_more

Don't index large files, implement conventions for IContentMedia, e.g.

Add to your IndexingInitialization

private static bool ShouldIndexFile(IContentMedia file)
{
    if (file == null)
        return false;

    try
    {
        using (var stream = file.BinaryData.OpenRead())
            return stream.Length <= 52428800; // less than 50MB
    }
    catch (Exception e)
    {
        log.Error(String.Format("Unable to determine if the file {0} (ID: {1}) should get index.", file.Name, file.ContentLink.ID), e);
        return false;
    }
}

And

public void Initialize(InitializationEngine context)
{
    // override defaults if you like
    ContentIndexer.Instance.ContentBatchSize = 20; // default is 100
    ContentIndexer.Instance.MediaBatchSize = 3; // default is 5

    // remove media types that should not be indexed
    ContentIndexer.Instance.Conventions.ForInstancesOf<GenericMedia>().ShouldIndex(ShouldIndexFile); // implement any custom MediaData implementations etc
    ContentIndexer.Instance.Conventions.ForInstancesOf<MediaData>().ShouldIndex(ShouldIndexFile); // filter the base type

    // ...
}

You may find some additional advice in this blog post by Ben Nitti, https://world.optimizely.com/blogs/repo-journal/dates/2021/4/optimizing-your-asset-indexing-with-conventions/#:~:text=Episerver%20recommends%20not%20exceeding%20the%20by%20default%2050%20MB%20maximum%20request%20size.

#297146

Feb 24, 2023 15:23

huseyinerdinc

Vote:

expand_less 0 expand_more

Hi Eric,

Thanks, this works like a charm. However, I still think that HTTP 419 errors should not trigger a retry automatically since for a file which is over 50 MB, it is guaranteed to get the same error for consecutive retries.

#297299

Feb 27, 2023 8:21

huseyinerdinc

Vote:

expand_less 0 expand_more

Hi Eric,

#297300

Feb 27, 2023 8:21

Eric Herlitz

Vote:

expand_less 0 expand_more

There are several other settings you can add to fine-tune the search engine.

Try these to get started (update the values to your preference)

ContentIndexer.Instance.MaxTries = 1;
ContentIndexer.Instance.MaxWaitTime = 10;

#297307

Edited, Feb 27, 2023 9:32

dada

Vote:

expand_less 0 expand_more

Hi huseyinerdinc

It does look like we're retrying on HTTP 413 Payload Too Large even when the batch size is 1 which we really shouldn't be doing. Otherwise the logic is to split the batch size and retry.
I will be reporting this issue to the development team as a bug. I'll let you know the bug number once it's been assigned.

#299529

Apr 04, 2023 20:45

- Apr 05, 2023 6:50

Sounds great, thanks.

- Apr 05, 2023 7:35

A bug has been filed with the dev team. Internal reference: FIND-11340
Looks like an easy fix but I can't give you an ETA at this time.