Missing excerpts on attachment results

Vote:

Dear fellow Episerver developers,

We are currently in the process of making a brand new website with Episerver Find as our search solution. We got this working for the most part, but we're having difficulties getting attachments (.doc and .pdf) to index/render (we're not sure wich) properly.

We have two different environments with the same code running, test and acceptation. On test it all seems to work just fine for PDF files, we can search for text inside of attachments and we get results with proper excerpts. However, on acceptation, it is a different story, we get the results just fine but without the extracts.

The test environment has the web server and the database on the same (virtual) machine whereas acceptation has them separate. I know from the documentation that iFilters need to be installed, but I'm left confused about how and where Find does the filtering: on the database or on the web server? Also, does Windows Search Service need to be installed?

Looking at the index for a PDF file on the test environment I see a `SearchAttachment$$attachment` (base64) in the index, but on acceptation, I see `SearchAttachmentText$$string` which has the contents in plaintext. So the former gives excerpts just fine while the latter doesn't. Word documents don't work in both environments, both have a `SearchAttachmentText$$string` in the index with no excerpts as a result.

According to the documentation on indexing (http://world.episerver.com/documentation/Items/Developers-Guide/EPiServer-Find/11/Integration/episerver-7-5/Indexing/) the EPiServer.Find.Cms.AttachmentFilter package should be installed, we have that, and`SearchAttachmentText$$string` should contain readable text, it does.

Does anybody know what's going wrong here?

#175804

Mar 02, 2017 14:07

Christian Lindeberg

Vote:

Try this:

SearchClient.Instance.Conventions.UnifiedSearchRegistry.ForInstanceOf<MediaData>()
.ProjectExcerptUsing<ISearchContent>(spec =>
doc =>
!string.IsNullOrWhiteSpace(doc.SearchAttachmentText)
? doc.SearchAttachmentText.AsCropped(spec.ExcerptLength)
: doc.SearchAttachment.AsCropped(spec.ExcerptLength)
);

By your description,

Test - SearchAttachment$$attachment` (base64)

Acceptation - `SearchAttachmentText$$string` <-- This is what you want.

I would say that it is your test env that is having problem parsing and indexing pdf contents.

#175915

Mar 06, 2017 13:47

Vote:

Thank you for your response. Your answer helped me to the solution and it was very close. We're using HitSpecification with HighlightExcerpt property set to true, so that means that we needed another method: one for the highlighting. Below is the solution I came up with.

    public class FindInitialization : IInitializableModule
    {
        public void Initialize(InitializationEngine context)
        {
            SearchClient.Instance.Conventions.UnifiedSearchRegistry.ForInstanceOf<MediaData>()
                .ProjectHighlightedExcerptUsing(HighlightExpressionGetter);
        }

        private static Expression<Func<ISearchContent, string>> HighlightExpressionGetter(HitSpecification spec)
        {
            var highlightSpec = new HighlightSpec
            {
                FragmentSize = spec.ExcerptLength,
                NumberOfFragments = 1,
                PreTag = spec.PreTagForAllHighlights,
                PostTag = spec.PostTagForAllHighlights
            };

            return doc => !string.IsNullOrWhiteSpace(doc.SearchAttachmentText) ?
                    doc.SearchAttachmentText.AsHighlighted(highlightSpec) : doc.SearchAttachment.AsHighlighted(highlightSpec);
        }
    }

#176011

Mar 07, 2017 15:56