Searching for pages based on free-text content of associated Medi

Searching for pages based on free-text content of associated MediaData

Vote:

expand_less 0 expand_more

We have a client requirement to have a page type which acts as a collection for multiple documents - one primary document (a PDF), and an arbitrary number of secondary documents (PDFs, DOCX/XLSX files, images). The documents are all stored in CMS as IContent. The clients want a search control on the page which can match against text in any of these documents, and return the parent page as the result. The individual files themselves should not be returned as search results. This doesn't need to be a site-wide search, and can be specific to the collection page type.

Is there any decent way to do this? We're on EPiServer 7.19/Find 8, and can't update at this stage.

I have a working-but-not-great solution, achieved by using Content Areas to hold the documents, and enabling IndexInContentAreas on the MediaData files. If I then use:

var results = SearchClient.Instance.Search().For("Search Query").GetPagesResult();

I get pages based on the text in these content areas - I've tested this with unique words in PDFs, and it definitely works. Great! Unfortunately, I can't see any way of communicating to the user why the match was made - I've tried creating highlights as per the documentation here, but this just returns an empty string, and there is no available excerpt for this search type.

My first guess was to use UnifiedSearch and grab the Excerpt and Highlight from there, but for whatever reason, UnifiedSearch does not match against words from the PDFs in content areas in the way conventional search does.

I've seen the documentation around attachments, but this is a bit abstract and I can't find an example of how this can be used as part of a Page, and whether it can accept IContent MediaData, or if it's meant solely for indexing binary content that sits outside of EPiServer's own media tree. Would I need to intercept the publish action for my page and create/update an instance of a POCO container class which links the page to the attachments for EPiServer Find to index?

#174828

Feb 06, 2017 12:00

Vote:

expand_less 0 expand_more

To Highlight on all attachments in a non typed way you can use the following gist: Highlighting all attachment fields

/Henrik

#176254

Edited, Mar 14, 2017 15:13

Vote:

expand_less 0 expand_more

Hi Henrik,

Thanks for the tip, but can you provide some explanation of how use the Find Attachments feature in these circumstances?

As per my original post, the documentation on Attachments makes it sound like this is only for files hosted outside EPiServer rather than for IContent items. We need users to be able to upload documents into the CMS media tree as they usually would, and drop an arbitrary number of these into a content area (or similar) for inclusion on the page.

Cheers,

Alex

#176256

Mar 14, 2017 15:22

Vote:

expand_less 0 expand_more

As for searching for an abritary number of attachment in a content area of perhaps different media types I think the solution works ok.

In the general case you can query IContentMedia as any other IContent and access the attachment data of the actual file using the .SearchAttachment()-extension.

/Henrik

#176259

Mar 14, 2017 15:40

Vote:

expand_less 0 expand_more

When you say "the solution works OK", do you mean that what we've currently got (matching text in attachments, but not being able to highlight or boost) is probably as good as we're going to get here?

I don't think I follow your postscript about using SearchAttachment - I can't find any documentation for this, other than this stub, which doesn't shed much light. How could we use this to improve our solution?

Is there any method which lets us, if we have an IContent object, use EPiServer Find to do a full-text search (with highlighting) on that object, assuming it's indexed?

Alex

#176273

Mar 14, 2017 17:04

Vote:

expand_less 0 expand_more

I posted a gist with an example on how to highlight. As for boosting you need to be able to specify specific fields and for attachments this cannot be done as the entire attachment is parsed as a single field (ie. you cant boos't titles/headers within a document).

If you have a IContent object you can highlight that as it is done in the documentation. As for IContentMedia what I meant is that if you want to highlight the actual content of the file what you do is:

searchResult = client.Search<MyMediaData>()
  .For("Banana")
  .Select(x => new { 
    HighlightedAttachment = x.SearchAttachment().AsHighlighted() 
   })
  .GetResult();

/Henrik

#176288

Mar 15, 2017 8:07

Vote:

expand_less 0 expand_more

I found a way to do this, in the end. Per's suggestion in this post, and the subsequent discussion on that thread, put me on the right track.

In the unlikely event that anyone else hits this, here's how it works:

Create a new GenericMedia child class, TextBasedMedia, from which the various full-text indexable media types inherit (PDF, DOCX, etc)
Create a boolean extension method for TextBasedMedia, IsLinkedByFooPage(this), which uses IContentRepository.GetReferencesToContent to check whether it's linked by an instance of our target page type.
Update the search conventions for TextBasedMedia to add a new field containing the value of IsLinkedByFooPage
Add a handler to IContentEvents.PublishingContent to catch instances of FooPage and reindex all attached media on the new and old versions of the page, to ensure their IsLinkedByFooPage value is kept up to date.
Using unified search, create a query to search across FooPage instances and TextBasedMediaInstances where IsLinkedByFooPage is true.
Run the search, taking all results (Take(1000)) - we can't do pagination at this stage, as we need to filter out duplicates.
Process the results. FooPage results can be returned directly. For TextBasedMedia results, create a new result object using the title and URL of the FooPage they link to, but the Excerpt from the TextBasedMedia file itself. Consolidate results so only the top-ranking one for each FooPage appears, removing duplicates.
Apply the relevant Skip/Take from the original request to these processed results to allow pagination.

It seems to work. Not sure how it would scale, but we're only dealing with ~40 instances of FooPage. There's some scope to add cacheing in there if required.

Alex

#176340

Edited, Mar 16, 2017 12:06