Indexing

This topic describes the indexing in an integration with Episerver Find and Episerver 7.5+ CMS.

How it works

Because you reference the EPiServer.Find.Cms assembly in the Episerver CMS project, published content is automatically indexed. Content is also reindexed, or deleted from the index, when it is saved, moved or deleted. Each language version is indexed as a separate document.

Indexing module

The indexing module is an IInitializableModule that handles all DataFactory event indexing. Whenever content is saved, published, moved or deleted, it triggers an index request to the ContentIndexer.Instance object, which handles the indexing.

ContentIndexer.Instance

The ContentIndexer.Instance singleton, located in the EPiServer.Find.Cms namespace, adds support for indexing IContent and UnifiedFile objects. ContentIndexer.Instance supports re-indexing the entire PageTree and specific language branches and individual content and files. When indexing an IContent object, all page files are also indexed.

Invisible mode

A core feature of the ContentIndexer is its ability to work in invisible mode when indexing objects passed by the IndexingModule. In invisible mode, indexing is handled in a separate thread, not the DataFactory event thread. So, indexing does not delay the DataFactory event thread and, therefore, does not delay the save/publish action. To override this default behavior, set ContentIndexer.Instance.Invisible to false.

Conventions

The ContentIndexer.Instance has conventions for customizing indexing. For example, you can control which pages are indexed (described below) and dependencies between pages.

Customizing pages to be indexed

To control which content is indexed, pass a verification expression to the ShouldIndex convention. By default, all published content is indexed.

For example, if you do not want to index a page type (such as the LoginPageType), pass a verification expression that validates to false for the LoginPageType to the ShouldIndex convention. Preferably, you would do this during application startup, such as in the global.asax file's Application_Start method.

C#

//using EPiServer.Find.Cms.Conventions;

ContentIndexer.Instance.Conventions
  .ForInstancesOf<LoginPageType>()
  .ShouldIndex(x => false);

To override the default setting, add a convention for PageData and add the appropriate verification expression.

C#

//using EPiServer.Find.Cms.Conventions;
ContentIndexer.Instance.Conventions
  .ForInstancesOf<PageData>()
  .ShouldIndex(x => true);

To exclude a property from being indexed, use the JsonIgnore attribute or add a convention for it.

C#

//using EPiServer.Find.Cms.Conventions;
ContentIndexer.Instance.Conventions
  .ForInstancesOf<PageData>()
  .ExcludeField(x => x.ACL)

C#

[JsonIgnore]
public DateInterval Interval { get; set; }

File indexing

Using IContentMedia, files are indexed by default when based on the following MIME types:

"text/plain"
"application/pdf"
"application/postscript"
"application/msword"
"application/vnd.openxmlformats-officedocument.wordprocessingml.document"

Changing the name or namespaces of page types

If you change the name or namespace of a page type, a mismatch occurs between types in the index and the new page types. This might cause errors when querying, because the API cannot resolve the correct page type from what is reported from the index. To solve this, reindex all pages, using the scheduled plugin, to have new page types reflected in the index.

Improving search relevancy of attachments

By default, search relevancy for text inside an attachment is imperfect. This is because attachments are indexed in the default language, which might not match the document's content. (CMS content, in contrast, is indexed using all enabled languages to improve search relevancy.)

Also, when browsing Find's explore view of an attachment, the attachment text is not readable, because it is indexed using the base64 representation of itself.

To improve the search relevancy of text attachments, use the IAttachmentHelper interface, which enables developers to implement their own parsing of attachments. Out of the box, Episerver provides an implementation of IAttachmentHelper that uses Microsoft IFilter functionality. For this to work, the correct IFilters need to be installed on the client.

Episerver highly recommends using this package, as it enhances the quality of your search.

Using the default implementation of IAttachmentHelper

Install the EPiServer.Find.Cms.AttachmentFilter Nuget package.
Determine which attachment file types you want to support (for example, PDF and Microsoft Word). Each file type has a corresponding filter. The list of file types and filters is below.
Download and install the selected filters.
Restart.
Add some supported file attachments to your site.
Log into your website and browse to Find > Overview > Explore.
Find the attachments and verify that their content is stored as readable text under SearchAttachmentText$$String.

Supported file formats

Using Ifilters and Episerver Find., you can parse the file types below.

adw, ai, doc, docm, docx, dwg, eps, gif, html, htm, jpeg, jpg, mm, msg, odt, ods, odp, odi, one, otf, otp, pdf, png, ppt, pptm, pptx, ps, rar, sda, sdg, sdm, sfs, sgf, smf, std, sti, stw, svg, sxd, sxi, txt, vdx, vsd, vdx, vor, vss, vst, vsx, vtx, wma, wmv, xls, xlsb, xlsm, xlsx, xml, zip

For many file types, more than one filter is available. You can find many filters on http://www.ifiltershop.com/.

A few common file types and their filters are listed below.

PDF

Adobe has a PDF IFIlter, although it does not work in all environments. See http://www.adobe.com/support/downloads/detail.jsp?ftpID=5542.
If your environment is not supported by Adobe's IFilter, try PDF-XChange Viewer from Tracker Software http://www.tracker-software.com/product/pdf-xchange-viewer.

Microsoft Office 2010 filter packs

Microsoft's filter pack covers the file types below. Download it from https://www.microsoft.com/en-us/download/details.aspx?id=17062.

Legacy Office Filter (97-2003; .doc, .ppt, .xls)
Metro Office Filter (2007; .docx, .pptx, .xlsx)
Zip Filter
OneNote filter
Visio Filter
Publisher Filter
Open Document Format Filter