London Dev Meetup Rescheduled! Due to unavoidable reasons, the event has been moved to 21st May. Speakers remain the same—any changes will be communicated. Seats are limited—register here to secure your spot!
London Dev Meetup Rescheduled! Due to unavoidable reasons, the event has been moved to 21st May. Speakers remain the same—any changes will be communicated. Seats are limited—register here to secure your spot!
You can index attachments, that is, external files such as Word and PDF documents. For a list of supported formats, see the Apache Tika documentation.
To index attachments using the .NET API, create an instance of a class that has a property of type Attachment (found in the EPiServer.Find namespace). The Attachment class constructor has a single parameter of type Func<FileStream>. Another class, FileAttachment (also in the EPiServer.Find namespace) requires a file path as a constructor parameter.
You create a class named Document.
public class Document
{
public string Name { get; set; }
public Attachment Attachment { get; set; }
}
You can index an instance of the Document class to index a Word document along with some metadata (Name in this example).
var path = "TestData/Memoirs.docx";
var document = new Document()
{
Name = "My memoirs",
Attachment = new FileAttachment(path);
}
client.Index(document);
You can search the indexed Word document. For example, if it contains "Banana," the result variable below would contain a hit.
var result = client.Search<Document>()
.For("Banana").GetResult();
A REST API issue causes an exception the first time an instance of a type with an Attachment property (document in this example) is indexed. This only happens the first time--after that, everything works as expected.
By default, search relevancy for text inside an attachment is imperfect. This is because attachments are indexed in the default language, which might not match the document's content. (CMS content, in contrast, is indexed using all enabled languages to improve search relevancy.)
Also, when browsing Find's explore view of an attachment, the attachment text is not readable, because it is indexed using the base64 representation of itself.
To improve the search relevancy of text attachments, use the IAttachmentHelper interface, which enables developers to implement their own parsing of attachments. Out of the box, EPiServer provides an implementation of IAttachmentHelper that uses Microsoft IFilter functionality. For this to work, the correct IFilters need to be installed on the client.
The file types below can be parsed using Ifilters and EPiServer Find.
adw, ai, doc, docm, docx, dwg, eps, gif, html, htm, jpeg, jpg, mm, msg, odt, ods, odp, odi, one, otf, otp, pdf, png, ppt, pptm, pptx, ps, rar, sda, sdg, sdm, sfs, sgf, smf, std, sti, stw, svg, sxd, sxi, txt, vdx, vsd, vdx, vor, vss, vst, vsx, vtx, wma, wmv, xls, xlsb, xlsm, xlsx, xml, zip
For many file types, more than one filter is available. You can find many filters on http://www.ifiltershop.com/.A few common file types and filters are listed below.
Adobe has a PDF IFIlter, although it does not work in all environments. See http://www.adobe.com/support/downloads/detail.jsp?ftpID=5542.
If your environment is not supported, try the PDF-XChange Viewer from Tracker Software http://www.tracker-software.com/product/pdf-xchange-viewer.
Microsoft's filter pack covers the file types below. Download it from https://www.microsoft.com/en-us/download/details.aspx?id=17062.
Last updated: Sep 21, 2015