Indexing/crawl VPP with Microsoft Search Server Express

London Dev Meetup Rescheduled! Due to unavoidable reasons, the event has been moved to 21st May. Speakers remain the same—any changes will be communicated. Seats are limited—register here to secure your spot!

AI OnAI Off

Home / Forums / Legacy forums / Episerver CMS 6 R2 /

Indexing/crawl VPP with Microsoft Search Server Express

Christer Pettersson

Vote:

I'm trying to utilise a Search Server Search against documents uploaded into EpiServer VPP. When a document is uploaded it also contains metadatainformation in the Unified files Summery (file.Summary.Dictionary) which I also would like to be indexed.

I thought of this solution, I have created EpiServer page that lists all files in the vpp folder and each one if these files is a link to another Episerver page in which I retrieve the metainformation about the specific document this meta is written to the page head as meta of the page.

Now my question is how would I go about to let SearchServer crawl the documents content? Also to get SearchServer to understand/connect the relevant meta with the content of the file, so when a user is doing a search, Search Server will have an index of content of file with its metadata connected.

As for now I think that SearchServer will respond to the current solution as one hit on the metadata and another hit on the content of the vpp-file, I would like to merge this data. I think it won't be suffient to add an url to the vpp-file in the meta data (Have done this approach against a SiteSeeker earlier who was able to handle that solution)

#68166

Mar 18, 2013 10:46

Johan Kronberg

Vote:

We've created a HTML-representation of the file with file summary data needed is put as META elements. The downside is that you need to interpret the file content by your self and serve it for the Crawler:

protected void OnUnifiedFileTransmitting(UnifiedFile sender, UnifiedVirtualPathEventArgs e)
{
if (Util.IsTrustedWebCrawler())
{
string fileData = SearchServer.Util.FileMetadata.GetFileMetadata(sender);
HttpContext.Current.Response.ContentEncoding = System.Text.Encoding.UTF8;
HttpContext.Current.Response.ContentType = "text/html";
HttpContext.Current.Response.Write(fileData);
HttpContext.Current.Response.End();
}
}

#68173

Edited, Mar 18, 2013 12:03

Christer Pettersson

Vote:

Hi Johan thanks for the reply.

Yes that is one way I thought about, your function GetFileMetaData has some parsing techniques I suppose that handle different kinds of file formats? I'm looking for some kind of .doc parser to handle a request, as for know I only have to deal with .doc files.

#68184

Mar 18, 2013 14:57

Johan Kronberg

Vote:

We have a "TextReader that reads from an IFilter". I didn't write that stuff so not sure on the specifics but I'm fairly certain it handles both PDF's and all MS Office formats.

#68186

Mar 18, 2013 15:06

Christer Pettersson

Vote:

Yes I found it it, thanks Johan then I keep trying to accomplish this. Feels better to know that others have done this before heading down the road ;)

#68188

Edited, Mar 18, 2013 15:09

This thread is locked and should be used for reference only. Please use the Episerver CMS 7 and earlier versions forum to open new discussions.