Join us this Friday for AI in Action at the Virtual Happy Hour! This free virtual event is open to all—enroll now on Academy and don’t miss out.
Join us this Friday for AI in Action at the Virtual Happy Hour! This free virtual event is open to all—enroll now on Academy and don’t miss out.
I think Per Magne did a demo of something similar at an EPiServer Norway partner event a while ago. I'll try to get hold of the code.
Ooh, that sounds promising.. Would be great if you could find me something, thanks!
When crawling, save a scraped URL to an object that inherits or looks like UnifiedSearchHit. That way you can easily add everything to the UnifiedSearchRegistry (example http://joelabrahamsson.com/docs/episerver-find-alloy-search-page/findinitialization.html).
I would also think about sitting tight for a bit since next gen Find will, if I recall correctly, have crawling capabilities available.
Cool, I was thinking something along those lines.. Good tip about inheriting UnifiedSearchHit. Thanks.
I am also thinking about using alchemyapi.com to extract keywords, and key content for adding to index. I could then have facet results..
Oh, great, that sounds perfect it does indeed have crawling capabilities..
EPiServer are presenting to our client next week, so I'll get a question over to them about this...
Any idea when next gen Find is coming?
Thanks for the heads up!
Hi,
I did indeed create a crawler for Find at a partner event. It is very similar to your original idea, Danny.
For this particular demo I used ncrawler for the crawling part and HtmlAgilityPack for the scraping part. http://ncrawler.codeplex.com/
If you decide to use ncrawler, then here is some basic code to get you started:
First, setup the crawler and execute the crawl. A scheduled job would be perfect for this:
using(Crawler c = new Crawler(new Uri("http://yoururlgoeshere.com/"), new HtmlDocumentProcessor(), new Step()))
{
// you could set a maxium crawl count or time
c.MaximumCrawlCount = 100;
c.MaximumCrawlTime = new TimeSpan(0, 0, 1, 0);
// you could exclude files and certain paths
c.ExcludeFilter = new[]
{
new RegexFilter(
new Regex(@"(\.jpg|\.css|\.js|\.gif|\.jpeg|\.png|\.ico|=atom|=rss)",
RegexOptions.Compiled | RegexOptions.CultureInvariant | RegexOptions.IgnoreCase))
};
c.Crawl();
}
the "Step" class is executed for every url that is crawled, and is where we do the actual indexing:
public class Step : IPipelineStep
{
public void Process(Crawler crawler, PropertyBag propertyBag)
{
var meta = (string[])propertyBag["Meta"].Value
var crawledItem = new CrawledItem()
{
Url = propertyBag.Step.Uri.ToString().ToLower(),
Title = propertyBag.Title,
Text = propertyBag.Text,
Published = MetaDataExtracter.GetPublishedFromMetaData(meta),
// ... and so on
};
SearchClient.Instance.Index(crawledItem);
}
}
You should probably put the objects in a list and then index the objects in bulks, instead of one by one. I've written a blog post about that earlier:
http://world.episerver.com/Blogs/Per-Magne-Skuseth/Dates/2013/5/EPiServer-Find-Bulks-please/
The GetPublishedFromMetaData extension method:
public static DateTime GetPublishedFromMetaData(string[] metatagValues)
{
return Convert.ToDateTime(metatagValues.FirstOrDefault(x => x.Contains("last-modified")).Substring(15));
}
And you could then add CrawledItem, or whatever you name it, to the unified search, as suggested by Johan.
I'm not sure when the next version of Find will be released. Maybe you'll get some answers in your meeting next week :-)
That's fantastic, thanks Per.. Just want I needed, and good to know my conceptual idea is not a bad one!
Should be quite an interesting project this.. 10,000+ pages/pdf files from multiple sites into one, highly searchable index..
Happy to help! That does indeed sound like an interesting project.
However, I would have to agree with Johan. You might want sit tight for a while and wait for the next version of Find.
We are starting a new project that will be built with EPiServer 7.5
Our issue is that there will be several sections of the site that will link of to existing websites.
These will be on sub domains, or sometimes sub sections of existing sites.
i.e.
www.mysite.com
section.mysite.com
www.mysite.com/section2
section.mysite.com/section3
We will have no direct control of the existing sites, apart from making sure that Meta tags are completed, and providing updated headers and footers.
These "other" sites would fit under one of 8 main navigation items in the new site.
We want to implement a "global" search that covers not only the existing EPiServer pages, but also all the existing pages.
When we get search results back, we want the user to be able to filter down further based on the section that the result came from. If could be an actual child page of the section within EPiServer, but also be a page within one of the sub sites..
I an envisioning creating a spider/service that would crawl the existing sites, creating a simple object to represent this page, and manually adding this to the index. I could at this stage define the "section" that the site/page appears..
I hope this makes sense... but please let me know if any further clarifications are needed..