Using Google Mini as an EPiServer search solution
Google Mini can provide a low-cost alternative to Lucene-based searching in EPiServer. Although its text-based search is simple and powerful, the devil, as ever, is in the detail.
What Google Mini does
Google Mini is a stand-alone hardware unit that provides a configurable “Google in a box” on your network. It indexes site content by reading web pages and following links, deciding on page relevancy in much the same way as the real search engine.
A single Google Mini box can be used to provide search for any number of different sites. All the crawled content is kept in a central index and you specify any number of sub-sets of this index as collections, allowing for a rudimentary partitioning your search index. Google Mini also allows you to create different “front-ends”, which are a collection of client settings that include different keyword matches and synonyms.
It’s very easy to get up and running. All you do is point it at a website, wait for the content to be crawled and then start firing requests at it. The Google Mini API is simple enough, relying on a simple RESTful call to the device that can return a set of results as XML. Putting together an API abstraction to format requests and deserialize responses is a pretty trivial development task.
It’s important to bear in mind that Google Mini is still a bit of a “black box”. You can filter your search and adjust the way in which it indexes a site, but you cannot go in and directly tweak any of the indexed content. It merely reports back on what it sees and optimising your EPiServer page output for Google Mini is where the real work lies in an implementation.
Maximising page relevance
If you want your search results to make sense then it is incumbent on you to make sure that your page mark-up is optimised for search engines. As this is a Google-based engine, the normal SEO rules apply here. You need to ensure that your page structure is optimised for SEO and that your internal linking distributes page rank appropriately.
The amount of unique content in a page is also important – this is, one of the ways in which Google decides what a page is “really about”. Many page designs use a lot of boiler plate mark-up which is pretty much the same between every page, i.e. a large header, footer and side bar. With this kind of web design the amount of unique content between each page can be a relatively small proportion of the overall content.
One way of getting Google Mini to focus on the relevant content is to selectively render page content for the Google Mini device. You can assign a unique user agent string to the Google Mini so you can detect it when a page loads and choose which parts of the page to render.
Although this can be an effective way of optimising Google Mini indexing do not be tempted to selectively render content for external search engines. Google regards this as a spamming technique and it will penalise your site in the search rankings if it discovers that you’re sending out different content to search engine spiders.
Property-based search via meta-tags
Google Mini provides fantastic text-based search complete with all the richness and intelligence you’d expect from Google-based search. However, many websites require a search solution that provides more than just text-based searching. In particular, some level of filtering by specific page properties is often required.
There’s nothing to stop you using FindPagesWithCriteria() in some circumstances, but Google Mini does provide a meta-data catalogue that you can use to support property-based searching. You can publish any EPiServer property as a meta-tag and Google Mini will store these tags, allowing you to specify them as parameters for searches. The tags are also returned as part of the search result information for each page, allowing you to pick up information about a page without having to resort to the EPiServer data factory.
Ideally, you would only want to publish these tags to the Google Mini device rather than exposing details of your EPiServer implementation to normal website visitors. This is easy enough to accomplish given that you can assign a unique user agent string to the Google Mini device.
The Google Mini meta-tag catalogue does have some limitations as it will only store up to 1,000 characters and the only value data types it recognises are strings and dates. It can be used to support most simple property-based search scenarios, but if your requirements are for more sophisticated faceted search then you ought to be looking elsewhere for your search solution.
Custom sorting via meta-tags
By default, Google Mini only allows you to sort search results by relevance or the date on which the content was originally indexed. This isn’t enough for more specialised searches that may need to sort results by a particular EPiServer property.
The Google Mini meta-tag catalogue can be used to sort results, but you will have to construct a full result set first and then sort it manually. Given that Google Mini returns a maximum of 100 search results with each call this means making successive calls to the device to build up a large result before you can sort it. Once you’ve sorted the result set you should also consider caching it somewhere rather than suffering the overhead of having to re-construct it for successive pages of results.
This is not an ideal solution – it is, in fact, nothing more than a slightly dirty-feeling work-around. That said, it does work for simple sorting scenarios that do not involve more than a few thousand results. More demanding sorting scenarios may demand a different search technology.
Crawling meta-data on files in the VPP
Although Google Mini is very effective at crawling text-based content there is no way for it to pick up any meta-data associated with files in your VPP. It can index image and video files, but will only index the filename. If you have a large body of images and videos that have been described and categorised via file summaries then you will need a work-around to ensure that this information will be incorporated into search.
The solution that we used was to create an EPiServer page provider that reads a VPP and renders a web page for each file that contains the file summary information. These pages are only visible to Google Mini so if a user actually visits the page then they are redirected to the actual resource rather than the page provider page.
Again, this is a work-around, but an effective and flexible way of incorporating file summaries into your search index.
Duplicates on the master language branch
One problem that is specific to EPiServer and Google Mini is caused by the URL language selector on the master language branch. Google Mini will regard the following URLs as separate pages, even though the content is exactly the same.
www.example.com/mypage
www.example.com/en/mypage
To overcome this problem, Google recommend that you add a link tag specifying the canonical URL to the page’s header section. Although using a canonical URL can help, and is regarded as good practise in any event, we have found that it is not necessarily enough to guarantee that duplicate results will appear with a Google Mini.
Ultimately, if you want to be sure that duplicate pages on the master language branch will not appear in your results then you will need to filter them out of the Google Mini index – this is easy enough to do in the Google Mini crawl and collection settings.
Low-cost search, but beware of the implementation time
A Google Mini box requires a relatively small investment – a single unit costs only £2,400. This is cheaper than any licensed search solution for EPiServer but it does come at the cost of extra development work.
One advantage of a Google Mini implementation is that you will almost certainly be left with a site that is very heavily optimised for search engines. The need to define appropriate page titles and descriptions, think through your internal linking and tweak output for relevancy will pay dividends with your external search.
However, if your search requirements involve complex facets, extensive property-based sorting or more than a few hundred thousand pages then you may need to invest more heavily in your search technology.
We've built faceted search with Google Mini for one of our customer. http://www.byggnadsarbetaren.se/sok/?q=test
The bad thing is that you only get metadata from the hits, there is no way to query Google Mini for only metadata. And you can only retrieve 100 hits at one time, this is a problem for us when we're using server side paging and stuff.
Nice blog post!
I've used Google Mini on a few sites and I really like it. I would say that Siteseeker is to prefer, but then you need $. Also tried Search Server Express once, and I will never try it again. :)
One bad thing with Google Mini is that when something goes wrong (ie hardware failure, network related problem) it can be pretty hard to find what's causing the problem.
The trick with these third-party solutions is always permissions. How do you ensure that the Mini can index everything it needs to, but only shows people the results they should be able to see?
Do you early-bind the permissions (when you crawl) or late-bind them (when you display)?
We also did some faceted search with a Mini on Memorex. Example:
http://www.memorex.com/en-us/search/?q=dvd