More Like/Related
This topic explains how to build functionality to search for related objects using the MoreLike mthod in Episerver Find.
How it works
Using the MoreLike method it is possible to find documents whose text content are "like" a given string. This functionality is typically used for, but not limited to, finding related documents/objects.
Examples
A simple example can look like this:
searchResult = client.Search<BlogPost>()
.MoreLike("guitar")
.GetResult();
After having invoked the MoreLike method we can customize the search query with a number of methods. For instance, given that we don't have a lot of documents with similar content we will probably want to lower the minimum document frequency requirement. That is, the level at which words will be ignored which do not occur in at least that many documents, which defaults to five.
searchResult = client.Search<BlogPost>()
.MoreLike("guitar")
.MinimumDocumentFrequency(1)
.GetResult();
A full list of extension methods for customizing the query follows below. But before we look at those, let us look at an example of finding documents "related" to a given document. Assuming we have indexed two BlogPosts with similar content we can search for similar documents as the first and expect the second using a query such as this:
var firstBlogPost = //Some indexed blog post about guitars
var secondBlogPost = //Another blog post about guitars
searchResult = client.Search<BlogPost>()
.MoreLike(firstBlogPost.Content)
.MinimumDocumentFrequency(1)
.Filter(x => !x.Id.Match(firstBlogPost.Id))
.GetResult();
Note: When you issue these types of queries, use some caching because the result is not likely to change very often and even if it does a few minutes delay might not matter.
Customization methods
As the nature of the content can differ greatly between indexes and types it is often a good idea to play around with the many settings available after having invoked the MoreLike method. Below is a list of all methods that can be called to customize the query. See also the Elastic Search guide.
MinimumDocumentFrequency
The frequency at which words are ignored which do not occur in at least this many docs. Default is 5.
MaximumDocumentFrequency
The maximum frequency in which words may still appear. Words that appear in more than this many docs are ignored. Default is unbounded.
PercentTermsToMatch
The percentage of terms to match on. Default is 30 (percent).
MinimumTermFrequency
The frequency below which terms are ignored in the source doc. The default frequency is 2.
MinimumWordLength
The minimum word length below which words are ignored. Default is 0.
MaximumWordLength
The maximum word length above which words are ignored. Default is unbounded (0).
MaximumQueryTerms
The maximum number of query terms that are included in any generated query. Default is 25.
StopWords
A list of words considered “uninteresting” and which are ignored.
Last updated: Nov 16, 2015