Calling all developers! We invite you to provide your input on Feature Experimentation by completing this brief survey.

 

Viktor Sahlström
Aug 22, 2016
  7120
(4 votes)

Episerver Find stemming

One of the things that makes search challenging (and really interesting) is language handling. Often, spoken languages differ from what a programmer is familiar with, since the rules are more comparable to a never-ending set of exceptions than actual rules. To properly analyze a string of text, the system must understand all of these exceptions.

A great feature of Episerver Find is stemming. Stemming reduces an inflected word to its root form (a.k.a. stem), for example, "fishing", "fished", and "fisher" have a root word of "fish." If the root word is determined, it can be used to return the full set of related items, thus improving retrievability and relevancy of search results.

Episerver Find uses snowball stemmers shipped as the default stemmer with Elastic search. This stemmer handles the general rules quite well but does not handle all special cases. In many languages, this works very well (English, for example). But depending on the complexity of the language and the maturity of the stemmer, this is not always enough. Swedish is a case where the default stemmer is not always perfect in execution. Often, the default stemmer creates a conflict that, in turn, causes unexpected search hits.

As an example, consider the Swedish words “bananen” (the banana) and “banans” (the race tracks). Using normal Swedish stemming rules, they would both be stemmed down to “banan.” In this case, any search also stemmed down to “banan” would give both results even though half of them are not relevant.
 
To fix this, a list of exceptions has been added to the Find stemming. We have started with Swedish and will look at additional languages going forward. The new algorithm recognizes that “bananen” and “banans” are different words even though their stem is the same. Hence, it creates unique tokens from them so the search engine can distinguish them at query time. This is a great improvement to search relevancy in many cases. One thing that remains to be solved is the case of “banan” (banana) and “banan” (the race track). In this form, the words are spelled exactly the same and cannot be distinguished without looking at the context. For these cases, search results are returned for both words.

To keep the list updated we would love users and partners to let us know if they find searches that results in weird results. 

Aug 22, 2016

Comments

David Tellander
David Tellander Mar 1, 2017 11:47 AM

Hi Viktor, 

I started a thread in the find forum, before I saw this post, with some examples of words that results in a lot of false positives because of stemming. It seems most of these words gets stemmed to common words that should be stop words according to this snowball stop word list

/David

Per Atle Holvik
Per Atle Holvik Apr 24, 2018 01:53 PM

Hi Viktor,

Is there an exception list for Norwegian? If yes, are you perhaps using this one? http://snowball.tartarus.org/algorithms/norwegian/stop.txt In that case, the search for "oversette" (translate) does not seem to be recognizing "over" as a stop word, at least not in Find 9.6.

/Per Atle

Please login to comment.
Latest blogs
Decimal numbers in Optimizely Graph

Storing prices as decimal numbers on a commerce website and planning to expose them through Optimizely Graph? It might not be as straightforward as...

Damian Smutek | Jan 23, 2025 | Syndicated blog

Find and delete non used media and blocks

On my new quest to play around with Blazor and MudBlazor I'm going back memory lane and porting some previously plugins. So this time up is my plug...

Per Nergård (MVP) | Jan 21, 2025

Optimizely Content Graph on mobile application

CG everywhere! I pull schema from our default index https://cg.optimizely.com/app/graphiql?auth=eBrGunULiC5TziTCtiOLEmov2LijBf30obh0KmhcBlyTktGZ in...

Cuong Nguyen Dinh | Jan 20, 2025

Image Analyzer with AI Assistant for Optimizely

The Smart Image Analyzer is a new feature in the Epicweb AI Assistant for Optimizely CMS that automates the management of image metadata, such as...

Luc Gosso (MVP) | Jan 16, 2025 | Syndicated blog

How to: create Decimal metafield with custom precision

If you are using catalog system, the way of creating metafields are easy – in fact, you can forget about “metafields”, all you should be using is t...

Quan Mai | Jan 16, 2025 | Syndicated blog

Level Up with Optimizely's Newly Relaunched Certifications!

We're thrilled to announce the relaunch of our Optimizely Certifications—designed to help partners, customers, and developers redefine what it mean...

Satata Satez | Jan 14, 2025