Switching From Lucene to Solr

In this blog I will talk about how easy (or hard) it is to switch from Lucene to Solr.

EPiServer Commerce R2 SP2 comes with three providers: Lucene (default), Solr and Solr 3.5.

While the Lucene provider is .NET, Solr is based on Java and thus needs to be hosted on Tomcat which is a HTTP Java web server.

Because of this the requirements for Solr is a bit different than Lucene. First of you will need Java Runtime (JRE) installed, or the full JDK. After Java is installed Tomcat is next. You can download either version 7 or 6. The important part is that you choose to download the 32-bit/64-bit Windows Service Installer.

Installation

Extract the shipped Solr implementation to Tomcat. You can find it in the MediachaseECF folder in the web root of your CMS site.

Open the zip file containing the solr implementation and copy the two folders in \Tools\Search\SolrServer(350) to your Tomcat installation folder.

After that you need to configure both the CMS and Commerce Manager sites to use Solr by changing the Mediachase.Search.config file. You can read how in the documentation linked below.

You can find more thorough installation steps in the Commerce documentation

The steps outlined in the document works for the older Solr implementation as well, except the automatic generation of fields in the schema config you will have to configure the fields yourself.

After you have followed the steps you should be able to build your index and do a search using it. You can verify that the index has been built and what it contains by visitng the admin mode of Solr which per default is located at http://localhost:8080/solr/. One query you can try is "_content:a" which translates to "search the field _content for the string that begins with a" (short query syntax guide). One thing to note is that only fields that are of the type text can be searched this way. If you have a string field, it will not allow you to use wildcards. Also another thing that is good to know is that Solr doesn't support wildcards in the beginning of a search.

Default fields

The _content field we used above is special for two reasons. One because it is the default field, which means if you do a query without specifying the field it will search in the _content field:

q=test => _content:test

And two, because the default config copies a few fields into this field, makinging the field very rich in content.

For example the default config can look like this:

<copyField source="name" dest="_content"/>

This tells Solr to copy whatever is in the name field to the _content field, adding to whatever information already was there.

It is also important because the default Solr search provider implementation will not implicitly specify any fields for your search terms, so searching for "candy" on the front end site will lead to a "q=candy" query. Which equates to _content:candy thanks to _content being the default field.

TIP Tomcat logs every query, so if you want to see how your queries look when being send to Solr, look at the logs directory of you Tomcat installation folder.

As you see, getting the _content field right is important when it comes to what information you store and the index size. The larger index data you have, the longer the indexing will take. That leads me to the different options you have for index fields. The comment in catalog.schema.xml does a pretty good job explaining it, but I will just do a quick recap of the most important fields below.

Configure index fields

Name: Mandatory the name for the field
type: Mandatory the name of a previously defined type from the section
indexed: True if this field should be indexed (searchable or sortable)
stored: True if this field should be retrievable multiValued: true if this field may contain multiple values per document

Let's take an example and see how it is defined:

<field name="_node" type="string" indexed="true" stored="true" multiValued="true" />

_node maps to what Catalog node the node/SKU is located. It's a string type which means we can't use wildcards or anything that differ from what is stored in the index (case sensitive!). Finding SKU's located in the node "DefaultNode" means we need to search on the whole word and match the casing.

q=_node:DefaultNode

Further the field is indexed which means we can search and sort on it. It is also stored which means we can get the actual value from the index itself, and do not need to fetch the whole object from the database (DTO) to get the value. It is also multi valued which means it can have several values for one single entry. This makes sense as a entry can be in several nodes through linking.

When you have all the index fields defined just as you like them and you feel comfortable with how long the indexing takes you are good (as again, more data means longer indexing)!

Bonus

Unless you want to change the search provider that is...

Right now we offer the source for the Solr 3.5 search provider as a download on EPiServer World. Download and compilate it and you have more freedom on how the final query will look when it is sent to Solr. This greater control can make the search results more to your and your customer's liking. After all, many customers have specific needs when it comes to search and a prebuilt one-size-fits-all solution will hinder that.

One of the method you want to take a look at is BuildQuery in SolrSearchQueryBuilder.cs where the actual query is built.

The End

As you can see switching from Lucene to Solr is quite easy, it's the customization part that can be hard.

Jul 09, 2012

Comments

Jul 18, 2012 06:01 PM

Thanks!
The part of "The important part is that you choose to download the 32-bit/64-bit Windows Service Installer." did the trick, I used the 64-bit Windows zip version and wasn't able to get the indexing running, but every thing started working fine after installing the Windows service version.

Please login to comment.