How do I reset EPiServer's search index?

Vote:
 

My SearchDataSource control is not returning pages that have been added recently.  I'm assuming that it's stopped indexing, for some reason.

I'm just a little confused about the search architecture -- EPiServer has a "CMS Indexing Service," but my review of the API and the DLLs tells me that EPiServer is using Microsoft Indexing Service (if so, I can't figure out what catalog they're using...).  Additionally, there's all sorts of Lucene integration floating around.  The whole thing is a little black-box-ish.

So, in the end, what actually maintains the search index in EPiServer?  And how do I get it to start indexing again?  Is there some way to "reset" it?

#40816
Jun 17, 2010 18:47
Vote:
 

Try restarting the EPiserver indexing service. You should easily find it under services. If you check the vpp folders there should be a indexing folder.

There's a setting in web.config for the delay before a newly created page get's indexed.

But I think restarting the service should do it. I've had to reset it several times in different environments.

/Per

#40819
Jun 17, 2010 21:51
Vote:
 

Per:

Thanks, but that hasn't really solved it.  Search works great on pages we created long ago, but the newer the page, the less likely it is to be in the index.  It's like the indexer stopped indexing at some point in time.

I'd really like some more insight on how this thing works behind the scenes.  I used Reflector to dig through the SearchDataSource control, and I see that there's a method in there that calls IndexServerSearch, which actually makes uses Microsoft Indexing Server to run a search on...something.  I checked and I still only have the System and Web catalogs, and I never set anything else up, so I have no idea what index it's searching.

And, on top of all this, I have no idea how Lucene fits in into all this.

Deane

#40842
Jun 18, 2010 18:16
Vote:
 

A quick overview:

* The versioned VPP (files) is using the EPiServer Indexing Service which uses Lucene for the index.

* The native VPP (files) is using Microsoft Indexing Service and is the "classic" implementation and not enabled by default. Not dependent on the EPiServer Indexing Service.

* Searching for pages is using a custom search implementation stored as keywords in the database (see tblKeyword, tblPageKeyword). It listens for events and is not dependent on any indexing service. Implemented in EPiServer.LazyIndexer.

We are looking into consolidating this for a future version.

#40855
Jun 21, 2010 11:14
Vote:
 

So Microsoft Indexing Services is essentially deprecated by default?

To sum up --

(1) Lucene indexes binary files, and (2) a custom SQL implementation indexes pages.

#40867
Edited, Jun 21, 2010 14:53
Vote:
 

You are correct. We still support native file systems using MS Indexing, but by default we don't use it since you don't get permanent links and versioning if you go that route.

#40871
Jun 21, 2010 15:23
Vote:
 

Thanks, Per.  I have a ticket open with support on this, and it's gotten really odd.  In particular, since the page indexing is event-based, not service-based, it couldn't have just "stopped indexing."  Also, support has had me run SQL queries on the keywords in the database, and the results are not consistent.

I'll report back here with the solution.  I appreciate the background info -- that helped clear up some questions for me.

#40872
Jun 21, 2010 15:45
Vote:
 

On workaround is to delete contents of tblKeyword and tblPageKeyword, and then run the following code:

 

ArrayList array = new ArrayList();
IList pages = new EPiServer.DataAccess.PageListDB().ListAll();
foreach (EPiServer.Core.PageReference page in pages)
     array.Add(page.ID);

IndexPageJob job = new IndexPageJob((int[])array.ToArray(typeof(int)));
job.Execute();

This will start the re-indexing process (this is a time-consuming process, you only need to start it once).

 

 

#40894
Edited, Jun 22, 2010 8:09
Vote:
 

Mari:

Thanks for this code.  I've wrapped a Scheduled Job around it, and I'm running it now.

One question, though -- Reflector tells me that IndexPageJob just passes the IDs to LazyIndexer which queues them up.  What process actually clears the queue?  Is this done in the Web process, or is it the EPiServer CMS Indexing Service?

Deane

#40904
Jun 22, 2010 17:14
Vote:
 

LazyIndexer has a timer that checks the queue every minute (the "lazy" part to get better perf when a lof pages are being published). It is done in the web process.

The IndexPageJob is internally used when the application is being shut down to make sure we don't loose unprocessed pages in the queue, thats why it looks a bit strange and just queues up pages.

You could also call LazIndexer.IndexPage(pageID) to force an instant re-index of a page (no queues or timers involved).

#40906
Jun 22, 2010 17:30
Vote:
 

Per:

You make an interesting point there -- what happens if there's 1,000 pages in the LazyIndexer queue, and the process suddenly goes away?  I don't see any persistence layer anywhere, so it strikes me that these pages just wouldn't be indexed.

If the app is shutdown gracefully, you might be able to do something, but if it's reset suddenly, I think you end up with holes in the index. (And, I don't think there's a clean way around this, either.)

Deane

#40907
Jun 22, 2010 18:25
Vote:
 

We store the queue when the appdomain unloads, you are correct that we cannot handle when something forcefully kills the process.

#40914
Jun 23, 2010 9:09
Vote:
 

Okay, I've figured out what the problem is.  I don't know why it's happening, but based on my reading, it's a bug in the indexing system.

I've dug down through this issue, and I have found that a page (Page A, let's say) that is fetching its data from another page (Page B), the indexing system does not index fetched properties.

For Page B (the source page), it is clearly indexed for all search terms in the page name and the searchable properties (MainBody, in this instance -- a XHTML field).

But Page A (the one fetching from Page B) is only indexed for its page name.  None of the fetched properties are indexed.  The page works fine otherwise -- when I render a property from Page A, it transparently fetches the MainBody from Page B behind the scenes.  But this doesn't seem to extend to indexing, for some reason.

I have a ticket open with support on this.  I'll update this thread when we come to some resolution on it.

#40976
Edited, Jun 25, 2010 21:45
Vote:
 

I think I've proven this via an experiment.  Consider --

Page A is fetching its content from Page B.

  1. I found a word that should be on Page A -- "ladder" in this case.  I searched for "ladder."  Page A was not returned.
  2. I went into Page B (the page from which Page A is fetching its content), copied all the content out of the MainBody, pasted it into Page A, and broke the shortcut link.  So, now Page A is standalone -- it holds all of its own content, and doesn't fetch from Page B anymore.
  3. After a few minutes, I searched for "ladder" again.  Page A was now returned.
  4. I then deleted all the MainBody content in Page A and set the shortcut to fetch from Page B again.
  5. After a few minutes, I searched for "ladder" again.  Page A was no longer returned.
#40977
Jun 25, 2010 22:30
Vote:
 

Reflecting through the API, I think I found the source of the bug.

To index a page, LazyIndexer calls PageTextIndexDB().LoadPageTextData(pageID).  The problem is, that method doesn't use the API to get the text of the page to index.  It makes a direct database call.  Specifically, it calls editGetPageTextData, and passes the page ID and the language branch.

Since it goes straight to the database rather than through the API, it doesn't do any fetching of the properties.  I looked through the stored proc, and while I don't claim to completely understand it, I'm pretty sure it's not doing any property fetching at all.

So, my workaround is this --

I'll create a LongString property called "Searchable Text."  I'll mark this property as searchable, but not display it in Edit Mode.  On page publishing, I'll use the API (which does fetch) to write the contents of any searchable properties to this hidden field.  This should give the indexing system a text string with the correct values in it to index.

If anyone sees anything wrong with this plan, let me know.

#40978
Jun 26, 2010 0:48
Vote:
 

Are you updating the hidden searchtext property on page a  when publishing page a or b?

#40984
Jun 28, 2010 9:33
Vote:
 

Good point.  I was going to updated it when publishing Page A, but I'll have to do the same when changing Page B as well.  So, essentially, publishing Page B will have to trigger an updating of all the Page A's that pull from it.

#41006
Jun 28, 2010 14:02
Vote:
 

This has been entered as Bug #51010.

#41007
Jun 28, 2010 14:03
Vote:
 

Any news on this bug #51010 ? couldn't find info on it

#47165
Edited, Jan 17, 2011 23:57
Vote:
 

I haven't heard anything.  I'll ping Jens about it.

#47166
Jan 18, 2011 2:57
Vote:
 

Sorry for posting in old thread, but it seems this bug is still present in CMS 6 R2.

Deane: Did you get any response on your ticket?

#58476
Apr 26, 2012 12:27
Vote:
 

Looking in TFS I can see that the bug isn't a bug. It's a feature :)

#58478
Apr 26, 2012 12:37
Vote:
 

I don't think that I did get a response.  I think the prevailing opinion was just, "use the new full-text search that's coming out soon."

#58489
Apr 26, 2012 15:47
This topic was created over six months ago and has been resolved. If you have a similar question, please create a new topic and refer to this one.
* You are NOT allowed to include any hyperlinks in the post because your account hasn't associated to your company. User profile should be updated.