Don't miss out Virtual Happy Hour this Friday (April 26).

Try our conversational search powered by Generative AI!

Updating crawler's start URL - will indexed WebContent on old URL domain be removed from index?

Vote:
 

I have a Crawler connector where I've updated the Start URL from domain1.se to domain2.se. Do I need to clear the index by myself or will the crawl process clear the previously indexed domain1 items once a new crawl on domain2 completes?

While crawl is running it looks like items from domain1 are still in untouched in index while items from domain2 are being added.

I guess I will find out later today but could save some time by clearing manually right away if that's what's needed.

#202479
Mar 27, 2019 10:36
Vote:
 

Please let us know the final result. I vote you will need to clear it manually.

#202481
Mar 27, 2019 11:03
Vote:
 

It actually looks like the crawler's previously indexed items were removed once the new run with the same crawler (but with updated domain name) finished.

This is similiar to how SiteSeeker crawling operated.

While running there were new items added from the new crawl and but the old items were still there.

This is not how SiteSeeker crawling operated.

I think this is the place where the docs would clarify this but doesn't:
http://webhelp.episerver.com/latest/find/adding-connectors.htm 

#202696
Mar 29, 2019 8:53
Vote:
 

The _id of a WebContent document in Find is a hash of the URL. In a standard case when scheduled indexings are performed with the same (or almost the same) settings every time, old variants of crawled documents will be over written during the indexing. As Johan noticed and anticipated, because of the changed host, he did end up with a duplicated index.. for a while.

Each WebContent document has a session id. This id is tied to the crawl when it was fetched. At the end of each crawl the connector will do a delete that remove fetched pages that was not fetched the last crawl. This is done so removed pages from the crawled web site also disappears from the index after the crawl.

#202748
Mar 29, 2019 19:39
This topic was created over six months ago and has been resolved. If you have a similar question, please create a new topic and refer to this one.
* You are NOT allowed to include any hyperlinks in the post because your account hasn't associated to your company. User profile should be updated.