Exclude URLs and content when using Connectors

Johan Petersson

Vote:

expand_less 0 expand_more

Hi,

So I need some more control over which URLs that should NOT be indexed. The help only states that I can add URLs, one per line. But I'm indexing a website without friendly URLs, so I need to be able to specify rules with a pattern, e.g. regex. Is that possible?

It would also be awesome if we could have more control over what is actually indexed in the HTML document. Maybe being able to specify with XPath what to include and exclude. Because now everything is indexed e.g. header, footer and navigation.

#116797

Feb 05, 2015 0:42

Johan Petersson

Vote:

expand_less 0 expand_more

Also, it would be nice to be able to exclude file/mime types. We don't want to index e.g. image/jpeg

#116798

Feb 05, 2015 1:53

Vote:

expand_less 0 expand_more

Hi,

Regarding the crawler connector: currently, the only way to exclude crawled paths is using "disallow" in robots.txt on the site with the user agent "EPiServer Crawler".

The crawler connector has more expert level filters (and settings), such as what paths should be excluded from indexing but included during crawling, but that has not been exposed in the UI or API afaik.

Also, there is no mime-filter. The crawler is well aware of the mime types during filtering but there is currently no way to configure it.

Very happy to hear that the crawler connector is being used and I will gladly help you with any further questions.

/ John Johansson, Backend Developer, Hosted Services

#116870

Feb 06, 2015 13:31

Johan Petersson

Vote:

expand_less 0 expand_more

Hi,

So we can specify patterns with robots.txt, but not in the UI? And the UI only supports specifying complete URLs, that is wildcard matched to the right, e.g. /about-us/ which will exclude all subpages too?

That's weird that the UI doesn't support everything that you can do in robots.txt, since the crawler supports it and the UI lets you specify URLs.

On of the biggest reasons to use the crawler is when we want to search a website we don't have access to. So adding a robots.txt is not always possible. In our case it's not. Even if we could add a robots.txt, it would be a big effort to keep it up to date and fine-tune all the rules while setting up the search, since deploying a new one could take days (who knows...).

In our case, one of the websites has a news archive. The archive has about 10 languages, all the english articles are available on all languages and most of them are actually in english. The archive has also tags and dates filters and pagination, which are controlled by query parameters. As you can imagine, the pages adds up really quickly! And we have no way to exclude theses pages. I had to stop the crawler when it reached over 40.000 documents and I don't know how much there was left to crawl, since there is no log or output whatsoever.

Since we're indexing a couple more websites we reached the document limit very quickly :( The websites have far less than 10.000 unique pages each, but since the crawler is indexing all files and duplicate pages it adds up tenfolds.

I don't even know if the 1 million documents index will be enough, when the 100.000 should be more than enough if we only had more controls.

Can you remove the document limit from the index until we have more control over the crawler? Or how can we solve this?

#116917

Edited, Feb 06, 2015 16:48

Vote:

expand_less 0 expand_more

I don't think I can help you remove/modify the document count limit, sorry.

We might however be able to help you modify your connectors.

There are settings/filters for removing unwanted URL-parameters, limiting recursion depth and for globbing in the crawler, all which can potentially help you with what you want to achieve. I really can't tell you why these settings are not available in the UI because I don't know.

Anyway, some of the settings are:

"remove_query_args": [] //< Array of parameter names. Useful for removing timestamps, session-ids, page-ids, sorting information etc.
"max_depth": int //< Recursion depth limit. No limit (-1) by default.
"max_documents": int //< Total document limit. Default to no limit but the crawler will terminate if it receives "full"-response from indexer.
"max_links_per_document": int //< Limits the number of followed links per page. Defaults to 5000.
"included_crawl_patterns": [] //< Array of globbing patterns (like robots.txt) for urls that are allowed to be followed.
"excluded_crawl_patterns": [] //< Array of globbing patterns (like robots.txt) for urls that are not allowed to be followed.
"excluded_index_patterns": [] //< Array of globbing patterns (like robots.txt) for urls that are allowed to be followed but not allowed to be indexed.

Please contact support for help with manually setting these.

The crawler connector is quite competent, but sadly underused, and I agree that the UI needs to be improved to accommodate proper configuration for the crawler connector. Hopefully this post leads to a new story in the backlog.

/ John Johansson, Backend Developer, Hosted Services

#116931

Feb 06, 2015 19:00

Johan Petersson

Vote:

expand_less 0 expand_more

Thanks for the answer John! Really appreciate it.

I will contact the support with my crawl and index patterns, and hopefully that will be enough. But my concerns are that this will be a lenghty process, since we can't test them right away, and we have to wait another day for the support team to update the rules.

From experience I know that it takes time to configure crawling and indexing, and to have a third party involved in-between will add a lot of time.

So, are these settings not even available through some hidden REST API?

#116933

Edited, Feb 06, 2015 20:14

Johan Petersson

Vote:

expand_less 0 expand_more

Hi John,

We have actually had some success with the crawl and index patterns! Thanks! However, I have noticed a couple of bugs and some features that would be nice to have:

Canonical url should be honored. A lot of duplicates could be eliminated for websites with fallback languages this way.

<link rel="canonical" href="http://example.com/en/realpage/" />

The UI validates the exclude_crawl_patterns as real URLs and not as just globbing patterns. So now we can't change these patterns from the UI, nor update the connectors.

Every pattern is on a seperate line, but the stylesheet changes the linebreaks to whitespace, this makes it almost impossible to get an overview of the patterns.

Should I report this in a developer support ticket, or can you pass this on?

#117392

Feb 19, 2015 19:08

Johan Petersson

Vote:

expand_less 0 expand_more

As per EPiServer Find version 9.2.0.2446 most of these features are implemented! Yay!

#119709

Apr 01, 2015 21:53