November Happy Hour will be moved to Thursday December 5th.

Exclude URLs and content when using Connectors

Vote:
 

From existing topic at https://world.episerver.com/forum/developer-forum/EPiServer-Search/Thread-Container/2015/2/exclude-urls-and-content-when-using-connectors/:

Exclude URLs and content when using Connectors

We might however be able to help you modify your connectors.

There are settings/filters for removing unwanted URL-parameters, limiting recursion depth and for globbing in the crawler, all which can potentially help you with what you want to achieve. I really can't tell you why these settings are not available in the UI because I don't know.

Anyway, some of the settings are:

  • "remove_query_args": [] //< Array of parameter names. Useful for removing timestamps, session-ids, page-ids, sorting information etc.
  • "max_depth": int //< Recursion depth limit. No limit (-1) by default.
  • "max_documents": int //< Total document limit. Default to no limit but the crawler will terminate if it receives "full"-response from indexer.
  • "max_links_per_document": int //< Limits the number of followed links per page. Defaults to 5000.
  • "included_crawl_patterns": [] //< Array of globbing patterns (like robots.txt) for urls that are allowed to be followed.
  • "excluded_crawl_patterns": [] //< Array of globbing patterns (like robots.txt) for urls that are not allowed to be followed.
  • "excluded_index_patterns": [] //< Array of globbing patterns (like robots.txt) for urls that are allowed to be followed but not allowed to be indexed. 

Please contact support for help with manually setting these.

As per EPiServer Find version 9.2.0.2446 most of these features are implemented! Yay!

As for max_links_per_document and max_documents, where do we set these? Do we still need to contact support?

#193683
Jun 03, 2018 2:03
* You are NOT allowed to include any hyperlinks in the post because your account hasn't associated to your company. User profile should be updated.