Vulnerability in EPiServer.Forms
Im having some trouble searching phone numbers in our Find index. Im getting what i would consider alot of false positives and even if i get a hit on the phone number it often not in the top of the results. As an example, if i search for "08-12345" i would get hits on alot of other documents that only contain the "08"-part. It seems that the "08" and the "12345" parts is searched as separate words.
I realized that the "-"-character is a reserved character in Lucene and if i would want to create a search containing that charactert i would probably need to escape it as described in http://lucene.apache.org/core/3_6_0/queryparsersyntax.html#Escaping%20Special%20Characters
When i search using UnifiedSearchFor() the search string is parsed using EPiServer.Find.QueryEscaping.Quote() which seems to escape all the reserved characters though. So adding it myself shouldent be neccery. Still... the search result is less then perfect.
Am i misinterpreting the lucene documentation or shouldent "08-12345" be matched as if it was one word when the "-"-character is escaped?
I migh be wrong here, but I think Lucene will divide the term into two separate words when indexing that string. "08-12345" will be indexed as "08" and "12345". But then you would think searching for 08\-12345 would give the phone number as the highest ranked search hit.
What I usually do when faced with these "known pattern"-queries (I know how some pattern should be interpreted, in this case a phone number shouldn't be tokenized into 08,12345 but kept as 08-12345) is to simply encapsulate them in phrases (i.e. don't let lucene decide what to break tokens on). So in this case I would simply scan the query and escape known patterns with ":
searchResult = client.Search<BlogPost>() .For("some other query tokens \"08-12345\"") .GetResult();