Try our conversational search powered by Generative AI!

How to boost result based on the volume/frequency of keywords in the whole document

Vote:
 

Hi Everyone,

 

We have got a scenario in which we need to we need to boost the document score based on the number of times the keyword is occurring in the document. Suppose if we have got 2 document A and B.

For Example- Let's say we have 2 documents A and B. and we write a keyword associate to search for.

A’s Content- Monitor. Associate is profile. Associate Associate .

B’s Content- Monitor is screen. Associate Associate Associate today’s weather is great. Associate Associate.

As we can see that A document contains only 3 occurrences of associate whereas B document contains 5 occurrences. So we should get boost up B document and it should appear higher in the search result.

But when use with BoostMatching or Infield with weights we are not getting the desired Output. Document A is coming above B. Our assumption is that it is considering the proportion of the number of occurrences to the whole description. Let's go back to the above example as we can see in A document the total number of words there is 6 and associate occurrence is 3 whereas in the B document we have 12 words but 5 occurrences of associate.

  • A ratio - 3/6 =  0.5 score
  • B ratio - 5/12 = 0.41666 score

So though A contains less number of occurrences still it appears at the top.

So we want to know whether there is any other way to boost results based on volume/frequency of keyword in the whole document.

#294182
Jan 05, 2023 14:42
Vote:
 

Hi Hitesh,

Can you include an example of the query?  

I am assuming that the content is te example above is split over multiple fields?

Andy

#294185
Jan 05, 2023 22:26
Vote:
 

Hi Andrew,

We are searching on multiple fields.

For Example-

We have got one model JobFeedModel. It contains the below properties in the example.

Document A-

{

                        JobId = "100",

                        Title = "Software engineer",

                        Description = "Designs, develops and maintains computer Software at a company",

                        Location = "New Delhi"

                    }

 

Document B-

                    {

                        JobId = "101",

                        Title = "Senior Software Engineer",

                        Description = " We are looking for a Senior Software Engineer to produce and implement functional Software solutions. You will work with upper management to define Software requirements and take the lead on operational and technical projects.,

                        Location = "Boston"

                    }

 

So for the above 2 documents, suppose someone searches the keyword Software, so according to our requirement, Document B should be boosted up as it contains 4 instances of "Software" whereas document A contains 2 instances of "Software".

Find Query-

query = _searchClient.Service.Search<JobFeedModel>()
                             .For(keyword)
                             .InField(x => x.Title, 2)
                             .InField(x => x.Description, 2)
                             .InField(x => x.Location, 2);

But it seems this query is not working as expected, we are not getting the expected results.

Our requirement is to fetch the records based on the frequency/volume of keyword present in the whole document.

Thanks!

#294546
Jan 12, 2023 6:50
Vote:
 

Hi Hitesh,

In the example above you are applying weighting rather than boosting.  Weighting means that you can signify the importance of matches in one field over another, i.e. matches in the title are more important than the description.

Document A may come before document B because the keyword is first, and this may make it more relevant.

I would recommend using the .BoostMatching(x => x.Description.MatchContainedCaseInsensitive(keyword), 10) as well (specify each field you want to boost by).  This is applying the boosting you are looking for.

Another option would be to create a field that consolidates both the title and description fields and use that for searching.

From experience, you need to mess around with the boosting and weighting values to really see how they impact the ordering of the results, so you should make these configurable in the CMS so you can alter them without having to make a code change and redeploy.

Finally, make sure you are not autoboosting, or ordering the results... this will also mess up the ordering on the page.

Andy

#294547
Jan 12, 2023 8:23
Vote:
 

Hi Andrew,

I tried using the boostMatching but it is not working as expected. What I observed is that it was considering the relative frequency of keyword in the whole document(multiple fields / consolidated field).

For Example-

Document A-

{

                        JobId = "100",

                        Title = "Software engineer",

                        Description = "Designs, develops and maintains computer Software at a company",

                        Location = "New Delhi",

                        ConsolidatedData = "Software engineer Designs, develops, and maintains computer Software at a company"

                    }

 

Document B-

                    {

                        JobId = "101",

                        Title = "Senior Software Engineer",

                        Description = " We are looking for a Senior Software Engineer to produce and implement functional Software solutions. You will work with upper management to define Software requirements and take the lead on operational and technical projects.",

                        Location = "Boston"

                        ConsolidatedData = "Senior Software Engineer  We are looking for a Senior Software Engineer to produce and implement functional Software solutions. You will work with upper management to define Software requirements and take the lead on operational and technical projects.",

                    }

So if we see the ConsolidatedData field (they are of type string) of both documents A and B. What was happening is for document A though it contains the "Software" 2 times and a total number of words 11. The score was coming out to be (2/11) * 10(boostValue) = 1.818

and as document B contains "Software" 4 times but the total number of words is 33. The score was coming out to be (4/33) * 10(boostValue) = 1.2121.

So because of this relative score(i.e. score obtained by dividing the frequency of the keyword by the total number of words), Document A was coming above Document B.

Thanks!

#294567
Jan 12, 2023 13:39
Vote:
 

Hi Hitesh,

I am interested in the details around the 'relative frequency of keyword' calculation as I cannot find this within the elastic search documentation.  Where did you find the information about this?

https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html

The score you are showing above, are you getting this value from the results?  i.e. 'searchResult.Hits.First().Score'

Can you include the full query for the new search you have created with the consolidated data, I still think there is something here that is affecting the search order.

Another thing you can try using this 'https://github.com/episerver/EPiServer.Labs.Find.Toolbox', it helps when using multi term searches.

Andy

#294618
Jan 13, 2023 8:29
Vote:
 

Hi Andrew,

It was my observation for the 'relative frequency of keyword' but when I searched it on google I came across this question on StackOverflow which talked about the same issue that I am facing.

https://stackoverflow.com/questions/16631026/elasticsearch-higher-scoring-if-higher-frequency-of-term

The answer suggested in this question is that we can either disable the normalization or make the "index": "not_analyzed" but I am not sure whether we can do it in the Episerver.

Yes, you are right, the score that I showed above. I am getting that from the results using 'searchResult.Hits.First().Score'.

The query I am using now.

Find Query-

query = _searchClient.Service.Search<JobFeedModel>()
                             .For(keyword)
                             .BoostMatching(x => x.ConsolidatedData.Match(keyword), 10);

I also tried using BoostMatching without the "For" but no success there.

I also tried testing by creating a new field in my mapping model i.e. SpaceSeparated which is of the type List<string> to check the behavior of BoostMatching with it. But it seems that Boostmatching stops after the first occurrence of the keyword in a list, so it was not considering all the other occurrences of the keyword. So that didn't work as well.

Example-

"SpaceSeparated": [
        "Software",
        "Associate",
        "Consultant",
        "Software",
        "software",
        "software",
        "software",
        "Processing",
        "datainformation",
        "conducting",
        "analysis",
        "and",
        "preparing",
        "reports",
        "of",
        "findings"
    ]

Thanks!
#294755
Jan 16, 2023 8:23
Vote:
 

Hi Hitesh,

Reading the post you referenced in the message I would agree with how the score is calculated. TBH I always assumed that the frequency of a term outweighed anything else, you learn something every day.

I think I would raise this with Optimizely support, they may have more insight into how you can make these changes.

Sorry I cannot be of more help.

Andy

#295263
Jan 24, 2023 8:31
Vote:
 

I think you are absolutely correct about the 'relative frequency of keyword', unfortunately I do not know if it's possible to achieve what you're asking without writing code to count the frequency after the search result is returned, and then sorting by that value.

#295284
Jan 24, 2023 21:16
Vote:
 

Hi Tomas, that's a good point. 

I was trying to solve the problem within find.  But as you said, you could count the number of occurrences of the keyword within the results and then add this total to the score you get back in the search results. 

You can then order by your new score.

Andy

#295320
Edited, Jan 25, 2023 8:09
* You are NOT allowed to include any hyperlinks in the post because your account hasn't associated to your company. User profile should be updated.