How to boost result based on the volume/frequency of keywords in

How to boost result based on the volume/frequency of keywords in the whole document

Hitesh

Vote:

Hi Everyone,

We have got a scenario in which we need to we need to boost the document score based on the number of times the keyword is occurring in the document. Suppose if we have got 2 document A and B.

For Example- Let's say we have 2 documents A and B. and we write a keyword associate to search for.

A’s Content- Monitor. Associate is profile. Associate Associate .

B’s Content- Monitor is screen. Associate Associate Associate today’s weather is great. Associate Associate.

As we can see that A document contains only 3 occurrences of associate whereas B document contains 5 occurrences. So we should get boost up B document and it should appear higher in the search result.

But when use with BoostMatching or Infield with weights we are not getting the desired Output. Document A is coming above B. Our assumption is that it is considering the proportion of the number of occurrences to the whole description. Let's go back to the above example as we can see in A document the total number of words there is 6 and associate occurrence is 3 whereas in the B document we have 12 words but 5 occurrences of associate.

A ratio - 3/6 = 0.5 score
B ratio - 5/12 = 0.41666 score

So though A contains less number of occurrences still it appears at the top.

So we want to know whether there is any other way to boost results based on volume/frequency of keyword in the whole document.

#294182

Jan 05, 2023 14:42

Andrew Markham

Vote:

Hi Hitesh,

Can you include an example of the query?

I am assuming that the content is te example above is split over multiple fields?

Andy

#294185

Jan 05, 2023 22:26

Hitesh

Vote:

Hi Andrew,

We are searching on multiple fields.

For Example-

We have got one model JobFeedModel. It contains the below properties in the example.

Document A-

{

JobId = "100",

Title = "Software engineer",

Description = "Designs, develops and maintains computer Software at a company",

Location = "New Delhi"

}

Document B-

{

JobId = "101",

Title = "Senior Software Engineer",

Description = " We are looking for a Senior Software Engineer to produce and implement functional Software solutions. You will work with upper management to define Software requirements and take the lead on operational and technical projects.,

Location = "Boston"

}

So for the above 2 documents, suppose someone searches the keyword Software, so according to our requirement, Document B should be boosted up as it contains 4 instances of "Software" whereas document A contains 2 instances of "Software".

Find Query-

query = _searchClient.Service.Search<JobFeedModel>()
.For(keyword)
.InField(x => x.Title, 2)
.InField(x => x.Description, 2)
.InField(x => x.Location, 2);

But it seems this query is not working as expected, we are not getting the expected results.

Our requirement is to fetch the records based on the frequency/volume of keyword present in the whole document.

Thanks!

#294546

Jan 12, 2023 6:50

Andrew Markham

Vote:

Hi Hitesh,

In the example above you are applying weighting rather than boosting. Weighting means that you can signify the importance of matches in one field over another, i.e. matches in the title are more important than the description.

Document A may come before document B because the keyword is first, and this may make it more relevant.

I would recommend using the .BoostMatching(x => x.Description.MatchContainedCaseInsensitive(keyword), 10) as well (specify each field you want to boost by). This is applying the boosting you are looking for.

Another option would be to create a field that consolidates both the title and description fields and use that for searching.

From experience, you need to mess around with the boosting and weighting values to really see how they impact the ordering of the results, so you should make these configurable in the CMS so you can alter them without having to make a code change and redeploy.

Finally, make sure you are not autoboosting, or ordering the results... this will also mess up the ordering on the page.

Andy

#294547

Jan 12, 2023 8:23

Hitesh

Vote:

Hi Andrew,

I tried using the boostMatching but it is not working as expected. What I observed is that it was considering the relative frequency of keyword in the whole document(multiple fields / consolidated field).

For Example-

Document A-

{

JobId = "100",

Title = "Software engineer",

Description = "Designs, develops and maintains computer Software at a company",

Location = "New Delhi",

ConsolidatedData = "Software engineer Designs, develops, and maintains computer Software at a company"

}

Document B-

{

JobId = "101",

Title = "Senior Software Engineer",

Location = "Boston"

ConsolidatedData = "Senior Software Engineer We are looking for a Senior Software Engineer to produce and implement functional Software solutions. You will work with upper management to define Software requirements and take the lead on operational and technical projects.",

}

So if we see the ConsolidatedData field (they are of type string) of both documents A and B. What was happening is for document A though it contains the "Software" 2 times and a total number of words 11. The score was coming out to be (2/11) * 10(boostValue) = 1.818

and as document B contains "Software" 4 times but the total number of words is 33. The score was coming out to be (4/33) * 10(boostValue) = 1.2121.

So because of this relative score(i.e. score obtained by dividing the frequency of the keyword by the total number of words), Document A was coming above Document B.

Thanks!

#294567

Jan 12, 2023 13:39

Andrew Markham

Vote:

Hi Hitesh,

I am interested in the details around the 'relative frequency of keyword' calculation as I cannot find this within the elastic search documentation. Where did you find the information about this?

https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html

The score you are showing above, are you getting this value from the results? i.e. 'searchResult.Hits.First().Score'

Can you include the full query for the new search you have created with the consolidated data, I still think there is something here that is affecting the search order.

Another thing you can try using this 'https://github.com/episerver/EPiServer.Labs.Find.Toolbox', it helps when using multi term searches.

Andy

#294618

Jan 13, 2023 8:29

Hitesh

Vote:

Hi Andrew,

It was my observation for the 'relative frequency of keyword' but when I searched it on google I came across this question on StackOverflow which talked about the same issue that I am facing.

https://stackoverflow.com/questions/16631026/elasticsearch-higher-scoring-if-higher-frequency-of-term

The answer suggested in this question is that we can either disable the normalization or make the "index": "not_analyzed" but I am not sure whether we can do it in the Episerver.

Yes, you are right, the score that I showed above. I am getting that from the results using 'searchResult.Hits.First().Score'.

The query I am using now.

Find Query-

query = _searchClient.Service.Search<JobFeedModel>()
.For(keyword)
.BoostMatching(x => x.ConsolidatedData.Match(keyword), 10);

I also tried using BoostMatching without the "For" but no success there.

I also tried testing by creating a new field in my mapping model i.e. SpaceSeparated which is of the type List<string> to check the behavior of BoostMatching with it. But it seems that Boostmatching stops after the first occurrence of the keyword in a list, so it was not considering all the other occurrences of the keyword. So that didn't work as well.

Example-

"SpaceSeparated": [
        "Software",
        "Associate",
        "Consultant",
        "Software",
        "software",
        "software",
        "software",
        "Processing",
        "datainformation",
        "conducting",
        "analysis",
        "and",
        "preparing",
        "reports",
        "of",
        "findings"
    ]

Thanks!

#294755

Jan 16, 2023 8:23

Andrew Markham

Vote:

Hi Hitesh,

Reading the post you referenced in the message I would agree with how the score is calculated. TBH I always assumed that the frequency of a term outweighed anything else, you learn something every day.

I think I would raise this with Optimizely support, they may have more insight into how you can make these changes.

Sorry I cannot be of more help.

Andy

#295263

Jan 24, 2023 8:31

Tomas Hensrud Gulla

Vote:

I think you are absolutely correct about the 'relative frequency of keyword', unfortunately I do not know if it's possible to achieve what you're asking without writing code to count the frequency after the search result is returned, and then sorting by that value.

#295284

Jan 24, 2023 21:16

Andrew Markham

Vote:

Hi Tomas, that's a good point.

I was trying to solve the problem within find. But as you said, you could count the number of occurrences of the keyword within the results and then add this total to the score you get back in the search results.

You can then order by your new score.

Andy

#295320

Edited, Jan 25, 2023 8:09

Priyanka

Vote:

Hi, I am doing the same thing. Were you able to get this to work? If yes, Can you add a sample query?

Thanks!

#323738

Jun 18, 2024 8:06