HTML gets included in the index (Excerpt property)

Vote:

I have a XhtmlString property on a page which is indexed using Find.

[CultureSpecific]
[Display(
Name = "Body",
GroupName = Core.EPiServer.Ui.PageTabNames.Content,
Order = 20)]
public virtual XhtmlString Body { get; set; }

But when i search using SearchClient.Instance.UnifiedSearchFor(query) then the Excerpt property includes HTML from the properties of type XhtmlString.

I would like Find to strip the HTML before adding the content to the index.

What is the best way to implement this?

#142024

Nov 26, 2015 15:48

Henrik Fransas

Vote:

Here is a link that shows how it can be done

http://world.episerver.com/documentation/Items/Developers-Guide/EPiServer-Find/11/DotNET-Client-API/Customizing-serialization/Removing-HTML-tags/

#142027

Nov 26, 2015 20:22

Vote:

I have tried following the Removing HTML tags examples, but as far as i can see, it is not compatible with XhtmlString properties.

Nothing changes when i add the attribute to my property.
The conventions is simply not supported in the API for XhtmlString.

Maybee i did something wrong?

#142032

Nov 27, 2015 9:32

Per Magne Skuseth

Vote:

You'll have to reindex after adding the attribute

#142033

Nov 27, 2015 10:47

Vote:

I deleted the index and published the page again, but with little luck.

Are you sure this attribute is supposed to work for XhtmlString's as well?

#142034

Nov 27, 2015 10:51

Per Magne Skuseth

Vote:

Which version of Find are you using?

#142035

Nov 27, 2015 10:57

Vote:

The index is hosted by EPiServer and the assembly version is 9.5.0.2999

#142036

Nov 27, 2015 11:00

Per Magne Skuseth

Vote:

If you look at Find's explore view, does the Body property contain html? What about the "SearchText" field in the index?

#142037

Nov 27, 2015 11:03

Vote:

The SearchText gets indexes including HTML tags, but the HTML gets stripped from the AsViewedByAnonymous property of the body property.

"SearchText$$string": "Om os Dette er en test af hvordan \n<ul>\n<li>EPiServer</li>\n<li>har tænkt sig at håndtere HTML</li>\n<li>i indholdsfelter</li>\n</ul>\nGår det mon godt i dette tilfælde? epi.cms.contentdata:///105",

"Body": {
"IsEmpty$$bool": false,
"___types": [
"EPiServer.Core.XhtmlString",
"System.Object",
"System.Web.IHtmlString",
"System.Runtime.Serialization.ISerializable",
"EPiServer.Data.Entity.IReadOnly`1[[EPiServer.Core.XhtmlString, EPiServer, Version=8.8.1.0, Culture=neutral, PublicKeyToken=8fe83dea738b45b7]]",
"EPiServer.Data.Entity.IReadOnly"
],
"AsViewedByAnonymous$$string": "Dette er en test af hvordan EPiServer har tænkt sig at håndtere HTML i indholdsfelter Går det mon godt i dette tilfælde?",
"IsModified$$bool": false,
"$type": "EPiServer.Core.XhtmlString, EPiServer"
},

#142039

Edited, Nov 27, 2015 11:13

Per Magne Skuseth

Vote:

Sounds like a bug in the SearchText extension to me. This does not happen in EPiServer Find 11.

A workaround could be to add the following to an initializable module:

SearchClient.Instance.Conventions.ForInstancesOf<IContent>().Field(x => x.SearchText()).StripHtml();

and then reindex. I have not tried this though.

#142041

Nov 27, 2015 11:43

Vote:

Using your code gives me some strange results.
When i execute a query i get TotalMatchingResult = 1 but the hits collection is empty.

But the code below seems to fix it.

public static string StripHtml(string html)
{
    if (!string.IsNullOrEmpty(html))
    {
        return html.StripHtml();
    }

    return html;
}

public void Initialize(InitializationEngine context)
{
    SearchClient.Instance.Conventions.ForInstancesOf<IContent>().Field(x => x.SearchText()).ConvertBeforeSerializing(StripHtml);
}

Now the only problem is that the SearchText includes some weird suffix, which is searchable and is sometimes included in the Excerpt.

"SearchText$$string": "Om os Dette er en test af hvordan \n \n EPiServer \n har tænkt sig at håndtere HTML \n i indholdsfelter \n \n Går det mon godt i dette tilfælde? \n test test test test epi.cms.contentdata:///105",

Edit: Now using EPiServer.Find.Helpers.Text.StringExtensions.StripHtml() instead of custom Regex.

#142044

Edited, Nov 27, 2015 12:41

Per Magne Skuseth

Vote:

I think you should try a different approach as this is getting slightly hacky :-)

How about implementing your own SearchText property on your page? This will override the searchtext field in the index. You will then have full control on what is in the SearchText field.
Do this by adding a new property named SearchText on your content type or base class:

for example:

 public virtual string SearchText => string.Format(CultureInfo.InvariantCulture, "{0} {1} {2}", PageName, MainBody != null ? MainBody.AsViewedByAnonymous() : "", MetaTitle);

#142045

Nov 27, 2015 13:15

Per Magne Skuseth

Vote:

if you are not using c# 6.0:

public virtual string SearchText { get { return String.Format(CultureInfo.InvariantCulture, "{0} {1} {2}", PageName, MainBody != null ? MainBody.AsViewedByAnonymous() : "", MetaTitle); } }

#142046

Nov 27, 2015 13:16