HTML gets included in the index (Excerpt property)

Vote:
 

I have a XhtmlString property on a page which is indexed using Find.

[CultureSpecific]
[Display(
Name = "Body",
GroupName = Core.EPiServer.Ui.PageTabNames.Content,
Order = 20)]
public virtual XhtmlString Body { get; set; }

But when i search using SearchClient.Instance.UnifiedSearchFor(query) then the Excerpt property includes HTML from the properties of type XhtmlString.

I would like Find to strip the HTML before adding the content to the index.

What is the best way to implement this?

#142024
Nov 26, 2015 15:48
Vote:
 

I have tried following the Removing HTML tags examples, but as far as i can see, it is not compatible with XhtmlString properties.

Nothing changes when i add the attribute to my property.
The conventions is simply not supported in the API for XhtmlString.

Maybee i did something wrong?

#142032
Nov 27, 2015 9:32
Vote:
 

You'll have to reindex after adding the attribute

#142033
Nov 27, 2015 10:47
Vote:
 

I deleted the index and published the page again, but with little luck.

Are you sure this attribute is supposed to work for XhtmlString's as well?

#142034
Nov 27, 2015 10:51
Vote:
 

Which version of Find are you using?

#142035
Nov 27, 2015 10:57
Vote:
 

The index is hosted by EPiServer and the assembly version is 9.5.0.2999

#142036
Nov 27, 2015 11:00
Vote:
 

If you look at Find's explore view, does the Body property contain html? What about the "SearchText" field in the index?

#142037
Nov 27, 2015 11:03
Vote:
 

The SearchText gets indexes including HTML tags, but the HTML gets stripped from the AsViewedByAnonymous property of the body property.

"SearchText$$string": "Om os  <p>Dette er en test af hvordan&nbsp;</p>\n<ul>\n<li>EPiServer</li>\n<li>har tænkt sig at <strong>håndtere</strong> HTML</li>\n<li>i indholdsfelter</li>\n</ul>\n<p>Går det mon godt i <strong>dette</strong> tilfælde?</p> epi.cms.contentdata:///105",

"Body": {
"IsEmpty$$bool": false,
"___types": [
"EPiServer.Core.XhtmlString",
"System.Object",
"System.Web.IHtmlString",
"System.Runtime.Serialization.ISerializable",
"EPiServer.Data.Entity.IReadOnly`1[[EPiServer.Core.XhtmlString, EPiServer, Version=8.8.1.0, Culture=neutral, PublicKeyToken=8fe83dea738b45b7]]",
"EPiServer.Data.Entity.IReadOnly"
],
"AsViewedByAnonymous$$string": "Dette er en test af hvordan EPiServer har tænkt sig at håndtere HTML i indholdsfelter Går det mon godt i dette tilfælde?",
"IsModified$$bool": false,
"$type": "EPiServer.Core.XhtmlString, EPiServer"
},

#142039
Edited, Nov 27, 2015 11:13
Vote:
 

Sounds like a bug in the SearchText extension to me. This does not happen in EPiServer Find 11.

A workaround could be to add the following to an initializable module:

SearchClient.Instance.Conventions.ForInstancesOf<IContent>().Field(x => x.SearchText()).StripHtml();

and then reindex. I have not tried this though.

#142041
Nov 27, 2015 11:43
Vote:
 

Using your code gives me some strange results.
When i execute a query i get TotalMatchingResult = 1 but the hits collection is empty.

But the code below seems to fix it.

public static string StripHtml(string html)
{
    if (!string.IsNullOrEmpty(html))
    {
        return html.StripHtml();
    }

    return html;
}

public void Initialize(InitializationEngine context)
{
    SearchClient.Instance.Conventions.ForInstancesOf<IContent>().Field(x => x.SearchText()).ConvertBeforeSerializing(StripHtml);
}

Now the only problem is that the SearchText includes some weird suffix, which is searchable and is sometimes included in the Excerpt.

"SearchText$$string": "Om os Dette er en test af hvordan \n \n EPiServer \n har tænkt sig at håndtere HTML \n i indholdsfelter \n \n Går det mon godt i dette tilfælde? \n test test test test epi.cms.contentdata:///105",

Edit: Now using EPiServer.Find.Helpers.Text.StringExtensions.StripHtml() instead of custom Regex.

#142044
Edited, Nov 27, 2015 12:41
Vote:
 

I think you should try a different approach as this is getting slightly hacky :-)

How about implementing your own SearchText property on your page? This will override the searchtext field in the index. You will then have full control on what is in the SearchText field. 
Do this by adding a new property named SearchText on your content type or base class:

for example:

 public virtual string SearchText => string.Format(CultureInfo.InvariantCulture, "{0} {1} {2}", PageName, MainBody != null ? MainBody.AsViewedByAnonymous() : "", MetaTitle);
#142045
Nov 27, 2015 13:15
Vote:
 

if you are not using c# 6.0:

public virtual string SearchText { get { return String.Format(CultureInfo.InvariantCulture, "{0} {1} {2}", PageName, MainBody != null ? MainBody.AsViewedByAnonymous() : "", MetaTitle); } }
#142046
Nov 27, 2015 13:16
This topic was created over six months ago and has been resolved. If you have a similar question, please create a new topic and refer to this one.
* You are NOT allowed to include any hyperlinks in the post because your account hasn't associated to your company. User profile should be updated.