Virtual Happy Hour this month, Jun 28, we'll be getting a sneak preview at our soon to launch SaaS CMS!

Try our conversational search powered by Generative AI!

Magnus Stråle
Mar 3, 2011
(3 votes)

Do we really need yet another HTML parser?

 In EPiServer Framework we now (since the Community 4 release) ship a new HTML parser. I will do some more technically oriented posts later on, but for now I just wanted to explain why we decided to invest in a new parser.

Why do we need a HTML parser at all?

Short aside - the "HTML parsing" that we do is lexical analysis and tokenization. In a Computer Science class we would not get away with calling this parsing since we don't really care about the syntax.

In EPiServer CMS we primarily need HTML parsing for Friendly URL (aka FURL) rewriting of outgoing HTML. It is also used to deal with the permanent link scheme used when storing CMS data to the database. As part of this process we also do the "soft link" indexing. There are also a few other situations, for example allowing a subset of HTML for untrusted user input (Relate), where a solid HTML parser can help. Finally there are also all sorts of interesting scenarios that you, our partners, come up with - pulling information from other sites and extracting links, custom markup language etc.

What's wrong with the SGML Reader that we use today?

The SGML Reader is basically an XML reader (build around the same .NET infrastructure as the XML readers) that accepts malformed XML / HTML.

Unfortunately that codebase is very complex . There are a couple of long-standing bugs that we have been unable to fix. Another aspect is that the SGML Reader will force your HTML code into well-formed XHTML. This is usually the right thing to do, but in some cases you don't want your HTML code to be reformatted at all.

The XML reader model with returning the node and attributes separately also causes client code to be much more complicated than necessary, usually forcing you to use an event-based architecture. I will give some examples of this in future posts.

We are not happy with the SGML Reader, but there are other HTML parsers. Why not use one of them?

Very good question - it is so easy to fall into the "Not Invented Here" trap. Creating a good HTML parser is a major undertaking so lets first lets go thru the "must haves" for our parser and compare it to existing HTML parser implementations:

  1. High performance.
    Since it is used to parse outgoing HTML it will be called very frequently. Today the FURL rewriting is responsible for 5 - 20% of the page execution time.
  2. Streaming model.
    Since it is used very frequently and with HTML responses of unknown size, it would be bad if we have to keep the entire HTML in memory at the same time. We need a streaming model to handle this.
  3. Easy to maintain and easy to use.
    This is a must for any piece of software, but I mention it here due to the issues we've had with SGML Parser.
  4. Minimal changes to HTML after roundtrip in the parser.
    If you write HTML in a specific way, then you probably do it for a reason. We should not modify it unless requested, or absolutely necessary.
  5. Pure CLR implementation.
    We do not want complicated installation procedures with COM registration or native-code libraries.

Now lets take a look at the existing parsers:

SGML Reader
XML DOM based-parsing, although possible to use without actually creating an XML DOM. As previously noted, fails on #3 and #4.
HTML Agility Pack
Very nice API, but it does not support a streaming model. Everything gets read into memory before you can act on it, breaking #2.
Extremely fast but enforces a hard, compile-time limit on the size of the HTML. The API is also a bit clunky, breaking must-have #2 and #3.
Nice API, but still DOM based breaking #2.
A wrapper for native-code HTML Tidy library, breaking #5


(Please let me know if there are any interesting libraries that I have missed.)

What are the features that makes the new HTML parser so special?

  • Streaming model.
    The parser simply returns a stream of HTML fragments, no DOM, no big pile of data in memory.
  • High-performance.
    Detailed benchmarks has only been done against SGML Reader, which is already fast, and HtmlStreamReader outperforms SGML Reader by 10 - 50%.
  • LINQ support.
    Since the HtmlStreamReader implements IEnumerable<HtmlFragment> it directly supports LINQ-to-objects. Alternatively you can just do a foreach over the results if you want to do classical looping.
  • Roundtripping (read from a stream and output data to another stream) thru HtmlStreamReader will do minimal changes to your code.
    The only things that we will touch are whitespaces in HTML elements, everything else is left intact, unless you explicitly enable things like fixing mismatched tags etc.
  • Support for correcting common issues with HTML.
    The parser can automatically insert missing end tags, enforce the empty content model and a few other tricks to clean up your HTML. All these fixups are optional.
  • Handling of malformed data is compatible with common browser behavior.
    If you have serious errors in your HTML code, such as leaving out the closing bracket ( "<b>Bold</b<i>Italic</i>" ) the parser will correct and return elements according to the same heuristics as most major browsers.

None of these features in itself makes the HtmlStreamReader unique, but the combination of features are perfect for our needs. I hope you will find it useful too!

Since I feel that a blog post is not complete unless it contains at least a few lines of code, here is a short sample showing off the LINQ capabilities. This code snippet will show all external references from the startpage of Swedish newspaper DN.

var html = new HtmlStreamReader(new StreamReader(WebRequest.Create("").GetResponse().GetResponseStream()));

var result = html.


SelectMany(e => e.Attributes).

Where(a => (a.Token == AttributeToken.Src || a.Token == AttributeToken.Href) &&

!a.UnquotedValue.StartsWith("") &&


Mar 03, 2011


Mark Bagnall
Mark Bagnall Mar 3, 2011 02:14 PM

A quick question - can the control over roundtripping (whether 'invalid' HTML is auto-corrected or not) be specifed on a per-property basis, or is it simply on-off for the entire site?

Mar 4, 2011 05:51 AM

@Mark: I did not make it clear that for the R2 release we are only using the new HtmlStreamReader for PropertyXhtmlString (CMS) and HTML filtering (Relate). For FURL rewriting we still use SGML Reader. This will change for the next release. The reason we decided not to do it now is that CMS 6 -> CMS 6 R2 should be a no-brainer upgrade and with the changes in HTML rewriting behavior (today we always XHTML-ify and move tags according to HTML semantics) was considered too breaking to make it into CMS 6 R2.

Therefore I cannot really answer your question since we haven't implemented that part yet, but we will keep it in mind.

Mads Storm Hansen
Mads Storm Hansen Feb 3, 2012 11:48 AM

This would make a great open-source project ;-)

Please login to comment.
Latest blogs
Copying property values part 2

After publishing my last article about copying property values to other language versions, I received constructive feedback on how could I change t...

Grzegorz Wiecheć | Jun 18, 2024 | Syndicated blog

Enhancing online shopping through Optimizely's personalized product recommendations

In this blog, I have summarized my experience of using and learning product recommendation feature of Optimizely Personalization Artificial...

Hetaxi | Jun 18, 2024

New Series: Building a .NET Core headless site on Optimizely Graph and SaaS CMS

Welcome to this new multi-post series where you can follow along as I indulge in yet another crazy experiment: Can we make our beloved Alloy site r...

Allan Thraen | Jun 14, 2024 | Syndicated blog

Inspect In Index is finally back

EPiCode.InspectInIndex was released 9 years ago . The Search and Navigation addon is now finally upgraded to support Optimizely CMS 12....

Haakon Peder Haugsten | Jun 14, 2024