Do we really need yet another HTML parser?
In EPiServer Framework we now (since the Community 4 release) ship a new HTML parser. I will do some more technically oriented posts later on, but for now I just wanted to explain why we decided to invest in a new parser.
Why do we need a HTML parser at all?
Short aside - the "HTML parsing" that we do is lexical analysis and tokenization. In a Computer Science class we would not get away with calling this parsing since we don't really care about the syntax.
In EPiServer CMS we primarily need HTML parsing for Friendly URL (aka FURL) rewriting of outgoing HTML. It is also used to deal with the permanent link scheme used when storing CMS data to the database. As part of this process we also do the "soft link" indexing. There are also a few other situations, for example allowing a subset of HTML for untrusted user input (Relate), where a solid HTML parser can help. Finally there are also all sorts of interesting scenarios that you, our partners, come up with - pulling information from other sites and extracting links, custom markup language etc.
What's wrong with the SGML Reader that we use today?
The SGML Reader is basically an XML reader (build around the same .NET infrastructure as the XML readers) that accepts malformed XML / HTML.
Unfortunately that codebase is very complex . There are a couple of long-standing bugs that we have been unable to fix. Another aspect is that the SGML Reader will force your HTML code into well-formed XHTML. This is usually the right thing to do, but in some cases you don't want your HTML code to be reformatted at all.
The XML reader model with returning the node and attributes separately also causes client code to be much more complicated than necessary, usually forcing you to use an event-based architecture. I will give some examples of this in future posts.
We are not happy with the SGML Reader, but there are other HTML parsers. Why not use one of them?
Very good question - it is so easy to fall into the "Not Invented Here" trap. Creating a good HTML parser is a major undertaking so lets first lets go thru the "must haves" for our parser and compare it to existing HTML parser implementations:
- High performance.
Since it is used to parse outgoing HTML it will be called very frequently. Today the FURL rewriting is responsible for 5 - 20% of the page execution time. - Streaming model.
Since it is used very frequently and with HTML responses of unknown size, it would be bad if we have to keep the entire HTML in memory at the same time. We need a streaming model to handle this. - Easy to maintain and easy to use.
This is a must for any piece of software, but I mention it here due to the issues we've had with SGML Parser. - Minimal changes to HTML after roundtrip in the parser.
If you write HTML in a specific way, then you probably do it for a reason. We should not modify it unless requested, or absolutely necessary. - Pure CLR implementation.
We do not want complicated installation procedures with COM registration or native-code libraries.
Now lets take a look at the existing parsers:
SGML Reader http://archive.msdn.microsoft.com/SgmlReader |
XML DOM based-parsing, although possible to use without actually creating an XML DOM. As previously noted, fails on #3 and #4. |
HTML Agility Pack http://htmlagilitypack.codeplex.com/ |
Very nice API, but it does not support a streaming model. Everything gets read into memory before you can act on it, breaking #2. |
Majestic-12 http://www.majestic12.co.uk/projects/html_parser.php |
Extremely fast but enforces a hard, compile-time limit on the size of the HTML. The API is also a bit clunky, breaking must-have #2 and #3. |
LINQ to HTML http://www.justagile.com/linq-to-html.aspx |
Nice API, but still DOM based breaking #2. |
TidyForNet http://tidyfornet.sourceforge.net/ |
A wrapper for native-code HTML Tidy library, breaking #5 |
(Please let me know if there are any interesting libraries that I have missed.)
What are the features that makes the new HTML parser so special?
- Streaming model.
The parser simply returns a stream of HTML fragments, no DOM, no big pile of data in memory. - High-performance.
Detailed benchmarks has only been done against SGML Reader, which is already fast, and HtmlStreamReader outperforms SGML Reader by 10 - 50%. - LINQ support.
Since the HtmlStreamReader implements IEnumerable<HtmlFragment> it directly supports LINQ-to-objects. Alternatively you can just do a foreach over the results if you want to do classical looping. - Roundtripping (read from a stream and output data to another stream) thru HtmlStreamReader will do minimal changes to your code.
The only things that we will touch are whitespaces in HTML elements, everything else is left intact, unless you explicitly enable things like fixing mismatched tags etc. - Support for correcting common issues with HTML.
The parser can automatically insert missing end tags, enforce the empty content model and a few other tricks to clean up your HTML. All these fixups are optional. - Handling of malformed data is compatible with common browser behavior.
If you have serious errors in your HTML code, such as leaving out the closing bracket ( "<b>Bold</b<i>Italic</i>" ) the parser will correct and return elements according to the same heuristics as most major browsers.
None of these features in itself makes the HtmlStreamReader unique, but the combination of features are perfect for our needs. I hope you will find it useful too!
Since I feel that a blog post is not complete unless it contains at least a few lines of code, here is a short sample showing off the LINQ capabilities. This code snippet will show all external references from the startpage of Swedish newspaper DN.
var html = new HtmlStreamReader(new StreamReader(WebRequest.Create("http://www.dn.se/").GetResponse().GetResponseStream()));
var result = html.
OfType<ElementFragment>().
SelectMany(e => e.Attributes).
Where(a => (a.Token == AttributeToken.Src || a.Token == AttributeToken.Href) &&
!a.UnquotedValue.StartsWith("http://www.dn.se") &&
a.UnquotedValue.StartsWith("http://"));
A quick question - can the control over roundtripping (whether 'invalid' HTML is auto-corrected or not) be specifed on a per-property basis, or is it simply on-off for the entire site?
@Mark: I did not make it clear that for the R2 release we are only using the new HtmlStreamReader for PropertyXhtmlString (CMS) and HTML filtering (Relate). For FURL rewriting we still use SGML Reader. This will change for the next release. The reason we decided not to do it now is that CMS 6 -> CMS 6 R2 should be a no-brainer upgrade and with the changes in HTML rewriting behavior (today we always XHTML-ify and move tags according to HTML semantics) was considered too breaking to make it into CMS 6 R2.
Therefore I cannot really answer your question since we haven't implemented that part yet, but we will keep it in mind.
This would make a great open-source project ;-)