One of the interesting security problems we see on the web today is how to deal with untrusted HTML. Lots of folks want to allow Markdown on their sites, or AsciiDoc or some other text format, and any of these can contain literal HTML that needs to be sanitized. The problem invariably comes down to writing a sanitizer that parses HTML in the way that everyone else parses HTML, so that a malicious user can’t provide HTML that escapes sanitization.

Now, theoretically, HTML 5 has this all specified for us: it specifically defines the error handling of all syntax errors, so it’s possible for us to handle any case by just doing what the spec says. The reality, though, is different, and it’s possible that not every browser (or other tool) parses according to the spec. Heck, some people still try to use regular expressions to parse HTML.

There is, however, a simple, effective, and wildly unpopular alternative: XHTML. Because XHTML is XML, there are a variety of parsers which can parse it correctly, and most importantly, every syntactical error is fatal. There is, unambiguously, one way to parse XHTML. This is remarkably good for security, because it means we can sanitize our document and know that every other user will parse the data the same way.

Moreover, the strictness has additional benefits, because it makes some types of cross-site scripting attacks harder. If the user cannot insert a syntactically valid fragment into your XHTML page, then the best they can do is cause the page to fail to render. No more writing things like <img src=x onerror=alert(1);> without any quotation marks in sight.

Many people bemoan the fatality of syntactical errors because it makes XML a bear to write, with which I agree wholeheartedly. Nevertheless, many of us write C, or JavaScript, or a variety of other languages and can deal with the strictness imposed there, so I feel confident we as developers can deal with XHTML.

Additionally, XHTML became less popular in many browsers because it required parsing the entire document before rendering, slowing page load down, although that need not be the case. There are a variety of popular incremental parsing interfaces, such as SAX, that can be used to parse incrementally. It’s true that the page will have to render an error if it’s syntactically invalid, which can cause a flash of page load before an error page, but this isn’t an issue if the page is syntactically valid, and it’s a small penalty to incur for the benefits.

The astute user will notice that this site (and all of my other sites) are serving syntactically valid XHTML, and have been for years. Since I generate all my sites from other formats, there’s no downside for me, and I basically get all the upsides with no work. For other folks, especially those designing web apps, it may take more getting used to, but the improved security is totally worth it.