Wednesday, December 06, 2006

Why XHTML still serves a purpose

The W3C is a source of irritation to many web-developers.  It seems to produce heinously complex specification solving problems nobody has.  It's tackling issues such as the semantic web, when much "simpler" issues such as, say, the syntactic web are completely unresolved.

Evan Goer is cynical about XHTML's usefulness.  XHTML adds the ability to embed other formats such as MathML and SVG, and that "sole" advantage can be mimicked with the help of some JavaScript, allowing SVG and MathML embeds in plain HTML4.

XHTML does add a whole lot more than that, however.  How many developers nowadays actually write all of their HTML themselves?  Anyone?  Almost every site on the planet uses some form of HTML generation.  And it so happens that XHTML is much more suited to that task than HTML, for a number of reasons.

Somebody said that a prime advantage of XHTML is it's integration into the XML food chain.  This is a huge advantage.  The project I'm currently working on generates all output via XSLT transformations - which handily guarantees a certain minimum of well-formed-ness.  When I perform screenscraping, the first step is always to run TIDY over the input - that allows me to choose any number of parsing techniques in stead of regular expressions to extract the information I'm looking for.

The real killer advantage is more fundamental though, and this is an advantage that HTML is not likely to ever achieve:  Security.  The threat of injection is looming large, and html is a target that is difficult to secure.  Early today another 0day XSS exploit on myspace was discovered.  It relies on a particularity of Firefox's HTML parser.

Invalid HTML abounds on the web, and every user agent interprets it differently.  Producing only well-formed XML significantly reduces your attack surface.  MySpace would not have been vulnerable had it actually parsed the HTML and regenerated it.  And while it's parsed anyway, why not use something like XSLT to filter out more complex issues?  Allowing structured user input without filtering it is a security impossibility, but filtering it correctly without parsing it is a technical impossibility.

And if you use XSLT to produce this safe version of your output you get the bonus that while working entirely in XHTML in your stylesheets, you can trivially choose to encode the result as HTML for compatibility reasons.

Essentially, if you're making a dynamic web site which display's user-provided input, and you're not generating HTML structurally (with a real serializer) you're asking for unnecessary trouble.  XHTML won't solve everything, but it's not useless: actually, you' ld be crazy not to use it wherever you can ;-).

3 Comments:

At 06 December, 2006 17:02 , Anonymous Evan Goer said...

Hi Eamon,

First, thanks very much for writing, and your thoughts on the matter.

On the screenscraping issue. First, let's note the interesting fact that you already have to pass the site through Tidy, whether or not the owner of the site claims that the page is "XHTML". That's because you and I both know that over 90% of the time (really! I've measured this), passing that "XHTML" page through an XML parser would blow up our parser. ;) This should be a big red flag about XHTML and any benefits about its purported XML nature right there.

Second, there are plenty of fine tools for parsing and screenscraping HTML. Python's SGMLParser, for instance, has worked great for my purposes. Anything you can Tidy and scrape as XHTML, I can Tidy and scrape as HTML -- or often, I can just scrape as HTML.

Third, screenscraping is a bad idea for anything other than the quickest-and-dirtiest kind of software. This is because the owner of the site almost certainly doesn't know or care that they are providing structured data to you. There's no contract here. If they *did* care about this they would provide a real API or at least an RSS/Atom feed.

On XHTML security. The MySpace XSS vulnerability relies on invalid HTML. Allow me to officially speak with my "Evan Goer" hat on, and not my "Employee of a competitor of MySpace" hat on: MySpace was utterly stupid to allow that sort of unsanitzed user input to go through.

You are right that forcing user input to be valid XHTML significantly reduces your attack surface. But so would forcing user input to be valid HTML. Tidy and its libraries are perfectly capable of cleaning up bad HTML (or rejecting stuff that is too pathological). If you can produce a tool that generates valid XML, you can produce a tool that generates valid HTML too.

But MORE important: the real lesson, and one that is repeated over and over by security experts at my company and elsewhere, is that you cannot trust user input and you must sanitize the hell out of it. If you do want to allow any sort of structured user input, any sane user input system only allows a small, explicit subset of elements and attributes. Please, I beg you, do not rely on creating valid HTML or XHTML for sanitization. If you allow arbitrary valid (X)HTML, you will get owned. (Actually, the best thing of all is not to build your own sanitization system, but to take one that has already been well-tested in the wild.)

Since we're on the subject of parser bugs, here's a nasty remote execution vulnerability from a month ago that affected Drupal, due to a bug in its XML parser:

http://www.derkeiler.com/Mailing-Lists/securityfocus/bugtraq/2006-10/msg00320.html

If you search for "XML parser vulnerability" or "XML parser XSS" or the like, it's easy to find many more such issues...

 
At 07 December, 2006 01:51 , Anonymous Anonymous said...

Just for clarification. You said: "It's tackling issues such as the semantic web, when much "simpler" issues such as, say, the syntactic web are completely unresolved."

That is one of the common mistakes, people mixing W3C, the organization, and the individual WGs. HTML WG and Semantic Web WGs are different groups with different people working on different technologies.

Also, Most active participants of WHATWG are in fact people who are part of W3C (CSS WG and WebAPI WG), so it is not a fight of W3C against the rest of the world. It might be more in fact a fight of engineers against their own company, as ironical it might seem.

 
At 07 December, 2006 09:23 , Blogger Eamon Nerbonne said...

Sorry about the posting issue; the posts weren't eaten and are now online. In (an apparently vain) attempt to avoid supporting an install of yet another piece of software I hoped blogger would Just Work; since all dynamic content is hosted - and hopefully secured - by blogger/google, security issues should be less of my problem. However; the ftp-publishing is very delayed (it seems I need to manually intervene or wait for a long long time for it to update the site) which is very confusing if that's not what you're expecting.

So I posted my reply on Evan's blog.

 

Post a Comment

Subscribe to Post Comments [Atom]

<< Home