Sorry about the posting issue; the posts weren't e...

2006-12-07T09:23:00.000+01:00

Sorry about the posting issue; the posts weren't eaten and are now online. In (an apparently vain) attempt to avoid supporting an install of yet another piece of software I hoped blogger would Just Work; since all dynamic content is hosted - and hopefully secured - by blogger/google, security issues should be less of my problem. However; the ftp-publishing is very delayed (it seems I need to manually intervene or wait for a long long time for it to update the site) which is very confusing if that's not what you're expecting.

So I posted my reply on Evan's blog.

Just for clarification. You said: "It's tackling i...

2006-12-07T01:51:00.000+01:00

Just for clarification. You said: "It's tackling issues such as the semantic web, when much "simpler" issues such as, say, the syntactic web are completely unresolved."

That is one of the common mistakes, people mixing W3C, the organization, and the individual WGs. HTML WG and Semantic Web WGs are different groups with different people working on different technologies.

Also, Most active participants of WHATWG are in fact people who are part of W3C (CSS WG and WebAPI WG), so it is not a fight of W3C against the rest of the world. It might be more in fact a fight of engineers against their own company, as ironical it might seem.

Hi Eamon, First, thanks very much for writing, an...

2006-12-06T17:02:00.000+01:00

Hi Eamon,

First, thanks very much for writing, and your thoughts on the matter.

On the screenscraping issue. First, let's note the interesting fact that you already have to pass the site through Tidy, whether or not the owner of the site claims that the page is "XHTML". That's because you and I both know that over 90% of the time (really! I've measured this), passing that "XHTML" page through an XML parser would blow up our parser. ;) This should be a big red flag about XHTML and any benefits about its purported XML nature right there.

Second, there are plenty of fine tools for parsing and screenscraping HTML. Python's SGMLParser, for instance, has worked great for my purposes. Anything you can Tidy and scrape as XHTML, I can Tidy and scrape as HTML -- or often, I can just scrape as HTML.

Third, screenscraping is a bad idea for anything other than the quickest-and-dirtiest kind of software. This is because the owner of the site almost certainly doesn't know or care that they are providing structured data to you. There's no contract here. If they *did* care about this they would provide a real API or at least an RSS/Atom feed.

On XHTML security. The MySpace XSS vulnerability relies on invalid HTML. Allow me to officially speak with my "Evan Goer" hat on, and not my "Employee of a competitor of MySpace" hat on: MySpace was utterly stupid to allow that sort of unsanitzed user input to go through.

You are right that forcing user input to be valid XHTML significantly reduces your attack surface. But so would forcing user input to be valid HTML. Tidy and its libraries are perfectly capable of cleaning up bad HTML (or rejecting stuff that is too pathological). If you can produce a tool that generates valid XML, you can produce a tool that generates valid HTML too.

But MORE important: the real lesson, and one that is repeated over and over by security experts at my company and elsewhere, is that you cannot trust user input and you must sanitize the hell out of it. If you do want to allow any sort of structured user input, any sane user input system only allows a small, explicit subset of elements and attributes. Please, I beg you, do not rely on creating valid HTML or XHTML for sanitization. If you allow arbitrary valid (X)HTML, you will get owned. (Actually, the best thing of all is not to build your own sanitization system, but to take one that has already been well-tested in the wild.)

Since we're on the subject of parser bugs, here's a nasty remote execution vulnerability from a month ago that affected Drupal, due to a bug in its XML parser:

http://www.derkeiler.com/Mailing-Lists/securityfocus/bugtraq/2006-10/msg00320.html

If you search for "XML parser vulnerability" or "XML parser XSS" or the like, it's easy to find many more such issues...

Comments on Eamon Nerbonne: Why XHTML still serves a purpose

Sorry about the posting issue; the posts weren't e...

Just for clarification. You said: "It's tackling i...

Hi Eamon, First, thanks very much for writing, an...