The How-To for html 5 parsing

« Improving Interoperability by Short Release Cycle | Main | life without MIME type sniffing? »

The How-To for html 5 parsing

You have read a lot about the html 5 specification. You heard that there were hidden dragons and acid rains. But what about looking by yourself practically how html 5 parsing is working? There are already some tools to play with html 5.

DOM in actual browsers

DOM (Document Object Model) is the representation that browsers are using in memory to manipulate Web content. Browsers have bugs and the content on the Web is largely not conforming. It results in very different DOM representations in browsers. If you are interested by seeing what a document looks like in different browsers, you can use the Live DOM Viewer. Open this link with each browser you know and paste code into the window.

This helps you to see how the Web content is understood today by different tools.

DOM after html 5 parsing

Now you might be interested to see how a document will be represented by a tool implementing html 5 parsing rules. An important note, html 5 is a specification in development. Things might change. The following tools might be incomplete and contain bugs as well. But it will give you an idea of the DOM. It is very practical when you are developing another language which is not html 5 but might be sent as text/html (by mistake or practical choice).

There are at least two online services:

Live html 5 parser by Philip Taylor
html5lib Based HTML5 Parser

Henri Sivonen developed a standalone application that you can use on your desktop. Here are the instructions to get it running. It worked fine on my macintosh.

Check out the source: svn co http://svn.versiondude.net/whattf/htmlparser/trunk/ htmlparser
Download and untar GWT 1.5 RC1: http://code.google.com/webtoolkit/versions.html
On Linux, install libstdc++5 and a JDK (Ubuntu's OpenJDK-based package worked for me).
Edit the paths in HtmlParser-shell (Mac) or HtmlParser-linux (Linux) to point to the location of GWT.
Run HtmlParser-shell (Mac) or HtmlParser-linux (Linux)

Henri gave a list of limitations and bugs

Using html 5 parsing in your own code

There are for now three implementations of the html 5 parsing algorithm.

There is an attempt at implementing in C# for .Net 2.0, but no code has been released yet.

Twintsam

If you know other tools implementing it, leave a comment.

Filed by Karl Dubost on July 7, 2008 2:35 AM in HTML, Technology 101, Tools
| Permalink | Comments (0) | TrackBacks (0)

Note: this blog is intended to foster polite on-topic discussions. Comments failing these requirements and spam will not get published. Please, enter your real name and email address. Every individual comment is reviewed by the W3C staff. This may take some time, thank you for your patience.

You can use the following HTML markup (a href, b, i, br/, p, strong, em, ul, ol, li, blockquote, pre) and/or Markdown syntax.

This page was last generated on $Date: 2011/12/16 03:02:57 $

W3C Blog