XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Parsing Microformats

September 04, 2007

Microformats are a way to embed specific semantic data into the HTML that we use today. One of the first questions an XML guru might ask is "Why use HTML when XML lets you create the same semantics?" I won't go into all the reasons XML might be a better or worse choice for encoding data or why microformats have chosen to use HTML as their encoding base. This article will focus more on how to extract microformats data from the HTML, how the basic parsing rules work, and how they differ from XML.

Contact Information in HTML

One of the more popular and well-established microformats is hCard. This is a vCard representation in HTML, hence the h in hCard, HTML vCard. You can read more about hCards on the microformats wiki. A vCard contains basic information about a person or an organization. This format is used extensively in address book applications as a way to backup and interchange contact information. By Internet standards it's an old format, the specification is RFC2426 from 1998. It is pre-XML, so the syntax is just simple text with a few delimiters and start and end elements. We'll use my information for this example.

BEGIN:VCARD
FN:Brian Suda
N:Suda;Brian;;;
URL:http://suda.co.uk
END:VCARD

This vCard file has a BEGIN:VCARD and an END:VCARD that acts as a container so the parser knows when to stop looking for more data. There might be multiple vCards in one file, so this nicely groups the data into distinct vCards. The FN stands for Formatted Name, which is used as the display name. The N is the structured name, which encodes things like first, last, middle names, prefixes and suffixes, all semicolon separated. Finally, URL is the URL of the web site associated with this contact.

If we were to encode this in XML it would probably look something like this:

<vcard>
    <fn>Brian Suda</fn>
    <n>
        <given-name>Brian</given-name>
        <family-name>Suda</family-name>
    </n>
    <url>http://suda.co.uk</url>
</vcard>

Let's see how we can mark up the same vCard data in HTML using microformats, which make extensive use of the rel, rev, and class attributes to help encode the semantics. The class attribute is used in much the same way as elements are used in XML. So the previous XML example might be marked up in HTML as:

<div class="vcard">
    <div class="fn">Brian Suda</div>
    <div class="n">
        <div class="given-name">Brian</div>
        <div class="family-name">Suda</div>
    </div>
    <div class="url">http://suda.co.uk</div>
</div>

If that was all microformats did, then it wouldn't be very interesting. Instead, microformats make use of the semantics of existing HTML elements to explain where the encoded data can be found. In this example everything is a <div>, but it doesn't have to be. This is what makes extracting data from the HTML slightly more difficult for parsers, but makes it easier for publisher. Microformats do not force publishers to change their current HTML structure or publishing behavior. At the end of the day, there will be factors of 10 more people writing HTML than writing parsers, so why not make it as easy as possible for the publishers?

It bugs me when I look at the previous XML example and see "Brian Suda" encoded twice, once for FN then repeated again for N. With HTML this isn't a problem, we can combine those two XML elements using space-separated values in the class attribute. It is a little know fact that the class, rel, and rev attributes in HTML can actually take a space-separated list of values. If we combine the FN and N we get something like this:

<div class="n fn">
    <div class="given-name">Brian</div>
    <div class="family-name">Suda</div>
</div>

Now the N property still has its children and the FN has the same value as before. Remember, HTML collapses whitespace, so the FN still is "Brian Suda" even though it is spread over two elements now with spaces inside those <div>s.

So, we have sorted the ability to condense multiple properties with the same value. The next thing that bothers me about the XML example is that the URL is displayed, it doesn't seem natural. In XML we are talking about data, but the HTML is being displayed to people in a browser. Coincidentally, there is an <a> element, which has an href attribute that takes the URL value and also a node-value to display more human-friendly text. We can further refine our HTML example to include the URL switching the <div> to an <a> element.

<a class="n fn url" href="http://suda.co.uk">
    <span class="given-name">Brian</span>
    <span class="family-name">Suda</span>
</a>

After switching to the <a> element, we needed to change the child <div>s to <spans>s because the <a> element can only contain inline elements as children. Microformats do not force publishers to use specific elements, but it is recommended that you use the most semantic for each case. In the case of URL data, it makes the most sense in this case to use an <a> element, because of this; the parsing rules change slightly (we'll discuss this in a bit).

The final hCard microformat might look something like the following in HTML:

<div class="vcard">
    <a class="n fn url" href="http://suda.co.uk">
        <span class="given-name">Brian</span>
        <span class="family-name">Suda</span>
    </a>
</div>

To me, this is much more intuitive, simpler, and more compact than the XML example at the start. People are already publishing blogrolls and links in this manner and all browsers recognize and style this information, plus it can easily be passed around inside a feed.

Pages: 1, 2

Next Pagearrow