
Analyzing the Web
The mantra of Web development used to be, "Content is king." Don't waste time prettying things up, we were advised (and we ourselves advised others); people are too busy, and download speeds are too slow, to crud your pages up with big images, Flash presentations, and other media types. (These days, with the weblog explosion — over 13 million weblogs, their authors busily cranking out the content — perhaps content has resumed its primacy. On the other hand, all those bloggers do spend an awful lot of time tinkering with templates...)
With the veneer of commerce applied to many Web sites now, though, perhaps a more important mantra would be, "Web statistics are king." Sites are measured along a host of dimensions: hits, visits and return visits, page views, referrers, visit duration and depth, authenticated users, etc. Most professional Web-hosting providers include with their hosting plans a logging feature which captures all these details and saves them for later analysis.
Common Log Format (CLF)
Let's go back to the old days, when visits to the busiest sites might number merely in the hundreds. The National Center for Supercomputing Applications (NCSA) — the same group that developed the original Mosaic browser and the HTTPd Web server — came up with what has come to be known as the Common Log Format, or CLF, for keeping tabs on who had visited a given page, when and by what means, and so on. The format is fairly simple: a CLF file is a plain old text file, each line of which represents a request for a file or document to be delivered to a remote user. The basic format of each line is:
host ident authuser date request status
bytes
where:
hostis the fully-qualified domain name of the remote client if it's available, otherwise an IP addressidentis the "identity" of the remote clientauthuseris the username supplied by the user if the requested resource is password-protecteddateis the date/time the request was maderequestis the actual HTTP request made by the client, enclosed in quotesstatusis the three-digit HTTP status code returned to the client (e.g. 404 for "not found")bytesis the number of bytes transferred, not counting any headers
If any item is unavailable, a hyphen is used as a placeholder.
Here's a sample (this is all one line, but may wrap in your display):
cache.somesite.com - - [01/Jul/2005:00:14:04 -0400]
"GET /somefile.html HTTP/1.0" 200 5968
Over time, the need for a few more fields was identified, such as the referrer and user agent (e.g., browser). But the above has remained the core of the de facto CLF standard.
Logging Web Usage with XML
There've been a couple efforts to define an XML-based log format, using CLF as a jumping-off point.
The first, the Extensible Log Format (XLF) initiative, came along in 1998, early in XML terms. Like much other XML activity then, its development sprang from the participation of a loose coalition of subscribers to the XML-DEV mailing list. As you can see from what appears to be the only surviving copy of the XLF specification, the application was intended not only to replace CLF, but to enhance it:
Intelligent log data combined with intelligent processing will lead to far more powerful analysis and reporting capabilities than ever before....
[An] example might be electronic commerce: a transaction page that is written in XML (say the order page from amazon.com) might have its
<total.price>,<customer.name>and<customer.address>, and other pieces of information logged, especially if that information can then entered into a database automatically. XLF could play a key role in defining the model for distributed data-driven processes on the Web.
A few years later, another alternative was offered: LOGML (for "Log Markup Language"). Impetus for developing LOGML came primarily from members of the Computer Science Department at Rensselaer Polytechnic Institute. The LOGML Draft Specification (more complete than the corresponding one for XLF, including both a DTD and an XML Schema) contains language hinting at the same motivations as those for XLF.
All of which sounds great, right? But you'll look in vain for widespread support among Web servers of either XLF or LOGML. Indeed, log files continue to be recorded almost exclusively as plain old CLF, sometimes with the extra fields (like user agent). There seems to be simply very little call to change what has worked well for so long.
So what's website logging doing here in an "XML Tourist" column?