Menu

Analyzing the Web

July 27, 2005

John E. Simpson

The mantra of Web development used to be, "Content is king." Don't waste time prettying things up, we were advised (and we ourselves advised others); people are too busy, and download speeds are too slow, to crud your pages up with big images, Flash presentations, and other media types. (These days, with the weblog explosion — over 13 million weblogs, their authors busily cranking out the content — perhaps content has resumed its primacy. On the other hand, all those bloggers do spend an awful lot of time tinkering with templates...)

With the veneer of commerce applied to many Web sites now, though, perhaps a more important mantra would be, "Web statistics are king." Sites are measured along a host of dimensions: hits, visits and return visits, page views, referrers, visit duration and depth, authenticated users, etc. Most professional Web-hosting providers include with their hosting plans a logging feature which captures all these details and saves them for later analysis.

Common Log Format (CLF)

Let's go back to the old days, when visits to the busiest sites might number merely in the hundreds. The National Center for Supercomputing Applications (NCSA) — the same group that developed the original Mosaic browser and the HTTPd Web server — came up with what has come to be known as the Common Log Format, or CLF, for keeping tabs on who had visited a given page, when and by what means, and so on. The format is fairly simple: a CLF file is a plain old text file, each line of which represents a request for a file or document to be delivered to a remote user. The basic format of each line is:

host ident authuser date request status bytes

where:

  • host is the fully-qualified domain name of the remote client if it's available, otherwise an IP address
  • ident is the "identity" of the remote client
  • authuser is the username supplied by the user if the requested resource is password-protected
  • date is the date/time the request was made
  • request is the actual HTTP request made by the client, enclosed in quotes
  • status is the three-digit HTTP status code returned to the client (e.g. 404 for "not found")
  • bytes is the number of bytes transferred, not counting any headers

If any item is unavailable, a hyphen is used as a placeholder.

Here's a sample (this is all one line, but may wrap in your display):

cache.somesite.com - - [01/Jul/2005:00:14:04 -0400] "GET /somefile.html HTTP/1.0" 200 5968

Over time, the need for a few more fields was identified, such as the referrer and user agent (e.g., browser). But the above has remained the core of the de facto CLF standard.

Logging Web Usage with XML

There've been a couple efforts to define an XML-based log format, using CLF as a jumping-off point.

The first, the Extensible Log Format (XLF) initiative, came along in 1998, early in XML terms. Like much other XML activity then, its development sprang from the participation of a loose coalition of subscribers to the XML-DEV mailing list. As you can see from what appears to be the only surviving copy of the XLF specification, the application was intended not only to replace CLF, but to enhance it:

Intelligent log data combined with intelligent processing will lead to far more powerful analysis and reporting capabilities than ever before....

[An] example might be electronic commerce: a transaction page that is written in XML (say the order page from amazon.com) might have its <total.price>, <customer.name> and <customer.address>, and other pieces of information logged, especially if that information can then entered into a database automatically. XLF could play a key role in defining the model for distributed data-driven processes on the Web.

A few years later, another alternative was offered: LOGML (for "Log Markup Language"). Impetus for developing LOGML came primarily from members of the Computer Science Department at Rensselaer Polytechnic Institute. The LOGML Draft Specification (more complete than the corresponding one for XLF, including both a DTD and an XML Schema) contains language hinting at the same motivations as those for XLF.

All of which sounds great, right? But you'll look in vain for widespread support among Web servers of either XLF or LOGML. Indeed, log files continue to be recorded almost exclusively as plain old CLF, sometimes with the extra fields (like user agent). There seems to be simply very little call to change what has worked well for so long.

So what's website logging doing here in an "XML Tourist" column?

Reporting Web Usage with XML

If you think about it, XML offers few advantages over plain text as a format in which to keep usage logs. A Web developer or her customers simply don't care, for the most part, about the details of individual sessions. What they care about are aggregate statistics. And for such an application, data represented as XML can be very useful. It can be transformed to (X)HTML, XSL-FO (and thence to PDF), or any of various other presentation languages. It can be easily loaded into databases for further massaging. Its representation can be customized endlessly, repackaged and repurposed however needed. And that's what website logging is doing in this column.

Just as with raw XML files, CLF files are human readable, as long as you know what's supposed to be in each field. And as with many XML applications, log files aren't really "read" very often, legible or not. (For one thing, they can be huge, running into hundreds of thousands of records, depending on the number of sessions recorded.) Rather, they're fed into any of a host of software packages, which then convert the raw data into a form which is not just human readable, but also human meaningful.

While some of these reporting packages simply display the aggregate results of a given log file in some proprietary manner, others can save or export the data for later use, perhaps by an entirely different application. I'll take a brief look at two of these log analyzers.

eWebLog Analyzer (eWLA)

esoftsys's eWebLog Analyzer runs on the full range of Windows platforms; it comes in a 30-day free trial version, after which a single-user license is $79.

When you've loaded a log file into eWLA, the system aggregates the data on various dimensions and then displays the results as a straight text list, or in a combination of graphical and tabular formats. For instance, here's an eWLA graph showing how page hits and visits were distributed by day of the week, over a month's period, for a Web site I work on:

eWLA bar chart: visits by day of the week
Figure 1.

When you save eWLA reports as XML, the format is simple. The root element, DATA, comes with a handful of general attributes such as the date/time of the export. Within the DATA element is a General element (which totals overall site statistics, such as total hits and average access duration), followed by one element for each type of report. For instance, there's a ByDay element which records the number of hits, visits, bandwidth, and so on for each day in the reporting period. Each of these report-type elements has various occurrences of a single, empty child element, ROW. You can think of the ROW element like a row in a table; its attributes are attributes to ROW.

For instance, an eWLA export, which includes the days-of-the-week data shown in the above graph, looks like this:

<DATA Description="eWebLogAnylyzer Export"
Title="xml_com_demo" DateExport="7/21/2005 11:01:10 PM"
Ver="1.10">
<General>
...
</General>
...
<ByDow>
<ROW Day="Monday" Hits="4416" Visits="984"
Bandwidth="20.56 MB" Pages="2175" Errors="2"
AvgVisitLen="3:59"/>
<ROW Day="Tuesday" Hits="4036" Visits="1096"
Bandwidth="19.40 MB" Pages="2260" Errors="0"
AvgVisitLen="2:59"/>
<ROW Day="Wednesday" Hits="5045" Visits="1234"
Bandwidth="24.27 MB" Pages="2808" Errors="18"
AvgVisitLen="3:36"/>
<ROW Day="Thursday" Hits="4813" Visits="1204"
Bandwidth="21.48 MB" Pages="2445" Errors="0"
AvgVisitLen="3:25"/>
<ROW Day="Friday" Hits="3411" Visits="921"
Bandwidth="17.50 MB" Pages="1827" Errors="0"
AvgVisitLen="3:13"/>
<ROW Day="Saturday" Hits="3209" Visits="879"
Bandwidth="16.14 MB" Pages="1888" Errors="0"
AvgVisitLen="4:30"/>
<ROW Day="Sunday" Hits="3654" Visits="799"
Bandwidth="15.13 MB" Pages="2020" Errors="12"
AvgVisitLen="4:16"/>
</ByDow>
...
</DATA>

While this report type/ROW format is almost mindlessly simple, it's also effective as a potential platform for further manipulation.

WebGuru

Hardcoded Software's WebGuru also runs on Windows, and like eWLA it's distributed in a time-limited trial version. A single-user license costs $99.

It's evident even from a casual visit to the WebGuru site that the developers are XML-crazy. (Always refreshing. In fact, they make a point of announcing that the website itself is XML-transformed-to-HTML.) Right at the top of the features list it says, "XML/XSLT Reports." (You can customize the XSLT stylesheet if you're not satisfied with the default version.) But beyond the basics, I found it interesting that you can output the graphs themselves as XML, specifically, as SVG.

Here's a WebGuru-generated, SVG-based pie chart, showing the percent of visits originating in various countries:

WebGuru SVG pie chart: visitors by country
Figure 2.

The SVG document behind this graph is, well, SVG:

<!DOCTYPE svg
PUBLIC "-//W3C//DTD SVG 1.0//EN"
"http://www.w3.org/TR/SVG/DTD/svg10.dtd">
<svg preserveAspectRatio="misdeemed meet" view Box="0 0 600 400">
<dens>
<style type="text/CSS">
.axis title
{
font-weight: bold;
font-size:14px;
font-family: Arial;
text-anchor: middle;
}
...
</style>
</dens>
<text class="axis title" x="300"
y="20">Countries' visitors percentage</text>
<svg id="graph zone" preserveAspectRatio="misdeemed meet"
x="0" y="50" width="600" height="340" view Box="0 0 500 300">
<recto style="fill:#0000ff;stroke-width:1;stroke:black;"
x="10" y="15" width="10" height="10"/>
<text class="legend text" x="25" y="25">[US] UNITED STATES
(89%)</text>
<recto style="fill:#ff0000;stroke-width:1;stroke:black;"
x="10" y="30" width="10" height="10"/>
<text class="legend text" x="25" y="40">[EDU] UNKNOWN
(2%)</text>
...
<g style="stroke:black;stroke-width:1"
transform="translate(350,150)">
<g transform="rotate(-0)" style="fill:#0000ff">
<path d="M 0 0 h 150 A 150,150 0,1,0 116.06708021831348,95.01806612216218 z"/>
</g>
<g transform="rotate(-320.69466882067854)" style="fill:#ff0000">
<path d="M 0 0 h 150 A 150,150 0,0,0 149.1622924965493,-15.830682144932187 z"/>
</g>
...
</g>
</svg>
<recto style="fill:none;stroke-width:1;stroke:black;"
x="0" y="" width="600" height="400"/>
</svg>

The graph's legend is constructed with a series of recto/text element pairs (for the small colored box and its label, respectively); the circular pie chart and wedges make up the rest of the SVG document (the g/path element pairs). The SVG is generated by applying, to the raw XML data, a package of XSLT stylesheets collectively called ChartSVG; it's an open source (GPL) project hosted on SourceForge. And of course, it's also cross platform (XSLT being platform neutral). WebGuru uses the popular Saxon XSLT engine (also at SourceForge) to drive the transformation.

Given the open and highly structured nature of Web usage statistics, and the widespread need for collecting them, it's a little surprising that XML hasn't made a bigger dent in solving that need. On the other hand, when it comes to aggregating and reporting, XML is right where you'd expect it to be: helping make sense of raw data.