October 26, 2005
I wrote the Python-XML column for three years, discussing the combination of an agile programming language with an agile data format. It's time to pull the lens back a bit to take in other such technologies. This new column, "Agile Web," will cover the intersection of dynamic programming languages and web technologies, particularly the sorts of dynamic developments on the web for which some use the moniker, "Web 2.0." The primary language focus will still be Python, with some ECMAScript. Occasionally there will be some coverage of other dynamic languages as well.
Yet Another Google Project
Earlier this year Google launched a free service, Google Sitemaps, which allows you to send them a map of your website in XML format. The Google crawler then uses this map to improve coverage of your content, according to the specific information you provide about page locations, modification frequency ,and more. Google describes its service thusly:
Google Sitemaps is an easy way for you to help improve your coverage in the Google index. It's a collaborative crawling system that enables you to communicate directly with Google to keep us informed of all your web pages, and when you make changes to these pages.
The service is still in beta, and Google calls it "an experiment in web crawling," but many recent developments on the web make it an especially useful idea. For one thing, with the rise of weblogs there are so many more episodic websites, which are intended to change frequently. It makes sense to tell a web crawler that it might want to come back for an update in, say, 24 hours (and perhaps to tell it not to come back before then, to reduce load). Of course there are other ways to provide such hints, including specialized HTTP headers, but Google Sitemaps are a specialized mechanism to fit the specific semantics of web-indexing software. Also, as the web is becoming more and more dynamic, with AJAX and other such tricks, there may be no obvious web of static links for a crawler to discover on its own. You might have to provide hints as to where the goods are in your sophisticated website.
What particularly attracts this column's interest is the fact that Google also provides Python tools for creating Sitemaps to be submitted to Google. This article introduces and discusses these tools (think of it as using Python to allocate your Google juice), but first thing's first, let's look at Google's XML format for Sitemaps.
Not Your Dad's Robot File
The classic web method for controlling crawlers and other such automated agents ("robots")
is the robots.txt file, formally called the Standard for Robot Exclusion. It's main
however, is to tell crawlers where they may not go. Listing 1 is a robots.txt file
that tells all crawlers to stay away from the top level
private folder of the
Listing 1. Simple robots.txt Example
# Please respect our privacy User-agent: * Disallow: /private
The first line is a comment. The second indicates that the following headers apply
crawlers. You can also specify the name of a particular crawler with the
User-agent header. The last line tells matching crawlers not to access any
URLs whose path portion begins with
/private. You can learn more about
robots.txt in the specification or
Google Sitemaps complement, rather than replace, robots.txt. They do not include instructions for excluding crawlers from directories. They can be as simple as a list of URLs that you do want the crawler to visit, and these URLs can optionally be annotated with information about the last modified date, the frequency with which the content changes, and their relative importance within the overall content of your site. The Sitemap Protocol Contents document contains the full scoop, but the example in listing 2 should give you a good idea of the format.
Listing 2. Sample Google XML Sitemap File
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.google.com/schemas/sitemap/0.84"> <url> <loc>http://uche.ogbuji.net/</loc> <lastmod>2005-10-01</lastmod> <priority>1.0</priority> </url> <url> <loc>http://uche.ogbuji.net/tech/publications/</loc> <lastmod>2005-10-03T12:00:00+00:00</lastmod> <changefreq>weekly</changefreq> </url> <url> <loc>http://uche.ogbuji.net/tech/4suite/amara/</loc> </url> </urlset>
url contains at least a
loc element with the location being
described. Each can have an optional
lastmod (last modification time in ISO
changefreq (controlled vocabulary expressing frequency of
priority (priority from 0.0 to 1.0 that the crawler should give the
URL relative to your overall site).
Google Sitemaps allows you to submit files in other formats, including plain text (one URL per line), RSS 2.0, Atom 3.0 and Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), which uses an XML format based on the Dublin Core Metadata Initiative (DCMI). See this article for more on using OAI-PMH with Google Sitemaps.
A Little Help from the Tools
Google's tool for generating Sitemaps is sitemap_gen.py. It's an open source program (BSD 2.0 license) written in Python 2.2.
The Google Sitemap Generator is a Python script that creates a Sitemap for your site using the Sitemap Protocol. This script can create Sitemap XML protocol files from URL lists, web server directories, or from access logs.
I had a look at version 1.3 of sitemap_gen.py, which is hosted on SourceForge. It is designed to run from your site's web server, but it provides a lot of options, and you always have the option of simply specifying the URLs in a text file. In this case, all the tool does is take a text file of delimited records and convert it to XML without any more special processing. You can also use it to trawl through your HTTP server access logs for URLs to add to the Sitemap, to consolidate multiple existing Sitemap XML files into one large one, and more. The documentation for typical use of the tool is very clear, so I'll move on to how you would use it from other Python tools.
Once you've set up a Sitemap configuration file, it's very easy to import Google's tool and trigger a regeneration of the output Sitemap file (the output file location is also set in your configuration). Listing 3 is a simple usage example from Python code.
Listing 3. Using Google Sitemap Generator from Python
import sitemap_gen CONFIG_FNAME = '/path/to/sitemap-config.xml' sitemap = sitemap_gen.CreateSitemapFromFile(CONFIG_FNAME, False) if sitemap: sitemap.Generate() else: #Indicates an error in configuration pass
You could use such code to work Sitemap updates into your Python web application. As an example, if you run a weblog using PyBlosxom, as I do, you could touch up your Sitemap in your CGI handler. Doing so is a fairly expensive task, so it's not something you'd want to do every time. I have a PyBlosxom plugin, task_control.py, which allows you to run Python scripts upon CGI request, but only if a specified interval has passed since the last run. You can run Listing 3 pretty much as it is from that tool. Of course you can also use cron to run Google's sitemap_gen directly.
And Yahoo! Too
For a long time, Yahoo has offered services for submitting URLs for its web crawler. This was a free service with which you could only enter one URL at a time into a Web form. For bulk submission, you had to pay for the privilege. The arrival of Google Sitemaps prompted Yahoo to provide a free service with a similar option for specifying collections of URLs for indexing. In the case of Yahoo, you create a simple text file with one URL per line, publish it on the web, and submit the URL for that file to Yahoo's submission form. Such a simple URL-per-line text file is also supported by Google Sitemaps, but Yahoo doesn't additionally support the sorts of URL metadata that Google does. You can generate a plain URL list from a Google Sitemap protocol XML file very easily with XSLT, as in Listing 4.
Listing 4. Using XSLT to Generate a Yahoo URL list from a Google Sitemaps XML File
<?xml version="1.0" encoding="utf-8"?> <xsl:stylesheet version="1.0" xmlns:gsm="http://www.google.com/schemas/sitemap/0.84" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" > <xsl:output method="text"/> <xsl:strip-space elements="*"/> <xsl:template match="gsm:url"> <xsl:value-of select="gsm:loc"/><xsl:text> </xsl:text> </xsl:template> </xsl:stylesheet>
As with all slick new ideas on the web, you have to be careful. Sometimes the need for a specialized Sitemap says something about the usability of your site, and what it says might not be complimentary. If you are losing or confusing traditional web crawlers, there is some chance that you're losing human visitors as well. Of course, if your website is new, for example, you've moved a site to a new domain, it makes a lot of sense to make an initial submission to Google and Yahoo (and other search engines) to speed up discovery of the new site, but it may be best not to rely too much on the metadata for tuning URLs in Google's service. It's very unlikely that Google will ever subordinate its traditional Web crawler heuristics to such user-provided hints, and indeed Google is explicit about the ancillary role of Sitemaps in their overview.
[The Sitemaps] program does not replace our normal methods of crawling the web. Google still searches and indexes your sites the same way it has done in the past whether or not you use this program. A Sitemap simply gives Google additional information that we may not otherwise discover. Sites are never penalized for using this service.
Nevertheless, there is a lot of chatter on the various groups about sudden changes in Google PageRank after submitting a Sitemap, with a lot of complaints about negative effects (presumably if people saw positive effects they'd be less likely to make noise about the fact). Some claim they even suffered complete delistings from Google results after Sitemaps submission. It's hard to know how much of this to credit because the Google Groups support forum for Sitemaps is more a hangout for search engine optimizers and other marketing professionals than for technicians of the web. I would recommend the Inside Google Sitemaps weblog as the most reliable source of evolving details about the program. Certainly submitting URLs to Google's or Yahoo's Sitemaps programs provides no guarantee of ranking for your pages, just a guarantee of inclusion into the index of pages for the web crawlers.
Google recently supplemented its Sitemaps program with some Google Mobile Sitemaps, which provides specializations for Mobile Web Search, Google's search engine focusing on sites for small-format mobile devices such as cell phones and PDAs.
In the following months I shall continue my exploration of the Agile Web. In Perspective on XML: What Is This "Agility"? I shared my thoughts on agile development tools and practices. My academic training (first Electrical/Electronic Engineering in Nigeria, and then a Computer Engineering program in the U.S. that was very close to its EEE roots) conditions me to have some caution regarding the quest for absolute agility. On the other hand, I'm a big fan of Python and XML, technologies that are, as I said earlier, oft-cited examples of an agile programming language and an agile data format. In this column I'll apply this tension towards the analysis of the technological forces that are shaping the next generation web.