I wrote the Python-XML column for three years, discussing the combination of an agile programming language with an agile data format. It's time to pull the lens back a bit to take in other such technologies. This new column, "Agile Web," will cover the intersection of dynamic programming languages and web technologies, particularly the sorts of dynamic developments on the web for which some use the moniker, "Web 2.0." The primary language focus will still be Python, with some ECMAScript. Occasionally there will be some coverage of other dynamic languages as well.
Yet Another Google Project
Earlier this year Google launched a free service, Google Sitemaps, which allows you to send them a map of your website in XML format. The Google crawler then uses this map to improve coverage of your content, according to the specific information you provide about page locations, modification frequency ,and more. Google describes its service thusly:
Google Sitemaps is an easy way for you to help improve your coverage in the Google index. It's a collaborative crawling system that enables you to communicate directly with Google to keep us informed of all your web pages, and when you make changes to these pages.
The service is still in beta, and Google calls it "an experiment in web crawling," but many recent developments on the web make it an especially useful idea. For one thing, with the rise of weblogs there are so many more episodic websites, which are intended to change frequently. It makes sense to tell a web crawler that it might want to come back for an update in, say, 24 hours (and perhaps to tell it not to come back before then, to reduce load). Of course there are other ways to provide such hints, including specialized HTTP headers, but Google Sitemaps are a specialized mechanism to fit the specific semantics of web-indexing software. Also, as the web is becoming more and more dynamic, with AJAX and other such tricks, there may be no obvious web of static links for a crawler to discover on its own. You might have to provide hints as to where the goods are in your sophisticated website.
What particularly attracts this column's interest is the fact that Google also provides Python tools for creating Sitemaps to be submitted to Google. This article introduces and discusses these tools (think of it as using Python to allocate your Google juice), but first thing's first, let's look at Google's XML format for Sitemaps.
Not Your Dad's Robot File
The classic web method for controlling crawlers and other such automated agents ("robots") is the robots.txt file, formally called the Standard for Robot Exclusion. It's main purpose, however, is to tell crawlers where they may not go. Listing 1 is a robots.txt file that tells all crawlers to stay away from the top level
private folder of the site.
Listing 1. Simple robots.txt Example
# Please respect our privacy User-agent: * Disallow: /private
The first line is a comment. The second indicates that the following headers apply to all crawlers. You can also specify the name of a particular crawler with the
User-agent header. The last line tells matching crawlers not to access any URLs whose path portion begins with
/private. You can learn more about robots.txt in the specification or the FAQ.
Google Sitemaps complement, rather than replace, robots.txt. They do not include instructions for excluding crawlers from directories. They can be as simple as a list of URLs that you do want the crawler to visit, and these URLs can optionally be annotated with information about the last modified date, the frequency with which the content changes, and their relative importance within the overall content of your site. The Sitemap Protocol Contents document contains the full scoop, but the example in listing 2 should give you a good idea of the format.
Listing 2. Sample Google XML Sitemap File
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.google.com/schemas/sitemap/0.84"> <url> <loc>http://uche.ogbuji.net/</loc> <lastmod>2005-10-01</lastmod> <priority>1.0</priority> </url> <url> <loc>http://uche.ogbuji.net/tech/publications/</loc> <lastmod>2005-10-03T12:00:00+00:00</lastmod> <changefreq>weekly</changefreq> </url> <url> <loc>http://uche.ogbuji.net/tech/4suite/amara/</loc> </url> </urlset>
url contains at least a
loc element with the location being described. Each can have an optional
lastmod (last modification time in ISO 8601 format),
changefreq (controlled vocabulary expressing frequency of change), or
priority (priority from 0.0 to 1.0 that the crawler should give the URL relative to your overall site).
Google Sitemaps allows you to submit files in other formats, including plain text (one URL per line), RSS 2.0, Atom 3.0 and Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), which uses an XML format based on the Dublin Core Metadata Initiative (DCMI). See this article for more on using OAI-PMH with Google Sitemaps.
Pages: 1, 2