Menu

Googling for XML

February 11, 2004

Bob DuCharme

The introduction of the O'Reilly book Google Hacks tells us that the filetype: query qualifier restricts your Google search to files whose names end with a particular extension. The book's first example of this is homeschooling filetype:pdf, a query that searches for the word "homeschooling" in Adobe Acrobat files. The second example, "leading economic indicators" filetype:ppt, looks for the phrase "leading economic indicators" in Microsoft PowerPoint presentations. (Of course, Google checks the file extension and not the actual format; if an Excel spreadsheet with a "ppt" file extension is in Google's index, the second search will look for the target phrase there, and if a PowerPoint presentation with an extension of "pres" is in the index, the same search will ignore it.)

Being an XML geek, I had to run immediately to Google's homepage to try searching XML files with this trick. Simply entering filetype:xml as a Google query returns nothing, so I entered filetype:xml test to search for XML files with the word "test" in them, and Google reported 329,000 hits. (All "hits" figures listed here will evolve by the time you read this.) The query filetype:xml -test, which searches for files with an extension of "xml" that don't have the word "test" in their contents, gave me 1,080,000 hits. So my rough guess puts 1.4 million files with an extension of "xml" in Google's index.

Of course, it's a very rough guess. As you read about my further experiments in searching only XML files of particular document types, such as DocBook files and TEI files, as well as my Google searches through RSS, FOAF and other RDF files, remember that I based it all on hunches and guesswork. The technical-sounding term for this exploration into Google capabilities is "reverse engineering," but the most appropriate term is the one that gave the name to the popular O'Reilly series: hacks.

Googling for Specific Document Types

Running my test/-test pair of searches for files with an extension of "xhtml" showed about half a million in Google's index. This is useful to the many XML developers who know that these HTML files are much more likely to be properly well-formed, and maybe even valid against a DTD or schema, than files with extensions of "html" or "htm".

Many XHTML files have an extension of "xml" as well, and these present a problem when searching for XML documents of other document types besides XHTML. For example, a search (filetype:xml docbook) for files with an extension of "xml" that mention DocBook, a DTD popular for technical writing and computer books since SGML days, will find XHTML files that discuss DocBook as well as actual DocBook files.

Let's look at some strategies for locating DocBook files and then return to this issue of XHTML files that discuss DocBook. Technically, DocBook has no namespace URI associated with it, but when mixing DocBook elements with elements from other namespaces, many people want to assign a namespace URI to those elements, and "http://www.oasis-open.org/docbook/" seems to be popular. As the ancestor directory for many DocBook DTD files, this URL shows up in the SYSTEM parameter of a lot of DOCTYPE declarations. A search for files with an extension of "xml" that contain this string turns up about 1,170 hits, many of which are and many of which aren't DocBook files.

The context phrases that Google search results show around these hits often show tags from the DocBook DTD, making it easier to see which ones are really DocBook documents. A search of XML files for the quoted phrase "oasis dtd docbook xml" gets about 1,560 hits because Google, which ignores punctuation, often finds that phrase in a public identifier string like "-//OASIS//DTD DocBook XML V4.2//EN". Some of these files are actually HTML representations of complete DocBook files, perhaps with numbers to show them as the "source code" for some project.

I tried adding the quoted string "doctype article" to that last search and found some surprising results. While Google supposedly doesn't index tag names or the contents of DOCTYPE declarations, it apparently does in certain circumstances. (Again: guesswork! Reverse engineering! Hacks!) Several results for this query show a document "title" (for HTML files, the part in the head element's title element) that begin like this: "<html> <head> </head><body><pre>". Following one of these links shows no such HTML tags. Following the corresponding link to the Google cache shows that the document was "converted" to HTML for Google's cache by mapping all less-than signs to &lt; entity references and then wrapping the whole document in the appropriate HTML tags to make it one big HTML pre element.

This is good news for two reasons: first, while a Google search for XML files of a particular document type may show you plenty of XHTML documents that discuss that document type, as opposed to actually being documents of that type, don't let a string of HTML tags in Google's result listing discourage you -- the file might be a document of the type you're interested in after all. Second, when Google does this, it apparently indexes the entire DocBook document as the contents of an HTML pre element, putting tag names and attribute values in the index as well, because it just considers them to be more pre content. When element and attribute names, attribute values, and other markup metadata are in the index, you can use them as search terms, which is why I got DocBook hits from a search for "doctype article".

Another DTD that's been popular since SGML days is the one developed by the Text Encoding Initiative, a non-profit group that has worked to make it easier to encode literary and linguistic texts since 1987. I had disappointing results with a search of filetype:xml "TEI DTD" ("TEI DTD" being a phrase in its public identifier), but eventually figured out that "tei" is a more popular extension for these files than "xml". For example, a search for filetype:tei tei gave me 2,630 hits.

XHTML and TEI files aren't the only XML documents that often don't have extensions of "xml". Running my test/-test pair of searches for files with an extension of "rss" showed about 116,000 files in Google's index. Of course, they're not necessarily all well-formed XML; specialized RSS search engines do exist, but the ability to search them with Google means that you can use all the other search techniques described here and in the "Google Hacks" book to search RSS files. For example, a search of filetype:rss http://purl.org/rss/1.0 looks for files with an "rss" extension that include the namespace URL for RSS 1.0 in their content, resulting in 10,800 hits. Searching for the same URL in files with an "rdf" extension ( filetype:rdf http://purl.org/rss/1.0) gave me 34,500 hits.

To search in both filetypes at once, use Google's OR operator. (Remember to enter it in upper-case.) The search filetype:rdf OR filetype:rss http://purl.org/rss/1.0 gave me 47,200 hits, and a more specific search for the term "XForms" in RSS 1.0 files with an extension of "rdf" or "rss" ( filetype:rdf OR filetype:rss http://purl.org/rss/1.0 xforms) found 21 files. Remember that all the found documents aren't necessarily RSS 1.0, but odds are that most files with an extension of "rss" or "rdf" that have the string "http://purl.org/rss/1.0" in them are RSS 1.0 files.

Googling for RDF

RDF is used for more than RSS. FOAF, or Friend Of A Friend, files are an experiment in the RDF community to store personal metadata -- where people live and work, what their interests are, and who their friends are. A typical FOAF file (mine, for example) doesn't list all of a person's friends, but only those who have FOAF files themselves; the growing collection of FOAF-to-FOAF links provide sample data for various RDF experiments.

There are conventions for FOAF filenames, but no set rules, so to search for FOAF files in Google, instead of filetype: I used the inurl: qualifier. This searches for URLs that have the specified string in them. Just entering inurl:foaf as a search term gave me 37,200 hits, but that included the FOAF specs, articles about it, and associated software. Adding the FOAF namespace URL to create a search query of inurl:foaf http://xmlns.com/foaf/0.1/ gave me 1,090 hits, with a much higher percentage of hits on the first Google result page being actual FOAF RDF files. You can add search terms to this to search within those files for a specific term -- for example, to see how many of those FOAF files specify a value for the FOAF workplacehomepage property, enter inurl:foaf http://xmlns.com/foaf/0.1/ workplacehomepage.

FOAF files and RSS 1.0 are the two most popular uses of RDF that I know of. The OWL Web Ontology Language provides infrastructure for the ontology part of the Semantic Web. How popular is this set of RDF properties? A check for filetype:rdf owl showed 456 hits; repeated checks over time will give clues about the progress of its popularity. Once the number of hits gets into four figures, Semantic Web experiments are going to get easier and easier.

What kind of experiments can we do with the RDF out there? I've started playing a bit to answer this related question: what else do people use RDF for besides FOAF and RSS 1.0? By searching for files that have an extension of "rdf" but don't mention FOAF or RSS, I hope to find out. A Google query of filetype:rdf -rss -foaf ("show me files with a filetype of 'rdf' that don't have the strings 'rss' or 'foaf' in them") gave me 150,000 hits. Of course, many turn out to be RSS or FOAF files anyway, but this particular query reduces their percentage. Using the Google API and a simple Perl script described in "Google Hacks," I can pull down URLs for some of these RDF files, and then a batch file that uses the wget utility can pull down the files themselves.

Loading the RDF triples from these randomly collected files into a single RDF triple store will create an interesting collection of RDF to play with. I certainly can't assume that all the files will contain good RDF, but using rdflib and Python's exception handling ability, the following short script rejects any RDF files that it can't parse, reads the rest into a single triple store, and at the end saves it all as a single XML/RDF file:

#! /usr/bin/python
from rdflib.TripleStore import TripleStore
store = TripleStore()

# Try to read files data/1.rdf, data/2.rdf ... data/34.rdf into a
# TripleStore directory, then save that as test.rdf.
for i in range(1,35):
    filename = "data/" + str(i) + ".rdf"
    try:
        store.load(filename)
    except:
        print "bad XML: " + filename
store.save("test.rdf")

It's only an experiment, and it's just a start, but I'm confident that as I scale it up, analysis of the results will reveal valuable information about how people use RDF. Repeating the same experiment every six or eight months is bound to be interesting as well, showing increases and decreases in the popularity of various aspects of RDF.

Googling for...

Whether you're interested in RDF or any other kinds of XML, the presence of this freely accessible, constantly updated, massive index of XML files known as Google is quite a resource. Combining the techniques shown here with others in the book "Google Hacks" gives you a lot to play with. You've seen one of my ideas for future research to take advantage of this resource. I look forward to seeing yours.