Putting ISBNs to Work
In last month's Hacking the Library column ("Six Steps to LCC@Home") I explained how to use a subset of the Library of Congress Classification scheme to organize your personal library -- a project I called LCC@Home.
In addition to all the spiffy diagrams in that column, I said that the bulk of the work required to implement LCC@Home was the process of labeling the items -- I suspect books, mostly -- in your collection. One trick for streamlining that process is to use the LC Cataloging in Publication data which is very often printed on the verso of a book's title page.
Most books published by American university presses since around 1970, and most other books published since 1990, will have a CIP block; but many small press and older books won't. That means you have to look that book up in some kind of Library of Congress database in order to find its LC Call Number. When I did this project the first time four years ago, I used the LC database on the Web. But it's a bit annoying to use other than casually, since it employs some rather picky session timeout settings.
What I really wanted back then, but didn't really take the time to figure out, was a command line tool that let me input an ISBN -- nearly every book you're likely to own either has an ISBN or can be assigned one easily enough -- and which outputs a Library of Congress Call Number, which I could then affix to a book.
In this and next month's column, I'm going to design and implement just such a tool in Python, isbn2lccn. More specifically, in this column I'll look at ISBNs, including how we might use ISBNs in RDF, and consider some of the sources of bibliographic information available on the Web. Next month I'll walk through the Python code and talk about how we can turn it into a proper web service itself.
We're going to input ISBNs and output LC Call Numbers, but what is an ISBN anyway? First, it's an international standard, ISO 2108. Second, it's a structured identification string, made up of 10 digits, that is "unique" and "machine-readable" and "which marks any book unmistakably", according to the International ISBN Agency. ISBNs have some properties that geeks like us find pretty interesting. The 10 digits of an ISBN represent four fields: a group, publisher, and title identifier, plus a check digit. In 2007 U.S. publishers will begin to transition to 13-digit ISBNs. An ISBN is a book's fingerprint, often represented by a bar-code. It's another bit of the world I call a dijalog inflection point -- that is, ISBNs, whether represented as digits, bar-codes, or in RFID tags, are points at which the digital and analog worlds synch up.
Why is that point of inflection useful? Physical (or, as I persist in misnaming, analog) items that have a unique, machine-readable identifier can often become the subject of machine-readable assertions, using RDF. There are at least two possibilities: first, use the unique identifier as the basis for coining unique URIs; second, check to see if there is a URN scheme you can use instead.
Eventually in this series I will begin to explore and to use RDF to represent assertions about the items of my dijalog collections. For example, imagine that we want to make some machine-readable assertions about The Iliad. (I'm going to skirt for now some tricky data modeling issues here; but rest assured that we'll come back to them later on in this series.)
The first thing we might do is to coin a URI using a book's ISBN:
That's a perfectly good URI: as the person who owns the domain "monkeyfist.com", I am the controlling authority (in some ways Semantic Web folks still haven't really worked out) of that URI; it's not going to clash with coining other URIs; it suggests a generative naming scheme, so I can easily make new ones at will. One bad thing about this URI is that it's not likely that anyone else will want to use it, which makes it somewhat harder for different parties to know that we're making assertions about the same thing -- which is part of the mojo of the Semantic Web and RDF in the first place.
What about the second possibility? As discussed in RFC 3187 we could use a URN schema for ISBNs, in which case our URI becomes a URN:
This URN is preferable to the URI for at least two reasons: first, I have some reasonable expectations that other people will use this identifier form; second, it's semi-structured: if I have one of these URNs, I and others know exactly how to convert it to the equivalent ISBN:
Python 2.3 (#1, Aug 5 2003, 15:19:06) [GCC 3.2.2 20030222 (Red Hat Linux 3.2.2-5)] on linux2 >>> print "urn:ISBN:0670835102".split(":") 0670835102 >>>
But do I really want to make assertions about that particular realization of the The Iliad, the 1990 Robert Fagles hardcover edition? Maybe I do and maybe I don't. Since I own that book, I will eventually want to make assertions about it, including the assertion that I own a copy. But what if I want to say, first, that it is a realization of the abstract entity "an English translation of Homer's Iliad"? In that case, I'd probably use an RDF blank node, which is an RDF graph-scoped variable. For example, using the Notation 3 form of RDF, I might say:
@prefix bib: <http://monkeyfist.com/kendall/dijalog/scheme/0.1/#> . @prefix dc: <http://purl.org/dc/elements/1.1/> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> .  dc:title "The Iliad" ; dc:author "Homer" ; rdfs:label "Homer's Iliad" ; rdf:type <bib:work-of-antiquity> ; bib:realized-by <urn:ISBN:0670835102> . <urn:ISBN:0670835102> ; bib:translator "Robert Fagles" ; rdf:type <bib:book> ; dc:date "1990"^^xsd:gYear .
That is, I might want to say that there is a thing, such that it has the title "The Iliad", was written by Homer, and is a work of antiquity. Further, I might want to say that this rather abstract entity is realized by a particular book which was translated by Robert Fagles and published in 1990. If I owned several different versions of this abstract entity, I could make assertions about all of them, too. Eventually, I might use some of the properties of OWL, the W3C's Web Ontology language, to say that some of these things were the same.
That's where we're going, eventually, but first we need to talk about converting ISBNs into LC Call Numbers.
Before taking up some sources of bibliographic information on the Web, let me commit an act of library and information science heresy -- when implementing LCC@Home, it's safe to consider most, if not all, of your books to be interchangeable instances of a class, rather than as classes or types themselves. That's a very bad way of saying that, for real librarians, different versions of the same book are conceptually and organizationally distinct; but for pseudo-librarians like us, we can consider different variations of the same book to be the same or similar. Consider, for example, that old copy of the Iliad you have from college days, the reissued classic 1713 translation by Alexander Pope, who rendered Homer's Greek dactylic hexameter into the poetry of my mother tongue. Consider, too, a copy of the first English prose translation of the Iliad, Samuel Butler's 1898 effort. (Both of which, by the way, are available from Project Gutenberg.) Finally, consider Robert Fagles's very idiomatic verse translation of the Iliad from 1990, the one I made some RDF assertions about earlier.
In some informal sense which may be useful, these are the same book. While they are very distinct in many ways, including in ways which matter to library and information science, to scholarship, as well as to my eventual efforts to use RDF in this domain, it's likely that many versions of the first two books don't have ISBNs, while the third one certainly does. I'm suggesting that, if you run into books in your library that don't have ISBNs -- probably because they were published before the advent of ISBN in the early 1970s -- you can use the ISBN of a newer edition or translation or version of the book. That is, feed the ISBN of the Fagles's version of The Iliad into our tool in order to get an LC Call Number to affix to your copy of the Pope translation. If you have a choice, it's better to use the ISBN for a recent version of the same book than to use a newer edition of a different translation of the same book. This makes librarians shudder, of course, but it's still preferable to trying to do your own original cataloging.
In what remains of this column I'm going to describe four sources of bibliographic information on the Web, ending with the sources I'll be using to build my isbn@lccn tool. I'll conclude with a simple description of the tool, the full details of which I'll take up next month.
Everyone knows that Amazon.com sells books. And probably everyone in the XML.com audience knows by now that Amazon provides an awful lot of information about those books by way of SOAP or REST web services. I won't say much else about Amazon's informational offerings today, except that they don't contain LC Call Numbers. But that's fine because they do provide information via ISBN, and eventually we may want to incorporate some of the information Amazon does provide -- sales price, book cover graphic, related books, etc. -- into the tool we're building.
Next, the OCLC's xISBN is a web service that takes as input an ISBN -- that's good, our tool takes ISBNs as input -- and returns more ISBNs. What does that mean? First, let's figure out how to call xISBN, which is a plain old REST service. Simply dereference (either programmatically or from your web browser) URIs of the form:
where the path segment "[ISBN]" is replaced by, you guessed it, a valid ISBN. For example,
is the xISBN URI for the Fagles's Iliad.
What do you get back when you dereference that URI? You get, in REST speak, an XML representation of a resource that can loosely be described as "other ISBNs associated with the one you submitted, according to OCLC's WorldCat records". The XML format is the very soul of simplicity:
<?xml version="1.0" encoding="UTF-8" ?> <idlist> <isbn>0670835102</isbn> <isbn>038505940x</isbn> <isbn>0872203522</isbn> <isbn>0872203530</isbn> <isbn>0226469409</isbn> <isbn>0674995791</isbn> ...
The ISBNs returned by xISBN will most typically be different versions of the same work as the one identified by the input ISBN. That's not the information we want directly, but it could be useful in our tool. Imagine you have a book with an ISBN, but that particular ISBN hasn't been given an LC Call Number, for whatever reason. In that case, it may make sense to ask xISBN for related ISBNs, for which we could then try to find LC Call Numbers.
I haven't decided whether to incorporate that bit of functionality into our isbn2lccn tool, but I'm going to consider it, especially if I hear from some librarians and information scientists about this issue. (In other words, if you care about this, either way,
Finally, if you want to read xISBN's technical details, they're available.
The traditional computerized means of accessing and interchanging library bibliographic records is the Z39.50 protocol (an ANSI/NISO standard) and MARC records. I should point out that the very first programming project I ever undertook was to write a MARC parser in Python -- a project that failed miserably.
I want to say a few things here about Z39.50 and MARC, but they really deserve a column of their own, which I intend to provide later this year. MARC is a bit ghastly, given contemporary standards for data interchange, though there is an XML representation, MARCXML. It's pretty ghastly, too. In short, I want to avoid mucking about too much with MARC records for now.
Of interest, however, is the fact that the LC Call Number is often present in a MARC record in data field 050. So I could write an ad-hoc tool to deal with 050 in MARC records. But it's probably smarter and easier to use a real MARC parser; there's one available in the PyZ3950 open source project. It's even easier, I suspect, to write a bit of code to extract the LC Call Number from XML versions of MARC records. If I were going to work directly with MARC, I'd probably go the XML route, though there may be advantages to using a full MARC parser that I don't yet understand. More about that in a future column.
For now let me say that nearly all bibliographic information still zips around the world via Z39.50 and MARC -- the ultimate source of the information provided by our tool is Z39.50/MARC.
What I really want to use is something I already understand pretty well, namely, XML and a more popular, thus well-known, vocabulary. I've decided that for our tool I'm going to use the Library of Congress's Z39.50 web gateway and its XML message formats, which are based on the XML metadata vocabularies Dublin Core and MODS. Since Dublin Core is pretty well known by XML developers, I'll probably play with the MODS messages as a way to evangelize that metadata standard a bit. It's an interesting vocabulary. If I ever add persistence to our tool, we'll have a configuration bit that users can twiddle if they prefer Dublin Core, MODS, or MARCXML.
There's also an experimental British Library gateway that does much the same thing, so our tool will likely query both services. More about it next month.
Our isbn2lccn tool will read its command line arguments, one of which will be a required ISBN. It will then make some REST web service calls to either the Library of Congress or the British Library; in some cases it may also make calls to Amazon or to xISBN. In its regular mode it will extract the LC Call Number from the XML message returned by the BL or LC services, printing them to the output channel specified by one of the invocation arguments -- probably STDOUT by default. In a future column I'll likely add a batch mode in which the tool will store some representation of these messages, probably as raw XML, on the disk. And if I can get the Python PDF libraries to play nice, it will eventually create a PDF suitable for printing directly onto sticky labels.
Finally, I must acknowledge two correspondents who really did all the hard work behind this month's column. First, Bill Oldroyd emailed to tell me about the British Library's experimental service, which provides an XML representation of bibliographic records, indexable by ISBN. These messages contain a stylesheet PI and so are humanly-readable with the right browser. That's a very neat trick. Second, I want to thank the proprietor -- whose name, alas, I neither know nor could locate -- of the RAWBRICK.NET weblog. It was a weblog posting there that provided me most of the details of the LC gateway.
XML.com Copyright © 1998-2006 O'Reilly Media, Inc.