XML in News Syndication
|Table of Contents|
The news industry has spent the last few years trying to figure out how to deal with the new challenges the Web presents. It was difficult enough to handle the initial challenge of adopting business models that focused on giving content away for free, but the technical issues have also proven enormous.
The requirements of simultaneously creating content for paper, web, and archive destinations are more demanding on news production systems than paper-only delivery, and place particular demands on the transmission and storage of news content--specifically, the granularity, structure, and precision of information.
Typically, human intervention is employed at many stages in the production process of print publications. This has led to a situation where many of the data formats and conventions within news organizations could not be reliably parsed by computer. As a trivial example, consider the "byline" of a story. In an ASCII news story (as might be sent down a newswire), this might be represented thus:
IMPORTANT HAPPENING IN ONLINE NEWS by Edd Dumbill
Although this example doesn't seem too hard to parse with a simple Perl script, consider the variations "by Edd Dumbill, News Reporter" and "by Edd Dumbill, Fred Smith, and Sally Jones". Is "News Reporter" a person, or "Fred Smith" a job title?
If were were encoding the news today in XML we might expect something a little more like this:
<headline>IMPORTANT HAPPENING IN ONLINE NEWS</headline> <by><person>Edd Dumbill</person></by>
As news used to be handled by journalists who understood the context of the content, this didn't matter too much. The web, however, has boosted the requirement for automatic handling of news content, which needs to support more granularity than straight text. It goes almost without saying that this is one of the areas in which XML excels, and it should come as no surprise to find a great deal of XML activity in the news distribution world.
The XML applications outlined in this article have all arisen in response to the problems posed by multi-destination delivery, and to the new business models in content that have emerged.
If we look at the conventional way news has been transmitted between organizations--the newswire--we can identify various components:
- Protocol: the conventions used to carry information between parties over the transmission medium
- Envelope: the conventions used to identify a segment of information
- Header: the conventions for identifying metadata about a news item
- Content: the convention used for the actual content of the item
I've invented some of this separation for the convenience of illustrating the roles played by the XML-based applications discussed in this article. In the old wire specifications, some of these components were merged together. On the Internet, transmission is no longer as simple as broadcast or point-to-point, and pulling out the components separately provides a cleaner, more reusable, approach.
Here's an outline of where today's news syndication technologies sit:
Existing multimedia formats
Note that there are some points of overlap between these classifications, but this table serves as a useful indicator of the purposes of the separate news initiatives. The names picked out in italics are those formats and protocols that together are likely to become the most ubiquitous platform over the next few years.
The rest of this article discusses each of the XML-based technologies featured in the table.
The News Industry Text Format is a well-established XML application for marking up news items. In fact, it was originally an SGML application before the advent of XML.
NITF was developed jointly between the International Press Telecommunications Council (IPTC) and the Newspaper Association of America, the two major standards organizations for the news industry. The intention was to supersede the ANPA1312/IPTC7901 binary wire formats for the delivery of news, which, as alluded to in the introduction, were geared exclusively towards print applications.
To give you a taste for what NITF does, here's a sample story marked up in NITF:
<nitf version="-//IPTC-NAA//DTD NITF-XML 1.3c//EN" change.date="31 October 1999" change.time="1900"> <head> <title>XML News Formats</title> </head> <body> <body.head> <hedline> <hl1>XML-based Formats in the News Industry</hl1> </hedline> <byline> <person>Edd Dumbill</person> <virtloc>email@example.com</virtloc> </byline> <dateline><story.date>Friday July 14 2000</story.date></dateline> </body.head> <body.content> <p> The advent of the web was a large problem for the news industry, and in more ways than one. Economically speaking, print publications faced the challenge of giving away their content for free on the web, and adapting to new business models. Technically, too, the web raised many issues for news providers. </p> </body.content> </body> </nitf>
You can see that NITF is inspired in part by HTML. NITF has been cleverly designed to be a flexible DTD, in that users can put as little or as much embedded markup as they wish into the story. This may seem to markup fans a little counterproductive, but it is an essential feature as it lowers the costs of moving to NITF by reducing the amount of re-engineering required within production systems. So, both of the following fragments are valid NITF, but allow varying degrees of cost in generating the content.
The riot took place in north London last Monday.
The <event>riot</event> took place in <location>north London</location> <chron norm="20000717">last Monday</chron>.
As the NITF Implementors' Guide puts it,
[P]artial implementation can be introduced into most editorial computer systems without large-scale modifications. NITF can be tested on a specific project, such as sports agate, without involving other departments. This gives publishers a chance to see how NITF works without making a large investment.
NITF has evolved through several versions, including making the transition from SGML to XML. The most recent version is v1.3, released on October 31, 1999. The DTD is available from the NITF web site.
Further reading: Robin Cover on NITF
XMLNews is probably the most deployed news syndication format on the web at the moment. It was designed by David Megginson to be a subset of the NITF September 21, 1998 release. That part of the specification is known as "XMLNews-Story". Additionally, XMLNews contained "XMLNews-Meta," an RDF application for describing news content.
Because XMLNews-Story is similar to NITF (as described above), I won't explain it in detail here. It is worth noting, however, that with the most recent revision of NITF (which includes simplifications and improves ease-of-use), XMLNews-Story is no longer a compliant subset. Megginson's work with XMLNews made NITF radically more accessible and understandable for many, and has fulfilled a valuable function in enabling software support from the likes of Wavo and iSyndicate. However, it looks as though future development will happen exclusively in NITF.
XMLNews-Meta is an extensible vocabulary for describing news resources. In contrast to NITF, which is used for the content itself, XMLNews-Meta describes the content. Its main features are the ability to describe the following:
- Identification (assigning a unique ID to the resource being described)
- Header Information (such as language, title, description)
- Milestones (publication, release, receipt and expiry times)
- Provenance (the route through news providers taken by a story)
- Rights (copyright and distribution rights)
- Subject Matter (machine readable classification information)
- Linking (describing inter-story relationships, e.g., previous versions)
As XMLNews-Meta is expressed in RDF, it is inherently extensible, and organizations can extend for their own purposes by using a vocabulary in a custom namespace. XMLNews-Meta also has the distinction of being one of the few RDF applications currently in everyday use.
A younger specification than NITF, NewsML is also being developed under the auspices of the IPTC. NewsML is an envelope format for news content, designed to help solve the problem of transporting news items irrespective of their encoding.
In the same way that NITF supersedes IPTC7901, NewsML is an XML-based successor to the IPTC's "Information Interchange Model." NewsML is still very much in development, but its core features are support for the following:
- All formats and media-types: "News ML makes no assumption about the media type, format, or encoding of news. NewsML provides a structure within which news objects, of whatever type, relate to each other. NewsML can equally represent text, video, audio, graphics, and photos."
- Collections of news items, either as journalistic packages or results of automatic collation
- Named relationships between news items: much like the linking part of XMLNews-Meta
- Multi-part structure with internal relationships: e.g., text with supporting images or video
- Tracking revision of news items over time
- Alternative representations of item parts, for instance HTML, RTF, and PDF encoding of text
- Inclusion and exclusion of news item parts
- Attached metadata
Although NewsML will allow the implementation of an envelope, incorporating existing content formats like NITF, and the inclusion of external metadata descriptions like PRISM (see below), it is also designed to be self-sufficient. That is, it will be possible to use NewsML alone for the envelope, metadata, and text parts of a story, albeit with less flexibility. Further detail on the relationship between NITF and NewsML can be found in this thread from the NewsML mailing list.
A beta DTD for NewsML can be found at http://www.iptc.org/NewsML, and the final specification is due for release in early October.
Further reading: IPTC's NewsML web site
The Publishing Requirements for Industry Standard Metadata initiative takes a wider view than the IPTC-sponsored NITF and NewsML. Operated under the aegis of the IDEAlliance, PRISM seeks to "develop an XML metadata vocabulary for the magazine, catalogue, mainstream journal, news, and book industries."
Before the Web, news syndication was largely the domain of large organizations, such as news wires, who could afford the staffing and the infrastructure to make the business profitable. As it has many other things, the web overturns that model. Standards like PRISM and ICE address themselves to a larger audience than the traditional large news organizations.
The PRISM authoring group expects to release their metadata vocabulary this fall. Until that point, there is not much information publicly available about the vocabulary. The PRISM home page does, however, outline some encouraging goals, particularly the re-use of existing metadata standards such as RDF and the Dublin Core.
PRISM, when it is released, will fulfil a similar purpose as XMLNews-Meta does today, but with a broader focus. The dual-document approach of XMLNews provides the pattern that PRISM itself will follow.
Further reading: PRISM web site.
The Information and Content Exchange specification is one of the most established applications of XML in this area. It defines both a vocabulary and a protocol for the transport and business rules aspects of content syndication.
ICE provides the protocol by which content syndicators can offer content to potential subscribers, and subscribers can receive it. It does this by defining XML-over-HTTP exchanges.
Most of the members of the ICE Authoring Group don't really come from the traditional news world. This is reflected in the fact that the ICE syndication model is a lot more sophisticated than the models previously used by the big players. Traditional newswires adopted a broadcast or point-to-point philosophy, where all the business rules governing subscription happened out-of-band between humans. Other syndication occurred by drop-off using ISDN, bulletin boards, or even by fax.
Perhaps ICE's most notable feature is its ability to codify some of the business aspects of syndication. These include delivery schedules, activating subscriptions, and even "surprise" content requests, to handle one-off transactions. This ability means it is practical for syndicators to maintain large client bases. Also, for content aggregators, ICE makes it easier to conduct transactions of known reliability with multiple suppliers. ICE servers and clients provide a shrink-wrap replacement for a host of Perl and shell scripting that previously conducted these operations.
ICE now has multiple implementations in the field, and is being applied wherever a reliable publish/subscribe model is required for electronic asset exchange. Examples include parts catalogs as well as more traditional media areas.
Although the ICE specification is publicly available, the ICE Authoring Group and Network are fee-based organizations, which means they pay for the work to develop the standard. This does, however, have a knock-on effect on the speed with which information about ICE is disseminated--meaning the ICE AG hasn't been able to benefit from the same groundswell of open source implementation that W3C XML specs often receive. According to a message earlier this year from Sami Khoury, one of ICE's authors, an open source reference implementation of ICE will soon be available from the authoring group themselves.
The technologies mentioned in this article are at varying levels of completeness and implementation. Also, they each have slight overlaps as they have been developed by differing groups and with different goals. What is very encouraging, though, is that each of the initiatives are being pursued with both extensibility and compatibility in mind, and are looking as though they will play nicely with each other.
For pursuing content syndication right now, the two most stable activities are NITF and ICE, both of which are good places to start looking. For an accessible start into syndication, XMLNews is still a good choice, and has commercial software support--this means that as the newer specifications such as PRISM and NewsML come on-stream, an upgrade path is likely to be provided.
What does the future hold? Hopefully before too long, the various technologies will be joined together, so you may find news marked up in NITF, described by PRISM, packaged in NewsML, and delivered via ICE.
I'd like to thank Deren Hansen of Wavo Corporation for his assistance in compiling this article.