XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.


Dynamic News Stories

May 17, 2006

I like structured data. My favorite projects tend to be those that deal with, and exploit, structured information: events, restaurants, crime, and political information.

But one thing that's always bothered me is that the bread-and-butter of my chosen field, journalism, is relentlessly unstructured. The primary product of journalists -- the news story -- is just a giant blob of text.

A news story cannot be broken down into easily defined, consistent pieces. It doesn't have facts in predictable places. (Yes, a computer can split text into sentences and words, but those carry little meaning as discrete bits.) Sets of stories, even when published in the same publication and written by the same reporter, are not consistent in any machine-readable fashion. I cannot easily tell a computer, "Give me the most important bits of information in this news story." (Not to mention some stories contain more fluff than solid information!)

Sure, news stories can have metadata -- headlines, publication dates, bylines, categorization -- but the essence of a news story, its text, isn't structured in itself.

This lack of structure certainly makes sense. A news story is intended to be "consumed" by humans, not computers. Indeed, one could argue that's what makes a good news story great: it's literature -- more art than science. Well-written prose is unpredictable, and predictable prose is boring. Why should art be machine-readable?

I've pondered this question, and I think it's possible to compromise. We can indeed introduce some structured value to news stories while retaining the "looseness" of arbitrary prose.

The following list of ideas introduces a level of automation/dynamism that solves a couple of problems and makes news stories more dynamic. Each idea is implemented via an XML tag, assuming a story is stored as XML. (I don't have any particular XML language in mind; I'm simply introducing these concepts and giving them generic XML tag names.)

<profanity level="X">

When I worked at LJWorld.com and Lawrence.com, we had an interesting dilemma: we shared stories across both sites, but Lawrence.com (a local entertainment site with an audience of mostly college students) has quite a different tolerance for profanity than the more conservative LJWorld.com (the "traditional" local newspaper site). Lawrence.com writers aren't afraid to include naughty words in their stories, whereas LJWorld.com needs to remain a "family newspaper."

We would solve this dilemma by publishing two stories in the database: one with the profanity, one without. It got the job done, but it was redundant and inefficient.

So, the solution I'm proposing is simple: introduce a <profanity> tag, which you'd wrap around all words that could be considered profane. Give it a "level" attribute, in which you specify the severity of the word/phrase, and tweak your news site's content-management system to either dash-out the word, or proudly display it.

This could scale to the individual-user level, too. Let each site reader specify in a site profile whether he or she would prefer profanity to be blocked.

<date real="YYYY-MM-DD">

News stories that deal with current events (and that's most of them!) often use date-specific words such as "today," along with weekday names that assume the reader can determine whether the writer is referring to the past or the future. For example, the word "Wednesday" in the sentence "President Bush will be in Chicago Wednesday" refers to the next Wednesday after the date of publication.

This sort of practice was OK in print newspapers, because people generally read newspapers the day they're published. But on the Web, articles live for longer than a day, and it's almost certain an article will still be read days after its publication, via search-engine traffic or archive browsing. So, frankly, vague words such as "today" and "Friday" don't cut it anymore.

The solution? A <date> tag that journalists could wrap around vague date words. Using that, publishing systems could output appropriate date text, depending on the day the article was being read -- perhaps even taking into account the reader's time zone!

Note: Credit for this idea goes to Nathan Ashby-Kuhlman.

Pages: 1, 2

Next Pagearrow