Dynamic News Stories

May 17, 2006

Adrian Holovaty

I like structured data. My favorite projects tend to be those that deal with, and exploit, structured information: events, restaurants, crime, and political information.

But one thing that's always bothered me is that the bread-and-butter of my chosen field, journalism, is relentlessly unstructured. The primary product of journalists -- the news story -- is just a giant blob of text.

A news story cannot be broken down into easily defined, consistent pieces. It doesn't have facts in predictable places. (Yes, a computer can split text into sentences and words, but those carry little meaning as discrete bits.) Sets of stories, even when published in the same publication and written by the same reporter, are not consistent in any machine-readable fashion. I cannot easily tell a computer, "Give me the most important bits of information in this news story." (Not to mention some stories contain more fluff than solid information!)

Sure, news stories can have metadata -- headlines, publication dates, bylines, categorization -- but the essence of a news story, its text, isn't structured in itself.

This lack of structure certainly makes sense. A news story is intended to be "consumed" by humans, not computers. Indeed, one could argue that's what makes a good news story great: it's literature -- more art than science. Well-written prose is unpredictable, and predictable prose is boring. Why should art be machine-readable?

I've pondered this question, and I think it's possible to compromise. We can indeed introduce some structured value to news stories while retaining the "looseness" of arbitrary prose.

The following list of ideas introduces a level of automation/dynamism that solves a couple of problems and makes news stories more dynamic. Each idea is implemented via an XML tag, assuming a story is stored as XML. (I don't have any particular XML language in mind; I'm simply introducing these concepts and giving them generic XML tag names.)

<profanity level="X">

When I worked at and, we had an interesting dilemma: we shared stories across both sites, but (a local entertainment site with an audience of mostly college students) has quite a different tolerance for profanity than the more conservative (the "traditional" local newspaper site). writers aren't afraid to include naughty words in their stories, whereas needs to remain a "family newspaper."

We would solve this dilemma by publishing two stories in the database: one with the profanity, one without. It got the job done, but it was redundant and inefficient.

So, the solution I'm proposing is simple: introduce a <profanity> tag, which you'd wrap around all words that could be considered profane. Give it a "level" attribute, in which you specify the severity of the word/phrase, and tweak your news site's content-management system to either dash-out the word, or proudly display it.

This could scale to the individual-user level, too. Let each site reader specify in a site profile whether he or she would prefer profanity to be blocked.

<date real="YYYY-MM-DD">

News stories that deal with current events (and that's most of them!) often use date-specific words such as "today," along with weekday names that assume the reader can determine whether the writer is referring to the past or the future. For example, the word "Wednesday" in the sentence "President Bush will be in Chicago Wednesday" refers to the next Wednesday after the date of publication.

This sort of practice was OK in print newspapers, because people generally read newspapers the day they're published. But on the Web, articles live for longer than a day, and it's almost certain an article will still be read days after its publication, via search-engine traffic or archive browsing. So, frankly, vague words such as "today" and "Friday" don't cut it anymore.

The solution? A <date> tag that journalists could wrap around vague date words. Using that, publishing systems could output appropriate date text, depending on the day the article was being read -- perhaps even taking into account the reader's time zone!

Note: Credit for this idea goes to Nathan Ashby-Kuhlman.

<time gmt="HH:MM:SS">

Speaking of time zones, news organizations should help readers in other time zones by clarifying time differences.

A Des Moines Register article may state that the mayor is planning a 2 p.m. press conference, but the reporter likely won't go out of her way to explain that it's 2 p.m. Central Time; the time zone is obvious to Des Moines readers. But website visitors from other parts of the world have to stop and think: "What time zone is Des Moines in? And how many hours away is that? What does 2 p.m. Des Moines time mean for me?" It's not optimized for global use.

The solution? A <time> tag that journalists could wrap around times. It would contain some normalized representation of the time: perhaps the time in GMT. Using that, publishing systems could output an explicit time zone declaration for out-of-town users, or perhaps even a time-zone converter next to each time.

<expire when="YYYY-MM-DD">

Some bits of news stories are only relevant for a certain amount of time. After that, they lose all value.

For example, it's kind of trendy for newspapers and their websites to cross-promote. A print newspaper might say, "For more on this story, check our website," and a website might say, "For more on this breaking-news story, see tomorrow's newspaper." In the Web-posted article, that latter text becomes useless the instant the next day's newspaper has been published. So, it'd be nice if that little teaser disappeared on a specified day and time.

Enter the <expire> tag, which editors would wrap around sentences whose value eventually expires.

This is a tricky one, though, because journalists have conflicting goals: displaying information that is currently accurate and information that will be historically accurate. One important function of a news article is to provide a historical record, so that in ten years a researcher can return to a news story and expect that its content has remained the same. But, at the same time, an important function of a web page is to display information that is up-to-date. It's unsettling, and it just feels messy, to read a message such as "See tomorrow's newspaper for more" in an article published a month ago.

<currency date="YYYY-MM-DD" units="USD">

Similarly, currency is a specific type of information whose value changes over time, namely due to inflation. A <currency> tag would let journalists signify that a certain monetary value is tied to a specific date, and this markup could trigger automated, on-the-fly adjustments for inflation when the news story is output on a web page, according to the date the article was viewed.


One particularly old-fashioned aspect of journalism style is the special-case treatment of certain cities. For example, according to the Associated Press stylebook, when a news story references "Chicago," it should not mention the state (Illinois), because the AP has deemed Chicago well-known enough that it has special status. The AP maintains a list of cities for which editors should leave off the state; any other city should be published with its state.

In addition, local newspapers have custom lists of nearby cities for which naming the state would be redundant, or even condescending! For example, at the Lawrence, Kansas newspaper, articles reference the nearby town of Tonganoxie without explicitly saying "Tonganoxie, Kan."

This sort of special-casing, which typically is verified by copy editors when they edit stories, is inefficient and doesn't scale to the worldwide readership of the Web.

So journalists should start using <city> tags, which would specify the full name and state of the cities in their stories. With that, the website's content-management system could specify some easy business logic defining which cities should be spelled out with states, and which cities could be published "stateless."

More ideas

I've only scratched the surface here. A few more ideas:

Automatic conversion: How about automatic conversions for units of temperature (Fahrenheit to/from Celsius), weight (pounds to/from kilograms) and distance (miles to/from kilometers)?

Isolating people and quotes: How about marking up each quote, and associating it with the person who said it, so it would be possible to automatically retrieve all quotes by a given person, and all articles in which a given person was quoted? For more, see Tagging quotes in a news story.

Isolating individual facts: This is a pipe dream, but how about giving each and every fact a unique ID, and doing things like <fact id="26" assumes-fact-id="27">? This would let journalists and readers create elaborate "fact trees," which could display the relationship between information. For more, see Microformats could describe online news intelligently.

So much of a traditional news article fundamentally assumes the story is intended for a person in the same town, on the same day, with the same cultural background. But the Web allows anyone to read news stories worldwide, and days or weeks after the fact, so journalists should start taking advantage of automation and smart markup to make news stories more valuable sources of information.