Menu

Online Magazines with Apache Cocoon

April 16, 2003

Steve Punte

In order to demonstrate what I call XML-directed solutions using Apache Cocoon, in this article I will discuss how to use Cocoon to create an online magazine. XML-directed solutions are those where XML, rather than a programming language is used to control the application. If you are considering entering into the world of online publications or are thinking about upgrading your existing technology, consider how elegantly Apache Cocoon provides a publishing framework.

Overview

You are reading this article via the web magazine XML.com. Ever wonder what goes on behind the scenes? It can be really quite simple. Web publications are content presentation services, different than an interactive application like a stock trading service. Publication services typically have very little state information, while a stock trading service contains significant mutable information: customer data, equity trades, transactions, etc.

Apache Cocoon Notation

This article uses the following Apache Cocoon Graphical Notation:

Cocoon Graphicsl Notation

In this article I examine a very simple and elegant two-layer solution for online publishing; it presents articles stored in a local repository or directly utilizes feeds from other online magazines and new services. It turns out that the management of articles and of news stories is very similar, and much of this type content is converging on the use of RSS. Thus, an appropriate architectural tactic is to divide the problem into two parts: the article repository layer and the presentation layer. Figure 1 represents a top-level perspective.

Top Level Design
Figure 1: Top Level Design

A key architectural feature of this solution is that no application-specific Java or other procedural software is required. All necessary functionality and operation is achieved using existing off-the-shelf Apache Cocoon components, supplying them with appropriate XML configuration information. Such solutions are "XML directed architectures" and are expected to play an increasingly dominant role thanks to the software component interoperability that XML provides.

Design and Implementation

RSS

In a nutshell, RSS is an XML vocabulary for describing content such as the headlines of a news site or the latest articles of an online magazine.

RSS is one of those standards that fights like hell not be standard. To begin with, there is no agreement on the acronym RSS. To make matters worse the dominant versions of RSS are incompatible. Mark Pilgrim's article, "What Is RSS?", is a good place to get caught up with RSS.

The presentation layer assumes the news feed is in one of the dominant RSS formats, converting it into the RSS 1.0 format for uniformity. RSS only specifies the delivery of the content headlines, not the body of the story or article.

Apache Cocoon Architecture

Solutions realized by the Apache Cocoon framework are constructed by way of "pipelines" (see Getting Started with Cocoon for introductory tutorial). In a nutshell, each pipeline is a sequence of XML processing beginning with a "generator" (representing in Figure 2 below as a pentagon shaped block), followed by any number of "transformers" (triangle shaped block), and finally terminated by a "serializer" (hexagon shaped block).

Two standard Cocoon components comprise nearly all of this particular solution. They are, first, the URI Generator which simply retrieves XML content given any URI; and, second, the XSL Transformer, which can be configured to utilize any number of XSLT engines (by default it uses Apache Xalan). Apache Cocoon offers a wide variety of standard components which can be further examined in the Apache Cocoon User Docs.

Architecture and Design

The entire architecture consists of four Cocoon pipelines as shown in Figure 2. Only two pipelines (i.e. the "/home" and the "/article" pipeline) are intended for the end-user.

Internal Pipeline Design
Figure 2: Internal Pipeline Design

The "/home" pipeline and associated URL portion exist for the purpose of displaying summaries of the top available articles. The Apache Cocoon Sitemap Pipeline construct is show below. The first step of the pipeline is to retrieve the appropriate RSS document. This could be from the local RSS repository or could be a well known remote source depending upon which magazine is selected (i.e. variable {1}). Notice that this solution uses the Apache Cocoon sitemap "One of N" switch functionality (<map:select>). This construct provides a simple mechanism to uniquely post-process a particular feed source. In the case of NewsForge, we convert its RSS-0.91 format into RSS-1.0 using a standard XSLT component configured with stylesheet document rss-91.xsl. Finally, the feed is converted to HTML and the appropriate styling and graphics are added: see figure 3 for results.


<!-- HOME PAGE APACHE COCOON PIPELINE FRAGMENT -->

<!-- Use local or remote RSS feed to populate home page.  -->

<map:match pattern="home/*.html">



  <!-- Use second field on URI to determine RSS Source. -->

  <!-- These values are hardcoded here and in common.xsl -->

  <map:select type="parameter">

    <map:parameter name="parameter-selector-test" value="{1}"/>



    <!--  Obtain on-line from O'Reilly Net.  -->

    <map:when test="oreillynet">

     <map:generate src="http://www.oreillynet.com/meerkat/?_fl=rss10&t=ALL&c=47"/>

    </map:when>



    <!--  Obtain headlines from this local file inside application.  -->

    <map:when test="local">

      <map:generate src="http://localhost:8080/cocoon-mag/rss-feed.rss"/>

    </map:when>



    <!--  Obtain on-line from News 4 Sites.  -->

    <!--  Note: Format is in RSS-0.91 -->

    <map:when test="newsforge">

      <map:generate src="http://www.newsforge.com/newsforge.rss"/>

      <map:transform type="xslt" src="rss-91.xsl"/>

    </map:when>



  </map:select>



  <!-- Presentation Layer: Convert RSS-1.00 to our HTML -->

  <map:transform type="xslt2" src="home.xsl">

    <map:parameter name="global-source" value="{1}"/>

  </map:transform>



  <!-- Send off as HTML character stream -->

  <map:serialize type="html"/>



</map:match>

Top Level Magazine Home Page
Figure 3: Top Level Magazine Home Page

The second user pipeline is the "article pipeline" shown below. The URL intercepted and processed by this pipeline is rather lengthy and has embedded in it the actual source location (local or remote) of the article (i.e. ** construct). The article is retrieved as HTML, then optional custom filtering (i.e. see source code article.xsl file) may be applied to remove undesired portions; finally, the presentation is applied. The results of publishing an article from NewsForge in our exemplar magazine is show in figure 4 (note URL in address bar has embedded the NewsForge location).


<!-- ARTICLE PAGE APACHE COCOON PIPELINE FRAGMENT -->

<!-- Retrieve an article, even from a remote feed, and wrap it 

     with our magazine.  -->

<map:match pattern="article/*/**">



  <!-- Retrieve article from (possibly remote) source -->

  <map:generate type="html" src="http://{2}?">

    <map:parameter name="copy-parameters" value="true"/>

  </map:generate>



  <!-- Format into HTML -->

  <map:transform type="xslt" src="article.xsl">

    <map:parameter name="global-source" value="{1}"/>

    <map:parameter name="global-path" value="{2}"/>

  </map:transform>



  <map:serialize type="html"/>



</map:match>

Article imported from NewsForge
Figure 4: Article imported from NewsForge

The Local Sources

To achieve uniformity and simplicity, the local magazine content is made available as two web services: a local RSS feed at URL location "/rss-feed.rss" and the article feed at "/article-feed/*/body.html". Both services are trivial two-component Cocoon pipelines. See the demonstration software for additional details.

Distinguishing Characteristics of this Solution

Component Reuse

A repeated theme in this and previous articles is the use of the XML directed architecture philosophy. The entire solution is achieved by way of reusable components directed by XML documents: in this case three XSL stylesheets and the sitemap file. No Java or any other type of custom procedural software was written. Granted this is a very simple design, and a more feature-rich magazine would possibly require such procedural business. Nonetheless, the trend seems to be that more and more solutions are taking on this reuse paradigm, achieving more functionality with less effort.

Simplicity

Again, the architectural goal is simplicity. Following this philosophy, a decision was made early on to not use a relational database. Instead all content is stored in the file system. The file system is probably the most under-appreciated subsystem of the modern OS. It is capable of nearly unlimited storage, fast retrieval, and efficient and automatic caching. The key concept is that no relational queries are needed in this application. Thus the use of a relational database or even an XML database adds no value.

Performance

While I have yet to measure performance, I am confident that this solution should hold it own against any other system. First, the file system is used as the primary means of persistence. File systems are typically very efficient and finely optimized over many years of evolution. Second, all key components in Apache Cocoon utilize the Jakarta Avalon framework and model for component pooling and reuse. Like file systems, this approach is highly efficient and optimized. Apache Cocoon allows and supports pooling configuration for every component in the pipeline. Third, Apache Cocoon also provides content caching. Each component in the chain can ask the quick question of the previous component: "do you have anything different than last time?" If not, a final component like a serializer can make the decision to simply reuse the last generated content and forgo nearly all pipeline processing. Fourth, a performance improvement can be achieved by embedding the local RSS and article feed into the two user pipelines. This would eliminate an unnecessary conversion of document between text and SAX events. Ideally the framework would be smart enough to do this automatically. Last, the XSLT transformers are the only possible building block that could be troublesome. XSLT technology is still fairly new and has shown sluggishness in the past. However, tremendous efforts are underway to improve performance. (See the article "Fast XSLT" for a detailed consideration of XSLT performance issues.)

In summary, the Apache Cocoon framework has provision for all major optimization tactics and allows them to be engaged and activated with simple configuration adjustments.

Trying it Out

Installation

A J2EE war file (cocoon-mag.war) solution utilizing Apache Cocoon and implementing the "Generic Online Magazine" can be downloaded here. This software has been tested against Tomcat 4.1.12 and requires no other packages. Simply place the downloaded war file into the ~tomcat/webapps directory and direct a browser to the application's URL; http://localhost:8080/cocoon-mag, typically.

Adding New Articles

You can add a new article by simply installing it in the directory space ~tomcat/webapps/cocoon-mag/articles/<id>/. By convention, the article ID is a unique numeric value. The second step is to add an RSS reference to the article in the file ~tomcat/webapps/cocoon-mag/local.xml. This will cause it to appear on the top page headlines.

Conclusion

While still in its infancy, component solutions directed by XML configurations are becoming viable and production-worthy ways of building web applications. Apache Cocoon excels in the territory of content presentation solutions and is making progress at addressing more interactive behavior situations with Apache Struts-like additions. The entire application presented in this article is contained in one Cocoon sitemap file and a handful of XSLT templates. Both these files define behavior and can be seen as an application layer on top of a generic, technology-agnostic XML framework. In my next article for XML.com, I will present a generalization of such a framework, which I call X2EE.

Resources