January 20, 1997
The Seybold Report on Publishing Systems
Vol. 26, No. 10
Direct Route from SGML to Pages
FOR LONG-TIME READERS of The Seybold Report, Penta is a very familiar name. For many years, the company was one of the top suppliers of systems to the typesetting trade. It was a leader in providing productive, professional tools for setting high-quality type efficiently. But newer readers may not be as aware of Penta, since the company has been keeping a low profile of late. It went through a period of financial and organizational trauma in the wake of the desktop publishing revolution. Penta informs us that these problems are all in the past and that it has had revenues of $10 million and been profitable over the last four years.
Following reorganization in 1992, Penta has resurfaced, seeking new opportunities. The company, which is now called Penta Software, still offers its traditional tools for typographic professionals, but there are a lot fewer of those professionals than there once were. So Penta is focusing on the SGML market, where it can sell systems both to its traditional customers (some of whom have become SGML composition service bureaus) and to the SGML user community (a larger and steadily growing market). Penta reports that its existing customers did $3 million of SGML work last year; this year, it should grow to around $10 million.
Penta faces a challenge in trying to make a place for itself in the SGML market. Two of its traditional rivals, Xyvision and Miles 33, made SGML a priority years ago and have established a presence in the market. Xyvision, in particular, has been successful in selling to major SGML users. In addition, Frame, Interleaf, ArborText, Advent and Datalogics are all active competitors in this market.
Despite the formidable competition and the head start that other firms have, we think Penta still has an opportunity to become an important player in SGML composition. The company's new SGMLPublisher software has some significant advantages over the competition for certain types of SGML work.
SGML composition: 3 problems
The philosophy of SGML dictates that an SGML-coded file should contain only structural information, not formatting. Formatting is done later, by applying typographic "styles" and page layout rules to the SGML structural tags. This philosophy, plus the long documents that characterize many SGML applications, makes SGML composition a natural for automatic pagination software. In many inhouse applications (for example, in composing documentation for heavy equipment or for electronic components) all the work goes into setting up the pagination parameters and tag-to-style conversions ahead of time. Actual live pagination can then occur in a completely automated fashion when editorial work is complete.
There have been some impressive success stories in SGML pagination. But getting there isn't necessarily easy, and even the successes have their limitations. It is useful to examine some of the problems, since they provide a context for understanding what Penta has done differently from other vendors.
Same tag, different meanings. The same SGML tag can have several different typographic meanings. When you are applying typography to SGML tags, context matters. Take, for example, <p>, the tag for a normal paragraph. It could represent a different face, size or style depending on whether it occurs in the preface of a book, in the book's main body, or in an appendix. This reuse of tags in different contexts is a valuable and widely used SGML feature. It does not present a problem for SGML, but it often does for composition software, which is designed to apply one set of parameters (a style) to a given tag, regardless of where it appears.
The practical result is that a conversion step is usually necessary to generate a unique tag for each different style. In the example above, each occurrence of the <p> tag might be converted into <p1>, <p2> or <p3>, depending on context. The composition software could then associate a single typographic style with each.
One-way conversion. Converting files for composition may be a one-way process. Some things are lost in conversion. For example, attributes that have no typographic meaning (which could be codes for things like what the security level of a section is, or what the dimensions of an image are) may simply be stripped from the file. Some structural tags, like those that delimit the beginning and end of the front matter, may also be dropped.
This loss of data has no effect on the appearance of the formatted output, of course, but it has another consequence: It makes the conversion from SGML to the pagination environment irreversible. This may or may not be a disadvantage for a given user, depending on how pagination fits into the overall workflow.
Let's take a look at the consequences of adopting one-way conversion of SGML files into page files. In most publishing environments, SGML files will get reused—either as the basis for future editions of the same document, for spinoff documents, for electronic formats (online and CD-ROM) or for all of these. If any changes are made to the file, they should obviously be made to the SGML version, which serves as the source for all others. Otherwise, duplicate changes would have to be made in multiple files, with the attendant costs and possibility of error.
Therefore, with a one-way conversion process from SGML to pages, no changes should be made to the page files. If problems are discovered in pages, they should be addressed by going back to the SGML source, fixing that, and running pages again. This is in fact the process that is widely used in SGML publishing. Pagination files are throwaway files, intermediate steps on the way to printed pages or PDFs. This workflow places a premium on highly automatic pagination and, even more, on catching errors before pages are created.
There are, however, branches of publishing for which this approach presents problems of its own. It makes pages a kind of terminal product that cannot be worked on directly. In many kinds of book and journal publishing, a lot of editorial activity revolves around the page proof, and changes are bound to occur at that stage. While the corrections generated at the page stage can be incorporated back into an unpaged file, it is generally easier to work with the file in the pagination environment. But if there is no possibility of conversion back to SGML, working in the pagination environment is not an option.
DTDs proliferating rapidly. SGML provides the flexibility for users to create their own DTDs (Document Type Definitions). That allows them to create tags that have meaning in their specific environments. A dictionary publisher can have a tag <etym> to indicate that a piece of text is an etymology. A legal publisher can have a tag <statute> to indicate the name of a particular law. This is obviously a useful feature for SGML to have, but it does mean that there is no limit on the number of DTDs that a typesetting facility—especially one in a service-bureau environment—might encounter. (This is usually not a problem in industries that have adopted an industry-wide DTD, as the aircraft and electronics industries have.)
If you must frequently deal with new DTDs, setting up conversions becomes a major preoccupation, and the efficiencies of automatic pagination may be lost to the inefficiencies of the setup process.
Rewriting with SGML as internal format
SGMLPublisher represents a major rewrite of Penta's composition software. The easiest way to deal with SGML files is to do conversion of SGML tags to composition coding; that is what most SGML composition systems do (and it is what Penta had initially done as well). But as Penta started to face up to the first two problems mentioned above (context dependence and irreversible conversions), the company recognized that it could take a different approach, and accept SGML tags as if they were Penta's own style tags.
Taking this approach required a major rewrite of much of Penta's editing and composition code, but it resulted in a product with an important characteristic: It treats native SGML documents as its own internal format. That means no conversion is required to get SGML documents into SGMLPublisher, and only a minimal filtering step (using a filter supplied with the system) is required to restore the composed SGML files to their original state.
Relating styles to tags. Penta also recognized that special tools would be needed, in service-bureau environments, to deal with lots of new DTDs. (This is the third problem mentioned above.) One possible approach is to take each tag in the DTD and create a style for it. This approach seems relatively straightforward, but Penta ultimately rejected it, because of two practical problems.
First, many tags are never used. For example, the standard journal-article DTD, in the ISO 12083 standard, defines four tags for poems (<poem>, <stanza>, <poemline> and <cline>). In scientific journals, these are virtually never used. Yet, if the journal's publisher adopts the standard DTD, the poem tags will be in it. Since changing DTDs can be a nuisance, DTDs are generally created to deal with every contingency that the DTD writer can think of. From an SGML point of view, it doesn't matter if most of the DTD isn't used in practice. From the composition point of view, however, the unused tags mean that setting up a style for every DTD tag can be a waste of time.
The second problem has already been mentioned: Tags can have several contexts, and their typography could be different in each one. For example, the ISO 12083 article DTD allows articles to be subdivided various ways (into parts, chapters, sections and up to six levels of subsections). The poetry tags (or any others) could have different typographic associations in any of these subdivision levels. In principle, working just from the DTD, you might have to set up styles for each tag in each of these contexts, multiplying the wasted time.
What this shows is that the approach of setting up all the styles that could be required by a given DTD is often impractical. You must look at the actual document to be formatted (in SGML jargon, the "document instance") to see what tags are actually used and their contexts.
Three-step process. So Penta created a series of tools to analyze and assign typography to incoming SGML files. The style sheets are automatically extended to subsequent files that use the same DTD. If users have designated existing elements as "code templates," like-named elements in new contexts will automatically acquire their code. SGMLPublisher applies a three-step process to composing and paging SGML files.
First, the document is run through the Style Sheet Generator module, which produces a list of all the tags used in the document, in their hierarchical context. At the style sheet generation phase, each element's attribute names, along with their declared and default values, are merged into the style sheet. It is at the SGML processing phase that actual attribute values are dynamically assessed.
Next, the Style Sheet Editor is used to associate Penta's composition coding with the SGML tags. All the features of Penta's composition language can be used, including explicit typographic commands for things like font, size, style, leading, indent and the like; calls to predefined typographic formats; the generation of literal strings; and conditional processing of various kinds. The operator can assign a single set of typographic specifications to a tag, wherever it occurs (called "broadcast" typography), or different typography can be specified for each context.
The final step, using the SGML processing module, applies the composition coding to the file. The file that results from this step is the same as the original SGML file, except that it contains Penta codes as well. The user can switch freely among views that display or suppress the SGML code, and display or suppress the associated typographic information. The text and SGML coding can be edited, but typographic coding inserted from the style sheet must be edited in the style sheet. This enforces the separation of SGML markup from our specific markup. However, SGML processor instructions that contain Penta coding (e.g., <?penta...>) may be inserted anywhere in the document without affecting document integrity. A file can be run through the composition process (which determines spacing, line breaking and hyphenation) and then through the batch pagination process (which breaks material into columns, places illustrations and footnotes, and creates running heads and feet).
At any time, a filtering utility can be run to strip out the Penta codes. If no editing has occurred, the resulting file is simply the original SGML file. This file can be parsed to make sure the SGML is still valid.
Editing: to parse or not to parse. One of Penta's limitations is that there is no built-in parser available within SGMLPublisher. This means that the operator can edit the SGML coding in ways that violate the DTD. The best way of preventing this is simply to ask the operator not to touch the SGML. The Penta environment is fine for making text corrections (e.g., fixing typos that are discovered in the composed file). But for changing SGML tags and attributes, a standard SGML editor (like ArborText Adept or SoftQuad Author/Editor) should be used.
Files should be parsed before being composed, and they should be parsed after the composition stage is over, just to make sure no problems have crept in.
Penta bundles the public-domain SGMLS parser with SGMLPublisher, and all instances are validated as part of the style-sheet-generation process and again when they are brought into the SGMLPublisher environment. In addition, for users of Penta's EditMaster program, EditMaster launches SGMLS and returns error messages. However, we think having a parser built into the editing software would be a big advantage. It could provide the operator with a menu of valid tags for the editing context, and it could reject attempts to enter tags and attributes that aren't valid under the DTD. With such a parsing editor, the Penta operator could fix any problem without having to switch to a different software tool.
Pricing and future plans
SGMLPublisher is an add-on to Penta's DeskTopPro system. It runs on Sun Sparcstations and has recently been ported to the Data General Aviion platform. (In addition, there is a possibility that Penta will offer a version for Pentium computers, although the Sun environment is the focus of Penta's SGML effort.)
For existing Penta customers running Sun servers, SGMLPublisher costs $17,500 per CPU for the Sparc 5, or $22,500 per CPU for the Sparc Ultra.
For a user buying a complete composition and pagination system with SGMLPublisher, the software would cost about $64,000 for a stand-alone workstation, or about $97,500 for a server. The minimum hardware cost would be about $6,000.
SGMLPublisher is available now.
Future enhancements. Penta plans to continue enhancing SGMLPublisher. An important area of focus will be making the steps involved more automatic. Penta will support automation through a hot-folder facility. Files that are moved into the folder will proceed automatically through a variety of processes. For example, a folder could be set up to check incoming files for tags that have no associated style, or to run composition and pagination on files.
SGML support will be expanded too. SGML has an "omit endtags" option, which is sometimes used to make files more readable (by humans). Penta doesn't currently support that feature; all tags must have an explicit terminating end tag. But support for it will be added.
Another area that will be addressed is support for external files that are included by reference in the file being processed. This is not in the current product but will be added. Penta also will include support for marked sections and the ability to apply coding to other vendors' processing instructions. For example, hard line endings added in an authoring or editing system could be designated for similar treatment within SGMLPublisher.
The Allen Press experience
One important early user of SGMLPublisher is Allen Press of Kansas. Allen Press specializes in producing scientific, technical and medical journals. It currently typesets, prints and distributes more than 300 journals in a wide variety of specialties, and it is in the process of adding the production of electronic formats (online, CD and Web). In most cases, Allen Press receives paper manuscripts and does the keyboarding of the electronic file via outside keyboarding services.
Two years ago, customers began asking Allen Press to produce SGML files of their journals. Like many typesetting services getting started with SGML, the company decided to convert typesetting files to SGML after composition and pagination had been completed. They soon came up against the limitations of this approach. There are some things that ought to be tagged that are not typographically distinct. (For example, authors' names in a multiauthor journal article may need to be separated and tagged individually.) Sometimes there are not enough indications of document structure to distinguish elements from one another. The company found that, even with the help of very sophisticated software, only 95% of the tagging could be automated, and doing the last 5% by hand was very expensive.
After six frustrating months, Allen switched to keyboarding SGML up front. Penta had developed a utility, StyleTags, that allows the import of SGML into the Penta composition environment, with context-sensitive conversion of tags to typographic style codes. This approach proved better and cheaper, and since the files were being parsed ahead of time, many coding errors were corrected before they reached the Penta system. A very sophisticated operator was still needed to set up the StyleTags conversions, however, and this created a bottleneck, which meant Allen Press had to turn down some potential SGML business.
Allen Press was delighted when SGMLPublisher was developed because it's a much easier tool for its staff to use. The nested-tag display of tags occurring in a document instance provides a good visual impression of the document structure and makes it easier to figure out what typographic styles need to be created or applied. The software also displays attributes so that these can be taken into account in formatting, if appropriate.
Ted Freeman, electronic publishing coordinator at Allen Press, tells us that SGMLPublisher is allowing others to take on more of the day-to-day SGML composition duties, which allows him to spend more time on DTD writing and other high-level tasks. He is pleased with what Penta has done, and he does not see how competing products (he mentions Xyvision and Miles 33 specifically) could be used as efficiently in a multi-DTD environment such as Allen Press's.
AutoTab tackles a tall order
A particularly intriguing part of SGMLPublisher is the AutoTab module, developed to automate tabular functions within an SGML environment. We have seen many sophisticated tabular routines that automate nicely the composition of straightforward tables in which each cell contains single-line entries. We have also seen many programs that expand a cell's depth automatically as new text is entered, creating cells with multiline text. With AutoTab, Penta tries to do more. It attempts to automate fully the division of tabular data into multicolumn formats, even when some cells contain running text of variable numbers of lines.
AutoTab is a complex program that involves analyzing certain tradeoffs as it attempts to determine how many lines to give each cell's data. Because we haven't had a chance to run it through a very rigorous test to see how well it holds up under fire—to see how far it goes in automating any table—we have asked Penta for a chance to spend some time with it before we publish a review. We urge anyone considering purchasing SGMLPublisher to do the same. As we said, it looks like a very intriguing project.
Pluses. The features of SGMLPublisher that strike us as most important are these:
- SGML is preserved in the pagination environment. In SGMLPublisher, the SGML file is always preserved. The user can edit it directly, and it can still be used by other SGML applications. This makes SGMLPublisher particularly suited to environments where a mixture of SGML applications are used, in various sequences.
- Good tools are provided for style creation. It is relatively easy to figure out what styles are required, create them and associate them with the appropriate tags. New files from an established customer can be quickly checked to see if they require any styles that have not previously been set up.
- Typographic tools are excellent. Users have full access to Penta's typographic command set, which represents the results of many years of refinement in professional typesetting environments.
- Pages can be part of the main workflow, not a dead-end branch. With many batch pagination systems, it is not practical to update the files used to produce pages. Updates are made to the SGML source files, and then pages are created again. Since SGMLPublisher leaves SGML files intact, there is no reason to treat pagination as a dead-end, final step. Files can be edited before, during and after composition, which allows great flexibility of workflow, especially in environments where corrections are made at the "page proof" stage.
Negatives: operator errors can cause problems. On the downside, SGMLPublisher does not include parsing as part of the normal editing process. The operator is not prevented from adding, modifying or deleting tags and attributes in ways that are incompatible with the DTD, and there is no warning when this has occurred. In practical terms this probably means that most users will correct only typos and formatting errors in SGMLPublisher. They will use other SGML editors with built-in parsing to make structural and tag changes, thus avoiding the risk of coding errors.
SGMLPublisher is new, and Penta will have to make up for its late entry into the market. Penta will also have to overcome the memory of the financial difficulties and reorganizations it has been through in recent years. But we think SGMLPublisher is off to a promising start. It will find some ready buyers among Penta's remaining loyal users, and the product has enough attractive features, including the ability to publish tables automatically via AutoTab, that it will also find acceptance in the broader SGML user market, especially in settings where high-quality printed typography is still very important and many DTDs are in use.