XML.com

Using GitHub for Collaborative XML Publishing

June 20, 2021

G. Ken Holman

Authoring a technical standard can distract from the development of the standard’s content. Equipping a standards committee effectively to satisfy the documentation obligation, without impacting on the technical development, benefits those involved and produces results faster.

And writing is not the only task. Assembling complex work products can be finicky, and so leveraging automation where possible produces results more consistently.

This case study shows how two OASIS technical committees collaboratively prepare documents for both OASIS and ISO submission.

The committees’ goals were to:

  • maximize the time developing technical content, which is why the members joined in the first place;
  • minimize the time spent formatting content twice to satisfy two sets of layout requirements;
  • automate the production of intricate committee deliverables; and
  • enable committee members to propose contributions to the editors in an efficient manner.

This case study illuminates the committees’ use of DocBook XML for authoring a single document to produce multiple layouts. Moreover, using XML provides options for generated content not readily available in other authoring environments.

Also illustrated is how the editing and publishing process is supported by using the git repository and GitHub hosting for collaborators to use to make their proposed contributions to the editors. Together with the online XML publishing service from Réalta, this equips members to preview their draft work in final-form PDF and HTML at any time. This frees members of the burden of supporting specialized, expensive publishing tools they may not otherwise need.

The end result for each committee is the hands-off production of complete work product deliverables including two different PDF layouts.

IMPORTANT: This essay is not intended to replace the more detailed README.md instructions for the technical committee members found in their respective git repositories. Rather than get bogged down in details, this essay is meant to introduce and overview the strategy of using git and GitHub for collaborative committee work.

Technical note

This monolithic HTML document includes embedded SVG graphic images that may not be visible on all browsers. The author has tested this file successfully on Chrome, Firefox, Opera, Edge, and Safari.

1. Introduction

The bane of standards and specifications writers is the distraction from valuable technical work triggered by the responsibility to present that work in a completed document according to an imposed style or layout guide. This responsibility is multiplied when a single document text is targeted for multiple standards development organizations who impose their own historical-based layouts. Maintaining multiple document formats simultaneously is rife with problems of consistency and extra effort. Addressing this using XML as a single source for multiple differing layouts is critical to maintaining the single source text consistently across the publications.

Moreover, committees have many experts with input to a given document. Being able to incorporate input from different contributors is streamlined when using XML because it is text based. Equipping committee members to publish their intermediate work allows them to review their modifications before submitting to editors their suggested changes.

Finally, assembling complex work products can be a finicky task. Results are produced more quickly and are built more reliably where it is possible to automate the production and assembly task. Using the GitHub hosting service for git repositories offers “GitHub Actions” where scripted behaviours can be executed.

This case study of two OASIS technical committees with the responsibility to prepare revisions of their respective OASIS specifications to be suitable for both OASIS and ISO submission illuminates the committees’ use of DocBook XML for single-source authoring. And using XML provides additional benefits opening options for generated content not readily available in other authoring environments.

Also illustrated is how the editing and publishing process is supported by using the git repository and GitHub hosting for collaborators to use to make their proposed modifications to the editors, thus freeing contributors of the burden of supporting in their local computing environments specialized publishing tools they may not have and processes they may not be familiar with.

The environment now available for each committee implements the hands-off production of a pair of ZIP files for uploading to the OASIS Kavi server: one for distribution to users by TC Administration and one for the archiving of source and intermediate files that are used in production but not residing on any OASIS server. This archive fulfills a committee’s obligation to make publicly-available for posterity all of the inputs involved in the production of work products.

This essay presents this model for other OASIS Technical Committees to leverage the collaboration opportunities provided by the technologies, yet still meet the committee obligations for archive. This implementation of the model incorporates OASIS’s commercial license for the http://RealtaOnline.com online-publishing service, a high-performance and purpose-built standards publishing platform, now available to support all OASIS editors of committee work products.

This model can be useful in other collaborative writing and publishing environments, illustrating the use of tools whose genesis is in software development, not in the authoring of documents. And not just for OASIS and not just for standards.

2. A requirement to support multiple SDO layouts

Standards Development Organizations (SDOs) mandate particular page/screen layout requirements for their own specifications. They don’t always look the same as work products from other SDOs (imagine!). For the authors/editors of standards for a single SDO, this isn’t usually a problem, other than the burden itself of worrying about the layout when wanting otherwise to focus on the technical work itself.

Lately, however, there have been opportunities for some SDOs to adopt the specifications of other SDOs. The International Organization for Standardization (ISO) is an frontrunner of this with their formal Publicly Available Specification (PAS) submission process. This process permits an accredited organization to submit the organization’s own internally-developed specifications as candidate ISO specifications to be affirmed as International Standards with the agreement of ISO’s national bodies. While the first PAS submission is allowed to be published using the submitter’s organizational layout, subsequent submissions are required to follow ISO’s layout as dictated by an important document titled “Directives Part 2”.

The Organization for the Advancement of Structured Information Standards (OASIS) is an accredited PAS submitter and a number of committees have, and plan to have, submitted OASIS Standards to become ISO standards. The OASIS Universal Business Language (UBL) technical committee submitted their UBL 2.1 OASIS Standard through the PAS process to become ISO/IEC 19845:2015. Subsequent revisions must be submitted to ISO using the Directives Part 2 layout, while at the same time satisfy OASIS’s technical publication process obligations for layout. These are two quite different layouts for one set of technical information.

The OASIS Code List Representation technical committee intends to submit their OASIS genericode specification through the PAS process. As the team progresses, it faces the same multiple-layout challenges as the UBL TC.

In both cases the committee is leveraging a long-promoted benefit of using XML syntax for source documents: multiple target publishing. For documents written following the OASIS DocBook standard, OASIS has a specification layout set of stylesheets conforming to the layout requirements dictated by committee process. With these stylesheets the source XML is in a simple text-based file format that is readily handled by many tools.

In support of their work, OASIS committee editors have access to the online-publishing REST-based publishing service offered by http://RealtaOnline.com. One of the features of this service is the transformation of DocBook XML conforming to the OASIS specification conventions into NISO-STS, the JATS-based publishing vocabulary commonly adopted for international standards.

The Réalta service pairs the OASIS specification stylesheet library for DocBook with a Directives Part 2 stylesheet library for NISO-STS. Committee editors can choose to invoke the service by supplying a single DocBook representation of the content and getting back the content in OASIS layout in both PDF and HTML, optionally as well as in Directives Part 2 layout in PDF if needed. The former two outputs satisfy the OASIS technical committee process and the latter output satisfies the PAS submission process.

3. GitHub: a committee-wide collaboration opportunity

GitHub describes itself as the largest and most advanced development platform in the world. At this time of writing, it is supporting over 200 million repositories providing cloud-based storage for project data maintained using the free git source code control and change tracking software. Also provided is a computing service called GitHub Actions, with which processes can run in the cloud on the data found in the repository at various interactions collaborators have with the git software.

For committees such as the UBL TC, the publishing burden is multiplied by the need for separate subcommittees to make contributions towards a single specification. Different clauses of the specification are the responsibility of different subcommittees. Moreover, some of the technical artefacts are governed by separate subcommittees.

Using git on GitHub supports the collaboration and input from multiple committee members towards a single specification. With the automation, each contributor is empowered to create the set of deliverables reflecting their input to the project as a preview of what the committee editors would see with their input created. Editors, in turn, can see both the inputs and the outputs of a collaborator’s submission to assess how best to respond to the contribution.

The two GitHub-hosted git repositories for this case study are:

Separately tailored GitHub Actions are leveraged by each committee in the generation of published content and artefacts, and their assembly into both archival and distributable packages conforming to OASIS committee process.

All of the committee scripts and stylesheets are found in the repository in clear text for future teams to maintain and modify as required. The scripting is written using Apache Ant, a choice by the committees and not in any way an obligation on the part of GitHub. If you can run the script locally in your environment, you can run it in GitHub if all of the tools are available. The UBL environment has some tool dependencies that are able to be satisfied by GitHub.

4. The committee protocol using git on GitHub

Two committee project roles are identified, each defined as a GitHub team: editors and maintainers.

Editors are responsible for incorporating the suggestions made by the maintainers into review copies (for committee consideration) and main copies (already accepted by the committee). Two git branches are reserved and commits to these branches are restricted to editors:

  • main - this is content that has been reviewed by committee members and considered acceptable to be distributed for its intended purpose (which may be for testing or for production use, not necessarily for final use); the public is expected to look to this branch for self-consistent content reviewed and accepted by the committee.
  • review - this is content from the editors that has not been reviewed by committee members yet, and so is not considered agreed-upon for its intended purpose, but the editors have incorporated input from other sources into a package for review; when there is consensus about the content of the review branch, it is snapshot in the main branch.

A main branch package is not necessarily a final package, but simply a package merged from the review branch whose review has been completed. Editor’s note: the jury remains out whether the main branch is useful to the committee’s public audience, as the function may be satisfied by judicious use of tags and releases.

Maintainers create and maintain their suggestions in their own git branches (note that editors making their own suggestions also work in their own git branches as if they were a maintainer). Maintainers can use any XML editing tool to make their changes to the specification document. Other files and directories can change however needed by the maintainer.

This diagram overviews the maintenance and publishing protocol maintainers and editors are expected to follow, remembering that an editor also performs maintainer tasks until they submit their own pull requests from their personal branches and then perform the role as an editor. The numbered step details are found at https://github.com/oasis-tcs/ubl/tree/review#detailed-steps for readers interested in the interactions between roles.

Figure 1. Overall protocol when using git

image/svg+xml Editor'sEnvironment Maintainer'sEnvironment www.RealtaOnline.com Repository Push Actions main review other main review other Push Push Push Merge Pull Request Merge Pull www.GitHub.com OASISKavi OASISDocs

Open-source tools can be configured as part of the GitHub environment (e.g. OpenOffice is used in the UBL environment) or uploaded with the repository to be used for publishing. Of note, the OASIS committees’ use of the REST-based interface to the commercial http://www.RealtaOnline.com satisfies the publishing task without having to upload the publishing tools as part of the repository contents.

GitHub Action results are restricted to GitHub members and do not persist more than 90 days after production. OASIS editors and maintainers needing to distribute the published results manually copy the assembled ZIP files to the Kavi server.

5. Generated content used in the deliverables

5.1. Specification text

In both the UBL and genericode specifications, some of the content of the specification document is synthesized as part of the build and publication processes. This is accomplished by declaring in the specification XML the external general entities that point to entity files containing the generated content.

XSLT is a versatile transformation language that reads in XML and can output either standalone XML or XML general entities.

In the UBL specification the generation process is quite complex, producing a handful of included entities incorporating information from multiple sources including the previous version of the document XML, the previous and current versions of the semantic library from Google spreadsheets, and some colloquial XML documents used to specify summary information.

In the genericode specification the generation is quite straightforward and is illustrated in the diagram below. The repository entities always are empty placebos, as they are replaced in the production process prior to publishing. See https://github.com/oasis-tcs/codelist-genericode/tree/review#authoring-and-generated-content for details on the numbered steps.

Figure 2. The synthesis of conformance clause summaries from authored content

image/svg+xml

This content harvesting, manipulation, and re-insertion can be very powerful in creating useful summaries or other reference materials inside of the specification. As in this example, the harvested table is massaged for use in a conformance clause, rearranging the text based on how it is read in a different context.

5.2. Accompanying artefacts

In the case of the genericode project, all the accompanying artefacts are prepared by hand and contributors simply update the repository directories as required. The build process is less than 90 seconds.

In the case of the UBL project, most of the numerous accompanying artefacts are synthesized from three Google spreadsheets made available to the committee to collaborate on the UBL semantic model. From these spreadsheets the GitHub automation runs a number of stylesheets and other applications to check the veracity of the inputs while producing the outputs. When problems are detected, the output work products include files not intended to be distributed to users. This brings the problems to the attention of the author. A successful build process takes more than 20 minutes. GitHub conveniently sends an email notification at the completion of the build process, successful or not.

6. Previewing XML content

If all a committee member is doing is modifying the text of the specification document, a local preview environment enables the writer to see the impact of their changes to the documentation XML before checking in their branch. This gives instant feedback without needing to trigger a GitHub action. Only a single layout is supported, that being the OASIS layout based on DocBook, and so there may be some limited content reserved for the ISO publication that cannot be seen.

On the Windows platform the Internet Explorer browser renders the specification XML. On the Mac platform the Safari browser does the same. It appears other browsers cannot handle the DocBook stylesheet library and cannot be used.

The author simply drags-and-drops or opens the XML file from the browser to see the HTML rendering of the content. After editing the content in their XML editor and saving their work, a simple refresh in the browser renders their latest. This is instant and does not rely on GitHub actions to perform a formal publishing process.

This functionality is unavailable to maintainers performing their tasks from the online GitHub web interface. It is available only to those users who have cloned the repository to their local environment. Online users must commit and push their content in order to obtain a rendering. For UBL contributors the online run takes over 20 minutes, where as a local refresh is instantaneous.

7. GitHub automation and housekeeping

Key to the success of the hands-off committee work is the automation provided by GitHub that is triggered by the push request. A GitHub Action is performed on the server after the server makes its own copy of the git repository content. Thus, any changes to the repository made on the server during the action do not impact on the repository data in git. Also, the scripting for the server is maintained as repository files just as all the other files.

As described in sections above, the artefact synthesis, the content generation, and the publishing all are executed on the GitHub server and not the committee member’s computer. Not only does this free up the computer from the grinding of producing the outputs, it precludes the need to install the transformation and publishing software on the member’s machine.

Moreover, not all committee members are working on the same operating system and so there would be members who would not be able to run the build script (in this case using bash for invocation) on their computer. This is mitigated somewhat for these two repositories in that the scripting is done using Ant, which is cross platform. A batch file invocation of the Ant script would work just as well.

Finally, there is no need to propagate to committee members information that may be of a sensitive nature. In the example of these two repositories, the user name and password REST access credentials for the commercial publishing service are hidden in GitHub secret values managed by OASIS TC Administration. These values are not exposed in the console logs or error reports of the executing process and so there is no security breach that might permit unauthorized access to the publishing service. OASIS committees not wanting to use GitHub but who need to use the REST service are trusted to protect their use of their committee’s private values.

Before GitHub actions are able to be performed in the repository, they must be enabled by going to the Actions tab and engaging the facility. Then the hidden directory in the repository has a YAML script that can be tailored for custom invocations based on git actions (see https://github.com/oasis-tcs/codelist-genericode/tree/review/.github/workflows for the genericode example). In the case of these two repositories, the only action triggering an invocation is “push”. After establishing the computing environment and tools needed as environment dependencies, the invocation performed is the bash script that runs the Apache Ant script and zips up the results.

Importantly, how a collaborator uses git impacts on how often the automation gets run. When a collaborator is working from the command line they are able to commit multiple changes to their git repository before a single push request is used to trigger the automation. When a collaborator is working from the GitHub web interface, each and every individual commit includes an implicit push that triggers the automation. A web user modifying 10 files will trigger 10 automation builds. This may introduce delays for the collaborator wanting top see their final result, or for other users of the repository, by ending up queued behind all of the automation triggers.

Accordingly, a collaborator will see their final result sooner if they take the time to go to the Actions tab and cancel the nine workflow runs triggered before the final trigger that produces the desired results. This also prevents execution minutes being deducted from the GitHub monthly limits.

Moreover, some housekeeping is critical to reduce the GitHub burden of supporting the triggered actions. Action results are automatically deleted by GitHub 90 days after having been created, but for those 90 days the results need to be kept around occupying storage. This reduces the impact on the GitHub storage limit for repositories.

In accordance with OASIS committee process requirements, committee work products need to persist indefinitely on OASIS platforms. Collaborators are obliged to view GitHub action results as the transitory constructs they are, manually preserving in Kavi the results that need to be preserved, and respectfully cancelling and deleting runs and results that no longer are needed. In the case of the UBL project, a single run of the automation takes over 20 minutes to produce 550Mb of data compressed into 160Mb total in two ZIP files. Deleting the undesired action runs and results will save time and space. Deleting intermediate results will help, as will deleting the final results that are posted to Kavi for distribution to the committee.

8. Summary

Using git and GitHub for standards development provides a collaborative environment supporting committee members in the mundane and arcane tasks of publishing specification documents and assembling distribution deliverables. This frees up their time to focus on the heart of the standards process: the divining and development of the specification content.

Not that git and GitHub themselves aren’t a bit arcane, but an investment in learning how to use these tools will stand one in good stead in the future of collaboration and software development. Helpfully, the openly-available repositories supporting these two cases can be inspected for the procedures performed and copied when starting new projects.

This case study illustrates how these tools can be successfully deployed for collaborative documentation and deliverable development. Contributors can trigger the remote process readily to preview the impact of their changes before suggesting them to the editors in charge, without the burden of supporting specialized publishing tools they may not have in their personal computing environment. Members see their contributions in final-form PDF and HTML when the GitHub commit process invokes the online XML publishing service from Réalta.

These collaboration benefits are not restricted to this OASIS environment, as this git and GitHub combination can be realized by any writing project using open publishing processes. OASIS project editors in particular have access to a commercial publishing process that simultaneously satisfies the OASIS committee process requirements for both the OASIS page layout and the ISO Directives Part 2 page layout, but not having this doesn’t take away from the model itself.