XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Proper XML Output in Python

Proper XML Output in Python

November 13, 2002

I planned to conclude my exploration of 4Suite this time, but events since last month's article led me to discuss some fundamental techniques for Python-XML processing first. First, I consider ways of producing XML output in Python, which might make you wonder what's wrong with good old print? Indeed programmers often use simple print statements in order to generate XML. But this approach is not without hazards, and it's good to be aware of them. It's even better to learn about tools that can help you avoid the hazards.

Minding the Law

The main problem with simple print is that it knows nothing about the syntactic restrictions in XML standards. As long as you can trust all sources of text to be rendered as proper XML, you can constrain the output as appropriate; but it's very easy to run into subtle problems which even experts may miss.

XML has been praised partly because, by setting down some important syntactic rules, it eases the path to interoperability across languages, tools, and platforms. When these rules are ignored, XML loses much of its advantage. Unfortunately, developers often produce XML carelessly, resulting in broken XML. The RSS community is a good example. RSS uses XML (and, in some variants, RDF) in order to standardize syntax, but many RSS feeds produce malformed XML. Since some of these feeds are popular, the result has been a spate of RSS processors that are so forgiving, they will even accept broken XML. Which is a pity.

Eric van der Vlist -- as reported in his article for XML.com, "Cataloging XML Vocabularies" -- found that a significant number of Web documents with XML namespaces are not well-formed, including XHTML documents. Even a tech-savvy outfit like Wired has had problems developing systems that reliably put out well-formed XML.

My point is that there's no reason why Python developers shouldn't be good citizens in producing well-formed XML output. We have a variety of tools and a language which allows us to express our intent very clearly. All we need is to take a suitable amount of care. Consider the snippet in Listing 1. It defines a function, write_xml_log_entry, for writing log file entries as little XML documents, using the print keyword and formatted strings.

Listing 1: Simple script to write XML log entries

import time

LOG_LEVELS = ['DEBUG', 'WARNING', 'ERROR']

def write_xml_log_entry(level, msg):
    #Note: in a real application, I would use ISO 8601 for the date
    #asctime used here for simplicity
    now = time.asctime(time.localtime())
    params = {'level': LOG_LEVELS[level], 'date': now, 'msg': msg}
    print '<entry level="%(level)s" date="%(date)s"> \
\n%(msg)s\n</entry>' % params
    return

write_xml_log_entry(0, "'Twas brillig and the slithy toves")
#sleep one second
#Space out the log messages just for fun
time.sleep(1)
write_xml_log_entry(1, "And as in uffish thought he stood,")
time.sleep(1)
write_xml_log_entry(0, "The vorpal blade went snicker snack")  

Listing 1 also includes a few lines to exercise the write_xml_log_entry function. All in all, this script is straightforward enough. The output looks like

$ python listing1.py 
<entry level="DEBUG" date="Mon Oct 21 22:11:01 2002">
'Twas brillig and the slithy toves
</entry>
<entry level="WARNING" date="Mon Oct 21 22:11:03 2002">
And as in uffish thought he stood,
</entry>
<entry level="DEBUG" date="Mon Oct 21 22:11:07 2002">
The vorpal blade went snicker snack
</entry>  

But what if someone uses this function thus:

>>> write_xml_log_entry(2, "In any triangle, each interior angle < 90 degrees")
<entry level="ERROR" date="Tue Oct 22 05:41:31 2002">
In any triangle, each interior angle < 90 degrees
</entry>  

Now the result isn't well-formed XML. The character "<" should, of course, have been escaped to "&lt;". And there's a policy issue to consider. Are messages passed into the logging function merely strings of unescaped character data, or are they structured, markup-containing XML fragments? The latter policy might be prudent if you want to allow people to mark up log entries by, say, italicizing a portion of the message:

>>> write_xml_log_entry(2, "Came no church of cut stone signed: <i>Adamo me fecit.</i>")
<entry level="ERROR" date="Tue Oct 22 05:41:31 2002">
Came no church of cut stone signed: <i>Adamo me fecit.</i>
</entry>  

I've reused the write_xml_log_entry function because msg-as-markup is the policy implied by the function as currently written. There are further policy issues to consider. In particular, to what XML vocabularies do you constrain output, if any? Allowing the user to pass markup often entails that they have the responsibility for passing in well-formed markup. The other approach, where the msg parameter is merely character data, usually entails that the write_xml_log_entry function will perform the escaping required to produce well-formed XML in the end. For this purpose I can use the escape utility function in the xml.sax.saxutils module. Listing 2 defines a function, write_xml_cdata_log_entry, which performs such escaping.

Listing 2: Simple script to write XML log entries, with the policy that messages passed in are considered character data

import time
from xml.sax import saxutils

LOG_LEVELS = ['DEBUG', 'WARNING', 'ERROR']

def write_xml_cdata_log_entry(level, msg):
    #Note: in a real application, I would use ISO 8601 for the date
    #asctime used here for simplicity
    now = time.asctime(time.localtime())
    params = {'level': LOG_LEVELS[level], 'date': now, 
              'msg': saxutils.escape(msg)}
    print '<entry level="%(level)s" date="%(date)s"> \
\n%(msg)s\n</entry>' % params
    return  

This function is now a bit safer to use for arbitrary text.

$ python -i listing2.py
>>> write_xml_cdata_log_entry(2, "In any triangle, each interior angle < 90\260")
<entry level="ERROR" date="Tue Oct 22 06:33:51 2002">
In any triangle, each interior angle &lt; 90
</entry>
>>> 

And it enforces the policy of no embedded markup in messages.

>>> write_xml_cdata_log_entry(2, "Came no church of cut stone signed: <i>Adamo me fecit.</i>")
<entry level="ERROR" date="Tue Oct 22 06:41:31 2002">
Came no church of cut stone signed: &lt;i&gt;Adamo me fecit.&lt;/i&gt;
</entry>
>>>  

Notice that escape also escapes ">" characters, which is not required by XML in character data but is often preferred by users for congruity.

Minding Your Character

Python's regular strings are byte arrays. Characters are represented as one or more bytes each, depending on the encoding, but the string does not indicate which encoding was used. If it surprises you to hear that characters might be represented using more than one byte, consider Asian writing systems, where there are far more characters available than could be squeezed into the 255 a byte can represent. For this reason, some character encodings, such as UTF-8, use more than one byte to encode a single character. Other encodings, such as UTF-16 and UTF-32, effectively organize the byte sequence into groups of two or four bytes, each of which is the basic unit of the character encoding.

Because most single-byte encodings, such as as ISO-8859-1, are identical in the ASCII byte range (0x00 to 0x7F), it's generally safe to use Python's strings if they contain only ASCII characters and you're using a single-byte encoding (other than the old IBM mainframe standard, EBCDIC, which is different from ASCII throughout its range). But in any case, especially if you put non-ASCII characters in one of these regular strings, both the sender and receiver of these bytes need to be in agreement about what encoding was used.

The problems associated with encoded strings are easy to demonstrate. Consider one of the earlier examples, with a slight change -- the word "degrees" has been replaced with the byte B0 (octal 260), which is the degree symbol in the popular ISO-8859-1 and Windows-1252 character encodings:

$ python -i listing2.py
>>> write_xml_cdata_log_entry(2, "In any triangle, each interior angle < 90\260")
<entry level="ERROR" date="Tue Oct 22 06:33:51 2002">
In any triangle, each interior angle &lt; 90
</entry>
>>>  

The \260 is an octal escape for characters in Python. It represents a byte of octal value 260 (B0 hex, 176 decimal).

Pages: 1, 2

Next Pagearrow







close