XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Proper XML Output in Python
by Uche Ogbuji | Pages: 1, 2

As for the output produced by write_xml_cdata_log_entry, the characters seem properly escaped, but there may still be a problem. If this output is to stand alone as an XML document, it's not be well-formed. The problem is that there is no XML declaration, so the character encoding is assumed by XML processors to be UTF-8. But the degree symbol at the end of the string makes it illegal UTF-8; an XML parser would signal an error. This is one of the most common symptoms of bad XML I have seen: documents encoded in ISO-8859-1 or some other encoding which are not marked as such in an XML declaration.

Just adding an XML declaration is not necessarily a solution. If I have the function add

"<?xml version="1.0" encoding="ISO-8859-1"?>"

then the previous function invocation produces problem-free XML. But nothing prevents write_xml_cdata_log_entry from being passed a message in an encoding other than ISO-8859-1. Almost any sequence of bytes can be interpreted ISO-8859-1, so no error would be detected. But this is merely masking a deeper, more insidious problem: the text would be completely misinterpreted. To illustrate this specious fix, Listing 3 forces an ISO-8859-1 XML declaration.

Listing 3: a variation on write_xml_cdata_log_entry which always puts out an ISO-8859-1 XML declaration

import time
from xml.sax import saxutils

LOG_LEVELS = ['DEBUG', 'WARNING', 'ERROR']

def write_xml_cdata_log_entry(level, msg):
    #Note: in a real application, I would use ISO 8601 for the date
    #asctime used here for simplicity
    now = time.asctime(time.localtime())
    params = {'level': LOG_LEVELS[level], 'date': now, 
              'msg': saxutils.escape(msg)}
    print '<?xml version="1.0" encoding="ISO-8859-1"?>'
    print '<entry level="%(level)s" date="%(date)s"> \
\n%(msg)s\n</entry>' % params
    return  

To understand the nastiness that lurks within this seeming fix, take the case where a user passes in a string with a UTF-8 sequence with a Japanese message, which translates to "Welcome" in English.

$ python -i listing2.py
>>> write_xml_cdata_log_entry(2, "\343\202\210\343\201\206\343\201\223\343\201\235")
<?xml version="1.0" encoding="ISO-8859-1"?>
<entry level="ERROR" date="Tue Oct 22 15:54:36 2002">
よãfl†ãfl“ãfl?
</entry>
>>>  

An XML parser would accept this with no complaint. The problem is that any processing tools looking at this XML would read the individual sequences of the UTF-8 encoding as separate ISO-8859-1 characters. Which means they would see twelve characters, rather than the four which our imaginary Japanese user thought she had specified. Even worse, unless this text is displayed in a system localized for Japanese, it will come out as a mess of accented "a"s and other strange characters, rather than the dignified Japanese welcome intended by the user, illustrated in Figure 1.

Figure 1: A Japanese Welcome

Character encoding issues are a very tricky business, and you should always defer to the tools that your language and operating environment provide for such magic, if for no other reason than to pass the buck when something goes wrong. In Python's case, this means using the Unicode facilities available in Python 1.6 and 2.x (although I still highly recommend Python 2.2 or more recent for XML processing). In fact, I use and strongly encourage the following rule for XML processing in Python: In all public APIs for XML processing, character data should be passed in strictly as Python Unicode objects.

In fact, I encourage that all use of strings in programs that process XML should be in the form of Unicode objects, but following the above rule alone will prevent a lot of problems. Listing 4 updates write_xml_cdata_log_entry to follow this rule.

Listing 4: a variation on write_xml_cdata_log_entry which strictly accepts Python Unicode objects for message text.

import time, types
from xml.sax import saxutils

LOG_LEVELS = ['DEBUG', 'WARNING', 'ERROR']

def write_xml_cdata_log_entry(level, msg):
    if not isinstance(msg, types.UnicodeType):
        raise TypeError("XML character data must be passed in as a unicode object")
    now = time.asctime(time.localtime())
    encoded_msg = saxutils.escape(msg).encode('UTF-8')
    params = {'level': LOG_LEVELS[level], 'date': now, 'msg': encoded_msg}
    print '<entry level="%(level)s" date="%(date)s"> \
\n%(msg)s\n</entry>' % params
    return  

Pay particular attention to the line

encoded_msg = saxutils.escape(msg).encode('UTF-8')

Not only does this line escape characters that are illegal in XML character data, but it also encodes the Unicode object as a UTF-8 byte string. This is needed because most output, including printing to consoles and writing to files on most operating systems, requires conversion to byte streams. This means using an 8-bit encoding for strings that were originally in Unicode (because of my suggested rule). The write_xml_cdata_log_entry function always uses UTF-8 for this output encoding, which means that it doesn't have to put out an XML declaration that specifies an encoding. I should point out that in general it's considered good practice to always use an XML declaration which specifies an encoding, but I wrote the function this way as an illustration.

This version of the write_xml_cdata_log_entry function is safe as far as character encodings are concerned. It doesn't care whether the character data came from an ISO-8859-1 string, a UTF-8 string, or any other form of string, as long as it is passed in as a Unicode object.

$ python -i listing4.py
>>> write_xml_cdata_log_entry(2, "In any triangle, each interior angle < 90\260")
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "listing4.py", line 8, in write_xml_cdata_log_entry
    raise TypeError("XML character data must be passed in as a unicode object")
TypeError: XML character data must be passed in as a unicode object  

This exception is as expected. We passed in a plain byte string rather than a Unicode object and the function is enforcing policy.

>>> write_xml_cdata_log_entry(2, u"In any triangle, each interior angle < 90\u00B0")
<entry level="ERROR" date="Tue Oct 22 17:58:08 2002">
In any triangle, each interior angle &lt; 90°
</entry>  

The log message unicode object includes a character, \u00B0 in the Python notation for explicitly representing a Unicode code point. A code point is a number that uniquely identifies one of the many characters Unicode defines. Here, of course, the code point represents the degree symbol. In this case, it would also be correct to use the regular octal escape character \260, but I recommend using the "\u" form of escape in Python Unicode objects. Be wary of using the position of the character you want in your local encoding as the Unicode code point. For example, on Macs predating OS X, the 176th character is the infinity symbol ("\u221E"/), rather than the degree symbol.

The function outputs the single degree character as a two-byte UTF-8 sequence. Since my console thinks it is displaying ISO-8859-1, the bytes appear to be separate characters, but an XML processor would properly read the sequence as a single character.

>>> #The following two lines are equivalent
>>> msg = unicode("\343\202\210\343\201\206\343\201\223\343\201\235", "UTF-8")
>>> msg = "\343\202\210\343\201\206\343\201\223\343\201\235".decode("UTF-8")
>>> write_xml_cdata_log_entry(2, msg)
<entry level="ERROR" date="Tue Oct 22 18:10:57 2002">
よãfl†ãfl“ãfl?
</entry>  

First, I create a Unicode object from the UTF-8-encoded string, and then pass it to the function, which outputs it as UTF-8. This is no longer a problem because the parser will recognize the encoding as UTF-8, rather than confusing it as ISO-8859-1, as before.

Not Quite There Yet

But this function is still not failsafe. A remaining problem is that XML only allows a limited set of characters to be present in markup. For example, the form feed character is illegal. There is nothing in our function to prevent a user from inserting a form feed character, which would result in malformed XML. There are other subtleties to consider. Users of 4Suite have handy functions that take care of most of the concerns surrounding the output of XML character data. The one of most interest in this discussion is Ft.Xml.Lib.String.TranslateCdata. Listing 5 is a version of write_xml_cdata_log_entry that uses TranslateCdata to render character data as well-formed XML.

Listing 5: a variation on write_xml_cdata_log_entry which uses Ft.Xml.Lib.String.TranslateCdata from 4Suite for safer XML outout.

import time, types
from xml.sax import saxutils
from Ft.Xml.Lib.String import TranslateCdata

LOG_LEVELS = ['DEBUG', 'WARNING', 'ERROR']

def write_xml_cdata_log_entry(level, msg):
    if not isinstance(msg, types.UnicodeType):
        raise TypeError("XML character data must be passed in as a unicode object")
    #Note: in a real application, I would use ISO 8601 for the date
    #asctime used here for simplicity
    now = time.asctime(time.localtime())
    encoded_msg = TranslateCdata(msg)
    params = {'level': LOG_LEVELS[level], 'date': now, 'msg': encoded_msg}
    print '<entry level="%(level)s" date="%(date)s"> \
\n%(msg)s\n</entry>'% params
    return  

The key bit is now encoded_msg = TranslateCdata(msg).

Which uses the 4Suite function. This takes care of the escaping, the character encoding, trapping illegal XML characters, and more. 4Suite also provides functions that prepare character data to be output inside an XML attribute or for HTML output.

But just to put another twist on the matter, even now the 4Suite developers are refining these functions for better design, and the signatures may change in future releases. Since in many cases you have a special task to fulfill, and don't want to bear all the burden of XML correctness, this reinforces the importance of relying on third-party tools.

Conclusion

    

Also in Python and XML

Processing Atom 1.0

Should Python and XML Coexist?

EaseXML: A Python Data-Binding Tool

More Unicode Secrets

Unicode Secrets

So much for the notion that XML output is nothing more than an exercise for the Python print keyword. I haven't even plumbed all the issues involved, and I'll return to further concerns in future articles. The main point I want to get across is that generating XML is not as easy as it would at first seem, and that you should use established tools as much as possible. I have pointed out utility functions in the standard Python library and in 4Suite. Another approach is to create a DOM tree and then serialize it. Just remember to always generate XML with a great deal of care and to test all output thoroughly with reliable XML parsers. The world could certainly do with more good XML citizens.

Thanks to Mike Brown,an expert on the intersection of XML and character set arcana. He reviewed this article for technical correctness and suggested important clarifications.

Python-XML Happenings

Here is a brief on significant new happenings relevant to Python-XML development, including significant software releases. Not much to report this month.

Walter Dörwald announced version 2.0 of XIST, an XML-based extensible HTML generator written in Python. The announcement also led to sime discussion of the use of namespaces in XIST, leading to this clarification.

Henry Thompson appears to have responded to my teasing about the lack of distutils in XSV with a new release.

Resources



1 to 1 of 1
  1. A few notes
    2002-11-14 09:59:54 Uche Ogbuji
1 to 1 of 1