Introducing PyXML
by Uche Ogbuji
|
Pages: 1, 2
from xml.dom.ext import StripXml, Print
#Strip the white space nodes in place and return the same node
StripXml(doc_node)
#Print the node as serialized XML to stdout
Print(doc_node)
#Write the node as serialized XML to a file
f = open("tmp.xml", "w")
Print(doc_node, stream=f)
f.close()
Listing 4 illustrates creating a tiny XML document from scratch using standard DOM API, and printing the serialized result.
Listing 4: Creating a 4DOM document bit by bit and writing the result
from xml.dom import implementation
from xml.dom import EMPTY_NAMESPACE, XML_NAMESPACE
from xml.dom.ext import Print
#Create a document type node using the doctype name "message"
#A blank system ID and blank public ID (i.e. no DTD information)
doctype = implementation.createDocumentType("message", None, None)
#Create a document node, which also creates a document element node
#For the element, use a blank namespace URI and local name "message"
doc = implementation.createDocument(EMPTY_NAMESPACE, "message", doctype)
#Get the document element
msg_elem = doc.documentElement
#Create an xml:lang attribute on the new element
msg_elem.setAttributeNS(XML_NAMESPACE, "xml:lang", "en")
#Create a text node with some data in it
new_text = doc.createTextNode("You need Python")
#Add the new text node to the document element
msg_elem.appendChild(new_text)
#Print out the result
Print(doc)
Notice the use of EMPTY_NAMESPACE or None for blank URIs and ID fields. This is a Python-XML convention. Rather than using the empty string, which is, after all, a valid URI reference, None is used (EMPTY_NAMESPACE is merely an alias for None). This has the added value of being a bit more readable. If you run this, you'll see that no declaration is given for the XML namespace. This is because of the special status of that namespace. If you created elements or attributes in other namespaces, those namespace declarations would be rendered in the output.
4DOM is a useful tool for learning about DOM in general, but it's not a practical DOM for most uses. The problem is that it is very slow and memory intensive. A lot of effort goes into handling DOM arcana; in particular, the event system is a significant drag on performance, though it has some very interesting uses. I am one of the original authors of 4DOM, and yet I rarely use it any more. There are other, much faster DOM implementations which I'll be covering in future articles. Luckily, the lessons you learn experimenting with 4DOM are mostly applicable to other Python DOMs.
Canonicalization
Two XML files can be different and yet have the same effective
content to an XML processor. This is because XML defines that some
differences are insignificant, such as the order of attributes, the
choice of single or double quotes, and some differences in the use of
character entities. Canonicalization,
popularly abbreviated as c14n, is a mechanism for serializing XML which
normalizes all these differences. As such, c14n can be used to make XML
files easier to compare. One important application of this is security.
You may want to use technology to ensure that an XML file has not been
tampered with, but you might want this check to ignore insignificant
lexical differences. The xml.dom.ext.c14n module provides
canonicalization support for Python. It provides a
Canonicalize function which takes a DOM node (most
implementations should work), and writes out canonical XML.
XPath
PyXML includes XPath and XSLT implementations written completely in Python, but as of this writing, the XSLT module is still being integrated into PyXML and does not really work. If you need XSLT processing, you will want to get Python-libxslt, 4Suite, Pyana, or one of the other packages I mentioned in the last installment. The XPath support, however, does work and operates on DOM nodes. In fact, it can be a very handy shortcut from using DOM methods for navigating. The following snippet uses an XPath expression to retrieve a list of all elements in a DOM document which are in the XSLT namespace. If you want to run it, first of all prepend code to define "doc" as a DOM document node object.
from xml.ns import XSLT
#The XSLT namespace http://www.w3.org/1999/XSL/Transform
NS = XSLT.BASE
from xml.xpath import Context, Evaluate
#Create an XPath context with the given DOM node
#With no other nodes in the context list
#(list size 1, current position 1)
#And the given prefix/namespace mapping
con = Context.Context(doc, 1, 1, processorNss={"xsl": NS})
#Evaluate the XPath expression and return the resulting
#Python list of nodes
result = Evaluate("//xsl:*", context=con)
XPath expressions can also return other data types: strings, numbers,
and boolean. The first two are returned as regular Python strings and
floats. Booleans are returned as instances of a special class,
xml.utils.boolean. This will probably change in Python
2.3, which has a built-in boolean type.
If you are making extensive use of XPath, you might still want to consider the implementation in 4Suite, which is a faster version of the one in PyXML, with more bug fixes as well.
Et cetera
I mentioned some of the other modules in PyXML in the last
installment, including the xmlproc validating parser, in module
xml.parsers.xmlproc, Quick Parsing for XML, in module
xml.utils.qp_xml and the WDDX library, in module
xml.marshal.wddx. Marshalling is the process of
representing arbitrary Python objects as XML, and there is also a
general marshaling facility in the xml.marshal.generic
module. It has a similar interface to the standard pickle module.
Listing 6 marshals a small Python dictionary to XML.
from xml.marshal.generic import Marshaller
marshal = Marshaller()
obj = {1: [2, 3], 'a': 'b'}
#Dump to a string
xml_form = marshal.dumps(obj)
The created XML looks as follows (line wrapping and indentation added for clarity):
<?xml version="1.0"?>
<marshal>
<dictionary id="i2">
<int>1</int>
<list id="i3"><int>2</int><int>3</int></list>
<string>a</string>
<string>b</string>
</dictionary>
</marshal>
Also in Python and XML | |
Should Python and XML Coexist? | |
Wrap up
You can find more code examples in the demos directory of the PyXML package. The place to discuss PyXML or to report problems is the Python XML SIG mailing list.
In the next article, I'll offer a tour of 4Suite, which provides enhanced versions of some of PyXML's facilities, as well as some other unique features.