XML.com: XML From the Inside Out

XML.comWebServices.XML.comO'Reilly Networkoreilly.com
  Articles | Weblogs | Newsletter | Safari Bookshelf
advertisement

Article:
 Introducing PyRXP
Subject: PyRXP XML non-conformance
Date: 2004-02-29 07:38:04
From: Uche Ogbuji
Response to: PyRXP unicode conformance

PyRXP do not conform to the XML 1.0 standard, although it claims to. This is inimical to the very idea of standards. Standards were not designed to satisfy your specific requirements on your specific machine. They are designed with tradeoffs for everyone in mind. In XML one of the most important tradeoffs is that Unicode is the fundamental basis for XML, even though clearly processing Unicode is more expensive than processing more limited character sets.


If you need performance greater than what XML can accommodate, you should not be using XML. Plain text parsing options are a couple of orders of magnitude more efficient than PyRXP, so why do you put up with even PyRXP's relative slowness and bloat?


Your sentence talking about "archaeologists" seem to indicate hat you didn't even read the article. I ran into PyRXP's non-conformance while parsing a file that used ellipsis, which is a character, you'd probably have to admit, used by far more people than archaeologists. The specific example in which I selected a character that hapens to be of intrest to characters was when I was actually proving a positive of PyRXPU, the true XML parser in the Python/RXP family, against a specific corner case.


I never indicated that expressing Linear B is likely to be a common need. But are you trying to minimize the need in the real world for expressing Arabic, Chinese, Japanese, Korean and even the many high unicode characters used in European language documents such as smart quotes, em and en dashes and ellipses? PyRXP cannot handle any of these.


If you don't like the specifications that others make, then go invent your own (I myself have done this before and expect to do so again), but don't then try to confuse people about what you've done. There is already standard called XML and PyRXP does not conform to it, so no one should cause confuson by calling PyRXP an XML parser. I've done my bit to reduce the confusion by explaining the facts in detail. What you do with that information is your choice.


--Uche


No Previous Message Previous Message Move up to Parent Message Up Next Message No Next Message


Titles Only Titles Only Newest First
  • PyRXP XML non-conformance
    2004-06-24 15:30:16 PaulMayer [Reply]

    I use PyRXP extensively so my problem set
    does not contain ANY unicode and "conformance"
    hence is a non-issue.Memory footprint and speed in this case DO however matter, as I am dealing with processing roughly 50,000 files with on-disk
    sizes of up 450 MB for the largest file
    containing roughly 900,000 tags.


    • PyRXP XML non-conformance
      2004-06-29 14:47:16 Uche Ogbuji [Reply]

      As I pointed out in the article, if you are not using XML, then you don't need to deal with all the mess of angle brackets in the first place.


      I can easily process 450MB and even 900MB files 10 times faster than PyRXP can by writing usually a dozen lines of Python. When I want to actually use XML (which is defined by the W3C spec whether you like it or not) things become vastly more complex, and thus much slower.


      If all you're using is plain text, then just use plain text. If, however, you're using XML, which *is* Unicode (it is meaningless to say "there is no Unicode in my XML data), then use an XML tool. My main point in this article is that PyRXP is not an XML tool. Full stop.


      This is really not that subtle a point.


      Why peoeple insist on mucking around with pointy brackets and attributes awhen all they need is plain text CSV or INI-like format is beyond me, nor is it a question that interests me.


      --Uche


Sponsored By:


Contact Us | Our Mission | Privacy Policy | Advertise With Us | | Submissions Guidelines
Copyright © 2008 O'Reilly Media, Inc. | (707) 827-7000 / (800) 998-9938