Sign In/My Account | View Cart  
advertisement

Article:
 Introducing PyRXP
Subject: PyRXP unicode conformance
Date: 2004-02-29 00:19:46
From: Arno Paehler

I find the comments on "non-conformance"
does not do justice to PyRXP.
I use PyRXP extensively. My problem set
does not contain ANY unicode and "conformance"
hence is a non-issue.
Memory footprint and speed in this case DO
however matter, as I am dealing with
processing roughly 50,000 files with on-disk
sizes of up 450 MB for the largest file
containing roughly 900,000 tags.
Processing time of said file with an Athlon
XP 1800+ and 1.5 GB memory was about 20 minutes
and I do not dare to imagine what time any
DO(o)Med implementation might take.
Summary: you need conformance because you
are an archaeologist, try something else.
BUT PyRXP is an ooutstanding tools disqualified
and trashed in a very unqualified way.

No Previous Message Previous Message   Next Message Next Message


Titles Only Full Threads Newest First
  • PyRXP XML non-conformance
    2004-02-29 07:38:04 Uche Ogbuji [Reply]

    PyRXP do not conform to the XML 1.0 standard, although it claims to. This is inimical to the very idea of standards. Standards were not designed to satisfy your specific requirements on your specific machine. They are designed with tradeoffs for everyone in mind. In XML one of the most important tradeoffs is that Unicode is the fundamental basis for XML, even though clearly processing Unicode is more expensive than processing more limited character sets.


    If you need performance greater than what XML can accommodate, you should not be using XML. Plain text parsing options are a couple of orders of magnitude more efficient than PyRXP, so why do you put up with even PyRXP's relative slowness and bloat?


    Your sentence talking about "archaeologists" seem to indicate hat you didn't even read the article. I ran into PyRXP's non-conformance while parsing a file that used ellipsis, which is a character, you'd probably have to admit, used by far more people than archaeologists. The specific example in which I selected a character that hapens to be of intrest to characters was when I was actually proving a positive of PyRXPU, the true XML parser in the Python/RXP family, against a specific corner case.


    I never indicated that expressing Linear B is likely to be a common need. But are you trying to minimize the need in the real world for expressing Arabic, Chinese, Japanese, Korean and even the many high unicode characters used in European language documents such as smart quotes, em and en dashes and ellipses? PyRXP cannot handle any of these.


    If you don't like the specifications that others make, then go invent your own (I myself have done this before and expect to do so again), but don't then try to confuse people about what you've done. There is already standard called XML and PyRXP does not conform to it, so no one should cause confuson by calling PyRXP an XML parser. I've done my bit to reduce the confusion by explaining the facts in detail. What you do with that information is your choice.


    --Uche



Sponsored By: