|
|
 |
Article:
 |
 |
Introducing PyRXP
|
| Subject: |
PyRXP unicode conformance |
| Date: |
2004-02-29 00:19:46 |
| From: |
Arno Paehler |
|
|
|
I find the comments on "non-conformance"
does not do justice to PyRXP.
I use PyRXP extensively. My problem set
does not contain ANY unicode and "conformance"
hence is a non-issue.
Memory footprint and speed in this case DO
however matter, as I am dealing with
processing roughly 50,000 files with on-disk
sizes of up 450 MB for the largest file
containing roughly 900,000 tags.
Processing time of said file with an Athlon
XP 1800+ and 1.5 GB memory was about 20 minutes
and I do not dare to imagine what time any
DO(o)Med implementation might take.
Summary: you need conformance because you
are an archaeologist, try something else.
BUT PyRXP is an ooutstanding tools disqualified
and trashed in a very unqualified way.
|
- PyRXP XML non-conformance
2004-02-29 07:38:04 Uche Ogbuji
[Reply]
PyRXP do not conform to the XML 1.0 standard, although it claims to. This is inimical to the very idea of standards. Standards were not designed to satisfy your specific requirements on your specific machine. They are designed with tradeoffs for everyone in mind. In XML one of the most important tradeoffs is that Unicode is the fundamental basis for XML, even though clearly processing Unicode is more expensive than processing more limited character sets.
If you need performance greater than what XML can accommodate, you should not be using XML. Plain text parsing options are a couple of orders of magnitude more efficient than PyRXP, so why do you put up with even PyRXP's relative slowness and bloat?
Your sentence talking about "archaeologists" seem to indicate hat you didn't even read the article. I ran into PyRXP's non-conformance while parsing a file that used ellipsis, which is a character, you'd probably have to admit, used by far more people than archaeologists. The specific example in which I selected a character that hapens to be of intrest to characters was when I was actually proving a positive of PyRXPU, the true XML parser in the Python/RXP family, against a specific corner case.
I never indicated that expressing Linear B is likely to be a common need. But are you trying to minimize the need in the real world for expressing Arabic, Chinese, Japanese, Korean and even the many high unicode characters used in European language documents such as smart quotes, em and en dashes and ellipses? PyRXP cannot handle any of these.
If you don't like the specifications that others make, then go invent your own (I myself have done this before and expect to do so again), but don't then try to confuse people about what you've done. There is already standard called XML and PyRXP does not conform to it, so no one should cause confuson by calling PyRXP an XML parser. I've done my bit to reduce the confusion by explaining the facts in detail. What you do with that information is your choice.
--Uche
- PyRXP XML non-conformance
2004-06-24 15:30:16 PaulMayer
[Reply]
I use PyRXP extensively so my problem set
does not contain ANY unicode and "conformance"
hence is a non-issue.Memory footprint and speed in this case DO however matter, as I am dealing with processing roughly 50,000 files with on-disk
sizes of up 450 MB for the largest file
containing roughly 900,000 tags.
- PyRXP XML non-conformance
2004-06-29 14:47:16 Uche Ogbuji
[Reply]
As I pointed out in the article, if you are not using XML, then you don't need to deal with all the mess of angle brackets in the first place.
I can easily process 450MB and even 900MB files 10 times faster than PyRXP can by writing usually a dozen lines of Python. When I want to actually use XML (which is defined by the W3C spec whether you like it or not) things become vastly more complex, and thus much slower.
If all you're using is plain text, then just use plain text. If, however, you're using XML, which *is* Unicode (it is meaningless to say "there is no Unicode in my XML data), then use an XML tool. My main point in this article is that PyRXP is not an XML tool. Full stop.
This is really not that subtle a point.
Why peoeple insist on mucking around with pointy brackets and attributes awhen all they need is plain text CSV or INI-like format is beyond me, nor is it a question that interests me.
--Uche
|
 |
Sponsored By:
|