Sign In/My Account | View Cart  
advertisement

Article:
 Non-Extractive Parsing for XML
Subject: How you deal with encoding
Date: 2004-05-20 02:38:26
From: Jirka Kosek

I see some problems with you proposal. The one which concerns me most is encoding related.


XML document can be represented in many possible encodings (UTF-8, UTF-16, ISO-8859-1, ...). If you map this document directly into memory without some sort of encoding normalization (which is done in the most today's parsers) you will be forced to manually encode/decode all strings which are read/write from/into document.


Do you have any idea how to deal with this problem.


No Previous Message Previous Message   Next Message Next Message


Titles Only Titles Only Newest First
  • How you deal with encoding
    2004-06-22 17:40:18 jimmy_z [Reply]

    Jirka,


    We (Ximpleware) recently released our software (in Java) under GPL. I would like to personally invite you to visit the project web site (http://vtd-xml.sf.net). Also your suggestions and feedback are very welcome.
    Cheers,
    Jimmy

  • How you deal with encoding
    2004-05-20 10:52:57 jimmy_z [Reply]

    Thanks for posting this question.


    One way to deal with character encoding is to build "intelligence" into directly various non-extractive string comparison functions.


    Most people are used to UCS-2 string representation in their code. So a "non-extractive" comparison function needs to compare UTF-8 tokens (or UTF-16) against UCS-2 strings.
    In addition,it may also resolve entity references on the fly during the comparison.


    Same thing applies to text to numerical data conversion as well. An non-extractive version of "parseInt" needs to convert a UTF-8 (or UTF-16) token into an integer without "extracting" it out of the source document.


    Hope I answered your question.


Sponsored By: