|
Great article. This is a rarely explored area of XML technology.
You will also need to consider the restrictions on allowable characters in element and attribute names. No spaces, funky characters, etc. Dashes and dots are allowed, slashes aren't, etc.
Check the XML 1.0 spec for details.
There was once an HTML to SAX parser that used heuristics to deal with wild HTML - it was called something like 'HEX'.
|