|
Reading that back its not clear what I meant. We treated XHTML as looking like:
[markup unit...]
[text unit...]
[markup unit...]
[text unit...]
(etc)
Each text unit is a DOM "DocumentFragment" which contains at least one non-blank "Text" node at the top level, and whose leading and trailing nodes are non-blank "Text", or "Element"s that contain a non-blank "Text" node at some depth. Markup units do not contain any non-blank "Text" nodes.
The algorithm in the previous message describes how to get this segmentation, which captures all the text from XHTML with as little markup as possible, and you don't need any special handling for block/inline elements to do this.
|