|
Going from XHTML to wiki text is fairly trivial with XSL-T, if you restrict the syntax enough, e.g. if you only look at pages generated by a specific CMS (eg another wiki).
I was recently writing some stuff to pull out 'text-like' chunks from XHTML for a translator to work on, its somewhat relevant. The spans of text with minimal embedded markup were identified by doing a depth-first search in a DOM for nodes that had mixed content (ie they have at least one non-blank text child node).
This gives you a list of child nodes that may look text-like. The list was further narrowed by removing from the start & end nodes that contained no non-blank text nodes at any depth (eg, omits "br" padding)
We processed the omitted nodes to pull out some attributes too (alt, title and value attrs were interesting for translation - obviously the 9 url attributes in html would interest a wiki extractor: action, src, codebase, usemap, cite, href, longdesc, profile and background - nb background isn't in xhtml, its a netscape thing).
Once you're down to this 'minimal' markup it should be even easier to get to a wiki-like representation as you're generally only left with a & span from your xhtml. I wrote this in java, but looking back at where we ended up I'm sure the same algorithm is expressible in xsl (not sure how you'd do the 'narrow' bit though).
|