Sign In/My Account | View Cart  
advertisement

Article:
 From Wiki to XML, through SGML
Subject: Please, more XML-Wiki articles
Date: 2004-03-04 02:00:37
From: Anthony Thompson

I think Wiki text is the perfect text for users to use in a home-grown content management system, since it's not too hard for users to learn that blank lines separate paragraphs, asterisks mean *bold*, etc.


Getting Wikitext -> XHTML, and the other way around (XHTML -> Wikitext), however, is the tough part, and something I imagine XML would be perfect for. Does anyone know of any solutions for this, other than having to go through SGML as this article indicates?


Previous Message Previous Message   Next Message Next Message


Titles Only Titles Only Newest First
  • Please, more XML-Wiki articles
    2004-03-08 08:41:06 Brian Ewins [Reply]

    Going from XHTML to wiki text is fairly trivial with XSL-T, if you restrict the syntax enough, e.g. if you only look at pages generated by a specific CMS (eg another wiki).


    I was recently writing some stuff to pull out 'text-like' chunks from XHTML for a translator to work on, its somewhat relevant. The spans of text with minimal embedded markup were identified by doing a depth-first search in a DOM for nodes that had mixed content (ie they have at least one non-blank text child node).


    This gives you a list of child nodes that may look text-like. The list was further narrowed by removing from the start & end nodes that contained no non-blank text nodes at any depth (eg, omits "br" padding)


    We processed the omitted nodes to pull out some attributes too (alt, title and value attrs were interesting for translation - obviously the 9 url attributes in html would interest a wiki extractor: action, src, codebase, usemap, cite, href, longdesc, profile and background - nb background isn't in xhtml, its a netscape thing).


    Once you're down to this 'minimal' markup it should be even easier to get to a wiki-like representation as you're generally only left with a & span from your xhtml. I wrote this in java, but looking back at where we ended up I'm sure the same algorithm is expressible in xsl (not sure how you'd do the 'narrow' bit though).




    • Please, more XML-Wiki articles
      2004-03-08 08:53:27 Brian Ewins [Reply]

      Reading that back its not clear what I meant. We treated XHTML as looking like:


      [markup unit...]
      [text unit...]
      [markup unit...]
      [text unit...]
      (etc)


      Each text unit is a DOM "DocumentFragment" which contains at least one non-blank "Text" node at the top level, and whose leading and trailing nodes are non-blank "Text", or "Element"s that contain a non-blank "Text" node at some depth. Markup units do not contain any non-blank "Text" nodes.


      The algorithm in the previous message describes how to get this segmentation, which captures all the text from XHTML with as little markup as possible, and you don't need any special handling for block/inline elements to do this.


Sponsored By: