Migrating to XForms

November 1, 2006

In 2001, the W3C set out to create an XML standard for implementing user forms in XHTML by publishing the XForms 1.0 Working Draft. The purpose of XForms is to eventually replace existing HTML forms, which are limited in capability and notoriously difficult to develop in. If you are not familiar with XForms, or aren't convinced of their benefits, start off by checking out What are XForms.

In March of this year, the W3C announced the XForms 1.0 Second Edition Recommendation. In July, Mozilla announced Preview Release 0.6 of their XForms extension. It won't be long until browsers begin supporting XForms, and once this happens, they will be the prevalent and preferred method of user data collection on the internet. Until then, it's in our best interest to begin migrating our current XHTML forms to XForms so that we're ready once the new standard is mainstream.

Our goal here is to take an XHTML document containing one or more standard forms, convert the forms into XForms format while preserving all of the information, and generate a new XHTML document as a result. To achieve this, we will be using the PHP parser functions, which have been around since PHP 4 and have been used in many PHP APIs, such as Magpie (an RSS parser) and nuSOAP (a library for web services support).

Figure 1. XForms Parser

Figure 1 is an overview of how the system will work. Essentially, there are three main phases (grey). In Phase 1, we prepare the input file for parsing and split it into several segments. In Phase 2, we actually pass the data through the parser. Note that the only segment of the input file that is actually parsed is the <body> tag (green). Because XForms require elements in both the <head> and <body> HTML, the parser will also append data to the contents of the <head> tag. This appended data is labeled "A" (orange). "B" represents the portion of the input XHTML that closes the <head> tag. Each phase will be explained separately.

Phase 1: Preparing the Input

As is evident in Figure 1, it is crucial that we split the input file into many segments so that we parse only the portion of the XHTML file that we need to, and so that we append the necessary XForms elements to the <head> tag. To accomplish this, we use two PHP functions: stripos() and substr(). The first function tells us the position of a string (needle) inside a larger string (haystack). We will pass the result we get from this function to the second function: substr(). As you might guess, substr() gives us a part (substring) of a larger string -- all we have to tell it is the start position and the substring's desired length.

Now that you understand what we're doing, you're probably wondering why we're doing it. Take a look at the code below, and you should get a clearer idea:

/*A*/

$instr = file_get_contents("inputform.html");

$pos["headstart"] = stripos($instr,"<head>");

$pos["headend"] = stripos($instr,"</head>");

$pos["bodystart"] = stripos($instr,"<body>");

$pos["bodyend"] = stripos($instr,"</body>")+7;



/*B*/

$input["top"] = substr($instr,0,$pos["headstart"]);

$input["head"] = substr($instr,$pos["headstart"],$pos["headend"]-$pos["headstart"]);

$input["middle"] = substr($instr,$pos["headend"],$pos["bodystart"]-$pos["headend"]);

$input["body"] = substr($instr,$pos["bodystart"],$pos["bodyend"]-$pos["bodystart"]);

$input["bottom"] = substr($instr,$pos["bodyend"]);

A: file_get_contents() fetches the contents of the input HTML (inputform.html) and stores it in the variable $instr (line 1). The next four lines call stripos() to get the positions where the <head> tag begins, the <head> tag ends, the <body> tag begins, and the <body> tag ends (respectively). We added "7" to the position of the end of the <body> tag so that the position is that of the first character after the <body> tag. To understand why we've made this exception, let's look at the second part of the code.

B: Here we call substr() and split the input into the five sections outlined in Figure 1. The first parameter passed to substr() is the input string (in this case, $instr), the second is the position of the first character of the substring that will be returned, and the third parameter is the length of the desired substring. We already have the right positions (the simple algebra used to verify this has been omitted), so we simply pass the positions we got in the previous four lines. We added "7" to the last position retrieved (i.e., the closing </body> tag) so that we include this closing tag inside the $input["body"] substring. We do this because this substring will be the one passed to the parser; we include the closing tag so that the substring runs through the parser without throwing an error.

Because the PHP parser is designed primarily for XML input, we will need to make some minor changes to the contents of the <body> tag (stored in $input["body"]). For example, the following three form tags would each throw a PHP parser error:

<input type="text" name="t" disabled />

<input type="checkbox" name="c" value="c1" checked />

<select multiple name="s">

<option value="1">One</option>

</select>

This happens because element attributes without set values are not allowed in XML. Namely: disabled, checked, and multiple. To avoid this, we will "trick" the parser by assigning null values for these element attributes so that the modified HTML look like this:

<input type="text" name="t" disabled="" />

<input type="checkbox" name="c" value="c1" checked="" />

<select multiple="" name="s">

<option value="1">One</option>

</select>

The following code accomplishes this task:

$fixatt = array("multiple","checked","disabled");

foreach ($fixatt as $a)

    $input["body"] = str_replace(" $a "," $a=\"\" ",$input["body"]);

str_replace() is another useful PHP function. It searches for a certain string (first parameter) inside a larger string (third parameter), and replaces it with a replacement string (second parameter). The function returns the new, modified string. Note that if you plan to extend this code to larger HTML files with mixed data, you should use the preg_replace() function instead because str_replace will not be selective enough in some cases. That is, if your HTML body contains any of the words in $fixatt, they will automatically have " ="" " appended to them. You can be more specific with preg_replace() since it uses regular expressions, thus allowing you to limit modifications to only those within <form> tags.

As we have successfully prepared the HTML for parsing, we can move on to the main phase: the parser.

Phase 2: The Parser

Initially, we will construct the parser so that it is able to read XHTML and reconstruct it as output. Thus, the output will be identical to the input. The purpose of this first step is to ensure that the parser is able to preserve the portions of the HTML that are not form elements.

Before we actually go into the parsing logic, we define the initial parser configuration as follows:

/*A*/

define(NSPACES_ON,true);

$f = (NSPACES_ON) ? "f:" : "";



/*B*/

$parser = xml_parser_create();

xml_set_element_handler($parser, "tagOpen", "tagClosed");

xml_set_character_data_handler($parser, "tagContent");

$curtags = array();

A: To allow for greater syntax flexibility, we provide a way to turn namespaces on or off. If you are unfamiliar with XML namespaces, check out XML Namespaces By Example. The current W3C proposal for XHTML requires the namespace references for XForms to be included. However, once XHTML 2.0 becomes a recommendation, they will not be required. Visit the W3C HTML Homepage for more information.

B: Here is where we set up the parser itself. The first line simply creates a parser resource. The second line is critical: it defines the functions that the parser calls when it encounters the start and end of an XHTML element (or tag). The PHP function that is used to accomplish this, xml_set_element_handler(), takes three parameters: the variable representing the parser resource, the name of the function that is called at the start of a tag, and the function called when the tag (XHTML element) is closed. Next, xml_set_character_data_handler() defines the function called when any non-HTML data is encountered by the parser (also known as character data, or CDATA). The parameters are similar: the first is the parser resource, and the second is the function name to call when any CDATA is encountered. The functions tagOpen(), tagClosed(), and tagContent() are known as "handlers," since they are called by an internal system versus by programmer-written code. The internal system in this case is the PHP parser. On the last line, we initialize the $curtags array. This array (implemented as a stack) will be visible to all three handlers so that we always know what tag is being read and what other tags are open. The way $curtag works will be explained in more detail later in this article.

Parser Foundation

As an abstract example, here is some simple XML. Let's assume the parser is running with the settings that we've just defined above: <greeting friendly="true"> Hello World! </greeting>. The parser walks through the above XML character by character. When it reaches the end of line 1, it calls tagOpen(), passing the data inside the <greeting> tag. Once the function executes, it continues to traverse the XML, detecting more XML on line 3. At this point, it calls tagContents() and passes the text inside the <greeting> tag (including the two line breaks). After that function runs, it reads the name of the closing tag and passes it to the tagClosed() function. That's essentially how the PHP parser works.

Now that we've gone through some PHP parser basics, we can start tackling the logic of the parser itself. As mentioned before, this initial version is only meant to pass the input file through the parser and reconstruct a file with identical data as output. We will add the form translation code once we get this first part right. Let's start with the tagOpen() function (the start element handler):

function tagOpen($parser, $name, $attrs) 

{

    /*A*/

    global $outbody, $curtags, $sctag;

    $sctag = true;



    /*B*/

    array_unshift($curtags,$name);



    /*C*/

    switch ($curtags[0]) {

        /*Cases for form tag translation go here*/

        default:

            /*D*/

            $outbody .= "<".$name;

            foreach ($attrs as $k=>$v)

                $outbody .= " $k=\"$v\"";

            $outbody .= ">";

        break;

    }

}

A: First we define all variables that have to be seen by all handlers. $outbody contains the parsed output for the <body> tag, while the purpose of $curtag has been previously mentioned. The Boolean variable $sctag determines whether the current tag is self-closing. For example, <br/>, <hr/>, <img/>, and <input/> are all self-closing tags. This is set to true by default.

B: The function array_unshift(), in conjunction with array_shift(), allows us to implement $curtags as a simple stack. array_unshift() puts $name as an element at the front of the array while shifting all other elements of the array down one position. On the other hand, array_shift() does the opposite: it removes the first element of the array and overrides its position by shifting all other elements in the array up one position. Implementing a stack like this is convenient in PHP, as the top of the stack can be examined (without changes) simply by accessing $curtags[0]. Thus, the first element in the array is the most recently opened XHTML tag, the second element is the open tag that is one level up from the current one, and so on. Also, the size of $curtags tells us our current tag depth.

C: This switch statement determines what to do based on the current tag. As we add the forms translation logic, we will add more cases to the switch statement. For now, we are only concerned with the default case, which should completely preserve the original XHTML syntax.

D: The unchanged XHTML syntax is appended to $outbody here. The foreach loop traverses through the associative array that contains the attribute information, and appends to $outbody as appropriate. For example, the tag <style id="1"> would result in $attrs having an element with a key of "id" and an associated value of "1".

Now we'll examine the tagContents() and tagClosed() functions (the CDATA and end element handlers, respectively):

function tagContent($parser, $data) 

{

    global $outbody, $curtags, $sctag;



    switch ($curtags[0]) {

        /*Cases for form tag translation go here*/

        default:

            /*A*/

            $sctag = false;

            $outbody .= $data;

        break;

    }

}



function tagClosed($parser, $name) 

{

    global $outbody, $curtags, $sctag;



    switch ($name) {

        /*Cases for form tag translation go here*/

        default:

            /*B*/

            if ($sctag) //self-closing tag

                $outbody = substr($outbody,0,-1) . "/>";

            else

                $outbody .= "</$name>";

        break;

    }



    /*C*/

    array_shift($curtags);

}

When comparing these two handlers with the first one we discussed, we see similarities: both begin by exposing the required variables globally (lines 3 and 17), and both contain a switch statement that selects cases based on the current tag name. As with tagOpen(), we will add more cases to these switch statements once we add support for XForms translation.

A: Once we reach this point, we know that we're in a standard XHTML tag that contains non-HTML data. In other words, it is not a self-contained tag. Therefore, we set $sctag to false. Also, we make sure that this non-HTML data is carried through to the output file by appending it to $outbody.

B: If the tag that we're currently parsing turns out to be a self-contained tag, we have to remove the ">" character that was added in tagOpen() and replace it with "/>" (line 24). Otherwise, we close the tag the expected way (line 26).

C: At the end of tagClosed(), we are done with the current tag, so we remove it from the top of the stack using array_shift().

Now that we've set the foundations of our parser, we can start adding in the logic necessary to translate the HTML form elements into XForm elements.

Translating to XForms

From this point on, an understanding of XForms is assumed -- if you are unfamiliar or need brushing up, I recommend "What Are XForms" (mentioned earlier).

Let's look at an input XHTML file containing a simple form:

<html>

 <head>

  <title>sample form</title>

 </head>

 <body>

  <form action="#" method="get" name="s">

   Find

   <input type="text" name="Find" />

   <input type="submit" value="Go" />

  </form>

 </body>

</html>

If we translate this form into the XForms model, it looks like this:

<html xmlns:f='http://www.w3.org/2002/xforms'>

 <head>

  <title>sample form</title>

 <f:model><f:submission action='#' method='get' id='s'/></f:model></head>

 <body>

  <p class='form'>

   Find

   <f:input ref='Find'><f:label>Find</f:label></f:input>

   <f:submit submission='s'><f:label>Go</f:label></f:submit>

  </p>

 </body>

</html>

Now that we have our input and output requirements, we can add the necessary XForms translation logic to our element handlers (the added code is in bold). Let's start with openTag():

function tagOpen($parser, $name, $attrs) 

{

    /*A*/

    global $outbody, $curtags, $sctag;

    global $outhead, $curformid, $f;



    $sctag = true;

    array_unshift($curtags,$name);

    switch ($curtags[0]) {

        case "FORM":

            /*B*/

            if (!isset($attrs["ENCTYPE"]))

            {

                if ($attrs["METHOD"] != "post")

                    $method = $attrs["METHOD"];

            }

            else if ($attrs["ENCTYPE"] == "application/x-www-form-urlencoded")

                $method = "urlencoded-post";

            else if ($attrs["ENCTYPE"] == "multipart/form-data")

                $method = "form-data-post";



            /*C*/

            $curformid = $attrs["NAME"];

            $outhead .= "<$f"."submission action='".$attrs["ACTION"] . 

                "' method='" . $method . 

                "' id='" . $attrs["NAME"] . "'/>";

            $outbody .= "<div class='form'>";

        break;

        case "INPUT":

            /*D*/

            $sctag = false;

            switch ($attrs["TYPE"]) {

                /*Add'l cases for form tag translation go here*/

                case "text":

                    $outbody .= "<$f"."input ref='".$attrs["NAME"] . "'><$f" .

                    "label>".$attrs["NAME"]."</$f"."label>"."</$f"."input>";

                break;

                case "submit":

                    $outbody .= "<$f"."submit submission='$curformid'><$f" .

                    "label>".$attrs["VALUE"]."</$f"."label>"."</$f"."submit>";

                break;

            }

        break;

        default:

            $outbody .= "<".$name;

            foreach ($attrs as $k=>$v)

                $outbody .= " $k=\"$v\"";

            $outbody .= ">";

        break;

    }

}

A: We had to add some more globally scoped variables to support the new logic. $outhead contains all the XForms tags that need to be added to the <head> tag (represented by the orange box labeled "A" in Figure 1). $curformid contains the unique identifier of the current form; although not strictly necessary for this example, it can be useful for scaling the parser to handle multiple forms, and for detecting errors in the HTML when the forms are improperly nested. Lastly, $f either contains "f:" or is an empty string. As discussed previously, this is included so that we can easily turn namespaces on and off without changing more than one part of the code.

B: To determine the submission behavior, HTML forms use two attributes: enctype and method. However, XForms only uses one attribute -- method -- to accomplish this. The appropriate mapping is defined here. Using a series of if/else statements, we can assign the appropriate value to $method. For the sake of simplicity, error handling is omitted; however, it's worth noting that there's an opportunity here to throw an exception if the HTML data is incomplete: e.g., enctype should be set if method="post".

C: Although HTML form elements can have an ID attribute, we have chosen to assign the ID attribute of the created XForm with the HTML form's name attribute (instead of its id attribute). The reason for this is because name is more commonly used as a unique identifier for an HTML form than id. Finally, note that all the data in the <form> tag is stored in the <head> tag of the output XHTML. For the body, we use a <div> element to replace the <form> element as a container for all child tags and the form contents. If, for example, there was style information associated with the <form> tag, we could easily redefine the CSS so that it refers to the new <div> tag instead.

D: This is where we extract all the info from a form's <input> tag. Note that HTML forms have multiple input types, so we need another switch/case control that selects a case on the value of the type attribute. Because our sample form has only two input types, we define only two cases for now.

As you would expect, most of the work is done by tagOpen(). Here is tagClosed() and tagContent(), with the additions in bold:

function tagClosed($parser, $name) 

{

    global $outbody, $curtags, $sctag;

    global $outhead, $curformid, $f;



    switch ($name) {

        /*A*/

        case "INPUT":

            //do nothing

        break;

        case "FORM":

            $curformid = "";

            $outbody .= "</div>";

        break;

        default:

            if ($sctag) //self-closing tag

                $outbody = substr($outbody,0,-1) . "/>";

            else

                $outbody .= "</$name>";

        break;

    }



    array_shift($curtags);

}



function tagContent($parser, $data) 

{

    global $outbody, $curtags, $sctag;

    global $outhead, $curformid, $f;



    switch ($curtags[0]) {

        /*B*/

        default:

            $sctag = false

            $outbody .= $data;

        break;

    }

}

A: We have added the cases for both the <input> and <form> tags. It's important to add it for the <input> tag, even though no code is executed, so that it's not treated as the default case. The reason for this is because we have already added the XForms closing tags in tagOpen() for the HTML input tag, so no further tags need to be added at this point. The logic for handling the closing of the <form> tag is also straightforward -- we just close the <div> tag that was opened when we handled the start of the <form> tag in tagOpen().

B: For the tags we added so far (<form> and <input>), we don't need to add any cases in the tagContent() function. However, we will need to do so when we include support for tags such as <option> (nested in a <select> tag).

You can add further HTML form support using a similar approach -- just add the cases in the switch statements. Note the nested switch statement in the tagOpen() function: this will eventually have the most cases because most form tags are <input> tags, and there will be one case for every possible value of the type attribute. Here is a useful table that you can use as a translation guide. It shows you the XForms element that each HTML form element should be mapped to.

Now that we have some basic functionality, we can move on to Phase 3, which completes the operation of the parser.

Phase 3: Finalizing the Output

As you've hopefully noticed, we haven't actually called the function that runs the parser. This is what we do next:

/*A*/

if (!xml_parse($parser, $input["body"], true))

{

    $error = xml_error_string(xml_get_error_code($parser));

    $line = xml_get_current_line_number($parser);

    die("HTML error: " . $error . " , line " . $line);

}

xml_parser_free($parser);



/*B*/

$outhead = "<$f"."model>".$outhead."</$f"."model>";

$finaloutput = $input["top"].$input["head"].$outhead.$input["middle"].$outbody.$input["bottom"];

if (NSPACES_ON)

    $finaloutput = str_replace("<html","<html xmlns:f='http://www.w3.org/2002/xforms'",$finaloutput);



/*C*/

$outfile = "output.html";

$fh = fopen($outfile, "w");

if (!fwrite($fh, $finaloutput))

    die("Failed to write to file.");

A: xml_parse() is what puts the gears in motion -- we pass the variable representing the parser and the input that we wish to parse as the first two parameters, respectively. The third parameter is set to false if we want to pass the input in smaller chunks (this is done when the input is very large and a lot of processing is required). In our case, we will be parsing the input in one go, so we set the third parameter to true. If xml_parse() returns false, it encounters an error and is unable to finish parsing the input. When this happens, we use the xml_get_error_code() function to find out what happened, and xml_get_current_line_number() to find out where it happened. The final parser-related function, xml_parser_free(), removes the parser resource from memory. This is only done once we're finished with the parser entirely.

B: As previously mentioned, $outhead contains all the XForms elements that need to be added to the <head> tag of the output XHTML. However, before we do this, we encase all of this in a <model> tag to indicate that they are XForms tags. Now, we stick the file segments back together (as shown in Figure 1) and store the end result in $finaloutput. Before storing our result in a file, we add the namespace declaration to the <html> tag, using str_replace(). This function was explained when we used it in Phase 1.

C: Now that we have our translated form, we need to put it into a file. fopen() defines a file handler, which tells PHP that we will be doing something with a file. In this case, we will be writing to it, so we pass a parameter of w (the first parameter is the name of the output file). The function that does the actual file writing is fwrite() -- we pass the file handler we declared earlier, along with the data we wish to write. We produce an error message if the write fails.

At this point, we have consolidated the translated XHTML and written it to an output file. This marks the end of Phase 3, and the completion of the parser.

Scaling the Parser

What has been provided here are the rudimentary building blocks for a complete HTML to XForms translator. As explained earlier, the parser can be easily scaled to handle all possible HTML form elements and translate them into XForms. In addition to the introductory XForms article mentioned earlier, you may find this link useful: XForms for HTML Authors. It explains in detail how to use XForms to provide all features available with HTML forms.

All the files that were discussed (including the main translator) are available below: