Migrating to XForms
by Paul Sobocinski
|
Pages: 1, 2, 3
Phase 2: The Parser
Initially, we will construct the parser so that it is able to read XHTML and reconstruct it as output. Thus, the output will be identical to the input. The purpose of this first step is to ensure that the parser is able to preserve the portions of the HTML that are not form elements.
Before we actually go into the parsing logic, we define the initial parser configuration as follows:
/*A*/
define(NSPACES_ON,true);
$f = (NSPACES_ON) ? "f:" : "";
/*B*/
$parser = xml_parser_create();
xml_set_element_handler($parser, "tagOpen", "tagClosed");
xml_set_character_data_handler($parser, "tagContent");
$curtags = array();
A: To allow for greater syntax flexibility, we provide a way to turn namespaces on or off. If you are unfamiliar with XML namespaces, check out XML Namespaces By Example. The current W3C proposal for XHTML requires the namespace references for XForms to be included. However, once XHTML 2.0 becomes a recommendation, they will not be required. Visit the W3C HTML Homepage for more information.
B: Here is where we set up the parser itself. The first line simply creates a parser resource. The second line is critical: it defines the functions that the parser calls when it encounters the start and end of an XHTML element (or tag). The PHP function that is used to accomplish this, xml_set_element_handler(), takes three parameters: the variable representing the parser resource, the name of the function that is called at the start of a tag, and the function called when the tag (XHTML element) is closed. Next, xml_set_character_data_handler() defines the function called when any non-HTML data is encountered by the parser (also known as character data, or CDATA). The parameters are similar: the first is the parser resource, and the second is the function name to call when any CDATA is encountered. The functions tagOpen(), tagClosed(), and tagContent() are known as "handlers," since they are called by an internal system versus by programmer-written code. The internal system in this case is the PHP parser. On the last line, we initialize the $curtags array. This array (implemented as a stack) will be visible to all three handlers so that we always know what tag is being read and what other tags are open. The way $curtag works will be explained in more detail later in this article.
Parser Foundation
As an abstract example, here is some simple XML. Let's assume the parser is running with the settings that we've just defined above: <greeting friendly="true"> Hello World! </greeting>. The parser walks through the above XML character by character. When it reaches the end of line 1, it calls tagOpen(), passing the data inside the <greeting> tag. Once the function executes, it continues to traverse the XML, detecting more XML on line 3. At this point, it calls tagContents() and passes the text inside the <greeting> tag (including the two line breaks). After that function runs, it reads the name of the closing tag and passes it to the tagClosed() function. That's essentially how the PHP parser works.
Now that we've gone through some PHP parser basics, we can start tackling the logic of the parser itself. As mentioned before, this initial version is only meant to pass the input file through the parser and reconstruct a file with identical data as output. We will add the form translation code once we get this first part right. Let's start with the tagOpen() function (the start element handler):
function tagOpen($parser, $name, $attrs)
{
/*A*/
global $outbody, $curtags, $sctag;
$sctag = true;
/*B*/
array_unshift($curtags,$name);
/*C*/
switch ($curtags[0]) {
/*Cases for form tag translation go here*/
default:
/*D*/
$outbody .= "<".$name;
foreach ($attrs as $k=>$v)
$outbody .= " $k=\"$v\"";
$outbody .= ">";
break;
}
}
A: First we define all variables that have to be seen by all handlers. $outbody contains the parsed output for the <body> tag, while the purpose of $curtag has been previously mentioned. The Boolean variable $sctag determines whether the current tag is self-closing. For example, <br/>, <hr/>, <img/>, and <input/> are all self-closing tags. This is set to true by default.
B: The function array_unshift(), in conjunction with array_shift(), allows us to implement $curtags as a simple stack. array_unshift() puts $name as an element at the front of the array while shifting all other elements of the array down one position. On the other hand, array_shift() does the opposite: it removes the first element of the array and overrides its position by shifting all other elements in the array up one position. Implementing a stack like this is convenient in PHP, as the top of the stack can be examined (without changes) simply by accessing $curtags[0]. Thus, the first element in the array is the most recently opened XHTML tag, the second element is the open tag that is one level up from the current one, and so on. Also, the size of $curtags tells us our current tag depth.
C: This switch statement determines what to do based on the current tag. As we add the forms translation logic, we will add more cases to the switch statement. For now, we are only concerned with the default case, which should completely preserve the original XHTML syntax.
D: The unchanged XHTML syntax is appended to $outbody here. The foreach loop traverses through the associative array that contains the attribute information, and appends to $outbody as appropriate. For example, the tag <style id="1"> would result in $attrs having an element with a key of "id" and an associated value of "1".
Now we'll examine the tagContents() and tagClosed() functions (the CDATA and end element handlers, respectively):
function tagContent($parser, $data)
{
global $outbody, $curtags, $sctag;
switch ($curtags[0]) {
/*Cases for form tag translation go here*/
default:
/*A*/
$sctag = false;
$outbody .= $data;
break;
}
}
function tagClosed($parser, $name)
{
global $outbody, $curtags, $sctag;
switch ($name) {
/*Cases for form tag translation go here*/
default:
/*B*/
if ($sctag) //self-closing tag
$outbody = substr($outbody,0,-1) . "/>";
else
$outbody .= "</$name>";
break;
}
/*C*/
array_shift($curtags);
}
When comparing these two handlers with the first one we discussed, we see similarities: both begin by exposing the required variables globally (lines 3 and 17), and both contain a switch statement that selects cases based on the current tag name. As with tagOpen(), we will add more cases to these switch statements once we add support for XForms translation.
A: Once we reach this point, we know that we're in a standard XHTML tag that contains non-HTML data. In other words, it is not a self-contained tag. Therefore, we set $sctag to false. Also, we make sure that this non-HTML data is carried through to the output file by appending it to $outbody.
B: If the tag that we're currently parsing turns out to be a self-contained tag, we have to remove the ">" character that was added in tagOpen() and replace it with "/>" (line 24). Otherwise, we close the tag the expected way (line 26).
C: At the end of tagClosed(), we are done with the current tag, so we remove it from the top of the stack using array_shift().
Now that we've set the foundations of our parser, we can start adding in the logic necessary to translate the HTML form elements into XForm elements.
Translating to XForms
From this point on, an understanding of XForms is assumed -- if you are unfamiliar or need brushing up, I recommend "What Are XForms" (mentioned earlier).
Let's look at an input XHTML file containing a simple form:
<html>
<head>
<title>sample form</title>
</head>
<body>
<form action="#" method="get" name="s">
Find
<input type="text" name="Find" />
<input type="submit" value="Go" />
</form>
</body>
</html>
If we translate this form into the XForms model, it looks like this:
<html xmlns:f='http://www.w3.org/2002/xforms'>
<head>
<title>sample form</title>
<f:model><f:submission action='#' method='get' id='s'/></f:model></head>
<body>
<p class='form'>
Find
<f:input ref='Find'><f:label>Find</f:label></f:input>
<f:submit submission='s'><f:label>Go</f:label></f:submit>
</p>
</body>
</html>
Now that we have our input and output requirements, we can add the necessary XForms translation logic to our element handlers (the added code is in bold). Let's start with openTag():
function tagOpen($parser, $name, $attrs)
{
/*A*/
global $outbody, $curtags, $sctag;
global $outhead, $curformid, $f;
$sctag = true;
array_unshift($curtags,$name);
switch ($curtags[0]) {
case "FORM":
/*B*/
if (!isset($attrs["ENCTYPE"]))
{
if ($attrs["METHOD"] != "post")
$method = $attrs["METHOD"];
}
else if ($attrs["ENCTYPE"] == "application/x-www-form-urlencoded")
$method = "urlencoded-post";
else if ($attrs["ENCTYPE"] == "multipart/form-data")
$method = "form-data-post";
/*C*/
$curformid = $attrs["NAME"];
$outhead .= "<$f"."submission action='".$attrs["ACTION"] .
"' method='" . $method .
"' id='" . $attrs["NAME"] . "'/>";
$outbody .= "<div class='form'>";
break;
case "INPUT":
/*D*/
$sctag = false;
switch ($attrs["TYPE"]) {
/*Add'l cases for form tag translation go here*/
case "text":
$outbody .= "<$f"."input ref='".$attrs["NAME"] . "'><$f" .
"label>".$attrs["NAME"]."</$f"."label>"."</$f"."input>";
break;
case "submit":
$outbody .= "<$f"."submit submission='$curformid'><$f" .
"label>".$attrs["VALUE"]."</$f"."label>"."</$f"."submit>";
break;
}
break;
default:
$outbody .= "<".$name;
foreach ($attrs as $k=>$v)
$outbody .= " $k=\"$v\"";
$outbody .= ">";
break;
}
}
A: We had to add some more globally scoped variables to support the new logic. $outhead contains all the XForms tags that need to be added to the <head> tag (represented by the orange box labeled "A" in Figure 1). $curformid contains the unique identifier of the current form; although not strictly necessary for this example, it can be useful for scaling the parser to handle multiple forms, and for detecting errors in the HTML when the forms are improperly nested. Lastly, $f either contains "f:" or is an empty string. As discussed previously, this is included so that we can easily turn namespaces on and off without changing more than one part of the code.
B: To determine the submission behavior, HTML forms use two attributes: enctype and method. However, XForms only uses one attribute -- method -- to accomplish this. The appropriate mapping is defined here. Using a series of if/else statements, we can assign the appropriate value to $method. For the sake of simplicity, error handling is omitted; however, it's worth noting that there's an opportunity here to throw an exception if the HTML data is incomplete: e.g., enctype should be set if method="post".
C: Although HTML form elements can have an ID attribute, we have chosen to assign the ID attribute of the created XForm with the HTML form's name attribute (instead of its id attribute). The reason for this is because name is more commonly used as a unique identifier for an HTML form than id. Finally, note that all the data in the <form> tag is stored in the <head> tag of the output XHTML. For the body, we use a <div> element to replace the <form> element as a container for all child tags and the form contents. If, for example, there was style information associated with the <form> tag, we could easily redefine the CSS so that it refers to the new <div> tag instead.
D: This is where we extract all the info from a form's <input> tag. Note that HTML forms have multiple input types, so we need another switch/case control that selects a case on the value of the type attribute. Because our sample form has only two input types, we define only two cases for now.
As you would expect, most of the work is done by tagOpen(). Here is tagClosed() and tagContent(), with the additions in bold:
function tagClosed($parser, $name)
{
global $outbody, $curtags, $sctag;
global $outhead, $curformid, $f;
switch ($name) {
/*A*/
case "INPUT":
//do nothing
break;
case "FORM":
$curformid = "";
$outbody .= "</div>";
break;
default:
if ($sctag) //self-closing tag
$outbody = substr($outbody,0,-1) . "/>";
else
$outbody .= "</$name>";
break;
}
array_shift($curtags);
}
function tagContent($parser, $data)
{
global $outbody, $curtags, $sctag;
global $outhead, $curformid, $f;
switch ($curtags[0]) {
/*B*/
default:
$sctag = false
$outbody .= $data;
break;
}
}
A: We have added the cases for both the <input> and <form> tags. It's important to add it for the <input> tag, even though no code is executed, so that it's not treated as the default case. The reason for this is because we have already added the XForms closing tags in tagOpen() for the HTML input tag, so no further tags need to be added at this point. The logic for handling the closing of the <form> tag is also straightforward -- we just close the <div> tag that was opened when we handled the start of the <form> tag in tagOpen().
B: For the tags we added so far (<form> and <input>), we don't need to add any cases in the tagContent() function. However, we will need to do so when we include support for tags such as <option> (nested in a <select> tag).
You can add further HTML form support using a similar approach -- just add the cases in the switch statements. Note the nested switch statement in the tagOpen() function: this will eventually have the most cases because most form tags are <input> tags, and there will be one case for every possible value of the type attribute. Here is a useful table that you can use as a translation guide. It shows you the XForms element that each HTML form element should be mapped to.
Now that we have some basic functionality, we can move on to Phase 3, which completes the operation of the parser.