The Impact of Site Finder on Web Services

October 28, 2003

Automated HTTP Tools

Issue: Automated processes using HTTP over TCP port 80 may exhibit problems when encountering the Site Finder page instead of a DNS Name Error response
Response: No reported occurrences
- The site includes a robots.txt file to prevent indexing
- Other types of automated tools are discouraged according to BCP 56

Matt Larson, Review of Technical Issues and VeriSign Response, p25, October 15, 2003

This quotation is from a presentation by VeriSign to ICANN, stating that their recent and temporarily suspended changes to the root DNS servers have had no reported effect on automated HTTP tools; and, further, we shouldn't be automating HTTP access anyway.

Unfortunately, the entire web service protocol stack that IBM, Microsoft, the W3C, Apache, and others have been busy working on for the past few years is effectively "automated processes using HTTP over TCP port 80". Thus the niche processes that are being so glibly inconvenienced by these changes happen to include what many people believe is the future of distributed systems.

This article shows how SOAP-based web service stacks do in fact suffer from VeriSign's changes and discusses what can be done to fix them. The simplest solution is to leave Site Finder turned off. If it comes back, regardless of what changes we make to the SOAP stacks, the process of identifying configuration defects will be made more complex.

Introduction

A few years ago, when we were bringing up an early web service, we got a support call from the customers: our XML-RPC service was "sending back bad XML". Their client stack, written for an appropriately large fee by some consultancy group, was failing with SAX parser errors. Yet everything was working perfectly on our tests, so the fault had to be somewhere on their side. During the debugging session that ensued, we managed to get hold of the XML content that was causing the trouble. It was the HTML 404 page automatically generated by IIS. This lead to a highly memorable conversation.

"We have found the problem: your client program is receiving an IIS error page and failing to parse it."

"I knew it -- there is a problem on your site."

"We aren't running IIS"

You see, we were running a Java Application server fronted by Apache 1.3. The client-side configuration file was wrong and the client system was pointing at some random server, an IIS server sending back its error page. Their client software was handing this page to an XML parser, with predictable consequences.

I learned a lot from that incident. I learned that a client-side XML parser error is often caused by HTML coming down the wire. I learned that home-rolled web service protocol stacks often neglect to test for HTTP error codes. And I learned that the first thing to do with any problem that you don't see yourself is to figure out which URL you are trying to talk to.

This is a question that everyone building a SOAP, XML-RPC, or REST web service should be prepared to ask more often as a result of the new Site Finder service.

Site Finder

On September 15, 2003, VeriSign tweaked the .com and .net DNS registries so that every lookup for an unknown host resolved to a search service web site, Site Finder, rather than return the NXDOMAIN response traditionally associated with DNS lookup failures.

This led many users' web browsers to the service, which VeriSign hoped would lead to the users clicking through on the paid links in the search service, thus bringing revenue to the company. Unfortunately, Site Finder also happens to break many existing programs: all those that assume a missing hostname maps to an immediate error. These programs will get back a hostname, but when they connect for a conversation, they will get back a "connection refused" error, wrapped into the language and toolkit specific exception, fault, or error code the client program expects. All such programs are now going to have to their documentation rewritten so that people know that a connection refused error may mean the hostname is wrong.

An interesting question is what impact will the changes have on web services -- anything using XML over HTTP as the means of coupling computers. One assumption of VeriSign's is mostly valid: such applications do use HTTP, albeit often on a different port. The other assumption -- that whoever is making the request would be grateful to see a search page -- is clearly false.

Theory

Here is what used to happen on a SOAP request to an invalid endpoint hostname, such as http://nosuchhost.com/endpoint:

Caller does DNS lookup.
DNS returns an error.
The protocol stack returns something like java.io.UnknownHostException.
If the application is smart, it maps this to a meaningful error such as that may be an incorrect hostname.
If the application is simple. it shows the framework's error and assumes the end user is smart enough to understand it.
If a person is at the end of the application, they see the error and either fix their endpoint or phone up support.
If it is unattended operation, the machine ought to retry later. Applications aren't meant to cache failed lookups, but Java is naughty: some versions do exactly that unless told not to.
If the host comes back later, all is well. If not, then the application should have a recovery policy.

Now let's look at how things would be expected to change with Site Finder intervening:

Caller does DNS lookup.
DNS returns the IP address of something.
Caller creates a TCP link to a port 80 on that machine, then sends its SOAP request; usually a POST, although SOAP 1.2 adds GET.
The endpoint returns 302, "moved temporarily", redirecting the caller to a URL under http://sitefinder.verisign.com.
If the client handles 302 responses, then it resends the request to Site Finder.
Site Finder returns 200, "OK", and an HTML search page

A SOAP client would normally POST its SOAP request, expecting an XML formatted SOAP response and a 200 code on success, 500 on a fault. Only now it would get a 200 response with text/html content. What is it going to do?

Either it is going to test the MIME type and bail out when that is not XML; or, as in the example cited above, it will hand it off to the XML parser, which will then break as the content is not valid XML. Even if it were valid XHTML, as per the W3C, the parsing would quite probably fail messily when the application tried to make sense of the data.

The result of this is that the VeriSign response does not parse. The client application is going to give some kind of error, perhaps an XML parser error, and that is going to lead to a support call.

The result of the change, therefore, is that if 302 redirects are handled in the web service client, then you are going to get more support calls. What about frameworks that don't? Well, they will report it somehow. Again, it is a more subtle error than Unknown Host, which means support get a call.

Not only is the 302 or search page going to result in meaningless errors, because the responses are only sent after the request is sent, a big request -- such as a POST of binary data or SOAP with Attachments message -- will only fail after the upload. This will waste time and bandwidth. Requests made from a device that pays by the second or by the byte -- such as a cellphone -- will be costing the user even more money than before.

Testing on .NET WSE2.0

What does .NET1.1 with the preview release WSE2.0 do? I chose this stack as it is the latest version of one of the leading SOAP stacks, and I had a client program that I had written with it ready to hand.

As this is the most recent SOAP implementation from Microsoft, one would expect it to have incorporated all the feedback from users of the previous implementations, and handle errors gracefully and in a way that could be well reported. It certainly does this with the classic failure mode but not with the VeriSign introduced errors.

Before:

C:> DotNetClient doc.xml http://nosuchhost.com/axis/endpoint
uploading doc.xml to http://nosuchhost.com/axis/endpoint
Exception:
System.Net.WebException: The underlying connection was closed: 
 The remote name could not be resolved.
   at System.Net.HttpWebRequest.CheckFinalStatus()
   at System.Net.HttpWebRequest.EndGetRequestStream(IAsyncResult asyncResult)
   at System.Net.HttpWebRequest.GetRequestStream()
   at Microsoft.Web.Services.SoapWebRequest.GetRequestStream()
   at System.Web.Services.Protocols.SoapHttpClientProtocol.Invoke(
      String methodName, Object[] parameters)

This is what we expect: an error message that indicates the true, underlying cause of the problem.

After:

With Site Finder running, this stack bails out at the end of the first POST with a MIME type error:

C:> DotNetClient doc.xml http://nosuchhost.com/axis/endpoint
uploading doc.xml to http://nosuchhost.com/axis/endpoint
Exception:
System.InvalidOperationException: Client found response 
content type of 'text/html; charset=iso-8859-1', but expected 'text/xml'.
The request failed with the error message:
--
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HTML><HEAD>
<TITLE>302 Found</TITLE>
</HEAD><BODY>
<H1>Found</H1>
The document has moved 
<a href="http://sitefinder.verisign.com/lpc
    ?url=nosuchhost.comPOST%20/axis/endpoint&amp;host=nosuchhost.com">
    here</A>.<P>
</BODY></HTML>
   at System.Web.Services.Protocols.SoapHttpClientProtocol
         .ReadResponse(SoapClientMessage message, WebResponse response, 
                 Stream responseStream, Boolean asyncCall)
   at System.Web.Services.Protocols.SoapHttpClientProtocol
         .Invoke(String methodName, Object[] parameters)

The stack is trying to parse the body of the 302 response, instead of looking at the response and failing on that error code.

Provided the client application presents all the data in the exception, whoever ends up fielding the support call will be able to diagnose the problem. Assuming, that is, that they know that a redirect to Site Finder appears whenever the client application tried to connect to port 80 on an unknown host. If the exception text was not displayed, only its type (System.InvalidOperationException), then there would be not enough information to diagnose a cause.

Java: Apache Axis

On the Java-side, I am going to look at Apache Axis. The trace here is from the CVS_HEAD version of Axis from September 27 2003.

Before:

A classic DNS failure results in a Java UnknownHostException being thrown and then wrapped in the generic AxisFault Exception, which adds SOAP1.1/1.2 attributes such as actor, node and detail:

AxisFault
 faultCode: {http://schemas.xmlsoap.org/soap/envelope/}Server.userException
 faultSubcode: 
 faultString: java.net.UnknownHostException: nosuchhost.com
 faultActor: 
 faultNode: 
 faultDetail:

After:

With Site Finder operational, the error becomes more complex. The core text of the fault is the response from the server; the fault detail incorporates the text of the response.

     
AxisFault
 faultCode: {http://xml.apache.org/axis/}HTTP
 faultSubcode: 
 faultString: (302)Found
 faultActor: 
 faultNode: 
 faultDetail: 
        {}:return code:  302
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HTML><HEAD>
<TITLE>302 Found</TITLE>
</HEAD><BODY>
<H1>Found</H1>
The document has moved 
<A  HREF="http://sitefinder.verisign.com/lpc?url=nosuchhost.comPOST%20/axis/&amp;
    host=nosuchhost.com">here</A>.<P>
</BODY></HTML> 
(302)Found
at org.apache.axis.transport.http.HTTPSender.readFromSocket(HTTPSender.java:630)
at org.apache.axis.transport.http.HTTPSender.invoke(HTTPSender.java:128)
at org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.java:71)

So it's the same thing: the client stack reads the 302-coded redirect page and bails out. Rather than choke on the text/html content, Axis rejects the response because the code is not supported. Supporting redirects is actually something we have discussed in the past -- I don't think we want to do that any more.

As with .NET WSE.2.0, the result is that a misspelled endpoint could result in different errors than before. Whereas an unknown host message was probably going to spur a couple of the end user's neurons into inferring a cause, a 302 may not. Indeed, I have some suspicions based on a fair few of the postings on the axis-user mailing list that a fair few of the people writing web services do not themselves know what a connection refused error message implies, let alone a 302 response code. If the people writing web services do not understand error codes from layers further down the stack, I do not have high hopes for end users.

Other Implications

Here are some other unexpected consequences of the change that are hard to describe as positive for web services.

Anything retrieving WSDL will also have to deal with the 302 redirect. Again, Axis will probably fail with some moderately uninformative error.
XML processors need to resolve hostnames to import remote DTDs and schemas. Hopefully nobody has been using invalid domains for their URIs.
Unless you properly configure the Java runtime, Java applications, including application servers, cache successful DNS lookups forever. If a hostname resolves to the VeriSign spoof service once, it will resolve there until the application is restarted. Java 1.3 and earlier cached negative responses, in violation of the DNS standards, and were roundly vilified for the practice. The VeriSign change reverts Java to the old behavior in some instances.

What Do the Standards Say?

The most relevant specification here, RFC2616, HTTP/1.1, says of 302 and 307 redirects

If the [302 or 307] status code is received in response to a request other than GET or HEAD, the user agent MUST NOT automatically redirect the request unless it can be confirmed by the user, since this might change the conditions under which the request was issued.

Of course, this document assumes a user behind the User Agent, the latter being a web browser of some sort. Web service clients may or may not have an end user to hand; with many more POSTs, PUTs, and other operations being sent to the destination, a very different interaction model from traditional web browsing. It is unrealistic to ask the user what to do after every redirection response is received.

Alongside the official W3C specifications and submissions, the main body dealing with defining what is a good SOAP-based web service is the WS-I. VeriSign is a member of this organization and has contributed a lot to the security aspects of the SOAP-based web services specification suite.

The WS-I Basic Profile 1.0 says implementations may use 307 as the redirect code, not the 302 code. So the fact that both .NET and Axis ignore the 302 is correct. At the same time, neither respond very well to the message. Axis should have a user comprehensible message; .NET WSE2.0 should look at the response code before complaining that the response was in HTML.

The final reference document is the one that VeriSign refers to in its "you should not be automating HTTP" slide. Best Current Practices #56 is a declaration of what constitutes good and bad behavior for anything attempting to layer itself over HTTP. It states that port 80 should not be used for new protocols, no error codes other than 200 and 500 should be recognized, and that new URI schemas may be appropriate to identify new protocols. These recommendations are certainly valid for protocols such as RMI-over-HTTP and DCOM-over-HTTP, which use HTTP purely as means to break through the firewall with their distributed object protocols. Some people view SOAP as a similar abuse of HTTP, though SOAP 1.2, with its GET support, integrates better with classic HTTP use than ever before.

In the web services I have developed, we have often mixed user visible code with the SOAP services within the same Java app. It is the only way to maintain context within a single Java Web Application, without having to resort to back-end services such as a database or EJB server.

Furthermore, BCP 56 only covers layering of protocols above HTTP, not automated use of those protocols with machine-readable content. The REST paradigm is pure HTTP, making full use of the verb set and (typically) passing XML data in both directions. If some of the state of the remote objects includes HTTP documents, then the integration of REST with the rest of the Web is complete.

Finally, WebDAV is an HTTP extension that is designed to treat a web server a read-write file repository. While it can be used for a pure remote filesystem, its core role is to give HTTP editing tools the ability to upload content to a public site. In this role, it is only useful if that public site contains human readable content, such as HTML pages served up from the default port of the HTTP protocol. BCP56 cannot apply to such as use case.

I must conclude that, while BCP 56 does contain valid recommendations, they do not preclude a HTTP server on port 80 supporting automated clients, be they SOAP, REST, WebDAV or something else. Saying that unusual failure modes introduced by Site Finder are our fault for ignoring BCP 56 is a specious argument.

Framework Changes

What can be done in SOAP, XML-RPC and REST frameworks?

Reject 302 redirects with meaningful errors.
Verify MIME types before parsing the contents, again reporting errors in a way that enables the problem to be resolved.
Recognize a Site Finder redirect and translate that into a no-such-host error.
Hope that everyone patches their DNS servers to ignore the wildcards.
Always include the endpoint URL in any connectivity/parser fault; anything where the destination did not reply within the schema expected.

Hard-coded handling is an ugly hack that is hard to countenance. It is also very brittle. This leaves better reporting of connectivity faults to end users as the most fundamental improvement.

As of October 1, Axis saves the HTTP error code in as one of the elements in the fault details (http://xml.apache.org/axis/ : HttpErrorCode). We also plan to save the target URL and the headers from HTTP responses. This will help Axis users, but does little for anyone coding a client intended to work with any implementation of the JAX-RPC specification.

One other area for improvement in web service stacks is simply to test against HTTP response codes, and make sure the error messages are meaningful. Even the mainstream toolkits, Axis and MS WSE, are clearly weak here; home-rolled implementations are likely to be as bad or even worse.

What Can a Web Service Developer Do?

Provided all your callers are running well-configured applications, this DNS change will not have any visible effect. It only becomes an issue when the caller has an incorrect URL. In that situation, the change may result in misleading error messages "302", "wrong MIME type", instead of the simpler "No such host".

All you can do is document this in both the end user documentation and the support documentation. There are many other error messages related to connectivity, all of which need to be incorporated into a troubleshooting guide. You need such a guide, whether or not the DNS changes stick. The impact of those changes is that the matrix which maps error messages to underlying causes needs to be updated, with some possible extra causes for messages:

Connection refused	The host exists, nothing is listening for connections on that port. Site Finder: the URL is using a port other than 80, and the .com or .net address is invalid
Unknown host	The hostname component of the URL is invalid.
404: Not Found	There is a web server there, but nothing at the exact URL. Proxy servers can also generate 404 pages for unknown hosts.
302: Moved	The content at the end of the URL has moved, and the client application does not follow the links. Site Finder: the .com or .net address is invalid, the port is explicitly -or defaulting to- port 80
Other 3xx response	The content at the end of the URL has moved, and the client application does not follow the links.
Wrong content type/MIME type	The URL may be incorrect, or the server application is not returning XML. Site Finder: a 302 response is being returned as the host is unknown
XML parser error	This can be caused when the content is not XML, but the client application assumes it is. Site Finder: this may be the body of a 302 response due to an unknown host, the client application should check return codes and the Content-Type header
500: Internal Error	SOAP uses this as a cue that a SOAPFault has been returned, but it can also mean 'the server is not working through some internal fault'
Connection Timed out/ NoRouteToHost	The hostname can be resolved, but not reached. Either the host is missing (potentially a transient fault), or network/firewall issues are preventing access. The client may need to be configured for its proxy server.
GUI hangs/ long pauses	Client application may be timing out on lookups/connects

The support line's response to such messages should all be the same:

When a connectivity problem is suspected, get the URL that is at fault; the caller to view it in their web browser and see if you can view it yourself.

This is where you can take advantage of the fact that web service protocols are built on top of, or just are, HTTP, and use the common underlying notion of URLs defining services. Provided those same URLs generate some human-readable content, even if that is an XML message, then the end user and support contact can both bring it up in their web browser. This action is the core technique for diagnosing connectivity problems, primarily because the HTTP infrastructure -- servers, proxies and clients -- is designed to support this diagnosis process.

As a web service provider, you can simplify the process by

Having human-readable content at every URL used in the service. Specifically, you should support GET requests, even if it is only to return a message such as "There is a SOAP endpoint here".
Using URLs that are human readable, short and communicable over the telephone being the ideal.
Having support-accessible logging to provide an escalation path should the problem turn out to be server side.
Always setting the content type to text/xml or a MIME type specific to the XML returned by the service.

Another useful technique is for the service to implement the ping design pattern. The service needs to support a simple ping operation which immediately returns. This operation can be used by clients to probe for the presence of the service, without any side effects or even placing much load on the server. Client applications should initiate communications with a server -- uploads, complex requests, etc -- by pinging it first. This detects failure early on, often at a lower cost.

What Can the Developer of a Web Service Client Application Do?

Developers of web service client applications are on the front line here. Even if they use a WSDL-based code generation process that hides underlying URLs, or discover services using UDDI, Rendezvous, or some other mechanism, their program will still encounter connectivity problems. Networks are fundamentally unreliable; laptops move around and go offline; services get switched off.

They need to handle the connectivity problems and fail in a way that allows the problem to be diagnosed and corrected.

It is good to translate framework errors/exceptions into error messages that are comprehensible by end users. XML parser errors, HTTP error codes, and complaints about MIME types are not suitable for average end users, though the support organization may need these.
The target URL that failed needs to be disclosed to the end user, so that they can test it by hand.
For any error, the response body needs to be preserved for the benefit of support.
The fault diagnosis matrix listed above needs to be adapted to the client and included in the documentation.
If the service implements a ping operation, use it to probe for service existence, preferably in a background thread or asynchronous call, so that the GUI does not block.
Clients need to be tested over slow and unreliable networks. The Axis tcpmon SOAP monitor/HTTP proxy can be used to simulate slow HTTP connections.
Always verify that the MIME type of received content is exactly that documented.
Test the client's handling of HTTP response codes, and of HTML responses when XML is expected.
Java developers should look at "Address Caching" under java.io.InetAddress. Applications need to be configured to only cache DNS lookups, successful and unsuccessful, for a short period of time.

One question is whether or not to follow 302 requests. While this is ordinarily useful, the new DNS behavior means that it could be troublesome. Follow a 302 and you may end up at Site Finder, trying to parse HTML in an XML parser. This may be a good place to insert Site Finder recognition into the application; redirects to that site can be mapped to an unknown host error; all other redirects can be followed.

Conclusions

The changes that VeriSign made to the .com and .net domains will make it harder to diagnose errors in the URLs used by programs to access web services. However, there was always a chance that an incorrect URL would lead to a confusing error message from the underlying protocol stack. Protocol stacks and client-side applications can be written to handle such errors in a way that makes diagnosis easier; doing so has broader benefits than just addressing the recent DNS changes.

These changes have not helped web services, or any other distributed application protocol. If they return, we are going to have to get used to "Connection refused" and HTTP error code 302 responses as cues for nonexistent hosts. This is going to lead to more support calls, and perhaps some coding to translate these cues into end user messages. Needless to say, VeriSign is not offering to pay for these costs incurred by its actions.

On October 3, ICANN got VeriSign to "temporarily suspend" the Site Finder service, under the threat of legal action from a breach of contract. With any luck, it will stay suspended, though as review boards and lawyers get involved, it will be hard to be sure.

The amount of traffic to the Site Finder site has propelled it to a top ten Internet site, so the kickback from funded links in the search terms is potentially huge. The thought of all that found money will deafen VeriSign's ears to complaints from the developer and networking community. Unfortunately that money probably used to go to AOL, MSN, Earthlink, and Microsoft. I do not see these organizations quietly giving up all this money. As well as the legal path, they have some technical options:

Patch their DNS servers to ignore wildcards on the .net and .com domains.
Patch their DNS servers to forward to ISP-specific search engines
Patch their web browsers or proxies to recognize a Site Finder redirect and redirect it to their own search engines.

VeriSign could do nothing about options (1) and (2). Option three is most easily achieved by the web browser vendor, which means Microsoft. MS could patch IE over the Windows Update mechanism and effectively deny VeriSign 90% of their potential audience. This would destroy VeriSign's justification for the changes and encourage it to revert to an RFC-compliant implementation of DNS. More likely, VeriSign would try and change their redirect URLs to get past the patch, leading to an ongoing patch-war between the effective owners of DNS and the effective owners of the web browser. Anyone hard-coding Site Finder workarounds into their own programs would be victims of such a battle.

If there is one saving grace with web services, it is that users have the option of pasting the target URL into a browser to see what is going wrong. To make this simple, developers of web service protocol stacks and applications need to ensure that the target URL is included in all error reports, and that GET queries of all endpoints return some meaningful information.