httplib2: HTTP Persistence and Authentication

March 29, 2006

Last time we covered HTTP caching and how it can improve the performance of your web service. This time we'll cover some other aspects of HTTP that, if fully utilized, can also speed up your web service.

Persistent Connections

Persistent connections are critical to performance. In early versions of HTTP, connections from the client to the server were built up and torn down for every request. That's a lot of overhead on the client, on the server, and on any intermediaries. The persistent connection approach, that is, keeping the same socket connection open for multiple requests, is the default behavior in HTTP 1.1.

Now if all HTTP 1.1 connections are considered persistent, then there must be a mechanism to signal that the connection is to be closed, right? That is handled by the Connection: header.

  Connection: close

The header signals that the connection is to be closed after the current request-response is finished. Note that either the client or the server can send such a header.

If you allow persistent connections, then the next obvious optimization is pipelining: stuffing a bunch of requests down a socket without waiting for the response from the first request to be returned before sending subsequent requests. Now this only works for certain types of requests; at a minimum, those requests have to be idempotent. Now aren't you glad you made all your GETs idempotent when you designed your RESTful web service?

Compression

So now we're saving time and bandwidth by using caching to avoid retrieving content if it hasn't changed, and using persistent connections to avoid the overhead of tearing down and rebuilding sockets. If you have an entity to transfer, then you can still speed things up by transferring fewer bytes over the wire--that is, by using compression.

Though RFC 2616 specifies three types of compression, the values are actually tracked in an IANA registry and could in theory be supplemented by others. But it's been nine years since HTTP 1.1 was released and it hasn't been added to yet. Even at that, with three types of compression specified, only two, gzip and compress, are regularly seen in the wild.

The way compression normally works is that the client announces the types of compression it can handle by using the Accept-Encoding: request header:

  Accept-Encoding: gzip;q=1.0, identity; q=0.5, *;q=0

Those are weighted parameters with resolution rules similar to the mime-types in Accept: headers. I covered parsing and interpreting those in Just Use Media Types?, which you should read if you missed it the first time.

If the server supports any of the listed compression types, it can compress the response entity and announce how a response was compressed so that the client can decompress it correctly. That information is carried by the Content-Encoding: header.

  Content-Encoding: gzip

In the process of implementing httplib2 I also discovered some rough spots in HTTP implementations.

Authentication

In the past, people have asked me how to protect their web services and I've told them to just use HTTP authentication, by which I meant either Basic or Digest as defined in RFC 2617.

For most authentication requirements, using Basic alone isn't really an option since it transmits your name and password unencrypted. Yes, it encodes them as base64, but that's not encryption.

The other option is Digest, which tries to protect your password by not transferring it directly, but uses challenges and hashes to let the client prove to the server that it knows a shared secret.

Here's the "executive summary" of HTTP Digest authentication:

The server rejects an unauthenticated request with a challenge. That challenge contains a nonce, a random string generated by the server.
The client responds with the same request again, but this time with a WWW-Authenticate: header that contains a hash of the supplied nonce, the username, the password, the request URI, and the HTTP method.

The problem with Digest is that it suffers from too many options, which are implemented non-uniformly, and not always correctly. For example, there is an option to include the entity body in the calculation of the hash, called auth-int. There are also two different kinds of hashing, MD5 and MD5-sess. The server can return a fresh challenge nonce with every response, or the client can include a monotonically increasing nonce-count value with each request. The server also has the option of returning a digest of its own, which is a way the server can prove to the client that it also knows the shared secret.

With all those options it doesn't seem suprising that there are interop problems. For example, Apache 2.0 does not do auth-int in Digest. While Python's urllib2 claims to do MD5-sess, Apache does not implement it correctly. In addition, looking at the code of Python's urllib2, it appears to support the SHA hash in addition to the standard MD5 hash. The only problem is that there's no mention of SHA as an option in RFC 2617. And, of course, no mention of Digest is complete without mentioning Internet Explorer, which doesn't calculate the digest correctly for URIs that have query parameters.

Now in case it seems like we're trapped in a twisted Monty Python sketch, there are some bright spots: on Apache 2.0.51 or later you can get IE and Digest to work by using this directive:

BrowserMatch "MSIE" AuthDigestEnableQueryStringHack=On

OK, you know you're in trouble when a directive called AuthDigestEnableQueryStringHack is the bright spot.

Oh yeah, one last twist in implementing both Basic and Digest is that you should keep track of the URIs that you have authenticated because if you attempt to access a URI "below" an authenticated URI, then you can send authentication on the first request and not wait for a challenge. By "below," I mean based on the URI path. Also, be prepared because the authentication at a lower level in path depth may require a different set of credentials or use a different authentication scheme.

If you move outside of RFC 2617 you could use WSSE, but it isn't really specified for plain HTTP; it doesn't work in any known browsers, it was originally designed for WS-Security and unofficially ported to work in HTTP headers and not in a SOAP envelope; the definitive reference is an XML.com article, and while XML.com is an august publication, it isn't the IETF or W3C.

Now you might think I could use TLS (HTTPS), which is what lots of web apps and services use in conjunction with HTTP Basic. But you should realize that I, like many other people, use a shared hosting account; even if I wanted to shell out the money to buy a certificate, I wouldn't be able to set up TLS for my site, as certificates are tied to a specific IP address and not a domain name. This is really too bad since client-side support for TLS (HTTPS) seems pretty good.

The bad news is that current state of security with HTTP is bad. The best interoperable solution is Basic over HTTPS. The good news is that everyone agrees the situation stinks and there are multiple efforts afoot to fix the problem. Just be warned that security is not a one-size-fits-all game and that the result of all this heat and smoke may be several new authentication schemes, each targeted at a different user community.

For further reading you may want to check out this W3C note from 1999 (!), User Agent Authentication Forms. In addition the WHATWG's Web Applications 1.0 specification lists as a requirement "Better defined user authentication state handling. (Being able to 'log out' of sites reliably, for instance, or being able to integrate the HTTP authentication model into the Web page.)"

Redirects

As I implemented 3xx redirects I came across a couple things that were new to me, some of which could provide performance boosts. Now, in general, the 3xx series of HTTP status codes are either for redirecting the client to a new location or for indicating that more work needs to be done by the client.

One of the things I learned is that 300, 301, 302, and 307 are all cacheable in some circumstances, either by default or in the presence of cache control headers. That means that if your client implements caching, it may avoid one or more round trips if it is able to cache those 3xx responses.

httplib2

At the end of my last article I introduced httplib2, a Python client library that implemented all the caching covered in that article. So for those of you keeping track at home, httplib2 also handles many of the things here, such as HTTPS, Keep-Alive, Basic, Digest, WSSE, and both gzip and compress forms of compression. That's enough of libraries and specs for now; next article, we'll get back to writing code and putting all this infrastructure to work.