Menu

Doing HTTP Caching Right: Introducing httplib2

February 1, 2006

Joe Gregorio

You need to understand HTTP caching. No, really, you do. I have mentioned repeatedly that you need to choose your HTTP methods carefully when building a web service, in part because you can get the performance benefits of caching with GET. Well, if you want to get the real advantages of GET then you need to understand caching and how you can use it effectively to improve the performance of your service.

This article will not explain how to set up caching for your particular web server, nor will it cover the different kinds of caches. If you want that kind of information I recommend Mark Nottingham's excellent tutorial on HTTP caching.

Goals

First you need to understand the goals of the HTTP caching model. One objective is to let both the client and server have a say over when to return a cached entry. As you can imagine, allowing both client and server to have input on when a cached entry is to be considered stale is obviously going to introduce some complexity.

The HTTP caching model is based on validators, which are bits of data that a client can use to validate that a cached response is still valid. They are fundamental to the operation of caches since they allow a client or intermediary to query the status of a resource without having to transfer the entire response again: the server returns an entity body only if the validator indicates that the cache has a stale response.

Validators

One of the validators for HTTP is the ETag. An ETag is like a fingerprint for the bytes in the representation; if a single byte changes the ETag also changes.

Using validators requires that you already have done a GET once on a resource. The cache stores the value of the ETag header if present and then uses the value of that header in later requests to that same URI.

For example, if I send a request to example.org and get back this response:

HTTP/1.1 200 OK
Date: Fri, 30 Dec 2005 17:30:56 GMT
Server: Apache
ETag: "11c415a-8206-243aea40"
Accept-Ranges: bytes
Content-Length: 33286
Vary: Accept-Encoding,User-Agent
Cache-Control: max-age=7200
Expires: Fri, 30 Dec 2005 19:30:56 GMT
Content-Type: image/png 

-- binary data --

Then the next time I do a GET I can add the validator in. Note that the value of ETag is placed in the If-None-Match: header.

GET / HTTP/1.1
Host: example.org
If-None-Match: "11c415a-8206-243aea40"

If there was no change in the representation then the server returns a 304 Not Modified.

HTTP/1.1 304 Not Modified 
Date: Fri, 30 Dec 2005 17:32:47 GMT

If there was a change, the new representation is returned with a status code of 200 and a new ETag.

HTTP/1.1 200 OK
Date: Fri, 30 Dec 2005 17:32:47 GMT
Server: Apache
ETag: "0192384-9023-1a929893"
Accept-Ranges: bytes
Content-Length: 33286
Vary: Accept-Encoding,User-Agent
Cache-Control: max-age=7200
Expires: Fri, 30 Dec 2005 19:30:56 GMT
Content-Type: image/png 

-- binary data --

Cache-Control

While validators are used to test if a cached entry is still valid, the Cache-Control: header is used to signal how long a representation can be cached. The most fundamental of all the cache-control directives is max-age. This directive asserts that the cached response can be only max-age seconds old before being considered stale. Note that max-age can appear in both request headers and response headers, which gives both the client and server a chance to assert how old they like their responses cached. If a cached response is fresh then we can return the cached response immediately; if it's stale then we need to validate the cached response before returning it.

Let's take another look at our example response from above. Note that the Cache-Control: header is set and that a max-age of 7200 means that the entry can be cached for up to two hours.

HTTP/1.1 200 OK
Date: Fri, 30 Dec 2005 17:32:47 GMT
Server: Apache
ETag: "0192384-9023-1a929893"
Accept-Ranges: bytes
Content-Length: 33286
Vary: Accept-Encoding,User-Agent
Cache-Control: max-age=7200
Expires: Fri, 30 Dec 2005 19:30:56 GMT
Content-Type: text/xml

There are lots of directives that can be put in the Cache-Control: header, and the Cache-Control: header may appear in both requests and/or responses.

Directives Allowed in a Request

Directive Description
no-cache The cached response must not be used to satisfy this request.
no-store Do not store this response in a cache.
max-age=delta-seconds The client is willing to accept a cached reponse that is delta-seconds old without validating.
max-stale=delta-seconds The client is willing to accept a cached response that is no more than delta-seconds stale.
min-fresh=delta-seconds The client is willing to accept only a cached response that will still be fresh delta-seconds from now.
no-transform The entity body must not be transformed.
only-if-cached Return a response only if there is one in the cache. Do not validate or GET a response if no cache entry exists.

Directives Allowed in a Response

Directive Description
public This can be cached by any cache.
private This can be cached only by a private cache.
no-cache The cached response must not be used on subsequent requests without first validating it.
no-store Do not store this response in a cache.
no-transform The entity body must not be transformed.
must-revalidate If the cached response is stale it must be validated before it is returned in any response. Overrides max-stale.
max-age=delta-seconds The client is willing to accept a cached reponse that is delta-seconds old without validating.
s-maxage=delta-seconds Just like max-age but it applies only to shared caches.
proxy-revalidate Like must-revalidate, but only for proxies.

Let's look at some Cache-Control: header examples.

Cache-Control: private, max-age=3600

If sent by a server, this Cache-Control: header states that the response can only be cached in a private cache for one hour.

Cache-Control: public, must-revalidate, max-age=7200

The included response can be cached by a public cache and can be cached for two hours; after that the cache must revalidate the entry before returning it to a subsequent request.

Cache-Control: must-revalidate, max-age=0

This forces the client to revalidate every request, since a max-age=0 forces the cached entry to be instantly stale. See Mark Nottingham's Leveraging the Web: Caching for a nice example of how this can be applied.

Cache-Control: no-cache

This is pretty close to must-revalidate, max-age=0, except that a client could use a max-stale header on a request and get a stale response. The must-revalidate will override the max-stale property. I told you that giving both client and server some control would make things a bit complicated.

So far all of the Cache-Control: header examples we have looked at are on the response side, but they can also be added on the request too.

Cache-Control: no-cache

This forces an "end-to-end reload," where the client forces the cache to reload its cache from the origin server.

Cache-Control: min-fresh=200

Here the client asserts that it wants a response that will be fresh for at least 200 seconds.

Vary

You may be wondering about situations where a cache might get confused. For example, what if a server does content negotiation, where different representations can be returned from the same URI? For cases like this HTTP supplies the Vary: header. The Vary: header informs the cache of the names of the all headers that might cause a resources representation to change.

For example, if a server did do content negotiation then the Content-Type: header would be different for the different types of responses, depending on the type of content negotiated. In that case the server can add a Vary: accept header, which causes the cache to consider the Accept: header when caching responses from that URI.

Date: Mon, 23 Jan 2006 15:37:34 GMT
Server: Apache
Accept-Ranges: bytes
Vary: Accept-Encoding,User-Agent
Content-Encoding: gzip
Cache-Control: max-age=7200
Expires: Mon, 23 Jan 2006 17:37:34 GMT
Content-Length: 5073
Content-Type: text/html; charset=utf-8

In this example the server is stating that responses can be cached for two hours, but that responses may vary based on the Accept-Encoding and User-Agent headers.

Connection

When a server successfully validates a cached response, using for example the If-None-Match: header, then the server returns a status code of 304 Not Modified. So nothing much happens on a 304 Not Modified response, right? Well, not exactly. In fact, the server can send updated headers for the entity that have to be updated in the cache. The server can also send along a Connection: header that says which headers shouldn't be updated.

Some headers are by default excluded from list of headers to update. These are called hop-by-hop headers and they are: Connection, Keep-Alive, Proxy-Authenticate, Proxy-Authorization, TE, Trailers, Transfer-Encoding, and Upgrade. All other headers are considered end-to-end headers.

HTTP/1.1 304 Not Modified
Content-Length: 647
Server: Apache
Connection: close
Date: Mon, 23 Jan 2006 16:10:52 GMT
Content-Type: text/html; charset=iso-8859-1

...

In the above example Date: is not a hop-by-hop header nor is it listed in the Connection: header, so the cache has to update the value of Date: in the cache.

If Only It Were That Easy

While a little complex, the above is at least conceptually nice. Of course, one of the problems is that we have to be able to work with HTTP 1.0 servers and caches which use a different set of headers, all time-based, to do caching and out of necessity those are brought forward into HTTP 1.1.

The older cache control model from HTTP 1.0 is based solely on time. The Last-Modified cache validator is just that, the last time that the resource was modified. The cache uses the Date:, Expires:, Last-Modified:, and If-Modified-Since: headers to detect changes in a resource.

If you are developing a client you should always use both validators if present; you never know when an HTTP 1.0 cache will pop up between you and a server. HTTP 1.1 was published seven years ago so you'd think that at this late date most things would be updated. This is the protocol equivalent of wearing a belt and suspenders.

Now that you understand caching you may be wondering if the client library in your favorite language even supports caching. I know the answer for Python, and sadly that answer is currently no. It pains me that my favorite language doesn't have one of the best HTTP client implementations around. That needs to change.

Introducing httplib2

Introducing httplib2, a comprehensive Python HTTP client library that supports a local private cache that understands all the caching operations we just talked about. In addition it supports many features left out of other HTTP libraries.

HTTP and HTTPS
HTTPS support is available only if the socket module was compiled with SSL support.
Keep-Alive
Supports HTTP 1.1 Keep-Alive, keeping the socket open and performing multiple requests over the same connection if possible.
Authentication
The following three types of HTTP Authentication are supported. These can be used over both HTTP and HTTPS.
Caching
The module can optionally operate with a private cache that understands the Cache-Control: header and uses both the ETag and Last-Modified cache validators.
All Methods
The module can handle any HTTP request method, not just GET and POST.
Redirects
Automatically follows 3XX redirects on GETs.
Compression
Handles both compress and gzip types of compression.
Lost Update Support
Automatically adds back ETags into PUT requests to resources we have already cached. This implements Section 3.2 of Detecting the Lost Update Problem Using Unreserved Checkout.
Unit Tested
A large and growing set of unit tests.

See the httplib2 project page for more details.

Next Time

Next time I will cover HTTP authentication, redirects, keep-alive, and compression in HTTP and how httplib2 handles them. You might also be wondering how the "big guys" handle caching. That will take a whole other article to cover.