Just Use Media Types?
In three of my four Restful Web columns, I've been describing the design of a REST web service for creating and managing web bookmarks. It's now time to get down to some coding. The major part of creating such a service is implementing the method of dispatching: how does an incoming HTTP request get routed to the right piece of code?
As a quick recap, here is the table summarizing the resources in
our bookmark service. You will notice that the second column,
Representations, now lists the media types of the representations we
will accept at each of our resources. For now we'll assume that
application/xbel+xml is a valid media type, even though it is
not, in fact, registered. IANA maintains a list of the registered
media types. If it's not on that list, it's not really a valid
type. If you want to officially register a media type, the IANA has a
web page for doing
so.
For the simple format that we are using as the representation
of [user]/config, we will use the
media type application/xml. See RFC 3023 and
Mark Pilgrim's XML.com article XML on the Web Has Failed to
learn why we don't use text/xml.
|
URI |
Representations |
Description |
|---|---|---|
|
[user]/bookmark/[id]/ |
application/xbel+xml |
A single bookmark for "user" |
|
[user]/bookmarks/ |
application/xbel+xml |
The 20 most recent bookmarks for "user" |
|
[user]/bookmarks/all/ |
application/xbel+xml |
All the bookmarks for "user" |
|
[user]/bookmarks/tags/[tag] |
application/xbel+xml |
The 20 most recent bookmarks for "user" that were filed in the category "tag" |
|
[user]/bookmarks/date/[Y]/[M]/ |
application/xbel+xml |
All the bookmarks for "user" that were created in a certain year [Y] or month [M] |
|
[user]/config/ |
application/xml |
A list of all the "tags" a user has ever used |
The first confusion to get out of the way is MIME versus media. In many discussions of HTTP you will see reference to both MIME types and media types. What's the difference? MIME stands for Multipurpose Internet Mail Extensions, which are extensions to RFC 822 that allow the transporting of something besides plain ASCII text. If you are going to allow other stuff--that is, other media besides plain text--then you will need to know what type of media it is. Thus RFC 2054 gave birth to MIME Media-Types. They have spread beyond mail messages--that is, beyond MIME--and that includes HTTP. The list of types is used by both MIME and HTTP, but that doesn't mean the HTTP entities are valid RFC 2045 entities--in fact, they aren't.
So where does that leave us? MIME Media-Type is rather awkward, so it's often shortened to MIME type or media type. For our purposes here, they are the same thing.
One of the benefits of using HTTP correctly is that we can dispatch on a whole range of things. To make the discussion more concrete, let's look at an example HTTP request:
GET / HTTP/1.1
Host: 127.0.0.1:8080
User-Agent: Mozilla/5.0 (...) Gecko/20050511 Firefox/1.0.4
Accept: text/xml, application/xml, application/xhtml+xml, text/html;q=0.9,
text/plain;q=0.8, image/png,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
There are three items of interest here. First, the HTTP request method
is GET. Second, the URI is carried in two locations. The path and query
parameters are on the first line of the request. The remainder of
the URI, the domain name of the server, is carried in the Host
header. Third, the media type is carried in the Accept header
because this is a GET request. For other POST or
PUT requests, the
Content-Type header in the request carries the media type of the
entity body.
When requests come into our service, we can route them based on the URI, the method, and the media type. We'll return to dispatching on the URI and the HTTP method later. The media type is what we are concentrating on right now. It turns out that dispatching on media types isn't as simple as it sounds. It's not really that complicated--we'll be doing it by the end of this article--but it's not trivial either.
| Method | Header |
|---|---|
|
GET |
Accept |
|
HEAD |
Accept |
|
PUT |
Content-Type |
|
POST |
Content-Type |
|
DELETE |
n/a |
If an entity is involved in the request--that is, a POST or PUT,
then the media type is contained in the Content-Type header. If the
request is a HEAD or GET, then a list of acceptable media types for
the response is given in the Accept header. That's actually not
true, but I'll discuss the falseness of that claim below. First, let's look
at the Content-Type header. Here is the
definition straight from the HTTP specification (RFC 2616):
Content-Type = "Content-Type" ":" media-type
media-type = type "/" subtype *( ";" parameter )
parameter = attribute "=" value
attribute = token
value = token | quoted-string
quoted-string = ( <"> *(qdtext | quoted-pair ) <"> )
qdtext = <any TEXT except <">>
quoted-pair = "\" CHAR
type = token
subtype = token
token = 1*<any CHAR except CTLs or separators>
separators = "(" | ")" | "<" | ">" | "@"
| "," | ";" | ":" | "\" | <">
| "/" | "[" | "]" | "?" | "="
| "{" | "}" | SP | HT
CTL = <any US-ASCII ctl chr (0-31) and DEL (127)>
I've gathered up all the pertinent pieces, but really the thing
we'll be using the most is the definition of media-type. That
definition states that a media type contains a type,
subtype, and parameter, which are separated by "/" and
";" characters, respectively. We can decompose a media-type
into its component parts using Python code like this:
(mime_type, parameter) = media_type.split(";"); (type, subtype) = mime_type.split("/")
I said the Accept header contained a list of all of
the media types that the client was able to,
well, accept. That isn't quite true. Accept is a
little more complicated, allowing the client to list multiple
media ranges. A media range is different from a media type: a
media range can use wildcards (*) for the type and subtype and can
have multiple parameters. One of the parameters that can be used is
q, which is a quality indicator. It has a value, from 0.0
to 1.0, that indicates the client's preference for that media
type. The higher the quality indicator value, the more preferred
the media type is. For example, application/xbel+xml could
match application/xbel+xml, application/*,
or */*.
Microsoft's Internet Explorer browser typically uses the
following Accept header: Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg,
application/x-shockwave-flash, */*, while Mozilla Firefox
typically uses Accept: text/xml, application/xml, application/xhtml+xml,
text/html;q=0.9, text/plain;q=0.8, image/png,*/*;q=0.5.
One thing that makes our lives a little easier is that mime-type, as defined for the
Content-Type header, is also a valid media range for
an Accept header. So we only have to parse strings
defined by mime-type. If we do that well, then we will be able to parse
Accept headers without much additional work.
Our first function is parse_mime_type:
def parse_mime_type(mime_type):
parts = mime_type.split(";")
params = dict([tuple([s.strip() for s in param.split("=")])
for param in parts[1:] ])
(type, subtype) = parts[0].split("/")
return (type.strip(), subtype.strip(), params)
Let's follow the code by watching how a media range would be
dissected. If our media range is application/xhtml+xml;q=0.5,
then
parts = ["application/xhtml+xml", "q=0.5"]
params = {"q": "0.5"}
(type, subtype) = ("application", "xhtml+xml")
and the function returns the tuple ("application",
"xhtml+xml", {"q": "0.5"}).
Now remember that the difference between a MIME type and a
media range is the presence of wildcards and the q parameter.
Our parse_mime_type function doesn't actually care about
wildcards and will happily parse them. All that's left is to ensure
that the q quality parameter is set, using a default value of 1 if
none is given.
def parse_media_range(range):
(type, subtype, params) = parse_mime_type(range)
if not params.has_key('q') or not params['q'] or \
not float(params['q']) or float(params['q']) > 1 \
or float(params['q']) < 0:
params['q'] = '1'
return (type, subtype, params)
So we can parse media ranges, and now we need to compare a target
media type against a list of media ranges. That is, if we know our
application supports image/jpeg, and we get a request that contains
an Accept header--image/gif, image/x-xbitmap,
image/jpeg, image/pjpeg, application/x-shockwave-flash, */*--will the client be able to accept a response with a MIME type
image/jpeg? And what is the quality value associated with that
type?
This is where things get a little tricky. Here are the rules for how to match a media type to a list of media ranges, which are distilled from Section 14.1 of RFC 2616:
application/foo;key=value has a higher precedence than
application/foo, which has a higher precedence than
application/*, which in turn has a high precedence than
*/*.q parameter for that media range
is applied.Once we have this match function working, then matching up the
media types we accept is easy: just pass each one to
the match function; the one that comes out with the
highest q value is the winner and, therefore, the MIME type of the
representation we are going to return. I like to turn these kinds of
comparisons into math problems. (It's the kind of thing I do.) To find
the most specific match, we'll score a media range in the following
way:
Now we just score each media range, and the one with the highest
score is the best match. We return the q parameter of the best
match.
def quality_parsed(mime_type, parsed_ranges):
"""Find the best match for a given mime_type against a list of
media_ranges that have already been parsed by
parse_media_range(). Returns the 'q' quality parameter of the
best match, 0 if no match was found. This function bahaves the
same as quality() except that 'parsed_ranges' must be a list of
parsed media ranges."""
best_fitness = -1; best_match = ""; best_fit_q = 0
(target_type, target_subtype, target_params) = parse_media_range(mime_type)
for (type, subtype, params) in parsed_ranges:
param_matches = sum([1 for (key, value) in \
target_params.iteritems() if key != 'q' and \
params.has_key(key) and value == params[key]])
if (type == target_type or type == '*')
and (subtype == target_subtype or subtype == "*"):
fitness = (type == target_type) and 100 or 0
fitness += (subtype == target_subtype) and 10 or 0
fitness += param_matches
if fitness > best_fitness:
best_fitness = fitness
best_fit_q = params['q']
return float(best_fit_q)
The best_match function , which ties all of this
together, takes the list of MIME types that we support and the value
of the Accept: header and returns the best match.
def best_match(supported, header):
"""Takes a list of supported mime-types and finds the best match
for all the media-ranges listed in header. The value of header
must be a string that conforms to the format of the HTTP Accept:
header. The value of 'supported' is a list of mime-types.
>>> best_match(['application/xbel+xml', 'text/xml'],\
'text/*;q=0.5,*/*; q=0.1')
'text/xml' """
parsed_header = [parse_media_range(r) for r in header.split(",")]
weighted_matches = [(quality_parsed(mime_type, parsed_header), mime_type)
for mime_type in supported]
weighted_matches.sort()
return weighted_matches[-1][0] and weighted_matches[-1][1] or ''
The full Python module, which includes comments and unit tests, is available from bitworking.org.
So now let's loop back to where we started. When we receive an HTTP
request, part of our dispatching is going to depend on the media
type. The header we need to look at depends on the type of the request
or response. Using our newly created module, we can parse both
the Content-Type and Accept headers. In the
next column we'll jump into the meat of dispatching our incoming
requests.
XML.com Copyright © 1998-2006 O'Reilly Media, Inc.