XML vs. HTTPissues affecting safe transport of XML over HTTP
by Mike Brown <mike@skew.org> |
Table of Contents
- XML vs HTTP
- Conclusion and the real solution
- Troubleshooting
- Revision History
Introduction
XML documents are intrinsically no more or less subject to corruption than HTML documents, so one would think that HTTP, intended for the safe transport of HTML documents, would be sufficient for sending XML documents. In practice, however, some widely deployed HTTP applications are either too limited in how they handle HTML and XML, or they make assumptions that are not supported by and/or are in conflict with the relevant specifications.
Software developers often use these applications in an attempt to avoid reinventing certain wheels. They tend to make the mistake of passing XML documents through web browsers or putting them into HTTP messages that are then interpreted on the receiving end by applications that unsafely treat the XML as HTML form data.
Some of the consequences of this kind of mishandling are discussed herein. A conclusion follows, offering one of probably many possible solutions for safely transporting XML documents via HTTP requests. The intended audience for this document includes software developers and information architects who are designing and implementing general purpose HTTP-based XML transmission systems in their products -- i.e., systems in which the documents are to be considered arbitrary binary data, not purely ASCII text, during the transmission.
Discussion
The first part of the discussion explains some XML document concepts that are relevant to HTTP: characters, markup, bit sequences and entities. The second part of the discussion explains a little bit about HTTP and goes on to explain how common ways of processing HTTP messages can result in the misintrepretation or corruption of XML documents.
Overview of an XML document
Four levels of abstraction
A document is an abstract collection of information. The information carried in a document is used by an application.
In an SGML, HTML, or XML document, the information takes the form of a logical hierarchy: a tree of abstract constructs called 'elements', 'attributes', 'character data', 'comments', and others. This hierarchy is the most abstract form of the document's data.
The hierarchy is encoded as text, a contiguous sequence of abstract characters (letters, numbers, punctuation, etc.). This character sequence is usually finite. The abstract structures are mapped to alternating sequences of characters called markup and character data. This text must follow certain rules in order to be considered XML. This text is another abstract representation of the document's data.
In order to represent a document in binary information architectures, the document's text is mapped to a contiguous sequence of bits. The mapping of a repertoire of characters to bit sequences is a character encoding scheme, often informally called a 'character set'.
Programmers should note that sometimes an intermediate encoding form is used, where abstract characters are first mapped to numeric sequences, rather than directly to specific bit patterns, allowing characters to be manipulated as numbers and avoiding CPU-dependent bit ordering issues.
In summary, XML documents, and to varying degrees SGML and HTML documents, can be thought of as having 4 main layers, in order of increasing abstraction:
- a contiguous bit sequence (binary data) that represents…
- a contiguous character sequence (textual data) that is divided into…
- segments of markup and character data, that together represent…
- a hierarchy of logical constructs: elements, attributes, character data, etc.
Programmers may have to work with a document at any of these levels of abstraction. For example, the DOM API provides access to a document's logical constructs, while an HTTP transmission must work with a document as binary data.
However, the abstractions in an SGML, HTML or XML document are somewhat more complex, because of the concept of entities.
The notion of entities
An XML document is divided into one or more units called entities. These divisions can be expressed at any of the levels of abstraction. A document is 'physically' comprised of at least one entity.
In XML, each entity is classified as 'parsed' or 'unparsed'. A parsed entity is text that can be (not necessarily has been) parsed as XML. That is, a parsed entity is a binary representation of character sequences that follow the rules of XML character data and markup. In contrast, an unparsed entity is a blob of non-XML data, rarely used in XML.
Each parsed entity, if it exists in binary form, can use a different character encoding scheme. The encoding scheme may, but does not have to be, declared within the entity itself. There are rules for determining what the actual encoding scheme might be. It is considered an error if the actual encoding is different from what is deduced. It is also an error for the actual encoding to differ from the declared encoding. Therefore it is best to ensure that the entity does contain an accurate encoding declaration.
There are well-formedness rules that dictate where the boundaries of parsed entities can be. Parsed entities must correspond to complete sequences of characters. The characters must form either arbitrary segments of text, or units of markup-and-text corresponding to complete logical structures.
Entities can exist separate from each other ('external' to each other; like files on a disk), or they can exist within ('internal' to) another entity. Internal entities are always parsed entities, and they are always defined in a markup portion of the containing entity; there is no such thing in XML as an internal unparsed entity.
Entities offer various advantages. Different parts of a document can live in different locations. Markup within one parsed entity can refer to the contents of another, implying that the reference itself is equivalent to the referenced entity's 'replacement text'. There can be multiple references to a single entity, allowing for efficient reuse of oft-repeated sections of the document.
In an XML document, one of the entities will be the principal one that defines the document: the 'document entity'. The role of being the document entity is imposed upon an entity at the time it is parsed/processed; there is no way to know which entity is the document entity based on its contents alone, although it is possible to syntactically eliminate certain entities as candidates for being document entities.
Hereafter, the word 'XML entity', meaning a parsed XML entity, will be used instead of 'XML document', to clarify that an entire document would not necessarily be the subject of an HTTP transmission.
Summary of important issues
- An XML document may exist as (and reference) multiple entities. When parsing the document, are all the necessary entities accessible?
- Is the presence/absence of an encoding declaration in a parsed entity being accounted for? The declaration is irrelevant at the more abstract levels, but at the bit-sequence level, it is highly relevant.
- Is the parsed entity already well formed, with special characters in character data escaped with the appropriate markup?
HTTP issues
Client request, server response
Most HTTP transactions involve a client opening a connection to the server and sending a request message, and a server issuing a response message, usually closing the connection. Note that 'request' doesn't necessarily imply the retrieval of a resource.
The request message always has the following info:
- Request method (type of request): GET or POST being the most common
- Relative URI of resource requested
- HTTP version
- Virtual host name
A GET request message typically looks like the following, and is ASCII encoded, with CR+LF newlines:
GET /path/to/some/document.xml HTTP/1.1
Host: www.myhost.foo
The semantics of GET are that the server is being asked to deliver some representation of the identified resource.
A POST request is nearly the same, but after the headers is an entity (a blob of binary data) in the message body. The data might be encoded text, but it could be any kind of file. Additional headers describe the data's media type, its encoding (if it is of a text media type), and the length of the data, in bytes:
POST /path/to/posted-data-processor HTTP/1.1
Host: www.myhost.foo
Content-Type: application/x-www-form-urlencoded
Content-Length: 22
greeting=hello%20world
The semantics of POST are that the server is being asked to "apply" the entity body to the identified resource.
There are other types of HTTP requests that are not widely implemented but are potentially very useful, most notably PUT, which is like POST but carries the semantics that the server is being asked to simply store the entity and label it with the given URI as an identifier.
After processing a request, the server delivers a response, consisting of a 3-digit response code and some text indicating the code's meaning, followed by a message with headers and body. The response message headers indicate the media type and length of the body, along with other information about the server and the requested resource (this example is simplified a bit):
200 OK
Date: Sun, 09 Feb 2003 08:08:53 GMT
Server: Acme HTTP Server 0.01 alpha
Content-Length: 70
Content-Type: text/xml
Last-Modified: Thu, 08 Nov 2001 22:12:35 GMT
<?xml version="1.0" encoding="utf-8"?><greeting>hello world</greeting>
So, a client can make a request for an XML document without actually sending XML in its request, and a server can send an XML document in the body of the response, and in fact this is usually what happens when an XML document is requested from a server. There's nothing particularly unsafe about this, although one might argue that the Content-Type is usually not as accurate or specific as it should be, due to common server misconfiguration, which at this point is entrenched to support backward compatibility with clients.
See RFC 3023 for information about the correct media types for XML documents. See RFC 2616 for the full explanation of how HTTP works.
The real problems arise when the request itself (GET or POST or any other) needs to contain XML.
The best request method, if XML is in the request, is POST
The client can send to the server other metadata via name-value pairs in the HTTP headers, but arbitrary data, such as an XML document, must be supplied in one of two places:
- Following the headers, in the request message body, or
- Embedded in the request URI, URL-encoded
If the request method is GET, the implication is that the server has a resource (usually an HTML document) that the client wants to retrieve, and large amounts of arbitrary data do not need to be sent to the server in order to specify what resource is desired. Therefore, the GET method, although it doesn't explicitly disallow this, is not usually used to send arbitrary data via the message body. On this basis alone, using POST is favored over GET, for transmitting XML entities via HTTP.
There are other reasons why GET is unfavorable. It is unsafe to use request URIs that are longer than 1024 characters. And as explained below, URL-encoding can also be an unreliable operation to perform on XML entities.
POST is favorable because it is specifically for delivering arbitrary data as a 'subordinate' of the request URI. As mentioned above, POST is typically used to deliver a document to an application that can process it and generate a response for the server to deliver back to the client, but it can also be used for other purposes, such as annotation of existing documents.
Since POST is a good method for transmitting documents, the document must be sent not in the request URI, but in the body of the request message. The body follows a blank line after the headers, just like in an email message. The headers must contain a Content-Length header indicating the number of octets (8-bit bytes) contained in the body. A Content-Type header is required. The charset parameter must match the actual encoding of the document.
An example of an ideal, but impractical, HTTP message for XML entity transmission follows.
POST /MessageReceiver.jsp HTTP/1.0
Host: www.SomeHost.net
Content-Type: application/xml; charset=iso-8859-1
Content-Length: 68
<?xml version="1.0" encoding="iso-8859-1"?>
<doc>hello world</doc>
...But all HTTP implementations do not handle POST data well
On the server, MessageReceiver.jsp must be prepared to accept the complete message body as the XML entity. Java Servlet request objects have a getInputStream() method that allows access to the raw message body, but this method is relatively unknown and underutilized. And other server-side interfaces such as Perl's CGI module do not offer such an option at all.
It is more common to find interfaces that translate the HTTP message headers into environment variables, for example, and that interpret POST message bodies as HTML form data (name-value pairs called parameters by the implementations), even if the Content-Type indicates the data is not an HTML form data set.
Note that HTTP messages can have 'parameters' that supplement certain header values. These are not the same as the parameters included in an HTML form data set. It is confusing because interfaces to HTTP messages tend to use the word 'parameters' to refer to HTML form data values only.
The consequence of misinterpreting the message body as an HTML form data set is that, typically, an XML entity will be interpreted as being a form data parameter with name <?xml version and the value being the rest of the document following the first '=' character. There is also a good chance that the XML entity will be interpreted as being ASCII or ISO-8859-1 encoded, possibly resulting in corruption of the data, especially if the XML entity uses a UTF-16 encoding.
Therefore, if an XML entity is sent in an HTTP message body and will be processed with legacy applications, then it is necessary to properly prepare the document as the value of a parameter in an HTML form data set. Unfortunately, this can be quite problematic, mainly due to character encoding and media type issues.
A risky solution: Embed XML in HTML form data
The Content-Type header of the HTTP message indicates the media type. If the media type is one that can have a single character encoding scheme, then the header may also contain a charset parameter.
An HTML form data set is has one of two media types:
- application/x-www-form-urlencoded
- multipart/form-data
Neither media type can be associated with a charset parameter in the HTTP Content-Type header.
Using application/x-www-form-urlencoded is bad
The application/x-www-form-urlencoded media type is not suitable for general purpose XML entity transmission.
When using this media type, the form data set, by definition, is a sequence of ASCII characters that must be encoded as ASCII bytes in the HTTP request. The form data set must be URL-encoded:
- Parameters are separated by '&' characters
- Parameter names must be ASCII characters
- Parameter values must be ASCII characters
- Parameter names are separated from values by '=' characters
- Various other characters that have reserved purposes or that cannot exist in a URI, as long as they are in the ASCII range, are translated to '%xx' escape sequences, where xx is the hex value of the character's ASCII code.
It should be noted that URL-encoding has been defined, in some contexts, to allow non-ASCII characters to be converted to multiple '%xx' sequences using UTF-8 encoding, like '%C2%A0' for the non-breaking space character. Additionally, web browser vendors have kludged similar ways of sending non-ASCII characters, but using the encoding of the HTML document containing the form, or the interpreted encoding (users are often allowed to force the document to be decoded with a different scheme than the actual encoding scheme). These approaches are not standardized, but are widespread and show no signs of abating in the near future, at least not for HTML form data or HTTP URLs. See http://skew.org/xml/misc/URI-i18n/ where this issue is discussed further.
The following conditions must be met in order for the transmission of an XML entity as application/x-www-form-urlencoded form data to be successful:
- Non-ASCII characters appearing in the entity must be replaced with numeric character references, where character references are allowed.
- The entity must not use ASCII characters in any markup sections where character references are not allowed -- for example, element names, attribute names, and processing instruction targets must be entirely ASCII characters. XML does not offer an option for transcoding these to ASCII.
- The entity's encoding declaration must be encoding="us-ascii" or a suitable superset of ASCII, like encoding="utf-8". UTF-8 is recommended to ensure any XML parser will be able to handle it.
These conditions are too restrictive for a general purpose XML transmission system, but may be acceptable for a system where the XML entities can always be trusted to meet these requirements.
An example of a safe XML entity transmission:
POST /MessageReceiver.jsp HTTP/1.0
Host: www.SomeHost.net
Content-Type: application/x-www-form-urlencoded
Content-Length: 105
XML=%3C?xml%20version%3D%221.0%22%20encoding%3D%22utf-8%22?%3E%0A%3Cdoc%3Ehello%20world%3E%3C/doc%3E%0A
While this approach allows the XML entity to be transmitted safely to a receiver that expects to treat a POSTed message body as form data, the process of transcoding the entity's non-ASCII characters and modifying its encoding declaration could be a significant burden on the process that prepares the entity for transmission. It is definitely not something a web browser will do on its own, at least.
Using multipart/form-data is better, but still bad
The other option, when sending an XML entity via HTTP to a receiver that expects POST messages to contain form data, is to use the multipart/form-data media type. This mechanism is much better, but is still not ideal for XML entities.
With this media type, the MIME multipart format, originally developed for attaching files to email messages, is used in the message body.
MIME allows one or more entities of arbitrary data of any media type to be encapsulated in a series of ASCII-delimited sections called body parts. The multipart/form-data media type is typically used for file uploads from web forms.
The good news is that with this media type, an XML entity can be in any character encoding; it does not need to be dumbed-down to ASCII or modified in any way. Also, the media type for the entity can be declared independently of the media type of the HTTP message body.
The bad news is twofold.
One problem is that web browsers, for 'compatibility reasons', do not reliably include media type or charset information with each body part. Without this information, the receiver must trust that the XML entity's self-declared encoding is correct. This may not be a problem, but the receiver must be trusted to not munge the data by misrepresenting it as ASCII when passing it to an application (e.g. exposing it as parameter data). The MIME specification even says that in the absence of a Content-Type header for the body part, Content-Type: text/plain; charset=us-ascii is assumed.
The other problem is that multipart/form-data HTTP requests are not even handled at all by some current HTTP implementations. For example, as of 2001, the stock Apache Tomcat and BEA WebLogic HTTP servlet classes only provided access to form data that is sent application/x-www-form-urlencoded.
Nevertheless, an example of an ideal HTTP message with this media type follows. If the receiver can be trusted to handle this type of message properly, then it is safe to use in a general-purpose XML system.
POST /MessageReceiver.jsp HTTP/1.0
Host: www.SomeHost.net
Content-Type: multipart/form-data; boundary=---------------------------7d0355331b90386
Content-Length: 385
-----------------------------7d0355331b90386
Content-Type: application/xml; charset=utf-8
Content-Disposition: form-data; name="XML"
<?xml version="1.0">
<myDoc xml:lang="fr">C'est dans cette belle maison, affectée au commandement de la place, que Napoléon a passé ses derniers jours de liberté en France en juillet 1815.</myDoc>
-----------------------------7d0355331b90386--
Note that in the example above, UTF-8 bytes are shown (in this article) as their iso-8859-1 interpretations, just to emphasize that there are 2 bytes per character, for certain characters.
Conclusion and the real solution
To implement a general-purpose XML entity transmission system with HTTP, one should, as explained above, use a POST transaction, with the entity in the HTTP message body. The first example was actually ideal:
POST /MessageReceiver.jsp HTTP/1.0
Host: www.SomeHost.net
Content-Type: application/xml; charset=iso-8859-1
Content-Length: 68
<?xml version="1.0" encoding="iso-8859-1"?>
<doc>hello world</doc>
The only way to reliably construct such a message is to have complete control over the sending end. This means not using HTML forms at all.
It also means having complete control over the receiving end. Since common HTTP server and servlet implementations may make unreliable assumptions about the nature of the POST data, the application receiving data through the server's APIs must be careful to work around such assumptions. Things to watch out for include assumptions that the POST data is URL-encoded HTML form data, or text in a platform-default encoding.
If this level of control over sender and receiver is guaranteed, then there is the added benefit of being able to transmit multiple entities in a single HTTP message, using the multipart/mixed media type, which works just like multipart/form-data but without the Content-Disposition: form-data; name="foo" headers on each body part. But if this control cannot be guaranteed, the best one can do is to use multipart/form-data and trust that the receiver can handle it properly.
If the goal is to not have a general-purpose XML transmission system, then the best option is to configure the sender to never attempt to send XML consisting of anything other than pure ASCII. Then, any of the methods described above will work.
This conclusion might raise the question "Doesn't having complete control over the sender and receiver defeat the main purpose of using XML?" The answer is no, because the receiver is not necessarily the application that will be using the data; it might just be a way-station on the XML entity's journey to an application that will intelligently process the file. The point of having control over both ends of the HTTP transmission is just to ensure that the entity is not corrupted before it gets to the application.
Troubleshooting
A common problem: non-ASCII characters become "garbage"
Here is a very typical situation that demonstrates some of the issues explained above:
1. The client requests page via HTTP.
2. The server sends an HTML form, wherein the Unicode chars of the document have been serialized in the HTTP response in a particular encoding. The response may or may not indicate to the client what the encoding is, via the charset parameter in the Content-Type header. The client may or may not use the indicated encoding to know how to decode the document and present the form (the user can usually override the decoding on their end, because there is a long history of Japanese and Chinese multibyte character sets being misrepresented as iso-8859-1).
3. Due to convention, not formal standard, the client will try to use the same encoding when it submits the form data, no matter how it is sent (GET, POST, x-www-form-urlencoded or multipart/form-data ... doesn't matter). Unencodable characters in the form data might be first translated to numeric character references... again, there is no standard, so browser behavior varies. The browser most likely will *not* indicate what encoding was used in the form data submission, "for backward compatibility".
If the text "© 2002 Acmée Inc." was entered into a form field named "msg" on a UTF-8 encoded HTML page...
...then the form data in the HTTP request may very likely look like this:
msg=%C2%A9%202002%20Acm%C3%A9e%20Inc.
Note that the copyright symbol character in utf-8 is byte pair C2 A9, and is thus %C2%A9 in the URL-encoded data, while in iso-8859-1 it would have been just byte A9, %A9 in the URL-encoded data. The small e with acute accent is byte pair C3 A9 (%C3%A9 in the URL-encoded data), while in iso-8859-1 it would have been just E9 (%E9).
It is often helpful to monitor the HTTP traffic so that the raw request can be observed, before making any assumptions. Use a proxy server with extended logging options, or perhaps a packet sniffer on one of the network endpoints or anywhere in between; e.g., on a BSD box, root can use "tcpdump -s 0 -w - port 80 | hexdump -C"
4. The server or servlet/JSP engine makes an assumption about what encoding was used in the form data. Most likely, it chooses iso-8859-1 or whatever the platform default encoding is. It will provide access to the data through what it calls a "parameter". (One should note that URIs and MIME headers have "parameters" as well, but none of them are exactly the same thing.) The parameter named "msg" will contain the Unicode string obtained by decoding the URL-encoded bytes with the assumption that bytes over 7F are iso-8859-1 encoded text. The parameter value is thus exposed to the server side of the application as a string like this:
© 2002 Acmée Inc.
Recommendations:
1. Always know the encoding of the HTML form that was sent to the browser. For maximum predictability and Unicode support, try to use UTF-8. Ensure that the HTML declares itself as utf-8 in a meta tag and/or in the HTTP response headers.
2. Make it a requirement for using your application that the browser be set to auto-detect encoding, not override it, so it can be assumed that the form data will come back using the same encoding as the form. If this requirement cannot be made, then one can attempt to look at the Accept-Charset and Accept-Language headers in the HTTP requests to make an intelligent guess as to what encoding the browser is likely to be using. This would be just a guess, though, and it would not difficult to know when to choose UTF-8.
3. If the form was sent in utf-8, the response is probably coming back utf-8. If the server seems to be decoding it as iso-8859-1, then just re-encode the paramter value as iso-8859-1, then decode those bytes back as if they were utf-8. In Java, it's as simple as this, plus the appropriate try-catch for the possible UnsupportedEncodingException:
String badString = request.getParameter("foo");
byte[] bytes = badString.getBytes("ISO-8859-1");
String goodString = new String(bytes, "UTF-8");
When diagnosing the problem, try to intercept the XML at every step of the way and see what it looks like. There are often many points where it is passed around without regard for encoding issues. It is not helpful to observe its state at the endpoints of a long processing chain.
Revision History
- 28 Dec 2000 - First draft posted here as MS Word HTML. Announced on xml-dev.
- 02 Jan 2001 - Reformatted as XHTML + CSS. Added introduction.
- 02 Jan 2001 - Clarified that HTTP requests not containing XML or other binary data are safe.
- 10 Jan 2001 - Revised paragraph on implementations not handling POST bodies to include mention of getInputStream(). Will research further.
- 15 Jan 2001 - Document stylesheet is now external.
- 20 Mar 2001 - Removed extraneous sentence fragment in "risky solution" section.
- 09 Feb 2003 - Added general HTTP examples & explanations. Added troubleshooting section. Rephrased several paragraphs.
- 11 Jun 2003 - Typo fix in one of the examples.
- 28 Jun 2008 - Added disclaimer, removed To Do, and reformatted email address.
- 08 Mar 2012 - Fixed minor XHTML errors in document source, and replaced "..." with "…".
Thanks to Rich Dobbs @ Sagent Technology for pointing out getInputStream().
Thanks to Andrew Layman @ Microsoft and Michael Smith @ xml-doc.org for content suggestions.
Thanks to Manos Bastis @ profile.gr for initial XHTML version.
Please address comments about this document to the author.
This document is part of the skew.org XML & XSLT resources.