EXSLT URI function proposals

Updated [an error occurred while processing the directive]
Contributors: Craig Stewart, Mike Brown

str:encode-uri()

The proposal is to change the existing str:encode-uri function to the following, which is based on xf:escape-uri() in [XQFuncs]. Differences between that Working Draft and the proposed EXSLT function are highlighted.

Function Syntax

string **str:encode-uri(string, boolean, string?)**

This function applies the URI escaping rules defined in section 2 of [RFC 2396], as amended by [RFC 2732], to the string supplied as $uri-part the first argument, which typically represents all or part of a URI, URI reference or IRI. The effect of the function is to replace any special character in the string by an escape sequence of the form %xx%yy..., where xxyy... is the hexadecimal representation of the octets used to represent the character in UTF-8 US-ASCII for characters in the ASCII repertoire, and a different character encoding for non-ASCII characters.

The set of characters that are escaped depends on the setting of the boolean argument $escape-reserved second argument.

If $escape-reserved the second argument is true, all characters are escaped other than lower case letters a-z, upper case letters A-Z, digits 0-9, and the characters referred to in [RFC 2396] as "marks": specifically, "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")". The "%" character itself is escaped only if it is not followed by two hexadecimal digits (that is, 0-9, a-f, and A-F).

If $escape-reserved the second argument is false, the behavior differs in that characters referred to in [RFC 2396] and [RFC 2732] as reserved characters are not escaped. These characters are ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | "$" | "," | "[" | "]".

[RFC 2396] does not define whether escaped URIs should use lower case or upper case for hexadecimal digits. To ensure that escaped URIs can be compared using string comparison functions, this function must always use the upper-case letters A-F.

Generally, $escape-reserved the second argument should be set to true when escaping a string that is to form a single part of a URI, URI reference or IRI, and to false when escaping an entire URI or URI reference , URI reference or IRI.

The character encoding used as the basis for determining the octets depends on the setting of the optional third argument. If the argument is given, it should be an encoding name listed in [Charsets], and may be given case-insensitively. The default encoding, if the argument is not given, is UTF-8. UTF-8 is the only encoding required to be supported by an implementation of this function. If the given encoding is not supported, then the function returns an empty string. If the encoding is supported but a character in the string cannot be represented in that encoding, then the character is escaped as if it were a question mark ("%3F").

Examples:

str:encode-uri('http://www.example.com/my résumé.html',false())
returns 'http://www.example.com/my%20r%E9sum%C3%A9.html'

str:encode-uri('http://www.example.com/my résumé.html',true())
returns 'http%3A%2F%2Fwww.example.com%2Fmy%20r%C3%A9sum%C3%A9.html'

str:encode-uri('http://www.example.com/my résumé.html',false(),'iso-8859-1')
returns 'http://www.example.com/my%20r%E9sum%E9.html' if the implementation supports iso-8859-1, or an empty string otherwise.

Issues:

Can we change the function name in a backward-compatible manner? I prefer "uri-escape" over "encode-uri" or "escape-uri". "uri-escape" implies a type of action (URI-style escaping of something), while the other names imply an action on a certain kind of subject (some kind of escaping performed on a URI).
Should unsupported encodings result in an error, rather than an empty string? If so, how do we specify that?
Should unencodable characters be omitted, or perhaps represented as something other than my proposal ("%3F" for "?")?

str:decode-uri()

The proposal is to change the existing function to the following:

Function Syntax

string **str:decode-uri(string, string)**

The str:decode-uri function returns its first argument string with URI escape sequences, as described in [RFC 2396] section 2.4.1, converted back to the characters they represent.

The optional second argument to the function supplies a character encoding name, which can be given case-insensitively, and should be listed in [Charsets]. If the named encoding is supported by the function implementation, then it is used as the basis for interpreting the octet sequences obtained when unescaping non-ASCII characters. UTF-8 is the default encoding, and is the only encoding required to be supported by this function. If the encoding is given as an empty string or is not supported, then an empty string is returned. If the encoding is supported, but an escaped octet sequence in the string cannot be decoded to a character in that encoding, then the sequence is ignored.

Examples:

str:decode-uri('http://www.example.com/my%20r%E9sum%C3%A9.html') returns 'http://www.example.com/my résumé.html'

str:decode-uri('http://www.example.com/my%20r%E9sum%E9.html','iso-8859-1') returns 'http://www.example.com/my résumé.html' if the implementation supports iso-8859-1.

Issues:

Can we change the function name in a backward-compatible manner? I prefer "uri-unescape" over "decode-uri".
Should unsupported encodings result in an error, rather than an empty string? If so, how do we specify that?
Should undecodable octets be represented as something like "?" rather than just ignoring them?

References:

XQFuncs
World Wide Web Consortium. XQuery 1.0 and XPath 2.0 Functions and Operators W3C Working Draft, 16 August 2002. See http://www.w3.org/TR/xquery-operators/
RFC 2396
IETF. Uniform Resource Identifiers (URI): Generic Syntax. See http://www.ietf.org/rfc/rfc2396.txt
RFC 2732
IETF. Format for Literal IPv6 Addresses in URL's. See http://www.ietf.org/rfc/rfc2732.txt
Charsets
IANA. Character Sets. See http://www.iana.org/assignments/character-sets

Comments or questions? Email mike@skew.org or email the EXSLT list.

EXSLT URI function proposals

str:encode-uri()

Function Syntax

string str:encode-uri(string, boolean, string?)

Examples:

Issues:

str:decode-uri()

Function Syntax

string str:decode-uri(string, string)

Examples:

Issues:

References:

string **str:encode-uri(string, boolean, string?)**

string **str:decode-uri(string, string)**