This module contains functions for parsing and handling URIs
(RFC 3986) and
form-urlencoded query strings (HTML 5.2).
Parsing and serializing non-UTF-8 form-urlencoded query strings are also supported
(HTML 5.0).
A URI is an identifier consisting of a sequence of characters matching the syntax
rule named URI in RFC 3986.
The generic URI syntax consists of a hierarchical sequence of components referred
to as the scheme, authority, path, query, and fragment:
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
hier-part = "//" authority path-abempty
/ path-absolute
/ path-rootless
/ path-empty
scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
authority = [ userinfo "@" ] host [ ":" port ]
userinfo = *( unreserved / pct-encoded / sub-delims / ":" )
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
The interpretation of a URI depends only on the characters used and not on how those
characters are represented in a network protocol.
The functions implemented by this module cover the following use cases:
There are four different encodings present during the handling of URIs:
- Inbound binary encoding in binaries
- Inbound percent-encoding in lists and binaries
- Outbound binary encoding in binaries
- Outbound percent-encoding in lists and binaries
Functions with uri_string() argument accept lists, binaries and
mixed lists (lists with binary elements) as input type. All of the functions but
transcode/2 expects input as lists of unicode codepoints, UTF-8 encoded binaries
and UTF-8 percent-encoded URI parts ("%C3%B6" corresponds to the unicode character "ö").
Unless otherwise specified the return value type and encoding are the same as the input
type and encoding. That is, binary input returns binary output, list input returns a list
output but mixed input returns list output.
In case of lists there is only percent-encoding. In binaries, however, both binary encoding
and percent-encoding shall be considered. transcode/2 provides the means to convert
between the supported encodings, it takes a uri_string() and a list of options
specifying inbound and outbound encodings.
RFC 3986 does not mandate any specific
character encoding and it is usually defined by the protocol or surrounding text. This library
takes the same assumption, binary and percent-encoding are handled as one configuration unit,
they cannot be set to different values.