% This LaTeX document was generated using the LaTeX backend of PlDoc,
% The SWI-Prolog documentation system


\section{library(uri): Process URIs}

\label{sec:uri}

This library provides high-performance C-based primitives for
manipulating URIs. We decided for a C-based implementation for the much
better performance on raw character manipulation. Notably, URI handling
primitives are used in time-critical parts of RDF processing. This
implementation is based on RFC-3986:

\begin{code}
http://labs.apache.org/webarch/uri/rfc/rfc3986.html
\end{code}

The URI processing in this library is rather liberal. That is, we break
URIs according to the rules, but we do not validate that the components
are valid. Also, percent-decoding for IRIs is liberal. It first tries
UTF-8; then ISO-Latin-1 and finally accepts \%-characters verbatim.

Earlier experience has shown that strict enforcement of the URI syntax
results in many errors that are accepted by many other web-document
processing tools.\vspace{0.7cm}

\begin{description}
    \predicate[det]{uri_components}{2}{+URI, -Components}
\nodescription
    \predicate[det]{uri_components}{2}{-URI, +Components}
Break a \arg{URI} into its 5 basic components according to the
RFC-3986 regular expression:

\begin{code}
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
 12            3  4          5       6  7        8 9
\end{code}

\begin{arguments}
\arg{Components} & is a term \verb$uri_components(Scheme, Authority, Path, Search, Fragment)$. If a \arg{URI} is \textbf{parsed}, i.e., using mode
(+,-), components that are not found are left \textit{uninstantiated}
(variable). See \predref{uri_data}{3} for accessing this structure. \\
\end{arguments}

    \predicate[semidet]{uri_data}{3}{?Field, +Components, ?Data}
Provide access the uri_component structure. Defined field-names
are: \const{scheme}, \const{authority}, \const{path}, \const{search} and \const{fragment}

    \predicate[semidet]{uri_data}{4}{+Field, +Components, +Data, -NewComponents}
\arg{NewComponents} is the same as \arg{Components} with \arg{Field} set to \arg{Data}.

    \predicate[det]{uri_normalized}{2}{+URI, -NormalizedURI}
\arg{NormalizedURI} is the normalized form of \arg{URI}. Normalization is
syntactic and involves the following steps:

\begin{shortlist}
    \item 6.2.2.1. Case Normalization
    \item 6.2.2.2. Percent-Encoding Normalization
    \item 6.2.2.3. Path Segment Normalization
\end{shortlist}

    \predicate[det]{iri_normalized}{2}{+IRI, -NormalizedIRI}
\arg{NormalizedIRI} is the normalized form of \arg{IRI}. Normalization is
syntactic and involves the following steps:

\begin{shortlist}
    \item 6.2.2.1. Case Normalization
    \item 6.2.2.3. Path Segment Normalization
\end{shortlist}

\begin{tags}
    \tag{See also}
This is similar to \predref{uri_normalized}{2}, but does not do
normalization of \%-escapes.
\end{tags}

    \predicate[det]{uri_normalized_iri}{2}{+URI, -NormalizedIRI}
As \predref{uri_normalized}{2}, but percent-encoding is translated into IRI
Unicode characters. The translation is liberal: valid UTF-8
sequences of \%-encoded bytes are mapped to the Unicode
character. Other \%XX-sequences are mapped to the corresponding
ISO-Latin-1 character and sole \% characters are left untouched.

\begin{tags}
    \tag{See also}
\predref{uri_iri}{2}.
\end{tags}

    \predicate[semidet]{uri_is_global}{1}{+URI}
True if \arg{URI} has a scheme. The semantics is the same as the code
below, but the implementation is more efficient as it does not need
to parse the other components, nor needs to bind the scheme. The
condition to demand a scheme of more than one character is added to
avoid confusion with DOS path names.

\begin{code}
uri_is_global(URI) :-
        uri_components(URI, Components),
        uri_data(scheme, Components, Scheme),
        nonvar(Scheme),
        atom_length(Scheme, Len),
        Len > 1.
\end{code}

    \predicate[det]{uri_resolve}{3}{+URI, +Base, -GlobalURI}
Resolve a possibly local \arg{URI} relative to \arg{Base}. This implements
\url{http://labs.apache.org/webarch/uri/rfc/rfc3986.html\#relative-transform}

    \predicate[det]{uri_normalized}{3}{+URI, +Base, -NormalizedGlobalURI}
\arg{NormalizedGlobalURI} is the normalized global version of \arg{URI}.
Behaves as if defined by:

\begin{code}
uri_normalized(URI, Base, NormalizedGlobalURI) :-
        uri_resolve(URI, Base, GlobalURI),
        uri_normalized(GlobalURI, NormalizedGlobalURI).
\end{code}

    \predicate[det]{iri_normalized}{3}{+IRI, +Base, -NormalizedGlobalIRI}
\arg{NormalizedGlobalIRI} is the normalized global version of \arg{IRI}.
This is similar to \predref{uri_normalized}{3}, but does not do \%-escape
normalization.

    \predicate[det]{uri_normalized_iri}{3}{+URI, +Base, -NormalizedGlobalIRI}
\arg{NormalizedGlobalIRI} is the normalized global IRI of \arg{URI}. Behaves
as if defined by:

\begin{code}
uri_normalized(URI, Base, NormalizedGlobalIRI) :-
        uri_resolve(URI, Base, GlobalURI),
        uri_normalized_iri(GlobalURI, NormalizedGlobalIRI).
\end{code}

    \predicate[det]{uri_query_components}{2}{+String, -Query}
\nodescription
    \predicate[det]{uri_query_components}{2}{-String, +Query}
Perform encoding and decoding of an URI query string. \arg{Query} is a
list of fully decoded (Unicode) Name=Value pairs. In mode (-,+),
query elements of the forms Name(Value) and Name-Value are also
accepted to enhance interoperability with the option and pairs
libraries. E.g.

\begin{code}
?- uri_query_components(QS, [a=b, c('d+w'), n-'VU Amsterdam']).
QS = 'a=b&c=d%2Bw&n=VU%20Amsterdam'.

?- uri_query_components('a=b&c=d%2Bw&n=VU%20Amsterdam', Q).
Q = [a=b, c='d+w', n='VU Amsterdam'].
\end{code}

    \predicate[det]{uri_authority_components}{2}{+Authority, -Components}
\nodescription
    \predicate[det]{uri_authority_components}{2}{-Authority, +Components}
Break-down the authority component of a URI. The fields of the
structure \arg{Components} can be accessed using \predref{uri_authority_data}{3}.
This predicate deals with IPv6 addresses written as \verb$[ip]$,
returning the \textit{ip} as \const{host}, without the enclosing \verb$[]$. When
constructing an authority string and the host contains \verb$:$, the
host is embraced in \verb$[]$. If \verb$[]$ is not used correctly, the
behavior should be considered poorly defined. If there is no
balancing `]` or the host part does not end with `]`, these
characters are considered normal characters and part of the
(invalid) host name.

    \predicate[semidet]{uri_authority_data}{3}{+Field, ?Components, ?Data}
Provide access the uri_authority structure. Defined field-names
are: \const{user}, \const{password}, \const{host} and \const{port}

    \predicate[det]{uri_encoded}{3}{+Component, +Value, -Encoded}
\nodescription
    \predicate[det]{uri_encoded}{3}{+Component, -Value, +Encoded}
\arg{Encoded} is the URI encoding for \arg{Value}. When encoding
(\arg{Value}\Sifthen{}\arg{Encoded}), \arg{Component} specifies the URI component where the
value is used. It is one of \verb$query_value$, \const{fragment}, \const{path} or
\const{segment}. Besides alphanumerical characters, the following
characters are passed verbatim (the set is split in logical groups
according to RFC3986).

\begin{description}
    \item[query_value, fragment] 
"-._\Stilde{}" \Sbar{} "!\$'()*,;" \Sbar{} "@" \Sbar{} "/?"
    \item[path] 
"-._\Stilde{}" \Sbar{} "!\$\&'()*,;=" \Sbar{} "@" \Sbar{} "/"
    \item[segment] 
"-._\Stilde{}" \Sbar{} "!\$\&'()*,;=" \Sbar{} "@"
\end{description}

    \predicate[det]{uri_iri}{2}{+URI, -IRI}
\nodescription
    \predicate[det]{uri_iri}{2}{-URI, +IRI}
Convert between a \arg{URI}, encoded in US-ASCII and an \arg{IRI}. An \arg{IRI} is
a fully expanded Unicode string. Unicode strings are first
encoded into UTF-8, after which \%-encoding takes place.

\begin{tags}
    \tag{Errors}
\verb$syntax_error(Culprit)$ in mode (+,-) if \arg{URI} is not a
legally percent-encoded UTF-8 string.
\end{tags}

    \predicate[semidet]{uri_file_name}{2}{+URI, -FileName}
\nodescription
    \predicate[det]{uri_file_name}{2}{-URI, +FileName}
Convert between a \arg{URI} and a local file_name. This protocol is
covered by RFC 1738. Please note that file-URIs use \textit{absolute}
paths. The mode (-, +) translates a possible relative path into
an absolute one.

    \predicate[det]{uri_edit}{3}{+Actions, +URI0, -URI}
Modify a \arg{URI} according to \arg{Actions}. \arg{Actions} is either a single
action or a (nested) list of actions. Defined primitive actions
are:

\begin{description}
    \termitem{scheme}{+Scheme}
Set the \arg{Scheme} of the \arg{URI} (typically \const{http}, \const{https}, etc.)
    \termitem{user}{+User}
Add/set the user of the authority component.
    \termitem{password}{+Password}
Add/set the password of the authority component.
    \termitem{host}{+Host}
Add/set the host (or ip address) of the authority component.
    \termitem{port}{+Port}
Add/set the port of the authority component.
    \termitem{path}{+Path}
Set/extend the \const{path} component. If \arg{Path} is not absolute it
is taken relative to the path of \arg{URI0}.
    \termitem{search}{+KeyValues}
Extend the \verb$Key=Value$ pairs of the current search (query)
component. New values replace existing values. If \arg{KeyValues}
is written as =(\arg{KeyValues}) the current search component is
ignored. \arg{KeyValues} is a list, whose elements are one of
\verb$Key=Value$, \verb$Key-Value$ or `Key(Value)`.
    \termitem{fragment}{+Fragment}
Set the \arg{Fragment} of the uri.
\end{description}

Components can be \textit{removed} by using a variable as value, except
from \const{path} which can be reset using \verb$path(/)$ and query which can
be dropped using \verb$query(=([]))$.

\begin{arguments}
\arg{URI0} & is either a valid uri or a variable to start fresh. \\
\end{arguments}
\end{description}