% This LaTeX document was generated using the LaTeX backend of PlDoc, % The SWI-Prolog documentation system \section{library(uri): Process URIs} \label{sec:uri} This library provides high-performance C-based primitives for manipulating URIs. We decided for a C-based implementation for the much better performance on raw character manipulation. Notably, URI handling primitives are used in time-critical parts of RDF processing. This implementation is based on RFC-3986: \begin{code} http://labs.apache.org/webarch/uri/rfc/rfc3986.html \end{code} The URI processing in this library is rather liberal. That is, we break URIs according to the rules, but we do not validate that the components are valid. Also, percent-decoding for IRIs is liberal. It first tries UTF-8; then ISO-Latin-1 and finally accepts \%-characters verbatim. Earlier experience has shown that strict enforcement of the URI syntax results in many errors that are accepted by many other web-document processing tools.\vspace{0.7cm} \begin{description} \predicate[det]{uri_components}{2}{+URI, -Components} \nodescription \predicate[det]{uri_components}{2}{-URI, +Components} Break a \arg{URI} into its 5 basic components according to the RFC-3986 regular expression: \begin{code} ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? 12 3 4 5 6 7 8 9 \end{code} \begin{arguments} \arg{Components} & is a term \verb$uri_components(Scheme, Authority, Path, Search, Fragment)$. If a \arg{URI} is \textbf{parsed}, i.e., using mode (+,-), components that are not found are left \textit{uninstantiated} (variable). See \predref{uri_data}{3} for accessing this structure. \\ \end{arguments} \predicate[semidet]{uri_data}{3}{?Field, +Components, ?Data} Provide access the uri_component structure. Defined field-names are: \const{scheme}, \const{authority}, \const{path}, \const{search} and \const{fragment} \predicate[semidet]{uri_data}{4}{+Field, +Components, +Data, -NewComponents} \arg{NewComponents} is the same as \arg{Components} with \arg{Field} set to \arg{Data}. \predicate[det]{uri_normalized}{2}{+URI, -NormalizedURI} \arg{NormalizedURI} is the normalized form of \arg{URI}. Normalization is syntactic and involves the following steps: \begin{shortlist} \item 6.2.2.1. Case Normalization \item 6.2.2.2. Percent-Encoding Normalization \item 6.2.2.3. Path Segment Normalization \end{shortlist} \predicate[det]{iri_normalized}{2}{+IRI, -NormalizedIRI} \arg{NormalizedIRI} is the normalized form of \arg{IRI}. Normalization is syntactic and involves the following steps: \begin{shortlist} \item 6.2.2.1. Case Normalization \item 6.2.2.3. Path Segment Normalization \end{shortlist} \begin{tags} \tag{See also} This is similar to \predref{uri_normalized}{2}, but does not do normalization of \%-escapes. \end{tags} \predicate[det]{uri_normalized_iri}{2}{+URI, -NormalizedIRI} As \predref{uri_normalized}{2}, but percent-encoding is translated into IRI Unicode characters. The translation is liberal: valid UTF-8 sequences of \%-encoded bytes are mapped to the Unicode character. Other \%XX-sequences are mapped to the corresponding ISO-Latin-1 character and sole \% characters are left untouched. \begin{tags} \tag{See also} \predref{uri_iri}{2}. \end{tags} \predicate[semidet]{uri_is_global}{1}{+URI} True if \arg{URI} has a scheme. The semantics is the same as the code below, but the implementation is more efficient as it does not need to parse the other components, nor needs to bind the scheme. The condition to demand a scheme of more than one character is added to avoid confusion with DOS path names. \begin{code} uri_is_global(URI) :- uri_components(URI, Components), uri_data(scheme, Components, Scheme), nonvar(Scheme), atom_length(Scheme, Len), Len > 1. \end{code} \predicate[det]{uri_resolve}{3}{+URI, +Base, -GlobalURI} Resolve a possibly local \arg{URI} relative to \arg{Base}. This implements \url{http://labs.apache.org/webarch/uri/rfc/rfc3986.html\#relative-transform} \predicate[det]{uri_normalized}{3}{+URI, +Base, -NormalizedGlobalURI} \arg{NormalizedGlobalURI} is the normalized global version of \arg{URI}. Behaves as if defined by: \begin{code} uri_normalized(URI, Base, NormalizedGlobalURI) :- uri_resolve(URI, Base, GlobalURI), uri_normalized(GlobalURI, NormalizedGlobalURI). \end{code} \predicate[det]{iri_normalized}{3}{+IRI, +Base, -NormalizedGlobalIRI} \arg{NormalizedGlobalIRI} is the normalized global version of \arg{IRI}. This is similar to \predref{uri_normalized}{3}, but does not do \%-escape normalization. \predicate[det]{uri_normalized_iri}{3}{+URI, +Base, -NormalizedGlobalIRI} \arg{NormalizedGlobalIRI} is the normalized global IRI of \arg{URI}. Behaves as if defined by: \begin{code} uri_normalized(URI, Base, NormalizedGlobalIRI) :- uri_resolve(URI, Base, GlobalURI), uri_normalized_iri(GlobalURI, NormalizedGlobalIRI). \end{code} \predicate[det]{uri_query_components}{2}{+String, -Query} \nodescription \predicate[det]{uri_query_components}{2}{-String, +Query} Perform encoding and decoding of an URI query string. \arg{Query} is a list of fully decoded (Unicode) Name=Value pairs. In mode (-,+), query elements of the forms Name(Value) and Name-Value are also accepted to enhance interoperability with the option and pairs libraries. E.g. \begin{code} ?- uri_query_components(QS, [a=b, c('d+w'), n-'VU Amsterdam']). QS = 'a=b&c=d%2Bw&n=VU%20Amsterdam'. ?- uri_query_components('a=b&c=d%2Bw&n=VU%20Amsterdam', Q). Q = [a=b, c='d+w', n='VU Amsterdam']. \end{code} \predicate[det]{uri_authority_components}{2}{+Authority, -Components} \nodescription \predicate[det]{uri_authority_components}{2}{-Authority, +Components} Break-down the authority component of a URI. The fields of the structure \arg{Components} can be accessed using \predref{uri_authority_data}{3}. This predicate deals with IPv6 addresses written as \verb$[ip]$, returning the \textit{ip} as \const{host}, without the enclosing \verb$[]$. When constructing an authority string and the host contains \verb$:$, the host is embraced in \verb$[]$. If \verb$[]$ is not used correctly, the behavior should be considered poorly defined. If there is no balancing `]` or the host part does not end with `]`, these characters are considered normal characters and part of the (invalid) host name. \predicate[semidet]{uri_authority_data}{3}{+Field, ?Components, ?Data} Provide access the uri_authority structure. Defined field-names are: \const{user}, \const{password}, \const{host} and \const{port} \predicate[det]{uri_encoded}{3}{+Component, +Value, -Encoded} \nodescription \predicate[det]{uri_encoded}{3}{+Component, -Value, +Encoded} \arg{Encoded} is the URI encoding for \arg{Value}. When encoding (\arg{Value}\Sifthen{}\arg{Encoded}), \arg{Component} specifies the URI component where the value is used. It is one of \verb$query_value$, \const{fragment}, \const{path} or \const{segment}. Besides alphanumerical characters, the following characters are passed verbatim (the set is split in logical groups according to RFC3986). \begin{description} \item[query_value, fragment] "-._\Stilde{}" \Sbar{} "!\$'()*,;" \Sbar{} "@" \Sbar{} "/?" \item[path] "-._\Stilde{}" \Sbar{} "!\$\&'()*,;=" \Sbar{} "@" \Sbar{} "/" \item[segment] "-._\Stilde{}" \Sbar{} "!\$\&'()*,;=" \Sbar{} "@" \end{description} \predicate[det]{uri_iri}{2}{+URI, -IRI} \nodescription \predicate[det]{uri_iri}{2}{-URI, +IRI} Convert between a \arg{URI}, encoded in US-ASCII and an \arg{IRI}. An \arg{IRI} is a fully expanded Unicode string. Unicode strings are first encoded into UTF-8, after which \%-encoding takes place. \begin{tags} \tag{Errors} \verb$syntax_error(Culprit)$ in mode (+,-) if \arg{URI} is not a legally percent-encoded UTF-8 string. \end{tags} \predicate[semidet]{uri_file_name}{2}{+URI, -FileName} \nodescription \predicate[det]{uri_file_name}{2}{-URI, +FileName} Convert between a \arg{URI} and a local file_name. This protocol is covered by RFC 1738. Please note that file-URIs use \textit{absolute} paths. The mode (-, +) translates a possible relative path into an absolute one. \predicate[det]{uri_edit}{3}{+Actions, +URI0, -URI} Modify a \arg{URI} according to \arg{Actions}. \arg{Actions} is either a single action or a (nested) list of actions. Defined primitive actions are: \begin{description} \termitem{scheme}{+Scheme} Set the \arg{Scheme} of the \arg{URI} (typically \const{http}, \const{https}, etc.) \termitem{user}{+User} Add/set the user of the authority component. \termitem{password}{+Password} Add/set the password of the authority component. \termitem{host}{+Host} Add/set the host (or ip address) of the authority component. \termitem{port}{+Port} Add/set the port of the authority component. \termitem{path}{+Path} Set/extend the \const{path} component. If \arg{Path} is not absolute it is taken relative to the path of \arg{URI0}. \termitem{search}{+KeyValues} Extend the \verb$Key=Value$ pairs of the current search (query) component. New values replace existing values. If \arg{KeyValues} is written as =(\arg{KeyValues}) the current search component is ignored. \arg{KeyValues} is a list, whose elements are one of \verb$Key=Value$, \verb$Key-Value$ or `Key(Value)`. \termitem{fragment}{+Fragment} Set the \arg{Fragment} of the uri. \end{description} Components can be \textit{removed} by using a variable as value, except from \const{path} which can be reset using \verb$path(/)$ and query which can be dropped using \verb$query(=([]))$. \begin{arguments} \arg{URI0} & is either a valid uri or a variable to start fresh. \\ \end{arguments} \end{description}