% This LaTeX document was generated using the LaTeX backend of PlDoc,
% The SWI-Prolog documentation system


\section{library(xpath): Select nodes in an XML DOM}

\label{sec:xpath}

\begin{tags}
    \tag{See also}
\url{http://www.w3.org/TR/xpath}
\end{tags}

The library \file{xpath.pl} provides predicates to select nodes from an XML DOM
tree as produced by \file{library(sgml)} based on descriptions inspired by the
XPath language.

The predicate \predref{xpath}{3} selects a sub-structure of the DOM
non-deterministically based on an XPath-like specification. Not all
selectors of XPath are implemented, but the ability to mix \predref{xpath}{3} calls
with arbitrary Prolog code provides a powerful tool for extracting
information from XML parse-trees.\vspace{0.7cm}

\begin{description}
    \predicate[semidet]{xpath_chk}{3}{+DOM, +Spec, ?Content}
Semi-deterministic version of \predref{xpath}{3}.

    \predicate[nondet]{xpath}{3}{+DOM, +Spec, ?Content}
Match an element in a \arg{DOM} structure. The syntax is inspired by
XPath, using () rather than \Snil{} to select inside an element.
First we can construct paths using / and //:

\begin{description}
    \item[\const{\Sidiv}Term] 
Select any node in the \arg{DOM} matching term.
    \item[\const{\Sdiv}Term] 
Match the root against Term.
    \item[Term] 
Select the immediate children of the root matching Term.
\end{description}

The Terms above are of type \textit{callable}. The functor specifies
the element name. The element name '*' refers to any element.
The name \const{self} refers to the top-element itself and is often
used for processing matches of an earlier \predref{xpath}{3} query. A term
NS:Term refers to an XML name in the namespace NS. Optional
arguments specify additional constraints and functions. The
arguments are processed from left to right. Defined conditional
argument values are:

\begin{description}
    \item[index(?Index)] 
True if the element is the Index-th child of its parent,
where 1 denotes the first child. Index can be one of:

\begin{description}
    \item[\arg{Var}] 
\arg{Var} is unified with the index of the matched element.
    \item[\const{last}] 
True for the last element.
    \item[\const{last} - \arg{IntExpr}] 
True for the last-minus-nth element. For example,
\verb$last-1$ is the element directly preceding the last one.
    \item[\arg{IntExpr}] 
True for the element whose index equals \arg{IntExpr}.
\end{description}

    \item[Integer] 
The N-th element with the given name, with 1 denoting the
first element. Same as \verb$index(Integer)$.
    \item[\const{last}] 
The last element with the given name. Same as
\verb$index(last)$.
    \item[\const{last} - IntExpr] 
The IntExpr-th element before the last.
Same as \verb$index(last-IntExpr)$.
\end{description}

Defined function argument values are:

\begin{description}
    \item[\const{self}] 
Evaluate to the entire element
    \item[\const{content}] 
Evaluate to the content of the element (a list)
    \item[\const{text}] 
Evaluates to all text from the sub-tree as an atom
    \item[\const{text(As)}] 
Evaluates to all text from the sub-tree according to
\arg{As}, which is either \const{atom} or \const{string}.
    \item[\const{normalize_space}] 
As \const{text}, but uses \predref{normalize_space}{2} to normalise
white-space in the output
    \item[\const{number}] 
Extract an integer or float from the value. Ignores
leading and trailing white-space
    \item[\const{@}Attribute] 
Evaluates to the value of the given attribute. Attribute
can be a compound term. In this case the functor name
denotes the attribute and arguments perform transformations
on the attribute value. Defined transformations are:

\begin{description}
    \termitem{number}{}
Translate the value into a number using
\predref{xsd_number_string}{2} from \file{library(sgml)}.
    \termitem{integer}{}
As \const{number}, but subsequently transform the value
into an integer using the \predref{round}{1} function.
    \termitem{float}{}
As \const{number}, but subsequently transform the value
into a float using the \predref{float}{1} function.
    \termitem{atom}{}
Translate the value into a Prolog atom. Note that
an atom is normally the default, so \verb$@href$ and
\verb$@href(atom)$ are equivalent. The SGML parser
can return attributes as strings using the
\verb$attribute_value(string)$ option.
    \termitem{string}{}
Translate the value into a Prolog string.
    \termitem{lower}{}
Translate the value to lower case, preserving
the type.
    \termitem{upper}{}
Translate the value to upper case, preserving
the type.
\end{description}
\end{description}

In addition, the argument-list can be \textit{conditions}:

\begin{description}
    \item[Left = Right] 
Succeeds if the left-hand unifies with the right-hand.
If the left-hand side is a function, this is evaluated.
The right-hand side is \textit{never} evaluated, and thus the
condition \verb$content = content$ defines that the content
of the element is the atom \const{content}.
The functions \verb$lower_case$ and \verb$upper_case$ can be applied
to Right (see example below).
    \item[\const{contains(Haystack, Needle)}] 
Succeeds if Needle is a sub-string of Haystack.
    \item[XPath] 
Succeeds if XPath matches in the currently selected
sub-\arg{DOM}. For example, the following expression finds
an \const{h3} element inside a \const{div} element, where the \const{div}
element itself contains an \const{h2} child with a \const{strong}
child.

\begin{code}
//div(h2/strong)/h3
\end{code}

This is equivalent to the conjunction of XPath goals below.

\begin{code}
   ...,
   xpath(DOM, //(div), Div),
   xpath(Div, h2/strong, _),
   xpath(Div, h3, Result)
\end{code}

\end{description}

\textbf{Examples}:

Match each table-row in \arg{DOM}:

\begin{code}
xpath(DOM, //tr, TR)
\end{code}

Match the last cell of each tablerow in \arg{DOM}. This example
illustrates that a result can be the input of subsequent \predref{xpath}{3}
queries. Using multiple queries on the intermediate TR term
guarantee that all results come from the same table-row:

\begin{code}
xpath(DOM, //tr, TR),
xpath(TR,  /td(last), TD)
\end{code}

Match each \const{href} attribute in an $<$a$>$ element

\begin{code}
xpath(DOM, //a(@href), HREF)
\end{code}

Suppose we have a table containing rows where each first column
is the name of a product with a link to details and the second
is the price (a number). The following predicate matches the
name, URL and price:

\begin{code}
product(DOM, Name, URL, Price) :-
    xpath(DOM, //tr, TR),
    xpath(TR, td(1), C1),
    xpath(C1, /self(normalize_space), Name),
    xpath(C1, a(@href), URL),
    xpath(TR, td(2, number), Price).
\end{code}

Suppose we want to select books with genre="thriller" from a
tree containing elements \verb$<book genre=...>$

\begin{code}
thriller(DOM, Book) :-
    xpath(DOM, //book(@genre=thiller), Book).
\end{code}

Match the elements \verb$<table align="center">$ \textit{and} \verb$<table align="CENTER">$:

\begin{code}
    //table(@align(lower) = center)
\end{code}

Get the \const{width} and \const{height} of a \const{div} element as a number,
and the \const{div} node itself:

\begin{code}
    xpath(DOM, //div(@width(number)=W, @height(number)=H), Div)
\end{code}

Note that \const{div} is an infix operator, so parentheses must be
used in cases like the following:

\begin{code}
    xpath(DOM, //(div), Div)
\end{code}

\end{description}