% This LaTeX document was generated using the LaTeX backend of PlDoc, % The SWI-Prolog documentation system \section{library(xpath): Select nodes in an XML DOM} \label{sec:xpath} \begin{tags} \tag{See also} \url{http://www.w3.org/TR/xpath} \end{tags} The library \file{xpath.pl} provides predicates to select nodes from an XML DOM tree as produced by \file{library(sgml)} based on descriptions inspired by the XPath language. The predicate \predref{xpath}{3} selects a sub-structure of the DOM non-deterministically based on an XPath-like specification. Not all selectors of XPath are implemented, but the ability to mix \predref{xpath}{3} calls with arbitrary Prolog code provides a powerful tool for extracting information from XML parse-trees.\vspace{0.7cm} \begin{description} \predicate[semidet]{xpath_chk}{3}{+DOM, +Spec, ?Content} Semi-deterministic version of \predref{xpath}{3}. \predicate[nondet]{xpath}{3}{+DOM, +Spec, ?Content} Match an element in a \arg{DOM} structure. The syntax is inspired by XPath, using () rather than \Snil{} to select inside an element. First we can construct paths using / and //: \begin{description} \item[\const{\Sidiv}Term] Select any node in the \arg{DOM} matching term. \item[\const{\Sdiv}Term] Match the root against Term. \item[Term] Select the immediate children of the root matching Term. \end{description} The Terms above are of type \textit{callable}. The functor specifies the element name. The element name '*' refers to any element. The name \const{self} refers to the top-element itself and is often used for processing matches of an earlier \predref{xpath}{3} query. A term NS:Term refers to an XML name in the namespace NS. Optional arguments specify additional constraints and functions. The arguments are processed from left to right. Defined conditional argument values are: \begin{description} \item[index(?Index)] True if the element is the Index-th child of its parent, where 1 denotes the first child. Index can be one of: \begin{description} \item[\arg{Var}] \arg{Var} is unified with the index of the matched element. \item[\const{last}] True for the last element. \item[\const{last} - \arg{IntExpr}] True for the last-minus-nth element. For example, \verb$last-1$ is the element directly preceding the last one. \item[\arg{IntExpr}] True for the element whose index equals \arg{IntExpr}. \end{description} \item[Integer] The N-th element with the given name, with 1 denoting the first element. Same as \verb$index(Integer)$. \item[\const{last}] The last element with the given name. Same as \verb$index(last)$. \item[\const{last} - IntExpr] The IntExpr-th element before the last. Same as \verb$index(last-IntExpr)$. \end{description} Defined function argument values are: \begin{description} \item[\const{self}] Evaluate to the entire element \item[\const{content}] Evaluate to the content of the element (a list) \item[\const{text}] Evaluates to all text from the sub-tree as an atom \item[\const{text(As)}] Evaluates to all text from the sub-tree according to \arg{As}, which is either \const{atom} or \const{string}. \item[\const{normalize_space}] As \const{text}, but uses \predref{normalize_space}{2} to normalise white-space in the output \item[\const{number}] Extract an integer or float from the value. Ignores leading and trailing white-space \item[\const{@}Attribute] Evaluates to the value of the given attribute. Attribute can be a compound term. In this case the functor name denotes the attribute and arguments perform transformations on the attribute value. Defined transformations are: \begin{description} \termitem{number}{} Translate the value into a number using \predref{xsd_number_string}{2} from \file{library(sgml)}. \termitem{integer}{} As \const{number}, but subsequently transform the value into an integer using the \predref{round}{1} function. \termitem{float}{} As \const{number}, but subsequently transform the value into a float using the \predref{float}{1} function. \termitem{atom}{} Translate the value into a Prolog atom. Note that an atom is normally the default, so \verb$@href$ and \verb$@href(atom)$ are equivalent. The SGML parser can return attributes as strings using the \verb$attribute_value(string)$ option. \termitem{string}{} Translate the value into a Prolog string. \termitem{lower}{} Translate the value to lower case, preserving the type. \termitem{upper}{} Translate the value to upper case, preserving the type. \end{description} \end{description} In addition, the argument-list can be \textit{conditions}: \begin{description} \item[Left = Right] Succeeds if the left-hand unifies with the right-hand. If the left-hand side is a function, this is evaluated. The right-hand side is \textit{never} evaluated, and thus the condition \verb$content = content$ defines that the content of the element is the atom \const{content}. The functions \verb$lower_case$ and \verb$upper_case$ can be applied to Right (see example below). \item[\const{contains(Haystack, Needle)}] Succeeds if Needle is a sub-string of Haystack. \item[XPath] Succeeds if XPath matches in the currently selected sub-\arg{DOM}. For example, the following expression finds an \const{h3} element inside a \const{div} element, where the \const{div} element itself contains an \const{h2} child with a \const{strong} child. \begin{code} //div(h2/strong)/h3 \end{code} This is equivalent to the conjunction of XPath goals below. \begin{code} ..., xpath(DOM, //(div), Div), xpath(Div, h2/strong, _), xpath(Div, h3, Result) \end{code} \end{description} \textbf{Examples}: Match each table-row in \arg{DOM}: \begin{code} xpath(DOM, //tr, TR) \end{code} Match the last cell of each tablerow in \arg{DOM}. This example illustrates that a result can be the input of subsequent \predref{xpath}{3} queries. Using multiple queries on the intermediate TR term guarantee that all results come from the same table-row: \begin{code} xpath(DOM, //tr, TR), xpath(TR, /td(last), TD) \end{code} Match each \const{href} attribute in an $<$a$>$ element \begin{code} xpath(DOM, //a(@href), HREF) \end{code} Suppose we have a table containing rows where each first column is the name of a product with a link to details and the second is the price (a number). The following predicate matches the name, URL and price: \begin{code} product(DOM, Name, URL, Price) :- xpath(DOM, //tr, TR), xpath(TR, td(1), C1), xpath(C1, /self(normalize_space), Name), xpath(C1, a(@href), URL), xpath(TR, td(2, number), Price). \end{code} Suppose we want to select books with genre="thriller" from a tree containing elements \verb$$ \begin{code} thriller(DOM, Book) :- xpath(DOM, //book(@genre=thiller), Book). \end{code} Match the elements \verb$$ \textit{and} \verb$
$: \begin{code} //table(@align(lower) = center) \end{code} Get the \const{width} and \const{height} of a \const{div} element as a number, and the \const{div} node itself: \begin{code} xpath(DOM, //div(@width(number)=W, @height(number)=H), Div) \end{code} Note that \const{div} is an infix operator, so parentheses must be used in cases like the following: \begin{code} xpath(DOM, //(div), Div) \end{code} \end{description}