% This LaTeX document was generated using the LaTeX backend of PlDoc, % The SWI-Prolog documentation system \section{library(pcre): Perl compatible regular expression matching for SWI-Prolog} \label{sec:pcre} \begin{tags} \tag{See also} `man pcre2api` or \url{https://www.pcre.org/current/doc/html/pcre2api.html} for details of the PCRE2 syntax and options. \end{tags} This module provides an interface to the \href{http://www.pcre.org/}{PCRE2} (Perl Compatible Regular Expression) library. This Prolog interface provides an almost complete wrapper around PCRE2 (the successor to PCRE) with as much backward compatibility to PCRE as possible, because the original implementation was for PCRE (also known as PCRE1). Regular expressions are created from a pattern and options and represented as a SWI-Prolog \textit{blob}. This implies they are subject to (atom) garbage collection. Compiled regular expressions can safely be used in multiple threads. Most predicates accept both an explicitly compiled regular expression, a pattern, or a term Pattern/Flags. The semantics of the pattern can be additionally modified by options. In the latter two cases a regular expression \textit{blob} is created and stored in a cache. The cache can be cleared using \predref{re_flush}{0}. Most of the predicates in this library take both a regular expression represented as a string with optional flags, e.g., \verb$'aap'/i$ or a \textit{compiled regular} expression. If a string (+flags) alternative is used, the library maintains a cache of compiled regular expressions. See also \predref{re_flush}{0}. The library can be asked to rewrite the \predref{re_match}{2} and \predref{re_match}{3} goals to use inlined compiled regular expression objects using \begin{code} :- set_prolog_flag(re_compile, true). \end{code} This has some consequences: \begin{itemize} \item Performance is considerable better. \item Compiled regular expressions are currently incompatible with \textit{Quick Load Files} (`.qlf`, see \predref{qcompile}{1}) and \textit{Saved States} (see \predref{qsave_program}{2} and the \verb$-c$ command line option. \item Debugging may be harder. \end{itemize} \vspace{0.7cm} \begin{description} \predicate[semidet]{re_match}{2}{+Regex, +String} \nodescription \predicate[semidet]{re_match}{3}{+Regex, +String, +Options} Succeeds if \arg{String} matches \arg{Regex}. For example: \begin{code} ?- re_match("^needle"/i, "Needle in a haystack"). true. \end{code} Defined \arg{Options} are given below. For details, see the PCRE documentation. If an option is repeated, the first value is used and subsequent values are ignored. Unrecognized options are ignored. Unless otherwise specified, boolean options default to \const{false}. If \arg{Regex} is a text pattern (optionally with flags), then any of the \arg{Options} for \predref{re_compile}{3} can be used, in addition to the \arg{Options} listed below. If \arg{Regex} is the result of \predref{re_compile}{3}, then only the following execution-time \arg{Options} are recognized and any others are ignored. Some options may not exist on your system, depending on the PCRE2 version and how it was built - these unsupported options are silently ignored. \begin{itemize} \item \verb$start(From)$ Start at the given character index \item \verb$anchored(Bool)$ If \const{true}, match only at the first position \item \verb$bol(Bool)$ \arg{String} is the beginning of a line (default \const{true}) - affects behavior of circumflex metacharacter (\verb$^$). \item \verb$empty(Bool)$ An empty string is a valid match (default \const{true}) \item \verb$empty_atstart(Bool)$ An empty string at the start of the subject is a valid match (default \const{true}) \item \verb$eol(Bool)$ \arg{String} is the end of a line - affects behavior of dollar metacharacter (\verb|$|) (default \const{true}). \item \verb$newline(Mode)$ If \const{any}, recognize any Unicode newline sequence, if \const{anycrlf}, recognize CR, LF, and CRLF as newline sequences, if \const{cr}, recognize CR, if \const{lf}, recognize LF, if \const{crlf} recognize CRLF as newline. The default is determined by how PCRE was built, and can be found by \verb$re_config(newline2(NewlineDefault))$. \item \verb$newline2(Mode)$ - synonym for \verb$newline(Mode)$. \item \verb$utf_check(Bool)$ - see \href{https://www.pcre.org/current/doc/html/pcre2api.html}{PCRE2 API documentation} You should not need this because SWI-Prolog ensures that the UTF8 strings are valid, so the default is \const{false}. \item \verb$endanchored(Bool)$ - see \href{https://www.pcre.org/current/doc/html/pcre2api.html}{PCRE2 API documentation} \item \verb$partial_soft(Bool)$ - see \href{https://www.pcre.org/current/doc/html/pcre2api.html}{PCRE2 API documentation} \item \verb$partial_hard(Bool)$ - see \href{https://www.pcre.org/current/doc/html/pcre2api.html}{PCRE2 API documentation} \item \verb$dfa_restart(Bool)$ - see \href{https://www.pcre.org/current/doc/html/pcre2api.html}{PCRE2 API documentation} \item \verb$dfa_shortest(Bool)$ - see \href{https://www.pcre.org/current/doc/html/pcre2api.html}{PCRE2 API documentation} \end{itemize} \begin{arguments} \arg{Regex} & is the output of \predref{re_compile}{3}, a pattern or a term Pattern/Flags, where Pattern is an atom or string. The defined flags and their related option for \predref{re_compile}{3} are below. \begin{shortlist} \item \textbf{x}: \verb$extended(true)$ \item \textbf{i}: \verb$caseless(true)$ \item \textbf{m}: \verb$multiline(true)$ \item \textbf{s}: \verb$dotall(true)$ \item \textbf{a}: \verb$capture_type(atom)$ \item \textbf{r}: \verb$capture_type(range)$ \item \textbf{t}: \verb$capture_type(term)$ \end{shortlist} If \arg{Regex} is the output of \predref{re_compile}{3}, any compile-time options in \arg{Options} or Flags are ignored and only match-time options are used. The options that are derived from flags take precedence over the options in the \arg{Options} list. In the case of conflicting flags, the first one is used (e.g., \const{ra} results in \verb$capture_type(range)$). \\ \end{arguments} \predicate[semidet]{re_matchsub}{3}{+Regex, +String, -Sub:dict} \nodescription \predicate[semidet]{re_matchsub}{4}{+Regex, +String, -Sub:dict, +Options} Match \arg{String} against \arg{Regex}. On success, \arg{Sub} is a dict containing integer keys for the numbered capture group and atom keys for the named capture groups. The entire match string has the key \verb$0$. The associated value is determined by the \verb$capture_type(Type)$ option passed to \predref{re_compile}{3}, or by flags if \arg{Regex} is of the form Pattern/Flags; and may be specified at the level of individual captures using a naming convention for the caption name. See \predref{re_compile}{3} for details. The example below exploits the typed groups to parse a date specification: \begin{code} ?- re_matchsub("(? (?(?:\\d\\d)?\\d\\d) - (?\\d\\d) - (?\\d\\d) )"/x, "2017-04-20", Sub, []). Sub = re_match{0:"2017-04-20", date:"2017-04-20", day:20, month:4, year:2017}. \end{code} \begin{arguments} \arg{Both} & compilation and execution options are processed. See \predref{re_compile}{3} and \predref{re_match}{3} for the set of options. In addition, some compilation options may passed as \verb$/Flags$ to \arg{Regex} - see \predref{re_match}{3} for the list of flags. \\ \arg{Regex} & See \predref{re_match}{2} for a description of this argument. \\ \end{arguments} \predicate[semidet]{re_foldl}{6}{:Goal, +Regex, +String, ?V0, ?V, +Options} Fold all matches of \arg{Regex} on \arg{String}. Each match is represented by a dict as specified for \predref{re_matchsub}{4}. \arg{V0} and \arg{V} are related using a sequence of invocations of \arg{Goal} as illustrated below. \begin{code} call(Goal, Dict1, V0, V1), call(Goal, Dict2, V1, V2), ... call(Goal, Dictn, Vn, V). \end{code} This predicate is used to implement \predref{re_split}{4} and \predref{re_replace}{4}. For example, we can count all matches of a \arg{Regex} on \arg{String} using this code: \begin{code} re_match_count(Regex, String, Count) :- re_foldl(increment, Regex, String, 0, Count, []). increment(_Match, V0, V1) :- V1 is V0+1. \end{code} After which we can query \begin{code} ?- re_match_count("a", "aap", X). X = 2. \end{code} Here is an example \arg{Goal} for extracting all the matches with their offsets within the string: \begin{code} range_match(Dict, StringIndex-[MatchStart-Substring|List], StringIndex-List) :- Dict.(StringIndex.index) = MatchStart-MatchLen, sub_string(StringIndex.string, MatchStart, MatchLen, _, Substring). \end{code} And can be used with this query (note the \verb$capture_type(range)$ option, which is needed by \predref{range_match}{3}, and \verb$greedy(false)$ to invert the meaning of \verb$*?$): \begin{code} ?- String = "{START} Mary {END} had a {START} little lamb {END}", re_foldl(range_match, "{START} *?(?.*) *?{END}", String, _{string:String,index:piece}-Matches, _-[], [capture_type(range),greedy(false)]). Matches = [8-"Mary", 33-"little lamb"]. \end{code} \predicate[det]{re_split}{3}{+Pattern, +String, -Splits:list} \nodescription \predicate[det]{re_split}{4}{+Pattern, +String, -Splits:list, +Options} Split \arg{String} using the regular expression \arg{Pattern}. \arg{Splits} is a list of strings holding alternating matches of \arg{Pattern} and skipped parts of the \arg{String}, starting with a skipped part. The \arg{Splits} lists ends with a string of the content of \arg{String} after the last match. If \arg{Pattern} does not appear in \arg{String}, \arg{Splits} is a list holding a copy of \arg{String}. This implies the number of elements in \arg{Splits} is \textit{always} odd. For example: \begin{code} ?- re_split("a+", "abaac", Splits, []). Splits = ["","a","b","aa","c"]. ?- re_split(":\\s*"/n, "Age: 33", Splits, []). Splits = ['Age', ': ', 33]. \end{code} \begin{arguments} \arg{Pattern} & is the pattern text, optionally follows by /Flags. Similar to \predref{re_matchsub}{4}, the final output type can be controlled by a flag \const{a} (atom), \const{s} (string, default) or \const{n} (number if possible, atom otherwise). \\ \end{arguments} \predicate[det]{re_replace}{4}{+Pattern, +With, +String, -NewString} \nodescription \predicate[det]{re_replace}{5}{+Pattern, +With, +String, -NewString, +Options} Replace matches of the regular expression \arg{Pattern} in \arg{String} with \arg{With} (possibly containing references to captured substrings). Throws an error if \arg{With} uses a name that doesn't exist in the \arg{Pattern}. \begin{arguments} \arg{Pattern} & is the pattern text, optionally followed by /Flags. Flags may include \const{g}, replacing all occurences of \arg{Pattern}. In addition, similar to \predref{re_matchsub}{4}, the final output type can be controlled by a flag \const{a} (atom) or \const{s} (string, default). The output type can also be specified by the \verb$capture_type$ option. Capture type suffixes can modify behavior; for example, the following will change an ISO 8601 format date (YYYY-MM-DD) to American style (m/d/y), and also remove leading zeros by using the \arg{_I} suffix: \begin{code} re_replace("(? (?(?:\\d\\d)?\\d\\d) - (?\\d\\d) - (?\\d\\d) )"/x, "$month-$day-$year", ISODate, AmericanDate)` \end{code} \\ \arg{With} & is the replacement text. It may reference captured substrings using \Sneg{}N or \$Name. Both N and Name may be written as \{N\} and \{Name\} to avoid ambiguities. If a substring is named, it cannot be referenced by its number. The single chracters \verb|$| and \verb$\$ can be escaped by doubling (e.g., \verb|re_replace(".","$$","abc",Replaced)| results in \verb|Replaced="$bc"|). (Because \verb$\$ is an escape character inside strings, you need to write "\bsl{}\bsl{}\BB{}" to get a single backslash.) \\ \arg{Options} & See \predref{re_match}{3} for the set of options. The options that are derived from flags take precedence over the options in the \arg{Options} list. In the case of conflicting flags, the first one is used (e.g., \const{as} results in \verb$capture_type(string)$). If a \verb$capture_type$ is meaningless (\const{range} or \const{term}), it is ignored. \\ \end{arguments} \predicate[det]{re_compile}{3}{+Pattern, -Regex, +Options} Compiles \arg{Pattern} to a \arg{Regex} \textit{blob} of type \const{regex} (see \predref{blob}{2}). Defined \arg{Options} are given below. Please consult the \href{https://www.pcre.org/current/doc/html/pcre2api.html}{PCRE2 API documentation} for details. If an option is repeated, the first value is used and subsequent values are ignored. Unrecognized options are ignored. Unless otherwise specified, boolean options default to \const{false}. Some options may not exist on your system, depending on the PCRE2 version and how it was built - these unsupported options are silently ignored. The various matching predicates can take either a \arg{Regex} \textit{blob} or a string pattern; if they are given a string pattern, they call \predref{re_compile}{3} and cache the result; so, there is little reason to use \predref{re_compile}{3} directly. \begin{itemize} \item \verb$anchored(Bool)$ If \const{true}, match only at the first position \item \verb$auto_capture(Bool)$ Enable use of numbered capturing parentheses. (default \const{true}) \item \verb$bsr(Mode)$ If \const{anycrlf}, \Sneg{}R only matches CR, LF or CRLF; if \const{unicode}, \Sneg{}R matches all Unicode line endings. \item \verb$bsr2(Mode)$ - synonym for \verb$bsr(Mode)$. \item \verb$caseless(Bool)$ If \const{true}, do caseless matching. \item \verb$compat(With)$ Error - PCRE1 had \verb$compat(javascript)$ for JavaScript compatibility, but PCRE2 has removed that. \item \verb$dollar_endonly(Bool)$ If \const{true}, \$ not to match newline at end \item \verb$dotall(Bool)$ If \const{true}, . matches anything including NL \item \verb$dupnames(Bool)$ If \const{true}, allow duplicate names for subpatterns \item \verb$extended(Bool)$ If \const{true}, ignore white space and \# comments \item \verb$firstline(Bool)$ If \const{true}, force matching to be before newline \item \verb$greedy(Bool)$ If \const{true}, operators such as \verb$+$ and \verb$*$ are greedy unless followed by \verb$?$; if \const{false}, the operators are not greedy and \verb$?$ has the opposite meaning. It can also beset by a `(?U)` within the pattern - see the \href{https://www.pcre.org/current/doc/html/pcre2pattern.html\#SEC13}{PCRE2 pattern internal option setting documentation} for details and note that the PCRE2 option is \arg{UNGREEDY}, which is the inverse of this packages \const{greedy} options. (default \const{true}) \item \verb$compat(With)$ Raises an errr - PCRE1 had \verb$compat(javascript)$ for JavaScript compatibility, but PCRE2 has removed that option . Consider using the \verb$alt_bsux$ and \verb$extra_alt_bsux$ options. \item \verb$multiline(Bool)$ If \const{true}, \Shat{} and \$ match newlines within data \item \verb$newline(Mode)$ If \const{any}, recognize any Unicode newline sequence; if \const{anycrlf} (default), recognize CR, LF, and CRLF as newline sequences; if \const{cr}, recognize CR; if \const{lf}, recognize LF; \const{crlf} recognize CRLF as newline; if \const{nul}, recognize the NULL character (0x00) as newline. \item \verb$newline2(Mode)$ - synonym for \verb$newline(Mode)$. \item \verb$ucp(Bool)$ If \const{true}, use Unicode properties for \Sneg{}d, \Sneg{}w, etc. \item \verb$utf_check(Bool)$ - see \href{https://www.pcre.org/current/doc/html/pcre2api.html}{PCRE2 API documentation} You should not need this because SWI-Prolog ensures that the UTF8 strings are valid, \item \verb$endanchored(boolean)$ - see \href{https://www.pcre.org/current/doc/html/pcre2api.html}{PCRE2 API documentation} \item \verb$allow_empty_class(boolean)$ - see \href{https://www.pcre.org/current/doc/html/pcre2api.html}{PCRE2 API documentation} \item \verb$alt_bsux(boolean)$ - see \href{https://www.pcre.org/current/doc/html/pcre2api.html}{PCRE2 API documentation} \item \verb$auto_callout(boolean)$ - see \href{https://www.pcre.org/current/doc/html/pcre2api.html}{PCRE2 API documentation} \item \verb$match_unset_backref(boolean)$ - see \href{https://www.pcre.org/current/doc/html/pcre2api.html}{PCRE2 API documentation} \item \verb$never_ucp(boolean)$ - see \href{https://www.pcre.org/current/doc/html/pcre2api.html}{PCRE2 API documentation} \item \verb$never_utf(boolean)$ - see \href{https://www.pcre.org/current/doc/html/pcre2api.html}{PCRE2 API documentation} \item \verb$auto_possess(boolean)$ - see \href{https://www.pcre.org/current/doc/html/pcre2api.html}{PCRE2 API documentation} (default \const{true}) \item \verb$dotstar_anchor(boolean)$ - see \href{https://www.pcre.org/current/doc/html/pcre2api.html}{PCRE2 API documentation} (default \const{true}) \item \verb$start_optimize(boolean)$ - see \href{https://www.pcre.org/current/doc/html/pcre2api.html}{PCRE2 API documentation} (default \const{true}) \item \verb$utf(boolean)$ - see \href{https://www.pcre.org/current/doc/html/pcre2api.html}{PCRE2 API documentation} \item \verb$never_backslash_c(boolean)$ - see \href{https://www.pcre.org/current/doc/html/pcre2api.html}{PCRE2 API documentation} \item \verb$alt_circumflex(boolean)$ - see \href{https://www.pcre.org/current/doc/html/pcre2api.html}{PCRE2 API documentation} \item \verb$alt_verbnames(boolean)$ - see \href{https://www.pcre.org/current/doc/html/pcre2api.html}{PCRE2 API documentation} \item \verb$use_offset_limit(boolean)$ - see \href{https://www.pcre.org/current/doc/html/pcre2api.html}{PCRE2 API documentation} \item \verb$extended_more(boolean)$ - see \href{https://www.pcre.org/current/doc/html/pcre2api.html}{PCRE2 API documentation} \item \verb$literal(boolean)$ - see \href{https://www.pcre.org/current/doc/html/pcre2api.html}{PCRE2 API documentation} \item \verb$match_invalid_utf(boolean)$ - see \href{https://www.pcre.org/current/doc/html/pcre2api.html}{PCRE2 API documentation} \item \verb$jit_complete(boolean)$ - see \href{https://www.pcre.org/current/doc/html/pcre2api.html}{PCRE2 API documentation} \item \verb$jit_partial_soft(boolean)$ - see \href{https://www.pcre.org/current/doc/html/pcre2api.html}{PCRE2 API documentation} \item \verb$jit_partial_hard(boolean)$ - see \href{https://www.pcre.org/current/doc/html/pcre2api.html}{PCRE2 API documentation} \item \verb$jit_invalid_utf(boolean)$ - see \href{https://www.pcre.org/current/doc/html/pcre2api.html}{PCRE2 API documentation} \item \verb$jit(boolean)$ - see \href{https://www.pcre.org/current/doc/html/pcre2api.html}{PCRE2 API documentation} (default \const{true}) \item \verb$copy_matched_subject(boolean)$ - see \href{https://www.pcre.org/current/doc/html/pcre2api.html}{PCRE2 API documentation} \end{itemize} In addition to the options above that directly map to PCRE flags the following options are processed: \begin{itemize} \item \verb$optimise(Bool)$ or \verb$optimize(Bool)$ Turns on the JIT compiler for additional optimization that greatly that speeds up the matching performance of many patterns. (Note that he meaning has changed slightly from the PCRE1 implementation \item PCRE2 always optimises where possible; this is an additional optimisation.) \item \verb$capture_type(+Type)$ How to return the matched part of the input and possibly captured groups in there. Possible values are: \begin{description} \termitem{string}{} Return the captured string as a string (default). \termitem{atom}{} Return the captured string as an atom. \termitem{range}{} Return the captured string as a pair \verb$Start-Length$. Note that we use \verb$Start-Length$ rather than the more conventional \verb$Start-End$ to allow for immediate use with \predref{sub_atom}{5} and \predref{sub_string}{5}. \termitem{term}{} Parse the captured string as a Prolog term. This is notably practical if you capture a number. \end{description} \end{itemize} The \verb$capture_type$ specifies the default for this pattern. The interface supports a different type for each \textit{named} group using the syntax `(?$<$name_T$>$...)`, where \arg{T} is one of \verb$S$ (string), \verb$A$ (atom), \verb$I$ (integer), \verb$F$ (float), \verb$N$ (number), \verb$T$ (term) and \verb$R$ (range). In the current implementation \verb$I$, \verb$F$ and \verb$N$ are synonyms for \verb$T$. Future versions may act different if the parsed value is not of the requested numeric type. Note that \predref{re_compile}{3} does not support the \arg{Pattern}/Flags form that is supported by \predref{re_match}{3}, \predref{re_replace}{4}, etc.; the \arg{Pattern} must be text and all compile options specified in \arg{Options}. \predicate{re_flush}{0}{} Clean pattern and replacement caches. \begin{tags} \tag{To be done} Flush automatically if the cache becomes too large. \end{tags} \predicate{re_config}{1}{?Term} Extract configuration information from the pcre library. \arg{Term} is of the form \verb$Name(Value)$. Name is derived from the \verb$PCRE_CONFIG_*$ constant after removing \verb$PCRE_CONFIG_$ and mapping the name to lower case, e.g. \const{utf8}, \verb$unicode_properties$, etc. Value is a Prolog boolean, integer, or atom. For boolean (1 or 0) values, \const{true} or \const{false} is returned. \predref{re_config}{1} will backtrack through all the possible configuration values if its argument is a variable. If an unknown option is specified, \predref{re_config}{1} fails. Non-compatible changes between PCRE1 and PCRE2 because numeric values changed: \const{bsr} and \const{newline} have been replaced by \const{bsr2} and \const{newline2}: \begin{itemize} \item \const{bsr2} - previously \const{bsr} returned 0 or 1; now returns \const{unicode} or \const{anycrlf} \item \const{newline2} - previously \const{newline} returned an integer, now returns \const{cr}, \const{lf}, \const{crlf}, \const{any}, \const{anycrlf}, \const{nul} \end{itemize} \arg{Term} values are as follows. Some values might not exist, depending on the version of PCRE2 and the options it was built with. \begin{itemize} \item bsr2 The character sequences that the \verb$\R$ escape sequence matches by default. Replaces \const{bsr} option from PCRE1, which is not compatible. \item compiled_widths An integer whose lower bits indicate which code unit widths were selected when PCRE2 was built. The 1-bit indicates 8-bit support, and the 2-bit and 4-bit indicate 16-bit and 32-bit support, respectively. The 1 bit should always be set because the wrapper code requires 8 bit support. \item depthlimit \item heaplimit \item jit \const{true} if just-in-time compiling is available. \item jittarget A string containing the name of the architecture for which the JIT compiler is configured. e.g., 'x86 64bit (little endian + unaligned)'. \item linksize \item matchlimit \item never_backslash_c \item newline2 An atom whose value specifies the default character sequence that is recognized as meaning "newline" (\const{cr}, \const{lf}, \const{crlf}, \const{any}, \const{anycrlf}, \const{nul}). Replaces \const{newline} option from PCRE1, which is not compatible. \item parenslimit \item stackrecurse \item unicode Always \const{true} \item unicode_version The unicode version as an atom, e.g. '12.1.0'. \item utf8 - synonym for \const{unicode} \item parens_limit \item version The version information as an atom, containing the PCRE version number and release date, e.g. '10.34 2019-11-21'. For backwards compatibility with PCRE1, the following are accepted, but are deprecated: \begin{shortlist} \item \const{utf8} - synonym for \const{unicode} \item \verb$link_size$ - synonym for \const{linksize} \item \verb$match_limit$ - synonym for \const{matchlimit} \item \verb$parens_limit$ - synonym for \const{parenslimit} \item \verb$unicode_properties$ - always true \end{shortlist} The following have been removed because they don't exist in PCRE2 and don't seem to have any meaningful use in PCRE1: \begin{shortlist} \item \verb$posix_malloc_threshold$ \item \verb$match_limit_recursion$ \end{shortlist} \end{itemize} \end{description}