% This LaTeX document was generated using the LaTeX backend of PlDoc, % The SWI-Prolog documentation system \section{library(isub): isub: a string similarity measure} \label{sec:isub} \begin{tags} \tag{author} Giorgos Stoilos \tag{See also} \textit{A string metric for ontology alignment} by Giorgos Stoilos, 2005 - \url{http://www.image.ece.ntua.gr/papers/378.pdf} . \end{tags} The \file{library(isub)} implements a similarity measure between strings, i.e., something similar to the \textit{Levenshtein distance}. This method is based on the length of common substrings.\vspace{0.7cm} \begin{description} \predicate[det]{isub}{4}{+Text1:text, +Text2:text, -Similarity:float, +Options:list} \arg{Similarity} is a measure of the similarity/dissimilarity between \arg{Text1} and \arg{Text2}. E.g. \begin{code} ?- isub('E56.Language', 'languange', D, [normalize(true)]). D = 0.4226950354609929. % [-1,1] range ?- isub('E56.Language', 'languange', D, [normalize(true),zero_to_one(true)]). D = 0.7113475177304964. % [0,1] range ?- isub('E56.Language', 'languange', D, []). % without normalization D = 0.19047619047619047. % [-1,1] range ?- isub(aa, aa, D, []). % does not work for short substrings D = -0.8. ?- isub(aa, aa, D, [substring_threshold(0)]). % works with short substrings D = 1.0. % but may give unwanted values % between e.g. 'store' and 'spore'. ?- isub(joe, hoe, D, [substring_threshold(0)]). D = 0.5315315315315314. ?- isub(joe, hoe, D, []). D = -1.0. \end{code} This is a new version of \predref{isub}{4} which replaces the old version while providing backwards compatibility. This new version allows several options to tweak the algorithm. \begin{arguments} \arg{Text1} & and \arg{Text2} are either an atom, string or a list of characters or character codes. \\ \arg{Similarity} & is a float in the range [-1,1.0], where 1.0 means \textit{most similar}. The range can be set to [0,1] with the zero_to_one option described below. \\ \arg{Options} & is a list with elements described below. Please note that the options are processed at compile time using goal_expansion to provide much better speed. Supported options are: \begin{description} \termitem{normalize}{+Boolean} Applies string normalization as implemented by the original authors: \arg{Text1} and \arg{Text2} are mapped to lowercase and the characters "._ " are removed. Lowercase mapping is done with the C-library function \verb$towlower()$. In general, the required normalization is domain dependent and is better left to the caller. See e.g., \predref{unaccent_atom}{2}. The default is to skip normalization (\const{false}). \termitem{zero_to_one}{+Boolean} The old isub implementation deviated from the original algorithm by returning a value in the [0,1] range. This new \predref{isub}{4} implementation defaults to the original range of [-1,1], but this option can be set to \const{true} to set the output range to [0,1]. \termitem{substring_threshold}{+Nonneg} The original algorithm was meant to compare terms in semantic web ontologies, and it had a hard coded parameter that only considered substring similarities greater than 2 characters. This caused the similarity between, for example 'aa' and 'aa' to return -0.8 which is not expected. This option allows the user to set any threshold, such as 0, so that the similatiry between short substrings can be properly recognized. The default value is 2 which is what the original algorithm used. \end{description} \\ \end{arguments} \end{description}