# Design Notes

Initially extracted from conversation with
[@Annieppo](https://github.com/Anniepoo) and [@nicoabie](https://github.com/nicoabie) in
##prolog on [freenode](https://freenode.net/).

The library started as a very simple and lightweight set of predicates for a
common, but very limited, form of lexing. As we extend it, we aim to maintain a
modest scope in order to achieve a sweet spot between ease of use and powerful
flexibility.

## Scope and Aims

`tokenize` does not aspire to become an industrial strength lexer generator. We
aim to serve most users needs between raw input and a structured form ready for
parsing by a DCG.

If a user is parsing a language with keywords such as `class`, `module`, etc.,
and wants to distinguish these from variable names, `tokenize` isn't going to
give you this out of the box. But, it should provide an easy means of achieving
this result through a subsequent lexing pass.

## Some Model Users

* somebody making a computer language
  * needs to be able to distinguish keywords, variables and literals
  * needs to be able to identify comments
* somebody making a parser for an interactive fiction game
  * needs to handle stuff like "William O. N'mutu-O'Connell went to the market"
* somebody wanting to analyze human texts
  * wanting to do some analysis on New York Times articles, they want to first
    process the articles into meaningful tokens

## Design Rules

* We don't parse.
* Every token generated is callable (i.e., an atom or compound).
  * Example of an possible compound token: `space(' ')`.
  * Example of a possible atom token: `escape`.
  tokenization need to return tokens represented with the same arity)
* Users should be able to determine the kind of token by unification.
* Users should be able to clearly see and specify the precedence for tokenizaton
  * E.g., given `"-12.3"`, `numbers, punctuation` should yield `[pnct('-'),
    number(12), pnct('.'), number(3)]` while `punctuation, numbers` should yield
    `[number(-12.3)]`.