This is wisitoken-user_guide.info, produced by makeinfo version 6.7 from
wisitoken-user_guide.texinfo.

Copyright (C) 2014-2015, 2017, 2018, 2020 Stephen Leake.

     Permission is granted to copy, distribute and/or modify this
     document under the terms of the GNU Free Documentation License,
     Version 1.3 or any later version published by the Free Software
     Foundation; with no Invariant Sections, no Front-Cover Texts and no
     Back-Cover Texts.  A copy of the license is included in the section
     entitled "GNU Free Documentation License".
INFO-DIR-SECTION Parser generators
START-INFO-DIR-ENTRY
* wisitoken-bnf-generate: (wisitoken-bnf-generate).         Ada and Elisp parser generator
END-INFO-DIR-ENTRY


File: wisitoken-user_guide.info,  Node: Top,  Next: Overview,  Up: (dir)

WisiToken User Guide
********************

Copyright (C) 2014-2015, 2017, 2018, 2020 Stephen Leake.

     Permission is granted to copy, distribute and/or modify this
     document under the terms of the GNU Free Documentation License,
     Version 1.3 or any later version published by the Free Software
     Foundation; with no Invariant Sections, no Front-Cover Texts and no
     Back-Cover Texts.  A copy of the license is included in the section
     entitled "GNU Free Documentation License".

* Menu:

* Overview::
* Common grammar problems::
* Grammar File Syntax::


File: wisitoken-user_guide.info,  Node: Overview,  Next: Common grammar problems,  Prev: Top,  Up: Top

1 Overview
**********

WisiToken is a parser and parser generator toolkit, supporting
generalized LR (both LALR and LR1) and packrat parsers; the LR parser
provides robust error recovery.  The grammar can be expressed as either
Ada source code statements, or in an EBNF file.  The parser generator
generates Ada, either plain or assuming the Emacs wisi package.

   At one point, "wisi" was short for "Wisent Indentation engine"; the
Emacs 'wisi' package implements an indentation engine that used to be
based on the Emacs wisent parser.  However, that parser has now been
replaced by the WisiToken parser, so "wisi" is just a name.

* Menu:

* Install::


File: wisitoken-user_guide.info,  Node: Install,  Up: Overview

1.1 Install
===========

WisiToken is available as source code only, although a subset is
available in the GNU ELPA package 'wisi'.

   You will also need to install a lexer generator.  WisiToken supports
re2c, and other lexers can be added.

   re2c is available from <http://re2c.org/>; it is also packaged in
Mingw64 and Debian.  WisiToken requires at least version 1.3.  WisiToken
assumes the executable 're2c' is in '$PATH'.


File: wisitoken-user_guide.info,  Node: Common grammar problems,  Next: Grammar File Syntax,  Prev: Overview,  Up: Top

2 Common grammar problems
*************************

LALR grammars are tricky.  Here we describe some common problems people
run into.

* Menu:

* Empty choice in list::


File: wisitoken-user_guide.info,  Node: Empty choice in list,  Up: Common grammar problems

2.1 Empty choice in list
========================

Many programming languages have lists in the grammar.  For example, Ada
has lists of declarations:

     package_body
       : PACKAGE name IS declaration_list BEGIN statement_list END SEMICOLON
       ;

     declaration_list
       : declaration
       | declaration_list declaration
       ;

     declaration
       : object_declaration
       | subprogram_declaration
       ;; ...
       ;

   Note that the above grammar fragment does not allow an empty
declaration_list.  But Ada does, so the question is how can we add that
to the grammar.

   There are four choices:

  1. Add an empty declaration choice to declaration_list:

          declaration_list
            : ;; empty list
            | declaration
            | declaration_list declaration
            ;
     This is now redundant; since declaration_list can be empty, the
     second choice is not needed:
          declaration_list
            : ;; empty list
            | declaration_list declaration
            ;

  2. Add an empty declaration choice to declaration:

          declaration
            : ;; empty declaration
            | object_declaration
            | subprogram_declaration
            ;; ...
            ;

  3. Add another rule with the empty production:

          package_body
            : PACKAGE name IS declarative_part BEGIN statement_list END SEMICOLON
            ;

          declarative_part
            : ;; empty
            | declaration_list
            ;

          declaration_list
            : declaration
            | declaration_list declaration
            ;

          declaration
            : object_declaration
            | subprogram_declaration
            ;; ...
            ;

  4. Add another choice in package_body that leaves out
     declaration_list:
          package_body
            : PACKAGE name IS declaration_list BEGIN statement_list END SEMICOLON
            | PACKAGE name IS BEGIN statement_list END SEMICOLON
            ;

   Choice 1 is redundant, giving parse errors at parse time.  Consider
the following statements, where "<empty>" is used to indicate an empty
declaration:

   1) package One is <empty> begin end ; 2) package One is package One
is <empty> begin end ; begin end ; 3) package One is <empty> package One
is <empty declaration> begin end ; begin end ;

   In parsing 3), the second 'package' causes a shift/reduce conflict;
shift to start the nested declaration (as in 2), reduce to the empty
declaration.  Both are correct according to the grammar.

   Choice 2 leads to a shift/reduce conflict in the production for
package_body; implementing the wisi parser as a generalized LALR parser
allows it to handle this option.

   Choice 2 is the preferred choice for Ada, since it involves the least
modifications to the original Ada grammar in the Ada reference manual.


File: wisitoken-user_guide.info,  Node: Grammar File Syntax,  Prev: Common grammar problems,  Up: Top

3 Grammar File Syntax
*********************

The grammar file syntax is based on Gnu bison syntax with some additions
and deletions (*note Bison: (bison)Top.).

   (The grammar is specified in the WisiToken grammar file
'wisitoken_grammar.wy').

   The top level file structure is a list of declarations and
nonterminals.

   Comments are started by ';;' and terminated by end of line.

* Menu:

* Declarations::
* Nonterminals::
* Conditional code::


File: wisitoken-user_guide.info,  Node: Declarations,  Next: Nonterminals,  Up: Grammar File Syntax

3.1 Declarations
================

Declarations declare terminal tokens, conflicts, and other parser
parameters.

* Menu:

* Raw Code::
* Keywords::
* Tokens::
* Error recovery::
* Other declarations::


File: wisitoken-user_guide.info,  Node: Raw Code,  Next: Keywords,  Up: Declarations

3.1.1 Raw code
--------------

%code { actions | copyright_license } [spec | body | context | pre | post]... %{ <output language code> }%

   Raw code declarations contain arbitrary code, copied verbatim into
the output.  The keywords following '%code' determine where the section
is output.


File: wisitoken-user_guide.info,  Node: Keywords,  Next: Tokens,  Prev: Raw Code,  Up: Declarations

3.1.2 Keywords
--------------

%keyword <name> <string>

   example:
%keyword SEMICOLON ";"

   "Keywords" are reserved words or symbols in the target language; the
lexers recognize them by the given string.


File: wisitoken-user_guide.info,  Node: Tokens,  Next: Error recovery,  Prev: Keywords,  Up: Declarations

3.1.3 Tokens
------------

%token < kind > name regexp

   example:
%token <symbol> IDENTIFIER %[ ... ]%
%token <punctuation> TICK "'"

   The syntax of the regular expression is determined by the lexer
generator.  The meaning of 'kind' is determined by the lexer ('re2c'
ignores this), with the following defined by the WisiToken generator.
Other token kinds have no effect; they may be used for documentation.

'<string-double>'
     %token <string-double> STRING_LITERAL %[ ... ]%
     A string of characters that have string syntax, with double quote
     delimiters.

'<string-single>'
     %token <string-single> CHARACTER_LITERAL %[ ... ]%
     A string of characters that have string syntax, with single quote
     delimiters.

'<new-line>'
     %token <new-line> [\n] %[ ... ]%
     Not used by the wisi lexer; required by the Ada lexer.  The third
     argument is the regular expression to recognize the entire comment.

'<non-reporting>'
     %token <non-reporting> WHITESPACE %[ [ \t] ]%
     A token that is recognized by the lexer, but not returned to the
     parser.

'<delimited-text>'
     %token <delimited-text> RAW_CODE "%{" "}%"
     A token that contains arbitrary text, delimited by the two strings.


File: wisitoken-user_guide.info,  Node: Error recovery,  Next: Other declarations,  Prev: Tokens,  Up: Declarations

3.1.4 Error recovery
--------------------

The parser uses an error recovery algorithm when it encounters a syntax
error; if a solution is found, the parse continues.

   Error recovery uses multiple tasks to take advantage of multiple CPU
cores.  Unfortunately, this means there is a race condition; the
solutions found can be delivered in different orders on different runs.
This matters because each solution results in a successful parse,
possibly with different actions (different indentation computed, for
example).  Which solution finally succeeds depends on which are
terminated due to identical parser stacks, which in turn depends on the
order they were delivered.

   Once the syntax errors are fixed, only ambiguities in the grammar
itself can cause a similar problem.

   Several grammar file declarations set parameters for the error
recovery.  If none of these parameters are present in the grammar file,
the generated parser does not do error recovery.

   The error recovery algorithm generates possible solutions based on
the parse state preceding the error point, by inserting, deleting, or
pushing back tokens.  Each possible solution is given a cost, and
enqueued to be checked later.  Solutions are checked in cost order
(lowest first).

'%mckenzie_check_limit <limit>'
     The number of tokens past the error point that must be parsed
     successfully for a solution to be deemed successful.  Smaller
     values give faster recovery; larger values give better solutions.
     Too large a value risks encountering another user error, making a
     solution impossible.  3 or 4 works well in practice.

'mckenzie_check_delta_limit <limit>'
     When error recovery is entered with multiple parsers active, once a
     solution has been found for one parser, the other parsers are
     allowed to check only 'mckenzie_check_delta_limit' possible
     solutions before they fail.  This prevents long recovery times.

'%mckenzie_cost_default <insert> <delete> <push back> <ignore check fail>'
     McKenzie error recovery default costs for insert, delete, push back
     single tokens, and for ignoring a semantic check failure; four
     floating point numbers.

     "Push back" means undo parsing; remove tokens from the parse stack
     and put them back into the input stream.  This moves the
     insert/delete point, allowing better solutions.  The push back
     default cost is also the undo reduce default cost.

     If not specified, costs are zero.  Costs can be negative; they all
     add linearly.

'%mckenzie_cost_delete <token> <cost>'
     McKenzie error recovery delete cost for a specific token.

'%mckenzie_cost_fast_forward <cost>'
     McKenzie error recovery cost for parsing ahead after fixing one
     error, moving to the next error location.

'%mckenzie_cost_insert <token> <cost>'
     McKenzie error recovery insert cost for a specific token.

'%mckenzie_cost_fast_forward <cost>'
     McKenzie error recovery cost for using the 'matching_begin'
     strategy.

'%mckenzie_cost_push_back <token> <cost>'
     McKenzie error recovery push back cost for a specific token.

'%mckenzie_cost_undo_reduce <token> <cost>'
     McKenzie error recovery undo reduce cost for a specific token.

'%mckenzie_enqueue_limit <integer>'
     McKenzie error recovery limit on possible solutions enqueued (to be
     checked); default max integer.

'%mckenzie_minimal_complete_cost_delta <cost>'
     McKenzie error recovery cost added to the cost of an inserted token
     when the insert is done by the minimal complete strategy; this cost
     is normally negative.


File: wisitoken-user_guide.info,  Node: Other declarations,  Prev: Error recovery,  Up: Declarations

3.1.5 Other declarations
------------------------

'%case_insensitive'
     If present, keywords are case insensitive in the lexer.

'%conflict <conflict description>'
     Declare a known conflict.

     Example conflict declaration:
     %conflict REDUCE/REDUCE in state abstract_limited_opt, abstract_limited_synchronized_opt on token NEW

     The conflict description is output by 'wisitoken-bnf-generate' when
     an undeclared conflict is detected.  If the user decides to not fix
     the conflict, the description can be copied into the grammar source
     file, so it will be ignored next time around.

     Resolving conflicts in the grammar can be difficult, but leaving
     them in can increase parse time and cause ambiguous parses.

'%elisp_face <name>'
     Declare a name for an elisp face constant.

     When generating Ada code for Emacs, the elisp faces applied by
     'wisi-face-apply' actions must be declared, so the elisp and Ada
     code aggree on what they mean.

'%elisp_indent <elisp name> <Ada name>'
     Declare elisp and Ada names for an indent variable.

     When generating Ada code for Emacs, the elisp indent variables used
     in 'wisi-indent' actions must be declared, so the elisp and Ada
     code aggree on what they mean.

'%elisp_action <elisp name> <Ada name>'
     Declare elisp and Ada names for a custom action subprogram written
     in Ada.

     The term "elisp" here is historical; the name is not actually used
     by elisp in the current implementation.

'end_names_optional_option <name>'
     When generating Ada code for Emacs, the name of the Ada variable
     determining whether end block names are optional.

     In the Ada language, block names can be repeated at the end; for
     example:

     Get_Inputs :
     loop
     ...
     end loop Get_Inputs;

     These names are optional in the Ada standard.  Making them required
     improves error recovery; the recovery algorithm can use matching
     names to isolate the error.

'generate <generate_algorithm> <output_language> [text_rep]'

     '<generate_algorithm>' is one of 'LALR | LR1 | Packrat_Gen |
     Packrat_Proc | External'

     '<output_language>' is one of 'Ada | Ada_Emacs'

     The algorithm/output_language pair declares one output source set.
     Multiple sets can be declared; they are all generated together.

     'text_rep' determines how the parse table is represented; if
     present, it is in a text file that is loaded at parser run time.
     If absent, it is in the code.  For very large parse tables, such as
     for an LR1 parser for a large language like Ada, the text
     representation may be needed, because the Ada compiler can't handle
     the very large number of statements that represent the parser table
     in the code.  The text file can take a long time to read at parser
     startup (a few seconds for the Ada language).

'%language_runtime'
     Specify an alternate name for the language runtime package; the
     default is 'Wisi.<language_name>'.

'%meta_syntax [BNF | EBNF]'
     Declares the syntax used by the grammar file.  BNF is a minor
     extension of standard Backus Naur Form; EBNF is a large extension.
     The default is BNF.

'%no_enum'
     By default, the generated Ada code includes an enumeration type
     declaring each token.  This makes the language-specific runtime
     easier to write (without this type, tokens are identified by
     integers).

     '%no_enum' causes the generated code to not include the token
     enumeration type.

'%no_language_runtime'
     When generating Ada code for Emacs, '%no_language_runtime' causes
     the generated code to not include the runtime.  Some grammars may
     need no runtime, particularly if they are small grammars intendend
     to test some generator feature.

'%partial_recursion'
     The error recovery algorithm requires computing the recursion
     present in the language grammar.  For some grammars (such as Java),
     this is too hard; '%partial_recursion' tells WisiToken to use a
     simpler approximate calculation.  This will affect the quality of
     the error recovery, but it will still be robust.

'%start'
     The start token for the grammar.

're2c_regexp <name> <value>'
     Declare a named regular expression with re2c name and syntax.  The
     name may then occur in another re2c regular expression.


File: wisitoken-user_guide.info,  Node: Nonterminals,  Next: Conditional code,  Prev: Declarations,  Up: Grammar File Syntax

3.2 Nonterminals
================

A nonterminal statement declares a nonterminal token, and the associated
production rules and actions.

   The syntax of a nonterminal statement is:

nonterminal : rhs {| rhs} ;
   A nonterminal is defined by a list of alternate right hand sides.

rhs : {rhs_item} [action [action]] ;
   Each right hand side is a list of items, followed by zero to two
actions; the first is the post-parse action, the second the in-parse
action.

   In-parse actions are exeuted during the parse, when the production is
reduced; they can add semantic checks that help during error recovery.

   Post-parse actions are executed after the parse is complete, when a
node produced by this production is visited during the tree traversal;
they typically build an abstract syntax tree.

   The actions are written in output-language code; for 'Ada_Emacs'
output, this is elisp (a hold-over from when WisiToken only output elisp
code).

   If using BNF:
rhs_item : token ;
   Where 'token' is defined by a token declaration.

   if using EBNF:
rhs_item
  : token
  | < identifier = identifier >
  | rhs_optional_item
  | rhs_multiple_item
  | '(' rhs {| rhs} ')'
  ;
   Here 'token' is either defined by a token declaration, or the token
value contained in single quotes.

   The second option is an attribute, as defined by ANTLR; these are
ignored in wisitoken.

   Parentheses are used to group items.

rhs_optional_item
  : '[' rhs {| rhs} ']'
  | '(' rhs {| rhs} ')' '?'
  | token '?'
  ;
   These options all mean the same thing; the content is present zero or
one times.

rhs_multiple_item
  : '{' rhs {| rhs} '}'
  | '{' rhs {| rhs} '}-'
  | '(' rhs {| rhs} ')+'
  | '(' rhs {| rhs} ')*'
  | token '+'
  | token '*'
  ;
   "{}", "()*", and "token*" mean the content is present zero or more
times.  "{}-", "()+", and "token+" mean the content is present one or
more times.


File: wisitoken-user_guide.info,  Node: Conditional code,  Prev: Nonterminals,  Up: Grammar File Syntax

3.3 Conditional code
====================

It is sometimes necessary to include or exclude some declarations and
portions of rules based on the choice of lexer or parser.

   Therefore WisiToken supports '%if ... %end if' in the grammar file:
%if {lexer | parser} = {<lexer> | <generate_algorithm>}
...
%end if

   The lines between '%if' and '%end if' are ignored if the current
lexer or parser is not the one specified in the '%if' condition.

   '%if ... %end if' cannot be nested.



Tag Table:
Node: Top727
Node: Overview1378
Node: Install2140
Node: Common grammar problems2637
Node: Empty choice in list2930
Node: Grammar File Syntax5912
Node: Declarations6469
Node: Raw Code6775
Node: Keywords7156
Node: Tokens7468
Node: Error recovery8803
Node: Other declarations12525
Node: Nonterminals17017
Node: Conditional code19038

End Tag Table


Local Variables:
coding: utf-8
End: