This is wisitoken-user_guide.info, produced by makeinfo version 6.7 from wisitoken-user_guide.texinfo. Copyright (C) 2014-2015, 2017, 2018, 2020 Stephen Leake. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts and no Back-Cover Texts. A copy of the license is included in the section entitled "GNU Free Documentation License". INFO-DIR-SECTION Parser generators START-INFO-DIR-ENTRY * wisitoken-bnf-generate: (wisitoken-bnf-generate). Ada and Elisp parser generator END-INFO-DIR-ENTRY  File: wisitoken-user_guide.info, Node: Top, Next: Overview, Up: (dir) WisiToken User Guide ******************** Copyright (C) 2014-2015, 2017, 2018, 2020 Stephen Leake. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts and no Back-Cover Texts. A copy of the license is included in the section entitled "GNU Free Documentation License". * Menu: * Overview:: * Common grammar problems:: * Grammar File Syntax::  File: wisitoken-user_guide.info, Node: Overview, Next: Common grammar problems, Prev: Top, Up: Top 1 Overview ********** WisiToken is a parser and parser generator toolkit, supporting generalized LR (both LALR and LR1) and packrat parsers; the LR parser provides robust error recovery. The grammar can be expressed as either Ada source code statements, or in an EBNF file. The parser generator generates Ada, either plain or assuming the Emacs wisi package. At one point, "wisi" was short for "Wisent Indentation engine"; the Emacs 'wisi' package implements an indentation engine that used to be based on the Emacs wisent parser. However, that parser has now been replaced by the WisiToken parser, so "wisi" is just a name. * Menu: * Install::  File: wisitoken-user_guide.info, Node: Install, Up: Overview 1.1 Install =========== WisiToken is available as source code only, although a subset is available in the GNU ELPA package 'wisi'. You will also need to install a lexer generator. WisiToken supports re2c, and other lexers can be added. re2c is available from ; it is also packaged in Mingw64 and Debian. WisiToken requires at least version 1.3. WisiToken assumes the executable 're2c' is in '$PATH'.  File: wisitoken-user_guide.info, Node: Common grammar problems, Next: Grammar File Syntax, Prev: Overview, Up: Top 2 Common grammar problems ************************* LALR grammars are tricky. Here we describe some common problems people run into. * Menu: * Empty choice in list::  File: wisitoken-user_guide.info, Node: Empty choice in list, Up: Common grammar problems 2.1 Empty choice in list ======================== Many programming languages have lists in the grammar. For example, Ada has lists of declarations: package_body : PACKAGE name IS declaration_list BEGIN statement_list END SEMICOLON ; declaration_list : declaration | declaration_list declaration ; declaration : object_declaration | subprogram_declaration ;; ... ; Note that the above grammar fragment does not allow an empty declaration_list. But Ada does, so the question is how can we add that to the grammar. There are four choices: 1. Add an empty declaration choice to declaration_list: declaration_list : ;; empty list | declaration | declaration_list declaration ; This is now redundant; since declaration_list can be empty, the second choice is not needed: declaration_list : ;; empty list | declaration_list declaration ; 2. Add an empty declaration choice to declaration: declaration : ;; empty declaration | object_declaration | subprogram_declaration ;; ... ; 3. Add another rule with the empty production: package_body : PACKAGE name IS declarative_part BEGIN statement_list END SEMICOLON ; declarative_part : ;; empty | declaration_list ; declaration_list : declaration | declaration_list declaration ; declaration : object_declaration | subprogram_declaration ;; ... ; 4. Add another choice in package_body that leaves out declaration_list: package_body : PACKAGE name IS declaration_list BEGIN statement_list END SEMICOLON | PACKAGE name IS BEGIN statement_list END SEMICOLON ; Choice 1 is redundant, giving parse errors at parse time. Consider the following statements, where "" is used to indicate an empty declaration: 1) package One is begin end ; 2) package One is package One is begin end ; begin end ; 3) package One is package One is begin end ; begin end ; In parsing 3), the second 'package' causes a shift/reduce conflict; shift to start the nested declaration (as in 2), reduce to the empty declaration. Both are correct according to the grammar. Choice 2 leads to a shift/reduce conflict in the production for package_body; implementing the wisi parser as a generalized LALR parser allows it to handle this option. Choice 2 is the preferred choice for Ada, since it involves the least modifications to the original Ada grammar in the Ada reference manual.  File: wisitoken-user_guide.info, Node: Grammar File Syntax, Prev: Common grammar problems, Up: Top 3 Grammar File Syntax ********************* The grammar file syntax is based on Gnu bison syntax with some additions and deletions (*note Bison: (bison)Top.). (The grammar is specified in the WisiToken grammar file 'wisitoken_grammar.wy'). The top level file structure is a list of declarations and nonterminals. Comments are started by ';;' and terminated by end of line. * Menu: * Declarations:: * Nonterminals:: * Conditional code::  File: wisitoken-user_guide.info, Node: Declarations, Next: Nonterminals, Up: Grammar File Syntax 3.1 Declarations ================ Declarations declare terminal tokens, conflicts, and other parser parameters. * Menu: * Raw Code:: * Keywords:: * Tokens:: * Error recovery:: * Other declarations::  File: wisitoken-user_guide.info, Node: Raw Code, Next: Keywords, Up: Declarations 3.1.1 Raw code -------------- %code { actions | copyright_license } [spec | body | context | pre | post]... %{ }% Raw code declarations contain arbitrary code, copied verbatim into the output. The keywords following '%code' determine where the section is output.  File: wisitoken-user_guide.info, Node: Keywords, Next: Tokens, Prev: Raw Code, Up: Declarations 3.1.2 Keywords -------------- %keyword example: %keyword SEMICOLON ";" "Keywords" are reserved words or symbols in the target language; the lexers recognize them by the given string.  File: wisitoken-user_guide.info, Node: Tokens, Next: Error recovery, Prev: Keywords, Up: Declarations 3.1.3 Tokens ------------ %token < kind > name regexp example: %token IDENTIFIER %[ ... ]% %token TICK "'" The syntax of the regular expression is determined by the lexer generator. The meaning of 'kind' is determined by the lexer ('re2c' ignores this), with the following defined by the WisiToken generator. Other token kinds have no effect; they may be used for documentation. '' %token STRING_LITERAL %[ ... ]% A string of characters that have string syntax, with double quote delimiters. '' %token CHARACTER_LITERAL %[ ... ]% A string of characters that have string syntax, with single quote delimiters. '' %token [\n] %[ ... ]% Not used by the wisi lexer; required by the Ada lexer. The third argument is the regular expression to recognize the entire comment. '' %token WHITESPACE %[ [ \t] ]% A token that is recognized by the lexer, but not returned to the parser. '' %token RAW_CODE "%{" "}%" A token that contains arbitrary text, delimited by the two strings.  File: wisitoken-user_guide.info, Node: Error recovery, Next: Other declarations, Prev: Tokens, Up: Declarations 3.1.4 Error recovery -------------------- The parser uses an error recovery algorithm when it encounters a syntax error; if a solution is found, the parse continues. Error recovery uses multiple tasks to take advantage of multiple CPU cores. Unfortunately, this means there is a race condition; the solutions found can be delivered in different orders on different runs. This matters because each solution results in a successful parse, possibly with different actions (different indentation computed, for example). Which solution finally succeeds depends on which are terminated due to identical parser stacks, which in turn depends on the order they were delivered. Once the syntax errors are fixed, only ambiguities in the grammar itself can cause a similar problem. Several grammar file declarations set parameters for the error recovery. If none of these parameters are present in the grammar file, the generated parser does not do error recovery. The error recovery algorithm generates possible solutions based on the parse state preceding the error point, by inserting, deleting, or pushing back tokens. Each possible solution is given a cost, and enqueued to be checked later. Solutions are checked in cost order (lowest first). '%mckenzie_check_limit ' The number of tokens past the error point that must be parsed successfully for a solution to be deemed successful. Smaller values give faster recovery; larger values give better solutions. Too large a value risks encountering another user error, making a solution impossible. 3 or 4 works well in practice. 'mckenzie_check_delta_limit ' When error recovery is entered with multiple parsers active, once a solution has been found for one parser, the other parsers are allowed to check only 'mckenzie_check_delta_limit' possible solutions before they fail. This prevents long recovery times. '%mckenzie_cost_default ' McKenzie error recovery default costs for insert, delete, push back single tokens, and for ignoring a semantic check failure; four floating point numbers. "Push back" means undo parsing; remove tokens from the parse stack and put them back into the input stream. This moves the insert/delete point, allowing better solutions. The push back default cost is also the undo reduce default cost. If not specified, costs are zero. Costs can be negative; they all add linearly. '%mckenzie_cost_delete ' McKenzie error recovery delete cost for a specific token. '%mckenzie_cost_fast_forward ' McKenzie error recovery cost for parsing ahead after fixing one error, moving to the next error location. '%mckenzie_cost_insert ' McKenzie error recovery insert cost for a specific token. '%mckenzie_cost_fast_forward ' McKenzie error recovery cost for using the 'matching_begin' strategy. '%mckenzie_cost_push_back ' McKenzie error recovery push back cost for a specific token. '%mckenzie_cost_undo_reduce ' McKenzie error recovery undo reduce cost for a specific token. '%mckenzie_enqueue_limit ' McKenzie error recovery limit on possible solutions enqueued (to be checked); default max integer. '%mckenzie_minimal_complete_cost_delta ' McKenzie error recovery cost added to the cost of an inserted token when the insert is done by the minimal complete strategy; this cost is normally negative.  File: wisitoken-user_guide.info, Node: Other declarations, Prev: Error recovery, Up: Declarations 3.1.5 Other declarations ------------------------ '%case_insensitive' If present, keywords are case insensitive in the lexer. '%conflict ' Declare a known conflict. Example conflict declaration: %conflict REDUCE/REDUCE in state abstract_limited_opt, abstract_limited_synchronized_opt on token NEW The conflict description is output by 'wisitoken-bnf-generate' when an undeclared conflict is detected. If the user decides to not fix the conflict, the description can be copied into the grammar source file, so it will be ignored next time around. Resolving conflicts in the grammar can be difficult, but leaving them in can increase parse time and cause ambiguous parses. '%elisp_face ' Declare a name for an elisp face constant. When generating Ada code for Emacs, the elisp faces applied by 'wisi-face-apply' actions must be declared, so the elisp and Ada code aggree on what they mean. '%elisp_indent ' Declare elisp and Ada names for an indent variable. When generating Ada code for Emacs, the elisp indent variables used in 'wisi-indent' actions must be declared, so the elisp and Ada code aggree on what they mean. '%elisp_action ' Declare elisp and Ada names for a custom action subprogram written in Ada. The term "elisp" here is historical; the name is not actually used by elisp in the current implementation. 'end_names_optional_option ' When generating Ada code for Emacs, the name of the Ada variable determining whether end block names are optional. In the Ada language, block names can be repeated at the end; for example: Get_Inputs : loop ... end loop Get_Inputs; These names are optional in the Ada standard. Making them required improves error recovery; the recovery algorithm can use matching names to isolate the error. 'generate [text_rep]' '' is one of 'LALR | LR1 | Packrat_Gen | Packrat_Proc | External' '' is one of 'Ada | Ada_Emacs' The algorithm/output_language pair declares one output source set. Multiple sets can be declared; they are all generated together. 'text_rep' determines how the parse table is represented; if present, it is in a text file that is loaded at parser run time. If absent, it is in the code. For very large parse tables, such as for an LR1 parser for a large language like Ada, the text representation may be needed, because the Ada compiler can't handle the very large number of statements that represent the parser table in the code. The text file can take a long time to read at parser startup (a few seconds for the Ada language). '%language_runtime' Specify an alternate name for the language runtime package; the default is 'Wisi.'. '%meta_syntax [BNF | EBNF]' Declares the syntax used by the grammar file. BNF is a minor extension of standard Backus Naur Form; EBNF is a large extension. The default is BNF. '%no_enum' By default, the generated Ada code includes an enumeration type declaring each token. This makes the language-specific runtime easier to write (without this type, tokens are identified by integers). '%no_enum' causes the generated code to not include the token enumeration type. '%no_language_runtime' When generating Ada code for Emacs, '%no_language_runtime' causes the generated code to not include the runtime. Some grammars may need no runtime, particularly if they are small grammars intendend to test some generator feature. '%partial_recursion' The error recovery algorithm requires computing the recursion present in the language grammar. For some grammars (such as Java), this is too hard; '%partial_recursion' tells WisiToken to use a simpler approximate calculation. This will affect the quality of the error recovery, but it will still be robust. '%start' The start token for the grammar. 're2c_regexp ' Declare a named regular expression with re2c name and syntax. The name may then occur in another re2c regular expression.  File: wisitoken-user_guide.info, Node: Nonterminals, Next: Conditional code, Prev: Declarations, Up: Grammar File Syntax 3.2 Nonterminals ================ A nonterminal statement declares a nonterminal token, and the associated production rules and actions. The syntax of a nonterminal statement is: nonterminal : rhs {| rhs} ; A nonterminal is defined by a list of alternate right hand sides. rhs : {rhs_item} [action [action]] ; Each right hand side is a list of items, followed by zero to two actions; the first is the post-parse action, the second the in-parse action. In-parse actions are exeuted during the parse, when the production is reduced; they can add semantic checks that help during error recovery. Post-parse actions are executed after the parse is complete, when a node produced by this production is visited during the tree traversal; they typically build an abstract syntax tree. The actions are written in output-language code; for 'Ada_Emacs' output, this is elisp (a hold-over from when WisiToken only output elisp code). If using BNF: rhs_item : token ; Where 'token' is defined by a token declaration. if using EBNF: rhs_item : token | < identifier = identifier > | rhs_optional_item | rhs_multiple_item | '(' rhs {| rhs} ')' ; Here 'token' is either defined by a token declaration, or the token value contained in single quotes. The second option is an attribute, as defined by ANTLR; these are ignored in wisitoken. Parentheses are used to group items. rhs_optional_item : '[' rhs {| rhs} ']' | '(' rhs {| rhs} ')' '?' | token '?' ; These options all mean the same thing; the content is present zero or one times. rhs_multiple_item : '{' rhs {| rhs} '}' | '{' rhs {| rhs} '}-' | '(' rhs {| rhs} ')+' | '(' rhs {| rhs} ')*' | token '+' | token '*' ; "{}", "()*", and "token*" mean the content is present zero or more times. "{}-", "()+", and "token+" mean the content is present one or more times.  File: wisitoken-user_guide.info, Node: Conditional code, Prev: Nonterminals, Up: Grammar File Syntax 3.3 Conditional code ==================== It is sometimes necessary to include or exclude some declarations and portions of rules based on the choice of lexer or parser. Therefore WisiToken supports '%if ... %end if' in the grammar file: %if {lexer | parser} = { | } ... %end if The lines between '%if' and '%end if' are ignored if the current lexer or parser is not the one specified in the '%if' condition. '%if ... %end if' cannot be nested.  Tag Table: Node: Top727 Node: Overview1378 Node: Install2140 Node: Common grammar problems2637 Node: Empty choice in list2930 Node: Grammar File Syntax5912 Node: Declarations6469 Node: Raw Code6775 Node: Keywords7156 Node: Tokens7468 Node: Error recovery8803 Node: Other declarations12525 Node: Nonterminals17017 Node: Conditional code19038  End Tag Table  Local Variables: coding: utf-8 End: