% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %
%     Extracts from the book "Natural Language Processing in Prolog"    %
%                      published by Addison Wesley                      %
%        Copyright (c) 1989, Gerald Gazdar & Christopher Mellish.       %
% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % %

Well-formed substring tables & charts

In implementing a chart-parser in Prolog, it is tempting to use the                     
Prolog database itself as the repository for the chart, and store
edges by asserting them in a format such as the following:

    edge(START,FINISH,LABEL,TOFIND,FOUND)

We shall succumb to this temptation in implementing our first two
chart parsers, though later in the chapter it will become apparent
that we have bought simplicity at the cost of flexibility.

Given this approach, the first thing we need is a means of
initializing the chart by asserting inactive edges for each word in
the string.  The predicate 'start_chart' is provided with a string of
words as its third argument and a position in the chart to start from
(a number) in its first argument.  Its job is to enter the appropriate
words in the chart and instantiate the second argument to the final
position in the chart.  If the string is empty, it does nothing and
returns the position in the string:

    start_chart(V0,V0,[]).

Otherwise, it takes the first word in the string, adds an inactive
edge to the chart for each occurrence of that word in the lexicon, and
then continues with the rest of the string:

    start_chart(V0,Vn,[Word|Words]) :-
        V1 is V0+1,
        foreach(word(Category,Word),
          add_edge(V0,V1,Category,[],[Word,Category])),
        start_chart(V1,Vn,Words).

The predicate 'foreach' can be found in the file utilities.pl in the
appendix, and its definition will not be discussed here.  It
backtracks through all possible ways of satisfying its first argument
as a Prolog goal, for each one invoking its second argument as a
Prolog goal.  So the effect here is to call 'add_edge' for each word-
category combination found by the 'word' predicate.  Notice that we
include the Category itself in the FOUND argument of each edge to be
added.  By having the FOUND argument be a list containing the category
and the parse trees of the "found" phrases, we automatically construct
parse trees as we go.

The predicate 'add_edge', however, does more than simply assert the
existence of the edge.  It also needs to determine what other edges
need to be added to the chart as a consequence of the addition.
First, it needs to check if the edge already exists, and, if so, do
nothing:

    add_edge(V0,V1,Category,Categories,Parse) :-
        edge(V0,V1,Category,Categories,Parse),!.

Otherwise, it asserts the edge, and then adds any further (active)
edges that are contingent on the addition of this edge to the chart.
If an inactive Category1 edge spanning V1 to V2 is being added (and
thus the TOFIND argument is an empty list), then, for each rule whose
leftmost daughter is of Category1, it adds an empty active edge at V1,
of the kind dictated by the applicable rule.  This is the bottom-up
rule:

    add_edge(V1,V2,Category1,[],Parse) :-
        asserta(edge(V1,V2,Category1,[],Parse)),
        foreach(rule(Category2,[Category1|Categories]),
          add_edge(V1,V1,Category2,[Category1|Categories],[Category2])),

Then, as the second part of this clause, for every active edge which
currently ends at V1, and which is looking for a Category1, we add a
further edge (which may turn out to be either active or inactive)
based on it, that ends at V2.  This is the fundamental rule (for the
case when an inactive edge is added):

        foreach(edge(V0,V1,Category2,[Category1|Categories],Parses),
          add_edge(V0,V2,Category2,Categories,[Parse|Parses])).

That clause deals with everything that has to be done on the occasion
of the addition of an inactive edge.  But we still need a clause to
cater for the case of active edge addition, and this turns out to be
simpler.  We just look for every inactive edge that could allow the
active edge to be extended, and add edges extending it, as
appropriate.  This is the fundamental rule (for the case when an
active edge is added).

    add_edge(V0,V1,Category1,[Category2|Categories],Parses) :-
        asserta(edge(V0,V1,Category1,[Category2|Categories],Parses)),
        foreach(edge(V1,V2,Category2,[],Parse),
          add_edge(V0,V2,Category1,Categories,[Parse|Parses])).

That essentially completes our bottom-up chart parser, but it is
convenient to wrap the top level predicate 'start_chart' into a
further predicate that (i) will save us having to specify the initial
vertex, (ii) will display the results of the parse, and (iii) will
clean up afterwards by removing all the edges created:

    test(String) :-
        V0 is 1,
        start_chart(V0,Vn,String),
        foreach(edge(V0,Vn,Symbol,[],Parse),
          write(Parse)),
        retractall(edge(_,_,_,_,_)).

Implementing a top-down chart parser in Prolog is not significantly
harder than the bottom-up case, and the code is remarkably similar to
before.  At the top-level we have a predicate 'parse' that initializes
the chart with inactive word edges, ascertains the category of the
initial symbol in use (usually 'S') and starts active empty edges at
the beginning of the chart with this category.  As before, each
addition of an edge to the chart may cause the addition of further
edges.

    parse(V0,Vn,String) :-
           start_chart(V0,Vn,String),
           initial(Symbol),
           start_active(V0,Symbol).

The definition of 'start_chart' can simply be carried over from our
earlier bottom-up parser (though the predicates it calls cannot be),
but 'start_active(Vertex,Category)' requires definition.  This is
straightforward: for every rule that expands Category, start an empty
active edge based on the rule at Vertex.

    start_active(V0,Category) :-
            foreach(rule(Category,Categories),
             add_edge(V0,V0,Category,Categories,[Category])).

Both 'start_chart' and 'start_active' call 'add_edge' and the
definition of the latter cannot simply be taken over from our bottom-
up parser, though the first clause is the same:

    add_edge(V1,V2,Category,Categories,Parse) :-
           edge(V1,V2,Category,Categories,Parse),!.

Apart from this, as before we need to distinguish the addition of
active and inactive edges.  For an inactive edge, we need only worry
about the fundamental rule:

    add_edge(V1,V2,Category1,[],Parse) :-
             asserta(edge(V1,V2,Category1,[],Parse)),
            foreach(edge(V0,V1,Category2,[Category1|Categories],Parses),
             add_edge(V0,V2,Category2,Categories,[Parse|Parses])).

For an active edge, we need to worry about both the fundamental rule
and the top-down rule:

    add_edge(V1,V2,Category1,[Category2|Categories],Parses) :-
             asserta(edge(V1,V2,Category1,[Category2|Categories],Parses)),
            foreach(edge(V2,V3,Category2,[],Parse),
             add_edge(V1,V3,Category1,Categories,[Parse|Parses])),
            start_active(V2,Category2).

In both cases, the edge is added by assertion.  If an inactive edge is
added, then the fundamental rule is applied in respect of every active
edge that can make use of it.  Conversely, if an active edge is added,
then the fundamental rule is applied in respect of every inactive edge
that it can use, and in addition an active edge is added for every
category that appear at the start of the RHS of a rule for the next
category required.  Fortunately, this last set of additions can be
achieved using our already defined predicate 'start_active'.

That completes our top-down chart parser, but again it is convenient
to wrap the top level predicate 'parse' into a further predicate that
will (i) save us having to specify the initial vertex, (ii) display
the results of the parse, and (iii) clean up afterwards by removing
all the edges created:

    test(String) :-
        V0 is 1,
        initial(Symbol),
        parse(V0,Vn,String),
        foreach(edge(V0,Vn,Symbol,[],Parse),
          write(Parse)),
        retractall(edge(_,_,_,_,_)).

Our example parsers above had two characteristics in common:
(i) they used the Prolog database to store the chart, and
(ii) they had no explicit agenda, but simply relied on Prolog
backtracking as their control strategy.  This approach makes for short
code, but sacrifices flexibility.  In the two parsers that we describe
in this section, we will maintain and manipulate both the chart and an
explicit agenda as lists of edges that get passed as arguments from
predicate to predicate.  As we shall see, it then becomes possible to
switch between depth-first-like to breadth-first-like processing by
interchanging the order of arguments in a single predicate call.

Our basic data structure remains the edge, only one shorn of the final
FOUND component used in our earlier chart parsers.

    edge(START,FINISH,LABEL,TOFIND)

This means that, for simplicity, we are talking about a recognizer,
rather than a parser.  Charts and agendas simply consist of lists of
such edges.

As before, we begin with a bottom-up parser.  We start off with an
initialization predicate 'start_agenda(String,Vertex,Agenda)' which
produces an initial Agenda of lexical edges given String starting at
Vertex.  If the string is empty, then there is an empty agenda:

    start_agenda([],V0,[]).

Otherwise, we peel off the first word, consult the lexicon and map all
the relevant inactive edges spanning this word into an agenda,
appending this to the front of the agenda obtained by looking at the
rest of the words:

    start_agenda([Word|Words],V0,Agenda) :-
        V1 is V0+1,
        findall(edge(V0,V1,Category,[]),
          word(Category,Word),
          Agenda1),
        start_agenda(Words,V1,Agenda2),
        append(Agenda1,Agenda2,Agenda).

The predicate 'findall', which is in the utilities library, is used to
collect in its third argument a list which contains for each solution
of the second argument the appropriate instantiation of the first
argument.  It is similar to the built-in predicates 'setof' and
'bagof' that are provided with many Prolog systems, except that
'findall' produces the empty list in its third argument (rather than
failing) if there are no solutions to the second argument.  In this
case, 'findall' is used to iterate over all possible ways of finding
Word as an instance of some Category.  For each such solution, an
appropriate 'edge' term is included in the answer list (Agenda1).

The predicate 'extend_edges(Agenda, InChart, OutChart)' is used to add
the edges in Agenda to an initial chart InChart, taking into account
the fundamental rule and bottom-up rule to generate yet more edges,
producing a new chart OutChart.  It is an identity function when the
agenda is empty:

    extend_edges([],Chart,Chart).

Otherwise, it takes the first edge on the agenda and sees whether it
is already in the chart.  If so, it can move down the agenda without
more ado:

    extend_edges([Edge|Agenda1],Chart1,Chart2) :-
       member(Edge,Chart1), !,
       extend_edges(Agenda1,Chart1,Chart2).

Otherwise 'extend_edges' adds the first edge to the chart, looks for
any new edges that it provokes as a result of addition, adds them to
the agenda, and then continues with this new agenda and the augmented
chart

    extend_edges([Edge|Agenda1], Chart1, Chart3) :-
        Chart2 = [Edge|Chart1],
        new_edges(Edge,Chart2,Edges),
        add_edges(Agenda1,Edges,Agenda2),
        extend_edges(Agenda2,Chart2,Chart3).

You 'add' an edge to a list of edges either by finding that it is
already there, or by just adding it:

    add_edge(Edge,Edges,Edges) :-
        member(Edge,Edges), !.
    add_edge(Edge,Edges,[Edge|Edges]).

And you combine two lists of edges by calling 'add_edge' in the
obvious recursive manner:

    add_edges([],Edges,Edges).
    add_edges([Edge|Edges],Edges1,Edges3) :-
        add_edge(Edge,Edges1,Edges2),
        add_edges(Edges,Edges2,Edges3).

So far, so good, but the essence of chart-parsing, the fundamental
rule, has yet to make its appearance.  We find it here in the
predicate 'new_edges(Edge,Chart,Edges)' which maps Edge and Chart into
a set of new edges, Edges, that arises from the addition of Edge to
Chart.  It is also this 'new_edges' predicate which is responsible for
adding the empty active edges that get everything going.  If an
inactive Category1 edge spanning V1 to V2 is being added then, for
each rule whose leftmost daughter is of Category1, create an empty
active edge at V1, of the kind dictated by the applicable rule:

    new_edges(edge(V1,V2,Category1,[]),Chart,Edges) :-
        findall(edge(V1,V1,Category2,[Category1|Categories1]),
          rule(Category2,[Category1|Categories1]),
          Edges1),

Then, for every existing active edge which currently ends at V1, and
which is looking for a Category1, add a further edge (which may turn
out to be either active or inactive) based on it, that ends at V2:

        findall(edge(V0,V2,Category3,Categories2),
          member(edge(V0,V1,Category3,[Category1|Categories2]),Chart),
          Edges2),

Finally, combine the two resulting sets of edges:

        add_edges(Edges1,Edges2,Edges).

That clause deals with everything that has to be done on the occasion
of the addition of an inactive edge.  But we still need a clause to
cater for the case of active edge addition, and, as with our earlier
bottom-up chart parser example, this turns out to be simpler.  We just
look for every inactive edge that could allow the active edge to be
extended, and add edges extending it, as appropriate:

    new_edges(edge(V1,V2,Category1,[Category2|Categories]),Chart,Edges) :-
        findall(edge(V1,V3,Category1,Categories),
          member(edge(V2,V3,Category2,[]),Chart),
          Edges).

That really completes our new bottom-up chart parser, but it is
convenient to wrap the top level predicates 'start_agenda' and
'extend_edges' into a further predicate that will save us having to
specify the initial vertex, and will check to see that the chart does
indeed contain an edge that spans the string with an instance of the
initial symbol:

    test(String) :-                     
        start_agenda(String,0,Agenda),
        extend_edges(Agenda,[],Chart),
        initial(Symbol),
        member(edge(0,M,Symbol,[]),Chart),
        N is M + 1,
        not(member(edge(_,N,_,_),Chart)).

The parser we have been considering operates depth-first and, to the
casual reader, this might appear to simply follow from our decision to
implement it in Prolog.  But this is not the case.  In fact, it
follows from the order in which things get added to the agenda, and
this is within our control.  At present, the new edges that result
from adding an edge get put on the front of the agenda, which means
that the list is functioning as a stack.  We can redefine
'extend_edges' by commenting out the existing agenda contruction call
of 'add_edges' and replace it with a call in which the order of the
first two arguments (the existing agenda and the new items) is
reversed.  This simple move gives us a list that behaves like a queue
and hence a parser that explores the search space in an approximately
breadth-first manner.

    extend_edges([Edge|Agenda1], Chart1, Chart3) :-
        Chart2 = [Edge|Chart1],
        new_edges(Edge,Chart2,Edges),
    %   add_edges(Edges,Agenda1,Agenda2),      % depth-first processing
        add_edges(Agenda1,Edges,Agenda2),      % breadth-first processing
        extend_edges(Agenda2,Chart2,Chart3).

Switching from bottom-up to top-down requires rather more work,
however, although some of the code we have developed so far can be
carried over unchanged, namely 'start_agenda', 'add_edge', 'add_edges'
and both variants of 'extend_edges'.  The predicate 'start_agenda'
creates an initialized agenda of lexical edges, which, given that
there is no bottom-up rule, we can simply treat as the initial chart
in top-down parsing.  We also need a way to get active edges started,
however, and we do this, as in our earlier top-down parser, with a
predicate called 'start_active(Category, Vertex, Edges)'.  This is
straightforward: for every rule that expands Category, create an empty
active edge based on the rule at Vertex and put it in Edges:

    start_active(Category,Vertex,Edges) :-
        findall(edge(Vertex,Vertex,Category,Categories),
          rule(Category,Categories),
          Edges).

Our new top-down parser still has to incorporate the fundamental rule.
We find it, as with the preceding example, in the predicate
'new_edges(Edge,Chart,Edges)' which maps Edge and Chart into a set of
new edges, Edges, that arises from the addition of Edge to Chart.  If
an inactive Category1 edge spanning V1 to V2 is being added, then for
every existing active edge which currently ends at V1, and which is
looking for a Category1, add a further edge (which may turn out to be                     
either active or inactive) based on it, that ends at V2:

    new_edges(edge(V1,V2,Category1,[]),Chart,Edges) :-
        findall(edge(V0,V2,Category2,Categories),
           member(edge(V0,V1,Category2,[Category1|Categories]),Chart),
          Edges).

That clause deals with everything that has to be done on the occasion
of the addition of an inactive edge.  But we also need a clause to
cater for the case of active edge addition.  We need to do three
things: (i) create the relevant empty active edges to look for the
next category (the top-down parsing strategy), (ii) look for every
inactive edge that could allow the active edge to be extended, and
create edges extending it, as appropriate, and (iii) combine the two
resulting sets of edges:

    new_edges(edge(V1,V2,Category1,[Category2|Categories]),Chart,Edges) :-
        start_active(Category2,V2,Edges1),
        findall(edge(V1,V3,Category1,Categories),
           member(edge(V2,V3,Category2,[]),Chart),
           Edges2),
        append(Edges1,Edges2,Edges).

Our new top-down chart parser is almost complete.  It only remains to
wrap the predicates into a package that will initialize the chart,
start an empty active edge for the initial symbol, create an initial
agenda, elaborate the chart to termination, and conclude by checking
for success:

    test(String) :-
        initial(Symbol),
        start_agenda(String,0,Chart1),
        start_active(Symbol,0,Agenda),
        extend_edges(Agenda,Chart1,Chart2),
        member(edge(0,M,Symbol,[]),Chart2),
        N is M + 1,
        not(member(edge(_,N,_,_),Chart2)).

Obviously our representation of the chart and/or agenda as a simple
Prolog list of edge data structures is one of the worst possible
choices as regards the efficiency of finding an edge of a given type
in a large chart/agenda.  The approach of representing the chart as
Prolog assertions is marginally better, as many Prolog implementations
index the clauses of a predicate to take account of some of the
arguments of the predicate.  This means that not all the possible
clauses would have to be considered when a search is made for an edge
with particular characteristics.  The automatic indexing introduced by
a Prolog compiler is not necessarily tuned to the type of searches
that one actually wants to perform, however, and so rather more
careful programming would be required to build a really efficient
chart parser in Prolog.