Calibrating Unsupervised Language Learning ========================================== * Version of May 2021 Ongoing project, continuing activity. See the [language learning wiki](http://wiki.opencog.org/w/Language_learning) for an alternate overview. See the diary at `learn-lang-diary/learn-lang-diary-part-two.lyx` for a progress log. Project Summary --------------- In 2019 we realized that training on English corpora does not offer sufficient control to measure the quality of the learning algorithm. Thus, we've devised a new approach: create a random grammar, create a corpus of sentences from that random grammar, learn a grammar from that corpus, and validate that the learned grammar accurately matches the input grammar. Doing this will allow the learning algorithm to be correctly calibrated for grammars of different sizes and complexities, and for corpora of different sizes. We will be able to measure how accuracy scales as a function of training time, how well different training algorithms perform, how large a corpus is needed to get good results, and other related questions. Informally, the idea of calibration here is just as with any other instrument: you measure a "known quantity", and make sure that the instrument is reading it accurately. In this case, the "known quantity" is a known grammar, and the instrument is the grammar-learning system. As of May 2021, the infrastructure to do the above is more or less complete, and mostly automated, and some early calibration runs have been performed. Lessons learned so far are given below, right after a quick overview of the processing stages. Processing Overview ------------------- See the [README-Natural](README-Natural.md) file for a description of the "open-loop" (uncalibrated) processing system. It describes the processing steps in detail. Getting good results requires tuning a variety of parameters, and so calibration needs to be run first. Calibration is done by creating an artificial grammar, then creating a text corpus from this grammar, and then attempting to learn a new grammar from this text corpus, and then assessing accuracy by comparing the the learned grammar to the generating grammar. This pipeline has been more-or-less fully set up, perhaps with a few "rough edges": some cosmetic bugs and some incomplete automation scripts and faulty instructions. It requires a fair bit of editing of configuration scripts, to adjust file paths, desirable parameters, etc. So far: 1. Build and install link-grammar. Download the latest Link Grammar tarball from `http://www.abisource.com/downloads/link-grammar/current`. Then unpack it and compile it: ``` tar -zxf link-grammar-*.tar.gz cd link-grammar-* mkdir build; cd build; ../configure; make -j sudo make install ``` 2. Build and install `cogutils`, `atomspace`, and this project: ``` git clone https://github.com/opencog/cogutils cd cogutils; mkdir build; cd build; cmake ..; make sudo make install ``` Repeat the above, with `atomspace` in place of `cogutils`, and again, with `learn` (this project). 3. Go to the [run/0-config](run/0-config) and review both the [run/0-config/0-pipeline.sh](run/0-config/0-pipeline.sh) and the [run/0-config/1-dict-conf.scm](run/0-config/1-dict-conf.scm) files. The first contains a directory where the generated dictionary should be written. The second contains configurable parameters for defining a random grammar. The shell script [run/1-gen-dict/gen-dict.sh](run/1-gen-dict/gen-dict.sh) will generate the dictionary. The output is a standard Link Grammar Dictionary. ``` $ cd /home/ubuntu/run/1-gen-dict $ ./gen-dict.sh ``` 4. Review [run/0-config/1-corpus-conf.scm](run/0-config/1-corpus-conf.sh) for the corpus generation settings. Then run `./gen-corpus.sh`. 5. Run the processing pipeline described in [README-Natural](README-Natural.md) 6. Measure accuracy. So far, steps 1-6 have been automated to varying degrees. Contact us (me) to discuss details. Lessons Learned --------------- The idea of calibration is a good idea, even an excellent idea, but is considerably more subtle than it first appears. The following issues and questions arose fairly quickly. * It is possible for two seemingly different grammars to generate the same corpus. In this case, when learning the grammar, how do we judge if it is "correct"? How can we prove that two different grammars are equivalent? Is there an algorithm for generating this proof? Is this even provable, or is this Turing-undecidable, in the same way that the equivalence of two different group presentations is famously known to be undecidable? Given that group presentations and groups are a special case of grammars and corpora, it would seem that proving the equivalence of grammars is also undecidable. None-the-less, for simple grammars, one might hope that ad hoc algos might suffice. * The current code for generating random grammars has a dozen different tunable parameters to control that grammar. They are "common sense" parameters, in that they directly control different steps in the generation. Yet it has rapidly become clear that most parameter settings result in wildly complex and highly chaotic grammars. The resulting corpora appear to be highly "mixed" or ergodic. If a corpus is ergodic, then, of course, it will be impossible to extract any structure from it. How can one measure the ergodicity of a corpus? How can one measure the complexity of a grammar? * The way in which grammars are generated was motivated by a perhaps naive understanding of "factorization" - see the paper on sheaves, where an analogy is made between Ising models, matrix factorization, partition functions and other related concepts. The idea is that the word-disjunct matrix M factorizes as M=LCR where L and R are sparse high-dimension matrices, and C is low-dimension and dense (compact). The idea was to generate the grammars so that they resemble this factorization; yet, the generated grammars are likely to have multiple ambiguous factorizations. How can we tell if a grammar has multiple ambiguous factorizations? How can we find these? Obviously, if a grammar has multiple ambiguous factorizations, then the machine that attempts to learn that grammar is likely to come up with one of the equivalent factorizations. How can we characterize this? * As noted above, most parameter settings generate complex and seemingly ergodic grammars. Just eyeballing these shows that they do not seem to resemble English or any other natural language grammar. How can we determine if a random grammar is "natural-language-like"? What parameter settings result in human-like languages? Are the "axes" of adjustable parameters even "aligned" with the axes of human language complexity? How does one even judge this? Additional details, results, questions and head-scartching can be found in the [Language Learning Diary, Part Two](learn-lang-diary/learn-lang-diary-part-two.pdf). It appears that we've stumbled into the classic trap of science: the more we learn, the more we don't know, and the things we don't know appear to be increasingly basic and simple. It feels like I haven't even dented the surface of grammar--corpus correspondence. Onward through the fog! That's all for now! ------------------- THE END.