Discovering English syntax

I've started a new project. My goal is to write a program that discovers enough of the syntax rules of written English to serve as a passable lexical tokenizer. I've made some progress in my approach thus far. But I can tell that my approach requires some serious rethinking. I'll describe the experimental design here and comment on my current progress.

If you wish to see the code I'm actively experimenting with you can find it on GitHub.

English syntax

Anyone familiar with programming language will recognize that there is a process involved in translating human-readable code into the more cryptic representation used by a computer to execute that code. And that there is a very precise syntax governing what your code in that language must look like to be considered syntactically valid. In the JavaScript statement "var x = someFunction(y * 12);" you'll realize that there are well-defined roles for each part. The "var" keyword indicates that the "x" identifier is a variable to use henceforth in the code. The "=" symbol indicates that you are immediately assigning a value to "x" using the expression to the right of it. You know that the "someFunction" identifier refers to some function defined somewhere. The matched parentheses following it will be input arguments to that function. And that "y * 12" is a sub-expression that must be evaluated before its computed value gets used as the only input argument to someFunction.

Written English is very like this. You know that this blog post is broken down into sections. Each section is divided into paragraphs. Each paragraph is composed of one or more sentences. Sentences are composed of strings of words. Mostly separated by spaces and terminated with periods. And words are mostly unbroken strings of single letters.

Naturally you know that this is insufficient to capture all of the syntax rules of the text of even this post. For example, the word "you'll" is not an unbroken string of letters. You know that there are strings of space-separated words that are wrapped up in double-quotes too. You recognize that words are not generally composed of letters willy-nilly. Instead they are mostly lowercase letters. Some words have initial capitals. Some words like "someFunction" violate even these norms. And clearly my quoted JavaScript expression is not even English text. But for the most part this blog post follows the simple syntax rules I just described.

Expressing syntax rules

My goal for this project is to get software that can discover basic syntax rules for written English. The starting point is a test program that has a relatively small sample of several paragraphs of text captured from an online news article. I break this text up into a list of separated paragraphs to feed into the parser. The parser's job is to translate any paragraph it is given into a tree structure representing sentences, words, nested clauses in parentheses, and so on. Here's one of the more complex paragraphs I'm feeding it:

Findings from this groundbreaking study, conducted in China by the George Institute for Global Health, show that low-sodium salt substitutes save lives and prevent heart attacks and strokes. Low-sodium salt decreased the risk of death by 12%, the risk of stroke by 14%, and total cardiovascular events (strokes and heart attacks combined) by 13%.
(Source: CNN Health article)

As you can see it has dash-conjoined words like "low-sodium", commas, parentheses, and percentiles to complicate things.

The parser is supposed to ultimately do linear-ish parsing like a parser of JavaScript or any other programming language does. Programming language syntax is often expressed initially and formally using BNF grammar. Even programmers often struggle to make sense of BNF expressions. We are often more familiar with the regular expressions (aka "regex") features available in most programming languages now. One simple regular expression to represent a trivial paragraph structure might be:

^(\s*([A-Za-z]+)+[.?!])+$

Essentially, one or more words ( [A-Za-z]+ ) with optional spaces before each, followed by trailing punctuation, all repeated one or more times.

I was tempted to implement my parser by having syntax rules be composed of ever more sophisticated regular expressions. But for various reasons I chose to invent my own text parser that supports a regex-like grammar I invented for this project's purpose. The parser contains a set of "patterns". Every pattern has a unique numeric ID and can optionally contain a name I designate (eg "Word" or "Letter"). A pattern is either a literal string of text or an expression.

As a minimal requisite I created one pattern for each of the characters the parser can encounter. That includes all the uppercase and lowercase letters, digits, the symbols found on a standard American keyboard, and the space character. For simplicity I don't allow newline characters, tabs, or Unicode characters beyond this very basic ASCII-centric set.

I do believe that my learning algorithm could eventually discover abstract classes of characters like digits and letters. But to make my explorations easier to start with I endowed the parser with a few extra patterns:

  • Aa:  A | a
  • Bb:  B | b
  • (All other upper/lower case pairs)
  • Uppercase:  A | B | C | D | ... | Y | Z
  • Lowercase:  a | b | c | d | ... | y | z
  • Letter:  Uppercase | Lowercase

The above pattern expressions showcase how a pattern can contain alternatives. If one alternative doesn't match then maybe the next one will. Each alternative is a sequence. For example, "Ss Tt Oo Pp" would match "Stop", "stop", "STOP", and any other case-insensitive version of the word. Combining sequencing and alternation, "Ss Tt Oo Pp | Gg Oo" would match either "stop" or "go".

This expression language also supports parentheses grouping of sub-expressions. The main purpose of which is to facilitate quantifiers. Those familiar with regular expressions will recognize some of these quantifiers and probably guess the rest:

  • A+:  One or more "A"
  • A*:  Zero or more "A"
  • A?:  Optional; zero or one "A"
  • A!:  Negative; make sure the next thing is not "A" before continuing on
  • A{3}:  Exactly 3 "A"
  • A{3-5}:  At least 3 and up to 5 "A"
  • A {3+}:  Three or more "A"
Finally, this expression language supports look-behinds and look-aheads. Prefixing an element with "<" causes it to look behind. Prefixing with a ">" causes it to look ahead. This means to make sure that current element must be preceded by or followed by something. For example, "<Space Letter" means make sure that the letter must be preceded by a space. Adding "!" as seen in the negation above makes it a negative look-behind or look-ahead. So "<Letter! Letter" means make sure this letter is not preceded by a letter.

Coming back to the earlier regex for a paragraph. We'd like to end up with some patterns similar to this:

  • Word:  Letter+
  • Phrase:  (Word Space)* Word
  • Sentence:  Phrase+ '.'  ('.' is the escaped name of the period literal pattern)
  • Paragraph: Sentence+

Of course this would fail to match any but the most trivial paragraphs. We would want to have more sophisticated patterns. Maybe the Word pattern might look more like "Letter+ ('''' s | '''')?" to capture words with possessives expressed with apostrophe-S or just apostrophe suffixes. And so on.

The parsing process

A typical linear parser produces a parse tree capturing the singular acceptable interpretation of the source text. The syntax is designed to guarantee that there really is only one valid interpretation. But my experience with natural language processing tells me that it makes more sense to produce multiple interpretations of some text and leave it to higher levels of processing to evaluate which one is best. Moreover. I'm starting with a parser that must discover the syntax rules.

One naïve way to approach this problem is to start at the first character and create a branching tree of all possible interpretations as I move forward. Using the above quoted text, the first word is "Findings". My first attempt would match the "F" pattern, the "Ff" pattern, the "Uppercase" pattern, and the "Letter" pattern that are the initial givens. The next step would be to move past the end of each of these matches and consider the next bit of text. In this case all of the matches are of single characters. If we already had a "Word" pattern then it would be 8 characters long. We would start matching at the next character after that final "s". The problem with this approach is that we have already matched 4 patterns on our first character. Then for each starting point after that we are going to match 4 more, thus making our tree have 16 end nodes after just 2 characters. Proceeding forward like this without even adding any extra patterns means our tree would have 1.6x1060 tree endpoints once we reach character 100. That's not practical for most computers to work with.

My solution to this problem was to introduce what I call a "token chain" data structure that collapses the tree of all combinations of patterns down to a linear array that one one element for each character in the source text. Each array element is itself a list of all matching patterns that start at that location. To produce this token chain is a simple matter. Starting from character 1, attempt to match all known patterns starting at each character position going forward. If a pattern does match then attach a token representing that match at that position. Actually, the token chain has two analogous arrays called "heads" and "hails". Each matching token gets attached to both arrays. Heads are attached wherever the first character of the match is. And then the last character of the match indicates where to attach it to the tails array. The tails array allows the algorithm to look backward and easily see which patterns precede any given token's start. For example, when looking at a token matching the "Word" pattern (Letter+) that starts at character 50, the algorithm can then look at the tails array at position 49 to see what matches end just before this word.

The above algorithm may seem like a bad idea. After all, if I match a word like "Findings" right away, shouldn't I just move on to the next character after it and start there, skipping all the characters in between? The problem is that this initial pass of parsing does not yet know what the "best" interpretation is. So it must try all possibilities. That means all the single-character patterns too. So matching of all patterns must happen starting at every character position right up to the last. The good news is that this process is actually very fast. Even with thousands of patterns defined.

Another interesting aspect of this initial parsing pass is that the more abstract patterns benefit from earlier parsing already done. Let's say we had "Word" defined as "Letter+". The patterns are all stored in the order in which they were introduced. The "Aa" pattern must be defined after "A" and "a" are. The "Letter" pattern must be defined after "Uppercase" and "Lowercase" are. Which means by the time we get to the "Word" pattern, we've already discovered that at this location is one "Letter" match. Having said that, our "Word" pattern requires us to move forward character by character looking for more letters. We might not have looked far enough ahead yet. But as we do look for "Letter" at the next position, we are also looking to see if it matches "Uppercase" and so on all the way up the hierarchy of ever simpler patterns. And all along we are caching those matches at that head position and also caching them at their tail positions as well. This caching of all matches greatly speeds up the process. And it guarantees that all possible strings of pattern matches are covered by the time we're done matching the last character of the source text. Then we can easily hop our way from match to match in the token chain however we wish. If we hand-crafted the "Paragraph" pattern above we could hop from sentence token to sentence token with ease because they would already be matched. We wouldn't even have to do this because the Paragraph pattern would already have been matched by doing this all the way down to the single character level.

As you might imagine, I don't expect the token chain to be the final output of this parser. But for this experiment it is a sufficient one. The learning algorithm uses this token chain as its input.

The learning process

All the above is really just the test harness. The real crux of this experiment is creating an algorithm for discovering the patterns that best capture the lexical elements I expect to be able to consume in some larger program. Namely sentences, words, quoted text, and so forth. So what is the algorithm? Before I continue I'll say that I haven't discovered one yet. What you'll read below is what I have tried thus far and some observations.

I'm basing this entire experiment on a premise: that information from the natural world is not random but structured. The same goes for human languages. We "design" them to be understood by other people. We may omit many details in order to communicate quickly. But the structure is still there. If the human mind can learn to capture those patterns with relative ease by learning to speak and eventually to read then it should be possible for a machine to see those patterns and learn to recognize and expect them in any written language.

In this case the knowledge of the system is the set of all defined patterns. So originating knowledge means constructing new patterns. And then testing their effects on parsing text. One way to propose new rules is to do so completely at random. Maybe "Letter '9' (Xx '3' '$')+" is worth trying out. But of course it is not. Why not? Because it is random. And language is not random.

The crux of what I have tried is to observe actual pairs of adjacent matching patterns in the token chain. What does this mean? Let's say we have a "Word_Space" pattern defined as "Word Space" and "Word" is simply "Letter+". Our source paragraph begins "Findings from this groundbreaking". Starting at character 1 we find a token whose pattern is "Word_Space" and whose matching text is "Findings ". Immediately after that token is another "Word_Space" token whose match is "from ". This is one pair of adjacent patterns found in the text. As you might imagine, there will be lots of other adjacent matches. Like "Letter" + "Word" matching "F" + "indings". And "Uppercase" + "Lowercase" matching "F" + "i".

We will actually get a massive number of these pairs of adjacent tokens (matches) as we survey the entire token chain. Every one of these can be directly translated into a new pattern. For each pair I can evaluate whether the two patterns are the same ("Letter" + "Letter") or different ("Letter" + "Word"). Let's call the first pattern "A" and the second "B" for this purpose. If A and B are identical then we'll hypothesize that there are many of these repetitions. We'll define a new pattern like "A+". Like "Letter+", "Ee+", or "Word_Space+". If A and B are different then we'll hypothesize that this pair occurs often in natural text. We'll create an "A B" pattern. Like "Uppercase Lowercase" or "Letter Word".

One natural problem with this is that there will be at tens of thousands or more of unique pairs of patterns found in even modest paragraphs of text. We need a way to reduce the number that we consider down to a manageable number. One way I tried is to keep count of how many times I encounter a pattern. Letter + Letter appears very often for example. And then I can sort them from most common to least common and choose, say, the top 10 or top 100 to use to propose new patterns.

Okay. So now I have maybe 10 or 100 new patterns in the parser. Now what? Now I run the entire parsing process again with the newly expanded set of patterns. Why? Because I want to evaluate how useful each experimental rule is. What am I measuring? One option is to survey the token chain to see how many times each pattern matched something. I already know that I chose pairs of patterns that were found in the text, so I can be sure that all the patterns will match lots of things.

Then what? Keep iterating. Each new iteration will generate new patterns. These will generally be more and more abstract, building on earlier patterns.

Intermediate results

Overall I’m fairly happy with this approach as a starting point. As I had hoped, the algorithm immediately discovers that “Letter+” is very effective at fitting much of the available text. When I see this proposed pattern I immediately name it “Word” for my own ease of understanding. And it also fairly quickly discovers that “Word Space” also describes a lot of the text. So does “(Word Space)+”. And eventually “(Word Space+) Word”. In practice these gains come from a lot of tweaking of various numbers and biases. Under very expensive (in processing terms) conditions it eventually discovers the basic sentence as “(Word Space)+ Word ‘.’”. But whereas I expect this to be an easy win, it actually gets harder for the algorithm to make this sort of progress. Why?

What I did not talk much about earlier is how I’m deciding to winnow down the many options I can pursue. Remember how I said I can take the top 10 pairs I found for creating 10 new patterns based on how many matches there were in the token chain? This perversely rewards generalized patterns that match as few characters as possible. Like “Letter Lowercase”, “Uppercase Lowercase Letter” And so forth. They can crowd out more useful patterns like “(Word Space)+”. I also tried keeping track of total match lengths. But this means that a pattern like “Word” = “Letter+” would match “Findings”, “indings”, “ndings”, and so on and quickly rack up total match count of 8 + 7 + 6 + … + 1 for this one word. I then introduced the metric of “coverage”. In that case I’m counting how many of the total characters in the source text are matched, even if in duplicate. So Word matching “Findings”, “indings”, and so forth would still have a total coverage of 8 characters for that word. That helped a lot. I also introduced a metric I call “stretch”. That measures how many characters from the very first one in the source text is matched. For Word the stretch measure for our source paragraph would be 8. For Sentence (if we ever got there) the stretch value would be the length of the first sentence in characters.

Each of these metrics is kept with their respective patterns and accumulated during the “survey” process after pattern matching produces the token chain. There is a separate data structure for keeping track of all the unique pairs of patterns (eg Letter + Lowercase) and their counts. Originally I tried literally counting how many times the given pair could be found. This creates the same basic bias problem of totalling up the match counts for each single pattern. I experimented with summing up coverages for each pattern pair in the same way I did for single patterns.

Experimenting with different metrics I collect about individual patterns and pattern pairs changes which patterns the algorithm ends up proposing and experimenting with. This is because in each case I am limiting how many I’ll try out in each iteration. There is also a culling process after an iteration is through where I toss out patterns that did not perform well compared to the others. Again based on the metrics I keep about each pattern’s performance during parsing.

Observations

Here’s an example of just some of the patterns this algorithm comes up with after 4 iterations using some fairly modest settings. You can see I have named some of them like “Word” and “Word_Space” to help myself make some sense of the patterns. These names are reflected in later patterns too. This isn’t the full set from this particular run, but only the start of the experiments.

Id | Name | Type | Pattern
124 |  | Experimental | Letter Lowercase
125 | Word | Derived | Letter+
126 |  | Experimental | Lowercase+
127 |  | Experimental | Lowercase Letter
224 |  | Experimental | Space Word
225 |  | Experimental | Space Letter Word
226 |  | Experimental | Space Letter Lowercase+
227 |  | Experimental | Letter Space Word
228 |  | Experimental | Lowercase Space Word
229 |  | Experimental | Space Lowercase+
230 |  | Experimental | Space Lowercase Word
231 |  | Experimental | Space Lowercase Lowercase+
232 |  | Experimental | Word Space Letter
233 | Word_Space | Derived | Word Space
234 |  | Experimental | Lowercase+ Space Letter
235 |  | Experimental | Letter Word
236 |  | Experimental | Letter Lowercase+
237 |  | Experimental | Lowercase+ Space
238 |  | Experimental | Lowercase Word
239 |  | Experimental | Lowercase Lowercase+
254 |  | Experimental | (Letter Lowercase)+
314 | Word_Spaces | Derived | Word_Space+
315 |  | Experimental | Word Space Word
316 |  | Experimental | Word_Space Word
317 |  | Experimental | Word Space Letter Word
318 |  | Experimental | Word Space Letter Lowercase+
319 |  | Experimental | Word_Space Letter Word
320 |  | Experimental | Word_Space Letter Lowercase+
321 |  | Experimental | Letter Word Space Word
322 |  | Experimental | Letter Lowercase+ Space Word
323 |  | Experimental | Letter Word Space Letter Word
324 |  | Experimental | Letter Word Space Letter Lowercase+
325 |  | Experimental | Letter Lowercase+ Space Letter Word
326 |  | Experimental | Letter Lowercase+ Space Letter Lowercase+
327 |  | Experimental | Lowercase+ Space Word
328 |  | Experimental | Lowercase+ Space Letter Word
329 |  | Experimental | Lowercase+ Space Letter Lowercase+
330 |  | Experimental | Lowercase Word Space Word
331 |  | Experimental | Lowercase Lowercase+ Space Word
332 |  | Experimental | Word_Space (Letter Lowercase)+
333 |  | Experimental | Lowercase Word Space Letter Word

It’s apparent to me that, despite being generated based on observations of nonrandom data, these patterns are still fairly random. Expanding the number of candidates that I take or increasing the number of iterations mainly increases the amount of dubious patterns proposed. Tweaking how I measure utility — by match count, matching character count, coverage, or stretch — does influence what patterns are proposed. Sometimes for the better. But it’s clear that this overall approach is missing one or more things to help focus it on getting “smarter” in a way I can relate to.

I think one problem is that I’m not quite tying the utility of one pattern to the utility of the larger patterns that rely on it. If the algorithm discovers the basic sentence structure then the reward for that should be high. And the reward for the words that compose it should also be high as a consequence. This example also indicates that there should be a reward for higher order patterns that match a large percentage of the source paragraphs with a small number of instances. 5 sentences should be worth more than the 40 words, 30 spaces, and 5 periods that compose them, even though they cover the same exact characters. But the reward for those 5 matches should also be shared downward to those lower level patterns so they are favored over less productive patterns.

One clear problem is that only looking at pairs of matching A + B patterns and only constructing “A+” or “A B” patterns from each pair is very limited. This completely ignores most of my pattern expression language’s capabilities. Certainly different quantifiers like “A*”, “A?”, and so on. But more egregiously I am ignoring the power of alternatives. The words in a sentence are usually separated by spaces, but sometimes by commas, semicolons, dashes, and so on. There’s no way for my current algorithm to construct something like Word_Separator = “Space | ‘,’ Space | ‘;’ Space | Space ‘-’ Space”, for example.

Overall I still consider this a success so far. My parsing mechanism is solid. Once I got that working I did not need to change it. Most of my work then went into the token-chain surveying, pattern proposal, and pattern culling mechanisms. Going with simple pairs and constructing ever larger patterns using “A+” and “A B” construction got the algorithm fairly far in discovering the structure of written English in the source texts.

Next steps

I’m generally happy with this so far. But I’m nowhere near done. I still get the sense that my program is getting progressively dumber instead of smarter. It does not demonstrate any genuine ability to recognize that some of the patterns it discovers are very good at capturing the contents of the source texts it is fed. We get those “a-ha” moments as we learn to read. Some new pattern clicks and it is apparent to us how useful it is. I don’t believe that “a-ha” moment is a result of magic. I think we apply the new pattern to what we read and see that it allows us to read that much better going forward. And that’s what this algorithm should be able to do.

To that end I need to put more thought into what it means for me to survey the token chain that the parser produces. I don’t think there’s anything wrong with how that token chain is produced for now. It allows the algorithm to clearly see what patterns have already been matched through an exhaustive search. But I should be able to see how far the higher order patterns are able to parse through the source text before they get stuck on some unrecognized pattern.

I think I should also introduce the concept of “holes” in the parsing results. Let’s say I have some sentence that contains “approximately 30% of all adults”. And say I have a trivial notion of what a sentence is as composed of letters-only words separated by spaces followed by a period. That “30%” is going to break that attempt to match the sentence. As a human observer I know that “30%” fills the same role as any other word. I ultimately want my learning algorithm to discover this on its own. So it seems like it could be useful to skip past “30%” and see if the rest of the sentence matches the expected pattern. If it does then I might well hypothesize that “30%” is somehow a word. Then comes the harder task of breaking it down to its inner pattern. Is it literally “30%”? You and I know it’s not. It’s actually a string of digits followed by “%”. And we know that it could optionally include other characters as with “30.6%” or “1,234%”. This “holes” idea could be fruitful and does seem a lot like how we humans learn to cope with novel content in text that we read. We seem to read around the weird stuff and then come back to consider the weird stuff on its own.

I also plan to consider more than just adjacent pairs of matching patterns. I think it would be worthwhile during surveying to look for more sophisticated patterns in the token chain before it even gets to the point of proposing new patterns. For example I should look for lists of repetitions with optional separators. Like “Word Space Word Space … Space Word”. I would expect that when this survey process comes across long repetitions like this that they should be regarded as very high value candidates immediately. That exemplifies the very idea that there is structure in nature and language. I would also like to search for “containment” type patterns. As Prefix + Body + Suffix. Ideally it would be able to discover balanced parentheses, single quotes, double quotes, brackets, and braces.

One other aspect I want to explore is looking inward and not just outward. What I mean is that I would like this algorithm to attempt to discover patterns within words, for example. Like how most words in a text are all in lowercase. Yet some begin with a capital letter. Maybe how many words contain the “ch” letter pair or end in “s”. My current thinking is that the algorithm should prioritize maximizing coverage of patterns in search of the uber “Paragraph” pattern that will match all paragraphs. The problem is that it may trivialize some sub-patterns. It might consider “(“ to be a word, for example. And never discover that “(“ is usually paired with “)” to contain an inner sub-sentence inside a sentence.

Bottom line is that there’s a lot more to try out. I don’t think I will completely exhaust this entire topic. But I do think I can make a lot more progress.

I also think that what I’m doing here is crafting a generalized learning algorithm. I am applying it presently to one specific task in English. But I think that there are larger meta-patterns at work that apply to a larger set of data recognition and parsing problems. One very straightforward example is that I believe that once this algorithm works well with the current starting patterns, I should be able to throw away all the derived patterns. No Aa = “A | a”, Lowercase, or Letter patterns given as a starting point. I believe this algorithm should be able to even discover those things in the same way I am expecting it to discover digits as a unified character class. But I suspect this approach may even be useful in discovering semantic patterns in how words are put together to form larger ideas.

I also think that this algorithm can ultimately be used in a manner that it continues to learn as it encounters text. I’ve never liked the idea of turning off learning in a neural network or other machine learning algorithm and letting its knowledge remain forever fixed. I see the algorithm I’m exploring as perfectly capable of staying on and getting innovative when it comes across new patterns in text. Ideally it would also be able to call out when some text does not fit the usual patterns and possibly propose corrections. Think of it as being analogous to a spell or grammar checker. I also see human knowledge workers assisting the algorithm by giving names to patterns, offering corrections to dubious patterns, flagging some patterns as known no-nos, approving novel discoveries as worth keeping, and so on.

There’s a lot more to explore in this algorithm and approach to machine learning.

Comments

Popular posts from this blog

Virtual lexicon vs Brown corpus

Text blocking / Sentence segmentation

Stereo vision: measuring object distance using pixel offset