Jim Carnicelli's AI Blog

Coherence and ambiguities in problem solving

2022-04-01T11:39:00.005-07:00

Natural Language Processing (NLP) is a big topic. One I come back to again and again when I have time to explore it. Work has kept me very busy. So has moving. In recent months I've returned to the topic. I made some interesting progress constructing an "NLP pipeline". But as anyone who has done NLP work will tell you, English is full of ambiguities. They may tell you about the approaches they take to reduce the ambiguity and be decisive in the end. But the ambiguities persist and can't simply be guessed away.

The problem

More importantly though. The ambiguities often require us as humans to look across levels of interpretation to resolve. I generally have not found AI researchers offering a good way of doing this though.

To illustrate my problem, consider this sentence:

Some guys' shoes' laces are red.

As someone literate in English you have no problem interpreting it. But odds are good you can guess where the ambiguity lies. Looked at in isolation, each of the apostrophes can be interpreted in at least 3 possible ways. Each could be part of a possessive plural. It could be the start of some quoted text surrounded by single quotes. Or it could be the end of some similarly single-quoted text. What leads you the reader to conclude it's the plural possessive "guys' "? You might argue that s-apostrophe always indicates a plural possessive. But I think you also look at where the apostrophes appear. Consider this alternative but nearly identical sentence:

Some guys 'shoes' laces are red.

This should bother you. Why? Grammatically it makes no sense. The thing is, a typical NLP pipeline does not look at text like you and I do. We actually look for meaning and realize it doesn't make sense. But what if you didn't look at the meaning but only at the structure? Same options available for each of the apostrophes as above. But now your first guess is that the apostrophes are definitely single quotes surrounding "shoes". As though the statement were sarcastically referring to something as "shoes". But as a human you would correct this in your head and maybe call out to the author the mistaken placement of the first apostrophe.

What's going on here? One way of looking at this mechanically is seeing a one-way pipeline of interpretation that starts with raw text as input. The first component parses out the tokens. It passes them on to a second component that finds quoted text, parentheticals, and other logical groupings. It passes the now grouped segments of tokens on to a third component that tries to find meaning in it. But it should be apparent from the above example that in order to even find the tokens correctly you may need to find the correct meaning that relies on the correct tokens. A chicken or egg problem. If you accept that there is a most logical interpretation then you'll agree that "shoes' " is a whole word token in both sentence versions. But the tokenizer cannot truly conclude this correctly. Nor even can the "grouper" component.

For years I've been puzzling over how to get discrete pieces of an NLP process to collaborate to resolve ambiguities. A generalized solution would revolutionize AI for sure. I won't say I have found the answer. But I think I may have stumbled upon a way of structuring problem solving of this sort.

Conceptual framework

Today I started thinking of this in somewhat new terms for me that help. I realized that what I need is an algorithm that can entertain different interpretations of ambiguous data. In the past I've run into the problem of exploding combinatorics even with a simple tokenizer. Where if I create a branching tree of all the possible interpretations of a small paragraph of text, I might quickly construct a tree with millions of leaf nodes at the end. Needless to say this gets slow and memory-intensive. And still leaves you with the need to find the best interpretation.

I started thinking today in terms of seeking a “coherent interpretation”, or “coherence”. It occurred to me that it is not necessary to consider every possibility. That it could be worthwhile to just identify possible ambiguities along the way and keep track of them. But to then seek one or a small number of most coherent interpretations. And then move on, keeping track of these. Only if a later stage in the pipeline concludes that there is a lack of coherence should we backtrack and revisit some of the alternatives in hopes of finding a more coherent bigger picture.

I realized that one way to make use of this is to embrace ambiguities. The most recent version of my tokenizer relies on a set of named regular expression definitions for words. I did not include the possessive case where a word ends in an apostrophe because I knew that this needed to be resolved by a later stage. By this thinking I absolutely should represent that case in the tokenization rules. But I should make sure that my tokenizer can recognize that there are at least two possible interpretations at this word boundary.

What’s more. I realized that this is an opportunity for a learning algorithm to get involved. When a component recognizes that there are two or more interpretations of some data, I could store this fact and start keeping a tally of the interpretations that are ultimately accepted as part of the most coherent interpretations over time. Then the most common correct interpretations can be favored as the first interpretation to improve the performance and accuracy of decision making later. If the algorithm finds that the s-apostrophe case 90% of the time is a plural possessive noun then that will be its first guess going forward.

I realized also that there is a place in this conceptual framework for learning negative rules. In proper English we expect the first word in a sentence to be capitalized. So when we come across a sentence whose first word is not capitalized then we might let the user know of this mistake. But to do this we actually need to encode the rules and flag them as erroneous. They would contribute to concluding that an interpretation is incoherent. But that they might serve as good explanations when there are no more coherent interpretations available.

Reentrance

I’ve been thinking about how to approach the above poorly and summarily described conceptual framework. One key to it is getting away from the impulse to linearize everything.

Consider the task of counting a pile of money. It’s easy to picture doing so from start to finish. But what would you do if you got interrupted in the middle of the task? You might write down the running total and make sure the already counted pile is well separated from the uncounted pile. Then when you return you can pick it up again where you left off. In this way this process is reentrant.

In this sense it is necessary to be able to come up with one interpretation of a piece of text or other problem and be able to come back to it later to consider alternative interpretations. Which means the task must be designed to be reentrant from the start. And it means there must be a way of keeping track of the options we have already tried and be able to pick up where we left off and try another option. Ideally each next option we would try would be the next best and not merely a randomly possible option.

It occurs to me that a later stage in the pipeline would ideally be able to give clues as to what to look for too. Let’s say an earlier stage gave as its best interpretation that there is a sentence whose first word is not capitalized. A later phase looking up each of the words in its lexicon might conclude that “iPhone”, the first word in the sentence, is actually a proper noun that is spelled in a nonstandard way. It might then tell the earlier stage to consider this fact and find another alternative interpretation with this knowledge in mind. A more coherent interpretation should emerge.

Scoring

I think each total interpretation of some piece of data should be given a numeric coherence score. I’m not exactly sure how to go about it just yet. But one option would be to use positive scores to indicate coherence. The more coherent the higher the score. I think as soon as anything breaks the coherence of an interpretation, no matter how small, the score might go negative. The more incoherent the more negative the total score.

What would contribute to the score? I’m still trying to work this out. I think that anything that is not ambiguous could contribute 0 to the score. Only the ambiguous cases might be considered. Let’s say we came across an ambiguity like ‘12” ’. Is this 12 inches? Or is the double quote here a closing quote from a larger string of text? Let’s say in surveying thousands of texts we found that 65% of the time it’s a closing quote and 35% of the time it’s a length in inches. So we might add +1 for the length option and +2 for the closing quote option. If when evaluating the total text we discover that there is no opening quote to match with this potential closer then we evaluate the total option to be negative to indicate incoherence.

Conclusion

I’m still trying to work this concept out mentally before I try to write an algorithm based on it. I genuinely think I’m onto something here though. I think that there may be a generalized approach to problem-solving peeking out here. I want to believe that there is a way to write a general data structure and algorithms that, like a Christmas tree, can be adorned with specialized black-box processing components with reentrance and coherence models built into them. That the larger algorithms could enable these black boxes to collaborate without understanding each others’ details.

I plan to try to take a first stab at this in the coming days. Hopefully I’ll have something to report soon.

Discovering English syntax

2021-10-27T15:38:00.241-07:00

I've started a new project. My goal is to write a program that discovers enough of the syntax rules of written English to serve as a passable lexical tokenizer. I've made some progress in my approach thus far. But I can tell that my approach requires some serious rethinking. I'll describe the experimental design here and comment on my current progress.

If you wish to see the code I'm actively experimenting with you can find it on GitHub.

English syntax

Anyone familiar with programming language will recognize that there is a process involved in translating human-readable code into the more cryptic representation used by a computer to execute that code. And that there is a very precise syntax governing what your code in that language must look like to be considered syntactically valid. In the JavaScript statement "var x = someFunction(y * 12);" you'll realize that there are well-defined roles for each part. The "var" keyword indicates that the "x" identifier is a variable to use henceforth in the code. The "=" symbol indicates that you are immediately assigning a value to "x" using the expression to the right of it. You know that the "someFunction" identifier refers to some function defined somewhere. The matched parentheses following it will be input arguments to that function. And that "y * 12" is a sub-expression that must be evaluated before its computed value gets used as the only input argument to someFunction.

Written English is very like this. You know that this blog post is broken down into sections. Each section is divided into paragraphs. Each paragraph is composed of one or more sentences. Sentences are composed of strings of words. Mostly separated by spaces and terminated with periods. And words are mostly unbroken strings of single letters.

Naturally you know that this is insufficient to capture all of the syntax rules of the text of even this post. For example, the word "you'll" is not an unbroken string of letters. You know that there are strings of space-separated words that are wrapped up in double-quotes too. You recognize that words are not generally composed of letters willy-nilly. Instead they are mostly lowercase letters. Some words have initial capitals. Some words like "someFunction" violate even these norms. And clearly my quoted JavaScript expression is not even English text. But for the most part this blog post follows the simple syntax rules I just described.

Expressing syntax rules

My goal for this project is to get software that can discover basic syntax rules for written English. The starting point is a test program that has a relatively small sample of several paragraphs of text captured from an online news article. I break this text up into a list of separated paragraphs to feed into the parser. The parser's job is to translate any paragraph it is given into a tree structure representing sentences, words, nested clauses in parentheses, and so on. Here's one of the more complex paragraphs I'm feeding it:

Findings from this groundbreaking study, conducted in China by the George Institute for Global Health, show that low-sodium salt substitutes save lives and prevent heart attacks and strokes. Low-sodium salt decreased the risk of death by 12%, the risk of stroke by 14%, and total cardiovascular events (strokes and heart attacks combined) by 13%.
(Source: CNN Health article)

As you can see it has dash-conjoined words like "low-sodium", commas, parentheses, and percentiles to complicate things.

The parser is supposed to ultimately do linear-ish parsing like a parser of JavaScript or any other programming language does. Programming language syntax is often expressed initially and formally using BNF grammar. Even programmers often struggle to make sense of BNF expressions. We are often more familiar with the regular expressions (aka "regex") features available in most programming languages now. One simple regular expression to represent a trivial paragraph structure might be:

^(\s*([A-Za-z]+)+[.?!])+$

Essentially, one or more words ( [A-Za-z]+ ) with optional spaces before each, followed by trailing punctuation, all repeated one or more times.

I was tempted to implement my parser by having syntax rules be composed of ever more sophisticated regular expressions. But for various reasons I chose to invent my own text parser that supports a regex-like grammar I invented for this project's purpose. The parser contains a set of "patterns". Every pattern has a unique numeric ID and can optionally contain a name I designate (eg "Word" or "Letter"). A pattern is either a literal string of text or an expression.

As a minimal requisite I created one pattern for each of the characters the parser can encounter. That includes all the uppercase and lowercase letters, digits, the symbols found on a standard American keyboard, and the space character. For simplicity I don't allow newline characters, tabs, or Unicode characters beyond this very basic ASCII-centric set.

I do believe that my learning algorithm could eventually discover abstract classes of characters like digits and letters. But to make my explorations easier to start with I endowed the parser with a few extra patterns:

Aa: A | a
Bb: B | b
(All other upper/lower case pairs)
Uppercase: A | B | C | D | ... | Y | Z
Lowercase: a | b | c | d | ... | y | z
Letter: Uppercase | Lowercase

The above pattern expressions showcase how a pattern can contain alternatives. If one alternative doesn't match then maybe the next one will. Each alternative is a sequence. For example, "Ss Tt Oo Pp" would match "Stop", "stop", "STOP", and any other case-insensitive version of the word. Combining sequencing and alternation, "Ss Tt Oo Pp | Gg Oo" would match either "stop" or "go".

This expression language also supports parentheses grouping of sub-expressions. The main purpose of which is to facilitate quantifiers. Those familiar with regular expressions will recognize some of these quantifiers and probably guess the rest:

A+: One or more "A"
A*: Zero or more "A"
A?: Optional; zero or one "A"
A!: Negative; make sure the next thing is not "A" before continuing on
A{3}: Exactly 3 "A"
A{3-5}: At least 3 and up to 5 "A"
A {3+}: Three or more "A"

Finally, this expression language supports look-behinds and look-aheads. Prefixing an element with "<" causes it to look behind. Prefixing with a ">" causes it to look ahead. This means to make sure that current element must be preceded by or followed by something. For example, "<Space Letter" means make sure that the letter must be preceded by a space. Adding "!" as seen in the negation above makes it a negative look-behind or look-ahead. So "<Letter! Letter" means make sure this letter is not preceded by a letter.

Coming back to the earlier regex for a paragraph. We'd like to end up with some patterns similar to this:

Word: Letter+
Phrase: (Word Space)* Word
Sentence: Phrase+ '.' ('.' is the escaped name of the period literal pattern)
Paragraph: Sentence+

Of course this would fail to match any but the most trivial paragraphs. We would want to have more sophisticated patterns. Maybe the Word pattern might look more like "Letter+ ('''' s | '''')?" to capture words with possessives expressed with apostrophe-S or just apostrophe suffixes. And so on.

The parsing process

A typical linear parser produces a parse tree capturing the singular acceptable interpretation of the source text. The syntax is designed to guarantee that there really is only one valid interpretation. But my experience with natural language processing tells me that it makes more sense to produce multiple interpretations of some text and leave it to higher levels of processing to evaluate which one is best. Moreover. I'm starting with a parser that must discover the syntax rules.

One naïve way to approach this problem is to start at the first character and create a branching tree of all possible interpretations as I move forward. Using the above quoted text, the first word is "Findings". My first attempt would match the "F" pattern, the "Ff" pattern, the "Uppercase" pattern, and the "Letter" pattern that are the initial givens. The next step would be to move past the end of each of these matches and consider the next bit of text. In this case all of the matches are of single characters. If we already had a "Word" pattern then it would be 8 characters long. We would start matching at the next character after that final "s". The problem with this approach is that we have already matched 4 patterns on our first character. Then for each starting point after that we are going to match 4 more, thus making our tree have 16 end nodes after just 2 characters. Proceeding forward like this without even adding any extra patterns means our tree would have 1.6x10⁶⁰ tree endpoints once we reach character 100. That's not practical for most computers to work with.

My solution to this problem was to introduce what I call a "token chain" data structure that collapses the tree of all combinations of patterns down to a linear array that one one element for each character in the source text. Each array element is itself a list of all matching patterns that start at that location. To produce this token chain is a simple matter. Starting from character 1, attempt to match all known patterns starting at each character position going forward. If a pattern does match then attach a token representing that match at that position. Actually, the token chain has two analogous arrays called "heads" and "hails". Each matching token gets attached to both arrays. Heads are attached wherever the first character of the match is. And then the last character of the match indicates where to attach it to the tails array. The tails array allows the algorithm to look backward and easily see which patterns precede any given token's start. For example, when looking at a token matching the "Word" pattern (Letter+) that starts at character 50, the algorithm can then look at the tails array at position 49 to see what matches end just before this word.

The above algorithm may seem like a bad idea. After all, if I match a word like "Findings" right away, shouldn't I just move on to the next character after it and start there, skipping all the characters in between? The problem is that this initial pass of parsing does not yet know what the "best" interpretation is. So it must try all possibilities. That means all the single-character patterns too. So matching of all patterns must happen starting at every character position right up to the last. The good news is that this process is actually very fast. Even with thousands of patterns defined.

Another interesting aspect of this initial parsing pass is that the more abstract patterns benefit from earlier parsing already done. Let's say we had "Word" defined as "Letter+". The patterns are all stored in the order in which they were introduced. The "Aa" pattern must be defined after "A" and "a" are. The "Letter" pattern must be defined after "Uppercase" and "Lowercase" are. Which means by the time we get to the "Word" pattern, we've already discovered that at this location is one "Letter" match. Having said that, our "Word" pattern requires us to move forward character by character looking for more letters. We might not have looked far enough ahead yet. But as we do look for "Letter" at the next position, we are also looking to see if it matches "Uppercase" and so on all the way up the hierarchy of ever simpler patterns. And all along we are caching those matches at that head position and also caching them at their tail positions as well. This caching of all matches greatly speeds up the process. And it guarantees that all possible strings of pattern matches are covered by the time we're done matching the last character of the source text. Then we can easily hop our way from match to match in the token chain however we wish. If we hand-crafted the "Paragraph" pattern above we could hop from sentence token to sentence token with ease because they would already be matched. We wouldn't even have to do this because the Paragraph pattern would already have been matched by doing this all the way down to the single character level.

As you might imagine, I don't expect the token chain to be the final output of this parser. But for this experiment it is a sufficient one. The learning algorithm uses this token chain as its input.

The learning process

All the above is really just the test harness. The real crux of this experiment is creating an algorithm for discovering the patterns that best capture the lexical elements I expect to be able to consume in some larger program. Namely sentences, words, quoted text, and so forth. So what is the algorithm? Before I continue I'll say that I haven't discovered one yet. What you'll read below is what I have tried thus far and some observations.

I'm basing this entire experiment on a premise: that information from the natural world is not random but structured. The same goes for human languages. We "design" them to be understood by other people. We may omit many details in order to communicate quickly. But the structure is still there. If the human mind can learn to capture those patterns with relative ease by learning to speak and eventually to read then it should be possible for a machine to see those patterns and learn to recognize and expect them in any written language.

In this case the knowledge of the system is the set of all defined patterns. So originating knowledge means constructing new patterns. And then testing their effects on parsing text. One way to propose new rules is to do so completely at random. Maybe "Letter '9' (Xx '3' '$')+" is worth trying out. But of course it is not. Why not? Because it is random. And language is not random.

The crux of what I have tried is to observe actual pairs of adjacent matching patterns in the token chain. What does this mean? Let's say we have a "Word_Space" pattern defined as "Word Space" and "Word" is simply "Letter+". Our source paragraph begins "Findings from this groundbreaking". Starting at character 1 we find a token whose pattern is "Word_Space" and whose matching text is "Findings ". Immediately after that token is another "Word_Space" token whose match is "from ". This is one pair of adjacent patterns found in the text. As you might imagine, there will be lots of other adjacent matches. Like "Letter" + "Word" matching "F" + "indings". And "Uppercase" + "Lowercase" matching "F" + "i".

We will actually get a massive number of these pairs of adjacent tokens (matches) as we survey the entire token chain. Every one of these can be directly translated into a new pattern. For each pair I can evaluate whether the two patterns are the same ("Letter" + "Letter") or different ("Letter" + "Word"). Let's call the first pattern "A" and the second "B" for this purpose. If A and B are identical then we'll hypothesize that there are many of these repetitions. We'll define a new pattern like "A+". Like "Letter+", "Ee+", or "Word_Space+". If A and B are different then we'll hypothesize that this pair occurs often in natural text. We'll create an "A B" pattern. Like "Uppercase Lowercase" or "Letter Word".

One natural problem with this is that there will be at tens of thousands or more of unique pairs of patterns found in even modest paragraphs of text. We need a way to reduce the number that we consider down to a manageable number. One way I tried is to keep count of how many times I encounter a pattern. Letter + Letter appears very often for example. And then I can sort them from most common to least common and choose, say, the top 10 or top 100 to use to propose new patterns.

Okay. So now I have maybe 10 or 100 new patterns in the parser. Now what? Now I run the entire parsing process again with the newly expanded set of patterns. Why? Because I want to evaluate how useful each experimental rule is. What am I measuring? One option is to survey the token chain to see how many times each pattern matched something. I already know that I chose pairs of patterns that were found in the text, so I can be sure that all the patterns will match lots of things.

Then what? Keep iterating. Each new iteration will generate new patterns. These will generally be more and more abstract, building on earlier patterns.

Intermediate results

Overall I’m fairly happy with this approach as a starting point. As I had hoped, the algorithm immediately discovers that “Letter+” is very effective at fitting much of the available text. When I see this proposed pattern I immediately name it “Word” for my own ease of understanding. And it also fairly quickly discovers that “Word Space” also describes a lot of the text. So does “(Word Space)+”. And eventually “(Word Space+) Word”. In practice these gains come from a lot of tweaking of various numbers and biases. Under very expensive (in processing terms) conditions it eventually discovers the basic sentence as “(Word Space)+ Word ‘.’”. But whereas I expect this to be an easy win, it actually gets harder for the algorithm to make this sort of progress. Why?

What I did not talk much about earlier is how I’m deciding to winnow down the many options I can pursue. Remember how I said I can take the top 10 pairs I found for creating 10 new patterns based on how many matches there were in the token chain? This perversely rewards generalized patterns that match as few characters as possible. Like “Letter Lowercase”, “Uppercase Lowercase Letter” And so forth. They can crowd out more useful patterns like “(Word Space)+”. I also tried keeping track of total match lengths. But this means that a pattern like “Word” = “Letter+” would match “Findings”, “indings”, “ndings”, and so on and quickly rack up total match count of 8 + 7 + 6 + … + 1 for this one word. I then introduced the metric of “coverage”. In that case I’m counting how many of the total characters in the source text are matched, even if in duplicate. So Word matching “Findings”, “indings”, and so forth would still have a total coverage of 8 characters for that word. That helped a lot. I also introduced a metric I call “stretch”. That measures how many characters from the very first one in the source text is matched. For Word the stretch measure for our source paragraph would be 8. For Sentence (if we ever got there) the stretch value would be the length of the first sentence in characters.

Each of these metrics is kept with their respective patterns and accumulated during the “survey” process after pattern matching produces the token chain. There is a separate data structure for keeping track of all the unique pairs of patterns (eg Letter + Lowercase) and their counts. Originally I tried literally counting how many times the given pair could be found. This creates the same basic bias problem of totalling up the match counts for each single pattern. I experimented with summing up coverages for each pattern pair in the same way I did for single patterns.

Experimenting with different metrics I collect about individual patterns and pattern pairs changes which patterns the algorithm ends up proposing and experimenting with. This is because in each case I am limiting how many I’ll try out in each iteration. There is also a culling process after an iteration is through where I toss out patterns that did not perform well compared to the others. Again based on the metrics I keep about each pattern’s performance during parsing.

Observations

Here’s an example of just some of the patterns this algorithm comes up with after 4 iterations using some fairly modest settings. You can see I have named some of them like “Word” and “Word_Space” to help myself make some sense of the patterns. These names are reflected in later patterns too. This isn’t the full set from this particular run, but only the start of the experiments.

Id | Name | Type | Pattern
124 |  | Experimental | Letter Lowercase
125 | Word | Derived | Letter+
126 |  | Experimental | Lowercase+
127 |  | Experimental | Lowercase Letter
224 |  | Experimental | Space Word
225 |  | Experimental | Space Letter Word
226 |  | Experimental | Space Letter Lowercase+
227 |  | Experimental | Letter Space Word
228 |  | Experimental | Lowercase Space Word
229 |  | Experimental | Space Lowercase+
230 |  | Experimental | Space Lowercase Word
231 |  | Experimental | Space Lowercase Lowercase+
232 |  | Experimental | Word Space Letter
233 | Word_Space | Derived | Word Space
234 |  | Experimental | Lowercase+ Space Letter
235 |  | Experimental | Letter Word
236 |  | Experimental | Letter Lowercase+
237 |  | Experimental | Lowercase+ Space
238 |  | Experimental | Lowercase Word
239 |  | Experimental | Lowercase Lowercase+
254 |  | Experimental | (Letter Lowercase)+
314 | Word_Spaces | Derived | Word_Space+
315 |  | Experimental | Word Space Word
316 |  | Experimental | Word_Space Word
317 |  | Experimental | Word Space Letter Word
318 |  | Experimental | Word Space Letter Lowercase+
319 |  | Experimental | Word_Space Letter Word
320 |  | Experimental | Word_Space Letter Lowercase+
321 |  | Experimental | Letter Word Space Word
322 |  | Experimental | Letter Lowercase+ Space Word
323 |  | Experimental | Letter Word Space Letter Word
324 |  | Experimental | Letter Word Space Letter Lowercase+
325 |  | Experimental | Letter Lowercase+ Space Letter Word
326 |  | Experimental | Letter Lowercase+ Space Letter Lowercase+
327 |  | Experimental | Lowercase+ Space Word
328 |  | Experimental | Lowercase+ Space Letter Word
329 |  | Experimental | Lowercase+ Space Letter Lowercase+
330 |  | Experimental | Lowercase Word Space Word
331 |  | Experimental | Lowercase Lowercase+ Space Word
332 |  | Experimental | Word_Space (Letter Lowercase)+
333 |  | Experimental | Lowercase Word Space Letter Word

It’s apparent to me that, despite being generated based on observations of nonrandom data, these patterns are still fairly random. Expanding the number of candidates that I take or increasing the number of iterations mainly increases the amount of dubious patterns proposed. Tweaking how I measure utility — by match count, matching character count, coverage, or stretch — does influence what patterns are proposed. Sometimes for the better. But it’s clear that this overall approach is missing one or more things to help focus it on getting “smarter” in a way I can relate to.

I think one problem is that I’m not quite tying the utility of one pattern to the utility of the larger patterns that rely on it. If the algorithm discovers the basic sentence structure then the reward for that should be high. And the reward for the words that compose it should also be high as a consequence. This example also indicates that there should be a reward for higher order patterns that match a large percentage of the source paragraphs with a small number of instances. 5 sentences should be worth more than the 40 words, 30 spaces, and 5 periods that compose them, even though they cover the same exact characters. But the reward for those 5 matches should also be shared downward to those lower level patterns so they are favored over less productive patterns.

One clear problem is that only looking at pairs of matching A + B patterns and only constructing “A+” or “A B” patterns from each pair is very limited. This completely ignores most of my pattern expression language’s capabilities. Certainly different quantifiers like “A*”, “A?”, and so on. But more egregiously I am ignoring the power of alternatives. The words in a sentence are usually separated by spaces, but sometimes by commas, semicolons, dashes, and so on. There’s no way for my current algorithm to construct something like Word_Separator = “Space | ‘,’ Space | ‘;’ Space | Space ‘-’ Space”, for example.

Overall I still consider this a success so far. My parsing mechanism is solid. Once I got that working I did not need to change it. Most of my work then went into the token-chain surveying, pattern proposal, and pattern culling mechanisms. Going with simple pairs and constructing ever larger patterns using “A+” and “A B” construction got the algorithm fairly far in discovering the structure of written English in the source texts.

Next steps

I’m generally happy with this so far. But I’m nowhere near done. I still get the sense that my program is getting progressively dumber instead of smarter. It does not demonstrate any genuine ability to recognize that some of the patterns it discovers are very good at capturing the contents of the source texts it is fed. We get those “a-ha” moments as we learn to read. Some new pattern clicks and it is apparent to us how useful it is. I don’t believe that “a-ha” moment is a result of magic. I think we apply the new pattern to what we read and see that it allows us to read that much better going forward. And that’s what this algorithm should be able to do.

To that end I need to put more thought into what it means for me to survey the token chain that the parser produces. I don’t think there’s anything wrong with how that token chain is produced for now. It allows the algorithm to clearly see what patterns have already been matched through an exhaustive search. But I should be able to see how far the higher order patterns are able to parse through the source text before they get stuck on some unrecognized pattern.

I think I should also introduce the concept of “holes” in the parsing results. Let’s say I have some sentence that contains “approximately 30% of all adults”. And say I have a trivial notion of what a sentence is as composed of letters-only words separated by spaces followed by a period. That “30%” is going to break that attempt to match the sentence. As a human observer I know that “30%” fills the same role as any other word. I ultimately want my learning algorithm to discover this on its own. So it seems like it could be useful to skip past “30%” and see if the rest of the sentence matches the expected pattern. If it does then I might well hypothesize that “30%” is somehow a word. Then comes the harder task of breaking it down to its inner pattern. Is it literally “30%”? You and I know it’s not. It’s actually a string of digits followed by “%”. And we know that it could optionally include other characters as with “30.6%” or “1,234%”. This “holes” idea could be fruitful and does seem a lot like how we humans learn to cope with novel content in text that we read. We seem to read around the weird stuff and then come back to consider the weird stuff on its own.

I also plan to consider more than just adjacent pairs of matching patterns. I think it would be worthwhile during surveying to look for more sophisticated patterns in the token chain before it even gets to the point of proposing new patterns. For example I should look for lists of repetitions with optional separators. Like “Word Space Word Space … Space Word”. I would expect that when this survey process comes across long repetitions like this that they should be regarded as very high value candidates immediately. That exemplifies the very idea that there is structure in nature and language. I would also like to search for “containment” type patterns. As Prefix + Body + Suffix. Ideally it would be able to discover balanced parentheses, single quotes, double quotes, brackets, and braces.

One other aspect I want to explore is looking inward and not just outward. What I mean is that I would like this algorithm to attempt to discover patterns within words, for example. Like how most words in a text are all in lowercase. Yet some begin with a capital letter. Maybe how many words contain the “ch” letter pair or end in “s”. My current thinking is that the algorithm should prioritize maximizing coverage of patterns in search of the uber “Paragraph” pattern that will match all paragraphs. The problem is that it may trivialize some sub-patterns. It might consider “(“ to be a word, for example. And never discover that “(“ is usually paired with “)” to contain an inner sub-sentence inside a sentence.

Bottom line is that there’s a lot more to try out. I don’t think I will completely exhaust this entire topic. But I do think I can make a lot more progress.

I also think that what I’m doing here is crafting a generalized learning algorithm. I am applying it presently to one specific task in English. But I think that there are larger meta-patterns at work that apply to a larger set of data recognition and parsing problems. One very straightforward example is that I believe that once this algorithm works well with the current starting patterns, I should be able to throw away all the derived patterns. No Aa = “A | a”, Lowercase, or Letter patterns given as a starting point. I believe this algorithm should be able to even discover those things in the same way I am expecting it to discover digits as a unified character class. But I suspect this approach may even be useful in discovering semantic patterns in how words are put together to form larger ideas.

I also think that this algorithm can ultimately be used in a manner that it continues to learn as it encounters text. I’ve never liked the idea of turning off learning in a neural network or other machine learning algorithm and letting its knowledge remain forever fixed. I see the algorithm I’m exploring as perfectly capable of staying on and getting innovative when it comes across new patterns in text. Ideally it would also be able to call out when some text does not fit the usual patterns and possibly propose corrections. Think of it as being analogous to a spell or grammar checker. I also see human knowledge workers assisting the algorithm by giving names to patterns, offering corrections to dubious patterns, flagging some patterns as known no-nos, approving novel discoveries as worth keeping, and so on.

There’s a lot more to explore in this algorithm and approach to machine learning.

Neural network in C# with multicore parallelization / MNIST digits demo

2021-10-09T16:26:00.030-07:00

I've been working for a couple weeks on building my first fully functional artificial neural network (ANN). I'm not blazing any news trails here by doing so. I'm a software engineer. I can barely follow the mathematical explanations of how ANNs work. For the most part I have turned to the source code others have shared online for inspiration. In most cases I've struggled to understand even that, despite programming for a living.

Part of the challenge is that more than a few of those demos have surprised me by being nonfunctional. They did something for sure. They just didn't learn anything or perform significantly better than random chance making correct predictions, no matter how many iterations they went through. Or they had bugs that prevented them from working according to the well-worn basic backpropagation algorithm.

I mostly worked from C# examples when I could find them. One thing that was a genuine struggle for me to deal with is my sense that they all derived from one source from a decade and a half ago that itself had bugs. And which struck me as poorly structured to begin with. In short, I found it hard to read most of the source code samples I found because they were written in cryptic ways in my opinion. Along the way I wrote and rewrote from scratch. If I couldn't duplicate what was contained in one demo I might download and run it directly within my project. And usually I would find it wouldn't work for one reason or another. I was amazed that people blogged about the subject without apparently confirming that their own code worked properly.

For a while I was very frustrated because I was seeing a strange behavior nobody else had documented. My models would train and get very good. And then their accuracy rates would start falling off as though rolling down the other side of a hill. I spent over a week trying to figure out the cause. Ultimately I discovered an extra loop in my code that influenced training in a way that didn't demolish it completely, but which somehow compounded after a while to eventually undo all of the training. Once I fixed that I immediately started seeing my code behaving like everybody else's. Hooray!

I know there are lots of code samples out there already. But here is my own. My previous blog post showed a simplified version of this. Short enough to paste directly on the page. In this case I'm going to instead point you to a Github repository with my complete program in it:

https://github.com/JimCarnicelli/BasicNeuralNetwork

I'm also going to skip trying to write an extensive explanation of how an ANN works, including the backpropagation algorithm. I think that has been very well covered on so many other websites that I would have little more of value to add. So I'll just tell you a little more about what's in my demo code.

For starters, my demo has a solidly OOP structure. The very reusable basis is a set of NeuralNetwork, Layer, and Neuron classes. These classes are well oriented toward the basics of both training and later practical use. NeuralNetwork features .FromJson() and .ToJson() methods for serialization of the trained state of a model. The layers can be separately configured to use different learning rates and activation functions, including Softmax, TanH, Sigmoid, ReLU, and LReLU. You can have as many hidden layers as you want too. NeuralNetwork offers various ways to inject input values and get your output, including the .Classify() method, which gives you an integer representing which output neuron had the highest value and is thus the predicted class. I've added a lot of inline comments to help explain everything for both the practical programmer and the programmer looking to understand the inner workings.

I didn't want to only focus on readability. I also put a lot of thought into performance. Starting with memory. You might think that instantiating one Neuron instance for each logical neuron would be very memory wasteful but it's not. I tested this with very large test networks with thousands and even millions of neurons. As the number of neurons grows and thus the number of interconnections among them, the size of the memory footprint of the network approaches 4 bytes times the total number of input weights. That's 4 bytes per floating point number, which is the common currency for this code. So if you had a network with 1,000 hidden-layer neurons and 1,000 output-layer neurons, that accounts for 1,000,000 input weights and thus the total network will take up around 4MB of memory. Which is quite compact. One thing my code does not do during training or behaving is allocate temporary arrays or collections that then go away. That saves memory and speeds things up.

The structure of my code lends itself to speedy execution too. I just configured a network for the MNIST demo with 784 inputs, 100 hidden-layer neurons, and 10 output neurons. The latter 2 layers use the sigmoid activation function. In about 75 seconds it has churned through 100k training iterations on my laptop. Which has an 8-processor, 16-core CPU set running at around 4.27 GHz. I think this is decent performance and is a result of a reasonably optimal coding. But I also added a switch to enable each layer to spread the training and behaving calculations out across all the computer's processors. In my fairly small tests this about doubles the speed. With larger networks I start seeing 7x speedups. I haven't tried it on any very large networks with millions of nodes yet. Hopefully it starts approaching a 16x speedup for 16 cores and so forth.

My code has a shabby set of demos included. One is a classic XOR gate demo. This one is great for study because it is such a small network that you can visualize the whole network fairly easily.

The second demo similarly involves synthetic data in the problem of learning to classify all of the ASCII characters from 32 (space) to 95 (tilde) as Whitespace, Symbol, Letter, Digit, or None. This network has 7 inputs representing the 7 bits needed for these characters, 10 hidden-layer neurons, and 5 output neurons representing the character classes. This one uses the LReLU activation function.

The final demo uses the aforementioned MNIST data set. The files can be found on the MNIST project page. I was frustrated at how slow loading the source data files was each time. I included a utility function to convert the training and test file pairs into pure .PNG files. Here's the smaller test image:

If you zoom in very close you will see single red pixels in the upper left corner of each digit tile. Since every source pixel was white I decided to pack the digit's value into the blue channel. In the example below you can see that the "7" digit's actual value is packed into the (255, 0, 7) RGB value.

Doing this one-time transformation of these 4 files into 2 PNGs is great. The PNG files are smaller and load much faster. And they are easier to visualize using an ordinary image viewer or paint program.

My results seem in line with those documented on the MNIST project site. I experimented with a lot of network configurations and parameter values. But for one concrete example, I configured the single hidden layer with 100 neurons using the sigmoid activation function and a learning rate of 0.1. My accuracy rate on the test set was 97.3% after 1M training iterations. That's a 2.7% error rate. The closest comparison I can see in the table of results is listed as "2-layer NN, 1000 hidden units" with no preprocessing (same as mine). The error rate on that is reported as 4.5%. That was from a 1998 paper by LeCun et al.

Quick note. You're going to need to edit this line in Program.cs to point to your own project data folder immediately under the project root folder:

  static string dataDirectory = @"G:\My Drive\Ventures\MsDev\BasicNeuralNetwork\Data\";

I want to emphasize that while I have worked on this code for a couple weeks and feel very good about it, I can't guarantee it is without bugs. I encourage you to comment here or reach out to me if you find any bugs, no matter how small.

I'd also welcome hearing from you about your experiences in using this code for your own projects. Cheers.

Back in the saddle / C# neural network demo

2021-09-23T05:48:00.017-07:00

I felt like I was making good progress back in 2016 in my AI research. But I realized it was not going to turn into an income-yielding venture anytime soon. So I moved on to other ventures. This was during a sabbatical from "real work". I eventually gave up and got a real job again. While that job pays reasonably well, it's also fairly boring. Kira and I moved from Madison to Miami this year. For various reasons I decided to do a YouTube channel for a while to give my impressions of life here. That was fun for a while Recently I got somewhat bored with that and put it on hold.

I was in a bit of a funk about what I should do with my free time when I'm not doing my regular work. I decided to resume my AI research. I've been dipping my toes in this week. Reading back on my own blog to reconnect with my most recent project. I decided to start into the subject of artificial neural networks (ANNs) again.

I'm a little embarrassed to admit that I never wrote a traditional ANN until last night. I don't just want to play with existing algorithms. I feel compelled to write one from scratch to make sure I can genuinely understand how they work from the experience. This is actually something I've wanted to do since somewhere around 1991 and just never got around to. One key reason is that I've struggled to understand the code I have seen in the past and even more to translate the arcane mathematics into algorithms. Each time I tried in the past I found that what I wrote did not functionally learn anything. The documented explanations were always missing some ingredient required for me to have a complete enough understanding.

I was running into that same problem last night when I started in to the problem again. I decided to stick with C# this time so I wouldn't get sidetracked by getting back into doing C++ development again. I looked for some online articles about ANNs in C#. I found quite a few. And as per usual, most of them made it difficult for me to make sense of what was going on in their algorithms. I didn't want to just copy one and call it done. Moreover, most of them seemed to make poor use of memory. I ultimately settled on a super simplified demo with a fairly hardcoded demo that trains to mimic a logical XOR gate. Simple training set. Easy to verify. And the sample code was compact and readable. Off to a great start. I pasted it into my test program and ran it as-is. I mirrored what I gleaned from its design in a brand new set of classes designed to generally optimize memory and performance for large networks. I even chose the same XOR test case.

What was driving me bonkers was that once again it was never learning this trivial function. Most of the few hours that I worked on it were devoted to staring at the two pieces of code and trying to figure out where the bugs in mine were. Most had to do with addressing the wrong array elements. And yet after all that the network was just never settling into a meaningful behavior. Along the way I discovered the truth that neither did the demo code I had copy-and-pasted. It suffered the same problem my code did. I honestly can't believe the author went through all that trouble to craft the code and blog about it without meaningfully proving that it works.

I eventually figured out that the activation functions (AF) I was experimenting with were critical to get right. I was blindly implementing them from examples I found online and not working to ensure that they were meaningful and that I chose the correct first derivatives of them for backpropagation purposes. I eventually got a proper implementation of the hyperbolic tangent (aka "tanh") AF and its correct 1 - tanh(X)² first derivative. And finally my demo program worked. First time ever for me.

In some ways I credit the wealth of relatively new web pages out there discussing ANNs for programmers. Most of them weren't around back in 2016 when I was last exploring classifier systems. But I mainly credit my absolute determination to finally crack this. I wasn't going to accept a nonfunctional algorithm and set the code aside for later again.

I was intrigued to learn about the rectified linear unit (ReLU) activation function. I implemented it as an alternative AF in my program. Sometimes it works. And sometimes its weights veer off to infinity. I was afraid that would happen because of its partial linearity. I still haven't fully wrapped my head around ReLU and the "leaky" version (LReLU) yet. But it sounds like in recent years most ANNs have shifted toward ReLUs. The calculations are way less expensive than for sigmoid-type AFs. And the fact that they are not squished into a narrow range (0 to 1 or -1 to 1) apparently makes deep learning algorithms with multiple hidden layers possible. I get why in a basic sense. But I need to study this a lot further before I'll feel more confident in it.

I'm just getting started with my ANN experiments. I need to construct better training scenarios and shake this out some more. If you are interested in seeing my code, here it is. It's a simple console project using .Net Core 5. I find that on most runs it gets fully trained and working correctly by around iteration 1000. But sometimes it does not settle at all before it stops at 10000. You'll know it has successfully settled when you stop seeing the "------" lines that indicate an incorrect prediction. Note that the NeuralNetwork class constructor will let you choose as many middle layers as you want. But so far I've only tested with one.

*** 10/6/2021 update ***

After a couple weeks of experiments I decided to replace this code with a trimmed down demo version of my latest. There was a critical bug that was causing the training to eventually collapse. So many of the C# code samples I studied along the way had their own bugs and/or were extremely difficult for me to follow. Most of the C# code samples appeared to be tweaked versions of one single flawed demo shared nearly a decade ago. And all had significant limitations on what you could do with the final product. I've stripped out some L1+L2 regularization experiments I'm dabbling with. And JSON serialization for loading and saving state.

I'm hoping that my demo will help others struggling with understanding basic ANNs and applying them to their own projects.

I think I've pounded out all the real bugs. If you find a bug, please do let me know!

Typical output:

Iteration     Inputs    Output   Valid?      Accuracy
        0    1 xor 0 =   0.599              100.0% |----------|
    1,000    1 xor 0 =   0.557               42.0% |----      |
    2,000    1 xor 1 =   0.520  (wrong)      42.0% |----      |
    3,000    0 xor 1 =   0.490  (wrong)      50.0% |-----     |
    4,000    1 xor 0 =   0.551               70.0% |-------   |
    5,000    1 xor 1 =   0.592  (wrong)      62.0% |------    |
    6,000    1 xor 0 =   0.521               29.0% |---       |
    7,000    0 xor 0 =   0.368               41.0% |----      |
    8,000    1 xor 1 =   0.532  (wrong)      65.0% |-------   |
    9,000    0 xor 0 =   0.284               82.0% |--------  |
   10,000    1 xor 1 =   0.587  (wrong)      70.0% |-------   |
   11,000    1 xor 0 =   0.569               78.0% |--------  |
   12,000    1 xor 1 =   0.417              100.0% |----------|
I've had 1,000 flawless predictions recently. Continue anyway?

Code:

using System;
using System.Collections.Generic;

namespace BasicNeuralNetworkDemo {

    class Program {

        static void Main(string[] args) {

            var nn = new NeuralNetwork();
            nn.AddLayer(2);
            nn.AddLayer(2, true, ActivationFunctionEnum.TanH, 0.01f);
            nn.AddLayer(1, true, ActivationFunctionEnum.TanH, 0.01f);

            float[][] training = new float[][] {
                new float[] { 0, 0,   0 },
                new float[] { 0, 1,   1 },
                new float[] { 1, 0,   1 },
                new float[] { 1, 1,   0 },
            };

            Console.WriteLine("Iteration     Inputs    Output   Valid?      Accuracy");

            int maxIterations = 1000000;
            var corrects = new List<bool>();
            int flawlessRuns = 0;
            int i = 0;
            while (i < maxIterations) {

                int trainingCase = NeuralNetwork.NextRandomInt(0, training.Length);
                var trainingData = training[trainingCase];
                nn.SetInputs(trainingData);
                nn.FeedForward();

                nn.TrainingOutputs[0] = trainingData[2];

                bool isCorrect = (nn.OutputLayer.Neurons[0].Output < 0.5 ? 0 : 1) == nn.TrainingOutputs[0];
                corrects.Add(isCorrect);
                while (corrects.Count > 100) corrects.RemoveAt(0);
                float percentCorrect = 0;
                foreach (var correct in corrects) if (correct) percentCorrect += 1;
                percentCorrect /= corrects.Count;

                if (percentCorrect == 1) flawlessRuns++;
                else flawlessRuns = 0;

                nn.Backpropagate();

                if (i % 100 == 0) {
                    #region Output state

                    Console.WriteLine(
                        RightJustify(i.ToString("#,##0"), 9) + "    " +
                        trainingData[0] +
                        " xor " +
                        trainingData[1] + " = " +
                        RightJustify("" + nn.OutputLayer.Neurons[0].Output.ToString("0.000"), 7) + "  " +
                        (isCorrect ? "       " : "(wrong)") +
                        RightJustify((percentCorrect * 100).ToString("0.0") + "% ", 12) +
                        RenderPercent(percentCorrect * 100)
                    );

                    #endregion
                }

                if (flawlessRuns == 1000) {
                    Console.WriteLine("I've had " + flawlessRuns.ToString("#,##0") + " flawless predictions recently. Continue anyway?");
                    Console.Beep();
                    Console.ReadLine();
                }

                i++;
            }

            Console.WriteLine("Done");
            Console.Beep();
            Console.ReadLine();
        }

        static string RenderPercent(float percent) {
            float value = percent / 10f;
            if (value < 0.5) return "|          |";
            if (value < 1.5) return "|-         |";
            if (value < 2.5) return "|--        |";
            if (value < 3.5) return "|---       |";
            if (value < 4.5) return "|----      |";
            if (value < 5.5) return "|-----     |";
            if (value < 6.5) return "|------    |";
            if (value < 7.5) return "|-------   |";
            if (value < 8.5) return "|--------  |";
            if (value < 9.5) return "|--------- |";
            return "|----------|";
        }

        static string RightJustify(string text, int width) {
            while (text.Length < width) text = " " + text;
            return text;
        }

    }


    public enum ActivationFunctionEnum {
        /// <summary> Rectified Linear Unit </summary>
        ReLU,
        /// <summary> Leaky Rectified Linear Unit </summary>
        LReLU,
        /// <summary> Logistic sigmoid </summary>
        Sigmoid,
        /// <summary> Hyperbolic tangent </summary>
        TanH,
        /// <summary> Softmax function </summary>
        Softmax,
    }

    public class NeuralNetwork {

        /// <summary>
        /// The layers of neurons from input (0) to output (N)
        /// </summary>
        public Layer[] Layers { get; private set; }

        /// <summary>
        /// Equivalent to Layers.Length
        /// </summary>
        public int LayerCount { get; private set; }

        /// <summary>
        /// Equivalent to InputLayer.NeuronCount
        /// </summary>
        public int InputCount { get; private set; }

        /// <summary>
        /// Equivalent to OutputLayer.NeuronCount
        /// </summary>
        public int OutputCount { get; private set; }

        /// <summary>
        /// Equivalent to Layers[0]
        /// </summary>
        public Layer InputLayer { get; private set; }

        /// <summary>
        /// Equivalent to Layers[LayerCount - 1]
        /// </summary>
        public Layer OutputLayer { get; private set; }

        /// <summary>
        /// Provides the desired output values for use in backpropagation training
        /// </summary>
        public float[] TrainingOutputs { get; private set; }

        public NeuralNetwork() { }

        /// <summary>
        /// Constructs and adds a new neuron layer to .Layers
        /// </summary>
        public Layer AddLayer(
            int neuronCount,
            bool randomize = false,
            ActivationFunctionEnum activationFunction = ActivationFunctionEnum.TanH,
            float learningRate = 0.01f
        ) {
            // Since we can't expand the array we'll construct a new one
            var newLayers = new Layer[LayerCount + 1];
            if (LayerCount > 0) Array.Copy(Layers, newLayers, LayerCount);

            // Interconnect layers
            Layer previousLayer = null;
            if (LayerCount > 0) previousLayer = newLayers[LayerCount - 1];

            // Construct the new layer
            var layer = new Layer(neuronCount, previousLayer);
            layer.ActivationFunction = activationFunction;
            layer.LearningRate = learningRate;
            if (randomize) layer.Randomize();
            newLayers[LayerCount] = layer;

            // Interconnect layers
            if (LayerCount > 0) previousLayer.NextLayer = layer;

            // Cache some helpful properties
            if (LayerCount == 0) {
                InputLayer = layer;
                InputCount = neuronCount;
            }
            if (LayerCount == newLayers.Length - 1) {
                OutputLayer = layer;
                OutputCount = neuronCount;
                TrainingOutputs = new float[neuronCount];
            }

            // Emplace the new array and move on
            Layers = newLayers;
            LayerCount++;
            return layer;
        }

        /// <summary>
        /// Copy the array of input values to the input layer's .Output properties
        /// </summary>
        public void SetInputs(float[] inputs) {
            for (int n = 0; n < InputCount; n++) {
                InputLayer.Neurons[n].Output = inputs[n];
            }
        }

        /// <summary>
        /// Copy the output layer's .Output property values to the given array
        /// </summary>
        public void GetOutputs(float[] outputs) {
            for (int n = 0; n < OutputCount; n++) {
                outputs[n] = OutputLayer.Neurons[n].Output;
            }
        }

        /// <summary>
        /// Interpret the output array as a singular category (0, 1, 2, ...) or -1 (none)
        /// </summary>
        public int Classify() {
            float maxValue = 0;
            int bestIndex = -1;
            for (int o = 0; o < OutputCount; o++) {
                float value = OutputLayer.Neurons[o].Output;
                if (value > maxValue) {
                    bestIndex = o;
                    maxValue = value;
                }
            }
            if (maxValue == 0) return -1;
            return bestIndex;
        }

        /// <summary>
        /// Copy the given array's values to the .TrainingOutputs property
        /// </summary>
        public void SetTrainingOutputs(float[] outputs) {
            Array.Copy(outputs, TrainingOutputs, OutputCount);
        }

        /// <summary>
        /// Flipside of .Classify() that sets .TrainingOutputs to all zeros and the given index to one
        /// </summary>
        public void SetTrainingClassification(int value) {
            for (int o = 0; o < OutputCount; o++) {
                if (o == value) {
                    TrainingOutputs[o] = 1;
                } else {
                    TrainingOutputs[o] = 0;
                }
            }
        }

        /// <summary>
        /// Feed .Inputs forward to populate .Outputs
        /// </summary>
        public void FeedForward() {
            for (int l = 1; l < LayerCount; l++) {
                var layer = Layers[l];
                layer.FeedForward();
            }
        }

        /// <summary>
        /// One iteration of backpropagation training using inputs and training outputs after .Predict() was called on the same
        /// </summary>
        public void Backpropagate() {
            for (int l = LayerCount - 1; l > 0; l--) {
                var layer = Layers[l];
                layer.Backpropagate(TrainingOutputs);
            }
        }

        /// <summary>
        /// Returns a random float in the range from min to max (inclusive)
        /// </summary>
        public static float NextRandom(float min, float max) {
            return (float)random.NextDouble() * (max - min) + min;
        }
        /// <summary>
        /// Returns a random int that is at least min and less than max
        /// </summary>
        public static int NextRandomInt(int min, int max) {
            return random.Next(min, max);
        }
        private static Random random = new Random();

    }


    public class Layer {

        /// <summary>
        /// All the neurons in this layer
        /// </summary>
        public Neuron[] Neurons;

        /// <summary>
        /// Reference to the earlier layer that I get my input from
        /// </summary>
        public Layer PreviousLayer;

        /// <summary>
        /// Reference to the later layer that gets its input from me
        /// </summary>
        public Layer NextLayer;

        /// <summary>
        /// A tunable parameter that trades shorter training times for greater final accuracy
        /// </summary>
        public float LearningRate = 0.01f;

        /// <summary>
        /// How to transform the summed-up scalar output value of each neuron during feed forward
        /// </summary>
        public ActivationFunctionEnum ActivationFunction = ActivationFunctionEnum.TanH;

        /// <summary>
        /// Equivalent to Neurons.Length
        /// </summary>
        public int NeuronCount { get; private set; }

        public Layer(int neuronCount, Layer previousLayer) {
            PreviousLayer = previousLayer;
            NeuronCount = neuronCount;
            Neurons = new Neuron[NeuronCount];
            for (int n = 0; n < NeuronCount; n++) {
                Neuron neuron = new Neuron(this);
                Neurons[n] = neuron;
            }
        }

        /// <summary>
        /// Forget all prior training by randomizing all input weights and biases
        /// </summary>
        public void Randomize() {
            // Put weights in the range of -0.5 to 0.5
            const float randomWeightRadius = 0.5f;
            foreach (Neuron neuron in Neurons) {
                neuron.Randomize(randomWeightRadius);
            }
        }

        /// <summary>
        /// Feed-forward algorithm for this layer
        /// </summary>
        public void FeedForward() {
            foreach (var neuron in Neurons) {

                // Sum up the previous layer's outputs multiplied by this neuron's weights for each
                float sigma = 0;
                for (int i = 0; i < PreviousLayer.NeuronCount; i++) {
                    sigma += PreviousLayer.Neurons[i].Output * neuron.InputWeights[i];
                }
                sigma += neuron.Bias;  // Add in each neuron's bias too

                // Shape the output using the activation function
                float output = ActivationFn(sigma);
                neuron.Output = output;
            }

            // The Softmax activation function requires extra processing of aggregates
            if (ActivationFunction == ActivationFunctionEnum.Softmax) {
                // Find the max output value
                float max = float.NegativeInfinity;
                foreach (var neuron in Neurons) {
                    if (neuron.Output > max) max = neuron.Output;
                }
                // Compute the scale
                float scale = 0;
                foreach (var neuron in Neurons) {
                    scale += (float)Math.Exp(neuron.Output - max);
                }
                // Shift and scale the outputs
                foreach (var neuron in Neurons) {
                    neuron.Output = (float)Math.Exp(neuron.Output - max) / scale;
                }
            }
        }

        /// <summary>
        /// Backpropagation algorithm
        /// </summary>
        public void Backpropagate(float[] trainingOutputs) {

            // Compute error for each neuron
            for (int n = 0; n < NeuronCount; n++) {
                var neuron = Neurons[n];
                float output = neuron.Output;

                if (NextLayer == null) {  // Output layer
                    var error = trainingOutputs[n] - output;
                    neuron.Error = error * ActivationFnDerivative(output);
                } else {  // Hidden layer
                    float error = 0;
                    for (int o = 0; o < NextLayer.NeuronCount; o++) {
                        var nextNeuron = NextLayer.Neurons[o];
                        var iw = nextNeuron.InputWeights[n];
                        error += nextNeuron.Error * iw;
                    }
                    neuron.Error = error * ActivationFnDerivative(output);
                }
            }

            // Adjust weights of each neuron
            for (int n = 0; n < NeuronCount; n++) {
                var neuron = Neurons[n];

                // Update this neuron's bias
                var gradient = neuron.Error;
                neuron.Bias += gradient * LearningRate;

                // Update this neuron's input weights
                for (int i = 0; i < PreviousLayer.NeuronCount; i++) {
                    gradient = neuron.Error * PreviousLayer.Neurons[i].Output;
                    neuron.InputWeights[i] += gradient * LearningRate;
                }
            }

        }

        private float ActivationFn(float value) {
            switch (ActivationFunction) {
                case ActivationFunctionEnum.ReLU:
                    if (value < 0) return 0;
                    return value;
                case ActivationFunctionEnum.LReLU:
                    if (value < 0) return value * 0.01f;
                    return value;
                case ActivationFunctionEnum.Sigmoid:
                    return (float)(1 / (1 + Math.Exp(-value)));
                case ActivationFunctionEnum.TanH:
                    return (float)Math.Tanh(value);
                case ActivationFunctionEnum.Softmax:
                    return value;  // This is only the first part of summing up all the values
            }
            return value;
        }

        private float ActivationFnDerivative(float value) {
            switch (ActivationFunction) {
                case ActivationFunctionEnum.ReLU:
                    if (value > 0) return 1;
                    return 0;
                case ActivationFunctionEnum.LReLU:
                    if (value > 0) return 1;
                    return 0.01f;
                case ActivationFunctionEnum.Sigmoid:
                    return value * (1 - value);
                case ActivationFunctionEnum.TanH:
                    return 1 - value * value;
                case ActivationFunctionEnum.Softmax:
                    return (1 - value) * value;
            }
            return 0;
        }

    }


    public class Neuron {

        /// <summary>
        /// The weight I put on each of my inputs when computing my output as my essential learned memory
        /// </summary>
        public float[] InputWeights;

        /// <summary>
        /// My bias is also part of my learned memory
        /// </summary>
        public float Bias;

        /// <summary>
        /// My feed-forward computed output
        /// </summary>
        public float Output;

        /// <summary>
        /// My back-propagation computed error
        /// </summary>
        public float Error;

        public Neuron(Layer layer) {
            if (layer.PreviousLayer != null) {
                InputWeights = new float[layer.PreviousLayer.NeuronCount];
            }
        }

        /// <summary>
        /// Forget all prior training by randomizing my input weights and bias
        /// </summary>
        public void Randomize(float radius) {
            if (InputWeights != null) {
                for (int i = 0; i < InputWeights.Length; i++) {
                    InputWeights[i] = NeuralNetwork.NextRandom(-radius, radius);
                }
            }
            Bias = NeuralNetwork.NextRandom(-radius, radius);
        }

    }

}

Virtual lexicon vs Brown corpus

2016-12-15T15:04:00.001-08:00

Having completed my blocker, I decided to take a break before tackling syntax analysis to study more facets of English. But also, I realized I should beef up the lexicon underlying my virtual lexicon (VL). I had only collected about 1,500 words, and most of those I had simply hand-entered by way of theft from the CGEL's chapters on morphology; mostly compound words, at that. It was enough to test and demonstrate the VL's capacity to deal half-decently with morphological parsing, but nowhere near big enough to represent the at least tens of thousands of words a typical high school graduate with English as their native language will know.

A virtual lexicon's core premise is that being able to recognize novel word forms by recognizing the parts of the word is more valuable than having a large list of exacting word-forms. In essence, a relatively small number of lexical entries should be able to represent a much larger set of practical words found "in the wild".

Using the Brown corpus

I decided that a good way to see just how much mileage I could get out of my virtual lexicon by exposing it to an existing dictionary, of sorts. In particular, I chose the Brown corpus, which is full of words hand-tagged with their lexical categories (parts of speech) taken from excerpts of 500 documents contemporary to the 1960s. I had already converted the BC's data to JavaScript/JSON files and dabbled a bit with it many months back, so I had an easy way to work with it.

Most significantly, I already had a list of all the unique words found in the BC, complete with an order sub-list of all the lexical categories and their frequency counts. Here's an example:

care:{c:162,p:[{c:87,p:'nn'},{c:75,p:'vb'}],bp:[{c:87,p:'n'},{c:75,p:'v'}]},
'care-free':{c:1,p:[{c:1,p:'jj'}],bp:[{c:1,p:'aj'}]},
cared:{c:15,p:[{c:9,p:'vbd'},{c:6,p:'vbn'}],bp:[{c:15,p:'v'}]},
careened:{c:1,p:[{c:1,p:'vbd'}],bp:[{c:1,p:'v'}]},
careening:{c:1,p:[{c:1,p:'vbg'}],bp:[{c:1,p:'v'}]},
career:{c:67,p:[{c:67,p:'nn'}],bp:[{c:67,p:'n'}]},

For example, care appears 162 times in the BC. 87 of those times, it's as a common noun, as in health care. And 75 of those times it appears instead as a base verb, as in to care for.
Given a word like "caring", my VL will try its best to figure out the lexical category. For this example, it would likely parse it as "care -ing" and call this a gerund/participle, same as BC, which uses the "vbg" tag to represent this.
This list contains lots of elements I don't care to push through my VL, such as proper nouns (John, Brooklyn, Glazer-Fine) and punctuation. After filtering, that leaves a word-list of 12,222 unique words for me to test my VL against.
Here's a snippet of the typical output from my testing:

 |   | preposterous    |       5 | jj              | J               | J               |     7 ms |    125 | pre-(U) post(N) -er(J→J) -ous(J)  (J, N, or N)
 | X | prescribe       |       5 | vb              | V               | N               |          |     40 | pre-(U) scribe(N)
 |   | prescribed      |      14 | vbn, vbd        | V.pret          | V.pret          |          |    100 | pre-(U) scribe(N) -ed(V→V)  (V or J)
 |   | prescription    |       5 | nn              | N               | N               |          |     85 | pre-(U) script(N) -ion(V|J→N)
 |   | presence        |      76 | nn              | N               | N               |     3 ms |     59 | present(J) -ce(J→N)  (N or Phr)
 | / | present         |     377 | jj, rb, nn, vb… | J               | V               |          |      0 | present(V)  (V or J)
 | X | present-day     |      17 | jj              | J               | N               |          |    100 | present(V) -(U) day(N)
 |   | presentation    |      33 | nn              | N               | N               |    17 ms |     97 | present(V) -ate(V) -ion(V|J→N)
 |   | presentations   |       6 | nns             | N.plur          | N.plur          |    88 ms |    137 | present(V) -ate(V) -ion(V|J→N) -s(N→N)  (N, V, or N)
 |   | presented       |      82 | vbn, vbd        | V.pret          | V.pret          |     4 ms |     40 | present(V) -ed(V→V)  (V or J)
 |   | presenting      |      10 | vbg             | V.gerprt        | V.gerprt        |     4 ms |     40 | present(V) -ing(V|N→V)  (V or N)

For example, prescribe gets treated as "pre- scribe". Since it sees scribe as a noun, it concludes that this whole word as a noun, as though we were talking about a person before they became a scribe. The BC tags this as "vb". To run the comparison, I use a mapping to translate some of the many tags the BC uses to the representation used by my VL. For examples, "V.pret" means preterite verb and "N.plur" means plural noun.

Data mapping is a tricky and often dubious affair. Sometimes, there just isn't an exact mapping between two systems. In my case, them, which in the BC is considered a "ppo" (pronoun, personal, accusative), which includes words like it, him, me, us, you, and her. Some of these are plural and the rest aren't. In my VL, them is a plural pronoun ("N.pron.plur"), making the plural "ppo" items compare incorrectly. I could have modified my mapping to treat them and us as plural, but that's an unnecessary hack that doesn't really help my task.

The first column of the output contains a match status. When blank, that means the two systems agreed on the LC of that word. A "?" indicates that my VL couldn't even match the morphemes of the prospective word. Though that doesn't stop it from making a guess based on a familiar suffix (e.g., -ous or -ing), I disqualified its attempt on that basis and also to point me to morphemes that really needed to be added to my lexicon. If all the morphemes did match, I compare the resultant LCs. The first one in BC represents the most common occurrence (e.g., "jj" (adjective) for "present") and the others represent less common occurrences. If my VL doesn't match any of BC's LCs, this column contains "X". If it matches only a secondary LC, "/" appears. Think "half an 'X' for half-wrong" (or half-right).

As you can see from these examples, some of the derivations are pretty good, as with "N.plur: present(V) -ate(V) -ion(V|J→N) -s(N→N)" for presentations. And some are pretty bad, like "J: pre-(U) post(N) -er(J→J) -ous(J)" for preposterous. Yes, it got the final lexical category right, thanks to the -ous suffix, but only by fumbling through its morphemes.

Adding lexemes

Although testing the dexterity of my VL was a key goal, a more basic one was augmenting my lexicon with more words. To that end, I would filter my word comparison runs for all the "?"-status bad matches and hand-enter morphemes as necessary.

Consider "monast", for example. I added this as a bound morpheme, which isn't definitively a prefix (un-, ante-, electro-) or suffix (-ing, -ably, -ment), but can't really stand on its own as a complete word in a sentence. Although I used my own sense of how a word was historically composed and its potential for production of other words, I also relied on online tools to help. For example, searching for all words that begin with "monast" or for all words that end with "ment". Having extensive examples at the ready helped me test (and reject) many of my hypotheses. And then I could know that monast~ could correctly form monastic, monastery, monasticism, and more.

I went through this process for several days. While I had some shortcuts, I ultimately hand-processed every word. To my surprise, I personally recognized all but perhaps ten of the 12k+ words, ignoring certain highly technical medical terms. And with each new lexeme I'd add, I'd ask myself, "how was this word not already in here?" One of the last words I added was "yes", one of the most basic in the English language. My sense is that there must still be loads of even ordinary words not covered by my VL.

I continued this process until there were no more "unmatched" words, meaning almost every word in the BC could be sliced up into morphemes that matched my underlying lexicon, even if the LCs didn't match the BC's LCs. In the end, my lexicon had 4,913 lexemes available. Of those, 4,494 lexemes were used to match 12,132 words. That represents a "lexical compression rate" of 37%. On average, one of my lexemes can match about three words in the BC. For comparison, a basic word-list with no morphological parsing would display 0% compression. 100% is the impossible asymptote that could never be reached.

Part of speech tagging

In processing the full word list from the Brown corpus, I get a rate of 83% "hard" matches and "5%" more "soft" matches. A hard match is where my VL's lexical category matches the most common usage of that word in the BC and a soft match is where my VL's LC matches one of the less common usages in BC. Let's be liberal and call this an 88% match.

To anyone familiar with traditional part of speech (PoS) tagging, 88% is pathetic. A typical PoS tagger will get better than 95% correct without breaking a sweat.

But my test is definitely not a PoS tagger. A PoS tagger typically looks at the words in the neighborhood of one word being considered and uses a statistical model to decide what it's most likely to be in that context. My test program does nothing of the sort. A better analogy would be that this is the naive first step in a Brill tagger, where each word is looked up in a lexicon for its most likely LC. And then the tagger begins transforming those guesses based on what's in the neighborhood.

Still, a typical PoS tagger that starts with a naive lookup will usually start out at around 93% match, so why would my VL do so badly?

One simple reason is that so many of the words that remain rely on poorly chosen lexemes during morphological parsing. Consider youth, which my VL sees as you -th, where -th is typically a suffix for ordinal numbers like tenth and 175th. My lexicon is missing an entry for youth. In this case, youth, youthful, and youths all correctly match, but in many other cases, such shortcomings in my lexicon cause clearly mistaken guesses, as when legitimate gets interpreted as leg it im- ate, whereas I really need a legitim~ bound lexeme to come up with legitim~ -ate and a valid interpretation as either a verb (to legitimate her presidency) or adjective (the legitimate president).

Another reason is that the derivation of a word from its morphemes often seems superficially logical, but doesn't reflect the reality of how the word is typically used. For example, amazing naturally follows the usual amaze -ing pattern and can be used as a verb (amazing friends with magic) or gerund ("amazing" isn't bold enough to describe it), but in practice, we most often use amazing as an adjective (the soup is amazing). There's no way to tell that by reference to its two morphemes. -ing almost always forms a gerund-participle and sometimes a noun (the flashing for the siding needs repair), but rarely as an adjective (stunning, breathtaking). This reality reflects a limitation of the virtual lexicon, at least as I've constructed it. Sometimes the only answer is to lexicalize (add to the lexicon) a word that would otherwise badly match, as I did with amazing.

Often, I simply couldn't bring myself to label a word in accordance with the most common usages in the BC. For example, I have defeat as a lexeme with only a verb sense, but the BC has it more often tagged as a noun (suffered a defeat) and less often as a verb (we'll defeat them). To my thinking, this reflects not a definitional disagreement, but the difference between a word's intrinsic meaning and its usage in a specific sentence.

Moreover, I am still troubled by the idea of having a lexicon contain multiple entries for a lexeme whose only apparent difference is one of lexical category. In Lexical Categorization in English Dictionaries and Traditional Grammars Geoffrey Pullum points out that "many dictionaries actually do — quite wrongly — include subentries for numerous nouns that list them as adjectives."

In the process of beefing up my lexicon, I was struck by a feeling that almost every entry I added that had a noun sense could also be used as a verb or adjective as well, so I favored only adding LCs for what I thought of as the predominant LCs for the major senses. For example, for appeal, I added verb (I plan to appeal the decision) and noun (the youthful appeal of this dress) senses because each, to my thinking, had a distinctly different meaning. For medal, I only added a noun sense, despite the validity of using it as a verb, as in medaled in track and field, because to my thinking, both uses fundamentally refer to the same exact concept. "Medaling" just means winning a medal, the thing that is won.

That said, I've also second-guessed this line of thinking. If I imagine there's a singular noun-verb-adjective pseudo-category, then it's clear that many words would violate it. For example, it's hard to imagine using medal as an adjective. Yes, some words like fast easily lend themselves to all three interpretations (They fasted their fast after eating faster than usual.) But so many words just pile up to negate this nice model, such as complex (the complex has a complex layout) and monkey (let's not monkey with this monkey). That seems to suggest I should have been more liberal with my lexicon. That I should have listed all the reasonable lexical categories a lexeme could take.

That said, I guess my stinginess here is a significant source of the low half-right match rate. Had I taken the alternative view, the right plus half-right numbers probably would be in the ninety percents.

From word-list to document tags

As I said before, my test code is not what I would actually call a part of speech (lexical category) tagger, since it does not consider any word in the context of the words around it. All it does is guess at the proper lexical category for a given word by a morphological analysis involving dictionary-like lookups.

Still, I was curious to see how it would fare against the individual documents in the BC. The unique words are not all equal, after all. One word will appear only once in the BC, while another will appear thousands of times. And that word may vary among three different LCs with each usage.

As before, I ignored proper nouns and punctuation, but also "cd" (number) words. Of the 951k words thus considered, 773k (81%) of them had matching LCs, which is close to the 83% exact-match rate I got when simply looking at the unique-word-list from the BC. I include here an output file from one test run. Here's an example of what its contents look like:

 | on              | rp         | R               | P: on(P)
 | long            | rb         | R               | J: long(J)
-------- Doc 201 --------
 | to              | to         | R               | P: to(P)  (P or R)
 | Farewell        | nn         | N               | J: fare(N) well(J)  (cap)
 | that            | cs         | S               | N.pron: that(N)  (N, D, R, or S)
 | search          | nn         | N               | V: search(V)
 | meaning         | nn         | N               | V.gerprt: mean(V) -ing(V|N→V)  (V or N)
 | hints           | vbz        | V.3rdsg         | N.plur: hint(N) -s(N→N)  (N or V)
 | Unconscious     | nn         | N               | J: un-(J→J) conscious(J)  (J, Phr, N, or N)  (cap)
 | form            | nn         | N               | V: form(V)  (V or N)
 | other           | ap         | D               | J: other(J)  (J or V)
 | human           | jj         | J               | N: human(N)

Each of the 500 documents has a Doc NNN header, followed by a list of the words that did not match. Each such mismatch lists the word, the exact tag (e.g., "nn" or "cs"), my mapped version of it (e.g., "N" and "S"), and then my own interpretation of the word.

The example of human well illustrates the difference between a word's natural lexical category and the syntactic category it takes on in a sentence. In this case, it falls within the clause some form or other enters into all human activity. Practically speaking, human is a noun and its use in human activity doesn't change that.

Conclusions

Overall, I'm happy with how well the morphological parsing approach my virtual lexicon takes to solving the specific problem of guessing the baseline lexical category for a word it doesn't already know. 81% of over 12k words were properly recognized by a little over a third as many lexemes. That said, wacky examples like "de- co- rat -ive" (instead of "decor -ate -ive") illustrate how it's often just a lucky guess where the last suffix's LC saves the day.

My hand-crafting yielded a lexicon just under 5k works. Given a choice between a massive word list — think hundreds of thousands or even millions of words — and a tiny lexeme set plus morphological analysis, the massive word list is clearly going to win. That said, it seems reasonable to assume that the best results would be gotten by a combination of a morphological analyzer with a massive word list. The reason is that the underlying premise remains: that you're inevitably going to run into novel word forms as you process new documents.

Text blocking / Sentence segmentation

2016-12-08T14:12:00.001-08:00

I've finished a first working version of my "blocker" module. I'm coining this term to reflect its purpose: to break a paragraph being parsed up into its constituent sentences and sub-sentence "blocks" of text. This is often referred to as "sentence segmentation", but I find that term belies the fuller scope of a blocker.

Wikipedia presents a good summary of the basics of sentence segmentation:

The standard 'vanilla' approach to locate the end of a sentence:

(a) If it's a period, it ends a sentence.

(b) If the preceding token is in the hand-compiled list of abbreviations, then it doesn't end a sentence.

(c) If the next token is capitalized, then it ends a sentence.

This strategy gets about 95% of sentences correct. Things such as shortened names, e.g. "D. H. Lawrence" (with whitespaces between the individual words that form the full name), idiosyncratic orthographical spellings used for stylistic purposes (often referring to a single concept, e.g. an entertainment product title like ".hack//SIGN") and usage of non-standard punctuation (or non-standard usage of punctuation) in a text often fall under the remaining 5%.

But to my thinking, this really misses a lot of the picture.

What is blocking?

I struggled for a while to find a good term. "Blocking" is the best one I came up with, but admittedly, it does conjure up that other modern meaning, as with "call blocking" or "spam blocking". In this case, I mean "blocking" in the sense: breaking text up into a semantically meaningful hierarchy of blocks of text. So what do I mean by that?

Let's say you're given the task of blocking by hand. You are always given what is assumed to be a single paragraph or some other unbroken block of plain text (e.g., a headline), so don't worry about things like first-line indents or bullet points that screw with text segmentation.

At the very least, you're expected to break the paragraph up into separate sentences. The previous paragraph, for example, contains two of them. You start looking for periods. Then exclamation points (!) and question marks (?) marking the ends of sentences. Not a bad start. But see how sometimes they aren't, as in my putting "!" and "?" in parentheses, and now in quotes? Imagine if you were to call out the following as sentences from this paragraph:

Then exclamation points (!
) and question marks (?
marking the ends of sentences.

Nonsense, right? So maybe the problem is the parentheses. Maybe, but that means we need to keep track of parentheticals, now, too. So maybe our rule is: ignore stuff inside parentheses. So what if we had a sentence that included the somewhat arcane use of in-line bullet points like: 1) point one, 2) point two? It's suddenly not a simple case of ducking in and out of parentheticals. Still, better than nothing.

But parentheses bring up another important point. Sometimes stuff in parentheses is grammatical in context. Consider this (illustrative) example. You could take the parentheses away and the remaining sentence is still a grammatically correct sentence. But not (this is a clear example) in this sentence. Take its parens away and you end up with an ungrammatical sentence. Because of that potential, I claim it's better to segment out parenthetical blocks of text as being inline with other text bound for syntactic and semantic analysis. I decided to also include text in [square brackets] and {curly braces} in the class of parentheticals.

So far we have sentences and parentheticals. Let's add quotations, which act similarly, but not identically, to parentheticals. For one thing, a given text is not guaranteed to make use of Unicode's support for left and right ‘single’ and “double” quote characters. Moreover, when it does, they may be used for alternative purposes, such as in place of apostrophes. I decided to just simplify these all down to basic ASCII 'apostrophes' and "straight quotes". But that creates ambiguity as to what is intended when any of these two characters appears in text. We'll come back to how this is handled soon. Suffice to say that we want to consider the text inside quotes to be text blocks just like with parentheticals.

Adding to the complexity is the fact that a parenthetical or quotation block can contain more sentences or other sub-blocks. Although it's generally considered poor form for an inline quotation to contain multiple statements, this isn't unusual in practice. Here's an example:

Smith was quoted as saying, "We plan to appeal the decision. But we'll also need to study its impact."

There's no logical way to see this "single" sentence as grammatical without admitting it's one sentence that contains two others.

To complicate things further, there are "quotes 'within' quotes". When a text is well formatted, the inner quotation will use single quotes and quotes deeper within will either use more double quotes or the somewhat awkward "'triple quotes'". However, it's not uncommon for more informal texts to contain lazily copy-and-pasted text formatted with surrounding double quotes which already contained double quotes; thus, double quotes inside double quotes.

Our end result of blocking process should be a tree structure with the paragraph node at the top, its sentence nodes as its children, and the sub-sentence blocks. Consider the following example, taken from Wikipedia's article on Mark Twain:

I came in with Halley's Comet in 1835. It is coming again next year, and I expect to go out with it. It will be the greatest disappointment of my life if I don't go out with Halley's Comet. The Almighty has said, no doubt: 'Now here are these two unaccountable freaks; they came in together, they must go out together'.

And here is a visual way of representing the block tree for it:

Each block has slots for an "opener" and "closer". These are typically filled by punctuation like periods, quotes, and parentheses. Inside each block is either a string of textual tokens or a set of sub-blocks, if the contents are heterogeneous.

Being a programmer, I have my own more cryptic way of displaying these structures using plain text. Here's what it looks like for me (color manually added here):

A(|
S(|
I(N.pron) came(V.pret) in(P) with(P) @Halley's(N.poss) @Comet(N) in(P) '1835'
| '.' )
S(|
It(N.pron) is(V.aux) coming(V.gerprt/N) again(V) next(J/R) year(N) ',' and(C) I(N.pron) expect(V) to(P/R) go(V) out(R/P/J) with(P) it(N.pron)
| '.' )
S(|
It(N.pron) will(V.modal/N) be(V.aux) the(D) greatest(J.superl/R.superl/D/N.prop) disappointment(N/J/U/D/V) of(P) my(N.pron.poss) life(N) if(S) I(N.pron) don't([do not]) go(V) out(R/P/J) with(P) @Halley's(N.poss) @Comet(N)
| '.' )
S(|
T(|
The(D) @Almighty(U) has(V.aux) said(V.pret) ',' no(D) doubt(V) ':'
|)
Q( ''' |
Now(R) here(R) are(V.aux.plur) these(D/N.pron.plur/R) two(D) unaccountable(J/N) freaks(U) ';' they(N.pron.plur) came(V.pret) in(P) together(R) ',' they(N.pron.plur) must(V.modal/N) go(V) out(R/P/J) together(R)
| ''' )
| '.' )
|)

In keeping with the bubble-like diagram above, the format for each block is: Type( opener | text goes here | closer ). Block types are indicated by prefixing a single letter before each, including:

A: Paragraph
S: Sentence
Q: Quotation
P: Parenthetical
C: Custom
T: Text

Note that each word token indicates in parentheses one or more single-letter lexical categories (part of speech) tags (e.g., "next(J/R)") plus some optional sub-categorization tags. LCs include:

U: Unspecified
N: Noun (including pronouns)
V: Verb
J: Adjective
R: Adverb
P: Preposition
D: Determinative (articles, numbers, etc.)
S: Subordinator (conjunction)
C: Coordinator (conjunction)
L: Correlator (conjunction)
I: Interjection
Y: Symbol

I won't spell out all the subcategories here, but examples include "N.pron.plur" for plural pronouns (these), "V.gerprt" for verbs as gerunds or present participles (running, coloring), and "J.compar" for comparative adjectives (better, curiouser). I'm convinced that this finer-grained categorization for words will aid in later syntax analysis.

I also prefix each word with an at sign (@) when it is considered likely to be used as a name in this text (e.g., Halley's and Comet).

Custom blocks

I'm convinced that we sometimes use special formatting as a form of logical blocking. Consider my use of italics in this blog post to highlight examples of what I'm demonstrating. Sometimes I include whole sentences or sentence fragments in italics within narrative statements. Sometimes, my italics contents are grammatically part of the sentence, but in other cases they are ungrammatical, just the same as if they were protected within parentheses or double quotes.

I did not want to get into the weeds of parsing richly formatted text like HTML or RTF, but I also did not want to ignore this special case. So I decided to introduce a feature where the calling application can add custom XML-looking blocking tags. They appear in the otherwise plain text as "regular text <123>custom block</123> more regular text", where "123" is any unique integer that must match on both tags. If there's a start or end tag that doesn't have an appropriate matching tag, it will be regarded as ordinary text to include in the sentence where it appears. And custom block tags must appear properly nested and not overlapping, following the same logic for nesting of tags within well-formed XML.

Although block IDs (the integer value) can be reused within a paragraph, it is wise to make them unique throughout a document. One benefit is that the calling application can maintain a dictionary of known IDs and whatever properties are of significance to that application, such as the special formatting.

Here's a sample sentence illustrating the concept:

I came <1>in with <2>Halley's Comet</2></1> in <3>1835</3>.

And here's the debugging output:

A(|
S(|
T(| I(N.pron) came(V.pret) |)
C( '<1>' |
T(| in(P) with(P) |)
C( '<2>' | @Halley's(N.poss) @Comet(N) | '</2>' )
| '</1>' )
T(| in(P) |)
C( '<3>' | '1835' | '</3>' )
| '.' )
|)

I don't want to go into great detail here on this, but want to make the point that custom block markers grant the calling application a special ability to mark portions of text explicitly as blocks that may or may not be grammatical in order to give the syntax analyzer the chance to decide.

Designing the blocker

Conceptually, my basic english parser (BEP) product has a pipeline with the following modules:

Tokenizer → Virtual Lexicon → Blocker

However, the reality is that these are not strictly separate modules. The blocker is really just some extra code within the tokenizer. This is mainly because the code didn't seem complicated enough to me to warrant its own class. Plus I didn't want yet another layer needlessly slowing things down.

I've been programming this all in C++11 using Xcode on my iMac & MacBook, but I'm reluctant to show actual code here. I'm more interested in describing the algorithms so programmers and non-programmers can understand the concepts.

In brainstorming about how to approach the blocker, I was initially inspired by the familiar BNF specifications typically used to describe context-free grammars like programming languages and data file formats like XML. But there was one huge problem: each paragraph can be blocked into several alternative block trees. BNF wasn't designed with alternative interpretations in mind and neither were the kinds of parsers that implement BNF. There is supposed to be only one valid output when parsing a given context-free grammar file.

I want to emphasize that multiple interpretations is a key concept I've embedded in all three layers of my BEP so far. Natural languages are full of ambiguities. They are not resolved by arbitrarily choosing the "best" option in each layer. That's a brittle solution. And it's one thing that bothers me about almost all NLP products: they almost never seem to allow ambiguities among their processing layers, let alone in their output. By contrast, my virtual lexicon (VL) typically outputs several likely interpretations ("senses") for each word. "Jim's" may be the possessive of "Jim" or a contraction representing "Jim is" or "Jim has". "Out" may be an adverb (out in the middle of nowhere), preposition (out the door), or adjective (She is out about her gender identity). The VL couldn't possibly decide which one is best, so it's best to just report these options to other layers like the syntax analyzer which are more qualified to resolve them. This same concept applies to blocking.

I was also inspired by my approach to morphological parsing, which involves recursively drilling down, character by character from right to left, searching for possible morphemes. Each time this brute-force search reaches the end (first character in the word) by accounting for all the morphemes in the word, the endpoint in the parse tree being recursively constructed gets added to a list of all "tails" representing successes. Afterward, another algorithm works backward from each tail to construct a linear list of the morphemes discovered, teasing the successes out of the tree and weeding out the failures. But I realized the morphemes in a word represent a linear list instead of a tree structure, so I had trouble at first seeing how it could apply to constructing a parse tree of block trees.

The problem was that I needed an algorithm that could build lots of alternative trees in the same way my morphological parser built lots of lists.

I settled on a multi-pass approach. Each pass through the entire paragraph gradually transforms the tokens into the final set of block trees. Pass 1 creates blocks of contiguous tokens and mini-blocks representing potential block dividers like parentheses, quotes, and punctuation. Pass 2 transforms these lists of proto-blocks into "suggestions" for how they should be represented as trees. Pass 3 scores the merits of each interpretation and throws away all but the best N of them. Pass 4 creates cleaner interpretation chains from the source chains that still have lots of overlap, enabling transformations to each chain that won't screw up the other chains. Pass 5 proposes sentence (and sub-sentence) begin and end tags to ensure that every potential block's start and end is clearly spelled out. Pass 6 transforms these tree-flavored chains into actually tree structures and, in the process, restructures them to improve their simplicity, clarity, and interpretation. Pass 6 also looks at the paragraph's whole block tree to make minor refinements and generate usable statistics.

Pass 1: Creating block tokens

When constructing blocking, it's worth noting that you can tell where block boundaries definitely won't be. For example, a string of plain words will never have a boundary placed inside them. In fact, the only places block boundaries could possibly exist are where certain kinds of punctuation like question marks, apostrophes, and right parentheses characters are found. There's no sense wasting computing power looking for boundaries elsewhere.

So the first step goes through all the tokens in a paragraph created by the tokenizer. The result is a list of a higher order of tokens representing blocks of other tokens that can't be broken down any smaller, from a blocking perspective.

I struggle with the idea of calling them "tokens", given that I already use this term in its much more conventional sense to refer to single words, symbol characters, and so on. The term "block-token" is slightly better, but messy. But given how I actually chose to implement this algorithm, I'm just going to call these higher order tokens "blocks" for now. It should become apparent why momentarily.

Rather than create a distinct block-token data structure to represent these special kinds of tokens, I decided to create one "block" structure and use blocks for both this linear breakdown, sans hierarchy, and the final hierarchic structure that results from this kind of parsing.

Each block is defined primarily by a type, by whether it is an "in" or "out" (explained below), and by a list of tokens or child blocks it contains. You may recall that block types include paragraph, sentence, text, quotes, parentheses, and custom. When we come across an open or close parentheses or square bracket token, for example, a block is created with just that one token in it. Same for apostrophes or double-quotes, which get the "quotes" type. Periods, exclamation points, and question marks get typed as "sentence". Custom block markers are easily recognized because the tokenizer already flagged them with a special "block marker" token type (recall that all tokens have their own types, including word, symbol, number, etc.).

Every other token gets lumped into text-type blocks. They will mostly be word-type tokens, but may also be symbols not listed above. One special exception is the horizontal ellipsis character, which is regarded here as potential sentence-ending punctuation like periods are.

Here's a sample sentence and representation of these proto-blocks. Note the custom blocks (1 & 2), which I added to correspond to the italicized book titles.

Among his novels are <1>The Adventures of Tom Sawyer</1> (1876) and its sequel, <2>Adventures of Huckleberry Finn</2> (1885), the latter often called "The Great American Novel".

C[ <1> ]
T[ The Adventures of Tom Sawyer ]
C[ </1> ]
P[ ( ]
T[ 1876 ]
P[ ) ]
T[ and its sequel , ]
C[ <2> ]
T[ Adventures of Huckleberry Finn ]
C[ </2> ]
P[ ( ]
T[ 1885 ]
P[ ) ]
T[ , ]
T[ the latter often called ]
Q[ " ]
T[ The Great American Novel ]
Q[ " ]
S[ . ]

Pass 2: Tree discovery

The second pass takes this linear list of proto-blocks and adds "in" and "out" flags as appropriate. This is more complex than it sounds, though.

First, what do I mean by "in"? A block is flagged as "in" if it represents the beginning of a child block. The simplest example is an open parentheses character ("("). Conversely, a close parentheses character (")") would be flagged as "out".

Ending punctuation — for now I'll just shorten that to "punctuation" for simplicity — gets flagged as both "out" and "in". Why? Let's say you had a paragraph with two simple sentences containing nothing but words and punctuation. The end result will be two "sentence" blocks, so it makes sense to think of the punctuation after that first sentence as representing the end ("out") of one block and beginning ("in") of another block.

So here is how the above sentence might be represented with all its "out" and "in" flags. "Out" is represented by "^" (as in up the hierarchic tree) and "in" by "v" (as in down deeper into the tree):

T[ Among his novels are ]
C[ <1> ]v
T[ The Adventures of Tom Sawyer ]
C[ </1> ]^
P[ ( ]v
T[ 1876 ]
P[ ) ]^
T[ and its sequel , ]
C[ <2> ]v
T[ Adventures of Huckleberry Finn ]
C[ </2> ]^
P[ ( ]v
T[ 1885 ]
P[ ) ]^
T[ , ]
T[ the latter often called ]
Q[ " ]v
T[ The Great American Novel ]
Q[ " ]^
S[ . ]^v

It seems like it should be easy to produce this in one simple step, but it isn't. Quotes create a problem. While some texts contain Unicode characters that clearly represent left and right single and double quotes, not all texts do and sometimes they are simply wrong because of oddball use of the word processors that generate them wrong. My tokenizer simply standardizes them into "straight quotes" with no left or right polarity. And now we pay the price. A given quote may potentially represent the start or the end of a block.

For another thing, sometimes characters do not actually represent block markers. Take the following example:

I need you to pick up 1) apples, 2) bananas, and 3) strawberries.

Or this example:

A sentence may contain quotes (") and brackets ('[' and ']').

There are actually many possible ways to interpret this one. You and I know that the square bracket characters are merely literal symbol characters in the text, not markers of a parenthetical block that happens to have a quoted block inside them and quotes containing them. How does an algorithm figure this out?

The answer is that this pass is responsible for coming up with all the possible interpretations of each symbol and letting the next pass decide which ones are most likely. The way it does this is by building a tree of all the possibilities.

Consider a (double) quote character, for example. It could represent a plain old symbol with no blocking meaning, the opening of a quotation block, or the closing of a quotation block. Thus, finding a quote adds 3 branches to the tree below the node representing the earlier block. The tree nodes contain the "in" and "out" flags and a reference to the appropriate proto-block.

Let me spell this out more clearly. A recursive function named .find_blocks() examines a specific block in the list of them and adds at least one node to a tree it is constructing. It then recurses, calling .find_blocks() with the goal of looking at the next block and adding its own nodes to the tree under each node it added. So in the quote example, .find_blocks() gets called once for each of the 3 branch nodes created representing the possible interpretations.

Along the way, it's possible to weed out some clearly impossible chains of interpretation. Consider a paragraph with only one sentence and having that contain one piece of quoted text. Interpreting the first quote as a closing of a block makes no sense because there was no corresponding open quote earlier.

This may sound difficult to recognize, but it's actually quite easy. The .find_blocks() function takes a "level" argument as input. On the first call, zero is passed in. When the function considers an "in" interpretation of a symbol, its subsequent call to .find_blocks() will pass in level + 1. Conversely, when it considers an "out" interpretation of a symbol, it passes in level - 1. The first thing the .find_blocks() function does is check "level". If it is below zero, it knows that this cannot represent a valid interpretation of the paragraph's blocking.

Proceeding forward, block by block, the recursion of .find_blocks() eventually reaches the last proto-block of tokens in the sentence — typically punctuation. One final check is done to see if level equals zero. This is also required for a valid interpretation. An example of how this might be violated would be if we interpreted the two quotes found in that single sentence as both opening quotes but never found closing quotes to balance them. Level would thus be 2 by the end, not zero.

There is one very special exception to this rule. In fiction writing, it is common for a paragraph to end with quoted dialog that does not end with a closing quote. The convention is that the next paragraph should begin immediately with an open quote. Here's a trivial example:

Chrystal said, "What were you thinking?

"Nevermind. We'll figure it out," she continued.

This is valid, so I didn't want to ignore this special case. My answer is that, when .find_blocks() reaches the end of the sentence and level happens to be 1, it considers this possibility. To support it, the code before .find_blocks() was ever called counted the number of double quote character tokens in the text and passed in a boolean flag indicating if there is an odd number of them. So if level = 1 and the paragraph contains an odd number of double quotes, it adds one more branch to the tree with a contrived block representing a closing quote, but that psuedo-block does not refer to any specific tokens the way the other blocks do. The result is that level is now effectively back down to zero at this point.

I want to note that this is a bit of a hack in that it will only work for double-quote characters. It would not work if dialogue were represented with single or triple quotes (or double quotes represented by double apostrophe characters) because of the ambiguity these representations would create. That said, I don't recall ever in my life running into these awkward cases.

So the last thing the .find_blocks() function does, once it has weeded out impossible (level ≠ 0) cases, is add the last node to a list of "tails". Each tail represents a successful chain of possible interpretations from the first proto-block to the last in the paragraph. Traversing backward from each tail toward the root node represents whole single chains of interpretation. If a paragraph yields 14 tails, that means there were 14 reasonable interpretations of the blocking created and waiting to be considered by the next pass.

Pass 3: Scoring the possible-tree chains

The first pass' output is a list of tail nodes that are linked lists — what I call "chains" — leading from the last proto-block back to the first and bringing along each of the interpretations of the source data that were found and considered within the realm of possibility.

This pass, then, scores each chain, sorts them by their scores, and keeps the N best options.

Scoring involves following each chain from its tail — the rightmost block in the paragraph — back to its head — a node representing the beginning of the paragraph just before the first block. The score is actually an integer penalty value, just like with the morphological parser. The best possible score would be zero and any positive value indicates some penalties that may represent defects in that interpretation. The score would never be negative.

An example of a penalty is when a double quote seen as an opener is followed by a space. Similarly, when a close quote is preceded by a space. Typically, opening quotes are preceded by space and followed by none and vice-versa for closing quotes.

Another thing penalized is when a token-block ends with an abbreviation like "Dr." (or initialism like "A.S.A.P"), but that period is also considered a sentence-block end. It's not impossible, of course. After all, many sentences end with "etc." But it does make sense to penalize this scenario and thus favor interpretations where these "soft periods" are seen not as sentence endings.

Another thing penalized is sentence-ending punctuation followed by words that are not capitalized. This, again, can happen with abbreviations in the middle of sentences. We don't want to rule it out, as the author may have made a typo or not be inclined to use capitalization. Informal text messages and instant messages often completely lack capitalization.

A block that looks like it should be a block opener, closer, or sentence-ending punctuation but is not flagged as "in" or "out" is penalized. Consider this the "lazy option penalty". After all, one acceptable interpretation output by the previous pass is that all quotes, parentheses, and punctuation are just raw symbols and not block markers. But this is a highly unlikely case.

Once each chain is scored, they get sorted such that the lowest-penalty chain is on top and the others represent progressively less likely interpretations of the paragraph's possible blocking. Although in principle, every such interpretation has the potential to be the correct one, in practice it doesn't make sense to keep all of them. With really oddball text — say, containing mathematical expressions of programming code — there could potentially be thousands of chains output. So the list of chains gets truncated to some practical limit, which I have defaulted as 10 but which the calling application can override.

Pass 4: Linearizing the chains

This pass exists mainly to keep me from going insane. The chains thus far are singly linked lists that proceed from right to left. Although I had originally been fine working with this, despite my head spinning when dealing with terminology like the node representing the block to the right being the "previous" node instead of "next".

But moreover, I found that later passes needed to actually modify nodes in the chains. Remember that for now, these chains are actually just the parse tree viewed backward, traversing from leaf nodes toward the root. That means that two chains likely share nodes. So manipulating a node in one chain has the potential to do the same to that node in another chain, which is bad.

I also found that there was value in having a doubly-linked list later. I simply couldn't do that with an inverted tree.

In this stage, I simply traverse each chain and produce a new chain with copies of all the nodes traversed. The old chains — and the tree they came out of — are discarded. Moreover, I now have a list of "heads" and discard the old list of "tails". This new list keeps me slightly more sane.

I want to take a moment to point out something that came as a surprise to me. I'm programming this stuff in C++. Programmers with only passing familiarity with C++ might well regard it as harder to program in. One classic problem is that sloppy use of explicit object instantiation and cleanup — pointer hell — can lead to memory leaks, among other things. One implicit assumption that explicit instantiation gives some is that you have to do all the hard work of copying data structures, yourself. Ironically, the opposite is true, especially in the newer, more standardized C++, which I did not have available in my early programming days. In fact, C++ makes it stunningly easy to copy whole, deep data structures. My NLP work creates some real doozies, too. Programmers familiar with C#, Visual Basic, JavaScript, Java, and most of the other mainline languages are used to passing objects around by reference. The idea of cloning a deep-structure in one single assignment like auto b = a; is completely alien. But, dear God, is it a refreshing change of pace, not having to create .clone() methods on every custom class! C++'s comfortable use of explicit references and pointers takes care of the rest, where you don't want to have massively wasteful copying. The real kicker is how fast cloning of standard list and dictionary structures — the real workhorses of deep data structure — is.

Pass 5: Finding sentence beginnings and endings

Once we've got the opening and closing tags worked out for quotes, parentheticals, and custom blocks, we're left with the ambiguities of finding the beginnings and endings of sentence blocks. Again, this may sound like a trivial task, but it's not.

Here's an example of the output of this pass, using a sentence taken from a CNN article:

Castro became famous enough that he could be identified by only one name. A mention of "Fidel" left little doubt who was being talked about.

S[ ]v
T[ Castro became famous enough that he could be identified by only one name ]
S[ . ]^
S[ ]v
T[ A mention of ]
Q[ " ]v
T[ Fidel ]
Q[ " ]^
T[ left little doubt who was being talked about ]
S[ . ]^

Note that each sentence has both opener ( S[ ]v ) and closer ( S[ . ]^ ) blocks.

Once again, this relies on a recursive function, this time called .find_sentences(). Inside, it loops through all the nodes in the current paragraph-representing chain of blocks. When it comes across an ending punctuation mark, it splits that block into two separate ones. Whereas the original was flagged as "out" and "in", the new ones divide these flags between them. The new sentence opener blocks, unlike others, do not point to any actual tokens. This is the same as for the virtual closing quotes as described earlier.

This works great for the division between two sentences, but misses the beginning of the first sentence. It also misses the end of the final sentence which may be missing ending punctuation. It would be tempting to simply add end-caps to a whole sentence, but that would miss an important point that sentences may contain other sub-sentences inside quotations or parentheticals. So when the looky loop inside .find_sentences() finds an "in" block representing the beginning of a child block, it executes .find_sentences() recursively to process that inner block. When that finds the block closer, which is guaranteed easy at this point, it returns and this outer loop continues on where it left off.

It would be great if that were all there was to it, but there are lots of gotchas. Consider the following example:

Smith went on to explain. "We had to do something. We improvised."

The process described earlier would split the first period into two blocks, the latter being an "in" block representing the beginning of some other potential sentence. However, what follows is not a sentence, strictly speaking. It's a quotation containing two other sentences. .find_sentences() sees this and lops off the newly minted sentence opener block and then does its work with the quotation.

One complication is that it's not immediately obvious, when the loop runs across a block opener, that it contains any sentences. Many "quotes" contain single words or strings of words but no punctuation. They typically serve the same function as underlining or italicizing text: to call out a significant idea within a sentence without breaking the grammatical flow of the sentence. Practically speaking, the only way to tell if a block contains a sentence is to look for punctuation in it. When the main loop in .find_sentences() comes across sentence-ending punctuation, it takes note of this. When it is done with the current block (or the outer, whole-paragraph block), it uses this information to decide whether to prepend a sentence opener virtual block to the beginning of the block. It also knows whether the block should end with a sentence closer. If punctuation was not found at the end, a virtual block with its "out" flag set is created and appended to the end of the block. If, however, no punctuation was ever found by the loop considering the current block, the block is assumed to contain no sentences.

At this point, it's fair to say that our linear chains contain a very easy-to-interpret representation of a tree structure. Every child block has an opening and a closing marker, whether it refers to a real token (e.g., an apostrophe or right square bracket) or not.

Pass 6: Constructing the tree

Whereas all the passes leading up to this one have created a linear representation of a tree structure, this pass finally creates that tree data structure explicitly. It uses the same block objects as before, but now it uses them differently. Whereas up to now, our blocks have only contained tokens, if any, now our blocks may alternatively contain child blocks representing the breakdown of one block into sub-blocks as needed.

Here's the same example sentence used above, now represented in its final, blocked form:

Castro became famous enough that he could be identified by only one name. A mention of "Fidel" left little doubt who was being talked about.

A(|
S(|
Castro(U) became(V.pret) famous(J/N.plur/N.prop) enough(D/R) that(N/S/P) he(N.pron) could(V.modal) be(V.aux) identified(V.pret/J) by(P) only(R/J) one(D/N.pron) name(N)
| '.' )
S(|
T(| A(D) mention(N) of(P) |)
Q( '"' | Fidel(U) | '"' )
T(|
left(J/V.pret/R) little(J) doubt(V) who(N) was(V.aux.pret) being(V.gerprt/N) talked(V.pret/J) about(P)
|)
| '.' )
|)

The basic process is fairly straightforward, since the chain already contains clear indications of where all blocks are to begin and end. A paragraph block is created and then a recursive .construct_blocking() function is called. It contains a loop charged with adding all the child blocks that are direct children of the current parent block, which is initially the (root) paragraph block. When the loop encounters an "out" node, it returns from its recursion. When it counters an "in" node is when things start to happen.

An "in" node often begins with a single token representing the block's opener, such as an open parentheses symbol. This node's block will be jettisoned, but its token will migrate to the new block's "opener" property. As described earlier, each block can optionally point to one "opener" and one "closer" token. In the output above, these appear on the left or right end of a block's representation, inside the "( opener |" or "| closer )" parts.

Once the inner call to .construct_blocking() returns, the next block will inevitably be the one containing the closer. We jettison that block, nabbing the token it points to and making it the current block's closer token.

Any other block the loop comes across will be a text block. We simply add it as a child to the current block. This isn't ideal, as we'd often rather just have the tokens and not simply a text block inside a parent block, but we'll get to that momentarily. One special thing we do, however, is check whether this new text block will be following another text block and merge them together. There are a few circumstances under which this could happen, but it doesn't make sense to have them be separate, ultimately.

Now the real fun begins: restructuring. To illustrate, here's the result of tree construction without and then with restructuring:

"We came here with a round-trip ticket ... because we thought the revolution was going to last days," said Rep. Ileana Ros-Lehtinen, who came to Florida as a child and went on to become the first Cuban-American elected to Congress. "And the days turned into weeks, and the weeks to months, and the months to years."

A(|
S(|
Q( '"' |
S(|
T(|
We(N.pron.plur) came(V.pret) here(R) with(P) a(D) round-trip(N/V) ticket(N)
|)
| '…' )
S(|
T(|
because(S) we(N.pron.plur) thought(N) the(D) revolution(N) was(V.aux.pret) going(V.gerprt/N) to(P/R) last(V) days(N.plur/V.3rdsg) ','
|)
|)
| '"' )
T(|
said(V.pret) @Rep(N) @Ileana(D) @Ros-lehtinen(J/V.pstprt) ',' who(N) came(V.pret) to(P/R) @Florida(D/N/U) as(R/S/P) a(D) child(N) and(C) went(J) on(P) to(P/R) become(V/N) the(D) first(D) @Cuban-american(J/D/V/N) elected(V.pret/J) to(P/R) @Congress(N)
|)
| '.' )
Q( '"' |
S(|
T(|
And(C) the(D) days(N.plur/V.3rdsg) turned(V.pret/J) into(P) weeks(N.plur/V.3rdsg) ',' and(C) the(D) weeks(N.plur/V.3rdsg) to(P/R) months(N.plur/V.3rdsg) ',' and(C) the(D) months(N.plur/V.3rdsg) to(P/R) years(N.plur/V.3rdsg)
|)
| '.' )
| '"' )
|)

I've highlighted extraneous bits. And here is the same parse, but with restructuring:

A(|
S(|
Q( '"' |
S(|
We(N.pron.plur) came(V.pret) here(R) with(P) a(D) round-trip(N/V) ticket(N) '…' because(S) we(N.pron.plur) thought(N) the(D) revolution(N) was(V.aux.pret) going(V.gerprt/N) to(P/R) last(V) days(N.plur/V.3rdsg) ','
|)
| '"' )
T(|
said(V.pret) @Rep(N) @Ileana(D) @Ros-lehtinen(J/V.pstprt) ',' who(N) came(V.pret) to(P/R) @Florida(D/N/U) as(R/S/P) a(D) child(N) and(C) went(J) on(P) to(P/R) become(V/N) the(D) first(D) @Cuban-american(J/D/V/N) elected(V.pret/J) to(P/R) @Congress(N)
|)
| '.' )
Q( '"' |
S(|
And(C) the(D) days(N.plur/V.3rdsg) turned(V.pret/J) into(P) weeks(N.plur/V.3rdsg) ',' and(C) the(D) weeks(N.plur/V.3rdsg) to(P/R) months(N.plur/V.3rdsg) ',' and(C) the(D) months(N.plur/V.3rdsg) to(P/R) years(N.plur/V.3rdsg)
| '.' )
| '"' )

|)

In this case, sentences containing singular text blocks get reduced to just sentences with their tokens.

But also, the ellipsis mark (…) gets reviewed and determined to most likely be inline instead of a sentence end. So the two sub-sentences inside the quotation get merged into one, with the ellipsis being just another token among the words and symbols in it.

This is a very important point. Reviewing and restructuring the blocking for the sentences within paragraph is much, much easier when there is an actual tree structure to deal with. Here is a summary without elaboration of the restructuring rules processed during .construct_blocking():

If this block's only child is a single text node, condense it.
A sentence that has no punctuation and contains only a quotation or parenthetical is not really a sentence; condense it.
A single quotation within a quotation may be an alternative way of representing triple quotes, quadruple, etc.
See if a quotation block should be merged with a following sentence fragment.
See if a quotation block should be merged with a preceding sentence fragment.
Join together two sentences separated only by ellipsis if the conditions are right.
Join together two sentences if the first ends in ellipsis and the second contains nothing but ending punctuation.
Migrate sentence punctuation out of ending quotes, but only when it follows a sentence fragment not ending in ",".

One rule worth elaborating on is the one that collapses quotes within quotes. Here's a trivial example:

Here's an ```example of triple quotes''' that will get collapsed.

Without restructuring, it looks like this:

A(|
S(|
T(| Here's(N.poss/[here is]/[here has]) an(D) |)
Q( ''' |
Q( ''' |
Q( ''' |
T(| example(V/N/D) of(P) triple(V/N) quotes(N.plur/V.3rdsg) |)
| ''' )
| ''' )
| ''' )
T(| that(N/S/P) will(V.modal/N) get(V) collapsed(V.pret/J) |)
| '.' )
|)

Three quotation blocks nested within one another. With restructuring, it looks like this:

A(|
S(|
T(| Here's(N.poss/[here is]/[here has]) an(D) |)
Q( ''' | example(V/N/D) of(P) triple(V/N) quotes(N.plur/V.3rdsg) | ''' )
T(| that(N/S/P) will(V.modal/N) get(V) collapsed(V.pret/J) |)
| '.' )
|)

Note that the opener and closer are not triple quotes. I realized that having it this way would require changing how openers and closers are implemented to be able to point to multiple tokens instead of single ones. But more importantly, I realized that how the quotes are represented should not actually be important to later stages of parsing. All they care about is that some text is in quotes, regardless of them being single, double, triple quotes, etc.

But the reason this is important is that it provides an answer to a nagging problem: how to deal with various representations of quotations. In this example, I have three left-accent characters on the left and three apostrophes on the right, one way I've seen triple quotes represented before. It's tempting to think that it would have been easy to recognize this pattern using, say, a regular expression during the early tokenization process. However, doing so would likely run afoul of words that include apostrophes, such as stores' and 'im (truncation of him). Moreover, it would run afoul of scenarios in which quotes include other quotes, as in this example:

She said, "He said, 'That's "'crazy pants'" fussy of you.'"

The first apostrophe-plus-quote actually is a closing triple quote, but the second one represents the close of a single-quoted block followed by the close of a double-quoted block. There's no way the tokenizer could have distinguished that without looking at the whole sentence for cues. But this block restructuring process handles this special case automatically, now that there is no more ambiguity left.

Extra goodies

As a final step, some analysis is done of the results. If, for example, there is an open quotation apparently leading into the next paragraph's picking up the quotation, this paragraph will be flagged as such. This will make it easy for a later phase — say, a grammar checker — to follow up and see if this is a correct form, possibly by seeing if the next paragraph begins immediately with an open quote.

Similarly, the code checks to see if the paragraph contains nothing but a sentence that does not end in any punctuation. It flags the paragraph as thus containing a single sentence fragment. This makes it easy for the calling application to detect headers, for example. Yes, this also often applies to bullet points, but those are often easy to detect by virtue of being marked as such in HTML or prefixed by bullet-like characters such as "-" and "*" in plain-text files.

I do plan to extend these extra analyses to do things like count whole sentences in the paragraph, but it's not as important yet as nailing down the mechanics of blocking.

Conclusions

In summary, I created a data structure and algorithm for breaking a paragraph of English text down into smaller blocks like sentences, parentheticals, and quotations.

Although I don't consider myself an expert in all the latest NLP research, I do believe I've got a fairly good idea of what the state of the art is. Dr. Scott Paio produced what appears to me to be a fairly good demonstrator in Java back in 2008. But one thing I discovered is that it does not appear to be able to deal well with sentences embedded in quotes or parentheses. Here is one example:

Here is some text. "Here's a sentence. (What about a parenthetical? It could contain multiple sentences.) And another."

Paio's demo outputs the following representation:

Paragraph(1)
1	Here is some text.
2	"Here's a sentence.
3	(What about a parenthetical?
4	It could contain multiple sentences.)
5	And another."

Not bad, but it doesn't seem to care about the natural hierarchy of the sentences. By contrast, here's my blocker's output:

A(|
S(| Here(R) is(V.aux) some(D/N/R) text(U) | '.' )
Q( '"' |
S(| Here's(N.poss/[here is]/[here has]) a(D) sentence(N) | '.' )
P( '(' |
S(| What(N.pron) about(P) a(D) parenthetical(J) | '?' )
S(| It(N.pron) could(V.modal) contain(U/V) multiple(U) sentences(N.plur/V.3rdsg) | '.' )
| ')' )
S(| And(C) another(D) | '.' )
| '"' )
|)

As you can see, it has no problem recognizing and representing the hierarchy.

Although I may simply be missing something, I have yet to find another sentence segmentation algorithm that produces this sort of hierarchic structure.

For a fuller example of my algorithm in action, I'm including a sample text file I just processed and the corresponding processing output, taken from a CNN story.

The performance of this algorithm seems to be excellent, despite its complexity. Whether I run my test sets with or without the blocker included, the time almost always is the same. Which test (with or without blocking) takes longer appears to be random, making it effectively zero percent of the total parsing time so far. Most of the time of this pipeline thus far is taken up by the virtual lexicon. Most of the time, I'm finding it processes texts at a rate of about 2,000 words per second. Note that that's words, not tokens, which includes punctuation and other symbols.

One task I am looking forward to applying my algorithm to is dialogue analysis. Being able to clearly find common English dialogue patterns in text — without even considering the words, let alone their syntax or semantic meaning — should make this quite practical.

Although I wouldn't go so far as to say that this is entirely original, I do believe that my blocker represents a significant improvement over other sentence segmentation algorithms. It adds hierarchic representations, for one, making it possible to more accurately find correct sentence boundaries in complex cases. But also, it offers the application multiple interpretations of paragraph blocking, enabling later consideration of less likely but possibly more correct interpretations.

Adaptive tokenizer

2016-11-26T12:37:00.001-08:00

I knew that to continue advancing the refinement of my virtual lexicon, I'd need to throw it at real text "in the wild". To do that meant revisiting my tokenizer so I could feed words into it. The first tokenizer I made for my Basic English Parser (BEP) project is very primitive, so I decided to trash it and start over.

Tokens

Programmers often find themselves needing to parse structured data, such as comma-separated-values or XML files. In parsing richly structured data, like code written in a programming language, we refer to the smallest bits of data — usually text — as "tokens". In this case, my tokenizer is tasked with taking as input a single paragraph of English prose in a plain-text format and producing as output a list of tokens.

In my tokenizer, tokens presently come in a small variety of types: word, symbol, number, date/time, sentence divider. Since recognizing a token requires characterizing it, tokenization is the natural place to assign a type to each token. But only to the degree that it is competent to do so. A tokenizer can readily recognize the pattern for a typical word, for example, but couldn't classify that word's lexical category (noun, adjective, etc.) Each piece of a good parser has its own specialties.

English-speaking programmers familiar with typical parsing usually rely on conveniences bundled into the ASCII data standard. With a few logical and mathematical comparisons, you can easily tell capital letters from lower-case ones, letters from digits from other symbols. Because I am opting to deal properly with Unicode, those distinctions don't come so easily. I had to start with a custom Unicode library that can deal efficiently with normalized UTF-8 representations of text, where a single visible character in a text file might be represented by a code-point packed into 1 to 4 bytes or possibly a string of them bundled into a grapheme cluster. My unicode::unistring class easily interfaces with std::string. It's character iterator, .size() method, and other members mimic those of the standard string class', but deal at the grapheme cluster level instead of the byte level.

Although language parser programmers often favor designs based on BNF grammars, I've opted to just hard-code a mostly linear parsing algorithm. It progresses character by character, looking for boundaries between token types. The most obvious boundaries are found when we hit white-space, including spaces, tabs, and new-line characters. For numbers and dates, I'm relying on regular expressions to look ahead of the current character. Everything else initially comes down to punctuation and symbol characters. If it's a symbol character and we've been looking at a word, we recognize the boundary of that word token and pack the single symbol character into its own token, ready to begin searching for the next token.

With some special exceptions. Consider hyphens. It's tempting to take a term like "AFL-CIO" or "singer-songwriter" and split it into separate words. Hyphenated phrases like "lighter-than-air" make a compelling case for this seductive option, which fits neatly the model where a symbol character like the minus sign (-) is its own token and marks boundaries to its left and right. But sometimes this doesn't make sense, as with non-compliant and pre-industrial. It would be hard to make sense of the syntax of a sentence that included non compliant or pre industrial. Also, hyphenation of words sometimes changes the lexical category (LC) of a combination of words, as with an off-the-cuff remark, where off-the-cuff is effectively an adjective, even though none of its words is an adjective. I conclude that the tokenizer is not qualified to decide when it's appropriate to subdivide a hyphenated word. Nor is the virtual lexicon, even. I suspect this will fall to the syntax parser, if only because that module will be able to evaluate whether the string of hyphenated words neatly form a syntactic structure better than the single hyphenated word. Finally, there are special cases of words that end in a hyphen that are meant to link to a later word, as with in pre- and post-op photos.

Thus, the tokenizer allows single hyphens that appear within words or as a last character in a word. Minus signs that appear outside the context of a word, as when there are spaces around it, are considered symbols and not words. If an apparent word has two or more consecutive minus signs in it (e.g., in full knowledge--at least, afterward), they break that up into separate words. This is to take care of the common case where an em dash is represented by two minus characters with no space around them.

The same constraint applies to single periods and apostrophes, which are allowed to be in the middle of a word token (ASP.NET, isn't) or on the left (.com, 'ere) or right (etc., cousins') of one.

Generally, the tokenizer preserves the actual characters from the text, but it does do some conversions. For starters, it does not preserve whitespace, per se. Each token indicates whether or not it has whitespace before it, but not how many characters or which. The various kinds of horizontal dash characters are also condensed down to the standard ASCII minus sign (-).

My tokenizer also condenses various types of double quote characters (“ & ”) down to the basic ASCII double quote (") character. Same for single quotes (‘ & ’), which are reduced to the ASCII apostrophe (') character. Although I may regret this later, as their polarity (left or right) can carry clues to the boundaries of quoted blocks of text, for now they would just complicate lexical analysis. They probably would complicate syntax analysis, too, because of inconsistent use of these in modern and older text files. Moreover, I would still face the problem of properly recognizing triple quotes, quadruples, etc., typically represented by combinations of double and single quote characters.

Sentences and paragraphs

I specifically decided not to make my tokenizer responsible for starting with a raw text file and splitting it up into sections and paragraphs. My main reason is that it should be relatively easy in many cases for an earlier-stage parser to find paragraph boundaries, but that different formats require different approaches. From a syntax standpoint, paragraphs might be considered the largest significant unit of text.

Why not stop at the sentence level? A typical syntax analyzer deals only with a sentence. The problem is that it's often difficult to find clear sentence boundaries. Question marks and exclamation points are strong indicators of sentence-end punctuation, but periods are not. Consider a sentence that contains the abbreviation "etc.". My lexicon can tell me that "etc" is an abbreviation, which means it typically has a period after it. And that could be the end of it, if only we ended a sentence with an extra period after an abbreviation, but we usually don't. Moreover, we often punctuate a sentence that ends with double quotes with a period just before the final quote. I reasoned that the best my tokenizer can do is mark places where it thinks sentences might end, and some indication of how certain it is in each case.

To facilitate this, the tokenizer offers a special sentence-divider token type. Sentence dividers are tokens inserted into the token stream just like words, symbols, numbers, etc.

To help add certainty, the tokenizer will pass any word-like token into the virtual lexicon and attach the returned "word" data structure to the token. As described previously, the VL will indicate that a word has leading or trailing period and/or apostrophe characters. A word's various senses (interpretations) will indicate whether they consider the period/apostrophe characters to be integral to their words, as with .com, etc., 'nother, and ol', or to otherwise be unrelated to the word. It checks every sense to see if any of them allow for an integral-period interpretation. The tokenizer can use this to indicate whether or not it is "certain" that a period at the end of a word represents a separate symbol. If it does come to this conclusion, the tokenizer will actually strip off the trailing period and add it as a new token, asking the VL to reparse the new word without the trailing period.

More precisely, the tokenizer does this same thing for the beginning period/apostrophe and for the end one. It creates new tokens before or after the word token when the word contains no interpretations in which that leading or trailing period or apostrophe are integral to the word.

In the special case of an abbreviation (Dr., etc., appt.) or period-inclusive initialism (r.s.v.p., C.I.A.) word that integrates an ending period, the tokenizer still creates a sentence-divider token, but it marks it as uncertain about this conclusion.

Punctuation sometimes appears within quoted text in sentences. Consider this example:

To those who ask "Why bother?" I say "Because it's right." I hope they listen.

My intention in this example is that the first quoted sentence not also mark the end of the outer sentence that contains it, However, the second one does mark the end. From a tokenization or lexicon standpoint, there is no way to correctly draw these conclusions. My tokenizer thus kicks this can down the road by marking the "?" and "." before each closing quote as definitely sentence ends and adds a maybe-a-sentence-end marker after each closing quote. It will fall to the syntax analyzer to consider alternative hypotheses.

The bottom line here is that neither the tokenizer nor the virtual lexicon are qualified to say for sure where sentences end. The best they can do is propose options.

Naming names

One of the other features I've added to my BEP is an ability to spot likely name words in text. This presents a difficult challenge. While many of the unknown words found by an English parser are simply inflectional derivatives (e.g., harm to harms, harmed) and assemblies of morphemes (harmful, harmless) missing from that lexicon, a large fraction of the remaining unknown words are going to be names of people, places, events, organizations, products, and so on. It seems inappropriate to attempt to fill a lexicon with a list of all such named entities because, no matter how authoritative it is, there will always be more names emerging.

Instead, I'm employing a tactic of looking at the characteristics of unknown words to see if they might potentially be names. The most basic characteristic is whether the word is capitalized. But this needs to be mediated by whether the word appears at the beginning of a sentence. That's the main reason the tokenizer proposes sentence boundaries. Words that appear immediately after sentence boundaries (opening quotes are factored in) are ignored in the initial survey to find potential names for the simple reason that in English, we typically capitalize the first word in a sentence. I'm looking for words that are either capitalized (John, Woodstock, Florida) or all-caps (FBI, DARPA, NASCAR) that are not already in my lexicon.

One benefit of tokenizing at the paragraph level is that it allows me to deal well with headlines. In this case, headlines that have most of their words capitalized (Harry Potter and the Chamber of Secrets) confound the ability to find name words. Part of the computation is that if a paragraph has a large percent of its words either capitalized or in all-caps, it is disqualified from the search for name words. This latter also allows for dealing with snippets of text where almost all letters are capitalized, which similarly makes finding potential names impossible.

Once the first pass to identify potential name words is done, I do a second pass later to "name names". An important part of that process is that it then covers words that were previously disqualified. If the word Rupert appeared in the middle of a sentence in a regular paragraph during the initial sweep, it got added to a list of potential name words. In this second pass, instances of Rupert that appeared as the first word in sentences would also be tagged as name words. So would instances found in title "paragraphs". One proviso, though, is that this later call to name names specifies a threshold. Any potential name found with fewer qualifying occurrences in text than this threshold value get ignored. Those at or above the threshold get flagged as name words. In my testing, I stuck with a threshold of 1, making the threshold useless.

I should also point out that this algorithm will miss names that only appear at the beginnings of sentences in a document.

The first piece of this puzzle is the virtual lexicon, which helps to characterize words. In particular, it can flag a given word as being capitalized (e.g., Joshua, iPhone, MyVMS) or all-caps. For such a word, it extracts a potential name. This is separate from the word's raw text in part because names often come in possessive form with s' or 's at its tail. The potential name excludes the apostrophe or apostrophe-S ending and the word indicates that its potential name is possessive. This allows a completely unknown word to be available as a potentially possessive proper noun to the syntax analyzer later. But it also allows Francis, Francis' and Francis's all to represent the same Francis name when it comes time to name names.

The VL comes up with these potential names even for derived words or words that come directly from the lexicon. This reflects the fact that even common words of all LCs can be used as proper nouns. This allows the senses the VL comes up with for a given word to coexist with this potential for this specific use of the word to alternatively be used as a name. You don't want a syntax parser to be unable to discover syntactic structure because a long string of words are all capitalized and thus marked as having a pseudonymous "proper noun" lexical category that paves over its actual LCs.

On that note, when the name-names function is applying the used-as-a-name flag on words based on the criteria described above, it does not apply to words that are not capitalized or all-caps. So a phrase like how Mark made his mark on me can contain both common and name-flagged versions of the same word. As indicated before, flagging a word as used-as-a-name does not negate the senses (and LCs) returned by looking it up in the virtual lexicon.

Ultimately, the syntax analyzer will have to sort out the name-oriented reality of each word based on how it's used in a given sentence, but it will already have a variety of extra characteristics at its disposal, including whether the word is capitalized, whether it is flagged as used-as-a-name, and whether a given word sense is flagged as a proper noun.

Lexicon building

One of my main goals in revisiting my tokenizer, now that I have a good virtual lexicon mechanism, is to test my VL and continue building my lexicon. In practice, this also means continuing to refine the VL and tokenizer, too. Thus far, I've only processed two documents: a short biography of Mark Twain and a CNN news story about America's black working class. My current process begins with letting my tokenizer have a go at a document. Among other things, it generates lists of potential names it found, unknown words, and derived words. Here is an example of the output my test program generates.

I mainly work through the list of unknown words, creating entries in my lexicon as needed. If, for example, I see flippancy, I'll create an entry for flippant, knowing it will derive flippancy from it. I generally don't add names to the lexicon, with some exceptions. I will add common names like US states and major cities, commonly occurring people's names like John, and so on. I flag them as proper nouns. Here's an example of what my lexicon file looks like.

Once I've worked through the unknown words, I move on to the list of derived words. Here's a sample of what that looks like:

official | 1 | N: off(P) -ic(N|V→J) -ial(N) (N or J)
opposed | 1 | V: oppose(V) -ed(V→V) (V or J)
organizer | 1 | N: organize(V) -er(V→N) (N or J)
others | 1 | N: other(N) -s(N→N) (N or V)
overalls | 1 | N: over(P) -al(N→J) -s(N→N) (N or V)
overwhelmingly | 1 | R: over(P) whelm(V) -ing(V|N→V) -ly(N|J→R)
pan-african | 1 | N: pan(N) -(U) Africa(N) -an(N→N) (N or D)
partnership | 1 | N: partner(N) -ship(N→N)
partnerships | 1 | N: partner(N) -ship(N→N) -s(N→N) (N or V)
payments | 1 | N: pay(V) -ment(V→N) -s(N→N) (N or V)
personal | 1 | J: person(V) -al(N→J)
poorer | 1 | J: poor(J) -er(J→J) (J or N)
populism | 1 | N: popul~(J) -ism(N→N)
ports | 1 | N: port(N) -s(N→N) (N or V)
president | 1 | J: preside(V) -ent(J)
president-elect | 2 | N: preside(V) -ent(J) -(U) elect(V)
pressures | 1 | N: press(V) -ure(N) -s(N→N) (N or V)
problems | 1 | N: problem(N) -s(N→N) (N or V)
professor | 2 | N: pro-(U) fess(V) -or(V|N→N) (N or V)

Most of these look very appropriate. I'm looking for oddballs, like with official, which it derived as "off -ic -ial". All I have to do is add office to the lexicon and it gets it right as "office -ial".

I also study a relatively compact representation of the parsed text. Here's an excerpt:

'"' The(D) notion(N,P) of(P) the(D) white(J) working(V,N) class(N) implicitly(R) embodies(N,V) a(D) view(N) of(P) white(J) privilege(N) ',' '"' said(V) @[Spriggs](?) ',' '"' It(N) implies(V,N) that(N,S,P) things(N,V) are(V) supposed(V,J) to(P,R) be(V) different(J) for(P) them(N) ',' that(N,S,P) they(N) aren't(Phr,Phr,Phr) the(D) same(J) ',' that(N,S,P) they(N) aren't(Phr,Phr,Phr) going(V,N) to(P,R) face(N) the(D) same(J) pressures(N,V) '.' <<sentence end!>> '"' <<sentence end?>>

There(N) is(V) no(D) official(N,J) definition(N) of(P) the(D) working(V,N) class(N) '.' <<sentence end!>> Some(D,N,R) define(V) it(N) by(P) education(N) level(N) or(C) job(N) sector(N) ',' others(N,V) by(P) income(V) '.' <<sentence end!>> John(N) @[Hudak](?) ',' a(D) senior(J) fellow(N) in(P) governance(N) studies(N,V) at(P) the(D) Brookings(N,V) Institution(N) calls(V,N) these(D,N,R) voters(N,V) '"' economically(R,J) marginalized(V,J) '"' because(S) they(N) often(R) fall(V,N) somewhere(R,N) between(P) the(D) poor(J) and(C) the(D) middle(N) class(N) '.' <<sentence end!>>

After each word is a list of the lexical categories for the senses the VL came up with for each word, listed in order of how likely it considered each sense to be the correct one. My LCs here include: (N)oun, (V)erb, ad(J)ective, adve(R)b, (P)reposition, (D)eterminative, (S)ubordinator, (C)oordinator, (I)nterjection, s(Y)mbol, and (U)nspecified. I consider this list a work in progress. If a word needs to be interpreted as a phrase, as with aren't (are not), you'll see "Phr" in place of an LC.

Note the <<sentence end!>> and <<sentence end?>> tokens. The ones with exclamation points (!) indicate high-certainty markers and the ones with question marks (?) indicate low-certainty but potential markers.

If I want to better understand how my VL treats a word, I can generate a more detailed output for that word like the following:

- Word: economically
- R: economy(N) -ic(N|V→J) -al(N→J) -ly(N|J→R) (R or J)

- Short list of senses:
- R: economy(N) -ic(N|V→J) -al(N→J) -ly(N|J→R) (141)
- J: economy(N) -ic(N|V→J) all(D) -y(N→J) (160)

- All senses considered:
- R: economy(N) -ic(N|V→J) -al(N→J) -ly(N|J→R) (141)
- J: economy(N) -ic(N|V→J) all(D) -y(N→J) (160)
- J: economy(N) -ic(N|V→J) -al(N→J) -y(N→J) (160)
- J: economy(N) -ic(N|V→J) all(R) -y(N→J) (162)
- R: economy(N) -cy(J→N) -al(N→J) -ly(N|J→R) (171)
- J: economy(N) -cy(J→N) -al(N→J) -y(N→J) (220)
- R: economy(N) -ic(N|V→J) a(D) -ly(N|J→R) -ly(N|J→R) (220)

One assumption I'm making is that there is a natural ability for words in the major LCs to be converted to the others. The noun candy can easily be used as a verb, as in We candied apples for the party. The adjective beautiful can be used as a noun, as in The beautiful have it easy. My LC is mainly charged with giving its top pick of LC for a given word, but as requested, it will either give one sense for each distinct LC or will give all senses it conceives of. This latter I consider to be only of use in debugging, though.

The special exception is with phrase expansions, which are not really LCs to be condensed down. For example, Fred's could be interpreted as the possessive form of Fred or as Fred is or Fred has. Which one is appropriate depends on the syntax, as illustrated in Fred's been drinking Fred's own root beer. Fred's happy with how it turned out.

Performance metrics

In my earlier performance tests of my VL, I was testing to see how fast it could recognize a single, specific word. I knew that I wouldn't be able to get a meaningful average rate, though, until I started testing real texts. Now that I have, I can say that on my late-2013 iMac, I'm typically seeing parse rates of over 2,000 words per second. To clarify, that's not counting symbols and other tokens; just words that get processed by the VL. That timing is for the total process of tokenization, VL lookups, name finding, and generating statistics for my debugging use. On my early-2015 MacBook, the rate is typically above 1,200 words per second.

Generally speaking, the more unknown words there are, the slower the VL runs as it tries its best to find approximate matches. So as I was adding new lexemes to my lexicon, this algorithm got ever faster.

Here is a typical output of statistics at the end of my test runs:

996 total words
457 distinct words (45.9% of total)
437 known words (95.6% of distinct)
20 unknown words (4.4% of distinct)
216 derived words (47.3% of distinct)
204 known derived words (44.6% of distinct, 46.7% of known, 94.4% of derived)
423 lexemes used (96.8% the number of known words)
374 free/bound lexemes used (85.6% the number of known words)
Rate: 2.1 parses/second
Rate: 2600.0 tokens/second
Rate: 2096.8 words/second
3 iterations took 1.4 seconds
1427 lexemes defined

For comparison, I just now extracted a third document. Without making any changes to the lexicon, here are the stats it outputs on a first run:

2227 total words
856 distinct words (38.4% of total)
528 known words (61.7% of distinct)
328 unknown words (38.3% of distinct)
545 derived words (63.7% of distinct)
307 known derived words (35.9% of distinct, 58.1% of known, 56.3% of derived)
429 lexemes used (81.2% the number of known words)
376 free/bound lexemes used (71.2% the number of known words)
Rate: 0.4 parses/second
Rate: 1059.8 tokens/second
Rate: 855.7 words/second
5 iterations took 13.0 seconds
1428 lexemes defined

As you can see, the rate is down from over 2,000 words per second to well under 900. Most of that is consumed by attempts to figure out unknown words.

After generating these numbers, I added a caching mechanism to save a copy of what the VL returns for a given word token. The result is a 10 - 15% speed increase, depending on whether the document scanned is known

I have a lot more lexicon building to do. At least this process makes it fairly fast and easy to identify the words I need to import. The effectiveness and performance are encouraging so far.

Virtual lexicon enhancements

2016-11-18T12:17:00.001-08:00

Virtual lexicon

In my previous post, I used the term "morphological parser" to describe what I've been building. Now that I'm done with a first significant version of it, I'm really preferring the term "virtual lexicon" ("VL"). It captures the spirit of this as a module that represents a much larger set of words than the actual lexicon it contains. Same idea with virtual memory, which enables programs to consume what they perceive as a much larger amount of RAM than is physically available in the computer. Likewise, a virtual machine environment, which enables a fixed set of computer hardware to represent emulated computers with more CPUs, memory, and other resources than are physically available.

Periods

One of my stated goals in the previous post was to deal with periods and apostrophes. What I was indicating is that a basic tokenizer is faced with a nasty interpretation problem when it comes to these symbols. In standard written English, a sentence typically ends with a period and the next sentence begins after at least one space character (in typical computer formats, at least). But English also allows abbreviations to end in a period, typically followed by a space, as in Alice asked Prof. Smith for help. English also has initialisms represented sometimes with periods following each letter, as in r.s.v.p. and A.K.A. Further compounding the problem is that a sentence ending in an abbreviation or period-delimited initialism usually doesn't contain a separate period. For example, "I'm afraid of snakes, spiders, etc. When I see them, I run away A.S.A.P."

There is a further problem that could be ignored, but I decided to tackle it as well. Some special words like ".com" begin with periods, which would throw off a basic tokenizer. Further, it's possible for text sloppily written to have sentences ending in periods that don't have following spaces, as in the "I'm going to the store.Do you need anything?" The following capital letter does help suggest a sentence break over the alternative interpretation that "store.Do" is a word. But there are such words, like "ASP.NET", and Internet domain names, like "google.com".

I decided to modify my tokenizer to allow word-looking tokens to include single periods, including one at the beginning and one at the end of the token (e.g., ".ASP.Net."). Doing so would give the virtual lexicon a chance to weigh in on whether such periods are part of the words or separate from them. The VL's return value now indicates if the word begins with a period and also if it ends with one. But then each of the word-senses in that word gets to indicate if the leading and/or trailing period is integral. To illustrate this, my test output shows integral periods as {.} and separate ones as (.). Consider the following examples:

".animal.": (.) ⇒ N: animal(N) ⇒ (.)
".com.": {.} ⇒ N: .com(N) ⇒ (.)
".etc.": (.) ⇒ U: etc(U) ⇒ {.}

My tokenizer also deals with the dotted initialism (N.A.T.O., r.s.v.p.) scenario, which is also a problem for the lexicon. I decided a lexeme representing this case should only contain the dot-free spelling (NATO, RSVP) and should contain one or more senses indicating that it is an initialism. When the VL comes across this pattern, it gets rid of the periods and then begins its search. For example:

"R.I.P.":

V: RIP(V) ⇒ {.}
V: rip(V)

Note how it offers an alternative, because my test lexicon also has the common verb sense of "rip" referring to the ripping action. Had I fed in "rip" instead of "R.I.P." or "RIP", it would have put that common sense on top and the initialism sense second. But not also how the first sense indicates that, yes, the word ends in a period, but that period is part of the word. Had it been "RIP.", it would have indicated that there was a trailing period that was clearly not part of the word.

I would note that my VL doesn't deal well with cases where there are periods within an initialism but where one or more such periods are missing. A word like A.S.AP. would fail to be properly recognized by my VL, but I consider that a good design. I'm betting this sort of case is rare and almost always a typo. If someone wanted to say I RSVPed, for example, they probably wouldn't include any periods. This leaves those oddball words that do include infrequent periods, like Node.JS., pristine for lexicon lookups.

I would also note that my VL does not provide meaningful support for domain names (microsoft.com, apple.com), Usenet group names (alt.tv.muppets, sci.philosophy.tech), and so forth. This is probably best handled by a tokenizer, which could easily flag a token as fitting this pattern and possibly ask the VL to see if the token is a known word, as in the Node.JS case. It's going to be a challenge for any syntax and semantic parser to deal with these entities, anyway.

This all takes responsibility away from the tokenizer for dealing with single periods in and adjacent to words. The VL doesn't definitively decide whether a given period is punctuation marking the end of a sentence, but it does provide strong evidence for later interpretation. Plus, it allows terms that do contain periods to be properly interpreted on their own.

Apostrophes

Almost the same problem crops up with apostrophe characters, which may be integral to words or may indicate the special class of punctuation that includes quotes, parentheses, and italicized text. Some words, like can't, bees', and 'nother contain apostrophes that are integral to the word and not at all part of quoted text. However, a tokenizer just can't deal with this without recourse to a lexicon. So my lexicon allows terms to include integral apostrophes.

The tokenizer is expected to leave single apostrophes that may appear to the left or right of a word-like token, as well as within it, attached to the text. The VL then considers the various interpretations possible with the leading and trailing apostrophes. The output word indicates whether it begins with and also whether it ends with an apostrophe. Then each sense within indicates whether those leading and trailing apostrophes are part of the word or not. Same pattern as for leading and trailing periods. And in that spirit, here are some sample outputs for tokens that feature both leading and trailing apostrophes. Integral apostrophes are represented with {'} and clearly-separate apostrophes with (').

'animal': (') ⇒ N: animal(N) ⇒ (')
'animals': (') ⇒ N: animal(N) -s'(N→N) ⇒ {'}
'nother': {'} ⇒ N: 'nother(N) ⇒ (')

The 'animals' example illustrates the potential for confusion, too. After all, it could be that animals is simply a plural form of animals that's in single quotes, as in Your so-called 'animals' are monsters. Or it could be that the plural form of "animals" has possession of something, as in Your 'animals' pen' is empty. There truly is no way in this parsing layer to iron that out.

Kicking the can down the road

One of my guiding assumptions is that each layer of the parsing process adds some clarity, but also creates more questions that can't be answered within that layer. I'm counting on the next layer being responsible for taking what the lexicalizer, which is essentially this virtual lexicon applied to all word tokens, outputs and generating as many alternative interpretations as are necessary to deal with the ambiguities. Then it will fall to the syntax parser, which should rule out some unlikely interpretations. That layer, too, will create more unanswered questions, which it will foist on later layers dealing more with the semantics of sentences.

One pattern I found is that I end up using my VL recursively because some alternative interpretations can only be handled by fully parsing a word by trying various interpretations, such as stripping off a leading period, and seeing which interpretation seems best. No doubt this same pattern will hold for the syntax parser, which probably will even occasionally call back to the VL to reinterpret alternative tokens it comes up with.

Morphological parser / Virtual lexicon

2016-11-15T14:20:00.011-08:00

I've been able to focus the past 3 months exclusively on my AI research, a luxury I've never had before. Given that I can't afford to do this indefinitely, I've chosen to focus primarily on NLP, with an overt goal to create relevant marketable technologies.

I'm presently operating with an intermediate goal of creating a "basic" English parser (BEP). In my conception, a BEP will transform a block of English discourse into an abstract representation of its constituent words, sentences, and paragraphs that is more amenable to consumption by software. Though there are many research and industry products that do this at some level, I'm convinced I can improve upon some of their aspects to a marketable degree.

When you set about trying to understand natural language parsing, you quickly discover that there are many interconnected aspects that seem to make starting to program a system intractable. NLP researchers have made great strides in the past few decades largely by focusing on narrowly defined tasks, like part of speech tagging and phrase chunking; especially tasks that rely on statistical machine learning. Still, it seems every piece requires every other piece as a prerequisite.

After exploring lexical tagging (AKA part of speech tagging) for a while, especially using a custom Brill-style tagger I wrote from scratch, I decided to tackle an important piece of the bigger puzzle. Namely, how to deal with unknown words.

Why a morphological parser?

I'm convinced most unknown words are simply known words reassembled in different ways. The most basic example is inflectional versions. The plural form of "chicken" is "chickens". The past ("preterite") tense form of "stop" is "stopped". The comparative form of "sloppy" is "sloppier" and the superlative form is "sloppiest". Beyond these, many words can be formed by compounding existing words. "Sunshine", "ankle-deep", and "brainwash" illustrate basic compound words. And then there are all those affixes — prefixes and suffixes — like "un-", "re-", "-ish", and "-ation" that can be added to other words to form more complex ones like "reprioritization".

This is the domain of morphology, the study of word-formation, among other things. I decided I should try to somewhat master this algorithmically in order to deal with the broadest array of words.

The logic goes something like this. To parse, say, a paragraph of text, one must identify the sentence boundaries. There are some obvious algorithms to do so, but they run into ambiguities. A period, for example, might signal an abbreviation like "Mr." or "Dr." instead of the end of a sentence. Once you have isolated a sentence, making sense of it requires being able to characterize all of its words in some useful way. At the very least, you might want to tell what the main verb is and possibly what the subject and direct object are. Doing so usually begins with a process of identifying all the verbs, nouns, adjectives, etc. in the sentence, which turns out to be a surprisingly tricky process. You usually start with a naive guess that involves looking the word up in a lexicon. Like a standard dictionary, a lexicon will typically at least tell your algorithm what the most likely lexical category is for a given, known word (e.g., "run" is most likely a verb). And then the order of the categories in the sentence is typically used to figure out the syntactic structure of the sentence. Any word in the sentence that isn't already in the lexicon becomes a problem for this approach.

How, then, to deal with unknown words? One answer is to use well-known syntactic patterns to make a guess. In the sentence I ate a very XXXX pear, we can guess that XXXX is most likely an adjective because that's the only thing that should be allowable by English grammar rules. But we might also be able to guess by picking the unknown word apart. In an XXXXer pear, we can guess that XXXXer is probably a comparative adjective like tastier or raunchier. That said, it isn't guaranteed. Consider bitter, which coincidentally ends in er but which is not comparative (that would be bitterer or more bitter). Still, English comes with a wealth of prefixes and suffixes that can hint at the likely category for otherwise unknown words. XXXXish is probably an adjective, XXXXing is probably a present participle or gerund. And so on.

Human languages are "rulesy". Whether we intend to or not, we embed rules in how we form utterances. That's true at the word level just as much as at the sentence level, not to mention paragraph and discourse levels. Like many in the computational linguistics community, I had been falling under the spell of letting learning algorithms figure out the rules instead of hand-crafting them. The case for this approach has been compelling in recent decades. Start with a learning algorithm and a hand-tagged corpus and the algorithm will save you the trouble of understanding, yourself. The results for a specific task are often statistically better than hand-crafted rules, furthering the case of this approach. However, I'm beginning to question the wisdom of this seductive but naive approach, which Geffrey K. Pullum of the University of Edinburgh might label corpus fetishism.

Pullum represents one of the three legs of the computational linguistics stool: linguistics. At least one friend from the mathematics leg of CL has suggested I would do better to bone up on my linguistics knowledge than to worry about perfecting my mathematics knowledge. I concur with her thinking. As a programmer — this discipline being the third leg — I can attest that it's impossible to program a solution to a problem without first defining the problem. My own sense is that linguistics is the leading edge of the NLP revolution that is gradually happening. And mathematicians and programmers need to look deeper than the first level of linguistics to really understand what we're doing as we try to automate language understanding and production.

Pullum's mechanistic approach to understanding and explaining English grammar is infectious. I'm studying parts of The Cambridge Grammar of the English Language (CGEL) lately, especially with a focus on morphology. Chapters 18 (Inflectional morphology and related matters) and 19 (Lexical word formation) deal extensively with the subject. Pullum and Rodney Huddleston (et al) delved far deeper than my own limited needs into the gory guts of word-formation, but I am quite enjoying the treatment.

While my main motivation for adding a morphological parser (MP) to my BEP is dealing with unknown words, I also have creating a trim lexicon as a major goal. If I have the word "happy" in my lexicon, I should not also need happier, happily, unhappy, unhappily, happiness, unhappiness, and so on in my lexicon. I want to believe that a large fraction of the words that appear in a given sentence are derivatives of simpler, common-enough ones. In the previous sentence, for example, words, given, derivatives, simpler, common-enough, and ones are derivatives. That's 26% of the sentence.

I regularly coin what I like to call "bullshit words" like "rulesy" and "cuteiful" when I'm explaining ideas and as a light form of humor. While amusing to us, this is devastating to a parser relying purely on a lexicon, even when it already has millions of words in it. One powerful benefit to an MP is the ability to deal more robustly with neologisms.

Putting it all together, the main goal for me of a morphological parser is to be able to recognize a much larger potential vocabulary with a much smaller lexicon held in memory. I'm going to dub this a "virtual lexicon".

Design philosophy

Loosely speaking, my morphological parser serves the function of a lexicon for a parser. In my current usage, I have a typical parse pipeline that begins with a tokenizer that produces a string of words, numbers, punctuation, and other symbols. The tokens that look word-like, usually because they are made up exclusively or mainly of letters, are typically compared against the lexicon, a long list of known words. In the syntax parsing application, that lexicon usually indicates the lexical categories (verb, noun, etc.) for each matched word. Depending on the design, the lexical entry may contain more than one category.

To my thinking, an simple, exact-match lookup isn't robust enough. Instead, each word-like token is fed into the MP, which has the same main goal: returning an indication of the most likely lexical categories for that word. But inside, the MP is operating on the premise that the word could be derived from other word parts. To be sure, my MP does have a lexicon and will return the category for a known word if it matches exactly. If it doesn't find an exact match, though, it tries its best to find a derivative of one or more known ones.

Remember: the first priority of the MP is to determine the most likely lexical category (e.g., adverb) of a word. There are thus four basic possible scenarios: the exact word already exists in the lexicon; the word is completely composed of other word parts that are all found in the lexicon; some parts of the word — usually suffixes like ing — are recognized to the point where a good guess is possible; or the word is wholly unrecognized.

Given a single textual token like strawberries to look up, the MP returns a "word". That word consists of one or more interpretations, which I call "senses", in keeping with the idea that a dictionary entry for a word may contain definitions for many different senses of that word. This is also in keeping with one of my design goals: to entertain multiple interpretations of a given word, sentence, etc.

Each word sense indicates the word's lexical category and includes a list of the morphemes that make it up. Given irredeemable, for example, the MP returns a data structure that it shorthands as "(J): ir-(J→J) redeem(V) -able(V→J)". The leading (J): indicates that the whole word is interpreted as an adjective. Next is ir-, which it sees as a prefix that typically attaches to an adjective to form an adjective. Next is redeem(V), which it sees as a verb. Last is -able(V→J), a suffix that usually attaches to a verb to form an adjective.

The MP also incorporates a feature to treat one word as though it were multiple. In most cases, a single word can be treated as having a single, well-defined category like verb or adjective. But sometimes this isn't the case. Consider auxiliary verbs ending with the n't contraction, like isn't, wouldn't, or haven't. It is best to treat these as two-word strings like is not, would not, or have not, acknowledging that these are verb-adverb combinations. Most contractions likewise need expansion. Consider it's and what's, which should be interpreted as it is/was and what is/was. This also applies in Joe and Dale're going fishing, where 're must be interpreted as are and applying to both Joe and Dale. While most initialisms (ASAP, IRS) and acronyms (AWOL, NAFTA) can be seen as having single categories — usually noun — others suffer the same problem. IMO (in my opinion) is a prepositional phrase. Although it could be seen as an adverb, it's probably better to simply expand it out into its set of words in the sentence in which it appears before syntactic parsing. Or HIFW (how I felt when), which surely can't be reduced to a basic category.

In keeping with the belief that a word can have multiple senses and that it can sometimes be treated as multiple words, I'll point out that the "word" that is output by the MP when fed a token is a tree structure. A word object contains a list of word-sense objects. A word-sense object has a list of morphemes and, alternatively, a list of child word objects if a sense is to be interpreted as a string of words. A morpheme is mainly a pointer to a lexeme (like an entry for a word in a dictionary) and which sense of that lexeme is meant. If a lexeme wasn't found for the morpheme in the word, those pointers are null, but the morpheme is still useful in that it contains the text of that morpheme.

I decided that it was critical to support Unicode. I created a custom, relatively trim library that lets me deal with multilingual text and "rich" characters. This overhead probably slows processing down marginally.

One other key is that I decided to specialize this parser to English morphology, hard-coding some English-centric rules related to morpheme transformation and lexical categorization into it. I'm hopeful that this work can provide inspiration for extracting those rules out as data to support other languages better, but I just don't have enough knowledge of other languages to justify the added system complexity yet.

Morpheme splitting

My morphological parser breaks its job into two independent tasks: finding morphemes; and then interpreting them in conjunction with one another.

Since each token is suspected to be composed of several morphemes, it's necessary to search for them. One way of doing that might be to start with the first letter. Consider happenings, for example, which is made up of happen -ing -s. One might start with h to see if we can find that in our lexicon. Not finding that, try ha. Then hap, and so on. Eventually, happen would be found. Then we could move on to find ing and finally s.

Each substring of characters could go against a hashed dictionary, which is fairly efficient. However, my MP has a specialized search tree in which each node represents a single letter and contains another hashed dictionary of next letters. To find happen, the algorithm might start with the root node, find the child node corresponding to "h", and for that node, find the child node corresponding to "a", and so on. When the search finds a known word during this tree traversal, that node — the "n" child node of the "e" node in this example — will have a link to the appropriate lexeme entry in the lexicon.

But I should clarify that my MP design counterintuitively starts at the end of a word and works its way backward. This is mainly because in English, modifications to base words (happy, flap) before adding suffixes (happily, flapping) usually occur in the last letter or two. More on this shortly.

I refer to this supporting lookup structure as a "morpheme tree". Words are added to this tree in reverse-character order. So happen is to be found by starting at the root and traversing "n", "e", "p", "p", "a", and finally "h", which leaf node will contain a pointer to the "happen" lexeme, which in turn has a list of lexeme-sense nodes representing the different categories (and in the future, distinct definitions) for that word.

Some words are subsets of others, as with Japan and pan. If the lexicon contains entries for both, this means parsing Japanese leads to ambiguity over whether pan is intended or merely coincidental. The morpheme tree will have a link to the pan lexeme at the "p" node, but also have a child node for the "a", leading into the "j" node, which then links to the japan lexeme entry. Thus, not all morpheme tree nodes that have links to lexemes are leaf nodes.

The ambiguity introduced by overlap cannot be resolved immediately during word parsing. Moreover, other ambiguities arise that, again, cannot be resolved immediately. The morpheme-finding process is an exhaustive search that returns all possible parses of the whole token, from last to first character. In the Japanese example, traversing the morpheme tree yields -ese and then pan, but it doesn't find ja, which it conservatively interprets as an unacceptable parse. However, continuing past pan into japan bears fruit, so that parse gets interpreted as acceptable. Only those parses that return a string of morphemes that cover every last letter are returned by this part of the process.

Getting the list of acceptable parses involves constructing a parse tree using a recursive algorithm. The parse starts with the final letter. Moving forward involves recursively calling a sub-parse function that also constructs the parse tree in parallel with the recursion process, including all branches considered, as in the pan vs japan case. Every time this recursion successfully reaches the beginning of the word, the final node added to this parse tree, which represents the first letter of the word, is added to a list of search "tails". Every node in that list of tails represents a distinct, completed parse of the entire token. If there were a ja lexeme in the test lexicon, then the two tails would correspond to both ja pan -ese and japan -ese parses, which then move onto the next stage for scoring. More on that later.

One way we end up with more tails is by having multiple senses of a word. Take the lexeme er, for example. In my test lexicon, this lexeme is viewed as a suffix added to an adjective to form the comparative of it (bigger, heavier), a suffix added to a verb to make a noun of it (cobbler, runner), or an initialism (ER or E.R.) for emergency room. So a parse of killer could yield "kill(V) -er(J→J)", "kill(V) -er(V→N)", or "kill(V) er(N)". Yes, "kill emergency room" is a possible interpretation.

Another way we end up with more tails is by word modification, as with silliness and panicking. These modifications are assumed by my MP to happen only to words that have suffixes appended, which is why the search begins at the end of each word. After finding the ness morpheme and committing in one path to the suffix sense from the lexicon (the ness lexeme could also have a proper noun sense, too, as the proper name Ness), we then look at the next few characters along the way, knowing that they represent the end of another morpheme. Based on what we find, we try whatever changes (e.g., replace "i" with "y") are allowable and then continue parsing based on those modifications, in addition to the parallel track of parsing with no such changes. A parse of happiness would find -ness but not happi. But it would successfully find happy after changing "i" to "y". The parse would essentially return happy -ness, as though the word were literally spelled as "happyness". Here are the specific modification rules I've implemented thus far:

If last letter != "e" then append "e" (shar -ing → share -ing)
If last letter = "i" then change to "y" (tri -es → try -es)
If last letter = "v" then change to "f" (leav -es → leaf -es)
If last letter doubled then trim last letter (stopp -ing → stop -ing)
If last letters = "ck" then change to "c" (panick -ed → panic -ed)
If suffix = "n't" and last letter != "n" then append "n" (ca -n't → can n't)
If suffix = "s'" (apostrophe) then append "s" (bas -s' → bass -s')

I suspect there may be other modifications worth adding to this list later, but these already do a lot of good work. Each of these above does turn validly spelled words into invalid ones, but the payoff in being able to correctly parse such modified words is obvious. One downside is that the modifications can potentially create new, also-valid morphemes that lead to incorrect parses, but this should be rare. One example might be decking, which should be interpreted as deck -ing, but could also be interpreted via the ck → c rule as dec -ing, where "dec" is the abbreviated form of "December".
Once the recursive parse is done, we're left with a list of tails that all represent the paths by which we got from the last letter of the token to the first. The final step of morpheme splitting involves constructing linear representations of the chains that start at each tail. Each node in these chains represents one morpheme. This set of chains is the input to the next stage, where we score each chain to see which ones seem the most likely.

Unknown words

It's obviously possible that a parse of a token won't produce any tails, meaning there's no interpretation whose morphemes all match entries in the lexicon. My morphological parser doesn't just give up in this case. It alters the original token, creating one "hole" of all possible sizes and locations in the token and attempts to match the word in light of that hole. This involves adding a fake lexical and one morpheme tree entry for the "?" symbol that gets used as a stand-in for the hole (I don't allow for tokens already containing "?" symbols). Let's say we had "unXXXXing" as our token. Since the first try would not find any acceptable tails representing completed parses, our algorithm would try all possible variations that allow it to have at least two non-hole characters, including "un?" and "?ng", but also "unX?Xing", "unXXXX?", and "un?ing", our intuitive best bet. This gets parsed as un- [XXXX] -ing, which is more useful than no match. (Anything represented as inside [brackets] was the "hole", the text not found in the lexicon.) This is better than no match, as the -ing suffix can be applied to a verb to form a verb or an adjective, narrowing the possibilities more than a completely unknown word like XXXX would.

This process does not stop as soon as one tail is found. Indeed, it generates all tails for all hole placements and leaves it to the next stage to find the best possible interpretations. This speculative process is naturally more expensive than when the parser does come up with at least one full-word interpretation.

Scoring senses

Once the first stage has produced a set of morpheme chains (word senses), the second stage scores each chain, winnows them down, and sorts them so that the first in the list is most likely to be the right sense.

I've used scoring algorithms often for searches and such, but with them I'm using building up a positive score reflecting all the good things about each item, putting the highest scoring item at the top. This time I decided to go with a negative scoring algorithm that adds up all the downsides of a given interpretation of the word, putting the word-sense with the lowest penalty value (zero being the lowest) at the top of the favored list.

There are penalties for many potential defects. Modifications like changing "i" to "y" in happiness are penalized in favor of no-modification interpretations. If the whole token didn't parse and we had to try out different sized gaps, there is a penalty that favors the smallest gap size. Senses not fitting the prefixes + bases + suffixes pattern are penalized. If a lexeme has multiple senses, there's a small penalty for each subsequent sense used, thus favoring the earlier ones as being more likely. If a suffix has an "attachment filter", meaning it favors attaching to words of one or more lexical categories more than others (e.g., -er(V→N) versus -er(J→J)), there's a penalty if the running category violates the suffix's filter. Having more morphemes is penalized. Having multiple free morphemes (e.g., apple, care, pretty) is penalized in favor of affixes ("big -er" favored over "big E.R."). Having zero free morphemes — having only affixes — is heavily penalized. Ideally, there will be only one morpheme because it exactly matches a lexeme in the lexicon. We penalize a sense that has a suffix as its first morpheme (e.g., -ing rown) and also penalize it if it has a prefix as its last one (e.g., ado re-).

One underlying assumption for this scoring algorithm is that all interpretations spit out by the morpheme splitting stage are worth considering. I don't want to disqualify a potentially valid interpretation just because it doesn't obey the usual conventions for word production. A good example of a heavily penalized sense that is actually correct is the word ish, which is sometimes used informally as a way to indicate degree. "Are you tired?" "Ish." This thinking is especially helpful when words are formed using unusual affixation. For example, the -ish suffix is intended to attach to adjectives or nouns to form an adjective (squareish, houseish), but one could also attach it to a verb (burnish, crankish, rompish). Yes, the -ish lexeme sense's filter could be expanded to include verbs, but this algorithm prefers to see all the penalty mechanisms as reflecting preferences in lexical interpretation instead of absolute disqualifiers. If the best scoring sense is heavily penalized, it's still in the game until better interpretations come along. There is no penalty threshold that disqualifies a sense.

Once scoring is done, the results are sorted and only the top-scoring sense for each representative category is kept. That is, only the best verb sense is kept, only the best noun sense is kept, and so on. I have some misgivings about this expedient, but I'm motivated by a desire to keep the syntax parsing and broader interpretation process limited to a modest number of possible interpretations. Having 100 possible interpretations for, say, a partly unknown word, for example, seems counterproductive.

Category guessing

At the same time each sense is scored, the lexical category for the whole word is developed. As you might guess, even this is an error-prone process. The essence of the matter involves finding the last free morpheme's lexical category and then transforming it according to the suffixes attached to it. Consider meatier, for example, which parses out to either "(J): meat(N) -y(N→J) -er(J→J)" (penalty = 20) or "(N): meat(N) -y(N→J) -er(V→N)" (penalty = 27). As desired, the final conclusion is that it's most likely an adjective, since -y typically turns a noun into an adjective (meaty, nerdy, watery) and adding -er to an adjective keeps it an adjective. The other option, where another sense of -er converts a verb into a noun (killer, taker, slicer) doesn't fit as well, but it's still an option we want to present to the consuming application for its consideration.

I considered allowing the prefixes to influence transformation of the category, but this raised some ambiguities. Moreover, the prefixes I considered generally don't change the category of the words they attach to. I decided to just ignore them for now.

There are plenty of compound words that my morphological parser can handle. Here's a place where CGEL was immensely helpful for my understanding of how to deal with them. For starters, it seems most of the compounds we use contain only two free morphemes (swimsuit, backwater, nosedive). I decided to effectively treat compound words as though they are made up of separate words in this narrow context. My algorithm develops a category along the way and when it finds a boundary between two (potentially affixed) free morphemes, it starts over. But it keeps track of what the categories were for the two (or more) sub-words. It then uses the conjunction of <category> + <category> — what I call the "compound pattern" — to decide whether to override whatever category the last word-let came up with, which is otherwise a good predictor. Thus far I've only found two compound patterns that merit changing their default lexical categories of. The first is verb+preposition (breakthrough, look-out, talking-to), which I change to noun. Another is adjective+verb (blueprint, high-set, smalltalk), which I default to being a noun. But if the verb in that adjective+verb compound ends in -ing (breathtaking, strange-looking, talking-to) or -ed (French-based, short-lived, well-behaved), I convert the total word's category to adjective.

Multi-word strings

There is a final step, once scoring and winnowing are done. We look at each sense to see if any of its morphemes demands that it must stand alone instead of being integral to a word. If so, we now break the total word up according to the morphemes' needs. If a word sense is composed of five morphemes and the one in the middle demands it must be expanded and its words stand on their own, the algorithm will create a new word from the first two morphemes in the original word, expand out the must-expand words from the middle morpheme, and then create a final word from the last two morphemes. For each of the new words, which are now really just a new plain-text tokens, the entire process repeats and this word sense now becomes just a shell for the string of sub-words parsed in the same way. One example is shouldn't've, which breaks down to should + not + have.

In truth, I'm not 100% sure about the need for this feature. Consider the HIFW (how I felt when) example. Standing on its own, it seems valuable to expand it out into a sentence like HIFW I saw it, but what if it had a suffix, as in totally HIFWing on this? "How I felt whening" doesn't make sense, while treating the whole thing as probably a verb does. This is an area I think I need to study further.

Performance tests

One way of seeing how fast this runs is to select sample words and see how many times my morphological parser can process each. I'm starting with a late 2013 iMac with a 3.4 GHz Intel Core i5 and 8GB 1600 MHz DDR3 memory, a reasonably upscale desktop computer. I wrote my code in C++ using Xcode.

My test lexicon contains 876 lexemes. I'll admit that this is much too small to be representative of a well-stocked lexicon, but I also don't believe that increasing its size will have much effect on this algorithm's performance. The main reason is that the expensive part of dealing with the lexicon is looking up a candidate morpheme. Since this is done by traversing the morpheme tree in parallel with reading each character, which takes constant time per recursive step of the morpheme parse, I expect no significant change in parse time as the lexicon gets bigger. Time will tell.

So let's take some sample words and see how many times it can parse the same word per second. First, consider tokens that had full matches:

30,000 words/second: red: (J): red(J)
17,300 w/s: adventure: (N): adventure(N)
9,000 w/s: recordkeeping: (V): record(N) keep(V) -ing(V|N→V)
8,500 w/s: relies: (V): re-(U) lie(V) -s(V→V)
6,100 w/s: breathtaking: (J): breath(J) take(V) -ing(V|N→V)
3,600 w/s: unremittingly: (J): un-(J→J) remit(V) -ing(V→J) -ly(N|J→J)
1,700 w/s: antidisestablishmentarianism: (N): anti-(U) dis-(U) establish(V) -ment(V→N) -arian(N→N) ism(N)
181 w/s: happily-tippingreungreennesspotatoes: (N): happy(J) -ly(N|J→J) -(U) tip(V) -ing(V|N→V) re-(U) un-(J→J) green(J) -ness(J→N) potato(N) -es(N→N)

Now let's try some words that don't have full matches. Note the interpretations. Some of them are clearly wrong, but they help illustrate how this algorithm works:

40,000 w/s: bug: <no match>
9,000 w/s: redbug: (J): red(J) [bug]
2,400 w/s: mister: (J): my(N) -s(N→N) [t] -er(J→J)
2,200 w/s: censorize: (V): [censo] re-(U) -ize(V|N→V)
14 w/s: punk-antidisestablishmentarianism: (N): [punk] -(U) anti-(U) dis-(U) establish(V) -ment(V→N) -arian(N→N) -ism(N→N)

I am very happy at these results. I thought it would be orders of magnitude slower. Instead, it seems this piece could hum along at 6,000 or more words per second on average on my computer, assuming most words it comes across have full matches.

Memory consumption

Regarding memory, a simple test in which I reduce the lexicon to nearly empty shows that it consumes about 628 KB of memory. With 878 items in the lexicon, it climbs to 1 MB. Here are some actual memory measurements for lexicon sizes during loading:

0: 628 KB
1: 636 KB
100: 700 KB (720 B/lexeme)
200: 740 KB (650 B/lexeme)
300: 804 KB (587 B/lexeme)
400: 832 KB (510 B/lexeme)
500: 876 KB (496 B/lexeme)
600: 936 KB (513 B/lexeme)
700: 968 KB (486 B/lexeme)
800: 1,020 KB (490 B/lexeme)

Memory: bytes per lexeme
I'm not sure whether this means that the per-lexeme consumption flattens out at a little under 500 bytes per lexeme or if it continues downward, which I'm expecting. The morpheme tree's memory footprint should grow logarithmically. The lexicon's lexeme info should grow linearly. So let's say the average stays around 500 bytes per lexeme. That means a lexicon with one million items should consume half a gigabyte.

A more modest lexicon of 100k lexemes (words) would consume 50 MB. For comparison, as I look at the currently active programs in my computer's memory and see that Chrome is consuming 3 GB, Google Drive is consuming 655 MB, Xcode is consuming 826 MB, and so on.

Fidelity tests

Of course, having an algorithm that's fast isn't as important as having one that works well. Were I writing a scholarly paper, I'd feel compelled to flesh out my lexicon and mine a corpus for test cases, but I haven't gotten around to that yet. I plan to do more serious testing of this sort in time, though.

But I do have one useful barrage test behind me. I was keenly interested in seeing how well my MP would fare against the wide variety of compound words found in CGEL's treatment of morphology. To that end, I painstakingly typed the 678 examples I found there into a data file and hand tagged all of their lexical categories. I then created another data file containing their base words. For the example of taxpayer-funded, I had to isolate tax, pay, and fund. I then hand-tagged those words, too. Below is a snippet from the test's output:

  - sunshine             |  .  | (N): sun(N) shine(N)  (P:13)
  - swearword            |  .  | (N): swear(V) word(N)  (P:13)
  - sweetheart           |  .  | (N): sweet(J) heart(N)  (P:13)
  - swimsuit             |  .  | (N): swim(V) suit(N)  (P:13)
  - swordsman            |  .  | (N): sword(N) -s(N→N) man(N)  (P:123)
  - syntactic-semantic   |  .  | (J): syntactic(J) -(U) semantic(J)  (P:23)
  - table-talk           | (N) | (V): table(N) -(U) talk(V)  (P:23)
  - take-away            | (N) | (R): take(V) -(U) away(R)  (P:23)
  - take-off             |  .  | (N): take(V) -(U) off(P)  (P:23)
  - talking-point        |  .  | (N): talk(V) -ing(V|N→V) -(U) point(N)  (P:133)
  - talking-to           |  .  | (N): talk(V) -ing(V|N→V) -(U) to(P)  (P:133)
  - tape-record          | (V) | (N): tape(N) -(U) record(N)  (P:23)
  - tax-deductible       |  .  | (J): tax(N) -(U) deduct(V) -ible(V→J)  (P:33)
  - tax-free             |  .  | (J): tax(N) -(U) free(J)  (P:23)
  - taxpayer-funded      | (J) | (V): tax(N) pay(V) er(N) -(U) fund(V) -ed(V→V)  (P:63)
  - tearoom              |  .  | (N): tea(N) room(N)  (P:13)
  - theater-goer         |  .  | (N): theater(N) -(U) go(V) -er(V→N)  (P:35)
  - theatre-going        | (J) | (V): theatre(N) -(U) go(V) -ing(V|N→V)  (P:33)
  - thought-provoking    | (J) | (V): thought(N) -(U) provoke(V) -ing(V|N→V)  (P:53)
  - threadbare           |  .  | (J): thread(N) bare(J)  (P:13)
  - three-inch           | (J) | (N): three(D) -(U) inch(N)  (P:23)
  - three-metre-wide     |  .  | (J): three(D) -(U) metre(N) -(U) wide(J)  (P:46)
  - tightrope            |  .  | (N): tight(J) rope(N)  (P:13)
  - timberline           |  .  | (N): timber(N) line(N)  (P:13)

The center column represents the hand-tagged value. If it is the same as the MP's prediction, the column contains a period, allowing the mistakes to jump out easily. Of the 678 compound words tested, 76.8% were correctly tagged. Note that the "(P:13)" values on the far right represent penalty calculations for each of these. I'm showing only the best scoring (least penalized) interpretation for each of the test tokens.

During development, I relied a lot on hand-crafted example words. I reproduce some examples below:

- antidisestablishmentarianism
  - (N): anti-(U) dis-(U) establish(V) -ment(V→N) -arian(N→N) -ism(N→N)  (P:50)
- Rate: 1451.38 words/s

- buttons
  - (N): button(N) -s(N→N)  (P:10)
  - (V): button(N) -s(V→V)  (P:17)
- Rate: 17152.7 words/s

- buttoning
  - (V): button(N) -ing(V|N→V)  (P:11)
  - (N): button(N) -ing(N→N)  (P:14)
  - (J): button(N) -ing(V→J)  (P:17)
- Rate: 11904.8 words/s

- exposition
  - (N): expose(V) -ition(V→N)  (P:30)
- Rate: 12547.1 words/s

- expositions
  - (N): expose(V) -ition(V→N) -s(N→N)  (P:40)
  - (V): expose(V) -ition(V→N) -s(V→V)  (P:47)
- Rate: 7189.07 words/s

- reexpose
  - (V): re-(U) expose(V)  (P:10)
- Rate: 27100.3 words/s

- reexposure
  - (N): re-(U) expose(V) -ure(N)  (P:40)
- Rate: 15432.1 words/s

- reexposed
  - (V): re-(U) expose(V) -ed(V→V)  (P:40)
  - (J): re-(U) expose(V) -ed(V→J)  (P:42)
- Rate: 11723.3 words/s

- malignant
  - (N): malign(V) -ant(N)  (P:12)
  - (J): malign(V) ant(J)  (P:13)
- Rate: 14881 words/s

- meteorites
  - (N): meteor(N) -ite(N→N) -s(N→N)  (P:20)
  - (V): meteor(N) -ite(N→N) -s(V→V)  (P:27)
- Rate: 8992.81 words/s

- mouthy
  - (J): mouth(N) -y(N→J)  (P:10)
- Rate: 22002.2 words/s

- stubbornly
  - (J): -s(N→N) [tub] born(V) -ly(N|J→J)  (P:3343)
- Rate: 1026.69 words/s

- muddling
  - (V): [muddl] -ing(V|N→V)  (P:5015)
  - (J): [muddl] -ing(V→J)  (P:5017)
  - (N): [muddl] -ing(N→N)  (P:5019)
- Rate: 2212.39 words/s

- rapacious
  - (J): [rapac] -y(N→J) -ous(J)  (P:5025)
- Rate: 1189.06 words/s

I know I'll need to do more testing, but I'm fairly happy with the results so far.

Applications and future work

While my main goal in creating a morphological parser is to create a mechanism for building a "virtual lexicon" that supports syntax parsing by guessing at lexical categories for words, I see other potential uses, too.

For starters, an MP should be able to aid the process of building a lexicon. Imagine doing so by importing documents. For each document, the lexicon-builder tool calls out words it doesn't already recognize. Take the muddling example from above. The best guess was that the word is a verb, which is correct. It called out "muddl" as the unknown. But moreover, one could use the -ing(V|N→V) lexeme sense, which indicates that it usually attaches "ing" to a verb or (secondarily) a noun to form a verb, to guess that "muddl" is most likely a verb, which is also correct. The only thing wrong is the spelling, since this involved lopping off a final "e". The user would need to review and finesse each suggested entry found this way.

I also believe this could be used to enhance a typical spell checker. For starters, it could allow the spell checker to distinguish between "hard" and "soft" misspellings. That is, it could call out words that fit word-formation patterns but are not in an otherwise large lexicon as "soft" misspellings. But moreover, it could recognize when a word looks like a proper inflection for a word but is actually not. If the lexeme sense indicated that a base word does not follow the usual inflection rules and calls out the alternatives, the spell checker could suggest the correct one. For example, badder might lead to worse as a suggestion, as badder appears to be the comparative of bad. Similarly, worser could be called out as a nonstandard comparative, with worse being suggested. Childs becomes children. And so on. These would generally be favored over typo-assuming suggestions like balder, worsen, and child.

One problem I intend to apply my MP to is improved tokenization. Consider the sentence When will you see Prof. Smith? A basic tokenizer would see the period in Prof. and assume it marked the end of a sentence. Smith? could, after all, be a complete sentence, too. I think my lexicon is going to need to have common abbreviations like Prof., Mrs., and etc. to help disambiguate period usage. One option would be to include the period in a token that is otherwise word-like and ask the MP to render an opinion about whether the period is part of the word or more likely punctuation. This would extend to period-delimited formats like R.I.P. and R.S.V.P., where it seems logical for the MP, which looks at words character-by-character, to recognize and "correct" this pattern. After all, the lexicon may have RSVP in it but not the redundant R.S.V.P. defined, so it would be helpful to recognize and transform this pattern before traversing the morpheme tree.

Related to ambiguous periods is ambiguous apostrophes ('). If a word begins or ends in an apostrophe, does that signal a possessive, an eye-dialect spelling ('nother, lil', readin'), or single-quoted prose? The lexicon could help if it contained common eye-dialect examples. And the MP could figure out if a trailing s' likely represents a possessive (fixing the dogs' dinners).

Because the MP returns multiple options, it certainly can return several interpretations of embedded periods and apostrophes. It might be best for the tokenizer, confronted with a paragraph, to conservatively attach periods and apostrophes to the otherwise word-like tokens they are nearest as a first step, then have the MP come up with naive guesses for tokens' categories, word splits, and punctuation interpretations. Only after that would a next stage come up with one or more interpretations of where the sentence boundaries are, a break with the traditional tokenization → lexicalization → syntax parsing flow. Then it would be up to the syntax parser to figure out which proposed sentence boundaries make the most sense, grammatically.

Although my current morphological parser code is already my second version, I've only been working on this for two and a half weeks. I have no doubt this deserves quite a bit more work. But overall, I'm very happy with the initial results. My sense is that my MP works effectively and efficiently and that it will serve several parsing goals at once.

I'm back

2016-10-04T10:32:00.002-07:00

After many years on hold, I've been spending the past couple months getting back into to my AI research saddle. For now, I'm able to devote a lot more time to it.

One of the most significant things I've noticed is that in recent years, a lot more AI-related research papers have become freely available to read online. This has dramatically accelerated my learning. A lot of smart people have been spending the past few decades advancing inductive learning techniques to apply to ever larger volumes of training data. And developing various algorithmic approaches to machine learning, writ large.

I've been focused almost exclusively on the area of natural language processing (NLP). Linguists and AI researchers seem to have done each other great service in advancing what we know about how humans deal with natural languages. I'm hoping to capitalize on a lot of their good work. And hopefully contribute some novel research of my own soon.

It's an exciting time to be diving back into AI research.

Mechanical finger

2009-10-18T08:43:00.000-07:00

I'm a bit surprised I never posted about this sooner. When I was in high school, back around 1990, I designed and built prototypes of a human-shaped finger for a robot. At the time there were already some human-shaped mechanical hands, but I was disappointed by their Erector Set bulkiness and openness. I imagined they would pinch human skin and damage all sorts of things humans are used to interacting with. I was trying to think of a way to solve that problem and came up with the idea of constructing a solid finger out of a sandwich of layers of parts. I built a prototype out of balsa wood and then another out of plastic. Below are some pictures I just took of them and the design sketch for the plastic one.

I guess I didn't think much of them because, as the years went by, so many great innovations have come about in this area. Still, I think there's merit in this approach.

This is incredibly lightweight. The balsa wood one weighs a few grams. They each weigh as much as a solid piece of the material they are carved from. Because the hinges are as large as the fingers' diameters, they are very study, so the fingers don't flex from side to side at the joints. Being solid means they resist compression fairly well.

There are few moving parts: just three finger segments, three axles, and two cables. No complex pulleys to deal with or the like.

A mechanical engineer might object to the friction that can come from the large hinges. I was worried about this when I build them but was surprised at how little friction there is. I may have put some powdered lubricant in the balsa one, but I put nothing in the plastic one.

Probably the greatest weakness of what I built is the cables. The balsa one's cables are made from bundles of sewing thread and the plastic one's are ordinary twine. I envisioned a production version being made from plastic and having cables made from thick fishing wire. I imagine this could easily be made from aluminum or another metal and use steel cables.

One other shortcoming is that the cable runs are exposed on the bottom when the finger is extended. Although they are very small slots, they still can be pinch points or places where dirt can collect and foul the machine. When I build these I imagined there would be a skin covering the mechanisms.

One nice thing about this design is that it lends itself well to sculpting. You can probably see that I whittled the outside of the balsa one to give it a human finger shape. It's unusually thin, but I expected I could have added more layers to fatten it out to human proportions. Although the diagram doesn't show this, the various layers have different diameters for the hinge parts to account for the curvature of the fingers.

Again, it's not much to speak of, but I thought it might be nice to post about this here. But it's also a reminder that I made this over half my life ago. So much time has passed and I really haven't done much of the research I had hoped I would in that time. It's a bit sad to think this modest creation might be the pinnacle of my creative efforts in robotics.

Confirmation bias as a tool of perception

2007-11-13T00:00:00.001-08:00

I've been trying to figure out where to go next with my study of perception. One concept I'm exploring is the idea that our expectations enhance our ability to recognize patterns.

I recently found a brilliant illustration of this from researcher Matt Davis, who studies how humans process language. Try out the following audio samples. Listen to the first one several times. It's a "vocoded" version of the plain English recording that follows. Can you tell what's being said?

Vocoded version.

Click here to open this WAV file

Give up? Now listen to the plain English version once and then listen to the vocoded version again.

Clear English version.

Click here to open this WAV file

Davis refers to this a-ha effect as "pop-out":

Perhaps the clearest case of pop-out occurs if you listen to a vocoded sentence before and immediately after you hear the same sentence in vocoded form. It is likely that the vocoded sentence will sound a lot clearer when you know the identity of that sentence.

To me, this is a wonderful example of confirmation bias. Once you have an expectation of what to look for in the data, you quickly find it.

How does this relate to perception? I believe that recognizing patterns in real world data involves not only the data causing simple pattern matching to occur (bottom up), but also higher level expectations prompting the lower levels to search for expected patterns (top down). To help illustrate and explain, consider how you might engineer a specific task of perception: detecting a straight line in a picture. If you're familiar with machine vision, you'll know this is an age-old problem that has been fairly well solved using some good algorithms. Still, it's not trivial. Consider the following illustration of a picture of a building and some of the steps leading up to our thought experiment:

The first three steps we'll take are pretty conventional ones. First, we get our source image. Second, we apply a filter that looks at each pixel to see if it strongly contrasts with its neighbors. Our output is represented by a grayscale image, with black pixels representing strong contrasts in the source image. In our third step, we "threshold" our contrast image so each pixel goes either to black or white; no shades of gray.

Here's where our line detection begins. We'll say we start by making a list of all sets of neighboring black pixels that have, say, 10 or more pixels touching one another. Next, we filter these by seeing which have a large number of those pixels roughly fitting a line function. We end up with a bunch of small line segments. Traditionally, we could stop here, but we don't have to. We could pick any of these line segments and extend it out in either direction to see how far it can go and still find black pixels that roughly fit that line function. We might even tolerate a gap of a white pixel or two as we continue extending out. And we might try different variations of the line function that still fit but fit better as the line segment gets longer, in order to further refine the line function. But then uncertainty kicks in and we conservatively stop stretching out when we no longer see black pixels.

Here's where confirmation bias can help. Once we have a bunch of high-certainty line segments to work with, we now have expectations set about where lines form. So maybe we take our line segments back to the grayscale version of the contrast image. To my thinking, those gray pixels that got thresholded to white earlier still contain useful information. In fact, each grey pixel in the hypothesized line provides "evidence" that the line continues onward; that the "hypothesis" is "valid". It doesn't even matter that there may be lots of other grey -- or even black -- pixels just outside the hypothesized line. They don't add to or distract from the hypothesis. Only the "positive confirmation" of grey pixels adds weight to the hypothesis that the line extends further than we could tell by the black pixels in the thresholded version. Naturally, as the line extends out, we may get to a point where most of the pixels are white or light. Then we stop extending our line.

I love this example. It shows how we can start with the source data "suggesting" certain known patterns (here, lines) and that a higher level model can then set expectations about bigger patterns that are not immediately visible (longer lines) and use otherwise "weak evidence" (light grey pixels) as additional confirmation that such patterns are indeed found. To me, this is a wonderful illustration of inductive reasoning at work. The dark pixels may give strong, deductive proof of the existence of lines in the source data, but the light pixels that fit the extended line functions give weaker inductive evidence of the same.

I don't mean to suggest that perception is now solved. This example works because I've predefined a model of an "object"; here, a line. I could extend the example to search for ellipses, rectangles, and so on. But having to predefine these primitive object types seems to miss the point that we are quite capable of discovering these and much more sophisticated models for ourselves. There's no real learning in my example; only refinement. Still, I like that this illustrates how confirmation bias -- something of a dirty phrase in the worlds of science and politics -- probably plays a central role in the nature of perception.

What bar code scanners can tell us about perception

2007-11-06T00:00:00.000-08:00

It may not be obvious, but a basic bar code scanner does something that machine vision researchers would love to see their own systems do: find objects amidst noisy backgrounds of visual information. What is an "object" to a bar code scanner? To answer that, let's start by explaining what a bar code is.

What is a bar code?

You've probably seen bar codes everywhere. Typically, they are represented as a series of vertical bars with a number or code underneath. There are many standards for bar codes, but we'll limit ourselves to one narrow class, typified by the following example:

This sort of bar code has a start code and an end code. These typically feature a very wide bar. One of its main purposes is to serve as a standard for bar widths. This is sometimes 4x the unit width for a bar. The remaining bars and gaps between them will be some multiple of that unit width (e.g., 1x, 2x, or 3x). Each sequence of bars and gaps relates to a unique number (or letter or other symbol) that is specified in advance by the standard for that kind of bar code.

A bar code scanner, like the handheld version pictured at right, doesn't actually care that the code is 2D, as you see it. To the scanner, the input is a stream of alternating light and dark signals, typically furnished by a laser signal bouncing off white paper or being absorbed by black ink (or reflecting / not reflecting off an aluminum can, etc.). If you're a programmer or PhotoShop guru, you could visualize this as starting with a digital snapshot of a bar code and cropping away all but a single pixel line of the image that cuts across the bar code, then applying a threshold to convert it into a black and white image devoid of color and even shades of gray.

The size of the bar code doesn't much matter, either. Within a certain, wide range, a bar code scanner will take any string of solid black as a potential start of a bar code, whether it's small or large and whether it's off to the left or the right of the center of the scanner's view.

What the scanner is doing with this stream of information is looking for the beginning and ending of a black section and using that first sample as a cue to look for the rest of the start code (or stop code; the bar code could be upside down) following it. If it finds that pattern, it continues looking for the patterns that follow, translating them into the appropriate digits, letters, or symbols, until it reaches the stop code.

Now, bar codes are often damaged. And they often appear in a noisy background of information. In fact, the inventors of bar code standards are very aware that a random pattern on a printed page could be misinterpreted as a bar code. They dealt with this by adding in several checks. For instance, one or more of the digits in a bar code are reserved as a "check code", the output of a mathematical function applied to the other data. The scanner applies the same function. If the output isn't the same as what the check code read in is, the candidate bar code scan is rejected as corrupt. Even the digit representations themselves contain only a small subset of all possible bar/gap combinations in order to reduce the chances that an errant spot or other invalid information could be misconstrued as a valid bar code. In fact, the odds that a bar code scanner could misread a bar code like the one above are so infinitesimally small that engineers and clerks can place nearly 100% confidence in their bar codes. A bar code either does or does not scan. There's no "kinda".

Seeing things

Bar codes have been engineered so well that it's possible to leave a scanner turned on 24/7, scanning out over a wide area, seeing all sorts of noise continuously, and be nearly 100% guaranteed that when it thinks it sees a bar code in the environment, it is correct. Some warehouses feature stationary bar code scanners that scan large boxes as they are moved along by fork lifts, for instance.

What does this have to do with machine vision? Isn't it amazing that a bar code scanner can deal with an incredibly noisy environment and still have a nearly 100% accuracy when it finds a bar code? This is very much like how you can pick out a human face in a busy picture with nearly 100% accuracy. There's all sorts of things that may ping your face recognition capacity, but when your focus is brought to bear on them, your skill at filtering out noise and correctly identifying the real faces is incredible, just like the bar code scanner. What's more, it doesn't matter where in your visual field the face is and how near or far it is, within a reasonable range. Just like the scanner.

Vision researchers are still hard pressed to provide an accounting of how we perceive the world visually. Machine vision researchers have been doing all sorts of neat things for decades, but we're still barely scratching the surface, here, for lack of a comprehensive theory of perception. Yet engineers creating bar codes decades ago actually solved this problem in a narrow case.

A good bar code scanner has an elegant solution to the problems of noise, scale invariance (zoom & offset), bounds detection (via start and stop codes). They even made it so a single bar code could represent one of billions of unique messages, not just be a simple there/not-there marker.

The bigger picture

Of course, I don't want to suggest that bar code scanners hold the key to solving the basic problem of perception. You probably have already guessed that the secret to bar codes is that they follow well engineered standards that make it almost easy to pick bar codes out of a noisy environment. Vision researchers have likewise made many systems that are quite capable of picking out human faces, as well as a variety of special classes of clearly definable objects.

It's pretty much accepted wisdom in human brain research now that much of what we see in the world is what we are looking to find. A bar code scanner works because it knows what to look for. Obviously, one key difference between your perceptual faculty and a bar code scanner is that the scanner is "born" with all the knowledge it needs, while you have to learn how faces, chairs, and cars "work" for yourself.

Still, for people wondering how to approach the question of perception, bar coding is not a bad analogy to start with.

Perception as construction of stable interpretations

2007-10-21T00:00:00.000-07:00

I've been spending a lot of time lately thinking about the nature of perception. As I've said before, I believe AI has gotten stuck at the two coastlines of intelligence: the knee-jerk-reaction of the sensory level and the castles-in-the-sky of the conceptual level. We've been missing the huge interior of the perceptual level of intelligence. It's not that programmers are ignoring the problem. They just don't have much in the way of a theoretical framework to work with, yet. People don't really know yet how humans perceive, so it's hard to say how a machine could be made to perceive in a way familiar to humans.

Example of a stable interpretation

I've been focused very much on the principle of "stable interpretation" as a fundamental component of perception. To illustrate what I mean by "stable", consider the following short video clip:

Click here to open this WMV file

This is taken from a larger video I've used in other vision experiments. In this case, I've already applied a program that "stabilizes" a source video by tracking some central point as it moves from frame to frame and clipping out the margins. In this case, you can still see motion, though. The camera is tilting. The foreground is sliding from right to left. And there is a noticeable flicker of pixels because the source video is of a low resolution. On the other hand, you have no trouble at all perceiving each frame as part of a continuous scene. You don't see frames, really. You just see a rocky shore and sky in apparent motion as the camera moves along. That's what perception in a machine should be like, too.

The problem is that the interpretation of a static scene in which only the camera moves does not arise directly from the source data. If you were to simply watch a single pixel in this video as the frames progress, you'd see even that changes, literally. Also, individual rocks do move relative to the frame and to each other. Yet you easily sense that there's a rigid arrangement of rocks. How?

One way of forming a stable view is one I've dabbled in a long time: patch matching. In this case, I took a source video and put a smaller "patch" in it that's the size of the video frames you see here. With each passing frame, my code compares different places to move the patch frame to in hopes of finding the best matching candidate patch. In this case, you can see it works pretty well. But this is a very brittle algorithm. Were I to include subsequent frames, where a man runs through the scene, you would see that the patch "runs away" from the man because his motion breaks up the "sameness" from frame to frame. My interpretation is that the simple patch comparisons I use are insufficient; that this cheap trick is, at best, a small component in a larger toolset needed for constructing stable interpretations. A more robust system would be able to stay locked on the stable backdrop as the man runs through the scene, for instance.

What is a stable interpretation?

What makes an interpretation of information "stable"? The video example above is riddled with noise. One fundamental thing perception does is filter out noise. If, for example, I painted a red pixel in one frame of the video, you might notice it, but you would quickly conclude that it is random noise and ignore it. If I painted another red pixel in several more frames, you would no longer consider it noise, but some artifact with a significant cause. Seeing the same information repeated is the essence of non-randomness.

"Stability", in the context of perception, can be defined as "the coincidental repetition of information that suggests a persistent cause for that information."

My Pattern Sniffer program and blog entry illustrate one algorithm for learning that is based almost entirely on this definition of stability. The program is confronted with a series of patterns. Over time, individual neurons come to learn to strongly recognize the patterns. Even when I introduced random noise distorting the images, it still worked very well at learning "idealized" versions of the distorted patterns that do not reflect the noise. Shown a given image once, a "free" neuron might instantly learn it, but without repetition over time, it would quickly forget the pattern. My sense is that Pattern Sniffer's neuron bank algorithm is very reusable in many contexts of perception, but it's obviously not a complete vision system, per se.

What is repetition?

When I speak of repetition, in the context of Pattern Sniffer, it's obvious that I mean showing the neuron bank a given pattern many times. But that's not the only form of repetition that matters to perception. Consider the following pie chart image:

When you look at the "Comedy" (27%) wedge, you see that it is solid orange. You instantly perceive it as a continuous thing, separable from the rest of the image. Why? Because the orange color is repeated across many pixels. Here's a more interesting example image of a wall of bricks:

Your visual perception instantly grasps that the bricks are all the "same". Not literally, if you consider each pixel in each brick, but in a deep sense, you see them as all the same. The brick motif repeats itself in a regularized pattern.

When your two eyes are working properly, they will tend to fixate on the same thing. Your vision is thus recognizing that what your left eye sees is repeated also in your right eye, approximately.

In each of these cases, one can apply the patch comparison approach to searching for repeated patterns. This is just in the realm of vision and only considers 2D patches of source images. But the same principle can be applied to any form of input. A "patch" can be a 1D pattern in linear data, just the same. Or it could encompass a 4D space of taste components (sweet, salty, sour, bitter). The concept is the same, though. A "patch" of localized input elements (e.g., pixels) is compared to another patch in a different part of the input for repetition, whether it's repeated somewhere else in time or in another part of the input space.

Repetition as structure

We've seen that we can use coincidental repetitions of patterns as a way to separate "interesting" information from random noise. But we can do more with it. We can use pattern repetition as a way to discover structure in information.

Consider edges. Long ago, vision researchers discovered that our own visual systems can detect sharp contrasts in what we see and thus highlight them as edges. Implementing this in a computer turns out to be quite easy, as the following example illustrates:

It's tempting to think it is easy, then, to trace around these sharply contrasting regions to find whole textures or even whole objects. The problem is that in most natural scenarios, it doesn't work. Edges are interrupted because of low-contrast areas, as with the left-hand player's knee. Other non-edge textures like the grass are high enough contrast to appear as edges in this sort of algorithm. True, people have made algorithms to reduce noise like this using crafty means, but the bottom line is that this approach is not sufficient for detecting edges in a general case.

The clincher comes when an edge is demarked by a very soft, low-contrast transition or even a rough edge. Consider the following example of a gravel road, with its fuzzy edge:

As you can see, it's hard to find a high contrast edge to the road using a typical, pixel contrast algorithm. There's higher contrast to be found in the brush beyond the road's edge, in fact. But what if one started with a patch along the edge of the road (as we perceive it) and searched for similar patches? Some of the best matches would likely be along that same edge. As such, this soft and messy edge should be much more easily found. The following mockup illustrates this concept:

In addition to discovering "fuzzy" edges like this better, patch matching can be used to discover 2D "regions" within an image. The surface of the road above, or of the brush along the side of it might be better found than with the more common color flood-fill technique.

I've explored these ideas a bit in my research, but I want to make clear that I haven't come up with the right kinds of algorithms to make these practical tools of perception as of yet.

Pattern funneling

One problem that plagues me with machine vision research is that mechanisms like my Pattern Sniffer's neuron banks work great for learning to recognize things only when those things are perfectly placed within their soda-straw windows on the world. With Pattern Sniffer, the patterns are always lined up properly in a tiny array of pixels. It's not like it goes searching a large image for those known patterns, like a "Where's Waldo" search. For that kind of neuron bank to work well in a more general application, it's important for some other mechanism to "funnel" interesting information to the neuron bank that gains expertise in recognizing patterns.

Take textures, for instance. One algorithm could seek out textures by simply looking for localized repetition of a patch. A patch of grass could be a candidate, and other patch matches around that patch would help confirm that the first patch considered is not just a noisy fluke.

That patch, then, could be run through a neuron bank that knows lots of different textures. If it finds a strong match, it would say so. If not, a neuron in the bank that isn't yet an expert in some texture would temporarily learn the pattern. Subsequent repetition would help reinforce it for ever longer terms. This is what I mean by "funneling", in this case: distilling an entire texture down to a single, representative patch that is "standardized" for use by a simpler pattern-learning algorithm.

Assemblies of patterns

In principle, it should be possible to detect patterns composed of non-random coincidences of known patterns, too. Consider the above example of an image of grass and sky, along with some other stuff. Once it is established, using pattern funneling to a learned neuron bank, that the grass and sky textures were found in the image, these facts can be used as another pattern of input to another processor. Let's say we have a neuron bank that has, as inputs, the various known textures. After processing any single image, we have an indication of whether or not a given known texture is seen in that image, as indicated in the following diagram:

Shown lots of images that include a few different images of grassy fields with blue skies, this neuron bank should come to recognize this repeated pattern of grass + sky as a pattern of its own. We could term this an "assembly of patterns".

In a similar way, a different neuron bank could be set up with inputs that consider a time sequence of recognized patterns. It could be musical notes, for example, with each musical note being one dimension of input, and the last, say, 10 notes being another dimension of input. As such, this neuron bank could learn to recognize and separate simple melodies from random notes.

The goal: perception

The goal, as stated above, is to make a machine able to perceive objects, events, and attributes in a way that is more sophisticated, like humans have, than the trivial sensory level many robots and AI programs deal with today. My sense is that the kinds of abstractions described above take me a little closer to that goal. But there's a lot more ground to cover.

For one thing, I really should try coding algorithms like the ones I've hypothesized about here.

One of the big limitations I can still see in this patch-centric approach to pattern recognition is the age-old problem of pattern invariance. I may make an algorithm that can recognize a pea on a white table at one scale, but as soon as I zoom in the camera a little, the pea looks much bigger, and no longer is easily recognizable using a single-patch match against the previously known pea archetype. Perhaps some sort of pattern funneling could be made that deals specifically with scaling images to a standardized size and orientation before recognizing / learning algorithms get involved. Perhaps salience concepts, which seek out points of interest in a busy source image, could be used to help in pattern funneling, too.

Still, I think there's merit in vigorously pursuing this overarching notion of stable interpretations as a primary mechanism of perception.

Rebuttal of the Chinese Room Argument

2007-10-14T00:00:00.000-07:00

While discussing the subject of Artificial Intelligence in another forum, someone brought up the old "Chinese Room" argument against the possibility of AI. My wife suggested I post my response to the point, as it seems a good rebuttal of the argument itself.

If you're unfamiliar with the CR argument, there's a great entry in the Stanford Encyclopedia of Philosophy. It summarizes as follows:

The argument centers on a thought experiment in which someone who knows only English sits alone in a room following English instructions for manipulating strings of Chinese characters, such that to those outside the room it appears as if someone in the room understands Chinese. The argument is intended to show that while suitably programmed computers may appear to converse in natural language, they are not capable of understanding language, even in principle. Searle argues that the thought experiment underscores the fact that computers merely use syntactic rules to manipulate symbol strings, but have no understanding of meaning or semantics. Searle's argument is a direct challenge to proponents of Artificial Intelligence, and the argument also has broad implications for functionalist and computational theories of meaning and of mind. As a result, there have been many critical replies to the argument.

To my thinking, this is a basically flawed argument from the start. What if the instructions were given in English by another, Chinese-speaking (yes, I know "Chinese" is not a language) person? Really, the human following the processing rules is just a conduit for those processing rules. He might as well be a mail courier with no inkling what's in the envelope he's delivering. It doesn't mean the person who sent the mail is not intelligent. The CR argument says absolutely nothing about the nature of the data processing rules. It dismisses the possibility that those rules could constitute an intelligent program without consideration.

I think the CR argument holds some sway with people because they've seen the famous Eliza program from 1966 and tons of other chatbots based on it. Most of them take a sentence you type and respond to it either by reformulating it (e.g., replying to "I like chocolate" with "why do you like chocolate?") using predefined rules or by looking up random responses to certain keywords (e.g., responding to a search on "chocolate" in "I like chocolate" with "Willy Wonka and the Chocolate Factory grossed $475 million in box office receipts.")

Anyone who has interacted with a chatbot like this recognizes that it's easy to be fooled, at first, by this sort of trickery. The problem with the Chinese Room argument is that it posits that this is all a computer can do, without providing any real proof. In fact, the human mind is the product of the human nervous system and, really, the whole body. But that body is a machine. It's constructed of material parts that all obey physical laws. A computer is no different in this sense. What separates a cheap computer trick like Eliza from a human mind is how their systems are structured.

I take it as obvious, these days, that it's possible to make a machine that can reason and act "intelligent" like we do, generally. And I've never seen the CR argument as having any real bearing on the possibility of intelligent machines. It only provides a cautionary note about the difference between faking intelligence and actually being intelligent.

Video stabilizer

2007-10-07T00:00:00.000-07:00

I haven't had much chance to do coding for my AI research of late. My most recent experiment dealt more with patch matching in video streams. Here's a source video, taken from a hot air balloon, with a run of what I'll call a "video stabilizer" applied:

Full video with "follower" frame.
Click here to open this WMV file

Contents of the follower frame.
Click here to open this WMV file

The colored "follower" frame in the left video does its best to lock onto the subject it first sees when it appears. As the follower moves off center, a new frame is created in the center to take over. The right video is of the contents of the colored frame. (If the two videos appear out of sync, try refreshing this page once the videos are totally loaded.)

This algorithm does a surprisingly good job of tracking the ambient movement in this particular video. That was the point, though. I wondered how well a visual system could learn to identify stable patterns in a video if the video was not stable in the first place. I reasoned that an algorithm like this could help a machine vision system to make the world a little more stable for second level processing of source video.

The algorithm for this feat is unbelievably simple. I have a code class representing a single "follower" object. A follower has a center point, relative to the source video, and a width and height. We'll call this a "patch" of the video frame. With each passing frame, it does a bit-level comparison of what's inside the current patch against the contents of the next video frame, in search of a good match.

For each patch considered in the next frame, a difference calculation is performed, which is very simple. For each pixel in the two corresponding patches (current-frame and next-frame) under consideration, the difference in the red, green, and blue values are added to a running difference total. The candidate patch that has the lowest total difference is considered the best match and is thus where the follower goes in this next frame. Here's the code for comparing the current patch against a candidate patch in the next frame:


        private int CompareRegions(int OffsetX, int OffsetY) {
            int X, Y, Diff;
            Color A, B;

            const int ScanSpacing = 10;

            Diff = 0;

            for (Y = CenterY - RadiusY; Y <= CenterY + RadiusY; Y += ScanSpacing) {
                for (X = CenterX - RadiusX; X <= CenterX + RadiusX; X += ScanSpacing) {
                    A = GetPixel(CurrentBmp, X, Y);
                    B = GetPixel(NextBmp, X + OffsetX, Y + OffsetY);
                    Diff +=
                        Math.Abs(A.R - B.R) +
                        Math.Abs(A.G - B.G) +
                        Math.Abs(A.B - B.B);
                }
            }

            return Diff;
        }

Assuming the above gibberish makes any sense, you may notice "Y += ScanSpacing" and the same for X. That's an optimization. In fact, the program does include a number of performance optimizations that help make the run-time on these processes more bearable. First, a follower doesn't consider all possible patches in the next frame to decide where to move. It only considers patches within a certain radius of the current location. OffsetX, for example, may only be +/- 50 pixels, which means if the subject matter in the video slides horizontally more than 50 pixels between frames, the algorithm won't work right. Still, this can increase frame processing rates 10-fold, with smaller search radii yielding shorter run-times.

As for "Y += ScanSpacing", that was a shot in the dark for me. I was finding frame processing was taking a very long time, still. So I figured, why not skip every Nth pixel in the patches during the patch comparison operation? I was surprised to find that even with ScanSpacing of 10 (with a patch of at least 60 pixels wide or tall), the follower didn't lose much of its ability to track the subject matter. Not surprisingly, the higher the scan spacing, the lower the fidelity, but the faster. Doubling ScanSpacing means a 4-fold increase in the frame processing rate.

I am inclined to think the process demonstrated in the above video is analogous to what our own eyes do. In any busy motion scene, I think your eyes engage in a very muscle-intensive process of locking in, moment by moment, on stable points of interest. In this case, the follower's fixation is chosen at random, essentially. Whatever is in the center becomes the fixation point. Still, the result is that our eyes can see the video, frame by frame, as part of a continuous, stable world. By fixating on some point while the view is in motion, whether on a television or looking out a car window, we get that more stable view.

Finally, one thought that kinda drives this research, but is really secondary to it, is that this could be a practical algorithm for video stabilization. In fact, I suspect the makers of video cameras are using it in their digital stabilization. It would be interesting to see someone create a freeware product or plug-in for video editing software because the value seems pretty obvious.

"Conscious Realism" and "Multimodal User Interface" theories

2007-09-27T00:00:00.000-07:00

I recently sent an email to Donald Hoffman, professor at the University of California, Irvine, with kudos for his book, Visual Intelligence, which has had a profound impact on my thinking about perception. Understandably, he's very busy kicking off the new school year, so I was grateful that he sent at least brief response and a reference to his latest published paper, titled Conscious Realism and the Mind-Body Problem. Naturally, I was eager to read it.

Much of the study of how human consciousness arises stems from the assumption that consciousness is a product of physical processes; that consciousness is a product of a physical processes in the brain. This paper starts from the opposite assumption: that "consciousness creates brain activity, and indeed creates all objects and properties of the physical world." When I read this in the abstract, I must have largely ignored its significance. Having read Visual Intelligence, I'm familiar with Hoffman's focus on how our minds construct the things we perceive, so I took this summary as a shorthand for this concept of construction of the contents of consciousness. It becomes apparent that this claim is far more literal than I had assumed.

Hoffman begins by explaining the ubiquity of a central assumption as follows. "A goal of perception is to match or approximate true properties of an objective physical environment. We call this the hypothesis of faithful depiction (HFD)." After giving lots of examples of this assumption and reasons why it's taken for granted, Hoffman declares his rejection of it:

I now think HFD is false. Our perceptual systems do not try to approximate properties of an objective physical world. Moreover evolutionary considerations, properly understood, do not support HFD, but require its rejection.

Now, I'll state here that most of Hoffman's claims in this paper appear logically valid and, on the face of it, uncontested. But I would have to say that this one probably isn't logically supported: that evolutionary considerations require the rejection of HFD. By and large, however, this paper claims that it is not necessary to assume there is an objective physical world in order to study and understand consciousness, which seems acceptable.

The term "objective physical world" deserves some explanation. It identifies the view that there is a single reality that exists without regard to observers. If there is an apple on the table before two people, the apple really is there, whether either of them perceives it. Naturally, one would imagine that if one of them can see the apple, the other one probably can (barring obstructions), because both of them have access to information (e.g., light) reflected off the apple and into both their eyes. They may see different sides of the apple, but the apple is definitively there.

To be sure, one should not dismiss Hoffman as a fringe nut that claims there is no reality, per se; only people and their subjective consciousnesses. He doesn't in this paper. In fact, it's clear he does appear to accept the assumption that there really is an objective reality, but that we don't have direct "access" to it. A classic example of this distinction is a detailed treatment of the table not as a solid object with straight edged surfaces, but as a collection of atoms and, mostly, empty space and, as such, rough, continuously changing surfaces. In this sense, there really isn't a table; that's just a percept (or concept) we use to refer to the collection of atoms.

To help illustrate the distinction between what one perceives and the subject matter of perception, Hoffman introduces the analogy of deleting a computer file by "dragging" a file icon and "dropping" it onto a trash can icon. This action is intuitive and designed specifically as an analogy of the actual file delete operation, but it actually bears no resemblance to what actually goes on under the surface. In fact, even the icon is not equivalent to the file; it's merely a percept specifically designed to represent it to the end user. By analogy, Hoffman refers to the table or the apple as merely "icons" we create in our minds to represent what most people would reflexively call "real objects". In fact, to the person who says, "no, the apple is just a bunch of atoms," Hoffman would in turn say, "the atoms are themselves icons we create."

Hoffman introduces the term "multimodal user interface", or "MUI", to summarize what consciousness is. In contrast to the view that perception is all about constructing a mental model of reality that closely resembles reality, Hoffman claims perception is about constructing practical models that "get the job done". And just as computer designers might construct icon based interfaces to help make it easier for humans to understand and practically manage information, our own minds actually set out to construct "practical" percepts in order to help us simplify what we do. But the mental models, Hoffman claims need not bear any resemblance to what is being modeled.

To be sure, Hoffman may say the percepts -- mental models -- a conscious entity holds bear no resemblance to their referents, but he doesn't claim that there is no correlation to them. Hoffman says that user interfaces, including our own consciousnesses, by design have the following characteristics:

Friendly formatting
Concealed causality
Clued conduct
Ostensible objectivity

That is, a user interface's "purpose" is to distill immensely complex behaviors down to practical "icons" of objects and behaviors that stand for that underlying complexity, but don't literally mirror it. Take the file-delete example. The icon on the desktop is a sufficient stand-in for a file, even though the file, a pattern of magnetic fields on a metal platter, bears no resemblance to the icon. It's a "friendly format", in this sense. Further, the action of dragging and dropping it onto a trash can icon to "delete it" has its own causal chain, which conceals the true, deeply complex causal chain that actually happens to effect the file delete operation. Yet the drag-n-drop operation and the trash can icon give an intuitive clue of what will happen if something is dropped onto it. Finally, this drag-n-drop-to-delete operation is designed to consistently do the same thing every time, thereby engendering in the user an ostensible sense that there is an objective operation going on that will always happen, even though a moment's reflection tells us that a failure in the underlying software or hardware could cause something else to happen when one drops a file icon on the trash icon.

So far, I can see that there's a practical use for this notion to people trying to understand human perception or to engender consciousness in machines. For one, the claim is that percepts do not have to bear much resemblance to their referents in the "real world". They just have to have practical utility. An icon in a user interface just needs to be useful enough for the user to be aware of a file's existence and to do some basic stuff with it. Similarly, the mental percept an antelope has of a lion in the distance only needs to be useful enough to stay alive to be useful. It doesn't need to be a highly detailed representation of the lion beyond that basic utility. It also alludes to the view that a high fidelity representation in a computer of the "real world" doesn't make the machine that has it any more aware of what is represented. For instance, just because a self-driving car has a 3D map of the terrain out in front doesn't mean it can "see" where the road is. It's still necessary to create a practical model of how the world works that uses this 3D representation as source data, like an algorithm that seeks basically level ground, defined by a threshold of variation that separates level from non-level ground. If this were the message of the paper, I would say it adds genuine value: a set of concepts and terms to use to help steer people away from fallacious assumptions about how consciousness works and to suggest paths for further study.

But this isn't where the paper ends. It's more where it starts. In fact, this paper is less about explaining how consciousness works than about how reality works; it's metaphysics instead of epistemology. As stated earlier, it starts with the assumption that consciousness exists and that the subject of consciousness is optional. To avoid sounding like a total subjectivist, Hoffman states that:

seriously

literally

If Hoffman accepts the idea that there is a physical, objective reality, what is it composed of? "Conscious Realism asserts the following: The objective world, i.e., the world whose existence does not depend on the perceptions of a particular observer, consists entirely of conscious agents." Honestly, I would love to say that this claim is explained, but it really isn't. Hoffman claims that humans are not the only conscious agents, but doesn't say that tables, apples, and such are conscious, per se. "According to conscious realism, when I see a table, I interact with a system, or systems, of conscious agents," which really does seem to suggest that the table is conscious, but not clearly.

Conscious realism is not panpsychism nor entails panpsychism. Panpsychism claims that all objects, from tables and chairs to the sun and moon, are themselves conscious (Hartshorne, 1937/1968; Whitehead, 1929/1979), or that many objects, such as trees and atoms, but perhaps not tables and chairs, are conscious (Griffin, 1998).Conscious realism, together with MUI theory, claims that tables and chairs are icons in the MUIs of conscious agents, and thus that they are conscious experiences of those agents. It does not claim, nor entail, that tables and chairs are conscious or conscious agents.

This is one of the problems I have with this paper, though. Although Hoffman rejects the notion of inanimate objects as conscious in a trippy, Disney cartoon sense, he doesn't really elaborate on what he does mean. Moreover, if a table is labeled as conscious in order to stick a placeholder for a physical object in the objective world, what value does this add over the simpler, more intuitive conception of the table as being a physical object? It almost seems as though, in order to come up with a rigorous, clean-cut, math-friendly theory of how consciousness constructs perceptions of the world, Hoffman throws the baby out with the bathwater by claiming that even though there is an objective world, it is not composed of actual objects.

I think if Hoffman were inclined to speak of "conscious realism" and "multimodal user interfaces" as tools and techniques for studying consciousness and guides to creating it, this could be a practical concept. He could say that our perceptions of reality really do reflect, if simplistically, abstractly, and practically, an actual, objective reality. By taking pains to say there isn't really one -- or that it is entirely disconnected from our ability to perceive it -- this paper seems to do something of a disservice to science:

We want the same [approach] for all branches of science. For instance we want, where possible, to exhibit current laws of physics as projections of more general laws or dynamics of conscious agents. Some current laws of physics, or of other sciences, might be superseded or discarded as the science of conscious realism advances, but those that survive should be exhibited as limiting cases or projections of the more complete laws governing conscious agents and their MUIs.

While I can see that it is possible, perhaps, to express other branches of science in the terminology of MUIs, I don't see how it would advance our understanding of their subject matter. Gravity was well understood by Newton, yet expressing it in terms of the theory of General Relativity makes it possible to do more with the subject matter than was possible in the purely Newtonian framework. What new insights will the physicist have as a result of expressing gravity in terms of multimodal user interfaces and with reference to heavenly bodies as conscious entities? If anything, it sounds more like this extra layer would only add to the confusion people have in trying to understand already complex concepts and could even potentially take away certain practical conceptual tools. So I don't see the point.

All that said, the MUI concept does seem to add value to my own way of thinking of perception. The four functions of a good user interface listed above (friendly formatting, concealed causality, clued conduct, ostensible objectivity) seem to shout out how scientists trying to engender perception in machines should frame their goals and concepts. But the rest of Hoffman's paper, which dabbles in the philosophy of what reality is, seems to have little use for AI research.

Plan for video patch analysis study

2007-07-04T00:00:00.000-07:00

I've done a lot of thinking about this idea of making a program that can characterize the motions of all parts of a video scene. Not surprisingly, I've concluded it's going to be a hard problem. But unlike other cases where I've smacked up against a brick wall, I can see what seems a clear path from here to there. It's just going to take a long time and a lot of steps. Here's an overview of my plan.

First, the goal. The most basic purpose is to, as I said above, make a program that can characterize the motions of all parts of a video scene. The program should be able to fill an entire scene with "patches". Each patch will lock onto the content found in that frame and follow it throughout the video or until it can no longer be tracked. So if one patch is planted over the eye of a person walking through the scene, the patch should be able to follow that eye for at least as long as it's visible. Achieving this goal will be valuable because it will provide a sort of representation of the contents of the scene as fluidly moving but persistent objects. This seems a cornerstone of generalized visual perception, which has been entirely lacking in the history of AI research.

One key principle for all of this research will be the goal of constructing stable, generic views, elaborated by Donald D. Hoffman inVisual Intelligence. The dynamics of individual patches will be very ambiguous. Favoring stable interpretations of the world will help patches to make smarter guesses, especially when some lines of evidence strongly suggest non-stable ones.

One obvious challenge is when a patch falls on a linear edge, like the side of a house, instead of a sharp point, like a roof peak. Even more challenging will be patches that fall on homogenous textures, like grass, where independent tracking will be very difficult. It seems clear that an important key to the success of any single patch tracking its subject matter will be cooperating with its neighboring patches to get clues about what its own motion should be. Patches that follow sharp corners will have a high degree of confidence in their ability to follow their target content. Patches that follow edges will be less certain and will rely on higher confidence patches nearby to help them make good guesses. Patches that follow homogeneous textures will have very low confidence and will rely almost exclusively on higher confidence patches nearby to make reasonable guesses about how to follow their target content.

The algorithms for getting patches to cooperate will be a big challenge as it is. If the patches themselves aren't any good at following even strong points of interest, working on fabrics of patches will be a waste of time. Before any significant amount of time is spent on patch fabrics, I intend to focus attention on individual patches. A patch should be able to at least follow sharp points of interest. It should also be able to follow smooth edges laterally along the edge, like a buoy bobbing on water. Even this is a difficult challenge, though. Video of 3D scenes will include objects that move toward and away from the camera, so individual patches' target contents will sometimes shrink or expand. Nearby points of interest that look similar can confuse a patch if the target content is moving a lot. Changes in lighting and shadow from overcast trees, rotation, and so on will pose a huge challenge. Some of the strongest points of interest lie on outer edges of 3D objects. As such an object moves against its background, part of the patch's pattern will naturally change. The patch needs to be able to detect its content as an object edge and learn quickly to ignore the background movements.

It's apparent that solving each of these problems will require a lot of thought, coding, and testing. Also, that these components may well work against each other. It's going to be important for the patch to be able to arbitrate differing opinions among the components about where to go with each moment. How best to arbitrate is a mystery to me at present. It seems logical, then, to begin my study by creating and testing the various analysis components of a single patch.

Once I have some better definition of the analysis tools a patch will have at its disposal for independent behavior, I should then have a tool kit of black-boxes that an arbitration (and probably learning) algorithm can work with. Once I have a patch component that can do many analyses and come up with good guesses about the dynamics of its target content, then I can move on to constructing "fabrics" of patches so the patches can rely on their neighbors for additional evidence. The individual patches, if they have a generic arbitration mechanism, can use additional information from neighbors as just more evidence to arbitrate with.

I have made a conscious choice this time not to worry about performance. If it takes a day to analyze a single frame of a video, that's fine. *shudder* Well, I probably will try to at least make my research tolerable, but the result of this will almost certainly not be practical for real-time processing of video using the equipment I have on hand. However, I believe that if I am successful at least in proving the concept I'm striving for and thus advancing research into visual perception in machines, other programmers will pick apart the algorithms and reproduce them in more efficient ways. Further, it is very clear to me that individual patches are so wonderfully self-contained that it will be possible to divvy out all the patches in a scene to as many processors as we can throw at the problem. This means that if one can make a patch fabric engine that processes one frame per second using a single processor, it should be fairly easy to make it process 30 frames per second with 30 processors.

I am also dispensing somewhat with the goal of mimicking human vision with this project. I do believe a lot of what I'm trying to do does go on in our visual systems. I don't have strong reason to believe, though, that we have little parts of our brains devoted to following patches wherever they will go as time passes. That doesn't seem to fit the fixed wiring of our brains very well. It may well be that we do patch following of a sort that lets the patch slide from neural patch to neural patch, which may imply some means of passing state information along those internal paths. I can hypothesize about that, but really, I don't know enough yet to say that this is literally what happens in the human visual system. I think it's enough to say that it could.

So that's my current plan of research for a while. I have to do this in such small bites that it's going to be a challenge keeping momentum. I just hope that I've broken the project up into small enough bites to make significant progress over the longer term.

Patch mapping in video

2007-07-01T00:00:00.000-07:00

Over the weekend, I had one of them epiphany thingies. Sometime last week, I had started up a new vision project involving patch matching. In the past, I've explored this idea with stereo vision and discovering textures. Also, I opined a bit on motion-based segmentation here a couple of years ago.

My goal in this new experiment was fairly modest: plant a point of interest (POI) on a video scene and see how well the program can track that POI from frame to frame. I took a snippet of a music video and captured 55 frames into separate JPEG files and made a simple engine with a Sequence class to cache the video frames in memory and a PointOfInterest class, of which the Sequence object would have a list, all busy following POIs. The algorithm for finding the same patch in the next frame is really simple and only involves summing up the red, green, and blue pixel value differences in candidate patches and accepting the candidate with the lowest difference total; trivial, really. When I ran the algorithm with a carefully picked POI, I was stunned at how well it worked on the first try. I experimented with various POIs and different parameters and got a good sense for its limits and potentials. It got me really thinking a lot about how far this idea can be taken, though. Following is a sample video that illustrates what I experimented with. I explain more below. You may want to stop the video here and open it in a separate media player while you read on in the text.

Click here to open this WMV file

I specifically wanted to show both the bad and the good of my algorithm with the above video. After I played a lot with hand-selected POIs, I let the program pick POIs based on how "sharp" regions in the image are. I was impressed at how my simple algorithm for that worked, too. As you can see, in the first frame, 20 POIs (green squares) are found at some fairly high contrast parts of the image, like the runner's neck and the boulders near the horizon. As you watch the video loop, start by watching how well the POIs on the right brilliantly follow with the video. The ones that start on the runner quickly go all over the place and "die" because they can no longer find their intended targets. Note the POIs in the rocks that get obscured by the runner's arm, though. They flash red as the arm goes by, but they pick up again as the arm uncovers them. Once a POI loses its target, it gets 3 more frames to try, during which it continues forward in the same velocity as before, and then it dies if it doesn't pick it up again. Once the man's leg covers these POIs, you can see them fly off in a vain search for where the POIs might be going before they die.

I don't want to go into all the details of this particular program because I intend to take this to the next logical level and will make code available for that. I thought it useful just to show a cute video and perhaps mark this as a starting point with much bigger potential.

Although I thought of a bunch of ways in which I could use this, I want to indicate one in particular. First, my general goal in AI these days is generally to engender what I refer to as "perceptual level intelligence". I want to make it so machines can meaningfully and generally perceive the world. In this case, I'd like to build up software that can construct a 2D-ish perception of the contents of a video stream. My view is that typical real video contains enough information to discern foreground from background and whole objects and their parts as though they were layers drawn separately and layered together, as with an old fashioned cel-type animation. In fact, I think it's possible to do this without meaningfully recognizing the objects as people, rocks, etc.

I propose filling the first frame of a video with POI trackers like the ones in this video. The ones that have clearly distinguished targets would act like anchor points. Other neighbors that would be in more ambiguous areas -- like the sky or gravel in this example -- would rely more on those anchors, but would also "talk" to their neighbors to help correct themselves when errors creep in. In fact, it should be possible for POIs that become obscured by foreground objects to continue to be projected forward. In the example above, it should actually be possible, then, to take the resulting patches that are tagged as belonging to the background and actually reproduce a new video that does not include the runner! And then another video that, by subtracting out the established background, contains only the runner. This would be a good demonstration of segmenting background and foreground.

It should also be possible for these POIs to get better and better at predicting where they will go by introducing certain learning algorithms. In fact, it's possible the POI algorithm could actually start off naive and come to learn how to properly behave on its own.

The key to both this latter dramatic feat and the other earlier goals is an idea I gleaned from Donald D. Hoffman's Visual Intelligence. One idea he promotes repeatedly in this book is the importance of "stable" interpretations of visual scenes. His book dealt primarily in static images, but this idea is powerful. Here's an example of what I mean. Watch the gravel in the video above. Naturally, gravel that is lower in the video is closer to you and thus slides by faster than the gravel higher up and thus farther away. Ideally, POI patches following this gravel would move smoothly so that higher up levels would slide slowly and lower down would slide more quickly. (To be sure, this video would have to be normalized to correct for the camera being so jumpy.) If one patch in this "stream" of flow were to think it should suddenly jut up several pixels while its neighbors are all slowly drifting to the left, this would not seem to fit a "stable" interpretation of this one patch being part of a larger whole or of it following a smooth path at a fairly consistent pace. We assume the world rarely has sudden changes and thus prefer these smooth continuations.

In chapter 6 of Visual Intelligence, Hoffman addresses motion specifically and, while he doesn't talk about patch processing like this, does introduce a bunch of interesting rules for perception. Here are some of them that relate here:

Rule 29. Create the simplest possible motions.
Rule 30. When making motion, construct as few objects as possible, and conserve them as much as possible.
Rule 31. Construct motion to be as uniform over space as possible.
Rule 32. Construct the smoothest velocity field.

The idea of stable interpretations can come into play with POIs that are following boundaries of foreground objects, like the runner in this example. My POIs failed to follow in part because, while the "inside" part of the patch was associated with the man's head, for example, the "outside" would be associated with the background, which might be constantly changing as the head moves forward in space. In fact, the "outside" (background) part of such a POI should generally be "unstable", while the "inside" (foreground) stays stable. That assumption of instability of background as it constantly is obscured or uncovered by the foreground is a rule that should be helpful both in getting POIs to track these edges, but also in detecting these edges in the first place and thus segmenting foreground objects from background ones.

As far as patches learning how to make predictions autonomously, here's where the concept of stable interpretations really shines. The goal of the learning process should be to make a POI algorithm that forms the most stable interpretations of the world. Therefore, when comparing two possible algorithmic changes -- perhaps using a genetic algorithm -- the fitness function would be stability itself. That is, the fitness function would measure the fidelity of the matches, how well each POI sticks with its neighbors, how well it finds foreground / background interfaces (against human-defined standards, perhaps), and so on.

There's so much more that could be said on this topic, but my blogging hand needs a break.

Emotional and moral tagging of percepts and concepts

2007-06-27T00:00:00.000-07:00

Back in April, I suffered head trauma that almost killed me and landed me in the hospital for, thankfully, only a day. My wife, the sweet prankster that she is, went to a newsstand and got me copies of Scientific American Mind and Discover Presents: The Brain, an Owner's Manual (a one-off, not a periodical). The former had a picture of a woman with the upper portion of her head as a hamburger and the latter a picture of a head with its skullcap removed revealing the brain. So I got a good laugh and some interesting reading.

I'm reading an article now in The Brain titled "Conflict". The basic position author Carl Zimmer offers is encapsulated in the subtitle: morality may be hardwired into our brains by evolution. In my opinion, there is some merit to this idea, but I don't subscribe wholeheartedly to all of what the article promotes. Zimmer argues that the parts of our brains that respond emotionally to moral dilemmas are different from the parts that respond rationally and that, in fact, the emotional responses often happen faster than the intellectual ones. He further contends that our moral judgments come out of these more primitive, instant emotional responses. I have thought this as well, but not for the reason Zimmer proffers: that moral reasoning is automatic and built in.

I'd agree that, yes, we are reacting automatically and almost instantly, emotionally and moralistically, before we start seriously analyzing a moral question. But I would argue that it's because one's "moral compass" is programmable, but largely knee-jerk. Most humans may be born with some basic moral elements, like empathy and a desire to not see or let other people suffer. But we can readily reprogram this mechanism to respond instantly to things evolution obviously didn't plan for. For example, most Americans recognize the danger smoking poses to health. So smoking around other people comes with an understanding that it's a danger to their health, and often without their consenting to the risks. That knowledge quickly becomes associated with the "second-hand smoke" concept. I would argue that people with this knowledge instantly respond emotionally and moralistically when the subject of second-hand smoking comes up, regardless of the content of the conversation in which it's referenced. Even before the sentence is completely uttered, the moral judgments and emotional indignation are kicking in in the listener's mind. Why is this?

The article just prior to this one by Steven Johnson and titled "Fear" points out that the amygdala is activated when the brain is responding to "fear conditioning", as when a rat is trained to associate a sound tone with electric shock.

Johnson cites a fascinating case of a woman who suffered a tragic case of short term memory. Her doctor could leave for 15 minutes and return and the woman would not recognize him or recall having any history or relation to the doctor. Each time they met, he would shake her hand as part of the greeting ritual. One day, he concealed a tack in his hand when he went to shake her hand. After that, while she still did not recognize the doctor in any conscious way, she no longer wished to shake his hand. In experiments with rats, researchers found that removing the part of the neocortex that remembers events did not stop the rats from continuing to respond to fear conditioning. On the other hand, removing the amygdala did seem to take away the automatic fear reaction they had learned, even if they could remember events associated with their fear conditioning.

Johnson leaves open the question of whether the amygdala is actually storing memories of events for later responses versus simply being a way of "tagging" memories stored in other parts of the brain. My opinion is that tagging makes more sense. Imagine some part of your cortex stores the salient facts associated with some historical event that was traumatic. If the amygdala has connections to that portion of the cortex, they could be strengthened in such a way that anything that triggers memories of that event would also activate the amygdala via that strong link. If the amygdala is really just a part of the brain that kicks off the emotional responses the body and mind undergo, this seems a really simple mechanism for connecting thoughts with emotions.

In the hypothetical example I gave earlier, there could be a strong link between the "second-hand smoke" concept and the amygdala (or some other part of the brain associated with anger). So anything that activates those neurons would also trigger an instant emotional response that would become part of the context of the conversation or event.

I would propose the inclusion of this sort of "tagging" of the contents of consciousness (or even subconsciousness) for just about any broad AI research project. Strong emotions tend to be important in mediating learning. We remember things that evoke strong emotions, after all, and more easily forget things that don't. That has implications for learning algorithms. But conversely, memories of just about any sort in an intelligent machine could come with emotional tags that help to set the machine's "emotional state", even when that low-level response seems incongruous with the larger context. For example, a statement like "we are eliminating second-hand smoke here by banning smoking in this office" might be intended to make a non-smoker happy, but the "second-hand smoke" concept, by simply being invoked, might instantly add a small anger component to the emotional soup of the listener. That way, when the mind recognizes that the statement is about a remedy, the value of the remedy is recognized as proportional to the anger engendered by the problem.

Although I haven't talked much about moralistic tagging, per se, I guess I'm assuming that there is a strong relationship between how we respond emotionally to things and how we view their moral content. To be sure, I'm not suggesting that one's ethical judgments always (or should always) jibe with one's knee-jerk emotional reactions to things. Still, it seems this is somewhat a default for us, and not a bad starting point for thinking about how to relate moral thinking to rational thinking in machines.

Being able to tag any particular percepts or concepts learned (or even given a priori) may sound circular, mainly because it is. Emotions beget emotions, as it were. But there are obvious bootstraps. If a robot is given "pain sensors" to, say, detect damage or potential damage, that could be a source of emotional fear and / or anger.

These emotions, in addition to affecting short-term planning, could also be saved with the memory of a damage event and even any other perceptual input (e.g., location in the world or smells) available during that event. Later, recalling the event or detecting or thinking about any of those related percepts could trigger the very same emotions, thus affecting whatever else is the subject of consideration, including affecting its emotional tagging. In this way, the emotions associated with a bad event could propagate through many different facets of the machine's knowledge and "life". This may sound like random chaos -- like tracking mud into a room and having other feet track that mud into other rooms -- but I would expect there to be natural connections from state to state, provided the machine is not prone to random thinking without reason. I think putting "tracers" in such a process and seeing what thoughts become "infected" would be fascinating fodder for study.

A hypothetical blob-based vision system

2007-06-22T00:00:00.000-07:00

As often happens, I was talking with my wife earlier this evening about AI. Given that she's a non-programmer, she's an incredible sport about it and really bright in her understanding of these often arcane ideas.

Because of some questions she was asking, I thought it worthwhile to explain the basics of classifier systems. Without going into detail here, one way of summarizing them is to imagine representing knowledge of different kinds of things in terms of comparable features. She's a "foodie", so I gave the example of classifying cookies. As an engineer, you might come up with a long list of the things that define cookies; especially ones that can be compared among lots of cookies. Like "includes eggs" or a degree of homogeneity from 0 - 100%. Then, you describe each kind of cookie in terms of all these characteristics and measures. Some cookie types will have a "not applicable" or "don't care" value for some of these characteristics. So when confronted with an object that has a particular set of characteristics, it's pretty easy to figure out which candidate object types best fit this new object and thus come up with a best guess. One could even add learning algorithms and such to deal with genuinely novel kinds of objects.

I explained classifier systems to my wife in part to show that they are incomplete. Where does the list of characteristics of the cookie in question come from? It's not that it's not a useful thing, but that it lacks the thing that most all AI system ever made to date lack: a decent perceptual faculty. Such a system could have cameras, chemical analyzers, crush sensors, and all sorts of things to generate raw data, and that might give us enough characteristics to classify cookies. But what happens when the cookie is on a table full of food? How do we even find it? AI researchers have been taking the cookie off the table and putting it on the lab bench for their machines to study for decades, and it's a cheap half-solution.

Ronda naturally asked if it would be possible to have the machine come up with the fields in the "vectors" -- I prefer to think in terms of matrices or database tables -- on its own, instead of having an engineer hand craft those fields? Clever. Of course, I've thought about that and other AI researchers have gone there before. We took the face recognition problem as a new example. I explained how engineers define key points on faces, craft algorithms to find them, and then build a vector of numbers that represent the relationships among those points as found in pictures of faces. The vector can then be used in a classifier system. OK, that's the same as before. So I imagined the engineer instead coming up with an algorithm to look for potential key points in a set of pictures of 100 people's faces. It could then see which ones appear to be repeated in many or most faces and throw away all others. The end result could be a map of key points that are comparable. Those are the fields in the table. OK. So a program can define both the comparable features of faces and then classify all the faces it has pictures of. Pretty cool.

But then, there's that magic step, again. We had 100 people sit in a well-lit studio and had them all face forward, take off their hats and shades, and so on. We spoon fed our program the data and it works great. Yay. But what about the real world? What about when I want to find and classify faces in photographs taken at Disneyland? That's a new problem and starts to bring up the perception question all over again.

At some point, as we were talking over all this, I put the question: let's say your practical goal for a system is to be able to pick out certain known objects in a visual scene and keep track of them as they move around. How can you do this? I was reminded of the brilliant observations Donald D. Hoffman laid out in his Visual Intelligence book, which I reviewed on 5/11/2005. Among other things, Hoffman observed that, given a simple drawing representing an outline of an object, it seems we look for "saddle points" and draw imaginary lines to connect them and end up with lots of simpler "blob" shapes. I went further to suggest that this could be a way to segment a complex shape in such a way that it can be represented by a set of ellipses. The figure below shows a simple example:

I drew a similar outline in a sandbox at a playground we were walking by and asked her to segment it using these fairly simple rules. Naturally, she got the concept easily. From there, we asked how you could get to the clean line drawings to do the segmenting. After all, vision researchers have been banging their heads against the wall trying to come up with clean segmentation algorithms like this for decades.

I described the most common trick vision researchers have in their arsenal of searching static images for sharp contrasts and approximating lines and curves along them. Not surprisingly, these don't often yield closed loops. That's why I had experimented with growing "bubbles" (see my blog entry and project site) to ensure that there were always closed loops, on the assumption that they would be easier to analyze later than disconnected lines. Following is an illustration:

I found that somewhat unsatisfying because it relies very much on smooth textures, whereas life is full of more complicated textures that we naturally perceive as continuous surfaces. So we batted around a similar idea in which we could imagine "planting" small circles on the image and growing them so long as the image included within the circle is reasonably homogeneous, from a texture perspective. Scientists are still struggling to understand how it is we perceive textures and how to pick them out. I like the idea of simply averaging out pixel colors in a sample patch to compare that to other such patches and, when the colors are sufficiently similar, assume they have the same texture. Not a bad starting point. So imagine segmenting a source image into a bunch of ellipses, where each ellipse contains as large a patch of one single texture as reasonably possible. Why bother?
These ellipses -- we'll call them "blobs" for now -- carry usable information. We switched gears and used hand tools as our example. Let's say we want to learn to recognize hammers and wrenches and such and be able to tell one from another, even when there are variations in designs. Can we get geometric information to jibe with the very one-dimensional nature of databases and algebraic scoring functions? Yes. Our blobs have metrics. Each blob has X / Y coordinates and a surface area; we'll call it its "weight". So maybe in our early experiments, we write algorithms to learn how to describe objects' shapes in terms of blobs, like so:

Step 3 is interesting, in that it involves a somewhat computation-heavy analysis of the blobs to see how we can group together bunches of small blobs into "parts" so we can describe our tools in terms of parts; especially if those parts can be found on other tools. In step 4, we use some algorithm to rotate the image (and blobs and parts) so we have them in some well-defined "upright" orientation and stretch it all out so it fits some fixed-sized box, which makes it easier to compare other objects, regardless of their sizes and orientations. In step 5, we look for connections among blobs to help show how they are related. Now, all of these steps are somewhat fictional. They're easy to draw on paper and hard to code. Still, let's imagine we come up with something that basically works for each.

Now, when we see other tools laid out on our bench, we can do the same sorts of analyses and ultimately store the abstract representations we come up with. Perhaps for each object, we store a representation of its parts. One would be picked -- perhaps the center-most -- as the "root" and all the other parts would be available via links to their information in memory. Walking through an object definition would be like following links on web pages. Each part could be described in terms of its smaller parts, and, ultimately, blobs. Information like the number, weights, and relative positions or orientations of blobs and parts to one another can be stored and later compared with those of other candidate objects.

Now here's where things can get interesting. The next step could be to take our now-learned software out into a "real world" environment. Maybe we give it a photograph of the wrench in a busy scene. We segment the entire scene into blobs, as before. But this time, we do an exhaustive search of all combinations of blobs against all known objects' descriptions.

At this point, the veteran programmer has the shakes over the computation time required for all this. Get over it and pretend other engineers work on optimizing it all later. And besides, we have an infinitely fast computer in our thought experiment; something every AI researcher could use.

It starts seeming like we can actually do this; like we can have a system that is capable of actually perceiving hand tools in a busy scene. Maybe our next step is to feed video to the program, where a camera pans across the busy scene. This time, instead of our program looking at each individual frame as a whole new scene, we start with the assumption of object persistence. In frame 1, we found the wrench. In frame 2, we search for the wrench immediately at the same place. Once we found the wrench in frame 1, we worked back down to the source image and picked out the part of the bitmap that is strongly associated with the wrench and try doing a literal bitmap match in frame 2 around the area it was in frame 1. Sure enough, we find it, perhaps just a little to the right. We assume it's the same wrench. So now, we've saved a lot of computation by doing more of a "patch match" algorithm.

Now we not only have our object isolated, but we also now have information about its movement in time and can make a prediction about where it might be in frame 3. Maybe in frame 1, we found 2 wrenches and 1 hammer. Maybe as we track each one's movement from frame to frame, we look to see if it's all consistent in such a way that suggests maybe the camera is moving or that they are all on the same table or otherwise meaningfully related to one another in their dynamics. New objects might be discovered, as well, using "learning while performing" algorithms like I described in a recent blog entry. So much potential is opened up.

I don't mean to suggest this is exactly how a visual perception algorithm should work. I just loved the thought experiment and how it showed how engineers could genuinely craft a system that can truly perceive things. And it illustrates a lot of features I consider highly valuable, like learning, pattern invariance, geometric knowledge, hierarchic segmentation of objects into "parts", bottom-up and top-down processes to refine percepts, object permanence, and so on.

Now, about the code. I'll have to get back to you on that.

Abstraction in neuron banks

2007-04-21T12:00:00.000-07:00

[Audio Version]

On an exhilarating walk with my wife, we discussed the subject of how to build on the lessons I learned from my Pattern Sniffer project and its "neuron bank", documented in my previous blog entry. There are loads of things to do and it was not obvious how to squeeze more value out of what little I've done so far. But it finally became apparent.

One thing that I was not happy about with Pattern Sniffer is that the world it perceives is "pure". There is just one pattern to perceive at a time. The world we perceive is rarely like this. As I walk along, I hear a bird singing, a car, and a lawn mower at the same time and am aware of each, separately. Clearly, there is lots of raw information overlap, yet I'm able to filter these things out and be aware of all three at once. Pattern Sniffer could see two things going on in its tiny 5 x 5 pixel visual field, but it would see them as a single pattern. This is the kind of sterile world so many AI systems live in because the experimenters don't know how to rise above this problem. Yet rising above is a requirement if we want to be able to get machines that can exist at the "perceptual level", and not just the "sensory level" of intelligence.

I said in my previous blog entry that my neurons' dendrites had a "care" property, but that I didn't make use of it yet. My vision was that this would play an important role in being able to recognize patterns in a more abstract way, but I didn't know how, yet. I need to get to work and document my results, but I wanted to document some of the thoughts we came up with that I can now practically explore.

As we walked, I pointed at a car and explained that somehow, I'm able to "mask out" all the not-car parts of the scene and focus only on the car part. It's very hard to explain what that means, but I tried to relate it in terms of my neuron banks. Consider the "left bar" pattern:

"Left Bar" pattern.

What if we had a neuron in a bank that could recognize this pattern. But let's say I have another neuron that's a copy of this, save for one thing: each dendrite that now expects white pixels now doesn't actually care what's in the white area. We'll represent "don't care" pixels (dendrites) with blue diagonal stripes, like so:

"Left Bar" pattern with white pixels replaced by "don't care" pixels.

In this case, I'm assuming the "care" property would be a numeric value, from 0 (don't care) to 1 (care very much), multiplied while calculating the strength of the match on that dendrite that ultimately contributes to the total match score for the neuron. Now let's say the neuron bank is confronted by a perfect left bar pattern. Clearly, the neuron with the "solid" left bar pattern, with all dendrites having care = 1, will get a stronger match than the neuron with the "masked" version of the left bar pattern, because the don't-care dendrites will not contribute positively to the match score. So if only one neuron gets to "win" this matching game, the neuron with the solid left bar pattern will always win.

An exact match trumps a masked match.

But now let's say we showed our neuron bank an "L" shaped pattern. The "masked" left bar pattern is going to fare better than the "solid" left bar, like so:

The "don't care" pixels don't get penalized by the "lower bar" part.

Now let's say we also had "bottom bar" neurons that match both the solid and masked versions of that. Things get interesting with the "L" pattern. Let's say we even have a neuron that has learned the solid "L" pattern. Following illustrates these variations:

The "L" neuron has the best match, followed by the masked left and bottom bar.

OK, so if we have a neuron that already has a strong match of the "L" pattern, what good are the masked left and bottom bar? Here's where having a neuron hierarchy comes in handy. If we are regularly seeing left bars, bottom bars, and L patterns, a higher level neuron bank could potentially see that the masked-pattern neurons match more things than the solid-pattern neurons do and thus find them to be more generally useful than the specific-pattern neurons. It could then reward them by encouraging them to gain confidence, even though they are not the best matches.

One thing my current neuron banks assume is that there is only one single best match and that only that one neuron gets rewarded for matching a pattern, while all the others may in fact be penalized. Yet this doesn't seem to fit how our brains work, at some level. Remember: I said I can hear and be aware of a bird singing, a car, and a lawn mower at the same time. That's what I want my software to do, too. See, if we're regularly seeing left bars and bottom bars, it may just be that, when we see an "L" in the input, that it's actually just a left bar and a bottom bar, seen together. That's another interpretation.

Being able to explain the total input in terms of multiple perceived stimuli must be more "satisfying" to certain parts of our brains than alternative explanations that see the input as all part of a single cause that is not currently known. Being able to engender this could bring a machine a lot closer to the perceptual level of intelligence.

So that's what I'm probably going to study next. One challenge will be figuring out how to deal with allowing multiple neurons to be rewarded for doing the right thing in a given moment without encouraging neurons to learn redundant information. We'll see.

Pattern Sniffer: a demonstration of neural learning

2007-04-12T12:00:00.000-07:00

[Audio Version]

Introduction
Unsupervised learning
Finite resources
Competing to be useful
Confidence
The simulation
Learning in linear time
All at once learning
Learning while performing
Noisy data
Longevity
Working memory
Pattern invariance
More to explore
The nuts and bolts of the algorithm

Introduction

For over a year, I've been nursing what I believe is a somewhat novel concept in AI that superficially resembles a neural network and is inspired by my read of Jeff Hawkins' On Intelligence. Recently, I finally got around to writing code to explore it. I was deeply surprised by how well it already works that I thought it worthwhile to write a blog entry introducing the concept and make public my source code and test program for independent review. For lack of putting any real thought into it, I just named the project / program "Pattern Sniffer".

My regular readers will recognize my frequent disdain for traditional artificial neural networks (ANNs), not only because they do not strike me as being anything like the ones in "real" brains, but also because they seem to fail miserably at displaying anything like "intelligent" behavior. So it's with reluctance that I call this a neural network. The test program I made, however, has only one "layer" of neurons, which I call a "neuron bank". I did not wish, yet, to demonstrate a hierarchy and multi-level abstraction, though. My main goal was to focus specifically on a very narrow but almost completely overlooked topic in artificial intelligence: unguided learning.

Unguided learning

All artificial neural networks I have ever seen or read about rely on a so-called "training phase", where they are exposed to examples of certain patterns they are supposed to be able to recognize in the future before they are ever put out into the "real world". I was disappointed when I finally read of how Numenta's Hierarchic Temporal Memories (HTMs) undergo the same sort of learning process before they can begin recognizing things in the world. This smacks in the face of how humans and other mammals and, indeed, all creatures on Earth that can learn work.

Does intelligence require that an intelligent being continue to learn once it enters a productive life? I think the answer is obviously "yes". What's more, it's tempting for us to think humans rarely go through learning, as in their school years, and spend most of their lives in a basic "production" mode. Yet I would argue that every moment we are awake, we are learning things. Most of it is quickly forgotten. We use the terms "short term memory" and "working memory" to identify this, which seems to suggest we have something like computer RAM, while the real long-term memory is packed away into a hard drive.

I'm no expert in neurobiology, so I may be missing some important information. But the idea of information being transferred in packages of data from one part of the brain to another for long term storage doesn't seem to jibe with my limited understanding of how our brains work. Why, for example, should learning a phone number long enough to dial it occur in one part of the brain while learning it for long term use, like with our own home numbers? And how would it be transferred?

What if it's the same part of the brain learning that phone number, whether for short or long term usage? Perhaps the part of my brain that is most directly associated with remembering phone numbers has some neurons that have learned some important phone numbers and will remember them for life, while it contains other neurons that have not learned any phone numbers and are just eagerly awaiting exposure to new ones that may be learned for a few seconds, a few minutes, or a few years.

Finite resources

We are constantly learning. Yet we have a finite amount of brain matter. Somehow we must have some mechanism for deciding what information we are exposed to is important enough to retain long term and which is only worth retaining for a moment.

When I studied how Numenta's HTMs learn, I was a bit disappointed to see that, while there is a finite and predetermined number of nodes in an HTM, the amount of memory required for one is variable. This is like many other kinds of classifier systems and other learning algorithms. This does make some sense from an engineering perspective, but it does not seem to fit what I understand of how our brains work. Our neurons may change the number and arrangement of dendritic connections, but it's a far cry from keeping a long list of learned things inside. So far, it seems ANNs are one of the only classes of learning systems out there that do use a finite and predefined amount of memory in learning and functioning.

I believe that, for some functional chunk of cortical tissue, there is a fixed number and basic arrangement of neurons and they all are doing basically the same thing, like learning, recognizing, and reciting phone numbers. It seems intuitive to believe that that chunk has its own way of deciding how to allocate its neurons to various numbers, with some being locked down, long term, and others open to learning new ones immediately for short term use. Any one of these may also eventually become locked down for the long term, too.

I also believe it's possible, though not certain, that some neurons that have learned information for the long term may occasionally have that information decay and be freed up to learn new things.

Competing to be useful

When I started thinking about banks of neurons working in this way, I naturally asked the question: how does the brain decide what is important to learn and how long to retain it? It then occurred to me that there may be some kind of competition going on. What if most of the neurons in the cortex "want" more than anything to be useful? What if they are all competing to be the most useful neuron in the entire brain?

Let's start with the assumption that all neurons in a neuron bank all have access to the same input data. And let's say each neuron wishes to be most useful by learning some important piece of information. You would think that the first problem to arise would be that they would all learn the exact same piece of information and thus be redundant. But what if, when one neuron learns a piece of information, the others could be steered away from learning the same information? What if every neuron was hungry to learn, but also eager to be unique among its peers in what it knows?

But how could one neuron know what its peers know? Would that require an outside arbiter? An executive function, perhaps? Not necessarily. It's possible that each neuron, when it considers the current state of input, decides how closely that input matches its own expected pattern that it has learned, "shouts out" how strongly it considers the input to match its expectation. The other neurons in the bank could each be watching to see which neuron shouts out the loudest and assume that neuron is the most likely match. Actually, it could be enough to know the loudest shout and not which neuron did the shouting.

Confidence

The idea that every neuron in a bank reports to the group how well it thinks it matches the input is powerful. It follows, then, that the neuron that shouts the loudest would pat itself on the back by becoming more "confident" in its knowledge and thus reinforce what it knows. Conversely, all the other neurons would become no more confident and perhaps even less so with each passing moment that they go unused.

Confidence breeds stasis. In this case, that's ideal. What if some neurons in a bank were highly confident in what they know and others were very unconfident? Those that have low confidence should be busy looking for patterns to learn. In a rich environment, there will be a nearly limitless variety of new patterns that such neurons could learn. There are several ways a brain could decide that some piece of information is important. One is simple repetition. When you want to remember someone's name, you probably repeat it in your mind several times to help reinforce it. And in school, repetition is key to learning. So it could be that individual neurons of low confidence gain confidence when they latch onto some new pattern and see it repeated. Repetition suggests non-randomness and hence a natural sort of significance.

What if, as a neuron becomes more confident, it becomes less likely to change its expectation of what pattern it will match? What it confidence is itself a moderator of a neuron's flexibility to learning new patterns?

The simulation

Armed with this hypothesis, I set out to make a program called "Pattern Sniffer" to simulate a bank of neurons operating in this way and to test its viability. My goal, to be sure, is not to replicate human neocortical tissue. I suspect our brains do some of what my hypothesis entails, but my main goal is to see if learning can happen like this. Here's a screen shot from the program:

Screen shot from Pattern Sniffer program

You can download the Pattern Sniffer program and its source code. This is a VB.NET 2005 application. Once you unzip it, you will find the executable program at PatternSniffer\Ver_01\PatternSniffer\bin\Debug\PatternSniffer.exe. There is a PatternSniffer.exe.config file along-side it, which you can edit with a text editor to change certain settings, such as the number of neurons in the bank. There is a "Snapshots" subfolder, in case you wish to use the "Snapshot" button, not shown here.

The program's user interface is very simply as seen above. The main feature is a set of gray boxes representing individual neurons in a single bank. The grid of various shades of gray boxes in each represents the "dendrites" of each. Input values in this program are from -1 to +1. In this UI, -1 is represented as white and +1 as black. Each dendrite has an "expectation" of what its input value should be for it to consider itself to match. In this example, there are 25 input values; hence 25 dendrites per neuron. The top left corner of the program features an input grid, also with 25 values. The user can click on this to alternate each pixel from black to white. You probably won't want to use that, though, as the program comes with a SourcePatterns.bmp file that has 25 5x5 gray-scale images on it, which you can edit. Following is a magnified version of SourcePatterns.bmp:

SourcePatterns.bmp, magnified 10 times

When you start the program, the neurons start out in a "naive" state. They know nothing and hence have nearly zero confidence (shown as a white box in each neuron display above). As you click the "Random Patch" button, the program picks one of the patterns in SourcePatterns.bmp, displays a representation of it in the input grid, presents it to the neuron bank for a moment of consideration, and updates the display to reflect changes in the neuron bank's state. Check the "Keep going" check box to make pushing this button happen automatically.

To be clear, while the program displays a 2 dimensional grid of image data, the neurons have no awareness of either a grid or of it being graphical data. They only know they take a set of linear values as input. The inputs could be randomly reshuffled at the start with no impact on behavior. The grid and the choice of image data is simply to help us visualize what is going on inside the bank.

You can control how many of the patterns in the source set are used by changing the "Use first" number. If you choose 3, for example, patterns 1, 2, and 3 will be used to select randomly from with each click of the "Random Patch" button. At any time, you can specifically change the "Pattern" number to select a specific pattern to work with. Clicking "Linger" causes the bank to go through a single moment of "pondering" the input, just like when the user clicks "Random Patch". With each moment of pondering, the brain becomes more "set" in what it knows. Clicking "Brainwash" brings the entire neuron bank back to its naive state.

The "Noise" setting is a value from 0 to 100% and controls how degraded the input pattern is when presented to the neuron bank. At 100%, one pattern is nearly indistinguishable from any other.

Learning in linear time

Let's start with a familiar and yet simplistic case of training and using our neuron bank. We begin with the naive state as follows:

Pattern 1 contains all white pixels. With the first click of "Linger", the neurons in the bank all try to determine which of them best matches this pattern. In this case, neuron 14 (n14) is most similar:

Because it "yells the loudest", it is rewarded by having its confidence level raised ever so slightly and by moving its dendrites' expectation levels closer to the input pattern. The lower the confidence, the more pliable the dendrites' expectations are to change. Since n14 has near zero confidence (-1), it conforms nearly 100% in this single step. Clicking "Linger" 7 more times, n14 continues to be the best match and so continue to increase its confidence until it is nearly full confidence (+1):

Now we move to pattern 2 and repeat this. Pattern 2 is all black pixels. n23 happens to be most like this pattern, so with repetition it learns it quickly:

Notice in the preceding how n14 is still expecting the white pattern and has a high level of confidence. Its expectations have shifted every so slightly, indicated by the very faint gray boxes scatted within n14's display.

We continue this process for the first 6 patterns, picking one and lingering on it for 8 steps each, and end up with the following state:

You can quickly find the learned knowledge by looking for black confidence level boxes. At this point, you may wonder why the left, right, top, or bottom bar patterns would match neurons with randomized expectations better than, say, the solid white or solid black patterns. This has to do with the way matching occurs and is affected by a neuron's confidence level.

When the neuron bank is asked to "ponder" the current input, it goes through two steps, with each neuron being processed in turn in one step before the next step proceeds and each neuron is again processed. Step 1 is matching. It begins with each dendrite calculating its own match strength. The match strength is calculated as MaxSignal - Abs(Input - Expectation), where MaxSignal = 1. Thus, the closer the scalar input value is to the value expected by that dendrite, the closer the match strength will be to the maximum possible.

Things get interesting here. Before returning the match strength value, we alter it. If the strength is less than zero -- that is, if this dendrite finds the input value is very different -- then we "penalize" the match strength using Strength = Strength * Neuron.Confidence * 6. The final strength, whether adjusted or not, is divided by 6 to make sure the strength is never outside the min/max range of -1 to +1. So the more confident the neuron is in what it knows, the more strongly mismatched inputs will penalize the match value.

So now, if I set "Use first" to 6 and check "Keep going", the program will continually run through these first 6 patterns that have been learned and will always match and reinforce them. So far, this is not very remarkable, as it is easy to make a program learn any number of distinct digital patterns. As we'll see, however, there's a lot more to this than this cheap parlor trick.

What is remarkable, however, is the time it takes to learn. AI systems that include learning often suffer exponential increases in learning time as the amount of information to learn increases linearly. In this simple demonstration, it does not matter how many novel patterns are exposed to the neuron bank. It will take the same number of steps of repetition to solidify a naive neuron's knowledge. One simple estimate would be that it takes 8 steps to learn each new pattern, when they are presented in this fashion.

There are caveats, to be sure. For one, the configuration for this demo has only 26 neurons, which means it can only learn up to 26 distinct patterns. For another, as time passes and a neuron is not "used" -- if it never matches anything -- it slowly loses confidence that it is still useful and begins to degrade until it finally is naive again. So there is a practical limit to how many patterns can be taught before there has to be a "refreshment" process to bolster the existing neurons' confidences.

All at once learning

The story changes when learning is done in bulk. Let's change the experiment a little to illustrate. First, we'll brainwash our neuron bank. Then, we set "Use first" to 6, the same solid black and white patterns, plus the left, right, top and bottom bars that we saw before. Now we'll step through the process for a while (using the "Random Patch" button). Below is a series of screen shots. Note the "Steps taken" number in each step.

When we started out, all neurons were naive, meaning they had not learned any patterns and they had no confidence in what they "knew". So as a new pattern is introduced in each moment, there's usually a "virgin" neuron that's happy to match and claim that pattern for its own. But watch the sequence of events for each neuron that does this as time moves on. Each one degrades quickly. In step 1, n21 is the first neuron to match anything, namely the solid black pattern. Yet one step later, when the input has a new pattern, n21 is already starting to decay. By step 8, with no further reinforcement yet, n21 has decayed so much that there's a good chance if the next step brings the solid black pattern back, it may not be the best match for it any more.

However, reinforcement does build confidence. The right bar pattern has been seen 3 times in the above sequence. n5 was the first to see it and, thanks to reinforcement, it has a higher degree of confidence and so its expectation pattern is more likely to persist longer without reinforcement. Still, this is not at all high. Let's see what happens as time progresses on and the patterns are seen more. Note the steps-taken number in each snapshot and how each learned neuron's confidence level grows with reinforcement:

OK. So after 80 steps, we have most of the patterns pretty well learned, save for the solid white pattern. By random chance, that one was simply not seen many times during this run. Still, this is markedly worse than when we spoon-fed the patterns to learn, one at a time. With 8 steps per pattern and 6 patterns, the learning process took only 48 steps. So maybe that's an indication that this is not a very good learning algorithm. Isn't the real world like this? And when we try this experiment with all 25 patterns thrown around at random, it may take thousands of steps to solidly learn them all instead of the 200 if we spoon-feed them.

But maybe this is exactly what we expect. Have you ever been in a room with someone speaking a language you don't understand? You may be exposed to hundreds of new words. If I asked you to repeat even three of them that you picked up (and did not already know), you might just shrug and tell me none of them really stuck. But if you asked one of the speakers to teach you one or two words, you might be able to retain them for the duration of the conversation and reliably repeat them. To use another analogy, consider a grade school English class. Would a teacher be more likely to expose the students to all of the vocabulary words at once and simply repeat them all every day, or instead to expose students to a small number of new vocabulary words each week? Clearly, learning a few new words a week is easier than learning the same several hundred all at once, starting from day one.

My interpretation of what's going on is that this neural network is behaving very much like our own brains do, in this sense. The more focused its attention is on learning a small number of patterns at one time, the faster it will learn them. This may seem like a weakness of our brains, but I don't think so. I believe this is one way our own brains filter out extraneous information. We're exposed to an endless stream of changing data. Some of it we already know and expect, but a lot of it is novel. Repetition, especially when it occurs in close succession, is a powerful way to suggest that a novel pattern is not random and therefore potentially interesting enough to learn. In fact, the very principle of rote learning seems to be based on this idea of hijacking this repetition-based learning system in our brains.

Learning while performing

As I mentioned in the introduction, I've long been bothered by the fact that most AI learning systems require a learning stage separate from a "performance" process. So far, we've been focused on learning with this novel sort of neural network I've made, and we'll continue to focus on that, but I want to stress that all the while that we are training this neural net, we are also watching it perform. Its only task, in this experiment, is to match patterns it sees.

One simple way to prove this point is to train the neuron bank on however many patterns you wish and then just check the "Keep going" box and watch it perform. Then, at some point, try adding one more pattern using the "Use first" number while it continues crunching away. It will eventually learn the new pattern, all the while still performing its main task of matching patterns. There is no cue we send to the neuron bank that we are introducing a new pattern. In fact, the neuron bank doesn't know any of these numbers we see on the screen. It doesn't, for example, know that we have 25 total patterns, or that we are only using 6 of them at the moment. We don't check any box saying, "you are now supposed to be learning". It just does both constantly; both learning and performing.

Noisy data

I said earlier that having a machine learn 6 digital image patterns is just a cheap programming parlor trick. But I said there is more to this. Numenta's Pictures demo app of their HTM concept is configured such that a single node adds a quantization point for each bit-level unique pattern it comes across. True, the HTM can be configured to be a little more relaxed and to consider two similar patterns to represent one and the same, but you have to program the threshold of similarity in in advance of learning. So really, one is very likely to end up with a very large set of quantization points if the training data is noisy. And their own white paper states, "The system achieved 66 percent recognition accuracy on the test image set," hardly impressive. Traditional ANNs seem to be a little less sensitive to noise, but they aren't perfect, either.

The matching algorithm for this neural network is incredibly simple: just add together the differences between the expected and actual input values and multiply them by other basic factors like confidence level. But as you'll see in the following experiments, this makes it very competent at dealing with noise.

Let's start by setting "Noise" to 50% and brainwashing. We'll take the top bar pattern (#3) as our starting point and click "Linger" a few times. Watch what happens in the following sequence:

Notice now n21's expectations, in step 1, look exactly like the first noisy version of the top-bar that it sees? Yet in each successive step of learning, as it gets new noisy versions, its expectation shifts more towards the perfectly noise-free top-bar pattern it never actually sees. It's learning a mostly noise-free version of a pattern it never sees without that noise!

Is this magic? Not at all. The noise is purely random, not structured. That means with each successive step, n21 is averaging out the pixel values and thus cancelling the noise. Now, n21 is also becoming more confident, though more slowly than it did when it saw the noise free version. So with each passing moment, the pattern is changing more and more slowly. Eventually, it will become fairly solid.

Let's continue this experiment by training the bank with the first 6 patterns:

With manual spoon-feed learning of each of the 6 patterns, we get to step 90 and all 6 are pretty solidly learned. We can now switch on the "Keep going" check box to let it cycle at random through all 6 patterns indefinitely and it will continue to work just fine, with 100% accuracy (to be sure, I spot-checked; I didn't check the match accuracy at all steps), in spite of the noise and all the neurons hungrily looking for new patterns to learn. Here it is after 150 unattended steps, still solid in its knowledge:

Now, we turn the noise level up to 75%. Watch how well it continues to work:

Look back carefully at these 8 steps, because they are very telling. Remember: the neuron bank has no idea that I am still using the same 6 patterns I trained it on. Remember also that with a highly confident neuron, there is a high penalty for each poorly matched dendrite. Looking at the input patterns, I'm struck by how badly degraded they are and thus difficult for me to match, yet the neuron bank seems to perform brilliantly. Only at step 155 do we finally see a pattern so badly degraded that the bank decides it's a novel one it might want to learn. Of course, it's never going to be seen again, so this blip will be quickly forgotten and n8 will be free to try learning some other new pattern. In all 7 of the other steps, it matches the noisy input pattern correctly.

This isn't the end of the story, though. Noise filtering cuts both ways. Some unique patterns will be treated as simply noisy versions of known patterns. Take another look at the source patterns:

SourcePatterns.bmp, magnified 10 times

Near the bottom, there are four "arrow" patterns. To your eye, they probably look pretty distinctly different from the side bar patterns (left, right, top, bottom) that we've been working with, but to this neural net, they are so similar that they are considered to be simply noisy versions of the bars. Or, conversely, the bars are seen as noisy versions of the arrows. Here's our neuron bank after a brainwashing and learning the first 19 patterns, just before we get to the arrows. You can see the first patterns (solid white and black) to be learned are starting to degrade:

Now to introduce one of the arrows to the bank. See how, in just a few steps, this confident neuron's expectations change to start looking like the arrow?

Longevity of information

Now that I've illustrated some of what this particular program can do and thus some of the potential capabilities for machine learning using this concept, I think I can more easily speak about some of its weaknesses and suggest some potential ways to overcome them.

For one thing, longevity is lacking. What is learned in this particular demonstration by one neuron can be unlearned within a few minutes of running without seeing that pattern again. That's obviously not a desirable capability of a machine that may have a useful life of many years. But that doesn't mean that this is a limitation of this type of system, per se. I set out to demonstrate not only how a neural network can learn while being productive, but also how unused neurons can be freed up to learn new things without any central control over resource allocation.

I did address this to some degree in the current algorithm, actually. As described earlier, a neuron loses confidence over time if it is unused, and therefore becomes more pliable to adjusting its expectations. However, the degree to which it loses confidence, in any given step, is determined in part by the best match value seen. That is, if some neuron has a very strong match of the current input pattern, then a non-matching neuron will not lose much confidence. If, however, none of the other neurons considers itself to be a strong match, that could potentially mean that there's a new pattern to learn, and so the non-matching neurons will lose confidence a little faster.

One way that this algorithm could be improved is by consideration of how "full" a neuron bank is of knowledge. Perhaps when a bank has a lot of naive neurons, those that are highly confident of what they know should be less likely to lose confidence. Conversely, when there are few or no neurons that remain naive, there could be a higher pressure to lose confidence. Perhaps this could further be adjusted based on the rate of novelty in input patterns, but that's harder to measure.

Perhaps there are higher level ways that memory could be evaluated for importance and, over time, exercised in order to keep it clean and strong.

Working memory

When I started making this program, I was not really considering the problem identified earlier in this blog entry of working memory versus long term memory. But in the course of building and testing Pattern Sniffer, it dawned on me that my neural network was displaying both short and long term learning within the same system. The key difference was not structure, locality, or anything so complicated, but simply repetition.

Yes, in the sample program, we are learning and matching simple visual patterns. But this same kind of memory could just as easily be used to learn a phone number sequence long enough to dial it. Or to remember a visual pattern long enough to match it to something else in the room. And, without heavy repetition, the neuron(s) that remember it will decay again into naivete, ready to learn some other pattern.

Pattern invariance

I think this sample program well demonstrates this kind of neural network's insensitivity to noisy data. Still, one thing it clearly is not is insensitive to patterns of information that are subtly transformed.

With this program, I decided I would use a small visual patch for demonstration purposes in part because I though it would be worth perhaps replicating the ability of our own retinas to detect and report strong edges and edge-like features at different angles, especially if it could learn about edges all on its own. But I must admit this was also a cheat of the same sort many AI researchers tackling vision do: forcibly constrain the source data to take advantage of easy-to-code techniques.

To their credit, the Numenta team have come up with a crafty way of discerning that different patterns of input are representative of the same things by starting with the assumption that "spatial" patterns that appear in close time succession to one another very likely have the same "cause" and thus such closely tied spatial patterns should be treated as effectively the same, when reporting to higher levels of the brain.

I think the kind of neural network I've engendered in Pattern Sniffer can benefit from this concept as well. Implicitly, it already embraces the notion that the same pattern, repeated in close succession, has the same cause and is thus significant enough to learn. But to be able to see that two rather different spatial patterns have a common cause could be very powerful. One way to do this would be to have a neuron bank above the first which is responsible for discovering two-step (or longer) sequences in the lower level's data. If, for example, the first level has 10 neurons, the second level could take 20 inputs: 10 for one moment of output and 10 more for the following moment. In keeping with Jeff Hawkins' vision of information flowing both up and down a neural hierarchy, discovering such temporal patterns, the upper neuron bank could "reward" the contributing lower level neurons by pushing up their confidence levels even faster. This higher level neuron bank could even be designed to respond either to the sequence being seen or to any one of its constituents being seen, and thus serve as an "if I see A, B, or C, I'll treat them as all the same thing" kind of operation.

One thing I had originally envisioned but never implemented is the concept of "don't care". If you look at the source code, you'll notice each dendrite has not only an "expectation", but also a "care" property. The idea was that care would be a value from 0 to 1. Multiplying the match strength by the "care" value would effectively mean that the less a dendrite cares about the input value, the less likely it would be to contribute positively or negatively to the neuron's overall match strength. I was impressed enough with the results of the algorithm without this that I never bothered exploring it further. Honestly, I don't even know quite how I would use it. I had assumed that a neuron could strongly learn some pattern's essential parts and learn to ignore nonessentials by observing that certain parts of a recurring pattern themselves don't recur. But that simply led me to wonder how a neuron bank would decide whether to allocate two or more neurons for pattern variants or to allocate a single neuron with those variants ignored. There's still room to explore this concept further, as it seems almost intuitively like something our own brains would do.

More to explore

This is obviously not the end of this concept for me. I think one logical next area of exploration will be hierarchy. I also want to see if and what even the current arrangement can learn when it is exposed to "real world data". Even with noise added, the truth is I'm just feeding this thing carefully crafted, strong patterns that seem of dubious relation to the messy sensory world we inhabit.

I certainly welcome others to dabble in this concept as well. You can play with this sample program yourself. The .config file gives you control over a bunch of factors, you can supply your own source-patterns graphic, and the program's user interface is fairly easy to extend for other experiments. The NeuronBank class and all of its lower level parts is very self-contained and independent of the UI, which means it can easily be applied in other ways without the need for this or even any user interface. And the core code is surprisingly lightweight (only 3 classes) and heavily commented, so it should be easy to study and even reproduce in other environments.

So we'll see what's next.

The nuts and bolts of the algorithm

I've tried to describe the concepts of the Pattern Sniffer demonstration program in plain English and with visuals, but it's worthwhile to go into more detail for people more interested in the details of how this algorithm actually works. I'll ignore the UI and test program and focus exclusively on the neuron bank and its constituent parts.

Following is a list of the classes and their essential public members:

NeuronBank:

Inputs As List(Of Single)
Neurons AsList(OfNeuron)
New(InputCount, NeuronCount)
Brainwash()
Ponder()

Neuron:

Bank As NeuronBank
Dendrites As List(OfDendrite)
MatchStrength As Single
Confidence As Single
New(Bank, ListIndex, DendriteCount)
Brainwash()
PonderStep1()
PonderStep2()

Dendrite:

ForNeuron As Neruon
InputIndex As Integer
Expectation As Single
MatchStrength As Single
New(ForNeuron, InputIndex)
Brainwash()

Next is the algorithm for behavior. Aside from basic maintenance like the .Brainwash() methods, there really is only one single operation that the neuron bank and all its parts perform. Each "moment", the input values are set and the neuron bank "ponders" the inputs. Here's a pseudo-code summary of how it works. All the methods and properties have been mashed into one chunk to make it easier to read the process in a linear fashion. Here's the short version:


    Loop endlessly
        
        Set values in Bank.Inputs (each value is a single floating point number from -1 to 1)
        
        Sub Bank.Ponder()
            For Each N in Me.Neurons
                N.PonderStep1()  (Measure the strength of my own match to the current input.)
            Next N
            For Each N in Me.Neurons
                N.PonderStep2()  (Adjust my confidence level and dendrite expectations.)
            Next N
        End Sub
        
        For Each N In Bank.Neurons
            Do something with N.MatchValue
        Next
        
    Continue looping

And now the more detailed version, fleshing out PonderStep1() and PonderStep2():


    Loop endlessly
        
        Set values in Bank.Inputs (each value is a single floating point number from -1 to 1)
        
        Sub Bank.Ponder()
            For Each N in Me.Neurons
                
                Sub N.PonderStep1()
                    'Measure the strength of my own match to the current input.
                    
                    'Add up all the dendrite strengths.
                    For Each D in Me.Dendrites
                        Strength = Strength + D.MatchStrength
                        
                        Function D.MatchStrenth() As Single
                            Input = ForNeuron.Bank.Inputs(Me.InputIndex)
                            
                            Strength = 1 - AbsoluteValue(Input - m_Expectation)
                            Strength = Strength / 6
                            
                            'Penalize strongly mismatched values.
                            If Strength < 0 Then
                                Strength = Strength * ForNeuron.Confidence * 6
                            End If
                            
                            Return Strength
                        End Function D.MatchStrength()
                        
                    Next D
                    
                    'Divide the total to get the average dendrite strength.
                    Strength = Strength / DendriteCount
                    
                    'Maybe I am the new best match.
                    If Strength > Bank.BestMatchValue Then
                        Bank.BestMatchValue = Strength
                        Bank.BestMatchIndex = Me.ListIndex
                    End If
                    
                    Me.MatchStrength = Strength
                End Sub N.PonderStep1()

            Next N
            For Each N in Me.Neurons
                
                Sub N.PonderStep2()
                    'Adjust my confidence level and dendrite expectations.
                    
                    If Me.ListIndex = Bank.BestMatchIndex Then  'I have the best match
                        
                        'Boost my confidence a little.
                        Me.Confidence = Me.Confidence + 0.8 * Me.MatchStrength
                        If Me.Confidence > 0.9 Then Me.Confidence = 0.9  'Maximum possible confidence.
                        
                        For i = 0 To Me.Dendrites.Count - 1
                            D = Me.Dendrites(i)
                            Input = Bank.Inputs(i)
                            
                            'How far away is this dendrite's value from what's expected?
                            Delta = Input - D.Expectation
                            
                            'The more confident I am, the less I want to deviate from my current expectation.
                            Delta = Delta * (1 - Me.Confidence)
                            D.Expectation = D.Expectation + Delta
                        Next i
                        
                    Else  'I don't have the best match
                        
                        'I should lose confidence more when no other neuron has a strong match.
                        Me.Confidence = Me.Confidence * 0.001 * (1 - Bank.BestMatchValue)
                        If Me.Confidence < 0.05 Then Me.Confidence = 0.05  'Minimum possible confidence.
                        
                        For i = 0 To Me.Dendrites.Count - 1
                            D = Me.Dendrites(i)
                            Input = Bank.Inputs(i)
                            If Bank.BestMatchValue - Me.MatchStrength <= 0.1 Then
                                'I must be pretty close to the current best match.
                                
                                'Get more random.
                                D.Expectation = D.Expectation + RandomPlusMinus(0.05) * (1 - Me.Confidence)
                                
                            Else  'I don't strongly match the current input.
                                
                                'How far away is this dendrite's value from what's expected?
                                Delta = Input - D.Expectation
                                
                                'The more confident I am, the less I want to deviate from current expectation.
                                Delta = Delta * (1 - Confidence)
                                
                                'Get a little closer to the current input value.
                                D.Expectation = D.Expectation + RandomPlusMinus(0.00001) * Delta * 0.2
                            End If
                        Next  i
                        
                    End If  'Do I have the best match or no?
                    
                End Sub N.PonderStep2()

                
            Next N
        End Sub
        
        For Each N In Bank.Neurons
            Do something with N.MatchValue
        Next
        
    Continue looping

It might be entertaining to try to boil this down to a few lengthy mathematical formulas, but I usually find those more intimidating than helpful.

A respectful critique of the Hierarchical Temporal Memory (HTM) concept

2007-04-07T12:00:00.000-07:00

[Audio Version]

I've been away from this too long, distracted by other things in my life. I've missed it. Lately, I've been finding myself getting excited again to the point of getting distracted from those other things and back in this world.

The most interesting development in the world of artificial intelligence of late, to my thinking, is the recent release of Numenta's Hierarchical Temporal Memory algorithm, the brainchild largely of Dileep George and inspired largely by Jeff Hawkins, author of On Intelligence. Having been so disappointed by artificial neural networks, expert systems, and various other "traditional" approaches to AI, I found the ideas presented by Hawkins refreshing and exciting, so I joined Numenta's mailing list and eagerly awaited the arrival of its promised products.

Now that the NuPIC platform and related tools have been released, Numenta has also authored various white papers on how it actually works. In refreshing contrast to the mind numbing gibberish of some proprietary systems' (e.g., PILE's) white papers and math-heavy tomes on Bayesian networks and neural networks, these documents present a clearly understandable description of what HTMs actually do and how they do them. The one I found most penetrating was coauthored by Dileep George and titled The HTM Learning Algorithms. So far, this is the best document I have read on the subject, though admittedly, it helps to be familiar with the HTM concept at a high level.

I am about halfway through reading this 44-page PDF. I had to stop in part because my brain couldn't focus any more on it because I'm distracted by my own work and, frankly, inspired by what I've found in this document. I finally "get it", how an HTM learns, which I've been missing for the whole time I've been aware of HTMs. But to my surprise, I found there are some troubling questions I've formed already in the process that I want to document before I forget. I want to pose them here to help further the discussion of the value of HTMs and perhaps promote their improvement.

Section 4 describes how an HTM node is exposed to a continuously changing stream of data and learns to recognize "causes". In this example, however, there are very tight constraints. The application used is called "Pictures" and involves learning to recognize pure black and white line drawings of simple symbols like letters and coffee cups. This section focuses on learning in the first layer, in which each HTM node can see a 4x4 grid of B&W pixels. The sample drawings used are all composed of very simple elements like vertical or horizontal lines, "L" joints, "T" joins, "Z" folds and line ends. In order to make sure the HTM properly learns to recognize these constructs in many situations, this HTM is exposed to examples of each in many positions in its 4x4 visual field. This is done by showing it (and all the other HTMs in this level) "movies" of the archetype drawings moving in various directions and at different scales (zoom factors).

Now, I know it's important to reduce a general problem to a narrower problem in order to help test, quantify, and explain a concept. So I'm willing to suspend a little skepticism. But as I read on about the nuts and bolts, this came back to bug me again. In order to learn to recognize that many variations of a pattern all represent the same pattern, HTMs rely critically on a temporal component for learning. Let's say in moment T1, the node is exposed to a picture of an "L" joint and in moment T2, it's the same L joint, but shifted to the right one pixel. The fact that these two distinct patterns were seen in adjacent time steps suggests they have the same "cause" and so get lumped in together. Later, when the HTM sees either of these two versions of the "L" joint, they will report it as the same thing, which is super cool.

But here's one problem. Before an HTM can even begin noticing that the two "L" joint patterns appear one after the other, it's necessary for the HTM to undergo a "long" learning process just to recognize the distinct patterns, which here are called "quantization points". In the learning process, the HTM is exposed to a long series of these "movies" of all the sample images moving around relative to the HTMs. In that process, all unique pixel patterns a level 1 HTM is exposed to are recorded before it moves on to learning which ones are related to one another. Every single pattern! Now, with a 4x4 black and white grid, there may be 2^(4x4) or 65,536 unique patterns. Since the source data fed into this program is limited to these very clean, rectilinear patterns, the actual number of unique quantization points recorded in this first phase is only 150. If there were curves, different angles, and "dirt" in the source images, the number would clearly be much higher. Honestly, this leaves a bad taste in my mouth, as I can't imagine gathering together all examples of rich source data as a good prerequisite for beginning to classify things, nor a resource responsible way.

Now, one of the points of an HTM in this Pictures application is that it can learn to recognize that all "L" joints are the same thing without any prior knowledge of that. The key ingredient in the HTM recipe is this temporal coincidence. So once all 150 distinct mini-patterns, or quantization points, have been identified by watching the source images moving around in various directions against the field of view, the next step is to construct a 150 x 150 matrix initialized with all zeros. The rows and columns both represent each of the quantization points, but one represents seeing one in T1 and the other represents seeing it in T2. So lets say quantization point Q1 represents an "L" joint and Q2 represents another L joint shifted one pixel to the right of Q1. As the movie progresses from T1 to T2, we find the cell in the matrix where row Q1 and column Q2 meet and we add 1 to it. After a lot of this process, we end up with a matrix that has very high numbers in a few cells that represent lots of coincidences of quantization points in time, like our two L joints and a large portion of the matrix still with zeros. The reason for doing this is that there must be some way to say that Q1 and Q2 are related; that's the point of an HTM, and coincidence in time seems a good way.

An HTM has a finite number of outputs, each of which represents a "cause". The developer gets to decide the number. The more there are, in theory, the more nuanced the known causes can be. The next step of the learning process, then, is to decide what those causes are. Let's say for example there can be at most 10 "causes" that can be output. The 150 quantization points each get assigned to one of these 10 causes in a process that's a bit hard to understand. It's probably best to read section 4.2.2, "Forming temporal groups by partitioning the time-adjacency matrix", for a precise explanation. But one summary way of explaining it is that this algorithm starts at one quantization point that has the highest number of temporal connections (as represented in the 150 x 150 matrix) to others and follows along the really strong connections to other quantization points, lumping them together into one group. In theory, the connections branching out get sufficiently weak that the algorithm stops following them. Then it moves on to the next remaining quantization point that has the highest value in the matrix and continues on (ignoring all other quantization points that have already been grouped). This continues until either all quantization points with connections above a certain threshold are exhausted or we run out of groups (our maximum of 10 causes). The authors point out that this is not the only way to do grouping, but it's a pretty ingenious way to quickly allocate causes.

This learning algorithm is truly ingenious. I love it. And yet it bothers me, too. For one thing, this specific algorithm only cares about the coincidence of patterns from one discrete moment to the next. For another, its performance seems to rely very strongly on tight constraints on the data. As the data is allowed to become less constrained -- going from perfect right angle lines to allowing curves, allowing thicker lines, allowing dirty data, rotating in 3D, allowing grey scales or colors, and so on -- the number of quantization points and time to learn must grow exponentially. "Real" data would probably quickly deluge such a system as this with quantization points.

I'm especially bothered by the fact that each HTM requires an exhaustive learning period where it discovers all its quantization points before it moves on to start learning how they are causally related. And then this phase requires another exhaustive learning period where it discovers all the two-moment temporal relations among quantization points before it moves on to try to group the quantization points -- distinct input patterns -- into proximal causes which are then the main output of an HTM.

Further, while I recognize the value of showing a picture of a cat in many different "orientations" using these movies as a proxy for seeing lots of actual cats, I'm bothered by the idea that the movies are required for this algorithm to learn about cats. I would think that an algorithm that learns to distinguish cats as a group should be able to see lots of single, still pictures of animals of all sorts, including a cat. Heck, if I had 10 pictures of different animals and ten neurons (or HTMs), I should be able to repeatedly show each of my 10 pictures at random with different scales and orientations and have my neurons learn to align themselves to each of the 10 animals, yet the HTMs aren't going to work this way, unless I wiggle the pictures around. Why this curious requirement?

Now, in defense of HTMs, I would point out that Jeff does not see this first generation of them as the end goal, but just a first prototype that illustrates the concept. I think he would quickly agree that the learning algorithm will continue to evolve. Not only will it become more efficient and perform faster as generations of engineers learn to apply and enhance them, but they will also come to be more robust. In fairness, I don't see that the quantization process necessarily has to happen before finding temporal relations occurs. They could happen in real time. Also, the prediction part need not wait until after learning. Also, the little right-angle black and white line drawings are not a necessity. Nor are temporal patterns relying on discrete two-step time periods. None of my complaints here represents a "gotcha", I think.

I have more to read, and I may take an opportunity to try coding this to reproduce this experiment and explore it more. We'll see. I have my own experiment that I started, inspired by my read of On Intelligence, which I have to start fleshing out, though. In the meantime, I'm likely to continue to comment on HTMs as I learn more. I still think they represent the most significant new concept in artificial intelligence in several decades.

Neuron banks and learning

2005-11-10T12:00:00.000-08:00

[Audio Version]

I've been thinking more about perceptual-level thinking and how to implement it in software. In doing so, I've started formulating a model of how cortical neural networks might work, at least in part. I'm sure it's not an entirely new idea, but I haven't run across it in quite this form, so far.

One of the key questions I ask myself is: how does human neural tissue learn? And, building on Jeff Hawkins' memory-prediction model, I came up with at least one plausible answer. First, however, let me say that I use the term "neuron" here loosely. The mechanisms I ascribe to individual neurons may turn out to be more a function of groups of them working in concert.

Let me start with the notion of a group of neurons in a "neural bank". A bank is simply a group of neurons that are all looking at the same inputs, as illustrated in the following figure:

Perhaps it's a region of the input coming from the auditory nerves. Or perhaps it's looking at more refined input from several different senses. Or perhaps even a more abstract set of concepts at a still higher level. It may not be that there are large numbers of neurons that all look at the same chunk of inputs -- it may be more messy than that -- but this is a helpful idea, as we'll soon see. Further, while I'll speak of neural banks as though they all fall into a single "layer" in the sense that traditional artificial neural networks are arranged, it's more likely that this neural bank idea applies to an entire patch of 6-layered cortical tissue in one's brain. Still, I don't want to get mired in such details in this discussion.

Each neuron in a bank is hungry to contribute to the whole process. In a naive state, they might all simply fire, but such a cacophony would probably be counterproductive. In fact, our neural banks could be hard-wired to favor having a minimal number of neurons in a bank firing at any given time -- ideally, zero or one. So each neuron is eager to fire, but the bank, as a whole, doesn't want them to fire all at once.

These two forces act in tension to balance things out. How? Imagine that each neuron in a bank is such that when it fires, its signal tends to suppress the other neurons in the bank. Suppress how? Two ways: firing and learning. When a neuron is highly sure that it is perceiving a pattern it has learned, it fires very strongly. Other neurons that may be firing because they have weak matches would be self-silenced by these louder neurons, on the assumption that the louder neurons must have more reason to be sure of the patterns they perceive. Consider the following figured, modified from above to show this feedback:

But what about learning? What does a neuron learn and why would we want other neurons to suppress it? First, what is learned by a neuron is one or more patterns. For simplicity, let's say it's a simple, binary pattern. For each dendritic synapse looking at input from outside axons that a neuron has, we'll say it either cares or doesn't care and, if it does, it prefers either a firing or not-firing value. The following figure illustrates this, schematically:

Following is a logical behavior table. It is equivalent to a logical exclusive or (XOR) operation:

Preferred Input	Actual Input	Matches
0	0	Yes
0	1	No
1	0	No
1	1	Yes

Let's describe the desired input pattern in terms of a string of zeros (not firing), ones (firing), and exes (don't care). For example, a neuron might prefer to see "x x 0 x 1 0 x 1 0 0 x 0 x x 1". When it sees this exact pattern, it fires strongly. But maybe when it sees all but one of the inputs it cares about doesn't fit. It still fires, but not as strongly. If another neuron is firing more strongly, this one shuts up.

That's what's learned but not how it's learned. Let's consider that more directly. A neuron that fires on a regular basis is "happy" with what it knows. It's useful. It doesn't need to learn anything else, it seems. But what about a neuron that never gets a chance to fire because its pattern doesn't match much of anything? I argue that this "unhappy" neuron wants very much to be useful. It searches for novel patterns. What does this mean? There are many possible mechanisms, but let's consider just one. We'll assume all the neurons started out with random synaptic settings (0, 1, or x). Now let's say that there is a certain combination of inputs that no neuron in the bank shouts out to say "I got this one". Some of these neurons see that some of the inputs do match. These are inclined to believe that this input is probably a pattern that can be learned, so they change some of their "wrong" settings to better match the current input. The more strongly the match already is for a given unhappy neuron, the more changes that neuron is likely to make to conform to this new input.

Now let's say this particular combination of input values (0s and 1s) continues to appear. At least one neuron will continue to grow ever more biased towards matching that pattern that eventually it will start shouting out like other "happy" neurons do.

This does seem to satisfy a basic definition for learning. But it does leave many questions unanswered. One is: how does it decide whether or not to care about an input? I don't know the answer, but here's one plausible answer. A neuron -- whether "happy" or "unhappy" with what it knows -- can allow its synaptic settings to change over time. Consider a happy one. It continues to see its favored pattern and fires whenever it does. Seeing no other neurons contending for being the best at matching its pattern, it is free to continue learning in a new way. In particular, it looks for patterns at the individual synapse level. If one synaptic input is constantly the same value whenever this one fires, it favors setting that synapse to "do care". If, conversely, it changes with some regularity, this neuron will favor setting that one to "don't care".

Interestingly, this leads to a new set of possible contentions and opportunities for new knowledge. One key problem in conceptualization is learning when to recognize that two concepts should be merged and when one concept should be subdivided into other narrower ones. When do you learn to recognize two different dogs are actually part of the same group of objects called "dogs"? And why do you decide that a chimpanzee, which looks like a person, is really a wholly new kind of thing that deserves its own concept?

Imagine that there is one neuron in a bank of them that has mastered the art of recognizing a basset hound dog. And let's say that's the only kind of dog this brain has ever seen before. It has seen many different bassets, but no other breed. This neuron's pattern recognition is greedy, seeing all the particular facets of bassets as essential to what dogs are all about. Then, one day, this brain sees a Doberman pinscher for the first time. To this neuron, it seems very like a basset, but there are enough features to be doubtful. Still, nobody else is firing strongly, so this one might as well, considering itself to have the best guess. This neuron is strongly invested in a specific kind of dog, though. It would be worthwhile to have another neuron devoted to recognizing this other kind of dog. What's more, it would be valuable to have yet another neuron that recognizes dogs more generally. How would that come about?

In theory, there are other neurons in this bank that are hungry to learn new patterns. One of them could see the lack of a strong response from any other neuron as an opportunity to learn either the more specific Dobie pattern or of the more general dog pattern.

One potential problem is that the neurons that detect more specific features -- bassets versus all dogs, for example -- might tend to make more general concepts like "dog" go away. There must be some incentive. One explanation could be frequency. The dog neuron might not have as many matching features to consider as the basset neuron does, but if this brain sees lots of different dogs and only occasionally bassets, the dog neuron would get exercised more frequently, even if it doesn't shout the loudest when a basset is seen. So perhaps both frequency and strength of matching are strong prompts for a neuron that it's learned well.

I have no doubt that there's much more to learning and the neocortex, more generally. Still, this seems a plausible model for how learning could happen there.