Coherence and ambiguities in problem solving
Natural Language Processing (NLP) is a big topic. One I come back to again and again when I have time to explore it. Work has kept me very busy. So has moving. In recent months I've returned to the topic. I made some interesting progress constructing an "NLP pipeline". But as anyone who has done NLP work will tell you, English is full of ambiguities. They may tell you about the approaches they take to reduce the ambiguity and be decisive in the end. But the ambiguities persist and can't simply be guessed away.
More importantly though. The ambiguities often require us as humans to look across levels of interpretation to resolve. I generally have not found AI researchers offering a good way of doing this though.
To illustrate my problem, consider this sentence:
Some guys' shoes' laces are red.
As someone literate in English you have no problem interpreting it. But odds are good you can guess where the ambiguity lies. Looked at in isolation, each of the apostrophes can be interpreted in at least 3 possible ways. Each could be part of a possessive plural. It could be the start of some quoted text surrounded by single quotes. Or it could be the end of some similarly single-quoted text. What leads you the reader to conclude it's the plural possessive "guys' "? You might argue that s-apostrophe always indicates a plural possessive. But I think you also look at where the apostrophes appear. Consider this alternative but nearly identical sentence:
Some guys 'shoes' laces are red.
This should bother you. Why? Grammatically it makes no sense. The thing is, a typical NLP pipeline does not look at text like you and I do. We actually look for meaning and realize it doesn't make sense. But what if you didn't look at the meaning but only at the structure? Same options available for each of the apostrophes as above. But now your first guess is that the apostrophes are definitely single quotes surrounding "shoes". As though the statement were sarcastically referring to something as "shoes". But as a human you would correct this in your head and maybe call out to the author the mistaken placement of the first apostrophe.
What's going on here? One way of looking at this mechanically is seeing a one-way pipeline of interpretation that starts with raw text as input. The first component parses out the tokens. It passes them on to a second component that finds quoted text, parentheticals, and other logical groupings. It passes the now grouped segments of tokens on to a third component that tries to find meaning in it. But it should be apparent from the above example that in order to even find the tokens correctly you may need to find the correct meaning that relies on the correct tokens. A chicken or egg problem. If you accept that there is a most logical interpretation then you'll agree that "shoes' " is a whole word token in both sentence versions. But the tokenizer cannot truly conclude this correctly. Nor even can the "grouper" component.
For years I've been puzzling over how to get discrete pieces of an NLP process to collaborate to resolve ambiguities. A generalized solution would revolutionize AI for sure. I won't say I have found the answer. But I think I may have stumbled upon a way of structuring problem solving of this sort.
Today I started thinking of this in somewhat new terms for me that help. I realized that what I need is an algorithm that can entertain different interpretations of ambiguous data. In the past I've run into the problem of exploding combinatorics even with a simple tokenizer. Where if I create a branching tree of all the possible interpretations of a small paragraph of text, I might quickly construct a tree with millions of leaf nodes at the end. Needless to say this gets slow and memory-intensive. And still leaves you with the need to find the best interpretation.
I started thinking today in terms of seeking a “coherent interpretation”, or “coherence”. It occurred to me that it is not necessary to consider every possibility. That it could be worthwhile to just identify possible ambiguities along the way and keep track of them. But to then seek one or a small number of most coherent interpretations. And then move on, keeping track of these. Only if a later stage in the pipeline concludes that there is a lack of coherence should we backtrack and revisit some of the alternatives in hopes of finding a more coherent bigger picture.
I realized that one way to make use of this is to embrace ambiguities. The most recent version of my tokenizer relies on a set of named regular expression definitions for words. I did not include the possessive case where a word ends in an apostrophe because I knew that this needed to be resolved by a later stage. By this thinking I absolutely should represent that case in the tokenization rules. But I should make sure that my tokenizer can recognize that there are at least two possible interpretations at this word boundary.
What’s more. I realized that this is an opportunity for a learning algorithm to get involved. When a component recognizes that there are two or more interpretations of some data, I could store this fact and start keeping a tally of the interpretations that are ultimately accepted as part of the most coherent interpretations over time. Then the most common correct interpretations can be favored as the first interpretation to improve the performance and accuracy of decision making later. If the algorithm finds that the s-apostrophe case 90% of the time is a plural possessive noun then that will be its first guess going forward.
I realized also that there is a place in this conceptual framework for learning negative rules. In proper English we expect the first word in a sentence to be capitalized. So when we come across a sentence whose first word is not capitalized then we might let the user know of this mistake. But to do this we actually need to encode the rules and flag them as erroneous. They would contribute to concluding that an interpretation is incoherent. But that they might serve as good explanations when there are no more coherent interpretations available.
I’ve been thinking about how to approach the above poorly and summarily described conceptual framework. One key to it is getting away from the impulse to linearize everything.
Consider the task of counting a pile of money. It’s easy to picture doing so from start to finish. But what would you do if you got interrupted in the middle of the task? You might write down the running total and make sure the already counted pile is well separated from the uncounted pile. Then when you return you can pick it up again where you left off. In this way this process is reentrant.
In this sense it is necessary to be able to come up with one interpretation of a piece of text or other problem and be able to come back to it later to consider alternative interpretations. Which means the task must be designed to be reentrant from the start. And it means there must be a way of keeping track of the options we have already tried and be able to pick up where we left off and try another option. Ideally each next option we would try would be the next best and not merely a randomly possible option.
It occurs to me that a later stage in the pipeline would ideally be able to give clues as to what to look for too. Let’s say an earlier stage gave as its best interpretation that there is a sentence whose first word is not capitalized. A later phase looking up each of the words in its lexicon might conclude that “iPhone”, the first word in the sentence, is actually a proper noun that is spelled in a nonstandard way. It might then tell the earlier stage to consider this fact and find another alternative interpretation with this knowledge in mind. A more coherent interpretation should emerge.
I think each total interpretation of some piece of data should be given a numeric coherence score. I’m not exactly sure how to go about it just yet. But one option would be to use positive scores to indicate coherence. The more coherent the higher the score. I think as soon as anything breaks the coherence of an interpretation, no matter how small, the score might go negative. The more incoherent the more negative the total score.
What would contribute to the score? I’m still trying to work this out. I think that anything that is not ambiguous could contribute 0 to the score. Only the ambiguous cases might be considered. Let’s say we came across an ambiguity like ‘12” ’. Is this 12 inches? Or is the double quote here a closing quote from a larger string of text? Let’s say in surveying thousands of texts we found that 65% of the time it’s a closing quote and 35% of the time it’s a length in inches. So we might add +1 for the length option and +2 for the closing quote option. If when evaluating the total text we discover that there is no opening quote to match with this potential closer then we evaluate the total option to be negative to indicate incoherence.
I’m still trying to work this concept out mentally before I try to write an algorithm based on it. I genuinely think I’m onto something here though. I think that there may be a generalized approach to problem-solving peeking out here. I want to believe that there is a way to write a general data structure and algorithms that, like a Christmas tree, can be adorned with specialized black-box processing components with reentrance and coherence models built into them. That the larger algorithms could enable these black boxes to collaborate without understanding each others’ details.
I plan to try to take a first stab at this in the coming days. Hopefully I’ll have something to report soon.