Virtual lexicon vs Brown corpus

Having completed my blocker, I decided to take a break before tackling syntax analysis to study more facets of English. But also, I realized I should beef up the lexicon underlying my virtual lexicon (VL). I had only collected about 1,500 words, and most of those I had simply hand-entered by way of theft from the CGEL's chapters on morphology; mostly compound words, at that. It was enough to test and demonstrate the VL's capacity to deal half-decently with morphological parsing, but nowhere near big enough to represent the at least tens of thousands of words a typical high school graduate with English as their native language will know.

A virtual lexicon's core premise is that being able to recognize novel word forms by recognizing the parts of the word is more valuable than having a large list of exacting word-forms. In essence, a relatively small number of lexical entries should be able to represent a much larger set of practical words found "in the wild".

Using the Brown corpus

I decided that a good way to see just how much mileage I could get out of my virtual lexicon by exposing it to an existing dictionary, of sorts. In particular, I chose the Brown corpus, which is full of words hand-tagged with their lexical categories (parts of speech) taken from excerpts of 500 documents contemporary to the 1960s. I had already converted the BC's data to JavaScript/JSON files and dabbled a bit with it many months back, so I had an easy way to work with it.

Most significantly, I already had a list of all the unique words found in the BC, complete with an order sub-list of all the lexical categories and their frequency counts. Here's an example:

care:{c:162,p:[{c:87,p:'nn'},{c:75,p:'vb'}],bp:[{c:87,p:'n'},{c:75,p:'v'}]},
'care-free':{c:1,p:[{c:1,p:'jj'}],bp:[{c:1,p:'aj'}]},
cared:{c:15,p:[{c:9,p:'vbd'},{c:6,p:'vbn'}],bp:[{c:15,p:'v'}]},
careened:{c:1,p:[{c:1,p:'vbd'}],bp:[{c:1,p:'v'}]},
careening:{c:1,p:[{c:1,p:'vbg'}],bp:[{c:1,p:'v'}]},
career:{c:67,p:[{c:67,p:'nn'}],bp:[{c:67,p:'n'}]},

For example, care appears 162 times in the BC. 87 of those times, it's as a common noun, as in health care. And 75 of those times it appears instead as a base verb, as in to care for.
Given a word like "caring", my VL will try its best to figure out the lexical category. For this example, it would likely parse it as "care -ing" and call this a gerund/participle, same as BC, which uses the "vbg" tag to represent this.
This list contains lots of elements I don't care to push through my VL, such as proper nouns (John, Brooklyn, Glazer-Fine) and punctuation. After filtering, that leaves a word-list of 12,222 unique words for me to test my VL against.
Here's a snippet of the typical output from my testing:

 |   | preposterous    |       5 | jj              | J               | J               |     7 ms |    125 | pre-(U) post(N) -er(J→J) -ous(J)  (J, N, or N)
 | X | prescribe       |       5 | vb              | V               | N               |          |     40 | pre-(U) scribe(N)
 |   | prescribed      |      14 | vbn, vbd        | V.pret          | V.pret          |          |    100 | pre-(U) scribe(N) -ed(V→V)  (V or J)
 |   | prescription    |       5 | nn              | N               | N               |          |     85 | pre-(U) script(N) -ion(V|J→N)
 |   | presence        |      76 | nn              | N               | N               |     3 ms |     59 | present(J) -ce(J→N)  (N or Phr)
 | / | present         |     377 | jj, rb, nn, vb… | J               | V               |          |      0 | present(V)  (V or J)
 | X | present-day     |      17 | jj              | J               | N               |          |    100 | present(V) -(U) day(N)
 |   | presentation    |      33 | nn              | N               | N               |    17 ms |     97 | present(V) -ate(V) -ion(V|J→N)
 |   | presentations   |       6 | nns             | N.plur          | N.plur          |    88 ms |    137 | present(V) -ate(V) -ion(V|J→N) -s(N→N)  (N, V, or N)
 |   | presented       |      82 | vbn, vbd        | V.pret          | V.pret          |     4 ms |     40 | present(V) -ed(V→V)  (V or J)
 |   | presenting      |      10 | vbg             | V.gerprt        | V.gerprt        |     4 ms |     40 | present(V) -ing(V|N→V)  (V or N)

For example, prescribe gets treated as "pre- scribe". Since it sees scribe as a noun, it concludes that this whole word as a noun, as though we were talking about a person before they became a scribe. The BC tags this as "vb". To run the comparison, I use a mapping to translate some of the many tags the BC uses to the representation used by my VL. For examples, "V.pret" means preterite verb and "N.plur" means plural noun.

Data mapping is a tricky and often dubious affair. Sometimes, there just isn't an exact mapping between two systems. In my case, them, which in the BC is considered a "ppo" (pronoun, personal, accusative), which includes words like it, him, me, us, you, and her. Some of these are plural and the rest aren't. In my VL, them is a plural pronoun ("N.pron.plur"), making the plural "ppo" items compare incorrectly. I could have modified my mapping to treat them and us as plural, but that's an unnecessary hack that doesn't really help my task.

The first column of the output contains a match status. When blank, that means the two systems agreed on the LC of that word. A "?" indicates that my VL couldn't even match the morphemes of the prospective word. Though that doesn't stop it from making a guess based on a familiar suffix (e.g., -ous or -ing), I disqualified its attempt on that basis and also to point me to morphemes that really needed to be added to my lexicon. If all the morphemes did match, I compare the resultant LCs. The first one in BC represents the most common occurrence (e.g., "jj" (adjective) for "present") and the others represent less common occurrences. If my VL doesn't match any of BC's LCs, this column contains "X". If it matches only a secondary LC, "/" appears. Think "half an 'X' for half-wrong" (or half-right).

As you can see from these examples, some of the derivations are pretty good, as with "N.plur: present(V) -ate(V) -ion(V|J→N) -s(N→N)" for presentations. And some are pretty bad, like "J: pre-(U) post(N) -er(J→J) -ous(J)" for preposterous. Yes, it got the final lexical category right, thanks to the -ous suffix, but only by fumbling through its morphemes.

Adding lexemes

Although testing the dexterity of my VL was a key goal, a more basic one was augmenting my lexicon with more words. To that end, I would filter my word comparison runs for all the "?"-status bad matches and hand-enter morphemes as necessary.

Consider "monast", for example. I added this as a bound morpheme, which isn't definitively a prefix (un-, ante-, electro-) or suffix (-ing, -ably, -ment), but can't really stand on its own as a complete word in a sentence. Although I used my own sense of how a word was historically composed and its potential for production of other words, I also relied on online tools to help. For example, searching for all words that begin with "monast" or for all words that end with "ment". Having extensive examples at the ready helped me test (and reject) many of my hypotheses. And then I could know that monast~ could correctly form monastic, monastery, monasticism, and more.

I went through this process for several days. While I had some shortcuts, I ultimately hand-processed every word. To my surprise, I personally recognized all but perhaps ten of the 12k+ words, ignoring certain highly technical medical terms. And with each new lexeme I'd add, I'd ask myself, "how was this word not already in here?" One of the last words I added was "yes", one of the most basic in the English language. My sense is that there must still be loads of even ordinary words not covered by my VL.

I continued this process until there were no more "unmatched" words, meaning almost every word in the BC could be sliced up into morphemes that matched my underlying lexicon, even if the LCs didn't match the BC's LCs. In the end, my lexicon had 4,913 lexemes available. Of those, 4,494 lexemes were used to match 12,132 words. That represents a "lexical compression rate" of 37%. On average, one of my lexemes can match about three words in the BC. For comparison, a basic word-list with no morphological parsing would display 0% compression. 100% is the impossible asymptote that could never be reached.

Part of speech tagging

In processing the full word list from the Brown corpus, I get a rate of 83% "hard" matches and "5%" more "soft" matches. A hard match is where my VL's lexical category matches the most common usage of that word in the BC and a soft match is where my VL's LC matches one of the less common usages in BC. Let's be liberal and call this an 88% match.

To anyone familiar with traditional part of speech (PoS) tagging, 88% is pathetic. A typical PoS tagger will get better than 95% correct without breaking a sweat.

But my test is definitely not a PoS tagger. A PoS tagger typically looks at the words in the neighborhood of one word being considered and uses a statistical model to decide what it's most likely to be in that context. My test program does nothing of the sort. A better analogy would be that this is the naive first step in a Brill tagger, where each word is looked up in a lexicon for its most likely LC. And then the tagger begins transforming those guesses based on what's in the neighborhood.

Still, a typical PoS tagger that starts with a naive lookup will usually start out at around 93% match, so why would my VL do so badly?

One simple reason is that so many of the words that remain rely on poorly chosen lexemes during morphological parsing. Consider youth, which my VL sees as you -th, where -th is typically a suffix for ordinal numbers like tenth and 175th. My lexicon is missing an entry for youth. In this case, youth, youthful, and youths all correctly match, but in many other cases, such shortcomings in my lexicon cause clearly mistaken guesses, as when legitimate gets interpreted as leg it im- ate, whereas I really need a legitim~ bound lexeme to come up with legitim~ -ate and a valid interpretation as either a verb (to legitimate her presidency) or adjective (the legitimate president).

Another reason is that the derivation of a word from its morphemes often seems superficially logical, but doesn't reflect the reality of how the word is typically used. For example, amazing naturally follows the usual amaze -ing pattern and can be used as a verb (amazing friends with magic) or gerund ("amazing" isn't bold enough to describe it), but in practice, we most often use amazing as an adjective (the soup is amazing). There's no way to tell that by reference to its two morphemes. -ing almost always forms a gerund-participle and sometimes a noun (the flashing for the siding needs repair), but rarely as an adjective (stunning, breathtaking). This reality reflects a limitation of the virtual lexicon, at least as I've constructed it. Sometimes the only answer is to lexicalize (add to the lexicon) a word that would otherwise badly match, as I did with amazing.

Often, I simply couldn't bring myself to label a word in accordance with the most common usages in the BC. For example, I have defeat as a lexeme with only a verb sense, but the BC has it more often tagged as a noun (suffered a defeat) and less often as a verb (we'll defeat them). To my thinking, this reflects not a definitional disagreement, but the difference between a word's intrinsic meaning and its usage in a specific sentence.

Moreover, I am still troubled by the idea of having a lexicon contain multiple entries for a lexeme whose only apparent difference is one of lexical category. In Lexical Categorization in English Dictionaries and Traditional Grammars Geoffrey Pullum points out that "many dictionaries actually do — quite wrongly — include subentries for numerous nouns that list them as adjectives."

In the process of beefing up my lexicon, I was struck by a feeling that almost every entry I added that had a noun sense could also be used as a verb or adjective as well, so I favored only adding LCs for what I thought of as the predominant LCs for the major senses. For example, for appeal, I added verb (I plan to appeal the decision) and noun (the youthful appeal of this dress) senses because each, to my thinking, had a distinctly different meaning. For medal, I only added a noun sense, despite the validity of using it as a verb, as in medaled in track and field, because to my thinking, both uses fundamentally refer to the same exact concept. "Medaling" just means winning a medal, the thing that is won.

That said, I've also second-guessed this line of thinking. If I imagine there's a singular noun-verb-adjective pseudo-category, then it's clear that many words would violate it. For example, it's hard to imagine using medal as an adjective. Yes, some words like fast easily lend themselves to all three interpretations (They fasted their fast after eating faster than usual.) But so many words just pile up to negate this nice model, such as complex (the complex has a complex layout) and monkey (let's not monkey with this monkey). That seems to suggest I should have been more liberal with my lexicon. That I should have listed all the reasonable lexical categories a lexeme could take.

That said, I guess my stinginess here is a significant source of the low half-right match rate. Had I taken the alternative view, the right plus half-right numbers probably would be in the ninety percents.

From word-list to document tags

As I said before, my test code is not what I would actually call a part of speech (lexical category) tagger, since it does not consider any word in the context of the words around it. All it does is guess at the proper lexical category for a given word by a morphological analysis involving dictionary-like lookups.

Still, I was curious to see how it would fare against the individual documents in the BC. The unique words are not all equal, after all. One word will appear only once in the BC, while another will appear thousands of times. And that word may vary among three different LCs with each usage.

As before, I ignored proper nouns and punctuation, but also "cd" (number) words. Of the 951k words thus considered, 773k (81%) of them had matching LCs, which is close to the 83% exact-match rate I got when simply looking at the unique-word-list from the BC. I include here an output file from one test run. Here's an example of what its contents look like:

 | on              | rp         | R               | P: on(P)
 | long            | rb         | R               | J: long(J)
-------- Doc 201 --------
 | to              | to         | R               | P: to(P)  (P or R)
 | Farewell        | nn         | N               | J: fare(N) well(J)  (cap)
 | that            | cs         | S               | N.pron: that(N)  (N, D, R, or S)
 | search          | nn         | N               | V: search(V)
 | meaning         | nn         | N               | V.gerprt: mean(V) -ing(V|N→V)  (V or N)
 | hints           | vbz        | V.3rdsg         | N.plur: hint(N) -s(N→N)  (N or V)
 | Unconscious     | nn         | N               | J: un-(J→J) conscious(J)  (J, Phr, N, or N)  (cap)
 | form            | nn         | N               | V: form(V)  (V or N)
 | other           | ap         | D               | J: other(J)  (J or V)
 | human           | jj         | J               | N: human(N)

Each of the 500 documents has a Doc NNN header, followed by a list of the words that did not match. Each such mismatch lists the word, the exact tag (e.g., "nn" or "cs"), my mapped version of it (e.g., "N" and "S"), and then my own interpretation of the word.

The example of human well illustrates the difference between a word's natural lexical category and the syntactic category it takes on in a sentence. In this case, it falls within the clause some form or other enters into all human activity. Practically speaking, human is a noun and its use in human activity doesn't change that.

Conclusions

Overall, I'm happy with how well the morphological parsing approach my virtual lexicon takes to solving the specific problem of guessing the baseline lexical category for a word it doesn't already know. 81% of over 12k words were properly recognized by a little over a third as many lexemes. That said, wacky examples like "de- co- rat -ive" (instead of "decor -ate -ive") illustrate how it's often just a lucky guess where the last suffix's LC saves the day.

My hand-crafting yielded a lexicon just under 5k works. Given a choice between a massive word list — think hundreds of thousands or even millions of words — and a tiny lexeme set plus morphological analysis, the massive word list is clearly going to win. That said, it seems reasonable to assume that the best results would be gotten by a combination of a morphological analyzer with a massive word list. The reason is that the underlying premise remains: that you're inevitably going to run into novel word forms as you process new documents.

Jim Carnicelli's AI Blog