Morphological parser / Virtual lexicon

I've been able to focus the past 3 months exclusively on my AI research, a luxury I've never had before. Given that I can't afford to do this indefinitely, I've chosen to focus primarily on NLP, with an overt goal to create relevant marketable technologies.

I'm presently operating with an intermediate goal of creating a "basic" English parser (BEP). In my conception, a BEP will transform a block of English discourse into an abstract representation of its constituent words, sentences, and paragraphs that is more amenable to consumption by software. Though there are many research and industry products that do this at some level, I'm convinced I can improve upon some of their aspects to a marketable degree.

When you set about trying to understand natural language parsing, you quickly discover that there are many interconnected aspects that seem to make starting to program a system intractable. NLP researchers have made great strides in the past few decades largely by focusing on narrowly defined tasks, like part of speech tagging and phrase chunking; especially tasks that rely on statistical machine learning. Still, it seems every piece requires every other piece as a prerequisite.

After exploring lexical tagging (AKA part of speech tagging) for a while, especially using a custom Brill-style tagger I wrote from scratch, I decided to tackle an important piece of the bigger puzzle. Namely, how to deal with unknown words.

Why a morphological parser?

I'm convinced most unknown words are simply known words reassembled in different ways. The most basic example is inflectional versions. The plural form of "chicken" is "chickens". The past ("preterite") tense form of "stop" is "stopped". The comparative form of "sloppy" is "sloppier" and the superlative form is "sloppiest". Beyond these, many words can be formed by compounding existing words. "Sunshine", "ankle-deep", and "brainwash" illustrate basic compound words. And then there are all those affixes — prefixes and suffixes — like "un-", "re-", "-ish", and "-ation" that can be added to other words to form more complex ones like "reprioritization".

This is the domain of morphology, the study of word-formation, among other things. I decided I should try to somewhat master this algorithmically in order to deal with the broadest array of words.

The logic goes something like this. To parse, say, a paragraph of text, one must identify the sentence boundaries. There are some obvious algorithms to do so, but they run into ambiguities. A period, for example, might signal an abbreviation like "Mr." or "Dr." instead of the end of a sentence. Once you have isolated a sentence, making sense of it requires being able to characterize all of its words in some useful way. At the very least, you might want to tell what the main verb is and possibly what the subject and direct object are. Doing so usually begins with a process of identifying all the verbs, nouns, adjectives, etc. in the sentence, which turns out to be a surprisingly tricky process. You usually start with a naive guess that involves looking the word up in a lexicon. Like a standard dictionary, a lexicon will typically at least tell your algorithm what the most likely lexical category is for a given, known word (e.g., "run" is most likely a verb). And then the order of the categories in the sentence is typically used to figure out the syntactic structure of the sentence. Any word in the sentence that isn't already in the lexicon becomes a problem for this approach.

How, then, to deal with unknown words? One answer is to use well-known syntactic patterns to make a guess. In the sentence I ate a very XXXX pear, we can guess that XXXX is most likely an adjective because that's the only thing that should be allowable by English grammar rules. But we might also be able to guess by picking the unknown word apart. In an XXXXer pear, we can guess that XXXXer is probably a comparative adjective like tastier or raunchier. That said, it isn't guaranteed. Consider bitter, which coincidentally ends in er but which is not comparative (that would be bitterer or more bitter). Still, English comes with a wealth of prefixes and suffixes that can hint at the likely category for otherwise unknown words. XXXXish is probably an adjective, XXXXing is probably a present participle or gerund. And so on.

Human languages are "rulesy". Whether we intend to or not, we embed rules in how we form utterances. That's true at the word level just as much as at the sentence level, not to mention paragraph and discourse levels. Like many in the computational linguistics community, I had been falling under the spell of letting learning algorithms figure out the rules instead of hand-crafting them. The case for this approach has been compelling in recent decades. Start with a learning algorithm and a hand-tagged corpus and the algorithm will save you the trouble of understanding, yourself. The results for a specific task are often statistically better than hand-crafted rules, furthering the case of this approach. However, I'm beginning to question the wisdom of this seductive but naive approach, which Geffrey K. Pullum of the University of Edinburgh might label corpus fetishism.

Pullum represents one of the three legs of the computational linguistics stool: linguistics. At least one friend from the mathematics leg of CL has suggested I would do better to bone up on my linguistics knowledge than to worry about perfecting my mathematics knowledge. I concur with her thinking. As a programmer — this discipline being the third leg — I can attest that it's impossible to program a solution to a problem without first defining the problem. My own sense is that linguistics is the leading edge of the NLP revolution that is gradually happening. And mathematicians and programmers need to look deeper than the first level of linguistics to really understand what we're doing as we try to automate language understanding and production.

Pullum's mechanistic approach to understanding and explaining English grammar is infectious. I'm studying parts of The Cambridge Grammar of the English Language (CGEL) lately, especially with a focus on morphology. Chapters 18 (Inflectional morphology and related matters) and 19 (Lexical word formation) deal extensively with the subject. Pullum and Rodney Huddleston (et al) delved far deeper than my own limited needs into the gory guts of word-formation, but I am quite enjoying the treatment.

While my main motivation for adding a morphological parser (MP) to my BEP is dealing with unknown words, I also have creating a trim lexicon as a major goal. If I have the word "happy" in my lexicon, I should not also need happier, happilyunhappyunhappily, happiness, unhappiness, and so on in my lexicon. I want to believe that a large fraction of the words that appear in a given sentence are derivatives of simpler, common-enough ones. In the previous sentence, for example, words, given, derivatives, simpler, common-enough, and ones are derivatives. That's 26% of the sentence.

I regularly coin what I like to call "bullshit words" like "rulesy" and "cuteiful" when I'm explaining ideas and as a light form of humor. While amusing to us, this is devastating to a parser relying purely on a lexicon, even when it already has millions of words in it. One powerful benefit to an MP is the ability to deal more robustly with neologisms.

Putting it all together, the main goal for me of a morphological parser is to be able to recognize a much larger potential vocabulary with a much smaller lexicon held in memory. I'm going to dub this a "virtual lexicon".

Design philosophy

Loosely speaking, my morphological parser serves the function of a lexicon for a parser. In my current usage, I have a typical parse pipeline that begins with a tokenizer that produces a string of words, numbers, punctuation, and other symbols. The tokens that look word-like, usually because they are made up exclusively or mainly of letters, are typically compared against the lexicon, a long list of known words. In the syntax parsing application, that lexicon usually indicates the lexical categories (verb, noun, etc.) for each matched word. Depending on the design, the lexical entry may contain more than one category.

To my thinking, an simple, exact-match lookup isn't robust enough. Instead, each word-like token is fed into the MP, which has the same main goal: returning an indication of the most likely lexical categories for that word. But inside, the MP is operating on the premise that the word could be derived from other word parts. To be sure, my MP does have a lexicon and will return the category for a known word if it matches exactly. If it doesn't find an exact match, though, it tries its best to find a derivative of one or more known ones.

Remember: the first priority of the MP is to determine the most likely lexical category (e.g., adverb) of a word. There are thus four basic possible scenarios: the exact word already exists in the lexicon; the word is completely composed of other word parts that are all found in the lexicon; some parts of the word — usually suffixes like ing — are recognized to the point where a good guess is possible; or the word is wholly unrecognized.

Given a single textual token like strawberries to look up, the MP returns a "word". That word consists of one or more interpretations, which I call "senses", in keeping with the idea that a dictionary entry for a word may contain definitions for many different senses of that word. This is also in keeping with one of my design goals: to entertain multiple interpretations of a given word, sentence, etc.

Each word sense indicates the word's lexical category and includes a list of the morphemes that make it up. Given irredeemable, for example, the MP returns a data structure that it shorthands as "(J): ir-(JJ) redeem(V) -able(VJ)". The leading (J): indicates that the whole word is interpreted as an adjective. Next is ir-, which it sees as a prefix that typically attaches to an adjective to form an adjective. Next is redeem(V), which it sees as a verb. Last is -able(VJ), a suffix that usually attaches to a verb to form an adjective.

The MP also incorporates a feature to treat one word as though it were multiple. In most cases, a single word can be treated as having a single, well-defined category like verb or adjective. But sometimes this isn't the case. Consider auxiliary verbs ending with the n't contraction, like isn't, wouldn't, or haven't. It is best to treat these as two-word strings like is not, would not, or have not, acknowledging that these are verb-adverb combinations. Most contractions likewise need expansion. Consider it's and what's, which should be interpreted as it is/was and what is/was. This also applies in Joe and Dale're going fishing, where 're must be interpreted as are and applying to both Joe and Dale. While most initialisms (ASAP, IRS) and acronyms (AWOL, NAFTA) can be seen as having single categories — usually noun — others suffer the same problem. IMO (in my opinion) is a prepositional phrase. Although it could be seen as an adverb, it's probably better to simply expand it out into its set of words in the sentence in which it appears before syntactic parsing. Or HIFW (how I felt when), which surely can't be reduced to a basic category.

In keeping with the belief that a word can have multiple senses and that it can sometimes be treated as multiple words, I'll point out that the "word" that is output by the MP when fed a token is a tree structure. A word object contains a list of word-sense objects. A word-sense object has a list of morphemes and, alternatively, a list of child word objects if a sense is to be interpreted as a string of words. A morpheme is mainly a pointer to a lexeme (like an entry for a word in a dictionary) and which sense of that lexeme is meant. If a lexeme wasn't found for the morpheme in the word, those pointers are null, but the morpheme is still useful in that it contains the text of that morpheme.

I decided that it was critical to support Unicode. I created a custom, relatively trim library that lets me deal with multilingual text and "rich" characters. This overhead probably slows processing down marginally.

One other key is that I decided to specialize this parser to English morphology, hard-coding some English-centric rules related to morpheme transformation and lexical categorization into it. I'm hopeful that this work can provide inspiration for extracting those rules out as data to support other languages better, but I just don't have enough knowledge of other languages to justify the added system complexity yet.

Morpheme splitting

My morphological parser breaks its job into two independent tasks: finding morphemes; and then interpreting them in conjunction with one another.

Since each token is suspected to be composed of several morphemes, it's necessary to search for them. One way of doing that might be to start with the first letter. Consider happenings, for example, which is made up of happen -ing -s. One might start with h to see if we can find that in our lexicon. Not finding that, try ha. Then hap, and so on. Eventually, happen would be found. Then we could move on to find ing and finally s.

Each substring of characters could go against a hashed dictionary, which is fairly efficient. However, my MP has a specialized search tree in which each node represents a single letter and contains another hashed dictionary of next letters. To find happen, the algorithm might start with the root node, find the child node corresponding to "h", and for that node, find the child node corresponding to "a", and so on. When the search finds a known word during this tree traversal, that node — the "n" child node of the "e" node in this example — will have a link to the appropriate lexeme entry in the lexicon.

But I should clarify that my MP design counterintuitively starts at the end of a word and works its way backward. This is mainly because in English, modifications to base words (happy, flap) before adding suffixes (happily, flapping) usually occur in the last letter or two. More on this shortly.

I refer to this supporting lookup structure as a "morpheme tree". Words are added to this tree in reverse-character order. So happen is to be found by starting at the root and traversing "n", "e", "p", "p", "a", and finally "h", which leaf node will contain a pointer to the "happen" lexeme, which in turn has a list of lexeme-sense nodes representing the different categories (and in the future, distinct definitions) for that word.

Some words are subsets of others, as with Japan and pan. If the lexicon contains entries for both, this means parsing Japanese leads to ambiguity over whether pan is intended or merely coincidental. The morpheme tree will have a link to the pan lexeme at the "p" node, but also have a child node for the "a", leading into the "j" node, which then links to the japan lexeme entry. Thus, not all morpheme tree nodes that have links to lexemes are leaf nodes.

The ambiguity introduced by overlap cannot be resolved immediately during word parsing. Moreover, other ambiguities arise that, again, cannot be resolved immediately. The morpheme-finding process is an exhaustive search that returns all possible parses of the whole token, from last to first character. In the Japanese example, traversing the morpheme tree yields -ese and then pan, but it doesn't find ja, which it conservatively interprets as an unacceptable parse. However, continuing past pan into japan bears fruit, so that parse gets interpreted as acceptable. Only those parses that return a string of morphemes that cover every last letter are returned by this part of the process.

Getting the list of acceptable parses involves constructing a parse tree using a recursive algorithm. The parse starts with the final letter. Moving forward involves recursively calling a sub-parse function that also constructs the parse tree in parallel with the recursion process, including all branches considered, as in the pan vs japan case. Every time this recursion successfully reaches the beginning of the word, the final node added to this parse tree, which represents the first letter of the word, is added to a list of search "tails". Every node in that list of tails represents a distinct, completed parse of the entire token. If there were a ja lexeme in the test lexicon, then the two tails would correspond to both ja pan -ese and japan -ese parses, which then move onto the next stage for scoring. More on that later.

One way we end up with more tails is by having multiple senses of a word. Take the lexeme er, for example. In my test lexicon, this lexeme is viewed as a suffix added to an adjective to form the comparative of it (bigger, heavier), a suffix added to a verb to make a noun of it (cobbler, runner), or an initialism (ER or E.R.) for emergency room. So a parse of killer could yield "kill(V) -er(JJ)", "kill(V) -er(VN)", or "kill(V) er(N)". Yes, "kill emergency room" is a possible interpretation.

Another way we end up with more tails is by word modification, as with silliness and panicking. These modifications are assumed by my MP to happen only to words that have suffixes appended, which is why the search begins at the end of each word. After finding the ness morpheme and committing in one path to the suffix sense from the lexicon (the ness lexeme could also have a proper noun sense, too, as the proper name Ness), we then look at the next few characters along the way, knowing that they represent the end of another morpheme. Based on what we find, we try whatever changes (e.g., replace "i" with "y") are allowable and then continue parsing based on those modifications, in addition to the parallel track of parsing with no such changes. A parse of happiness would find -ness but not happi. But it would successfully find happy after changing "i" to "y". The parse would essentially return happy -ness, as though the word were literally spelled as "happyness". Here are the specific modification rules I've implemented thus far:

  • If last letter != "e" then append "e" (shar -ing  share -ing)
  • If last letter = "i" then change to "y" (tri -es  try -es)
  • If last letter = "v" then change to "f" (leav -es  leaf -es)
  • If last letter doubled then trim last letter (stopp -ing  stop -ing)
  • If last letters = "ck" then change to "c" (panick -ed  panic -ed)
  • If suffix = "n't" and last letter != "n" then append "n" (ca -n't  can n't)
  • If suffix = "s'" (apostrophe) then append "s" (bas -s'  bass -s')

I suspect there may be other modifications worth adding to this list later, but these already do a lot of good work. Each of these above does turn validly spelled words into invalid ones, but the payoff in being able to correctly parse such modified words is obvious. One downside is that the modifications can potentially create new, also-valid morphemes that lead to incorrect parses, but this should be rare. One example might be decking, which should be interpreted as deck -ing, but could also be interpreted via the ck  c rule as dec -ing, where "dec" is the abbreviated form of "December".
Once the recursive parse is done, we're left with a list of tails that all represent the paths by which we got from the last letter of the token to the first. The final step of morpheme splitting involves constructing linear representations of the chains that start at each tail. Each node in these chains represents one morpheme. This set of chains is the input to the next stage, where we score each chain to see which ones seem the most likely.

Unknown words

It's obviously possible that a parse of a token won't produce any tails, meaning there's no interpretation whose morphemes all match entries in the lexicon. My morphological parser doesn't just give up in this case. It alters the original token, creating one "hole" of all possible sizes and locations in the token and attempts to match the word in light of that hole. This involves adding a fake lexical and one morpheme tree entry for the "?" symbol that gets used as a stand-in for the hole (I don't allow for tokens already containing "?" symbols). Let's say we had "unXXXXing" as our token. Since the first try would not find any acceptable tails representing completed parses, our algorithm would try all possible variations that allow it to have at least two non-hole characters, including "un?" and "?ng", but also "unX?Xing", "unXXXX?", and "un?ing", our intuitive best bet. This gets parsed as un- [XXXX] -ing, which is more useful than no match. (Anything represented as inside [brackets] was the "hole", the text not found in the lexicon.) This is better than no match, as the -ing suffix can be applied to a verb to form a verb or an adjective, narrowing the possibilities more than a completely unknown word like XXXX would.

This process does not stop as soon as one tail is found. Indeed, it generates all tails for all hole placements and leaves it to the next stage to find the best possible interpretations. This speculative process is naturally more expensive than when the parser does come up with at least one full-word interpretation.

Scoring senses

Once the first stage has produced a set of morpheme chains (word senses), the second stage scores each chain, winnows them down, and sorts them so that the first in the list is most likely to be the right sense.

I've used scoring algorithms often for searches and such, but with them I'm using building up a positive score reflecting all the good things about each item, putting the highest scoring item at the top. This time I decided to go with a negative scoring algorithm that adds up all the downsides of a given interpretation of the word, putting the word-sense with the lowest penalty value (zero being the lowest) at the top of the favored list.

There are penalties for many potential defects. Modifications like changing "i" to "y" in happiness are penalized in favor of no-modification interpretations.  If the whole token didn't parse and we had to try out different sized gaps, there is a penalty that favors the smallest gap size. Senses not fitting the prefixes + bases + suffixes pattern are penalized. If a lexeme has multiple senses, there's a small penalty for each subsequent sense used, thus favoring the earlier ones as being more likely. If a suffix has an "attachment filter", meaning it favors attaching to words of one or more lexical categories more than others (e.g., -er(VN) versus -er(JJ)), there's a penalty if the running category violates the suffix's filter. Having more morphemes is penalized. Having multiple free morphemes (e.g., apple, care, pretty) is penalized in favor of affixes ("big -er" favored over "big E.R."). Having zero free morphemes — having only affixes — is heavily penalized. Ideally, there will be only one morpheme because it exactly matches a lexeme in the lexicon. We penalize a sense that has a suffix as its first morpheme (e.g., -ing rown) and also penalize it if it has a prefix as its last one (e.g., ado re-).

One underlying assumption for this scoring algorithm is that all interpretations spit out by the morpheme splitting stage are worth considering. I don't want to disqualify a potentially valid interpretation just because it doesn't obey the usual conventions for word production. A good example of a heavily penalized sense that is actually correct is the word ish, which is sometimes used informally as a way to indicate degree. "Are you tired?" "Ish." This thinking is especially helpful when words are formed using unusual affixation. For example, the -ish suffix is intended to attach to adjectives or nouns to form an adjective (squareish, houseish), but one could also attach it to a verb (burnish, crankish, rompish). Yes, the -ish lexeme sense's filter could be expanded to include verbs, but this algorithm prefers to see all the penalty mechanisms as reflecting preferences in lexical interpretation instead of absolute disqualifiers. If the best scoring sense is heavily penalized, it's still in the game until better interpretations come along. There is no penalty threshold that disqualifies a sense.

Once scoring is done, the results are sorted and only the top-scoring sense for each representative category is kept. That is, only the best verb sense is kept, only the best noun sense is kept, and so on. I have some misgivings about this expedient, but I'm motivated by a desire to keep the syntax parsing and broader interpretation process limited to a modest number of possible interpretations. Having 100 possible interpretations for, say, a partly unknown word, for example, seems counterproductive.

Category guessing

At the same time each sense is scored, the lexical category for the whole word is developed. As you might guess, even this is an error-prone process. The essence of the matter involves finding the last free morpheme's lexical category and then transforming it according to the suffixes attached to it. Consider meatier, for example, which parses out to either "(J): meat(N) -y(N→J) -er(J→J)" (penalty = 20) or "(N): meat(N) -y(N→J) -er(V→N)" (penalty = 27). As desired, the final conclusion is that it's most likely an adjective, since -y typically turns a noun into an adjective (meaty, nerdy, watery) and adding -er to an adjective keeps it an adjective. The other option, where another sense of -er converts a verb into a noun (killer, taker, slicer) doesn't fit as well, but it's still an option we want to present to the consuming application for its consideration.

I considered allowing the prefixes to influence transformation of the category, but this raised some ambiguities. Moreover, the prefixes I considered generally don't change the category of the words they attach to. I decided to just ignore them for now.

There are plenty of compound words that my morphological parser can handle. Here's a place where CGEL was immensely helpful for my understanding of how to deal with them. For starters, it seems most of the compounds we use contain only two free morphemes (swimsuit, backwaternosedive). I decided to effectively treat compound words as though they are made up of separate words in this narrow context. My algorithm develops a category along the way and when it finds a boundary between two (potentially affixed) free morphemes, it starts over. But it keeps track of what the categories were for the two (or more) sub-words. It then uses the conjunction of <category> + <category> — what I call the "compound pattern" — to decide whether to override whatever category the last word-let came up with, which is otherwise a good predictor. Thus far I've only found two compound patterns that merit changing their default lexical categories of. The first is verb+preposition (breakthrough, look-out, talking-to), which I change to noun. Another is adjective+verb (blueprint, high-set, smalltalk), which I default to being a noun. But if the verb in that adjective+verb compound ends in -ing (breathtaking, strange-looking, talking-to) or -ed (French-basedshort-livedwell-behaved), I convert the total word's category to adjective.

Multi-word strings

There is a final step, once scoring and winnowing are done. We look at each sense to see if any of its morphemes demands that it must stand alone instead of being integral to a word. If so, we now break the total word up according to the morphemes' needs. If a word sense is composed of five morphemes and the one in the middle demands it must be expanded and its words stand on their own, the algorithm will create a new word from the first two morphemes in the original word, expand out the must-expand words from the middle morpheme, and then create a final word from the last two morphemes. For each of the new words, which are now really just a new plain-text tokens, the entire process repeats and this word sense now becomes just a shell for the string of sub-words parsed in the same way. One example is shouldn't've, which breaks down to should + not + have.

In truth, I'm not 100% sure about the need for this feature. Consider the HIFW (how I felt when) example. Standing on its own, it seems valuable to expand it out into a sentence like HIFW I saw it, but what if it had a suffix, as in totally HIFWing on this? "How I felt whening" doesn't make sense, while treating the whole thing as probably a verb does. This is an area I think I need to study further.

Performance tests

One way of seeing how fast this runs is to select sample words and see how many times my morphological parser can process each. I'm starting with a late 2013 iMac with a 3.4 GHz Intel Core i5 and 8GB 1600 MHz DDR3 memory, a reasonably upscale desktop computer. I wrote my code in C++ using Xcode.

My test lexicon contains 876 lexemes. I'll admit that this is much too small to be representative of a well-stocked lexicon, but I also don't believe that increasing its size will have much effect on this algorithm's performance. The main reason is that the expensive part of dealing with the lexicon is looking up a candidate morpheme. Since this is done by traversing the morpheme tree in parallel with reading each character, which takes constant time per recursive step of the morpheme parse, I expect no significant change in parse time as the lexicon gets bigger. Time will tell.

So let's take some sample words and see how many times it can parse the same word per second. First, consider tokens that had full matches:
  • 30,000 words/second:  red:  (J): red(J)
  • 17,300 w/s:  adventure:  (N): adventure(N)
  • 9,000 w/s:  recordkeeping:  (V): record(N) keep(V) -ing(V|N→V)
  • 8,500 w/s:  relies:  (V): re-(U) lie(V) -s(V→V)
  • 6,100 w/s:  breathtaking:  (J): breath(J) take(V) -ing(V|N→V)
  • 3,600 w/s:  unremittingly:  (J): un-(J→J) remit(V) -ing(V→J) -ly(N|J→J)
  • 1,700 w/s:  antidisestablishmentarianism:  (N): anti-(U) dis-(U) establish(V) -ment(V→N) -arian(N→N) ism(N)
  • 181 w/s:  happily-tippingreungreennesspotatoes:  (N): happy(J) -ly(N|J→J) -(U) tip(V) -ing(V|N→V) re-(U) un-(J→J) green(J) -ness(J→N) potato(N) -es(N→N)
Now let's try some words that don't have full matches. Note the interpretations. Some of them are clearly wrong, but they help illustrate how this algorithm works:
  • 40,000 w/s:  bug:  <no match>
  • 9,000 w/s:  redbug:  (J): red(J) [bug]
  • 2,400 w/s:  mister:  (J): my(N) -s(N→N) [t] -er(J→J)
  • 2,200 w/s:  censorize:  (V): [censo] re-(U) -ize(V|N→V)
  • 14 w/s:  punk-antidisestablishmentarianism:  (N): [punk] -(U) anti-(U) dis-(U) establish(V) -ment(V→N) -arian(N→N) -ism(N→N)
I am very happy at these results. I thought it would be orders of magnitude slower. Instead, it seems this piece could hum along at 6,000 or more words per second on average on my computer, assuming most words it comes across have full matches.

Memory consumption

Regarding memory, a simple test in which I reduce the lexicon to nearly empty shows that it consumes about 628 KB of memory. With 878 items in the lexicon, it climbs to 1 MB. Here are some actual memory measurements for lexicon sizes during loading:
  • 0:  628 KB
  • 1:  636 KB
  • 100:  700 KB  (720 B/lexeme)
  • 200:  740 KB  (650 B/lexeme)
  • 300:  804 KB  (587 B/lexeme)
  • 400:  832 KB  (510 B/lexeme)
  • 500:  876 KB  (496 B/lexeme)
  • 600:  936 KB  (513 B/lexeme)
  • 700:  968 KB  (486 B/lexeme)
  • 800:  1,020 KB  (490 B/lexeme)

Memory: bytes per lexeme

I'm not sure whether this means that the per-lexeme consumption flattens out at a little under 500 bytes per lexeme or if it continues downward, which I'm expecting. The morpheme tree's memory footprint should grow logarithmically. The lexicon's lexeme info should grow linearly. So let's say the average stays around 500 bytes per lexeme. That means a lexicon with one million items should consume half a gigabyte.

A more modest lexicon of 100k lexemes (words) would consume 50 MB. For comparison, as I look at the currently active programs in my computer's memory and see that Chrome is consuming 3 GB, Google Drive is consuming 655 MB, Xcode is consuming 826 MB, and so on.

Fidelity tests

Of course, having an algorithm that's fast isn't as important as having one that works well. Were I writing a scholarly paper, I'd feel compelled to flesh out my lexicon and mine a corpus for test cases, but I haven't gotten around to that yet. I plan to do more serious testing of this sort in time, though.

But I do have one useful barrage test behind me. I was keenly interested in seeing how well my MP would fare against the wide variety of compound words found in CGEL's treatment of morphology. To that end, I painstakingly typed the 678 examples I found there into a data file and hand tagged all of their lexical categories. I then created another data file containing their base words. For the example of taxpayer-funded, I had to isolate tax, pay, and fund. I then hand-tagged those words, too. Below is a snippet from the test's output:

  - sunshine             |  .  | (N): sun(N) shine(N)  (P:13)
  - swearword            |  .  | (N): swear(V) word(N)  (P:13)
  - sweetheart           |  .  | (N): sweet(J) heart(N)  (P:13)
  - swimsuit             |  .  | (N): swim(V) suit(N)  (P:13)
  - swordsman            |  .  | (N): sword(N) -s(N→N) man(N)  (P:123)
  - syntactic-semantic   |  .  | (J): syntactic(J) -(U) semantic(J)  (P:23)
  - table-talk           | (N) | (V): table(N) -(U) talk(V)  (P:23)
  - take-away            | (N) | (R): take(V) -(U) away(R)  (P:23)
  - take-off             |  .  | (N): take(V) -(U) off(P)  (P:23)
  - talking-point        |  .  | (N): talk(V) -ing(V|N→V) -(U) point(N)  (P:133)
  - talking-to           |  .  | (N): talk(V) -ing(V|N→V) -(U) to(P)  (P:133)
  - tape-record          | (V) | (N): tape(N) -(U) record(N)  (P:23)
  - tax-deductible       |  .  | (J): tax(N) -(U) deduct(V) -ible(V→J)  (P:33)
  - tax-free             |  .  | (J): tax(N) -(U) free(J)  (P:23)
  - taxpayer-funded      | (J) | (V): tax(N) pay(V) er(N) -(U) fund(V) -ed(V→V)  (P:63)
  - tearoom              |  .  | (N): tea(N) room(N)  (P:13)
  - theater-goer         |  .  | (N): theater(N) -(U) go(V) -er(V→N)  (P:35)
  - theatre-going        | (J) | (V): theatre(N) -(U) go(V) -ing(V|N→V)  (P:33)
  - thought-provoking    | (J) | (V): thought(N) -(U) provoke(V) -ing(V|N→V)  (P:53)
  - threadbare           |  .  | (J): thread(N) bare(J)  (P:13)
  - three-inch           | (J) | (N): three(D) -(U) inch(N)  (P:23)
  - three-metre-wide     |  .  | (J): three(D) -(U) metre(N) -(U) wide(J)  (P:46)
  - tightrope            |  .  | (N): tight(J) rope(N)  (P:13)
  - timberline           |  .  | (N): timber(N) line(N)  (P:13)

The center column represents the hand-tagged value. If it is the same as the MP's prediction, the column contains a period, allowing the mistakes to jump out easily. Of the 678 compound words tested, 76.8% were correctly tagged. Note that the "(P:13)" values on the far right represent penalty calculations for each of these. I'm showing only the best scoring (least penalized) interpretation for each of the test tokens.

During development, I relied a lot on hand-crafted example words. I reproduce some examples below:

- antidisestablishmentarianism
  - (N): anti-(U) dis-(U) establish(V) -ment(V→N) -arian(N→N) -ism(N→N)  (P:50)
- Rate: 1451.38 words/s

- buttons
  - (N): button(N) -s(N→N)  (P:10)
  - (V): button(N) -s(V→V)  (P:17)
- Rate: 17152.7 words/s

- buttoning
  - (V): button(N) -ing(V|N→V)  (P:11)
  - (N): button(N) -ing(N→N)  (P:14)
  - (J): button(N) -ing(V→J)  (P:17)
- Rate: 11904.8 words/s

- exposition
  - (N): expose(V) -ition(V→N)  (P:30)
- Rate: 12547.1 words/s

- expositions
  - (N): expose(V) -ition(V→N) -s(N→N)  (P:40)
  - (V): expose(V) -ition(V→N) -s(V→V)  (P:47)
- Rate: 7189.07 words/s

- reexpose
  - (V): re-(U) expose(V)  (P:10)
- Rate: 27100.3 words/s

- reexposure
  - (N): re-(U) expose(V) -ure(N)  (P:40)
- Rate: 15432.1 words/s

- reexposed
  - (V): re-(U) expose(V) -ed(V→V)  (P:40)
  - (J): re-(U) expose(V) -ed(V→J)  (P:42)
- Rate: 11723.3 words/s

- malignant
  - (N): malign(V) -ant(N)  (P:12)
  - (J): malign(V) ant(J)  (P:13)
- Rate: 14881 words/s

- meteorites
  - (N): meteor(N) -ite(N→N) -s(N→N)  (P:20)
  - (V): meteor(N) -ite(N→N) -s(V→V)  (P:27)
- Rate: 8992.81 words/s

- mouthy
  - (J): mouth(N) -y(N→J)  (P:10)
- Rate: 22002.2 words/s

- stubbornly
  - (J): -s(N→N) [tub] born(V) -ly(N|J→J)  (P:3343)
- Rate: 1026.69 words/s

- muddling
  - (V): [muddl] -ing(V|N→V)  (P:5015)
  - (J): [muddl] -ing(V→J)  (P:5017)
  - (N): [muddl] -ing(N→N)  (P:5019)
- Rate: 2212.39 words/s

- rapacious
  - (J): [rapac] -y(N→J) -ous(J)  (P:5025)
- Rate: 1189.06 words/s

I know I'll need to do more testing, but I'm fairly happy with the results so far.

Applications and future work

While my main goal in creating a morphological parser is to create a mechanism for building a "virtual lexicon" that supports syntax parsing by guessing at lexical categories for words,  I see other potential uses, too.

For starters, an MP should be able to aid the process of building a lexicon. Imagine doing so by importing documents. For each document, the lexicon-builder tool calls out words it doesn't already recognize. Take the muddling example from above. The best guess was that the word is a verb, which is correct. It called out "muddl" as the unknown. But moreover, one could use the -ing(V|N→V) lexeme sense, which indicates that it usually attaches "ing" to a verb or (secondarily) a noun to form a verb, to guess that "muddl" is most likely a verb, which is also correct. The only thing wrong is the spelling, since this involved lopping off a final "e". The user would need to review and finesse each suggested entry found this way.

I also believe this could be used to enhance a typical spell checker. For starters, it could allow the spell checker to distinguish between "hard" and "soft" misspellings. That is, it could call out words that fit word-formation patterns but are not in an otherwise large lexicon as "soft" misspellings. But moreover, it could recognize when a word looks like a proper inflection for a word but is actually not. If the lexeme sense indicated that a base word does not follow the usual inflection rules and calls out the alternatives, the spell checker could suggest the correct one. For example, badder might lead to worse as a suggestion, as badder appears to be the comparative of bad. Similarly, worser could be called out as a nonstandard comparative, with worse being suggested. Childs becomes children. And so on. These would generally be favored over typo-assuming suggestions like balder, worsen, and child.

One problem I intend to apply my MP to is improved tokenization. Consider the sentence When will you see Prof. Smith? A basic tokenizer would see the period in Prof. and assume it marked the end of a sentence. Smith? could, after all, be a complete sentence, too. I think my lexicon is going to need to have common abbreviations like Prof., Mrs., and etc. to help disambiguate period usage. One option would be to include the period in a token that is otherwise word-like and ask the MP to render an opinion about whether the period is part of the word or more likely punctuation. This would extend to period-delimited formats like R.I.P. and R.S.V.P., where it seems logical for the MP, which looks at words character-by-character, to recognize and "correct" this pattern. After all, the lexicon may have RSVP in it but not the redundant R.S.V.P. defined, so it would be helpful to recognize and transform this pattern before traversing the morpheme tree.

Related to ambiguous periods is ambiguous apostrophes ('). If a word begins or ends in an apostrophe, does that signal a possessive, an eye-dialect spelling ('nother, lil', readin'), or single-quoted prose? The lexicon could help if it contained common eye-dialect examples. And the MP could figure out if a trailing s' likely represents a possessive (fixing the dogs' dinners).

Because the MP returns multiple options, it certainly can return several interpretations of embedded periods and apostrophes. It might be best for the tokenizer, confronted with a paragraph, to conservatively attach periods and apostrophes to the otherwise word-like tokens they are nearest as a first step, then have the MP come up with naive guesses for tokens' categories, word splits, and punctuation interpretations. Only after that would a next stage come up with one or more interpretations of where the sentence boundaries are, a break with the traditional tokenization → lexicalization → syntax parsing flow. Then it would be up to the syntax parser to figure out which proposed sentence boundaries make the most sense, grammatically.

Although my current morphological parser code is already my second version, I've only been working on this for two and a half weeks. I have no doubt this deserves quite a bit more work. But overall, I'm very happy with the initial results. My sense is that my MP works effectively and efficiently and that it will serve several parsing goals at once.


Popular posts from this blog

Coherence and ambiguities in problem solving

Discovering English syntax

Neural network in C# with multicore parallelization / MNIST digits demo