Search This Blog

Saturday, November 26, 2016

Adaptive tokenizer

I knew that to continue advancing the refinement of my virtual lexicon, I'd need to throw it at real text "in the wild". To do that meant revisiting my tokenizer so I could feed words into it. The first tokenizer I made for my Basic English Parser (BEP) project is very primitive, so I decided to trash it and start over.


Programmers often find themselves needing to parse structured data, such as comma-separated-values or XML files. In parsing richly structured data, like code written in a programming language, we refer to the smallest bits of data  usually text  as "tokens".  In this case, my tokenizer is tasked with taking as input a single paragraph of English prose in a plain-text format and producing as output a list of tokens.

In my tokenizer, tokens presently come in a small variety of types: word, symbol, number, date/time, sentence divider. Since recognizing a token requires characterizing it, tokenization is the natural place to assign a type to each token. But only to the degree that it is competent to do so. A tokenizer can readily recognize the pattern for a typical word, for example, but couldn't classify that word's lexical category (noun, adjective, etc.) Each piece of a good parser has its own specialties.

English-speaking programmers familiar with typical parsing usually rely on conveniences bundled into the ASCII data standard. With a few logical and mathematical comparisons, you can easily tell capital letters from lower-case ones, letters from digits from other symbols. Because I am opting to deal properly with Unicode, those distinctions don't come so easily. I had to start with a custom Unicode library that can deal efficiently with normalized UTF-8 representations of text, where a single visible character in a text file might be represented by a code-point packed into 1 to 4 bytes or possibly a string of them bundled into a grapheme cluster. My unicode::unistring class easily interfaces with std::string. It's character iterator, .size() method, and other members mimic those of the standard string class', but deal at the grapheme cluster level instead of the byte level.

Although language parser programmers often favor designs based on BNF grammars, I've opted to just hard-code a mostly linear parsing algorithm. It progresses character by character, looking for boundaries between token types. The most obvious boundaries are found when we hit white-space, including spaces, tabs, and new-line characters. For numbers and dates, I'm relying on regular expressions to look ahead of the current character. Everything else initially comes down to punctuation and symbol characters. If it's a symbol character and we've been looking at a word, we recognize the boundary of that word token and pack the single symbol character into its own token, ready to begin searching for the next token.

With some special exceptions. Consider hyphens. It's tempting to take a term like "AFL-CIO" or "singer-songwriter" and split it into separate words. Hyphenated phrases like "lighter-than-air" make a compelling case for this seductive option, which fits neatly the model where a symbol character like the minus sign (-) is its own token and marks boundaries to its left and right. But sometimes this doesn't make sense, as with non-compliant and pre-industrial. It would be hard to make sense of the syntax of a sentence that included non compliant or pre industrial. Also, hyphenation of words sometimes changes the lexical category (LC) of a combination of words, as with an off-the-cuff remark, where off-the-cuff is effectively an adjective, even though none of its words is an adjective. I conclude that the tokenizer is not qualified to decide when it's appropriate to subdivide a hyphenated word. Nor is the virtual lexicon, even. I suspect this will fall to the syntax parser, if only because that module will be able to evaluate whether the string of hyphenated words neatly form a syntactic structure better than the single hyphenated word. Finally, there are special cases of words that end in a hyphen that are meant to link to a later word, as with in pre- and post-op photos.

Thus, the tokenizer allows single hyphens that appear within words or as a last character in a word. Minus signs that appear outside the context of a word, as when there are spaces around it, are considered symbols and not words. If an apparent word has two or more consecutive minus signs in it (e.g., in full knowledge--at least, afterward), they break that up into separate words. This is to take care of the common case where an em dash is represented by two minus characters with no space around them.

The same constraint applies to single periods and apostrophes, which are allowed to be in the middle of a word token (ASP.NET, isn't) or on the left (.com, 'ere) or right (etc., cousins') of one.

Generally, the tokenizer preserves the actual characters from the text, but it does do some conversions. For starters, it does not preserve whitespace, per se. Each token indicates whether or not it has whitespace before it, but not how many characters or which. The various kinds of horizontal dash characters are also condensed down to the standard ASCII minus sign (-).

My tokenizer also condenses various types of double quote characters (“ & ”) down to the basic ASCII double quote (") character. Same for single quotes (‘ & ’), which are reduced to the ASCII apostrophe (') character. Although I may regret this later, as their polarity (left or right) can carry clues to the boundaries of quoted blocks of text, for now they would just complicate lexical analysis. They probably would complicate syntax analysis, too, because of inconsistent use of these in modern and older text files. Moreover, I would still face the problem of properly recognizing triple quotes, quadruples, etc., typically represented by combinations of double and single quote characters.

Sentences and paragraphs

I specifically decided not to make my tokenizer responsible for starting with a raw text file and splitting it up into sections and paragraphs. My main reason is that it should be relatively easy in many cases for an earlier-stage parser to find paragraph boundaries, but that different formats require different approaches. From a syntax standpoint, paragraphs might be considered the largest significant unit of text.

Why not stop at the sentence level? A typical syntax analyzer deals only with a sentence. The problem is that it's often difficult to find clear sentence boundaries. Question marks and exclamation points are strong indicators of sentence-end punctuation, but periods are not. Consider a sentence that contains the abbreviation "etc.". My lexicon can tell me that "etc" is an abbreviation, which means it typically has a period after it. And that could be the end of it, if only we ended a sentence with an extra period after an abbreviation, but we usually don't. Moreover, we often punctuate a sentence that ends with double quotes with a period just before the final quote. I reasoned that the best my tokenizer can do is mark places where it thinks sentences might end, and some indication of how certain it is in each case.

To facilitate this, the tokenizer offers a special sentence-divider token type. Sentence dividers are tokens inserted into the token stream just like words, symbols, numbers, etc.

To help add certainty, the tokenizer will pass any word-like token into the virtual lexicon and attach the returned "word" data structure to the token. As described previously, the VL will indicate that a word has leading or trailing period and/or apostrophe characters. A word's various senses (interpretations) will indicate whether they consider the period/apostrophe characters to be integral to their words, as with .com, etc., 'nother, and ol', or to otherwise be unrelated to the word. It checks every sense to see if any of them allow for an integral-period interpretation. The tokenizer can use this to indicate whether or not it is "certain" that a period at the end of a word represents a separate symbol. If it does come to this conclusion, the tokenizer will actually strip off the trailing period and add it as a new token, asking the VL to reparse the new word without the trailing period.

More precisely, the tokenizer does this same thing for the beginning period/apostrophe and for the end one. It creates new tokens before or after the word token when the word contains no interpretations in which that leading or trailing period or apostrophe are integral to the word.

In the special case of an abbreviation (Dr., etc., appt.) or period-inclusive initialism (r.s.v.p., C.I.A.) word that integrates an ending period, the tokenizer still creates a sentence-divider token, but it marks it as uncertain about this conclusion.

Punctuation sometimes appears within quoted text in sentences. Consider this example:
To those who ask "Why bother?" I say "Because it's right." I hope they listen.
My intention in this example is that the first quoted sentence not also mark the end of the outer sentence that contains it, However, the second one does mark the end. From a tokenization or lexicon standpoint, there is no way to correctly draw these conclusions. My tokenizer thus kicks this can down the road by marking the "?" and "." before each closing quote as definitely sentence ends and adds a maybe-a-sentence-end marker after each closing quote. It will fall to the syntax analyzer to consider alternative hypotheses.

The bottom line here is that neither the tokenizer nor the virtual lexicon are qualified to say for sure where sentences end. The best they can do is propose options.

Naming names

One of the other features I've added to my BEP is an ability to spot likely name words in text. This presents a difficult challenge. While many of the unknown words found by an English parser are simply inflectional derivatives (e.g., harm to harms, harmed) and assemblies of morphemes (harmful, harmless) missing from that lexicon, a large fraction of the remaining unknown words are going to be names of people, places, events, organizations, products, and so on. It seems inappropriate to attempt to fill a lexicon with a list of all such named entities because, no matter how authoritative it is, there will always be more names emerging.

Instead, I'm employing a tactic of looking at the characteristics of unknown words to see if they might potentially be names. The most basic characteristic is whether the word is capitalized. But this needs to be mediated by whether the word appears at the beginning of a sentence. That's the main reason the tokenizer proposes sentence boundaries. Words that appear immediately after sentence boundaries (opening quotes are factored in) are ignored in the initial survey to find potential names for the simple reason that in English, we typically capitalize the first word in a sentence. I'm looking for words that are either capitalized (John, Woodstock, Florida) or all-caps (FBI, DARPA, NASCAR) that are not already in my lexicon.

One benefit of tokenizing at the paragraph level is that it allows me to deal well with headlines. In this case, headlines that have most of their words capitalized (Harry Potter and the Chamber of Secrets) confound the ability to find name words. Part of the computation is that if a paragraph has a large percent of its words either capitalized or in all-caps, it is disqualified from the search for name words. This latter also allows for dealing with snippets of text where almost all letters are capitalized, which similarly makes finding potential names impossible.

Once the first pass to identify potential name words is done, I do a second pass later to "name names". An important part of that process is that it then covers words that were previously disqualified. If the word Rupert appeared in the middle of a sentence in a regular paragraph during the initial sweep, it got added to a list of potential name words. In this second pass, instances of Rupert that appeared as the first word in sentences would also be tagged as name words. So would instances found in title "paragraphs". One proviso, though, is that this later call to name names specifies a threshold. Any potential name found with fewer qualifying occurrences in text than this threshold value get ignored. Those at or above the threshold get flagged as name words. In my testing, I stuck with a threshold of 1, making the threshold useless.

I should also point out that this algorithm will miss names that only appear at the beginnings of sentences in a document.

The first piece of this puzzle is the virtual lexicon, which helps to characterize words. In particular, it can flag a given word as being capitalized (e.g., Joshua, iPhone, MyVMS) or all-caps. For such a word, it extracts a potential name. This is separate from the word's raw text in part because names often come in possessive form with s' or 's at its tail. The potential name excludes the apostrophe or apostrophe-S ending and the word indicates that its potential name is possessive. This allows a completely unknown word to be available as a potentially possessive proper noun to the syntax analyzer later. But it also allows Francis, Francis' and Francis's all to represent the same Francis name when it comes time to name names.

The VL comes up with these potential names even for derived words or words that come directly from the lexicon. This reflects the fact that even common words of all LCs can be used as proper nouns. This allows the senses the VL comes up with for a given word to coexist with this potential for this specific use of the word to alternatively be used as a name. You don't want a syntax parser to be unable to discover syntactic structure because a long string of words are all capitalized and thus marked as having a pseudonymous "proper noun" lexical category that paves over its actual LCs.

On that note, when the name-names function is applying the used-as-a-name flag on words based on the criteria described above, it does not apply to words that are not capitalized or all-caps. So a phrase like how Mark made his mark on me can contain both common and name-flagged versions of the same word. As indicated before, flagging a word as used-as-a-name does not negate the senses (and LCs) returned by looking it up in the virtual lexicon.

Ultimately, the syntax analyzer will have to sort out the name-oriented reality of each word based on how it's used in a given sentence, but it will already have a variety of extra characteristics at its disposal, including whether the word is capitalized, whether it is flagged as used-as-a-name, and whether a given word sense is flagged as a proper noun.

Lexicon building

One of my main goals in revisiting my tokenizer, now that I have a good virtual lexicon mechanism, is to test my VL and continue building my lexicon. In practice, this also means continuing to refine the VL and tokenizer, too. Thus far, I've only processed two documents: a short biography of Mark Twain and a CNN news story about America's black working class. My current process begins with letting my tokenizer have a go at a document. Among other things, it generates lists of potential names it found, unknown words, and derived words. Here is an example of the output my test program generates.

I mainly work through the list of unknown words, creating entries in my lexicon as needed. If, for example, I see flippancy, I'll create an entry for flippant, knowing it will derive flippancy from it. I generally don't add names to the lexicon, with some exceptions. I will add common names like US states and major cities, commonly occurring people's names like John, and so on. I flag them as proper nouns. Here's an example of what my lexicon file looks like.

Once I've worked through the unknown words, I move on to the list of derived words. Here's a sample of what that looks like:

  • official        | 1 | N: off(P) -ic(N|V→J) -ial(N)  (N or J)
  • opposed         | 1 | V: oppose(V) -ed(V→V)  (V or J)
  • organizer       | 1 | N: organize(V) -er(V→N)  (N or J)
  • others          | 1 | N: other(N) -s(N→N)  (N or V)
  • overalls        | 1 | N: over(P) -al(N→J) -s(N→N)  (N or V)
  • overwhelmingly  | 1 | R: over(P) whelm(V) -ing(V|N→V) -ly(N|J→R)
  • pan-african     | 1 | N: pan(N) -(U) Africa(N) -an(N→N)  (N or D)
  • partnership     | 1 | N: partner(N) -ship(N→N)
  • partnerships    | 1 | N: partner(N) -ship(N→N) -s(N→N)  (N or V)
  • payments        | 1 | N: pay(V) -ment(V→N) -s(N→N)  (N or V)
  • personal        | 1 | J: person(V) -al(N→J)
  • poorer          | 1 | J: poor(J) -er(J→J)  (J or N)
  • populism        | 1 | N: popul~(J) -ism(N→N)
  • ports           | 1 | N: port(N) -s(N→N)  (N or V)
  • president       | 1 | J: preside(V) -ent(J)
  • president-elect | 2 | N: preside(V) -ent(J) -(U) elect(V)
  • pressures       | 1 | N: press(V) -ure(N) -s(N→N)  (N or V)
  • problems        | 1 | N: problem(N) -s(N→N)  (N or V)
  • professor       | 2 | N: pro-(U) fess(V) -or(V|N→N)  (N or V)

Most of these look very appropriate. I'm looking for oddballs, like with official, which it derived as "off -ic -ial". All I have to do is add office to the lexicon and it gets it right as "office -ial".

I also study a relatively compact representation of the parsed text. Here's an excerpt:

 '"' The(D) notion(N,P) of(P) the(D) white(J) working(V,N) class(N) implicitly(R) embodies(N,V) a(D) view(N) of(P) white(J) privilege(N) ','  '"'  said(V) @[Spriggs](?) ','   '"' It(N) implies(V,N) that(N,S,P) things(N,V) are(V) supposed(V,J) to(P,R) be(V) different(J) for(P) them(N) ','  that(N,S,P) they(N) aren't(Phr,Phr,Phr) the(D) same(J) ','  that(N,S,P) they(N) aren't(Phr,Phr,Phr) going(V,N) to(P,R) face(N) the(D) same(J) pressures(N,V) '.'  <<sentence end!>>  '"'  <<sentence end?>> 

There(N) is(V) no(D) official(N,J) definition(N) of(P) the(D) working(V,N) class(N) '.'  <<sentence end!>>  Some(D,N,R) define(V) it(N) by(P) education(N) level(N) or(C) job(N) sector(N) ','  others(N,V) by(P) income(V) '.'  <<sentence end!>>  John(N) @[Hudak](?) ','  a(D) senior(J) fellow(N) in(P) governance(N) studies(N,V) at(P) the(D) Brookings(N,V) Institution(N) calls(V,N) these(D,N,R) voters(N,V)  '"' economically(R,J) marginalized(V,J) '"'  because(S) they(N) often(R) fall(V,N) somewhere(R,N) between(P) the(D) poor(J) and(C) the(D) middle(N) class(N) '.'  <<sentence end!>> 

After each word is a list of the lexical categories for the senses the VL came up with for each word, listed in order of how likely it considered each sense to be the correct one. My LCs here include: (N)oun, (V)erb, ad(J)ective, adve(R)b, (P)reposition, (D)eterminative, (S)ubordinator, (C)oordinator, (I)nterjection, s(Y)mbol, and (U)nspecified. I consider this list a work in progress. If a word needs to be interpreted as a phrase, as with aren't (are not), you'll see "Phr" in place of an LC.

Note the <<sentence end!>> and <<sentence end?>> tokens. The ones with exclamation points (!) indicate high-certainty markers and the ones with question marks (?) indicate low-certainty but potential markers.

If I want to better understand how my VL treats a word, I can generate a more detailed output for that word like the following:

- Word: economically
- R: economy(N) -ic(N|V→J) -al(N→J) -ly(N|J→R)  (R or J)

- Short list of senses:

  - R: economy(N) -ic(N|V→J) -al(N→J) -ly(N|J→R)  (141)
  - J: economy(N) -ic(N|V→J) all(D) -y(N→J)  (160)

- All senses considered:

  - R: economy(N) -ic(N|V→J) -al(N→J) -ly(N|J→R)  (141)
  - J: economy(N) -ic(N|V→J) all(D) -y(N→J)  (160)
  - J: economy(N) -ic(N|V→J) -al(N→J) -y(N→J)  (160)
  - J: economy(N) -ic(N|V→J) all(R) -y(N→J)  (162)
  - R: economy(N) -cy(J→N) -al(N→J) -ly(N|J→R)  (171)
  - J: economy(N) -cy(J→N) -al(N→J) -y(N→J)  (220)
  - R: economy(N) -ic(N|V→J) a(D) -ly(N|J→R) -ly(N|J→R)  (220)

One assumption I'm making is that there is a natural ability for words in the major LCs to be converted to the others. The noun candy can easily be used as a verb, as in We candied apples for the party. The adjective beautiful can be used as a noun, as in The beautiful have it easy. My LC is mainly charged with giving its top pick of LC for a given word, but as requested, it will either give one sense for each distinct LC or will give all senses it conceives of. This latter I consider to be only of use in debugging, though.

The special exception is with phrase expansions, which are not really LCs to be condensed down. For example, Fred's could be interpreted as the possessive form of Fred or as Fred is or Fred has. Which one is appropriate depends on the syntax, as illustrated in Fred's been drinking Fred's own root beer. Fred's happy with how it turned out.

Performance metrics

In my earlier performance tests of my VL, I was testing to see how fast it could recognize a single, specific word. I knew that I wouldn't be able to get a meaningful average rate, though, until I started testing real texts. Now that I have, I can say that on my late-2013 iMac, I'm typically seeing parse rates of over 2,000 words per second. To clarify, that's not counting symbols and other tokens; just words that get processed by the VL. That timing is for the total process of tokenization, VL lookups, name finding, and generating statistics for my debugging use. On my early-2015 MacBook, the rate is typically above 1,200 words per second.

Generally speaking, the more unknown words there are, the slower the VL runs as it tries its best to find approximate matches. So as I was adding new lexemes to my lexicon, this algorithm got ever faster.

Here is a typical output of statistics at the end of my test runs:

  • 996 total words
  • 457 distinct words (45.9% of total)
  • 437 known words (95.6% of distinct)
  • 20 unknown words (4.4% of distinct)
  • 216 derived words (47.3% of distinct)
  • 204 known derived words (44.6% of distinct, 46.7% of known, 94.4% of derived)
  • 423 lexemes used (96.8% the number of known words)
  • 374 free/bound lexemes used (85.6% the number of known words)
  • Rate: 2.1 parses/second
  • Rate: 2600.0 tokens/second
  • Rate: 2096.8 words/second
  • 3 iterations took 1.4 seconds
  • 1427 lexemes defined

For comparison, I just now extracted a third document. Without making any changes to the lexicon, here are the stats it outputs on a first run:

  • 2227 total words
  • 856 distinct words (38.4% of total)
  • 528 known words (61.7% of distinct)
  • 328 unknown words (38.3% of distinct)
  • 545 derived words (63.7% of distinct)
  • 307 known derived words (35.9% of distinct, 58.1% of known, 56.3% of derived)
  • 429 lexemes used (81.2% the number of known words)
  • 376 free/bound lexemes used (71.2% the number of known words)
  • Rate: 0.4 parses/second
  • Rate: 1059.8 tokens/second
  • Rate: 855.7 words/second
  • 5 iterations took 13.0 seconds
  • 1428 lexemes defined

As you can see, the rate is down from over 2,000 words per second to well under 900. Most of that is consumed by attempts to figure out unknown words.

After generating these numbers, I added a caching mechanism to save a copy of what the VL returns for a given word token. The result is a 10 - 15% speed increase, depending on whether the document scanned is known

I have a lot more lexicon building to do. At least this process makes it fairly fast and easy to identify the words I need to import. The effectiveness and performance are encouraging so far.

Friday, November 18, 2016

Virtual lexicon enhancements

Virtual lexicon

In my previous post, I used the term "morphological parser" to describe what I've been building. Now that I'm done with a first significant version of it, I'm really preferring the term "virtual lexicon" ("VL"). It captures the spirit of this as a module that represents a much larger set of words than the actual lexicon it contains. Same idea with virtual memory, which enables programs to consume what they perceive as a much larger amount of RAM than is physically available in the computer. Likewise, a virtual machine environment, which enables a fixed set of computer hardware to represent emulated computers with more CPUs, memory, and other resources than are physically available.


One of my stated goals in the previous post was to deal with periods and apostrophes. What I was indicating is that a basic tokenizer is faced with a nasty interpretation problem when it comes to these symbols. In standard written English, a sentence typically ends with a period and the next sentence begins after at least one space character (in typical computer formats, at least). But English also allows abbreviations to end in a period, typically followed by a space, as in Alice asked Prof. Smith for help. English also has initialisms represented sometimes with periods following each letter, as in r.s.v.p. and A.K.A. Further compounding the problem is that a sentence ending in an abbreviation or period-delimited initialism usually doesn't contain a separate period. For example, "I'm afraid of snakes, spiders, etc. When I see them, I run away A.S.A.P."

There is a further problem that could be ignored, but I decided to tackle it as well. Some special words like ".com" begin with periods, which would throw off a basic tokenizer. Further, it's possible for text sloppily written to have sentences ending in periods that don't have following spaces, as in the "I'm going to the store.Do you need anything?" The following capital letter does help suggest a sentence break over the alternative interpretation that "store.Do" is a word. But there are such words, like "ASP.NET", and Internet domain names, like "".

I decided to modify my tokenizer to allow word-looking tokens to include single periods, including one at the beginning and one at the end of the token (e.g., ".ASP.Net."). Doing so would give the virtual lexicon a chance to weigh in on whether such periods are part of the words or separate from them. The VL's return value now indicates if the word begins with a period and also if it ends with one. But then each of the word-senses in that word gets to indicate if the leading and/or trailing period is integral. To illustrate this, my test output shows integral periods as {.} and separate ones as (.). Consider the following examples:
  • ".animal.":  (.)  ⇒  N: animal(N)  ⇒  (.)
  • ".com.":  {.}  ⇒  N: .com(N)  ⇒  (.)
  • ".etc.":  (.)  ⇒  U: etc(U)  ⇒  {.}
My tokenizer also deals with the dotted initialism (N.A.T.O., r.s.v.p.) scenario, which is also a problem for the lexicon. I decided a lexeme representing this case should only contain the dot-free spelling (NATO, RSVP) and should contain one or more senses indicating that it is an initialism. When the VL comes across this pattern, it gets rid of the periods and then begins its search. For example:
  • "R.I.P.":
    • V: RIP(V)  ⇒  {.}
    • V: rip(V)
Note how it offers an alternative, because my test lexicon also has the common verb sense of "rip" referring to the ripping action. Had I fed in "rip" instead of "R.I.P." or "RIP", it would have put that common sense on top and the initialism sense second. But not also how the first sense indicates that, yes, the word ends in a period, but that period is part of the word. Had it been "RIP.", it would have indicated that there was a trailing period that was clearly not part of the word.

I would note that my VL doesn't deal well with cases where there are periods within an initialism but where one or more such periods are missing. A word like A.S.AP. would fail to be properly recognized by my VL, but I consider that a good design. I'm betting this sort of case is rare and almost always a typo. If someone wanted to say I RSVPed, for example, they probably wouldn't include any periods. This leaves those oddball words that do include infrequent periods, like Node.JS., pristine for lexicon lookups.

I would also note that my VL does not provide meaningful support for domain names (,, Usenet group names (,, and so forth. This is probably best handled by a tokenizer, which could easily flag a token as fitting this pattern and possibly ask the VL to see if the token is a known word, as in the Node.JS case. It's going to be a challenge for any syntax and semantic parser to deal with these entities, anyway.

This all takes responsibility away from the tokenizer for dealing with single periods in and adjacent to words. The VL doesn't definitively decide whether a given period is punctuation marking the end of a sentence, but it does provide strong evidence for later interpretation. Plus, it allows terms that do contain periods to be properly interpreted on their own.


Almost the same problem crops up with apostrophe characters, which may be integral to words or may indicate the special class of punctuation that includes quotes, parentheses, and italicized text. Some words, like can'tbees', and 'nother contain apostrophes that are integral to the word and not at all part of quoted text. However, a tokenizer just can't deal with this without recourse to a lexicon. So my lexicon allows terms to include integral apostrophes.

The tokenizer is expected to leave single apostrophes that may appear to the left or right of a word-like token, as well as within it, attached to the text. The VL then considers the various interpretations possible with the leading and trailing apostrophes. The output word indicates whether it begins with and also whether it ends with an apostrophe. Then each sense within indicates whether those leading and trailing apostrophes are part of the word or not. Same pattern as for leading and trailing periods. And in that spirit, here are some sample outputs for tokens that feature both leading and trailing apostrophes. Integral apostrophes are represented with {'} and clearly-separate apostrophes with (').
  • 'animal':  (')  ⇒  N: animal(N)  ⇒  (')
  • 'animals':  (')  ⇒  N: animal(N) -s'(N→N)  ⇒  {'}
  • 'nother':  {'}  ⇒  N: 'nother(N)  ⇒  (')
The 'animals' example illustrates the potential for confusion, too. After all, it could be that animals is simply a plural form of animals that's in single quotes, as in Your so-called 'animals' are monsters. Or it could be that the plural form of "animals" has possession of something, as in Your 'animals' pen' is empty. There truly is no way in this parsing layer to iron that out.

Kicking the can down the road

One of my guiding assumptions is that each layer of the parsing process adds some clarity, but also creates more questions that can't be answered within that layer. I'm counting on the next layer being responsible for taking what the lexicalizer, which is essentially this virtual lexicon applied to all word tokens, outputs and generating as many alternative interpretations as are necessary to deal with the ambiguities. Then it will fall to the syntax parser, which should rule out some unlikely interpretations. That layer, too, will create more unanswered questions, which it will foist on later layers dealing more with the semantics of sentences.

One pattern I found is that I end up using my VL recursively because some alternative interpretations can only be handled by fully parsing a word by trying various interpretations, such as stripping off a leading period, and seeing which interpretation seems best. No doubt this same pattern will hold for the syntax parser, which probably will even occasionally call back to the VL to reinterpret alternative tokens it comes up with.

Tuesday, November 15, 2016

Morphological parser / Virtual lexicon

I've been able to focus the past 3 months exclusively on my AI research, a luxury I've never had before. Given that I can't afford to do this indefinitely, I've chosen to focus primarily on NLP, with an overt goal to create relevant marketable technologies.

I'm presently operating with an intermediate goal of creating a "basic" English parser (BEP). In my conception, a BEP will transform a block of English discourse into an abstract representation of its constituent words, sentences, and paragraphs that is more amenable to consumption by software. Though there are many research and industry products that do this at some level, I'm convinced I can improve upon some of their aspects to a marketable degree.

When you set about trying to understand natural language parsing, you quickly discover that there are many interconnected aspects that seem to make starting to program a system intractable. NLP researchers have made great strides in the past few decades largely by focusing on narrowly defined tasks, like part of speech tagging and phrase chunking; especially tasks that rely on statistical machine learning. Still, it seems every piece requires every other piece as a prerequisite.

After exploring lexical tagging (AKA part of speech tagging) for a while, especially using a custom Brill-style tagger I wrote from scratch, I decided to tackle an important piece of the bigger puzzle. Namely, how to deal with unknown words.

Why a morphological parser?

I'm convinced most unknown words are simply known words reassembled in different ways. The most basic example is inflectional versions. The plural form of "chicken" is "chickens". The past ("preterite") tense form of "stop" is "stopped". The comparative form of "sloppy" is "sloppier" and the superlative form is "sloppiest". Beyond these, many words can be formed by compounding existing words. "Sunshine", "ankle-deep", and "brainwash" illustrate basic compound words. And then there are all those affixes — prefixes and suffixes — like "un-", "re-", "-ish", and "-ation" that can be added to other words to form more complex ones like "reprioritization".

This is the domain of morphology, the study of word-formation, among other things. I decided I should try to somewhat master this algorithmically in order to deal with the broadest array of words.

The logic goes something like this. To parse, say, a paragraph of text, one must identify the sentence boundaries. There are some obvious algorithms to do so, but they run into ambiguities. A period, for example, might signal an abbreviation like "Mr." or "Dr." instead of the end of a sentence. Once you have isolated a sentence, making sense of it requires being able to characterize all of its words in some useful way. At the very least, you might want to tell what the main verb is and possibly what the subject and direct object are. Doing so usually begins with a process of identifying all the verbs, nouns, adjectives, etc. in the sentence, which turns out to be a surprisingly tricky process. You usually start with a naive guess that involves looking the word up in a lexicon. Like a standard dictionary, a lexicon will typically at least tell your algorithm what the most likely lexical category is for a given, known word (e.g., "run" is most likely a verb). And then the order of the categories in the sentence is typically used to figure out the syntactic structure of the sentence. Any word in the sentence that isn't already in the lexicon becomes a problem for this approach.

How, then, to deal with unknown words? One answer is to use well-known syntactic patterns to make a guess. In the sentence I ate a very XXXX pear, we can guess that XXXX is most likely an adjective because that's the only thing that should be allowable by English grammar rules. But we might also be able to guess by picking the unknown word apart. In an XXXXer pear, we can guess that XXXXer is probably a comparative adjective like tastier or raunchier. That said, it isn't guaranteed. Consider bitter, which coincidentally ends in er but which is not comparative (that would be bitterer or more bitter). Still, English comes with a wealth of prefixes and suffixes that can hint at the likely category for otherwise unknown words. XXXXish is probably an adjective, XXXXing is probably a present participle or gerund. And so on.

Human languages are "rulesy". Whether we intend to or not, we embed rules in how we form utterances. That's true at the word level just as much as at the sentence level, not to mention paragraph and discourse levels. Like many in the computational linguistics community, I had been falling under the spell of letting learning algorithms figure out the rules instead of hand-crafting them. The case for this approach has been compelling in recent decades. Start with a learning algorithm and a hand-tagged corpus and the algorithm will save you the trouble of understanding, yourself. The results for a specific task are often statistically better than hand-crafted rules, furthering the case of this approach. However, I'm beginning to question the wisdom of this seductive but naive approach, which Geffrey K. Pullum of the University of Edinburgh might label corpus fetishism.

Pullum represents one of the three legs of the computational linguistics stool: linguistics. At least one friend from the mathematics leg of CL has suggested I would do better to bone up on my linguistics knowledge than to worry about perfecting my mathematics knowledge. I concur with her thinking. As a programmer — this discipline being the third leg — I can attest that it's impossible to program a solution to a problem without first defining the problem. My own sense is that linguistics is the leading edge of the NLP revolution that is gradually happening. And mathematicians and programmers need to look deeper than the first level of linguistics to really understand what we're doing as we try to automate language understanding and production.

Pullum's mechanistic approach to understanding and explaining English grammar is infectious. I'm studying parts of The Cambridge Grammar of the English Language (CGEL) lately, especially with a focus on morphology. Chapters 18 (Inflectional morphology and related matters) and 19 (Lexical word formation) deal extensively with the subject. Pullum and Rodney Huddleston (et al) delved far deeper than my own limited needs into the gory guts of word-formation, but I am quite enjoying the treatment.

While my main motivation for adding a morphological parser (MP) to my BEP is dealing with unknown words, I also have creating a trim lexicon as a major goal. If I have the word "happy" in my lexicon, I should not also need happier, happilyunhappyunhappily, happiness, unhappiness, and so on in my lexicon. I want to believe that a large fraction of the words that appear in a given sentence are derivatives of simpler, common-enough ones. In the previous sentence, for example, words, given, derivatives, simpler, common-enough, and ones are derivatives. That's 26% of the sentence.

I regularly coin what I like to call "bullshit words" like "rulesy" and "cuteiful" when I'm explaining ideas and as a light form of humor. While amusing to us, this is devastating to a parser relying purely on a lexicon, even when it already has millions of words in it. One powerful benefit to an MP is the ability to deal more robustly with neologisms.

Putting it all together, the main goal for me of a morphological parser is to be able to recognize a much larger potential vocabulary with a much smaller lexicon held in memory. I'm going to dub this a "virtual lexicon".

Design philosophy

Loosely speaking, my morphological parser serves the function of a lexicon for a parser. In my current usage, I have a typical parse pipeline that begins with a tokenizer that produces a string of words, numbers, punctuation, and other symbols. The tokens that look word-like, usually because they are made up exclusively or mainly of letters, are typically compared against the lexicon, a long list of known words. In the syntax parsing application, that lexicon usually indicates the lexical categories (verb, noun, etc.) for each matched word. Depending on the design, the lexical entry may contain more than one category.

To my thinking, an simple, exact-match lookup isn't robust enough. Instead, each word-like token is fed into the MP, which has the same main goal: returning an indication of the most likely lexical categories for that word. But inside, the MP is operating on the premise that the word could be derived from other word parts. To be sure, my MP does have a lexicon and will return the category for a known word if it matches exactly. If it doesn't find an exact match, though, it tries its best to find a derivative of one or more known ones.

Remember: the first priority of the MP is to determine the most likely lexical category (e.g., adverb) of a word. There are thus four basic possible scenarios: the exact word already exists in the lexicon; the word is completely composed of other word parts that are all found in the lexicon; some parts of the word — usually suffixes like ing — are recognized to the point where a good guess is possible; or the word is wholly unrecognized.

Given a single textual token like strawberries to look up, the MP returns a "word". That word consists of one or more interpretations, which I call "senses", in keeping with the idea that a dictionary entry for a word may contain definitions for many different senses of that word. This is also in keeping with one of my design goals: to entertain multiple interpretations of a given word, sentence, etc.

Each word sense indicates the word's lexical category and includes a list of the morphemes that make it up. Given irredeemable, for example, the MP returns a data structure that it shorthands as "(J): ir-(JJ) redeem(V) -able(VJ)". The leading (J): indicates that the whole word is interpreted as an adjective. Next is ir-, which it sees as a prefix that typically attaches to an adjective to form an adjective. Next is redeem(V), which it sees as a verb. Last is -able(VJ), a suffix that usually attaches to a verb to form an adjective.

The MP also incorporates a feature to treat one word as though it were multiple. In most cases, a single word can be treated as having a single, well-defined category like verb or adjective. But sometimes this isn't the case. Consider auxiliary verbs ending with the n't contraction, like isn't, wouldn't, or haven't. It is best to treat these as two-word strings like is not, would not, or have not, acknowledging that these are verb-adverb combinations. Most contractions likewise need expansion. Consider it's and what's, which should be interpreted as it is/was and what is/was. This also applies in Joe and Dale're going fishing, where 're must be interpreted as are and applying to both Joe and Dale. While most initialisms (ASAP, IRS) and acronyms (AWOL, NAFTA) can be seen as having single categories — usually noun — others suffer the same problem. IMO (in my opinion) is a prepositional phrase. Although it could be seen as an adverb, it's probably better to simply expand it out into its set of words in the sentence in which it appears before syntactic parsing. Or HIFW (how I felt when), which surely can't be reduced to a basic category.

In keeping with the belief that a word can have multiple senses and that it can sometimes be treated as multiple words, I'll point out that the "word" that is output by the MP when fed a token is a tree structure. A word object contains a list of word-sense objects. A word-sense object has a list of morphemes and, alternatively, a list of child word objects if a sense is to be interpreted as a string of words. A morpheme is mainly a pointer to a lexeme (like an entry for a word in a dictionary) and which sense of that lexeme is meant. If a lexeme wasn't found for the morpheme in the word, those pointers are null, but the morpheme is still useful in that it contains the text of that morpheme.

I decided that it was critical to support Unicode. I created a custom, relatively trim library that lets me deal with multilingual text and "rich" characters. This overhead probably slows processing down marginally.

One other key is that I decided to specialize this parser to English morphology, hard-coding some English-centric rules related to morpheme transformation and lexical categorization into it. I'm hopeful that this work can provide inspiration for extracting those rules out as data to support other languages better, but I just don't have enough knowledge of other languages to justify the added system complexity yet.

Morpheme splitting

My morphological parser breaks its job into two independent tasks: finding morphemes; and then interpreting them in conjunction with one another.

Since each token is suspected to be composed of several morphemes, it's necessary to search for them. One way of doing that might be to start with the first letter. Consider happenings, for example, which is made up of happen -ing -s. One might start with h to see if we can find that in our lexicon. Not finding that, try ha. Then hap, and so on. Eventually, happen would be found. Then we could move on to find ing and finally s.

Each substring of characters could go against a hashed dictionary, which is fairly efficient. However, my MP has a specialized search tree in which each node represents a single letter and contains another hashed dictionary of next letters. To find happen, the algorithm might start with the root node, find the child node corresponding to "h", and for that node, find the child node corresponding to "a", and so on. When the search finds a known word during this tree traversal, that node — the "n" child node of the "e" node in this example — will have a link to the appropriate lexeme entry in the lexicon.

But I should clarify that my MP design counterintuitively starts at the end of a word and works its way backward. This is mainly because in English, modifications to base words (happy, flap) before adding suffixes (happily, flapping) usually occur in the last letter or two. More on this shortly.

I refer to this supporting lookup structure as a "morpheme tree". Words are added to this tree in reverse-character order. So happen is to be found by starting at the root and traversing "n", "e", "p", "p", "a", and finally "h", which leaf node will contain a pointer to the "happen" lexeme, which in turn has a list of lexeme-sense nodes representing the different categories (and in the future, distinct definitions) for that word.

Some words are subsets of others, as with Japan and pan. If the lexicon contains entries for both, this means parsing Japanese leads to ambiguity over whether pan is intended or merely coincidental. The morpheme tree will have a link to the pan lexeme at the "p" node, but also have a child node for the "a", leading into the "j" node, which then links to the japan lexeme entry. Thus, not all morpheme tree nodes that have links to lexemes are leaf nodes.

The ambiguity introduced by overlap cannot be resolved immediately during word parsing. Moreover, other ambiguities arise that, again, cannot be resolved immediately. The morpheme-finding process is an exhaustive search that returns all possible parses of the whole token, from last to first character. In the Japanese example, traversing the morpheme tree yields -ese and then pan, but it doesn't find ja, which it conservatively interprets as an unacceptable parse. However, continuing past pan into japan bears fruit, so that parse gets interpreted as acceptable. Only those parses that return a string of morphemes that cover every last letter are returned by this part of the process.

Getting the list of acceptable parses involves constructing a parse tree using a recursive algorithm. The parse starts with the final letter. Moving forward involves recursively calling a sub-parse function that also constructs the parse tree in parallel with the recursion process, including all branches considered, as in the pan vs japan case. Every time this recursion successfully reaches the beginning of the word, the final node added to this parse tree, which represents the first letter of the word, is added to a list of search "tails". Every node in that list of tails represents a distinct, completed parse of the entire token. If there were a ja lexeme in the test lexicon, then the two tails would correspond to both ja pan -ese and japan -ese parses, which then move onto the next stage for scoring. More on that later.

One way we end up with more tails is by having multiple senses of a word. Take the lexeme er, for example. In my test lexicon, this lexeme is viewed as a suffix added to an adjective to form the comparative of it (bigger, heavier), a suffix added to a verb to make a noun of it (cobbler, runner), or an initialism (ER or E.R.) for emergency room. So a parse of killer could yield "kill(V) -er(JJ)", "kill(V) -er(VN)", or "kill(V) er(N)". Yes, "kill emergency room" is a possible interpretation.

Another way we end up with more tails is by word modification, as with silliness and panicking. These modifications are assumed by my MP to happen only to words that have suffixes appended, which is why the search begins at the end of each word. After finding the ness morpheme and committing in one path to the suffix sense from the lexicon (the ness lexeme could also have a proper noun sense, too, as the proper name Ness), we then look at the next few characters along the way, knowing that they represent the end of another morpheme. Based on what we find, we try whatever changes (e.g., replace "i" with "y") are allowable and then continue parsing based on those modifications, in addition to the parallel track of parsing with no such changes. A parse of happiness would find -ness but not happi. But it would successfully find happy after changing "i" to "y". The parse would essentially return happy -ness, as though the word were literally spelled as "happyness". Here are the specific modification rules I've implemented thus far:

  • If last letter != "e" then append "e" (shar -ing  share -ing)
  • If last letter = "i" then change to "y" (tri -es  try -es)
  • If last letter = "v" then change to "f" (leav -es  leaf -es)
  • If last letter doubled then trim last letter (stopp -ing  stop -ing)
  • If last letters = "ck" then change to "c" (panick -ed  panic -ed)
  • If suffix = "n't" and last letter != "n" then append "n" (ca -n't  can n't)
  • If suffix = "s'" (apostrophe) then append "s" (bas -s'  bass -s')

I suspect there may be other modifications worth adding to this list later, but these already do a lot of good work. Each of these above does turn validly spelled words into invalid ones, but the payoff in being able to correctly parse such modified words is obvious. One downside is that the modifications can potentially create new, also-valid morphemes that lead to incorrect parses, but this should be rare. One example might be decking, which should be interpreted as deck -ing, but could also be interpreted via the ck  c rule as dec -ing, where "dec" is the abbreviated form of "December".
Once the recursive parse is done, we're left with a list of tails that all represent the paths by which we got from the last letter of the token to the first. The final step of morpheme splitting involves constructing linear representations of the chains that start at each tail. Each node in these chains represents one morpheme. This set of chains is the input to the next stage, where we score each chain to see which ones seem the most likely.

Unknown words

It's obviously possible that a parse of a token won't produce any tails, meaning there's no interpretation whose morphemes all match entries in the lexicon. My morphological parser doesn't just give up in this case. It alters the original token, creating one "hole" of all possible sizes and locations in the token and attempts to match the word in light of that hole. This involves adding a fake lexical and one morpheme tree entry for the "?" symbol that gets used as a stand-in for the hole (I don't allow for tokens already containing "?" symbols). Let's say we had "unXXXXing" as our token. Since the first try would not find any acceptable tails representing completed parses, our algorithm would try all possible variations that allow it to have at least two non-hole characters, including "un?" and "?ng", but also "unX?Xing", "unXXXX?", and "un?ing", our intuitive best bet. This gets parsed as un- [XXXX] -ing, which is more useful than no match. (Anything represented as inside [brackets] was the "hole", the text not found in the lexicon.) This is better than no match, as the -ing suffix can be applied to a verb to form a verb or an adjective, narrowing the possibilities more than a completely unknown word like XXXX would.

This process does not stop as soon as one tail is found. Indeed, it generates all tails for all hole placements and leaves it to the next stage to find the best possible interpretations. This speculative process is naturally more expensive than when the parser does come up with at least one full-word interpretation.

Scoring senses

Once the first stage has produced a set of morpheme chains (word senses), the second stage scores each chain, winnows them down, and sorts them so that the first in the list is most likely to be the right sense.

I've used scoring algorithms often for searches and such, but with them I'm using building up a positive score reflecting all the good things about each item, putting the highest scoring item at the top. This time I decided to go with a negative scoring algorithm that adds up all the downsides of a given interpretation of the word, putting the word-sense with the lowest penalty value (zero being the lowest) at the top of the favored list.

There are penalties for many potential defects. Modifications like changing "i" to "y" in happiness are penalized in favor of no-modification interpretations.  If the whole token didn't parse and we had to try out different sized gaps, there is a penalty that favors the smallest gap size. Senses not fitting the prefixes + bases + suffixes pattern are penalized. If a lexeme has multiple senses, there's a small penalty for each subsequent sense used, thus favoring the earlier ones as being more likely. If a suffix has an "attachment filter", meaning it favors attaching to words of one or more lexical categories more than others (e.g., -er(VN) versus -er(JJ)), there's a penalty if the running category violates the suffix's filter. Having more morphemes is penalized. Having multiple free morphemes (e.g., apple, care, pretty) is penalized in favor of affixes ("big -er" favored over "big E.R."). Having zero free morphemes — having only affixes — is heavily penalized. Ideally, there will be only one morpheme because it exactly matches a lexeme in the lexicon. We penalize a sense that has a suffix as its first morpheme (e.g., -ing rown) and also penalize it if it has a prefix as its last one (e.g., ado re-).

One underlying assumption for this scoring algorithm is that all interpretations spit out by the morpheme splitting stage are worth considering. I don't want to disqualify a potentially valid interpretation just because it doesn't obey the usual conventions for word production. A good example of a heavily penalized sense that is actually correct is the word ish, which is sometimes used informally as a way to indicate degree. "Are you tired?" "Ish." This thinking is especially helpful when words are formed using unusual affixation. For example, the -ish suffix is intended to attach to adjectives or nouns to form an adjective (squareish, houseish), but one could also attach it to a verb (burnish, crankish, rompish). Yes, the -ish lexeme sense's filter could be expanded to include verbs, but this algorithm prefers to see all the penalty mechanisms as reflecting preferences in lexical interpretation instead of absolute disqualifiers. If the best scoring sense is heavily penalized, it's still in the game until better interpretations come along. There is no penalty threshold that disqualifies a sense.

Once scoring is done, the results are sorted and only the top-scoring sense for each representative category is kept. That is, only the best verb sense is kept, only the best noun sense is kept, and so on. I have some misgivings about this expedient, but I'm motivated by a desire to keep the syntax parsing and broader interpretation process limited to a modest number of possible interpretations. Having 100 possible interpretations for, say, a partly unknown word, for example, seems counterproductive.

Category guessing

At the same time each sense is scored, the lexical category for the whole word is developed. As you might guess, even this is an error-prone process. The essence of the matter involves finding the last free morpheme's lexical category and then transforming it according to the suffixes attached to it. Consider meatier, for example, which parses out to either "(J): meat(N) -y(N→J) -er(J→J)" (penalty = 20) or "(N): meat(N) -y(N→J) -er(V→N)" (penalty = 27). As desired, the final conclusion is that it's most likely an adjective, since -y typically turns a noun into an adjective (meaty, nerdy, watery) and adding -er to an adjective keeps it an adjective. The other option, where another sense of -er converts a verb into a noun (killer, taker, slicer) doesn't fit as well, but it's still an option we want to present to the consuming application for its consideration.

I considered allowing the prefixes to influence transformation of the category, but this raised some ambiguities. Moreover, the prefixes I considered generally don't change the category of the words they attach to. I decided to just ignore them for now.

There are plenty of compound words that my morphological parser can handle. Here's a place where CGEL was immensely helpful for my understanding of how to deal with them. For starters, it seems most of the compounds we use contain only two free morphemes (swimsuit, backwaternosedive). I decided to effectively treat compound words as though they are made up of separate words in this narrow context. My algorithm develops a category along the way and when it finds a boundary between two (potentially affixed) free morphemes, it starts over. But it keeps track of what the categories were for the two (or more) sub-words. It then uses the conjunction of <category> + <category> — what I call the "compound pattern" — to decide whether to override whatever category the last word-let came up with, which is otherwise a good predictor. Thus far I've only found two compound patterns that merit changing their default lexical categories of. The first is verb+preposition (breakthrough, look-out, talking-to), which I change to noun. Another is adjective+verb (blueprint, high-set, smalltalk), which I default to being a noun. But if the verb in that adjective+verb compound ends in -ing (breathtaking, strange-looking, talking-to) or -ed (French-basedshort-livedwell-behaved), I convert the total word's category to adjective.

Multi-word strings

There is a final step, once scoring and winnowing are done. We look at each sense to see if any of its morphemes demands that it must stand alone instead of being integral to a word. If so, we now break the total word up according to the morphemes' needs. If a word sense is composed of five morphemes and the one in the middle demands it must be expanded and its words stand on their own, the algorithm will create a new word from the first two morphemes in the original word, expand out the must-expand words from the middle morpheme, and then create a final word from the last two morphemes. For each of the new words, which are now really just a new plain-text tokens, the entire process repeats and this word sense now becomes just a shell for the string of sub-words parsed in the same way. One example is shouldn't've, which breaks down to should + not + have.

In truth, I'm not 100% sure about the need for this feature. Consider the HIFW (how I felt when) example. Standing on its own, it seems valuable to expand it out into a sentence like HIFW I saw it, but what if it had a suffix, as in totally HIFWing on this? "How I felt whening" doesn't make sense, while treating the whole thing as probably a verb does. This is an area I think I need to study further.

Performance tests

One way of seeing how fast this runs is to select sample words and see how many times my morphological parser can process each. I'm starting with a late 2013 iMac with a 3.4 GHz Intel Core i5 and 8GB 1600 MHz DDR3 memory, a reasonably upscale desktop computer. I wrote my code in C++ using Xcode.

My test lexicon contains 876 lexemes. I'll admit that this is much too small to be representative of a well-stocked lexicon, but I also don't believe that increasing its size will have much effect on this algorithm's performance. The main reason is that the expensive part of dealing with the lexicon is looking up a candidate morpheme. Since this is done by traversing the morpheme tree in parallel with reading each character, which takes constant time per recursive step of the morpheme parse, I expect no significant change in parse time as the lexicon gets bigger. Time will tell.

So let's take some sample words and see how many times it can parse the same word per second. First, consider tokens that had full matches:
  • 30,000 words/second:  red:  (J): red(J)
  • 17,300 w/s:  adventure:  (N): adventure(N)
  • 9,000 w/s:  recordkeeping:  (V): record(N) keep(V) -ing(V|N→V)
  • 8,500 w/s:  relies:  (V): re-(U) lie(V) -s(V→V)
  • 6,100 w/s:  breathtaking:  (J): breath(J) take(V) -ing(V|N→V)
  • 3,600 w/s:  unremittingly:  (J): un-(J→J) remit(V) -ing(V→J) -ly(N|J→J)
  • 1,700 w/s:  antidisestablishmentarianism:  (N): anti-(U) dis-(U) establish(V) -ment(V→N) -arian(N→N) ism(N)
  • 181 w/s:  happily-tippingreungreennesspotatoes:  (N): happy(J) -ly(N|J→J) -(U) tip(V) -ing(V|N→V) re-(U) un-(J→J) green(J) -ness(J→N) potato(N) -es(N→N)
Now let's try some words that don't have full matches. Note the interpretations. Some of them are clearly wrong, but they help illustrate how this algorithm works:
  • 40,000 w/s:  bug:  <no match>
  • 9,000 w/s:  redbug:  (J): red(J) [bug]
  • 2,400 w/s:  mister:  (J): my(N) -s(N→N) [t] -er(J→J)
  • 2,200 w/s:  censorize:  (V): [censo] re-(U) -ize(V|N→V)
  • 14 w/s:  punk-antidisestablishmentarianism:  (N): [punk] -(U) anti-(U) dis-(U) establish(V) -ment(V→N) -arian(N→N) -ism(N→N)
I am very happy at these results. I thought it would be orders of magnitude slower. Instead, it seems this piece could hum along at 6,000 or more words per second on average on my computer, assuming most words it comes across have full matches.

Memory consumption

Regarding memory, a simple test in which I reduce the lexicon to nearly empty shows that it consumes about 628 KB of memory. With 878 items in the lexicon, it climbs to 1 MB. Here are some actual memory measurements for lexicon sizes during loading:
  • 0:  628 KB
  • 1:  636 KB
  • 100:  700 KB  (720 B/lexeme)
  • 200:  740 KB  (650 B/lexeme)
  • 300:  804 KB  (587 B/lexeme)
  • 400:  832 KB  (510 B/lexeme)
  • 500:  876 KB  (496 B/lexeme)
  • 600:  936 KB  (513 B/lexeme)
  • 700:  968 KB  (486 B/lexeme)
  • 800:  1,020 KB  (490 B/lexeme)

Memory: bytes per lexeme

I'm not sure whether this means that the per-lexeme consumption flattens out at a little under 500 bytes per lexeme or if it continues downward, which I'm expecting. The morpheme tree's memory footprint should grow logarithmically. The lexicon's lexeme info should grow linearly. So let's say the average stays around 500 bytes per lexeme. That means a lexicon with one million items should consume half a gigabyte.

A more modest lexicon of 100k lexemes (words) would consume 50 MB. For comparison, as I look at the currently active programs in my computer's memory and see that Chrome is consuming 3 GB, Google Drive is consuming 655 MB, Xcode is consuming 826 MB, and so on.

Fidelity tests

Of course, having an algorithm that's fast isn't as important as having one that works well. Were I writing a scholarly paper, I'd feel compelled to flesh out my lexicon and mine a corpus for test cases, but I haven't gotten around to that yet. I plan to do more serious testing of this sort in time, though.

But I do have one useful barrage test behind me. I was keenly interested in seeing how well my MP would fare against the wide variety of compound words found in CGEL's treatment of morphology. To that end, I painstakingly typed the 678 examples I found there into a data file and hand tagged all of their lexical categories. I then created another data file containing their base words. For the example of taxpayer-funded, I had to isolate tax, pay, and fund. I then hand-tagged those words, too. Below is a snippet from the test's output:

  - sunshine             |  .  | (N): sun(N) shine(N)  (P:13)
  - swearword            |  .  | (N): swear(V) word(N)  (P:13)
  - sweetheart           |  .  | (N): sweet(J) heart(N)  (P:13)
  - swimsuit             |  .  | (N): swim(V) suit(N)  (P:13)
  - swordsman            |  .  | (N): sword(N) -s(N→N) man(N)  (P:123)
  - syntactic-semantic   |  .  | (J): syntactic(J) -(U) semantic(J)  (P:23)
  - table-talk           | (N) | (V): table(N) -(U) talk(V)  (P:23)
  - take-away            | (N) | (R): take(V) -(U) away(R)  (P:23)
  - take-off             |  .  | (N): take(V) -(U) off(P)  (P:23)
  - talking-point        |  .  | (N): talk(V) -ing(V|N→V) -(U) point(N)  (P:133)
  - talking-to           |  .  | (N): talk(V) -ing(V|N→V) -(U) to(P)  (P:133)
  - tape-record          | (V) | (N): tape(N) -(U) record(N)  (P:23)
  - tax-deductible       |  .  | (J): tax(N) -(U) deduct(V) -ible(V→J)  (P:33)
  - tax-free             |  .  | (J): tax(N) -(U) free(J)  (P:23)
  - taxpayer-funded      | (J) | (V): tax(N) pay(V) er(N) -(U) fund(V) -ed(V→V)  (P:63)
  - tearoom              |  .  | (N): tea(N) room(N)  (P:13)
  - theater-goer         |  .  | (N): theater(N) -(U) go(V) -er(V→N)  (P:35)
  - theatre-going        | (J) | (V): theatre(N) -(U) go(V) -ing(V|N→V)  (P:33)
  - thought-provoking    | (J) | (V): thought(N) -(U) provoke(V) -ing(V|N→V)  (P:53)
  - threadbare           |  .  | (J): thread(N) bare(J)  (P:13)
  - three-inch           | (J) | (N): three(D) -(U) inch(N)  (P:23)
  - three-metre-wide     |  .  | (J): three(D) -(U) metre(N) -(U) wide(J)  (P:46)
  - tightrope            |  .  | (N): tight(J) rope(N)  (P:13)
  - timberline           |  .  | (N): timber(N) line(N)  (P:13)

The center column represents the hand-tagged value. If it is the same as the MP's prediction, the column contains a period, allowing the mistakes to jump out easily. Of the 678 compound words tested, 76.8% were correctly tagged. Note that the "(P:13)" values on the far right represent penalty calculations for each of these. I'm showing only the best scoring (least penalized) interpretation for each of the test tokens.

During development, I relied a lot on hand-crafted example words. I reproduce some examples below:

- antidisestablishmentarianism
  - (N): anti-(U) dis-(U) establish(V) -ment(V→N) -arian(N→N) -ism(N→N)  (P:50)
- Rate: 1451.38 words/s

- buttons
  - (N): button(N) -s(N→N)  (P:10)
  - (V): button(N) -s(V→V)  (P:17)
- Rate: 17152.7 words/s

- buttoning
  - (V): button(N) -ing(V|N→V)  (P:11)
  - (N): button(N) -ing(N→N)  (P:14)
  - (J): button(N) -ing(V→J)  (P:17)
- Rate: 11904.8 words/s

- exposition
  - (N): expose(V) -ition(V→N)  (P:30)
- Rate: 12547.1 words/s

- expositions
  - (N): expose(V) -ition(V→N) -s(N→N)  (P:40)
  - (V): expose(V) -ition(V→N) -s(V→V)  (P:47)
- Rate: 7189.07 words/s

- reexpose
  - (V): re-(U) expose(V)  (P:10)
- Rate: 27100.3 words/s

- reexposure
  - (N): re-(U) expose(V) -ure(N)  (P:40)
- Rate: 15432.1 words/s

- reexposed
  - (V): re-(U) expose(V) -ed(V→V)  (P:40)
  - (J): re-(U) expose(V) -ed(V→J)  (P:42)
- Rate: 11723.3 words/s

- malignant
  - (N): malign(V) -ant(N)  (P:12)
  - (J): malign(V) ant(J)  (P:13)
- Rate: 14881 words/s

- meteorites
  - (N): meteor(N) -ite(N→N) -s(N→N)  (P:20)
  - (V): meteor(N) -ite(N→N) -s(V→V)  (P:27)
- Rate: 8992.81 words/s

- mouthy
  - (J): mouth(N) -y(N→J)  (P:10)
- Rate: 22002.2 words/s

- stubbornly
  - (J): -s(N→N) [tub] born(V) -ly(N|J→J)  (P:3343)
- Rate: 1026.69 words/s

- muddling
  - (V): [muddl] -ing(V|N→V)  (P:5015)
  - (J): [muddl] -ing(V→J)  (P:5017)
  - (N): [muddl] -ing(N→N)  (P:5019)
- Rate: 2212.39 words/s

- rapacious
  - (J): [rapac] -y(N→J) -ous(J)  (P:5025)
- Rate: 1189.06 words/s

I know I'll need to do more testing, but I'm fairly happy with the results so far.

Applications and future work

While my main goal in creating a morphological parser is to create a mechanism for building a "virtual lexicon" that supports syntax parsing by guessing at lexical categories for words,  I see other potential uses, too.

For starters, an MP should be able to aid the process of building a lexicon. Imagine doing so by importing documents. For each document, the lexicon-builder tool calls out words it doesn't already recognize. Take the muddling example from above. The best guess was that the word is a verb, which is correct. It called out "muddl" as the unknown. But moreover, one could use the -ing(V|N→V) lexeme sense, which indicates that it usually attaches "ing" to a verb or (secondarily) a noun to form a verb, to guess that "muddl" is most likely a verb, which is also correct. The only thing wrong is the spelling, since this involved lopping off a final "e". The user would need to review and finesse each suggested entry found this way.

I also believe this could be used to enhance a typical spell checker. For starters, it could allow the spell checker to distinguish between "hard" and "soft" misspellings. That is, it could call out words that fit word-formation patterns but are not in an otherwise large lexicon as "soft" misspellings. But moreover, it could recognize when a word looks like a proper inflection for a word but is actually not. If the lexeme sense indicated that a base word does not follow the usual inflection rules and calls out the alternatives, the spell checker could suggest the correct one. For example, badder might lead to worse as a suggestion, as badder appears to be the comparative of bad. Similarly, worser could be called out as a nonstandard comparative, with worse being suggested. Childs becomes children. And so on. These would generally be favored over typo-assuming suggestions like balder, worsen, and child.

One problem I intend to apply my MP to is improved tokenization. Consider the sentence When will you see Prof. Smith? A basic tokenizer would see the period in Prof. and assume it marked the end of a sentence. Smith? could, after all, be a complete sentence, too. I think my lexicon is going to need to have common abbreviations like Prof., Mrs., and etc. to help disambiguate period usage. One option would be to include the period in a token that is otherwise word-like and ask the MP to render an opinion about whether the period is part of the word or more likely punctuation. This would extend to period-delimited formats like R.I.P. and R.S.V.P., where it seems logical for the MP, which looks at words character-by-character, to recognize and "correct" this pattern. After all, the lexicon may have RSVP in it but not the redundant R.S.V.P. defined, so it would be helpful to recognize and transform this pattern before traversing the morpheme tree.

Related to ambiguous periods is ambiguous apostrophes ('). If a word begins or ends in an apostrophe, does that signal a possessive, an eye-dialect spelling ('nother, lil', readin'), or single-quoted prose? The lexicon could help if it contained common eye-dialect examples. And the MP could figure out if a trailing s' likely represents a possessive (fixing the dogs' dinners).

Because the MP returns multiple options, it certainly can return several interpretations of embedded periods and apostrophes. It might be best for the tokenizer, confronted with a paragraph, to conservatively attach periods and apostrophes to the otherwise word-like tokens they are nearest as a first step, then have the MP come up with naive guesses for tokens' categories, word splits, and punctuation interpretations. Only after that would a next stage come up with one or more interpretations of where the sentence boundaries are, a break with the traditional tokenization → lexicalization → syntax parsing flow. Then it would be up to the syntax parser to figure out which proposed sentence boundaries make the most sense, grammatically.

Although my current morphological parser code is already my second version, I've only been working on this for two and a half weeks. I have no doubt this deserves quite a bit more work. But overall, I'm very happy with the initial results. My sense is that my MP works effectively and efficiently and that it will serve several parsing goals at once.