Adaptive tokenizer

I knew that to continue advancing the refinement of my virtual lexicon, I'd need to throw it at real text "in the wild". To do that meant revisiting my tokenizer so I could feed words into it. The first tokenizer I made for my Basic English Parser (BEP) project is very primitive, so I decided to trash it and start over.


Programmers often find themselves needing to parse structured data, such as comma-separated-values or XML files. In parsing richly structured data, like code written in a programming language, we refer to the smallest bits of data  usually text  as "tokens".  In this case, my tokenizer is tasked with taking as input a single paragraph of English prose in a plain-text format and producing as output a list of tokens.

In my tokenizer, tokens presently come in a small variety of types: word, symbol, number, date/time, sentence divider. Since recognizing a token requires characterizing it, tokenization is the natural place to assign a type to each token. But only to the degree that it is competent to do so. A tokenizer can readily recognize the pattern for a typical word, for example, but couldn't classify that word's lexical category (noun, adjective, etc.) Each piece of a good parser has its own specialties.

English-speaking programmers familiar with typical parsing usually rely on conveniences bundled into the ASCII data standard. With a few logical and mathematical comparisons, you can easily tell capital letters from lower-case ones, letters from digits from other symbols. Because I am opting to deal properly with Unicode, those distinctions don't come so easily. I had to start with a custom Unicode library that can deal efficiently with normalized UTF-8 representations of text, where a single visible character in a text file might be represented by a code-point packed into 1 to 4 bytes or possibly a string of them bundled into a grapheme cluster. My unicode::unistring class easily interfaces with std::string. It's character iterator, .size() method, and other members mimic those of the standard string class', but deal at the grapheme cluster level instead of the byte level.

Although language parser programmers often favor designs based on BNF grammars, I've opted to just hard-code a mostly linear parsing algorithm. It progresses character by character, looking for boundaries between token types. The most obvious boundaries are found when we hit white-space, including spaces, tabs, and new-line characters. For numbers and dates, I'm relying on regular expressions to look ahead of the current character. Everything else initially comes down to punctuation and symbol characters. If it's a symbol character and we've been looking at a word, we recognize the boundary of that word token and pack the single symbol character into its own token, ready to begin searching for the next token.

With some special exceptions. Consider hyphens. It's tempting to take a term like "AFL-CIO" or "singer-songwriter" and split it into separate words. Hyphenated phrases like "lighter-than-air" make a compelling case for this seductive option, which fits neatly the model where a symbol character like the minus sign (-) is its own token and marks boundaries to its left and right. But sometimes this doesn't make sense, as with non-compliant and pre-industrial. It would be hard to make sense of the syntax of a sentence that included non compliant or pre industrial. Also, hyphenation of words sometimes changes the lexical category (LC) of a combination of words, as with an off-the-cuff remark, where off-the-cuff is effectively an adjective, even though none of its words is an adjective. I conclude that the tokenizer is not qualified to decide when it's appropriate to subdivide a hyphenated word. Nor is the virtual lexicon, even. I suspect this will fall to the syntax parser, if only because that module will be able to evaluate whether the string of hyphenated words neatly form a syntactic structure better than the single hyphenated word. Finally, there are special cases of words that end in a hyphen that are meant to link to a later word, as with in pre- and post-op photos.

Thus, the tokenizer allows single hyphens that appear within words or as a last character in a word. Minus signs that appear outside the context of a word, as when there are spaces around it, are considered symbols and not words. If an apparent word has two or more consecutive minus signs in it (e.g., in full knowledge--at least, afterward), they break that up into separate words. This is to take care of the common case where an em dash is represented by two minus characters with no space around them.

The same constraint applies to single periods and apostrophes, which are allowed to be in the middle of a word token (ASP.NET, isn't) or on the left (.com, 'ere) or right (etc., cousins') of one.

Generally, the tokenizer preserves the actual characters from the text, but it does do some conversions. For starters, it does not preserve whitespace, per se. Each token indicates whether or not it has whitespace before it, but not how many characters or which. The various kinds of horizontal dash characters are also condensed down to the standard ASCII minus sign (-).

My tokenizer also condenses various types of double quote characters (“ & ”) down to the basic ASCII double quote (") character. Same for single quotes (‘ & ’), which are reduced to the ASCII apostrophe (') character. Although I may regret this later, as their polarity (left or right) can carry clues to the boundaries of quoted blocks of text, for now they would just complicate lexical analysis. They probably would complicate syntax analysis, too, because of inconsistent use of these in modern and older text files. Moreover, I would still face the problem of properly recognizing triple quotes, quadruples, etc., typically represented by combinations of double and single quote characters.

Sentences and paragraphs

I specifically decided not to make my tokenizer responsible for starting with a raw text file and splitting it up into sections and paragraphs. My main reason is that it should be relatively easy in many cases for an earlier-stage parser to find paragraph boundaries, but that different formats require different approaches. From a syntax standpoint, paragraphs might be considered the largest significant unit of text.

Why not stop at the sentence level? A typical syntax analyzer deals only with a sentence. The problem is that it's often difficult to find clear sentence boundaries. Question marks and exclamation points are strong indicators of sentence-end punctuation, but periods are not. Consider a sentence that contains the abbreviation "etc.". My lexicon can tell me that "etc" is an abbreviation, which means it typically has a period after it. And that could be the end of it, if only we ended a sentence with an extra period after an abbreviation, but we usually don't. Moreover, we often punctuate a sentence that ends with double quotes with a period just before the final quote. I reasoned that the best my tokenizer can do is mark places where it thinks sentences might end, and some indication of how certain it is in each case.

To facilitate this, the tokenizer offers a special sentence-divider token type. Sentence dividers are tokens inserted into the token stream just like words, symbols, numbers, etc.

To help add certainty, the tokenizer will pass any word-like token into the virtual lexicon and attach the returned "word" data structure to the token. As described previously, the VL will indicate that a word has leading or trailing period and/or apostrophe characters. A word's various senses (interpretations) will indicate whether they consider the period/apostrophe characters to be integral to their words, as with .com, etc., 'nother, and ol', or to otherwise be unrelated to the word. It checks every sense to see if any of them allow for an integral-period interpretation. The tokenizer can use this to indicate whether or not it is "certain" that a period at the end of a word represents a separate symbol. If it does come to this conclusion, the tokenizer will actually strip off the trailing period and add it as a new token, asking the VL to reparse the new word without the trailing period.

More precisely, the tokenizer does this same thing for the beginning period/apostrophe and for the end one. It creates new tokens before or after the word token when the word contains no interpretations in which that leading or trailing period or apostrophe are integral to the word.

In the special case of an abbreviation (Dr., etc., appt.) or period-inclusive initialism (r.s.v.p., C.I.A.) word that integrates an ending period, the tokenizer still creates a sentence-divider token, but it marks it as uncertain about this conclusion.

Punctuation sometimes appears within quoted text in sentences. Consider this example:
To those who ask "Why bother?" I say "Because it's right." I hope they listen.
My intention in this example is that the first quoted sentence not also mark the end of the outer sentence that contains it, However, the second one does mark the end. From a tokenization or lexicon standpoint, there is no way to correctly draw these conclusions. My tokenizer thus kicks this can down the road by marking the "?" and "." before each closing quote as definitely sentence ends and adds a maybe-a-sentence-end marker after each closing quote. It will fall to the syntax analyzer to consider alternative hypotheses.

The bottom line here is that neither the tokenizer nor the virtual lexicon are qualified to say for sure where sentences end. The best they can do is propose options.

Naming names

One of the other features I've added to my BEP is an ability to spot likely name words in text. This presents a difficult challenge. While many of the unknown words found by an English parser are simply inflectional derivatives (e.g., harm to harms, harmed) and assemblies of morphemes (harmful, harmless) missing from that lexicon, a large fraction of the remaining unknown words are going to be names of people, places, events, organizations, products, and so on. It seems inappropriate to attempt to fill a lexicon with a list of all such named entities because, no matter how authoritative it is, there will always be more names emerging.

Instead, I'm employing a tactic of looking at the characteristics of unknown words to see if they might potentially be names. The most basic characteristic is whether the word is capitalized. But this needs to be mediated by whether the word appears at the beginning of a sentence. That's the main reason the tokenizer proposes sentence boundaries. Words that appear immediately after sentence boundaries (opening quotes are factored in) are ignored in the initial survey to find potential names for the simple reason that in English, we typically capitalize the first word in a sentence. I'm looking for words that are either capitalized (John, Woodstock, Florida) or all-caps (FBI, DARPA, NASCAR) that are not already in my lexicon.

One benefit of tokenizing at the paragraph level is that it allows me to deal well with headlines. In this case, headlines that have most of their words capitalized (Harry Potter and the Chamber of Secrets) confound the ability to find name words. Part of the computation is that if a paragraph has a large percent of its words either capitalized or in all-caps, it is disqualified from the search for name words. This latter also allows for dealing with snippets of text where almost all letters are capitalized, which similarly makes finding potential names impossible.

Once the first pass to identify potential name words is done, I do a second pass later to "name names". An important part of that process is that it then covers words that were previously disqualified. If the word Rupert appeared in the middle of a sentence in a regular paragraph during the initial sweep, it got added to a list of potential name words. In this second pass, instances of Rupert that appeared as the first word in sentences would also be tagged as name words. So would instances found in title "paragraphs". One proviso, though, is that this later call to name names specifies a threshold. Any potential name found with fewer qualifying occurrences in text than this threshold value get ignored. Those at or above the threshold get flagged as name words. In my testing, I stuck with a threshold of 1, making the threshold useless.

I should also point out that this algorithm will miss names that only appear at the beginnings of sentences in a document.

The first piece of this puzzle is the virtual lexicon, which helps to characterize words. In particular, it can flag a given word as being capitalized (e.g., Joshua, iPhone, MyVMS) or all-caps. For such a word, it extracts a potential name. This is separate from the word's raw text in part because names often come in possessive form with s' or 's at its tail. The potential name excludes the apostrophe or apostrophe-S ending and the word indicates that its potential name is possessive. This allows a completely unknown word to be available as a potentially possessive proper noun to the syntax analyzer later. But it also allows Francis, Francis' and Francis's all to represent the same Francis name when it comes time to name names.

The VL comes up with these potential names even for derived words or words that come directly from the lexicon. This reflects the fact that even common words of all LCs can be used as proper nouns. This allows the senses the VL comes up with for a given word to coexist with this potential for this specific use of the word to alternatively be used as a name. You don't want a syntax parser to be unable to discover syntactic structure because a long string of words are all capitalized and thus marked as having a pseudonymous "proper noun" lexical category that paves over its actual LCs.

On that note, when the name-names function is applying the used-as-a-name flag on words based on the criteria described above, it does not apply to words that are not capitalized or all-caps. So a phrase like how Mark made his mark on me can contain both common and name-flagged versions of the same word. As indicated before, flagging a word as used-as-a-name does not negate the senses (and LCs) returned by looking it up in the virtual lexicon.

Ultimately, the syntax analyzer will have to sort out the name-oriented reality of each word based on how it's used in a given sentence, but it will already have a variety of extra characteristics at its disposal, including whether the word is capitalized, whether it is flagged as used-as-a-name, and whether a given word sense is flagged as a proper noun.

Lexicon building

One of my main goals in revisiting my tokenizer, now that I have a good virtual lexicon mechanism, is to test my VL and continue building my lexicon. In practice, this also means continuing to refine the VL and tokenizer, too. Thus far, I've only processed two documents: a short biography of Mark Twain and a CNN news story about America's black working class. My current process begins with letting my tokenizer have a go at a document. Among other things, it generates lists of potential names it found, unknown words, and derived words. Here is an example of the output my test program generates.

I mainly work through the list of unknown words, creating entries in my lexicon as needed. If, for example, I see flippancy, I'll create an entry for flippant, knowing it will derive flippancy from it. I generally don't add names to the lexicon, with some exceptions. I will add common names like US states and major cities, commonly occurring people's names like John, and so on. I flag them as proper nouns. Here's an example of what my lexicon file looks like.

Once I've worked through the unknown words, I move on to the list of derived words. Here's a sample of what that looks like:

  • official        | 1 | N: off(P) -ic(N|V→J) -ial(N)  (N or J)
  • opposed         | 1 | V: oppose(V) -ed(V→V)  (V or J)
  • organizer       | 1 | N: organize(V) -er(V→N)  (N or J)
  • others          | 1 | N: other(N) -s(N→N)  (N or V)
  • overalls        | 1 | N: over(P) -al(N→J) -s(N→N)  (N or V)
  • overwhelmingly  | 1 | R: over(P) whelm(V) -ing(V|N→V) -ly(N|J→R)
  • pan-african     | 1 | N: pan(N) -(U) Africa(N) -an(N→N)  (N or D)
  • partnership     | 1 | N: partner(N) -ship(N→N)
  • partnerships    | 1 | N: partner(N) -ship(N→N) -s(N→N)  (N or V)
  • payments        | 1 | N: pay(V) -ment(V→N) -s(N→N)  (N or V)
  • personal        | 1 | J: person(V) -al(N→J)
  • poorer          | 1 | J: poor(J) -er(J→J)  (J or N)
  • populism        | 1 | N: popul~(J) -ism(N→N)
  • ports           | 1 | N: port(N) -s(N→N)  (N or V)
  • president       | 1 | J: preside(V) -ent(J)
  • president-elect | 2 | N: preside(V) -ent(J) -(U) elect(V)
  • pressures       | 1 | N: press(V) -ure(N) -s(N→N)  (N or V)
  • problems        | 1 | N: problem(N) -s(N→N)  (N or V)
  • professor       | 2 | N: pro-(U) fess(V) -or(V|N→N)  (N or V)

Most of these look very appropriate. I'm looking for oddballs, like with official, which it derived as "off -ic -ial". All I have to do is add office to the lexicon and it gets it right as "office -ial".

I also study a relatively compact representation of the parsed text. Here's an excerpt:

 '"' The(D) notion(N,P) of(P) the(D) white(J) working(V,N) class(N) implicitly(R) embodies(N,V) a(D) view(N) of(P) white(J) privilege(N) ','  '"'  said(V) @[Spriggs](?) ','   '"' It(N) implies(V,N) that(N,S,P) things(N,V) are(V) supposed(V,J) to(P,R) be(V) different(J) for(P) them(N) ','  that(N,S,P) they(N) aren't(Phr,Phr,Phr) the(D) same(J) ','  that(N,S,P) they(N) aren't(Phr,Phr,Phr) going(V,N) to(P,R) face(N) the(D) same(J) pressures(N,V) '.'  <<sentence end!>>  '"'  <<sentence end?>> 

There(N) is(V) no(D) official(N,J) definition(N) of(P) the(D) working(V,N) class(N) '.'  <<sentence end!>>  Some(D,N,R) define(V) it(N) by(P) education(N) level(N) or(C) job(N) sector(N) ','  others(N,V) by(P) income(V) '.'  <<sentence end!>>  John(N) @[Hudak](?) ','  a(D) senior(J) fellow(N) in(P) governance(N) studies(N,V) at(P) the(D) Brookings(N,V) Institution(N) calls(V,N) these(D,N,R) voters(N,V)  '"' economically(R,J) marginalized(V,J) '"'  because(S) they(N) often(R) fall(V,N) somewhere(R,N) between(P) the(D) poor(J) and(C) the(D) middle(N) class(N) '.'  <<sentence end!>> 

After each word is a list of the lexical categories for the senses the VL came up with for each word, listed in order of how likely it considered each sense to be the correct one. My LCs here include: (N)oun, (V)erb, ad(J)ective, adve(R)b, (P)reposition, (D)eterminative, (S)ubordinator, (C)oordinator, (I)nterjection, s(Y)mbol, and (U)nspecified. I consider this list a work in progress. If a word needs to be interpreted as a phrase, as with aren't (are not), you'll see "Phr" in place of an LC.

Note the <<sentence end!>> and <<sentence end?>> tokens. The ones with exclamation points (!) indicate high-certainty markers and the ones with question marks (?) indicate low-certainty but potential markers.

If I want to better understand how my VL treats a word, I can generate a more detailed output for that word like the following:

- Word: economically
- R: economy(N) -ic(N|V→J) -al(N→J) -ly(N|J→R)  (R or J)

- Short list of senses:

  - R: economy(N) -ic(N|V→J) -al(N→J) -ly(N|J→R)  (141)
  - J: economy(N) -ic(N|V→J) all(D) -y(N→J)  (160)

- All senses considered:

  - R: economy(N) -ic(N|V→J) -al(N→J) -ly(N|J→R)  (141)
  - J: economy(N) -ic(N|V→J) all(D) -y(N→J)  (160)
  - J: economy(N) -ic(N|V→J) -al(N→J) -y(N→J)  (160)
  - J: economy(N) -ic(N|V→J) all(R) -y(N→J)  (162)
  - R: economy(N) -cy(J→N) -al(N→J) -ly(N|J→R)  (171)
  - J: economy(N) -cy(J→N) -al(N→J) -y(N→J)  (220)
  - R: economy(N) -ic(N|V→J) a(D) -ly(N|J→R) -ly(N|J→R)  (220)

One assumption I'm making is that there is a natural ability for words in the major LCs to be converted to the others. The noun candy can easily be used as a verb, as in We candied apples for the party. The adjective beautiful can be used as a noun, as in The beautiful have it easy. My LC is mainly charged with giving its top pick of LC for a given word, but as requested, it will either give one sense for each distinct LC or will give all senses it conceives of. This latter I consider to be only of use in debugging, though.

The special exception is with phrase expansions, which are not really LCs to be condensed down. For example, Fred's could be interpreted as the possessive form of Fred or as Fred is or Fred has. Which one is appropriate depends on the syntax, as illustrated in Fred's been drinking Fred's own root beer. Fred's happy with how it turned out.

Performance metrics

In my earlier performance tests of my VL, I was testing to see how fast it could recognize a single, specific word. I knew that I wouldn't be able to get a meaningful average rate, though, until I started testing real texts. Now that I have, I can say that on my late-2013 iMac, I'm typically seeing parse rates of over 2,000 words per second. To clarify, that's not counting symbols and other tokens; just words that get processed by the VL. That timing is for the total process of tokenization, VL lookups, name finding, and generating statistics for my debugging use. On my early-2015 MacBook, the rate is typically above 1,200 words per second.

Generally speaking, the more unknown words there are, the slower the VL runs as it tries its best to find approximate matches. So as I was adding new lexemes to my lexicon, this algorithm got ever faster.

Here is a typical output of statistics at the end of my test runs:

  • 996 total words
  • 457 distinct words (45.9% of total)
  • 437 known words (95.6% of distinct)
  • 20 unknown words (4.4% of distinct)
  • 216 derived words (47.3% of distinct)
  • 204 known derived words (44.6% of distinct, 46.7% of known, 94.4% of derived)
  • 423 lexemes used (96.8% the number of known words)
  • 374 free/bound lexemes used (85.6% the number of known words)
  • Rate: 2.1 parses/second
  • Rate: 2600.0 tokens/second
  • Rate: 2096.8 words/second
  • 3 iterations took 1.4 seconds
  • 1427 lexemes defined

For comparison, I just now extracted a third document. Without making any changes to the lexicon, here are the stats it outputs on a first run:

  • 2227 total words
  • 856 distinct words (38.4% of total)
  • 528 known words (61.7% of distinct)
  • 328 unknown words (38.3% of distinct)
  • 545 derived words (63.7% of distinct)
  • 307 known derived words (35.9% of distinct, 58.1% of known, 56.3% of derived)
  • 429 lexemes used (81.2% the number of known words)
  • 376 free/bound lexemes used (71.2% the number of known words)
  • Rate: 0.4 parses/second
  • Rate: 1059.8 tokens/second
  • Rate: 855.7 words/second
  • 5 iterations took 13.0 seconds
  • 1428 lexemes defined

As you can see, the rate is down from over 2,000 words per second to well under 900. Most of that is consumed by attempts to figure out unknown words.

After generating these numbers, I added a caching mechanism to save a copy of what the VL returns for a given word token. The result is a 10 - 15% speed increase, depending on whether the document scanned is known

I have a lot more lexicon building to do. At least this process makes it fairly fast and easy to identify the words I need to import. The effectiveness and performance are encouraging so far.


Popular posts from this blog

Neural network in C# with multicore parallelization / MNIST digits demo

Discovering English syntax

Virtual lexicon vs Brown corpus