Search This Blog

Friday, November 18, 2016

Virtual lexicon enhancements

Virtual lexicon

In my previous post, I used the term "morphological parser" to describe what I've been building. Now that I'm done with a first significant version of it, I'm really preferring the term "virtual lexicon" ("VL"). It captures the spirit of this as a module that represents a much larger set of words than the actual lexicon it contains. Same idea with virtual memory, which enables programs to consume what they perceive as a much larger amount of RAM than is physically available in the computer. Likewise, a virtual machine environment, which enables a fixed set of computer hardware to represent emulated computers with more CPUs, memory, and other resources than are physically available.

Periods

One of my stated goals in the previous post was to deal with periods and apostrophes. What I was indicating is that a basic tokenizer is faced with a nasty interpretation problem when it comes to these symbols. In standard written English, a sentence typically ends with a period and the next sentence begins after at least one space character (in typical computer formats, at least). But English also allows abbreviations to end in a period, typically followed by a space, as in Alice asked Prof. Smith for help. English also has initialisms represented sometimes with periods following each letter, as in r.s.v.p. and A.K.A. Further compounding the problem is that a sentence ending in an abbreviation or period-delimited initialism usually doesn't contain a separate period. For example, "I'm afraid of snakes, spiders, etc. When I see them, I run away A.S.A.P."

There is a further problem that could be ignored, but I decided to tackle it as well. Some special words like ".com" begin with periods, which would throw off a basic tokenizer. Further, it's possible for text sloppily written to have sentences ending in periods that don't have following spaces, as in the "I'm going to the store.Do you need anything?" The following capital letter does help suggest a sentence break over the alternative interpretation that "store.Do" is a word. But there are such words, like "ASP.NET", and Internet domain names, like "google.com".

I decided to modify my tokenizer to allow word-looking tokens to include single periods, including one at the beginning and one at the end of the token (e.g., ".ASP.Net."). Doing so would give the virtual lexicon a chance to weigh in on whether such periods are part of the words or separate from them. The VL's return value now indicates if the word begins with a period and also if it ends with one. But then each of the word-senses in that word gets to indicate if the leading and/or trailing period is integral. To illustrate this, my test output shows integral periods as {.} and separate ones as (.). Consider the following examples:
  • ".animal.":  (.)  ⇒  N: animal(N)  ⇒  (.)
  • ".com.":  {.}  ⇒  N: .com(N)  ⇒  (.)
  • ".etc.":  (.)  ⇒  U: etc(U)  ⇒  {.}
My tokenizer also deals with the dotted initialism (N.A.T.O., r.s.v.p.) scenario, which is also a problem for the lexicon. I decided a lexeme representing this case should only contain the dot-free spelling (NATO, RSVP) and should contain one or more senses indicating that it is an initialism. When the VL comes across this pattern, it gets rid of the periods and then begins its search. For example:
  • "R.I.P.":
    • V: RIP(V)  ⇒  {.}
    • V: rip(V)
Note how it offers an alternative, because my test lexicon also has the common verb sense of "rip" referring to the ripping action. Had I fed in "rip" instead of "R.I.P." or "RIP", it would have put that common sense on top and the initialism sense second. But not also how the first sense indicates that, yes, the word ends in a period, but that period is part of the word. Had it been "RIP.", it would have indicated that there was a trailing period that was clearly not part of the word.

I would note that my VL doesn't deal well with cases where there are periods within an initialism but where one or more such periods are missing. A word like A.S.AP. would fail to be properly recognized by my VL, but I consider that a good design. I'm betting this sort of case is rare and almost always a typo. If someone wanted to say I RSVPed, for example, they probably wouldn't include any periods. This leaves those oddball words that do include infrequent periods, like Node.JS., pristine for lexicon lookups.

I would also note that my VL does not provide meaningful support for domain names (microsoft.com, apple.com), Usenet group names (alt.tv.muppets, sci.philosophy.tech), and so forth. This is probably best handled by a tokenizer, which could easily flag a token as fitting this pattern and possibly ask the VL to see if the token is a known word, as in the Node.JS case. It's going to be a challenge for any syntax and semantic parser to deal with these entities, anyway.

This all takes responsibility away from the tokenizer for dealing with single periods in and adjacent to words. The VL doesn't definitively decide whether a given period is punctuation marking the end of a sentence, but it does provide strong evidence for later interpretation. Plus, it allows terms that do contain periods to be properly interpreted on their own.

Apostrophes

Almost the same problem crops up with apostrophe characters, which may be integral to words or may indicate the special class of punctuation that includes quotes, parentheses, and italicized text. Some words, like can'tbees', and 'nother contain apostrophes that are integral to the word and not at all part of quoted text. However, a tokenizer just can't deal with this without recourse to a lexicon. So my lexicon allows terms to include integral apostrophes.

The tokenizer is expected to leave single apostrophes that may appear to the left or right of a word-like token, as well as within it, attached to the text. The VL then considers the various interpretations possible with the leading and trailing apostrophes. The output word indicates whether it begins with and also whether it ends with an apostrophe. Then each sense within indicates whether those leading and trailing apostrophes are part of the word or not. Same pattern as for leading and trailing periods. And in that spirit, here are some sample outputs for tokens that feature both leading and trailing apostrophes. Integral apostrophes are represented with {'} and clearly-separate apostrophes with (').
  • 'animal':  (')  ⇒  N: animal(N)  ⇒  (')
  • 'animals':  (')  ⇒  N: animal(N) -s'(N→N)  ⇒  {'}
  • 'nother':  {'}  ⇒  N: 'nother(N)  ⇒  (')
The 'animals' example illustrates the potential for confusion, too. After all, it could be that animals is simply a plural form of animals that's in single quotes, as in Your so-called 'animals' are monsters. Or it could be that the plural form of "animals" has possession of something, as in Your 'animals' pen' is empty. There truly is no way in this parsing layer to iron that out.

Kicking the can down the road

One of my guiding assumptions is that each layer of the parsing process adds some clarity, but also creates more questions that can't be answered within that layer. I'm counting on the next layer being responsible for taking what the lexicalizer, which is essentially this virtual lexicon applied to all word tokens, outputs and generating as many alternative interpretations as are necessary to deal with the ambiguities. Then it will fall to the syntax parser, which should rule out some unlikely interpretations. That layer, too, will create more unanswered questions, which it will foist on later layers dealing more with the semantics of sentences.

One pattern I found is that I end up using my VL recursively because some alternative interpretations can only be handled by fully parsing a word by trying various interpretations, such as stripping off a leading period, and seeing which interpretation seems best. No doubt this same pattern will hold for the syntax parser, which probably will even occasionally call back to the VL to reinterpret alternative tokens it comes up with.

No comments:

Post a Comment