Search This Blog

Thursday, December 15, 2016

Virtual lexicon vs Brown corpus

Having completed my blocker, I decided to take a break before tackling syntax analysis to study more facets of English. But also, I realized I should beef up the lexicon underlying my virtual lexicon (VL). I had only collected about 1,500 words, and most of those I had simply hand-entered by way of theft from the CGEL's chapters on morphology; mostly compound words, at that. It was enough to test and demonstrate the VL's capacity to deal half-decently with morphological parsing, but nowhere near big enough to represent the at least tens of thousands of words a typical high school graduate with English as their native language will know.

A virtual lexicon's core premise is that being able to recognize novel word forms by recognizing the parts of the word is more valuable than having a large list of exacting word-forms. In essence, a relatively small number of lexical entries should be able to represent a much larger set of practical words found "in the wild".

Using the Brown corpus

I decided that a good way to see just how much mileage I could get out of my virtual lexicon by exposing it to an existing dictionary, of sorts. In particular, I chose the Brown corpus, which is full of words hand-tagged with their lexical categories (parts of speech) taken from excerpts of 500 documents contemporary to the 1960s. I had already converted the BC's data to JavaScript/JSON files and dabbled a bit with it many months back, so I had an easy way to work with it.

Most significantly, I already had a list of all the unique words found in the BC, complete with an order sub-list of all the lexical categories and their frequency counts. Here's an example:
For example, care appears 162 times in the BC. 87 of those times, it's as a common noun, as in health care. And 75 of those times it appears instead as a base verb, as in to care for.
Given a word like "caring", my VL will try its best to figure out the lexical category. For this example, it would likely parse it as "care -ing" and call this a gerund/participle, same as BC, which uses the "vbg" tag to represent this.
This list contains lots of elements I don't care to push through my VL, such as proper nouns (John, Brooklyn, Glazer-Fine) and punctuation. After filtering, that leaves a word-list of 12,222 unique words for me to test my VL against.
Here's a snippet of the typical output from my testing:
 |   | preposterous    |       5 | jj              | J               | J               |     7 ms |    125 | pre-(U) post(N) -er(J→J) -ous(J)  (J, N, or N)
 | X | prescribe       |       5 | vb              | V               | N               |          |     40 | pre-(U) scribe(N)
 |   | prescribed      |      14 | vbn, vbd        | V.pret          | V.pret          |          |    100 | pre-(U) scribe(N) -ed(V→V)  (V or J)
 |   | prescription    |       5 | nn              | N               | N               |          |     85 | pre-(U) script(N) -ion(V|J→N)
 |   | presence        |      76 | nn              | N               | N               |     3 ms |     59 | present(J) -ce(J→N)  (N or Phr)
 | / | present         |     377 | jj, rb, nn, vb… | J               | V               |          |      0 | present(V)  (V or J)
 | X | present-day     |      17 | jj              | J               | N               |          |    100 | present(V) -(U) day(N)
 |   | presentation    |      33 | nn              | N               | N               |    17 ms |     97 | present(V) -ate(V) -ion(V|J→N)
 |   | presentations   |       6 | nns             | N.plur          | N.plur          |    88 ms |    137 | present(V) -ate(V) -ion(V|J→N) -s(N→N)  (N, V, or N)
 |   | presented       |      82 | vbn, vbd        | V.pret          | V.pret          |     4 ms |     40 | present(V) -ed(V→V)  (V or J)
 |   | presenting      |      10 | vbg             | V.gerprt        | V.gerprt        |     4 ms |     40 | present(V) -ing(V|N→V)  (V or N)
For example, prescribe gets treated as "pre- scribe". Since it sees scribe as a noun, it concludes that this whole word as a noun, as though we were talking about a person before they became a scribe. The BC tags this as "vb". To run the comparison, I use a mapping to translate some of the many tags the BC uses to the representation used by my VL. For examples, "V.pret" means preterite verb and "N.plur" means plural noun.

Data mapping is a tricky and often dubious affair. Sometimes, there just isn't an exact mapping between two systems. In my case, them, which in the BC is considered a "ppo" (pronoun, personal, accusative), which includes words like it, him, me, us, you, and her. Some of these are plural and the rest aren't. In my VL, them is a plural pronoun ("N.pron.plur"), making the plural "ppo" items compare incorrectly. I could have modified my mapping to treat them and us as plural, but that's an unnecessary hack that doesn't really help my task.

The first column of the output contains a match status. When blank, that means the two systems agreed on the LC of that word. A "?" indicates that my VL couldn't even match the morphemes of the prospective word. Though that doesn't stop it from making a guess based on a familiar suffix (e.g., -ous or -ing), I disqualified its attempt on that basis and also to point me to morphemes that really needed to be added to my lexicon. If all the morphemes did match, I compare the resultant LCs. The first one in BC represents the most common occurrence (e.g., "jj" (adjective) for "present") and the others represent less common occurrences. If my VL doesn't match any of BC's LCs, this column contains "X". If it matches only a secondary LC, "/" appears. Think "half an 'X' for half-wrong" (or half-right).

As you can see from these examples, some of the derivations are pretty good, as with "N.plur: present(V) -ate(V) -ion(V|J→N) -s(N→N)" for presentations. And some are pretty bad, like "J: pre-(U) post(N) -er(J→J) -ous(J)" for preposterous. Yes, it got the final lexical category right, thanks to the -ous suffix, but only by fumbling through its morphemes.

Adding lexemes

Although testing the dexterity of my VL was a key goal, a more basic one was augmenting my lexicon with more words. To that end, I would filter my word comparison runs for all the "?"-status bad matches and hand-enter morphemes as necessary.

Consider "monast", for example. I added this as a bound morpheme, which isn't definitively a prefix (un-, ante-, electro-) or suffix (-ing, -ably, -ment), but can't really stand on its own as a complete word in a sentence. Although I used my own sense of how a word was historically composed and its potential for production of other words, I also relied on online tools to help. For example, searching for all words that begin with "monast" or for all words that end with "ment". Having extensive examples at the ready helped me test (and reject) many of my hypotheses. And then I could know that monast~ could correctly form monastic, monastery, monasticism, and more.

I went through this process for several days. While I had some shortcuts, I ultimately hand-processed every word. To my surprise, I personally recognized all but perhaps ten of the 12k+ words, ignoring certain highly technical medical terms. And with each new lexeme I'd add, I'd ask myself, "how was this word not already in here?" One of the last words I added was "yes", one of the most basic in the English language. My sense is that there must still be loads of even ordinary words not covered by my VL.

I continued this process until there were no more "unmatched" words, meaning almost every word in the BC could be sliced up into morphemes that matched my underlying lexicon, even if the LCs didn't match the BC's LCs. In the end, my lexicon had 4,913 lexemes available. Of those, 4,494 lexemes were used to match 12,132 words. That represents a "lexical compression rate" of 37%. On average, one of my lexemes can match about three words in the BC. For comparison, a basic word-list with no morphological parsing would display 0% compression. 100% is the impossible asymptote that could never be reached.

Part of speech tagging

In processing the full word list from the Brown corpus, I get a rate of 83% "hard" matches and "5%" more "soft" matches. A hard match is where my VL's lexical category matches the most common usage of that word in the BC and a soft match is where my VL's LC matches one of the less common usages in BC. Let's be liberal and call this an 88% match.

To anyone familiar with traditional part of speech (PoS) tagging, 88% is pathetic. A typical PoS tagger will get better than 95% correct without breaking a sweat.

But my test is definitely not a PoS tagger. A PoS tagger typically looks at the words in the neighborhood of one word being considered and uses a statistical model to decide what it's most likely to be in that context. My test program does nothing of the sort. A better analogy would be that this is the naive first step in a Brill tagger, where each word is looked up in a lexicon for its most likely LC. And then the tagger begins transforming those guesses based on what's in the neighborhood.

Still, a typical PoS tagger that starts with a naive lookup will usually start out at around 93% match, so why would my VL do so badly?

One simple reason is that so many of the words that remain rely on poorly chosen lexemes during morphological parsing. Consider youth, which my VL sees as you -th, where -th is typically a suffix for ordinal numbers like tenth and 175th. My lexicon is missing an entry for youth. In this case, youth, youthful, and youths all correctly match, but in many other cases, such shortcomings in my lexicon cause clearly mistaken guesses, as when legitimate gets interpreted as leg it im- ate, whereas I really need a legitim~ bound lexeme to come up with legitim~ -ate and a valid interpretation as either a verb (to legitimate her presidency) or adjective (the legitimate president).

Another reason is that the derivation of a word from its morphemes often seems superficially logical, but doesn't reflect the reality of how the word is typically used. For example, amazing naturally follows the usual amaze -ing pattern and can be used as a verb (amazing friends with magic) or gerund ("amazing" isn't bold enough to describe it), but in practice, we most often use amazing as an adjective (the soup is amazing). There's no way to tell that by reference to its two morphemes. -ing almost always forms a gerund-participle and sometimes a noun (the flashing for the siding needs repair), but rarely as an adjective (stunning, breathtaking). This reality reflects a limitation of the virtual lexicon, at least as I've constructed it. Sometimes the only answer is to lexicalize (add to the lexicon) a word that would otherwise badly match, as I did with amazing.

Often, I simply couldn't bring myself to label a word in accordance with the most common usages in the BC. For example, I have defeat as a lexeme with only a verb sense, but the BC has it more often tagged as a noun (suffered a defeat) and less often as a verb (we'll defeat them). To my thinking, this reflects not a definitional disagreement, but the difference between a word's intrinsic meaning and its usage in a specific sentence.

Moreover, I am still troubled by the idea of having a lexicon contain multiple entries for a lexeme whose only apparent difference is one of lexical category. In Lexical Categorization in English Dictionaries and Traditional Grammars Geoffrey Pullum points out that "many dictionaries actually do — quite wrongly — include subentries for numerous nouns that list them as adjectives."

In the process of beefing up my lexicon, I was struck by a feeling that almost every entry I added that had a noun sense could also be used as a verb or adjective as well, so I favored only adding LCs for what I thought of as the predominant LCs for the major senses. For example, for appeal, I added verb (I plan to appeal the decision) and noun (the youthful appeal of this dress) senses because each, to my thinking, had a distinctly different meaning. For medal, I only added a noun sense, despite the validity of using it as a verb, as in medaled in track and field, because to my thinking, both uses fundamentally refer to the same exact concept. "Medaling" just means winning a medal, the thing that is won.

That said, I've also second-guessed this line of thinking. If I imagine there's a singular noun-verb-adjective pseudo-category, then it's clear that many words would violate it. For example, it's hard to imagine using medal as an adjective. Yes, some words like fast easily lend themselves to all three interpretations (They fasted their fast after eating faster than usual.) But so many words just pile up to negate this nice model, such as complex (the complex has a complex layout) and monkey (let's not monkey with this monkey). That seems to suggest I should have been more liberal with my lexicon. That I should have listed all the reasonable lexical categories a lexeme could take.

That said, I guess my stinginess here is a significant source of the low half-right match rate. Had I taken the alternative view, the right plus half-right numbers probably would be in the ninety percents.

From word-list to document tags

As I said before, my test code is not what I would actually call a part of speech (lexical category) tagger, since it does not consider any word in the context of the words around it. All it does is guess at the proper lexical category for a given word by a morphological analysis involving dictionary-like lookups.

Still, I was curious to see how it would fare against the individual documents in the BC. The unique words are not all equal, after all. One word will appear only once in the BC, while another will appear thousands of times. And that word may vary among three different LCs with each usage.

As before, I ignored proper nouns and punctuation, but also "cd" (number) words. Of the 951k words thus considered, 773k (81%) of them had matching LCs, which is close to the 83% exact-match rate I got when simply looking at the unique-word-list from the BC. I include here an output file from one test run. Here's an example of what its contents look like:
 | on              | rp         | R               | P: on(P)
 | long            | rb         | R               | J: long(J)
-------- Doc 201 --------
 | to              | to         | R               | P: to(P)  (P or R)
 | Farewell        | nn         | N               | J: fare(N) well(J)  (cap)
 | that            | cs         | S               | N.pron: that(N)  (N, D, R, or S)
 | search          | nn         | N               | V: search(V)
 | meaning         | nn         | N               | V.gerprt: mean(V) -ing(V|N→V)  (V or N)
 | hints           | vbz        | V.3rdsg         | N.plur: hint(N) -s(N→N)  (N or V)
 | Unconscious     | nn         | N               | J: un-(J→J) conscious(J)  (J, Phr, N, or N)  (cap)
 | form            | nn         | N               | V: form(V)  (V or N)
 | other           | ap         | D               | J: other(J)  (J or V)
 | human           | jj         | J               | N: human(N)
Each of the 500 documents has a Doc NNN header, followed by a list of the words that did not match. Each such mismatch lists the word, the exact tag (e.g., "nn" or "cs"), my mapped version of it (e.g., "N" and "S"), and then my own interpretation of the word.

The example of human well illustrates the difference between a word's natural lexical category and the syntactic category it takes on in a sentence. In this case, it falls within the clause some form or other enters into all human activity. Practically speaking, human is a noun and its use in human activity doesn't change that.


Overall, I'm happy with how well the morphological parsing approach my virtual lexicon takes to solving the specific problem of guessing the baseline lexical category for a word it doesn't already know. 81% of over 12k words were properly recognized by a little over a third as many lexemes. That said, wacky examples like "de- co- rat -ive" (instead of "decor -ate -ive") illustrate how it's often just a lucky guess where the last suffix's LC saves the day.

My hand-crafting yielded a lexicon just under 5k works. Given a choice between a massive word list — think hundreds of thousands or even millions of words — and a tiny lexeme set plus morphological analysis, the massive word list is clearly going to win. That said, it seems reasonable to assume that the best results would be gotten by a combination of a morphological analyzer with a massive word list. The reason is that the underlying premise remains: that you're inevitably going to run into novel word forms as you process new documents.

Thursday, December 8, 2016

Text blocking / Sentence segmentation

I've finished a first working version of my "blocker" module. I'm coining this term to reflect its purpose: to break a paragraph being parsed up into its constituent sentences and sub-sentence "blocks" of text. This is often referred to as "sentence segmentation", but I find that term belies the fuller scope of a blocker.

Wikipedia presents a good summary of the basics of sentence segmentation:
The standard 'vanilla' approach to locate the end of a sentence:

(a) If it's a period, it ends a sentence.
(b) If the preceding token is in the hand-compiled list of abbreviations, then it doesn't end a sentence.
(c) If the next token is capitalized, then it ends a sentence.

This strategy gets about 95% of sentences correct. Things such as shortened names, e.g. "D. H. Lawrence" (with whitespaces between the individual words that form the full name), idiosyncratic orthographical spellings used for stylistic purposes (often referring to a single concept, e.g. an entertainment product title like ".hack//SIGN") and usage of non-standard punctuation (or non-standard usage of punctuation) in a text often fall under the remaining 5%.
But to my thinking, this really misses a lot of the picture.

What is blocking?

I struggled for a while to find a good term. "Blocking" is the best one I came up with, but admittedly, it does conjure up that other modern meaning, as with "call blocking" or "spam blocking". In this case, I mean "blocking" in the sense: breaking text up into a semantically meaningful hierarchy of blocks of text. So what do I mean by that?

Let's say you're given the task of blocking by hand. You are always given what is assumed to be a single paragraph or some other unbroken block of plain text (e.g., a headline), so don't worry about things like first-line indents or bullet points that screw with text segmentation.

At the very least, you're expected to break the paragraph up into separate sentences. The previous paragraph, for example, contains two of them. You start looking for periods. Then exclamation points (!) and question marks (?) marking the ends of sentences. Not a bad start. But see how sometimes they aren't, as in my putting "!" and "?" in parentheses, and now in quotes? Imagine if you were to call out the following as sentences from this paragraph:
  • Then exclamation points (!
  • ) and question marks (?
  • marking the ends of sentences.
Nonsense, right? So maybe the problem is the parentheses. Maybe, but that means we need to keep track of parentheticals, now, too. So maybe our rule is: ignore stuff inside parentheses. So what if we had a sentence that included the somewhat arcane use of in-line bullet points like: 1) point one, 2) point two? It's suddenly not a simple case of ducking in and out of parentheticals. Still, better than nothing.

But parentheses bring up another important point. Sometimes stuff in parentheses is grammatical in context. Consider this (illustrative) example. You could take the parentheses away and the remaining sentence is still a grammatically correct sentence. But not (this is a clear example) in this sentence. Take its parens away and you end up with an ungrammatical sentence. Because of that potential, I claim it's better to segment out parenthetical blocks of text as being inline with other text bound for syntactic and semantic analysis. I decided to also include text in [square brackets] and {curly braces} in the class of parentheticals.

So far we have sentences and parentheticals. Let's add quotations, which act similarly, but not identically, to parentheticals. For one thing, a given text is not guaranteed to make use of Unicode's support for left and right single and double quote characters. Moreover, when it does, they may be used for alternative purposes, such as in place of apostrophes. I decided to just simplify these all down to basic ASCII 'apostrophes' and "straight quotes". But that creates ambiguity as to what is intended when any of these two characters appears in text. We'll come back to how this is handled soon. Suffice to say that we want to consider the text inside quotes to be text blocks just like with parentheticals.

Adding to the complexity is the fact that a parenthetical or quotation block can contain more sentences or other sub-blocks. Although it's generally considered poor form for an inline quotation to contain multiple statements, this isn't unusual in practice. Here's an example:
Smith was quoted as saying, "We plan to appeal the decision. But we'll also need to study its impact."
There's no logical way to see this "single" sentence as grammatical without admitting it's one sentence that contains two others.

To complicate things further, there are "quotes 'within' quotes". When a text is well formatted, the inner quotation will use single quotes and quotes deeper within will either use more double quotes or the somewhat awkward "'triple quotes'". However, it's not uncommon for more informal texts to contain lazily copy-and-pasted text formatted with surrounding double quotes which already contained double quotes; thus, double quotes inside double quotes.

Our end result of blocking process should be a tree structure with the paragraph node at the top, its sentence nodes as its children, and the sub-sentence blocks. Consider the following example, taken from Wikipedia's article on Mark Twain:
I came in with Halley's Comet in 1835. It is coming again next year, and I expect to go out with it. It will be the greatest disappointment of my life if I don't go out with Halley's Comet. The Almighty has said, no doubt: 'Now here are these two unaccountable freaks; they came in together, they must go out together'.
And here is a visual way of representing the block tree for it:

Each block has slots for an "opener" and "closer". These are typically filled by punctuation like periods, quotes, and parentheses. Inside each block is either a string of textual tokens or a set of sub-blocks, if the contents are heterogeneous.

Being a programmer, I have my own more cryptic way of displaying these structures using plain text. Here's what it looks like for me (color manually added here):

          I(N.pron) came(V.pret) in(P) with(P) @Halley's(N.poss) @Comet(N) in(P) '1835' 
        | '.' )
          It(N.pron) is(V.aux) coming(V.gerprt/N) again(V) next(J/R) year(N) ',' and(C) I(N.pron) expect(V) to(P/R) go(V) out(R/P/J) with(P) it(N.pron) 
        | '.' )
          It(N.pron) will(V.modal/N) be(V.aux) the(D) greatest(J.superl/R.superl/D/N.prop) disappointment(N/J/U/D/V) of(P) my(N.pron.poss) life(N) if(S) I(N.pron) don't([do not]) go(V) out(R/P/J) with(P) @Halley's(N.poss) @Comet(N) 
        | '.' )
              The(D) @Almighty(U) has(V.aux) said(V.pret) ',' no(D) doubt(V) ':' 
            Q( '''
              Now(R) here(R) are(V.aux.plur) these(D/N.pron.plur/R) two(D) unaccountable(J/N) freaks(U) ';' they(N.pron.plur) came(V.pret) in(P) together(R) ',' they(N.pron.plur) must(V.modal/N) go(V) out(R/P/J) together(R) 
            | ''' )
        | '.' )

In keeping with the bubble-like diagram above, the format for each block is: Type( opener | text goes here | closer ). Block types are indicated by prefixing a single letter before each, including:
  • A:  Paragraph
  • S:  Sentence
  • Q:  Quotation
  • P:  Parenthetical
  • C:  Custom
  • T:  Text
Note that each word token indicates in parentheses one or more single-letter lexical categories (part of speech) tags (e.g., "next(J/R)") plus some optional sub-categorization tags. LCs include:
  • U:  Unspecified
  • N:  Noun (including pronouns)
  • V:  Verb
  • J:  Adjective
  • R:  Adverb
  • P:  Preposition
  • D:  Determinative (articles, numbers, etc.)
  • S:  Subordinator (conjunction)
  • C:  Coordinator (conjunction)
  • L:  Correlator (conjunction)
  • I:  Interjection
  • Y:  Symbol
I won't spell out all the subcategories here, but examples include "N.pron.plur" for plural pronouns (these), "V.gerprt" for verbs as gerunds or present participles (running, coloring), and "J.compar" for comparative adjectives (better, curiouser). I'm convinced that this finer-grained categorization for words will aid in later syntax analysis.

I also prefix each word with an at sign (@) when it is considered likely to be used as a name in this text (e.g., Halley's and Comet).

Custom blocks

I'm convinced that we sometimes use special formatting as a form of logical blocking. Consider my use of italics in this blog post to highlight examples of what I'm demonstrating. Sometimes I include whole sentences or sentence fragments in italics within narrative statements. Sometimes, my italics contents are grammatically part of the sentence, but in other cases they are ungrammatical, just the same as if they were protected within parentheses or double quotes.

I did not want to get into the weeds of parsing richly formatted text like HTML or RTF, but I also did not want to ignore this special case. So I decided to introduce a feature where the calling application can add custom XML-looking blocking tags. They appear in the otherwise plain text as "regular text <123>custom block</123> more regular text", where "123" is any unique integer that must match on both tags. If there's a start or end tag that doesn't have an appropriate matching tag, it will be regarded as ordinary text to include in the sentence where it appears. And custom block tags must appear properly nested and not overlapping, following the same logic for nesting of tags within well-formed XML.

Although block IDs (the integer value) can be reused within a paragraph, it is wise to make them unique throughout a document. One benefit is that the calling application can maintain a dictionary of known IDs and whatever properties are of significance to that application, such as the special formatting.

Here's a sample sentence illustrating the concept:
I came <1>in with <2>Halley's Comet</2></1> in <3>1835</3>.
And here's the debugging output:

            T(| I(N.pron) came(V.pret) |)
            C( '<1>'
                T(| in(P) with(P) |)
                C( '<2>' | @Halley's(N.poss) @Comet(N) | '</2>' )
            | '</1>' )
            T(| in(P) |)
            C( '<3>' | '1835' | '</3>' )
        | '.' )

I don't want to go into great detail here on this, but want to make the point that custom block markers grant the calling application a special ability to mark portions of text explicitly as blocks that may or may not be grammatical in order to give the syntax analyzer the chance to decide.

Designing the blocker

Conceptually, my basic english parser (BEP) product has a pipeline with the following modules:

Tokenizer → Virtual Lexicon → Blocker

However, the reality is that these are not strictly separate modules. The blocker is really just some extra code within the tokenizer. This is mainly because the code didn't seem complicated enough to me to warrant its own class. Plus I didn't want yet another layer needlessly slowing things down.

I've been programming this all in C++11 using Xcode on my iMac & MacBook, but I'm reluctant to show actual code here. I'm more interested in describing the algorithms so programmers and non-programmers can understand the concepts.

In brainstorming about how to approach the blocker, I was initially inspired by the familiar BNF specifications typically used to describe context-free grammars like programming languages and data file formats like XML. But there was one huge problem: each paragraph can be blocked into several alternative block trees. BNF wasn't designed with alternative interpretations in mind and neither were the kinds of parsers that implement BNF. There is supposed to be only one valid output when parsing a given context-free grammar file.

I want to emphasize that multiple interpretations is a key concept I've embedded in all three layers of my BEP so far. Natural languages are full of ambiguities. They are not resolved by arbitrarily choosing the "best" option in each layer. That's a brittle solution. And it's one thing that bothers me about almost all NLP products: they almost never seem to allow ambiguities among their processing layers, let alone in their output. By contrast, my virtual lexicon (VL) typically outputs several likely interpretations ("senses") for each word. "Jim's" may be the possessive of "Jim" or a contraction representing "Jim is" or "Jim has". "Out" may be an adverb (out in the middle of nowhere), preposition (out the door), or adjective (She is out about her gender identity). The VL couldn't possibly decide which one is best, so it's best to just report these options to other layers like the syntax analyzer which are more qualified to resolve them. This same concept applies to blocking.

I was also inspired by my approach to morphological parsing, which involves recursively drilling down, character by character from right to left, searching for possible morphemes. Each time this brute-force search reaches the end (first character in the word) by accounting for all the morphemes in the word, the endpoint in the parse tree being recursively constructed gets added to a list of all "tails" representing successes. Afterward, another algorithm works backward from each tail to construct a linear list of the morphemes discovered, teasing the successes out of the tree and weeding out the failures. But I realized the morphemes in a word represent a linear list instead of a tree structure, so I had trouble at first seeing how it could apply to constructing a parse tree of block trees.

The problem was that I needed an algorithm that could build lots of alternative trees in the same way my morphological parser built lots of lists.

I settled on a multi-pass approach. Each pass through the entire paragraph gradually transforms the tokens into the final set of block trees. Pass 1 creates blocks of contiguous tokens and mini-blocks representing potential block dividers like parentheses, quotes, and punctuation. Pass 2 transforms these lists of proto-blocks into "suggestions" for how they should be represented as trees. Pass 3 scores the merits of each interpretation and throws away all but the best N of them. Pass 4 creates cleaner interpretation chains from the source chains that still have lots of overlap, enabling transformations to each chain that won't screw up the other chains. Pass 5 proposes sentence (and sub-sentence) begin and end tags to ensure that every potential block's start and end is clearly spelled out. Pass 6 transforms these tree-flavored chains into actually tree structures and, in the process, restructures them to improve their simplicity, clarity, and interpretation. Pass 6 also looks at the paragraph's whole block tree to make minor refinements and generate usable statistics.

Pass 1: Creating block tokens

When constructing blocking, it's worth noting that you can tell where block boundaries definitely won't be. For example, a string of plain words will never have a boundary placed inside them. In fact, the only places block boundaries could possibly exist are where certain kinds of punctuation like question marks, apostrophes, and right parentheses characters are found. There's no sense wasting computing power looking for boundaries elsewhere.

So the first step goes through all the tokens in a paragraph created by the tokenizer. The result is a list of a higher order of tokens representing blocks of other tokens that can't be broken down any smaller, from a blocking perspective.

I struggle with the idea of calling them "tokens", given that I already use this term in its much more conventional sense to refer to single words, symbol characters, and so on. The term "block-token" is slightly better, but messy. But given how I actually chose to implement this algorithm, I'm just going to call these higher order tokens "blocks" for now. It should become apparent why momentarily.

Rather than create a distinct block-token data structure to represent these special kinds of tokens, I decided to create one "block" structure and use blocks for both this linear breakdown, sans hierarchy, and the final hierarchic structure that results from this kind of parsing.

Each block is defined primarily by a type, by whether it is an "in" or "out" (explained below), and by a list of tokens or child blocks it contains. You may recall that block types include paragraph, sentence, text, quotes, parentheses, and custom. When we come across an open or close parentheses or square bracket token, for example, a block is created with just that one token in it. Same for apostrophes or double-quotes, which get the "quotes" type. Periods, exclamation points, and question marks get typed as "sentence". Custom block markers are easily recognized because the tokenizer already flagged them with a special "block marker" token type (recall that all tokens have their own types, including word, symbol, number, etc.).

Every other token gets lumped into text-type blocks. They will mostly be word-type tokens, but may also be symbols not listed above. One special exception is the horizontal ellipsis character, which is regarded here as potential sentence-ending punctuation like periods are.

Here's a sample sentence and representation of these proto-blocks. Note the custom blocks (1 & 2), which I added to correspond to the italicized book titles.
Among his novels are <1>The Adventures of Tom Sawyer</1> (1876) and its sequel, <2>Adventures of Huckleberry Finn</2> (1885), the latter often called "The Great American Novel".
  • C[ <1> ]
  • T[ The Adventures of Tom Sawyer ]
  • C[ </1> ]
  • P[ ( ]
  • T[ 1876 ]
  • P[ ) ]
  • T[ and its sequel , ]
  • C[ <2> ]
  • T[ Adventures of Huckleberry Finn ]
  • C[ </2> ]
  • P[ ( ]
  • T[ 1885 ]
  • P[ ) ]
  • T[ , ]
  • T[ the latter often called ]
  • Q[ " ]
  • T[ The Great American Novel ]
  • Q[ " ]
  • S[ . ]

Pass 2: Tree discovery

The second pass takes this linear list of proto-blocks and adds "in" and "out" flags as appropriate. This is more complex than it sounds, though.

First, what do I mean by "in"? A block is flagged as "in" if it represents the beginning of a child block. The simplest example is an open parentheses character ("("). Conversely, a close parentheses character (")") would be flagged as "out".

Ending punctuation  for now I'll just shorten that to "punctuation" for simplicity  gets flagged as both "out" and "in". Why? Let's say you had a paragraph with two simple sentences containing nothing but words and punctuation. The end result will be two "sentence" blocks, so it makes sense to think of the punctuation after that first sentence as representing the end ("out") of one block and beginning ("in") of another block.

So here is how the above sentence might be represented with all its "out" and "in" flags. "Out" is represented by "^" (as in up the hierarchic tree) and "in" by "v" (as in down deeper into the tree):

  • T[ Among his novels are ]
  • C[ <1> ]v
  • T[ The Adventures of Tom Sawyer ]
  • C[ </1> ]^
  • P[ ( ]v
  • T[ 1876 ]
  • P[ ) ]^
  • T[ and its sequel , ]
  • C[ <2> ]v
  • T[ Adventures of Huckleberry Finn ]
  • C[ </2> ]^
  • P[ ( ]v
  • T[ 1885 ]
  • P[ ) ]^
  • T[ , ]
  • T[ the latter often called ]
  • Q[ " ]v
  • T[ The Great American Novel ]
  • Q[ " ]^
  • S[ . ]^v

It seems like it should be easy to produce this in one simple step, but it isn't. Quotes create a problem. While some texts contain Unicode characters that clearly represent left and right single and double quotes, not all texts do and sometimes they are simply wrong because of oddball use of the word processors that generate them wrong. My tokenizer simply standardizes them into "straight quotes" with no left or right polarity. And now we pay the price. A given quote may potentially represent the start or the end of a block.

For another thing, sometimes characters do not actually represent block markers. Take the following example:
I need you to pick up 1) apples, 2) bananas, and 3) strawberries.
Or this example:
A sentence may contain quotes (") and brackets ('[' and ']').
There are actually many possible ways to interpret this one. You and I know that the square bracket characters are merely literal symbol characters in the text, not markers of a parenthetical block that happens to have a quoted block inside them and quotes containing them. How does an algorithm figure this out?

The answer is that this pass is responsible for coming up with all the possible interpretations of each symbol and letting the next pass decide which ones are most likely. The way it does this is by building a tree of all the possibilities.

Consider a (double) quote character, for example. It could represent a plain old symbol with no blocking meaning, the opening of a quotation block, or the closing of a quotation block. Thus, finding a quote adds 3 branches to the tree below the node representing the earlier block. The tree nodes contain the "in" and "out" flags and a reference to the appropriate proto-block.

Let me spell this out more clearly. A recursive function named .find_blocks() examines a specific block in the list of them and adds at least one node to a tree it is constructing. It then recurses, calling .find_blocks() with the goal of looking at the next block and adding its own nodes to the tree under each node it added. So in the quote example, .find_blocks() gets called once for each of the 3 branch nodes created representing the possible interpretations.

Along the way, it's possible to weed out some clearly impossible chains of interpretation. Consider a paragraph with only one sentence and having that contain one piece of quoted text. Interpreting the first quote as a closing of a block makes no sense because there was no corresponding open quote earlier.

This may sound difficult to recognize, but it's actually quite easy. The .find_blocks() function takes a "level" argument as input. On the first call, zero is passed in. When the function considers an "in" interpretation of a symbol, its subsequent call to .find_blocks() will pass in level + 1. Conversely, when it considers an "out" interpretation of a symbol, it passes in level - 1. The first thing the .find_blocks() function does is check "level". If it is below zero, it knows that this cannot represent a valid interpretation of the paragraph's blocking.

Proceeding forward, block by block, the recursion of .find_blocks() eventually reaches the last proto-block of tokens in the sentence  typically punctuation. One final check is done to see if level equals zero. This is also required for a valid interpretation. An example of how this might be violated would be if we interpreted the two quotes found in that single sentence as both opening quotes but never found closing quotes to balance them. Level would thus be 2 by the end, not zero.

There is one very special exception to this rule. In fiction writing, it is common for a paragraph to end with quoted dialog that does not end with a closing quote. The convention is that the next paragraph should begin immediately with an open quote. Here's a trivial example:
Chrystal said, "What were you thinking?

"Nevermind. We'll figure it out," she continued.
This is valid, so I didn't want to ignore this special case. My answer is that, when .find_blocks() reaches the end of the sentence and level happens to be 1, it considers this possibility. To support it, the code before .find_blocks() was ever called counted the number of double quote character tokens in the text and passed in a boolean flag indicating if there is an odd number of them. So if level = 1 and the paragraph contains an odd number of double quotes, it adds one more branch to the tree with a contrived block representing a closing quote, but that psuedo-block does not refer to any specific tokens the way the other blocks do. The result is that level is now effectively back down to zero at this point.

I want to note that this is a bit of a hack in that it will only work for double-quote characters. It would not work if dialogue were represented with single or triple quotes (or double quotes represented by double apostrophe characters) because of the ambiguity these representations would create. That said, I don't recall ever in my life running into these awkward cases.

So the last thing the .find_blocks() function does, once it has weeded out impossible (level  0) cases, is add the last node to a list of "tails". Each tail represents a successful chain of possible interpretations from the first proto-block to the last in the paragraph. Traversing backward from each tail toward the root node represents whole single chains of interpretation. If a paragraph yields 14 tails, that means there were 14 reasonable interpretations of the blocking created and waiting to be considered by the next pass.

Pass 3: Scoring the possible-tree chains

The first pass' output is a list of tail nodes that are linked lists — what I call "chains" — leading from the last proto-block back to the first and bringing along each of the interpretations of the source data that were found and considered within the realm of possibility.

This pass, then, scores each chain, sorts them by their scores, and keeps the N best options.

Scoring involves following each chain from its tail — the rightmost block in the paragraph — back to its head — a node representing the beginning of the paragraph just before the first block. The score is actually an integer penalty value, just like with the morphological parser. The best possible score would be zero and any positive value indicates some penalties that may represent defects in that interpretation. The score would never be negative.

An example of a penalty is when a double quote seen as an opener is followed by a space. Similarly, when a close quote is preceded by a space. Typically, opening quotes are preceded by space and followed by none and vice-versa for closing quotes.

Another thing penalized is when a token-block ends with an abbreviation like "Dr." (or initialism like "A.S.A.P"), but that period is also considered a sentence-block end. It's not impossible, of course. After all, many sentences end with "etc." But it does make sense to penalize this scenario and thus favor interpretations where these "soft periods" are seen not as sentence endings.

Another thing penalized is sentence-ending punctuation followed by words that are not capitalized. This, again, can happen with abbreviations in the middle of sentences. We don't want to rule it out, as the author may have made a typo or not be inclined to use capitalization. Informal text messages and instant messages often completely lack capitalization.

A block that looks like it should be a block opener, closer, or sentence-ending punctuation but is not flagged as "in" or "out" is penalized. Consider this the "lazy option penalty". After all, one acceptable interpretation output by the previous pass is that all quotes, parentheses, and punctuation are just raw symbols and not block markers. But this is a highly unlikely case.

Once each chain is scored, they get sorted such that the lowest-penalty chain is on top and the others represent progressively less likely interpretations of the paragraph's possible blocking. Although in principle, every such interpretation has the potential to be the correct one, in practice it doesn't make sense to keep all of them. With really oddball text  say, containing mathematical expressions of programming code  there could potentially be thousands of chains output. So the list of chains gets truncated to some practical limit, which I have defaulted as 10 but which the calling application can override.

Pass 4: Linearizing the chains

This pass exists mainly to keep me from going insane. The chains thus far are singly linked lists that proceed from right to left. Although I had originally been fine working with this, despite my head spinning when dealing with terminology like the node representing the block to the right being the "previous" node instead of "next".

But moreover, I found that later passes needed to actually modify nodes in the chains. Remember that for now, these chains are actually just the parse tree viewed backward, traversing from leaf nodes toward the root. That means that two chains likely share nodes. So manipulating a node in one chain has the potential to do the same to that node in another chain, which is bad.

I also found that there was value in having a doubly-linked list later. I simply couldn't do that with an inverted tree.

In this stage, I simply traverse each chain and produce a new chain with copies of all the nodes traversed. The old chains  and the tree they came out of  are discarded. Moreover, I now have a list of "heads" and discard the old list of "tails". This new list keeps me slightly more sane.

I want to take a moment to point out something that came as a surprise to me. I'm programming this stuff in C++. Programmers with only passing familiarity with C++ might well regard it as harder to program in. One classic problem is that sloppy use of explicit object instantiation and cleanup  pointer hell — can lead to memory leaks, among other things. One implicit assumption that explicit instantiation gives some is that you have to do all the hard work of copying data structures, yourself. Ironically, the opposite is true, especially in the newer, more standardized C++, which I did not have available in my early programming days. In fact, C++ makes it stunningly easy to copy whole, deep data structures. My NLP work creates some real doozies, too. Programmers familiar with C#, Visual Basic, JavaScript, Java, and most of the other mainline languages are used to passing objects around by reference. The idea of cloning a deep-structure in one single assignment like auto b = a; is completely alien. But, dear God, is it a refreshing change of pace, not having to create .clone() methods on every custom class! C++'s comfortable use of explicit references and pointers takes care of the rest, where you don't want to have massively wasteful copying. The real kicker is how fast cloning of standard list and dictionary structures  the real workhorses of deep data structure  is.

Pass 5: Finding sentence beginnings and endings

Once we've got the opening and closing tags worked out for quotes, parentheticals, and custom blocks, we're left with the ambiguities of finding the beginnings and endings of sentence blocks. Again, this may sound like a trivial task, but it's not.

Here's an example of the output of this pass, using a sentence taken from a CNN article:
Castro became famous enough that he could be identified by only one name. A mention of "Fidel" left little doubt who was being talked about.

S[ ]v
  T[ Castro became famous enough that he could be identified by only one name ]
S[ . ]^
S[ ]v
  T[ A mention of ]
  Q[ " ]v
    T[ Fidel ]
  Q[ " ]^
  T[ left little doubt who was being talked about ]
S[ . ]^
Note that each sentence has both opener ( S[ ]v ) and closer ( S[ . ]^ ) blocks.

Once again, this relies on a recursive function, this time called .find_sentences(). Inside, it loops through all the nodes in the current paragraph-representing chain of blocks. When it comes across an ending punctuation mark, it splits that block into two separate ones. Whereas the original was flagged as "out" and "in", the new ones divide these flags between them. The new sentence opener blocks, unlike others, do not point to any actual tokens. This is the same as for the virtual closing quotes as described earlier.

This works great for the division between two sentences, but misses the beginning of the first sentence. It also misses the end of the final sentence which may be missing ending punctuation. It would be tempting to simply add end-caps to a whole sentence, but that would miss an important point that sentences may contain other sub-sentences inside quotations or parentheticals. So when the looky loop inside .find_sentences() finds an "in" block representing the beginning of a child block, it executes .find_sentences() recursively to process that inner block. When that finds the block closer, which is guaranteed easy at this point, it returns and this outer loop continues on where it left off.

It would be great if that were all there was to it, but there are lots of gotchas. Consider the following example:
Smith went on to explain. "We had to do something. We improvised."
The process described earlier would split the first period into two blocks, the latter being an "in" block representing the beginning of some other potential sentence. However, what follows is not a sentence, strictly speaking. It's a quotation containing two other sentences. .find_sentences() sees this and lops off the newly minted sentence opener block and then does its work with the quotation.

One complication is that it's not immediately obvious, when the loop runs across a block opener, that it contains any sentences. Many "quotes" contain single words or strings of words but no punctuation. They typically serve the same function as underlining or italicizing text: to call out a significant idea within a sentence without breaking the grammatical flow of the sentence. Practically speaking, the only way to tell if a block contains a sentence is to look for punctuation in it. When the main loop in .find_sentences() comes across sentence-ending punctuation, it takes note of this. When it is done with the current block (or the outer, whole-paragraph block), it uses this information to decide whether to prepend a sentence opener virtual block to the beginning of the block. It also knows whether the block should end with a sentence closer. If punctuation was not found at the end, a virtual block with its "out" flag set is created and appended to the end of the block. If, however, no punctuation was ever found by the loop considering the current block, the block is assumed to contain no sentences.

At this point, it's fair to say that our linear chains contain a very easy-to-interpret representation of a tree structure. Every child block has an opening and a closing marker, whether it refers to a real token (e.g., an apostrophe or right square bracket) or not.

Pass 6: Constructing the tree

Whereas all the passes leading up to this one have created a linear representation of a tree structure, this pass finally creates that tree data structure explicitly. It uses the same block objects as before, but now it uses them differently. Whereas up to now, our blocks have only contained tokens, if any, now our blocks may alternatively contain child blocks representing the breakdown of one block into sub-blocks as needed.

Here's the same example sentence used above, now represented in its final, blocked form:
Castro became famous enough that he could be identified by only one name. A mention of "Fidel" left little doubt who was being talked about.

      Castro(U) became(V.pret) famous(J/N.plur/N.prop) enough(D/R) that(N/S/P) he(N.pron) could(V.modal) be(V.aux) identified(V.pret/J) by(P) only(R/J) one(D/N.pron) name(N) 
    | '.' )
        T(| A(D) mention(N) of(P) |)
        Q( '"' | Fidel(U) | '"' )
          left(J/V.pret/R) little(J) doubt(V) who(N) was(V.aux.pret) being(V.gerprt/N) talked(V.pret/J) about(P) 
    | '.' )
The basic process is fairly straightforward, since the chain already contains clear indications of where all blocks are to begin and end. A paragraph block is created and then a recursive .construct_blocking() function is called. It contains a loop charged with adding all the child blocks that are direct children of the current parent block, which is initially the (root) paragraph block. When the loop encounters an "out" node, it returns from its recursion. When it counters an "in" node is when things start to happen.

An "in" node often begins with a single token representing the block's opener, such as an open parentheses symbol. This node's block will be jettisoned, but its token will migrate to the new block's "opener" property. As described earlier, each block can optionally point to one "opener" and one "closer" token. In the output above, these appear on the left or right end of a block's representation, inside the "( opener |" or "| closer )" parts.

Once the inner call to .construct_blocking() returns, the next block will inevitably be the one containing the closer. We jettison that block, nabbing the token it points to and making it the current block's closer token.

Any other block the loop comes across will be a text block. We simply add it as a child to the current block. This isn't ideal, as we'd often rather just have the tokens and not simply a text block inside a parent block, but we'll get to that momentarily. One special thing we do, however, is check whether this new text block will be following another text block and merge them together. There are a few circumstances under which this could happen, but it doesn't make sense to have them be separate, ultimately.

Now the real fun begins: restructuring. To illustrate, here's the result of tree construction without and then with restructuring:
"We came here with a round-trip ticket ... because we thought the revolution was going to last days," said Rep. Ileana Ros-Lehtinen, who came to Florida as a child and went on to become the first Cuban-American elected to Congress. "And the days turned into weeks, and the weeks to months, and the months to years."

        Q( '"' | 
                  We(N.pron.plur) came(V.pret) here(R) with(P) a(D) round-trip(N/V) ticket(N) 
            | '…' )
                  because(S) we(N.pron.plur) thought(N) the(D) revolution(N) was(V.aux.pret) going(V.gerprt/N) to(P/R) last(V) days(N.plur/V.3rdsg) ',' 
        | '"' )
          said(V.pret) @Rep(N) @Ileana(D) @Ros-lehtinen(J/V.pstprt) ',' who(N) came(V.pret) to(P/R) @Florida(D/N/U) as(R/S/P) a(D) child(N) and(C) went(J) on(P) to(P/R) become(V/N) the(D) first(D) @Cuban-american(J/D/V/N) elected(V.pret/J) to(P/R) @Congress(N) 
    | '.' )
    Q( '"' | 
              And(C) the(D) days(N.plur/V.3rdsg) turned(V.pret/J) into(P) weeks(N.plur/V.3rdsg) ',' and(C) the(D) weeks(N.plur/V.3rdsg) to(P/R) months(N.plur/V.3rdsg) ',' and(C) the(D) months(N.plur/V.3rdsg) to(P/R) years(N.plur/V.3rdsg) 
        | '.' )
    | '"' )

I've highlighted extraneous bits. And here is the same parse, but with restructuring:

        Q( '"' | 
              We(N.pron.plur) came(V.pret) here(R) with(P) a(D) round-trip(N/V) ticket(N) '…' because(S) we(N.pron.plur) thought(N) the(D) revolution(N) was(V.aux.pret) going(V.gerprt/N) to(P/R) last(V) days(N.plur/V.3rdsg) ',' 
        | '"' )
          said(V.pret) @Rep(N) @Ileana(D) @Ros-lehtinen(J/V.pstprt) ',' who(N) came(V.pret) to(P/R) @Florida(D/N/U) as(R/S/P) a(D) child(N) and(C) went(J) on(P) to(P/R) become(V/N) the(D) first(D) @Cuban-american(J/D/V/N) elected(V.pret/J) to(P/R) @Congress(N) 
    | '.' )
    Q( '"' | 
          And(C) the(D) days(N.plur/V.3rdsg) turned(V.pret/J) into(P) weeks(N.plur/V.3rdsg) ',' and(C) the(D) weeks(N.plur/V.3rdsg) to(P/R) months(N.plur/V.3rdsg) ',' and(C) the(D) months(N.plur/V.3rdsg) to(P/R) years(N.plur/V.3rdsg) 
        | '.' )
    | '"' )

In this case, sentences containing singular text blocks get reduced to just sentences with their tokens.

But also, the ellipsis mark (…) gets reviewed and determined to most likely be inline instead of a sentence end. So the two sub-sentences inside the quotation get merged into one, with the ellipsis being just another token among the words and symbols in it.

This is a very important point. Reviewing and restructuring the blocking for the sentences within paragraph is much, much easier when there is an actual tree structure to deal with. Here is a summary without elaboration of the restructuring rules processed during .construct_blocking():

  • If this block's only child is a single text node, condense it.
  • A sentence that has no punctuation and contains only a quotation or parenthetical is not really a sentence; condense it.
  • A single quotation within a quotation may be an alternative way of representing triple quotes, quadruple, etc.
  • See if a quotation block should be merged with a following sentence fragment.
  • See if a quotation block should be merged with a preceding sentence fragment.
  • Join together two sentences separated only by ellipsis if the conditions are right.
  • Join together two sentences if the first ends in ellipsis and the second contains nothing but ending punctuation.
  • Migrate sentence punctuation out of ending quotes, but only when it follows a sentence fragment not ending in ",".
One rule worth elaborating on is the one that collapses quotes within quotes. Here's a trivial example:
Here's an ```example of triple quotes''' that will get collapsed.
Without restructuring, it looks like this:
        T(| Here's(N.poss/[here is]/[here has]) an(D) |)
        Q( ''' |
            Q( ''' |
                Q( ''' |
                    T(| example(V/N/D) of(P) triple(V/N) quotes(N.plur/V.3rdsg) |)
                | ''' )
            | ''' )
        | ''' )
        T(| that(N/S/P) will(V.modal/N) get(V) collapsed(V.pret/J) |)
    | '.' )
Three quotation blocks nested within one another. With restructuring, it looks like this:
        T(| Here's(N.poss/[here is]/[here has]) an(D) |)
        Q( ''' | example(V/N/D) of(P) triple(V/N) quotes(N.plur/V.3rdsg) | ''' )
        T(| that(N/S/P) will(V.modal/N) get(V) collapsed(V.pret/J) |)
    | '.' )
Note that the opener and closer are not triple quotes. I realized that having it this way would require changing how openers and closers are implemented to be able to point to multiple tokens instead of single ones. But more importantly, I realized that how the quotes are represented should not actually be important to later stages of parsing. All they care about is that some text is in quotes, regardless of them being single, double, triple quotes, etc.

But the reason this is important is that it provides an answer to a nagging problem: how to deal with various representations of quotations. In this example, I have three left-accent characters on the left and three apostrophes on the right, one way I've seen triple quotes represented before. It's tempting to think that it would have been easy to recognize this pattern using, say, a regular expression during the early tokenization process. However, doing so would likely run afoul of words that include apostrophes, such as stores' and 'im (truncation of him). Moreover, it would run afoul of scenarios in which quotes include other quotes, as in this example:
She said, "He said, 'That's "'crazy pants'" fussy of you.'"
The first apostrophe-plus-quote actually is a closing triple quote, but the second one represents the close of a single-quoted block followed by the close of a double-quoted block. There's no way the tokenizer could have distinguished that without looking at the whole sentence for cues. But this block restructuring process handles this special case automatically, now that there is no more ambiguity left.

Extra goodies

As a final step, some analysis is done of the results. If, for example, there is an open quotation apparently leading into the next paragraph's picking up the quotation, this paragraph will be flagged as such. This will make it easy for a later phase — say, a grammar checker — to follow up and see if this is a correct form, possibly by seeing if the next paragraph begins immediately with an open quote.

Similarly, the code checks to see if the paragraph contains nothing but a sentence that does not end in any punctuation. It flags the paragraph as thus containing a single sentence fragment. This makes it easy for the calling application to detect headers, for example. Yes, this also often applies to bullet points, but those are often easy to detect by virtue of being marked as such in HTML or prefixed by bullet-like characters such as "-" and "*" in plain-text files.

I do plan to extend these extra analyses to do things like count whole sentences in the paragraph, but it's not as important yet as nailing down the mechanics of blocking.


In summary, I created a data structure and algorithm for breaking a paragraph of English text down into smaller blocks like sentences, parentheticals, and quotations.

Although I don't consider myself an expert in all the latest NLP research, I do believe I've got a fairly good idea of what the state of the art is. Dr. Scott Paio produced what appears to me to be a fairly good demonstrator in Java back in 2008. But one thing I discovered is that it does not appear to be able to deal well with sentences embedded in quotes or parentheses. Here is one example:
Here is some text. "Here's a sentence. (What about a parenthetical? It could contain multiple sentences.) And another."
Paio's demo outputs the following representation:

1Here is some text.
2"Here's a sentence.
3(What about a parenthetical?
4It could contain multiple sentences.)
5And another."

Not bad, but it doesn't seem to care about the natural hierarchy of the sentences. By contrast, here's my blocker's output:
    S(| Here(R) is(V.aux) some(D/N/R) text(U) | '.' )
    Q( '"' | 
        S(| Here's(N.poss/[here is]/[here has]) a(D) sentence(N) | '.' )
        P( '(' | 
            S(| What(N.pron) about(P) a(D) parenthetical(J) | '?' )
            S(| It(N.pron) could(V.modal) contain(U/V) multiple(U) sentences(N.plur/V.3rdsg) | '.' )
        | ')' )
        S(| And(C) another(D) | '.' )
    | '"' )
As you can see, it has no problem recognizing and representing the hierarchy.

Although I may simply be missing something, I have yet to find another sentence segmentation algorithm that produces this sort of hierarchic structure.

For a fuller example of my algorithm in action, I'm including a sample text file I just processed and the corresponding processing output, taken from a CNN story.

The performance of this algorithm seems to be excellent, despite its complexity. Whether I run my test sets with or without the blocker included, the time almost always is the same. Which test (with or without blocking) takes longer appears to be random, making it effectively zero percent of the total parsing time so far. Most of the time of this pipeline thus far is taken up by the virtual lexicon. Most of the time, I'm finding it processes texts at a rate of about 2,000 words per second. Note that that's words, not tokens, which includes punctuation and other symbols.

One task I am looking forward to applying my algorithm to is dialogue analysis. Being able to clearly find common English dialogue patterns in text  without even considering the words, let alone their syntax or semantic meaning — should make this quite practical.

Although I wouldn't go so far as to say that this is entirely original, I do believe that my blocker represents a significant improvement over other sentence segmentation algorithms. It adds hierarchic representations, for one, making it possible to more accurately find correct sentence boundaries in complex cases. But also, it offers the application multiple interpretations of paragraph blocking, enabling later consideration of less likely but possibly more correct interpretations.