Showing posts from December, 2016

Virtual lexicon vs Brown corpus

Having completed my blocker , I decided to take a break before tackling syntax analysis to study more facets of English. But also, I realized I should beef up the lexicon underlying my virtual lexicon (VL). I had only collected about 1,500 words, and most of those I had simply hand-entered by way of theft from the CGEL's chapters on morphology; mostly compound words, at that. It was enough to test and demonstrate the VL's capacity to deal half-decently with morphological parsing, but nowhere near big enough to represent the at least tens of thousands of words a typical high school graduate with English as their native language will know. A virtual lexicon's core premise is that being able to recognize novel word forms by recognizing the parts of the word is more valuable than having a large list of exacting word-forms. In essence, a relatively small number of lexical entries should be able to represent a much larger set of practical words found "in the wild".

Text blocking / Sentence segmentation

I've finished a first working version of my "blocker" module. I'm coining this term to reflect its purpose: to break a paragraph being parsed up into its constituent sentences and sub-sentence "blocks" of text. This is often referred to as " sentence segmentation ", but I find that term belies the fuller scope of a blocker. Wikipedia presents a good summary of the basics of sentence segmentation : The standard 'vanilla' approach to locate the end of a sentence: (a) If it's a period, it ends a sentence. (b) If the preceding token is in the hand-compiled  list of abbreviations , then it doesn't end a sentence. (c) If the next token is capitalized, then it ends a sentence. This strategy gets about 95% of sentences correct. Things such as shortened names, e.g. "D. H. Lawrence" (with whitespaces between the individual words that form the full name), idiosyncratic orthographical spellings used for stylistic purposes