Showing posts from November, 2016

Adaptive tokenizer

I knew that to continue advancing the refinement of my virtual lexicon, I'd need to throw it at real text "in the wild". To do that meant revisiting my tokenizer so I could feed words into it. The first tokenizer I made for my Basic English Parser (BEP) project is very primitive, so I decided to trash it and start over. Tokens Programmers often find themselves needing to parse structured data, such as comma-separated-values or XML files. In parsing richly structured data, like code written in a programming language, we refer to the smallest bits of data  —  usually text  —  as "tokens".  In this case, my tokenizer is tasked with taking as input a single paragraph of English prose in a plain-text format and producing as output a list of tokens. In my tokenizer, tokens presently come in a small variety of types: word, symbol, number, date/time, sentence divider. Since recognizing a token requires characterizing it, tokenization is the natural place to assig

Virtual lexicon enhancements

Virtual lexicon In my previous post , I used the term "morphological parser" to describe what I've been building. Now that I'm done with a first significant version of it, I'm really preferring the term "virtual lexicon" ("VL"). It captures the spirit of this as a module that represents a much larger set of words than the actual lexicon it contains. Same idea with virtual memory, which enables programs to consume what they perceive as a much larger amount of RAM than is physically available in the computer. Likewise, a virtual machine environment, which enables a fixed set of computer hardware to represent emulated computers with more CPUs, memory, and other resources than are physically available. Periods One of my stated goals in the previous post was to deal with periods and apostrophes. What I was indicating is that a basic tokenizer is faced with a nasty interpretation problem when it comes to these symbols. In standard written Engl

Morphological parser / Virtual lexicon

I've been able to focus the past 3 months exclusively on my AI research, a luxury I've never had before. Given that I can't afford to do this indefinitely, I've chosen to focus primarily on NLP , with an overt goal to create relevant marketable technologies. I'm presently operating with an intermediate goal of creating a "basic" English parser (BEP). In my conception, a BEP will transform a block of English discourse  into an abstract representation of its constituent words, sentences, and paragraphs that is more amenable to consumption by software. Though there are many research and industry products that do this at some level, I'm convinced I can improve upon some of their aspects to a marketable degree. When you set about trying to understand natural language parsing, you quickly discover that there are many interconnected aspects that seem to make starting to program a system intractable. NLP researchers have made great strides in the past few