Adaptive tokenizer
I knew that to continue advancing the refinement of my virtual lexicon, I'd need to throw it at real text "in the wild". To do that meant revisiting my tokenizer so I could feed words into it. The first tokenizer I made for my Basic English Parser (BEP) project is very primitive, so I decided to trash it and start over. Tokens Programmers often find themselves needing to parse structured data, such as comma-separated-values or XML files. In parsing richly structured data, like code written in a programming language, we refer to the smallest bits of data — usually text — as "tokens". In this case, my tokenizer is tasked with taking as input a single paragraph of English prose in a plain-text format and producing as output a list of tokens. In my tokenizer, tokens presently come in a small variety of types: word, symbol, number, date/time, sentence divider. Since recognizing a token requires characterizing it, tokenization is the natural place to ...