Open Source Text Processing Project: segtok

segtok: sentence segmentation and word tokenization tools Project Website: Github Link: Description A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic features. The segtok package provides two modules, segtok.segmenter and segtok.tokenizer. The segmenter provides functionality for … Continue reading

Open Source Text Processing Project: Stanford Tokenizer

Stanford Tokenizer Project Website: Github Link: None Description A tokenizer divides text into a sequence of tokens, which roughly correspond to “words”. We provide a class suitable for tokenization of English, called PTBTokenizer. It was initially designed to largely … Continue reading