Open Source Text Processing Project: segtok

segtok: sentence segmentation and word tokenization tools Project Website: http://fnl.es/segtok-a-segmentation-and-tokenization-library.html Github Link: https://github.com/fnl/segtok Description A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic features. The segtok package provides two modules, segtok.segmenter and segtok.tokenizer. The segmenter provides functionality for … Continue reading

Open Source Text Processing Project: Stanford Tokenizer

Stanford Tokenizer Project Website: http://nlp.stanford.edu/software/tokenizer.shtml Github Link: None Description A tokenizer divides text into a sequence of tokens, which roughly correspond to “words”. We provide a class suitable for tokenization of English, called PTBTokenizer. It was initially designed to largely … Continue reading