Getting started with Python Word Segmentation

About Python Word Segmentation Python Word Segmentation WordSegment is an Apache2 licensed module for English word segmentation, written in pure-Python, and based on a trillion-word corpus. Based on code from the chapter “Natural Language Corpus Data” by Peter Norvig from … Continue reading

Open Source Text Processing Project: Jieba

Jieba: Chinese text segmentation Project Website: None Github Link: https://github.com/fxsjy/jieba Description “Jieba” (Chinese for “to stutter”) Chinese text segmentation: built to be the best Python Chinese word segmentation module. Features Support three types of segmentation mode: Accurate Mode attempts to … Continue reading

Open Source Text Processing Project: Stanford Word Segmenter

Stanford Word Segmenter Project Website: http://nlp.stanford.edu/software/segmenter.shtml Github Link: None Description Tokenization of raw text is a standard pre-processing step for many NLP tasks. For English, tokenization usually involves punctuation splitting and separation of some affixes like possessives. Other languages require … Continue reading