Open Source Text Processing Project: segtok

segtok: sentence segmentation and word tokenization tools

Project Website:
Github Link:


A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic features.

The segtok package provides two modules, segtok.segmenter and segtok.tokenizer. The segmenter provides functionality for splitting (Indo-European) text into sentences. The tokenizer provides functionality for splitting (Indo-European) sentences into words and symbols (collectively called tokens). Both modules can also be used from the command-line. While other Indo-European languages could work, it has only been designed with languages such as Spanish, English, and German in mind.

To install this package, you should have the latest official version of Python 2 or 3 installed. The package has been reported to work with Python 2.7, 3.3, and 3.4 and is tested against the latest Python 2 and 3 branches. The easiest way to get it installed is using pip or any other package manager that works with PyPI:

pip install segtok
Important: If you are on a Linux machine and have problems installing the regex dependency of segtok, make sure you have the python-dev and/or python3-dev packages installed to get the necessary headers to compile the package.

Then try the command line tools on some plain-text files (e.g., this README) to see if segtok meets your needs:

segmenter README.rst | tokenizer

Leave a Reply

Your email address will not be published. Required fields are marked *