Getting started with Giza++ for Word Alignment

About Giza++ Open Source Text Processing Project: GIZA++ Install Giza++ First get the Giza++ related code: git clone https://github.com/moses-smt/giza-pp.git The git package include and Giza++ and mkcls which used in the process. We recommended you modify the Giza++ Makefile which … Continue reading

A Beginner’s Guide to spaCy

About spaCy Open Source Text Processing Project: spaCy Install spaCy and related data model Install spaCy by pip: sudo pip install -U spacy Collecting spacy Downloading spacy-1.8.2.tar.gz (3.3MB) Downloading numpy-1.13.0-cp27-cp27mu-manylinux1_x86_64.whl (16.6MB) Collecting murmurhash=0.26 (from spacy) Downloading murmurhash-0.26.4-cp27-cp27mu-manylinux1_x86_64.whl Collecting cymem=1.30 (from … Continue reading

Getting started with Python Word Segmentation

About Python Word Segmentation Python Word Segmentation WordSegment is an Apache2 licensed module for English word segmentation, written in pure-Python, and based on a trillion-word corpus. Based on code from the chapter “Natural Language Corpus Data” by Peter Norvig from … Continue reading

Getting started with topia.termextract

About topia.termextract Open Source Text Processing Project: topia.termextract Install topia.termextract Also topia.termextract has a pip site, but cannot install it by “pip install” method, you should download the source code first: https://pypi.python.org/packages/d1/b9/452257976ebee91d07c74bc4b34cfce416f45b94af1d62902ae39bf902cf/topia.termextract-1.1.0.tar.gz Then “tar -zxvf topia.termextract-1.1.0.tar.gz” and “cd topia.termextract-1.1.0” and … Continue reading

A Beginner’s Guide to TextBlob

About TextBlob Open Source Text Processing Project: TextBlob Install TextBlob Install the latest TextBlob on Ubuntu 16.04.1 LTS: textprocessing@ubuntu:~$ sudo pip install -U textblob Collecting textblob Downloading textblob-0.12.0-py2.py3-none-any.whl (631kB) Requirement already up-to-date: nltk>=3.1 in /usr/local/lib/python2.7/dist-packages (from textblob) Requirement already up-to-date: … Continue reading

Getting started with NLTK

About NLTK Open Source Text Processing Project: NLTK Install NLTK 1. Install the latest NLTK pakage on Ubuntu 16.04.1 LTS: textprocessing@ubuntu:~$ sudo pip install -U nltk Collecting nltk Downloading nltk-3.2.2.tar.gz (1.2MB) 35% |███████████▍ | 409kB 20.8MB/s eta 0:00:0 …… 100% … Continue reading

Open Source Text Processing Project: Wapiti

Wapiti – A simple and fast discriminative sequence labelling toolkit Project Website: https://wapiti.limsi.fr/ Github Link: https://github.com/Jekub/Wapiti Description Wapiti is a very fast toolkit for segmenting and labeling sequences with discriminative models. It is based on maxent models, maximum entropy Markov … Continue reading

Open Source Text Processing Project: segtok

segtok: sentence segmentation and word tokenization tools Project Website: http://fnl.es/segtok-a-segmentation-and-tokenization-library.html Github Link: https://github.com/fnl/segtok Description A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic features. The segtok package provides two modules, segtok.segmenter and segtok.tokenizer. The segmenter provides functionality for … Continue reading

Open Source Text Processing Project: nlp-with-ruby

nlp-with-ruby: Awesome NLP with Ruby Project Website: None Github Link: https://github.com/arbox/nlp-with-ruby Description This curated list comprises awesome resources, libraries, information sources about computational processing of texts in human languages with Ruby. That field is often referred to as NLP, Computational … Continue reading

Open Source Text Processing Project: textacy

textacy: higher-level NLP built on spaCy Project Website: https://textacy.readthedocs.io Github Link: https://github.com/chartbeat-labs/textacy Description textacy is a Python library for performing higher-level natural language processing (NLP) tasks, built on the high-performance spaCy library. With the basics — tokenization, part-of-speech tagging, dependency … Continue reading