Getting started with Translation Memory

Translation Memory is very useful for CAT Tools, here is a list of open source translation memory tools: 1. OmegaT: the free translation memory tool OmegaT is a free translation memory application written in Java. It is a tool intended … Continue reading

How to Custom Sentence Segmentation or Sentence Boundary Detection

A lot of NLP tools have sentence segmentation function, such as NLTK Sentence Segmentation, TextBlob Sentence Segmentation, Pattern Sentence Segmentation, spaCy Sentence Segmentation, but sometimes we need to custom the sentence segmentation or sentence boundary detection tool, how to do … Continue reading

Getting started with Giza++ for Word Alignment

About Giza++ Open Source Text Processing Project: GIZA++ Install Giza++ First get the Giza++ related code: git clone The git package include and Giza++ and mkcls which used in the process. We recommended you modify the Giza++ Makefile which … Continue reading

A Beginner’s Guide to spaCy

About spaCy Open Source Text Processing Project: spaCy Install spaCy and related data model Install spaCy by pip: sudo pip install -U spacy Collecting spacy Downloading spacy-1.8.2.tar.gz (3.3MB) Downloading numpy-1.13.0-cp27-cp27mu-manylinux1_x86_64.whl (16.6MB) Collecting murmurhash=0.26 (from spacy) Downloading murmurhash-0.26.4-cp27-cp27mu-manylinux1_x86_64.whl Collecting cymem=1.30 (from … Continue reading

Getting started with Python Word Segmentation

About Python Word Segmentation Python Word Segmentation WordSegment is an Apache2 licensed module for English word segmentation, written in pure-Python, and based on a trillion-word corpus. Based on code from the chapter “Natural Language Corpus Data” by Peter Norvig from … Continue reading

Getting started with topia.termextract

About topia.termextract Open Source Text Processing Project: topia.termextract Install topia.termextract Also topia.termextract has a pip site, but cannot install it by “pip install” method, you should download the source code first: Then “tar -zxvf topia.termextract-1.1.0.tar.gz” and “cd topia.termextract-1.1.0” and … Continue reading

Getting started with WordNet

About WordNet WordNet is a lexical database for English: WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means … Continue reading

A Beginner’s Guide to TextBlob

About TextBlob Open Source Text Processing Project: TextBlob Install TextBlob Install the latest TextBlob on Ubuntu 16.04.1 LTS: textprocessing@ubuntu:~$ sudo pip install -U textblob Collecting textblob Downloading textblob-0.12.0-py2.py3-none-any.whl (631kB) Requirement already up-to-date: nltk>=3.1 in /usr/local/lib/python2.7/dist-packages (from textblob) Requirement already up-to-date: … Continue reading

Getting started with Word2Vec

1. Source by Google Project with Code: Word2Vec Blog: Learning the meaning behind words Paper: [1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013. … Continue reading