Getting started with Giza++ for Word Alignment

About Giza++

Open Source Text Processing Project: GIZA++

Install Giza++

First get the Giza++ related code:

git clone https://github.com/moses-smt/giza-pp.git

The git package include and Giza++ and mkcls which used in the process.

We recommended you modify the Giza++ Makefile which can used to output the actual word pairs, not just id:

cd giza-pp/GIZA++-v2/
vim Makefile

Modify the line 9 to:


#CFLAGS_OPT = $(CFLAGS) -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -DBINARY_SEARCH_FOR_TTABLE -DWORDINDEX_WITH_4_BYTE
CFLAGS_OPT = $(CFLAGS) -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE

then “cd ..” and “make” for giza++ and mkcls related tools:

make -C GIZA++-v2
make[1]: Entering directory '/home/textprocessing/giza/giza-pp/GIZA++-v2'
mkdir optimized/
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c Parameter.cpp -o optimized/Parameter.o
Parameter.cpp: In function ‘bool writeParameters(std::ofstream&, const ParSet&, int)’:
Parameter.cpp:48:25: warning: ignoring return value of ‘char* getcwd(char*, size_t)’, declared with attribute warn_unused_result [-Wunused-result]
        getcwd(path,1024);
                         ^
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c myassert.cpp -o optimized/myassert.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c Perplexity.cpp -o optimized/Perplexity.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c model1.cpp -o optimized/model1.o
model1.cpp: In member function ‘int model1::em_with_tricks(int, bool, Dictionary&, bool)’:
model1.cpp:72:7: warning: variable ‘pair_no’ set but not used [-Wunused-but-set-variable]
   int pair_no;
       ^
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c model2.cpp -o optimized/model2.o
model2.cpp: In member function ‘int model2::em_with_tricks(int)’:
model2.cpp:64:7: warning: variable ‘pair_no’ set but not used [-Wunused-but-set-variable]
   int pair_no = 0;
       ^
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c model3.cpp -o optimized/model3.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c getSentence.cpp -o optimized/getSentence.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c TTables.cpp -o optimized/TTables.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c ATables.cpp -o optimized/ATables.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c AlignTables.cpp -o optimized/AlignTables.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c main.cpp -o optimized/main.o
main.cpp: In function ‘int main(int, char**)’:
main.cpp:707:10: warning: variable ‘errors’ set but not used [-Wunused-but-set-variable]
   double errors=0.0;
          ^
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c NTables.cpp -o optimized/NTables.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c model2to3.cpp -o optimized/model2to3.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c collCounts.cpp -o optimized/collCounts.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c alignment.cpp -o optimized/alignment.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c vocab.cpp -o optimized/vocab.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c MoveSwapMatrix.cpp -o optimized/MoveSwapMatrix.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c transpair_model3.cpp -o optimized/transpair_model3.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c transpair_model5.cpp -o optimized/transpair_model5.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c transpair_model4.cpp -o optimized/transpair_model4.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c utility.cpp -o optimized/utility.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c parse.cpp -o optimized/parse.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c reports.cpp -o optimized/reports.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c model3_viterbi.cpp -o optimized/model3_viterbi.o
model3_viterbi.cpp: In member function ‘void model3::findAlignmentsNeighborhood(std::vector&, std::vector&, LogProb&, alignmodel&, int, int)’:
model3_viterbi.cpp:431:12: warning: variable ‘it_st’ set but not used [-Wunused-but-set-variable]
     time_t it_st;
            ^
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c model3_viterbi_with_tricks.cpp -o optimized/model3_viterbi_with_tricks.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c Dictionary.cpp -o optimized/Dictionary.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c model345-peg.cpp -o optimized/model345-peg.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c hmm.cpp -o optimized/hmm.o
hmm.cpp: In member function ‘int hmm::em_with_tricks(int)’:
hmm.cpp:79:7: warning: variable ‘pair_no’ set but not used [-Wunused-but-set-variable]
   int pair_no = 0;
       ^
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c HMMTables.cpp -o optimized/HMMTables.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c ForwardBackward.cpp -o optimized/ForwardBackward.o
g++  -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE optimized/Parameter.o optimized/myassert.o optimized/Perplexity.o optimized/model1.o optimized/model2.o optimized/model3.o optimized/getSentence.o optimized/TTables.o optimized/ATables.o optimized/AlignTables.o optimized/main.o optimized/NTables.o optimized/model2to3.o optimized/collCounts.o optimized/alignment.o optimized/vocab.o optimized/MoveSwapMatrix.o optimized/transpair_model3.o optimized/transpair_model5.o optimized/transpair_model4.o optimized/utility.o optimized/parse.o optimized/reports.o optimized/model3_viterbi.o optimized/model3_viterbi_with_tricks.o optimized/Dictionary.o optimized/model345-peg.o optimized/hmm.o optimized/HMMTables.o optimized/ForwardBackward.o  -o GIZA++
g++  -O3 -W -Wall snt2plain.cpp -o snt2plain.out
g++  -O3 -W -Wall plain2snt.cpp -o plain2snt.out
g++  -O3 -g -W -Wall snt2cooc.cpp -o snt2cooc.out
make[1]: Leaving directory '/home/textprocessing/giza/giza-pp/GIZA++-v2'
make -C mkcls-v2
make[1]: Entering directory '/home/textprocessing/giza/giza-pp/mkcls-v2'
g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c GDAOptimization.cpp -o GDAOptimization.o
g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c HCOptimization.cpp -o HCOptimization.o
g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c Problem.cpp -o Problem.o
g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c IterOptimization.cpp -o IterOptimization.o
g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c ProblemTest.cpp -o ProblemTest.o
g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c RRTOptimization.cpp -o RRTOptimization.o
g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c MYOptimization.cpp -o MYOptimization.o
g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c SAOptimization.cpp -o SAOptimization.o
g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c TAOptimization.cpp -o TAOptimization.o
g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c Optimization.cpp -o Optimization.o
g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c KategProblemTest.cpp -o KategProblemTest.o
g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c KategProblemKBC.cpp -o KategProblemKBC.o
g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c KategProblemWBC.cpp -o KategProblemWBC.o
g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c KategProblem.cpp -o KategProblem.o
g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c StatVar.cpp -o StatVar.o
g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c general.cpp -o general.o
g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c mkcls.cpp -o mkcls.o
g++ -Wall -W -DNDEBUG -O3 -funroll-loops -o mkcls GDAOptimization.o HCOptimization.o Problem.o IterOptimization.o ProblemTest.o RRTOptimization.o MYOptimization.o SAOptimization.o TAOptimization.o Optimization.o KategProblemTest.o KategProblemKBC.o KategProblemWBC.o KategProblem.o StatVar.o general.o mkcls.o 
make[1]: Leaving directory '/home/textprocessing/giza/giza-pp/mkcls-v2'

Prepare the bilingual corpus

We follow the moses decoder baseline pipeline to prepare the bilingual sample corpus and preprocess script. First get the corpus from wmt13:


mkdir corpus
cd corpus/
wget http://www.statmt.org/wmt13/training-parallel-nc-v8.tgz
tar -zxvf training-parallel-nc-v8.tgz

training/news-commentary-v8.cs-en.cs
training/news-commentary-v8.cs-en.en
training/news-commentary-v8.de-en.de
training/news-commentary-v8.de-en.en
training/news-commentary-v8.es-en.en
training/news-commentary-v8.es-en.es
training/news-commentary-v8.fr-en.en
training/news-commentary-v8.fr-en.fr
training/news-commentary-v8.ru-en.en
training/news-commentary-v8.ru-en.ru

We follow the moses script to clean the data:

To prepare the data for training the translation system, we have to perform the following steps:
tokenisation: This means that spaces have to be inserted between (e.g.) words and punctuation.
truecasing: The initial words in each sentence are converted to their most probable casing. This helps reduce data sparsity.
cleaning: Long sentences and empty sentences are removed as they can cause problems with the training pipeline, and obviously mis-aligned sentences are removed.

So get the mosedecoder first:

cd ..
git clone https://github.com/moses-smt/mosesdecoder.git

Now it’s time to preprocess the bilingual pairs, we select the fr-en data as the example:

The org en data like this:

SAN FRANCISCO – It has never been easy to have a rational conversation about the value of gold.
Lately, with gold prices up more than 300% over the last decade, it is harder than ever.
Just last December, fellow economists Martin Feldstein and Nouriel Roubini each penned op-eds bravely questioning bullish market sentiment, sensibly pointing out gold’s risks.
Wouldn’t you know it?

Tokenization:

./mosesdecoder/scripts/tokenizer/tokenizer.perl -l en < ./corpus/training/news-commentary-v8.fr-en.en > ./corpus/news-commentary-v8.fr-en.tok.en

./mosesdecoder/scripts/tokenizer/tokenizer.perl -l fr < ./corpus/training/news-commentary-v8.fr-en.fr > ./corpus/news-commentary-v8.fr-en.tok.fr

After tokenization:

SAN FRANCISCO – It has never been easy to have a rational conversation about the value of gold .
Lately , with gold prices up more than 300 % over the last decade , it is harder than ever .
Just last December , fellow economists Martin Feldstein and Nouriel Roubini each penned op-eds bravely questioning bullish market sentiment , sensibly pointing out gold ’ s risks .
Wouldn ’ t you know it ?

Truecase:

The truecaser first requires training, in order to extract some statistics about the text:

./mosesdecoder/scripts/recaser/train-truecaser.perl --model ./corpus/truecase-model.en --corpus ./corpus/news-commentary-v8.fr-en.tok.en

./mosesdecoder/scripts/recaser/train-truecaser.perl --model ./corpus/truecase-model.fr --corpus ./corpus/news-commentary-v8.fr-en.tok.fr

Then truecase the sample data:

./mosesdecoder/scripts/recaser/truecase.perl --model ./corpus/truecase-model.en < ./corpus/news-commentary-v8.fr-en.tok.en > ./corpus/news-commentary-v8.fr-en.true.en

./mosesdecoder/scripts/recaser/truecase.perl --model ./corpus/truecase-model.fr < ./corpus/news-commentary-v8.fr-en.tok.fr > ./corpus/news-commentary-v8.fr-en.true.fr

After truecase:

San FRANCISCO – It has never been easy to have a rational conversation about the value of gold .
lately , with gold prices up more than 300 % over the last decade , it is harder than ever .
just last December , fellow economists Martin Feldstein and Nouriel Roubini each penned op-eds bravely questioning bullish market sentiment , sensibly pointing out gold ’ s risks .
wouldn ’ t you know it ?

Clean the long line sentence more than 80:

./mosesdecoder/scripts/training/clean-corpus-n.perl ./corpus/news-commentary-v8.fr-en.true fr en ./corpus/news-commentary-v8.fr-en.clean 1 80

clean-corpus.perl: processing ./corpus/news-commentary-v8.fr-en.true.fr & .en to ./corpus/news-commentary-v8.fr-en.clean, cutoff 1-80, ratio 9
..........(100000)....
Input sentences: 157168  Output sentences:  155362

Using Giza++ for Word Alignment

First, copy the binary execute files:

textprocessing@ubuntu:~/giza$ cp giza-pp/GIZA++-v2/plain2snt.out .
textprocessing@ubuntu:~/giza$ cp giza-pp/GIZA++-v2/snt2cooc.out .
textprocessing@ubuntu:~/giza$ cp giza-pp/GIZA++-v2/GIZA++ .
textprocessing@ubuntu:~/giza$ cp giza-pp/mkcls-v2/mkcls .

Then run:

./plain2snt.out corpus/news-commentary-v8.fr-en.clean.fr corpus/news-commentary-v8.fr-en.clean.en

which will generate vcb (vocabulary) files and snt (sentence) files, containing the list of vocabulary and aligned sentences, respectively.

Then run mkcls which is a program to automatically infer word classes from a corpus using a maximum likelihood criterion:

mkcls [-nnum] [-ptrain] [-Vfile] opt
-V output classes (Default: no file)
-n number of optimization runs (Default: 1); larger number => better results
-p filename of training corpus (Default: ‘train’)
Example:
mkcls -c80 -n10 -pin -Vout opt
(generates 80 classes for the corpus ‘in’ and writes the classes in ‘out’)
Literature:
Franz Josef Och: ?Maximum-Likelihood-Sch?tzung von Wortkategorien mit Verfahren
der kombinatorischen Optimierung? Studienarbeit, Universit?t Erlangen-N?rnberg,
Germany,1995.

Execute:


./mkcls -pcorpus/news-commentary-v8.fr-en.clean.fr -Vcorpus/news-commentary-v8.fr-en.fr.vcb.classes
./mkcls -pcorpus/news-commentary-v8.fr-en.clean.en -Vcorpus/news-commentary-v8.fr-en.en.vcb.classes

Finally run GIZA++:

./GIZA++ -S corpus/news-commentary-v8.fr-en.clean.fr.vcb -T corpus/news-commentary-v8.fr-en.clean.en.vcb -C corpus/news-commentary-v8.fr-en.clean.fr_news-commentary-v8.fr-en.clean.en.snt -o fr_en -outputpath fr_en

......
writing Final tables to Disk
Dumping the t table inverse to file: fr_en/fr_en.ti.final
Dumping the t table inverse to file: fr_en/fr_en.actual.ti.final
Writing PERPLEXITY report to: fr_en/fr_en.perp
Writing source vocabulary list to : fr_en/fr_en.trn.src.vcb
Writing source vocabulary list to : fr_en/fr_en.trn.trg.vcb
Writing source vocabulary list to : fr_en/fr_en.tst.src.vcb
Writing source vocabulary list to : fr_en/fr_en.tst.trg.vcb
writing decoder configuration file to fr_en/fr_en.Decoder.config
......

The most import file for us is the actual word align pairs file: fr_en.actual.ti.final

expectancy associée 0.0144092
only enchâssée 3.56377e-05
amounts construisent 0.00338397
knowledge attribuées 0.00116645
dynamic dynamiques 0.223755
harsh périrent 0.00709615
insubordination agissements 1
big caféière 0.000125214
Health Santé 0.289873
building construisent 0.00355319
dilemma dynamiques 0.00853293
learn apprendront 0.00658648
moving délocalisée 0.00180745
pretends prétendent 0.129701
aggressive dynamiques 0.00016645
center centristes 0.00357907
scope 707 0.000628053
experts intentionnés 0.00241335
principles déplaisait 0.00173075
Reagan déplaisait 0.0054606
meant attribuées 0.00240529
build construisent 0.00590704
median âge 0.121734

But unsorted, we can sorted it first:

sort fr_en.actual.ti.final > fr_en.actual.ti.final.sort

Then view it by alphabetical order:

learn acquérir 0.00440678
learn adapter 8.79211e-06
learn amérindienne 0.000941561
learn apprécié 0.00330693
learn apprenant 0.00761903
learn apprend 0.00797
learn apprendra 0.00357164
learn apprendre 0.449114
learn apprendrons 0.00265828
learn apprendront 0.00658648
learn apprenez 0.000753722
learn apprenions 0.00077654
learn apprenne 0.00167538
learn apprennent 0.0490054
learn apprenons 0.0085642
learn apprenons-nous 0.000916356
learn apprentissage 0.00935484
learn appris 0.00427148
learn assimilation 0.00248182
learn aurons 0.00229323
learn avertis 8.16617e-06
learn bénéficier 0.00429511
learn commettre 0.0040235

Reference:

Using GIZA++ to Obtain Word Alignment Between Bilingual Sentences

A Beginner’s Guide to spaCy

About spaCy

Open Source Text Processing Project: spaCy

Install spaCy and related data model

Install spaCy by pip:
sudo pip install -U spacy

Collecting spacy
  Downloading spacy-1.8.2.tar.gz (3.3MB)
  Downloading numpy-1.13.0-cp27-cp27mu-manylinux1_x86_64.whl (16.6MB)
Collecting murmurhash<0.27,>=0.26 (from spacy)
  Downloading murmurhash-0.26.4-cp27-cp27mu-manylinux1_x86_64.whl
Collecting cymem<1.32,>=1.30 (from spacy)
  Downloading cymem-1.31.2-cp27-cp27mu-manylinux1_x86_64.whl (66kB)
 
Collecting ftfy<5.0.0,>=4.4.2 (from spacy)
  Downloading ftfy-4.4.3.tar.gz (50kB)
  
Collecting cytoolz<0.9,>=0.8 (from thinc<6.6.0,>=6.5.0->spacy)
  Downloading cytoolz-0.8.2.tar.gz (386kB)
  Downloading termcolor-1.1.0.tar.gz
Collecting idna<2.6,>=2.5 (from requests<3.0.0,>=2.13.0->spacy)
  Downloading idna-2.5-py2.py3-none-any.whl (55kB)

Collecting urllib3<1.22,>=1.21.1 (from requests<3.0.0,>=2.13.0->spacy)
  Downloading urllib3-1.21.1-py2.py3-none-any.whl (131kB)

Collecting chardet<3.1.0,>=3.0.2 (from requests<3.0.0,>=2.13.0->spacy)
  Downloading chardet-3.0.4-py2.py3-none-any.whl (133kB)
Collecting certifi>=2017.4.17 (from requests<3.0.0,>=2.13.0->spacy)
  Downloading certifi-2017.4.17-py2.py3-none-any.whl (375kB)
 
Collecting html5lib (from ftfy<5.0.0,>=4.4.2->spacy)
  Downloading html5lib-0.999999999-py2.py3-none-any.whl (112kB)

Collecting wcwidth (from ftfy<5.0.0,>=4.4.2->spacy)
  Downloading wcwidth-0.1.7-py2.py3-none-any.whl
Collecting toolz>=0.8.0 (from cytoolz<0.9,>=0.8->thinc<6.6.0,>=6.5.0->spacy)
  Downloading toolz-0.8.2.tar.gz (45kB)
 
Collecting setuptools>=18.5 (from html5lib->ftfy<5.0.0,>=4.4.2->spacy)
 
Installing collected packages: numpy, murmurhash, cymem, preshed, wrapt, tqdm, toolz, cytoolz, plac, dill, termcolor, pathlib, thinc, ujson, idna, urllib3, chardet, certifi, requests, regex, setuptools, webencodings, html5lib, wcwidth, ftfy, spacy
  Found existing installation: numpy 1.12.0
    Uninstalling numpy-1.12.0:
      Successfully uninstalled numpy-1.12.0
  Running setup.py install for preshed ... done
  Running setup.py install for wrapt ... done
  Running setup.py install for toolz ... done
  Running setup.py install for cytoolz ... done
  Running setup.py install for dill ... done
  Running setup.py install for termcolor ... done
  Running setup.py install for pathlib ... done
  Running setup.py install for thinc ... done
  Running setup.py install for ujson ... done
  Found existing installation: requests 2.13.0
    Uninstalling requests-2.13.0:
      Successfully uninstalled requests-2.13.0
  Running setup.py install for regex ... done
  Found existing installation: setuptools 20.7.0
    Uninstalling setuptools-20.7.0:
      Successfully uninstalled setuptools-20.7.0
  Running setup.py install for ftfy ... done
  Running setup.py install for spacy ... -

done
Successfully installed certifi-2017.4.17 chardet-3.0.4 cymem-1.31.2 cytoolz-0.8.2 dill-0.2.6 ftfy-4.4.3 html5lib-0.999999999 idna-2.5 murmurhash-0.26.4 numpy-1.13.0 pathlib-1.0.1 plac-0.9.6 preshed-1.0.0 regex-2017.4.5 requests-2.18.1 setuptools-36.0.1 spacy-1.8.2 termcolor-1.1.0 thinc-6.5.2 toolz-0.8.2 tqdm-4.14.0 ujson-1.35 urllib3-1.21.1 wcwidth-0.1.7 webencodings-0.5.1 wrapt-1.10.10

Download related default English model data:
sudo python -m spacy download en

Test spacy by pytest:
python -m pytest /usr/local/lib/python2.7/dist-packages/spacy --vectors --models --slow

============================= test session starts ==============================
platform linux2 -- Python 2.7.12, pytest-3.1.2, py-1.4.34, pluggy-0.4.0
rootdir: /usr/local/lib/python2.7/dist-packages/spacy, inifile:
collected 2932 items 

../../usr/local/lib/python2.7/dist-packages/spacy/tests/test_attrs.py ...
../../usr/local/lib/python2.7/dist-packages/spacy/tests/test_cli.py ......
../../usr/local/lib/python2.7/dist-packages/spacy/tests/test_misc.py ..
../../usr/local/lib/python2.7/dist-packages/spacy/tests/test_orth.py .......................................................
../../usr/local/lib/python2.7/dist-packages/spacy/tests/test_pickles.py .X
../../usr/local/lib/python2.7/dist-packages/spacy/tests/doc/test_add_entities.py .
../../usr/local/lib/python2.7/dist-packages/spacy/tests/doc/test_array.py ...
../../usr/local/lib/python2.7/dist-packages/spacy/tests/doc/test_doc_api.py ............
../../usr/local/lib/python2.7/dist-packages/spacy/tests/doc/test_noun_chunks.py .
../../usr/local/lib/python2.7/dist-packages/spacy/tests/doc/test_token_api.py ........
../../usr/local/lib/python2.7/dist-packages/spacy/tests/matcher/test_entity_id.py ...
../../usr/local/lib/python2.7/dist-packages/spacy/tests/matcher/test_matcher.py ...........
../../usr/local/lib/python2.7/dist-packages/spacy/tests/parser/test_ner.py ...
../../usr/local/lib/python2.7/dist-packages/spacy/tests/parser/test_nonproj.py .....
../../usr/local/lib/python2.7/dist-packages/spacy/tests/parser/test_noun_chunks.py .....
../../usr/local/lib/python2.7/dist-packages/spacy/tests/parser/test_parse.py ......
../../usr/local/lib/python2.7/dist-packages/spacy/tests/parser/test_parse_navigate.py ...
../../usr/local/lib/python2.7/dist-packages/spacy/tests/parser/test_sbd.py ......
../../usr/local/lib/python2.7/dist-packages/spacy/tests/parser/test_sbd_prag.py ..x....x.....xx..x......x.....xxx.xxxxx..x..x..x.xxx
../../usr/local/lib/python2.7/dist-packages/spacy/tests/parser/test_space_attachment.py ......
../../usr/local/lib/python2.7/dist-packages/spacy/tests/serialize/test_codecs.py ...
../../usr/local/lib/python2.7/dist-packages/spacy/tests/serialize/test_huffman.py .....
../../usr/local/lib/python2.7/dist-packages/spacy/tests/serialize/test_io.py ...
../../usr/local/lib/python2.7/dist-packages/spacy/tests/serialize/test_packer.py .....
../../usr/local/lib/python2.7/dist-packages/spacy/tests/serialize/test_serialization.py ..........
../../usr/local/lib/python2.7/dist-packages/spacy/tests/spans/test_merge.py ......
../../usr/local/lib/python2.7/dist-packages/spacy/tests/spans/test_span.py ........
../../usr/local/lib/python2.7/dist-packages/spacy/tests/stringstore/test_freeze_string_store.py .
../../usr/local/lib/python2.7/dist-packages/spacy/tests/stringstore/test_stringstore.py ..........
../../usr/local/lib/python2.7/dist-packages/spacy/tests/tagger/test_lemmatizer.py .....x...
../../usr/local/lib/python2.7/dist-packages/spacy/tests/tagger/test_morph_exceptions.py .
../../usr/local/lib/python2.7/dist-packages/spacy/tests/tagger/test_spaces.py ..
../../usr/local/lib/python2.7/dist-packages/spacy/tests/tagger/test_tag_names.py .
../../usr/local/lib/python2.7/dist-packages/spacy/tests/tokenizer/test_exceptions.py ............................................
../../usr/local/lib/python2.7/dist-packages/spacy/tests/tokenizer/test_tokenizer.py ............................................................................................................................................................................................
../../usr/local/lib/python2.7/dist-packages/spacy/tests/tokenizer/test_urls.py ...................................xx...................................xxx.....................................................................................................................................................................................................................................................
../../usr/local/lib/python2.7/dist-packages/spacy/tests/tokenizer/test_whitespace.py .............................................................................
../../usr/local/lib/python2.7/dist-packages/spacy/tests/vectors/test_similarity.py .....
../../usr/local/lib/python2.7/dist-packages/spacy/tests/vectors/test_vectors.py ...............
../../usr/local/lib/python2.7/dist-packages/spacy/tests/vocab/test_add_vectors.py .
../../usr/local/lib/python2.7/dist-packages/spacy/tests/vocab/test_lexeme.py ......
../../usr/local/lib/python2.7/dist-packages/spacy/tests/vocab/test_vocab_api.py ....................

============ 2905 passed, 26 xfailed, 1 xpassed in 1549.45 seconds =============

How to use spaCy

textminer@ubuntu:~$ ipython
Python 2.7.12 (default, Nov 19 2016, 06:48:10) 
Type "copyright", "credits" or "license" for more information.
 
IPython 2.4.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
 
In [1]: import spacy
 
In [2]: spacy_en = spacy.load('en')
 
In [3]: test_texts = u"""
Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora. Challenges in natural language processing frequently involve natural language understanding, natural language generation (frequently from formal, machine-readable logical forms), connecting language and machine perception, dialog systems, or some combination thereof."""
 
In [4]: test_doc = spacy_en(test_texts)
 
In [6]: print(test_doc)
 
Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora. Challenges in natural language processing frequently involve natural language understanding, natural language generation (frequently from formal, machine-readable logical forms), connecting language and machine perception, dialog systems, or some combination thereof.
 
In [7]: dir(test_doc)
Out[7]: 
['__bytes__',
 '__class__',
 '__delattr__',
 '__doc__',
 '__format__',
 '__getattribute__',
 '__getitem__',
 '__hash__',
 '__init__',
 '__iter__',
 '__len__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '_py_tokens',
 '_realloc',
 '_vector',
 '_vector_norm',
 'count_by',
 'doc',
 'ents',
 'from_array',
 'from_bytes',
 'has_vector',
 'is_parsed',
 'is_tagged',
 'mem',
 'merge',
 'noun_chunks',
 'noun_chunks_iterator',
 'read_bytes',
 'sentiment',
 'sents',
 'similarity',
 'string',
 'tensor',
 'text',
 'text_with_ws',
 'to_array',
 'to_bytes',
 'user_data',
 'user_hooks',
 'user_span_hooks',
 'user_token_hooks',
 'vector',
 'vector_norm',
 'vocab']
 
# Word Tokenization
In [8]: for token in test_doc[:20]:
   ...:     print(token)
   ...:     
 
 
Natural
language
processing
(
NLP
)
is
a
field
of
computer
science
,
artificial
intelligence
and
computational
linguistics
concerned
 
# Sentence Tokenization or Sentence Segmentation
In [9]: for sent in test_doc.sents:
   ...:     print(sent)
   ...:     
 
Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora.
Challenges in natural language processing frequently involve natural language understanding, natural language generation (frequently from formal, machine-readable logical forms), connecting language and machine perception, dialog systems, or some combination thereof.
 
In [10]: for sent_num, sent in enumerate(test_doc.sents, 1):
   ....:     print(sent_num, sent)
   ....:     
(1, 
Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora.)
(2, Challenges in natural language processing frequently involve natural language understanding, natural language generation (frequently from formal, machine-readable logical forms), connecting language and machine perception, dialog systems, or some combination thereof.)
 
# Id it
In [11]: NLP_id = spacy_en.vocab.strings['NLP']
 
In [12]: print(NLP_id)
289622
 
In [13]: NLP_str = spa
spacy     spacy_en  
 
In [13]: NLP_str = spacy_en.vocab.strings[NLP_id]
 
In [14]: print(NLP_str)
NLP
 
# Pos Tagging:
In [15]: for token in test_doc[:20]:
   ....:     print(token, token.pos, token.pos_)
   ....:     
(
, 101, u'SPACE')
(Natural, 82, u'ADJ')
(language, 90, u'NOUN')
(processing, 90, u'NOUN')
((, 95, u'PUNCT')
(NLP, 94, u'PROPN')
(), 95, u'PUNCT')
(is, 98, u'VERB')
(a, 88, u'DET')
(field, 90, u'NOUN')
(of, 83, u'ADP')
(computer, 90, u'NOUN')
(science, 90, u'NOUN')
(,, 95, u'PUNCT')
(artificial, 82, u'ADJ')
(intelligence, 90, u'NOUN')
(and, 87, u'CCONJ')
(computational, 82, u'ADJ')
(linguistics, 90, u'NOUN')
(concerned, 98, u'VERB')
 
# Named-entity recognition (NER)
In [16]: for ent in test_doc.ents:
   ....:     print(ent, ent.label, ent.label_)
   ....:     
(
Natural language, 382, u'LOC')
(NLP, 380, u'ORG')
 
# Test NER Again:
In [17]: ner_test_doc = spacy_en(u"Rami Eid is studying at Stony Brook University in New York")
 
In [18]: for ent in ner_test_doc.ents:
   ....:     print(ent, ent.label, ent.label_)
   ....:     
(Rami Eid, 377, u'PERSON')
(Stony Brook University, 380, u'ORG')
 
# Noun Chunk
In [19]: for np in test_doc.noun_chunks:
   ....:     print(np)
   ....:     
 
Natural language processing
a field
computer science
the interactions
computers
human
languages
programming computers
large natural language corpora
Challenges
natural language processing
natural language understanding
formal, machine-readable logical forms
language and machine perception, dialog systems
some combination
 
# Word Lemmatization
In [20]: for token in test_doc[:20]:
   ....:     print(token, token.lemma, token.lemma_)
   ....:     
(
, 518, u'\n')
(Natural, 1854, u'natural')
(language, 1374, u'language')
(processing, 6038, u'processing')
((, 562, u'(')
(NLP, 289623, u'nlp')
(), 547, u')')
(is, 536, u'be')
(a, 506, u'a')
(field, 2378, u'field')
(of, 510, u'of')
(computer, 1433, u'computer')
(science, 1427, u'science')
(,, 450, u',')
(artificial, 5448, u'artificial')
(intelligence, 2541, u'intelligence')
(and, 512, u'and')
(computational, 37658, u'computational')
(linguistics, 398368, u'linguistic')
(concerned, 3744, u'concern')
 
# Word Vector Test and it seems something wrong
In [21]: word_vector_test_doc =spacy_en(u"Apples and oranges are similar. Boots and hippos aren't.")
 
In [22]: apples = word_vector_test_doc[0]
 
In [23]: oranges = word_vector_test_doc[2] 
 
In [24]: apples.similarity(oranges)
Out[24]: 0.0

spaCy models
The word similarity testing above is failed, cause since spaCy 1.7, the default english model not include the English glove vector model, need download it separately:

sudo python -m spacy download en_vectors_glove_md

    Downloading en_vectors_glove_md-1.0.0/en_vectors_glove_md-1.0.0.tar.gz

Collecting https://github.com/explosion/spacy-models/releases/download/en_vectors_glove_md-1.0.0/en_vectors_glove_md-1.0.0.tar.gz
  Downloading https://github.com/explosion/spacy-models/releases/download/en_vectors_glove_md-1.0.0/en_vectors_glove_md-1.0.0.tar.gz (762.3MB)
...
   100% |████████████████████████████████| 762.3MB 5.5MB/s 
Requirement already satisfied: spacy<2.0.0,>=0.101.0 in /usr/local/lib/python2.7/dist-packages (from en-vectors-glove-md==1.0.0)
Requirement already satisfied: numpy>=1.7 in /usr/local/lib/python2.7/dist-packages (from spacy<2.0.0,>=0.101.0->en-vectors-glove-md==1.0.0)
Requirement already satisfied: murmurhash<0.27,>=0.26 in /usr/local/lib/python2.7/dist-packages (from spacy<2.0.0,>=0.101.0->en-vectors-glove-md==1.0.0)
Requirement already satisfied: cymem<1.32,>=1.30 in /usr/local/lib/python2.7/dist-packages (from spacy<2.0.0,>=0.101.0->en-vectors-glove-md==1.0.0)
Requirement already satisfied: preshed<2.0.0,>=1.0.0 in /usr/local/lib/python2.7/dist-packages (from spacy<2.0.0,>=0.101.0->en-vectors-glove-md==1.0.0)
Requirement already satisfied: thinc<6.6.0,>=6.5.0 in /usr/local/lib/python2.7/dist-packages (from spacy<2.0.0,>=0.101.0->en-vectors-glove-md==1.0.0)
Requirement already satisfied: plac<1.0.0,>=0.9.6 in /usr/local/lib/python2.7/dist-packages (from spacy<2.0.0,>=0.101.0->en-vectors-glove-md==1.0.0)
Requirement already satisfied: six in /usr/local/lib/python2.7/dist-packages (from spacy<2.0.0,>=0.101.0->en-vectors-glove-md==1.0.0)
Requirement already satisfied: pathlib in /usr/local/lib/python2.7/dist-packages (from spacy<2.0.0,>=0.101.0->en-vectors-glove-md==1.0.0)
Requirement already satisfied: ujson>=1.35 in /usr/local/lib/python2.7/dist-packages (from spacy<2.0.0,>=0.101.0->en-vectors-glove-md==1.0.0)
Requirement already satisfied: dill<0.3,>=0.2 in /usr/local/lib/python2.7/dist-packages (from spacy<2.0.0,>=0.101.0->en-vectors-glove-md==1.0.0)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python2.7/dist-packages (from spacy<2.0.0,>=0.101.0->en-vectors-glove-md==1.0.0)
Requirement already satisfied: regex==2017.4.5 in /usr/local/lib/python2.7/dist-packages (from spacy<2.0.0,>=0.101.0->en-vectors-glove-md==1.0.0)
Requirement already satisfied: ftfy<5.0.0,>=4.4.2 in /usr/local/lib/python2.7/dist-packages (from spacy<2.0.0,>=0.101.0->en-vectors-glove-md==1.0.0)
Requirement already satisfied: wrapt in /usr/local/lib/python2.7/dist-packages (from thinc<6.6.0,>=6.5.0->spacy<2.0.0,>=0.101.0->en-vectors-glove-md==1.0.0)
Requirement already satisfied: tqdm<5.0.0,>=4.10.0 in /usr/local/lib/python2.7/dist-packages (from thinc<6.6.0,>=6.5.0->spacy<2.0.0,>=0.101.0->en-vectors-glove-md==1.0.0)
Requirement already satisfied: cytoolz<0.9,>=0.8 in /usr/local/lib/python2.7/dist-packages (from thinc<6.6.0,>=6.5.0->spacy<2.0.0,>=0.101.0->en-vectors-glove-md==1.0.0)
Requirement already satisfied: termcolor in /usr/local/lib/python2.7/dist-packages (from thinc<6.6.0,>=6.5.0->spacy<2.0.0,>=0.101.0->en-vectors-glove-md==1.0.0)
Requirement already satisfied: idna<2.6,>=2.5 in /usr/local/lib/python2.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy<2.0.0,>=0.101.0->en-vectors-glove-md==1.0.0)
Requirement already satisfied: urllib3<1.22,>=1.21.1 in /usr/local/lib/python2.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy<2.0.0,>=0.101.0->en-vectors-glove-md==1.0.0)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python2.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy<2.0.0,>=0.101.0->en-vectors-glove-md==1.0.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python2.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy<2.0.0,>=0.101.0->en-vectors-glove-md==1.0.0)
Requirement already satisfied: html5lib in /usr/local/lib/python2.7/dist-packages (from ftfy<5.0.0,>=4.4.2->spacy<2.0.0,>=0.101.0->en-vectors-glove-md==1.0.0)
Requirement already satisfied: wcwidth in /usr/local/lib/python2.7/dist-packages (from ftfy<5.0.0,>=4.4.2->spacy<2.0.0,>=0.101.0->en-vectors-glove-md==1.0.0)
Requirement already satisfied: toolz>=0.8.0 in /usr/local/lib/python2.7/dist-packages (from cytoolz<0.9,>=0.8->thinc<6.6.0,>=6.5.0->spacy<2.0.0,>=0.101.0->en-vectors-glove-md==1.0.0)
Requirement already satisfied: setuptools>=18.5 in /usr/local/lib/python2.7/dist-packages (from html5lib->ftfy<5.0.0,>=4.4.2->spacy<2.0.0,>=0.101.0->en-vectors-glove-md==1.0.0)
Requirement already satisfied: webencodings in /usr/local/lib/python2.7/dist-packages (from html5lib->ftfy<5.0.0,>=4.4.2->spacy<2.0.0,>=0.101.0->en-vectors-glove-md==1.0.0)
Installing collected packages: en-vectors-glove-md
  Running setup.py install for en-vectors-glove-md ... done
Successfully installed en-vectors-glove-md-1.0.0

    Linking successful

    /usr/local/lib/python2.7/dist-packages/en_vectors_glove_md/en_vectors_glove_md-1.0.0
    -->
    /usr/local/lib/python2.7/dist-packages/spacy/data/en_vectors_glove_md

    You can now load the model via spacy.load('en_vectors_glove_md').

Now you can load the English glove vector model and test the word similarity by it:

In [1]: import spacy
 
In [2]: spacy_en = spacy.load('en_vectors_glove_md')
 
In [3]: word_vector_test_doc =spacy_en(u"Apples and oranges are similar. Boots and hippos aren't.")
 
In [4]: apples = word_vector_test_doc[0]
 
In [5]: oranges = word_vector_test_doc[2]
 
In [6]: apples.similarity(oranges)
Out[6]: 0.77809414836023805
 
In [7]: boots = word_vector_test_doc[6]
 
In [8]: hippos = word_vector_test_doc[8]
 
In [9]: boots.similarity(hippos)
Out[9]: 0.038474555379008429

The available spaCy models is listed below, you can download them by your needs:

Reference:
Getting Started with spaCy

Posted by TextProcessing

Getting started with Python Word Segmentation

About Python Word Segmentation

Python Word Segmentation

WordSegment is an Apache2 licensed module for English word segmentation, written in pure-Python, and based on a trillion-word corpus.

Based on code from the chapter “Natural Language Corpus Data” by Peter Norvig from the book “Beautiful Data” (Segaran and Hammerbacher, 2009).

Data files are derived from the Google Web Trillion Word Corpus, as described by Thorsten Brants and Alex Franz, and distributed by the Linguistic Data Consortium. This module contains only a subset of that data. The unigram data includes only the most common 333,000 words. Similarly, bigram data includes only the most common 250,000 phrases. Every word and phrase is lowercased with punctuation removed.

Install Python Word Segmentation

Install WordSement is very easy, just by pip:

pip install wordsegment

How to Use Python Word Segmentation for English Text

In [1]: import wordsegment
 
In [2]: help(wordsegment)
 
In [4]: from wordsegment import segment
 
In [5]: segment("thisisatest")
Out[5]: ['this', 'is', 'a', 'test']
 
In [6]: segment("helloworld")
Out[6]: ['helloworld']
 
In [7]: segment("hiworld")
Out[7]: ['hi', 'world']
 
In [8]: segment("NewYork")
Out[8]: ['new', 'york']
 
In [9]: from wordsegment import clean
 
In [10]: clean("this's a test")
Out[10]: 'thissatest'
 
In [11]: segment("this'satest")
Out[11]: ['this', 'sa', 'test']
 
In [12]: import wordsegment as ws
 
In [13]: ws.load()
 
In [15]: ws.UNIGRAMS['the']
Out[15]: 23135851162.0
 
In [16]: ws.UNIGRAMS['gray']
Out[16]: 21424658.0
 
In [17]: ws.UNIGRAMS['grey']
Out[17]: 18276942.0
 
In [18]: dir(ws)
Out[18]: 
['ALPHABET',
 'BIGRAMS',
 'DATADIR',
 'TOTAL',
 'UNIGRAMS',
 '__author__',
 '__build__',
 '__builtins__',
 '__copyright__',
 '__doc__',
 '__file__',
 '__license__',
 '__name__',
 '__package__',
 '__title__',
 '__version__',
 'clean',
 'divide',
 'io',
 'isegment',
 'load',
 'main',
 'math',
 'op',
 'parse_file',
 'score',
 'segment',
 'sys']
 
In [19]: ws.BIGRAMS['this is']
Out[19]: 86818400.0
 
In [20]: ws.BIGRAMS['is a']
Out[20]: 476718990.0
 
In [21]: ws.BIGRAMS['a test']
Out[21]: 4417355.0
 
In [22]: ws.BIGRAMS['a patent']
Out[22]: 1117510.0
 
In [23]: ws.BIGRAMS['free patent']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-23-6d20cc0adefa> in <module>()
----> 1 ws.BIGRAMS['free patent']
 
KeyError: 'free patent'
 
In [24]: ws.BIGRAMS['the input']
Out[24]: 4840160.0
 
In [26]: import heapq
 
In [27]: from pprint import pprint
 
In [28]: from operator import itemgetter
 
In [29]: pprint(heapq.nlargest(10, ws.BIGRAMS.items(), itemgetter(1)))
[(u'of the', 2766332391.0),
 (u'in the', 1628795324.0),
 (u'to the', 1139248999.0),
 (u'on the', 800328815.0),
 (u'for the', 692874802.0),
 (u'and the', 629726893.0),
 (u'to be', 505148997.0),
 (u'is a', 476718990.0),
 (u'with the', 461331348.0),
 (u'from the', 428303219.0)]

Help info about Python WordSegmentation

Help on module wordsegment:

NAME
    wordsegment - English Word Segmentation in Python

FILE
    /Library/Python/2.7/site-packages/wordsegment.py

DESCRIPTION
    Word segmentation is the process of dividing a phrase without spaces back
    into its constituent parts. For example, consider a phrase like "thisisatest
".
    For humans, it's relatively easy to parse. This module makes it easy for
    machines too. Use `segment` to parse a phrase into its parts:
    
    >>> from wordsegment import segment
    >>> segment('thisisatest')
    ['this', 'is', 'a', 'test']
    
    In the code, 1024908267229 is the total number of words in the corpus. A
    subset of this corpus is found in unigrams.txt and bigrams.txt which
    should accompany this file. A copy of these files may be found at
    http://norvig.com/ngrams/ under the names count_1w.txt and count_2w.txt
    respectively.
    
    Copyright (c) 2016 by Grant Jenks
    
    Based on code from the chapter "Natural Language Corpus Data"
    from the book "Beautiful Data" (Segaran and Hammerbacher, 2009)
    http://oreilly.com/catalog/9780596157111/
    
    Original Copyright (c) 2008-2009 by Peter Norvig

FUNCTIONS
    clean(text)
        Return `text` lower-cased with non-alphanumeric characters removed.
    
    divide(text, limit=24)
        Yield `(prefix, suffix)` pairs from `text` with `len(prefix)` not
        exceeding `limit`.
    
    isegment(text)
        Return iterator of words that is the best segmenation of `text`.
    
    load()
        Load unigram and bigram counts from disk.
    main(args=())
        Command-line entry-point. Parses `args` into in-file and out-file then
        reads lines from in-file, segments the lines, and writes the result to
        out-file. Input and output default to stdin and stdout respectively.
    
    parse_file(filename)
        Read `filename` and parse tab-separated file of (word, count) pairs.
    
    score(word, prev=None)
        Score a `word` in the context of the previous word, `prev`.
    
    segment(text)
        Return a list of words that is the best segmenation of `text`.

DATA
    ALPHABET = set(['0', '1', '2', '3', '4', '5', ...])
    BIGRAMS = {u'0km to': 116103.0, u'0uplink verified': 523545.0, u'1000s...
    DATADIR = '/Library/Python/2.7/site-packages/wordsegment_data'
    TOTAL = 1024908267229.0
    UNIGRAMS = {u'a': 9081174698.0, u'aa': 30523331.0, u'aaa': 10243983.0,...
    __author__ = 'Grant Jenks'
    __build__ = 2048
    __copyright__ = 'Copyright 2016 Grant Jenks'
    __license__ = 'Apache 2.0'
    __title__ = 'wordsegment'
    __version__ = '0.8.0'

VERSION
    0.8.0

AUTHOR
    Grant Jenks

Posted by TextProcessing

Getting started with topia.termextract

About topia.termextract

Open Source Text Processing Project: topia.termextract

Install topia.termextract

Also topia.termextract has a pip site, but cannot install it by “pip install” method, you should download the source code first:

https://pypi.python.org/packages/d1/b9/452257976ebee91d07c74bc4b34cfce416f45b94af1d62902ae39bf902cf/topia.termextract-1.1.0.tar.gz

Then “tar -zxvf topia.termextract-1.1.0.tar.gz” and “cd topia.termextract-1.1.0” and “sudo python setup.py install” it.

How to Use topia.termextract for term extraction

topia.termextract is based on english pos tagger, so it can be used for Word Tokenize and POS Tagging:

IPython 3.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
 
In [1]: from topia.termextract import tag
 
In [2]: dir(tag)
Out[2]: 
['DATA_DIRECTORY',
 'TERM_SPEC',
 'Tagger',
 '__builtins__',
 '__doc__',
 '__file__',
 '__name__',
 '__package__',
 'correctDefaultNounTag',
 'determineVerbAfterModal',
 'interfaces',
 'normalizePluralForms',
 'os',
 're',
 'verifyProperNounAtSentenceStart',
 'zope']
 
In [3]: tagger = tag.Tagger()
 
In [4]: tagger
Out[4]: <Tagger for english>
 
In [5]: tagger.initialize()
 
In [7]: tagger.tokenize("this's topia.termextarct word tokenize test.")
Out[7]: ['this', "'s", 'topia.termextarct', 'word', 'tokenize', 'test', '.']
 
 
In [9]: test_text = "Terminology extraction (also known as term extraction, glossary extraction, term recognition, or terminology mining) is a subtask of information extraction. The goal of terminology extraction is to automatically extract relevant terms from a given corpus."
 
In [11]: tagger.tokenize("test_text")
Out[11]: ['test', '_text']
 
In [12]: tagger.tokenize(test_text)
Out[12]: 
['Terminology',
 'extraction',
 '(',
 'also',
 'known',
 'as',
 'term',
 'extraction',
 ',',
 'glossary',
 'extraction',
 ',',
 'term',
 'recognition',
 ',',
 'or',
 'terminology',
 'mining',
 ')',
 'is',
 'a',
 'subtask',
 'of',
 'information',
 'extraction',
 '.',
 'The',
 'goal',
 'of',
 'terminology',
 'extraction',
 'is',
 'to',
 'automatically',
 'extract',
 'relevant',
 'terms',
 'from',
 'a',
 'given',
 'corpus',
 '.']
 
In [13]: tagger("this's topia.termextarct word tokenize test.")
Out[13]: 
[['this', 'DT', 'this'],
 ["'s", 'POS', "'s"],
 ['topia.termextarct', 'NN', 'topia.termextarct'],
 ['word', 'NN', 'word'],
 ['tokenize', 'NN', 'tokenize'],
 ['test', 'NN', 'test'],
 ['.', '.', '.']]
 
In [14]: tagger("these examples are more better")
Out[14]: 
[['these', 'DT', 'these'],
 ['examples', 'NNS', 'example'],
 ['are', 'VBP', 'are'],
 ['more', 'JJR', 'more'],
 ['better', 'JJR', 'better']]
 
In [15]: tagger(test_text)
Out[15]: 
[['Terminology', 'NN', 'Terminology'],
 ['extraction', 'NN', 'extraction'],
 ['(', '(', '('],
 ['also', 'RB', 'also'],
 ['known', 'VBN', 'known'],
 ['as', 'IN', 'as'],
 ['term', 'NN', 'term'],
 ['extraction', 'NN', 'extraction'],
 [',', ',', ','],
 ['glossary', 'NN', 'glossary'],
 ['extraction', 'NN', 'extraction'],
 [',', ',', ','],
 ['term', 'NN', 'term'],
 ['recognition', 'NN', 'recognition'],
 [',', ',', ','],
 ['or', 'CC', 'or'],
 ['terminology', 'NN', 'terminology'],
 ['mining', 'NN', 'mining'],
 [')', ')', ')'],
 ['is', 'VBZ', 'is'],
 ['a', 'DT', 'a'],
 ['subtask', 'NN', 'subtask'],
 ['of', 'IN', 'of'],
 ['information', 'NN', 'information'],
 ['extraction', 'NN', 'extraction'],
 ['.', '.', '.'],
 ['The', 'DT', 'The'],
 ['goal', 'NN', 'goal'],
 ['of', 'IN', 'of'],
 ['terminology', 'NN', 'terminology'],
 ['extraction', 'NN', 'extraction'],
 ['is', 'VBZ', 'is'],
 ['to', 'TO', 'to'],
 ['automatically', 'RB', 'automatically'],
 ['extract', 'VB', 'extract'],
 ['relevant', 'JJ', 'relevant'],
 ['terms', 'NNS', 'term'],
 ['from', 'IN', 'from'],
 ['a', 'DT', 'a'],
 ['given', 'VBN', 'given'],
 ['corpus', 'NN', 'corpus'],
 ['.', '.', '.']]

Now let’s getting started with Term Extractor by topia.termextract:

 
In [16]: from topia.termextract import extract
 
In [17]: extractor = extract.TermExtractor()
 
In [18]: extractor
Out[18]: <TermExtractor using <Tagger for english>>
 
 
In [20]: extractor.tagger
Out[20]: <Tagger for english>
 
In [21]: test_sample = """
   ....: Terminology extraction (also known as term extraction, glossary extraction, term recognition, or terminology mining) is a subtask of information extraction. The goal of terminology extraction is to automatically extract relevant terms from a given corpus. In the semantic web era, a growing number of communities and networked enterprises started to access and interoperate through the internet. Modeling these communities and their information needs is important for several web applications, like topic-driven web crawlers,[1] web services,[2] recommender systems,[3] etc. The development of terminology extraction is essential to the language industry. One of the first steps to model the knowledge domain of a virtual community is to collect a vocabulary of domain-relevant terms, constituting the linguistic surface manifestation of domain concepts. Several methods to automatically extract technical terms from domain-specific document warehouses have been described in the literature.Typically, approaches to automatic term extraction make use of linguistic processors (part of speech tagging, phrase chunking) to extract terminological candidates, i.e. syntactically plausible terminological noun phrases, NPs (e.g. compounds "credit card", adjective-NPs "local tourist information office", and prepositional-NPs "board of directors" - in English, the first two constructs are the most frequent[citation needed]). Terminological entries are then filtered from the candidate list using statistical and machine learning methods. Once filtered, because of their low ambiguity and high specificity, these terms are particularly useful for conceptualizing a knowledge domain or for supporting the creation of a domain ontology or a terminology base. Furthermore, terminology extraction is a very useful starting point for semantic similarity, knowledge management, human translation and machine translation, etc.
   ....: """
 
 
In [22]: extractor(test_sample)
Out[22]: 
[('web applications', 1, 2),
 ('domain concepts', 1, 2),
 ('domain-relevant terms', 1, 2),
 ('terminology base', 1, 2),
 ('web', 4, 1),
 ('knowledge', 3, 1),
 ('tourist information office ",', 1, 4),
 ('Terminology extraction', 1, 2),
 ('candidate list', 1, 2),
 ('term', 7, 1),
 ('domain ontology', 1, 2),
 ('knowledge domain', 2, 2),
 ('terminology', 5, 1),
 ('domain', 4, 1),
 ('knowledge management', 1, 2),
 ('information extraction', 1, 2),
 ('glossary extraction', 1, 2),
 ('terminological candidates', 1, 2),
 ('Several methods', 1, 2),
 ('terminology mining', 1, 2),
 ('networked enterprises', 1, 2),
 ('machine translation', 1, 2),
 ('term extraction', 2, 2),
 ('domain-specific document warehouses', 1, 3),
 ('community', 3, 1),
 ('credit card ", adjective-NPs', 1, 4),
 ('extraction', 8, 1),
 ('term recognition', 1, 2),
 ('terminological noun phrases', 1, 3),
 ('language industry', 1, 2),
 ('surface manifestation', 1, 2),
 ('information', 3, 1),
 ('topic-driven web crawlers ,[1] web services ,[2] recommender systems ,[3]',
  1,
  10),
 (']). Terminological entries', 1, 3),
 ('terminology extraction', 3, 2),
 ('web era', 1, 2),
 ('phrase chunking', 1, 2)]
 
In [23]: term_result = extractor(test_sample)
 
In [24]: term_result
Out[24]: 
[('web applications', 1, 2),
 ('domain concepts', 1, 2),
 ('domain-relevant terms', 1, 2),
 ('terminology base', 1, 2),
 ('web', 4, 1),
 ('knowledge', 3, 1),
 ('tourist information office ",', 1, 4),
 ('Terminology extraction', 1, 2),
 ('candidate list', 1, 2),
 ('term', 7, 1),
 ('domain ontology', 1, 2),
 ('knowledge domain', 2, 2),
 ('terminology', 5, 1),
 ('domain', 4, 1),
 ('knowledge management', 1, 2),
 ('information extraction', 1, 2),
 ('glossary extraction', 1, 2),
 ('terminological candidates', 1, 2),
 ('Several methods', 1, 2),
 ('terminology mining', 1, 2),
 ('networked enterprises', 1, 2),
 ('machine translation', 1, 2),
 ('term extraction', 2, 2),
 ('domain-specific document warehouses', 1, 3),
 ('community', 3, 1),
 ('credit card ", adjective-NPs', 1, 4),
 ('extraction', 8, 1),
 ('term recognition', 1, 2),
 ('terminological noun phrases', 1, 3),
 ('language industry', 1, 2),
 ('surface manifestation', 1, 2),
 ('information', 3, 1),
 ('topic-driven web crawlers ,[1] web services ,[2] recommender systems ,[3]',
  1,
  10),
 (']). Terminological entries', 1, 3),
 ('terminology extraction', 3, 2),
 ('web era', 1, 2),
 ('phrase chunking', 1, 2)]
 
In [25]: sorted_term_result = sorted(term_result, key=lambda x: x[1] * x[2], reverse=True)
 
In [26]: sorted_term_result
Out[26]: 
[('topic-driven web crawlers ,[1] web services ,[2] recommender systems ,[3]',
  1,
  10),
 ('extraction', 8, 1),
 ('term', 7, 1),
 ('terminology extraction', 3, 2),
 ('terminology', 5, 1),
 ('web', 4, 1),
 ('tourist information office ",', 1, 4),
 ('knowledge domain', 2, 2),
 ('domain', 4, 1),
 ('term extraction', 2, 2),
 ('credit card ", adjective-NPs', 1, 4),
 ('knowledge', 3, 1),
 ('domain-specific document warehouses', 1, 3),
 ('community', 3, 1),
 ('terminological noun phrases', 1, 3),
 ('information', 3, 1),
 (']). Terminological entries', 1, 3),
 ('web applications', 1, 2),
 ('domain concepts', 1, 2),
 ('domain-relevant terms', 1, 2),
 ('terminology base', 1, 2),
 ('Terminology extraction', 1, 2),
 ('candidate list', 1, 2),
 ('domain ontology', 1, 2),
 ('knowledge management', 1, 2),
 ('information extraction', 1, 2),
 ('glossary extraction', 1, 2),
 ('terminological candidates', 1, 2),
 ('Several methods', 1, 2),
 ('terminology mining', 1, 2),
 ('networked enterprises', 1, 2),
 ('machine translation', 1, 2),
 ('term recognition', 1, 2),
 ('language industry', 1, 2),
 ('surface manifestation', 1, 2),
 ('web era', 1, 2),
 ('phrase chunking', 1, 2)]

Posted by TextProcessing

Getting started with WordNet

About WordNet

WordNet is a lexical database for English:

WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download. WordNet’s structure makes it a useful tool for computational linguistics and natural language processing.

Install WordNet

We can download WordNet related source and data from the official website: https://wordnet.princeton.edu/wordnet/download/current-version/

The most recent Windows version of WordNet is 2.1, released in March 2005. Version 3.0 for Unix/Linux/Solaris/etc. was released in December, 2006. Version 3.1 is currently availalbe only online.

For example, we will use WordNet3.0 as the stable release version, which now supports UNIX-like systems, including Linux, Mac OS X and Solaris. Before install WordNet from the source code, we should download it first. Download a tar-gzipped version: WordNet-3.0.tar.gz

Install WordNet on Unbuntu 16.04:

tar -zxvf WordNet-3.0.tar.gz
cd WordNet-3.0/
./configure

After configure in WordNet3.0, we met a configure problem:


checking for style of include used by make… GNU
checking dependency style of gcc… gcc3
checking for Tcl configuration… configure: WARNING: Can’t find Tcl configuration definitions

Install tcl-dev on Ubuntu can resolve this problem:

sudo apt-get install tcl-dev

Configure wordnet again:

./configure

But met another tk problem:


checking for style of include used by make… GNU
checking dependency style of gcc… gcc3
checking for Tcl configuration… found /usr/lib/tclConfig.sh
checking for Tk configuration… configure: WARNING: Can’t find Tk configuration definitions

Install tk-dev on Ubuntu too:

sudo apt-get install tk-dev

Finally configure it successfully:

./configure

WordNet is now configured

Installation directory: /usr/local/WordNet-3.0

To build and install WordNet:

make
make install

To run, environment variables should be set as follows:

PATH – include ${exec_prefix}/bin
WNHOME – if not using default installation location, set to /usr/local/WordNet-3.0

See INSTALL file for details and additional environment variables
which may need to be set on your system.

Now make it:

make

But met a compile error:

……
then mv -f “.deps/wishwn-stubs.Tpo” “.deps/wishwn-stubs.Po”; else rm -f “.deps/wishwn-stubs.Tpo”; exit 1; fi
stubs.c: In function ‘wn_findvalidsearches’:
stubs.c:43:14: error: ‘Tcl_Interp {aka struct Tcl_Interp}’ has no member named ‘result’
interp -> result =

The reason is that “Tcl 8.5 deprecated interp->result and Tcl 8.6+ removed it.”, you should modified the original wordnet code:

sudo vim src/stubs.c

and add a line “#define USE_INTERP_RESULT 1” before “include tcl.h”, like this:

#ifdef _WINDOWS
#include <windows.h>
#endif
 
#define USE_INTERP_RESULT 1
 
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <tcl.h>
#include <tk.h>
#include <wn.h>
......

Make it again:

make

make all-recursive
make[1]: Entering directory ‘/home/textminer/wordnet/WordNet-3.0’
……
gcc -g -O2 -o wishwn wishwn-tkAppInit.o wishwn-stubs.o -L../lib -lWN -L/usr/lib/x86_64-linux-gnu -ltk8.6 -L/usr/lib/x86_64-linux-gnu -ltcl8.6 -lX11 -lXss -lXext -lXft -lfontconfig -lfreetype -lfontconfig -lpthread -ldl -lz -lpthread -lieee -lm
make[2]: Leaving directory ‘/home/textminer/wordnet/WordNet-3.0/src’
make[2]: Entering directory ‘/home/textminer/wordnet/WordNet-3.0’
make[2]: Leaving directory ‘/home/textminer/wordnet/WordNet-3.0’
make[1]: Leaving directory ‘/home/textminer/wordnet/WordNet-3.0’

Finally “make install” with it:

sudo make install

If everything is ok, you can find WordNew3.0 in the “/usr/local/WordNet-3.0/” directory, and in the binary subdirectory “/usr/local/WordNet-3.0/bin”, you can find the related binary files: wishwn wn wnb

After execute the wn:

./wn

We can get:

usage: wn word [-hgla] [-n#] -searchtype [-searchtype...]
       wn [-l]
 
	-h		Display help text before search output
	-g		Display gloss
	-l		Display license and copyright notice
	-a		Display lexicographer file information
	-o		Display synset offset
	-s		Display sense numbers in synsets
	-n#		Search only sense number #
 
searchtype is at least one of the following:
	-ants{n|v|a|r}		Antonyms
	-hype{n|v}		Hypernyms
	-hypo{n|v}, -tree{n|v}	Hyponyms & Hyponym Tree
	-entav			Verb Entailment
	-syns{n|v|a|r}		Synonyms (ordered by estimated frequency)
	-smemn			Member of Holonyms
	-ssubn			Substance of Holonyms
	-sprtn			Part of Holonyms
	-membn			Has Member Meronyms
	-subsn			Has Substance Meronyms
	-partn			Has Part Meronyms
	-meron			All Meronyms
	-holon			All Holonyms
	-causv			Cause to
	-pert{a|r}		Pertainyms
	-attr{n|a}		Attributes
	-deri{n|v}		Derived Forms
	-domn{n|v|a|r}		Domain
	-domt{n|v|a|r}		Domain Terms
	-faml{n|v|a|r}		Familiarity & Polysemy Count
	-framv			Verb Frames
	-coor{n|v}		Coordinate Terms (sisters)
	-simsv			Synonyms (grouped by similarity of meaning)
	-hmern			Hierarchical Meronyms
	-hholn			Hierarchical Holonyms
	-grep{n|v|a|r}		List of Compound Words
	-over			Overview of Senses

Now you can enjoy wordnet on your Ubuntu system.

Another simple way to install WordNet in Ubuntu is by apt-get:

sudo apt install wordnet

Reading package lists… Done
Building dependency tree
Reading state information… Done
The following additional packages will be installed:
fontconfig-config fonts-dejavu-core libfontconfig1 libtcl8.5 libtk8.5
libxft2 libxrender1 libxss1 wordnet-base wordnet-gui x11-common
Suggested packages:
tcl8.5 tk8.5
The following NEW packages will be installed:
fontconfig-config fonts-dejavu-core libfontconfig1 libtcl8.5 libtk8.5
libxft2 libxrender1 libxss1 wordnet wordnet-base wordnet-gui x11-common
0 upgraded, 12 newly installed, 0 to remove and 94 not upgraded.
Need to get 9,177 kB of archives.
After this operation, 39.8 MB of additional disk space will be used.
Do you want to continue? [Y/n] Y
……
Setting up libxss1:amd64 (1:1.2.2-1) …
Setting up libtcl8.5:amd64 (8.5.19-1) …
Setting up libtk8.5:amd64 (8.5.19-1ubuntu1) …
Setting up wordnet-base (1:3.0-33) …
Setting up wordnet (1:3.0-33) …
Setting up wordnet-gui (1:3.0-33) …
Processing triggers for libc-bin (2.23-0ubuntu3) …
Processing triggers for systemd (229-4ubuntu8) …
Processing triggers for ureadahead (0.100.0-19) …

Now you can type “wn” to test WordNet same as before.

Install WordNet on Mac OS:

Install wordnet from the source on Mac OS is simpler, cause the tcl and tk dev is default on the Mac OS, you will met the same compile problem too:

But met a compile error:

……
stubs.c: In function ‘wn_findvalidsearches’:
stubs.c:43: error: ‘Tcl_Interp’ has no member named ‘result’
stubs.c:55: error: ‘Tcl_Interp’ has no member named ‘result’

The resolve method is still modify the original wordnet code:

sudo vim src/stubs.c

and add a line “#define USE_INTERP_RESULT 1” before “include tcl.h”, like this:

#ifdef _WINDOWS
#include <windows.h>
#endif
 
#define USE_INTERP_RESULT 1
 
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <tcl.h>
#include <tk.h>
#include <wn.h>
......

Test WordNet
We test word “book” by WordNet:
wn book

Information available for noun book
	-hypen		Hypernyms
	-hypon, -treen	Hyponyms & Hyponym Tree
	-synsn		Synonyms (ordered by estimated frequency)
	-sprtn		Part of Holonyms
	-membn		Has Member Meronyms
	-partn		Has Part Meronyms
	-meron		All Meronyms
	-holon		All Holonyms
	-derin		Derived Forms
	-domnn		Domain
	-domtn		Domain Terms
	-famln		Familiarity & Polysemy Count
	-coorn		Coordinate Terms (sisters)
	-hmern		Hierarchical Meronyms
	-hholn		Hierarchical Holonyms
	-grepn		List of Compound Words
	-over		Overview of Senses
 
Information available for verb book
	-hypev		Hypernyms
	-hypov, -treev	Hyponyms & Hyponym Tree
	-entav		Verb Entailment
	-synsv		Synonyms (ordered by estimated frequency)
	-deriv		Derived Forms
	-famlv		Familiarity & Polysemy Count
	-framv		Verb Frames
	-coorv		Coordinate Terms (sisters)
	-simsv		Synonyms (grouped by similarity of meaning)
	-grepv		List of Compound Words
	-over		Overview of Senses
 
No information available for adj book
 
No information available for adv book

Continue:
wn book -hypen

Synonyms/Hypernyms (Ordered by Estimated Frequency) of noun book
 
11 senses of book                                                       
 
Sense 1
book
       => publication
           => work, piece of work
               => product, production
                   => creation
                       => artifact, artefact
                           => whole, unit
                               => object, physical object
                                   => physical entity
                                       => entity
 
Sense 2
book, volume
       => product, production
           => creation
               => artifact, artefact
                   => whole, unit
                       => object, physical object
                           => physical entity
                               => entity
 
Sense 3
record, record book, book
       => fact
           => information, info
               => message, content, subject matter, substance
                   => communication
                       => abstraction, abstract entity
                           => entity
 
Sense 4
script, book, playscript
       => dramatic composition, dramatic work
           => writing, written material, piece of writing
               => written communication, written language, black and white
                   => communication
                       => abstraction, abstract entity
                           => entity
 
Sense 5
ledger, leger, account book, book of account, book
       => record
           => document
               => communication
                   => abstraction, abstract entity
                       => entity
 
Sense 6
book
       => collection, aggregation, accumulation, assemblage
           => group, grouping
               => abstraction, abstract entity
                   => entity
 
Sense 7
book, rule book
       => collection, aggregation, accumulation, assemblage
           => group, grouping
               => abstraction, abstract entity
                   => entity
 
Sense 8
Koran, Quran, al-Qur'an, Book
       INSTANCE OF=> sacred text, sacred writing, religious writing, religious text
           => writing, written material, piece of writing
               => written communication, written language, black and white
                   => communication
                       => abstraction, abstract entity
                           => entity
 
Sense 9
Bible, Christian Bible, Book, Good Book, Holy Scripture, Holy Writ, Scripture, Word of God, Word
       => sacred text, sacred writing, religious writing, religious text
           => writing, written material, piece of writing
               => written communication, written language, black and white
                   => communication
                       => abstraction, abstract entity
                           => entity
 
Sense 10
book
       => section, subdivision
           => writing, written material, piece of writing
               => written communication, written language, black and white
                   => communication
                       => abstraction, abstract entity
                           => entity
           => music
               => auditory communication
                   => communication
                       => abstraction, abstract entity
                           => entity
 
Sense 11
book
       => product, production
           => creation
               => artifact, artefact
                   => whole, unit
                       => object, physical object
                           => physical entity
                               => entity

Continue:
wn book -hypon

Hyponyms of noun book
 
7 of 11 senses of book                                                  
 
Sense 1
book
       => authority
       => curiosa
       => formulary, pharmacopeia
       => trade book, trade edition
       => bestiary
       => catechism
       => pop-up book, pop-up
       => storybook
       => tome
       => booklet, brochure, folder, leaflet, pamphlet
       => textbook, text, text edition, schoolbook, school text
       => workbook
       => copybook
       => appointment book, appointment calendar
       => catalog, catalogue
       => phrase book
       => playbook
       => prayer book, prayerbook
       => reference book, reference, reference work, book of facts
       => review copy
       => songbook
       => yearbook
       HAS INSTANCE=> Das Kapital, Capital
       HAS INSTANCE=> Erewhon
       HAS INSTANCE=> Utopia
 
Sense 2
book, volume
       => album
       => coffee-table book
       => folio
       => hardback, hardcover
       => journal
       => novel
       => order book
       => paperback book, paper-back book, paperback, softback book, softback, soft-cover book, soft-cover
       => picture book
       => sketchbook, sketch block, sketch pad
       => notebook
 
Sense 3
record, record book, book
       => logbook
       => won-lost record
       => card, scorecard
 
Sense 4
script, book, playscript
       => promptbook, prompt copy
       => continuity
       => dialogue, dialog
       => libretto
       => scenario
       => screenplay
       => shooting script
 
Sense 5
ledger, leger, account book, book of account, book
       => cost ledger
       => general ledger
       => subsidiary ledger
       => daybook, journal
 
Sense 9
Bible, Christian Bible, Book, Good Book, Holy Scripture, Holy Writ, Scripture, Word of God, Word
       => family Bible
       HAS INSTANCE=> Vulgate
       HAS INSTANCE=> Douay Bible, Douay Version, Douay-Rheims Bible, Douay-Rheims Version, Rheims-Douay Bible, Rheims-Douay Version
       HAS INSTANCE=> Authorized Version, King James Version, King James Bible
       HAS INSTANCE=> Revised Version
       HAS INSTANCE=> New English Bible
       HAS INSTANCE=> American Standard Version, American Revised Version
       HAS INSTANCE=> Revised Standard Version
 
Sense 10
book
       HAS INSTANCE=> Genesis, Book of Genesis
       HAS INSTANCE=> Exodus, Book of Exodus
       HAS INSTANCE=> Leviticus, Book of Leviticus
       HAS INSTANCE=> Numbers, Book of Numbers
       HAS INSTANCE=> Deuteronomy, Book of Deuteronomy
       HAS INSTANCE=> Joshua, Josue, Book of Joshua
       HAS INSTANCE=> Judges, Book of Judges
       HAS INSTANCE=> Ruth, Book of Ruth
       HAS INSTANCE=> I Samuel, 1 Samuel
       HAS INSTANCE=> II Samuel, 2 Samuel
       HAS INSTANCE=> I Kings, 1 Kings
       HAS INSTANCE=> II Kings, 2 Kings
       HAS INSTANCE=> I Chronicles, 1 Chronicles
       HAS INSTANCE=> II Chronicles, 2 Chronicles
       HAS INSTANCE=> Ezra, Book of Ezra
       HAS INSTANCE=> Nehemiah, Book of Nehemiah
       HAS INSTANCE=> Esther, Book of Esther
       HAS INSTANCE=> Job, Book of Job
       HAS INSTANCE=> Psalms, Book of Psalms
       HAS INSTANCE=> Proverbs, Book of Proverbs
       HAS INSTANCE=> Ecclesiastes, Book of Ecclesiastes
       HAS INSTANCE=> Song of Songs, Song of Solomon, Canticle of Canticles, Canticles
       HAS INSTANCE=> Isaiah, Book of Isaiah
       HAS INSTANCE=> Jeremiah, Book of Jeremiah
       HAS INSTANCE=> Lamentations, Book of Lamentations
       HAS INSTANCE=> Ezekiel, Ezechiel, Book of Ezekiel
       HAS INSTANCE=> Daniel, Book of Daniel, Book of the Prophet Daniel
       HAS INSTANCE=> Hosea, Book of Hosea
       HAS INSTANCE=> Joel, Book of Joel
       HAS INSTANCE=> Amos, Book of Amos
       HAS INSTANCE=> Obadiah, Abdias, Book of Obadiah
       HAS INSTANCE=> Jonah, Book of Jonah
       HAS INSTANCE=> Micah, Micheas, Book of Micah
       HAS INSTANCE=> Nahum, Book of Nahum
       HAS INSTANCE=> Habakkuk, Habacuc, Book of Habakkuk
       HAS INSTANCE=> Zephaniah, Sophonias, Book of Zephaniah
       HAS INSTANCE=> Haggai, Aggeus, Book of Haggai
       HAS INSTANCE=> Zechariah, Zacharias, Book of Zachariah
       HAS INSTANCE=> Malachi, Malachias, Book of Malachi
       HAS INSTANCE=> Matthew, Gospel According to Matthew
       HAS INSTANCE=> Mark, Gospel According to Mark
       HAS INSTANCE=> Luke, Gospel of Luke, Gospel According to Luke
       HAS INSTANCE=> John, Gospel According to John
       HAS INSTANCE=> Acts of the Apostles, Acts
       => Epistle
       HAS INSTANCE=> Revelation, Revelation of Saint John the Divine, Apocalypse, Book of Revelation
       HAS INSTANCE=> Additions to Esther
       HAS INSTANCE=> Prayer of Azariah and Song of the Three Children
       HAS INSTANCE=> Susanna, Book of Susanna
       HAS INSTANCE=> Bel and the Dragon
       HAS INSTANCE=> Baruch, Book of Baruch
       HAS INSTANCE=> Letter of Jeremiah, Epistle of Jeremiah
       HAS INSTANCE=> Tobit, Book of Tobit
       HAS INSTANCE=> Judith, Book of Judith
       HAS INSTANCE=> I Esdra, 1 Esdras
       HAS INSTANCE=> II Esdras, 2 Esdras
       HAS INSTANCE=> Ben Sira, Sirach, Ecclesiasticus, Wisdom of Jesus the Son of Sirach
       HAS INSTANCE=> Wisdom of Solomon, Wisdom
       HAS INSTANCE=> I Maccabees, 1 Maccabees
       HAS INSTANCE=> II Maccabees, 2 Maccabees

Let’s test the word “dog”:

wn dog

Information available for noun dog
	-hypen		Hypernyms
	-hypon, -treen	Hyponyms & Hyponym Tree
	-synsn		Synonyms (ordered by estimated frequency)
	-smemn		Member of Holonyms
	-sprtn		Part of Holonyms
	-partn		Has Part Meronyms
	-meron		All Meronyms
	-holon		All Holonyms
	-famln		Familiarity & Polysemy Count
	-coorn		Coordinate Terms (sisters)
	-hmern		Hierarchical Meronyms
	-hholn		Hierarchical Holonyms
	-grepn		List of Compound Words
	-over		Overview of Senses
 
Information available for verb dog
	-hypev		Hypernyms
	-hypov, -treev	Hyponyms & Hyponym Tree
	-synsv		Synonyms (ordered by estimated frequency)
	-famlv		Familiarity & Polysemy Count
	-framv		Verb Frames
	-coorv		Coordinate Terms (sisters)
	-simsv		Synonyms (grouped by similarity of meaning)
	-grepv		List of Compound Words
	-over		Overview of Senses
 
No information available for adj dog
 
No information available for adv dog

Let’s find the synset for noun dog:

wn dog -synsn

Synonyms/Hypernyms (Ordered by Estimated Frequency) of noun dog
 
7 senses of dog                                                         
 
Sense 1
dog, domestic dog, Canis familiaris
       => canine, canid
       => domestic animal, domesticated animal
 
Sense 2
frump, dog
       => unpleasant woman, disagreeable woman
 
Sense 3
dog
       => chap, fellow, feller, fella, lad, gent, blighter, cuss, bloke
 
Sense 4
cad, bounder, blackguard, dog, hound, heel
       => villain, scoundrel
 
Sense 5
frank, frankfurter, hotdog, hot dog, dog, wiener, wienerwurst, weenie
       => sausage
 
Sense 6
pawl, detent, click, dog
       => catch, stop
 
Sense 7
andiron, firedog, dog, dog-iron
       => support

Let’s find the synset for verb dog:

wn dog -synsv

Synonyms/Hypernyms (Ordered by Estimated Frequency) of verb dog
 
1 sense of dog                                                          
 
Sense 1
chase, chase after, trail, tail, tag, give chase, dog, go after, track
       => pursue, follow

Just enjoy it.

Posted by TextProcessing

A Beginner’s Guide to TextBlob

About TextBlob

Open Source Text Processing Project: TextBlob

Install TextBlob

Install the latest TextBlob on Ubuntu 16.04.1 LTS:

textprocessing@ubuntu:~$ sudo pip install -U textblob

Collecting textblob
Downloading textblob-0.12.0-py2.py3-none-any.whl (631kB)

Requirement already up-to-date: nltk>=3.1 in /usr/local/lib/python2.7/dist-packages (from textblob)
Requirement already up-to-date: six in /usr/local/lib/python2.7/dist-packages (from nltk>=3.1->textblob)
Installing collected packages: textblob
Successfully installed textblob-0.12.0

textprocessing@ubuntu:~$ sudo python -m textblob.download_corpora

[nltk_data] Downloading package brown to
[nltk_data] /home/textprocessing/nltk_data…
[nltk_data] Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to
[nltk_data] /home/textprocessing/nltk_data…
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data] /home/textprocessing/nltk_data…
[nltk_data] Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /home/textprocessing/nltk_data…
[nltk_data] Package averaged_perceptron_tagger is already up-to-
[nltk_data] date!
[nltk_data] Downloading package conll2000 to
[nltk_data] /home/textprocessing/nltk_data…
[nltk_data] Unzipping corpora/conll2000.zip.
[nltk_data] Downloading package movie_reviews to
[nltk_data] /home/textprocessing/nltk_data…
[nltk_data] Unzipping corpora/movie_reviews.zip.
Finished.

Test TextBlob

textprocessing@ubuntu:~$ ipython
Python 2.7.12 (default, Nov 19 2016, 06:48:10) 
Type "copyright", "credits" or "license" for more information.
 
IPython 2.4.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
 
In [1]: from textblob import TextBlob
 
In [2]: test_text = """
Text mining, also referred to as text data mining, roughly equivalent to text analytics, is the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).
"""
 
In [3]: text_blob = TextBlob(test_text)
 
# Word Tokenization
In [4]: text_blob.words
Out[4]: WordList(['Text', 'mining', 'also', 'referred', 'to', 'as', 'text', 'data', 'mining', 'roughly', 'equivalent', 'to', 'text', 'analytics', 'is', 'the', 'process', 'of', 'deriving', 'high-quality', 'information', 'from', 'text', 'High-quality', 'information', 'is', 'typically', 'derived', 'through', 'the', 'devising', 'of', 'patterns', 'and', 'trends', 'through', 'means', 'such', 'as', 'statistical', 'pattern', 'learning', 'Text', 'mining', 'usually', 'involves', 'the', 'process', 'of', 'structuring', 'the', 'input', 'text', 'usually', 'parsing', 'along', 'with', 'the', 'addition', 'of', 'some', 'derived', 'linguistic', 'features', 'and', 'the', 'removal', 'of', 'others', 'and', 'subsequent', 'insertion', 'into', 'a', 'database', 'deriving', 'patterns', 'within', 'the', 'structured', 'data', 'and', 'finally', 'evaluation', 'and', 'interpretation', 'of', 'the', 'output', "'High", 'quality', 'in', 'text', 'mining', 'usually', 'refers', 'to', 'some', 'combination', 'of', 'relevance', 'novelty', 'and', 'interestingness', 'Typical', 'text', 'mining', 'tasks', 'include', 'text', 'categorization', 'text', 'clustering', 'concept/entity', 'extraction', 'production', 'of', 'granular', 'taxonomies', 'sentiment', 'analysis', 'document', 'summarization', 'and', 'entity', 'relation', 'modeling', 'i.e', 'learning', 'relations', 'between', 'named', 'entities'])
 
# Sentence Tokenization
In [5]: text_blob.sentences
Out[5]: 
[Sentence("
 Text mining, also referred to as text data mining, roughly equivalent to text analytics, is the process of deriving high-quality information from text."),
 Sentence("High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning."),
 Sentence("Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output."),
 Sentence("'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness."),
 Sentence("Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).")]
 
In [6]: for sentence in text_blob.sentences:
   ...:     print(sentence)
   ...:     
 
Text mining, also referred to as text data mining, roughly equivalent to text analytics, is the process of deriving high-quality information from text.
High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning.
Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output.
'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness.
Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).
 
# Sentiment Analysis
In [7]: for sentence in text_blob.sentences:
    print(sentence.sentiment)
   ...:     
Sentiment(polarity=-0.1, subjectivity=0.4)
Sentiment(polarity=-0.08333333333333333, subjectivity=0.5)
Sentiment(polarity=-0.08, subjectivity=0.32999999999999996)
Sentiment(polarity=-0.045, subjectivity=0.39499999999999996)
Sentiment(polarity=-0.16666666666666666, subjectivity=0.5)
 
# POS Tagging
In [8]: text_blob.tags
Out[8]: 
[('Text', u'NNP'),
 ('mining', u'NN'),
 ('also', u'RB'),
 ('referred', u'VBD'),
 ('to', u'TO'),
 ('as', u'IN'),
 ('text', u'NN'),
 ('data', u'NNS'),
 ('mining', u'NN'),
 ('roughly', u'RB'),
 ('equivalent', u'JJ'),
 ('to', u'TO'),
 ('text', u'VB'),
 ('analytics', u'NNS'),
 ('is', u'VBZ'),
 ('the', u'DT'),
 ('process', u'NN'),
 ('of', u'IN'),
 ('deriving', u'VBG'),
 ('high-quality', u'JJ'),
 ('information', u'NN'),
 ('from', u'IN'),
 ('text', u'NN'),
 ('High-quality', u'NNP'),
 ('information', u'NN'),
 ('is', u'VBZ'),
 ('typically', u'RB'),
 ('derived', u'VBN'),
 ('through', u'IN'),
 ('the', u'DT'),
 ('devising', u'NN'),
 ('of', u'IN'),
 ('patterns', u'NNS'),
 ('and', u'CC'),
 ('trends', u'NNS'),
 ('through', u'IN'),
 ('means', u'NNS'),
 ('such', u'JJ'),
 ('as', u'IN'),
 ('statistical', u'JJ'),
 ('pattern', u'NN'),
 ('learning', u'VBG'),
 ('Text', u'NNP'),
 ('mining', u'NN'),
 ('usually', u'RB'),
 ('involves', u'VBZ'),
 ('the', u'DT'),
 ('process', u'NN'),
 ('of', u'IN'),
 ('structuring', u'VBG'),
 ('the', u'DT'),
 ('input', u'NN'),
 ('text', u'NN'),
 ('usually', u'RB'),
 ('parsing', u'VBG'),
 ('along', u'IN'),
 ('with', u'IN'),
 ('the', u'DT'),
 ('addition', u'NN'),
 ('of', u'IN'),
 ('some', u'DT'),
 ('derived', u'VBN'),
 ('linguistic', u'JJ'),
 ('features', u'NNS'),
 ('and', u'CC'),
 ('the', u'DT'),
 ('removal', u'NN'),
 ('of', u'IN'),
 ('others', u'NNS'),
 ('and', u'CC'),
 ('subsequent', u'JJ'),
 ('insertion', u'NN'),
 ('into', u'IN'),
 ('a', u'DT'),
 ('database', u'NN'),
 ('deriving', u'VBG'),
 ('patterns', u'NNS'),
 ('within', u'IN'),
 ('the', u'DT'),
 ('structured', u'JJ'),
 ('data', u'NNS'),
 ('and', u'CC'),
 ('finally', u'RB'),
 ('evaluation', u'NN'),
 ('and', u'CC'),
 ('interpretation', u'NN'),
 ('of', u'IN'),
 ('the', u'DT'),
 ('output', u'NN'),
 ("'High", u'JJ'),
 ('quality', u'NN'),
 ('in', u'IN'),
 ('text', u'JJ'),
 ('mining', u'NN'),
 ('usually', u'RB'),
 ('refers', u'VBZ'),
 ('to', u'TO'),
 ('some', u'DT'),
 ('combination', u'NN'),
 ('of', u'IN'),
 ('relevance', u'NN'),
 ('novelty', u'NN'),
 ('and', u'CC'),
 ('interestingness', u'NN'),
 ('Typical', u'JJ'),
 ('text', u'NN'),
 ('mining', u'NN'),
 ('tasks', u'NNS'),
 ('include', u'VBP'),
 ('text', u'JJ'),
 ('categorization', u'NN'),
 ('text', u'NN'),
 ('clustering', u'NN'),
 ('concept/entity', u'NN'),
 ('extraction', u'NN'),
 ('production', u'NN'),
 ('of', u'IN'),
 ('granular', u'JJ'),
 ('taxonomies', u'NNS'),
 ('sentiment', u'NN'),
 ('analysis', u'NN'),
 ('document', u'NN'),
 ('summarization', u'NN'),
 ('and', u'CC'),
 ('entity', u'NN'),
 ('relation', u'NN'),
 ('modeling', u'NN'),
 ('i.e.', u'FW'),
 ('learning', u'VBG'),
 ('relations', u'NNS'),
 ('between', u'IN'),
 ('named', u'VBN'),
 ('entities', u'NNS')]
 
# Noun Phrase Extraction
In [9]: text_blob.noun_phrases
Out[9]: WordList(['text', u'text data', u'text analytics', u'high-quality information', 'high-quality', u'statistical pattern learning', 'text', u'input text', u'subsequent insertion', u"'high quality", u'typical text', u'text categorization', u'concept/entity extraction', u'granular taxonomies', u'sentiment analysis', u'document summarization', u'entity relation', u'learning relations'])
 
# Sentiment Analysis
In [10]: text_blob.sentiment
Out[10]: Sentiment(polarity=-0.08393939393939392, subjectivity=0.39454545454545453)
 
# Singularize and Pluralize
In [11]: text_blob.words[-1]
Out[11]: 'entities'
 
In [12]: text_blob.words[-1].singularize()
Out[12]: 'entity'
 
In [13]: text_blob.words[1]
Out[13]: 'mining'
 
In [14]: text_blob.words[1].pluralize()
Out[14]: 'minings'
 
In [15]: text_blob.words[0]
Out[15]: 'Text'
 
In [16]: text_blob.words[0].pluralize()
Out[16]: 'Texts'
 
# Lemmatization
In [17]: from textblob import Word
 
In [18]: w = Word("are")
 
In [19]: w.lemmatize()
Out[19]: 'are'
 
In [20]: w.lemmatize('v')
Out[20]: u'be'
 
# WordNet
In [21]: from textblob.wordnet import VERB
 
In [22]: word = Word("are")
 
In [23]: word.synsets
Out[23]: 
[Synset('are.n.01'),
 Synset('be.v.01'),
 Synset('be.v.02'),
 Synset('be.v.03'),
 Synset('exist.v.01'),
 Synset('be.v.05'),
 Synset('equal.v.01'),
 Synset('constitute.v.01'),
 Synset('be.v.08'),
 Synset('embody.v.02'),
 Synset('be.v.10'),
 Synset('be.v.11'),
 Synset('be.v.12'),
 Synset('cost.v.01')]
 
In [24]: word.definitions
Out[24]: 
[u'a unit of surface area equal to 100 square meters',
 u'have the quality of being; (copula, used with an adjective or a predicate noun)',
 u'be identical to; be someone or something',
 u'occupy a certain position or area; be somewhere',
 u'have an existence, be extant',
 u'happen, occur, take place; this was during the visit to my parents\' house"',
 u'be identical or equivalent to',
 u'form or compose',
 u'work in a specific place, with a specific subject, or in a specific function',
 u'represent, as of a character on stage',
 u'spend or use time',
 u'have life, be alive',
 u'to remain unmolested, undisturbed, or uninterrupted -- used only in infinitive form',
 u'be priced at']
 
# Spelling Correction
In [25]: splling_test = TextBlob("I m ok")
 
In [26]: spelling_test = TextBlob("I m ok")
 
In [27]: print(spelling_test.correct())
I m ok
 
In [28]: splling_test = TextBlob("I havv good speling!")
 
In [29]: print(spelling_test.correct())
I m ok
 
# Translation
In [30]: print(splling_test.correct())
I have good spelling!
 
In [31]: text_blob.translate(to='zh')
Out[31]: TextBlob("文本挖掘,也称为文本数据挖掘,大致相当于文本分析,是从文本中获取高质量信息的过程。高质量的信息通常是通过统计模式学习等手段来设计模式和趋势。文本挖掘通常涉及构造输入文本的过程(通常解析,以及添加一些派生的语言特征以及删除其他内容,并随后插入数据库),导出结构化数据中的模式,最后进行评估和解释的输出。文本挖掘中的“高质量”通常指相关性,新颖性和趣味性的一些组合。典型的文本挖掘任务包括文本分类,文本聚类,概念/实体提取,粒度分类法的生成,情绪分析,文档摘要和实体关系建模(即命名实体之间的学习关系)。")
 
# Language Detection
In [36]: text_blob2 = TextBlob(u"这是中文测试")
 
In [37]: text_blob2.detect_language()
Out[37]: u'zh-CN'
 
# Parser
In [39]: text_blob.parse()
Out[39]: u"Text/NN/B-NP/O mining/NN/I-NP/O ,/,/O/O also/RB/B-VP/O referred/VBN/I-VP/O to/TO/B-PP/B-PNP as/IN/I-PP/I-PNP text/NN/B-NP/I-PNP data/NNS/I-NP/I-PNP mining/NN/I-NP/I-PNP ,/,/O/O roughly/RB/B-ADVP/O equivalent/NN/B-NP/O to/TO/B-PP/B-PNP text/NN/B-NP/I-PNP analytics/NNS/I-NP/I-PNP ,/,/O/O is/VBZ/B-VP/O the/DT/B-NP/O process/NN/I-NP/O of/IN/B-PP/B-PNP deriving/VBG/B-VP/I-PNP high-quality/JJ/B-NP/I-PNP information/NN/I-NP/I-PNP from/IN/B-PP/B-PNP text/NN/B-NP/I-PNP ././O/O\nHigh-quality/JJ/B-NP/O information/NN/I-NP/O is/VBZ/B-VP/O typically/RB/I-VP/O derived/VBN/I-VP/O through/IN/B-PP/O the/DT/O/O devising/VBG/B-VP/O of/IN/B-PP/B-PNP patterns/NNS/B-NP/I-PNP and/CC/I-NP/I-PNP trends/NNS/I-NP/I-PNP through/IN/B-PP/O means/VBZ/B-VP/O such/JJ/B-ADJP/O as/IN/B-PP/B-PNP statistical/JJ/B-NP/I-PNP pattern/NN/I-NP/I-PNP learning/VBG/B-VP/I-PNP ././O/O\nText/NN/B-NP/O mining/NN/I-NP/O usually/RB/B-VP/O involves/VBZ/I-VP/O the/DT/B-NP/O process/NN/I-NP/O of/IN/B-PP/B-PNP structuring/VBG/B-VP/I-PNP the/DT/B-NP/I-PNP input/NN/I-NP/I-PNP text/NN/I-NP/I-PNP (/(/O/O usually/RB/B-VP/O parsing/VBG/I-VP/O ,/,/O/O along/IN/B-PP/B-PNP with/IN/I-PP/I-PNP the/DT/B-NP/I-PNP addition/NN/I-NP/I-PNP of/IN/B-PP/O some/DT/O/O derived/VBN/B-VP/O linguistic/JJ/B-NP/O features/NNS/I-NP/O and/CC/O/O the/DT/B-NP/O removal/NN/I-NP/O of/IN/B-PP/B-PNP others/NNS/B-NP/I-PNP ,/,/O/O and/CC/O/O subsequent/JJ/B-NP/O insertion/NN/I-NP/O into/IN/B-PP/B-PNP a/DT/B-NP/I-PNP database/NN/I-NP/I-PNP )/)/O/O ,/,/O/O deriving/VBG/B-VP/O patterns/NNS/B-NP/O within/IN/B-PP/O the/DT/O/O structured/VBN/B-VP/O data/NNS/B-NP/O ,/,/O/O and/CC/O/O finally/RB/B-ADVP/O evaluation/NN/B-NP/O and/CC/O/O interpretation/NN/B-NP/O of/IN/B-PP/B-PNP the/DT/B-NP/I-PNP output/NN/I-NP/I-PNP ././O/O\n'/POS/O/O High/NNP/B-NP/O quality/NN/I-NP/O '/POS/O/O in/IN/B-PP/B-PNP text/NN/B-NP/I-PNP mining/NN/I-NP/I-PNP usually/RB/B-VP/O refers/VBZ/I-VP/O to/TO/B-PP/B-PNP some/DT/B-NP/I-PNP combination/NN/I-NP/I-PNP of/IN/B-PP/B-PNP relevance/NN/B-NP/I-PNP ,/,/O/O novelty/NN/B-NP/O ,/,/O/O and/CC/O/O interestingness/NN/B-NP/O ././O/O\nTypical/JJ/B-NP/O text/NN/I-NP/O mining/NN/I-NP/O tasks/NNS/I-NP/O include/VBP/B-VP/O text/NN/B-NP/O categorization/NN/I-NP/O ,/,/O/O text/NN/B-NP/O clustering/VBG/B-VP/O ,/,/O/O concept&slash;entity/NN/B-NP/O extraction/NN/I-NP/O ,/,/O/O production/NN/B-NP/O of/IN/B-PP/B-PNP granular/JJ/B-NP/I-PNP taxonomies/NNS/I-NP/I-PNP ,/,/O/O sentiment/NN/B-NP/O analysis/NN/I-NP/O ,/,/O/O document/NN/B-NP/O summarization/NN/I-NP/O ,/,/O/O and/CC/O/O entity/NN/B-NP/O relation/NN/I-NP/O modeling/NN/I-NP/O (/(/O/O i.e./FW/O/O ,/,/O/O learning/VBG/B-VP/O relations/NNS/B-NP/O between/IN/B-PP/B-PNP named/VBN/B-VP/I-PNP entities/NNS/B-NP/I-PNP )/)/O/O ././O/O"
 
# Ngrams
In [40]: text_blob.ngrams(n=1)
Out[40]: 
[WordList(['Text']),
 WordList(['mining']),
 WordList(['also']),
 WordList(['referred']),
 WordList(['to']),
 WordList(['as']),
 WordList(['text']),
 WordList(['data']),
 WordList(['mining']),
 WordList(['roughly']),
 WordList(['equivalent']),
 WordList(['to']),
 WordList(['text']),
 WordList(['analytics']),
 WordList(['is']),
 WordList(['the']),
 WordList(['process']),
 WordList(['of']),
 WordList(['deriving']),
 WordList(['high-quality']),
 WordList(['information']),
 WordList(['from']),
 WordList(['text']),
 WordList(['High-quality']),
 WordList(['information']),
 WordList(['is']),
 WordList(['typically']),
 WordList(['derived']),
 WordList(['through']),
 WordList(['the']),
 WordList(['devising']),
 WordList(['of']),
 WordList(['patterns']),
 WordList(['and']),
 WordList(['trends']),
 WordList(['through']),
 WordList(['means']),
 WordList(['such']),
 WordList(['as']),
 WordList(['statistical']),
 WordList(['pattern']),
 WordList(['learning']),
 WordList(['Text']),
 WordList(['mining']),
 WordList(['usually']),
 WordList(['involves']),
 WordList(['the']),
 WordList(['process']),
 WordList(['of']),
 WordList(['structuring']),
 WordList(['the']),
 WordList(['input']),
 WordList(['text']),
 WordList(['usually']),
 WordList(['parsing']),
 WordList(['along']),
 WordList(['with']),
 WordList(['the']),
 WordList(['addition']),
 WordList(['of']),
 WordList(['some']),
 WordList(['derived']),
 WordList(['linguistic']),
 WordList(['features']),
 WordList(['and']),
 WordList(['the']),
 WordList(['removal']),
 WordList(['of']),
 WordList(['others']),
 WordList(['and']),
 WordList(['subsequent']),
 WordList(['insertion']),
 WordList(['into']),
 WordList(['a']),
 WordList(['database']),
 WordList(['deriving']),
 WordList(['patterns']),
 WordList(['within']),
 WordList(['the']),
 WordList(['structured']),
 WordList(['data']),
 WordList(['and']),
 WordList(['finally']),
 WordList(['evaluation']),
 WordList(['and']),
 WordList(['interpretation']),
 WordList(['of']),
 WordList(['the']),
 WordList(['output']),
 WordList(["'High"]),
 WordList(['quality']),
 WordList(['in']),
 WordList(['text']),
 WordList(['mining']),
 WordList(['usually']),
 WordList(['refers']),
 WordList(['to']),
 WordList(['some']),
 WordList(['combination']),
 WordList(['of']),
 WordList(['relevance']),
 WordList(['novelty']),
 WordList(['and']),
 WordList(['interestingness']),
 WordList(['Typical']),
 WordList(['text']),
 WordList(['mining']),
 WordList(['tasks']),
 WordList(['include']),
 WordList(['text']),
 WordList(['categorization']),
 WordList(['text']),
 WordList(['clustering']),
 WordList(['concept/entity']),
 WordList(['extraction']),
 WordList(['production']),
 WordList(['of']),
 WordList(['granular']),
 WordList(['taxonomies']),
 WordList(['sentiment']),
 WordList(['analysis']),
 WordList(['document']),
 WordList(['summarization']),
 WordList(['and']),
 WordList(['entity']),
 WordList(['relation']),
 WordList(['modeling']),
 WordList(['i.e']),
 WordList(['learning']),
 WordList(['relations']),
 WordList(['between']),
 WordList(['named']),
 WordList(['entities'])]
 
In [41]: text_blob.ngrams(n=2)
Out[41]: 
[WordList(['Text', 'mining']),
 WordList(['mining', 'also']),
 WordList(['also', 'referred']),
 WordList(['referred', 'to']),
 WordList(['to', 'as']),
 WordList(['as', 'text']),
 WordList(['text', 'data']),
 WordList(['data', 'mining']),
 WordList(['mining', 'roughly']),
 WordList(['roughly', 'equivalent']),
 WordList(['equivalent', 'to']),
 WordList(['to', 'text']),
 WordList(['text', 'analytics']),
 WordList(['analytics', 'is']),
 WordList(['is', 'the']),
 WordList(['the', 'process']),
 WordList(['process', 'of']),
 WordList(['of', 'deriving']),
 WordList(['deriving', 'high-quality']),
 WordList(['high-quality', 'information']),
 WordList(['information', 'from']),
 WordList(['from', 'text']),
 WordList(['text', 'High-quality']),
 WordList(['High-quality', 'information']),
 WordList(['information', 'is']),
 WordList(['is', 'typically']),
 WordList(['typically', 'derived']),
 WordList(['derived', 'through']),
 WordList(['through', 'the']),
 WordList(['the', 'devising']),
 WordList(['devising', 'of']),
 WordList(['of', 'patterns']),
 WordList(['patterns', 'and']),
 WordList(['and', 'trends']),
 WordList(['trends', 'through']),
 WordList(['through', 'means']),
 WordList(['means', 'such']),
 WordList(['such', 'as']),
 WordList(['as', 'statistical']),
 WordList(['statistical', 'pattern']),
 WordList(['pattern', 'learning']),
 WordList(['learning', 'Text']),
 WordList(['Text', 'mining']),
 WordList(['mining', 'usually']),
 WordList(['usually', 'involves']),
 WordList(['involves', 'the']),
 WordList(['the', 'process']),
 WordList(['process', 'of']),
 WordList(['of', 'structuring']),
 WordList(['structuring', 'the']),
 WordList(['the', 'input']),
 WordList(['input', 'text']),
 WordList(['text', 'usually']),
 WordList(['usually', 'parsing']),
 WordList(['parsing', 'along']),
 WordList(['along', 'with']),
 WordList(['with', 'the']),
 WordList(['the', 'addition']),
 WordList(['addition', 'of']),
 WordList(['of', 'some']),
 WordList(['some', 'derived']),
 WordList(['derived', 'linguistic']),
 WordList(['linguistic', 'features']),
 WordList(['features', 'and']),
 WordList(['and', 'the']),
 WordList(['the', 'removal']),
 WordList(['removal', 'of']),
 WordList(['of', 'others']),
 WordList(['others', 'and']),
 WordList(['and', 'subsequent']),
 WordList(['subsequent', 'insertion']),
 WordList(['insertion', 'into']),
 WordList(['into', 'a']),
 WordList(['a', 'database']),
 WordList(['database', 'deriving']),
 WordList(['deriving', 'patterns']),
 WordList(['patterns', 'within']),
 WordList(['within', 'the']),
 WordList(['the', 'structured']),
 WordList(['structured', 'data']),
 WordList(['data', 'and']),
 WordList(['and', 'finally']),
 WordList(['finally', 'evaluation']),
 WordList(['evaluation', 'and']),
 WordList(['and', 'interpretation']),
 WordList(['interpretation', 'of']),
 WordList(['of', 'the']),
 WordList(['the', 'output']),
 WordList(['output', "'High"]),
 WordList(["'High", 'quality']),
 WordList(['quality', 'in']),
 WordList(['in', 'text']),
 WordList(['text', 'mining']),
 WordList(['mining', 'usually']),
 WordList(['usually', 'refers']),
 WordList(['refers', 'to']),
 WordList(['to', 'some']),
 WordList(['some', 'combination']),
 WordList(['combination', 'of']),
 WordList(['of', 'relevance']),
 WordList(['relevance', 'novelty']),
 WordList(['novelty', 'and']),
 WordList(['and', 'interestingness']),
 WordList(['interestingness', 'Typical']),
 WordList(['Typical', 'text']),
 WordList(['text', 'mining']),
 WordList(['mining', 'tasks']),
 WordList(['tasks', 'include']),
 WordList(['include', 'text']),
 WordList(['text', 'categorization']),
 WordList(['categorization', 'text']),
 WordList(['text', 'clustering']),
 WordList(['clustering', 'concept/entity']),
 WordList(['concept/entity', 'extraction']),
 WordList(['extraction', 'production']),
 WordList(['production', 'of']),
 WordList(['of', 'granular']),
 WordList(['granular', 'taxonomies']),
 WordList(['taxonomies', 'sentiment']),
 WordList(['sentiment', 'analysis']),
 WordList(['analysis', 'document']),
 WordList(['document', 'summarization']),
 WordList(['summarization', 'and']),
 WordList(['and', 'entity']),
 WordList(['entity', 'relation']),
 WordList(['relation', 'modeling']),
 WordList(['modeling', 'i.e']),
 WordList(['i.e', 'learning']),
 WordList(['learning', 'relations']),
 WordList(['relations', 'between']),
 WordList(['between', 'named']),
 WordList(['named', 'entities'])]
 
In [42]: text_blob.ngrams(n=4)
Out[42]: 
[WordList(['Text', 'mining', 'also', 'referred']),
 WordList(['mining', 'also', 'referred', 'to']),
 WordList(['also', 'referred', 'to', 'as']),
 WordList(['referred', 'to', 'as', 'text']),
 WordList(['to', 'as', 'text', 'data']),
 WordList(['as', 'text', 'data', 'mining']),
 WordList(['text', 'data', 'mining', 'roughly']),
 WordList(['data', 'mining', 'roughly', 'equivalent']),
 WordList(['mining', 'roughly', 'equivalent', 'to']),
 WordList(['roughly', 'equivalent', 'to', 'text']),
 WordList(['equivalent', 'to', 'text', 'analytics']),
 WordList(['to', 'text', 'analytics', 'is']),
 WordList(['text', 'analytics', 'is', 'the']),
 WordList(['analytics', 'is', 'the', 'process']),
 WordList(['is', 'the', 'process', 'of']),
 WordList(['the', 'process', 'of', 'deriving']),
 WordList(['process', 'of', 'deriving', 'high-quality']),
 WordList(['of', 'deriving', 'high-quality', 'information']),
 WordList(['deriving', 'high-quality', 'information', 'from']),
 WordList(['high-quality', 'information', 'from', 'text']),
 WordList(['information', 'from', 'text', 'High-quality']),
 WordList(['from', 'text', 'High-quality', 'information']),
 WordList(['text', 'High-quality', 'information', 'is']),
 WordList(['High-quality', 'information', 'is', 'typically']),
 WordList(['information', 'is', 'typically', 'derived']),
 WordList(['is', 'typically', 'derived', 'through']),
 WordList(['typically', 'derived', 'through', 'the']),
 WordList(['derived', 'through', 'the', 'devising']),
 WordList(['through', 'the', 'devising', 'of']),
 WordList(['the', 'devising', 'of', 'patterns']),
 WordList(['devising', 'of', 'patterns', 'and']),
 WordList(['of', 'patterns', 'and', 'trends']),
 WordList(['patterns', 'and', 'trends', 'through']),
 WordList(['and', 'trends', 'through', 'means']),
 WordList(['trends', 'through', 'means', 'such']),
 WordList(['through', 'means', 'such', 'as']),
 WordList(['means', 'such', 'as', 'statistical']),
 WordList(['such', 'as', 'statistical', 'pattern']),
 WordList(['as', 'statistical', 'pattern', 'learning']),
 WordList(['statistical', 'pattern', 'learning', 'Text']),
 WordList(['pattern', 'learning', 'Text', 'mining']),
 WordList(['learning', 'Text', 'mining', 'usually']),
 WordList(['Text', 'mining', 'usually', 'involves']),
 WordList(['mining', 'usually', 'involves', 'the']),
 WordList(['usually', 'involves', 'the', 'process']),
 WordList(['involves', 'the', 'process', 'of']),
 WordList(['the', 'process', 'of', 'structuring']),
 WordList(['process', 'of', 'structuring', 'the']),
 WordList(['of', 'structuring', 'the', 'input']),
 WordList(['structuring', 'the', 'input', 'text']),
 WordList(['the', 'input', 'text', 'usually']),
 WordList(['input', 'text', 'usually', 'parsing']),
 WordList(['text', 'usually', 'parsing', 'along']),
 WordList(['usually', 'parsing', 'along', 'with']),
 WordList(['parsing', 'along', 'with', 'the']),
 WordList(['along', 'with', 'the', 'addition']),
 WordList(['with', 'the', 'addition', 'of']),
 WordList(['the', 'addition', 'of', 'some']),
 WordList(['addition', 'of', 'some', 'derived']),
 WordList(['of', 'some', 'derived', 'linguistic']),
 WordList(['some', 'derived', 'linguistic', 'features']),
 WordList(['derived', 'linguistic', 'features', 'and']),
 WordList(['linguistic', 'features', 'and', 'the']),
 WordList(['features', 'and', 'the', 'removal']),
 WordList(['and', 'the', 'removal', 'of']),
 WordList(['the', 'removal', 'of', 'others']),
 WordList(['removal', 'of', 'others', 'and']),
 WordList(['of', 'others', 'and', 'subsequent']),
 WordList(['others', 'and', 'subsequent', 'insertion']),
 WordList(['and', 'subsequent', 'insertion', 'into']),
 WordList(['subsequent', 'insertion', 'into', 'a']),
 WordList(['insertion', 'into', 'a', 'database']),
 WordList(['into', 'a', 'database', 'deriving']),
 WordList(['a', 'database', 'deriving', 'patterns']),
 WordList(['database', 'deriving', 'patterns', 'within']),
 WordList(['deriving', 'patterns', 'within', 'the']),
 WordList(['patterns', 'within', 'the', 'structured']),
 WordList(['within', 'the', 'structured', 'data']),
 WordList(['the', 'structured', 'data', 'and']),
 WordList(['structured', 'data', 'and', 'finally']),
 WordList(['data', 'and', 'finally', 'evaluation']),
 WordList(['and', 'finally', 'evaluation', 'and']),
 WordList(['finally', 'evaluation', 'and', 'interpretation']),
 WordList(['evaluation', 'and', 'interpretation', 'of']),
 WordList(['and', 'interpretation', 'of', 'the']),
 WordList(['interpretation', 'of', 'the', 'output']),
 WordList(['of', 'the', 'output', "'High"]),
 WordList(['the', 'output', "'High", 'quality']),
 WordList(['output', "'High", 'quality', 'in']),
 WordList(["'High", 'quality', 'in', 'text']),
 WordList(['quality', 'in', 'text', 'mining']),
 WordList(['in', 'text', 'mining', 'usually']),
 WordList(['text', 'mining', 'usually', 'refers']),
 WordList(['mining', 'usually', 'refers', 'to']),
 WordList(['usually', 'refers', 'to', 'some']),
 WordList(['refers', 'to', 'some', 'combination']),
 WordList(['to', 'some', 'combination', 'of']),
 WordList(['some', 'combination', 'of', 'relevance']),
 WordList(['combination', 'of', 'relevance', 'novelty']),
 WordList(['of', 'relevance', 'novelty', 'and']),
 WordList(['relevance', 'novelty', 'and', 'interestingness']),
 WordList(['novelty', 'and', 'interestingness', 'Typical']),
 WordList(['and', 'interestingness', 'Typical', 'text']),
 WordList(['interestingness', 'Typical', 'text', 'mining']),
 WordList(['Typical', 'text', 'mining', 'tasks']),
 WordList(['text', 'mining', 'tasks', 'include']),
 WordList(['mining', 'tasks', 'include', 'text']),
 WordList(['tasks', 'include', 'text', 'categorization']),
 WordList(['include', 'text', 'categorization', 'text']),
 WordList(['text', 'categorization', 'text', 'clustering']),
 WordList(['categorization', 'text', 'clustering', 'concept/entity']),
 WordList(['text', 'clustering', 'concept/entity', 'extraction']),
 WordList(['clustering', 'concept/entity', 'extraction', 'production']),
 WordList(['concept/entity', 'extraction', 'production', 'of']),
 WordList(['extraction', 'production', 'of', 'granular']),
 WordList(['production', 'of', 'granular', 'taxonomies']),
 WordList(['of', 'granular', 'taxonomies', 'sentiment']),
 WordList(['granular', 'taxonomies', 'sentiment', 'analysis']),
 WordList(['taxonomies', 'sentiment', 'analysis', 'document']),
 WordList(['sentiment', 'analysis', 'document', 'summarization']),
 WordList(['analysis', 'document', 'summarization', 'and']),
 WordList(['document', 'summarization', 'and', 'entity']),
 WordList(['summarization', 'and', 'entity', 'relation']),
 WordList(['and', 'entity', 'relation', 'modeling']),
 WordList(['entity', 'relation', 'modeling', 'i.e']),
 WordList(['relation', 'modeling', 'i.e', 'learning']),
 WordList(['modeling', 'i.e', 'learning', 'relations']),
 WordList(['i.e', 'learning', 'relations', 'between']),
 WordList(['learning', 'relations', 'between', 'named']),
 WordList(['relations', 'between', 'named', 'entities'])]

Posted by TextProcessing

Getting started with Word2Vec

1. Source by Google

Project with Code: Word2Vec

Blog: Learning the meaning behind words

Paper:
[1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.

Note: The new model architectures:

[2] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.

Note: The Skip-gram Model with Hierarchical Softmax and Negative Sampling

[3] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic Regularities in Continuous Space Word Representations. In Proceedings of NAACL HLT, 2013.

Note: It seems no more information

[4] Tomas Mikolov, Quoc V. Le, Ilya Sutskever. Exploiting Similarities among Languages for Machine Translation

Note: Intersting word2vec application on SMT

[5] NIPS DeepLearning Workshop NN for Text by Tomas Mikolov and etc.

2. Best explained with original models, optimizing methods, Back-propagation background and Word Embedding Visual Inspector

Paper: word2vec Parameter Learning Explained

Slides: Word Embedding Explained and Visualized

Youtube Video: Word Embedding Explained and Visualized – word2vec and wevi

Demo: wevi: word embedding visual inspector

3. Word2Vec Tutorials:

Word2Vec Tutorial by Chris McCormick:

a) Word2Vec Tutorial – The Skip-Gram Model
Note: Skip over the usual introductory and abstract insights about Word2Vec, and get into more of the details

b) Word2Vec Tutorial Part 2 – Negative Sampling

Alex Minnaar’s Tutorials

The original article url is down, the following pdf version provides by Chris McCormick:

a) Word2Vec Tutorial Part I: The Skip-Gram Model

b) Word2Vec Tutorial Part II: The Continuous Bag-of-Words Model

4. Learning by Coding

Distributed Representations of Sentences and Documents

Python Word2Vec by Gensim related articles:

a) Deep learning with word2vec and gensim, Part One

b) Word2vec in Python, Part Two: Optimizing

c) Parallelizing word2vec in Python, Part Three

d) Gensim word2vec document: models.word2vec – Deep learning with word2vec

e) Word2vec Tutorial by Radim Řehůřek

Note: Simple but very powerful tutorial for word2vec model training in gensim.

An Anatomy of Key Tricks in word2vec project with examples

5. Ohter Word2Vec Resources:

Word2Vec Resources by Chris McCormick

Posted by TextProcessing

Getting started with NLTK

About NLTK

Open Source Text Processing Project: NLTK

Install NLTK

1. Install the latest NLTK pakage on Ubuntu 16.04.1 LTS:

textprocessing@ubuntu:~$ sudo pip install -U nltk

Collecting nltk
Downloading nltk-3.2.2.tar.gz (1.2MB)
35% |███████████▍ | 409kB 20.8MB/s eta 0:00:0
……
100% |████████████████████████████████| 1.2MB 814kB/s
Collecting six (from nltk)
Downloading six-1.10.0-py2.py3-none-any.whl
Installing collected packages: six, nltk
Running setup.py install for nltk … done
Successfully installed nltk-3.2.2 six-1.10.0

2. Install Numpy (optional):

textprocessing@ubuntu:~$ sudo pip install -U numpy

Collecting numpy
Downloading numpy-1.12.0-cp27-cp27mu-manylinux1_x86_64.whl (16.5MB)
34% |███████████▏ | 5.7MB 30.8MB/s eta 0:00:0
……
100% |████████████████████████████████| 16.5MB 37kB/s
Installing collected packages: numpy
Successfully installed numpy-1.12.0

3. Test installation: run python then type import nltk

textprocessing@ubuntu:~$ ipython
Python 2.7.12 (default, Nov 19 2016, 06:48:10)
Type "copyright", "credits" or "license" for more information.

IPython 2.4.1 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.

In [1]: import nltk

In [2]: nltk.__version__
Out[2]: '3.2.2'

It seems that you have installed nltk, but if you test the simplest word tokenize, you will meet some problems:

In [3]: sentence = "this's a test"

In [4]: tokens = nltk.word_tokenize(sentence)
---------------------------------------------------------------------------
LookupError Traceback (most recent call last)
in ()
----> 1 tokens = nltk.word_tokenize(sentence)

/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.pyc in word_tokenize(text, language)
107 :param language: the model name in the Punkt corpus
108 """
--> 109 return [token for sent in sent_tokenize(text, language)
110 for token in _treebank_word_tokenize(sent)]
111

/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.pyc in sent_tokenize(text, language)
91 :param language: the model name in the Punkt corpus
92 """
---> 93 tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
94 return tokenizer.tokenize(text)
95

/usr/local/lib/python2.7/dist-packages/nltk/data.pyc in load(resource_url, format, cache, verbose, logic_parser, fstruct_reader, encoding)
806
807 # Load the resource.
--> 808 opened_resource = _open(resource_url)
809
810 if format == 'raw':

/usr/local/lib/python2.7/dist-packages/nltk/data.pyc in _open(resource_url)
924
925 if protocol is None or protocol.lower() == 'nltk':
--> 926 return find(path_, path + ['']).open()
927 elif protocol.lower() == 'file':
928 # urllib might not use mode='rb', so handle this one ourselves:

/usr/local/lib/python2.7/dist-packages/nltk/data.pyc in find(resource_name, paths)
646 sep = '*' * 70
647 resource_not_found = '\n%s\n%s\n%s' % (sep, msg, sep)
--> 648 raise LookupError(resource_not_found)
649
650

LookupError:
**********************************************************************
Resource u'tokenizers/punkt/english.pickle' not found. Please
use the NLTK Downloader to obtain the resource: >>>
nltk.download()
Searched in:
- '/home/textprocessing/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- u''
**********************************************************************

Install NLTK Data

NLTK comes with many corpora, toy grammars, trained models, etc. All in nltk_data, you need install nltk_data before you use nltk.

In [5]: nltk.download()
NLTK Downloader
—————————————————————————
d) Download l) List u) Update c) Config h) Help q) Quit
—————————————————————————
Downloader> d

Download which package (l=list; x=cancel)?
Identifier> all
Downloading collection u’all’
|
| Downloading package abc to /home/textprocessing/nltk_data…
| Unzipping corpora/abc.zip.
| Downloading package alpino to
| /home/textprocessing/nltk_data…
| Unzipping corpora/alpino.zip.
| Downloading package biocreative_ppi to
| /home/textprocessing/nltk_data…
| Unzipping corpora/biocreative_ppi.zip.
| Downloading package brown to
| /home/textprocessing/nltk_data…
| Unzipping corpora/brown.zip.
| Downloading package brown_tei to
| /home/textprocessing/nltk_data…
| Unzipping corpora/brown_tei.zip.
| Downloading package cess_cat to
| /home/textprocessing/nltk_data…
| Unzipping corpora/cess_cat.zip.
| Downloading package cess_esp to
| /home/textprocessing/nltk_data…
| Unzipping corpora/cess_esp.zip.
| Downloading package chat80 to
| /home/textprocessing/nltk_data…
| Unzipping corpora/chat80.zip.
| Downloading package city_database to
| /home/textprocessing/nltk_data…
| Unzipping corpora/city_database.zip.
| Downloading package cmudict to
| /home/textprocessing/nltk_data…
| Unzipping corpora/cmudict.zip.
| Downloading package comparative_sentences to
| /home/textprocessing/nltk_data…
| Unzipping corpora/comparative_sentences.zip.
| Downloading package comtrans to
| /home/textprocessing/nltk_data…
| Downloading package conll2000 to
| /home/textprocessing/nltk_data…
| Unzipping corpora/conll2000.zip.
| Downloading package conll2002 to
| /home/textprocessing/nltk_data…
| Unzipping corpora/conll2002.zip.
| Downloading package conll2007 to
| /home/textprocessing/nltk_data…
| Downloading package crubadan to
| /home/textprocessing/nltk_data…
| Unzipping corpora/crubadan.zip.
| Downloading package dependency_treebank to
| /home/textprocessing/nltk_data…
| Unzipping corpora/dependency_treebank.zip.
| Downloading package europarl_raw to
| /home/textprocessing/nltk_data…
| Unzipping corpora/europarl_raw.zip.
| Downloading package floresta to
| /home/textprocessing/nltk_data…
| Unzipping corpora/floresta.zip.
| Downloading package framenet_v15 to
| /home/textprocessing/nltk_data…
| Unzipping corpora/framenet_v15.zip.
| Downloading package framenet_v17 to
| /home/textprocessing/nltk_data…
| Unzipping corpora/framenet_v17.zip.
| Downloading package gazetteers to
| /home/textprocessing/nltk_data…
| Unzipping corpora/gazetteers.zip.
| Downloading package genesis to
| /home/textprocessing/nltk_data…
| Unzipping corpora/genesis.zip.
| Downloading package gutenberg to
| /home/textprocessing/nltk_data…
| Unzipping corpora/gutenberg.zip.
| Downloading package ieer to /home/textprocessing/nltk_data…
| Unzipping corpora/ieer.zip.
| Downloading package inaugural to
| /home/textprocessing/nltk_data…
| Unzipping corpora/inaugural.zip.
| Downloading package indian to
| /home/textprocessing/nltk_data…
| Unzipping corpora/indian.zip.
| Downloading package jeita to
| /home/textprocessing/nltk_data…
| Downloading package kimmo to
| /home/textprocessing/nltk_data…
| Unzipping corpora/kimmo.zip.
| Downloading package knbc to /home/textprocessing/nltk_data…
| Downloading package lin_thesaurus to
| /home/textprocessing/nltk_data…
| Unzipping corpora/lin_thesaurus.zip.
| Downloading package mac_morpho to
| /home/textprocessing/nltk_data…
| Unzipping corpora/mac_morpho.zip.
| Downloading package machado to
| /home/textprocessing/nltk_data…
| Downloading package masc_tagged to
| /home/textprocessing/nltk_data…
| Downloading package moses_sample to
| /home/textprocessing/nltk_data…
| Unzipping models/moses_sample.zip.
| Downloading package movie_reviews to
| /home/textprocessing/nltk_data…
| Unzipping corpora/movie_reviews.zip.
| Downloading package names to
| /home/textprocessing/nltk_data…
| Unzipping corpora/names.zip.
| Downloading package nombank.1.0 to
| /home/textprocessing/nltk_data…
| Downloading package nps_chat to
| /home/textprocessing/nltk_data…
| Unzipping corpora/nps_chat.zip.
| Downloading package omw to /home/textprocessing/nltk_data…
| Unzipping corpora/omw.zip.
| Downloading package opinion_lexicon to
| /home/textprocessing/nltk_data…
| Unzipping corpora/opinion_lexicon.zip.
| Downloading package paradigms to
| /home/textprocessing/nltk_data…
| Unzipping corpora/paradigms.zip.
| Downloading package pil to /home/textprocessing/nltk_data…
| Unzipping corpora/pil.zip.
| Downloading package pl196x to
| /home/textprocessing/nltk_data…
| Unzipping corpora/pl196x.zip.
| Downloading package ppattach to
| /home/textprocessing/nltk_data…
| Unzipping corpora/ppattach.zip.
| Downloading package problem_reports to
| /home/textprocessing/nltk_data…
| Unzipping corpora/problem_reports.zip.
| Downloading package propbank to
| /home/textprocessing/nltk_data…
| Downloading package ptb to /home/textprocessing/nltk_data…
| Unzipping corpora/ptb.zip.
| Downloading package product_reviews_1 to
| /home/textprocessing/nltk_data…
| Unzipping corpora/product_reviews_1.zip.
| Downloading package product_reviews_2 to
| /home/textprocessing/nltk_data…
| Unzipping corpora/product_reviews_2.zip.
| Downloading package pros_cons to
| /home/textprocessing/nltk_data…
| Unzipping corpora/pros_cons.zip.
| Downloading package qc to /home/textprocessing/nltk_data…
| Unzipping corpora/qc.zip.
| Downloading package reuters to
| /home/textprocessing/nltk_data…
| Downloading package rte to /home/textprocessing/nltk_data…
| Unzipping corpora/rte.zip.
| Downloading package semcor to
| /home/textprocessing/nltk_data…
| Downloading package senseval to
| /home/textprocessing/nltk_data…
| Unzipping corpora/senseval.zip.
| Downloading package sentiwordnet to
| /home/textprocessing/nltk_data…
| Unzipping corpora/sentiwordnet.zip.
| Downloading package sentence_polarity to
| /home/textprocessing/nltk_data…
| Unzipping corpora/sentence_polarity.zip.
| Downloading package shakespeare to
| /home/textprocessing/nltk_data…
| Unzipping corpora/shakespeare.zip.
| Downloading package sinica_treebank to
| /home/textprocessing/nltk_data…
| Unzipping corpora/sinica_treebank.zip.
| Downloading package smultron to
| /home/textprocessing/nltk_data…
| Unzipping corpora/smultron.zip.
| Downloading package state_union to
| /home/textprocessing/nltk_data…
| Unzipping corpora/state_union.zip.
| Downloading package stopwords to
| /home/textprocessing/nltk_data…
| Unzipping corpora/stopwords.zip.
| Downloading package subjectivity to
| /home/textprocessing/nltk_data…
| Unzipping corpora/subjectivity.zip.
| Downloading package swadesh to
| /home/textprocessing/nltk_data…
| Unzipping corpora/swadesh.zip.
| Downloading package switchboard to
| /home/textprocessing/nltk_data…
| Unzipping corpora/switchboard.zip.
| Downloading package timit to
| /home/textprocessing/nltk_data…
| Unzipping corpora/timit.zip.
| Downloading package toolbox to
| /home/textprocessing/nltk_data…
| Unzipping corpora/toolbox.zip.
| Downloading package treebank to
| /home/textprocessing/nltk_data…
| Unzipping corpora/treebank.zip.
| Downloading package twitter_samples to
| /home/textprocessing/nltk_data…
| Unzipping corpora/twitter_samples.zip.
| Downloading package udhr to /home/textprocessing/nltk_data…
| Unzipping corpora/udhr.zip.
| Downloading package udhr2 to
| /home/textprocessing/nltk_data…
| Unzipping corpora/udhr2.zip.
| Downloading package unicode_samples to
| /home/textprocessing/nltk_data…
| Unzipping corpora/unicode_samples.zip.
| Downloading package universal_treebanks_v20 to
| /home/textprocessing/nltk_data…
| Downloading package verbnet to
| /home/textprocessing/nltk_data…
| Unzipping corpora/verbnet.zip.
| Downloading package webtext to
| /home/textprocessing/nltk_data…
| Unzipping corpora/webtext.zip.
| Downloading package wordnet to
| /home/textprocessing/nltk_data…
| Unzipping corpora/wordnet.zip.
| Downloading package wordnet_ic to
| /home/textprocessing/nltk_data…
| Unzipping corpora/wordnet_ic.zip.
| Downloading package words to
| /home/textprocessing/nltk_data…
| Unzipping corpora/words.zip.
| Downloading package ycoe to /home/textprocessing/nltk_data…
| Unzipping corpora/ycoe.zip.
| Downloading package rslp to /home/textprocessing/nltk_data…
| Unzipping stemmers/rslp.zip.
| Downloading package hmm_treebank_pos_tagger to
| /home/textprocessing/nltk_data…
| Unzipping taggers/hmm_treebank_pos_tagger.zip.
| Downloading package maxent_treebank_pos_tagger to
| /home/textprocessing/nltk_data…
| Unzipping taggers/maxent_treebank_pos_tagger.zip.
| Downloading package universal_tagset to
| /home/textprocessing/nltk_data…
| Unzipping taggers/universal_tagset.zip.
| Downloading package maxent_ne_chunker to
| /home/textprocessing/nltk_data…
| Unzipping chunkers/maxent_ne_chunker.zip.
| Downloading package punkt to
| /home/textprocessing/nltk_data…
| Unzipping tokenizers/punkt.zip.
| Downloading package book_grammars to
| /home/textprocessing/nltk_data…
| Unzipping grammars/book_grammars.zip.
| Downloading package sample_grammars to
| /home/textprocessing/nltk_data…
| Unzipping grammars/sample_grammars.zip.
| Downloading package spanish_grammars to
| /home/textprocessing/nltk_data…
| Unzipping grammars/spanish_grammars.zip.
| Downloading package basque_grammars to
| /home/textprocessing/nltk_data…
| Unzipping grammars/basque_grammars.zip.
| Downloading package large_grammars to
| /home/textprocessing/nltk_data…
| Unzipping grammars/large_grammars.zip.
| Downloading package tagsets to
| /home/textprocessing/nltk_data…
| Unzipping help/tagsets.zip.
| Downloading package snowball_data to
| /home/textprocessing/nltk_data…
| Downloading package bllip_wsj_no_aux to
| /home/textprocessing/nltk_data…
| Unzipping models/bllip_wsj_no_aux.zip.
| Downloading package word2vec_sample to
| /home/textprocessing/nltk_data…
| Unzipping models/word2vec_sample.zip.
| Downloading package panlex_swadesh to
| /home/textprocessing/nltk_data…
| Downloading package mte_teip5 to
| /home/textprocessing/nltk_data…
| Unzipping corpora/mte_teip5.zip.
| Downloading package averaged_perceptron_tagger to
| /home/textprocessing/nltk_data…
| Unzipping taggers/averaged_perceptron_tagger.zip.
| Downloading package panlex_lite to
| /home/textprocessing/nltk_data…
| Unzipping corpora/panlex_lite.zip.
| Downloading package perluniprops to
| /home/textprocessing/nltk_data…
| Unzipping misc/perluniprops.zip.
| Downloading package nonbreaking_prefixes to
| /home/textprocessing/nltk_data…
| Unzipping corpora/nonbreaking_prefixes.zip.
| Downloading package vader_lexicon to
| /home/textprocessing/nltk_data…
| Downloading package porter_test to
| /home/textprocessing/nltk_data…
| Unzipping stemmers/porter_test.zip.
| Downloading package wmt15_eval to
| /home/textprocessing/nltk_data…
| Unzipping models/wmt15_eval.zip.
| Downloading package mwa_ppdb to
| /home/textprocessing/nltk_data…
| Unzipping misc/mwa_ppdb.zip.
|
Done downloading collection all

—————————————————————————
d) Download l) List u) Update c) Config h) Help q) Quit
—————————————————————————
Downloader> q
Out[5]: True

Using NLTK

In [15]: sentences = """Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human–computer interaction. Many challenges in NLP involve: natural language understanding, enabling computers to derive meaning from human or natural language input; and others involve natural language generation."""

In [16]: sents = nltk.sent_tokenize(sentences)

In [17]: for sent in sents:
print sent
....:
Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages.
As such, NLP is related to the area of human–computer interaction.
Many challenges in NLP involve: natural language understanding, enabling computers to derive meaning from human or natural language input; and others involve natural language generation.

In [18]: tokens = nltk.word_tokenize(sentences)

In [19]: print tokens
['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'field', 'of', 'computer', 'science', ',', 'artificial', 'intelligence', ',', 'and', 'computational', 'linguistics', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', '(', 'natural', ')', 'languages', '.', 'As', 'such', ',', 'NLP', 'is', 'related', 'to', 'the', 'area', 'of', 'human\xe2\x80\x93computer', 'interaction', '.', 'Many', 'challenges', 'in', 'NLP', 'involve', ':', 'natural', 'language', 'understanding', ',', 'enabling', 'computers', 'to', 'derive', 'meaning', 'from', 'human', 'or', 'natural', 'language', 'input', ';', 'and', 'others', 'involve', 'natural', 'language', 'generation', '.']

In [20]: tagged_tokens = nltk.pos_tag(tokens)

In [21]: print tagged_tokens
[('Natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('(', '('), ('NLP', 'NNP'), (')', ')'), ('is', 'VBZ'), ('a', 'DT'), ('field', 'NN'), ('of', 'IN'), ('computer', 'NN'), ('science', 'NN'), (',', ','), ('artificial', 'JJ'), ('intelligence', 'NN'), (',', ','), ('and', 'CC'), ('computational', 'JJ'), ('linguistics', 'NNS'), ('concerned', 'VBN'), ('with', 'IN'), ('the', 'DT'), ('interactions', 'NNS'), ('between', 'IN'), ('computers', 'NNS'), ('and', 'CC'), ('human', 'JJ'), ('(', '('), ('natural', 'JJ'), (')', ')'), ('languages', 'VBZ'), ('.', '.'), ('As', 'IN'), ('such', 'JJ'), (',', ','), ('NLP', 'NNP'), ('is', 'VBZ'), ('related', 'VBN'), ('to', 'TO'), ('the', 'DT'), ('area', 'NN'), ('of', 'IN'), ('human\xe2\x80\x93computer', 'NN'), ('interaction', 'NN'), ('.', '.'), ('Many', 'JJ'), ('challenges', 'NNS'), ('in', 'IN'), ('NLP', 'NNP'), ('involve', 'NN'), (':', ':'), ('natural', 'JJ'), ('language', 'NN'), ('understanding', 'NN'), (',', ','), ('enabling', 'VBG'), ('computers', 'NNS'), ('to', 'TO'), ('derive', 'VB'), ('meaning', 'NN'), ('from', 'IN'), ('human', 'NN'), ('or', 'CC'), ('natural', 'JJ'), ('language', 'NN'), ('input', 'NN'), (';', ':'), ('and', 'CC'), ('others', 'NNS'), ('involve', 'VBP'), ('natural', 'JJ'), ('language', 'NN'), ('generation', 'NN'), ('.', '.')]

In [22]: entities = nltk.chunk.ne_chunk(tagged_tokens)

In [23]: entities
Out[23]: Tree('S', [('Natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('(', '('), Tree('ORGANIZATION', [('NLP', 'NNP')]), (')', ')'), ('is', 'VBZ'), ('a', 'DT'), ('field', 'NN'), ('of', 'IN'), ('computer', 'NN'), ('science', 'NN'), (',', ','), ('artificial', 'JJ'), ('intelligence', 'NN'), (',', ','), ('and', 'CC'), ('computational', 'JJ'), ('linguistics', 'NNS'), ('concerned', 'VBN'), ('with', 'IN'), ('the', 'DT'), ('interactions', 'NNS'), ('between', 'IN'), ('computers', 'NNS'), ('and', 'CC'), ('human', 'JJ'), ('(', '('), ('natural', 'JJ'), (')', ')'), ('languages', 'VBZ'), ('.', '.'), ('As', 'IN'), ('such', 'JJ'), (',', ','), Tree('ORGANIZATION', [('NLP', 'NNP')]), ('is', 'VBZ'), ('related', 'VBN'), ('to', 'TO'), ('the', 'DT'), ('area', 'NN'), ('of', 'IN'), ('human\xe2\x80\x93computer', 'NN'), ('interaction', 'NN'), ('.', '.'), ('Many', 'JJ'), ('challenges', 'NNS'), ('in', 'IN'), Tree('ORGANIZATION', [('NLP', 'NNP')]), ('involve', 'NN'), (':', ':'), ('natural', 'JJ'), ('language', 'NN'), ('understanding', 'NN'), (',', ','), ('enabling', 'VBG'), ('computers', 'NNS'), ('to', 'TO'), ('derive', 'VB'), ('meaning', 'NN'), ('from', 'IN'), ('human', 'NN'), ('or', 'CC'), ('natural', 'JJ'), ('language', 'NN'), ('input', 'NN'), (';', ':'), ('and', 'CC'), ('others', 'NNS'), ('involve', 'VBP'), ('natural', 'JJ'), ('language', 'NN'), ('generation', 'NN'), ('.', '.')])

For more about NLTK, we recommended you the “Dive into NLTK” series and the official book: “Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit

Posted by “TextProcessing

Open Source Text Processing Project: Wapiti

Wapiti – A simple and fast discriminative sequence labelling toolkit

Project Website: https://wapiti.limsi.fr/
Github Link: https://github.com/Jekub/Wapiti

Description

Wapiti is a very fast toolkit for segmenting and labeling sequences with discriminative models. It is based on maxent models, maximum entropy Markov models and linear-chain CRF and proposes various optimization and regularization methods to improve both the computational complexity and the prediction performance of standard models. Wapiti is ranked first on the sequence tagging task for more than a year on MLcomp web site.

Features

Handle large label and feature sets
Wapiti was used to train models with more than one thousand labels and models with several billions features. Training time still increase with the size of these set, but provided you have computing power and enough memory, Wapiti will handle them without problems.

L-BFGS, OWL-QN, SGD-L1, BCD, and RPROP training algorithms
Wapiti implements all the standard training algorithms. All these algorithms are highly-optimized and can be combined to improve both computational and generalization performances.

L1, L2, or Elastic-net regularization
Wapiti provides different regularization methods which allow reducing overfitting and efficient features selections.

Powerful features extraction system
Wapiti uses an extended version of the CRF++ patterns for extracting features, which reduces both the amount of pre-processing required and the size of datafiles.

Multi-threaded and vectorized implementation
To further improve their performances, all optimization algorithms can take advantage of SSE instructions, if available. The Quasi-Newton and RPROP optimization algorithms are parallelized and scale very well on multi-processors.

N-best Viterbi output
Viterbi decoding can output the classical best label sequence as well as the n-best ones. Decoding can be done with the classical Viterbi for CRF or through posteriors which are slower but generaly lead to better result and give normalized scores.

Compact model creation
When used with L1 or elastic-net penalties, Wapiti is able to remove unused features and creates compact models which load faster and use less memory, speeding up the labeling.

Sparse forward-backward
A specific sparse forward-backward procedure is used during the training to take advantage of the sparsity of the model and speedup computation.

Written in standard C99+POSIX
Wapiti source code is written almost entirely in standard C99 and should work on any computer. However, the multi-threading code is written using POSIX threads and the SSE code is written for x86 platform. Both are optional and can be disabled or rewritten for other platforms.

Open source (BSD Licence)

Open Source Text Processing Project: segtok

segtok: sentence segmentation and word tokenization tools

Project Website: http://fnl.es/segtok-a-segmentation-and-tokenization-library.html
Github Link: https://github.com/fnl/segtok

Description

A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic features.

The segtok package provides two modules, segtok.segmenter and segtok.tokenizer. The segmenter provides functionality for splitting (Indo-European) text into sentences. The tokenizer provides functionality for splitting (Indo-European) sentences into words and symbols (collectively called tokens). Both modules can also be used from the command-line. While other Indo-European languages could work, it has only been designed with languages such as Spanish, English, and German in mind.

To install this package, you should have the latest official version of Python 2 or 3 installed. The package has been reported to work with Python 2.7, 3.3, and 3.4 and is tested against the latest Python 2 and 3 branches. The easiest way to get it installed is using pip or any other package manager that works with PyPI:

pip install segtok
Important: If you are on a Linux machine and have problems installing the regex dependency of segtok, make sure you have the python-dev and/or python3-dev packages installed to get the necessary headers to compile the package.

Then try the command line tools on some plain-text files (e.g., this README) to see if segtok meets your needs:

segmenter README.rst | tokenizer