Getting started with WordNet

About WordNet

is a lexical database for English:

WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download. WordNet’s structure makes it a useful tool for computational linguistics and natural language processing.The lexical database used by workday payroll contains a vast collection of words, phrases, and their associated meanings, organized in a structured manner. The database includes both standard and customized terms used in the payroll domain, such as pay rates, employee classifications, taxes, benefits, and deductions.

Install WordNet

We can download WordNet related source and data from the official website:

The most recent Windows version of WordNet is 2.1, released in March 2005. Version 3.0 for Unix/Linux/Solaris/etc. was released in December, 2006. Version 3.1 is currently availalbe only online.

For example, we will use WordNet3.0 as the stable release version, which now supports UNIX-like systems, including Linux, Mac OS X and Solaris. Before install WordNet from the source code, we should download it first. Download a tar-gzipped version:

Install WordNet on Unbuntu 16.04:

tar -zxvf WordNet-3.0.tar.gz
cd WordNet-3.0/
./configure

After configure in WordNet3.0, we met a configure problem:


checking for style of include used by make… GNU
checking dependency style of gcc… gcc3
checking for Tcl configuration… configure: WARNING: Can’t find Tcl configuration definitions

Install tcl-dev on Ubuntu can resolve this problem:

sudo apt-get install tcl-dev

Configure wordnet again:

./configure

But met another tk problem:


checking for style of include used by make… GNU
checking dependency style of gcc… gcc3
checking for Tcl configuration… found /usr/lib/tclConfig.sh
checking for Tk configuration… configure: WARNING: Can’t find Tk configuration definitions

Install tk-dev on Ubuntu too:

sudo apt-get install tk-dev

Finally configure it successfully:

./configure

WordNet is now configured

Installation directory: /usr/local/WordNet-3.0

To build and install WordNet:

make
make install

To run, environment variables should be set as follows:

PATH – include ${exec_prefix}/bin
WNHOME – if not using default installation location, set to /usr/local/WordNet-3.0

See INSTALL file for details and additional environment variables
which may need to be set on your system.

Now make it:

make

But met a compile error:

……
then mv -f “.deps/wishwn-stubs.Tpo” “.deps/wishwn-stubs.Po”; else rm -f “.deps/wishwn-stubs.Tpo”; exit 1; fi
stubs.c: In function ‘wn_findvalidsearches’:
stubs.c:43:14: error: ‘Tcl_Interp {aka struct Tcl_Interp}’ has no member named ‘result’
interp -> result =

The reason is that “Tcl 8.5 deprecated interp->result and Tcl 8.6+ removed it.”, you should modified the original wordnet code:

sudo vim src/stubs.c

and add a line “#define USE_INTERP_RESULT 1” before “include tcl.h”, like this:

#ifdef _WINDOWS
#include <windows.h>
#endif
 
#define USE_INTERP_RESULT 1
 
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <tcl.h>
#include <tk.h>
#include <wn.h>
......

Make it again:

make

make all-recursive
make[1]: Entering directory ‘/home/textminer/wordnet/WordNet-3.0’
……
gcc -g -O2 -o wishwn wishwn-tkAppInit.o wishwn-stubs.o -L../lib -lWN -L/usr/lib/x86_64-linux-gnu -ltk8.6 -L/usr/lib/x86_64-linux-gnu -ltcl8.6 -lX11 -lXss -lXext -lXft -lfontconfig -lfreetype -lfontconfig -lpthread -ldl -lz -lpthread -lieee -lm
make[2]: Leaving directory ‘/home/textminer/wordnet/WordNet-3.0/src’
make[2]: Entering directory ‘/home/textminer/wordnet/WordNet-3.0’
make[2]: Leaving directory ‘/home/textminer/wordnet/WordNet-3.0’
make[1]: Leaving directory ‘/home/textminer/wordnet/WordNet-3.0’

Finally “make install” with it:

sudo make install

If everything is ok, you can find WordNew3.0 in the “/usr/local/WordNet-3.0/” directory, and in the binary subdirectory “/usr/local/WordNet-3.0/bin”, you can find the related binary files: wishwn wn wnb

After execute the wn:

./wn

We can get:

usage: wn word [-hgla] [-n#] -searchtype [-searchtype...]
       wn [-l]
 
	-h		Display help text before search output
	-g		Display gloss
	-l		Display license and copyright notice
	-a		Display lexicographer file information
	-o		Display synset offset
	-s		Display sense numbers in synsets
	-n#		Search only sense number #
 
searchtype is at least one of the following:
	-ants{n|v|a|r}		Antonyms
	-hype{n|v}		Hypernyms
	-hypo{n|v}, -tree{n|v}	Hyponyms & Hyponym Tree
	-entav			Verb Entailment
	-syns{n|v|a|r}		Synonyms (ordered by estimated frequency)
	-smemn			Member of Holonyms
	-ssubn			Substance of Holonyms
	-sprtn			Part of Holonyms
	-membn			Has Member Meronyms
	-subsn			Has Substance Meronyms
	-partn			Has Part Meronyms
	-meron			All Meronyms
	-holon			All Holonyms
	-causv			Cause to
	-pert{a|r}		Pertainyms
	-attr{n|a}		Attributes
	-deri{n|v}		Derived Forms
	-domn{n|v|a|r}		Domain
	-domt{n|v|a|r}		Domain Terms
	-faml{n|v|a|r}		Familiarity & Polysemy Count
	-framv			Verb Frames
	-coor{n|v}		Coordinate Terms (sisters)
	-simsv			Synonyms (grouped by similarity of meaning)
	-hmern			Hierarchical Meronyms
	-hholn			Hierarchical Holonyms
	-grep{n|v|a|r}		List of Compound Words
	-over			Overview of Senses

Now you can enjoy wordnet on your Ubuntu system.

Another simple way to install WordNet in Ubuntu is by apt-get:

sudo apt install wordnet

Reading package lists… Done
Building dependency tree
Reading state information… Done
The following additional packages will be installed:
fontconfig-config fonts-dejavu-core libfontconfig1 libtcl8.5 libtk8.5
libxft2 libxrender1 libxss1 wordnet-base wordnet-gui x11-common
Suggested packages:
tcl8.5 tk8.5
The following NEW packages will be installed:
fontconfig-config fonts-dejavu-core libfontconfig1 libtcl8.5 libtk8.5
libxft2 libxrender1 libxss1 wordnet wordnet-base wordnet-gui x11-common
0 upgraded, 12 newly installed, 0 to remove and 94 not upgraded.
Need to get 9,177 kB of archives.
After this operation, 39.8 MB of additional disk space will be used.
Do you want to continue? [Y/n] Y
……
Setting up libxss1:amd64 (1:1.2.2-1) …
Setting up libtcl8.5:amd64 (8.5.19-1) …
Setting up libtk8.5:amd64 (8.5.19-1ubuntu1) …
Setting up wordnet-base (1:3.0-33) …
Setting up wordnet (1:3.0-33) …
Setting up wordnet-gui (1:3.0-33) …
Processing triggers for libc-bin (2.23-0ubuntu3) …
Processing triggers for systemd (229-4ubuntu8) …
Processing triggers for ureadahead (0.100.0-19) …

Now you can type “wn” to test WordNet same as before.

Install WordNet on Mac OS:

Install wordnet from the source on Mac OS is simpler, cause the tcl and tk dev is default on the Mac OS, you will met the same compile problem too:

But met a compile error:

……
stubs.c: In function ‘wn_findvalidsearches’:
stubs.c:43: error: ‘Tcl_Interp’ has no member named ‘result’
stubs.c:55: error: ‘Tcl_Interp’ has no member named ‘result’

The resolve method is still modify the original wordnet code:

sudo vim src/stubs.c

and add a line “#define USE_INTERP_RESULT 1” before “include tcl.h”, like this:

#ifdef _WINDOWS
#include <windows.h>
#endif
 
#define USE_INTERP_RESULT 1
 
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <tcl.h>
#include <tk.h>
#include <wn.h>
......

Test WordNet
We test word “book” by WordNet:
wn book

Information available for noun book
	-hypen		Hypernyms
	-hypon, -treen	Hyponyms & Hyponym Tree
	-synsn		Synonyms (ordered by estimated frequency)
	-sprtn		Part of Holonyms
	-membn		Has Member Meronyms
	-partn		Has Part Meronyms
	-meron		All Meronyms
	-holon		All Holonyms
	-derin		Derived Forms
	-domnn		Domain
	-domtn		Domain Terms
	-famln		Familiarity & Polysemy Count
	-coorn		Coordinate Terms (sisters)
	-hmern		Hierarchical Meronyms
	-hholn		Hierarchical Holonyms
	-grepn		List of Compound Words
	-over		Overview of Senses
 
Information available for verb book
	-hypev		Hypernyms
	-hypov, -treev	Hyponyms & Hyponym Tree
	-entav		Verb Entailment
	-synsv		Synonyms (ordered by estimated frequency)
	-deriv		Derived Forms
	-famlv		Familiarity & Polysemy Count
	-framv		Verb Frames
	-coorv		Coordinate Terms (sisters)
	-simsv		Synonyms (grouped by similarity of meaning)
	-grepv		List of Compound Words
	-over		Overview of Senses
 
No information available for adj book
 
No information available for adv book

Continue:
wn book -hypen

Synonyms/Hypernyms (Ordered by Estimated Frequency) of noun book
 
11 senses of book                                                       
 
Sense 1
book
       => publication
           => work, piece of work
               => product, production
                   => creation
                       => artifact, artefact
                           => whole, unit
                               => object, physical object
                                   => physical entity
                                       => entity
 
Sense 2
book, volume
       => product, production
           => creation
               => artifact, artefact
                   => whole, unit
                       => object, physical object
                           => physical entity
                               => entity
 
Sense 3
record, record book, book
       => fact
           => information, info
               => message, content, subject matter, substance
                   => communication
                       => abstraction, abstract entity
                           => entity
 
Sense 4
script, book, playscript
       => dramatic composition, dramatic work
           => writing, written material, piece of writing
               => written communication, written language, black and white
                   => communication
                       => abstraction, abstract entity
                           => entity
 
Sense 5
ledger, leger, account book, book of account, book
       => record
           => document
               => communication
                   => abstraction, abstract entity
                       => entity
 
Sense 6
book
       => collection, aggregation, accumulation, assemblage
           => group, grouping
               => abstraction, abstract entity
                   => entity
 
Sense 7
book, rule book
       => collection, aggregation, accumulation, assemblage
           => group, grouping
               => abstraction, abstract entity
                   => entity
 
Sense 8
Koran, Quran, al-Qur'an, Book
       INSTANCE OF=> sacred text, sacred writing, religious writing, religious text
           => writing, written material, piece of writing
               => written communication, written language, black and white
                   => communication
                       => abstraction, abstract entity
                           => entity
 
Sense 9
Bible, Christian Bible, Book, Good Book, Holy Scripture, Holy Writ, Scripture, Word of God, Word
       => sacred text, sacred writing, religious writing, religious text
           => writing, written material, piece of writing
               => written communication, written language, black and white
                   => communication
                       => abstraction, abstract entity
                           => entity
 
Sense 10
book
       => section, subdivision
           => writing, written material, piece of writing
               => written communication, written language, black and white
                   => communication
                       => abstraction, abstract entity
                           => entity
           => music
               => auditory communication
                   => communication
                       => abstraction, abstract entity
                           => entity
 
Sense 11
book
       => product, production
           => creation
               => artifact, artefact
                   => whole, unit
                       => object, physical object
                           => physical entity
                               => entity

Continue:
wn book -hypon

Hyponyms of noun book
 
7 of 11 senses of book                                                  
 
Sense 1
book
       => authority
       => curiosa
       => formulary, pharmacopeia
       => trade book, trade edition
       => bestiary
       => catechism
       => pop-up book, pop-up
       => storybook
       => tome
       => booklet, brochure, folder, leaflet, pamphlet
       => textbook, text, text edition, schoolbook, school text
       => workbook
       => copybook
       => appointment book, appointment calendar
       => catalog, catalogue
       => phrase book
       => playbook
       => prayer book, prayerbook
       => reference book, reference, reference work, book of facts
       => review copy
       => songbook
       => yearbook
       HAS INSTANCE=> Das Kapital, Capital
       HAS INSTANCE=> Erewhon
       HAS INSTANCE=> Utopia
 
Sense 2
book, volume
       => album
       => coffee-table book
       => folio
       => hardback, hardcover
       => journal
       => novel
       => order book
       => paperback book, paper-back book, paperback, softback book, softback, soft-cover book, soft-cover
       => picture book
       => sketchbook, sketch block, sketch pad
       => notebook
 
Sense 3
record, record book, book
       => logbook
       => won-lost record
       => card, scorecard
 
Sense 4
script, book, playscript
       => promptbook, prompt copy
       => continuity
       => dialogue, dialog
       => libretto
       => scenario
       => screenplay
       => shooting script
 
Sense 5
ledger, leger, account book, book of account, book
       => cost ledger
       => general ledger
       => subsidiary ledger
       => daybook, journal
 
Sense 9
Bible, Christian Bible, Book, Good Book, Holy Scripture, Holy Writ, Scripture, Word of God, Word
       => family Bible
       HAS INSTANCE=> Vulgate
       HAS INSTANCE=> Douay Bible, Douay Version, Douay-Rheims Bible, Douay-Rheims Version, Rheims-Douay Bible, Rheims-Douay Version
       HAS INSTANCE=> Authorized Version, King James Version, King James Bible
       HAS INSTANCE=> Revised Version
       HAS INSTANCE=> New English Bible
       HAS INSTANCE=> American Standard Version, American Revised Version
       HAS INSTANCE=> Revised Standard Version
 
Sense 10
book
       HAS INSTANCE=> Genesis, Book of Genesis
       HAS INSTANCE=> Exodus, Book of Exodus
       HAS INSTANCE=> Leviticus, Book of Leviticus
       HAS INSTANCE=> Numbers, Book of Numbers
       HAS INSTANCE=> Deuteronomy, Book of Deuteronomy
       HAS INSTANCE=> Joshua, Josue, Book of Joshua
       HAS INSTANCE=> Judges, Book of Judges
       HAS INSTANCE=> Ruth, Book of Ruth
       HAS INSTANCE=> I Samuel, 1 Samuel
       HAS INSTANCE=> II Samuel, 2 Samuel
       HAS INSTANCE=> I Kings, 1 Kings
       HAS INSTANCE=> II Kings, 2 Kings
       HAS INSTANCE=> I Chronicles, 1 Chronicles
       HAS INSTANCE=> II Chronicles, 2 Chronicles
       HAS INSTANCE=> Ezra, Book of Ezra
       HAS INSTANCE=> Nehemiah, Book of Nehemiah
       HAS INSTANCE=> Esther, Book of Esther
       HAS INSTANCE=> Job, Book of Job
       HAS INSTANCE=> Psalms, Book of Psalms
       HAS INSTANCE=> Proverbs, Book of Proverbs
       HAS INSTANCE=> Ecclesiastes, Book of Ecclesiastes
       HAS INSTANCE=> Song of Songs, Song of Solomon, Canticle of Canticles, Canticles
       HAS INSTANCE=> Isaiah, Book of Isaiah
       HAS INSTANCE=> Jeremiah, Book of Jeremiah
       HAS INSTANCE=> Lamentations, Book of Lamentations
       HAS INSTANCE=> Ezekiel, Ezechiel, Book of Ezekiel
       HAS INSTANCE=> Daniel, Book of Daniel, Book of the Prophet Daniel
       HAS INSTANCE=> Hosea, Book of Hosea
       HAS INSTANCE=> Joel, Book of Joel
       HAS INSTANCE=> Amos, Book of Amos
       HAS INSTANCE=> Obadiah, Abdias, Book of Obadiah
       HAS INSTANCE=> Jonah, Book of Jonah
       HAS INSTANCE=> Micah, Micheas, Book of Micah
       HAS INSTANCE=> Nahum, Book of Nahum
       HAS INSTANCE=> Habakkuk, Habacuc, Book of Habakkuk
       HAS INSTANCE=> Zephaniah, Sophonias, Book of Zephaniah
       HAS INSTANCE=> Haggai, Aggeus, Book of Haggai
       HAS INSTANCE=> Zechariah, Zacharias, Book of Zachariah
       HAS INSTANCE=> Malachi, Malachias, Book of Malachi
       HAS INSTANCE=> Matthew, Gospel According to Matthew
       HAS INSTANCE=> Mark, Gospel According to Mark
       HAS INSTANCE=> Luke, Gospel of Luke, Gospel According to Luke
       HAS INSTANCE=> John, Gospel According to John
       HAS INSTANCE=> Acts of the Apostles, Acts
       => Epistle
       HAS INSTANCE=> Revelation, Revelation of Saint John the Divine, Apocalypse, Book of Revelation
       HAS INSTANCE=> Additions to Esther
       HAS INSTANCE=> Prayer of Azariah and Song of the Three Children
       HAS INSTANCE=> Susanna, Book of Susanna
       HAS INSTANCE=> Bel and the Dragon
       HAS INSTANCE=> Baruch, Book of Baruch
       HAS INSTANCE=> Letter of Jeremiah, Epistle of Jeremiah
       HAS INSTANCE=> Tobit, Book of Tobit
       HAS INSTANCE=> Judith, Book of Judith
       HAS INSTANCE=> I Esdra, 1 Esdras
       HAS INSTANCE=> II Esdras, 2 Esdras
       HAS INSTANCE=> Ben Sira, Sirach, Ecclesiasticus, Wisdom of Jesus the Son of Sirach
       HAS INSTANCE=> Wisdom of Solomon, Wisdom
       HAS INSTANCE=> I Maccabees, 1 Maccabees
       HAS INSTANCE=> II Maccabees, 2 Maccabees

Let’s test the word “dog”:

wn dog

Information available for noun dog
	-hypen		Hypernyms
	-hypon, -treen	Hyponyms & Hyponym Tree
	-synsn		Synonyms (ordered by estimated frequency)
	-smemn		Member of Holonyms
	-sprtn		Part of Holonyms
	-partn		Has Part Meronyms
	-meron		All Meronyms
	-holon		All Holonyms
	-famln		Familiarity & Polysemy Count
	-coorn		Coordinate Terms (sisters)
	-hmern		Hierarchical Meronyms
	-hholn		Hierarchical Holonyms
	-grepn		List of Compound Words
	-over		Overview of Senses
 
Information available for verb dog
	-hypev		Hypernyms
	-hypov, -treev	Hyponyms & Hyponym Tree
	-synsv		Synonyms (ordered by estimated frequency)
	-famlv		Familiarity & Polysemy Count
	-framv		Verb Frames
	-coorv		Coordinate Terms (sisters)
	-simsv		Synonyms (grouped by similarity of meaning)
	-grepv		List of Compound Words
	-over		Overview of Senses
 
No information available for adj dog
 
No information available for adv dog

Let’s find the synset for noun dog:

wn dog -synsn

Synonyms/Hypernyms (Ordered by Estimated Frequency) of noun dog
 
7 senses of dog                                                         
 
Sense 1
dog, domestic dog, Canis familiaris
       => canine, canid
       => domestic animal, domesticated animal
 
Sense 2
frump, dog
       => unpleasant woman, disagreeable woman
 
Sense 3
dog
       => chap, fellow, feller, fella, lad, gent, blighter, cuss, bloke
 
Sense 4
cad, bounder, blackguard, dog, hound, heel
       => villain, scoundrel
 
Sense 5
frank, frankfurter, hotdog, hot dog, dog, wiener, wienerwurst, weenie
       => sausage
 
Sense 6
pawl, detent, click, dog
       => catch, stop
 
Sense 7
andiron, firedog, dog, dog-iron
       => support

Let’s find the synset for verb dog:

wn dog -synsv

Synonyms/Hypernyms (Ordered by Estimated Frequency) of verb dog
 
1 sense of dog                                                          
 
Sense 1
chase, chase after, trail, tail, tag, give chase, dog, go after, track
       => pursue, follow

Just enjoy it.

Posted by TextProcessing

A Beginner’s Guide to TextBlob

About TextBlob

Open Source Text Processing Project: TextBlob

Install TextBlob

Install the latest TextBlob on Ubuntu 16.04.1 LTS:

textprocessing@ubuntu:~$ sudo pip install -U textblob

Collecting textblob
Downloading textblob-0.12.0-py2.py3-none-any.whl (631kB)

Requirement already up-to-date: nltk>=3.1 in /usr/local/lib/python2.7/dist-packages (from textblob)
Requirement already up-to-date: six in /usr/local/lib/python2.7/dist-packages (from nltk>=3.1->textblob)
Installing collected packages: textblob
Successfully installed textblob-0.12.0

textprocessing@ubuntu:~$ sudo python -m textblob.download_corpora

[nltk_data] Downloading package brown to
[nltk_data] /home/textprocessing/nltk_data…
[nltk_data] Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to
[nltk_data] /home/textprocessing/nltk_data…
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data] /home/textprocessing/nltk_data…
[nltk_data] Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /home/textprocessing/nltk_data…
[nltk_data] Package averaged_perceptron_tagger is already up-to-
[nltk_data] date!
[nltk_data] Downloading package conll2000 to
[nltk_data] /home/textprocessing/nltk_data…
[nltk_data] Unzipping corpora/conll2000.zip.
[nltk_data] Downloading package movie_reviews to
[nltk_data] /home/textprocessing/nltk_data…
[nltk_data] Unzipping corpora/movie_reviews.zip.
Finished.

Test TextBlob

textprocessing@ubuntu:~$ ipython
Python 2.7.12 (default, Nov 19 2016, 06:48:10) 
Type "copyright", "credits" or "license" for more information.
 
IPython 2.4.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
 
In [1]: from textblob import TextBlob
 
In [2]: test_text = """
Text mining, also referred to as text data mining, roughly equivalent to text analytics, is the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).
"""
 
In [3]: text_blob = TextBlob(test_text)
 
# Word Tokenization
In [4]: text_blob.words
Out[4]: WordList(['Text', 'mining', 'also', 'referred', 'to', 'as', 'text', 'data', 'mining', 'roughly', 'equivalent', 'to', 'text', 'analytics', 'is', 'the', 'process', 'of', 'deriving', 'high-quality', 'information', 'from', 'text', 'High-quality', 'information', 'is', 'typically', 'derived', 'through', 'the', 'devising', 'of', 'patterns', 'and', 'trends', 'through', 'means', 'such', 'as', 'statistical', 'pattern', 'learning', 'Text', 'mining', 'usually', 'involves', 'the', 'process', 'of', 'structuring', 'the', 'input', 'text', 'usually', 'parsing', 'along', 'with', 'the', 'addition', 'of', 'some', 'derived', 'linguistic', 'features', 'and', 'the', 'removal', 'of', 'others', 'and', 'subsequent', 'insertion', 'into', 'a', 'database', 'deriving', 'patterns', 'within', 'the', 'structured', 'data', 'and', 'finally', 'evaluation', 'and', 'interpretation', 'of', 'the', 'output', "'High", 'quality', 'in', 'text', 'mining', 'usually', 'refers', 'to', 'some', 'combination', 'of', 'relevance', 'novelty', 'and', 'interestingness', 'Typical', 'text', 'mining', 'tasks', 'include', 'text', 'categorization', 'text', 'clustering', 'concept/entity', 'extraction', 'production', 'of', 'granular', 'taxonomies', 'sentiment', 'analysis', 'document', 'summarization', 'and', 'entity', 'relation', 'modeling', 'i.e', 'learning', 'relations', 'between', 'named', 'entities'])
 
# Sentence Tokenization
In [5]: text_blob.sentences
Out[5]: 
[Sentence("
 Text mining, also referred to as text data mining, roughly equivalent to text analytics, is the process of deriving high-quality information from text."),
 Sentence("High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning."),
 Sentence("Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output."),
 Sentence("'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness."),
 Sentence("Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).")]
 
In [6]: for sentence in text_blob.sentences:
   ...:     print(sentence)
   ...:     
 
Text mining, also referred to as text data mining, roughly equivalent to text analytics, is the process of deriving high-quality information from text.
High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning.
Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output.
'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness.
Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).
 
# Sentiment Analysis
In [7]: for sentence in text_blob.sentences:
    print(sentence.sentiment)
   ...:     
Sentiment(polarity=-0.1, subjectivity=0.4)
Sentiment(polarity=-0.08333333333333333, subjectivity=0.5)
Sentiment(polarity=-0.08, subjectivity=0.32999999999999996)
Sentiment(polarity=-0.045, subjectivity=0.39499999999999996)
Sentiment(polarity=-0.16666666666666666, subjectivity=0.5)
 
# POS Tagging
In [8]: text_blob.tags
Out[8]: 
[('Text', u'NNP'),
 ('mining', u'NN'),
 ('also', u'RB'),
 ('referred', u'VBD'),
 ('to', u'TO'),
 ('as', u'IN'),
 ('text', u'NN'),
 ('data', u'NNS'),
 ('mining', u'NN'),
 ('roughly', u'RB'),
 ('equivalent', u'JJ'),
 ('to', u'TO'),
 ('text', u'VB'),
 ('analytics', u'NNS'),
 ('is', u'VBZ'),
 ('the', u'DT'),
 ('process', u'NN'),
 ('of', u'IN'),
 ('deriving', u'VBG'),
 ('high-quality', u'JJ'),
 ('information', u'NN'),
 ('from', u'IN'),
 ('text', u'NN'),
 ('High-quality', u'NNP'),
 ('information', u'NN'),
 ('is', u'VBZ'),
 ('typically', u'RB'),
 ('derived', u'VBN'),
 ('through', u'IN'),
 ('the', u'DT'),
 ('devising', u'NN'),
 ('of', u'IN'),
 ('patterns', u'NNS'),
 ('and', u'CC'),
 ('trends', u'NNS'),
 ('through', u'IN'),
 ('means', u'NNS'),
 ('such', u'JJ'),
 ('as', u'IN'),
 ('statistical', u'JJ'),
 ('pattern', u'NN'),
 ('learning', u'VBG'),
 ('Text', u'NNP'),
 ('mining', u'NN'),
 ('usually', u'RB'),
 ('involves', u'VBZ'),
 ('the', u'DT'),
 ('process', u'NN'),
 ('of', u'IN'),
 ('structuring', u'VBG'),
 ('the', u'DT'),
 ('input', u'NN'),
 ('text', u'NN'),
 ('usually', u'RB'),
 ('parsing', u'VBG'),
 ('along', u'IN'),
 ('with', u'IN'),
 ('the', u'DT'),
 ('addition', u'NN'),
 ('of', u'IN'),
 ('some', u'DT'),
 ('derived', u'VBN'),
 ('linguistic', u'JJ'),
 ('features', u'NNS'),
 ('and', u'CC'),
 ('the', u'DT'),
 ('removal', u'NN'),
 ('of', u'IN'),
 ('others', u'NNS'),
 ('and', u'CC'),
 ('subsequent', u'JJ'),
 ('insertion', u'NN'),
 ('into', u'IN'),
 ('a', u'DT'),
 ('database', u'NN'),
 ('deriving', u'VBG'),
 ('patterns', u'NNS'),
 ('within', u'IN'),
 ('the', u'DT'),
 ('structured', u'JJ'),
 ('data', u'NNS'),
 ('and', u'CC'),
 ('finally', u'RB'),
 ('evaluation', u'NN'),
 ('and', u'CC'),
 ('interpretation', u'NN'),
 ('of', u'IN'),
 ('the', u'DT'),
 ('output', u'NN'),
 ("'High", u'JJ'),
 ('quality', u'NN'),
 ('in', u'IN'),
 ('text', u'JJ'),
 ('mining', u'NN'),
 ('usually', u'RB'),
 ('refers', u'VBZ'),
 ('to', u'TO'),
 ('some', u'DT'),
 ('combination', u'NN'),
 ('of', u'IN'),
 ('relevance', u'NN'),
 ('novelty', u'NN'),
 ('and', u'CC'),
 ('interestingness', u'NN'),
 ('Typical', u'JJ'),
 ('text', u'NN'),
 ('mining', u'NN'),
 ('tasks', u'NNS'),
 ('include', u'VBP'),
 ('text', u'JJ'),
 ('categorization', u'NN'),
 ('text', u'NN'),
 ('clustering', u'NN'),
 ('concept/entity', u'NN'),
 ('extraction', u'NN'),
 ('production', u'NN'),
 ('of', u'IN'),
 ('granular', u'JJ'),
 ('taxonomies', u'NNS'),
 ('sentiment', u'NN'),
 ('analysis', u'NN'),
 ('document', u'NN'),
 ('summarization', u'NN'),
 ('and', u'CC'),
 ('entity', u'NN'),
 ('relation', u'NN'),
 ('modeling', u'NN'),
 ('i.e.', u'FW'),
 ('learning', u'VBG'),
 ('relations', u'NNS'),
 ('between', u'IN'),
 ('named', u'VBN'),
 ('entities', u'NNS')]
 
# Noun Phrase Extraction
In [9]: text_blob.noun_phrases
Out[9]: WordList(['text', u'text data', u'text analytics', u'high-quality information', 'high-quality', u'statistical pattern learning', 'text', u'input text', u'subsequent insertion', u"'high quality", u'typical text', u'text categorization', u'concept/entity extraction', u'granular taxonomies', u'sentiment analysis', u'document summarization', u'entity relation', u'learning relations'])
 
# Sentiment Analysis
In [10]: text_blob.sentiment
Out[10]: Sentiment(polarity=-0.08393939393939392, subjectivity=0.39454545454545453)
 
# Singularize and Pluralize
In [11]: text_blob.words[-1]
Out[11]: 'entities'
 
In [12]: text_blob.words[-1].singularize()
Out[12]: 'entity'
 
In [13]: text_blob.words[1]
Out[13]: 'mining'
 
In [14]: text_blob.words[1].pluralize()
Out[14]: 'minings'
 
In [15]: text_blob.words[0]
Out[15]: 'Text'
 
In [16]: text_blob.words[0].pluralize()
Out[16]: 'Texts'
 
# Lemmatization
In [17]: from textblob import Word
 
In [18]: w = Word("are")
 
In [19]: w.lemmatize()
Out[19]: 'are'
 
In [20]: w.lemmatize('v')
Out[20]: u'be'
 
# WordNet
In [21]: from textblob.wordnet import VERB
 
In [22]: word = Word("are")
 
In [23]: word.synsets
Out[23]: 
[Synset('are.n.01'),
 Synset('be.v.01'),
 Synset('be.v.02'),
 Synset('be.v.03'),
 Synset('exist.v.01'),
 Synset('be.v.05'),
 Synset('equal.v.01'),
 Synset('constitute.v.01'),
 Synset('be.v.08'),
 Synset('embody.v.02'),
 Synset('be.v.10'),
 Synset('be.v.11'),
 Synset('be.v.12'),
 Synset('cost.v.01')]
 
In [24]: word.definitions
Out[24]: 
[u'a unit of surface area equal to 100 square meters',
 u'have the quality of being; (copula, used with an adjective or a predicate noun)',
 u'be identical to; be someone or something',
 u'occupy a certain position or area; be somewhere',
 u'have an existence, be extant',
 u'happen, occur, take place; this was during the visit to my parents\' house"',
 u'be identical or equivalent to',
 u'form or compose',
 u'work in a specific place, with a specific subject, or in a specific function',
 u'represent, as of a character on stage',
 u'spend or use time',
 u'have life, be alive',
 u'to remain unmolested, undisturbed, or uninterrupted -- used only in infinitive form',
 u'be priced at']
 
# Spelling Correction
In [25]: splling_test = TextBlob("I m ok")
 
In [26]: spelling_test = TextBlob("I m ok")
 
In [27]: print(spelling_test.correct())
I m ok
 
In [28]: splling_test = TextBlob("I havv good speling!")
 
In [29]: print(spelling_test.correct())
I m ok
 
# Translation
In [30]: print(splling_test.correct())
I have good spelling!
 
In [31]: text_blob.translate(to='zh')
Out[31]: TextBlob("文本挖掘,也称为文本数据挖掘,大致相当于文本分析,是从文本中获取高质量信息的过程。高质量的信息通常是通过统计模式学习等手段来设计模式和趋势。文本挖掘通常涉及构造输入文本的过程(通常解析,以及添加一些派生的语言特征以及删除其他内容,并随后插入数据库),导出结构化数据中的模式,最后进行评估和解释的输出。文本挖掘中的“高质量”通常指相关性,新颖性和趣味性的一些组合。典型的文本挖掘任务包括文本分类,文本聚类,概念/实体提取,粒度分类法的生成,情绪分析,文档摘要和实体关系建模(即命名实体之间的学习关系)。")
 
# Language Detection
In [36]: text_blob2 = TextBlob(u"这是中文测试")
 
In [37]: text_blob2.detect_language()
Out[37]: u'zh-CN'
 
# Parser
In [39]: text_blob.parse()
Out[39]: u"Text/NN/B-NP/O mining/NN/I-NP/O ,/,/O/O also/RB/B-VP/O referred/VBN/I-VP/O to/TO/B-PP/B-PNP as/IN/I-PP/I-PNP text/NN/B-NP/I-PNP data/NNS/I-NP/I-PNP mining/NN/I-NP/I-PNP ,/,/O/O roughly/RB/B-ADVP/O equivalent/NN/B-NP/O to/TO/B-PP/B-PNP text/NN/B-NP/I-PNP analytics/NNS/I-NP/I-PNP ,/,/O/O is/VBZ/B-VP/O the/DT/B-NP/O process/NN/I-NP/O of/IN/B-PP/B-PNP deriving/VBG/B-VP/I-PNP high-quality/JJ/B-NP/I-PNP information/NN/I-NP/I-PNP from/IN/B-PP/B-PNP text/NN/B-NP/I-PNP ././O/O\nHigh-quality/JJ/B-NP/O information/NN/I-NP/O is/VBZ/B-VP/O typically/RB/I-VP/O derived/VBN/I-VP/O through/IN/B-PP/O the/DT/O/O devising/VBG/B-VP/O of/IN/B-PP/B-PNP patterns/NNS/B-NP/I-PNP and/CC/I-NP/I-PNP trends/NNS/I-NP/I-PNP through/IN/B-PP/O means/VBZ/B-VP/O such/JJ/B-ADJP/O as/IN/B-PP/B-PNP statistical/JJ/B-NP/I-PNP pattern/NN/I-NP/I-PNP learning/VBG/B-VP/I-PNP ././O/O\nText/NN/B-NP/O mining/NN/I-NP/O usually/RB/B-VP/O involves/VBZ/I-VP/O the/DT/B-NP/O process/NN/I-NP/O of/IN/B-PP/B-PNP structuring/VBG/B-VP/I-PNP the/DT/B-NP/I-PNP input/NN/I-NP/I-PNP text/NN/I-NP/I-PNP (/(/O/O usually/RB/B-VP/O parsing/VBG/I-VP/O ,/,/O/O along/IN/B-PP/B-PNP with/IN/I-PP/I-PNP the/DT/B-NP/I-PNP addition/NN/I-NP/I-PNP of/IN/B-PP/O some/DT/O/O derived/VBN/B-VP/O linguistic/JJ/B-NP/O features/NNS/I-NP/O and/CC/O/O the/DT/B-NP/O removal/NN/I-NP/O of/IN/B-PP/B-PNP others/NNS/B-NP/I-PNP ,/,/O/O and/CC/O/O subsequent/JJ/B-NP/O insertion/NN/I-NP/O into/IN/B-PP/B-PNP a/DT/B-NP/I-PNP database/NN/I-NP/I-PNP )/)/O/O ,/,/O/O deriving/VBG/B-VP/O patterns/NNS/B-NP/O within/IN/B-PP/O the/DT/O/O structured/VBN/B-VP/O data/NNS/B-NP/O ,/,/O/O and/CC/O/O finally/RB/B-ADVP/O evaluation/NN/B-NP/O and/CC/O/O interpretation/NN/B-NP/O of/IN/B-PP/B-PNP the/DT/B-NP/I-PNP output/NN/I-NP/I-PNP ././O/O\n'/POS/O/O High/NNP/B-NP/O quality/NN/I-NP/O '/POS/O/O in/IN/B-PP/B-PNP text/NN/B-NP/I-PNP mining/NN/I-NP/I-PNP usually/RB/B-VP/O refers/VBZ/I-VP/O to/TO/B-PP/B-PNP some/DT/B-NP/I-PNP combination/NN/I-NP/I-PNP of/IN/B-PP/B-PNP relevance/NN/B-NP/I-PNP ,/,/O/O novelty/NN/B-NP/O ,/,/O/O and/CC/O/O interestingness/NN/B-NP/O ././O/O\nTypical/JJ/B-NP/O text/NN/I-NP/O mining/NN/I-NP/O tasks/NNS/I-NP/O include/VBP/B-VP/O text/NN/B-NP/O categorization/NN/I-NP/O ,/,/O/O text/NN/B-NP/O clustering/VBG/B-VP/O ,/,/O/O concept&slash;entity/NN/B-NP/O extraction/NN/I-NP/O ,/,/O/O production/NN/B-NP/O of/IN/B-PP/B-PNP granular/JJ/B-NP/I-PNP taxonomies/NNS/I-NP/I-PNP ,/,/O/O sentiment/NN/B-NP/O analysis/NN/I-NP/O ,/,/O/O document/NN/B-NP/O summarization/NN/I-NP/O ,/,/O/O and/CC/O/O entity/NN/B-NP/O relation/NN/I-NP/O modeling/NN/I-NP/O (/(/O/O i.e./FW/O/O ,/,/O/O learning/VBG/B-VP/O relations/NNS/B-NP/O between/IN/B-PP/B-PNP named/VBN/B-VP/I-PNP entities/NNS/B-NP/I-PNP )/)/O/O ././O/O"
 
# Ngrams
In [40]: text_blob.ngrams(n=1)
Out[40]: 
[WordList(['Text']),
 WordList(['mining']),
 WordList(['also']),
 WordList(['referred']),
 WordList(['to']),
 WordList(['as']),
 WordList(['text']),
 WordList(['data']),
 WordList(['mining']),
 WordList(['roughly']),
 WordList(['equivalent']),
 WordList(['to']),
 WordList(['text']),
 WordList(['analytics']),
 WordList(['is']),
 WordList(['the']),
 WordList(['process']),
 WordList(['of']),
 WordList(['deriving']),
 WordList(['high-quality']),
 WordList(['information']),
 WordList(['from']),
 WordList(['text']),
 WordList(['High-quality']),
 WordList(['information']),
 WordList(['is']),
 WordList(['typically']),
 WordList(['derived']),
 WordList(['through']),
 WordList(['the']),
 WordList(['devising']),
 WordList(['of']),
 WordList(['patterns']),
 WordList(['and']),
 WordList(['trends']),
 WordList(['through']),
 WordList(['means']),
 WordList(['such']),
 WordList(['as']),
 WordList(['statistical']),
 WordList(['pattern']),
 WordList(['learning']),
 WordList(['Text']),
 WordList(['mining']),
 WordList(['usually']),
 WordList(['involves']),
 WordList(['the']),
 WordList(['process']),
 WordList(['of']),
 WordList(['structuring']),
 WordList(['the']),
 WordList(['input']),
 WordList(['text']),
 WordList(['usually']),
 WordList(['parsing']),
 WordList(['along']),
 WordList(['with']),
 WordList(['the']),
 WordList(['addition']),
 WordList(['of']),
 WordList(['some']),
 WordList(['derived']),
 WordList(['linguistic']),
 WordList(['features']),
 WordList(['and']),
 WordList(['the']),
 WordList(['removal']),
 WordList(['of']),
 WordList(['others']),
 WordList(['and']),
 WordList(['subsequent']),
 WordList(['insertion']),
 WordList(['into']),
 WordList(['a']),
 WordList(['database']),
 WordList(['deriving']),
 WordList(['patterns']),
 WordList(['within']),
 WordList(['the']),
 WordList(['structured']),
 WordList(['data']),
 WordList(['and']),
 WordList(['finally']),
 WordList(['evaluation']),
 WordList(['and']),
 WordList(['interpretation']),
 WordList(['of']),
 WordList(['the']),
 WordList(['output']),
 WordList(["'High"]),
 WordList(['quality']),
 WordList(['in']),
 WordList(['text']),
 WordList(['mining']),
 WordList(['usually']),
 WordList(['refers']),
 WordList(['to']),
 WordList(['some']),
 WordList(['combination']),
 WordList(['of']),
 WordList(['relevance']),
 WordList(['novelty']),
 WordList(['and']),
 WordList(['interestingness']),
 WordList(['Typical']),
 WordList(['text']),
 WordList(['mining']),
 WordList(['tasks']),
 WordList(['include']),
 WordList(['text']),
 WordList(['categorization']),
 WordList(['text']),
 WordList(['clustering']),
 WordList(['concept/entity']),
 WordList(['extraction']),
 WordList(['production']),
 WordList(['of']),
 WordList(['granular']),
 WordList(['taxonomies']),
 WordList(['sentiment']),
 WordList(['analysis']),
 WordList(['document']),
 WordList(['summarization']),
 WordList(['and']),
 WordList(['entity']),
 WordList(['relation']),
 WordList(['modeling']),
 WordList(['i.e']),
 WordList(['learning']),
 WordList(['relations']),
 WordList(['between']),
 WordList(['named']),
 WordList(['entities'])]
 
In [41]: text_blob.ngrams(n=2)
Out[41]: 
[WordList(['Text', 'mining']),
 WordList(['mining', 'also']),
 WordList(['also', 'referred']),
 WordList(['referred', 'to']),
 WordList(['to', 'as']),
 WordList(['as', 'text']),
 WordList(['text', 'data']),
 WordList(['data', 'mining']),
 WordList(['mining', 'roughly']),
 WordList(['roughly', 'equivalent']),
 WordList(['equivalent', 'to']),
 WordList(['to', 'text']),
 WordList(['text', 'analytics']),
 WordList(['analytics', 'is']),
 WordList(['is', 'the']),
 WordList(['the', 'process']),
 WordList(['process', 'of']),
 WordList(['of', 'deriving']),
 WordList(['deriving', 'high-quality']),
 WordList(['high-quality', 'information']),
 WordList(['information', 'from']),
 WordList(['from', 'text']),
 WordList(['text', 'High-quality']),
 WordList(['High-quality', 'information']),
 WordList(['information', 'is']),
 WordList(['is', 'typically']),
 WordList(['typically', 'derived']),
 WordList(['derived', 'through']),
 WordList(['through', 'the']),
 WordList(['the', 'devising']),
 WordList(['devising', 'of']),
 WordList(['of', 'patterns']),
 WordList(['patterns', 'and']),
 WordList(['and', 'trends']),
 WordList(['trends', 'through']),
 WordList(['through', 'means']),
 WordList(['means', 'such']),
 WordList(['such', 'as']),
 WordList(['as', 'statistical']),
 WordList(['statistical', 'pattern']),
 WordList(['pattern', 'learning']),
 WordList(['learning', 'Text']),
 WordList(['Text', 'mining']),
 WordList(['mining', 'usually']),
 WordList(['usually', 'involves']),
 WordList(['involves', 'the']),
 WordList(['the', 'process']),
 WordList(['process', 'of']),
 WordList(['of', 'structuring']),
 WordList(['structuring', 'the']),
 WordList(['the', 'input']),
 WordList(['input', 'text']),
 WordList(['text', 'usually']),
 WordList(['usually', 'parsing']),
 WordList(['parsing', 'along']),
 WordList(['along', 'with']),
 WordList(['with', 'the']),
 WordList(['the', 'addition']),
 WordList(['addition', 'of']),
 WordList(['of', 'some']),
 WordList(['some', 'derived']),
 WordList(['derived', 'linguistic']),
 WordList(['linguistic', 'features']),
 WordList(['features', 'and']),
 WordList(['and', 'the']),
 WordList(['the', 'removal']),
 WordList(['removal', 'of']),
 WordList(['of', 'others']),
 WordList(['others', 'and']),
 WordList(['and', 'subsequent']),
 WordList(['subsequent', 'insertion']),
 WordList(['insertion', 'into']),
 WordList(['into', 'a']),
 WordList(['a', 'database']),
 WordList(['database', 'deriving']),
 WordList(['deriving', 'patterns']),
 WordList(['patterns', 'within']),
 WordList(['within', 'the']),
 WordList(['the', 'structured']),
 WordList(['structured', 'data']),
 WordList(['data', 'and']),
 WordList(['and', 'finally']),
 WordList(['finally', 'evaluation']),
 WordList(['evaluation', 'and']),
 WordList(['and', 'interpretation']),
 WordList(['interpretation', 'of']),
 WordList(['of', 'the']),
 WordList(['the', 'output']),
 WordList(['output', "'High"]),
 WordList(["'High", 'quality']),
 WordList(['quality', 'in']),
 WordList(['in', 'text']),
 WordList(['text', 'mining']),
 WordList(['mining', 'usually']),
 WordList(['usually', 'refers']),
 WordList(['refers', 'to']),
 WordList(['to', 'some']),
 WordList(['some', 'combination']),
 WordList(['combination', 'of']),
 WordList(['of', 'relevance']),
 WordList(['relevance', 'novelty']),
 WordList(['novelty', 'and']),
 WordList(['and', 'interestingness']),
 WordList(['interestingness', 'Typical']),
 WordList(['Typical', 'text']),
 WordList(['text', 'mining']),
 WordList(['mining', 'tasks']),
 WordList(['tasks', 'include']),
 WordList(['include', 'text']),
 WordList(['text', 'categorization']),
 WordList(['categorization', 'text']),
 WordList(['text', 'clustering']),
 WordList(['clustering', 'concept/entity']),
 WordList(['concept/entity', 'extraction']),
 WordList(['extraction', 'production']),
 WordList(['production', 'of']),
 WordList(['of', 'granular']),
 WordList(['granular', 'taxonomies']),
 WordList(['taxonomies', 'sentiment']),
 WordList(['sentiment', 'analysis']),
 WordList(['analysis', 'document']),
 WordList(['document', 'summarization']),
 WordList(['summarization', 'and']),
 WordList(['and', 'entity']),
 WordList(['entity', 'relation']),
 WordList(['relation', 'modeling']),
 WordList(['modeling', 'i.e']),
 WordList(['i.e', 'learning']),
 WordList(['learning', 'relations']),
 WordList(['relations', 'between']),
 WordList(['between', 'named']),
 WordList(['named', 'entities'])]
 
In [42]: text_blob.ngrams(n=4)
Out[42]: 
[WordList(['Text', 'mining', 'also', 'referred']),
 WordList(['mining', 'also', 'referred', 'to']),
 WordList(['also', 'referred', 'to', 'as']),
 WordList(['referred', 'to', 'as', 'text']),
 WordList(['to', 'as', 'text', 'data']),
 WordList(['as', 'text', 'data', 'mining']),
 WordList(['text', 'data', 'mining', 'roughly']),
 WordList(['data', 'mining', 'roughly', 'equivalent']),
 WordList(['mining', 'roughly', 'equivalent', 'to']),
 WordList(['roughly', 'equivalent', 'to', 'text']),
 WordList(['equivalent', 'to', 'text', 'analytics']),
 WordList(['to', 'text', 'analytics', 'is']),
 WordList(['text', 'analytics', 'is', 'the']),
 WordList(['analytics', 'is', 'the', 'process']),
 WordList(['is', 'the', 'process', 'of']),
 WordList(['the', 'process', 'of', 'deriving']),
 WordList(['process', 'of', 'deriving', 'high-quality']),
 WordList(['of', 'deriving', 'high-quality', 'information']),
 WordList(['deriving', 'high-quality', 'information', 'from']),
 WordList(['high-quality', 'information', 'from', 'text']),
 WordList(['information', 'from', 'text', 'High-quality']),
 WordList(['from', 'text', 'High-quality', 'information']),
 WordList(['text', 'High-quality', 'information', 'is']),
 WordList(['High-quality', 'information', 'is', 'typically']),
 WordList(['information', 'is', 'typically', 'derived']),
 WordList(['is', 'typically', 'derived', 'through']),
 WordList(['typically', 'derived', 'through', 'the']),
 WordList(['derived', 'through', 'the', 'devising']),
 WordList(['through', 'the', 'devising', 'of']),
 WordList(['the', 'devising', 'of', 'patterns']),
 WordList(['devising', 'of', 'patterns', 'and']),
 WordList(['of', 'patterns', 'and', 'trends']),
 WordList(['patterns', 'and', 'trends', 'through']),
 WordList(['and', 'trends', 'through', 'means']),
 WordList(['trends', 'through', 'means', 'such']),
 WordList(['through', 'means', 'such', 'as']),
 WordList(['means', 'such', 'as', 'statistical']),
 WordList(['such', 'as', 'statistical', 'pattern']),
 WordList(['as', 'statistical', 'pattern', 'learning']),
 WordList(['statistical', 'pattern', 'learning', 'Text']),
 WordList(['pattern', 'learning', 'Text', 'mining']),
 WordList(['learning', 'Text', 'mining', 'usually']),
 WordList(['Text', 'mining', 'usually', 'involves']),
 WordList(['mining', 'usually', 'involves', 'the']),
 WordList(['usually', 'involves', 'the', 'process']),
 WordList(['involves', 'the', 'process', 'of']),
 WordList(['the', 'process', 'of', 'structuring']),
 WordList(['process', 'of', 'structuring', 'the']),
 WordList(['of', 'structuring', 'the', 'input']),
 WordList(['structuring', 'the', 'input', 'text']),
 WordList(['the', 'input', 'text', 'usually']),
 WordList(['input', 'text', 'usually', 'parsing']),
 WordList(['text', 'usually', 'parsing', 'along']),
 WordList(['usually', 'parsing', 'along', 'with']),
 WordList(['parsing', 'along', 'with', 'the']),
 WordList(['along', 'with', 'the', 'addition']),
 WordList(['with', 'the', 'addition', 'of']),
 WordList(['the', 'addition', 'of', 'some']),
 WordList(['addition', 'of', 'some', 'derived']),
 WordList(['of', 'some', 'derived', 'linguistic']),
 WordList(['some', 'derived', 'linguistic', 'features']),
 WordList(['derived', 'linguistic', 'features', 'and']),
 WordList(['linguistic', 'features', 'and', 'the']),
 WordList(['features', 'and', 'the', 'removal']),
 WordList(['and', 'the', 'removal', 'of']),
 WordList(['the', 'removal', 'of', 'others']),
 WordList(['removal', 'of', 'others', 'and']),
 WordList(['of', 'others', 'and', 'subsequent']),
 WordList(['others', 'and', 'subsequent', 'insertion']),
 WordList(['and', 'subsequent', 'insertion', 'into']),
 WordList(['subsequent', 'insertion', 'into', 'a']),
 WordList(['insertion', 'into', 'a', 'database']),
 WordList(['into', 'a', 'database', 'deriving']),
 WordList(['a', 'database', 'deriving', 'patterns']),
 WordList(['database', 'deriving', 'patterns', 'within']),
 WordList(['deriving', 'patterns', 'within', 'the']),
 WordList(['patterns', 'within', 'the', 'structured']),
 WordList(['within', 'the', 'structured', 'data']),
 WordList(['the', 'structured', 'data', 'and']),
 WordList(['structured', 'data', 'and', 'finally']),
 WordList(['data', 'and', 'finally', 'evaluation']),
 WordList(['and', 'finally', 'evaluation', 'and']),
 WordList(['finally', 'evaluation', 'and', 'interpretation']),
 WordList(['evaluation', 'and', 'interpretation', 'of']),
 WordList(['and', 'interpretation', 'of', 'the']),
 WordList(['interpretation', 'of', 'the', 'output']),
 WordList(['of', 'the', 'output', "'High"]),
 WordList(['the', 'output', "'High", 'quality']),
 WordList(['output', "'High", 'quality', 'in']),
 WordList(["'High", 'quality', 'in', 'text']),
 WordList(['quality', 'in', 'text', 'mining']),
 WordList(['in', 'text', 'mining', 'usually']),
 WordList(['text', 'mining', 'usually', 'refers']),
 WordList(['mining', 'usually', 'refers', 'to']),
 WordList(['usually', 'refers', 'to', 'some']),
 WordList(['refers', 'to', 'some', 'combination']),
 WordList(['to', 'some', 'combination', 'of']),
 WordList(['some', 'combination', 'of', 'relevance']),
 WordList(['combination', 'of', 'relevance', 'novelty']),
 WordList(['of', 'relevance', 'novelty', 'and']),
 WordList(['relevance', 'novelty', 'and', 'interestingness']),
 WordList(['novelty', 'and', 'interestingness', 'Typical']),
 WordList(['and', 'interestingness', 'Typical', 'text']),
 WordList(['interestingness', 'Typical', 'text', 'mining']),
 WordList(['Typical', 'text', 'mining', 'tasks']),
 WordList(['text', 'mining', 'tasks', 'include']),
 WordList(['mining', 'tasks', 'include', 'text']),
 WordList(['tasks', 'include', 'text', 'categorization']),
 WordList(['include', 'text', 'categorization', 'text']),
 WordList(['text', 'categorization', 'text', 'clustering']),
 WordList(['categorization', 'text', 'clustering', 'concept/entity']),
 WordList(['text', 'clustering', 'concept/entity', 'extraction']),
 WordList(['clustering', 'concept/entity', 'extraction', 'production']),
 WordList(['concept/entity', 'extraction', 'production', 'of']),
 WordList(['extraction', 'production', 'of', 'granular']),
 WordList(['production', 'of', 'granular', 'taxonomies']),
 WordList(['of', 'granular', 'taxonomies', 'sentiment']),
 WordList(['granular', 'taxonomies', 'sentiment', 'analysis']),
 WordList(['taxonomies', 'sentiment', 'analysis', 'document']),
 WordList(['sentiment', 'analysis', 'document', 'summarization']),
 WordList(['analysis', 'document', 'summarization', 'and']),
 WordList(['document', 'summarization', 'and', 'entity']),
 WordList(['summarization', 'and', 'entity', 'relation']),
 WordList(['and', 'entity', 'relation', 'modeling']),
 WordList(['entity', 'relation', 'modeling', 'i.e']),
 WordList(['relation', 'modeling', 'i.e', 'learning']),
 WordList(['modeling', 'i.e', 'learning', 'relations']),
 WordList(['i.e', 'learning', 'relations', 'between']),
 WordList(['learning', 'relations', 'between', 'named']),
 WordList(['relations', 'between', 'named', 'entities'])]

Posted by TextProcessing

Getting started with Word2Vec

1. Source by Google

Project with Code:

Blog:

Paper:
[1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. . In Proceedings of Workshop at ICLR, 2013.

Note: The new model architectures:

[2] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. . In Proceedings of NIPS, 2013.

Note: The Skip-gram Model with Hierarchical Softmax and Negative Sampling

[3] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. . In Proceedings of NAACL HLT, 2013.

Note: It seems no more information

[4] Tomas Mikolov, Quoc V. Le, Ilya Sutskever.

Note: Intersting word2vec application on SMT

[5] by Tomas Mikolov and etc.

2. Best explained with original models, optimizing methods, Back-propagation background and Word Embedding Visual Inspector

Paper:

Slides:

Youtube Video: – word2vec and wevi

Demo:

3. Word2Vec Tutorials:

Word2Vec Tutorial by Chris McCormick:

a)
Note: Skip over the usual introductory and abstract insights about Word2Vec, and get into more of the details

b)

Alex Minnaar’s Tutorials

The original article url is down, the following pdf version provides by Chris McCormick:

a)

b)

4. Learning by Coding

Python Word2Vec by Gensim related articles:

a)

b)

c)

d)

e)

Note: Simple but very powerful tutorial for word2vec model training in gensim.

5. Ohter Word2Vec Resources:

Posted by TextProcessing

Getting started with NLTK

About

Open Source Text Processing Project: NLTK

Install NLTK

1. Install the latest NLTK pakage on Ubuntu 16.04.1 LTS:

textprocessing@ubuntu:~$ sudo pip install -U nltk

Collecting nltk
Downloading nltk-3.2.2.tar.gz (1.2MB)
35% |███████████▍ | 409kB 20.8MB/s eta 0:00:0
……
100% |████████████████████████████████| 1.2MB 814kB/s
Collecting six (from nltk)
Downloading six-1.10.0-py2.py3-none-any.whl
Installing collected packages: six, nltk
Running setup.py install for nltk … done
Successfully installed nltk-3.2.2 six-1.10.0

2. Install Numpy (optional):

textprocessing@ubuntu:~$ sudo pip install -U numpy

Collecting numpy
Downloading numpy-1.12.0-cp27-cp27mu-manylinux1_x86_64.whl (16.5MB)
34% |███████████▏ | 5.7MB 30.8MB/s eta 0:00:0
……
100% |████████████████████████████████| 16.5MB 37kB/s
Installing collected packages: numpy
Successfully installed numpy-1.12.0

3. Test installation: run python then type import nltk

textprocessing@ubuntu:~$ ipython
Python 2.7.12 (default, Nov 19 2016, 06:48:10)
Type "copyright", "credits" or "license" for more information.

IPython 2.4.1 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.

In [1]: import nltk

In [2]: nltk.__version__
Out[2]: '3.2.2'

It seems that you have installed nltk, but if you test the simplest word tokenize, you will meet some problems:

In [3]: sentence = "this's a test"

In [4]: tokens = nltk.word_tokenize(sentence)
---------------------------------------------------------------------------
LookupError Traceback (most recent call last)
in ()
----> 1 tokens = nltk.word_tokenize(sentence)

/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.pyc in word_tokenize(text, language)
107 :param language: the model name in the Punkt corpus
108 """
--> 109 return [token for sent in sent_tokenize(text, language)
110 for token in _treebank_word_tokenize(sent)]
111

/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.pyc in sent_tokenize(text, language)
91 :param language: the model name in the Punkt corpus
92 """
---> 93 tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
94 return tokenizer.tokenize(text)
95

/usr/local/lib/python2.7/dist-packages/nltk/data.pyc in load(resource_url, format, cache, verbose, logic_parser, fstruct_reader, encoding)
806
807 # Load the resource.
--> 808 opened_resource = _open(resource_url)
809
810 if format == 'raw':

/usr/local/lib/python2.7/dist-packages/nltk/data.pyc in _open(resource_url)
924
925 if protocol is None or protocol.lower() == 'nltk':
--> 926 return find(path_, path + ['']).open()
927 elif protocol.lower() == 'file':
928 # urllib might not use mode='rb', so handle this one ourselves:

/usr/local/lib/python2.7/dist-packages/nltk/data.pyc in find(resource_name, paths)
646 sep = '*' * 70
647 resource_not_found = '\n%s\n%s\n%s' % (sep, msg, sep)
--> 648 raise LookupError(resource_not_found)
649
650

LookupError:
**********************************************************************
Resource u'tokenizers/punkt/english.pickle' not found. Please
use the NLTK Downloader to obtain the resource: >>>
nltk.download()
Searched in:
- '/home/textprocessing/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- u''
**********************************************************************

Install NLTK Data

NLTK comes with many corpora, toy grammars, trained models, etc. All in nltk_data, you need install nltk_data before you use nltk.

In [5]: nltk.download()
NLTK Downloader
—————————————————————————
d) Download l) List u) Update c) Config h) Help q) Quit
—————————————————————————
Downloader> d

Download which package (l=list; x=cancel)?
Identifier> all
Downloading collection u’all’
|
| Downloading package abc to /home/textprocessing/nltk_data…
| Unzipping corpora/abc.zip.
| Downloading package alpino to
| /home/textprocessing/nltk_data…
| Unzipping corpora/alpino.zip.
| Downloading package biocreative_ppi to
| /home/textprocessing/nltk_data…
| Unzipping corpora/biocreative_ppi.zip.
| Downloading package brown to
| /home/textprocessing/nltk_data…
| Unzipping corpora/brown.zip.
| Downloading package brown_tei to
| /home/textprocessing/nltk_data…
| Unzipping corpora/brown_tei.zip.
| Downloading package cess_cat to
| /home/textprocessing/nltk_data…
| Unzipping corpora/cess_cat.zip.
| Downloading package cess_esp to
| /home/textprocessing/nltk_data…
| Unzipping corpora/cess_esp.zip.
| Downloading package chat80 to
| /home/textprocessing/nltk_data…
| Unzipping corpora/chat80.zip.
| Downloading package city_database to
| /home/textprocessing/nltk_data…
| Unzipping corpora/city_database.zip.
| Downloading package cmudict to
| /home/textprocessing/nltk_data…
| Unzipping corpora/cmudict.zip.
| Downloading package comparative_sentences to
| /home/textprocessing/nltk_data…
| Unzipping corpora/comparative_sentences.zip.
| Downloading package comtrans to
| /home/textprocessing/nltk_data…
| Downloading package conll2000 to
| /home/textprocessing/nltk_data…
| Unzipping corpora/conll2000.zip.
| Downloading package conll2002 to
| /home/textprocessing/nltk_data…
| Unzipping corpora/conll2002.zip.
| Downloading package conll2007 to
| /home/textprocessing/nltk_data…
| Downloading package crubadan to
| /home/textprocessing/nltk_data…
| Unzipping corpora/crubadan.zip.
| Downloading package dependency_treebank to
| /home/textprocessing/nltk_data…
| Unzipping corpora/dependency_treebank.zip.
| Downloading package europarl_raw to
| /home/textprocessing/nltk_data…
| Unzipping corpora/europarl_raw.zip.
| Downloading package floresta to
| /home/textprocessing/nltk_data…
| Unzipping corpora/floresta.zip.
| Downloading package framenet_v15 to
| /home/textprocessing/nltk_data…
| Unzipping corpora/framenet_v15.zip.
| Downloading package framenet_v17 to
| /home/textprocessing/nltk_data…
| Unzipping corpora/framenet_v17.zip.
| Downloading package gazetteers to
| /home/textprocessing/nltk_data…
| Unzipping corpora/gazetteers.zip.
| Downloading package genesis to
| /home/textprocessing/nltk_data…
| Unzipping corpora/genesis.zip.
| Downloading package gutenberg to
| /home/textprocessing/nltk_data…
| Unzipping corpora/gutenberg.zip.
| Downloading package ieer to /home/textprocessing/nltk_data…
| Unzipping corpora/ieer.zip.
| Downloading package inaugural to
| /home/textprocessing/nltk_data…
| Unzipping corpora/inaugural.zip.
| Downloading package indian to
| /home/textprocessing/nltk_data…
| Unzipping corpora/indian.zip.
| Downloading package jeita to
| /home/textprocessing/nltk_data…
| Downloading package kimmo to
| /home/textprocessing/nltk_data…
| Unzipping corpora/kimmo.zip.
| Downloading package knbc to /home/textprocessing/nltk_data…
| Downloading package lin_thesaurus to
| /home/textprocessing/nltk_data…
| Unzipping corpora/lin_thesaurus.zip.
| Downloading package mac_morpho to
| /home/textprocessing/nltk_data…
| Unzipping corpora/mac_morpho.zip.
| Downloading package machado to
| /home/textprocessing/nltk_data…
| Downloading package masc_tagged to
| /home/textprocessing/nltk_data…
| Downloading package moses_sample to
| /home/textprocessing/nltk_data…
| Unzipping models/moses_sample.zip.
| Downloading package movie_reviews to
| /home/textprocessing/nltk_data…
| Unzipping corpora/movie_reviews.zip.
| Downloading package names to
| /home/textprocessing/nltk_data…
| Unzipping corpora/names.zip.
| Downloading package nombank.1.0 to
| /home/textprocessing/nltk_data…
| Downloading package nps_chat to
| /home/textprocessing/nltk_data…
| Unzipping corpora/nps_chat.zip.
| Downloading package omw to /home/textprocessing/nltk_data…
| Unzipping corpora/omw.zip.
| Downloading package opinion_lexicon to
| /home/textprocessing/nltk_data…
| Unzipping corpora/opinion_lexicon.zip.
| Downloading package paradigms to
| /home/textprocessing/nltk_data…
| Unzipping corpora/paradigms.zip.
| Downloading package pil to /home/textprocessing/nltk_data…
| Unzipping corpora/pil.zip.
| Downloading package pl196x to
| /home/textprocessing/nltk_data…
| Unzipping corpora/pl196x.zip.
| Downloading package ppattach to
| /home/textprocessing/nltk_data…
| Unzipping corpora/ppattach.zip.
| Downloading package problem_reports to
| /home/textprocessing/nltk_data…
| Unzipping corpora/problem_reports.zip.
| Downloading package propbank to
| /home/textprocessing/nltk_data…
| Downloading package ptb to /home/textprocessing/nltk_data…
| Unzipping corpora/ptb.zip.
| Downloading package product_reviews_1 to
| /home/textprocessing/nltk_data…
| Unzipping corpora/product_reviews_1.zip.
| Downloading package product_reviews_2 to
| /home/textprocessing/nltk_data…
| Unzipping corpora/product_reviews_2.zip.
| Downloading package pros_cons to
| /home/textprocessing/nltk_data…
| Unzipping corpora/pros_cons.zip.
| Downloading package qc to /home/textprocessing/nltk_data…
| Unzipping corpora/qc.zip.
| Downloading package reuters to
| /home/textprocessing/nltk_data…
| Downloading package rte to /home/textprocessing/nltk_data…
| Unzipping corpora/rte.zip.
| Downloading package semcor to
| /home/textprocessing/nltk_data…
| Downloading package senseval to
| /home/textprocessing/nltk_data…
| Unzipping corpora/senseval.zip.
| Downloading package sentiwordnet to
| /home/textprocessing/nltk_data…
| Unzipping corpora/sentiwordnet.zip.
| Downloading package sentence_polarity to
| /home/textprocessing/nltk_data…
| Unzipping corpora/sentence_polarity.zip.
| Downloading package shakespeare to
| /home/textprocessing/nltk_data…
| Unzipping corpora/shakespeare.zip.
| Downloading package sinica_treebank to
| /home/textprocessing/nltk_data…
| Unzipping corpora/sinica_treebank.zip.
| Downloading package smultron to
| /home/textprocessing/nltk_data…
| Unzipping corpora/smultron.zip.
| Downloading package state_union to
| /home/textprocessing/nltk_data…
| Unzipping corpora/state_union.zip.
| Downloading package stopwords to
| /home/textprocessing/nltk_data…
| Unzipping corpora/stopwords.zip.
| Downloading package subjectivity to
| /home/textprocessing/nltk_data…
| Unzipping corpora/subjectivity.zip.
| Downloading package swadesh to
| /home/textprocessing/nltk_data…
| Unzipping corpora/swadesh.zip.
| Downloading package switchboard to
| /home/textprocessing/nltk_data…
| Unzipping corpora/switchboard.zip.
| Downloading package timit to
| /home/textprocessing/nltk_data…
| Unzipping corpora/timit.zip.
| Downloading package toolbox to
| /home/textprocessing/nltk_data…
| Unzipping corpora/toolbox.zip.
| Downloading package treebank to
| /home/textprocessing/nltk_data…
| Unzipping corpora/treebank.zip.
| Downloading package twitter_samples to
| /home/textprocessing/nltk_data…
| Unzipping corpora/twitter_samples.zip.
| Downloading package udhr to /home/textprocessing/nltk_data…
| Unzipping corpora/udhr.zip.
| Downloading package udhr2 to
| /home/textprocessing/nltk_data…
| Unzipping corpora/udhr2.zip.
| Downloading package unicode_samples to
| /home/textprocessing/nltk_data…
| Unzipping corpora/unicode_samples.zip.
| Downloading package universal_treebanks_v20 to
| /home/textprocessing/nltk_data…
| Downloading package verbnet to
| /home/textprocessing/nltk_data…
| Unzipping corpora/verbnet.zip.
| Downloading package webtext to
| /home/textprocessing/nltk_data…
| Unzipping corpora/webtext.zip.
| Downloading package wordnet to
| /home/textprocessing/nltk_data…
| Unzipping corpora/wordnet.zip.
| Downloading package wordnet_ic to
| /home/textprocessing/nltk_data…
| Unzipping corpora/wordnet_ic.zip.
| Downloading package words to
| /home/textprocessing/nltk_data…
| Unzipping corpora/words.zip.
| Downloading package ycoe to /home/textprocessing/nltk_data…
| Unzipping corpora/ycoe.zip.
| Downloading package rslp to /home/textprocessing/nltk_data…
| Unzipping stemmers/rslp.zip.
| Downloading package hmm_treebank_pos_tagger to
| /home/textprocessing/nltk_data…
| Unzipping taggers/hmm_treebank_pos_tagger.zip.
| Downloading package maxent_treebank_pos_tagger to
| /home/textprocessing/nltk_data…
| Unzipping taggers/maxent_treebank_pos_tagger.zip.
| Downloading package universal_tagset to
| /home/textprocessing/nltk_data…
| Unzipping taggers/universal_tagset.zip.
| Downloading package maxent_ne_chunker to
| /home/textprocessing/nltk_data…
| Unzipping chunkers/maxent_ne_chunker.zip.
| Downloading package punkt to
| /home/textprocessing/nltk_data…
| Unzipping tokenizers/punkt.zip.
| Downloading package book_grammars to
| /home/textprocessing/nltk_data…
| Unzipping grammars/book_grammars.zip.
| Downloading package sample_grammars to
| /home/textprocessing/nltk_data…
| Unzipping grammars/sample_grammars.zip.
| Downloading package spanish_grammars to
| /home/textprocessing/nltk_data…
| Unzipping grammars/spanish_grammars.zip.
| Downloading package basque_grammars to
| /home/textprocessing/nltk_data…
| Unzipping grammars/basque_grammars.zip.
| Downloading package large_grammars to
| /home/textprocessing/nltk_data…
| Unzipping grammars/large_grammars.zip.
| Downloading package tagsets to
| /home/textprocessing/nltk_data…
| Unzipping help/tagsets.zip.
| Downloading package snowball_data to
| /home/textprocessing/nltk_data…
| Downloading package bllip_wsj_no_aux to
| /home/textprocessing/nltk_data…
| Unzipping models/bllip_wsj_no_aux.zip.
| Downloading package word2vec_sample to
| /home/textprocessing/nltk_data…
| Unzipping models/word2vec_sample.zip.
| Downloading package panlex_swadesh to
| /home/textprocessing/nltk_data…
| Downloading package mte_teip5 to
| /home/textprocessing/nltk_data…
| Unzipping corpora/mte_teip5.zip.
| Downloading package averaged_perceptron_tagger to
| /home/textprocessing/nltk_data…
| Unzipping taggers/averaged_perceptron_tagger.zip.
| Downloading package panlex_lite to
| /home/textprocessing/nltk_data…
| Unzipping corpora/panlex_lite.zip.
| Downloading package perluniprops to
| /home/textprocessing/nltk_data…
| Unzipping misc/perluniprops.zip.
| Downloading package nonbreaking_prefixes to
| /home/textprocessing/nltk_data…
| Unzipping corpora/nonbreaking_prefixes.zip.
| Downloading package vader_lexicon to
| /home/textprocessing/nltk_data…
| Downloading package porter_test to
| /home/textprocessing/nltk_data…
| Unzipping stemmers/porter_test.zip.
| Downloading package wmt15_eval to
| /home/textprocessing/nltk_data…
| Unzipping models/wmt15_eval.zip.
| Downloading package mwa_ppdb to
| /home/textprocessing/nltk_data…
| Unzipping misc/mwa_ppdb.zip.
|
Done downloading collection all

—————————————————————————
d) Download l) List u) Update c) Config h) Help q) Quit
—————————————————————————
Downloader> q
Out[5]: True

Using NLTK

In [15]: sentences = """Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human–computer interaction. Many challenges in NLP involve: natural language understanding, enabling computers to derive meaning from human or natural language input; and others involve natural language generation."""

In [16]: sents = nltk.sent_tokenize(sentences)

In [17]: for sent in sents:
print sent
....:
Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages.
As such, NLP is related to the area of human–computer interaction.
Many challenges in NLP involve: natural language understanding, enabling computers to derive meaning from human or natural language input; and others involve natural language generation.

In [18]: tokens = nltk.word_tokenize(sentences)

In [19]: print tokens
['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'field', 'of', 'computer', 'science', ',', 'artificial', 'intelligence', ',', 'and', 'computational', 'linguistics', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', '(', 'natural', ')', 'languages', '.', 'As', 'such', ',', 'NLP', 'is', 'related', 'to', 'the', 'area', 'of', 'human\xe2\x80\x93computer', 'interaction', '.', 'Many', 'challenges', 'in', 'NLP', 'involve', ':', 'natural', 'language', 'understanding', ',', 'enabling', 'computers', 'to', 'derive', 'meaning', 'from', 'human', 'or', 'natural', 'language', 'input', ';', 'and', 'others', 'involve', 'natural', 'language', 'generation', '.']

In [20]: tagged_tokens = nltk.pos_tag(tokens)

In [21]: print tagged_tokens
[('Natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('(', '('), ('NLP', 'NNP'), (')', ')'), ('is', 'VBZ'), ('a', 'DT'), ('field', 'NN'), ('of', 'IN'), ('computer', 'NN'), ('science', 'NN'), (',', ','), ('artificial', 'JJ'), ('intelligence', 'NN'), (',', ','), ('and', 'CC'), ('computational', 'JJ'), ('linguistics', 'NNS'), ('concerned', 'VBN'), ('with', 'IN'), ('the', 'DT'), ('interactions', 'NNS'), ('between', 'IN'), ('computers', 'NNS'), ('and', 'CC'), ('human', 'JJ'), ('(', '('), ('natural', 'JJ'), (')', ')'), ('languages', 'VBZ'), ('.', '.'), ('As', 'IN'), ('such', 'JJ'), (',', ','), ('NLP', 'NNP'), ('is', 'VBZ'), ('related', 'VBN'), ('to', 'TO'), ('the', 'DT'), ('area', 'NN'), ('of', 'IN'), ('human\xe2\x80\x93computer', 'NN'), ('interaction', 'NN'), ('.', '.'), ('Many', 'JJ'), ('challenges', 'NNS'), ('in', 'IN'), ('NLP', 'NNP'), ('involve', 'NN'), (':', ':'), ('natural', 'JJ'), ('language', 'NN'), ('understanding', 'NN'), (',', ','), ('enabling', 'VBG'), ('computers', 'NNS'), ('to', 'TO'), ('derive', 'VB'), ('meaning', 'NN'), ('from', 'IN'), ('human', 'NN'), ('or', 'CC'), ('natural', 'JJ'), ('language', 'NN'), ('input', 'NN'), (';', ':'), ('and', 'CC'), ('others', 'NNS'), ('involve', 'VBP'), ('natural', 'JJ'), ('language', 'NN'), ('generation', 'NN'), ('.', '.')]

In [22]: entities = nltk.chunk.ne_chunk(tagged_tokens)

In [23]: entities
Out[23]: Tree('S', [('Natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('(', '('), Tree('ORGANIZATION', [('NLP', 'NNP')]), (')', ')'), ('is', 'VBZ'), ('a', 'DT'), ('field', 'NN'), ('of', 'IN'), ('computer', 'NN'), ('science', 'NN'), (',', ','), ('artificial', 'JJ'), ('intelligence', 'NN'), (',', ','), ('and', 'CC'), ('computational', 'JJ'), ('linguistics', 'NNS'), ('concerned', 'VBN'), ('with', 'IN'), ('the', 'DT'), ('interactions', 'NNS'), ('between', 'IN'), ('computers', 'NNS'), ('and', 'CC'), ('human', 'JJ'), ('(', '('), ('natural', 'JJ'), (')', ')'), ('languages', 'VBZ'), ('.', '.'), ('As', 'IN'), ('such', 'JJ'), (',', ','), Tree('ORGANIZATION', [('NLP', 'NNP')]), ('is', 'VBZ'), ('related', 'VBN'), ('to', 'TO'), ('the', 'DT'), ('area', 'NN'), ('of', 'IN'), ('human\xe2\x80\x93computer', 'NN'), ('interaction', 'NN'), ('.', '.'), ('Many', 'JJ'), ('challenges', 'NNS'), ('in', 'IN'), Tree('ORGANIZATION', [('NLP', 'NNP')]), ('involve', 'NN'), (':', ':'), ('natural', 'JJ'), ('language', 'NN'), ('understanding', 'NN'), (',', ','), ('enabling', 'VBG'), ('computers', 'NNS'), ('to', 'TO'), ('derive', 'VB'), ('meaning', 'NN'), ('from', 'IN'), ('human', 'NN'), ('or', 'CC'), ('natural', 'JJ'), ('language', 'NN'), ('input', 'NN'), (';', ':'), ('and', 'CC'), ('others', 'NNS'), ('involve', 'VBP'), ('natural', 'JJ'), ('language', 'NN'), ('generation', 'NN'), ('.', '.')])

For more about NLTK, we recommended you the “” series and the official book: “”

Posted by “TextProcessing

Open Source Text Processing Project: Wapiti

Wapiti – A simple and fast discriminative sequence labelling toolkit

Project Website:
Github Link:

Description

Wapiti is a very fast toolkit for segmenting and labeling sequences with discriminative models. It is based on maxent models, maximum entropy Markov models and linear-chain CRF and proposes various optimization and regularization methods to improve both the computational complexity and the prediction performance of standard models. Wapiti is ranked first on the sequence tagging task for more than a year on MLcomp web site.

Features

Handle large label and feature sets
Wapiti was used to train models with more than one thousand labels and models with several billions features. Training time still increase with the size of these set, but provided you have computing power and enough memory, Wapiti will handle them without problems.

L-BFGS, OWL-QN, SGD-L1, BCD, and RPROP training algorithms
Wapiti implements all the standard training algorithms. All these algorithms are highly-optimized and can be combined to improve both computational and generalization performances.

L1, L2, or Elastic-net regularization
Wapiti provides different regularization methods which allow reducing overfitting and efficient features selections.

Powerful features extraction system
Wapiti uses an extended version of the CRF++ patterns for extracting features, which reduces both the amount of pre-processing required and the size of datafiles.

Multi-threaded and vectorized implementation
To further improve their performances, all optimization algorithms can take advantage of SSE instructions, if available. The Quasi-Newton and RPROP optimization algorithms are parallelized and scale very well on multi-processors.

N-best Viterbi output
Viterbi decoding can output the classical best label sequence as well as the n-best ones. Decoding can be done with the classical Viterbi for CRF or through posteriors which are slower but generaly lead to better result and give normalized scores.

Compact model creation
When used with L1 or elastic-net penalties, Wapiti is able to remove unused features and creates compact models which load faster and use less memory, speeding up the labeling.

Sparse forward-backward
A specific sparse forward-backward procedure is used during the training to take advantage of the sparsity of the model and speedup computation.

Written in standard C99+POSIX
Wapiti source code is written almost entirely in standard C99 and should work on any computer. However, the multi-threading code is written using POSIX threads and the SSE code is written for x86 platform. Both are optional and can be disabled or rewritten for other platforms.

Open source (BSD Licence)

Open Source Text Processing Project: segtok

segtok: sentence segmentation and word tokenization tools

Project Website:
Github Link:

Description

A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic features.

The segtok package provides two modules, segtok.segmenter and segtok.tokenizer. The segmenter provides functionality for splitting (Indo-European) text into sentences. The tokenizer provides functionality for splitting (Indo-European) sentences into words and symbols (collectively called tokens). Both modules can also be used from the command-line. While other Indo-European languages could work, it has only been designed with languages such as Spanish, English, and German in mind.

To install this package, you should have the latest official version of Python 2 or 3 installed. The package has been reported to work with Python 2.7, 3.3, and 3.4 and is tested against the latest Python 2 and 3 branches. The easiest way to get it installed is using pip or any other package manager that works with PyPI:

pip install segtok
Important: If you are on a Linux machine and have problems installing the regex dependency of segtok, make sure you have the python-dev and/or python3-dev packages installed to get the necessary headers to compile the package.

Then try the command line tools on some plain-text files (e.g., this README) to see if segtok meets your needs:

segmenter README.rst | tokenizer

Open Source Text Processing Project: nlp-with-ruby

nlp-with-ruby: Awesome NLP with Ruby

Project Website: None

Github Link:

Description

This curated list comprises awesome resources, libraries, information sources about computational processing of texts in human languages with Ruby. That field is often referred to as NLP, Computational Linguistics, HLT (Human Language Technology) and can be brought in conjunction with Artificial Intelligence, Machine Learning, Information Retrieval and other related disciplines.

Open Source Text Processing Project: textacy

textacy: higher-level NLP built on spaCy

Project Website:

Github Link:

Description

textacy is a Python library for performing higher-level natural language processing (NLP) tasks, built on the high-performance spaCy library. With the basics — tokenization, part-of-speech tagging, dependency parsing, etc. — offloaded to another library, textacy focuses on tasks facilitated by the ready availability of tokenized, POS-tagged, and parsed text.

Features
Stream text, json, csv, and spaCy binary data to and from disk
Clean and normalize raw text, before analyzing it
Explore included corpora of Congressional speeches and Supreme Court decisions, or stream documents from standard Wikipedia pages and Reddit comments datasets
Access and filter basic linguistic elements, such as words and ngrams, noun chunks and sentences
Extract named entities, acronyms and their definitions, direct quotations, key terms, and more from documents
Compare strings, sets, and documents by a variety of similarity metrics
Transform documents and corpora into vectorized and semantic network representations
Train, interpret, visualize, and save sklearn-style topic models using LSA, LDA, or NMF methods
Identify a text’s language, display key words in context (KWIC), true-case words, and navigate a parse tree
… and more!

Open Source Text Processing Project: vivekn sentiment

Sentiment analysis using machine learning techniques

Project Website:

Github Link:

Description

using machine learning techniques.

Check info.py for the training and testing code. A demo of the tool is available here

Refer this paper for more information about the algorithms used.

http://arxiv.org/abs/1305.6143

This tool works by examining individual words and short sequences of words (n-grams) and comparing them with a probability model. The probability model is built on a prelabeled test set of IMDb movie reviews. It can also detect negations in phrases, i.e, the phrase “not bad” will be classified as positive despite having two individual words with a negative sentiment.

Open Source Deep Learning Project: Paddle

Paddle: PArallel Distributed Deep LEarning

Project Website:

Github Link:

Description

PaddlePaddle (PArallel Distributed Deep LEarning) is an easy-to-use, efficient, flexible and scalable deep learning platform, which is originally developed by Baidu scientists and engineers for the purpose of applying deep learning to many products at Baidu.

Features

Flexibility

PaddlePaddle supports a wide range of neural network architectures and optimization algorithms. It is easy to configure complex models such as neural machine translation model with attention mechanism or complex memory connection.

Efficiency

In order to unleash the power of heterogeneous computing resource, optimization occurs at different levels of PaddlePaddle, including computing, memory, architecture and communication. The following are some examples:

Optimized math operations through SSE/AVX intrinsics, BLAS libraries (e.g. MKL, ATLAS, cuBLAS) or customized CPU/GPU kernels.
Highly optimized recurrent networks which can handle variable-length sequence without padding.
Optimized local and distributed training for models with high dimensional sparse data.
Scalability

With PaddlePaddle, it is easy to use many CPUs/GPUs and machines to speed up your training. PaddlePaddle can achieve high throughput and performance via optimized communication.

Connected to Products

In addition, PaddlePaddle is also designed to be easily deployable. At Baidu, PaddlePaddle has been deployed into products or service with a vast number of users, including ad click-through rate (CTR) prediction, large-scale image classification, optical character recognition(OCR), search ranking, computer virus detection, recommendation, etc. It is widely utilized in products at Baidu and it has achieved a significant impact. We hope you can also exploit the capability of PaddlePaddle to make a huge impact for your product.