A Beginner’s Guide to TextBlob

About TextBlob

Open Source Text Processing Project: TextBlob

Install TextBlob

Install the latest TextBlob on Ubuntu 16.04.1 LTS:

textprocessing@ubuntu:~$ sudo pip install -U textblob

Collecting textblob
Downloading textblob-0.12.0-py2.py3-none-any.whl (631kB)

Requirement already up-to-date: nltk>=3.1 in /usr/local/lib/python2.7/dist-packages (from textblob)
Requirement already up-to-date: six in /usr/local/lib/python2.7/dist-packages (from nltk>=3.1->textblob)
Installing collected packages: textblob
Successfully installed textblob-0.12.0

textprocessing@ubuntu:~$ sudo python -m textblob.download_corpora

[nltk_data] Downloading package brown to
[nltk_data] /home/textprocessing/nltk_data…
[nltk_data] Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to
[nltk_data] /home/textprocessing/nltk_data…
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data] /home/textprocessing/nltk_data…
[nltk_data] Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /home/textprocessing/nltk_data…
[nltk_data] Package averaged_perceptron_tagger is already up-to-
[nltk_data] date!
[nltk_data] Downloading package conll2000 to
[nltk_data] /home/textprocessing/nltk_data…
[nltk_data] Unzipping corpora/conll2000.zip.
[nltk_data] Downloading package movie_reviews to
[nltk_data] /home/textprocessing/nltk_data…
[nltk_data] Unzipping corpora/movie_reviews.zip.
Finished.

Test TextBlob

textprocessing@ubuntu:~$ ipython
Python 2.7.12 (default, Nov 19 2016, 06:48:10) 
Type "copyright", "credits" or "license" for more information.
 
IPython 2.4.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
 
In [1]: from textblob import TextBlob
 
In [2]: test_text = """
Text mining, also referred to as text data mining, roughly equivalent to text analytics, is the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).
"""
 
In [3]: text_blob = TextBlob(test_text)
 
# Word Tokenization
In [4]: text_blob.words
Out[4]: WordList(['Text', 'mining', 'also', 'referred', 'to', 'as', 'text', 'data', 'mining', 'roughly', 'equivalent', 'to', 'text', 'analytics', 'is', 'the', 'process', 'of', 'deriving', 'high-quality', 'information', 'from', 'text', 'High-quality', 'information', 'is', 'typically', 'derived', 'through', 'the', 'devising', 'of', 'patterns', 'and', 'trends', 'through', 'means', 'such', 'as', 'statistical', 'pattern', 'learning', 'Text', 'mining', 'usually', 'involves', 'the', 'process', 'of', 'structuring', 'the', 'input', 'text', 'usually', 'parsing', 'along', 'with', 'the', 'addition', 'of', 'some', 'derived', 'linguistic', 'features', 'and', 'the', 'removal', 'of', 'others', 'and', 'subsequent', 'insertion', 'into', 'a', 'database', 'deriving', 'patterns', 'within', 'the', 'structured', 'data', 'and', 'finally', 'evaluation', 'and', 'interpretation', 'of', 'the', 'output', "'High", 'quality', 'in', 'text', 'mining', 'usually', 'refers', 'to', 'some', 'combination', 'of', 'relevance', 'novelty', 'and', 'interestingness', 'Typical', 'text', 'mining', 'tasks', 'include', 'text', 'categorization', 'text', 'clustering', 'concept/entity', 'extraction', 'production', 'of', 'granular', 'taxonomies', 'sentiment', 'analysis', 'document', 'summarization', 'and', 'entity', 'relation', 'modeling', 'i.e', 'learning', 'relations', 'between', 'named', 'entities'])
 
# Sentence Tokenization
In [5]: text_blob.sentences
Out[5]: 
[Sentence("
 Text mining, also referred to as text data mining, roughly equivalent to text analytics, is the process of deriving high-quality information from text."),
 Sentence("High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning."),
 Sentence("Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output."),
 Sentence("'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness."),
 Sentence("Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).")]
 
In [6]: for sentence in text_blob.sentences:
   ...:     print(sentence)
   ...:     
 
Text mining, also referred to as text data mining, roughly equivalent to text analytics, is the process of deriving high-quality information from text.
High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning.
Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output.
'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness.
Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).
 
# Sentiment Analysis
In [7]: for sentence in text_blob.sentences:
    print(sentence.sentiment)
   ...:     
Sentiment(polarity=-0.1, subjectivity=0.4)
Sentiment(polarity=-0.08333333333333333, subjectivity=0.5)
Sentiment(polarity=-0.08, subjectivity=0.32999999999999996)
Sentiment(polarity=-0.045, subjectivity=0.39499999999999996)
Sentiment(polarity=-0.16666666666666666, subjectivity=0.5)
 
# POS Tagging
In [8]: text_blob.tags
Out[8]: 
[('Text', u'NNP'),
 ('mining', u'NN'),
 ('also', u'RB'),
 ('referred', u'VBD'),
 ('to', u'TO'),
 ('as', u'IN'),
 ('text', u'NN'),
 ('data', u'NNS'),
 ('mining', u'NN'),
 ('roughly', u'RB'),
 ('equivalent', u'JJ'),
 ('to', u'TO'),
 ('text', u'VB'),
 ('analytics', u'NNS'),
 ('is', u'VBZ'),
 ('the', u'DT'),
 ('process', u'NN'),
 ('of', u'IN'),
 ('deriving', u'VBG'),
 ('high-quality', u'JJ'),
 ('information', u'NN'),
 ('from', u'IN'),
 ('text', u'NN'),
 ('High-quality', u'NNP'),
 ('information', u'NN'),
 ('is', u'VBZ'),
 ('typically', u'RB'),
 ('derived', u'VBN'),
 ('through', u'IN'),
 ('the', u'DT'),
 ('devising', u'NN'),
 ('of', u'IN'),
 ('patterns', u'NNS'),
 ('and', u'CC'),
 ('trends', u'NNS'),
 ('through', u'IN'),
 ('means', u'NNS'),
 ('such', u'JJ'),
 ('as', u'IN'),
 ('statistical', u'JJ'),
 ('pattern', u'NN'),
 ('learning', u'VBG'),
 ('Text', u'NNP'),
 ('mining', u'NN'),
 ('usually', u'RB'),
 ('involves', u'VBZ'),
 ('the', u'DT'),
 ('process', u'NN'),
 ('of', u'IN'),
 ('structuring', u'VBG'),
 ('the', u'DT'),
 ('input', u'NN'),
 ('text', u'NN'),
 ('usually', u'RB'),
 ('parsing', u'VBG'),
 ('along', u'IN'),
 ('with', u'IN'),
 ('the', u'DT'),
 ('addition', u'NN'),
 ('of', u'IN'),
 ('some', u'DT'),
 ('derived', u'VBN'),
 ('linguistic', u'JJ'),
 ('features', u'NNS'),
 ('and', u'CC'),
 ('the', u'DT'),
 ('removal', u'NN'),
 ('of', u'IN'),
 ('others', u'NNS'),
 ('and', u'CC'),
 ('subsequent', u'JJ'),
 ('insertion', u'NN'),
 ('into', u'IN'),
 ('a', u'DT'),
 ('database', u'NN'),
 ('deriving', u'VBG'),
 ('patterns', u'NNS'),
 ('within', u'IN'),
 ('the', u'DT'),
 ('structured', u'JJ'),
 ('data', u'NNS'),
 ('and', u'CC'),
 ('finally', u'RB'),
 ('evaluation', u'NN'),
 ('and', u'CC'),
 ('interpretation', u'NN'),
 ('of', u'IN'),
 ('the', u'DT'),
 ('output', u'NN'),
 ("'High", u'JJ'),
 ('quality', u'NN'),
 ('in', u'IN'),
 ('text', u'JJ'),
 ('mining', u'NN'),
 ('usually', u'RB'),
 ('refers', u'VBZ'),
 ('to', u'TO'),
 ('some', u'DT'),
 ('combination', u'NN'),
 ('of', u'IN'),
 ('relevance', u'NN'),
 ('novelty', u'NN'),
 ('and', u'CC'),
 ('interestingness', u'NN'),
 ('Typical', u'JJ'),
 ('text', u'NN'),
 ('mining', u'NN'),
 ('tasks', u'NNS'),
 ('include', u'VBP'),
 ('text', u'JJ'),
 ('categorization', u'NN'),
 ('text', u'NN'),
 ('clustering', u'NN'),
 ('concept/entity', u'NN'),
 ('extraction', u'NN'),
 ('production', u'NN'),
 ('of', u'IN'),
 ('granular', u'JJ'),
 ('taxonomies', u'NNS'),
 ('sentiment', u'NN'),
 ('analysis', u'NN'),
 ('document', u'NN'),
 ('summarization', u'NN'),
 ('and', u'CC'),
 ('entity', u'NN'),
 ('relation', u'NN'),
 ('modeling', u'NN'),
 ('i.e.', u'FW'),
 ('learning', u'VBG'),
 ('relations', u'NNS'),
 ('between', u'IN'),
 ('named', u'VBN'),
 ('entities', u'NNS')]
 
# Noun Phrase Extraction
In [9]: text_blob.noun_phrases
Out[9]: WordList(['text', u'text data', u'text analytics', u'high-quality information', 'high-quality', u'statistical pattern learning', 'text', u'input text', u'subsequent insertion', u"'high quality", u'typical text', u'text categorization', u'concept/entity extraction', u'granular taxonomies', u'sentiment analysis', u'document summarization', u'entity relation', u'learning relations'])
 
# Sentiment Analysis
In [10]: text_blob.sentiment
Out[10]: Sentiment(polarity=-0.08393939393939392, subjectivity=0.39454545454545453)
 
# Singularize and Pluralize
In [11]: text_blob.words[-1]
Out[11]: 'entities'
 
In [12]: text_blob.words[-1].singularize()
Out[12]: 'entity'
 
In [13]: text_blob.words[1]
Out[13]: 'mining'
 
In [14]: text_blob.words[1].pluralize()
Out[14]: 'minings'
 
In [15]: text_blob.words[0]
Out[15]: 'Text'
 
In [16]: text_blob.words[0].pluralize()
Out[16]: 'Texts'
 
# Lemmatization
In [17]: from textblob import Word
 
In [18]: w = Word("are")
 
In [19]: w.lemmatize()
Out[19]: 'are'
 
In [20]: w.lemmatize('v')
Out[20]: u'be'
 
# WordNet
In [21]: from textblob.wordnet import VERB
 
In [22]: word = Word("are")
 
In [23]: word.synsets
Out[23]: 
[Synset('are.n.01'),
 Synset('be.v.01'),
 Synset('be.v.02'),
 Synset('be.v.03'),
 Synset('exist.v.01'),
 Synset('be.v.05'),
 Synset('equal.v.01'),
 Synset('constitute.v.01'),
 Synset('be.v.08'),
 Synset('embody.v.02'),
 Synset('be.v.10'),
 Synset('be.v.11'),
 Synset('be.v.12'),
 Synset('cost.v.01')]
 
In [24]: word.definitions
Out[24]: 
[u'a unit of surface area equal to 100 square meters',
 u'have the quality of being; (copula, used with an adjective or a predicate noun)',
 u'be identical to; be someone or something',
 u'occupy a certain position or area; be somewhere',
 u'have an existence, be extant',
 u'happen, occur, take place; this was during the visit to my parents\' house"',
 u'be identical or equivalent to',
 u'form or compose',
 u'work in a specific place, with a specific subject, or in a specific function',
 u'represent, as of a character on stage',
 u'spend or use time',
 u'have life, be alive',
 u'to remain unmolested, undisturbed, or uninterrupted -- used only in infinitive form',
 u'be priced at']
 
# Spelling Correction
In [25]: splling_test = TextBlob("I m ok")
 
In [26]: spelling_test = TextBlob("I m ok")
 
In [27]: print(spelling_test.correct())
I m ok
 
In [28]: splling_test = TextBlob("I havv good speling!")
 
In [29]: print(spelling_test.correct())
I m ok
 
# Translation
In [30]: print(splling_test.correct())
I have good spelling!
 
In [31]: text_blob.translate(to='zh')
Out[31]: TextBlob("文本挖掘,也称为文本数据挖掘,大致相当于文本分析,是从文本中获取高质量信息的过程。高质量的信息通常是通过统计模式学习等手段来设计模式和趋势。文本挖掘通常涉及构造输入文本的过程(通常解析,以及添加一些派生的语言特征以及删除其他内容,并随后插入数据库),导出结构化数据中的模式,最后进行评估和解释的输出。文本挖掘中的“高质量”通常指相关性,新颖性和趣味性的一些组合。典型的文本挖掘任务包括文本分类,文本聚类,概念/实体提取,粒度分类法的生成,情绪分析,文档摘要和实体关系建模(即命名实体之间的学习关系)。")
 
# Language Detection
In [36]: text_blob2 = TextBlob(u"这是中文测试")
 
In [37]: text_blob2.detect_language()
Out[37]: u'zh-CN'
 
# Parser
In [39]: text_blob.parse()
Out[39]: u"Text/NN/B-NP/O mining/NN/I-NP/O ,/,/O/O also/RB/B-VP/O referred/VBN/I-VP/O to/TO/B-PP/B-PNP as/IN/I-PP/I-PNP text/NN/B-NP/I-PNP data/NNS/I-NP/I-PNP mining/NN/I-NP/I-PNP ,/,/O/O roughly/RB/B-ADVP/O equivalent/NN/B-NP/O to/TO/B-PP/B-PNP text/NN/B-NP/I-PNP analytics/NNS/I-NP/I-PNP ,/,/O/O is/VBZ/B-VP/O the/DT/B-NP/O process/NN/I-NP/O of/IN/B-PP/B-PNP deriving/VBG/B-VP/I-PNP high-quality/JJ/B-NP/I-PNP information/NN/I-NP/I-PNP from/IN/B-PP/B-PNP text/NN/B-NP/I-PNP ././O/O\nHigh-quality/JJ/B-NP/O information/NN/I-NP/O is/VBZ/B-VP/O typically/RB/I-VP/O derived/VBN/I-VP/O through/IN/B-PP/O the/DT/O/O devising/VBG/B-VP/O of/IN/B-PP/B-PNP patterns/NNS/B-NP/I-PNP and/CC/I-NP/I-PNP trends/NNS/I-NP/I-PNP through/IN/B-PP/O means/VBZ/B-VP/O such/JJ/B-ADJP/O as/IN/B-PP/B-PNP statistical/JJ/B-NP/I-PNP pattern/NN/I-NP/I-PNP learning/VBG/B-VP/I-PNP ././O/O\nText/NN/B-NP/O mining/NN/I-NP/O usually/RB/B-VP/O involves/VBZ/I-VP/O the/DT/B-NP/O process/NN/I-NP/O of/IN/B-PP/B-PNP structuring/VBG/B-VP/I-PNP the/DT/B-NP/I-PNP input/NN/I-NP/I-PNP text/NN/I-NP/I-PNP (/(/O/O usually/RB/B-VP/O parsing/VBG/I-VP/O ,/,/O/O along/IN/B-PP/B-PNP with/IN/I-PP/I-PNP the/DT/B-NP/I-PNP addition/NN/I-NP/I-PNP of/IN/B-PP/O some/DT/O/O derived/VBN/B-VP/O linguistic/JJ/B-NP/O features/NNS/I-NP/O and/CC/O/O the/DT/B-NP/O removal/NN/I-NP/O of/IN/B-PP/B-PNP others/NNS/B-NP/I-PNP ,/,/O/O and/CC/O/O subsequent/JJ/B-NP/O insertion/NN/I-NP/O into/IN/B-PP/B-PNP a/DT/B-NP/I-PNP database/NN/I-NP/I-PNP )/)/O/O ,/,/O/O deriving/VBG/B-VP/O patterns/NNS/B-NP/O within/IN/B-PP/O the/DT/O/O structured/VBN/B-VP/O data/NNS/B-NP/O ,/,/O/O and/CC/O/O finally/RB/B-ADVP/O evaluation/NN/B-NP/O and/CC/O/O interpretation/NN/B-NP/O of/IN/B-PP/B-PNP the/DT/B-NP/I-PNP output/NN/I-NP/I-PNP ././O/O\n'/POS/O/O High/NNP/B-NP/O quality/NN/I-NP/O '/POS/O/O in/IN/B-PP/B-PNP text/NN/B-NP/I-PNP mining/NN/I-NP/I-PNP usually/RB/B-VP/O refers/VBZ/I-VP/O to/TO/B-PP/B-PNP some/DT/B-NP/I-PNP combination/NN/I-NP/I-PNP of/IN/B-PP/B-PNP relevance/NN/B-NP/I-PNP ,/,/O/O novelty/NN/B-NP/O ,/,/O/O and/CC/O/O interestingness/NN/B-NP/O ././O/O\nTypical/JJ/B-NP/O text/NN/I-NP/O mining/NN/I-NP/O tasks/NNS/I-NP/O include/VBP/B-VP/O text/NN/B-NP/O categorization/NN/I-NP/O ,/,/O/O text/NN/B-NP/O clustering/VBG/B-VP/O ,/,/O/O concept&slash;entity/NN/B-NP/O extraction/NN/I-NP/O ,/,/O/O production/NN/B-NP/O of/IN/B-PP/B-PNP granular/JJ/B-NP/I-PNP taxonomies/NNS/I-NP/I-PNP ,/,/O/O sentiment/NN/B-NP/O analysis/NN/I-NP/O ,/,/O/O document/NN/B-NP/O summarization/NN/I-NP/O ,/,/O/O and/CC/O/O entity/NN/B-NP/O relation/NN/I-NP/O modeling/NN/I-NP/O (/(/O/O i.e./FW/O/O ,/,/O/O learning/VBG/B-VP/O relations/NNS/B-NP/O between/IN/B-PP/B-PNP named/VBN/B-VP/I-PNP entities/NNS/B-NP/I-PNP )/)/O/O ././O/O"
 
# Ngrams
In [40]: text_blob.ngrams(n=1)
Out[40]: 
[WordList(['Text']),
 WordList(['mining']),
 WordList(['also']),
 WordList(['referred']),
 WordList(['to']),
 WordList(['as']),
 WordList(['text']),
 WordList(['data']),
 WordList(['mining']),
 WordList(['roughly']),
 WordList(['equivalent']),
 WordList(['to']),
 WordList(['text']),
 WordList(['analytics']),
 WordList(['is']),
 WordList(['the']),
 WordList(['process']),
 WordList(['of']),
 WordList(['deriving']),
 WordList(['high-quality']),
 WordList(['information']),
 WordList(['from']),
 WordList(['text']),
 WordList(['High-quality']),
 WordList(['information']),
 WordList(['is']),
 WordList(['typically']),
 WordList(['derived']),
 WordList(['through']),
 WordList(['the']),
 WordList(['devising']),
 WordList(['of']),
 WordList(['patterns']),
 WordList(['and']),
 WordList(['trends']),
 WordList(['through']),
 WordList(['means']),
 WordList(['such']),
 WordList(['as']),
 WordList(['statistical']),
 WordList(['pattern']),
 WordList(['learning']),
 WordList(['Text']),
 WordList(['mining']),
 WordList(['usually']),
 WordList(['involves']),
 WordList(['the']),
 WordList(['process']),
 WordList(['of']),
 WordList(['structuring']),
 WordList(['the']),
 WordList(['input']),
 WordList(['text']),
 WordList(['usually']),
 WordList(['parsing']),
 WordList(['along']),
 WordList(['with']),
 WordList(['the']),
 WordList(['addition']),
 WordList(['of']),
 WordList(['some']),
 WordList(['derived']),
 WordList(['linguistic']),
 WordList(['features']),
 WordList(['and']),
 WordList(['the']),
 WordList(['removal']),
 WordList(['of']),
 WordList(['others']),
 WordList(['and']),
 WordList(['subsequent']),
 WordList(['insertion']),
 WordList(['into']),
 WordList(['a']),
 WordList(['database']),
 WordList(['deriving']),
 WordList(['patterns']),
 WordList(['within']),
 WordList(['the']),
 WordList(['structured']),
 WordList(['data']),
 WordList(['and']),
 WordList(['finally']),
 WordList(['evaluation']),
 WordList(['and']),
 WordList(['interpretation']),
 WordList(['of']),
 WordList(['the']),
 WordList(['output']),
 WordList(["'High"]),
 WordList(['quality']),
 WordList(['in']),
 WordList(['text']),
 WordList(['mining']),
 WordList(['usually']),
 WordList(['refers']),
 WordList(['to']),
 WordList(['some']),
 WordList(['combination']),
 WordList(['of']),
 WordList(['relevance']),
 WordList(['novelty']),
 WordList(['and']),
 WordList(['interestingness']),
 WordList(['Typical']),
 WordList(['text']),
 WordList(['mining']),
 WordList(['tasks']),
 WordList(['include']),
 WordList(['text']),
 WordList(['categorization']),
 WordList(['text']),
 WordList(['clustering']),
 WordList(['concept/entity']),
 WordList(['extraction']),
 WordList(['production']),
 WordList(['of']),
 WordList(['granular']),
 WordList(['taxonomies']),
 WordList(['sentiment']),
 WordList(['analysis']),
 WordList(['document']),
 WordList(['summarization']),
 WordList(['and']),
 WordList(['entity']),
 WordList(['relation']),
 WordList(['modeling']),
 WordList(['i.e']),
 WordList(['learning']),
 WordList(['relations']),
 WordList(['between']),
 WordList(['named']),
 WordList(['entities'])]
 
In [41]: text_blob.ngrams(n=2)
Out[41]: 
[WordList(['Text', 'mining']),
 WordList(['mining', 'also']),
 WordList(['also', 'referred']),
 WordList(['referred', 'to']),
 WordList(['to', 'as']),
 WordList(['as', 'text']),
 WordList(['text', 'data']),
 WordList(['data', 'mining']),
 WordList(['mining', 'roughly']),
 WordList(['roughly', 'equivalent']),
 WordList(['equivalent', 'to']),
 WordList(['to', 'text']),
 WordList(['text', 'analytics']),
 WordList(['analytics', 'is']),
 WordList(['is', 'the']),
 WordList(['the', 'process']),
 WordList(['process', 'of']),
 WordList(['of', 'deriving']),
 WordList(['deriving', 'high-quality']),
 WordList(['high-quality', 'information']),
 WordList(['information', 'from']),
 WordList(['from', 'text']),
 WordList(['text', 'High-quality']),
 WordList(['High-quality', 'information']),
 WordList(['information', 'is']),
 WordList(['is', 'typically']),
 WordList(['typically', 'derived']),
 WordList(['derived', 'through']),
 WordList(['through', 'the']),
 WordList(['the', 'devising']),
 WordList(['devising', 'of']),
 WordList(['of', 'patterns']),
 WordList(['patterns', 'and']),
 WordList(['and', 'trends']),
 WordList(['trends', 'through']),
 WordList(['through', 'means']),
 WordList(['means', 'such']),
 WordList(['such', 'as']),
 WordList(['as', 'statistical']),
 WordList(['statistical', 'pattern']),
 WordList(['pattern', 'learning']),
 WordList(['learning', 'Text']),
 WordList(['Text', 'mining']),
 WordList(['mining', 'usually']),
 WordList(['usually', 'involves']),
 WordList(['involves', 'the']),
 WordList(['the', 'process']),
 WordList(['process', 'of']),
 WordList(['of', 'structuring']),
 WordList(['structuring', 'the']),
 WordList(['the', 'input']),
 WordList(['input', 'text']),
 WordList(['text', 'usually']),
 WordList(['usually', 'parsing']),
 WordList(['parsing', 'along']),
 WordList(['along', 'with']),
 WordList(['with', 'the']),
 WordList(['the', 'addition']),
 WordList(['addition', 'of']),
 WordList(['of', 'some']),
 WordList(['some', 'derived']),
 WordList(['derived', 'linguistic']),
 WordList(['linguistic', 'features']),
 WordList(['features', 'and']),
 WordList(['and', 'the']),
 WordList(['the', 'removal']),
 WordList(['removal', 'of']),
 WordList(['of', 'others']),
 WordList(['others', 'and']),
 WordList(['and', 'subsequent']),
 WordList(['subsequent', 'insertion']),
 WordList(['insertion', 'into']),
 WordList(['into', 'a']),
 WordList(['a', 'database']),
 WordList(['database', 'deriving']),
 WordList(['deriving', 'patterns']),
 WordList(['patterns', 'within']),
 WordList(['within', 'the']),
 WordList(['the', 'structured']),
 WordList(['structured', 'data']),
 WordList(['data', 'and']),
 WordList(['and', 'finally']),
 WordList(['finally', 'evaluation']),
 WordList(['evaluation', 'and']),
 WordList(['and', 'interpretation']),
 WordList(['interpretation', 'of']),
 WordList(['of', 'the']),
 WordList(['the', 'output']),
 WordList(['output', "'High"]),
 WordList(["'High", 'quality']),
 WordList(['quality', 'in']),
 WordList(['in', 'text']),
 WordList(['text', 'mining']),
 WordList(['mining', 'usually']),
 WordList(['usually', 'refers']),
 WordList(['refers', 'to']),
 WordList(['to', 'some']),
 WordList(['some', 'combination']),
 WordList(['combination', 'of']),
 WordList(['of', 'relevance']),
 WordList(['relevance', 'novelty']),
 WordList(['novelty', 'and']),
 WordList(['and', 'interestingness']),
 WordList(['interestingness', 'Typical']),
 WordList(['Typical', 'text']),
 WordList(['text', 'mining']),
 WordList(['mining', 'tasks']),
 WordList(['tasks', 'include']),
 WordList(['include', 'text']),
 WordList(['text', 'categorization']),
 WordList(['categorization', 'text']),
 WordList(['text', 'clustering']),
 WordList(['clustering', 'concept/entity']),
 WordList(['concept/entity', 'extraction']),
 WordList(['extraction', 'production']),
 WordList(['production', 'of']),
 WordList(['of', 'granular']),
 WordList(['granular', 'taxonomies']),
 WordList(['taxonomies', 'sentiment']),
 WordList(['sentiment', 'analysis']),
 WordList(['analysis', 'document']),
 WordList(['document', 'summarization']),
 WordList(['summarization', 'and']),
 WordList(['and', 'entity']),
 WordList(['entity', 'relation']),
 WordList(['relation', 'modeling']),
 WordList(['modeling', 'i.e']),
 WordList(['i.e', 'learning']),
 WordList(['learning', 'relations']),
 WordList(['relations', 'between']),
 WordList(['between', 'named']),
 WordList(['named', 'entities'])]
 
In [42]: text_blob.ngrams(n=4)
Out[42]: 
[WordList(['Text', 'mining', 'also', 'referred']),
 WordList(['mining', 'also', 'referred', 'to']),
 WordList(['also', 'referred', 'to', 'as']),
 WordList(['referred', 'to', 'as', 'text']),
 WordList(['to', 'as', 'text', 'data']),
 WordList(['as', 'text', 'data', 'mining']),
 WordList(['text', 'data', 'mining', 'roughly']),
 WordList(['data', 'mining', 'roughly', 'equivalent']),
 WordList(['mining', 'roughly', 'equivalent', 'to']),
 WordList(['roughly', 'equivalent', 'to', 'text']),
 WordList(['equivalent', 'to', 'text', 'analytics']),
 WordList(['to', 'text', 'analytics', 'is']),
 WordList(['text', 'analytics', 'is', 'the']),
 WordList(['analytics', 'is', 'the', 'process']),
 WordList(['is', 'the', 'process', 'of']),
 WordList(['the', 'process', 'of', 'deriving']),
 WordList(['process', 'of', 'deriving', 'high-quality']),
 WordList(['of', 'deriving', 'high-quality', 'information']),
 WordList(['deriving', 'high-quality', 'information', 'from']),
 WordList(['high-quality', 'information', 'from', 'text']),
 WordList(['information', 'from', 'text', 'High-quality']),
 WordList(['from', 'text', 'High-quality', 'information']),
 WordList(['text', 'High-quality', 'information', 'is']),
 WordList(['High-quality', 'information', 'is', 'typically']),
 WordList(['information', 'is', 'typically', 'derived']),
 WordList(['is', 'typically', 'derived', 'through']),
 WordList(['typically', 'derived', 'through', 'the']),
 WordList(['derived', 'through', 'the', 'devising']),
 WordList(['through', 'the', 'devising', 'of']),
 WordList(['the', 'devising', 'of', 'patterns']),
 WordList(['devising', 'of', 'patterns', 'and']),
 WordList(['of', 'patterns', 'and', 'trends']),
 WordList(['patterns', 'and', 'trends', 'through']),
 WordList(['and', 'trends', 'through', 'means']),
 WordList(['trends', 'through', 'means', 'such']),
 WordList(['through', 'means', 'such', 'as']),
 WordList(['means', 'such', 'as', 'statistical']),
 WordList(['such', 'as', 'statistical', 'pattern']),
 WordList(['as', 'statistical', 'pattern', 'learning']),
 WordList(['statistical', 'pattern', 'learning', 'Text']),
 WordList(['pattern', 'learning', 'Text', 'mining']),
 WordList(['learning', 'Text', 'mining', 'usually']),
 WordList(['Text', 'mining', 'usually', 'involves']),
 WordList(['mining', 'usually', 'involves', 'the']),
 WordList(['usually', 'involves', 'the', 'process']),
 WordList(['involves', 'the', 'process', 'of']),
 WordList(['the', 'process', 'of', 'structuring']),
 WordList(['process', 'of', 'structuring', 'the']),
 WordList(['of', 'structuring', 'the', 'input']),
 WordList(['structuring', 'the', 'input', 'text']),
 WordList(['the', 'input', 'text', 'usually']),
 WordList(['input', 'text', 'usually', 'parsing']),
 WordList(['text', 'usually', 'parsing', 'along']),
 WordList(['usually', 'parsing', 'along', 'with']),
 WordList(['parsing', 'along', 'with', 'the']),
 WordList(['along', 'with', 'the', 'addition']),
 WordList(['with', 'the', 'addition', 'of']),
 WordList(['the', 'addition', 'of', 'some']),
 WordList(['addition', 'of', 'some', 'derived']),
 WordList(['of', 'some', 'derived', 'linguistic']),
 WordList(['some', 'derived', 'linguistic', 'features']),
 WordList(['derived', 'linguistic', 'features', 'and']),
 WordList(['linguistic', 'features', 'and', 'the']),
 WordList(['features', 'and', 'the', 'removal']),
 WordList(['and', 'the', 'removal', 'of']),
 WordList(['the', 'removal', 'of', 'others']),
 WordList(['removal', 'of', 'others', 'and']),
 WordList(['of', 'others', 'and', 'subsequent']),
 WordList(['others', 'and', 'subsequent', 'insertion']),
 WordList(['and', 'subsequent', 'insertion', 'into']),
 WordList(['subsequent', 'insertion', 'into', 'a']),
 WordList(['insertion', 'into', 'a', 'database']),
 WordList(['into', 'a', 'database', 'deriving']),
 WordList(['a', 'database', 'deriving', 'patterns']),
 WordList(['database', 'deriving', 'patterns', 'within']),
 WordList(['deriving', 'patterns', 'within', 'the']),
 WordList(['patterns', 'within', 'the', 'structured']),
 WordList(['within', 'the', 'structured', 'data']),
 WordList(['the', 'structured', 'data', 'and']),
 WordList(['structured', 'data', 'and', 'finally']),
 WordList(['data', 'and', 'finally', 'evaluation']),
 WordList(['and', 'finally', 'evaluation', 'and']),
 WordList(['finally', 'evaluation', 'and', 'interpretation']),
 WordList(['evaluation', 'and', 'interpretation', 'of']),
 WordList(['and', 'interpretation', 'of', 'the']),
 WordList(['interpretation', 'of', 'the', 'output']),
 WordList(['of', 'the', 'output', "'High"]),
 WordList(['the', 'output', "'High", 'quality']),
 WordList(['output', "'High", 'quality', 'in']),
 WordList(["'High", 'quality', 'in', 'text']),
 WordList(['quality', 'in', 'text', 'mining']),
 WordList(['in', 'text', 'mining', 'usually']),
 WordList(['text', 'mining', 'usually', 'refers']),
 WordList(['mining', 'usually', 'refers', 'to']),
 WordList(['usually', 'refers', 'to', 'some']),
 WordList(['refers', 'to', 'some', 'combination']),
 WordList(['to', 'some', 'combination', 'of']),
 WordList(['some', 'combination', 'of', 'relevance']),
 WordList(['combination', 'of', 'relevance', 'novelty']),
 WordList(['of', 'relevance', 'novelty', 'and']),
 WordList(['relevance', 'novelty', 'and', 'interestingness']),
 WordList(['novelty', 'and', 'interestingness', 'Typical']),
 WordList(['and', 'interestingness', 'Typical', 'text']),
 WordList(['interestingness', 'Typical', 'text', 'mining']),
 WordList(['Typical', 'text', 'mining', 'tasks']),
 WordList(['text', 'mining', 'tasks', 'include']),
 WordList(['mining', 'tasks', 'include', 'text']),
 WordList(['tasks', 'include', 'text', 'categorization']),
 WordList(['include', 'text', 'categorization', 'text']),
 WordList(['text', 'categorization', 'text', 'clustering']),
 WordList(['categorization', 'text', 'clustering', 'concept/entity']),
 WordList(['text', 'clustering', 'concept/entity', 'extraction']),
 WordList(['clustering', 'concept/entity', 'extraction', 'production']),
 WordList(['concept/entity', 'extraction', 'production', 'of']),
 WordList(['extraction', 'production', 'of', 'granular']),
 WordList(['production', 'of', 'granular', 'taxonomies']),
 WordList(['of', 'granular', 'taxonomies', 'sentiment']),
 WordList(['granular', 'taxonomies', 'sentiment', 'analysis']),
 WordList(['taxonomies', 'sentiment', 'analysis', 'document']),
 WordList(['sentiment', 'analysis', 'document', 'summarization']),
 WordList(['analysis', 'document', 'summarization', 'and']),
 WordList(['document', 'summarization', 'and', 'entity']),
 WordList(['summarization', 'and', 'entity', 'relation']),
 WordList(['and', 'entity', 'relation', 'modeling']),
 WordList(['entity', 'relation', 'modeling', 'i.e']),
 WordList(['relation', 'modeling', 'i.e', 'learning']),
 WordList(['modeling', 'i.e', 'learning', 'relations']),
 WordList(['i.e', 'learning', 'relations', 'between']),
 WordList(['learning', 'relations', 'between', 'named']),
 WordList(['relations', 'between', 'named', 'entities'])]

Posted by TextProcessing


Leave a Reply

Your email address will not be published. Required fields are marked *