Getting started with NLTK

About

Open Source Text Processing Project: NLTK

Install NLTK

1. Install the latest NLTK pakage on Ubuntu 16.04.1 LTS:

textprocessing@ubuntu:~$ sudo pip install -U nltk

Collecting nltk
Downloading nltk-3.2.2.tar.gz (1.2MB)
35% |███████████▍ | 409kB 20.8MB/s eta 0:00:0
……
100% |████████████████████████████████| 1.2MB 814kB/s
Collecting six (from nltk)
Downloading six-1.10.0-py2.py3-none-any.whl
Installing collected packages: six, nltk
Running setup.py install for nltk … done
Successfully installed nltk-3.2.2 six-1.10.0

2. Install Numpy (optional):

textprocessing@ubuntu:~$ sudo pip install -U numpy

Collecting numpy
Downloading numpy-1.12.0-cp27-cp27mu-manylinux1_x86_64.whl (16.5MB)
34% |███████████▏ | 5.7MB 30.8MB/s eta 0:00:0
……
100% |████████████████████████████████| 16.5MB 37kB/s
Installing collected packages: numpy
Successfully installed numpy-1.12.0

3. Test installation: run python then type import nltk

textprocessing@ubuntu:~$ ipython
Python 2.7.12 (default, Nov 19 2016, 06:48:10)
Type "copyright", "credits" or "license" for more information.

IPython 2.4.1 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.

In [1]: import nltk

In [2]: nltk.__version__
Out[2]: '3.2.2'

It seems that you have installed nltk, but if you test the simplest word tokenize, you will meet some problems:

In [3]: sentence = "this's a test"

In [4]: tokens = nltk.word_tokenize(sentence)
---------------------------------------------------------------------------
LookupError Traceback (most recent call last)
in ()
----> 1 tokens = nltk.word_tokenize(sentence)

/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.pyc in word_tokenize(text, language)
107 :param language: the model name in the Punkt corpus
108 """
--> 109 return [token for sent in sent_tokenize(text, language)
110 for token in _treebank_word_tokenize(sent)]
111

/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.pyc in sent_tokenize(text, language)
91 :param language: the model name in the Punkt corpus
92 """
---> 93 tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
94 return tokenizer.tokenize(text)
95

/usr/local/lib/python2.7/dist-packages/nltk/data.pyc in load(resource_url, format, cache, verbose, logic_parser, fstruct_reader, encoding)
806
807 # Load the resource.
--> 808 opened_resource = _open(resource_url)
809
810 if format == 'raw':

/usr/local/lib/python2.7/dist-packages/nltk/data.pyc in _open(resource_url)
924
925 if protocol is None or protocol.lower() == 'nltk':
--> 926 return find(path_, path + ['']).open()
927 elif protocol.lower() == 'file':
928 # urllib might not use mode='rb', so handle this one ourselves:

/usr/local/lib/python2.7/dist-packages/nltk/data.pyc in find(resource_name, paths)
646 sep = '*' * 70
647 resource_not_found = '\n%s\n%s\n%s' % (sep, msg, sep)
--> 648 raise LookupError(resource_not_found)
649
650

LookupError:
**********************************************************************
Resource u'tokenizers/punkt/english.pickle' not found. Please
use the NLTK Downloader to obtain the resource: >>>
nltk.download()
Searched in:
- '/home/textprocessing/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- u''
**********************************************************************

Install NLTK Data

NLTK comes with many corpora, toy grammars, trained models, etc. All in nltk_data, you need install nltk_data before you use nltk.

In [5]: nltk.download()
NLTK Downloader
—————————————————————————
d) Download l) List u) Update c) Config h) Help q) Quit
—————————————————————————
Downloader> d

Download which package (l=list; x=cancel)?
Identifier> all
Downloading collection u’all’
|
| Downloading package abc to /home/textprocessing/nltk_data…
| Unzipping corpora/abc.zip.
| Downloading package alpino to
| /home/textprocessing/nltk_data…
| Unzipping corpora/alpino.zip.
| Downloading package biocreative_ppi to
| /home/textprocessing/nltk_data…
| Unzipping corpora/biocreative_ppi.zip.
| Downloading package brown to
| /home/textprocessing/nltk_data…
| Unzipping corpora/brown.zip.
| Downloading package brown_tei to
| /home/textprocessing/nltk_data…
| Unzipping corpora/brown_tei.zip.
| Downloading package cess_cat to
| /home/textprocessing/nltk_data…
| Unzipping corpora/cess_cat.zip.
| Downloading package cess_esp to
| /home/textprocessing/nltk_data…
| Unzipping corpora/cess_esp.zip.
| Downloading package chat80 to
| /home/textprocessing/nltk_data…
| Unzipping corpora/chat80.zip.
| Downloading package city_database to
| /home/textprocessing/nltk_data…
| Unzipping corpora/city_database.zip.
| Downloading package cmudict to
| /home/textprocessing/nltk_data…
| Unzipping corpora/cmudict.zip.
| Downloading package comparative_sentences to
| /home/textprocessing/nltk_data…
| Unzipping corpora/comparative_sentences.zip.
| Downloading package comtrans to
| /home/textprocessing/nltk_data…
| Downloading package conll2000 to
| /home/textprocessing/nltk_data…
| Unzipping corpora/conll2000.zip.
| Downloading package conll2002 to
| /home/textprocessing/nltk_data…
| Unzipping corpora/conll2002.zip.
| Downloading package conll2007 to
| /home/textprocessing/nltk_data…
| Downloading package crubadan to
| /home/textprocessing/nltk_data…
| Unzipping corpora/crubadan.zip.
| Downloading package dependency_treebank to
| /home/textprocessing/nltk_data…
| Unzipping corpora/dependency_treebank.zip.
| Downloading package europarl_raw to
| /home/textprocessing/nltk_data…
| Unzipping corpora/europarl_raw.zip.
| Downloading package floresta to
| /home/textprocessing/nltk_data…
| Unzipping corpora/floresta.zip.
| Downloading package framenet_v15 to
| /home/textprocessing/nltk_data…
| Unzipping corpora/framenet_v15.zip.
| Downloading package framenet_v17 to
| /home/textprocessing/nltk_data…
| Unzipping corpora/framenet_v17.zip.
| Downloading package gazetteers to
| /home/textprocessing/nltk_data…
| Unzipping corpora/gazetteers.zip.
| Downloading package genesis to
| /home/textprocessing/nltk_data…
| Unzipping corpora/genesis.zip.
| Downloading package gutenberg to
| /home/textprocessing/nltk_data…
| Unzipping corpora/gutenberg.zip.
| Downloading package ieer to /home/textprocessing/nltk_data…
| Unzipping corpora/ieer.zip.
| Downloading package inaugural to
| /home/textprocessing/nltk_data…
| Unzipping corpora/inaugural.zip.
| Downloading package indian to
| /home/textprocessing/nltk_data…
| Unzipping corpora/indian.zip.
| Downloading package jeita to
| /home/textprocessing/nltk_data…
| Downloading package kimmo to
| /home/textprocessing/nltk_data…
| Unzipping corpora/kimmo.zip.
| Downloading package knbc to /home/textprocessing/nltk_data…
| Downloading package lin_thesaurus to
| /home/textprocessing/nltk_data…
| Unzipping corpora/lin_thesaurus.zip.
| Downloading package mac_morpho to
| /home/textprocessing/nltk_data…
| Unzipping corpora/mac_morpho.zip.
| Downloading package machado to
| /home/textprocessing/nltk_data…
| Downloading package masc_tagged to
| /home/textprocessing/nltk_data…
| Downloading package moses_sample to
| /home/textprocessing/nltk_data…
| Unzipping models/moses_sample.zip.
| Downloading package movie_reviews to
| /home/textprocessing/nltk_data…
| Unzipping corpora/movie_reviews.zip.
| Downloading package names to
| /home/textprocessing/nltk_data…
| Unzipping corpora/names.zip.
| Downloading package nombank.1.0 to
| /home/textprocessing/nltk_data…
| Downloading package nps_chat to
| /home/textprocessing/nltk_data…
| Unzipping corpora/nps_chat.zip.
| Downloading package omw to /home/textprocessing/nltk_data…
| Unzipping corpora/omw.zip.
| Downloading package opinion_lexicon to
| /home/textprocessing/nltk_data…
| Unzipping corpora/opinion_lexicon.zip.
| Downloading package paradigms to
| /home/textprocessing/nltk_data…
| Unzipping corpora/paradigms.zip.
| Downloading package pil to /home/textprocessing/nltk_data…
| Unzipping corpora/pil.zip.
| Downloading package pl196x to
| /home/textprocessing/nltk_data…
| Unzipping corpora/pl196x.zip.
| Downloading package ppattach to
| /home/textprocessing/nltk_data…
| Unzipping corpora/ppattach.zip.
| Downloading package problem_reports to
| /home/textprocessing/nltk_data…
| Unzipping corpora/problem_reports.zip.
| Downloading package propbank to
| /home/textprocessing/nltk_data…
| Downloading package ptb to /home/textprocessing/nltk_data…
| Unzipping corpora/ptb.zip.
| Downloading package product_reviews_1 to
| /home/textprocessing/nltk_data…
| Unzipping corpora/product_reviews_1.zip.
| Downloading package product_reviews_2 to
| /home/textprocessing/nltk_data…
| Unzipping corpora/product_reviews_2.zip.
| Downloading package pros_cons to
| /home/textprocessing/nltk_data…
| Unzipping corpora/pros_cons.zip.
| Downloading package qc to /home/textprocessing/nltk_data…
| Unzipping corpora/qc.zip.
| Downloading package reuters to
| /home/textprocessing/nltk_data…
| Downloading package rte to /home/textprocessing/nltk_data…
| Unzipping corpora/rte.zip.
| Downloading package semcor to
| /home/textprocessing/nltk_data…
| Downloading package senseval to
| /home/textprocessing/nltk_data…
| Unzipping corpora/senseval.zip.
| Downloading package sentiwordnet to
| /home/textprocessing/nltk_data…
| Unzipping corpora/sentiwordnet.zip.
| Downloading package sentence_polarity to
| /home/textprocessing/nltk_data…
| Unzipping corpora/sentence_polarity.zip.
| Downloading package shakespeare to
| /home/textprocessing/nltk_data…
| Unzipping corpora/shakespeare.zip.
| Downloading package sinica_treebank to
| /home/textprocessing/nltk_data…
| Unzipping corpora/sinica_treebank.zip.
| Downloading package smultron to
| /home/textprocessing/nltk_data…
| Unzipping corpora/smultron.zip.
| Downloading package state_union to
| /home/textprocessing/nltk_data…
| Unzipping corpora/state_union.zip.
| Downloading package stopwords to
| /home/textprocessing/nltk_data…
| Unzipping corpora/stopwords.zip.
| Downloading package subjectivity to
| /home/textprocessing/nltk_data…
| Unzipping corpora/subjectivity.zip.
| Downloading package swadesh to
| /home/textprocessing/nltk_data…
| Unzipping corpora/swadesh.zip.
| Downloading package switchboard to
| /home/textprocessing/nltk_data…
| Unzipping corpora/switchboard.zip.
| Downloading package timit to
| /home/textprocessing/nltk_data…
| Unzipping corpora/timit.zip.
| Downloading package toolbox to
| /home/textprocessing/nltk_data…
| Unzipping corpora/toolbox.zip.
| Downloading package treebank to
| /home/textprocessing/nltk_data…
| Unzipping corpora/treebank.zip.
| Downloading package twitter_samples to
| /home/textprocessing/nltk_data…
| Unzipping corpora/twitter_samples.zip.
| Downloading package udhr to /home/textprocessing/nltk_data…
| Unzipping corpora/udhr.zip.
| Downloading package udhr2 to
| /home/textprocessing/nltk_data…
| Unzipping corpora/udhr2.zip.
| Downloading package unicode_samples to
| /home/textprocessing/nltk_data…
| Unzipping corpora/unicode_samples.zip.
| Downloading package universal_treebanks_v20 to
| /home/textprocessing/nltk_data…
| Downloading package verbnet to
| /home/textprocessing/nltk_data…
| Unzipping corpora/verbnet.zip.
| Downloading package webtext to
| /home/textprocessing/nltk_data…
| Unzipping corpora/webtext.zip.
| Downloading package wordnet to
| /home/textprocessing/nltk_data…
| Unzipping corpora/wordnet.zip.
| Downloading package wordnet_ic to
| /home/textprocessing/nltk_data…
| Unzipping corpora/wordnet_ic.zip.
| Downloading package words to
| /home/textprocessing/nltk_data…
| Unzipping corpora/words.zip.
| Downloading package ycoe to /home/textprocessing/nltk_data…
| Unzipping corpora/ycoe.zip.
| Downloading package rslp to /home/textprocessing/nltk_data…
| Unzipping stemmers/rslp.zip.
| Downloading package hmm_treebank_pos_tagger to
| /home/textprocessing/nltk_data…
| Unzipping taggers/hmm_treebank_pos_tagger.zip.
| Downloading package maxent_treebank_pos_tagger to
| /home/textprocessing/nltk_data…
| Unzipping taggers/maxent_treebank_pos_tagger.zip.
| Downloading package universal_tagset to
| /home/textprocessing/nltk_data…
| Unzipping taggers/universal_tagset.zip.
| Downloading package maxent_ne_chunker to
| /home/textprocessing/nltk_data…
| Unzipping chunkers/maxent_ne_chunker.zip.
| Downloading package punkt to
| /home/textprocessing/nltk_data…
| Unzipping tokenizers/punkt.zip.
| Downloading package book_grammars to
| /home/textprocessing/nltk_data…
| Unzipping grammars/book_grammars.zip.
| Downloading package sample_grammars to
| /home/textprocessing/nltk_data…
| Unzipping grammars/sample_grammars.zip.
| Downloading package spanish_grammars to
| /home/textprocessing/nltk_data…
| Unzipping grammars/spanish_grammars.zip.
| Downloading package basque_grammars to
| /home/textprocessing/nltk_data…
| Unzipping grammars/basque_grammars.zip.
| Downloading package large_grammars to
| /home/textprocessing/nltk_data…
| Unzipping grammars/large_grammars.zip.
| Downloading package tagsets to
| /home/textprocessing/nltk_data…
| Unzipping help/tagsets.zip.
| Downloading package snowball_data to
| /home/textprocessing/nltk_data…
| Downloading package bllip_wsj_no_aux to
| /home/textprocessing/nltk_data…
| Unzipping models/bllip_wsj_no_aux.zip.
| Downloading package word2vec_sample to
| /home/textprocessing/nltk_data…
| Unzipping models/word2vec_sample.zip.
| Downloading package panlex_swadesh to
| /home/textprocessing/nltk_data…
| Downloading package mte_teip5 to
| /home/textprocessing/nltk_data…
| Unzipping corpora/mte_teip5.zip.
| Downloading package averaged_perceptron_tagger to
| /home/textprocessing/nltk_data…
| Unzipping taggers/averaged_perceptron_tagger.zip.
| Downloading package panlex_lite to
| /home/textprocessing/nltk_data…
| Unzipping corpora/panlex_lite.zip.
| Downloading package perluniprops to
| /home/textprocessing/nltk_data…
| Unzipping misc/perluniprops.zip.
| Downloading package nonbreaking_prefixes to
| /home/textprocessing/nltk_data…
| Unzipping corpora/nonbreaking_prefixes.zip.
| Downloading package vader_lexicon to
| /home/textprocessing/nltk_data…
| Downloading package porter_test to
| /home/textprocessing/nltk_data…
| Unzipping stemmers/porter_test.zip.
| Downloading package wmt15_eval to
| /home/textprocessing/nltk_data…
| Unzipping models/wmt15_eval.zip.
| Downloading package mwa_ppdb to
| /home/textprocessing/nltk_data…
| Unzipping misc/mwa_ppdb.zip.
|
Done downloading collection all

—————————————————————————
d) Download l) List u) Update c) Config h) Help q) Quit
—————————————————————————
Downloader> q
Out[5]: True

Using NLTK

In [15]: sentences = """Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human–computer interaction. Many challenges in NLP involve: natural language understanding, enabling computers to derive meaning from human or natural language input; and others involve natural language generation."""

In [16]: sents = nltk.sent_tokenize(sentences)

In [17]: for sent in sents:
print sent
....:
Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages.
As such, NLP is related to the area of human–computer interaction.
Many challenges in NLP involve: natural language understanding, enabling computers to derive meaning from human or natural language input; and others involve natural language generation.

In [18]: tokens = nltk.word_tokenize(sentences)

In [19]: print tokens
['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'field', 'of', 'computer', 'science', ',', 'artificial', 'intelligence', ',', 'and', 'computational', 'linguistics', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', '(', 'natural', ')', 'languages', '.', 'As', 'such', ',', 'NLP', 'is', 'related', 'to', 'the', 'area', 'of', 'human\xe2\x80\x93computer', 'interaction', '.', 'Many', 'challenges', 'in', 'NLP', 'involve', ':', 'natural', 'language', 'understanding', ',', 'enabling', 'computers', 'to', 'derive', 'meaning', 'from', 'human', 'or', 'natural', 'language', 'input', ';', 'and', 'others', 'involve', 'natural', 'language', 'generation', '.']

In [20]: tagged_tokens = nltk.pos_tag(tokens)

In [21]: print tagged_tokens
[('Natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('(', '('), ('NLP', 'NNP'), (')', ')'), ('is', 'VBZ'), ('a', 'DT'), ('field', 'NN'), ('of', 'IN'), ('computer', 'NN'), ('science', 'NN'), (',', ','), ('artificial', 'JJ'), ('intelligence', 'NN'), (',', ','), ('and', 'CC'), ('computational', 'JJ'), ('linguistics', 'NNS'), ('concerned', 'VBN'), ('with', 'IN'), ('the', 'DT'), ('interactions', 'NNS'), ('between', 'IN'), ('computers', 'NNS'), ('and', 'CC'), ('human', 'JJ'), ('(', '('), ('natural', 'JJ'), (')', ')'), ('languages', 'VBZ'), ('.', '.'), ('As', 'IN'), ('such', 'JJ'), (',', ','), ('NLP', 'NNP'), ('is', 'VBZ'), ('related', 'VBN'), ('to', 'TO'), ('the', 'DT'), ('area', 'NN'), ('of', 'IN'), ('human\xe2\x80\x93computer', 'NN'), ('interaction', 'NN'), ('.', '.'), ('Many', 'JJ'), ('challenges', 'NNS'), ('in', 'IN'), ('NLP', 'NNP'), ('involve', 'NN'), (':', ':'), ('natural', 'JJ'), ('language', 'NN'), ('understanding', 'NN'), (',', ','), ('enabling', 'VBG'), ('computers', 'NNS'), ('to', 'TO'), ('derive', 'VB'), ('meaning', 'NN'), ('from', 'IN'), ('human', 'NN'), ('or', 'CC'), ('natural', 'JJ'), ('language', 'NN'), ('input', 'NN'), (';', ':'), ('and', 'CC'), ('others', 'NNS'), ('involve', 'VBP'), ('natural', 'JJ'), ('language', 'NN'), ('generation', 'NN'), ('.', '.')]

In [22]: entities = nltk.chunk.ne_chunk(tagged_tokens)

In [23]: entities
Out[23]: Tree('S', [('Natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('(', '('), Tree('ORGANIZATION', [('NLP', 'NNP')]), (')', ')'), ('is', 'VBZ'), ('a', 'DT'), ('field', 'NN'), ('of', 'IN'), ('computer', 'NN'), ('science', 'NN'), (',', ','), ('artificial', 'JJ'), ('intelligence', 'NN'), (',', ','), ('and', 'CC'), ('computational', 'JJ'), ('linguistics', 'NNS'), ('concerned', 'VBN'), ('with', 'IN'), ('the', 'DT'), ('interactions', 'NNS'), ('between', 'IN'), ('computers', 'NNS'), ('and', 'CC'), ('human', 'JJ'), ('(', '('), ('natural', 'JJ'), (')', ')'), ('languages', 'VBZ'), ('.', '.'), ('As', 'IN'), ('such', 'JJ'), (',', ','), Tree('ORGANIZATION', [('NLP', 'NNP')]), ('is', 'VBZ'), ('related', 'VBN'), ('to', 'TO'), ('the', 'DT'), ('area', 'NN'), ('of', 'IN'), ('human\xe2\x80\x93computer', 'NN'), ('interaction', 'NN'), ('.', '.'), ('Many', 'JJ'), ('challenges', 'NNS'), ('in', 'IN'), Tree('ORGANIZATION', [('NLP', 'NNP')]), ('involve', 'NN'), (':', ':'), ('natural', 'JJ'), ('language', 'NN'), ('understanding', 'NN'), (',', ','), ('enabling', 'VBG'), ('computers', 'NNS'), ('to', 'TO'), ('derive', 'VB'), ('meaning', 'NN'), ('from', 'IN'), ('human', 'NN'), ('or', 'CC'), ('natural', 'JJ'), ('language', 'NN'), ('input', 'NN'), (';', ':'), ('and', 'CC'), ('others', 'NNS'), ('involve', 'VBP'), ('natural', 'JJ'), ('language', 'NN'), ('generation', 'NN'), ('.', '.')])

For more about NLTK, we recommended you the “” series and the official book: “”

Posted by “TextProcessing


Leave a Reply

Your email address will not be published. Required fields are marked *