Getting started with NLTK | TextProcessing | A Text Processing Portal for Humans

About

Open Source Text Processing Project: NLTK

Install NLTK

1. Install the latest NLTK pakage on Ubuntu 16.04.1 LTS:

textprocessing@ubuntu:~$ sudo pip install -U nltk

Collecting nltk
Downloading nltk-3.2.2.tar.gz (1.2MB)
35% |███████████▍ | 409kB 20.8MB/s eta 0:00:0
……
100% |████████████████████████████████| 1.2MB 814kB/s
Collecting six (from nltk)
Downloading six-1.10.0-py2.py3-none-any.whl
Installing collected packages: six, nltk
Running setup.py install for nltk … done
Successfully installed nltk-3.2.2 six-1.10.0

2. Install Numpy (optional):

textprocessing@ubuntu:~$ sudo pip install -U numpy

Collecting numpy
Downloading numpy-1.12.0-cp27-cp27mu-manylinux1_x86_64.whl (16.5MB)
34% |███████████▏ | 5.7MB 30.8MB/s eta 0:00:0
……
100% |████████████████████████████████| 16.5MB 37kB/s
Installing collected packages: numpy
Successfully installed numpy-1.12.0

3. Test installation: run python then type import nltk

textprocessing@ubuntu:~$ ipython Python 2.7.12 (default, Nov 19 2016, 06:48:10) Type "copyright", "credits" or "license" for more information.


IPython 2.4.1 -- An enhanced Interactive Python.

?         -> Introduction and overview of IPython's features.

%quickref -> Quick reference.

help      -> Python's own help system.

object?   -> Details about 'object', use 'object??' for extra details.
In [1]: import nltk

In [2]: nltk.__version__ Out[2]: '3.2.2'

It seems that you have installed nltk, but if you test the simplest word tokenize, you will meet some problems:

In [3]: sentence = "this's a test"


In [4]: tokens = nltk.word_tokenize(sentence)

---------------------------------------------------------------------------

LookupError                               Traceback (most recent call last)

 in ()

----> 1 tokens = nltk.word_tokenize(sentence)
/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.pyc in word_tokenize(text, language)

    107     :param language: the model name in the Punkt corpus

    108     """

--> 109     return [token for sent in sent_tokenize(text, language)

    110             for token in _treebank_word_tokenize(sent)]

    111 
/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.pyc in sent_tokenize(text, language)

     91     :param language: the model name in the Punkt corpus

     92     """

---> 93     tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))

     94     return tokenizer.tokenize(text)

     95 
/usr/local/lib/python2.7/dist-packages/nltk/data.pyc in load(resource_url, format, cache, verbose, logic_parser, fstruct_reader, encoding)

    806

    807     # Load the resource.

--> 808     opened_resource = _open(resource_url)

    809

    810     if format == 'raw':
/usr/local/lib/python2.7/dist-packages/nltk/data.pyc in _open(resource_url)

    924

    925     if protocol is None or protocol.lower() == 'nltk':

--> 926         return find(path_, path + ['']).open()

    927     elif protocol.lower() == 'file':

    928         # urllib might not use mode='rb', so handle this one ourselves:
/usr/local/lib/python2.7/dist-packages/nltk/data.pyc in find(resource_name, paths)

    646     sep = '*' * 70

    647     resource_not_found = '\n%s\n%s\n%s' % (sep, msg, sep)

--> 648     raise LookupError(resource_not_found)

    649

    650

LookupError: ********************************************************************** Resource u'tokenizers/punkt/english.pickle' not found. Please use the NLTK Downloader to obtain the resource: >>> nltk.download() Searched in: - '/home/textprocessing/nltk_data' - '/usr/share/nltk_data' - '/usr/local/share/nltk_data' - '/usr/lib/nltk_data' - '/usr/local/lib/nltk_data' - u'' **********************************************************************

Install NLTK Data

NLTK comes with many corpora, toy grammars, trained models, etc. All in nltk_data, you need install nltk_data before you use nltk.

In [5]: nltk.download()
NLTK Downloader
—————————————————————————
d) Download l) List u) Update c) Config h) Help q) Quit
—————————————————————————
Downloader> d

—————————————————————————
d) Download l) List u) Update c) Config h) Help q) Quit
—————————————————————————
Downloader> q
Out[5]: True

Using NLTK
In [15]: sentences = """Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human–computer interaction. Many challenges in NLP involve: natural language understanding, enabling computers to derive meaning from human or natural language input; and others involve natural language generation."""


In [16]: sents = nltk.sent_tokenize(sentences)
In [17]: for sent in sents:

    print sent

   ....:

Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages.

As such, NLP is related to the area of human–computer interaction.

Many challenges in NLP involve: natural language understanding, enabling computers to derive meaning from human or natural language input; and others involve natural language generation.
In [18]: tokens = nltk.word_tokenize(sentences)
In [19]: print tokens

['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'field', 'of', 'computer', 'science', ',', 'artificial', 'intelligence', ',', 'and', 'computational', 'linguistics', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', '(', 'natural', ')', 'languages', '.', 'As', 'such', ',', 'NLP', 'is', 'related', 'to', 'the', 'area', 'of', 'human\xe2\x80\x93computer', 'interaction', '.', 'Many', 'challenges', 'in', 'NLP', 'involve', ':', 'natural', 'language', 'understanding', ',', 'enabling', 'computers', 'to', 'derive', 'meaning', 'from', 'human', 'or', 'natural', 'language', 'input', ';', 'and', 'others', 'involve', 'natural', 'language', 'generation', '.']
In [20]: tagged_tokens = nltk.pos_tag(tokens)
In [21]: print tagged_tokens

[('Natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('(', '('), ('NLP', 'NNP'), (')', ')'), ('is', 'VBZ'), ('a', 'DT'), ('field', 'NN'), ('of', 'IN'), ('computer', 'NN'), ('science', 'NN'), (',', ','), ('artificial', 'JJ'), ('intelligence', 'NN'), (',', ','), ('and', 'CC'), ('computational', 'JJ'), ('linguistics', 'NNS'), ('concerned', 'VBN'), ('with', 'IN'), ('the', 'DT'), ('interactions', 'NNS'), ('between', 'IN'), ('computers', 'NNS'), ('and', 'CC'), ('human', 'JJ'), ('(', '('), ('natural', 'JJ'), (')', ')'), ('languages', 'VBZ'), ('.', '.'), ('As', 'IN'), ('such', 'JJ'), (',', ','), ('NLP', 'NNP'), ('is', 'VBZ'), ('related', 'VBN'), ('to', 'TO'), ('the', 'DT'), ('area', 'NN'), ('of', 'IN'), ('human\xe2\x80\x93computer', 'NN'), ('interaction', 'NN'), ('.', '.'), ('Many', 'JJ'), ('challenges', 'NNS'), ('in', 'IN'), ('NLP', 'NNP'), ('involve', 'NN'), (':', ':'), ('natural', 'JJ'), ('language', 'NN'), ('understanding', 'NN'), (',', ','), ('enabling', 'VBG'), ('computers', 'NNS'), ('to', 'TO'), ('derive', 'VB'), ('meaning', 'NN'), ('from', 'IN'), ('human', 'NN'), ('or', 'CC'), ('natural', 'JJ'), ('language', 'NN'), ('input', 'NN'), (';', ':'), ('and', 'CC'), ('others', 'NNS'), ('involve', 'VBP'), ('natural', 'JJ'), ('language', 'NN'), ('generation', 'NN'), ('.', '.')]
In [22]: entities = nltk.chunk.ne_chunk(tagged_tokens)

In [23]: entities Out[23]: Tree('S', [('Natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('(', '('), Tree('ORGANIZATION', [('NLP', 'NNP')]), (')', ')'), ('is', 'VBZ'), ('a', 'DT'), ('field', 'NN'), ('of', 'IN'), ('computer', 'NN'), ('science', 'NN'), (',', ','), ('artificial', 'JJ'), ('intelligence', 'NN'), (',', ','), ('and', 'CC'), ('computational', 'JJ'), ('linguistics', 'NNS'), ('concerned', 'VBN'), ('with', 'IN'), ('the', 'DT'), ('interactions', 'NNS'), ('between', 'IN'), ('computers', 'NNS'), ('and', 'CC'), ('human', 'JJ'), ('(', '('), ('natural', 'JJ'), (')', ')'), ('languages', 'VBZ'), ('.', '.'), ('As', 'IN'), ('such', 'JJ'), (',', ','), Tree('ORGANIZATION', [('NLP', 'NNP')]), ('is', 'VBZ'), ('related', 'VBN'), ('to', 'TO'), ('the', 'DT'), ('area', 'NN'), ('of', 'IN'), ('human\xe2\x80\x93computer', 'NN'), ('interaction', 'NN'), ('.', '.'), ('Many', 'JJ'), ('challenges', 'NNS'), ('in', 'IN'), Tree('ORGANIZATION', [('NLP', 'NNP')]), ('involve', 'NN'), (':', ':'), ('natural', 'JJ'), ('language', 'NN'), ('understanding', 'NN'), (',', ','), ('enabling', 'VBG'), ('computers', 'NNS'), ('to', 'TO'), ('derive', 'VB'), ('meaning', 'NN'), ('from', 'IN'), ('human', 'NN'), ('or', 'CC'), ('natural', 'JJ'), ('language', 'NN'), ('input', 'NN'), (';', ':'), ('and', 'CC'), ('others', 'NNS'), ('involve', 'VBP'), ('natural', 'JJ'), ('language', 'NN'), ('generation', 'NN'), ('.', '.')])

For more about NLTK, we recommended you the “” series and the official book: “”

Posted by “TextProcessing”

Related posts:

Leave a Reply Cancel reply