Getting started with Python Word Segmentation

Deep Learning Specialization on Coursera

About Python Word Segmentation

Python Word Segmentation

WordSegment is an Apache2 licensed module for English word segmentation, written in pure-Python, and based on a trillion-word corpus.

Based on code from the chapter “Natural Language Corpus Data” by Peter Norvig from the book “Beautiful Data” (Segaran and Hammerbacher, 2009).

Data files are derived from the Google Web Trillion Word Corpus, as described by Thorsten Brants and Alex Franz, and distributed by the Linguistic Data Consortium. This module contains only a subset of that data. The unigram data includes only the most common 333,000 words. Similarly, bigram data includes only the most common 250,000 phrases. Every word and phrase is lowercased with punctuation removed.

Install Python Word Segmentation

Install WordSement is very easy, just by pip:

pip install wordsegment

How to Use Python Word Segmentation for English Text

In [1]: import wordsegment
In [2]: help(wordsegment)
In [4]: from wordsegment import segment
In [5]: segment("thisisatest")
Out[5]: ['this', 'is', 'a', 'test']
In [6]: segment("helloworld")
Out[6]: ['helloworld']
In [7]: segment("hiworld")
Out[7]: ['hi', 'world']
In [8]: segment("NewYork")
Out[8]: ['new', 'york']
In [9]: from wordsegment import clean
In [10]: clean("this's a test")
Out[10]: 'thissatest'
In [11]: segment("this'satest")
Out[11]: ['this', 'sa', 'test']
In [12]: import wordsegment as ws
In [13]: ws.load()
In [15]: ws.UNIGRAMS['the']
Out[15]: 23135851162.0
In [16]: ws.UNIGRAMS['gray']
Out[16]: 21424658.0
In [17]: ws.UNIGRAMS['grey']
Out[17]: 18276942.0
In [18]: dir(ws)
In [19]: ws.BIGRAMS['this is']
Out[19]: 86818400.0
In [20]: ws.BIGRAMS['is a']
Out[20]: 476718990.0
In [21]: ws.BIGRAMS['a test']
Out[21]: 4417355.0
In [22]: ws.BIGRAMS['a patent']
Out[22]: 1117510.0
In [23]: ws.BIGRAMS['free patent']
KeyError                                  Traceback (most recent call last)
<ipython-input-23-6d20cc0adefa> in <module>()
----> 1 ws.BIGRAMS['free patent']
KeyError: 'free patent'
In [24]: ws.BIGRAMS['the input']
Out[24]: 4840160.0
In [26]: import heapq
In [27]: from pprint import pprint
In [28]: from operator import itemgetter
In [29]: pprint(heapq.nlargest(10, ws.BIGRAMS.items(), itemgetter(1)))
[(u'of the', 2766332391.0),
 (u'in the', 1628795324.0),
 (u'to the', 1139248999.0),
 (u'on the', 800328815.0),
 (u'for the', 692874802.0),
 (u'and the', 629726893.0),
 (u'to be', 505148997.0),
 (u'is a', 476718990.0),
 (u'with the', 461331348.0),
 (u'from the', 428303219.0)]

Help info about Python WordSegmentation

Help on module wordsegment:

    wordsegment - English Word Segmentation in Python


    Word segmentation is the process of dividing a phrase without spaces back
    into its constituent parts. For example, consider a phrase like "thisisatest
    For humans, it's relatively easy to parse. This module makes it easy for
    machines too. Use `segment` to parse a phrase into its parts:
    >>> from wordsegment import segment
    >>> segment('thisisatest')
    ['this', 'is', 'a', 'test']
    In the code, 1024908267229 is the total number of words in the corpus. A
    subset of this corpus is found in unigrams.txt and bigrams.txt which
    should accompany this file. A copy of these files may be found at under the names count_1w.txt and count_2w.txt
    Copyright (c) 2016 by Grant Jenks
    Based on code from the chapter "Natural Language Corpus Data"
    from the book "Beautiful Data" (Segaran and Hammerbacher, 2009)
    Original Copyright (c) 2008-2009 by Peter Norvig

        Return `text` lower-cased with non-alphanumeric characters removed.
    divide(text, limit=24)
        Yield `(prefix, suffix)` pairs from `text` with `len(prefix)` not
        exceeding `limit`.
        Return iterator of words that is the best segmenation of `text`.
        Load unigram and bigram counts from disk.
        Command-line entry-point. Parses `args` into in-file and out-file then
        reads lines from in-file, segments the lines, and writes the result to
        out-file. Input and output default to stdin and stdout respectively.
        Read `filename` and parse tab-separated file of (word, count) pairs.
    score(word, prev=None)
        Score a `word` in the context of the previous word, `prev`.
        Return a list of words that is the best segmenation of `text`.

    ALPHABET = set(['0', '1', '2', '3', '4', '5', ...])
    BIGRAMS = {u'0km to': 116103.0, u'0uplink verified': 523545.0, u'1000s...
    DATADIR = '/Library/Python/2.7/site-packages/wordsegment_data'
    TOTAL = 1024908267229.0
    UNIGRAMS = {u'a': 9081174698.0, u'aa': 30523331.0, u'aaa': 10243983.0,...
    __author__ = 'Grant Jenks'
    __build__ = 2048
    __copyright__ = 'Copyright 2016 Grant Jenks'
    __license__ = 'Apache 2.0'
    __title__ = 'wordsegment'
    __version__ = '0.8.0'


    Grant Jenks

Posted by TextProcessing

Leave a Reply

Your email address will not be published. Required fields are marked *