Getting started with Python Word Segmentation

About Python Word Segmentation

WordSegment is an Apache2 licensed module for English word segmentation, written in pure-Python, and based on a trillion-word corpus.

Based on code from the chapter “Natural Language Corpus Data” by Peter Norvig from the book “Beautiful Data” (Segaran and Hammerbacher, 2009).

Data files are derived from the Google Web Trillion Word Corpus, as described by Thorsten Brants and Alex Franz, and distributed by the Linguistic Data Consortium. This module contains only a subset of that data. The unigram data includes only the most common 333,000 words. Similarly, bigram data includes only the most common 250,000 phrases. Every word and phrase is lowercased with punctuation removed.

Install Python Word Segmentation

Install WordSement is very easy, just by pip:

pip install wordsegment

How to Use Python Word Segmentation for English Text

In [1]: import wordsegment
 
In [2]: help(wordsegment)
 
In [4]: from wordsegment import segment
 
In [5]: segment("thisisatest")
Out[5]: ['this', 'is', 'a', 'test']
 
In [6]: segment("helloworld")
Out[6]: ['helloworld']
 
In [7]: segment("hiworld")
Out[7]: ['hi', 'world']
 
In [8]: segment("NewYork")
Out[8]: ['new', 'york']
 
In [9]: from wordsegment import clean
 
In [10]: clean("this's a test")
Out[10]: 'thissatest'
 
In [11]: segment("this'satest")
Out[11]: ['this', 'sa', 'test']
 
In [12]: import wordsegment as ws
 
In [13]: ws.load()
 
In [15]: ws.UNIGRAMS['the']
Out[15]: 23135851162.0
 
In [16]: ws.UNIGRAMS['gray']
Out[16]: 21424658.0
 
In [17]: ws.UNIGRAMS['grey']
Out[17]: 18276942.0
 
In [18]: dir(ws)
Out[18]: 
['ALPHABET',
 'BIGRAMS',
 'DATADIR',
 'TOTAL',
 'UNIGRAMS',
 '__author__',
 '__build__',
 '__builtins__',
 '__copyright__',
 '__doc__',
 '__file__',
 '__license__',
 '__name__',
 '__package__',
 '__title__',
 '__version__',
 'clean',
 'divide',
 'io',
 'isegment',
 'load',
 'main',
 'math',
 'op',
 'parse_file',
 'score',
 'segment',
 'sys']
 
In [19]: ws.BIGRAMS['this is']
Out[19]: 86818400.0
 
In [20]: ws.BIGRAMS['is a']
Out[20]: 476718990.0
 
In [21]: ws.BIGRAMS['a test']
Out[21]: 4417355.0
 
In [22]: ws.BIGRAMS['a patent']
Out[22]: 1117510.0
 
In [23]: ws.BIGRAMS['free patent']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-23-6d20cc0adefa> in <module>()
----> 1 ws.BIGRAMS['free patent']
 
KeyError: 'free patent'
 
In [24]: ws.BIGRAMS['the input']
Out[24]: 4840160.0
 
In [26]: import heapq
 
In [27]: from pprint import pprint
 
In [28]: from operator import itemgetter
 
In [29]: pprint(heapq.nlargest(10, ws.BIGRAMS.items(), itemgetter(1)))
[(u'of the', 2766332391.0),
 (u'in the', 1628795324.0),
 (u'to the', 1139248999.0),
 (u'on the', 800328815.0),
 (u'for the', 692874802.0),
 (u'and the', 629726893.0),
 (u'to be', 505148997.0),
 (u'is a', 476718990.0),
 (u'with the', 461331348.0),
 (u'from the', 428303219.0)]

Help info about Python WordSegmentation

Help on module wordsegment:

NAME
    wordsegment - English Word Segmentation in Python

FILE
    /Library/Python/2.7/site-packages/wordsegment.py

DESCRIPTION
    Word segmentation is the process of dividing a phrase without spaces back
    into its constituent parts. For example, consider a phrase like "thisisatest
".
    For humans, it's relatively easy to parse. This module makes it easy for
    machines too. Use `segment` to parse a phrase into its parts:
    
    >>> from wordsegment import segment
    >>> segment('thisisatest')
    ['this', 'is', 'a', 'test']
    
    In the code, 1024908267229 is the total number of words in the corpus. A
    subset of this corpus is found in unigrams.txt and bigrams.txt which
    should accompany this file. A copy of these files may be found at
    http://norvig.com/ngrams/ under the names count_1w.txt and count_2w.txt
    respectively.
    
    Copyright (c) 2016 by Grant Jenks
    
    Based on code from the chapter "Natural Language Corpus Data"
    from the book "Beautiful Data" (Segaran and Hammerbacher, 2009)
    http://oreilly.com/catalog/9780596157111/
    
    Original Copyright (c) 2008-2009 by Peter Norvig

FUNCTIONS
    clean(text)
        Return `text` lower-cased with non-alphanumeric characters removed.
    
    divide(text, limit=24)
        Yield `(prefix, suffix)` pairs from `text` with `len(prefix)` not
        exceeding `limit`.
    
    isegment(text)
        Return iterator of words that is the best segmenation of `text`.
    
    load()
        Load unigram and bigram counts from disk.
    main(args=())
        Command-line entry-point. Parses `args` into in-file and out-file then
        reads lines from in-file, segments the lines, and writes the result to
        out-file. Input and output default to stdin and stdout respectively.
    
    parse_file(filename)
        Read `filename` and parse tab-separated file of (word, count) pairs.
    
    score(word, prev=None)
        Score a `word` in the context of the previous word, `prev`.
    
    segment(text)
        Return a list of words that is the best segmenation of `text`.

DATA
    ALPHABET = set(['0', '1', '2', '3', '4', '5', ...])
    BIGRAMS = {u'0km to': 116103.0, u'0uplink verified': 523545.0, u'1000s...
    DATADIR = '/Library/Python/2.7/site-packages/wordsegment_data'
    TOTAL = 1024908267229.0
    UNIGRAMS = {u'a': 9081174698.0, u'aa': 30523331.0, u'aaa': 10243983.0,...
    __author__ = 'Grant Jenks'
    __build__ = 2048
    __copyright__ = 'Copyright 2016 Grant Jenks'
    __license__ = 'Apache 2.0'
    __title__ = 'wordsegment'
    __version__ = '0.8.0'

VERSION
    0.8.0

AUTHOR
    Grant Jenks

Posted by TextProcessing


Leave a Reply

Your email address will not be published. Required fields are marked *