About Python Word Segmentation
WordSegment is an Apache2 licensed module for English word segmentation, written in pure-Python, and based on a trillion-word corpus.
Based on code from the chapter “Natural Language Corpus Data” by Peter Norvig from the book “Beautiful Data” (Segaran and Hammerbacher, 2009).
Data files are derived from the Google Web Trillion Word Corpus, as described by Thorsten Brants and Alex Franz, and distributed by the Linguistic Data Consortium. This module contains only a subset of that data. The unigram data includes only the most common 333,000 words. Similarly, bigram data includes only the most common 250,000 phrases. Every word and phrase is lowercased with punctuation removed.
Install Python Word Segmentation
Install WordSement is very easy, just by pip:
pip install wordsegment
How to Use Python Word Segmentation for English Text
In [1]: import wordsegment In [2]: help(wordsegment) In [4]: from wordsegment import segment In [5]: segment("thisisatest") Out[5]: ['this', 'is', 'a', 'test'] In [6]: segment("helloworld") Out[6]: ['helloworld'] In [7]: segment("hiworld") Out[7]: ['hi', 'world'] In [8]: segment("NewYork") Out[8]: ['new', 'york'] In [9]: from wordsegment import clean In [10]: clean("this's a test") Out[10]: 'thissatest' In [11]: segment("this'satest") Out[11]: ['this', 'sa', 'test'] In [12]: import wordsegment as ws In [13]: ws.load() In [15]: ws.UNIGRAMS['the'] Out[15]: 23135851162.0 In [16]: ws.UNIGRAMS['gray'] Out[16]: 21424658.0 In [17]: ws.UNIGRAMS['grey'] Out[17]: 18276942.0 In [18]: dir(ws) Out[18]: ['ALPHABET', 'BIGRAMS', 'DATADIR', 'TOTAL', 'UNIGRAMS', '__author__', '__build__', '__builtins__', '__copyright__', '__doc__', '__file__', '__license__', '__name__', '__package__', '__title__', '__version__', 'clean', 'divide', 'io', 'isegment', 'load', 'main', 'math', 'op', 'parse_file', 'score', 'segment', 'sys'] In [19]: ws.BIGRAMS['this is'] Out[19]: 86818400.0 In [20]: ws.BIGRAMS['is a'] Out[20]: 476718990.0 In [21]: ws.BIGRAMS['a test'] Out[21]: 4417355.0 In [22]: ws.BIGRAMS['a patent'] Out[22]: 1117510.0 In [23]: ws.BIGRAMS['free patent'] --------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-23-6d20cc0adefa> in <module>() ----> 1 ws.BIGRAMS['free patent'] KeyError: 'free patent' In [24]: ws.BIGRAMS['the input'] Out[24]: 4840160.0 In [26]: import heapq In [27]: from pprint import pprint In [28]: from operator import itemgetter In [29]: pprint(heapq.nlargest(10, ws.BIGRAMS.items(), itemgetter(1))) [(u'of the', 2766332391.0), (u'in the', 1628795324.0), (u'to the', 1139248999.0), (u'on the', 800328815.0), (u'for the', 692874802.0), (u'and the', 629726893.0), (u'to be', 505148997.0), (u'is a', 476718990.0), (u'with the', 461331348.0), (u'from the', 428303219.0)] |
Help info about Python WordSegmentation
Help on module wordsegment:
NAME
wordsegment - English Word Segmentation in Python
FILE
/Library/Python/2.7/site-packages/wordsegment.py
DESCRIPTION
Word segmentation is the process of dividing a phrase without spaces back
into its constituent parts. For example, consider a phrase like "thisisatest
".
For humans, it's relatively easy to parse. This module makes it easy for
machines too. Use `segment` to parse a phrase into its parts:
>>> from wordsegment import segment
>>> segment('thisisatest')
['this', 'is', 'a', 'test']
In the code, 1024908267229 is the total number of words in the corpus. A
subset of this corpus is found in unigrams.txt and bigrams.txt which
should accompany this file. A copy of these files may be found at
http://norvig.com/ngrams/ under the names count_1w.txt and count_2w.txt
respectively.
Copyright (c) 2016 by Grant Jenks
Based on code from the chapter "Natural Language Corpus Data"
from the book "Beautiful Data" (Segaran and Hammerbacher, 2009)
http://oreilly.com/catalog/9780596157111/
Original Copyright (c) 2008-2009 by Peter Norvig
FUNCTIONS
clean(text)
Return `text` lower-cased with non-alphanumeric characters removed.
divide(text, limit=24)
Yield `(prefix, suffix)` pairs from `text` with `len(prefix)` not
exceeding `limit`.
isegment(text)
Return iterator of words that is the best segmenation of `text`.
load()
Load unigram and bigram counts from disk.
main(args=())
Command-line entry-point. Parses `args` into in-file and out-file then
reads lines from in-file, segments the lines, and writes the result to
out-file. Input and output default to stdin and stdout respectively.
parse_file(filename)
Read `filename` and parse tab-separated file of (word, count) pairs.
score(word, prev=None)
Score a `word` in the context of the previous word, `prev`.
segment(text)
Return a list of words that is the best segmenation of `text`.
DATA
ALPHABET = set(['0', '1', '2', '3', '4', '5', ...])
BIGRAMS = {u'0km to': 116103.0, u'0uplink verified': 523545.0, u'1000s...
DATADIR = '/Library/Python/2.7/site-packages/wordsegment_data'
TOTAL = 1024908267229.0
UNIGRAMS = {u'a': 9081174698.0, u'aa': 30523331.0, u'aaa': 10243983.0,...
__author__ = 'Grant Jenks'
__build__ = 2048
__copyright__ = 'Copyright 2016 Grant Jenks'
__license__ = 'Apache 2.0'
__title__ = 'wordsegment'
__version__ = '0.8.0'
VERSION
0.8.0
AUTHOR
Grant Jenks
Posted by TextProcessing