About Python Word Segmentation
WordSegment is an Apache2 licensed module for English word segmentation, written in pure-Python, and based on a trillion-word corpus.
Based on code from the chapter “Natural Language Corpus Data” by Peter Norvig from the book “Beautiful Data” (Segaran and Hammerbacher, 2009).
Data files are derived from the Google Web Trillion Word Corpus, as described by Thorsten Brants and Alex Franz, and distributed by the Linguistic Data Consortium. This module contains only a subset of that data. The unigram data includes only the most common 333,000 words. Similarly, bigram data includes only the most common 250,000 phrases. Every word and phrase is lowercased with punctuation removed.
Install Python Word Segmentation
Install WordSement is very easy, just by pip:
pip install wordsegment
How to Use Python Word Segmentation for English Text
In [1]: import wordsegment In [2]: help(wordsegment) In [4]: from wordsegment import segment In [5]: segment("thisisatest") Out[5]: ['this', 'is', 'a', 'test'] In [6]: segment("helloworld") Out[6]: ['helloworld'] In [7]: segment("hiworld") Out[7]: ['hi', 'world'] In [8]: segment("NewYork") Out[8]: ['new', 'york'] In [9]: from wordsegment import clean In [10]: clean("this's a test") Out[10]: 'thissatest' In [11]: segment("this'satest") Out[11]: ['this', 'sa', 'test'] In [12]: import wordsegment as ws In [13]: ws.load() In [15]: ws.UNIGRAMS['the'] Out[15]: 23135851162.0 In [16]: ws.UNIGRAMS['gray'] Out[16]: 21424658.0 In [17]: ws.UNIGRAMS['grey'] Out[17]: 18276942.0 In [18]: dir(ws) Out[18]: ['ALPHABET', 'BIGRAMS', 'DATADIR', 'TOTAL', 'UNIGRAMS', '__author__', '__build__', '__builtins__', '__copyright__', '__doc__', '__file__', '__license__', '__name__', '__package__', '__title__', '__version__', 'clean', 'divide', 'io', 'isegment', 'load', 'main', 'math', 'op', 'parse_file', 'score', 'segment', 'sys'] In [19]: ws.BIGRAMS['this is'] Out[19]: 86818400.0 In [20]: ws.BIGRAMS['is a'] Out[20]: 476718990.0 In [21]: ws.BIGRAMS['a test'] Out[21]: 4417355.0 In [22]: ws.BIGRAMS['a patent'] Out[22]: 1117510.0 In [23]: ws.BIGRAMS['free patent'] --------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-23-6d20cc0adefa> in <module>() ----> 1 ws.BIGRAMS['free patent'] KeyError: 'free patent' In [24]: ws.BIGRAMS['the input'] Out[24]: 4840160.0 In [26]: import heapq In [27]: from pprint import pprint In [28]: from operator import itemgetter In [29]: pprint(heapq.nlargest(10, ws.BIGRAMS.items(), itemgetter(1))) [(u'of the', 2766332391.0), (u'in the', 1628795324.0), (u'to the', 1139248999.0), (u'on the', 800328815.0), (u'for the', 692874802.0), (u'and the', 629726893.0), (u'to be', 505148997.0), (u'is a', 476718990.0), (u'with the', 461331348.0), (u'from the', 428303219.0)] |
Help info about Python WordSegmentation
Help on module wordsegment: NAME wordsegment - English Word Segmentation in Python FILE /Library/Python/2.7/site-packages/wordsegment.py DESCRIPTION Word segmentation is the process of dividing a phrase without spaces back into its constituent parts. For example, consider a phrase like "thisisatest ". For humans, it's relatively easy to parse. This module makes it easy for machines too. Use `segment` to parse a phrase into its parts: >>> from wordsegment import segment >>> segment('thisisatest') ['this', 'is', 'a', 'test'] In the code, 1024908267229 is the total number of words in the corpus. A subset of this corpus is found in unigrams.txt and bigrams.txt which should accompany this file. A copy of these files may be found at http://norvig.com/ngrams/ under the names count_1w.txt and count_2w.txt respectively. Copyright (c) 2016 by Grant Jenks Based on code from the chapter "Natural Language Corpus Data" from the book "Beautiful Data" (Segaran and Hammerbacher, 2009) http://oreilly.com/catalog/9780596157111/ Original Copyright (c) 2008-2009 by Peter Norvig FUNCTIONS clean(text) Return `text` lower-cased with non-alphanumeric characters removed. divide(text, limit=24) Yield `(prefix, suffix)` pairs from `text` with `len(prefix)` not exceeding `limit`. isegment(text) Return iterator of words that is the best segmenation of `text`. load() Load unigram and bigram counts from disk. main(args=()) Command-line entry-point. Parses `args` into in-file and out-file then reads lines from in-file, segments the lines, and writes the result to out-file. Input and output default to stdin and stdout respectively. parse_file(filename) Read `filename` and parse tab-separated file of (word, count) pairs. score(word, prev=None) Score a `word` in the context of the previous word, `prev`. segment(text) Return a list of words that is the best segmenation of `text`. DATA ALPHABET = set(['0', '1', '2', '3', '4', '5', ...]) BIGRAMS = {u'0km to': 116103.0, u'0uplink verified': 523545.0, u'1000s... DATADIR = '/Library/Python/2.7/site-packages/wordsegment_data' TOTAL = 1024908267229.0 UNIGRAMS = {u'a': 9081174698.0, u'aa': 30523331.0, u'aaa': 10243983.0,... __author__ = 'Grant Jenks' __build__ = 2048 __copyright__ = 'Copyright 2016 Grant Jenks' __license__ = 'Apache 2.0' __title__ = 'wordsegment' __version__ = '0.8.0' VERSION 0.8.0 AUTHOR Grant Jenks
Posted by TextProcessing