Open Source Text Processing Project: WhatLanguage

WhatLanguage: A language detection library for Ruby that uses bloom filters for speed.

Project Website: None

Github Link:


Text language detection. Quick, fast, memory efficient, and all in pure Ruby. Uses Bloom filters for aforementioned speed and memory benefits.

Works with Dutch, English, Farsi, French, German, Italian, Pinyin, Swedish, Portuguese, Russian, Arabic, Finnish, Greek, Hebrew, Hungarian, Korean, Norwegian, Polish and Spanish out of the box.

Important note

This library was first built in 2007 and has received a few minor updates over the years. There are now more efficient and effective algorithms for doing language detection which I am investigating for a WhatLanguage 2.0.

This library has been updated to be distributed and to work on modern Ruby implementations but other than that, has had no improvements.

Text Processing Book: Text Processing with Ruby

Text Processing with Ruby: Extract Value from the Data That Surrounds You

Text is everywhere. Web pages, databases, the contents of files–for almost any programming task you perform, you need to process text. Cut even the most complex text-based tasks down to size and learn how to master regular expressions, scrape information from Web pages, develop reusable utilities to process text in pipelines, and more.

Most information in the world is in text format, and programmers often find themselves needing to make sense of the data hiding within. It might be to convert it from one format to another, or to find out information about the text as a whole, or to extract information fromit. But how do you do this efficiently, avoiding labor-intensive, manual work?

Text Processing with Ruby takes a practical approach. You’ll learn how to get text into your Ruby programs from the file system and from user input. You’ll process delimited files such as CSVs, and write utilities that interact with other programs in text-processing pipelines. Decipher character encoding mysteries, and avoid the pain of jumbled characters and malformed output.

You’ll learn to use regular expressions to match, extract, and replace patterns in text. You’ll write a parser and learn how to process Web pages to pull out information from even the messiest of HTML.

Before long you’ll be able to tackle even the most enormous and entangled text with ease, scything through gigabytes of data and effortlessly extracting the bits that matter.

About the Author
Rob Miller is Operations Director at a London-based marketing consultancy. He spends his days merrily chewing through huge quantities of text in Ruby, turning raw data into meaningful analysis. He blogs at and tweets @robmil.

Text Processing Book: Taming Text – How to Find, Organize, and Manipulate It 1st Edition

Taming Text: How to Find, Organize, and Manipulate It


Taming Text, winner of the 2013 Jolt Awards for Productivity, is a hands-on, example-driven guide to working with unstructured text in the context of real-world applications. This book explores how to automatically organize text using approaches such as full-text search, proper name recognition, clustering, tagging, information extraction, and summarization. The book guides you through examples illustrating each of these topics, as well as the foundations upon which they are built.

About this Book

There is so much text in our lives, we are practically drowningin it. Fortunately, there are innovative tools and techniquesfor managing unstructured information that can throw thesmart developer a much-needed lifeline. You’ll find them in thisbook.
Taming Text is a practical, example-driven guide to working withtext in real applications. This book introduces you to useful techniques like full-text search, proper name recognition,clustering, tagging, information extraction, and summarization.You’ll explore real use cases as you systematically absorb thefoundations upon which they are built.Written in a clear and concise style, this book avoids jargon, explainingthe subject in terms you can understand without a backgroundin statistics or natural language processing. Examples are in Java, but the concepts can be applied in any language.

Written for Java developers, the book requires no prior knowledge of GWT.

Purchase of the print book comes with an offer of a free PDF, ePub, and Kindle eBook from Manning. Also available is all code from the book.

Winner of 2013 Jolt Awards: The Best Books—one of five notable books every serious programmer should read.

What’s Inside

When to use text-taming techniques
Important open-source libraries like Solr and Mahout
How to build text-processing applications
About the Authors
Grant Ingersoll is an engineer, speaker, and trainer, a Lucenecommitter, and a cofounder of the Mahout machine-learning project. Thomas Morton is the primary developer of OpenNLP and Maximum Entropy. Drew Farris is a technology consultant, software developer, and contributor to Mahout,Lucene, and Solr.

“Takes the mystery out of verycomplex processes.”—From the Foreword by Liz Liddy, Dean, iSchool, Syracuse University

Table of Contents

Getting started taming text
Foundations of taming text
Fuzzy string matching
Identifying people, places, and things
Clustering text
Classification, categorization, and tagging
Building an example question answering system
Untamed text: exploring the next frontier

About the Author
Grant Ingersoll is a founder of Lucid Imagination, developing search and natural language processing tools. Prior to Lucid Imagination, he was a Senior Software Engineer at the Center for Natural Language Processing at Syracuse University. At the Center and, previously, at MNIS-TextWise, Grant worked on a number of text processing applications involving information retrieval, question answering, clustering, summarization, and categorization. Grant is a committer, as well as a speaker and trainer, on the Apache Lucene Java project and a co-founder of the Apache Mahout machine-learning project. He holds a master’s degree in computer science from Syracuse University and a bachelor’s degree in mathematics and computer science from Amherst College.

Thomas Morton writes software and performs research in the area of text processing and machine learning. He has been the primary developer and maintainer of the OpenNLP text processing project and Maximum Entropy machine learning project for the last 5 years. He received his doctorate in Computer Science from the University of Pennsylvania in 2005, and has worked in several industry positions applying text processing and machine learning to enterprise class development efforts. Currently he works as a software architect for Comcast Interactive Media in Philadelphia.

Drew Farris is a professional software developer and technology consultant whose interests focus on large scale analytics, distributed computing and machine learning. Previously, he worked at TextWise where he implemented a wide variety of text exploration, management and retrieval applications combining natural language processing, classification and visualization techniques. He has contributed to a number of open source projects including Apache Mahout, Lucene and Solr, and holds a master’s degree in Information Resource Management from Syracuse University’s iSchool and a B.F.A in Computer Graphics.

Open Source Text Processing Project: Jieba

Jieba: Chinese text segmentation

Project Website: None

Github Link:


“Jieba” (Chinese for “to stutter”) Chinese text segmentation: built to be the best Python Chinese word segmentation module.

Support three types of segmentation mode:

Accurate Mode attempts to cut the sentence into the most accurate segmentations, which is suitable for text analysis.
Full Mode gets all the possible words from the sentence. Fast but not accurate.
Search Engine Mode, based on the Accurate Mode, attempts to cut long words into several short words, which can raise the recall rate. Suitable for search engines.

Supports Traditional Chinese
Supports customized dictionaries
MIT License

Open Source Text Processing Project: THUTag

THUTag: A Package of Keyphrase Extraction and Social Tag Suggetion

Project Website: None

Github Link:


Part I : THUTag Contents

Part II : How To Compile THUTag

Part III : How To Run Cross-validation of THUTag

Part IV : Input File Formats of Cross-validation

Part V : Output File Formats of Cross-validation

Part VI : How To Run UI && Testing a single passage of THUTag

Part VII : Input File Formats of UI && Testing a single passage

Part VIII: Output File Formats of UI && Testing a single passage

Part IX : Literature

Part X : License

Part XI : Authors

Part XII : Appendix

Text Processing Book: Python 2.6 Text Processing Beginners Guide

Python 2.6 Text Processing: Beginners Guide

With a basic knowledge of Python you have the potential to undertake time-saving text processing. This book is a great introduction to the various techniques, and teaches through practical examples and clear explanations. Overview The easiest way to learn text processing with Python Deals with the most important textual data formats you will encounter Learn to use the most popular text processing libraries available for Python Packed with examples to guide you through What you will learn from this book Know the options available for processing text in Python Parse JSON data that is often used as a data delivery mechanism on the Internet Organize a log-processing application via modules and packages to make it more extensible Perform conditional matches via look-ahead and look-behind assertions by using basic regular expressions Process XML and HTML documents in a variety of ways based on the needs of your application Implement callback methods to perform SAX processing and walk in-memory DOM structures Understand Unicode, character encoding, internationalization, and localization Lay out a Mako template-based project by using techniques such as template inheritance, additional tags, and custom filters Install and use the Mako templating system to create your own Mako templates Process a large number of e-mail messages using the Python standard library and index them with Nucular for fast searching Fix common exceptions that occur while dealing with different types of text encoding Build simple PDF output using the ReportLab toolkit’s high-level PLATYPUS framework Generate Microsoft Excel output using the xlwt module Open and edit existing Open Document files to use them as template sources Understand supporting functions and classes, such as the Python IO system and packaging components Approach This book is part of the Beginner’s Guide series. Each chapter covers the steps for various tasks to process data followed

About the Author
Jeff McNeil Jeff McNeil has been working in the Internet Services industry for over 10 years. He cut his teeth during the late 90’s Internet boom and has been developing software for Unix and Unix-flavored systems ever since. Jeff has been a full-time Python developer for the better half of that time and has professional experience with a collection of other languages, including C, Java, and Perl. He takes an interest in systems administration and server automation problems. Jeff recently joined Google and has had the pleasure of working with some very talented individuals.

Open Source Text Processing Project: langid Stand-alone language identification system

Project Website: None

Github Link:

Description is a standalone Language Identification (LangID) tool.

The design principles are as follows:

Pre-trained over a large number of languages (currently 97)
Not sensitive to domain-specific features (e.g. HTML/XML markup)
Single .py file with minimal dependencies
Deployable as a web service
All that is required to run is >= Python 2.7 and numpy. The main script langid/ is cross-compatible with both Python2 and Python3, but the accompanying training tools are still Python2-only. is WSGI-compliant. will use fapws3 as a web server if available, and default to wsgiref.simple_server otherwise. comes pre-trained on 97 languages (ISO 639-1 codes given):

af, am, an, ar, as, az, be, bg, bn, br, bs, ca, cs, cy, da, de, dz, el, en, eo, es, et, eu, fa, fi, fo, fr, ga, gl, gu, he, hi, hr, ht, hu, hy, id, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lb, lo, lt, lv, mg, mk, ml, mn, mr, ms, mt, nb, ne, nl, nn, no, oc, or, pa, pl, ps, pt, qu, ro, ru, rw, se, si, sk, sl, sq, sr, sv, sw, ta, te, th, tl, tr, ug, uk, ur, vi, vo, wa, xh, zh, zu
The training data was drawn from 5 different sources:

ClueWeb 09
Reuters RCV2
Debian i18n

Open Source Text Processing Project: LingPipe


Project Website:

Github Link: None


LingPipe is tool kit for processing text using computational linguistics. LingPipe is used to do tasks like:

Find the names of people, organizations or locations in news
Automatically classify Twitter search results into categories
Suggest correct spellings of queries
To get a better idea of the range of possible LingPipe uses, visit our tutorials and sandbox.

LingPipe’s architecture is designed to be efficient, scalable, reusable, and robust. Highlights include:

Java API with source code and unit tests;
multi-lingual, multi-domain, multi-genre models;
training with new data for new tasks;
n-best output with statistical confidence estimates;
online training (learn-a-little, tag-a-little);
thread-safe models and decoders for concurrent-read exclusive-write (CREW) synchronization; and
character encoding-sensitive I/O.

Open Source Text Processing Project: OpenNLP

Apache OpenNLP

Project Website:

Github Link: None


The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text.

It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. These tasks are usually required to build more advanced text processing services. OpenNLP also includes maximum entropy and perceptron based machine learning.

Open Source Text Processing Project: GATE

GATE: a full-lifecycle open source solution for text processing

Project Website:

Github Link: None


GATE is…

open source software capable of solving almost any text processing problem
a mature and extensive community of developers, users, educators, students and scientists
a defined and repeatable process for creating robust and maintainable text processing workflows
in active use for all sorts of language processing tasks and applications, including: voice of the customer; cancer research; drug research; decision support; recruitment; web mining; information extraction; semantic annotation
the result of a €multi-million R&D programme running since 1995, funded by commercial users, the EC, BBSRC, EPSRC, AHRC, JISC, etc.
used by corporations, SMEs, research labs and Universities worldwide
the Eclipse of Natural Language Engineering, the Lucene of Information Extraction, the ISO 9001 of Text Mining
a world-class team of language processing developers

GATE has grown over the years to include a desktop client for developers, a workflow-based web application, a Java library, an architecture and a process. GATE is:

an IDE, GATE Developer4: an integrated development environment for language processing components bundled with a very widely used Information Extraction system and a comprehensive set of other plugins
a web app: GATE Teamware a collaborative annotation environment for factory-style semantic annotation projects built around a workflow engine and a heavily-optimised backend service infrastructure
a framework, GATE Embedded: an object library optimised for inclusion in diverse applications giving access to all the services used by GATE Developer and more
an architecture: a high-level organisational picture of how language processing software composition
a process for the creation of robust and maintainable services

On top of the core functions GATE includes components for diverse language processing tasks, e.g. parsers, morphology, tagging, Information Retrieval tools, Information Extraction components for various languages, and many others. GATE Developer and Embedded are supplied with an Information Extraction system (ANNIE) which has been adapted and evaluated very widely (numerous industrial systems, research systems evaluated in MUC, TREC, ACE, DUC, Pascal, NTCIR, etc.). ANNIE is often used to create RDF or OWL (metadata) for unstructured content (semantic annotation).