Open Source Text Processing Project: GibbsLDA++

GibbsLDA++: A C/C++ Implementation of Latent Dirichlet Allocation

Project Website:

Github Link: None


GibbsLDA++ is a C/C++ implementation of Latent Dirichlet Allocation (LDA) using Gibbs Sampling technique for parameter estimation and inference. It is very fast and is designed to analyze hidden/latent topic structures of large-scale datasets including large collections of text/Web documents. LDA was first introduced by David Blei et al [Blei03]. There have been several implementations of this model in C (using Variational Methods), Java, and Matlab. We decided to release this implementation of LDA in C/C++ using Gibbs Sampling to provide an alternative to the topic-model community.

GibbsLDA++ is useful for the following potential application areas:

Information retrieval and search (analyzing semantic/latent topic/concept structures of large text collection for a more intelligent information search).
Document classification/clustering, document summarization, and text/web mining community in general.
Content-based image clustering, object recognition, and other applications of computer vision in general.
Other potential applications in biological data.

Open Source Text Processing Project: WhatLanguage

WhatLanguage: A language detection library for Ruby that uses bloom filters for speed.

Project Website: None

Github Link:


Text language detection. Quick, fast, memory efficient, and all in pure Ruby. Uses Bloom filters for aforementioned speed and memory benefits.

Works with Dutch, English, Farsi, French, German, Italian, Pinyin, Swedish, Portuguese, Russian, Arabic, Finnish, Greek, Hebrew, Hungarian, Korean, Norwegian, Polish and Spanish out of the box.

Important note

This library was first built in 2007 and has received a few minor updates over the years. There are now more efficient and effective algorithms for doing language detection which I am investigating for a WhatLanguage 2.0.

This library has been updated to be distributed and to work on modern Ruby implementations but other than that, has had no improvements.

Text Processing Book: Text Processing with Ruby

Text Processing with Ruby: Extract Value from the Data That Surrounds You

Text is everywhere. Web pages, databases, the contents of files–for almost any programming task you perform, you need to process text. Cut even the most complex text-based tasks down to size and learn how to master regular expressions, scrape information from Web pages, develop reusable utilities to process text in pipelines, and more.

Most information in the world is in text format, and programmers often find themselves needing to make sense of the data hiding within. It might be to convert it from one format to another, or to find out information about the text as a whole, or to extract information fromit. But how do you do this efficiently, avoiding labor-intensive, manual work?

Text Processing with Ruby takes a practical approach. You’ll learn how to get text into your Ruby programs from the file system and from user input. You’ll process delimited files such as CSVs, and write utilities that interact with other programs in text-processing pipelines. Decipher character encoding mysteries, and avoid the pain of jumbled characters and malformed output.

You’ll learn to use regular expressions to match, extract, and replace patterns in text. You’ll write a parser and learn how to process Web pages to pull out information from even the messiest of HTML.

Before long you’ll be able to tackle even the most enormous and entangled text with ease, scything through gigabytes of data and effortlessly extracting the bits that matter.

About the Author
Rob Miller is Operations Director at a London-based marketing consultancy. He spends his days merrily chewing through huge quantities of text in Ruby, turning raw data into meaningful analysis. He blogs at and tweets @robmil.

Text Processing Book: Taming Text – How to Find, Organize, and Manipulate It 1st Edition

Taming Text: How to Find, Organize, and Manipulate It


Taming Text, winner of the 2013 Jolt Awards for Productivity, is a hands-on, example-driven guide to working with unstructured text in the context of real-world applications. This book explores how to automatically organize text using approaches such as full-text search, proper name recognition, clustering, tagging, information extraction, and summarization. The book guides you through examples illustrating each of these topics, as well as the foundations upon which they are built.

About this Book

There is so much text in our lives, we are practically drowningin it. Fortunately, there are innovative tools and techniquesfor managing unstructured information that can throw thesmart developer a much-needed lifeline. You’ll find them in thisbook.
Taming Text is a practical, example-driven guide to working withtext in real applications. This book introduces you to useful techniques like full-text search, proper name recognition,clustering, tagging, information extraction, and summarization.You’ll explore real use cases as you systematically absorb thefoundations upon which they are built.Written in a clear and concise style, this book avoids jargon, explainingthe subject in terms you can understand without a backgroundin statistics or natural language processing. Examples are in Java, but the concepts can be applied in any language.

Written for Java developers, the book requires no prior knowledge of GWT.

Purchase of the print book comes with an offer of a free PDF, ePub, and Kindle eBook from Manning. Also available is all code from the book.

Winner of 2013 Jolt Awards: The Best Books—one of five notable books every serious programmer should read.

What’s Inside

When to use text-taming techniques
Important open-source libraries like Solr and Mahout
How to build text-processing applications
About the Authors
Grant Ingersoll is an engineer, speaker, and trainer, a Lucenecommitter, and a cofounder of the Mahout machine-learning project. Thomas Morton is the primary developer of OpenNLP and Maximum Entropy. Drew Farris is a technology consultant, software developer, and contributor to Mahout,Lucene, and Solr.

“Takes the mystery out of verycomplex processes.”—From the Foreword by Liz Liddy, Dean, iSchool, Syracuse University

Table of Contents

Getting started taming text
Foundations of taming text
Fuzzy string matching
Identifying people, places, and things
Clustering text
Classification, categorization, and tagging
Building an example question answering system
Untamed text: exploring the next frontier

About the Author
Grant Ingersoll is a founder of Lucid Imagination, developing search and natural language processing tools. Prior to Lucid Imagination, he was a Senior Software Engineer at the Center for Natural Language Processing at Syracuse University. At the Center and, previously, at MNIS-TextWise, Grant worked on a number of text processing applications involving information retrieval, question answering, clustering, summarization, and categorization. Grant is a committer, as well as a speaker and trainer, on the Apache Lucene Java project and a co-founder of the Apache Mahout machine-learning project. He holds a master’s degree in computer science from Syracuse University and a bachelor’s degree in mathematics and computer science from Amherst College.

Thomas Morton writes software and performs research in the area of text processing and machine learning. He has been the primary developer and maintainer of the OpenNLP text processing project and Maximum Entropy machine learning project for the last 5 years. He received his doctorate in Computer Science from the University of Pennsylvania in 2005, and has worked in several industry positions applying text processing and machine learning to enterprise class development efforts. Currently he works as a software architect for Comcast Interactive Media in Philadelphia.

Drew Farris is a professional software developer and technology consultant whose interests focus on large scale analytics, distributed computing and machine learning. Previously, he worked at TextWise where he implemented a wide variety of text exploration, management and retrieval applications combining natural language processing, classification and visualization techniques. He has contributed to a number of open source projects including Apache Mahout, Lucene and Solr, and holds a master’s degree in Information Resource Management from Syracuse University’s iSchool and a B.F.A in Computer Graphics.

Open Source Text Processing Project: Jieba

Jieba: Chinese text segmentation

Project Website: None

Github Link:


“Jieba” (Chinese for “to stutter”) Chinese text segmentation: built to be the best Python Chinese word segmentation module.

Support three types of segmentation mode:

Accurate Mode attempts to cut the sentence into the most accurate segmentations, which is suitable for text analysis.
Full Mode gets all the possible words from the sentence. Fast but not accurate.
Search Engine Mode, based on the Accurate Mode, attempts to cut long words into several short words, which can raise the recall rate. Suitable for search engines.

Supports Traditional Chinese
Supports customized dictionaries
MIT License

Open Source Text Processing Project: THUTag

THUTag: A Package of Keyphrase Extraction and Social Tag Suggetion

Project Website: None

Github Link:


Part I : THUTag Contents

Part II : How To Compile THUTag

Part III : How To Run Cross-validation of THUTag

Part IV : Input File Formats of Cross-validation

Part V : Output File Formats of Cross-validation

Part VI : How To Run UI && Testing a single passage of THUTag

Part VII : Input File Formats of UI && Testing a single passage

Part VIII: Output File Formats of UI && Testing a single passage

Part IX : Literature

Part X : License

Part XI : Authors

Part XII : Appendix

Text Processing Book: Python 2.6 Text Processing Beginners Guide

Python 2.6 Text Processing: Beginners Guide

With a basic knowledge of Python you have the potential to undertake time-saving text processing. This book is a great introduction to the various techniques, and teaches through practical examples and clear explanations. Overview The easiest way to learn text processing with Python Deals with the most important textual data formats you will encounter Learn to use the most popular text processing libraries available for Python Packed with examples to guide you through What you will learn from this book Know the options available for processing text in Python Parse JSON data that is often used as a data delivery mechanism on the Internet Organize a log-processing application via modules and packages to make it more extensible Perform conditional matches via look-ahead and look-behind assertions by using basic regular expressions Process XML and HTML documents in a variety of ways based on the needs of your application Implement callback methods to perform SAX processing and walk in-memory DOM structures Understand Unicode, character encoding, internationalization, and localization Lay out a Mako template-based project by using techniques such as template inheritance, additional tags, and custom filters Install and use the Mako templating system to create your own Mako templates Process a large number of e-mail messages using the Python standard library and index them with Nucular for fast searching Fix common exceptions that occur while dealing with different types of text encoding Build simple PDF output using the ReportLab toolkit’s high-level PLATYPUS framework Generate Microsoft Excel output using the xlwt module Open and edit existing Open Document files to use them as template sources Understand supporting functions and classes, such as the Python IO system and packaging components Approach This book is part of the Beginner’s Guide series. Each chapter covers the steps for various tasks to process data followed

About the Author
Jeff McNeil Jeff McNeil has been working in the Internet Services industry for over 10 years. He cut his teeth during the late 90’s Internet boom and has been developing software for Unix and Unix-flavored systems ever since. Jeff has been a full-time Python developer for the better half of that time and has professional experience with a collection of other languages, including C, Java, and Perl. He takes an interest in systems administration and server automation problems. Jeff recently joined Google and has had the pleasure of working with some very talented individuals.

Open Source Text Processing Project: langid Stand-alone language identification system

Project Website: None

Github Link:

Description is a standalone Language Identification (LangID) tool.

The design principles are as follows:

Pre-trained over a large number of languages (currently 97)
Not sensitive to domain-specific features (e.g. HTML/XML markup)
Single .py file with minimal dependencies
Deployable as a web service
All that is required to run is >= Python 2.7 and numpy. The main script langid/ is cross-compatible with both Python2 and Python3, but the accompanying training tools are still Python2-only. is WSGI-compliant. will use fapws3 as a web server if available, and default to wsgiref.simple_server otherwise. comes pre-trained on 97 languages (ISO 639-1 codes given):

af, am, an, ar, as, az, be, bg, bn, br, bs, ca, cs, cy, da, de, dz, el, en, eo, es, et, eu, fa, fi, fo, fr, ga, gl, gu, he, hi, hr, ht, hu, hy, id, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lb, lo, lt, lv, mg, mk, ml, mn, mr, ms, mt, nb, ne, nl, nn, no, oc, or, pa, pl, ps, pt, qu, ro, ru, rw, se, si, sk, sl, sq, sr, sv, sw, ta, te, th, tl, tr, ug, uk, ur, vi, vo, wa, xh, zh, zu
The training data was drawn from 5 different sources:

ClueWeb 09
Reuters RCV2
Debian i18n

Open Source Text Processing Project: LingPipe


Project Website:

Github Link: None


LingPipe is tool kit for processing text using computational linguistics. LingPipe is used to do tasks like:

Find the names of people, organizations or locations in news
Automatically classify Twitter search results into categories
Suggest correct spellings of queries
To get a better idea of the range of possible LingPipe uses, visit our tutorials and sandbox.

LingPipe’s architecture is designed to be efficient, scalable, reusable, and robust. Highlights include:

Java API with source code and unit tests;
multi-lingual, multi-domain, multi-genre models;
training with new data for new tasks;
n-best output with statistical confidence estimates;
online training (learn-a-little, tag-a-little);
thread-safe models and decoders for concurrent-read exclusive-write (CREW) synchronization; and
character encoding-sensitive I/O.

Open Source Text Processing Project: OpenNLP

Apache OpenNLP

Project Website:

Github Link: None


The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text.

It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. These tasks are usually required to build more advanced text processing services. OpenNLP also includes maximum entropy and perceptron based machine learning.