Open Source Text Processing Project: Stanford Word Segmenter

Stanford Word Segmenter

Project Website:

Github Link: None


Tokenization of raw text is a standard pre-processing step for many NLP tasks. For English, tokenization usually involves punctuation splitting and separation of some affixes like possessives. Other languages require more extensive token pre-processing, which is usually called segmentation.

The Stanford Word Segmenter currently supports Arabic and Chinese. The provided segmentation schemes have been found to work well for a variety of applications.

The system requires Java 1.8+ to be installed. We recommend at least 1G of memory for documents that contain long sentences. For files with shorter sentences (e.g., 20 tokens), decrease the memory requirement by changing the option java -mx1g in the run scripts.

Arabic is a root-and-template language with abundant bound morphemes. These morphemes include possessives, pronouns, and discourse connectives. Segmenting bound morphemes reduces lexical sparsity and simplifies syntactic analysis.

The Arabic segmenter model processes raw text according to the Penn Arabic Treebank 3 (ATB) standard. It is an implementation of the segmenter described in:

Will Monroe, Spence Green, and Christopher D. Manning. 2014. Word Segmentation of Informal Arabic with Domain Adaptation. In ACL.

Chinese is standardly written without spaces between words (as are some other languages). This software will split Chinese text into a sequence of words, defined according to some word segmentation standard. It is a Java implementation of the CRF-based Chinese Word Segmenter described in:

Huihsin Tseng, Pichuan Chang, Galen Andrew, Daniel Jurafsky and Christopher Manning. 2005. A Conditional Random Field Word Segmenter. In Fourth SIGHAN Workshop on Chinese Language Processing.
Two models with two different segmentation standards are included: Chinese Penn Treebank standard and Peking University standard.

On May 21, 2008, we released a version that makes use of lexicon features. With external lexicon features, the segmenter segments more consistently and also achieves higher F measure when we train and test on the bakeoff data. This version is close to the CRF-Lex segmenter described in:

Pi-Chuan Chang, Michel Galley and Chris Manning. 2008. Optimizing Chinese Word Segmentation for Machine Translation Performance. In WMT.
The older version (2006-05-11) without using external lexicon features will still be available for download, but we do recommend using the latest version.
Another new feature of the latest release is that the segmenter can now output k-best segmentations. An example of how to train the segmenter is now also available.

Open Source Text Processing Project: The Stanford Parser (A statistical parser)

The Stanford Parser: A statistical parser

Project Website:

Github Link: None


A natural language parser is a program that works out the grammatical structure of sentences, for instance, which groups of words go together (as “phrases”) and which words are the subject or object of a verb. Probabilistic parsers use knowledge of language gained from hand-parsed sentences to try to produce the most likely analysis of new sentences. These statistical parsers still make some mistakes, but commonly work rather well. Their development was one of the biggest breakthroughs in natural language processing in the 1990s.

This package is a Java implementation of probabilistic natural language parsers, both highly optimized PCFG and lexicalized dependency parsers, and a lexicalized PCFG parser. The original version of this parser was mainly written by Dan Klein, with support code and linguistic grammar development by Christopher Manning. Extensive additional work (internationalization and language-specific modeling, flexible input/output, grammar compaction, lattice parsing, k-best parsing, typed dependencies output, user support, etc.) has been done by Roger Levy, Christopher Manning, Teg Grenager, Galen Andrew, Marie-Catherine de Marneffe, Bill MacCartney, Anna Rafferty, Spence Green, Huihsin Tseng, Pi-Chuan Chang, Wolfgang Maier, and Jenny Finkel.

The lexicalized probabilistic parser implements a factored product model, with separate PCFG phrase structure and lexical dependency experts, whose preferences are combined by efficient exact inference, using an A* algorithm. Or the software can be used simply as an accurate unlexicalized stochastic context-free grammar parser. Either of these yields a good performance statistical parsing system. A GUI is provided for viewing the phrase structure tree output of the parser.

As well as providing an English parser, the parser can be and has been adapted to work with other languages. A Chinese parser based on the Chinese Treebank, a German parser based on the Negra corpus and Arabic parsers based on the Penn Arabic Treebank are also included. The parser has also been used for other languages, such as Italian, Bulgarian, and Portuguese.

The parser provides Universal Dependencies and Stanford Dependencies output as well as phrase structure trees. Typed dependencies are otherwise known grammatical relations. This style of output is available only for English and Chinese. For more details, please refer to the Stanford Dependencies webpage and the Universal Dependencies documentation.

Shift-reduce constituency parser
As of version 3.4 in 2014, the parser includes the code necessary to run a shift reduce parser, a much faster constituent parser with competitive accuracy. Models for this parser are linked below.

Neural-network dependency parser
In version 3.5.0 (October 2014) we released a high-performance dependency parser powered by a neural network. The parser outputs typed dependency parses for English and Chinese. The models for this parser are included in the general Stanford Parser models package.

Dependency scoring
The package includes a tool for scoring of generic dependency parses, in a class edu.stanford.nlp.trees.DependencyScoring. This tool measures scores for dependency trees, doing F1 and labeled attachment scoring. The included usage message gives a detailed description of how to use the tool.

Open Source Text Processing Project: Stanford Named Entity Recognizer (NER)

Stanford Named Entity Recognizer (NER)

Project Website:

Github Link: None


Stanford NER is a Java implementation of a Named Entity Recognizer. Named Entity Recognition (NER) labels sequences of words in a text which are the names of things, such as person and company names, or gene and protein names. It comes with well-engineered feature extractors for Named Entity Recognition, and many options for defining feature extractors. Included with the download are good named entity recognizers for English, particularly for the 3 classes (PERSON, ORGANIZATION, LOCATION), and we also make available on this page various other models for different languages and circumstances, including models trained on just the CoNLL 2003 English training data.

Stanford NER is also known as CRFClassifier. The software provides a general implementation of (arbitrary order) linear chain Conditional Random Field (CRF) sequence models. That is, by training your own models on labeled data, you can actually use this code to build sequence models for NER or any other task. (CRF models were pioneered by Lafferty, McCallum, and Pereira (2001); see Sutton and McCallum (2006) or Sutton and McCallum (2010) for more comprehensible introductions.)

The CRF code is by Jenny Finkel. The feature extractors are by Dan Klein, Christopher Manning, and Jenny Finkel. Much of the documentation and usability is due to Anna Rafferty. The CRF sequence models provided here do not precisely correspond to any published paper, but the correct paper to cite for the software is:

Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363-370.
The software provided here is similar to the baseline local+Viterbi model in that paper, but adds new distributional similarity based features (in the -distSim classifiers). The distributional similarity features in some models improve performance but the models require considerably more memory. The big models were trained on a mixture of CoNLL, MUC-6, MUC-7 and ACE named entity corpora, and as a result the models are fairly robust across domains.

Open Source Text Processing Project: Stanford Log-linear Part-Of-Speech Tagger

Stanford Log-linear Part-Of-Speech Tagger

Project Website:

Github Link: None


A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although generally computational applications use more fine-grained POS tags like ‘noun-plural’. This software is a Java implementation of the log-linear part-of-speech taggers described in these papers (if citing just one paper, cite the 2003 one):

Kristina Toutanova and Christopher D. Manning. 2000. Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), pp. 63-70.
Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer. 2003. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of HLT-NAACL 2003, pp. 252-259.
The tagger was originally written by Kristina Toutanova. Since that time, Dan Klein, Christopher Manning, William Morgan, Anna Rafferty, Michel Galley, and John Bauer have improved its speed, performance, usability, and support for other languages.

The system requires Java 1.8+ to be installed. Depending on whether you’re running 32 or 64 bit Java and the complexity of the tagger model, you’ll need somewhere between 60 and 200 MB of memory to run a trained tagger (i.e., you may need to give java an option like java -mx200m). Plenty of memory is needed to train a tagger. It again depends on the complexity of the model but at least 1GB is usually needed, often more.

Several downloads are available. The basic download contains two trained tagger models for English. The full download contains three trained English tagger models, an Arabic tagger model, a Chinese tagger model, a French tagger model, and a German tagger model. Both versions include the same source and other required files. The tagger can be retrained on any language, given POS-annotated training text for the language.

Part-of-speech name abbreviations: The English taggers use the Penn Treebank tag set. Here are some links to documentation of the Penn Treebank English POS tag set: 1993 Computational Linguistics article in PDF, AMALGAM page, Aoife Cahill’s list. See the included README-Models.txt in the models directory for more information about the tagsets for the other languages.

Open Source Text Processing Project: Stanford CoreNLP

Stanford CoreNLP – a suite of core NLP tools

Project Website:

Github Link:


Stanford CoreNLP provides a set of natural language analysis tools. It can give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, and mark up the structure of sentences in terms of phrases and word dependencies, indicate which noun phrases refer to the same entities, indicate sentiment, extract open-class relations between mentions, etc.

Choose Stanford CoreNLP if you need:

An integrated toolkit with a good range of grammatical analysis tools
Fast, reliable analysis of arbitrary texts
The overall highest quality text analytics
Support for a number of major (human) languages
Interfaces available for various major modern programming languages
Stanford CoreNLP is an integrated framework. Its goal is to make it very easy to apply a bunch of linguistic analysis tools to a piece of text. Starting from plain text, you can run all the tools on it with just two lines of code. It is designed to be highly flexible and extensible. With a single option you can change which tools should be enabled and which should be disabled. Stanford CoreNLP integrates many of Stanford’s NLP tools, including the part-of-speech (POS) tagger, the named entity recognizer (NER), the parser, the coreference resolution system, sentiment analysis, and the bootstrapped pattern learning tools. Its analyses provide the foundational building blocks for higher-level and domain-specific text understanding applications.

Open Source Text Processing Project: Pattern


Project Website:

Github Link:


Pattern is a web mining module for the Python programming language.

It has tools for data mining (Google, Twitter and Wikipedia API, a web crawler, a HTML DOM parser), natural language processing (part-of-speech taggers, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, clustering, SVM), network analysis and canvas visualization.

Getting Started with Pattern

Open Source Text Processing Project: MBSP

MBSP for Python

Project Website:


MBSP is a text analysis system based on the TiMBL and MBT memory based learning applications developed at CLiPS and ILK. It provides tools for Tokenization and Sentence Splitting, Part of Speech Tagging, Chunking, Lemmatization, Relation Finding and Prepositional Phrase Attachment.

Getting Started with MBSP

Text Processing Course: Stanford Deep Learning for Natural Language Processing

Name: Deep Learning for Natural Language Processing



Natural language processing (NLP) is one of the most important technologies of the information age. Understanding complex language utterances is also a crucial part of artificial intelligence. Applications of NLP are everywhere because people communicate most everything in language: web search, advertisement, emails, customer service, language translation, radiology reports, etc. There are a large variety of underlying tasks and machine learning models powering NLP applications. Recently, deep learning approaches have obtained very high performance across many different NLP tasks. These models can often be trained with a single end-to-end model and do not require traditional, task-specific feature engineering. In this spring quarter course students will learn to implement, train, debug, visualize and invent their own neural network models. The course provides a deep excursion into cutting-edge research in deep learning applied to NLP. The final project will involve training a complex recurrent neural network and applying it to a large scale NLP problem. On the model side we will cover word vector representations, window-based neural networks, recurrent neural networks, long-short-term-memory models, recursive neural networks, convolutional neural networks as well as some very novel models involving a memory component. Through lectures and programming assignments students will learn the necessary engineering tricks for making neural networks work on practical problems.

About the Instructors

Richard Socher

“I am the Founder and CEO at MetaMind. Our vision is to improve artificial intelligence and make it easily accessible. I enjoy research in machine learning, natural language processing and computer vision. In spring 2015, I taught a class on Deep Learning for Natural Language Processing at Stanford. I got my PhD in the CS Department at Stanford, advised by Chris Manning and Andrew Ng. This Wired article talks about some of the research work that we do at MetaMind. I’m on Twitter.”

Text Processing Course: Introduction to Natural Language Processing

Name: Introduction to Natural Language Processing



This course provides an introduction to the field of Natural Language Processing, including topics like Language Models, Parsing, Semantics, Question Answering, and Sentiment Analysis.

This course provides an introduction to the field of Natural Language Processing. It includes relevant background material in Linguistics, Mathematics, Probabilities, and Computer Science. Some of the topics covered in the class are Text Similarity, Part of Speech Tagging, Parsing, Semantics, Question Answering, Sentiment Analysis, and Text Summarization.
The course includes quizzes, programming assignments in python, and a final exam.

Course Syllabus
Week One (Introduction 1/2) (1:35:31)
Week Two (Introduction 2/2) (1:36:26)
Week Three (NLP Tasks and Text Similarity) (1:42:52)
Week Four (Syntax and Parsing, Part 1) (1:48:14)
Week Five (Syntax and Parsing, Part 2) (1:50:29)
Week Six (Language Modeling and Word Sense Disambiguation) (1:40:33)
Week Seven (Part of Speech Tagging and Information Extraction) (1:33:21)
Week Eight (Question Answering) (1:16:59)
Week Nine (Text Summarization) (1:33:55)
Week Ten (Collocations and Information Retrieval) (1:29:40)
Week Eleven (Sentiment Analysis and Semantics) (1:09:38)
Week Twelve (Discourse, Machine Translation, and Generation) (1:30:57)

About the Instructors

Dragomir R. Radev, Ph.D.
Professor of Information, School of Information, Professor of Electrical Engineering and Computer Science, College of Engineering, and Professor of Linguistics, College of Literature, Science, and the Arts
University of Michigan
Among his many accomplishments, Dragomir Radev was cited for being an
international leader in computational linguistics, and contributing
significantly to automatic methods to extract content from text,
including data mining, Web graph and network analysis, and
bioinformatics. Using natural language processing, information
retrieval, and machine learning, he makes sense of the exploding
volume of digital content.

Professor Radev was the first to develop ways to generate informative
summaries from multiple online sources, addressing for the first time
this important problem. Before joining the University of Michigan
faculty in 2000, he worked at IBM’s T.J. Watson Research Center. At
IBM, he co-authored three patents, one of which was for a forerunner to
the Watson Q&A Engine.

Professor Radev is one of the co-founders of the North American
Computational Linguistics Olympiad (NACLO), which began in 2006. This
event for high school and middle school students from the U.S. and
Canada introduces the field of computational linguistics through
competitions to solve linguistic puzzles using analytic
reasoning. Each year, he has coached and traveled with the
U.S. national teams who participate in the International Linguistics
Olympiad and has returned with several gold medals. In 2011, Radev and
his fellow organizers received the Linguistics, Language and the
Public Award from the Linguistic Society of America.

Radev is Professor of Information (in the School of Information),
Professor of Electrical Engineering and Computer Science (in the
College of Engineering) and Professor of Linguistics (in the College
of Literature, Science and the Arts).

Dragomir Radev is an ACM Distinguished Scientist and the recipient of
the University of Michigan Faculty Recognition Award and Outstanding
Mentorship Award.

Text Processing Course: Text Mining and Analytics

Name: Text Mining and Analytics



Explore algorithms for mining and analyzing big text data to discover interesting patterns, extract useful knowledge, and support decision making.

This course will cover the major techniques for mining and analyzing text data to discover interesting patterns, extract useful knowledge, and support decision making, with an emphasis on statistical approaches that can be generally applied to arbitrary text data in any natural language with no or minimum human effort.

Detailed analysis of text data requires understanding of natural language text, which is known to be a difficult task for computers. However, a number of statistical approaches have been shown to work well for the “shallow” but robust analysis of text data for pattern finding and knowledge discovery. You will learn the basic concepts, principles, and major algorithms in text mining and their potential applications.

This course will be covering the following topics:

Overview of text mining and analytics
Natural language processing and text representation
Word association mining
Topic mining and analysis with statistical topic models
Text clustering and categorization
Opinion mining and sentiment analysis
Integrative analysis of text and structured data

About the Instructors

ChengXiang Zhai
Department of Computer Science
University of Illinois at Urbana-Champaign
ChengXiang Zhai is a Professor of Computer Science at the University of Illinois at Urbana-Champaign, where he also holds a joint appointment at the Institute for Genomic Biology, Statistics, and the Graduate School of Library and Information Science. His research interests include information retrieval, text mining, natural language processing, machine learning, and bioinformatics, and has published over 200 papers in these areas with an H-index of 58 in Google Scholar. He is an Associate Editor of ACM Transactions on Information Systems, and Information Processing and Management, and the Americas Editor of Springer’s Information Retrieval Book Series. He is a conference program co-chair of ACM CIKM 2004, NAACL HLT 2007, ACM SIGIR 2009, ECIR 2014, ICTIR 2015, and WWW 2015, and conference general co-chair for ACM CIKM 2016. He is an ACM Distinguished Scientist and a recipient of multiple best paper awards, Rose Award for Teaching Excellence at UIUC, Alfred P. Sloan Research Fellowship, IBM Faculty Award, HP Innovation Research Program Award, and the Presidential Early Career Award for Scientists and Engineers (PECASE).