Open Source Text Processing Project: Stanford Named Entity Recognizer (NER)

Stanford Named Entity Recognizer (NER)

Project Website: http://nlp.stanford.edu/software/CRF-NER.shtml

Github Link: None

Description

Stanford NER is a Java implementation of a Named Entity Recognizer. Named Entity Recognition (NER) labels sequences of words in a text which are the names of things, such as person and company names, or gene and protein names. It comes with well-engineered feature extractors for Named Entity Recognition, and many options for defining feature extractors. Included with the download are good named entity recognizers for English, particularly for the 3 classes (PERSON, ORGANIZATION, LOCATION), and we also make available on this page various other models for different languages and circumstances, including models trained on just the CoNLL 2003 English training data.

Stanford NER is also known as CRFClassifier. The software provides a general implementation of (arbitrary order) linear chain Conditional Random Field (CRF) sequence models. That is, by training your own models on labeled data, you can actually use this code to build sequence models for NER or any other task. (CRF models were pioneered by Lafferty, McCallum, and Pereira (2001); see Sutton and McCallum (2006) or Sutton and McCallum (2010) for more comprehensible introductions.)

The CRF code is by Jenny Finkel. The feature extractors are by Dan Klein, Christopher Manning, and Jenny Finkel. Much of the documentation and usability is due to Anna Rafferty. The CRF sequence models provided here do not precisely correspond to any published paper, but the correct paper to cite for the software is:

Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363-370. http://nlp.stanford.edu/~manning/papers/gibbscrf3.pdf
The software provided here is similar to the baseline local+Viterbi model in that paper, but adds new distributional similarity based features (in the -distSim classifiers). The distributional similarity features in some models improve performance but the models require considerably more memory. The big models were trained on a mixture of CoNLL, MUC-6, MUC-7 and ACE named entity corpora, and as a result the models are fairly robust across domains.

Open Source Text Processing Project: Stanford Log-linear Part-Of-Speech Tagger

Stanford Log-linear Part-Of-Speech Tagger

Project Website: http://nlp.stanford.edu/software/tagger.shtml

Github Link: None

Description

A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although generally computational applications use more fine-grained POS tags like ‘noun-plural’. This software is a Java implementation of the log-linear part-of-speech taggers described in these papers (if citing just one paper, cite the 2003 one):

Kristina Toutanova and Christopher D. Manning. 2000. Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), pp. 63-70.
Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer. 2003. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of HLT-NAACL 2003, pp. 252-259.
The tagger was originally written by Kristina Toutanova. Since that time, Dan Klein, Christopher Manning, William Morgan, Anna Rafferty, Michel Galley, and John Bauer have improved its speed, performance, usability, and support for other languages.

The system requires Java 1.8+ to be installed. Depending on whether you’re running 32 or 64 bit Java and the complexity of the tagger model, you’ll need somewhere between 60 and 200 MB of memory to run a trained tagger (i.e., you may need to give java an option like java -mx200m). Plenty of memory is needed to train a tagger. It again depends on the complexity of the model but at least 1GB is usually needed, often more.

Several downloads are available. The basic download contains two trained tagger models for English. The full download contains three trained English tagger models, an Arabic tagger model, a Chinese tagger model, a French tagger model, and a German tagger model. Both versions include the same source and other required files. The tagger can be retrained on any language, given POS-annotated training text for the language.

Part-of-speech name abbreviations: The English taggers use the Penn Treebank tag set. Here are some links to documentation of the Penn Treebank English POS tag set: 1993 Computational Linguistics article in PDF, AMALGAM page, Aoife Cahill’s list. See the included README-Models.txt in the models directory for more information about the tagsets for the other languages.

Open Source Text Processing Project: Stanford CoreNLP

Stanford CoreNLP – a suite of core NLP tools

Project Website: http://stanfordnlp.github.io/CoreNLP/

Github Link: https://github.com/stanfordnlp/CoreNLP

Description

Stanford CoreNLP provides a set of natural language analysis tools. It can give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, and mark up the structure of sentences in terms of phrases and word dependencies, indicate which noun phrases refer to the same entities, indicate sentiment, extract open-class relations between mentions, etc.

Choose Stanford CoreNLP if you need:

An integrated toolkit with a good range of grammatical analysis tools
Fast, reliable analysis of arbitrary texts
The overall highest quality text analytics
Support for a number of major (human) languages
Interfaces available for various major modern programming languages
Stanford CoreNLP is an integrated framework. Its goal is to make it very easy to apply a bunch of linguistic analysis tools to a piece of text. Starting from plain text, you can run all the tools on it with just two lines of code. It is designed to be highly flexible and extensible. With a single option you can change which tools should be enabled and which should be disabled. Stanford CoreNLP integrates many of Stanford’s NLP tools, including the part-of-speech (POS) tagger, the named entity recognizer (NER), the parser, the coreference resolution system, sentiment analysis, and the bootstrapped pattern learning tools. Its analyses provide the foundational building blocks for higher-level and domain-specific text understanding applications.

Open Source Text Processing Project: Pattern

Pattern

Project Website: http://www.clips.ua.ac.be/pattern

Github Link: https://github.com/clips/pattern

Description

Pattern is a web mining module for the Python programming language.

It has tools for data mining (Google, Twitter and Wikipedia API, a web crawler, a HTML DOM parser), natural language processing (part-of-speech taggers, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, clustering, SVM), network analysis and canvas visualization.

Reference
Getting Started with Pattern

Open Source Text Processing Project: MBSP

MBSP for Python

Project Website: http://www.clips.ua.ac.be/pages/MBSP

Description

MBSP is a text analysis system based on the TiMBL and MBT memory based learning applications developed at CLiPS and ILK. It provides tools for Tokenization and Sentence Splitting, Part of Speech Tagging, Chunking, Lemmatization, Relation Finding and Prepositional Phrase Attachment.

Reference
Getting Started with MBSP

Text Processing Course: Stanford Deep Learning for Natural Language Processing

Name: Deep Learning for Natural Language Processing

Website: http://cs224d.stanford.edu/

Description

Natural language processing (NLP) is one of the most important technologies of the information age. Understanding complex language utterances is also a crucial part of artificial intelligence. Applications of NLP are everywhere because people communicate most everything in language: web search, advertisement, emails, customer service, language translation, radiology reports, etc. There are a large variety of underlying tasks and machine learning models powering NLP applications. Recently, deep learning approaches have obtained very high performance across many different NLP tasks. These models can often be trained with a single end-to-end model and do not require traditional, task-specific feature engineering. In this spring quarter course students will learn to implement, train, debug, visualize and invent their own neural network models. The course provides a deep excursion into cutting-edge research in deep learning applied to NLP. The final project will involve training a complex recurrent neural network and applying it to a large scale NLP problem. On the model side we will cover word vector representations, window-based neural networks, recurrent neural networks, long-short-term-memory models, recursive neural networks, convolutional neural networks as well as some very novel models involving a memory component. Through lectures and programming assignments students will learn the necessary engineering tricks for making neural networks work on practical problems.

About the Instructors

Richard Socher

“I am the Founder and CEO at MetaMind. Our vision is to improve artificial intelligence and make it easily accessible. I enjoy research in machine learning, natural language processing and computer vision. In spring 2015, I taught a class on Deep Learning for Natural Language Processing at Stanford. I got my PhD in the CS Department at Stanford, advised by Chris Manning and Andrew Ng. This Wired article talks about some of the research work that we do at MetaMind. I’m on Twitter.”

Text Processing Course: Introduction to Natural Language Processing

Name: Introduction to Natural Language Processing

Website: https://www.coursera.org/course/nlpintro

Description

This course provides an introduction to the field of Natural Language Processing, including topics like Language Models, Parsing, Semantics, Question Answering, and Sentiment Analysis.

This course provides an introduction to the field of Natural Language Processing. It includes relevant background material in Linguistics, Mathematics, Probabilities, and Computer Science. Some of the topics covered in the class are Text Similarity, Part of Speech Tagging, Parsing, Semantics, Question Answering, Sentiment Analysis, and Text Summarization.
The course includes quizzes, programming assignments in python, and a final exam.

Course Syllabus
Week One (Introduction 1/2) (1:35:31)
Week Two (Introduction 2/2) (1:36:26)
Week Three (NLP Tasks and Text Similarity) (1:42:52)
Week Four (Syntax and Parsing, Part 1) (1:48:14)
Week Five (Syntax and Parsing, Part 2) (1:50:29)
Week Six (Language Modeling and Word Sense Disambiguation) (1:40:33)
Week Seven (Part of Speech Tagging and Information Extraction) (1:33:21)
Week Eight (Question Answering) (1:16:59)
Week Nine (Text Summarization) (1:33:55)
Week Ten (Collocations and Information Retrieval) (1:29:40)
Week Eleven (Sentiment Analysis and Semantics) (1:09:38)
Week Twelve (Discourse, Machine Translation, and Generation) (1:30:57)

About the Instructors

Dragomir R. Radev, Ph.D.
Professor of Information, School of Information, Professor of Electrical Engineering and Computer Science, College of Engineering, and Professor of Linguistics, College of Literature, Science, and the Arts
University of Michigan
Among his many accomplishments, Dragomir Radev was cited for being an
international leader in computational linguistics, and contributing
significantly to automatic methods to extract content from text,
including data mining, Web graph and network analysis, and
bioinformatics. Using natural language processing, information
retrieval, and machine learning, he makes sense of the exploding
volume of digital content.

Professor Radev was the first to develop ways to generate informative
summaries from multiple online sources, addressing for the first time
this important problem. Before joining the University of Michigan
faculty in 2000, he worked at IBM’s T.J. Watson Research Center. At
IBM, he co-authored three patents, one of which was for a forerunner to
the Watson Q&A Engine.

Professor Radev is one of the co-founders of the North American
Computational Linguistics Olympiad (NACLO), which began in 2006. This
event for high school and middle school students from the U.S. and
Canada introduces the field of computational linguistics through
competitions to solve linguistic puzzles using analytic
reasoning. Each year, he has coached and traveled with the
U.S. national teams who participate in the International Linguistics
Olympiad and has returned with several gold medals. In 2011, Radev and
his fellow organizers received the Linguistics, Language and the
Public Award from the Linguistic Society of America.

Radev is Professor of Information (in the School of Information),
Professor of Electrical Engineering and Computer Science (in the
College of Engineering) and Professor of Linguistics (in the College
of Literature, Science and the Arts).

Dragomir Radev is an ACM Distinguished Scientist and the recipient of
the University of Michigan Faculty Recognition Award and Outstanding
Mentorship Award.

Text Processing Course: Text Mining and Analytics

Name: Text Mining and Analytics

Website: https://www.coursera.org/course/textanalytics

Description

Explore algorithms for mining and analyzing big text data to discover interesting patterns, extract useful knowledge, and support decision making.

This course will cover the major techniques for mining and analyzing text data to discover interesting patterns, extract useful knowledge, and support decision making, with an emphasis on statistical approaches that can be generally applied to arbitrary text data in any natural language with no or minimum human effort.

Detailed analysis of text data requires understanding of natural language text, which is known to be a difficult task for computers. However, a number of statistical approaches have been shown to work well for the “shallow” but robust analysis of text data for pattern finding and knowledge discovery. You will learn the basic concepts, principles, and major algorithms in text mining and their potential applications.

This course will be covering the following topics:

Overview of text mining and analytics
Natural language processing and text representation
Word association mining
Topic mining and analysis with statistical topic models
Text clustering and categorization
Opinion mining and sentiment analysis
Integrative analysis of text and structured data

About the Instructors

ChengXiang Zhai
Professor
Department of Computer Science
University of Illinois at Urbana-Champaign
ChengXiang Zhai is a Professor of Computer Science at the University of Illinois at Urbana-Champaign, where he also holds a joint appointment at the Institute for Genomic Biology, Statistics, and the Graduate School of Library and Information Science. His research interests include information retrieval, text mining, natural language processing, machine learning, and bioinformatics, and has published over 200 papers in these areas with an H-index of 58 in Google Scholar. He is an Associate Editor of ACM Transactions on Information Systems, and Information Processing and Management, and the Americas Editor of Springer’s Information Retrieval Book Series. He is a conference program co-chair of ACM CIKM 2004, NAACL HLT 2007, ACM SIGIR 2009, ECIR 2014, ICTIR 2015, and WWW 2015, and conference general co-chair for ACM CIKM 2016. He is an ACM Distinguished Scientist and a recipient of multiple best paper awards, Rose Award for Teaching Excellence at UIUC, Alfred P. Sloan Research Fellowship, IBM Faculty Award, HP Innovation Research Program Award, and the Presidential Early Career Award for Scientists and Engineers (PECASE).

Text Processing Course: Natural Language Processing by Columbia University

Name: Natural Language Processing

Website: https://www.coursera.org/course/nlangp

Description

Natural language processing (NLP) deals with the application of computational models to text or speech data. Application areas within NLP include automatic (machine) translation between languages; dialogue systems, which allow a human to interact with a machine using natural language; and information extraction, where the goal is to transform unstructured text into structured (database) representations that can be searched and browsed in flexible ways. NLP technologies are having a dramatic impact on the way people interact with computers, on the way people interact with each other through the use of language, and on the way people access the vast amount of linguistic data now in electronic form. From a scientific viewpoint, NLP involves fundamental questions of how to structure formal models (for example statistical models) of natural language phenomena, and of how to design algorithms that implement these models.

In this course you will study mathematical and computational models of language, and the application of these models to key problems in natural language processing. The course has a focus on machine learning methods, which are widely used in modern NLP systems: we will cover formalisms such as hidden Markov models, probabilistic context-free grammars, log-linear models, and statistical models for machine translation. The curriculum closely follows a course currently taught by Professor Collins at Columbia University, and previously taught at MIT.

About the Instructors

Michael Collins
Vikram S. Pandit Professor of Computer Science
Columbia University
Michael Collins is the Vikram S. Pandit Professor of Computer Science at Columbia University. Michael received Bachelors and MPhil degrees from Cambridge University, and a PhD from University of Pennsylvania. He was then a researcher at AT&T Labs (1999-2002), and an assistant/associate professor at MIT (2003-2010), before joining Columbia University in January 2011. His research areas are natural language processing and machine learning, with a focus on problems such as statistical parsing, structured prediction problems in machine learning, and applications including machine translation, dialog systems, and speech recognition. Michael is a fellow of the Association for Computational Linguistics, and has received various awards including a Sloan fellowship, an NSF Career award, as well as best paper awards at several conferences.

Text Processing Course: Stanford Natural Language Processing

Name: Natural Language Processing

Website: https://www.coursera.org/course/nlp

Description

This course covers a broad range of topics in natural language processing, including word and sentence tokenization, text classification and sentiment analysis, spelling correction, information extraction, parsing, meaning extraction, and question answering, We will also introduce the underlying theory from probability, statistics, and machine learning that are crucial for the field, and cover fundamental algorithms like n-gram language modeling, naive bayes and maxent classifiers, sequence models like Hidden Markov Models, probabilistic dependency and constituent parsing, and vector-space models of meaning.

We are offering this course on Natural Language Processing free and online to students worldwide, continuing Stanford’s exciting forays into large scale online instruction. Students have access to screencast lecture videos, are given quiz questions, assignments and exams, receive regular feedback on progress, and can participate in a discussion forum. Those who successfully complete the course will receive a statement of accomplishment. Taught by Professors Jurafsky and Manning, the curriculum draws from Stanford’s courses in Natural Language Processing. You will need a decent internet connection for accessing course materials, but should be able to watch the videos on your smartphone.

About the Instructors

Dan Jurafsky
Professor
Stanford University

Dan Jurafsky is Professor of Linguistics and Professor by Courtesy of Computer Science at Stanford University. Dan received his Bachelor’s degree in Linguistics in 1983 and his Ph.D. in Computer Science in 1992, both from the University of California at Berkeley, and also taught at the University of Colorado, Boulder before joining the Stanford faculty in 2004. He is the recipient of a MacArthur Fellowship and has served on a variety of editorial boards, corporate advisory boards, and program committees. Dan’s research extends broadly throughout natural language processing as well as its application to the behavioral and social sciences.

Christopher Manning
Associate Professor
Stanford University

Christopher Manning is an Associate Professor of Computer Science and Linguistics at Stanford University. Chris received a Bachelors degree and University Medal from the Australian National University and a Ph.D. from Stanford in 1994, both in Linguistics. Chris taught at Carnegie Mellon University and The University of Sydney before joining the Stanford faculty in 1999. He is a Fellow of the American Association for Artificial Intelligence and of the Association for Computational Linguistics, and is one of the most cited authors in natural language processing, for his research on a broad range of statistical natural language topics from tagging and parsing to grammar induction and text understanding.