Open Source Text Processing Project: MBSP

MBSP for Python

Project Website: http://www.clips.ua.ac.be/pages/MBSP

Description

MBSP is a text analysis system based on the TiMBL and MBT memory based learning applications developed at CLiPS and ILK. It provides tools for Tokenization and Sentence Splitting, Part of Speech Tagging, Chunking, Lemmatization, Relation Finding and Prepositional Phrase Attachment.

Reference
Getting Started with MBSP

Text Processing Course: Stanford Deep Learning for Natural Language Processing

Name: Deep Learning for Natural Language Processing

Website: http://cs224d.stanford.edu/

Description

Natural language processing (NLP) is one of the most important technologies of the information age. Understanding complex language utterances is also a crucial part of artificial intelligence. Applications of NLP are everywhere because people communicate most everything in language: web search, advertisement, emails, customer service, language translation, radiology reports, etc. There are a large variety of underlying tasks and machine learning models powering NLP applications. Recently, deep learning approaches have obtained very high performance across many different NLP tasks. These models can often be trained with a single end-to-end model and do not require traditional, task-specific feature engineering. In this spring quarter course students will learn to implement, train, debug, visualize and invent their own neural network models. The course provides a deep excursion into cutting-edge research in deep learning applied to NLP. The final project will involve training a complex recurrent neural network and applying it to a large scale NLP problem. On the model side we will cover word vector representations, window-based neural networks, recurrent neural networks, long-short-term-memory models, recursive neural networks, convolutional neural networks as well as some very novel models involving a memory component. Through lectures and programming assignments students will learn the necessary engineering tricks for making neural networks work on practical problems.

About the Instructors

Richard Socher

“I am the Founder and CEO at MetaMind. Our vision is to improve artificial intelligence and make it easily accessible. I enjoy research in machine learning, natural language processing and computer vision. In spring 2015, I taught a class on Deep Learning for Natural Language Processing at Stanford. I got my PhD in the CS Department at Stanford, advised by Chris Manning and Andrew Ng. This Wired article talks about some of the research work that we do at MetaMind. I’m on Twitter.”

Text Processing Course: Introduction to Natural Language Processing

Name: Introduction to Natural Language Processing

Website: https://www.coursera.org/course/nlpintro

Description

This course provides an introduction to the field of Natural Language Processing, including topics like Language Models, Parsing, Semantics, Question Answering, and Sentiment Analysis.

This course provides an introduction to the field of Natural Language Processing. It includes relevant background material in Linguistics, Mathematics, Probabilities, and Computer Science. Some of the topics covered in the class are Text Similarity, Part of Speech Tagging, Parsing, Semantics, Question Answering, Sentiment Analysis, and Text Summarization.
The course includes quizzes, programming assignments in python, and a final exam.

Course Syllabus
Week One (Introduction 1/2) (1:35:31)
Week Two (Introduction 2/2) (1:36:26)
Week Three (NLP Tasks and Text Similarity) (1:42:52)
Week Four (Syntax and Parsing, Part 1) (1:48:14)
Week Five (Syntax and Parsing, Part 2) (1:50:29)
Week Six (Language Modeling and Word Sense Disambiguation) (1:40:33)
Week Seven (Part of Speech Tagging and Information Extraction) (1:33:21)
Week Eight (Question Answering) (1:16:59)
Week Nine (Text Summarization) (1:33:55)
Week Ten (Collocations and Information Retrieval) (1:29:40)
Week Eleven (Sentiment Analysis and Semantics) (1:09:38)
Week Twelve (Discourse, Machine Translation, and Generation) (1:30:57)

About the Instructors

Dragomir R. Radev, Ph.D.
Professor of Information, School of Information, Professor of Electrical Engineering and Computer Science, College of Engineering, and Professor of Linguistics, College of Literature, Science, and the Arts
University of Michigan
Among his many accomplishments, Dragomir Radev was cited for being an
international leader in computational linguistics, and contributing
significantly to automatic methods to extract content from text,
including data mining, Web graph and network analysis, and
bioinformatics. Using natural language processing, information
retrieval, and machine learning, he makes sense of the exploding
volume of digital content.

Professor Radev was the first to develop ways to generate informative
summaries from multiple online sources, addressing for the first time
this important problem. Before joining the University of Michigan
faculty in 2000, he worked at IBM’s T.J. Watson Research Center. At
IBM, he co-authored three patents, one of which was for a forerunner to
the Watson Q&A Engine.

Professor Radev is one of the co-founders of the North American
Computational Linguistics Olympiad (NACLO), which began in 2006. This
event for high school and middle school students from the U.S. and
Canada introduces the field of computational linguistics through
competitions to solve linguistic puzzles using analytic
reasoning. Each year, he has coached and traveled with the
U.S. national teams who participate in the International Linguistics
Olympiad and has returned with several gold medals. In 2011, Radev and
his fellow organizers received the Linguistics, Language and the
Public Award from the Linguistic Society of America.

Radev is Professor of Information (in the School of Information),
Professor of Electrical Engineering and Computer Science (in the
College of Engineering) and Professor of Linguistics (in the College
of Literature, Science and the Arts).

Dragomir Radev is an ACM Distinguished Scientist and the recipient of
the University of Michigan Faculty Recognition Award and Outstanding
Mentorship Award.

Text Processing Course: Text Mining and Analytics

Name: Text Mining and Analytics

Website: https://www.coursera.org/course/textanalytics

Description

Explore algorithms for mining and analyzing big text data to discover interesting patterns, extract useful knowledge, and support decision making.

This course will cover the major techniques for mining and analyzing text data to discover interesting patterns, extract useful knowledge, and support decision making, with an emphasis on statistical approaches that can be generally applied to arbitrary text data in any natural language with no or minimum human effort.

Detailed analysis of text data requires understanding of natural language text, which is known to be a difficult task for computers. However, a number of statistical approaches have been shown to work well for the “shallow” but robust analysis of text data for pattern finding and knowledge discovery. You will learn the basic concepts, principles, and major algorithms in text mining and their potential applications.

This course will be covering the following topics:

Overview of text mining and analytics
Natural language processing and text representation
Word association mining
Topic mining and analysis with statistical topic models
Text clustering and categorization
Opinion mining and sentiment analysis
Integrative analysis of text and structured data

About the Instructors

ChengXiang Zhai
Professor
Department of Computer Science
University of Illinois at Urbana-Champaign
ChengXiang Zhai is a Professor of Computer Science at the University of Illinois at Urbana-Champaign, where he also holds a joint appointment at the Institute for Genomic Biology, Statistics, and the Graduate School of Library and Information Science. His research interests include information retrieval, text mining, natural language processing, machine learning, and bioinformatics, and has published over 200 papers in these areas with an H-index of 58 in Google Scholar. He is an Associate Editor of ACM Transactions on Information Systems, and Information Processing and Management, and the Americas Editor of Springer’s Information Retrieval Book Series. He is a conference program co-chair of ACM CIKM 2004, NAACL HLT 2007, ACM SIGIR 2009, ECIR 2014, ICTIR 2015, and WWW 2015, and conference general co-chair for ACM CIKM 2016. He is an ACM Distinguished Scientist and a recipient of multiple best paper awards, Rose Award for Teaching Excellence at UIUC, Alfred P. Sloan Research Fellowship, IBM Faculty Award, HP Innovation Research Program Award, and the Presidential Early Career Award for Scientists and Engineers (PECASE).

Text Processing Course: Natural Language Processing by Columbia University

Name: Natural Language Processing

Website: https://www.coursera.org/course/nlangp

Description

Natural language processing (NLP) deals with the application of computational models to text or speech data. Application areas within NLP include automatic (machine) translation between languages; dialogue systems, which allow a human to interact with a machine using natural language; and information extraction, where the goal is to transform unstructured text into structured (database) representations that can be searched and browsed in flexible ways. NLP technologies are having a dramatic impact on the way people interact with computers, on the way people interact with each other through the use of language, and on the way people access the vast amount of linguistic data now in electronic form. From a scientific viewpoint, NLP involves fundamental questions of how to structure formal models (for example statistical models) of natural language phenomena, and of how to design algorithms that implement these models.

In this course you will study mathematical and computational models of language, and the application of these models to key problems in natural language processing. The course has a focus on machine learning methods, which are widely used in modern NLP systems: we will cover formalisms such as hidden Markov models, probabilistic context-free grammars, log-linear models, and statistical models for machine translation. The curriculum closely follows a course currently taught by Professor Collins at Columbia University, and previously taught at MIT.

About the Instructors

Michael Collins
Vikram S. Pandit Professor of Computer Science
Columbia University
Michael Collins is the Vikram S. Pandit Professor of Computer Science at Columbia University. Michael received Bachelors and MPhil degrees from Cambridge University, and a PhD from University of Pennsylvania. He was then a researcher at AT&T Labs (1999-2002), and an assistant/associate professor at MIT (2003-2010), before joining Columbia University in January 2011. His research areas are natural language processing and machine learning, with a focus on problems such as statistical parsing, structured prediction problems in machine learning, and applications including machine translation, dialog systems, and speech recognition. Michael is a fellow of the Association for Computational Linguistics, and has received various awards including a Sloan fellowship, an NSF Career award, as well as best paper awards at several conferences.

Text Processing Course: Stanford Natural Language Processing

Name: Natural Language Processing

Website: https://www.coursera.org/course/nlp

Description

This course covers a broad range of topics in natural language processing, including word and sentence tokenization, text classification and sentiment analysis, spelling correction, information extraction, parsing, meaning extraction, and question answering, We will also introduce the underlying theory from probability, statistics, and machine learning that are crucial for the field, and cover fundamental algorithms like n-gram language modeling, naive bayes and maxent classifiers, sequence models like Hidden Markov Models, probabilistic dependency and constituent parsing, and vector-space models of meaning.

We are offering this course on Natural Language Processing free and online to students worldwide, continuing Stanford’s exciting forays into large scale online instruction. Students have access to screencast lecture videos, are given quiz questions, assignments and exams, receive regular feedback on progress, and can participate in a discussion forum. Those who successfully complete the course will receive a statement of accomplishment. Taught by Professors Jurafsky and Manning, the curriculum draws from Stanford’s courses in Natural Language Processing. You will need a decent internet connection for accessing course materials, but should be able to watch the videos on your smartphone.

About the Instructors

Dan Jurafsky
Professor
Stanford University

Dan Jurafsky is Professor of Linguistics and Professor by Courtesy of Computer Science at Stanford University. Dan received his Bachelor’s degree in Linguistics in 1983 and his Ph.D. in Computer Science in 1992, both from the University of California at Berkeley, and also taught at the University of Colorado, Boulder before joining the Stanford faculty in 2004. He is the recipient of a MacArthur Fellowship and has served on a variety of editorial boards, corporate advisory boards, and program committees. Dan’s research extends broadly throughout natural language processing as well as its application to the behavioral and social sciences.

Christopher Manning
Associate Professor
Stanford University

Christopher Manning is an Associate Professor of Computer Science and Linguistics at Stanford University. Chris received a Bachelors degree and University Medal from the Australian National University and a Ph.D. from Stanford in 1994, both in Linguistics. Chris taught at Carnegie Mellon University and The University of Sydney before joining the Stanford faculty in 1999. He is a Fellow of the American Association for Artificial Intelligence and of the Association for Computational Linguistics, and is one of the most cited authors in natural language processing, for his research on a broad range of statistical natural language topics from tagging and parsing to grammar induction and text understanding.

Text Processing Book: Text Processing in Python 1st Edition

Text Processing in Python

Description
Text Processing in Python describes techniques for manipulation of text using the Python programming language. At the broadest level, text processing is simply taking textual information and doing something with it. This might be restructuring or reformatting it, extracting smaller bits of information from it, or performing calculations that depend on the text. Text processing is arguably what most programmers spend most of their time doing. Because Python is clear, expressive, and object-oriented it is a perfect language for doing text processing, even better than Perl. As the amount of data everywhere continues to increase, this is more and more of a challenge for programmers. This book is not a tutorial on Python. It has two other goals: helping the programmer get the job done pragmatically and efficiently; and giving the reader an understanding – both theoretically and conceptually – of why what works works and what doesn’t work doesn’t work. Mertz provides practical pointers and tips that emphasize efficent, flexible, and maintainable approaches to the textprocessing tasks that working programmers face daily.

About the Author
David Mertz came to writing about programming via the unlikely route of first being a humanities professor. Along the way, he was a senior software developer, and now runs his own development company, Gnosis Software (“We know stuff!”). David writes regular columns and articles for IBM developerWorks, Intel Developer Network, O’Reilly ONLamp, and other publications.

Text Processing Book: Python Text Processing with NLTK 2.0 Cookbook

Python Text Processing with NLTK 2.0 Cookbook

Description
Use Python’s NLTK suite of libraries to maximize your Natural Language Processing capabilities. * Quickly get to grips with Natural Language Processing ? with Text Analysis, Text Mining, and beyond * Learn how machines and crawlers interpret and process natural languages * Easily work with huge amounts of data and learn how to handle distributed processing * Part of Packt’s Cookbook series: Each recipe is a carefully organized sequence of instructions to complete the task as efficiently as possible In Detail Natural Language Processing is used everywhere ? in search engines, spell checkers, mobile phones, computer games ? even your washing machine. Python’s Natural Language Toolkit (NLTK) suite of libraries has rapidly emerged as one of the most efficient tools for Natural Language Processing. You want to employ nothing less than the best techniques in Natural Language Processing ? and this book is your answer. Python Text Processing with NLTK 2.0 Cookbook is your handy and illustrative guide, which will walk you through all the Natural Language Processing techniques in a step?by-step manner. It will demystify the advanced features of text analysis and text mining using the comprehensive NLTK suite. This book cuts short the preamble and you dive right into the science of text processing with a practical hands-on approach. Get started off with learning tokenization of text. Get an overview of WordNet and how to use it. Learn the basics as well as advanced features of Stemming and Lemmatization. Discover various ways to replace words with simpler and more common (read: more searched) variants. Create your own corpora and learn to create custom corpus readers for JSON files as well as for data stored in MongoDB. Use and manipulate POS taggers. Transform and normalize parsed chunks to produce a canonical form without changing their meaning. Dig into feature extraction and text classification. Learn how to easily handle huge amounts of data without any loss in efficiency or speed. This book will teach you all that and beyond, in a hands-on learn-by-doing manner. Make yourself an expert in using the NLTK for Natural Language Processing with this handy companion. What you will learn from this book * Learn Text categorization and Topic identification * Learn Stemming and Lemmatization and how to go beyond the usual spell checker * Replace negations with antonyms in your text * Learn to tokenize words into lists of sentences and words, and gain an insight into WordNet * Transform and manipulate chunks and trees * Learn advanced features of corpus readers and create your own custom corpora * Tag different parts of speech by creating, training, and using a part-of-speech tagger * Improve accuracy by combining multiple part-of-speech taggers * Learn how to do partial parsing to extract small chunks of text from a part-of-speech tagged sentence * Produce an alternative canonical form without changing the meaning by normalizing parsed chunks * Learn how search engines use Natural Language Processing to process text * Make your site more discoverable by learning how to automatically replace words with more searched equivalents * Parse dates, times, and HTML * Train and manipulate different types of classifiers Approach The learn-by-doing approach of this book will enable you to dive right into the heart of text processing from the very first page. Each recipe is carefully designed to fulfill your appetite for Natural Language Processing. Packed with numerous illustrative examples and code samples, it will make the task of using the NLTK for Natural Language Processing easy and straightforward. Who this book is written for This book is for Python programmers who want to quickly get to grips with using the NLTK for Natural Language Processing. Familiarity with basic text processing concepts is required. Programmers experienced in the NLTK will also find it useful. Students of linguistics will find it invaluable.

About the Author
Jacob Perkins has been an avid user of open source software since high school, when he first built his own computer and didn’t want to pay for Windows. At one point he had 5 operating systems installed, including RedHat Linux, OpenBSD, and BeOS. While at Washington University in St. Louis, Jacob took classes in Spanish, poetry writing, and worked on an independent study project that eventually became his Master’s Project: WUGLE – a GUI for manipulating logical expressions. In his free time, he wrote the Gnome2 version of Seahorse (a GUI for encryption and key management), which has since been translated into over a dozen languages and is included in the default Gnome distribution. After getting his MS in Computer Science, Jacob tried to start a web development studio with some friends, but since no-one knew anything about web development, it didn’t work out as planned. Once he’d actually learned web development, he went off and co-founded another company called Weotta, which sparked his interest in Machine Learning and Natural Language Processing. Jacob is currently the CTO / Chief Hacker for Weotta and blogs about what he’s learned along the way at http://streamhacker.com/. He is also applying this knowledge to produce text processing APIs and demos at http://text-processing.com/. This book is a synthesis of his knowledge on processing text using Python, NLTK, and more.

Text Processing Book: Natural Language Processing with Python 1st Edition

Natural Language Processing with Python

Description
Analyzing Text with the Natural Language Toolkit

This book offers a highly accessible introduction to natural language processing, the field that supports a variety of language technologies, from predictive text and email filtering to automatic summarization and translation. With it, you’ll learn how to write Python programs that work with large collections of unstructured text. You’ll access richly annotated datasets using a comprehensive range of linguistic data structures, and you’ll understand the main algorithms for analyzing the content and structure of written communication.

Packed with examples and exercises, Natural Language Processing with Python will help you:

Extract information from unstructured text, either to guess the topic or identify “named entities”
Analyze linguistic structure in text, including parsing and semantic analysis
Access popular linguistic databases, including WordNet and treebanks
Integrate techniques drawn from fields as diverse as linguistics and artificial intelligence

This book will help you gain practical skills in natural language processing using the Python programming language and the Natural Language Toolkit (NLTK) open source library. If you’re interested in developing web applications, analyzing multilingual news sources, or documenting endangered languages — or if you’re simply curious to have a programmer’s perspective on how human language works — you’ll find Natural Language Processing with Python both fascinating and immensely useful.

About the Author
Steven Bird is Associate Professor in the Department of Computer Science and Software Engineering at the University of Melbourne, and Senior Research Associate in the Linguistic Data Consortium at the University of Pennsylvania. He completed a PhD on computational phonology at the University of Edinburgh in 1990, supervised by Ewan Klein. He later moved to Cameroon to conduct linguistic fieldwork on the Grassfields Bantu languages under the auspices of the Summer Institute of Linguistics. More recently, he spent several years as Associate Director of the Linguistic Data Consortium where he led an R&D team to create models and tools for large databases of annotated text. At Melbourne University, he established a language technology research group and has taught at all levels of the undergraduate computer science curriculum. In 2009, Steven is President of the Association for Computational Linguistics.

Ewan Klein is Professor of Language Technology in the School of Informatics at the University of Edinburgh. He completed a PhD on formal semantics at the University of Cambridge in 1978. After some years working at the Universities of Sussex and Newcastle upon Tyne, Ewan took up a teaching position at Edinburgh. He was involved in the establishment of Edinburgh’s Language Technology Group in 1993, and has been closely associated with it ever since. From 2000-2002, he took leave from the University to act as Research Manager for the Edinburgh-based Natural Language Research Group of Edify Corporation, Santa Clara, and was responsible for spoken dialogue processing. Ewan is a past President of the European Chapter of the Association for Computational Linguistics and was a founding member and Coordinator of the European Network of Excellence in Human Language Technologies (ELSNET).

Edward Loper has recently completed a PhD on machine learning for natural language processing at the the University of Pennsylvania. Edward was a student in Steven’s graduate course on computational linguistics in the fall of 2000, and went on to be a TA and share in the development of NLTK. In addition to NLTK, he has helped develop two packages for documenting and testing Python software, epydoc, and doctest.

Open Source Text Processing Project: TextBlob

TextBlob: Simplified Text Processing

Project Website: http://textblob.readthedocs.org/en/dev/

Github Link: https://github.com/sloria/textblob

Description

TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

**Features**
Noun phrase extraction
Part-of-speech tagging
Sentiment analysis
Classification (Naive Bayes, Decision Tree)
Language translation and detection powered by Google Translate
Tokenization (splitting text into words and sentences)
Word and phrase frequencies
Parsing
n-grams
Word inflection (pluralization and singularization) and lemmatization
Spelling correction
Add new models or languages through extensions
WordNet integration

Reference
Getting Started with TextBlob