Open Source Text Processing Project: Wapiti

Wapiti – A simple and fast discriminative sequence labelling toolkit

Project Website: https://wapiti.limsi.fr/
Github Link: https://github.com/Jekub/Wapiti

Description

Wapiti is a very fast toolkit for segmenting and labeling sequences with discriminative models. It is based on maxent models, maximum entropy Markov models and linear-chain CRF and proposes various optimization and regularization methods to improve both the computational complexity and the prediction performance of standard models. Wapiti is ranked first on the sequence tagging task for more than a year on MLcomp web site.

Features

Handle large label and feature sets
Wapiti was used to train models with more than one thousand labels and models with several billions features. Training time still increase with the size of these set, but provided you have computing power and enough memory, Wapiti will handle them without problems.

L-BFGS, OWL-QN, SGD-L1, BCD, and RPROP training algorithms
Wapiti implements all the standard training algorithms. All these algorithms are highly-optimized and can be combined to improve both computational and generalization performances.

L1, L2, or Elastic-net regularization
Wapiti provides different regularization methods which allow reducing overfitting and efficient features selections.

Powerful features extraction system
Wapiti uses an extended version of the CRF++ patterns for extracting features, which reduces both the amount of pre-processing required and the size of datafiles.

Multi-threaded and vectorized implementation
To further improve their performances, all optimization algorithms can take advantage of SSE instructions, if available. The Quasi-Newton and RPROP optimization algorithms are parallelized and scale very well on multi-processors.

N-best Viterbi output
Viterbi decoding can output the classical best label sequence as well as the n-best ones. Decoding can be done with the classical Viterbi for CRF or through posteriors which are slower but generaly lead to better result and give normalized scores.

Compact model creation
When used with L1 or elastic-net penalties, Wapiti is able to remove unused features and creates compact models which load faster and use less memory, speeding up the labeling.

Sparse forward-backward
A specific sparse forward-backward procedure is used during the training to take advantage of the sparsity of the model and speedup computation.

Written in standard C99+POSIX
Wapiti source code is written almost entirely in standard C99 and should work on any computer. However, the multi-threading code is written using POSIX threads and the SSE code is written for x86 platform. Both are optional and can be disabled or rewritten for other platforms.

Open source (BSD Licence)

Open Source Text Processing Project: segtok

segtok: sentence segmentation and word tokenization tools

Project Website: http://fnl.es/segtok-a-segmentation-and-tokenization-library.html
Github Link: https://github.com/fnl/segtok

Description

A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic features.

The segtok package provides two modules, segtok.segmenter and segtok.tokenizer. The segmenter provides functionality for splitting (Indo-European) text into sentences. The tokenizer provides functionality for splitting (Indo-European) sentences into words and symbols (collectively called tokens). Both modules can also be used from the command-line. While other Indo-European languages could work, it has only been designed with languages such as Spanish, English, and German in mind.

To install this package, you should have the latest official version of Python 2 or 3 installed. The package has been reported to work with Python 2.7, 3.3, and 3.4 and is tested against the latest Python 2 and 3 branches. The easiest way to get it installed is using pip or any other package manager that works with PyPI:

pip install segtok
Important: If you are on a Linux machine and have problems installing the regex dependency of segtok, make sure you have the python-dev and/or python3-dev packages installed to get the necessary headers to compile the package.

Then try the command line tools on some plain-text files (e.g., this README) to see if segtok meets your needs:

segmenter README.rst | tokenizer

Open Source Text Processing Project: nlp-with-ruby

nlp-with-ruby: Awesome NLP with Ruby

Project Website: None

Github Link: https://github.com/arbox/nlp-with-ruby

Description

This curated list comprises awesome resources, libraries, information sources about computational processing of texts in human languages with Ruby. That field is often referred to as NLP, Computational Linguistics, HLT (Human Language Technology) and can be brought in conjunction with Artificial Intelligence, Machine Learning, Information Retrieval and other related disciplines.

Open Source Text Processing Project: textacy

textacy: higher-level NLP built on spaCy

Project Website: https://textacy.readthedocs.io

Github Link: https://github.com/chartbeat-labs/textacy

Description

textacy is a Python library for performing higher-level natural language processing (NLP) tasks, built on the high-performance spaCy library. With the basics — tokenization, part-of-speech tagging, dependency parsing, etc. — offloaded to another library, textacy focuses on tasks facilitated by the ready availability of tokenized, POS-tagged, and parsed text.

Features
Stream text, json, csv, and spaCy binary data to and from disk
Clean and normalize raw text, before analyzing it
Explore included corpora of Congressional speeches and Supreme Court decisions, or stream documents from standard Wikipedia pages and Reddit comments datasets
Access and filter basic linguistic elements, such as words and ngrams, noun chunks and sentences
Extract named entities, acronyms and their definitions, direct quotations, key terms, and more from documents
Compare strings, sets, and documents by a variety of similarity metrics
Transform documents and corpora into vectorized and semantic network representations
Train, interpret, visualize, and save sklearn-style topic models using LSA, LDA, or NMF methods
Identify a text’s language, display key words in context (KWIC), true-case words, and navigate a parse tree
… and more!

Open Source Text Processing Project: vivekn sentiment

Sentiment analysis using machine learning techniques

Project Website: http://sentiment.vivekn.com/

Github Link: https://github.com/vivekn/sentiment

Description

Sentiment analysis using machine learning techniques.

Check info.py for the training and testing code. A demo of the tool is available here

Refer this paper for more information about the algorithms used.

http://arxiv.org/abs/1305.6143

This tool works by examining individual words and short sequences of words (n-grams) and comparing them with a probability model. The probability model is built on a prelabeled test set of IMDb movie reviews. It can also detect negations in phrases, i.e, the phrase “not bad” will be classified as positive despite having two individual words with a negative sentiment.

Open Source Deep Learning Project: Paddle

Paddle: PArallel Distributed Deep LEarning

Project Website: http://www.paddlepaddle.org/

Github Link: https://github.com/baidu/Paddle

Description

PaddlePaddle (PArallel Distributed Deep LEarning) is an easy-to-use, efficient, flexible and scalable deep learning platform, which is originally developed by Baidu scientists and engineers for the purpose of applying deep learning to many products at Baidu.

Features

Flexibility

PaddlePaddle supports a wide range of neural network architectures and optimization algorithms. It is easy to configure complex models such as neural machine translation model with attention mechanism or complex memory connection.

Efficiency

In order to unleash the power of heterogeneous computing resource, optimization occurs at different levels of PaddlePaddle, including computing, memory, architecture and communication. The following are some examples:

Optimized math operations through SSE/AVX intrinsics, BLAS libraries (e.g. MKL, ATLAS, cuBLAS) or customized CPU/GPU kernels.
Highly optimized recurrent networks which can handle variable-length sequence without padding.
Optimized local and distributed training for models with high dimensional sparse data.
Scalability

With PaddlePaddle, it is easy to use many CPUs/GPUs and machines to speed up your training. PaddlePaddle can achieve high throughput and performance via optimized communication.

Connected to Products

In addition, PaddlePaddle is also designed to be easily deployable. At Baidu, PaddlePaddle has been deployed into products or service with a vast number of users, including ad click-through rate (CTR) prediction, large-scale image classification, optical character recognition(OCR), search ranking, computer virus detection, recommendation, etc. It is widely utilized in products at Baidu and it has achieved a significant impact. We hope you can also exploit the capability of PaddlePaddle to make a huge impact for your product.

Open Source Text Processing Project: Stanford Temporal Tagger

Stanford Temporal Tagger

Project Website: http://nlp.stanford.edu/software/sutime.html

Github Link: None

Description

SUTime is a library for recognizing and normalizing time expressions. That is, it will convert next wednesday at 3pm to something like 2016-02-17T15:00 (depending on the assumed current reference time). SUTime is available as part of the Stanford CoreNLP pipeline and can be used to annotate documents with temporal information. It is a deterministic rule-based system designed for extensibility. The currently available rule support only English.

SUTime was developed using TokensRegex, a generic framework for definining patterns over text and mapping to semantic objects. An included set of powerpoint slides and the javadoc for SUTime provide an overview of this package.

SUTime was written by Angel Chang. These programs also rely on classes developed by others as part of the Stanford JavaNLP project.

There is a paper describing SUTime. You’re encouraged to cite it if you use SUTime.

Angel X. Chang and Christopher D. Manning. 2012. SUTIME: A Library for Recognizing and Normalizing Time Expressions. 8th International Conference on Language Resources and Evaluation (LREC 2012).

Open Source Deep Learning Project: dlib

dlib: A toolkit for making real world machine learning and data analysis aplications in C++

Project Website: http://dlib.net

Github Link: https://github.com/davisking/dlib

Description

Dlib is a modern C++ toolkit containing machine learning algorithms and tools for creating complex software in C++ to solve real world problems. It is used in both industry and academia in a wide range of domains including robotics, embedded devices, mobile phones, and large high performance computing environments. Dlib’s open source licensing allows you to use it in any application, free of charge.

To follow or participate in the development of dlib subscribe to dlib on github. Also be sure to read the how to contribute page if you intend to submit code to the project.

Open Source Deep Learning Project: torchnet

torchnet: Torch on steroids

Project Website: None

Github Link: https://github.com/torchnet/torchnet

Description

torchnet is a framework for torch which provides a set of abstractions aiming at encouraging code re-use as well as encouraging modular programming.

At the moment, torchnet provides four set of important classes:

Dataset: handling and pre-processing data in various ways.
Engine: training/testing machine learning algorithm.
Meter: meter performance or any other quantity.
Log: output performance or any other string to file / disk in a consistent manner.

Open Source Deep Learning Project: OpenNN

OpenNN – Open Neural Networks Library

Project Website: http://www.opennn.net/

Github Link: https://github.com/Artelnics/OpenNN

Description

OpenNN is an open source class library written in C++ programming language which implements neural networks, a main area of deep learning research. It is intended for advanced users, with high C++ and machine learning skills.

The library implements any number of layers of non-linear processing units for supervised learning. This deep architecture allows the design of neural networks with universal approximation properties.

The main advantage of OpenNN is its high performance. This library outstands in terms of execution speed and memory allocation. It is constantly optimized and parallelized in order to maximize its efficiency.

OpenNN is a software library written in C++ for predictive analytics. It implements neural networks, the most successful deep learning method.

The main advantage of OpenNN is its high performance. This library outstands in terms of execution speed and memory allocation. It is constantly optimized and parallelized in order to maximize its efficiency.

Some typical applications of OpenNN are function regression (modelling), pattern recognition (classification) and time series prediction (forecasting).

The documentation is composed by tutorials and examples to offer a complete overview about the library. The documentation can be found at the official OpenNN site.

CMakeLists.txt are build files for CMake, it is also used byt the CLion IDE.

The .pro files are project files for the Qt Creator IDE, which can be downloaded from its site. Note that OpenNN does not make use of the Qt library.

OpenNN is developed by Artelnics, a company specialized in artificial intelligence.