Open Source Text Processing Project: Thot

Thot: a Toolkit for Statistical Machine Translation

Project Website:

Github Link:

Description

Thot is an open source software toolkit for statistical machine translation (SMT). Originally, Thot incorporated tools to train phrase-based models. The new version of Thot now includes a state-of-the-art phrase-based translation decoder as well as tools to estimate all of the models involved in the translation process. In addition to this, Thot is also able to incrementally update its models in real time after presenting an individual sentence pair using online learning.

Thot is being developed by Daniel Ortiz-Martínez. Daniel is a researcher on natural language processing at Webinterpret. Formerly, he was a member of the PRHLT research group as well as an assistant professor at the Technical University of Valencia.

News

A new version of the toolkit has been released with several improvements and new features:

Incorporation of a whole set of pre/post-processing tools
Portability increased (Thot has been successfully compiled in many different platforms, including Mac OS X, FreeBSD, OpenBSD, NetBSD, etc.)
Improved checking of runtime errors in all of the tools involved in the translation pipeline
Early detection of bugs and portability problems using built-in checks
Improvements in tools to carry out translation experiments and incorporation of new ones
Translation can now be executed in parallel using clusters or multi-processor systems by means of the thot_decoder tool
The Thot manual has been extended and revised
Features

The toolkit includes the following features:

Phrase-based statistical machine translation decoder.
Computer-aided translation (post-editing and interactive machine translation).
Incremental estimation of all of the models involved in the translation process.
Robust generation of alignments at phrase-level.
Client-server implementation of the translation functionality.
Single word alignment model estimation using the incremental EM algorithm.
Scalable and parallel model estimation algorithms using Map-Reduce.
Compiles on Unix-like and Windows (using Cygwin) systems.
Integration with the CasMaCat Workbench developed in the EU FP7 CasMaCat project.

Distribution Details

Thot has been coded using C, C++, Python and shell-scripting. Thot is known to compile on Unix-like and Windows (using Cygwin) systems. See the “Documentation and Support” section of these instructions if you experience problems during compilation.

It is released under the GNU Lesser General Public License (LGPL).


Leave a Reply

Your email address will not be published. Required fields are marked *