Getting started with Translation Memory

Deep Learning Specialization on Coursera

Translation Memory is very useful for CAT Tools, here is a list of open source translation memory tools:

1. OmegaT: the free translation memory tool

OmegaT is a free translation memory application written in Java. It is a tool intended for professional translators. It does not translate for you! (Software that does this is called “machine translation”, and you will have to look elsewhere for it.)

2. amaGama: a web translation memory server

amaGama is a web service written in Python implementing a large-scale translation memory on top of PostgreSQL. A translation memory is a database of previous translations which can be searched to find good matches to new strings.

There are currently no releases of amaGama, but the source code is available in the https://github.com/translate/amagama repository.

A public deployment of amaGama is available at http://amagama.locamotion.org. Check the documentation to learn how to use it.

Document: amaGama documentation http://docs.translatehouse.org/projects/amagama/en/latest/
Demo: https://amagama-live.translatehouse.org/

3. Translate House: http://translatehouse.org/

Matching years of localization experience with the technology built by localizers for localizers, Translate allows you to deliver incredible community localization.

4. translation-memory-tools: https://github.com/Softcatala/translation-memory-tools

A set of tools to build, maintain and use translation memories

This is the toolset used at Softcatalà to build the translation memories for all the projects that we know exist in Catalan language and have their translations available openly.

The toolset contains the following components with their own responsibility:

Builder (fetch and build memories)

Download and unpack the files from source repositories
Convert from the different translation formats (ts, strings, etc) to PO
Create a translation memory for project in PO and TMX formats
Produce a single translation memory file that contains all the projects
Web

Provides a web application and an API that allow users download memories and search translation
Provides an index-creator that creates a Whoosh index with all the strings than then the user can search using the web app
Provides an download-creation that creates a zip file with all memories that the user can download
Terminology (terminology extraction)

Analyzes the PO files and creates a report with the most common terminology across the projects
Quality (feedback on how to improve translations)

Runs Pology and LanguageTool and generates HTML reports on translation quality

5. TMOP: https://github.com/hlt-mt/TMOP

Translation Memory Open-source Purifier

TMop is an open-source software written in Python designed for cleaning and maintaining a Translation Memory (i.e. a collection of (source, target) segments, called Translation Units, used to aid human translators operating in a Computer-assisted Translation framework).

The goal of TMop is to identify and remove from the TM all the “bad” TUs, in which any of the two textual elements is either:

i) syntactically poor,

ii) semantically different from the other,

iii) awkward according to some formatting criteria.

6. ModernMT: https://www.modernmt.eu/

Simple. Adaptive. Neural.
State-of-the-art neural machine translation as a service that learns from your translation memories and corrections

https://github.com/ModernMT/MMT

7. Pootle

Pootle is an online tool that makes the process of translating so much simpler. It allows crowd-sourced translations, easy volunteer contribution and gives statistics about the ongoing work.

Pootle is built using the powerful API of the Translate Toolkit and the Django framework. If you want to know more about these, you can dive into their own documentation.

8. BasicCAT: https://github.com/xulihang/BasicCAT

An open source computer-aided translation tool written in Basic

9. Heartsome Translation Studio 8.0

https://github.com/heartsome/translationstudio8

Heartsome Translation Studio 8.0 is the latest version of Heartsome’s CAT software series. This version features many revolutionary improvements compared to previous versions, especially with regard to ease of use and file format support.

Heartsome Translation Studio 8.0 has incorporated feedback based on practical experience from project managers, translators and proofreaders in the localization industry, which has resulted in a wealth of improvements and innovations, including:

New User Interface
New Project Management Functions
More File Formats Support, Enhanced translation engine
Innovative machine translation (MT) pre-saving feature
New RTF proofreading support
More Database Support

10. Tools for TMX

http://wiki.apertium.org/wiki/Tools_for_TMX

As it is now quite easy to make lots of large TMXes with Bitextor and other tools, it would be good to have some tools for processing them. There are various things that it would be useful to do. For example, given a TMX file:

strip out translation units for any given two languages (tmx-extract). Bitextor and other tools generate TMX files with many possible combinations of languages, for a file of “no is en da sv”, just give me all TUs which are “en-da”.
strip out duplicate translation units (tmx-uniq).
sort the file by: line length, language, etc. (tmx-sort)
trim the file of dubious TUs (tmx-trim) — very short translations of long segments, very different punctuation, translations where the translation is exactly the same as the reference, translations which only consist of numbers, would be nice to have an option to give an MT of the target language to try and do better edit-distance, etc.
re-perform language identification (tmx-rident) of all segments given a number of options (e.g. you know the file is in either Swedish or Danish, but some entries come up as Icelandic).
re-format a TMX so that it fits the standard (tmx-clean), e.g. turns ‘&’ into & etc. and optionally removes formatting.[1]
merge TMX files (tmx-merge), merge TMX files and uniq them on the way.
split a TMX file with many different languages (tmx-split) into tmx files with each of the different language pairs, optionally while re-identifying the language of each segment before placing it in a separate file.

11. Elasticsearch at Transifex

Translation Memory 3.0, which, at its core, uses Elasticsearch.

This blog post will highlight particular areas of interest on what we learnt using Elasticsearch in production. Enjoy!

12. paracrawl: https://github.com/paracrawl

13. Bitextor:https://github.com/bitextor/bitextor

Bitextor generates translation memories from multilingual websites


Leave a Reply

Your email address will not be published. Required fields are marked *