Getting started with Sentence Alignment

Deep Learning Specialization on Coursera

Sentence Alignment is an old and new problem, which is very important for machine translation.

An influential early method is based on sentence length, measured in words:

1)Peter F. Brown and Jennifer C. Lai and Robert L. Mercer (1991): ALIGNING SENTENCES IN PARALLEL CORPORA, Proceedings of the 29th Annual Meeting of the Association of Computational Linguistics (ACL)
2) William A. Gale and Kenneth Ward Church (1991): A PROGRAM FOR ALIGNING SENTENCES IN BILINGUAL CORPORA, Proceedings of the 29th Annual Meeting of the Association of Computational Linguistics (ACL)
3) William A. Gale and Kenneth Ward Church (1993): A program for aligning sentences in bilingual corpora, Computational Linguistics
4) Kenneth Ward Church (1993): Char align: A Program for Aligning Parallel Texts at the Character Level , Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics (ACL)

Classical and Useful, Related with “lexical” information:

5) Robert C. Moore (2002): Fast and Accurate Sentence Alignment of Bilingual Corpora, Machine Translation: From Research to Real Users, 5th Conference of the Association for Machine Translation in the Americas, AMTA 2002 Tiburon, CA, USA, October 6-12, 2002, Proceedings
Related Perl Code: https://www.microsoft.com/en-us/download/details.aspx?id=52608

6) Singh, Anil Kumar and Husain, Samar (2005): Comparison, Selection and Use of Sentence Alignment Algorithms for New Language Pairs, Proceedings of the ACL Workshop on Building and Using Parallel Texts

7) Bleualgin: https://github.com/rsennrich/Bleualign

The algorithm is described in:

Rico Sennrich, Martin Volk (2010): MT-based Sentence Alignment for OCR-generated Parallel Texts. In: Proceedings of AMTA 2010, Denver, Colorado.

Rico Sennrich; Martin Volk (2011): Iterative, MT-based sentence alignment of parallel texts. In: NODALIDA 2011, Nordic Conference of Computational Linguistics, Riga.

8) The hunalign sentence aligner

D. Varga, L. Németh, P. Halácsy, A. Kornai, V. Trón, V. Nagy (2005).
Parallel corpora for medium density languages
In Proceedings of the RANLP 2005, pages 590-596.

9) yasa — Yet Another Sentence Aligner

yasa is a program that aligns two translations of a text sentence by sentence in order to produce a bi-text

github: https://github.com/rali-udem/yasa

10) mALIGNa: Bilingual sengence aligner

https://github.com/loomchild/maligna

11) SMT-LowRec

https://github.com/nguyenlab/SMT-LowRec
This repository is for the following paper:

Enhancing Statistical Machine Translation For Low-ResourceLanguages Using Semantic Similarity

The repository includes:

Corpora
Bilingual corpora: training, tuning, and test sets for language pairs: Japanese-Vietnamese, Indonesian-Vietnamese, Malay-Vietnamese, Filipino-Vietnamese.
Sentence alignment
The Java implementation of [Moore, 2002] for sentence alignment.
Extending word alignment by word similarity using word2vec
Pivot translation
The Java implementation of [Wu and Wang, 2007]

12) LogisticRegression-Shared-Task-Parallel-Corpus-Filtering

13) Cleaning of Parallel Texts for Machine Translation

Ref:
Sentence Alignment by MT Research Survey Wiki


Leave a Reply

Your email address will not be published. Required fields are marked *