Sentence Alignment is an old and new problem, which is very important for machine translation.
An influential early method is based on sentence length, measured in words：
1）Peter F. Brown and Jennifer C. Lai and Robert L. Mercer (1991): ALIGNING SENTENCES IN PARALLEL CORPORA, Proceedings of the 29th Annual Meeting of the Association of Computational Linguistics (ACL)
2) William A. Gale and Kenneth Ward Church (1991): A PROGRAM FOR ALIGNING SENTENCES IN BILINGUAL CORPORA, Proceedings of the 29th Annual Meeting of the Association of Computational Linguistics (ACL)
3) William A. Gale and Kenneth Ward Church (1993): A program for aligning sentences in bilingual corpora, Computational Linguistics
4) Kenneth Ward Church (1993): Char align: A Program for Aligning Parallel Texts at the Character Level , Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics (ACL)
Classical and Useful, Related with “lexical” information:
5) Robert C. Moore (2002): Fast and Accurate Sentence Alignment of Bilingual Corpora, Machine Translation: From Research to Real Users, 5th Conference of the Association for Machine Translation in the Americas, AMTA 2002 Tiburon, CA, USA, October 6-12, 2002, Proceedings
Related Perl Code: https://www.microsoft.com/en-us/download/details.aspx?id=52608
6) Singh, Anil Kumar and Husain, Samar (2005): Comparison, Selection and Use of Sentence Alignment Algorithms for New Language Pairs, Proceedings of the ACL Workshop on Building and Using Parallel Texts
7) Bleualgin: https://github.com/rsennrich/Bleualign
The algorithm is described in:
Rico Sennrich, Martin Volk (2010): MT-based Sentence Alignment for OCR-generated Parallel Texts. In: Proceedings of AMTA 2010, Denver, Colorado.
Rico Sennrich; Martin Volk (2011): Iterative, MT-based sentence alignment of parallel texts. In: NODALIDA 2011, Nordic Conference of Computational Linguistics, Riga.
D. Varga, L. Németh, P. Halácsy, A. Kornai, V. Trón, V. Nagy (2005).
Parallel corpora for medium density languages
In Proceedings of the RANLP 2005, pages 590-596.
yasa is a program that aligns two translations of a text sentence by sentence in order to produce a bi-text
10) mALIGNa: Bilingual sengence aligner
This repository is for the following paper:
Enhancing Statistical Machine Translation For Low-ResourceLanguages Using Semantic Similarity
The repository includes:
Bilingual corpora: training, tuning, and test sets for language pairs: Japanese-Vietnamese, Indonesian-Vietnamese, Malay-Vietnamese, Filipino-Vietnamese.
The Java implementation of [Moore, 2002] for sentence alignment.
Extending word alignment by word similarity using word2vec
The Java implementation of [Wu and Wang, 2007]