Getting started with Giza++ for Word Alignment

Deep Learning Specialization on Coursera

About Giza++

Open Source Text Processing Project: GIZA++

Install Giza++

First get the Giza++ related code:

git clone https://github.com/moses-smt/giza-pp.git

The git package include and Giza++ and mkcls which used in the process.

We recommended you modify the Giza++ Makefile which can used to output the actual word pairs, not just id:

cd giza-pp/GIZA++-v2/
vim Makefile

Modify the line 9 to:


#CFLAGS_OPT = $(CFLAGS) -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -DBINARY_SEARCH_FOR_TTABLE -DWORDINDEX_WITH_4_BYTE
CFLAGS_OPT = $(CFLAGS) -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE

then “cd ..” and “make” for giza++ and mkcls related tools:

make -C GIZA++-v2
make[1]: Entering directory '/home/textprocessing/giza/giza-pp/GIZA++-v2'
mkdir optimized/
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c Parameter.cpp -o optimized/Parameter.o
Parameter.cpp: In function ‘bool writeParameters(std::ofstream&, const ParSet&, int)’:
Parameter.cpp:48:25: warning: ignoring return value of ‘char* getcwd(char*, size_t)’, declared with attribute warn_unused_result [-Wunused-result]
        getcwd(path,1024);
                         ^
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c myassert.cpp -o optimized/myassert.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c Perplexity.cpp -o optimized/Perplexity.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c model1.cpp -o optimized/model1.o
model1.cpp: In member function ‘int model1::em_with_tricks(int, bool, Dictionary&, bool)’:
model1.cpp:72:7: warning: variable ‘pair_no’ set but not used [-Wunused-but-set-variable]
   int pair_no;
       ^
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c model2.cpp -o optimized/model2.o
model2.cpp: In member function ‘int model2::em_with_tricks(int)’:
model2.cpp:64:7: warning: variable ‘pair_no’ set but not used [-Wunused-but-set-variable]
   int pair_no = 0;
       ^
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c model3.cpp -o optimized/model3.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c getSentence.cpp -o optimized/getSentence.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c TTables.cpp -o optimized/TTables.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c ATables.cpp -o optimized/ATables.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c AlignTables.cpp -o optimized/AlignTables.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c main.cpp -o optimized/main.o
main.cpp: In function ‘int main(int, char**)’:
main.cpp:707:10: warning: variable ‘errors’ set but not used [-Wunused-but-set-variable]
   double errors=0.0;
          ^
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c NTables.cpp -o optimized/NTables.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c model2to3.cpp -o optimized/model2to3.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c collCounts.cpp -o optimized/collCounts.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c alignment.cpp -o optimized/alignment.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c vocab.cpp -o optimized/vocab.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c MoveSwapMatrix.cpp -o optimized/MoveSwapMatrix.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c transpair_model3.cpp -o optimized/transpair_model3.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c transpair_model5.cpp -o optimized/transpair_model5.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c transpair_model4.cpp -o optimized/transpair_model4.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c utility.cpp -o optimized/utility.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c parse.cpp -o optimized/parse.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c reports.cpp -o optimized/reports.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c model3_viterbi.cpp -o optimized/model3_viterbi.o
model3_viterbi.cpp: In member function ‘void model3::findAlignmentsNeighborhood(std::vector&, std::vector&, LogProb&, alignmodel&, int, int)’:
model3_viterbi.cpp:431:12: warning: variable ‘it_st’ set but not used [-Wunused-but-set-variable]
     time_t it_st;
            ^
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c model3_viterbi_with_tricks.cpp -o optimized/model3_viterbi_with_tricks.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c Dictionary.cpp -o optimized/Dictionary.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c model345-peg.cpp -o optimized/model345-peg.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c hmm.cpp -o optimized/hmm.o
hmm.cpp: In member function ‘int hmm::em_with_tricks(int)’:
hmm.cpp:79:7: warning: variable ‘pair_no’ set but not used [-Wunused-but-set-variable]
   int pair_no = 0;
       ^
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c HMMTables.cpp -o optimized/HMMTables.o
g++   -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE  -c ForwardBackward.cpp -o optimized/ForwardBackward.o
g++  -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE optimized/Parameter.o optimized/myassert.o optimized/Perplexity.o optimized/model1.o optimized/model2.o optimized/model3.o optimized/getSentence.o optimized/TTables.o optimized/ATables.o optimized/AlignTables.o optimized/main.o optimized/NTables.o optimized/model2to3.o optimized/collCounts.o optimized/alignment.o optimized/vocab.o optimized/MoveSwapMatrix.o optimized/transpair_model3.o optimized/transpair_model5.o optimized/transpair_model4.o optimized/utility.o optimized/parse.o optimized/reports.o optimized/model3_viterbi.o optimized/model3_viterbi_with_tricks.o optimized/Dictionary.o optimized/model345-peg.o optimized/hmm.o optimized/HMMTables.o optimized/ForwardBackward.o  -o GIZA++
g++  -O3 -W -Wall snt2plain.cpp -o snt2plain.out
g++  -O3 -W -Wall plain2snt.cpp -o plain2snt.out
g++  -O3 -g -W -Wall snt2cooc.cpp -o snt2cooc.out
make[1]: Leaving directory '/home/textprocessing/giza/giza-pp/GIZA++-v2'
make -C mkcls-v2
make[1]: Entering directory '/home/textprocessing/giza/giza-pp/mkcls-v2'
g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c GDAOptimization.cpp -o GDAOptimization.o
g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c HCOptimization.cpp -o HCOptimization.o
g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c Problem.cpp -o Problem.o
g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c IterOptimization.cpp -o IterOptimization.o
g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c ProblemTest.cpp -o ProblemTest.o
g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c RRTOptimization.cpp -o RRTOptimization.o
g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c MYOptimization.cpp -o MYOptimization.o
g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c SAOptimization.cpp -o SAOptimization.o
g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c TAOptimization.cpp -o TAOptimization.o
g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c Optimization.cpp -o Optimization.o
g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c KategProblemTest.cpp -o KategProblemTest.o
g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c KategProblemKBC.cpp -o KategProblemKBC.o
g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c KategProblemWBC.cpp -o KategProblemWBC.o
g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c KategProblem.cpp -o KategProblem.o
g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c StatVar.cpp -o StatVar.o
g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c general.cpp -o general.o
g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c mkcls.cpp -o mkcls.o
g++ -Wall -W -DNDEBUG -O3 -funroll-loops -o mkcls GDAOptimization.o HCOptimization.o Problem.o IterOptimization.o ProblemTest.o RRTOptimization.o MYOptimization.o SAOptimization.o TAOptimization.o Optimization.o KategProblemTest.o KategProblemKBC.o KategProblemWBC.o KategProblem.o StatVar.o general.o mkcls.o 
make[1]: Leaving directory '/home/textprocessing/giza/giza-pp/mkcls-v2'

Prepare the bilingual corpus

We follow the moses decoder baseline pipeline to prepare the bilingual sample corpus and preprocess script. First get the corpus from wmt13:


mkdir corpus
cd corpus/
wget http://www.statmt.org/wmt13/training-parallel-nc-v8.tgz
tar -zxvf training-parallel-nc-v8.tgz

training/news-commentary-v8.cs-en.cs
training/news-commentary-v8.cs-en.en
training/news-commentary-v8.de-en.de
training/news-commentary-v8.de-en.en
training/news-commentary-v8.es-en.en
training/news-commentary-v8.es-en.es
training/news-commentary-v8.fr-en.en
training/news-commentary-v8.fr-en.fr
training/news-commentary-v8.ru-en.en
training/news-commentary-v8.ru-en.ru

We follow the moses script to clean the data:

To prepare the data for training the translation system, we have to perform the following steps:
tokenisation: This means that spaces have to be inserted between (e.g.) words and punctuation.
truecasing: The initial words in each sentence are converted to their most probable casing. This helps reduce data sparsity.
cleaning: Long sentences and empty sentences are removed as they can cause problems with the training pipeline, and obviously mis-aligned sentences are removed.

So get the mosedecoder first:

cd ..
git clone https://github.com/moses-smt/mosesdecoder.git

Now it’s time to preprocess the bilingual pairs, we select the fr-en data as the example:

The org en data like this:

SAN FRANCISCO – It has never been easy to have a rational conversation about the value of gold.
Lately, with gold prices up more than 300% over the last decade, it is harder than ever.
Just last December, fellow economists Martin Feldstein and Nouriel Roubini each penned op-eds bravely questioning bullish market sentiment, sensibly pointing out gold’s risks.
Wouldn’t you know it?

Tokenization:

./mosesdecoder/scripts/tokenizer/tokenizer.perl -l en < ./corpus/training/news-commentary-v8.fr-en.en > ./corpus/news-commentary-v8.fr-en.tok.en

./mosesdecoder/scripts/tokenizer/tokenizer.perl -l fr < ./corpus/training/news-commentary-v8.fr-en.fr > ./corpus/news-commentary-v8.fr-en.tok.fr

After tokenization:

SAN FRANCISCO – It has never been easy to have a rational conversation about the value of gold .
Lately , with gold prices up more than 300 % over the last decade , it is harder than ever .
Just last December , fellow economists Martin Feldstein and Nouriel Roubini each penned op-eds bravely questioning bullish market sentiment , sensibly pointing out gold ’ s risks .
Wouldn ’ t you know it ?

Truecase:

The truecaser first requires training, in order to extract some statistics about the text:

./mosesdecoder/scripts/recaser/train-truecaser.perl --model ./corpus/truecase-model.en --corpus ./corpus/news-commentary-v8.fr-en.tok.en

./mosesdecoder/scripts/recaser/train-truecaser.perl --model ./corpus/truecase-model.fr --corpus ./corpus/news-commentary-v8.fr-en.tok.fr

Then truecase the sample data:

./mosesdecoder/scripts/recaser/truecase.perl --model ./corpus/truecase-model.en < ./corpus/news-commentary-v8.fr-en.tok.en > ./corpus/news-commentary-v8.fr-en.true.en

./mosesdecoder/scripts/recaser/truecase.perl --model ./corpus/truecase-model.fr < ./corpus/news-commentary-v8.fr-en.tok.fr > ./corpus/news-commentary-v8.fr-en.true.fr

After truecase:

San FRANCISCO – It has never been easy to have a rational conversation about the value of gold .
lately , with gold prices up more than 300 % over the last decade , it is harder than ever .
just last December , fellow economists Martin Feldstein and Nouriel Roubini each penned op-eds bravely questioning bullish market sentiment , sensibly pointing out gold ’ s risks .
wouldn ’ t you know it ?

Clean the long line sentence more than 80:

./mosesdecoder/scripts/training/clean-corpus-n.perl ./corpus/news-commentary-v8.fr-en.true fr en ./corpus/news-commentary-v8.fr-en.clean 1 80

clean-corpus.perl: processing ./corpus/news-commentary-v8.fr-en.true.fr & .en to ./corpus/news-commentary-v8.fr-en.clean, cutoff 1-80, ratio 9
..........(100000)....
Input sentences: 157168  Output sentences:  155362

Using Giza++ for Word Alignment

First, copy the binary execute files:

textprocessing@ubuntu:~/giza$ cp giza-pp/GIZA++-v2/plain2snt.out .
textprocessing@ubuntu:~/giza$ cp giza-pp/GIZA++-v2/snt2cooc.out .
textprocessing@ubuntu:~/giza$ cp giza-pp/GIZA++-v2/GIZA++ .
textprocessing@ubuntu:~/giza$ cp giza-pp/mkcls-v2/mkcls .

Then run:

./plain2snt.out corpus/news-commentary-v8.fr-en.clean.fr corpus/news-commentary-v8.fr-en.clean.en

which will generate vcb (vocabulary) files and snt (sentence) files, containing the list of vocabulary and aligned sentences, respectively.

Then run mkcls which is a program to automatically infer word classes from a corpus using a maximum likelihood criterion:

mkcls [-nnum] [-ptrain] [-Vfile] opt
-V output classes (Default: no file)
-n number of optimization runs (Default: 1); larger number => better results
-p filename of training corpus (Default: ‘train’)
Example:
mkcls -c80 -n10 -pin -Vout opt
(generates 80 classes for the corpus ‘in’ and writes the classes in ‘out’)
Literature:
Franz Josef Och: ?Maximum-Likelihood-Sch?tzung von Wortkategorien mit Verfahren
der kombinatorischen Optimierung? Studienarbeit, Universit?t Erlangen-N?rnberg,
Germany,1995.

Execute:


./mkcls -pcorpus/news-commentary-v8.fr-en.clean.fr -Vcorpus/news-commentary-v8.fr-en.fr.vcb.classes
./mkcls -pcorpus/news-commentary-v8.fr-en.clean.en -Vcorpus/news-commentary-v8.fr-en.en.vcb.classes

Finally run GIZA++:

./GIZA++ -S corpus/news-commentary-v8.fr-en.clean.fr.vcb -T corpus/news-commentary-v8.fr-en.clean.en.vcb -C corpus/news-commentary-v8.fr-en.clean.fr_news-commentary-v8.fr-en.clean.en.snt -o fr_en -outputpath fr_en

......
writing Final tables to Disk
Dumping the t table inverse to file: fr_en/fr_en.ti.final
Dumping the t table inverse to file: fr_en/fr_en.actual.ti.final
Writing PERPLEXITY report to: fr_en/fr_en.perp
Writing source vocabulary list to : fr_en/fr_en.trn.src.vcb
Writing source vocabulary list to : fr_en/fr_en.trn.trg.vcb
Writing source vocabulary list to : fr_en/fr_en.tst.src.vcb
Writing source vocabulary list to : fr_en/fr_en.tst.trg.vcb
writing decoder configuration file to fr_en/fr_en.Decoder.config
......

The most import file for us is the actual word align pairs file: fr_en.actual.ti.final

expectancy associée 0.0144092
only enchâssée 3.56377e-05
amounts construisent 0.00338397
knowledge attribuées 0.00116645
dynamic dynamiques 0.223755
harsh périrent 0.00709615
insubordination agissements 1
big caféière 0.000125214
Health Santé 0.289873
building construisent 0.00355319
dilemma dynamiques 0.00853293
learn apprendront 0.00658648
moving délocalisée 0.00180745
pretends prétendent 0.129701
aggressive dynamiques 0.00016645
center centristes 0.00357907
scope 707 0.000628053
experts intentionnés 0.00241335
principles déplaisait 0.00173075
Reagan déplaisait 0.0054606
meant attribuées 0.00240529
build construisent 0.00590704
median âge 0.121734

But unsorted, we can sorted it first:

sort fr_en.actual.ti.final > fr_en.actual.ti.final.sort

Then view it by alphabetical order:

learn acquérir 0.00440678
learn adapter 8.79211e-06
learn amérindienne 0.000941561
learn apprécié 0.00330693
learn apprenant 0.00761903
learn apprend 0.00797
learn apprendra 0.00357164
learn apprendre 0.449114
learn apprendrons 0.00265828
learn apprendront 0.00658648
learn apprenez 0.000753722
learn apprenions 0.00077654
learn apprenne 0.00167538
learn apprennent 0.0490054
learn apprenons 0.0085642
learn apprenons-nous 0.000916356
learn apprentissage 0.00935484
learn appris 0.00427148
learn assimilation 0.00248182
learn aurons 0.00229323
learn avertis 8.16617e-06
learn bénéficier 0.00429511
learn commettre 0.0040235

Reference:

Using GIZA++ to Obtain Word Alignment Between Bilingual Sentences


Leave a Reply

Your email address will not be published. Required fields are marked *