About Giza++
Open Source Text Processing Project: GIZA++
Install Giza++
First get the Giza++ related code:
git clone https://github.com/moses-smt/giza-pp.git
The git package include and Giza++ and mkcls which used in the process.
We recommended you modify the Giza++ Makefile which can used to output the actual word pairs, not just id:
cd giza-pp/GIZA++-v2/
vim Makefile
Modify the line 9 to:
#CFLAGS_OPT = $(CFLAGS) -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -DBINARY_SEARCH_FOR_TTABLE -DWORDINDEX_WITH_4_BYTE
CFLAGS_OPT = $(CFLAGS) -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE
then “cd ..” and “make” for giza++ and mkcls related tools:
make -C GIZA++-v2 make[1]: Entering directory '/home/textprocessing/giza/giza-pp/GIZA++-v2' mkdir optimized/ g++ -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -c Parameter.cpp -o optimized/Parameter.o Parameter.cpp: In function ‘bool writeParameters(std::ofstream&, const ParSet&, int)’: Parameter.cpp:48:25: warning: ignoring return value of ‘char* getcwd(char*, size_t)’, declared with attribute warn_unused_result [-Wunused-result] getcwd(path,1024); ^ g++ -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -c myassert.cpp -o optimized/myassert.o g++ -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -c Perplexity.cpp -o optimized/Perplexity.o g++ -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -c model1.cpp -o optimized/model1.o model1.cpp: In member function ‘int model1::em_with_tricks(int, bool, Dictionary&, bool)’: model1.cpp:72:7: warning: variable ‘pair_no’ set but not used [-Wunused-but-set-variable] int pair_no; ^ g++ -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -c model2.cpp -o optimized/model2.o model2.cpp: In member function ‘int model2::em_with_tricks(int)’: model2.cpp:64:7: warning: variable ‘pair_no’ set but not used [-Wunused-but-set-variable] int pair_no = 0; ^ g++ -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -c model3.cpp -o optimized/model3.o g++ -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -c getSentence.cpp -o optimized/getSentence.o g++ -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -c TTables.cpp -o optimized/TTables.o g++ -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -c ATables.cpp -o optimized/ATables.o g++ -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -c AlignTables.cpp -o optimized/AlignTables.o g++ -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -c main.cpp -o optimized/main.o main.cpp: In function ‘int main(int, char**)’: main.cpp:707:10: warning: variable ‘errors’ set but not used [-Wunused-but-set-variable] double errors=0.0; ^ g++ -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -c NTables.cpp -o optimized/NTables.o g++ -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -c model2to3.cpp -o optimized/model2to3.o g++ -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -c collCounts.cpp -o optimized/collCounts.o g++ -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -c alignment.cpp -o optimized/alignment.o g++ -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -c vocab.cpp -o optimized/vocab.o g++ -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -c MoveSwapMatrix.cpp -o optimized/MoveSwapMatrix.o g++ -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -c transpair_model3.cpp -o optimized/transpair_model3.o g++ -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -c transpair_model5.cpp -o optimized/transpair_model5.o g++ -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -c transpair_model4.cpp -o optimized/transpair_model4.o g++ -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -c utility.cpp -o optimized/utility.o g++ -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -c parse.cpp -o optimized/parse.o g++ -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -c reports.cpp -o optimized/reports.o g++ -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -c model3_viterbi.cpp -o optimized/model3_viterbi.o model3_viterbi.cpp: In member function ‘void model3::findAlignmentsNeighborhood(std::vector&, std::vector &, LogProb&, alignmodel&, int, int)’: model3_viterbi.cpp:431:12: warning: variable ‘it_st’ set but not used [-Wunused-but-set-variable] time_t it_st; ^ g++ -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -c model3_viterbi_with_tricks.cpp -o optimized/model3_viterbi_with_tricks.o g++ -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -c Dictionary.cpp -o optimized/Dictionary.o g++ -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -c model345-peg.cpp -o optimized/model345-peg.o g++ -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -c hmm.cpp -o optimized/hmm.o hmm.cpp: In member function ‘int hmm::em_with_tricks(int)’: hmm.cpp:79:7: warning: variable ‘pair_no’ set but not used [-Wunused-but-set-variable] int pair_no = 0; ^ g++ -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -c HMMTables.cpp -o optimized/HMMTables.o g++ -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -c ForwardBackward.cpp -o optimized/ForwardBackward.o g++ -Wall -Wno-parentheses -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE optimized/Parameter.o optimized/myassert.o optimized/Perplexity.o optimized/model1.o optimized/model2.o optimized/model3.o optimized/getSentence.o optimized/TTables.o optimized/ATables.o optimized/AlignTables.o optimized/main.o optimized/NTables.o optimized/model2to3.o optimized/collCounts.o optimized/alignment.o optimized/vocab.o optimized/MoveSwapMatrix.o optimized/transpair_model3.o optimized/transpair_model5.o optimized/transpair_model4.o optimized/utility.o optimized/parse.o optimized/reports.o optimized/model3_viterbi.o optimized/model3_viterbi_with_tricks.o optimized/Dictionary.o optimized/model345-peg.o optimized/hmm.o optimized/HMMTables.o optimized/ForwardBackward.o -o GIZA++ g++ -O3 -W -Wall snt2plain.cpp -o snt2plain.out g++ -O3 -W -Wall plain2snt.cpp -o plain2snt.out g++ -O3 -g -W -Wall snt2cooc.cpp -o snt2cooc.out make[1]: Leaving directory '/home/textprocessing/giza/giza-pp/GIZA++-v2' make -C mkcls-v2 make[1]: Entering directory '/home/textprocessing/giza/giza-pp/mkcls-v2' g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c GDAOptimization.cpp -o GDAOptimization.o g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c HCOptimization.cpp -o HCOptimization.o g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c Problem.cpp -o Problem.o g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c IterOptimization.cpp -o IterOptimization.o g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c ProblemTest.cpp -o ProblemTest.o g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c RRTOptimization.cpp -o RRTOptimization.o g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c MYOptimization.cpp -o MYOptimization.o g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c SAOptimization.cpp -o SAOptimization.o g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c TAOptimization.cpp -o TAOptimization.o g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c Optimization.cpp -o Optimization.o g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c KategProblemTest.cpp -o KategProblemTest.o g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c KategProblemKBC.cpp -o KategProblemKBC.o g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c KategProblemWBC.cpp -o KategProblemWBC.o g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c KategProblem.cpp -o KategProblem.o g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c StatVar.cpp -o StatVar.o g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c general.cpp -o general.o g++ -Wall -W -DNDEBUG -O3 -funroll-loops -c mkcls.cpp -o mkcls.o g++ -Wall -W -DNDEBUG -O3 -funroll-loops -o mkcls GDAOptimization.o HCOptimization.o Problem.o IterOptimization.o ProblemTest.o RRTOptimization.o MYOptimization.o SAOptimization.o TAOptimization.o Optimization.o KategProblemTest.o KategProblemKBC.o KategProblemWBC.o KategProblem.o StatVar.o general.o mkcls.o make[1]: Leaving directory '/home/textprocessing/giza/giza-pp/mkcls-v2'
Prepare the bilingual corpus
We follow the to prepare the bilingual sample corpus and preprocess script. First get the corpus from wmt13:
mkdir corpus
cd corpus/
wget http://www.statmt.org/wmt13/training-parallel-nc-v8.tgz
tar -zxvf training-parallel-nc-v8.tgz
training/news-commentary-v8.cs-en.cs training/news-commentary-v8.cs-en.en training/news-commentary-v8.de-en.de training/news-commentary-v8.de-en.en training/news-commentary-v8.es-en.en training/news-commentary-v8.es-en.es training/news-commentary-v8.fr-en.en training/news-commentary-v8.fr-en.fr training/news-commentary-v8.ru-en.en training/news-commentary-v8.ru-en.ru
We follow the moses script to clean the data:
To prepare the data for training the translation system, we have to perform the following steps:
tokenisation: This means that spaces have to be inserted between (e.g.) words and punctuation.
truecasing: The initial words in each sentence are converted to their most probable casing. This helps reduce data sparsity.
cleaning: Long sentences and empty sentences are removed as they can cause problems with the training pipeline, and obviously mis-aligned sentences are removed.
So get the mosedecoder first:
cd ..
git clone https://github.com/moses-smt/mosesdecoder.git
Now it’s time to preprocess the bilingual pairs, we select the fr-en data as the example:
The org en data like this:
SAN FRANCISCO – It has never been easy to have a rational conversation about the value of gold.
Lately, with gold prices up more than 300% over the last decade, it is harder than ever.
Just last December, fellow economists Martin Feldstein and Nouriel Roubini each penned op-eds bravely questioning bullish market sentiment, sensibly pointing out gold’s risks.
Wouldn’t you know it?
Tokenization:
./mosesdecoder/scripts/tokenizer/tokenizer.perl -l en < ./corpus/training/news-commentary-v8.fr-en.en > ./corpus/news-commentary-v8.fr-en.tok.en
./mosesdecoder/scripts/tokenizer/tokenizer.perl -l fr < ./corpus/training/news-commentary-v8.fr-en.fr > ./corpus/news-commentary-v8.fr-en.tok.fr
After tokenization:
SAN FRANCISCO – It has never been easy to have a rational conversation about the value of gold .
Lately , with gold prices up more than 300 % over the last decade , it is harder than ever .
Just last December , fellow economists Martin Feldstein and Nouriel Roubini each penned op-eds bravely questioning bullish market sentiment , sensibly pointing out gold ’ s risks .
Wouldn ’ t you know it ?
Truecase:
The truecaser first requires training, in order to extract some statistics about the text:
./mosesdecoder/scripts/recaser/train-truecaser.perl --model ./corpus/truecase-model.en --corpus ./corpus/news-commentary-v8.fr-en.tok.en
./mosesdecoder/scripts/recaser/train-truecaser.perl --model ./corpus/truecase-model.fr --corpus ./corpus/news-commentary-v8.fr-en.tok.fr
Then truecase the sample data:
./mosesdecoder/scripts/recaser/truecase.perl --model ./corpus/truecase-model.en < ./corpus/news-commentary-v8.fr-en.tok.en > ./corpus/news-commentary-v8.fr-en.true.en
./mosesdecoder/scripts/recaser/truecase.perl --model ./corpus/truecase-model.fr < ./corpus/news-commentary-v8.fr-en.tok.fr > ./corpus/news-commentary-v8.fr-en.true.fr
After truecase:
San FRANCISCO – It has never been easy to have a rational conversation about the value of gold .
lately , with gold prices up more than 300 % over the last decade , it is harder than ever .
just last December , fellow economists Martin Feldstein and Nouriel Roubini each penned op-eds bravely questioning bullish market sentiment , sensibly pointing out gold ’ s risks .
wouldn ’ t you know it ?
Clean the long line sentence more than 80:
./mosesdecoder/scripts/training/clean-corpus-n.perl ./corpus/news-commentary-v8.fr-en.true fr en ./corpus/news-commentary-v8.fr-en.clean 1 80
clean-corpus.perl: processing ./corpus/news-commentary-v8.fr-en.true.fr & .en to ./corpus/news-commentary-v8.fr-en.clean, cutoff 1-80, ratio 9 ..........(100000).... Input sentences: 157168 Output sentences: 155362
Using Giza++ for Word Alignment
First, copy the binary execute files:
textprocessing@ubuntu:~/giza$ cp giza-pp/GIZA++-v2/plain2snt.out .
textprocessing@ubuntu:~/giza$ cp giza-pp/GIZA++-v2/snt2cooc.out .
textprocessing@ubuntu:~/giza$ cp giza-pp/GIZA++-v2/GIZA++ .
textprocessing@ubuntu:~/giza$ cp giza-pp/mkcls-v2/mkcls .
Then run:
./plain2snt.out corpus/news-commentary-v8.fr-en.clean.fr corpus/news-commentary-v8.fr-en.clean.en
which will generate vcb (vocabulary) files and snt (sentence) files, containing the list of vocabulary and aligned sentences, respectively.
Then run mkcls which is a program to automatically infer word classes from a corpus using a maximum likelihood criterion:
mkcls [-nnum] [-ptrain] [-Vfile] opt
-V output classes (Default: no file)
-n number of optimization runs (Default: 1); larger number => better results
-p filename of training corpus (Default: ‘train’)
Example:
mkcls -c80 -n10 -pin -Vout opt
(generates 80 classes for the corpus ‘in’ and writes the classes in ‘out’)
Literature:
Franz Josef Och: ?Maximum-Likelihood-Sch?tzung von Wortkategorien mit Verfahren
der kombinatorischen Optimierung? Studienarbeit, Universit?t Erlangen-N?rnberg,
Germany,1995.
Execute:
./mkcls -pcorpus/news-commentary-v8.fr-en.clean.fr -Vcorpus/news-commentary-v8.fr-en.fr.vcb.classes
./mkcls -pcorpus/news-commentary-v8.fr-en.clean.en -Vcorpus/news-commentary-v8.fr-en.en.vcb.classes
Finally run GIZA++:
./GIZA++ -S corpus/news-commentary-v8.fr-en.clean.fr.vcb -T corpus/news-commentary-v8.fr-en.clean.en.vcb -C corpus/news-commentary-v8.fr-en.clean.fr_news-commentary-v8.fr-en.clean.en.snt -o fr_en -outputpath fr_en
...... writing Final tables to Disk Dumping the t table inverse to file: fr_en/fr_en.ti.final Dumping the t table inverse to file: fr_en/fr_en.actual.ti.final Writing PERPLEXITY report to: fr_en/fr_en.perp Writing source vocabulary list to : fr_en/fr_en.trn.src.vcb Writing source vocabulary list to : fr_en/fr_en.trn.trg.vcb Writing source vocabulary list to : fr_en/fr_en.tst.src.vcb Writing source vocabulary list to : fr_en/fr_en.tst.trg.vcb writing decoder configuration file to fr_en/fr_en.Decoder.config ......
The most import file for us is the actual word align pairs file: fr_en.actual.ti.final
expectancy associée 0.0144092 only enchâssée 3.56377e-05 amounts construisent 0.00338397 knowledge attribuées 0.00116645 dynamic dynamiques 0.223755 harsh périrent 0.00709615 insubordination agissements 1 big caféière 0.000125214 Health Santé 0.289873 building construisent 0.00355319 dilemma dynamiques 0.00853293 learn apprendront 0.00658648 moving délocalisée 0.00180745 pretends prétendent 0.129701 aggressive dynamiques 0.00016645 center centristes 0.00357907 scope 707 0.000628053 experts intentionnés 0.00241335 principles déplaisait 0.00173075 Reagan déplaisait 0.0054606 meant attribuées 0.00240529 build construisent 0.00590704 median âge 0.121734
But unsorted, we can sorted it first:
sort fr_en.actual.ti.final > fr_en.actual.ti.final.sort
Then view it by alphabetical order:
learn acquérir 0.00440678 learn adapter 8.79211e-06 learn amérindienne 0.000941561 learn apprécié 0.00330693 learn apprenant 0.00761903 learn apprend 0.00797 learn apprendra 0.00357164 learn apprendre 0.449114 learn apprendrons 0.00265828 learn apprendront 0.00658648 learn apprenez 0.000753722 learn apprenions 0.00077654 learn apprenne 0.00167538 learn apprennent 0.0490054 learn apprenons 0.0085642 learn apprenons-nous 0.000916356 learn apprentissage 0.00935484 learn appris 0.00427148 learn assimilation 0.00248182 learn aurons 0.00229323 learn avertis 8.16617e-06 learn bénéficier 0.00429511 learn commettre 0.0040235
Reference: