pialign is a package that allows you to create a phrase table and word alignments from an unaligned parallel corpus. It is unlike other unsupervised word alignment tools in that it is able to create a phrase table using a fully statistical model, no heuristics. As a result, it is able to build phrase tables for phrase-based machine translation that achieve competitive results but are only a fraction of the size of those created with heuristic methods.
pialign was developed mainly by Graham Neubig during his period as an intern at NICT.
If you would like more details about the method or want to cite pialign in your research, please reference:
Latest Version: pialign 0.2.5
Bleeding-Edge Code: @github
Past Versions:
pialign 0.2.4
pialign 0.2.3
pialign 0.2.2
pialign 0.2.1
pialign 0.2.0
pialign 0.1.0
The code of pialign is distributed according to the Common Public License v 1.0, and can be distributed freely according to this license.
On Linux, Mac OS X, or Cygwin, download the source code, and install using the following commands.
tar -xzf pialign-X.X.X.tar.gz cd pialign-X.X.X ./configure make make install pialign --help
If this prints a help message, pialign is working properly.
To align parallel text, say we have a source language file "data/f.txt" and a target language file "data/e.txt", we can run the program as follows:
$ mkdir out $ pialign data/f.txt data/e.txt out/align. &> out/log.txt &
The program will run for a while (about 1-2 hours for 10k sentences, or 10-20 hours for 100k). Note that by default, only sentences of 40 words or less will be aligned. This can be changed by using "-maxsentlen XX" where XX is the maximum sentence length, but if this is set to a large value the program may use a large amount of memory, and the speed will drop to some extent.
When the program is done running, it will output two files "out/align.1.samp" and "out/align.1.pt." align.1.samp contains the alignments (in the form of trees) created by the aligner, and align.1.pt contains a phrase table that can be used with Moses.
In order for Moses to run, you will also need a language model, which you can train using SRILM or IRSTLM (you can reference the Moses step-by-step tutorial for more details on how to train a model). In addition, to achieve high accuracies, you should probably calculate a lexicalized reordering model using the "itgstats.pl" script on the derivation file. Here, "7" is the maximum length of a phrase, and 0.5 is the "pseudo-count" to be added for smoothing.
$ script/itgstats.pl lex 7 0.5 < out/align.1.samp | LC_ALL=C sort > out/align.1.lex
Finally, you can combine the phrase table (TM), language model (LM), and reordering model (DM) together using the following command.
$ script/make-moses-config.pl /full/path/to/out/align.1.pt /full/path/to/lm.txt --dm-file /full/path/to/out/align.1.lex > out/moses.ini
There are also a number of settings controlling the LM order, TM order, etc. described in the beginning of make-moses-config.pl, so please take a look to make sure there is nothing that you need to change. If all goes well, Moses can be run using this moses.ini file. If not, feel free to contact pialign-users for help at any time.
The following options can be used with pialign.
~~~ Input/Output ~~~ Usage: pialign [OPTIONS] FFILE EFILE PREFIX FFILE is the foreign input corpus EFILE is the english input corpus PREFIX is the prefix that will be used for the output Other input: -le2f A file containing the lexicon probabilities for e2f -lf2e A file containing the lexicon probabilities for f2e (These can be used with "-base m1" or "-base m1g" but are not necessary) ~~~ Model Parameters ~~~ -model Model type (hier/len/flat, default: hier) -avgphraselen A parameter indicating the expected length of a phrase. default is small (0.01) to prevent overly long alignments -base The type of base measure to use (m1g is generally best). 'm1g'=geometric mean of model 1, 'm1'=arithmetic mean of model 1, 'uni'=simple unigrams, 'coocll'=log-linear interpolation of phrase cooccurrence probabilities in both directions (default 'm1g') -coocdisc How much to discount the cooccurrence for phrasal base measures -defstren Fixed strength of the PY process (default none) -defdisc Fixed discount of the PY process (default none) -nullprob The probability of a null alignment (default 0.01) -noremnull Do not remember nulls in the phrase table -termprior The prior probability of generating a terminal (0.33) -termstren Strength of the type distribution (default 1) -domh Do a Metropolis-Hastings rejection step ~~~ Phrase Table ~~~ -maxphraselen The maximum length of a minimal phrase (default 7) -maxsentlen The maximum length of sentences to use (default 40) -printmax The maximum length of phrases included in the phrase table (default 7) -printmin The minimal length of phrases included in the phrase table (default 1) -noword Output only phrase alignments (do not force output of word alignments) ~~~ Inference Parameters ~~~ -burnin The number of burn-in iterations (default 9) -probwidth The width of the probability beam to use (default 1e-4) -noqueue Use exhaustive search instead of queue-based parsing -lookahead The type of lookahead function to use: 'none'=no look-ahead, 'ind'=independently calculate both sides -samps The number of samples to take (default 1) -samprate Take samples every samprate turns (default 1) -worditers The number of iterations to perform with a word-based model (default 0) -noshuffle Don't shuffle the order of the sentences -batchlen The number of sentences to process in a single batch -threads The number of threads to use (must be <= -batchlen)
This can be done by running the script "script/itgstats.pl" on "align.1.samp." There are three types of word alignments that you can create, many-to-many (phrase) alignments, one-to-many (block) alignments, and one-to-one (word) alignments.
many-to-many: $ script/itgstats.pl palign < out/align.1.samp > out/align.1.pal one-to-many: $ script/itgstats.pl balign < out/align.1.samp > out/align.1.bal one-to-one: $ script/itgstats.pl align < out/align.1.samp > out/align.1.wal
If you want to visualize these alignments you can do so as follows:
$ script/visualize.pl data/e.txt data/f.txt out/align.1.pal > out/align.1.vis
These are explained in detail in the referenced paper, but briefly:
1,2: The conditional probabilities of the phrases p_t(e|f), p_t(f|e) 3: The joint probability of the phrase pair p_t(e,f) 4: The average posterior probability of a span containing e,f 5,6: Lexical weighting probabilities using model 1 word probabilities (only output if the base measure uses model 1) 7: The uniform phrase penalty
The bracketed representation in the derivation file has four different types of brackets:
[X Y]: Indicates that children X and Y were generated by a regular ITG symbol <X Y>: Indicates that children X and Y were generated by a reverse ITG symbol ((( e ||| f ))): Indicates a phrase pair e, f { X }: Indicates that X was generated as a single phrase in the model. Words inside these brackets were actually aligned together as a single phrase, but by default pialign forces alignments down to words to preserve word alignments for possible other uses.
The easiest way speed up pialign is to use multi-threading. This can be done by setting the "-batchlen" and "-threads" parameters. "-threads" should be set to the number of cores you want to use. "-batchlen" must be as large as "-threads." The larger the value is, the faster multi-threaded processing will be, but very large values might cause a small decrease in accuracy. It is likely that a value 10-40 times the number of threads would be appropriate. For example, if you want to use 4 cores, you can set "-threads 4 -batchlen 40".
Also, if you cannot use multiple cores and you want a speed up at the possible small drop in alignment accuracy, you can reduce the size of the probability beam using the -probwidth option. The default is "-probwidth 1e-4" so if you set this to "-probwidth 1e-3" instead, you will likely see a significant speed up.
Finally, using the "-viterbi" option can speed up alignment significantly (about 2.5 times?). I have not done a complete test of how this affects accuracy, but it will probably drop a little.
If you checked the source directly out from the repository, you may have to run autotools before building. First, make sure autotools is installed on your computer, then run "autoreconf -i" which will prepare the configure file for you.
If you have a question about pialign, please submit it to the pialign-users mailing list. (If you don't get a response, you can also try contacting Graham at neubig at gmail dot com).