Can I get word alignments while translating?

Raw alignments from averaging Transformer attention heads

Currently, we support producing word alignment while translating for Transformer based models. Using -report_align when calling translate.py will output the inferred alignments in Pharaoh format. Those alignments are computed from an argmax on the average of the attention heads of the second to last decoder layer. The resulting alignment src-tgt (Pharaoh) will be pasted to the translation sentence, separated by |||. Note: The second to last default behaviour was empirically determined. It is not the same as the paper (they take the penultimate layer), probably because of slight differences in the architecture.

alignments use the standard "Pharaoh format", where a pair i-j indicates the i_th word of source language is aligned to j_th word of target language.
Example: {'src': 'das stimmt nicht !'; 'output': 'that is not true ! ||| 0-0 0-1 1-2 2-3 1-4 1-5 3-6'}
Using -tgt and -gold_align options when calling translate.py, we output alignments between the source and the gold target rather than the inferred target, assuming we're doing evaluation.
To convert subword alignments to word alignments, or symetrize bidirectional alignments, please refer to the lilt scripts.

Supervised learning on a specific head

The quality of output alignments can be further improved by providing reference alignments while training. This will invoke multi-task learning on translation and alignment. This is an implementation based on the paper Jointly Learning to Align and Translate with Transformer Models.

The data need to be preprocessed with the reference alignments in order to learn the supervised task. The reference alignment file(s) can for instance be generated by GIZA++ or fast_align.

In order to learn the supervised task, you can set for each dataset the path of its alignment file in the YAML configuration file:

<your_config>.yaml

...

# Corpus opts:
data:
    corpus_1:
        path_src: toy-ende/src-train1.txt
        path_tgt: toy-ende/tgt-train1.txt
        # src - tgt alignments in pharaoh format
        path_align: toy-ende/src-tgt.align
        transforms: []
        weight: 1
    valid:
        path_src: toy-ende/src-val.txt
        path_tgt: toy-ende/tgt-val.txt
        transforms: []

...

Notes:

Most of the transforms are for now incompatible with the joint alignment learning pipeline, because most of them make modifications at the token level, hence alignments would be made invalid.
There should be no blank lines in the alignment files provided.

Training options to learn such alignments are:

-lambda_align: set the value > 0.0 to enable joint align training, the paper suggests 0.05;
-alignment_layer: indicate the index of the decoder layer;
-alignment_heads: number of alignment heads for the alignment task - should be set to 1 for the supervised task, and preferably kept to default (or same as num_heads) for the average task;
-full_context_alignment: do full context decoder pass (no future mask) when computing alignments. This will slow down the training (~12% in terms of tok/s) but will be beneficial to generate better alignment.

Can I get word alignments while translating?

Raw alignments from averaging Transformer attention heads​

Supervised learning on a specific head​

Raw alignments from averaging Transformer attention heads

Supervised learning on a specific head