Skip to main content

CometKiwi


NOTE

This is NOT the exact replication of the Unbabel Comet Kiwi.

What is common: We use the same base model xlm-roberta-xl or xlm-roberta-xxl encoders We use the same dataset (1720-da.mlqe)

You can get the dataset to train the model below here: https://huggingface.co/eole-nlp/cometkiwi-xxl-eole/

What is different: wmt23-cometkiwi-da-xl(or xxl) use Layerwise attention which brings complexity without significant better accuracy we use Gelu instead of Tanh in the Estimator

Our scores for XL and XXL are in the same range when they are very different for Unbabel/wmt23-cometkiwi-XL or XXL

To make your life easier, run these commands from the recipe directory (here recipes/cometkiwi).

Retrieve and convert model

Set environment variables

export EOLE_MODEL_DIR=<where_to_store_models>

Download and convert the base model

eole convert HF --model_dir facebook/xlm-roberta-xxl --output $EOLE_MODEL_DIR/xlm-roberta-xxl-eole

NOTE The facebook original model is stored in FP32 but we convert it to FP16 at conversion.

XXL is a 10.7G params model hence will save a 21.4GB file on disk (safetensors format) XL is a 3.5GB params model hence will save a 7.0GB file on disk

After conversion in $EOLE_MODEL_DIR you will get the following files: config.json model.00.safetensors sentencepiece.bpe.model vocab.json vocab.txt

The vocab.txt file contains 250000 entries (from sentencepiece) but the model was trained with extra tokens You need to concat the added_vocab.txt file to the vocab.txt file resulting in 250880 tokens

Training cometkiwi

Training will happen in two steps (see cometkiwi-xxl-eole.yaml file) FYI the trained model can be downloaded here: https://huggingface.co/eole-nlp/cometkiwi-xxl-eole/

STEP 1: We train from the converted xlm-roberta-xxl model but we keep everything frozen. We just add an Estimator layer that will be trained during 4000 steps To make sure we do not lose anything we will rename the pre-trained subfolder into step_4000_pretrain In this step we do NOT use LoRA but we use 4bit quant to make things easier and fit in a smaller cards

STEP 2: We release the encoder to make weights trainable hence we need to use LoRA since the model is big and gradients would not fit in consumer grade cards. We train during 40000 steps

For the two steps above, the training command is the same but make sure the yaml file is modified according to the instructions eole train --config cometkiwi-xxl-eole.yaml

After this step we need to merge the LoRA weights into the original model with the following command:

eole model lora --base_model "./cometkiwi-xxl-eole/step_4000_pretrain/" --lora_weights "./cometkiwi-xxl-eole/" --output "./cometkiwi-xxl-eole/merged"

Cometkiwi Inference

Format the source / target file you want to score in the Comet format:

./combine.sh newstest2014.en newstest2014.de newstest2014.comb

Score the .comb file:

eole predict --config cometkiwi-xxl-inference.yaml --src newstest2014.comb --output newstest2014.scores --with_score

For now the scores are in the third column, so you can cut -f3 the output file