Quickstart
How to train a model from scratch
Step 1: Prepare the data
To get started, we propose to download a toy English-German dataset for machine translation containing 10k tokenized sentences:
wget https://s3.amazonaws.com/opennmt-trainingdata/toy-ende.tar.gz
tar xf toy-ende.tar.gz
cd toy-ende
The data consists of parallel source (src
) and target (tgt
) data containing one sentence per line with tokens separated by a space:
src-train.txt
tgt-train.txt
src-val.txt
tgt-val.txt
Validation files are used to evaluate the convergence of the training. It usually contains no more than 5k sentences.
$ head -n 2 toy-ende/src-train.txt
It is not acceptable that , with the help of the national bureaucracies , Parliament 's legislative prerogative should be made null and void by means of implementing provisions whose content , purpose and extent are not laid down in advance .
Federal Master Trainer and Senior Instructor of the Italian Federation of Aerobic Fitness , Group Fitness , Postural Gym , Stretching and Pilates; from 2004 , he has been collaborating with Antiche Terme as personal Trainer and Instructor of Stretching , Pilates and Postural Gym .
We need to build a YAML configuration file to specify the data that will be used:
# toy_en_de.yaml
## Where the samples will be written
save_data: toy-ende/run/example
## Where the vocab(s) will be written
src_vocab: toy-ende/run/example.vocab.src
tgt_vocab: toy-ende/run/example.vocab.tgt
# Prevent overwriting existing files in the folder
overwrite: False
# Corpus opts:
data:
corpus_1:
path_src: toy-ende/src-train.txt
path_tgt: toy-ende/tgt-train.txt
valid:
path_src: toy-ende/src-val.txt
path_tgt: toy-ende/tgt-val.txt
From this configuration, we can build the vocab(s), that will be necessary to train the model:
eole build_vocab -config toy_en_de.yaml -n_sample 10000
Notes:
-n_sample
is advised here – it represents the number of lines sampled from each corpus to build the vocab.- This configuration is the simplest possible, without any tokenization or other transforms. See recipes for more complex pipelines.
Step 2: Train the model
To train a model, we need to add the following to the YAML configuration file:
- the vocabulary path(s) that will be used: can be that generated by eole build_vocab;
- training specific parameters.
# toy_en_de.yaml
# Model architecture
model:
architecture: transformer
# Train on a single GPU
training:
world_size: 1
gpu_ranks: [0]
model_path: toy-ende/run/model
save_checkpoint_steps: 500
train_steps: 1000
valid_steps: 500
# adapt dataloading defaults to very small dataset
bucket_size: 1000
Then you can simply run:
eole train -config toy_en_de.yaml
This configuration will run a default transformer model. It will run on a single GPU (world_size 1
& gpu_ranks [0]
).
Before the training process actually starts, it is possible to generate transformed samples to simplify any potentially required visual inspection. The number of sample lines to dump per corpus is set with the -n_sample
flag.
Step 3: Translate
eole predict -model_path toy-ende/run/model -src toy-ende/src-test.txt -output toy-ende/pred_1000.txt -gpu 0 -verbose
Now you have a model which you can use to predict on new data. We do this by running beam search. This will output predictions into toy-ende/pred_1000.txt
.
Note:
The predictions are going to be quite terrible, as the demo dataset is small. Try running on some larger datasets!
For example you can download millions of parallel sentences for translation or summarization.
How to generate with a pretrained LLM
Step 1: Convert a model from Hugging Face Hub
Several converters are provided for models 1) from Hugging Face hub: T5, Falcon, MPT, Openllama, Redpajama, Xgen or 2) the legacy Llama from Meta.
See here for conversion command line.
T5 (and variant Flan-T5), Llama and Openllama use Sentencepiece. Other models uses BPE, we had to reconstruct the BPE model and vocab file:
The command line to convert a model to OpenNMT-py is:
Note: providing a HuggingFace repo id is supported in most conversion tools.
Step 2: Prepare an inference.yaml config file
Even though it is not mandatory, the best way to run inference is to use a config file; here is an example:
transforms: [sentencepiece]
#### Subword
src_subword_model: "/path_to/llama7B/tokenizer.model"
tgt_subword_model: "/path_to/llama7B/tokenizer.model"
# Model info
model_path: "/path_to/llama7B/llama7B-eole.pt"
# Inference
seed: 42
max_length: 256
gpu: 0
batch_type: sents
batch_size: 1
compute_dtype: fp16
#random_sampling_topk: 40
#random_sampling_topp: 0.75
#random_sampling_temp: 0.1
beam_size: 1
n_best: 1
report_time: true
or similarly for a model using BPE:
transforms: [onmt_tokenize]
#### Subword
src_subword_type: bpe
src_subword_model: "/path_to/mpt7B/mpt-model.bpe"
src_onmttok_kwargs: '{"mode": "conservative"}'
tgt_subword_type: bpe
tgt_subword_model: "/path_to/mpt7B/mpt-model.bpe"
tgt_onmttok_kwargs: '{"mode": "conservative"}'
gpt2_pretok: true
# Model info
model_path: "/path_to/mpt7B/mpt-eole.pt"
# Inference
seed: 42
max_length: 1
gpu: 0
batch_type: sents
batch_size: 1
compute_dtype: fp16
#random_sampling_topk: 40
#random_sampling_topp: 0.75
#random_sampling_temp: 0.8
beam_size: 1
report_time: true
src: None
tgt: None
In this second example, we used max_length: 1
and src: None
tgt: None
which is typically the configuration to be used in a scoring script like MMLU where it expects only 1 token as the answer.
WARNING
For inhomogeneous batches with many examples, the potentially high number of tokens inserted in the shortest examples leads to degraded results when attention layer quantization and flash attention are activated.
In practice, in the inference configuration file, when batch_size
is greater than 1,
delete ‘linear_values’, ‘linear_query’, ‘linear_keys’, ‘final_linear’ from quant_layers
and specify self_attn_type: scaled-dot
.
You can run this script with the following command line:
eole tools run_mmlu --config myinference.yaml
Step 3: Generate text
Generating text is also easier with an inference config file (in which you can set max_length or ramdom sampling settings):
eole predict --config /path_to_config/llama7B/llama-inference.yaml --src /path_to_source/input.txt --output /path_to_target/output.txt
How to finetune a pretrained LLM
See Llama2 recipe for an end-to-end example.
Note:
If you want to enable the “zero-out prompt loss” mechanism to ignore the prompt when calculating the loss,
you can add the insert_mask_before_placeholder
transform as well as the zero_out_prompt_loss
flag:
transforms: [insert_mask_before_placeholder, sentencepiece, filtertoolong]
zero_out_prompt_loss: true
The default value for the response response_pattern
used to locate the end of the prompt is “Response : ⦅newline⦆”, but you can choose another to align it with your training data.