Configuration

One of the core principles around Eole is the structured configuration logic via Pydantic models. This allows for centralized validation of numerous parameters, as well as proper nesting of various objects and scopes. It can be a lot at first, but that's a necessary evil for proper structure and modularity.

Here is an example configuration to train a GPT-style language model:

# General data/vocab/run related settings
seed: 42
save_data: test_save_data
src_vocab_size: 60000
tgt_vocab_size: 60000
share_vocab: true
src_vocab: my_vocab.txt
report_every: 100 # report stats every 100 steps

# datasets
data:
    # training sets can be numerous, and named anything
    corpus_1:
        path_src: my_training_set.txt
    # single validation set, always named "valid"
    valid:
        path_src: my_validation_set.txt

# default transforms, in application order
transforms: [onmt_tokenize, filtertoolong]
# transforms configuration
transforms_configs:
  onmt_tokenize:
    src_subword_type: bpe
    src_subword_model: my_subwords_model.bpe
    src_onmttok_kwargs: {"mode": "aggressive", "joiner_annotate": True, "preserve_placeholders":
    True, "case_markup": True, "soft_case_regions": True, "preserve_segmented_tokens":
    True}
  filtertoolong:
    src_seq_length: 512
    tgt_seq_length: 512

# model architecture configuration
model:
    architecture: "transformer_lm"
    layers: 6
    heads: 8
    hidden_size: 512
    transformer_ff: 2048
    embeddings:
        word_vec_size: 512
        position_encoding: true

# training routine configuration
training:
    # Train on a single GPU
    world_size: 1
    gpu_ranks: [0]
    # Batching
    batch_size: 2048
    batch_type: tokens
    # Optimizer
    model_dtype: "fp32"
    optim: "adam"
    learning_rate: 2
    warmup_steps: 8000
    decay_method: "noam"
    adam_beta2: 0.998
    # Hyperparams
    dropout_steps: [0]
    dropout: [0.1]
    attention_dropout: [0.1]
    max_grad_norm: 0
    label_smoothing: 0.1
    param_init: 0
    param_init_glorot: true
    normalization: "tokens"
    # Where to save the checkpoints (creates a directory)
    model_path: my_model
    # Steps intervals
    save_checkpoint_steps: 10
    train_steps: 50
    valid_steps: 500