Training

pydantic model eole.config.training.OptimizerConfig[source]

Bases: Config

Everything related to optimizers. Might be split into multiple subclasses later. Note: not fully sufficient (yet) to replace full opt namespace in build_torch_optimizer. Some other parameters (hidden_size, compute_dtype, apex_opt_level, etc.) are accessed.

Show JSON schema

{
   "title": "OptimizerConfig",
   "description": "Everything related to optimizers.\nMight be split into multiple subclasses later.\nNote: not fully sufficient (yet) to replace full opt namespace in build_torch_optimizer.\nSome other parameters (hidden_size, compute_dtype, apex_opt_level, etc.) are accessed.",
   "type": "object",
   "properties": {
      "optim": {
         "default": "sgd",
         "description": "Optimization method.",
         "enum": [
            "sgd",
            "adagrad",
            "adadelta",
            "adam",
            "adamw",
            "sparseadam",
            "adafactor",
            "fusedadam",
            "adamw8bit",
            "pagedadamw8bit",
            "pagedadamw32bit"
         ],
         "title": "Optim",
         "type": "string"
      },
      "adagrad_accumulator_init": {
         "default": 0,
         "description": "Initialize the accumulator values in adagrad. Mirrors initial_accumulator_value flag from tensorflow adagrad implementation (default 0.1 there).",
         "title": "Adagrad Accumulator Init",
         "type": "number"
      },
      "adam_beta1": {
         "default": 0.9,
         "description": "Beta1 parameter used by Adam. Almost without exception a value of 0.9 is used in the literature, seemingly giving good results, so we would discourage changing this value from the default without due consideration.",
         "title": "Adam Beta1",
         "type": "number"
      },
      "adam_beta2": {
         "default": 0.999,
         "description": "Beta2 parameter used by Adam. Typically a value of 0.999 is recommended, as this is the value suggested by the original paper describing Adam, and is also the value adopted in other frameworks such as Tensorflow (https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer) and Keras (https://keras.io/optimizers/). Whereas recently the paper Attention is All You Need suggested a value of 0.98 for beta2, this parameter may not work well for normal models / default baselines.",
         "title": "Adam Beta2",
         "type": "number"
      },
      "weight_decay": {
         "default": 0.0,
         "description": "Weight decay to forward to torch Optimizer.",
         "title": "Weight Decay",
         "type": "number"
      },
      "learning_rate": {
         "default": 1.0,
         "description": "Starting learning rate. Recommended settings: sgd=1, adagrad=0.1, adadelta=1, adam=0.001.",
         "title": "Learning Rate",
         "type": "number"
      },
      "learning_rate_decay": {
         "default": 0.5,
         "description": "Decay learning rate by this much if steps have gone past start_decay_steps.",
         "title": "Learning Rate Decay",
         "type": "number"
      },
      "start_decay_steps": {
         "default": 50000,
         "description": "Start decaying every decay_steps after this many steps.",
         "title": "Start Decay Steps",
         "type": "integer"
      },
      "decay_steps": {
         "default": 10000,
         "description": "Frequency for learning rate decay, in steps.",
         "title": "Decay Steps",
         "type": "integer"
      },
      "decay_method": {
         "default": "none",
         "description": "Custom decay method to use.",
         "enum": [
            "noam",
            "noamwd",
            "cosine",
            "rsqrt",
            "none"
         ],
         "title": "Decay Method",
         "type": "string"
      },
      "warmup_steps": {
         "default": 4000,
         "description": "Number of warmup steps for custom decay.",
         "title": "Warmup Steps",
         "type": "integer"
      },
      "reset_optim": {
         "default": "none",
         "description": "Optimization resetter when using train_from.",
         "enum": [
            "none",
            "all",
            "states",
            "keep_states"
         ],
         "title": "Reset Optim",
         "type": "string"
      }
   },
   "additionalProperties": false
}

Config:
- validate_assignment: bool = True
- validate_default: bool = True
- use_enum_values: bool = True
- extra: str = forbid
- protected_namespaces: tuple = ()
Fields:

field adagrad_accumulator_init : float = 0

Initialize the accumulator values in adagrad. Mirrors initial_accumulator_value flag from tensorflow adagrad implementation (default 0.1 there).

field adam_beta1 : float = 0.9

Beta1 parameter used by Adam. Almost without exception a value of 0.9 is used in the literature, seemingly giving good results, so we would discourage changing this value from the default without due consideration.

field adam_beta2 : float = 0.999

Beta2 parameter used by Adam. Typically a value of 0.999 is recommended, as this is the value suggested by the original paper describing Adam, and is also the value adopted in other frameworks such as Tensorflow (https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer) and Keras (https://keras.io/optimizers/). Whereas recently the paper Attention is All You Need suggested a value of 0.98 for beta2, this parameter may not work well for normal models / default baselines.

field decay_method : Literal['noam', 'noamwd', 'cosine', 'rsqrt', 'none'] = 'none'

Custom decay method to use.

field decay_steps : int = 10000

Frequency for learning rate decay, in steps.

field learning_rate : float = 1.0

Starting learning rate. Recommended settings: sgd=1, adagrad=0.1, adadelta=1, adam=0.001.

field learning_rate_decay : float = 0.5

Decay learning rate by this much if steps have gone past start_decay_steps.

field optim : Literal['sgd', 'adagrad', 'adadelta', 'adam', 'adamw', 'sparseadam', 'adafactor', 'fusedadam', 'adamw8bit', 'pagedadamw8bit', 'pagedadamw32bit'] = 'sgd'

Optimization method.

field reset_optim : Literal['none', 'all', 'states', 'keep_states'] = 'none'

Optimization resetter when using train_from.

field start_decay_steps : int = 50000

Start decaying every decay_steps after this many steps.

field warmup_steps : int = 4000

Number of warmup steps for custom decay.

field weight_decay : float = 0.0

Weight decay to forward to torch Optimizer.

pydantic model eole.config.training.TrainingConfig[source]

Bases: RunningConfig, OptimizerConfig, LoRaConfig, QuantizeConfig

Show JSON schema

{
   "title": "TrainingConfig",
   "type": "object",
   "properties": {
      "quant_layers": {
         "default": [],
         "description": "List of layers to be compressed in 4/8bit.",
         "items": {
            "type": "string"
         },
         "title": "Quant Layers",
         "type": "array"
      },
      "quant_type": {
         "default": "",
         "description": "Type of compression.",
         "enum": [
            "",
            "bnb_8bit",
            "bnb_FP4",
            "bnb_NF4",
            "awq_gemm",
            "awq_gemv"
         ],
         "title": "Quant Type",
         "type": "string"
      },
      "w_bit": {
         "default": 4,
         "description": "W_bit quantization",
         "title": "W Bit",
         "type": "integer"
      },
      "group_size": {
         "default": 128,
         "description": "Group size quantization.",
         "title": "Group Size",
         "type": "integer"
      },
      "lora_layers": {
         "default": [],
         "description": "List of layers to be replaced by LoRa layers. E.g. ['linear_values', 'linear_query'] (\u00a74.2 in https://arxiv.org/abs/2106.09685)",
         "items": {
            "type": "string"
         },
         "title": "Lora Layers",
         "type": "array"
      },
      "lora_embedding": {
         "default": false,
         "description": "Replace embeddings with LoRa Embeddings (\u00a75.1)",
         "title": "Lora Embedding",
         "type": "boolean"
      },
      "lora_rank": {
         "default": 2,
         "description": "r=2 successfully tested with NLLB-200 3.3B",
         "title": "Lora Rank",
         "type": "integer"
      },
      "lora_alpha": {
         "default": 1,
         "description": "\u00a74.1 https://arxiv.org/abs/2106.09685",
         "title": "Lora Alpha",
         "type": "integer"
      },
      "lora_dropout": {
         "default": 0.0,
         "description": "Rule of thumb: same value as in main model.",
         "title": "Lora Dropout",
         "type": "number"
      },
      "optim": {
         "default": "sgd",
         "description": "Optimization method.",
         "enum": [
            "sgd",
            "adagrad",
            "adadelta",
            "adam",
            "adamw",
            "sparseadam",
            "adafactor",
            "fusedadam",
            "adamw8bit",
            "pagedadamw8bit",
            "pagedadamw32bit"
         ],
         "title": "Optim",
         "type": "string"
      },
      "adagrad_accumulator_init": {
         "default": 0,
         "description": "Initialize the accumulator values in adagrad. Mirrors initial_accumulator_value flag from tensorflow adagrad implementation (default 0.1 there).",
         "title": "Adagrad Accumulator Init",
         "type": "number"
      },
      "adam_beta1": {
         "default": 0.9,
         "description": "Beta1 parameter used by Adam. Almost without exception a value of 0.9 is used in the literature, seemingly giving good results, so we would discourage changing this value from the default without due consideration.",
         "title": "Adam Beta1",
         "type": "number"
      },
      "adam_beta2": {
         "default": 0.999,
         "description": "Beta2 parameter used by Adam. Typically a value of 0.999 is recommended, as this is the value suggested by the original paper describing Adam, and is also the value adopted in other frameworks such as Tensorflow (https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer) and Keras (https://keras.io/optimizers/). Whereas recently the paper Attention is All You Need suggested a value of 0.98 for beta2, this parameter may not work well for normal models / default baselines.",
         "title": "Adam Beta2",
         "type": "number"
      },
      "weight_decay": {
         "default": 0.0,
         "description": "Weight decay to forward to torch Optimizer.",
         "title": "Weight Decay",
         "type": "number"
      },
      "learning_rate": {
         "default": 1.0,
         "description": "Starting learning rate. Recommended settings: sgd=1, adagrad=0.1, adadelta=1, adam=0.001.",
         "title": "Learning Rate",
         "type": "number"
      },
      "learning_rate_decay": {
         "default": 0.5,
         "description": "Decay learning rate by this much if steps have gone past start_decay_steps.",
         "title": "Learning Rate Decay",
         "type": "number"
      },
      "start_decay_steps": {
         "default": 50000,
         "description": "Start decaying every decay_steps after this many steps.",
         "title": "Start Decay Steps",
         "type": "integer"
      },
      "decay_steps": {
         "default": 10000,
         "description": "Frequency for learning rate decay, in steps.",
         "title": "Decay Steps",
         "type": "integer"
      },
      "decay_method": {
         "default": "none",
         "description": "Custom decay method to use.",
         "enum": [
            "noam",
            "noamwd",
            "cosine",
            "rsqrt",
            "none"
         ],
         "title": "Decay Method",
         "type": "string"
      },
      "warmup_steps": {
         "default": 4000,
         "description": "Number of warmup steps for custom decay.",
         "title": "Warmup Steps",
         "type": "integer"
      },
      "reset_optim": {
         "default": "none",
         "description": "Optimization resetter when using train_from.",
         "enum": [
            "none",
            "all",
            "states",
            "keep_states"
         ],
         "title": "Reset Optim",
         "type": "string"
      },
      "gpu_ranks": {
         "default": [],
         "description": "List of ranks for each process.",
         "items": {
            "type": "integer"
         },
         "title": "Gpu Ranks",
         "type": "array"
      },
      "world_size": {
         "default": 1,
         "description": "Total number of distributed processes.",
         "title": "World Size",
         "type": "integer"
      },
      "parallel_mode": {
         "default": "data_parallel",
         "description": "Distributed mode.",
         "enum": [
            "data_parallel",
            "tensor_parallel"
         ],
         "title": "Parallel Mode",
         "type": "string"
      },
      "gpu_backend": {
         "default": "nccl",
         "description": "Type of torch distributed backend.",
         "title": "Gpu Backend",
         "type": "string"
      },
      "gpu_verbose_level": {
         "default": 0,
         "description": "Gives more info on each process per GPU.",
         "title": "Gpu Verbose Level",
         "type": "integer"
      },
      "master_ip": {
         "default": "localhost",
         "description": "IP of master for torch.distributed training.",
         "title": "Master Ip",
         "type": "string"
      },
      "master_port": {
         "default": 10000,
         "description": "Port of master for torch.distributed training.",
         "title": "Master Port",
         "type": "integer"
      },
      "timeout": {
         "default": 60,
         "description": "Timeout for one GPU to wait for the others.",
         "title": "Timeout",
         "type": "integer"
      },
      "model_path": {
         "default": "model",
         "description": "Path to directory containing all model components.",
         "title": "Model Path",
         "type": "string"
      },
      "self_attn_backend": {
         "default": "flash",
         "description": "Self-attention backend.",
         "enum": [
            "flash",
            "pytorch"
         ],
         "title": "Self Attn Backend",
         "type": "string"
      },
      "compute_dtype": {
         "description": "Compute dtype (precision) to use for main compute. Some parameters might have other dtypes for specific cases (e.g. torch.amp -- See eole.config.training.TrainingConfig.storage_dtype) fp32 to force slow fp16 model on gtx1080, int8 to enable pytorch native 8-bit quantization (cpu only).",
         "enum": [
            "fp32",
            "fp16",
            "int8",
            "bf16"
         ],
         "title": "Compute Dtype",
         "type": "string"
      },
      "torch_compile": {
         "default": false,
         "description": "Use torch.compile with dynamic=True.",
         "title": "Torch Compile",
         "type": "boolean"
      },
      "param_init": {
         "default": 0.1,
         "description": "Support value for uniform distribution parameters initialization. Set to 0 not to use initialization.",
         "title": "Param Init",
         "type": "number"
      },
      "param_init_method": {
         "default": "uniform",
         "description": "Parameter initialization method.",
         "enum": [
            "xavier_uniform",
            "uniform",
            "normal"
         ],
         "title": "Param Init Method",
         "type": "string"
      },
      "freeze_encoder": {
         "default": false,
         "description": "Freeze parameters in encoder.",
         "title": "Freeze Encoder",
         "type": "boolean"
      },
      "freeze_decoder": {
         "default": false,
         "description": "Freeze parameters in decoder.",
         "title": "Freeze Decoder",
         "type": "boolean"
      },
      "pre_word_vecs_enc": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "If a valid path is specified, will load pretrained word embeddings on the encoder side.",
         "title": "Pre Word Vecs Enc"
      },
      "pre_word_vecs_dec": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "If a valid path is specified, will load pretrained word embeddings on the decoder side.",
         "title": "Pre Word Vecs Dec"
      },
      "data_type": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": "text",
         "title": "Data Type"
      },
      "bucket_size": {
         "default": 262144,
         "description": "A bucket is a buffer of bucket_size examples to pick from the various corpora. The dynamic iterator batches batch_size items from the bucket and shuffle them.",
         "title": "Bucket Size",
         "type": "integer"
      },
      "bucket_size_init": {
         "default": -1,
         "description": "Bucket size is initialized with this amount of examples (see bucket_size_increment).",
         "title": "Bucket Size Init",
         "type": "integer"
      },
      "bucket_size_increment": {
         "default": 0,
         "description": "Bucket size incremented with this amount of examples at each new bucket (up to bucket_size).",
         "title": "Bucket Size Increment",
         "type": "integer"
      },
      "prefetch_factor": {
         "default": 200,
         "description": "Number of mini-batches loaded in advance to avoid the GPU waiting during processing of next bucket.",
         "title": "Prefetch Factor",
         "type": "integer"
      },
      "save_format": {
         "default": "pytorch",
         "description": "Format to save the model weights.",
         "enum": [
            "pytorch",
            "safetensors"
         ],
         "title": "Save Format",
         "type": "string"
      },
      "save_checkpoint_steps": {
         "default": 5000,
         "description": "Frequency of checkpoint saving (in steps).",
         "title": "Save Checkpoint Steps",
         "type": "integer"
      },
      "keep_checkpoint": {
         "default": -1,
         "description": "Number of checkpoints to retain. (-1 retains all)",
         "title": "Keep Checkpoint",
         "type": "integer"
      },
      "train_from": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Pretrained model/checkpoint weights to continue training from.",
         "title": "Train From"
      },
      "num_workers": {
         "default": 2,
         "description": "Number of workers for pytorch.DataLoader objects.",
         "title": "Num Workers",
         "type": "integer"
      },
      "batch_size": {
         "default": 64,
         "description": "Maximum batch size for training.",
         "title": "Batch Size",
         "type": "integer"
      },
      "batch_size_multiple": {
         "default": 1,
         "description": "Batch size multiple for token batches.",
         "title": "Batch Size Multiple",
         "type": "integer"
      },
      "batch_type": {
         "default": "sents",
         "description": "Batch grouping for batch_size.",
         "enum": [
            "sents",
            "tokens"
         ],
         "title": "Batch Type",
         "type": "string"
      },
      "normalization": {
         "default": "sents",
         "description": "Normalization method of the gradient.",
         "enum": [
            "sents",
            "tokens"
         ],
         "title": "Normalization",
         "type": "string"
      },
      "accum_count": {
         "default": [
            1
         ],
         "description": "Accumulate gradient this many times. Approximately equivalent to updating batch_size * accum_count batches at once. Recommended for transformer.",
         "items": {
            "type": "integer"
         },
         "title": "Accum Count",
         "type": "array"
      },
      "accum_steps": {
         "default": [
            0
         ],
         "description": "Steps at which accum_count values change.",
         "items": {
            "type": "integer"
         },
         "title": "Accum Steps",
         "type": "array"
      },
      "valid_steps": {
         "default": 10000,
         "description": "Frequency of validation, in steps.",
         "title": "Valid Steps",
         "type": "integer"
      },
      "valid_batch_size": {
         "default": 32,
         "description": "Maximum batch size for validation.",
         "title": "Valid Batch Size",
         "type": "integer"
      },
      "train_steps": {
         "default": 100000,
         "description": "Number of training steps.",
         "title": "Train Steps",
         "type": "integer"
      },
      "single_pass": {
         "default": false,
         "description": "Make a single pass over the training dataset.",
         "title": "Single Pass",
         "type": "boolean"
      },
      "early_stopping": {
         "default": 0,
         "description": "Number of validation steps without improving that will trigger early stop of training.",
         "title": "Early Stopping",
         "type": "integer"
      },
      "early_stopping_criteria": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Criteria to use for early stopping.",
         "title": "Early Stopping Criteria"
      },
      "max_grad_norm": {
         "default": 5,
         "description": "If the norm of the gradient vector exceeds this value, renormalize it to have the norm equal to max_grad_norm.",
         "title": "Max Grad Norm",
         "type": "number"
      },
      "dropout": {
         "default": [
            0.3
         ],
         "description": "Dropout probability.",
         "items": {
            "type": "number"
         },
         "title": "Dropout",
         "type": "array"
      },
      "attention_dropout": {
         "default": [
            0.1
         ],
         "description": "Attention dropout probability.",
         "items": {
            "type": "number"
         },
         "title": "Attention Dropout",
         "type": "array"
      },
      "dropout_steps": {
         "default": [
            0
         ],
         "description": "Steps at which dropout changes.",
         "items": {
            "type": "integer"
         },
         "title": "Dropout Steps",
         "type": "array"
      },
      "truncated_decoder": {
         "default": 0,
         "description": "Truncated bptt.",
         "title": "Truncated Decoder",
         "type": "integer"
      },
      "label_smoothing": {
         "default": 0.0,
         "description": "Label smoothing value epsilon. Probability of all non-true labels will be smoothed by epsilon/(vocab_size-1). Set to 0 to turn off label smoothing. (https://arxiv.org/abs/1512.00567)",
         "title": "Label Smoothing",
         "type": "number"
      },
      "average_decay": {
         "default": 0.0,
         "description": "Exponential moving average decay (https://en.wikipedia.org/wiki/Moving_average). Set to other than 0 (e.g. 1e-4) to activate. Similar to Marian NMT implementation (http://www.aclweb.org/anthology/P18-4020).",
         "title": "Average Decay",
         "type": "number"
      },
      "average_every": {
         "default": 1,
         "description": "Step for moving average. Default is every update if average_decay is set.",
         "title": "Average Every",
         "type": "integer"
      },
      "loss_scale": {
         "default": 0.0,
         "description": "For FP16 training, the static loss scale to use. If not set, the loss scale is dynamically computed.",
         "title": "Loss Scale",
         "type": "number"
      },
      "apex_opt_level": {
         "default": "",
         "description": "For FP16 training, the opt_level to use. See https://nvidia.github.io/apex/amp.html#opt-levels.",
         "enum": [
            "",
            "O0",
            "O1",
            "O2",
            "O3"
         ],
         "title": "Apex Opt Level",
         "type": "string"
      },
      "zero_out_prompt_loss": {
         "default": false,
         "description": "Set the prompt loss to zero. Mostly for LLM finetuning. Will be enabled only if the `insert_mask_before_placeholder` transform is applied.",
         "title": "Zero Out Prompt Loss",
         "type": "boolean"
      },
      "use_ckpting": {
         "default": [],
         "description": "Use gradient checkpointing for those modules.",
         "items": {
            "type": "string"
         },
         "title": "Use Ckpting",
         "type": "array"
      },
      "update_vocab": {
         "default": false,
         "description": "Update source and target existing vocabularies.",
         "title": "Update Vocab",
         "type": "boolean"
      },
      "lm_prior_model": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "LM model to use to train the TM.",
         "title": "Lm Prior Model"
      },
      "lm_prior_lambda": {
         "default": 0.0,
         "description": "LM Prior Lambda",
         "title": "Lm Prior Lambda",
         "type": "number"
      },
      "lm_prior_tau": {
         "default": 1.0,
         "description": "LM Prior Tau",
         "title": "Lm Prior Tau",
         "type": "number"
      },
      "estim_loss_lambda": {
         "default": [
            1.0
         ],
         "description": "Weight applied to estimator loss",
         "items": {
            "type": "number"
         },
         "title": "Estim Loss Lambda",
         "type": "array"
      },
      "estim_loss_lambda_steps": {
         "default": [
            0
         ],
         "description": "Steps at which estimator loss lambda changes",
         "items": {
            "type": "integer"
         },
         "title": "Estim Loss Lambda Steps",
         "type": "array"
      },
      "score_threshold": {
         "default": 0.68,
         "description": "Threshold to filterout data",
         "title": "Score Threshold",
         "type": "number"
      },
      "dummy_load": {
         "anyOf": [
            {
               "type": "boolean"
            },
            {
               "type": "null"
            }
         ],
         "default": false,
         "description": "Ignore some warnings if we are only loading the configuration prior to other operations, e.g. in `train_from` context.",
         "title": "Dummy Load"
      }
   },
   "additionalProperties": false
}

Config:
- validate_assignment: bool = True
- validate_default: bool = True
- use_enum_values: bool = True
- extra: str = forbid
- protected_namespaces: tuple = ()
- arbitrary_types_allowed: bool = True
Fields:
Validators:
- _validate_running_config » all fields
- checkpointing_layers » use_ckpting

field accum_count : List[int] = [1]

Accumulate gradient this many times. Approximately equivalent to updating batch_size * accum_count batches at once. Recommended for transformer.

Validated by:
- _validate_running_config

field accum_steps : List[int] = [0]

Steps at which accum_count values change.

Validated by:
- _validate_running_config

field apex_opt_level : Literal['', 'O0', 'O1', 'O2', 'O3'] = ''

For FP16 training, the opt_level to use. See https://nvidia.github.io/apex/amp.html#opt-levels.

Validated by:
- _validate_running_config

field attention_dropout : List[float] = [0.1]

Attention dropout probability.

Validated by:
- _validate_running_config

field average_decay : float = 0.0

Exponential moving average decay (https://en.wikipedia.org/wiki/Moving_average). Set to other than 0 (e.g. 1e-4) to activate. Similar to Marian NMT implementation (http://www.aclweb.org/anthology/P18-4020).

Validated by:
- _validate_running_config

field average_every : int = 1

Step for moving average. Default is every update if average_decay is set.

Validated by:
- _validate_running_config

field batch_size : int = 64

Maximum batch size for training.

Validated by:
- _validate_running_config

field batch_size_multiple : int = 1

Batch size multiple for token batches.

Validated by:
- _validate_running_config

field batch_type : Literal['sents', 'tokens'] = 'sents'

Batch grouping for batch_size.

Validated by:
- _validate_running_config

field bucket_size : int = 262144

A bucket is a buffer of bucket_size examples to pick from the various corpora. The dynamic iterator batches batch_size items from the bucket and shuffle them.

Validated by:
- _validate_running_config

field bucket_size_increment : int = 0

Bucket size incremented with this amount of examples at each new bucket (up to bucket_size).

Validated by:
- _validate_running_config

field bucket_size_init : int = -1

Bucket size is initialized with this amount of examples (see bucket_size_increment).

Validated by:
- _validate_running_config

field data_type : str | None = 'text'

Validated by:
- _validate_running_config

field dropout : List[float] = [0.3]

Dropout probability.

Validated by:
- _validate_running_config

field dropout_steps : List[int] = [0]

Steps at which dropout changes.

Validated by:
- _validate_running_config

field dummy_load : bool | None = False

Ignore some warnings if we are only loading the configuration prior to other operations, e.g. in train_from context.

Validated by:
- _validate_running_config

field early_stopping : int = 0

Number of validation steps without improving that will trigger early stop of training.

Validated by:
- _validate_running_config

field early_stopping_criteria : str | None = None

Criteria to use for early stopping.

Validated by:
- _validate_running_config

field estim_loss_lambda : List[float] = [1.0]

Weight applied to estimator loss

Validated by:
- _validate_running_config

field estim_loss_lambda_steps : List[int] = [0]

Steps at which estimator loss lambda changes

Validated by:
- _validate_running_config

field freeze_decoder : bool = False

Freeze parameters in decoder.

Validated by:
- _validate_running_config

field freeze_encoder : bool = False

Freeze parameters in encoder.

Validated by:
- _validate_running_config

field keep_checkpoint : int = -1

Number of checkpoints to retain. (-1 retains all)

Validated by:
- _validate_running_config

field label_smoothing : float = 0.0

Label smoothing value epsilon. Probability of all non-true labels will be smoothed by epsilon/(vocab_size-1). Set to 0 to turn off label smoothing. (https://arxiv.org/abs/1512.00567)

Validated by:
- _validate_running_config

field lm_prior_lambda : float = 0.0

LM Prior Lambda

Validated by:
- _validate_running_config

field lm_prior_model : str | None = None

LM model to use to train the TM.

Validated by:
- _validate_running_config

field lm_prior_tau : float = 1.0

LM Prior Tau

Validated by:
- _validate_running_config

field loss_scale : float = 0.0

For FP16 training, the static loss scale to use. If not set, the loss scale is dynamically computed.

Validated by:
- _validate_running_config

field max_grad_norm : float = 5

If the norm of the gradient vector exceeds this value, renormalize it to have the norm equal to max_grad_norm.

Validated by:
- _validate_running_config

field normalization : Literal['sents', 'tokens'] = 'sents'

Normalization method of the gradient.

Validated by:
- _validate_running_config

field num_workers : int = 2

Number of workers for pytorch.DataLoader objects.

Validated by:
- _validate_running_config

field param_init : float = 0.1

Support value for uniform distribution parameters initialization. Set to 0 not to use initialization.

Validated by:
- _validate_running_config

field param_init_method : Literal['xavier_uniform', 'uniform', 'normal'] = 'uniform'

Parameter initialization method.

Validated by:
- _validate_running_config

field pre_word_vecs_dec : str | None = None

If a valid path is specified, will load pretrained word embeddings on the decoder side.

Validated by:
- _validate_running_config

field pre_word_vecs_enc : str | None = None

If a valid path is specified, will load pretrained word embeddings on the encoder side.

Validated by:
- _validate_running_config

field prefetch_factor : int = 200

Number of mini-batches loaded in advance to avoid the GPU waiting during processing of next bucket.

Validated by:
- _validate_running_config

field save_checkpoint_steps : int = 5000

Frequency of checkpoint saving (in steps).

Validated by:
- _validate_running_config

field save_format : Literal['pytorch', 'safetensors'] = 'pytorch'

Format to save the model weights.

Validated by:
- _validate_running_config

field score_threshold : float = 0.68

Threshold to filterout data

Validated by:
- _validate_running_config

field single_pass : bool = False

Make a single pass over the training dataset.

Validated by:
- _validate_running_config

field train_from : str | None = None

Pretrained model/checkpoint weights to continue training from.

Validated by:
- _validate_running_config

field train_steps : int = 100000

Number of training steps.

Validated by:
- _validate_running_config

field truncated_decoder : int = 0

Truncated bptt.

Validated by:
- _validate_running_config

field update_vocab : bool = False

Update source and target existing vocabularies.

Validated by:
- _validate_running_config

field use_ckpting : List[str] = []

Use gradient checkpointing for those modules.

Validated by:
- _validate_running_config
- checkpointing_layers

field valid_batch_size : int = 32

Maximum batch size for validation.

Validated by:
- _validate_running_config

field valid_steps : int = 10000

Frequency of validation, in steps.

Validated by:
- _validate_running_config

field zero_out_prompt_loss : bool = False

Set the prompt loss to zero. Mostly for LLM finetuning. Will be enabled only if the insert_mask_before_placeholder transform is applied.

Validated by:
- _validate_running_config

validator checkpointing_layers » use_ckpting[source]

get_model_path()[source]

property storage_dtype : dtype[source]

Deduce which dtype to use for main model parameters. E.g. with mixed precision a copy is kept in float32.

Training

pydantic model eole.config.training.OptimizerConfig[source]​

field adagrad_accumulator_init : float = 0​

field adam_beta1 : float = 0.9​

field adam_beta2 : float = 0.999​

field decay_method : Literal['noam', 'noamwd', 'cosine', 'rsqrt', 'none'] = 'none'​

field decay_steps : int = 10000​

field learning_rate : float = 1.0​

field learning_rate_decay : float = 0.5​

field optim : Literal['sgd', 'adagrad', 'adadelta', 'adam', 'adamw', 'sparseadam', 'adafactor', 'fusedadam', 'adamw8bit', 'pagedadamw8bit', 'pagedadamw32bit'] = 'sgd'​

field reset_optim : Literal['none', 'all', 'states', 'keep_states'] = 'none'​

field start_decay_steps : int = 50000​

field warmup_steps : int = 4000​

field weight_decay : float = 0.0​

pydantic model eole.config.training.TrainingConfig[source]​

field accum_count : List[int] = [1]​

field accum_steps : List[int] = [0]​

field apex_opt_level : Literal['', 'O0', 'O1', 'O2', 'O3'] = ''​

field attention_dropout : List[float] = [0.1]​

field average_decay : float = 0.0​

field average_every : int = 1​

field batch_size : int = 64​

field batch_size_multiple : int = 1​

field batch_type : Literal['sents', 'tokens'] = 'sents'​

field bucket_size : int = 262144​

field bucket_size_increment : int = 0​

field bucket_size_init : int = -1​

field data_type : str | None = 'text'​

field dropout : List[float] = [0.3]​

field dropout_steps : List[int] = [0]​

field dummy_load : bool | None = False​

field early_stopping : int = 0​

field early_stopping_criteria : str | None = None​

field estim_loss_lambda : List[float] = [1.0]​

field estim_loss_lambda_steps : List[int] = [0]​

field freeze_decoder : bool = False​

field freeze_encoder : bool = False​

field keep_checkpoint : int = -1​

field label_smoothing : float = 0.0​

field lm_prior_lambda : float = 0.0​

field lm_prior_model : str | None = None​

field lm_prior_tau : float = 1.0​

field loss_scale : float = 0.0​

field max_grad_norm : float = 5​

field normalization : Literal['sents', 'tokens'] = 'sents'​

field num_workers : int = 2​

field param_init : float = 0.1​

field param_init_method : Literal['xavier_uniform', 'uniform', 'normal'] = 'uniform'​

field pre_word_vecs_dec : str | None = None​

field pre_word_vecs_enc : str | None = None​

field prefetch_factor : int = 200​

field save_checkpoint_steps : int = 5000​

field save_format : Literal['pytorch', 'safetensors'] = 'pytorch'​

field score_threshold : float = 0.68​

field single_pass : bool = False​

field train_from : str | None = None​

field train_steps : int = 100000​

field truncated_decoder : int = 0​

field update_vocab : bool = False​

field use_ckpting : List[str] = []​

field valid_batch_size : int = 32​

field valid_steps : int = 10000​

field zero_out_prompt_loss : bool = False​

validator checkpointing_layers » use_ckpting[source]​

get_model_path()[source]​

property storage_dtype : dtype[source]​