Training
pydantic model eole.config.training.OptimizerConfig[source]โ
Bases: Config
Everything related to optimizers. Might be split into multiple subclasses later. Note: not fully sufficient (yet) to replace full opt namespace in build_torch_optimizer. Some other parameters (hidden_size, compute_dtype, apex_opt_level, etc.) are accessed.
Show JSON schema
{
"title": "OptimizerConfig",
"description": "Everything related to optimizers.\nMight be split into multiple subclasses later.\nNote: not fully sufficient (yet) to replace full opt namespace in build_torch_optimizer.\nSome other parameters (hidden_size, compute_dtype, apex_opt_level, etc.) are accessed.",
"type": "object",
"properties": {
"optim": {
"default": "sgd",
"description": "Optimization method.",
"enum": [
"sgd",
"adagrad",
"adadelta",
"adam",
"adamw",
"sparseadam",
"adafactor",
"fusedadam",
"adamw8bit",
"pagedadamw8bit",
"pagedadamw32bit"
],
"title": "Optim",
"type": "string"
},
"adagrad_accumulator_init": {
"default": 0,
"description": "Initialize the accumulator values in adagrad. Mirrors initial_accumulator_value flag from tensorflow adagrad implementation (default 0.1 there).",
"title": "Adagrad Accumulator Init",
"type": "number"
},
"adam_beta1": {
"default": 0.9,
"description": "Beta1 parameter used by Adam. Almost without exception a value of 0.9 is used in the literature, seemingly giving good results, so we would discourage changing this value from the default without due consideration.",
"title": "Adam Beta1",
"type": "number"
},
"adam_beta2": {
"default": 0.999,
"description": "Beta2 parameter used by Adam. Typically a value of 0.999 is recommended, as this is the value suggested by the original paper describing Adam, and is also the value adopted in other frameworks such as Tensorflow (https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer) and Keras (https://keras.io/optimizers/). Whereas recently the paper Attention is All You Need suggested a value of 0.98 for beta2, this parameter may not work well for normal models / default baselines.",
"title": "Adam Beta2",
"type": "number"
},
"weight_decay": {
"default": 0.0,
"description": "Weight decay to forward to torch Optimizer.",
"title": "Weight Decay",
"type": "number"
},
"learning_rate": {
"default": 1.0,
"description": "Starting learning rate. Recommended settings: sgd=1, adagrad=0.1, adadelta=1, adam=0.001.",
"title": "Learning Rate",
"type": "number"
},
"learning_rate_decay": {
"default": 0.5,
"description": "Decay learning rate by this much if steps have gone past start_decay_steps.",
"title": "Learning Rate Decay",
"type": "number"
},
"start_decay_steps": {
"default": 50000,
"description": "Start decaying every decay_steps after this many steps.",
"title": "Start Decay Steps",
"type": "integer"
},
"decay_steps": {
"default": 10000,
"description": "Frequency for learning rate decay, in steps.",
"title": "Decay Steps",
"type": "integer"
},
"decay_method": {
"default": "none",
"description": "Custom decay method to use.",
"enum": [
"noam",
"noamwd",
"cosine",
"rsqrt",
"none"
],
"title": "Decay Method",
"type": "string"
},
"warmup_steps": {
"default": 4000,
"description": "Number of warmup steps for custom decay.",
"title": "Warmup Steps",
"type": "integer"
},
"reset_optim": {
"default": "none",
"description": "Optimization resetter when using train_from.",
"enum": [
"none",
"all",
"states",
"keep_states"
],
"title": "Reset Optim",
"type": "string"
}
},
"additionalProperties": false
}
- Config:
- validate_assignment: bool = True
- validate_default: bool = True
- use_enum_values: bool = True
- extra: str = forbid
- protected_namespaces: tuple = ()
- Fields:
adagrad_accumulator_init (float)
adam_beta1 (float)
adam_beta2 (float)
decay_method (Literal['noam', 'noamwd', 'cosine', 'rsqrt', 'none'])
decay_steps (int)
learning_rate (float)
learning_rate_decay (float)
optim (Literal['sgd', 'adagrad', 'adadelta', 'adam', 'adamw', 'sparseadam', 'adafactor', 'fusedadam', 'adamw8bit', 'pagedadamw8bit', 'pagedadamw32bit'])
reset_optim (Literal['none', 'all', 'states', 'keep_states'])
start_decay_steps (int)
warmup_steps (int)
weight_decay (float)
field adagrad_accumulator_init : float = 0โ
Initialize the accumulator values in adagrad. Mirrors initial_accumulator_value flag from tensorflow adagrad implementation (default 0.1 there).
field adam_beta1 : float = 0.9โ
Beta1 parameter used by Adam. Almost without exception a value of 0.9 is used in the literature, seemingly giving good results, so we would discourage changing this value from the default without due consideration.
field adam_beta2 : float = 0.999โ
Beta2 parameter used by Adam. Typically a value of 0.999 is recommended, as this is the value suggested by the original paper describing Adam, and is also the value adopted in other frameworks such as Tensorflow (https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer) and Keras (https://keras.io/optimizers/). Whereas recently the paper Attention is All You Need suggested a value of 0.98 for beta2, this parameter may not work well for normal models / default baselines.
field decay_method : Literal['noam', 'noamwd', 'cosine', 'rsqrt', 'none'] = 'none'โ
Custom decay method to use.
field decay_steps : int = 10000โ
Frequency for learning rate decay, in steps.
field learning_rate : float = 1.0โ
Starting learning rate. Recommended settings: sgd=1, adagrad=0.1, adadelta=1, adam=0.001.
field learning_rate_decay : float = 0.5โ
Decay learning rate by this much if steps have gone past start_decay_steps.
field optim : Literal['sgd', 'adagrad', 'adadelta', 'adam', 'adamw', 'sparseadam', 'adafactor', 'fusedadam', 'adamw8bit', 'pagedadamw8bit', 'pagedadamw32bit'] = 'sgd'โ
Optimization method.
field reset_optim : Literal['none', 'all', 'states', 'keep_states'] = 'none'โ
Optimization resetter when using train_from.
field start_decay_steps : int = 50000โ
Start decaying every decay_steps after this many steps.
field warmup_steps : int = 4000โ
Number of warmup steps for custom decay.
field weight_decay : float = 0.0โ
Weight decay to forward to torch Optimizer.
pydantic model eole.config.training.TrainingConfig[source]โ
Bases: RunningConfig
, OptimizerConfig
, LoRaConfig
, QuantizeConfig
Show JSON schema
{
"title": "TrainingConfig",
"type": "object",
"properties": {
"quant_layers": {
"default": [],
"description": "List of layers to be compressed in 4/8bit.",
"items": {
"type": "string"
},
"title": "Quant Layers",
"type": "array"
},
"quant_type": {
"default": "",
"description": "Type of compression.",
"enum": [
"",
"bnb_8bit",
"bnb_FP4",
"bnb_NF4",
"awq_gemm",
"awq_gemv"
],
"title": "Quant Type",
"type": "string"
},
"w_bit": {
"default": 4,
"description": "W_bit quantization",
"title": "W Bit",
"type": "integer"
},
"group_size": {
"default": 128,
"description": "Group size quantization.",
"title": "Group Size",
"type": "integer"
},
"lora_layers": {
"default": [],
"description": "List of layers to be replaced by LoRa layers. E.g. ['linear_values', 'linear_query'] (\u00a74.2 in https://arxiv.org/abs/2106.09685)",
"items": {
"type": "string"
},
"title": "Lora Layers",
"type": "array"
},
"lora_embedding": {
"default": false,
"description": "Replace embeddings with LoRa Embeddings (\u00a75.1)",
"title": "Lora Embedding",
"type": "boolean"
},
"lora_rank": {
"default": 2,
"description": "r=2 successfully tested with NLLB-200 3.3B",
"title": "Lora Rank",
"type": "integer"
},
"lora_alpha": {
"default": 1,
"description": "\u00a74.1 https://arxiv.org/abs/2106.09685",
"title": "Lora Alpha",
"type": "integer"
},
"lora_dropout": {
"default": 0.0,
"description": "Rule of thumb: same value as in main model.",
"title": "Lora Dropout",
"type": "number"
},
"optim": {
"default": "sgd",
"description": "Optimization method.",
"enum": [
"sgd",
"adagrad",
"adadelta",
"adam",
"adamw",
"sparseadam",
"adafactor",
"fusedadam",
"adamw8bit",
"pagedadamw8bit",
"pagedadamw32bit"
],
"title": "Optim",
"type": "string"
},
"adagrad_accumulator_init": {
"default": 0,
"description": "Initialize the accumulator values in adagrad. Mirrors initial_accumulator_value flag from tensorflow adagrad implementation (default 0.1 there).",
"title": "Adagrad Accumulator Init",
"type": "number"
},
"adam_beta1": {
"default": 0.9,
"description": "Beta1 parameter used by Adam. Almost without exception a value of 0.9 is used in the literature, seemingly giving good results, so we would discourage changing this value from the default without due consideration.",
"title": "Adam Beta1",
"type": "number"
},
"adam_beta2": {
"default": 0.999,
"description": "Beta2 parameter used by Adam. Typically a value of 0.999 is recommended, as this is the value suggested by the original paper describing Adam, and is also the value adopted in other frameworks such as Tensorflow (https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer) and Keras (https://keras.io/optimizers/). Whereas recently the paper Attention is All You Need suggested a value of 0.98 for beta2, this parameter may not work well for normal models / default baselines.",
"title": "Adam Beta2",
"type": "number"
},
"weight_decay": {
"default": 0.0,
"description": "Weight decay to forward to torch Optimizer.",
"title": "Weight Decay",
"type": "number"
},
"learning_rate": {
"default": 1.0,
"description": "Starting learning rate. Recommended settings: sgd=1, adagrad=0.1, adadelta=1, adam=0.001.",
"title": "Learning Rate",
"type": "number"
},
"learning_rate_decay": {
"default": 0.5,
"description": "Decay learning rate by this much if steps have gone past start_decay_steps.",
"title": "Learning Rate Decay",
"type": "number"
},
"start_decay_steps": {
"default": 50000,
"description": "Start decaying every decay_steps after this many steps.",
"title": "Start Decay Steps",
"type": "integer"
},
"decay_steps": {
"default": 10000,
"description": "Frequency for learning rate decay, in steps.",
"title": "Decay Steps",
"type": "integer"
},
"decay_method": {
"default": "none",
"description": "Custom decay method to use.",
"enum": [
"noam",
"noamwd",
"cosine",
"rsqrt",
"none"
],
"title": "Decay Method",
"type": "string"
},
"warmup_steps": {
"default": 4000,
"description": "Number of warmup steps for custom decay.",
"title": "Warmup Steps",
"type": "integer"
},
"reset_optim": {
"default": "none",
"description": "Optimization resetter when using train_from.",
"enum": [
"none",
"all",
"states",
"keep_states"
],
"title": "Reset Optim",
"type": "string"
},
"gpu_ranks": {
"default": [],
"description": "List of ranks for each process.",
"items": {
"type": "integer"
},
"title": "Gpu Ranks",
"type": "array"
},
"world_size": {
"default": 1,
"description": "Total number of distributed processes.",
"title": "World Size",
"type": "integer"
},
"parallel_mode": {
"default": "data_parallel",
"description": "Distributed mode.",
"enum": [
"data_parallel",
"tensor_parallel"
],
"title": "Parallel Mode",
"type": "string"
},
"gpu_backend": {
"default": "nccl",
"description": "Type of torch distributed backend.",
"title": "Gpu Backend",
"type": "string"
},
"gpu_verbose_level": {
"default": 0,
"description": "Gives more info on each process per GPU.",
"title": "Gpu Verbose Level",
"type": "integer"
},
"master_ip": {
"default": "localhost",
"description": "IP of master for torch.distributed training.",
"title": "Master Ip",
"type": "string"
},
"master_port": {
"default": 10000,
"description": "Port of master for torch.distributed training.",
"title": "Master Port",
"type": "integer"
},
"timeout": {
"default": 60,
"description": "Timeout for one GPU to wait for the others.",
"title": "Timeout",
"type": "integer"
},
"model_path": {
"default": "model",
"description": "Path to directory containing all model components.",
"title": "Model Path",
"type": "string"
},
"self_attn_backend": {
"default": "flash",
"description": "Self-attention backend.",
"enum": [
"flash",
"pytorch"
],
"title": "Self Attn Backend",
"type": "string"
},
"compute_dtype": {
"description": "Compute dtype (precision) to use for main compute. Some parameters might have other dtypes for specific cases (e.g. torch.amp -- See eole.config.training.TrainingConfig.storage_dtype) fp32 to force slow fp16 model on gtx1080, int8 to enable pytorch native 8-bit quantization (cpu only).",
"enum": [
"fp32",
"fp16",
"int8",
"bf16"
],
"title": "Compute Dtype",
"type": "string"
},
"torch_compile": {
"default": false,
"description": "Use torch.compile with dynamic=True.",
"title": "Torch Compile",
"type": "boolean"
},
"param_init": {
"default": 0.1,
"description": "Support value for uniform distribution parameters initialization. Set to 0 not to use initialization.",
"title": "Param Init",
"type": "number"
},
"param_init_method": {
"default": "uniform",
"description": "Parameter initialization method.",
"enum": [
"xavier_uniform",
"uniform",
"normal"
],
"title": "Param Init Method",
"type": "string"
},
"freeze_encoder": {
"default": false,
"description": "Freeze parameters in encoder.",
"title": "Freeze Encoder",
"type": "boolean"
},
"freeze_decoder": {
"default": false,
"description": "Freeze parameters in decoder.",
"title": "Freeze Decoder",
"type": "boolean"
},
"pre_word_vecs_enc": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "If a valid path is specified, will load pretrained word embeddings on the encoder side.",
"title": "Pre Word Vecs Enc"
},
"pre_word_vecs_dec": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "If a valid path is specified, will load pretrained word embeddings on the decoder side.",
"title": "Pre Word Vecs Dec"
},
"data_type": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "text",
"title": "Data Type"
},
"bucket_size": {
"default": 262144,
"description": "A bucket is a buffer of bucket_size examples to pick from the various corpora. The dynamic iterator batches batch_size items from the bucket and shuffle them.",
"title": "Bucket Size",
"type": "integer"
},
"bucket_size_init": {
"default": -1,
"description": "Bucket size is initialized with this amount of examples (see bucket_size_increment).",
"title": "Bucket Size Init",
"type": "integer"
},
"bucket_size_increment": {
"default": 0,
"description": "Bucket size incremented with this amount of examples at each new bucket (up to bucket_size).",
"title": "Bucket Size Increment",
"type": "integer"
},
"prefetch_factor": {
"default": 200,
"description": "Number of mini-batches loaded in advance to avoid the GPU waiting during processing of next bucket.",
"title": "Prefetch Factor",
"type": "integer"
},
"save_format": {
"default": "pytorch",
"description": "Format to save the model weights.",
"enum": [
"pytorch",
"safetensors"
],
"title": "Save Format",
"type": "string"
},
"save_checkpoint_steps": {
"default": 5000,
"description": "Frequency of checkpoint saving (in steps).",
"title": "Save Checkpoint Steps",
"type": "integer"
},
"keep_checkpoint": {
"default": -1,
"description": "Number of checkpoints to retain. (-1 retains all)",
"title": "Keep Checkpoint",
"type": "integer"
},
"train_from": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Pretrained model/checkpoint weights to continue training from.",
"title": "Train From"
},
"num_workers": {
"default": 2,
"description": "Number of workers for pytorch.DataLoader objects.",
"title": "Num Workers",
"type": "integer"
},
"batch_size": {
"default": 64,
"description": "Maximum batch size for training.",
"title": "Batch Size",
"type": "integer"
},
"batch_size_multiple": {
"default": 1,
"description": "Batch size multiple for token batches.",
"title": "Batch Size Multiple",
"type": "integer"
},
"batch_type": {
"default": "sents",
"description": "Batch grouping for batch_size.",
"enum": [
"sents",
"tokens"
],
"title": "Batch Type",
"type": "string"
},
"normalization": {
"default": "sents",
"description": "Normalization method of the gradient.",
"enum": [
"sents",
"tokens"
],
"title": "Normalization",
"type": "string"
},
"accum_count": {
"default": [
1
],
"description": "Accumulate gradient this many times. Approximately equivalent to updating batch_size * accum_count batches at once. Recommended for transformer.",
"items": {
"type": "integer"
},
"title": "Accum Count",
"type": "array"
},
"accum_steps": {
"default": [
0
],
"description": "Steps at which accum_count values change.",
"items": {
"type": "integer"
},
"title": "Accum Steps",
"type": "array"
},
"valid_steps": {
"default": 10000,
"description": "Frequency of validation, in steps.",
"title": "Valid Steps",
"type": "integer"
},
"valid_batch_size": {
"default": 32,
"description": "Maximum batch size for validation.",
"title": "Valid Batch Size",
"type": "integer"
},
"train_steps": {
"default": 100000,
"description": "Number of training steps.",
"title": "Train Steps",
"type": "integer"
},
"single_pass": {
"default": false,
"description": "Make a single pass over the training dataset.",
"title": "Single Pass",
"type": "boolean"
},
"early_stopping": {
"default": 0,
"description": "Number of validation steps without improving that will trigger early stop of training.",
"title": "Early Stopping",
"type": "integer"
},
"early_stopping_criteria": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Criteria to use for early stopping.",
"title": "Early Stopping Criteria"
},
"max_grad_norm": {
"default": 5,
"description": "If the norm of the gradient vector exceeds this value, renormalize it to have the norm equal to max_grad_norm.",
"title": "Max Grad Norm",
"type": "number"
},
"dropout": {
"default": [
0.3
],
"description": "Dropout probability.",
"items": {
"type": "number"
},
"title": "Dropout",
"type": "array"
},
"attention_dropout": {
"default": [
0.1
],
"description": "Attention dropout probability.",
"items": {
"type": "number"
},
"title": "Attention Dropout",
"type": "array"
},
"dropout_steps": {
"default": [
0
],
"description": "Steps at which dropout changes.",
"items": {
"type": "integer"
},
"title": "Dropout Steps",
"type": "array"
},
"truncated_decoder": {
"default": 0,
"description": "Truncated bptt.",
"title": "Truncated Decoder",
"type": "integer"
},
"label_smoothing": {
"default": 0.0,
"description": "Label smoothing value epsilon. Probability of all non-true labels will be smoothed by epsilon/(vocab_size-1). Set to 0 to turn off label smoothing. (https://arxiv.org/abs/1512.00567)",
"title": "Label Smoothing",
"type": "number"
},
"average_decay": {
"default": 0.0,
"description": "Exponential moving average decay (https://en.wikipedia.org/wiki/Moving_average). Set to other than 0 (e.g. 1e-4) to activate. Similar to Marian NMT implementation (http://www.aclweb.org/anthology/P18-4020).",
"title": "Average Decay",
"type": "number"
},
"average_every": {
"default": 1,
"description": "Step for moving average. Default is every update if average_decay is set.",
"title": "Average Every",
"type": "integer"
},
"loss_scale": {
"default": 0.0,
"description": "For FP16 training, the static loss scale to use. If not set, the loss scale is dynamically computed.",
"title": "Loss Scale",
"type": "number"
},
"apex_opt_level": {
"default": "",
"description": "For FP16 training, the opt_level to use. See https://nvidia.github.io/apex/amp.html#opt-levels.",
"enum": [
"",
"O0",
"O1",
"O2",
"O3"
],
"title": "Apex Opt Level",
"type": "string"
},
"zero_out_prompt_loss": {
"default": false,
"description": "Set the prompt loss to zero. Mostly for LLM finetuning. Will be enabled only if the `insert_mask_before_placeholder` transform is applied.",
"title": "Zero Out Prompt Loss",
"type": "boolean"
},
"use_ckpting": {
"default": [],
"description": "Use gradient checkpointing for those modules.",
"items": {
"type": "string"
},
"title": "Use Ckpting",
"type": "array"
},
"update_vocab": {
"default": false,
"description": "Update source and target existing vocabularies.",
"title": "Update Vocab",
"type": "boolean"
},
"lm_prior_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "LM model to use to train the TM.",
"title": "Lm Prior Model"
},
"lm_prior_lambda": {
"default": 0.0,
"description": "LM Prior Lambda",
"title": "Lm Prior Lambda",
"type": "number"
},
"lm_prior_tau": {
"default": 1.0,
"description": "LM Prior Tau",
"title": "Lm Prior Tau",
"type": "number"
},
"estim_loss_lambda": {
"default": [
1.0
],
"description": "Weight applied to estimator loss",
"items": {
"type": "number"
},
"title": "Estim Loss Lambda",
"type": "array"
},
"estim_loss_lambda_steps": {
"default": [
0
],
"description": "Steps at which estimator loss lambda changes",
"items": {
"type": "integer"
},
"title": "Estim Loss Lambda Steps",
"type": "array"
},
"score_threshold": {
"default": 0.68,
"description": "Threshold to filterout data",
"title": "Score Threshold",
"type": "number"
},
"dummy_load": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": false,
"description": "Ignore some warnings if we are only loading the configuration prior to other operations, e.g. in `train_from` context.",
"title": "Dummy Load"
}
},
"additionalProperties": false
}
- Config:
- validate_assignment: bool = True
- validate_default: bool = True
- use_enum_values: bool = True
- extra: str = forbid
- protected_namespaces: tuple = ()
- arbitrary_types_allowed: bool = True
- Fields:
accum_count (List[int])
accum_steps (List[int])
apex_opt_level (Literal['', 'O0', 'O1', 'O2', 'O3'])
attention_dropout (List[float])
average_decay (float)
average_every (int)
batch_size (int)
batch_size_multiple (int)
batch_type (Literal['sents', 'tokens'])
bucket_size (int)
bucket_size_increment (int)
bucket_size_init (int)
data_type (str | None)
dropout (List[float])
dropout_steps (List[int])
dummy_load (bool | None)
early_stopping (int)
early_stopping_criteria (str | None)
estim_loss_lambda (List[float])
estim_loss_lambda_steps (List[int])
freeze_decoder (bool)
freeze_encoder (bool)
keep_checkpoint (int)
label_smoothing (float)
lm_prior_lambda (float)
lm_prior_model (str | None)
lm_prior_tau (float)
loss_scale (float)
max_grad_norm (float)
normalization (Literal['sents', 'tokens'])
num_workers (int)
param_init (float)
param_init_method (Literal['xavier_uniform', 'uniform', 'normal'])
pre_word_vecs_dec (str | None)
pre_word_vecs_enc (str | None)
prefetch_factor (int)
save_checkpoint_steps (int)
save_format (Literal['pytorch', 'safetensors'])
score_threshold (float)
single_pass (bool)
train_from (str | None)
train_steps (int)
truncated_decoder (int)
update_vocab (bool)
use_ckpting (List[str])
valid_batch_size (int)
valid_steps (int)
zero_out_prompt_loss (bool)
- Validators:
_validate_running_config
ยปall fields
checkpointing_layers
ยปuse_ckpting
field accum_count : List[int] = [1]โ
Accumulate gradient this many times. Approximately equivalent to updating batch_size * accum_count batches at once. Recommended for transformer.
- Validated by:
_validate_running_config
field accum_steps : List[int] = [0]โ
Steps at which accum_count values change.
- Validated by:
_validate_running_config
field apex_opt_level : Literal['', 'O0', 'O1', 'O2', 'O3'] = ''โ
For FP16 training, the opt_level to use. See https://nvidia.github.io/apex/amp.html#opt-levels.
- Validated by:
_validate_running_config
field attention_dropout : List[float] = [0.1]โ
Attention dropout probability.
- Validated by:
_validate_running_config
field average_decay : float = 0.0โ
Exponential moving average decay (https://en.wikipedia.org/wiki/Moving_average). Set to other than 0 (e.g. 1e-4) to activate. Similar to Marian NMT implementation (http://www.aclweb.org/anthology/P18-4020).
- Validated by:
_validate_running_config
field average_every : int = 1โ
Step for moving average. Default is every update if average_decay is set.
- Validated by:
_validate_running_config
field batch_size : int = 64โ
Maximum batch size for training.
- Validated by:
_validate_running_config
field batch_size_multiple : int = 1โ
Batch size multiple for token batches.
- Validated by:
_validate_running_config
field batch_type : Literal['sents', 'tokens'] = 'sents'โ
Batch grouping for batch_size.
- Validated by:
_validate_running_config
field bucket_size : int = 262144โ
A bucket is a buffer of bucket_size examples to pick from the various corpora. The dynamic iterator batches batch_size items from the bucket and shuffle them.
- Validated by:
_validate_running_config
field bucket_size_increment : int = 0โ
Bucket size incremented with this amount of examples at each new bucket (up to bucket_size).
- Validated by:
_validate_running_config
field bucket_size_init : int = -1โ
Bucket size is initialized with this amount of examples (see bucket_size_increment).
- Validated by:
_validate_running_config
field data_type : str | None = 'text'โ
- Validated by:
_validate_running_config
field dropout : List[float] = [0.3]โ
Dropout probability.
- Validated by:
_validate_running_config
field dropout_steps : List[int] = [0]โ
Steps at which dropout changes.
- Validated by:
_validate_running_config
field dummy_load : bool | None = Falseโ
Ignore some warnings if we are only loading the configuration prior to other operations, e.g. in train_from context.
- Validated by:
_validate_running_config
field early_stopping : int = 0โ
Number of validation steps without improving that will trigger early stop of training.
- Validated by:
_validate_running_config
field early_stopping_criteria : str | None = Noneโ
Criteria to use for early stopping.
- Validated by:
_validate_running_config
field estim_loss_lambda : List[float] = [1.0]โ
Weight applied to estimator loss
- Validated by:
_validate_running_config
field estim_loss_lambda_steps : List[int] = [0]โ
Steps at which estimator loss lambda changes
- Validated by:
_validate_running_config
field freeze_decoder : bool = Falseโ
Freeze parameters in decoder.
- Validated by:
_validate_running_config
field freeze_encoder : bool = Falseโ
Freeze parameters in encoder.
- Validated by:
_validate_running_config
field keep_checkpoint : int = -1โ
Number of checkpoints to retain. (-1 retains all)
- Validated by:
_validate_running_config
field label_smoothing : float = 0.0โ
Label smoothing value epsilon. Probability of all non-true labels will be smoothed by epsilon/(vocab_size-1). Set to 0 to turn off label smoothing. (https://arxiv.org/abs/1512.00567)
- Validated by:
_validate_running_config
field lm_prior_lambda : float = 0.0โ
LM Prior Lambda
- Validated by:
_validate_running_config
field lm_prior_model : str | None = Noneโ
LM model to use to train the TM.
- Validated by:
_validate_running_config
field lm_prior_tau : float = 1.0โ
LM Prior Tau
- Validated by:
_validate_running_config
field loss_scale : float = 0.0โ
For FP16 training, the static loss scale to use. If not set, the loss scale is dynamically computed.
- Validated by:
_validate_running_config
field max_grad_norm : float = 5โ
If the norm of the gradient vector exceeds this value, renormalize it to have the norm equal to max_grad_norm.
- Validated by:
_validate_running_config
field normalization : Literal['sents', 'tokens'] = 'sents'โ
Normalization method of the gradient.
- Validated by:
_validate_running_config
field num_workers : int = 2โ
Number of workers for pytorch.DataLoader objects.
- Validated by:
_validate_running_config
field param_init : float = 0.1โ
Support value for uniform distribution parameters initialization. Set to 0 not to use initialization.
- Validated by:
_validate_running_config
field param_init_method : Literal['xavier_uniform', 'uniform', 'normal'] = 'uniform'โ
Parameter initialization method.
- Validated by:
_validate_running_config
field pre_word_vecs_dec : str | None = Noneโ
If a valid path is specified, will load pretrained word embeddings on the decoder side.
- Validated by:
_validate_running_config
field pre_word_vecs_enc : str | None = Noneโ
If a valid path is specified, will load pretrained word embeddings on the encoder side.
- Validated by:
_validate_running_config
field prefetch_factor : int = 200โ
Number of mini-batches loaded in advance to avoid the GPU waiting during processing of next bucket.
- Validated by:
_validate_running_config
field save_checkpoint_steps : int = 5000โ
Frequency of checkpoint saving (in steps).
- Validated by:
_validate_running_config
field save_format : Literal['pytorch', 'safetensors'] = 'pytorch'โ
Format to save the model weights.
- Validated by:
_validate_running_config
field score_threshold : float = 0.68โ
Threshold to filterout data
- Validated by:
_validate_running_config
field single_pass : bool = Falseโ
Make a single pass over the training dataset.
- Validated by:
_validate_running_config
field train_from : str | None = Noneโ
Pretrained model/checkpoint weights to continue training from.
- Validated by:
_validate_running_config
field train_steps : int = 100000โ
Number of training steps.
- Validated by:
_validate_running_config
field truncated_decoder : int = 0โ
Truncated bptt.
- Validated by:
_validate_running_config
field update_vocab : bool = Falseโ
Update source and target existing vocabularies.
- Validated by:
_validate_running_config
field use_ckpting : List[str] = []โ
Use gradient checkpointing for those modules.
- Validated by:
_validate_running_config
checkpointing_layers
field valid_batch_size : int = 32โ
Maximum batch size for validation.
- Validated by:
_validate_running_config
field valid_steps : int = 10000โ
Frequency of validation, in steps.
- Validated by:
_validate_running_config
field zero_out_prompt_loss : bool = Falseโ
Set the prompt loss to zero. Mostly for LLM finetuning. Will be enabled only if the insert_mask_before_placeholder transform is applied.
- Validated by:
_validate_running_config
validator checkpointing_layers ยป use_ckpting[source]โ
get_model_path()[source]โ
property storage_dtype : dtype[source]โ
Deduce which dtype to use for main model parameters. E.g. with mixed precision a copy is kept in float32.