Skip to main content

Training

pydantic model eole.config.training.OptimizerConfig[source]โ€‹

Bases: Config

Everything related to optimizers. Might be split into multiple subclasses later. Note: not fully sufficient (yet) to replace full opt namespace in build_torch_optimizer. Some other parameters (hidden_size, compute_dtype, apex_opt_level, etc.) are accessed.

Show JSON schema
{
"title": "OptimizerConfig",
"description": "Everything related to optimizers.\nMight be split into multiple subclasses later.\nNote: not fully sufficient (yet) to replace full opt namespace in build_torch_optimizer.\nSome other parameters (hidden_size, compute_dtype, apex_opt_level, etc.) are accessed.",
"type": "object",
"properties": {
"optim": {
"default": "sgd",
"description": "Optimization method.",
"enum": [
"sgd",
"adagrad",
"adadelta",
"adam",
"adamw",
"sparseadam",
"adafactor",
"fusedadam",
"adamw8bit",
"pagedadamw8bit",
"pagedadamw32bit"
],
"title": "Optim",
"type": "string"
},
"adagrad_accumulator_init": {
"default": 0,
"description": "Initialize the accumulator values in adagrad. Mirrors initial_accumulator_value flag from tensorflow adagrad implementation (default 0.1 there).",
"title": "Adagrad Accumulator Init",
"type": "number"
},
"adam_beta1": {
"default": 0.9,
"description": "Beta1 parameter used by Adam. Almost without exception a value of 0.9 is used in the literature, seemingly giving good results, so we would discourage changing this value from the default without due consideration.",
"title": "Adam Beta1",
"type": "number"
},
"adam_beta2": {
"default": 0.999,
"description": "Beta2 parameter used by Adam. Typically a value of 0.999 is recommended, as this is the value suggested by the original paper describing Adam, and is also the value adopted in other frameworks such as Tensorflow (https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer) and Keras (https://keras.io/optimizers/). Whereas recently the paper Attention is All You Need suggested a value of 0.98 for beta2, this parameter may not work well for normal models / default baselines.",
"title": "Adam Beta2",
"type": "number"
},
"weight_decay": {
"default": 0.0,
"description": "Weight decay to forward to torch Optimizer.",
"title": "Weight Decay",
"type": "number"
},
"learning_rate": {
"default": 1.0,
"description": "Starting learning rate. Recommended settings: sgd=1, adagrad=0.1, adadelta=1, adam=0.001.",
"title": "Learning Rate",
"type": "number"
},
"learning_rate_decay": {
"default": 0.5,
"description": "Decay learning rate by this much if steps have gone past start_decay_steps.",
"title": "Learning Rate Decay",
"type": "number"
},
"start_decay_steps": {
"default": 50000,
"description": "Start decaying every decay_steps after this many steps.",
"title": "Start Decay Steps",
"type": "integer"
},
"decay_steps": {
"default": 10000,
"description": "Frequency for learning rate decay, in steps.",
"title": "Decay Steps",
"type": "integer"
},
"decay_method": {
"default": "none",
"description": "Custom decay method to use.",
"enum": [
"noam",
"noamwd",
"cosine",
"rsqrt",
"none"
],
"title": "Decay Method",
"type": "string"
},
"warmup_steps": {
"default": 4000,
"description": "Number of warmup steps for custom decay.",
"title": "Warmup Steps",
"type": "integer"
},
"reset_optim": {
"default": "none",
"description": "Optimization resetter when using train_from.",
"enum": [
"none",
"all",
"states",
"keep_states"
],
"title": "Reset Optim",
"type": "string"
}
},
"additionalProperties": false
}

field adagrad_accumulator_init : float = 0โ€‹

Initialize the accumulator values in adagrad. Mirrors initial_accumulator_value flag from tensorflow adagrad implementation (default 0.1 there).

field adam_beta1 : float = 0.9โ€‹

Beta1 parameter used by Adam. Almost without exception a value of 0.9 is used in the literature, seemingly giving good results, so we would discourage changing this value from the default without due consideration.

field adam_beta2 : float = 0.999โ€‹

Beta2 parameter used by Adam. Typically a value of 0.999 is recommended, as this is the value suggested by the original paper describing Adam, and is also the value adopted in other frameworks such as Tensorflow (https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer) and Keras (https://keras.io/optimizers/). Whereas recently the paper Attention is All You Need suggested a value of 0.98 for beta2, this parameter may not work well for normal models / default baselines.

field decay_method : Literal['noam', 'noamwd', 'cosine', 'rsqrt', 'none'] = 'none'โ€‹

Custom decay method to use.

field decay_steps : int = 10000โ€‹

Frequency for learning rate decay, in steps.

field learning_rate : float = 1.0โ€‹

Starting learning rate. Recommended settings: sgd=1, adagrad=0.1, adadelta=1, adam=0.001.

field learning_rate_decay : float = 0.5โ€‹

Decay learning rate by this much if steps have gone past start_decay_steps.

field optim : Literal['sgd', 'adagrad', 'adadelta', 'adam', 'adamw', 'sparseadam', 'adafactor', 'fusedadam', 'adamw8bit', 'pagedadamw8bit', 'pagedadamw32bit'] = 'sgd'โ€‹

Optimization method.

field reset_optim : Literal['none', 'all', 'states', 'keep_states'] = 'none'โ€‹

Optimization resetter when using train_from.

field start_decay_steps : int = 50000โ€‹

Start decaying every decay_steps after this many steps.

field warmup_steps : int = 4000โ€‹

Number of warmup steps for custom decay.

field weight_decay : float = 0.0โ€‹

Weight decay to forward to torch Optimizer.

pydantic model eole.config.training.TrainingConfig[source]โ€‹

Bases: RunningConfig, OptimizerConfig, LoRaConfig, QuantizeConfig

Show JSON schema
{
"title": "TrainingConfig",
"type": "object",
"properties": {
"quant_layers": {
"default": [],
"description": "List of layers to be compressed in 4/8bit.",
"items": {
"type": "string"
},
"title": "Quant Layers",
"type": "array"
},
"quant_type": {
"default": "",
"description": "Type of compression.",
"enum": [
"",
"bnb_8bit",
"bnb_FP4",
"bnb_NF4",
"awq_gemm",
"awq_gemv"
],
"title": "Quant Type",
"type": "string"
},
"w_bit": {
"default": 4,
"description": "W_bit quantization",
"title": "W Bit",
"type": "integer"
},
"group_size": {
"default": 128,
"description": "Group size quantization.",
"title": "Group Size",
"type": "integer"
},
"lora_layers": {
"default": [],
"description": "List of layers to be replaced by LoRa layers. E.g. ['linear_values', 'linear_query'] (\u00a74.2 in https://arxiv.org/abs/2106.09685)",
"items": {
"type": "string"
},
"title": "Lora Layers",
"type": "array"
},
"lora_embedding": {
"default": false,
"description": "Replace embeddings with LoRa Embeddings (\u00a75.1)",
"title": "Lora Embedding",
"type": "boolean"
},
"lora_rank": {
"default": 2,
"description": "r=2 successfully tested with NLLB-200 3.3B",
"title": "Lora Rank",
"type": "integer"
},
"lora_alpha": {
"default": 1,
"description": "\u00a74.1 https://arxiv.org/abs/2106.09685",
"title": "Lora Alpha",
"type": "integer"
},
"lora_dropout": {
"default": 0.0,
"description": "Rule of thumb: same value as in main model.",
"title": "Lora Dropout",
"type": "number"
},
"optim": {
"default": "sgd",
"description": "Optimization method.",
"enum": [
"sgd",
"adagrad",
"adadelta",
"adam",
"adamw",
"sparseadam",
"adafactor",
"fusedadam",
"adamw8bit",
"pagedadamw8bit",
"pagedadamw32bit"
],
"title": "Optim",
"type": "string"
},
"adagrad_accumulator_init": {
"default": 0,
"description": "Initialize the accumulator values in adagrad. Mirrors initial_accumulator_value flag from tensorflow adagrad implementation (default 0.1 there).",
"title": "Adagrad Accumulator Init",
"type": "number"
},
"adam_beta1": {
"default": 0.9,
"description": "Beta1 parameter used by Adam. Almost without exception a value of 0.9 is used in the literature, seemingly giving good results, so we would discourage changing this value from the default without due consideration.",
"title": "Adam Beta1",
"type": "number"
},
"adam_beta2": {
"default": 0.999,
"description": "Beta2 parameter used by Adam. Typically a value of 0.999 is recommended, as this is the value suggested by the original paper describing Adam, and is also the value adopted in other frameworks such as Tensorflow (https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer) and Keras (https://keras.io/optimizers/). Whereas recently the paper Attention is All You Need suggested a value of 0.98 for beta2, this parameter may not work well for normal models / default baselines.",
"title": "Adam Beta2",
"type": "number"
},
"weight_decay": {
"default": 0.0,
"description": "Weight decay to forward to torch Optimizer.",
"title": "Weight Decay",
"type": "number"
},
"learning_rate": {
"default": 1.0,
"description": "Starting learning rate. Recommended settings: sgd=1, adagrad=0.1, adadelta=1, adam=0.001.",
"title": "Learning Rate",
"type": "number"
},
"learning_rate_decay": {
"default": 0.5,
"description": "Decay learning rate by this much if steps have gone past start_decay_steps.",
"title": "Learning Rate Decay",
"type": "number"
},
"start_decay_steps": {
"default": 50000,
"description": "Start decaying every decay_steps after this many steps.",
"title": "Start Decay Steps",
"type": "integer"
},
"decay_steps": {
"default": 10000,
"description": "Frequency for learning rate decay, in steps.",
"title": "Decay Steps",
"type": "integer"
},
"decay_method": {
"default": "none",
"description": "Custom decay method to use.",
"enum": [
"noam",
"noamwd",
"cosine",
"rsqrt",
"none"
],
"title": "Decay Method",
"type": "string"
},
"warmup_steps": {
"default": 4000,
"description": "Number of warmup steps for custom decay.",
"title": "Warmup Steps",
"type": "integer"
},
"reset_optim": {
"default": "none",
"description": "Optimization resetter when using train_from.",
"enum": [
"none",
"all",
"states",
"keep_states"
],
"title": "Reset Optim",
"type": "string"
},
"gpu_ranks": {
"default": [],
"description": "List of ranks for each process.",
"items": {
"type": "integer"
},
"title": "Gpu Ranks",
"type": "array"
},
"world_size": {
"default": 1,
"description": "Total number of distributed processes.",
"title": "World Size",
"type": "integer"
},
"parallel_mode": {
"default": "data_parallel",
"description": "Distributed mode.",
"enum": [
"data_parallel",
"tensor_parallel"
],
"title": "Parallel Mode",
"type": "string"
},
"gpu_backend": {
"default": "nccl",
"description": "Type of torch distributed backend.",
"title": "Gpu Backend",
"type": "string"
},
"gpu_verbose_level": {
"default": 0,
"description": "Gives more info on each process per GPU.",
"title": "Gpu Verbose Level",
"type": "integer"
},
"master_ip": {
"default": "localhost",
"description": "IP of master for torch.distributed training.",
"title": "Master Ip",
"type": "string"
},
"master_port": {
"default": 10000,
"description": "Port of master for torch.distributed training.",
"title": "Master Port",
"type": "integer"
},
"timeout": {
"default": 60,
"description": "Timeout for one GPU to wait for the others.",
"title": "Timeout",
"type": "integer"
},
"model_path": {
"default": "model",
"description": "Path to directory containing all model components.",
"title": "Model Path",
"type": "string"
},
"self_attn_backend": {
"default": "flash",
"description": "Self-attention backend.",
"enum": [
"flash",
"pytorch"
],
"title": "Self Attn Backend",
"type": "string"
},
"compute_dtype": {
"description": "Compute dtype (precision) to use for main compute. Some parameters might have other dtypes for specific cases (e.g. torch.amp -- See eole.config.training.TrainingConfig.storage_dtype) fp32 to force slow fp16 model on gtx1080, int8 to enable pytorch native 8-bit quantization (cpu only).",
"enum": [
"fp32",
"fp16",
"int8",
"bf16"
],
"title": "Compute Dtype",
"type": "string"
},
"torch_compile": {
"default": false,
"description": "Use torch.compile with dynamic=True.",
"title": "Torch Compile",
"type": "boolean"
},
"param_init": {
"default": 0.1,
"description": "Support value for uniform distribution parameters initialization. Set to 0 not to use initialization.",
"title": "Param Init",
"type": "number"
},
"param_init_method": {
"default": "uniform",
"description": "Parameter initialization method.",
"enum": [
"xavier_uniform",
"uniform",
"normal"
],
"title": "Param Init Method",
"type": "string"
},
"freeze_encoder": {
"default": false,
"description": "Freeze parameters in encoder.",
"title": "Freeze Encoder",
"type": "boolean"
},
"freeze_decoder": {
"default": false,
"description": "Freeze parameters in decoder.",
"title": "Freeze Decoder",
"type": "boolean"
},
"pre_word_vecs_enc": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "If a valid path is specified, will load pretrained word embeddings on the encoder side.",
"title": "Pre Word Vecs Enc"
},
"pre_word_vecs_dec": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "If a valid path is specified, will load pretrained word embeddings on the decoder side.",
"title": "Pre Word Vecs Dec"
},
"data_type": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "text",
"title": "Data Type"
},
"bucket_size": {
"default": 262144,
"description": "A bucket is a buffer of bucket_size examples to pick from the various corpora. The dynamic iterator batches batch_size items from the bucket and shuffle them.",
"title": "Bucket Size",
"type": "integer"
},
"bucket_size_init": {
"default": -1,
"description": "Bucket size is initialized with this amount of examples (see bucket_size_increment).",
"title": "Bucket Size Init",
"type": "integer"
},
"bucket_size_increment": {
"default": 0,
"description": "Bucket size incremented with this amount of examples at each new bucket (up to bucket_size).",
"title": "Bucket Size Increment",
"type": "integer"
},
"prefetch_factor": {
"default": 200,
"description": "Number of mini-batches loaded in advance to avoid the GPU waiting during processing of next bucket.",
"title": "Prefetch Factor",
"type": "integer"
},
"save_format": {
"default": "pytorch",
"description": "Format to save the model weights.",
"enum": [
"pytorch",
"safetensors"
],
"title": "Save Format",
"type": "string"
},
"save_checkpoint_steps": {
"default": 5000,
"description": "Frequency of checkpoint saving (in steps).",
"title": "Save Checkpoint Steps",
"type": "integer"
},
"keep_checkpoint": {
"default": -1,
"description": "Number of checkpoints to retain. (-1 retains all)",
"title": "Keep Checkpoint",
"type": "integer"
},
"train_from": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Pretrained model/checkpoint weights to continue training from.",
"title": "Train From"
},
"num_workers": {
"default": 2,
"description": "Number of workers for pytorch.DataLoader objects.",
"title": "Num Workers",
"type": "integer"
},
"batch_size": {
"default": 64,
"description": "Maximum batch size for training.",
"title": "Batch Size",
"type": "integer"
},
"batch_size_multiple": {
"default": 1,
"description": "Batch size multiple for token batches.",
"title": "Batch Size Multiple",
"type": "integer"
},
"batch_type": {
"default": "sents",
"description": "Batch grouping for batch_size.",
"enum": [
"sents",
"tokens"
],
"title": "Batch Type",
"type": "string"
},
"normalization": {
"default": "sents",
"description": "Normalization method of the gradient.",
"enum": [
"sents",
"tokens"
],
"title": "Normalization",
"type": "string"
},
"accum_count": {
"default": [
1
],
"description": "Accumulate gradient this many times. Approximately equivalent to updating batch_size * accum_count batches at once. Recommended for transformer.",
"items": {
"type": "integer"
},
"title": "Accum Count",
"type": "array"
},
"accum_steps": {
"default": [
0
],
"description": "Steps at which accum_count values change.",
"items": {
"type": "integer"
},
"title": "Accum Steps",
"type": "array"
},
"valid_steps": {
"default": 10000,
"description": "Frequency of validation, in steps.",
"title": "Valid Steps",
"type": "integer"
},
"valid_batch_size": {
"default": 32,
"description": "Maximum batch size for validation.",
"title": "Valid Batch Size",
"type": "integer"
},
"train_steps": {
"default": 100000,
"description": "Number of training steps.",
"title": "Train Steps",
"type": "integer"
},
"single_pass": {
"default": false,
"description": "Make a single pass over the training dataset.",
"title": "Single Pass",
"type": "boolean"
},
"early_stopping": {
"default": 0,
"description": "Number of validation steps without improving that will trigger early stop of training.",
"title": "Early Stopping",
"type": "integer"
},
"early_stopping_criteria": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Criteria to use for early stopping.",
"title": "Early Stopping Criteria"
},
"max_grad_norm": {
"default": 5,
"description": "If the norm of the gradient vector exceeds this value, renormalize it to have the norm equal to max_grad_norm.",
"title": "Max Grad Norm",
"type": "number"
},
"dropout": {
"default": [
0.3
],
"description": "Dropout probability.",
"items": {
"type": "number"
},
"title": "Dropout",
"type": "array"
},
"attention_dropout": {
"default": [
0.1
],
"description": "Attention dropout probability.",
"items": {
"type": "number"
},
"title": "Attention Dropout",
"type": "array"
},
"dropout_steps": {
"default": [
0
],
"description": "Steps at which dropout changes.",
"items": {
"type": "integer"
},
"title": "Dropout Steps",
"type": "array"
},
"truncated_decoder": {
"default": 0,
"description": "Truncated bptt.",
"title": "Truncated Decoder",
"type": "integer"
},
"label_smoothing": {
"default": 0.0,
"description": "Label smoothing value epsilon. Probability of all non-true labels will be smoothed by epsilon/(vocab_size-1). Set to 0 to turn off label smoothing. (https://arxiv.org/abs/1512.00567)",
"title": "Label Smoothing",
"type": "number"
},
"average_decay": {
"default": 0.0,
"description": "Exponential moving average decay (https://en.wikipedia.org/wiki/Moving_average). Set to other than 0 (e.g. 1e-4) to activate. Similar to Marian NMT implementation (http://www.aclweb.org/anthology/P18-4020).",
"title": "Average Decay",
"type": "number"
},
"average_every": {
"default": 1,
"description": "Step for moving average. Default is every update if average_decay is set.",
"title": "Average Every",
"type": "integer"
},
"loss_scale": {
"default": 0.0,
"description": "For FP16 training, the static loss scale to use. If not set, the loss scale is dynamically computed.",
"title": "Loss Scale",
"type": "number"
},
"apex_opt_level": {
"default": "",
"description": "For FP16 training, the opt_level to use. See https://nvidia.github.io/apex/amp.html#opt-levels.",
"enum": [
"",
"O0",
"O1",
"O2",
"O3"
],
"title": "Apex Opt Level",
"type": "string"
},
"zero_out_prompt_loss": {
"default": false,
"description": "Set the prompt loss to zero. Mostly for LLM finetuning. Will be enabled only if the `insert_mask_before_placeholder` transform is applied.",
"title": "Zero Out Prompt Loss",
"type": "boolean"
},
"use_ckpting": {
"default": [],
"description": "Use gradient checkpointing for those modules.",
"items": {
"type": "string"
},
"title": "Use Ckpting",
"type": "array"
},
"update_vocab": {
"default": false,
"description": "Update source and target existing vocabularies.",
"title": "Update Vocab",
"type": "boolean"
},
"lm_prior_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "LM model to use to train the TM.",
"title": "Lm Prior Model"
},
"lm_prior_lambda": {
"default": 0.0,
"description": "LM Prior Lambda",
"title": "Lm Prior Lambda",
"type": "number"
},
"lm_prior_tau": {
"default": 1.0,
"description": "LM Prior Tau",
"title": "Lm Prior Tau",
"type": "number"
},
"estim_loss_lambda": {
"default": [
1.0
],
"description": "Weight applied to estimator loss",
"items": {
"type": "number"
},
"title": "Estim Loss Lambda",
"type": "array"
},
"estim_loss_lambda_steps": {
"default": [
0
],
"description": "Steps at which estimator loss lambda changes",
"items": {
"type": "integer"
},
"title": "Estim Loss Lambda Steps",
"type": "array"
},
"score_threshold": {
"default": 0.68,
"description": "Threshold to filterout data",
"title": "Score Threshold",
"type": "number"
},
"dummy_load": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": false,
"description": "Ignore some warnings if we are only loading the configuration prior to other operations, e.g. in `train_from` context.",
"title": "Dummy Load"
}
},
"additionalProperties": false
}

field accum_count : List[int] = [1]โ€‹

Accumulate gradient this many times. Approximately equivalent to updating batch_size * accum_count batches at once. Recommended for transformer.

  • Validated by:
    • _validate_running_config

field accum_steps : List[int] = [0]โ€‹

Steps at which accum_count values change.

  • Validated by:
    • _validate_running_config

field apex_opt_level : Literal['', 'O0', 'O1', 'O2', 'O3'] = ''โ€‹

For FP16 training, the opt_level to use. See https://nvidia.github.io/apex/amp.html#opt-levels.

  • Validated by:
    • _validate_running_config

field attention_dropout : List[float] = [0.1]โ€‹

Attention dropout probability.

  • Validated by:
    • _validate_running_config

field average_decay : float = 0.0โ€‹

Exponential moving average decay (https://en.wikipedia.org/wiki/Moving_average). Set to other than 0 (e.g. 1e-4) to activate. Similar to Marian NMT implementation (http://www.aclweb.org/anthology/P18-4020).

  • Validated by:
    • _validate_running_config

field average_every : int = 1โ€‹

Step for moving average. Default is every update if average_decay is set.

  • Validated by:
    • _validate_running_config

field batch_size : int = 64โ€‹

Maximum batch size for training.

  • Validated by:
    • _validate_running_config

field batch_size_multiple : int = 1โ€‹

Batch size multiple for token batches.

  • Validated by:
    • _validate_running_config

field batch_type : Literal['sents', 'tokens'] = 'sents'โ€‹

Batch grouping for batch_size.

  • Validated by:
    • _validate_running_config

field bucket_size : int = 262144โ€‹

A bucket is a buffer of bucket_size examples to pick from the various corpora. The dynamic iterator batches batch_size items from the bucket and shuffle them.

  • Validated by:
    • _validate_running_config

field bucket_size_increment : int = 0โ€‹

Bucket size incremented with this amount of examples at each new bucket (up to bucket_size).

  • Validated by:
    • _validate_running_config

field bucket_size_init : int = -1โ€‹

Bucket size is initialized with this amount of examples (see bucket_size_increment).

  • Validated by:
    • _validate_running_config

field data_type : str | None = 'text'โ€‹

  • Validated by:
    • _validate_running_config

field dropout : List[float] = [0.3]โ€‹

Dropout probability.

  • Validated by:
    • _validate_running_config

field dropout_steps : List[int] = [0]โ€‹

Steps at which dropout changes.

  • Validated by:
    • _validate_running_config

field dummy_load : bool | None = Falseโ€‹

Ignore some warnings if we are only loading the configuration prior to other operations, e.g. in train_from context.

  • Validated by:
    • _validate_running_config

field early_stopping : int = 0โ€‹

Number of validation steps without improving that will trigger early stop of training.

  • Validated by:
    • _validate_running_config

field early_stopping_criteria : str | None = Noneโ€‹

Criteria to use for early stopping.

  • Validated by:
    • _validate_running_config

field estim_loss_lambda : List[float] = [1.0]โ€‹

Weight applied to estimator loss

  • Validated by:
    • _validate_running_config

field estim_loss_lambda_steps : List[int] = [0]โ€‹

Steps at which estimator loss lambda changes

  • Validated by:
    • _validate_running_config

field freeze_decoder : bool = Falseโ€‹

Freeze parameters in decoder.

  • Validated by:
    • _validate_running_config

field freeze_encoder : bool = Falseโ€‹

Freeze parameters in encoder.

  • Validated by:
    • _validate_running_config

field keep_checkpoint : int = -1โ€‹

Number of checkpoints to retain. (-1 retains all)

  • Validated by:
    • _validate_running_config

field label_smoothing : float = 0.0โ€‹

Label smoothing value epsilon. Probability of all non-true labels will be smoothed by epsilon/(vocab_size-1). Set to 0 to turn off label smoothing. (https://arxiv.org/abs/1512.00567)

  • Validated by:
    • _validate_running_config

field lm_prior_lambda : float = 0.0โ€‹

LM Prior Lambda

  • Validated by:
    • _validate_running_config

field lm_prior_model : str | None = Noneโ€‹

LM model to use to train the TM.

  • Validated by:
    • _validate_running_config

field lm_prior_tau : float = 1.0โ€‹

LM Prior Tau

  • Validated by:
    • _validate_running_config

field loss_scale : float = 0.0โ€‹

For FP16 training, the static loss scale to use. If not set, the loss scale is dynamically computed.

  • Validated by:
    • _validate_running_config

field max_grad_norm : float = 5โ€‹

If the norm of the gradient vector exceeds this value, renormalize it to have the norm equal to max_grad_norm.

  • Validated by:
    • _validate_running_config

field normalization : Literal['sents', 'tokens'] = 'sents'โ€‹

Normalization method of the gradient.

  • Validated by:
    • _validate_running_config

field num_workers : int = 2โ€‹

Number of workers for pytorch.DataLoader objects.

  • Validated by:
    • _validate_running_config

field param_init : float = 0.1โ€‹

Support value for uniform distribution parameters initialization. Set to 0 not to use initialization.

  • Validated by:
    • _validate_running_config

field param_init_method : Literal['xavier_uniform', 'uniform', 'normal'] = 'uniform'โ€‹

Parameter initialization method.

  • Validated by:
    • _validate_running_config

field pre_word_vecs_dec : str | None = Noneโ€‹

If a valid path is specified, will load pretrained word embeddings on the decoder side.

  • Validated by:
    • _validate_running_config

field pre_word_vecs_enc : str | None = Noneโ€‹

If a valid path is specified, will load pretrained word embeddings on the encoder side.

  • Validated by:
    • _validate_running_config

field prefetch_factor : int = 200โ€‹

Number of mini-batches loaded in advance to avoid the GPU waiting during processing of next bucket.

  • Validated by:
    • _validate_running_config

field save_checkpoint_steps : int = 5000โ€‹

Frequency of checkpoint saving (in steps).

  • Validated by:
    • _validate_running_config

field save_format : Literal['pytorch', 'safetensors'] = 'pytorch'โ€‹

Format to save the model weights.

  • Validated by:
    • _validate_running_config

field score_threshold : float = 0.68โ€‹

Threshold to filterout data

  • Validated by:
    • _validate_running_config

field single_pass : bool = Falseโ€‹

Make a single pass over the training dataset.

  • Validated by:
    • _validate_running_config

field train_from : str | None = Noneโ€‹

Pretrained model/checkpoint weights to continue training from.

  • Validated by:
    • _validate_running_config

field train_steps : int = 100000โ€‹

Number of training steps.

  • Validated by:
    • _validate_running_config

field truncated_decoder : int = 0โ€‹

Truncated bptt.

  • Validated by:
    • _validate_running_config

field update_vocab : bool = Falseโ€‹

Update source and target existing vocabularies.

  • Validated by:
    • _validate_running_config

field use_ckpting : List[str] = []โ€‹

Use gradient checkpointing for those modules.

field valid_batch_size : int = 32โ€‹

Maximum batch size for validation.

  • Validated by:
    • _validate_running_config

field valid_steps : int = 10000โ€‹

Frequency of validation, in steps.

  • Validated by:
    • _validate_running_config

field zero_out_prompt_loss : bool = Falseโ€‹

Set the prompt loss to zero. Mostly for LLM finetuning. Will be enabled only if the insert_mask_before_placeholder transform is applied.

  • Validated by:
    • _validate_running_config

validator checkpointing_layers ยป use_ckpting[source]โ€‹

get_model_path()[source]โ€‹

property storage_dtype : dtype[source]โ€‹

Deduce which dtype to use for main model parameters. E.g. with mixed precision a copy is kept in float32.