Skip to main content

Training

pydantic model eole.config.training.OptimizerConfig​

Bases: Config

Everything related to optimizers. Might be split into multiple subclasses later. Note: not fully sufficient (yet) to replace full opt namespace in build_torch_optimizer. Some other parameters (hidden_size, compute_dtype, etc.) are accessed.

Show JSON schema
{
"title": "OptimizerConfig",
"description": "Everything related to optimizers.\nMight be split into multiple subclasses later.\nNote: not fully sufficient (yet) to replace full opt namespace in build_torch_optimizer.\nSome other parameters (hidden_size, compute_dtype, etc.) are accessed.",
"type": "object",
"properties": {
"optim": {
"default": "sgd",
"description": "Optimization method.",
"enum": [
"sgd",
"adagrad",
"adadelta",
"adam",
"adamw",
"sparseadam",
"adafactor",
"adamw8bit",
"pagedadamw8bit",
"pagedadamw32bit"
],
"title": "Optim",
"type": "string"
},
"adagrad_accumulator_init": {
"default": 0,
"description": "Initialize the accumulator values in adagrad. Mirrors initial_accumulator_value flag from tensorflow adagrad implementation (default 0.1 there).",
"title": "Adagrad Accumulator Init",
"type": "number"
},
"adam_beta1": {
"default": 0.9,
"description": "Beta1 parameter used by Adam. Almost without exception a value of 0.9 is used in the literature, seemingly giving good results, so we would discourage changing this value from the default without due consideration.",
"title": "Adam Beta1",
"type": "number"
},
"adam_beta2": {
"default": 0.999,
"description": "Beta2 parameter used by Adam. Typically a value of 0.999 is recommended, as this is the value suggested by the original paper describing Adam, and is also the value adopted in other frameworks such as Tensorflow (https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer) and Keras (https://keras.io/optimizers/). Whereas recently the paper Attention is All You Need suggested a value of 0.98 for beta2, this parameter may not work well for normal models / default baselines.",
"title": "Adam Beta2",
"type": "number"
},
"adam_eps": {
"default": 1e-08,
"description": "Adam epsilon to forward to torch Optimizer.",
"title": "Adam Eps",
"type": "number"
},
"weight_decay": {
"default": 0.0,
"description": "Weight decay to forward to torch Optimizer.",
"title": "Weight Decay",
"type": "number"
},
"use_amp": {
"default": true,
"description": "Use torch mixed precision when compute_dtype is 16-bit.",
"title": "Use Amp",
"type": "boolean"
},
"learning_rate": {
"default": 1.0,
"description": "Starting learning rate. Recommended settings: sgd=1, adagrad=0.1, adadelta=1, adam=0.001.",
"title": "Learning Rate",
"type": "number"
},
"learning_rate_decay": {
"default": 0.5,
"description": "Decay learning rate by this much if steps have gone past start_decay_steps.",
"title": "Learning Rate Decay",
"type": "number"
},
"start_decay_steps": {
"default": 50000,
"description": "Start decaying every decay_steps after this many steps.",
"title": "Start Decay Steps",
"type": "integer"
},
"decay_steps": {
"default": 10000,
"description": "Frequency for learning rate decay, in steps.",
"title": "Decay Steps",
"type": "integer"
},
"decay_method": {
"default": "none",
"description": "Custom decay method to use.",
"enum": [
"noam",
"noamwd",
"cosine",
"rsqrt",
"none"
],
"title": "Decay Method",
"type": "string"
},
"warmup_steps": {
"default": 4000,
"description": "Number of warmup steps for custom decay.",
"title": "Warmup Steps",
"type": "integer"
},
"reset_optim": {
"default": "none",
"description": "Optimization resetter when using train_from.",
"enum": [
"none",
"all",
"states",
"keep_states"
],
"title": "Reset Optim",
"type": "string"
}
},
"additionalProperties": false
}

field adagrad_accumulator_init : float = 0​

Initialize the accumulator values in adagrad. Mirrors initial_accumulator_value flag from tensorflow adagrad implementation (default 0.1 there).

field adam_beta1 : float = 0.9​

Beta1 parameter used by Adam. Almost without exception a value of 0.9 is used in the literature, seemingly giving good results, so we would discourage changing this value from the default without due consideration.

field adam_beta2 : float = 0.999​

Beta2 parameter used by Adam. Typically a value of 0.999 is recommended, as this is the value suggested by the original paper describing Adam, and is also the value adopted in other frameworks such as Tensorflow (https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer) and Keras (https://keras.io/optimizers/). Whereas recently the paper Attention is All You Need suggested a value of 0.98 for beta2, this parameter may not work well for normal models / default baselines.

field adam_eps : float = 1e-08​

Adam epsilon to forward to torch Optimizer.

field decay_method : Literal['noam', 'noamwd', 'cosine', 'rsqrt', 'none'] = 'none'​

Custom decay method to use.

field decay_steps : int = 10000​

Frequency for learning rate decay, in steps.

field learning_rate : float = 1.0​

Starting learning rate. Recommended settings: sgd=1, adagrad=0.1, adadelta=1, adam=0.001.

field learning_rate_decay : float = 0.5​

Decay learning rate by this much if steps have gone past start_decay_steps.

field optim : Literal['sgd', 'adagrad', 'adadelta', 'adam', 'adamw', 'sparseadam', 'adafactor', 'adamw8bit', 'pagedadamw8bit', 'pagedadamw32bit'] = 'sgd'​

Optimization method.

field reset_optim : Literal['none', 'all', 'states', 'keep_states'] = 'none'​

Optimization resetter when using train_from.

field start_decay_steps : int = 50000​

Start decaying every decay_steps after this many steps.

field use_amp : bool = True​

Use torch mixed precision when compute_dtype is 16-bit.

field warmup_steps : int = 4000​

Number of warmup steps for custom decay.

field weight_decay : float = 0.0​

Weight decay to forward to torch Optimizer.

pydantic model eole.config.training.TrainingConfig​

Bases: RunningConfig, OptimizerConfig, LoRaConfig, QuantizeConfig

Show JSON schema
{
"title": "TrainingConfig",
"type": "object",
"properties": {
"quant_layers": {
"default": [],
"description": "List of layers to be compressed in 4/8bit.",
"items": {
"type": "string"
},
"title": "Quant Layers",
"type": "array"
},
"quant_type": {
"default": "",
"description": "Type of compression.",
"enum": [
"",
"bnb_8bit",
"bnb_FP4",
"bnb_NF4",
"awq_gemm",
"awq_gemv"
],
"title": "Quant Type",
"type": "string"
},
"w_bit": {
"default": 4,
"description": "W_bit quantization",
"title": "W Bit",
"type": "integer"
},
"group_size": {
"default": 128,
"description": "Group size quantization.",
"title": "Group Size",
"type": "integer"
},
"lora_layers": {
"default": [],
"description": "List of layers to be replaced by LoRa layers. E.g. ['linear_values', 'linear_query'] (\u00a74.2 in https://arxiv.org/abs/2106.09685)",
"items": {
"type": "string"
},
"title": "Lora Layers",
"type": "array"
},
"lora_embedding": {
"default": false,
"description": "Replace embeddings with LoRa Embeddings (\u00a75.1)",
"title": "Lora Embedding",
"type": "boolean"
},
"lora_rank": {
"default": 2,
"description": "r=2 successfully tested with NLLB-200 3.3B",
"title": "Lora Rank",
"type": "integer"
},
"lora_alpha": {
"default": 1,
"description": "\u00a74.1 https://arxiv.org/abs/2106.09685",
"title": "Lora Alpha",
"type": "integer"
},
"lora_dropout": {
"default": 0.0,
"description": "Rule of thumb: same value as in main model.",
"title": "Lora Dropout",
"type": "number"
},
"optim": {
"default": "sgd",
"description": "Optimization method.",
"enum": [
"sgd",
"adagrad",
"adadelta",
"adam",
"adamw",
"sparseadam",
"adafactor",
"adamw8bit",
"pagedadamw8bit",
"pagedadamw32bit"
],
"title": "Optim",
"type": "string"
},
"adagrad_accumulator_init": {
"default": 0,
"description": "Initialize the accumulator values in adagrad. Mirrors initial_accumulator_value flag from tensorflow adagrad implementation (default 0.1 there).",
"title": "Adagrad Accumulator Init",
"type": "number"
},
"adam_beta1": {
"default": 0.9,
"description": "Beta1 parameter used by Adam. Almost without exception a value of 0.9 is used in the literature, seemingly giving good results, so we would discourage changing this value from the default without due consideration.",
"title": "Adam Beta1",
"type": "number"
},
"adam_beta2": {
"default": 0.999,
"description": "Beta2 parameter used by Adam. Typically a value of 0.999 is recommended, as this is the value suggested by the original paper describing Adam, and is also the value adopted in other frameworks such as Tensorflow (https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer) and Keras (https://keras.io/optimizers/). Whereas recently the paper Attention is All You Need suggested a value of 0.98 for beta2, this parameter may not work well for normal models / default baselines.",
"title": "Adam Beta2",
"type": "number"
},
"adam_eps": {
"default": 1e-08,
"description": "Adam epsilon to forward to torch Optimizer.",
"title": "Adam Eps",
"type": "number"
},
"weight_decay": {
"default": 0.0,
"description": "Weight decay to forward to torch Optimizer.",
"title": "Weight Decay",
"type": "number"
},
"use_amp": {
"default": true,
"description": "Use torch mixed precision when compute_dtype is 16-bit.",
"title": "Use Amp",
"type": "boolean"
},
"learning_rate": {
"default": 1.0,
"description": "Starting learning rate. Recommended settings: sgd=1, adagrad=0.1, adadelta=1, adam=0.001.",
"title": "Learning Rate",
"type": "number"
},
"learning_rate_decay": {
"default": 0.5,
"description": "Decay learning rate by this much if steps have gone past start_decay_steps.",
"title": "Learning Rate Decay",
"type": "number"
},
"start_decay_steps": {
"default": 50000,
"description": "Start decaying every decay_steps after this many steps.",
"title": "Start Decay Steps",
"type": "integer"
},
"decay_steps": {
"default": 10000,
"description": "Frequency for learning rate decay, in steps.",
"title": "Decay Steps",
"type": "integer"
},
"decay_method": {
"default": "none",
"description": "Custom decay method to use.",
"enum": [
"noam",
"noamwd",
"cosine",
"rsqrt",
"none"
],
"title": "Decay Method",
"type": "string"
},
"warmup_steps": {
"default": 4000,
"description": "Number of warmup steps for custom decay.",
"title": "Warmup Steps",
"type": "integer"
},
"reset_optim": {
"default": "none",
"description": "Optimization resetter when using train_from.",
"enum": [
"none",
"all",
"states",
"keep_states"
],
"title": "Reset Optim",
"type": "string"
},
"gpu_ranks": {
"default": [],
"description": "List of ranks for each process.",
"items": {
"type": "integer"
},
"title": "Gpu Ranks",
"type": "array"
},
"world_size": {
"default": 1,
"description": "Total number of distributed processes.",
"title": "World Size",
"type": "integer"
},
"parallel_mode": {
"default": "data_parallel",
"description": "Distributed mode.",
"enum": [
"data_parallel",
"tensor_parallel"
],
"title": "Parallel Mode",
"type": "string"
},
"gpu_backend": {
"default": "nccl",
"description": "Type of torch distributed backend.",
"title": "Gpu Backend",
"type": "string"
},
"gpu_verbose_level": {
"default": 0,
"description": "Gives more info on each process per GPU.",
"title": "Gpu Verbose Level",
"type": "integer"
},
"master_ip": {
"default": "localhost",
"description": "IP of master for torch.distributed training.",
"title": "Master Ip",
"type": "string"
},
"master_port": {
"default": 10000,
"description": "Port of master for torch.distributed training.",
"title": "Master Port",
"type": "integer"
},
"timeout": {
"default": 60,
"description": "Timeout for one GPU to wait for the others.",
"title": "Timeout",
"type": "integer"
},
"model_path": {
"default": "model",
"description": "Path to directory containing all model components.",
"title": "Model Path",
"type": "string"
},
"self_attn_backend": {
"default": "flash",
"description": "Self-attention backend.",
"enum": [
"flash",
"pytorch"
],
"title": "Self Attn Backend",
"type": "string"
},
"compute_dtype": {
"description": "Compute dtype (precision) to use for main compute. Some parameters might have other dtypes for specific cases (e.g. torch.amp -- See eole.config.training.TrainingConfig.storage_dtype) fp32 to force slow fp16 model on gtx1080, int8 to enable pytorch native 8-bit quantization (cpu only).",
"enum": [
"fp32",
"fp16",
"int8",
"bf16"
],
"title": "Compute Dtype",
"type": "string"
},
"torch_compile": {
"default": false,
"description": "Use torch.compile with dynamic=True.",
"title": "Torch Compile",
"type": "boolean"
},
"param_init": {
"default": 0.1,
"description": "Support value for uniform distribution parameters initialization. Set to 0 not to use initialization.",
"title": "Param Init",
"type": "number"
},
"param_init_method": {
"default": "uniform",
"description": "Parameter initialization method.",
"enum": [
"xavier_uniform",
"uniform",
"normal"
],
"title": "Param Init Method",
"type": "string"
},
"freeze_encoder": {
"default": false,
"description": "Freeze parameters in encoder.",
"title": "Freeze Encoder",
"type": "boolean"
},
"freeze_decoder": {
"default": false,
"description": "Freeze parameters in decoder.",
"title": "Freeze Decoder",
"type": "boolean"
},
"pre_word_vecs_enc": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "If a valid path is specified, will load pretrained word embeddings on the encoder side.",
"title": "Pre Word Vecs Enc"
},
"pre_word_vecs_dec": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "If a valid path is specified, will load pretrained word embeddings on the decoder side.",
"title": "Pre Word Vecs Dec"
},
"data_type": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "text",
"title": "Data Type"
},
"bucket_size": {
"default": 262144,
"description": "A bucket is a buffer of bucket_size examples to pick from the various corpora. The dynamic iterator batches batch_size items from the bucket and shuffle them.",
"title": "Bucket Size",
"type": "integer"
},
"bucket_size_init": {
"default": -1,
"description": "Bucket size is initialized with this amount of examples (see bucket_size_increment).",
"title": "Bucket Size Init",
"type": "integer"
},
"bucket_size_increment": {
"default": 0,
"description": "Bucket size incremented with this amount of examples at each new bucket (up to bucket_size).",
"title": "Bucket Size Increment",
"type": "integer"
},
"prefetch_factor": {
"default": 200,
"description": "Number of mini-batches loaded in advance to avoid the GPU waiting during processing of next bucket.",
"title": "Prefetch Factor",
"type": "integer"
},
"save_format": {
"default": "pytorch",
"description": "Format to save the model weights.",
"enum": [
"pytorch",
"safetensors"
],
"title": "Save Format",
"type": "string"
},
"save_checkpoint_steps": {
"default": 5000,
"description": "Frequency of checkpoint saving (in steps).",
"title": "Save Checkpoint Steps",
"type": "integer"
},
"keep_checkpoint": {
"default": -1,
"description": "Number of checkpoints to retain. (-1 retains all)",
"title": "Keep Checkpoint",
"type": "integer"
},
"train_from": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Pretrained model/checkpoint weights to continue training from.",
"title": "Train From"
},
"num_workers": {
"default": 2,
"description": "Number of workers for pytorch.DataLoader objects.",
"title": "Num Workers",
"type": "integer"
},
"batch_size": {
"default": 64,
"description": "Maximum batch size for training.",
"title": "Batch Size",
"type": "integer"
},
"batch_size_multiple": {
"default": 1,
"description": "Batch size multiple for token batches.",
"title": "Batch Size Multiple",
"type": "integer"
},
"batch_type": {
"default": "sents",
"description": "Batch grouping for batch_size.",
"enum": [
"sents",
"tokens"
],
"title": "Batch Type",
"type": "string"
},
"normalization": {
"default": "sents",
"description": "Normalization method of the gradient.",
"enum": [
"sents",
"tokens"
],
"title": "Normalization",
"type": "string"
},
"accum_count": {
"default": [
1
],
"description": "Accumulate gradient this many times. Approximately equivalent to updating batch_size * accum_count batches at once. Recommended for transformer.",
"items": {
"type": "integer"
},
"title": "Accum Count",
"type": "array"
},
"accum_steps": {
"default": [
0
],
"description": "Steps at which accum_count values change.",
"items": {
"type": "integer"
},
"title": "Accum Steps",
"type": "array"
},
"valid_steps": {
"default": 10000,
"description": "Frequency of validation, in steps.",
"title": "Valid Steps",
"type": "integer"
},
"valid_batch_size": {
"default": 32,
"description": "Maximum batch size for validation.",
"title": "Valid Batch Size",
"type": "integer"
},
"train_steps": {
"default": 100000,
"description": "Number of training steps.",
"title": "Train Steps",
"type": "integer"
},
"single_pass": {
"default": false,
"description": "Make a single pass over the training dataset.",
"title": "Single Pass",
"type": "boolean"
},
"early_stopping": {
"default": 0,
"description": "Number of validation steps without improving that will trigger early stop of training.",
"title": "Early Stopping",
"type": "integer"
},
"early_stopping_criteria": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Criteria to use for early stopping.",
"title": "Early Stopping Criteria"
},
"max_grad_norm": {
"default": 5,
"description": "If the norm of the gradient vector exceeds this value, renormalize it to have the norm equal to max_grad_norm.",
"title": "Max Grad Norm",
"type": "number"
},
"dropout": {
"default": [
0.3
],
"description": "Dropout probability.",
"items": {
"type": "number"
},
"title": "Dropout",
"type": "array"
},
"attention_dropout": {
"default": [
0.1
],
"description": "Attention dropout probability.",
"items": {
"type": "number"
},
"title": "Attention Dropout",
"type": "array"
},
"dropout_steps": {
"default": [
0
],
"description": "Steps at which dropout changes.",
"items": {
"type": "integer"
},
"title": "Dropout Steps",
"type": "array"
},
"label_smoothing": {
"default": 0.0,
"description": "Label smoothing value epsilon. Probability of all non-true labels will be smoothed by epsilon/(vocab_size-1). Set to 0 to turn off label smoothing. (https://arxiv.org/abs/1512.00567)",
"title": "Label Smoothing",
"type": "number"
},
"average_decay": {
"default": 0.0,
"description": "Exponential moving average decay (https://en.wikipedia.org/wiki/Moving_average). Set to other than 0 (e.g. 1e-4) to activate. Similar to Marian NMT implementation (http://www.aclweb.org/anthology/P18-4020).",
"title": "Average Decay",
"type": "number"
},
"average_every": {
"default": 1,
"description": "Step for moving average. Default is every update if average_decay is set.",
"title": "Average Every",
"type": "integer"
},
"zero_out_prompt_loss": {
"default": false,
"description": "Set the prompt loss to zero. Mostly for LLM finetuning. Will be enabled only if the `insert_mask_before_placeholder` transform is applied.",
"title": "Zero Out Prompt Loss",
"type": "boolean"
},
"use_ckpting": {
"default": [],
"description": "Use gradient checkpointing for those modules.",
"items": {
"type": "string"
},
"title": "Use Ckpting",
"type": "array"
},
"update_vocab": {
"default": false,
"description": "Update source and target existing vocabularies.",
"title": "Update Vocab",
"type": "boolean"
},
"lm_prior_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "LM model to use to train the TM.",
"title": "Lm Prior Model"
},
"lm_prior_lambda": {
"default": 0.0,
"description": "LM Prior Lambda",
"title": "Lm Prior Lambda",
"type": "number"
},
"lm_prior_tau": {
"default": 1.0,
"description": "LM Prior Tau",
"title": "Lm Prior Tau",
"type": "number"
},
"estim_loss_lambda": {
"default": [
1.0
],
"description": "Weight applied to estimator loss",
"items": {
"type": "number"
},
"title": "Estim Loss Lambda",
"type": "array"
},
"estim_loss_lambda_steps": {
"default": [
0
],
"description": "Steps at which estimator loss lambda changes",
"items": {
"type": "integer"
},
"title": "Estim Loss Lambda Steps",
"type": "array"
},
"score_threshold": {
"default": 0.68,
"description": "Threshold to filterout data",
"title": "Score Threshold",
"type": "number"
},
"log_attention_entropy": {
"default": true,
"description": "Whether to compute and log attention entropy during training.",
"title": "Log Attention Entropy",
"type": "boolean"
},
"attention_entropy_types": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"description": "Which attention types to compute entropy for. If None, computes for all available types (e.g., ['std', 'self', 'context']).",
"title": "Attention Entropy Types"
},
"attention_entropy_layers": {
"anyOf": [
{
"items": {
"type": "integer"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"description": "Which attention layer indices to include in entropy computation. If None, includes all layers.",
"title": "Attention Entropy Layers"
},
"attention_entropy_aggregation": {
"default": "mean",
"description": "How to aggregate attention entropy across different attention types/layers.",
"enum": [
"mean",
"max",
"min"
],
"title": "Attention Entropy Aggregation",
"type": "string"
}
},
"additionalProperties": false
}

field accum_count : List[int] = [1]​

Accumulate gradient this many times. Approximately equivalent to updating batch_size * accum_count batches at once. Recommended for transformer.

  • Validated by:
    • _validate_running_config

field accum_steps : List[int] = [0]​

Steps at which accum_count values change.

  • Validated by:
    • _validate_running_config

field attention_dropout : List[float] = [0.1]​

Attention dropout probability.

  • Validated by:
    • _validate_running_config

field attention_entropy_aggregation : Literal['mean', 'max', 'min'] = 'mean'​

How to aggregate attention entropy across different attention types/layers.

  • Validated by:
    • _validate_running_config

field attention_entropy_layers : List[int] | None = None​

Which attention layer indices to include in entropy computation. If None, includes all layers.

  • Validated by:
    • _validate_running_config

field attention_entropy_types : List[str] | None = None​

Which attention types to compute entropy for. If None, computes for all available types (e.g., [β€˜std’, β€˜self’, β€˜context’]).

  • Validated by:
    • _validate_running_config

field average_decay : float = 0.0​

Exponential moving average decay (https://en.wikipedia.org/wiki/Moving_average). Set to other than 0 (e.g. 1e-4) to activate. Similar to Marian NMT implementation (http://www.aclweb.org/anthology/P18-4020).

  • Validated by:
    • _validate_running_config

field average_every : int = 1​

Step for moving average. Default is every update if average_decay is set.

  • Validated by:
    • _validate_running_config

field batch_size : int = 64​

Maximum batch size for training.

  • Validated by:
    • _validate_running_config

field batch_size_multiple : int = 1​

Batch size multiple for token batches.

  • Validated by:
    • _validate_running_config

field batch_type : Literal['sents', 'tokens'] = 'sents'​

Batch grouping for batch_size.

  • Validated by:
    • _validate_running_config

field bucket_size : int = 262144​

A bucket is a buffer of bucket_size examples to pick from the various corpora. The dynamic iterator batches batch_size items from the bucket and shuffle them.

  • Validated by:
    • _validate_running_config

field bucket_size_increment : int = 0​

Bucket size incremented with this amount of examples at each new bucket (up to bucket_size).

  • Validated by:
    • _validate_running_config

field bucket_size_init : int = -1​

Bucket size is initialized with this amount of examples (see bucket_size_increment).

  • Validated by:
    • _validate_running_config

field data_type : str | None = 'text'​

  • Validated by:
    • _validate_running_config

field dropout : List[float] = [0.3]​

Dropout probability.

  • Validated by:
    • _validate_running_config

field dropout_steps : List[int] = [0]​

Steps at which dropout changes.

  • Validated by:
    • _validate_running_config

field early_stopping : int = 0​

Number of validation steps without improving that will trigger early stop of training.

  • Validated by:
    • _validate_running_config

field early_stopping_criteria : str | None = None​

Criteria to use for early stopping.

  • Validated by:
    • _validate_running_config

field estim_loss_lambda : List[float] = [1.0]​

Weight applied to estimator loss

  • Validated by:
    • _validate_running_config

field estim_loss_lambda_steps : List[int] = [0]​

Steps at which estimator loss lambda changes

  • Validated by:
    • _validate_running_config

field freeze_decoder : bool = False​

Freeze parameters in decoder.

  • Validated by:
    • _validate_running_config

field freeze_encoder : bool = False​

Freeze parameters in encoder.

  • Validated by:
    • _validate_running_config

field keep_checkpoint : int = -1​

Number of checkpoints to retain. (-1 retains all)

  • Validated by:
    • _validate_running_config

field label_smoothing : float = 0.0​

Label smoothing value epsilon. Probability of all non-true labels will be smoothed by epsilon/(vocab_size-1). Set to 0 to turn off label smoothing. (https://arxiv.org/abs/1512.00567)

  • Validated by:
    • _validate_running_config

field lm_prior_lambda : float = 0.0​

LM Prior Lambda

  • Validated by:
    • _validate_running_config

field lm_prior_model : str | None = None​

LM model to use to train the TM.

  • Validated by:
    • _validate_running_config

field lm_prior_tau : float = 1.0​

LM Prior Tau

  • Validated by:
    • _validate_running_config

field log_attention_entropy : bool = True​

Whether to compute and log attention entropy during training.

  • Validated by:
    • _validate_running_config

field max_grad_norm : float = 5​

If the norm of the gradient vector exceeds this value, renormalize it to have the norm equal to max_grad_norm.

  • Validated by:
    • _validate_running_config

field normalization : Literal['sents', 'tokens'] = 'sents'​

Normalization method of the gradient.

  • Validated by:
    • _validate_running_config

field num_workers : int = 2​

Number of workers for pytorch.DataLoader objects.

  • Validated by:
    • _validate_running_config

field param_init : float = 0.1​

Support value for uniform distribution parameters initialization. Set to 0 not to use initialization.

  • Validated by:
    • _validate_running_config

field param_init_method : Literal['xavier_uniform', 'uniform', 'normal'] = 'uniform'​

Parameter initialization method.

  • Validated by:
    • _validate_running_config

field pre_word_vecs_dec : str | None = None​

If a valid path is specified, will load pretrained word embeddings on the decoder side.

  • Validated by:
    • _validate_running_config

field pre_word_vecs_enc : str | None = None​

If a valid path is specified, will load pretrained word embeddings on the encoder side.

  • Validated by:
    • _validate_running_config

field prefetch_factor : int = 200​

Number of mini-batches loaded in advance to avoid the GPU waiting during processing of next bucket.

  • Validated by:
    • _validate_running_config

field save_checkpoint_steps : int = 5000​

Frequency of checkpoint saving (in steps).

  • Validated by:
    • _validate_running_config

field save_format : Literal['pytorch', 'safetensors'] = 'pytorch'​

Format to save the model weights.

  • Validated by:
    • _validate_running_config

field score_threshold : float = 0.68​

Threshold to filterout data

  • Validated by:
    • _validate_running_config

field single_pass : bool = False​

Make a single pass over the training dataset.

  • Validated by:
    • _validate_running_config

field train_from : str | None = None​

Pretrained model/checkpoint weights to continue training from.

  • Validated by:
    • _validate_running_config

field train_steps : int = 100000​

Number of training steps.

  • Validated by:
    • _validate_running_config

field update_vocab : bool = False​

Update source and target existing vocabularies.

  • Validated by:
    • _validate_running_config

field use_ckpting : List[str] = []​

Use gradient checkpointing for those modules.

field valid_batch_size : int = 32​

Maximum batch size for validation.

  • Validated by:
    • _validate_running_config

field valid_steps : int = 10000​

Frequency of validation, in steps.

  • Validated by:
    • _validate_running_config

field zero_out_prompt_loss : bool = False​

Set the prompt loss to zero. Mostly for LLM finetuning. Will be enabled only if the insert_mask_before_placeholder transform is applied.

  • Validated by:
    • _validate_running_config

validator checkpointing_layers Β» use_ckpting​

get_model_path()​

property storage_dtype : dtype​

Deduce which dtype to use for main model parameters. E.g. with mixed precision a copy is kept in float32.