Main Entrypoints
Common Base Config
pydantic model eole.config.common.DistributedConfig
Bases: Config
Show JSON schema
{
"title": "DistributedConfig",
"type": "object",
"properties": {
"gpu_ranks": {
"default": [],
"description": "List of ranks for each process.",
"items": {
"type": "integer"
},
"title": "Gpu Ranks",
"type": "array"
},
"world_size": {
"default": 1,
"description": "Total number of distributed processes.",
"title": "World Size",
"type": "integer"
},
"parallel_mode": {
"default": "data_parallel",
"description": "Distributed mode.",
"enum": [
"data_parallel",
"tensor_parallel"
],
"title": "Parallel Mode",
"type": "string"
},
"gpu_backend": {
"default": "nccl",
"description": "Type of torch distributed backend.",
"title": "Gpu Backend",
"type": "string"
},
"gpu_verbose_level": {
"default": 0,
"description": "Gives more info on each process per GPU.",
"title": "Gpu Verbose Level",
"type": "integer"
},
"master_ip": {
"default": "localhost",
"description": "IP of master for torch.distributed training.",
"title": "Master Ip",
"type": "string"
},
"master_port": {
"default": 10000,
"description": "Port of master for torch.distributed training.",
"title": "Master Port",
"type": "integer"
},
"timeout": {
"default": 60,
"description": "Timeout for one GPU to wait for the others.",
"title": "Timeout",
"type": "integer"
}
},
"additionalProperties": false
}
- Config:
- validate_assignment: bool = True
- validate_default: bool = True
- use_enum_values: bool = True
- extra: str = forbid
- protected_namespaces: tuple = ()
- Fields:
field gpu_backend : str = 'nccl'
Type of torch distributed backend.
field gpu_ranks : List[int] = []
List of ranks for each process.
field gpu_verbose_level : int = 0
Gives more info on each process per GPU.
field master_ip : str = 'localhost'
IP of master for torch.distributed training.
field master_port : int = 10000
Port of master for torch.distributed training.
field parallel_mode : Literal['data_parallel', 'tensor_parallel'] = 'data_parallel'
Distributed mode.
field timeout : int = 60
Timeout for one GPU to wait for the others.
field world_size : int = 1
Total number of distributed processes.
property parallel_gpu : int
pydantic model eole.config.common.LoggingConfig
Bases: Config
Show JSON schema
{
"title": "LoggingConfig",
"type": "object",
"properties": {
"log_file": {
"default": "",
"description": "Output logs to a file under this path.",
"title": "Log File",
"type": "string"
},
"report_every": {
"default": 50,
"description": "Print stats at this interval (in steps).",
"title": "Report Every",
"type": "integer"
},
"valid_metrics": {
"default": [],
"description": "List of names of additional validation metrics.",
"items": {
"type": "string"
},
"title": "Valid Metrics",
"type": "array"
},
"wer_normalize": {
"default": "none",
"description": "WER normalization mode: none, lowercase, whisper_en, whisper_basic.",
"title": "Wer Normalize",
"type": "string"
},
"comet_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "COMET model name or local path to use for COMET/COMET-KIWI scoring. Defaults to Unbabel/wmt22-comet-da for COMET and Unbabel/wmt22-cometkiwi-da for COMET-KIWI when not set.",
"title": "Comet Model"
},
"comet_batch_size": {
"default": 64,
"description": "Batch size used when running COMET/COMET-KIWI scoring.",
"title": "Comet Batch Size",
"type": "integer"
},
"scoring_debug": {
"default": false,
"description": "Dump src/ref/pred of the current batch.",
"title": "Scoring Debug",
"type": "boolean"
},
"dump_preds": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Folder to dump predictions to.",
"title": "Dump Preds"
},
"tensorboard": {
"default": false,
"description": "Use tensorboard for visualization during training.",
"title": "Tensorboard",
"type": "boolean"
},
"tensorboard_log_dir": {
"default": "runs/eole",
"description": "Log directory for tensorboard (also the name of the run).",
"title": "Tensorboard Log Dir",
"type": "string"
},
"tensorboard_log_dir_dated": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Tensorboard Log Dir Dated"
}
},
"additionalProperties": false
}
- Config:
- validate_assignment: bool = True
- validate_default: bool = True
- use_enum_values: bool = True
- extra: str = forbid
- protected_namespaces: tuple = ()
- Fields:
field comet_batch_size : int = 64
Batch size used when running COMET/COMET-KIWI scoring.
field comet_model : str | None = None
COMET model name or local path to use for COMET/COMET-KIWI scoring. Defaults to Unbabel/wmt22-comet-da for COMET and Unbabel/wmt22-cometkiwi-da for COMET-KIWI when not set.
field dump_preds : str | None = None
Folder to dump predictions to.
field log_file : str = ''
Output logs to a file under this path.
field report_every : int = 50
Print stats at this interval (in steps).
field scoring_debug : bool = False
Dump src/ref/pred of the current batch.
field tensorboard : bool = False
Use tensorboard for visualization during training.
field tensorboard_log_dir : str = 'runs/eole'
Log directory for tensorboard (also the name of the run).
field tensorboard_log_dir_dated : str | None = None
field valid_metrics : List[str] = []
List of names of additional validation metrics.
field wer_normalize : str = 'none'
WER normalization mode: none, lowercase, whisper_en, whisper_basic.
pydantic model eole.config.common.LoRaConfig
Bases: Config
Show JSON schema
{
"title": "LoRaConfig",
"type": "object",
"properties": {
"lora_layers": {
"default": [],
"description": "List of layers to be replaced by LoRa layers. E.g. ['linear_values', 'linear_query'] (\u00a74.2 in https://arxiv.org/abs/2106.09685)",
"items": {
"type": "string"
},
"title": "Lora Layers",
"type": "array"
},
"lora_embedding": {
"default": false,
"description": "Replace embeddings with LoRa Embeddings (\u00a75.1)",
"title": "Lora Embedding",
"type": "boolean"
},
"lora_rank": {
"default": 2,
"description": "r=2 successfully tested with NLLB-200 3.3B",
"title": "Lora Rank",
"type": "integer"
},
"lora_alpha": {
"default": 1,
"description": "\u00a74.1 https://arxiv.org/abs/2106.09685",
"title": "Lora Alpha",
"type": "integer"
},
"lora_dropout": {
"default": 0.0,
"description": "Rule of thumb: same value as in main model.",
"title": "Lora Dropout",
"type": "number"
}
},
"additionalProperties": false
}
- Config:
- validate_assignment: bool = True
- validate_default: bool = True
- use_enum_values: bool = True
- extra: str = forbid
- protected_namespaces: tuple = ()
- Fields:
field lora_alpha : int = 1
§4.1 https://arxiv.org/abs/2106.09685
field lora_dropout : float = 0.0
Rule of thumb: same value as in main model.
field lora_embedding : bool = False
Replace embeddings with LoRa Embeddings ( §5.1)
field lora_layers : List[str] = []
List of layers to be replaced by LoRa layers. E.g. [‘linear_values’, ‘linear_query’] (§4.2 in https://arxiv.org/abs/2106.09685)
field lora_rank : int = 2
r=2 successfully tested with NLLB-200 3.3B
pydantic model eole.config.common.QuantizeConfig
Bases: Config
Show JSON schema
{
"title": "QuantizeConfig",
"type": "object",
"properties": {
"quant_layers": {
"default": [],
"description": "List of layers to be compressed in 4/8bit.",
"items": {
"type": "string"
},
"title": "Quant Layers",
"type": "array"
},
"quant_type": {
"default": "",
"description": "Type of compression.",
"enum": [
"",
"bnb_8bit",
"bnb_FP4",
"bnb_NF4",
"awq_gemm",
"awq_gemv",
"autoround",
"gguf"
],
"title": "Quant Type",
"type": "string"
},
"w_bit": {
"default": 4,
"description": "W_bit quantization",
"title": "W Bit",
"type": "integer"
},
"group_size": {
"default": 128,
"description": "Group size quantization.",
"title": "Group Size",
"type": "integer"
},
"autoround_packing_format": {
"default": "auto_round:auto_gptq",
"description": "AutoRound packing format (from quantization_config.packing_format). Determines whether qzeros use GPTQ-style (zeros-1) packing. Use 'auto_round:auto_gptq' for GPTQ-format (default), or 'auto_round' for direct zero-point.",
"title": "Autoround Packing Format",
"type": "string"
},
"autoround_sym": {
"default": true,
"description": "AutoRound symmetric quantization flag (from quantization_config.sym). Required to select the Marlin CUDA backend, which only supports symmetric quantization.",
"title": "Autoround Sym",
"type": "boolean"
},
"quant_exclude_modules": {
"default": [],
"description": "List of parent module names whose entire subtrees must not be quantized, even if child layers appear in quant_layers. Used for AutoRound models where some parent modules (e.g. shared_experts in MoE) were kept in fp16 during quantization.",
"items": {
"type": "string"
},
"title": "Quant Exclude Modules",
"type": "array"
}
},
"additionalProperties": false
}
- Config:
- validate_assignment: bool = True
- validate_default: bool = True
- use_enum_values: bool = True
- extra: str = forbid
- protected_namespaces: tuple = ()
- Fields:
field autoround_packing_format : str = 'auto_round:auto_gptq'
AutoRound packing format (from quantization_config.packing_format). Determines whether qzeros use GPTQ-style (zeros-1) packing. Use ‘auto_round:auto_gptq’ for GPTQ-format (default), or ‘auto_round’ for direct zero-point.
field autoround_sym : bool = True
AutoRound symmetric quantization flag (from quantization_config.sym). Required to select the Marlin CUDA backend, which only supports symmetric quantization.
field group_size : int = 128
Group size quantization.
field quant_exclude_modules : List[str] = []
List of parent module names whose entire subtrees must not be quantized, even if child layers appear in quant_layers. Used for AutoRound models where some parent modules (e.g. shared_experts in MoE) were kept in fp16 during quantization.
field quant_layers : List[str] = []
List of layers to be compressed in 4/8bit.
field quant_type : Literal['', 'bnb_8bit', 'bnb_FP4', 'bnb_NF4', 'awq_gemm', 'awq_gemv', 'autoround', 'gguf'] = ''
Type of compression.
field w_bit : int = 4
W_bit quantization
pydantic model eole.config.common.MiscConfig
Bases: Config
Show JSON schema
{
"title": "MiscConfig",
"type": "object",
"properties": {
"seed": {
"default": -1,
"description": "Set random seed used for better reproducibility between experiments.",
"title": "Seed",
"type": "integer"
}
},
"additionalProperties": false
}
- Config:
- validate_assignment: bool = True
- validate_default: bool = True
- use_enum_values: bool = True
- extra: str = forbid
- protected_namespaces: tuple = ()
- Fields:
field seed : int = -1
Set random seed used for better reproducibility between experiments.
pydantic model eole.config.common.RunningConfig
Bases: DistributedConfig
Should be used as base class for Training/Inference at some point, factorizing common stuff like batch_size etc.
Show JSON schema
{
"title": "RunningConfig",
"description": "Should be used as base class for Training/Inference at some point,\nfactorizing common stuff like batch_size etc.",
"type": "object",
"properties": {
"gpu_ranks": {
"default": [],
"description": "List of ranks for each process.",
"items": {
"type": "integer"
},
"title": "Gpu Ranks",
"type": "array"
},
"world_size": {
"default": 1,
"description": "Total number of distributed processes.",
"title": "World Size",
"type": "integer"
},
"parallel_mode": {
"default": "data_parallel",
"description": "Distributed mode.",
"enum": [
"data_parallel",
"tensor_parallel"
],
"title": "Parallel Mode",
"type": "string"
},
"gpu_backend": {
"default": "nccl",
"description": "Type of torch distributed backend.",
"title": "Gpu Backend",
"type": "string"
},
"gpu_verbose_level": {
"default": 0,
"description": "Gives more info on each process per GPU.",
"title": "Gpu Verbose Level",
"type": "integer"
},
"master_ip": {
"default": "localhost",
"description": "IP of master for torch.distributed training.",
"title": "Master Ip",
"type": "string"
},
"master_port": {
"default": 10000,
"description": "Port of master for torch.distributed training.",
"title": "Master Port",
"type": "integer"
},
"timeout": {
"default": 60,
"description": "Timeout for one GPU to wait for the others.",
"title": "Timeout",
"type": "integer"
},
"model_path": {
"default": "model",
"description": "Path to directory containing all model components.",
"title": "Model Path",
"type": "string"
},
"self_attn_backend": {
"default": "flash",
"description": "Self-attention backend.",
"enum": [
"flash",
"pytorch"
],
"title": "Self Attn Backend",
"type": "string"
},
"compute_dtype": {
"description": "Compute dtype (precision) to use for main compute. Some parameters might have other dtypes for specific cases (e.g. torch.amp -- See eole.config.training.TrainingConfig.storage_dtype) fp32 to force slow fp16 model on gtx1080, int8 to enable pytorch native 8-bit quantization (cpu only).",
"enum": [
"fp32",
"fp16",
"int8",
"bf16"
],
"title": "Compute Dtype",
"type": "string"
},
"torch_compile": {
"default": false,
"description": "Use torch.compile with dynamic=True.",
"title": "Torch Compile",
"type": "boolean"
}
},
"additionalProperties": false
}
- Config:
- validate_assignment: bool = True
- validate_default: bool = True
- use_enum_values: bool = True
- extra: str = forbid
- protected_namespaces: tuple = ()
- arbitrary_types_allowed: bool = True
- Fields:
- Validators:
field compute_dtype : Literal['fp32', 'fp16', 'int8', 'bf16'] | dtype = torch.float32
Compute dtype (precision) to use for main compute. Some parameters might have other dtypes for specific cases (e.g. torch.amp – See eole.config.training.TrainingConfig.storage_dtype) fp32 to force slow fp16 model on gtx1080, int8 to enable pytorch native 8-bit quantization (cpu only).
- Validated by:
field model_path : str = 'model'
Path to directory containing all model components.
field self_attn_backend : Literal['flash', 'pytorch'] = 'flash'
Self-attention backend.
field torch_compile : bool = False
Use torch.compile with dynamic=True.
validator validate_compute_dtype » compute_dtype
check_self_attn_backend()
Run Config
pydantic model eole.config.run.TrainConfig
Bases: LoggingConfig, MiscConfig, DataConfig, VocabConfig
Show JSON schema
{
"title": "TrainConfig",
"type": "object",
"properties": {
"src_vocab": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"description": "Path to src (or shared) vocabulary file. Format: one <word> or <word>\t<count> per line.",
"title": "Src Vocab"
},
"tgt_vocab": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Path to tgt vocabulary file. Format: one <word> or <word>\t<count> per line.",
"title": "Tgt Vocab"
},
"share_vocab": {
"default": false,
"description": "Share source and target vocabulary.",
"title": "Share Vocab",
"type": "boolean"
},
"decoder_start_token": {
"default": "<s>",
"description": "Default decoder start token. For most models it is <s> = BOS. Some fairseq models require </s>.",
"title": "Decoder Start Token",
"type": "string"
},
"bos_token": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "<s>",
"title": "Bos Token"
},
"eos_token": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "</s>",
"title": "Eos Token"
},
"unk_token": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "<unk>",
"title": "Unk Token"
},
"pad_token": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "<blank>",
"title": "Pad Token"
},
"both_embeddings": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Path to the embeddings file to use for both source and target tokens.",
"title": "Both Embeddings"
},
"src_embeddings": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Path to the embeddings file to use for source tokens.",
"title": "Src Embeddings"
},
"tgt_embeddings": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Path to the embeddings file to use for target tokens.",
"title": "Tgt Embeddings"
},
"embeddings_type": {
"anyOf": [
{
"enum": [
"GloVe",
"word2vec"
],
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Type of embeddings file.",
"title": "Embeddings Type"
},
"src_vocab_size": {
"default": 32768,
"description": "Maximum size of the source vocabulary.",
"title": "Src Vocab Size",
"type": "integer"
},
"tgt_vocab_size": {
"default": 32768,
"description": "Maximum size of the target vocabulary.",
"title": "Tgt Vocab Size",
"type": "integer"
},
"vocab_size_multiple": {
"default": 8,
"description": "Make the vocabulary size a multiple of this value. (Adds dummy tokens if needed.)",
"title": "Vocab Size Multiple",
"type": "integer"
},
"src_words_min_frequency": {
"default": 0,
"description": "Discard source words with lower frequency.",
"title": "Src Words Min Frequency",
"type": "integer"
},
"tgt_words_min_frequency": {
"default": 0,
"description": "Discard target words with lower frequency.",
"title": "Tgt Words Min Frequency",
"type": "integer"
},
"data": {
"anyOf": [
{
"additionalProperties": {
"$ref": "#/$defs/Dataset"
},
"type": "object"
},
{
"type": "null"
}
],
"description": "All datasets and their specifications. See examples/*.yaml for further details.",
"title": "Data"
},
"transforms": {
"default": [],
"description": "Default transform pipeline to apply to data. Can be specified in each corpus of data to override.",
"items": {
"type": "string"
},
"title": "Transforms",
"type": "array"
},
"transforms_configs": {
"anyOf": [
{
"$ref": "#/$defs/NestedAllTransformsConfig"
},
{
"type": "null"
}
]
},
"skip_empty_level": {
"default": "warning",
"description": "Logging level when encoutering empty examples. (silent: silently ignore/skip empty examples, warning: warn when ignoring/skipping empty examples, error: raise an error and stop execution when any empty example)",
"enum": [
"silent",
"warning",
"error"
],
"title": "Skip Empty Level",
"type": "string"
},
"n_sample": {
"default": 0,
"description": "Number of transformed samples per corpus to use to build the vocabulary. Set to -1 to use the full corpora.",
"title": "N Sample",
"type": "integer"
},
"save_data": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Output base path for objects that will be saved (vocab, transforms, embeddings, ...)",
"title": "Save Data"
},
"overwrite": {
"default": false,
"description": "Overwrite existing objects if any.",
"title": "Overwrite",
"type": "boolean"
},
"seed": {
"default": -1,
"description": "Set random seed used for better reproducibility between experiments.",
"title": "Seed",
"type": "integer"
},
"log_file": {
"default": "",
"description": "Output logs to a file under this path.",
"title": "Log File",
"type": "string"
},
"report_every": {
"default": 50,
"description": "Print stats at this interval (in steps).",
"title": "Report Every",
"type": "integer"
},
"valid_metrics": {
"default": [],
"description": "List of names of additional validation metrics.",
"items": {
"type": "string"
},
"title": "Valid Metrics",
"type": "array"
},
"wer_normalize": {
"default": "none",
"description": "WER normalization mode: none, lowercase, whisper_en, whisper_basic.",
"title": "Wer Normalize",
"type": "string"
},
"comet_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "COMET model name or local path to use for COMET/COMET-KIWI scoring. Defaults to Unbabel/wmt22-comet-da for COMET and Unbabel/wmt22-cometkiwi-da for COMET-KIWI when not set.",
"title": "Comet Model"
},
"comet_batch_size": {
"default": 64,
"description": "Batch size used when running COMET/COMET-KIWI scoring.",
"title": "Comet Batch Size",
"type": "integer"
},
"scoring_debug": {
"default": false,
"description": "Dump src/ref/pred of the current batch.",
"title": "Scoring Debug",
"type": "boolean"
},
"dump_preds": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Folder to dump predictions to.",
"title": "Dump Preds"
},
"tensorboard": {
"default": false,
"description": "Use tensorboard for visualization during training.",
"title": "Tensorboard",
"type": "boolean"
},
"tensorboard_log_dir": {
"default": "runs/eole",
"description": "Log directory for tensorboard (also the name of the run).",
"title": "Tensorboard Log Dir",
"type": "string"
},
"tensorboard_log_dir_dated": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Tensorboard Log Dir Dated"
},
"verbose": {
"default": false,
"description": "Print data loading and statistics for all process (default only logs the first process shard).",
"title": "Verbose",
"type": "boolean"
},
"model": {
"anyOf": [
{
"discriminator": {
"mapping": {
"cnn": "#/$defs/CnnModelConfig",
"custom": "#/$defs/CustomModelConfig",
"rnn": "#/$defs/RnnModelConfig",
"transformer": "#/$defs/TransformerModelConfig",
"transformer_encoder": "#/$defs/TransformerEncoderModelConfig",
"transformer_lm": "#/$defs/TransformerLMModelConfig",
"vision_transformer_lm": "#/$defs/VisionTransformerLMModelConfig",
"whisper": "#/$defs/WhisperModelConfig"
},
"propertyName": "architecture"
},
"oneOf": [
{
"$ref": "#/$defs/TransformerModelConfig"
},
{
"$ref": "#/$defs/TransformerLMModelConfig"
},
{
"$ref": "#/$defs/VisionTransformerLMModelConfig"
},
{
"$ref": "#/$defs/WhisperModelConfig"
},
{
"$ref": "#/$defs/TransformerEncoderModelConfig"
},
{
"$ref": "#/$defs/RnnModelConfig"
},
{
"$ref": "#/$defs/CnnModelConfig"
},
{
"$ref": "#/$defs/CustomModelConfig"
}
]
},
{
"type": "null"
}
],
"default": null,
"title": "Model"
},
"training": {
"anyOf": [
{
"$ref": "#/$defs/TrainingConfig"
},
{
"type": "null"
}
]
},
"inference": {
"anyOf": [
{
"$ref": "#/$defs/InferenceConfig"
},
{
"type": "null"
}
],
"default": null
}
},
"$defs": {
"ActivationFunction": {
"enum": [
"relu",
"gelu",
"gelu-tanh",
"quick_gelu",
"silu",
"gated-gelu",
"gated-gelu-tanh",
"gated-silu",
"fused-gated-gelu",
"fused-gated-gelu-tanh",
"fused-gated-silu"
],
"title": "ActivationFunction",
"type": "string"
},
"AudioEncoderConfig": {
"additionalProperties": false,
"description": "Configuration for audio encoder.",
"properties": {
"encoder_type": {
"const": "audio",
"default": "audio",
"title": "Encoder Type",
"type": "string"
},
"layers": {
"default": 2,
"description": "Number of layers in the encoder.",
"title": "Layers",
"type": "integer"
},
"hidden_size": {
"default": 512,
"description": "Size of encoder hidden states.",
"title": "Hidden Size",
"type": "integer"
},
"src_word_vec_size": {
"default": 512,
"description": "Word embedding size for src.",
"title": "Src Word Vec Size",
"type": "integer"
},
"sliding_window": {
"default": 0,
"description": "Sliding window for transformer self-attention.",
"title": "Sliding Window",
"type": "integer"
},
"heads": {
"default": 8,
"description": "Number of heads for transformer self-attention.",
"title": "Heads",
"type": "integer"
},
"transformer_ff": {
"default": 2048,
"description": "Size of hidden transformer feed-forward.",
"title": "Transformer Ff",
"type": "integer"
},
"moe_transformer_ff": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Size of hidden moe transformer feed-forward.",
"title": "Moe Transformer Ff"
},
"relative_positions_buckets": {
"default": 0,
"description": "Enable relative position bias (https://github.com/google-research/text-to-text-transfer-transformer).",
"title": "Relative Positions Buckets",
"type": "integer"
},
"mlp_activation_fn": {
"$ref": "#/$defs/ActivationFunction",
"default": "relu",
"description": "The activation function to use in MLP layer."
},
"layer_norm": {
"default": "standard",
"description": "Type of layer normalization in transformer architecture.",
"enum": [
"standard",
"standardFP32",
"rms",
"gemma-rms"
],
"title": "Layer Norm",
"type": "string"
},
"norm_eps": {
"default": 1e-05,
"description": "Layer norm epsilon.",
"title": "Norm Eps",
"type": "number"
},
"shared_layer_norm": {
"default": false,
"description": "Use a shared layer_norm in parallel residual attention. Note: must be True for Falcon 7B, False for Falcon 40B, same for GPT-J and GPT-NeoX models.",
"title": "Shared Layer Norm",
"type": "boolean"
},
"ffn_layernorm": {
"default": false,
"description": "Add pre/post_feedforward_layernorm around MLP forward. Note: introduced for gemma2 support.",
"title": "Ffn Layernorm",
"type": "boolean"
},
"add_qkvbias": {
"default": false,
"description": "Add bias to nn.Linear of Query/Key/Value in MHA. Note: this will add bias to output projection layer too by default. Can be disabled with `add_final_linear_bias`.",
"title": "Add Qkvbias",
"type": "boolean"
},
"add_key_bias": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": null,
"description": "Add bias to Key projection in MHA. Defaults to add_qkvbias when not set. Set to False for models like Whisper where K has no bias.",
"title": "Add Key Bias"
},
"query_norm": {
"default": false,
"title": "Query Norm",
"type": "boolean"
},
"key_norm": {
"default": false,
"title": "Key Norm",
"type": "boolean"
},
"qk_norm_post_rope": {
"default": false,
"title": "Qk Norm Post Rope",
"type": "boolean"
},
"add_final_linear_bias": {
"default": false,
"description": "Add bias to nn.Linear of final_linear in MHA.",
"title": "Add Final Linear Bias",
"type": "boolean"
},
"heads_kv": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Number of heads for KV. heads_kv=heads if None, else number of heads for KV(e.g. Falcon 40B)",
"title": "Heads Kv"
},
"head_dim": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Head dimension when this needs to be different vs hidden_size // heads",
"title": "Head Dim"
},
"attn_scaling": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": null,
"description": "Attention scaling factor, when None uses 1/sqrt(head_dim) by default",
"title": "Attn Scaling"
},
"add_ffnbias": {
"default": false,
"description": "Add bias to nn.Linear of MLP FFN.",
"title": "Add Ffnbias",
"type": "boolean"
},
"parallel_residual": {
"default": false,
"description": "Use parallel residual in decoder layer. Note: this is used by GPT-J / Falcon Architecture.",
"title": "Parallel Residual",
"type": "boolean"
},
"num_experts": {
"default": 0,
"description": "Number of experts for MoE models.",
"title": "Num Experts",
"type": "integer"
},
"num_shared_experts": {
"default": 0,
"description": "Number of shared experts for MoE models (DeepSeekv2).",
"title": "Num Shared Experts",
"type": "integer"
},
"shared_expert_gate": {
"default": false,
"description": "Apply sigmoid-gated shared expert output (Qwen3.5 MoE style). When True, a linear gate is applied: output += sigmoid(gate(x)) * shared_expert(x).",
"title": "Shared Expert Gate",
"type": "boolean"
},
"first_k_dense_replace": {
"default": 0,
"description": "Number of layers using Dense instead of MoE",
"title": "First K Dense Replace",
"type": "integer"
},
"num_experts_per_tok": {
"default": 2,
"description": "Number of experts per token.",
"title": "Num Experts Per Tok",
"type": "integer"
},
"moe_softmax_after": {
"default": false,
"description": "Usually softmax is before topk, Mixtral does it after.",
"title": "Moe Softmax After",
"type": "boolean"
},
"moe_renormalize": {
"default": false,
"description": "Qwen renormalize expert weights after softmax.",
"title": "Moe Renormalize",
"type": "boolean"
},
"q_gating": {
"default": false,
"description": "Enable gated query in attention (Qwen3.5 style). Q projection has doubled output size; output is multiplied by sigmoid(gate).",
"title": "Q Gating",
"type": "boolean"
},
"position_encoding_type": {
"anyOf": [
{
"$ref": "#/$defs/PositionEncodingType"
},
{
"type": "null"
}
],
"default": null
},
"interpolate_mode": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Interpolation mode for position embeddings. If None: position_embeddings is a lookup table based on n_positions. If string (e.g., 'bilinear'): position_embedding is a learned grid using interpolation. (see Vision.py encoder)",
"title": "Interpolate Mode"
},
"n_positions": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Two casesCase 1: Absolute number of positions to learn position embeddings on (position_encoding_type: Learned)Case 2: Max Relative PositionsIn the case of position_encoding_type: Relative",
"title": "N Positions"
},
"rope_config": {
"anyOf": [
{
"$ref": "#/$defs/RotaryPositionConfig"
},
{
"type": "null"
}
],
"default": null,
"description": "Rotary position config, if relevant."
},
"num_mel_bins": {
"default": 80,
"description": "Number of mel spectrogram bins.",
"title": "Num Mel Bins",
"type": "integer"
},
"max_source_positions": {
"default": 1500,
"description": "Maximum number of source positions (time frames after conv stem).",
"title": "Max Source Positions",
"type": "integer"
},
"sample_rate": {
"default": 16000,
"description": "Audio sample rate in Hz.",
"title": "Sample Rate",
"type": "integer"
},
"chunk_length": {
"default": 30,
"description": "Audio chunk length in seconds.",
"title": "Chunk Length",
"type": "integer"
},
"n_fft": {
"default": 400,
"description": "FFT window size for mel spectrogram.",
"title": "N Fft",
"type": "integer"
},
"hop_length": {
"default": 160,
"description": "Hop length for mel spectrogram.",
"title": "Hop Length",
"type": "integer"
},
"timestamp_resolution": {
"default": 0.02,
"description": "Time resolution per timestamp token in seconds.",
"title": "Timestamp Resolution",
"type": "number"
}
},
"title": "AudioEncoderConfig",
"type": "object"
},
"BARTNoiseConfig": {
"additionalProperties": false,
"properties": {
"permute_sent_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0.0,
"description": "Permute this proportion of sentences (boundaries defined by ['.', '?', '!']) in all inputs.",
"title": "Permute Sent Ratio"
},
"rotate_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0.0,
"description": "Rotate this proportion of inputs.",
"title": "Rotate Ratio"
},
"insert_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0.0,
"description": "Insert this percentage of additional random tokens.",
"title": "Insert Ratio"
},
"random_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0.0,
"description": "Instead of using <mask>, use random token this often.",
"title": "Random Ratio"
},
"mask_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0.0,
"description": "Fraction of words/subwords that will be masked.",
"title": "Mask Ratio"
},
"mask_length": {
"anyOf": [
{
"enum": [
"subword",
"word",
"span-poisson"
],
"type": "string"
},
{
"type": "null"
}
],
"default": "subword",
"description": "Length of masking window to apply.",
"title": "Mask Length"
},
"poisson_lambda": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 3.0,
"description": "Lambda for Poisson distribution to sample span length if `-mask_length` set to span-poisson.",
"title": "Poisson Lambda"
},
"replace_length": {
"anyOf": [
{
"maximum": 1,
"minimum": -1,
"type": "integer"
},
{
"type": "null"
}
],
"default": -1,
"description": "When masking N tokens, replace with 0, 1, or N tokens. (use -1 for N)",
"title": "Replace Length"
}
},
"title": "BARTNoiseConfig",
"type": "object"
},
"BaseTokenizerConfig": {
"additionalProperties": false,
"properties": {
"src_subword_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Path of subword model for src (or shared).",
"title": "Src Subword Model"
},
"tgt_subword_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Path of subword model for tgt.",
"title": "Tgt Subword Model"
},
"src_subword_nbest": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1,
"description": "Number of candidates in subword regularization. Valid for unigram sampling, invalid for BPE-dropout. (source side)",
"title": "Src Subword Nbest"
},
"tgt_subword_nbest": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1,
"description": "Number of candidates in subword regularization. Valid for unigram sampling, invalid for BPE-dropout. (target side)",
"title": "Tgt Subword Nbest"
},
"src_subword_alpha": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0,
"description": "Smoothing parameter for sentencepiece unigram sampling, and dropout probability for BPE-dropout. (source side)",
"title": "Src Subword Alpha"
},
"tgt_subword_alpha": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0,
"description": "Smoothing parameter for sentencepiece unigram sampling, and dropout probability for BPE-dropout. (target side)",
"title": "Tgt Subword Alpha"
},
"src_subword_vocab": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "Path to the vocabulary file for src subword. Format: <word>\\t<count> per line.",
"title": "Src Subword Vocab"
},
"tgt_subword_vocab": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "Path to the vocabulary file for tgt subword. Format: <word>\\t<count> per line.",
"title": "Tgt Subword Vocab"
},
"src_vocab_threshold": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 0,
"description": "Only produce src subword in src_subword_vocab with frequency >= src_vocab_threshold.",
"title": "Src Vocab Threshold"
},
"tgt_vocab_threshold": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 0,
"description": "Only produce tgt subword in tgt_subword_vocab with frequency >= tgt_vocab_threshold.",
"title": "Tgt Vocab Threshold"
}
},
"title": "BaseTokenizerConfig",
"type": "object"
},
"CleanConfig": {
"additionalProperties": false,
"properties": {
"src_eq_tgt": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": false,
"description": "Remove ex src==tgt",
"title": "Src Eq Tgt"
},
"same_char": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": false,
"description": "Remove ex with same char more than 4 times",
"title": "Same Char"
},
"same_word": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": false,
"description": "Remove ex with same word more than 3 times",
"title": "Same Word"
},
"scripts_ok": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": [
"Latin",
"Common"
],
"description": "list of unicodata scripts accepted",
"title": "Scripts Ok"
},
"scripts_nok": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": [],
"description": "list of unicodata scripts not accepted",
"title": "Scripts Nok"
},
"src_tgt_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 2.0,
"description": "ratio between src and tgt",
"title": "Src Tgt Ratio"
},
"avg_tok_min": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 3.0,
"description": "average length of tokens min",
"title": "Avg Tok Min"
},
"avg_tok_max": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 20.0,
"description": "average length of tokens max",
"title": "Avg Tok Max"
},
"langid": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": [],
"description": "list of languages accepted",
"title": "Langid"
}
},
"title": "CleanConfig",
"type": "object"
},
"CnnDecoderConfig": {
"additionalProperties": false,
"properties": {
"decoder_type": {
"const": "cnn",
"default": "cnn",
"title": "Decoder Type",
"type": "string"
},
"layers": {
"default": 2,
"description": "Number of layers in the decoder.",
"title": "Layers",
"type": "integer"
},
"hidden_size": {
"default": 512,
"description": "Size of decoder hidden states.",
"title": "Hidden Size",
"type": "integer"
},
"tgt_word_vec_size": {
"default": 512,
"description": "Word embedding size for tgt.",
"title": "Tgt Word Vec Size",
"type": "integer"
},
"coverage_attn": {
"default": false,
"description": "Train a coverage attention layer.",
"title": "Coverage Attn",
"type": "boolean"
},
"with_cross_attn": {
"default": false,
"description": "Decoder uses cross-attention with encoder outputs.",
"title": "With Cross Attn",
"type": "boolean"
},
"lambda_coverage": {
"default": 0.0,
"description": "Lambda value for coverage loss of See et al (2017)",
"title": "Lambda Coverage",
"type": "number"
},
"global_attention": {
"default": "general",
"description": "The attention type to use. (Luong=general, Bahdanau=MLP)",
"enum": [
"dot",
"general",
"mlp",
null
],
"title": "Global Attention"
},
"global_attention_function": {
"default": "softmax",
"description": "Global attention function to use.",
"enum": [
"softmax",
"sparsemax"
],
"title": "Global Attention Function",
"type": "string"
},
"cnn_kernel_width": {
"default": 3,
"description": "Size of windows in the cnn, the kernel_size is (cnn_kernel_width, 1) in convolution layers.",
"title": "Cnn Kernel Width",
"type": "integer"
}
},
"title": "CnnDecoderConfig",
"type": "object"
},
"CnnEncoderConfig": {
"additionalProperties": false,
"properties": {
"encoder_type": {
"const": "cnn",
"default": "cnn",
"title": "Encoder Type",
"type": "string"
},
"layers": {
"default": 2,
"description": "Number of layers in the encoder.",
"title": "Layers",
"type": "integer"
},
"hidden_size": {
"default": 512,
"description": "Size of encoder hidden states.",
"title": "Hidden Size",
"type": "integer"
},
"src_word_vec_size": {
"default": 512,
"description": "Word embedding size for src.",
"title": "Src Word Vec Size",
"type": "integer"
},
"cnn_kernel_width": {
"default": 3,
"description": "Size of windows in the cnn, the kernel_size is (cnn_kernel_width, 1) in convolution layers.",
"title": "Cnn Kernel Width",
"type": "integer"
}
},
"title": "CnnEncoderConfig",
"type": "object"
},
"CnnModelConfig": {
"additionalProperties": false,
"properties": {
"embeddings": {
"$ref": "#/$defs/EmbeddingsConfig",
"description": "Contains most of the args useful to build the Embeddings module."
},
"encoder": {
"anyOf": [
{
"discriminator": {
"mapping": {
"audio": "#/$defs/AudioEncoderConfig",
"brnn": "#/$defs/RnnEncoderConfig",
"cnn": "#/$defs/CnnEncoderConfig",
"mean": "#/$defs/MeanEncoderConfig",
"rnn": "#/$defs/RnnEncoderConfig",
"transformer": "#/$defs/TransformerEncoderConfig",
"vision": "#/$defs/VisionEncoderConfig"
},
"propertyName": "encoder_type"
},
"oneOf": [
{
"$ref": "#/$defs/TransformerEncoderConfig"
},
{
"$ref": "#/$defs/RnnEncoderConfig"
},
{
"$ref": "#/$defs/CnnEncoderConfig"
},
{
"$ref": "#/$defs/MeanEncoderConfig"
},
{
"$ref": "#/$defs/VisionEncoderConfig"
},
{
"$ref": "#/$defs/AudioEncoderConfig"
}
]
},
{
"type": "null"
}
],
"default": null,
"description": "Major parameters of an encoder.",
"title": "Encoder"
},
"decoder": {
"anyOf": [
{
"discriminator": {
"mapping": {
"cnn": "#/$defs/CnnDecoderConfig",
"rnn": "#/$defs/RnnDecoderConfig",
"transformer": "#/$defs/TransformerDecoderConfig"
},
"propertyName": "decoder_type"
},
"oneOf": [
{
"$ref": "#/$defs/TransformerDecoderConfig"
},
{
"$ref": "#/$defs/RnnDecoderConfig"
},
{
"$ref": "#/$defs/CnnDecoderConfig"
}
]
},
{
"type": "null"
}
],
"default": null,
"description": "Major parameters of a decoder.",
"title": "Decoder"
},
"hidden_size": {
"default": -1,
"description": "Size of hidden states. Overwrites [encoder/decoder].hidden_size if set.",
"title": "Hidden Size",
"type": "integer"
},
"word_vec_size": {
"default": -1,
"description": "Word embedding size for src and tgt.",
"title": "Word Vec Size",
"type": "integer"
},
"layers": {
"default": -1,
"description": "Number of layers in both encoder and decoder (will overwrite enc_layers/dec_layers).",
"title": "Layers",
"type": "integer"
},
"transformer_ff": {
"default": -1,
"description": "Size of hidden transformer feed-forward.",
"title": "Transformer Ff",
"type": "integer"
},
"moe_transformer_ff": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Size of hidden moe transformer feed-forward.",
"title": "Moe Transformer Ff"
},
"share_decoder_embeddings": {
"default": false,
"description": "Use a share weight matrix for the input and output word embeddings in the decoder.",
"title": "Share Decoder Embeddings",
"type": "boolean"
},
"share_embeddings": {
"default": false,
"description": "Share the word embeddings between encoder and decoder. Need to use shared vocabulary for this option.",
"title": "Share Embeddings",
"type": "boolean"
},
"input_feed": {
"default": 1,
"description": "Feed the context vector at each time step as additional input (via concatenation with the word embeddings) to the decoder.",
"title": "Input Feed",
"type": "integer"
},
"generator_function": {
"default": "softmax",
"description": "Which function to use for generating probabilities over the target vocabulary.",
"enum": [
"softmax",
"sparsemax"
],
"title": "Generator Function",
"type": "string"
},
"generator_bias": {
"default": true,
"description": "Control whether or not the generator Linear module has bias weights.",
"title": "Generator Bias",
"type": "boolean"
},
"adapter_bias": {
"default": false,
"description": "Control whether or not the adapter module has bias weights.",
"title": "Adapter Bias",
"type": "boolean"
},
"projector_activation_fn": {
"anyOf": [
{
"$ref": "#/$defs/ActivationFunction"
},
{
"type": "null"
}
],
"default": "relu",
"description": "The activation function to use in adapter projector layer."
},
"spatial_merge_size": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1,
"description": "Control the presence and size of patch merger (Mistral3)",
"title": "Spatial Merge Size"
},
"add_estimator": {
"default": false,
"description": "Add estimator layer",
"title": "Add Estimator",
"type": "boolean"
},
"estimator_type": {
"default": "average",
"description": "Which hidden_states to use to feed the estimator",
"enum": [
"average",
"last_token",
"first_token"
],
"title": "Estimator Type",
"type": "string"
},
"left_pad": {
"default": false,
"description": "Enable left-padding, useful for some LLMs.",
"title": "Left Pad",
"type": "boolean"
},
"huggingface_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Original huggingface model.",
"title": "Huggingface Model"
},
"eole_version": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "0.5.2",
"description": "Eole version used to convert/train/save the model.",
"title": "Eole Version"
},
"architecture": {
"const": "cnn",
"default": "cnn",
"title": "Architecture",
"type": "string"
},
"cnn_kernel_width": {
"default": 3,
"description": "Size of windows in the cnn, the kernel_size is (cnn_kernel_width, 1) in convolution layers.",
"title": "Cnn Kernel Width",
"type": "integer"
}
},
"title": "CnnModelConfig",
"type": "object"
},
"CustomModelConfig": {
"additionalProperties": false,
"description": "Wrap anything that does not fit a set common architecture.",
"properties": {
"embeddings": {
"$ref": "#/$defs/EmbeddingsConfig",
"description": "Contains most of the args useful to build the Embeddings module."
},
"encoder": {
"anyOf": [
{
"discriminator": {
"mapping": {
"audio": "#/$defs/AudioEncoderConfig",
"brnn": "#/$defs/RnnEncoderConfig",
"cnn": "#/$defs/CnnEncoderConfig",
"mean": "#/$defs/MeanEncoderConfig",
"rnn": "#/$defs/RnnEncoderConfig",
"transformer": "#/$defs/TransformerEncoderConfig",
"vision": "#/$defs/VisionEncoderConfig"
},
"propertyName": "encoder_type"
},
"oneOf": [
{
"$ref": "#/$defs/TransformerEncoderConfig"
},
{
"$ref": "#/$defs/RnnEncoderConfig"
},
{
"$ref": "#/$defs/CnnEncoderConfig"
},
{
"$ref": "#/$defs/MeanEncoderConfig"
},
{
"$ref": "#/$defs/VisionEncoderConfig"
},
{
"$ref": "#/$defs/AudioEncoderConfig"
}
]
},
{
"type": "null"
}
],
"default": null,
"description": "Major parameters of an encoder.",
"title": "Encoder"
},
"decoder": {
"anyOf": [
{
"discriminator": {
"mapping": {
"cnn": "#/$defs/CnnDecoderConfig",
"rnn": "#/$defs/RnnDecoderConfig",
"transformer": "#/$defs/TransformerDecoderConfig"
},
"propertyName": "decoder_type"
},
"oneOf": [
{
"$ref": "#/$defs/TransformerDecoderConfig"
},
{
"$ref": "#/$defs/RnnDecoderConfig"
},
{
"$ref": "#/$defs/CnnDecoderConfig"
}
]
},
{
"type": "null"
}
],
"default": null,
"description": "Major parameters of a decoder.",
"title": "Decoder"
},
"hidden_size": {
"default": -1,
"description": "Size of hidden states. Overwrites [encoder/decoder].hidden_size if set.",
"title": "Hidden Size",
"type": "integer"
},
"word_vec_size": {
"default": -1,
"description": "Word embedding size for src and tgt.",
"title": "Word Vec Size",
"type": "integer"
},
"layers": {
"default": -1,
"description": "Number of layers in both encoder and decoder (will overwrite enc_layers/dec_layers).",
"title": "Layers",
"type": "integer"
},
"transformer_ff": {
"default": 2048,
"description": "Size of hidden transformer feed-forward.",
"title": "Transformer Ff",
"type": "integer"
},
"moe_transformer_ff": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Size of hidden moe transformer feed-forward.",
"title": "Moe Transformer Ff"
},
"share_decoder_embeddings": {
"default": false,
"description": "Use a share weight matrix for the input and output word embeddings in the decoder.",
"title": "Share Decoder Embeddings",
"type": "boolean"
},
"share_embeddings": {
"default": false,
"description": "Share the word embeddings between encoder and decoder. Need to use shared vocabulary for this option.",
"title": "Share Embeddings",
"type": "boolean"
},
"input_feed": {
"default": 1,
"description": "Feed the context vector at each time step as additional input (via concatenation with the word embeddings) to the decoder.",
"title": "Input Feed",
"type": "integer"
},
"generator_function": {
"default": "softmax",
"description": "Which function to use for generating probabilities over the target vocabulary.",
"enum": [
"softmax",
"sparsemax"
],
"title": "Generator Function",
"type": "string"
},
"generator_bias": {
"default": true,
"description": "Control whether or not the generator Linear module has bias weights.",
"title": "Generator Bias",
"type": "boolean"
},
"adapter_bias": {
"default": false,
"description": "Control whether or not the adapter module has bias weights.",
"title": "Adapter Bias",
"type": "boolean"
},
"projector_activation_fn": {
"anyOf": [
{
"$ref": "#/$defs/ActivationFunction"
},
{
"type": "null"
}
],
"default": "relu",
"description": "The activation function to use in adapter projector layer."
},
"spatial_merge_size": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1,
"description": "Control the presence and size of patch merger (Mistral3)",
"title": "Spatial Merge Size"
},
"add_estimator": {
"default": false,
"description": "Add estimator layer",
"title": "Add Estimator",
"type": "boolean"
},
"estimator_type": {
"default": "average",
"description": "Which hidden_states to use to feed the estimator",
"enum": [
"average",
"last_token",
"first_token"
],
"title": "Estimator Type",
"type": "string"
},
"left_pad": {
"default": false,
"description": "Enable left-padding, useful for some LLMs.",
"title": "Left Pad",
"type": "boolean"
},
"huggingface_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Original huggingface model.",
"title": "Huggingface Model"
},
"eole_version": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "0.5.2",
"description": "Eole version used to convert/train/save the model.",
"title": "Eole Version"
},
"architecture": {
"const": "custom",
"default": "custom",
"title": "Architecture",
"type": "string"
},
"sliding_window": {
"default": 0,
"description": "Sliding window for transformer self-attention.",
"title": "Sliding Window",
"type": "integer"
},
"heads": {
"default": 8,
"description": "Number of heads for transformer self-attention.",
"title": "Heads",
"type": "integer"
},
"relative_positions_buckets": {
"default": 0,
"description": "Enable relative position bias (https://github.com/google-research/text-to-text-transfer-transformer).",
"title": "Relative Positions Buckets",
"type": "integer"
},
"mlp_activation_fn": {
"$ref": "#/$defs/ActivationFunction",
"default": "relu",
"description": "The activation function to use in MLP layer."
},
"layer_norm": {
"default": "standard",
"description": "Type of layer normalization in transformer architecture.",
"enum": [
"standard",
"standardFP32",
"rms",
"gemma-rms"
],
"title": "Layer Norm",
"type": "string"
},
"norm_eps": {
"default": 1e-05,
"description": "Layer norm epsilon.",
"title": "Norm Eps",
"type": "number"
},
"shared_layer_norm": {
"default": false,
"description": "Use a shared layer_norm in parallel residual attention. Note: must be True for Falcon 7B, False for Falcon 40B, same for GPT-J and GPT-NeoX models.",
"title": "Shared Layer Norm",
"type": "boolean"
},
"ffn_layernorm": {
"default": false,
"description": "Add pre/post_feedforward_layernorm around MLP forward. Note: introduced for gemma2 support.",
"title": "Ffn Layernorm",
"type": "boolean"
},
"add_qkvbias": {
"default": false,
"description": "Add bias to nn.Linear of Query/Key/Value in MHA. Note: this will add bias to output projection layer too by default. Can be disabled with `add_final_linear_bias`.",
"title": "Add Qkvbias",
"type": "boolean"
},
"add_key_bias": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": null,
"description": "Add bias to Key projection in MHA. Defaults to add_qkvbias when not set. Set to False for models like Whisper where K has no bias.",
"title": "Add Key Bias"
},
"query_norm": {
"default": false,
"title": "Query Norm",
"type": "boolean"
},
"key_norm": {
"default": false,
"title": "Key Norm",
"type": "boolean"
},
"qk_norm_post_rope": {
"default": false,
"title": "Qk Norm Post Rope",
"type": "boolean"
},
"add_final_linear_bias": {
"default": false,
"description": "Add bias to nn.Linear of final_linear in MHA.",
"title": "Add Final Linear Bias",
"type": "boolean"
},
"heads_kv": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Number of heads for KV. heads_kv=heads if None, else number of heads for KV(e.g. Falcon 40B)",
"title": "Heads Kv"
},
"head_dim": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Head dimension when this needs to be different vs hidden_size // heads",
"title": "Head Dim"
},
"attn_scaling": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": null,
"description": "Attention scaling factor, when None uses 1/sqrt(head_dim) by default",
"title": "Attn Scaling"
},
"add_ffnbias": {
"default": false,
"description": "Add bias to nn.Linear of MLP FFN.",
"title": "Add Ffnbias",
"type": "boolean"
},
"parallel_residual": {
"default": false,
"description": "Use parallel residual in decoder layer. Note: this is used by GPT-J / Falcon Architecture.",
"title": "Parallel Residual",
"type": "boolean"
},
"num_experts": {
"default": 0,
"description": "Number of experts for MoE models.",
"title": "Num Experts",
"type": "integer"
},
"num_shared_experts": {
"default": 0,
"description": "Number of shared experts for MoE models (DeepSeekv2).",
"title": "Num Shared Experts",
"type": "integer"
},
"shared_expert_gate": {
"default": false,
"description": "Apply sigmoid-gated shared expert output (Qwen3.5 MoE style). When True, a linear gate is applied: output += sigmoid(gate(x)) * shared_expert(x).",
"title": "Shared Expert Gate",
"type": "boolean"
},
"first_k_dense_replace": {
"default": 0,
"description": "Number of layers using Dense instead of MoE",
"title": "First K Dense Replace",
"type": "integer"
},
"num_experts_per_tok": {
"default": 2,
"description": "Number of experts per token.",
"title": "Num Experts Per Tok",
"type": "integer"
},
"moe_softmax_after": {
"default": false,
"description": "Usually softmax is before topk, Mixtral does it after.",
"title": "Moe Softmax After",
"type": "boolean"
},
"moe_renormalize": {
"default": false,
"description": "Qwen renormalize expert weights after softmax.",
"title": "Moe Renormalize",
"type": "boolean"
},
"q_gating": {
"default": false,
"description": "Enable gated query in attention (Qwen3.5 style). Q projection has doubled output size; output is multiplied by sigmoid(gate).",
"title": "Q Gating",
"type": "boolean"
},
"position_encoding_type": {
"anyOf": [
{
"$ref": "#/$defs/PositionEncodingType"
},
{
"type": "null"
}
],
"default": "SinusoidalInterleaved",
"description": "Type of positional encoding."
},
"interpolate_mode": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Interpolation mode for position embeddings. If None: position_embeddings is a lookup table based on n_positions. If string (e.g., 'bilinear'): position_embedding is a learned grid using interpolation. (see Vision.py encoder)",
"title": "Interpolate Mode"
},
"n_positions": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Two casesCase 1: Absolute number of positions to learn position embeddings on (position_encoding_type: Learned)Case 2: Max Relative PositionsIn the case of position_encoding_type: Relative",
"title": "N Positions"
},
"rope_config": {
"anyOf": [
{
"$ref": "#/$defs/RotaryPositionConfig"
},
{
"type": "null"
}
],
"default": null,
"description": "Rotary position config, if relevant."
}
},
"title": "CustomModelConfig",
"type": "object"
},
"Dataset": {
"additionalProperties": false,
"properties": {
"name": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Name"
},
"weight": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1,
"title": "Weight"
},
"transforms": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"title": "Transforms"
},
"path_src": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Path Src"
},
"path_tgt": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Path Tgt"
},
"path_sco": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Path Sco"
},
"path_txt": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Path Txt"
},
"path_align": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Path Align"
},
"src_prefix": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Src Prefix"
},
"tgt_prefix": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Tgt Prefix"
},
"src_suffix": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Src Suffix"
},
"tgt_suffix": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Tgt Suffix"
},
"src_lang": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Src Lang"
},
"tgt_lang": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Tgt Lang"
},
"penn": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": true,
"title": "Penn"
},
"norm_quote_commas": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": true,
"title": "Norm Quote Commas"
},
"norm_numbers": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": true,
"title": "Norm Numbers"
},
"pre_replace_unicode_punct": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": false,
"title": "Pre Replace Unicode Punct"
},
"post_remove_control_chars": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": false,
"title": "Post Remove Control Chars"
},
"src_eq_tgt": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": true,
"title": "Src Eq Tgt"
},
"same_char": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": true,
"title": "Same Char"
},
"same_word": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": true,
"title": "Same Word"
},
"scripts_ok": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": [
"Latin",
"Common"
],
"title": "Scripts Ok"
},
"scripts_nok": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": [],
"title": "Scripts Nok"
},
"src_tgt_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 2,
"title": "Src Tgt Ratio"
},
"avg_tok_min": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 3,
"title": "Avg Tok Min"
},
"avg_tok_max": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 20,
"title": "Avg Tok Max"
},
"lang_id": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": [
"en",
"fr"
],
"title": "Lang Id"
}
},
"title": "Dataset",
"type": "object"
},
"DocifyConfig": {
"additionalProperties": false,
"properties": {
"doc_length": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 200,
"description": "Number of tokens per doc.",
"title": "Doc Length"
},
"max_context": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1,
"description": "Max context segments.",
"title": "Max Context"
}
},
"title": "DocifyConfig",
"type": "object"
},
"EmbeddingsConfig": {
"additionalProperties": false,
"properties": {
"src_word_vec_size": {
"default": 512,
"description": "Word embedding size for src.",
"title": "Src Word Vec Size",
"type": "integer"
},
"tgt_word_vec_size": {
"default": 512,
"description": "Word embedding size for tgt.",
"title": "Tgt Word Vec Size",
"type": "integer"
},
"word_vec_size": {
"default": -1,
"description": "Word embedding size for src and tgt.",
"title": "Word Vec Size",
"type": "integer"
},
"freeze_word_vecs_enc": {
"default": false,
"description": "Freeze word embeddings on the encoder side.",
"title": "Freeze Word Vecs Enc",
"type": "boolean"
},
"freeze_word_vecs_dec": {
"default": false,
"description": "Freeze word embeddings on the encoder side.",
"title": "Freeze Word Vecs Dec",
"type": "boolean"
},
"position_encoding": {
"default": false,
"description": "Absolute position encoding, see position_encoding_type. Necessary for non-RNN style models.",
"title": "Position Encoding",
"type": "boolean"
},
"position_encoding_type": {
"anyOf": [
{
"$ref": "#/$defs/PositionEncodingType"
},
{
"type": "null"
}
],
"default": "SinusoidalInterleaved",
"description": "Type of positional encoding."
},
"n_positions": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Two casesCase 1: Absolute number of positions to learn position embeddings on (position_encoding_type: Learned)Case 2: Max Relative PositionsIn the case of position_encoding_type: Relative",
"title": "N Positions"
},
"position_shift": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 0,
"description": "Positions IDS shift before making position embed dirty patch to cover for xlm-roberta-xl",
"title": "Position Shift"
},
"normalize": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": false,
"description": "Enable embeddings scaling. Not always necessary, but useful for some model compatibility, e.g. gemma. https://datascience.stackexchange.com/a/87909",
"title": "Normalize"
}
},
"title": "EmbeddingsConfig",
"type": "object"
},
"FilterTooLongConfig": {
"additionalProperties": false,
"properties": {
"src_seq_length": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 192,
"description": "Maximum source sequence length.",
"title": "Src Seq Length"
},
"tgt_seq_length": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 192,
"description": "Maximum target sequence length.",
"title": "Tgt Seq Length"
}
},
"title": "FilterTooLongConfig",
"type": "object"
},
"FilterTooShortConfig": {
"additionalProperties": false,
"properties": {
"src_seq_length": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 48,
"description": "Minimum source sequence length.",
"title": "Src Seq Length"
},
"tgt_seq_length": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 48,
"description": "Minimum target sequence length.",
"title": "Tgt Seq Length"
}
},
"title": "FilterTooShortConfig",
"type": "object"
},
"HuggingfaceTokenizerConfig": {
"additionalProperties": false,
"properties": {
"path": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Path"
},
"huggingface_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Huggingface Model"
},
"max_length": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"title": "Max Length"
}
},
"title": "HuggingfaceTokenizerConfig",
"type": "object"
},
"InferenceConfig": {
"additionalProperties": false,
"properties": {
"quant_layers": {
"default": [],
"description": "List of layers to be compressed in 4/8bit.",
"items": {
"type": "string"
},
"title": "Quant Layers",
"type": "array"
},
"quant_type": {
"default": "",
"description": "Type of compression.",
"enum": [
"",
"bnb_8bit",
"bnb_FP4",
"bnb_NF4",
"awq_gemm",
"awq_gemv",
"autoround",
"gguf"
],
"title": "Quant Type",
"type": "string"
},
"w_bit": {
"default": 4,
"description": "W_bit quantization",
"title": "W Bit",
"type": "integer"
},
"group_size": {
"default": 128,
"description": "Group size quantization.",
"title": "Group Size",
"type": "integer"
},
"autoround_packing_format": {
"default": "auto_round:auto_gptq",
"description": "AutoRound packing format (from quantization_config.packing_format). Determines whether qzeros use GPTQ-style (zeros-1) packing. Use 'auto_round:auto_gptq' for GPTQ-format (default), or 'auto_round' for direct zero-point.",
"title": "Autoround Packing Format",
"type": "string"
},
"autoround_sym": {
"default": true,
"description": "AutoRound symmetric quantization flag (from quantization_config.sym). Required to select the Marlin CUDA backend, which only supports symmetric quantization.",
"title": "Autoround Sym",
"type": "boolean"
},
"quant_exclude_modules": {
"default": [],
"description": "List of parent module names whose entire subtrees must not be quantized, even if child layers appear in quant_layers. Used for AutoRound models where some parent modules (e.g. shared_experts in MoE) were kept in fp16 during quantization.",
"items": {
"type": "string"
},
"title": "Quant Exclude Modules",
"type": "array"
},
"lora_layers": {
"default": [],
"description": "List of layers to be replaced by LoRa layers. E.g. ['linear_values', 'linear_query'] (\u00a74.2 in https://arxiv.org/abs/2106.09685)",
"items": {
"type": "string"
},
"title": "Lora Layers",
"type": "array"
},
"lora_embedding": {
"default": false,
"description": "Replace embeddings with LoRa Embeddings (\u00a75.1)",
"title": "Lora Embedding",
"type": "boolean"
},
"lora_rank": {
"default": 2,
"description": "r=2 successfully tested with NLLB-200 3.3B",
"title": "Lora Rank",
"type": "integer"
},
"lora_alpha": {
"default": 1,
"description": "\u00a74.1 https://arxiv.org/abs/2106.09685",
"title": "Lora Alpha",
"type": "integer"
},
"lora_dropout": {
"default": 0.0,
"description": "Rule of thumb: same value as in main model.",
"title": "Lora Dropout",
"type": "number"
},
"beam_size": {
"default": 5,
"description": "Beam size.",
"title": "Beam Size",
"type": "integer"
},
"ratio": {
"default": -0.0,
"description": "Ratio based beam stop condition.",
"title": "Ratio",
"type": "number"
},
"top_k": {
"default": 0,
"description": "Set this to -1 to do random sampling from full distribution. Set this to value k>1 to do random sampling restricted to the k most likely next tokens. Set this to 1 to use argmax.",
"title": "Top K",
"type": "integer"
},
"top_p": {
"default": 0.0,
"description": "Probability for top-p/nucleus sampling. Restrict tokens to the most likely until the cumulated probability is over p. In range [0,1]. (https://arxiv.org/abs/1904.09751)",
"lte": 1.0,
"minimum": 0.0,
"title": "Top P",
"type": "number"
},
"temperature": {
"default": 1.0,
"description": "If doing random sampling, divide the logits by this before computing softmax during decoding.",
"title": "Temperature",
"type": "number"
},
"length_penalty": {
"default": "avg",
"description": "Length penalty to use.",
"enum": [
"avg",
"wu",
"none"
],
"title": "Length Penalty",
"type": "string"
},
"alpha": {
"default": 1.0,
"description": "Length penalty parameter (higher = longer generation)",
"title": "Alpha",
"type": "number"
},
"coverage_penalty": {
"default": "none",
"description": "Coverage penalty to use. Only available in beam search.",
"enum": [
"none",
"wu",
"summary"
],
"title": "Coverage Penalty",
"type": "string"
},
"beta": {
"default": -0.0,
"description": "Coverage penalty parameter.",
"title": "Beta",
"type": "number"
},
"stepwise_penalty": {
"default": false,
"description": "Apply coverage penalty at every decoding step. Helpful for summary penalty.",
"title": "Stepwise Penalty",
"type": "boolean"
},
"min_length": {
"default": 0,
"description": "Minimum prediction length.",
"minimum": 0,
"title": "Min Length",
"type": "integer"
},
"max_length": {
"default": 250,
"description": "Maximum prediction length.",
"title": "Max Length",
"type": "integer"
},
"max_length_ratio": {
"default": 2,
"description": "Maximum prediction length ratio. For European languages, 2 is large enough, for target Asian charageters, need to increase to 2-3, for special languages (Burmese, Amharic) to 10. Set to 0 to disable ratio-based length capping.",
"minimum": 0,
"title": "Max Length Ratio",
"type": "number"
},
"block_ngram_repeat": {
"default": 0,
"description": "Block repetition of ngrams during decoding.",
"title": "Block Ngram Repeat",
"type": "integer"
},
"ignore_when_blocking": {
"default": [],
"description": "Ignore these strings when blocking repeats. You want to block sentence delimiters.",
"items": {
"type": "string"
},
"title": "Ignore When Blocking",
"type": "array"
},
"replace_unk": {
"default": false,
"description": "Replace the generated UNK tokens with the source token that had the highest attention weight. If phrase_table is provided, it will lok up the identified source token and give the corresponding target token. If it is not provided (or the identified source token does not exist in the table), then it will copy the source token.",
"title": "Replace Unk",
"type": "boolean"
},
"ban_unk_token": {
"default": false,
"description": "Prevent unk token generation by setting unk probability to 0.",
"title": "Ban Unk Token",
"type": "boolean"
},
"phrase_table": {
"default": "",
"description": "If phrase_table is provided (with replace_unk), it will look up the identified source token and give the corresponding target token.",
"title": "Phrase Table",
"type": "string"
},
"n_best": {
"default": 1,
"description": "Output the n_best decoded sentences.",
"title": "N Best",
"type": "integer"
},
"dump_beam": {
"default": "",
"description": "File to dump beam information to.",
"title": "Dump Beam",
"type": "string"
},
"verbose": {
"default": false,
"description": "Print scores and predictions for each input.",
"title": "Verbose",
"type": "boolean"
},
"with_score": {
"default": false,
"description": "Add a tab separated score to each output.",
"title": "With Score",
"type": "boolean"
},
"timestamps": {
"default": "none",
"description": "Audio models only. Timestamp output: 'none' = plain text, 'segment' = JSON with segment times, 'word' = per-word times via cross-attention DTW.",
"enum": [
"none",
"segment",
"word"
],
"title": "Timestamps",
"type": "string"
},
"language": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Audio models only. Language code (e.g. 'en', 'fr'). Inserts the language token into the decoder prefix.",
"title": "Language"
},
"task": {
"anyOf": [
{
"enum": [
"transcribe",
"translate"
],
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Audio models only. 'transcribe' for same-language, 'translate' for translation to English.",
"title": "Task"
},
"initial_prompt": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Audio models only. Text prompt to condition decoder output style and vocabulary. Prepended as previous context.",
"title": "Initial Prompt"
},
"condition_on_previous_text": {
"default": false,
"description": "Audio models only. Feed previous chunk's decoded text as decoder prompt for the next chunk.",
"title": "Condition On Previous Text",
"type": "boolean"
},
"fallback_temperatures": {
"default": [
0.0,
0.2,
0.4,
0.6,
0.8,
1.0
],
"description": "Audio models only. Temperature cascade for decoding fallback. First temperature uses beam search; subsequent use sampling. Set to [0.0] to disable fallback.",
"items": {
"type": "number"
},
"title": "Fallback Temperatures",
"type": "array"
},
"compression_ratio_threshold": {
"default": 2.4,
"description": "Audio models only. If gzip compression ratio of decoded text exceeds this, retry at next fallback temperature.",
"title": "Compression Ratio Threshold",
"type": "number"
},
"logprob_threshold": {
"default": -1.0,
"description": "Audio models only. If average log probability per token is below this, retry at next fallback temperature.",
"title": "Logprob Threshold",
"type": "number"
},
"no_speech_threshold": {
"default": 0.6,
"description": "Audio models only. Low avg_logprob only triggers fallback when no_speech_prob is also below this threshold.",
"maximum": 1.0,
"minimum": 0.0,
"title": "No Speech Threshold",
"type": "number"
},
"estim_only": {
"default": false,
"description": "Process the input to estimator only (no decoder).",
"title": "Estim Only",
"type": "boolean"
},
"attn_debug": {
"default": false,
"description": "Print best attn for each word.",
"title": "Attn Debug",
"type": "boolean"
},
"align_debug": {
"default": false,
"description": "Print best align for each word.",
"title": "Align Debug",
"type": "boolean"
},
"gpu_ranks": {
"default": [],
"description": "List of ranks for each process.",
"items": {
"type": "integer"
},
"title": "Gpu Ranks",
"type": "array"
},
"world_size": {
"default": 1,
"description": "Total number of distributed processes.",
"title": "World Size",
"type": "integer"
},
"parallel_mode": {
"default": "data_parallel",
"description": "Distributed mode.",
"enum": [
"data_parallel",
"tensor_parallel"
],
"title": "Parallel Mode",
"type": "string"
},
"gpu_backend": {
"default": "nccl",
"description": "Type of torch distributed backend.",
"title": "Gpu Backend",
"type": "string"
},
"gpu_verbose_level": {
"default": 0,
"description": "Gives more info on each process per GPU.",
"title": "Gpu Verbose Level",
"type": "integer"
},
"master_ip": {
"default": "localhost",
"description": "IP of master for torch.distributed training.",
"title": "Master Ip",
"type": "string"
},
"master_port": {
"default": 10000,
"description": "Port of master for torch.distributed training.",
"title": "Master Port",
"type": "integer"
},
"timeout": {
"default": 60,
"description": "Timeout for one GPU to wait for the others.",
"title": "Timeout",
"type": "integer"
},
"model_path": {
"default": "model",
"description": "Path to directory containing all model components.",
"title": "Model Path",
"type": "string"
},
"self_attn_backend": {
"default": "flash",
"description": "Self-attention backend.",
"enum": [
"flash",
"pytorch"
],
"title": "Self Attn Backend",
"type": "string"
},
"compute_dtype": {
"description": "Compute dtype (precision) to use for main compute. Some parameters might have other dtypes for specific cases (e.g. torch.amp -- See eole.config.training.TrainingConfig.storage_dtype) fp32 to force slow fp16 model on gtx1080, int8 to enable pytorch native 8-bit quantization (cpu only).",
"enum": [
"fp32",
"fp16",
"int8",
"bf16"
],
"title": "Compute Dtype",
"type": "string"
},
"torch_compile": {
"default": false,
"description": "Use torch.compile with dynamic=True.",
"title": "Torch Compile",
"type": "boolean"
},
"report_align": {
"default": false,
"description": "Report alignment for each translation.",
"title": "Report Align",
"type": "boolean"
},
"gold_align": {
"default": false,
"description": "Report alignment between source and gold target. Useful to test the performance of learnt alignments.",
"title": "Gold Align",
"type": "boolean"
},
"report_time": {
"default": false,
"description": "Report some translation time metrics.",
"title": "Report Time",
"type": "boolean"
},
"fuse_kvq": {
"default": false,
"description": "Fuse K, V, Q Linear layers into a single KVQ in Self Attn.",
"title": "Fuse Kvq",
"type": "boolean"
},
"fuse_gate": {
"default": false,
"description": "Fuse gate_up_proj and up_proj Linear layers into a single Linear.",
"title": "Fuse Gate",
"type": "boolean"
},
"profile": {
"default": false,
"description": "Report pytorch profiling stats.",
"title": "Profile",
"type": "boolean"
},
"batch_size": {
"default": 30,
"description": "Batch size.",
"title": "Batch Size",
"type": "integer"
},
"dynamic_shapes": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": null,
"description": "Use batch_size / Cache length static or Dynamic",
"title": "Dynamic Shapes"
},
"batch_type": {
"default": "sents",
"description": "Batch grouping for batch size.",
"enum": [
"sents",
"tokens"
],
"title": "Batch Type",
"type": "string"
},
"avg_raw_probs": {
"default": false,
"description": "If set, during ensembling scores from different models will be combined by averaging their raw probabilities and then taking the log. Otherwise, the log probabilities will be averaged directly. Necessary for models whose output layers can assign zero probability.",
"title": "Avg Raw Probs",
"type": "boolean"
},
"data_type": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "text",
"title": "Data Type"
},
"chat_template": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Chat Template"
},
"optional_eos": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": [],
"description": "Optional EOS tokens that would stop generation, e.g. <|eot_id|> for Llama3",
"title": "Optional Eos"
}
},
"title": "InferenceConfig",
"type": "object"
},
"InlineTagsConfig": {
"additionalProperties": false,
"properties": {
"tags_dictionary_path": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Path to a flat term dictionary.",
"title": "Tags Dictionary Path"
},
"tags_corpus_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0.1,
"description": "Ratio of corpus to augment with tags.",
"title": "Tags Corpus Ratio"
},
"max_tags": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 12,
"description": "Maximum number of tags that can be added to a single sentence.",
"title": "Max Tags"
},
"paired_stag": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "\uff5fph_#_beg\uff60",
"description": "The format of an opening paired inline tag. Must include the character #.",
"title": "Paired Stag"
},
"paired_etag": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "\uff5fph_#_end\uff60",
"description": "The format of a closing paired inline tag. Must include the character #.",
"title": "Paired Etag"
},
"isolated_tag": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "\uff5fph_#_std\uff60",
"description": "The format of an isolated inline tag. Must include the character #.",
"title": "Isolated Tag"
},
"src_delimiter": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "\uff5ffuzzy\uff60",
"description": "Any special token used for augmented src sentences. The default is the fuzzy token used in the FuzzyMatch transform.",
"title": "Src Delimiter"
}
},
"title": "InlineTagsConfig",
"type": "object"
},
"InsertMaskBeforePlaceholderConfig": {
"additionalProperties": false,
"properties": {
"response_patterns": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": [
"Response : \uff5fnewline\uff60"
],
"description": "Response pattern to locate the end of the prompt.",
"title": "Response Patterns"
}
},
"title": "InsertMaskBeforePlaceholderConfig",
"type": "object"
},
"MeanEncoderConfig": {
"additionalProperties": false,
"properties": {
"encoder_type": {
"const": "mean",
"default": "mean",
"title": "Encoder Type",
"type": "string"
},
"layers": {
"default": 2,
"description": "Number of layers in the encoder.",
"title": "Layers",
"type": "integer"
},
"hidden_size": {
"default": 512,
"description": "Size of encoder hidden states.",
"title": "Hidden Size",
"type": "integer"
},
"src_word_vec_size": {
"default": 512,
"description": "Word embedding size for src.",
"title": "Src Word Vec Size",
"type": "integer"
}
},
"title": "MeanEncoderConfig",
"type": "object"
},
"NestedAllTransformsConfig": {
"additionalProperties": false,
"properties": {
"clean": {
"$ref": "#/$defs/CleanConfig",
"default": {
"src_eq_tgt": false,
"same_char": false,
"same_word": false,
"scripts_ok": [
"Latin",
"Common"
],
"scripts_nok": [],
"src_tgt_ratio": 2.0,
"avg_tok_min": 3.0,
"avg_tok_max": 20.0,
"langid": []
}
},
"terminology": {
"$ref": "#/$defs/TerminologyConfig",
"default": {
"termbase_path": null,
"src_spacy_language_model": null,
"tgt_spacy_language_model": null,
"term_corpus_ratio": 0.3,
"term_example_ratio": 0.2,
"src_term_stoken": "\uff5fsrc_term_start\uff60",
"tgt_term_stoken": "\uff5ftgt_term_start\uff60",
"tgt_term_etoken": "\uff5ftgt_term_end\uff60",
"term_source_delimiter": "\uff5ffuzzy\uff60"
}
},
"docify": {
"$ref": "#/$defs/DocifyConfig",
"default": {
"doc_length": 200,
"max_context": 1
}
},
"filtertooshort": {
"$ref": "#/$defs/FilterTooShortConfig",
"default": {
"src_seq_length": 48,
"tgt_seq_length": 48
}
},
"filtertoolong": {
"$ref": "#/$defs/FilterTooLongConfig",
"default": {
"src_seq_length": 192,
"tgt_seq_length": 192
}
},
"prefix": {
"$ref": "#/$defs/PrefixConfig",
"default": {
"src_prefix": "",
"tgt_prefix": ""
}
},
"suffix": {
"$ref": "#/$defs/SuffixConfig",
"default": {
"src_suffix": "",
"tgt_suffix": ""
}
},
"insert_mask_before_placeholder": {
"$ref": "#/$defs/InsertMaskBeforePlaceholderConfig",
"default": {
"response_patterns": [
"Response : \uff5fnewline\uff60"
]
}
},
"uppercase": {
"$ref": "#/$defs/UpperCaseConfig",
"default": {
"upper_corpus_ratio": 0.01
}
},
"switchout": {
"$ref": "#/$defs/SwitchOutConfig",
"default": {
"switchout_temperature": 1.0
}
},
"tokendrop": {
"$ref": "#/$defs/TokenDropConfig",
"default": {
"tokendrop_temperature": 1.0
}
},
"tokenmask": {
"$ref": "#/$defs/TokenMaskConfig",
"default": {
"tokenmask_temperature": 1.0
}
},
"bart": {
"$ref": "#/$defs/BARTNoiseConfig",
"default": {
"permute_sent_ratio": 0.0,
"rotate_ratio": 0.0,
"insert_ratio": 0.0,
"random_ratio": 0.0,
"mask_ratio": 0.0,
"mask_length": "subword",
"poisson_lambda": 3.0,
"replace_length": -1
}
},
"sentencepiece": {
"$ref": "#/$defs/BaseTokenizerConfig",
"default": {
"src_subword_model": null,
"tgt_subword_model": null,
"src_subword_nbest": 1,
"tgt_subword_nbest": 1,
"src_subword_alpha": 0.0,
"tgt_subword_alpha": 0.0,
"src_subword_vocab": "",
"tgt_subword_vocab": "",
"src_vocab_threshold": 0,
"tgt_vocab_threshold": 0
}
},
"bpe": {
"$ref": "#/$defs/BaseTokenizerConfig",
"default": {
"src_subword_model": null,
"tgt_subword_model": null,
"src_subword_nbest": 1,
"tgt_subword_nbest": 1,
"src_subword_alpha": 0.0,
"tgt_subword_alpha": 0.0,
"src_subword_vocab": "",
"tgt_subword_vocab": "",
"src_vocab_threshold": 0,
"tgt_vocab_threshold": 0
}
},
"onmt_tokenize": {
"$ref": "#/$defs/ONMTTokenizerConfig",
"default": {
"src_subword_model": null,
"tgt_subword_model": null,
"src_subword_nbest": 1,
"tgt_subword_nbest": 1,
"src_subword_alpha": 0.0,
"tgt_subword_alpha": 0.0,
"src_subword_vocab": "",
"tgt_subword_vocab": "",
"src_vocab_threshold": 0,
"tgt_vocab_threshold": 0,
"src_subword_type": "none",
"tgt_subword_type": "none",
"src_onmttok_kwargs": {
"mode": "none"
},
"tgt_onmttok_kwargs": {
"mode": "none"
},
"gpt2_pretok": false,
"mapped_tokens": null
}
},
"inlinetags": {
"$ref": "#/$defs/InlineTagsConfig",
"default": {
"tags_dictionary_path": null,
"tags_corpus_ratio": 0.1,
"max_tags": 12,
"paired_stag": "\uff5fph_#_beg\uff60",
"paired_etag": "\uff5fph_#_end\uff60",
"isolated_tag": "\uff5fph_#_std\uff60",
"src_delimiter": "\uff5ffuzzy\uff60"
}
},
"huggingface_tokenize": {
"$ref": "#/$defs/HuggingfaceTokenizerConfig",
"default": {
"path": null,
"huggingface_model": null,
"max_length": null
}
},
"normalize": {
"$ref": "#/$defs/NormalizeConfig",
"default": {
"src_lang": "",
"tgt_lang": "",
"penn": true,
"norm_quote_commas": true,
"norm_numbers": true,
"pre_replace_unicode_punct": false,
"post_remove_control_chars": false
}
}
},
"title": "NestedAllTransformsConfig",
"type": "object"
},
"NormalizeConfig": {
"additionalProperties": false,
"properties": {
"src_lang": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "Source language code",
"title": "Src Lang"
},
"tgt_lang": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "Target language code",
"title": "Tgt Lang"
},
"penn": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": true,
"description": "Penn substitution",
"title": "Penn"
},
"norm_quote_commas": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": true,
"description": "Normalize quotations and commas",
"title": "Norm Quote Commas"
},
"norm_numbers": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": true,
"description": "Normalize numbers",
"title": "Norm Numbers"
},
"pre_replace_unicode_punct": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": false,
"description": "Replace unicode punct",
"title": "Pre Replace Unicode Punct"
},
"post_remove_control_chars": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": false,
"description": "Remove control chars",
"title": "Post Remove Control Chars"
}
},
"title": "NormalizeConfig",
"type": "object"
},
"ONMTTokenizerConfig": {
"additionalProperties": false,
"properties": {
"src_subword_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Path of subword model for src (or shared).",
"title": "Src Subword Model"
},
"tgt_subword_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Path of subword model for tgt.",
"title": "Tgt Subword Model"
},
"src_subword_nbest": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1,
"description": "Number of candidates in subword regularization. Valid for unigram sampling, invalid for BPE-dropout. (source side)",
"title": "Src Subword Nbest"
},
"tgt_subword_nbest": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1,
"description": "Number of candidates in subword regularization. Valid for unigram sampling, invalid for BPE-dropout. (target side)",
"title": "Tgt Subword Nbest"
},
"src_subword_alpha": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0,
"description": "Smoothing parameter for sentencepiece unigram sampling, and dropout probability for BPE-dropout. (source side)",
"title": "Src Subword Alpha"
},
"tgt_subword_alpha": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0,
"description": "Smoothing parameter for sentencepiece unigram sampling, and dropout probability for BPE-dropout. (target side)",
"title": "Tgt Subword Alpha"
},
"src_subword_vocab": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "Path to the vocabulary file for src subword. Format: <word>\\t<count> per line.",
"title": "Src Subword Vocab"
},
"tgt_subword_vocab": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "Path to the vocabulary file for tgt subword. Format: <word>\\t<count> per line.",
"title": "Tgt Subword Vocab"
},
"src_vocab_threshold": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 0,
"description": "Only produce src subword in src_subword_vocab with frequency >= src_vocab_threshold.",
"title": "Src Vocab Threshold"
},
"tgt_vocab_threshold": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 0,
"description": "Only produce tgt subword in tgt_subword_vocab with frequency >= tgt_vocab_threshold.",
"title": "Tgt Vocab Threshold"
},
"src_subword_type": {
"anyOf": [
{
"enum": [
"none",
"sentencepiece",
"bpe"
],
"type": "string"
},
{
"type": "null"
}
],
"default": "none",
"description": "Type of subword model for src (or shared) in pyonmttok.",
"title": "Src Subword Type"
},
"tgt_subword_type": {
"anyOf": [
{
"enum": [
"none",
"sentencepiece",
"bpe"
],
"type": "string"
},
{
"type": "null"
}
],
"default": "none",
"description": "Type of subword model for tgt in pyonmttok.",
"title": "Tgt Subword Type"
},
"src_onmttok_kwargs": {
"anyOf": [
{
"additionalProperties": true,
"type": "object"
},
{
"type": "null"
}
],
"default": {
"mode": "none"
},
"description": "Other pyonmttok options for src in dict string, except subword related options listed earlier.",
"title": "Src Onmttok Kwargs"
},
"tgt_onmttok_kwargs": {
"anyOf": [
{
"additionalProperties": true,
"type": "object"
},
{
"type": "null"
}
],
"default": {
"mode": "none"
},
"description": "Other pyonmttok options for tgt in dict string, except subword related options listed earlier.",
"title": "Tgt Onmttok Kwargs"
},
"gpt2_pretok": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": false,
"description": "Preprocess sentence with byte-level mapping.",
"title": "Gpt2 Pretok"
},
"mapped_tokens": {
"anyOf": [
{
"items": {
"maxItems": 2,
"minItems": 2,
"prefixItems": [
{
"type": "string"
},
{
"type": "string"
}
],
"type": "array"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"description": "Mapped tokens for placeholders preservation",
"title": "Mapped Tokens"
}
},
"title": "ONMTTokenizerConfig",
"type": "object"
},
"PositionEncodingType": {
"enum": [
"SinusoidalInterleaved",
"SinusoidalConcat",
"Learned",
"Relative",
"Rotary",
"Alibi"
],
"title": "PositionEncodingType",
"type": "string"
},
"PrefixConfig": {
"additionalProperties": false,
"properties": {
"src_prefix": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "String to prepend to all source examples.",
"title": "Src Prefix"
},
"tgt_prefix": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "String to prepend to all target examples.",
"title": "Tgt Prefix"
}
},
"title": "PrefixConfig",
"type": "object"
},
"RnnDecoderConfig": {
"additionalProperties": false,
"properties": {
"decoder_type": {
"const": "rnn",
"default": "rnn",
"title": "Decoder Type",
"type": "string"
},
"layers": {
"default": 2,
"description": "Number of layers in the decoder.",
"title": "Layers",
"type": "integer"
},
"hidden_size": {
"default": 512,
"description": "Size of decoder hidden states.",
"title": "Hidden Size",
"type": "integer"
},
"tgt_word_vec_size": {
"default": 512,
"description": "Word embedding size for tgt.",
"title": "Tgt Word Vec Size",
"type": "integer"
},
"coverage_attn": {
"default": false,
"description": "Train a coverage attention layer.",
"title": "Coverage Attn",
"type": "boolean"
},
"with_cross_attn": {
"default": false,
"description": "Decoder uses cross-attention with encoder outputs.",
"title": "With Cross Attn",
"type": "boolean"
},
"lambda_coverage": {
"default": 0.0,
"description": "Lambda value for coverage loss of See et al (2017)",
"title": "Lambda Coverage",
"type": "number"
},
"global_attention": {
"default": "general",
"description": "The attention type to use. (Luong=general, Bahdanau=MLP)",
"enum": [
"dot",
"general",
"mlp",
null
],
"title": "Global Attention"
},
"global_attention_function": {
"default": "softmax",
"description": "Global attention function to use.",
"enum": [
"softmax",
"sparsemax"
],
"title": "Global Attention Function",
"type": "string"
},
"bridge": {
"default": false,
"description": "Have an additional layer between the last encoder state and the first decoder state (RNN specific).",
"title": "Bridge",
"type": "boolean"
},
"rnn_type": {
"default": "LSTM",
"description": "The gate type to use in the RNNs.",
"enum": [
"LSTM",
"GRU"
],
"title": "Rnn Type",
"type": "string"
},
"context_gate": {
"default": null,
"description": "Type of context gate to use.",
"enum": [
"source",
"target",
"both",
null
],
"title": "Context Gate"
},
"bidirectional_encoder": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": false,
"title": "Bidirectional Encoder"
}
},
"title": "RnnDecoderConfig",
"type": "object"
},
"RnnEncoderConfig": {
"additionalProperties": false,
"properties": {
"encoder_type": {
"default": "rnn",
"enum": [
"rnn",
"brnn"
],
"title": "Encoder Type",
"type": "string"
},
"layers": {
"default": 2,
"description": "Number of layers in the encoder.",
"title": "Layers",
"type": "integer"
},
"hidden_size": {
"default": 512,
"description": "Size of encoder hidden states.",
"title": "Hidden Size",
"type": "integer"
},
"src_word_vec_size": {
"default": 512,
"description": "Word embedding size for src.",
"title": "Src Word Vec Size",
"type": "integer"
},
"bridge": {
"default": false,
"description": "Have an additional layer between the last encoder state and the first decoder state (RNN specific).",
"title": "Bridge",
"type": "boolean"
},
"rnn_type": {
"default": "LSTM",
"description": "The gate type to use in the RNNs.",
"enum": [
"LSTM",
"GRU"
],
"title": "Rnn Type",
"type": "string"
}
},
"title": "RnnEncoderConfig",
"type": "object"
},
"RnnModelConfig": {
"additionalProperties": false,
"properties": {
"embeddings": {
"$ref": "#/$defs/EmbeddingsConfig",
"description": "Contains most of the args useful to build the Embeddings module."
},
"encoder": {
"anyOf": [
{
"discriminator": {
"mapping": {
"audio": "#/$defs/AudioEncoderConfig",
"brnn": "#/$defs/RnnEncoderConfig",
"cnn": "#/$defs/CnnEncoderConfig",
"mean": "#/$defs/MeanEncoderConfig",
"rnn": "#/$defs/RnnEncoderConfig",
"transformer": "#/$defs/TransformerEncoderConfig",
"vision": "#/$defs/VisionEncoderConfig"
},
"propertyName": "encoder_type"
},
"oneOf": [
{
"$ref": "#/$defs/TransformerEncoderConfig"
},
{
"$ref": "#/$defs/RnnEncoderConfig"
},
{
"$ref": "#/$defs/CnnEncoderConfig"
},
{
"$ref": "#/$defs/MeanEncoderConfig"
},
{
"$ref": "#/$defs/VisionEncoderConfig"
},
{
"$ref": "#/$defs/AudioEncoderConfig"
}
]
},
{
"type": "null"
}
],
"default": null,
"description": "Major parameters of an encoder.",
"title": "Encoder"
},
"decoder": {
"anyOf": [
{
"discriminator": {
"mapping": {
"cnn": "#/$defs/CnnDecoderConfig",
"rnn": "#/$defs/RnnDecoderConfig",
"transformer": "#/$defs/TransformerDecoderConfig"
},
"propertyName": "decoder_type"
},
"oneOf": [
{
"$ref": "#/$defs/TransformerDecoderConfig"
},
{
"$ref": "#/$defs/RnnDecoderConfig"
},
{
"$ref": "#/$defs/CnnDecoderConfig"
}
]
},
{
"type": "null"
}
],
"default": null,
"description": "Major parameters of a decoder.",
"title": "Decoder"
},
"hidden_size": {
"default": -1,
"description": "Size of hidden states. Overwrites [encoder/decoder].hidden_size if set.",
"title": "Hidden Size",
"type": "integer"
},
"word_vec_size": {
"default": -1,
"description": "Word embedding size for src and tgt.",
"title": "Word Vec Size",
"type": "integer"
},
"layers": {
"default": -1,
"description": "Number of layers in both encoder and decoder (will overwrite enc_layers/dec_layers).",
"title": "Layers",
"type": "integer"
},
"transformer_ff": {
"default": -1,
"description": "Size of hidden transformer feed-forward.",
"title": "Transformer Ff",
"type": "integer"
},
"moe_transformer_ff": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Size of hidden moe transformer feed-forward.",
"title": "Moe Transformer Ff"
},
"share_decoder_embeddings": {
"default": false,
"description": "Use a share weight matrix for the input and output word embeddings in the decoder.",
"title": "Share Decoder Embeddings",
"type": "boolean"
},
"share_embeddings": {
"default": false,
"description": "Share the word embeddings between encoder and decoder. Need to use shared vocabulary for this option.",
"title": "Share Embeddings",
"type": "boolean"
},
"input_feed": {
"default": 1,
"description": "Feed the context vector at each time step as additional input (via concatenation with the word embeddings) to the decoder.",
"title": "Input Feed",
"type": "integer"
},
"generator_function": {
"default": "softmax",
"description": "Which function to use for generating probabilities over the target vocabulary.",
"enum": [
"softmax",
"sparsemax"
],
"title": "Generator Function",
"type": "string"
},
"generator_bias": {
"default": true,
"description": "Control whether or not the generator Linear module has bias weights.",
"title": "Generator Bias",
"type": "boolean"
},
"adapter_bias": {
"default": false,
"description": "Control whether or not the adapter module has bias weights.",
"title": "Adapter Bias",
"type": "boolean"
},
"projector_activation_fn": {
"anyOf": [
{
"$ref": "#/$defs/ActivationFunction"
},
{
"type": "null"
}
],
"default": "relu",
"description": "The activation function to use in adapter projector layer."
},
"spatial_merge_size": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1,
"description": "Control the presence and size of patch merger (Mistral3)",
"title": "Spatial Merge Size"
},
"add_estimator": {
"default": false,
"description": "Add estimator layer",
"title": "Add Estimator",
"type": "boolean"
},
"estimator_type": {
"default": "average",
"description": "Which hidden_states to use to feed the estimator",
"enum": [
"average",
"last_token",
"first_token"
],
"title": "Estimator Type",
"type": "string"
},
"left_pad": {
"default": false,
"description": "Enable left-padding, useful for some LLMs.",
"title": "Left Pad",
"type": "boolean"
},
"huggingface_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Original huggingface model.",
"title": "Huggingface Model"
},
"eole_version": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "0.5.2",
"description": "Eole version used to convert/train/save the model.",
"title": "Eole Version"
},
"architecture": {
"const": "rnn",
"default": "rnn",
"title": "Architecture",
"type": "string"
},
"bridge": {
"default": false,
"description": "Have an additional layer between the last encoder state and the first decoder state (RNN specific).",
"title": "Bridge",
"type": "boolean"
},
"rnn_type": {
"default": "LSTM",
"description": "The gate type to use in the RNNs.",
"enum": [
"LSTM",
"GRU"
],
"title": "Rnn Type",
"type": "string"
}
},
"title": "RnnModelConfig",
"type": "object"
},
"RotaryPositionConfig": {
"additionalProperties": false,
"description": "Configuration for rotary position embeddings used in transformer models.",
"properties": {
"rotary_interleave": {
"default": true,
"description": "Interleave the head dimensions when rotary embeddings are applied. Otherwise the head dimensions are sliced in half. (True= Llama from Meta (original), False= used by all HuggingFace models)",
"title": "Rotary Interleave",
"type": "boolean"
},
"rotary_theta": {
"default": 10000,
"description": "Rotary theta base length, 1e4 for Llama2.Mistral, 1e6 for Mixtral",
"title": "Rotary Theta",
"type": "integer"
},
"rotary_dim": {
"default": 0,
"description": "Rotary dim when model requires it to be different to head dim.",
"title": "Rotary Dim",
"type": "integer"
},
"scaling_type": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Specifies the type of RoPE scaling to be applied, if any.",
"title": "Scaling Type"
},
"alpha": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": null,
"description": "alpha factor by which to scale rope theta.",
"title": "Alpha"
},
"xdrope_section": {
"anyOf": [
{
"items": {
"type": "integer"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"description": "Sections for XDRope mappings",
"title": "Xdrope Section"
},
"scaling_factor": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 8.0,
"description": "Factor by which to scale RoPE embeddings.",
"title": "Scaling Factor"
},
"low_freq_factor": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 1.0,
"description": "Scaling factor applied to the lower frequency components of RoPE.",
"title": "Low Freq Factor"
},
"high_freq_factor": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 4.0,
"description": "Scaling factor applied to the higher frequency components of RoPE.",
"title": "High Freq Factor"
},
"original_max_position_embeddings": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 8192,
"description": "Original maximum position embeddings for RoPE scaling.",
"title": "Original Max Position Embeddings"
},
"rotary_theta_local": {
"default": 10000,
"description": "Rotary theta base length for local rotary layers",
"title": "Rotary Theta Local",
"type": "integer"
},
"interleave_local": {
"default": 0,
"description": "Local rotary layers each 1/N layers",
"title": "Interleave Local",
"type": "integer"
},
"tmax_index": {
"default": 0,
"description": "tmax indexing, 0 for all cases except gemma 3 = 1",
"title": "Tmax Index",
"type": "integer"
}
},
"title": "RotaryPositionConfig",
"type": "object"
},
"SuffixConfig": {
"additionalProperties": false,
"properties": {
"src_suffix": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "String to append to all source examples.",
"title": "Src Suffix"
},
"tgt_suffix": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "String to append to all target examples.",
"title": "Tgt Suffix"
}
},
"title": "SuffixConfig",
"type": "object"
},
"SwitchOutConfig": {
"additionalProperties": false,
"properties": {
"switchout_temperature": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 1.0,
"description": "Sampling temperature for SwitchOut. :math:`\\tau^{-1}` in :cite:`DBLP:journals/corr/abs-1808-07512`. Smaller value makes data more diverse.",
"title": "Switchout Temperature"
}
},
"title": "SwitchOutConfig",
"type": "object"
},
"TerminologyConfig": {
"additionalProperties": false,
"properties": {
"termbase_path": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Path to a dictionary file with terms.",
"title": "Termbase Path"
},
"src_spacy_language_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Name of the spaCy language model for the source corpus.",
"title": "Src Spacy Language Model"
},
"tgt_spacy_language_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Name of the spaCy language model for the target corpus.",
"title": "Tgt Spacy Language Model"
},
"term_corpus_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0.3,
"description": "Ratio of corpus to augment with terms.",
"title": "Term Corpus Ratio"
},
"term_example_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0.2,
"description": "Maximum terms allowed in an example.",
"title": "Term Example Ratio"
},
"src_term_stoken": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "\uff5fsrc_term_start\uff60",
"description": "The source term start token.",
"title": "Src Term Stoken"
},
"tgt_term_stoken": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "\uff5ftgt_term_start\uff60",
"description": "The target term start token.",
"title": "Tgt Term Stoken"
},
"tgt_term_etoken": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "\uff5ftgt_term_end\uff60",
"description": "The target term end token.",
"title": "Tgt Term Etoken"
},
"term_source_delimiter": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "\uff5ffuzzy\uff60",
"description": "Any special token used for augmented source sentences. The default is the fuzzy token used in the FuzzyMatch transform.",
"title": "Term Source Delimiter"
}
},
"title": "TerminologyConfig",
"type": "object"
},
"TokenDropConfig": {
"additionalProperties": false,
"properties": {
"tokendrop_temperature": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 1.0,
"description": "Sampling temperature for token deletion.",
"title": "Tokendrop Temperature"
}
},
"title": "TokenDropConfig",
"type": "object"
},
"TokenMaskConfig": {
"additionalProperties": false,
"properties": {
"tokenmask_temperature": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 1.0,
"description": "Sampling temperature for token masking.",
"title": "Tokenmask Temperature"
}
},
"title": "TokenMaskConfig",
"type": "object"
},
"TrainingConfig": {
"additionalProperties": false,
"properties": {
"quant_layers": {
"default": [],
"description": "List of layers to be compressed in 4/8bit.",
"items": {
"type": "string"
},
"title": "Quant Layers",
"type": "array"
},
"quant_type": {
"default": "",
"description": "Type of compression.",
"enum": [
"",
"bnb_8bit",
"bnb_FP4",
"bnb_NF4",
"awq_gemm",
"awq_gemv",
"autoround",
"gguf"
],
"title": "Quant Type",
"type": "string"
},
"w_bit": {
"default": 4,
"description": "W_bit quantization",
"title": "W Bit",
"type": "integer"
},
"group_size": {
"default": 128,
"description": "Group size quantization.",
"title": "Group Size",
"type": "integer"
},
"autoround_packing_format": {
"default": "auto_round:auto_gptq",
"description": "AutoRound packing format (from quantization_config.packing_format). Determines whether qzeros use GPTQ-style (zeros-1) packing. Use 'auto_round:auto_gptq' for GPTQ-format (default), or 'auto_round' for direct zero-point.",
"title": "Autoround Packing Format",
"type": "string"
},
"autoround_sym": {
"default": true,
"description": "AutoRound symmetric quantization flag (from quantization_config.sym). Required to select the Marlin CUDA backend, which only supports symmetric quantization.",
"title": "Autoround Sym",
"type": "boolean"
},
"quant_exclude_modules": {
"default": [],
"description": "List of parent module names whose entire subtrees must not be quantized, even if child layers appear in quant_layers. Used for AutoRound models where some parent modules (e.g. shared_experts in MoE) were kept in fp16 during quantization.",
"items": {
"type": "string"
},
"title": "Quant Exclude Modules",
"type": "array"
},
"lora_layers": {
"default": [],
"description": "List of layers to be replaced by LoRa layers. E.g. ['linear_values', 'linear_query'] (\u00a74.2 in https://arxiv.org/abs/2106.09685)",
"items": {
"type": "string"
},
"title": "Lora Layers",
"type": "array"
},
"lora_embedding": {
"default": false,
"description": "Replace embeddings with LoRa Embeddings (\u00a75.1)",
"title": "Lora Embedding",
"type": "boolean"
},
"lora_rank": {
"default": 2,
"description": "r=2 successfully tested with NLLB-200 3.3B",
"title": "Lora Rank",
"type": "integer"
},
"lora_alpha": {
"default": 1,
"description": "\u00a74.1 https://arxiv.org/abs/2106.09685",
"title": "Lora Alpha",
"type": "integer"
},
"lora_dropout": {
"default": 0.0,
"description": "Rule of thumb: same value as in main model.",
"title": "Lora Dropout",
"type": "number"
},
"optim": {
"default": "sgd",
"description": "Optimization method.",
"enum": [
"sgd",
"adagrad",
"adadelta",
"adam",
"adamw",
"sparseadam",
"adafactor",
"adamw8bit",
"pagedadamw8bit",
"pagedadamw32bit"
],
"title": "Optim",
"type": "string"
},
"adagrad_accumulator_init": {
"default": 0,
"description": "Initialize the accumulator values in adagrad. Mirrors initial_accumulator_value flag from tensorflow adagrad implementation (default 0.1 there).",
"title": "Adagrad Accumulator Init",
"type": "number"
},
"adam_beta1": {
"default": 0.9,
"description": "Beta1 parameter used by Adam. Almost without exception a value of 0.9 is used in the literature, seemingly giving good results, so we would discourage changing this value from the default without due consideration.",
"title": "Adam Beta1",
"type": "number"
},
"adam_beta2": {
"default": 0.999,
"description": "Beta2 parameter used by Adam. Typically a value of 0.999 is recommended, as this is the value suggested by the original paper describing Adam, and is also the value adopted in other frameworks such as Tensorflow (https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer) and Keras (https://keras.io/optimizers/). Whereas recently the paper Attention is All You Need suggested a value of 0.98 for beta2, this parameter may not work well for normal models / default baselines.",
"title": "Adam Beta2",
"type": "number"
},
"adafactor_beta2": {
"default": -0.8,
"description": "Beta2_decay parameter used by Adafactor - see Pytorch documentation. ",
"title": "Adafactor Beta2",
"type": "number"
},
"adam_eps": {
"default": 1e-08,
"description": "Adam epsilon to forward to torch Optimizer.",
"title": "Adam Eps",
"type": "number"
},
"adafactor_eps": {
"default": [
null,
0.001
],
"description": "Adafactor epsilon to forward to torch Optimizer.",
"maxItems": 2,
"minItems": 2,
"prefixItems": [
{
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
]
},
{
"type": "number"
}
],
"title": "Adafactor Eps",
"type": "array"
},
"adafactor_d": {
"default": 1.0,
"description": "clipping threshold, used to avoid larger-than-desired updates.",
"title": "Adafactor D",
"type": "number"
},
"weight_decay": {
"default": 0.0,
"description": "Weight decay to forward to torch Optimizer.",
"title": "Weight Decay",
"type": "number"
},
"use_amp": {
"default": true,
"description": "Use torch mixed precision when compute_dtype is 16-bit.",
"title": "Use Amp",
"type": "boolean"
},
"learning_rate": {
"default": 1.0,
"description": "Starting learning rate. Recommended settings: sgd=1, adagrad=0.1, adadelta=1, adam=0.001.",
"title": "Learning Rate",
"type": "number"
},
"learning_rate_decay": {
"default": 0.5,
"description": "Decay learning rate by this much if steps have gone past start_decay_steps.",
"title": "Learning Rate Decay",
"type": "number"
},
"start_decay_steps": {
"default": 50000,
"description": "Start decaying every decay_steps after this many steps.",
"title": "Start Decay Steps",
"type": "integer"
},
"decay_steps": {
"default": 10000,
"description": "Frequency for learning rate decay, in steps.",
"title": "Decay Steps",
"type": "integer"
},
"decay_method": {
"default": "none",
"description": "Custom decay method to use.",
"enum": [
"noam",
"noamwd",
"cosine",
"rsqrt",
"none"
],
"title": "Decay Method",
"type": "string"
},
"warmup_steps": {
"default": 4000,
"description": "Number of warmup steps for custom decay.",
"title": "Warmup Steps",
"type": "integer"
},
"reset_optim": {
"default": "none",
"description": "Optimization resetter when using train_from.",
"enum": [
"none",
"all",
"states",
"keep_states"
],
"title": "Reset Optim",
"type": "string"
},
"gpu_ranks": {
"default": [],
"description": "List of ranks for each process.",
"items": {
"type": "integer"
},
"title": "Gpu Ranks",
"type": "array"
},
"world_size": {
"default": 1,
"description": "Total number of distributed processes.",
"title": "World Size",
"type": "integer"
},
"parallel_mode": {
"default": "data_parallel",
"description": "Distributed mode.",
"enum": [
"data_parallel",
"tensor_parallel"
],
"title": "Parallel Mode",
"type": "string"
},
"gpu_backend": {
"default": "nccl",
"description": "Type of torch distributed backend.",
"title": "Gpu Backend",
"type": "string"
},
"gpu_verbose_level": {
"default": 0,
"description": "Gives more info on each process per GPU.",
"title": "Gpu Verbose Level",
"type": "integer"
},
"master_ip": {
"default": "localhost",
"description": "IP of master for torch.distributed training.",
"title": "Master Ip",
"type": "string"
},
"master_port": {
"default": 10000,
"description": "Port of master for torch.distributed training.",
"title": "Master Port",
"type": "integer"
},
"timeout": {
"default": 60,
"description": "Timeout for one GPU to wait for the others.",
"title": "Timeout",
"type": "integer"
},
"model_path": {
"default": "model",
"description": "Path to directory containing all model components.",
"title": "Model Path",
"type": "string"
},
"self_attn_backend": {
"default": "flash",
"description": "Self-attention backend.",
"enum": [
"flash",
"pytorch"
],
"title": "Self Attn Backend",
"type": "string"
},
"compute_dtype": {
"description": "Compute dtype (precision) to use for main compute. Some parameters might have other dtypes for specific cases (e.g. torch.amp -- See eole.config.training.TrainingConfig.storage_dtype) fp32 to force slow fp16 model on gtx1080, int8 to enable pytorch native 8-bit quantization (cpu only).",
"enum": [
"fp32",
"fp16",
"int8",
"bf16"
],
"title": "Compute Dtype",
"type": "string"
},
"torch_compile": {
"default": false,
"description": "Use torch.compile with dynamic=True.",
"title": "Torch Compile",
"type": "boolean"
},
"param_init": {
"default": 0.1,
"description": "Support value for uniform distribution parameters initialization. Set to 0 not to use initialization.",
"title": "Param Init",
"type": "number"
},
"param_init_method": {
"default": "uniform",
"description": "Parameter initialization method.",
"enum": [
"xavier_uniform",
"uniform",
"normal"
],
"title": "Param Init Method",
"type": "string"
},
"freeze_encoder": {
"default": false,
"description": "Freeze parameters in encoder.",
"title": "Freeze Encoder",
"type": "boolean"
},
"freeze_decoder": {
"default": false,
"description": "Freeze parameters in decoder.",
"title": "Freeze Decoder",
"type": "boolean"
},
"pre_word_vecs_enc": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "If a valid path is specified, will load pretrained word embeddings on the encoder side.",
"title": "Pre Word Vecs Enc"
},
"pre_word_vecs_dec": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "If a valid path is specified, will load pretrained word embeddings on the decoder side.",
"title": "Pre Word Vecs Dec"
},
"data_type": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "text",
"title": "Data Type"
},
"bucket_size": {
"default": 262144,
"description": "A bucket is a buffer of bucket_size examples to pick from the various corpora. The dynamic iterator batches batch_size items from the bucket and shuffle them.",
"title": "Bucket Size",
"type": "integer"
},
"bucket_size_init": {
"default": -1,
"description": "Bucket size is initialized with this amount of examples (see bucket_size_increment).",
"title": "Bucket Size Init",
"type": "integer"
},
"bucket_size_increment": {
"default": 0,
"description": "Bucket size incremented with this amount of examples at each new bucket (up to bucket_size).",
"title": "Bucket Size Increment",
"type": "integer"
},
"prefetch_factor": {
"default": 200,
"description": "Number of mini-batches loaded in advance to avoid the GPU waiting during processing of next bucket.",
"title": "Prefetch Factor",
"type": "integer"
},
"save_format": {
"default": "pytorch",
"description": "Format to save the model weights.",
"enum": [
"pytorch",
"safetensors"
],
"title": "Save Format",
"type": "string"
},
"save_checkpoint_steps": {
"default": 5000,
"description": "Frequency of checkpoint saving (in steps).",
"title": "Save Checkpoint Steps",
"type": "integer"
},
"keep_checkpoint": {
"default": -1,
"description": "Number of checkpoints to retain. (-1 retains all)",
"title": "Keep Checkpoint",
"type": "integer"
},
"train_from": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Pretrained model/checkpoint weights to continue training from.",
"title": "Train From"
},
"num_workers": {
"default": 2,
"description": "Number of workers for pytorch.DataLoader objects.",
"title": "Num Workers",
"type": "integer"
},
"batch_size": {
"default": 64,
"description": "Maximum batch size for training.",
"title": "Batch Size",
"type": "integer"
},
"batch_size_multiple": {
"default": 1,
"description": "Batch size multiple for token batches.",
"title": "Batch Size Multiple",
"type": "integer"
},
"batch_type": {
"default": "sents",
"description": "Batch grouping for batch_size.",
"enum": [
"sents",
"tokens"
],
"title": "Batch Type",
"type": "string"
},
"normalization": {
"default": "sents",
"description": "Normalization method of the gradient.",
"enum": [
"sents",
"tokens"
],
"title": "Normalization",
"type": "string"
},
"accum_count": {
"default": [
1
],
"description": "Accumulate gradient this many times. Approximately equivalent to updating batch_size * accum_count batches at once. Recommended for transformer.",
"items": {
"type": "integer"
},
"title": "Accum Count",
"type": "array"
},
"accum_steps": {
"default": [
0
],
"description": "Steps at which accum_count values change.",
"items": {
"type": "integer"
},
"title": "Accum Steps",
"type": "array"
},
"valid_steps": {
"default": 10000,
"description": "Frequency of validation, in steps.",
"title": "Valid Steps",
"type": "integer"
},
"valid_batch_size": {
"default": 32,
"description": "Maximum batch size for validation.",
"title": "Valid Batch Size",
"type": "integer"
},
"train_steps": {
"default": 100000,
"description": "Number of training steps.",
"title": "Train Steps",
"type": "integer"
},
"single_pass": {
"default": false,
"description": "Make a single pass over the training dataset.",
"title": "Single Pass",
"type": "boolean"
},
"early_stopping": {
"default": 0,
"description": "Number of validation steps without improving that will trigger early stop of training.",
"title": "Early Stopping",
"type": "integer"
},
"early_stopping_criteria": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Criteria to use for early stopping.",
"title": "Early Stopping Criteria"
},
"max_grad_norm": {
"default": 5,
"description": "If the norm of the gradient vector exceeds this value, renormalize it to have the norm equal to max_grad_norm.",
"title": "Max Grad Norm",
"type": "number"
},
"dropout": {
"default": [
0.3
],
"description": "Dropout probability.",
"items": {
"type": "number"
},
"title": "Dropout",
"type": "array"
},
"attention_dropout": {
"default": [
0.1
],
"description": "Attention dropout probability.",
"items": {
"type": "number"
},
"title": "Attention Dropout",
"type": "array"
},
"dropout_steps": {
"default": [
0
],
"description": "Steps at which dropout changes.",
"items": {
"type": "integer"
},
"title": "Dropout Steps",
"type": "array"
},
"label_smoothing": {
"default": 0.0,
"description": "Label smoothing value epsilon. Probability of all non-true labels will be smoothed by epsilon/(vocab_size-1). Set to 0 to turn off label smoothing. (https://arxiv.org/abs/1512.00567)",
"title": "Label Smoothing",
"type": "number"
},
"average_decay": {
"default": 0.0,
"description": "Exponential moving average decay (https://en.wikipedia.org/wiki/Moving_average). Set to other than 0 (e.g. 1e-4) to activate. Similar to Marian NMT implementation (http://www.aclweb.org/anthology/P18-4020).",
"title": "Average Decay",
"type": "number"
},
"average_every": {
"default": 1,
"description": "Step for moving average. Default is every update if average_decay is set.",
"title": "Average Every",
"type": "integer"
},
"zero_out_prompt_loss": {
"default": false,
"description": "Set the prompt loss to zero. Mostly for LLM finetuning. Will be enabled only if the `insert_mask_before_placeholder` transform is applied.",
"title": "Zero Out Prompt Loss",
"type": "boolean"
},
"use_ckpting": {
"default": [],
"description": "Use gradient checkpointing for those modules.",
"items": {
"type": "string"
},
"title": "Use Ckpting",
"type": "array"
},
"update_vocab": {
"default": false,
"description": "Update source and target existing vocabularies.",
"title": "Update Vocab",
"type": "boolean"
},
"lm_prior_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "LM model to use to train the TM.",
"title": "Lm Prior Model"
},
"lm_prior_lambda": {
"default": 0.0,
"description": "LM Prior Lambda",
"title": "Lm Prior Lambda",
"type": "number"
},
"lm_prior_tau": {
"default": 1.0,
"description": "LM Prior Tau",
"title": "Lm Prior Tau",
"type": "number"
},
"estim_loss_lambda": {
"default": [
1.0
],
"description": "Weight applied to estimator loss",
"items": {
"type": "number"
},
"title": "Estim Loss Lambda",
"type": "array"
},
"estim_loss_lambda_steps": {
"default": [
0
],
"description": "Steps at which estimator loss lambda changes",
"items": {
"type": "integer"
},
"title": "Estim Loss Lambda Steps",
"type": "array"
},
"score_threshold": {
"default": 0.68,
"description": "Threshold to filterout data",
"title": "Score Threshold",
"type": "number"
},
"log_attention_entropy": {
"default": true,
"description": "Whether to compute and log attention entropy during training.",
"title": "Log Attention Entropy",
"type": "boolean"
},
"attention_entropy_types": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"description": "Which attention types to compute entropy for. If None, computes for all available types (e.g., ['std', 'self', 'context']).",
"title": "Attention Entropy Types"
},
"attention_entropy_layers": {
"anyOf": [
{
"items": {
"type": "integer"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"description": "Which attention layer indices to include in entropy computation. If None, includes all layers.",
"title": "Attention Entropy Layers"
},
"attention_entropy_aggregation": {
"default": "mean",
"description": "How to aggregate attention entropy across different attention types/layers.",
"enum": [
"mean",
"max",
"min"
],
"title": "Attention Entropy Aggregation",
"type": "string"
}
},
"title": "TrainingConfig",
"type": "object"
},
"TransformerDecoderConfig": {
"additionalProperties": false,
"properties": {
"decoder_type": {
"const": "transformer",
"default": "transformer",
"title": "Decoder Type",
"type": "string"
},
"layers": {
"default": 2,
"description": "Number of layers in the decoder.",
"title": "Layers",
"type": "integer"
},
"hidden_size": {
"default": 512,
"description": "Size of decoder hidden states.",
"title": "Hidden Size",
"type": "integer"
},
"tgt_word_vec_size": {
"default": 512,
"description": "Word embedding size for tgt.",
"title": "Tgt Word Vec Size",
"type": "integer"
},
"coverage_attn": {
"default": false,
"description": "Train a coverage attention layer.",
"title": "Coverage Attn",
"type": "boolean"
},
"with_cross_attn": {
"default": false,
"description": "Decoder uses cross-attention with encoder outputs.",
"title": "With Cross Attn",
"type": "boolean"
},
"lambda_coverage": {
"default": 0.0,
"description": "Lambda value for coverage loss of See et al (2017)",
"title": "Lambda Coverage",
"type": "number"
},
"global_attention": {
"default": "general",
"description": "The attention type to use. (Luong=general, Bahdanau=MLP)",
"enum": [
"dot",
"general",
"mlp",
null
],
"title": "Global Attention"
},
"global_attention_function": {
"default": "softmax",
"description": "Global attention function to use.",
"enum": [
"softmax",
"sparsemax"
],
"title": "Global Attention Function",
"type": "string"
},
"sliding_window": {
"default": 0,
"description": "Sliding window for transformer self-attention.",
"title": "Sliding Window",
"type": "integer"
},
"heads": {
"default": 8,
"description": "Number of heads for transformer self-attention.",
"title": "Heads",
"type": "integer"
},
"transformer_ff": {
"default": 2048,
"description": "Size of hidden transformer feed-forward.",
"title": "Transformer Ff",
"type": "integer"
},
"moe_transformer_ff": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Size of hidden moe transformer feed-forward.",
"title": "Moe Transformer Ff"
},
"relative_positions_buckets": {
"default": 0,
"description": "Enable relative position bias (https://github.com/google-research/text-to-text-transfer-transformer).",
"title": "Relative Positions Buckets",
"type": "integer"
},
"mlp_activation_fn": {
"$ref": "#/$defs/ActivationFunction",
"default": "relu",
"description": "The activation function to use in MLP layer."
},
"layer_norm": {
"default": "standard",
"description": "Type of layer normalization in transformer architecture.",
"enum": [
"standard",
"standardFP32",
"rms",
"gemma-rms"
],
"title": "Layer Norm",
"type": "string"
},
"norm_eps": {
"default": 1e-05,
"description": "Layer norm epsilon.",
"title": "Norm Eps",
"type": "number"
},
"shared_layer_norm": {
"default": false,
"description": "Use a shared layer_norm in parallel residual attention. Note: must be True for Falcon 7B, False for Falcon 40B, same for GPT-J and GPT-NeoX models.",
"title": "Shared Layer Norm",
"type": "boolean"
},
"ffn_layernorm": {
"default": false,
"description": "Add pre/post_feedforward_layernorm around MLP forward. Note: introduced for gemma2 support.",
"title": "Ffn Layernorm",
"type": "boolean"
},
"add_qkvbias": {
"default": false,
"description": "Add bias to nn.Linear of Query/Key/Value in MHA. Note: this will add bias to output projection layer too by default. Can be disabled with `add_final_linear_bias`.",
"title": "Add Qkvbias",
"type": "boolean"
},
"add_key_bias": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": null,
"description": "Add bias to Key projection in MHA. Defaults to add_qkvbias when not set. Set to False for models like Whisper where K has no bias.",
"title": "Add Key Bias"
},
"query_norm": {
"default": false,
"title": "Query Norm",
"type": "boolean"
},
"key_norm": {
"default": false,
"title": "Key Norm",
"type": "boolean"
},
"qk_norm_post_rope": {
"default": false,
"title": "Qk Norm Post Rope",
"type": "boolean"
},
"add_final_linear_bias": {
"default": false,
"description": "Add bias to nn.Linear of final_linear in MHA.",
"title": "Add Final Linear Bias",
"type": "boolean"
},
"heads_kv": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Number of heads for KV. heads_kv=heads if None, else number of heads for KV(e.g. Falcon 40B)",
"title": "Heads Kv"
},
"head_dim": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Head dimension when this needs to be different vs hidden_size // heads",
"title": "Head Dim"
},
"attn_scaling": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": null,
"description": "Attention scaling factor, when None uses 1/sqrt(head_dim) by default",
"title": "Attn Scaling"
},
"add_ffnbias": {
"default": false,
"description": "Add bias to nn.Linear of MLP FFN.",
"title": "Add Ffnbias",
"type": "boolean"
},
"parallel_residual": {
"default": false,
"description": "Use parallel residual in decoder layer. Note: this is used by GPT-J / Falcon Architecture.",
"title": "Parallel Residual",
"type": "boolean"
},
"num_experts": {
"default": 0,
"description": "Number of experts for MoE models.",
"title": "Num Experts",
"type": "integer"
},
"num_shared_experts": {
"default": 0,
"description": "Number of shared experts for MoE models (DeepSeekv2).",
"title": "Num Shared Experts",
"type": "integer"
},
"shared_expert_gate": {
"default": false,
"description": "Apply sigmoid-gated shared expert output (Qwen3.5 MoE style). When True, a linear gate is applied: output += sigmoid(gate(x)) * shared_expert(x).",
"title": "Shared Expert Gate",
"type": "boolean"
},
"first_k_dense_replace": {
"default": 0,
"description": "Number of layers using Dense instead of MoE",
"title": "First K Dense Replace",
"type": "integer"
},
"num_experts_per_tok": {
"default": 2,
"description": "Number of experts per token.",
"title": "Num Experts Per Tok",
"type": "integer"
},
"moe_softmax_after": {
"default": false,
"description": "Usually softmax is before topk, Mixtral does it after.",
"title": "Moe Softmax After",
"type": "boolean"
},
"moe_renormalize": {
"default": false,
"description": "Qwen renormalize expert weights after softmax.",
"title": "Moe Renormalize",
"type": "boolean"
},
"q_gating": {
"default": false,
"description": "Enable gated query in attention (Qwen3.5 style). Q projection has doubled output size; output is multiplied by sigmoid(gate).",
"title": "Q Gating",
"type": "boolean"
},
"position_encoding_type": {
"anyOf": [
{
"$ref": "#/$defs/PositionEncodingType"
},
{
"type": "null"
}
],
"default": "SinusoidalInterleaved",
"description": "Type of positional encoding."
},
"interpolate_mode": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Interpolation mode for position embeddings. If None: position_embeddings is a lookup table based on n_positions. If string (e.g., 'bilinear'): position_embedding is a learned grid using interpolation. (see Vision.py encoder)",
"title": "Interpolate Mode"
},
"n_positions": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Two casesCase 1: Absolute number of positions to learn position embeddings on (position_encoding_type: Learned)Case 2: Max Relative PositionsIn the case of position_encoding_type: Relative",
"title": "N Positions"
},
"rope_config": {
"anyOf": [
{
"$ref": "#/$defs/RotaryPositionConfig"
},
{
"type": "null"
}
],
"default": null,
"description": "Rotary position config, if relevant."
},
"aan_useffn": {
"default": false,
"description": "Turn on the FFN layer in the AAN decoder.",
"title": "Aan Useffn",
"type": "boolean"
},
"alignment_layer": {
"default": -2,
"description": "Layer number which has to be supervised.",
"title": "Alignment Layer",
"type": "integer"
},
"alignment_heads": {
"default": 0,
"description": "Number of cross attention heads per layer to supervise with.",
"title": "Alignment Heads",
"type": "integer"
},
"full_context_alignment": {
"default": false,
"description": "Whether alignment is conditioned on full target context.",
"title": "Full Context Alignment",
"type": "boolean"
},
"lambda_align": {
"default": 0.0,
"description": "Lambda value for alignement loss of Garg et al, 2019 (https://arxiv.org/abs/1909.02074)",
"title": "Lambda Align",
"type": "number"
},
"LM_type": {
"default": "causal",
"description": "TransformerDecoder LM type (causal = classic, or prefix LM https://arxiv.org/pdf/2308.06912)",
"enum": [
"causal",
"prefix"
],
"title": "Lm Type",
"type": "string"
},
"layer_types": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"description": "Per-layer types for hybrid architectures (e.g. Qwen3.5). Supported values: 'full_attention', 'linear_attention'. When None, all layers use full attention.",
"title": "Layer Types"
},
"linear_conv_kernel_dim": {
"default": 4,
"description": "Convolution kernel size for linear attention layers (Qwen3.5 GatedDeltaNet).",
"title": "Linear Conv Kernel Dim",
"type": "integer"
},
"linear_key_head_dim": {
"default": 128,
"description": "Key head dimension for linear attention layers (Qwen3.5 GatedDeltaNet).",
"title": "Linear Key Head Dim",
"type": "integer"
},
"linear_value_head_dim": {
"default": 128,
"description": "Value head dimension for linear attention layers (Qwen3.5 GatedDeltaNet).",
"title": "Linear Value Head Dim",
"type": "integer"
},
"linear_num_key_heads": {
"default": 16,
"description": "Number of key heads for linear attention layers (Qwen3.5 GatedDeltaNet).",
"title": "Linear Num Key Heads",
"type": "integer"
},
"linear_num_value_heads": {
"default": 32,
"description": "Number of value heads for linear attention layers (Qwen3.5 GatedDeltaNet).",
"title": "Linear Num Value Heads",
"type": "integer"
}
},
"title": "TransformerDecoderConfig",
"type": "object"
},
"TransformerEncoderConfig": {
"additionalProperties": false,
"properties": {
"encoder_type": {
"const": "transformer",
"default": "transformer",
"title": "Encoder Type",
"type": "string"
},
"layers": {
"default": 2,
"description": "Number of layers in the encoder.",
"title": "Layers",
"type": "integer"
},
"hidden_size": {
"default": 512,
"description": "Size of encoder hidden states.",
"title": "Hidden Size",
"type": "integer"
},
"src_word_vec_size": {
"default": 512,
"description": "Word embedding size for src.",
"title": "Src Word Vec Size",
"type": "integer"
},
"sliding_window": {
"default": 0,
"description": "Sliding window for transformer self-attention.",
"title": "Sliding Window",
"type": "integer"
},
"heads": {
"default": 8,
"description": "Number of heads for transformer self-attention.",
"title": "Heads",
"type": "integer"
},
"transformer_ff": {
"default": 2048,
"description": "Size of hidden transformer feed-forward.",
"title": "Transformer Ff",
"type": "integer"
},
"moe_transformer_ff": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Size of hidden moe transformer feed-forward.",
"title": "Moe Transformer Ff"
},
"relative_positions_buckets": {
"default": 0,
"description": "Enable relative position bias (https://github.com/google-research/text-to-text-transfer-transformer).",
"title": "Relative Positions Buckets",
"type": "integer"
},
"mlp_activation_fn": {
"$ref": "#/$defs/ActivationFunction",
"default": "relu",
"description": "The activation function to use in MLP layer."
},
"layer_norm": {
"default": "standard",
"description": "Type of layer normalization in transformer architecture.",
"enum": [
"standard",
"standardFP32",
"rms",
"gemma-rms"
],
"title": "Layer Norm",
"type": "string"
},
"norm_eps": {
"default": 1e-05,
"description": "Layer norm epsilon.",
"title": "Norm Eps",
"type": "number"
},
"shared_layer_norm": {
"default": false,
"description": "Use a shared layer_norm in parallel residual attention. Note: must be True for Falcon 7B, False for Falcon 40B, same for GPT-J and GPT-NeoX models.",
"title": "Shared Layer Norm",
"type": "boolean"
},
"ffn_layernorm": {
"default": false,
"description": "Add pre/post_feedforward_layernorm around MLP forward. Note: introduced for gemma2 support.",
"title": "Ffn Layernorm",
"type": "boolean"
},
"add_qkvbias": {
"default": false,
"description": "Add bias to nn.Linear of Query/Key/Value in MHA. Note: this will add bias to output projection layer too by default. Can be disabled with `add_final_linear_bias`.",
"title": "Add Qkvbias",
"type": "boolean"
},
"add_key_bias": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": null,
"description": "Add bias to Key projection in MHA. Defaults to add_qkvbias when not set. Set to False for models like Whisper where K has no bias.",
"title": "Add Key Bias"
},
"query_norm": {
"default": false,
"title": "Query Norm",
"type": "boolean"
},
"key_norm": {
"default": false,
"title": "Key Norm",
"type": "boolean"
},
"qk_norm_post_rope": {
"default": false,
"title": "Qk Norm Post Rope",
"type": "boolean"
},
"add_final_linear_bias": {
"default": false,
"description": "Add bias to nn.Linear of final_linear in MHA.",
"title": "Add Final Linear Bias",
"type": "boolean"
},
"heads_kv": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Number of heads for KV. heads_kv=heads if None, else number of heads for KV(e.g. Falcon 40B)",
"title": "Heads Kv"
},
"head_dim": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Head dimension when this needs to be different vs hidden_size // heads",
"title": "Head Dim"
},
"attn_scaling": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": null,
"description": "Attention scaling factor, when None uses 1/sqrt(head_dim) by default",
"title": "Attn Scaling"
},
"add_ffnbias": {
"default": false,
"description": "Add bias to nn.Linear of MLP FFN.",
"title": "Add Ffnbias",
"type": "boolean"
},
"parallel_residual": {
"default": false,
"description": "Use parallel residual in decoder layer. Note: this is used by GPT-J / Falcon Architecture.",
"title": "Parallel Residual",
"type": "boolean"
},
"num_experts": {
"default": 0,
"description": "Number of experts for MoE models.",
"title": "Num Experts",
"type": "integer"
},
"num_shared_experts": {
"default": 0,
"description": "Number of shared experts for MoE models (DeepSeekv2).",
"title": "Num Shared Experts",
"type": "integer"
},
"shared_expert_gate": {
"default": false,
"description": "Apply sigmoid-gated shared expert output (Qwen3.5 MoE style). When True, a linear gate is applied: output += sigmoid(gate(x)) * shared_expert(x).",
"title": "Shared Expert Gate",
"type": "boolean"
},
"first_k_dense_replace": {
"default": 0,
"description": "Number of layers using Dense instead of MoE",
"title": "First K Dense Replace",
"type": "integer"
},
"num_experts_per_tok": {
"default": 2,
"description": "Number of experts per token.",
"title": "Num Experts Per Tok",
"type": "integer"
},
"moe_softmax_after": {
"default": false,
"description": "Usually softmax is before topk, Mixtral does it after.",
"title": "Moe Softmax After",
"type": "boolean"
},
"moe_renormalize": {
"default": false,
"description": "Qwen renormalize expert weights after softmax.",
"title": "Moe Renormalize",
"type": "boolean"
},
"q_gating": {
"default": false,
"description": "Enable gated query in attention (Qwen3.5 style). Q projection has doubled output size; output is multiplied by sigmoid(gate).",
"title": "Q Gating",
"type": "boolean"
},
"position_encoding_type": {
"anyOf": [
{
"$ref": "#/$defs/PositionEncodingType"
},
{
"type": "null"
}
],
"default": "SinusoidalInterleaved",
"description": "Type of positional encoding."
},
"interpolate_mode": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Interpolation mode for position embeddings. If None: position_embeddings is a lookup table based on n_positions. If string (e.g., 'bilinear'): position_embedding is a learned grid using interpolation. (see Vision.py encoder)",
"title": "Interpolate Mode"
},
"n_positions": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Two casesCase 1: Absolute number of positions to learn position embeddings on (position_encoding_type: Learned)Case 2: Max Relative PositionsIn the case of position_encoding_type: Relative",
"title": "N Positions"
},
"rope_config": {
"anyOf": [
{
"$ref": "#/$defs/RotaryPositionConfig"
},
{
"type": "null"
}
],
"default": null,
"description": "Rotary position config, if relevant."
}
},
"title": "TransformerEncoderConfig",
"type": "object"
},
"TransformerEncoderModelConfig": {
"additionalProperties": false,
"description": "Facilitate setting some transformer specific params at model level.",
"properties": {
"embeddings": {
"$ref": "#/$defs/EmbeddingsConfig",
"description": "Contains most of the args useful to build the Embeddings module."
},
"encoder": {
"anyOf": [
{
"discriminator": {
"mapping": {
"audio": "#/$defs/AudioEncoderConfig",
"brnn": "#/$defs/RnnEncoderConfig",
"cnn": "#/$defs/CnnEncoderConfig",
"mean": "#/$defs/MeanEncoderConfig",
"rnn": "#/$defs/RnnEncoderConfig",
"transformer": "#/$defs/TransformerEncoderConfig",
"vision": "#/$defs/VisionEncoderConfig"
},
"propertyName": "encoder_type"
},
"oneOf": [
{
"$ref": "#/$defs/TransformerEncoderConfig"
},
{
"$ref": "#/$defs/RnnEncoderConfig"
},
{
"$ref": "#/$defs/CnnEncoderConfig"
},
{
"$ref": "#/$defs/MeanEncoderConfig"
},
{
"$ref": "#/$defs/VisionEncoderConfig"
},
{
"$ref": "#/$defs/AudioEncoderConfig"
}
]
},
{
"type": "null"
}
],
"default": null,
"description": "Major parameters of an encoder.",
"title": "Encoder"
},
"decoder": {
"default": null,
"description": "Major parameters of a decoder.",
"title": "Decoder",
"type": "null"
},
"hidden_size": {
"default": -1,
"description": "Size of hidden states. Overwrites [encoder/decoder].hidden_size if set.",
"title": "Hidden Size",
"type": "integer"
},
"word_vec_size": {
"default": -1,
"description": "Word embedding size for src and tgt.",
"title": "Word Vec Size",
"type": "integer"
},
"layers": {
"default": -1,
"description": "Number of layers in both encoder and decoder (will overwrite enc_layers/dec_layers).",
"title": "Layers",
"type": "integer"
},
"transformer_ff": {
"default": 2048,
"description": "Size of hidden transformer feed-forward.",
"title": "Transformer Ff",
"type": "integer"
},
"moe_transformer_ff": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Size of hidden moe transformer feed-forward.",
"title": "Moe Transformer Ff"
},
"share_decoder_embeddings": {
"default": false,
"description": "Use a share weight matrix for the input and output word embeddings in the decoder.",
"title": "Share Decoder Embeddings",
"type": "boolean"
},
"share_embeddings": {
"default": false,
"description": "Share the word embeddings between encoder and decoder. Need to use shared vocabulary for this option.",
"title": "Share Embeddings",
"type": "boolean"
},
"input_feed": {
"default": 1,
"description": "Feed the context vector at each time step as additional input (via concatenation with the word embeddings) to the decoder.",
"title": "Input Feed",
"type": "integer"
},
"generator_function": {
"default": "softmax",
"description": "Which function to use for generating probabilities over the target vocabulary.",
"enum": [
"softmax",
"sparsemax"
],
"title": "Generator Function",
"type": "string"
},
"generator_bias": {
"default": true,
"description": "Control whether or not the generator Linear module has bias weights.",
"title": "Generator Bias",
"type": "boolean"
},
"adapter_bias": {
"default": false,
"description": "Control whether or not the adapter module has bias weights.",
"title": "Adapter Bias",
"type": "boolean"
},
"projector_activation_fn": {
"anyOf": [
{
"$ref": "#/$defs/ActivationFunction"
},
{
"type": "null"
}
],
"default": "relu",
"description": "The activation function to use in adapter projector layer."
},
"spatial_merge_size": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1,
"description": "Control the presence and size of patch merger (Mistral3)",
"title": "Spatial Merge Size"
},
"add_estimator": {
"default": false,
"description": "Add estimator layer",
"title": "Add Estimator",
"type": "boolean"
},
"estimator_type": {
"default": "average",
"description": "Which hidden_states to use to feed the estimator",
"enum": [
"average",
"last_token",
"first_token"
],
"title": "Estimator Type",
"type": "string"
},
"left_pad": {
"default": false,
"description": "Enable left-padding, useful for some LLMs.",
"title": "Left Pad",
"type": "boolean"
},
"huggingface_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Original huggingface model.",
"title": "Huggingface Model"
},
"eole_version": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "0.5.2",
"description": "Eole version used to convert/train/save the model.",
"title": "Eole Version"
},
"architecture": {
"const": "transformer_encoder",
"default": "transformer_encoder",
"title": "Architecture",
"type": "string"
},
"sliding_window": {
"default": 0,
"description": "Sliding window for transformer self-attention.",
"title": "Sliding Window",
"type": "integer"
},
"heads": {
"default": 8,
"description": "Number of heads for transformer self-attention.",
"title": "Heads",
"type": "integer"
},
"relative_positions_buckets": {
"default": 0,
"description": "Enable relative position bias (https://github.com/google-research/text-to-text-transfer-transformer).",
"title": "Relative Positions Buckets",
"type": "integer"
},
"mlp_activation_fn": {
"$ref": "#/$defs/ActivationFunction",
"default": "relu",
"description": "The activation function to use in MLP layer."
},
"layer_norm": {
"default": "standard",
"description": "Type of layer normalization in transformer architecture.",
"enum": [
"standard",
"standardFP32",
"rms",
"gemma-rms"
],
"title": "Layer Norm",
"type": "string"
},
"norm_eps": {
"default": 1e-05,
"description": "Layer norm epsilon.",
"title": "Norm Eps",
"type": "number"
},
"shared_layer_norm": {
"default": false,
"description": "Use a shared layer_norm in parallel residual attention. Note: must be True for Falcon 7B, False for Falcon 40B, same for GPT-J and GPT-NeoX models.",
"title": "Shared Layer Norm",
"type": "boolean"
},
"ffn_layernorm": {
"default": false,
"description": "Add pre/post_feedforward_layernorm around MLP forward. Note: introduced for gemma2 support.",
"title": "Ffn Layernorm",
"type": "boolean"
},
"add_qkvbias": {
"default": false,
"description": "Add bias to nn.Linear of Query/Key/Value in MHA. Note: this will add bias to output projection layer too by default. Can be disabled with `add_final_linear_bias`.",
"title": "Add Qkvbias",
"type": "boolean"
},
"add_key_bias": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": null,
"description": "Add bias to Key projection in MHA. Defaults to add_qkvbias when not set. Set to False for models like Whisper where K has no bias.",
"title": "Add Key Bias"
},
"query_norm": {
"default": false,
"title": "Query Norm",
"type": "boolean"
},
"key_norm": {
"default": false,
"title": "Key Norm",
"type": "boolean"
},
"qk_norm_post_rope": {
"default": false,
"title": "Qk Norm Post Rope",
"type": "boolean"
},
"add_final_linear_bias": {
"default": false,
"description": "Add bias to nn.Linear of final_linear in MHA.",
"title": "Add Final Linear Bias",
"type": "boolean"
},
"heads_kv": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Number of heads for KV. heads_kv=heads if None, else number of heads for KV(e.g. Falcon 40B)",
"title": "Heads Kv"
},
"head_dim": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Head dimension when this needs to be different vs hidden_size // heads",
"title": "Head Dim"
},
"attn_scaling": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": null,
"description": "Attention scaling factor, when None uses 1/sqrt(head_dim) by default",
"title": "Attn Scaling"
},
"add_ffnbias": {
"default": false,
"description": "Add bias to nn.Linear of MLP FFN.",
"title": "Add Ffnbias",
"type": "boolean"
},
"parallel_residual": {
"default": false,
"description": "Use parallel residual in decoder layer. Note: this is used by GPT-J / Falcon Architecture.",
"title": "Parallel Residual",
"type": "boolean"
},
"num_experts": {
"default": 0,
"description": "Number of experts for MoE models.",
"title": "Num Experts",
"type": "integer"
},
"num_shared_experts": {
"default": 0,
"description": "Number of shared experts for MoE models (DeepSeekv2).",
"title": "Num Shared Experts",
"type": "integer"
},
"shared_expert_gate": {
"default": false,
"description": "Apply sigmoid-gated shared expert output (Qwen3.5 MoE style). When True, a linear gate is applied: output += sigmoid(gate(x)) * shared_expert(x).",
"title": "Shared Expert Gate",
"type": "boolean"
},
"first_k_dense_replace": {
"default": 0,
"description": "Number of layers using Dense instead of MoE",
"title": "First K Dense Replace",
"type": "integer"
},
"num_experts_per_tok": {
"default": 2,
"description": "Number of experts per token.",
"title": "Num Experts Per Tok",
"type": "integer"
},
"moe_softmax_after": {
"default": false,
"description": "Usually softmax is before topk, Mixtral does it after.",
"title": "Moe Softmax After",
"type": "boolean"
},
"moe_renormalize": {
"default": false,
"description": "Qwen renormalize expert weights after softmax.",
"title": "Moe Renormalize",
"type": "boolean"
},
"q_gating": {
"default": false,
"description": "Enable gated query in attention (Qwen3.5 style). Q projection has doubled output size; output is multiplied by sigmoid(gate).",
"title": "Q Gating",
"type": "boolean"
},
"position_encoding_type": {
"anyOf": [
{
"$ref": "#/$defs/PositionEncodingType"
},
{
"type": "null"
}
],
"default": "SinusoidalInterleaved",
"description": "Type of positional encoding."
},
"interpolate_mode": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Interpolation mode for position embeddings. If None: position_embeddings is a lookup table based on n_positions. If string (e.g., 'bilinear'): position_embedding is a learned grid using interpolation. (see Vision.py encoder)",
"title": "Interpolate Mode"
},
"n_positions": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Two casesCase 1: Absolute number of positions to learn position embeddings on (position_encoding_type: Learned)Case 2: Max Relative PositionsIn the case of position_encoding_type: Relative",
"title": "N Positions"
},
"rope_config": {
"anyOf": [
{
"$ref": "#/$defs/RotaryPositionConfig"
},
{
"type": "null"
}
],
"default": null,
"description": "Rotary position config, if relevant."
}
},
"title": "TransformerEncoderModelConfig",
"type": "object"
},
"TransformerLMModelConfig": {
"additionalProperties": false,
"description": "Facilitate setting some transformer specific params at model level.",
"properties": {
"embeddings": {
"$ref": "#/$defs/EmbeddingsConfig",
"description": "Contains most of the args useful to build the Embeddings module."
},
"encoder": {
"default": null,
"description": "Major parameters of an encoder.",
"title": "Encoder",
"type": "null"
},
"decoder": {
"anyOf": [
{
"discriminator": {
"mapping": {
"cnn": "#/$defs/CnnDecoderConfig",
"rnn": "#/$defs/RnnDecoderConfig",
"transformer": "#/$defs/TransformerDecoderConfig"
},
"propertyName": "decoder_type"
},
"oneOf": [
{
"$ref": "#/$defs/TransformerDecoderConfig"
},
{
"$ref": "#/$defs/RnnDecoderConfig"
},
{
"$ref": "#/$defs/CnnDecoderConfig"
}
]
},
{
"type": "null"
}
],
"default": null,
"description": "Major parameters of a decoder.",
"title": "Decoder"
},
"hidden_size": {
"default": -1,
"description": "Size of hidden states. Overwrites [encoder/decoder].hidden_size if set.",
"title": "Hidden Size",
"type": "integer"
},
"word_vec_size": {
"default": -1,
"description": "Word embedding size for src and tgt.",
"title": "Word Vec Size",
"type": "integer"
},
"layers": {
"default": -1,
"description": "Number of layers in both encoder and decoder (will overwrite enc_layers/dec_layers).",
"title": "Layers",
"type": "integer"
},
"transformer_ff": {
"default": 2048,
"description": "Size of hidden transformer feed-forward.",
"title": "Transformer Ff",
"type": "integer"
},
"moe_transformer_ff": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Size of hidden moe transformer feed-forward.",
"title": "Moe Transformer Ff"
},
"share_decoder_embeddings": {
"default": false,
"description": "Use a share weight matrix for the input and output word embeddings in the decoder.",
"title": "Share Decoder Embeddings",
"type": "boolean"
},
"share_embeddings": {
"default": false,
"description": "Share the word embeddings between encoder and decoder. Need to use shared vocabulary for this option.",
"title": "Share Embeddings",
"type": "boolean"
},
"input_feed": {
"default": 1,
"description": "Feed the context vector at each time step as additional input (via concatenation with the word embeddings) to the decoder.",
"title": "Input Feed",
"type": "integer"
},
"generator_function": {
"default": "softmax",
"description": "Which function to use for generating probabilities over the target vocabulary.",
"enum": [
"softmax",
"sparsemax"
],
"title": "Generator Function",
"type": "string"
},
"generator_bias": {
"default": true,
"description": "Control whether or not the generator Linear module has bias weights.",
"title": "Generator Bias",
"type": "boolean"
},
"adapter_bias": {
"default": false,
"description": "Control whether or not the adapter module has bias weights.",
"title": "Adapter Bias",
"type": "boolean"
},
"projector_activation_fn": {
"anyOf": [
{
"$ref": "#/$defs/ActivationFunction"
},
{
"type": "null"
}
],
"default": "relu",
"description": "The activation function to use in adapter projector layer."
},
"spatial_merge_size": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1,
"description": "Control the presence and size of patch merger (Mistral3)",
"title": "Spatial Merge Size"
},
"add_estimator": {
"default": false,
"description": "Add estimator layer",
"title": "Add Estimator",
"type": "boolean"
},
"estimator_type": {
"default": "average",
"description": "Which hidden_states to use to feed the estimator",
"enum": [
"average",
"last_token",
"first_token"
],
"title": "Estimator Type",
"type": "string"
},
"left_pad": {
"default": false,
"description": "Enable left-padding, useful for some LLMs.",
"title": "Left Pad",
"type": "boolean"
},
"huggingface_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Original huggingface model.",
"title": "Huggingface Model"
},
"eole_version": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "0.5.2",
"description": "Eole version used to convert/train/save the model.",
"title": "Eole Version"
},
"architecture": {
"const": "transformer_lm",
"default": "transformer_lm",
"title": "Architecture",
"type": "string"
},
"sliding_window": {
"default": 0,
"description": "Sliding window for transformer self-attention.",
"title": "Sliding Window",
"type": "integer"
},
"heads": {
"default": 8,
"description": "Number of heads for transformer self-attention.",
"title": "Heads",
"type": "integer"
},
"relative_positions_buckets": {
"default": 0,
"description": "Enable relative position bias (https://github.com/google-research/text-to-text-transfer-transformer).",
"title": "Relative Positions Buckets",
"type": "integer"
},
"mlp_activation_fn": {
"$ref": "#/$defs/ActivationFunction",
"default": "relu",
"description": "The activation function to use in MLP layer."
},
"layer_norm": {
"default": "standard",
"description": "Type of layer normalization in transformer architecture.",
"enum": [
"standard",
"standardFP32",
"rms",
"gemma-rms"
],
"title": "Layer Norm",
"type": "string"
},
"norm_eps": {
"default": 1e-05,
"description": "Layer norm epsilon.",
"title": "Norm Eps",
"type": "number"
},
"shared_layer_norm": {
"default": false,
"description": "Use a shared layer_norm in parallel residual attention. Note: must be True for Falcon 7B, False for Falcon 40B, same for GPT-J and GPT-NeoX models.",
"title": "Shared Layer Norm",
"type": "boolean"
},
"ffn_layernorm": {
"default": false,
"description": "Add pre/post_feedforward_layernorm around MLP forward. Note: introduced for gemma2 support.",
"title": "Ffn Layernorm",
"type": "boolean"
},
"add_qkvbias": {
"default": false,
"description": "Add bias to nn.Linear of Query/Key/Value in MHA. Note: this will add bias to output projection layer too by default. Can be disabled with `add_final_linear_bias`.",
"title": "Add Qkvbias",
"type": "boolean"
},
"add_key_bias": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": null,
"description": "Add bias to Key projection in MHA. Defaults to add_qkvbias when not set. Set to False for models like Whisper where K has no bias.",
"title": "Add Key Bias"
},
"query_norm": {
"default": false,
"title": "Query Norm",
"type": "boolean"
},
"key_norm": {
"default": false,
"title": "Key Norm",
"type": "boolean"
},
"qk_norm_post_rope": {
"default": false,
"title": "Qk Norm Post Rope",
"type": "boolean"
},
"add_final_linear_bias": {
"default": false,
"description": "Add bias to nn.Linear of final_linear in MHA.",
"title": "Add Final Linear Bias",
"type": "boolean"
},
"heads_kv": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Number of heads for KV. heads_kv=heads if None, else number of heads for KV(e.g. Falcon 40B)",
"title": "Heads Kv"
},
"head_dim": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Head dimension when this needs to be different vs hidden_size // heads",
"title": "Head Dim"
},
"attn_scaling": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": null,
"description": "Attention scaling factor, when None uses 1/sqrt(head_dim) by default",
"title": "Attn Scaling"
},
"add_ffnbias": {
"default": false,
"description": "Add bias to nn.Linear of MLP FFN.",
"title": "Add Ffnbias",
"type": "boolean"
},
"parallel_residual": {
"default": false,
"description": "Use parallel residual in decoder layer. Note: this is used by GPT-J / Falcon Architecture.",
"title": "Parallel Residual",
"type": "boolean"
},
"num_experts": {
"default": 0,
"description": "Number of experts for MoE models.",
"title": "Num Experts",
"type": "integer"
},
"num_shared_experts": {
"default": 0,
"description": "Number of shared experts for MoE models (DeepSeekv2).",
"title": "Num Shared Experts",
"type": "integer"
},
"shared_expert_gate": {
"default": false,
"description": "Apply sigmoid-gated shared expert output (Qwen3.5 MoE style). When True, a linear gate is applied: output += sigmoid(gate(x)) * shared_expert(x).",
"title": "Shared Expert Gate",
"type": "boolean"
},
"first_k_dense_replace": {
"default": 0,
"description": "Number of layers using Dense instead of MoE",
"title": "First K Dense Replace",
"type": "integer"
},
"num_experts_per_tok": {
"default": 2,
"description": "Number of experts per token.",
"title": "Num Experts Per Tok",
"type": "integer"
},
"moe_softmax_after": {
"default": false,
"description": "Usually softmax is before topk, Mixtral does it after.",
"title": "Moe Softmax After",
"type": "boolean"
},
"moe_renormalize": {
"default": false,
"description": "Qwen renormalize expert weights after softmax.",
"title": "Moe Renormalize",
"type": "boolean"
},
"q_gating": {
"default": false,
"description": "Enable gated query in attention (Qwen3.5 style). Q projection has doubled output size; output is multiplied by sigmoid(gate).",
"title": "Q Gating",
"type": "boolean"
},
"position_encoding_type": {
"anyOf": [
{
"$ref": "#/$defs/PositionEncodingType"
},
{
"type": "null"
}
],
"default": "SinusoidalInterleaved",
"description": "Type of positional encoding."
},
"interpolate_mode": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Interpolation mode for position embeddings. If None: position_embeddings is a lookup table based on n_positions. If string (e.g., 'bilinear'): position_embedding is a learned grid using interpolation. (see Vision.py encoder)",
"title": "Interpolate Mode"
},
"n_positions": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Two casesCase 1: Absolute number of positions to learn position embeddings on (position_encoding_type: Learned)Case 2: Max Relative PositionsIn the case of position_encoding_type: Relative",
"title": "N Positions"
},
"rope_config": {
"anyOf": [
{
"$ref": "#/$defs/RotaryPositionConfig"
},
{
"type": "null"
}
],
"default": null,
"description": "Rotary position config, if relevant."
}
},
"title": "TransformerLMModelConfig",
"type": "object"
},
"TransformerModelConfig": {
"additionalProperties": false,
"description": "Facilitate setting some transformer specific params at model level.",
"properties": {
"embeddings": {
"$ref": "#/$defs/EmbeddingsConfig",
"description": "Contains most of the args useful to build the Embeddings module."
},
"encoder": {
"anyOf": [
{
"discriminator": {
"mapping": {
"audio": "#/$defs/AudioEncoderConfig",
"brnn": "#/$defs/RnnEncoderConfig",
"cnn": "#/$defs/CnnEncoderConfig",
"mean": "#/$defs/MeanEncoderConfig",
"rnn": "#/$defs/RnnEncoderConfig",
"transformer": "#/$defs/TransformerEncoderConfig",
"vision": "#/$defs/VisionEncoderConfig"
},
"propertyName": "encoder_type"
},
"oneOf": [
{
"$ref": "#/$defs/TransformerEncoderConfig"
},
{
"$ref": "#/$defs/RnnEncoderConfig"
},
{
"$ref": "#/$defs/CnnEncoderConfig"
},
{
"$ref": "#/$defs/MeanEncoderConfig"
},
{
"$ref": "#/$defs/VisionEncoderConfig"
},
{
"$ref": "#/$defs/AudioEncoderConfig"
}
]
},
{
"type": "null"
}
],
"default": null,
"description": "Major parameters of an encoder.",
"title": "Encoder"
},
"decoder": {
"anyOf": [
{
"discriminator": {
"mapping": {
"cnn": "#/$defs/CnnDecoderConfig",
"rnn": "#/$defs/RnnDecoderConfig",
"transformer": "#/$defs/TransformerDecoderConfig"
},
"propertyName": "decoder_type"
},
"oneOf": [
{
"$ref": "#/$defs/TransformerDecoderConfig"
},
{
"$ref": "#/$defs/RnnDecoderConfig"
},
{
"$ref": "#/$defs/CnnDecoderConfig"
}
]
},
{
"type": "null"
}
],
"default": null,
"description": "Major parameters of a decoder.",
"title": "Decoder"
},
"hidden_size": {
"default": -1,
"description": "Size of hidden states. Overwrites [encoder/decoder].hidden_size if set.",
"title": "Hidden Size",
"type": "integer"
},
"word_vec_size": {
"default": -1,
"description": "Word embedding size for src and tgt.",
"title": "Word Vec Size",
"type": "integer"
},
"layers": {
"default": -1,
"description": "Number of layers in both encoder and decoder (will overwrite enc_layers/dec_layers).",
"title": "Layers",
"type": "integer"
},
"transformer_ff": {
"default": 2048,
"description": "Size of hidden transformer feed-forward.",
"title": "Transformer Ff",
"type": "integer"
},
"moe_transformer_ff": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Size of hidden moe transformer feed-forward.",
"title": "Moe Transformer Ff"
},
"share_decoder_embeddings": {
"default": false,
"description": "Use a share weight matrix for the input and output word embeddings in the decoder.",
"title": "Share Decoder Embeddings",
"type": "boolean"
},
"share_embeddings": {
"default": false,
"description": "Share the word embeddings between encoder and decoder. Need to use shared vocabulary for this option.",
"title": "Share Embeddings",
"type": "boolean"
},
"input_feed": {
"default": 1,
"description": "Feed the context vector at each time step as additional input (via concatenation with the word embeddings) to the decoder.",
"title": "Input Feed",
"type": "integer"
},
"generator_function": {
"default": "softmax",
"description": "Which function to use for generating probabilities over the target vocabulary.",
"enum": [
"softmax",
"sparsemax"
],
"title": "Generator Function",
"type": "string"
},
"generator_bias": {
"default": true,
"description": "Control whether or not the generator Linear module has bias weights.",
"title": "Generator Bias",
"type": "boolean"
},
"adapter_bias": {
"default": false,
"description": "Control whether or not the adapter module has bias weights.",
"title": "Adapter Bias",
"type": "boolean"
},
"projector_activation_fn": {
"anyOf": [
{
"$ref": "#/$defs/ActivationFunction"
},
{
"type": "null"
}
],
"default": "relu",
"description": "The activation function to use in adapter projector layer."
},
"spatial_merge_size": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1,
"description": "Control the presence and size of patch merger (Mistral3)",
"title": "Spatial Merge Size"
},
"add_estimator": {
"default": false,
"description": "Add estimator layer",
"title": "Add Estimator",
"type": "boolean"
},
"estimator_type": {
"default": "average",
"description": "Which hidden_states to use to feed the estimator",
"enum": [
"average",
"last_token",
"first_token"
],
"title": "Estimator Type",
"type": "string"
},
"left_pad": {
"default": false,
"description": "Enable left-padding, useful for some LLMs.",
"title": "Left Pad",
"type": "boolean"
},
"huggingface_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Original huggingface model.",
"title": "Huggingface Model"
},
"eole_version": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "0.5.2",
"description": "Eole version used to convert/train/save the model.",
"title": "Eole Version"
},
"architecture": {
"const": "transformer",
"default": "transformer",
"title": "Architecture",
"type": "string"
},
"sliding_window": {
"default": 0,
"description": "Sliding window for transformer self-attention.",
"title": "Sliding Window",
"type": "integer"
},
"heads": {
"default": 8,
"description": "Number of heads for transformer self-attention.",
"title": "Heads",
"type": "integer"
},
"relative_positions_buckets": {
"default": 0,
"description": "Enable relative position bias (https://github.com/google-research/text-to-text-transfer-transformer).",
"title": "Relative Positions Buckets",
"type": "integer"
},
"mlp_activation_fn": {
"$ref": "#/$defs/ActivationFunction",
"default": "relu",
"description": "The activation function to use in MLP layer."
},
"layer_norm": {
"default": "standard",
"description": "Type of layer normalization in transformer architecture.",
"enum": [
"standard",
"standardFP32",
"rms",
"gemma-rms"
],
"title": "Layer Norm",
"type": "string"
},
"norm_eps": {
"default": 1e-05,
"description": "Layer norm epsilon.",
"title": "Norm Eps",
"type": "number"
},
"shared_layer_norm": {
"default": false,
"description": "Use a shared layer_norm in parallel residual attention. Note: must be True for Falcon 7B, False for Falcon 40B, same for GPT-J and GPT-NeoX models.",
"title": "Shared Layer Norm",
"type": "boolean"
},
"ffn_layernorm": {
"default": false,
"description": "Add pre/post_feedforward_layernorm around MLP forward. Note: introduced for gemma2 support.",
"title": "Ffn Layernorm",
"type": "boolean"
},
"add_qkvbias": {
"default": false,
"description": "Add bias to nn.Linear of Query/Key/Value in MHA. Note: this will add bias to output projection layer too by default. Can be disabled with `add_final_linear_bias`.",
"title": "Add Qkvbias",
"type": "boolean"
},
"add_key_bias": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": null,
"description": "Add bias to Key projection in MHA. Defaults to add_qkvbias when not set. Set to False for models like Whisper where K has no bias.",
"title": "Add Key Bias"
},
"query_norm": {
"default": false,
"title": "Query Norm",
"type": "boolean"
},
"key_norm": {
"default": false,
"title": "Key Norm",
"type": "boolean"
},
"qk_norm_post_rope": {
"default": false,
"title": "Qk Norm Post Rope",
"type": "boolean"
},
"add_final_linear_bias": {
"default": false,
"description": "Add bias to nn.Linear of final_linear in MHA.",
"title": "Add Final Linear Bias",
"type": "boolean"
},
"heads_kv": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Number of heads for KV. heads_kv=heads if None, else number of heads for KV(e.g. Falcon 40B)",
"title": "Heads Kv"
},
"head_dim": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Head dimension when this needs to be different vs hidden_size // heads",
"title": "Head Dim"
},
"attn_scaling": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": null,
"description": "Attention scaling factor, when None uses 1/sqrt(head_dim) by default",
"title": "Attn Scaling"
},
"add_ffnbias": {
"default": false,
"description": "Add bias to nn.Linear of MLP FFN.",
"title": "Add Ffnbias",
"type": "boolean"
},
"parallel_residual": {
"default": false,
"description": "Use parallel residual in decoder layer. Note: this is used by GPT-J / Falcon Architecture.",
"title": "Parallel Residual",
"type": "boolean"
},
"num_experts": {
"default": 0,
"description": "Number of experts for MoE models.",
"title": "Num Experts",
"type": "integer"
},
"num_shared_experts": {
"default": 0,
"description": "Number of shared experts for MoE models (DeepSeekv2).",
"title": "Num Shared Experts",
"type": "integer"
},
"shared_expert_gate": {
"default": false,
"description": "Apply sigmoid-gated shared expert output (Qwen3.5 MoE style). When True, a linear gate is applied: output += sigmoid(gate(x)) * shared_expert(x).",
"title": "Shared Expert Gate",
"type": "boolean"
},
"first_k_dense_replace": {
"default": 0,
"description": "Number of layers using Dense instead of MoE",
"title": "First K Dense Replace",
"type": "integer"
},
"num_experts_per_tok": {
"default": 2,
"description": "Number of experts per token.",
"title": "Num Experts Per Tok",
"type": "integer"
},
"moe_softmax_after": {
"default": false,
"description": "Usually softmax is before topk, Mixtral does it after.",
"title": "Moe Softmax After",
"type": "boolean"
},
"moe_renormalize": {
"default": false,
"description": "Qwen renormalize expert weights after softmax.",
"title": "Moe Renormalize",
"type": "boolean"
},
"q_gating": {
"default": false,
"description": "Enable gated query in attention (Qwen3.5 style). Q projection has doubled output size; output is multiplied by sigmoid(gate).",
"title": "Q Gating",
"type": "boolean"
},
"position_encoding_type": {
"anyOf": [
{
"$ref": "#/$defs/PositionEncodingType"
},
{
"type": "null"
}
],
"default": "SinusoidalInterleaved",
"description": "Type of positional encoding."
},
"interpolate_mode": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Interpolation mode for position embeddings. If None: position_embeddings is a lookup table based on n_positions. If string (e.g., 'bilinear'): position_embedding is a learned grid using interpolation. (see Vision.py encoder)",
"title": "Interpolate Mode"
},
"n_positions": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Two casesCase 1: Absolute number of positions to learn position embeddings on (position_encoding_type: Learned)Case 2: Max Relative PositionsIn the case of position_encoding_type: Relative",
"title": "N Positions"
},
"rope_config": {
"anyOf": [
{
"$ref": "#/$defs/RotaryPositionConfig"
},
{
"type": "null"
}
],
"default": null,
"description": "Rotary position config, if relevant."
}
},
"title": "TransformerModelConfig",
"type": "object"
},
"UpperCaseConfig": {
"additionalProperties": false,
"properties": {
"upper_corpus_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0.01,
"description": "Corpus ratio to apply uppercasing.",
"title": "Upper Corpus Ratio"
}
},
"title": "UpperCaseConfig",
"type": "object"
},
"VisionEncoderConfig": {
"additionalProperties": false,
"description": "Based on mistral-community/pixtral-12b, might evolve later.",
"properties": {
"encoder_type": {
"const": "vision",
"default": "vision",
"title": "Encoder Type",
"type": "string"
},
"layers": {
"default": 2,
"description": "Number of layers in the encoder.",
"title": "Layers",
"type": "integer"
},
"hidden_size": {
"default": 512,
"description": "Size of encoder hidden states.",
"title": "Hidden Size",
"type": "integer"
},
"src_word_vec_size": {
"default": 512,
"description": "Word embedding size for src.",
"title": "Src Word Vec Size",
"type": "integer"
},
"sliding_window": {
"default": 0,
"description": "Sliding window for transformer self-attention.",
"title": "Sliding Window",
"type": "integer"
},
"heads": {
"default": 8,
"description": "Number of heads for transformer self-attention.",
"title": "Heads",
"type": "integer"
},
"transformer_ff": {
"default": 2048,
"description": "Size of hidden transformer feed-forward.",
"title": "Transformer Ff",
"type": "integer"
},
"moe_transformer_ff": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Size of hidden moe transformer feed-forward.",
"title": "Moe Transformer Ff"
},
"relative_positions_buckets": {
"default": 0,
"description": "Enable relative position bias (https://github.com/google-research/text-to-text-transfer-transformer).",
"title": "Relative Positions Buckets",
"type": "integer"
},
"mlp_activation_fn": {
"$ref": "#/$defs/ActivationFunction",
"default": "relu",
"description": "The activation function to use in MLP layer."
},
"layer_norm": {
"default": "standard",
"description": "Type of layer normalization in transformer architecture.",
"enum": [
"standard",
"standardFP32",
"rms",
"gemma-rms"
],
"title": "Layer Norm",
"type": "string"
},
"norm_eps": {
"default": 1e-05,
"description": "Layer norm epsilon.",
"title": "Norm Eps",
"type": "number"
},
"shared_layer_norm": {
"default": false,
"description": "Use a shared layer_norm in parallel residual attention. Note: must be True for Falcon 7B, False for Falcon 40B, same for GPT-J and GPT-NeoX models.",
"title": "Shared Layer Norm",
"type": "boolean"
},
"ffn_layernorm": {
"default": false,
"description": "Add pre/post_feedforward_layernorm around MLP forward. Note: introduced for gemma2 support.",
"title": "Ffn Layernorm",
"type": "boolean"
},
"add_qkvbias": {
"default": false,
"description": "Add bias to nn.Linear of Query/Key/Value in MHA. Note: this will add bias to output projection layer too by default. Can be disabled with `add_final_linear_bias`.",
"title": "Add Qkvbias",
"type": "boolean"
},
"add_key_bias": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": null,
"description": "Add bias to Key projection in MHA. Defaults to add_qkvbias when not set. Set to False for models like Whisper where K has no bias.",
"title": "Add Key Bias"
},
"query_norm": {
"default": false,
"title": "Query Norm",
"type": "boolean"
},
"key_norm": {
"default": false,
"title": "Key Norm",
"type": "boolean"
},
"qk_norm_post_rope": {
"default": false,
"title": "Qk Norm Post Rope",
"type": "boolean"
},
"add_final_linear_bias": {
"default": false,
"description": "Add bias to nn.Linear of final_linear in MHA.",
"title": "Add Final Linear Bias",
"type": "boolean"
},
"heads_kv": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Number of heads for KV. heads_kv=heads if None, else number of heads for KV(e.g. Falcon 40B)",
"title": "Heads Kv"
},
"head_dim": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Head dimension when this needs to be different vs hidden_size // heads",
"title": "Head Dim"
},
"attn_scaling": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": null,
"description": "Attention scaling factor, when None uses 1/sqrt(head_dim) by default",
"title": "Attn Scaling"
},
"add_ffnbias": {
"default": false,
"description": "Add bias to nn.Linear of MLP FFN.",
"title": "Add Ffnbias",
"type": "boolean"
},
"parallel_residual": {
"default": false,
"description": "Use parallel residual in decoder layer. Note: this is used by GPT-J / Falcon Architecture.",
"title": "Parallel Residual",
"type": "boolean"
},
"num_experts": {
"default": 0,
"description": "Number of experts for MoE models.",
"title": "Num Experts",
"type": "integer"
},
"num_shared_experts": {
"default": 0,
"description": "Number of shared experts for MoE models (DeepSeekv2).",
"title": "Num Shared Experts",
"type": "integer"
},
"shared_expert_gate": {
"default": false,
"description": "Apply sigmoid-gated shared expert output (Qwen3.5 MoE style). When True, a linear gate is applied: output += sigmoid(gate(x)) * shared_expert(x).",
"title": "Shared Expert Gate",
"type": "boolean"
},
"first_k_dense_replace": {
"default": 0,
"description": "Number of layers using Dense instead of MoE",
"title": "First K Dense Replace",
"type": "integer"
},
"num_experts_per_tok": {
"default": 2,
"description": "Number of experts per token.",
"title": "Num Experts Per Tok",
"type": "integer"
},
"moe_softmax_after": {
"default": false,
"description": "Usually softmax is before topk, Mixtral does it after.",
"title": "Moe Softmax After",
"type": "boolean"
},
"moe_renormalize": {
"default": false,
"description": "Qwen renormalize expert weights after softmax.",
"title": "Moe Renormalize",
"type": "boolean"
},
"q_gating": {
"default": false,
"description": "Enable gated query in attention (Qwen3.5 style). Q projection has doubled output size; output is multiplied by sigmoid(gate).",
"title": "Q Gating",
"type": "boolean"
},
"position_encoding_type": {
"anyOf": [
{
"$ref": "#/$defs/PositionEncodingType"
},
{
"type": "null"
}
],
"default": "SinusoidalInterleaved",
"description": "Type of positional encoding."
},
"interpolate_mode": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Interpolation mode for position embeddings. If None: position_embeddings is a lookup table based on n_positions. If string (e.g., 'bilinear'): position_embedding is a learned grid using interpolation. (see Vision.py encoder)",
"title": "Interpolate Mode"
},
"n_positions": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Two casesCase 1: Absolute number of positions to learn position embeddings on (position_encoding_type: Learned)Case 2: Max Relative PositionsIn the case of position_encoding_type: Relative",
"title": "N Positions"
},
"rope_config": {
"anyOf": [
{
"$ref": "#/$defs/RotaryPositionConfig"
},
{
"type": "null"
}
],
"default": null,
"description": "Rotary position config, if relevant."
},
"num_channels": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 3,
"title": "Num Channels"
},
"image_size": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1024,
"title": "Image Size"
},
"patch_size": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 16,
"title": "Patch Size"
},
"image_token_id": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 10,
"title": "Image Token Id"
},
"image_token_id_list": {
"anyOf": [
{
"items": {
"type": "integer"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"description": "includes other image_token ids",
"title": "Image Token Id List"
},
"mm_tokens_per_image": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 256,
"title": "Mm Tokens Per Image"
},
"layernorm_pre": {
"default": true,
"title": "Layernorm Pre",
"type": "boolean"
},
"layernorm_post": {
"default": false,
"title": "Layernorm Post",
"type": "boolean"
},
"patch_conv_bias": {
"default": false,
"title": "Patch Conv Bias",
"type": "boolean"
},
"encoder_sam": {
"default": false,
"title": "Encoder Sam",
"type": "boolean"
},
"use_class_embedding": {
"default": false,
"title": "Use Class Embedding",
"type": "boolean"
},
"temporal_patch_size": {
"default": 1,
"description": "Temporal kernel size for Conv3D patch embedding. When >1 a nn.Conv3d is used (e.g. Qwen3.5 VL uses 2).",
"title": "Temporal Patch Size",
"type": "integer"
},
"num_position_embeddings": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Size of the absolute position embedding table (Qwen3.5 VL uses 2304 = 48\u00d748). When set together with position_encoding_type=Rotary both absolute embeddings and 2D RoPE are applied.",
"title": "Num Position Embeddings"
}
},
"title": "VisionEncoderConfig",
"type": "object"
},
"VisionTransformerLMModelConfig": {
"additionalProperties": false,
"properties": {
"embeddings": {
"$ref": "#/$defs/EmbeddingsConfig",
"description": "Contains most of the args useful to build the Embeddings module."
},
"encoder": {
"anyOf": [
{
"discriminator": {
"mapping": {
"audio": "#/$defs/AudioEncoderConfig",
"brnn": "#/$defs/RnnEncoderConfig",
"cnn": "#/$defs/CnnEncoderConfig",
"mean": "#/$defs/MeanEncoderConfig",
"rnn": "#/$defs/RnnEncoderConfig",
"transformer": "#/$defs/TransformerEncoderConfig",
"vision": "#/$defs/VisionEncoderConfig"
},
"propertyName": "encoder_type"
},
"oneOf": [
{
"$ref": "#/$defs/TransformerEncoderConfig"
},
{
"$ref": "#/$defs/RnnEncoderConfig"
},
{
"$ref": "#/$defs/CnnEncoderConfig"
},
{
"$ref": "#/$defs/MeanEncoderConfig"
},
{
"$ref": "#/$defs/VisionEncoderConfig"
},
{
"$ref": "#/$defs/AudioEncoderConfig"
}
]
},
{
"type": "null"
}
],
"default": null,
"description": "Major parameters of an encoder.",
"title": "Encoder"
},
"decoder": {
"anyOf": [
{
"discriminator": {
"mapping": {
"cnn": "#/$defs/CnnDecoderConfig",
"rnn": "#/$defs/RnnDecoderConfig",
"transformer": "#/$defs/TransformerDecoderConfig"
},
"propertyName": "decoder_type"
},
"oneOf": [
{
"$ref": "#/$defs/TransformerDecoderConfig"
},
{
"$ref": "#/$defs/RnnDecoderConfig"
},
{
"$ref": "#/$defs/CnnDecoderConfig"
}
]
},
{
"type": "null"
}
],
"default": null,
"description": "Major parameters of a decoder.",
"title": "Decoder"
},
"hidden_size": {
"default": -1,
"description": "Size of hidden states. Overwrites [encoder/decoder].hidden_size if set.",
"title": "Hidden Size",
"type": "integer"
},
"word_vec_size": {
"default": -1,
"description": "Word embedding size for src and tgt.",
"title": "Word Vec Size",
"type": "integer"
},
"layers": {
"default": -1,
"description": "Number of layers in both encoder and decoder (will overwrite enc_layers/dec_layers).",
"title": "Layers",
"type": "integer"
},
"transformer_ff": {
"default": 2048,
"description": "Size of hidden transformer feed-forward.",
"title": "Transformer Ff",
"type": "integer"
},
"moe_transformer_ff": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Size of hidden moe transformer feed-forward.",
"title": "Moe Transformer Ff"
},
"share_decoder_embeddings": {
"default": false,
"description": "Use a share weight matrix for the input and output word embeddings in the decoder.",
"title": "Share Decoder Embeddings",
"type": "boolean"
},
"share_embeddings": {
"default": false,
"description": "Share the word embeddings between encoder and decoder. Need to use shared vocabulary for this option.",
"title": "Share Embeddings",
"type": "boolean"
},
"input_feed": {
"default": 1,
"description": "Feed the context vector at each time step as additional input (via concatenation with the word embeddings) to the decoder.",
"title": "Input Feed",
"type": "integer"
},
"generator_function": {
"default": "softmax",
"description": "Which function to use for generating probabilities over the target vocabulary.",
"enum": [
"softmax",
"sparsemax"
],
"title": "Generator Function",
"type": "string"
},
"generator_bias": {
"default": true,
"description": "Control whether or not the generator Linear module has bias weights.",
"title": "Generator Bias",
"type": "boolean"
},
"adapter_bias": {
"default": false,
"description": "Control whether or not the adapter module has bias weights.",
"title": "Adapter Bias",
"type": "boolean"
},
"projector_activation_fn": {
"anyOf": [
{
"$ref": "#/$defs/ActivationFunction"
},
{
"type": "null"
}
],
"default": "relu",
"description": "The activation function to use in adapter projector layer."
},
"spatial_merge_size": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1,
"description": "Control the presence and size of patch merger (Mistral3)",
"title": "Spatial Merge Size"
},
"add_estimator": {
"default": false,
"description": "Add estimator layer",
"title": "Add Estimator",
"type": "boolean"
},
"estimator_type": {
"default": "average",
"description": "Which hidden_states to use to feed the estimator",
"enum": [
"average",
"last_token",
"first_token"
],
"title": "Estimator Type",
"type": "string"
},
"left_pad": {
"default": false,
"description": "Enable left-padding, useful for some LLMs.",
"title": "Left Pad",
"type": "boolean"
},
"huggingface_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Original huggingface model.",
"title": "Huggingface Model"
},
"eole_version": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "0.5.2",
"description": "Eole version used to convert/train/save the model.",
"title": "Eole Version"
},
"architecture": {
"const": "vision_transformer_lm",
"default": "vision_transformer_lm",
"title": "Architecture",
"type": "string"
},
"sliding_window": {
"default": 0,
"description": "Sliding window for transformer self-attention.",
"title": "Sliding Window",
"type": "integer"
},
"heads": {
"default": 8,
"description": "Number of heads for transformer self-attention.",
"title": "Heads",
"type": "integer"
},
"relative_positions_buckets": {
"default": 0,
"description": "Enable relative position bias (https://github.com/google-research/text-to-text-transfer-transformer).",
"title": "Relative Positions Buckets",
"type": "integer"
},
"mlp_activation_fn": {
"$ref": "#/$defs/ActivationFunction",
"default": "relu",
"description": "The activation function to use in MLP layer."
},
"layer_norm": {
"default": "standard",
"description": "Type of layer normalization in transformer architecture.",
"enum": [
"standard",
"standardFP32",
"rms",
"gemma-rms"
],
"title": "Layer Norm",
"type": "string"
},
"norm_eps": {
"default": 1e-05,
"description": "Layer norm epsilon.",
"title": "Norm Eps",
"type": "number"
},
"shared_layer_norm": {
"default": false,
"description": "Use a shared layer_norm in parallel residual attention. Note: must be True for Falcon 7B, False for Falcon 40B, same for GPT-J and GPT-NeoX models.",
"title": "Shared Layer Norm",
"type": "boolean"
},
"ffn_layernorm": {
"default": false,
"description": "Add pre/post_feedforward_layernorm around MLP forward. Note: introduced for gemma2 support.",
"title": "Ffn Layernorm",
"type": "boolean"
},
"add_qkvbias": {
"default": false,
"description": "Add bias to nn.Linear of Query/Key/Value in MHA. Note: this will add bias to output projection layer too by default. Can be disabled with `add_final_linear_bias`.",
"title": "Add Qkvbias",
"type": "boolean"
},
"add_key_bias": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": null,
"description": "Add bias to Key projection in MHA. Defaults to add_qkvbias when not set. Set to False for models like Whisper where K has no bias.",
"title": "Add Key Bias"
},
"query_norm": {
"default": false,
"title": "Query Norm",
"type": "boolean"
},
"key_norm": {
"default": false,
"title": "Key Norm",
"type": "boolean"
},
"qk_norm_post_rope": {
"default": false,
"title": "Qk Norm Post Rope",
"type": "boolean"
},
"add_final_linear_bias": {
"default": false,
"description": "Add bias to nn.Linear of final_linear in MHA.",
"title": "Add Final Linear Bias",
"type": "boolean"
},
"heads_kv": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Number of heads for KV. heads_kv=heads if None, else number of heads for KV(e.g. Falcon 40B)",
"title": "Heads Kv"
},
"head_dim": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Head dimension when this needs to be different vs hidden_size // heads",
"title": "Head Dim"
},
"attn_scaling": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": null,
"description": "Attention scaling factor, when None uses 1/sqrt(head_dim) by default",
"title": "Attn Scaling"
},
"add_ffnbias": {
"default": false,
"description": "Add bias to nn.Linear of MLP FFN.",
"title": "Add Ffnbias",
"type": "boolean"
},
"parallel_residual": {
"default": false,
"description": "Use parallel residual in decoder layer. Note: this is used by GPT-J / Falcon Architecture.",
"title": "Parallel Residual",
"type": "boolean"
},
"num_experts": {
"default": 0,
"description": "Number of experts for MoE models.",
"title": "Num Experts",
"type": "integer"
},
"num_shared_experts": {
"default": 0,
"description": "Number of shared experts for MoE models (DeepSeekv2).",
"title": "Num Shared Experts",
"type": "integer"
},
"shared_expert_gate": {
"default": false,
"description": "Apply sigmoid-gated shared expert output (Qwen3.5 MoE style). When True, a linear gate is applied: output += sigmoid(gate(x)) * shared_expert(x).",
"title": "Shared Expert Gate",
"type": "boolean"
},
"first_k_dense_replace": {
"default": 0,
"description": "Number of layers using Dense instead of MoE",
"title": "First K Dense Replace",
"type": "integer"
},
"num_experts_per_tok": {
"default": 2,
"description": "Number of experts per token.",
"title": "Num Experts Per Tok",
"type": "integer"
},
"moe_softmax_after": {
"default": false,
"description": "Usually softmax is before topk, Mixtral does it after.",
"title": "Moe Softmax After",
"type": "boolean"
},
"moe_renormalize": {
"default": false,
"description": "Qwen renormalize expert weights after softmax.",
"title": "Moe Renormalize",
"type": "boolean"
},
"q_gating": {
"default": false,
"description": "Enable gated query in attention (Qwen3.5 style). Q projection has doubled output size; output is multiplied by sigmoid(gate).",
"title": "Q Gating",
"type": "boolean"
},
"position_encoding_type": {
"anyOf": [
{
"$ref": "#/$defs/PositionEncodingType"
},
{
"type": "null"
}
],
"default": "SinusoidalInterleaved",
"description": "Type of positional encoding."
},
"interpolate_mode": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Interpolation mode for position embeddings. If None: position_embeddings is a lookup table based on n_positions. If string (e.g., 'bilinear'): position_embedding is a learned grid using interpolation. (see Vision.py encoder)",
"title": "Interpolate Mode"
},
"n_positions": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Two casesCase 1: Absolute number of positions to learn position embeddings on (position_encoding_type: Learned)Case 2: Max Relative PositionsIn the case of position_encoding_type: Relative",
"title": "N Positions"
},
"rope_config": {
"anyOf": [
{
"$ref": "#/$defs/RotaryPositionConfig"
},
{
"type": "null"
}
],
"default": null,
"description": "Rotary position config, if relevant."
},
"adapter": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "llava",
"description": "Adapter type to use in the model.",
"title": "Adapter"
}
},
"title": "VisionTransformerLMModelConfig",
"type": "object"
},
"WhisperModelConfig": {
"additionalProperties": false,
"description": "Configuration for Whisper speech-to-text models.",
"properties": {
"embeddings": {
"$ref": "#/$defs/EmbeddingsConfig",
"description": "Contains most of the args useful to build the Embeddings module."
},
"encoder": {
"anyOf": [
{
"discriminator": {
"mapping": {
"audio": "#/$defs/AudioEncoderConfig",
"brnn": "#/$defs/RnnEncoderConfig",
"cnn": "#/$defs/CnnEncoderConfig",
"mean": "#/$defs/MeanEncoderConfig",
"rnn": "#/$defs/RnnEncoderConfig",
"transformer": "#/$defs/TransformerEncoderConfig",
"vision": "#/$defs/VisionEncoderConfig"
},
"propertyName": "encoder_type"
},
"oneOf": [
{
"$ref": "#/$defs/TransformerEncoderConfig"
},
{
"$ref": "#/$defs/RnnEncoderConfig"
},
{
"$ref": "#/$defs/CnnEncoderConfig"
},
{
"$ref": "#/$defs/MeanEncoderConfig"
},
{
"$ref": "#/$defs/VisionEncoderConfig"
},
{
"$ref": "#/$defs/AudioEncoderConfig"
}
]
},
{
"type": "null"
}
],
"default": null,
"description": "Major parameters of an encoder.",
"title": "Encoder"
},
"decoder": {
"anyOf": [
{
"discriminator": {
"mapping": {
"cnn": "#/$defs/CnnDecoderConfig",
"rnn": "#/$defs/RnnDecoderConfig",
"transformer": "#/$defs/TransformerDecoderConfig"
},
"propertyName": "decoder_type"
},
"oneOf": [
{
"$ref": "#/$defs/TransformerDecoderConfig"
},
{
"$ref": "#/$defs/RnnDecoderConfig"
},
{
"$ref": "#/$defs/CnnDecoderConfig"
}
]
},
{
"type": "null"
}
],
"default": null,
"description": "Major parameters of a decoder.",
"title": "Decoder"
},
"hidden_size": {
"default": -1,
"description": "Size of hidden states. Overwrites [encoder/decoder].hidden_size if set.",
"title": "Hidden Size",
"type": "integer"
},
"word_vec_size": {
"default": -1,
"description": "Word embedding size for src and tgt.",
"title": "Word Vec Size",
"type": "integer"
},
"layers": {
"default": -1,
"description": "Number of layers in both encoder and decoder (will overwrite enc_layers/dec_layers).",
"title": "Layers",
"type": "integer"
},
"transformer_ff": {
"default": 2048,
"description": "Size of hidden transformer feed-forward.",
"title": "Transformer Ff",
"type": "integer"
},
"moe_transformer_ff": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Size of hidden moe transformer feed-forward.",
"title": "Moe Transformer Ff"
},
"share_decoder_embeddings": {
"default": false,
"description": "Use a share weight matrix for the input and output word embeddings in the decoder.",
"title": "Share Decoder Embeddings",
"type": "boolean"
},
"share_embeddings": {
"default": false,
"description": "Share the word embeddings between encoder and decoder. Need to use shared vocabulary for this option.",
"title": "Share Embeddings",
"type": "boolean"
},
"input_feed": {
"default": 1,
"description": "Feed the context vector at each time step as additional input (via concatenation with the word embeddings) to the decoder.",
"title": "Input Feed",
"type": "integer"
},
"generator_function": {
"default": "softmax",
"description": "Which function to use for generating probabilities over the target vocabulary.",
"enum": [
"softmax",
"sparsemax"
],
"title": "Generator Function",
"type": "string"
},
"generator_bias": {
"default": true,
"description": "Control whether or not the generator Linear module has bias weights.",
"title": "Generator Bias",
"type": "boolean"
},
"adapter_bias": {
"default": false,
"description": "Control whether or not the adapter module has bias weights.",
"title": "Adapter Bias",
"type": "boolean"
},
"projector_activation_fn": {
"anyOf": [
{
"$ref": "#/$defs/ActivationFunction"
},
{
"type": "null"
}
],
"default": "relu",
"description": "The activation function to use in adapter projector layer."
},
"spatial_merge_size": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1,
"description": "Control the presence and size of patch merger (Mistral3)",
"title": "Spatial Merge Size"
},
"add_estimator": {
"default": false,
"description": "Add estimator layer",
"title": "Add Estimator",
"type": "boolean"
},
"estimator_type": {
"default": "average",
"description": "Which hidden_states to use to feed the estimator",
"enum": [
"average",
"last_token",
"first_token"
],
"title": "Estimator Type",
"type": "string"
},
"left_pad": {
"default": false,
"description": "Enable left-padding, useful for some LLMs.",
"title": "Left Pad",
"type": "boolean"
},
"huggingface_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Original huggingface model.",
"title": "Huggingface Model"
},
"eole_version": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "0.5.2",
"description": "Eole version used to convert/train/save the model.",
"title": "Eole Version"
},
"architecture": {
"const": "whisper",
"default": "whisper",
"title": "Architecture",
"type": "string"
},
"sliding_window": {
"default": 0,
"description": "Sliding window for transformer self-attention.",
"title": "Sliding Window",
"type": "integer"
},
"heads": {
"default": 8,
"description": "Number of heads for transformer self-attention.",
"title": "Heads",
"type": "integer"
},
"relative_positions_buckets": {
"default": 0,
"description": "Enable relative position bias (https://github.com/google-research/text-to-text-transfer-transformer).",
"title": "Relative Positions Buckets",
"type": "integer"
},
"mlp_activation_fn": {
"$ref": "#/$defs/ActivationFunction",
"default": "relu",
"description": "The activation function to use in MLP layer."
},
"layer_norm": {
"default": "standard",
"description": "Type of layer normalization in transformer architecture.",
"enum": [
"standard",
"standardFP32",
"rms",
"gemma-rms"
],
"title": "Layer Norm",
"type": "string"
},
"norm_eps": {
"default": 1e-05,
"description": "Layer norm epsilon.",
"title": "Norm Eps",
"type": "number"
},
"shared_layer_norm": {
"default": false,
"description": "Use a shared layer_norm in parallel residual attention. Note: must be True for Falcon 7B, False for Falcon 40B, same for GPT-J and GPT-NeoX models.",
"title": "Shared Layer Norm",
"type": "boolean"
},
"ffn_layernorm": {
"default": false,
"description": "Add pre/post_feedforward_layernorm around MLP forward. Note: introduced for gemma2 support.",
"title": "Ffn Layernorm",
"type": "boolean"
},
"add_qkvbias": {
"default": false,
"description": "Add bias to nn.Linear of Query/Key/Value in MHA. Note: this will add bias to output projection layer too by default. Can be disabled with `add_final_linear_bias`.",
"title": "Add Qkvbias",
"type": "boolean"
},
"add_key_bias": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": null,
"description": "Add bias to Key projection in MHA. Defaults to add_qkvbias when not set. Set to False for models like Whisper where K has no bias.",
"title": "Add Key Bias"
},
"query_norm": {
"default": false,
"title": "Query Norm",
"type": "boolean"
},
"key_norm": {
"default": false,
"title": "Key Norm",
"type": "boolean"
},
"qk_norm_post_rope": {
"default": false,
"title": "Qk Norm Post Rope",
"type": "boolean"
},
"add_final_linear_bias": {
"default": false,
"description": "Add bias to nn.Linear of final_linear in MHA.",
"title": "Add Final Linear Bias",
"type": "boolean"
},
"heads_kv": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Number of heads for KV. heads_kv=heads if None, else number of heads for KV(e.g. Falcon 40B)",
"title": "Heads Kv"
},
"head_dim": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Head dimension when this needs to be different vs hidden_size // heads",
"title": "Head Dim"
},
"attn_scaling": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": null,
"description": "Attention scaling factor, when None uses 1/sqrt(head_dim) by default",
"title": "Attn Scaling"
},
"add_ffnbias": {
"default": false,
"description": "Add bias to nn.Linear of MLP FFN.",
"title": "Add Ffnbias",
"type": "boolean"
},
"parallel_residual": {
"default": false,
"description": "Use parallel residual in decoder layer. Note: this is used by GPT-J / Falcon Architecture.",
"title": "Parallel Residual",
"type": "boolean"
},
"num_experts": {
"default": 0,
"description": "Number of experts for MoE models.",
"title": "Num Experts",
"type": "integer"
},
"num_shared_experts": {
"default": 0,
"description": "Number of shared experts for MoE models (DeepSeekv2).",
"title": "Num Shared Experts",
"type": "integer"
},
"shared_expert_gate": {
"default": false,
"description": "Apply sigmoid-gated shared expert output (Qwen3.5 MoE style). When True, a linear gate is applied: output += sigmoid(gate(x)) * shared_expert(x).",
"title": "Shared Expert Gate",
"type": "boolean"
},
"first_k_dense_replace": {
"default": 0,
"description": "Number of layers using Dense instead of MoE",
"title": "First K Dense Replace",
"type": "integer"
},
"num_experts_per_tok": {
"default": 2,
"description": "Number of experts per token.",
"title": "Num Experts Per Tok",
"type": "integer"
},
"moe_softmax_after": {
"default": false,
"description": "Usually softmax is before topk, Mixtral does it after.",
"title": "Moe Softmax After",
"type": "boolean"
},
"moe_renormalize": {
"default": false,
"description": "Qwen renormalize expert weights after softmax.",
"title": "Moe Renormalize",
"type": "boolean"
},
"q_gating": {
"default": false,
"description": "Enable gated query in attention (Qwen3.5 style). Q projection has doubled output size; output is multiplied by sigmoid(gate).",
"title": "Q Gating",
"type": "boolean"
},
"position_encoding_type": {
"anyOf": [
{
"$ref": "#/$defs/PositionEncodingType"
},
{
"type": "null"
}
],
"default": "SinusoidalInterleaved",
"description": "Type of positional encoding."
},
"interpolate_mode": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Interpolation mode for position embeddings. If None: position_embeddings is a lookup table based on n_positions. If string (e.g., 'bilinear'): position_embedding is a learned grid using interpolation. (see Vision.py encoder)",
"title": "Interpolate Mode"
},
"n_positions": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Two casesCase 1: Absolute number of positions to learn position embeddings on (position_encoding_type: Learned)Case 2: Max Relative PositionsIn the case of position_encoding_type: Relative",
"title": "N Positions"
},
"rope_config": {
"anyOf": [
{
"$ref": "#/$defs/RotaryPositionConfig"
},
{
"type": "null"
}
],
"default": null,
"description": "Rotary position config, if relevant."
},
"suppress_tokens": {
"anyOf": [
{
"items": {
"type": "integer"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"description": "List of token IDs to suppress during audio decoding.",
"title": "Suppress Tokens"
},
"begin_suppress_tokens": {
"anyOf": [
{
"items": {
"type": "integer"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"description": "List of token IDs to suppress at the first generated position.",
"title": "Begin Suppress Tokens"
},
"no_timestamps_token_id": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Token ID for the no-timestamps token in audio models.",
"title": "No Timestamps Token Id"
},
"word_timestamp_heads": {
"anyOf": [
{
"items": {
"items": {
"type": "integer"
},
"type": "array"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"description": "List of [layer, head] pairs for word-level timestamp extraction (mapped from alignment_heads in HF generation_config).",
"title": "Word Timestamp Heads"
},
"median_filter_width": {
"default": 7,
"description": "Median filter width for word-level timestamp smoothing.",
"title": "Median Filter Width",
"type": "integer"
}
},
"title": "WhisperModelConfig",
"type": "object"
}
},
"additionalProperties": false,
"required": [
"src_vocab",
"data"
]
}
- Config:
- validate_assignment: bool = True
- validate_default: bool = True
- use_enum_values: bool = True
- extra: str = forbid
- protected_namespaces: tuple = ()
- Fields:
inference (eole.config.inference.InferenceConfig | None)model (Annotated[eole.config.models.TransformerModelConfig | eole.config.models.TransformerLMModelConfig | eole.config.models.VisionTransformerLMModelConfig | eole.config.models.WhisperModelConfig | eole.config.models.TransformerEncoderModelConfig | eole.config.models.RnnModelConfig | eole.config.models.CnnModelConfig | eole.config.models.CustomModelConfig, FieldInfo(annotation=NoneType, required=True, discriminator='architecture')] | None)n_sample (int)training (eole.config.training.TrainingConfig | None)verbose (bool)
- Validators:
_validate_train_config»all fieldsdefault_architecture»all fieldsstr_to_dict»modelstr_to_dict»training
field inference : InferenceConfig | None = None
- Validated by:
_maybe_set_huggingface_model_validate_train_configdefault_architecture
field model : Annotated[TransformerModelConfig | TransformerLMModelConfig | VisionTransformerLMModelConfig | WhisperModelConfig | TransformerEncoderModelConfig | RnnModelConfig | CnnModelConfig | CustomModelConfig, FieldInfo(annotation=NoneType, required=True, discriminator='architecture')] | None = None
- Validated by:
_maybe_set_huggingface_model_validate_train_configdefault_architecturestr_to_dict
field n_sample : int = 0
Number of transformed samples per corpus to use to build the vocabulary. Set to -1 to use the full corpora.
- Validated by:
_maybe_set_huggingface_model_validate_train_configdefault_architecture
field training : TrainingConfig | None [Optional]
- Validated by:
_maybe_set_huggingface_model_validate_train_configdefault_architecturestr_to_dict
field verbose : bool = False
Print data loading and statistics for all process (default only logs the first process shard).
- Validated by:
_maybe_set_huggingface_model_validate_train_configdefault_architecture
validator default_architecture » all fields
classmethod get_defaults(architecture)
validator str_to_dict » training , model
get_model_path()
model_post_init(context: Any,) → None
This function is meant to behave like a BaseModel method to initialise private attributes.
It takes context as an argument since that’s what pydantic-core passes when calling it.
- Parameters:
- self – The BaseModel instance.
- context – The context.
property data_type
pydantic model eole.config.run.PredictConfig
Bases: InferenceConfig, LoggingConfig, MiscConfig
Show JSON schema
{
"title": "PredictConfig",
"type": "object",
"properties": {
"seed": {
"default": -1,
"description": "Set random seed used for better reproducibility between experiments.",
"title": "Seed",
"type": "integer"
},
"log_file": {
"default": "",
"description": "Output logs to a file under this path.",
"title": "Log File",
"type": "string"
},
"report_every": {
"default": 50,
"description": "Print stats at this interval (in steps).",
"title": "Report Every",
"type": "integer"
},
"valid_metrics": {
"default": [],
"description": "List of names of additional validation metrics.",
"items": {
"type": "string"
},
"title": "Valid Metrics",
"type": "array"
},
"wer_normalize": {
"default": "none",
"description": "WER normalization mode: none, lowercase, whisper_en, whisper_basic.",
"title": "Wer Normalize",
"type": "string"
},
"comet_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "COMET model name or local path to use for COMET/COMET-KIWI scoring. Defaults to Unbabel/wmt22-comet-da for COMET and Unbabel/wmt22-cometkiwi-da for COMET-KIWI when not set.",
"title": "Comet Model"
},
"comet_batch_size": {
"default": 64,
"description": "Batch size used when running COMET/COMET-KIWI scoring.",
"title": "Comet Batch Size",
"type": "integer"
},
"scoring_debug": {
"default": false,
"description": "Dump src/ref/pred of the current batch.",
"title": "Scoring Debug",
"type": "boolean"
},
"dump_preds": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Folder to dump predictions to.",
"title": "Dump Preds"
},
"tensorboard": {
"default": false,
"description": "Use tensorboard for visualization during training.",
"title": "Tensorboard",
"type": "boolean"
},
"tensorboard_log_dir": {
"default": "runs/eole",
"description": "Log directory for tensorboard (also the name of the run).",
"title": "Tensorboard Log Dir",
"type": "string"
},
"tensorboard_log_dir_dated": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Tensorboard Log Dir Dated"
},
"quant_layers": {
"default": [],
"description": "List of layers to be compressed in 4/8bit.",
"items": {
"type": "string"
},
"title": "Quant Layers",
"type": "array"
},
"quant_type": {
"default": "",
"description": "Type of compression.",
"enum": [
"",
"bnb_8bit",
"bnb_FP4",
"bnb_NF4",
"awq_gemm",
"awq_gemv",
"autoround",
"gguf"
],
"title": "Quant Type",
"type": "string"
},
"w_bit": {
"default": 4,
"description": "W_bit quantization",
"title": "W Bit",
"type": "integer"
},
"group_size": {
"default": 128,
"description": "Group size quantization.",
"title": "Group Size",
"type": "integer"
},
"autoround_packing_format": {
"default": "auto_round:auto_gptq",
"description": "AutoRound packing format (from quantization_config.packing_format). Determines whether qzeros use GPTQ-style (zeros-1) packing. Use 'auto_round:auto_gptq' for GPTQ-format (default), or 'auto_round' for direct zero-point.",
"title": "Autoround Packing Format",
"type": "string"
},
"autoround_sym": {
"default": true,
"description": "AutoRound symmetric quantization flag (from quantization_config.sym). Required to select the Marlin CUDA backend, which only supports symmetric quantization.",
"title": "Autoround Sym",
"type": "boolean"
},
"quant_exclude_modules": {
"default": [],
"description": "List of parent module names whose entire subtrees must not be quantized, even if child layers appear in quant_layers. Used for AutoRound models where some parent modules (e.g. shared_experts in MoE) were kept in fp16 during quantization.",
"items": {
"type": "string"
},
"title": "Quant Exclude Modules",
"type": "array"
},
"lora_layers": {
"default": [],
"description": "List of layers to be replaced by LoRa layers. E.g. ['linear_values', 'linear_query'] (\u00a74.2 in https://arxiv.org/abs/2106.09685)",
"items": {
"type": "string"
},
"title": "Lora Layers",
"type": "array"
},
"lora_embedding": {
"default": false,
"description": "Replace embeddings with LoRa Embeddings (\u00a75.1)",
"title": "Lora Embedding",
"type": "boolean"
},
"lora_rank": {
"default": 2,
"description": "r=2 successfully tested with NLLB-200 3.3B",
"title": "Lora Rank",
"type": "integer"
},
"lora_alpha": {
"default": 1,
"description": "\u00a74.1 https://arxiv.org/abs/2106.09685",
"title": "Lora Alpha",
"type": "integer"
},
"lora_dropout": {
"default": 0.0,
"description": "Rule of thumb: same value as in main model.",
"title": "Lora Dropout",
"type": "number"
},
"beam_size": {
"default": 5,
"description": "Beam size.",
"title": "Beam Size",
"type": "integer"
},
"ratio": {
"default": -0.0,
"description": "Ratio based beam stop condition.",
"title": "Ratio",
"type": "number"
},
"top_k": {
"default": 0,
"description": "Set this to -1 to do random sampling from full distribution. Set this to value k>1 to do random sampling restricted to the k most likely next tokens. Set this to 1 to use argmax.",
"title": "Top K",
"type": "integer"
},
"top_p": {
"default": 0.0,
"description": "Probability for top-p/nucleus sampling. Restrict tokens to the most likely until the cumulated probability is over p. In range [0,1]. (https://arxiv.org/abs/1904.09751)",
"lte": 1.0,
"minimum": 0.0,
"title": "Top P",
"type": "number"
},
"temperature": {
"default": 1.0,
"description": "If doing random sampling, divide the logits by this before computing softmax during decoding.",
"title": "Temperature",
"type": "number"
},
"length_penalty": {
"default": "avg",
"description": "Length penalty to use.",
"enum": [
"avg",
"wu",
"none"
],
"title": "Length Penalty",
"type": "string"
},
"alpha": {
"default": 1.0,
"description": "Length penalty parameter (higher = longer generation)",
"title": "Alpha",
"type": "number"
},
"coverage_penalty": {
"default": "none",
"description": "Coverage penalty to use. Only available in beam search.",
"enum": [
"none",
"wu",
"summary"
],
"title": "Coverage Penalty",
"type": "string"
},
"beta": {
"default": -0.0,
"description": "Coverage penalty parameter.",
"title": "Beta",
"type": "number"
},
"stepwise_penalty": {
"default": false,
"description": "Apply coverage penalty at every decoding step. Helpful for summary penalty.",
"title": "Stepwise Penalty",
"type": "boolean"
},
"min_length": {
"default": 0,
"description": "Minimum prediction length.",
"minimum": 0,
"title": "Min Length",
"type": "integer"
},
"max_length": {
"default": 250,
"description": "Maximum prediction length.",
"title": "Max Length",
"type": "integer"
},
"max_length_ratio": {
"default": 2,
"description": "Maximum prediction length ratio. For European languages, 2 is large enough, for target Asian charageters, need to increase to 2-3, for special languages (Burmese, Amharic) to 10. Set to 0 to disable ratio-based length capping.",
"minimum": 0,
"title": "Max Length Ratio",
"type": "number"
},
"block_ngram_repeat": {
"default": 0,
"description": "Block repetition of ngrams during decoding.",
"title": "Block Ngram Repeat",
"type": "integer"
},
"ignore_when_blocking": {
"default": [],
"description": "Ignore these strings when blocking repeats. You want to block sentence delimiters.",
"items": {
"type": "string"
},
"title": "Ignore When Blocking",
"type": "array"
},
"replace_unk": {
"default": false,
"description": "Replace the generated UNK tokens with the source token that had the highest attention weight. If phrase_table is provided, it will lok up the identified source token and give the corresponding target token. If it is not provided (or the identified source token does not exist in the table), then it will copy the source token.",
"title": "Replace Unk",
"type": "boolean"
},
"ban_unk_token": {
"default": false,
"description": "Prevent unk token generation by setting unk probability to 0.",
"title": "Ban Unk Token",
"type": "boolean"
},
"phrase_table": {
"default": "",
"description": "If phrase_table is provided (with replace_unk), it will look up the identified source token and give the corresponding target token.",
"title": "Phrase Table",
"type": "string"
},
"n_best": {
"default": 1,
"description": "Output the n_best decoded sentences.",
"title": "N Best",
"type": "integer"
},
"dump_beam": {
"default": "",
"description": "File to dump beam information to.",
"title": "Dump Beam",
"type": "string"
},
"verbose": {
"default": false,
"description": "Print scores and predictions for each input.",
"title": "Verbose",
"type": "boolean"
},
"with_score": {
"default": false,
"description": "Add a tab separated score to each output.",
"title": "With Score",
"type": "boolean"
},
"timestamps": {
"default": "none",
"description": "Audio models only. Timestamp output: 'none' = plain text, 'segment' = JSON with segment times, 'word' = per-word times via cross-attention DTW.",
"enum": [
"none",
"segment",
"word"
],
"title": "Timestamps",
"type": "string"
},
"language": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Audio models only. Language code (e.g. 'en', 'fr'). Inserts the language token into the decoder prefix.",
"title": "Language"
},
"task": {
"anyOf": [
{
"enum": [
"transcribe",
"translate"
],
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Audio models only. 'transcribe' for same-language, 'translate' for translation to English.",
"title": "Task"
},
"initial_prompt": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Audio models only. Text prompt to condition decoder output style and vocabulary. Prepended as previous context.",
"title": "Initial Prompt"
},
"condition_on_previous_text": {
"default": false,
"description": "Audio models only. Feed previous chunk's decoded text as decoder prompt for the next chunk.",
"title": "Condition On Previous Text",
"type": "boolean"
},
"fallback_temperatures": {
"default": [
0.0,
0.2,
0.4,
0.6,
0.8,
1.0
],
"description": "Audio models only. Temperature cascade for decoding fallback. First temperature uses beam search; subsequent use sampling. Set to [0.0] to disable fallback.",
"items": {
"type": "number"
},
"title": "Fallback Temperatures",
"type": "array"
},
"compression_ratio_threshold": {
"default": 2.4,
"description": "Audio models only. If gzip compression ratio of decoded text exceeds this, retry at next fallback temperature.",
"title": "Compression Ratio Threshold",
"type": "number"
},
"logprob_threshold": {
"default": -1.0,
"description": "Audio models only. If average log probability per token is below this, retry at next fallback temperature.",
"title": "Logprob Threshold",
"type": "number"
},
"no_speech_threshold": {
"default": 0.6,
"description": "Audio models only. Low avg_logprob only triggers fallback when no_speech_prob is also below this threshold.",
"maximum": 1.0,
"minimum": 0.0,
"title": "No Speech Threshold",
"type": "number"
},
"estim_only": {
"default": false,
"description": "Process the input to estimator only (no decoder).",
"title": "Estim Only",
"type": "boolean"
},
"attn_debug": {
"default": false,
"description": "Print best attn for each word.",
"title": "Attn Debug",
"type": "boolean"
},
"align_debug": {
"default": false,
"description": "Print best align for each word.",
"title": "Align Debug",
"type": "boolean"
},
"gpu_ranks": {
"default": [],
"description": "List of ranks for each process.",
"items": {
"type": "integer"
},
"title": "Gpu Ranks",
"type": "array"
},
"world_size": {
"default": 1,
"description": "Total number of distributed processes.",
"title": "World Size",
"type": "integer"
},
"parallel_mode": {
"default": "data_parallel",
"description": "Distributed mode.",
"enum": [
"data_parallel",
"tensor_parallel"
],
"title": "Parallel Mode",
"type": "string"
},
"gpu_backend": {
"default": "nccl",
"description": "Type of torch distributed backend.",
"title": "Gpu Backend",
"type": "string"
},
"gpu_verbose_level": {
"default": 0,
"description": "Gives more info on each process per GPU.",
"title": "Gpu Verbose Level",
"type": "integer"
},
"master_ip": {
"default": "localhost",
"description": "IP of master for torch.distributed training.",
"title": "Master Ip",
"type": "string"
},
"master_port": {
"default": 10000,
"description": "Port of master for torch.distributed training.",
"title": "Master Port",
"type": "integer"
},
"timeout": {
"default": 60,
"description": "Timeout for one GPU to wait for the others.",
"title": "Timeout",
"type": "integer"
},
"model_path": {
"anyOf": [
{
"type": "string"
},
{
"items": {
"type": "string"
},
"type": "array"
}
],
"description": "Path to model .pt file(s). Multiple models can be specified for ensemble decoding.",
"title": "Model Path"
},
"self_attn_backend": {
"default": "flash",
"description": "Self-attention backend.",
"enum": [
"flash",
"pytorch"
],
"title": "Self Attn Backend",
"type": "string"
},
"compute_dtype": {
"description": "Compute dtype (precision) to use for main compute. Some parameters might have other dtypes for specific cases (e.g. torch.amp -- See eole.config.training.TrainingConfig.storage_dtype) fp32 to force slow fp16 model on gtx1080, int8 to enable pytorch native 8-bit quantization (cpu only).",
"enum": [
"fp32",
"fp16",
"int8",
"bf16"
],
"title": "Compute Dtype",
"type": "string"
},
"torch_compile": {
"default": false,
"description": "Use torch.compile with dynamic=True.",
"title": "Torch Compile",
"type": "boolean"
},
"report_align": {
"default": false,
"description": "Report alignment for each translation.",
"title": "Report Align",
"type": "boolean"
},
"gold_align": {
"default": false,
"description": "Report alignment between source and gold target. Useful to test the performance of learnt alignments.",
"title": "Gold Align",
"type": "boolean"
},
"report_time": {
"default": false,
"description": "Report some translation time metrics.",
"title": "Report Time",
"type": "boolean"
},
"fuse_kvq": {
"default": false,
"description": "Fuse K, V, Q Linear layers into a single KVQ in Self Attn.",
"title": "Fuse Kvq",
"type": "boolean"
},
"fuse_gate": {
"default": false,
"description": "Fuse gate_up_proj and up_proj Linear layers into a single Linear.",
"title": "Fuse Gate",
"type": "boolean"
},
"profile": {
"default": false,
"description": "Report pytorch profiling stats.",
"title": "Profile",
"type": "boolean"
},
"batch_size": {
"default": 30,
"description": "Batch size.",
"title": "Batch Size",
"type": "integer"
},
"dynamic_shapes": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": null,
"description": "Use batch_size / Cache length static or Dynamic",
"title": "Dynamic Shapes"
},
"batch_type": {
"default": "sents",
"description": "Batch grouping for batch size.",
"enum": [
"sents",
"tokens"
],
"title": "Batch Type",
"type": "string"
},
"avg_raw_probs": {
"default": false,
"description": "If set, during ensembling scores from different models will be combined by averaging their raw probabilities and then taking the log. Otherwise, the log probabilities will be averaged directly. Necessary for models whose output layers can assign zero probability.",
"title": "Avg Raw Probs",
"type": "boolean"
},
"data_type": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "text",
"title": "Data Type"
},
"chat_template": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Chat Template"
},
"optional_eos": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": [],
"description": "Optional EOS tokens that would stop generation, e.g. <|eot_id|> for Llama3",
"title": "Optional Eos"
},
"transforms": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": [],
"title": "Transforms"
},
"transforms_configs": {
"anyOf": [
{
"$ref": "#/$defs/NestedAllTransformsConfig"
},
{
"type": "null"
}
]
},
"skip_empty_level": {
"default": "warning",
"description": "Logging level when encountering empty examples during inference. (silent: silently ignore/skip empty examples, warning: warn when ignoring/skipping empty examples, error: raise an error and stop execution when any empty example)",
"enum": [
"silent",
"warning",
"error"
],
"title": "Skip Empty Level",
"type": "string"
},
"share_vocab": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": false,
"title": "Share Vocab"
},
"src_subword_vocab": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Src Subword Vocab"
},
"model": {
"anyOf": [
{
"discriminator": {
"mapping": {
"cnn": "#/$defs/CnnModelConfig",
"custom": "#/$defs/CustomModelConfig",
"rnn": "#/$defs/RnnModelConfig",
"transformer": "#/$defs/TransformerModelConfig",
"transformer_encoder": "#/$defs/TransformerEncoderModelConfig",
"transformer_lm": "#/$defs/TransformerLMModelConfig",
"vision_transformer_lm": "#/$defs/VisionTransformerLMModelConfig",
"whisper": "#/$defs/WhisperModelConfig"
},
"propertyName": "architecture"
},
"oneOf": [
{
"$ref": "#/$defs/TransformerModelConfig"
},
{
"$ref": "#/$defs/TransformerLMModelConfig"
},
{
"$ref": "#/$defs/VisionTransformerLMModelConfig"
},
{
"$ref": "#/$defs/WhisperModelConfig"
},
{
"$ref": "#/$defs/TransformerEncoderModelConfig"
},
{
"$ref": "#/$defs/RnnModelConfig"
},
{
"$ref": "#/$defs/CnnModelConfig"
},
{
"$ref": "#/$defs/CustomModelConfig"
}
]
},
{
"type": "null"
}
],
"default": null,
"title": "Model"
},
"src": {
"description": "Source file to decode (one line per sequence).",
"title": "Src",
"type": "string"
},
"tgt": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "True target sequences, useful for scoring or prefix decoding.",
"title": "Tgt"
},
"tgt_file_prefix": {
"default": false,
"description": "Generate predictions using provided tgt as prefix.",
"title": "Tgt File Prefix",
"type": "boolean"
},
"output": {
"default": "pred.txt",
"description": "Path to output the predictions (each line will be the decoded sequence).",
"title": "Output",
"type": "string"
},
"engine": {
"default": "eole",
"description": "engine to run inference: eole or ct2",
"title": "Engine",
"type": "string"
}
},
"$defs": {
"ActivationFunction": {
"enum": [
"relu",
"gelu",
"gelu-tanh",
"quick_gelu",
"silu",
"gated-gelu",
"gated-gelu-tanh",
"gated-silu",
"fused-gated-gelu",
"fused-gated-gelu-tanh",
"fused-gated-silu"
],
"title": "ActivationFunction",
"type": "string"
},
"AudioEncoderConfig": {
"additionalProperties": false,
"description": "Configuration for audio encoder.",
"properties": {
"encoder_type": {
"const": "audio",
"default": "audio",
"title": "Encoder Type",
"type": "string"
},
"layers": {
"default": 2,
"description": "Number of layers in the encoder.",
"title": "Layers",
"type": "integer"
},
"hidden_size": {
"default": 512,
"description": "Size of encoder hidden states.",
"title": "Hidden Size",
"type": "integer"
},
"src_word_vec_size": {
"default": 512,
"description": "Word embedding size for src.",
"title": "Src Word Vec Size",
"type": "integer"
},
"sliding_window": {
"default": 0,
"description": "Sliding window for transformer self-attention.",
"title": "Sliding Window",
"type": "integer"
},
"heads": {
"default": 8,
"description": "Number of heads for transformer self-attention.",
"title": "Heads",
"type": "integer"
},
"transformer_ff": {
"default": 2048,
"description": "Size of hidden transformer feed-forward.",
"title": "Transformer Ff",
"type": "integer"
},
"moe_transformer_ff": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Size of hidden moe transformer feed-forward.",
"title": "Moe Transformer Ff"
},
"relative_positions_buckets": {
"default": 0,
"description": "Enable relative position bias (https://github.com/google-research/text-to-text-transfer-transformer).",
"title": "Relative Positions Buckets",
"type": "integer"
},
"mlp_activation_fn": {
"$ref": "#/$defs/ActivationFunction",
"default": "relu",
"description": "The activation function to use in MLP layer."
},
"layer_norm": {
"default": "standard",
"description": "Type of layer normalization in transformer architecture.",
"enum": [
"standard",
"standardFP32",
"rms",
"gemma-rms"
],
"title": "Layer Norm",
"type": "string"
},
"norm_eps": {
"default": 1e-05,
"description": "Layer norm epsilon.",
"title": "Norm Eps",
"type": "number"
},
"shared_layer_norm": {
"default": false,
"description": "Use a shared layer_norm in parallel residual attention. Note: must be True for Falcon 7B, False for Falcon 40B, same for GPT-J and GPT-NeoX models.",
"title": "Shared Layer Norm",
"type": "boolean"
},
"ffn_layernorm": {
"default": false,
"description": "Add pre/post_feedforward_layernorm around MLP forward. Note: introduced for gemma2 support.",
"title": "Ffn Layernorm",
"type": "boolean"
},
"add_qkvbias": {
"default": false,
"description": "Add bias to nn.Linear of Query/Key/Value in MHA. Note: this will add bias to output projection layer too by default. Can be disabled with `add_final_linear_bias`.",
"title": "Add Qkvbias",
"type": "boolean"
},
"add_key_bias": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": null,
"description": "Add bias to Key projection in MHA. Defaults to add_qkvbias when not set. Set to False for models like Whisper where K has no bias.",
"title": "Add Key Bias"
},
"query_norm": {
"default": false,
"title": "Query Norm",
"type": "boolean"
},
"key_norm": {
"default": false,
"title": "Key Norm",
"type": "boolean"
},
"qk_norm_post_rope": {
"default": false,
"title": "Qk Norm Post Rope",
"type": "boolean"
},
"add_final_linear_bias": {
"default": false,
"description": "Add bias to nn.Linear of final_linear in MHA.",
"title": "Add Final Linear Bias",
"type": "boolean"
},
"heads_kv": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Number of heads for KV. heads_kv=heads if None, else number of heads for KV(e.g. Falcon 40B)",
"title": "Heads Kv"
},
"head_dim": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Head dimension when this needs to be different vs hidden_size // heads",
"title": "Head Dim"
},
"attn_scaling": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": null,
"description": "Attention scaling factor, when None uses 1/sqrt(head_dim) by default",
"title": "Attn Scaling"
},
"add_ffnbias": {
"default": false,
"description": "Add bias to nn.Linear of MLP FFN.",
"title": "Add Ffnbias",
"type": "boolean"
},
"parallel_residual": {
"default": false,
"description": "Use parallel residual in decoder layer. Note: this is used by GPT-J / Falcon Architecture.",
"title": "Parallel Residual",
"type": "boolean"
},
"num_experts": {
"default": 0,
"description": "Number of experts for MoE models.",
"title": "Num Experts",
"type": "integer"
},
"num_shared_experts": {
"default": 0,
"description": "Number of shared experts for MoE models (DeepSeekv2).",
"title": "Num Shared Experts",
"type": "integer"
},
"shared_expert_gate": {
"default": false,
"description": "Apply sigmoid-gated shared expert output (Qwen3.5 MoE style). When True, a linear gate is applied: output += sigmoid(gate(x)) * shared_expert(x).",
"title": "Shared Expert Gate",
"type": "boolean"
},
"first_k_dense_replace": {
"default": 0,
"description": "Number of layers using Dense instead of MoE",
"title": "First K Dense Replace",
"type": "integer"
},
"num_experts_per_tok": {
"default": 2,
"description": "Number of experts per token.",
"title": "Num Experts Per Tok",
"type": "integer"
},
"moe_softmax_after": {
"default": false,
"description": "Usually softmax is before topk, Mixtral does it after.",
"title": "Moe Softmax After",
"type": "boolean"
},
"moe_renormalize": {
"default": false,
"description": "Qwen renormalize expert weights after softmax.",
"title": "Moe Renormalize",
"type": "boolean"
},
"q_gating": {
"default": false,
"description": "Enable gated query in attention (Qwen3.5 style). Q projection has doubled output size; output is multiplied by sigmoid(gate).",
"title": "Q Gating",
"type": "boolean"
},
"position_encoding_type": {
"anyOf": [
{
"$ref": "#/$defs/PositionEncodingType"
},
{
"type": "null"
}
],
"default": null
},
"interpolate_mode": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Interpolation mode for position embeddings. If None: position_embeddings is a lookup table based on n_positions. If string (e.g., 'bilinear'): position_embedding is a learned grid using interpolation. (see Vision.py encoder)",
"title": "Interpolate Mode"
},
"n_positions": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Two casesCase 1: Absolute number of positions to learn position embeddings on (position_encoding_type: Learned)Case 2: Max Relative PositionsIn the case of position_encoding_type: Relative",
"title": "N Positions"
},
"rope_config": {
"anyOf": [
{
"$ref": "#/$defs/RotaryPositionConfig"
},
{
"type": "null"
}
],
"default": null,
"description": "Rotary position config, if relevant."
},
"num_mel_bins": {
"default": 80,
"description": "Number of mel spectrogram bins.",
"title": "Num Mel Bins",
"type": "integer"
},
"max_source_positions": {
"default": 1500,
"description": "Maximum number of source positions (time frames after conv stem).",
"title": "Max Source Positions",
"type": "integer"
},
"sample_rate": {
"default": 16000,
"description": "Audio sample rate in Hz.",
"title": "Sample Rate",
"type": "integer"
},
"chunk_length": {
"default": 30,
"description": "Audio chunk length in seconds.",
"title": "Chunk Length",
"type": "integer"
},
"n_fft": {
"default": 400,
"description": "FFT window size for mel spectrogram.",
"title": "N Fft",
"type": "integer"
},
"hop_length": {
"default": 160,
"description": "Hop length for mel spectrogram.",
"title": "Hop Length",
"type": "integer"
},
"timestamp_resolution": {
"default": 0.02,
"description": "Time resolution per timestamp token in seconds.",
"title": "Timestamp Resolution",
"type": "number"
}
},
"title": "AudioEncoderConfig",
"type": "object"
},
"BARTNoiseConfig": {
"additionalProperties": false,
"properties": {
"permute_sent_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0.0,
"description": "Permute this proportion of sentences (boundaries defined by ['.', '?', '!']) in all inputs.",
"title": "Permute Sent Ratio"
},
"rotate_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0.0,
"description": "Rotate this proportion of inputs.",
"title": "Rotate Ratio"
},
"insert_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0.0,
"description": "Insert this percentage of additional random tokens.",
"title": "Insert Ratio"
},
"random_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0.0,
"description": "Instead of using <mask>, use random token this often.",
"title": "Random Ratio"
},
"mask_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0.0,
"description": "Fraction of words/subwords that will be masked.",
"title": "Mask Ratio"
},
"mask_length": {
"anyOf": [
{
"enum": [
"subword",
"word",
"span-poisson"
],
"type": "string"
},
{
"type": "null"
}
],
"default": "subword",
"description": "Length of masking window to apply.",
"title": "Mask Length"
},
"poisson_lambda": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 3.0,
"description": "Lambda for Poisson distribution to sample span length if `-mask_length` set to span-poisson.",
"title": "Poisson Lambda"
},
"replace_length": {
"anyOf": [
{
"maximum": 1,
"minimum": -1,
"type": "integer"
},
{
"type": "null"
}
],
"default": -1,
"description": "When masking N tokens, replace with 0, 1, or N tokens. (use -1 for N)",
"title": "Replace Length"
}
},
"title": "BARTNoiseConfig",
"type": "object"
},
"BaseTokenizerConfig": {
"additionalProperties": false,
"properties": {
"src_subword_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Path of subword model for src (or shared).",
"title": "Src Subword Model"
},
"tgt_subword_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Path of subword model for tgt.",
"title": "Tgt Subword Model"
},
"src_subword_nbest": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1,
"description": "Number of candidates in subword regularization. Valid for unigram sampling, invalid for BPE-dropout. (source side)",
"title": "Src Subword Nbest"
},
"tgt_subword_nbest": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1,
"description": "Number of candidates in subword regularization. Valid for unigram sampling, invalid for BPE-dropout. (target side)",
"title": "Tgt Subword Nbest"
},
"src_subword_alpha": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0,
"description": "Smoothing parameter for sentencepiece unigram sampling, and dropout probability for BPE-dropout. (source side)",
"title": "Src Subword Alpha"
},
"tgt_subword_alpha": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0,
"description": "Smoothing parameter for sentencepiece unigram sampling, and dropout probability for BPE-dropout. (target side)",
"title": "Tgt Subword Alpha"
},
"src_subword_vocab": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "Path to the vocabulary file for src subword. Format: <word>\\t<count> per line.",
"title": "Src Subword Vocab"
},
"tgt_subword_vocab": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "Path to the vocabulary file for tgt subword. Format: <word>\\t<count> per line.",
"title": "Tgt Subword Vocab"
},
"src_vocab_threshold": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 0,
"description": "Only produce src subword in src_subword_vocab with frequency >= src_vocab_threshold.",
"title": "Src Vocab Threshold"
},
"tgt_vocab_threshold": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 0,
"description": "Only produce tgt subword in tgt_subword_vocab with frequency >= tgt_vocab_threshold.",
"title": "Tgt Vocab Threshold"
}
},
"title": "BaseTokenizerConfig",
"type": "object"
},
"CleanConfig": {
"additionalProperties": false,
"properties": {
"src_eq_tgt": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": false,
"description": "Remove ex src==tgt",
"title": "Src Eq Tgt"
},
"same_char": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": false,
"description": "Remove ex with same char more than 4 times",
"title": "Same Char"
},
"same_word": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": false,
"description": "Remove ex with same word more than 3 times",
"title": "Same Word"
},
"scripts_ok": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": [
"Latin",
"Common"
],
"description": "list of unicodata scripts accepted",
"title": "Scripts Ok"
},
"scripts_nok": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": [],
"description": "list of unicodata scripts not accepted",
"title": "Scripts Nok"
},
"src_tgt_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 2.0,
"description": "ratio between src and tgt",
"title": "Src Tgt Ratio"
},
"avg_tok_min": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 3.0,
"description": "average length of tokens min",
"title": "Avg Tok Min"
},
"avg_tok_max": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 20.0,
"description": "average length of tokens max",
"title": "Avg Tok Max"
},
"langid": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": [],
"description": "list of languages accepted",
"title": "Langid"
}
},
"title": "CleanConfig",
"type": "object"
},
"CnnDecoderConfig": {
"additionalProperties": false,
"properties": {
"decoder_type": {
"const": "cnn",
"default": "cnn",
"title": "Decoder Type",
"type": "string"
},
"layers": {
"default": 2,
"description": "Number of layers in the decoder.",
"title": "Layers",
"type": "integer"
},
"hidden_size": {
"default": 512,
"description": "Size of decoder hidden states.",
"title": "Hidden Size",
"type": "integer"
},
"tgt_word_vec_size": {
"default": 512,
"description": "Word embedding size for tgt.",
"title": "Tgt Word Vec Size",
"type": "integer"
},
"coverage_attn": {
"default": false,
"description": "Train a coverage attention layer.",
"title": "Coverage Attn",
"type": "boolean"
},
"with_cross_attn": {
"default": false,
"description": "Decoder uses cross-attention with encoder outputs.",
"title": "With Cross Attn",
"type": "boolean"
},
"lambda_coverage": {
"default": 0.0,
"description": "Lambda value for coverage loss of See et al (2017)",
"title": "Lambda Coverage",
"type": "number"
},
"global_attention": {
"default": "general",
"description": "The attention type to use. (Luong=general, Bahdanau=MLP)",
"enum": [
"dot",
"general",
"mlp",
null
],
"title": "Global Attention"
},
"global_attention_function": {
"default": "softmax",
"description": "Global attention function to use.",
"enum": [
"softmax",
"sparsemax"
],
"title": "Global Attention Function",
"type": "string"
},
"cnn_kernel_width": {
"default": 3,
"description": "Size of windows in the cnn, the kernel_size is (cnn_kernel_width, 1) in convolution layers.",
"title": "Cnn Kernel Width",
"type": "integer"
}
},
"title": "CnnDecoderConfig",
"type": "object"
},
"CnnEncoderConfig": {
"additionalProperties": false,
"properties": {
"encoder_type": {
"const": "cnn",
"default": "cnn",
"title": "Encoder Type",
"type": "string"
},
"layers": {
"default": 2,
"description": "Number of layers in the encoder.",
"title": "Layers",
"type": "integer"
},
"hidden_size": {
"default": 512,
"description": "Size of encoder hidden states.",
"title": "Hidden Size",
"type": "integer"
},
"src_word_vec_size": {
"default": 512,
"description": "Word embedding size for src.",
"title": "Src Word Vec Size",
"type": "integer"
},
"cnn_kernel_width": {
"default": 3,
"description": "Size of windows in the cnn, the kernel_size is (cnn_kernel_width, 1) in convolution layers.",
"title": "Cnn Kernel Width",
"type": "integer"
}
},
"title": "CnnEncoderConfig",
"type": "object"
},
"CnnModelConfig": {
"additionalProperties": false,
"properties": {
"embeddings": {
"$ref": "#/$defs/EmbeddingsConfig",
"description": "Contains most of the args useful to build the Embeddings module."
},
"encoder": {
"anyOf": [
{
"discriminator": {
"mapping": {
"audio": "#/$defs/AudioEncoderConfig",
"brnn": "#/$defs/RnnEncoderConfig",
"cnn": "#/$defs/CnnEncoderConfig",
"mean": "#/$defs/MeanEncoderConfig",
"rnn": "#/$defs/RnnEncoderConfig",
"transformer": "#/$defs/TransformerEncoderConfig",
"vision": "#/$defs/VisionEncoderConfig"
},
"propertyName": "encoder_type"
},
"oneOf": [
{
"$ref": "#/$defs/TransformerEncoderConfig"
},
{
"$ref": "#/$defs/RnnEncoderConfig"
},
{
"$ref": "#/$defs/CnnEncoderConfig"
},
{
"$ref": "#/$defs/MeanEncoderConfig"
},
{
"$ref": "#/$defs/VisionEncoderConfig"
},
{
"$ref": "#/$defs/AudioEncoderConfig"
}
]
},
{
"type": "null"
}
],
"default": null,
"description": "Major parameters of an encoder.",
"title": "Encoder"
},
"decoder": {
"anyOf": [
{
"discriminator": {
"mapping": {
"cnn": "#/$defs/CnnDecoderConfig",
"rnn": "#/$defs/RnnDecoderConfig",
"transformer": "#/$defs/TransformerDecoderConfig"
},
"propertyName": "decoder_type"
},
"oneOf": [
{
"$ref": "#/$defs/TransformerDecoderConfig"
},
{
"$ref": "#/$defs/RnnDecoderConfig"
},
{
"$ref": "#/$defs/CnnDecoderConfig"
}
]
},
{
"type": "null"
}
],
"default": null,
"description": "Major parameters of a decoder.",
"title": "Decoder"
},
"hidden_size": {
"default": -1,
"description": "Size of hidden states. Overwrites [encoder/decoder].hidden_size if set.",
"title": "Hidden Size",
"type": "integer"
},
"word_vec_size": {
"default": -1,
"description": "Word embedding size for src and tgt.",
"title": "Word Vec Size",
"type": "integer"
},
"layers": {
"default": -1,
"description": "Number of layers in both encoder and decoder (will overwrite enc_layers/dec_layers).",
"title": "Layers",
"type": "integer"
},
"transformer_ff": {
"default": -1,
"description": "Size of hidden transformer feed-forward.",
"title": "Transformer Ff",
"type": "integer"
},
"moe_transformer_ff": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Size of hidden moe transformer feed-forward.",
"title": "Moe Transformer Ff"
},
"share_decoder_embeddings": {
"default": false,
"description": "Use a share weight matrix for the input and output word embeddings in the decoder.",
"title": "Share Decoder Embeddings",
"type": "boolean"
},
"share_embeddings": {
"default": false,
"description": "Share the word embeddings between encoder and decoder. Need to use shared vocabulary for this option.",
"title": "Share Embeddings",
"type": "boolean"
},
"input_feed": {
"default": 1,
"description": "Feed the context vector at each time step as additional input (via concatenation with the word embeddings) to the decoder.",
"title": "Input Feed",
"type": "integer"
},
"generator_function": {
"default": "softmax",
"description": "Which function to use for generating probabilities over the target vocabulary.",
"enum": [
"softmax",
"sparsemax"
],
"title": "Generator Function",
"type": "string"
},
"generator_bias": {
"default": true,
"description": "Control whether or not the generator Linear module has bias weights.",
"title": "Generator Bias",
"type": "boolean"
},
"adapter_bias": {
"default": false,
"description": "Control whether or not the adapter module has bias weights.",
"title": "Adapter Bias",
"type": "boolean"
},
"projector_activation_fn": {
"anyOf": [
{
"$ref": "#/$defs/ActivationFunction"
},
{
"type": "null"
}
],
"default": "relu",
"description": "The activation function to use in adapter projector layer."
},
"spatial_merge_size": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1,
"description": "Control the presence and size of patch merger (Mistral3)",
"title": "Spatial Merge Size"
},
"add_estimator": {
"default": false,
"description": "Add estimator layer",
"title": "Add Estimator",
"type": "boolean"
},
"estimator_type": {
"default": "average",
"description": "Which hidden_states to use to feed the estimator",
"enum": [
"average",
"last_token",
"first_token"
],
"title": "Estimator Type",
"type": "string"
},
"left_pad": {
"default": false,
"description": "Enable left-padding, useful for some LLMs.",
"title": "Left Pad",
"type": "boolean"
},
"huggingface_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Original huggingface model.",
"title": "Huggingface Model"
},
"eole_version": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "0.5.2",
"description": "Eole version used to convert/train/save the model.",
"title": "Eole Version"
},
"architecture": {
"const": "cnn",
"default": "cnn",
"title": "Architecture",
"type": "string"
},
"cnn_kernel_width": {
"default": 3,
"description": "Size of windows in the cnn, the kernel_size is (cnn_kernel_width, 1) in convolution layers.",
"title": "Cnn Kernel Width",
"type": "integer"
}
},
"title": "CnnModelConfig",
"type": "object"
},
"CustomModelConfig": {
"additionalProperties": false,
"description": "Wrap anything that does not fit a set common architecture.",
"properties": {
"embeddings": {
"$ref": "#/$defs/EmbeddingsConfig",
"description": "Contains most of the args useful to build the Embeddings module."
},
"encoder": {
"anyOf": [
{
"discriminator": {
"mapping": {
"audio": "#/$defs/AudioEncoderConfig",
"brnn": "#/$defs/RnnEncoderConfig",
"cnn": "#/$defs/CnnEncoderConfig",
"mean": "#/$defs/MeanEncoderConfig",
"rnn": "#/$defs/RnnEncoderConfig",
"transformer": "#/$defs/TransformerEncoderConfig",
"vision": "#/$defs/VisionEncoderConfig"
},
"propertyName": "encoder_type"
},
"oneOf": [
{
"$ref": "#/$defs/TransformerEncoderConfig"
},
{
"$ref": "#/$defs/RnnEncoderConfig"
},
{
"$ref": "#/$defs/CnnEncoderConfig"
},
{
"$ref": "#/$defs/MeanEncoderConfig"
},
{
"$ref": "#/$defs/VisionEncoderConfig"
},
{
"$ref": "#/$defs/AudioEncoderConfig"
}
]
},
{
"type": "null"
}
],
"default": null,
"description": "Major parameters of an encoder.",
"title": "Encoder"
},
"decoder": {
"anyOf": [
{
"discriminator": {
"mapping": {
"cnn": "#/$defs/CnnDecoderConfig",
"rnn": "#/$defs/RnnDecoderConfig",
"transformer": "#/$defs/TransformerDecoderConfig"
},
"propertyName": "decoder_type"
},
"oneOf": [
{
"$ref": "#/$defs/TransformerDecoderConfig"
},
{
"$ref": "#/$defs/RnnDecoderConfig"
},
{
"$ref": "#/$defs/CnnDecoderConfig"
}
]
},
{
"type": "null"
}
],
"default": null,
"description": "Major parameters of a decoder.",
"title": "Decoder"
},
"hidden_size": {
"default": -1,
"description": "Size of hidden states. Overwrites [encoder/decoder].hidden_size if set.",
"title": "Hidden Size",
"type": "integer"
},
"word_vec_size": {
"default": -1,
"description": "Word embedding size for src and tgt.",
"title": "Word Vec Size",
"type": "integer"
},
"layers": {
"default": -1,
"description": "Number of layers in both encoder and decoder (will overwrite enc_layers/dec_layers).",
"title": "Layers",
"type": "integer"
},
"transformer_ff": {
"default": 2048,
"description": "Size of hidden transformer feed-forward.",
"title": "Transformer Ff",
"type": "integer"
},
"moe_transformer_ff": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Size of hidden moe transformer feed-forward.",
"title": "Moe Transformer Ff"
},
"share_decoder_embeddings": {
"default": false,
"description": "Use a share weight matrix for the input and output word embeddings in the decoder.",
"title": "Share Decoder Embeddings",
"type": "boolean"
},
"share_embeddings": {
"default": false,
"description": "Share the word embeddings between encoder and decoder. Need to use shared vocabulary for this option.",
"title": "Share Embeddings",
"type": "boolean"
},
"input_feed": {
"default": 1,
"description": "Feed the context vector at each time step as additional input (via concatenation with the word embeddings) to the decoder.",
"title": "Input Feed",
"type": "integer"
},
"generator_function": {
"default": "softmax",
"description": "Which function to use for generating probabilities over the target vocabulary.",
"enum": [
"softmax",
"sparsemax"
],
"title": "Generator Function",
"type": "string"
},
"generator_bias": {
"default": true,
"description": "Control whether or not the generator Linear module has bias weights.",
"title": "Generator Bias",
"type": "boolean"
},
"adapter_bias": {
"default": false,
"description": "Control whether or not the adapter module has bias weights.",
"title": "Adapter Bias",
"type": "boolean"
},
"projector_activation_fn": {
"anyOf": [
{
"$ref": "#/$defs/ActivationFunction"
},
{
"type": "null"
}
],
"default": "relu",
"description": "The activation function to use in adapter projector layer."
},
"spatial_merge_size": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1,
"description": "Control the presence and size of patch merger (Mistral3)",
"title": "Spatial Merge Size"
},
"add_estimator": {
"default": false,
"description": "Add estimator layer",
"title": "Add Estimator",
"type": "boolean"
},
"estimator_type": {
"default": "average",
"description": "Which hidden_states to use to feed the estimator",
"enum": [
"average",
"last_token",
"first_token"
],
"title": "Estimator Type",
"type": "string"
},
"left_pad": {
"default": false,
"description": "Enable left-padding, useful for some LLMs.",
"title": "Left Pad",
"type": "boolean"
},
"huggingface_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Original huggingface model.",
"title": "Huggingface Model"
},
"eole_version": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "0.5.2",
"description": "Eole version used to convert/train/save the model.",
"title": "Eole Version"
},
"architecture": {
"const": "custom",
"default": "custom",
"title": "Architecture",
"type": "string"
},
"sliding_window": {
"default": 0,
"description": "Sliding window for transformer self-attention.",
"title": "Sliding Window",
"type": "integer"
},
"heads": {
"default": 8,
"description": "Number of heads for transformer self-attention.",
"title": "Heads",
"type": "integer"
},
"relative_positions_buckets": {
"default": 0,
"description": "Enable relative position bias (https://github.com/google-research/text-to-text-transfer-transformer).",
"title": "Relative Positions Buckets",
"type": "integer"
},
"mlp_activation_fn": {
"$ref": "#/$defs/ActivationFunction",
"default": "relu",
"description": "The activation function to use in MLP layer."
},
"layer_norm": {
"default": "standard",
"description": "Type of layer normalization in transformer architecture.",
"enum": [
"standard",
"standardFP32",
"rms",
"gemma-rms"
],
"title": "Layer Norm",
"type": "string"
},
"norm_eps": {
"default": 1e-05,
"description": "Layer norm epsilon.",
"title": "Norm Eps",
"type": "number"
},
"shared_layer_norm": {
"default": false,
"description": "Use a shared layer_norm in parallel residual attention. Note: must be True for Falcon 7B, False for Falcon 40B, same for GPT-J and GPT-NeoX models.",
"title": "Shared Layer Norm",
"type": "boolean"
},
"ffn_layernorm": {
"default": false,
"description": "Add pre/post_feedforward_layernorm around MLP forward. Note: introduced for gemma2 support.",
"title": "Ffn Layernorm",
"type": "boolean"
},
"add_qkvbias": {
"default": false,
"description": "Add bias to nn.Linear of Query/Key/Value in MHA. Note: this will add bias to output projection layer too by default. Can be disabled with `add_final_linear_bias`.",
"title": "Add Qkvbias",
"type": "boolean"
},
"add_key_bias": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": null,
"description": "Add bias to Key projection in MHA. Defaults to add_qkvbias when not set. Set to False for models like Whisper where K has no bias.",
"title": "Add Key Bias"
},
"query_norm": {
"default": false,
"title": "Query Norm",
"type": "boolean"
},
"key_norm": {
"default": false,
"title": "Key Norm",
"type": "boolean"
},
"qk_norm_post_rope": {
"default": false,
"title": "Qk Norm Post Rope",
"type": "boolean"
},
"add_final_linear_bias": {
"default": false,
"description": "Add bias to nn.Linear of final_linear in MHA.",
"title": "Add Final Linear Bias",
"type": "boolean"
},
"heads_kv": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Number of heads for KV. heads_kv=heads if None, else number of heads for KV(e.g. Falcon 40B)",
"title": "Heads Kv"
},
"head_dim": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Head dimension when this needs to be different vs hidden_size // heads",
"title": "Head Dim"
},
"attn_scaling": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": null,
"description": "Attention scaling factor, when None uses 1/sqrt(head_dim) by default",
"title": "Attn Scaling"
},
"add_ffnbias": {
"default": false,
"description": "Add bias to nn.Linear of MLP FFN.",
"title": "Add Ffnbias",
"type": "boolean"
},
"parallel_residual": {
"default": false,
"description": "Use parallel residual in decoder layer. Note: this is used by GPT-J / Falcon Architecture.",
"title": "Parallel Residual",
"type": "boolean"
},
"num_experts": {
"default": 0,
"description": "Number of experts for MoE models.",
"title": "Num Experts",
"type": "integer"
},
"num_shared_experts": {
"default": 0,
"description": "Number of shared experts for MoE models (DeepSeekv2).",
"title": "Num Shared Experts",
"type": "integer"
},
"shared_expert_gate": {
"default": false,
"description": "Apply sigmoid-gated shared expert output (Qwen3.5 MoE style). When True, a linear gate is applied: output += sigmoid(gate(x)) * shared_expert(x).",
"title": "Shared Expert Gate",
"type": "boolean"
},
"first_k_dense_replace": {
"default": 0,
"description": "Number of layers using Dense instead of MoE",
"title": "First K Dense Replace",
"type": "integer"
},
"num_experts_per_tok": {
"default": 2,
"description": "Number of experts per token.",
"title": "Num Experts Per Tok",
"type": "integer"
},
"moe_softmax_after": {
"default": false,
"description": "Usually softmax is before topk, Mixtral does it after.",
"title": "Moe Softmax After",
"type": "boolean"
},
"moe_renormalize": {
"default": false,
"description": "Qwen renormalize expert weights after softmax.",
"title": "Moe Renormalize",
"type": "boolean"
},
"q_gating": {
"default": false,
"description": "Enable gated query in attention (Qwen3.5 style). Q projection has doubled output size; output is multiplied by sigmoid(gate).",
"title": "Q Gating",
"type": "boolean"
},
"position_encoding_type": {
"anyOf": [
{
"$ref": "#/$defs/PositionEncodingType"
},
{
"type": "null"
}
],
"default": "SinusoidalInterleaved",
"description": "Type of positional encoding."
},
"interpolate_mode": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Interpolation mode for position embeddings. If None: position_embeddings is a lookup table based on n_positions. If string (e.g., 'bilinear'): position_embedding is a learned grid using interpolation. (see Vision.py encoder)",
"title": "Interpolate Mode"
},
"n_positions": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Two casesCase 1: Absolute number of positions to learn position embeddings on (position_encoding_type: Learned)Case 2: Max Relative PositionsIn the case of position_encoding_type: Relative",
"title": "N Positions"
},
"rope_config": {
"anyOf": [
{
"$ref": "#/$defs/RotaryPositionConfig"
},
{
"type": "null"
}
],
"default": null,
"description": "Rotary position config, if relevant."
}
},
"title": "CustomModelConfig",
"type": "object"
},
"DocifyConfig": {
"additionalProperties": false,
"properties": {
"doc_length": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 200,
"description": "Number of tokens per doc.",
"title": "Doc Length"
},
"max_context": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1,
"description": "Max context segments.",
"title": "Max Context"
}
},
"title": "DocifyConfig",
"type": "object"
},
"EmbeddingsConfig": {
"additionalProperties": false,
"properties": {
"src_word_vec_size": {
"default": 512,
"description": "Word embedding size for src.",
"title": "Src Word Vec Size",
"type": "integer"
},
"tgt_word_vec_size": {
"default": 512,
"description": "Word embedding size for tgt.",
"title": "Tgt Word Vec Size",
"type": "integer"
},
"word_vec_size": {
"default": -1,
"description": "Word embedding size for src and tgt.",
"title": "Word Vec Size",
"type": "integer"
},
"freeze_word_vecs_enc": {
"default": false,
"description": "Freeze word embeddings on the encoder side.",
"title": "Freeze Word Vecs Enc",
"type": "boolean"
},
"freeze_word_vecs_dec": {
"default": false,
"description": "Freeze word embeddings on the encoder side.",
"title": "Freeze Word Vecs Dec",
"type": "boolean"
},
"position_encoding": {
"default": false,
"description": "Absolute position encoding, see position_encoding_type. Necessary for non-RNN style models.",
"title": "Position Encoding",
"type": "boolean"
},
"position_encoding_type": {
"anyOf": [
{
"$ref": "#/$defs/PositionEncodingType"
},
{
"type": "null"
}
],
"default": "SinusoidalInterleaved",
"description": "Type of positional encoding."
},
"n_positions": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Two casesCase 1: Absolute number of positions to learn position embeddings on (position_encoding_type: Learned)Case 2: Max Relative PositionsIn the case of position_encoding_type: Relative",
"title": "N Positions"
},
"position_shift": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 0,
"description": "Positions IDS shift before making position embed dirty patch to cover for xlm-roberta-xl",
"title": "Position Shift"
},
"normalize": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": false,
"description": "Enable embeddings scaling. Not always necessary, but useful for some model compatibility, e.g. gemma. https://datascience.stackexchange.com/a/87909",
"title": "Normalize"
}
},
"title": "EmbeddingsConfig",
"type": "object"
},
"FilterTooLongConfig": {
"additionalProperties": false,
"properties": {
"src_seq_length": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 192,
"description": "Maximum source sequence length.",
"title": "Src Seq Length"
},
"tgt_seq_length": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 192,
"description": "Maximum target sequence length.",
"title": "Tgt Seq Length"
}
},
"title": "FilterTooLongConfig",
"type": "object"
},
"FilterTooShortConfig": {
"additionalProperties": false,
"properties": {
"src_seq_length": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 48,
"description": "Minimum source sequence length.",
"title": "Src Seq Length"
},
"tgt_seq_length": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 48,
"description": "Minimum target sequence length.",
"title": "Tgt Seq Length"
}
},
"title": "FilterTooShortConfig",
"type": "object"
},
"HuggingfaceTokenizerConfig": {
"additionalProperties": false,
"properties": {
"path": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Path"
},
"huggingface_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Huggingface Model"
},
"max_length": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"title": "Max Length"
}
},
"title": "HuggingfaceTokenizerConfig",
"type": "object"
},
"InlineTagsConfig": {
"additionalProperties": false,
"properties": {
"tags_dictionary_path": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Path to a flat term dictionary.",
"title": "Tags Dictionary Path"
},
"tags_corpus_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0.1,
"description": "Ratio of corpus to augment with tags.",
"title": "Tags Corpus Ratio"
},
"max_tags": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 12,
"description": "Maximum number of tags that can be added to a single sentence.",
"title": "Max Tags"
},
"paired_stag": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "\uff5fph_#_beg\uff60",
"description": "The format of an opening paired inline tag. Must include the character #.",
"title": "Paired Stag"
},
"paired_etag": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "\uff5fph_#_end\uff60",
"description": "The format of a closing paired inline tag. Must include the character #.",
"title": "Paired Etag"
},
"isolated_tag": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "\uff5fph_#_std\uff60",
"description": "The format of an isolated inline tag. Must include the character #.",
"title": "Isolated Tag"
},
"src_delimiter": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "\uff5ffuzzy\uff60",
"description": "Any special token used for augmented src sentences. The default is the fuzzy token used in the FuzzyMatch transform.",
"title": "Src Delimiter"
}
},
"title": "InlineTagsConfig",
"type": "object"
},
"InsertMaskBeforePlaceholderConfig": {
"additionalProperties": false,
"properties": {
"response_patterns": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": [
"Response : \uff5fnewline\uff60"
],
"description": "Response pattern to locate the end of the prompt.",
"title": "Response Patterns"
}
},
"title": "InsertMaskBeforePlaceholderConfig",
"type": "object"
},
"MeanEncoderConfig": {
"additionalProperties": false,
"properties": {
"encoder_type": {
"const": "mean",
"default": "mean",
"title": "Encoder Type",
"type": "string"
},
"layers": {
"default": 2,
"description": "Number of layers in the encoder.",
"title": "Layers",
"type": "integer"
},
"hidden_size": {
"default": 512,
"description": "Size of encoder hidden states.",
"title": "Hidden Size",
"type": "integer"
},
"src_word_vec_size": {
"default": 512,
"description": "Word embedding size for src.",
"title": "Src Word Vec Size",
"type": "integer"
}
},
"title": "MeanEncoderConfig",
"type": "object"
},
"NestedAllTransformsConfig": {
"additionalProperties": false,
"properties": {
"clean": {
"$ref": "#/$defs/CleanConfig",
"default": {
"src_eq_tgt": false,
"same_char": false,
"same_word": false,
"scripts_ok": [
"Latin",
"Common"
],
"scripts_nok": [],
"src_tgt_ratio": 2.0,
"avg_tok_min": 3.0,
"avg_tok_max": 20.0,
"langid": []
}
},
"terminology": {
"$ref": "#/$defs/TerminologyConfig",
"default": {
"termbase_path": null,
"src_spacy_language_model": null,
"tgt_spacy_language_model": null,
"term_corpus_ratio": 0.3,
"term_example_ratio": 0.2,
"src_term_stoken": "\uff5fsrc_term_start\uff60",
"tgt_term_stoken": "\uff5ftgt_term_start\uff60",
"tgt_term_etoken": "\uff5ftgt_term_end\uff60",
"term_source_delimiter": "\uff5ffuzzy\uff60"
}
},
"docify": {
"$ref": "#/$defs/DocifyConfig",
"default": {
"doc_length": 200,
"max_context": 1
}
},
"filtertooshort": {
"$ref": "#/$defs/FilterTooShortConfig",
"default": {
"src_seq_length": 48,
"tgt_seq_length": 48
}
},
"filtertoolong": {
"$ref": "#/$defs/FilterTooLongConfig",
"default": {
"src_seq_length": 192,
"tgt_seq_length": 192
}
},
"prefix": {
"$ref": "#/$defs/PrefixConfig",
"default": {
"src_prefix": "",
"tgt_prefix": ""
}
},
"suffix": {
"$ref": "#/$defs/SuffixConfig",
"default": {
"src_suffix": "",
"tgt_suffix": ""
}
},
"insert_mask_before_placeholder": {
"$ref": "#/$defs/InsertMaskBeforePlaceholderConfig",
"default": {
"response_patterns": [
"Response : \uff5fnewline\uff60"
]
}
},
"uppercase": {
"$ref": "#/$defs/UpperCaseConfig",
"default": {
"upper_corpus_ratio": 0.01
}
},
"switchout": {
"$ref": "#/$defs/SwitchOutConfig",
"default": {
"switchout_temperature": 1.0
}
},
"tokendrop": {
"$ref": "#/$defs/TokenDropConfig",
"default": {
"tokendrop_temperature": 1.0
}
},
"tokenmask": {
"$ref": "#/$defs/TokenMaskConfig",
"default": {
"tokenmask_temperature": 1.0
}
},
"bart": {
"$ref": "#/$defs/BARTNoiseConfig",
"default": {
"permute_sent_ratio": 0.0,
"rotate_ratio": 0.0,
"insert_ratio": 0.0,
"random_ratio": 0.0,
"mask_ratio": 0.0,
"mask_length": "subword",
"poisson_lambda": 3.0,
"replace_length": -1
}
},
"sentencepiece": {
"$ref": "#/$defs/BaseTokenizerConfig",
"default": {
"src_subword_model": null,
"tgt_subword_model": null,
"src_subword_nbest": 1,
"tgt_subword_nbest": 1,
"src_subword_alpha": 0.0,
"tgt_subword_alpha": 0.0,
"src_subword_vocab": "",
"tgt_subword_vocab": "",
"src_vocab_threshold": 0,
"tgt_vocab_threshold": 0
}
},
"bpe": {
"$ref": "#/$defs/BaseTokenizerConfig",
"default": {
"src_subword_model": null,
"tgt_subword_model": null,
"src_subword_nbest": 1,
"tgt_subword_nbest": 1,
"src_subword_alpha": 0.0,
"tgt_subword_alpha": 0.0,
"src_subword_vocab": "",
"tgt_subword_vocab": "",
"src_vocab_threshold": 0,
"tgt_vocab_threshold": 0
}
},
"onmt_tokenize": {
"$ref": "#/$defs/ONMTTokenizerConfig",
"default": {
"src_subword_model": null,
"tgt_subword_model": null,
"src_subword_nbest": 1,
"tgt_subword_nbest": 1,
"src_subword_alpha": 0.0,
"tgt_subword_alpha": 0.0,
"src_subword_vocab": "",
"tgt_subword_vocab": "",
"src_vocab_threshold": 0,
"tgt_vocab_threshold": 0,
"src_subword_type": "none",
"tgt_subword_type": "none",
"src_onmttok_kwargs": {
"mode": "none"
},
"tgt_onmttok_kwargs": {
"mode": "none"
},
"gpt2_pretok": false,
"mapped_tokens": null
}
},
"inlinetags": {
"$ref": "#/$defs/InlineTagsConfig",
"default": {
"tags_dictionary_path": null,
"tags_corpus_ratio": 0.1,
"max_tags": 12,
"paired_stag": "\uff5fph_#_beg\uff60",
"paired_etag": "\uff5fph_#_end\uff60",
"isolated_tag": "\uff5fph_#_std\uff60",
"src_delimiter": "\uff5ffuzzy\uff60"
}
},
"huggingface_tokenize": {
"$ref": "#/$defs/HuggingfaceTokenizerConfig",
"default": {
"path": null,
"huggingface_model": null,
"max_length": null
}
},
"normalize": {
"$ref": "#/$defs/NormalizeConfig",
"default": {
"src_lang": "",
"tgt_lang": "",
"penn": true,
"norm_quote_commas": true,
"norm_numbers": true,
"pre_replace_unicode_punct": false,
"post_remove_control_chars": false
}
}
},
"title": "NestedAllTransformsConfig",
"type": "object"
},
"NormalizeConfig": {
"additionalProperties": false,
"properties": {
"src_lang": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "Source language code",
"title": "Src Lang"
},
"tgt_lang": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "Target language code",
"title": "Tgt Lang"
},
"penn": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": true,
"description": "Penn substitution",
"title": "Penn"
},
"norm_quote_commas": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": true,
"description": "Normalize quotations and commas",
"title": "Norm Quote Commas"
},
"norm_numbers": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": true,
"description": "Normalize numbers",
"title": "Norm Numbers"
},
"pre_replace_unicode_punct": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": false,
"description": "Replace unicode punct",
"title": "Pre Replace Unicode Punct"
},
"post_remove_control_chars": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": false,
"description": "Remove control chars",
"title": "Post Remove Control Chars"
}
},
"title": "NormalizeConfig",
"type": "object"
},
"ONMTTokenizerConfig": {
"additionalProperties": false,
"properties": {
"src_subword_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Path of subword model for src (or shared).",
"title": "Src Subword Model"
},
"tgt_subword_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Path of subword model for tgt.",
"title": "Tgt Subword Model"
},
"src_subword_nbest": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1,
"description": "Number of candidates in subword regularization. Valid for unigram sampling, invalid for BPE-dropout. (source side)",
"title": "Src Subword Nbest"
},
"tgt_subword_nbest": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1,
"description": "Number of candidates in subword regularization. Valid for unigram sampling, invalid for BPE-dropout. (target side)",
"title": "Tgt Subword Nbest"
},
"src_subword_alpha": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0,
"description": "Smoothing parameter for sentencepiece unigram sampling, and dropout probability for BPE-dropout. (source side)",
"title": "Src Subword Alpha"
},
"tgt_subword_alpha": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0,
"description": "Smoothing parameter for sentencepiece unigram sampling, and dropout probability for BPE-dropout. (target side)",
"title": "Tgt Subword Alpha"
},
"src_subword_vocab": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "Path to the vocabulary file for src subword. Format: <word>\\t<count> per line.",
"title": "Src Subword Vocab"
},
"tgt_subword_vocab": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "Path to the vocabulary file for tgt subword. Format: <word>\\t<count> per line.",
"title": "Tgt Subword Vocab"
},
"src_vocab_threshold": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 0,
"description": "Only produce src subword in src_subword_vocab with frequency >= src_vocab_threshold.",
"title": "Src Vocab Threshold"
},
"tgt_vocab_threshold": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 0,
"description": "Only produce tgt subword in tgt_subword_vocab with frequency >= tgt_vocab_threshold.",
"title": "Tgt Vocab Threshold"
},
"src_subword_type": {
"anyOf": [
{
"enum": [
"none",
"sentencepiece",
"bpe"
],
"type": "string"
},
{
"type": "null"
}
],
"default": "none",
"description": "Type of subword model for src (or shared) in pyonmttok.",
"title": "Src Subword Type"
},
"tgt_subword_type": {
"anyOf": [
{
"enum": [
"none",
"sentencepiece",
"bpe"
],
"type": "string"
},
{
"type": "null"
}
],
"default": "none",
"description": "Type of subword model for tgt in pyonmttok.",
"title": "Tgt Subword Type"
},
"src_onmttok_kwargs": {
"anyOf": [
{
"additionalProperties": true,
"type": "object"
},
{
"type": "null"
}
],
"default": {
"mode": "none"
},
"description": "Other pyonmttok options for src in dict string, except subword related options listed earlier.",
"title": "Src Onmttok Kwargs"
},
"tgt_onmttok_kwargs": {
"anyOf": [
{
"additionalProperties": true,
"type": "object"
},
{
"type": "null"
}
],
"default": {
"mode": "none"
},
"description": "Other pyonmttok options for tgt in dict string, except subword related options listed earlier.",
"title": "Tgt Onmttok Kwargs"
},
"gpt2_pretok": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": false,
"description": "Preprocess sentence with byte-level mapping.",
"title": "Gpt2 Pretok"
},
"mapped_tokens": {
"anyOf": [
{
"items": {
"maxItems": 2,
"minItems": 2,
"prefixItems": [
{
"type": "string"
},
{
"type": "string"
}
],
"type": "array"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"description": "Mapped tokens for placeholders preservation",
"title": "Mapped Tokens"
}
},
"title": "ONMTTokenizerConfig",
"type": "object"
},
"PositionEncodingType": {
"enum": [
"SinusoidalInterleaved",
"SinusoidalConcat",
"Learned",
"Relative",
"Rotary",
"Alibi"
],
"title": "PositionEncodingType",
"type": "string"
},
"PrefixConfig": {
"additionalProperties": false,
"properties": {
"src_prefix": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "String to prepend to all source examples.",
"title": "Src Prefix"
},
"tgt_prefix": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "String to prepend to all target examples.",
"title": "Tgt Prefix"
}
},
"title": "PrefixConfig",
"type": "object"
},
"RnnDecoderConfig": {
"additionalProperties": false,
"properties": {
"decoder_type": {
"const": "rnn",
"default": "rnn",
"title": "Decoder Type",
"type": "string"
},
"layers": {
"default": 2,
"description": "Number of layers in the decoder.",
"title": "Layers",
"type": "integer"
},
"hidden_size": {
"default": 512,
"description": "Size of decoder hidden states.",
"title": "Hidden Size",
"type": "integer"
},
"tgt_word_vec_size": {
"default": 512,
"description": "Word embedding size for tgt.",
"title": "Tgt Word Vec Size",
"type": "integer"
},
"coverage_attn": {
"default": false,
"description": "Train a coverage attention layer.",
"title": "Coverage Attn",
"type": "boolean"
},
"with_cross_attn": {
"default": false,
"description": "Decoder uses cross-attention with encoder outputs.",
"title": "With Cross Attn",
"type": "boolean"
},
"lambda_coverage": {
"default": 0.0,
"description": "Lambda value for coverage loss of See et al (2017)",
"title": "Lambda Coverage",
"type": "number"
},
"global_attention": {
"default": "general",
"description": "The attention type to use. (Luong=general, Bahdanau=MLP)",
"enum": [
"dot",
"general",
"mlp",
null
],
"title": "Global Attention"
},
"global_attention_function": {
"default": "softmax",
"description": "Global attention function to use.",
"enum": [
"softmax",
"sparsemax"
],
"title": "Global Attention Function",
"type": "string"
},
"bridge": {
"default": false,
"description": "Have an additional layer between the last encoder state and the first decoder state (RNN specific).",
"title": "Bridge",
"type": "boolean"
},
"rnn_type": {
"default": "LSTM",
"description": "The gate type to use in the RNNs.",
"enum": [
"LSTM",
"GRU"
],
"title": "Rnn Type",
"type": "string"
},
"context_gate": {
"default": null,
"description": "Type of context gate to use.",
"enum": [
"source",
"target",
"both",
null
],
"title": "Context Gate"
},
"bidirectional_encoder": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": false,
"title": "Bidirectional Encoder"
}
},
"title": "RnnDecoderConfig",
"type": "object"
},
"RnnEncoderConfig": {
"additionalProperties": false,
"properties": {
"encoder_type": {
"default": "rnn",
"enum": [
"rnn",
"brnn"
],
"title": "Encoder Type",
"type": "string"
},
"layers": {
"default": 2,
"description": "Number of layers in the encoder.",
"title": "Layers",
"type": "integer"
},
"hidden_size": {
"default": 512,
"description": "Size of encoder hidden states.",
"title": "Hidden Size",
"type": "integer"
},
"src_word_vec_size": {
"default": 512,
"description": "Word embedding size for src.",
"title": "Src Word Vec Size",
"type": "integer"
},
"bridge": {
"default": false,
"description": "Have an additional layer between the last encoder state and the first decoder state (RNN specific).",
"title": "Bridge",
"type": "boolean"
},
"rnn_type": {
"default": "LSTM",
"description": "The gate type to use in the RNNs.",
"enum": [
"LSTM",
"GRU"
],
"title": "Rnn Type",
"type": "string"
}
},
"title": "RnnEncoderConfig",
"type": "object"
},
"RnnModelConfig": {
"additionalProperties": false,
"properties": {
"embeddings": {
"$ref": "#/$defs/EmbeddingsConfig",
"description": "Contains most of the args useful to build the Embeddings module."
},
"encoder": {
"anyOf": [
{
"discriminator": {
"mapping": {
"audio": "#/$defs/AudioEncoderConfig",
"brnn": "#/$defs/RnnEncoderConfig",
"cnn": "#/$defs/CnnEncoderConfig",
"mean": "#/$defs/MeanEncoderConfig",
"rnn": "#/$defs/RnnEncoderConfig",
"transformer": "#/$defs/TransformerEncoderConfig",
"vision": "#/$defs/VisionEncoderConfig"
},
"propertyName": "encoder_type"
},
"oneOf": [
{
"$ref": "#/$defs/TransformerEncoderConfig"
},
{
"$ref": "#/$defs/RnnEncoderConfig"
},
{
"$ref": "#/$defs/CnnEncoderConfig"
},
{
"$ref": "#/$defs/MeanEncoderConfig"
},
{
"$ref": "#/$defs/VisionEncoderConfig"
},
{
"$ref": "#/$defs/AudioEncoderConfig"
}
]
},
{
"type": "null"
}
],
"default": null,
"description": "Major parameters of an encoder.",
"title": "Encoder"
},
"decoder": {
"anyOf": [
{
"discriminator": {
"mapping": {
"cnn": "#/$defs/CnnDecoderConfig",
"rnn": "#/$defs/RnnDecoderConfig",
"transformer": "#/$defs/TransformerDecoderConfig"
},
"propertyName": "decoder_type"
},
"oneOf": [
{
"$ref": "#/$defs/TransformerDecoderConfig"
},
{
"$ref": "#/$defs/RnnDecoderConfig"
},
{
"$ref": "#/$defs/CnnDecoderConfig"
}
]
},
{
"type": "null"
}
],
"default": null,
"description": "Major parameters of a decoder.",
"title": "Decoder"
},
"hidden_size": {
"default": -1,
"description": "Size of hidden states. Overwrites [encoder/decoder].hidden_size if set.",
"title": "Hidden Size",
"type": "integer"
},
"word_vec_size": {
"default": -1,
"description": "Word embedding size for src and tgt.",
"title": "Word Vec Size",
"type": "integer"
},
"layers": {
"default": -1,
"description": "Number of layers in both encoder and decoder (will overwrite enc_layers/dec_layers).",
"title": "Layers",
"type": "integer"
},
"transformer_ff": {
"default": -1,
"description": "Size of hidden transformer feed-forward.",
"title": "Transformer Ff",
"type": "integer"
},
"moe_transformer_ff": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Size of hidden moe transformer feed-forward.",
"title": "Moe Transformer Ff"
},
"share_decoder_embeddings": {
"default": false,
"description": "Use a share weight matrix for the input and output word embeddings in the decoder.",
"title": "Share Decoder Embeddings",
"type": "boolean"
},
"share_embeddings": {
"default": false,
"description": "Share the word embeddings between encoder and decoder. Need to use shared vocabulary for this option.",
"title": "Share Embeddings",
"type": "boolean"
},
"input_feed": {
"default": 1,
"description": "Feed the context vector at each time step as additional input (via concatenation with the word embeddings) to the decoder.",
"title": "Input Feed",
"type": "integer"
},
"generator_function": {
"default": "softmax",
"description": "Which function to use for generating probabilities over the target vocabulary.",
"enum": [
"softmax",
"sparsemax"
],
"title": "Generator Function",
"type": "string"
},
"generator_bias": {
"default": true,
"description": "Control whether or not the generator Linear module has bias weights.",
"title": "Generator Bias",
"type": "boolean"
},
"adapter_bias": {
"default": false,
"description": "Control whether or not the adapter module has bias weights.",
"title": "Adapter Bias",
"type": "boolean"
},
"projector_activation_fn": {
"anyOf": [
{
"$ref": "#/$defs/ActivationFunction"
},
{
"type": "null"
}
],
"default": "relu",
"description": "The activation function to use in adapter projector layer."
},
"spatial_merge_size": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1,
"description": "Control the presence and size of patch merger (Mistral3)",
"title": "Spatial Merge Size"
},
"add_estimator": {
"default": false,
"description": "Add estimator layer",
"title": "Add Estimator",
"type": "boolean"
},
"estimator_type": {
"default": "average",
"description": "Which hidden_states to use to feed the estimator",
"enum": [
"average",
"last_token",
"first_token"
],
"title": "Estimator Type",
"type": "string"
},
"left_pad": {
"default": false,
"description": "Enable left-padding, useful for some LLMs.",
"title": "Left Pad",
"type": "boolean"
},
"huggingface_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Original huggingface model.",
"title": "Huggingface Model"
},
"eole_version": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "0.5.2",
"description": "Eole version used to convert/train/save the model.",
"title": "Eole Version"
},
"architecture": {
"const": "rnn",
"default": "rnn",
"title": "Architecture",
"type": "string"
},
"bridge": {
"default": false,
"description": "Have an additional layer between the last encoder state and the first decoder state (RNN specific).",
"title": "Bridge",
"type": "boolean"
},
"rnn_type": {
"default": "LSTM",
"description": "The gate type to use in the RNNs.",
"enum": [
"LSTM",
"GRU"
],
"title": "Rnn Type",
"type": "string"
}
},
"title": "RnnModelConfig",
"type": "object"
},
"RotaryPositionConfig": {
"additionalProperties": false,
"description": "Configuration for rotary position embeddings used in transformer models.",
"properties": {
"rotary_interleave": {
"default": true,
"description": "Interleave the head dimensions when rotary embeddings are applied. Otherwise the head dimensions are sliced in half. (True= Llama from Meta (original), False= used by all HuggingFace models)",
"title": "Rotary Interleave",
"type": "boolean"
},
"rotary_theta": {
"default": 10000,
"description": "Rotary theta base length, 1e4 for Llama2.Mistral, 1e6 for Mixtral",
"title": "Rotary Theta",
"type": "integer"
},
"rotary_dim": {
"default": 0,
"description": "Rotary dim when model requires it to be different to head dim.",
"title": "Rotary Dim",
"type": "integer"
},
"scaling_type": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Specifies the type of RoPE scaling to be applied, if any.",
"title": "Scaling Type"
},
"alpha": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": null,
"description": "alpha factor by which to scale rope theta.",
"title": "Alpha"
},
"xdrope_section": {
"anyOf": [
{
"items": {
"type": "integer"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"description": "Sections for XDRope mappings",
"title": "Xdrope Section"
},
"scaling_factor": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 8.0,
"description": "Factor by which to scale RoPE embeddings.",
"title": "Scaling Factor"
},
"low_freq_factor": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 1.0,
"description": "Scaling factor applied to the lower frequency components of RoPE.",
"title": "Low Freq Factor"
},
"high_freq_factor": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 4.0,
"description": "Scaling factor applied to the higher frequency components of RoPE.",
"title": "High Freq Factor"
},
"original_max_position_embeddings": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 8192,
"description": "Original maximum position embeddings for RoPE scaling.",
"title": "Original Max Position Embeddings"
},
"rotary_theta_local": {
"default": 10000,
"description": "Rotary theta base length for local rotary layers",
"title": "Rotary Theta Local",
"type": "integer"
},
"interleave_local": {
"default": 0,
"description": "Local rotary layers each 1/N layers",
"title": "Interleave Local",
"type": "integer"
},
"tmax_index": {
"default": 0,
"description": "tmax indexing, 0 for all cases except gemma 3 = 1",
"title": "Tmax Index",
"type": "integer"
}
},
"title": "RotaryPositionConfig",
"type": "object"
},
"SuffixConfig": {
"additionalProperties": false,
"properties": {
"src_suffix": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "String to append to all source examples.",
"title": "Src Suffix"
},
"tgt_suffix": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "String to append to all target examples.",
"title": "Tgt Suffix"
}
},
"title": "SuffixConfig",
"type": "object"
},
"SwitchOutConfig": {
"additionalProperties": false,
"properties": {
"switchout_temperature": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 1.0,
"description": "Sampling temperature for SwitchOut. :math:`\\tau^{-1}` in :cite:`DBLP:journals/corr/abs-1808-07512`. Smaller value makes data more diverse.",
"title": "Switchout Temperature"
}
},
"title": "SwitchOutConfig",
"type": "object"
},
"TerminologyConfig": {
"additionalProperties": false,
"properties": {
"termbase_path": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Path to a dictionary file with terms.",
"title": "Termbase Path"
},
"src_spacy_language_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Name of the spaCy language model for the source corpus.",
"title": "Src Spacy Language Model"
},
"tgt_spacy_language_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Name of the spaCy language model for the target corpus.",
"title": "Tgt Spacy Language Model"
},
"term_corpus_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0.3,
"description": "Ratio of corpus to augment with terms.",
"title": "Term Corpus Ratio"
},
"term_example_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0.2,
"description": "Maximum terms allowed in an example.",
"title": "Term Example Ratio"
},
"src_term_stoken": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "\uff5fsrc_term_start\uff60",
"description": "The source term start token.",
"title": "Src Term Stoken"
},
"tgt_term_stoken": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "\uff5ftgt_term_start\uff60",
"description": "The target term start token.",
"title": "Tgt Term Stoken"
},
"tgt_term_etoken": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "\uff5ftgt_term_end\uff60",
"description": "The target term end token.",
"title": "Tgt Term Etoken"
},
"term_source_delimiter": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "\uff5ffuzzy\uff60",
"description": "Any special token used for augmented source sentences. The default is the fuzzy token used in the FuzzyMatch transform.",
"title": "Term Source Delimiter"
}
},
"title": "TerminologyConfig",
"type": "object"
},
"TokenDropConfig": {
"additionalProperties": false,
"properties": {
"tokendrop_temperature": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 1.0,
"description": "Sampling temperature for token deletion.",
"title": "Tokendrop Temperature"
}
},
"title": "TokenDropConfig",
"type": "object"
},
"TokenMaskConfig": {
"additionalProperties": false,
"properties": {
"tokenmask_temperature": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 1.0,
"description": "Sampling temperature for token masking.",
"title": "Tokenmask Temperature"
}
},
"title": "TokenMaskConfig",
"type": "object"
},
"TransformerDecoderConfig": {
"additionalProperties": false,
"properties": {
"decoder_type": {
"const": "transformer",
"default": "transformer",
"title": "Decoder Type",
"type": "string"
},
"layers": {
"default": 2,
"description": "Number of layers in the decoder.",
"title": "Layers",
"type": "integer"
},
"hidden_size": {
"default": 512,
"description": "Size of decoder hidden states.",
"title": "Hidden Size",
"type": "integer"
},
"tgt_word_vec_size": {
"default": 512,
"description": "Word embedding size for tgt.",
"title": "Tgt Word Vec Size",
"type": "integer"
},
"coverage_attn": {
"default": false,
"description": "Train a coverage attention layer.",
"title": "Coverage Attn",
"type": "boolean"
},
"with_cross_attn": {
"default": false,
"description": "Decoder uses cross-attention with encoder outputs.",
"title": "With Cross Attn",
"type": "boolean"
},
"lambda_coverage": {
"default": 0.0,
"description": "Lambda value for coverage loss of See et al (2017)",
"title": "Lambda Coverage",
"type": "number"
},
"global_attention": {
"default": "general",
"description": "The attention type to use. (Luong=general, Bahdanau=MLP)",
"enum": [
"dot",
"general",
"mlp",
null
],
"title": "Global Attention"
},
"global_attention_function": {
"default": "softmax",
"description": "Global attention function to use.",
"enum": [
"softmax",
"sparsemax"
],
"title": "Global Attention Function",
"type": "string"
},
"sliding_window": {
"default": 0,
"description": "Sliding window for transformer self-attention.",
"title": "Sliding Window",
"type": "integer"
},
"heads": {
"default": 8,
"description": "Number of heads for transformer self-attention.",
"title": "Heads",
"type": "integer"
},
"transformer_ff": {
"default": 2048,
"description": "Size of hidden transformer feed-forward.",
"title": "Transformer Ff",
"type": "integer"
},
"moe_transformer_ff": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Size of hidden moe transformer feed-forward.",
"title": "Moe Transformer Ff"
},
"relative_positions_buckets": {
"default": 0,
"description": "Enable relative position bias (https://github.com/google-research/text-to-text-transfer-transformer).",
"title": "Relative Positions Buckets",
"type": "integer"
},
"mlp_activation_fn": {
"$ref": "#/$defs/ActivationFunction",
"default": "relu",
"description": "The activation function to use in MLP layer."
},
"layer_norm": {
"default": "standard",
"description": "Type of layer normalization in transformer architecture.",
"enum": [
"standard",
"standardFP32",
"rms",
"gemma-rms"
],
"title": "Layer Norm",
"type": "string"
},
"norm_eps": {
"default": 1e-05,
"description": "Layer norm epsilon.",
"title": "Norm Eps",
"type": "number"
},
"shared_layer_norm": {
"default": false,
"description": "Use a shared layer_norm in parallel residual attention. Note: must be True for Falcon 7B, False for Falcon 40B, same for GPT-J and GPT-NeoX models.",
"title": "Shared Layer Norm",
"type": "boolean"
},
"ffn_layernorm": {
"default": false,
"description": "Add pre/post_feedforward_layernorm around MLP forward. Note: introduced for gemma2 support.",
"title": "Ffn Layernorm",
"type": "boolean"
},
"add_qkvbias": {
"default": false,
"description": "Add bias to nn.Linear of Query/Key/Value in MHA. Note: this will add bias to output projection layer too by default. Can be disabled with `add_final_linear_bias`.",
"title": "Add Qkvbias",
"type": "boolean"
},
"add_key_bias": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": null,
"description": "Add bias to Key projection in MHA. Defaults to add_qkvbias when not set. Set to False for models like Whisper where K has no bias.",
"title": "Add Key Bias"
},
"query_norm": {
"default": false,
"title": "Query Norm",
"type": "boolean"
},
"key_norm": {
"default": false,
"title": "Key Norm",
"type": "boolean"
},
"qk_norm_post_rope": {
"default": false,
"title": "Qk Norm Post Rope",
"type": "boolean"
},
"add_final_linear_bias": {
"default": false,
"description": "Add bias to nn.Linear of final_linear in MHA.",
"title": "Add Final Linear Bias",
"type": "boolean"
},
"heads_kv": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Number of heads for KV. heads_kv=heads if None, else number of heads for KV(e.g. Falcon 40B)",
"title": "Heads Kv"
},
"head_dim": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Head dimension when this needs to be different vs hidden_size // heads",
"title": "Head Dim"
},
"attn_scaling": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": null,
"description": "Attention scaling factor, when None uses 1/sqrt(head_dim) by default",
"title": "Attn Scaling"
},
"add_ffnbias": {
"default": false,
"description": "Add bias to nn.Linear of MLP FFN.",
"title": "Add Ffnbias",
"type": "boolean"
},
"parallel_residual": {
"default": false,
"description": "Use parallel residual in decoder layer. Note: this is used by GPT-J / Falcon Architecture.",
"title": "Parallel Residual",
"type": "boolean"
},
"num_experts": {
"default": 0,
"description": "Number of experts for MoE models.",
"title": "Num Experts",
"type": "integer"
},
"num_shared_experts": {
"default": 0,
"description": "Number of shared experts for MoE models (DeepSeekv2).",
"title": "Num Shared Experts",
"type": "integer"
},
"shared_expert_gate": {
"default": false,
"description": "Apply sigmoid-gated shared expert output (Qwen3.5 MoE style). When True, a linear gate is applied: output += sigmoid(gate(x)) * shared_expert(x).",
"title": "Shared Expert Gate",
"type": "boolean"
},
"first_k_dense_replace": {
"default": 0,
"description": "Number of layers using Dense instead of MoE",
"title": "First K Dense Replace",
"type": "integer"
},
"num_experts_per_tok": {
"default": 2,
"description": "Number of experts per token.",
"title": "Num Experts Per Tok",
"type": "integer"
},
"moe_softmax_after": {
"default": false,
"description": "Usually softmax is before topk, Mixtral does it after.",
"title": "Moe Softmax After",
"type": "boolean"
},
"moe_renormalize": {
"default": false,
"description": "Qwen renormalize expert weights after softmax.",
"title": "Moe Renormalize",
"type": "boolean"
},
"q_gating": {
"default": false,
"description": "Enable gated query in attention (Qwen3.5 style). Q projection has doubled output size; output is multiplied by sigmoid(gate).",
"title": "Q Gating",
"type": "boolean"
},
"position_encoding_type": {
"anyOf": [
{
"$ref": "#/$defs/PositionEncodingType"
},
{
"type": "null"
}
],
"default": "SinusoidalInterleaved",
"description": "Type of positional encoding."
},
"interpolate_mode": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Interpolation mode for position embeddings. If None: position_embeddings is a lookup table based on n_positions. If string (e.g., 'bilinear'): position_embedding is a learned grid using interpolation. (see Vision.py encoder)",
"title": "Interpolate Mode"
},
"n_positions": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Two casesCase 1: Absolute number of positions to learn position embeddings on (position_encoding_type: Learned)Case 2: Max Relative PositionsIn the case of position_encoding_type: Relative",
"title": "N Positions"
},
"rope_config": {
"anyOf": [
{
"$ref": "#/$defs/RotaryPositionConfig"
},
{
"type": "null"
}
],
"default": null,
"description": "Rotary position config, if relevant."
},
"aan_useffn": {
"default": false,
"description": "Turn on the FFN layer in the AAN decoder.",
"title": "Aan Useffn",
"type": "boolean"
},
"alignment_layer": {
"default": -2,
"description": "Layer number which has to be supervised.",
"title": "Alignment Layer",
"type": "integer"
},
"alignment_heads": {
"default": 0,
"description": "Number of cross attention heads per layer to supervise with.",
"title": "Alignment Heads",
"type": "integer"
},
"full_context_alignment": {
"default": false,
"description": "Whether alignment is conditioned on full target context.",
"title": "Full Context Alignment",
"type": "boolean"
},
"lambda_align": {
"default": 0.0,
"description": "Lambda value for alignement loss of Garg et al, 2019 (https://arxiv.org/abs/1909.02074)",
"title": "Lambda Align",
"type": "number"
},
"LM_type": {
"default": "causal",
"description": "TransformerDecoder LM type (causal = classic, or prefix LM https://arxiv.org/pdf/2308.06912)",
"enum": [
"causal",
"prefix"
],
"title": "Lm Type",
"type": "string"
},
"layer_types": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"description": "Per-layer types for hybrid architectures (e.g. Qwen3.5). Supported values: 'full_attention', 'linear_attention'. When None, all layers use full attention.",
"title": "Layer Types"
},
"linear_conv_kernel_dim": {
"default": 4,
"description": "Convolution kernel size for linear attention layers (Qwen3.5 GatedDeltaNet).",
"title": "Linear Conv Kernel Dim",
"type": "integer"
},
"linear_key_head_dim": {
"default": 128,
"description": "Key head dimension for linear attention layers (Qwen3.5 GatedDeltaNet).",
"title": "Linear Key Head Dim",
"type": "integer"
},
"linear_value_head_dim": {
"default": 128,
"description": "Value head dimension for linear attention layers (Qwen3.5 GatedDeltaNet).",
"title": "Linear Value Head Dim",
"type": "integer"
},
"linear_num_key_heads": {
"default": 16,
"description": "Number of key heads for linear attention layers (Qwen3.5 GatedDeltaNet).",
"title": "Linear Num Key Heads",
"type": "integer"
},
"linear_num_value_heads": {
"default": 32,
"description": "Number of value heads for linear attention layers (Qwen3.5 GatedDeltaNet).",
"title": "Linear Num Value Heads",
"type": "integer"
}
},
"title": "TransformerDecoderConfig",
"type": "object"
},
"TransformerEncoderConfig": {
"additionalProperties": false,
"properties": {
"encoder_type": {
"const": "transformer",
"default": "transformer",
"title": "Encoder Type",
"type": "string"
},
"layers": {
"default": 2,
"description": "Number of layers in the encoder.",
"title": "Layers",
"type": "integer"
},
"hidden_size": {
"default": 512,
"description": "Size of encoder hidden states.",
"title": "Hidden Size",
"type": "integer"
},
"src_word_vec_size": {
"default": 512,
"description": "Word embedding size for src.",
"title": "Src Word Vec Size",
"type": "integer"
},
"sliding_window": {
"default": 0,
"description": "Sliding window for transformer self-attention.",
"title": "Sliding Window",
"type": "integer"
},
"heads": {
"default": 8,
"description": "Number of heads for transformer self-attention.",
"title": "Heads",
"type": "integer"
},
"transformer_ff": {
"default": 2048,
"description": "Size of hidden transformer feed-forward.",
"title": "Transformer Ff",
"type": "integer"
},
"moe_transformer_ff": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Size of hidden moe transformer feed-forward.",
"title": "Moe Transformer Ff"
},
"relative_positions_buckets": {
"default": 0,
"description": "Enable relative position bias (https://github.com/google-research/text-to-text-transfer-transformer).",
"title": "Relative Positions Buckets",
"type": "integer"
},
"mlp_activation_fn": {
"$ref": "#/$defs/ActivationFunction",
"default": "relu",
"description": "The activation function to use in MLP layer."
},
"layer_norm": {
"default": "standard",
"description": "Type of layer normalization in transformer architecture.",
"enum": [
"standard",
"standardFP32",
"rms",
"gemma-rms"
],
"title": "Layer Norm",
"type": "string"
},
"norm_eps": {
"default": 1e-05,
"description": "Layer norm epsilon.",
"title": "Norm Eps",
"type": "number"
},
"shared_layer_norm": {
"default": false,
"description": "Use a shared layer_norm in parallel residual attention. Note: must be True for Falcon 7B, False for Falcon 40B, same for GPT-J and GPT-NeoX models.",
"title": "Shared Layer Norm",
"type": "boolean"
},
"ffn_layernorm": {
"default": false,
"description": "Add pre/post_feedforward_layernorm around MLP forward. Note: introduced for gemma2 support.",
"title": "Ffn Layernorm",
"type": "boolean"
},
"add_qkvbias": {
"default": false,
"description": "Add bias to nn.Linear of Query/Key/Value in MHA. Note: this will add bias to output projection layer too by default. Can be disabled with `add_final_linear_bias`.",
"title": "Add Qkvbias",
"type": "boolean"
},
"add_key_bias": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": null,
"description": "Add bias to Key projection in MHA. Defaults to add_qkvbias when not set. Set to False for models like Whisper where K has no bias.",
"title": "Add Key Bias"
},
"query_norm": {
"default": false,
"title": "Query Norm",
"type": "boolean"
},
"key_norm": {
"default": false,
"title": "Key Norm",
"type": "boolean"
},
"qk_norm_post_rope": {
"default": false,
"title": "Qk Norm Post Rope",
"type": "boolean"
},
"add_final_linear_bias": {
"default": false,
"description": "Add bias to nn.Linear of final_linear in MHA.",
"title": "Add Final Linear Bias",
"type": "boolean"
},
"heads_kv": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Number of heads for KV. heads_kv=heads if None, else number of heads for KV(e.g. Falcon 40B)",
"title": "Heads Kv"
},
"head_dim": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Head dimension when this needs to be different vs hidden_size // heads",
"title": "Head Dim"
},
"attn_scaling": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": null,
"description": "Attention scaling factor, when None uses 1/sqrt(head_dim) by default",
"title": "Attn Scaling"
},
"add_ffnbias": {
"default": false,
"description": "Add bias to nn.Linear of MLP FFN.",
"title": "Add Ffnbias",
"type": "boolean"
},
"parallel_residual": {
"default": false,
"description": "Use parallel residual in decoder layer. Note: this is used by GPT-J / Falcon Architecture.",
"title": "Parallel Residual",
"type": "boolean"
},
"num_experts": {
"default": 0,
"description": "Number of experts for MoE models.",
"title": "Num Experts",
"type": "integer"
},
"num_shared_experts": {
"default": 0,
"description": "Number of shared experts for MoE models (DeepSeekv2).",
"title": "Num Shared Experts",
"type": "integer"
},
"shared_expert_gate": {
"default": false,
"description": "Apply sigmoid-gated shared expert output (Qwen3.5 MoE style). When True, a linear gate is applied: output += sigmoid(gate(x)) * shared_expert(x).",
"title": "Shared Expert Gate",
"type": "boolean"
},
"first_k_dense_replace": {
"default": 0,
"description": "Number of layers using Dense instead of MoE",
"title": "First K Dense Replace",
"type": "integer"
},
"num_experts_per_tok": {
"default": 2,
"description": "Number of experts per token.",
"title": "Num Experts Per Tok",
"type": "integer"
},
"moe_softmax_after": {
"default": false,
"description": "Usually softmax is before topk, Mixtral does it after.",
"title": "Moe Softmax After",
"type": "boolean"
},
"moe_renormalize": {
"default": false,
"description": "Qwen renormalize expert weights after softmax.",
"title": "Moe Renormalize",
"type": "boolean"
},
"q_gating": {
"default": false,
"description": "Enable gated query in attention (Qwen3.5 style). Q projection has doubled output size; output is multiplied by sigmoid(gate).",
"title": "Q Gating",
"type": "boolean"
},
"position_encoding_type": {
"anyOf": [
{
"$ref": "#/$defs/PositionEncodingType"
},
{
"type": "null"
}
],
"default": "SinusoidalInterleaved",
"description": "Type of positional encoding."
},
"interpolate_mode": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Interpolation mode for position embeddings. If None: position_embeddings is a lookup table based on n_positions. If string (e.g., 'bilinear'): position_embedding is a learned grid using interpolation. (see Vision.py encoder)",
"title": "Interpolate Mode"
},
"n_positions": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Two casesCase 1: Absolute number of positions to learn position embeddings on (position_encoding_type: Learned)Case 2: Max Relative PositionsIn the case of position_encoding_type: Relative",
"title": "N Positions"
},
"rope_config": {
"anyOf": [
{
"$ref": "#/$defs/RotaryPositionConfig"
},
{
"type": "null"
}
],
"default": null,
"description": "Rotary position config, if relevant."
}
},
"title": "TransformerEncoderConfig",
"type": "object"
},
"TransformerEncoderModelConfig": {
"additionalProperties": false,
"description": "Facilitate setting some transformer specific params at model level.",
"properties": {
"embeddings": {
"$ref": "#/$defs/EmbeddingsConfig",
"description": "Contains most of the args useful to build the Embeddings module."
},
"encoder": {
"anyOf": [
{
"discriminator": {
"mapping": {
"audio": "#/$defs/AudioEncoderConfig",
"brnn": "#/$defs/RnnEncoderConfig",
"cnn": "#/$defs/CnnEncoderConfig",
"mean": "#/$defs/MeanEncoderConfig",
"rnn": "#/$defs/RnnEncoderConfig",
"transformer": "#/$defs/TransformerEncoderConfig",
"vision": "#/$defs/VisionEncoderConfig"
},
"propertyName": "encoder_type"
},
"oneOf": [
{
"$ref": "#/$defs/TransformerEncoderConfig"
},
{
"$ref": "#/$defs/RnnEncoderConfig"
},
{
"$ref": "#/$defs/CnnEncoderConfig"
},
{
"$ref": "#/$defs/MeanEncoderConfig"
},
{
"$ref": "#/$defs/VisionEncoderConfig"
},
{
"$ref": "#/$defs/AudioEncoderConfig"
}
]
},
{
"type": "null"
}
],
"default": null,
"description": "Major parameters of an encoder.",
"title": "Encoder"
},
"decoder": {
"default": null,
"description": "Major parameters of a decoder.",
"title": "Decoder",
"type": "null"
},
"hidden_size": {
"default": -1,
"description": "Size of hidden states. Overwrites [encoder/decoder].hidden_size if set.",
"title": "Hidden Size",
"type": "integer"
},
"word_vec_size": {
"default": -1,
"description": "Word embedding size for src and tgt.",
"title": "Word Vec Size",
"type": "integer"
},
"layers": {
"default": -1,
"description": "Number of layers in both encoder and decoder (will overwrite enc_layers/dec_layers).",
"title": "Layers",
"type": "integer"
},
"transformer_ff": {
"default": 2048,
"description": "Size of hidden transformer feed-forward.",
"title": "Transformer Ff",
"type": "integer"
},
"moe_transformer_ff": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Size of hidden moe transformer feed-forward.",
"title": "Moe Transformer Ff"
},
"share_decoder_embeddings": {
"default": false,
"description": "Use a share weight matrix for the input and output word embeddings in the decoder.",
"title": "Share Decoder Embeddings",
"type": "boolean"
},
"share_embeddings": {
"default": false,
"description": "Share the word embeddings between encoder and decoder. Need to use shared vocabulary for this option.",
"title": "Share Embeddings",
"type": "boolean"
},
"input_feed": {
"default": 1,
"description": "Feed the context vector at each time step as additional input (via concatenation with the word embeddings) to the decoder.",
"title": "Input Feed",
"type": "integer"
},
"generator_function": {
"default": "softmax",
"description": "Which function to use for generating probabilities over the target vocabulary.",
"enum": [
"softmax",
"sparsemax"
],
"title": "Generator Function",
"type": "string"
},
"generator_bias": {
"default": true,
"description": "Control whether or not the generator Linear module has bias weights.",
"title": "Generator Bias",
"type": "boolean"
},
"adapter_bias": {
"default": false,
"description": "Control whether or not the adapter module has bias weights.",
"title": "Adapter Bias",
"type": "boolean"
},
"projector_activation_fn": {
"anyOf": [
{
"$ref": "#/$defs/ActivationFunction"
},
{
"type": "null"
}
],
"default": "relu",
"description": "The activation function to use in adapter projector layer."
},
"spatial_merge_size": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1,
"description": "Control the presence and size of patch merger (Mistral3)",
"title": "Spatial Merge Size"
},
"add_estimator": {
"default": false,
"description": "Add estimator layer",
"title": "Add Estimator",
"type": "boolean"
},
"estimator_type": {
"default": "average",
"description": "Which hidden_states to use to feed the estimator",
"enum": [
"average",
"last_token",
"first_token"
],
"title": "Estimator Type",
"type": "string"
},
"left_pad": {
"default": false,
"description": "Enable left-padding, useful for some LLMs.",
"title": "Left Pad",
"type": "boolean"
},
"huggingface_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Original huggingface model.",
"title": "Huggingface Model"
},
"eole_version": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "0.5.2",
"description": "Eole version used to convert/train/save the model.",
"title": "Eole Version"
},
"architecture": {
"const": "transformer_encoder",
"default": "transformer_encoder",
"title": "Architecture",
"type": "string"
},
"sliding_window": {
"default": 0,
"description": "Sliding window for transformer self-attention.",
"title": "Sliding Window",
"type": "integer"
},
"heads": {
"default": 8,
"description": "Number of heads for transformer self-attention.",
"title": "Heads",
"type": "integer"
},
"relative_positions_buckets": {
"default": 0,
"description": "Enable relative position bias (https://github.com/google-research/text-to-text-transfer-transformer).",
"title": "Relative Positions Buckets",
"type": "integer"
},
"mlp_activation_fn": {
"$ref": "#/$defs/ActivationFunction",
"default": "relu",
"description": "The activation function to use in MLP layer."
},
"layer_norm": {
"default": "standard",
"description": "Type of layer normalization in transformer architecture.",
"enum": [
"standard",
"standardFP32",
"rms",
"gemma-rms"
],
"title": "Layer Norm",
"type": "string"
},
"norm_eps": {
"default": 1e-05,
"description": "Layer norm epsilon.",
"title": "Norm Eps",
"type": "number"
},
"shared_layer_norm": {
"default": false,
"description": "Use a shared layer_norm in parallel residual attention. Note: must be True for Falcon 7B, False for Falcon 40B, same for GPT-J and GPT-NeoX models.",
"title": "Shared Layer Norm",
"type": "boolean"
},
"ffn_layernorm": {
"default": false,
"description": "Add pre/post_feedforward_layernorm around MLP forward. Note: introduced for gemma2 support.",
"title": "Ffn Layernorm",
"type": "boolean"
},
"add_qkvbias": {
"default": false,
"description": "Add bias to nn.Linear of Query/Key/Value in MHA. Note: this will add bias to output projection layer too by default. Can be disabled with `add_final_linear_bias`.",
"title": "Add Qkvbias",
"type": "boolean"
},
"add_key_bias": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": null,
"description": "Add bias to Key projection in MHA. Defaults to add_qkvbias when not set. Set to False for models like Whisper where K has no bias.",
"title": "Add Key Bias"
},
"query_norm": {
"default": false,
"title": "Query Norm",
"type": "boolean"
},
"key_norm": {
"default": false,
"title": "Key Norm",
"type": "boolean"
},
"qk_norm_post_rope": {
"default": false,
"title": "Qk Norm Post Rope",
"type": "boolean"
},
"add_final_linear_bias": {
"default": false,
"description": "Add bias to nn.Linear of final_linear in MHA.",
"title": "Add Final Linear Bias",
"type": "boolean"
},
"heads_kv": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Number of heads for KV. heads_kv=heads if None, else number of heads for KV(e.g. Falcon 40B)",
"title": "Heads Kv"
},
"head_dim": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Head dimension when this needs to be different vs hidden_size // heads",
"title": "Head Dim"
},
"attn_scaling": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": null,
"description": "Attention scaling factor, when None uses 1/sqrt(head_dim) by default",
"title": "Attn Scaling"
},
"add_ffnbias": {
"default": false,
"description": "Add bias to nn.Linear of MLP FFN.",
"title": "Add Ffnbias",
"type": "boolean"
},
"parallel_residual": {
"default": false,
"description": "Use parallel residual in decoder layer. Note: this is used by GPT-J / Falcon Architecture.",
"title": "Parallel Residual",
"type": "boolean"
},
"num_experts": {
"default": 0,
"description": "Number of experts for MoE models.",
"title": "Num Experts",
"type": "integer"
},
"num_shared_experts": {
"default": 0,
"description": "Number of shared experts for MoE models (DeepSeekv2).",
"title": "Num Shared Experts",
"type": "integer"
},
"shared_expert_gate": {
"default": false,
"description": "Apply sigmoid-gated shared expert output (Qwen3.5 MoE style). When True, a linear gate is applied: output += sigmoid(gate(x)) * shared_expert(x).",
"title": "Shared Expert Gate",
"type": "boolean"
},
"first_k_dense_replace": {
"default": 0,
"description": "Number of layers using Dense instead of MoE",
"title": "First K Dense Replace",
"type": "integer"
},
"num_experts_per_tok": {
"default": 2,
"description": "Number of experts per token.",
"title": "Num Experts Per Tok",
"type": "integer"
},
"moe_softmax_after": {
"default": false,
"description": "Usually softmax is before topk, Mixtral does it after.",
"title": "Moe Softmax After",
"type": "boolean"
},
"moe_renormalize": {
"default": false,
"description": "Qwen renormalize expert weights after softmax.",
"title": "Moe Renormalize",
"type": "boolean"
},
"q_gating": {
"default": false,
"description": "Enable gated query in attention (Qwen3.5 style). Q projection has doubled output size; output is multiplied by sigmoid(gate).",
"title": "Q Gating",
"type": "boolean"
},
"position_encoding_type": {
"anyOf": [
{
"$ref": "#/$defs/PositionEncodingType"
},
{
"type": "null"
}
],
"default": "SinusoidalInterleaved",
"description": "Type of positional encoding."
},
"interpolate_mode": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Interpolation mode for position embeddings. If None: position_embeddings is a lookup table based on n_positions. If string (e.g., 'bilinear'): position_embedding is a learned grid using interpolation. (see Vision.py encoder)",
"title": "Interpolate Mode"
},
"n_positions": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Two casesCase 1: Absolute number of positions to learn position embeddings on (position_encoding_type: Learned)Case 2: Max Relative PositionsIn the case of position_encoding_type: Relative",
"title": "N Positions"
},
"rope_config": {
"anyOf": [
{
"$ref": "#/$defs/RotaryPositionConfig"
},
{
"type": "null"
}
],
"default": null,
"description": "Rotary position config, if relevant."
}
},
"title": "TransformerEncoderModelConfig",
"type": "object"
},
"TransformerLMModelConfig": {
"additionalProperties": false,
"description": "Facilitate setting some transformer specific params at model level.",
"properties": {
"embeddings": {
"$ref": "#/$defs/EmbeddingsConfig",
"description": "Contains most of the args useful to build the Embeddings module."
},
"encoder": {
"default": null,
"description": "Major parameters of an encoder.",
"title": "Encoder",
"type": "null"
},
"decoder": {
"anyOf": [
{
"discriminator": {
"mapping": {
"cnn": "#/$defs/CnnDecoderConfig",
"rnn": "#/$defs/RnnDecoderConfig",
"transformer": "#/$defs/TransformerDecoderConfig"
},
"propertyName": "decoder_type"
},
"oneOf": [
{
"$ref": "#/$defs/TransformerDecoderConfig"
},
{
"$ref": "#/$defs/RnnDecoderConfig"
},
{
"$ref": "#/$defs/CnnDecoderConfig"
}
]
},
{
"type": "null"
}
],
"default": null,
"description": "Major parameters of a decoder.",
"title": "Decoder"
},
"hidden_size": {
"default": -1,
"description": "Size of hidden states. Overwrites [encoder/decoder].hidden_size if set.",
"title": "Hidden Size",
"type": "integer"
},
"word_vec_size": {
"default": -1,
"description": "Word embedding size for src and tgt.",
"title": "Word Vec Size",
"type": "integer"
},
"layers": {
"default": -1,
"description": "Number of layers in both encoder and decoder (will overwrite enc_layers/dec_layers).",
"title": "Layers",
"type": "integer"
},
"transformer_ff": {
"default": 2048,
"description": "Size of hidden transformer feed-forward.",
"title": "Transformer Ff",
"type": "integer"
},
"moe_transformer_ff": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Size of hidden moe transformer feed-forward.",
"title": "Moe Transformer Ff"
},
"share_decoder_embeddings": {
"default": false,
"description": "Use a share weight matrix for the input and output word embeddings in the decoder.",
"title": "Share Decoder Embeddings",
"type": "boolean"
},
"share_embeddings": {
"default": false,
"description": "Share the word embeddings between encoder and decoder. Need to use shared vocabulary for this option.",
"title": "Share Embeddings",
"type": "boolean"
},
"input_feed": {
"default": 1,
"description": "Feed the context vector at each time step as additional input (via concatenation with the word embeddings) to the decoder.",
"title": "Input Feed",
"type": "integer"
},
"generator_function": {
"default": "softmax",
"description": "Which function to use for generating probabilities over the target vocabulary.",
"enum": [
"softmax",
"sparsemax"
],
"title": "Generator Function",
"type": "string"
},
"generator_bias": {
"default": true,
"description": "Control whether or not the generator Linear module has bias weights.",
"title": "Generator Bias",
"type": "boolean"
},
"adapter_bias": {
"default": false,
"description": "Control whether or not the adapter module has bias weights.",
"title": "Adapter Bias",
"type": "boolean"
},
"projector_activation_fn": {
"anyOf": [
{
"$ref": "#/$defs/ActivationFunction"
},
{
"type": "null"
}
],
"default": "relu",
"description": "The activation function to use in adapter projector layer."
},
"spatial_merge_size": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1,
"description": "Control the presence and size of patch merger (Mistral3)",
"title": "Spatial Merge Size"
},
"add_estimator": {
"default": false,
"description": "Add estimator layer",
"title": "Add Estimator",
"type": "boolean"
},
"estimator_type": {
"default": "average",
"description": "Which hidden_states to use to feed the estimator",
"enum": [
"average",
"last_token",
"first_token"
],
"title": "Estimator Type",
"type": "string"
},
"left_pad": {
"default": false,
"description": "Enable left-padding, useful for some LLMs.",
"title": "Left Pad",
"type": "boolean"
},
"huggingface_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Original huggingface model.",
"title": "Huggingface Model"
},
"eole_version": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "0.5.2",
"description": "Eole version used to convert/train/save the model.",
"title": "Eole Version"
},
"architecture": {
"const": "transformer_lm",
"default": "transformer_lm",
"title": "Architecture",
"type": "string"
},
"sliding_window": {
"default": 0,
"description": "Sliding window for transformer self-attention.",
"title": "Sliding Window",
"type": "integer"
},
"heads": {
"default": 8,
"description": "Number of heads for transformer self-attention.",
"title": "Heads",
"type": "integer"
},
"relative_positions_buckets": {
"default": 0,
"description": "Enable relative position bias (https://github.com/google-research/text-to-text-transfer-transformer).",
"title": "Relative Positions Buckets",
"type": "integer"
},
"mlp_activation_fn": {
"$ref": "#/$defs/ActivationFunction",
"default": "relu",
"description": "The activation function to use in MLP layer."
},
"layer_norm": {
"default": "standard",
"description": "Type of layer normalization in transformer architecture.",
"enum": [
"standard",
"standardFP32",
"rms",
"gemma-rms"
],
"title": "Layer Norm",
"type": "string"
},
"norm_eps": {
"default": 1e-05,
"description": "Layer norm epsilon.",
"title": "Norm Eps",
"type": "number"
},
"shared_layer_norm": {
"default": false,
"description": "Use a shared layer_norm in parallel residual attention. Note: must be True for Falcon 7B, False for Falcon 40B, same for GPT-J and GPT-NeoX models.",
"title": "Shared Layer Norm",
"type": "boolean"
},
"ffn_layernorm": {
"default": false,
"description": "Add pre/post_feedforward_layernorm around MLP forward. Note: introduced for gemma2 support.",
"title": "Ffn Layernorm",
"type": "boolean"
},
"add_qkvbias": {
"default": false,
"description": "Add bias to nn.Linear of Query/Key/Value in MHA. Note: this will add bias to output projection layer too by default. Can be disabled with `add_final_linear_bias`.",
"title": "Add Qkvbias",
"type": "boolean"
},
"add_key_bias": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": null,
"description": "Add bias to Key projection in MHA. Defaults to add_qkvbias when not set. Set to False for models like Whisper where K has no bias.",
"title": "Add Key Bias"
},
"query_norm": {
"default": false,
"title": "Query Norm",
"type": "boolean"
},
"key_norm": {
"default": false,
"title": "Key Norm",
"type": "boolean"
},
"qk_norm_post_rope": {
"default": false,
"title": "Qk Norm Post Rope",
"type": "boolean"
},
"add_final_linear_bias": {
"default": false,
"description": "Add bias to nn.Linear of final_linear in MHA.",
"title": "Add Final Linear Bias",
"type": "boolean"
},
"heads_kv": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Number of heads for KV. heads_kv=heads if None, else number of heads for KV(e.g. Falcon 40B)",
"title": "Heads Kv"
},
"head_dim": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Head dimension when this needs to be different vs hidden_size // heads",
"title": "Head Dim"
},
"attn_scaling": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": null,
"description": "Attention scaling factor, when None uses 1/sqrt(head_dim) by default",
"title": "Attn Scaling"
},
"add_ffnbias": {
"default": false,
"description": "Add bias to nn.Linear of MLP FFN.",
"title": "Add Ffnbias",
"type": "boolean"
},
"parallel_residual": {
"default": false,
"description": "Use parallel residual in decoder layer. Note: this is used by GPT-J / Falcon Architecture.",
"title": "Parallel Residual",
"type": "boolean"
},
"num_experts": {
"default": 0,
"description": "Number of experts for MoE models.",
"title": "Num Experts",
"type": "integer"
},
"num_shared_experts": {
"default": 0,
"description": "Number of shared experts for MoE models (DeepSeekv2).",
"title": "Num Shared Experts",
"type": "integer"
},
"shared_expert_gate": {
"default": false,
"description": "Apply sigmoid-gated shared expert output (Qwen3.5 MoE style). When True, a linear gate is applied: output += sigmoid(gate(x)) * shared_expert(x).",
"title": "Shared Expert Gate",
"type": "boolean"
},
"first_k_dense_replace": {
"default": 0,
"description": "Number of layers using Dense instead of MoE",
"title": "First K Dense Replace",
"type": "integer"
},
"num_experts_per_tok": {
"default": 2,
"description": "Number of experts per token.",
"title": "Num Experts Per Tok",
"type": "integer"
},
"moe_softmax_after": {
"default": false,
"description": "Usually softmax is before topk, Mixtral does it after.",
"title": "Moe Softmax After",
"type": "boolean"
},
"moe_renormalize": {
"default": false,
"description": "Qwen renormalize expert weights after softmax.",
"title": "Moe Renormalize",
"type": "boolean"
},
"q_gating": {
"default": false,
"description": "Enable gated query in attention (Qwen3.5 style). Q projection has doubled output size; output is multiplied by sigmoid(gate).",
"title": "Q Gating",
"type": "boolean"
},
"position_encoding_type": {
"anyOf": [
{
"$ref": "#/$defs/PositionEncodingType"
},
{
"type": "null"
}
],
"default": "SinusoidalInterleaved",
"description": "Type of positional encoding."
},
"interpolate_mode": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Interpolation mode for position embeddings. If None: position_embeddings is a lookup table based on n_positions. If string (e.g., 'bilinear'): position_embedding is a learned grid using interpolation. (see Vision.py encoder)",
"title": "Interpolate Mode"
},
"n_positions": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Two casesCase 1: Absolute number of positions to learn position embeddings on (position_encoding_type: Learned)Case 2: Max Relative PositionsIn the case of position_encoding_type: Relative",
"title": "N Positions"
},
"rope_config": {
"anyOf": [
{
"$ref": "#/$defs/RotaryPositionConfig"
},
{
"type": "null"
}
],
"default": null,
"description": "Rotary position config, if relevant."
}
},
"title": "TransformerLMModelConfig",
"type": "object"
},
"TransformerModelConfig": {
"additionalProperties": false,
"description": "Facilitate setting some transformer specific params at model level.",
"properties": {
"embeddings": {
"$ref": "#/$defs/EmbeddingsConfig",
"description": "Contains most of the args useful to build the Embeddings module."
},
"encoder": {
"anyOf": [
{
"discriminator": {
"mapping": {
"audio": "#/$defs/AudioEncoderConfig",
"brnn": "#/$defs/RnnEncoderConfig",
"cnn": "#/$defs/CnnEncoderConfig",
"mean": "#/$defs/MeanEncoderConfig",
"rnn": "#/$defs/RnnEncoderConfig",
"transformer": "#/$defs/TransformerEncoderConfig",
"vision": "#/$defs/VisionEncoderConfig"
},
"propertyName": "encoder_type"
},
"oneOf": [
{
"$ref": "#/$defs/TransformerEncoderConfig"
},
{
"$ref": "#/$defs/RnnEncoderConfig"
},
{
"$ref": "#/$defs/CnnEncoderConfig"
},
{
"$ref": "#/$defs/MeanEncoderConfig"
},
{
"$ref": "#/$defs/VisionEncoderConfig"
},
{
"$ref": "#/$defs/AudioEncoderConfig"
}
]
},
{
"type": "null"
}
],
"default": null,
"description": "Major parameters of an encoder.",
"title": "Encoder"
},
"decoder": {
"anyOf": [
{
"discriminator": {
"mapping": {
"cnn": "#/$defs/CnnDecoderConfig",
"rnn": "#/$defs/RnnDecoderConfig",
"transformer": "#/$defs/TransformerDecoderConfig"
},
"propertyName": "decoder_type"
},
"oneOf": [
{
"$ref": "#/$defs/TransformerDecoderConfig"
},
{
"$ref": "#/$defs/RnnDecoderConfig"
},
{
"$ref": "#/$defs/CnnDecoderConfig"
}
]
},
{
"type": "null"
}
],
"default": null,
"description": "Major parameters of a decoder.",
"title": "Decoder"
},
"hidden_size": {
"default": -1,
"description": "Size of hidden states. Overwrites [encoder/decoder].hidden_size if set.",
"title": "Hidden Size",
"type": "integer"
},
"word_vec_size": {
"default": -1,
"description": "Word embedding size for src and tgt.",
"title": "Word Vec Size",
"type": "integer"
},
"layers": {
"default": -1,
"description": "Number of layers in both encoder and decoder (will overwrite enc_layers/dec_layers).",
"title": "Layers",
"type": "integer"
},
"transformer_ff": {
"default": 2048,
"description": "Size of hidden transformer feed-forward.",
"title": "Transformer Ff",
"type": "integer"
},
"moe_transformer_ff": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Size of hidden moe transformer feed-forward.",
"title": "Moe Transformer Ff"
},
"share_decoder_embeddings": {
"default": false,
"description": "Use a share weight matrix for the input and output word embeddings in the decoder.",
"title": "Share Decoder Embeddings",
"type": "boolean"
},
"share_embeddings": {
"default": false,
"description": "Share the word embeddings between encoder and decoder. Need to use shared vocabulary for this option.",
"title": "Share Embeddings",
"type": "boolean"
},
"input_feed": {
"default": 1,
"description": "Feed the context vector at each time step as additional input (via concatenation with the word embeddings) to the decoder.",
"title": "Input Feed",
"type": "integer"
},
"generator_function": {
"default": "softmax",
"description": "Which function to use for generating probabilities over the target vocabulary.",
"enum": [
"softmax",
"sparsemax"
],
"title": "Generator Function",
"type": "string"
},
"generator_bias": {
"default": true,
"description": "Control whether or not the generator Linear module has bias weights.",
"title": "Generator Bias",
"type": "boolean"
},
"adapter_bias": {
"default": false,
"description": "Control whether or not the adapter module has bias weights.",
"title": "Adapter Bias",
"type": "boolean"
},
"projector_activation_fn": {
"anyOf": [
{
"$ref": "#/$defs/ActivationFunction"
},
{
"type": "null"
}
],
"default": "relu",
"description": "The activation function to use in adapter projector layer."
},
"spatial_merge_size": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1,
"description": "Control the presence and size of patch merger (Mistral3)",
"title": "Spatial Merge Size"
},
"add_estimator": {
"default": false,
"description": "Add estimator layer",
"title": "Add Estimator",
"type": "boolean"
},
"estimator_type": {
"default": "average",
"description": "Which hidden_states to use to feed the estimator",
"enum": [
"average",
"last_token",
"first_token"
],
"title": "Estimator Type",
"type": "string"
},
"left_pad": {
"default": false,
"description": "Enable left-padding, useful for some LLMs.",
"title": "Left Pad",
"type": "boolean"
},
"huggingface_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Original huggingface model.",
"title": "Huggingface Model"
},
"eole_version": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "0.5.2",
"description": "Eole version used to convert/train/save the model.",
"title": "Eole Version"
},
"architecture": {
"const": "transformer",
"default": "transformer",
"title": "Architecture",
"type": "string"
},
"sliding_window": {
"default": 0,
"description": "Sliding window for transformer self-attention.",
"title": "Sliding Window",
"type": "integer"
},
"heads": {
"default": 8,
"description": "Number of heads for transformer self-attention.",
"title": "Heads",
"type": "integer"
},
"relative_positions_buckets": {
"default": 0,
"description": "Enable relative position bias (https://github.com/google-research/text-to-text-transfer-transformer).",
"title": "Relative Positions Buckets",
"type": "integer"
},
"mlp_activation_fn": {
"$ref": "#/$defs/ActivationFunction",
"default": "relu",
"description": "The activation function to use in MLP layer."
},
"layer_norm": {
"default": "standard",
"description": "Type of layer normalization in transformer architecture.",
"enum": [
"standard",
"standardFP32",
"rms",
"gemma-rms"
],
"title": "Layer Norm",
"type": "string"
},
"norm_eps": {
"default": 1e-05,
"description": "Layer norm epsilon.",
"title": "Norm Eps",
"type": "number"
},
"shared_layer_norm": {
"default": false,
"description": "Use a shared layer_norm in parallel residual attention. Note: must be True for Falcon 7B, False for Falcon 40B, same for GPT-J and GPT-NeoX models.",
"title": "Shared Layer Norm",
"type": "boolean"
},
"ffn_layernorm": {
"default": false,
"description": "Add pre/post_feedforward_layernorm around MLP forward. Note: introduced for gemma2 support.",
"title": "Ffn Layernorm",
"type": "boolean"
},
"add_qkvbias": {
"default": false,
"description": "Add bias to nn.Linear of Query/Key/Value in MHA. Note: this will add bias to output projection layer too by default. Can be disabled with `add_final_linear_bias`.",
"title": "Add Qkvbias",
"type": "boolean"
},
"add_key_bias": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": null,
"description": "Add bias to Key projection in MHA. Defaults to add_qkvbias when not set. Set to False for models like Whisper where K has no bias.",
"title": "Add Key Bias"
},
"query_norm": {
"default": false,
"title": "Query Norm",
"type": "boolean"
},
"key_norm": {
"default": false,
"title": "Key Norm",
"type": "boolean"
},
"qk_norm_post_rope": {
"default": false,
"title": "Qk Norm Post Rope",
"type": "boolean"
},
"add_final_linear_bias": {
"default": false,
"description": "Add bias to nn.Linear of final_linear in MHA.",
"title": "Add Final Linear Bias",
"type": "boolean"
},
"heads_kv": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Number of heads for KV. heads_kv=heads if None, else number of heads for KV(e.g. Falcon 40B)",
"title": "Heads Kv"
},
"head_dim": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Head dimension when this needs to be different vs hidden_size // heads",
"title": "Head Dim"
},
"attn_scaling": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": null,
"description": "Attention scaling factor, when None uses 1/sqrt(head_dim) by default",
"title": "Attn Scaling"
},
"add_ffnbias": {
"default": false,
"description": "Add bias to nn.Linear of MLP FFN.",
"title": "Add Ffnbias",
"type": "boolean"
},
"parallel_residual": {
"default": false,
"description": "Use parallel residual in decoder layer. Note: this is used by GPT-J / Falcon Architecture.",
"title": "Parallel Residual",
"type": "boolean"
},
"num_experts": {
"default": 0,
"description": "Number of experts for MoE models.",
"title": "Num Experts",
"type": "integer"
},
"num_shared_experts": {
"default": 0,
"description": "Number of shared experts for MoE models (DeepSeekv2).",
"title": "Num Shared Experts",
"type": "integer"
},
"shared_expert_gate": {
"default": false,
"description": "Apply sigmoid-gated shared expert output (Qwen3.5 MoE style). When True, a linear gate is applied: output += sigmoid(gate(x)) * shared_expert(x).",
"title": "Shared Expert Gate",
"type": "boolean"
},
"first_k_dense_replace": {
"default": 0,
"description": "Number of layers using Dense instead of MoE",
"title": "First K Dense Replace",
"type": "integer"
},
"num_experts_per_tok": {
"default": 2,
"description": "Number of experts per token.",
"title": "Num Experts Per Tok",
"type": "integer"
},
"moe_softmax_after": {
"default": false,
"description": "Usually softmax is before topk, Mixtral does it after.",
"title": "Moe Softmax After",
"type": "boolean"
},
"moe_renormalize": {
"default": false,
"description": "Qwen renormalize expert weights after softmax.",
"title": "Moe Renormalize",
"type": "boolean"
},
"q_gating": {
"default": false,
"description": "Enable gated query in attention (Qwen3.5 style). Q projection has doubled output size; output is multiplied by sigmoid(gate).",
"title": "Q Gating",
"type": "boolean"
},
"position_encoding_type": {
"anyOf": [
{
"$ref": "#/$defs/PositionEncodingType"
},
{
"type": "null"
}
],
"default": "SinusoidalInterleaved",
"description": "Type of positional encoding."
},
"interpolate_mode": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Interpolation mode for position embeddings. If None: position_embeddings is a lookup table based on n_positions. If string (e.g., 'bilinear'): position_embedding is a learned grid using interpolation. (see Vision.py encoder)",
"title": "Interpolate Mode"
},
"n_positions": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Two casesCase 1: Absolute number of positions to learn position embeddings on (position_encoding_type: Learned)Case 2: Max Relative PositionsIn the case of position_encoding_type: Relative",
"title": "N Positions"
},
"rope_config": {
"anyOf": [
{
"$ref": "#/$defs/RotaryPositionConfig"
},
{
"type": "null"
}
],
"default": null,
"description": "Rotary position config, if relevant."
}
},
"title": "TransformerModelConfig",
"type": "object"
},
"UpperCaseConfig": {
"additionalProperties": false,
"properties": {
"upper_corpus_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0.01,
"description": "Corpus ratio to apply uppercasing.",
"title": "Upper Corpus Ratio"
}
},
"title": "UpperCaseConfig",
"type": "object"
},
"VisionEncoderConfig": {
"additionalProperties": false,
"description": "Based on mistral-community/pixtral-12b, might evolve later.",
"properties": {
"encoder_type": {
"const": "vision",
"default": "vision",
"title": "Encoder Type",
"type": "string"
},
"layers": {
"default": 2,
"description": "Number of layers in the encoder.",
"title": "Layers",
"type": "integer"
},
"hidden_size": {
"default": 512,
"description": "Size of encoder hidden states.",
"title": "Hidden Size",
"type": "integer"
},
"src_word_vec_size": {
"default": 512,
"description": "Word embedding size for src.",
"title": "Src Word Vec Size",
"type": "integer"
},
"sliding_window": {
"default": 0,
"description": "Sliding window for transformer self-attention.",
"title": "Sliding Window",
"type": "integer"
},
"heads": {
"default": 8,
"description": "Number of heads for transformer self-attention.",
"title": "Heads",
"type": "integer"
},
"transformer_ff": {
"default": 2048,
"description": "Size of hidden transformer feed-forward.",
"title": "Transformer Ff",
"type": "integer"
},
"moe_transformer_ff": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Size of hidden moe transformer feed-forward.",
"title": "Moe Transformer Ff"
},
"relative_positions_buckets": {
"default": 0,
"description": "Enable relative position bias (https://github.com/google-research/text-to-text-transfer-transformer).",
"title": "Relative Positions Buckets",
"type": "integer"
},
"mlp_activation_fn": {
"$ref": "#/$defs/ActivationFunction",
"default": "relu",
"description": "The activation function to use in MLP layer."
},
"layer_norm": {
"default": "standard",
"description": "Type of layer normalization in transformer architecture.",
"enum": [
"standard",
"standardFP32",
"rms",
"gemma-rms"
],
"title": "Layer Norm",
"type": "string"
},
"norm_eps": {
"default": 1e-05,
"description": "Layer norm epsilon.",
"title": "Norm Eps",
"type": "number"
},
"shared_layer_norm": {
"default": false,
"description": "Use a shared layer_norm in parallel residual attention. Note: must be True for Falcon 7B, False for Falcon 40B, same for GPT-J and GPT-NeoX models.",
"title": "Shared Layer Norm",
"type": "boolean"
},
"ffn_layernorm": {
"default": false,
"description": "Add pre/post_feedforward_layernorm around MLP forward. Note: introduced for gemma2 support.",
"title": "Ffn Layernorm",
"type": "boolean"
},
"add_qkvbias": {
"default": false,
"description": "Add bias to nn.Linear of Query/Key/Value in MHA. Note: this will add bias to output projection layer too by default. Can be disabled with `add_final_linear_bias`.",
"title": "Add Qkvbias",
"type": "boolean"
},
"add_key_bias": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": null,
"description": "Add bias to Key projection in MHA. Defaults to add_qkvbias when not set. Set to False for models like Whisper where K has no bias.",
"title": "Add Key Bias"
},
"query_norm": {
"default": false,
"title": "Query Norm",
"type": "boolean"
},
"key_norm": {
"default": false,
"title": "Key Norm",
"type": "boolean"
},
"qk_norm_post_rope": {
"default": false,
"title": "Qk Norm Post Rope",
"type": "boolean"
},
"add_final_linear_bias": {
"default": false,
"description": "Add bias to nn.Linear of final_linear in MHA.",
"title": "Add Final Linear Bias",
"type": "boolean"
},
"heads_kv": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Number of heads for KV. heads_kv=heads if None, else number of heads for KV(e.g. Falcon 40B)",
"title": "Heads Kv"
},
"head_dim": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Head dimension when this needs to be different vs hidden_size // heads",
"title": "Head Dim"
},
"attn_scaling": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": null,
"description": "Attention scaling factor, when None uses 1/sqrt(head_dim) by default",
"title": "Attn Scaling"
},
"add_ffnbias": {
"default": false,
"description": "Add bias to nn.Linear of MLP FFN.",
"title": "Add Ffnbias",
"type": "boolean"
},
"parallel_residual": {
"default": false,
"description": "Use parallel residual in decoder layer. Note: this is used by GPT-J / Falcon Architecture.",
"title": "Parallel Residual",
"type": "boolean"
},
"num_experts": {
"default": 0,
"description": "Number of experts for MoE models.",
"title": "Num Experts",
"type": "integer"
},
"num_shared_experts": {
"default": 0,
"description": "Number of shared experts for MoE models (DeepSeekv2).",
"title": "Num Shared Experts",
"type": "integer"
},
"shared_expert_gate": {
"default": false,
"description": "Apply sigmoid-gated shared expert output (Qwen3.5 MoE style). When True, a linear gate is applied: output += sigmoid(gate(x)) * shared_expert(x).",
"title": "Shared Expert Gate",
"type": "boolean"
},
"first_k_dense_replace": {
"default": 0,
"description": "Number of layers using Dense instead of MoE",
"title": "First K Dense Replace",
"type": "integer"
},
"num_experts_per_tok": {
"default": 2,
"description": "Number of experts per token.",
"title": "Num Experts Per Tok",
"type": "integer"
},
"moe_softmax_after": {
"default": false,
"description": "Usually softmax is before topk, Mixtral does it after.",
"title": "Moe Softmax After",
"type": "boolean"
},
"moe_renormalize": {
"default": false,
"description": "Qwen renormalize expert weights after softmax.",
"title": "Moe Renormalize",
"type": "boolean"
},
"q_gating": {
"default": false,
"description": "Enable gated query in attention (Qwen3.5 style). Q projection has doubled output size; output is multiplied by sigmoid(gate).",
"title": "Q Gating",
"type": "boolean"
},
"position_encoding_type": {
"anyOf": [
{
"$ref": "#/$defs/PositionEncodingType"
},
{
"type": "null"
}
],
"default": "SinusoidalInterleaved",
"description": "Type of positional encoding."
},
"interpolate_mode": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Interpolation mode for position embeddings. If None: position_embeddings is a lookup table based on n_positions. If string (e.g., 'bilinear'): position_embedding is a learned grid using interpolation. (see Vision.py encoder)",
"title": "Interpolate Mode"
},
"n_positions": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Two casesCase 1: Absolute number of positions to learn position embeddings on (position_encoding_type: Learned)Case 2: Max Relative PositionsIn the case of position_encoding_type: Relative",
"title": "N Positions"
},
"rope_config": {
"anyOf": [
{
"$ref": "#/$defs/RotaryPositionConfig"
},
{
"type": "null"
}
],
"default": null,
"description": "Rotary position config, if relevant."
},
"num_channels": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 3,
"title": "Num Channels"
},
"image_size": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1024,
"title": "Image Size"
},
"patch_size": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 16,
"title": "Patch Size"
},
"image_token_id": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 10,
"title": "Image Token Id"
},
"image_token_id_list": {
"anyOf": [
{
"items": {
"type": "integer"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"description": "includes other image_token ids",
"title": "Image Token Id List"
},
"mm_tokens_per_image": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 256,
"title": "Mm Tokens Per Image"
},
"layernorm_pre": {
"default": true,
"title": "Layernorm Pre",
"type": "boolean"
},
"layernorm_post": {
"default": false,
"title": "Layernorm Post",
"type": "boolean"
},
"patch_conv_bias": {
"default": false,
"title": "Patch Conv Bias",
"type": "boolean"
},
"encoder_sam": {
"default": false,
"title": "Encoder Sam",
"type": "boolean"
},
"use_class_embedding": {
"default": false,
"title": "Use Class Embedding",
"type": "boolean"
},
"temporal_patch_size": {
"default": 1,
"description": "Temporal kernel size for Conv3D patch embedding. When >1 a nn.Conv3d is used (e.g. Qwen3.5 VL uses 2).",
"title": "Temporal Patch Size",
"type": "integer"
},
"num_position_embeddings": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Size of the absolute position embedding table (Qwen3.5 VL uses 2304 = 48\u00d748). When set together with position_encoding_type=Rotary both absolute embeddings and 2D RoPE are applied.",
"title": "Num Position Embeddings"
}
},
"title": "VisionEncoderConfig",
"type": "object"
},
"VisionTransformerLMModelConfig": {
"additionalProperties": false,
"properties": {
"embeddings": {
"$ref": "#/$defs/EmbeddingsConfig",
"description": "Contains most of the args useful to build the Embeddings module."
},
"encoder": {
"anyOf": [
{
"discriminator": {
"mapping": {
"audio": "#/$defs/AudioEncoderConfig",
"brnn": "#/$defs/RnnEncoderConfig",
"cnn": "#/$defs/CnnEncoderConfig",
"mean": "#/$defs/MeanEncoderConfig",
"rnn": "#/$defs/RnnEncoderConfig",
"transformer": "#/$defs/TransformerEncoderConfig",
"vision": "#/$defs/VisionEncoderConfig"
},
"propertyName": "encoder_type"
},
"oneOf": [
{
"$ref": "#/$defs/TransformerEncoderConfig"
},
{
"$ref": "#/$defs/RnnEncoderConfig"
},
{
"$ref": "#/$defs/CnnEncoderConfig"
},
{
"$ref": "#/$defs/MeanEncoderConfig"
},
{
"$ref": "#/$defs/VisionEncoderConfig"
},
{
"$ref": "#/$defs/AudioEncoderConfig"
}
]
},
{
"type": "null"
}
],
"default": null,
"description": "Major parameters of an encoder.",
"title": "Encoder"
},
"decoder": {
"anyOf": [
{
"discriminator": {
"mapping": {
"cnn": "#/$defs/CnnDecoderConfig",
"rnn": "#/$defs/RnnDecoderConfig",
"transformer": "#/$defs/TransformerDecoderConfig"
},
"propertyName": "decoder_type"
},
"oneOf": [
{
"$ref": "#/$defs/TransformerDecoderConfig"
},
{
"$ref": "#/$defs/RnnDecoderConfig"
},
{
"$ref": "#/$defs/CnnDecoderConfig"
}
]
},
{
"type": "null"
}
],
"default": null,
"description": "Major parameters of a decoder.",
"title": "Decoder"
},
"hidden_size": {
"default": -1,
"description": "Size of hidden states. Overwrites [encoder/decoder].hidden_size if set.",
"title": "Hidden Size",
"type": "integer"
},
"word_vec_size": {
"default": -1,
"description": "Word embedding size for src and tgt.",
"title": "Word Vec Size",
"type": "integer"
},
"layers": {
"default": -1,
"description": "Number of layers in both encoder and decoder (will overwrite enc_layers/dec_layers).",
"title": "Layers",
"type": "integer"
},
"transformer_ff": {
"default": 2048,
"description": "Size of hidden transformer feed-forward.",
"title": "Transformer Ff",
"type": "integer"
},
"moe_transformer_ff": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Size of hidden moe transformer feed-forward.",
"title": "Moe Transformer Ff"
},
"share_decoder_embeddings": {
"default": false,
"description": "Use a share weight matrix for the input and output word embeddings in the decoder.",
"title": "Share Decoder Embeddings",
"type": "boolean"
},
"share_embeddings": {
"default": false,
"description": "Share the word embeddings between encoder and decoder. Need to use shared vocabulary for this option.",
"title": "Share Embeddings",
"type": "boolean"
},
"input_feed": {
"default": 1,
"description": "Feed the context vector at each time step as additional input (via concatenation with the word embeddings) to the decoder.",
"title": "Input Feed",
"type": "integer"
},
"generator_function": {
"default": "softmax",
"description": "Which function to use for generating probabilities over the target vocabulary.",
"enum": [
"softmax",
"sparsemax"
],
"title": "Generator Function",
"type": "string"
},
"generator_bias": {
"default": true,
"description": "Control whether or not the generator Linear module has bias weights.",
"title": "Generator Bias",
"type": "boolean"
},
"adapter_bias": {
"default": false,
"description": "Control whether or not the adapter module has bias weights.",
"title": "Adapter Bias",
"type": "boolean"
},
"projector_activation_fn": {
"anyOf": [
{
"$ref": "#/$defs/ActivationFunction"
},
{
"type": "null"
}
],
"default": "relu",
"description": "The activation function to use in adapter projector layer."
},
"spatial_merge_size": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1,
"description": "Control the presence and size of patch merger (Mistral3)",
"title": "Spatial Merge Size"
},
"add_estimator": {
"default": false,
"description": "Add estimator layer",
"title": "Add Estimator",
"type": "boolean"
},
"estimator_type": {
"default": "average",
"description": "Which hidden_states to use to feed the estimator",
"enum": [
"average",
"last_token",
"first_token"
],
"title": "Estimator Type",
"type": "string"
},
"left_pad": {
"default": false,
"description": "Enable left-padding, useful for some LLMs.",
"title": "Left Pad",
"type": "boolean"
},
"huggingface_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Original huggingface model.",
"title": "Huggingface Model"
},
"eole_version": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "0.5.2",
"description": "Eole version used to convert/train/save the model.",
"title": "Eole Version"
},
"architecture": {
"const": "vision_transformer_lm",
"default": "vision_transformer_lm",
"title": "Architecture",
"type": "string"
},
"sliding_window": {
"default": 0,
"description": "Sliding window for transformer self-attention.",
"title": "Sliding Window",
"type": "integer"
},
"heads": {
"default": 8,
"description": "Number of heads for transformer self-attention.",
"title": "Heads",
"type": "integer"
},
"relative_positions_buckets": {
"default": 0,
"description": "Enable relative position bias (https://github.com/google-research/text-to-text-transfer-transformer).",
"title": "Relative Positions Buckets",
"type": "integer"
},
"mlp_activation_fn": {
"$ref": "#/$defs/ActivationFunction",
"default": "relu",
"description": "The activation function to use in MLP layer."
},
"layer_norm": {
"default": "standard",
"description": "Type of layer normalization in transformer architecture.",
"enum": [
"standard",
"standardFP32",
"rms",
"gemma-rms"
],
"title": "Layer Norm",
"type": "string"
},
"norm_eps": {
"default": 1e-05,
"description": "Layer norm epsilon.",
"title": "Norm Eps",
"type": "number"
},
"shared_layer_norm": {
"default": false,
"description": "Use a shared layer_norm in parallel residual attention. Note: must be True for Falcon 7B, False for Falcon 40B, same for GPT-J and GPT-NeoX models.",
"title": "Shared Layer Norm",
"type": "boolean"
},
"ffn_layernorm": {
"default": false,
"description": "Add pre/post_feedforward_layernorm around MLP forward. Note: introduced for gemma2 support.",
"title": "Ffn Layernorm",
"type": "boolean"
},
"add_qkvbias": {
"default": false,
"description": "Add bias to nn.Linear of Query/Key/Value in MHA. Note: this will add bias to output projection layer too by default. Can be disabled with `add_final_linear_bias`.",
"title": "Add Qkvbias",
"type": "boolean"
},
"add_key_bias": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": null,
"description": "Add bias to Key projection in MHA. Defaults to add_qkvbias when not set. Set to False for models like Whisper where K has no bias.",
"title": "Add Key Bias"
},
"query_norm": {
"default": false,
"title": "Query Norm",
"type": "boolean"
},
"key_norm": {
"default": false,
"title": "Key Norm",
"type": "boolean"
},
"qk_norm_post_rope": {
"default": false,
"title": "Qk Norm Post Rope",
"type": "boolean"
},
"add_final_linear_bias": {
"default": false,
"description": "Add bias to nn.Linear of final_linear in MHA.",
"title": "Add Final Linear Bias",
"type": "boolean"
},
"heads_kv": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Number of heads for KV. heads_kv=heads if None, else number of heads for KV(e.g. Falcon 40B)",
"title": "Heads Kv"
},
"head_dim": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Head dimension when this needs to be different vs hidden_size // heads",
"title": "Head Dim"
},
"attn_scaling": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": null,
"description": "Attention scaling factor, when None uses 1/sqrt(head_dim) by default",
"title": "Attn Scaling"
},
"add_ffnbias": {
"default": false,
"description": "Add bias to nn.Linear of MLP FFN.",
"title": "Add Ffnbias",
"type": "boolean"
},
"parallel_residual": {
"default": false,
"description": "Use parallel residual in decoder layer. Note: this is used by GPT-J / Falcon Architecture.",
"title": "Parallel Residual",
"type": "boolean"
},
"num_experts": {
"default": 0,
"description": "Number of experts for MoE models.",
"title": "Num Experts",
"type": "integer"
},
"num_shared_experts": {
"default": 0,
"description": "Number of shared experts for MoE models (DeepSeekv2).",
"title": "Num Shared Experts",
"type": "integer"
},
"shared_expert_gate": {
"default": false,
"description": "Apply sigmoid-gated shared expert output (Qwen3.5 MoE style). When True, a linear gate is applied: output += sigmoid(gate(x)) * shared_expert(x).",
"title": "Shared Expert Gate",
"type": "boolean"
},
"first_k_dense_replace": {
"default": 0,
"description": "Number of layers using Dense instead of MoE",
"title": "First K Dense Replace",
"type": "integer"
},
"num_experts_per_tok": {
"default": 2,
"description": "Number of experts per token.",
"title": "Num Experts Per Tok",
"type": "integer"
},
"moe_softmax_after": {
"default": false,
"description": "Usually softmax is before topk, Mixtral does it after.",
"title": "Moe Softmax After",
"type": "boolean"
},
"moe_renormalize": {
"default": false,
"description": "Qwen renormalize expert weights after softmax.",
"title": "Moe Renormalize",
"type": "boolean"
},
"q_gating": {
"default": false,
"description": "Enable gated query in attention (Qwen3.5 style). Q projection has doubled output size; output is multiplied by sigmoid(gate).",
"title": "Q Gating",
"type": "boolean"
},
"position_encoding_type": {
"anyOf": [
{
"$ref": "#/$defs/PositionEncodingType"
},
{
"type": "null"
}
],
"default": "SinusoidalInterleaved",
"description": "Type of positional encoding."
},
"interpolate_mode": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Interpolation mode for position embeddings. If None: position_embeddings is a lookup table based on n_positions. If string (e.g., 'bilinear'): position_embedding is a learned grid using interpolation. (see Vision.py encoder)",
"title": "Interpolate Mode"
},
"n_positions": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Two casesCase 1: Absolute number of positions to learn position embeddings on (position_encoding_type: Learned)Case 2: Max Relative PositionsIn the case of position_encoding_type: Relative",
"title": "N Positions"
},
"rope_config": {
"anyOf": [
{
"$ref": "#/$defs/RotaryPositionConfig"
},
{
"type": "null"
}
],
"default": null,
"description": "Rotary position config, if relevant."
},
"adapter": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "llava",
"description": "Adapter type to use in the model.",
"title": "Adapter"
}
},
"title": "VisionTransformerLMModelConfig",
"type": "object"
},
"WhisperModelConfig": {
"additionalProperties": false,
"description": "Configuration for Whisper speech-to-text models.",
"properties": {
"embeddings": {
"$ref": "#/$defs/EmbeddingsConfig",
"description": "Contains most of the args useful to build the Embeddings module."
},
"encoder": {
"anyOf": [
{
"discriminator": {
"mapping": {
"audio": "#/$defs/AudioEncoderConfig",
"brnn": "#/$defs/RnnEncoderConfig",
"cnn": "#/$defs/CnnEncoderConfig",
"mean": "#/$defs/MeanEncoderConfig",
"rnn": "#/$defs/RnnEncoderConfig",
"transformer": "#/$defs/TransformerEncoderConfig",
"vision": "#/$defs/VisionEncoderConfig"
},
"propertyName": "encoder_type"
},
"oneOf": [
{
"$ref": "#/$defs/TransformerEncoderConfig"
},
{
"$ref": "#/$defs/RnnEncoderConfig"
},
{
"$ref": "#/$defs/CnnEncoderConfig"
},
{
"$ref": "#/$defs/MeanEncoderConfig"
},
{
"$ref": "#/$defs/VisionEncoderConfig"
},
{
"$ref": "#/$defs/AudioEncoderConfig"
}
]
},
{
"type": "null"
}
],
"default": null,
"description": "Major parameters of an encoder.",
"title": "Encoder"
},
"decoder": {
"anyOf": [
{
"discriminator": {
"mapping": {
"cnn": "#/$defs/CnnDecoderConfig",
"rnn": "#/$defs/RnnDecoderConfig",
"transformer": "#/$defs/TransformerDecoderConfig"
},
"propertyName": "decoder_type"
},
"oneOf": [
{
"$ref": "#/$defs/TransformerDecoderConfig"
},
{
"$ref": "#/$defs/RnnDecoderConfig"
},
{
"$ref": "#/$defs/CnnDecoderConfig"
}
]
},
{
"type": "null"
}
],
"default": null,
"description": "Major parameters of a decoder.",
"title": "Decoder"
},
"hidden_size": {
"default": -1,
"description": "Size of hidden states. Overwrites [encoder/decoder].hidden_size if set.",
"title": "Hidden Size",
"type": "integer"
},
"word_vec_size": {
"default": -1,
"description": "Word embedding size for src and tgt.",
"title": "Word Vec Size",
"type": "integer"
},
"layers": {
"default": -1,
"description": "Number of layers in both encoder and decoder (will overwrite enc_layers/dec_layers).",
"title": "Layers",
"type": "integer"
},
"transformer_ff": {
"default": 2048,
"description": "Size of hidden transformer feed-forward.",
"title": "Transformer Ff",
"type": "integer"
},
"moe_transformer_ff": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Size of hidden moe transformer feed-forward.",
"title": "Moe Transformer Ff"
},
"share_decoder_embeddings": {
"default": false,
"description": "Use a share weight matrix for the input and output word embeddings in the decoder.",
"title": "Share Decoder Embeddings",
"type": "boolean"
},
"share_embeddings": {
"default": false,
"description": "Share the word embeddings between encoder and decoder. Need to use shared vocabulary for this option.",
"title": "Share Embeddings",
"type": "boolean"
},
"input_feed": {
"default": 1,
"description": "Feed the context vector at each time step as additional input (via concatenation with the word embeddings) to the decoder.",
"title": "Input Feed",
"type": "integer"
},
"generator_function": {
"default": "softmax",
"description": "Which function to use for generating probabilities over the target vocabulary.",
"enum": [
"softmax",
"sparsemax"
],
"title": "Generator Function",
"type": "string"
},
"generator_bias": {
"default": true,
"description": "Control whether or not the generator Linear module has bias weights.",
"title": "Generator Bias",
"type": "boolean"
},
"adapter_bias": {
"default": false,
"description": "Control whether or not the adapter module has bias weights.",
"title": "Adapter Bias",
"type": "boolean"
},
"projector_activation_fn": {
"anyOf": [
{
"$ref": "#/$defs/ActivationFunction"
},
{
"type": "null"
}
],
"default": "relu",
"description": "The activation function to use in adapter projector layer."
},
"spatial_merge_size": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1,
"description": "Control the presence and size of patch merger (Mistral3)",
"title": "Spatial Merge Size"
},
"add_estimator": {
"default": false,
"description": "Add estimator layer",
"title": "Add Estimator",
"type": "boolean"
},
"estimator_type": {
"default": "average",
"description": "Which hidden_states to use to feed the estimator",
"enum": [
"average",
"last_token",
"first_token"
],
"title": "Estimator Type",
"type": "string"
},
"left_pad": {
"default": false,
"description": "Enable left-padding, useful for some LLMs.",
"title": "Left Pad",
"type": "boolean"
},
"huggingface_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Original huggingface model.",
"title": "Huggingface Model"
},
"eole_version": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "0.5.2",
"description": "Eole version used to convert/train/save the model.",
"title": "Eole Version"
},
"architecture": {
"const": "whisper",
"default": "whisper",
"title": "Architecture",
"type": "string"
},
"sliding_window": {
"default": 0,
"description": "Sliding window for transformer self-attention.",
"title": "Sliding Window",
"type": "integer"
},
"heads": {
"default": 8,
"description": "Number of heads for transformer self-attention.",
"title": "Heads",
"type": "integer"
},
"relative_positions_buckets": {
"default": 0,
"description": "Enable relative position bias (https://github.com/google-research/text-to-text-transfer-transformer).",
"title": "Relative Positions Buckets",
"type": "integer"
},
"mlp_activation_fn": {
"$ref": "#/$defs/ActivationFunction",
"default": "relu",
"description": "The activation function to use in MLP layer."
},
"layer_norm": {
"default": "standard",
"description": "Type of layer normalization in transformer architecture.",
"enum": [
"standard",
"standardFP32",
"rms",
"gemma-rms"
],
"title": "Layer Norm",
"type": "string"
},
"norm_eps": {
"default": 1e-05,
"description": "Layer norm epsilon.",
"title": "Norm Eps",
"type": "number"
},
"shared_layer_norm": {
"default": false,
"description": "Use a shared layer_norm in parallel residual attention. Note: must be True for Falcon 7B, False for Falcon 40B, same for GPT-J and GPT-NeoX models.",
"title": "Shared Layer Norm",
"type": "boolean"
},
"ffn_layernorm": {
"default": false,
"description": "Add pre/post_feedforward_layernorm around MLP forward. Note: introduced for gemma2 support.",
"title": "Ffn Layernorm",
"type": "boolean"
},
"add_qkvbias": {
"default": false,
"description": "Add bias to nn.Linear of Query/Key/Value in MHA. Note: this will add bias to output projection layer too by default. Can be disabled with `add_final_linear_bias`.",
"title": "Add Qkvbias",
"type": "boolean"
},
"add_key_bias": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": null,
"description": "Add bias to Key projection in MHA. Defaults to add_qkvbias when not set. Set to False for models like Whisper where K has no bias.",
"title": "Add Key Bias"
},
"query_norm": {
"default": false,
"title": "Query Norm",
"type": "boolean"
},
"key_norm": {
"default": false,
"title": "Key Norm",
"type": "boolean"
},
"qk_norm_post_rope": {
"default": false,
"title": "Qk Norm Post Rope",
"type": "boolean"
},
"add_final_linear_bias": {
"default": false,
"description": "Add bias to nn.Linear of final_linear in MHA.",
"title": "Add Final Linear Bias",
"type": "boolean"
},
"heads_kv": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Number of heads for KV. heads_kv=heads if None, else number of heads for KV(e.g. Falcon 40B)",
"title": "Heads Kv"
},
"head_dim": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Head dimension when this needs to be different vs hidden_size // heads",
"title": "Head Dim"
},
"attn_scaling": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": null,
"description": "Attention scaling factor, when None uses 1/sqrt(head_dim) by default",
"title": "Attn Scaling"
},
"add_ffnbias": {
"default": false,
"description": "Add bias to nn.Linear of MLP FFN.",
"title": "Add Ffnbias",
"type": "boolean"
},
"parallel_residual": {
"default": false,
"description": "Use parallel residual in decoder layer. Note: this is used by GPT-J / Falcon Architecture.",
"title": "Parallel Residual",
"type": "boolean"
},
"num_experts": {
"default": 0,
"description": "Number of experts for MoE models.",
"title": "Num Experts",
"type": "integer"
},
"num_shared_experts": {
"default": 0,
"description": "Number of shared experts for MoE models (DeepSeekv2).",
"title": "Num Shared Experts",
"type": "integer"
},
"shared_expert_gate": {
"default": false,
"description": "Apply sigmoid-gated shared expert output (Qwen3.5 MoE style). When True, a linear gate is applied: output += sigmoid(gate(x)) * shared_expert(x).",
"title": "Shared Expert Gate",
"type": "boolean"
},
"first_k_dense_replace": {
"default": 0,
"description": "Number of layers using Dense instead of MoE",
"title": "First K Dense Replace",
"type": "integer"
},
"num_experts_per_tok": {
"default": 2,
"description": "Number of experts per token.",
"title": "Num Experts Per Tok",
"type": "integer"
},
"moe_softmax_after": {
"default": false,
"description": "Usually softmax is before topk, Mixtral does it after.",
"title": "Moe Softmax After",
"type": "boolean"
},
"moe_renormalize": {
"default": false,
"description": "Qwen renormalize expert weights after softmax.",
"title": "Moe Renormalize",
"type": "boolean"
},
"q_gating": {
"default": false,
"description": "Enable gated query in attention (Qwen3.5 style). Q projection has doubled output size; output is multiplied by sigmoid(gate).",
"title": "Q Gating",
"type": "boolean"
},
"position_encoding_type": {
"anyOf": [
{
"$ref": "#/$defs/PositionEncodingType"
},
{
"type": "null"
}
],
"default": "SinusoidalInterleaved",
"description": "Type of positional encoding."
},
"interpolate_mode": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Interpolation mode for position embeddings. If None: position_embeddings is a lookup table based on n_positions. If string (e.g., 'bilinear'): position_embedding is a learned grid using interpolation. (see Vision.py encoder)",
"title": "Interpolate Mode"
},
"n_positions": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Two casesCase 1: Absolute number of positions to learn position embeddings on (position_encoding_type: Learned)Case 2: Max Relative PositionsIn the case of position_encoding_type: Relative",
"title": "N Positions"
},
"rope_config": {
"anyOf": [
{
"$ref": "#/$defs/RotaryPositionConfig"
},
{
"type": "null"
}
],
"default": null,
"description": "Rotary position config, if relevant."
},
"suppress_tokens": {
"anyOf": [
{
"items": {
"type": "integer"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"description": "List of token IDs to suppress during audio decoding.",
"title": "Suppress Tokens"
},
"begin_suppress_tokens": {
"anyOf": [
{
"items": {
"type": "integer"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"description": "List of token IDs to suppress at the first generated position.",
"title": "Begin Suppress Tokens"
},
"no_timestamps_token_id": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Token ID for the no-timestamps token in audio models.",
"title": "No Timestamps Token Id"
},
"word_timestamp_heads": {
"anyOf": [
{
"items": {
"items": {
"type": "integer"
},
"type": "array"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"description": "List of [layer, head] pairs for word-level timestamp extraction (mapped from alignment_heads in HF generation_config).",
"title": "Word Timestamp Heads"
},
"median_filter_width": {
"default": 7,
"description": "Median filter width for word-level timestamp smoothing.",
"title": "Median Filter Width",
"type": "integer"
}
},
"title": "WhisperModelConfig",
"type": "object"
}
},
"additionalProperties": false,
"required": [
"model_path",
"src"
]
}
- Config:
- validate_assignment: bool = True
- validate_default: bool = True
- use_enum_values: bool = True
- extra: str = forbid
- protected_namespaces: tuple = ()
- arbitrary_types_allowed: bool = True
- Fields:
engine (str)model (Annotated[eole.config.models.TransformerModelConfig | eole.config.models.TransformerLMModelConfig | eole.config.models.VisionTransformerLMModelConfig | eole.config.models.WhisperModelConfig | eole.config.models.TransformerEncoderModelConfig | eole.config.models.RnnModelConfig | eole.config.models.CnnModelConfig | eole.config.models.CustomModelConfig, FieldInfo(annotation=NoneType, required=True, discriminator='architecture')] | None)model_path (str | List[str])output (str)share_vocab (bool | None)skip_empty_level (Literal['silent', 'warning', 'error'])src (str)src_subword_vocab (str | None)tgt (str | None)tgt_file_prefix (bool)transforms (List[str] | None)transforms_configs (eole.config.data.NestedAllTransformsConfig | None)
- Validators:
_validate_model_path»model_path_validate_predict_config»all fields
field engine : str = 'eole'
engine to run inference: eole or ct2
- Validated by:
_validate_predict_config_validate_running_config
field model : Annotated[TransformerModelConfig | TransformerLMModelConfig | VisionTransformerLMModelConfig | WhisperModelConfig | TransformerEncoderModelConfig | RnnModelConfig | CnnModelConfig | CustomModelConfig, FieldInfo(annotation=NoneType, required=True, discriminator='architecture')] | None = None
- Validated by:
_validate_predict_config_validate_running_config
field model_path : str | List[str] [Required]
Path to model .pt file(s). Multiple models can be specified for ensemble decoding.
- Validated by:
_validate_model_path_validate_predict_config_validate_running_config
field output : str = 'pred.txt'
Path to output the predictions (each line will be the decoded sequence).
- Validated by:
_validate_predict_config_validate_running_config
field share_vocab : bool | None = False
- Validated by:
_validate_predict_config_validate_running_config
field skip_empty_level : Literal['silent', 'warning', 'error'] = 'warning'
Logging level when encountering empty examples during inference. (silent: silently ignore/skip empty examples, warning: warn when ignoring/skipping empty examples, error: raise an error and stop execution when any empty example)
- Validated by:
_validate_predict_config_validate_running_config
field src : str [Required]
Source file to decode (one line per sequence).
- Validated by:
_validate_predict_config_validate_running_config
field src_subword_vocab : str | None = None
- Validated by:
_validate_predict_config_validate_running_config
field tgt : str | None = None
True target sequences, useful for scoring or prefix decoding.
- Validated by:
_validate_predict_config_validate_running_config
field tgt_file_prefix : bool = False
Generate predictions using provided tgt as prefix.
- Validated by:
_validate_predict_config_validate_running_config
field transforms : List[str] | None = []
- Validated by:
_validate_predict_config_validate_running_config
field transforms_configs : NestedAllTransformsConfig | None [Optional]
- Validated by:
_validate_predict_config_validate_running_config
model_post_init(context: Any,) → None
This function is meant to behave like a BaseModel method to initialise private attributes.
It takes context as an argument since that’s what pydantic-core passes when calling it.
- Parameters:
- self – The BaseModel instance.
- context – The context.
pydantic model eole.config.run.BuildVocabConfig
Bases: DataConfig, MiscConfig, BaseVocabConfig
Show JSON schema
{
"title": "BuildVocabConfig",
"type": "object",
"properties": {
"src_vocab": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"description": "Path to src (or shared) vocabulary file. Format: one <word> or <word>\t<count> per line.",
"title": "Src Vocab"
},
"tgt_vocab": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Path to tgt vocabulary file. Format: one <word> or <word>\t<count> per line.",
"title": "Tgt Vocab"
},
"share_vocab": {
"default": false,
"description": "Share source and target vocabulary.",
"title": "Share Vocab",
"type": "boolean"
},
"decoder_start_token": {
"default": "<s>",
"description": "Default decoder start token. For most models it is <s> = BOS. Some fairseq models require </s>.",
"title": "Decoder Start Token",
"type": "string"
},
"bos_token": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "<s>",
"title": "Bos Token"
},
"eos_token": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "</s>",
"title": "Eos Token"
},
"unk_token": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "<unk>",
"title": "Unk Token"
},
"pad_token": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "<blank>",
"title": "Pad Token"
},
"both_embeddings": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Path to the embeddings file to use for both source and target tokens.",
"title": "Both Embeddings"
},
"src_embeddings": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Path to the embeddings file to use for source tokens.",
"title": "Src Embeddings"
},
"tgt_embeddings": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Path to the embeddings file to use for target tokens.",
"title": "Tgt Embeddings"
},
"embeddings_type": {
"anyOf": [
{
"enum": [
"GloVe",
"word2vec"
],
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Type of embeddings file.",
"title": "Embeddings Type"
},
"seed": {
"default": -1,
"description": "Set random seed used for better reproducibility between experiments.",
"title": "Seed",
"type": "integer"
},
"src_vocab_size": {
"default": 32768,
"description": "Maximum size of the source vocabulary.",
"title": "Src Vocab Size",
"type": "integer"
},
"tgt_vocab_size": {
"default": 32768,
"description": "Maximum size of the target vocabulary.",
"title": "Tgt Vocab Size",
"type": "integer"
},
"vocab_size_multiple": {
"default": 8,
"description": "Make the vocabulary size a multiple of this value. (Adds dummy tokens if needed.)",
"title": "Vocab Size Multiple",
"type": "integer"
},
"src_words_min_frequency": {
"default": 0,
"description": "Discard source words with lower frequency.",
"title": "Src Words Min Frequency",
"type": "integer"
},
"tgt_words_min_frequency": {
"default": 0,
"description": "Discard target words with lower frequency.",
"title": "Tgt Words Min Frequency",
"type": "integer"
},
"data": {
"anyOf": [
{
"additionalProperties": {
"$ref": "#/$defs/Dataset"
},
"type": "object"
},
{
"type": "null"
}
],
"description": "All datasets and their specifications. See examples/*.yaml for further details.",
"title": "Data"
},
"transforms": {
"default": [],
"description": "Default transform pipeline to apply to data. Can be specified in each corpus of data to override.",
"items": {
"type": "string"
},
"title": "Transforms",
"type": "array"
},
"transforms_configs": {
"anyOf": [
{
"$ref": "#/$defs/NestedAllTransformsConfig"
},
{
"type": "null"
}
]
},
"skip_empty_level": {
"default": "warning",
"description": "Logging level when encoutering empty examples. (silent: silently ignore/skip empty examples, warning: warn when ignoring/skipping empty examples, error: raise an error and stop execution when any empty example)",
"enum": [
"silent",
"warning",
"error"
],
"title": "Skip Empty Level",
"type": "string"
},
"n_sample": {
"default": 5000,
"description": "Number of transformed samples per corpus to use to build the vocabulary. Set to -1 to use the full corpora.",
"title": "N Sample",
"type": "integer"
},
"save_data": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Output base path for objects that will be saved (vocab, transforms, embeddings, ...)",
"title": "Save Data"
},
"overwrite": {
"default": false,
"description": "Overwrite existing objects if any.",
"title": "Overwrite",
"type": "boolean"
},
"dump_samples": {
"default": false,
"description": "Dump samples when building vocabulary. Warning: this may slow down the process.",
"title": "Dump Samples",
"type": "boolean"
},
"num_threads": {
"default": 1,
"description": "Number of parallel threads to build the vocabulary.",
"title": "Num Threads",
"type": "integer"
},
"learn_subwords": {
"default": false,
"description": "Learn subwords (based on defined transforms) prior to building vocabulary.",
"title": "Learn Subwords",
"type": "boolean"
},
"learn_subwords_size": {
"default": 32000,
"description": "Number of subwords operations to learn.",
"title": "Learn Subwords Size",
"type": "integer"
},
"vocab_sample_queue_size": {
"default": 20,
"description": "Size of queues used for dumping samples.",
"title": "Vocab Sample Queue Size",
"type": "integer"
}
},
"$defs": {
"BARTNoiseConfig": {
"additionalProperties": false,
"properties": {
"permute_sent_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0.0,
"description": "Permute this proportion of sentences (boundaries defined by ['.', '?', '!']) in all inputs.",
"title": "Permute Sent Ratio"
},
"rotate_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0.0,
"description": "Rotate this proportion of inputs.",
"title": "Rotate Ratio"
},
"insert_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0.0,
"description": "Insert this percentage of additional random tokens.",
"title": "Insert Ratio"
},
"random_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0.0,
"description": "Instead of using <mask>, use random token this often.",
"title": "Random Ratio"
},
"mask_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0.0,
"description": "Fraction of words/subwords that will be masked.",
"title": "Mask Ratio"
},
"mask_length": {
"anyOf": [
{
"enum": [
"subword",
"word",
"span-poisson"
],
"type": "string"
},
{
"type": "null"
}
],
"default": "subword",
"description": "Length of masking window to apply.",
"title": "Mask Length"
},
"poisson_lambda": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 3.0,
"description": "Lambda for Poisson distribution to sample span length if `-mask_length` set to span-poisson.",
"title": "Poisson Lambda"
},
"replace_length": {
"anyOf": [
{
"maximum": 1,
"minimum": -1,
"type": "integer"
},
{
"type": "null"
}
],
"default": -1,
"description": "When masking N tokens, replace with 0, 1, or N tokens. (use -1 for N)",
"title": "Replace Length"
}
},
"title": "BARTNoiseConfig",
"type": "object"
},
"BaseTokenizerConfig": {
"additionalProperties": false,
"properties": {
"src_subword_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Path of subword model for src (or shared).",
"title": "Src Subword Model"
},
"tgt_subword_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Path of subword model for tgt.",
"title": "Tgt Subword Model"
},
"src_subword_nbest": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1,
"description": "Number of candidates in subword regularization. Valid for unigram sampling, invalid for BPE-dropout. (source side)",
"title": "Src Subword Nbest"
},
"tgt_subword_nbest": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1,
"description": "Number of candidates in subword regularization. Valid for unigram sampling, invalid for BPE-dropout. (target side)",
"title": "Tgt Subword Nbest"
},
"src_subword_alpha": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0,
"description": "Smoothing parameter for sentencepiece unigram sampling, and dropout probability for BPE-dropout. (source side)",
"title": "Src Subword Alpha"
},
"tgt_subword_alpha": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0,
"description": "Smoothing parameter for sentencepiece unigram sampling, and dropout probability for BPE-dropout. (target side)",
"title": "Tgt Subword Alpha"
},
"src_subword_vocab": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "Path to the vocabulary file for src subword. Format: <word>\\t<count> per line.",
"title": "Src Subword Vocab"
},
"tgt_subword_vocab": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "Path to the vocabulary file for tgt subword. Format: <word>\\t<count> per line.",
"title": "Tgt Subword Vocab"
},
"src_vocab_threshold": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 0,
"description": "Only produce src subword in src_subword_vocab with frequency >= src_vocab_threshold.",
"title": "Src Vocab Threshold"
},
"tgt_vocab_threshold": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 0,
"description": "Only produce tgt subword in tgt_subword_vocab with frequency >= tgt_vocab_threshold.",
"title": "Tgt Vocab Threshold"
}
},
"title": "BaseTokenizerConfig",
"type": "object"
},
"CleanConfig": {
"additionalProperties": false,
"properties": {
"src_eq_tgt": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": false,
"description": "Remove ex src==tgt",
"title": "Src Eq Tgt"
},
"same_char": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": false,
"description": "Remove ex with same char more than 4 times",
"title": "Same Char"
},
"same_word": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": false,
"description": "Remove ex with same word more than 3 times",
"title": "Same Word"
},
"scripts_ok": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": [
"Latin",
"Common"
],
"description": "list of unicodata scripts accepted",
"title": "Scripts Ok"
},
"scripts_nok": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": [],
"description": "list of unicodata scripts not accepted",
"title": "Scripts Nok"
},
"src_tgt_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 2.0,
"description": "ratio between src and tgt",
"title": "Src Tgt Ratio"
},
"avg_tok_min": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 3.0,
"description": "average length of tokens min",
"title": "Avg Tok Min"
},
"avg_tok_max": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 20.0,
"description": "average length of tokens max",
"title": "Avg Tok Max"
},
"langid": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": [],
"description": "list of languages accepted",
"title": "Langid"
}
},
"title": "CleanConfig",
"type": "object"
},
"Dataset": {
"additionalProperties": false,
"properties": {
"name": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Name"
},
"weight": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1,
"title": "Weight"
},
"transforms": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"title": "Transforms"
},
"path_src": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Path Src"
},
"path_tgt": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Path Tgt"
},
"path_sco": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Path Sco"
},
"path_txt": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Path Txt"
},
"path_align": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Path Align"
},
"src_prefix": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Src Prefix"
},
"tgt_prefix": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Tgt Prefix"
},
"src_suffix": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Src Suffix"
},
"tgt_suffix": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Tgt Suffix"
},
"src_lang": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Src Lang"
},
"tgt_lang": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Tgt Lang"
},
"penn": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": true,
"title": "Penn"
},
"norm_quote_commas": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": true,
"title": "Norm Quote Commas"
},
"norm_numbers": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": true,
"title": "Norm Numbers"
},
"pre_replace_unicode_punct": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": false,
"title": "Pre Replace Unicode Punct"
},
"post_remove_control_chars": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": false,
"title": "Post Remove Control Chars"
},
"src_eq_tgt": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": true,
"title": "Src Eq Tgt"
},
"same_char": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": true,
"title": "Same Char"
},
"same_word": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": true,
"title": "Same Word"
},
"scripts_ok": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": [
"Latin",
"Common"
],
"title": "Scripts Ok"
},
"scripts_nok": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": [],
"title": "Scripts Nok"
},
"src_tgt_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 2,
"title": "Src Tgt Ratio"
},
"avg_tok_min": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 3,
"title": "Avg Tok Min"
},
"avg_tok_max": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 20,
"title": "Avg Tok Max"
},
"lang_id": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": [
"en",
"fr"
],
"title": "Lang Id"
}
},
"title": "Dataset",
"type": "object"
},
"DocifyConfig": {
"additionalProperties": false,
"properties": {
"doc_length": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 200,
"description": "Number of tokens per doc.",
"title": "Doc Length"
},
"max_context": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1,
"description": "Max context segments.",
"title": "Max Context"
}
},
"title": "DocifyConfig",
"type": "object"
},
"FilterTooLongConfig": {
"additionalProperties": false,
"properties": {
"src_seq_length": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 192,
"description": "Maximum source sequence length.",
"title": "Src Seq Length"
},
"tgt_seq_length": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 192,
"description": "Maximum target sequence length.",
"title": "Tgt Seq Length"
}
},
"title": "FilterTooLongConfig",
"type": "object"
},
"FilterTooShortConfig": {
"additionalProperties": false,
"properties": {
"src_seq_length": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 48,
"description": "Minimum source sequence length.",
"title": "Src Seq Length"
},
"tgt_seq_length": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 48,
"description": "Minimum target sequence length.",
"title": "Tgt Seq Length"
}
},
"title": "FilterTooShortConfig",
"type": "object"
},
"HuggingfaceTokenizerConfig": {
"additionalProperties": false,
"properties": {
"path": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Path"
},
"huggingface_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Huggingface Model"
},
"max_length": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"title": "Max Length"
}
},
"title": "HuggingfaceTokenizerConfig",
"type": "object"
},
"InlineTagsConfig": {
"additionalProperties": false,
"properties": {
"tags_dictionary_path": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Path to a flat term dictionary.",
"title": "Tags Dictionary Path"
},
"tags_corpus_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0.1,
"description": "Ratio of corpus to augment with tags.",
"title": "Tags Corpus Ratio"
},
"max_tags": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 12,
"description": "Maximum number of tags that can be added to a single sentence.",
"title": "Max Tags"
},
"paired_stag": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "\uff5fph_#_beg\uff60",
"description": "The format of an opening paired inline tag. Must include the character #.",
"title": "Paired Stag"
},
"paired_etag": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "\uff5fph_#_end\uff60",
"description": "The format of a closing paired inline tag. Must include the character #.",
"title": "Paired Etag"
},
"isolated_tag": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "\uff5fph_#_std\uff60",
"description": "The format of an isolated inline tag. Must include the character #.",
"title": "Isolated Tag"
},
"src_delimiter": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "\uff5ffuzzy\uff60",
"description": "Any special token used for augmented src sentences. The default is the fuzzy token used in the FuzzyMatch transform.",
"title": "Src Delimiter"
}
},
"title": "InlineTagsConfig",
"type": "object"
},
"InsertMaskBeforePlaceholderConfig": {
"additionalProperties": false,
"properties": {
"response_patterns": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": [
"Response : \uff5fnewline\uff60"
],
"description": "Response pattern to locate the end of the prompt.",
"title": "Response Patterns"
}
},
"title": "InsertMaskBeforePlaceholderConfig",
"type": "object"
},
"NestedAllTransformsConfig": {
"additionalProperties": false,
"properties": {
"clean": {
"$ref": "#/$defs/CleanConfig",
"default": {
"src_eq_tgt": false,
"same_char": false,
"same_word": false,
"scripts_ok": [
"Latin",
"Common"
],
"scripts_nok": [],
"src_tgt_ratio": 2.0,
"avg_tok_min": 3.0,
"avg_tok_max": 20.0,
"langid": []
}
},
"terminology": {
"$ref": "#/$defs/TerminologyConfig",
"default": {
"termbase_path": null,
"src_spacy_language_model": null,
"tgt_spacy_language_model": null,
"term_corpus_ratio": 0.3,
"term_example_ratio": 0.2,
"src_term_stoken": "\uff5fsrc_term_start\uff60",
"tgt_term_stoken": "\uff5ftgt_term_start\uff60",
"tgt_term_etoken": "\uff5ftgt_term_end\uff60",
"term_source_delimiter": "\uff5ffuzzy\uff60"
}
},
"docify": {
"$ref": "#/$defs/DocifyConfig",
"default": {
"doc_length": 200,
"max_context": 1
}
},
"filtertooshort": {
"$ref": "#/$defs/FilterTooShortConfig",
"default": {
"src_seq_length": 48,
"tgt_seq_length": 48
}
},
"filtertoolong": {
"$ref": "#/$defs/FilterTooLongConfig",
"default": {
"src_seq_length": 192,
"tgt_seq_length": 192
}
},
"prefix": {
"$ref": "#/$defs/PrefixConfig",
"default": {
"src_prefix": "",
"tgt_prefix": ""
}
},
"suffix": {
"$ref": "#/$defs/SuffixConfig",
"default": {
"src_suffix": "",
"tgt_suffix": ""
}
},
"insert_mask_before_placeholder": {
"$ref": "#/$defs/InsertMaskBeforePlaceholderConfig",
"default": {
"response_patterns": [
"Response : \uff5fnewline\uff60"
]
}
},
"uppercase": {
"$ref": "#/$defs/UpperCaseConfig",
"default": {
"upper_corpus_ratio": 0.01
}
},
"switchout": {
"$ref": "#/$defs/SwitchOutConfig",
"default": {
"switchout_temperature": 1.0
}
},
"tokendrop": {
"$ref": "#/$defs/TokenDropConfig",
"default": {
"tokendrop_temperature": 1.0
}
},
"tokenmask": {
"$ref": "#/$defs/TokenMaskConfig",
"default": {
"tokenmask_temperature": 1.0
}
},
"bart": {
"$ref": "#/$defs/BARTNoiseConfig",
"default": {
"permute_sent_ratio": 0.0,
"rotate_ratio": 0.0,
"insert_ratio": 0.0,
"random_ratio": 0.0,
"mask_ratio": 0.0,
"mask_length": "subword",
"poisson_lambda": 3.0,
"replace_length": -1
}
},
"sentencepiece": {
"$ref": "#/$defs/BaseTokenizerConfig",
"default": {
"src_subword_model": null,
"tgt_subword_model": null,
"src_subword_nbest": 1,
"tgt_subword_nbest": 1,
"src_subword_alpha": 0.0,
"tgt_subword_alpha": 0.0,
"src_subword_vocab": "",
"tgt_subword_vocab": "",
"src_vocab_threshold": 0,
"tgt_vocab_threshold": 0
}
},
"bpe": {
"$ref": "#/$defs/BaseTokenizerConfig",
"default": {
"src_subword_model": null,
"tgt_subword_model": null,
"src_subword_nbest": 1,
"tgt_subword_nbest": 1,
"src_subword_alpha": 0.0,
"tgt_subword_alpha": 0.0,
"src_subword_vocab": "",
"tgt_subword_vocab": "",
"src_vocab_threshold": 0,
"tgt_vocab_threshold": 0
}
},
"onmt_tokenize": {
"$ref": "#/$defs/ONMTTokenizerConfig",
"default": {
"src_subword_model": null,
"tgt_subword_model": null,
"src_subword_nbest": 1,
"tgt_subword_nbest": 1,
"src_subword_alpha": 0.0,
"tgt_subword_alpha": 0.0,
"src_subword_vocab": "",
"tgt_subword_vocab": "",
"src_vocab_threshold": 0,
"tgt_vocab_threshold": 0,
"src_subword_type": "none",
"tgt_subword_type": "none",
"src_onmttok_kwargs": {
"mode": "none"
},
"tgt_onmttok_kwargs": {
"mode": "none"
},
"gpt2_pretok": false,
"mapped_tokens": null
}
},
"inlinetags": {
"$ref": "#/$defs/InlineTagsConfig",
"default": {
"tags_dictionary_path": null,
"tags_corpus_ratio": 0.1,
"max_tags": 12,
"paired_stag": "\uff5fph_#_beg\uff60",
"paired_etag": "\uff5fph_#_end\uff60",
"isolated_tag": "\uff5fph_#_std\uff60",
"src_delimiter": "\uff5ffuzzy\uff60"
}
},
"huggingface_tokenize": {
"$ref": "#/$defs/HuggingfaceTokenizerConfig",
"default": {
"path": null,
"huggingface_model": null,
"max_length": null
}
},
"normalize": {
"$ref": "#/$defs/NormalizeConfig",
"default": {
"src_lang": "",
"tgt_lang": "",
"penn": true,
"norm_quote_commas": true,
"norm_numbers": true,
"pre_replace_unicode_punct": false,
"post_remove_control_chars": false
}
}
},
"title": "NestedAllTransformsConfig",
"type": "object"
},
"NormalizeConfig": {
"additionalProperties": false,
"properties": {
"src_lang": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "Source language code",
"title": "Src Lang"
},
"tgt_lang": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "Target language code",
"title": "Tgt Lang"
},
"penn": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": true,
"description": "Penn substitution",
"title": "Penn"
},
"norm_quote_commas": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": true,
"description": "Normalize quotations and commas",
"title": "Norm Quote Commas"
},
"norm_numbers": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": true,
"description": "Normalize numbers",
"title": "Norm Numbers"
},
"pre_replace_unicode_punct": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": false,
"description": "Replace unicode punct",
"title": "Pre Replace Unicode Punct"
},
"post_remove_control_chars": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": false,
"description": "Remove control chars",
"title": "Post Remove Control Chars"
}
},
"title": "NormalizeConfig",
"type": "object"
},
"ONMTTokenizerConfig": {
"additionalProperties": false,
"properties": {
"src_subword_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Path of subword model for src (or shared).",
"title": "Src Subword Model"
},
"tgt_subword_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Path of subword model for tgt.",
"title": "Tgt Subword Model"
},
"src_subword_nbest": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1,
"description": "Number of candidates in subword regularization. Valid for unigram sampling, invalid for BPE-dropout. (source side)",
"title": "Src Subword Nbest"
},
"tgt_subword_nbest": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1,
"description": "Number of candidates in subword regularization. Valid for unigram sampling, invalid for BPE-dropout. (target side)",
"title": "Tgt Subword Nbest"
},
"src_subword_alpha": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0,
"description": "Smoothing parameter for sentencepiece unigram sampling, and dropout probability for BPE-dropout. (source side)",
"title": "Src Subword Alpha"
},
"tgt_subword_alpha": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0,
"description": "Smoothing parameter for sentencepiece unigram sampling, and dropout probability for BPE-dropout. (target side)",
"title": "Tgt Subword Alpha"
},
"src_subword_vocab": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "Path to the vocabulary file for src subword. Format: <word>\\t<count> per line.",
"title": "Src Subword Vocab"
},
"tgt_subword_vocab": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "Path to the vocabulary file for tgt subword. Format: <word>\\t<count> per line.",
"title": "Tgt Subword Vocab"
},
"src_vocab_threshold": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 0,
"description": "Only produce src subword in src_subword_vocab with frequency >= src_vocab_threshold.",
"title": "Src Vocab Threshold"
},
"tgt_vocab_threshold": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 0,
"description": "Only produce tgt subword in tgt_subword_vocab with frequency >= tgt_vocab_threshold.",
"title": "Tgt Vocab Threshold"
},
"src_subword_type": {
"anyOf": [
{
"enum": [
"none",
"sentencepiece",
"bpe"
],
"type": "string"
},
{
"type": "null"
}
],
"default": "none",
"description": "Type of subword model for src (or shared) in pyonmttok.",
"title": "Src Subword Type"
},
"tgt_subword_type": {
"anyOf": [
{
"enum": [
"none",
"sentencepiece",
"bpe"
],
"type": "string"
},
{
"type": "null"
}
],
"default": "none",
"description": "Type of subword model for tgt in pyonmttok.",
"title": "Tgt Subword Type"
},
"src_onmttok_kwargs": {
"anyOf": [
{
"additionalProperties": true,
"type": "object"
},
{
"type": "null"
}
],
"default": {
"mode": "none"
},
"description": "Other pyonmttok options for src in dict string, except subword related options listed earlier.",
"title": "Src Onmttok Kwargs"
},
"tgt_onmttok_kwargs": {
"anyOf": [
{
"additionalProperties": true,
"type": "object"
},
{
"type": "null"
}
],
"default": {
"mode": "none"
},
"description": "Other pyonmttok options for tgt in dict string, except subword related options listed earlier.",
"title": "Tgt Onmttok Kwargs"
},
"gpt2_pretok": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": false,
"description": "Preprocess sentence with byte-level mapping.",
"title": "Gpt2 Pretok"
},
"mapped_tokens": {
"anyOf": [
{
"items": {
"maxItems": 2,
"minItems": 2,
"prefixItems": [
{
"type": "string"
},
{
"type": "string"
}
],
"type": "array"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"description": "Mapped tokens for placeholders preservation",
"title": "Mapped Tokens"
}
},
"title": "ONMTTokenizerConfig",
"type": "object"
},
"PrefixConfig": {
"additionalProperties": false,
"properties": {
"src_prefix": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "String to prepend to all source examples.",
"title": "Src Prefix"
},
"tgt_prefix": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "String to prepend to all target examples.",
"title": "Tgt Prefix"
}
},
"title": "PrefixConfig",
"type": "object"
},
"SuffixConfig": {
"additionalProperties": false,
"properties": {
"src_suffix": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "String to append to all source examples.",
"title": "Src Suffix"
},
"tgt_suffix": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "String to append to all target examples.",
"title": "Tgt Suffix"
}
},
"title": "SuffixConfig",
"type": "object"
},
"SwitchOutConfig": {
"additionalProperties": false,
"properties": {
"switchout_temperature": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 1.0,
"description": "Sampling temperature for SwitchOut. :math:`\\tau^{-1}` in :cite:`DBLP:journals/corr/abs-1808-07512`. Smaller value makes data more diverse.",
"title": "Switchout Temperature"
}
},
"title": "SwitchOutConfig",
"type": "object"
},
"TerminologyConfig": {
"additionalProperties": false,
"properties": {
"termbase_path": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Path to a dictionary file with terms.",
"title": "Termbase Path"
},
"src_spacy_language_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Name of the spaCy language model for the source corpus.",
"title": "Src Spacy Language Model"
},
"tgt_spacy_language_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Name of the spaCy language model for the target corpus.",
"title": "Tgt Spacy Language Model"
},
"term_corpus_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0.3,
"description": "Ratio of corpus to augment with terms.",
"title": "Term Corpus Ratio"
},
"term_example_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0.2,
"description": "Maximum terms allowed in an example.",
"title": "Term Example Ratio"
},
"src_term_stoken": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "\uff5fsrc_term_start\uff60",
"description": "The source term start token.",
"title": "Src Term Stoken"
},
"tgt_term_stoken": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "\uff5ftgt_term_start\uff60",
"description": "The target term start token.",
"title": "Tgt Term Stoken"
},
"tgt_term_etoken": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "\uff5ftgt_term_end\uff60",
"description": "The target term end token.",
"title": "Tgt Term Etoken"
},
"term_source_delimiter": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "\uff5ffuzzy\uff60",
"description": "Any special token used for augmented source sentences. The default is the fuzzy token used in the FuzzyMatch transform.",
"title": "Term Source Delimiter"
}
},
"title": "TerminologyConfig",
"type": "object"
},
"TokenDropConfig": {
"additionalProperties": false,
"properties": {
"tokendrop_temperature": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 1.0,
"description": "Sampling temperature for token deletion.",
"title": "Tokendrop Temperature"
}
},
"title": "TokenDropConfig",
"type": "object"
},
"TokenMaskConfig": {
"additionalProperties": false,
"properties": {
"tokenmask_temperature": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 1.0,
"description": "Sampling temperature for token masking.",
"title": "Tokenmask Temperature"
}
},
"title": "TokenMaskConfig",
"type": "object"
},
"UpperCaseConfig": {
"additionalProperties": false,
"properties": {
"upper_corpus_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0.01,
"description": "Corpus ratio to apply uppercasing.",
"title": "Upper Corpus Ratio"
}
},
"title": "UpperCaseConfig",
"type": "object"
}
},
"required": [
"src_vocab",
"data"
]
}
- Config:
- validate_assignment: bool = True
- validate_default: bool = True
- use_enum_values: bool = True
- extra: str = ignore
- protected_namespaces: tuple = ()
- Fields:
- Validators:
_validate_build_vocab_config»all fields
field dump_samples : bool = False
Dump samples when building vocabulary. Warning: this may slow down the process.
- Validated by:
_maybe_set_huggingface_model_validate_build_vocab_config
field learn_subwords : bool = False
Learn subwords (based on defined transforms) prior to building vocabulary.
- Validated by:
_maybe_set_huggingface_model_validate_build_vocab_config
field learn_subwords_size : int = 32000
Number of subwords operations to learn.
- Validated by:
_maybe_set_huggingface_model_validate_build_vocab_config
field n_sample : int = 5000
Number of transformed samples per corpus to use to build the vocabulary. Set to -1 to use the full corpora.
- Validated by:
_maybe_set_huggingface_model_validate_build_vocab_config
field num_threads : int = 1
Number of parallel threads to build the vocabulary.
- Validated by:
_maybe_set_huggingface_model_validate_build_vocab_config
field vocab_sample_queue_size : int = 20
Size of queues used for dumping samples.
- Validated by:
_maybe_set_huggingface_model_validate_build_vocab_config
model_post_init(context: Any,) → None
This function is meant to behave like a BaseModel method to initialise private attributes.
It takes context as an argument since that’s what pydantic-core passes when calling it.
- Parameters:
- self – The BaseModel instance.
- context – The context.