Transforms
pydantic model eole.transforms.tokenize.ONMTTokenizerConfig[source]β
Bases: BaseTokenizerConfig
Show JSON schema
{
"title": "ONMTTokenizerConfig",
"type": "object",
"properties": {
"src_subword_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Path of subword model for src (or shared).",
"title": "Src Subword Model"
},
"tgt_subword_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Path of subword model for tgt.",
"title": "Tgt Subword Model"
},
"src_subword_nbest": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1,
"description": "Number of candidates in subword regularization. Valid for unigram sampling, invalid for BPE-dropout. (source side)",
"title": "Src Subword Nbest"
},
"tgt_subword_nbest": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1,
"description": "Number of candidates in subword regularization. Valid for unigram sampling, invalid for BPE-dropout. (target side)",
"title": "Tgt Subword Nbest"
},
"src_subword_alpha": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0,
"description": "Smoothing parameter for sentencepiece unigram sampling, and dropout probability for BPE-dropout. (source side)",
"title": "Src Subword Alpha"
},
"tgt_subword_alpha": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0,
"description": "Smoothing parameter for sentencepiece unigram sampling, and dropout probability for BPE-dropout. (target side)",
"title": "Tgt Subword Alpha"
},
"src_subword_vocab": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "Path to the vocabulary file for src subword. Format: <word>\\t<count> per line.",
"title": "Src Subword Vocab"
},
"tgt_subword_vocab": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "Path to the vocabulary file for tgt subword. Format: <word>\\t<count> per line.",
"title": "Tgt Subword Vocab"
},
"src_vocab_threshold": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 0,
"description": "Only produce src subword in src_subword_vocab with frequency >= src_vocab_threshold.",
"title": "Src Vocab Threshold"
},
"tgt_vocab_threshold": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 0,
"description": "Only produce tgt subword in tgt_subword_vocab with frequency >= tgt_vocab_threshold.",
"title": "Tgt Vocab Threshold"
},
"src_subword_type": {
"anyOf": [
{
"enum": [
"none",
"sentencepiece",
"bpe"
],
"type": "string"
},
{
"type": "null"
}
],
"default": "none",
"description": "Type of subword model for src (or shared) in pyonmttok.",
"title": "Src Subword Type"
},
"tgt_subword_type": {
"anyOf": [
{
"enum": [
"none",
"sentencepiece",
"bpe"
],
"type": "string"
},
{
"type": "null"
}
],
"default": "none",
"description": "Type of subword model for tgt in pyonmttok.",
"title": "Tgt Subword Type"
},
"src_onmttok_kwargs": {
"anyOf": [
{
"type": "object"
},
{
"type": "null"
}
],
"default": {
"mode": "none"
},
"description": "Other pyonmttok options for src in dict string, except subword related options listed earlier.",
"title": "Src Onmttok Kwargs"
},
"tgt_onmttok_kwargs": {
"anyOf": [
{
"type": "object"
},
{
"type": "null"
}
],
"default": {
"mode": "none"
},
"description": "Other pyonmttok options for tgt in dict string, except subword related options listed earlier.",
"title": "Tgt Onmttok Kwargs"
},
"gpt2_pretok": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": false,
"description": "Preprocess sentence with byte-level mapping.",
"title": "Gpt2 Pretok"
},
"mapped_tokens": {
"anyOf": [
{
"items": {
"maxItems": 2,
"minItems": 2,
"prefixItems": [
{
"type": "string"
},
{
"type": "string"
}
],
"type": "array"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"description": "Mapped tokens for placeholders preservation",
"title": "Mapped Tokens"
}
},
"additionalProperties": false
}
- Config:
- validate_assignment: bool = True
- validate_default: bool = True
- use_enum_values: bool = True
- extra: str = forbid
- protected_namespaces: tuple = ()
- Fields:
- Validators:
check_values
Β»all fields
field gpt2_pretok : bool | None = Falseβ
Preprocess sentence with byte-level mapping.
- Validated by:
field mapped_tokens : List[Tuple[str, str]] | None = Noneβ
Mapped tokens for placeholders preservation
- Validated by:
field src_onmttok_kwargs : dict | None = {'mode': 'none'}β
Other pyonmttok options for src in dict string, except subword related options listed earlier.
- Validated by:
field src_subword_type : Literal['none', 'sentencepiece', 'bpe'] | None = 'none'β
Type of subword model for src (or shared) in pyonmttok.
- Validated by:
field tgt_onmttok_kwargs : dict | None = {'mode': 'none'}β
Other pyonmttok options for tgt in dict string, except subword related options listed earlier.
- Validated by:
field tgt_subword_type : Literal['none', 'sentencepiece', 'bpe'] | None = 'none'β
Type of subword model for tgt in pyonmttok.
- Validated by:
validator check_values Β» all fields[source]β
pydantic model eole.transforms.tokenize.BaseTokenizerConfig[source]β
Bases: TransformConfig
Show JSON schema
{
"title": "BaseTokenizerConfig",
"type": "object",
"properties": {
"src_subword_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Path of subword model for src (or shared).",
"title": "Src Subword Model"
},
"tgt_subword_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Path of subword model for tgt.",
"title": "Tgt Subword Model"
},
"src_subword_nbest": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1,
"description": "Number of candidates in subword regularization. Valid for unigram sampling, invalid for BPE-dropout. (source side)",
"title": "Src Subword Nbest"
},
"tgt_subword_nbest": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1,
"description": "Number of candidates in subword regularization. Valid for unigram sampling, invalid for BPE-dropout. (target side)",
"title": "Tgt Subword Nbest"
},
"src_subword_alpha": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0,
"description": "Smoothing parameter for sentencepiece unigram sampling, and dropout probability for BPE-dropout. (source side)",
"title": "Src Subword Alpha"
},
"tgt_subword_alpha": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0,
"description": "Smoothing parameter for sentencepiece unigram sampling, and dropout probability for BPE-dropout. (target side)",
"title": "Tgt Subword Alpha"
},
"src_subword_vocab": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "Path to the vocabulary file for src subword. Format: <word>\\t<count> per line.",
"title": "Src Subword Vocab"
},
"tgt_subword_vocab": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "Path to the vocabulary file for tgt subword. Format: <word>\\t<count> per line.",
"title": "Tgt Subword Vocab"
},
"src_vocab_threshold": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 0,
"description": "Only produce src subword in src_subword_vocab with frequency >= src_vocab_threshold.",
"title": "Src Vocab Threshold"
},
"tgt_vocab_threshold": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 0,
"description": "Only produce tgt subword in tgt_subword_vocab with frequency >= tgt_vocab_threshold.",
"title": "Tgt Vocab Threshold"
}
},
"additionalProperties": false
}
- Config:
- validate_assignment: bool = True
- validate_default: bool = True
- use_enum_values: bool = True
- extra: str = forbid
- protected_namespaces: tuple = ()
- Fields:
src_subword_alpha (float | None)
src_subword_model (str | None)
src_subword_nbest (int | None)
src_subword_vocab (str | None)
src_vocab_threshold (int | None)
tgt_subword_alpha (float | None)
tgt_subword_model (str | None)
tgt_subword_nbest (int | None)
tgt_subword_vocab (str | None)
tgt_vocab_threshold (int | None)
- Validators:
check_values
Β»all fields
field src_subword_alpha : float | None = 0β
Smoothing parameter for sentencepiece unigram sampling, and dropout probability for BPE-dropout. (source side)
- Validated by:
field src_subword_model : str | None = Noneβ
Path of subword model for src (or shared).
- Validated by:
field src_subword_nbest : int | None = 1β
Number of candidates in subword regularization. Valid for unigram sampling, invalid for BPE-dropout. (source side)
- Validated by:
field src_subword_vocab : str | None = ''β
Path to the vocabulary file for src subword. Format:
- Validated by:
field src_vocab_threshold : int | None = 0β
Only produce src subword in src_subword_vocab with frequency >= src_vocab_threshold.
- Validated by:
field tgt_subword_alpha : float | None = 0β
Smoothing parameter for sentencepiece unigram sampling, and dropout probability for BPE-dropout. (target side)
- Validated by:
field tgt_subword_model : str | None = Noneβ
Path of subword model for tgt.
- Validated by:
field tgt_subword_nbest : int | None = 1β
Number of candidates in subword regularization. Valid for unigram sampling, invalid for BPE-dropout. (target side)
- Validated by:
field tgt_subword_vocab : str | None = ''β
Path to the vocabulary file for tgt subword. Format:
- Validated by:
field tgt_vocab_threshold : int | None = 0β
Only produce tgt subword in tgt_subword_vocab with frequency >= tgt_vocab_threshold.
- Validated by:
validator check_values Β» all fields[source]β
pydantic model eole.transforms.docify.DocifyConfig[source]β
Bases: TransformConfig
Show JSON schema
{
"title": "DocifyConfig",
"type": "object",
"properties": {
"doc_length": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 200,
"description": "Number of tokens per doc.",
"title": "Doc Length"
},
"max_context": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 1,
"description": "Max context segments.",
"title": "Max Context"
}
},
"additionalProperties": false
}
- Config:
- validate_assignment: bool = True
- validate_default: bool = True
- use_enum_values: bool = True
- extra: str = forbid
- protected_namespaces: tuple = ()
- Fields:
field doc_length : int | None = 200β
Number of tokens per doc.
field max_context : int | None = 1β
Max context segments.
pydantic model eole.transforms.clean.CleanConfig[source]β
Bases: TransformConfig
Show JSON schema
{
"title": "CleanConfig",
"type": "object",
"properties": {
"src_eq_tgt": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": false,
"description": "Remove ex src==tgt",
"title": "Src Eq Tgt"
},
"same_char": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": false,
"description": "Remove ex with same char more than 4 times",
"title": "Same Char"
},
"same_word": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": false,
"description": "Remove ex with same word more than 3 times",
"title": "Same Word"
},
"scripts_ok": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": [
"Latin",
"Common"
],
"description": "list of unicodata scripts accepted",
"title": "Scripts Ok"
},
"scripts_nok": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": [],
"description": "list of unicodata scripts not accepted",
"title": "Scripts Nok"
},
"src_tgt_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 2.0,
"description": "ratio between src and tgt",
"title": "Src Tgt Ratio"
},
"avg_tok_min": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 3.0,
"description": "average length of tokens min",
"title": "Avg Tok Min"
},
"avg_tok_max": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 20.0,
"description": "average length of tokens max",
"title": "Avg Tok Max"
},
"langid": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": [],
"description": "list of languages accepted",
"title": "Langid"
}
},
"additionalProperties": false
}
- Config:
- validate_assignment: bool = True
- validate_default: bool = True
- use_enum_values: bool = True
- extra: str = forbid
- protected_namespaces: tuple = ()
- Fields:
field avg_tok_max : float | None = 20.0β
average length of tokens max
field avg_tok_min : float | None = 3.0β
average length of tokens min
field langid : List[str] | None = []β
list of languages accepted
field same_char : bool | None = Falseβ
Remove ex with same char more than 4 times
field same_word : bool | None = Falseβ
Remove ex with same word more than 3 times
field scripts_nok : List[str] | None = []β
list of unicodata scripts not accepted
field scripts_ok : List[str] | None = ['Latin', 'Common']β
list of unicodata scripts accepted
field src_eq_tgt : bool | None = Falseβ
Remove ex src==tgt
field src_tgt_ratio : float | None = 2.0β
ratio between src and tgt
pydantic model eole.transforms.bart.BARTNoiseConfig[source]β
Bases: TransformConfig
Show JSON schema
{
"title": "BARTNoiseConfig",
"type": "object",
"properties": {
"permute_sent_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0.0,
"description": "Permute this proportion of sentences (boundaries defined by ['.', '?', '!']) in all inputs.",
"title": "Permute Sent Ratio"
},
"rotate_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0.0,
"description": "Rotate this proportion of inputs.",
"title": "Rotate Ratio"
},
"insert_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0.0,
"description": "Insert this percentage of additional random tokens.",
"title": "Insert Ratio"
},
"random_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0.0,
"description": "Instead of using <mask>, use random token this often.",
"title": "Random Ratio"
},
"mask_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0.0,
"description": "Fraction of words/subwords that will be masked.",
"title": "Mask Ratio"
},
"mask_length": {
"anyOf": [
{
"enum": [
"subword",
"word",
"span-poisson"
],
"type": "string"
},
{
"type": "null"
}
],
"default": "subword",
"description": "Length of masking window to apply.",
"title": "Mask Length"
},
"poisson_lambda": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 3.0,
"description": "Lambda for Poisson distribution to sample span length if `-mask_length` set to span-poisson.",
"title": "Poisson Lambda"
},
"replace_length": {
"anyOf": [
{
"maximum": 1,
"minimum": -1,
"type": "integer"
},
{
"type": "null"
}
],
"default": -1,
"description": "When masking N tokens, replace with 0, 1, or N tokens. (use -1 for N)",
"title": "Replace Length"
}
},
"additionalProperties": false
}
- Config:
- validate_assignment: bool = True
- validate_default: bool = True
- use_enum_values: bool = True
- extra: str = forbid
- protected_namespaces: tuple = ()
- Fields:
field insert_ratio : float | None = 0.0β
Insert this percentage of additional random tokens.
field mask_length : Literal['subword', 'word', 'span-poisson'] | None = 'subword'β
Length of masking window to apply.
field mask_ratio : float | None = 0.0β
Fraction of words/subwords that will be masked.
field permute_sent_ratio : float | None = 0.0β
Permute this proportion of sentences (boundaries defined by [β.β, β?β, β!β]) in all inputs.
field poisson_lambda : float | None = 3.0β
Lambda for Poisson distribution to sample span length if -mask_length set to span-poisson.
field random_ratio : float | None = 0.0β
Instead of using
field replace_length : int | None = -1β
When masking N tokens, replace with 0, 1, or N tokens. (use -1 for N)
- Constraints:
- ge = -1
- le = 1
field rotate_ratio : float | None = 0.0β
Rotate this proportion of inputs.
pydantic model eole.transforms.fuzzymatch.FuzzyMatchConfig[source]β
Bases: TransformConfig
Show JSON schema
{
"title": "FuzzyMatchConfig",
"type": "object",
"properties": {
"tm_path": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Path to a flat text TM.",
"title": "Tm Path"
},
"fuzzy_corpus_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0.1,
"description": "Ratio of corpus to augment with fuzzy matches.",
"title": "Fuzzy Corpus Ratio"
},
"fuzzy_threshold": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 70,
"description": "The fuzzy matching threshold.",
"title": "Fuzzy Threshold"
},
"tm_delimiter": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "\t",
"description": "The delimiter used in the flat text TM.",
"title": "Tm Delimiter"
},
"fuzzy_token": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "\uff5ffuzzy\uff60",
"description": "The fuzzy token to be added with the matches.",
"title": "Fuzzy Token"
},
"fuzzymatch_min_length": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 70,
"description": "Max length for TM entries and examples to match.",
"title": "Fuzzymatch Min Length"
}
},
"additionalProperties": false
}
- Config:
- validate_assignment: bool = True
- validate_default: bool = True
- use_enum_values: bool = True
- extra: str = forbid
- protected_namespaces: tuple = ()
- Fields:
field fuzzy_corpus_ratio : float | None = 0.1β
Ratio of corpus to augment with fuzzy matches.
field fuzzy_threshold : float | None = 70β
The fuzzy matching threshold.
field fuzzy_token : str | None = 'ο½fuzzyο½ 'β
The fuzzy token to be added with the matches.
field fuzzymatch_min_length : int | None = 70β
Max length for TM entries and examples to match.
field tm_delimiter : str | None = '\t'β
The delimiter used in the flat text TM.
field tm_path : str | None = Noneβ
Path to a flat text TM.
pydantic model eole.transforms.inlinetags.InlineTagsConfig[source]β
Bases: TransformConfig
Show JSON schema
{
"title": "InlineTagsConfig",
"type": "object",
"properties": {
"tags_dictionary_path": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Path to a flat term dictionary.",
"title": "Tags Dictionary Path"
},
"tags_corpus_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0.1,
"description": "Ratio of corpus to augment with tags.",
"title": "Tags Corpus Ratio"
},
"max_tags": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 12,
"description": "Maximum number of tags that can be added to a single sentence.",
"title": "Max Tags"
},
"paired_stag": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "\uff5fph_#_beg\uff60",
"description": "The format of an opening paired inline tag. Must include the character #.",
"title": "Paired Stag"
},
"paired_etag": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "\uff5fph_#_end\uff60",
"description": "The format of a closing paired inline tag. Must include the character #.",
"title": "Paired Etag"
},
"isolated_tag": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "\uff5fph_#_std\uff60",
"description": "The format of an isolated inline tag. Must include the character #.",
"title": "Isolated Tag"
},
"src_delimiter": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "\uff5ffuzzy\uff60",
"description": "Any special token used for augmented src sentences. The default is the fuzzy token used in the FuzzyMatch transform.",
"title": "Src Delimiter"
}
},
"additionalProperties": false
}
- Config:
- validate_assignment: bool = True
- validate_default: bool = True
- use_enum_values: bool = True
- extra: str = forbid
- protected_namespaces: tuple = ()
- Fields:
field isolated_tag : str | None = 'ο½ph_#_stdο½ 'β
The format of an isolated inline tag. Must include the character #.
field max_tags : int | None = 12β
Maximum number of tags that can be added to a single sentence.
field paired_etag : str | None = 'ο½ph_#_endο½ 'β
The format of a closing paired inline tag. Must include the character #.
field paired_stag : str | None = 'ο½ph_#_begο½ 'β
The format of an opening paired inline tag. Must include the character #.
field src_delimiter : str | None = 'ο½fuzzyο½ 'β
Any special token used for augmented src sentences. The default is the fuzzy token used in the FuzzyMatch transform.
field tags_corpus_ratio : float | None = 0.1β
Ratio of corpus to augment with tags.
field tags_dictionary_path : str | None = Noneβ
Path to a flat term dictionary.
pydantic model eole.transforms.uppercase.UpperCaseConfig[source]β
Bases: TransformConfig
Show JSON schema
{
"title": "UpperCaseConfig",
"type": "object",
"properties": {
"upper_corpus_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0.01,
"description": "Corpus ratio to apply uppercasing.",
"title": "Upper Corpus Ratio"
}
},
"additionalProperties": false
}
- Config:
- validate_assignment: bool = True
- validate_default: bool = True
- use_enum_values: bool = True
- extra: str = forbid
- protected_namespaces: tuple = ()
- Fields:
field upper_corpus_ratio : float | None = 0.01β
Corpus ratio to apply uppercasing.
pydantic model eole.transforms.sampling.TokenDropConfig[source]β
Bases: TransformConfig
Show JSON schema
{
"title": "TokenDropConfig",
"type": "object",
"properties": {
"tokendrop_temperature": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 1.0,
"description": "Sampling temperature for token deletion.",
"title": "Tokendrop Temperature"
}
},
"additionalProperties": false
}
- Config:
- validate_assignment: bool = True
- validate_default: bool = True
- use_enum_values: bool = True
- extra: str = forbid
- protected_namespaces: tuple = ()
- Fields:
field tokendrop_temperature : float | None = 1.0β
Sampling temperature for token deletion.
pydantic model eole.transforms.sampling.TokenMaskConfig[source]β
Bases: TransformConfig
Show JSON schema
{
"title": "TokenMaskConfig",
"type": "object",
"properties": {
"tokenmask_temperature": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 1.0,
"description": "Sampling temperature for token masking.",
"title": "Tokenmask Temperature"
}
},
"additionalProperties": false
}
- Config:
- validate_assignment: bool = True
- validate_default: bool = True
- use_enum_values: bool = True
- extra: str = forbid
- protected_namespaces: tuple = ()
- Fields:
field tokenmask_temperature : float | None = 1.0β
Sampling temperature for token masking.
pydantic model eole.transforms.sampling.SwitchOutConfig[source]β
Bases: TransformConfig
Show JSON schema
{
"title": "SwitchOutConfig",
"type": "object",
"properties": {
"switchout_temperature": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 1.0,
"description": "Sampling temperature for SwitchOut. :math:`\\tau^{-1}` in :cite:`DBLP:journals/corr/abs-1808-07512`. Smaller value makes data more diverse.",
"title": "Switchout Temperature"
}
},
"additionalProperties": false
}
- Config:
- validate_assignment: bool = True
- validate_default: bool = True
- use_enum_values: bool = True
- extra: str = forbid
- protected_namespaces: tuple = ()
- Fields:
field switchout_temperature : float | None = 1.0β
Sampling temperature for SwitchOut. $\tau^{-1}$ in []. Smaller value makes data more diverse.
pydantic model eole.transforms.terminology.TerminologyConfig[source]β
Bases: TransformConfig
Show JSON schema
{
"title": "TerminologyConfig",
"type": "object",
"properties": {
"termbase_path": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Path to a dictionary file with terms.",
"title": "Termbase Path"
},
"src_spacy_language_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Name of the spaCy language model for the source corpus.",
"title": "Src Spacy Language Model"
},
"tgt_spacy_language_model": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Name of the spaCy language model for the target corpus.",
"title": "Tgt Spacy Language Model"
},
"term_corpus_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0.3,
"description": "Ratio of corpus to augment with terms.",
"title": "Term Corpus Ratio"
},
"term_example_ratio": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"default": 0.2,
"description": "Maximum terms allowed in an example.",
"title": "Term Example Ratio"
},
"src_term_stoken": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "\uff5fsrc_term_start\uff60",
"description": "The source term start token.",
"title": "Src Term Stoken"
},
"tgt_term_stoken": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "\uff5ftgt_term_start\uff60",
"description": "The target term start token.",
"title": "Tgt Term Stoken"
},
"tgt_term_etoken": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "\uff5ftgt_term_end\uff60",
"description": "The target term end token.",
"title": "Tgt Term Etoken"
},
"term_source_delimiter": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "\uff5ffuzzy\uff60",
"description": "Any special token used for augmented source sentences. The default is the fuzzy token used in the FuzzyMatch transform.",
"title": "Term Source Delimiter"
}
},
"additionalProperties": false
}
- Config:
- validate_assignment: bool = True
- validate_default: bool = True
- use_enum_values: bool = True
- extra: str = forbid
- protected_namespaces: tuple = ()
- Fields:
field src_spacy_language_model : str | None = Noneβ
Name of the spaCy language model for the source corpus.
field src_term_stoken : str | None = 'ο½src_term_startο½ 'β
The source term start token.
field term_corpus_ratio : float | None = 0.3β
Ratio of corpus to augment with terms.
field term_example_ratio : float | None = 0.2β
Maximum terms allowed in an example.
field term_source_delimiter : str | None = 'ο½fuzzyο½ 'β
Any special token used for augmented source sentences. The default is the fuzzy token used in the FuzzyMatch transform.
field termbase_path : str | None = Noneβ
Path to a dictionary file with terms.
field tgt_spacy_language_model : str | None = Noneβ
Name of the spaCy language model for the target corpus.
field tgt_term_etoken : str | None = 'ο½tgt_term_endο½ 'β
The target term end token.
field tgt_term_stoken : str | None = 'ο½tgt_term_startο½ 'β
The target term start token.
pydantic model eole.transforms.misc.FilterTooLongConfig[source]β
Bases: TransformConfig
Show JSON schema
{
"title": "FilterTooLongConfig",
"type": "object",
"properties": {
"src_seq_length": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 192,
"description": "Maximum source sequence length.",
"title": "Src Seq Length"
},
"tgt_seq_length": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 192,
"description": "Maximum target sequence length.",
"title": "Tgt Seq Length"
}
},
"additionalProperties": false
}
- Config:
- validate_assignment: bool = True
- validate_default: bool = True
- use_enum_values: bool = True
- extra: str = forbid
- protected_namespaces: tuple = ()
- Fields:
field src_seq_length : int | None = 192β
Maximum source sequence length.
field tgt_seq_length : int | None = 192β
Maximum target sequence length.
pydantic model eole.transforms.misc.PrefixConfig[source]β
Bases: TransformConfig
Show JSON schema
{
"title": "PrefixConfig",
"type": "object",
"properties": {
"src_prefix": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "String to prepend to all source examples.",
"title": "Src Prefix"
},
"tgt_prefix": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "String to prepend to all target examples.",
"title": "Tgt Prefix"
}
},
"additionalProperties": false
}
- Config:
- validate_assignment: bool = True
- validate_default: bool = True
- use_enum_values: bool = True
- extra: str = forbid
- protected_namespaces: tuple = ()
- Fields:
field src_prefix : str | None = ''β
String to prepend to all source examples.
field tgt_prefix : str | None = ''β
String to prepend to all target examples.
pydantic model eole.transforms.misc.SuffixConfig[source]β
Bases: TransformConfig
Show JSON schema
{
"title": "SuffixConfig",
"type": "object",
"properties": {
"src_suffix": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "String to append to all source examples.",
"title": "Src Suffix"
},
"tgt_suffix": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "String to append to all target examples.",
"title": "Tgt Suffix"
}
},
"additionalProperties": false
}
- Config:
- validate_assignment: bool = True
- validate_default: bool = True
- use_enum_values: bool = True
- extra: str = forbid
- protected_namespaces: tuple = ()
- Fields:
field src_suffix : str | None = ''β
String to append to all source examples.
field tgt_suffix : str | None = ''β
String to append to all target examples.
pydantic model eole.transforms.normalize.NormalizeConfig[source]β
Bases: TransformConfig
Show JSON schema
{
"title": "NormalizeConfig",
"type": "object",
"properties": {
"src_lang": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "Source language code",
"title": "Src Lang"
},
"tgt_lang": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": "",
"description": "Target language code",
"title": "Tgt Lang"
},
"penn": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": true,
"description": "Penn substitution",
"title": "Penn"
},
"norm_quote_commas": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": true,
"description": "Normalize quotations and commas",
"title": "Norm Quote Commas"
},
"norm_numbers": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": true,
"description": "Normalize numbers",
"title": "Norm Numbers"
},
"pre_replace_unicode_punct": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": false,
"description": "Replace unicode punct",
"title": "Pre Replace Unicode Punct"
},
"post_remove_control_chars": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": false,
"description": "Remove control chars",
"title": "Post Remove Control Chars"
}
},
"additionalProperties": false
}
- Config:
- validate_assignment: bool = True
- validate_default: bool = True
- use_enum_values: bool = True
- extra: str = forbid
- protected_namespaces: tuple = ()
- Fields:
field norm_numbers : bool | None = Trueβ
Normalize numbers
field norm_quote_commas : bool | None = Trueβ
Normalize quotations and commas
field penn : bool | None = Trueβ
Penn substitution
field post_remove_control_chars : bool | None = Falseβ
Remove control chars
field pre_replace_unicode_punct : bool | None = Falseβ
Replace unicode punct
field src_lang : str | None = ''β
Source language code
field tgt_lang : str | None = ''β
Target language code
pydantic model eole.transforms.insert_mask_before_placeholder.InsertMaskBeforePlaceholderConfig[source]β
Bases: TransformConfig
Show JSON schema
{
"title": "InsertMaskBeforePlaceholderConfig",
"type": "object",
"properties": {
"response_patterns": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": [
"Response : \uff5fnewline\uff60"
],
"description": "Response pattern to locate the end of the prompt.",
"title": "Response Patterns"
}
},
"additionalProperties": false
}
- Config:
- validate_assignment: bool = True
- validate_default: bool = True
- use_enum_values: bool = True
- extra: str = forbid
- protected_namespaces: tuple = ()
- Fields:
field response_patterns : List[str] | None = ['Response : ο½newlineο½ ']β
Response pattern to locate the end of the prompt.