Dataset Weighting

This is naturally embedded in the data configuration format introduced in OpenNMT-py 2.0. Each entry of the data configuration will have its own weight. When building batches, we'll sequentially take weight example from each corpus.

Note: don't worry about batch homogeneity/heterogeneity, the bucketing mechanism is here for that reason. Instead of building batches one at a time, we will load bucket_size examples, sort them by length, build batches and then yield them in a random order.

Example

In the following example, we will sequentially sample 7 examples from corpus_1, and 3 examples from corpus_2, and so on:

# <your_config>.yaml

...

# Corpus opts:
data:
    corpus_1:
        path_src: toy-ende/src-train1.txt
        path_tgt: toy-ende/tgt-train1.txt
        weight: 7
    corpus_2:
        path_src: toy-ende/src-train1.txt
        path_tgt: toy-ende/tgt-train1.txt
        weight: 3
    valid:
        path_src: toy-ende/src-val.txt
        path_tgt: toy-ende/tgt-val.txt
...

Dataset Weighting

Example​

Example