Skip to main content

Do you support multi-GPU?

Yes. EOLE supports two distributed modes: data parallelism and tensor parallelism.

Configuration

First you need to make sure you export CUDA_VISIBLE_DEVICES=0,1,2,3.

If you want to use GPU id 1 and 3 of your OS, you will need to export CUDA_VISIBLE_DEVICES=1,3

Both world_size and gpu_ranks need to be set. E.g.:

training:
world_size: 4
gpu_ranks: [0, 1, 2, 3]

Data Parallelism

The default mode (parallel_mode: data_parallel) replicates the model across GPUs and splits the data. This is well-suited for training.

Tensor Parallelism

Tensor parallelism (parallel_mode: tensor_parallel) splits the model tensors across GPUs. This is useful when a single model is too large to fit on one GPU, and is supported for both training and inference.

training:
world_size: 4
gpu_ranks: [0, 1, 2, 3]
parallel_mode: tensor_parallel

Multi-node Training

For multi-node distributed training, configure master_ip and master_port:

  • Node 1: world_size: 4, gpu_ranks: [0, 1]
  • Node 2: world_size: 4, gpu_ranks: [2, 3]

With accum_count: 2 to accumulate over 2 batches before updating parameters, which reduces inter-node communication.

If you use a regular network card (1 Gbps) then we suggest using a higher accum_count to minimize inter-node communication.