EOLE

Latest: Chatbot in streaming mode - 60-65 tok/sec with Qwen3.5-27B-int4 (on RTX 5090)

Screencast from 2026-03-15 14-04-56.webm

Open language modeling toolkit based on PyTorch (initially spun-off of OpenNMT-py)

Top inference speed with torch.compile and Cudagraphs - as fast as vLLM / faster than CT2 on GPU. see results

Just reproduce with your own hardware:

git clone https://github.com/eole-nlp/eole
cd eole
pip install -e .
export EOLE_MODEL_DIR=<where_to_store_models>
export HF_TOKEN=<your_hf_token>
eole convert HF --model_dir "google/gemma-3-1b-it" --output $EOLE_MODEL_DIR/gemma-3-1b-it --token $HF_TOKEN
cd benchmarks/genai
EOLE_TORCH_COMPILE="1" EOLE_COMPILE_MODE="0" python generate-eole.py

First run will take 60-80 seconds to compile
Run it a second time and see the blast.
To accomplish this we performed a full refactor of the code: Encoders, Decoders, Adapters, Model classes, Trainer, Distributed training / Inference.

We aim to maintain the research-friendly approach of the original project while including latest architectures (LLMs) and various other techniques. Our goal is to provide a comprehensive yet compact and modular codebase for experimenting with various types of language models (encoder, decoder, seq2seq).

HF Models supported

Qwen3.5 family Including vision - Including Autoround (GPTQ quant)
Whisper see full detail and example in recipe
tencent/HunyuanOCR End-to-End OCR model by Tencent. Uses more image token vs Deepseek but smaller LM. Results are impressive. (see recipe)
deepseek-ai/DeepSeek-OCR For now takes any image and rescales to 1024x1024 before processing - Gundam mode not implemented yet) - pdf_ocr to mmd replicated - check recipes
tencent/Hunyuan-MT-7B SOTA NMT at WMT25, better than Towerplus-9B and EuroLLM-9B
Qwen/Qwen2/3 Non VL family. Includes Qwen3-30B-A3B
google/gemma-3-27b-it All Gemma3 family - supports text and image input
Mistral-3.1-24B-instruct supports all Mistral AI models (text and image input) - includes Ministral 3, Mixtral, Mathstral
meta-llama/Llama-3.X models
microsoft/Phi-2/3 models

Of course you can train your own architecture (Decoder only, Encoder Only, or EncoderDecoder Model)

Latest developments

LM_scoring Updated perplexity tool to compare the perplexity of a model family (ex: Qwen3.5 27B, 9B, 4B, ... using the same tokenizer)
gguf conversion Included for educational purpose - show how to convert a gguf quantized (or not) to Eole safetensor model keeping (almost) the same Quant
Autoround support Uses GPTQModel to support int4 quantization
Comet scorer You can now use both BLEU and COMET during validation to measure the improvement of training
torch.compile compliant amazing inference speed (vLLM level)
high inference speed using Flash Attention (decoding with in-place KVCache), Cuda kernels for RMSNorm, Rope, Activations - Triton very fast MoE, fused MLP Gate, fused KVQ Linear.
prefixLM + split prompt/answer in src/tgt optional method to feed your data
Pure-BF16 Training thanks to Kahan Summation implemented here
Web-based (Google translator-like) interface featuring the latest Hunyuan-MT-7B or EuroLLM-8B-Instruct LLM
Estimator layer which enables to rescore multiple beams in the same model. Read article here and here
Support Hugging Face Tokenizers for better compatiblity
Replicate CometKiwi(XL/XXL) Encoder+Estimator models

Key Features

Versatile Training and Inference: Train from scratch, finetune, and infer models of various architectures including Transformer Encoder/Decoder/EncoderDecoder and RNN EncoderDecoder.
Dynamic Data Transforms: Apply on-the-fly transformations in the dataloading logic for both training and inference.
Comprehensive LLM Support: Includes converters for Llama, Mistral, Phi, Gemma ...
Advanced Quantization: Support for 8-bit and 4-bit quantization, along with LoRA adapters, with or without checkpointing, as well as mixed precision (FP16).
Efficient Finetuning: Finetune 7B and 13B models on a single RTX 24GB GPU using 4-bit quantization.
Flexible Inference: Perform inference in 4-bit or 8-bit using the same layer quantization methods as in finetuning.
Tensor Parallelism: Enable tensor parallelism for both training and inference when models exceed the memory capacity of a single GPU.

Work completed

We have made significant progress in several areas:

Configuration Management: Streamlined through pydantic models.
Command Line Entry Points: Improved using structured subparsers for better organization.
Reproducible Recipes: Provided for widely used models and tasks, ensuring consistency and reliability.
Core API Simplification: Refined around the new configuration objects for ease of use.
Revamped Fast API based server: see above example with EuroLLM-9B-Instruct

Future Directions

There are still several exciting avenues to explore:

Documentation: Enhance and expand the documentation for better user guidance.
Test Coverage: Improve testing to ensure code reliability and performance.
Logging Enhancements: Implement more sophisticated logging mechanisms.
Broader Model Support: Extend support to include a wider range of open models, potentially multi-modal.

Setup

Using Docker

To facilitate setup and reproducibility, we provide Docker images via the GitHub Container Registry: EOLE Docker Images.

You can customize the workflow and build your own images based on specific needs using build.sh and Dockerfile in the docker directory of the repository.

To pull the Docker image:

docker pull ghcr.io/eole-nlp/eole:0.4.0-torch2.9.1-ubuntu22.04-cuda12.8

Example one-liner to run a container and open a bash shell within it:

docker run --rm -it --runtime=nvidia ghcr.io/eole-nlp/eole:0.4.0-torch2.9.1-ubuntu22.04-cuda12.8

Note: Ensure you have the Nvidia Container Toolkit (formerly nvidia-docker) installed to take advantage of CUDA/GPU features.

Depending on your needs, you can add various flags:

-p 5000:5000: Forward an exposed port from your container to your host.
-v /some/local/directory:/some/container/directory: Mount a local directory to a container directory.
--entrypoint some_command: Run a specific command as the container entry point (instead of the default bash shell).

Installing Locally

Requirements

Python >= 3.10
PyTorch >= 2.8 < 2.10

Installation from Source

To install from source:

git clone https://github.com/eole-nlp/eole
cd eole
pip install -e .

Installation from PyPI

Installation from PyPI will be available soon.

Notes

If you encounter a MemoryError during installation, try using pip with the --no-cache-dir option.

(Optional) Some advanced features (e.g., pretrained models or specific transforms) require extra packages. Install them with:

pip install -r requirements.opt.txt

Manual Installation of Some Dependencies

Flash Attention

To use Flash Attention, install it manually:

pip install flash-attn --no-build-isolation

AWQ

For inference or quantizing an AWQ model, AutoAWQ is required. Install it with:

pip install autoawq

For more details, refer to AutoAWQ.

Notes on Mixed-precision or Low precision Training

Until Feb 25, we used torch optimizers with or without AMP (mixed precision) or "fusedadam" which was an old implementation of Apex/Nvidia using FP16 with dynamic loss scaling and without FP32 master weights. As of 0.2 "fusedadam" is deprecated and we implemented pure-BF16 training.

As a result, config flags are now:

For FP16-amp or BF16-amp training (using pytorch optimizers and amp implementation)

compute_dtype: fp16 or bf16
use_amp: true
optim: adam or adamw

Special note: even though it may not be logical, we still use the torch GradScaler in BF16-AMP. Even if the BF16 range is similar to FP32, scaling prevents from underflowing. We tested BF16-AMP without the GradScaler and it does not give good results.

For pure-bf16 training (using torch-optimi and kahan summation)

compute_dtype: bf16
use_amp: false
optim: adam or adamw

Pure-BF16 training is faster than AMP and the memory footprint is reduced (master weights are kept in BF16 vs FP32). However Kahan Summation is not magical, results are good but not as good as AMP. Use this feature mainly when memory footprint is an issue with LLMs.

Contributing

We love contributions! Please look at issues marked with the contributions welcome tag.

Before raising an issue, make sure you read the requirements and the Full Documentation. You can also check if a Recipe fits your use case.

Unless there is a bug, please use the Discussions tab to ask questions or propose new topics/features.

Latest: Chatbot in streaming mode - 60-65 tok/sec with Qwen3.5-27B-int4 (on RTX 5090)​

Top inference speed with torch.compile and Cudagraphs - as fast as vLLM / faster than CT2 on GPU. see results​

HF Models supported​

Latest developments​

Key Features​

Work completed​

Future Directions​

Setup​

Using Docker​

Installing Locally​

Requirements​

Installation from Source​

Installation from PyPI​

Notes​

Manual Installation of Some Dependencies​

Flash Attention​

AWQ​

Notes on Mixed-precision or Low precision Training​

Contributing​