How to use torch.compile for fast inference?

EOLE supports torch.compile for significantly accelerated inference, reaching speeds comparable to vLLM on GPU.

Enabling torch.compile

Set the EOLE_TORCH_COMPILE environment variable before running inference:

export EOLE_TORCH_COMPILE=1

You can also control the compile mode with EOLE_COMPILE_MODE:

Value	Description
`0`	Full compile (decoder as a whole with Cuda graphs)
`1`	Same as 0 but no cuda graphs
`2`	Decoder Layer compilation (faster to compile) with cuda graphs
`3`	Same as 2 but no cuda graphs

export EOLE_TORCH_COMPILE=1
export EOLE_COMPILE_MODE=2
eole predict --config inference.yaml --src input.txt --output output.txt

Note: In mode "0" the first run will take 60-90 seconds to compile. Subsequent runs will be significantly faster.

InferenceEngine

EOLE provides an InferenceEngine Python API that manages the inference lifecycle, including:

Managing GPU-bound inference work from the main application thread
Support for non-streaming batch inference via infer_list
Support for streaming token-by-token generation via infer_list_stream

Example usage (Python API)

from eole.config.run import PredictConfig
from eole.inference_engine import InferenceEnginePY

config = PredictConfig(model_path="/path/to/model", gpu=0, compute_dtype="fp16")
engine = InferenceEnginePY(config)

results = engine.infer_list(["Hello, world!"])
for result in results:
    print(result)

engine.terminate()

Streaming

for token in engine.infer_list_stream(["Hello, world!"]):
    print(token, end="", flush=True)

Benchmarks

See benchmarks/genai/README.md for detailed performance comparisons.

Enabling torch.compile​

InferenceEngine​

Example usage (Python API)​

Streaming​

Benchmarks​

Enabling torch.compile

InferenceEngine

Example usage (Python API)

Streaming

Benchmarks