Skip to main content

How to use torch.compile for fast inference?

EOLE supports torch.compile for significantly accelerated inference, reaching speeds comparable to vLLM on GPU.

Enabling torch.compile

Set the EOLE_TORCH_COMPILE environment variable before running inference:

export EOLE_TORCH_COMPILE=1

You can also control the compile mode with EOLE_COMPILE_MODE:

ValueDescription
0Full compile (decoder as a whole with Cuda graphs)
1Same as 0 but no cuda graphs
2Decoder Layer compilation (faster to compile) with cuda graphs
3Same as 2 but no cuda graphs
export EOLE_TORCH_COMPILE=1
export EOLE_COMPILE_MODE=2
eole predict --config inference.yaml --src input.txt --output output.txt

Note: In mode "0" the first run will take 60-90 seconds to compile. Subsequent runs will be significantly faster.

InferenceEngine

EOLE provides an InferenceEngine Python API that manages the inference lifecycle, including:

  • Managing GPU-bound inference work from the main application thread
  • Support for non-streaming batch inference via infer_list
  • Support for streaming token-by-token generation via infer_list_stream

Example usage (Python API)

from eole.config.run import PredictConfig
from eole.inference_engine import InferenceEnginePY

config = PredictConfig(model_path="/path/to/model", gpu=0, compute_dtype="fp16")
engine = InferenceEnginePY(config)

results = engine.infer_list(["Hello, world!"])
for result in results:
print(result)

engine.terminate()

Streaming

for token in engine.infer_list_stream(["Hello, world!"]):
print(token, end="", flush=True)

Benchmarks

See benchmarks/genai/README.md for detailed performance comparisons.