Skip to main content

Serving models with Eole

The provided example configuration allows to serve Llama3-8B-Instruct.

models_root: "." # used only for HF downloads for now, but might override $EOLE_MODEL_DIR at some point
models:
# local model
- id: "llama3-8b-instruct"
path: "${EOLE_MODEL_DIR}/llama3-8b-instruct"
preload: false
config:
quant_layers: ['gate_up_proj', 'down_proj', 'up_proj', 'linear_values', 'linear_query', 'linear_keys', 'final_linear']
quant_type: "bnb_NF4"
# HF repo id, automatically downloaded to models_root
- id: "llama3-8b-instruct-hf"
path: "fhdz/llama3-8b-instruct"
preload: true

Note: the preload flag allow to load the corresponding model at server startup. See below for the two options.

Retrieve and convert model

Set environment variables

export EOLE_MODEL_DIR=<where_to_store_models>
export HF_TOKEN=<your_hf_token>

Option 1 - Download and convert model

The first example "llama3-8b-instruct" requires you to manually convert the model in your desired $EOLE_MODEL_DIR.

eole convert HF --model_dir meta-llama/Meta-Llama-3-8B-Instruct --output $EOLE_MODEL_DIR/llama3-8b-instruct --token $HF_TOKEN

Option 2 - Retrieve an already converted model from HF

The second example "llama3-8b-instruct-hf" downloads a model that has already been converted, for the sake of this example.

Run server

eole serve -c serve.example.yaml

Play with the API

FastAPI exposes a swagger UI by default. It should be accessible via your browser at http://localhost:5000/docs.