Skip to main content

Whisper

NOTE: To make your life easier, run these commands from the recipe directory (here recipes/whisper).

Supported models

Any HuggingFace Whisper checkpoint can be converted:

ModelHuggingFace IDLanguages
tinyopenai/whisper-tinyMultilingual
tiny.enopenai/whisper-tiny.enEnglish only
baseopenai/whisper-baseMultilingual
base.enopenai/whisper-base.enEnglish only
smallopenai/whisper-smallMultilingual
small.enopenai/whisper-small.enEnglish only
mediumopenai/whisper-mediumMultilingual
medium.enopenai/whisper-medium.enEnglish only
large-v3openai/whisper-large-v3Multilingual
large-v3-turboopenai/whisper-large-v3-turboMultilingual

Prerequisites

Audio decoding requires FFmpeg libraries to be installed on your system:

apt install ffmpeg          # Debian/Ubuntu
brew install ffmpeg # macOS

Quick start

Set environment variables

export EOLE_MODEL_DIR=<where_to_store_models>

Download and convert model

eole convert HF --model_dir openai/whisper-base.en --output ${EOLE_MODEL_DIR}/whisper-base-en-eole

Download sample audio files (optional)

bash download_samples.sh

This downloads a few public-domain audio clips and converts them to 16kHz mono WAV. Requires wget and ffmpeg.

Run inference

The included audio_files.txt lists the English sample files. After downloading, run:

eole predict -config whisper_predict.yaml -model_path ${EOLE_MODEL_DIR}/whisper-base-en-eole -src ./audio_files.txt -output ./transcription.txt

Each line in the output corresponds to the audio file on the same line in the input.

Output modes

Plain text (default)

eole predict -config whisper_predict.yaml -src ./audio_files.txt -output ./text.txt

Segment timestamps

Outputs JSON with start/end times for each segment:

eole predict -config whisper_predict.yaml -src ./audio_files.txt -output ./segments.json -timestamps segment

Word timestamps

Outputs JSON with per-word timing via cross-attention DTW alignment:

eole predict -config whisper_predict.yaml -src ./audio_files.txt -output ./words.json -timestamps word

Note: word-level timestamps require a model with alignment_heads in its generation_config.json (e.g. whisper-base.en, whisper-small, whisper-large-v3).

Language and task

Multilingual Whisper models support specifying the source language and task.

Language hint

Force a specific source language (useful when auto-detection is unreliable):

eole predict -config whisper_predict.yaml -src ./audio_files.txt -language fr

Translation (to English)

Translation requires a multilingual model (not .en variants). First convert one:

eole convert HF --model_dir openai/whisper-small --output ${EOLE_MODEL_DIR}/whisper-small-eole

Then run with task: translate on a French audio sample (included in download_samples.sh):

echo "samples/fr0.wav" > ./french_audio.txt
eole predict -config whisper_predict.yaml -model_path ${EOLE_MODEL_DIR}/whisper-small-eole -src ./french_audio.txt -output ./translation.txt -language fr -task translate

Or with segment timestamps:

eole predict -config whisper_predict_translate.yaml -model_path ${EOLE_MODEL_DIR}/whisper-small-eole -src ./french_audio.txt -output ./translation.json

Prompt conditioning

Use initial_prompt to condition the decoder output style and vocabulary. The prompt text is prepended as previous context using the <|startofprev|> mechanism:

eole predict -config whisper_predict.yaml -src ./audio_files.txt -initial_prompt "This is a presidential radio address."

This can help with:

  • Spelling of proper nouns and domain-specific terms
  • Output formatting and punctuation style
  • Steering the model toward a particular topic or register

Evaluation (WER)

WER scorer plugin

Eole includes a WerScorer plugin for computing Word Error Rate during training evaluation. It works like the existing BLEU and TER scorers:

scoring_metrics: ["WER"]

Requires the wer extra:

pip install eole[wer]           # published install
pip install -e .[wer] # local/dev install

LibriSpeech test-clean benchmark

To measure WER on the standard LibriSpeech test-clean dataset (~2620 utterances, ~5.4 hours):

1. Download dataset

cd recipes/whisper/eval
bash download_librispeech.sh

This downloads and extracts test-clean.tar.gz (~350MB) from OpenSLR.

2. Run evaluation

python eval_librispeech.py --model_path $EOLE_MODEL_DIR/whisper-base-en-eole

Options:

FlagDefaultDescription
--model_pathrequiredPath to converted eole model
--data_dir./LibriSpeech/test-cleanLibriSpeech directory
--gpu0GPU device ID (-1 for CPU)
--beam_size5Beam search width
--condition_on_previous_textfalseCondition on previous segment
--output_dir./resultsWhere to write results

The script normalises text using EnglishTextNormalizer from the whisper-normalizer package (same normaliser used by OpenAI and whisper.cpp) before computing WER via jiwer.

Config reference

FieldTypeDefaultDescription
model_pathstrrequiredPath to converted eole model
srcstrrequiredFile listing audio paths (one per line)
outputstrrequiredOutput file path
beam_sizeint5Beam search width
length_penaltystr"avg"Length penalty strategy. Use none for Whisper models
max_lengthint250Maximum output tokens
batch_sizeint1Batch size (use 1 for audio)
gpu_rankslist[]GPU device IDs
seedint-1Random seed. Set to 0+ for deterministic fallback sampling
timestampsstr"none"Output mode: "none", "segment", "word"
languagestrnullSource language code (e.g. "en", "fr", "zh")
taskstrnull"transcribe" or "translate"
initial_promptstrnullText prompt for decoder conditioning
condition_on_previous_textboolfalseCondition next segment on previous segment's text
fallback_temperatureslist[0.0, 0.2, 0.4, 0.6, 0.8, 1.0]Temperature cascade. Beam search at t=0, sampling at t>0. [0.0] to disable
compression_ratio_thresholdfloat2.4Retry at next temperature if gzip compression ratio exceeds this
logprob_thresholdfloat-1.0Retry at next temperature if avg log probability is below this
no_speech_thresholdfloat0.6Low logprob only triggers fallback when no-speech prob is also below this