Llama2
NOTE
To make your life easier, run these commands from the recipe directory (here recipes/llama2).
Retrieve and convert model
Set environment variables
export EOLE_MODEL_DIR=<where_to_store_models>
export HF_TOKEN=<your_hf_token>
Download and convert model
eole convert HF --model_dir meta-llama/Llama-2-7b-chat-hf --output $EOLE_MODEL_DIR/llama2-7b-chat-hf --token $HF_TOKEN
Inference
Write test prompt to text file
echo -e "<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>
What are some nice places to visit in France? [/INST]" | sed ':a;N;$!ba;s/\n/⦅newline⦆/g' > test_prompt.txt
Run inference
Single GPU
eole predict -c llama-inference.yaml -src test_prompt.txt -output test_output.txt
Dual GPU (tensor parallelism)
eole predict -c llama-inference-tp-2gpu.yaml -src test_prompt.txt -output test_output.txt