August 2024

TIL: Quantize and use Llama 3.1 with llama.cpp on a Mac

“Today I Learned” are instructions-focused posts about things I’ve learned.

I’ll show you how to download Llama 3.1 8B from HuggingFace, and convert it to GGUF format used by the popular llama.cpp tool, as well as quantize it using the same tool. Also, I’ll explain what all those words mean.

Setup §

You’ll need XCode, Python, and pipx installed. I think that’s it.

Then, get and build llama.cpp:

$ git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
$ make

Llama.cpp is a pure C/C++ implementation of an inference engine for Meta’s Llama family of Large Language Models.

A Large Language Model, or LLM in short, is a big pile of statistical numbers that, given a text, can give you the likely next word. But they also feel like magic to use. An inference engine is just something you can use to “run” this LLM.

Getting Llama 3.1 8B Instruct §

We get the model from HuggingFace, which is a large platform for distributing and running LLMs. First, ask for access to the model called Llama 3.1 8B Instruct, because it’ll take a while to get access.

Then, generate an access token (a write token will do, just keep it secret), and install the HuggingFace CLI and login with the token:

$ pipx install -U huggingface_hub
$ huggingface-cli login

Once you have access (you’ll get an email), you can download the model:

$ mkdir -p models/meta-llama
$ huggingface-cli download meta-llama/Meta-Llama-3.1-8B-Instruct \
--exclude "original/*" \
--local-dir models/meta-llama/Meta-Llama-3.1-8B-Instruct

What’s with those numbers and weird words in the model name?

8B refers to the number of numbers in the big pile of numbers. :D So in this case, it’s short for 8 billion. 3.1 is the version number. Instruct means that the model has been instruction-tuned, so instead of finishing a sentence like “the capital of Denmark is…”, it’ll follow instructions you give it, to the extent possible.

Converting to GGUF and quantizing §

Once you have the model downloaded, we can use llama.cpp to convert it to GGUF and quantize it.

GGUF is just a file format, invented by Georgi Gerganov, which contains the LLM and some metadata, like the instructions for how to talk to it.

Quantizing means reducing the precision of the individual numbers of the model, but making it smaller as well as easier and faster to run.

We do the conversion to GGUF with a python script included with llama.cpp:

$ python3 -m venv .env
$ source .env/bin/activate
$ python3 -m pip install -r requirements.txt
$ python3 convert_hf_to_gguf.py models/meta-llama/Meta-Llama-3.1-8B-Instruct

The quantization process is similarly easy:

$ ./llama-quantize models/meta-llama/Meta-Llama-3.1-8B-Instruct/Meta-Llama-3.1-8B-Instruct-F16.gguf \
models/meta-llama/Meta-Llama-3.1-8B-Instruct/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf \
Q5_K_M

Note the Q5_K_M, which is the precision to quantize to, in this case to 5 bit numbers. A lower number would be worse, higher generally (but not necessarily meaningfully) better. The _K_M part means that different parts of the quantization process use different precisions. See the available quantization options with ./llama-quantize -h.

Using the model §

Let’s talk to a big pile of numbers.

Run llama-cli to start an interactive chat session with your newly quantized model:

$ ./llama-cli --conversation --log-disable --no-display-prompt -m models/meta-llama/Meta-Llama-3.1-8B-Instruct/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf

> Describe the facial expression of a goat wearing a party hat, in one sentence.
The goat's facial expression is a comically bemused mixture of confusion and delight,
with its eyebrows slightly raised and a hint of a tongue-out grin,
as if to say "I have no idea why I'm wearing this silly hat, but I'm kind of enjoying it."
A picture of me, Markus.

I’m Markus, a professional software consultant and developer. 🤓✨ Reach me at markus@maragu.dk.

Podcast on Apple Podcasts and Spotify. Streaming on Youtube. Subscribe to this blog by RSS or newsletter: