We tried four quantization techniques on Llama-3 and Mistral models. The quality vs cost trade-offs we found, plus what works for production inference.
For the past several months we've been running our own LLM inference for some workloads. Quantization is the lever that determines whether self-hosted inference is cost-competitive with managed APIs. We benchmarked four quantization techniques on Llama-3-8B and Mistral-7B. This post is what we found, where each technique fits, and what to use for production.
The basics: an LLM's weights are stored as floating point numbers. By default, training uses fp32 (4 bytes per weight) or bf16 (2 bytes). For inference, you don't need that precision. Quantization reduces the bits per weight — to 8, 4, or even 2 bits — at some cost in quality.
Why this matters:
The trade is quality. Aggressive quantization (low bit counts) starts to hurt model output quality. The question is: how much, and on what tasks?
Models: Llama-3-8B-Instruct, Mistral-7B-Instruct (both bf16 baselines).
Quantization techniques:
llama.cpp. Mixed precision (some tensors at 4-bit, sensitive ones at higher precision).bnb library's "normal float 4" quantization. Used by QLoRA for fine-tuning.Benchmarks:
For Llama-3-8B:
| Quantization | Size | Mem (1 req) | Tok/s | Quality |
|---|---|---|---|---|
| bf16 baseline | 16 GB | 17 GB | 78 | 100% |
| nf4 (bnb) | 5.5 GB | 7 GB | 95 | 96% |
| AWQ-4bit | 5.4 GB | 6 GB | 165 | 97% |
| GPTQ-4bit | 5.4 GB | 6 GB | 155 | 95% |
| GGUF Q4_K_M | 5.0 GB | 5.5 GB | 110 | 96% |
Quality is normalized so the bf16 baseline scores 100%; lower is worse on the eval set.
A few takeaways:
The eval set was diverse; quality drops weren't uniform across categories:
If your task is "follow this format and respond using the provided context" (RAG-style), 4-bit quantization is great. If your task is "write working Python code from scratch," 4-bit hurts more.
For some teams, this means: 4-bit for RAG and conversation; bf16 (or 8-bit) for code generation tasks.
For our use cases, AWQ-4bit on Llama-3-8B turned out to be the right answer for self-hosted inference. Reasons:
For one specific use case (a coding-assistant feature) we use 8-bit instead of 4-bit because the code quality drop wasn't acceptable.
For inference, we use vLLM as the serving framework:
openai.ChatCompletion.create)A single A10G can serve our quantized 8B model at ~150 tokens/s aggregate, supporting maybe 30-50 concurrent users with reasonable latency. Cost on AWS Spot: ~$0.30/hr. Compared to GPT-4o-mini API costs at our token volume, the break-even is around 200k tokens/hour; we're past that on most days.
When we fine-tune (LoRA), the workflow:
For our use case (one fine-tune per language we serve), we merge and re-quantize. Each merged+quantized model is ~5GB and serves with the same speed as the base.
Things that bit us:
Quantizing a fresh download fails silently. Some versions of GPTQ libraries silently fall back to fp16 if the quantization doesn't converge. We always test the model size on disk after quantization to verify.
KV-cache memory dominates. Even with quantized weights, the per-request KV cache is fp16 by default. For long contexts, KV cache memory can exceed weight memory. vLLM's PagedAttention helps; we also use 8-bit KV cache for high-context workloads.
Tokenizer mismatches. Some quantized models on Hugging Face have subtly different tokenizers than the base model. Symptoms: garbage outputs. Always pair the quantized weights with the matching tokenizer.
Long-context degradation. Quantization quality often gets worse as context length grows (compounding errors). For our long-context use cases (>8k tokens), we use a less aggressive quantization (8-bit) to compensate.
2-bit quantization (e.g., AQLM). Quality dropped too much for our use cases (15-25% on the eval set). Maybe useful for very large models (70B+) where 2-bit lets you fit on a single GPU. For 8B, 4-bit is the right floor.
Custom quantization training (training a model directly in low-bit). Active research area but for our use case, post-training quantization of an off-the-shelf model is good enough.
Hand-tuned per-layer quantization (sensitive layers higher precision, others lower). GGUF Q4_K_M does this in a managed way; trying to do it more aggressively was lots of work for marginal gain.
Quantization helps if you're going to self-host. If you're using a managed API (OpenAI, Anthropic), quantization is invisible — those providers do their own optimizations.
Self-hosting with quantization makes sense when:
For most teams, the answer is: stick with managed APIs. The optimization story for quantization only matters once self-hosting is the right call.
Pick AWQ-4bit if you're using vLLM. It's well-supported, fast, and quality is good.
Benchmark quality on your actual eval set, not someone else's. "MMLU drops 2%" doesn't tell you what happens on your specific task.
Monitor inference quality after deployment. Quantization quality drift is real over edge-case inputs. Have an eval suite that runs regularly.
Don't over-quantize. 4-bit is the sweet spot for most production inference. 2-bit and below are research territory.
8-bit is a safe fallback for tasks where 4-bit hurts. Memory savings smaller, quality preserved.
Quantization is mature enough that it's a real production option. The tooling (vLLM, AutoAWQ, GPTQ libraries, llama.cpp) is solid; the quality trade-offs are well-characterized; the cost wins are real if you have the volume to amortize the operational complexity. The decision tree is mostly: do you have the volume, and are the quality trade-offs acceptable for your tasks. If yes to both, quantized self-hosted inference is likely cheaper than the API alternative.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.
Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.
Explore more articles in this category
Embedding indexes degrade silently. The signals that catch drift, how often to re-embed, and the operational patterns we built after one quiet quality regression.
Streaming LLM responses is easy until the client disconnects, the model stalls, or the user cancels. The patterns that keep streaming responsive without leaking spend.
When LLMs can call tools that change real state, the design decisions that matter most are about what's gated, what's automatic, and what triggers a human checkpoint.
Evergreen posts worth revisiting.