I fine-tuned Llama 3 8B on a single 4090 over a weekend for a side project. Here's what worked, what cost more than expected, and what I'd do differently.

On this page

Fine-Tuning Llama 3 on Consumer Hardware

A friend asked me to help fine-tune a small Llama 3 8B model on a domain-specific dataset for a side project. They didn't want to rent cloud GPUs; we did the whole thing on a single RTX 4090 over a weekend. The end result worked well enough for the project. The path was bumpier than I'd hoped. This post is what I'd tell someone trying the same thing tomorrow.

The setup #

Hardware: RTX 4090 (24GB VRAM), Ryzen 9 5950X, 64GB DDR4 system RAM, NVMe SSD
Software: Ubuntu 22.04, CUDA 12.1, PyTorch 2.3, Hugging Face transformers + peft + bitsandbytes
Base model: Llama 3 8B Instruct (the smallest Llama 3 variant)
Dataset: ~3,500 instruction-response pairs in a domain-specific format (programming-related)
Goal: improve the model's behavior on the specific format and content of this niche

Choice 1: full fine-tune vs LoRA #

Full fine-tuning of Llama 3 8B in float16 needs about 60GB of VRAM during training (gradients + optimizer state for 8B params). Not happening on a 4090.

LoRA (Low-Rank Adaptation) trains only a small adapter on top of the frozen base model. With LoRA, training Llama 3 8B fits in 16GB. We picked LoRA.

The configuration:

python.python

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(base_model, lora_config)

LoRA adds about 0.5% additional parameters as adapter weights. Those are what we train. Training time and VRAM requirements drop accordingly.

The recommended LoRA config for Llama 3: rank 16, alpha 32, target the attention projection modules. We didn't experiment much beyond defaults; they worked.

Choice 2: 4-bit quantization for the base model #

Even with LoRA, the base model itself takes ~16GB at fp16 (8B params × 2 bytes). Combined with activations and optimizer state, you're at the edge of 24GB VRAM.

bitsandbytes 4-bit quantization shrinks the base model to ~5GB. Combined with LoRA, the whole training fits comfortably in 24GB with room for batch size 4-8.

python.python

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
)

This is the QLoRA approach (4-bit base + LoRA adapter). Works well on consumer hardware.

Step 3: dataset preparation, the part that took longest #

The dataset prep took longer than the actual training. We had ~3,500 instruction-response pairs in a CSV. They needed to be:

Reformatted into Llama 3's instruction template
Tokenized
Filtered for quality (some were too short, some had formatting bugs)
Split into train/eval

Llama 3's instruction template is specific:

code

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>

{user_message}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{assistant_response}<|eot_id|>

Get this wrong and the model fine-tunes on the wrong tokens; it won't learn the response pattern correctly. We spent a couple hours debugging quality issues that turned out to be a missing <|eot_id|> between turns.

Step 4: training loop #

Total training time: ~14 hours for 3 epochs over 3,500 examples on the 4090. The actual training command:

bash.bash

python train.py \
  --model_name "meta-llama/Meta-Llama-3-8B-Instruct" \
  --dataset_path "data/train.jsonl" \
  --eval_dataset_path "data/eval.jsonl" \
  --output_dir "./output/llama3-domain" \
  --num_train_epochs 3 \
  --per_device_train_batch_size 4 \
  --gradient_accumulation_steps 4 \
  --learning_rate 2e-4 \
  --warmup_ratio 0.03 \
  --logging_steps 10 \
  --save_steps 100 \
  --eval_steps 100

Effective batch size: 4 × 4 = 16. Learning rate 2e-4 is the SFT recommendation for LoRA; we didn't experiment.

Memory usage during training: ~20GB VRAM. Right at the edge of the 4090 but stable.

Power consumption: 4090 was at 350-400W during training. ~14 hours = ~5.5 kWh of power. At local electricity rates, ~$1 of electricity. Cheaper than renting cloud GPU time for the equivalent.

What broke during training #

Twice the training crashed at random checkpoints with CUDA OOM. The cause: occasional batches with longer-than-average tokenization. We added a hard max_length=2048 truncation and the issue stopped.

The other thing that bit us: the eval loss reported by the trainer was misleading — it was on a tiny eval set (~50 examples) and didn't reflect how the model actually performed. Real evaluation came from manually running the model on test queries after training.

Evaluating the result #

We had no rigorous eval framework. For this project, evaluation was:

30 hand-curated test queries (representative of expected usage)
Manually scoring each fine-tuned response: better, same, worse than base model

Result on 30 queries: 22 better, 6 same, 2 worse. For the side project, that was good enough to ship.

A more rigorous setup would use a judge LLM on a larger eval set. We didn't bother.

What we'd do differently #

Start with 1,000 examples, see if it works, then scale up. We trained on the full 3,500 from the start. If it had failed (e.g., wrong format causing the model to learn nothing), we'd have wasted 14 hours. A 4-hour pilot on 1,000 examples would have caught format issues earlier.

Better eval set from the beginning. 30 hand-curated questions is OK; 200 would have been better. The cost is upfront effort that pays back during iteration.

Cache the tokenized dataset. We re-tokenized at the start of each training run. Saving the tokenized version saved ~5 minutes per run, helpful when iterating.

What's still hard #

Knowing when to stop. Training loss kept decreasing through 3 epochs. We could have run more. We stopped because the eval (limited as it was) plateaued, and we were nervous about overfitting on a small dataset. With a bigger dataset, more epochs would probably help.

Inference vs training memory. Fine-tuning fits in 24GB; running inference on the result (with the LoRA merged or applied at runtime) is also fine. But running both at once (e.g., evaluating during training) doesn't fit. We separated phases.

Production deployment. A 4090 for inference is fine for a personal project. For anything beyond that, you'd want to either deploy to cloud GPUs (defeating the cost-saving purpose) or merge the LoRA into the base and serve via a quantized runtime. We didn't do this for the side project; for production you'd want vLLM or similar.

Cost comparison #

For comparison: renting an A100 (40GB) on a major cloud for 14 hours costs ~$25-40. The 4090 cost us roughly $1 in electricity. For a one-off, the 4090 is dramatically cheaper. For repeated experiments, the cloud's flexibility (parallel runs, larger memory if needed) probably wins.

The cost to BUY a 4090 is around $1,500. If you're going to fine-tune more than ~3-5 small models a year, the hardware pays for itself.

What I'd tell someone starting #

Use QLoRA, not full fine-tuning. Unless you have access to multi-GPU rigs, full fine-tuning of even small LLMs is impractical on consumer hardware.

Pick a small base model. Llama 3 8B is at the upper edge of what fits on a 4090. The 1B or 3B variants of various models train much faster and are sufficient for many tasks.

Spend 70% of your time on data, 30% on training. The training process is mostly automatic once it works. Data quality, format correctness, and eval set design are where the wins are.

Monitor VRAM. nvidia-smi -l 5 in another terminal. If you're hitting 23.5GB during training, you're one weird batch away from OOM. Reduce batch size or max_length.

Don't expect magic. A LoRA fine-tune on 3,500 examples won't make Llama 3 8B match GPT-4. It will make it better at the specific shape and content of your data. That's the realistic outcome.

This was a fun weekend project. The result was usable for the side project's needs. Anyone with a modern consumer GPU (RTX 4070 Super or up) can do something similar. The barriers are mostly knowing what to expect and the dataset prep, not the training itself.

Fine-tuning Llama 3 on Consumer Hardware

Fine-Tuning Llama 3 on Consumer Hardware

The setup #

Choice 1: full fine-tune vs LoRA #

Choice 2: 4-bit quantization for the base model #

Step 3: dataset preparation, the part that took longest #

Step 4: training loop #

What broke during training #

Evaluating the result #

What we'd do differently #

What's still hard #

Cost comparison #

What I'd tell someone starting #

Stay Updated

AWS Cost Optimization Strategies

Building Scalable CI/CD Pipelines with GitHub Actions

More from AI

Token Budgeting for Long-Context Prompts: What to Cut First

Multi-Provider LLM Gateways: Routing, Fallback, and Cost Control

Streaming LLM Responses: SSE, Backpressure, and Cancellation

Token Budgeting for Long-Context Prompts: What to Cut First

Multi-Provider LLM Gateways: Routing, Fallback, and Cost Control

Streaming LLM Responses: SSE, Backpressure, and Cancellation

Choosing an Embedding Model: Dimensions, Cost, and MTEB Reality

Agent Memory: Short-Term, Long-Term, and When You Need Neither

Guardrails for Production LLMs: Input and Output Filtering That Holds

About Kiril Urbonas

You might have missed

GitOps with Argo CD: Best Practices for 2025

Process Management and Monitoring in Linux

Linux Performance Tuning for Containers and Kubernetes Nodes