I fine-tuned Llama 3 8B on a single 4090 over a weekend for a side project. Here's what worked, what cost more than expected, and what I'd do differently.
A friend asked me to help fine-tune a small Llama 3 8B model on a domain-specific dataset for a side project. They didn't want to rent cloud GPUs; we did the whole thing on a single RTX 4090 over a weekend. The end result worked well enough for the project. The path was bumpier than I'd hoped. This post is what I'd tell someone trying the same thing tomorrow.
transformers + peft + bitsandbytesFull fine-tuning of Llama 3 8B in float16 needs about 60GB of VRAM during training (gradients + optimizer state for 8B params). Not happening on a 4090.
LoRA (Low-Rank Adaptation) trains only a small adapter on top of the frozen base model. With LoRA, training Llama 3 8B fits in 16GB. We picked LoRA.
The configuration:
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(base_model, lora_config)
LoRA adds about 0.5% additional parameters as adapter weights. Those are what we train. Training time and VRAM requirements drop accordingly.
The recommended LoRA config for Llama 3: rank 16, alpha 32, target the attention projection modules. We didn't experiment much beyond defaults; they worked.
Even with LoRA, the base model itself takes ~16GB at fp16 (8B params × 2 bytes). Combined with activations and optimizer state, you're at the edge of 24GB VRAM.
bitsandbytes 4-bit quantization shrinks the base model to ~5GB. Combined with LoRA, the whole training fits comfortably in 24GB with room for batch size 4-8.
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B-Instruct",
quantization_config=bnb_config,
device_map="auto",
)
This is the QLoRA approach (4-bit base + LoRA adapter). Works well on consumer hardware.
The dataset prep took longer than the actual training. We had ~3,500 instruction-response pairs in a CSV. They needed to be:
Llama 3's instruction template is specific:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>
{user_message}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{assistant_response}<|eot_id|>
Get this wrong and the model fine-tunes on the wrong tokens; it won't learn the response pattern correctly. We spent a couple hours debugging quality issues that turned out to be a missing <|eot_id|> between turns.
Total training time: ~14 hours for 3 epochs over 3,500 examples on the 4090. The actual training command:
python train.py \
--model_name "meta-llama/Meta-Llama-3-8B-Instruct" \
--dataset_path "data/train.jsonl" \
--eval_dataset_path "data/eval.jsonl" \
--output_dir "./output/llama3-domain" \
--num_train_epochs 3 \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 4 \
--learning_rate 2e-4 \
--warmup_ratio 0.03 \
--logging_steps 10 \
--save_steps 100 \
--eval_steps 100
Effective batch size: 4 × 4 = 16. Learning rate 2e-4 is the SFT recommendation for LoRA; we didn't experiment.
Memory usage during training: ~20GB VRAM. Right at the edge of the 4090 but stable.
Power consumption: 4090 was at 350-400W during training. ~14 hours = ~5.5 kWh of power. At local electricity rates, ~$1 of electricity. Cheaper than renting cloud GPU time for the equivalent.
Twice the training crashed at random checkpoints with CUDA OOM. The cause: occasional batches with longer-than-average tokenization. We added a hard max_length=2048 truncation and the issue stopped.
The other thing that bit us: the eval loss reported by the trainer was misleading — it was on a tiny eval set (~50 examples) and didn't reflect how the model actually performed. Real evaluation came from manually running the model on test queries after training.
We had no rigorous eval framework. For this project, evaluation was:
Result on 30 queries: 22 better, 6 same, 2 worse. For the side project, that was good enough to ship.
A more rigorous setup would use a judge LLM on a larger eval set. We didn't bother.
Start with 1,000 examples, see if it works, then scale up. We trained on the full 3,500 from the start. If it had failed (e.g., wrong format causing the model to learn nothing), we'd have wasted 14 hours. A 4-hour pilot on 1,000 examples would have caught format issues earlier.
Better eval set from the beginning. 30 hand-curated questions is OK; 200 would have been better. The cost is upfront effort that pays back during iteration.
Cache the tokenized dataset. We re-tokenized at the start of each training run. Saving the tokenized version saved ~5 minutes per run, helpful when iterating.
Knowing when to stop. Training loss kept decreasing through 3 epochs. We could have run more. We stopped because the eval (limited as it was) plateaued, and we were nervous about overfitting on a small dataset. With a bigger dataset, more epochs would probably help.
Inference vs training memory. Fine-tuning fits in 24GB; running inference on the result (with the LoRA merged or applied at runtime) is also fine. But running both at once (e.g., evaluating during training) doesn't fit. We separated phases.
Production deployment. A 4090 for inference is fine for a personal project. For anything beyond that, you'd want to either deploy to cloud GPUs (defeating the cost-saving purpose) or merge the LoRA into the base and serve via a quantized runtime. We didn't do this for the side project; for production you'd want vLLM or similar.
For comparison: renting an A100 (40GB) on a major cloud for 14 hours costs ~$25-40. The 4090 cost us roughly $1 in electricity. For a one-off, the 4090 is dramatically cheaper. For repeated experiments, the cloud's flexibility (parallel runs, larger memory if needed) probably wins.
The cost to BUY a 4090 is around $1,500. If you're going to fine-tune more than ~3-5 small models a year, the hardware pays for itself.
Use QLoRA, not full fine-tuning. Unless you have access to multi-GPU rigs, full fine-tuning of even small LLMs is impractical on consumer hardware.
Pick a small base model. Llama 3 8B is at the upper edge of what fits on a 4090. The 1B or 3B variants of various models train much faster and are sufficient for many tasks.
Spend 70% of your time on data, 30% on training. The training process is mostly automatic once it works. Data quality, format correctness, and eval set design are where the wins are.
Monitor VRAM. nvidia-smi -l 5 in another terminal. If you're hitting 23.5GB during training, you're one weird batch away from OOM. Reduce batch size or max_length.
Don't expect magic. A LoRA fine-tune on 3,500 examples won't make Llama 3 8B match GPT-4. It will make it better at the specific shape and content of your data. That's the realistic outcome.
This was a fun weekend project. The result was usable for the side project's needs. Anyone with a modern consumer GPU (RTX 4070 Super or up) can do something similar. The barriers are mostly knowing what to expect and the dataset prep, not the training itself.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
A different angle on AWS cost work: the operational discipline that prevents costs from creeping back up after the initial cleanup.
We run ~600 GitHub Actions workflow runs per day across 80 repos. The patterns that scale and the ones that hit limits we didn't expect.
Explore more articles in this category
Embedding indexes degrade silently. The signals that catch drift, how often to re-embed, and the operational patterns we built after one quiet quality regression.
Streaming LLM responses is easy until the client disconnects, the model stalls, or the user cancels. The patterns that keep streaming responsive without leaking spend.
When LLMs can call tools that change real state, the design decisions that matter most are about what's gated, what's automatic, and what triggers a human checkpoint.
Evergreen posts worth revisiting.