Learn how to fine-tune LLMs like Llama 2, Mistral, and GPT models for your specific use case. Includes LoRA, QLoRA, and full fine-tuning techniques.
Fine-tuning allows you to adapt pre-trained language models to your specific domain or task. This guide covers practical techniques for fine-tuning modern LLMs.
Updates all model parameters. Most effective but resource-intensive.
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from datasets import load_dataset
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Load your dataset
dataset = load_dataset("your-dataset")
# Tokenize
def tokenize_function(examples):
return tokenizer(examples["text"], truncation=True, max_length=512)
tokenized_dataset = dataset.map(tokenize_function, batched=True)
# Training arguments
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-5,
fp16=True,
logging_steps=10,
save_steps=500,
)
# Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
)
trainer.train()
Efficient fine-tuning that only updates a small number of parameters.
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
# LoRA configuration
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # Rank
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
)
# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters() # See how many parameters are trainable
# Train as normal
trainer = Trainer(model=model, ...)
trainer.train()
Combines quantization with LoRA for even more efficiency.
from transformers import BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
# 4-bit quantization
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4"
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=quantization_config,
device_map="auto"
)
# Apply LoRA
lora_config = LoraConfig(...)
model = get_peft_model(model, lora_config)
| Model Size | Full Fine-tuning | LoRA | QLoRA |
|---|---|---|---|
| 7B | 4x A100 (80GB) | 1x A100 (40GB) | 1x RTX 3090 (24GB) |
| 13B | 8x A100 (80GB) | 2x A100 (40GB) | 1x A100 (40GB) |
| 70B | 16x A100 (80GB) | 4x A100 (40GB) | 2x A100 (40GB) |
# Instruction-following format
data = [
{
"instruction": "Explain quantum computing",
"input": "",
"output": "Quantum computing uses quantum mechanical phenomena..."
},
{
"instruction": "Translate to French",
"input": "Hello, how are you?",
"output": "Bonjour, comment allez-vous?"
}
]
# Convert to training format
def format_prompt(example):
if example["input"]:
return f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
return f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
training_args = TrainingArguments(
learning_rate=2e-5,
lr_scheduler_type="cosine",
warmup_steps=100,
warmup_ratio=0.1,
)
training_args = TrainingArguments(
gradient_checkpointing=True, # Saves memory
)
training_args = TrainingArguments(
fp16=True, # or bf16=True for newer GPUs
)
from transformers import pipeline
# Load fine-tuned model
model = AutoModelForCausalLM.from_pretrained("./results/checkpoint-1000")
tokenizer = AutoTokenizer.from_pretrained("./results/checkpoint-1000")
# Create pipeline
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
# Test
prompt = "Explain Kubernetes in simple terms:"
result = generator(prompt, max_length=200, temperature=0.7)
print(result[0]["generated_text"])
# Merge LoRA weights
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
lora_model = PeftModel.from_pretrained(base_model, "./lora-checkpoint")
merged_model = lora_model.merge_and_unload()
merged_model.save_pretrained("./merged-model")
# Install vLLM
pip install vllm
# Start server
python -m vllm.entrypoints.openai.api_server \
--model ./merged-model \
--port 8000
Fine-tuning LLMs has become more accessible with techniques like LoRA and QLoRA. Start with QLoRA for resource-constrained environments, and use full fine-tuning when you have the resources and need maximum performance.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Compare Terraform, Pulumi, and Ansible for Infrastructure as Code. Learn when to use each tool and how they complement each other in modern DevOps workflows.
We had three months of slow drift between our Terraform code and AWS reality. Here's the daily-cron + Slack workflow that closed the gap.
Explore more articles in this category
Embedding indexes degrade silently. The signals that catch drift, how often to re-embed, and the operational patterns we built after one quiet quality regression.
Streaming LLM responses is easy until the client disconnects, the model stalls, or the user cancels. The patterns that keep streaming responsive without leaking spend.
When LLMs can call tools that change real state, the design decisions that matter most are about what's gated, what's automatic, and what triggers a human checkpoint.
Evergreen posts worth revisiting.