Fine-tuning Large Language Models: A Practical Guide
Learn how to fine-tune LLMs like Llama 2, Mistral, and GPT models for your specific use case. Includes LoRA, QLoRA, and full fine-tuning techniques.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Fine-tuning allows you to adapt pre-trained language models to your specific domain or task. This guide covers practical techniques for fine-tuning modern LLMs.
Updates all model parameters. Most effective but resource-intensive.
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from datasets import load_dataset
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Load your dataset
dataset = load_dataset("your-dataset")
# Tokenize
def tokenize_function(examples):
return tokenizer(examples["text"], truncation=True, max_length=512)
tokenized_dataset = dataset.map(tokenize_function, batched=True)
# Training arguments
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-5,
fp16=True,
logging_steps=10,
save_steps=500,
)
# Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
)
trainer.train()
Efficient fine-tuning that only updates a small number of parameters.
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
# LoRA configuration
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # Rank
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
)
# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters() # See how many parameters are trainable
# Train as normal
trainer = Trainer(model=model, ...)
trainer.train()
Combines quantization with LoRA for even more efficiency.
from transformers import BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
# 4-bit quantization
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4"
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=quantization_config,
device_map="auto"
)
# Apply LoRA
lora_config = LoraConfig(...)
model = get_peft_model(model, lora_config)
| Model Size | Full Fine-tuning | LoRA | QLoRA |
|---|---|---|---|
| 7B | 4x A100 (80GB) | 1x A100 (40GB) | 1x RTX 3090 (24GB) |
| 13B | 8x A100 (80GB) | 2x A100 (40GB) | 1x A100 (40GB) |
| 70B | 16x A100 (80GB) | 4x A100 (40GB) | 2x A100 (40GB) |
# Instruction-following format
data = [
{
"instruction": "Explain quantum computing",
"input": "",
"output": "Quantum computing uses quantum mechanical phenomena..."
},
{
"instruction": "Translate to French",
"input": "Hello, how are you?",
"output": "Bonjour, comment allez-vous?"
}
]
# Convert to training format
def format_prompt(example):
if example["input"]:
return f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
return f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
training_args = TrainingArguments(
learning_rate=2e-5,
lr_scheduler_type="cosine",
warmup_steps=100,
warmup_ratio=0.1,
)
training_args = TrainingArguments(
gradient_checkpointing=True, # Saves memory
)
training_args = TrainingArguments(
fp16=True, # or bf16=True for newer GPUs
)
from transformers import pipeline
# Load fine-tuned model
model = AutoModelForCausalLM.from_pretrained("./results/checkpoint-1000")
tokenizer = AutoTokenizer.from_pretrained("./results/checkpoint-1000")
# Create pipeline
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
# Test
prompt = "Explain Kubernetes in simple terms:"
result = generator(prompt, max_length=200, temperature=0.7)
print(result[0]["generated_text"])
# Merge LoRA weights
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
lora_model = PeftModel.from_pretrained(base_model, "./lora-checkpoint")
merged_model = lora_model.merge_and_unload()
merged_model.save_pretrained("./merged-model")
# Install vLLM
pip install vllm
# Start server
python -m vllm.entrypoints.openai.api_server \
--model ./merged-model \
--port 8000
Fine-tuning LLMs has become more accessible with techniques like LoRA and QLoRA. Start with QLoRA for resource-constrained environments, and use full fine-tuning when you have the resources and need maximum performance.
For Fine-tuning Large Language Models: A Practical Guide, define pre-deploy checks, rollout gates, and rollback triggers before release. Track p95 latency, error rate, and cost per request for at least 24 hours after deployment. If the trend regresses from baseline, revert quickly and document the decision in the runbook.
Keep the operating model simple under pressure: one owner per change, one decision channel, and clear stop conditions. Review alert quality regularly to remove noise and ensure on-call engineers can distinguish urgent failures from routine variance.
Repeatability is the goal. Convert successful interventions into standard operating procedures and version them in the repository so future responders can execute the same flow without ambiguity.
For Fine-tuning Large Language Models: A Practical Guide, define pre-deploy checks, rollout gates, and rollback triggers before release. Track p95 latency, error rate, and cost per request for at least 24 hours after deployment. If the trend regresses from baseline, revert quickly and document the decision in the runbook.
Keep the operating model simple under pressure: one owner per change, one decision channel, and clear stop conditions. Review alert quality regularly to remove noise and ensure on-call engineers can distinguish urgent failures from routine variance.
Repeatability is the goal. Convert successful interventions into standard operating procedures and version them in the repository so future responders can execute the same flow without ambiguity.
Kernel and Package Patch Management. Practical guidance for reliable, scalable platform operations.
LLM Gateway Design for Multi-Provider Inference. Practical guidance for reliable, scalable platform operations.
Explore more articles in this category
AI Inference Cost Optimization. Practical guidance for reliable, scalable platform operations.
Python Worker Queue Scaling Patterns. Practical guidance for reliable, scalable platform operations.
Model Serving Observability Stack. Practical guidance for reliable, scalable platform operations.