Fine-tuning Large Language Models: LoRA, QLoRA Guide 2024 | DevOpsNess

Fine-tuning Large Language Models: A Practical Guide

Fine-tuning allows you to adapt pre-trained language models to your specific domain or task. This guide covers practical techniques for fine-tuning modern LLMs.

Why Fine-tune?#

Domain Adaptation: Make models understand your specific domain
Task Specialization: Improve performance on specific tasks
Cost Efficiency: Smaller fine-tuned models can outperform larger base models
Data Privacy: Keep sensitive data on-premises

Fine-tuning Methods #

1. Full Fine-tuning #

Updates all model parameters. Most effective but resource-intensive.

python.python

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from datasets import load_dataset

model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Load your dataset
dataset = load_dataset("your-dataset")

# Tokenize
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512)

tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-5,
    fp16=True,
    logging_steps=10,
    save_steps=500,
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
)

trainer.train()

2. LoRA (Low-Rank Adaptation)#

Efficient fine-tuning that only updates a small number of parameters.

python.python

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

# LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,  # Rank
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
)

# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()  # See how many parameters are trainable

# Train as normal
trainer = Trainer(model=model, ...)
trainer.train()

3. QLoRA (Quantized LoRA)#

Combines quantization with LoRA for even more efficiency.

python.python

from transformers import BitsAndBytesConfig
from peft import LoraConfig, get_peft_model

# 4-bit quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=quantization_config,
    device_map="auto"
)

# Apply LoRA
lora_config = LoraConfig(...)
model = get_peft_model(model, lora_config)

Hardware Requirements #

Model Size	Full Fine-tuning	LoRA	QLoRA
7B	4x A100 (80GB)	1x A100 (40GB)	1x RTX 3090 (24GB)
13B	8x A100 (80GB)	2x A100 (40GB)	1x A100 (40GB)
70B	16x A100 (80GB)	4x A100 (40GB)	2x A100 (40GB)

Data Preparation #

Format Your Data #

python.python

# Instruction-following format
data = [
    {
        "instruction": "Explain quantum computing",
        "input": "",
        "output": "Quantum computing uses quantum mechanical phenomena..."
    },
    {
        "instruction": "Translate to French",
        "input": "Hello, how are you?",
        "output": "Bonjour, comment allez-vous?"
    }
]

# Convert to training format
def format_prompt(example):
    if example["input"]:
        return f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
    return f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"

Training Best Practices #

1. Learning Rate Scheduling #

python.python

training_args = TrainingArguments(
    learning_rate=2e-5,
    lr_scheduler_type="cosine",
    warmup_steps=100,
    warmup_ratio=0.1,
)

2. Gradient Checkpointing #

python.python

training_args = TrainingArguments(
    gradient_checkpointing=True,  # Saves memory
)

3. Mixed Precision Training #

python.python

training_args = TrainingArguments(
    fp16=True,  # or bf16=True for newer GPUs
)

Evaluation #

python.python

from transformers import pipeline

# Load fine-tuned model
model = AutoModelForCausalLM.from_pretrained("./results/checkpoint-1000")
tokenizer = AutoTokenizer.from_pretrained("./results/checkpoint-1000")

# Create pipeline
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Test
prompt = "Explain Kubernetes in simple terms:"
result = generator(prompt, max_length=200, temperature=0.7)
print(result[0]["generated_text"])

Deployment #

Export for Production #

python.python

# Merge LoRA weights
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
lora_model = PeftModel.from_pretrained(base_model, "./lora-checkpoint")
merged_model = lora_model.merge_and_unload()
merged_model.save_pretrained("./merged-model")

Serve with vLLM #

bash.bash

# Install vLLM
pip install vllm

# Start server
python -m vllm.entrypoints.openai.api_server \
  --model ./merged-model \
  --port 8000

Common Pitfalls #

Overfitting: Use validation set and early stopping
Catastrophic Forgetting: Use LoRA or adapter methods
Data Quality: Ensure high-quality, diverse training data
Learning Rate: Start with 2e-5 and adjust
Batch Size: Balance memory and training stability

Fine-tuning LLMs has become more accessible with techniques like LoRA and QLoRA. Start with QLoRA for resource-constrained environments, and use full fine-tuning when you have the resources and need maximum performance.

Production Notes 1 #

For Fine-tuning Large Language Models: A Practical Guide, define pre-deploy checks, rollout gates, and rollback triggers before release. Track p95 latency, error rate, and cost per request for at least 24 hours after deployment. If the trend regresses from baseline, revert quickly and document the decision in the runbook.

Keep the operating model simple under pressure: one owner per change, one decision channel, and clear stop conditions. Review alert quality regularly to remove noise and ensure on-call engineers can distinguish urgent failures from routine variance.

Repeatability is the goal. Convert successful interventions into standard operating procedures and version them in the repository so future responders can execute the same flow without ambiguity.

Production Notes 2 #

Repeatability is the goal. Convert successful interventions into standard operating procedures and version them in the repository so future responders can execute the same flow without ambiguity.

Fine-tuning Large Language Models: A Practical Guide

Stay Updated

Fine-tuning Large Language Models: A Practical Guide

Why Fine-tune?#

Fine-tuning Methods #

1. Full Fine-tuning #

2. LoRA (Low-Rank Adaptation)#

3. QLoRA (Quantized LoRA)#

Hardware Requirements #

Data Preparation #

Format Your Data #

Training Best Practices #

1. Learning Rate Scheduling #

2. Gradient Checkpointing #

3. Mixed Precision Training #

Evaluation #

Deployment #

Export for Production #

Serve with vLLM #

Common Pitfalls #

Conclusion #

Production Notes 1 #

Production Notes 2 #

Practical Guide: Kernel and Package Patch Management

Practical Guide: LLM Gateway Design for Multi-Provider Inference

More from AI

Operational Checklist: AI Inference Cost Optimization

Operational Checklist: Python Worker Queue Scaling Patterns

Operational Checklist: Model Serving Observability Stack

Operational Checklist: AI Inference Cost Optimization

Operational Checklist: Python Worker Queue Scaling Patterns

Operational Checklist: Model Serving Observability Stack

Operational Checklist: RAG Retrieval Quality Evaluation

Operational Checklist: SLO-Based Monitoring for APIs

Operational Checklist: Prompt Versioning and Regression Testing

About Kiril Urbonas