How much data do you need to fine-tune an LLM effectively?

The practical minimum is around 500 high-quality examples, but 1,000-5,000 is the sweet spot for most domain tasks. Quality matters more than volume: 500 carefully curated pairs will outperform 50,000 scraped ones. If you're seeing no improvement over the base model, your problem is almost always data quality, not quantity.

Can I fine-tune a closed model like GPT-4o or Claude Opus 4.7?

OpenAI offers fine-tuning for GPT-4o through their platform with per-token training fees plus elevated inference costs. Anthropic does not currently offer public fine-tuning for Claude Opus 4.7 as of early 2026. For full control over weights, hyperparameters, and deployment, open-weights models like Llama 3.1 or Mistral remain the practical choice.

What's the difference between LoRA, QLoRA, and full fine-tuning?

Full fine-tuning updates every parameter and needs roughly 80GB VRAM for an 8B model. LoRA freezes the base and trains small adapter matrices, cutting memory to ~24GB. QLoRA adds 4-bit quantization on top, dropping requirements to 16GB while preserving 95-99% of LoRA's quality. For most teams, QLoRA is the default starting point.

How long does fine-tuning take on a single GPU?

For 1,000 training examples and 3 epochs on Llama 3.1 8B, expect 30-45 minutes on an A10G or about 20 minutes on an RTX 4090. Times scale roughly linearly with dataset size and model parameters. A 70B model fine-tune on the same data takes 6-10 hours on an A100 80GB.

Will fine-tuning make my model worse at general tasks?

Yes, slightly. This is called catastrophic forgetting and it's unavoidable to some degree. LoRA reduces the effect compared to full fine-tuning because most weights stay frozen. If general capability matters, mix 10-20% general-purpose instruction data into your training set and keep the learning rate conservative (1e-4 to 2e-4).

Fine-Tune an LLM on Your Own Data: A 2026 Guide

So you want to fine-tune an LLM on your own data. Good call. As of early 2026, fine-tuning a small open-weights model on a focused dataset can match (and sometimes beat) GPT-4o for narrow tasks at roughly 1/100th the inference cost.

But most tutorials skip the boring parts: dataset formatting, evaluation, and what to do when your loss curve looks like a heart monitor. This guide covers the full pipeline using QLoRA on a single consumer GPU, with the same workflow used by teams shipping production models.

What You'll Build

By the end, you'll have a fine-tuned Llama 3.1 8B model trained on a custom JSONL dataset, quantized to 4-bit for inference, and evaluated against the base model. The whole thing runs on a single 24GB GPU (think RTX 4090 or a rented A10G).

Laptop screen displaying JSONL training data for LLM fine-tuning

We'll use QLoRA because it's the most efficient method that still produces production-quality results. Full fine-tuning of an 8B model needs ~80GB of VRAM. QLoRA squeezes the same job into 16GB.

Prerequisites

Before you write a single line of code:

Hardware: 1 GPU with 16GB+ VRAM (24GB recommended). RunPod or Lambda Labs work if you don't own one. Check current pricing on their sites.
Python 3.10+ with CUDA-enabled PyTorch installed
A Hugging Face account with an access token (free tier is fine)
A dataset of at least 500 examples (more on format below)
Patience. Your first run will fail. Everyone's does.

Install the core stack:

Bash

pip install torch transformers datasets peft bitsandbytes accelerate trl

The trl library from Hugging Face handles the training loop. peft does the LoRA magic. bitsandbytes handles 4-bit quantization. According to the PEFT docs, this combo is now the standard recipe.

Step 1: Prepare Your Dataset

This is where 80% of fine-tuning projects fail. Garbage in, garbage out, except worse because you also paid for the GPU time.

Your data should be in JSONL format with a clear instruction structure. The most common schema (and the one TRL expects by default) looks like this:

JSON

{"messages": [{"role": "user", "content": "Summarize this support ticket: ..."}, {"role": "assistant", "content": "Customer reports billing issue..."}]}

A few non-negotiable rules:

Minimum 500 examples. Below that, you're basically just confusing the model. 1,000-5,000 is the sweet spot for most tasks.
Consistent formatting. If half your assistant responses end with a period and half don't, the model learns inconsistency.
No data leakage. Hold out 10-15% as a validation set. Don't peek.
Quality over quantity. 500 hand-curated examples beats 50,000 scraped ones. Not gonna lie, this part is tedious.

If your data is in CSV or plain text, convert it with a quick script:

Python

import json
import pandas as pd

df = pd.read_csv('your_data.csv')
with open('train.jsonl', 'w') as f:
    for _, row in df.iterrows():
        record = {
            "messages": [
                {"role": "user", "content": row['prompt']},
                {"role": "assistant", "content": row['response']}
            ]
        }
        f.write(json.dumps(record) + '\n')

Step 2: Load the Base Model in 4-Bit

Now the fun part. Load Llama 3.1 8B with 4-bit quantization so it actually fits in memory.

Python

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

model_id = "meta-llama/Llama-3.1-8B-Instruct"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

Bar chart showing VRAM requirements for full fine-tuning

The nf4 (4-bit Normal Float) quantization is the standard recommendation from the original QLoRA paper. Double quantization saves another ~0.4 bits per parameter, which adds up on bigger models.

If you get an access error, accept the model's license on its Hugging Face page first. Llama models gate-keep behind a click-through agreement.

Step 3: Configure LoRA Adapters

LoRA freezes the base model and trains tiny rank-decomposition matrices instead. You end up modifying maybe 0.5% of the parameters but capture most of the benefit.

Python

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

A few opinionated defaults:

r=16 is the sweet spot for most tasks. r=8 if you're memory-constrained, r=32+ for complex domain shifts.
lora_alpha = 2 × r is a solid heuristic. Don't overthink it.
Target all linear layers, not just attention. The community consensus shifted on this in 2024 and it's still the right call.

You should see something like "trainable params: 41M / 8B (0.51%)". That's the goal.

Step 4: Train the Model

The TRL library makes this almost embarrassingly simple:

Python

from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

dataset = load_dataset("json", data_files={
    "train": "train.jsonl",
    "validation": "val.jsonl"
})

training_args = SFTConfig(
    output_dir="./llama-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    bf16=True,
    logging_steps=10,
    eval_strategy="steps",
    eval_steps=50,
    save_strategy="steps",
    save_steps=100,
    max_seq_length=2048,
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    processing_class=tokenizer,
)

trainer.train()

While it trains, watch the eval loss. If it stops decreasing after epoch 1 and starts climbing, you're overfitting. Stop early. If it never decreases at all, your data is broken (go back to Step 1).

Home GPU rig with dual RTX 4090 cards used for local LLM fine-tuning

For 1,000 examples and 3 epochs on an A10G, expect roughly 30-45 minutes of training time. An RTX 4090 cuts that nearly in half.

Step 5: Test and Save

Merge the LoRA weights into the base model and run a quick sanity check:

Python

from peft import PeftModel

model.save_pretrained("./llama-finetuned-final")
tokenizer.save_pretrained("./llama-finetuned-final")

prompt = "Your test prompt here"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Compare a handful of outputs side-by-side with the base model. If the difference isn't obvious on your domain task, your dataset probably needs work, not your hyperparameters.

Common Pitfalls (and Fixes)

A quick reference for the inevitable problems:

Problem	Likely Cause	Fix
Loss is NaN	Learning rate too high or fp16 instability	Drop LR to 1e-4, switch to bf16
Eval loss climbs after epoch 1	Overfitting	Reduce epochs to 1-2, add dropout
Model outputs gibberish	Tokenizer mismatch or bad chat template	Verify `tokenizer.apply_chat_template` matches training format
OOM on 24GB GPU	Sequence length too long	Drop `max_seq_length` to 1024, batch size to 1
Training is glacially slow	Gradient checkpointing off, no flash attention	Enable both in config

And one mistake almost everyone makes the first time: forgetting to set the pad token. Llama's tokenizer doesn't ship one by default. If you skip tokenizer.pad_token = tokenizer.eos_token, training will crash in confusing ways.

Evaluating Your Fine-Tune

Loss is necessary but not sufficient. The real question: does the model do the thing better?

Three practical evaluation methods:

Hold-out eval set scoring. Run both base and fine-tuned models on 50-100 unseen examples. Have a stronger model (Claude Opus 4.7 or GPT-4o) judge the responses head-to-head. This is the LLM-as-judge pattern and it works.
Domain-specific metrics. If you fine-tuned for code, measure pass rate on a private test set. For classification, accuracy. For summarization, ROUGE plus human review on a sample.
Production canary. Route 5% of real traffic to the new model and watch user behavior. Nothing reveals problems faster.

Don't trust public benchmarks for fine-tuned models. They measure general capability you probably degraded slightly. That's the trade-off.

Cost Reality Check

For a 1,000-example dataset on Llama 3.1 8B:

GPU rental: ~$0.50-$1.50 per training run on RunPod or Lambda (1 hour A10G, check current pricing)
Storage: negligible
Inference: roughly $0.10-$0.30 per million tokens self-hosted, vs $2.50/$10 per MTok for GPT-4o

If your task hits 10M tokens per month, the math favors fine-tuning fast. Below 1M tokens per month, just use the API and call it a day. Once your fine-tune is ready, our LLM API on AWS guide covers production deployment, and 10 tricks to slash your AI API bill helps trim inference costs further.

Next Steps

Once your first fine-tune works, the obvious upgrades:

Try bigger base models. Llama 3.1 70B with QLoRA needs ~48GB VRAM and gives a noticeable quality bump. If you'd rather skip training entirely, see our guide on running Llama locally.
Experiment with DPO or ORPO for preference tuning once you have a working SFT baseline. The TRL docs cover both.
Quantize for inference with GGUF or AWQ to deploy on cheap hardware.
Set up an eval suite so future fine-tunes can be compared apples-to-apples.

Fine-tuning isn't magic. It's mostly dataset hygiene, sensible defaults, and patience. But when it clicks, you get a model that does your specific job better than anything you can buy.

Sources