fine-tuning llms on custom datasets
most llm tasks don’t need fine-tuning. prompt engineering, few-shot examples, and rag cover 90% of use cases and cost a fraction. fine-tuning makes sense when you need the model to reliably follow a very specific output format, speak in a domain-specific voice, or internalise knowledge that’s too large for a context window.
this is based on what i did for the darkqwen project — fine-tuning qwen3.5-0.8b for adversarial robustness research using qlora.
when to actually fine-tune
- you’ve exhausted prompt engineering and the model still doesn’t follow your output format
- you need sub-second inference on a small model and can’t afford large-model api latency
- you’re doing safety research and need controlled, reproducible behaviour
- your domain vocabulary is specialised enough that the base model consistently misses it
if none of these apply, use an api.
dataset preparation
the single biggest determinant of fine-tune quality is dataset quality, not the training hyperparameters.
for instruction fine-tuning, each example needs a clear input-output pair. the format i use:
{
"instruction": "explain what this function does",
"input": "def relu(x): return max(0, x)",
"output": "relu (rectified linear unit) returns x if x is positive, otherwise 0. it's the most common activation function in neural networks because it avoids the vanishing gradient problem."
}
minimum dataset size: 500 high-quality examples beat 5,000 mediocre ones. i’ve seen fine-tunes go wrong because the training set had contradictory examples — the model learns to be inconsistent rather than learning the target behaviour.
clean your data before training:
- remove duplicates
- verify output quality manually on at least 10% of examples
- check for format inconsistencies (missing fields, trailing whitespace, inconsistent casing)
qlora setup
qlora (quantised lora) lets you fine-tune a model that would otherwise not fit in vram. the base model is loaded in 4-bit precision; only the small lora adapter weights are trained in full precision.
with unsloth this is straightforward:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="Qwen/Qwen2.5-0.5B-Instruct",
max_seq_length=2048,
dtype=None, # auto-detect
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model,
r=16, # lora rank
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
)
lora rank (r): higher rank = more parameters = better fit but higher vram. start with r=16. go to r=32 only if the model isn’t capturing the target behaviour.
training
from trl import SFTTrainer
from transformers import TrainingArguments
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=2048,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=5,
num_train_epochs=3,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
output_dir="outputs",
),
)
trainer.train()
watch the training loss. a loss that plateaus above 1.5 usually means your dataset is too small or too inconsistent. a loss that goes to near-zero is overfitting — the model is memorising, not generalising.
evaluation
don’t rely solely on loss. generate samples from the fine-tuned model on held-out prompts and evaluate the outputs manually. for my project i used a set of 50 adversarial prompts as the eval set and tracked how the model’s response pattern shifted across training checkpoints.
if you’re doing instruction fine-tuning, check that the model follows instructions on prompts it hasn’t seen before. that’s the whole point.
saving and inference
# save merged model (base + lora weights)
model.save_pretrained_merged("fine-tuned-model", tokenizer, save_method="merged_16bit")
# inference
FastLanguageModel.for_inference(model)
inputs = tokenizer("your prompt here", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0]))
for deployment, export to gguf format for llama.cpp inference — you get fast cpu-based inference without gpu dependency.
training was done on a single a100 (40gb). unsloth’s memory optimisations meant the 0.8b model trained at roughly 3x the speed of vanilla transformers.