fine-tuning llms on custom datasets


most llm tasks don’t need fine-tuning. prompt engineering, few-shot examples, and rag cover 90% of use cases and cost a fraction. fine-tuning makes sense when you need the model to reliably follow a very specific output format, speak in a domain-specific voice, or internalise knowledge that’s too large for a context window.

this is based on what i did for the darkqwen project — fine-tuning qwen3.5-0.8b for adversarial robustness research using qlora.

when to actually fine-tune

  • you’ve exhausted prompt engineering and the model still doesn’t follow your output format
  • you need sub-second inference on a small model and can’t afford large-model api latency
  • you’re doing safety research and need controlled, reproducible behaviour
  • your domain vocabulary is specialised enough that the base model consistently misses it

if none of these apply, use an api.

dataset preparation

the single biggest determinant of fine-tune quality is dataset quality, not the training hyperparameters.

for instruction fine-tuning, each example needs a clear input-output pair. the format i use:

{
  "instruction": "explain what this function does",
  "input": "def relu(x): return max(0, x)",
  "output": "relu (rectified linear unit) returns x if x is positive, otherwise 0. it's the most common activation function in neural networks because it avoids the vanishing gradient problem."
}

minimum dataset size: 500 high-quality examples beat 5,000 mediocre ones. i’ve seen fine-tunes go wrong because the training set had contradictory examples — the model learns to be inconsistent rather than learning the target behaviour.

clean your data before training:

  • remove duplicates
  • verify output quality manually on at least 10% of examples
  • check for format inconsistencies (missing fields, trailing whitespace, inconsistent casing)

qlora setup

qlora (quantised lora) lets you fine-tune a model that would otherwise not fit in vram. the base model is loaded in 4-bit precision; only the small lora adapter weights are trained in full precision.

with unsloth this is straightforward:

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="Qwen/Qwen2.5-0.5B-Instruct",
    max_seq_length=2048,
    dtype=None,         # auto-detect
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,               # lora rank
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
)

lora rank (r): higher rank = more parameters = better fit but higher vram. start with r=16. go to r=32 only if the model isn’t capturing the target behaviour.

training

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=10,
        output_dir="outputs",
    ),
)
trainer.train()

watch the training loss. a loss that plateaus above 1.5 usually means your dataset is too small or too inconsistent. a loss that goes to near-zero is overfitting — the model is memorising, not generalising.

evaluation

don’t rely solely on loss. generate samples from the fine-tuned model on held-out prompts and evaluate the outputs manually. for my project i used a set of 50 adversarial prompts as the eval set and tracked how the model’s response pattern shifted across training checkpoints.

if you’re doing instruction fine-tuning, check that the model follows instructions on prompts it hasn’t seen before. that’s the whole point.

saving and inference

# save merged model (base + lora weights)
model.save_pretrained_merged("fine-tuned-model", tokenizer, save_method="merged_16bit")

# inference
FastLanguageModel.for_inference(model)
inputs = tokenizer("your prompt here", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0]))

for deployment, export to gguf format for llama.cpp inference — you get fast cpu-based inference without gpu dependency.


training was done on a single a100 (40gb). unsloth’s memory optimisations meant the 0.8b model trained at roughly 3x the speed of vanilla transformers.