fine-tuning llms on custom datasets

most llm tasks don’t need fine-tuning. prompt engineering, few-shot examples, and rag cover 90% of use cases and cost a fraction. fine-tuning makes sense when you need the model to reliably follow a very specific output format, speak in a domain-specific voice, or internalise knowledge that’s too large for a context window.

this is based on what i did for the darkqwen project — fine-tuning qwen3.5-0.8b for adversarial robustness research using qlora.

when to actually fine-tune

you’ve exhausted prompt engineering and the model still doesn’t follow your output format
you need sub-second inference on a small model and can’t afford large-model api latency
you’re doing safety research and need controlled, reproducible behaviour
your domain vocabulary is specialised enough that the base model consistently misses it

if none of these apply, use an api.

dataset preparation

the single biggest determinant of fine-tune quality is dataset quality, not the training hyperparameters.

for instruction fine-tuning, each example needs a clear input-output pair. the format i use:

{
  "instruction": "explain what this function does",
  "input": "def relu(x): return max(0, x)",
  "output": "relu (rectified linear unit) returns x if x is positive, otherwise 0. it's the most common activation function in neural networks because it avoids the vanishing gradient problem."
}

minimum dataset size: 500 high-quality examples beat 5,000 mediocre ones. i’ve seen fine-tunes go wrong because the training set had contradictory examples — the model learns to be inconsistent rather than learning the target behaviour.

clean your data before training:

remove duplicates
verify output quality manually on at least 10% of examples
check for format inconsistencies (missing fields, trailing whitespace, inconsistent casing)

qlora setup

qlora (quantised lora) lets you fine-tune a model that would otherwise not fit in vram. the base model is loaded in 4-bit precision; only the small lora adapter weights are trained in full precision.

with unsloth this is straightforward:

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="Qwen/Qwen2.5-0.5B-Instruct",
    max_seq_length=2048,
    dtype=None,         # auto-detect
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,               # lora rank
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
)

lora rank (r): higher rank = more parameters = better fit but higher vram. start with r=16. go to r=32 only if the model isn’t capturing the target behaviour.

training

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=10,
        output_dir="outputs",
    ),
)
trainer.train()

watch the training loss. a loss that plateaus above 1.5 usually means your dataset is too small or too inconsistent. a loss that goes to near-zero is overfitting — the model is memorising, not generalising.

evaluation

don’t rely solely on loss. generate samples from the fine-tuned model on held-out prompts and evaluate the outputs manually. for my project i used a set of 50 adversarial prompts as the eval set and tracked how the model’s response pattern shifted across training checkpoints.

if you’re doing instruction fine-tuning, check that the model follows instructions on prompts it hasn’t seen before. that’s the whole point.

saving and inference

# save merged model (base + lora weights)
model.save_pretrained_merged("fine-tuned-model", tokenizer, save_method="merged_16bit")

# inference
FastLanguageModel.for_inference(model)
inputs = tokenizer("your prompt here", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0]))

for deployment, export to gguf format for llama.cpp inference — you get fast cpu-based inference without gpu dependency.

training was done on a single a100 (40gb). unsloth’s memory optimisations meant the 0.8b model trained at roughly 3x the speed of vanilla transformers.