Fine-Tuning LLMs: Teaching AI to Speak Your Language

Part 3 of 5: Customizing Pre-trained Models

Part 3 of 5: Customizing Pre-trained Models

Welcome back! In our last article, we built a RAG system that gives AI access to your specific information. Today, we’re taking it a step further — we’re going to actually modify how an AI model thinks and responds by fine-tuning it. I am going to use an open-source model for this demo example.

RAG vs Fine-Tuning: What’s the Difference?

Think of it this way:

RAG is like giving someone a reference book. They’re still the same person, but now they have access to specific information when answering questions.

Fine-tuning is like sending someone to specialized training. You’re actually changing how they think and respond based on your specific requirements.

When Should You Fine-Tune?

Fine-tuning is powerful, but it’s not always necessary. Consider fine-tuning when:

You need the AI to adopt a specific tone or style
Your domain has specialized terminology or concepts
You want consistent responses across many different scenarios
RAG isn’t giving you the quality of responses you need
You have sufficient training data (usually hundreds of examples)
You need the model to work offline or in air-gapped environments

Setting Up Your Environment

We’ll use Google Colab with a T4 GPU for this tutorial, but you can adapt this for any CUDA-capable environment:

# Install required packages
!pip install bitsandbytes

import os
import torch
from datasets import Dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, PeftModel

Choosing the Right Model

For this tutorial, we’ll use TinyLlama-1.1B, but here are some excellent options for different needs:

# Ultra-light models (good for experimentation)
# model_name = "distilgpt2"                          # 82M parameters
# model_name = "gpt2"                                # 124M parameters
# model_name = "EleutherAI/gpt-neo-125M"            # 125M parameters

# Small but capable models
# model_name = "microsoft/DialoGPT-medium"           # 355M parameters
# model_name = "microsoft/DialoGPT-large"            # 762M parameters
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0" # 1.1B parameters - choosing this one
# For production use (if you have more GPU memory)
# model_name = "microsoft/phi-2"                     # 2.7B parameters
# model_name = "stabilityai/stablelm-2-1_6b"        # 1.6B parameters

Preparing Your Training Data

The key to successful fine-tuning is high-quality training data. Here’s how we structure it:

# Sample dataset for MyNextDeveloper (MND)
data = {
    "text": [
        "User asks: Write a 3-bullet overview of MND.\nAssistant: - Remote-first, founded 2022 in Mumbai\n- Staff augmentation + full-cycle web/API, UI/UX, AI/ML\n- Agile/TDD, POSH culture, transparent pricing",
        "User asks: Where is MND headquartered?\nAssistant: Malabar Hill, Mumbai",
        "User asks: What services does MND offer?\nAssistant: Staff augmentation, API development, web development, UI/UX design, AI/ML solutions",
        "User asks: What is MND's tech stack?\nAssistant: Angular, React, Next.js, Node.js, Django, Python, Docker, AWS",
        "User asks: What is MND's pricing model?\nAssistant: Project pricing starting at $10,000 with hourly rates between $25 and $49",
        "User asks: What is MND's mission?\nAssistant: To solve the trust gap between startups and engineers by emphasizing empathy, communication, and transparency",
        "User asks: What is MND's culture like?\nAssistant: Agile practices, test-driven development, and POSH-compliant culture",
        "User asks: When was MND founded?\nAssistant: 2022"
    ]
}

# Convert to HuggingFace dataset
dataset = Dataset.from_dict(data)

Key principles for training data:

Use consistent formatting (notice the “User asks:” and “Assistant:” pattern)
Keep responses factual and concise
Cover the most important information about your domain
Quality over quantity: 50 perfect examples beat 500 mediocre ones

Efficient Training with LoRA and Quantization

Instead of fine-tuning the entire model (which requires massive GPU memory), we’ll use two efficiency techniques:

1. Quantization (4-bit)

Quantization is a compression technique used to make large language models (LLMs) smaller and faster without heavily reducing accuracy.

Normally, LLM weights are stored as 32-bit floating point numbers (FP32).
Quantization reduces this precision to 16-bit (FP16/BF16), 8-bit (INT8), 4-bit (INT4) or even lower.

👉 Example:

A model with 10 billion parameters in FP32 needs ~40 GB memory.
If we quantize it to INT8 (8-bit) → it only needs ~10 GB.
If we go to INT4 (4-bit) → just ~5 GB.

⚡ Why it’s useful?

Makes models run on smaller GPUs (or even CPUs).
Speeds up inference.
Slight trade-off in accuracy, but often negligible.

# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16
)

# Load model with quantization
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.float16
)

2. LoRA (Low-Rank Adaptation)

LoRA is a parameter-efficient fine-tuning (PEFT) method.

Normally, to fine-tune a huge LLM (like LLaMA-65B or GPT-J), we would need hundreds of GBs of GPU memory. LoRA solves this by:

Freezing the original model weights.
Adding small trainable matrices (low-rank adapters) inside the model layers.
During fine-tuning, only these small matrices are updated.

👉 Example:

Full fine-tuning a 65B model could require ~1 TB GPU memory.
With LoRA, you only train a few million parameters → needs < 20 GB GPU memory.

⚡ Why it’s useful?

Makes fine-tuning massive LLMs possible on consumer hardware (1–2 GPUs).
Easy to “merge” or “unmerge” adapters → you can quickly switch tasks.
Often achieves performance close to full fine-tuning.

# Configure LoRA based on model architecture
if "TinyLlama" in model_name or "Llama" in model_name:
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]
elif "gpt" in model_name.lower() or "DialoGPT" in model_name:
    target_modules = ["c_attn", "c_proj"]
else:
    target_modules = ["q_proj", "v_proj"]

lora_config = LoraConfig(
    r=8,                          # Rank of adaptation
    lora_alpha=16,               # LoRA scaling parameter
    target_modules=target_modules,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA to the model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

We’re only training 0.2% of the parameters!

Training the Model

# Tokenize the data
def tokenize_function(examples):
    tokenized_inputs = tokenizer(
        examples["text"],
        truncation=True,
        padding="max_length",
        max_length=512,
        return_tensors="pt"
    )
    # Create labels by copying input tokens
    tokenized_inputs["labels"] = tokenized_inputs["input_ids"].clone()
    return tokenized_inputs

# Split dataset
train_dataset = dataset.select(range(8))
eval_dataset = dataset.select(range(8, 11))
# Apply tokenization
train_dataset = train_dataset.map(tokenize_function, batched=True)
eval_dataset = eval_dataset.map(tokenize_function, batched=True)
# Configure training
training_args = TrainingArguments(
    output_dir="./mnd-finetune-outputs",
    num_train_epochs=5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=2,
    warmup_steps=10,
    learning_rate=3e-4,
    weight_decay=0.01,
    logging_steps=1,
    save_steps=30,
    eval_strategy="steps",
    eval_steps=15,
    save_total_limit=2,
    load_best_model_at_end=True,
    fp16=True,
    report_to=None  # Disable wandb logging
)
# Create trainer and start training
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
)
print("Starting training...")
trainer.train()
# Save the adapter
adapter_dir = "./mnd-finetune-outputs/final_adapter"
model.save_pretrained(adapter_dir)
tokenizer.save_pretrained(adapter_dir)

Testing the Fine-Tuned Model

def load_model_for_inference():
    """Load the fine-tuned model for inference"""
    # Load base model
    base_model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto",
        torch_dtype=torch.float16
    )
    # Load the fine-tuned adapter
    model = PeftModel.from_pretrained(base_model, adapter_dir)
    model = model.merge_and_unload()
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(adapter_dir)
    return model, tokenizer
def generate_response(model, tokenizer, prompt, max_length=256, temperature=0.7):
    """Generate response for a given prompt"""
    formatted_prompt = f"User asks: {prompt}\nAssistant:"
    inputs = tokenizer.encode(formatted_prompt, return_tensors="pt")
    inputs = inputs.to(model.device)
    with torch.no_grad():
        outputs = model.generate(
            inputs,
            max_length=max_length,
            temperature=temperature,
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
            repetition_penalty=1.1
        )
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    assistant_response = response.split("Assistant:")[-1].strip()
    return assistant_response
# Load fine-tuned model and test
model, tokenizer = load_model_for_inference()
test_prompts = [
    "What makes MND different from other development companies?",
    "Can you tell me about MND's team?", 
    "What is the company culture at MND?"
]
for prompt in test_prompts:
    response = generate_response(model, tokenizer, prompt)
    print(f"Q: {prompt}")
    print(f"A: {response}\n")

Google Colab

Here’s the google colab link for this article that you can run in one click: link

The Real Costs

Unlike commercial APIs, your costs are primarily:

Initial Setup: GPU time for training
Storage: Model weights (~4GB for TinyLlama with LoRA adapters)
Inference: Your own hardware or cloud GPU instances

For a small model like TinyLlama, you can run inference on:

Google Colab (free tier with limitations)
AWS t3.medium with GPU (~$0.05/hour)
Your own hardware (RTX 3060 or better)

Best Practices from Real Experience

1. Start Small and Scale Up Begin with distilgpt2 or gpt2 to test your data and process, then move to larger models.

2. Monitor for Overfitting With small datasets, models can memorize rather than learn. Watch your evaluation loss.

3. Format Consistency is Critical The exact format of your training data matters enormously. Be consistent with punctuation, spacing, and structure.

4. Test with Unseen Questions Always test with questions not in your training data to ensure generalization.

5. Save Regular Checkpoints Training can be interrupted. Save frequently and test intermediate checkpoints.

Common Pitfalls and Solutions

Problem: Model generates repetitive or nonsensical text Solution: Lower the learning rate, increase regularization, or improve training data quality

Problem: Model forgets general knowledge Solution: Include some general examples in your training data or use a larger base model

Problem: Responses are too generic Solution: Make your training examples more specific and detailed

Problem: GPU memory issues Solution: Reduce batch size, use more aggressive quantization, or gradient checkpointing

Combining with RAG

Here’s the powerful part: you can combine your fine-tuned model with RAG for the best of both worlds:

Fine-tune for domain understanding, tone, and general company knowledge
Use RAG for current information, specific documents, or detailed technical specs

This gives you a model that “thinks” like your company but can access up-to-date information.

What’s Next

In our next article, we’re going completely custom — building a language model from scratch, entirely yours, running on your hardware with your data.

Fine-tuning open-source models gives you unprecedented control over your AI assistant. You can ensure it understands your domain, speaks in your voice, and never sends your data to third parties. With modern efficiency techniques like LoRA and quantization, it’s more accessible than ever.

The future of AI isn’t just about using someone else’s model — it’s about making AI truly yours.

Coming up: Article 4 — “Building Your Own Language Model: Small But Mighty”

Fine-Tuning LLMs: Teaching AI to Speak Your Language

Fine-Tuning LLMs: Teaching AI to Speak Your Language

RAG vs Fine-Tuning: What’s the Difference?

When Should You Fine-Tune?

Setting Up Your Environment

Choosing the Right Model

Preparing Your Training Data

Efficient Training with LoRA and Quantization

1. Quantization (4-bit)

2. LoRA (Low-Rank Adaptation)

Training the Model

Testing the Fine-Tuned Model

Google Colab

The Real Costs

Best Practices from Real Experience

Common Pitfalls and Solutions

Combining with RAG

What’s Next

Useful Links

Support

Contact Info

Locations