A university friend posted on LinkedIn recently: "How do you train AI after you integrate the API?"
I immediately thought of How to Train Your Dragon. Hiccup didn't build his own dragon. He didn't raise Toothless from an egg. The Night Fury was already there — full-grown, already capable of flight, fire, and everything else. Hiccup's challenge wasn't creating the dragon. It was learning how to speak to it, how to earn its trust, and how to point it at the right target.
Most people integrating LLMs are in the exact same spot. The dragon is already trained. Your job is figuring out how to ride it.
Here are four tiers, ranked from "whistle a command" to "raise the beast from an egg."
Tier 1: Learn to speak Dragon
The API hands you a fully trained model. GPT-4, Claude, Gemini — these are not blank slates. They have read most of the internet. They can reason, write code, translate languages, and argue about philosophy. Your integration does not start with training. It starts with a conversation.
The system prompt is your dragon whistle. It tells the model who it is, what it should care about, and how to behave.
Say you're building a support bot for an electronics store. You don't need to retrain the model on product manuals. You write:
You are a helpful support assistant for ElectroMart, an electronics retailer.
Your job is to answer questions about orders, returns, and product compatibility.
- Check the user's order status before offering a replacement.
- Returns are accepted within 30 days with a receipt.
- Do not make up product specs. If unsure, say you will connect them to a human.
- Be concise. Most customers are on mobile.
That's it. The dragon already knows how to answer questions. You're just pointing it at the right village.
For probably 80% of AI integrations — maybe 90% — this is the entire solution. I have seen startups burn weeks and thousands of dollars on pipelines when their actual problem was a three-sentence system prompt that was vague about tone.
Tier 2: Give the dragon a map
Sometimes the system prompt isn't enough because the knowledge is too large, too specific, or changes too often.
Your company has 10,000 internal documents. Your product catalog updates daily. Your legal team rewrites policy every quarter. You can't paste all of that into a prompt — there's a token limit, and even if there weren't, it would be a mess.
This is where Retrieval-Augmented Generation, or RAG, comes in.
Instead of memorizing the world, the dragon looks at a map. You store your documents in a vector database. When a user asks a question, you search for the most relevant chunks, stuff them into the prompt alongside the user's message, and let the model answer from that context.
User: What's our policy on remote work in the EU?
System: You are an HR assistant. Answer using only the provided context.
Context:
[Excerpt from Remote Work Policy v3.2, March 2026]
[Excerpt from EU Labor Compliance Guide]
[Excerpt from Employee Handbook - Europe section]
User question: What's our policy on remote work in the EU?
The dragon didn't need to "learn" your HR policy. It just needed to read the right paragraphs at the right time.
When to pick RAG: your knowledge base is large, changes frequently, or you need source citations. When the user asks "where did you get that answer from?" and you need to point to a specific document.
When to skip it: your domain is small and stable enough to fit in a system prompt, or your answers don't need to be grounded in source documents.
Tier 3: Teach the dragon a new trick
System prompts and RAG handle knowledge. They do not reliably change behavior.
If you need the model to consistently output in a very specific format, follow an unusual reasoning pattern, or adopt a tone that it just won't nail with prompting alone — that's when fine-tuning enters the picture.
Fine-tuning means taking a pre-trained dragon and drilling it on a specific maneuver. You collect example inputs and the exact outputs you want, and you run additional training on top of the base model. You're not teaching it facts. You're teaching it a muscle memory.
Real example: I worked on a project where we needed the model to classify customer complaints into exactly 12 categories with a specific JSON schema. Few-shot prompting got us to 92%. The remaining 8% were maddening — wrong category, malformed JSON, extra fields.
We collected about 2,000 labeled examples and fine-tuned a smaller model. Accuracy jumped to 98%. More importantly, the output format became consistent. No more random schema drift.
{
"complaint": "My package arrived damaged and the delivery was late.",
"category": "SHIPPING_ISSUE",
"severity": 3,
"requires_escalation": false
}
When to fine-tune: you need consistent structure, a specific tone that's hard to prompt, or you're making the same kind of call so frequently that API costs for a large model are bleeding you dry. A fine-tuned small model can be cheaper and faster at inference.
When to skip it: you don't have at least a few hundred high-quality examples, your problem is just knowledge lookup, or you haven't exhausted prompt engineering and RAG first. Fine-tuning is also a maintenance burden — retrain when your data drifts, monitor for degradation.
Tier 4: Raise the dragon from an egg
This is actual training. And it is nothing like the first three tiers.
When people say "we're training an AI," this is what they mean. Not a system prompt. Not RAG. Not fine-tuning a few thousand examples on top of someone else's checkpoint. I mean collecting terabytes of text, renting clusters of hundreds of GPUs, and running training for weeks or months while burning enough electricity to power a small town.
What training actually takes
Pre-training a large language model from scratch requires:
- ›
Data at scale. We're talking hundreds of billions to trillions of tokens. That's most of the internet, plus books, code, scientific papers, and whatever proprietary datasets you can legally acquire. Cleaning this data is a full-time job for a team. Bad data doesn't just make the model worse — it makes it confidently wrong.
- ›
Compute clusters. A model the size of GPT-3 (175 billion parameters) required thousands of GPU-years to train initially. Modern frontier models use tens of thousands of high-end GPUs running in parallel for months. At cloud rates, that's millions of dollars for a single training run. And you usually need multiple runs because the first one doesn't work.
- ›
Infrastructure expertise. Distributed training across thousands of GPUs is hard. Nodes fail. Network bandwidth becomes a bottleneck. Checkpoints get corrupted. You need machine learning engineers who understand optimization, parallelism strategies, memory management, and debugging at scale. This is not a weekend project.
- ›
Evaluation frameworks. Training without measurement is just burning money. You need benchmarks, human evaluators, safety testing, red-teaming, and a process for deciding whether iteration 47 is actually better than iteration 46. Companies like OpenAI and Anthropic have dedicated teams just for this.
Large models vs small models
Here's the thing most people miss: the "dragon" you get from an API is already the product of this process. GPT-4, Claude 3.5, Gemini 1.5 — these are massive models with hundreds of billions of parameters. A parameter is essentially a learned connection in a neural network. More parameters means more capacity to memorize facts, understand nuance, and generalize across domains. But it also means more compute to run inference.
| Aspect | Large Model (GPT-4, Claude 3.5) | Small Model (Llama 3 8B, Mistral 7B) |
|---|---|---|
| Parameters | Hundreds of billions | Single-digit billions |
| Capabilities | Broad, deep reasoning | Narrower, more shallow |
| Inference cost | High per token | Low per token |
| Latency | Slower | Faster |
| Can run locally? | No | Yes, on consumer GPUs |
| Fine-tuning cost | Very expensive | Cheap, accessible |
Small models are the practical workhorses. A fine-tuned 7-billion-parameter model can outperform a generalist giant on a narrow task because it's specialized. But ask it to write poetry or debug a novel algorithm and it will struggle. Large models are the generalists. They cost more but handle edge cases you didn't plan for.
Most companies should be using large models via API for general tasks and fine-tuned small models for high-volume, narrow workflows.
Who actually trains models?
Very few organizations need to do this. The ones that do fall into three buckets:
- ›
Foundation model labs. OpenAI, Anthropic, Google DeepMind, Meta, Mistral, Cohere. These companies exist to train the dragons everyone else rides.
- ›
Domains with no useful priors. Some biotech, legal, and scientific applications use vocabulary and reasoning so specialized that general-purpose models have never seen enough relevant data. Even then, it's usually more practical to start with a base model and do extensive continued pre-training rather than train from random initialization.
- ›
Air-gapped or extreme privacy environments. Governments, defense contractors, or highly regulated industries that cannot send data to third-party APIs. They train or run open-weight models internally.
If you're reading a blog post to decide whether you need to train a model, you don't.
The Dragon Rider's Guide
Here's the full decision tree. Start at the top. Stop when it works.
| Your situation | What to do | Why |
|---|---|---|
| The model doesn't know my business rules | System prompt | Fast, free, reversible |
| Too much information to fit in a prompt | RAG | Keeps knowledge current, cite sources |
| The model won't follow my output format | Fine-tuning | Teaches behavior, not facts |
| Nothing else works, and I have a GPU farm | Training | Last resort for edge cases |
Most people never make it past row one. Some make it to row two. Row three is for special cases. Row four is for people with venture capital and PhDs.
The thing about AI right now is that the technology feels so magical that people assume the solution must be magical too. It must be training. It must be a pipeline. It must be machine learning.
Sometimes the answer is just: write a better paragraph at the top of your API call. The dragon already knows how to fly. You just need to tell it where to go.
Feel free to connect if you have questions or want to argue about whether fine-tuning counts as "training." (It technically does. You know what I mean.)