Fine-Tuning vs Prompt Engineering vs RAG: A Decision Tree for Your AI Feature
Most teams pick the wrong approach and waste months. Here's the decision framework: when to use prompt engineering, when to add RAG, and when fine-tuning actually makes sense.
September 27, 2025 11 min read
Every AI feature starts with the same question: "Should we use prompt engineering, RAG, or fine-tuning?"
Most teams pick based on what sounds cool or what they read in a blog post. Then they spend three months building the wrong thing.
Here's the decision framework that actually works, based on shipping dozens of AI development projects.
The Three Approaches: What They Actually Are
Before you can choose, you need to understand what you're choosing between.
Prompt engineering: Crafting instructions and examples that guide a pre-trained LLM to produce the output you want. No additional training, no external data retrieval—just clever prompting.
RAG (Retrieval-Augmented Generation): Retrieving relevant information from a knowledge base and including it in the prompt context before the LLM generates a response. The model doesn't change; you just give it better context.
Fine-tuning: Taking a pre-trained model and training it further on your specific dataset to modify its behavior, knowledge, or output style. You're creating a custom version of the model.
The complexity, cost, and capability of each approach are dramatically different.
Start with Prompt Engineering: The 80/20 Solution
Prompt engineering should be your default. It's the cheapest, fastest, and most flexible approach.
When prompt engineering is enough:
You need the LLM to follow a specific format. Structured outputs, JSON responses, specific tone of voice, adherence to brand guidelines. All of this can be achieved with well-crafted prompts.
Stop planning and start building. We turn your idea into a production-ready product in 6-8 weeks.
Your task fits within the context window. If all the information the LLM needs can fit in the prompt (up to 128K-200K tokens depending on model), you don't need RAG.
You're solving a general reasoning problem. Translation, summarization, code generation, creative writing, question answering based on general knowledge. Pre-trained models are already good at this.
You need to iterate quickly. Changing a prompt takes seconds. Changing a RAG pipeline takes hours. Changing a fine-tuned model takes days.
Real examples that only need prompt engineering:
Writing product descriptions in a specific brand voice
Generating SQL queries from natural language
Summarizing customer feedback into categories
Converting meeting notes into action items
Translating technical documentation into user-friendly language
Cost and complexity:
Time to implement: Hours to days
Cost: API calls only ($3-$15 per 1M tokens depending on model)
Maintenance: Low—just prompt updates as needed
Common mistakes:
Giving up on prompt engineering too early. If your first prompt doesn't work, the answer isn't "we need RAG" or "we need fine-tuning." The answer is better prompting. Spend 20-40 hours iterating before declaring prompt engineering insufficient.
Not using structured prompting techniques. Few-shot examples, chain-of-thought reasoning, role prompting, and output format specifications dramatically improve results. Use them.
Ignoring prompt versioning. Your prompts will change. Version them like code and track what works.
Add RAG When You Need External Knowledge
RAG makes sense when the information the LLM needs isn't in its training data or can't fit in the prompt.
When to use RAG:
You're building on top of proprietary data. Internal documentation, customer data, product catalogs, support tickets. The LLM doesn't know this information and never will unless you provide it.
The knowledge changes frequently. If you're answering questions about a product that ships new features weekly, fine-tuning is too slow. RAG lets you update the knowledge base and have it reflected immediately.
You have more context than fits in a prompt. If you need to search across thousands of documents to find relevant information, you need retrieval. You can't paste your entire knowledge base into every prompt.
You need attribution and citations. RAG lets you return sources alongside answers. Fine-tuned models can't tell you where their knowledge came from.
Real examples that need RAG:
Customer support chatbot answering questions based on documentation
AI assistant that searches internal company wikis and Slack history
Code generation tool that references a company's specific coding standards
Legal research tool that retrieves relevant case law
E-commerce product recommendation based on catalog data
Cost and complexity:
Time to implement: 2-6 weeks for a production-ready system
Upfront cost: $10,000-$40,000 for development
Ongoing cost: Vector database hosting ($200-$1,000/month), API calls, maintenance
Maintenance: Medium—requires updating embeddings when documents change
Check our pricing for RAG implementation estimates.
The RAG stack:
Building RAG isn't just about adding a vector database. You need:
Document processing: Chunking documents into retrievable segments, handling different formats (PDF, HTML, Markdown), preserving context across chunks.
Embedding generation: Converting documents into vector embeddings, choosing the right embedding model, managing embedding costs.
Each piece adds complexity. Don't underestimate the engineering effort.
Common mistakes:
Retrieving too many or too few documents. Retrieving 20 documents fills your context window with noise. Retrieving 2 documents misses relevant information. The sweet spot is usually 3-7 chunks.
Not handling retrieval failures. What happens when the retrieval returns nothing relevant? Your LLM will hallucinate. You need graceful fallbacks.
Ignoring chunk size and overlap. Too-small chunks lose context. Too-large chunks dilute relevance. Overlap between chunks ensures important information isn't split across boundaries.
Assuming semantic search is enough. Sometimes keyword search outperforms semantic search. Hybrid approaches that combine both usually win.
Fine-Tuning: The Last Resort, Not the First
Fine-tuning is expensive, slow, and rarely necessary. Use it only when prompt engineering and RAG fail.
When fine-tuning makes sense:
You need to modify the model's behavior in ways prompting can't achieve. Teaching the model a new language style, adjusting its reasoning process, embedding domain-specific knowledge so deeply that it doesn't need retrieval.
You have a massive, high-quality training dataset. Fine-tuning with 100 examples won't help. You need thousands to tens of thousands of high-quality examples. If you don't have this, fine-tuning won't work.
Latency and cost matter more than flexibility. A fine-tuned model can produce outputs without retrieval, reducing latency and complexity. If you're doing millions of inferences and retrieval overhead is killing your budget, fine-tuning might pay off.
You need consistent, predictable outputs at scale. Fine-tuning can bake in patterns that are hard to achieve reliably with prompting alone. If consistency is critical and you're willing to trade flexibility, fine-tuning helps.
Real examples that justify fine-tuning:
Legal document generation with firm-specific language patterns (10,000+ examples of past documents)
Customer support classifier trained on millions of labeled tickets
Code completion model specialized for a proprietary programming framework
Medical diagnosis assistant trained on annotated clinical notes
Cost and complexity:
Time to implement: 4-12 weeks (data prep, training, evaluation, deployment)
Upfront cost: $30,000-$100,000+ for development
Training cost: $100-$5,000+ per training run depending on model and dataset size
Ongoing cost: Model hosting (if self-hosted), inference costs, retraining as data changes
Maintenance: High—requires ongoing data collection, retraining, version management
The fine-tuning process:
Fine-tuning isn't just clicking a button. You need:
Data collection and labeling: Gathering thousands of high-quality examples, labeling them correctly, ensuring diversity and coverage.
Data cleaning: Removing duplicates, fixing errors, standardizing formats. Garbage in, garbage out.
Training setup: Choosing hyperparameters, setting up training infrastructure, running experiments.
Evaluation: Testing the fine-tuned model against holdout data, comparing to baseline, identifying failure modes.
Deployment: Hosting the model (or using API if fine-tuning via OpenAI/Anthropic), setting up monitoring, managing versions.
Ongoing retraining: As your data changes, you need to retrain. This isn't a one-time effort.
Each step requires ML expertise and time. Budget accordingly.
Common mistakes:
Fine-tuning on too little data. 100 examples won't cut it. 1,000 examples might not either. You need enough data to shift the model's behavior meaningfully.
Fine-tuning to teach facts instead of behavior. Fine-tuning is for teaching the model how to respond, not what to know. If you're trying to teach it facts, use RAG.
Not comparing to prompt engineering baselines. Before you spend $50,000 fine-tuning, prove that prompt engineering can't achieve the same result. Most of the time, it can.
Assuming fine-tuning solves hallucination. It doesn't. A fine-tuned model will still hallucinate if you ask it questions it can't answer.
The Decision Tree: How to Choose
Stop guessing. Follow this decision tree.
Step 1: Can prompt engineering solve it?
Spend 20-40 hours iterating on prompts. Use few-shot examples, chain-of-thought reasoning, structured output formats, and role prompting.
If yes: Stop. You're done. Don't over-engineer.
If no: Move to Step 2.
Step 2: Do you need external knowledge or proprietary data?
Does the task require information that isn't in the model's training data? Internal docs, real-time data, user-specific context?
If yes: Use RAG. Proceed to Step 3.
If no: Revisit Step 1. You probably need better prompting, not a different approach.
Step 3: Is RAG solving it?
Build a RAG prototype. Test retrieval quality. Measure whether retrieved context improves LLM outputs.
If yes: Stop. RAG is your solution.
If no: Move to Step 4.
Step 4: Do you have 10,000+ high-quality training examples?
Fine-tuning requires significant training data. If you don't have it, you can't fine-tune effectively.
If no: You can't fine-tune. Go back to improving your RAG retrieval or prompts.
If yes: Move to Step 5.
Step 5: Is the ROI of fine-tuning worth the cost?
Fine-tuning will cost $30,000-$100,000+ and take 2-3 months. Will the improvement justify this?
If yes: Fine-tune.
If no: Live with RAG + prompt engineering or rethink the feature.
Combining Approaches: The Pragmatic Solution
The best AI features don't use one approach. They use all three strategically.
Example: AI-powered customer support
Prompt engineering: Defines the chatbot's tone, response structure, and escalation logic.
RAG: Retrieves relevant documentation and past tickets to ground responses in accurate information.
Fine-tuning: Not used. Prompt + RAG is sufficient and much cheaper.
Example: Legal document generation
Prompt engineering: Structures the document format and enforces legal writing conventions.
RAG: Retrieves relevant clauses and precedents from a database of past contracts.
Fine-tuning: Applied to a smaller model to internalize firm-specific legal language patterns, reducing reliance on retrieval and improving output consistency.
Example: Code completion tool
Prompt engineering: Specifies output format and code style guidelines.
RAG: Retrieves relevant code snippets from the company's codebase to provide context.
Fine-tuning: Trained on the company's proprietary codebase to understand internal frameworks and patterns that aren't publicly documented.
Use the cheapest, simplest approach that solves each piece of the problem.
Cost Comparison: Real Numbers
Here's what each approach actually costs for a mid-sized B2B SaaS company.
Ongoing: Updates and maintenance (20 hours/month) = $3,000/month
API costs: $1,000-$3,000/month (retrieval + generation)
Total year 1: $30,000 + $60,000 = $90,000
Fine-tuning:
Development: 400 hours at $150/hour = $60,000
Data labeling: 200 hours at $50/hour = $10,000
Training costs: $2,000 per training run × 10 runs = $20,000
Infrastructure: Model hosting = $2,000/month
Ongoing: Retraining and maintenance (40 hours/month) = $6,000/month
API costs: $2,000-$5,000/month
Total year 1: $90,000 + $120,000 = $210,000
The complexity and cost scale dramatically. Don't reach for fine-tuning unless the ROI is clear. Use our MVP calculator to compare total costs across all three approaches.
When to Pivot from One Approach to Another
Your first choice might not be your final choice. Know when to switch.
Prompt engineering → RAG:
You're hitting context window limits trying to include all necessary information in prompts. You need to search across more data than fits in 200K tokens.
RAG → Fine-tuning:
Retrieval latency is killing user experience (every query takes 5+ seconds). You're spending $10,000+/month on retrieval and it's growing linearly with users. You have 10,000+ examples and can justify the investment.
Fine-tuning → RAG:
Your knowledge changes too frequently to retrain. You need attribution and sources. Fine-tuning didn't improve performance enough to justify the cost.
Any approach → Prompt engineering:
You over-engineered. Strip it back and see if better prompts can achieve the same result with 10x less complexity.
The Mistake That Kills AI Projects
Teams choose their approach based on what sounds impressive, not what solves the problem.
"We fine-tuned a custom model" sounds better in a pitch deck than "we wrote good prompts." But if prompts work, fine-tuning is waste.
Pick based on results, not story. Your users don't care how you built it. They care that it works.
Next Steps
If you're building an AI feature and not sure which approach to use, start with prompt engineering. Prove it can't work before moving to RAG. Prove RAG can't work before considering fine-tuning.
If you need help evaluating which approach makes sense for your use case, or you want to avoid the expensive mistakes we've seen dozens of teams make, we've built AI features across all three approaches. We'll help you choose the right one and build it correctly.
Estimate the cost of your AI feature or talk to us about building a prototype that proves which approach works before you commit to a full build.
Most marketing automation apps treat AI as a feature to add later. Here's why that approach fails—and how to architect AI-native marketing automation from day one.