Cost Engineering
Most people send every task straight to the most expensive model they can find. That is like hiring a senior partner to photocopy documents. Stop burning money.
I run autonomous AI systems for a living. Bots that process hundreds of requests a day across CRM pipelines, contract analysis, lead qualification, and content generation. If I sent every single task to ~anthropic/claude-opus-latest or openai/gpt-5.4, I would be bankrupt in a month. (I wrote about killing the SaaS tax here .)
Instead, I use a two-pass architecture that cuts my token costs by ~75% (expensive) to ~50% (cheap) while producing output that is indistinguishable from the expensive model working alone. Most people have never heard of this. Those who have are saving thousands of dollars a month.
Here is exactly how it works.
The Expensive Mistake Everyone Makes
When most people use AI, they do this:
- They have a complex task (write a report, analyze a contract, generate a marketing strategy).
- They send it directly to the most powerful model they have access to (~anthropic/claude-opus-latest, openai/gpt-5.4, Gemini Pro).
- They wait. They pay. They get a good result.
- They repeat this for every task, including the trivial ones.
The problem is that 90% of the work the expensive model does is scaffolding . It is figuring out the structure, drafting the rough content, organizing the raw information. That work does not require frontier intelligence. A lite model can do it in a fraction of the time, at a fraction of the cost.
The only part that truly benefits from the expensive model is the final 10% : the refinement, the nuance, the strategic polish, the tone calibration. That is where Opus earns its price tag. Everything before that is overkill.
The Two-Pass Architecture
The framework is simple. Two passes. Two models. One output that looks like the expensive model did everything.
Pass 1: The Draft (Lite Model)
You send the full task to a cheap, fast model. This is your workhorse. It handles:
- Initial research and information gathering
- Structure and outline creation
- First-draft content generation
- Data extraction and formatting
- Boilerplate and repetitive sections
Models I use for Pass 1:
- google/gemini-3.1-flash-lite-preview (Google) - $0.10/M tokens, 1M context, the best value drafting model right now
- openai/gpt-5.4.1 Nano (OpenAI) - $0.10/M tokens, 1M context, surprisingly capable for its price
- openai/gpt-5.4-nano (OpenAI) - $0.05/M tokens, the cheapest OpenAI option available
- qwen/qwen3.5-flash-02-23 (Alibaba) - $0.07/M tokens, 1M context, great for high-volume extraction
- mistralai/mistral-small-2603 (Mistral) - European, $0.15/M tokens, strong multilingual support
The output from Pass 1 is rough. It is 80% there. It has the right structure, the right information, the right bones. But it reads like a draft, because that is exactly what it is.
Pass 2: The Refine (Heavy Model)
You take the entire Pass 1 output, attach it to a short refinement prompt, and send it to the expensive model. The prompt is critical. You are not asking it to start from scratch. You are asking it to polish an existing piece.
// Pass 2 Refinement Prompt Template (April 2026)
You are a senior editor and strategist. Below is a first-draft document
produced by a junior model. Your job is to:
1. Improve the writing quality and flow.
2. Add strategic nuance and deeper insights.
3. Fix any factual inaccuracies or weak arguments.
4. Calibrate the tone to be executive, minimal, and authoritative.
5. Do NOT rewrite from scratch. Preserve the structure and key points.
[DRAFT FROM PASS 1 HERE]The heavy model spends 90% of its tokens on refinement, not scaffolding. This is where the savings come from. You are paying Opus rates only for the work that actually requires Opus-level intelligence.
Pro Tip: In your Pass 2 prompt, explicitly tell the model "Do NOT rewrite from scratch." This prevents it from throwing away the draft and starting over, which would defeat the entire purpose and burn tokens for nothing.
The Cost Math
Let me show you the actual numbers. Here is a realistic scenario: generating a 2,000-word strategic report with research, analysis, and recommendations.
- Approach
- Model
- Tokens Used
- Cost
~anthropic/claude-opus-latest
~12,000
$0.18
Naive (Single Pass)openai/gpt-5.4
~12,000
$0.15
Two-Pass (Lite → Opus)Flash Lite + Sonnet 4.6
9K (Haiku) + 3K (Opus)
$0.045
Two-Pass (Nano → GPT-5.4)Nano + GPT-5.4
9K (Nano) + 3K (GPT-5.4)
$0.055
Read that again. The two-pass approach produces the same quality output for $0.045 instead of $0.18 . That is a 75% reduction. With Opus at $5/$25, the savings reach 88%. On a per-request basis, it looks small. At scale, it is the difference between profitability and bankruptcy.
The Monthly Reality
- Usage Level
- Naive (Opus Only)
- Two-Pass
- Saved
- 50 requests/day
- $720/mo
- $165/mo
- $555/mo
- 200 requests/day
- $2,880/mo
- $660/mo
- $2,220/mo
- 500 requests/day
- $7,200/mo
- $1,650/mo
- $5,550/mo
If you are running an agency, a SaaS product, or any operation that makes hundreds of AI calls a day, this is not a nice-to-have optimization. It is an existential one.
Example 1: Contract Analysis Pipeline
Task: Extract key clauses, identify risk factors, and generate an executive summary from a 40-page PDF contract.
Pass 1 (google/gemini-3.1-flash-lite-preview): Extracts all clauses, identifies section headers, pulls out dates, parties, financial terms, and termination conditions. Cost: $0.003.
Pass 2 (Claude Sonnet 4.6): Takes the extracted data, analyzes risk exposure, writes the executive summary with strategic recommendations, and flags unusual clauses. Cost: $0.04.
Total: $0.043 vs $0.38 if sent directly to Opus. Savings: 89%.
Example 2: Content Generation
Task: Write a 1,500-word technical blog post about a specific API integration.
Pass 1 (openai/gpt-5.4.1 Nano): Generates the outline, writes all sections with basic explanations, includes code snippets. Cost: $0.004.
Pass 2 (Claude Sonnet 4.6): Rewrites the intro and conclusion for impact, improves transitions, adds nuance to technical explanations, polishes the tone. Cost: $0.03.
Total: $0.034 vs $0.30 with Opus alone. Savings: 89%.
Example 3: Lead Qualification
Task: Analyze an inbound lead's message, check against ICP criteria, and draft a personalized response.
Pass 1 (qwen/qwen3.5-flash-02-23): Extracts company name, role, pain points, budget signals, and urgency level. Cost: $0.001.
Pass 2 (Gemini 2.5 Pro): Scores the lead, writes a personalized response that references their specific pain points, and suggests next steps. Cost: $0.015.
Total: $0.016 vs $0.12 with openai/gpt-5.4 alone. Savings: 87%.
When To Skip This
The two-pass architecture is not always the right call. There are specific cases where you should go straight to the heavy model:
- Extremely short tasks: If the input and output are both under 500 tokens, the overhead of two API calls is not worth it. Just use the expensive model.
- Code generation: Writing working code from scratch requires high reasoning from the first token. A lite model will produce buggy scaffolding that costs more to fix than to generate correctly the first time.
- Adversarial or safety-critical tasks: If the output will be sent to customers, published externally, or used in legal/financial contexts, do not trust a lite model even for the draft. The cost of a hallucination is far higher than the token savings.
- Real-time chat: If the user is waiting for a response, the latency of two sequential API calls is unacceptable. Use the best model you can afford in a single pass.
CRITICAL: Always review the Pass 2 output. The refinement model can occasionally introduce errors if the Pass 1 draft was fundamentally wrong. The two-pass architecture assumes the draft is directionally correct, just rough. If the draft is garbage, the refine step will produce polished garbage.