Do I need LangChain or CrewAI to build a multi-agent system?

No. Frameworks like LangChain, CrewAI, and AutoGen add convenience but introduce learning overhead and lock you into their abstractions. For production systems handling fewer than ~10 agent types, raw SDK calls plus a small orchestrator (like the one above) are often faster to debug and cheaper to maintain. Adopt a framework once you hit features you genuinely need, such as streaming graph execution or built-in human-in-the-loop checkpoints.

How much does running a multi-agent system cost compared to a single agent call?

Expect 2-5x the cost of a single call, depending on how many workers fire. A three-agent pipeline (planner + 3 workers + synthesizer) using Claude Opus 4.7 for orchestration and Sonnet 4.6 for workers typically lands at $0.05-$0.20 per query as of early 2026. The biggest cost lever is using cheaper models for workers, since they handle the bulk of token volume.

Can I use this pattern with open-source models like Llama 4 or DeepSeek?

Yes, and it works well. Swap the Anthropic client for an OpenAI-compatible client pointing at a provider like Together AI, Fireworks, or DeepSeek's API. The architecture is identical. One caveat: open models tend to follow strict JSON formatting instructions less reliably, so you'll want a JSON-mode flag or a retry loop around the planner's output parsing.

What happens if one worker agent times out or crashes mid-task?

With the basic `asyncio.gather` setup, one exception cancels the others. Add `return_exceptions=True` to gather, then inspect results before synthesis. A common pattern is to retry failed workers once with a longer timeout, then pass partial results to the synthesizer with a note about which subtasks are missing. The synthesizer can produce a useful answer from 2 out of 3 subtasks in most cases.

When should I switch from this pattern to a graph-based agent framework like LangGraph?

When your agent flow stops being a simple fan-out. If you need conditional routing ("if worker A says X, run worker D instead of B"), cyclic execution (agents that call themselves), or persistent state across long-running sessions, a graph framework pays for itself. For pure orchestrator-worker patterns, raw asyncio stays simpler indefinitely.

Build a Multi-Agent AI System From Scratch in 2026

Most multi-agent tutorials hand you a bloated framework, a 40-line config file, and call it a day. That's not learning. That's copy-paste.

So this guide goes the other way. You'll build a multi-agent AI system from scratch in Python, using nothing but the Anthropic SDK and the standard library. By the end you'll have an orchestrator that breaks down a task, dispatches it to specialist agents, and merges their outputs into a final answer. The whole thing fits in under 200 lines.

And yes, it actually works.

What You'll Build

You'll build a research assistant with three agents working together:

Planner agent: takes the user's question and breaks it into 2-4 subtasks
Worker agents (parallel): each handles one subtask independently
Synthesizer agent: stitches the worker outputs into a final, coherent answer

This is the classic orchestrator-worker pattern, and it's the foundation underneath production systems like Anthropic's research agent and most agentic RAG setups. Once you understand it, every other multi-agent framework (CrewAI, AutoGen, LangGraph) becomes much easier to read.

Whiteboard diagram showing planner

We'll use Claude Sonnet 4.6 for the workers (fast, cheap at $3/$15 per million tokens) and Claude Opus 4.7 for the planner and synthesizer where reasoning matters more. You can swap in GPT-4o or DeepSeek with about ten lines of changes.

Prerequisites

Before you start:

Python 3.10 or newer
An Anthropic API key (free credits available for new accounts)
Basic comfort with async/await (we'll run workers in parallel)
A terminal

Install the SDK:

Bash

pip install anthropic

Set your key:

Bash

export ANTHROPIC_API_KEY="sk-ant-..."

That's the whole setup. No vector database, no framework, no Docker. Pretty refreshing, not gonna lie.

Step 1: Define the Agent Interface

Every agent in your system needs the same three things: a system prompt, an input, and a structured output. So let's build a single Agent class that all three roles will reuse.

Python

import os
import json
import asyncio
from anthropic import AsyncAnthropic

client = AsyncAnthropic()

class Agent:
    def __init__(self, name: str, model: str, system: str):
        self.name = name
        self.model = model
        self.system = system

    async def run(self, user_message: str) -> str:
        response = await client.messages.create(
            model=self.model,
            max_tokens=2048,
            system=self.system,
            messages=[{"role": "user", "content": user_message}],
        )
        return response.content[0].text

That's the entire abstraction. One class. One method. If you're used to LangChain's Runnable chains, this might feel suspiciously simple. But that's the point. Most of the complexity in agent frameworks exists to support features you don't need yet.

Step 2: Build the Planner

The planner is the brain of the operation. It receives the user's question and produces a JSON list of subtasks. We want strict structure here, so the system prompt does a lot of work.

Python

PLANNER_SYSTEM = """You are a planning agent. Given a user question, break it
into 2-4 independent research subtasks that can be answered in parallel.

Return ONLY valid JSON in this exact format:
{"subtasks": ["task 1", "task 2", "task 3"]}

Rules:
- Each subtask must be answerable on its own
- Subtasks must not depend on each other's output
- Phrase each subtask as a specific question
"""

planner = Agent(
    name="planner",
    model="claude-opus-4-7",
    system=PLANNER_SYSTEM,
)

async def plan(question: str) -> list[str]:
    raw = await planner.run(question)
    return json.loads(raw)["subtasks"]

Terminal output showing multi-agent orchestrator running planner

A word of caution: JSON parsing from LLMs is fragile. In production you'd want either tool use (the structured output feature in the Anthropic API) or a retry loop that catches json.JSONDecodeError. For learning purposes, this is fine.

Why independence matters

If subtask 2 depends on subtask 1's output, you can't parallelize them. The whole speed advantage of multi-agent systems comes from running workers concurrently. So the planner's job is partly to enforce independence by phrasing tasks that don't reference each other.

This is the single biggest mistake beginners make. They write a planner that produces sequential steps ("first find X, then use X to find Y"), and end up with a slow agent that's just doing chain-of-thought through three API calls. Don't do that.

Step 3: Build the Worker

Workers are the simplest piece. Each one gets a subtask, does the research, returns a short factual answer. We'll use Claude Sonnet 4.6 because it's fast and the work doesn't need deep reasoning.

Python

WORKER_SYSTEM = """You are a research worker. Answer the given question
concisely and factually. If you don't know, say so. Maximum 200 words.
Cite sources only if you are certain they exist."""

worker = Agent(
    name="worker",
    model="claude-sonnet-4-6",
    system=WORKER_SYSTEM,
)

async def do_subtask(subtask: str) -> dict:
    answer = await worker.run(subtask)
    return {"subtask": subtask, "answer": answer}

Developer's hands typing on MacBook while reviewing Python agent code with sticky note reminders

Notice we return a dict, not a raw string. The synthesizer needs to know which subtask each answer corresponds to. This sounds obvious but I've seen plenty of broken implementations where workers just dump strings into a list and the final agent gets confused about what answered what.

Step 4: Build the Synthesizer

The synthesizer takes the planner's subtasks and the workers' answers, then writes the final response. It's the only agent that sees the original question alongside all the partial results.

Python

SYNTHESIZER_SYSTEM = """You are a synthesis agent. You receive a user's
original question and a set of subtask answers from research workers.

Write a single, coherent answer to the original question that integrates
the worker findings. Do not mention the multi-agent process. Do not list
the subtasks. Write as if you researched everything yourself."""

synthesizer = Agent(
    name="synthesizer",
    model="claude-opus-4-7",
    system=SYNTHESIZER_SYSTEM,
)

async def synthesize(question: str, results: list[dict]) -> str:
    payload = f"Original question: {question}\n\nWorker results:\n"
    for r in results:
        payload += f"\nQ: {r['subtask']}\nA: {r['answer']}\n"
    return await synthesizer.run(payload)

That instruction about "don't mention the multi-agent process" matters more than you'd think. Without it, the synthesizer often writes things like "Based on the research from my workers..." which makes the output sound like a project status report instead of an answer.

Step 5: Wire It All Together

Now the fun part. The orchestrator runs the planner, fans out to workers in parallel using asyncio.gather, then calls the synthesizer.

Python

async def orchestrate(question: str) -> str:
    print(f"[orchestrator] Planning subtasks...")
    subtasks = await plan(question)
    print(f"[orchestrator] Got {len(subtasks)} subtasks")

    print(f"[orchestrator] Running workers in parallel...")
    results = await asyncio.gather(*[do_subtask(s) for s in subtasks])

    print(f"[orchestrator] Synthesizing final answer...")
    return await synthesize(question, results)

if __name__ == "__main__":
    question = "Compare Claude Opus 4.7, GPT-4o, and Gemini 2.5 Pro on coding benchmarks and pricing."
    final = asyncio.run(orchestrate(question))
    print("\n=== FINAL ANSWER ===\n")
    print(final)

Run it:

Bash

python agents.py

For the sample question above, you'll see roughly: the planner generates three subtasks (coding benchmarks, pricing, context windows), the three workers fire off concurrently in about 3-5 seconds total instead of 9-15 sequentially, and the synthesizer produces a clean comparison. End-to-end token spend lands around 8,000-12,000 input tokens and 1,500-2,500 output tokens, or roughly $0.06-$0.12 per query with the model mix above. That parallelism is the entire reason this pattern exists.

Common Pitfalls (And How to Avoid Them)

A few things will bite you the first time you build this.

The planner produces dependent subtasks. Already covered, but worth repeating because it kills your performance gains. If you see workers waiting on each other, your planner prompt needs more aggressive independence rules.

JSON parsing fails silently. When the planner returns malformed JSON, json.loads throws and your whole pipeline dies. Wrap it in a try/except with one retry. Better: use the tool use API for guaranteed structure.

Workers hallucinate citations. LLMs love inventing plausible-looking URLs. The system prompt above tells workers to cite "only if certain" but the real fix is giving workers actual tool access (web search, RAG retrieval). Without grounding, treat factual claims as suspect.

Costs explode. Three agents per query means three API calls minimum, often more. At Opus 4.7 prices ($5/$25 per million tokens as of early 2026), a single query with long context can run $0.10-$0.30. Use Sonnet for workers. Use Opus only where reasoning quality matters.

Async errors get swallowed. If one worker throws an exception inside asyncio.gather, the others might silently fail too. Add return_exceptions=True to gather, then check each result before passing to the synthesizer.

Testing Your System

Three quick verification steps before you call it done.

Timing test: run a question with three subtasks. Total time should be roughly plan_time + max(worker_times) + synthesis_time, not the sum of all four. If it's the sum, your parallelism is broken.
Determinism test: run the same question five times. The plans should be similar but not identical. If they're wildly different each run, your planner prompt is too vague. If they're byte-identical, you might have temperature set to 0 (which is fine, but worth knowing).
Failure injection: deliberately break one worker (raise an exception inside do_subtask). The system should either recover gracefully or fail loudly with a clear error. Silent partial failures are the worst kind of bug.

For more rigorous evaluation, Anthropic's evaluation cookbook has examples of running structured eval suites against agent systems.

Where to Go From Here

You've got the skeleton. The next layers, roughly in order of usefulness:

Tool use for workers. Give workers actual capabilities: web search via Exa, code execution, file reading. This is the biggest jump in real-world usefulness. A worker that can search the web is dramatically more useful than one relying on training data.

Memory between turns. Right now each query is independent. Add a simple conversation history dict keyed by session ID, and pass relevant context into the planner.

Worker specialization. Instead of identical workers, build different ones with different system prompts: a "data worker" that loves numbers, a "summary worker" that writes prose, a "code worker" that returns runnable snippets. The planner then assigns subtasks based on type.

Observability. Log every agent call with input, output, latency, and token count. When something goes wrong (and it will), you need to see which agent produced the bad output. Langfuse and Helicone both work well for this without forcing you into a framework.

Once you've added tool use and memory, you've essentially built a smaller version of what production agent systems do. From there, frameworks like LangGraph start making sense because you understand what they're abstracting over. Build from scratch first. Adopt frameworks second. That order matters.

Raw Orchestration vs Frameworks: Quick Comparison

If you're weighing whether to stay with raw asyncio or graduate to a framework, here's the honest tradeoff in concrete terms:

Concern	Raw asyncio (this guide)	LangGraph / CrewAI	AutoGen
Lines of code for a 3-agent pipeline	~150	~80-100 + framework	~120 + framework
New dependencies	1 (anthropic SDK)	12-25 transitive	30+ transitive
Time to first run	5 minutes	30-60 minutes (concepts + setup)	45-90 minutes
Conditional routing (if/else between agents)	Manual (write the if)	Built-in graph edges	Built-in via group chat
Human-in-the-loop checkpoints	Roll your own	Native interrupt API	Native via UserProxyAgent
Debugging when something breaks	Read your own 150 lines	Trace through framework internals	Trace through framework internals
Lock-in risk if you swap providers	Near zero	High (graph definitions are framework-specific)	High

The rule of thumb? Build raw until your conditional orchestration logic exceeds ~50 lines or you need persistent state across sessions. That's roughly when the framework's abstractions start paying for themselves instead of taxing you.

Sources