MCP tool sprawl: why adding more servers can make agents worse
Connecting more MCP servers can give agents more capabilities — but it also expands the tool surface they have to reason over. Here's why agent performance can degrade as the tool surface grows, and what changes when routing moves into configuration.
Give an agent access to two well-described tools, and tool choice is usually obvious. Give it ten MCP servers with overlapping verbs, inconsistent schemas, and broad descriptions, and tool choice becomes part of the problem. The conventional wisdom — connect more capabilities, get a more capable agent — is partly right and partly a trap. The trap is that every tool an agent can see is a tool the agent has to pay attention to, and attention has a cost.
This is MCP tool sprawl. It isn't a bug in MCP, and it isn't a flaw in any particular server. It's an emergent property of the way agents reason over their available toolset. The trap is that teams add MCP servers expecting capability gains and only discover the reasoning cost later — usually after the agent starts picking wrong tools, calling redundant tools, or burning context budget on definitions for tools it never uses.
The shape of the problem has three parts: token cost, selection quality, and description quality. Each is manageable in isolation. Together, they decide how well your agent can actually use the tools you give it.
Why more tools can mean worse tool selection
Tool selection is a search problem. Given a user request and a set of available tools, the model has to identify which tool — or sequence of tools — matches the request. With two tools, the choice is between two options. With two hundred, the choice is between two hundred options, and the model has to consider each one. Not literally, but in the sense that all of them are in its context as it decides.
Tool-use benchmarks increasingly treat tool selection as its own failure mode, not just a question of whether the model is "smart enough." A model can understand the task and still choose the wrong function, skip a necessary function, or pass the wrong arguments. That risk gets worse when the available tools are numerous, overlapping, or similarly described. Recent MCP-focused benchmark work also calls out long tool descriptions and parameter schemas as a practical limit on how many tools can be made available in a single run.
The failure mode isn't always obvious. An agent might call the right tool but with subtly wrong parameters because it confused two similar tool schemas. An agent might call a redundant pair of tools because both seemed plausible. An agent might skip a necessary call entirely because something else in the toolset matched more loosely. None of these read as "tool selection failure" in logs; they read as "the agent did the wrong thing."
The token cost of idle tools
Before any work begins, every visible tool definition competes for space in the model's context. Many MCP servers expose multiple tools — often separate operations for search, list, read, create, update, delete, or destination-specific actions — each with a name, description, parameter schema, and usually one or more example invocations. Connect several servers and the agent starts every conversation with thousands of tokens of tool definitions before reading a single character of the user's prompt.
Tool definitions are one cost; tool outputs are another. Some MCP servers can also return large payloads once called. Vapi's MCP documentation explicitly warns that tool calls such as GitHub API queries can return enough data to exceed model context limits, affecting assistant performance and causing failures with some models. Definitions crowd the context up front; outputs crowd it mid-task.
The cost isn't just token budget. Tokens spent on tool definitions are tokens the model isn't spending on reasoning about the task, the conversation history, or the structured output. Long-running agent sessions compound this: the tool definitions stay resident for every turn, while the conversation grows around them.
How tool descriptions compete for the agent's attention
MCP standardizes where tool descriptions and schemas live, but it doesn't enforce a quality bar for how tools describe themselves. Each server author writes descriptions, examples, and parameter docs in their own voice and at their own level of detail. The result is a toolset with variable quality: some tools have terse one-line descriptions that under-sell what they do; others have multi-paragraph descriptions that aggressively claim broad relevance.
This matters because the model uses descriptions to decide. A well-described tool with a focused scope wins more selections than a poorly-described tool with the same actual capability. A loudly-described tool that claims relevance to many situations wins selections away from quieter tools that would have done a better job. There is no universal scoring layer that tells the model which tool description is more accurate, more complete, or less self-promotional. MCP's own security guidance treats descriptions of tool behavior, such as annotations, as untrusted unless they come from a trusted server. In practice, the model still has to choose based on the tool names, descriptions, schemas, and surrounding context it is given. If a tool claims it handles email, calendar, contacts, and CRM updates, that claim carries weight whether or not it's the best fit.
When multiple MCP servers are connected, this turns into a quiet competition. Servers from different authors describe themselves in maximally relevant terms. Servers covering overlapping use cases bid for the same selections. The agent's choice is influenced by whichever description wins on phrasing, not necessarily whichever tool is best suited for the task.
The CRUD-per-destination pattern and why it compounds
The default shape of an MCP server today is one server per destination system, each exposing fine-grained CRUD operations against that destination's API. A CRM MCP server exposes tools to create contacts, update contacts, search contacts, list deals, create deals, update deal stages, and so on. A second CRM MCP server exposes a similar set against its own data model. A messaging MCP server exposes tools for sending messages, looking up users, posting to channels.
Each individual server is reasonable. The problem is additive. An agent expected to react to product events across a stack of seven destinations is looking at seven servers, each contributing its own set of CRUD tools. Even at five to twenty operations per destination, the agent's effective toolset lands somewhere in the dozens to low hundreds — most of them variations on "create," "update," and "search" against different shapes of underlying data.
This is where sprawl compounds. The token cost grows linearly with destinations. The selection difficulty grows worse than linearly, because many of the tools are similar enough to be confused. And the description-quality variance grows with the number of authors.
From a behavioral standpoint, the agent's job often isn't to choose between CRUD operations against seven destinations. It's to react to a product event by getting the right information to the right systems. The CRUD shape forces the agent to reconstruct that intent in terms of seven destinations' individual data models on every invocation.
What an emit-shaped tool surface looks like
An alternative pattern is to expose a small set of verbs that operate over destinations rather than against them. Instead of crm_a.create_contact, crm_b.upsert_lead, messaging.post_message, and email.add_subscriber as four distinct tools, the agent sees something closer to emit_event with a payload describing what happened. The routing decision — which destinations get which payload shapes, which fields map where, which destinations receive which event types — moves out of the agent's runtime decision-making and into configuration.
The agent's tool surface shrinks dramatically. Where the CRUD pattern might expose a hundred tools across destinations, the emit pattern might expose three or four. Context cost drops. Selection difficulty drops, because the verb space is small and the verbs are mutually distinct. Description variance drops, because the same author writes the same small set.
The trade-off is agency. With CRUD tools, the agent can decide at runtime that a particular event should update one CRM but skip another. With emit-shaped tools, that decision was made when the routing was configured — the agent emits the event, and the routing layer fans it out according to rules the agent doesn't need to reason over. Some agents need the runtime agency; many don't.
When CRUD-style tools are the right answer
CRUD-style tools are the right answer when the agent's job genuinely requires per-destination decision-making at runtime. An agent that answers customer support questions by querying multiple systems, choosing what to look at based on context, and stitching answers together needs fine-grained tools. An agent that operates as a power user inside a particular destination — researching a deal, refining a contact, drafting a message — needs the full vocabulary of that destination.
The common thread is that the agent's value depends on its ability to reason about individual destinations as distinct systems with distinct affordances. The cost of tool sprawl is the price of admission for that capability.
When emit-style tools are the right answer
Emit-style tools are the right answer when the agent's job is to react to events by updating systems of record. The agent doesn't need to know that one destination stores contacts and another stores leads and a third stores messages; it needs to know that a user signed up, and the right things should happen downstream. The routing decisions are knowable in advance and don't benefit from per-event reconsideration.
Most agents that fall into the "automation" rather than "exploration" category are in this bucket. The work isn't in choosing between destinations event by event; the work is in describing the event clearly enough that the routing layer can fan it out reliably.
The choice you're actually making
When teams connect a new MCP server, they tend to frame the decision as "should we give the agent this capability." The more useful framing is "is this capability worth what it costs the agent's reasoning over every other tool it already has." Capabilities aren't free additions to a toolbox; they're contributions to a search space the agent has to navigate every time it acts.
Some additions are worth it. A new destination with a genuinely distinct API surface and a clear use case adds capability that justifies the context. Many additions are not worth it on their own terms. A fifth CRM-shaped tool surface, a third notification destination, a redundant data lookup — these contribute more confusion than capability.
The teams that get the most out of agents tend to make this trade-off explicitly. They pick a tool surface shape — CRUD-heavy for exploratory work, emit-heavy for reactive work — keep it small, and treat additions as architectural decisions rather than feature requests.
A useful test is whether the agent should be deciding this at runtime at all. If the decision depends on the user's current intent, expose the tool. If the decision is a business rule you would configure once and repeat hundreds of times, don't make the agent rediscover it on every run.
Meshes gives AI agents and SaaS products a smaller outbound tool surface: emit one event, and configured workspace rules deliver it to HubSpot, Salesforce, Intercom, Mailchimp, Slack, webhooks, and more — with retries, fan-out, field mappings, and delivery history handled outside the agent's context.Start free
Frequently asked questions
What is MCP tool sprawl? MCP tool sprawl is the degradation in agent performance that can happen as more MCP servers are connected. It surfaces as worse tool selection, higher token costs, and inconsistent behavior across runs. The underlying cause is that every available tool occupies context and contributes to the model's search space when deciding what to call.
Why do agents get worse with more tools available? As the available tool space grows, the model has more options to weigh, and selection quality degrades. Failures take several forms: calling the wrong tool, calling redundant tools, or skipping necessary calls. This is distinct from capability — a model can understand the task perfectly and still pick the wrong function when the toolset is large, overlapping, or inconsistently described.
How many MCP servers is too many? There isn't a universal threshold, but the effect is observable in practice well before fifty tools. The number depends on tool description quality, semantic overlap between tools, and the model being used. A useful heuristic is that tools should be mutually distinct enough that a brief description differentiates them; when descriptions start blurring, sprawl is already a problem.
What is the difference between CRUD-style and emit-style MCP tools? CRUD-style tools expose fine-grained operations against a destination's data model — create contact, update lead, send message. Emit-style tools expose a small set of verbs that operate over destinations through configuration — emit an event, and the routing layer fans it out. CRUD-style maximizes agent agency at the cost of context surface. Emit-style minimizes context surface at the cost of runtime agency.
Does connecting an MCP server cost tokens even when its tools aren't used? Usually, yes. In most agent setups, available tool definitions are provided to the model before it decides what to call, whether or not those tools are eventually used. Tool gating, dynamic loading, and scoped toolsets can reduce this, but unused tools still carry a cost when they are visible to the model.
Why don't MCP servers describe their tools consistently? MCP standardizes the tool object — name, description, schema, and annotations — but it doesn't enforce a quality bar for the prose itself. Server authors write descriptions, examples, and parameter docs in their own voice and at their own level of detail. This produces a toolset where some tools sell themselves aggressively and others underdescribe their function, and the model has no universal scoring layer to tell which self-description is most accurate.