The Model Context Protocol has made it easy to plug an LLM into almost anything: a warehouse, a ticketing system, a CRM. But somewhere along the way, a quiet assumption took hold: that if an agent can reach a system, it can use it well.
Being accessible to an agent and being optimized for agent use are very different goals. As agents take on operational workloads (querying data warehouses, triaging support tickets, chaining multi-step lookups), the design of the MCP server itself becomes a first-order performance factor. Tool abstractions, output formats, and embedded domain context directly drive token consumption, tool-call counts, latency, and ultimately whether the task gets done at all.
Most MCP servers today are system-shaped: they mirror the API of the system underneath, one endpoint per tool. We think the next generation will be task-shaped: built around what the agent is trying to accomplish. Nexla’s MCP Studio exists to make building those servers easy, so we put our servers through a rigorous benchmark, scoring each one against a fixed set of parameters, to test whether task-based design actually pays off.
It does. The rest of this post is really two things: how we measure server quality against a consistent set of parameters, and what those measurements found when we pointed them at a real BigQuery workload.
Watch two agents answer the same question
Here is a real task from the benchmark: “Among active connectors, which one has the highest reliability grade?” Below, the same model (Claude Sonnet 4.6) answers it twice: once through the official Google BigQuery MCP, and once through a task-based Nexla MCP scoped to the connector-quality dataset.
The generic server makes the agent discover the warehouse first: list the datasets, list the tables, fetch a schema, ask the user a clarifying question, and only then query, sometimes twice. The task-based server ships that context up front, so the agent goes straight to the answer. Press Run the race and watch the tool calls, tokens, and clock (replayed at ~6× speed, using the benchmark’s average trace):
Same model. Same question. Same warehouse. The only variable is the MCP server in between, and it cut tool calls 6×, finished about 3× faster, and used under a third of the tokens. That gap is exactly what our benchmark is built to measure.
What we measure
Each benchmark runs head-to-head comparisons between MCP servers on the same live systems, with the same candidate model and the same judge. No synthetic toys: every task is drawn from real operational queries our analysts and customer engineers run against production systems. The framework itself is system-agnostic: hold the model, the tasks, and the data constant, vary only the server, and whatever moves is attributable to the server’s design. The run in this post points it at live BigQuery tables.
Every run is scored on the same eight parameters:
Correctness
LLM-judge score (0 to 1) against the expected answer.
Task completion
Whether the agent finished with a usable answer at all.
Tool calls
How many MCP tool invocations it took to reach the answer.
Failed tool calls
Calls that errored out: access denied, wrong schema, or a bad query.
Time to first answer
Seconds until the agent produced its first answer.
End-to-end time
Total wall-clock from question to final answer.
Total tokens
Input plus output tokens consumed across the whole run.
Clarification turns
Times the agent had to stop and ask before it could answer.
Each system under test gets 20 evaluation tasks per harness. The BigQuery tasks cover schema exploration, aggregation, filtering, and ranking over connector-quality tables. Many require the agent to first figure out which dataset and table even hold the answer, which is exactly where schema-navigation overhead shows up.
Every matchup runs in two environments, because people use agents in two very different ways. Click a cell to see the matchup:
Two harnesses, two failure modes
The two environments stress different things. Toggle between them:
The Claude API harness isolates raw server quality in a constrained chat setting: a single request/response loop, no shell, no tool discovery. Every excess tool call, schema-discovery hop, or clarification turn is directly attributable to the server’s design. A lightweight resolver (Haiku 4.5) gets one shot at disambiguating vague user queries before execution.
The Claude Code harness tests how a server holds up inside an autonomous agent. The agent can discover tools dynamically with ToolSearch and drop into Bash for local computation, which sounds like an advantage, until you realize every Bash fallback is the agent compensating for output the server should have shaped. This harness doesn’t just measure whether a server works; it measures whether a server is optimized for autonomous execution.
BigQuery
The matchup: Nexla BigQuery MCP (NBQ) vs the official Google BigQuery MCP (GBQ), in both harnesses. Use the tabs to explore each metric. Orange is the task-based server, gray is the official one:
In the chat harness the gap is brutal. GBQ hit 2 timeouts and needed a clarification round on 17 of its 18 completed tasks. The model kept having to ask which dataset the user meant, because nothing in the server told it. NBQ needed zero clarification rounds across all 20 tasks, because the dataset context travels with the server. The result: 100% vs 90% accuracy, with roughly half the tool calls and latency, at a third of the token cost.
Tokens deserve a closer look, because they are the quiet tax on every agent system. Each square below is 2,000 tokens of context window. This is what one average BigQuery question costs in each setup:
The Claude Code numbers tell the same story from a different angle. Accuracy tied at 95%. A capable agent can brute-force its way around a clumsy server. But how it got there matters:
The scoreboard
Both BigQuery matchups, from Nexla’s perspective. Hover any cell for the underlying numbers:
Read down the matrix and one pattern jumps out: generic MCP servers force the agent to perform work the server should have abstracted away. Repeated schema discovery, clarification loops, search-then-fetch chains, Bash post-processing. These are all symptoms of the same disease: exposing system plumbing instead of delivering answers. And the red row at the bottom shows the disease isn’t exclusive to official servers. Any server that returns raw payloads with cryptic tool names will lose, no matter how domain-specific it is.
The bar for task-based MCPs
Put the BigQuery results together and you get a concrete spec. Toggle below between a system-shaped server and a task-shaped one, and watch what happens to the agent’s overhead:
Four principles fall out of the data, each one traceable to a specific number in this benchmark:
Embed the domain context
The agent should never have to ask which dataset, which project, which table. Ship that knowledge inside the server.
Evidence: 17/18 clarification rounds → 0
Collapse the round-trips
Build answer-oriented operations, not API mirrors. One intent should be one call, not list-datasets, get-schema, then query.
Evidence: BigQuery tool calls, 5.3 → 2.7
Name tools for the task
Tool names are documentation the model reads on every call. find_failing_connectors beats run_query + get_table_schema, every time.
Evidence: GBQ ToolSearch discovery calls, 42 → ~0
Return answers, not payloads
If the agent needs Bash to reshape your output, the server is shipping its homework downstream. Presentation-ready or it isn’t done.
Evidence: GBQ reshaped output in Bash; NBQ needed none
This is why we think task-based MCPs are the future. Context windows are the scarcest resource in agent systems, and every schema listing, every clarification turn, every raw payload the agent has to re-parse is rent paid to a server that didn’t do its job. A system-shaped server makes that rent structural. A task-shaped server, with domain context embedded, round-trips collapsed, and outputs answer-ready, makes the agent’s shortest path the default path. But the craftsmanship is the point: task-shaping wins only when the server actually embeds the context, collapses the round-trips, names tools clearly, and returns answers ready to use, which is exactly what the BigQuery results show it can.