Benchmarking Nexla MCP Server Design

The Model Context Protocol has made it easy to plug an LLM into almost anything: a warehouse, a ticketing system, a CRM. But somewhere along the way, a quiet assumption took hold: that if an agent can reach a system, it can use it well.

Being accessible to an agent and being optimized for agent use are very different goals. As agents take on operational workloads (querying data warehouses, triaging support tickets, chaining multi-step lookups), the design of the MCP server itself becomes a first-order performance factor. Tool abstractions, output formats, and embedded domain context directly drive token consumption, tool-call counts, latency, and ultimately whether the task gets done at all.

Most MCP servers today are system-shaped: they mirror the API of the system underneath, one endpoint per tool. We think the next generation will be task-shaped: built around what the agent is trying to accomplish. Nexla’s MCP Studio exists to make building those servers easy, so we put our servers through a rigorous benchmark, scoring each one against a fixed set of parameters, to test whether task-based design actually pays off.

It does. The rest of this post is really two things: how we measure server quality against a consistent set of parameters, and what those measurements found when we pointed them at a real BigQuery workload.

Watch two agents answer the same question

Here is a real task from the benchmark: “Among active connectors, which one has the highest reliability grade?” Below, the same model (Claude Sonnet 4.6) answers it twice: once through the official Google BigQuery MCP, and once through a task-based Nexla MCP scoped to the connector-quality dataset.

The generic server makes the agent discover the warehouse first: list the datasets, list the tables, fetch a schema, ask the user a clarifying question, and only then query, sometimes twice. The task-based server ships that context up front, so the agent goes straight to the answer. Press Run the race and watch the tool calls, tokens, and clock (replayed at ~6× speed, using the benchmark’s average trace):

Interactive · agent trace replay

Google BigQuery MCP (GBQ)

system-shaped · agent must navigate the warehouse

tool calls

failed calls

0.0s

first answer

tokens

0.0s

elapsed

✓ finished first

Nexla BigQuery MCP (NBQ)

task-shaped · dataset context embedded in the server

tool calls

failed calls

0.0s

first answer

tokens

0.0s

elapsed

✓ finished first

Real run · task bq_011 · Claude API harness · clock shows real benchmark seconds

Real run, task bq_011: Google BigQuery MCP = 12 tool calls (7 failed), 112,904 tokens, 50.4 s end-to-end. Nexla BigQuery MCP = 2 tool calls (0 failed), 39,661 tokens, 17.1 s.

Same model. Same question. Same warehouse. The only variable is the MCP server in between, and it cut tool calls 6×, finished about 3× faster, and used under a third of the tokens. That gap is exactly what our benchmark is built to measure.

What we measure

Each benchmark runs head-to-head comparisons between MCP servers on the same live systems, with the same candidate model and the same judge. No synthetic toys: every task is drawn from real operational queries our analysts and customer engineers run against production systems. The framework itself is system-agnostic: hold the model, the tasks, and the data constant, vary only the server, and whatever moves is attributable to the server’s design. The run in this post points it at live BigQuery tables.

agent harnesses

tracked metrics

LLM

judged scoring

Any

system, any server

Every run is scored on the same eight parameters:

Correctness

LLM-judge score (0 to 1) against the expected answer.

Task completion

Whether the agent finished with a usable answer at all.

Tool calls

How many MCP tool invocations it took to reach the answer.

Failed tool calls

Calls that errored out: access denied, wrong schema, or a bad query.

Time to first answer

Seconds until the agent produced its first answer.

End-to-end time

Total wall-clock from question to final answer.

Total tokens

Input plus output tokens consumed across the whole run.

Clarification turns

Times the agent had to stop and ask before it could answer.

Each system under test gets 20 evaluation tasks per harness. The BigQuery tasks cover schema exploration, aggregation, filtering, and ranking over connector-quality tables. Many require the agent to first figure out which dataset and table even hold the answer, which is exactly where schema-navigation overhead shows up.

Every matchup runs in two environments, because people use agents in two very different ways. Click a cell to see the matchup:

Two harnesses, two failure modes

The two environments stress different things. Toggle between them:

The Claude API harness isolates raw server quality in a constrained chat setting: a single request/response loop, no shell, no tool discovery. Every excess tool call, schema-discovery hop, or clarification turn is directly attributable to the server’s design. A lightweight resolver (Haiku 4.5) gets one shot at disambiguating vague user queries before execution.

The Claude Code harness tests how a server holds up inside an autonomous agent. The agent can discover tools dynamically with ToolSearch and drop into Bash for local computation, which sounds like an advantage, until you realize every Bash fallback is the agent compensating for output the server should have shaped. This harness doesn’t just measure whether a server works; it measures whether a server is optimized for autonomous execution.

BigQuery

The matchup: Nexla BigQuery MCP (NBQ) vs the official Google BigQuery MCP (GBQ), in both harnesses. Use the tabs to explore each metric. Orange is the task-based server, gray is the official one:

In the chat harness the gap is brutal. GBQ hit 2 timeouts and needed a clarification round on 17 of its 18 completed tasks. The model kept having to ask which dataset the user meant, because nothing in the server told it. NBQ needed zero clarification rounds across all 20 tasks, because the dataset context travels with the server. The result: 100% vs 90% accuracy, with roughly half the tool calls and latency, at a third of the token cost.

Tokens deserve a closer look, because they are the quiet tax on every agent system. Each square below is 2,000 tokens of context window. This is what one average BigQuery question costs in each setup:

The Claude Code numbers tell the same story from a different angle. Accuracy tied at 95%. A capable agent can brute-force its way around a clumsy server. But how it got there matters:

Verdict: the task-based server wins every efficiency metric in both environments (2× fewer tool calls, 1.9–3.1× fewer tokens, 1.6–1.9× faster) and wins accuracy outright in the chat harness.

The scoreboard

Both BigQuery matchups, from Nexla’s perspective. Hover any cell for the underlying numbers:

Interactive · Nexla win matrix

Matchup	Tool calls	Tokens	Latency

Green = Nexla advantage · Red = official server advantage. Accuracy: NBQ wins Claude API (100% vs 90%), ties Claude Code (95%).

Read down the matrix and one pattern jumps out: generic MCP servers force the agent to perform work the server should have abstracted away. Repeated schema discovery, clarification loops, search-then-fetch chains, Bash post-processing. These are all symptoms of the same disease: exposing system plumbing instead of delivering answers. And the red row at the bottom shows the disease isn’t exclusive to official servers. Any server that returns raw payloads with cryptic tool names will lose, no matter how domain-specific it is.

The bar for task-based MCPs

Put the BigQuery results together and you get a concrete spec. Toggle below between a system-shaped server and a task-shaped one, and watch what happens to the agent’s overhead:

Four principles fall out of the data, each one traceable to a specific number in this benchmark:

Embed the domain context

The agent should never have to ask which dataset, which project, which table. Ship that knowledge inside the server.

Evidence: 17/18 clarification rounds → 0

Collapse the round-trips

Build answer-oriented operations, not API mirrors. One intent should be one call, not list-datasets, get-schema, then query.

Evidence: BigQuery tool calls, 5.3 → 2.7

Name tools for the task

Tool names are documentation the model reads on every call. find_failing_connectors beats run_query + get_table_schema, every time.

Evidence: GBQ ToolSearch discovery calls, 42 → ~0

Return answers, not payloads

If the agent needs Bash to reshape your output, the server is shipping its homework downstream. Presentation-ready or it isn’t done.

Evidence: GBQ reshaped output in Bash; NBQ needed none

This is why we think task-based MCPs are the future. Context windows are the scarcest resource in agent systems, and every schema listing, every clarification turn, every raw payload the agent has to re-parse is rent paid to a server that didn’t do its job. A system-shaped server makes that rent structural. A task-shaped server, with domain context embedded, round-trips collapsed, and outputs answer-ready, makes the agent’s shortest path the default path. But the craftsmanship is the point: task-shaping wins only when the server actually embeds the context, collapses the round-trips, names tools clearly, and returns answers ready to use, which is exactly what the BigQuery results show it can.

MCP-Studio: How We Benchmark Nexla MCP Servers

Watch two agents answer the same question

Google BigQuery MCP (GBQ)

Nexla BigQuery MCP (NBQ)

What we measure

Correctness

Task completion

Tool calls

Failed tool calls

Time to first answer

End-to-end time

Total tokens

Clarification turns

Two harnesses, two failure modes

BigQuery

Google BigQuery MCP

Nexla BigQuery MCP

The scoreboard

The bar for task-based MCPs

Tool surface

What the agent gets back

Agent overhead

Embed the domain context

Collapse the round-trips

Name tools for the task

Return answers, not payloads

Build a task-based MCP in minutes

You May Also Like

Join Our Newsletter

Related Blogs

Data Platform for AI Agents: 7 Capabilities to Demand

How to Give AI Agents Access to Enterprise Data (Without Rebuilding Your Stack)

Agentic RAG: How AI Agents Reason Over Enterprise Data

The Data Layer Your AI Is Missing

Connect, contextualize, and govern enterprise

data across 600+ systems in real time.