MCP-Studio: How We Benchmark Nexla MCP Servers

MCP-Studio: How We Benchmark Nexla MCP Servers

The Model Context Protocol has made it easy to plug an LLM into almost anything: a warehouse, a ticketing system, a CRM. But somewhere along the way, a quiet assumption took hold: that if an agent can reach a system, it can use it well.

Being accessible to an agent and being optimized for agent use are very different goals. As agents take on operational workloads (querying data warehouses, triaging support tickets, chaining multi-step lookups), the design of the MCP server itself becomes a first-order performance factor. Tool abstractions, output formats, and embedded domain context directly drive token consumption, tool-call counts, latency, and ultimately whether the task gets done at all.

Most MCP servers today are system-shaped: they mirror the API of the system underneath, one endpoint per tool. We think the next generation will be task-shaped: built around what the agent is trying to accomplish. Nexla’s MCP Studio exists to make building those servers easy, so we put our servers through a rigorous benchmark, scoring each one against a fixed set of parameters, to test whether task-based design actually pays off.

It does. The rest of this post is really two things: how we measure server quality against a consistent set of parameters, and what those measurements found when we pointed them at a real BigQuery workload.

Watch two agents answer the same question

Here is a real task from the benchmark: “Among active connectors, which one has the highest reliability grade?” Below, the same model (Claude Sonnet 4.6) answers it twice: once through the official Google BigQuery MCP, and once through a task-based Nexla MCP scoped to the connector-quality dataset.

The generic server makes the agent discover the warehouse first: list the datasets, list the tables, fetch a schema, ask the user a clarifying question, and only then query, sometimes twice. The task-based server ships that context up front, so the agent goes straight to the answer. Press Run the race and watch the tool calls, tokens, and clock (replayed at ~6× speed, using the benchmark’s average trace):

Interactive · agent trace replay

Google BigQuery MCP (GBQ)

system-shaped · agent must navigate the warehouse
0
tool calls
0
failed calls
0.0s
first answer
0
tokens
0.0s
elapsed
✓ finished first

Nexla BigQuery MCP (NBQ)

task-shaped · dataset context embedded in the server
0
tool calls
0
failed calls
0.0s
first answer
0
tokens
0.0s
elapsed
✓ finished first
Real run · task bq_011 · Claude API harness · clock shows real benchmark seconds

Real run, task bq_011: Google BigQuery MCP = 12 tool calls (7 failed), 112,904 tokens, 50.4 s end-to-end. Nexla BigQuery MCP = 2 tool calls (0 failed), 39,661 tokens, 17.1 s.

Same model. Same question. Same warehouse. The only variable is the MCP server in between, and it cut tool calls 6×, finished about 3× faster, and used under a third of the tokens. That gap is exactly what our benchmark is built to measure.

What we measure

Each benchmark runs head-to-head comparisons between MCP servers on the same live systems, with the same candidate model and the same judge. No synthetic toys: every task is drawn from real operational queries our analysts and customer engineers run against production systems. The framework itself is system-agnostic: hold the model, the tasks, and the data constant, vary only the server, and whatever moves is attributable to the server’s design. The run in this post points it at live BigQuery tables.

0
agent harnesses
0
tracked metrics
LLM
judged scoring
Any
system, any server

Every run is scored on the same eight parameters:

1

Correctness

LLM-judge score (0 to 1) against the expected answer.

2

Task completion

Whether the agent finished with a usable answer at all.

3

Tool calls

How many MCP tool invocations it took to reach the answer.

4

Failed tool calls

Calls that errored out: access denied, wrong schema, or a bad query.

5

Time to first answer

Seconds until the agent produced its first answer.

6

End-to-end time

Total wall-clock from question to final answer.

7

Total tokens

Input plus output tokens consumed across the whole run.

8

Clarification turns

Times the agent had to stop and ask before it could answer.

Each system under test gets 20 evaluation tasks per harness. The BigQuery tasks cover schema exploration, aggregation, filtering, and ranking over connector-quality tables. Many require the agent to first figure out which dataset and table even hold the answer, which is exactly where schema-navigation overhead shows up.

Every matchup runs in two environments, because people use agents in two very different ways. Click a cell to see the matchup:

Interactive · evaluation matrix

Claude API
chat harness
Claude Code
agentic harness
BigQuery

Candidate model: Claude Sonnet 4.6 · Judge: Claude Opus 4.8 · Metrics: correctness, accuracy, tool calls, tokens, latency.
NBQ = Nexla BigQuery MCP · GBQ = Google BigQuery MCP

Two harnesses, two failure modes

The two environments stress different things. Toggle between them:

Interactive · harness anatomy

The Claude API harness isolates raw server quality in a constrained chat setting: a single request/response loop, no shell, no tool discovery. Every excess tool call, schema-discovery hop, or clarification turn is directly attributable to the server’s design. A lightweight resolver (Haiku 4.5) gets one shot at disambiguating vague user queries before execution.

The Claude Code harness tests how a server holds up inside an autonomous agent. The agent can discover tools dynamically with ToolSearch and drop into Bash for local computation, which sounds like an advantage, until you realize every Bash fallback is the agent compensating for output the server should have shaped. This harness doesn’t just measure whether a server works; it measures whether a server is optimized for autonomous execution.

BigQuery

The matchup: Nexla BigQuery MCP (NBQ) vs the official Google BigQuery MCP (GBQ), in both harnesses. Use the tabs to explore each metric. Orange is the task-based server, gray is the official one:

Interactive · BigQuery results explorer

Nexla (NBQ)Google (GBQ)

20 tasks per harness. Claude API: NBQ 20/20 correct with 0 timeouts; GBQ 18/20 with 2 timeouts. Claude Code: both 19/20.

In the chat harness the gap is brutal. GBQ hit 2 timeouts and needed a clarification round on 17 of its 18 completed tasks. The model kept having to ask which dataset the user meant, because nothing in the server told it. NBQ needed zero clarification rounds across all 20 tasks, because the dataset context travels with the server. The result: 100% vs 90% accuracy, with roughly half the tool calls and latency, at a third of the token cost.

Tokens deserve a closer look, because they are the quiet tax on every agent system. Each square below is 2,000 tokens of context window. This is what one average BigQuery question costs in each setup:

Interactive · the token tax

Google BigQuery MCP

0 tokens / task

Nexla BigQuery MCP

0 tokens / task

One square = 2,000 tokens. Schema listings, retries, and clarification turns all land in the context window, and you pay for them on every single question.

The Claude Code numbers tell the same story from a different angle. Accuracy tied at 95%. A capable agent can brute-force its way around a clumsy server. But how it got there matters:

Who does the post-processing?

0
Bash fallbacks · GBQ
times Claude Code dropped to the shell to reshape MCP output, across 20 tasks
MCP outputraw rows + schema noiseBash / jqanswer
0
Bash fallbacks · NBQ
answer-ready outputs needed no local reshaping at all
MCP outputanswer

GBQ also triggered 42 ToolSearch discovery calls in Claude Code; NBQ triggered almost none.

Verdict: the task-based server wins every efficiency metric in both environments (2× fewer tool calls, 1.9–3.1× fewer tokens, 1.6–1.9× faster) and wins accuracy outright in the chat harness.

The scoreboard

Both BigQuery matchups, from Nexla’s perspective. Hover any cell for the underlying numbers:

Interactive · Nexla win matrix

Matchup Tool calls Tokens Latency

Green = Nexla advantage · Red = official server advantage. Accuracy: NBQ wins Claude API (100% vs 90%), ties Claude Code (95%).

Read down the matrix and one pattern jumps out: generic MCP servers force the agent to perform work the server should have abstracted away. Repeated schema discovery, clarification loops, search-then-fetch chains, Bash post-processing. These are all symptoms of the same disease: exposing system plumbing instead of delivering answers. And the red row at the bottom shows the disease isn’t exclusive to official servers. Any server that returns raw payloads with cryptic tool names will lose, no matter how domain-specific it is.

The bar for task-based MCPs

Put the BigQuery results together and you get a concrete spec. Toggle below between a system-shaped server and a task-shaped one, and watch what happens to the agent’s overhead:

Interactive · anatomy of an MCP server

Tool surface

What the agent gets back

Agent overhead

Discovery / navigation
Clarification turns
Post-processing (Bash)
Token burn

Four principles fall out of the data, each one traceable to a specific number in this benchmark:

1

Embed the domain context

The agent should never have to ask which dataset, which project, which table. Ship that knowledge inside the server.

Evidence: 17/18 clarification rounds → 0

2

Collapse the round-trips

Build answer-oriented operations, not API mirrors. One intent should be one call, not list-datasets, get-schema, then query.

Evidence: BigQuery tool calls, 5.3 → 2.7

3

Name tools for the task

Tool names are documentation the model reads on every call. find_failing_connectors beats run_query + get_table_schema, every time.

Evidence: GBQ ToolSearch discovery calls, 42 → ~0

4

Return answers, not payloads

If the agent needs Bash to reshape your output, the server is shipping its homework downstream. Presentation-ready or it isn’t done.

Evidence: GBQ reshaped output in Bash; NBQ needed none

This is why we think task-based MCPs are the future. Context windows are the scarcest resource in agent systems, and every schema listing, every clarification turn, every raw payload the agent has to re-parse is rent paid to a server that didn’t do its job. A system-shaped server makes that rent structural. A task-shaped server, with domain context embedded, round-trips collapsed, and outputs answer-ready, makes the agent’s shortest path the default path. But the craftsmanship is the point: task-shaping wins only when the server actually embeds the context, collapses the round-trips, names tools clearly, and returns answers ready to use, which is exactly what the BigQuery results show it can.

Build a task-based MCP in minutes

MCP Studio turns any data system into an intent-specific MCP server, with domain context, answer-ready outputs, and governed access included. Both servers in this benchmark were built with it.

More on this soon

Methodology notes. 160 tasks total: 2 data systems × 2 harnesses × 2 servers × 20 tasks. Candidate model Claude Sonnet 4.6; judge Claude Opus 4.8; clarification resolver (Claude API harness only) Haiku 4.5. Tasks drawn from real operational queries against live BigQuery tables. Latency averages include server-side and model time.

Replays in this post are reconstructed from a real benchmark run (task bq_011) and representative traces; step labels in the multi-hop and race widgets are representative, the call counts and metrics are measured.


You May Also Like

A Guide to AI Readiness
Intercompany Integration Overview

Join Our Newsletter

Share

Related Blogs

The Data Layer Your AI Is Missing

Connect, contextualize, and govern enterprise
data across 600+ systems in real time.