Beyond Automation: Orchestrating Agentic AI Systems at Scale

Insights from Qingyun Wu and Amey Desai at the 2025 Data + AI Integration Summit

Agentic AI is entering a new phase. It’s moving from exploratory demos to production-ready workflows. But behind the excitement lies a complex set of challenges: cost, reliability, hallucination, evaluation, and the infrastructure required to scale.

At the 2025 Data + AI Integration Summit, a conversation between Amey Desai (CTO at Nexla) and Qingyun Wu (Founder & CEO at AG2) unpacked how multi-agent systems and orchestration frameworks are shaping the next generation of enterprise AI.

Watch the session recording below 👇

 

From Automation to Orchestration

Automation is rule-based, deterministic, and designed for static environments. Orchestration, by contrast, is dynamic, enabling multiple agents and humans to collaborate in loosely structured workflows. This shift is essential for solving interdependent tasks that single agents can’t handle alone.

However, orchestrated systems introduce unpredictability. In dynamic multi-agent flows, outcomes can swing from surprisingly effective to completely off-track. Guardrails and constraints become essential, not optional, to maintain alignment with intended outcomes.

Playing It Safe and Paying the Price

Enterprises are overwhelmingly choosing “safe” projects like RAG which minimizes risk but often limits returns. While these low-risk approaches may offer easier wins, they also tend to plateau quickly in ROI.

In contrast, companies that push into more ambitious agentic use cases ( even expensive ones) often uncover much higher value. Consider Anthropic’s internal use of Claude, reportedly writing 80% of the company’s code. Other high-impact areas include legal services, finance, and healthcare, where repetitive, document-heavy workflows are prime for intelligent automation.

For teams unsure where to begin, the advice was simple: start small, iterate quickly, but don’t be afraid to pursue higher-upside use cases once initial traction is proven.

Evaluating the Systems That Can’t Be Benchmarked (Yet)

Despite progress in orchestration, evaluation remains an open problem. There’s no standard way to measure whether multi-agent systems are succeeding, especially in real-world scenarios. To address this, Wu’s team developed:

  • AutoGenBench: A benchmark spanning multiple domains (coding, math, web) for testing agent system performance.
  • AgentEval: An agent-based meta-evaluation system, where agents review and critique each other’s results.
  • Synthetic datasets with debug traces: Allowing teams to identify exactly when and where agents fail.

While promising, these efforts also underscore how early the field still is. Robust, repeatable evaluation remains one of the biggest roadblocks to widespread deployment.

Tackling Hallucination and Security Separately

AI hallucination and AI security are often lumped together, but they require different approaches. Fine-tuned models and stronger orchestration can reduce hallucination, especially in tool-using workflows. For instance, models can be trained to hallucinate less during code execution even if they still struggle in creative or open-ended contexts.

Security, however, demands more rigorous solutions. Enterprises cannot afford loose tolerance for risk, particularly in agent-to-agent communication across organizational boundaries. Solving AI security likely depends less on model optimization and more on traditional information security principles, robust access control, and a deeper understanding of threat surfaces.

The Runtime Bottleneck

Even the most sophisticated agentic designs often fail under real data volume. While most current workflows are task-oriented and lightweight, the second real data enters the system, performance tends to break down.

A high-performance execution runtime is essential,  not just for speed and reliability, but also for compliance, monitoring, and governance. Without this layer, agentic systems remain prototypes.

Key runtime needs include:

  • Parallel agent execution
  • Persistent, resumable agents
  • Distributed coordination across servers or org boundaries

Agent frameworks must be built with these requirements in mind. The runtime isn’t just infrastructure. It’s a critical part of the product.

Making Agentic AI Usable: Code, Low-Code, and No-Code

Adoption depends on accessibility. While full-code systems offer flexibility, they also demand deep expertise. Low-code and no-code solutions can enable broader experimentation and faster deployment, especially when tailored to specific personas.

The tradeoff is complexity versus control. But for many use cases, bounded low-code interfaces may be all that’s needed to create and run valuable agents in production.

Want to hear more from Qingyun Wu?

Her full session with Amey Desai is available to watch on demand>>>

Nexla User Interface

Unify your Data and Services Today!

Instantly turn any data into ready-to-use products, integrate for AI and analytics, and do it all 10x faster—no coding needed.