Building AI Agents That Internalize Corrections

Episode 3
Dec 23, 2025 | 41:21

Summary

Most AI agent demos still look great but fall apart in production. At BigPanda, Alexander Page’s team solved this by building systems that internalize user corrections and improve without requiring source data fixes.

The Engineering Director of Applied AI shares with Saket how his team designs production-grade AI agents for IT operations. When a user flags that step seven of a retrieved runbook is outdated, the system internalizes that correction with appropriate weighting and handles conflicts on future retrievals, even when nobody updates the original Confluence page. He argues this capability is becoming a baseline expectation: users accept that AI systems won’t be perfect, but they increasingly expect systems to learn when shown the right answer.

Page also breaks down multi-agent architecture decisions. When you have 100 tools, giving them all to one agent degrades tool selection. His team isolates decision-making by domain, spinning up specialized sub-agents at runtime based on user intent. For evals, they focus on tool call sequences rather than final outputs, making it easier to pinpoint where agent chains break down.

Topics Discussed

  • Internalizing user corrections when source data stays outdated
  • Why correction capability is becoming a baseline user expectation
  • Evaluating agent chains by tool call sequences not outputs
  • Breaking large tool sets into domain-specific agents
  • MCP security tradeoffs and when A2A fits better
  • Runtime decisions on which sub-agents to spin up
  • Maintaining a prototype shelf for future foundation model capabilities
  • Context engineering over expanding context windows

User corrections should take a pretty high weighting, not infinitely so.”

Alexander Page
Engineering Director at Applied AI
Transcript

Alexander Page
2024 was really the year of agents that looked good in YouTube videos and demos, but sort of fell apart in production.

 

Saket Saurabh
Give us a sense of what is the AI tooling for you today.

 

Alexander Page
I mean, I’ve seen firsthand and I’m using that all day, every day, really. It’s just an absolute amplifier. I think if you don’t know how to code, you’ll quickly get into a situation where you’re kind of stuck.

 

But really my point is that you’re focusing less on the actual data and the sort of content of the data and more on the actions or tool calls that the agent made.

 

You being kind of an expert of this thing really puts you in a position to design an agent probably better than most. And I think the designing and building of the agent is not as hard as it might seem from the outside. Now’s a great time to jump in if you haven’t, honestly.

 

Saket Saurabh
Hello, everyone. Thanks for listening to another episode of Data Innovators and Builders. Today, I’m speaking with Alexander Page, Engineering Director of Applied AI at BigPanda. Alex, thanks for chatting with me today.

 

Alexander Page
Yeah, thanks for having me on.

 

Saket Saurabh
Alex, would love for you to give a quick introduction about yourself and how you got into the world of AI agents.

 

Alexander Page
Oh, man, it’s tough doing that quickly, but I’ll say I have always loved technology. I think I was maybe six or seven years old, got my first Toshiba laptop with the eraser mouse. And I just kind of always knew that I wanted to do this or something in technology my whole life.

I studied computer science, started my career as a developer, then moved into sales engineering or pre-sales. So kind of running technical demos of software solutions, running POCs and POVs with prospective buyers, sort of the technical side of IT sales.

 

And I was doing that at BigPanda initially. Then ChatGPT came out and I very quickly saw a huge opportunity to apply this new wave of AI technology to what BigPanda was already doing in IT operations. Very data heavy and often very cryptic at that. This need to understand all these different signals, and humans really struggling with that.

 

I think LLMs and agents are a great fit for bringing a lot of efficiency and improving the quality at which we do these things and run operations. So that became very clear to me very quickly. I just dove right into experimenting and prototyping. Fast forward a couple of years and I’m now running a team at BigPanda to basically do just that. It’s been an interesting journey.

 

Saket Saurabh
Yeah, that’s incredible. I think when talking to you, I feel like you are one of the most advanced people in terms of using agents and building solutions off of that and actually thinking about some of the more cutting edge areas when it comes to creating an agent solution.

 

Maybe let’s start with, you know, what are some of the top of mind things for you? When we were starting the conversation, it was about production systems. And in that context, you talked about guardrails and evaluation. So you’re clearly well ahead of the folks who are still testing ideas. You have taken them to production. Tell us a little about that.

 

Alexander Page
Yeah, I think 2024 was really the year of agents that looked good in YouTube videos and demos, but sort of fell apart in production. And a majority of our research efforts are around how do you go from something that clearly works and the prototype is extremely viable, but how do I account for all the production setting situations it’s going to find itself in and still work reliably, still get the right answer at least most of the time.

 

And then, really along with that, how do we build these systems in a way that they learn? That can take many forms, but the key is that they improve. If they don’t get something right, they need to be able to get it right the next time.

 

It does matter when you’re saying agents in production, like are we talking smaller companies, midsize companies, or larger enterprise companies. There are different considerations in each. I can speak primarily to the enterprise side.

 

Things like guardrails matter quite a bit. We can’t build these systems in a way where they can really cause harm in terms of actions they’re taking. The obvious mitigator for that is human in the loop when it comes to actually taking action. When you’re building agents to more generate insight or make suggestions, and they’re doing read-only operations on different systems or looking at data in a RAG context, guardrails for taking action matter less.

But even that, like data access and being able to honor the permissions that a given user has, if that user is interacting with an agent, we should probably reflect those permissions in the systems it might be getting information from. So that’s a consideration, making sure we’re not even letting the user get information they shouldn’t get.

 

Saket Saurabh
Yeah, I think to this conversation, let’s peel the layers of that, but maybe to back up. Why don’t we start with, is there a framework that you use in deciding what use cases you want to take to production or which use cases to go after?

 

Alexander Page
Yeah, there are sort of two avenues. One is we dedicate time to do really pure research on looking at the space holistically. For us, that’s IT operations. I think it’s critical that if you’re building software, you’re talking to your customers. You’ll get a quick sense of what problems you’re solving well, what problems you might be able to solve in a better way, and what problems you’re not even looking at yet.

 

If you’re in touch with that, you will naturally start to think of new ways to solve problems. That can translate into prototypes. It’s sort of like try 10 things and maybe one will stick. Or you take a more direct approach where what you prototype around is directly things that customers are asking for.

 

I think you kind of need both. You need to accommodate the immediate needs being voiced by your end users and customers, but also take a step back and rethink the game entirely. That’s sort of the classic figure out what your customers want before they realize they need it kind of thing.

 

Saket Saurabh
And do you typically make multiple prototypes depending on what’s coming in, and after you build the prototype, figure out what’s really high value? Or sometimes you realize that some use cases are extremely complex. How do you go about that? Is there something you would recommend to the listeners?

 

Alexander Page
Yeah, I think for any larger, more complex problem, break it up into parts and try to solve them individually. That’s typically a winning strategy.

 

It was the case maybe two years ago where we knew exactly what we wanted to achieve, but it might not have been possible with the current state of foundation models. Simple agents back then were not so reliable. Since then, tool calling has improved so much and there’s just so much more that’s possible today.

 

Most things are achievable now. It’s about how you approach it and how you design the system. Breaking things into chunks worked 20 years ago in programming and it still works today in designing these systems.

 

Saket Saurabh
Yeah, that’s absolutely right. I think breaking that down… and capabilities have certainly improved. One of the things that came up when we were chatting earlier was the role of data quality. Do you have any thoughts or comments on how you handle data quality issues when you’re working with LLMs?

 

Alexander Page
Yeah, so I think it’s actually two things. It’s data quality and data gaps. The system I primarily work on at BigPanda is called Biggie, a personified system that does many things, but one of those things is it understands and internalizes the history and context of a given organization. It can RAG across that data, things like Confluence pages, ITSM data, wikis, knowledge base articles, SOPs, all IT operations related, in many different forms, structured and unstructured.

 

You talk about data quality. I might have a Confluence page describing how to do something written two years ago, and there’s one step that’s not quite right anymore. An LLM or agent isn’t going to know that. So it serves up the answer to the user, and I think an important part of that is traceability. Citing sources when you’re building a RAG system, because then a user might try those steps and realize step seven is wrong.

 

Now that’s the pivotal moment of what do you do right there. Can the user easily say to your system, that was kind of good, but this step is wrong, we don’t do it that way anymore? And can you internalize that? One approach would be to try to programmatically fix the data in the source system, which you probably don’t want to do. Another approach is to highly recommend they fix it at the source, but they may or may not do that.

 

An approach we’ve explored is to internalize that and handle conflicts when they arise. User corrections should take a pretty high weighting. Then the next time a similar request comes in and I find that same out-of-date Confluence article, I can see that when Bob asked me about this, he said this part was wrong. And so I can handle the blending of those two things.

 

But data quality is just one part of it. There are also data gaps. What happens if we don’t have anything to retrieve? How do you handle that scenario? I think it’s very important to track these things and make them available to power users or admins on the customer side so they can figure out what to do with that. A lot of users are asking questions about this but there’s nothing in our context for it. That’s a data gap we want to do something about.

 

At the very least, you need to be able to surface that. Because if you don’t have it and you come up with magical answers sometimes and then just can’t answer other times, your system gets blamed. These systems should feel human, like, hey, I searched everywhere and I don’t see anything for this. You can manage those perceptions a bit with how you handle those scenarios, but they should be expected because they will 100 percent happen.

 

Saket Saurabh
Yeah, exactly right. And part of what you’re talking about is also having some sort of human in the loop in these cases where we are learning from that interaction.

How important is that as you’re designing a solution, figuring out that part of the interaction and taking inputs from users? Is that happening mostly in the testing phase, or is it also happening in actual production use where you have a human in the loop?

 

Alexander Page
As far as being able to correct the system, yeah, I would almost go as far as to say if there’s not something in place for that, it’s not going to go well. I don’t think the expectation is necessarily that these systems are perfect, but I think there is a growing expectation that when they mess up or don’t know or can’t figure something out, they can be taught or shown the correct answer, and then they should get it right the next time.

 

If we’re building these systems that are only as good as their data and can’t be improved and can’t morph or adjust or adapt to different needs of different customers, the success of that system is going to be negatively impacted for sure.

 

Saket Saurabh
One of the things we’ve noted in building an agent solution ourselves, a data engineering agent, is that when you talk about the agent sounding human and having a natural response, a lot of the work we put in went into the prompting side of things and also enriching the context we’re bringing into the model. Anything you might want to share on that side, like techniques that you’ve seen?

 

Alexander Page
Yeah, I think those are the two main levers. Fine tuning is another. For a lot of our needs in this space, we’ve found ways to achieve them without needing to fine tune, but it has been made a lot easier to do and shouldn’t be overlooked.

 

I think one area to really pay attention to is what you mentioned, context engineering. Even though context windows are increasing and will likely continue to increase, agents still suffer from, yes, it’s within the context window, but it’s a lot of context and it’s going to struggle. Everything from the lost in the middle type of problem to the answer just not being that great because it didn’t adhere to all my instructions because it was inundated with a mountain of retrieved context that I pulled in without any chunking or representation.

 

I’ve seen context engineering coming up a lot more recently. It’s something we realized pretty early on we needed to address. A lot of our initial indexing and retrieval was built around the 32K token context windows we had back then. So now it’s like, oh, I have a million tokens. But I think it’s important not to fall under the impression that you can fill up the whole context and it’ll work just fine. You’ll quickly realize that’s not true.

 

So what do we do with that? We can prune. Just because something is semantically similar above a certain score threshold doesn’t necessarily mean it’s useful. Doesn’t mean the whole thing is useful. You can handle this on the retrieval side or on the indexing side. Chunking, representation, different costs associated with those things, but they can dramatically improve results.

 

And we’re talking in a RAG system context here, but I think most agentic systems should have some sort of RAG component to them as the source of information not available in the pre-trained data of a given LLM, as a source of user corrections, as a source of how to approach a problem, and as a source of history for a particular customer organization.

 

Saket Saurabh
Yeah. And especially as we’re talking about context engineering and building some of these, I want to ask you about the AI tooling. Give us a sense of what is the AI tooling for you today.

 

Alexander Page
Yeah, I think there’s a lot of attention on this topic, especially recently. The way I look at it is in the context of software development. There are all these activities that we as engineers perform, from writing code itself to designing systems, designing infrastructure, building Terraform for a given system, reviewing code, writing tests.

 

For all of them, there are a number of different AI-native tools that in general can give you a level of scale. If nothing else, it’s another set of eyes. For the actual coding AI tooling like Cursor, I’ve seen firsthand and I’m using that all day, every day. It’s just an absolute amplifier. I think I’m somewhere between six to ten times more productive.

 

It took a second to figure out how to use them effectively. You sort of have to get used to the harness of each one. I know down to the point of which model to use for what in Cursor. How to prompt Sonnet 4.5 versus Codex or GPT-5. They sort of all behave differently. I know which one to use for, say, UI-heavy work versus back end, more algorithmic processing logic. You just naturally pick that up as you start using them.

 

I think if you don’t know how to code, you’ll quickly get into a situation where you’re kind of stuck. Because when it gets stuck, and it definitely does, or even worse, doesn’t get stuck but is confidently wrong, you need to be able to spot that, remediate it, redirect it, and sometimes take back the reins and fix it yourself. If you don’t have that background, that’s going to be a lot harder.

 

But I think someone who has coding experience already and maybe some professional experience writing code, not just in college, can be very effective with this tooling.

 

Saket Saurabh
Yeah, I mean, I 100 percent agree. And one of the things I feel is that to be good with these tools, the important skill is knowing what a good outcome looks like so you can keep judging, okay, this model is doing better for this use case, this model is doing better for that use case.

And this is a constant learning process as the models themselves keep changing and evolving. Knowing what a good outcome looks like is almost a necessity. If you don’t know what code should look like, it’s very hard to get to that outcome.

 

Alexander Page
Mm hmm. Yep.

 

Saket Saurabh
Thinking of the tooling, I also want to ask about how you’re building confidence around quality. One of the things I think is about consistency of outcome. Maybe share any lessons or guidance on that.

 

Alexander Page
Yeah, evals is a very broad topic. At its core, you’re trying to evaluate the output of an LLM. There are a number of things you could evaluate. Is it correct? Is it too wordy? Is it aligned with human values? Is it misleading? So there’s a lot of different ways to evaluate. First is identifying what you should actually evaluate.

 

But then you talk about agentic systems, and evaluating an agent becomes quite a bit more difficult. If you want to do that right, you’d want to not just evaluate the final output the agent might produce, but the intermediary steps. And if you do that, you’ll be much more effective at identifying the source of problems when they arise and making a more predictable system.

 

Really, if you think about what an agent is doing, let’s start with just a more stateless call. I have a prompt, push that via API to some LLM, it gives me a response. That’s one thing. When you’re talking about an agent, the simple definition is it’s an LLM-based application where the LLM itself is deciding the control flow. It decides what to do, and it does that by having tools made available to it. You could think of those as just functions. Those functions can do whatever you want. They could be retrieving context from a vector database, creating a ticket in JIRA, doing math, getting the current time, more complex things.

 

You provide those tools to the agent and when it’s invoked, it kicks off this agent loop. You have some input to it, like go look this up and then create a JIRA ticket. And it says, okay, what tools do I have? I can create a JIRA ticket, but I should probably look this thing up first. I’ll call that tool, that returns a response, then the agent’s back in decider mode. Okay, I asked that tool this, it gave me this, now I need to create the JIRA ticket. I’ll call that tool.

 

So if you’re evaluating that, what if the agent just skipped right to the JIRA ticket? It assumed it had enough information but really didn’t. So it made one tool call instead of two. You would fail that evaluation because your expected result was that it calls two tools. You care less about what the response of the tool was unless you’re controlling the test data. But really my point is that you’re focusing less on the actual data and more on the actions or tool calls that the agent made.

 

And so that’s where we tend to focus our evaluations, because a lot of what we’re doing involves tiers of agents or hierarchies of agents and agent teams. In the entire agent loop, 30, 40, 50 tool calls. It would be quite cumbersome to evaluate the response of each tool. It’s more straightforward to enforce that for this sort of input, this is the sequence of tool calls that should occur.

 

Saket Saurabh
And so that’s where we focus more. Okay. And again, we are orchestrating a whole layer of agents to get the task done.

 

Alexander Page
Yeah. And in many cases, yes. The way we’ve designed our system is to make the input easy. As a user, I don’t need to know any of this. I just know what I need or want. And then internally we’ll figure out at runtime, well, what do I need to spin up? Can my wrapper agent handle this? Does it need to spin up one to ten sub agents? Do I need this team of agents over here that specializes in this thing?

 

We sort of decide what we need at any given time.

 

Saket Saurabh
Basically getting that user intent, interpreting that, then of course invoking agents around that. I think one of the things we talked about was also having multiple agents with different responsibilities. How are we defining that? Like, is it taking one small task and saying, here’s the job responsibility, almost like a person?

 

Alexander Page
Yeah, an interesting thought is, do we design our systems around the capabilities of today’s foundation models? Or do we approach it differently, knowing that one day, maybe the next generation that comes out, something will be feasible that isn’t right now? I think that’s an important lens to look at everything through. We have a whole shelf at any given time of things that aren’t possible today, but we know we want to do and we know they’re going to be possible soon.

 

If you have a big system you want to do all these different things and it ends up with a hundred different tools, if you add all those hundred tools to one agent, it’s going to struggle. It’s not going to be good at figuring out which tool to call. In general, an agent handles fewer tools better, just like an LLM handles fewer tokens better.

 

A lot of it is really just isolation of decision making. It’s pretty natural to design it that way because it’s almost like designing a team of humans. You have a couple of developers who write code, a designer who creates Figmas, a PM who decides what you do, a support person who handles issues. You don’t want to limit yourself to what humans are capable of in terms of how you design these systems, but it’s a good place to start.

 

Keeping agents domain-specific works well. If you’re building an agent to interact with a certain system, you might want a single agent just for interaction with that system. And there are different layers involved. I might have a set of tools I want to give to one or many agents, maybe my context retrieval tools and basic date-time conversions and calculator functions. That might be useful for a number of different agents. So we have a sort of shared library of tools that we give access to different agents. But in general, the more you can focus an agent in terms of the tools it has access to and the system prompt that kicks it off, the results tend to dramatically improve.

 

Saket Saurabh
Yeah, absolutely. And I think maybe this also points to one thing that is perhaps a source of confusion out there about MCP. From the outside, sometimes people feel like it can solve all their problems. Where have you seen limitations that you’ve worked through, and what advice would you share about working with MCPs?

 

Alexander Page
Yeah, I think it is a term that’s used a lot and maybe not so well understood. It’s basically a standard for how systems can get context or get data out of other systems. Really all the protocol itself exposes is tools, resources, data context, and in some cases prompt templates. I would look at it as more of like a standard plug, like USB-C.

 

When you’re building an agent and it needs to have access to, say, JIRA or ServiceNow or Dynatrace or whatever, you have an easy way to consume tools from those external systems in the agent you’re building so that you don’t have to define them yourself.

 

But that doesn’t mean it’s a cure-all. It does come with challenges. Security-wise, you’re sort of trusting the MCP you’re pulling in to have handled that correctly, which you shouldn’t just assume. It also abstracts out certain logic that you then have less visibility into. And if some input to that remote tool is resulting in very different outputs, you don’t really know why. So harder to monitor, harder to guardrail, harder to troubleshoot.

 

And there’s also A2A. I think the original intent with MCP was really all around data. It’s model context protocol, so it’s about getting data out of some other system. But if you’re building an agentic system and some of your customers have built their own agents and they want them to talk to each other, there’s another protocol that’s better suited for that, which is A2A, which Google put out. Agent to agent. One of the primary differences is it’s not exposing the inner workings of each system but still letting them talk to each other effectively, which has a lot of benefit from an enterprise software perspective.

 

I kind of think of MCP as a little bit of a snake oil type of thing in how it’s being used sometimes. I’ve heard it described as wrapping your chaos in a standard and calling it governance. But when used properly, it can be a nice easy button to beef up the capabilities of an agent or agent system without having to rewrite something that some other system is probably better suited to own.

 

Saket Saurabh
And I would certainly say that to your point, where things stand today and the amount of progress, these are very solvable problems. You can consistently put guardrails around that and have it evolve as you build.

 

We can go from being in prototype mode, yes this looks cool and solves a real problem, all the way to having a production system running 24/7 solving problems that you can trust and is consistent. It has become possible to do that. But I feel like we’re at the starting point of some of that evolution.

 

All the things you mentioned today as approaches are super helpful in thinking systematically about how you’re designing your agents and putting a system together. Anything you want to add as advice, because I know a lot of people are looking at building that sort of production-grade solution?

 

Alexander Page
The only thing I would say is that the accessibility of this technology is the most incredible thing about it personally. I don’t have a background in ML, but I feel like I’ve learned a lot about it, really through doing. Lots of YouTube videos, but primarily just experimenting and banging my head against the wall and figuring it out. And also getting help from LLMs and ChatGPT and Cursor along the way.

 

I’ve seen it firsthand. If you are an expert of some thing or understand some slice of data really well, it is so accessible that it’s really not as far-fetched as it might seem to build an agent on top of that. I’ve seen several people make that leap, from PMs to support to SREs.

 

You being an expert of something really puts you in a position to design an agent probably better than most. And I think the designing and building of the agent is not as hard as it might seem from the outside. There’s a lot of things that make that easier, from frameworks to just a wide range of YouTube videos. Things like Langraph, Agent SDK, OpenAI just put out a visual editor for building agents. It’s just getting easier and easier and easier. Now’s a great time to jump in if you haven’t, honestly.

 

Saket Saurabh
Yeah. Well, I think that’s fabulous career advice, actually. And I would agree with you that a few years back, if you wanted to do anything AI and ML related, it would easily take a year of building and training to get some sort of small use case to production. And here you are, the whole power is right accessible to you.

 

Now a lot of the problem is about stitching together the right pieces, thinking through the system design very well, and really thinking, this is the outcome I’m trying to get to, how do I do that? And you’re right, that takes a lot of banging your head against the wall. And also sometimes getting surprised by new models that come up and all of a sudden the problem is getting solved.

 

Alexander Page
Right. Right. Yep.

 

Saket Saurabh
Awesome. And I really love your title, Applied AI. I feel like probably there’ll be more of those sorts of roles for people out there who are willing to put in that effort and learn the applied AI side, because there’s tremendous value in enterprise to create in that function.

 

Alexander Page
I agree. Yeah.

 

Saket Saurabh
It’s been such a pleasure chatting with you, Alex. I’ve learned a lot and I think our audience will learn a lot as well through this conversation. Thank you so much for joining us today.

 

Alexander Page
Of course. Yeah. Thanks for having me. It was great. Thank you.

Show Full Transcript

Ready to Conquer Data Variety?

Turn data chaos into structured intelligence today!