Stephen Gatchell
I wish we all paid attention to data over the past two decades like we should have, because then we would really be ready to launch the AI strategies and not worry about a lot of the challenges that we have today.
Saket Saurabh
I mean, you guys have been in the business of scanning through and finding where sensitive data is.
Stephen Gatchell
There are some tools in the marketplace, but honestly, as they scan and build these data catalogs, they actually copy the entire dataset. Think about that at petabyte scale. One, there’s a huge cost. Two, you just doubled your risk profile.
Saket Saurabh
Can AI effectively solve enterprise problems, which are very high value that people are willing to pay for?
Stephen Gatchell
We’re starting to move away from just large language models and we’re getting to small language models. We’re moving away from a general-purpose agent into a very specific agent.
Data products is starting to become a conversation as well because of AI models becoming data products.
Saket Saurabh
Hi, everyone. Thanks for listening to another episode of Data Innovators and Builders. Today, I’m speaking with Stephen Gatchell from BigID, Vice President of Data and AI Strategy. Stephen, thank you for joining us and chatting with me today.
Stephen Gatchell
Thanks for having me. I’m looking forward to our conversation.
Saket Saurabh
Thank you. Stephen, you have been in the data space for quite some time. You’ve seen the ups and downs, the changes, and the evolution. Maybe give us a little bit of a perspective in terms of what’s top of mind for you, what’s happening in data that’s exciting.
Stephen Gatchell
Yeah. Look, I think over the years I was a practitioner of actually trying to implement data and utilizing that for things like analytics and AI strategies and so forth. A little different than AI today from a few years ago. But as I came to BigID, I’ve been here the last few years, really getting the privilege of working with our customers across the board as well as our partners. BigID really focuses across the security, the privacy, and the governance space. When you talk about that, you talk about data, you talk about AI assets as well, things like data products, etc.
What comes to top of mind quite honestly is, I wish we all paid attention to data over the past two decades like we should have, because then we would really be ready to launch the AI strategies and not worry about a lot of the challenges that we have today. I think some of the regulations that have been implemented across the world may have been a little bit easier on us from a data perspective if we were doing the right things from the beginning.
I think two things that really come to mind. Everybody wants to talk about AI in the context of modern day AI, not traditional machine learning. And second, now they’re starting to pay attention to data and governance, and we should have been doing that all along, to be honest.
Saket Saurabh
Yeah, I think everybody’s talking about AI and we will too actually. From what I’ve seen, it seems like that focus on data has increased over the past year. That conversation about enterprises adopting AI initially seemed like it would be easy, and then yes, we started to hit the challenge of data, and data is complex and messy in the enterprise.
So let’s dive right into that. High-level, how are you seeing AI and reading the impact and opportunity there?
Stephen Gatchell
At first, I see a major change in the type of data people are paying attention to. Traditionally, people would look at structured data and have that as part of their business processes and critical data paths. But now with the advent of AI, unstructured data is becoming the main focus. When I talk about unstructured data, it could be as simple as a PDF or a Word document, but now we’re talking about voice files, pictures, video, all kinds of unstructured data.
The challenge there is the size and scope. We’ve heard this way back when we used to call stuff big data. The number of petabytes that people are trying to manage in order to understand what should I use in AI, what should I not use in AI, what should people have access to and what should we block from a security perspective has now changed. We see even companies that have had decent data governance programs now struggling because they only focused on the structured side and not the unstructured side.
The second major shift I see is the type of users. In the past, it mostly started with technical users that understood applications and things like firewalls and other security concepts. But now we have business users trying to adopt it. They’re laying down their credit card and signing up for ChatGPT or Gemini and trying to use it in their day-to-day lives. It’s a different profile of a user, and do they have the right knowledge and skill set to do it safely and responsibly?
Saket Saurabh
Yeah. I think finally, I feel like technology is becoming the means to the end. The end user, the end application is really taking the forefront. That’s the business user who wants to get their job done and done more effectively.
So maybe let’s double-click a little bit on some of the data challenges. You brought in the fact that it was built for structured, even where people had governance it was built for structured. Unstructured was unserved. Maybe in the context of your company BigID, give us a context of what is the state for structured data, and then let’s go into unstructured.
Stephen Gatchell
Yeah, absolutely. So look, for unstructured data, most companies have a handle on where their critical information is, where their personal information is from a regulatory perspective, their IP. I’m not saying everybody, but most of the time. When you talk about unstructured data, they don’t even know how to discover that information and build a traditional data catalog. That’s what I talk to customers about on a day-to-day basis.
So BigID basically connects to all different types of data sources, whether structured or unstructured. We now can connect to vector databases and AI models. Think about a persistent data catalog that has all those types of data put together. Understanding and classifying information, identifying the personal information, the PII data, the IP data of what you actually want to pay attention to and secure.
And then when you start talking about the AI realm, capabilities such as labeling to turn on a Microsoft Copilot or utilize Google, where you can actually look at that unstructured data of a Word document, a PowerPoint, an Excel spreadsheet, and be able to say, look, there’s some really important information here. And if Steven, who sits in the product group, has access to the financial HR information, that’s probably not a good thing when they’re sitting there and typing in, show me everybody’s salaries over the past weeks. This is a real use case, by the way.
Most of the customers I talk to, they tend to turn on some of these copilots and then they realize that they don’t have the right structure in place to identify what things need to be masked or tokenized or otherwise protected so that the right people have access to the right data at the right time. Usually that’s not in place from an end-to-end perspective.
And so when you talk about data itself, I think a major change I’ve seen over the past 12 to 18 months with the advent of generative AI is really the groups coming together from privacy, security, and governance and now saying, how do we manage all of this from an end-user perspective? How do we actually manage the data? Where do we even start? And I think that’s where tools like BigID come in and can scan the landscape at a very high level, do some quick evaluation and say, hey, there’s some really data that you want to secure over here, but these repositories over here, you don’t have to worry about because we didn’t find any PII data. And start accelerating that discovery of where to prioritize and focus.
Saket Saurabh
Okay, so first of all, let’s just scan through and find out where we have sensitive data. Now, one question I had was from a user perspective. I mean, I see products like Glean or Google Gemini in your drive or Microsoft Copilot. All these products are saying, hey, come chat with your data, ask questions. But you are saying that behind that, the data has a lot of sensitive information. It has to be carefully managed and siloed to the right person, the right use case.
So how does this all piece together? Because this use case of talk to your data has so much momentum. People want to just move with it. And here you are like, hey, hold on. So tell me about the real life situations and how that’s happening.
Stephen Gatchell
Yeah, so the horse is out of the barn, as they say. I talk to many, many customers that do turn on these types of applications and they literally a month later will shut them off. This is real life. They will allow their entire user base to use a Microsoft Copilot or a Glean, and then as they’re going through it, they realize that the sensitive data is being exposed.
And every customer is different, by the way. It all depends upon the risk tolerance of the customer and really what industry they’re in. Traditionally, finance and healthcare usually had more controls over their data because they were regulatory required to do that. Releasing patient data or financial data of your customers is really frowned upon and you don’t want to be in the news. But with retail and some others, you’ll see that their risk tolerance is a bit higher.
So real life use cases. What I work with customers on a day-to-day basis is, what is that risk tolerance? Do you actually have a governance committee that cuts across security, privacy and data that has already talked about, well, we’ve evaluated this Copilot and we want to turn it on because we believe the ROI and the efficiency and the end user experience exceed the risk tolerance that we’re going to see? And they just turn the thing on.
But in real life, if you’re not managing your data and you don’t even know where your data is, think about the number of files that get copied across a OneDrive or a Google Drive, where somebody in HR just copies a file and then it gets copied 15 times over and now that PII data is supposed to be in a secure repository but is now in 14 other locations. This is the day-to-day problem we find. Duplication of data is a huge issue right now. Just reducing the duplicate data is a use case we see accelerating in our marketplace.
If you can remove all the duplicates that have sensitive information and have your system of record, then you’re reducing your risk, whether that’s security, privacy, or data risk. And if you can understand your retention policies, most companies have one but don’t actually execute it. They don’t actually delete data that’s over seven years old that they no longer want to keep. They have 30, 40 years of data, literally, petabytes of data, stored in anything from a mainframe all the way through modern cloud technologies.
So the ability to connect to these different types of data repositories and give the customer insight to say, we scanned this environment, here are your Amazon S3 buckets, Azure, Google Drives, AWS, mainframes, Salesforce, and help structure where do you even start? Where’s your most risk? And until you do that, what you want to do is point tools like Microsoft Copilot or Gemini to repositories where you understand there is no risk or there’s an acceptable risk. Start small, get those end users using the tool, figure out what data they have access to and what data they don’t, what use cases they actually want to solve.
So there might be a little bit of frustration upfront. But in the end, to me, it gets some education and literacy to the end users, it reduces your risk. Start small. And then as you start scanning environments and realizing, okay, there’s all public data in this repository, let’s just open it up. This over here, we found a bunch of stuff and we’ve got some work to do from a security perspective. And let’s not give access to the entire company. Maybe we give access to the small group that needs access to it. Have a methodical plan to roll it out.
Make sure that you have security, privacy, and data talking to each other. What is the right risk tolerance level? And then what is the payback in the ROI, making sure it’s worth that risk tolerance level? Then roll it out that way. Just don’t flick a switch. It’s not that easy.
Saket Saurabh
Yeah, yeah. I think the risk of opening up too fast and then having to close back down and go reopen, you make a very valid point that you should go through it thoughtfully, maybe business use case by business use case.
Like, hey, let’s take the marketing data and make that available. And now you can ask questions of all the product features and capabilities and so on, and then go to more sensitive data.
One thing I was curious about was that you guys have been in the business of scanning through and finding where sensitive data is. And you started with structured data and said, hey, now we support vector databases and unstructured and all that stuff. How have the techniques evolved in being able to do that? It’s a hard problem to solve, and that too at scale. You’re talking petabytes now.
Stephen Gatchell
Yes. And just to be clear, BigID always scanned both structured and unstructured. It’s the marketplace that’s starting to catch up and worrying about the unstructured. So the techniques, some of them are traditional AI. It’s machine learning combined with regular expressions, just finding patterns of data and identifying it.
And then what has matured in the marketplace are the number of tools you can utilize. So think about a classifier as you’re going through and reading data. Years ago, people just read metadata. They read the tag of information, but they didn’t actually read the data itself. So somebody labeled the column as customer and you’re like, okay, I found PII data. The metadata is captured and now we found PII data.
Well, as we both know, not all databases are created equal. The way people actually structured their databases, they didn’t always use the right column names. On unstructured data, you can’t just go by a file name. You have to actually look inside the data, whether structured or unstructured, and be able to evaluate the data itself, either at a detail level where I’m doing a full scan and really looking at each column and each row, or scanning through a document to understand what information is in there.
That’s a challenge, and at scale it’s very time consuming and can be very costly. So you develop other capabilities where maybe you just do sample scanning. You don’t have to read every single row or column. You could just sample and say, yep, I found some stuff in here, good enough. Or you have what you call side scanning now, where you actually take a snapshot of the data and scan it offline, not in production. Or you have an assessment scan where you go through and as soon as you find something, you just stop. You don’t have to go through every single file in an S3 bucket. Hey, these first five files have PII data in them, so this S3 bucket has PII data. Just stop.
There are different techniques that have come up over the years to help accelerate that time to value from identifying where the sensitive information is and helping you prioritize it.
Over the past 18 months with generative AI, there have been huge advancements. Now think about, we looked at the technical metadata in the past. You do look at the metadata, but you also scan the actual data itself and you combine it and you realize what information is in there. But now with the advent of generative AI, I could take things like a business glossary, I have definitions of business terms, and I can import that into BigID and then actually use AI to map my physical data to my business terms. Or I could put a column name that makes sense from a business perspective instead of the column name that says column underscore one.
And so now with the advent of generative AI, that has really accelerated the contextual piece of the data. And as we increase and start getting into more agentic and generative AI, a semantic layer of understanding both from a technical metadata and a business metadata has to play together. We’ve seen tremendous change over the past 18 months in accelerating that automation of putting the business contextual stuff around it.
Saket Saurabh
No, I think that’s incredible. And I think that brings to the point that generative AI, you’re both a user of it in your own product building, but you’re also an enabler of it because now you’re bringing this privacy and governance and control to generative AI. I love how you’re creating value on both sides of it.
Stephen Gatchell
Yeah, and as a vendor, every vendor has generative AI in it today and now we’re starting to have agentic AI. I think the important message there is as people start evaluating technology suites, it’s got to be configurable. Are you using an external LLM that’s going out and sharing your sensitive data externally, or can it connect to your internal knowledge-based systems and actually create custom large language models? Or is the tool configurable, can you turn things on and off depending upon your security profile?
I think what people listening to this should do as they evaluate technology tools is make sure that you understand the configurability of it, make sure that you understand what data is leaving your environment versus what data is staying inside of your environment. There are some tools in the marketplace that, as they scan and build these data catalogs, they actually copy the entire dataset. Think about that at petabyte scale. One, there’s a huge cost. Two, you just doubled your risk profile.
You’ve got to think about how are some of these tools actually implementing the application itself. Ask those security questions, take the time to understand how is that backend being executed and can I control it?
Saket Saurabh
Absolutely. I think that copy of copy of data has been a big problem in the enterprise. You kind of close the door in one place and it opens up in another. In our world, we see that as creating new data silos because you’re constantly trying to manage that, and then you don’t know which one is in use. It’s of course a cost issue too, because you can’t shut one down.
And going forward from that, I think one of the things you pointed out is, okay, you’re scanning through this, you’re finding the data, and is that becoming a data product? Like how is this all getting bounded? Because you have a large scale over there.
Stephen Gatchell
Yeah. So I think the point you made earlier around a business use case, it tends to be the bounding piece of it. You can’t just go through and scan a hundred terabytes of data. It’s going to take you a while to figure all that stuff out and even just get internal connectivity to these systems.
And it costs a lot of money. I’ll just generalize the story that most customers I talk to, they’re like, yeah, we decided to consolidate a lot of our data on a specific platform, and then we budgeted an X number of dollars and it ends up being two to three X, and after 12 months they’re like, okay, now what do we do? We have to increase our budget and we want to continue using it because there’s real value there, but we can’t really afford to go two or three X.
So data products. In the past, depending upon the definition, I actually was leading work with the EDM Associates, where we actually took a cross-functional industry group and said, what is a data product? Like, think about that for a second. I’m a governance guy at heart, so my first thing is, okay, what do you mean by data product? Define it for me, and we literally sat down and defined what a data product is, what the different components and key issues of a data product are, how data relates and how it gets managed to a data product.
You brought up a great point about duplication of data and you start losing track of what the system of record is. We have companies now, especially with the accelerated innovation of AI, where they’re building data products, their AI models as an example, that sit inside a customer-facing application or internally an employee application, and they’re not even using the system of record to train the models on. So are you getting the right output?
To me, as you’re going through, the first step is understanding your data landscape in general first, then identifying the business use case and defining what constitutes a data product, meaning do you have the right governance in place, how often does that data product get upgraded, does it have a data owner, do you understand what the ROI is on that data product?
Because the other issue we see is that people just continually build models and then six months later, is that model actually doing what you defined it should be doing? If you turn that into a data product and treat it as an asset, and a couple of years ago we started talking about data as assets and ROI on data, put it on the balance sheet, it hasn’t accelerated as much as I would have liked. Because I think once you put data on a balance sheet and start talking about data products, then you start measuring a data product as an individual thing. What is the profitability of that data product? It costs you X, it’s generating Y, therefore you keep that data product running.
And so they’re all related together in my eyes. Data feeds into an AI model and trains it. That AI model is defined as a product, so it has an owner, it has a measurable ROI on it, it has ownership from the data being fed into that model, it understands exactly who’s using that model and what the purpose of use is for it. And I use that term, purpose of use, very distinctly because that is what some of the regulations talk about.
I think data products is starting to become a conversation as well, because of AI models becoming data products.
Saket Saurabh
Yeah, okay, that’s quite interesting. Do you also see that playing a role at the inferencing time? Like, you know, what data products are out there and which ones do you have access to, which ones are permitted and where are you retrieving from? Like, is it giving a framework to that part as well?
Stephen Gatchell
Yeah, a hundred percent. To me, look, a data product doesn’t have to be as fancy as AI. It could be a worksheet, it could be a dashboard. Those can be data products. So to me, it’s all the traditional data governance stuff that we’ve been talking about for decades and how to manage it.
I would say one of the challenges I see is people start trying to do the right things around data products and then they’ll be like, okay, it needs a data owner. And my first question to the customer is, define data owner, and do they actually know what their responsibilities are? Sometimes the answer is, oh no, we assigned all these data owners. And I’m like, well, how do they know what to do? What does that mean? What are their responsibilities? What’s their accountability to that data product?
I think we’re starting to see more acceptance of defining a data product and more acceptance of things like ownership of that data product. We still have a little ways to go to make sure that when we do this, people actually know what that means.
Saket Saurabh
So what would you, for the audience, give like a must-have definition of what a data product must have to it?
Stephen Gatchell
Yeah, so data owner, 100%. Purpose of use, 100%. Measuring the cost and the return on investment on that. What is the risk tolerance profile of that model? How often is it reviewed to ensure that it meets the purpose of fit? And then the last thing is, how do you manage the lifecycle of that model? So just like data, what is the model lifecycle? When do you know when to retire that product? When do you know when it needs to be retrained if it’s an AI model as a product? When do you know that you need to cut people off from that product?
Because I’ll give you one very specific example from my past life. During the pandemic, I was in retail at the time. We had a model that predicted pricing across different geographical locations. That was a data product. It was a model that sales used around the globe and with our partners. Well, when the pandemic came and we had retail stores shut down, and then all of a sudden actual revenue jumped up from the internet versus our in-stores, it changed everything. It changed the pricing of the model because if you sell online, it’s cheaper than selling in a store. And so that whole model just changed.
If we didn’t treat that as a data product to evaluate, as new data comes in, you can have monitoring on outlying information. Think about what I just said. We had 50% of sales coming from retail stores and 50% coming from our web store. Now all of a sudden that jumps to 95% through the web. That’s going to change your model. It’s going to change how that model is calculating prices around the globe. And that could be a good thing or it could be a bad thing.
And you treat it as a data product so that when things like that happen, versions and acquisitions happen, you’re ingesting somebody else’s data or there’s a divestiture of data, you’ve got somebody monitoring and responsible as you go through these major changes and impacts to how your business is changing. Tariffs globally change retail quite frequently right now. How does that change the models that are out there in the retail space or the wholesale space? These things happen every day. And if you don’t treat things as data products, you could potentially lose millions and millions of dollars.
Saket Saurabh
Yeah, yeah. You mentioned a little bit about the semantic layer as well. How would you help us relate that with data products?
Stephen Gatchell
Yeah, so look, I think a semantic layer comes into play. There are a few different concepts here. Data mesh, knowledge graphs, and I’ll kind of explain how these fit together.
I think as you go through and you start scanning data and understanding the relationship of data, correlating the data, correlation of data means what relationship does one entity have from another. And an entity can be a person or a thing. So think about a person. As you’re looking at where my information is across multiple databases and unstructured data, how do I relate that data across the different data sources so that if I say I have the right to be forgotten, because there’s a regulation in California that says you have the right to be forgotten, a company can go in and actually delete that data. In the background of that correlation is actually a knowledge graph. And it shows how one dataset is related to another dataset that’s relating to another dataset. It actually builds out this visualization.
Where the semantic layer comes in is how do I define that consistently across disparate data sources? So as I build this knowledge graph, if I’m defining what a customer is, is it consistent across the different data sources? Or do I have B2C customer versus B2B customer? Can I identify emails that are vendor emails versus employee emails versus customer emails? That’s the semantic layer.
Combining how you look at the business metadata to the technical metadata, understanding the knowledge graph and the relationship of the entities in the background, starts building up this semantic layer so that as you go through and you want to build an AI model, building something for the US is going to be very different than building that same model for the Middle East because you have different consumers, different regulations, and you have to understand that semantic layer and ensure you’re separating the data, understanding the regulations and how that data relates together, and then what you actually have to capture for that data product to ensure you can do it legally inside that country or location.
And then companies are starting to think about data meshes, which was starting to become popular. To me, that’s just virtualization. Maybe I’m oversimplifying it. Virtualization has been around since mainframes. But how do you actually utilize that data and relate it without moving the data? And again, that’s where the semantic layer comes in so that if you’re in Salesforce versus inside your financial system versus SAP, you can relate that data all together. That’s where knowledge graph comes in. That’s where the semantic layer comes in.
And that’s where a lot of this automation comes in. Where we used to have these poor data stewards try to do this manually, it’s impossible. The size and scope, it’s just impossible. So like I said, over the last 18 to 24 months with the advent of generative AI, knowledge graph is starting to become much more popular because people are actually trying to build a semantic layer now, which is really cool. And that’s going to help the automation of governance, the automation of using the data in the right context with the right regulatory oversight and the right monitoring of those environments.
Saket Saurabh
Yeah, yeah. And I think when you talk about the right context, I think that becomes an essential part of taking advantage of the language models as well. Can we fetch all the relevant context with governance for answering a specific question or running a specific agentic flow?
Stephen Gatchell
Sure, I love that. Can I just jump in for a second? Because you bring up a brilliant point. As we go through, most people are familiar with large language models at this point. Small language models are starting to become more popular. So as we move into agentic AI and having agents, it becomes even more important to the context of what that agent is going to do.
And what I like about the trend I see going on is we’re starting to move away from just the large language models and we’re getting to small language models. We’re moving away from a general-purpose agent into a very specific agent. And that semantic layer actually helps. So let me build a very specific CRM frontline use case agent that’s just going to solve first line support tickets as an example. I’m going to build then a financial agent that is going to worry about things like accounts receivable and accounts payable and our customers and be very specific on financial. So it could be much more accurate, lower cost, more contextualized.
And then let me build a customer service agent that actually goes to the services part. And those agents can then work together if the right semantic layer is there and the right context is there. And so you now start having what I call a swarm of agents come together to solve an end-to-end business problem.
Versus when we first started with generative AI, everything was large language models. It cost too much money for anybody to build their own large language models, which has changed now. And we’re starting to really understand the technology, the risk associated, how do we make it more accurate and less costly. A lot of really smart people in the industry are starting to focus on this type of stuff.
Saket Saurabh
That’s a very good point. And I think one of the things that you noted was the swarm of agents. And I think for those who are listening, it’s very important to understand and maybe see that it’s best when you make agents that are very tightly defined and well-constrained within the task they’re supposed to do, and you increase their ability to execute reliably with that.
And if I look at the number one question that is out there in the industry today, is there an AI boom or not? I think the real answer to the question is, can AI effectively solve enterprise problems, which are very high value that people are willing to pay for? And that’s the question we are really working towards. So let me pause there and see if you have any comments on that part.
Stephen Gatchell
I love that. That, of course, is the question. Because what are we doing all this stuff for if it’s not delivering value? So I would say there are some people in the marketplace who are saying there’s going to be an AI bust and all this other stuff. I’m a little more optimistic than that.
And I would say moving things into production is the number one problem. We see tremendous things happening in educational institutions, inside enterprise organizations, inside the government. Really cool stuff in the labs. But the lab does not equal production. And I think that’s where the next focus has to be.
It has to be that if we’re actually going to do POCs, proof of value, proof of concepts, whatever you’re going to call these things inside the lab, we have to have the right guardrails in place for those testing environments so that as we move to production, it’s not a complete reconfiguration of everything. We have to rebuild the agent because we didn’t put in guardrails that follow specific regulations. You have NYDFS in New York that everybody’s worrying about. There’s all these complex things that are happening.
And I see great stuff happening in the test environment, but the ROI comes when you put it in production. So I think we have to focus on that as the next step. If companies start thinking about, let’s start very, very small, let’s take a specific use case, let’s take it from end to end and actually implement it in production with measurable OKRs. I like OKRs, objective and key results, because they actually have objectives and key results versus KPIs which are just kind of like a key performance indicator, and honestly half of them I see don’t mean anything. They don’t tell you if you’re going in the right direction.
Have OKRs associated so that as you go through this, you have an expected ROI and then the actual ROI. And it could actually really help with funding too, because as you’re going to try and get more and more funding to pay for those cloud costs, for additional technology, for more resources, you can actually make your department self-funding at some point if you treat it as a P&L.
Get it into production, stop with just the testing, deliver something end to end, measure expected versus actual, and start building that factory. If companies do that, we will not have an AI bust. We will continue to have the AI boom. If we just keep spending more and more money on AI because we think it’s cool and we’re not seeing an ROI return, meaning our company’s profitability is not going up, then there’s a problem. 12 months ago we didn’t have generative AI in our product. 12 months later, we do. Are we selling more? Yes or no? It doesn’t have to be direct attribution to me. It’s just, are you going in the right direction? Are you actually getting back what you’re spending? If the answer is yes, there’ll be no bust.
Saket Saurabh
Yeah, I think the good thing about the pace at which AI adoption and innovation is happening is that we are moving at a very fast pace. Which means, look, we had a bust in the dot-com era and the time period it took from people coming up with ideas like, hey, let’s have an online store, to an online store actually working was several years. So clearly the money pouring in couldn’t sustain for that.
But in the case of AI, I do think that the pace at which things are happening is very fast. So there is quite a possibility that the investment curve and the outcome curve catch up to each other in a timely fashion. We’re all optimistic and hope that happens.
But one of the things that I’m seeing at least is that slowly production use cases have started to happen. It’s very important to pick the right area where you put that effort in. Clearly areas like customer support, for example, have much better outcomes than maybe managing procurement. So there are complex use cases, there are simpler use cases.
But curious if you think about, as a company getting into agentic AI and thinking about bringing in real production-grade use cases, what would your guidance be for them? We talked about structured data privacy governance, we talked about unstructured data privacy governance. How do they really get to production in a reliable way? Because all these privacy concerns will become really important when you get there.
Stephen Gatchell
100%, because if you get fined or you have a breach and you’re in the news, somebody’s going to get fired. And they should get fired. So look, I think there’s a bunch of steps, but I’ll just take you from left to right as I’m thinking about this.
First, you’ve got to have a governance committee, period, end of story. It’s got to be cross-functional with the right decision-makers or representatives of decision-makers in the room. Because you need to do assessments of what your risk profile is versus what your expected ROI is versus your expense, and you have to have a payback on that. So it all starts with assessments.
Then you figure out, okay, what is that business process? This is where the differential between the lab and getting into production changes. Don’t just approve the assessment and go off and play around. No, design the business process to get to production as the next step. What applications need access? What’s the data of those applications? What are the individuals or the agents? Because now we have to worry about not just humans, but individual agents. How do we handle human in the loop? Are we going to have human over the loop, in the loop, outside the loop? Human in the loop just means for the audience, do you have human oversight of the process somehow, some way?
You have to build that production business process. Then you go and execute around what data you’re utilizing, around what regulations you have to follow and what guardrails you realize you need. A lot of people will use frameworks like NIST or DCAM or CDMC to actually understand if they’re in compliance with regulations. Then you understand all the data you’re going to utilize to execute that use case. You evaluate the risk through classification, labeling, and tagging. Then you actually remediate those risks before you train the model. And then you monitor that on a continual basis.
You go from assessment all the way through monitoring. If you do that, then you can start putting things into production with the right confidence level that your risk tolerance is being met.
Saket Saurabh
Yeah, I think that’s a good blueprint to take forward and get started. It’s very important to demonstrate value coming out of this.
And one of the things you touched on earlier was about how prompts and so on sort of expose new attack surfaces as well. I’m curious if you can share a little bit about that, especially as people are thinking about going to production use cases.
Stephen Gatchell
Yeah, so look, I think with the adoption of generative AI, one, you have shadow AI all over the place. If you’re talking about just generative AI, you’ll have people that are using their credit card to connect with ChatGPT or DeepSeek or whatever the tool of choice is. And if they don’t have the data literacy or the AI literacy, they may be uploading sensitive information. If you don’t have the right controls in place, you’re releasing sensitive information into a public marketplace. That’s problem number one.
If you are using auto code generation, a lot of people use GitHub these days. If you’re not protecting that code generation and you have clear text passwords or you have open API keys or token keys inside of that code repository, you may be releasing that information inside of your code that goes into a product released to customers. So you have multiple attack surfaces that you didn’t have before, and you’ve got to secure those.
Customers will say, okay, I’m not using a public GPT, I’m using a private GPT. That’s great. There’s this little thing called insider risk. Same problem. You’ll have insider risk of individuals who, if they don’t know what they’re doing, or maybe intentionally, they’re going to take information that should not be released and put it inside of the model.
Think about if you have a breach. Even if you create and use a RAG model, retrieval augmented generation, and you’re building your own internal generative AI application, that’s great. If you don’t do the right security of your knowledge-based systems and your vector databases, most companies seem to be storing their prompted responses in some type of environment because they want to evaluate that after the fact to retrain the model or retrain individuals that might not be using the prompts correctly. You’ve got three attack surfaces now. And if you have a breach and you don’t protect that data all the way through that RAG pipeline, your attack surface is increasing.
And I’m getting tired of reading about breaches. It’s literally every day that there’s a new breach somewhere. So I think part of that is education. We have done a better job in the industry of forcing people through security training, I just got my email today, I’ve got to go through my yearly security training. But we’re not doing that for AI. We’re not doing that for generative AI. We don’t do that for data governance. We’ve got to start doing that. Every year, release the security training, great. But I want a second training. What are the new implications of AI? How do you use it safely? What are some of the new regulations? What are the implications of those? It should be part of our training every year.
Saket Saurabh
Oh yeah, absolutely. Those trainings are still talking about email phishing and those things, and not yet getting into all the risks that can come with AI. And also, being able to have outcomes from AI that you can trust, and you can easily feel confident about whatever is coming back from a language model.
Stephen Gatchell
Great point. And just make sure that they understand that the outcomes of these things can actually be incorrect. Think about making a major business decision on incorrect information. I can’t believe we haven’t even talked about that yet.
Saket Saurabh
I know, when we started the conversation, Steven, you said we could go four days talking about it, and I can’t believe how fast 45 minutes have gone by. But it’s been amazing chatting with you. There’s so much richness of insight that you have.
Maybe share with us how can people follow you, and where are you sharing these insights? And we should get together again for yet another conversation. So many topics to cover.
Stephen Gatchell
Hey, look, let’s do it. Let’s do it. Look, obviously I really enjoy talking about this stuff. Sometimes I might talk a little too much. You’re a great interviewer, by the way. I really enjoyed our conversation today.
I’m on LinkedIn. Steven Gatchell on LinkedIn. Please feel free to reach out to connect. I do post on data governance and AI strategies, responsible AI. bigid.com has some of the white papers. I just wrote one on classification and labeling. So you’ll see that on bigid.com. Feel free to reach out and I look forward to talking to you again. It’s been great.
Saket Saurabh
Thank you, Steven. It’s been a pleasure talking to you. Very insightful, lots of learning. I’m taking notes myself, and I’m sure our audience is learning a lot in the process. Appreciate you joining us today. Thank you.
Stephen Gatchell
Thank you, take care.