Data governance is a popular topic, even if not all companies adhere to widely accepted principles in this field. Where things are getting a little more complicated these days is AI governance. This is a topic that is on the minds of executives and boards of directors who want to embrace generative AI but want to avoid making headlines for their company due to AI misbehavior.
AI governance is still in its infancy. Despite all the advances in AI technology and investment in AI programs, there are no hard and fast rules and regulations. The European Union is leading the way in enacting AI legislation, and President Joe Biden has issued a set of rules for companies to follow in the United States under an executive order. However, there are significant gaps in knowledge and best practices regarding AI governance, and this book is still largely a work in progress.
One technology provider looking to advance AI governance is Immuta. Founded by Matt Carroll, who previously advised the U.S. intelligence community on data and analytics issues, the College Park, Maryland-based company is the key to keeping machine learning and AI models from going off track. We have been paying attention to data management for many years.
But as the GenAI engine ramps up into 2023, Immuta customers will have more control over how their data is consumed by large-scale language models (LLMs) and other components of GenAI applications. I have asked the company to do so.
Customer concerns about GenAI were highlighted in Immuta’s 4th Annual State of Data Security Report.as data nami reported in November that 88% of 700 survey respondents said their organization uses AI, but 50% said their organization’s data security strategy has not kept up with the rapid evolution of AI. I answered that no. “More than half (56%) of data professionals say their biggest concern about AI is that sensitive data will be exposed through AI prompts,” he reported Ali Azhar.
Joe Regensburger, vice president of research at Immuta, said the company is working to address the emerging data and AI governance needs of its customers. In a conversation this month, he said: data nami Some of the research areas his team is investigating.
One of the AI governance challenges that Regensburger studies revolves around ensuring the truthfulness of the results of content generated by GenAI.
“Right now, it’s kind of an unknown question,” he says. “We will be held accountable for how we use AI as a decision support tool. We see that in some regulations, such as the AI Act and President Biden’s proposed AI Bill of Rights, where outcomes must be It becomes very important and it moves into the realm of governance.”
LLMs tend to make everything whole, which creates risks for those who use it. For example, Regensburger recently asked his LLM to write a summary on a topic he studied in graduate school.
“My background is in high-energy physics,” he says. “The generated text seemed perfectly reasonable and resulted in a series of citations. So I decided to look up the citations. It’s been a while since I started graduate school. Maybe… , Has something happened since then?
“And the quote was completely fictitious,” he continues. “Totally. They look perfectly reasonable. They had physics review letters. It had all the right formats. And on first casual inspection, it seemed reasonable…it looked like something you’d see in an archive. And when I entered the quote, it didn’t exist. So that was a wake-up call for me. ”
Getting into LLMs and figuring out why they’re making things up is likely beyond the capabilities of any one company and will require a coordinated effort from the entire industry, Regensburger said. . “We’re trying to understand the impact of all of this,” he says. “But we are very much a data company. So as things move away from data, data becomes something that we need to grow or partner with.”
Most of Immuta’s data governance technologies discover sensitive data residing within databases and implement policies and procedures to ensure that data, primarily used by advanced analytics and business intelligence (BI) tools, is properly protected. The focus is on enacting. Governance policies can be complex. One piece of data in a SQL table may be allowed in one type of query, but not when combined with other data.
To provide the same level of governance for the data used by GenAI, Immuta must implement controls on the repositories used to store the data. Most repositories are not structured databases, but unstructured sources such as call logs, chats, PDFs, Slack messages, emails, and other forms of communication.
Despite the challenges of working with sensitive data in structured data sources, the task is even more difficult when working with unstructured data sources because the context of the information differs from source to source. says Regensburger.
“So much context depends on that,” he says. “A phone number isn’t a phone number unless it’s associated with an individual. So in structured data, you can say “This phone number matches a social security number, it matches someone’s address, and the sensitivity of the whole table is different. ” can be established as a principle. On the other hand, in unstructured data, it might just be an 800 phone number. It could simply be a company’s corporate account. So these things are much more difficult. ”
One place where companies can potentially gain control points is in vector databases used for prompt engineering. A vector database is used to store the refined embeddings previously generated by LLM. At runtime, GenAI applications can combine indexed embedded data from vector databases with prompts added to queries to improve accuracy and context of results.
“If you’re training an off-the-shelf model, you’re using unstructured data, but if you’re doing it on the prompt engineering side, it’s typically coming from a vector database,” Regensburger says. “There are a lot of possibilities and a lot of interest in how some of these same governance principles can be applied to vector databases as well.”
Regensburger reiterated that Imta has no plans to develop this capability at this time, but that it is an active area of research. “We’re looking at how to apply some of the security principles to unstructured data,” he says.
As companies begin developing their GenAI plans and building GenAI products, potential data security risks become more apparent. Keeping personal data private is an important issue currently on many people’s lists. Unfortunately, “data governance” is much easier to say than it is to actually do it. This is especially true when dealing with the intersection of sensitive data and probabilistic models that sometimes behave in unexplained ways.
Related products:
AI monitors moving targets in the US, but Europe is also watching
Immuta report shows companies are struggling to keep up with rapid advances in AI
Keep the model straight and narrow