Data quality has always been critical to analytics.
If the data used to inform decisions isn’t good, neither will be the actions taken based on those decisions. But data quality has assumed even greater importance over the past year as generative AI becomes part of the decision-making process.
Large language models (LLMs) such as ChatGPT and Google Bard frequently respond to questions with incorrect responses called AI hallucinations. To reduce the frequency of hallucinations — which also include misleading and offensive information — related to queries about their business, enterprises have done one of two things.
They’ve either attempted to develop language models trained exclusively with their own data or imported public LLMs into a secure environment where they can add proprietary data to retrain those LLMs.
In both instances, data quality is crucial.
In response to the rising emphasis on data quality, data observability specialist Monte Carlo on Jan. 23 hosted Data Quality Day, a virtual event featuring panel discussions on how organizations can best guarantee that the data used to inform data products and AI models can be trusted.
“I’ve seen data quality come up as a problem over and over again,” said Chad Sanderson, co-founder and CEO of data community vendor Gable, during the streamed event.
Similarly, Barr Moses, co-founder and CEO of Monte Carlo, said data quality remains a problem for many organizations. She noted that one of the reasons she helped start Monte Carlo was that in her previous roles at customer success management platform vendor Gainsight — as senior director of business operations and technical success, and vice president of customer success operations — she and her teams often had to deal with bad data.
Barr MosesCo-founder and CEO, Monte Carlo
“As the leader of a data team, I had various challenges,” Moses said. “But maybe the main challenge was that the data was often wrong. We had one job, which was to get the data right, but it was inaccurate a lot of the time.”
Ultimately, improving data quality and ensuring that data can be trusted to inform analytics and AI applications comes down to a combination of technology and organizational processes, according to the panelists.
Importance of data quality
One of the promises of generative AI for businesses is to make data exploration and analysis available to more than just a small group of data experts within organizations.
For decades, analytics use within the enterprise has been stuck around a quarter of all employees. The main reason for the lack of expansion is that analytics platforms are complex. In particular, code is required to carry out most queries and analyses.
In recent years, many vendors have developed natural language processing (NLP) and low-code/no-code tools in an attempt to reduce the complexity of their platforms, but those tools have largely failed to markedly expand analytics use.
The NLP tools had small vocabularies and thus required users to know and use highly specific phrasing, while low-code/no-code tools enabled only cursory data exploration and analysis. To do in-depth analysis, significant training in the use of the platforms and data literacy were still required.
LLMs, however, have the vocabularies of the most expansive dictionaries and are trained to understand intent. Therefore, they substantially reduce some of the barriers that have held back expanded use of analytics. But to enable business users to ask questions of their data with LLMs, the LLMs need to be trained with an enterprise’s proprietary data.
Ask an LLM such as ChatGPT to write a song, and it can do that. Ask it to summarize a book, and it can do that. Ask it to generate code to build a data pipeline, and it can even do that.
But ask it what sales figures were in Nebraska during the past five winters, and either it will be unable to come up with a response or it could make up an answer. Ask it to then write a report based on those sales figures, and it will fail at that too.
Because it doesn’t have the company’s proprietary data.
Organizations need to train LLMs with their own data to answer questions relevant to their business. And that data needs to be good data or else the LLMs will either be unresponsive or deliver incorrect responses that might seem plausible enough to fool a user and lead to a bad decision.
“The key to making generative AI great is by introducing proprietary enterprise data into generative AI pipelines,” Moses said. “If that is not accurate or not reliable, then all of [an organization’s] generative AI efforts will be moot.”
Those generative AI pipelines, meanwhile, need huge amounts of data pulled into models from databases, data warehouses, data lakes and data lakehouses. They need to be able to wrangle relevant structured and unstructured data and combine those disparate data types to give organizations as complete a view of their operations as possible. And they need to be able to do so quickly so that decisions can be made in real time when needed.
Monitoring potentially billions of data points for quality is far more than even a team of humans can manage. Technology, therefore, is now an integral part of ensuring data quality.
Technology
One means of addressing data quality is with technology that automatically monitors data for accuracy and alerts data stewards about anomalies.
In particular, data observability platforms are designed to give data engineers and other stewards a view of their data as it moves from ingestion through the data pipeline to the point when it can be operationalized to inform decisions.
Data observability was a simple process when all data was kept on premises and stored in localized databases. The cloud, however, changed that. Most organizations now store at least some of their data in cloud-based warehouses, lakes and lakehouses. Most also store that cloud-based data in more than one repository, often managed by more than one vendor.
That makes data complicated to track and monitor. So does the explosion in data types with text, video, audio, IoT sensors and others all producing data that can now be captured and stored.
In response to the growing complexity of data, observability specialists including Monte Carlo, Acceldata and Datadog emerged with platforms that automatically test and monitor data throughout its lifetime.
Moses noted that humans can address known problems and develop tests to solve those specific problems. But they can’t address what they aren’t aware of.
“That approach has poor coverage,” she said.
Data quality monitoring tools, meanwhile, have the opposite problem, Moses continued. They automate data monitoring so that all of an organization’s data can be overseen, but they lack intuition and tend to flood data stewards with push notifications.
Data observability blends testing and automated monitoring to allow for full coverage without overburdening data teams with alerts every time they sense the slightest anomaly.
“Data observability uses machine learning to solve for being too specific and too broad,” Moses said. “It also introduces context so users can learn things like lineage, root cause analysis and impact analysis — things that make data engineers more effective in their thinking about data quality.”
Another technology that has the potential to improve data quality is generative AI itself, according to Sanderson.
He noted that some of the things generative AI does well are inferring context from code and understanding the intent of software, which enables generative AI to discover problems that might otherwise be overlooked, in the same way data observability does.
“I’ve been an infrastructure person my entire life, so I was very skeptical of generative AI as a weird AI thing that was going to come and go,” Sanderson said. “I think it’s really going to play a big role in data quality and governance over the next five to 10 years.”
One more technology that could play an important role in improving data quality is testing tools from vendors such as DBT Labs, according to Dana Neufeld, data product manager at Fundbox, a small-business loan platform vendor. She noted that such tools enable engineers to run tests on their data during the development process so that data quality issues can be addressed before pipelines and applications are put into production.
“It’s testing within DBT by developers before they release their code,” Neufeld said. “It is way easier to spot data quality issues within the code before it gets released.”
People and processes
Beyond technology, organizational processes developed by people and carried out by people need to be part of addressing data quality, according to Moses.
Buy-in from top executives is needed for any organizational undertaking to have a chance at success. But assuming C-suite executives understand the importance of data quality, a hierarchical structure that lays out who is responsible for data — and accountable for ensuring data quality — below the top level is also important, Moses said.
Some enterprises still have centralized data teams that oversee every aspect of data operations and parse data out only upon request.
Others, however, have adopted decentralized data management strategies such as data mesh and data fabric, and have empowered business users with self-service analytics platforms. Such strategies and tools make their organizations more flexible, and able to act and react faster than those with centralized data management. But they also allow more people to work with and potentially alter data.
Such organizations need definitive data policies and hierarchies to decrease the risk of lowering data quality.
“Five or 10 years ago, there were maybe one or two people responsible for data,” Moses said. “There was a long lag time to make sure data was accurate, and it was used by a small number of people. The world we live in today is vastly different from that.”
Now, a lot more people are involved, she continued. Even within data teams, there are data engineers, data analysts, machine learning engineers, data stewards, governance experts and other specialists.
“It’s important to ask who owns data quality, because when you don’t have a single owner, it’s really hard to determine accountability,” Moses said. “When the data is wrong, everyone starts finger-pointing. That’s not great for culture, which is toxic, and it also doesn’t lead to a solution. It’s actually valuable to identify an owner.”
In a sense, that data ownership is part of a communication process, which is another key element of managing data quality, according to Sanderson.
He noted that data is essentially a supply chain that includes producers, consumers, distributors and brokers, with people at each stage of the pipeline playing different roles. Communication between those people is crucial so that everyone understands how data needs to be treated to maintain its quality.
“Communication is more of a necessity of the system functioning than a nice-to-have,” Sanderson said. “There are a lot of processes that teams can start following to create better communication.”
One is knowing the steps of the supply chain so that data lineage is understood. Another is recognizing which data is most important.
Sanderson said organizations typically use only about 25% of their data. Beyond that, perhaps about 5% of the organization’s data is the most informative and valuable. Communicating what that 5% is, and the importance of keeping its quality, is therefore significant.
“My recommendation has been identifying tier 1 data … where if there is some quality issue, it has a tangible financial impact,” Sanderson said. “If they can solve those issues, then it’s easy to get leadership and upstream teams taking accountability.”
Ultimately, with so much data and so many people handling data, there’s no single way to fully guarantee data quality, according to Moses.
However, applying appropriate technologies with the right organizational processes can greatly improve an enterprise’s chances of maintaining data quality and delivering trustworthy data products.
“There’s no magic wand solution,” Moses said. “I don’t think I’ve seen an organization, large or small, that you can say, ‘This is done perfect.’ But there are things that are important.”
Eric Avidon is a senior news writer for TechTarget Editorial and a journalist with more than 25 years of experience. He covers analytics and data management.