DataChat uses GenAI to enable data exploration

(SomYuZu/Shutterstock)

What if you could tell a computer how you wanted to explore a data set, and it would automatically perform the analysis and provide you with the results? Here’s a research project spun out from the University of Wisconsin-Madison. is the idea behind DataChat, a generative AI-based data exploration and analysis tool that is now a commercial product.

Jignesh Patel, currently a computer science professor at Carnegie Mellon University and co-founder of DataChat, recently conducted a virtual interview. data nami We chat about the nature of data exploration in the age of generative AI and the new DataChat service, which was officially launched at the Gartner Data & Analytics Summit earlier this month.

The inspiration for creating DataChat began in 2016 when Patel was working as a computer science professor at the University of Wisconsin-Madison and CTO of Pivotal (now part of VMware Tanzu and parent company Broadcom). The explosion of big data is in full swing, Hadoop has become the rallying point for new distributed frameworks, and data scientists are in high demand.

While technology is rapidly evolving, too many companies are stretched thin when it comes to data analysis and exploration, and Patel felt something was missing from the equation.

“The first goal of every CTO was to hire an army of data scientists. We just couldn’t get enough data scientists,” Patel said. “What I started observing very early on was the way data scientists worked. It’s all ad-hoc analysis. In contrast to the BI world, there’s no scripting, there’s no scripting, there’s something that comes from the data in a non-linear path. I’m trying to get it.”

Data scientists are always in short supply (pathdoc/Shutterstock)

Much of this data exploration work was done manually using tools such as Jupyter data science notebooks. Data scientists explore a particular set of data until something interesting emerges, then extract that particular data, transform it into a more useful format, and pipe it to a machine learning algorithm for use. I’ll figure out a way. application.

Patel realized that this pattern lends itself to some automation and is easy for non-experts to work with.

“Literally, the way they were doing this was breaking down the problem step by step, finding code somewhere on the web and putting it under the hood. That’s how it’s built,” he said. “So we wrote a paper in his 2017: What if a user could fill his science cell with this data simply by expressing it in natural language?”

Of course, this was a time before ChatGPT, and the state of the art in natural language processing (NLP) was far from what it is today. NLP technology will improve, but Patel and University of Wisconsin doctoral graduate student Roger Jeffrey Leo John are developing a compact control language that can sit between the user and the underlying SQL and Python code that queries the data. We did the hard work of building it. Each calls a machine learning algorithm.

“Middleman” [language]…This was great because we could now take any language and translate it into that intermediate language and translate that into SQL and Python,” Patel said. “Because that’s what you need if you want to talk to a SQL database and do ETL. If you want to build machine learning models, you really need to use SQL and Python, the two main languages of data science. is needed.”

Natural language for data science

DataChat’s goal was to create a data analysis and exploration tool that can be followed with simple English instructions, reducing the need for users to know SQL or Python in order to be productive with their data. Users enter a simple command, such as “Create a customer churn visualization,” and the product automatically creates a visualization based on the data.

Jignesh Patel is a co-founder of DataChat and a professor of computer science at Carnegie Mellon University.

Patel said the aim is for DataChat to become naturally interactive. Users sit in a spreadsheet-like interface and can ask questions of the data. Not all questions submitted to DataChat will receive reliable answers immediately. But give and take allows the product and the user to move forward in a predictable way.

“Ask and you will get the answer,” Patel said. “Also, if something comes back, I’ll explain the steps as well. There’s a give and take. I’m trying to ask you something, but I didn’t understand what you meant, so I asked the question a little differently, but I We’re making progress every step of the way.”

DataChat’s target audience is business users, data analysts, and data scientists. The goal of business users and data analysts is to develop their skills in the area of data science without much training. Data scientists often use her DataChat just to understand the contents of new datasets.

“They might just be playing around with DataChat and saying, ‘How many NULL values are there in three important columns?'” Patel says. “Instead of writing a SQL query, you just point, click, or ask a question and you get an answer, which is much faster. I am.”

DataChat workflows can generate three artifacts (reports with regression, classification, and time series, graphs, and machine learning models) from any data, from Excel workbooks to Databricks or Snowflake data warehouses. Patel said each workflow comes with an explanation of how and why the “I did” answer was generated, which Patel said is a key feature of the product.

For models around churn, DataChat doesn’t produce “crazy technical answers,” he said. “But you’ll say, ‘Okay, he’s three things, the age of the person, the type of policy, whether he has insurance.’ And this is 60% or 20% of the influence and he’s 10 %, and here we show you what doesn’t have an impact based on the data.”

Patel said this level of transparency is extremely important in data science. “We’ve been thinking about data science solutions since day one. Science requires transparency, so that’s built into the philosophy of the product,” he said.

The changing foundations of NLP

DataChat was first registered as a company in 2017 and raised a $4 million seed round in 2020 (and has since raised another $25 million). Back in 2017, Patel and John were making hard headway with his then-current NLP technology, which was neither as powerful nor as easy to use as today’s large-scale language models (LLMs).

DataChat interface allows users to explore data using natural language (Image courtesy of DataChat)

They built language parsers, dug into semantic understanding, and “all of that was crazy,” Patel said. “But as part of that, we built the rest of the bottom of the stack,” he continued. “All the important layers were ready. They were scalable and cost-optimized, especially for cloud databases.”

A few years later, when the LLM revolution exploded onto the scene, Patel and John quickly recognized the advantages of the new approach and replaced it on top of a stack built on now-obsolete NLP techniques. Abandoned. They replaced it with OpenAI’s Codex. When OpenAI retired Codex a year before him, they pivoted again to make LLM components interchangeable within the stack.

“Obviously, it was hell for us. But as part of that, the LLM part is so that the next time something like that happens, we can plug and play the LLM and make it as painless as possible. ,” Patel said.

Currently, the company primarily relies on OpenAI’s GPT-4. This is generally considered to be the most powerful and well-read LLM on the market today. DataChat uses his GPT-4 to learn and generate the DataChat intermediate language. GPT-4 is generally told what kind of data users want to analyze, but a customer’s actual data is never touched by GPT-4, Patel said.

“We build a summary of what the structure of the schema is, so we say, ‘Here are the elements,'” Patel says. “You don’t have to give it up. [GPT-4] Actual data value. ”

DataChat uses the LLM only as a “guide,” Patel said, because the LLM is a non-deterministic machine and cannot be completely trusted. “They hallucinate and do things wrong,” he said. “So they just provide us with information. We translate that query into an intermediate language… and what we produce is completely deterministic.”

Users can take a workflow generated by DataChat from one piece of data and run it on another, and it will run exactly the same, he said. “So there is no ambiguity.”

It’s been a long road for Patel and John, but the Madison, Wis.-based company is finally accepting orders for DataChat. After being officially announced at the Gartner show, Patel is ready to see what the next chapter of his fourth startup will bring.

““When we started writing the first paper, everyone in the database world thought this was crazy,” Patel said. “But in a way, we were lucky that GenAI’s work got to a place where it was easier to use. But that’s the funny thing about technology: it moves around, and your willingness to move around with it If you do, good things can happen.”

Related products:

GenAI does not require a large LLM.we need better data

Are we underestimating the impact of GenAI?

Top 10 challenges to GenAI success

Source link

Subscribe to Updates

What's Hot

DataChat uses GenAI to enable data exploration

Natural language for data science

The changing foundations of NLP

Related Posts