Hundreds of open-source large-scale language model (LLM) builder servers and dozens of vector databases are leaking highly sensitive information onto the open web.
In their rush to integrate AI into business workflows, companies may not pay enough attention to securing these tools and the information they entrust to them, and in a new report, Legit security researcher Naphtali Deutsch demonstrates this by scanning the web for two types of threats. Potentially Vulnerable Open Source (OSS) AI ServicesA vector database to store data for AI tools and an LLM application builder, specifically the open source program Flowise. Sensitive personal and corporate dataThey are unwittingly put at risk by organisations striving to join the generative AI revolution.
“A lot of programmers see these tools on the Internet and try to implement them in their environments,” Deutsch says, but they don’t consider security.
Hundreds of unpatched Flowwise servers
Flowise is a low-code tool for building any kind of LLM application. It’s backed by Y Combinator and has tens of thousands of stars on GitHub.
The programs that developers build with Flowise, whether they’re customer support bots or tools that generate and extract data for downstream programming or other tasks, tend to access and manage large amounts of data, so it’s no surprise that the majority of Flowise servers are password protected.
But passwords alone are not enough security. Earlier this year, an Indian researcher discovered an authentication bypass vulnerability in Flowise version 1.6.2 and earlier versions that could be triggered by simply capitalizing a few characters in the program’s API endpoint. The issue is tracked as CVE-2024-31621 and received a “high” score of 7.6 on the CVSS version 3 scale.
Legit’s Deutsch exploited CVE-2024-31621 to crack 438 Flowise servers, including GitHub access tokens. OpenAI API keyyour plaintext Flowise password and API key, configurations and prompts associated with your Flowise app, etc.
“GitHub API tokens allow access to private repositories,” Deutsch emphasizes. This is just one example of the additional attacks that such data can enable. “We also found API keys to other vector databases, such as Pinecone, a very popular SaaS platform. These can be used to break into the databases and dump all the data found, which is probably private and sensitive data.”
Dozens of unsecured vector databases
In fact, vector databases store all kinds of data that AI apps need to retrieve, and any data accessible from the wider web could be attacked directly.
Using a scanning tool, Deutsch found about 30 Vector database servers running online without any authentication checks. These servers contained apparently sensitive information, such as private email conversations from an engineering services vendor, documents from a fashion company, and PII and financial information from clients of an industrial equipment company. Other databases included real estate data, product documentation and data sheets, and patient information used by a medical chatbot.
Leaky vector databases are even more dangerous than leaky LLM builders because they can be tampered with in ways that don’t alert users of the AI tools that rely on them. For example, hackers can not only steal information from a public vector database, but can also delete or corrupt that data to manipulate the results. They could also plant malware within a vector database that would be pulled in when an LLM program runs a query.
To mitigate the risks from exposed AI tools, Deutsch recommends organizations limit access to the AI services they rely on, monitor and log activity related to those services, protect any sensitive data trafficked by LLM apps, and apply software updates whenever possible.
“[These tools] “It’s new, and people don’t have a lot of knowledge about how to set it up,” he warns. “And it’s getting easier to set up. A lot of these vector databases can be set up in Docker or in an AWS Azure environment with two clicks.”Security is more cumbersome and can lag behind.