
Google surprised the world this month with a demo of Gemini 1.5, its cutting-edge generative artificial intelligence (AI) program, the successor to the original Gemini program released last December. Gemini 1.5 excels at features such as the “needle in the haystack” challenge, where the program must identify frames in a video that match a text description.
But Google’s program, like most AI programs at the largest commercial organizations, includes few technical details about how the software works. The 58-page technical report Google released for Gemini 1.5 provides only a general description of the program and the approach used, but does not provide details on the architecture that makes up Gemini 1.5. And of course, the code is not available.
Also: Introducing Gemini 1.5, Google’s latest AI model with significant upgrades over its predecessor.
In that sense, Gemini 1.5 continues the recent trend of Google, OpenAI, and other commercial companies to blur the technical details of AI.
This kind of secrecy presents an opportunity for open source software to match some of Gemini’s capabilities while opening up access to the code.
In research published this month by Hao Liu, Wilson Yan, Marei Zaharia, and Pieter Abbeel of the University of California, Berkeley, and described on the project’s GitHub site, scientists are using Meta’s open source Llama 2 large-scale We adapt language models to create multilingual models. Like Gemini 1.5, it is a modal program that can process not only text but also video and images, but (unlike Gemini 1.5) it cannot process audio.
Also: Research shows GPT-4 has become significantly duller over time
Using the mainstream version of Llama 2, a not-so-large neural network of 7 billion parameters, the authors are able to accept inputs of up to 1 million “tokens”, which can be text, images, or videos, entered into the program. was able to process. This number represents a dramatic increase from the 128,000 processed by the Gemini 1.0 version and OpenAI’s GPT-4 Turbo.
Its creation, known as the Large World Model (LWM), performs similar tasks to Gemini 1.5. You can solve needle-in-a-haystack problems like answering the request “What color jacket was the girl on the trampoline wearing?” when you stream her hour-long YouTube video. .
UC Berkeley’s Large World Model can answer “needle in a haystack” questions about specific moments in a video better than Google’s Gemini 1.0 or OpenAI’s GPT-4 Turbo. University of California, Berkeley
Liu and his team have not yet shown how their results compare to Gemini 1.5. Instead, the team presents a comparison with his GPT-4 and Gemini 1.0.
As shown in the image above, LWM correctly answers the needle in the hay question, but fails the other two.
LWM can host a chat about what’s happening in a video clip or have a long discussion about the content of an image. This is a process researchers call “picture chat.” LWM can also generate images and videos if you provide a text description for the prompt (see both examples below).
Remarkably, it appears that Liu and his team could have achieved results comparable to Gemini 1.0 with less computing power. LWM trained him for 58 hours on his one slice of his TPU version 4 “POD”, which consists of his 256 TPU chips with two cores each. For Gemini 1.0, the technical report contains few technical details about the training infrastructure, similar to the 1.5 technical report. All we know is that Google used some amount of his TPU version 4 and version 5 PODs for a period of time. It’s quite possible that Liu and his team used far more computing power than he used to train his LWM.
So how can LWM achieve similar results to Gemini 1.0 since it is based only on a relatively small open source program and runs on less computing power? , are the product of different kinds of approaches to the problem of how to develop neural networks.
Both models start by using a similar type of neural network, a transformer. Google has added “innovations in training algorithms, datasets, and infrastructure” to Transformer.
Also: How Google and OpenAI inspired GPT-4 to provide more timely answers
For LWM, Liu and his team trained the model in multiple consecutive rounds, gradually increasing the “context window,” the amount of data samples the program processes in each pass. The team started by displaying his 32,768 tokens in the context window. You can think of this as multiple pieces of data. They then processed up to 1 million tokens.
The approach, called “ring attention,” was developed by Liu and his team last year. The insight of Ring Attendance is that training can be parallelized by training neural networks on samples of data simultaneously rather than sequentially. This means you can accomplish more in less time and use your chips more efficiently.
“We take a training approach […] Here, our model is trained with progressively longer sequence lengths, starting at 32,000 tokens and increasing in powers of 2, ending at 10 million tokens. ” write Liu and team.
“Intuitively, this saves model computation by first learning short range dependencies before moving on to longer sequences. This allows you to train directly on the maximum length of the target sequence. You can train with orders of magnitude more tokens than you would otherwise.”
LWM is trained on sequences of data of increasing length. University of California, Berkeley
The data used to train LWM includes some of the most prominent publicly available data sets, including Books3, which is at the center of controversy over copyright infringement. The researchers also used Video Instruct-100K, a “video conversation dataset” hosted on GitHub.
Google does not disclose training data for Gemini 1.0, only explaining that: “Gemini models are trained on multimodal and multilingual datasets. Our pre-training dataset uses data from web documents, books, and code, and includes:” Images, Audio, and video data. ”
AI also unlocks human potential to the next level.Here’s how to do it
Google is already working on Gemini 1.5, which can handle up to 10 million tokens on input, but Liu and the team say Ring Attendant is “theoretically limited only by the number of devices available. We believe it can be extended to an infinite number of contexts.
They continue: “We believe that the released model not only provides the basis for future work on the development of longer context models, but also supports difficult long-term We believe it will encourage more challenging benchmarks, including tasks.”
The LWM code is posted on the research team’s GitHub site.