Development of humanoid robots has been moving at a snail’s pace for the better part of two decades, but thanks to the collaboration between Figure AI and OpenAI, it’s rapidly accelerating, resulting in some of the most amazing robots I’ve ever seen. A video of a real humanoid robot was born.
On Wednesday, startup robotics company Figure AI released a video update (see below) for its Figure 01 robot running a new visual language model (VLM). This robot somehow transformed the bot from an uninteresting automaton into a full-fledged sci-fi bot. That approaches his C-3PO level abilities.
In the video, Figure 01 is standing behind a table set with a plate, an apple, and a cup. The left side is a drainer. A human stands in front of the robot and asks the robot, “Figure 01, what do you see now?”
After a few seconds, Figure 01 responds with a surprisingly human-like voice (no face, just an animated light that moves in sync with the voice) and details everything on the table and the man standing in front of it. .
“That’s cool,” I thought.
Then the man asks, “Hey, can I have something to eat?”
Figure 01 replied, “Sure,” and with one smooth movement deftly picked up the apple and handed it to the man.
“Wow,” I thought.
The man then removes some crumpled trash from the trash can in front of Figure 01 and asks, “Can you explain to me why you did what you did when you picked up this trash?”
Figure 01 explains why without wasting any time while putting the paper back in the bin. “That’s why I gave you the apple, because it’s the only edible thing I can offer you from the table.”
“This can’t be real,” I thought.
But this is what it seems, at least according to Figure AI.
speech to speech
Figure 01 uses OpenAI’s pre-trained multimodal model VLM to perform “speech-to-speech” inference to understand images and text, and relies on the entire audio conversation to create a response, the company said in a release. I explained. This differs from, for example, his GPT-4 in OpenAI, which focuses on written prompts.
It also uses what the company calls “learned low-level two-handed manipulation.” The system matches precise image calibration (down to the pixel level) with a neural network to control movement. “These networks ingest onboard images at 10hz and generate 24 degrees of freedom actions (wrist pose and knuckle angles) at 200hz,” Figure AI wrote in a release.
The company claims that all actions in the video are based on system learning and are not remotely controlled. This means that there is no one controlling Figure 01 behind the scenes.
It’s hard to verify these claims without looking directly at Figure 01 and asking my own questions. This may not be the first time Figure 01 has run this routine. Maybe it was the 100th time, which might explain the speed and fluidity.
Or maybe it’s 100% real. In that case, I think it’s amazing. That’s amazing.