summary: Researchers have made significant progress by training a multimodal AI system using only input received from one child’s birth to their second birthday, allowing the AI to learn language. We challenged the notion that huge amounts of data were required.
Their research demonstrated that an AI model can learn words and concepts from portions of a child’s experience captured through head camera recordings. This experiment highlights the potential of AI to mimic the human language learning process and reshapes our understanding of early language and concept acquisition.
By aligning AI learning with children’s naturalistic experiences, researchers provide new insights into the debate about how children learn language, and find that associative learning is more advanced than previously thought. This suggests that it may also play an important role.
Important facts:
- An AI system trained on one child’s headcam footage learned a significant number of words and concepts, even though the video only captured about 1% of the child’s waking hours. is completed.
- The study utilized a multimodal neural network to combine visual and verbal data through contrastive learning to mimic the way children associate words with visual context.
- This study challenges traditional beliefs about language learning and shows that, just as in human children, associative learning with minimal input can lead to substantial language acquisition.
sauce: new york university
AI systems such as GPT-4 can now learn and use human language, but they do so from an astronomical amount of linguistic input. This is much more than the amount that children receive when learning how to understand and speak a language. The best AI systems learn texts with trillions of words, but children receive only a few million words a year.
Because of this huge data gap, researchers have viewed with skepticism that recent advances in AI can tell us much about human learning and development. An ideal test to demonstrate connectivity would involve training an AI model based only on input received by a single child, rather than large amounts of data from the web. What can the model learn then?
A research team from New York University conducted just this experiment. They trained a multimodal AI system through the eyes and ears of one child, using head camera video recordings from 6 months of age until her second birthday. They investigated whether an AI model could learn words and concepts that are present in children’s everyday experiences.
Their findings were reported in the latest issue of the journal scienceshowed that a model, or neural network, can indeed learn a significant number of words and concepts using a limited slice of what a child has experienced. In other words, the video only captured about 1% of the child’s waking time, but it was enough for real language learning.
“We show for the first time that a neural network trained on developmentally realistic input from a single child can learn to associate words with their visual counterparts,” New York University Data Science said Wai Keen Vong, a researcher at the centre. First author of this paper.
“Our findings demonstrate how recent algorithmic advances and one child’s natural experiences may reshape our understanding of early language and concept acquisition. I am.”
“By using AI models to study real-life language learning problems that children face, we can understand what factors children need to learn words: language-specific biases, innate “We can address the classic debate about whether we need knowledge, or just associative learning,” said Brenden Lake, assistant professor in New York University’s Data Science Center and Department of Psychology and lead author of the paper. added.
“It seems like you can get more out of just learning than you think.”

Vong, Lake, and New York University colleagues Wentao Wang and Emin Orhan studied the learning process of children from 6 months to 25 years old, captured in first-person video via a lightweight, head-mounted camera. Analyzed on a weekly basis up to the month. Over 60 hours of footage.
The footage includes approximately 250,000 word instances (i.e., the number of words that were communicated, many of which are repeated), and video frames of what the child saw when those words were spoken. and includes a variety of activities. Development including mealtimes, reading, and children’s play.
The New York University researchers then trained a multimodal neural network using two separate modules. One module captures a single video frame (vision encoder), and the other module captures transcribed child audio (language encoder).
These two encoders were combined and trained using an algorithm called . contrastive learning, aims to learn useful input features and their cross-modal associations. For example, when a parent says something from a child’s perspective, some of the words used are likely to refer to things the child can see, and this is done by linking visual and verbal cues. It means that understanding is instilled.
“This provides clues to the model about which words should be associated with which objects,” Vong explains.
“Combining these cues gradually determines which words belong to which visuals, enabling contrastive learning that captures children’s first word learning.”
After training the model, the researchers tested it using the same types of assessments used to measure word learning in infants. We presented the model with a target word and an array of four different image options and asked it to select the image that matched the target word. .
Their results showed that the model was able to learn a significant number of words and concepts that are present in children’s daily experiences. Furthermore, for some of the words the model learns, it may generalize to very different visual instances than those seen during training, and this also holds true when children are tested in the laboratory. It reflects the generalization aspect seen.
“These findings suggest that this aspect of word learning is achievable from the kind of naturalistic data that children receive, while using relatively common learning mechanisms such as those found in neural networks.” ,” Lake observes.
Funding: This research was supported by the Defense Advanced Research Projects Agency (N6600119C4030) and the National Science Foundation (1922658) of the U.S. Department of Defense. Children’s participation was approved by their parents, and the methodology was approved by the New York University Institutional Review Board.
About this artificial intelligence research news
author: james devitt
sauce: new york university
contact: James Devitt – New York University
image: Image credited to Neuroscience News
Original research: Closed access.
“Down-to-earth language acquisition through the eyes and ears of a child” by Wai Keen Vong et al. science
abstract
Learn down-to-earth language through the eyes and ears of one child
From around 6 to 9 months of age, children begin to acquire their first words and associate spoken words with their visual equivalents. How much of this knowledge can be learned from sensory input using relatively general learning mechanisms? And how much knowledge requires a stronger inductive bias?
Using longitudinal head-mounted camera recordings from a single child aged 6 to 25 months, we trained a relatively general neural network on 61 hours of correlated visual-linguistic data streams to develop a feature-based We learned expressions and cross-modal relationships.
Our model captures the mapping of many words and referents present in a child’s daily experience, enables zero-shot generalization to new visual referents, and adjusts its visual and linguistic conceptual systems. Masu.
These results demonstrate how important aspects of grounded word meaning can be learned through joint representation and associative learning from one child’s input.