Introducing “AI in the Real World”
In the coming months, I’ll be hosting a series of in-depth conversations with leading AI researchers where we’ll explore how state-of-the-art models are being applied in the real world. To kick off the series, I speak with Joon Sung Park, a Ph.D. student in computer science at Stanford who works at the intersection of human-computer interaction, NLP, and machine learning. He’s best known for creating LLM-powered agents that communicate, act, and think increasingly like humans. These agents use LLMs as central reasoning engines to make decisions, interact, and otherwise navigate virtual worlds.
Joon started his Ph.D. in 2020, shortly after the release of GPT-3. Along with its fellow “foundation models” (a term coined by Joon’s Stanford peers), GPT-3 exhibits a remarkable ability to generate, synthesize, and analyze natural language. While traditional ML models have narrowly defined remits, foundation models can generalize to tasks far beyond their original training. This breakthrough led Joon and his colleagues to ask: “Now that we have these models, what are we going to do with them? They have incredible capacity, but can they enable us to do something truly new and unique?”
Joon’s research seeks to answer this question by applying LLMs to a long-standing challenge in computer science: the creation of AI agents that can authentically simulate human behavior. This ambition traces back to the earliest days of AI in the 1950s, when the field’s founders dreamed of replicating human cognition in computational form. Anticipating present-day debates around AGI, their goal was to endow AI systems with human-level intelligence, including the abilities to learn, solve problems, and take action in open-ended environments.
Central to Joon’s research is the insight that LLMs, trained on massive amounts of data from the web, digitized books, and Wikipedia, possess an in-built understanding of human behavior. In his paper, “Generative Agents: Interactive Simulacra of Human Behavior,” published on ArXiv last April, he and his coauthors populated a virtual town, which they named “Smallville,” with 25 AI agents. These agents paired ChatGPT with a technical architecture that enabled long-term memory, reflection, and planning. Over the course of two days, they organically engaged with one another, forming relationships, sharing information, and reacting to unforeseen events.
The implications of Joon’s work extend far beyond Smallville’s sandbox. Generative agents promise to significantly enhance our understanding of—and our ability to create solutions for—our interconnected “bits and atoms” world. Through high-fidelity simulations, social scientists gain a laboratory for exploring the complexities of large-scale group dynamics. Policymakers can use these tools to model the effects of different forms of legislation, while architects can draw on them to optimize development plans and predict potential consequences. Businesses also stand to benefit immensely, as they can deploy generative agents to simulate customer interactions, test new products, and refine operational strategies without the headaches that attend human studies.
In what follows, I explore the core insights that I gathered from our conversation. To learn more from Joon, listen in on our full conversation here!
LLMs represent a new paradigm for agent design
Historically, agent design has leaned on rule-based systems, such as finite-state machines and behavior trees. These methods are well-suited for simple tasks yet struggle with the dynamic, unpredictable nature of real-world interactions. The need to hard-code every possible behavior prevents these agents from responding to unscripted scenarios. While advances in learning-based strategies, including reinforcement learning, allow for agents that can adapt to new information, their application has been largely limited to competitive settings where reward functions are clearly defined.
LLMs address many of these limitations. Speaking of his research, Joon stresses: “LLMs are really what made this possible. That is really the fundamental tech stack that we needed.” Armed with a broad-based understanding of natural language, LLMs can predict the next action in a sequence based on the context they’re both provided with and discern. This eliminates the need for manual scripting, effectively automating the generation of believable agent behavior.
As Joon puts it: “What LLMs change is that they give us a single ingredient, which is given a micro-context, micro-moment… Given that micro-moment description, a language model is extremely good at predicting the next moment.” Coupled with an architecture that stores, retrieves, and synthesizes “memories” (a comprehensive log of an agent’s experiences), LLMs enable the creation of AI agents that can coherently act in and respond to ever-changing environments.
Tool use and simulations are the two main focuses of agent R&D
To better situate his research, Joon describes two R&D directions in the agent space: “We’re seeing a split where there’s one community who’s now deeply interested in agents using tools, and another community that is deeply interested in this idea of ‘can we simulate?’”
The first camp is building tool-based agents that automate complex, multi-step workflows. Capable of tasks like ordering pizza, buying plane tickets, and creating slide decks, these agents rely on LLMs’ advanced NLP abilities to perform functions that go well beyond basic scripting. This approach has roots in early AI agents like Clippy, the friendly paperclip that assisted Microsoft Word users based on their past interactions with the software. A more recent example is OpenAI’s GPT store, where users can share customized versions of the startup’s general-purpose chatbots.
Joon’s research falls into the second camp that’s focused on simulation-based agents. Here, the goal is to realistically replicate human behavior and social dynamics in virtual worlds. Originating in the gaming industry, simulation-based agents can be applied in any field where modeling complex human interactions proves beneficial, including business, social science, and policymaking.
Despite their differing focuses today, Joon speculates that these communities will converge in the future. In both cases, LLMs serve as the foundational architectural layer.
ChatGPT is not the “killer app” for LLMs
We’re already seeing both startups and established companies apply LLMs in incredibly valuable ways, including to personalize marketing campaigns, boost sales efficiency, and streamline software development. Yet Joon questions whether ChatGPT, for all its popularity, is truly the “killer app” for these models.
Killer apps, he argues, do more than attract a vast user base: they unlock and leverage the novel capabilities of a given technology. He points to Microsoft Excel as an example, noting how it revolutionized the manipulation of tabular data and, in doing so, transformed PCs into essential tools for both businesses and consumers. In the case of LLMs, the equivalent would be an app that’s uniquely enabled by its underlying model and ushers in a new phase of AI utility and adoption.
By this definition, ChatGPT might not make the cut. Joon describes this chatbot as “a fairly simple wrap” that fails to fully exploit LLMs’ powers to create, reason, and plan. Its blank text box places the burden on users to articulate their intent, provide all relevant context, and evaluate the model’s responses. Then comes the problem of “unknown unknowns”: the fact that users lack visibility into the LLM’s vast possibility space. The “killer app” for LLMs should help users intentionally navigate this space through thoughtful product design. Joon issues a challenge to builders: “What is going to be the application that really adds value in a much more generalizable way?”
Multimodal models promise step-change improvements in agent performance
Multimodal models, which can process images, audio, and video along with text, promise to bring a new level of depth and nuance to agent interactions. For example, a multimodal AI agent could discern a user’s emotions from non-verbal cues such as tone of voice, gestures, and facial expressions. It could also act directly on multimedia inputs from its environment without first converting them into text: an extra step that may cause valuable data to be lost in translation.
As Joon explains of Smallville: “We used our system to translate the visual world into natural language and then fed it to the agent architecture, which used our language model to process it. WIth models that are able to deal with multimodal inputs, we might actually be able to bypass that phase and go straight from ‘Here is the visual world or space that you’re seeing right now’ to ‘That is your memory, now act on it.’”
Soft-edged problems are better bets for AI builders
Joon provides a strategic framework for AI builders to identify the most promising agent applications. It turns on the distinction between hard-edged and soft-edged problems.
Hard-edged problems are rule-based and tolerate no errors. Their outcomes are strictly binary: either success or failure. Tasks like flight bookings, financial transactions, and medical diagnoses fall in this category. Here, mistakes can have serious consequences. By contrast, soft-edged problems are more subjective and allow for a range of acceptable outcomes. Here, the focus is on adaptability and progressive improvement. Examples include content generation, personalized recommendations, and interactive entertainment. As Joon elaborates, “These are problems where we can increasingly hill-climb toward being better. At a certain level, the technology starts to actually become useful.”
Reflecting on failed AI agents of the past, like Clippy, Joon highlights the pitfalls of applying this technology in hard-edged situations. Without a deep understanding of users’ intentions and goals, agents can be perceived as unhelpful or, worse, actively annoying. This has led to cycles of initial excitement followed by user frustration and disillusionment among developers. Joon suggests that while LLMs may shift this dynamic, it’s more likely that founders tackling hard-edged problems might once again be let down. “My hunch is that we will likely see a very similar pattern arise, at least for the foreseeable future,” he notes.
For entrepreneurs, soft-edged problem spaces represent a safer path. Here, the bar for shipping a product is not perfection but the added value that it delivers: a more attainable standard that allows builders to iterate on their products alongside their users. This was a core motivation for Joon and his collaborators to work on simulation. “Simulation is the prime example of a soft-edged problem space because the simulation only has to be ‘good enough’ for it to start being useful,” he explains. Other “good enough” use cases include marketing, sales, customer service, and product design. In these fields, the presence of multiple viable solutions is a feature, not a bug. This flexibility creates more opportunities for founders to differentiate their products.
Agents need to solve real user needs
The success of an LLM-powered agent turns on more just than the sophistication of its underlying model. As Joon puts it: “One thing I’m curious about, that I don’t think a lot of us are thinking about, is not the technology part, but the interaction. How are these agents going to be used? In what way? […] That’s where agents in the past have failed. There was really cool technology, but we didn’t seriously ask ourselves, is this something that people really need? And does the cost-benefit analysis of using these agents, and learning how to use them well, really make sense for a broad user base?”
Rather than lead with the technology, AI builders should prioritize understanding and solving real user needs, focusing on the proverbial “10x better” experience that agents can deliver. As Steve Jobs famously put it: start with the user experience, then work backward to the technology.
Grounding, representativeness, and scalability are all unsolved challenges
Ultimately, the aim is for AI simulations to accurately model real-world, rather than make-believe, scenarios. As Joon articulates: “I think it’s going to be far more interesting if we can make these simulations closely model our actual human communities. So it’s not just fictional but actually has grounding. That’s going to open up from our perspective an entirely new set of application spaces as well as research impact.”
To achieve this goal, researchers face several obstacles. Many third-party AI models have safety measures that limit the range of human behaviors they can express. ChatGPT, for example, defers from arguing, reciprocating affection, and expressing personal beliefs. Joon points out that open-source models, which lack these restraints, could be a potential workaround—yet there are also a myriad of risks, from toxic content to targeted persuasion, that arise when models can more easily go “off the rails.”
In addition, today’s generative models reflect a specific demographic: the “WEIRD” (Western, Educated, Industrialized, Rich, and Democratic) user. “Right now, ChatGPT really is an aggregate personality from the web. If you’re looking for minority voices in these chatbots, they’ll likely struggle,” Joon relays. Developing models that represent broader slices of humanity is crucial for creating simulations that are not only more inclusive but also more accurate.
Another challenge lies in scaling simulations. Joon and his peers’ Smallville experiment costs thousands of dollars in token credits, despite featuring only 25 agents and lasting just two days. More cost-efficient strategies could involve parallelizing agents and creating models specifically for simulations. On this front, Joon is hopeful: ”In general, with advances in underlying models, we believe that agents’ performance will improve.”
Classic product principles still apply when building with LLMs
When thinking about AI’s future, Joon often looks back to the field’s founding moments. As he shares: “I get inspired by insights that had an impact and stood the test of time. […]
I personally think all the great ideas are sort of timeless. Just because the current hype cycle is over doesn’t mean they’re less interesting or less meaningful.” He cites Allen Newell and Herbert Simon, two of AI’s early pioneers, as abiding influences.
This perspective is especially relevant for founders focused on generative AI. For all the advances of cutting-edge foundation models, the time-tested principles of product development—choosing problems that allow for iterative gains, baking user feedback loops into the product’s design, understanding and solving real needs, and crafting interactions that deliver tangible value to users—haven’t changed. Building with LLMs involves finding the right balance between leveraging the technology’s novel capabilities and adhering to best practices for shipping sticky products.
Stay tuned for our next conversation in this series, where we’ll dive further into the realm of AI-powered agents and their wide-ranging applications.
Follow me on Twitter or LinkedIn.