Multimodal generative AI allows teams to create machine learning models that support multiple data types such as text, images, and audio. These new features enable content creation, customer service, and research and development.
Many generative AI services from Google, Microsoft, AWS, OpenAI, and the open source community now support at least text and images within a single model. Efforts are also underway to support other inputs, such as data from IoT devices, robotic controls, and corporate records and code.
“Multi-modality in AI for business applications is best achieved by first recognizing the diversity and complexity of the data types that businesses deal with every day,” said Christian Ward, executive vice president and chief data officer at digital experience platform Yext. I can understand that.”
Multimodal generative AI can assist with financial data, customer profiles, store statistics, geographic information, search trends, and marketing insights. All of these are stored in different formats such as images, graphs, text, audio, and dialogue. Multimodal AI can automatically find connections between different datasets representing entities such as customers, equipment, and processes.
“We are used to seeing these datasets as separate and often different software packages, but multimodality also means merging and meshing this into a completely new output format.” said Ward.
Multimodal model overview
Major AI services such as OpenAI’s GPT-4 and Google’s Gemini are starting to support multimodal features. These models can understand and generate content in multiple formats, including text, images, and audio.
Samuel HamwayNucleus Research Research Analyst
“The advent of capable generative multimodal models such as GPT-4 and Gemini is an important milestone in AI development,” said Samuel Hamway, research analyst at technology research firm Nucleus Research.
Hamway recommends that companies start by exploring and experimenting with chatbots available to consumers, such as ChatGPT and Gemini, formerly known as Bard. These platforms with multimodal capabilities offer companies great opportunities to increase productivity in several areas. For example, ChatGPT and Gemini can automate routine customer interactions, help generate creative content, simplify complex data analysis, and interpret visual data in combination with text-based queries. .
Despite recent advances, multimodal AI is generally less mature than LLM, primarily due to challenges associated with obtaining high-quality training data. Additionally, multimodal models can have higher training and computational costs compared to traditional His LLM.
Vishal Gupta, partner at advisory firm Everest Group, observed that current multimodal AI models primarily focus on text and images, with some including audio in the experimental stage. That said, Gupta expects the market to gain momentum in the coming years because multimodal AI has broad applicability across industries and professions.
8 Multimodal Generative AI Use Cases
Here are eight real-world use cases where multimodal generative AI can deliver value to businesses today or in the near future.
1. Marketing and Advertising
Marketing content creation is one of the multimodal generative AI use cases that is gaining relative traction, said Gupta. The multimodal model integrates audio, images, video and text and helps marketing developers develop dynamic images and videos for their campaigns.
“This has huge potential to further improve the customer experience by dynamically personalizing content for users and increasing the efficiency and productivity of content teams,” said Mr. Gupta.
But Hamway cautions that companies need to balance personalization with privacy concerns. Additionally, we need to develop a data infrastructure that can effectively manage large and diverse datasets in order to glean actionable insights.
2. Labeling images and videos
Multimodal generative AI models can generate textual descriptions of a series of images, said Gupta. This functionality can be applied to captioning videos, annotating and labeling images, generating product descriptions for e-commerce, and generating medical reports.
3. Interaction with customer support
Yaad Oren, managing director of SAP Labs US and global head of the SAP Innovation Center Network, believes that the most promising use case for multimodal generative AI is customer support. Multimodal generative AI can enhance customer support interactions by simultaneously analyzing text, image, and audio data, resulting in more context-aware, personalized responses and improving the overall customer experience .
Chatbots can also use multimodality to understand and respond to customer questions in more nuanced ways by incorporating visual and contextual information. However, one of the key challenges is the accurate and ethical handling of different types of data, especially sensitive customer information.
4. Supply chain optimization
Multimodal generative AI optimizes supply chain processes by analyzing text and image data, providing real-time insights for inventory management, demand forecasting, and quality control. Oren said SAP Labs US is looking into image analysis for quality assurance in manufacturing processes and identifying defects and irregularities. The company is also looking at how natural language processing models can analyze text data from various sources to predict fluctuations in demand and optimize inventory levels.
5. Improving medical care
Taylor Dolezal, Head of Ecosystems at the Cloud Native Computing Foundation, sees great potential in healthcare by integrating different data types to enable more accurate diagnosis and personalized patient care. I see that there are. Multimodal generative AI is particularly useful for diagnostic tools, surgical robots, and remote monitoring devices.
“While these advances promise to improve patient outcomes and accelerate medical research, they also pose challenges in data integration, accuracy, and patient privacy,” Dolezal said.
6. Improving manufacturing and product design
Multimodal generative AI can improve manufacturing and design processes, Dolezal said. Models trained on design and manufacturing data, defect reports, and customer feedback can enhance the design process, improve quality control, and improve manufacturing efficiency.
AI can analyze market trends and consumer feedback in product design and implement quality control and predictive maintenance in manufacturing processes. The main challenge is integrating multiple data sources and ensuring the interpretability of AI decisions, Dolezal said.
7. Employee training
Multimodal generative AI can enhance learning and proficiency in employee training programs, Ward said. AI can create custom experiences for each role by creating content using a variety of learning materials and data. From here, employees can “teach” learning materials to the AI through audio or video recordings, creating an interactive feedback mechanism for him. When an employee clearly communicates their understanding of the material to her AI system, the AI system assesses the employee’s understanding and identifies learning gaps.
Mr Ward warned that this approach could face challenges, particularly in introducing human AI feedback. Nevertheless, it promises a more personalized and effective learning experience.
8. Multimodal question answering
Ajay Divakaran is the Technical Director of the Vision and Learning Institute at SRI International’s Vision Technology Center, a nonprofit scientific research institute. SRI International is currently exploring ways to improve question answering by combining images with text and, where possible, audio.
This is especially useful for applications that involve the execution of ordered steps. For example, when you query an AI system with a question about home repair, you can receive a combination of text instructions and generated images or videos, allowing the text and visuals to work together to walk users through the process. Masu.
George Lawton is a journalist based in London. Over the past 30 years, he has written over 3,000 articles on computers, communications, knowledge management, business, health, and other areas of interest.