Artificial intelligence (AI) is catalyzing a new technological era, with multimodal AI systems redefining how we interact with technology. Multimodal AI transcends the capabilities introduced by traditional AI, simultaneously interpreting and integrating information from multiple sources, including text, images, audio, and video.
A study by Research and Markets demonstrated its transformative potential, indicating that by 2034, the multimodal AI market will reach USD 27 billion—a significant leap from its USD 1.6 billion value in 2024.
This new frontier in AI technology is poised to understand complex scenarios, delivering richer and more accurate responses by enhancing real-time human-AI collaboration.
Latest: Manus: A Step Forward in Artificial General Intelligence?
Transforming Lives with Multimodal AI
The advent of multimodal AI has transformed daily experiences, offering a more intuitive approach to technology. By integrating and analyzing information from multiple sources, multimodal AI represents a paradigm shift in how technology can influence human lives.
According to a study by GE Healthcare, 90% of all healthcare data comes from imaging technology. In healthcare, multimodal AI’s potential is utilized for non-invasive breast cancer subtype classification, enabling enhanced diagnostic precision and personalized treatments. This multimodal framework integrates histopathological images and gene expression data to classify breast cancer into various categories.
In education, multimodal AI can create personalized learning environments by adapting content and teaching methods better suited to students.
Chatbots and virtual assistants (VAs) can interpret voice tone and text, enabling more natural conversations in customer service. Multimodal AI’s ability also extends to natural language processing (NLP), merging audio and textual data to improve the system’s understanding of the context. These advancements are powered by deep learning frameworks, such as convolutional neural networks (CNNs) for image recognition, recurrent neural networks (RNNs) for sequential data processing, and transformer models for comprehensive text analysis.
In an exclusive interview with Telecom Review, Prakash Siva, Senior Vice President, Head of Technology & Architecture at Radisys, added:
“Virtual assistants are redefining interactions by utilizing multimodal, human-like capabilities.”
He noted that they are being trained on “custom data sets specifically developed for contact centers, B2B applications, and telecom services. These assistants turn real-time voice streams into insights via AI speech analytics and large language models.”
Driven by advancements in deep learning techniques, which include enhanced image classification and recognition capabilities, the image data segment was valued at USD 565.4 million in 2024. The overall multimodal AI segment was valued at USD 740.1 million in the same year, primarily driven by the increasing demand for high-quality content creation across various digital platforms.
One notable example is DeepSeek’s open-source multimodal AI model, Janus Pro 7B, which features advanced text-to-image generation and visual understanding. This innovation can handle complex queries, perform reasoning, and conduct deep analysis, with its multimodal understanding demonstrating superior performance over OpenAI’s DALL-E 3 in text-to-image generation.
Also Read: Networks for AI: Telecoms Networks Facing the Boom of Artificial Intelligence Uses
Challenges Associated with Multimodal AI
Despite its revolutionary potential, multimodal AI faces significant challenges that must be addressed. According to Research and Markets, ethical AI governance, computational efficiency, and data fusion complexity continue to pose challenges.
As organizations align AI systems with societal values and accountability, the complexity of integrating diverse data types poses hurdles, requiring sophisticated algorithms to deliver accurate results. Businesses increasingly rely on AI-driven tools to optimize workflows, reduce errors, and boost productivity. However, these innovative tools require vast amounts of high-quality data for effective training, increasing the risks of misinterpretation or bias. Low-quality data can undermine an AI model’s reasoning ability, leading to unreliable results.
Sophisticated AI models need advanced AI architectures, including transformers, capsule networks, and memory networks. However, they demand comprehensive training data and computational resources to facilitate real-time computation, making them costly to build and maintain.
To cater to the rise of multimodal AI, Richard Liu, President of ICT Marketing and Solution Sales at Huawei, emphasized that uplink bandwidth must expand tenfold to support multimodal interactions, with connections extending beyond human-to-human communication to include human-to-machine and machine-to-machine interactions.
Moreover, privacy remains a critical concern, particularly in multimodal AI systems processing sensitive information, including healthcare records, social media, wearables, and smartphones. Integrating data from multiple sources increases the risk of breaches, highlighting the growing need for a robust privacy approach and ethical guidelines to safeguard sensitive data.
Also Read: Huawei Shares Vision for the Mobile AI Era at MWC Barcelona 2025
Advances in Multimodal AI
Recent advances in multimodal AI are poised to revolutionize industries, with healthcare at the forefront. Microsoft Health and Life Sciences launched a fine-tuning capability for MedImageInsight in Azure Machine Learning, promising 93% fewer parameters and state-of-the-art results for medical imaging tasks. For instance, the model reduced radiologists’ workload by 42%.
Google Cloud enhanced its Vertex AI Search platform for healthcare with multimodal AI capabilities by integrating a visual question and answer (Q&A) feature. This new feature will enable search through tables, charts, and diagrams, thoroughly analyzing patient records and information across numerous data sources and types in the healthcare sector. The Vertex AI Search with visual Q&A will respond to queries after evaluating forms, saving time and harnessing more accurate answers.
Beyond healthcare, Google released an AI Mode feature integrated into Google Lens. Powered by multimodal capabilities, this feature offers responses on extensive topics, multi-faceted search queries, and search topics requiring multiple searches. The tool allows users to click or upload an image alongside a query, prompting the AI feature to share comprehensive responses along with related links to the topic. Google’s AI Mode can interpret the entire image, understanding the object’s relationship, and unique materials, color, and shapes.
Cohere launched Embed 4, an advanced multimodal enterprise search supporting large volumes of data, including documents with texts and visuals. Embed 4 can generate embeddings for documents up to 128K tokens (around 200 pages) in length such as annual financial reports, product manuals, or detailed legal contracts, improving enterprise search accuracy.
Similarly, Alibaba Cloud introduced the Qwen2.5-Omni-7B into its AI Qwen series. This unified, end-to-end, multimodal system can process text, images, audio, and video while generating real-time text and natural speech responses.
OpenAI also launched its most advanced omnimodal technology models to date—the o3 and o4-mini—which are trained to think for longer before responding and can understand and generate text, audio, and video. Open AI’s o3 and o4-mini models are powered by web search, Python, image analysis, file interpretation, and image generation, setting the groundwork for agentic AI systems.
Meanwhile, Meta introduced the Llama 4 series, which includes the Llama 4 Scout and Llama 4 Maverick models. These natively multimodal AI models integrate different data types, including text, image, and video, supporting text-in, text-out, and image-in, text-out use cases.
SenseTime initiated the SenseNova V6, leveraging advanced training in multimodal long chain-of-thought (CoT), global memory, and reinforcement learning to deliver industry-leading multimodal reasoning capabilities. Apart from having the lowest reasoning costs in the industry, the SenseNova V6 is also China’s first large model to support in-depth analysis of 10-minute mid-to-long form videos.
Moreover, Apple introduced its first multimodal AI model, MM1, in 2024, which significantly enhanced Siri and iMessage’s capabilities by enabling the understanding of images and texts. This is one outcome of Apple’s daily multi-million-dollar investment in AI training, which began in late 2023.
Eleven Labs and Infer.so also collaborated to create a multimodal AI voice bot set to revolutionize the e-commerce and fintech industries with lifelike and contextually-aware voice interactions.
Read More: Global MBBF 2024: The Future of Mobile AI Through 5.5G and Connectivity
Final Thoughts
As AI continues to evolve, the world will see more human-AI advancements in the future. By enabling technology to process information from various sources, multimodal AI bridges the gap between human and machine understanding.
This pivotal leap in AI technology will not only enhance AI’s existing capabilities but also pave the way for the development of more advanced AI systems.
The responsible deployment of multimodal AI will transform industries and shape the next generation of interconnected and highly intelligent solutions.
More on Artificial Intelligence:
Artificial Superintelligence: Unlocking Potential and Navigating Risks
Unraveling the Future: Navigating the Landscape of Artificial Intelligence in Technology
From Networks to Experiences: Radisys’s Vision for the Digital Era