Phi-4: Behind Microsoft’s Smaller, Multimodal AI Models

As the AI market matures, a countermovement to the trend of ever-larger language models has gained momentum across the industry. While many companies have focused on developing increasingly powerful but resource-intensive models, a parallel effort to create more efficient small language models (SLMs) has emerged.
Beyond the hype, companies today are confronting the harsh realities of deploying AI in the real world: energy costs, hardware limitations and customers who expect AI to work seamlessly on their devices without draining batteries or requiring constant internet connections.
As a result, Microsoft has positioned itself at the forefront of this movement through its Phi model series, demonstrating that smaller, more carefully engineered models can match or even exceed the capabilities of much larger counterparts in some tasks.
Now, the company has introduced two new additions to its Phi family of SLMs: Phi-4-multimodal and Phi-4-mini. The models, now available through Azure AI Foundry, HuggingFace and the Nvidia API Catalog, aim to provide developers with advanced capabilities in compact formats suited for edge computing and resource-constrained environments.
Phi-4-multimodal, a 5.6 billion parameter model, represents Microsoft’s first foray into multimodal language models, integrating speech, vision and text processing functionalities. Meanwhile, Phi-4-mini, with 3.8 billion parameters, focuses on text-based tasks with support for sequences up to 128,000 tokens.
“In direct response to customer feedback, we’ve developed Phi-4-multimodal, a 5.6B parameter model, that seamlessly integrates speech, vision and text processing into a single, unified architecture,” said Weizhu Chen, Vice President of Generative AI at Microsoft.
Microsoft Phi-4 multimodal capabilities outperform industry benchmarks
The multimodal version of Phi-4 demonstrates competitive performance against larger models in several areas. According to Microsoft’s benchmarks, it has claimed the top position on the Huggingface OpenASR leaderboard with a word error rate of 6.14%, surpassing the previous record of 6.5% as of February 2025.
As Chen notes, the model is particularly strong in speech-related tasks. “It outperforms specialised models like WhisperV3 and SeamlessM4T-v2-Large in both automatic speech recognition (ASR) and speech translation (ST),” he says
In vision capabilities, Microsoft reports that despite its relatively small size, Phi-4-multimodal achieves strong performance on mathematical and science reasoning tasks, matching or exceeding the capabilities of models such as Google’s Gemini 2 Flash lite preview and Anthropic’s Claude 3.5 Sonnet in document understanding, chart comprehension, Optical Character Recognition and visual science reasoning.
The compact size of both models makes them suitable for deployment in environments with limited computational resources, potentially enabling on-device AI processing with reduced latency.
- 5.6B - Parameters in the Phi-4-multimodal model, significantly smaller than most competing multimodal systems
- 6.14% - Word error rate achieved on the Huggingface OpenASR leaderboard, setting a new benchmark record
- 128,000 - Maximum token sequence length supported by the Phi-4-mini model, enabling processing of extensive text
Microsoft has also positioned these models for customisation, with the smaller file sizes making fine-tuning more accessible and cost-effective. The company reports examples of significant improvements through fine-tuning, including enhancing speech translation from English to Indonesian from a base performance of 17.4 to 35.5 after three hours of computation on 16 A100 GPUs.
Windows integration positions Microsoft Phi-4 models for Copilot+ PC rollout
Microsoft plans to integrate these SLMs into its broader ecosystem, including Windows and Copilot+ PCs, as AI capabilities are further embedded into consumer products.
“Language models are powerful reasoning engines, and integrating small language models like Phi into Windows allows us to maintain efficient compute capabilities and opens the door to a future of continuous intelligence baked in across all your apps and experiences,” said Vivek Pradeep, Vice President Distinguished Engineer of Windows Applied Sciences.
“Copilot+ PCs will build upon Phi-4-multimodal's capabilities, delivering the power of Microsoft's advanced SLMs without the energy drain,” he adds. “This integration will enhance productivity, creativity, and education-focused experiences, becoming a standard part of our developer platform.”
Microsoft states that both models underwent security and safety testing by internal and external experts using strategies developed by the Microsoft AI Red Team (AIRT). The testing covered areas including cybersecurity, national security, fairness and violence through multilingual probing.
“These models are designed to handle complex tasks efficiently,” Chen adds, “making them ideal for edge case scenarios and compute-constrained environments. Given the new capabilities Phi-4-multimodal and Phi-4-mini bring, the uses of Phi are only expanding.”
Explore the latest edition of Technology Magazine and be part of the conversation at our global conference series, Tech & AI LIVE.
Discover all our upcoming events and secure your tickets today.
Technology Magazine is a BizClik brand


