Nvidia's Blackwell Ultra tops MLPerf AI Benchmarks

Share this article
Share this article
Prioritise Us on Google
Jensen Huang, Founder and CEO of Nvidia
Nvidia's Blackwell Ultra systems and NVFP4 training set record performance across seven MLPerf Training v5.1 tests, cementing its lead in AI hardware

Nvidia has demonstrated the performance of its hardware in the latest MLPerf Training v5.1 benchmarks, showcasing its Blackwell Ultra GPU architecture.

By securing the top performance results across all seven benchmarking targets, Nvidia set new records.

This achievement could highlight the versatility of its CUDA software stack, as Nvidia was the only company to submit entries across every category in the suite.

The results show the fastest times for training large language models (LLMs), image generation models, recommender systems, computer vision and graph neural networks.

At the Nvidia GTC in Washington DC, Jensen Huang, CEO and Founder of Nvidia, spoke about Nvidia's long-term focus.

“For 30 years, we have been developing this form of computing we call accelerated computing,” Jensen explains.

“We invented the GPU, we invented the programming model called CUDA. Its moment has now arrived.”

Nvidia dominates MLPerf Training V5.1 benchmarks | Credit: Nvidia

MLPerf v5.1 as a standard for AI performance

MLPerf Training v5.1 is an industry benchmark for artificial intelligence performance created and managed by MLCommons, an open engineering consortium.

The suite provides a standardised set of system tests that measure the performance of both hardware and software across a variety of machine learning applications.

The benchmarks are updated regularly to reflect the rapid advancements in the field of AI.

The latest version v5.1 introduced two new benchmarks, Llama 3.18b for LLMs and FLUX.1 for text-to-image models.

These are in addition to five existing benchmarks, which include Llama 3.1 405B for pre-training, Llama 2 70B LoRA for fine-tuning RetinaNet for object detection, RGAT for graph node classification and DLRM-dcnv2 for recommender systems.

Paul Baumstarck, co-chair of the MLPerf Training working group calls the field of AI a moving target

“The field of AI is a moving target constantly evolving with new scenarios and capabilities,” says Paul Baumstarck, Co-Chair of the MLPerf Training working group.

“We will continue to evolve the MLPerf Training benchmark suite to ensure that we are measuring what is important to the community both today and tomorrow.”

Record-breaking performance with Blackwell Ultra architecture

Nvidia's performance gains are attributed to its new architecture, which incorporates new Tensor cores offering 15 petaflops of AI compute and adapted training methods.

According to Nvidia, its Blackwell GB300 NVL72 rack-scale system delivered four times the performance on the Llama 3.1 405B benchmark and nearly five times on the Llama 2 70B benchmark compared to previous rounds.

Nvidia delivered 4x performance of Llama 3.1 405B and nearly 5x Llama 2 70B compared to the previous MLPerf benchmarking rounds | Credit: Nvidia

Nvidia also set a new record for training the Llama 3.1 405B model, completing the task in just 10 minutes.

During his address at GTC, Jensen highlighted Nvidia's market position.

Jensen says: “If you look at the list of GPUs, you could actually benchmark it is 90% Nvidia.”

“So we are comparing ourselves to ourselves.”

This perspective points to Nvidia's dominant presence in the high-performance computing space for AI.

NVFP4 precision and its computational edge

A key factor in this performance was Nvidia’s use of its NVFP4 precision for calculations.

This method performs computations on data represented by fewer bits.

While using fewer bits can lead to a decrease in accuracy, it can also increase the speed of computation.

Youtube Placeholder
Jensen Huang’s Keynote Highlights at NVIDIA GTC Washington, D.C.

To mitigate the potential for lower accuracy, Nvidia has employed architectural innovations, including a high-precision scale encoding and a two-level micro-block scaling strategy.

This approach is designed to reduce the memory burden and simplify computing operations.

By extension, this lessens the system's dependence on memory bandwidth, which in turn could improve overall performance.

Nvidia’s clean sweep across all MLPerf v5.1 categories could reinforce its position in AI training performance.

With new precision formats and the Blackwell architecture delivering substantial gains, Nvidia has established a high bar for future benchmark rounds and competitors.

Company portals

Executives