Nvidia’s New Tools Accelerate Custom AI Model Development

Fine-tuning large language models (LLMs) is a critical process for businesses aiming to adapt AI systems for their specific operational needs, from customer service chatbots to specialised coding assistants.
A key challenge is ensuring smaller, customised language models can deliver consistently high accuracy for niche tasks, especially when domain-specific knowledge or strict formatting is required.
To address this, Nvidia has released new guidance on fine-tuning these models using Unsloth, an open-source framework optimised for Nvidia's graphics processing units (GPUs).
This accompanies the launch of Nvidia's Nemotron 3 family of open models, which are designed for AI applications that can perform actions on behalf of users.
Fine-tuning AI models with the Unsloth framework
“Fine-tuning is like giving an AI model a focused training session,” explains Annamalai Chockalingam, a Product Professional focused on LLM products from Nvidia.
Annamalai adds: “With examples tied to a specific topic or workflow, the model improves its accuracy by learning new patterns and adapting to the task at hand.”
Developers have three main approaches to choose from. Parameter-efficient fine-tuning, which uses techniques like LoRA or QLoRA, updates a small fraction of the model for quicker training and requires between 100 and 1,000 prompt-sample pairs.
This method is suitable for adding domain knowledge or improving coding accuracy.
Full fine-tuning adjusts all of the model’s parameters and is useful for teaching a model to follow specific formats or styles.
Nvidia suggests this approach for building AI agents and chatbots that need to provide assistance on particular topics and respond in a specific way, a process that requires over 1,000 prompt-sample pairs.
The third option is reinforcement learning, a more advanced technique that modifies model behaviour using feedback. This method is suited for creating autonomous agents or improving accuracy in specialised fields such as law or medicine.
By translating complex mathematical operations into custom GPU kernels, the Unsloth framework can boost the performance of the Hugging Face transformers library by 2.5 times on Nvidia GPUs. These optimisations are effective across Nvidia’s hardware, from GeForce RTX laptops to DGX Spark.
Nemotron 3: a hybrid architecture for agentic AI
The Nemotron 3 family of models utilises a hybrid latent Mixture-of-Experts architecture. The first model available, Nemotron 3 Nano 30B-A3B, is described by Nvidia as the most compute-efficient model in the series.
Although it contains 30 billion parameters, it activates only three billion during inference, which could produce up to 60% fewer reasoning tokens and lower inference costs. The model also includes a one-million-token context window, allowing it to process a larger volume of information for complex multistep tasks.
Nvidia has optimised Nemotron 3 Nano for software debugging, content summarisation and AI assistant workflows.
Two other models in the family, Nemotron 3 Super and Nemotron 3 Ultra, are planned for release in the first half of 2026. Nemotron 3 Super will target high-accuracy reasoning for multi-agent applications, while Nemotron 3 Ultra is designed for more complex AI applications.
Alongside the models, Nvidia has released an open collection of training datasets and reinforcement learning libraries. Developers can access Nemotron 3 Nano through platforms such as Hugging Face Llama.cpp and LM Studio.
Local fine-tuning with DGX Spark
The DGX Spark system provides developers with powerful desktop computing capabilities. Built on Nvidia’s Grace Blackwell architecture, the system can deliver up to one petaflop of FP4 AI performance and includes 128GB of unified CPU-GPU memory.
This unified memory architecture supports the use of larger models, allowing those with over 30 billion parameters to fit within DGX Spark’s capacity, which can be a challenge on consumer GPUs.
The system is capable of handling full fine-tuning and reinforcement learning workflows, which demand greater memory and higher throughput than parameter-efficient methods.
This allows developers to run compute-intensive tasks locally, avoiding the need for cloud instances. According to Nvidia, the system can fine-tune a Llama model with eight billion parameters at a rate of 6,832 tokens per second using parameter-efficient methods or 4,294 tokens per second with full fine-tuning.
For larger models with 70 billion parameters, DGX Spark achieves 1,461 tokens per second with parameter-efficient fine-tuning.
The system’s applications extend beyond language models; it can also handle high-resolution diffusion models, which often require more memory than standard desktops provide and can generate 1,000 images in seconds.


