Everything You Need to Know About Meta AI's DINO and SAM

AI is reshaping how meaning is drawn from visual information.
Two of Meta AI’s most significant innovations in computer vision – DINO and SAM – may share the same research lineage, but they serve distinct roles.
Together, they’re revolutionising how both humans and systems perceive and act on complex visual scenes.
DINO: Self-supervised vision foundation learning
DINO, short for distillation with no labels, is Meta AI’s self-supervised learning (SSL) leap in computer vision, building powerful visual representations without any labelled data.
Traditionally, computer vision models rely on large datasets of human-labelled images to recognise objects or scenes.
DINO discards that constraint by enabling the model to learn directly from the intrinsic structure of images.
At the heart of DINO lies a student–teacher framework.
A large teacher network guides a smaller student to align representations from different views of the same input.
Over time, the student internalises transferable, semantic features that generalise across varied image domains.
With the DINOv2 family, Meta trained on approximately 142 million label-free images, yielding robust representations capable of powering diverse tasks straight out of the box – from classification and depth estimation to semantic segmentation v often without fine-tuning.
Now, DINOv3 scales SSL to new heights, delivering Meta AI’s most capable universal vision backbones yet, enabling leading performance across numerous vision-driven applications.
SAM for flexible segmentation
While DINO builds general visual understanding, the Segment Anything Model (SAM) zeroes in on segmentation – analysing and dividing images into distinct, meaningful regions or objects.
SAM is designed to be promptable: meaning whether given a simple click, bounding box or text description, it produces high-fidelity, pixel-accurate masks of the relevant regions.
What makes SAM exceptional is its generalisation.
It can accurately segment objects it’s never encountered before, making it one of the most flexible tools across computer vision domains.
Its versatility allows SAM to power interactive annotation, automated labelling pipelines and other vision-integrated systems.
Since its debut, Meta has continually refined SAM.
The model now supports accelerated segmentation and multimodal integration – fusing visual input with language or other AI modalities.
That adaptability has widened SAM’s use cases, spanning healthcare, robotics and augmented reality.
The DARPA triage challenge
Advances in computer vision, robotics, and AI are being stress-tested under extreme conditions.
The US Defense Advanced Research Projects Agency (DARPA) has launched a three-year Triage Challenge to revolutionise emergency medical assessment through autonomous systems.
The mission: use stand-off sensors on drones and robots to remotely detect vital signs and injury severity, even in low-connectivity, high-stress environments.
Each scenario replicates mass-casualty conditions – darkness, debris, heavy smoke, explosions and visual obstructions – testing how well autonomous systems can locate, classify and prioritise victims when time is critical.
Teams are scored not just on how many casualties they identify, but also on classification accuracy and how fast they notify responders before critical windows close.
Pushing the boundaries of medical AI
The University of Pennsylvania’s Penn Robotic Non-contact Triage and Observation (PRONTO) team unites Penn Medicine surgeons with computer vision and robotics experts from Penn Engineering and the GRASP Lab.
Their mission blends Meta’s DINO and SAM with advanced robotics to perform non-contact injury assessments.
During Phase 1 of the 2024 DARPA Triage Challenge, PRONTO used an aerial drone for scene scanning and a ground robot for stabilised imaging and physiological data capture.
SAM 2 handled image segmentation, accurately isolating human forms and injury sites even under degraded visual conditions.
The team’s pipeline integrated SAM, DINO and Grounding DINO – an open-vocabulary object detection model that interprets text prompts such as “wound?” or “blood?” to highlight relevant visual features.
DINO then derived high-level visual features that fed bespoke neural networks trained to detect injuries, classify severity and estimate vital signs including heart rate, respiration, and consciousness – all without touch.
Each patient’s location and physiological data were displayed on a mobile interface, giving first responders rapid, data-driven prioritisation in chaotic environments.
A major byproduct of the DARPA programme is its unprecedented dataset.
For the first time, researchers can systematically compare triage approaches across consistent, realistic scenarios – providing a new evidence base for digital and robotic medical response.
As the competition progresses, teams will test how the latest versions of DINO and SAM can further advance triage precision in complex conditions.
“We are really interested in making this application work in the real world,” says Professor Eric Eaton, Team Lead for PRONTO.
“The people I have on my team are trauma surgeons that deal with this in the trenches every day and researchers working on state-of-the-art robotics and machine learning.
“Together, we are looking to develop technologies that could be useful in saving lives.”
Once confined to academic benchmarks, foundation models like DINO and SAM now demonstrate how adaptable AI can directly tackle urgent, high-stakes problems in medicine and defence – where every moment matters.


