How do you mix SigLIP2, DINOv3, and SAM3 right into a single imaginative and prescient spine with out sacrificing dense or segmentation efficiency? NVIDIA’s C-RADIOv4 is a brand new agglomerative imaginative and prescient spine that distills three robust instructor fashions, SigLIP2-g-384, DINOv3-7B, and SAM3, right into a single scholar encoder. It extends the AM-RADIO and RADIOv2.5 line, preserving comparable computational price whereas enhancing dense prediction high quality, decision robustness, and drop-in compatibility with SAM3.
The important thing thought is straightforward. As an alternative of selecting between a imaginative and prescient language mannequin, a self supervised dense mannequin, and a segmentation mannequin, C-RADIOv4 tries to approximate all three without delay with one spine.


Agglomerative distillation in RADIO
The RADIO household makes use of agglomerative distillation. A single ViT type scholar is skilled to match each dense characteristic maps and abstract tokens from a number of heterogeneous lecturers.
Earlier RADIO fashions mixed DFN CLIP, DINOv2, and SAM. They already supported multi decision coaching however confirmed ‘mode switching’, the place the illustration modified qualitatively as enter decision modified. Later work akin to PHI-S, RADIOv2.5, and FeatSharp added higher multi decision distillation and regularization, however the instructor set was nonetheless restricted.
C-RADIOv4 upgrades the lecturers:
- SigLIP2-g-384 for stronger picture textual content alignment
- DINOv3-7B for prime quality self supervised dense options
- SAM3 for segmentation oriented options and compatibility with the SAM3 decoder
The coed is skilled in order that its dense options match DINOv3 and SAM3, whereas its abstract tokens match SigLIP2 and DINOv3. This provides one encoder that may assist classification, retrieval, dense prediction, and segmentation.
Stochastic multi decision coaching
C-RADIOv4 makes use of stochastic multi decision coaching fairly than a small fastened set of resolutions.
Coaching samples enter sizes from two partitions:
- Low decision:
{128, 192, 224, 256, 384, 432} - Excessive decision:
{512, 768, 1024, 1152}
SigLIP2 operates natively at 384 pixels. Its options are upsampled by an element of three utilizing FeatSharp to align with 1152 pixel SAM3 options. SAM3 is skilled with mosaic augmentation at 1152 × 1152.
This design smooths the efficiency curve over decision and improves low decision habits. For instance, on ADE20k linear probing, C-RADIOv4-H reaches round:
- 55.20 mIoU at 512 px
- 57.02 mIoU at 1024 px
- 57.72 mIoU at 1536 px
The scaling pattern is near DINOv3-7B whereas utilizing roughly an order of magnitude fewer parameters.
Eradicating instructor noise with shift equivariant losses and MESA
Distilling from massive imaginative and prescient fashions tends to repeat their artifacts, not simply their helpful construction. SigLIP2 has border noise patterns, and ViTDet type fashions can present window boundary artifacts. Direct characteristic regression can power the scholar to breed these patterns.
C-RADIOv4 introduces two shift equivariant mechanisms to suppress such noise:
- Shift equivariant dense loss: Every instructor and the scholar see independently shifted crops of a picture. Earlier than computing the squared error, options are aligned through a shift mapping and the loss solely makes use of overlapping spatial positions. As a result of the scholar by no means sees the identical absolute positions because the instructor, it can’t merely memorize place fastened noise and is pressured to trace enter dependent construction as a substitute.
- Shift equivariant MESA: C-RADIOv4 additionally makes use of MESA type regularization between the net community and an EMA copy. Right here once more, the scholar and its EMA see totally different crops, options are aligned by a shift, and the loss is utilized after layer normalization. This encourages easy loss landscapes and robustness, whereas being invariant to absolute place.
As well as, coaching makes use of DAMP, which injects multiplicative noise into weights. This additional improves robustness to corruptions and small distribution shifts.
Balancing lecturers with an angular dispersion conscious abstract loss
The abstract loss in earlier RADIO fashions used cosine distance between scholar and instructor embeddings. Cosine distance removes magnitude however not directional dispersion on the sphere. Some lecturers, akin to SigLIP2, produce embeddings concentrated in a slim cone, whereas DINOv3 variants produce extra unfold out embeddings.
If uncooked cosine distance is used, lecturers with wider angular dispersion contribute bigger losses and dominate optimization. In follow, DINOv3 tended to overshadow SigLIP2 within the abstract time period.
C-RADIOv4 replaces this with an angle normalized loss. The squared angle between scholar and instructor embeddings is split by the instructor’s angular dispersion. Measured dispersions present SigLIP2-g-384 round 0.694, whereas DINOv3-H+ and DINOv3-7B are round 2.12 and a pair of.19. Normalizing by these values equalizes their affect and preserves each imaginative and prescient language and dense semantics.
Efficiency: classification, dense prediction, and Probe3d
On ImageNet-1k zero shot classification, C-RADIOv4-H reaches about 83.09 % top-1 accuracy. It matches or improves on RADIOv2.5-H and C-RADIOv3-H throughout resolutions, with the most effective efficiency close to 1024 px.
On k-NN classification, C-RADIOv4-H improves over RADIOv2.5 and C-RADIOv3, and matches or surpasses DINOv3 beginning round 256 px. DINOv3 peaks close to 192–256 px after which degrades, whereas C-RADIOv4 retains secure or enhancing efficiency at increased resolutions.
Dense and 3D conscious metrics present the supposed tradeoff. On ADE20k, PASCAL VOC, NAVI, and SPair, C-RADIOv4-H and the SO400M variant outperform earlier RADIO fashions and are aggressive with DINOv3-7B on dense benchmarks. For C-RADIOv4-H, typical scores are:
- ADE20k: 55.20 mIoU
- VOC: 87.24 mIoU
- NAVI: 63.44
- SPair: 60.57


On Probe3d, which incorporates Depth Normals, Floor Normals, NAVI, and SPair, C-RADIOv4-H achieves the most effective NAVI and SPair scores within the RADIO household. Depth and Floor metrics are near these of C-RADIOv3-H, with small variations in both path, fairly than a uniform enchancment.
Integration with SAM3 and ViTDet-mode deployment
C-RADIOv4 is designed to be a drop in alternative for the Notion Encoder spine in SAM3. The SAM3 decoder and reminiscence parts stay unchanged. A reference implementation is offered in a SAM3 fork. Qualitative examples present that segmentation habits is preserved for each textual content prompts akin to “shoe”, “helmet”, “bike”, “spectator” and field prompts, and in some reported instances C-RADIOv4 primarily based SAM3 resolves failure instances from the unique encoder.
For deployment, C-RADIOv4 exposes a ViTDet-mode configuration. Most transformer blocks use windowed consideration, whereas a number of use world consideration. Supported window sizes vary from 6 × 6 to 32 × 32 tokens, topic to divisibility with patch dimension and picture decision. On an A100, the SO400M mannequin with window dimension at most 12 is quicker than the SAM3 ViT-L+ encoder throughout a variety of enter sizes, and the Big mannequin with window dimension 8 is shut in latency.
This makes C-RADIOv4 a sensible spine for prime decision dense duties the place full world consideration in any respect layers is just too costly.
Key Takeaways
- Single unified spine: C-RADIOv4 distills SigLIP2-g-384, DINOv3-7B, and SAM3 into one ViT-style encoder that helps classification, retrieval, dense prediction, and segmentation.
- Any-resolution habits: Stochastic multi decision coaching over {128…1152} px, and FeatSharp upsampling for SigLIP2, stabilizes efficiency throughout resolutions and tracks DINOv3-7B scaling with far fewer parameters.
- Noise suppression through shift equivariance: Shift equivariant dense loss and shift equivariant MESA forestall the scholar from copying instructor border and window artifacts, focusing studying on enter dependent semantics.
- Balanced multi-teacher distillation: An angular dispersion normalized abstract loss equalizes the contribution of SigLIP2 and DINOv3, preserving each textual content alignment and dense illustration high quality.
- SAM3 and ViTDet-ready deployment: C-RADIOv4 can straight substitute the SAM3 Notion Encoder, presents ViTDet-mode windowed consideration for sooner excessive decision inference, and is launched below the NVIDIA Open Mannequin License.
Try the Paper, Repo, Mannequin-1 and Mannequin-2. Additionally, be at liberty to comply with us on Twitter and don’t overlook to affix our 100k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you’ll be able to be a part of us on telegram as properly.

