Roboflow 100-VL
A Multi-Domain Object Detection Benchmark for Vision-Language Models
CVPR 2025 Workshop

Roboflow 100 Vision Language (RF100-VL) is the first benchmark to ask, “How well does your VLM do in understanding the real world?” In pursuit of this question, RF100-VL introduces 100 open source datasets containing object detection bounding boxes and multimodal few shot instruction image-text pairs across novel image domains. The dataset is comprised of 164,149 images and 1,355,491, annotations across seven domains, including aerial, biological, and industrial imagery. 1693 labeling hours were spent labeling, reviewing, and preparing the dataset.
RF100-VL is a curated sample from Roboflow Universe, a repository of over 500,000+ datasets that collectively demonstrate how computer vision is being leveraged in production problems today. Current state-of-the-art models trained on web-scale data like QwenVL2.5 and GroundingDINO achieve as low as 1% AP in some categories represented in RF100-VL.
Abstract
Vision-language models (VLMs) trained on internet-scale data achieve remarkable zero-shot detection performance on common objects like car, truck, and pedestrian. However, state-of-the-art models still struggle to generalize to out- of-distribution tasks (e.g. material property estimation, defect detection, and contextual action recognition) and imaging modalities (e.g. X-rays, thermal-spectrum data, and aerial images) not typically found in their pre-training. Rather than simply re-training VLMs on more data, we argue that out-of-domain generalization can be principally addressed through the lens of few-shot learning by aligning VLMs to new concepts with annotation instructions containing a few visual examples and rich textual descriptions. To this end, we introduce Roboflow-VL, a large-scale collection of 100 multi-modal datasets with diverse concepts not commonly found in VLM pre-training. Notably, state-of-the-art models like GroundingDINO and Qwen2.5-VL achieve less than 1% AP on many domains in Roboflow-VL. Our code and dataset are available on GitHub and Roboflow.
Explore the Dataset
You can explore the RF100-VL dataset with our interactive, CLIP-based vector chart below.
The chart illustrates the discrete clusters of data that comprise the RF100-VL dataset.
Contributions
RF100-VL introduces a novel evaluation for assessing the efficacy of vision language models (VLMs) and traditional object detectors in real world settings. As models become increasingly capable, they need to be compared to real world, difficult, and practical scenarios. The Roboflow open source community is increasingly representative of how computer vision is truly being used in real world settings, spanning over 500M user labeled and shared images. RF100-VL is the culmination of aggregating the best high quality open source examples (including those cited in Nature) from a wide array of domains, then relabeling and verifying annotation quality. It supports few shot object detection from image prompts, annotator instructions, and human readable class names. RF100-VL builds on RF100, a benchmark Roboflow introduced at CVPR 2023 that labs from companies like Apple, Baidu, and Microsoft leverage to benchmark vision model capabilities.
CVPR 2025 Workshop: Challenge of Few-Shot Object Detection from Annotator Instructions
Organized by: Anish Madan, Neehar Peri, Deva Ramanan, Shu Kong
Introduction
This challenge focuses on few-shot object detection (FSOD) with 10 examples of each class provided by a human annotator. Existing FSOD benchmarks repurpose well-established datasets like COCO by partitioning categories into base and novel classes for pre-training and fine-tuning respectively. However, these benchmarks do not reflect how FSOD is deployed in practice.
Rather than pre-training on only a small number of base categories, we argue that it is more practical to download a foundational model (e.g., a vision-language model (VLM) pretrained on web-scale data) and fine-tune it for specific applications. We propose a new FSOD benchmark protocol that evaluates detectors pre-trained on any external dataset (not including the target dataset), and fine-tuned on K-shot annotations per C target classes.
We propose our new FSOD benchmark using the challenging nuImages dataset. Specifically, participants will be allowed to pre-train their detector on any dataset (except nuScenes or nuImages), and can fine-tune on 10 examples of each of the 18 classes in nuImages.
Benchmarking Protocols
Goal: Developing robust object detectors using few annotations provided by annotator instructions. The detector should detect object instances of interest in real-world testing images.
Environment for model development:
- Pretraining: Models are allowed to pre-train on any existing datasets except nuScenes and nuImages.
- Fine-Tuning: Models can fine-tune on 10 shots from each of nuImage's 18 classes.
- Evaluation: Models are evaluated on the standard nuImages validation set.
Evaluation metrics:
- AP: The average precision of IoU thresholds from 0.5 to 0.95 with the step size 0.05.
- AP50 and AP75: The precision averaged over all instances with IoU threshold as 0.5 and 0.75, respectively.
- AR (average recall): Averages the proposal recall at IoU threshold from 0.5 to 1.0 with the step size 0.05, regardless of the classification accuracy.
Submission Details
Output format: One JSON file of predicted bounding boxes of all test images in a COCO compatible format.
[ {"image_id": 0, "category_id": 79, "bbox": [976, 632, 64, 80], "score": 99.32915569311469, "image_width": 8192, "image_height": 6144, "scale": 1}, {"image_id": 2, "category_id": 18, "bbox": [323, 0, 1724, 237], "score": 69.3080951903575, "image_width": 8192, "image_height": 6144, "scale": 1}, ... ]
Dataset Details
nuImages is a large-scale 2D detection dataset that extends the popular nuScenes 3D detection dataset. It includes 93,000 images (with 800k foreground objects and 100k semantic segmentation masks) from nearly 500 driving logs. Scenarios are selected using an active-learning approach, ensuring that both rare and diverse examples are included. The annotated images include rain, snow and night time, which are essential for autonomous driving applications.
Official Baseline
We pre-train Detic on ImageNet21-K, COCO Captions, and LVIS and fine-tune it on 10 shots of each nuImages class.
Timeline
- Submission opens: March 1st, 2025
- Submission closes: May 10th, 2025, 11:59 pm Pacific Time
- The top 3 participants on the leaderboard will be invited to give a talk at the workshop
References
- Zhou et. al. "Detecting Twenty-Thousand Classes Using Image-Level Supervision". Proceedings of the IEEE European Conference on Computer Vision. 2022
- Caesar et. al. "nuScenes: A Multi-Modal Dataset for Autonomous Driving." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2020.
Citation
You can cite RF100-VL in your research using the following citation:
@misc{Robicheaux_Roboflow_100_VL, author = {Robicheaux, Peter and Popov, Matvei and Madan, Anish and Robinson, Isaac and Ramanan, Deva and Peri, Neehar}, title = {{Roboflow 100 VL}}, url = {https://github.com/roboflow/rf100-vl/} }