Roboflow 100-VL

A Multi-Domain Object Detection Benchmark for Vision-Language Models

CVPR 2025 Workshop

† First authors

Figure 1. We identify a set of 100 challenging datasets from Roboflow Universe that contain concepts not typically found in internet-scale pre-training.

Roboflow 100 Vision Language (RF100-VL) is the first benchmark to ask, “How well does your VLM do in understanding the real world?” In pursuit of this question, RF100-VL introduces 100 open source datasets containing object detection bounding boxes and multimodal few shot instruction image-text pairs across novel image domains. The dataset is comprised of 164,149 images and 1,355,491, annotations across seven domains, including aerial, biological, and industrial imagery. 1693 labeling hours were spent labeling, reviewing, and preparing the dataset.

RF100-VL is a curated sample from Roboflow Universe, a repository of over 500,000+ datasets that collectively demonstrate how computer vision is being leveraged in production problems today. Current state-of-the-art models trained on web-scale data like QwenVL2.5 and GroundingDINO achieve as low as 1% AP in some categories represented in RF100-VL.

Abstract

Vision-language models (VLMs) trained on internet-scale data achieve remarkable zero-shot detection performance on common objects like car, truck, and pedestrian. However, state-of-the-art models still struggle to generalize to out- of-distribution tasks (e.g. material property estimation, defect detection, and contextual action recognition) and imaging modalities (e.g. X-rays, thermal-spectrum data, and aerial images) not typically found in their pre-training. Rather than simply re-training VLMs on more data, we argue that out-of-domain generalization can be principally addressed through the lens of few-shot learning by aligning VLMs to new concepts with annotation instructions containing a few visual examples and rich textual descriptions. To this end, we introduce Roboflow-VL, a large-scale collection of 100 multi-modal datasets with diverse concepts not commonly found in VLM pre-training. Notably, state-of-the-art models like GroundingDINO and Qwen2.5-VL achieve less than 1% AP on many domains in Roboflow-VL. Our code and dataset are available on GitHub and Roboflow.

Explore the Dataset

You can explore the RF100-VL dataset with our interactive, CLIP-based vector chart below.

The chart illustrates the discrete clusters of data that comprise the RF100-VL dataset.

Contributions

RF100-VL introduces a novel evaluation for assessing the efficacy of vision language models (VLMs) and traditional object detectors in real world settings. As models become increasingly capable, they need to be compared to real world, difficult, and practical scenarios. The Roboflow open source community is increasingly representative of how computer vision is truly being used in real world settings, spanning over 500M user labeled and shared images. RF100-VL is the culmination of aggregating the best high quality open source examples (including those cited in Nature) from a wide array of domains, then relabeling and verifying annotation quality. It supports few shot object detection from image prompts, annotator instructions, and human readable class names. RF100-VL builds on RF100, a benchmark Roboflow introduced at CVPR 2023 that labs from companies like Apple, Baidu, and Microsoft leverage to benchmark vision model capabilities.

CVPR 2025 Workshop: Challenge of Few-Shot Object Detection from Annotator Instructions

Organized by: Anish Madan, Neehar Peri, Deva Ramanan, Shu Kong

Introduction

This challenge focuses on few-shot object detection (FSOD) with 10 examples of each class provided by a human annotator. Existing FSOD benchmarks repurpose well-established datasets like COCO by partitioning categories into base and novel classes for pre-training and fine-tuning respectively. However, these benchmarks do not reflect how FSOD is deployed in practice.

Rather than pre-training on only a small number of base categories, we argue that it is more practical to download a foundational model (e.g., a vision-language model (VLM) pretrained on web-scale data) and fine-tune it for specific applications. We propose a new FSOD benchmark protocol that evaluates detectors pre-trained on any external dataset (not including the target dataset), and fine-tuned on K-shot annotations per C target classes.

We propose our new FSOD benchmark using the challenging nuImages dataset. Specifically, participants will be allowed to pre-train their detector on any dataset (except nuScenes or nuImages), and can fine-tune on 10 examples of each of the 18 classes in nuImages.

Benchmarking Protocols

Goal: Developing robust object detectors using few annotations provided by annotator instructions. The detector should detect object instances of interest in real-world testing images.

Environment for model development:

Evaluation metrics:

Submission Details

Output format: One JSON file of predicted bounding boxes of all test images in a COCO compatible format.

[
  {"image_id": 0, "category_id": 79, "bbox": [976, 632, 64, 80], "score": 99.32915569311469, "image_width": 8192, "image_height": 6144, "scale": 1},
  {"image_id": 2, "category_id": 18, "bbox": [323, 0, 1724, 237], "score": 69.3080951903575, "image_width": 8192, "image_height": 6144, "scale": 1},
  ...
]

Dataset Details

nuImages is a large-scale 2D detection dataset that extends the popular nuScenes 3D detection dataset. It includes 93,000 images (with 800k foreground objects and 100k semantic segmentation masks) from nearly 500 driving logs. Scenarios are selected using an active-learning approach, ensuring that both rare and diverse examples are included. The annotated images include rain, snow and night time, which are essential for autonomous driving applications.

Official Baseline

We pre-train Detic on ImageNet21-K, COCO Captions, and LVIS and fine-tune it on 10 shots of each nuImages class.

Timeline

References

  1. Zhou et. al. "Detecting Twenty-Thousand Classes Using Image-Level Supervision". Proceedings of the IEEE European Conference on Computer Vision. 2022
  2. Caesar et. al. "nuScenes: A Multi-Modal Dataset for Autonomous Driving." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2020.

Citation

You can cite RF100-VL in your research using the following citation:

@misc{Robicheaux_Roboflow_100_VL,
    author = {Robicheaux, Peter and Popov, Matvei and Madan, Anish and Robinson, Isaac and Ramanan, Deva and Peri, Neehar},
    title = {{Roboflow 100 VL}},
    url = {https://github.com/roboflow/rf100-vl/}
}