RefAV: Towards Planning-Centric Scenario Mining

Carnegie Mellon University
Teaser figure

Abstract

Autonomous Vehicles (AVs) collect and pseudo-label terabytes of multi-modal data localized to HD maps during normal fleet testing. However, identifying interesting and safety-critical scenarios from uncurated driving logs remains a significant challenge. Traditional scenario mining techniques are error-prone and prohibitively time-consuming, often relying on hand-crafted structured queries. In this work, we revisit spatio-temporal scenario mining through the lens of recent vision-language models (VLMs) to detect whether a described scenario occurs in a driving log and, if so, precisely localize it in both time and space. To address this problem, we introduce RefAV, a large-scale dataset of 10,000 diverse natural language queries that describe complex multi-agent interactions relevant to motion planning derived from 1000 driving logs in the Argoverse 2 Sensor dataset. We evaluate several referential multi-object trackers and present an empirical analysis of our baselines. Notably, we find that naively repurposing off-the-shelf VLMs yields poor performance, suggesting that scenario mining presents unique challenges.


The RefAV Dataset

Pedestrian with multiple dogs

Ego vehicle merging between two regular vehicles

Bicycle passing between bus and construction barrier

Passenger vehicle near stroller in road

Ego vehicle approaching construction zone with traffic controller

Pedestrians walking between two stopped vehicles


RefProg: Referential Program Synthesis

Refprog Figure

RefProg is a dual-path method that independently generates 3D perception outputs and Python-based programs for referential grounding. Given raw LiDAR and RGB inputs, RefProg runs a offline 3D perception model to generate high quality 3D tracks. In parallel, it prompts an LLM to generate code to identify the referred track. Finally, the generated code is executed to filter the output of the offline 3D perception model to produce a final set of referred objects (shown in green), related objects (shown in blue), and other objects (shown in red).


Qualitative Results

Strengths

Unambiguous Scenario

Nested Relationships

Weaknesses

Semantics

Expressivity


Argoverse 2 Scenario Mining Competition 2025

Eight teams submitted to our benchmark as part of a CVPR 2025 Workshop on Autonomous Driving (WAD) competition. Thank you to all of those who participated!

Rank Participant team HOTA-Temporal (↑) HOTA-Track (↑) Timestamp Bal. Acc. (↑) Log Bal. Acc. (↑)
1 Zeekr_UMCV (Zeekr_UMCV_v1.2) 53.38 51.05 76.62 66.34
2 Mi3 UCM_AV2 (Mi3_test_v1) 52.37 51.53 77.48 65.82
3 zxh 52.09 50.24 76.12 66.52
4 LiDAR_GPT_VLM (v2) 51.92 51.91 76.90 66.74
5 RefProg 50.15 51.13 74.03 68.31

Citation

If you find our paper and code repository useful, please cite us:

@article{davidson2025refav,
  title={RefAV: Towards Planning-Centric Scenario Mining},
  author={Davidson, Cainan and Ramanan, Deva and Peri, Neehar},
  journal={arXiv preprint arXiv:2505.20981},
  year={2025}
}