RefAV: Towards Planning-Centric Scenario Mining

Teaser figure displaying RefAV capabilities

Abstract

Autonomous Vehicles (AVs) collect and pseudo-label terabytes of multi-modal data localized to HD maps during normal fleet testing. However, identifying interesting and safety-critical scenarios from uncurated driving logs remains a significant challenge. Traditional scenario mining techniques are error-prone and prohibitively time-consuming, often relying on hand-crafted structured queries.

In this work, we revisit spatio-temporal scenario mining through the lens of recent vision-language models (VLMs) to detect whether a described scenario occurs in a driving log and, if so, precisely localize it in both time and space. To address this problem, we introduce RefAV, a large-scale dataset of 10,000 diverse natural language queries that describe complex multi-agent interactions relevant to motion planning derived from 1000 driving logs in the Argoverse 2 Sensor dataset. We evaluate several referential multi-object trackers and present an empirical analysis of our baselines. Notably, we find that naively repurposing off-the-shelf VLMs yields poor performance, suggesting that scenario mining presents unique challenges.

The RefAV Dataset

Pedestrian with multiple dogs

Ego vehicle merging between two regular vehicles

Bicycle passing between bus and construction barrier

Passenger vehicle near stroller in road

Ego vehicle approaching construction zone with traffic controller

Pedestrians walking between two stopped vehicles

RefProg: Referential Program Synthesis

RefProg is a dual-path method that independently generates 3D perception outputs and Python-based programs for referential grounding. Given raw LiDAR and RGB inputs, RefProg runs an offline 3D perception model to generate high-quality 3D tracks. In parallel, it prompts an LLM to generate code to identify the referred track. Finally, the generated code is executed to filter the output of the offline 3D perception model to produce a final set of referred objects (shown in green), related objects (shown in blue), and other objects (shown in red).

Qualitative Results

Strengths

Unambiguous Scenario

Nested Relationships

Weaknesses

Semantics

Expressivity

Scenario Mining Challenge

Eight teams submitted to our benchmark as part of our community challenge. Thank you to all of those who participated!

Rank	Participant team	HOTA-Temporal (↑)	HOTA-Track (↑)	Timestamp Bal. Acc. (↑)	Log Bal. Acc. (↑)
1	Team 1	53.38	51.05	76.62	66.34
2	Team 2	52.37	51.53	77.48	65.82
3	Team 3	52.09	50.24	76.12	66.52
4	Team 4	51.92	51.91	76.90	66.74
5	RefProg	50.15	51.13	74.03	68.31