End-to-end autonomous driving has great potential in the transportation industry. However, the lack of transparency and interpretability of the automatic decisionmaking process hinders its industrial adoption in practice. There have been some early attempts to use attention maps or cost volume for better model explainability which is difficult for ordinary passengers to understand. To bridge the gap, we propose an end-to-end transformer-based architecture, ADAPT (Action-aware Driving cAPtion Transformer), which provides user-friendly natural language narrations and reasoning for each decision making step of autonomous vehicular control and action. ADAPT jointly trains both the driving caption task and the vehicular control prediction task, through a shared video representation. Experiments on BDD-X (Berkeley DeepDrive eXplanation) dataset demonstrate state-of-the-art performance of the ADAPT framework on both automatic metrics and human evaluation. To illustrate the feasibility of the proposed framework in real-world applications, we build a novel deployable system that takes raw car videos as input and outputs the action narrations and reasoning in real time. The code, models and data are available at https://github.com/jxbbb/ADAPT.
端到端的自动驾驶在交通行业中具有巨大的潜力。然而,自动决策过程的透明度和可解释性的缺乏阻碍了其在实际工业中的应用。为了提高模型的可解释性,一些早期尝试使用注意力图或成本体积,这对于普通乘客来说很难理解。为了弥补这一差距,我们提出了一种基于Transformer的端到端架构,即ADAPT(感知行动驾驶字幕变换器),它为自动驾驶车辆的每个决策和行动步骤提供了用户友好的自然语言叙述和推理。ADAPT通过共享视频表示,联合训练驾驶字幕任务和车辆控制预测任务。在BDD-X(Berkeley DeepDrive eXplanation)数据集上的实验表明,ADAPT框架在自动度量和人类评估方面都展现出了最先进的性能。为了说明所提出框架在现实世界应用中的可行性,我们构建了一个新颖的可部署系统,它以原始汽车视频为输入,并实时输出行动叙述和推理。代码、模型和数据可在 https://github.com/jxbbb/ADAPT 上获取。
The goal of an autonomous system is to gain precise perception of the environment, make safe real-time decisions, take reliable actions without human involvement and provide a safe and comfortable ride experience for passengers. There are generally two types of paradigms for autopilot controller design: mediation-aware method [1], [2] and end-to-end learning approach [3]–[23]. Mediation-aware approaches rely on recognizing human-specified features such as vehicles, lane markings, etc., which require rigorous parameter tuning to achieve satisfactory performance. In contrast, end-to-end methods directly take raw data from sensors as input to generate planning routes or control signals.
自动驾驶系统的目标是精确感知环境,进行安全的实时决策,无需人类干预即可采取可靠的行动,并为乘客提供安全舒适的乘坐体验。自动驾驶员控制器设计通常有两种范式:中介意识方法[1]、[2]和端到端学习方法[3]–[23]。中介意识方法依赖于识别人类指定的特征,如车辆、车道标记等,这需要严格的参数调整才能达到令人满意的性能。相比之下,端到端方法直接将传感器的原始数据作为输入,以生成规划路线或控制信号。
One of the key challenges in deploying such autonomous control systems to real vehicles is that intelligent decisionmaking policies in autonomous cars are often too complicated and difficult for common passengers to understand, for whom the safety of such vehicles and their controlability is the top priority.
在将这类自动驾驶控制系统部署到真实车辆中时,面临的一个关键挑战是,自动驾驶汽车中智能决策制定策略通常过于复杂,对普通乘客来说难以理解。对乘客而言,这些车辆的安全性和可控性是最重要的。
Some previous work has explored the interpretation of autonomous navigation [13], [14], [24]–[30]. Cost map, for example, is employed in [13] to interpret the actions of a self-driving system by visualizing the difficulty of traversing through different areas of the map. Visual attention is utilized in [24] to filter out non-salient image regions, and [31] constructs BEV (Bird’s eye view) to visualize the motion information of the vehicle. However, these interfaces can easily lead to misinterpretation if the user is unfamiliar with the system.
一些先前的研究已经探索了自动驾驶导航的解释问题[13]、[14]、[24]–[30]。例如,在[13]中,成本图被用来通过可视化不同区域的通过难度来解释自动驾驶系统的行动。在[24]中,视觉注意力被用来过滤掉非显著的图像区域,而[31]构建了BEV(鸟瞰图)来可视化车辆的运动信息。然而,如果用户不熟悉这些系统,这些界面很容易导致误解。
An ideal solution is to include natural language narrations to guide the use throughout the decision making and action taking process of the autonomous control module, which is comprehensible and user-friendly. Furthermore, an additional reasoning explanation for each control/action decision can help users understand the current state of the vehicle and the surrounding environment, as supporting evidence for the actions taken by the autonomous vehicle. For example, ”[Action narration:] the car pulls over to the right side of the road, [Reasoning:] because the car is parking”, as shown in Fig. 1. Explaining vehicle behaviors via natural language narrations and reasoning thus makes the whole autonomous system more transparent and easier to understand.
一个理想的解决方案是在自动驾驶控制模块的整个决策和行动过程中加入自然语言叙述,以指导使用,使其易于理解和用户友好。此外,为每个控制/行动决策提供额外的推理解释,可以帮助用户理解车辆当前的状态和周围环境,作为自动驾驶车辆采取行动的支持证据。例如,“[行动叙述:] 汽车靠向道路右侧,[推理:] 因为汽车正在停车”,如图1 所示。通过自然语言叙述和推理来解释车辆行为,使整个自动驾驶系统更加透明和易于理解。
图1. 展示了不同解释自动驾驶车辆的方法,包括注意力图[24]、成本体积[13]和自然语言。尽管注意力图或成本体积有效,但基于语言的解释对普通乘客来说更加友好。
To this end, we propose ADAPT, the first action-aware transformer-based driving action captioning architecture that provides for passengers user-friendly natural language narrations and reasoning of autonomous driving vehicles. To eliminate the discrepancy between the captioning task and the vehicular control signal prediction task, we jointly train these two tasks with a shared video representation. This multi-task framework can be built upon various end-to-end autonomous systems by incorporating a text generation head.
为了实现这一目标,我们提出了ADAPT,这是首个基于感知行动的变换器(transformer)驾驶动作字幕生成架构,为乘客提供了用户友好的自然语言叙述和自动驾驶车辆的推理。为了消除字幕生成任务和车辆控制信号预测任务之间的差异,我们通过共享的视频表示来联合训练这两个任务。这个多任务框架可以通过整合一个文本生成头部,构建在各种端到端的自动驾驶系统之上。
We demonstrate the effectiveness of the ADAPT approach on a large-scale dataset that consists of control signals and videos along with action narration and reasoning. Based on ADAPT, we build a novel deployable system that takes raw vehicular navigation videos as input and generates the action narrations and reasoning explanations in real time.
我们在包含控制信号和视频以及动作叙述和推理的大型数据集上展示了ADAPT方法的有效性。基于ADAPT,我们构建了一个新颖的可部署系统,该系统以原始车辆导航视频为输入,并实时生成动作叙述和推理解释。
Our contributions can be summarized as:
• We propose ADAPT, a new end-to-end transformerbased action narration and reasoning framework for self-driving vehicles.
• We propose a multi-task joint training framework that aligns both the driving action captioning task and the control signal prediction task.
• We develop a deployable pipeline for the application of ADAPT in both the simulator environment and the real world.
我们的贡献可以总结为:
The main goal of the video captioning task is to describe the objects and their relationship of a given video in natural language. Early researches [32]–[35] generate sentences with specific syntactic structures by filling recognized elements in fixed templates, which are inflexible and lack of richness. [36]–[45] exploit sequence learning approaches to generate natural sentences with flexible syntactic structures. Specifically, these methods employ a video encoder to extract frame features and a language decoder to learn visual-textual alignment for caption generation. To enrich captions with fine-grained objects and actions, [46]–[48] exploit objectlevel representations that capture detailed object-aware interaction features in videos. [49] further develops a novel dual-branch convolutional encoder to jointly learn the content and semantic information of videos. Moreover, [50] adapts the uni-modal transformer to video captioning and employs a sparse boundary-aware pooling to reduce the redundancy in video frames. The development of scene understanding [51]–[62] also contribute a lot to the captioning task. Most recently, [63] proposes an end-to-end transformer-based model SWINBERT, which utilizes a sparse attention mask to lessen the redundant and irrelevant information in consecutive video frames.
视频字幕生成任务的主要目标是使用自然语言描述给定视频中的物体及其相互关系。早期的研究[32]–[35]通过在固定模板中填充识别出的元素来生成具有特定句法结构的句子,这些方法不够灵活且缺乏丰富性。随后的研究[36]–[45]利用序列学习方法来生成具有灵活句法结构的自然句子。具体来说,这些方法采用视频编码器提取帧特征,并使用语言解码器学习视觉-文本对齐,以生成字幕。为了使字幕包含更细粒度的物体和动作,[46]–[48]利用物体级别的表示来捕捉视频中详细的物体感知交互特征。[49]进一步开发了一种新颖的双分支卷积编码器,以联合学习视频的内容和语义信息。此外,[50]将单模态变换器适应于视频字幕生成,并采用稀疏的边界感知池化来减少视频帧中的冗余。场景理解的发展[51]–[62]也对字幕任务做出了很大贡献。最近,[63]提出了一个端到端的基于变换器的模型SWINBERT,它使用稀疏注意力掩码来减少连续视频帧中的冗余和不相关信息。
While existing architectures achieve promising results for general video captioning, it cannot be directly applied to action representation because simply transferring video caption to self-driving action representation would miss some key information like the speed of the vehicle, which is essential in the autonomous system. How to effectively use these multimodal information to generate sentences remains a mystery, which is the focus of our work.
尽管现有的架构在普通视频字幕生成方面取得了有希望的结果,但它不能直接应用于动作表示,因为简单地将视频字幕转换为自动驾驶动作表示会遗漏一些关键信息,如车辆的速度,这在自动驾驶系统中是必不可少的。如何有效地利用这些多模态信息生成句子仍然是一个谜,这也是我们工作的重点。
Learning-based autonomous driving is an active research area [64], [65]. Some learning-based driving methods such as affordances [3], [4] and reinforcement learning [5]–[7] are employed, gaining promising performance. Imitation methods [8]–[13] are also utilized to regress the control commands from human demonstrations. For example, [14]–[16] model the future behavior of driving agents like vehicles, cyclists or pedestrians to predict the vehicular waypoints, while [17]–[23] predict vehicular control signals directly according to the sensor input, which is similar to our control signal prediction sub-task.
基于学习的自动驾驶是一个活跃的研究领域[64],[65]。一些基于学习的驾驶方法,如功能提示[3],[4]和强化学习[5]–[7]已被采用,展现出了有希望的性能。模仿学习方法[8]–[13]也被用来从人类演示中回归控制命令。例如,[14]–[16]对驾驶代理(如车辆、自行车或行人)的未来行为进行建模,以预测车辆的航