Zeng, W., Luo, W., Suo, S., Sadat, A., Yang, B., Casas, S., & Urtasun, R. (2019). End-To-End Interpretable Neural Motion Planner.2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Presented at the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA. https://doi.org/10.1109/cvpr.2019.00886.
本文采用了轨迹采样+cost map prediction的方式实现learning based planning方案。
网络的输入是lidar cloud point and hp map, 然后通过cnn卷成feature最后接MLP进行预测。预测有两个部分,可以理解成感知任务和规划任务。其中感知任务包括了3D detection and future motion forcasting. 规划的任务则是预测稠密的cost volumes. 此外这个输入的特征空间还带有时序的信息,通过融合多帧的感知信息,然后进行cat组成了存储时序信息的features。规划任务是预测稠密的cost volume, 不过文章提到预测这个东西主要靠的是gt轨迹, 我们希望预测到gt附近区域是cost比较低的,但是这个奖励太稀疏了,所以单纯的利用gt信息学习是很困难的,所以额外的通过两个感知任务帮忙调整感知backbone输入,这样也可以提升planning的学习效果,原文的表达如下:“we introduce an another perception loss that encourages the intermediate representations to produce accurate 3D detections and motion forecasting. This ensures the interpretability of the intermediate representations and enables much faster learning。”
此外 HDmap上存储了各种道路环境的语义信息:“we exploit HD maps that contain information about the semantics of the scene such as the location of lanes, their boundary type (e.g., solid, dashed) and the location of stop signs.”这些道路,十字路口,车道线以及红绿灯等静态道路元素也被提取出来作为cost map中的静态元素,这些东西被铺进多个图层,也生成了M个通道,最后和lidar point cloud上提取出来的T个时间维度的信息组合在一起给到后面的planning使用。
而轨迹的costing方式则是则是通过从cost map中索引voxel-wise的cost进行计算的。感知的输入是H,W,Z维度的lidar cloud points, 同时为了考虑时序上的动态agents的信息,雷达点云信息还要融合多帧的时序结果,在Z维度叠起来,H,W,ZT. 另一方面,Wiley考虑道路环境元素,本文给各个道路元素都准备了一层通道,包括road, intersections, lanes, lane boudaries, and traffic lights. 原文表达是“we exploit HD maps that contain information about the semantics of the scene such as the location of lanes, their boundary type (e.g., solid, dashed) and the location of stop signs. Similar to [5], we rasterize the map to form an M channels tensor, where each channel represents a different map element, including road, intersections, lanes, lane boundaries, traffic lights, etc.”所以维度变成H,W,(ZT+M).
感知backbone是个CNN,作为下面两个头的输入,其中感知头预测bounding box and motion forcasting. cost volume头预测cost volume, 这里主要看下cost volume的预测。这里采用了max margin loss, gt是人驾轨迹。loss希望区分人驾轨迹的区域和其他区域,人驾轨迹的地方就是cost低的地方:“The intuition behind is to encourage the ground-truth trajectory to have the minimal cost, and others to have higher costs.”
c表示cost, d表示轨迹距离,gamma表示traffic rule violation.
在负样本采样中,需要采样大量的偏移人驾轨迹的曲线,这里除了用planning anchor采样逻辑外,还对起点状态做了一个轻微扰动:“except there is 0.8 probability that the negative sample doesn't obey SDV's initial states, e.g. we randomly sample a velocity to replace SDV's initial velocity.”
planinng anchor
横向采用螺旋曲线进行采样:
纵向则是采用了constant accleration直接采样加速度,非常粗糙。
文章还提到了一点,“Note that Clothoid curves cannot handle circle and straight line trajectories well, thus we sample them separately. ”这个螺旋曲线不能表达直线和圆形,所以直行和掉头要出问题,所以额外单独采样,他们的采样比例是:“The probability of using straightline, circle and Clothoid curves are 0.5, 0.25, 0.25 respectively.”
experiment
实验关注L2 distance, collision rate, and lane violation rate这几个指标,然后做了几个对比实验:
Imitation Learning (IL):imitation is all you need, 用纯粹imitation学习
Adaptive Cruise Control (ACC):没有细说怎么处理,不过从后面的实验结果分析上来看,应该是加了lane violation的loss
Plan w/ Manual Cost (Manual): 人工设计cost
对比结果如下:
结论就是:“Egomotion and IL baselines give lower L2 numbers as they optimize directly for this metric, however they are not good from planning perspective as they have difficulty reasoning about other actors and collide frequently with them.”