BEVDet 开源有一段时间,我们陆续更新了很多feature,比如支持旷世的BEVDepth,支持FP16等等,后面也会持续更新更多和部署相关的feature。
3、BEVDet Data Processing Pipeline
原始的数据在训练测试过程中,经过Data Processing Pipeline进行数据增广(imageview augmentation 和BEV augmentation) 以及一些必要的数据准备(图片读取、获取LSS-viewtransformer相关变换矩阵等)
为更具一般性和全面地介绍,我们以BEVDepth4D训练过程的数据处理流程为例,该流程包含如下subprocesses,其中图像空间的增广是在LoadMultiViewImageFromFiles_BEVDet中完成的,而BEV空间的增广是在GlobalRotScaleTrans 和 RandomFlip3D中完成的:
train_pipeline = [ # load multiview images, perform image view data augmentation, and prepare # transformation for lss view transformer dict(type ='LoadMultiViewImageFromFiles_BEVDet' , is_train=True, data_config=data_config, sequential=True, aligned=True, trans_only=False), # load points clouds dict( type ='LoadPointsFromFile' , coord_type='LIDAR' , load_dim=5, use_dim=5, file_client_args=file_client_args), # prepare 3D object detection annotations dict(type ='LoadAnnotations3D' , with_bbox_3d=True, with_label_3d=True), # BEV augmentations dict( type ='GlobalRotScaleTrans' , rot_range=[-0.3925, 0.3925], scale_ratio_range=[0.95, 1.05], translation_std=[0, 0, 0], update_img2lidar=True), dict( type ='RandomFlip3D' , sync_2d=False, flip_ratio_bev_horizontal=0.5, flip_ratio_bev_vertical=0.5, update_img2lidar=True), # Prepare depth supervision for bevdepth with the point clouds
dict(type ='PointToMultiViewDepth' , grid_config=grid_config), dict(type ='ObjectRangeFilter' , point_cloud_range=point_cloud_range), dict(type ='ObjectNameFilter' , classes=class_names), dict(type ='DefaultFormatBundle3D' , class_names=class_names), dict(type ='Collect3D' , keys=['img_inputs' , 'gt_bboxes_3d' , 'gt_labels_3d' ], meta_keys=('filename' , 'ori_shape' , 'img_shape' , 'lidar2img' , 'depth2img' , 'cam2img' , 'pad_shape' , 'scale_factor' , 'flip' , 'pcd_horizontal_flip' , 'pcd_vertical_flip' , 'box_mode_3d' , 'box_type_3d' , 'img_norm_cfg' , 'pcd_trans' , 'sample_idx' , 'pcd_scale_factor' , 'pcd_rotation' , 'pts_filename' , 'transformation_3d_flow' , 'img_info' ))
生成lss view transformer相关变换的矩阵
对于时序的bevdet4d额外读取相邻帧的图像,执行和当前帧完全一样的图像空间的数据增广策略(同样的策略和幅度),也生成lss view transformer相关变换的矩阵,这里区别于当前帧记录的是当前帧的相机坐标系到当前帧的lidar坐标系的变换(currcam2currlidar),对于相邻帧,我们
def get_inputs(self,results, flip=None, scale=None): imgs = [] rots = [] trans = [] intrins = [] post_rots = [] post_trans = [] cams = self.choose_cams() for cam in cams: cam_data = results['img_info' ][cam] filename = cam_data['data_path' ] # 读取图片 img = Image.open(filename) # lss view transformer相关变换的矩阵 post_rot = torch.eye(2) # 图像空间数据增广产生的旋转矩阵 post_tran = torch.zeros(2) # 图像空间数据增广产生的平移 intrin = torch.Tensor(cam_data['cam_intrinsic' ]) # 相机内参,用于图像空间到相机坐标系的变换 rot = torch.Tensor(cam_data['sensor2lidar_rotation' ]) # 相机坐标系到lidar坐标系的旋转变换 tran = torch.Tensor(cam_data['sensor2lidar_translation' ]) # 相机坐标系到lidar坐标系的平移变换 # augmentation (resize, crop, horizontal flip, rotate) resize, resize_dims, crop, flip, rotate = self.sample_augmentation(H=img.height, W=img.width, flip=flip, scale=scale) # 图像空间 augmentation (resize, crop, horizontal flip, rotate),增广过程中同步更新post_rot,post_tran img, post_rot2, post_tran2 = self.img_transform(img, post_rot, post_tran, resize=resize, resize_dims=resize_dims, crop=crop, flip=flip, rotate=rotate) # for convenience, make augmentation matrices 3x3 post_tran = torch.zeros(3) post_rot = torch.eye(3) post_tran[:2] = post_tran2 post_rot[:2, :2] = post_rot2 imgs.append(self.normalize_img(img)) if self.sequential: # 读取相邻帧的图片,执行相同的图像空间的数据增广 filename_adjacent = results['adjacent' ]['cams' ][cam]['data_path' ] img_adjacent = Image.open(filename_adjacent) img_adjacent = self.img_transform_core(img_adjacent, resize_dims=resize_dims, crop=crop, flip=flip, rotate=rotate) imgs.append(self.normalize_img(img_adjacent)) intrins.append(intrin) rots.append(rot) trans.append(tran) post_rots.append(post_rot) post_trans.append(post_tran) if self.sequential: # 对于相邻帧,因为相机内参和图像空间的增广不变,post_trans/post_rots/intrins复用当前帧的 # 对于相机到lidar变换,我们记录相邻帧相机坐标系到当前帧lidar坐标系的变换 # adjcam2currlidar=adjlidar2currlidar @ adjcam2adjliar = adjlidar2currlidar @ currcam2currlidar post_trans.extend(post_trans) post_rots.extend(post_rots) intrins.extend(intrins) egocurr2global = np.eye(4, dtype=np.float32) egocurr2global[:3,:3] = Quaternion(results['curr' ]['ego2global_rotation' ]).rotation_matrix egocurr2global[:3,3] = results['curr' ]['ego2global_translation' ] egoadj2global = np.eye(4, dtype=np.float32) egoadj2global[:3,:3] = Quaternion(results['adjacent' ]['ego2global_rotation' ]).rotation_matrix egoadj2global[:3,3] = results['adjacent' ]['ego2global_translation' ] lidar2ego = np.eye(4, dtype=np.float32) lidar2ego[:3, :3] = Quaternion(results['curr' ]['lidar2ego_rotation' ]).rotation_matrix lidar2ego[:3, 3] = results['curr' ]['lidar2ego_translation' ] lidaradj2lidarcurr = np.linalg.inv(lidar2ego) @ np.linalg.inv(egocurr2global) \ @ egoadj2global @ lidar2ego trans_new = [] rots_new =[] for tran,rot in zip(trans, rots): mat = np.eye(4, dtype=np.float32) mat[:3,:3] = rot mat[:3,3] = tran mat = lidaradj2lidarcurr @ mat rots_new.append(torch.from_numpy(mat[:3,:3])) trans_new.append(torch.from_numpy(mat[:3,3])) rots.extend(rots_new) trans.extend(trans_new)
在执行一般的三维空间的增广同时,我们同时更新相机坐标系到lidar坐标系的变换,使得在lss view transformer 转换得到的特征和增广后的target保持空间一致性。以RandomFlip3D为例:
def update_transform(self, input_dict): # aug 前 cam2liar的变换 transform = torch.zeros((input_dict['img_inputs' ][1].shape[0],4,4)).float () transform[:,:3,:3] = input_dict['img_inputs' ][1] transform[:,:3,-1] = input_dict['img_inputs' ][2] transform[:, -1, -1] = 1.0 # aug 引起的变换 aug_transform = torch.eye(4).float () if input_dict['pcd_horizontal_flip' ]: aug_transform[1,1] = -1 if input_dict['pcd_vertical_flip' ]: aug_transform[0,0] = -1 aug_transform = aug_transform.view(1,4,4) new_transform = aug_transform.matmul(transform) # 左乘 得到aug 后 cam2liar的变换 input_dict['img_inputs' ][1][...] = new_transform[:,:3,:3] input_dict['img_inputs' ][2][...] = new_transform[:,:3,-1]
4、BEVDet Inference
BEVDet 推理实现中,数据处理相关的最核心的是LSS View Transformer的相关变换和BEVDet4D中的特征对齐。
4.1、LSS View Transformer
在lss的view transformer中,首先在图像空间按照一定的规律预定义了视锥点,视锥点的坐标分别是(x,y,d),其中x和y是图像空间以像素为单位度量的坐标,d是深度以米为单位度量,预定义了D种深度值,那么对于每个图像就有DHW个点,注意H和W是特征分辨率而非图像分辨率,但是x和y却是定义在图像空间而非特征空间
def create_frustum(self): # make grid in image plane ogfH, ogfW = self.data_config['input_size' ] fH, fW = ogfH // self.downsample, ogfW // self.downsample ds = torch.arange(*self.grid_config['dbound' ], dtype=torch.float).view(-1, 1, 1).expand(-1, fH, fW) D, _, _ = ds.shape xs = torch.linspace(0, ogfW - 1, fW, dtype=torch.float).view(1, 1, fW).expand(D, fH, fW) ys = torch.linspace(0, ogfH - 1, fH, dtype=torch.float).view(1, fH, 1).expand(D, fH, fW) # D x H x W x 3 frustum = torch.stack((xs, ys, ds), -1) return nn.Parameter(frustum, requires_grad=False)
接着这些点会根据上面记录的lss相关的变换 post_trans/pos_rots/intrinsics/rots/trans 转换为lidar坐标系下的坐标
def get_geometry(self, rots, trans, intrins, post_rots, post_trans): "" "Determine the (x,y,z) locations (in the ego frame) of the points in the point cloud. Returns B x N x D x H/downsample x W/downsample x 3 " "" B, N, _ = trans.shape # 执行图像空间增广的逆变换 # B x N x D x H x W x 3 points = self.frustum - post_trans.view(B, N, 1, 1, 1, 3) points = torch.inverse(post_rots).view(B, N, 1, 1, 1, 3, 3).matmul(points.unsqueeze(-1)) # 图像空间到lidar坐标系 points = torch.cat((points[:, :, :, :, :, :2] * points[:, :, :, :, :, 2:3], points[:, :, :, :, :, 2:3] ), 5) combine = rots.matmul(torch.inverse(intrins)) points = combine.view(B, N, 1, 1, 1, 3, 3).matmul(points).squeeze(-1) points += trans.view(B, N, 1, 1, 1, 3) return points
最后用voxel pooling根据这些点生成bev空间的特征。
4.2、Feature Alignment
Feature alignment的目的在于获得定义在
下的相邻帧的特征。如果使用BEVDetSequential类,使用上述的adjcam2currlidar变换去做lss的view transformation生成相邻帧的bev特征,因此得到的bev特征就是定义在currlidar坐标系下,可以和当前帧进行直接的concat,但这样会改变lss view transformation 的输入,使得加速的前提不成立。为了加速,我们使用BEVDetSequentialES类,在view transformation 中保持cam2lidar的变换不变,转而对view transformer 生成的bev特征进行align。