PointPillars是一個來自工業界的模型，整體思想基于圖片的處理框架，直接將點云從俯視圖的視角劃分為一個個的Pillar（立方柱體），從而構成了類似圖片的資料，然后在使用2D的檢測框架進行特征提取和密集的框預測得到檢測框，從而使得該模型在速度和精度都達到了一個很好的平衡，

PointPillars網路結構總覽：

網路速度精度對比：

注：（PP代表pointpillars，M代表MV3D， A代表AVOD，C代表ContFuse，V代表VoxelNet，

F代表Frustum Pointnet，S代表SECOND ，P+代表PIXOR++）

本文將會以OpenPCDet的代碼基礎，詳細決議PointPillars的每一行代碼實作以及原因，

讀者可以下載OpenPCDet后根據文章進行閱讀和理解，

由于本人才疏學淺，決議中難免會出現不足之處，歡迎指正、討論，有好的建議或意見都可以在評論區留言，謝謝大家！

PointPillars的論文地址為：

https://arxiv.org/pdf/1812.05784.pdf

決議參考代碼：

https://github.com/open-mmlab/OpenPCDet

一：綜述

3D檢測演算法通常有以下幾種形式：

（1）將點云資料劃納入一個個體素（Voxel）中，構成規則的、密集分布的體素集，如有VoxelNet和SECOND，

（2）從前視和俯視角度對點云資料進行投影映射處理，獲得一個個偽圖片的資料，常見的模型有MV3D和AVOD，

（3）直接將點云資料映射到鳥瞰圖后，再直接使用2D的檢測框架的處理方法進行特征提取和RPN，實作3D的檢測，如PIXOR、本文的主角pointpillar，

（4）使用pointnet直接從點云中對資料進行特征提取后獲取proposals，然后根據獲取的proposals進行微調，如Pointrcnn

二： PP網路點云資料處理

這里的處理程序直接將3D的點云資訊直接從以俯視圖的形式進行獲取，在點云中假設有N*3個點的資訊，所有的這些點都在kitti lidar坐標系xyz中（單位是米，其中x向前，y向左，z向上），所有的這些點都會分配到均等大小的x-y平面的立方柱體中，這個立方柱就被稱為pillar，如下圖所示

左相機前視圖

點云俯視圖將點云分布到的均勻的立方柱體中

kitti的點云資料是4維度的資料包含（x, y, z, r）其中xyz是改點在點云中的坐標，r代表了改點的反射強度（與物體材質和激光入射角度等有關）；并且在將所有點放入每個pillar中的時候不需要像voxel那樣考慮高度，可以將一個pillar理解為就是一個z軸上所有voxel組成在一起的，

在進行PP的資料增強時候，需要對pillar中的資料進行增強操作，需要將每個pillar中的點增加5個維度的資料，包含 x c , y c , z c , x p 和 y p，其中下標c代表了每個點云到改點所對應pillar中所有點平均值的偏移量，下標p代表了該點距離所在pillar中心點的x，y的偏移量，所有經過資料增強操作后每個點的維度是9維；包含了x,y,z, x c , y c , z c , x p 和 y p（注在openpcdet的代碼實作中是10維，多了一個zp，也就是該點在z軸上與該點所處pillar的z軸中心的偏移量）

經過上述操作之后，就可以把原始的點云結構（N*3）變換成了（D，P，N），其中D代表了每個點云的特征維度，也就是每個點云9個特征，P代表了所有非空的立方柱體，N代表了每個pillar中最多會有多少個點，

注：

1、在實作的程序中，每個pillar的長寬是0.16米，在pcdet的實作中，我們只會截取前視圖的部分，進行訓練，因為kitti的標注是根據2號相機進行標注的，所有x軸的負方向（即車的后方）是沒有標注資料的，我們會截取掉后面的資料；同時為了保證檢測的可靠性，距離太遠的點，由于點云過于稀疏，也會被截取，所以在pcdet的實作中，點云空間的選取范圍xyz的最小值是=[0, -39.68,-3]， xyz選取的最大值是[69.12, 39.68, 1]，

2、其中每個pillar中的最大點云數量是32，如果一個pillar中的點云數量超過32,那么就會隨機采樣，選取32個點；如果一個pillar中的點云數量少于32；那么會對這個pillar使用0樣本填充，

在經過映射后，就獲得了一個（D，P，N）的張量；接下來這里使用了一個簡化版的pointnet網路對點云的資料進行特征提取（即將這些點通過MLP升維，然后跟著BN層和Relu激活層），得到一個（C，P，N）形狀的張量，之后再使用maxpool操作提取每個pillar中最能代表該pillar的點，那么輸出會變成（C，P，N）->（C，P）；在經過上述操作編碼后的點，需要重新放回到原來對應pillar的x,y位置上生成偽圖象資料，

下面看這部分的代碼實作：

預處理實作代碼 pcdet/datasets/processor/data_processor.py

   def transform_points_to_voxels(self, data_dict=None, config=None):
        """
        將點云轉換為pillar,使用spconv的VoxelGeneratorV2
        因為pillar可是認為是一個z軸上所有voxel的集合，所以在設定的時候，
        只需要將每個voxel的高度設定成kitti中點云的最大高度即可
        """
        
        #初始化點云轉換成pillar需要的引數
        if data_dict is None:
            # kitti截取的點云范圍是[0, -39.68, -3, 69.12, 39.68, 1]
            # 得到[69.12, 79.36, 4]/[0.16, 0.16, 4] = [432, 496, 1]
            grid_size = (self.point_cloud_range[3:6] - self.point_cloud_range[0:3]) / np.array(config.VOXEL_SIZE)
            self.grid_size = np.round(grid_size).astype(np.int64)
            self.voxel_size = config.VOXEL_SIZE
            # just bind the config, we will create the VoxelGeneratorWrapper later,
            # to avoid pickling issues in multiprocess spawn
            return partial(self.transform_points_to_voxels, config=config)

        if self.voxel_generator is None:
            self.voxel_generator = VoxelGeneratorWrapper(
                #給定每個pillar的大小  [0.16, 0.16, 4]
                vsize_xyz=config.VOXEL_SIZE,  
                #給定點云的范圍 [0, -39.68, -3, 69.12, 39.68, 1]
                coors_range_xyz=self.point_cloud_range,  
                #給定每個點云的特征維度，這里是x，y，z，r 其中r是激光雷達反射強度
                num_point_features=self.num_point_features,
                #給定每個pillar中最多能有多少個點 32
                max_num_points_per_voxel=config.MAX_POINTS_PER_VOXEL,  
                #最多選取多少個pillar，因為生成的pillar中，很多都是沒有點在里面的
                # 可以重上面的可視化影像中查看到，所以這里只需要得到那些非空的pillar就行
                max_num_voxels=config.MAX_NUMBER_OF_VOXELS[self.mode],  # 16000
            )
        
        points = data_dict['points']
        # 生成pillar輸出
        voxel_output = self.voxel_generator.generate(points)
        # 假設一份點云資料是N*4，那么經過pillar生成后會得到三份資料
        # voxels代表了每個生成的pillar資料，維度是[M,32,4]
        # coordinates代表了每個生成的pillar所在的zyx軸坐標，維度是[M,3],其中z恒為0
        # num_points代表了每個生成的pillar中有多少個有效的點維度是[m,]，因為不滿32會被0填充
        voxels, coordinates, num_points = voxel_output

        if not data_dict['use_lead_xyz']:
            voxels = voxels[..., 3:]  # remove xyz in voxels(N, 3)

        data_dict['voxels'] = voxels
        data_dict['voxel_coords'] = coordinates
        data_dict['voxel_num_points'] = num_points
        return data_dict

#　下面是使用spconv生成pillar的代碼    

class VoxelGeneratorWrapper():
    def __init__(self, vsize_xyz, coors_range_xyz, num_point_features, max_num_points_per_voxel, max_num_voxels):
        try:
            from spconv.utils import VoxelGeneratorV2 as VoxelGenerator
            self.spconv_ver = 1
        except:
            try:
                from spconv.utils import VoxelGenerator
                self.spconv_ver = 1
            except:
                from spconv.utils import Point2VoxelCPU3d as VoxelGenerator
                self.spconv_ver = 2

        if self.spconv_ver == 1:
            self._voxel_generator = VoxelGenerator(
                voxel_size=vsize_xyz,
                point_cloud_range=coors_range_xyz,
                max_num_points=max_num_points_per_voxel,
                max_voxels=max_num_voxels
            )
        else:
            self._voxel_generator = VoxelGenerator(
                vsize_xyz=vsize_xyz,
                coors_range_xyz=coors_range_xyz,
                num_point_features=num_point_features,
                max_num_points_per_voxel=max_num_points_per_voxel,
                max_num_voxels=max_num_voxels
            )

    def generate(self, points):
        if self.spconv_ver == 1:
            voxel_output = self._voxel_generator.generate(points)
            if isinstance(voxel_output, dict):
                voxels, coordinates, num_points = \
                    voxel_output['voxels'], voxel_output['coordinates'], voxel_output['num_points_per_voxel']
            else:
                voxels, coordinates, num_points = voxel_output
        else:
            assert tv is not None, f"Unexpected error, library: 'cumm' wasn't imported properly."
            voxel_output = self._voxel_generator.point_to_voxel(tv.from_numpy(points))
            tv_voxels, tv_coordinates, tv_num_points = voxel_output
            # make copy with numpy(), since numpy_view() will disappear as soon as the generator is deleted
            voxels = tv_voxels.numpy()
            coordinates = tv_coordinates.numpy()
            num_points = tv_num_points.numpy()
        return voxels, coordinates, num_points

在經過上面的預處理之后，就需要使用簡化版的pointnet網路對每個pillar中的資料進行特征提取了，

代碼在pcdet/models/backbones_3d/vfe/pillar_vfe.py

import torch
import torch.nn as nn
import torch.nn.functional as F

from .vfe_template import VFETemplate


class PFNLayer(nn.Module):
    def __init__(self,
                 in_channels,
                 out_channels,
                 use_norm=True,
                 last_layer=False):
        super().__init__()

        self.last_vfe = last_layer
        self.use_norm = use_norm
        if not self.last_vfe:
            out_channels = out_channels // 2

        if self.use_norm:
            # 根據論文中，這是是簡化版pointnet網路層的初始化
            # 論文中使用的是 1x1 的卷積層完成這里的升維操作（理論上使用卷積的計算速度會更快）
            # 輸入的通道數是剛剛經過資料增強過后的點云特征，每個點云有10個特征，
            # 輸出的通道數是64
            self.linear = nn.Linear(in_channels, out_channels, bias=False)
            # 一維BN層
            self.norm = nn.BatchNorm1d(out_channels, eps=1e-3, momentum=0.01)
        else:
            self.linear = nn.Linear(in_channels, out_channels, bias=True)

        self.part = 50000

    def forward(self, inputs):
        if inputs.shape[0] > self.part:
            # nn.Linear performs randomly when batch size is too large
            num_parts = inputs.shape[0] // self.part
            part_linear_out = [self.linear(inputs[num_part * self.part:(num_part + 1) * self.part])
                               for num_part in range(num_parts + 1)]
            x = torch.cat(part_linear_out, dim=0)
        else:
            # x的維度由（M, 32, 10）升維成了（M, 32, 64）
            x = self.linear(inputs)
        torch.backends.cudnn.enabled = False
        # BatchNorm1d層:(M, 64, 32) --> (M, 32, 64)
        # （pillars,num_point,channel）->(pillars,channel,num_points)
        # 這里之所以變換維度，是因為BatchNorm1d在通道維度上進行,對于影像來說默認模式為[N,C,H*W],通道在第二個維度上
        x = self.norm(x.permute(0, 2, 1)).permute(0, 2, 1) if self.use_norm else x
        torch.backends.cudnn.enabled = True
        x = F.relu(x)
        # 完成pointnet的最大池化操作，找出每個pillar中最能代表該pillar的點
        # x_max shape ：（M, 1, 64）　
        x_max = torch.max(x, dim=1, keepdim=True)[0]

        if self.last_vfe:
            # 回傳經過簡化版pointnet處理pillar的結果
            return x_max
        else:
            x_repeat = x_max.repeat(1, inputs.shape[1], 1)
            x_concatenated = torch.cat([x, x_repeat], dim=2)
            return x_concatenated


class PillarVFE(VFETemplate):
    """
    model_cfg:NAME: PillarVFE
                    WITH_DISTANCE: False
                    USE_ABSLOTE_XYZ: True
                    USE_NORM: True
                    NUM_FILTERS: [64]
    num_point_features:4
    voxel_size:[0.16 0.16 4]
    POINT_CLOUD_RANGE: [0, -39.68, -3, 69.12, 39.68, 1]
    """

    def __init__(self, model_cfg, num_point_features, voxel_size, point_cloud_range, **kwargs):
        super().__init__(model_cfg=model_cfg)

        self.use_norm = self.model_cfg.USE_NORM
        self.with_distance = self.model_cfg.WITH_DISTANCE
        self.use_absolute_xyz = self.model_cfg.USE_ABSLOTE_XYZ
        num_point_features += 6 if self.use_absolute_xyz else 3
        if self.with_distance:
            num_point_features += 1

        self.num_filters = self.model_cfg.NUM_FILTERS
        assert len(self.num_filters) > 0
        num_filters = [num_point_features] + list(self.num_filters)

        pfn_layers = []
        for i in range(len(num_filters) - 1):
            in_filters = num_filters[i]
            out_filters = num_filters[i + 1]
            pfn_layers.append(
                PFNLayer(in_filters, out_filters, self.use_norm, last_layer=(i >= len(num_filters) - 2))
            )
        # 加入線性層，將10維特征變為64維特征
        self.pfn_layers = nn.ModuleList(pfn_layers)

        self.voxel_x = voxel_size[0]
        self.voxel_y = voxel_size[1]
        self.voxel_z = voxel_size[2]
        self.x_offset = self.voxel_x / 2 + point_cloud_range[0]
        self.y_offset = self.voxel_y / 2 + point_cloud_range[1]
        self.z_offset = self.voxel_z / 2 + point_cloud_range[2]

    def get_output_feature_dim(self):
        return self.num_filters[-1]

    def get_paddings_indicator(self, actual_num, max_num, axis=0):
        """
        計算padding的指示
        Args:
            actual_num:每個voxel實際點的數量（M，）
            max_num:voxel最大點的數量（32，）
        Returns:
            paddings_indicator:表明一個pillar中哪些是真實資料，哪些是填充的0資料
        """
        # 擴展一個維度，使變為（M，1）
        actual_num = torch.unsqueeze(actual_num, axis + 1)
        # [1, 1]
        max_num_shape = [1] * len(actual_num.shape)
        # [1, -1]
        max_num_shape[axis + 1] = -1
        # (1,32)
        max_num = torch.arange(max_num, dtype=torch.int, device=actual_num.device).view(max_num_shape)
        # (M, 32)
        paddings_indicator = actual_num.int() > max_num
        return paddings_indicator

    def forward(self, batch_dict, **kwargs):
        """
        batch_dict:
            points:(N,5) --> (batch_index,x,y,z,r) batch_index代表了該點云資料在當前batch中的index
            frame_id:(4,) --> (003877,001908,006616,005355) 幀ID
            gt_boxes:(4,40,8)--> (x,y,z,dx,dy,dz,ry,class)
            use_lead_xyz:(4,) --> (1,1,1,1)
            voxels:(M,32,4) --> (x,y,z,r)
            voxel_coords:(M,4) --> (batch_index,z,y,x) batch_index代表了該點云資料在當前batch中的index
            voxel_num_points:(M,)
            image_shape:(4,2) 每份點云資料對應的2號相機圖片解析度
            batch_size:4    batch_size大小
        """
        voxel_features, voxel_num_points, coords = batch_dict['voxels'], batch_dict['voxel_num_points'], batch_dict[
            'voxel_coords']
        # 求每個pillar中所有點云的和 (M, 32, 3)->(M, 1, 3) 設定keepdim=True的，則保留原來的維度資訊
        # 然后在使用求和資訊除以每個點云中有多少個點來求每個pillar中所有點云的平均值 points_mean shape：(M, 1, 3)
        points_mean = voxel_features[:, :, :3].sum(dim=1, keepdim=True) / voxel_num_points.type_as(voxel_features).view(
            -1, 1, 1)
        # 每個點云資料減去該點對應pillar的平均值得到差值 xc,yc,zc
        f_cluster = voxel_features[:, :, :3] - points_mean

        # 創建每個點云到該pillar的坐標中心點偏移量空資料 xp,yp,zp
        f_center = torch.zeros_like(voxel_features[:, :, :3])
        #  coords是每個網格點的坐標，即[432, 496, 1]，需要乘以每個pillar的長寬得到點云資料中實際的長寬（單位米）
        #  同時為了獲得每個pillar的中心點坐標，還需要加上每個pillar長寬的一半得到中心點坐標
        #  每個點的x、y、z減去對應pillar的坐標中心點，得到每個點到該點中心點的偏移量
        f_center[:, :, 0] = voxel_features[:, :, 0] - (
                coords[:, 3].to(voxel_features.dtype).unsqueeze(1) * self.voxel_x + self.x_offset)
        f_center[:, :, 1] = voxel_features[:, :, 1] - (
                coords[:, 2].to(voxel_features.dtype).unsqueeze(1) * self.voxel_y + self.y_offset)
        # 此處偏移多了z軸偏移  論文中沒有z軸偏移
        f_center[:, :, 2] = voxel_features[:, :, 2] - (
                coords[:, 1].to(voxel_features.dtype).unsqueeze(1) * self.voxel_z + self.z_offset)

        # 如果使用絕對坐標，直接組合
        if self.use_absolute_xyz:
            features = [voxel_features, f_cluster, f_center]
        # 否則，取voxel_features的3維之后，在組合
        else:
            features = [voxel_features[..., 3:], f_cluster, f_center]

        # 如果使用距離資訊
        if self.with_distance:
            # torch.norm的第一個2指的是求2范數，第二個2是在第三維度求范數
            points_dist = torch.norm(voxel_features[:, :, :3], 2, 2, keepdim=True)
            features.append(points_dist)
        # 將特征在最后一維度拼接 得到維度為（M，32,10）的張量
        features = torch.cat(features, dim=-1)
        # 每個pillar中點云的最大數量
        voxel_count = features.shape[1]
        """
        由于在生成每個pillar中，不滿足最大32個點的pillar會存在由0填充的資料，
        而剛才上面的計算中，會導致這些
        由0填充的資料在計算出現xc,yc,zc和xp,yp,zp出現數值，
        所以需要將這個被填充的資料的這些數值清0,
        因此使用get_paddings_indicator計算features中哪些是需要被保留真實資料和需要被置0的填充資料
        """
        # 得到mask維度是（M， 32）
        # mask中指名了每個pillar中哪些是需要被保留的資料
        mask = self.get_paddings_indicator(voxel_num_points, voxel_count, axis=0)
        # （M， 32）->(M, 32, 1)
        mask = torch.unsqueeze(mask, -1).type_as(voxel_features)
        # 將feature中被填充資料的所有特征置0
        features *= mask

        for pfn in self.pfn_layers:
            features = pfn(features)
        # (M, 64), 每個pillar抽象出一個64維特征
        features = features.squeeze()
        batch_dict['pillar_features'] = features
        return batch_dict

在經過簡化版的pointnet網路提取出每個pillar的特征資訊后，就需要將每個的pillar資料重新放回原來的坐標分布中來組成偽影像資料了，

代碼在pcdet/models/backbones_2d/map_to_bev/pointpillar_scatter.py

import torch
import torch.nn as nn


class PointPillarScatter(nn.Module):
    """
       對應到論文中就是stacked pillars，將生成的pillar按照坐標索引還原到原空間中
    """

    def __init__(self, model_cfg, grid_size, **kwargs):
        super().__init__()

        self.model_cfg = model_cfg
        self.num_bev_features = self.model_cfg.NUM_BEV_FEATURES  # 64
        self.nx, self.ny, self.nz = grid_size  # [432,496,1]
        assert self.nz == 1

    def forward(self, batch_dict, **kwargs):
        """
       Args:
           pillar_features:(M,64)
           coords:(M, 4) 第一維是batch_index 其余維度為xyz
       Returns:
           batch_spatial_features:(batch_size, 64, 496, 432)
       """
        # 拿到經過前面pointnet處理過后的pillar資料和每個pillar所在點云中的坐標位置
        # pillar_features 維度 （M， 64）
        # coords 維度 （M， 4）
        pillar_features, coords = batch_dict['pillar_features'], batch_dict['voxel_coords']

        # 將轉換成為偽影像的資料存在到該串列中
        batch_spatial_features = []
        batch_size = coords[:, 0].max().int().item() + 1

        # batch中的每個資料獨立處理
        for batch_idx in range(batch_size):
            # 創建一個空間坐標所有用來接受pillar中的資料
            # self.num_bev_features是64
            # self.nz * self.nx * self.ny是生成的空間坐標索引 [496, 432, 1]的乘積
            # spatial_feature 維度 (64,214272)
            spatial_feature = torch.zeros(
                self.num_bev_features,
                self.nz * self.nx * self.ny,
                dtype=pillar_features.dtype,
                device=pillar_features.device)  # (64,214272)-->1x432x496=214272

            # 從coords[:, 0]取出該batch_idx的資料mask
            batch_mask = coords[:, 0] == batch_idx
            # 根據mask提取坐標
            this_coords = coords[batch_mask, :]
            # this_coords中存盤的坐標是z,y和x的形式,且只有一層，因此計算索引的方式如下
            # 平鋪后需要計算前面有多少個pillar 一直到當前pillar的索引
            """
            因為前面是將所有資料flatten成一維的了，相當于一個圖片寬高為[496, 432]的圖片
            被flatten成一維的圖片資料了，變成了496*432=214272;
            而this_coords中存盤的是平面（不需要考慮Z軸）中一個點的資訊，所以要
            將這個點的位置放回被flatten的一位資料時，需要計算在該點之前所有行的點總和加上
            該點所在的列即可
            """
            # 這里得到所有非空pillar在偽影像的對應索引位置
            indices = this_coords[:, 1] + this_coords[:, 2] * self.nx + this_coords[:, 3]
            # 轉換資料型別
            indices = indices.type(torch.long)
            # 根據mask提取pillar_features
            pillars = pillar_features[batch_mask, :]
            pillars = pillars.t()
            # 在索引位置填充pillars
            spatial_feature[:, indices] = pillars
            # 將空間特征加入list,每個元素為(64, 214272)
            batch_spatial_features.append(spatial_feature)

        # 在第0個維度將所有的資料堆疊在一起
        batch_spatial_features = torch.stack(batch_spatial_features, 0)
        # reshape回原空間(偽影像)    （4, 64, 214272）--> (4, 64, 496, 432)
        batch_spatial_features = batch_spatial_features.view(batch_size, self.num_bev_features * self.nz, self.ny,
                                                             self.nx)
        # 將結果加入batch_dict
        batch_dict['spatial_features'] = batch_spatial_features
        return batch_dict

三、使用2D BackBone提取特征

經過上面的映射操作，將原來的pillar提取最大的數值后放回到相應的坐標后，就可以得到類似于影像的資料了；只有在有pillar非空的坐標處有提取的點云資料，其余地方都是0資料，所以得到的一個（batch_size，64, 432, 496）的張量還是很稀疏的，

下圖是對得到的張量資料使用2D中的特征提取手段進行多尺度的特征提取和拼接融合，

這沒有好決議的就是常規的卷積操作然后進行拼接即可，注意一下維度變換就可以，

最終經過所有上采樣層得到的3個尺度的的資訊
每個尺度的 shape 都是 （batch_size, 128, 248, 216）
在第一個維度上進行拼接得到x  維度是 （batch_size, 384, 248, 216）

代碼在pcdet/models/backbones_2d/base_bev_backbone.py

import numpy as np
import torch
import torch.nn as nn


class BaseBEVBackbone(nn.Module):
    def __init__(self, model_cfg, input_channels):
        super().__init__()
        self.model_cfg = model_cfg
        # 讀取下采樣層引數
        if self.model_cfg.get('LAYER_NUMS', None) is not None:
            assert len(self.model_cfg.LAYER_NUMS) == len(self.model_cfg.LAYER_STRIDES) == len(
                self.model_cfg.NUM_FILTERS)
            layer_nums = self.model_cfg.LAYER_NUMS
            layer_strides = self.model_cfg.LAYER_STRIDES
            num_filters = self.model_cfg.NUM_FILTERS
        else:
            layer_nums = layer_strides = num_filters = []
        # 讀取上采樣層引數
        if self.model_cfg.get('UPSAMPLE_STRIDES', None) is not None:
            assert len(self.model_cfg.UPSAMPLE_STRIDES) == len(self.model_cfg.NUM_UPSAMPLE_FILTERS)
            num_upsample_filters = self.model_cfg.NUM_UPSAMPLE_FILTERS
            upsample_strides = self.model_cfg.UPSAMPLE_STRIDES
        else:
            upsample_strides = num_upsample_filters = []

        num_levels = len(layer_nums)  # 2
        c_in_list = [input_channels, *num_filters[:-1]]  # (256, 128) input_channels:256, num_filters[:-1]：64,128
        self.blocks = nn.ModuleList()
        self.deblocks = nn.ModuleList()
        for idx in range(num_levels):  # (64,64)-->(64,128)-->(128,256) # 這里為cur_layers的第一層且stride=2
            cur_layers = [
                nn.ZeroPad2d(1),
                nn.Conv2d(
                    c_in_list[idx], num_filters[idx], kernel_size=3,
                    stride=layer_strides[idx], padding=0, bias=False
                ),
                nn.BatchNorm2d(num_filters[idx], eps=1e-3, momentum=0.01),
                nn.ReLU()
            ]
            for k in range(layer_nums[idx]):  # 根據layer_nums堆疊卷積層
                cur_layers.extend([
                    nn.Conv2d(num_filters[idx], num_filters[idx], kernel_size=3, padding=1, bias=False),
                    nn.BatchNorm2d(num_filters[idx], eps=1e-3, momentum=0.01),
                    nn.ReLU()
                ])
            # 在block中添加該層
            # *作用是：將串列解開成幾個獨立的引數，傳入函式 # 類似的運算子還有兩個星號(**)，是將字典解開成獨立的元素作為形參
            self.blocks.append(nn.Sequential(*cur_layers))
            if len(upsample_strides) > 0:  # 構造上采樣層  # (1, 2, 4)
                stride = upsample_strides[idx]
                if stride >= 1:
                    self.deblocks.append(nn.Sequential(
                        nn.ConvTranspose2d(
                            num_filters[idx], num_upsample_filters[idx],
                            upsample_strides[idx],
                            stride=upsample_strides[idx], bias=False
                        ),
                        nn.BatchNorm2d(num_upsample_filters[idx], eps=1e-3, momentum=0.01),
                        nn.ReLU()
                    ))
                else:
                    stride = np.round(1 / stride).astype(np.int)
                    self.deblocks.append(nn.Sequential(
                        nn.Conv2d(
                            num_filters[idx], num_upsample_filters[idx],
                            stride,
                            stride=stride, bias=False
                        ),
                        nn.BatchNorm2d(num_upsample_filters[idx], eps=1e-3, momentum=0.01),
                        nn.ReLU()
                    ))

        c_in = sum(num_upsample_filters)  # 512
        if len(upsample_strides) > num_levels:
            self.deblocks.append(nn.Sequential(
                nn.ConvTranspose2d(c_in, c_in, upsample_strides[-1], stride=upsample_strides[-1], bias=False),
                nn.BatchNorm2d(c_in, eps=1e-3, momentum=0.01),
                nn.ReLU(),
            ))

        self.num_bev_features = c_in

    def forward(self, data_dict):
        """
        Args:
            data_dict:
                spatial_features : (4, 64, 496, 432)
        Returns:
        """
        spatial_features = data_dict['spatial_features']
        ups = []
        ret_dict = {}
        x = spatial_features
        for i in range(len(self.blocks)):
            x = self.blocks[i](x)

            stride = int(spatial_features.shape[2] / x.shape[2])
            ret_dict['spatial_features_%dx' % stride] = x
            if len(self.deblocks) > 0:  # (4,64,248,216)-->(4,128,124,108)-->(4,256,62,54)
                ups.append(self.deblocks[i](x))
            else:
                ups.append(x)

        # 如果存在上采樣層，將上采樣結果連接
        if len(ups) > 1:
            """
            最終經過所有上采樣層得到的3個尺度的的資訊
            每個尺度的 shape 都是 （batch_size, 128, 248, 216）
            在第一個維度上進行拼接得到x  維度是 （batch_size, 384, 248, 216）
            """
            x = torch.cat(ups, dim=1)
        elif len(ups) == 1:
            x = ups[0]
        
        # Fasle
        if len(self.deblocks) > len(self.blocks):
            x = self.deblocks[-1](x)
            
        # 將結果存盤在spatial_features_2d中并回傳
        data_dict['spatial_features_2d'] = x

        return data_dict

四、檢測頭實作

PiontPillars中的檢測頭采用了類似SSD的檢測頭設定，在openpcdet的實作中，直接使用了一個網路訓練車、人、自行車三個類別；沒有像原論文中對車、人使用兩種不同的網路結構，因此在檢測頭的先驗框設定上，一共有三個類別的先驗框，每個先驗框都有兩個方向分別是BEV視角下的0度和90度，每個類別的先驗證只有一種尺度資訊；分別是車 [3.9, 1.6, 1.56]、人[0.8, 0.6, 1.73]、自行車[1.76, 0.6, 1.73]（單位：米），

在anchor匹配GT的程序中，使用的是2D IOU匹配方式，直接從生成的特征圖也就是BEV視角進行匹配；不需要考慮高度資訊，原因有二：1、因為在kitti資料集中所有的物體都是在三維空間的同一個平面中的，沒有車在車上面的一個情況， 2、所有類別物體之間的高度差別不是很大，直接使用SmoothL1回歸就可以得到很好的結果，其次是每個anchor被設定為正負樣本的iou閾值是：

車匹配iou閾值大于等于0.65為正樣本，小于0.45為負樣本，中間的不計算損失，

人匹配iou閾值大于等于0.5為正樣本，小于0.35為負樣本，中間的不計算損失，

自行車匹配iou閾值大于等于0.5為正樣本，小于0.35為負樣本，中間的不計算損失，

其中每個anchor都需要預測7個引數，分別是 (x, y, z, w, l, h, θ)，其中x, y, z預測一個anchor的中心坐標在點云中的位置， w，l，h分別預測了一個anchor的長寬高資料，θ預測了box的偏移角度，

同時，因為在角度預測時候不可以區分兩個完全相反的box，所以PiontPillars的檢測頭中還添加了對一個anchor的方向預測；這里使用了一個基于softmax的方向分類box的兩個朝向資訊，

代碼在 pcdet/models/dense_heads/anchor_head_single.py

import numpy as np
import torch.nn as nn

from .anchor_head_template import AnchorHeadTemplate


class AnchorHeadSingle(AnchorHeadTemplate):
    """
    Args:
        model_cfg: AnchorHeadSingle的配置
        input_channels: 384 輸入通道數
        num_class: 3
        class_names: ['Car','Pedestrian','Cyclist']
        grid_size: (432, 496, 1)
        point_cloud_range: (0, -39.68, -3, 69.12, 39.68, 1)
        predict_boxes_when_training: False
    """

    def __init__(self, model_cfg, input_channels, num_class, class_names, grid_size, point_cloud_range,
                 predict_boxes_when_training=True, **kwargs):
        super().__init__(
            model_cfg=model_cfg, num_class=num_class, class_names=class_names, grid_size=grid_size,
            point_cloud_range=point_cloud_range,
            predict_boxes_when_training=predict_boxes_when_training
        )
        # 每個點有3個尺度的個先驗框  每個先驗框都有兩個方向（0度，90度） num_anchors_per_location:[2, 2, 2]
        self.num_anchors_per_location = sum(self.num_anchors_per_location)  # sum([2, 2, 2])
        # Conv2d(512,18,kernel_size=(1,1),stride=(1,1))
        self.conv_cls = nn.Conv2d(
            input_channels, self.num_anchors_per_location * self.num_class,
            kernel_size=1
        )
        # Conv2d(512,42,kernel_size=(1,1),stride=(1,1))
        self.conv_box = nn.Conv2d(
            input_channels, self.num_anchors_per_location * self.box_coder.code_size,
            kernel_size=1
        )
        # 如果存在方向損失，則添加方向卷積層Conv2d(512,12,kernel_size=(1,1),stride=(1,1))
        if self.model_cfg.get('USE_DIRECTION_CLASSIFIER', None) is not None:
            self.conv_dir_cls = nn.Conv2d(
                input_channels,
                self.num_anchors_per_location * self.model_cfg.NUM_DIR_BINS,
                kernel_size=1
            )
        else:
            self.conv_dir_cls = None
        self.init_weights()

    # 初始化引數
    def init_weights(self):
        pi = 0.01
        # 初始化分類卷積偏置
        nn.init.constant_(self.conv_cls.bias, -np.log((1 - pi) / pi))
        # 初始化分類卷積權重
        nn.init.normal_(self.conv_box.weight, mean=0, std=0.001)

    def forward(self, data_dict):
        # 從字典中取出經過backbone處理過的資訊
        # spatial_features_2d 維度 （batch_size, 384, 248, 216）
        spatial_features_2d = data_dict['spatial_features_2d']
        # 每個坐標點上面6個先驗框的類別預測 --> (batch_size, 18, 200, 176)
        cls_preds = self.conv_cls(spatial_features_2d)
        # 每個坐標點上面6個先驗框的引數預測 --> (batch_size, 42, 200, 176)  其中每個先驗框需要預測7個引數，分別是（x, y, z, w, l, h, θ）
        box_preds = self.conv_box(spatial_features_2d)
        # 維度調整，將類別放置在最后一維度   [N, H, W, C] --> (batch_size, 200, 176, 18)
        cls_preds = cls_preds.permute(0, 2, 3, 1).contiguous()
        # 維度調整，將先驗框調整引數放置在最后一維度   [N, H, W, C] --> (batch_size ,200, 176, 42)
        box_preds = box_preds.permute(0, 2, 3, 1).contiguous()
        # 將類別和先驗框調整預測結果放入前向傳播字典中
        self.forward_ret_dict['cls_preds'] = cls_preds
        self.forward_ret_dict['box_preds'] = box_preds
        # 進行方向分類預測
        if self.conv_dir_cls is not None:
            # # 每個先驗框都要預測為兩個方向中的其中一個方向 --> (batch_size, 12, 200, 176)
            dir_cls_preds = self.conv_dir_cls(spatial_features_2d)
            # 將類別和先驗框方向預測結果放到最后一個維度中   [N, H, W, C] --> (batch_size, 248, 216, 12)
            dir_cls_preds = dir_cls_preds.permute(0, 2, 3, 1).contiguous()
            # 將方向預測結果放入前向傳播字典中
            self.forward_ret_dict['dir_cls_preds'] = dir_cls_preds
        else:
            dir_cls_preds = None

        """
        如果是在訓練模式的時候，需要對每個先驗框分配GT來計算loss
        """
        if self.training:
            # targets_dict = {
            #     'box_cls_labels': cls_labels, # (4，211200）
            #     'box_reg_targets': bbox_targets, # (4，211200, 7）
            #     'reg_weights': reg_weights # (4，211200）
            # }
            targets_dict = self.assign_targets(
                gt_boxes=data_dict['gt_boxes']  # （4，39，8）
            )
            # 將GT分配結果放入前向傳播字典中
            self.forward_ret_dict.update(targets_dict)

        # 如果不是訓練模式，則直接生成進行box的預測
        if not self.training or self.predict_boxes_when_training:
            # 根據預測結果解碼生成最終結果
            batch_cls_preds, batch_box_preds = self.generate_predicted_boxes(
                batch_size=data_dict['batch_size'],
                cls_preds=cls_preds, box_preds=box_preds, dir_cls_preds=dir_cls_preds
            )
            data_dict['batch_cls_preds'] = batch_cls_preds  # (1, 211200, 3) 70400*3=211200
            data_dict['batch_box_preds'] = batch_box_preds  # (1, 211200, 7)
            data_dict['cls_preds_normalized'] = False

        return data_dict

五、loss計算

在Pointpillars的loss計算中，使用了與SECOND相同的loss計算方式，每個GT框都包含了 (x, y, z, w, l, h, θ)這7個引數，

1、loss理論計算

1.定位任務的回歸殘差定義如下：

其中x^gt代表了標注框的x長度；x^a代表了先驗框的長度資訊，d^a表示先驗框長度和寬度的對角線距離，定義為：，

因此得到的總回歸損失是：，

2.類別分類任務

對于每個先驗框的物體類別分類，PointPillars使用了focal loss，來完成調節正負樣本均衡，和難樣本挖掘，公式定義如下：

???????

其中，aplha引數和gamma引數都和RetinaNet中的設定一樣，分別為0.25和2，

3.先驗框方向分類

由于在角度回歸的時候，不可以完全區分兩個兩個方向完全相反的預測框，所以在實作的時候，作者加入了對先驗框的方向分類，使用softmax函式預測方向的類別，

因此總損失定義如下：

其中，系數Beta_loc為2,Beta_cls為1,Beta_dir為0.2，

2、loss計算代碼實作

在loss計算的代碼實作中涉及的代碼量比較多，因此決議分為如下三個部分分別完成

1、先驗框的生成

2、GT和先驗框的匹配

3、loss計算實作

1、先驗框的生成

代碼在pcdet/models/dense_heads/target_assigner/anchor_generator.py

import torch


class AnchorGenerator(object):
    def __init__(self, anchor_range, anchor_generator_config):
        super().__init__()
        self.anchor_generator_cfg = anchor_generator_config  # list:3
        # 得到anchor在點云中的分布范圍[0, -39.68, -3, 69.12, 39.68, 1]
        self.anchor_range = anchor_range
        # 得到配置引數中所有尺度anchor的長寬高
        # list:3 --> 車、人、自行車[[[3.9, 1.6, 1.56]],[[0.8, 0.6, 1.73]],[[1.76, 0.6, 1.73]]]
        self.anchor_sizes = [config['anchor_sizes'] for config in anchor_generator_config]
        # 得到anchor的旋轉角度，這是是弧度，也就是0度和90度
        # list:3 --> [[0, 1.57],[0, 1.57],[0, 1.57]]
        self.anchor_rotations = [config['anchor_rotations'] for config in anchor_generator_config]
        # 得到每個anchor初始化在點云中z軸的位置，其中在kitti中點云的z軸范圍是-3米到1米
        # list:3 -->  [[-1.78],[-0.6],[-0.6]]
        self.anchor_heights = [config['anchor_bottom_heights'] for config in anchor_generator_config]
        # 每個先驗框產生的時候是否需要在每個格子的中間，
        # 例如坐標點為[1,1]，如果需要對齊中心點的話，需要加上0.5變成[1.5, 1.5]
        # 默認為False
        # list:3 --> [False, False, False]
        self.align_center = [config.get('align_center', False) for config in anchor_generator_config]

        assert len(self.anchor_sizes) == len(self.anchor_rotations) == len(self.anchor_heights)
        self.num_of_anchor_sets = len(self.anchor_sizes)  # 3

    def generate_anchors(self, grid_sizes):
        assert len(grid_sizes) == self.num_of_anchor_sets
        # 1.初始化
        all_anchors = []
        num_anchors_per_location = []
        # 2.三個類別的先驗框逐類別生成
        for grid_size, anchor_size, anchor_rotation, anchor_height, align_center in zip(
                grid_sizes, self.anchor_sizes, self.anchor_rotations, self.anchor_heights, self.align_center):
            # 2 = 2x1x1 --> 每個位置產生2個anchor，這里的2代表兩個方向
            num_anchors_per_location.append(len(anchor_rotation) * len(anchor_size) * len(anchor_height))
            # 　不需要對齊中心點來生成先驗框
            if align_center:
                x_stride = (self.anchor_range[3] - self.anchor_range[0]) / grid_size[0]
                y_stride = (self.anchor_range[4] - self.anchor_range[1]) / grid_size[1]
                # 中心對齊，平移半個網格
                x_offset, y_offset = x_stride / 2, y_stride / 2
            else:
                # 2.1計算每個網格的在點云空間中的實際大小
                # 用于將每個anchor映射回實際點云中的大小
                # (69.12 - 0) / (216 - 1) = 0.3214883848678234  單位：米
                x_stride = (self.anchor_range[3] - self.anchor_range[0]) / (grid_size[0] - 1)
                # (39.68 - (-39.68.)) / (248 - 1) = 0.3212955490297634  單位：米
                y_stride = (self.anchor_range[4] - self.anchor_range[1]) / (grid_size[1] - 1)
                # 由于沒有進行中心對齊，所有每個點相對于左上角坐標的偏移量都是0
                x_offset, y_offset = 0, 0

            # 2.2 生成單個維度x_shifts，y_shifts和z_shifts
            # 以x_stride為step，在self.anchor_range[0] + x_offset和self.anchor_range[3] + 1e-5，
            # 產生x坐標 --> 216個點 [0, 69.12]
            x_shifts = torch.arange(
                self.anchor_range[0] + x_offset, self.anchor_range[3] + 1e-5, step=x_stride, dtype=torch.float32,
            ).cuda()
            # 產生y坐標 --> 248個點 [0, 79.36]
            y_shifts = torch.arange(
                self.anchor_range[1] + y_offset, self.anchor_range[4] + 1e-5, step=y_stride, dtype=torch.float32,
            ).cuda()
            """
            new_tensor函式可以回傳一個新的張量資料，該張量資料與指定的有相同的屬性
            如擁有相同的資料型別和設備型別等；
            并使用anchor_height對這個來填充這個張量
            """
            # [-1.78]
            z_shifts = x_shifts.new_tensor(anchor_height)
            # num_anchor_size = 1
            # num_anchor_rotation = 2
            num_anchor_size, num_anchor_rotation = anchor_size.__len__(), anchor_rotation.__len__()  # 1, 2
            #  [0, 1.57] 弧度制
            anchor_rotation = x_shifts.new_tensor(anchor_rotation)
            # [[3.9, 1.6, 1.56]]
            anchor_size = x_shifts.new_tensor(anchor_size)

            # 2.3 呼叫meshgrid生成網格坐標
            x_shifts, y_shifts, z_shifts = torch.meshgrid([
                x_shifts, y_shifts, z_shifts
            ])
            # meshgrid可以理解為在原來的維度上進行擴展,例如:
            # x原來為（216，）-->（216，1, 1）--> (216,248,1)
            # y原來為（248，）--> (1，248，1）--> (216,248,1)
            # z原來為 (1, )  --> (1,1,1)    --> (216,248,1)

            # 2.4.anchor各個維度堆疊組合，生成最終anchor(1,432,496,1,2,7）
            # 2.4.1.堆疊anchor的位置 
            # [x, y, z, 3]-->[216, 248, 1, 3] 代表了每個anchor的位置資訊
            # 其中3為該點所在映射tensor中的（z, y, x）數值
            anchors = torch.stack((x_shifts, y_shifts, z_shifts), dim=-1)  
            # 2.4.2.將anchor的位置和大小進行組合，編程為將anchor擴展并復制為相同維度（除了最后一維），然后進行組合
            # (216, 248, 1, 3) --> (216, 248, 1 , 1, 3)
            # 維度分別代表了： z，y，x， 該類別anchor的尺度數量，該個anchor的位置資訊
            anchors = anchors[:, :, :, None, :].repeat(1, 1, 1, anchor_size.shape[0], 1)
            # (1, 1, 1, 1, 3) --> (216, 248, 1, 1, 3)
            anchor_size = anchor_size.view(1, 1, 1, -1, 3).repeat([*anchors.shape[0:3], 1, 1])
            # anchors生成的最終結果需要有位置資訊和大小資訊 --> (216, 248, 1, 1, 6)
            # 最后一個緯度中表示（z, y, x, l, w, h）
            anchors = torch.cat((anchors, anchor_size), dim=-1)
            # 2.4.3.將anchor的位置和大小和旋轉角進行組合
            # 在倒數第二個維度上增加一個維度，然后復制該維度一次
            # (216, 248, 1, 1, 2, 6)        長， 寬， 深， anchor數量， 旋轉角個數，anchor位置大小
            anchors = anchors[:, :, :, :, None, :].repeat(1, 1, 1, 1, num_anchor_rotation, 1)
            # (216, 248, 1, 1, 2, 1)        兩個旋轉方向
            anchor_rotation = anchor_rotation.view(1, 1, 1, 1, -1, 1).repeat(
                [*anchors.shape[0:3], num_anchor_size, 1, 1])
            # [z, y, x, num_size, num_rot, 7] --> (216, 248, 1, 1, 2, 7)
            # 最后一個緯度表示為anchors的位置+大小+旋轉角度（z, y, x, l, w, h, theta）
            anchors = torch.cat((anchors, anchor_rotation), dim=-1)  # [z, y, x, num_size, num_rot, 7]

            # 2.5 置換anchor的維度
            # [z, y, x, num_anchor_size, num_rot, 7]-->[x, y, z, num_anchor_zie, num_rot, 7]
            # 最后一個緯度代表了 : [x, y, z, dx, dy, dz, rot]
            anchors = anchors.permute(2, 1, 0, 3, 4, 5).contiguous()
            # 使得各類anchor的z軸方向從anchor的底部移動到該anchor的中心點位置
            # 車 ： -1.78 + 1.56/2 = -1.0
            # 人、自行車 ： -0.6 + 1.73/2 = 0.23
            anchors[..., 2] += anchors[..., 5] / 2
            all_anchors.append(anchors)
        # all_anchors： [(1，248，216，1，2，7），(1，248，216，1，2，7），(1，248，216，1，2，7）]
        # num_anchors_per_location：[2,2,2]
        return all_anchors, num_anchors_per_location

待續

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/428544.html

標籤：AI

上一篇：使用Future.then進行Flutter單元測驗

下一篇：如何測驗是否創建了django模型的實體？

PointPillars論文決議和代碼實作

一 ： 綜述

二 ： PP網路點云資料處理

三、使用2D BackBone提取特征

四、檢測頭實作

五、loss計算

1、loss理論計算

1.定位任務的回歸殘差定義如下：

2.類別分類任務

3.先驗框方向分類

2、loss計算代碼實作

1、先驗框的生成

待續

一：綜述

二： PP網路點云資料處理