yolov5深度剖析+原始碼debug級講解系列（三）yolov5 head原始碼決議-有解無憂

前言

在上次的文章中我們決議了backbone網路的構建原始碼，在這篇中我們針對model.py剩余的部分進行debug決議，如果沒看過之前文章的小伙伴，推薦先查看這個系列的第一篇和第二篇，下面貼上傳送門：

1.yolov5原始碼決議第一篇架構設計和debug準備

2.yolov5原始碼決議第二篇 backbone原始碼決議

今天我們繼續對model.py里的Detect類進行決議，這部分對應yolov5的檢查頭部分，

detect類在model.py里，這部分代碼如下：

class Detect(nn.Module):
    stride = None  # strides computed during build
    export = False  # onnx export

    def __init__(self, nc=80, anchors=(), ch=()):  # detection layer
        super(Detect, self).__init__()
        self.nc = nc  # number of classes
        self.no = nc + 5  # number of outputs per anchor  85 for coco
        self.nl = len(anchors)  # number of detection layers 3
        self.na = len(anchors[0]) // 2  # number of anchors  3
        self.grid = [torch.zeros(1)] * self.nl  # init grid
        a = torch.tensor(anchors).float().view(self.nl, -1, 2)
        self.register_buffer('anchors', a)  # shape(nl,na,2)
        self.register_buffer('anchor_grid', a.clone().view(self.nl, 1, -1, 1, 1, 2))  # shape(nl,1,na,1,1,2)
        self.m = nn.ModuleList(nn.Conv2d(x, self.no * self.na, 1) for x in ch)  # output conv 128=>255/256=>255/512=>255

    def forward(self, x):
        # x = x.copy()  # for profiling
        z = []  # inference output
        self.training |= self.export
        for i in range(self.nl):
            x[i] = self.m[i](x[i])  # conv
            bs, _, ny, nx = x[i].shape  # x(bs,255,20,20) to x(bs,3,20,20,85)
            x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()

            if not self.training:  # inference
                if self.grid[i].shape[2:4] != x[i].shape[2:4]:
                    self.grid[i] = self._make_grid(nx, ny).to(x[i].device)

                y = x[i].sigmoid()
                y[..., 0:2] = (y[..., 0:2] * 2. - 0.5 + self.grid[i].to(x[i].device)) * self.stride[i]  # xy
                y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
                z.append(y.view(bs, -1, self.no))

        return x if self.training else (torch.cat(z, 1), x)

    @staticmethod
    def _make_grid(nx=20, ny=20):
        yv, xv = torch.meshgrid([torch.arange(ny), torch.arange(nx)])
        return torch.stack((xv, yv), 2).view((1, 1, ny, nx, 2)).float()

我們首先來看這個類的__init__()函式：

    def __init__(self, nc=80, anchors=(), ch=()):  # detection layer
        super(Detect, self).__init__()
        self.nc = nc  # number of classes
        self.no = nc + 5  # number of outputs per anchor  85 for coco
        self.nl = len(anchors)  # number of detection layers 3
        self.na = len(anchors[0]) // 2  # number of anchors  3
        self.grid = [torch.zeros(1)] * self.nl  # init grid
        a = torch.tensor(anchors).float().view(self.nl, -1, 2)
        self.register_buffer('anchors', a)  # shape(nl,na,2)
        self.register_buffer('anchor_grid', a.clone().view(self.nl, 1, -1, 1, 1, 2))  # shape(nl,1,na,1,1,2)
        self.m = nn.ModuleList(nn.Conv2d(x, self.no * self.na, 1) for x in ch)  # output conv 128=>255/256=>255/512=>255

yolov5的檢測頭仍為FPN結構，所以self.m為3個輸出卷積，這三個輸出卷積模塊的channel變化分別為128=>255|256=>255|512=>255，
self.no為每個anchor位置的輸出channel維度，每個位置都預測80個類（coco）+ 4個位置坐標xywh + 1個confidence score，所以輸出channel為85，每個尺度下有3個anchor位置，所以輸出85*3=255個channel，

下面我們再來看下head部分的forward()函式：

    def forward(self, x):
        # x = x.copy()  # for profiling
        z = []  # inference output
        self.training |= self.export
        for i in range(self.nl):
            x[i] = self.m[i](x[i])  # conv
            bs, _, ny, nx = x[i].shape  # x(bs,255,20,20) to x(bs,3,20,20,85)
            x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()

            if not self.training:  # inference
                if self.grid[i].shape[2:4] != x[i].shape[2:4]:
                    self.grid[i] = self._make_grid(nx, ny).to(x[i].device)

                y = x[i].sigmoid()
                y[..., 0:2] = (y[..., 0:2] * 2. - 0.5 + self.grid[i].to(x[i].device)) * self.stride[i]  # xy
                y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
                z.append(y.view(bs, -1, self.no))

        return x if self.training else (torch.cat(z, 1), x)

x是一個串列的形式，分別對應著3個head的輸入，它們的shape分別為：

[B, 128, 32, 32]
[B, 256, 16, 16]
[B, 512, 8, 8]

三個輸入先后被送入了3個卷積，得到輸出結果，

x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()

這里將x進行變換從：

x[0]：(bs,255,32,32) => x(bs,3,32,32,85)
x[1]：(bs,255,32,32) => x(bs,3,16,16,85)
x[2]：(bs,255,32,32) => x(bs,3,8,8,85)

            if not self.training:  # inference
                if self.grid[i].shape[2:4] != x[i].shape[2:4]:
                    self.grid[i] = self._make_grid(nx, ny).to(x[i].device)

    def _make_grid(nx=20, ny=20):
        yv, xv = torch.meshgrid([torch.arange(ny), torch.arange(nx)])
        return torch.stack((xv, yv), 2).view((1, 1, ny, nx, 2)).float()

這里的_make_grid()函式是準備好格點，所有的預測的單位長度都是基于grid層面的而不是原圖，注意每一層的grid的尺寸都是不一樣的，和每一層輸出的尺寸w,h是一樣的，

   y = x[i].sigmoid()
   y[..., 0:2] = (y[..., 0:2] * 2. - 0.5 + self.grid[i].to(x[i].device)) * self.stride[i]  # xy
   y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
   z.append(y.view(bs, -1, self.no))

這里是inference的核心代碼，我們要好好剖析一下，相比于yolov3，yolov5有一些變化：

y[..., 0:2] = (y[..., 0:2] * 2. - 0.5 + self.grid[i].to(x[i].device)) * self.stride[i]

這里可以明顯發現box center的x,y的預測被乘以2并減去了0.5，所以這里的值域從yolov3里的（0，1）注意是開區間，變成了（-0.5， 1.5），
這樣改的原因目前還未知，從表面理解是可以跨半個格點預測了，這樣應該能提高一些召回，當然還有一個好處就是也解決了yolov3中因為sigmoid開區間而導致中心無法到達邊界處的問題，這里是我分析的觀點，如果讀者有其他的思路歡迎留言點撥，

y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]

這里是預測boundingbox的wh，先回顧下yolov3里的預測：

        pred_boxes[..., 0] = x.data + self.grid_x
        pred_boxes[..., 1] = y.data + self.grid_y
        pred_boxes[..., 2] = torch.exp(w.data) * self.anchor_w
        pred_boxes[..., 3] = torch.exp(h.data) * self.anchor_h

是一個基于框的w，h的e指數函式，而在yolov5中這里變成了：(2*w_pred/h_pred) ^2，
值域從原來的（1，e）變成了（0，4），這里我的理解是這個預測的框范圍變得更大了，不僅可以預測到4倍以內的大物體，而且可以預測到比anchor小的boundingbox，和上面一樣，這里是我分析的觀點，如果讀者有其他的思路歡迎留言點撥，

到這里我們就分析完了Detect類里面的所有代碼，下面我回到Model類里面，最后分析它的前向傳播程序，這里有兩個函式forward()和forward_once()兩個函式：

    def forward_once(self, x, profile=False):
        y, dt = [], []  # outputs
        for m in self.model:
            if m.f != -1:  # if not from previous layer
                x = y[m.f] if isinstance(m.f, int) else [x if j == -1 else y[j] for j in m.f]  # from earlier layers

            if profile:
                o = thop.profile(m, inputs=(x,), verbose=False)[0] / 1E9 * 2 if thop else 0  # FLOPS
                t = time_synchronized()
                for _ in range(10):
                    _ = m(x)
                dt.append((time_synchronized() - t) * 100)
                print('%10.1f%10.0f%10.1fms %-40s' % (o, m.np, dt[-1], m.type))

            x = m(x)  # run
            y.append(x if m.i in self.save else None)  # save output

        if profile:
            print('%.1fms total' % sum(dt))
        return x

self.foward_once()就是前向執行一次model里的所有module，得到結果，profile引數打開會記錄每個模塊的平均執行時長和flops用于分析模型的瓶頸，提高模型的執行速度和降低顯存占用，

    def forward(self, x, augment=False, profile=False):
        if augment:
            img_size = x.shape[-2:]  # height, width
            s = [1, 0.83, 0.67]  # scales
            f = [None, 3, None]  # flips (2-ud, 3-lr)
            y = []  # outputs
            for si, fi in zip(s, f):
                xi = scale_img(x.flip(fi) if fi else x, si, gs=int(self.stride.max()))
                yi = self.forward_once(xi)[0]  # forward
                # cv2.imwrite(f'img_{si}.jpg', 255 * xi[0].cpu().numpy().transpose((1, 2, 0))[:, :, ::-1])  # save
                yi[..., :4] /= si  # de-scale
                if fi == 2:
                    yi[..., 1] = img_size[0] - 1 - yi[..., 1]  # de-flip ud
                elif fi == 3:
                    yi[..., 0] = img_size[1] - 1 - yi[..., 0]  # de-flip lr
                y.append(yi)
            return torch.cat(y, 1), None  # augmented inference, train
        else:
            return self.forward_once(x, profile)  # single-scale inference, train

self.forward()函式里面augment可以理解為控制TTA，如果打開會對圖片進行scale和flip，默認是關閉的，

def scale_img(img, ratio=1.0, same_shape=False, gs=32):  # img(16,3,256,416)
    # scales img(bs,3,y,x) by ratio constrained to gs-multiple
    if ratio == 1.0:
        return img
    else:
        h, w = img.shape[2:]
        s = (int(h * ratio), int(w * ratio))  # new size
        img = F.interpolate(img, size=s, mode='bilinear', align_corners=False)  # resize
        if not same_shape:  # pad/crop img
            h, w = [math.ceil(x * ratio / gs) * gs for x in (h, w)]
        return F.pad(img, [0, w - s[1], 0, h - s[0]], value=0.447)  # value = imagenet mean

scale_img的原始碼如上，就是通過普通的雙線性插值實作，根據ratio來控制圖片的縮放比例，最后通過pad 0補齊到原圖的尺寸，

至此整個yolov5 head的前向傳播和inference的原始碼就分析完了，整個實作debug下來感覺也是比較通俗易懂的，整體上yolov3的差距也不是特別大，

在下一篇我們會開始分析重點分析yolov5的train.py，既剖析yolov5的訓練程序，謝謝大家閱讀！

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/267379.html

標籤：python

上一篇：藍橋杯之Python演算法設計系列（一）

下一篇：深入理解深度優先搜索：從Leetcode實踐出發【3】（題號785、207、797、802、1319）