About

YOLO系列是one-stage目标检测种的重要成果,兼具速度与精度,同时包含了很多与时俱进的tricks,是理论与工程实践完美结合的产物。虽然Andrew Ng说YOLO论文比较难读,但是为了更好的梳理检测这一模块,理解实践这项工作必须要进行下去。

Pytorch YOLO项目推荐

YOLO算法家族全景图

推荐阅读:YOLO v1-v5全解读

YOLO v1

paper;

参考博客:1, 2, 3

检测思想:利用卷积提取特征,然后得到最终的S×SS\times S的特征图,这里面每一个cell会预测B个框(这里为什么要每个cell预测多个框,有点类似anchor的思想,猜想可能是为了应对尺度问题,后面回归的时候也是选择跟GT接近的那个框去回归,但是实际上yolo v1对小物体的检测效果并不是很好,所以这样隐式设置,网络应该不好学习到),每个框预测offset和置信度(综合是否有物体以及IOU-预测的质量),根据CNN的位置不变性,原图也可以划分S×SS\times S的cell,看原图上GT的中心落在哪个cell上,对应到最后的特征图上的那个cell就负责预测该物体。再加上预测的类别置信度,总共预测的变量为S×S×(B×5+C)S\times S \times(B\times5 + C),由FC给出,其中为了处理方便,预测的向量中[0 :S×S×CS \times S \times C]为类别概率部分,[S×S×CS \times S \times C:S×S×(C+B)S \times S \times (C+B)]是框置信度部分,[S×S×(C+B)S \times S \times (C+B):]是边界框预测部分。

yolo将整个目标检测都看成回归问题,输出7×7×307 \times 7 \times 30的tensor(针对pascal VOC),采用均方差损失函数。

CNN的空间位置不变性,原图224×224224 \times 224 resize到448×448448 \times 448,downstream 下采样到划分7×77 \times 7的grid栅格,每个栅格预测2个bounding box,预测变量为(x,y,w,h,C)(x, y, w, h, C)C=pobjectIoUC = p_{object} * IoU,存在物体,pobject=1p_{object}=1,否则为0

除了最后一层,其他层用leakly ReLU激活函数,loss函数,分配权重。

λcoord i=0S2j=0B1ijobj [(xix^i)2+(yiy^i)2]+λcoord i=0S2j=0B1ijobj [(wiw^i)2+(hih^i)2]+i=0S2j=0B1ijobj(CiC^i)2+λnoobji=0S2j=0B1ijnoobj(CiC^i)2+i=0S21iobjc classes (pi(c)p^i(c))2\begin{array}{l} \lambda_{\text {coord }} \sum_{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}_{i j}^{\text {obj }}\left[\left(x_{i}-\hat{x}_{i}\right)^{2}+\left(y_{i}-\hat{y}_{i}\right)^{2}\right] \\ \quad+\lambda_{\text {coord }} \sum_{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}_{i j}^{\text {obj }}\left[(\sqrt{w_{i}}-\sqrt{\hat{w}_{i}})^{2}+(\sqrt{h_{i}}-\sqrt{\hat{h}_{i}})^{2}\right] \\ \quad+\sum_{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}_{i j}^{\mathrm{obj}}\left(C_{i}-\hat{C}_{i}\right)^{2}+\lambda_{\mathrm{noobj}} \sum_{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}_{i j}^{\mathrm{noobj}}\left(C_{i}-\hat{C}_{i}\right)^{2}\\ \quad+\sum_{i=0}^{S^{2}} \mathbb{1}_{i}^{\mathrm{obj}} \sum_{c \in \text { classes }}\left(p_{i}(c)-\hat{p}_{i}(c)\right)^{2} \end{array}

λcoord=5,λnoobj=0.5\lambda_{coord}=5,\lambda_{noobj}=0.5,利用平方根宽高是为了让小物体框对尺寸变化更加敏感。其中C代表的是框的置信度,是为了衡量是否存在物体和框的质量,p才是类别置信度。

v1的backbone

回归变量的计算方式,图来自博客1

YOLO predicts multiple bounding boxes per grid cell. At training time we only want one bounding box predictor to be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth. This leads to specialization between the bounding box predictors. Each predictor gets better at predicting certain sizes, aspect ratios, or classes of object, improving overall recall.

Note that the loss function only penalizes classification error if an object is present in that grid cell (hence the conditional class probability discussed earlier). It also only penalizes bounding box coordinate error if that predictor is “responsible” for the ground truth box (i.e. has the highest IOU of any predictor in that grid cell).

trick: warm up, drop out fc, dataset random scaling、translations (up to 20%), adjust exposure and saturation (factor 1.5) in HSV color space.

普适性的架构,在artwork数据集上检测效果也不错,不局限于自然图像

一个grid cell只检测一个物体,如果是重合的多个物体只能检测一个,精度不高,但是由于是统一的架构,在一个网络上检测,信息比faster rcnn的region proposal充分,所以误检率低

预测部分:

在博客3中给出了源码中的预测输出,大致是输出结果,置信度低于阈值置0,NMS之后输出,也就是先NMS再给类别,而不是先给类别置信度再NMS

根据类别NMS,还是全部NMS?一般都是预测框全部放在一起然后NMS,结果似乎没有很大区别。

这里给出先判断类别再NMS的预测方案,代码来自博客3:

def _build_detector(self):
"""Interpret the net output and get the predicted boxes"""
# the width and height of orignal image
self.width = tf.placeholder(tf.float32, name="img_w")
self.height = tf.placeholder(tf.float32, name="img_h")
# get class prob, confidence, boxes from net output
idx1 = self.S * self.S * self.C
idx2 = idx1 + self.S * self.S * self.B
# class prediction
class_probs = tf.reshape(self.predicts[0, :idx1], [self.S, self.S, self.C])
# confidence
confs = tf.reshape(self.predicts[0, idx1:idx2], [self.S, self.S, self.B])
# boxes -> (x, y, w, h)
boxes = tf.reshape(self.predicts[0, idx2:], [self.S, self.S, self.B, 4])

# convert the x, y to the coordinates relative to the top left point of the image
# the predictions of w, h are the square root
# multiply the width and height of image
boxes = tf.stack([(boxes[:, :, :, 0] + tf.constant(self.x_offset, dtype=tf.float32)) / self.S * self.width,
(boxes[:, :, :, 1] + tf.constant(self.y_offset, dtype=tf.float32)) / self.S * self.height,
tf.square(boxes[:, :, :, 2]) * self.width,
tf.square(boxes[:, :, :, 3]) * self.height], axis=3)

# class-specific confidence scores [S, S, B, C]
scores = tf.expand_dims(confs, -1) * tf.expand_dims(class_probs, 2)

scores = tf.reshape(scores, [-1, self.C]) # [S*S*B, C]
boxes = tf.reshape(boxes, [-1, 4]) # [S*S*B, 4]

# find each box class, only select the max score
box_classes = tf.argmax(scores, axis=1)
box_class_scores = tf.reduce_max(scores, axis=1)

# filter the boxes by the score threshold
filter_mask = box_class_scores >= self.threshold
scores = tf.boolean_mask(box_class_scores, filter_mask)
boxes = tf.boolean_mask(boxes, filter_mask)
box_classes = tf.boolean_mask(box_classes, filter_mask)

# non max suppression (do not distinguish different classes)
# ref: https://tensorflow.google.cn/api_docs/python/tf/image/non_max_suppression
# box (x, y, w, h) -> box (x1, y1, x2, y2)
_boxes = tf.stack([boxes[:, 0] - 0.5 * boxes[:, 2], boxes[:, 1] - 0.5 * boxes[:, 3],
boxes[:, 0] + 0.5 * boxes[:, 2], boxes[:, 1] + 0.5 * boxes[:, 3]], axis=1)
nms_indices = tf.image.non_max_suppression(_boxes, scores,
self.max_output_size, self.iou_threshold)
self.scores = tf.gather(scores, nms_indices)
self.boxes = tf.gather(boxes, nms_indices)
self.box_classes = tf.gather(box_classes, nms_indices)

YOLO v2

paper;

参考博客:1, 2, 3, 4

相比于YOLO v1,v2每个cell通过聚类设置了anchor boxes,这些anchor会预测offset,置信度和类别概率,全卷积预测,除去了FC层

聚类采用的距离是衡量box和聚类中心之间的IOU,如果IOU大就说明距离近,聚类得出的结果比直接手工设定的anchor相比,前者的平均IOU更大些,这样模型会更容易训练学习。

回归的设置

pass through层为了预测小物体,通过拆分特征图来缩小尺寸并增加维度

先验框怎么匹配的,与v1类似,还是看物体GT的中心落在哪一个cell上,这个cell的anchor boxes就负责预测它,具体是哪一个anchor box预测,需要在训练时确定,选IOU最大(这里计算不考虑位置,只考虑面积)的那个,其他的都不匹配。与ground truth匹配的先验框计算坐标误差、置信度误差以及分类误差,而其它的边界框只计算置信度误差,与v1类似

reorganization, 跨层连接,自己构建的网络Darknet-19,reorg layer/passthrough

yolov2采用自搭建的网络,Darknet-19

yolo v2 net, 源自:https://ethereon.github.io/netscope/#/gist/d08a41711e48cf111e330827b1279c31

全卷积,多尺度

加入anchors,提高recall

anchor尺度聚类代替手工设定,如何实现聚类

预测bounding box偏移变量tx,ty,tw,tht_{x},t_{y},t_{w},t_{h},基于anchor的宽高预测,基于grid cell左上角的位置预测,相比于v1直接预测框的绝对位置,偏移量能更好地被网络学习到,因为比较小且稳定。

loss函数

losst=i=0Wj=0Hk=0A1Max 100< Thresh λnoobj (bijko)2+1t<12800λprior rϵ(x,y,w,h)( prior krbijkr)2+1ktruth (λcoord rϵ(x,y,w,h)( truth rbijkr)2+bijkr)2+λobj (IOUtruth kbijko)2+λclass (c=1C(truthcbijkc)2))\begin{array}{rl} \operatorname{loss}_{t}=\sum_{i=0}^{W} \sum_{j=0}^{H} \sum_{k=0}^{A} & 1_{\text {Max } 100<\text { Thresh }} \lambda_{\text {noobj }} *\left(-b_{i j k}^{o}\right)^{2} \\ & +1_{t<12800} \lambda_{\text {prior }} * \sum_{r_{\epsilon}(x, y, w, h)}\left(\text { prior }_{k}^{r}-b_{i j k}^{r}\right)^{2} \\ & +1_{k}^{\text {truth }}\left(\lambda_{\text {coord }} * \sum_{r_{\epsilon}(x, y, w, h)}\left(\text { truth }^{r}-b_{i j k}^{r}\right)^{2}\right. & \left.+b_{i j k}^{r}\right)^{2} \\ & +\lambda_{\text {obj }} *\left(I O U_{\text {truth }}^{k} - b_{ijk}^{o}\right)^{2} \\ & \left.+\lambda_{\text {class }} *\left(\sum_{c=1}^{C}\left(t r u t h^{c}-b_{i j k}^{c}\right)^{2}\right)\right) \end{array}

先验框匹配,样本选择 ,IOU计算只考虑形状大小,不考虑坐标,移到原点计算,每个groudtruth只会分配一个最合适的预测框。前面先让预测框学着预测先验框anchor

训练,多scale训练,通道数不变,只是特征图的尺寸会变化。先用224×224224 \times 224尺寸ImageNet图片对Darknet-19进行第一阶段训练,然后将图片尺寸调整为448×448448 \times 448继续进行第二阶段的训练,目的是为了让网络可以适应大尺寸的图片输入。之后移除最后的卷积层,global avgpooling和分类的softmax, 增加一些卷积层改为分类backbone进行训练。输出的通道数为anchor_nums×(4+1+num_classes)anchor\_nums \times (4+1+num\_classes),以VOC数据集为例,最后的网络输出shape为(N,W,H,125)(N,W,H,125),先reshape为(N,W,H,5,25)(N,W,H,5,25),其中[:,:,:,:,0:4][:,:,:,:,0:4]为边界框的位置和大小(tx,ty,tw,th)(t_{x},t_{y},t_{w},t_{h})[:,:,:,:,4][:,:,:,:,4]为边界框的置信度,[:,:,:,:,5:][:,:,:,:,5:]为边界框的类别预测,最后还需要对其进行NMS等后处理

"""
Detection ops for Yolov2
codelink:https://github.com/xiaohu2015/DeepLearning_tutorials/tree/master/ObjectDetections/yolo2
"""

import tensorflow as tf
import numpy as np


def decode(detection_feat, feat_sizes=(13, 13), num_classes=80,
anchors=None):
"""decode from the detection feature"""
H, W = feat_sizes
num_anchors = len(anchors)
detetion_results = tf.reshape(detection_feat, [-1, H * W, num_anchors,
num_classes + 5])

bbox_xy = tf.nn.sigmoid(detetion_results[:, :, :, 0:2])
bbox_wh = tf.exp(detetion_results[:, :, :, 2:4])
obj_probs = tf.nn.sigmoid(detetion_results[:, :, :, 4])
class_probs = tf.nn.softmax(detetion_results[:, :, :, 5:])

anchors = tf.constant(anchors, dtype=tf.float32)

height_ind = tf.range(H, dtype=tf.float32)
width_ind = tf.range(W, dtype=tf.float32)
x_offset, y_offset = tf.meshgrid(height_ind, width_ind)
x_offset = tf.reshape(x_offset, [1, -1, 1])
y_offset = tf.reshape(y_offset, [1, -1, 1])

# decode
bbox_x = (bbox_xy[:, :, :, 0] + x_offset) / W
bbox_y = (bbox_xy[:, :, :, 1] + y_offset) / H
bbox_w = bbox_wh[:, :, :, 0] * anchors[:, 0] / W * 0.5 # 这里除以0.5是为下面中心点计算左上和右下坐标
bbox_h = bbox_wh[:, :, :, 1] * anchors[:, 1] / H * 0.5

bboxes = tf.stack([bbox_x - bbox_w, bbox_y - bbox_h,
bbox_x + bbox_w, bbox_y + bbox_h], axis=3)

return bboxes, obj_probs, class_probs

预测,归功于multi-scale training, 测试时,大尺寸的图片mAP很高

相比v1做的一些改进和tricks

yolo9000, 联合ImageNet和COCO数据集进行检测和分类训练

YOLO v3

paper;

换了backbone, Darknet-53, 使用仿resnet的skip connection,引入residual module,通过大中小三个特征图尺度和不同的anchor大小,提升小目标的检测精度

Darknet-53

yolov3检测架构,图来自:https://zhuanlan.zhihu.com/p/35325884

多尺度融合。FPN,依然是使用k-mean聚类得出9种不同大小比例的anchor,然后均匀的分配到3个不同scales的feature map上去。

yolov2与yolov3的损失函数,知乎相关回答

网络预测出的tx,tyt_{x}, t_{y}是在0-1之间的,采用了BCE而不是MSE,loss是在输出的t上计算,还是进一步归一化处理后的b上计算?

YOLO v4

paper;

参考博客:1, 2

一个GT可以有多个anchor对应,不再只回归一个,提升正样本比例

Mosaic(马赛克)数据增强和Self-Adversarial Training(自对抗训练)比较有趣,从风格迁移可视化CNN来的

https://zhuanlan.zhihu.com/p/172121380

https://zhuanlan.zhihu.com/p/143747206