SSD

paper

参考:1

SSD是一篇写得非常好,读起来也非常舒服的文章。


写于fater rcnn之后,所以anchor不是新思想,新思想是多个尺寸的feature map然后分别设置不同尺寸的anchor去预测和回归,有特征金字塔的萌芽

Our improvements include using a small convolutional filter to predict object categories and offsets in bounding box locations, using separate predictors (filters) for different aspect ratio detections, and applying these filters to multiple feature maps from the later stages of a network in order to perform detection at multiple scales

backbone使用VGG16,但是把fc6和fc7换成了卷积,其中conv5_出来后的maxpool5的2×22\times2, s=2s=2换成了3×33\times3, s=1s=1的,没有改变尺寸大小,fc6换成了卷积层并使用使用了空洞卷积,增大感受野,去掉了fc8层,直接接上卷积层进行特征抽取和预测。为什么要在maxpool5和fc6这里做如此处理呢?作者在论文的实验分析部分指出,如果单纯只是用原始vgg16 conv5_3来作预测,效果没有太大变化,但是速度会比使用空洞卷积慢20%左右,这里可能是为了速度上的工程考量。在后续自己的工作中,似乎可以考虑实验下空洞卷积带来的作用。

此外,由于用了VGG16种的conv4_3作为第一个预测层,相比于其他层比较靠前,论文中用了parseNet的L2 normalization对每个像素在channel层面上做了归一化(跟layer norm不一样,在图像上layer norm是CHW做了归一化,保留N维度)。这样处理是想匹配特征数据范围,便于模型收敛。但是具体原因应该还是实际实验发现的,不然也不会只L2 NORM一个层,其具体计算公式为:

yi=xii=1Dxi2y_{i}=\frac{x_{i}}{\sqrt{\sum_{i=1}^{D} x_{i}^{2}}}

计算完之后还需要乘上指定的scale参数进行缩放。

norm = conv4_3_feats.pow(2).sum(dim=1, keepdim=True).sqrt()+1e-10  # (N, 1, 38, 38)
conv4_3_feats = conv4_3_feats / norm # (N, 512, 38, 38)
conv4_3_feats = conv4_3_feats * self.rescale_factors # (N, 512, 38, 38)

更多相关代码示例见,以及作者对于SSD中对Conv4_3做归一化,加入variance的解释

利用卷积来预测框的偏移量和类别置信度,所以最后得卷积核数量为(C+4)k(C+4)k,其中k为anchor的预设数量,C包含了背景类。

如何获得GT:

We begin by matching each ground truth box to the default box with the best jaccard overlap (as in MultiBox [7]). Unlike MultiBox, we then match default boxes to any ground truth with jaccard overlap higher than a threshold (0.5).

anchor box和GT取IOU,大于阈值0.5的就匹配上,负责预测该GT(可以一个GT匹配多个anchor box但是反过来不行,如果出现这样的情况,那么该anchor box只会取与某个GT有最大IOU的)

但是anchor box不是通过该feature map的感受野去匹配的,而是人为指定的scale和中心,原文是最浅的层scale是0.2,最深的层scale是0.9(相对与原图而言),anchor scale为{1, 2,3,1/2,1/3,1’},这个1’是每个特征图单独指定的,与1对应的区域大小不一样。虽然anchor scale设置了6个,对应的大小也有相应的公式,但是并不是每一个feature map上都用了全部的anchor scale,大小的公式推导也只是针对后面新增的特征图,vgg16 head里用来预测的是单独设置的,原文中也对此做了详细说明:

Figure 2 shows the architecture details of the SSD300 model. We use conv4_3, conv7 (fc7), conv8_2, conv9_2, conv10_2, and conv11_2 to predict both location and confidences. We set default box with scale 0.1 on conv4_3 . We initialize the parameters for all the newly added convolutional layers with the ”xavier” method [20]. For conv4_3, conv10_2 and conv11_2, we only associate 4 default boxes at each feature map location – omitting aspect ratios of 13\frac{1}{3} and 3. For all other layers, we put 6 default boxes as described in Sec. 2.2. Since, as pointed out in [12], conv4_3 has a different feature scale compared to the other layers, we use the L2 normalization technique introduced in [12] to scale the feature norm at each location in the feature map to 20 and learn the scale during back propagation.

为了更清楚整体的流程和一些细节,我这里直接贴出参考的博客1中的解释:

由于特征图每个cell内部都设置了几个同样的大小和长宽比的anchor,其对应的中心就是特征图cell的中心放大该特征图下采样倍数对应到原图区域的中心位置。

损失函数与faster rcnn相同,offset的预测类型也是一样的,回归用smooth L1,分类用softmax加交叉熵,两者之间的权重通过交叉验证设置为1

hard negative mining,通过对负样本进行抽样,抽样时按照置信度误差(预测背景的置信度越小,误差越大)进行降序排列,选取误差的较大的top-k作为训练的负样本,尽量控制正负样本比例在1:3左右

关于为何选择多尺度feature map而不是利用多尺度图片进行训练:

To handle different object scales, some methods [4,9] suggest processing the image at different sizes and combining the results afterwards. However, by utilizing feature maps from several different layers in a single network for prediction we can mimic the same effect, while also sharing parameters across all object scales.

对小物体检测效果差,因为结构上是从浅层去检测,语义信息不够充分。

Figure 4 shows that SSD is very sensitive to the bounding box size. In other words, it has much worse performance on smaller objects than bigger objects. This is not surprising because those small objects may not even have any information at the very top layers.

相较于faster rcnn,对data augumentation(random crop去zoom in图像类似的操作,其实在这一点上,我觉得yolo v4的“马赛克”数据增强手段有异曲同工之妙,可以说是增强普适版,这种数据增强操作对一阶的object detection似乎是标配)的依赖很重,原文的结果表示加了数据增强会比不加高8-9个点,作者认为可能是faster rcnn feature pooling这一步会让模型对物体的translation更加鲁棒,“use a feature pooling step during classification that is relatively robust to object translation by design”

最后总结了当时的目标检测思路,将SSD与主流的方法进行了比较,其中我认为关键的地方在于作者指出,SSD其实就是相当于faster rcnn中的RPN:

Our SSD is very similar to the region proposal network (RPN) in Faster R-CNN in that we also use a fixed set of (default) boxes for prediction, similar to the anchor boxes in the RPN. But instead of using these to pool features and evaluate another classifier, we simultaneously produce a score for each object category in each box. Thus, our approach avoids the complication of merging RPN with Fast R-CNN and is easier to train, faster, and straightforward to integrate in other tasks.

RefineDet

paper, codepytorch版本

参考:1

目的:借鉴一阶检测算法类似SSD的高效率推断和二阶检测算法类似Faster RCNN的高检测率,构建anchor两次回归的二阶段回归模型,但是没有用到图像proposal,只是针对特征图来做。

结构/思想部分:

第一阶段:

(1) filter out negative anchors to reduce search space for the classifier, and (2) coarsely adjust the locations and sizes of anchors to provide better initialization for the subsequent regressor.

第二阶段:

The latter module takes the refined anchors as the input from the former to further improve the regression and predict multi-class label.

第一阶段与faster rcnn类似,分出positive的anchor box,也就是包含物体,非background,并作一些修正,第二阶段用的特征图会在第一阶段的基础上继续做一些特征提取和融合操作,以增强多分类和回归效果。

refinedet结构图,可以看出,大致是RPN+FPN+RPN堆叠起来的。论文中给的三个模块名称分别为ARM+TCB+ODM

第一阶段,每个特征图的grid cell固定了几个anchor box,然后预测四个offset,针对原始设定的anchor而言,并且给出foreground置信度

ARM和ODM的特征图维度相同

ARM中negative box的negative 置信度大于阈值(0.99)就丢掉,预测的时候也是,ARM只传递hard negative anchor boxes和修正过的positive anchor boxes给ODM

TCB的结构图,逐级与高层特征图进行侧向连接

训练/推理部分:

数据增强:crop,expand, flipping等,与SSD类似

backbone:采用vgg16或者resnet101,以VGG16为例,把最后两层全连接层fc_6, fc_7去掉,加入两个卷积层conv_fc6, conv_fc7,接在pool5后面的卷积层conv_fc6和SSD一样做了空洞卷积(dilation=6),在conv_fc7之后又接了两个卷积层,其中最后一个用了stride=2进行下采样,预测的特征图是Conv4_3, Conv_5_3, Conv6_1, Conv6_2,前面两个也做了跟SSD一样的L2 norm操作。所欲预测的特征图分别对原图下采样了8,16,32,64倍,每个特征图安排一个大小(4乘以下采样倍数)的和三个不同长宽比(0.5,1,2)的anchors,匹配原则:首先让每个GT匹配与其有最大IOU的anchor,然后再让anchor匹配与其IOU大于0.5的GT,这一点与SSD相同。

Specifically, we first match each ground truth to the anchorbox with the best overlap score, and then match the anchorboxes to any ground truth with overlap higher than0.5

hard negative mining:与SSD相同,选择那些负样本分类置信度小的,也就是loss大的,进行排序,保持正负样本在1:3左右或者更高,

loss function: 分ARM和ODM两部分。类似faster rcnn的rpn和fast rcnn, 分类都采用交叉熵,回归采用smooth L1,论文中四个损失函数之间没有权重设置,都是等比例贡献,如下所示。N代表positive anchor的数量,只对正样本回归,但是分类对正负样本都会进行,且如前所述,ARM会把难分的负样本传给ODM继续分类。论文中还提到,如果ARM或者ODM中的N为0,那么分类和回归损失都置为0。

L({pi},{xi},{ci},{ti})=1Narm (iLb(pi,[li1])+i[li1]Lr(xi,gi))+1Nodm (iLm(ci,li)+i[li1]Lr(ti,gi))\begin{array}{l} \mathcal{L}\left(\left\{p_{i}\right\},\left\{x_{i}\right\},\left\{c_{i}\right\},\left\{t_{i}\right\}\right)=\frac{1}{N_{\text {arm }}}\left(\sum_{i} \mathcal{L}_{\mathrm{b}}\left(p_{i},\left[l_{i}^{*} \geq 1\right]\right)\right. \left.+\sum_{i}\left[l_{i}^{*} \geq 1\right] \mathcal{L}_{\mathrm{r}}\left(x_{i}, g_{i}^{*}\right)\right)+ \frac{1}{N_{\text {odm }}}\left(\sum_{i} \mathcal{L}_{\mathrm{m}}\left(c_{i}, l_{i}^{*}\right)\right. \left.+\sum_{i}\left[l_{i}^{*} \geq 1\right] \mathcal{L}_{\mathrm{r}}\left(t_{i}, g_{i}^{*}\right)\right) \end{array}

训练:采用pretrained model,batch_size=32, 新增的卷积层采用“Xavier"方法初始化权重,采用带动量(0.9)的SGD训练,weight decay为0.0005,初始学习率为0.001,后续会随着迭代次数增加而调整。

推理:ARM先去掉置信度高的负样本(跟阈值比较),然后对剩下的anchors进行回归refine,输送给ODM,ODM进一步分类和refine输出top400的框,然后利用NMS(阈值0.45)筛选,然后最多留下200的框产生最后的结果

结果:300的图片可以达到0.8的mAP (PASCAL VOC),Titan X上可以达到40FPS,512的输入尺度,速度掉到了24FPS,几乎一半。

这里根据博客1提到的pytorch源码,其实一阶段对anchor的抑制并没有做,而是全部送到了ODM,然后ODM根据ARM的置信度和当前的分类置信度一起筛选anchor.

这个操作可能是我理解有误,在pytorch_refindet未有体现,实际上ARM中所有anchor都直接传到了ODM中,但ARM中所有anchor确实完成了1st-stage的refine,ODM中为每个anchor进一步预测2nd-stage的bbox reg + cls,结合1st-stage的objectness得分 + 2nd-stage的bbox cls(具体类别)得分,一起筛选有效anchor,并结合2nd-stage的bbox reg预测结果,完成2nd-stage的refine后,最后NMS输出结果;

refinedet结构pytorch代码(以VGG16为backbone):

import os
import torch
import torch.nn as nn
import torch.nn.functional as F

from layers import *
from .base_models import vgg, vgg_base


def vgg(cfg, i=3, batch_norm=False):
layers = []
in_channels = i
for v in cfg:
if v == 'M':
layers += [nn.MaxPool2d(kernel_size=2, stride=2)]
elif v == 'C':
layers += [nn.MaxPool2d(kernel_size=2, stride=2, ceil_mode=True)]
else:
conv2d = nn.Conv2d(in_channels, v, kernel_size=3, padding=1)
if batch_norm:
layers += [conv2d, nn.BatchNorm2d(v), nn.ReLU(inplace=True)]
else:
layers += [conv2d, nn.ReLU(inplace=True)]
in_channels = v
pool5 = nn.MaxPool2d(kernel_size=2, stride=2, padding=0)
conv6 = nn.Conv2d(512, 1024, kernel_size=3, padding=6, dilation=6)
conv7 = nn.Conv2d(1024, 1024, kernel_size=1)
layers += [pool5, conv6,
nn.ReLU(inplace=True), conv7, nn.ReLU(inplace=True)]
return layers


vgg_base = {
'320': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'C', 512, 512, 512, 'M',
512, 512, 512],
}


class RefineSSD(nn.Module):
"""Single Shot Multibox Architecture
The network is composed of a base VGG network followed by the
added multibox conv layers. Each multibox layer branches into
1) conv2d for class conf scores
2) conv2d for localization predictions
3) associated priorbox layer to produce default bounding
boxes specific to the layer's feature map size.
See: https://arxiv.org/pdf/1512.02325.pdf for more details.
Args:
phase: (string) Can be "test" or "train"
base: VGG16 layers for input, size of either 300 or 500
extras: extra layers that feed to multibox loc and conf layers
head: "multibox head" consists of loc and conf conv layers
"""

def __init__(self, size, num_classes, use_refine=False):
super(RefineSSD, self).__init__()
self.num_classes = num_classes
# TODO: implement __call__ in PriorBox
self.size = size
self.use_refine = use_refine

# SSD network
self.base = nn.ModuleList(vgg(vgg_base['320'], 3))
# Layer learns to scale the l2 normalized features from conv4_3
self.L2Norm_4_3 = L2Norm(512, 10)
self.L2Norm_5_3 = L2Norm(512, 8)
self.last_layer_trans = nn.Sequential(nn.Conv2d(512, 256, kernel_size=3, stride=1, padding=1),
nn.ReLU(inplace=True),
nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1),
nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1))
self.extras = nn.Sequential(nn.Conv2d(1024, 256, kernel_size=1, stride=1, padding=0), nn.ReLU(inplace=True), \
nn.Conv2d(256, 512, kernel_size=3, stride=2, padding=1), nn.ReLU(inplace=True))

if use_refine:
self.arm_loc = nn.ModuleList([nn.Conv2d(512, 12, kernel_size=3, stride=1, padding=1), \
nn.Conv2d(512, 12, kernel_size=3, stride=1, padding=1), \
nn.Conv2d(1024, 12, kernel_size=3, stride=1, padding=1), \
nn.Conv2d(512, 12, kernel_size=3, stride=1, padding=1), \
])
self.arm_conf = nn.ModuleList([nn.Conv2d(512, 6, kernel_size=3, stride=1, padding=1), \
nn.Conv2d(512, 6, kernel_size=3, stride=1, padding=1), \
nn.Conv2d(1024, 6, kernel_size=3, stride=1, padding=1), \
nn.Conv2d(512, 6, kernel_size=3, stride=1, padding=1), \
])
self.odm_loc = nn.ModuleList([nn.Conv2d(256, 12, kernel_size=3, stride=1, padding=1), \
nn.Conv2d(256, 12, kernel_size=3, stride=1, padding=1), \
nn.Conv2d(256, 12, kernel_size=3, stride=1, padding=1), \
nn.Conv2d(256, 12, kernel_size=3, stride=1, padding=1), \
])
self.odm_conf = nn.ModuleList([nn.Conv2d(256, 3*num_classes, kernel_size=3, stride=1, padding=1), \
nn.Conv2d(256, 3*num_classes, kernel_size=3, stride=1, padding=1), \
nn.Conv2d(256, 3*num_classes, kernel_size=3, stride=1, padding=1), \
nn.Conv2d(256, 3*num_classes, kernel_size=3, stride=1, padding=1), \
])
self.trans_layers = nn.ModuleList([nn.Sequential(nn.Conv2d(512, 256, kernel_size=3, stride=1, padding=1),
nn.ReLU(inplace=True),
nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1)), \
nn.Sequential(nn.Conv2d(512, 256, kernel_size=3, stride=1, padding=1),
nn.ReLU(inplace=True),
nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1)), \
nn.Sequential(nn.Conv2d(1024, 256, kernel_size=3, stride=1, padding=1),
nn.ReLU(inplace=True),
nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1)), \
])
self.up_layers = nn.ModuleList([nn.ConvTranspose2d(256, 256, kernel_size=2, stride=2, padding=0),
nn.ConvTranspose2d(256, 256, kernel_size=2, stride=2, padding=0),
nn.ConvTranspose2d(256, 256, kernel_size=2, stride=2, padding=0), ])
self.latent_layrs = nn.ModuleList([nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1),
nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1),
nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1),
])

self.softmax = nn.Softmax()

def forward(self, x, test=False):
"""Applies network layers and ops on input image(s) x.
Args:
x: input image or batch of images. Shape: [batch,3*batch,300,300].
Return:
Depending on phase:
test:
Variable(tensor) of output class label predictions,
confidence score, and corresponding location predictions for
each object detected. Shape: [batch,topk,7]
train:
list of concat outputs from:
1: confidence layers, Shape: [batch*num_priors,num_classes]
2: localization layers, Shape: [batch,num_priors*4]
3: priorbox layers, Shape: [2,num_priors*4]
"""
arm_sources = list()
arm_loc_list = list()
arm_conf_list = list()
obm_loc_list = list()
obm_conf_list = list()
obm_sources = list()

# apply vgg up to conv4_3 relu
for k in range(23):
x = self.base[k](x)

s = self.L2Norm_4_3(x)
arm_sources.append(s)

# apply vgg up to conv5_3
for k in range(23, 30):
x = self.base[k](x)
s = self.L2Norm_5_3(x)
arm_sources.append(s)

# apply vgg up to fc7
for k in range(30, len(self.base)):
x = self.base[k](x)
arm_sources.append(x)
# conv6_2
x = self.extras(x)
arm_sources.append(x)
# apply multibox head to arm branch
if self.use_refine:
for (x, l, c) in zip(arm_sources, self.arm_loc, self.arm_conf):
arm_loc_list.append(l(x).permute(0, 2, 3, 1).contiguous())
arm_conf_list.append(c(x).permute(0, 2, 3, 1).contiguous())
arm_loc = torch.cat([o.view(o.size(0), -1) for o in arm_loc_list], 1)
arm_conf = torch.cat([o.view(o.size(0), -1) for o in arm_conf_list], 1)
x = self.last_layer_trans(x)
obm_sources.append(x)

# get transformed layers
trans_layer_list = list()
for (x_t, t) in zip(arm_sources, self.trans_layers):
trans_layer_list.append(t(x_t))
# fpn module
trans_layer_list.reverse()
arm_sources.reverse()
for (t, u, l) in zip(trans_layer_list, self.up_layers, self.latent_layrs):
x = F.relu(l(F.relu(u(x) + t, inplace=True)), inplace=True)
obm_sources.append(x)
obm_sources.reverse()
for (x, l, c) in zip(obm_sources, self.odm_loc, self.odm_conf):
obm_loc_list.append(l(x).permute(0, 2, 3, 1).contiguous())
obm_conf_list.append(c(x).permute(0, 2, 3, 1).contiguous())
obm_loc = torch.cat([o.view(o.size(0), -1) for o in obm_loc_list], 1)
obm_conf = torch.cat([o.view(o.size(0), -1) for o in obm_conf_list], 1)

# apply multibox head to source layers

if test:
if self.use_refine:
output = (
arm_loc.view(arm_loc.size(0), -1, 4), # loc preds
self.softmax(arm_conf.view(-1, 2)), # conf preds
obm_loc.view(obm_loc.size(0), -1, 4), # loc preds
self.softmax(obm_conf.view(-1, self.num_classes)), # conf preds
)
else:
output = (
obm_loc.view(obm_loc.size(0), -1, 4), # loc preds
self.softmax(obm_conf.view(-1, self.num_classes)), # conf preds
)
else:
if self.use_refine:
output = (
arm_loc.view(arm_loc.size(0), -1, 4), # loc preds
arm_conf.view(arm_conf.size(0), -1, 2), # conf preds
obm_loc.view(obm_loc.size(0), -1, 4), # loc preds
obm_conf.view(obm_conf.size(0), -1, self.num_classes), # conf preds
)
else:
output = (
obm_loc.view(obm_loc.size(0), -1, 4), # loc preds
obm_conf.view(obm_conf.size(0), -1, self.num_classes), # conf preds
)

return output

def load_weights(self, base_file):
other, ext = os.path.splitext(base_file)
if ext == '.pkl' or '.pth':
print('Loading weights into state dict...')
self.load_state_dict(torch.load(base_file, map_location=lambda storage, loc: storage))
print('Finished!')
else:
print('Sorry only .pth and .pkl files supported.')


def build_net(size=320, num_classes=21, use_refine=False):
if size != 320:
print("Error: Sorry only SSD300 and SSD512 is supported currently!")
return

return RefineSSD(size, num_classes=num_classes, use_refine=use_refine)

RefineDet++

paper

video

原作者对CVPR论文进行了丰富,整体结构没什么区别,主要不同是在第二阶段的分类和回归中使用AlignConv来进行操作。

这样设计的原因是修正后的anchor和特征之间存在不对齐的问题。回顾下经典的two-stage目标检测器faster rcnn,先利用anchor找出修正原图的proposal,然后利用ROI Pooling在特征图上抠出修正后的proposal对应的特征区域,RPN和fast rcnn阶段没有直接的联系。这样的话可以保证每次回归都是在对应的特征上进行,不容易被干扰。one-stage检测器由于没有这个区域抠取的操作,只是在同样的特征图上进行,分类和回归,每个特征图上的像素点可能设置了不同大小的anchor,而该特征图对应的感受野是一定的,所以不一定可以让检测效果好,SSD的多尺度特征图预测以及FPN可以在一定程度上解决该问题。在refinedet这里,第二阶段利用的特征图来自于第一阶段,在位置上没做改变,所以还是在原始的位置进行预测,但是第一阶段已经对anchor进行修正,它所对应的特征可能已经偏移了,所以依然没解决特征不对齐的问题。后来图森的AlignDet对二阶段的卷积做了处理,优化了refinedet的检测结果,其中相关的知乎问题解释了refinedet的这个问题,链接在,有关AlignDet和RePoints后面会接着讲。

实际上,目标检测中的Feature Alignment是目前提及比较多的问题,参考1的总结,主要分为两个部分:

1.分类和回归的不匹配,大意是分类和回归我们不应该用同样的特征,毕竟不是同一个任务,两者对特征的精细度要求也不同,像faster rcnn最后拿了ROI Pooling的特征之后共享给了分类和回归任务;

2.anchor based方法由于预设了不同尺寸,对于一个特征图来说,感受野固定,特征不对齐;一阶的级联回归,如果使用同样位置的特征图,修正后的anchor对应的区域已经发生改变,此时第二阶段的特征区域却没有对应改变;

这篇博客根据以上两个问题做了一个论文综述。在这里具体细节和方法暂时先不展开谈,等后续看完相关论文,搞清楚了这些问题再具体开一篇博客详说。

refinedet++结构和alignconv操作细节

alignconv设计背景

论文中给出的AlignConv具体实现步骤如下:

To this end, we design an alignment convolution operation(AlignConv), which uses the aligned feature from the refined anchors to predict multi-class labels and regress accurate object locations, as shown in Fig. 3©. Specifically, the newly designed AlignConv operation conducts convolution operation based on computed offsets from the refined anchors. Denoting each refined anchor with a four-tuple (x,y,h,w) that specifies the top-left corner (x,y) and height and width (h,w), theAlignConv is conducted as follows. First, after taking the re-fined anchors from ARM, we equally divide the regions of the refined anchors into K×K parts, where K is the kernel size of convolution operation. The center of each part is computed as: for the part at i-th row and j-th column, the center locationis(x+(2i1)w2Ky+(2i1)h2K)(x+\frac{(2i-1)*w}{2K},y+\frac{(2i-1)*h}{2K}). Second, we multiply the feature values at the K×K part centers in refined anchors with the corresponding parameters of the convolution filter, see Fig. 1.In this way, we successfully extract more accurate features that are aligned to the refined anchors for object detection.In contrast to existing deformable convolution methods [65]–[67] that learn the offsets by convolution operation with extra parameters, our AlignConv conducts the convolution with the guidance from the refined anchors of the ARM, which is more suitable for RefineDet++ and produces better performance.

大意就是根据ARM修正后的anchor区域,来对ODM的卷积偏移一个offset,根据修正的anchor区域来调整卷积区域,不再是原来的的区域,类似可变形卷积,但是比较奇怪的是,作者在论文中也用可变形卷积代替,结果反而更差了。加了这个操作后,速度下降明显,毕竟割裂了连续的卷积操作,精度大概涨了1个点左右。

AlignDet

repoints

Cascade RetinaNet解释了这个问题

https://zhuanlan.zhihu.com/p/78026765

https://zhuanlan.zhihu.com/p/114700229