PaperStation

About

记录自己看到的与计算机视觉，视觉SLAM，机器人，机器学习等相关的论文。如果是比较重要的和自己感兴趣的论文会另开一篇post详细介绍。

2019

Visual SLAM几篇综述

基于单目视觉的同时定位与地图构建方法综述

摘要: 增强现实是一种在现实场景中无缝地融入虚拟物体或信息的技术, 能够比传统的文字、图像和视频等方式更高效、直观地呈现信息，有着非常广泛的应用. 同时定位与地图构建作为增强现实的关键基础技术, 可以用来在未知环境中定位自身方位并同时构建环境三维地图, 从而保证叠加的虚拟物体与现实场景在几何上的一致性. 文中首先简述基于视觉的同时定位与地图构建的基本原理; 然后介绍几个代表性的基于单目视觉的同时定位与地图构建方法并做深入分析和比较; 最后讨论近年来研究热点和发展趋势, 并做总结和展望。

中文综述目前看的比较舒服的一篇，对于visual SLAM有一定了解的人看起来很快，同时也梳理得比较整洁紧凑。

Visual simultaneous localization and mapping: a survey

Abstract : Visual SLAM (simultaneous localization and mapping) refers to the problem of using images, as the only source of external information, in order to establish the position of a robot, a vehicle, or a moving camera in an environment, and at the same time, construct a representation of the explored zone. SLAM is an essential task for the autonomy of a robot. Nowadays, the problem of SLAM is considered solved when range sensors such as lasers or sonar are used to built 2D maps of small static environments. However SLAM for dynamic, complex and large scale environments, using vision as the sole external sensor, is an active area of research. The computer vision techniques employed in visual SLAM, such as detection, description and matching of salient features, image recognition and retrieval, among others, are still susceptible of improvement. The objective of this article is to provide new researchers in the field of visual SLAM a brief and comprehensible review of the state-of-the-art.

这是一篇非常棒的综述，对于入门的人非常友好，几乎没有数学公式，只是对slam中的各个问题和模块进行了分解，词汇也不复杂，读起来很快。但是缺点是不是很新，而且深度不够。

Past, Present, and Future of Simultaneous Localization And Mapping: Towards the Robust-Perception Age

Abstract—Simultaneous Localization and Mapping (SLAM) consists in the concurrent construction of a model of the environment (the map), and the estimation of the state of the robot moving within it. The SLAM community has made astonishing progress over the last 30 years, enabling large-scale real-world applications, and witnessing a steady transition of this technology to industry. We survey the current state of SLAM. We start by presenting what is now the de-facto standard formulation for SLAM. We then review related work, covering a broad set of topics including robustness and scalability in long-term mapping, metric and semantic representations for mapping, theoretical performance guarantees, active SLAM and exploration, and other new frontiers. This paper simultaneously serves as a position paper and tutorial to those who are users of SLAM. By looking at the published research with a critical eye, we delineate open challenges and new research issues, that still deserve careful scientific investigation. The paper also contains the authors’ take on two questions that often animate discussions during robotics conferences: Do robots need SLAM? and Is SLAM solved?

这个综述相对难说难度点，但是写的非常好，毕竟作者都是有名的大佬。比较难得的是，这篇综述不仅梳理了slam的发展历程和技术现状，还提出了一些"open problem"，表明了自己的的观点，详细地阐述了视觉SLAM现在的挑战以及未来可能的应对办法，虽然有些问题是显而易见的。此外，该综述中也提供了很多参考文献，尤其是对于场景识别中的感知混叠以及滤波器优化和非线性优化的比较，以及因子图的功效，后续都值得研究一下。

FutureMapping: The Computational Structure of Spatial AI Systems,Andrew J. Davison

We discuss and predict the evolution of Simultaneous Localisation and Mapping (SLAM) into a general geometric and semantic ‘Spatial AI’ perception capability for intelligent embodied devices. A big gap remains between the visual perception performance that devices such as augmented reality eyewear or consumer robots will require and what is possible within the constraints imposed by real products. Co-design of algorithms, processors and sensors will be needed. We explore the computational structure of current and future Spatial AI algorithms and consider this within the landscape of ongoing hardware developments.

blog

FutureMapping 2: Gaussian Belief Propagation for Spatial AI

blog

Ongoing Evolution of Visual SLAM from Geometry to Deep Learning: Challenges and Opportunities

这篇综述主要关注的是深度学习在slam中的应用，先介绍了几种常见的模型，即CNN, RNN和encoder, decoder，然后列举了一些slam常用的dataset，包括KITTI, TUM, NYU等，接着分块介绍深度学习在depth estimation, pose estimation, ego-motion estimation, relocalization, sensor fusion, semantic mapping方面的应用概况，总的来说在位姿估计，深度尺度估计，回环检测重定位，地图构建这几个方面着手，论文最后提出了一些存在的挑战和思路，总体来说介绍的还是挺全的，列举的文章也很经典。但是总觉得深度不够，像是一种在知乎上回答问题的方式。虽然值得看，不过等到以后发现了更好的综述再来替换吧。

A Survey of Simultaneous Localization and Mapping

Focal Loss

loss的具体形式为： $criterion= \alpha(1-a)^{\gamma}y \ln a + (1-\alpha)a^{\gamma}(1-y) \ln (1-a)$ ，主要的作用就是提高对假阴性的惩罚力度，在论文中作者指出，对于设计的RetinaNet，超参数 $\alpha=2, \gamma=0.25$ 效果最好（分类目标检测阶段的前景和背景分离），在我实际的二分类使用中，效果并不是十分突出，参数的调节是个技术活，否则很容易使得假阳性很高，不过这可能也是和数据集有关。

pytorch代码如下：

import torch
from torch import nn
#自定义的模型和loss都要继承nn.Module类
class BFocalLoss(nn.Module):
      def __init__(self, gamma=2, alpha=0.25):
          super(BFocalLoss, self).__init__()
          self.gamma = gamma
          self.alpha = alpha
      def forward(self, inputs, targets):
          pt = nn.Softmax(input, dim=1)
          p = pt[:,1]
          loss = -self.alpha * (1-p) ** self.gamma*(target*torch.log(p+1e-12)) - \
                 (1-self.alpha)*p**self.gamma*((1-target)*torch.log(1-p+1e-12))
          return loss.mean()

medium上一篇blog对flocal loss进行了阐释。

Deep Learning in Tumor Metastatic on Medicine Image

两篇深度学习在乳腺癌细胞转移检测的论文：

医学图像处理与自然图像处理不同，一般来说由于设备的原因，可能病灶特征不是特别容易区分，也不是很明显，因此ImageNet上的预训练模型可能不是很有用。医学图像方面由于图像数量少，标注成本高，所以用的tricks比较多，要根据具体的要求和数据采集情况分析，比如痰涂片载玻片图像，一般得到的数据集可能是组与组之间是连续的特征，就像视频中连续帧的图像，差别不会很大，因此标注的时候可能只需要根据采集的组进行少量标注就可以，进行弱监督训练，也可能达到很不错的分类精度。

特定的数据增强，RGB-HSV转换，color normalization
slide选取patches放大不同尺度，多尺度输入
原始样本旋转90，180，270度，left-right flip之后再旋转90，180，270度，这样就扩增到了8倍大小。然后进行图像色调的调整，包括对比度，亮度，饱和度等。
FROC 而不是ROC和AUC（performance衡量标准）
减少计算，移除背景patches
随机森林提取heatmap特征

Mixup

adversarial examples， ERM（经验风险最小化）准则不能很好的适用，数据增强，VRM（近邻风险最小化），插值生成对抗样本和标签

\widetilde x = \lambda x_{i} + (1-\lambda)x_{j}\\ \widetilde y = \lambda y_{i} + (1-\lambda)y_{j} \\

对交叉熵损失函数和Focal Loss而言，可以直接取出 $y_{i}, y_{j}$ 对其损失函数进行插值(数学上可以推导)

loss = \lambda \cdot criterion(\widetilde x, y_{i}) + (1-\lambda) \cdot criterion(\widetilde x, y_{j})

The mixup vicinal distribution can be understood as a form of data augmentation that encourages the model f to behave linearly in-between training examples. We argue that this linear behaviour reduces the amount of undesirable oscillations when predicting outside the training examples. Also, linearity is a good inductive bias from the perspective of Occam’s razor, since it is one of the simplest possible behaviors.

论文指出，mixup可以控制模型复杂度，也就是说模型在ERM情况下不断训练会记住training data，导致泛化能力差，而mixup通过随机pairing插值融合，生成对抗样本可以有效的缓解这种情况。而且通过大量的实验，证明mixup确实有效果，而且在各个领域都还不错，此外可以和dropout等控制模型复杂度方法相结合。

We have shown that mixup is a form of vicinal risk minimization, which trains on virtual examples
constructed as the linear interpolation of two random examples from the training set and their labels. Incorporating mixup into existing training pipelines reduces to a few lines of code, and introduces little or no computational overhead. Throughout an extensive evaluation, we have shown that mixup improves the generalization error of state-of-the-art models on ImageNet, CIFAR, speech, and tabular datasets. Furthermore, mixup helps to combat memorization of corrupt labels, sensitivity to adversarial examples, and instability in adversarial training.

在图像上的训练trick: 训练时每个epoch都采用mixup， $\lambda$ 的取值由 $Beta(\alpha,\alpha)$ 函数随机指定， $\alpha$ 是hyper-parameter，论文中指出在imagNet上的值在[0.1, 0.4]之间，在CIFAR上取的是1，此外，网络结构加深和训练周期的加长都会使得最终的泛化效果比较好。但是论文中没有将为什么选用 $beta$ 函数去生成 $\lambda$ ，优化器选的是带动量的SGD，其中learning rate会随着指定的epoch范围进行下降，且没有使用drop out。

numpy.random.beta()是对beta分布进行随机采样，下式是beta分布的概率密度函数，当 $\alpha$ 的取值越大时，取样的值基本就会往0.5靠近，这时候似乎就退化成sample pairing。

\lambda = f(x ; a, b)=\frac{1}{B(\alpha, \beta)} x^{\alpha-1}(1-x)^{\beta-1} \\ B(\alpha, \beta)=\int_{0}^{1} t^{\alpha-1}(1-t)^{\beta-1} dt

mixup与IBM的一篇文章sample pairing的想法很类似，而且提出的时间都差不多，不过sample pairing是随机将两幅图片平均插值，但是label不变，等于是引入噪声，而且训练的trick也比较多，可以参考这篇博客的实验。

如果加入warmup，学习率随指定epoch下降，weight decay= $10^{-4}$ （对于mixup，小的weight decay效果更好)，此外超参 $\alpha$ 的取值越大，训练集的loss会越大，但是泛化能力就会越好。但是具体的数据集可能training loss变化趋势不同，可能随着 $\alpha$ 的增加急剧增加，也可能不怎么变化，因此最佳的位置，作者也提出了疑问，放在了discussion中，他们猜测可能大容量模型可能会对大取值 $\alpha$ 的适应度好点。

In our experiments, the following trend is consistent: with increasingly large $\alpha$ , the training error on
real data increases, while the generalization gap decreases.

2020.3，另有两篇images mixture的文章，分别是关于有监督和无监督领域的：

SuperMix: Supervising the Mixing Data Augmentation

Rethinking Image Mixture for Unsupervised Visual Representation Learning

About

2019

Visual SLAM几篇综述

Focal Loss

Deep Learning in Tumor Metastatic on Medicine Image

Mixup

Thoracic Disease Identification and Localization with Limited Supervision

FCN: Fully Convolutional Networks for Semantic Segmentation

BoxSup: Exploiting Bounding Boxes to Supervise Convolutional Networks for Semantic Segmentation

Simple Does It: Weakly Supervised Instance and Semantic Segmentation

FutureMapping

目标检测R-CNN系列

2020

Deep GrabCut for Object Selection

Holistically-Nested Edge Detection

Seed, Expand and Constrain: Three Principles for Weakly-Supervised Image Segmentation

Weakly-Supervised Semantic Segmentation by Iteratively Mining Common Object Features

U-Net: Convolutional Networks for Biomedical Image Segmentation

U-Net++: A Nested U-Net Architecture for Medical Image Segmentation

CAM–Learning Deep Features for Discriminative Localization

CRFasRNN

Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials

Multiscale Combinatorial Grouping for Image Segmentation and

Weakly- and Semi-Supervised Learning of a Deep Convolutional Network for Semantic Image Segmentation

Learning to Reweight Examples for Robust Deep Learning

Detecting Lesion Bounding Ellipses With Gaussian Proposal Networks(GPN)

Class-Balanced Loss Based on Effective Number of Samples

Tell Me Where to Look: Guided Attention Inference Network

Dice Loss

Deep Learning + Visual Odometry