About

  • 视频;
  • 课件关注“旷视研究院”公众号回复“深度学习实践PPT”可获取;
  • 这套课程虽然是2017年的,但是涉及的内容还是很全面的,很多工作现在都发挥着不小的作用,所以并不过时,仍然值得一看。建议观看者是有过一定基础的人,刚入门者最好一开始不要看此系列课程,个人觉得第五课神经网络压缩,第六课基于DL的目标检测,第十课GAN,第十一课Person Re-Identification讲得比较好。

1. Introduction

  • computer vision 在AI中的地位属于感知(perception)智能(还包括speech),另外一块是认知(cognitive),包括NLP和AGI(通用人工智能);
  • 人使用眼睛和大脑认识世界,电脑使用图像传感器和算力来视觉感知周围环境;
  • 大脑皮层的出现,灵活,结构化,计算处理;
  • CV和AI的关系:其中非常重要的一个task/不同的研究工作和成果/作为关键应用;
  • 现阶段的CV任务:classification (image)/ detection (region) /segmentation (pixel,实例,语义,全景)/ sequence (video,spatial+temporal);
  • David Marr 的《vision》一书,这在visual SLAM中也十分重要,视觉知识的表示,part representation (拆成块,用各种模型表示,举例关键点检测);
  • part representation存在局限,有些不可分,引发了神经网络第二次复兴,Yann Lecun 的卷积神经网络应用于手写字体识别和人脸检测。由于当时难以复现,且懂的人不多,加上小规模数据和SVM等模型流行,神经网络出现衰落;
  • learning-based representation/ feature-based representation,特征工程+分类器(handcraft features engineering+SVM/Random Forest),浅层学习pipeline a short sequence;
  • 端到端学习,所有参数联合优化 ,a long or very long sequence实现高维非线性映射;
  • 受感知机启发的多层感知机(multilayer perceptron,MLP),利用backpropagation (BP) 梯度训练逼近(局部最优解)任意非线性函数;
  • 90年代的神经网络成果:CNN/ autocoder/ boltzmann machine/ belief nets/ RNN;
  • 复兴:data+computing+industry competition+a few breakthrough;

深度学习沉浮历史

  • Resnet的思想:由浅到深学习,保持梯度数值较大,防止梯度消失 ;

  • 从以前的手工设计feature为重点到现在设计网络结构(2012-2017为止)为重点,不同的结构所需算力不同,现在轻量级网络是一个热点;

  • 卷积核的方式:1x1,3x3,depthwise 3x3等,网络layer连接的方式;

2. Math in DL and ML Basics

  • 深度学习的内涵:deep learning) representation learning) machine learning) AI;

  • Linear Algebra:

    • 向量,矩阵,集合,群,封闭性,矩阵乘法是为了表示一种变换关系,向量映射到另一个向量;

    • 方阵,正交矩阵,特征值,特征向量,实对称矩阵,二次型,正定矩阵,半正定矩阵,奇异值分解;

  • Probability:

    • 随机事件,随机变量,概率密度函数,联合分布,边缘分布,条件分布,独立变量;

    • 贝叶斯法则,先验分布,后验分布,期望,方差,协方差矩阵(半正定);

    • 常见分布:二值分布,二项分布,多值/多相分布(图像分类问题),正态分布(高斯分布);

    • 信息熵(分类中的交叉熵损失函数,发生概率越大的事情信息越不值钱),交叉熵和KL-divergence,生成式模型中的wassertein distance;

  • Optimize:

    • minimization(最小化)-- 梯度下降gradient descent(步长的选取很关键),stochastic gradient descent;
  • 机器学习基本知识(machine learning basics)-- 定义,假设,模型,评估,supervised & unsupervised learning (learning p(yx)orp(x,y)p(y | x) \quad or \quad p(x,y),判别式模型,生成式模型<目前都用判别式模型>, learning p(x)p(x),auto encoder,GAN),“no free lunch theorem"(all learning algorithms are equal, but some algorithm are more equal than others),overfitting & underfitting,model capacity vs. generalization error,regularization (正则项,数据增强,parameter reduce and tying);

3. Neural Networks Basics & Architecture Design

  • Fundamental task in CV: classification, object detection, semantic segmentation, instance segmentation, keypoint detection, human pose estimation, VQA…

  • 计算机识别图像的难点:图像内容的复杂性和多样性,比如姿势,光照,模糊等;

  • 特征是计算机认识图像的一个灯塔,且应当使用非线性特征抽取器;

  • 线性组合特征(kernel learning,boosting),缺点是需要大量的templates,对特征的利用性差;

  • 特征层级组合,重复利用特征,更为高效 —> concepts reuse in DL,网络层级的特征也是由低到高,但是这样高度非线性的函数难以优化(目前采用收敛到局部最优值);

  • key ideas of DL: nolinear system, learn it from data, feature hierarchies, end-to-end learning;

  • 激活函数,神经元,全连接网络,训练决定网络参数(前向,反向,更新);

  • 针对图像的认识从locally-connected net到convolutional net的设计,参数共享;

  • 卷积层的卷积操作,pooling layer等;

  • 网络结构设计:网络拓扑结构,layer function,超参,优化算法等经验性的东西,手动/autoML;

  • 简介AlexNet(包含LRN,加速收敛),VGG(发掘3x3小卷积核的显著作用,但并不代表最高效的做法),GoogleNet,ResNet(拟合残差而不是直接拟合原函数),Xception,ResNeXt (借鉴Xception在resnet基础改进),ShuffleNet,DesneNet,SqueezeNet;

  • structure design: deeper and wider, ease of optimization, multi-path design, resdiual path, sparse connection;

  • 简介部分layer design:SPP,batch normalization,parametric rectifiers,bilinear CNNs(做细粒度分类);

  • 针对特定任务的结构设计:Deepface (人脸识别),Global Convolutional Networks (语义分割),Hourglass Networks (沙漏结构,大的感受野,用于pose estimation或者关键点);

4. Introduction to Computation Technologies in Deep Learning

该节课偏底层,听的不是很懂,权当了解。

  • symbolic computation:

    • 深度学习框架overview–program, compilation, runtime mangement, kernels, hardware;

    • computing graph, graph structure–variable, operator, edge;

    • 静态图和动态图;

    • 执行和优化;

  • dense numerical computation:

    • CPU computation (机器码,流水线,超流水线,超标量,乱序执行/cache hierarchy/…);

    • other computation devices (NVIDIA GPU<单指令,多线程架构>,Google TPU,Huawei NPU in Kirin 970,Mobile CPU+GPU+DSP);

    • computation & memory gap;

  • distributed computation:

    • system (communication,Remote Direct Memory Access);

    • optimization algorithm (synchronous SGD,asynchronous SGD);

    • communication algorithm (MPI Primitives,An AllReduce Algorithm);

5. Neural Network Approximation(low rank, sparsity, and quantization)

该节课着重神经网络压缩,for faster training,faster inference, smaller capacity;

convolution as matrix product,利用近似权重矩阵达到网络压缩的目的;

Low Rank (本质是对矩阵进行一系列分解变换近似操作,减小计算量和存储量):

  • 对权重矩阵进行奇异值分解,singular value decomposition;

  • SVD+Kronecker Product ----> KSPD;

  • 矩阵分解:C-HW-K====》C-HW-R-(1X1)-K,然后通过reshape进行重新分解,目前horizontal-vertical decomposition最好;

  • shared group convolution is a kronrcker layer;

  • CP-decomposition与depthwise;

Sparse Approximation:

  • 权重分布有点类似高斯分布 ,0附近很多,微调网络,weight pruning: 韩松博士的deepcompression,让为0的权重逐渐增多(掩模矩阵使权重为0),不让0附近的权重在训练时抖动,FC层效果压缩明显;

  • 网络加速计算–稀疏矩阵计算,channel purning,sparse communication for distributed gradient descent;

Quantization:

  • 用什么精度算;

  • 参数的量化,激活的量化,梯度的量化;

  • 二值化,binary network;

  • 大容量模型利用小bit训练时掉点不明显,小容量模型视情况而定;

主讲人周舒畅推荐的几篇文章,其中XNOR-Net为课程阅读要求材料:

Bit Neural Network
● Matthieu Courbariaux et al. BinaryConnect: Training Deep Neural Networks with binary
weights during propagations. http://arxiv.org/abs/1511.00363.
● Itay Hubara et al. Binarized Neural Networks https://arxiv.org/abs/1602.02505v3.
● Matthieu Courbariaux et al. Binarized Neural Networks: Training Neural Networks with
Weights and Activations Constrained to +1 or -1. http://arxiv.org/pdf/1602.02830v3.pdf.
● Rastegari et al. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural
Networks http://arxiv.org/pdf/1603.05279v1.pdf.
● Zhou et al. DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with
Low Bitwidth Gradients https://arxiv.org/abs/1606.06160.
● Hubara et al. Quantized Neural Networks: Training Neural Networks with Low Precision
Weights and Activations https://arxiv.org/abs/1609.07061.

6. Modern Object Detection

anchor-free与anchor-based的交替轮回;

Representation:

  • Bounding-box: face detection, human detection, vehicle detection, text detection, general object detection;

  • Point: semantic segmentation (下一节课);

  • Keypoint: face landmark, human keypoint;

Evaluation Criteria:

  • precision (预测为真的里面真正是真的比例), recall (所有是真的里面预测为真的比例), Average Prcision (AP), mean AP (mAP),IoU, mmAP(coco);

Perform a detection:

  • 之前是手工特征+图像金字塔+滑动窗口+分类器(robust real-time detection; IJCV 2001);

  • 通过Fully Convolutional Network进行计算共享;

Deep Learning for Object Detetcion:

  • proposal and refine;

  • one stage:

    • example: Densebox, YOLO, SSD, Retina Net…

    • keyword: anchor, divide and conquer, loss sampling;

  • two stage:

    • example: Faster R-CNN, RFCN, FPN, Mask R-CNN;
  • keyword: speed, performance;

One Stage:

Densebox:

  • 流程:图–>图像金字塔–>卷积神经网络–>upsampling–>卷积神经网络–>(4+1)通道–>预测+threshold+NMS;

  • 输入:m×n×3m \times n \times 3,输出:m/4×n/4×5m/4 \times n/4 \times 5

  • 输出的feature map每个像素对应一个带分数的边框:

ti={si,dxt=xixt,dyt=yiyt,dxb=xixb,dyb=yiyb,}t_{i}=\left\{s_{i}, d x^{t}=x_{i}-x_{t}, d y^{t}=y_{i}-y_{t}, d x^{b}=x_{i}-x_{b}, d y^{b}=y_{i}-y_{b},\right\}

其中t和b分别代表左上角和右下角坐标;

  • 问题:回归的L2损失函数选的不好(不同程度scale的object学习程度不同),GT assignment也存在问题,object比较拥挤的情况下,多个物体可能缩小在最后特征图上的一个点上,FP比较多,回归变量选取问题,误差较大;

UnitBox:

  • 把L2 loss换成IoU loss = lnIoU-\ln IoU;

YOLO:

  • 7×77 \times 7的grid,加了fc层可以覆盖到一些更全局的context,但是受限于固定输入尺寸,运行速度虽快但是拥挤场景检测不是很work;

SSD

  • 引入不同scale和aspect ratio的anchor;

  • 回归GT与anchor的offset;

  • 不同layer检测不同尺寸的物体,小物体浅层出,大物体深层出(但是并没有直接证据证明此法可靠);

  • loss sampling和OHEM;

  • blog

DSSD:

  • SSD利用浅层检测小目标,但是浅层语义信息少;

  • 利用upsampling和融合加强语义信息;

RON:

  • reverse connect (similar to FPN);

  • loss sampling: objectness prior (先做二分类在再细分);

RetainaNet:

  • 引入Focal loss;

  • FPN结构;

One Stage Detector: Summary

  • Anchor:

    • No anchor: YOLO, densebox/unitbox/east;

    • Anchor: YOLOv2, SSD, DSSD, RON, RetinaNet;

  • Divide and conquer:

    • SSD, DSSD, RON, RetinaNet;
  • loss sample:

    • all sample: densebox;
    • OHEM: SSD;
    • focal loss: RetinaNet;

Two Stage:

RCNN:

  • selective search+分类proposal;

Fast RCNN:

  • selective search对应到特征图,通过RoI pooling去分类;

Faster RCNN:

  • 用预设的anchor去找proposal;

RFCN,Deformable Convolutional Networks,FPN,Mask RCNN…

Two Stages Detector-Summary:

  • Speed:
    • RCNN -> Fast RCNN -> Faster RCNN -> RFCN;
  • Performance:
    • Divide and conquer:
      • FPN;
    • Deformable Pool/ROIAlign;
    • Deformable Conv;
    • Multi-task learning;

Open Problem in Detection:

  • FP;
  • NMS (detection in crowd);
  • GT assignment issue;
  • Detection in video:
    • detect & track in a network;

Human Keypoint Task:

  • Single Person Skeleton:

    • CPM;

    • Hourglass;

  • Multiple-Person Skeleton:

    • top down:

      • detect->single person skeleton;

      • Depends on the detector:

        • Fail in the crowd case;

        • Fail with partial observation;

        • can detect the small-scale human;

      • More computation;

      • Better localization when the input-size of single person skeleton is large;

    • bottom up:

      • Deep/Deeper cut, OpenPose, Associative Embedding;
      • Fast computational speed;
      • good at localizing the human with partial observation;
      • Hard to assemble human;

7. Scene Text Detection and Recognition

Background:

  • 文字的重要性:文明标志,携带高层语义信息,作为visual recognition的线索;

  • problem: scene text detection+scene text recognition;

  • challenge: 比OCR更复杂,比如背景,颜色,字体,方向,文字混杂等;

  • application: card recognition,图片定位,产品搜索,自动驾驶,工业自动化等;

conventional methods:

  • detection before deep learning: MSER (maximally stable extremal regions),SWT (stroke width transform),Multi-Oriented;

  • recognition: Top-down and bottom-up cues(滑窗+统计特性),Tree-structured Model (DPM+CRF),Label embedding (另辟蹊);

  • 统一检测和识别:Lexicon Driven;

Deep learning methods:

包含传统辅助方法的:

  • end-to-end-recognition: PhotoOCR,Deep Features,Reading Text;

  • detection: MSER Trees;

不包含传统辅助方法的:

  • detection: Holistic (当作语义分割来做),EAST (旷视CVPR2017,多任务学习),Deep Direct Regression (与EAST相似),SegLink (多尺度特征图),Synthetic Data (在图片上产生文字);

EAST框架

  • recognition:R2AMR^2AM (递归循环神经网络+soft-attention),Visual Attention;

  • end-to-end recognition:Deep TextSpotter

  • summary: ideas from object detection and segmentation,end-to-end,use synthetic data;

datasets and competitions:

  • dataset: ICDAR 2103, MARA-TD500, ICDAR 2015, IIIT 5K-Word, COCO-Text, MLT, Total-Text;

conclusion:

challenges:

  • Diversity of text: language, font, scale, orientation, arrangement, etc;
  • Complexity of background: virtually indistinguishable elements (signs, fences, bricks and grasses, etc.);
  • Interferences: noise, blur, distortion, low resolution, nonuniform illumination, partial occlusion, etc;

Trends:

  • Stronger models (accuracy, efficiency, interpretability);
  • Data synthesis;
  • Muiti-oriented text;
  • Curved text;
  • Muiti-language text;

References:

  • Survey:

    • Ye et al… Text Detection and Recognition in Imagery: A Survey. TPAMI, 2015.
    • Zhu et al… Scene Text Detection and Recognition: Recent Advances and Future Trends. FCS, 2015.
  • Conventional Methods:

    • Epshtein et al… Detecting Text in Natural Scenes with Stroke Width Transform. CVPR, 2010.
    • Neumann et al… A method for text localization and recognition in real-world images. ACCV, 2010.
    • Yao et al… Detecting Texts of Arbitrary Orientations in Natural Images. CVPR, 2012.
    • Wang et al… End-to-End Scene Text Recognition. ICCV, 2011.
    • Mishra et al… Scene Text Recognition using Higher Order Language Priors. BMVC, 2012.
    • Busta et al… FASText: Efficient Unconstrained Scene Text Detector. ICCV 2015.
  • Deep Learning Methods:

    • Bissacco et al… PhotoOCR: Reading Text in Uncontrolled Conditions. ICCV, 2013.
    • Jaderberg et al… Deep Features for Text Spotting. ECCV, 2014.
    • Gupta et al… Synthetic Data for Text Localisation in Natural Images. CVPR, 2016.
    • Zhou et al… EAST: An Efficient and Accurate Scene Text Detector. CVPR, 2017.
    • Busta et al… Deep TextSpotter: An End-To-End Trainable Scene Text Localization and Recognition Framework. ICCV, 2017.
    • Ghosh et al… Visual attention models for scene text recognition. 2017. arXiv:1706.01487.
    • Cheng et al… Focusing Attention: Towards Accurate Text Recognition in Natural Images. ICCV, 2017.

Useful Resources:

8. Image Segmentation

semantic segmentaion, instace segmentation, scene parsing, human parsing, stuff segmentation, UlrtraSound segmentation, selfie segmentation…

评价指标:

Accuracy(y,y^)=i=0nI[yi=y^i]nmeanIOU(y,y^)=cminI[yi=c,y^i=c]cminI[yi=c or y^=c]\begin{array}{l} {\operatorname{Accuracy}(\mathbf{y}, \hat{\mathbf{y}})=\sum_{i=0}^{n} \frac{I\left[y_{i}=\hat{y}_{i}\right]}{n}} \\ {\operatorname{mean} I O U(\mathbf{y}, \hat{\mathbf{y}})=\frac{\sum_{c}^{m} \sum_{i}^{n} I\left[y_{i}=c, \hat{y}_{i}=c\right]}{\sum_{c}^{m} \sum_{i}^{n} I\left[y_{i}=c \text { or } \hat{y}=c\right]}} \end{array}

Semantic Segmantation:

  • FCN: 第一篇语义分割工作;

  • Learning Deconvolution Network for Semantic Segmentation,引入unpool和反卷积deconvolution;

  • DeepLab,引入空洞卷积dilated-convolution和DenseCRF;

  • CRF AS RNN;

  • Deeplab Attention;

  • PSPNet;

  • GCN (Global Convolutional Network,主讲人的工作,想要框住任意尺度的物体);

  • Deeplab V3;

  • Deformable Convolution;

deformable deconvolution

Instance Segmentation:

Top-down pipeline (目前主流,依赖detection框架):

  • 先detection再segmentation

  • FCIS (框得不准,但是分割依然准);

  • Mask RCNN;

Bottom-up pipeline (效果差,难实现,思考空间大):

  • 不出框分割;

  • Semantic instance segmentation via metric learning;


介绍旷视的框架:

  • batch size in training:

​ detection得batch size往往比分类小很多,主要是训练尺寸不同,另外可能一张图片会有很多proposal…

​ 小batch size 会导致:unstable gradient,inaccurate BN statistics, extremely imbalanced data, very

​ long training period…

  • Multi-device BatchNorm;

  • Sublinear Memory;

  • Large Learning Rate;

  • 打COCO instance segmentation比赛的一些tricks: precise RoI pooling, context extractor, mask generator;

  • keypoint比赛tricks;

9. Recurrent Neural Network

RNN Bascis:

  • Turning Machine, RNN is Turing Complete, Sequence Modeling;

  • RNN Diagram,(hi,yi)=F(hi1,xi,W)(h_{i}, y_{i}) = F(h_{i-1},x_{i},W)

  • 根据input/output分类:many-to-many, many-to-one, one-to-many, many-to-one+one-to-many;

  • many-to-many example: language model (predict next word by given previous words, tell story, write books in LaTex…);

  • many-to-one example: Sentiment analysis…

  • many-to-one+one_to_many exapmle: Neural Machine Translation (encoder+decoder)…

  • 训练RNN,梯度爆炸和梯度消失: singular value > 1 => explodes, singular value < 1 => vanishes… LSTM (Long short-term memory) come to the resuce;

  • why LSTM works (input gate, forget gate, output gate, temp variable, memory cell);

  • GRU (similar to LSTM, let information flow without a separate memory cell);

  • Search for better RNN architecture;

Simple RNN Extentsions:

  • Bidirectional RNN (BDRNN),预测未来;

  • 2D-RNN: Pixel-RNN, each pixel depends on its top and left neighbor (补图,segmentation);

  • Deep RNN (stack more of them, harder to train);

RNN with Attention:

  • attention: differentiate entities by its importance, spatial attention is related to location; temporal attention is related to causality;

  • attention over input sequence: Neural Machine Translation (NMT);

  • Image Attention: Image Captioning (input image–> Convolutional feature extraction–>RNN with attention over the image–>Word by word generation);

RNN with External Memory:

  • copy a sequence: Neural Turning Machines (NTM);

More Applications:

  • RNN without a sequence input: read house numbers from left to right, generate images of digits by learning to sequentially add color to canvas;

  • generalizing recurrence (a computation unit with shared parameter occurs at multiple places in the computation graph);

  • apply when there’s tree structure in data;

  • bottom-up aggregation of information;

  • speech recognition;

  • generating sequence;

  • question answering;

  • visual question answering;

  • combinatorial problems;

  • learning to excute;

  • compress image;

  • model architecture search;

  • meta-learning;

RNN’s RIval:

  • WaveNet: causal dilated convolution, Oord, Aaron van den, et al. “Wavenet: A generative model for raw audio.” arXiv preprint arXiv:1609.03499 (2016).

  • Attention is All You Need (Transformer) ;

10. Introduction to Generative Models (and GANs)

Basics:

  • Generative Models: Learning the distributions;

  • Discriminative: learn the likelihood;

  • Generative: performs Density Estimation (learns the distribution) to allow sampling;

  • 回归建模的话会取平均值,回归的是最可能情况的平均值,显得不真实,a driscrminative model just smoothes all possibilities, ambiguity and “blur” effect;

  • application of generative models: image generation from sketch, interactive editing, image to image translation;

How to train generative models:

  • 给出一系列样本点,模型生成符合预期分布的输出;

从左往右方法逐渐work

  • exact model: NVP (non-volume preserving), real NVP: invertible no-linear transforms, 理论要求过于严格(Restriction on the source domain: must be of the same as the target.),效果不好(人脸稍微好点,因为其structure比较规矩);

  • Variational Auto-Encoder (VAE): encoder 做density estimation的过程, decoder做sampling的过程。

  • Generative Adversarial Networks (GAN): 生成器和判别器相互学习进步,交替训练;

  • DCGAN: example of feature manipulation (人脸加眼镜,变性别之类的的操作);

  • conditional, cross-domain generation (genenative adversarial text to image synthesis);

  • GAN training problems: unstable losses(训练时应该G和D应该处于动态平衡), mini-batch fluctuation (每个batch之间生成的图像不同),model collapse (lack of diversity in generated results);

  • improve GAN training: label smoothing, Wasserstein GAN (WGAN) (stabilized taining curve, non-vanishing gradient), loss sensitive GAN (LS-GAN)… The GAN Zoo;

举一些有名的GAN例子:

  • zhu junyan—Cycle GAN :correspondence from unpaired data;

  • DiscoGAN: cross-domain relation;

  • GeneGAN: shorter pathway improves training (cross breeds and reproductions, 生成笑容),object transfiguration (变发型),interpolation in object subspace (改变发型方向);

Math behind Generative Models:

  • formulation: sampling vs. density estimation;

  • RBM (现在已经不怎么使用);

  • from density to sample: 给定概率密度方程,无法有效采样;

  • from sample to density: 给定black-box sampler,是否可以估计概率密度(频率);

​ Given samples, some properties of the distribution can be learned, while others cannot.

  • the future of GANs: guaranteed stabilization (new distance), broader application (apply adversarial loss in xx/ different type of data);

  • GAN tutorial from Ian Goodfellow: https://arxiv.org/abs/1701.00160;

11. Persom Re-Identification

ReID: from face to person;

  • face recognition (verification, size: 32×3232 \times 32, horizontal: -30~30, vertical: -20~20, little occlusion);

  • person Re-Identification (trcaking in cameras, searching person in videos, clustering person in photos, challenges: inaccurate detection, misalignment, illumination difference, occlusion…);

  • common in FR & ReID: deep metric learning, mutual learning, re-ranking;

  • special in ReID: feature alignment, ReID with pose estimation, ReID with human attributes;

from classification to metric learning:

  • classification network只能辨别那些“见过的”物体,没见过的物体就要重训练,对于人脸识别部署来说,不现实。为了克服这点,加入metric learning,拿pre-train过的classification网络在metric learning中finetune (similar feature);

  • 有些工作是fusing intermediate feature maps, 但是计算量和存储都加大,拖慢了速度,不实用;

Metric Learning:

  • Learn a function that measures how similar two objects are. Compared to classification which works in a closed-word, metric learning deals with an open-world.

  • contrastive loss: Lpairwise=δ(IA,IB)fAfB2+(1δ(IA,IB))(αfAfB2)+L_{\text {pairwise}}=\delta\left(I_{A}, I_{B}\right) \cdot\left\|f_{A}-f_{B}\right\|_{2}+\left(1-\delta\left(I_{A}, I_{B}\right)\right)\left(\alpha-\left\|f_{A}-f_{B}\right\|_{2}\right)_{+} (最后一项有focus困难样本的作用,δ\delta is Kronecker Delta,α\alpha is the margin for different identities),让有相同identity的图像距离变小,反之变大,α\alpha被用来略掉那些“naive”的negative pairs;

  • triplet loss: Ltrp=1NN(fAfA2fAfB2+α)+L_{t r p}=\frac{1}{N} \sum \limits ^{N}\left(\left\|f_{A}-f_{A^{\prime}}\right\|_{2}-\left\|f_{A}-f_{B}\right\|_{2}+\alpha\right)_{+} (The distance of A and A’ should be smaller than that of A and B. α\alpha is the margin between negative and positive pairs. Without α\alpha, all distance converge to zero.);

  • improved triplet loss: $ L_{i m t r p}=\frac{1}{N} \sum^{N}\left(\left|f_{A}-f_{A^{\prime}}\right|{2}-\left|f{A}-f_{B}\right|{2}+\alpha\right){+} +\frac{1}{N} \sum^{N}\left(\left|f_{A}-f_{A^{\prime}}\right|{2}-\beta\right){+} $ (β\beta penalizes distance between features of AA and AA^{\prime}), only consider image pairs with the same identity;

  • quadruplet loss: Lquad=1NN(fAfA2fAfB2+αrelative distance)+1NN(fAfA2fCfB2+βabsolute distance )\begin{aligned} L_{q u a d} &=\frac{1}{N} \sum^{N}(\overbrace{\left\|f_{A}-f_{A^{\prime}}\right\|_{2}-\left\|f_{A}-f_{B}\right\|_{2}+\alpha}^{\text {relative distance}}) \\ &+\frac{1}{N} \sum^{N}(\overbrace{\left\|f_{A}-f_{A^{\prime}}\right\|_{2}-\left\|f_{C}-f_{B}\right\|_{2}+\beta}^{\text {absolute distance }}) \end{aligned}, 结合了triplet loss和pairwise loss,任何有着相同identity的image之间的distance都要比不同不同image之间的distance小;

  • triplet loss较contrastive loss提升明显,后面的quadruplet loss较triplet提升不多,而带来了计算量和搜索空间的提升,因此常用triplet loss;

Hard Sample Mining:

  • triplet hard loss: $ L_{\text {trihard}}=\frac{1}{N} \sum_{A \in \text {batch}}(\overbrace{\max {A^{\prime}}\left(\left|f{A}-f_{A^{\prime}}\right|{2}\right)}^{\text {hard positive pair }} -\overbrace{\min \left(\left|f{A}-f_{B}\right|_{2}\right)}^{\text {hard negative pair }}+\alpha) $, 找出矩阵中相同identity images中最不像的(the largest distance in the diagonal block)和不同identity images中最像的(The smallest distance in other places);

  • soft triplet hard loss: 不用一个个找出来,而是利用softmax自动去分配大权重给harder samples;

  • margin sample mining: Leml=(maxA,A(fAfA2)hardest positive pair minC,B(fCfB2)hardest negative pair +α)+L_{e m l}=(\overbrace{\max _{A, A^{\prime}}\left(\left\|f_{A}-f_{A^{\prime}}\right\|_{2}\right)}^{\text {hardest positive pair }}-\overbrace{\min _{C, B}\left(\left\|f_{C}-f_{B}\right\|_{2}\right)}^{\text {hardest negative pair }}+\alpha)_{+};

Mutual Learning:

  • knowledge distill: 知识蒸馏,学生网络学习老师网络的输出;

  • mutual learning: 几个学生网络自己相互学习,利用KL散度算各个网络output pro之间的接近程度;

  • metric mutual learning: $ L_{M}=\frac{1}{N^{2}} \sum_{i}^{N} \sum_{j}^{N} \left(\left[Z G\left(M_{i j}^{\theta_{1}}\right)-M_{i j}^{\theta_{2}}\right]^{2}+\left[M_{i j}^{\theta_{1}}-Z G\left(M_{i j}^{\theta_{2}}\right)\right]^{2}\right) $, ZG代表zero gradient,不计算梯度,不进行反向传播,学习distance matrix;

  • re-ranking: 对initial ranking list进行再ranking,使其smooth,on Supervised Smoothed Manifold/ by K-reciprocal Encoding;

Person Re-Identification:

  • difficulties: inaccurate detection, misalignment, illumination difference, occlusion, non-rigid body deformation, similar apperance…

  • evaluation criteria: CMC (Cumulative Math Characteristic)<rank-1, rank-5, rank-10>, mAP (based on rank);

  • datasets: Marke1501, CUHK03, DukeMTMC-reid, MARS;

Feature Alignment:

motivations:

  • Person is highly structured;
  • Local similarity plays a key role to decide the identity;

methods:

  • Local Features from local regions
    • Traditional Methods (colors, texture…);
    • Deep Learning Methods;
  • Local Feature Alignment
    • Fusion by LSTM (RNN cannot fuse local features properly);
    • Alignment in PL-Net (Part Loss Network, unsupervised);
    • Alignment in AlignedReID (Face++出品,性能超越人类,global feature+7个local feature,代表人的7个部分,横向pool,只拿对应的边,使用动态规划);

ReID with Extra Information:

ReID with Pose Estimation:

  • Providing explicit guidance for alignment;
  • Global-Local Alignment Descriptor (GLAD);
    • Vertical alignment by pose estimation;
  • SpindleNet;
    • Fusing local features from regions proposed by pose estimation;

ReID with Human Attributes:

  • Attributes is critical in discriminating different persons;

12. Shape from X (3D reconstruction: 传统和DL)

  • Structure from Motion (SfM): the most easy-to-understand approach, triangulation gets depth;

  • triangulation: the epipolar constraint对极约束,单目;

  • stereo, rectification (更正), disparity (视差,depth): correspondence, 不能远距离测量;

  • 3D point cloud: paper-building Rome in one day, 多视角图片SfM重建,3D geometry;

  • surface reconstruction: integration of oriented point;

  • SfM scanning: SLAM based, positioning, 华为手机发布会实现静止小熊猫玩偶重建;

  • depth sensing: active sensors, structured light, ToF (Time of Flight);

  • short baseline stereo: phase detection autofous;

  • shape from shading: shading as cue of 3D shape (the Lambertian law);

  • photometric stereo;

  • shape from texture, depth from focus, depth from defocus, shape from shadows, shape from sepcularities, object priors paper-A point set generation network for 3D object reconstruction from single image;

3D reconstruction from single image:

  • the ShapeNet dataset;

  • depth map;

  • volumetric occupancy;

  • XML file;

  • ponit-based represenation;

A neural method to stereo matching:

  • Flownet & Dispnet (using raw left and right images as input, output disparity map);

  • stereo matching cost convolutional neural network–Yan lecun;

  • MRF (马尔可夫随机场) stereo methods;

  • global local stereo neural network;

  • PatchMatch Communication Layer;

13. Visual Object Tracking

Motion estimation/ Optical flow:

  • motion field: the projection of the 3D motion onto a 2D image;

  • optical flwow: the pattern of apparent motion in images, I(x,y,t)=I(x+dx,y+dy,t+dt)I(x, y, t)=I(x+d x, y+d y, t+d t), 在adjacent frames中像素的运动;

  • motion field与optical flow不是完全相等;

  • KLT feature tracker (找点,计算光流,更新点),比较成熟,available in OpenCV;

  • optical flow with CNN: FlowNet / FlowNet 2.0, lack of training data (Flying Chairs / ChairsSDHom, Flying Things 3D);

  • optical flow长距离跟踪和复杂场景跟踪容易失效,不建议采用;

Single object tracking:

  • model free: nothing but a single training example is provided by the bounding box in the first frame;

  • short term and subject to causality;

  • paper list;

  • correlation fiter: 模板匹配,similar to convolution;

  • MOSSE (Minimum Output Sum of Squared Error) Filter;

  • KCF (Kernelized Correlation Filter);

  • from KCF to Discriminative CF Trackers: Martin Danelljan, 从Deep SRDCF开始利用CNN feature;

  • Continous-Convolution Operator Tracker: very slow (~1fps) and easy to overfitting;

  • Efficient Convolution Operators: based (factorized convolution operator + Guassian mixture model) on C-COT, ~15fps on GPU;

  • Multi-Domain Convolutional Neural Network Tracker: online tracking, bounding box regression, ~1fps;

  • GOTURN: ~100fps;

  • SiameseFC: ~60fps, a deep FCN is trained to address a more general similarity learning problem in an initial offline phase;

  • Benchmark: VOT (accuracy, robustness, EAO-expect average overlap), OTB(one pass evaluation, spatial robustness evaluation);

Multiple object tracking:

  • tracking by detection: assocation based on location (IoU, L1/L2 distance), motion (modeling the movement of objects, Kalman filter), apperance (feature) ans so on;

  • association:

  • association as optimization: local method (Hungarian algorithm), global methods (clustering, network flow, minimum cost multi-cut problem), do optimization in a window (trade off speed against acc);

  • Benchmark: MOT, KITTI, ImageNet VID;

  • evaluation metrics: Multiple object tracking accuracy (MOTA);

Other:

  • fast moving object (FMO): an object that moves over a distance exceeding its size within the
    exposure time;

  • multiple camera tracking;

  • tracking with multiple cues: with multiple detectors, with key points, with semantic segmentation, with RGBD camera;

  • multiple object tracking with NN:

    • Milan, Anton, et al. "Online Multi-Target Tracking Using
      Recurrent Neural Networks“. AAAI. 2017.
    • Son, Jeany, et al. “Multi-Object Tracking with Quadruplet
      Convolutional Neural Networks.” CVPR. 2017.

14. Neural Network in Computer Graphics

计算机视觉是将图像信息转换成抽象的语义信息等,而计算机图形学是将抽象的语义信息转换成图像信息。

  • Graphics: rendering, 3D modeling, visual media retouching (图像修整);

  • Neural Network for graphics: faster, better, more robust;

  • NN rendering:

    • Monte Carlo ray tracing (光线追踪,寻找光源),paper-[SIGGRAPH17] Kernel-Predicting Convolutional Networks for Denoising Monte Carlo Renderings ( utilize CNN to predict de-noising kernels, thus enhance ray tracing rendering result);

    • volume rendering;

    • NN shading (real-time rendering), paper-Deep shading: Convolutional Neural Networks for Screen-space shading (2016);

    • goal is to accelerate, all training data can be gathered virtually;

  • NN 3D modeling:

    • shape understanding:
      • 3D ShapeNets: A Deep Representation for Volumetric Shapes (2015).
      • VoxNet: A 3D Convolutional Neural Network for Real-Time Object Recognition (2015).
      • DeepPano: Deep Panoramic Representation for 3-D Shape Recognition (2015).
      • FusionNet: 3D Object Classification Using Multiple Data Representations (2016).
      • OctNet: Learning Deep 3D Representations at High Resolutions (2017).
      • O-CNN: Octree-based Convolutional Neural Networks for 3D Shape Analysis (2017).
      • Orientation-boosted voxel nets for 3D object recognition (2017).
      • PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation (2017).
    • shape synthesis: 3D-conv, also use GAN
    • from 2D to 3D, data becomes harder to handle, design of mesh representation, high-resolution 3D problem;
  • NN visual retouching:

    • tone mapping: paper-Deep Bilateral Learning for Real-Time Image Enhancement (2017), it can handle high-resolution images relatively fast;
    • automatic enhancement: paper- Exposure: A white-box Photo Post-processing Framework (2017);

Example-NN 3D Face:

  • given a face RGB/RGBD still/sequence, reconstruct for each frame (intrinsic image or inverse rendering):

    • Inner/outer camera matrix;
    • Face 3D pose;
    • Face shape;
    • Face expression;
    • Face albedo;
    • lighting;
  • 3D face priors: shape & albedo, paper-A 3D Morphable Model learnt from 10,000 faces (2016);

  • 3D face priors: expression: paper- FaceWarehouse: a 3D Facial Expression Database for Visual Computing (2012);

  • optimization: based 3D face fitting;

  • Coarse Net, Fine Net;

  • 3D Face-without prior: paper-DenseReg: Fully Convolutional Dense Shape Regression In-the-Wild(2017);

  • render for CV:

    • Synthesizing Training Data for Object Detection in Indoor Scenes, (2017);
    • Playing for Data: Ground Truth from Computer Games (2016)
    • Learning from Synthetic Humans (2017);
  • demo: Face2Face, Real-Time high-fidelity facial performance capture, DenseReg;