About
- 视频;
- 课件关注“旷视研究院”公众号回复“深度学习实践PPT”可获取;
- 这套课程虽然是2017年的,但是涉及的内容还是很全面的,很多工作现在都发挥着不小的作用,所以并不过时,仍然值得一看。建议观看者是有过一定基础的人,刚入门者最好一开始不要看此系列课程,个人觉得第五课神经网络压缩,第六课基于DL的目标检测,第十课GAN,第十一课Person Re-Identification讲得比较好。
1. Introduction
- computer vision 在AI中的地位属于感知(perception)智能(还包括speech),另外一块是认知(cognitive),包括NLP和AGI(通用人工智能);
- 人使用眼睛和大脑认识世界,电脑使用图像传感器和算力来视觉感知周围环境;
- 大脑皮层的出现,灵活,结构化,计算处理;
- CV和AI的关系:其中非常重要的一个task/不同的研究工作和成果/作为关键应用;
- 现阶段的CV任务:classification (image)/ detection (region) /segmentation (pixel,实例,语义,全景)/ sequence (video,spatial+temporal);
- David Marr 的《vision》一书,这在visual SLAM中也十分重要,视觉知识的表示,part representation (拆成块,用各种模型表示,举例关键点检测);
- part representation存在局限,有些不可分,引发了神经网络第二次复兴,Yann Lecun 的卷积神经网络应用于手写字体识别和人脸检测。由于当时难以复现,且懂的人不多,加上小规模数据和SVM等模型流行,神经网络出现衰落;
- learning-based representation/ feature-based representation,特征工程+分类器(handcraft features engineering+SVM/Random Forest),浅层学习pipeline a short sequence;
- 端到端学习,所有参数联合优化 ,a long or very long sequence实现高维非线性映射;
- 受感知机启发的多层感知机(multilayer perceptron,MLP),利用backpropagation (BP) 梯度训练逼近(局部最优解)任意非线性函数;
- 90年代的神经网络成果:CNN/ autocoder/ boltzmann machine/ belief nets/ RNN;
- 复兴:data+computing+industry competition+a few breakthrough;
Resnet的思想:由浅到深学习,保持梯度数值较大,防止梯度消失 ;
从以前的手工设计feature为重点到现在设计网络结构(2012-2017为止)为重点,不同的结构所需算力不同,现在轻量级网络是一个热点;
卷积核的方式:1x1,3x3,depthwise 3x3等,网络layer连接的方式;
2. Math in DL and ML Basics
深度学习的内涵:deep learning) representation learning) machine learning) AI;
Linear Algebra:
向量,矩阵,集合,群,封闭性,矩阵乘法是为了表示一种变换关系,向量映射到另一个向量;
方阵,正交矩阵,特征值,特征向量,实对称矩阵,二次型,正定矩阵,半正定矩阵,奇异值分解;
Probability:
随机事件,随机变量,概率密度函数,联合分布,边缘分布,条件分布,独立变量;
贝叶斯法则,先验分布,后验分布,期望,方差,协方差矩阵(半正定);
常见分布:二值分布,二项分布,多值/多相分布(图像分类问题),正态分布(高斯分布);
信息熵(分类中的交叉熵损失函数,发生概率越大的事情信息越不值钱),交叉熵和KL-divergence,生成式模型中的wassertein distance;
Optimize:
- minimization(最小化)-- 梯度下降gradient descent(步长的选取很关键),stochastic gradient descent;
机器学习基本知识(machine learning basics)-- 定义,假设,模型,评估,supervised & unsupervised learning (learning ,判别式模型,生成式模型<目前都用判别式模型>, learning ,auto encoder,GAN),“no free lunch theorem"(all learning algorithms are equal, but some algorithm are more equal than others),overfitting & underfitting,model capacity vs. generalization error,regularization (正则项,数据增强,parameter reduce and tying);
3. Neural Networks Basics & Architecture Design
Fundamental task in CV: classification, object detection, semantic segmentation, instance segmentation, keypoint detection, human pose estimation, VQA…
计算机识别图像的难点:图像内容的复杂性和多样性,比如姿势,光照,模糊等;
特征是计算机认识图像的一个灯塔,且应当使用非线性特征抽取器;
线性组合特征(kernel learning,boosting),缺点是需要大量的templates,对特征的利用性差;
特征层级组合,重复利用特征,更为高效 —> concepts reuse in DL,网络层级的特征也是由低到高,但是这样高度非线性的函数难以优化(目前采用收敛到局部最优值);
key ideas of DL: nolinear system, learn it from data, feature hierarchies, end-to-end learning;
激活函数,神经元,全连接网络,训练决定网络参数(前向,反向,更新);
针对图像的认识从locally-connected net到convolutional net的设计,参数共享;
卷积层的卷积操作,pooling layer等;
网络结构设计:网络拓扑结构,layer function,超参,优化算法等经验性的东西,手动/autoML;
简介AlexNet(包含LRN,加速收敛),VGG(发掘3x3小卷积核的显著作用,但并不代表最高效的做法),GoogleNet,ResNet(拟合残差而不是直接拟合原函数),Xception,ResNeXt (借鉴Xception在resnet基础改进),ShuffleNet,DesneNet,SqueezeNet;
structure design: deeper and wider, ease of optimization, multi-path design, resdiual path, sparse connection;
简介部分layer design:SPP,batch normalization,parametric rectifiers,bilinear CNNs(做细粒度分类);
针对特定任务的结构设计:Deepface (人脸识别),Global Convolutional Networks (语义分割),Hourglass Networks (沙漏结构,大的感受野,用于pose estimation或者关键点);
4. Introduction to Computation Technologies in Deep Learning
该节课偏底层,听的不是很懂,权当了解。
symbolic computation:
深度学习框架overview–program, compilation, runtime mangement, kernels, hardware;
computing graph, graph structure–variable, operator, edge;
静态图和动态图;
执行和优化;
dense numerical computation:
CPU computation (机器码,流水线,超流水线,超标量,乱序执行/cache hierarchy/…);
other computation devices (NVIDIA GPU<单指令,多线程架构>,Google TPU,Huawei NPU in Kirin 970,Mobile CPU+GPU+DSP);
computation & memory gap;
distributed computation:
system (communication,Remote Direct Memory Access);
optimization algorithm (synchronous SGD,asynchronous SGD);
communication algorithm (MPI Primitives,An AllReduce Algorithm);
5. Neural Network Approximation(low rank, sparsity, and quantization)
该节课着重神经网络压缩,for faster training,faster inference, smaller capacity;
convolution as matrix product,利用近似权重矩阵达到网络压缩的目的;
Low Rank (本质是对矩阵进行一系列分解变换近似操作,减小计算量和存储量):
对权重矩阵进行奇异值分解,singular value decomposition;
SVD+Kronecker Product ----> KSPD;
矩阵分解:C-HW-K====》C-HW-R-(1X1)-K,然后通过reshape进行重新分解,目前horizontal-vertical decomposition最好;
shared group convolution is a kronrcker layer;
CP-decomposition与depthwise;
Sparse Approximation:
权重分布有点类似高斯分布 ,0附近很多,微调网络,weight pruning: 韩松博士的deepcompression,让为0的权重逐渐增多(掩模矩阵使权重为0),不让0附近的权重在训练时抖动,FC层效果压缩明显;
网络加速计算–稀疏矩阵计算,channel purning,sparse communication for distributed gradient descent;
Quantization:
用什么精度算;
参数的量化,激活的量化,梯度的量化;
二值化,binary network;
大容量模型利用小bit训练时掉点不明显,小容量模型视情况而定;
主讲人周舒畅推荐的几篇文章,其中XNOR-Net为课程阅读要求材料:
Bit Neural Network
● Matthieu Courbariaux et al. BinaryConnect: Training Deep Neural Networks with binary
weights during propagations. http://arxiv.org/abs/1511.00363.
● Itay Hubara et al. Binarized Neural Networks https://arxiv.org/abs/1602.02505v3.
● Matthieu Courbariaux et al. Binarized Neural Networks: Training Neural Networks with
Weights and Activations Constrained to +1 or -1. http://arxiv.org/pdf/1602.02830v3.pdf.
● Rastegari et al. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural
Networks http://arxiv.org/pdf/1603.05279v1.pdf.
● Zhou et al. DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with
Low Bitwidth Gradients https://arxiv.org/abs/1606.06160.
● Hubara et al. Quantized Neural Networks: Training Neural Networks with Low Precision
Weights and Activations https://arxiv.org/abs/1609.07061.
6. Modern Object Detection
anchor-free与anchor-based的交替轮回;
Representation:
Bounding-box: face detection, human detection, vehicle detection, text detection, general object detection;
Point: semantic segmentation (下一节课);
Keypoint: face landmark, human keypoint;
Evaluation Criteria:
- precision (预测为真的里面真正是真的比例), recall (所有是真的里面预测为真的比例), Average Prcision (AP), mean AP (mAP),IoU, mmAP(coco);
Perform a detection:
之前是手工特征+图像金字塔+滑动窗口+分类器(robust real-time detection; IJCV 2001);
通过Fully Convolutional Network进行计算共享;
Deep Learning for Object Detetcion:
proposal and refine;
one stage:
example: Densebox, YOLO, SSD, Retina Net…
keyword: anchor, divide and conquer, loss sampling;
two stage:
- example: Faster R-CNN, RFCN, FPN, Mask R-CNN;
keyword: speed, performance;
One Stage:
Densebox:
流程:图–>图像金字塔–>卷积神经网络–>upsampling–>卷积神经网络–>(4+1)通道–>预测+threshold+NMS;
输入:,输出:;
输出的feature map每个像素对应一个带分数的边框:
其中t和b分别代表左上角和右下角坐标;
- 问题:回归的L2损失函数选的不好(不同程度scale的object学习程度不同),GT assignment也存在问题,object比较拥挤的情况下,多个物体可能缩小在最后特征图上的一个点上,FP比较多,回归变量选取问题,误差较大;
UnitBox:
- 把L2 loss换成IoU loss = ;
YOLO:
- 的grid,加了fc层可以覆盖到一些更全局的context,但是受限于固定输入尺寸,运行速度虽快但是拥挤场景检测不是很work;
SSD:
引入不同scale和aspect ratio的anchor;
回归GT与anchor的offset;
不同layer检测不同尺寸的物体,小物体浅层出,大物体深层出(但是并没有直接证据证明此法可靠);
loss sampling和OHEM;
blog;
DSSD:
SSD利用浅层检测小目标,但是浅层语义信息少;
利用upsampling和融合加强语义信息;
RON:
reverse connect (similar to FPN);
loss sampling: objectness prior (先做二分类在再细分);
RetainaNet:
引入Focal loss;
FPN结构;
One Stage Detector: Summary
Anchor:
No anchor: YOLO, densebox/unitbox/east;
Anchor: YOLOv2, SSD, DSSD, RON, RetinaNet;
Divide and conquer:
- SSD, DSSD, RON, RetinaNet;
loss sample:
- all sample: densebox;
- OHEM: SSD;
- focal loss: RetinaNet;
Two Stage:
RCNN:
- selective search+分类proposal;
Fast RCNN:
- selective search对应到特征图,通过RoI pooling去分类;
Faster RCNN:
- 用预设的anchor去找proposal;
RFCN,Deformable Convolutional Networks,FPN,Mask RCNN…
Two Stages Detector-Summary:
- Speed:
- RCNN -> Fast RCNN -> Faster RCNN -> RFCN;
- Performance:
- Divide and conquer:
- FPN;
- Deformable Pool/ROIAlign;
- Deformable Conv;
- Multi-task learning;
- Divide and conquer:
Open Problem in Detection:
- FP;
- NMS (detection in crowd);
- GT assignment issue;
- Detection in video:
- detect & track in a network;
Human Keypoint Task:
Single Person Skeleton:
CPM;
Hourglass;
Multiple-Person Skeleton:
top down:
detect->single person skeleton;
Depends on the detector:
Fail in the crowd case;
Fail with partial observation;
can detect the small-scale human;
More computation;
Better localization when the input-size of single person skeleton is large;
bottom up:
- Deep/Deeper cut, OpenPose, Associative Embedding;
- Fast computational speed;
- good at localizing the human with partial observation;
- Hard to assemble human;
7. Scene Text Detection and Recognition
Background:
文字的重要性:文明标志,携带高层语义信息,作为visual recognition的线索;
problem: scene text detection+scene text recognition;
challenge: 比OCR更复杂,比如背景,颜色,字体,方向,文字混杂等;
application: card recognition,图片定位,产品搜索,自动驾驶,工业自动化等;
conventional methods:
detection before deep learning: MSER (maximally stable extremal regions),SWT (stroke width transform),Multi-Oriented;
recognition: Top-down and bottom-up cues(滑窗+统计特性),Tree-structured Model (DPM+CRF),Label embedding (另辟蹊);
统一检测和识别:Lexicon Driven;
Deep learning methods:
包含传统辅助方法的:
end-to-end-recognition: PhotoOCR,Deep Features,Reading Text;
detection: MSER Trees;
不包含传统辅助方法的:
- detection: Holistic (当作语义分割来做),EAST (旷视CVPR2017,多任务学习),Deep Direct Regression (与EAST相似),SegLink (多尺度特征图),Synthetic Data (在图片上产生文字);
recognition: (递归循环神经网络+soft-attention),Visual Attention;
end-to-end recognition:Deep TextSpotter;
summary: ideas from object detection and segmentation,end-to-end,use synthetic data;
datasets and competitions:
- dataset: ICDAR 2103, MARA-TD500, ICDAR 2015, IIIT 5K-Word, COCO-Text, MLT, Total-Text;
conclusion:
challenges:
- Diversity of text: language, font, scale, orientation, arrangement, etc;
- Complexity of background: virtually indistinguishable elements (signs, fences, bricks and grasses, etc.);
- Interferences: noise, blur, distortion, low resolution, nonuniform illumination, partial occlusion, etc;
Trends:
- Stronger models (accuracy, efficiency, interpretability);
- Data synthesis;
- Muiti-oriented text;
- Curved text;
- Muiti-language text;
References:
Survey:
- Ye et al… Text Detection and Recognition in Imagery: A Survey. TPAMI, 2015.
- Zhu et al… Scene Text Detection and Recognition: Recent Advances and Future Trends. FCS, 2015.
Conventional Methods:
- Epshtein et al… Detecting Text in Natural Scenes with Stroke Width Transform. CVPR, 2010.
- Neumann et al… A method for text localization and recognition in real-world images. ACCV, 2010.
- Yao et al… Detecting Texts of Arbitrary Orientations in Natural Images. CVPR, 2012.
- Wang et al… End-to-End Scene Text Recognition. ICCV, 2011.
- Mishra et al… Scene Text Recognition using Higher Order Language Priors. BMVC, 2012.
- Busta et al… FASText: Efficient Unconstrained Scene Text Detector. ICCV 2015.
Deep Learning Methods:
- Bissacco et al… PhotoOCR: Reading Text in Uncontrolled Conditions. ICCV, 2013.
- Jaderberg et al… Deep Features for Text Spotting. ECCV, 2014.
- Gupta et al… Synthetic Data for Text Localisation in Natural Images. CVPR, 2016.
- Zhou et al… EAST: An Efficient and Accurate Scene Text Detector. CVPR, 2017.
- Busta et al… Deep TextSpotter: An End-To-End Trainable Scene Text Localization and Recognition Framework. ICCV, 2017.
- Ghosh et al… Visual attention models for scene text recognition. 2017. arXiv:1706.01487.
- Cheng et al… Focusing Attention: Towards Accurate Text Recognition in Natural Images. ICCV, 2017.
Useful Resources:
- Laboratories and Papers
https://github.com/chongyangtao/Awesome-Scene-Text-Recognition - Datasets and Codes
https://github.com/seungwooYoo/Curated-scene-text-recognition-analysis - Projects and Products
https://github.com/wanghaisheng/awesome-ocr
8. Image Segmentation
semantic segmentaion, instace segmentation, scene parsing, human parsing, stuff segmentation, UlrtraSound segmentation, selfie segmentation…
评价指标:
Semantic Segmantation:
FCN: 第一篇语义分割工作;
Learning Deconvolution Network for Semantic Segmentation,引入unpool和反卷积deconvolution;
DeepLab,引入空洞卷积dilated-convolution和DenseCRF;
CRF AS RNN;
Deeplab Attention;
PSPNet;
GCN (Global Convolutional Network,主讲人的工作,想要框住任意尺度的物体);
Deeplab V3;
Deformable Convolution;
Instance Segmentation:
Top-down pipeline (目前主流,依赖detection框架):
先detection再segmentation
FCIS (框得不准,但是分割依然准);
Mask RCNN;
Bottom-up pipeline (效果差,难实现,思考空间大):
不出框分割;
Semantic instance segmentation via metric learning;
介绍旷视的框架:
- batch size in training:
detection得batch size往往比分类小很多,主要是训练尺寸不同,另外可能一张图片会有很多proposal…
小batch size 会导致:unstable gradient,inaccurate BN statistics, extremely imbalanced data, very
long training period…
Multi-device BatchNorm;
Sublinear Memory;
Large Learning Rate;
打COCO instance segmentation比赛的一些tricks: precise RoI pooling, context extractor, mask generator;
keypoint比赛tricks;
9. Recurrent Neural Network
RNN Bascis:
Turning Machine, RNN is Turing Complete, Sequence Modeling;
RNN Diagram, ;
根据input/output分类:many-to-many, many-to-one, one-to-many, many-to-one+one-to-many;
many-to-many example: language model (predict next word by given previous words, tell story, write books in LaTex…);
many-to-one example: Sentiment analysis…
many-to-one+one_to_many exapmle: Neural Machine Translation (encoder+decoder)…
训练RNN,梯度爆炸和梯度消失: singular value > 1 => explodes, singular value < 1 => vanishes… LSTM (Long short-term memory) come to the resuce;
why LSTM works (input gate, forget gate, output gate, temp variable, memory cell);
GRU (similar to LSTM, let information flow without a separate memory cell);
Search for better RNN architecture;
Simple RNN Extentsions:
Bidirectional RNN (BDRNN),预测未来;
2D-RNN: Pixel-RNN, each pixel depends on its top and left neighbor (补图,segmentation);
Deep RNN (stack more of them, harder to train);
RNN with Attention:
attention: differentiate entities by its importance, spatial attention is related to location; temporal attention is related to causality;
attention over input sequence: Neural Machine Translation (NMT);
Image Attention: Image Captioning (input image–> Convolutional feature extraction–>RNN with attention over the image–>Word by word generation);
RNN with External Memory:
- copy a sequence: Neural Turning Machines (NTM);
More Applications:
RNN without a sequence input: read house numbers from left to right, generate images of digits by learning to sequentially add color to canvas;
generalizing recurrence (a computation unit with shared parameter occurs at multiple places in the computation graph);
apply when there’s tree structure in data;
bottom-up aggregation of information;
speech recognition;
generating sequence;
question answering;
visual question answering;
combinatorial problems;
learning to excute;
compress image;
model architecture search;
meta-learning;
…
RNN’s RIval:
WaveNet: causal dilated convolution, Oord, Aaron van den, et al. “Wavenet: A generative model for raw audio.” arXiv preprint arXiv:1609.03499 (2016).
Attention is All You Need (Transformer) ;
10. Introduction to Generative Models (and GANs)
Basics:
Generative Models: Learning the distributions;
Discriminative: learn the likelihood;
Generative: performs Density Estimation (learns the distribution) to allow sampling;
回归建模的话会取平均值,回归的是最可能情况的平均值,显得不真实,a driscrminative model just smoothes all possibilities, ambiguity and “blur” effect;
application of generative models: image generation from sketch, interactive editing, image to image translation;
How to train generative models:
- 给出一系列样本点,模型生成符合预期分布的输出;
exact model: NVP (non-volume preserving), real NVP: invertible no-linear transforms, 理论要求过于严格(Restriction on the source domain: must be of the same as the target.),效果不好(人脸稍微好点,因为其structure比较规矩);
Variational Auto-Encoder (VAE): encoder 做density estimation的过程, decoder做sampling的过程。
Generative Adversarial Networks (GAN): 生成器和判别器相互学习进步,交替训练;
DCGAN: example of feature manipulation (人脸加眼镜,变性别之类的的操作);
conditional, cross-domain generation (genenative adversarial text to image synthesis);
GAN training problems: unstable losses(训练时应该G和D应该处于动态平衡), mini-batch fluctuation (每个batch之间生成的图像不同),model collapse (lack of diversity in generated results);
improve GAN training: label smoothing, Wasserstein GAN (WGAN) (stabilized taining curve, non-vanishing gradient), loss sensitive GAN (LS-GAN)… The GAN Zoo;
举一些有名的GAN例子:
zhu junyan—Cycle GAN :correspondence from unpaired data;
DiscoGAN: cross-domain relation;
GeneGAN: shorter pathway improves training (cross breeds and reproductions, 生成笑容),object transfiguration (变发型),interpolation in object subspace (改变发型方向);
Math behind Generative Models:
formulation: sampling vs. density estimation;
RBM (现在已经不怎么使用);
from density to sample: 给定概率密度方程,无法有效采样;
from sample to density: 给定black-box sampler,是否可以估计概率密度(频率);
Given samples, some properties of the distribution can be learned, while others cannot.
the future of GANs: guaranteed stabilization (new distance), broader application (apply adversarial loss in xx/ different type of data);
GAN tutorial from Ian Goodfellow: https://arxiv.org/abs/1701.00160;
11. Persom Re-Identification
ReID: from face to person;
face recognition (verification, size: , horizontal: -30~30, vertical: -20~20, little occlusion);
person Re-Identification (trcaking in cameras, searching person in videos, clustering person in photos, challenges: inaccurate detection, misalignment, illumination difference, occlusion…);
common in FR & ReID: deep metric learning, mutual learning, re-ranking;
special in ReID: feature alignment, ReID with pose estimation, ReID with human attributes;
from classification to metric learning:
classification network只能辨别那些“见过的”物体,没见过的物体就要重训练,对于人脸识别部署来说,不现实。为了克服这点,加入metric learning,拿pre-train过的classification网络在metric learning中finetune (similar feature);
有些工作是fusing intermediate feature maps, 但是计算量和存储都加大,拖慢了速度,不实用;
Metric Learning:
Learn a function that measures how similar two objects are. Compared to classification which works in a closed-word, metric learning deals with an open-world.
contrastive loss: (最后一项有focus困难样本的作用, is Kronecker Delta, is the margin for different identities),让有相同identity的图像距离变小,反之变大,被用来略掉那些“naive”的negative pairs;
triplet loss: (The distance of A and A’ should be smaller than that of A and B. is the margin between negative and positive pairs. Without , all distance converge to zero.);
improved triplet loss: $ L_{i m t r p}=\frac{1}{N} \sum^{N}\left(\left|f_{A}-f_{A^{\prime}}\right|{2}-\left|f{A}-f_{B}\right|{2}+\alpha\right){+} +\frac{1}{N} \sum^{N}\left(\left|f_{A}-f_{A^{\prime}}\right|{2}-\beta\right){+} $ ( penalizes distance between features of and ), only consider image pairs with the same identity;
quadruplet loss: , 结合了triplet loss和pairwise loss,任何有着相同identity的image之间的distance都要比不同不同image之间的distance小;
triplet loss较contrastive loss提升明显,后面的quadruplet loss较triplet提升不多,而带来了计算量和搜索空间的提升,因此常用triplet loss;
Hard Sample Mining:
triplet hard loss: $ L_{\text {trihard}}=\frac{1}{N} \sum_{A \in \text {batch}}(\overbrace{\max {A^{\prime}}\left(\left|f{A}-f_{A^{\prime}}\right|{2}\right)}^{\text {hard positive pair }} -\overbrace{\min \left(\left|f{A}-f_{B}\right|_{2}\right)}^{\text {hard negative pair }}+\alpha) $, 找出矩阵中相同identity images中最不像的(the largest distance in the diagonal block)和不同identity images中最像的(The smallest distance in other places);
soft triplet hard loss: 不用一个个找出来,而是利用softmax自动去分配大权重给harder samples;
margin sample mining: ;
Mutual Learning:
knowledge distill: 知识蒸馏,学生网络学习老师网络的输出;
mutual learning: 几个学生网络自己相互学习,利用KL散度算各个网络output pro之间的接近程度;
metric mutual learning: $ L_{M}=\frac{1}{N^{2}} \sum_{i}^{N} \sum_{j}^{N} \left(\left[Z G\left(M_{i j}^{\theta_{1}}\right)-M_{i j}^{\theta_{2}}\right]^{2}+\left[M_{i j}^{\theta_{1}}-Z G\left(M_{i j}^{\theta_{2}}\right)\right]^{2}\right) $, ZG代表zero gradient,不计算梯度,不进行反向传播,学习distance matrix;
- re-ranking: 对initial ranking list进行再ranking,使其smooth,on Supervised Smoothed Manifold/ by K-reciprocal Encoding;
Person Re-Identification:
difficulties: inaccurate detection, misalignment, illumination difference, occlusion, non-rigid body deformation, similar apperance…
evaluation criteria: CMC (Cumulative Math Characteristic)<rank-1, rank-5, rank-10>, mAP (based on rank);
datasets: Marke1501, CUHK03, DukeMTMC-reid, MARS;
Feature Alignment:
motivations:
- Person is highly structured;
- Local similarity plays a key role to decide the identity;
methods:
- Local Features from local regions
- Traditional Methods (colors, texture…);
- Deep Learning Methods;
- Local Feature Alignment
- Fusion by LSTM (RNN cannot fuse local features properly);
- Alignment in PL-Net (Part Loss Network, unsupervised);
- Alignment in AlignedReID (Face++出品,性能超越人类,global feature+7个local feature,代表人的7个部分,横向pool,只拿对应的边,使用动态规划);
ReID with Extra Information:
ReID with Pose Estimation:
- Providing explicit guidance for alignment;
- Global-Local Alignment Descriptor (GLAD);
- Vertical alignment by pose estimation;
- SpindleNet;
- Fusing local features from regions proposed by pose estimation;
ReID with Human Attributes:
- Attributes is critical in discriminating different persons;
12. Shape from X (3D reconstruction: 传统和DL)
Structure from Motion (SfM): the most easy-to-understand approach, triangulation gets depth;
triangulation: the epipolar constraint对极约束,单目;
stereo, rectification (更正), disparity (视差,depth): correspondence, 不能远距离测量;
3D point cloud: paper-building Rome in one day, 多视角图片SfM重建,3D geometry;
surface reconstruction: integration of oriented point;
SfM scanning: SLAM based, positioning, 华为手机发布会实现静止小熊猫玩偶重建;
depth sensing: active sensors, structured light, ToF (Time of Flight);
short baseline stereo: phase detection autofous;
shape from shading: shading as cue of 3D shape (the Lambertian law);
photometric stereo;
shape from texture, depth from focus, depth from defocus, shape from shadows, shape from sepcularities, object priors paper-A point set generation network for 3D object reconstruction from single image;
3D reconstruction from single image:
the ShapeNet dataset;
depth map;
volumetric occupancy;
XML file;
ponit-based represenation;
A neural method to stereo matching:
Flownet & Dispnet (using raw left and right images as input, output disparity map);
stereo matching cost convolutional neural network–Yan lecun;
MRF (马尔可夫随机场) stereo methods;
global local stereo neural network;
PatchMatch Communication Layer;
13. Visual Object Tracking
Motion estimation/ Optical flow:
motion field: the projection of the 3D motion onto a 2D image;
optical flwow: the pattern of apparent motion in images, , 在adjacent frames中像素的运动;
motion field与optical flow不是完全相等;
KLT feature tracker (找点,计算光流,更新点),比较成熟,available in OpenCV;
optical flow with CNN: FlowNet / FlowNet 2.0, lack of training data (Flying Chairs / ChairsSDHom, Flying Things 3D);
optical flow长距离跟踪和复杂场景跟踪容易失效,不建议采用;
Single object tracking:
model free: nothing but a single training example is provided by the bounding box in the first frame;
short term and subject to causality;
correlation fiter: 模板匹配,similar to convolution;
MOSSE (Minimum Output Sum of Squared Error) Filter;
KCF (Kernelized Correlation Filter);
from KCF to Discriminative CF Trackers: Martin Danelljan, 从Deep SRDCF开始利用CNN feature;
Continous-Convolution Operator Tracker: very slow (~1fps) and easy to overfitting;
Efficient Convolution Operators: based (factorized convolution operator + Guassian mixture model) on C-COT, ~15fps on GPU;
Multi-Domain Convolutional Neural Network Tracker: online tracking, bounding box regression, ~1fps;
GOTURN: ~100fps;
SiameseFC: ~60fps, a deep FCN is trained to address a more general similarity learning problem in an initial offline phase;
Benchmark: VOT (accuracy, robustness, EAO-expect average overlap), OTB(one pass evaluation, spatial robustness evaluation);
Multiple object tracking:
tracking by detection: assocation based on location (IoU, L1/L2 distance), motion (modeling the movement of objects, Kalman filter), apperance (feature) ans so on;
association:
association as optimization: local method (Hungarian algorithm), global methods (clustering, network flow, minimum cost multi-cut problem), do optimization in a window (trade off speed against acc);
Benchmark: MOT, KITTI, ImageNet VID;
evaluation metrics: Multiple object tracking accuracy (MOTA);
Other:
fast moving object (FMO): an object that moves over a distance exceeding its size within the
exposure time;multiple camera tracking;
tracking with multiple cues: with multiple detectors, with key points, with semantic segmentation, with RGBD camera;
multiple object tracking with NN:
- Milan, Anton, et al. "Online Multi-Target Tracking Using
Recurrent Neural Networks“. AAAI. 2017. - Son, Jeany, et al. “Multi-Object Tracking with Quadruplet
Convolutional Neural Networks.” CVPR. 2017.
- Milan, Anton, et al. "Online Multi-Target Tracking Using
14. Neural Network in Computer Graphics
计算机视觉是将图像信息转换成抽象的语义信息等,而计算机图形学是将抽象的语义信息转换成图像信息。
Graphics: rendering, 3D modeling, visual media retouching (图像修整);
Neural Network for graphics: faster, better, more robust;
NN rendering:
Monte Carlo ray tracing (光线追踪,寻找光源),paper-[SIGGRAPH17] Kernel-Predicting Convolutional Networks for Denoising Monte Carlo Renderings ( utilize CNN to predict de-noising kernels, thus enhance ray tracing rendering result);
volume rendering;
NN shading (real-time rendering), paper-Deep shading: Convolutional Neural Networks for Screen-space shading (2016);
goal is to accelerate, all training data can be gathered virtually;
NN 3D modeling:
- shape understanding:
- 3D ShapeNets: A Deep Representation for Volumetric Shapes (2015).
- VoxNet: A 3D Convolutional Neural Network for Real-Time Object Recognition (2015).
- DeepPano: Deep Panoramic Representation for 3-D Shape Recognition (2015).
- FusionNet: 3D Object Classification Using Multiple Data Representations (2016).
- OctNet: Learning Deep 3D Representations at High Resolutions (2017).
- O-CNN: Octree-based Convolutional Neural Networks for 3D Shape Analysis (2017).
- Orientation-boosted voxel nets for 3D object recognition (2017).
- PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation (2017).
- shape synthesis: 3D-conv, also use GAN
- from 2D to 3D, data becomes harder to handle, design of mesh representation, high-resolution 3D problem;
- shape understanding:
NN visual retouching:
- tone mapping: paper-Deep Bilateral Learning for Real-Time Image Enhancement (2017), it can handle high-resolution images relatively fast;
- automatic enhancement: paper- Exposure: A white-box Photo Post-processing Framework (2017);
Example-NN 3D Face:
given a face RGB/RGBD still/sequence, reconstruct for each frame (intrinsic image or inverse rendering):
- Inner/outer camera matrix;
- Face 3D pose;
- Face shape;
- Face expression;
- Face albedo;
- lighting;
3D face priors: shape & albedo, paper-A 3D Morphable Model learnt from 10,000 faces (2016);
3D face priors: expression: paper- FaceWarehouse: a 3D Facial Expression Database for Visual Computing (2012);
optimization: based 3D face fitting;
Coarse Net, Fine Net;
3D Face-without prior: paper-DenseReg: Fully Convolutional Dense Shape Regression In-the-Wild(2017);
render for CV:
- Synthesizing Training Data for Object Detection in Indoor Scenes, (2017);
- Playing for Data: Ground Truth from Computer Games (2016)
- Learning from Synthetic Humans (2017);
demo: Face2Face, Real-Time high-fidelity facial performance capture, DenseReg;