Awesome Fine-grained Image Analysis (FGIA)

细粒度图像分析总结

Posted by JY on August 5, 2020

Introduction

  • 细粒度图像,相对于通用图像(general/generic images)的区别和难点在于其图像所属类别的粒度更为精细,是计算机视觉领域比较热门的一个方向,包括了分类、检索以及图像生成等

img

  • 细粒度图像识别的难点和挑战主要在于:
    • 类间差异小 (small inter-class variance):都属于同一个物种下的小类
    • 类内差异大 (large intra-class variance):受姿态、尺度和旋转等因素影响

Tutorials

Survey papers

Benchmark datasets

展示了 11 个数据集,如下图所示,其中 BBox 表示数据集提供物体的边界框信息,Part anno 则是数据集共了关键部位的位置信息,HRCHY 表示有分层次的标签,ATR 表示属性标签(比如翅膀颜色等),Texts 表示提供了图片的文本描述信息。

Dataset name Year Meta-class images categories BBox Part annotation HRCHY ATR Texts
Oxford flower 2008 Flowers 8,189 102         surd
CUB200 2011 Birds 11,788 200 surd surd   surd surd
Stanford Dog 2011 Dogs 20,580 120 surd        
Stanford Car 2013 Cars 16,185 196 surd        
FGVC Aircraft 2013 Aircrafts 10,000 100 surd   surd    
Birdsnap 2014 Birds 49,829 500 surd surd   surd  
NABirds 2015 Birds 48,562 555 surd surd      
DeepFashion 2016 Clothes 800,000 1,050 surd surd   surd  
Fru92 2017 Fruits 69,614 92     surd    
Veg200 2017 Vegetable 91,117 200     surd    
iNat2017 2017 Plants & Animals 859,000 5,089 surd   surd    
RPC 2019 Retail products 83,739 200 surd   surd    

Fine-grained image recognition

Fine-grained recognition by localization-classification subnetworks

基于定位-分类网络

Classical State-of-the-arts

  • Mask-CNN: Localizing Parts and Selecting Descriptors for Fine-Grained Im age Recognition

    主要包括两个模块:第一个是Part Localization,第二个全局和局部图像块的特征学习

    • 在Mask-CNN中,借助FCN学习一个三分类分割模型(一类为头部、一类为躯干、最后一类则是背景),GT mask是通过Part Annotation得到的头部和躯干部位的最小外接矩形。

      img

    • FCN训练完毕后,可以对测试集中的细粒度图像进行较精确地part定位,得到part mask,合起来为object-mask,用于part localization和useful feature selection

      img

    • 将不同部位输入到CNN子网络后输出feature map,利用前面得到的part-mask和object-mask作为权重,与对应像素点点乘。然后再分别进行max pooling和average pooling得到的特征级联作为子网络的最终feature vector。最后将三个子网特征再次级联作为整张图像的特征表示

      img

  • Selective Sparse Sampling for Fine-Grained Image Recognition

    img

    提出了一种捕捉细粒度级别特征同时不会丢失上下文信息的简单有效的框架,

    • 采用class peak responses,从class response map中定位局部最大值,从而形成sparse attention。sparse attention通常对应于精细的图像部分

    • 定义了两个平行的采样分支去重采样图片:
      • 判别性(discriminative)分支:抽取判别性的特征
      • 互补性(complementary)分支:抽取互补性的特征
    • 将三个输出拼接后通过FC层实现最终的分类

Fine-grained recognition by end-to-end feature encoding

端对端特征编码

Classical State-of-the-arts

Fine-grained by leveraging attention mechanisms

利用注意力机制

Classical State-of-the-arts

Fine-grained by contrastive learning manners

利用对比学习

Classical State-of-the-arts

Fine-grained recognition with external information

采用额外信息,减少标注成本

Fine-grained recognition with web data / auxiliary data

web data / auxiliary data需要利用模型进行数据降噪

Fine-grained recognition with multi-modality data

Fine-grained recognition with humans in the loop

Fine-grained image recognition with limited data

少样本学习在细粒度识别的应用

Classical State-of-the-arts

Fine-grained image retrieval

Unsupervised with pre-trained models

Supervised with metric learning

Fine-grained image generation

Generating from fine-grained image distributions

Generating from text descriptions

Future directions of FGIA

Fine-grained few shot learning

Fine-Grained hashing

Fine-grained domain adaptation

FGIA within more realistic settings

Recognition leaderboard

在数据集 CUB200-2011 数据集上的测试准确率,列举出目前最好的方法和其是否采用标准信息、额外的数据、采用的网络结构、输入图片的大小设置以及分类准确率

Method Publication BBox? Part? External information? Base model Image resolution Accuracy
PB R-CNN ECCV 2014       Alex-Net 224x224 73.9%
MaxEnt NeurIPS 2018       GoogLeNet TBD 74.4%
PB R-CNN ECCV 2014 surd     Alex-Net 224x224 76.4%
PS-CNN CVPR 2016 surd surd   CaffeNet 454x454 76.6%
MaxEnt NeurIPS 2018       VGG-16 TBD 77.0%
Mask-CNN PR 2018   surd   Alex-Net 448x448 78.6%
PC ECCV 2018       ResNet-50 TBD 80.2%
DeepLAC CVPR 2015 surd surd   Alex-Net 227x227 80.3%
MaxEnt NeurIPS 2018       ResNet-50 TBD 80.4%
Triplet-A CVPR 2016 surd   Manual labour GoogLeNet TBD 80.7%
Multi-grained ICCV 2015     WordNet etc. VGG-19 224x224 81.7%
Krause et al. CVPR 2015 surd     CaffeNet TBD 82.0%
Multi-grained ICCV 2015 surd   WordNet etc. VGG-19 224x224 83.0%
TS CVPR 2016       VGGD+VGGM 448x448 84.0%
Bilinear CNN ICCV 2015       VGGD+VGGM 448x448 84.1%
STN NeurIPS 2015       GoogLeNet+BN 448x448 84.1%
LRBP CVPR 2017       VGG-16 224x224 84.2%
PDFS CVPR 2016       VGG-16 TBD 84.5%
Xu et al. ICCV 2015 surd surd Web data CaffeNet 224x224 84.6%
Cai et al. ICCV 2017       VGG-16 448x448 85.3%
RA-CNN CVPR 2017       VGG-19 448x448 85.3%
MaxEnt NeurIPS 2018       Bilinear CNN TBD 85.3%
PC ECCV 2018       Bilinear CNN TBD 85.6%
CVL CVPR 2017     Texts VGG TBD 85.6%
Mask-CNN PR 2018   surd   VGG-16 448x448 85.7%
GP-256 ECCV 2018       VGG-16 448x448 85.8%
KP CVPR 2017       VGG-16 224x224 86.2%
T-CNN IJCAI 2018       ResNet 224x224 86.2%
MA-CNN ICCV 2017       VGG-19 448x448 86.5%
MaxEnt NeurIPS 2018       DenseNet-161 TBD 86.5%
DeepKSPD ECCV 2018       VGG-19 448x448 86.5%
OSME+MAMC ECCV 2018       ResNet-101 448x448 86.5%
StackDRL IJCAI 2018       VGG-19 224x224 86.6%
DFL-CNN CVPR 2018       VGG-16 448x448 86.7%
Bi-Modal PMA IEEE TIP 2020       VGG-16 448x448 86.8%
PC ECCV 2018       DenseNet-161 TBD 86.9%
KERL IJCAI 2018     Attributes VGG-16 224x224 87.0%
HBP ECCV 2018       VGG-16 448x448 87.1%
Mask-CNN PR 2018   surd   ResNet-50 448x448 87.3%
DFL-CNN CVPR 2018       ResNet-50 448x448 87.4%
NTS-Net ECCV 2018       ResNet-50 448x448 87.5%
HSnet CVPR 2017 surd surd   GoogLeNet+BN TBD 87.5%
Bi-Modal PMA IEEE TIP 2020       ResNet-50 448x448 87.5%
CIN AAAI 2020       ResNet-50 448x448 87.5%
MetaFGNet ECCV 2018     Auxiliary data ResNet-34 TBD 87.6%
Cross-X CVPR 2020       ResNet-50 448x448 87.7%
DCL CVPR 2019       ResNet-50 448x448 87.8%
ACNet CVPR 2020       VGG-16 448x448 87.8%
TASN CVPR 2019       ResNet-50 448x448 87.9%
ACNet CVPR 2020       ResNet-50 448x448 88.1%
CIN AAAI 2020       ResNet-101 448x448 88.1%
DBTNet-101 NeurIPS 2019       ResNet-101 448x448 88.1%
Bi-Modal PMA IEEE TIP 2020     Texts VGG-16 448x448 88.2%
GCL AAAI 2020       ResNet-50 448x448 88.3%
S3N CVPR 2020       ResNet-50 448x448 88.5%
Sun et al. AAAI 2020       ResNet-50 448x448 88.6%
FDL AAAI 2020       ResNet-50 448x448 88.6%
Bi-Modal PMA IEEE TIP 2020     Texts ResNet-50 448x448 88.7%
DF-GMM CVPR 2020       ResNet-50 448x448 88.8%
PMG ECCV 2020       VGG-16 550x550 88.8%
FDL AAAI 2020       DenseNet-161 448x448 89.1%
PMG ECCV 2020       ResNet-50 550x550 89.6%
API-Net AAAI 2020       DenseNet-161 512x512 90.0%
Ge et al. CVPR 2019       GoogLeNet+BN Shorter side is 800 px 90.4%