YOLO系列(五)yolov4-tiny

YOLOv4-tiny结构是YOLOv4的精简版,属于轻量化模型,参数只有600万相当于原来的十分之一,这使得检测速度提升很大。整体网络结构共有38层,使用了三个残差单元,激活函数使用了LeakyReLU,目标的分类与回归改为使用两个特征层,合并有效特征层时使用了特征金字塔(FPN)网络。其同样使用了CSPnet结构,并对特征提取网络进行通道分割,将经过3×3卷积后输出的特征层通道划分为两部分,并取第二部分。在COCO数据集上得到了40.2%的AP50、371FPS,相较于其他版本的轻量化模型性能优势显著。其结构图如下图所示。

YOLOv4-tiny具有多任务、端到端、注意力机制和多尺度的特点。多任务即同时完成目标的分类与回归,实现参数共享,避免过拟合;端到端即模型接收图像数据后直接给出分类与回归的预测信息;注意力机制是重点关注目标区域特征进行详细处理,提高处理速度;多尺度的特点是将经过下采样和上采样的数据相互融合,其作用是能够分割出多种尺度大小的目标。在对模型进行训练时可以使用Mosaic数据增强、标签平滑、学习率余弦退火衰减等方法来提升模型的训练速度和检测精度。

YOLO系列(四):yolov3

yolov3属于一阶段、anchor-based 目标检测

FPN :

原来多数的object detection算法都是只采用顶层特征做预测,但我们知道低层的特征语义信息比较少,但是目标位置准确;高层的特征语义信息比较丰富,但是目标位置比较粗略。另外虽然也有些算法采用多尺度特征融合的方式,但是一般是采用融合后的特征做预测,而本文不一样的地方在于预测是在不同特征层独立进行的。

FPN(Feature Pyramid Network)算法可以同时利用低层特征高分辨率和高层特征的高语义信息,通过融合这些不同层的特征达到很好的预测效果。此外,和其他的特征融合方式不同的是本文中的预测是在每个融合后的特征层上单独进行的。(对不同特征层单独预测)

网络结构解析:

  1. Yolov3中,只有卷积层,通过调节卷积步长控制输出特征图的尺寸。所以对于输入图片尺寸没有特别限制。流程图中,输入图片以256*256作为样例。
  2. Yolov3借鉴了金字塔特征图思想,小尺寸特征图用于检测大尺寸物体,而大尺寸特征图检测小尺寸物体。特征图的输出维度为 [公式] , [公式] 为输出特征图格点数,一共3个Anchor框,每个框有4维预测框数值 [公式] ,1维预测框置信度,80维物体类别数。所以第一层特征图的输出维度为 [公式] 。
  3. Yolov3总共输出3个特征图,第一个特征图下采样32倍,第二个特征图下采样16倍,第三个下采样8倍。输入图像经过Darknet-53(无全连接层),再经过Yoloblock生成的特征图被当作两用,第一用为经过3*3卷积层、1*1卷积之后生成特征图一,第二用为经过1*1卷积层加上采样层,与Darnet-53网络的中间层输出结果进行拼接,产生特征图二。同样的循环之后产生特征图三。
  4. concat操作与加和操作的区别:加和操作来源于ResNet思想,将输入的特征图,与输出特征图对应维度进行相加,即 [公式] ;而concat操作源于DenseNet网络的设计思路,将特征图按照通道维度直接进行拼接,例如8*8*16的特征图与8*8*16的特征图拼接后生成8*8*32的特征图。
  5. 上采样层(upsample):作用是将小尺寸特征图通过插值等方法,生成大尺寸图像。例如使用最近邻插值算法,将8*8的图像变换为16*16。上采样层不改变特征图的通道数。

Yolo的整个网络,吸取了Resnet、Densenet、FPN的精髓,可以说是融合了目标检测当前业界最有效的全部技巧。

YOLOv3网络结构示意图(VOC数据集)
YOLOv3所用的Darknet-53模型

YOLO系列(三):yolov2

yolov2属于一阶段、anchor-based 目标检测

YOLOv2的论文全名为YOLO9000: Better, Faster, Stronger,它斩获了CVPR 2017 Best Paper Honorable Mention。在这篇文章中,作者首先在YOLOv1的基础上提出了改进的YOLOv2,然后提出了一种检测与分类联合训练方法,使用这种联合训练方法在COCO检测数据集和ImageNet分类数据集上训练出了YOLO9000模型,其可以检测超过9000多类物体。所以,这篇文章其实包含两个模型:YOLOv2和YOLO9000,不过后者是在前者基础上提出的,两者模型主体结构是一致的。YOLOv2相比YOLOv1做了很多方面的改进,这也使得YOLOv2的mAP有显着的提升,并且YOLOv2的速度依然很快,保持着自己作为one-stage方法的优势.

Yolov2和Yolo9000算法内核相同,区别是训练方式不同:Yolov2用coco数据集训练后,可以识别80个种类。而Yolo9000可以使用coco数据集 + ImageNet数据集联合训练,可以识别9000多个种类。

YOLOv2的改进策略

YOLOv1虽然检测速度很快,但是在检测精度上却不如R-CNN系检测方法,YOLOv1在物体定位方面(localization)不够准确,并且召回率(recall)较低。YOLOv2共提出了几种改进策略来提升YOLO模型的定位准确度和召回率,从而提高mAP,YOLOv2在改进中遵循一个原则:保持检测速度,这也是YOLO模型的一大优势。YOLOv2的改进策略如图2所示,可以看出,大部分的改进方法都可以比较显着提升模型的mAP。

Batch Normalization

Batch Normalization可以提升模型收敛速度,而且可以起到一定正则化效果,降低模型的过拟合。在YOLOv2中,每个卷积层后面都添加了Batch Normalization层,并且不再使用droput。使用Batch Normalization后,YOLOv2的mAP提升了2.4%。

High Resolution Classifier:

目前大部分的检测模型都会在先在ImageNet分类数据集上预训练模型的主体部分(CNN特征提取器),由于历史原因,ImageNet分类模型基本采用大小为 224*224的图片作为输入,分辨率相对较低,不利于检测模型。所以YOLOv1在采用 224*224 分类模型预训练后,将分辨率增加至 448*448,并使用这个高分辨率在检测数据集上finetune。但是直接切换分辨率,检测模型可能难以快速适应高分辨率。所以YOLOv2增加了在ImageNet数据集上使用448*448输入来finetune分类网络这一中间过程(10 epochs),这可以使得模型在检测数据集上finetune之前已经适用高分辨率输入。使用高分辨率分类器后,YOLOv2的mAP提升了约4%。

Convolutional With Anchor Boxes:在YOLOv1中,输入图片最终被划分为7*7网格,每个单元格预测2个边界框。YOLOv1最后采用的是全连接层直接对边界框进行预测,其中边界框的宽与高是相对整张图片大小的,而由于各个图片中存在不同尺度和长宽比(scales and ratios)的物体,YOLOv1在训练过程中学习适应不同物体的形状是比较困难的,这也导致YOLOv1在精确定位方面表现较差。YOLOv2借鉴了Faster R-CNN中RPN网络的先验框(anchor boxes,prior boxes,SSD也采用了先验框)策略。RPN对CNN特征提取器得到的特征图(feature map)进行卷积来预测每个位置的边界框以及置信度(是否含有物体),并且各个位置设置不同尺度和比例的先验框,所以RPN预测的是边界框相对于先验框的offsets值(其实是transform值,详细见Faster R_CNN论文),采用先验框使得模型更容易学习。所以YOLOv2移除了YOLOv1中的全连接层而采用了卷积和anchor boxes来预测边界框。为了使检测所用的特征图分辨率更高,移除其中的一个pool层。在检测模型中,YOLOv2不是采用448*448图片作为输入,而是采用416*416大小。因为YOLOv2模型下采样的总步长为32,对于 416*416 大小的图片,最终得到的特征图大小为 13*13,维度是奇数,这样特征图恰好只有一个中心位置。对于一些大物体,它们中心点往往落入图片中心位置,此时使用特征图的一个中心点去预测这些物体的边界框相对容易些。所以在YOLOv2设计中要保证最终的特征图有奇数个位置。对于YOLOv1,每个cell都预测2个boxes,每个boxes包含5个值: (x,y,w,h,c),前4个值是边界框位置与大小,最后一个值是置信度(confidence scores,包含两部分:含有物体的概率以及预测框与ground truth的IOU)。但是每个cell只预测一套分类概率值(class predictions,其实是置信度下的条件概率值),供2个boxes共享。YOLOv2使用了anchor boxes之后,每个位置的各个anchor box都单独预测一套分类概率值,这和SSD比较类似(但SSD没有预测置信度,而是把background作为一个类别来处理)。使用anchor boxes之后,YOLOv2的mAP有稍微下降(这里下降的原因,我猜想是YOLOv2虽然使用了anchor boxes,但是依然采用YOLOv1的训练方法YOLOv1只能预测98个边界框( 7*7*2 ),而YOLOv2使用anchor boxes之后可以预测上千个边界框(13*13*num_anchors)。所以使用anchor boxes之后,YOLOv2的召回率大大提升,由原来的81%升至88%。

Dimension Clusters

在Faster R-CNN和SSD中,先验框的维度(长和宽)都是手动设定的,带有一定的主观性。如果选取的先验框维度比较合适,那么模型更容易学习,从而做出更好的预测。因此,YOLOv2采用k-means聚类方法对训练集中的边界框做了聚类分析。因为设置先验框的主要目的是为了使得预测框与ground truth的IOU更好,所以聚类分析时选用box与聚类中心box之间的IOU值作为距离指标:

$$
d(\text { box }, \text { centroid })=1-I O U(\text { box }, \text { centroid })
$$

下图为在VOC和COCO数据集上的聚类分析结果,随着聚类中心数目的增加,平均IOU值(各个边界框与聚类中心的IOU的平均值)是增加的,但是综合考虑模型复杂度和召回率,作者最终选取5个聚类中心作为先验框,其相对于图片的大小如右边图所示。对于两个数据集,5个先验框的width和height如下所示(来源:YOLO源码的cfg文件):

COCO: (0.57273, 0.677385), (1.87446, 2.06253), (3.33843, 5.47434), (7.88282, 3.52778), (9.77052, 9.16828)
VOC: (1.3221, 1.73145), (3.19275, 4.00944), (5.05587, 8.09892), (9.47112, 4.84053), (11.2364, 10.0071)

但是这里先验框的大小具体指什么作者并没有说明,但肯定不是像素点,从代码实现上看,应该是相对于预测的特征图大小( [公式] )。对比两个数据集,也可以看到COCO数据集上的物体相对小点。这个策略作者并没有单独做实验,但是作者对比了采用聚类分析得到的先验框与手动设置的先验框在平均IOU上的差异,发现前者的平均IOU值更高,因此模型更容易训练学习。

图3:数据集VOC和COCO上的边界框聚类分析结果

New Network: Darknet-19

YOLOv2采用了一个新的基础模型(特征提取器),称为Darknet-19,包括19个卷积层和5个maxpooling层,如图4所示。Darknet-19与VGG16模型设计原则是一致的,主要采用3*3卷积,采用2*2的maxpooling层之后,特征图维度降低2倍,而同时将特征图的channles增加两倍。与NIN(Network in Network)类似,Darknet-19最终采用global avgpooling做预测,并且在3*3卷积之间使用1*1卷积来压缩特征图channles以降低模型计算量和参数。Darknet-19每个卷积层后面同样使用了batch norm层以加快收敛速度,降低模型过拟合。在ImageNet分类数据集上,Darknet-19的top-1准确度为72.9%,top-5准确度为91.2%,但是模型参数相对小一些。使用Darknet-19之后,YOLOv2的mAP值没有显着提升,但是计算量却可以减少约33%。

Direct location prediction

沿用YOLOv1的方法,就是预测边界框中心点相对于对应cell左上角位置的相对偏移值,为了将边界框中心点约束在当前cell中,使用sigmoid函数处理偏移值,这样预测的偏移值在(0,1)范围内(每个cell的尺度看做1)。

Fine-Grained Features 更精细的特征图

YOLOv2的输入图片大小为 416*416 ,经过5次maxpooling之后得到 13*13 大小的特征图,并以此特征图采用卷积做预测。13*13大小的特征图对检测大物体是足够了,但是对于小物体还需要更精细的特征图(Fine-Grained Features)。因此SSD使用了多尺度的特征图来分别检测不同大小的物体,前面更精细的特征图可以用来预测小物体。YOLOv2提出了一种passthrough层来利用更精细的特征图。YOLOv2所利用的Fine-Grained Features是26*26大小的特征图(最后一个maxpooling层的输入),对于Darknet-19模型来说就是大小为 26*26*512 的特征图。passthrough层与ResNet网络的shortcut类似,以前面更高分辨率的特征图为输入,然后将其连接到后面的低分辨率特征图上。前面的特征图维度是后面的特征图的2倍,passthrough层抽取前面层的每个 2*2的局部区域,然后将其转化为channel维度,对于 [ 26*26*512 ] 的特征图,经passthrough层处理之后就变成了 [13*13*2048] 的新特征图(特征图大小降低4倍,而channles增加4倍,下图为一个实例),这样就可以与后面的 [13*13*1024] 特征图连接在一起形成 13*13*3072大小的特征图,然后在此特征图基础上卷积做预测。在YOLO的C源码中,passthrough层称为reorg layer。在TensorFlow中,可以使用tf.extract_image_patches或者tf.space_to_depth来实现passthrough层

passthrough层实例

Multi-Scale Training

采用Multi-Scale Training策略,YOLOv2可以适应不同大小的图片,并且预测出很好的结果。在测试时,YOLOv2可以采用不同大小的图片作为输入,在VOC 2007数据集上的效果如下图所示。可以看到采用较小分辨率时,YOLOv2的mAP值略低,但是速度更快,而采用高分辨输入时,mAP值更高,但是速度略有下降,对于 544*544,mAP高达78.6%。注意,这只是测试时输入图片大小不同,而实际上用的是同一个模型(采用Multi-Scale Training训练)

YOLO9000

YOLO9000是在YOLOv2的基础上提出的一种可以检测超过9000个类别的模型,其主要贡献点在于提出了一种分类和检测的联合训练策略。众多周知,检测数据集的标注要比分类数据集打标签繁琐的多,所以ImageNet分类数据集比VOC等检测数据集高出几个数量级。在YOLO中,边界框的预测其实并不依赖于物体的标签,所以YOLO可以实现在分类和检测数据集上的联合训练。对于检测数据集,可以用来学习预测物体的边界框、置信度以及为物体分类,而对于分类数据集可以仅用来学习分类,但是其可以大大扩充模型所能检测的物体种类。

作者选择在COCO和ImageNet数据集上进行联合训练,但是遇到的第一问题是两者的类别并不是完全互斥的,比如”Norfolk terrier”明显属于”dog”,所以作者提出了一种层级分类方法(Hierarchical classification),主要思路是根据各个类别之间的从属关系(根据WordNet)建立一种树结构WordTree

WordTree中的根节点为”physical object”,每个节点的子节点都属于同一子类,可以对它们进行softmax处理。在给出某个类别的预测概率时,需要找到其所在的位置,遍历这个path,然后计算path上各个节点的概率之积。

在训练时,如果是检测样本,按照YOLOv2的loss计算误差,而对于分类样本,只计算分类误差。在预测时,YOLOv2给出的置信度就是 [公式] ,同时会给出边界框位置以及一个树状概率图。在这个概率图中找到概率最高的路径,当达到某一个阈值时停止,就用当前节点表示预测的类别。

通过联合训练策略,YOLO9000可以快速检测出超过9000个类别的物体,总体mAP值为19,7%。我觉得这是作者在这篇论文作出的最大的贡献,因为YOLOv2的改进策略亮点并不是很突出,但是YOLO9000算是开创之举。

reference:

https://zhuanlan.zhihu.com/p/35325884

生成对抗网络系列论文+pytorch实现

github链接:

https://github.com/eriklindernoren/PyTorch-GAN

This repository has gone stale as I unfortunately do not have the time to maintain it anymore. If you would like to continue the development of it as a collaborator send me an email at eriklindernoren@gmail.com.

PyTorch-GAN

Collection of PyTorch implementations of Generative Adversarial Network varieties presented in research papers. Model architectures will not always mirror the ones proposed in the papers, but I have chosen to focus on getting the core ideas covered instead of getting every layer configuration right. Contributions and suggestions of GANs to implement are very welcomed.

See also: Keras-GAN

Table of Contents

Installation

$ git clone https://github.com/eriklindernoren/PyTorch-GAN
$ cd PyTorch-GAN/
$ sudo pip3 install -r requirements.txt

Implementations

Auxiliary Classifier GAN

Auxiliary Classifier Generative Adversarial Network

Authors

Augustus Odena, Christopher Olah, Jonathon Shlens

Abstract

Synthesizing high resolution photorealistic images has been a long-standing challenge in machine learning. In this paper we introduce new methods for the improved training of generative adversarial networks (GANs) for image synthesis. We construct a variant of GANs employing label conditioning that results in 128×128 resolution image samples exhibiting global coherence. We expand on previous work for image quality assessment to provide two new analyses for assessing the discriminability and diversity of samples from class-conditional image synthesis models. These analyses demonstrate that high resolution samples provide class information not present in low resolution samples. Across 1000 ImageNet classes, 128×128 samples are more than twice as discriminable as artificially resized 32×32 samples. In addition, 84.7% of the classes have samples exhibiting diversity comparable to real ImageNet data.

[Paper] [Code]

Run Example

$ cd implementations/acgan/
$ python3 acgan.py

Adversarial Autoecoder

Adversarial Autoencoder

Authors

Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, Brendan Frey

Abstract

n this paper, we propose the “adversarial autoencoder” (AAE), which is a probabilistic autoencoder that uses the recently proposed generative adversarial networks (GAN) to perform variational inference by matching the aggregated posterior of the hidden code vector of the autoencoder with an arbitrary prior distribution. Matching the aggregated posterior to the prior ensures that generating from any part of prior space results in meaningful samples. As a result, the decoder of the adversarial autoencoder learns a deep generative model that maps the imposed prior to the data distribution. We show how the adversarial autoencoder can be used in applications such as semi-supervised classification, disentangling style and content of images, unsupervised clustering, dimensionality reduction and data visualization. We performed experiments on MNIST, Street View House Numbers and Toronto Face datasets and show that adversarial autoencoders achieve competitive results in generative modeling and semi-supervised classification tasks.

[Paper] [Code]

Run Example

$ cd implementations/aae/
$ python3 aae.py

BEGAN

BEGAN: Boundary Equilibrium Generative Adversarial Networks

Authors

David Berthelot, Thomas Schumm, Luke Metz

Abstract

We propose a new equilibrium enforcing method paired with a loss derived from the Wasserstein distance for training auto-encoder based Generative Adversarial Networks. This method balances the generator and discriminator during training. Additionally, it provides a new approximate convergence measure, fast and stable training and high visual quality. We also derive a way of controlling the trade-off between image diversity and visual quality. We focus on the image generation task, setting a new milestone in visual quality, even at higher resolutions. This is achieved while using a relatively simple model architecture and a standard training procedure.

[Paper] [Code]

Run Example

$ cd implementations/began/
$ python3 began.py

BicycleGAN

Toward Multimodal Image-to-Image Translation

Authors

Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A. Efros, Oliver Wang, Eli Shechtman

Abstract

Many image-to-image translation problems are ambiguous, as a single input image may correspond to multiple possible outputs. In this work, we aim to model a \emph{distribution} of possible outputs in a conditional generative modeling setting. The ambiguity of the mapping is distilled in a low-dimensional latent vector, which can be randomly sampled at test time. A generator learns to map the given input, combined with this latent code, to the output. We explicitly encourage the connection between output and the latent code to be invertible. This helps prevent a many-to-one mapping from the latent code to the output during training, also known as the problem of mode collapse, and produces more diverse results. We explore several variants of this approach by employing different training objectives, network architectures, and methods of injecting the latent code. Our proposed method encourages bijective consistency between the latent encoding and output modes. We present a systematic comparison of our method and other variants on both perceptual realism and diversity.

[Paper] [Code]

Run Example

$ cd data/
$ bash download_pix2pix_dataset.sh edges2shoes
$ cd ../implementations/bicyclegan/
$ python3 bicyclegan.py

Various style translations by varying the latent code.​

Boundary-Seeking GAN

Boundary-Seeking Generative Adversarial Networks

Authors

R Devon Hjelm, Athul Paul Jacob, Tong Che, Adam Trischler, Kyunghyun Cho, Yoshua Bengio

Abstract

Generative adversarial networks (GANs) are a learning framework that rely on training a discriminator to estimate a measure of difference between a target and generated distributions. GANs, as normally formulated, rely on the generated samples being completely differentiable w.r.t. the generative parameters, and thus do not work for discrete data. We introduce a method for training GANs with discrete data that uses the estimated difference measure from the discriminator to compute importance weights for generated samples, thus providing a policy gradient for training the generator. The importance weights have a strong connection to the decision boundary of the discriminator, and we call our method boundary-seeking GANs (BGANs). We demonstrate the effectiveness of the proposed algorithm with discrete image and character-based natural language generation. In addition, the boundary-seeking objective extends to continuous data, which can be used to improve stability of training, and we demonstrate this on Celeba, Large-scale Scene Understanding (LSUN) bedrooms, and Imagenet without conditioning.

[Paper] [Code]

Run Example

$ cd implementations/bgan/
$ python3 bgan.py

Cluster GAN

ClusterGAN: Latent Space Clustering in Generative Adversarial Networks

Authors

Sudipto Mukherjee, Himanshu Asnani, Eugene Lin, Sreeram Kannan

Abstract

Generative Adversarial networks (GANs) have obtained remarkable success in many unsupervised learning tasks and unarguably, clustering is an important unsupervised learning problem. While one can potentially exploit the latent-space back-projection in GANs to cluster, we demonstrate that the cluster structure is not retained in the GAN latent space. In this paper, we propose ClusterGAN as a new mechanism for clustering using GANs. By sampling latent variables from a mixture of one-hot encoded variables and continuous latent variables, coupled with an inverse network (which projects the data to the latent space) trained jointly with a clustering specific loss, we are able to achieve clustering in the latent space. Our results show a remarkable phenomenon that GANs can preserve latent space interpolation across categories, even though the discriminator is never exposed to such vectors. We compare our results with various clustering baselines and demonstrate superior performance on both synthetic and real datasets.

[Paper] [Code]

Code based on a full PyTorch [implementation].

Run Example

$ cd implementations/cluster_gan/
$ python3 clustergan.py

Conditional GAN

Conditional Generative Adversarial Nets

Authors

Mehdi Mirza, Simon Osindero

Abstract

Generative Adversarial Nets [8] were recently introduced as a novel way to train generative models. In this work we introduce the conditional version of generative adversarial nets, which can be constructed by simply feeding the data, y, we wish to condition on to both the generator and discriminator. We show that this model can generate MNIST digits conditioned on class labels. We also illustrate how this model could be used to learn a multi-modal model, and provide preliminary examples of an application to image tagging in which we demonstrate how this approach can generate descriptive tags which are not part of training labels.

[Paper] [Code]

Run Example

$ cd implementations/cgan/
$ python3 cgan.py

Context-Conditional GAN

Semi-Supervised Learning with Context-Conditional Generative Adversarial Networks

Authors

Emily Denton, Sam Gross, Rob Fergus

Abstract

We introduce a simple semi-supervised learning approach for images based on in-painting using an adversarial loss. Images with random patches removed are presented to a generator whose task is to fill in the hole, based on the surrounding pixels. The in-painted images are then presented to a discriminator network that judges if they are real (unaltered training images) or not. This task acts as a regularizer for standard supervised training of the discriminator. Using our approach we are able to directly train large VGG-style networks in a semi-supervised fashion. We evaluate on STL-10 and PASCAL datasets, where our approach obtains performance comparable or superior to existing methods.

[Paper] [Code]

Run Example

$ cd implementations/ccgan/
$ python3 ccgan.py

Context Encoder

Context Encoders: Feature Learning by Inpainting

Authors

Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, Alexei A. Efros

Abstract

We present an unsupervised visual feature learning algorithm driven by context-based pixel prediction. By analogy with auto-encoders, we propose Context Encoders — a convolutional neural network trained to generate the contents of an arbitrary image region conditioned on its surroundings. In order to succeed at this task, context encoders need to both understand the content of the entire image, as well as produce a plausible hypothesis for the missing part(s). When training context encoders, we have experimented with both a standard pixel-wise reconstruction loss, as well as a reconstruction plus an adversarial loss. The latter produces much sharper results because it can better handle multiple modes in the output. We found that a context encoder learns a representation that captures not just appearance but also the semantics of visual structures. We quantitatively demonstrate the effectiveness of our learned features for CNN pre-training on classification, detection, and segmentation tasks. Furthermore, context encoders can be used for semantic inpainting tasks, either stand-alone or as initialization for non-parametric methods.

[Paper] [Code]

Run Example

$ cd implementations/context_encoder/
<follow steps at the top of context_encoder.py>
$ python3 context_encoder.py

Rows: Masked | Inpainted | Original | Masked | Inpainted | Original​

Coupled GAN

Coupled Generative Adversarial Networks

Authors

Ming-Yu Liu, Oncel Tuzel

Abstract

We propose coupled generative adversarial network (CoGAN) for learning a joint distribution of multi-domain images. In contrast to the existing approaches, which require tuples of corresponding images in different domains in the training set, CoGAN can learn a joint distribution without any tuple of corresponding images. It can learn a joint distribution with just samples drawn from the marginal distributions. This is achieved by enforcing a weight-sharing constraint that limits the network capacity and favors a joint distribution solution over a product of marginal distributions one. We apply CoGAN to several joint distribution learning tasks, including learning a joint distribution of color and depth images, and learning a joint distribution of face images with different attributes. For each task it successfully learns the joint distribution without any tuple of corresponding images. We also demonstrate its applications to domain adaptation and image transformation.

[Paper] [Code]

Run Example

$ cd implementations/cogan/
$ python3 cogan.py

Generated MNIST and MNIST-M images​

CycleGAN

Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks

Authors

Jun-Yan Zhu, Taesung Park, Phillip Isola, Alexei A. Efros

Abstract

Image-to-image translation is a class of vision and graphics problems where the goal is to learn the mapping between an input image and an output image using a training set of aligned image pairs. However, for many tasks, paired training data will not be available. We present an approach for learning to translate an image from a source domain X to a target domain Y in the absence of paired examples. Our goal is to learn a mapping G:X→Y such that the distribution of images from G(X) is indistinguishable from the distribution Y using an adversarial loss. Because this mapping is highly under-constrained, we couple it with an inverse mapping F:Y→X and introduce a cycle consistency loss to push F(G(X))≈X (and vice versa). Qualitative results are presented on several tasks where paired training data does not exist, including collection style transfer, object transfiguration, season transfer, photo enhancement, etc. Quantitative comparisons against several prior methods demonstrate the superiority of our approach.

[Paper] [Code]

Run Example

$ cd data/
$ bash download_cyclegan_dataset.sh monet2photo
$ cd ../implementations/cyclegan/
$ python3 cyclegan.py --dataset_name monet2photo

Monet to photo translations.​

Deep Convolutional GAN

Deep Convolutional Generative Adversarial Network

Authors

Alec Radford, Luke Metz, Soumith Chintala

Abstract

In recent years, supervised learning with convolutional networks (CNNs) has seen huge adoption in computer vision applications. Comparatively, unsupervised learning with CNNs has received less attention. In this work we hope to help bridge the gap between the success of CNNs for supervised learning and unsupervised learning. We introduce a class of CNNs called deep convolutional generative adversarial networks (DCGANs), that have certain architectural constraints, and demonstrate that they are a strong candidate for unsupervised learning. Training on various image datasets, we show convincing evidence that our deep convolutional adversarial pair learns a hierarchy of representations from object parts to scenes in both the generator and discriminator. Additionally, we use the learned features for novel tasks – demonstrating their applicability as general image representations.

[Paper] [Code]

Run Example

$ cd implementations/dcgan/
$ python3 dcgan.py

DiscoGAN

Learning to Discover Cross-Domain Relations with Generative Adversarial Networks

Authors

Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, Jiwon Kim

Abstract

While humans easily recognize relations between data from different domains without any supervision, learning to automatically discover them is in general very challenging and needs many ground-truth pairs that illustrate the relations. To avoid costly pairing, we address the task of discovering cross-domain relations given unpaired data. We propose a method based on generative adversarial networks that learns to discover relations between different domains (DiscoGAN). Using the discovered relations, our proposed network successfully transfers style from one domain to another while preserving key attributes such as orientation and face identity.

[Paper] [Code]

Run Example

$ cd data/
$ bash download_pix2pix_dataset.sh edges2shoes
$ cd ../implementations/discogan/
$ python3 discogan.py --dataset_name edges2shoes

Rows from top to bottom: (1) Real image from domain A (2) Translated image from
domain A (3) Reconstructed image from domain A (4) Real image from domain B (5)
Translated image from domain B (6) Reconstructed image from domain B​

DRAGAN

On Convergence and Stability of GANs

Authors

Naveen Kodali, Jacob Abernethy, James Hays, Zsolt Kira

Abstract

We propose studying GAN training dynamics as regret minimization, which is in contrast to the popular view that there is consistent minimization of a divergence between real and generated distributions. We analyze the convergence of GAN training from this new point of view to understand why mode collapse happens. We hypothesize the existence of undesirable local equilibria in this non-convex game to be responsible for mode collapse. We observe that these local equilibria often exhibit sharp gradients of the discriminator function around some real data points. We demonstrate that these degenerate local equilibria can be avoided with a gradient penalty scheme called DRAGAN. We show that DRAGAN enables faster training, achieves improved stability with fewer mode collapses, and leads to generator networks with better modeling performance across a variety of architectures and objective functions.

[Paper] [Code]

Run Example

$ cd implementations/dragan/
$ python3 dragan.py

DualGAN

DualGAN: Unsupervised Dual Learning for Image-to-Image Translation

Authors

Zili Yi, Hao Zhang, Ping Tan, Minglun Gong

Abstract

Conditional Generative Adversarial Networks (GANs) for cross-domain image-to-image translation have made much progress recently. Depending on the task complexity, thousands to millions of labeled image pairs are needed to train a conditional GAN. However, human labeling is expensive, even impractical, and large quantities of data may not always be available. Inspired by dual learning from natural language translation, we develop a novel dual-GAN mechanism, which enables image translators to be trained from two sets of unlabeled images from two domains. In our architecture, the primal GAN learns to translate images from domain U to those in domain V, while the dual GAN learns to invert the task. The closed loop made by the primal and dual tasks allows images from either domain to be translated and then reconstructed. Hence a loss function that accounts for the reconstruction error of images can be used to train the translators. Experiments on multiple image translation tasks with unlabeled data show considerable performance gain of DualGAN over a single GAN. For some tasks, DualGAN can even achieve comparable or slightly better results than conditional GAN trained on fully labeled data.

[Paper] [Code]

Run Example

$ cd data/
$ bash download_pix2pix_dataset.sh facades
$ cd ../implementations/dualgan/
$ python3 dualgan.py --dataset_name facades

Energy-Based GAN

Energy-based Generative Adversarial Network

Authors

Junbo Zhao, Michael Mathieu, Yann LeCun

Abstract

We introduce the “Energy-based Generative Adversarial Network” model (EBGAN) which views the discriminator as an energy function that attributes low energies to the regions near the data manifold and higher energies to other regions. Similar to the probabilistic GANs, a generator is seen as being trained to produce contrastive samples with minimal energies, while the discriminator is trained to assign high energies to these generated samples. Viewing the discriminator as an energy function allows to use a wide variety of architectures and loss functionals in addition to the usual binary classifier with logistic output. Among them, we show one instantiation of EBGAN framework as using an auto-encoder architecture, with the energy being the reconstruction error, in place of the discriminator. We show that this form of EBGAN exhibits more stable behavior than regular GANs during training. We also show that a single-scale architecture can be trained to generate high-resolution images.

[Paper] [Code]

Run Example

$ cd implementations/ebgan/
$ python3 ebgan.py

Enhanced Super-Resolution GAN

ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks

Authors

Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Chen Change Loy, Yu Qiao, Xiaoou Tang

Abstract

The Super-Resolution Generative Adversarial Network (SRGAN) is a seminal work that is capable of generating realistic textures during single image super-resolution. However, the hallucinated details are often accompanied with unpleasant artifacts. To further enhance the visual quality, we thoroughly study three key components of SRGAN – network architecture, adversarial loss and perceptual loss, and improve each of them to derive an Enhanced SRGAN (ESRGAN). In particular, we introduce the Residual-in-Residual Dense Block (RRDB) without batch normalization as the basic network building unit. Moreover, we borrow the idea from relativistic GAN to let the discriminator predict relative realness instead of the absolute value. Finally, we improve the perceptual loss by using the features before activation, which could provide stronger supervision for brightness consistency and texture recovery. Benefiting from these improvements, the proposed ESRGAN achieves consistently better visual quality with more realistic and natural textures than SRGAN and won the first place in the PIRM2018-SR Challenge. The code is available at this https URL.

[Paper] [Code]

Run Example

$ cd implementations/esrgan/
<follow steps at the top of esrgan.py>
$ python3 esrgan.py

Nearest Neighbor Upsampling | ESRGAN​

GAN

Generative Adversarial Network

Authors

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio

Abstract

We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.

[Paper] [Code]

Run Example

$ cd implementations/gan/
$ python3 gan.py

InfoGAN

InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets

Authors

Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, Pieter Abbeel

Abstract

This paper describes InfoGAN, an information-theoretic extension to the Generative Adversarial Network that is able to learn disentangled representations in a completely unsupervised manner. InfoGAN is a generative adversarial network that also maximizes the mutual information between a small subset of the latent variables and the observation. We derive a lower bound to the mutual information objective that can be optimized efficiently, and show that our training procedure can be interpreted as a variation of the Wake-Sleep algorithm. Specifically, InfoGAN successfully disentangles writing styles from digit shapes on the MNIST dataset, pose from lighting of 3D rendered images, and background digits from the central digit on the SVHN dataset. It also discovers visual concepts that include hair styles, presence/absence of eyeglasses, and emotions on the CelebA face dataset. Experiments show that InfoGAN learns interpretable representations that are competitive with representations learned by existing fully supervised methods.

[Paper] [Code]

Run Example

$ cd implementations/infogan/
$ python3 infogan.py

Result of varying categorical latent variable by column.​

Result of varying continuous latent variable by row.​

Least Squares GAN

Least Squares Generative Adversarial Networks

Authors

Xudong Mao, Qing Li, Haoran Xie, Raymond Y.K. Lau, Zhen Wang, Stephen Paul Smolley

Abstract

Unsupervised learning with generative adversarial networks (GANs) has proven hugely successful. Regular GANs hypothesize the discriminator as a classifier with the sigmoid cross entropy loss function. However, we found that this loss function may lead to the vanishing gradients problem during the learning process. To overcome such a problem, we propose in this paper the Least Squares Generative Adversarial Networks (LSGANs) which adopt the least squares loss function for the discriminator. We show that minimizing the objective function of LSGAN yields minimizing the Pearson χ2 divergence. There are two benefits of LSGANs over regular GANs. First, LSGANs are able to generate higher quality images than regular GANs. Second, LSGANs perform more stable during the learning process. We evaluate LSGANs on five scene datasets and the experimental results show that the images generated by LSGANs are of better quality than the ones generated by regular GANs. We also conduct two comparison experiments between LSGANs and regular GANs to illustrate the stability of LSGANs.

[Paper] [Code]

Run Example

$ cd implementations/lsgan/
$ python3 lsgan.py

MUNIT

Multimodal Unsupervised Image-to-Image Translation

Authors

Xun Huang, Ming-Yu Liu, Serge Belongie, Jan Kautz

Abstract

Unsupervised image-to-image translation is an important and challenging problem in computer vision. Given an image in the source domain, the goal is to learn the conditional distribution of corresponding images in the target domain, without seeing any pairs of corresponding images. While this conditional distribution is inherently multimodal, existing approaches make an overly simplified assumption, modeling it as a deterministic one-to-one mapping. As a result, they fail to generate diverse outputs from a given source domain image. To address this limitation, we propose a Multimodal Unsupervised Image-to-image Translation (MUNIT) framework. We assume that the image representation can be decomposed into a content code that is domain-invariant, and a style code that captures domain-specific properties. To translate an image to another domain, we recombine its content code with a random style code sampled from the style space of the target domain. We analyze the proposed framework and establish several theoretical results. Extensive experiments with comparisons to the state-of-the-art approaches further demonstrates the advantage of the proposed framework. Moreover, our framework allows users to control the style of translation outputs by providing an example style image. Code and pretrained models are available at this https URL

[Paper] [Code]

Run Example

$ cd data/
$ bash download_pix2pix_dataset.sh edges2shoes
$ cd ../implementations/munit/
$ python3 munit.py --dataset_name edges2shoes

Results by varying the style code.​

Pix2Pix

Unpaired Image-to-Image Translation with Conditional Adversarial Networks

Authors

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, Alexei A. Efros

Abstract

We investigate conditional adversarial networks as a general-purpose solution to image-to-image translation problems. These networks not only learn the mapping from input image to output image, but also learn a loss function to train this mapping. This makes it possible to apply the same generic approach to problems that traditionally would require very different loss formulations. We demonstrate that this approach is effective at synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks. Indeed, since the release of the pix2pix software associated with this paper, a large number of internet users (many of them artists) have posted their own experiments with our system, further demonstrating its wide applicability and ease of adoption without the need for parameter tweaking. As a community, we no longer hand-engineer our mapping functions, and this work suggests we can achieve reasonable results without hand-engineering our loss functions either.

[Paper] [Code]

Run Example

$ cd data/
$ bash download_pix2pix_dataset.sh facades
$ cd ../implementations/pix2pix/
$ python3 pix2pix.py --dataset_name facades

Rows from top to bottom: (1) The condition for the generator (2) Generated image
based of condition (3) The true corresponding image to the condition​

PixelDA

Unsupervised Pixel-Level Domain Adaptation with Generative Adversarial Networks

Authors

Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, Dilip Krishnan

Abstract

Collecting well-annotated image datasets to train modern machine learning algorithms is prohibitively expensive for many tasks. One appealing alternative is rendering synthetic data where ground-truth annotations are generated automatically. Unfortunately, models trained purely on rendered images often fail to generalize to real images. To address this shortcoming, prior work introduced unsupervised domain adaptation algorithms that attempt to map representations between the two domains or learn to extract features that are domain-invariant. In this work, we present a new approach that learns, in an unsupervised manner, a transformation in the pixel space from one domain to the other. Our generative adversarial network (GAN)-based method adapts source-domain images to appear as if drawn from the target domain. Our approach not only produces plausible samples, but also outperforms the state-of-the-art on a number of unsupervised domain adaptation scenarios by large margins. Finally, we demonstrate that the adaptation process generalizes to object classes unseen during training.

[Paper] [Code]

MNIST to MNIST-M Classification

Trains a classifier on images that have been translated from the source domain (MNIST) to the target domain (MNIST-M) using the annotations of the source domain images. The classification network is trained jointly with the generator network to optimize the generator for both providing a proper domain translation and also for preserving the semantics of the source domain image. The classification network trained on translated images is compared to the naive solution of training a classifier on MNIST and evaluating it on MNIST-M. The naive model manages a 55% classification accuracy on MNIST-M while the one trained during domain adaptation achieves a 95% classification accuracy.

$ cd implementations/pixelda/
$ python3 pixelda.py
MethodAccuracy
Naive55%
PixelDA95%

Rows from top to bottom: (1) Real images from MNIST (2) Translated images from
MNIST to MNIST-M (3) Examples of images from MNIST-M​

Relativistic GAN

The relativistic discriminator: a key element missing from standard GAN

Authors

Alexia Jolicoeur-Martineau

Abstract

In standard generative adversarial network (SGAN), the discriminator estimates the probability that the input data is real. The generator is trained to increase the probability that fake data is real. We argue that it should also simultaneously decrease the probability that real data is real because 1) this would account for a priori knowledge that half of the data in the mini-batch is fake, 2) this would be observed with divergence minimization, and 3) in optimal settings, SGAN would be equivalent to integral probability metric (IPM) GANs. We show that this property can be induced by using a relativistic discriminator which estimate the probability that the given real data is more realistic than a randomly sampled fake data. We also present a variant in which the discriminator estimate the probability that the given real data is more realistic than fake data, on average. We generalize both approaches to non-standard GAN loss functions and we refer to them respectively as Relativistic GANs (RGANs) and Relativistic average GANs (RaGANs). We show that IPM-based GANs are a subset of RGANs which use the identity function. Empirically, we observe that 1) RGANs and RaGANs are significantly more stable and generate higher quality data samples than their non-relativistic counterparts, 2) Standard RaGAN with gradient penalty generate data of better quality than WGAN-GP while only requiring a single discriminator update per generator update (reducing the time taken for reaching the state-of-the-art by 400%), and 3) RaGANs are able to generate plausible high resolutions images (256×256) from a very small sample (N=2011), while GAN and LSGAN cannot; these images are of significantly better quality than the ones generated by WGAN-GP and SGAN with spectral normalization.

[Paper] [Code]

Run Example

$ cd implementations/relativistic_gan/
$ python3 relativistic_gan.py # Relativistic Standard GAN
$ python3 relativistic_gan.py --rel_avg_gan # Relativistic Average GAN

Semi-Supervised GAN

Semi-Supervised Generative Adversarial Network

Authors

Augustus Odena

Abstract

We extend Generative Adversarial Networks (GANs) to the semi-supervised context by forcing the discriminator network to output class labels. We train a generative model G and a discriminator D on a dataset with inputs belonging to one of N classes. At training time, D is made to predict which of N+1 classes the input belongs to, where an extra class is added to correspond to the outputs of G. We show that this method can be used to create a more data-efficient classifier and that it allows for generating higher quality samples than a regular GAN.

[Paper] [Code]

Run Example

$ cd implementations/sgan/
$ python3 sgan.py

Softmax GAN

Softmax GAN

Authors

Min Lin

Abstract

Softmax GAN is a novel variant of Generative Adversarial Network (GAN). The key idea of Softmax GAN is to replace the classification loss in the original GAN with a softmax cross-entropy loss in the sample space of one single batch. In the adversarial learning of N real training samples and M generated samples, the target of discriminator training is to distribute all the probability mass to the real samples, each with probability 1M, and distribute zero probability to generated data. In the generator training phase, the target is to assign equal probability to all data points in the batch, each with probability 1M+N. While the original GAN is closely related to Noise Contrastive Estimation (NCE), we show that Softmax GAN is the Importance Sampling version of GAN. We futher demonstrate with experiments that this simple change stabilizes GAN training.

[Paper] [Code]

Run Example

$ cd implementations/softmax_gan/
$ python3 softmax_gan.py

StarGAN

StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation

Authors

Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, Jaegul Choo

Abstract

Recent studies have shown remarkable success in image-to-image translation for two domains. However, existing approaches have limited scalability and robustness in handling more than two domains, since different models should be built independently for every pair of image domains. To address this limitation, we propose StarGAN, a novel and scalable approach that can perform image-to-image translations for multiple domains using only a single model. Such a unified model architecture of StarGAN allows simultaneous training of multiple datasets with different domains within a single network. This leads to StarGAN’s superior quality of translated images compared to existing models as well as the novel capability of flexibly translating an input image to any desired target domain. We empirically demonstrate the effectiveness of our approach on a facial attribute transfer and a facial expression synthesis tasks.

[Paper] [Code]

Run Example

$ cd implementations/stargan/
<follow steps at the top of stargan.py>
$ python3 stargan.py

Original | Black Hair | Blonde Hair | Brown Hair | Gender Flip | Aged​

Super-Resolution GAN

Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network

Authors

Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, Wenzhe Shi

Abstract

Despite the breakthroughs in accuracy and speed of single image super-resolution using faster and deeper convolutional neural networks, one central problem remains largely unsolved: how do we recover the finer texture details when we super-resolve at large upscaling factors? The behavior of optimization-based super-resolution methods is principally driven by the choice of the objective function. Recent work has largely focused on minimizing the mean squared reconstruction error. The resulting estimates have high peak signal-to-noise ratios, but they are often lacking high-frequency details and are perceptually unsatisfying in the sense that they fail to match the fidelity expected at the higher resolution. In this paper, we present SRGAN, a generative adversarial network (GAN) for image super-resolution (SR). To our knowledge, it is the first framework capable of inferring photo-realistic natural images for 4x upscaling factors. To achieve this, we propose a perceptual loss function which consists of an adversarial loss and a content loss. The adversarial loss pushes our solution to the natural image manifold using a discriminator network that is trained to differentiate between the super-resolved images and original photo-realistic images. In addition, we use a content loss motivated by perceptual similarity instead of similarity in pixel space. Our deep residual network is able to recover photo-realistic textures from heavily downsampled images on public benchmarks. An extensive mean-opinion-score (MOS) test shows hugely significant gains in perceptual quality using SRGAN. The MOS scores obtained with SRGAN are closer to those of the original high-resolution images than to those obtained with any state-of-the-art method.

[Paper] [Code]

Run Example

$ cd implementations/srgan/
<follow steps at the top of srgan.py>
$ python3 srgan.py

Nearest Neighbor Upsampling | SRGAN​

UNIT

Unsupervised Image-to-Image Translation Networks

Authors

Ming-Yu Liu, Thomas Breuel, Jan Kautz

Abstract

Unsupervised image-to-image translation aims at learning a joint distribution of images in different domains by using images from the marginal distributions in individual domains. Since there exists an infinite set of joint distributions that can arrive the given marginal distributions, one could infer nothing about the joint distribution from the marginal distributions without additional assumptions. To address the problem, we make a shared-latent space assumption and propose an unsupervised image-to-image translation framework based on Coupled GANs. We compare the proposed framework with competing approaches and present high quality image translation results on various challenging unsupervised image translation tasks, including street scene image translation, animal image translation, and face image translation. We also apply the proposed framework to domain adaptation and achieve state-of-the-art performance on benchmark datasets. Code and additional results are available in this https URL.

[Paper] [Code]

Run Example

$ cd data/
$ bash download_cyclegan_dataset.sh apple2orange
$ cd implementations/unit/
$ python3 unit.py --dataset_name apple2orange

Wasserstein GAN

Wasserstein GAN

Authors

Martin Arjovsky, Soumith Chintala, Léon Bottou

Abstract

We introduce a new algorithm named WGAN, an alternative to traditional GAN training. In this new model, we show that we can improve the stability of learning, get rid of problems like mode collapse, and provide meaningful learning curves useful for debugging and hyperparameter searches. Furthermore, we show that the corresponding optimization problem is sound, and provide extensive theoretical work highlighting the deep connections to other distances between distributions.

[Paper] [Code]

Run Example

$ cd implementations/wgan/
$ python3 wgan.py

Wasserstein GAN GP

Improved Training of Wasserstein GANs

Authors

Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, Aaron Courville

Abstract

Generative Adversarial Networks (GANs) are powerful generative models, but suffer from training instability. The recently proposed Wasserstein GAN (WGAN) makes progress toward stable training of GANs, but sometimes can still generate only low-quality samples or fail to converge. We find that these problems are often due to the use of weight clipping in WGAN to enforce a Lipschitz constraint on the critic, which can lead to undesired behavior. We propose an alternative to clipping weights: penalize the norm of gradient of the critic with respect to its input. Our proposed method performs better than standard WGAN and enables stable training of a wide variety of GAN architectures with almost no hyperparameter tuning, including 101-layer ResNets and language models over discrete data. We also achieve high quality generations on CIFAR-10 and LSUN bedrooms.

[Paper] [Code]

Run Example

$ cd implementations/wgan_gp/
$ python3 wgan_gp.py

Wasserstein GAN DIV

Wasserstein Divergence for GANs

Authors

Jiqing Wu, Zhiwu Huang, Janine Thoma, Dinesh Acharya, Luc Van Gool

Abstract

In many domains of computer vision, generative adversarial networks (GANs) have achieved great success, among which the fam- ily of Wasserstein GANs (WGANs) is considered to be state-of-the-art due to the theoretical contributions and competitive qualitative performance. However, it is very challenging to approximate the k-Lipschitz constraint required by the Wasserstein-1 metric (W-met). In this paper, we propose a novel Wasserstein divergence (W-div), which is a relaxed version of W-met and does not require the k-Lipschitz constraint.As a concrete application, we introduce a Wasserstein divergence objective for GANs (WGAN-div), which can faithfully approximate W-div through optimization. Under various settings, including progressive growing training, we demonstrate the stability of the proposed WGAN-div owing to its theoretical and practical advantages over WGANs. Also, we study the quantitative and visual performance of WGAN-div on standard image synthesis benchmarks, showing the superior performance of WGAN-div compared to the state-of-the-art methods.

[Paper] [Code]

Run Example

$ cd implementations/wgan_div/
$ python3 wgan_div.py

YOLO系列(一)

What is YOLOv5


YOLO an acronym for ‘You only look once’, is an object detection algorithm that divides images into a grid system. Each cell in the grid is responsible for detecting objects within itself.

YOLO is one of the most famous object detection algorithms due to its speed and accuracy.

The History of YOLO


YOLOv5

Shortly after the release of YOLOv4 Glenn Jocher introduced YOLOv5 using the Pytorch framework.
The open source code is available on GitHub

Author: Glenn Jocher
Released: 18 May 2020

YOLOv4

With the original authors work on YOLO coming to a standstill, YOLOv4 was released by Alexey Bochoknovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. The paper was titled YOLOv4: Optimal Speed and Accuracy of Object Detection

Author: Alexey BochoknovskiyChien-Yao Wang, and Hong-Yuan Mark Liao
Released: 23 April 2020

yolov4-tiny : https://arxiv.org/abs/2011.04244

code yolov4-tiny

https://github.com/bubbliiiing/yolov4-tiny-pytorch

yolov4 tiny结构

YOLOv3

YOLOv3 improved on the YOLOv2 paper and both Joseph Redmon and Ali Farhadi, the original authors, contributed.
Together they published YOLOv3: An Incremental Improvement

The original YOLO papers were are hosted here

Author: Joseph Redmon and Ali Farhadi
Released: 8 Apr 2018

YOLOv2

YOLOv2 was a joint endevor by Joseph Redmon the original author of YOLO and Ali Farhadi.
Together they published YOLO9000:Better, Faster, Stronger

Author: Joseph Redmon and Ali Farhadi
Released: 25 Dec 2016

YOLOv1

YOLOv1 was released as a research paper by Joseph Redmon.
The paper was titled You Only Look Once: Unified, Real-Time Object Detection

Author: Joseph Redmon
Released: 8 Jun 2015

2021: A Year Full of Amazing AI papers – A Review

A curated list of the latest breakthroughs in AI by release date with a clear video explanation, link to a more in-depth article, and code.

comefrom:

https://www.louisbouchard.ai/2021-ai-papers-review/

Table of content

  • DALL·E: Zero-Shot Text-to-Image Generation from OpenAI [1]
  • VOGUE: Try-On by StyleGAN Interpolation Optimization [2]
  • Taming Transformers for High-Resolution Image Synthesis [3]
  • Thinking Fast And Slow in AI [4]
  • Automatic detection and quantification of floating marine macro-litter in aerial images [5]
  • ShaRF: Shape-conditioned Radiance Fields from a Single View [6]
  • Generative Adversarial Transformers [7]
  • We Asked Artificial Intelligence to Create Dating Profiles. Would You Swipe Right? [8]
  • Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [9]
  • IMAGE GANS MEET DIFFERENTIABLE RENDERING FOR INVERSE GRAPHICS AND INTERPRETABLE 3D NEURAL RENDERING [10]
  • Deep nets: What have they ever done for vision? [11]
  • Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image [12]
  • Portable, Self-Contained Neuroprosthetic Hand with Deep Learning-Based Finger Control [13]
  • Total Relighting: Learning to Relight Portraits for Background Replacement [14]
  • LASR: Learning Articulated Shape Reconstruction from a Monocular Video [15]
  • Enhancing Photorealism Enhancement [16]
  • DefakeHop: A Light-Weight High-Performance Deepfake Detector [17]
  • High-Resolution Photorealistic Image Translation in Real-Time: A Laplacian Pyramid Translation Network [18]
  • Barbershop: GAN-based Image Compositing using Segmentation Masks [19]
  • TextStyleBrush: Transfer of text aesthetics from a single example [20]
  • Animating Pictures with Eulerian Motion Fields [21]
  • CVPR 2021 Best Paper Award: GIRAFFE – Controllable Image Generation [22]
  • GitHub Copilot & Codex: Evaluating Large Language Models Trained on Code [23]
  • Apple: Recognizing People in Photos Through Private On-Device Machine Learning [24]
  • Image Synthesis and Editing with Stochastic Differential Equations [25]
  • Sketch Your Own GAN [26]
  • Tesla’s Autopilot Explained [27]
  • Styleclip: Text-driven manipulation of StyleGAN imagery [28]
  • TimeLens: Event-based Video Frame Interpolation [29]
  • Diverse Generation from a Single Video Made Possible [30]
  • Skillful Precipitation Nowcasting using Deep Generative Models of Radar [31]
  • The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks [32]
  • ADOP: Approximate Differentiable One-Pixel Point Rendering [33]
  • (Style)CLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis [34]
  • SwinIR: Image restoration using swin transformer [35]
  • EditGAN: High-Precision Semantic Image Editing [36]
  • CityNeRF: Building NeRF at City Scale [37]
  • ClipCap: CLIP Prefix for Image Captioning [38]
  • Paper references

A Parameterisable FPGA-tailored Architecture for YOLOv3-tiny

————– YOLOv3-tiny 的ZYNQ实现

摘要:

实验表明,当针对低端 FPGA 设备时,与设备的硬核处理器相比,所提出的架构实现了 290 倍的延迟改进,同时与原始模型相比,mAP 下降了2.5pp(30.9% 对 33.4%)。 所呈现的工作为低端 FPGA 设备上的低延迟对象检测开辟了道路。

introduction:

目标检测技术用于检测图像和视频中的对象实例。这项技术的应用可以在高级智能系统的部署中找到,如先进的驾驶辅助系统(adas)和视频监控。精确的对象分类和对象位置的标识通常是必需的,因为这些信息构成了应用程序管道的其余部分进一步处理和决策的基础。最近,利用机器学习的最新进展,特别是深层神经网络的发展,研究人员和实践者开发了强大的目标检测系统,可以在许多具有挑战性的情况下提供准确的检测。此外,在需要较低处理延迟的情况下,该领域的工作已经从扫描多个位置的图像转向应用图像分类器(即将目标检测问题转化为多窗口的分类问题) ,转而将上述不同步骤组合在一个单一的管道中,通常基于深层神经网络。

这篇论文的创新贡献是针对YOLOv3tiny工作负载定制的延迟优化参数化架构可以根据硬件资源,自定义参数),该架构可根据任何目标FPGA设备的资源可用性进行调整。为了实现上述功能,我们使用Vivado HLS开发并实现了一个可参数化的体系结构。导出了性能和资源模型,用于指导设计空间探索(DSE)阶段,以确定优化系统延迟的设计点,同时满足资源约束。
针对的是低功耗和资源有限的FPGA设备,需要使用片外存储器来存储网络的参数和中间结果,从而能够在资源极其有限的情况下部署YOLOv3 tiny。

网络结构:

YOLOv3 tiny接受416×416分辨率的RGB图像作为输入。与YOLOv3相反,YOLOv3 tiny只在两种不同的尺度上预测边界框。第一个比例将输入图像划分为13×13个网格,而第二个比例在26×26个网格上操作。该框架在每个网格中生成三个边界框。网络输出一个3d张量,其中包含关于边界框、对象可信度和类预测的信息。该网络主要使用五种类型的层;卷积、最大池、路由、上采样和Yolo层。路由层负责在网络中创建不同的流,其中上采样用于支持多个检测比例。Yolo层负责生成输出向量。

yolov3网络模型

这项工作的目标是设计一种延迟优化和FPGA定制架构,该架构可以根据可用的FPGA资源进行定制,以加速YOLOv3微小模型的推理阶段。所提出的架构在HLS中是定制的,编译时可参数化,以有限资源的低端FPGA设备为目标,因此系统不会对用于存储数据的足够片上内存施加硬约束。

模块设计:

FPGA硬件加速器由五个主要计算模块组成;卷积、累加、最大池、上采样和yolo块

FPGA硬件加速器表示提议的FPGA架构,由三级管道组成,其中每一级对应一层YOLOv3微型网络。加速器由负责系统总体控制的ARM处理器控制。数据和权重通过DMA接口在加速器和片外存储器之间传输。FPGA加速器由三级流水线组成。管道的第一级支持卷积层的执行,其输出在管道的第二级中累积。根据在给定时间执行的网络结构,累加结果将发送到Max pooling、Upsample或YLO层进行进一步处理

在推理阶段,ARM处理器充当主处理器并控制推理过程。网络的计算被分解为更小的组件,即层批次,由处理器调度并按顺序执行。图4捕获了单层批次的处理流程。更具体地说,ARM处理器首先为硬件加速器中的每个单独块设置参数,并配置DMA模块。然后,它通过DMA流启动权重加载和输入数据传输到硬件加速器。FPGA加速块开始处理数据,并将输出数据传输回片外DDR存储器。相应缓存区域的必要失效由处理器执行,以确保正确的数据传输。

实验验证:

使用具有Xilinx XC7Z020 SoC和512 MB DDR3的Zedboard开发工具包对提议的框架进行评估。可编程逻辑与处理系统的时钟频率为100MHz和666。分别为7MHz。设计空间探索阶段以整个设备的利用率为目标,确定满足可用资源所施加约束的多个设计点,并预测每个点的延迟数字。穿越的空间如图5所示。性能最好的设计实现了每次推断(在电路板上测量)532ms的延迟,需要185个BRAM、160个DSP、25个CPU。9k LUT和46。7k自由流速度。测量的功耗为3。36W

论文:

Yu Z, Bouganis C S, Rincón F, et al. A Parameterisable FPGA-Tailored Architecture for YOLOv3-Tiny[C]//ARC. 2020: 330-344.

Improved Regularization of Convolutional Neural Networks with Cutou

–cutout 正则化

cutout,是一种数据增强的方法,主要应用于分类任务中。
  cutout的实现方法为,在图像中随机选取一个点作为中心点,覆盖一个固定大小的方形zero-mask。mask的大小是一个超参数,在文中是通过网格搜索得到的长度。mask区域可以在图像外。

cutout方法提出的出发点是作为一个正则化方法,防止CNN过拟合。cutcout方法很简单,就是在训练的时候,在随机位置应用一个方形矩阵。
  作者认为这种技术鼓励网络去利用整个图片的信息,而不是依赖于小部分特定的视觉特征。

  相比于dropout,cutout更像是数据增强的一种手段,而不是添加噪声。

  在刚开始应用maks的时候,作者也尝试应用mask于关键部位(那些激活值最大的区域),并得到了不错的结果(如下图所示)。但后来发现随机去除固定大小区域和直接在目标区域的效果是相当的,所以之后都采用移除固定大小区域的策略。

 同时,作者发现zero-mask区域大小的选择比形状的选择更重要。大小的选择在文中是通过网格搜索完成的,但都是应用于较小的数据集(CIFAR10/CIFAR100/SVHN)上。在选择应用区域的时候,发现zero-mask随机应用效果比较好,即部分mask在图像外。作者解释为,部分mask在图像外是实现良好性能的关键。

Learning CNN-LSTM Architectures for Image论文阅读

圣诞快乐

Abstract:

自动描述图像的内容是连接计算机视觉和自然语言处理的人工智能的基本问题。在本文中,我们提出了一个基于深度递归体系结构的生成模型,该模型结合了计算机视觉和机器翻译的最新进展,可用于生成描述图像的自然句子。训练模型以在给定训练图像的情况下最大化目标描述语句的可能性。在多个数据集上进行的实验表明,该模型的准确性以及仅从图像描述中学习的语言的流畅性。我们的模型通常非常准确,我们可以在定性和定量上进行验证。例如,在Pascal数据集上,当前最先进的BLEU-1得分(越高越好)是25,而我们的方法得出的结果是59,与人类表现在69左右相比。我们还显示了BLEU-1 Flickr30k的得分从56提升到66,SBU的得分从19提升到28。最后,在新发布的COCO数据集上,我们的BLEU-4为27.7,这是当前的最新水平。

Introduction:

能够使用格式正确的英语句子自动描述图像的内容是一项非常具有挑战性的任务,但它可能会产生巨大的影响,例如,通过帮助视力障碍的人们更好地理解网络上的图像内容。例如,此任务比经过充分研究的图像分类或对象识别任务要困难得多,而这些任务已成为计算机视觉领域的主要关注点[27]。实际上,描述不仅必须捕获图像中包含的对象,而且还必须表达这些对象如何相互关联以及它们的属性和它们所涉及的活动。此外,必须表达上述语义知识以自然语言(例如英语)表示,这意味着除了视觉理解外还需要一种语言模型。

以前的大多数尝试都是将上述子问题的现有解决方案组合在一起,以便从图像进行描述[6,16]。相反,我们在这项工作中提出一个联合模型,该模型以图像 I 作为输入,并经过训练以最大化产生目标单词序列 S = S 1 , S 2 , . . . 的可能性 p ( S∣I ) ,其中每个单词 S t 来自给定的字典,即充分描述图像。

我们工作的主要灵感来自机器翻译的最新进展,其中的任务是通过最大化 p ( T∣S) ,将以源语言编写的句子 S 转换为目标语言的译文 T 。多年以来,机器翻译还通过一系列单独的任务来实现(分别翻译单词,对齐单词,重新排序等),但是最近的工作表明,使用递归神经网络(RNN)可以以更简单的方式完成翻译。 [3,2,30]并仍达到最先进的性能。 “编码器” RNN读取源语句并将其转换为丰富的固定长度向量表示形式,然后将其用作生成目标语句的“解码器” RNN的初始隐藏状态。

在这里,我们建议遵循这种优雅的方法,用深度卷积神经网络(CNN)代替编码器RNN。在过去的几年中,令人信服的表明,CNN可以通过将输入图像嵌入到固定长度的向量中来生成输入图像的丰富表示,从而这种表示可以用于各种视觉任务[28]。因此,自然是将CNN用作图像“编码器”,方法是先对其进行预训练以进行图像分类任务,然后将最后一个隐藏层用作生成语句的RNN解码器的输入(请参见图1)。我们将此模型称为神经图像标题或NIC。

我们的贡献如下。首先,我们提出了一个解决问题的端到端系统。它是一种神经网络,可以使用随机梯度下降训练。其次,我们的模型结合了用于视觉和语言模型的最新网络。这些可以在较大的语料库上进行预训练,因此可以利用其他数据。最后,与最先进的方法相比,它的性能显着提高。 例如,在Pascal数据集上,NIC的BLEU得分为59,与当前的最新水平25相比,而人类的性能达到69。在Flickr30k上,我们的得分从56提高到66,在SBU,从19到28。

Related Work:

从视觉数据中生成自然语言描述的问题已经在计算机视觉中进行了长期研究,但主要针对视频[7,32]。这导致了由视觉原始识别器与结构化形式语言(例如, And-Or图形或逻辑系统,它们通过基于规则的系统进一步转换为自然语言。这样的系统是手工设计的,相对较脆,并且仅在有限的领域(例如,图1)中被证明。例如,交通场景或运动描述。

带有自然文本的静止图像描述问题最近引起了人们的关注。借助对象,属性和位置识别方面的最新进展,尽管这些语言的表达能力受到限制,但我们仍可以驱动自然语言生成系统。 Farhadi等。 [6]使用检测来推断场景元素的三元组,并使用模板将其转换为文本。同样,李等。 [19]从检测开始,并使用包含检测到的对象和关系的短语拼凑出最终描述。 Kulkani等人使用了一个更复杂的检测图(三重态除外)。 [16],但具有基于模板的文本生成。也使用了基于语言解析的更强大的语言模型[23,1,17,18,5]。上面的方法已经能够“in the wild”描述图像,但是在文本生成方面,它们是经过大量手工设计和严格设计的。

大量工作解决了对给定图像[11、8、24]进行描述排名的问题。这样的方法基于在相同向量空间中共同嵌入图像和文本的想法。对于图像查询,将获取接近嵌入空间中图像的描述。最紧密地,神经网络用于共同嵌入图像和句子[29],甚至嵌入图像作物和句子[13],但并未尝试生成新颖的描述。通常,即使可能已经在训练数据中观察到了单个对象,上述方法也无法描述以前看不见的对象组成。而且,它们避免了解决评估所生成的描述的良好程度的问题。

在这项工作中,我们将用于图像分类的深层卷积网络[12]与用于序列建模的循环网络[10]相结合,以创建一个生成图像描述的单一网络。在这个单一的“端到端”网络的背景下对RNN进行了训练。该模型的灵感来自机器翻译中序列生成的最新成功[3,2,30],区别在于我们提供的不是卷积句子,而是提供了由卷积网络处理的图像。最近的著作是基洛斯等人。 [15]他们使用神经网络,但使用前馈神经网络,根据图像和前一个单词来预测下一个单词。毛等人的最新著作。 [21]使用递归神经网络进行相同的预测任务。这与当前建议非常相似,但是有许多重要的区别:我们使用功能更强大的RNN模型,并直接向RNN模型提供可视输入,这使得RNN可以跟踪那些由文字解释。由于这些看似微不足道的差异,我们的系统在已建立的基准上取得了明显更好的结果。最后,基洛斯等。 [14]提出通过使用功能强大的计算机视觉模型和对文本进行编码的LSTM来构建联合多峰嵌入空间。与我们的方法相反,它们使用两个单独的路径(一个用于图像,一个用于文本)来定义联合嵌入,并且即使它们可以生成文本,也对其方法进行了高度调整以进行排名。

Model:

在本文中,我们提出了一种神经和概率框架来从图像生成描述。统计机器翻译的最新进展表明,给定强大的序列模型,可以通过在“端到端”的方式中给定输入句子的情况下,直接最大化正确翻译的概率来获得最新的结果–既用于训练又用于推理。这些模型使用循环神经网络,该网络将可变长度输入编码为固定维向量,并使用此表示形式将其“解码”为所需的输出语句。 因此,很自然地使用相同的方法,即在给定图像(而不是源语言中的输入句子)的情况下,应用相同的原理将其“翻译”成其描述。

Thus, we propose to directly maximize the probability of the correct description given the image by using the following formulation:

where θ are the parameters of our model, I is an image, and S its correct transcription. Since S represents any sentence, its length is unbounded(无限的). Thus, it is common to apply the chain rule to model the joint probability over S 0 , . . . , S N , where N is the length of this particular example as

where we dropped the dependency on θfor convenience. At training time, ( S , I ) is a training example pair, and we optimize the sum of the log probabilities as described in (2) over the whole training set using stochastic gradient descent (further training details are given in Section 4).

It is natural to model p ( S t ∣ I , S 0 , . . . , S t − 1 )with a Recurrent Neural Network (RNN), where the variable number of words we condition upon up to t − 1 is expressed by a fixed length hidden state or memory h t This memory is updated after seeing a new input xtby using a non-linear function f :

ht+1​=f(ht​;xt​)

为了使上述RNN更具体,需要做出两个关键的设计选择:f的确切形式是什么以及如何将图像和单词作为输入xt输入。对于 f,我们使用长时记忆(LSTM)网络,该网络已显示出诸如翻译之类的序列任务的最新性能。下一节将概述此模型。

对于图像的表示,我们使用卷积神经网络(CNN)。它们已被广泛地用于图像任务并已被研究,并且目前是物体识别和检测的最新技术。我们对CNN的特定选择使用一种新颖的方法对批处理进行归一化,并在ILSVRC 2014分类竞赛中获得当前的最佳表现[12]。此外,它们已被证明可以通过转移学习推广到其他任务,例如场景分类[4]。单词用嵌入模型表示。

LSTM-based Sentence Generator

在(3)中f的选择取决于它处理消失和爆炸梯度的能力[10],这是设计和训练RNN时最常见的挑战。为了解决这一挑战,引入了一种称为LSTM的特殊形式的递归网络[10],并成功应用于翻译[3,30]和序列生成[9]。 LSTM模型的核心是存储单元 c ,它在每个时间步上编码知识,直到该步为止都观察到了哪些输入(参见图2)。单元的行为由“门”控制,“门”是相乘的层,因此如果门为1则可以保留门控层的值,如果门为0则可以保持此值为零。特别是,正在使用三个门用于控制是否忘记当前单元格值(忘记门 f),是否应读取其输入(输入门 i)以及是否输出新单元格值(输出门 o )。门的定义以及单元更新和输出如下:

其中 ⊙表示门值的乘积,而各种 W矩阵都是经过训练的参数。这样的乘法门使训练鲁棒的LSTM成为可能,因为这些门很好地处理了爆炸和消失的梯度[10]。非线性为S型σ(⋅)和双曲正切 h ( ⋅ ) 。最后一个方程 mt是输入给Softmax的方程,它将产生所有单词上的概率分布。

LSTM模型经过训练,可以在看到图像后预测句子中的每个单词以及通过 p ( S t ∣ I , S 0 , . . . , S t − 1 ) ) 预测所有先前单词。为此,以展开形式考虑LSTM是有启发性的–为图像和每个句子单词创建LSTM存储器的副本,以便所有LSTM在时间t共享相同的参数和LSTM的输出 mt-1
在时间 t 将馈送到LSTM(见图3)。在展开版本中,所有经常性连接都将转换为前馈连接。更详细地讲,如果我们用I表示输入图像,而用 S = ( S 0 , . . . , S N ) 表示描述该图像的真实句子,则展开过程为:

在这里,我们将每个单词表示为一维向量 S t ,其维数等于字典的大小。注意,我们用 S 0 表示一个特殊的开始词,用 S N 表示一个特殊的停止词,它指定句子的开头和结尾。特别是通过发出停用词,LSTM发出信号,表明已生成完整的句子。图像和单词都映射到相同的空间,使用视觉CNN映射图像,使用单词嵌入 W e映射到单词。图像 I 仅在 t = − 1 时输入一次,以通知LSTM有关图像内容。我们凭经验验证了,由于网络可以显式利用图像中的噪声并更容易过度拟合,因此在每个时间步幅上作为额外的输入来馈送图像会产生较差的结果

Our loss is the sum of the negative log likelihood of the correct word at each step as follows:

The above loss is minimized w.r.t. all the parameters of the LSTM, the top layer of the image embedder CNN and word embeddings We

使用NIC,有多种方法可以用于生成给定图像的句子。第一个是抽样,我们只是根据p1对第一个单词进行抽样,然后提供相应的嵌入作为输入,然后对p2进行抽样,这样一直进行下去,直到我们对特殊的语句结束标记或某个最大长度进行抽样。第二种方法是BeamSearch:迭代地考虑k个最好的句子,直到时间t,作为候选,生成大小为t + 1的句子,并只保留其中最好的k个。这更接近于S = arg maxS0 p(S0|I)。在接下来的实验中,我们使用了波束搜索方法,波束大小为20。使用光束大小为1(即(贪婪搜索)降低了平均2个BELU点。

结构整体结构:

包括Encoder、decoder 、Attention、 Beam Search(束搜索)

Putting it all together

Generation Results

我们在表1和表2中报告了所有相关数据集的主要结果。由于PASCAL没有训练集,所以我们使用使用MSCOCO训练的系统(对于这个任务来说,MSCOCO可能是最大和最高质量的数据集)。PASCAL和SBU的最新研究结果并没有使用基于深度学习的图像特征,因此可以说,这些分数上的一个巨大进步仅仅来自于这种改变。Flickr数据集最近才被使用[11,21,14],但大多数是在检索框架中评估的。一个值得注意的例外是[21],它们在其中进行检索和生成,并且在Flickr数据集上产生了迄今为止最好的性能。

表2中的人类评分是通过比较其中一个人类字幕和另外四个字幕计算出来的。我们为五个打分者中的每一个打分,并对他们的BLEU分数进行平均。由于这给我们的系统带来了一点优势,考虑到BLEU分数是根据5个参考句计算的,而不是4个,我们将5个参考句而不是4个参考句的平均差异加回人类分数。

鉴于该领域在过去几年中取得了重大进展,我们认为报告BLEU-4更有意义,这是机器翻译向前发展的标准。此外,我们还报告了表14中显示的与人工评估关联更好的度量标准。尽管最近在更好的评价指标[31]的努力,我们的模型与人类评分者相比表现良好。然而,当使用人工评分员来评估我们的字幕时(见4.3.6节),我们的模型表现得更差,这表明我们需要做更多的工作来获得更好的指标。在官方测试集上,我们的标签只能通过官方网站获得,我们的型号有27.2的BLEU-4。

Conclusion

我们提出了NIC,一个端到端的神经网络系统,可以自动查看图像并生成合理的描述。NIC基于卷积神经网络,将图像编码成紧凑的表示形式,然后是递归神经网络,生成相应的句子。该模型经过训练以最大化给定图像的句子的可能性。在多个数据集上的实验表明,NIC在定性结果(生成的句子非常合理)和定量评估方面具有鲁棒性,可以使用排名指标,也可以使用机器翻译中用来评估生成句子质量的BLEU指标。从这些实验中可以清楚地看到,随着用于图像描述的可用数据集的大小的增加,NIC等方法的性能也会提高。此外,观察如何使用非监督数据(单独来自图像和单独来自文本)来改进图像描述方法也很有趣。

github实现:https://github.com/sgrvinod/Deep-Tutorials-for-PyTorch

References
[1] A. Aker and R. Gaizauskas. Generating image descriptions using dependency relational patterns. In ACL, 2010.
[2] D. Bahdanau, K. Cho, and Y . Bengio. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473, 2014.
[3] K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y . Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP, 2014.
[4] J. Donahue, Y . Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML, 2014.
[5] D. Elliott and F. Keller. Image description using visual dependency representations. In EMNLP, 2013.
[6] A. Farhadi, M. Hejrati, M. A. Sadeghi, P . Y oung, C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every picture tells a story: Generating sentences from images. In ECCV, 2010.
[7] R. Gerber and H.-H. Nagel. Knowledge representation for the generation of quantified natural language descriptions of vehicle traffic in image sequences. In ICIP. IEEE, 1996.
[8] Y . Gong, L. Wang, M. Hodosh, J. Hockenmaier, and S. Lazebnik. Improving image-sentence embeddings using large weakly annotated photo collections. In ECCV, 2014.
[9] A. Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2013.
[10] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8), 1997.
[11] M. Hodosh, P . Y oung, and J. Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. JAIR, 47, 2013.
[12] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In arXiv:1502.03167, 2015.
[13] A. Karpathy, A. Joulin, and L. Fei-Fei. Deep fragment embeddings for bidirectional image sentence mapping. NIPS, 2014.
[14] R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. In arXiv:1411.2539, 2014.
[15] R. Kiros and R. Z. R. Salakhutdinov. Multimodal neural language models. In NIPS Deep Learning Workshop, 2013.
[16] G. Kulkarni, V . Premraj, S. Dhar, S. Li, Y . Choi, A. C. Berg, and T. L. Berg. Baby talk: Understanding and generating simple image descriptions. In CVPR, 2011.
[17] P . Kuznetsova, V . Ordonez, A. C. Berg, T. L. Berg, and Y . Choi. Collective generation of natural image descriptions. In ACL, 2012. [18] P . Kuznetsova, V . Ordonez, T. Berg, and Y . Choi. Treetalk: Composition and compression of trees for image descriptions. ACL, 2(10), 2014.
[19] S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y . Choi. Composing simple image descriptions using web-scale n-grams. In Conference on Computational Natural Language Learning, 2011.
[20] T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P . Perona, D. Ramanan, P . Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. arXiv:1405.0312, 2014.
[21] J. Mao, W. Xu, Y . Yang, J. Wang, and A. Y uille. Explain images with multimodal recurrent neural networks. In arXiv:1410.1090, 2014. [22] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. In ICLR, 2013.
[23] M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal, A. C. Berg, K. Yamaguchi, T. L. Berg, K. Stratos, and H. D. III. Midge: Generating image descriptions from computer vision detections. In EACL, 2012.
[24] V . Ordonez, G. Kulkarni, and T. L. Berg. Im2text: Describing images using 1 million captioned photographs. In NIPS, 2011.
[25] K. Papineni, S. Roukos, T. Ward, and W. J. Zhu. BLEU: A method for automatic evaluation of machine translation. In ACL, 2002.
[26] C. Rashtchian, P . Y oung, M. Hodosh, and J. Hockenmaier. Collecting image annotations using amazon’s mechanical turk. In NAACL HLT Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pages 139– 147, 2010.
[27] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge, 2014.
[28] P . Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y . LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013.
[29] R. Socher, A. Karpathy, Q. V . Le, C. Manning, and A. Y . Ng. Grounded compositional semantics for finding and describing images with sentences. In ACL, 2014.
[30] I. Sutskever, O. Vinyals, and Q. V . Le. Sequence to sequence learning with neural networks. In NIPS, 2014.
[31] R. V edantam, C. L. Zitnick, and D. Parikh. CIDEr: Consensus-based image description evaluation. In arXiv:1411.5726, 2015.
[32] B. Z. Yao, X. Yang, L. Lin, M. W. Lee, and S.-C. Zhu. I2t: Image parsing to text description. Proceedings of the IEEE, 98(8), 2010.
[33] P . Y oung, A. Lai, M. Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. In ACL, 2014.
[34] W. Zaremba, I. Sutskever, and O. Vinyals. Recurrent neural network regularization. In arXiv:1409.2329, 2014.