SSD原理与实现

SSD属于一阶段、anchor-based 目标检测

基于anchor-based的技术包括一个阶段和两个阶段的检测。其中一阶段的检测技术包括SSD,DSSD,RetinaNet,RefineDet,YOLOV3等,二阶段技术包括Faster-RCNN,R-FCN,FPN,Cascade R-CNN,SNIP等。一般的,两个阶段的目标检测会比一个阶段的精度要高,但一个阶段的算法速度会更快。

 anchor-based类算法代表是fasterRCNN、SSD、YoloV2/V3等

目标检测近年来已经取得了很重要的进展,主流的算法主要分为两个类型(参考RefineDet):(1)two-stage方法,如R-CNN系算法,其主要思路是先通过启发式方法(selective search)或者CNN网络(RPN)产生一系列稀疏的候选框,然后对这些候选框进行分类与回归,two-stage方法的优势是准确度高;(2)one-stage方法,如Yolo和SSD,其主要思路是均匀地在图片的不同位置进行密集抽样,抽样时可以采用不同尺度和长宽比,然后利用CNN提取特征后直接进行分类与回归,整个过程只需要一步,所以其优势是速度快,但是均匀的密集采样的一个重要缺点是训练比较困难,这主要是因为正样本与负样本(背景)极其不均衡(参见Focal Loss),导致模型准确度稍低。

需要掌握的知识:

1、边界框

在⽬标检测中,我们通常使⽤边界框(bounding box)来描述对象的空间位置。边界框是矩形的,由矩形左上⻆的 x 和 y 坐标以及右下⻆的坐标决定。另⼀种常⽤的边界框表⽰⽅法是边界框中⼼的 (x, y) 轴坐标以及框的宽度和⾼度

2、锚框

⽬标检测算法通常会在输⼊图像中采样⼤量的区域,然后判断这些区域中是否包含我们感兴趣的⽬标,并调整区域边缘从而更准确地预测⽬标的真实边界框(ground-truth bounding box)。不同的模型使⽤的区域采样⽅法可能不同。这⾥我们介绍其中的⼀种⽅法:它以每个像素为中⼼⽣成多个⼤小和宽⾼⽐(aspect ratio)不同的边界框。这些边界框被称为锚框(anchor box)

3、交并⽐(IoU)

直观地说,我们可以衡量锚框和真实边界框之间的相似性。我们知道 Jaccard 系数可以衡量
两组之间的相似性。给定集合 A 和 B,他们的 Jaccard 系数是他们交集的⼤小除以他们并集的⼤小:$$J\left( A,B\right) =\dfrac{\left| A\cap B\right| }{\left| A\right| U\left| B\right| }$$ 事实上,我们可以将任何边界框的像素区域视为⼀组像素。通过这种⽅式,我们可以通过其像素集的 Jaccard索引来测量两个边界框的相似性。对于两个边界框,我们通常将他们的 Jaccard 指数称为 交并⽐ (intersectionover union,IoU),即两个边界框相交⾯积与相并⾯积之⽐

4、标注训练数据的锚框

在训练集中,我们将每个锚框视为⼀个训练样本。为了训练⽬标检测模型,我们需要每个锚框的类别(class)和偏移量(offset)标签,其中前者是与锚框相关的对象的类别,后者是真实边界框相对于锚框的偏移量。在预测期间,我们为每个图像⽣成多个锚框,预测所有锚框的类和偏移量,根据预测的偏移量调整它们的位置以获得预测的边界框,最后只输出符合特定条件的预测边界框。

5、将真实边界框分配给锚框

给定图像, 假设锚框是 \(A_{1}, A_{2}, \ldots, A_{n_{a}}\) , 真实边界框是 \(B_{1}, B_{2}, \ldots, B_{n_{b}}\) , 其中 \( n_{a} \geq n_{b}\) 。让我们定义一个矩 阵 \(\mathbf{X} \in \mathbb{R}^{n_{a} \times n_{b}}\), 其中 \(i^{\text {th }}\) 行和 \(j^{\text {th }}\) 列中的元素 \(x_{i j}\)是针框 \(A_{i}\) 和真实边界框 \(B_{j}\) 的 \(\mathrm{IoU}\) 。该算法包含以下步骤:

  1. 在矩阵 \(\mathbf{X}\) 中找到最大的元素, 并将它的行索引和列索引分别表示为 \(i_{1}\) 和 \(j_{1}\) 。然后将真实边界框 \(B_{j_{1}}\) 分 配给针框 \(A_{i_{1}}\) 。这很直观,因为 \(A_{i_{1}}\) 和 \(B_{j_{1}}\) 是所有针框和真实边界框配对中最相近的。在第一个分配完 成后,丢弃矩阵中 \(i_{1}{ }^{\text {th }}\) 行和 \(j_{1}{ }^{\text {th }}\) 列中的所有元素。
  2. 在矩阵 \(\mathbf{X}\) 中找到剩余元素中最大的元素, 并将它的行索引和列索引分别表示为 \(i_{2}\) 和 \(j_{2}\) 。我们将真实边 界框 \(B_{j_{2}}\) 分配给针框 \(A_{i_{2}}\), 并丢弃矩阵中 \(i_{2}{ }^{\text {th }}\) 行和 \(j_{2}{ }^{\text {th }}\) 列中的所有元素。
  3. 此时,矩阵 \(\mathbf{X}\) 中两行和两列中的元素已被丢弃。我们继续, 直到丢弃掉矩阵 \(\mathbf{X}\) 中 \(n_{b}\) 列中的所有元素。 此时,我们已经为这 \(n_{b}\) 个针框各自分配了一个真实边界框。
  4. 只遍历剩下的 \(n_{a}-n_{b}\) 个针框。例如,给定任何针框 \(A_{i}\), 在矩阵 \(\mathbf{X}\)的第 \(i^{\text {th }}\) 行中找到与 \(A_{i}\) 的IoU最大的 真实边界框 \(B_{j}\), 只有当此 IoU 大于预定义的阈值时, 才将 \(B_{j}\) 分配给 \(A_{i \circ}\)

6、标记类和偏移

现在我们可以为每个锚框标记分类和偏移量了。假设⼀个锚框 A 被分配了⼀个真实边界框 B。⼀⽅⾯,锚框A 的类将被标记为与 B 相同。另⼀⽅⾯,锚框 A 的偏移量将根据 B 和 A 中⼼坐标的相对位置、以及这两个框的相对⼤小进⾏标记。鉴于数据集内不同的框的位置和⼤小不同,我们可以对那些相对位置和⼤小应⽤变换,使其获得更均匀分布、易于适应的偏移量。在这⾥,我们介绍⼀种常⻅的变换。给定框 A 和 B,中⼼坐标分别为 (xa, ya) 和 (xb, yb),宽度分别为 wa 和 wb,⾼度分别为 ha 和 hb。我们可以将 A 的偏移量标记为

$$
\left(\frac{\frac{x_{b}-x_{a}}{w_{a}}-\mu_{x}}{\sigma_{x}}, \frac{\frac{y_{b}-y_{a}}{h_{a}}-\mu_{y}}{\sigma_{y}}, \frac{\log \frac{w_{b}}{w_{a}}-\mu_{w}}{\sigma_{w}}, \frac{\log \frac{h_{b}}{h_{a}}-\mu_{h}}{\sigma_{h}}\right)
$$

$$
\text { 其中常量的默认值是 } \mu_{x}=\mu_{y}=\mu_{w}=\mu_{h}=0, \sigma_{x}=\sigma_{y}=0.1 \text { 和 } \sigma_{w}=\sigma_{h}=0.2 \text { 。 }
$$

7、⽤⾮极⼤值抑制预测边界框

在预测期间,我们先为图像⽣成多个锚框,再为这些锚框⼀⼀预测类别和偏移量。⼀个“预测好的边界框”则根据其中某个带有预测偏移量的锚框而⽣成。当有许多锚框时,可能会输出许多相似的具有明显重叠的预测边界框,都围绕着同⼀⽬标。为了简化输出,我们可以使⽤ ⾮极⼤值抑制 (non-maximum suppression,NMS)合并属于同⼀⽬标的类似的预测边界框。以下是⾮极⼤值抑制的⼯作原理。对于⼀个预测边界框 B,⽬标检测模型会计算每个类的预测概率。假设最⼤的预测概率为 p ,则该概率所对应的类别 B 即为预测的类别。具体来说,我们将 p 称为预测边界框 B 的置信度。在同⼀张图像中,所有预测的⾮背景边界框都按置信度降序排序,以⽣成列表 L。然后我们通过以下步骤操作排序列表 L:

  1. 从 L 中选取置信度最高的预测边界框 \(B_{1}\) 作为基准,然后将所有与 \(B_{1}\) 的IoU 超过预定阈值\(\epsilon\) 的非基准 预测边界框从 L 中移除。这时, L 保留了置信度最高的预测边界框,去除了与其太过相似的其他预测 边界框。简而言之,那些具有 非极大值置信度的边界框被 抑制了。
  2. 从 L 中选取置信度第二高的预测边界框 \(B_{2}\) 作为又一个基准,然后将所有与 \(B_{2}\)的IoU大于 \(\epsilon\)的非基准 预测边界框从 L 中移除。
  3. 重复上述过程,直到 L 中的所有预测边界框都曾被用作基准。此时, L中任意一对预测边界框的IoU都 小于阈值 \(\epsilon\); 因此,没有一对边界框过于相似。
  4. 输出列表 L中的所有预测边界框。

SSD原理:

在了解上述概念后,开始实现SSD(Single Shot MultiBox Detector)

SSD和Yolo一样都是采用一个CNN网络来进行检测,但是却采用了多尺度的特征图,其基本架构如图3所示。下面将SSD核心设计理念总结为以下三点:

(1)采用多尺度特征图用于检测

所谓多尺度采用大小不同的特征图,CNN网络一般前面的特征图比较大,后面会逐渐采用stride=2的卷积或者pool来降低特征图大小,这正如图3所示,一个比较大的特征图和一个比较小的特征图,它们都用来做检测。这样做的好处是比较大的特征图来用来检测相对较小的目标,而小的特征图负责检测大目标。

(2)采用卷积进行检测

与Yolo最后采用全连接层不同,SSD直接采用卷积对不同的特征图来进行提取检测结果。对于形状为 [公式] 的特征图,只需要采用 [公式] 这样比较小的卷积核得到检测值。

(3)设置先验框

在Yolo中,每个单元预测多个边界框,但是其都是相对这个单元本身(正方块),但是真实目标的形状是多变的,Yolo需要在训练过程中自适应目标的形状。而SSD借鉴了Faster R-CNN中anchor的理念,每个单元设置尺度或者长宽比不同的先验框预测的边界框(bounding boxes)是以这些先验框为基准的,在一定程度上减少训练难度。一般情况下,每个单元会设置多个先验框,其尺度和长宽比存在差异,如图5所示,可以看到每个单元使用了4个不同的先验框,图片中猫和狗分别采用最适合它们形状的先验框来进行训练,后面会详细讲解训练过程中的先验框匹配原则

也就是说,在上面4中所说的 “标注训练数据的锚框” ,这里的 框在SSD中就是先验框。

网络结构

SSD采用VGG16作为基础模型,然后在VGG16的基础上新增了卷积层来获得更多的特征图以用于检测。SSD的网络结构如图5所示。上面是SSD模型,下面是Yolo模型,可以明显看到SSD利用了多尺度的特征图做检测。

得到了特征图之后,需要对特征图进行卷积得到检测结果

下图给出了一个 5*5大小的特征图的检测过程。其中Priorbox是得到先验框,前面已经介绍了生成规则。检测值包含两个部分:类别置信度和边界框位置,各采用一次3*3 卷积来进行完成。

训练过程

(1)先验框匹配
在训练过程中,首先要确定训练图片中的ground truth(真实目标)与哪个先验框来进行匹配,与之匹配的先验框所对应的边界框将负责预测它。在Yolo中,ground truth的中心落在哪个单元格,该单元格中与其IOU最大的边界框负责预测它。但是在SSD中却完全不一样,SSD的先验框与ground truth的匹配原则主要有两点。首先,对于图片中每个ground truth,找到与其IOU最大的先验框,该先验框与其匹配,这样,可以保证每个ground truth一定与某个先验框匹配。通常称与ground truth匹配的先验框为正样本(其实应该是先验框对应的预测box,不过由于是一一对应的就这样称呼了),反之,若一个先验框没有与任何ground truth进行匹配,那么该先验框只能与背景匹配,就是负样本。一个图片中ground truth是非常少的, 而先验框却很多,如果仅按第一个原则匹配,很多先验框会是负样本,正负样本极其不平衡,所以需要第二个原则。第二个原则是:对于剩余的未匹配先验框,若某个ground truth的 IOU大于某个阈值(一般是0.5),那么该先验框也与这个ground truth进行匹配。这意味着某个ground truth可能与多个先验框匹配,这是可以的。但是反过来却不可以,因为一个先验框只能匹配一个ground truth,如果多个ground truth与某个先验框 IOU大于阈值,那么先验框只与IOU最大的那个ground truth进行匹配。第二个原则一定在第一个原则之后进行,仔细考虑一下这种情况,如果某个ground truth所对应最大 IOU小于阈值,并且所匹配的先验框却与另外一个ground truth的IOU大于阈值,那么该先验框应该匹配谁,答案应该是前者,首先要确保某个ground truth一定有一个先验框与之匹配。但是,这种情况我觉得基本上是不存在的。由于先验框很多,某个ground truth的最大 IOU肯定大于阈值,所以可能只实施第二个原则既可以了,这里的TensorFlow版本就是只实施了第二个原则,但是这里的Pytorch两个原则都实施了

(2)损失函数
训练样本确定了,然后就是损失函数了。损失函数定义为位置误差(locatization loss, loc)与置信度误差(confidence loss, conf)的加权和:

$$
L(x, c, l, g)=\frac{1}{N}\left(L_{c o n f}(x, c)+\alpha L_{l o c}(x, l, g)\right)
$$
其中 N是先验框的正样本数量。这里\(x_{i j}^{p} \in{1,0}\) 为一个指示参数,当 \(x_{i j}^{p}=1\)时表示第 i 个先验框与第 j 个ground truth匹配,并且ground truth的类别为 p 。 c 为类别置信度预 测值。 l为先验框的所对应边界框的位置预测值,而 g 是ground truth的位置参数。对于位置误 差,其采用Smooth L1 loss,定义如下:

对于置信度误差,其采用softmax loss:

⽬标检测有两种类型的损失。第⼀种有关锚框类别的损失:我们可以简单地重⽤之前图像分类问题⾥⼀直使⽤的交叉熵损失函数来计算;第⼆种有关正类锚框偏移量的损失:预测偏移量是⼀个回归问题, 使⽤ L1 范数损失,即预测值和真实值之差的绝对值

3)数据扩增

采用数据扩增(Data Augmentation)可以提升SSD的性能,主要采用的技术有水平翻转(horizontal flip),随机裁剪加颜色扭曲(random crop & color distortion),随机采集块域(Randomly sample a patch)(获取小目标训练样本)

预测过程

预测过程比较简单,对于每个预测框,首先根据类别置信度确定其类别(置信度最大者)与置信度值,并过滤掉属于背景的预测框。然后根据置信度阈值(如0.5)过滤掉阈值较低的预测框。对于留下的预测框进行解码,根据先验框得到其真实的位置参数(解码后一般还需要做clip,防止预测框位置超出图片)。解码之后,一般需要根据置信度进行降序排列,然后仅保留top-k(如400)个预测框。最后就是进行NMS算法,过滤掉那些重叠度较大的预测框。最后剩余的预测框就是检测结果了。

性能评估

首先整体看一下SSD在VOC2007,VOC2012及COCO数据集上的性能,如表1所示。相比之下,SSD512的性能会更好一些。加*的表示使用了image expansion data augmentation(通过zoom out来创造小的训练样本)技巧来提升SSD在小目标上的检测效果,所以性能会有所提升。

表1 SSD在不同数据集上的性能

SSD与其它检测算法的对比结果(在VOC2007数据集)如表2所示,基本可以看到,SSD与Faster R-CNN有同样的准确度,并且与Yolo具有同样较快地检测速度。

表2 SSD与其它检测算法的对比结果(在VOC2007数据集)

文章还对SSD的各个trick做了更为细致的分析,表3为不同的trick组合对SSD的性能影响,从表中可以得出如下结论:

  • 数据扩增技术很重要,对于mAP的提升很大;
  • 使用不同长宽比的先验框可以得到更好的结果;
表3 不同的trick组合对SSD的性能影响

同样的,采用多尺度的特征图用于检测也是至关重要的,这可以从表4中看出:

表4 多尺度特征图对SSD的影响

参考: https://zhuanlan.zhihu.com/p/33544892

KL散度(相对熵)

定义:

若P和Q是定义在同一概率空间的两个概率测度,定义P和Q的相对熵为:

$$D\left( P\| Q\right) =\sum _{x}P\left( x\right) \log \dfrac{p\left( x\right) }{Q\left( x\right) }$$

性质:\( D\left( P\| Q\right) \) >=0

  在概率论或信息论中,KL散度( Kullback–Leibler divergence),又称相对熵(relative entropy),是描述两个概率分布P和Q差异的一种方法。它是非对称的,这意味着D(P||Q) ≠ D(Q||P)。特别的,在信息论中,D(P||Q)表示当用概率分布Q来拟合真实分布P时,产生的信息损耗,其中P表示真实分布,Q表示P的拟合分布。有人将KL散度称为KL距离,但事实上,KL散度并不满足距离的概念,应为:1)KL散度不是对称的;2)KL散度不满足三角不等式。

KL散度在信息论中有自己明确的物理意义,它是用来度量使用基于Q分布的编码来编码来自P分布的样本平均所需的额外的Bit个数。而其在机器学习领域的物理意义则是用来度量两个函数的相似程度或者相近程度.

基于P的编码去编码来自P的样本,其最优编码平均所需要的比特个数(即这个字符集的熵)为:

$$H\left( x\right) =\sum _{x}P\left( x\right) \log \left( \dfrac{1}{p\left( x\right) }\right)$$

用基于P的编码去编码来自Q的样本,则所需要的比特个数变为:

$$H’\left( x\right) =\sum _{x}P\left( x\right) \log \left( \dfrac{1}{Q\left( x\right) }\right)$$

于是,我们即可得出P与Q的KL散度:

$$D\left( P\| Q\right) = H\left( x\right) -H’\left( x\right)$$

生成对抗网络系列论文+pytorch实现

github链接:

https://github.com/eriklindernoren/PyTorch-GAN

This repository has gone stale as I unfortunately do not have the time to maintain it anymore. If you would like to continue the development of it as a collaborator send me an email at eriklindernoren@gmail.com.

PyTorch-GAN

Collection of PyTorch implementations of Generative Adversarial Network varieties presented in research papers. Model architectures will not always mirror the ones proposed in the papers, but I have chosen to focus on getting the core ideas covered instead of getting every layer configuration right. Contributions and suggestions of GANs to implement are very welcomed.

See also: Keras-GAN

Table of Contents

Installation

$ git clone https://github.com/eriklindernoren/PyTorch-GAN
$ cd PyTorch-GAN/
$ sudo pip3 install -r requirements.txt

Implementations

Auxiliary Classifier GAN

Auxiliary Classifier Generative Adversarial Network

Authors

Augustus Odena, Christopher Olah, Jonathon Shlens

Abstract

Synthesizing high resolution photorealistic images has been a long-standing challenge in machine learning. In this paper we introduce new methods for the improved training of generative adversarial networks (GANs) for image synthesis. We construct a variant of GANs employing label conditioning that results in 128×128 resolution image samples exhibiting global coherence. We expand on previous work for image quality assessment to provide two new analyses for assessing the discriminability and diversity of samples from class-conditional image synthesis models. These analyses demonstrate that high resolution samples provide class information not present in low resolution samples. Across 1000 ImageNet classes, 128×128 samples are more than twice as discriminable as artificially resized 32×32 samples. In addition, 84.7% of the classes have samples exhibiting diversity comparable to real ImageNet data.

[Paper] [Code]

Run Example

$ cd implementations/acgan/
$ python3 acgan.py

Adversarial Autoecoder

Adversarial Autoencoder

Authors

Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, Brendan Frey

Abstract

n this paper, we propose the “adversarial autoencoder” (AAE), which is a probabilistic autoencoder that uses the recently proposed generative adversarial networks (GAN) to perform variational inference by matching the aggregated posterior of the hidden code vector of the autoencoder with an arbitrary prior distribution. Matching the aggregated posterior to the prior ensures that generating from any part of prior space results in meaningful samples. As a result, the decoder of the adversarial autoencoder learns a deep generative model that maps the imposed prior to the data distribution. We show how the adversarial autoencoder can be used in applications such as semi-supervised classification, disentangling style and content of images, unsupervised clustering, dimensionality reduction and data visualization. We performed experiments on MNIST, Street View House Numbers and Toronto Face datasets and show that adversarial autoencoders achieve competitive results in generative modeling and semi-supervised classification tasks.

[Paper] [Code]

Run Example

$ cd implementations/aae/
$ python3 aae.py

BEGAN

BEGAN: Boundary Equilibrium Generative Adversarial Networks

Authors

David Berthelot, Thomas Schumm, Luke Metz

Abstract

We propose a new equilibrium enforcing method paired with a loss derived from the Wasserstein distance for training auto-encoder based Generative Adversarial Networks. This method balances the generator and discriminator during training. Additionally, it provides a new approximate convergence measure, fast and stable training and high visual quality. We also derive a way of controlling the trade-off between image diversity and visual quality. We focus on the image generation task, setting a new milestone in visual quality, even at higher resolutions. This is achieved while using a relatively simple model architecture and a standard training procedure.

[Paper] [Code]

Run Example

$ cd implementations/began/
$ python3 began.py

BicycleGAN

Toward Multimodal Image-to-Image Translation

Authors

Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A. Efros, Oliver Wang, Eli Shechtman

Abstract

Many image-to-image translation problems are ambiguous, as a single input image may correspond to multiple possible outputs. In this work, we aim to model a \emph{distribution} of possible outputs in a conditional generative modeling setting. The ambiguity of the mapping is distilled in a low-dimensional latent vector, which can be randomly sampled at test time. A generator learns to map the given input, combined with this latent code, to the output. We explicitly encourage the connection between output and the latent code to be invertible. This helps prevent a many-to-one mapping from the latent code to the output during training, also known as the problem of mode collapse, and produces more diverse results. We explore several variants of this approach by employing different training objectives, network architectures, and methods of injecting the latent code. Our proposed method encourages bijective consistency between the latent encoding and output modes. We present a systematic comparison of our method and other variants on both perceptual realism and diversity.

[Paper] [Code]

Run Example

$ cd data/
$ bash download_pix2pix_dataset.sh edges2shoes
$ cd ../implementations/bicyclegan/
$ python3 bicyclegan.py

Various style translations by varying the latent code.​

Boundary-Seeking GAN

Boundary-Seeking Generative Adversarial Networks

Authors

R Devon Hjelm, Athul Paul Jacob, Tong Che, Adam Trischler, Kyunghyun Cho, Yoshua Bengio

Abstract

Generative adversarial networks (GANs) are a learning framework that rely on training a discriminator to estimate a measure of difference between a target and generated distributions. GANs, as normally formulated, rely on the generated samples being completely differentiable w.r.t. the generative parameters, and thus do not work for discrete data. We introduce a method for training GANs with discrete data that uses the estimated difference measure from the discriminator to compute importance weights for generated samples, thus providing a policy gradient for training the generator. The importance weights have a strong connection to the decision boundary of the discriminator, and we call our method boundary-seeking GANs (BGANs). We demonstrate the effectiveness of the proposed algorithm with discrete image and character-based natural language generation. In addition, the boundary-seeking objective extends to continuous data, which can be used to improve stability of training, and we demonstrate this on Celeba, Large-scale Scene Understanding (LSUN) bedrooms, and Imagenet without conditioning.

[Paper] [Code]

Run Example

$ cd implementations/bgan/
$ python3 bgan.py

Cluster GAN

ClusterGAN: Latent Space Clustering in Generative Adversarial Networks

Authors

Sudipto Mukherjee, Himanshu Asnani, Eugene Lin, Sreeram Kannan

Abstract

Generative Adversarial networks (GANs) have obtained remarkable success in many unsupervised learning tasks and unarguably, clustering is an important unsupervised learning problem. While one can potentially exploit the latent-space back-projection in GANs to cluster, we demonstrate that the cluster structure is not retained in the GAN latent space. In this paper, we propose ClusterGAN as a new mechanism for clustering using GANs. By sampling latent variables from a mixture of one-hot encoded variables and continuous latent variables, coupled with an inverse network (which projects the data to the latent space) trained jointly with a clustering specific loss, we are able to achieve clustering in the latent space. Our results show a remarkable phenomenon that GANs can preserve latent space interpolation across categories, even though the discriminator is never exposed to such vectors. We compare our results with various clustering baselines and demonstrate superior performance on both synthetic and real datasets.

[Paper] [Code]

Code based on a full PyTorch [implementation].

Run Example

$ cd implementations/cluster_gan/
$ python3 clustergan.py

Conditional GAN

Conditional Generative Adversarial Nets

Authors

Mehdi Mirza, Simon Osindero

Abstract

Generative Adversarial Nets [8] were recently introduced as a novel way to train generative models. In this work we introduce the conditional version of generative adversarial nets, which can be constructed by simply feeding the data, y, we wish to condition on to both the generator and discriminator. We show that this model can generate MNIST digits conditioned on class labels. We also illustrate how this model could be used to learn a multi-modal model, and provide preliminary examples of an application to image tagging in which we demonstrate how this approach can generate descriptive tags which are not part of training labels.

[Paper] [Code]

Run Example

$ cd implementations/cgan/
$ python3 cgan.py

Context-Conditional GAN

Semi-Supervised Learning with Context-Conditional Generative Adversarial Networks

Authors

Emily Denton, Sam Gross, Rob Fergus

Abstract

We introduce a simple semi-supervised learning approach for images based on in-painting using an adversarial loss. Images with random patches removed are presented to a generator whose task is to fill in the hole, based on the surrounding pixels. The in-painted images are then presented to a discriminator network that judges if they are real (unaltered training images) or not. This task acts as a regularizer for standard supervised training of the discriminator. Using our approach we are able to directly train large VGG-style networks in a semi-supervised fashion. We evaluate on STL-10 and PASCAL datasets, where our approach obtains performance comparable or superior to existing methods.

[Paper] [Code]

Run Example

$ cd implementations/ccgan/
$ python3 ccgan.py

Context Encoder

Context Encoders: Feature Learning by Inpainting

Authors

Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, Alexei A. Efros

Abstract

We present an unsupervised visual feature learning algorithm driven by context-based pixel prediction. By analogy with auto-encoders, we propose Context Encoders — a convolutional neural network trained to generate the contents of an arbitrary image region conditioned on its surroundings. In order to succeed at this task, context encoders need to both understand the content of the entire image, as well as produce a plausible hypothesis for the missing part(s). When training context encoders, we have experimented with both a standard pixel-wise reconstruction loss, as well as a reconstruction plus an adversarial loss. The latter produces much sharper results because it can better handle multiple modes in the output. We found that a context encoder learns a representation that captures not just appearance but also the semantics of visual structures. We quantitatively demonstrate the effectiveness of our learned features for CNN pre-training on classification, detection, and segmentation tasks. Furthermore, context encoders can be used for semantic inpainting tasks, either stand-alone or as initialization for non-parametric methods.

[Paper] [Code]

Run Example

$ cd implementations/context_encoder/
<follow steps at the top of context_encoder.py>
$ python3 context_encoder.py

Rows: Masked | Inpainted | Original | Masked | Inpainted | Original​

Coupled GAN

Coupled Generative Adversarial Networks

Authors

Ming-Yu Liu, Oncel Tuzel

Abstract

We propose coupled generative adversarial network (CoGAN) for learning a joint distribution of multi-domain images. In contrast to the existing approaches, which require tuples of corresponding images in different domains in the training set, CoGAN can learn a joint distribution without any tuple of corresponding images. It can learn a joint distribution with just samples drawn from the marginal distributions. This is achieved by enforcing a weight-sharing constraint that limits the network capacity and favors a joint distribution solution over a product of marginal distributions one. We apply CoGAN to several joint distribution learning tasks, including learning a joint distribution of color and depth images, and learning a joint distribution of face images with different attributes. For each task it successfully learns the joint distribution without any tuple of corresponding images. We also demonstrate its applications to domain adaptation and image transformation.

[Paper] [Code]

Run Example

$ cd implementations/cogan/
$ python3 cogan.py

Generated MNIST and MNIST-M images​

CycleGAN

Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks

Authors

Jun-Yan Zhu, Taesung Park, Phillip Isola, Alexei A. Efros

Abstract

Image-to-image translation is a class of vision and graphics problems where the goal is to learn the mapping between an input image and an output image using a training set of aligned image pairs. However, for many tasks, paired training data will not be available. We present an approach for learning to translate an image from a source domain X to a target domain Y in the absence of paired examples. Our goal is to learn a mapping G:X→Y such that the distribution of images from G(X) is indistinguishable from the distribution Y using an adversarial loss. Because this mapping is highly under-constrained, we couple it with an inverse mapping F:Y→X and introduce a cycle consistency loss to push F(G(X))≈X (and vice versa). Qualitative results are presented on several tasks where paired training data does not exist, including collection style transfer, object transfiguration, season transfer, photo enhancement, etc. Quantitative comparisons against several prior methods demonstrate the superiority of our approach.

[Paper] [Code]

Run Example

$ cd data/
$ bash download_cyclegan_dataset.sh monet2photo
$ cd ../implementations/cyclegan/
$ python3 cyclegan.py --dataset_name monet2photo

Monet to photo translations.​

Deep Convolutional GAN

Deep Convolutional Generative Adversarial Network

Authors

Alec Radford, Luke Metz, Soumith Chintala

Abstract

In recent years, supervised learning with convolutional networks (CNNs) has seen huge adoption in computer vision applications. Comparatively, unsupervised learning with CNNs has received less attention. In this work we hope to help bridge the gap between the success of CNNs for supervised learning and unsupervised learning. We introduce a class of CNNs called deep convolutional generative adversarial networks (DCGANs), that have certain architectural constraints, and demonstrate that they are a strong candidate for unsupervised learning. Training on various image datasets, we show convincing evidence that our deep convolutional adversarial pair learns a hierarchy of representations from object parts to scenes in both the generator and discriminator. Additionally, we use the learned features for novel tasks – demonstrating their applicability as general image representations.

[Paper] [Code]

Run Example

$ cd implementations/dcgan/
$ python3 dcgan.py

DiscoGAN

Learning to Discover Cross-Domain Relations with Generative Adversarial Networks

Authors

Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, Jiwon Kim

Abstract

While humans easily recognize relations between data from different domains without any supervision, learning to automatically discover them is in general very challenging and needs many ground-truth pairs that illustrate the relations. To avoid costly pairing, we address the task of discovering cross-domain relations given unpaired data. We propose a method based on generative adversarial networks that learns to discover relations between different domains (DiscoGAN). Using the discovered relations, our proposed network successfully transfers style from one domain to another while preserving key attributes such as orientation and face identity.

[Paper] [Code]

Run Example

$ cd data/
$ bash download_pix2pix_dataset.sh edges2shoes
$ cd ../implementations/discogan/
$ python3 discogan.py --dataset_name edges2shoes

Rows from top to bottom: (1) Real image from domain A (2) Translated image from
domain A (3) Reconstructed image from domain A (4) Real image from domain B (5)
Translated image from domain B (6) Reconstructed image from domain B​

DRAGAN

On Convergence and Stability of GANs

Authors

Naveen Kodali, Jacob Abernethy, James Hays, Zsolt Kira

Abstract

We propose studying GAN training dynamics as regret minimization, which is in contrast to the popular view that there is consistent minimization of a divergence between real and generated distributions. We analyze the convergence of GAN training from this new point of view to understand why mode collapse happens. We hypothesize the existence of undesirable local equilibria in this non-convex game to be responsible for mode collapse. We observe that these local equilibria often exhibit sharp gradients of the discriminator function around some real data points. We demonstrate that these degenerate local equilibria can be avoided with a gradient penalty scheme called DRAGAN. We show that DRAGAN enables faster training, achieves improved stability with fewer mode collapses, and leads to generator networks with better modeling performance across a variety of architectures and objective functions.

[Paper] [Code]

Run Example

$ cd implementations/dragan/
$ python3 dragan.py

DualGAN

DualGAN: Unsupervised Dual Learning for Image-to-Image Translation

Authors

Zili Yi, Hao Zhang, Ping Tan, Minglun Gong

Abstract

Conditional Generative Adversarial Networks (GANs) for cross-domain image-to-image translation have made much progress recently. Depending on the task complexity, thousands to millions of labeled image pairs are needed to train a conditional GAN. However, human labeling is expensive, even impractical, and large quantities of data may not always be available. Inspired by dual learning from natural language translation, we develop a novel dual-GAN mechanism, which enables image translators to be trained from two sets of unlabeled images from two domains. In our architecture, the primal GAN learns to translate images from domain U to those in domain V, while the dual GAN learns to invert the task. The closed loop made by the primal and dual tasks allows images from either domain to be translated and then reconstructed. Hence a loss function that accounts for the reconstruction error of images can be used to train the translators. Experiments on multiple image translation tasks with unlabeled data show considerable performance gain of DualGAN over a single GAN. For some tasks, DualGAN can even achieve comparable or slightly better results than conditional GAN trained on fully labeled data.

[Paper] [Code]

Run Example

$ cd data/
$ bash download_pix2pix_dataset.sh facades
$ cd ../implementations/dualgan/
$ python3 dualgan.py --dataset_name facades

Energy-Based GAN

Energy-based Generative Adversarial Network

Authors

Junbo Zhao, Michael Mathieu, Yann LeCun

Abstract

We introduce the “Energy-based Generative Adversarial Network” model (EBGAN) which views the discriminator as an energy function that attributes low energies to the regions near the data manifold and higher energies to other regions. Similar to the probabilistic GANs, a generator is seen as being trained to produce contrastive samples with minimal energies, while the discriminator is trained to assign high energies to these generated samples. Viewing the discriminator as an energy function allows to use a wide variety of architectures and loss functionals in addition to the usual binary classifier with logistic output. Among them, we show one instantiation of EBGAN framework as using an auto-encoder architecture, with the energy being the reconstruction error, in place of the discriminator. We show that this form of EBGAN exhibits more stable behavior than regular GANs during training. We also show that a single-scale architecture can be trained to generate high-resolution images.

[Paper] [Code]

Run Example

$ cd implementations/ebgan/
$ python3 ebgan.py

Enhanced Super-Resolution GAN

ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks

Authors

Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Chen Change Loy, Yu Qiao, Xiaoou Tang

Abstract

The Super-Resolution Generative Adversarial Network (SRGAN) is a seminal work that is capable of generating realistic textures during single image super-resolution. However, the hallucinated details are often accompanied with unpleasant artifacts. To further enhance the visual quality, we thoroughly study three key components of SRGAN – network architecture, adversarial loss and perceptual loss, and improve each of them to derive an Enhanced SRGAN (ESRGAN). In particular, we introduce the Residual-in-Residual Dense Block (RRDB) without batch normalization as the basic network building unit. Moreover, we borrow the idea from relativistic GAN to let the discriminator predict relative realness instead of the absolute value. Finally, we improve the perceptual loss by using the features before activation, which could provide stronger supervision for brightness consistency and texture recovery. Benefiting from these improvements, the proposed ESRGAN achieves consistently better visual quality with more realistic and natural textures than SRGAN and won the first place in the PIRM2018-SR Challenge. The code is available at this https URL.

[Paper] [Code]

Run Example

$ cd implementations/esrgan/
<follow steps at the top of esrgan.py>
$ python3 esrgan.py

Nearest Neighbor Upsampling | ESRGAN​

GAN

Generative Adversarial Network

Authors

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio

Abstract

We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.

[Paper] [Code]

Run Example

$ cd implementations/gan/
$ python3 gan.py

InfoGAN

InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets

Authors

Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, Pieter Abbeel

Abstract

This paper describes InfoGAN, an information-theoretic extension to the Generative Adversarial Network that is able to learn disentangled representations in a completely unsupervised manner. InfoGAN is a generative adversarial network that also maximizes the mutual information between a small subset of the latent variables and the observation. We derive a lower bound to the mutual information objective that can be optimized efficiently, and show that our training procedure can be interpreted as a variation of the Wake-Sleep algorithm. Specifically, InfoGAN successfully disentangles writing styles from digit shapes on the MNIST dataset, pose from lighting of 3D rendered images, and background digits from the central digit on the SVHN dataset. It also discovers visual concepts that include hair styles, presence/absence of eyeglasses, and emotions on the CelebA face dataset. Experiments show that InfoGAN learns interpretable representations that are competitive with representations learned by existing fully supervised methods.

[Paper] [Code]

Run Example

$ cd implementations/infogan/
$ python3 infogan.py

Result of varying categorical latent variable by column.​

Result of varying continuous latent variable by row.​

Least Squares GAN

Least Squares Generative Adversarial Networks

Authors

Xudong Mao, Qing Li, Haoran Xie, Raymond Y.K. Lau, Zhen Wang, Stephen Paul Smolley

Abstract

Unsupervised learning with generative adversarial networks (GANs) has proven hugely successful. Regular GANs hypothesize the discriminator as a classifier with the sigmoid cross entropy loss function. However, we found that this loss function may lead to the vanishing gradients problem during the learning process. To overcome such a problem, we propose in this paper the Least Squares Generative Adversarial Networks (LSGANs) which adopt the least squares loss function for the discriminator. We show that minimizing the objective function of LSGAN yields minimizing the Pearson χ2 divergence. There are two benefits of LSGANs over regular GANs. First, LSGANs are able to generate higher quality images than regular GANs. Second, LSGANs perform more stable during the learning process. We evaluate LSGANs on five scene datasets and the experimental results show that the images generated by LSGANs are of better quality than the ones generated by regular GANs. We also conduct two comparison experiments between LSGANs and regular GANs to illustrate the stability of LSGANs.

[Paper] [Code]

Run Example

$ cd implementations/lsgan/
$ python3 lsgan.py

MUNIT

Multimodal Unsupervised Image-to-Image Translation

Authors

Xun Huang, Ming-Yu Liu, Serge Belongie, Jan Kautz

Abstract

Unsupervised image-to-image translation is an important and challenging problem in computer vision. Given an image in the source domain, the goal is to learn the conditional distribution of corresponding images in the target domain, without seeing any pairs of corresponding images. While this conditional distribution is inherently multimodal, existing approaches make an overly simplified assumption, modeling it as a deterministic one-to-one mapping. As a result, they fail to generate diverse outputs from a given source domain image. To address this limitation, we propose a Multimodal Unsupervised Image-to-image Translation (MUNIT) framework. We assume that the image representation can be decomposed into a content code that is domain-invariant, and a style code that captures domain-specific properties. To translate an image to another domain, we recombine its content code with a random style code sampled from the style space of the target domain. We analyze the proposed framework and establish several theoretical results. Extensive experiments with comparisons to the state-of-the-art approaches further demonstrates the advantage of the proposed framework. Moreover, our framework allows users to control the style of translation outputs by providing an example style image. Code and pretrained models are available at this https URL

[Paper] [Code]

Run Example

$ cd data/
$ bash download_pix2pix_dataset.sh edges2shoes
$ cd ../implementations/munit/
$ python3 munit.py --dataset_name edges2shoes

Results by varying the style code.​

Pix2Pix

Unpaired Image-to-Image Translation with Conditional Adversarial Networks

Authors

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, Alexei A. Efros

Abstract

We investigate conditional adversarial networks as a general-purpose solution to image-to-image translation problems. These networks not only learn the mapping from input image to output image, but also learn a loss function to train this mapping. This makes it possible to apply the same generic approach to problems that traditionally would require very different loss formulations. We demonstrate that this approach is effective at synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks. Indeed, since the release of the pix2pix software associated with this paper, a large number of internet users (many of them artists) have posted their own experiments with our system, further demonstrating its wide applicability and ease of adoption without the need for parameter tweaking. As a community, we no longer hand-engineer our mapping functions, and this work suggests we can achieve reasonable results without hand-engineering our loss functions either.

[Paper] [Code]

Run Example

$ cd data/
$ bash download_pix2pix_dataset.sh facades
$ cd ../implementations/pix2pix/
$ python3 pix2pix.py --dataset_name facades

Rows from top to bottom: (1) The condition for the generator (2) Generated image
based of condition (3) The true corresponding image to the condition​

PixelDA

Unsupervised Pixel-Level Domain Adaptation with Generative Adversarial Networks

Authors

Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, Dilip Krishnan

Abstract

Collecting well-annotated image datasets to train modern machine learning algorithms is prohibitively expensive for many tasks. One appealing alternative is rendering synthetic data where ground-truth annotations are generated automatically. Unfortunately, models trained purely on rendered images often fail to generalize to real images. To address this shortcoming, prior work introduced unsupervised domain adaptation algorithms that attempt to map representations between the two domains or learn to extract features that are domain-invariant. In this work, we present a new approach that learns, in an unsupervised manner, a transformation in the pixel space from one domain to the other. Our generative adversarial network (GAN)-based method adapts source-domain images to appear as if drawn from the target domain. Our approach not only produces plausible samples, but also outperforms the state-of-the-art on a number of unsupervised domain adaptation scenarios by large margins. Finally, we demonstrate that the adaptation process generalizes to object classes unseen during training.

[Paper] [Code]

MNIST to MNIST-M Classification

Trains a classifier on images that have been translated from the source domain (MNIST) to the target domain (MNIST-M) using the annotations of the source domain images. The classification network is trained jointly with the generator network to optimize the generator for both providing a proper domain translation and also for preserving the semantics of the source domain image. The classification network trained on translated images is compared to the naive solution of training a classifier on MNIST and evaluating it on MNIST-M. The naive model manages a 55% classification accuracy on MNIST-M while the one trained during domain adaptation achieves a 95% classification accuracy.

$ cd implementations/pixelda/
$ python3 pixelda.py
MethodAccuracy
Naive55%
PixelDA95%

Rows from top to bottom: (1) Real images from MNIST (2) Translated images from
MNIST to MNIST-M (3) Examples of images from MNIST-M​

Relativistic GAN

The relativistic discriminator: a key element missing from standard GAN

Authors

Alexia Jolicoeur-Martineau

Abstract

In standard generative adversarial network (SGAN), the discriminator estimates the probability that the input data is real. The generator is trained to increase the probability that fake data is real. We argue that it should also simultaneously decrease the probability that real data is real because 1) this would account for a priori knowledge that half of the data in the mini-batch is fake, 2) this would be observed with divergence minimization, and 3) in optimal settings, SGAN would be equivalent to integral probability metric (IPM) GANs. We show that this property can be induced by using a relativistic discriminator which estimate the probability that the given real data is more realistic than a randomly sampled fake data. We also present a variant in which the discriminator estimate the probability that the given real data is more realistic than fake data, on average. We generalize both approaches to non-standard GAN loss functions and we refer to them respectively as Relativistic GANs (RGANs) and Relativistic average GANs (RaGANs). We show that IPM-based GANs are a subset of RGANs which use the identity function. Empirically, we observe that 1) RGANs and RaGANs are significantly more stable and generate higher quality data samples than their non-relativistic counterparts, 2) Standard RaGAN with gradient penalty generate data of better quality than WGAN-GP while only requiring a single discriminator update per generator update (reducing the time taken for reaching the state-of-the-art by 400%), and 3) RaGANs are able to generate plausible high resolutions images (256×256) from a very small sample (N=2011), while GAN and LSGAN cannot; these images are of significantly better quality than the ones generated by WGAN-GP and SGAN with spectral normalization.

[Paper] [Code]

Run Example

$ cd implementations/relativistic_gan/
$ python3 relativistic_gan.py # Relativistic Standard GAN
$ python3 relativistic_gan.py --rel_avg_gan # Relativistic Average GAN

Semi-Supervised GAN

Semi-Supervised Generative Adversarial Network

Authors

Augustus Odena

Abstract

We extend Generative Adversarial Networks (GANs) to the semi-supervised context by forcing the discriminator network to output class labels. We train a generative model G and a discriminator D on a dataset with inputs belonging to one of N classes. At training time, D is made to predict which of N+1 classes the input belongs to, where an extra class is added to correspond to the outputs of G. We show that this method can be used to create a more data-efficient classifier and that it allows for generating higher quality samples than a regular GAN.

[Paper] [Code]

Run Example

$ cd implementations/sgan/
$ python3 sgan.py

Softmax GAN

Softmax GAN

Authors

Min Lin

Abstract

Softmax GAN is a novel variant of Generative Adversarial Network (GAN). The key idea of Softmax GAN is to replace the classification loss in the original GAN with a softmax cross-entropy loss in the sample space of one single batch. In the adversarial learning of N real training samples and M generated samples, the target of discriminator training is to distribute all the probability mass to the real samples, each with probability 1M, and distribute zero probability to generated data. In the generator training phase, the target is to assign equal probability to all data points in the batch, each with probability 1M+N. While the original GAN is closely related to Noise Contrastive Estimation (NCE), we show that Softmax GAN is the Importance Sampling version of GAN. We futher demonstrate with experiments that this simple change stabilizes GAN training.

[Paper] [Code]

Run Example

$ cd implementations/softmax_gan/
$ python3 softmax_gan.py

StarGAN

StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation

Authors

Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, Jaegul Choo

Abstract

Recent studies have shown remarkable success in image-to-image translation for two domains. However, existing approaches have limited scalability and robustness in handling more than two domains, since different models should be built independently for every pair of image domains. To address this limitation, we propose StarGAN, a novel and scalable approach that can perform image-to-image translations for multiple domains using only a single model. Such a unified model architecture of StarGAN allows simultaneous training of multiple datasets with different domains within a single network. This leads to StarGAN’s superior quality of translated images compared to existing models as well as the novel capability of flexibly translating an input image to any desired target domain. We empirically demonstrate the effectiveness of our approach on a facial attribute transfer and a facial expression synthesis tasks.

[Paper] [Code]

Run Example

$ cd implementations/stargan/
<follow steps at the top of stargan.py>
$ python3 stargan.py

Original | Black Hair | Blonde Hair | Brown Hair | Gender Flip | Aged​

Super-Resolution GAN

Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network

Authors

Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, Wenzhe Shi

Abstract

Despite the breakthroughs in accuracy and speed of single image super-resolution using faster and deeper convolutional neural networks, one central problem remains largely unsolved: how do we recover the finer texture details when we super-resolve at large upscaling factors? The behavior of optimization-based super-resolution methods is principally driven by the choice of the objective function. Recent work has largely focused on minimizing the mean squared reconstruction error. The resulting estimates have high peak signal-to-noise ratios, but they are often lacking high-frequency details and are perceptually unsatisfying in the sense that they fail to match the fidelity expected at the higher resolution. In this paper, we present SRGAN, a generative adversarial network (GAN) for image super-resolution (SR). To our knowledge, it is the first framework capable of inferring photo-realistic natural images for 4x upscaling factors. To achieve this, we propose a perceptual loss function which consists of an adversarial loss and a content loss. The adversarial loss pushes our solution to the natural image manifold using a discriminator network that is trained to differentiate between the super-resolved images and original photo-realistic images. In addition, we use a content loss motivated by perceptual similarity instead of similarity in pixel space. Our deep residual network is able to recover photo-realistic textures from heavily downsampled images on public benchmarks. An extensive mean-opinion-score (MOS) test shows hugely significant gains in perceptual quality using SRGAN. The MOS scores obtained with SRGAN are closer to those of the original high-resolution images than to those obtained with any state-of-the-art method.

[Paper] [Code]

Run Example

$ cd implementations/srgan/
<follow steps at the top of srgan.py>
$ python3 srgan.py

Nearest Neighbor Upsampling | SRGAN​

UNIT

Unsupervised Image-to-Image Translation Networks

Authors

Ming-Yu Liu, Thomas Breuel, Jan Kautz

Abstract

Unsupervised image-to-image translation aims at learning a joint distribution of images in different domains by using images from the marginal distributions in individual domains. Since there exists an infinite set of joint distributions that can arrive the given marginal distributions, one could infer nothing about the joint distribution from the marginal distributions without additional assumptions. To address the problem, we make a shared-latent space assumption and propose an unsupervised image-to-image translation framework based on Coupled GANs. We compare the proposed framework with competing approaches and present high quality image translation results on various challenging unsupervised image translation tasks, including street scene image translation, animal image translation, and face image translation. We also apply the proposed framework to domain adaptation and achieve state-of-the-art performance on benchmark datasets. Code and additional results are available in this https URL.

[Paper] [Code]

Run Example

$ cd data/
$ bash download_cyclegan_dataset.sh apple2orange
$ cd implementations/unit/
$ python3 unit.py --dataset_name apple2orange

Wasserstein GAN

Wasserstein GAN

Authors

Martin Arjovsky, Soumith Chintala, Léon Bottou

Abstract

We introduce a new algorithm named WGAN, an alternative to traditional GAN training. In this new model, we show that we can improve the stability of learning, get rid of problems like mode collapse, and provide meaningful learning curves useful for debugging and hyperparameter searches. Furthermore, we show that the corresponding optimization problem is sound, and provide extensive theoretical work highlighting the deep connections to other distances between distributions.

[Paper] [Code]

Run Example

$ cd implementations/wgan/
$ python3 wgan.py

Wasserstein GAN GP

Improved Training of Wasserstein GANs

Authors

Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, Aaron Courville

Abstract

Generative Adversarial Networks (GANs) are powerful generative models, but suffer from training instability. The recently proposed Wasserstein GAN (WGAN) makes progress toward stable training of GANs, but sometimes can still generate only low-quality samples or fail to converge. We find that these problems are often due to the use of weight clipping in WGAN to enforce a Lipschitz constraint on the critic, which can lead to undesired behavior. We propose an alternative to clipping weights: penalize the norm of gradient of the critic with respect to its input. Our proposed method performs better than standard WGAN and enables stable training of a wide variety of GAN architectures with almost no hyperparameter tuning, including 101-layer ResNets and language models over discrete data. We also achieve high quality generations on CIFAR-10 and LSUN bedrooms.

[Paper] [Code]

Run Example

$ cd implementations/wgan_gp/
$ python3 wgan_gp.py

Wasserstein GAN DIV

Wasserstein Divergence for GANs

Authors

Jiqing Wu, Zhiwu Huang, Janine Thoma, Dinesh Acharya, Luc Van Gool

Abstract

In many domains of computer vision, generative adversarial networks (GANs) have achieved great success, among which the fam- ily of Wasserstein GANs (WGANs) is considered to be state-of-the-art due to the theoretical contributions and competitive qualitative performance. However, it is very challenging to approximate the k-Lipschitz constraint required by the Wasserstein-1 metric (W-met). In this paper, we propose a novel Wasserstein divergence (W-div), which is a relaxed version of W-met and does not require the k-Lipschitz constraint.As a concrete application, we introduce a Wasserstein divergence objective for GANs (WGAN-div), which can faithfully approximate W-div through optimization. Under various settings, including progressive growing training, we demonstrate the stability of the proposed WGAN-div owing to its theoretical and practical advantages over WGANs. Also, we study the quantitative and visual performance of WGAN-div on standard image synthesis benchmarks, showing the superior performance of WGAN-div compared to the state-of-the-art methods.

[Paper] [Code]

Run Example

$ cd implementations/wgan_div/
$ python3 wgan_div.py

YOLO系列(一)

What is YOLOv5


YOLO an acronym for ‘You only look once’, is an object detection algorithm that divides images into a grid system. Each cell in the grid is responsible for detecting objects within itself.

YOLO is one of the most famous object detection algorithms due to its speed and accuracy.

The History of YOLO


YOLOv5

Shortly after the release of YOLOv4 Glenn Jocher introduced YOLOv5 using the Pytorch framework.
The open source code is available on GitHub

Author: Glenn Jocher
Released: 18 May 2020

YOLOv4

With the original authors work on YOLO coming to a standstill, YOLOv4 was released by Alexey Bochoknovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. The paper was titled YOLOv4: Optimal Speed and Accuracy of Object Detection

Author: Alexey BochoknovskiyChien-Yao Wang, and Hong-Yuan Mark Liao
Released: 23 April 2020

yolov4-tiny : https://arxiv.org/abs/2011.04244

code yolov4-tiny

https://github.com/bubbliiiing/yolov4-tiny-pytorch

yolov4 tiny结构

YOLOv3

YOLOv3 improved on the YOLOv2 paper and both Joseph Redmon and Ali Farhadi, the original authors, contributed.
Together they published YOLOv3: An Incremental Improvement

The original YOLO papers were are hosted here

Author: Joseph Redmon and Ali Farhadi
Released: 8 Apr 2018

YOLOv2

YOLOv2 was a joint endevor by Joseph Redmon the original author of YOLO and Ali Farhadi.
Together they published YOLO9000:Better, Faster, Stronger

Author: Joseph Redmon and Ali Farhadi
Released: 25 Dec 2016

YOLOv1

YOLOv1 was released as a research paper by Joseph Redmon.
The paper was titled You Only Look Once: Unified, Real-Time Object Detection

Author: Joseph Redmon
Released: 8 Jun 2015

mAP—–目标检测

在评价一个目标检测算法的“好坏”程度的时候,往往采用的是pascal voc 2012的评价标准mAP。

官方定义:

  1. Compute a version of the measured precision/recall curve with precision monotonically decreasing, by setting the precision for recall r to the maximum precision obtained for any recall r′ ≥ r.
  2. Compute the AP as the area under this curve by numerical integration. No approximation is involved since the curve is piecewise constant.

Note that prior to 2010 the AP is computed by sampling the monotonically

decreasing curve at a fixed set of uniformly-spaced recall values 0, 0.1, 0.2, . . . , 1. By contrast, VOC2010–2012 effectively samples the curve at all unique recall values.

mAP(均值平均精度)是目标检测中的最常用的评价指标,详细理解其计算方式有助于我们评估算法的有效性,并针对评测指标对算法进行调整。mean Average Precision, 即各类别AP的平均值

TP TN FP FN:

T或者N代表的是该样本是否被分类分对,P或者N代表的是该样本被分为什么

TP(True Positives)意思我们倒着来翻译就是“被分为正样本,并且分对了”,TN(True Negatives)意思是“被分为负样本,而且分对了”,FP(False Positives)意思是“被分为正样本,但是分错了”,FN(False Negatives)意思是“被分为负样本,但是分错了”。

按下图来解释,左半矩形是正样本,右半矩形是负样本。一个2分类器,在图上画了个圆,分类器认为圆内是正样本,圆外是负样本。那么左半圆分类器认为是正样本,同时它确实是正样本,那么就是“被分为正样本,并且分对了”即TP,左半矩形扣除左半圆的部分就是分类器认为它是负样本,但是它本身却是正样本,就是“被分为负样本,但是分错了”即FN。右半圆分类器认为它是正样本,但是本身却是负样本,那么就是“被分为正样本,但是分错了”即FP。右半矩形扣除右半圆的部分就是分类器认为它是负样本,同时它本身确实是负样本,那么就是“被分为负样本,而且分对了”即TN。

Precision(精度)和Recall(召回率)的概念:

Precision(精度 )翻译成中文就是“分类器认为是正类并且确实是正类的部分占所有分类器认为是正类的比例”,衡量的是一个分类器分出来的正类的确是正类的概率。两种极端情况就是,如果精度是100%,就代表所有分类器分出来的正类确实都是正类。如果精度是0%,就代表分类器分出来的正类没一个是正类。光是精度还不能衡量分类器的好坏程度,比如50个正样本和50个负样本,我的分类器把49个正样本和50个负样本都分为负样本,剩下一个正样本分为正样本,这样我的精度也是100%,但是傻子也知道这个分类器很垃圾。

Recall(召回率) ,翻译成中文就是“分类器认为是正类并且确实是正类的部分占所有确实是正类的比例”,衡量的是一个分类能把所有的正类都找出来的能力。两种极端情况,如果召回率是100%,就代表所有的正类都被分类器分为正类。如果召回率是0%,就代表没一个正类被分为正类。

IOU(交并比)

它是模型所预测的检测框(bbox)和真实的检测框(ground truth)的交集和并集之间的比例

IOU
def Iou(rec1,rec2):
  x1,x2,y1,y2 = rec1 #分别是第一个矩形左右上下的坐标
  x3,x4,y3,y4 = rec2 #分别是第二个矩形左右上下的坐标
  area_1 = (x2-x1)*(y1-y2)
  area_2 = (x4-x3)*(y3-y4)
  sum_area = area_1 + area_2
  w1 = x2 - x1#第一个矩形的宽
  w2 = x4 - x3#第二个矩形的宽
  h1 = y1 - y2
  h2 = y3 - y4
  W = min(x1,x2,x3,x4)+w1+w2-max(x1,x2,x3,x4)#交叉部分的宽
  H = min(y1,y2,y3,y4)+h1+h2-max(y1,y2,y3,y4)#交叉部分的高
  Area = W*H#交叉的面积
  Iou = Area/(sum_area-Area)
  return Iou

举例计算mAP

有3张图如下,要求算法找出face。蓝色框代表标签label,绿色框代表算法给出的结果pre,旁边的红色小字代表置信度。设定第一张图的检出框叫pre1,第一张的标签框叫label1。第二张、第三张同理。

首先我们计算每张图的pre和label的IOU,根据IOU是否大于0.5来判断该pre是属于TP(分成正样本,分对了)还是属于FP(被分为正样本,但是分错了)。显而易见,pre1是TP,pre2是FP,pre3是TP。

根据每个pre的置信度进行从高到低排序,这里pre1、pre2、pre3置信度刚好就是从高到低。

在不同置信度阈值下获得Precision和Recall:

首先,设置阈值为0.9,无视所有小于0.9的pre。那么检测器检出的所有框pre即TP+FP=1,并且pre1是TP,那么Precision=1/1。因为所有的label=3,所以Recall=1/3。这样就得到一组P、R值。

然后,设置阈值为0.8,无视所有小于0.8的pre。那么检测器检出的所有框pre即TP+FP=2,因为pre1是TP,pre2是FP,那么Precision=1/2=0.5。因为所有的label=3,所以Recall=1/3=0.33。这样就又得到一组P、R值。

再然后,设置阈值为0.7,无视所有小于0.7的pre。那么检测器检出的所有框pre即TP+FP=3,因为pre1是TP,pre2是FP,pre3是TP,那么Precision=2/3=0.67。因为所有的label=3,所以Recall=2/3=0.67。这样就又得到一组P、R值。

根据上面3组PR值绘制PR曲线如下。然后每个“峰值点”往左画一条线段直到与上一个峰值点的垂直线相交。这样画出来的红色线段与坐标轴围起来的面积就是AP值。

计算mAP


AP衡量的是对一个类检测好坏,mAP就是对多个类的检测好坏。就是简单粗暴的把所有类的AP值取平均就好了。比如有两类,类A的AP值是0.5,类B的AP值是0.2,那么mAP=(0.5+0.2)/2=0.35

计算mAP的github地址:

https://github.com/Cartucho/mAP

# AP的计算
def _average_precision(self, rec, prec):
    """
    Params:
    ----------
    rec : numpy.array
            cumulated recall
    prec : numpy.array
            cumulated precision
    Returns:
    ----------
    ap as float
    """
    if rec is None or prec is None:
        return np.nan
    ap = 0.
    for t in np.arange(0., 1.1, 0.1):  #十一个点的召回率,对应精度最大值
        if np.sum(rec >= t) == 0:
            p = 0
        else:
            p = np.max(np.nan_to_num(prec)[rec >= t])
        ap += p / 11.  #加权平均
    return ap

正确理解 YOLO 的辨识准确率

2021: A Year Full of Amazing AI papers – A Review

A curated list of the latest breakthroughs in AI by release date with a clear video explanation, link to a more in-depth article, and code.

comefrom:

https://www.louisbouchard.ai/2021-ai-papers-review/

Table of content

  • DALL·E: Zero-Shot Text-to-Image Generation from OpenAI [1]
  • VOGUE: Try-On by StyleGAN Interpolation Optimization [2]
  • Taming Transformers for High-Resolution Image Synthesis [3]
  • Thinking Fast And Slow in AI [4]
  • Automatic detection and quantification of floating marine macro-litter in aerial images [5]
  • ShaRF: Shape-conditioned Radiance Fields from a Single View [6]
  • Generative Adversarial Transformers [7]
  • We Asked Artificial Intelligence to Create Dating Profiles. Would You Swipe Right? [8]
  • Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [9]
  • IMAGE GANS MEET DIFFERENTIABLE RENDERING FOR INVERSE GRAPHICS AND INTERPRETABLE 3D NEURAL RENDERING [10]
  • Deep nets: What have they ever done for vision? [11]
  • Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image [12]
  • Portable, Self-Contained Neuroprosthetic Hand with Deep Learning-Based Finger Control [13]
  • Total Relighting: Learning to Relight Portraits for Background Replacement [14]
  • LASR: Learning Articulated Shape Reconstruction from a Monocular Video [15]
  • Enhancing Photorealism Enhancement [16]
  • DefakeHop: A Light-Weight High-Performance Deepfake Detector [17]
  • High-Resolution Photorealistic Image Translation in Real-Time: A Laplacian Pyramid Translation Network [18]
  • Barbershop: GAN-based Image Compositing using Segmentation Masks [19]
  • TextStyleBrush: Transfer of text aesthetics from a single example [20]
  • Animating Pictures with Eulerian Motion Fields [21]
  • CVPR 2021 Best Paper Award: GIRAFFE – Controllable Image Generation [22]
  • GitHub Copilot & Codex: Evaluating Large Language Models Trained on Code [23]
  • Apple: Recognizing People in Photos Through Private On-Device Machine Learning [24]
  • Image Synthesis and Editing with Stochastic Differential Equations [25]
  • Sketch Your Own GAN [26]
  • Tesla’s Autopilot Explained [27]
  • Styleclip: Text-driven manipulation of StyleGAN imagery [28]
  • TimeLens: Event-based Video Frame Interpolation [29]
  • Diverse Generation from a Single Video Made Possible [30]
  • Skillful Precipitation Nowcasting using Deep Generative Models of Radar [31]
  • The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks [32]
  • ADOP: Approximate Differentiable One-Pixel Point Rendering [33]
  • (Style)CLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis [34]
  • SwinIR: Image restoration using swin transformer [35]
  • EditGAN: High-Precision Semantic Image Editing [36]
  • CityNeRF: Building NeRF at City Scale [37]
  • ClipCap: CLIP Prefix for Image Captioning [38]
  • Paper references

Learning CNN-LSTM Architectures for Image论文阅读

圣诞快乐

Abstract:

自动描述图像的内容是连接计算机视觉和自然语言处理的人工智能的基本问题。在本文中,我们提出了一个基于深度递归体系结构的生成模型,该模型结合了计算机视觉和机器翻译的最新进展,可用于生成描述图像的自然句子。训练模型以在给定训练图像的情况下最大化目标描述语句的可能性。在多个数据集上进行的实验表明,该模型的准确性以及仅从图像描述中学习的语言的流畅性。我们的模型通常非常准确,我们可以在定性和定量上进行验证。例如,在Pascal数据集上,当前最先进的BLEU-1得分(越高越好)是25,而我们的方法得出的结果是59,与人类表现在69左右相比。我们还显示了BLEU-1 Flickr30k的得分从56提升到66,SBU的得分从19提升到28。最后,在新发布的COCO数据集上,我们的BLEU-4为27.7,这是当前的最新水平。

Introduction:

能够使用格式正确的英语句子自动描述图像的内容是一项非常具有挑战性的任务,但它可能会产生巨大的影响,例如,通过帮助视力障碍的人们更好地理解网络上的图像内容。例如,此任务比经过充分研究的图像分类或对象识别任务要困难得多,而这些任务已成为计算机视觉领域的主要关注点[27]。实际上,描述不仅必须捕获图像中包含的对象,而且还必须表达这些对象如何相互关联以及它们的属性和它们所涉及的活动。此外,必须表达上述语义知识以自然语言(例如英语)表示,这意味着除了视觉理解外还需要一种语言模型。

以前的大多数尝试都是将上述子问题的现有解决方案组合在一起,以便从图像进行描述[6,16]。相反,我们在这项工作中提出一个联合模型,该模型以图像 I 作为输入,并经过训练以最大化产生目标单词序列 S = S 1 , S 2 , . . . 的可能性 p ( S∣I ) ,其中每个单词 S t 来自给定的字典,即充分描述图像。

我们工作的主要灵感来自机器翻译的最新进展,其中的任务是通过最大化 p ( T∣S) ,将以源语言编写的句子 S 转换为目标语言的译文 T 。多年以来,机器翻译还通过一系列单独的任务来实现(分别翻译单词,对齐单词,重新排序等),但是最近的工作表明,使用递归神经网络(RNN)可以以更简单的方式完成翻译。 [3,2,30]并仍达到最先进的性能。 “编码器” RNN读取源语句并将其转换为丰富的固定长度向量表示形式,然后将其用作生成目标语句的“解码器” RNN的初始隐藏状态。

在这里,我们建议遵循这种优雅的方法,用深度卷积神经网络(CNN)代替编码器RNN。在过去的几年中,令人信服的表明,CNN可以通过将输入图像嵌入到固定长度的向量中来生成输入图像的丰富表示,从而这种表示可以用于各种视觉任务[28]。因此,自然是将CNN用作图像“编码器”,方法是先对其进行预训练以进行图像分类任务,然后将最后一个隐藏层用作生成语句的RNN解码器的输入(请参见图1)。我们将此模型称为神经图像标题或NIC。

我们的贡献如下。首先,我们提出了一个解决问题的端到端系统。它是一种神经网络,可以使用随机梯度下降训练。其次,我们的模型结合了用于视觉和语言模型的最新网络。这些可以在较大的语料库上进行预训练,因此可以利用其他数据。最后,与最先进的方法相比,它的性能显着提高。 例如,在Pascal数据集上,NIC的BLEU得分为59,与当前的最新水平25相比,而人类的性能达到69。在Flickr30k上,我们的得分从56提高到66,在SBU,从19到28。

Related Work:

从视觉数据中生成自然语言描述的问题已经在计算机视觉中进行了长期研究,但主要针对视频[7,32]。这导致了由视觉原始识别器与结构化形式语言(例如, And-Or图形或逻辑系统,它们通过基于规则的系统进一步转换为自然语言。这样的系统是手工设计的,相对较脆,并且仅在有限的领域(例如,图1)中被证明。例如,交通场景或运动描述。

带有自然文本的静止图像描述问题最近引起了人们的关注。借助对象,属性和位置识别方面的最新进展,尽管这些语言的表达能力受到限制,但我们仍可以驱动自然语言生成系统。 Farhadi等。 [6]使用检测来推断场景元素的三元组,并使用模板将其转换为文本。同样,李等。 [19]从检测开始,并使用包含检测到的对象和关系的短语拼凑出最终描述。 Kulkani等人使用了一个更复杂的检测图(三重态除外)。 [16],但具有基于模板的文本生成。也使用了基于语言解析的更强大的语言模型[23,1,17,18,5]。上面的方法已经能够“in the wild”描述图像,但是在文本生成方面,它们是经过大量手工设计和严格设计的。

大量工作解决了对给定图像[11、8、24]进行描述排名的问题。这样的方法基于在相同向量空间中共同嵌入图像和文本的想法。对于图像查询,将获取接近嵌入空间中图像的描述。最紧密地,神经网络用于共同嵌入图像和句子[29],甚至嵌入图像作物和句子[13],但并未尝试生成新颖的描述。通常,即使可能已经在训练数据中观察到了单个对象,上述方法也无法描述以前看不见的对象组成。而且,它们避免了解决评估所生成的描述的良好程度的问题。

在这项工作中,我们将用于图像分类的深层卷积网络[12]与用于序列建模的循环网络[10]相结合,以创建一个生成图像描述的单一网络。在这个单一的“端到端”网络的背景下对RNN进行了训练。该模型的灵感来自机器翻译中序列生成的最新成功[3,2,30],区别在于我们提供的不是卷积句子,而是提供了由卷积网络处理的图像。最近的著作是基洛斯等人。 [15]他们使用神经网络,但使用前馈神经网络,根据图像和前一个单词来预测下一个单词。毛等人的最新著作。 [21]使用递归神经网络进行相同的预测任务。这与当前建议非常相似,但是有许多重要的区别:我们使用功能更强大的RNN模型,并直接向RNN模型提供可视输入,这使得RNN可以跟踪那些由文字解释。由于这些看似微不足道的差异,我们的系统在已建立的基准上取得了明显更好的结果。最后,基洛斯等。 [14]提出通过使用功能强大的计算机视觉模型和对文本进行编码的LSTM来构建联合多峰嵌入空间。与我们的方法相反,它们使用两个单独的路径(一个用于图像,一个用于文本)来定义联合嵌入,并且即使它们可以生成文本,也对其方法进行了高度调整以进行排名。

Model:

在本文中,我们提出了一种神经和概率框架来从图像生成描述。统计机器翻译的最新进展表明,给定强大的序列模型,可以通过在“端到端”的方式中给定输入句子的情况下,直接最大化正确翻译的概率来获得最新的结果–既用于训练又用于推理。这些模型使用循环神经网络,该网络将可变长度输入编码为固定维向量,并使用此表示形式将其“解码”为所需的输出语句。 因此,很自然地使用相同的方法,即在给定图像(而不是源语言中的输入句子)的情况下,应用相同的原理将其“翻译”成其描述。

Thus, we propose to directly maximize the probability of the correct description given the image by using the following formulation:

where θ are the parameters of our model, I is an image, and S its correct transcription. Since S represents any sentence, its length is unbounded(无限的). Thus, it is common to apply the chain rule to model the joint probability over S 0 , . . . , S N , where N is the length of this particular example as

where we dropped the dependency on θfor convenience. At training time, ( S , I ) is a training example pair, and we optimize the sum of the log probabilities as described in (2) over the whole training set using stochastic gradient descent (further training details are given in Section 4).

It is natural to model p ( S t ∣ I , S 0 , . . . , S t − 1 )with a Recurrent Neural Network (RNN), where the variable number of words we condition upon up to t − 1 is expressed by a fixed length hidden state or memory h t This memory is updated after seeing a new input xtby using a non-linear function f :

ht+1​=f(ht​;xt​)

为了使上述RNN更具体,需要做出两个关键的设计选择:f的确切形式是什么以及如何将图像和单词作为输入xt输入。对于 f,我们使用长时记忆(LSTM)网络,该网络已显示出诸如翻译之类的序列任务的最新性能。下一节将概述此模型。

对于图像的表示,我们使用卷积神经网络(CNN)。它们已被广泛地用于图像任务并已被研究,并且目前是物体识别和检测的最新技术。我们对CNN的特定选择使用一种新颖的方法对批处理进行归一化,并在ILSVRC 2014分类竞赛中获得当前的最佳表现[12]。此外,它们已被证明可以通过转移学习推广到其他任务,例如场景分类[4]。单词用嵌入模型表示。

LSTM-based Sentence Generator

在(3)中f的选择取决于它处理消失和爆炸梯度的能力[10],这是设计和训练RNN时最常见的挑战。为了解决这一挑战,引入了一种称为LSTM的特殊形式的递归网络[10],并成功应用于翻译[3,30]和序列生成[9]。 LSTM模型的核心是存储单元 c ,它在每个时间步上编码知识,直到该步为止都观察到了哪些输入(参见图2)。单元的行为由“门”控制,“门”是相乘的层,因此如果门为1则可以保留门控层的值,如果门为0则可以保持此值为零。特别是,正在使用三个门用于控制是否忘记当前单元格值(忘记门 f),是否应读取其输入(输入门 i)以及是否输出新单元格值(输出门 o )。门的定义以及单元更新和输出如下:

其中 ⊙表示门值的乘积,而各种 W矩阵都是经过训练的参数。这样的乘法门使训练鲁棒的LSTM成为可能,因为这些门很好地处理了爆炸和消失的梯度[10]。非线性为S型σ(⋅)和双曲正切 h ( ⋅ ) 。最后一个方程 mt是输入给Softmax的方程,它将产生所有单词上的概率分布。

LSTM模型经过训练,可以在看到图像后预测句子中的每个单词以及通过 p ( S t ∣ I , S 0 , . . . , S t − 1 ) ) 预测所有先前单词。为此,以展开形式考虑LSTM是有启发性的–为图像和每个句子单词创建LSTM存储器的副本,以便所有LSTM在时间t共享相同的参数和LSTM的输出 mt-1
在时间 t 将馈送到LSTM(见图3)。在展开版本中,所有经常性连接都将转换为前馈连接。更详细地讲,如果我们用I表示输入图像,而用 S = ( S 0 , . . . , S N ) 表示描述该图像的真实句子,则展开过程为:

在这里,我们将每个单词表示为一维向量 S t ,其维数等于字典的大小。注意,我们用 S 0 表示一个特殊的开始词,用 S N 表示一个特殊的停止词,它指定句子的开头和结尾。特别是通过发出停用词,LSTM发出信号,表明已生成完整的句子。图像和单词都映射到相同的空间,使用视觉CNN映射图像,使用单词嵌入 W e映射到单词。图像 I 仅在 t = − 1 时输入一次,以通知LSTM有关图像内容。我们凭经验验证了,由于网络可以显式利用图像中的噪声并更容易过度拟合,因此在每个时间步幅上作为额外的输入来馈送图像会产生较差的结果

Our loss is the sum of the negative log likelihood of the correct word at each step as follows:

The above loss is minimized w.r.t. all the parameters of the LSTM, the top layer of the image embedder CNN and word embeddings We

使用NIC,有多种方法可以用于生成给定图像的句子。第一个是抽样,我们只是根据p1对第一个单词进行抽样,然后提供相应的嵌入作为输入,然后对p2进行抽样,这样一直进行下去,直到我们对特殊的语句结束标记或某个最大长度进行抽样。第二种方法是BeamSearch:迭代地考虑k个最好的句子,直到时间t,作为候选,生成大小为t + 1的句子,并只保留其中最好的k个。这更接近于S = arg maxS0 p(S0|I)。在接下来的实验中,我们使用了波束搜索方法,波束大小为20。使用光束大小为1(即(贪婪搜索)降低了平均2个BELU点。

结构整体结构:

包括Encoder、decoder 、Attention、 Beam Search(束搜索)

Putting it all together

Generation Results

我们在表1和表2中报告了所有相关数据集的主要结果。由于PASCAL没有训练集,所以我们使用使用MSCOCO训练的系统(对于这个任务来说,MSCOCO可能是最大和最高质量的数据集)。PASCAL和SBU的最新研究结果并没有使用基于深度学习的图像特征,因此可以说,这些分数上的一个巨大进步仅仅来自于这种改变。Flickr数据集最近才被使用[11,21,14],但大多数是在检索框架中评估的。一个值得注意的例外是[21],它们在其中进行检索和生成,并且在Flickr数据集上产生了迄今为止最好的性能。

表2中的人类评分是通过比较其中一个人类字幕和另外四个字幕计算出来的。我们为五个打分者中的每一个打分,并对他们的BLEU分数进行平均。由于这给我们的系统带来了一点优势,考虑到BLEU分数是根据5个参考句计算的,而不是4个,我们将5个参考句而不是4个参考句的平均差异加回人类分数。

鉴于该领域在过去几年中取得了重大进展,我们认为报告BLEU-4更有意义,这是机器翻译向前发展的标准。此外,我们还报告了表14中显示的与人工评估关联更好的度量标准。尽管最近在更好的评价指标[31]的努力,我们的模型与人类评分者相比表现良好。然而,当使用人工评分员来评估我们的字幕时(见4.3.6节),我们的模型表现得更差,这表明我们需要做更多的工作来获得更好的指标。在官方测试集上,我们的标签只能通过官方网站获得,我们的型号有27.2的BLEU-4。

Conclusion

我们提出了NIC,一个端到端的神经网络系统,可以自动查看图像并生成合理的描述。NIC基于卷积神经网络,将图像编码成紧凑的表示形式,然后是递归神经网络,生成相应的句子。该模型经过训练以最大化给定图像的句子的可能性。在多个数据集上的实验表明,NIC在定性结果(生成的句子非常合理)和定量评估方面具有鲁棒性,可以使用排名指标,也可以使用机器翻译中用来评估生成句子质量的BLEU指标。从这些实验中可以清楚地看到,随着用于图像描述的可用数据集的大小的增加,NIC等方法的性能也会提高。此外,观察如何使用非监督数据(单独来自图像和单独来自文本)来改进图像描述方法也很有趣。

github实现:https://github.com/sgrvinod/Deep-Tutorials-for-PyTorch

References
[1] A. Aker and R. Gaizauskas. Generating image descriptions using dependency relational patterns. In ACL, 2010.
[2] D. Bahdanau, K. Cho, and Y . Bengio. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473, 2014.
[3] K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y . Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP, 2014.
[4] J. Donahue, Y . Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML, 2014.
[5] D. Elliott and F. Keller. Image description using visual dependency representations. In EMNLP, 2013.
[6] A. Farhadi, M. Hejrati, M. A. Sadeghi, P . Y oung, C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every picture tells a story: Generating sentences from images. In ECCV, 2010.
[7] R. Gerber and H.-H. Nagel. Knowledge representation for the generation of quantified natural language descriptions of vehicle traffic in image sequences. In ICIP. IEEE, 1996.
[8] Y . Gong, L. Wang, M. Hodosh, J. Hockenmaier, and S. Lazebnik. Improving image-sentence embeddings using large weakly annotated photo collections. In ECCV, 2014.
[9] A. Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2013.
[10] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8), 1997.
[11] M. Hodosh, P . Y oung, and J. Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. JAIR, 47, 2013.
[12] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In arXiv:1502.03167, 2015.
[13] A. Karpathy, A. Joulin, and L. Fei-Fei. Deep fragment embeddings for bidirectional image sentence mapping. NIPS, 2014.
[14] R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. In arXiv:1411.2539, 2014.
[15] R. Kiros and R. Z. R. Salakhutdinov. Multimodal neural language models. In NIPS Deep Learning Workshop, 2013.
[16] G. Kulkarni, V . Premraj, S. Dhar, S. Li, Y . Choi, A. C. Berg, and T. L. Berg. Baby talk: Understanding and generating simple image descriptions. In CVPR, 2011.
[17] P . Kuznetsova, V . Ordonez, A. C. Berg, T. L. Berg, and Y . Choi. Collective generation of natural image descriptions. In ACL, 2012. [18] P . Kuznetsova, V . Ordonez, T. Berg, and Y . Choi. Treetalk: Composition and compression of trees for image descriptions. ACL, 2(10), 2014.
[19] S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y . Choi. Composing simple image descriptions using web-scale n-grams. In Conference on Computational Natural Language Learning, 2011.
[20] T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P . Perona, D. Ramanan, P . Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. arXiv:1405.0312, 2014.
[21] J. Mao, W. Xu, Y . Yang, J. Wang, and A. Y uille. Explain images with multimodal recurrent neural networks. In arXiv:1410.1090, 2014. [22] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. In ICLR, 2013.
[23] M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal, A. C. Berg, K. Yamaguchi, T. L. Berg, K. Stratos, and H. D. III. Midge: Generating image descriptions from computer vision detections. In EACL, 2012.
[24] V . Ordonez, G. Kulkarni, and T. L. Berg. Im2text: Describing images using 1 million captioned photographs. In NIPS, 2011.
[25] K. Papineni, S. Roukos, T. Ward, and W. J. Zhu. BLEU: A method for automatic evaluation of machine translation. In ACL, 2002.
[26] C. Rashtchian, P . Y oung, M. Hodosh, and J. Hockenmaier. Collecting image annotations using amazon’s mechanical turk. In NAACL HLT Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pages 139– 147, 2010.
[27] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge, 2014.
[28] P . Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y . LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013.
[29] R. Socher, A. Karpathy, Q. V . Le, C. Manning, and A. Y . Ng. Grounded compositional semantics for finding and describing images with sentences. In ACL, 2014.
[30] I. Sutskever, O. Vinyals, and Q. V . Le. Sequence to sequence learning with neural networks. In NIPS, 2014.
[31] R. V edantam, C. L. Zitnick, and D. Parikh. CIDEr: Consensus-based image description evaluation. In arXiv:1411.5726, 2015.
[32] B. Z. Yao, X. Yang, L. Lin, M. W. Lee, and S.-C. Zhu. I2t: Image parsing to text description. Proceedings of the IEEE, 98(8), 2010.
[33] P . Y oung, A. Lai, M. Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. In ACL, 2014.
[34] W. Zaremba, I. Sutskever, and O. Vinyals. Recurrent neural network regularization. In arXiv:1409.2329, 2014.

BLEU—机器翻译评价方法

Bleu[1]是IBM在2002提出的,用于机器翻译任务的评价,发表在ACL,引用次数10000+,原文题目是“BLEU: a Method for Automatic Evaluation of Machine Translation”。

它的总体思想就是准确率,假如给定标准译文reference,神经网络生成的句子是candidate,句子长度为n,candidate中有m个单词出现在reference,m/n就是bleu的1-gram的计算公式。

BLEU还有许多变种。根据n-gram可以划分成多种评价指标,常见的指标有BLEU-1、BLEU-2、BLEU-3、BLEU-4四种,其中n-gram指的是连续的单词个数为n。

BLEU-1衡量的是单词级别的准确性,更高阶的bleu可以衡量句子的流畅性。

计算公式:

分子释义

神经网络生成的句子是candidate,给定的标准译文是reference。

1) 第一个求和符号统计的是所有的candidate,因为计算时可能有多个句子,

2)第二个求和符号统计的是一条candidate中所有的n−gram,而 [公式] 表示某一个n−gram在reference中的个数。

所以整个分子就是在给定的candidate中有多少个n-gram词语出现在reference中。

分母释义

前两个求和符号和分子中的含义一样,Count(n-gram’)表示n−gram′在candidate中的个数,综上可知,分母是获得所有的candidate中n-gram的个数。

改进版的BLEU计算方法

改进思路:对于某个词组出现的次数,在保证不大于candidate中出现的个数的情况下,然后再reference寻找词组出现的最多次。

评价机器翻译结果通常使用BLEU(Bilingual Evaluation Understudy)。对于模型预测序列中任意的子序列,BLEU考察这个子序列是否出现在标签序列中。

具体来说,设词数为n的子序列的精度为pn​。它是预测序列与标签序列匹配词数为n的子序列的数量与预测序列中词数为n的子序列的数量之比。举个例子,假设标签序列为A、B、C、D、E、F,预测序列为ABB、C、D,那么p1=4/5,p2=3/4,p3=1/3,p4=0。设

lenlabel​和lenpred分别为标签序列和预测序列的词数,那么,BLEU的定义为

$$
\exp \left(\min \left(0,1-\frac{l e n_{\text {label }}}{l e n_{\text {pred }}}\right)\right) \prod_{n=1}^{k} p_{n}^{1 / 2^{n}}
$$
其中 \(k\) 是我们希望匹配的子序列的最大词数。可以看到当预测 序列和标签序列完全一致时,BLEU为 1 。
因为匹配较长子序列比匹配较短子序列更难,BLEU对匹配较长 子序列的精度赋予了更大权重。例如,当 \( p_{n} \) 固定在 \( 0.5 \) 时,随在 \( n \) 的增大, \( 0.5^{1 / 2} \approx 0.7,0.5^{1 / 4} \approx 0.84,0.5^{1 / 8} \approx$ $0.92,0.5^{1 / 16} \approx 0.96 \) 。另外,模型预测较短序列往往会得到 较高 \( p_{n} \) 值。因此,上式中连乘项前面的系数是为了惩罚较短的 输出而设的。举个例子,当 \(k=2 \) 时,假设标签序列为 A 、 B 、 C 、 D 、 E 、 F ,而预测序列为 A 、 B 。虽然 \( p_{1}=p_{2}=1 \) ,但惩罚系数 \( \exp (1-6 / 2) \approx 0.14\) , 因此BLEU也接近 0.14 。

python实现:

def bleu(pred_tokens, label_tokens, k):
    len_pred, len_label = len(pred_tokens), len(label_tokens)
    score = math.exp(min(0, 1 - len_label / len_pred))
    for n in range(1, k + 1):
        num_matches, label_subs = 0, collections.defaultdict(int)
        for i in range(len_label - n + 1):
            label_subs[''.join(label_tokens[i: i + n])] += 1
        for i in range(len_pred - n + 1):
            if label_subs[''.join(pred_tokens[i: i + n])] > 0:
                num_matches += 1
                label_subs[''.join(pred_tokens[i: i + n])] -= 1
        score *= math.pow(num_matches / (len_pred - n + 1), math.pow(0.5, n))
    return score

接下来,定义一个辅助打印函数:

def score(input_seq, label_seq, k):
    pred_tokens = translate(encoder, decoder, input_seq, max_seq_len)
    label_tokens = label_seq.split(' ')
    print('bleu %.3f, predict: %s' % (bleu(pred_tokens, label_tokens, k),
                                      ' '.join(pred_tokens)))

一些注意事项

一般给出的reference是4句话,之所以给出多个句子,是因为单个句子可能无法和生成的句子做很好地匹配。

比如”你好”,reference是”hello”,机器给出的译文是”how are you”,机器给出的词语每一个一个词匹配上这个reference,那么BLEU值是0,这显然是有问题的。所以reference越多样化,匹配成功概率越高。