分类: 论文
MAE:Masked Autoencoders Are Scalable Vision Learners
摘自 Jack Cui
马赛克,克星,真来了!
何恺明大神的新作 论文:https://arxiv.org/abs/2111.06377
项目地址:https://github.com/facebookresearch/mae
简单讲:将图片随机遮挡,然后复原。并且遮挡的比例,非常大!超过整张图的 80% ,我们直接看效果:
第一列是遮挡图,第二列是修复结果,第三列是原图。图片太多,可能看不清,我们单看一个:
看这个遮挡的程度,表针、表盘几乎都看不见了。但是 MAE 依然能够修复出来:
这个效果真的很惊艳!甚至对于遮挡 95% 的面积的图片依然 work。
看左图,你能看出来被遮挡的是蘑菇吗??MAE 却能轻松修复出来。接下来,跟大家聊聊 MAE。
Vit
讲解 MAE 之前不得不先说下 Vit。红遍大江南北的 Vision Transformer,ViT。领域内的小伙伴,或多或少都应该听说过。它将 Transformer 应用到了 CV 上面,将整个图分为 16 * 16 的小方块,每个方块做成一个词,然后放进 Transformer 进行训练。视觉transformer 和自然语言处理 中的transformer可以进行类比,可以把一个图像块理解成一个单词。
MAE
MAE 结构设计的非常简单:
将一张图随机打 Mask,未 Mask 部分输入给 Encoder 进行编码学习,这个 Encoder 就是 Vit,然后得到每个块的特征。再将未 Mask 部分以及 Mask 部分全部输入给 Decoder 进行解码学习,最终目标是修复图片。而 Decoder 就是一个轻量化的 Transformer。它的损失函数就是普通的 MSE。所以说, MAE 的 Encoder 和 Decoder 结构不同,是非对称式的。Encoder 将输入编码为 latent representation,而 Decoder 将从 latent representation 重建原始信号。
项目提供了 Colab,如果你能登陆,那么可以直接体验:https://colab.research.google.com/github/facebookresearch/mae/blob/main/demo/mae_visualize.ipynb
如果不能登陆,可以直接本地部署,作者提供了预训练模型。
MAE 可以用来生成不存在的内容,就像 GAN 一样。
首先来看看神魔是 Transformer :
Transformer 最初主要应用于一些自然语言处理场景,比如翻译、文本分类、写小说、写歌等。随着技术的发展,Transformer 开始征战视觉领域,分类、检测等任务均不在话下,逐渐走上了多模态的道路。
Transformer 是 Google 在 2017 年提出的用于机器翻译的模型。
Transformer 的内部,在本质上是一个 Encoder-Decoder 的结构,即 编码器-解码器。
Transformer 中抛弃了传统的 CNN 和 RNN,整个网络结构完全由 Attention 机制组成,并且采用了 6 层 Encoder-Decoder 结构。
显然,Transformer 主要分为两大部分,分别是编码器和解码器。整个 Transformer 是由 6 个这样的结构组成,为了方便理解,我们只看其中一个Encoder-Decoder 结构。
以一个简单的例子进行说明:
Why do we work?,我们为什么工作?左侧红框是编码器,右侧红框是解码器,编码器负责把自然语言序列映射成为隐藏层(上图第2步),即含有自然语言序列的数学表达。解码器把隐藏层再映射为自然语言序列,从而使我们可以解决各种问题,如情感分析、机器翻译、摘要生成、语义关系抽取等。简单说下,上图每一步都做了什么:
- 输入自然语言序列到编码器: Why do we work?(为什么要工作);
- 编码器输出的隐藏层,再输入到解码器;
- 输入 <𝑠𝑡𝑎𝑟𝑡> (起始)符号到解码器;
- 解码器得到第一个字”为”;
- 将得到的第一个字”为”落下来再输入到解码器;
- 解码器得到第二个字”什”;
- 将得到的第二字再落下来,直到解码器输出 <𝑒𝑛𝑑> (终止符),即序列生成完成。
解码器和编码器的结构类似,本文以编码器部分进行讲解。即把自然语言序列映射为隐藏层的数学表达的过程,因为理解了编码器中的结构,理解解码器就非常简单了。为了方便学习,我将编码器分为 4 个部分,依次讲解。
1、位置嵌入(𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑎𝑙 𝑒𝑛𝑐𝑜𝑑𝑖𝑛𝑔)
我们输入数据 X 维度为[batch size, sequence length]的数据,比如我们为什么工作。batch size 就是 batch 的大小,这里只有一句话,所以 batch size 为 1,sequence length 是句子的长度,一共 7 个字,所以输入的数据维度是 [1, 7]。我们不能直接将这句话输入到编码器中,因为 Tranformer 不认识,我们需要先进行字嵌入,即得到图中的 。简单点说,就是文字->字向量的转换,这种转换是将文字转换为计算机认识的数学表示,用到的方法就是 Word2Vec,Word2Vec 的具体细节,对于初学者暂且不用了解,这个是可以直接使用的。得到维度是 [batch size, sequence length, embedding dimension],embedding dimension 的大小由 Word2Vec 算法决定,Tranformer 采用 512 长度的字向量。所以 的维度是 [1, 7, 512]。至此,输入的我们为什么工作,可以用一个矩阵来简化表示。
我们知道,文字的先后顺序,很重要。比如吃饭没、没吃饭、没饭吃、饭吃没、饭没吃,同样三个字,顺序颠倒,所表达的含义就不同了。文字的位置信息很重要,Tranformer 没有类似 RNN 的循环结构,没有捕捉顺序序列的能力。为了保留这种位置信息交给 Tranformer 学习,我们需要用到位置嵌入。加入位置信息的方式非常多,最简单的可以是直接将绝对坐标 0,1,2 编码。Tranformer 采用的是 sin-cos 规则,使用了 sin 和 cos 函数的线性变换来提供给模型位置信息:
上式中 pos 指的是句中字的位置,取值范围是 [0, 𝑚𝑎𝑥 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒 𝑙𝑒𝑛𝑔𝑡ℎ),i 指的是字嵌入的维度, 取值范围是 [0, 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛)。 就是 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛 的大小。上面有 sin 和 cos 一组公式,也就是对应着 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛 维度的一组奇数和偶数的序号的维度,从而产生不同的周期性变化。
# 导入依赖库
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
def get_positional_encoding(max_seq_len, embed_dim):
# 初始化一个positional encoding
# embed_dim: 字嵌入的维度
# max_seq_len: 最大的序列长度
positional_encoding = np.array([
[pos / np.power(10000, 2 * i / embed_dim) for i in range(embed_dim)]
if pos != 0 else np.zeros(embed_dim) for pos in range(max_seq_len)])
positional_encoding[1:, 0::2] = np.sin(positional_encoding[1:, 0::2]) # dim 2i 偶数
positional_encoding[1:, 1::2] = np.cos(positional_encoding[1:, 1::2]) # dim 2i+1 奇数
# 归一化, 用位置嵌入的每一行除以它的模长
# denominator = np.sqrt(np.sum(position_enc**2, axis=1, keepdims=True))
# position_enc = position_enc / (denominator + 1e-8)
return positional_encoding
positional_encoding = get_positional_encoding(max_seq_len=100, embed_dim=16)
plt.figure(figsize=(10,10))
sns.heatmap(positional_encoding)
plt.title("Sinusoidal Function")
plt.xlabel("hidden dimension")
plt.ylabel("sequence length")
可以看到,位置嵌入在 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛 (也是hidden dimension )维度上随着维度序号增大,周期变化会越来越慢,而产生一种包含位置信息的纹理。
就这样,产生独一的纹理位置信息,模型从而学到位置之间的依赖关系和自然语言的时序特性。
最后,将Xembedding 和 位置嵌入
相加,送给下一层。
2、自注意力层(𝑠𝑒𝑙𝑓 𝑎𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑚𝑒𝑐ℎ𝑎𝑛𝑖𝑠𝑚)
直接看下图笔记,讲解的非常详细。
多头的意义在于,\(QK^{T}\) 得到的矩阵就叫注意力矩阵,它可以表示每个字与其他字的相似程度。因为,向量的点积值越大,说明两个向量越接近。
我们的目的是,让每个字都含有当前这个句子中的所有字的信息,用注意力层,我们做到了。
需要注意的是,在上面 𝑠𝑒𝑙𝑓 𝑎𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛
的计算过程中,我们通常使用 𝑚𝑖𝑛𝑖 𝑏𝑎𝑡𝑐ℎ
,也就是一次计算多句话,上文举例只用了一个句子。
每个句子的长度是不一样的,需要按照最长的句子的长度统一处理。对于短的句子,进行 Padding
操作,一般我们用 0
来进行填充。
3、残差链接和层归一化
加入了残差设计和层归一化操作,目的是为了防止梯度消失,加快收敛。
1) 残差设计
我们在上一步得到了经过注意力矩阵加权之后的 𝑉
, 也就是 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄, 𝐾, 𝑉)
,我们对它进行一下转置,使其和 𝑋𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔
的维度一致, 也就是 [𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒, 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒 𝑙𝑒𝑛𝑔𝑡ℎ, 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛]
,然后把他们加起来做残差连接,直接进行元素相加,因为他们的维度一致:
$$
X_{\text {embedding }}+\text { Attention }(Q, K, V)
$$
在之后的运算里,每经过一个模块的运算,都要把运算之前的值和运算之后的值相加,从而得到残差连接,训练的时候可以使梯度直接走捷径反传到最初始层:
$$
X+\text { SubLayer }(X)
$$
2) 层归一化
作用是把神经网络中隐藏层归一为标准正态分布,也就是 𝑖.𝑖.𝑑
独立同分布, 以起到加快训练速度, 加速收敛的作用。
$$
\mu_{i}=\frac{1}{m} \sum_{i=1}^{m} x_{i j}
$$
上式中以矩阵的行 (𝑟𝑜𝑤) 为单位求均值:
$$
\sigma_{j}^{2}=\frac{1}{m} \sum_{i=1}^{m}\left(x_{i j}-\mu_{j}\right)^{2}
$$
上式中以矩阵的行 (𝑟𝑜𝑤) 为单位求方差:
$$
\operatorname{LayerNorm}(x)=\alpha \odot \frac{x_{i j}-\mu_{i}}{\sqrt{\sigma_{i}^{2}+\epsilon}}+\beta
$$
然后用每一行的每一个元素减去这行的均值,再除以这行的标准差,从而得到归 一化后的数值, \(\epsilon\) 是为了防止除 0 ;之后引入两个可训练参数 \(\alpha, \beta\) 来弥补归一化的过程中损失掉的信息,注意 \(\odot\) 表示 元素相乘而不是点积,我们一般初始化 \(\alpha\) 为全 1 ,而\(\beta\) 为全 0 。
代码层面非常简单,单头 attention
操作如下:
class ScaledDotProductAttention(nn.Module):
''' Scaled Dot-Product Attention '''
def __init__(self, temperature, attn_dropout=0.1):
super().__init__()
self.temperature = temperature
self.dropout = nn.Dropout(attn_dropout)
def forward(self, q, k, v, mask=None):
# self.temperature是论文中的d_k ** 0.5,防止梯度过大
# QxK/sqrt(dk)
attn = torch.matmul(q / self.temperature, k.transpose(2, 3))
if mask is not None:
# 屏蔽不想要的输出
attn = attn.masked_fill(mask == 0, -1e9)
# softmax+dropout
attn = self.dropout(F.softmax(attn, dim=-1))
# 概率分布xV
output = torch.matmul(attn, v)
return output, attn
Multi-Head Attention
实现在 ScaledDotProductAttention
基础上构建:
class MultiHeadAttention(nn.Module):
''' Multi-Head Attention module '''
# n_head头的个数,默认是8
# d_model编码向量长度,例如本文说的512
# d_k, d_v的值一般会设置为 n_head * d_k=d_model,
# 此时concat后正好和原始输入一样,当然不相同也可以,因为后面有fc层
# 相当于将可学习矩阵分成独立的n_head份
def __init__(self, n_head, d_model, d_k, d_v, dropout=0.1):
super().__init__()
# 假设n_head=8,d_k=64
self.n_head = n_head
self.d_k = d_k
self.d_v = d_v
# d_model输入向量,n_head * d_k输出向量
# 可学习W^Q,W^K,W^V矩阵参数初始化
self.w_qs = nn.Linear(d_model, n_head * d_k, bias=False)
self.w_ks = nn.Linear(d_model, n_head * d_k, bias=False)
self.w_vs = nn.Linear(d_model, n_head * d_v, bias=False)
# 最后的输出维度变换操作
self.fc = nn.Linear(n_head * d_v, d_model, bias=False)
# 单头自注意力
self.attention = ScaledDotProductAttention(temperature=d_k ** 0.5)
self.dropout = nn.Dropout(dropout)
# 层归一化
self.layer_norm = nn.LayerNorm(d_model, eps=1e-6)
def forward(self, q, k, v, mask=None):
# 假设qkv输入是(b,100,512),100是训练每个样本最大单词个数
# 一般qkv相等,即自注意力
residual = q
# 将输入x和可学习矩阵相乘,得到(b,100,512)输出
# 其中512的含义其实是8x64,8个head,每个head的可学习矩阵为64维度
# q的输出是(b,100,8,64),kv也是一样
q = self.w_qs(q).view(sz_b, len_q, n_head, d_k)
k = self.w_ks(k).view(sz_b, len_k, n_head, d_k)
v = self.w_vs(v).view(sz_b, len_v, n_head, d_v)
# 变成(b,8,100,64),方便后面计算,也就是8个头单独计算
q, k, v = q.transpose(1, 2), k.transpose(1, 2), v.transpose(1, 2)
if mask is not None:
mask = mask.unsqueeze(1) # For head axis broadcasting.
# 输出q是(b,8,100,64),维持不变,内部计算流程是:
# q*k转置,除以d_k ** 0.5,输出维度是b,8,100,100即单词和单词直接的相似性
# 对最后一个维度进行softmax操作得到b,8,100,100
# 最后乘上V,得到b,8,100,64输出
q, attn = self.attention(q, k, v, mask=mask)
# b,100,8,64-->b,100,512
q = q.transpose(1, 2).contiguous().view(sz_b, len_q, -1)
q = self.dropout(self.fc(q))
# 残差计算
q += residual
# 层归一化,在512维度计算均值和方差,进行层归一化
q = self.layer_norm(q)
return q, attn
4、前馈网络
这个层就没啥说的了,非常简单,直接看代码吧:
class PositionwiseFeedForward(nn.Module):
''' A two-feed-forward-layer module '''
def __init__(self, d_in, d_hid, dropout=0.1):
super().__init__()
# 两个fc层,对最后的512维度进行变换
self.w_1 = nn.Linear(d_in, d_hid) # position-wise
self.w_2 = nn.Linear(d_hid, d_in) # position-wise
self.layer_norm = nn.LayerNorm(d_in, eps=1e-6)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
residual = x
x = self.w_2(F.relu(self.w_1(x)))
x = self.dropout(x)
x += residual
x = self.layer_norm(x)
return x
GAN系列之—Deep Convolutional GAN(DCGAN)
DCGAN 的判别器和生成器都使用了卷积神经网络(CNN)来替代GAN 中的多层感知机,同时为了使整个网络可微,拿掉了CNN 中的池化层,另外将全连接层以全局池化层替代以减轻计算量。
去卷积(反卷积,Deconvolution)
从上图中可以看到,生成器G 将一个100 维的噪音向量扩展成64 * 64 * 3 的矩阵输出,整个过程采用的是微步卷积的方式。作者在文中将其称为fractionally-strided convolutions,并特意强调不是deconvolutions。
去卷积(链接:反卷积)又包含转置卷积和微步卷积,两者的区别在于padding 的方式不同,看看下面这张图片就可以明白了:
3. 训练方法
DCGAN 的训练方法跟GAN 是一样的,分为以下三步:
(1)for k steps:训练D 让式子【logD(x) + log(1 – D(G(Z)) (G keeps still)】的值达到最大
(2)保持D 不变,训练G 使式子【logD(G(z))】的值达到最大
(3)重复step(1)和step(2)直到G 与D 达到纳什均衡
4. 相比于GAN 的改进
DCGAN 相比于GAN 或者是普通CNN 的改进包含以下几个方面:
(1)使用卷积和去卷积代替池化层
(2)在生成器和判别器中都添加了批量归一化操作
(3)去掉了全连接层,使用全局池化层替代
(4)生成器的输出层使用Tanh 激活函数,其他层使用RELU
(5)判别器的所有层都是用LeakyReLU 激活函数
5. 漫游隐空间
通过使用插值微调噪音输入z 的方式可以导致隐空间结构发生变化从而引导生成图像发生语义上的平滑过度,比如说从有窗户到没窗户,从有电视到没电视等等。
6. 语义遮罩
通过标注窗口,并判断激活神经元是否在窗口内的方式来找出影响窗户形成的神经元,将这些神经元的权重设置为0,那么就可以导致生成的图像中没有窗户。从下图可以看到,上面一行图片都是有窗户的,下面一行通过语义遮罩的方式拿掉了窗户,但是空缺的位置依然是平滑连续的,使整幅图像的语义没有发生太大的变化。
7. 矢量算法
在向量算法中有一个很经典的例子就是【vector(“King”) – vector(“Man”) + vector(“Woman”) = vector(“Queue”)】,作者将该思想引入到图像生成当中并得到了以下实验结果:【smiling woman – neutral woman + neutral man = smiling man】
GAN系列之经典GAN(一)
reference:
https://zhuanlan.zhihu.com/p/78777020
https://zhuanlan.zhihu.com/p/28853704
GAN全称:Generative Adversarial Network 即生成对抗网络,由Ian J. Goodfellow等人于2014年10月发表在NIPS大会上的论文《Generative Adversarial Nets》中提出。此后各种花式变体Pix2Pix、CYCLEGAN、STARGAN、StyleGAN等层出不穷,在“换脸”、“换衣”、“换天地”等应用场景下生成的图像、视频以假乱真,好不热闹。前段时间PaddleGAN实现的First Order Motion表情迁移模型,能用一张照片生成一段唱歌视频。各种搞笑鬼畜视频火遍全网。用的就是一种GAN模型哦。深度学习三巨神之一的LeCun也对GAN大加赞赏,称“adversarial training is the coolest thing since sliced bread”。
对抗生成模型GAN首先是一个生成模型,和大家比较熟悉的、用于分类的判别模型不同。
判别模型的数学表示是y=f(x),也可以表示为条件概率分布p(y|x)。当输入一张训练集图片x时,判别模型输出分类标签y。模型学习的是输入图片x与输出的类别标签的映射关系。即学习的目的是在输入图片x的条件下,尽量增大模型输出分类标签y的概率。
而生成模型的数学表示是概率分布p(x)。没有约束条件的生成模型是无监督模型,将给定的简单先验分布π(z)(通常是高斯分布),映射为训练集图片的像素概率分布p(x),即输出一张服从p(x)分布的具有训练集特征的图片。模型学习的是先验分布π(z)与训练集像素概率分布p(x)的映射关系。
生成对抗网络一般由一个生成器(生成网络),和一个判别器(判别网络)组成。生成器的作用是,通过学习训练集数据的特征,在判别器的指导下,将随机噪声分布尽量拟合为训练数据的真实分布,从而生成具有训练集特征的相似数据。而判别器则负责区分输入的数据是真实的还是生成器生成的假数据,并反馈给生成器。两个网络交替训练,能力同步提高,直到生成网络生成的数据能够以假乱真,并与与判别网络的能力达到一定均衡。
GAN的本质
其实GAN模型以及所有的生成模型都一样,做的事情只有一件:拟合训练数据的分布。对图片生成任务来说就是拟合训练集图片的像素概率分布。下面我们从原理的角度演示一下GAN的训练过程:
上图中: 黑色点线为训练集数据分布曲线 蓝色点线为判别器输出的分布曲线 绿色实线为生成器输出的分布曲线 z展示的是生成器映射前的简单概率分布(一般是高斯分布)的范围和密度 x展示的是生成器映射后学到的训练集的概率分布的范围和密度 (a)判别器与生成器均未训练呈随机分布 (b)判别器经过训练,输出的分布在靠近训练集“真”数据分布的区间趋近于1(真),在靠近生成器生成的“假”数据分布的区间趋近于0(假) (c)生成器根据判别器输出的(真假)分布,更新参数,使自己的输出分布趋近于训练集“真”数据的分布。 经过(b)(c)(b)(c)…步骤的循环交替。判别器的输出分布随着生成器输出的分布与训练集分布的接近而更加平缓;生成器输出的分布则在判别器输出分布的指引下逐渐趋近于训练集“真”数据的分布。 (d)训练完成时,生成器输出的分布完美拟合了训练集数据的分布,判别器的输出由于生成器的完美拟合而无法判别生成器输出的真伪而呈一条取值约为0.5(真假之间)的直线。
GAN的组成
- 解读GAN的loss函数
GAN网络的训练优化目标就是如下公式:
公式出自Goodfellow在2014年发表的论文Generative Adversarial Nets。这里简单介绍下公式的含义和如何应用到代码中。上式中等号左边的部分: V(D,G)表示的是生成样本和真实样本的差异度,可以使用二分类(真、假两个类别)的交叉熵损失。
maxV(D, G)表示在生成器固定的情况下,通过最大化交叉熵损失V(D,G)来更新判别器D的参数。
min maxV(D, G)表示生成器要在判别器最大化真、假图片交叉熵损失V(D,G)的情况下,最小化这个交叉熵损失
首先固定G训练D :
1)训练D的目的是希望这个式子的值越大越好。真实数据希望被D分成1,生成数据希望被分成0。
第一项,如果有一个真实数据被分错,那么log(D(x))<<0,期望会变成负无穷大。
第二项,如果被分错成1的话,第二项也会是负无穷大。
很多被分错的话,就会出现很多负无穷,那样可以优化的空间还有很多。可以修正参数,使V的数值增大。
2)训练G ,它是希望V的值越小越好,让D分不开真假数据。
因为目标函数的第一项不包含G,是常数,所以可以直接忽略 不受影响。
对于G来说 它希望D在划分他的时候能够越大越好,他希望被D划分1(真实数据)。
第二个式子和第一个式子等价。在训练的时候,第二个式子训练效果比较好 常用第二个式子的形式。
证明V是可以收敛导最佳解的。
(1)global optimum 存在
(2)global optimum训练过程收敛
全局优化首先固定G优化D,D的最佳情况为:
1、证明D*G(x)是最优解
由于V是连续的所以可以写成积分的形式来表示期望:
通过假设x=G(z)可逆进行了变量替换,整理式子后得到:
然后对V(G,D)进行最大化:对D进行优化令V取最大
取极值,对V进行求导并令导数等于0.求解出来可得D的最佳解D*G(x)结果一样。
2、假设我们已经知道D*G(x)是最佳解了,这种情况下G想要得到最佳解的情况是:G产生出来的分布要和真实分布一致,即:
在这个条件下,D*G(x)=1/2。
接下来看G的最优解是什么,因为D的这时已经找到最优解了,所以只需要调整G ,令
对于D的最优解我们已经知道了,D*G(x),可以直接把它带进来 并去掉前面的Max
然后对 log里面的式子分子分母都同除以2,分母不动,两个分子在log里面除以2 相当于在log外面 -log(4) 可以直接提出来:
结果可以整理成两个KL散度-log(4)
KL散度是大于等于零的,所以C的最小值是 -log(4)
当且仅当
即
所以证明了 当G产生的数据和真实数据是一样的时候,C取得最小值也就是最佳解。
如上图所示GAN由一个判别器(Discriminator)和一个生成器(Generator)两个网络组成。
训练时先训练判别器:将训练集数据(Training Set)打上真标签(1)和生成器(Generator)生成的假图片(Fake image)打上假标签(0)一同组成batch送入判别器(Discriminator),对判别器进行训练。计算loss时使判别器对真数据(Training Set)输入的判别趋近于真(1),对生成器(Generator)生成的假图片(Fake image)的判别趋近于假(0)。此过程中只更新判别器(Discriminator)的参数,不更新生成器(Generator)的参数。
然后再训练生成器:将高斯分布的噪声z(Random noise)送入生成器(Generator),然后将生成器(Generator)生成的假图片(Fake image)打上真标签(1)送入判别器(Discriminator)。计算loss时使判别器对生成器(Generator)生成的假图片(Fake image)的判别趋近于真(1)。此过程中只更新生成器(Generator)的参数,不更新判别器(Discriminator)的参数。
判别器结构:
生成器结构:
代码实现:http://139.9.1.231/index.php/2021/12/29/gan/
YOLO系列(二):yolov1
YOLOv1属于一阶段、anchor-free 目标检测
整体来看,Yolo算法采用一个单独的CNN模型实现end-to-end的目标检测,整个系统如图5所示:首先将输入图片resize到448×448,然后送入CNN网络,最后处理网络预测结果得到检测的目标。相比R-CNN算法,其是一个统一的框架,其速度更快,而且Yolo的训练过程也是end-to-end的。
具体来说,Yolo的CNN网络将输入的图片分割成 \(S \times S\) 网格,然后每个单元格负责去检测那些 中心点落在该格子内的目标,如图6所示,可以看到狗这个目标的中心落在左下角一个单元格内, 那么该单元格负责预测这个狗。每个单元格会预测 \(B\) 个边界框 (bounding box) 以及边界框的 置信度 (confidence score) 。所谓置信度其实包含两个方面,一是这个边界框含有目标的可能性 大小,二是这个边界框的准确度。前者记为 \(\operatorname{Pr}(object)\) ,当该边界框是背景时 (即不包含目 标),此时 \(\operatorname{Pr}(object)=0\) 。而当该边界框包含目标时, \(\operatorname{Pr}(object)=1\) 。边界框的准 确度可以用预测框与实际框 (ground truth) 的IOU (intersection over union,交并比) 来表 征,记为 \(\mathrm{IOU}{\text {pred }}^{\text {truth }}\) 。因此置信度可以定义为 \(\operatorname{Pr}(object) * \mathrm{IOU}{\text {pred }}^{\text {truth }}\) 。很多人可能将Yolo 的置信度看成边界框是否含有目标的概率,但是其实它是两个因子的乘积,预测框的准确度也反映 在里面。边界框的大小与位置可以用4个值来表征: (x, y, w, h),其中 (x, y) 是边界框的中 心坐标,而 w和 h 是边界框的宽与高。还有一点要注意,中心坐标的预测值 (x, y) 是相对于 每个单元格左上角坐标点的偏移值,并且单位是相对于单元格大小的,单元格的坐标定义如图6所 示。而边界框的 \(w\) 和 \(h\) 预测值是相对于整个图片的宽与高的比例,这样理论上4个元素的大小 应该在 \([0,1]\) 范围。这样,每个边界框的预测值实际上包含 5 个元素: \((x, y, w, h, c)\) ,其中 \((x, y)\) 是边界框的中 心坐标,而 \(w\) 和 \(h\) 是边界框的宽与高。还有一点要注意,中心坐标的预测值 \((x, y)\) 是相对于 每个单元格左上角坐标点的偏移值,并且单位是相对于单元格大小的,单元格的坐标定义如图所示。而边界框的\(w\) 和 \(h\) 预测值是相对于整个图片的宽与高的比例,这样理论上4个元素的大小 应该在 \([0,1]\) 范围。这样,每个边界框的预测值实际上包含 5 个元素: \((x, y, w, h, c)\) ,其中 前 4 个表征边界框的大小与位置,而最后一个值是置信度。
还有分类问题,对于每一个单元格其还要给出预测出 C个类别概率值,其表征的是由该单元格负 责预测的边界框其目标属于各个类别的概率。但是这些概率值其实是在各个边界框置信度下的条件 概率,即 \(\operatorname{Pr}\left(\right. class _{i} \mid object )\) 。值得注意的是,不管一个单元格预测多少个边界框,其只预测 一组类别概率值,这是Yolo算法的一个缺点,在后来的改进版本中,Yolo9000是把类别概率预测 值与边界框是绑定在一起的。同时,我们可以计算出各个边界框类别置信度(class-specific confidence scores):
边界框类别置信度表征的是该边界框中目标属于各个类别的可能性大小以及边界框匹配目标的好 坏。后面会说,一般会根据类别置信度来过滤网络的预测框。
总结一下,每个单元格需要预测 \((B * 5+C)\) 个值。如果将输入图片划分为 \(S \times S\) 网格,那 么最终预测值为 \(S \times S \times(B * 5+C)\) 大小的张量。整个模型的预测值结构如下图所示。对 于PASCAL VOC数据,其共有20个类别,如果使用 \(S=7, B=2\) ,那么最终的预测结果就是 \(7 \times 7 \times 30\) 大小的张量。在下面的网络结构中我们会详细讲述每个单元格的预测值的分布位 置。
Yolo采用卷积网络来提取特征,然后使用全连接层来得到预测值。网络结构参考GooLeNet模型,包含24个卷积层和2个全连接层,如图8所示。对于卷积层,主要使用1×1卷积来做channle reduction,然后紧跟3×3卷积。对于卷积层和全连接层,采用Leaky ReLU激活函数:max(x, 0.1x) 。但是最后一层却采用线性激活函数。
损失函数计算如下:
其中第一项是边界框中心坐标的误差项, \(1_{i j}^{obj}\) 指的是第 \(i\) 个单元格存在目标,且该单元格中的第 \(j\) 个边界框负责预测该目标。第二项是边界框的高与宽的误差项。第三项是包含目标的边界框 的置信度误差项。第四项是不包含目标的边界框的置信度误差项。而最后一项是包含目标的单元格 的分类误差项, \(1_{i}^{\text {obj }}\) 指的是第 \(i\) 个单元格存在目标。这里特别说一下置信度的target值 \(C_{i}\) , 如果是不存在目标,此时由于 \(\operatorname{Pr}( object )=0\) ,那么 \(C_{i}=0\) 。如果存在目标,
\(\operatorname{Pr}( object )=1\) ,此时需要确定 \(\mathrm{IOU}{\text {pred }}^{\text {truth }}\) ,当然你希望最好的话,可以将IOU取 1 ,这样 \(C{i}=1\) ,但是在 YOLO实现中,使用了一个控制参数 rescore (默认为 1 ),当其为 1 时,IOU不 是设置为 1 ,而就是计算truth和pred之间的真实 IOU
网络预测: 基于非极大值抑制算法
这个算法不单单是针对Yolo算法的,而是所有的检测算法中都会用到。NMS算法主要解决的是一个目标被多次检测的问题,如图11中人脸检测,可以看到人脸被多次检测,但是其实我们希望最后仅仅输出其中一个最好的预测框,比如对于美女,只想要红色那个检测结果。那么可以采用NMS算法来实现这样的效果:首先从所有的检测框中找到置信度最大的那个框,然后挨个计算其与剩余框的IOU,如果其值大于一定阈值(重合度过高),那么就将该框剔除;然后对剩余的检测框重复上述过程,直到处理完所有的检测框。Yolo预测过程也需要用到NMS算法。
下面就来分析Yolo的预测过程,这里我们不考虑batch,认为只是预测一张输入图片。根据前面的分析,最终的网络输出是 7×7×30 ,但是我们可以将其分割成三个部分:类别概率部分为 [7,7,20] ,置信度部分为 [7,7,2] ,而边界框部分为 [7,7,2,4] (对于这部分不要忘记根据原始图片计算出其真实值)。然后将前两项相乘(矩阵 [7,7,20] 乘以 [7,7,2] 可以各补一个维度来完成 [7,7,1,20]×[7,7,2,1] )可以得到类别置信度值为 [7,7,2,20] ,这里总共预测了 7∗7∗2=98 个边界框。
所有的准备数据已经得到了,那么我们先说第一种策略来得到检测框的结果,我认为这是最正常与自然的处理。首先,对于每个预测框根据类别置信度选取置信度最大的那个类别作为其预测标签,经过这层处理我们得到各个预测框的预测类别及对应的置信度值,其大小都是 [7,7,2] 。一般情况下,会设置置信度阈值,就是将置信度小于该阈值的box过滤掉,所以经过这层处理,剩余的是置信度比较高的预测框。最后再对这些预测框使用NMS算法,最后留下来的就是检测结果。一个值得注意的点是NMS是对所有预测框一视同仁,还是区分每个类别,分别使用NMS。Ng在deeplearning.ai中讲应该区分每个类别分别使用NMS,但是看了很多实现,其实还是同等对待所有的框,我觉得可能是不同类别的目标出现在相同位置这种概率很低吧。
上面的预测方法应该非常简单明了,但是对于Yolo算法,其却采用了另外一个不同的处理思路(至少从C源码看是这样的),其区别就是先使用NMS,然后再确定各个box的类别。其基本过程如图12所示。对于98个boxes,首先将小于置信度阈值的值归0,然后分类别地对置信度值采用NMS,这里NMS处理结果不是剔除,而是将其置信度值归为0。最后才是确定各个box的类别,当其置信度值不为0时才做出检测结果输出。这个策略不是很直接,但是貌似Yolo源码就是这样做的。Yolo论文里面说NMS算法对Yolo的性能是影响很大的,所以可能这种策略对Yolo更好。但是我测试了普通的图片检测,两种策略结果是一样的。
YOLO系列(五)yolov4-tiny
YOLOv4-tiny结构是YOLOv4的精简版,属于轻量化模型,参数只有600万相当于原来的十分之一,这使得检测速度提升很大。整体网络结构共有38层,使用了三个残差单元,激活函数使用了LeakyReLU,目标的分类与回归改为使用两个特征层,合并有效特征层时使用了特征金字塔(FPN)网络。其同样使用了CSPnet结构,并对特征提取网络进行通道分割,将经过3×3卷积后输出的特征层通道划分为两部分,并取第二部分。在COCO数据集上得到了40.2%的AP50、371FPS,相较于其他版本的轻量化模型性能优势显著。其结构图如下图所示。
YOLOv4-tiny具有多任务、端到端、注意力机制和多尺度的特点。多任务即同时完成目标的分类与回归,实现参数共享,避免过拟合;端到端即模型接收图像数据后直接给出分类与回归的预测信息;注意力机制是重点关注目标区域特征进行详细处理,提高处理速度;多尺度的特点是将经过下采样和上采样的数据相互融合,其作用是能够分割出多种尺度大小的目标。在对模型进行训练时可以使用Mosaic数据增强、标签平滑、学习率余弦退火衰减等方法来提升模型的训练速度和检测精度。
YOLO系列(四):yolov3
yolov3属于一阶段、anchor-based 目标检测
FPN :
原来多数的object detection算法都是只采用顶层特征做预测,但我们知道低层的特征语义信息比较少,但是目标位置准确;高层的特征语义信息比较丰富,但是目标位置比较粗略。另外虽然也有些算法采用多尺度特征融合的方式,但是一般是采用融合后的特征做预测,而本文不一样的地方在于预测是在不同特征层独立进行的。
FPN(Feature Pyramid Network)算法可以同时利用低层特征高分辨率和高层特征的高语义信息,通过融合这些不同层的特征达到很好的预测效果。此外,和其他的特征融合方式不同的是本文中的预测是在每个融合后的特征层上单独进行的。(对不同特征层单独预测)
网络结构解析:
- Yolov3中,只有卷积层,通过调节卷积步长控制输出特征图的尺寸。所以对于输入图片尺寸没有特别限制。流程图中,输入图片以256*256作为样例。
- Yolov3借鉴了金字塔特征图思想,小尺寸特征图用于检测大尺寸物体,而大尺寸特征图检测小尺寸物体。特征图的输出维度为 , 为输出特征图格点数,一共3个Anchor框,每个框有4维预测框数值 ,1维预测框置信度,80维物体类别数。所以第一层特征图的输出维度为 。
- Yolov3总共输出3个特征图,第一个特征图下采样32倍,第二个特征图下采样16倍,第三个下采样8倍。输入图像经过Darknet-53(无全连接层),再经过Yoloblock生成的特征图被当作两用,第一用为经过3*3卷积层、1*1卷积之后生成特征图一,第二用为经过1*1卷积层加上采样层,与Darnet-53网络的中间层输出结果进行拼接,产生特征图二。同样的循环之后产生特征图三。
- concat操作与加和操作的区别:加和操作来源于ResNet思想,将输入的特征图,与输出特征图对应维度进行相加,即 ;而concat操作源于DenseNet网络的设计思路,将特征图按照通道维度直接进行拼接,例如8*8*16的特征图与8*8*16的特征图拼接后生成8*8*32的特征图。
- 上采样层(upsample):作用是将小尺寸特征图通过插值等方法,生成大尺寸图像。例如使用最近邻插值算法,将8*8的图像变换为16*16。上采样层不改变特征图的通道数。
Yolo的整个网络,吸取了Resnet、Densenet、FPN的精髓,可以说是融合了目标检测当前业界最有效的全部技巧。
YOLO系列(三):yolov2
yolov2属于一阶段、anchor-based 目标检测
YOLOv2的论文全名为YOLO9000: Better, Faster, Stronger,它斩获了CVPR 2017 Best Paper Honorable Mention。在这篇文章中,作者首先在YOLOv1的基础上提出了改进的YOLOv2,然后提出了一种检测与分类联合训练方法,使用这种联合训练方法在COCO检测数据集和ImageNet分类数据集上训练出了YOLO9000模型,其可以检测超过9000多类物体。所以,这篇文章其实包含两个模型:YOLOv2和YOLO9000,不过后者是在前者基础上提出的,两者模型主体结构是一致的。YOLOv2相比YOLOv1做了很多方面的改进,这也使得YOLOv2的mAP有显着的提升,并且YOLOv2的速度依然很快,保持着自己作为one-stage方法的优势.
Yolov2和Yolo9000算法内核相同,区别是训练方式不同:Yolov2用coco数据集训练后,可以识别80个种类。而Yolo9000可以使用coco数据集 + ImageNet数据集联合训练,可以识别9000多个种类。
YOLOv2的改进策略
YOLOv1虽然检测速度很快,但是在检测精度上却不如R-CNN系检测方法,YOLOv1在物体定位方面(localization)不够准确,并且召回率(recall)较低。YOLOv2共提出了几种改进策略来提升YOLO模型的定位准确度和召回率,从而提高mAP,YOLOv2在改进中遵循一个原则:保持检测速度,这也是YOLO模型的一大优势。YOLOv2的改进策略如图2所示,可以看出,大部分的改进方法都可以比较显着提升模型的mAP。
Batch Normalization
Batch Normalization可以提升模型收敛速度,而且可以起到一定正则化效果,降低模型的过拟合。在YOLOv2中,每个卷积层后面都添加了Batch Normalization层,并且不再使用droput。使用Batch Normalization后,YOLOv2的mAP提升了2.4%。
High Resolution Classifier:
目前大部分的检测模型都会在先在ImageNet分类数据集上预训练模型的主体部分(CNN特征提取器),由于历史原因,ImageNet分类模型基本采用大小为 224*224的图片作为输入,分辨率相对较低,不利于检测模型。所以YOLOv1在采用 224*224 分类模型预训练后,将分辨率增加至 448*448,并使用这个高分辨率在检测数据集上finetune。但是直接切换分辨率,检测模型可能难以快速适应高分辨率。所以YOLOv2增加了在ImageNet数据集上使用448*448输入来finetune分类网络这一中间过程(10 epochs),这可以使得模型在检测数据集上finetune之前已经适用高分辨率输入。使用高分辨率分类器后,YOLOv2的mAP提升了约4%。
Convolutional With Anchor Boxes:在YOLOv1中,输入图片最终被划分为7*7网格,每个单元格预测2个边界框。YOLOv1最后采用的是全连接层直接对边界框进行预测,其中边界框的宽与高是相对整张图片大小的,而由于各个图片中存在不同尺度和长宽比(scales and ratios)的物体,YOLOv1在训练过程中学习适应不同物体的形状是比较困难的,这也导致YOLOv1在精确定位方面表现较差。YOLOv2借鉴了Faster R-CNN中RPN网络的先验框(anchor boxes,prior boxes,SSD也采用了先验框)策略。RPN对CNN特征提取器得到的特征图(feature map)进行卷积来预测每个位置的边界框以及置信度(是否含有物体),并且各个位置设置不同尺度和比例的先验框,所以RPN预测的是边界框相对于先验框的offsets值(其实是transform值,详细见Faster R_CNN论文),采用先验框使得模型更容易学习。所以YOLOv2移除了YOLOv1中的全连接层而采用了卷积和anchor boxes来预测边界框。为了使检测所用的特征图分辨率更高,移除其中的一个pool层。在检测模型中,YOLOv2不是采用448*448图片作为输入,而是采用416*416大小。因为YOLOv2模型下采样的总步长为32,对于 416*416 大小的图片,最终得到的特征图大小为 13*13,维度是奇数,这样特征图恰好只有一个中心位置。对于一些大物体,它们中心点往往落入图片中心位置,此时使用特征图的一个中心点去预测这些物体的边界框相对容易些。所以在YOLOv2设计中要保证最终的特征图有奇数个位置。对于YOLOv1,每个cell都预测2个boxes,每个boxes包含5个值: (x,y,w,h,c),前4个值是边界框位置与大小,最后一个值是置信度(confidence scores,包含两部分:含有物体的概率以及预测框与ground truth的IOU)。但是每个cell只预测一套分类概率值(class predictions,其实是置信度下的条件概率值),供2个boxes共享。YOLOv2使用了anchor boxes之后,每个位置的各个anchor box都单独预测一套分类概率值,这和SSD比较类似(但SSD没有预测置信度,而是把background作为一个类别来处理)。使用anchor boxes之后,YOLOv2的mAP有稍微下降(这里下降的原因,我猜想是YOLOv2虽然使用了anchor boxes,但是依然采用YOLOv1的训练方法YOLOv1只能预测98个边界框( 7*7*2 ),而YOLOv2使用anchor boxes之后可以预测上千个边界框(13*13*num_anchors)。所以使用anchor boxes之后,YOLOv2的召回率大大提升,由原来的81%升至88%。
Dimension Clusters
在Faster R-CNN和SSD中,先验框的维度(长和宽)都是手动设定的,带有一定的主观性。如果选取的先验框维度比较合适,那么模型更容易学习,从而做出更好的预测。因此,YOLOv2采用k-means聚类方法对训练集中的边界框做了聚类分析。因为设置先验框的主要目的是为了使得预测框与ground truth的IOU更好,所以聚类分析时选用box与聚类中心box之间的IOU值作为距离指标:
$$
d(\text { box }, \text { centroid })=1-I O U(\text { box }, \text { centroid })
$$
下图为在VOC和COCO数据集上的聚类分析结果,随着聚类中心数目的增加,平均IOU值(各个边界框与聚类中心的IOU的平均值)是增加的,但是综合考虑模型复杂度和召回率,作者最终选取5个聚类中心作为先验框,其相对于图片的大小如右边图所示。对于两个数据集,5个先验框的width和height如下所示(来源:YOLO源码的cfg文件):
COCO: (0.57273, 0.677385), (1.87446, 2.06253), (3.33843, 5.47434), (7.88282, 3.52778), (9.77052, 9.16828)
VOC: (1.3221, 1.73145), (3.19275, 4.00944), (5.05587, 8.09892), (9.47112, 4.84053), (11.2364, 10.0071)
但是这里先验框的大小具体指什么作者并没有说明,但肯定不是像素点,从代码实现上看,应该是相对于预测的特征图大小( )。对比两个数据集,也可以看到COCO数据集上的物体相对小点。这个策略作者并没有单独做实验,但是作者对比了采用聚类分析得到的先验框与手动设置的先验框在平均IOU上的差异,发现前者的平均IOU值更高,因此模型更容易训练学习。
New Network: Darknet-19
YOLOv2采用了一个新的基础模型(特征提取器),称为Darknet-19,包括19个卷积层和5个maxpooling层,如图4所示。Darknet-19与VGG16模型设计原则是一致的,主要采用3*3卷积,采用2*2的maxpooling层之后,特征图维度降低2倍,而同时将特征图的channles增加两倍。与NIN(Network in Network)类似,Darknet-19最终采用global avgpooling做预测,并且在3*3卷积之间使用1*1卷积来压缩特征图channles以降低模型计算量和参数。Darknet-19每个卷积层后面同样使用了batch norm层以加快收敛速度,降低模型过拟合。在ImageNet分类数据集上,Darknet-19的top-1准确度为72.9%,top-5准确度为91.2%,但是模型参数相对小一些。使用Darknet-19之后,YOLOv2的mAP值没有显着提升,但是计算量却可以减少约33%。
Direct location prediction:
沿用YOLOv1的方法,就是预测边界框中心点相对于对应cell左上角位置的相对偏移值,为了将边界框中心点约束在当前cell中,使用sigmoid函数处理偏移值,这样预测的偏移值在(0,1)范围内(每个cell的尺度看做1)。
Fine-Grained Features 更精细的特征图
YOLOv2的输入图片大小为 416*416 ,经过5次maxpooling之后得到 13*13 大小的特征图,并以此特征图采用卷积做预测。13*13大小的特征图对检测大物体是足够了,但是对于小物体还需要更精细的特征图(Fine-Grained Features)。因此SSD使用了多尺度的特征图来分别检测不同大小的物体,前面更精细的特征图可以用来预测小物体。YOLOv2提出了一种passthrough层来利用更精细的特征图。YOLOv2所利用的Fine-Grained Features是26*26大小的特征图(最后一个maxpooling层的输入),对于Darknet-19模型来说就是大小为 26*26*512 的特征图。passthrough层与ResNet网络的shortcut类似,以前面更高分辨率的特征图为输入,然后将其连接到后面的低分辨率特征图上。前面的特征图维度是后面的特征图的2倍,passthrough层抽取前面层的每个 2*2的局部区域,然后将其转化为channel维度,对于 [ 26*26*512 ] 的特征图,经passthrough层处理之后就变成了 [13*13*2048] 的新特征图(特征图大小降低4倍,而channles增加4倍,下图为一个实例),这样就可以与后面的 [13*13*1024] 特征图连接在一起形成 13*13*3072大小的特征图,然后在此特征图基础上卷积做预测。在YOLO的C源码中,passthrough层称为reorg layer。在TensorFlow中,可以使用tf.extract_image_patches或者tf.space_to_depth来实现passthrough层
Multi-Scale Training
采用Multi-Scale Training策略,YOLOv2可以适应不同大小的图片,并且预测出很好的结果。在测试时,YOLOv2可以采用不同大小的图片作为输入,在VOC 2007数据集上的效果如下图所示。可以看到采用较小分辨率时,YOLOv2的mAP值略低,但是速度更快,而采用高分辨输入时,mAP值更高,但是速度略有下降,对于 544*544,mAP高达78.6%。注意,这只是测试时输入图片大小不同,而实际上用的是同一个模型(采用Multi-Scale Training训练)
YOLO9000
YOLO9000是在YOLOv2的基础上提出的一种可以检测超过9000个类别的模型,其主要贡献点在于提出了一种分类和检测的联合训练策略。众多周知,检测数据集的标注要比分类数据集打标签繁琐的多,所以ImageNet分类数据集比VOC等检测数据集高出几个数量级。在YOLO中,边界框的预测其实并不依赖于物体的标签,所以YOLO可以实现在分类和检测数据集上的联合训练。对于检测数据集,可以用来学习预测物体的边界框、置信度以及为物体分类,而对于分类数据集可以仅用来学习分类,但是其可以大大扩充模型所能检测的物体种类。
作者选择在COCO和ImageNet数据集上进行联合训练,但是遇到的第一问题是两者的类别并不是完全互斥的,比如”Norfolk terrier”明显属于”dog”,所以作者提出了一种层级分类方法(Hierarchical classification),主要思路是根据各个类别之间的从属关系(根据WordNet)建立一种树结构WordTree
WordTree中的根节点为”physical object”,每个节点的子节点都属于同一子类,可以对它们进行softmax处理。在给出某个类别的预测概率时,需要找到其所在的位置,遍历这个path,然后计算path上各个节点的概率之积。
在训练时,如果是检测样本,按照YOLOv2的loss计算误差,而对于分类样本,只计算分类误差。在预测时,YOLOv2给出的置信度就是 ,同时会给出边界框位置以及一个树状概率图。在这个概率图中找到概率最高的路径,当达到某一个阈值时停止,就用当前节点表示预测的类别。
通过联合训练策略,YOLO9000可以快速检测出超过9000个类别的物体,总体mAP值为19,7%。我觉得这是作者在这篇论文作出的最大的贡献,因为YOLOv2的改进策略亮点并不是很突出,但是YOLO9000算是开创之举。
reference:
生成对抗网络系列论文+pytorch实现
github链接:
https://github.com/eriklindernoren/PyTorch-GAN
This repository has gone stale as I unfortunately do not have the time to maintain it anymore. If you would like to continue the development of it as a collaborator send me an email at eriklindernoren@gmail.com.
PyTorch-GAN
Collection of PyTorch implementations of Generative Adversarial Network varieties presented in research papers. Model architectures will not always mirror the ones proposed in the papers, but I have chosen to focus on getting the core ideas covered instead of getting every layer configuration right. Contributions and suggestions of GANs to implement are very welcomed.
See also: Keras-GAN
Table of Contents
- Installation
- Implementations
- Auxiliary Classifier GAN
- Adversarial Autoencoder
- BEGAN
- BicycleGAN
- Boundary-Seeking GAN
- Cluster GAN
- Conditional GAN
- Context-Conditional GAN
- Context Encoder
- Coupled GAN
- CycleGAN
- Deep Convolutional GAN
- DiscoGAN
- DRAGAN
- DualGAN
- Energy-Based GAN
- Enhanced Super-Resolution GAN
- GAN
- InfoGAN
- Least Squares GAN
- MUNIT
- Pix2Pix
- PixelDA
- Relativistic GAN
- Semi-Supervised GAN
- Softmax GAN
- StarGAN
- Super-Resolution GAN
- UNIT
- Wasserstein GAN
- Wasserstein GAN GP
- Wasserstein GAN DIV
Installation
$ git clone https://github.com/eriklindernoren/PyTorch-GAN
$ cd PyTorch-GAN/
$ sudo pip3 install -r requirements.txt
Implementations
Auxiliary Classifier GAN
Auxiliary Classifier Generative Adversarial Network
Authors
Augustus Odena, Christopher Olah, Jonathon Shlens
Abstract
Synthesizing high resolution photorealistic images has been a long-standing challenge in machine learning. In this paper we introduce new methods for the improved training of generative adversarial networks (GANs) for image synthesis. We construct a variant of GANs employing label conditioning that results in 128×128 resolution image samples exhibiting global coherence. We expand on previous work for image quality assessment to provide two new analyses for assessing the discriminability and diversity of samples from class-conditional image synthesis models. These analyses demonstrate that high resolution samples provide class information not present in low resolution samples. Across 1000 ImageNet classes, 128×128 samples are more than twice as discriminable as artificially resized 32×32 samples. In addition, 84.7% of the classes have samples exhibiting diversity comparable to real ImageNet data.
Run Example
$ cd implementations/acgan/
$ python3 acgan.py
Adversarial Autoecoder
Adversarial Autoencoder
Authors
Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, Brendan Frey
Abstract
n this paper, we propose the “adversarial autoencoder” (AAE), which is a probabilistic autoencoder that uses the recently proposed generative adversarial networks (GAN) to perform variational inference by matching the aggregated posterior of the hidden code vector of the autoencoder with an arbitrary prior distribution. Matching the aggregated posterior to the prior ensures that generating from any part of prior space results in meaningful samples. As a result, the decoder of the adversarial autoencoder learns a deep generative model that maps the imposed prior to the data distribution. We show how the adversarial autoencoder can be used in applications such as semi-supervised classification, disentangling style and content of images, unsupervised clustering, dimensionality reduction and data visualization. We performed experiments on MNIST, Street View House Numbers and Toronto Face datasets and show that adversarial autoencoders achieve competitive results in generative modeling and semi-supervised classification tasks.
Run Example
$ cd implementations/aae/
$ python3 aae.py
BEGAN
BEGAN: Boundary Equilibrium Generative Adversarial Networks
Authors
David Berthelot, Thomas Schumm, Luke Metz
Abstract
We propose a new equilibrium enforcing method paired with a loss derived from the Wasserstein distance for training auto-encoder based Generative Adversarial Networks. This method balances the generator and discriminator during training. Additionally, it provides a new approximate convergence measure, fast and stable training and high visual quality. We also derive a way of controlling the trade-off between image diversity and visual quality. We focus on the image generation task, setting a new milestone in visual quality, even at higher resolutions. This is achieved while using a relatively simple model architecture and a standard training procedure.
Run Example
$ cd implementations/began/
$ python3 began.py
BicycleGAN
Toward Multimodal Image-to-Image Translation
Authors
Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A. Efros, Oliver Wang, Eli Shechtman
Abstract
Many image-to-image translation problems are ambiguous, as a single input image may correspond to multiple possible outputs. In this work, we aim to model a \emph{distribution} of possible outputs in a conditional generative modeling setting. The ambiguity of the mapping is distilled in a low-dimensional latent vector, which can be randomly sampled at test time. A generator learns to map the given input, combined with this latent code, to the output. We explicitly encourage the connection between output and the latent code to be invertible. This helps prevent a many-to-one mapping from the latent code to the output during training, also known as the problem of mode collapse, and produces more diverse results. We explore several variants of this approach by employing different training objectives, network architectures, and methods of injecting the latent code. Our proposed method encourages bijective consistency between the latent encoding and output modes. We present a systematic comparison of our method and other variants on both perceptual realism and diversity.
Run Example
$ cd data/ $ bash download_pix2pix_dataset.sh edges2shoes $ cd ../implementations/bicyclegan/ $ python3 bicyclegan.py
Various style translations by varying the latent code.
Boundary-Seeking GAN
Boundary-Seeking Generative Adversarial Networks
Authors
R Devon Hjelm, Athul Paul Jacob, Tong Che, Adam Trischler, Kyunghyun Cho, Yoshua Bengio
Abstract
Generative adversarial networks (GANs) are a learning framework that rely on training a discriminator to estimate a measure of difference between a target and generated distributions. GANs, as normally formulated, rely on the generated samples being completely differentiable w.r.t. the generative parameters, and thus do not work for discrete data. We introduce a method for training GANs with discrete data that uses the estimated difference measure from the discriminator to compute importance weights for generated samples, thus providing a policy gradient for training the generator. The importance weights have a strong connection to the decision boundary of the discriminator, and we call our method boundary-seeking GANs (BGANs). We demonstrate the effectiveness of the proposed algorithm with discrete image and character-based natural language generation. In addition, the boundary-seeking objective extends to continuous data, which can be used to improve stability of training, and we demonstrate this on Celeba, Large-scale Scene Understanding (LSUN) bedrooms, and Imagenet without conditioning.
Run Example
$ cd implementations/bgan/
$ python3 bgan.py
Cluster GAN
ClusterGAN: Latent Space Clustering in Generative Adversarial Networks
Authors
Sudipto Mukherjee, Himanshu Asnani, Eugene Lin, Sreeram Kannan
Abstract
Generative Adversarial networks (GANs) have obtained remarkable success in many unsupervised learning tasks and unarguably, clustering is an important unsupervised learning problem. While one can potentially exploit the latent-space back-projection in GANs to cluster, we demonstrate that the cluster structure is not retained in the GAN latent space. In this paper, we propose ClusterGAN as a new mechanism for clustering using GANs. By sampling latent variables from a mixture of one-hot encoded variables and continuous latent variables, coupled with an inverse network (which projects the data to the latent space) trained jointly with a clustering specific loss, we are able to achieve clustering in the latent space. Our results show a remarkable phenomenon that GANs can preserve latent space interpolation across categories, even though the discriminator is never exposed to such vectors. We compare our results with various clustering baselines and demonstrate superior performance on both synthetic and real datasets.
Code based on a full PyTorch [implementation].
Run Example
$ cd implementations/cluster_gan/ $ python3 clustergan.py
Conditional GAN
Conditional Generative Adversarial Nets
Authors
Mehdi Mirza, Simon Osindero
Abstract
Generative Adversarial Nets [8] were recently introduced as a novel way to train generative models. In this work we introduce the conditional version of generative adversarial nets, which can be constructed by simply feeding the data, y, we wish to condition on to both the generator and discriminator. We show that this model can generate MNIST digits conditioned on class labels. We also illustrate how this model could be used to learn a multi-modal model, and provide preliminary examples of an application to image tagging in which we demonstrate how this approach can generate descriptive tags which are not part of training labels.
Run Example
$ cd implementations/cgan/ $ python3 cgan.py
Context-Conditional GAN
Semi-Supervised Learning with Context-Conditional Generative Adversarial Networks
Authors
Emily Denton, Sam Gross, Rob Fergus
Abstract
We introduce a simple semi-supervised learning approach for images based on in-painting using an adversarial loss. Images with random patches removed are presented to a generator whose task is to fill in the hole, based on the surrounding pixels. The in-painted images are then presented to a discriminator network that judges if they are real (unaltered training images) or not. This task acts as a regularizer for standard supervised training of the discriminator. Using our approach we are able to directly train large VGG-style networks in a semi-supervised fashion. We evaluate on STL-10 and PASCAL datasets, where our approach obtains performance comparable or superior to existing methods.
Run Example
$ cd implementations/ccgan/
$ python3 ccgan.py
Context Encoder
Context Encoders: Feature Learning by Inpainting
Authors
Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, Alexei A. Efros
Abstract
We present an unsupervised visual feature learning algorithm driven by context-based pixel prediction. By analogy with auto-encoders, we propose Context Encoders — a convolutional neural network trained to generate the contents of an arbitrary image region conditioned on its surroundings. In order to succeed at this task, context encoders need to both understand the content of the entire image, as well as produce a plausible hypothesis for the missing part(s). When training context encoders, we have experimented with both a standard pixel-wise reconstruction loss, as well as a reconstruction plus an adversarial loss. The latter produces much sharper results because it can better handle multiple modes in the output. We found that a context encoder learns a representation that captures not just appearance but also the semantics of visual structures. We quantitatively demonstrate the effectiveness of our learned features for CNN pre-training on classification, detection, and segmentation tasks. Furthermore, context encoders can be used for semantic inpainting tasks, either stand-alone or as initialization for non-parametric methods.
Run Example
$ cd implementations/context_encoder/ <follow steps at the top of context_encoder.py> $ python3 context_encoder.py
Rows: Masked | Inpainted | Original | Masked | Inpainted | Original
Coupled GAN
Coupled Generative Adversarial Networks
Authors
Ming-Yu Liu, Oncel Tuzel
Abstract
We propose coupled generative adversarial network (CoGAN) for learning a joint distribution of multi-domain images. In contrast to the existing approaches, which require tuples of corresponding images in different domains in the training set, CoGAN can learn a joint distribution without any tuple of corresponding images. It can learn a joint distribution with just samples drawn from the marginal distributions. This is achieved by enforcing a weight-sharing constraint that limits the network capacity and favors a joint distribution solution over a product of marginal distributions one. We apply CoGAN to several joint distribution learning tasks, including learning a joint distribution of color and depth images, and learning a joint distribution of face images with different attributes. For each task it successfully learns the joint distribution without any tuple of corresponding images. We also demonstrate its applications to domain adaptation and image transformation.
Run Example
$ cd implementations/cogan/ $ python3 cogan.py
Generated MNIST and MNIST-M images
CycleGAN
Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks
Authors
Jun-Yan Zhu, Taesung Park, Phillip Isola, Alexei A. Efros
Abstract
Image-to-image translation is a class of vision and graphics problems where the goal is to learn the mapping between an input image and an output image using a training set of aligned image pairs. However, for many tasks, paired training data will not be available. We present an approach for learning to translate an image from a source domain X to a target domain Y in the absence of paired examples. Our goal is to learn a mapping G:X→Y such that the distribution of images from G(X) is indistinguishable from the distribution Y using an adversarial loss. Because this mapping is highly under-constrained, we couple it with an inverse mapping F:Y→X and introduce a cycle consistency loss to push F(G(X))≈X (and vice versa). Qualitative results are presented on several tasks where paired training data does not exist, including collection style transfer, object transfiguration, season transfer, photo enhancement, etc. Quantitative comparisons against several prior methods demonstrate the superiority of our approach.
Run Example
$ cd data/ $ bash download_cyclegan_dataset.sh monet2photo $ cd ../implementations/cyclegan/ $ python3 cyclegan.py --dataset_name monet2photo
Monet to photo translations.
Deep Convolutional GAN
Deep Convolutional Generative Adversarial Network
Authors
Alec Radford, Luke Metz, Soumith Chintala
Abstract
In recent years, supervised learning with convolutional networks (CNNs) has seen huge adoption in computer vision applications. Comparatively, unsupervised learning with CNNs has received less attention. In this work we hope to help bridge the gap between the success of CNNs for supervised learning and unsupervised learning. We introduce a class of CNNs called deep convolutional generative adversarial networks (DCGANs), that have certain architectural constraints, and demonstrate that they are a strong candidate for unsupervised learning. Training on various image datasets, we show convincing evidence that our deep convolutional adversarial pair learns a hierarchy of representations from object parts to scenes in both the generator and discriminator. Additionally, we use the learned features for novel tasks – demonstrating their applicability as general image representations.
Run Example
$ cd implementations/dcgan/ $ python3 dcgan.py
DiscoGAN
Learning to Discover Cross-Domain Relations with Generative Adversarial Networks
Authors
Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, Jiwon Kim
Abstract
While humans easily recognize relations between data from different domains without any supervision, learning to automatically discover them is in general very challenging and needs many ground-truth pairs that illustrate the relations. To avoid costly pairing, we address the task of discovering cross-domain relations given unpaired data. We propose a method based on generative adversarial networks that learns to discover relations between different domains (DiscoGAN). Using the discovered relations, our proposed network successfully transfers style from one domain to another while preserving key attributes such as orientation and face identity.
Run Example
$ cd data/ $ bash download_pix2pix_dataset.sh edges2shoes $ cd ../implementations/discogan/ $ python3 discogan.py --dataset_name edges2shoes
Rows from top to bottom: (1) Real image from domain A (2) Translated image from
domain A (3) Reconstructed image from domain A (4) Real image from domain B (5)
Translated image from domain B (6) Reconstructed image from domain B
DRAGAN
On Convergence and Stability of GANs
Authors
Naveen Kodali, Jacob Abernethy, James Hays, Zsolt Kira
Abstract
We propose studying GAN training dynamics as regret minimization, which is in contrast to the popular view that there is consistent minimization of a divergence between real and generated distributions. We analyze the convergence of GAN training from this new point of view to understand why mode collapse happens. We hypothesize the existence of undesirable local equilibria in this non-convex game to be responsible for mode collapse. We observe that these local equilibria often exhibit sharp gradients of the discriminator function around some real data points. We demonstrate that these degenerate local equilibria can be avoided with a gradient penalty scheme called DRAGAN. We show that DRAGAN enables faster training, achieves improved stability with fewer mode collapses, and leads to generator networks with better modeling performance across a variety of architectures and objective functions.
Run Example
$ cd implementations/dragan/
$ python3 dragan.py
DualGAN
DualGAN: Unsupervised Dual Learning for Image-to-Image Translation
Authors
Zili Yi, Hao Zhang, Ping Tan, Minglun Gong
Abstract
Conditional Generative Adversarial Networks (GANs) for cross-domain image-to-image translation have made much progress recently. Depending on the task complexity, thousands to millions of labeled image pairs are needed to train a conditional GAN. However, human labeling is expensive, even impractical, and large quantities of data may not always be available. Inspired by dual learning from natural language translation, we develop a novel dual-GAN mechanism, which enables image translators to be trained from two sets of unlabeled images from two domains. In our architecture, the primal GAN learns to translate images from domain U to those in domain V, while the dual GAN learns to invert the task. The closed loop made by the primal and dual tasks allows images from either domain to be translated and then reconstructed. Hence a loss function that accounts for the reconstruction error of images can be used to train the translators. Experiments on multiple image translation tasks with unlabeled data show considerable performance gain of DualGAN over a single GAN. For some tasks, DualGAN can even achieve comparable or slightly better results than conditional GAN trained on fully labeled data.
Run Example
$ cd data/
$ bash download_pix2pix_dataset.sh facades
$ cd ../implementations/dualgan/
$ python3 dualgan.py --dataset_name facades
Energy-Based GAN
Energy-based Generative Adversarial Network
Authors
Junbo Zhao, Michael Mathieu, Yann LeCun
Abstract
We introduce the “Energy-based Generative Adversarial Network” model (EBGAN) which views the discriminator as an energy function that attributes low energies to the regions near the data manifold and higher energies to other regions. Similar to the probabilistic GANs, a generator is seen as being trained to produce contrastive samples with minimal energies, while the discriminator is trained to assign high energies to these generated samples. Viewing the discriminator as an energy function allows to use a wide variety of architectures and loss functionals in addition to the usual binary classifier with logistic output. Among them, we show one instantiation of EBGAN framework as using an auto-encoder architecture, with the energy being the reconstruction error, in place of the discriminator. We show that this form of EBGAN exhibits more stable behavior than regular GANs during training. We also show that a single-scale architecture can be trained to generate high-resolution images.
Run Example
$ cd implementations/ebgan/
$ python3 ebgan.py
Enhanced Super-Resolution GAN
ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks
Authors
Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Chen Change Loy, Yu Qiao, Xiaoou Tang
Abstract
The Super-Resolution Generative Adversarial Network (SRGAN) is a seminal work that is capable of generating realistic textures during single image super-resolution. However, the hallucinated details are often accompanied with unpleasant artifacts. To further enhance the visual quality, we thoroughly study three key components of SRGAN – network architecture, adversarial loss and perceptual loss, and improve each of them to derive an Enhanced SRGAN (ESRGAN). In particular, we introduce the Residual-in-Residual Dense Block (RRDB) without batch normalization as the basic network building unit. Moreover, we borrow the idea from relativistic GAN to let the discriminator predict relative realness instead of the absolute value. Finally, we improve the perceptual loss by using the features before activation, which could provide stronger supervision for brightness consistency and texture recovery. Benefiting from these improvements, the proposed ESRGAN achieves consistently better visual quality with more realistic and natural textures than SRGAN and won the first place in the PIRM2018-SR Challenge. The code is available at this https URL.
Run Example
$ cd implementations/esrgan/ <follow steps at the top of esrgan.py> $ python3 esrgan.py
Nearest Neighbor Upsampling | ESRGAN
GAN
Generative Adversarial Network
Authors
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio
Abstract
We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.
Run Example
$ cd implementations/gan/ $ python3 gan.py
InfoGAN
InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets
Authors
Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, Pieter Abbeel
Abstract
This paper describes InfoGAN, an information-theoretic extension to the Generative Adversarial Network that is able to learn disentangled representations in a completely unsupervised manner. InfoGAN is a generative adversarial network that also maximizes the mutual information between a small subset of the latent variables and the observation. We derive a lower bound to the mutual information objective that can be optimized efficiently, and show that our training procedure can be interpreted as a variation of the Wake-Sleep algorithm. Specifically, InfoGAN successfully disentangles writing styles from digit shapes on the MNIST dataset, pose from lighting of 3D rendered images, and background digits from the central digit on the SVHN dataset. It also discovers visual concepts that include hair styles, presence/absence of eyeglasses, and emotions on the CelebA face dataset. Experiments show that InfoGAN learns interpretable representations that are competitive with representations learned by existing fully supervised methods.
Run Example
$ cd implementations/infogan/ $ python3 infogan.py
Result of varying categorical latent variable by column.
Result of varying continuous latent variable by row.
Least Squares GAN
Least Squares Generative Adversarial Networks
Authors
Xudong Mao, Qing Li, Haoran Xie, Raymond Y.K. Lau, Zhen Wang, Stephen Paul Smolley
Abstract
Unsupervised learning with generative adversarial networks (GANs) has proven hugely successful. Regular GANs hypothesize the discriminator as a classifier with the sigmoid cross entropy loss function. However, we found that this loss function may lead to the vanishing gradients problem during the learning process. To overcome such a problem, we propose in this paper the Least Squares Generative Adversarial Networks (LSGANs) which adopt the least squares loss function for the discriminator. We show that minimizing the objective function of LSGAN yields minimizing the Pearson χ2 divergence. There are two benefits of LSGANs over regular GANs. First, LSGANs are able to generate higher quality images than regular GANs. Second, LSGANs perform more stable during the learning process. We evaluate LSGANs on five scene datasets and the experimental results show that the images generated by LSGANs are of better quality than the ones generated by regular GANs. We also conduct two comparison experiments between LSGANs and regular GANs to illustrate the stability of LSGANs.
Run Example
$ cd implementations/lsgan/
$ python3 lsgan.py
MUNIT
Multimodal Unsupervised Image-to-Image Translation
Authors
Xun Huang, Ming-Yu Liu, Serge Belongie, Jan Kautz
Abstract
Unsupervised image-to-image translation is an important and challenging problem in computer vision. Given an image in the source domain, the goal is to learn the conditional distribution of corresponding images in the target domain, without seeing any pairs of corresponding images. While this conditional distribution is inherently multimodal, existing approaches make an overly simplified assumption, modeling it as a deterministic one-to-one mapping. As a result, they fail to generate diverse outputs from a given source domain image. To address this limitation, we propose a Multimodal Unsupervised Image-to-image Translation (MUNIT) framework. We assume that the image representation can be decomposed into a content code that is domain-invariant, and a style code that captures domain-specific properties. To translate an image to another domain, we recombine its content code with a random style code sampled from the style space of the target domain. We analyze the proposed framework and establish several theoretical results. Extensive experiments with comparisons to the state-of-the-art approaches further demonstrates the advantage of the proposed framework. Moreover, our framework allows users to control the style of translation outputs by providing an example style image. Code and pretrained models are available at this https URL
Run Example
$ cd data/ $ bash download_pix2pix_dataset.sh edges2shoes $ cd ../implementations/munit/ $ python3 munit.py --dataset_name edges2shoes
Results by varying the style code.
Pix2Pix
Unpaired Image-to-Image Translation with Conditional Adversarial Networks
Authors
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, Alexei A. Efros
Abstract
We investigate conditional adversarial networks as a general-purpose solution to image-to-image translation problems. These networks not only learn the mapping from input image to output image, but also learn a loss function to train this mapping. This makes it possible to apply the same generic approach to problems that traditionally would require very different loss formulations. We demonstrate that this approach is effective at synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks. Indeed, since the release of the pix2pix software associated with this paper, a large number of internet users (many of them artists) have posted their own experiments with our system, further demonstrating its wide applicability and ease of adoption without the need for parameter tweaking. As a community, we no longer hand-engineer our mapping functions, and this work suggests we can achieve reasonable results without hand-engineering our loss functions either.
Run Example
$ cd data/ $ bash download_pix2pix_dataset.sh facades $ cd ../implementations/pix2pix/ $ python3 pix2pix.py --dataset_name facades
Rows from top to bottom: (1) The condition for the generator (2) Generated image
based of condition (3) The true corresponding image to the condition
PixelDA
Unsupervised Pixel-Level Domain Adaptation with Generative Adversarial Networks
Authors
Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, Dilip Krishnan
Abstract
Collecting well-annotated image datasets to train modern machine learning algorithms is prohibitively expensive for many tasks. One appealing alternative is rendering synthetic data where ground-truth annotations are generated automatically. Unfortunately, models trained purely on rendered images often fail to generalize to real images. To address this shortcoming, prior work introduced unsupervised domain adaptation algorithms that attempt to map representations between the two domains or learn to extract features that are domain-invariant. In this work, we present a new approach that learns, in an unsupervised manner, a transformation in the pixel space from one domain to the other. Our generative adversarial network (GAN)-based method adapts source-domain images to appear as if drawn from the target domain. Our approach not only produces plausible samples, but also outperforms the state-of-the-art on a number of unsupervised domain adaptation scenarios by large margins. Finally, we demonstrate that the adaptation process generalizes to object classes unseen during training.
MNIST to MNIST-M Classification
Trains a classifier on images that have been translated from the source domain (MNIST) to the target domain (MNIST-M) using the annotations of the source domain images. The classification network is trained jointly with the generator network to optimize the generator for both providing a proper domain translation and also for preserving the semantics of the source domain image. The classification network trained on translated images is compared to the naive solution of training a classifier on MNIST and evaluating it on MNIST-M. The naive model manages a 55% classification accuracy on MNIST-M while the one trained during domain adaptation achieves a 95% classification accuracy.
$ cd implementations/pixelda/
$ python3 pixelda.py
Method | Accuracy |
---|---|
Naive | 55% |
PixelDA | 95% |
Rows from top to bottom: (1) Real images from MNIST (2) Translated images from
MNIST to MNIST-M (3) Examples of images from MNIST-M
Relativistic GAN
The relativistic discriminator: a key element missing from standard GAN
Authors
Alexia Jolicoeur-Martineau
Abstract
In standard generative adversarial network (SGAN), the discriminator estimates the probability that the input data is real. The generator is trained to increase the probability that fake data is real. We argue that it should also simultaneously decrease the probability that real data is real because 1) this would account for a priori knowledge that half of the data in the mini-batch is fake, 2) this would be observed with divergence minimization, and 3) in optimal settings, SGAN would be equivalent to integral probability metric (IPM) GANs. We show that this property can be induced by using a relativistic discriminator which estimate the probability that the given real data is more realistic than a randomly sampled fake data. We also present a variant in which the discriminator estimate the probability that the given real data is more realistic than fake data, on average. We generalize both approaches to non-standard GAN loss functions and we refer to them respectively as Relativistic GANs (RGANs) and Relativistic average GANs (RaGANs). We show that IPM-based GANs are a subset of RGANs which use the identity function. Empirically, we observe that 1) RGANs and RaGANs are significantly more stable and generate higher quality data samples than their non-relativistic counterparts, 2) Standard RaGAN with gradient penalty generate data of better quality than WGAN-GP while only requiring a single discriminator update per generator update (reducing the time taken for reaching the state-of-the-art by 400%), and 3) RaGANs are able to generate plausible high resolutions images (256×256) from a very small sample (N=2011), while GAN and LSGAN cannot; these images are of significantly better quality than the ones generated by WGAN-GP and SGAN with spectral normalization.
Run Example
$ cd implementations/relativistic_gan/
$ python3 relativistic_gan.py # Relativistic Standard GAN
$ python3 relativistic_gan.py --rel_avg_gan # Relativistic Average GAN
Semi-Supervised GAN
Semi-Supervised Generative Adversarial Network
Authors
Augustus Odena
Abstract
We extend Generative Adversarial Networks (GANs) to the semi-supervised context by forcing the discriminator network to output class labels. We train a generative model G and a discriminator D on a dataset with inputs belonging to one of N classes. At training time, D is made to predict which of N+1 classes the input belongs to, where an extra class is added to correspond to the outputs of G. We show that this method can be used to create a more data-efficient classifier and that it allows for generating higher quality samples than a regular GAN.
Run Example
$ cd implementations/sgan/
$ python3 sgan.py
Softmax GAN
Softmax GAN
Authors
Min Lin
Abstract
Softmax GAN is a novel variant of Generative Adversarial Network (GAN). The key idea of Softmax GAN is to replace the classification loss in the original GAN with a softmax cross-entropy loss in the sample space of one single batch. In the adversarial learning of N real training samples and M generated samples, the target of discriminator training is to distribute all the probability mass to the real samples, each with probability 1M, and distribute zero probability to generated data. In the generator training phase, the target is to assign equal probability to all data points in the batch, each with probability 1M+N. While the original GAN is closely related to Noise Contrastive Estimation (NCE), we show that Softmax GAN is the Importance Sampling version of GAN. We futher demonstrate with experiments that this simple change stabilizes GAN training.
Run Example
$ cd implementations/softmax_gan/
$ python3 softmax_gan.py
StarGAN
StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation
Authors
Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, Jaegul Choo
Abstract
Recent studies have shown remarkable success in image-to-image translation for two domains. However, existing approaches have limited scalability and robustness in handling more than two domains, since different models should be built independently for every pair of image domains. To address this limitation, we propose StarGAN, a novel and scalable approach that can perform image-to-image translations for multiple domains using only a single model. Such a unified model architecture of StarGAN allows simultaneous training of multiple datasets with different domains within a single network. This leads to StarGAN’s superior quality of translated images compared to existing models as well as the novel capability of flexibly translating an input image to any desired target domain. We empirically demonstrate the effectiveness of our approach on a facial attribute transfer and a facial expression synthesis tasks.
Run Example
$ cd implementations/stargan/ <follow steps at the top of stargan.py> $ python3 stargan.py
Original | Black Hair | Blonde Hair | Brown Hair | Gender Flip | Aged
Super-Resolution GAN
Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network
Authors
Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, Wenzhe Shi
Abstract
Despite the breakthroughs in accuracy and speed of single image super-resolution using faster and deeper convolutional neural networks, one central problem remains largely unsolved: how do we recover the finer texture details when we super-resolve at large upscaling factors? The behavior of optimization-based super-resolution methods is principally driven by the choice of the objective function. Recent work has largely focused on minimizing the mean squared reconstruction error. The resulting estimates have high peak signal-to-noise ratios, but they are often lacking high-frequency details and are perceptually unsatisfying in the sense that they fail to match the fidelity expected at the higher resolution. In this paper, we present SRGAN, a generative adversarial network (GAN) for image super-resolution (SR). To our knowledge, it is the first framework capable of inferring photo-realistic natural images for 4x upscaling factors. To achieve this, we propose a perceptual loss function which consists of an adversarial loss and a content loss. The adversarial loss pushes our solution to the natural image manifold using a discriminator network that is trained to differentiate between the super-resolved images and original photo-realistic images. In addition, we use a content loss motivated by perceptual similarity instead of similarity in pixel space. Our deep residual network is able to recover photo-realistic textures from heavily downsampled images on public benchmarks. An extensive mean-opinion-score (MOS) test shows hugely significant gains in perceptual quality using SRGAN. The MOS scores obtained with SRGAN are closer to those of the original high-resolution images than to those obtained with any state-of-the-art method.
Run Example
$ cd implementations/srgan/ <follow steps at the top of srgan.py> $ python3 srgan.py
Nearest Neighbor Upsampling | SRGAN
UNIT
Unsupervised Image-to-Image Translation Networks
Authors
Ming-Yu Liu, Thomas Breuel, Jan Kautz
Abstract
Unsupervised image-to-image translation aims at learning a joint distribution of images in different domains by using images from the marginal distributions in individual domains. Since there exists an infinite set of joint distributions that can arrive the given marginal distributions, one could infer nothing about the joint distribution from the marginal distributions without additional assumptions. To address the problem, we make a shared-latent space assumption and propose an unsupervised image-to-image translation framework based on Coupled GANs. We compare the proposed framework with competing approaches and present high quality image translation results on various challenging unsupervised image translation tasks, including street scene image translation, animal image translation, and face image translation. We also apply the proposed framework to domain adaptation and achieve state-of-the-art performance on benchmark datasets. Code and additional results are available in this https URL.
Run Example
$ cd data/
$ bash download_cyclegan_dataset.sh apple2orange
$ cd implementations/unit/
$ python3 unit.py --dataset_name apple2orange
Wasserstein GAN
Wasserstein GAN
Authors
Martin Arjovsky, Soumith Chintala, Léon Bottou
Abstract
We introduce a new algorithm named WGAN, an alternative to traditional GAN training. In this new model, we show that we can improve the stability of learning, get rid of problems like mode collapse, and provide meaningful learning curves useful for debugging and hyperparameter searches. Furthermore, we show that the corresponding optimization problem is sound, and provide extensive theoretical work highlighting the deep connections to other distances between distributions.
Run Example
$ cd implementations/wgan/
$ python3 wgan.py
Wasserstein GAN GP
Improved Training of Wasserstein GANs
Authors
Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, Aaron Courville
Abstract
Generative Adversarial Networks (GANs) are powerful generative models, but suffer from training instability. The recently proposed Wasserstein GAN (WGAN) makes progress toward stable training of GANs, but sometimes can still generate only low-quality samples or fail to converge. We find that these problems are often due to the use of weight clipping in WGAN to enforce a Lipschitz constraint on the critic, which can lead to undesired behavior. We propose an alternative to clipping weights: penalize the norm of gradient of the critic with respect to its input. Our proposed method performs better than standard WGAN and enables stable training of a wide variety of GAN architectures with almost no hyperparameter tuning, including 101-layer ResNets and language models over discrete data. We also achieve high quality generations on CIFAR-10 and LSUN bedrooms.
Run Example
$ cd implementations/wgan_gp/ $ python3 wgan_gp.py
Wasserstein GAN DIV
Wasserstein Divergence for GANs
Authors
Jiqing Wu, Zhiwu Huang, Janine Thoma, Dinesh Acharya, Luc Van Gool
Abstract
In many domains of computer vision, generative adversarial networks (GANs) have achieved great success, among which the fam- ily of Wasserstein GANs (WGANs) is considered to be state-of-the-art due to the theoretical contributions and competitive qualitative performance. However, it is very challenging to approximate the k-Lipschitz constraint required by the Wasserstein-1 metric (W-met). In this paper, we propose a novel Wasserstein divergence (W-div), which is a relaxed version of W-met and does not require the k-Lipschitz constraint.As a concrete application, we introduce a Wasserstein divergence objective for GANs (WGAN-div), which can faithfully approximate W-div through optimization. Under various settings, including progressive growing training, we demonstrate the stability of the proposed WGAN-div owing to its theoretical and practical advantages over WGANs. Also, we study the quantitative and visual performance of WGAN-div on standard image synthesis benchmarks, showing the superior performance of WGAN-div compared to the state-of-the-art methods.
Run Example
$ cd implementations/wgan_div/ $ python3 wgan_div.py