生成对抗网络系列论文+pytorch实现

github链接:

https://github.com/eriklindernoren/PyTorch-GAN

This repository has gone stale as I unfortunately do not have the time to maintain it anymore. If you would like to continue the development of it as a collaborator send me an email at eriklindernoren@gmail.com.

PyTorch-GAN

Collection of PyTorch implementations of Generative Adversarial Network varieties presented in research papers. Model architectures will not always mirror the ones proposed in the papers, but I have chosen to focus on getting the core ideas covered instead of getting every layer configuration right. Contributions and suggestions of GANs to implement are very welcomed.

See also: Keras-GAN

Table of Contents

Installation

$ git clone https://github.com/eriklindernoren/PyTorch-GAN
$ cd PyTorch-GAN/
$ sudo pip3 install -r requirements.txt

Implementations

Auxiliary Classifier GAN

Auxiliary Classifier Generative Adversarial Network

Authors

Augustus Odena, Christopher Olah, Jonathon Shlens

Abstract

Synthesizing high resolution photorealistic images has been a long-standing challenge in machine learning. In this paper we introduce new methods for the improved training of generative adversarial networks (GANs) for image synthesis. We construct a variant of GANs employing label conditioning that results in 128×128 resolution image samples exhibiting global coherence. We expand on previous work for image quality assessment to provide two new analyses for assessing the discriminability and diversity of samples from class-conditional image synthesis models. These analyses demonstrate that high resolution samples provide class information not present in low resolution samples. Across 1000 ImageNet classes, 128×128 samples are more than twice as discriminable as artificially resized 32×32 samples. In addition, 84.7% of the classes have samples exhibiting diversity comparable to real ImageNet data.

[Paper] [Code]

Run Example

$ cd implementations/acgan/
$ python3 acgan.py

Adversarial Autoecoder

Adversarial Autoencoder

Authors

Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, Brendan Frey

Abstract

n this paper, we propose the “adversarial autoencoder” (AAE), which is a probabilistic autoencoder that uses the recently proposed generative adversarial networks (GAN) to perform variational inference by matching the aggregated posterior of the hidden code vector of the autoencoder with an arbitrary prior distribution. Matching the aggregated posterior to the prior ensures that generating from any part of prior space results in meaningful samples. As a result, the decoder of the adversarial autoencoder learns a deep generative model that maps the imposed prior to the data distribution. We show how the adversarial autoencoder can be used in applications such as semi-supervised classification, disentangling style and content of images, unsupervised clustering, dimensionality reduction and data visualization. We performed experiments on MNIST, Street View House Numbers and Toronto Face datasets and show that adversarial autoencoders achieve competitive results in generative modeling and semi-supervised classification tasks.

[Paper] [Code]

Run Example

$ cd implementations/aae/
$ python3 aae.py

BEGAN

BEGAN: Boundary Equilibrium Generative Adversarial Networks

Authors

David Berthelot, Thomas Schumm, Luke Metz

Abstract

We propose a new equilibrium enforcing method paired with a loss derived from the Wasserstein distance for training auto-encoder based Generative Adversarial Networks. This method balances the generator and discriminator during training. Additionally, it provides a new approximate convergence measure, fast and stable training and high visual quality. We also derive a way of controlling the trade-off between image diversity and visual quality. We focus on the image generation task, setting a new milestone in visual quality, even at higher resolutions. This is achieved while using a relatively simple model architecture and a standard training procedure.

[Paper] [Code]

Run Example

$ cd implementations/began/
$ python3 began.py

BicycleGAN

Toward Multimodal Image-to-Image Translation

Authors

Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A. Efros, Oliver Wang, Eli Shechtman

Abstract

Many image-to-image translation problems are ambiguous, as a single input image may correspond to multiple possible outputs. In this work, we aim to model a \emph{distribution} of possible outputs in a conditional generative modeling setting. The ambiguity of the mapping is distilled in a low-dimensional latent vector, which can be randomly sampled at test time. A generator learns to map the given input, combined with this latent code, to the output. We explicitly encourage the connection between output and the latent code to be invertible. This helps prevent a many-to-one mapping from the latent code to the output during training, also known as the problem of mode collapse, and produces more diverse results. We explore several variants of this approach by employing different training objectives, network architectures, and methods of injecting the latent code. Our proposed method encourages bijective consistency between the latent encoding and output modes. We present a systematic comparison of our method and other variants on both perceptual realism and diversity.

[Paper] [Code]

Run Example

$ cd data/
$ bash download_pix2pix_dataset.sh edges2shoes
$ cd ../implementations/bicyclegan/
$ python3 bicyclegan.py

Various style translations by varying the latent code.​

Boundary-Seeking GAN

Boundary-Seeking Generative Adversarial Networks

Authors

R Devon Hjelm, Athul Paul Jacob, Tong Che, Adam Trischler, Kyunghyun Cho, Yoshua Bengio

Abstract

Generative adversarial networks (GANs) are a learning framework that rely on training a discriminator to estimate a measure of difference between a target and generated distributions. GANs, as normally formulated, rely on the generated samples being completely differentiable w.r.t. the generative parameters, and thus do not work for discrete data. We introduce a method for training GANs with discrete data that uses the estimated difference measure from the discriminator to compute importance weights for generated samples, thus providing a policy gradient for training the generator. The importance weights have a strong connection to the decision boundary of the discriminator, and we call our method boundary-seeking GANs (BGANs). We demonstrate the effectiveness of the proposed algorithm with discrete image and character-based natural language generation. In addition, the boundary-seeking objective extends to continuous data, which can be used to improve stability of training, and we demonstrate this on Celeba, Large-scale Scene Understanding (LSUN) bedrooms, and Imagenet without conditioning.

[Paper] [Code]

Run Example

$ cd implementations/bgan/
$ python3 bgan.py

Cluster GAN

ClusterGAN: Latent Space Clustering in Generative Adversarial Networks

Authors

Sudipto Mukherjee, Himanshu Asnani, Eugene Lin, Sreeram Kannan

Abstract

Generative Adversarial networks (GANs) have obtained remarkable success in many unsupervised learning tasks and unarguably, clustering is an important unsupervised learning problem. While one can potentially exploit the latent-space back-projection in GANs to cluster, we demonstrate that the cluster structure is not retained in the GAN latent space. In this paper, we propose ClusterGAN as a new mechanism for clustering using GANs. By sampling latent variables from a mixture of one-hot encoded variables and continuous latent variables, coupled with an inverse network (which projects the data to the latent space) trained jointly with a clustering specific loss, we are able to achieve clustering in the latent space. Our results show a remarkable phenomenon that GANs can preserve latent space interpolation across categories, even though the discriminator is never exposed to such vectors. We compare our results with various clustering baselines and demonstrate superior performance on both synthetic and real datasets.

[Paper] [Code]

Code based on a full PyTorch [implementation].

Run Example

$ cd implementations/cluster_gan/
$ python3 clustergan.py

Conditional GAN

Conditional Generative Adversarial Nets

Authors

Mehdi Mirza, Simon Osindero

Abstract

Generative Adversarial Nets [8] were recently introduced as a novel way to train generative models. In this work we introduce the conditional version of generative adversarial nets, which can be constructed by simply feeding the data, y, we wish to condition on to both the generator and discriminator. We show that this model can generate MNIST digits conditioned on class labels. We also illustrate how this model could be used to learn a multi-modal model, and provide preliminary examples of an application to image tagging in which we demonstrate how this approach can generate descriptive tags which are not part of training labels.

[Paper] [Code]

Run Example

$ cd implementations/cgan/
$ python3 cgan.py

Context-Conditional GAN

Semi-Supervised Learning with Context-Conditional Generative Adversarial Networks

Authors

Emily Denton, Sam Gross, Rob Fergus

Abstract

We introduce a simple semi-supervised learning approach for images based on in-painting using an adversarial loss. Images with random patches removed are presented to a generator whose task is to fill in the hole, based on the surrounding pixels. The in-painted images are then presented to a discriminator network that judges if they are real (unaltered training images) or not. This task acts as a regularizer for standard supervised training of the discriminator. Using our approach we are able to directly train large VGG-style networks in a semi-supervised fashion. We evaluate on STL-10 and PASCAL datasets, where our approach obtains performance comparable or superior to existing methods.

[Paper] [Code]

Run Example

$ cd implementations/ccgan/
$ python3 ccgan.py

Context Encoder

Context Encoders: Feature Learning by Inpainting

Authors

Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, Alexei A. Efros

Abstract

We present an unsupervised visual feature learning algorithm driven by context-based pixel prediction. By analogy with auto-encoders, we propose Context Encoders — a convolutional neural network trained to generate the contents of an arbitrary image region conditioned on its surroundings. In order to succeed at this task, context encoders need to both understand the content of the entire image, as well as produce a plausible hypothesis for the missing part(s). When training context encoders, we have experimented with both a standard pixel-wise reconstruction loss, as well as a reconstruction plus an adversarial loss. The latter produces much sharper results because it can better handle multiple modes in the output. We found that a context encoder learns a representation that captures not just appearance but also the semantics of visual structures. We quantitatively demonstrate the effectiveness of our learned features for CNN pre-training on classification, detection, and segmentation tasks. Furthermore, context encoders can be used for semantic inpainting tasks, either stand-alone or as initialization for non-parametric methods.

[Paper] [Code]

Run Example

$ cd implementations/context_encoder/
<follow steps at the top of context_encoder.py>
$ python3 context_encoder.py

Rows: Masked | Inpainted | Original | Masked | Inpainted | Original​

Coupled GAN

Coupled Generative Adversarial Networks

Authors

Ming-Yu Liu, Oncel Tuzel

Abstract

We propose coupled generative adversarial network (CoGAN) for learning a joint distribution of multi-domain images. In contrast to the existing approaches, which require tuples of corresponding images in different domains in the training set, CoGAN can learn a joint distribution without any tuple of corresponding images. It can learn a joint distribution with just samples drawn from the marginal distributions. This is achieved by enforcing a weight-sharing constraint that limits the network capacity and favors a joint distribution solution over a product of marginal distributions one. We apply CoGAN to several joint distribution learning tasks, including learning a joint distribution of color and depth images, and learning a joint distribution of face images with different attributes. For each task it successfully learns the joint distribution without any tuple of corresponding images. We also demonstrate its applications to domain adaptation and image transformation.

[Paper] [Code]

Run Example

$ cd implementations/cogan/
$ python3 cogan.py

Generated MNIST and MNIST-M images​

CycleGAN

Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks

Authors

Jun-Yan Zhu, Taesung Park, Phillip Isola, Alexei A. Efros

Abstract

Image-to-image translation is a class of vision and graphics problems where the goal is to learn the mapping between an input image and an output image using a training set of aligned image pairs. However, for many tasks, paired training data will not be available. We present an approach for learning to translate an image from a source domain X to a target domain Y in the absence of paired examples. Our goal is to learn a mapping G:X→Y such that the distribution of images from G(X) is indistinguishable from the distribution Y using an adversarial loss. Because this mapping is highly under-constrained, we couple it with an inverse mapping F:Y→X and introduce a cycle consistency loss to push F(G(X))≈X (and vice versa). Qualitative results are presented on several tasks where paired training data does not exist, including collection style transfer, object transfiguration, season transfer, photo enhancement, etc. Quantitative comparisons against several prior methods demonstrate the superiority of our approach.

[Paper] [Code]

Run Example

$ cd data/
$ bash download_cyclegan_dataset.sh monet2photo
$ cd ../implementations/cyclegan/
$ python3 cyclegan.py --dataset_name monet2photo

Monet to photo translations.​

Deep Convolutional GAN

Deep Convolutional Generative Adversarial Network

Authors

Alec Radford, Luke Metz, Soumith Chintala

Abstract

In recent years, supervised learning with convolutional networks (CNNs) has seen huge adoption in computer vision applications. Comparatively, unsupervised learning with CNNs has received less attention. In this work we hope to help bridge the gap between the success of CNNs for supervised learning and unsupervised learning. We introduce a class of CNNs called deep convolutional generative adversarial networks (DCGANs), that have certain architectural constraints, and demonstrate that they are a strong candidate for unsupervised learning. Training on various image datasets, we show convincing evidence that our deep convolutional adversarial pair learns a hierarchy of representations from object parts to scenes in both the generator and discriminator. Additionally, we use the learned features for novel tasks – demonstrating their applicability as general image representations.

[Paper] [Code]

Run Example

$ cd implementations/dcgan/
$ python3 dcgan.py

DiscoGAN

Learning to Discover Cross-Domain Relations with Generative Adversarial Networks

Authors

Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, Jiwon Kim

Abstract

While humans easily recognize relations between data from different domains without any supervision, learning to automatically discover them is in general very challenging and needs many ground-truth pairs that illustrate the relations. To avoid costly pairing, we address the task of discovering cross-domain relations given unpaired data. We propose a method based on generative adversarial networks that learns to discover relations between different domains (DiscoGAN). Using the discovered relations, our proposed network successfully transfers style from one domain to another while preserving key attributes such as orientation and face identity.

[Paper] [Code]

Run Example

$ cd data/
$ bash download_pix2pix_dataset.sh edges2shoes
$ cd ../implementations/discogan/
$ python3 discogan.py --dataset_name edges2shoes

Rows from top to bottom: (1) Real image from domain A (2) Translated image from
domain A (3) Reconstructed image from domain A (4) Real image from domain B (5)
Translated image from domain B (6) Reconstructed image from domain B​

DRAGAN

On Convergence and Stability of GANs

Authors

Naveen Kodali, Jacob Abernethy, James Hays, Zsolt Kira

Abstract

We propose studying GAN training dynamics as regret minimization, which is in contrast to the popular view that there is consistent minimization of a divergence between real and generated distributions. We analyze the convergence of GAN training from this new point of view to understand why mode collapse happens. We hypothesize the existence of undesirable local equilibria in this non-convex game to be responsible for mode collapse. We observe that these local equilibria often exhibit sharp gradients of the discriminator function around some real data points. We demonstrate that these degenerate local equilibria can be avoided with a gradient penalty scheme called DRAGAN. We show that DRAGAN enables faster training, achieves improved stability with fewer mode collapses, and leads to generator networks with better modeling performance across a variety of architectures and objective functions.

[Paper] [Code]

Run Example

$ cd implementations/dragan/
$ python3 dragan.py

DualGAN

DualGAN: Unsupervised Dual Learning for Image-to-Image Translation

Authors

Zili Yi, Hao Zhang, Ping Tan, Minglun Gong

Abstract

Conditional Generative Adversarial Networks (GANs) for cross-domain image-to-image translation have made much progress recently. Depending on the task complexity, thousands to millions of labeled image pairs are needed to train a conditional GAN. However, human labeling is expensive, even impractical, and large quantities of data may not always be available. Inspired by dual learning from natural language translation, we develop a novel dual-GAN mechanism, which enables image translators to be trained from two sets of unlabeled images from two domains. In our architecture, the primal GAN learns to translate images from domain U to those in domain V, while the dual GAN learns to invert the task. The closed loop made by the primal and dual tasks allows images from either domain to be translated and then reconstructed. Hence a loss function that accounts for the reconstruction error of images can be used to train the translators. Experiments on multiple image translation tasks with unlabeled data show considerable performance gain of DualGAN over a single GAN. For some tasks, DualGAN can even achieve comparable or slightly better results than conditional GAN trained on fully labeled data.

[Paper] [Code]

Run Example

$ cd data/
$ bash download_pix2pix_dataset.sh facades
$ cd ../implementations/dualgan/
$ python3 dualgan.py --dataset_name facades

Energy-Based GAN

Energy-based Generative Adversarial Network

Authors

Junbo Zhao, Michael Mathieu, Yann LeCun

Abstract

We introduce the “Energy-based Generative Adversarial Network” model (EBGAN) which views the discriminator as an energy function that attributes low energies to the regions near the data manifold and higher energies to other regions. Similar to the probabilistic GANs, a generator is seen as being trained to produce contrastive samples with minimal energies, while the discriminator is trained to assign high energies to these generated samples. Viewing the discriminator as an energy function allows to use a wide variety of architectures and loss functionals in addition to the usual binary classifier with logistic output. Among them, we show one instantiation of EBGAN framework as using an auto-encoder architecture, with the energy being the reconstruction error, in place of the discriminator. We show that this form of EBGAN exhibits more stable behavior than regular GANs during training. We also show that a single-scale architecture can be trained to generate high-resolution images.

[Paper] [Code]

Run Example

$ cd implementations/ebgan/
$ python3 ebgan.py

Enhanced Super-Resolution GAN

ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks

Authors

Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Chen Change Loy, Yu Qiao, Xiaoou Tang

Abstract

The Super-Resolution Generative Adversarial Network (SRGAN) is a seminal work that is capable of generating realistic textures during single image super-resolution. However, the hallucinated details are often accompanied with unpleasant artifacts. To further enhance the visual quality, we thoroughly study three key components of SRGAN – network architecture, adversarial loss and perceptual loss, and improve each of them to derive an Enhanced SRGAN (ESRGAN). In particular, we introduce the Residual-in-Residual Dense Block (RRDB) without batch normalization as the basic network building unit. Moreover, we borrow the idea from relativistic GAN to let the discriminator predict relative realness instead of the absolute value. Finally, we improve the perceptual loss by using the features before activation, which could provide stronger supervision for brightness consistency and texture recovery. Benefiting from these improvements, the proposed ESRGAN achieves consistently better visual quality with more realistic and natural textures than SRGAN and won the first place in the PIRM2018-SR Challenge. The code is available at this https URL.

[Paper] [Code]

Run Example

$ cd implementations/esrgan/
<follow steps at the top of esrgan.py>
$ python3 esrgan.py

Nearest Neighbor Upsampling | ESRGAN​

GAN

Generative Adversarial Network

Authors

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio

Abstract

We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.

[Paper] [Code]

Run Example

$ cd implementations/gan/
$ python3 gan.py

InfoGAN

InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets

Authors

Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, Pieter Abbeel

Abstract

This paper describes InfoGAN, an information-theoretic extension to the Generative Adversarial Network that is able to learn disentangled representations in a completely unsupervised manner. InfoGAN is a generative adversarial network that also maximizes the mutual information between a small subset of the latent variables and the observation. We derive a lower bound to the mutual information objective that can be optimized efficiently, and show that our training procedure can be interpreted as a variation of the Wake-Sleep algorithm. Specifically, InfoGAN successfully disentangles writing styles from digit shapes on the MNIST dataset, pose from lighting of 3D rendered images, and background digits from the central digit on the SVHN dataset. It also discovers visual concepts that include hair styles, presence/absence of eyeglasses, and emotions on the CelebA face dataset. Experiments show that InfoGAN learns interpretable representations that are competitive with representations learned by existing fully supervised methods.

[Paper] [Code]

Run Example

$ cd implementations/infogan/
$ python3 infogan.py

Result of varying categorical latent variable by column.​

Result of varying continuous latent variable by row.​

Least Squares GAN

Least Squares Generative Adversarial Networks

Authors

Xudong Mao, Qing Li, Haoran Xie, Raymond Y.K. Lau, Zhen Wang, Stephen Paul Smolley

Abstract

Unsupervised learning with generative adversarial networks (GANs) has proven hugely successful. Regular GANs hypothesize the discriminator as a classifier with the sigmoid cross entropy loss function. However, we found that this loss function may lead to the vanishing gradients problem during the learning process. To overcome such a problem, we propose in this paper the Least Squares Generative Adversarial Networks (LSGANs) which adopt the least squares loss function for the discriminator. We show that minimizing the objective function of LSGAN yields minimizing the Pearson χ2 divergence. There are two benefits of LSGANs over regular GANs. First, LSGANs are able to generate higher quality images than regular GANs. Second, LSGANs perform more stable during the learning process. We evaluate LSGANs on five scene datasets and the experimental results show that the images generated by LSGANs are of better quality than the ones generated by regular GANs. We also conduct two comparison experiments between LSGANs and regular GANs to illustrate the stability of LSGANs.

[Paper] [Code]

Run Example

$ cd implementations/lsgan/
$ python3 lsgan.py

MUNIT

Multimodal Unsupervised Image-to-Image Translation

Authors

Xun Huang, Ming-Yu Liu, Serge Belongie, Jan Kautz

Abstract

Unsupervised image-to-image translation is an important and challenging problem in computer vision. Given an image in the source domain, the goal is to learn the conditional distribution of corresponding images in the target domain, without seeing any pairs of corresponding images. While this conditional distribution is inherently multimodal, existing approaches make an overly simplified assumption, modeling it as a deterministic one-to-one mapping. As a result, they fail to generate diverse outputs from a given source domain image. To address this limitation, we propose a Multimodal Unsupervised Image-to-image Translation (MUNIT) framework. We assume that the image representation can be decomposed into a content code that is domain-invariant, and a style code that captures domain-specific properties. To translate an image to another domain, we recombine its content code with a random style code sampled from the style space of the target domain. We analyze the proposed framework and establish several theoretical results. Extensive experiments with comparisons to the state-of-the-art approaches further demonstrates the advantage of the proposed framework. Moreover, our framework allows users to control the style of translation outputs by providing an example style image. Code and pretrained models are available at this https URL

[Paper] [Code]

Run Example

$ cd data/
$ bash download_pix2pix_dataset.sh edges2shoes
$ cd ../implementations/munit/
$ python3 munit.py --dataset_name edges2shoes

Results by varying the style code.​

Pix2Pix

Unpaired Image-to-Image Translation with Conditional Adversarial Networks

Authors

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, Alexei A. Efros

Abstract

We investigate conditional adversarial networks as a general-purpose solution to image-to-image translation problems. These networks not only learn the mapping from input image to output image, but also learn a loss function to train this mapping. This makes it possible to apply the same generic approach to problems that traditionally would require very different loss formulations. We demonstrate that this approach is effective at synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks. Indeed, since the release of the pix2pix software associated with this paper, a large number of internet users (many of them artists) have posted their own experiments with our system, further demonstrating its wide applicability and ease of adoption without the need for parameter tweaking. As a community, we no longer hand-engineer our mapping functions, and this work suggests we can achieve reasonable results without hand-engineering our loss functions either.

[Paper] [Code]

Run Example

$ cd data/
$ bash download_pix2pix_dataset.sh facades
$ cd ../implementations/pix2pix/
$ python3 pix2pix.py --dataset_name facades

Rows from top to bottom: (1) The condition for the generator (2) Generated image
based of condition (3) The true corresponding image to the condition​

PixelDA

Unsupervised Pixel-Level Domain Adaptation with Generative Adversarial Networks

Authors

Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, Dilip Krishnan

Abstract

Collecting well-annotated image datasets to train modern machine learning algorithms is prohibitively expensive for many tasks. One appealing alternative is rendering synthetic data where ground-truth annotations are generated automatically. Unfortunately, models trained purely on rendered images often fail to generalize to real images. To address this shortcoming, prior work introduced unsupervised domain adaptation algorithms that attempt to map representations between the two domains or learn to extract features that are domain-invariant. In this work, we present a new approach that learns, in an unsupervised manner, a transformation in the pixel space from one domain to the other. Our generative adversarial network (GAN)-based method adapts source-domain images to appear as if drawn from the target domain. Our approach not only produces plausible samples, but also outperforms the state-of-the-art on a number of unsupervised domain adaptation scenarios by large margins. Finally, we demonstrate that the adaptation process generalizes to object classes unseen during training.

[Paper] [Code]

MNIST to MNIST-M Classification

Trains a classifier on images that have been translated from the source domain (MNIST) to the target domain (MNIST-M) using the annotations of the source domain images. The classification network is trained jointly with the generator network to optimize the generator for both providing a proper domain translation and also for preserving the semantics of the source domain image. The classification network trained on translated images is compared to the naive solution of training a classifier on MNIST and evaluating it on MNIST-M. The naive model manages a 55% classification accuracy on MNIST-M while the one trained during domain adaptation achieves a 95% classification accuracy.

$ cd implementations/pixelda/
$ python3 pixelda.py
MethodAccuracy
Naive55%
PixelDA95%

Rows from top to bottom: (1) Real images from MNIST (2) Translated images from
MNIST to MNIST-M (3) Examples of images from MNIST-M​

Relativistic GAN

The relativistic discriminator: a key element missing from standard GAN

Authors

Alexia Jolicoeur-Martineau

Abstract

In standard generative adversarial network (SGAN), the discriminator estimates the probability that the input data is real. The generator is trained to increase the probability that fake data is real. We argue that it should also simultaneously decrease the probability that real data is real because 1) this would account for a priori knowledge that half of the data in the mini-batch is fake, 2) this would be observed with divergence minimization, and 3) in optimal settings, SGAN would be equivalent to integral probability metric (IPM) GANs. We show that this property can be induced by using a relativistic discriminator which estimate the probability that the given real data is more realistic than a randomly sampled fake data. We also present a variant in which the discriminator estimate the probability that the given real data is more realistic than fake data, on average. We generalize both approaches to non-standard GAN loss functions and we refer to them respectively as Relativistic GANs (RGANs) and Relativistic average GANs (RaGANs). We show that IPM-based GANs are a subset of RGANs which use the identity function. Empirically, we observe that 1) RGANs and RaGANs are significantly more stable and generate higher quality data samples than their non-relativistic counterparts, 2) Standard RaGAN with gradient penalty generate data of better quality than WGAN-GP while only requiring a single discriminator update per generator update (reducing the time taken for reaching the state-of-the-art by 400%), and 3) RaGANs are able to generate plausible high resolutions images (256×256) from a very small sample (N=2011), while GAN and LSGAN cannot; these images are of significantly better quality than the ones generated by WGAN-GP and SGAN with spectral normalization.

[Paper] [Code]

Run Example

$ cd implementations/relativistic_gan/
$ python3 relativistic_gan.py # Relativistic Standard GAN
$ python3 relativistic_gan.py --rel_avg_gan # Relativistic Average GAN

Semi-Supervised GAN

Semi-Supervised Generative Adversarial Network

Authors

Augustus Odena

Abstract

We extend Generative Adversarial Networks (GANs) to the semi-supervised context by forcing the discriminator network to output class labels. We train a generative model G and a discriminator D on a dataset with inputs belonging to one of N classes. At training time, D is made to predict which of N+1 classes the input belongs to, where an extra class is added to correspond to the outputs of G. We show that this method can be used to create a more data-efficient classifier and that it allows for generating higher quality samples than a regular GAN.

[Paper] [Code]

Run Example

$ cd implementations/sgan/
$ python3 sgan.py

Softmax GAN

Softmax GAN

Authors

Min Lin

Abstract

Softmax GAN is a novel variant of Generative Adversarial Network (GAN). The key idea of Softmax GAN is to replace the classification loss in the original GAN with a softmax cross-entropy loss in the sample space of one single batch. In the adversarial learning of N real training samples and M generated samples, the target of discriminator training is to distribute all the probability mass to the real samples, each with probability 1M, and distribute zero probability to generated data. In the generator training phase, the target is to assign equal probability to all data points in the batch, each with probability 1M+N. While the original GAN is closely related to Noise Contrastive Estimation (NCE), we show that Softmax GAN is the Importance Sampling version of GAN. We futher demonstrate with experiments that this simple change stabilizes GAN training.

[Paper] [Code]

Run Example

$ cd implementations/softmax_gan/
$ python3 softmax_gan.py

StarGAN

StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation

Authors

Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, Jaegul Choo

Abstract

Recent studies have shown remarkable success in image-to-image translation for two domains. However, existing approaches have limited scalability and robustness in handling more than two domains, since different models should be built independently for every pair of image domains. To address this limitation, we propose StarGAN, a novel and scalable approach that can perform image-to-image translations for multiple domains using only a single model. Such a unified model architecture of StarGAN allows simultaneous training of multiple datasets with different domains within a single network. This leads to StarGAN’s superior quality of translated images compared to existing models as well as the novel capability of flexibly translating an input image to any desired target domain. We empirically demonstrate the effectiveness of our approach on a facial attribute transfer and a facial expression synthesis tasks.

[Paper] [Code]

Run Example

$ cd implementations/stargan/
<follow steps at the top of stargan.py>
$ python3 stargan.py

Original | Black Hair | Blonde Hair | Brown Hair | Gender Flip | Aged​

Super-Resolution GAN

Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network

Authors

Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, Wenzhe Shi

Abstract

Despite the breakthroughs in accuracy and speed of single image super-resolution using faster and deeper convolutional neural networks, one central problem remains largely unsolved: how do we recover the finer texture details when we super-resolve at large upscaling factors? The behavior of optimization-based super-resolution methods is principally driven by the choice of the objective function. Recent work has largely focused on minimizing the mean squared reconstruction error. The resulting estimates have high peak signal-to-noise ratios, but they are often lacking high-frequency details and are perceptually unsatisfying in the sense that they fail to match the fidelity expected at the higher resolution. In this paper, we present SRGAN, a generative adversarial network (GAN) for image super-resolution (SR). To our knowledge, it is the first framework capable of inferring photo-realistic natural images for 4x upscaling factors. To achieve this, we propose a perceptual loss function which consists of an adversarial loss and a content loss. The adversarial loss pushes our solution to the natural image manifold using a discriminator network that is trained to differentiate between the super-resolved images and original photo-realistic images. In addition, we use a content loss motivated by perceptual similarity instead of similarity in pixel space. Our deep residual network is able to recover photo-realistic textures from heavily downsampled images on public benchmarks. An extensive mean-opinion-score (MOS) test shows hugely significant gains in perceptual quality using SRGAN. The MOS scores obtained with SRGAN are closer to those of the original high-resolution images than to those obtained with any state-of-the-art method.

[Paper] [Code]

Run Example

$ cd implementations/srgan/
<follow steps at the top of srgan.py>
$ python3 srgan.py

Nearest Neighbor Upsampling | SRGAN​

UNIT

Unsupervised Image-to-Image Translation Networks

Authors

Ming-Yu Liu, Thomas Breuel, Jan Kautz

Abstract

Unsupervised image-to-image translation aims at learning a joint distribution of images in different domains by using images from the marginal distributions in individual domains. Since there exists an infinite set of joint distributions that can arrive the given marginal distributions, one could infer nothing about the joint distribution from the marginal distributions without additional assumptions. To address the problem, we make a shared-latent space assumption and propose an unsupervised image-to-image translation framework based on Coupled GANs. We compare the proposed framework with competing approaches and present high quality image translation results on various challenging unsupervised image translation tasks, including street scene image translation, animal image translation, and face image translation. We also apply the proposed framework to domain adaptation and achieve state-of-the-art performance on benchmark datasets. Code and additional results are available in this https URL.

[Paper] [Code]

Run Example

$ cd data/
$ bash download_cyclegan_dataset.sh apple2orange
$ cd implementations/unit/
$ python3 unit.py --dataset_name apple2orange

Wasserstein GAN

Wasserstein GAN

Authors

Martin Arjovsky, Soumith Chintala, Léon Bottou

Abstract

We introduce a new algorithm named WGAN, an alternative to traditional GAN training. In this new model, we show that we can improve the stability of learning, get rid of problems like mode collapse, and provide meaningful learning curves useful for debugging and hyperparameter searches. Furthermore, we show that the corresponding optimization problem is sound, and provide extensive theoretical work highlighting the deep connections to other distances between distributions.

[Paper] [Code]

Run Example

$ cd implementations/wgan/
$ python3 wgan.py

Wasserstein GAN GP

Improved Training of Wasserstein GANs

Authors

Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, Aaron Courville

Abstract

Generative Adversarial Networks (GANs) are powerful generative models, but suffer from training instability. The recently proposed Wasserstein GAN (WGAN) makes progress toward stable training of GANs, but sometimes can still generate only low-quality samples or fail to converge. We find that these problems are often due to the use of weight clipping in WGAN to enforce a Lipschitz constraint on the critic, which can lead to undesired behavior. We propose an alternative to clipping weights: penalize the norm of gradient of the critic with respect to its input. Our proposed method performs better than standard WGAN and enables stable training of a wide variety of GAN architectures with almost no hyperparameter tuning, including 101-layer ResNets and language models over discrete data. We also achieve high quality generations on CIFAR-10 and LSUN bedrooms.

[Paper] [Code]

Run Example

$ cd implementations/wgan_gp/
$ python3 wgan_gp.py

Wasserstein GAN DIV

Wasserstein Divergence for GANs

Authors

Jiqing Wu, Zhiwu Huang, Janine Thoma, Dinesh Acharya, Luc Van Gool

Abstract

In many domains of computer vision, generative adversarial networks (GANs) have achieved great success, among which the fam- ily of Wasserstein GANs (WGANs) is considered to be state-of-the-art due to the theoretical contributions and competitive qualitative performance. However, it is very challenging to approximate the k-Lipschitz constraint required by the Wasserstein-1 metric (W-met). In this paper, we propose a novel Wasserstein divergence (W-div), which is a relaxed version of W-met and does not require the k-Lipschitz constraint.As a concrete application, we introduce a Wasserstein divergence objective for GANs (WGAN-div), which can faithfully approximate W-div through optimization. Under various settings, including progressive growing training, we demonstrate the stability of the proposed WGAN-div owing to its theoretical and practical advantages over WGANs. Also, we study the quantitative and visual performance of WGAN-div on standard image synthesis benchmarks, showing the superior performance of WGAN-div compared to the state-of-the-art methods.

[Paper] [Code]

Run Example

$ cd implementations/wgan_div/
$ python3 wgan_div.py

YOLO系列(一)

What is YOLOv5


YOLO an acronym for ‘You only look once’, is an object detection algorithm that divides images into a grid system. Each cell in the grid is responsible for detecting objects within itself.

YOLO is one of the most famous object detection algorithms due to its speed and accuracy.

The History of YOLO


YOLOv5

Shortly after the release of YOLOv4 Glenn Jocher introduced YOLOv5 using the Pytorch framework.
The open source code is available on GitHub

Author: Glenn Jocher
Released: 18 May 2020

YOLOv4

With the original authors work on YOLO coming to a standstill, YOLOv4 was released by Alexey Bochoknovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. The paper was titled YOLOv4: Optimal Speed and Accuracy of Object Detection

Author: Alexey BochoknovskiyChien-Yao Wang, and Hong-Yuan Mark Liao
Released: 23 April 2020

yolov4-tiny : https://arxiv.org/abs/2011.04244

code yolov4-tiny

https://github.com/bubbliiiing/yolov4-tiny-pytorch

yolov4 tiny结构

YOLOv3

YOLOv3 improved on the YOLOv2 paper and both Joseph Redmon and Ali Farhadi, the original authors, contributed.
Together they published YOLOv3: An Incremental Improvement

The original YOLO papers were are hosted here

Author: Joseph Redmon and Ali Farhadi
Released: 8 Apr 2018

YOLOv2

YOLOv2 was a joint endevor by Joseph Redmon the original author of YOLO and Ali Farhadi.
Together they published YOLO9000:Better, Faster, Stronger

Author: Joseph Redmon and Ali Farhadi
Released: 25 Dec 2016

YOLOv1

YOLOv1 was released as a research paper by Joseph Redmon.
The paper was titled You Only Look Once: Unified, Real-Time Object Detection

Author: Joseph Redmon
Released: 8 Jun 2015

mAP—–目标检测

在评价一个目标检测算法的“好坏”程度的时候,往往采用的是pascal voc 2012的评价标准mAP。

官方定义:

  1. Compute a version of the measured precision/recall curve with precision monotonically decreasing, by setting the precision for recall r to the maximum precision obtained for any recall r′ ≥ r.
  2. Compute the AP as the area under this curve by numerical integration. No approximation is involved since the curve is piecewise constant.

Note that prior to 2010 the AP is computed by sampling the monotonically

decreasing curve at a fixed set of uniformly-spaced recall values 0, 0.1, 0.2, . . . , 1. By contrast, VOC2010–2012 effectively samples the curve at all unique recall values.

mAP(均值平均精度)是目标检测中的最常用的评价指标,详细理解其计算方式有助于我们评估算法的有效性,并针对评测指标对算法进行调整。mean Average Precision, 即各类别AP的平均值

TP TN FP FN:

T或者N代表的是该样本是否被分类分对,P或者N代表的是该样本被分为什么

TP(True Positives)意思我们倒着来翻译就是“被分为正样本,并且分对了”,TN(True Negatives)意思是“被分为负样本,而且分对了”,FP(False Positives)意思是“被分为正样本,但是分错了”,FN(False Negatives)意思是“被分为负样本,但是分错了”。

按下图来解释,左半矩形是正样本,右半矩形是负样本。一个2分类器,在图上画了个圆,分类器认为圆内是正样本,圆外是负样本。那么左半圆分类器认为是正样本,同时它确实是正样本,那么就是“被分为正样本,并且分对了”即TP,左半矩形扣除左半圆的部分就是分类器认为它是负样本,但是它本身却是正样本,就是“被分为负样本,但是分错了”即FN。右半圆分类器认为它是正样本,但是本身却是负样本,那么就是“被分为正样本,但是分错了”即FP。右半矩形扣除右半圆的部分就是分类器认为它是负样本,同时它本身确实是负样本,那么就是“被分为负样本,而且分对了”即TN。

Precision(精度)和Recall(召回率)的概念:

Precision(精度 )翻译成中文就是“分类器认为是正类并且确实是正类的部分占所有分类器认为是正类的比例”,衡量的是一个分类器分出来的正类的确是正类的概率。两种极端情况就是,如果精度是100%,就代表所有分类器分出来的正类确实都是正类。如果精度是0%,就代表分类器分出来的正类没一个是正类。光是精度还不能衡量分类器的好坏程度,比如50个正样本和50个负样本,我的分类器把49个正样本和50个负样本都分为负样本,剩下一个正样本分为正样本,这样我的精度也是100%,但是傻子也知道这个分类器很垃圾。

Recall(召回率) ,翻译成中文就是“分类器认为是正类并且确实是正类的部分占所有确实是正类的比例”,衡量的是一个分类能把所有的正类都找出来的能力。两种极端情况,如果召回率是100%,就代表所有的正类都被分类器分为正类。如果召回率是0%,就代表没一个正类被分为正类。

IOU(交并比)

它是模型所预测的检测框(bbox)和真实的检测框(ground truth)的交集和并集之间的比例

IOU
def Iou(rec1,rec2):
  x1,x2,y1,y2 = rec1 #分别是第一个矩形左右上下的坐标
  x3,x4,y3,y4 = rec2 #分别是第二个矩形左右上下的坐标
  area_1 = (x2-x1)*(y1-y2)
  area_2 = (x4-x3)*(y3-y4)
  sum_area = area_1 + area_2
  w1 = x2 - x1#第一个矩形的宽
  w2 = x4 - x3#第二个矩形的宽
  h1 = y1 - y2
  h2 = y3 - y4
  W = min(x1,x2,x3,x4)+w1+w2-max(x1,x2,x3,x4)#交叉部分的宽
  H = min(y1,y2,y3,y4)+h1+h2-max(y1,y2,y3,y4)#交叉部分的高
  Area = W*H#交叉的面积
  Iou = Area/(sum_area-Area)
  return Iou

举例计算mAP

有3张图如下,要求算法找出face。蓝色框代表标签label,绿色框代表算法给出的结果pre,旁边的红色小字代表置信度。设定第一张图的检出框叫pre1,第一张的标签框叫label1。第二张、第三张同理。

首先我们计算每张图的pre和label的IOU,根据IOU是否大于0.5来判断该pre是属于TP(分成正样本,分对了)还是属于FP(被分为正样本,但是分错了)。显而易见,pre1是TP,pre2是FP,pre3是TP。

根据每个pre的置信度进行从高到低排序,这里pre1、pre2、pre3置信度刚好就是从高到低。

在不同置信度阈值下获得Precision和Recall:

首先,设置阈值为0.9,无视所有小于0.9的pre。那么检测器检出的所有框pre即TP+FP=1,并且pre1是TP,那么Precision=1/1。因为所有的label=3,所以Recall=1/3。这样就得到一组P、R值。

然后,设置阈值为0.8,无视所有小于0.8的pre。那么检测器检出的所有框pre即TP+FP=2,因为pre1是TP,pre2是FP,那么Precision=1/2=0.5。因为所有的label=3,所以Recall=1/3=0.33。这样就又得到一组P、R值。

再然后,设置阈值为0.7,无视所有小于0.7的pre。那么检测器检出的所有框pre即TP+FP=3,因为pre1是TP,pre2是FP,pre3是TP,那么Precision=2/3=0.67。因为所有的label=3,所以Recall=2/3=0.67。这样就又得到一组P、R值。

根据上面3组PR值绘制PR曲线如下。然后每个“峰值点”往左画一条线段直到与上一个峰值点的垂直线相交。这样画出来的红色线段与坐标轴围起来的面积就是AP值。

计算mAP


AP衡量的是对一个类检测好坏,mAP就是对多个类的检测好坏。就是简单粗暴的把所有类的AP值取平均就好了。比如有两类,类A的AP值是0.5,类B的AP值是0.2,那么mAP=(0.5+0.2)/2=0.35

计算mAP的github地址:

https://github.com/Cartucho/mAP

# AP的计算
def _average_precision(self, rec, prec):
    """
    Params:
    ----------
    rec : numpy.array
            cumulated recall
    prec : numpy.array
            cumulated precision
    Returns:
    ----------
    ap as float
    """
    if rec is None or prec is None:
        return np.nan
    ap = 0.
    for t in np.arange(0., 1.1, 0.1):  #十一个点的召回率,对应精度最大值
        if np.sum(rec >= t) == 0:
            p = 0
        else:
            p = np.max(np.nan_to_num(prec)[rec >= t])
        ap += p / 11.  #加权平均
    return ap

正确理解 YOLO 的辨识准确率

2021: A Year Full of Amazing AI papers – A Review

A curated list of the latest breakthroughs in AI by release date with a clear video explanation, link to a more in-depth article, and code.

comefrom:

https://www.louisbouchard.ai/2021-ai-papers-review/

Table of content

  • DALL·E: Zero-Shot Text-to-Image Generation from OpenAI [1]
  • VOGUE: Try-On by StyleGAN Interpolation Optimization [2]
  • Taming Transformers for High-Resolution Image Synthesis [3]
  • Thinking Fast And Slow in AI [4]
  • Automatic detection and quantification of floating marine macro-litter in aerial images [5]
  • ShaRF: Shape-conditioned Radiance Fields from a Single View [6]
  • Generative Adversarial Transformers [7]
  • We Asked Artificial Intelligence to Create Dating Profiles. Would You Swipe Right? [8]
  • Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [9]
  • IMAGE GANS MEET DIFFERENTIABLE RENDERING FOR INVERSE GRAPHICS AND INTERPRETABLE 3D NEURAL RENDERING [10]
  • Deep nets: What have they ever done for vision? [11]
  • Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image [12]
  • Portable, Self-Contained Neuroprosthetic Hand with Deep Learning-Based Finger Control [13]
  • Total Relighting: Learning to Relight Portraits for Background Replacement [14]
  • LASR: Learning Articulated Shape Reconstruction from a Monocular Video [15]
  • Enhancing Photorealism Enhancement [16]
  • DefakeHop: A Light-Weight High-Performance Deepfake Detector [17]
  • High-Resolution Photorealistic Image Translation in Real-Time: A Laplacian Pyramid Translation Network [18]
  • Barbershop: GAN-based Image Compositing using Segmentation Masks [19]
  • TextStyleBrush: Transfer of text aesthetics from a single example [20]
  • Animating Pictures with Eulerian Motion Fields [21]
  • CVPR 2021 Best Paper Award: GIRAFFE – Controllable Image Generation [22]
  • GitHub Copilot & Codex: Evaluating Large Language Models Trained on Code [23]
  • Apple: Recognizing People in Photos Through Private On-Device Machine Learning [24]
  • Image Synthesis and Editing with Stochastic Differential Equations [25]
  • Sketch Your Own GAN [26]
  • Tesla’s Autopilot Explained [27]
  • Styleclip: Text-driven manipulation of StyleGAN imagery [28]
  • TimeLens: Event-based Video Frame Interpolation [29]
  • Diverse Generation from a Single Video Made Possible [30]
  • Skillful Precipitation Nowcasting using Deep Generative Models of Radar [31]
  • The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks [32]
  • ADOP: Approximate Differentiable One-Pixel Point Rendering [33]
  • (Style)CLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis [34]
  • SwinIR: Image restoration using swin transformer [35]
  • EditGAN: High-Precision Semantic Image Editing [36]
  • CityNeRF: Building NeRF at City Scale [37]
  • ClipCap: CLIP Prefix for Image Captioning [38]
  • Paper references

Learning CNN-LSTM Architectures for Image论文阅读

圣诞快乐

Abstract:

自动描述图像的内容是连接计算机视觉和自然语言处理的人工智能的基本问题。在本文中,我们提出了一个基于深度递归体系结构的生成模型,该模型结合了计算机视觉和机器翻译的最新进展,可用于生成描述图像的自然句子。训练模型以在给定训练图像的情况下最大化目标描述语句的可能性。在多个数据集上进行的实验表明,该模型的准确性以及仅从图像描述中学习的语言的流畅性。我们的模型通常非常准确,我们可以在定性和定量上进行验证。例如,在Pascal数据集上,当前最先进的BLEU-1得分(越高越好)是25,而我们的方法得出的结果是59,与人类表现在69左右相比。我们还显示了BLEU-1 Flickr30k的得分从56提升到66,SBU的得分从19提升到28。最后,在新发布的COCO数据集上,我们的BLEU-4为27.7,这是当前的最新水平。

Introduction:

能够使用格式正确的英语句子自动描述图像的内容是一项非常具有挑战性的任务,但它可能会产生巨大的影响,例如,通过帮助视力障碍的人们更好地理解网络上的图像内容。例如,此任务比经过充分研究的图像分类或对象识别任务要困难得多,而这些任务已成为计算机视觉领域的主要关注点[27]。实际上,描述不仅必须捕获图像中包含的对象,而且还必须表达这些对象如何相互关联以及它们的属性和它们所涉及的活动。此外,必须表达上述语义知识以自然语言(例如英语)表示,这意味着除了视觉理解外还需要一种语言模型。

以前的大多数尝试都是将上述子问题的现有解决方案组合在一起,以便从图像进行描述[6,16]。相反,我们在这项工作中提出一个联合模型,该模型以图像 I 作为输入,并经过训练以最大化产生目标单词序列 S = S 1 , S 2 , . . . 的可能性 p ( S∣I ) ,其中每个单词 S t 来自给定的字典,即充分描述图像。

我们工作的主要灵感来自机器翻译的最新进展,其中的任务是通过最大化 p ( T∣S) ,将以源语言编写的句子 S 转换为目标语言的译文 T 。多年以来,机器翻译还通过一系列单独的任务来实现(分别翻译单词,对齐单词,重新排序等),但是最近的工作表明,使用递归神经网络(RNN)可以以更简单的方式完成翻译。 [3,2,30]并仍达到最先进的性能。 “编码器” RNN读取源语句并将其转换为丰富的固定长度向量表示形式,然后将其用作生成目标语句的“解码器” RNN的初始隐藏状态。

在这里,我们建议遵循这种优雅的方法,用深度卷积神经网络(CNN)代替编码器RNN。在过去的几年中,令人信服的表明,CNN可以通过将输入图像嵌入到固定长度的向量中来生成输入图像的丰富表示,从而这种表示可以用于各种视觉任务[28]。因此,自然是将CNN用作图像“编码器”,方法是先对其进行预训练以进行图像分类任务,然后将最后一个隐藏层用作生成语句的RNN解码器的输入(请参见图1)。我们将此模型称为神经图像标题或NIC。

我们的贡献如下。首先,我们提出了一个解决问题的端到端系统。它是一种神经网络,可以使用随机梯度下降训练。其次,我们的模型结合了用于视觉和语言模型的最新网络。这些可以在较大的语料库上进行预训练,因此可以利用其他数据。最后,与最先进的方法相比,它的性能显着提高。 例如,在Pascal数据集上,NIC的BLEU得分为59,与当前的最新水平25相比,而人类的性能达到69。在Flickr30k上,我们的得分从56提高到66,在SBU,从19到28。

Related Work:

从视觉数据中生成自然语言描述的问题已经在计算机视觉中进行了长期研究,但主要针对视频[7,32]。这导致了由视觉原始识别器与结构化形式语言(例如, And-Or图形或逻辑系统,它们通过基于规则的系统进一步转换为自然语言。这样的系统是手工设计的,相对较脆,并且仅在有限的领域(例如,图1)中被证明。例如,交通场景或运动描述。

带有自然文本的静止图像描述问题最近引起了人们的关注。借助对象,属性和位置识别方面的最新进展,尽管这些语言的表达能力受到限制,但我们仍可以驱动自然语言生成系统。 Farhadi等。 [6]使用检测来推断场景元素的三元组,并使用模板将其转换为文本。同样,李等。 [19]从检测开始,并使用包含检测到的对象和关系的短语拼凑出最终描述。 Kulkani等人使用了一个更复杂的检测图(三重态除外)。 [16],但具有基于模板的文本生成。也使用了基于语言解析的更强大的语言模型[23,1,17,18,5]。上面的方法已经能够“in the wild”描述图像,但是在文本生成方面,它们是经过大量手工设计和严格设计的。

大量工作解决了对给定图像[11、8、24]进行描述排名的问题。这样的方法基于在相同向量空间中共同嵌入图像和文本的想法。对于图像查询,将获取接近嵌入空间中图像的描述。最紧密地,神经网络用于共同嵌入图像和句子[29],甚至嵌入图像作物和句子[13],但并未尝试生成新颖的描述。通常,即使可能已经在训练数据中观察到了单个对象,上述方法也无法描述以前看不见的对象组成。而且,它们避免了解决评估所生成的描述的良好程度的问题。

在这项工作中,我们将用于图像分类的深层卷积网络[12]与用于序列建模的循环网络[10]相结合,以创建一个生成图像描述的单一网络。在这个单一的“端到端”网络的背景下对RNN进行了训练。该模型的灵感来自机器翻译中序列生成的最新成功[3,2,30],区别在于我们提供的不是卷积句子,而是提供了由卷积网络处理的图像。最近的著作是基洛斯等人。 [15]他们使用神经网络,但使用前馈神经网络,根据图像和前一个单词来预测下一个单词。毛等人的最新著作。 [21]使用递归神经网络进行相同的预测任务。这与当前建议非常相似,但是有许多重要的区别:我们使用功能更强大的RNN模型,并直接向RNN模型提供可视输入,这使得RNN可以跟踪那些由文字解释。由于这些看似微不足道的差异,我们的系统在已建立的基准上取得了明显更好的结果。最后,基洛斯等。 [14]提出通过使用功能强大的计算机视觉模型和对文本进行编码的LSTM来构建联合多峰嵌入空间。与我们的方法相反,它们使用两个单独的路径(一个用于图像,一个用于文本)来定义联合嵌入,并且即使它们可以生成文本,也对其方法进行了高度调整以进行排名。

Model:

在本文中,我们提出了一种神经和概率框架来从图像生成描述。统计机器翻译的最新进展表明,给定强大的序列模型,可以通过在“端到端”的方式中给定输入句子的情况下,直接最大化正确翻译的概率来获得最新的结果–既用于训练又用于推理。这些模型使用循环神经网络,该网络将可变长度输入编码为固定维向量,并使用此表示形式将其“解码”为所需的输出语句。 因此,很自然地使用相同的方法,即在给定图像(而不是源语言中的输入句子)的情况下,应用相同的原理将其“翻译”成其描述。

Thus, we propose to directly maximize the probability of the correct description given the image by using the following formulation:

where θ are the parameters of our model, I is an image, and S its correct transcription. Since S represents any sentence, its length is unbounded(无限的). Thus, it is common to apply the chain rule to model the joint probability over S 0 , . . . , S N , where N is the length of this particular example as

where we dropped the dependency on θfor convenience. At training time, ( S , I ) is a training example pair, and we optimize the sum of the log probabilities as described in (2) over the whole training set using stochastic gradient descent (further training details are given in Section 4).

It is natural to model p ( S t ∣ I , S 0 , . . . , S t − 1 )with a Recurrent Neural Network (RNN), where the variable number of words we condition upon up to t − 1 is expressed by a fixed length hidden state or memory h t This memory is updated after seeing a new input xtby using a non-linear function f :

ht+1​=f(ht​;xt​)

为了使上述RNN更具体,需要做出两个关键的设计选择:f的确切形式是什么以及如何将图像和单词作为输入xt输入。对于 f,我们使用长时记忆(LSTM)网络,该网络已显示出诸如翻译之类的序列任务的最新性能。下一节将概述此模型。

对于图像的表示,我们使用卷积神经网络(CNN)。它们已被广泛地用于图像任务并已被研究,并且目前是物体识别和检测的最新技术。我们对CNN的特定选择使用一种新颖的方法对批处理进行归一化,并在ILSVRC 2014分类竞赛中获得当前的最佳表现[12]。此外,它们已被证明可以通过转移学习推广到其他任务,例如场景分类[4]。单词用嵌入模型表示。

LSTM-based Sentence Generator

在(3)中f的选择取决于它处理消失和爆炸梯度的能力[10],这是设计和训练RNN时最常见的挑战。为了解决这一挑战,引入了一种称为LSTM的特殊形式的递归网络[10],并成功应用于翻译[3,30]和序列生成[9]。 LSTM模型的核心是存储单元 c ,它在每个时间步上编码知识,直到该步为止都观察到了哪些输入(参见图2)。单元的行为由“门”控制,“门”是相乘的层,因此如果门为1则可以保留门控层的值,如果门为0则可以保持此值为零。特别是,正在使用三个门用于控制是否忘记当前单元格值(忘记门 f),是否应读取其输入(输入门 i)以及是否输出新单元格值(输出门 o )。门的定义以及单元更新和输出如下:

其中 ⊙表示门值的乘积,而各种 W矩阵都是经过训练的参数。这样的乘法门使训练鲁棒的LSTM成为可能,因为这些门很好地处理了爆炸和消失的梯度[10]。非线性为S型σ(⋅)和双曲正切 h ( ⋅ ) 。最后一个方程 mt是输入给Softmax的方程,它将产生所有单词上的概率分布。

LSTM模型经过训练,可以在看到图像后预测句子中的每个单词以及通过 p ( S t ∣ I , S 0 , . . . , S t − 1 ) ) 预测所有先前单词。为此,以展开形式考虑LSTM是有启发性的–为图像和每个句子单词创建LSTM存储器的副本,以便所有LSTM在时间t共享相同的参数和LSTM的输出 mt-1
在时间 t 将馈送到LSTM(见图3)。在展开版本中,所有经常性连接都将转换为前馈连接。更详细地讲,如果我们用I表示输入图像,而用 S = ( S 0 , . . . , S N ) 表示描述该图像的真实句子,则展开过程为:

在这里,我们将每个单词表示为一维向量 S t ,其维数等于字典的大小。注意,我们用 S 0 表示一个特殊的开始词,用 S N 表示一个特殊的停止词,它指定句子的开头和结尾。特别是通过发出停用词,LSTM发出信号,表明已生成完整的句子。图像和单词都映射到相同的空间,使用视觉CNN映射图像,使用单词嵌入 W e映射到单词。图像 I 仅在 t = − 1 时输入一次,以通知LSTM有关图像内容。我们凭经验验证了,由于网络可以显式利用图像中的噪声并更容易过度拟合,因此在每个时间步幅上作为额外的输入来馈送图像会产生较差的结果

Our loss is the sum of the negative log likelihood of the correct word at each step as follows:

The above loss is minimized w.r.t. all the parameters of the LSTM, the top layer of the image embedder CNN and word embeddings We

使用NIC,有多种方法可以用于生成给定图像的句子。第一个是抽样,我们只是根据p1对第一个单词进行抽样,然后提供相应的嵌入作为输入,然后对p2进行抽样,这样一直进行下去,直到我们对特殊的语句结束标记或某个最大长度进行抽样。第二种方法是BeamSearch:迭代地考虑k个最好的句子,直到时间t,作为候选,生成大小为t + 1的句子,并只保留其中最好的k个。这更接近于S = arg maxS0 p(S0|I)。在接下来的实验中,我们使用了波束搜索方法,波束大小为20。使用光束大小为1(即(贪婪搜索)降低了平均2个BELU点。

结构整体结构:

包括Encoder、decoder 、Attention、 Beam Search(束搜索)

Putting it all together

Generation Results

我们在表1和表2中报告了所有相关数据集的主要结果。由于PASCAL没有训练集,所以我们使用使用MSCOCO训练的系统(对于这个任务来说,MSCOCO可能是最大和最高质量的数据集)。PASCAL和SBU的最新研究结果并没有使用基于深度学习的图像特征,因此可以说,这些分数上的一个巨大进步仅仅来自于这种改变。Flickr数据集最近才被使用[11,21,14],但大多数是在检索框架中评估的。一个值得注意的例外是[21],它们在其中进行检索和生成,并且在Flickr数据集上产生了迄今为止最好的性能。

表2中的人类评分是通过比较其中一个人类字幕和另外四个字幕计算出来的。我们为五个打分者中的每一个打分,并对他们的BLEU分数进行平均。由于这给我们的系统带来了一点优势,考虑到BLEU分数是根据5个参考句计算的,而不是4个,我们将5个参考句而不是4个参考句的平均差异加回人类分数。

鉴于该领域在过去几年中取得了重大进展,我们认为报告BLEU-4更有意义,这是机器翻译向前发展的标准。此外,我们还报告了表14中显示的与人工评估关联更好的度量标准。尽管最近在更好的评价指标[31]的努力,我们的模型与人类评分者相比表现良好。然而,当使用人工评分员来评估我们的字幕时(见4.3.6节),我们的模型表现得更差,这表明我们需要做更多的工作来获得更好的指标。在官方测试集上,我们的标签只能通过官方网站获得,我们的型号有27.2的BLEU-4。

Conclusion

我们提出了NIC,一个端到端的神经网络系统,可以自动查看图像并生成合理的描述。NIC基于卷积神经网络,将图像编码成紧凑的表示形式,然后是递归神经网络,生成相应的句子。该模型经过训练以最大化给定图像的句子的可能性。在多个数据集上的实验表明,NIC在定性结果(生成的句子非常合理)和定量评估方面具有鲁棒性,可以使用排名指标,也可以使用机器翻译中用来评估生成句子质量的BLEU指标。从这些实验中可以清楚地看到,随着用于图像描述的可用数据集的大小的增加,NIC等方法的性能也会提高。此外,观察如何使用非监督数据(单独来自图像和单独来自文本)来改进图像描述方法也很有趣。

github实现:https://github.com/sgrvinod/Deep-Tutorials-for-PyTorch

References
[1] A. Aker and R. Gaizauskas. Generating image descriptions using dependency relational patterns. In ACL, 2010.
[2] D. Bahdanau, K. Cho, and Y . Bengio. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473, 2014.
[3] K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y . Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP, 2014.
[4] J. Donahue, Y . Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML, 2014.
[5] D. Elliott and F. Keller. Image description using visual dependency representations. In EMNLP, 2013.
[6] A. Farhadi, M. Hejrati, M. A. Sadeghi, P . Y oung, C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every picture tells a story: Generating sentences from images. In ECCV, 2010.
[7] R. Gerber and H.-H. Nagel. Knowledge representation for the generation of quantified natural language descriptions of vehicle traffic in image sequences. In ICIP. IEEE, 1996.
[8] Y . Gong, L. Wang, M. Hodosh, J. Hockenmaier, and S. Lazebnik. Improving image-sentence embeddings using large weakly annotated photo collections. In ECCV, 2014.
[9] A. Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2013.
[10] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8), 1997.
[11] M. Hodosh, P . Y oung, and J. Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. JAIR, 47, 2013.
[12] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In arXiv:1502.03167, 2015.
[13] A. Karpathy, A. Joulin, and L. Fei-Fei. Deep fragment embeddings for bidirectional image sentence mapping. NIPS, 2014.
[14] R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. In arXiv:1411.2539, 2014.
[15] R. Kiros and R. Z. R. Salakhutdinov. Multimodal neural language models. In NIPS Deep Learning Workshop, 2013.
[16] G. Kulkarni, V . Premraj, S. Dhar, S. Li, Y . Choi, A. C. Berg, and T. L. Berg. Baby talk: Understanding and generating simple image descriptions. In CVPR, 2011.
[17] P . Kuznetsova, V . Ordonez, A. C. Berg, T. L. Berg, and Y . Choi. Collective generation of natural image descriptions. In ACL, 2012. [18] P . Kuznetsova, V . Ordonez, T. Berg, and Y . Choi. Treetalk: Composition and compression of trees for image descriptions. ACL, 2(10), 2014.
[19] S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y . Choi. Composing simple image descriptions using web-scale n-grams. In Conference on Computational Natural Language Learning, 2011.
[20] T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P . Perona, D. Ramanan, P . Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. arXiv:1405.0312, 2014.
[21] J. Mao, W. Xu, Y . Yang, J. Wang, and A. Y uille. Explain images with multimodal recurrent neural networks. In arXiv:1410.1090, 2014. [22] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. In ICLR, 2013.
[23] M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal, A. C. Berg, K. Yamaguchi, T. L. Berg, K. Stratos, and H. D. III. Midge: Generating image descriptions from computer vision detections. In EACL, 2012.
[24] V . Ordonez, G. Kulkarni, and T. L. Berg. Im2text: Describing images using 1 million captioned photographs. In NIPS, 2011.
[25] K. Papineni, S. Roukos, T. Ward, and W. J. Zhu. BLEU: A method for automatic evaluation of machine translation. In ACL, 2002.
[26] C. Rashtchian, P . Y oung, M. Hodosh, and J. Hockenmaier. Collecting image annotations using amazon’s mechanical turk. In NAACL HLT Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pages 139– 147, 2010.
[27] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge, 2014.
[28] P . Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y . LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013.
[29] R. Socher, A. Karpathy, Q. V . Le, C. Manning, and A. Y . Ng. Grounded compositional semantics for finding and describing images with sentences. In ACL, 2014.
[30] I. Sutskever, O. Vinyals, and Q. V . Le. Sequence to sequence learning with neural networks. In NIPS, 2014.
[31] R. V edantam, C. L. Zitnick, and D. Parikh. CIDEr: Consensus-based image description evaluation. In arXiv:1411.5726, 2015.
[32] B. Z. Yao, X. Yang, L. Lin, M. W. Lee, and S.-C. Zhu. I2t: Image parsing to text description. Proceedings of the IEEE, 98(8), 2010.
[33] P . Y oung, A. Lai, M. Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. In ACL, 2014.
[34] W. Zaremba, I. Sutskever, and O. Vinyals. Recurrent neural network regularization. In arXiv:1409.2329, 2014.

BLEU—机器翻译评价方法

Bleu[1]是IBM在2002提出的,用于机器翻译任务的评价,发表在ACL,引用次数10000+,原文题目是“BLEU: a Method for Automatic Evaluation of Machine Translation”。

它的总体思想就是准确率,假如给定标准译文reference,神经网络生成的句子是candidate,句子长度为n,candidate中有m个单词出现在reference,m/n就是bleu的1-gram的计算公式。

BLEU还有许多变种。根据n-gram可以划分成多种评价指标,常见的指标有BLEU-1、BLEU-2、BLEU-3、BLEU-4四种,其中n-gram指的是连续的单词个数为n。

BLEU-1衡量的是单词级别的准确性,更高阶的bleu可以衡量句子的流畅性。

计算公式:

分子释义

神经网络生成的句子是candidate,给定的标准译文是reference。

1) 第一个求和符号统计的是所有的candidate,因为计算时可能有多个句子,

2)第二个求和符号统计的是一条candidate中所有的n−gram,而 [公式] 表示某一个n−gram在reference中的个数。

所以整个分子就是在给定的candidate中有多少个n-gram词语出现在reference中。

分母释义

前两个求和符号和分子中的含义一样,Count(n-gram’)表示n−gram′在candidate中的个数,综上可知,分母是获得所有的candidate中n-gram的个数。

改进版的BLEU计算方法

改进思路:对于某个词组出现的次数,在保证不大于candidate中出现的个数的情况下,然后再reference寻找词组出现的最多次。

评价机器翻译结果通常使用BLEU(Bilingual Evaluation Understudy)。对于模型预测序列中任意的子序列,BLEU考察这个子序列是否出现在标签序列中。

具体来说,设词数为n的子序列的精度为pn​。它是预测序列与标签序列匹配词数为n的子序列的数量与预测序列中词数为n的子序列的数量之比。举个例子,假设标签序列为A、B、C、D、E、F,预测序列为ABB、C、D,那么p1=4/5,p2=3/4,p3=1/3,p4=0。设

lenlabel​和lenpred分别为标签序列和预测序列的词数,那么,BLEU的定义为

$$
\exp \left(\min \left(0,1-\frac{l e n_{\text {label }}}{l e n_{\text {pred }}}\right)\right) \prod_{n=1}^{k} p_{n}^{1 / 2^{n}}
$$
其中 \(k\) 是我们希望匹配的子序列的最大词数。可以看到当预测 序列和标签序列完全一致时,BLEU为 1 。
因为匹配较长子序列比匹配较短子序列更难,BLEU对匹配较长 子序列的精度赋予了更大权重。例如,当 \( p_{n} \) 固定在 \( 0.5 \) 时,随在 \( n \) 的增大, \( 0.5^{1 / 2} \approx 0.7,0.5^{1 / 4} \approx 0.84,0.5^{1 / 8} \approx$ $0.92,0.5^{1 / 16} \approx 0.96 \) 。另外,模型预测较短序列往往会得到 较高 \( p_{n} \) 值。因此,上式中连乘项前面的系数是为了惩罚较短的 输出而设的。举个例子,当 \(k=2 \) 时,假设标签序列为 A 、 B 、 C 、 D 、 E 、 F ,而预测序列为 A 、 B 。虽然 \( p_{1}=p_{2}=1 \) ,但惩罚系数 \( \exp (1-6 / 2) \approx 0.14\) , 因此BLEU也接近 0.14 。

python实现:

def bleu(pred_tokens, label_tokens, k):
    len_pred, len_label = len(pred_tokens), len(label_tokens)
    score = math.exp(min(0, 1 - len_label / len_pred))
    for n in range(1, k + 1):
        num_matches, label_subs = 0, collections.defaultdict(int)
        for i in range(len_label - n + 1):
            label_subs[''.join(label_tokens[i: i + n])] += 1
        for i in range(len_pred - n + 1):
            if label_subs[''.join(pred_tokens[i: i + n])] > 0:
                num_matches += 1
                label_subs[''.join(pred_tokens[i: i + n])] -= 1
        score *= math.pow(num_matches / (len_pred - n + 1), math.pow(0.5, n))
    return score

接下来,定义一个辅助打印函数:

def score(input_seq, label_seq, k):
    pred_tokens = translate(encoder, decoder, input_seq, max_seq_len)
    label_tokens = label_seq.split(' ')
    print('bleu %.3f, predict: %s' % (bleu(pred_tokens, label_tokens, k),
                                      ' '.join(pred_tokens)))

一些注意事项

一般给出的reference是4句话,之所以给出多个句子,是因为单个句子可能无法和生成的句子做很好地匹配。

比如”你好”,reference是”hello”,机器给出的译文是”how are you”,机器给出的词语每一个一个词匹配上这个reference,那么BLEU值是0,这显然是有问题的。所以reference越多样化,匹配成功概率越高。

鸡尾酒会问题—语音分离

在斯坦福大学的Coursera的Andrew Ng的机器学习介绍性讲座中,他给出了以下一行Octave解决方案的鸡尾酒会问题,因为音频源由两个空间分离的麦克风记录:

[W,s,v]=svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x’);

  • 传统方法

独立成分分析ICA

独立分量分析(Independent Component Analysis,ICA)是将信号之间的独立性作为分离变量判据的方法。由Comon于1994年首次提出。Comon指出ICA方法可以通过某个对比函数(Contrast Function)的目标函数达到极大值来消除观察信号中的高阶统计关联,实现盲源分离。盲源分离被描述成在不知道传输通道特性的情况下,从传感器或传感器阵列中分离或估计原始源波形的问题。然而,对于某些模型,不能保证估计或提取的信号与源信号具有完全相同的波形,因此有时要求放宽到提取的波形是源信号的失真或滤波版本。

独立成分分析(Independent Component Analysis),最早应用于盲源信号分离(Blind Source Separation,BBS)。起源于“鸡尾酒会问题”,描述如下:在嘈杂的鸡尾酒会上,许多人在同时交谈,可能还有背景音乐,但人耳却能准确而清晰的听到对方的话语。这种可以从混合声音中选择自己感兴趣的声音而忽略其他声音的现象称为“鸡尾酒会效应”。
  独立成分分析是从盲源分离技术发展而来的一种数据驱动的信号处理方法, 是基于高阶统计特性的分析方法。它利用统计原理进行计算,通过线性变换把数据或信号分离成统计独立的非高斯的信号源的线性组合。
  主成分分析(PCA)是一种数据降维的方法,它与主成分分析有一些区别.

ICA的python实现

应用Python机器学习库SKlearn中的FastICA来演示信号分离并与PCA进行对比。

#coding:utf-8
import numpy as np
import matplotlib.pyplot as plt
from scipy import signal
from sklearn.decomposition import FastICA, PCA
# 生成观测模拟数据
np.random.seed(0)
n_samples = 2000
time = np.linspace(0, 8, n_samples)
s1 = np.sin(2 * time)  # 信号源 1 : 正弦信号
s2 = np.sign(np.sin(3 * time))  # 信号源 2 : 方形信号
s3 = signal.sawtooth(2 * np.pi * time)  # 信号源 3: 锯齿波信号
S = np.c_[s1, s2, s3]
S += 0.2 * np.random.normal(size=S.shape)  # 增加噪音数据
S /= S.std(axis=0)  # 标准化

# 混合数据
A = np.array([[1, 1, 1], [0.5, 2, 1.0], [1.5, 1.0, 2.0]])  # 混合矩阵
X = np.dot(S, A.T)  # 生成观测信号源

# ICA模型
ica = FastICA(n_components=3)
S_ = ica.fit_transform(X)  # 重构信号
A_ = ica.mixing_  # 获得估计混合后的矩阵

# PCA模型
pca = PCA(n_components=3)
H = pca.fit_transform(X)  # 基于PCA的成分正交重构信号源

# 图形展示
plt.figure()
models = [X, S, S_, H]
names = ['Observations (mixed signal)',
         'True Sources',
         'ICA recovered signals',
         'PCA recovered signals']
colors = ['red', 'steelblue', 'orange']
for ii, (model, name) in enumerate(zip(models, names), 1):
    plt.subplot(4, 1, ii)
    plt.title(name)
    for sig, color in zip(model.T, colors):
        plt.plot(sig, color=color)
plt.subplots_adjust(0, 0.1, 1.2, 1.5, 0.26, 0.46)
plt.show()

from sklearn.decomposition import FastICA
ica = FastICA(n_components=None, algorithm=’parallel’, whiten=True, fun=’logcosh’, fun_args=None, max_iter=200, w_init=None)

以上是 FastICA 模型通常所含的参数及其对应的默认值。 n_components: 指定使用元素的数目。 algorithm: {‘parallel’,’deflational’},指定 FastICA 使用哪种算法。 writen: True/False,是否进行白化处理。 fun: {‘logcosh’,’exp’,’cube’,..},选择一种近似于计算负熵的目标函数,可自己定义函数。 fun_args: 指定目标函数所要用的参数。 max_iter: 指定拟合过程中最大的迭代次数。 w_init: 指定初始的混合矩阵。

​ 1. 主成分分析假设源信号间彼此非相关,独立成分分析假设源信号间彼此独立。

​ 2. 主成分分析认为主元之间彼此正交,样本呈高斯分布;独立成分分析则不要求样本呈高斯分布。

稀疏主成分分析Spars PCA

稀疏主成分分析即是为了解决这个问题而引进的一个算法。它会把主成分系数(构成主成分时每个变量前面的系数)变的稀疏,也即是把大多数系数都变成零,通过这样一种方式,我们就可以把主成分的主要的部分凸现出来,这样主成分就会变得较为容易解释。

实现主成分分析稀疏化,最终会转化为优化问题, 也即对本来的主成分分析(PCA)中的问题增加一个惩罚函数。 这个惩罚函数包含有稀疏度的信息。当然,最终得到的问题是NP-困难问题,为了解决它,我们就需要采用一些方法逼近这个问题的解。这也是不同稀疏主成分分析算法不同的由来。

非负矩阵分解NMF

NMF的基本思想可以简单描述为:对于任意给定的一个非负矩阵V,NMF算法能够寻找到一个非负矩阵W和一个非负矩阵H,使得满足 ,从而将一个非负的矩阵分解为左右两个非负矩阵的乘积。如下图所示,其中要求分解后的矩阵H和W都必须是非负矩阵。

参考:

https://zhuanlan.zhihu.com/p/142143151

https://github.com/jake-g/dsp-fpga-labs/blob/c0c84ed08c02da11fb4161599cafe8ddcabeaca4/Labs/sound_split_demo/seperate.m

https://www.cxymm.net/article/weixin_30446557/115847250

https://leoncuhk.gitbooks.io/feature-engineering/content/feature-extracting04.html

ICA独立成分分析

ICA的数学推导

ICA算法的思路比较简单,但是推导过程比较复杂,本文只是梳理了推理路线。

假设我们有n个混合信号源\(X\subset{R^{n}}\)和n个独立信号\(S\subset{R^{n}}\),且每个混合信号可以由n个独立信号的线性组合产生,即:\(X=\left[ \begin{matrix} x_1&\\ x_2&\\ …&\\ x_n& \end{matrix} \right]S=\left[ \begin{matrix} s_1&\\ s_2&\\ …&\\ s_n& \end{matrix} \right]X=AS => S=WX,W=A^{-1}\)

假设我们现在对于每个混合信号,可以取得m个样本,则有如下n*m的样本矩阵:\(D=\left[ \begin{matrix} d_{11}&d_{12}&…&d_{1m}&\\ …&\\ d_{n1}&d_{n2}&…&d_{nm}\\ \end{matrix} \right]\)

由于S中的n个独立信号是相互独立的,则它们的联合概率密度为:$$p_S(s)=\Pi_{i=1}^{n}p_{s_i}(s_i)$$

由于\(s=Wx\),因此我们可以得出:\(p_X(x)=F^{‘}_{X}(x)=|\frac{\partial{s}}{\partial{x}}|*p_S(s(x))=|W|*\Pi_{i=1}^{n}p_{s_i}(w_ix)\)

考虑目前有m个样本,则可以得到所有样本的似然函数:\(L=\Pi_{i=1}^{m}(|W|*\Pi_{j=1}^{n}p_{s_j}(w_{j\cdot}d_{\cdot{i}}))\)

取对数之后,得到:\(lnL=\Sigma_{i=1}^{m}\Sigma_{j=1}^{n}lnp_{s_j}(w_{j\cdot}d_{\cdot{i}})+mln|W|\)

之后只要通过梯度下降法对lnL求出最大值即可,即求使得该样本出现概率最大的参数W。
此时假设我们上面的各个独立信号的概率分布函数sigmoid函数,但是不确定这里的g函数和下面fastICA中的g函数是否有关联):\(F_{s_i}(s_i)=\frac{1}{1+e^{-s_i}}\)

最终,我们求得:\(\frac{\partial{lnL}}{\partial{W}}=Z^TD+\frac{m}{|W|}(W^*)^T\)

其中:$$Z=g(K)=\left[ \begin{matrix} g(k_{11})&g(k_{12})&…&g(k_{1m})&\\ …&\\ g(k_{n1})&g(k_{n2})&…&g(k_{nm})\\ \end{matrix} \right]g(x)=\frac{1-e^x}{1+e^x}K=WDD=\left[ \begin{matrix} d_{11}&d_{12}&…&d_{1m}&\\ …&\\ d_{n1}&d_{n2}&…&d_{nm}\\ \end{matrix} \right]$$

由于伴随矩阵具有以下性质:$$WW^*=|W|I$$

因此我们可以求出:\(\frac{\partial{lnL}}{\partial{W}}=Z^TD+m(W^{-1})^T\)

因此可以得到梯度下降更新公式:$$W=W+\alpha(Z^TD+m(W^{-1})^T)$$

至此,ICA的基本推理就此结束。下面我们来看一下fastICA的算法过程(没有数学推理)。

fastICA的算法步骤

观测信号构成一个混合矩阵,通过数学算法进行对混合矩阵A的逆进行近似求解分为三个步骤:

  • 去均值。去均值也就是中心化,实质是使信号X均值是零。
  • 白化。白化就是去相关性。
  • 构建正交系统。

在常用的ICA算法基础上已经有了一些改进,形成了fastICA算法。fastICA实际上是一种寻找\(w^Tz(Y=w^Tz)\)的非高斯最大的不动点迭代方案。具体步骤如下:

  1. 观测数据的中心化(去均值)
  2. 数据白化(去相关),得到z
  3. 选择需要顾及的独立源的个数n
  4. 随机选择初始权重W(非奇异矩阵)
  5. 选择非线性函数g
  6. 迭代更新:
    • \(w_i \leftarrow E\{zg(w_i^Tz)\}-E\{g^{‘}(w_i^Tz)\}w\)
    • \(W \leftarrow (WW^T)^{-1/2}W\)
  7. 判断收敛,是下一步,否则返回步骤6
  8. 返回近似混合矩阵的逆矩阵

代码实现
基于python2.7,matplotlib,numpy实现ICA,主要参考sklean的FastICA实现。

import math
import random
import matplotlib.pyplot as plt
from numpy import *

n_components = 2

def f1(x, period = 4):
return 0.5(x-math.floor(x/period)period)

def create_data():
#data number
n = 500
#data time
T = [0.1*xi for xi in range(0, n)]
#source
S = array([[sin(xi) for xi in T], [f1(xi) for xi in T]], float32)
#mix matrix
A = array([[0.8, 0.2], [-0.3, -0.7]], float32)
return T, S, dot(A, S)

def whiten(X):
#zero mean
X_mean = X.mean(axis=-1)
X -= X_mean[:, newaxis]
#whiten
A = dot(X, X.transpose())
D , E = linalg.eig(A)
D2 = linalg.inv(array([[D[0], 0.0], [0.0, D[1]]], float32))
D2[0,0] = sqrt(D2[0,0]); D2[1,1] = sqrt(D2[1,1])
V = dot(D2, E.transpose())
return dot(V, X), V

def _logcosh(x, fun_args=None, alpha = 1):
gx = tanh(alpha * x, x); g_x = gx ** 2; g_x -= 1.; g_x *= -alpha
return gx, g_x.mean(axis=-1)

def do_decorrelation(W):
#black magic
s, u = linalg.eigh(dot(W, W.T))
return dot(dot(u * (1. / sqrt(s)), u.T), W)

def do_fastica(X):
n, m = X.shape; p = float(m); g = _logcosh
#black magic
X *= sqrt(X.shape[1])
#create w
W = ones((n,n), float32)
for i in range(n):
for j in range(i):
W[i,j] = random.random()

#compute W
maxIter = 200
for ii in range(maxIter):
    gwtx, g_wtx = g(dot(W, X))
    W1 = do_decorrelation(dot(gwtx, X.T) / p - g_wtx[:, newaxis] * W)
    lim = max( abs(abs(diag(dot(W1, W.T))) - 1) )
    W = W1
    if lim < 0.0001:
        break
return W

def show_data(T, S):
plt.plot(T, [S[0,i] for i in range(S.shape[1])], marker=”*”)
plt.plot(T, [S[1,i] for i in range(S.shape[1])], marker=”o”)
plt.show()

def main():
T, S, D = create_data()
Dwhiten, K = whiten(D)
W = do_fastica(Dwhiten)
#Sr: reconstructed source
Sr = dot(dot(W, K), D)
show_data(T, D)
show_data(T, S)
show_data(T, Sr)

if name == “main“:
main()

参考:

http://skyhigh233.com/blog/2017/04/01/ica-math/