The document summarizes research on conditional image generation using PixelCNN decoders. It discusses how PixelCNNs sequentially predict pixel values rather than the whole image at once. Previous work used PixelRNNs, but these were slow to train. The proposed approach uses a Gated PixelCNN that removes blind spots in the receptive field by combining horizontal and vertical feature maps. It also conditions PixelCNN layers on class labels or embeddings to generate conditional images. Experimental results show the Gated PixelCNN outperforms PixelCNN and achieves performance close to PixelRNN on CIFAR-10 and ImageNet, while training faster. It can also generate portraits conditioned on embeddings of people.
오사카 대학 Nishida Geio군이 Normalization 관련기술 을 정리한 자료입니다.
Normalization이 왜 필요한지부터 시작해서
Batch, Weight, Layer Normalization별로 수식에 대한 설명과 함께
마지막으로 3방법의 비교를 잘 정리하였고
학습의 진행방법에 대한 설명을 Fisher Information Matrix를 이용했는데, 깊이 공부하실 분들에게만 필요할 듯 합니다.
007 20151214 Deep Unsupervised Learning using Nonequlibrium ThermodynamicsHa Phuong
The document discusses a new approach to unsupervised deep learning using concepts from nonequilibrium thermodynamics. Specifically, it proposes destroying structure in data through an iterative forward diffusion process, then learning the reverse diffusion process to restore structure and act as a generative model. This approach is shown to outperform other generative models on image datasets like CIFAR-10 and is able to perform tasks like inpainting. The diffusion process is modeled using Gaussian distributions and the reverse process is learned using a deep network as an approximator.
The document proposes a single image super-resolution method that combines multi-image and example-based super-resolution by leveraging patch redundancy. It models the super-resolution problem using similar patches within an image (multi-image approach) and across image scales (example-based approach). Experimental results show the proposed method performs better than interpolation and example-based approaches at enhancing detail in low resolution images.
FixMatch:simplifying semi supervised learning with consistency and confidenceLEE HOSEONG
This document summarizes the FixMatch paper, which proposes a simple semi-supervised learning method that achieves state-of-the-art results. FixMatch combines pseudo-labeling and consistency regularization by generating pseudo-labels for unlabeled data using a model's prediction on a weakly augmented version and enforcing consistency on a strongly augmented version. Extensive ablation studies show that FixMatch outperforms previous methods on standard benchmarks even with limited labeled data and identifies consistency regularization and pseudo-labeling as the most important factors for its success.
1. The document discusses energy-based models (EBMs) and how they can be applied to classifiers. It introduces noise contrastive estimation and flow contrastive estimation as methods to train EBMs.
2. One paper presented trains energy-based models using flow contrastive estimation by passing data through a flow-based generator. This allows implicit modeling with EBMs.
3. Another paper argues that classifiers can be viewed as joint energy-based models over inputs and outputs, and should be treated as such. It introduces a method to train classifiers as EBMs using contrastive divergence.
PR-409: Denoising Diffusion Probabilistic ModelsHyeongmin Lee
이번 논문은 요즘 핫한 Diffusion을 처음으로 유행시킨 Denoising Diffusion Probabilistic Models (DDPM) 입니다. ICML 2015년에 처음 제안된 Diffusion의 여러 실용적인 측면들을 멋지게 해결하여 그 유행의 시작을 알린 논문인데요, Generative Model의 여러 분야와 Diffusion, 그리고 DDPM에서는 무엇이 바뀌었는지 알아보도록 하겠습니다.
논문 링크: https://arxiv.org/abs/2006.11239
영상 링크: https://youtu.be/1j0W_lu55nc
The document discusses digital image upscaling techniques from traditional methods to deep learning methods. It covers classical super-resolution methods for images and videos, including interpolation-based, edge-directed, frequency-domain, and example-based methods. It also explains the challenges of super-resolution such as information loss during the digital conversion process.
Like other fields of computer vision, image retrieval has been
revolutionized by deep learning in recent years. Convolutional neural networks are now the tool of choice for computing feature representations of images. Many successful architectures employ global pooling layers to aggregate feature maps to a compact image representation. Using the neural network training procedure based on backpropagation and gradient descent methods, we can learn the global pooling operation from the training data.
We review existing approaches to learned pooling and propose two new layers: A learnable, extended variant of LSE pooling and the generalized max pooling layer based on an aggregation function from classical computer vision.
Our experiments show that learned global pooling can improve performance of image retrieval networks compared to the average pooling baseline for both tasks. For writer identification, our generalized max pooling layer outperforms all other tested pooling layers. Our learnable LSE pooling performs better than global average pooling and yields the best rank-1 score in our experiments on the Market-1501 dataset.
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis taeseon ryu
해당 논문은 3D Aware 모델입니다 StyleGAN 같은 경우에는 어떤 하나의 피처에 대해서 Editing 하고 싶을 때 입력에 해당하는 레이턴트 백터를 찾아서 레이턴트 백터를 수정함으로써 입에 해당하는 피쳐를 바꿀 수 있었는데 이런 컨셉을 그대로 착안해서
GAN 스페이스 논문에서는 인풋이 들어왔을 때 어떤 공간적인 정보까지도 에디팅하려고 시도했습니다 결과를 봤을 때 로테이션 정보가 어느 정도 잘 학습된 것 같지만 같은 사람이 아닌 것 같이 인식되기도 합니다 이러한 문제를 이제 disentangle 되지 않았다라고 하는 게 원하는 피처만 변화시켜야 되는 것과 달리 다른 피처까지도 모두 학습 모두 변했다는 것인데 이를 좀 더 효율적으로 3D를 더 잘 이해시키기 위해서 탄생한 논문입니다.
Dual Learning for Machine Translation (NIPS 2016)Toru Fujino
The paper introduces a dual learning algorithm that utilizes monolingual data to improve neural machine translation. The algorithm trains two translation models in both directions simultaneously. Experimental results show that when trained with only 10% of parallel data, the dual learning model achieves comparable results to baseline models trained on 100% of data. The dual learning mechanism also outperforms baselines when trained on full data and can help address the lack of large parallel corpora.
Introduction of “Fairness in Learning: Classic and Contextual Bandits”Kazuto Fukuchi
1. The document discusses fairness constraints in contextual bandit problems and classic bandit problems.
2. It shows that for classic bandits, Θ(k^3) rounds are necessary and sufficient to achieve a non-trivial regret under fairness constraints.
3. For contextual bandits, it establishes a tight relationship between achieving fairness and Knows What it Knows (KWIK) learning, where KWIK learnability implies the existence of fair learning algorithms.
Introduction of "TrailBlazer" algorithmKatsuki Ohto
論文「Blazing the trails before beating the path: Sample-efficient Monte-Carlo planning」紹介スライドです。NIPS2016読み会@PFN(2017/1/19) https://connpass.com/event/47580/ にて。
Interaction Networks for Learning about Objects, Relations and PhysicsKen Kuroki
For my presentation for a reading group. I have not in any way contributed this study, which is done by the researchers named on the first slide.
https://papers.nips.cc/paper/6418-interaction-networks-for-learning-about-objects-relations-and-physics
Safe and Efficient Off-Policy Reinforcement Learningmooopan
This document summarizes the Retrace(λ) reinforcement learning algorithm presented by Remi Munos, Thomas Stepleton, Anna Harutyunyan and Marc G. Bellemare. Retrace(λ) is an off-policy multi-step reinforcement learning algorithm that is safe (converges for any policy), efficient (makes best use of samples when policies are close), and has lower variance than importance sampling. Empirical results on Atari 2600 games show Retrace(λ) outperforms one-step Q-learning and existing multi-step methods.
Value Iteration Networks is a machine learning method for robot path planning that can operate in new environments not seen during training. It works by predicting optimal actions through learning reward values for each state and propagating rewards to determine the sum of future rewards. The method was shown to be effective for planning in grid maps and continuous control tasks, and was even applied to navigation of Wikipedia links.
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...Shuhei Yoshida
Unsupervised learning of disentangled representations was the goal. The approach was to use GANs and maximize the mutual information between generated images and input codes. This led to the benefit of obtaining interpretable representations without supervision and at substantial additional costs.
Fast and Probvably Seedings for k-MeansKimikazu Kato
The document proposes a new MCMC-based algorithm for initializing centroids in k-means clustering that does not assume a specific distribution of the input data, unlike previous work. It uses rejection sampling to emulate the distribution and select initial centroids that are widely scattered. The algorithm is proven mathematically to converge. Experimental results on synthetic and real-world datasets show it performs well with a good trade-off of accuracy and speed compared to existing techniques.
Improving Variational Inference with Inverse Autoregressive FlowTatsuya Shirakawa
This slide was created for NIPS 2016 study meetup.
IAF and other related researches are briefly explained.
paper:
Diederik P. Kingma et al., "Improving Variational Inference with Inverse Autoregressive Flow", 2016
https://papers.nips.cc/paper/6581-improving-variational-autoencoders-with-inverse-autoregressive-flow
The document summarizes the paper "Matching Networks for One Shot Learning". It discusses one-shot learning, where a classifier can learn new concepts from only one or a few examples. It introduces matching networks, a new approach that trains an end-to-end nearest neighbor classifier for one-shot learning tasks. The matching networks architecture uses an attention mechanism to compare a test example to a small support set and achieve state-of-the-art one-shot accuracy on Omniglot and other datasets. The document provides background on one-shot learning challenges and related work on siamese networks, memory augmented neural networks, and attention mechanisms.
https://telecombcn-dl.github.io/dlmm-2017-dcu/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
In this presentation we discuss the convolution operation, the architecture of a convolution neural network, different layers such as pooling etc. This presentation draws heavily from A Karpathy's Stanford Course CS 231n
build a Convolutional Neural Network (CNN) using TensorFlow in PythonKv Sagar
1. The document discusses CNN architecture and concepts like convolution, pooling, and fully connected layers.
2. Convolutional layers apply filters to input images to generate feature maps, capturing patterns like edges. Pooling layers downsample these to reduce parameters.
3. Fully connected layers at the end integrate learned features for classification tasks like image recognition. CNNs exploit spatial structure in images unlike regular neural networks.
https://telecombcn-dl.github.io/2017-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
The document summarizes a research paper that uses a technique called deconvnet to visualize and understand what convolutional neural networks have learned. It introduces deconvnet as a method to approximate activations in higher layers of a convnet by using transposed convolutions and max location switches from pooling layers. The document then shows examples of visualizing filters from different layers of a trained convnet on ImageNet, revealing what patterns and parts of images the network has learned to detect at each layer.
https://telecombcn-dl.github.io/2018-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
Here, we have implemented CNN network in FPGA by incorporating a novel technique of convolution which includes pipelining technique as well as parallelism (by optimizing) between the two.
- R-CNN was the first CNN model to achieve high performance in object detection. It used a multi-stage pipeline involving region proposals, feature extraction via CNN, and SVM classification. It was slow due to computing CNN features for each region individually.
- Fast R-CNN improved on R-CNN by introducing a ROI pooling layer to share computation and enabling end-to-end training. However, region proposals were still generated externally, slowing down detection.
- Faster R-CNN addressed this by introducing a Region Proposal Network to generate proposals, allowing the entire model to be trained end-to-end. This led to faster and more accurate detection compared to previous models.
- YOLO
SIGGRAPH 2014 Course on Computational Cameras and Displays (part 3)Matthew O'Toole
This document summarizes Gordon Wetzstein's presentation on compressive display systems. It discusses the evolution of displays from parallax barriers in 1903 to modern tensor displays. Computational techniques like low-rank light field factorization and tensor factorization can be used to compress light fields and reduce the number of pixels needed in multi-layer displays. These compressive displays integrate optimization, sensing, and human perception to provide capabilities like high dynamic range, super resolution, and 3D displays using a single display system. Wetzstein envisions these compressive displays enabling new form factors and applications in mobile devices, projections systems, and head mounted displays.
This document summarizes the DenseBox paper, which introduces a unified end-to-end fully convolutional network (FCN) framework for object detection. The key points are:
1. DenseBox directly predicts bounding boxes and class confidences through all locations and scales of an image using a single FCN, showing one-stage detectors can detect objects under different scales.
2. DenseBox is designed to detect small and occluded objects by fusing features from different convolutional layers and generating dense predictions.
3. It performs multi-task training for classification, regression, and optionally landmark localization through multiple loss functions and hard negative mining.
4. Experiments on face and car detection datasets show DenseBox achie
The document discusses Convolutional Neural Networks (CNNs). It explains that CNNs are a type of neural network that use convolutional operations in at least one layer. CNNs are well-suited for image classification and segmentation problems. The key layers in a CNN are convolutional layers, pooling layers, flattening layers, and fully connected layers. Convolutional layers act as feature extractors, pooling layers reduce spatial size, flattening layers transform pooled features into a vector, and fully connected layers are for classification.
Mask R-CNN is an algorithm for instance segmentation that builds upon Faster R-CNN by adding a branch for predicting masks in parallel with bounding boxes. It uses a Feature Pyramid Network to extract features at multiple scales, and RoIAlign instead of RoIPool for better alignment between masks and their corresponding regions. The architecture consists of a Region Proposal Network for generating candidate object boxes, followed by two branches - one for classification and box regression, and another for predicting masks with a fully convolutional network using per-pixel sigmoid activations and binary cross-entropy loss. Mask R-CNN achieves state-of-the-art performance on standard instance segmentation benchmarks.
Similar to Conditional Image Generation with PixelCNN Decoders (20)
Applications of Data Science in Various IndustriesIABAC
The wide-ranging applications of data science across industries.
From healthcare to finance, data science drives innovation and efficiency by transforming raw data into actionable insights.
Learn how data science enhances decision-making, boosts productivity, and fosters new advancements in technology and business. Explore real-world examples of data science applications today.
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...javier ramirez
Los sistemas distribuidos son difíciles. Los sistemas distribuidos de alto rendimiento, más. Latencias de red, mensajes sin confirmación de recibo, reinicios de servidores, fallos de hardware, bugs en el software, releases problemáticas, timeouts... hay un montón de motivos por los que es muy difícil saber si un mensaje que has enviado se ha recibido y procesado correctamente en destino. Así que para asegurar mandas el mensaje otra vez.. y otra... y cruzas los dedos para que el sistema del otro lado tenga tolerancia a los duplicados.
QuestDB es una base de datos open source diseñada para alto rendimiento. Nos queríamos asegurar de poder ofrecer garantías de "exactly once", deduplicando mensajes en tiempo de ingestión. En esta charla, te cuento cómo diseñamos e implementamos la palabra clave DEDUP en QuestDB, permitiendo deduplicar y además permitiendo Upserts en datos en tiempo real, añadiendo solo un 8% de tiempo de proceso, incluso en flujos con millones de inserciones por segundo.
Además, explicaré nuestra arquitectura de log de escrituras (WAL) paralelo y multithread. Por supuesto, todo esto te lo cuento con demos, para que veas cómo funciona en la práctica.
LLM powered contract compliance application which uses Advanced RAG method Self-RAG and Knowledge Graph together for the first time.
It provides highest accuracy for contract compliance recorded so far for Oil and Gas Industry.
[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and TuningDonghwan Lee
이 세션에서는 SageMaker Training Jobs / SageMaker Jumpstart를 사용하여 Foundation Model 을 Pre-Triaining 하거나 Fine Tuing 하는 방안을 제시합니다. 이 세션을 통해 아래 3가지가 소개됩니다.
1. 파운데이션 모델을 처음부터 Training
2. 오픈 소스 모델을 사용하여 파운데이션 모델을 Pre-Training
3. 도메인에 맞게 모델을 Fine Tuning하는 방안
발표자:
Miron Perel, Principal ML GTM Specialist, AWS
Kristine Pearce, Principal ML BD, AWS
Amazon Aurora 클러스터를 초당 수백만 건의 쓰기 트랜잭션으로 확장하고 페타바이트 규모의 데이터를 관리할 수 있으며, 사용자 지정 애플리케이션 로직을 생성하거나 여러 데이터베이스를 관리할 필요 없이 Aurora에서 관계형 데이터베이스 워크로드를 단일 Aurora 라이터 인스턴스의 한도 이상으로 확장할 수 있는 Amazon Aurora Limitless Database를 소개합니다.
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA MATKA RESULT KALYAN MATKA TIPS SATTA MATKA MATKA COM MATKA PANA JODI TODAY
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
Conditional Image Generation with PixelCNN Decoders
1. Conditional Image Generation
with PixelCNN Decoders
Yohei Sugawara
BrainPad Inc.
NIPS2016読み会(@Preferred Networks)
January 19, 2017
Aäron van den Oord
Nal Kalchbrenner
Oriol Vinyals
Lasse Espeholt
Alex Graves
Koray Kavukcuoglu
Google DeepMind
2. Preview
- Pixel dependencies = Raster scan order
- Autoregressive models sequentially predict pixels rather
than predicting the whole image at once (like as GAN, VAE)
PixelCNN PixelRNN
(Diagonal BiLSTM)
Pros: easier to parallelize
Cons: bounded dependency
masked convolution
【Background: Autoregressive Image Modeling】 【Previous work: Pixel Recurrent Neural Networks】
- CNN based model and RNN based models are proposed.
Pros: full dependency field
Cons: sequential training.
【Proposed approach: Gated PixelCNN】
Vertical maps Horizontal maps
one-hot encoding of the class-label
embedding from trained model
conditioning on
4. PixelCNN AutoEncoder
Encoder = Convolutional layers
Decoder = Conditional PixelCNN layers
3. Conditional PixelCNN
2. Gated PixelCNN architecture.1. Removal of blind spots in the receptive
field by combining the horizontal stack
and the vertical stack.
【Experimental Result】
Data
- CIFAR-10 dataset
(32x32 color images)
Performance (Unconditional)
3. Image Generation Models
-Three image generation approaches are dominating the field:
Variational AutoEncoders (VAE) Generative Adversarial Networks (GAN)
z
x
)(~ zpz θ
)|(~ zxpx θ
Decoder
Encoder
)|( xzqφ
x
z
Real
D
G
Fake
Real/Fake ?
generate
Autoregressive Models
(cf. https://openai.com/blog/generative-models/)
VAE GAN Autoregressive Models
Pros.
- Efficient inference with
approximate latent variables.
- generate sharp image.
- no need for any Markov chain or
approx networks during sampling.
- very simple and stable training process
- currently gives the best log likelihood.
- tractable likelihood
Cons.
- generated samples tend to be
blurry.
- difficult to optimize due to
unstable training dynamics.
- relatively inefficient during sampling
4. Autoregressive Image Modeling
- Autoregressive models train a network that models the
conditional distribution of every individual pixel given
previous pixels (raster scan order dependencies).
⇒ sequentially predict pixels rather than predicting the whole image at once (like as GAN, VAE)
- For color image, 3 channels are generated successive conditioning, blue given red and green,
green given red, and red given only the pixels above and to the left of all channels.
R G B
5. Previous work: Pixel Recurrent Neural Networks.
“Pixel Recurrent Neural Networks” got best paper award at ICML2016.
They proposed two types of models, PixelRNN and PixelCNN
(two types of LSTM layers are proposed for PixelRNN.)
PixelCNNPixelRNN
masked convolution
Row LSTM Diagonal BiLSTM
PixelRNN PixelCNN
Pros.
• effectively handles long-range dependencies
⇒ good performance
Convolutions are easier to parallelize ⇒ much faster to train
Cons.
• Each state needs to be computed sequentially.
⇒ computationally expensive
Bounded receptive field ⇒ inferior performance
Blind spot problem (due to the masked convolution) needs to be eliminated.
• LSTM based models are natural choice for
dealing with the autoregressive dependencies.
• CNN based model uses masked convolution,
to ensure the model is causal.
11w 12w 13w
21w 22w 23w
31w 32w 33w
6. Details of “Masked Convolution” & “Blind Spot”
To generate next pixel, the model can only condition on the previously generated pixels.
Then, to make sure CNN can only use information about pixels above and to the left of
current pixel, the filters of the convolution need to be masked.
Case 1D
Right figure shows 5x1 convolutional filters after
multiplying them by mask.
The filters connecting the input layer to the first
hidden layer are in this case multiplied by m=(1,1,0,0,0),
to ensure the model is causal.
(cf. Generating Interpretable Images with Contollable Structure, S.Reed et.al., 2016)
5x1 filter m=(1,1,0,0,0)
Case 2D
(ex. text, audio, etc)
(ex. image)
In case of 2D, PixelCNNs have a blind spot
in the receptive field that cannot be used
to make predictions.
Rightmost figure shows the growth of the
masked receptive field.
(3 layered network with 3x3 conv filters)
5x5 filter
3x3 filter, 3 layered
7. Proposed approach: Gated PixelCNN
In this paper, they proposed the improved version of PixelCNN.
Major improvements are as follows.
1. Removal of blind spots in the receptive field by
combining the horizontal stack and the vertical stack.
2. Replacement of the ReLu activations between the masked
convolutions in the original PixelCNN with the gated activation unit.
3. Given a latent vector, they modeled the conditional distribution of
images, conditional PixelCNN.
a. conditioning on class-label
b. conditioning on embedding from trained model
4. From a convolutional auto-encoder, they replaced the deconvolutional
decoder with conditional PixelCNN, named PixelCNN Auto-Encoders
8. First improvement: horizontal stack and vertical stack
The removal of blind spots in the receptive field are important
for PixelCNN’s performance, because the blind spot can cover as
much as a quarter of the potential receptive field.
The vertical stack conditions on all rows above the current row.
The horizontal stack conditions on current row.
- Details about implementation techniques are described below.
9. Second improvement: Gated activation and architecture
Gated activation unit:
Single layer block of a Gated PixelCNN
(σ: sigmoid, k: number of layer, ⦿: element-wise product, *: convolutional operator)
- Masked convolutions are shown in green.
- Element-wise operations are shown in red.
- Convolutions with Wf, Wg are combined
into a single operation shown in blue.
𝑣
𝑣′
ℎ
ℎ′
𝑣 = vertical activation maps
ℎ = horizontal activation maps
inth
intv
10. Details of Gated PixelCNN architecture
Break down operations into four steps.
① Calculate vertical feature maps
… n×n convolutions are calculated with gated activation.
Input: ( = input image if 1st layer)
Output:
𝑣
𝑣′
ℎ
ℎ′
𝑣𝑖𝑛𝑡
ℎ𝑖𝑛𝑡
𝑣
𝑣′
Feature map 3x3 masked filters
receptive field
(1,1,1,1)
zero padding
receptive field
(1,0,1,1)
zero padding
2x3 filtersFeature map
Two types of
equivalent
implementation:
In this case, (i, j)th pixel depends
on (i, j+k)th (future) pixels
Next problem:
Solution:
Shift down vertical feature maps
when to feed into horizontal stack.
(ex. n=3)
11. 0 0 0 … 0 0
𝑣
𝑣′
ℎ
ℎ′
𝑣𝑖𝑛𝑡
ℎ𝑖𝑛𝑡
Details of Gated PixelCNN architecture
② Feed vertical maps into horizontal stack
1. n x n masked convolution
2. shifting down operation (as below)
3. 1 x 1 convolution
Input: ( = input image if 1st layer)
Output:
𝑣
𝑣𝑖𝑛𝑡
Shift down vertical feature maps
when to feed into horizontal stack.
1. Add zero padding on the top
2. Crop the bottom
Left operations can be interpreted as below.
ensure causalityviolate causality
Feature map Feature map
12. Details of Gated PixelCNN architecture
𝑣
𝑣′
ℎ
ℎ′
𝑣𝑖𝑛𝑡
ℎ𝑖𝑛𝑡
Feature map 1x3 masked filters
receptive field
(0,0,1,1)
zero padding
receptive field
(0,0,1,0)
zero padding
1x2 filtersFeature map
Two types of
equivalent
implementation:
Next problem:
➢ Mask ‘A’ vs Mask ‘B’
③ Calculate horizontal feature maps
… 1×n convolutions are calculated with gated activation.
(vertical maps are added before activation.)
Input: , (input image if 1st layer)
Output:
ℎ
inth
intv
- Mask ‘A’ (restrict connection
from itself) is applied to only to
the first convolution.
- Mask ‘B’ (allow connection from
itself) is applied to all the
subsequent convolution.
(ex. n=3)
13. 𝑣
𝑣′
ℎ
ℎ′
Details of Gated PixelCNN architecture
④ Calculate residual connection in horizontal stack.
… 1×1 convolutions are calculated without gated activation.
then, maps are added to horizontal maps (of layer’s input)
Input: , (input image if 1st layer)
Output:
ℎ
ℎ′
inth
inth
intv
③ Calculate horizontal feature maps
- Mask ‘A’ can be implemented as below. (ex. n=3)
receptive field
(0,0,1,0)
zero padding
1x1 filters
Feature map
[Convolution] [Crop the right]
14. Output layer and whole architecture
Output layer
Using a softmax on discrete pixel values ([0-255] = 256 way) instead of a
mixture density approach. (same approach as PixelRNN)
Although without prior information about the meaning or relations of the 256
color categories, the distributions predicted by the model are meaningful.
Whole architecture
(width) x (height) x (channels)
…
…
(width) x (height) x p (#feature maps)
Gated PixelCNN layer
output
Input
Additional 1x1 conv layers
15. original conditional
Model
Gated activation unit
Third improvements: conditional PixelCNN & PixelCNN AutoEncoder
they modeled the conditional distribution by adding terms
that depend on h to the activations before the nolinearities
coniditional PixelCNN
PixelCNN AutoEncoder
From a convolutional auto-encoder, they replaced the
deconvolutional decoder with conditional PixelCNN
Encoder:Convolution layers Decoder:Deconvolution layers
⇒ conditional PixelCNN layers
16. Experimental Results (Unconditional)
Data: CIFAR-10 dataset
Score: Negative log-likelihood score (bits/dim)
Gated PixelCNN outperforms the PixelCNN by 0.11
bits/dim, which has a very significant effect on the
visual quality, and close to the performance of PixelRNN
Data: ImageNet dataset
Gated PixelCNN outperforms PixelRNN.
Achieve similar performance to the PixelRNN in less
than half the training time.
17. Experimental Results (Conditional)
Coniditioning on ImageNet classes
Given a one-hot encoding hi, for the i-th class, model )|( ihxp
Coniditioning on Portrait Embeddings
(part of results.)
Embeddings are took from top layer of a conv network trained
on a large database of portraits from Flickr images.
After the supervised net was trained, {x:image, h:embedding}
tuples are taken and trained conditional PixelCNN to model
Given a new image of a person that was not in the training set,
they computed h and generate new portraits of same person.
)|( hxp
And experimented with reconstructions conditioned on linear
interpolations between embeddings of pairs of images.
18. Experimental Results (PixelCNN Auto-Encoder)
Data: 32x32 ImageNet patches
(Left to right: original image, reconstruction by auto-encoder, conditional samples from PixelCNN auto-encoder)
(m: dimensional bottleneck)
19. Summary & Reference
Improved PixelCNN:
Same performance as PixelRNN, but faster (easier to parallelize)
Fixed “blind spots” problem
Gated activation units
Conditional Generation:
Conditioned on class-label
Conditioned on portrait embedding
PixelCNN AutoEncoders
Summary
References
[1] Aäron van den Oord et al., “Conditional Image Generation with PixelCNN Decoders”,
NIPS 2016
[2] Aäron van den Oord et al., “Pixel Recurrent Neural Networks”, ICML 2016 (Best Paper Award)
[3] S. Reed, A. van den Oord et al., “Generating Interpretable Images with Controllable Structure”,
Under review as a conference paper at ICLR 2017
20. Appendix – Progress of research related to this paper
Applied to other domains
➢ “WaveNet: A Generative Model for Raw Audio”, A. van den Oord et al. (DeepMind)
➢ “Video Pixel Networks”, Nal Kalchbrenner, A. van den Oord et al. (DeepMind)
➢ “Language Modeling with Gated Convolutional Networks”, Yann N.Dauphin et al. (Facebook AI Research)
➢ “Generating Interpretable Images with Controllable Structure”, S.Reed, A. van den Oord et al. (Google DeepMind),
Under review as a conference paper at ICLR 2017
The conditional probability distribution is modelled by a stack of dilated causal convolutional
layers with gated activation units.
The architecture of the generative video model consists of two parts:
1. Resolution preserving CNN encoders
2. PixelCNN decoders
Text-to-image synthesis (generating images from captions and other strcuture) using
gated conditional PixelCNN model.
A new language model that replace recurrent connections typically used in RNN with gated
temporal convolutions.
21. Appendix – Progress of research related to this paper
➢ “PixelCNN++: A PixelCNN Inplementation with Discretized Logistic Mixture Likelihood and Other Modifications”,
Tim Salimans, Andrej Karpathy, et al. (OpenAI), Under review as a conference paper at ICLR 2017
Modifications of Gated PixelCNN model
➢ “PixelVAE: A Latent Variable Model for Natural Images”,
Ishaan Gulrajani, Kundan Kumar et al.,
Under review as a conference paper at ICLR 2017
A number of modifications to the original gated PixelCNN model.
1. Use a discretized logistic mixture likelihood, rather than a 256-way sofmax.
2. Condition on whole pixels, rather than R/G/B sub-pixels.
3. Use downsampling.
4. Introduce additional short-cut connections. (like as U-net)
5. Regularize the model using dropout.
A VAE model with an autoregressive decoder
based on PixelCNN.