We carried out the experiments using three data sets: MNIST, CIFAR-10 and MSCOCO. The MNIST database of handwritten digits, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. The images are binary and every image has a size of 28*28 pixels. The training and validation subsets have 60000 and 10000 images correspondingly.

The CIFAR-10 dataset consists of 60000 32×32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The data set is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.

MSCOCO is the largest data set among all three data sets. It contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation of our data set drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. For our experiments we took 25600 images for training and the same amount we used for validation.

For each dataset we tested reconstruction abilities of stacked denoising and variational autoencoders. We carried out all the experiments using additive Gaussian noise. This kind of noise is one of the most natural ones that can take place in real life. We used different intensities of noise for MNIST dataset (larger intensity because images are binary (1 channel)) and two others (3-channel colored images).

In Fig. 1 2D mapping of latent variables is visualised. From this figure we can see that in case of CIFAR-10 data set values of latent variable have larger variance which confirms that this data set is more complicated than MNIST.

a) First (simpler) architecture b) Second (more complex) architecture

Fig. 0: Two architectures of SDCAE and one VAE

In Fig. 0 we show to architectures of Stcked Denoising Convolutional Autoencoders (SDCAE). Computetional graph for VAE id presented in the post VAE for image inpainting (MS COCO dataset): Intro. First graph is much simpler than the second one, so for a multiple experiments we are going to use this one because the computations on that graph are much faster than on another one. But for the future experiments this architecture can be used as well.

a) MNIST (100 epochs) b) CIFAR-10 (100 epochs)

Fig.1 Visualisation of latent variables

In Fig. 2 the example of image reconstruction based on MNIST data set is given. As seen from the figure VAE performs better for both cases when using 30 epochs for training or 100. For VAE there is no direct dependency between loss and visual quality of an image. That is why it was decided to train more as it usually suggested for such kind of stochastic models. For such kind of a data our VAE architecture is sufficient to regenerate clean data. Instead of that SDCAE does not produce that good restored images confirming the fact that if a pure generative model has enough capacity it might outperform any deterministic model in case of stochastic nature of corruption. In Fig.3 two losses both for SDCAE and VAE are shown.

a) MNIST:original images

b) MNIST:noisy and reconstructed images by SDCAE (30 epochs)

c)MNIST:noisy and reconstructed images by VAE (30 epochs)

d)MNIST:noisy and reconstructed images by SDCAE (100 epochs)

e)MNIST:noisy and reconstructed images by VAE (100 epochs)

Fig.2 MNIST: reconstructed from noise images

a)SDCAE loss (30 epochs) b) SDCAE loss (100 epochs)

c) VAE loss (30 epochs) d)VAE loss (100 epochs)

Fig.3 MNIST: Loss functions for SDCAE and VAE

In Fig. 4 some experiments were made on CIFAR-10 data set. This data set is more complicated than MNIST. Besides of the fact that in this data set we have images with only one object in the scene and there are also 10 classes of objects all images are 3-channel (colour images) and the objects in the scene are more complicated and variable that just handwritten digits. The results of reconstruction that come from SDCAE and VAE are comparable with slightly better reconstruction produced by SDCAE. The most probable reason is that our VAE model has no enough capacity (not deep enough) to capture more complicated structure of images. However one beliefs that with appropriate model VAE might have better reconstructing power. In Fig.\ref{fig: cifar10_loss} loss functions for two architectures are presented.

a) CIFAR-10:original images

b) CIFAR-10:noisy and reconstructed images by SDCAE (30 epochs)

c) CIFAR-10:noisy and reconstructed images by SDCAE (100 epochs)

d) CIFAR-10:noisy and reconstructed images by VAE (100 epochs)

Fig. 4 CIFAR10:reconstructed from noise images

a) SDCAE loss (30 epochs) b) VAE loss (100 epochs)

Fig.5 CIFAR-10: loss functions for SDCAE and VAE

Examples from Fig. 6 show that for more complicated image data set which MSCOCO is the architecture of VAE is not powerful enough to produce more or less reasonable reconstruction. Even after longer training (see Fig.6) when one has some value of overfitting (the difference between training and validation losses) we have blurred images with poor texture. The architecture cannot capture more complex image structures. MSCOCO image data set contains images of different categories with different background which is the case when we take a picture of an arbitrary object in some arbitrary place and is a very general case of an image. So we can conclude that appropriate data set may be a good tester of sufficient capacity of our model to solve this or that problem.

a) MSCOCO:original images

b) MSCOCO:noisy and reconstructed images by SDCAE

c) MSCOCO:noisy and reconstructed images by VAE

Fig. 6 MSCOCO: reconstructed from noise images

a) SDCAE loss (30 epochs) b) VAE loss (100 epochs)

I carried out our experiments for three image datasets MNIST, CIFAR-10 and MSCOCO. The main objective was to compare two models and verify if VAE and SDCAE models have enough capacity to learn variations in data for each of these three datasets. The most important in these experiments is to verify if VAE can catch variations in data and thus to generate a reasonable output. VAE are very powerful generative models and to use them effectively it is very important to know if they have enough capacity to work with certain type of data.

For comparison we took SDCAE and VAE. The main objective of these experiments is to figure out when it might be better to use SDCAE which are deterministic models and when to use stochastic generators which are VAEs. Experiments show that for simpler data sets example of which is MNIST data set stochastic generating model outperforms deterministic one. For more complicated image data set which is CIFAR-10 the results are more or less comparable with probably slightly better reconstruction obtained using deterministic SDCAE. For the last data set which is MSCOCO SDCAE definitely outperforms VAE. Regardless the results obtained on MSCOCO data set one may see how powerful generative models might be. The main difficulty with such kind of models is that they require very careful tuning and architecture design. For MNIST data set architecture fits the data, for CIFAR-10 it slightly underfits and for MSCOCO it underfits quite a lot. For MNIST one used 100 and 30 epochs for both models, for CIFAR10 one used 30 and 100 epochs for SDCAE and 100 epochs for VAE. Finally for MSCOCO data set 100 epochs were used for both architectures. Even using much more epochs did not helped to generate good-quality images especially in case of MSCOCO data set. We belief that with more complex model it will be possible to generate high-quality images using VAE. This allows also to hope that we can obtain very interesting features produced by the latent layers. In \cite{vincent2010stacked} authors claimed that visual inspection of image quality relates to the quality of features generated by a latent layer.