The code provided in this reposoitory was created during my bachelor thesis in the research group Computer Vision, at the Heidelberg Collaboratory for Image processing.
Below you can find a brief description of this work, for all details you should have a look here.
In this thesis we will investigate the mapping from the image space to the latent space using neural networks. We will focus on image data sets, specifically created to display three-dimensional objects with exact labeled articulations. Using a variational autoencoder in combination with a discriminative network, the aim is to extract and investigate information about articulations from images. This enables us to explore and compare the mapping of specific articulation parameters onto the latent space. The main contribution of this work is the comparison between natural interpolations of articulations and different interpolations in the latent space. Furthermore, we investigate how a metric loss improves the model and how a discriminator helps expand the latent space around observations.
We used an Variational Autoencoder (VAE) and an adversarial Discriminator as our model.
The architecture used for the Variational Autoencoder is shown in the following figure.
We use data sets created with tools from this repository.
This data set consist of 10,000 samples containing only one articulation of two cuboids while varying the parameter Phi as shown below. Hence, the articulation is horizontally rotated without any other changes throughout the whole data set.
This data set consists of 500,000 samples displaying between two and four cuboids. To create this diversified data set we vary the articulation parameters Phi Theta and Lambda while allowing different angles between the cuboids but not different scales of cuboids. To introduce even more complexity we will not only vary the articulation parameters but including the appearances and lighting parameters into into the parameter space. This enables different colors, directional lights, up to four spotlights and four point lights with different settings to be randomly chosen in every image. Therefore, creating the following examples.
updating both networks,
at every training step
inferior network,
comparing the discriminator output of the "false" image for the VAE loss and the discriminator loss
probabilistic inferior network,
inferior method but with a randomness introduced for the decision
accuracy threshold,
calculate the accuracy of the discriminator predictions and update below a given accuracy threshold
reducing learning rate,
improve the model by reducing the learning rate before the end of training
The following table shows the FID scores on the Varied Data set for networks trained with different methods to update the disciminator during training. Lower is better.
reducing learning rate | inferior network | probablistic inferior network | both networks | accuracy threshold |
---|---|---|---|---|
without | 50.53 | 36.39 | 73.51 | 39.23 |
with | 52.57 | 29.14 | 53.55 | 28.09 |
We use Principle Component Analysis (PCA) on the latent representation of the Phi validation data set to sample from the principle component (pc) with the highest eigenvalue.
These latent samples create following images.
Furthermore we used an Uniform Manifold Approximation and Projection (UMAP) for dimension reduction to visualize the 128 dimensional latent space in a 2 dimensional plot. In the following figure we compare the pc latent sampling with the representations of the images from the data set.
By adding a metric triplet loss we hoped for a better embedding of the parameter phi in the latent space. Redoing the experiments from before results in the following figures.
We use an image sequence with a natural interpolation of the articulation parameter to create the input images in the following figure. The reconstructed output from the VAE model is displayed as the output.
We will use the latent representation of the first and last image to interpolate between them in the latent space. We use an euclidean interpolation and a spherical interpolation to create the following images.
We use again a UMAP dimension reduction to compare the two interpolations.
Updating the Discriminator
Accuracy threshold as criteria for updating the discriminator
Updating the inferior network with additional randomness
PCA sampling in the latent space
linear PC can not represent an articulation parameter
metric loss does not improve correlation
Interpolation in latent space
linear interpolation better than a spherical interpolation
metric loss decreases the correlation