📝 @Arushi Somani May 19, 2024 1:29 PM (PDT)
Auto-encoders are one way to predict images — and they are typically the default way to combine multimodality in large language models. This post builds up to the ability to generate images one patch at a time, such that this can be combined with language model training.
Code is heavily sourced from the references, in particular https://github.com/google-deepmind/sonnet — big thanks to them! WIP Repo: ‣
The core idea behind an auto-encoder is compressing your input to some hidden dimension, then re-constructing it in the original dim (hence hopefully removing unnecessary information in the input).
The re-construction loss for the model is MSE Loss.
Training on MNIST, after 30 epochs, the results look like this:
Original Image
Re-constructed image
The seven looks like a seven! But sampling from the decoder itself is near impossible. This requires fixing— we would certainly like better results, and we’d also like to be able to sample from the decoder itself without an input.
The distinction between a variational auto-encoder and an auto-encoder is that instead of mapping an input-dim to a hidden-dim, we map the hidden-dim to a mean and standard deviation of some gaussian distribution, from which we sample. this sample is given to the decoder, which decodes it back to the input dimension.
What this means is that we’re learning the ability to sample from an underlying distribution such that we can re-create the input from the sample.
The model comes with its own loss function— where we use binary cross entropy as our re-construction loss and KL Divergence between the distribution and the standard normal as an additional regularization effect.
And the same results… look something like this!
Original Image
Reconstructed image
Note how much better a re-construction this is compared to the naive auto-encoder! We can also give random noise as input to the decoder and generate rational inputs!