Variational Autoencoders

Intuitively Explained: Variational Autoencoders

To paraphrase the brilliant physicist Richard Feynman, "If you can't explain something to a first-year student, you do not truly understand it." Though not as ambitious a goal as Feynman's, this series of posts aims to explain slightly advanced topics using the least formalism possible, usually just basic linear algebra and probability. The goal is to provide substantial intuition made as accessible as possible; it's important to note we're losing some precision this way! These posts will often be accompanied by a partner post on the same topic with code, more mathematical rigor, or both.

Part 1: Motivation and Introduction

What does it mean to make an autoencoder variational, and why could that be useful?

In the previous post, we discussed the general autoencoder framework for unsupervised learning. (This post assumes you have read that one or are otherwise familiar with the material.) To illustrate the idea, we stepped through the most basic type of autoencoder -- the undercomplete autoencoder -- and used it for a simple dimension reduction task. In this post, we'll now see how the same autoencoder framework can be leveraged towards learning a more interesting representation and in fact the creation of a generative model of our data.

More concretely, we'll be discussing what are called variational autoencoders. At a high level, the idea behind them is straightforward. With the undercomplete autoencoder, we were trying to learn a function from our data to a lower-dimensional representation (the code) that we could apply our deterministic decoder to and obtain something close to our input. The variational autoencoder extends this idea by considering our learned representation probabilistically. That is, rather than having our encoder output a single vector as our learned representation, we will have it output the parameters of a distribution of latent vectors.

VAE EncoderAn illustration of our encoder. Our input data $\boldsymbol x$ is fed into the encoder function $f$ to create the code; in this case, the code is actually a stack of two vectors, one representing the mean and one representing the variance of the parametrized distributions.

The variational autoencoder as described by Kingma et al. 1 assigns the distributions of each latent variable to be Gaussian. (In fact, our restrictions end up being even more specific, as we will see later.) Considering this case can help us visualize why a probabilistic parametrization allows for richer encoding of information than our deterministic case did.

In the deterministic case, we aimed to have our encoder map our input down to a point in our $d$-dimensional latent space; that is, a vector in $\mathbb{R}^d$.

MNIST EncodingTraditional encoding. The input data $\boldsymbol x$ is taken from the original, high-dimensional space to a single point $f(\boldsymbol{x})$ in the lower-dimensional space. This example illustrates an MNIST digit in 784 dimensions being embedded in the 2D plane.

In the probabilistic interpretation, we're instead trying to have our encoder come up with two things: a mean and a variance, both also $d$-dimensional vectors, defining the particular Gaussian that our latent variables are defined by.

MNIST Distr Encoding

Encoding for a variational autoencoder. In this case, the input data $\boldsymbol x$ is taken from the original, high-dimensional space to a distribution parametrized by $f(\boldsymbol{x}) = \boldsymbol \mu, \boldsymbol \sigma^2$ in the lower-dimensional space. $\boldsymbol \mu$ corresponds to the point at the center of the 'fuzzy ball,' while the entries of $\boldsymbol \sigma^2$ correspond to the width of the ball along the corresponding dimensions 2. This example illustrates an MNIST digit in 784 dimensions being embedded in the 2D plane.

The key introduction here is the variance. This essentially encodes our uncertainty about our point estimate. In fact, our deterministic estimation from before can be considered a particular instantiation of our new probabilistic estimation (as we have formulated it thus far!) in which the variance is fixed to be zero (in other words, where we aim to parametrize a Dirac delta function rather than a Gaussian). Our point estimate is analogous to the estimated mean of the distribution we're fitting. The deterministic approach forces us to define our encoder function as if we are absolutely sure an input should go to the point it maps to, while the probabilistic one allows us more 'wiggle room.'

Part 2: It can't be that easy, right?

2.1 Latent Representations via Sampling

By estimating a distribution rather than a point, we (unsurprisingly) introduce new issues. First of all, how is our decoder supposed to map a distribution to a reconstruction of the input? With a point, this was (conceptually) easy; we just had to jointly approximate a function from the code to the reconstruction along with the encoder. If we were to use the expected value of our distribution, we would be doing the same thing as before and our introduction of the variance would be pointless (no pun intended).

We can achieve the same ease of training by sampling from our distribution and expecting reasonable reconstructions from our samples. These samples give us particular instantiations of the latent variables parametrized by our mean and variance vectors, so this is a reasonable expectation to have.

Sampling and Decoding

Decoding for a variational autoencoder. Given a distribution parametrized by $f (\boldsymbol x)$ from the encoder, our decoder draws a sample $\boldsymbol s$ from $f (\boldsymbol x)$ and reconstructs $\boldsymbol x$ as best it can. This example illustrates a point being sampled from a distribution in 2D-space and decoded to 784-dimensional space as an MNIST image.

This also provides us with the rationale for using the variational autoencoder as a generative model; if we can learn a function that encodes distributions over our latent variables and a function that decodes samples, we can take new samples from our latent distributions and pass them through the decoder to generate novel, input-like data.

2.2 The Reparametrization Trick

This gives us vectors to make a decoder with... but now we have a new issue. Our traditional neural network training relies on the backpropagation algorithm, which in turn requires that derivatives be computable at every step along the way.

Let's think a little harder before throwing away our nice machinery. We know each of the variables we're trying to decode follows a Gaussian distribution. Much like a line can be expressed using a slope and an intercept (e.g. $y = ax+b$ from middle school algebra), arbitrarily distributed Gaussians can be expressed as a 'stretching factor' and a 'shift' (not real terms). That is, it is trivial to express any random variable (such as the entries of our code) $z = N(\mu, \sigma^2)$ as $z = \sigma \epsilon + \mu$, where $\epsilon = N(0,1)$. Here, $\sigma$ is our 'stretching factor' and $\mu$ is our shift.

In doing this, we separate out our (known) source of stochasticity. This can be considered a new external input in addition to the initial input, but crucially, one we don't care about the derivative of. Using this form for our latent variables, we can now take the derivatives we wanted for our mean and variance, as our new parametrization is deterministic in those variables.

This is called the reparametrization trick and can be applied anywhere we want to backpropagate through sampling operations for continuous (and, to be a little more pedantic, differentiable 'almost-everywhere') random variables -- the technique is useful for much more than its application in variational autoencoders! Here, though, we will be using it as a silver bullet that allows us to continue with stochastic gradient-based training methods that we would normally use when training a neural network.

2.3 Regularization

Now we have a way to train our encoder and decoder jointly to produce distributions describing our latents, just like we said we wanted earlier! So now we have to be done, right?

Not quite. Catching the last complication requires a little more critical thinking; we could now train a network as described above perfectly well given our fixes from the previous two sections, but it would be perfectly useless (as a generative model, at least).

There are a couple of things to note here. First, we haven't rewarded our network at all for using its dimensions effectively; that is, keeping the random variables for each dimension independent, which would be nice for interpretability and encouraging it to use the constraints we've given it to their fullest. In addition, we realize that if our network is smart enough, it could cheat our distribution-learning game very easily. Given sufficient parameters to manipulate in conjunction with each other, it could take every input data point and shove it to a very different part of the $d$-dimensional space. In other words, it would give them very different means and variances so small as to be near-zero. In this case, it could do a good job reconstructing our specific inputs -- just have the decoder map anything in the general vicinity of where an input went to exactly that input -- but this would be memorizing our input data. If we had a new data point that was not exactly like the ones we had seen already, our network wouldn't know what to do with it other than collapse it to something exactly like a different input it had seen, when what we actually want is some smooth interpolation to make something truly new.

Far apart points... zero variance... huh? In case this sounds sort of familiar, congrats! We've jumped through all these hoops only to end up with the same result that an unregularized, deterministic autoencoder would have given us.

But hope is not all lost. If we can find a way to discourage our network from doing exactly these things -- that is, penalize it for pushing means far apart, shrinking variances to near-zero, and make latents dependent on one another -- we could be back in business. Fortunately, probability theory gives us a simple way to do this.

Recall that we parametrized our learned distributions over latents as Gaussians. Rather than allowing our network to choose any Gaussians it might want (that could lead us to the fate described above), we want it to choose Gaussians while keeping in mind the three properties we want. This can be enforced by penalizing the loss the further from independent unit Gaussians our distributions become. Doing so would essentially let us tell the network to keep means close to zero (and thus close to each other), variances close to one (and thus not as low as zero), and to keep the latents independent. This may seem like a crazy level of restriction, but I'd like to let three things convince you this is at least somewhat reasonable. One, neural networks are successful at approximating many functions, and given a sufficiently complex transformation we can make unit Gaussians into whatever we want. Two, this is a suggestion we're giving to the network via a penalty, not a constraint we're forcibly imposing on it to make all of them exactly unit Gaussians. Three, it empirically works. (Pretty convincing if not satisfying!)

We can impose this penalty by subtracting off some measure of how different the learned distributions are from independent unit Gaussians from our original loss. Information theory gives us such a construct in what's called the (reverse) KL-divergence. Our new loss -- the difference of the original and the new KL-divergence term -- is referred to as the variational lower-bound.

Finally, we can minimize this new loss via our same stochastic gradient-based methods in order to get a powerful model for unsupervised learning of latent distributions and a great generative model at the same time!

Part 3: Sampling

As briefly touched on in the previous section, one of the key uses of variational autoencoders is that they can act as generative models, giving us novel (but still hopefully realistic) input-like data. As we tried to make our parametrized distributions close to unit Gaussians, we should expect reasonable-looking outputs if we try to sample unit Gaussian noise and decode it. This allows to generate as many as we want, as it is straightforward to sample such noise.

I trained a simple variational autoencoder with 32 code dimensions, following the description throughout this post; the code for this, with some accompanying implementation explanation, can be found here.

Following this very simple sampling procedure, we obtain outputs such as

Despite the very low dimensionality of the code compared to the input, this is very clearly a 3!

Part 3: Conclusion

In this post, we learned not only what a variational autoencoder is and what it does, but at a high level how it manages to accomplish this. Starting from the notion of a basic, deterministic autoencoder we built our way up and found workarounds for three idea-killing issues -- sampling for inputs to the decoder, the reparametrization technique for stochastic backpropagation, and regularization with the KL-divergence. And we got some idea how it all worked without invoking the terms 'variational Bayes' or even 'posterior' once!

Now that we understand the VAE conceptually, we are well set-up to begin exploring the many interesting extensions that have come of it since its original conception. Look forward to future posts pushing past this intro!


  1. See
  2. Of course, not every point in the ball is equally likely as we are parametrizing a Gaussian (those closer to the mean have higher probability); furthermore, this representation ignores the fact that the distribution returns some probability density everywhere in the space, not just inside the nice little ball we've drawn. But the visual captures the intuition -- we hope most of the points in the space have such low probability that they're essentially zero.
In [ ]:

Autoencoder Fundamentals

An Introduction to Autoencoders

This is one writeup in what I'm hoping to make a series of posts on representation learning and unsupervised methods in general. I've noticed that there are far fewer resources out there detailing these topics than there are for common supervised learning topics, and next-to-none that show them off in practice (i.e. with code) along with the underlying math. I'd like these posts to be accessible to a wider audience while still providing mathematical intuition.

Part 1: Motivation and Introduction

What is an autoencoder and why should I care?

The core concept underlying all autoencoders is deceptively simple. Let's break it down from the name instead of memorizing it as just another term. The word 'autoencoder' is made up of 'auto,' meaning self, and 'encoder,' meaning something that transforms information in one form (or code) to another. An autoencoder is a neural network that takes in data, transforms it, and tries to output the same data that it was given.

Cool. We now know what an autoencoder is. Article over, right? Well, not quite.

Our definition as it currently stands makes sense, but seems rather... useless. Why go through the trouble of training a neural network to get an imperfect approximation of the identity function when we could just return our input data? The answer is that we don't really care about the output of our network itself; in fact, when using autoencoders, we never even expect it to perfectly output the data we gave it as input. What's of importance to us is the transformation step. All autoencoders possess some transformed intermediate state of the data between the input and output, called the code. The part of the autoencoder that outputs the code is unsurprisingly referred to as the encoder, while the remainder (which tries to reconstruct the input from the code) is perhaps even more unsurprisingly dubbed the decoder.

titleAn illustration of an autoencoder. Our input data $\boldsymbol x$ is fed into the encoder function $f$ to create the code; the code is in turn fed into the decoder $\boldsymbol g$ to obtain the output, which we will compare to our input. (In particular, this is visually closest to an undercomplete autoencoder, described in the next section.

As things currently stand, however, this 'code' doesn't seem particularly useful either. One can easily imagine a neural network with enough capacity being perfectly happy to create some arbitrary approximately invertible function as the encoder and its inverse as the decoder, changing both in tandem to get the identity function as their composite. But the key to (useful) autoencoders is that we aren't going to let our network learn the identity function alone; we're instead going to bully it into learning a code with properties we'd like to have in a(n approximately invertible) transformation of our data at the expense of being able to perfectly reconstruct the input.

Thus, autoencoders are an extremely powerful framework for representation learning. Once we come up with properties we'd like to have in a representation of our data, we can try and constrain our network to learn a code with those properties.

Part 2: Undercomplete Autoencoders

Now that we have this framework in mind, let's think of properties we might like a representation to have. Oftentimes, the data we are working with is high-dimensional in nature. In order to avoid the curse of dimensionality or to extract semantically meaningful features, we might want some low-dimensional representation of our high-dimensional data -- hence the existence of a whole host of dimensionality reduction techniques in the literature, most well-known being by way of principal components analysis (PCA). (This was actually the original motivation behind the autoencoder, and can be considered the most 'vanilla' flavor of them.)

Thus, we bottleneck the number of dimensions in our network at the code to be lower than the number of dimensions in the input data. We then structure our decoder as a function from the lower number of dimensions to the original number to bring it back up.

Let's see what that looks like in code instead of words! For our first autoencoder, we'll use simple linear transformations for encoding and decoding so we can reason about what's going on intuitively, but keep in mind the main power behind autoencoders comes from their use of neural networks (i.e. incredibly versatile function approximators) for both steps.

In [1]:
# imports
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

import torch
from torch import nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms

torch.manual_seed(42); # set seed for consistent results

We first must define how many dimensions we want the code to be. This is highly dependent on the number of dimensions in the input -- there is no magic number you can always pick that will give you the results you want! (It also depends on what you plan on using the code for; but these are considerations for downstream tasks.)

In [2]:
input_dimensions = 784
code_dimensions = 32

We're going to define classes for the encoder and decoder networks. This might seem redundant with our small network now, but is good practice for making encoders and decoders with more complicated architectures than a single linear layer.

In [3]:
class Encoder(nn.Module):
    Encoder class for linear autoencoder. Takes in data of dimension `input_dimensions`, 
    conducts a linear transformation, and outputs the resulting code of dimension `code_dimensions`.
    def __init__(self, input_dimensions: int, code_dimensions: int):
        super(Encoder, self).__init__()
        self.fc1 = nn.Linear(input_dimensions, code_dimensions, bias=False)
    def forward(self, x: torch.Tensor):
        return self.fc1(x)
In [4]:
class Decoder(nn.Module):
    Decoder class for linear autoencoder. Takes in data of dimension `code_dimensions`, 
    conducts a linear transformation, and outputs the resulting output of dimension `input_dimensions`.
    def __init__(self, code_dimensions: int, input_dimensions: int):
        super(Decoder, self).__init__()
        self.fc1 = nn.Linear(code_dimensions, input_dimensions, bias=False)
    def forward(self, x: torch.Tensor):
        return self.fc1(x)

Our full autoencoder is then the encoder followed by the decoder.

In [5]:
class LinearAutoencoder(nn.Module):
    Class for linear autoencoder. Composite of Encoder and Decoder classes above.
    def __init__(self, input_dimensions: int, code_dimensions: int):
        super(LinearAutoencoder, self).__init__()
        self.encode = Encoder(input_dimensions, code_dimensions)
        self.decode = Decoder(code_dimensions, input_dimensions)
    def forward(self, x: torch.Tensor):
        x = self.encode(x) # from input_dimensions->code_dimensions...
        x = self.decode(x) # and back from code_dimensions->input_dimensions
        return x

Now we have some general setup to do before we can get training, but this is somewhat usecase-specific and not that important to understanding autoencoders conceptually.

In [6]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
In [7]:
# some hyperparameters
batch_sz = 32
test_batch_sz = 100

learning_rate = 10**(-1.5)
momentum = 0.5

n_epochs = 20
In [8]:
# set dataloader kwargs
kwargs = {'num_workers': 1, 'pin_memory':True} if torch.cuda.is_available() else {}
In [9]:
# preprocessing
img_transform = transforms.Compose([
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
In [10]:
# create dataloaders to serve up MNIST images as example data -- see MNIST example in the official documentation 
train_loader =
    datasets.MNIST('../data', train=True, download=True, transform= img_transform),
In [11]:
test_loader =
    datasets.MNIST('../data', train=False, download=True, transform=img_transform),

We'll set up our model and use stochastic gradient descent with momentum as our optimizer (simplicity is always best when complexity isn't necessary!).

In [12]:
model = LinearAutoencoder(input_dimensions, code_dimensions).to(device)
optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=momentum)
In [13]:
def train(model: nn.Module, optimizer: torch.optim.Optimizer, device: torch.device, 
          train_loader:, epoch: int):
    Executes one epoch of training given a model, optimizer, device, dataloader, and epoch number.
    for batch_idx, (inp, label) in enumerate(train_loader):
        inp = inp.view(-1, input_dimensions)
        inp = 
        output = model(inp) # here we put the input image through the autoencoder
        loss = F.mse_loss(output, inp) # then compare output to input via mean-square-error
        # good to check that our values are still sane and nothing has gone horribly wrong
        if torch.isnan(loss).any() or loss.item() == np.inf or loss.item() == -np.inf:
            print('Gradient vanished/exploded')
            raise Exception('Died')
        loss.backward() # compute gradients
        optimizer.step() # conduct step of gradient descent
        optimizer.zero_grad() # zero out gradients to avoid messing with future iterations
        if batch_idx % 1000 == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(inp), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))
In [14]:
def test(model: nn.Module, device: torch.device, test_loader:, epoch: int):
    Computes error on test set given a model, device, dataloader, and epoch number.
    test_loss = 0
    with torch.no_grad():
        for batch_idx, (inp, label) in enumerate(test_loader):
            inp = inp.view(-1, input_dimensions)
            inp =
            output = model(inp)
            test_loss += F.mse_loss(output, inp)
    test_loss /= len(test_loader.dataset)
    print('\nTest set: Average loss: {:.4f}\n'.format(test_loss))

All that's left is to train our model and see the results!

In [15]:
for epoch in range(1, n_epochs+1):
    train(model, optimizer, device, train_loader, epoch)
    test(model, device, test_loader, epoch)
Train Epoch: 1 [0/60000 (0%)]	Loss: 1.000938
Train Epoch: 1 [32000/60000 (53%)]	Loss: 0.257497

Test set: Average loss: 0.0025

Train Epoch: 2 [0/60000 (0%)]	Loss: 0.244470
Train Epoch: 2 [32000/60000 (53%)]	Loss: 0.238129

Test set: Average loss: 0.0024

Train Epoch: 3 [0/60000 (0%)]	Loss: 0.222586
Train Epoch: 3 [32000/60000 (53%)]	Loss: 0.237073

Test set: Average loss: 0.0021

Train Epoch: 4 [0/60000 (0%)]	Loss: 0.208160
Train Epoch: 4 [32000/60000 (53%)]	Loss: 0.203348

Test set: Average loss: 0.0019

Train Epoch: 5 [0/60000 (0%)]	Loss: 0.197845
Train Epoch: 5 [32000/60000 (53%)]	Loss: 0.175188

Test set: Average loss: 0.0017

Train Epoch: 6 [0/60000 (0%)]	Loss: 0.181806
Train Epoch: 6 [32000/60000 (53%)]	Loss: 0.155630

Test set: Average loss: 0.0016

Train Epoch: 7 [0/60000 (0%)]	Loss: 0.153001
Train Epoch: 7 [32000/60000 (53%)]	Loss: 0.155533

Test set: Average loss: 0.0015

Train Epoch: 8 [0/60000 (0%)]	Loss: 0.154720
Train Epoch: 8 [32000/60000 (53%)]	Loss: 0.156239

Test set: Average loss: 0.0014

Train Epoch: 9 [0/60000 (0%)]	Loss: 0.143948
Train Epoch: 9 [32000/60000 (53%)]	Loss: 0.147039

Test set: Average loss: 0.0014

Train Epoch: 10 [0/60000 (0%)]	Loss: 0.139918
Train Epoch: 10 [32000/60000 (53%)]	Loss: 0.135999

Test set: Average loss: 0.0013

Train Epoch: 11 [0/60000 (0%)]	Loss: 0.141453
Train Epoch: 11 [32000/60000 (53%)]	Loss: 0.126618

Test set: Average loss: 0.0013

Train Epoch: 12 [0/60000 (0%)]	Loss: 0.126350
Train Epoch: 12 [32000/60000 (53%)]	Loss: 0.123183

Test set: Average loss: 0.0012

Train Epoch: 13 [0/60000 (0%)]	Loss: 0.118564
Train Epoch: 13 [32000/60000 (53%)]	Loss: 0.123543

Test set: Average loss: 0.0012

Train Epoch: 14 [0/60000 (0%)]	Loss: 0.122253
Train Epoch: 14 [32000/60000 (53%)]	Loss: 0.125982

Test set: Average loss: 0.0012

Train Epoch: 15 [0/60000 (0%)]	Loss: 0.118890
Train Epoch: 15 [32000/60000 (53%)]	Loss: 0.116815

Test set: Average loss: 0.0011

Train Epoch: 16 [0/60000 (0%)]	Loss: 0.119878
Train Epoch: 16 [32000/60000 (53%)]	Loss: 0.110570

Test set: Average loss: 0.0011

Train Epoch: 17 [0/60000 (0%)]	Loss: 0.111909
Train Epoch: 17 [32000/60000 (53%)]	Loss: 0.108714

Test set: Average loss: 0.0011

Train Epoch: 18 [0/60000 (0%)]	Loss: 0.112195
Train Epoch: 18 [32000/60000 (53%)]	Loss: 0.105414

Test set: Average loss: 0.0011

Train Epoch: 19 [0/60000 (0%)]	Loss: 0.116659
Train Epoch: 19 [32000/60000 (53%)]	Loss: 0.110422

Test set: Average loss: 0.0010

Train Epoch: 20 [0/60000 (0%)]	Loss: 0.109862
Train Epoch: 20 [32000/60000 (53%)]	Loss: 0.086314

Test set: Average loss: 0.0010

We ended up with a test MSE of 0.001. Is this good? Well, as far as the quality of the reconstruction goes, it seems pretty decent in some handwavey sense considering our values range from 0 to 1. But humans are visual creatures: best way to see how it's doing is to actually see it!

In [16]:
# function to un-normalize the output of the network to make it look like the original images
def to_img(x: torch.Tensor):
    x = 0.5 * (x + 1)
    x = x.clamp(0, 1)
    x = x.view(x.size(0), 1, 28, 28)
    return x
In [17]:
# get an example batch and push it through the network
with torch.no_grad():
    x = next(iter(test_loader))[0].to(device)
    x = x.view(-1,input_dimensions)
    y = model(x)
    y = to_img(y)
plt.imshow((model.encode.fc1.weight.t() @ model.decode.fc1.weight.t()).detach().cpu().numpy().T, cmap='Greys_r')plt.imshow(model.encode.fc1.weight.detach().cpu().numpy() @ model.decode.fc1.weight.detach().cpu().numpy(), cmap='Greys_r')

Displaying our example and its corresponding output,

In [18]:
plt.imshow(x[0].view(28,28).cpu().numpy(), cmap='Greys_r')
<matplotlib.image.AxesImage at 0x7fcf51695da0>
In [19]:
plt.imshow(y[0].view(28,28).cpu().numpy(), cmap='Greys_r')
<matplotlib.image.AxesImage at 0x7fcf517486a0>

with the (uninterpretable) 32-dimensional code as

In [20]:
array([ 1.2461367 ,  0.7578721 , -0.80333775,  0.17708406,  2.0784078 ,
       -1.2867155 ,  0.459029  , -2.3941464 ,  0.10984115,  0.59566885,
       -1.8090788 , -2.0377617 , -2.190893  ,  0.2924345 , -0.20564324,
        0.10396053, -3.552227  , -1.2485214 , -0.8760196 , -1.0671576 ,
       -3.8653336 , -0.44916114,  1.0365984 ,  1.1902565 , -0.1348641 ,
        0.2300794 ,  0.6650476 , -1.5262647 ,  3.0209737 ,  0.59753466,
        0.12191479, -3.11799   ], dtype=float32)

We can see that the output still is very clearly a 1, but is not quite the same as what we put in. This should make sense to us as we did essentially did some lossy compression by a factor of 24.5x -- definitely non-negligible! The autoencoder learned what combinations of the input dimensions it could make in 32 dimensions to best squeeze as much information in as possible to be able to reconstruct the input.

Part 3: Conclusion

In this post, we took a look at the inner workings of the simplest variety of autoencoder, the undercomplete autoencoder; that is, a neural network that maps its input to itself with an intermediate layer of fewer dimensions. But this is just the beginning! More importantly, we introduced the autoencoder as a framework for representation learning given how we apply constraints. In the future, we'll see how this can be used for things other than dimension reduction.


P.S. In this case, we only allowed its combinations to be linear by construction of our encoder and decoder. As a fun note, this should feel very familiar to the linear algebra types out there: what we have is a special case of autoencoder construction that's equivalent to principal components analysis (PCA). Given our data is mean-centered here, this just means our network is learning to project into the subspace of the eigenspace spanned by the first $32$ eigenvectors (ordered by 'importance'/magnitude of eigenvalue). An brief, handwavey outline of why this is the case: PCA can be thought of as trying finding the subspace we can project into that minimizes reconstruction error of how far our projections are from the original points; this is what a linear autoencoder is also doing.

In [ ]:

Independent Component Analysis

A Code-Centric Introduction to Independent Component Analysis (ICA)

This is the first in what I'm hoping to make a series of posts on representation learning and unsupervised methods in general. I've noticed that there are far fewer resources out there detailing these topics than there are for common supervised learning topics, and next-to-none that show them off in practice (i.e. with code) along with the underlying math. I'd like these posts to be accessible to a wider audience while still providing mathematical intuition.

Part 1: Motivation and Introduction

What is ICA and why would we want to do it?

Suppose you are at a banquet with $n$ total attendees, all simultaneously engaged in conversation. Should you stand in the middle of this crowd, you will be able to pick out individual voices to tune in and out of at will; however, any microphone positioned in the banquet hall will record an incomprehensible cacaphony, all $n$ voices jumbled together based on their distance from the device. Say you would like to be able to listen to the crowd of voices on a per-speaker basis. With only the one recording, you might1 be out of luck. If you have recordings from $n$ microphones each placed at different positions rather than one, how can we recover the individual voice signals from every attendee?

What better way to illustrate this than to listen to some recordings! (The individual source voice signals were created by Google Translate Text-to-Speech and then mixed by me.)

In [1]:
import numpy as np
from numpy import linalg
import matplotlib.pyplot as plt
from scipy import signal
from import wavfile
from typing import Tuple
import os
import glob
from IPython.display import Audio, display
In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
In [3]:
%matplotlib inline
In [4]:
# create convenience function for plotting and playing audio
def show_audio(a: Tuple[int, np.ndarray])->None: # a: (sample_rate, audio_array)
    fig, ax = plt.subplots()
    time_axis = np.linspace(start=0, stop=(len(a[1])/a[0]),num=np.round(len(a[1])))
    ax.plot(time_axis, a[1])
    ax.set_xlabel('Time (seconds)')
    display(Audio(a[1], rate=a[0]))
In [5]:
# collect all the wav files
files = glob.glob('./data/mixed_data/*.wav')
In [6]:
samp_rates = []
sound_list = []
In [7]:
# collect sampling frequencies and audio signals
for f in files:
    samp_rate, sound =
In [8]:
# store as numpy array
audio_array = np.array(sound_list)
In [9]:
# listen and visualize sound waves as sanity check
for a in zip(samp_rates, sound_list):