Motivation

From bayesian probability we have: $ \displaystyle p_\theta(z\mid x) = \frac{p_\theta(x\mid z)~p_\theta(z)}{p_\theta(x)} $

The importance of this probabilistic formula cannot be overstated, especially in probabilistic models that is at the basis of deep learning — especially generative models. However, in cases where the model relies on such formulation, issues prop up regarding its analytical usability.
The main issue with this formula is that for a good chunk of problems that are high dimensional, the evidence $ p_\theta(x) $ is intractable.
To give intuition as to why we cannot calculate $p_\theta(x)$, let’s look at its continuous case:
$$ \displaystyle \int p_\theta(x\mid z)p_\theta(z)~dz $$ In here, if our $z$ is high dimensional, we will get a high dimensional integration operation which cannot be calculated and, therefore, will be intractable for any practical purposes.

Variational Inference

Instead of finding the actual distribution for $p_\theta(z\mid x)$ we can approximate the distribution using another distribution $q_\phi(z\mid x)$.
In other words, we want: $$p_\theta(z\mid x) \approx q_\phi(z\mid x)$$

As you might have guessed by now, there is no analytical way of approximating such distribution. Therefore, we use an iterative process to push our $ q_\phi(z\mid x)$ distribution ever so closer to $p_\theta(z\mid x)$. To accomplish this, we need a differentiable loss function that calculates the “distance” between two distributions.
Thankfully, there is such metric: KL Divergence $$D_{KL}(q_\phi \parallel p_\theta) = \mathbb{E}_{\sim q_\phi}\left[\log\frac{q_\phi(z\mid x)}{p_\theta(z\mid x)}\right] = \int_\R q_\phi(z\mid x)\log\left[\frac{q_\phi(z\mid x)}{p_\theta(z\mid x)}\right]dz$$ However, we again have the same issue; the denominator $p_\theta(z\mid x)$ is not tractable. Let’s see if we can simplify the expression further: $$\begin{aligned} D_{KL}(q_\phi \parallel p_\theta) &= \mathbb{E}_{\sim q_\phi}\left[\log\frac{q_\phi(z\mid x)}{p_\theta(z\mid x)}\right]\\ &= \mathbb{E}_{\sim q_\phi}[\log q_\phi(z\mid x)] - \mathbb{E}_{\sim q_\phi}[\log p_\theta(z\mid x)]\\ &= \mathbb{E}_{\sim q_\phi}[\log q_\phi(z\mid x)] - \mathbb{E}_{\sim q_\phi}\left[\log \frac{p_\theta(z, x)}{p_\theta(x)}\right]\\ &= \mathbb{E}_{\sim q_\phi}[\log q_\phi(z\mid x)] - \mathbb{E}_{\sim q_\phi}[\log p_\theta(z, x)]+ \mathbb{E}_{\sim q_\phi}[\log p_\theta(x)]\\ &= \mathbb{E}_{\sim q_\phi}[\log q_\phi(z\mid x)] - \mathbb{E}_{\sim q_\phi}[\log p_\theta(z, x)]+ \int q_\phi(z\mid x)\log p_\theta(x)~dz\\ &= \mathbb{E}_{\sim q_\phi}[\log q_\phi(z\mid x)] - \mathbb{E}_{\sim q_\phi}[\log p_\theta(z, x)]+ \log p_\theta(x)\int q_\phi(z\mid x)~dz\\ \end{aligned}$$ Since for any probability density function $ \displaystyle \int q_\phi(z\mid x)~dz = 1 $, we get: $$ D_{KL}(q_\phi \parallel p_\theta) = \mathbb{E}_{\sim q_\phi}[\log q_\phi(z\mid x)] - \mathbb{E}_{\sim q_\phi}[\log p_\theta(z, x)]+ \log p_\theta(x)$$ Rearranging the formulation and isolating the term of interest $ \log p_\theta(x) $ we get: $$ \begin{equation}\log p_\theta(x) = D_{KL}(q_\phi \parallel p_\theta) - \mathbb{E}_{\sim q_\phi}[\log q_\phi(z\mid x)] + \mathbb{E}_{\sim q_\phi}[\log p_\theta(z, x)] \end{equation}$$

ELBO

Let’s categorize the lefthand side into two chunk:
$ \mathrm{ELBO} = -\mathbb{E}_{\sim q_\phi}[\log q_\phi(z\mid x)] + \mathbb{E}_{\sim q_\phi}[\log p_\theta(z, x)] $
and
$ D_{KL}(q_\phi \parallel p_\theta) $

Since KL divergence here is intractable, we need a way to get rid of it. Knowing that the range of values for KL divegence is $[0, 1]$, if we remove it, our expression will change as follows: $$ \begin{equation} \log p_\theta(x) \ge - \mathbb{E}_{\sim q_\phi}[\log q_\phi(z\mid x)] + \mathbb{E}_{\sim q_\phi}[\log p_\theta(z, x)] \end{equation} $$ $$ \log p_\theta(x) \ge \mathrm{ELBO} $$ Given that we know there is an upper bound value of $1$ for the probabiliy density function $ p_\theta(x) $ (ignoring $\log$ bit for now, which is okay since $\log$ is monotonic), as it is for any PDF; we could indirectly decrease KL divergence.
More specifically, since the max value for Equation $(1)$ is $1$, if we optimize for the $\mathrm{ELBO}$ term, then the KL divergence term must be getting smaller to compensate for the maximum value threshold of $1$.

Therefore, by maximizing the $\mathrm{ELBO}$ value, we indirectly decrease the value for the KL divergence; hence, we use equation $(2)$ to maximize the terms, while keeping in mind that this formulation’s maximum value is possibly lower than the original formula at $(1)$. This is why we call it Evidence Lower Bound — evidence being $p_\theta(x)$.

Finally, let’s simplify the $\mathrm{ELBO}$ formulation further, and decipher each components: $$ \mathrm{ELBO} = -\mathbb{E}_{\sim q_\phi}[\log q_\phi(z\mid x)] + \mathbb{E}_{\sim q_\phi}[\log p_\theta(z, x)] $$ $$ \mathrm{ELBO} = -\mathbb{E}_{\sim q_\phi}[\log q_\phi(z\mid x)] + \mathbb{E}_{\sim q_\phi}[\log p_\theta(x\mid z)] +\mathbb{E}_{\sim q_\phi}[\log p_\theta(z)] $$ $$ \mathrm{ELBO} = \mathbb{E}_{\sim q_\phi}[\log p_\theta(x\mid z)] -\mathbb{E}_{\sim q_\phi}[\log q_\phi(z\mid x)] +\mathbb{E}_{\sim q_\phi}[\log p_\theta(z)] $$ $$ \mathrm{ELBO} = \mathbb{E}_{\sim q_\phi}[\log p_\theta(x\mid z)] -\left[\mathbb{E}_{\sim q_\phi}[\log q_\phi(z\mid x)] - \mathbb{E}_{\sim q_\phi}[\log p_\theta(z)]\right] $$ $$ \begin{equation}\mathrm{ELBO} = \mathbb{E}_{\sim q_\phi}[\log p_\theta(x\mid z)] - \mathbb{E}_{\sim q_\phi}\left[\log\frac{q_\phi(z\mid x)}{p_\theta(z)}\right] \end{equation}$$

In this final formulation, the first term is referred to as “Reconstruction Error” since its the expectation of how likely $x$ (mostly a data point) is given $z$ (usually some latent dimension). The second term is KL divergence between the approximated posterior distribution $q_\phi(z\mid x)$ and the prior distribution $p_\theta(z)$.
In most literature, including VAEs, Equation $(3)$ is used as an objective function.

Reference

  1. Evidence Lower Bound (ELBO)