This is an attempt at giving a mathematical intuition to how Variational Autoencoders work. https://arxiv.org/abs/1312.6114
This paper uses L0 regularization technique to shrink the model by making a lot of weights to exactly zero. But in doing so, you will lose the flexibility of gradient based learning. So, they come up with clever reformulations of the original loss function to enable that.