# VI Tutorial at Yandex NLP Week

This is an extended version of our ACL 2018 tutorial presented at Yandex NLP Week in Moscow.

# Schedule

Check the branch `yandex2019`

for all modules

**Day 1**

**Day 2**

**Day 3**

# Advices

For discrete latent variables:

- standardise the learning signal (
`log p(x|z)`

): that is, use the running average baseline in combination with scaling by constant, a.k.a. “multiplicative baseline”, for which you should use a running estimate of the standard deviation of the learning signal) - baselines should correlate well with learning sinal: try learning one via an MLP and use L2 loss for fitting its parameters
- for discrete sequences try the baseline
`log p(x|z^*)`

where`z^* = \argmax_z Q(z|x)`

, this usually works well because the reward at the argmax of the inference model oftentimes correlates with the reward on a sample (this is not a rule, just a heuristic that seems to hold often), and note this is a valid baseline and it involves no training (but does increase computation time and occupies more memory); check Havrylov et al 2019.

# Advanced topics

Beyond Gaussian posterior with normalising flows:

- Variational Inference with Normalising Flows
- Improving Variational Inference with Inverse Autoregressive Flow

Beyond mean-field:

- Stochastic Backpropagation and Approximate Inference in Deep Generative Models
- Sequential Neural Models with Stochastic Layers
- A Stochastic Decoder for Neural Machine Translation

Beyond Gaussian prior:

More about posterior collapse:

- Variational Lossy Autoencoder
- Towards a Deeper Understanding of Variational Autoencoding Models and InfoVAE: Information Maximizing Variational Autoencoders
- Fixing a Broken ELBO

Beyond KL divergence:

Beyond likelihood learning:

Beyond baselines:

- MuProp: Unbiased Backpropagation for Stochastic Neural Networks
- REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models
- Backpropagation through the Void: Optimizing control variates for black-box gradient estimation

# Further reading

This is a list of papers you can use to kickstart your path to being an expert on DGMs.

Some people also asked me to list the techniques available to dealing with intermediate discrete representations (this is not an exhaustive list):

- the probabilistic way to do it is to make the discrete representation stochastic and circumvent non-differentiability via marginalisation, this however leads to intractabilities that need to be addressed via approximate inference and sophisticated gradient estimation: NVIL and in NLP;
- you can use pseudo-gradients (i.e. gradient-like quantities) that you can use when gradients are not defined, though note that this is typically done heuristically and leads to biased estimators
- straight-through estimator (STE) including Concrete and GumbelSoftmax-ST
- SPIGOT is similar to STE, but uses a more sophisticated pseudo-gradient motivated from an NLP perspective

- you can define activations that are themselves solutions to an optimisation problem: this can be used to derive unbiased estimators (though sometimes it requires a change of objective):

# DGMs in NLP

Non-exhaustive list (let me know if you would like me to add a paper to this list):

- Word representation: Rios et al (2018), Brazinskas et al (2018)
- Morphological analysis and inflection: Zhou et al (2017), Wolf-Sonkin et al (2018)
- Syntactic parsing: Cheng et al (2017), Corro and Titov (2019)
- Semantic parsing: Yin et al (2018), Lyu and Titov (2019)
- Relation extraction: Marcheggiani and Titov (2016)
- Document modelling: Miao et al (2016), Srivastava and Sutton (2017)
- Summarisation: Miao and Blunsom (2016)
- Question answering: Miao et al (2016)
- Alignment and attention: Deng et al (2018)
- Machine translation: Zhang et al (2016), Schulz et al (2018), Eikema and Aziz (2018)
- Vision and language: Pu et al (2016), Wang et al (2017), Calixto et al (2018)
- Dialogue modelling: Wen et al (2017), Serban et al (2016)
- Speech modelling: Fraccaro et al (2016)
- Language modelling: Bowman et al (2016), Goyal et al (2017), Xu and Durret (2018), Ziegler and Rush (2019), Pelsmaeker and Aziz (2019)