VI Tutorial at Yandex NLP Week

This is an extended version of our ACL 2018 tutorial presented at Yandex NLP Week in Moscow.

Schedule

Check the branch yandex2019 for all modules

Day 1

Day 2

Day 3

Advices

For discrete latent variables:

standardise the learning signal (log p(x|z)): that is, use the running average baseline in combination with scaling by constant, a.k.a. “multiplicative baseline”, for which you should use a running estimate of the standard deviation of the learning signal)
baselines should correlate well with learning sinal: try learning one via an MLP and use L2 loss for fitting its parameters
for discrete sequences try the baseline log p(x|z^*) where z^* = \argmax_z Q(z|x), this usually works well because the reward at the argmax of the inference model oftentimes correlates with the reward on a sample (this is not a rule, just a heuristic that seems to hold often), and note this is a valid baseline and it involves no training (but does increase computation time and occupies more memory); check Havrylov et al 2019.

Advanced topics

Beyond Gaussian posterior with normalising flows:

Beyond mean-field:

Beyond Gaussian prior:

Automatic Differentiation Variational Inference

More about posterior collapse:

Beyond KL divergence:

Operator Variational Inference

Beyond likelihood learning:

Deep and Hierarchical Implicit Models

Beyond baselines:

Further reading

This is a list of papers you can use to kickstart your path to being an expert on DGMs.

Some people also asked me to list the techniques available to dealing with intermediate discrete representations (this is not an exhaustive list):

the probabilistic way to do it is to make the discrete representation stochastic and circumvent non-differentiability via marginalisation, this however leads to intractabilities that need to be addressed via approximate inference and sophisticated gradient estimation: NVIL and in NLP;
you can use pseudo-gradients (i.e. gradient-like quantities) that you can use when gradients are not defined, though note that this is typically done heuristically and leads to biased estimators
- straight-through estimator (STE) including Concrete and GumbelSoftmax-ST
- SPIGOT is similar to STE, but uses a more sophisticated pseudo-gradient motivated from an NLP perspective
you can define activations that are themselves solutions to an optimisation problem: this can be used to derive unbiased estimators (though sometimes it requires a change of objective):
- sparsemax
- sparsemap

DGMs in NLP

Non-exhaustive list (let me know if you would like me to add a paper to this list):

Word representation: Rios et al (2018), Brazinskas et al (2018)
Morphological analysis and inflection: Zhou et al (2017), Wolf-Sonkin et al (2018)
Syntactic parsing: Cheng et al (2017), Corro and Titov (2019)
Semantic parsing: Yin et al (2018), Lyu and Titov (2019)
Relation extraction: Marcheggiani and Titov (2016)
Document modelling: Miao et al (2016), Srivastava and Sutton (2017)
Summarisation: Miao and Blunsom (2016)
Question answering: Miao et al (2016)
Alignment and attention: Deng et al (2018)
Machine translation: Zhang et al (2016), Schulz et al (2018), Eikema and Aziz (2018)
Vision and language: Pu et al (2016), Wang et al (2017), Calixto et al (2018)
Dialogue modelling: Wen et al (2017), Serban et al (2016)
Speech modelling: Fraccaro et al (2016)
Language modelling: Bowman et al (2016), Goyal et al (2017), Xu and Durret (2018), Ziegler and Rush (2019), Pelsmaeker and Aziz (2019)