VI Tutorial at Yandex NLP Week

This is an extended version of our ACL 2018 tutorial presented at Yandex NLP Week in Moscow.


Check the branch yandex2019 for all modules

Day 1

Day 2

Day 3


For discrete latent variables:

  • standardise the learning signal (log p(x|z)): that is, use the running average baseline in combination with scaling by constant, a.k.a. “multiplicative baseline”, for which you should use a running estimate of the standard deviation of the learning signal)
  • baselines should correlate well with learning sinal: try learning one via an MLP and use L2 loss for fitting its parameters
  • for discrete sequences try the baseline log p(x|z^*) where z^* = \argmax_z Q(z|x), this usually works well because the reward at the argmax of the inference model oftentimes correlates with the reward on a sample (this is not a rule, just a heuristic that seems to hold often), and note this is a valid baseline and it involves no training (but does increase computation time and occupies more memory); check Havrylov et al 2019.

Advanced topics

Beyond Gaussian posterior with normalising flows:

Beyond mean-field:

Beyond Gaussian prior:

More about posterior collapse:

Beyond KL divergence:

Beyond likelihood learning:

Beyond baselines:

Further reading

This is a list of papers you can use to kickstart your path to being an expert on DGMs.

Some people also asked me to list the techniques available to dealing with intermediate discrete representations (this is not an exhaustive list):

  • the probabilistic way to do it is to make the discrete representation stochastic and circumvent non-differentiability via marginalisation, this however leads to intractabilities that need to be addressed via approximate inference and sophisticated gradient estimation: NVIL and in NLP;
  • you can use pseudo-gradients (i.e. gradient-like quantities) that you can use when gradients are not defined, though note that this is typically done heuristically and leads to biased estimators
    • straight-through estimator (STE) including Concrete and GumbelSoftmax-ST
    • SPIGOT is similar to STE, but uses a more sophisticated pseudo-gradient motivated from an NLP perspective
  • you can define activations that are themselves solutions to an optimisation problem: this can be used to derive unbiased estimators (though sometimes it requires a change of objective):


Non-exhaustive list (let me know if you would like me to add a paper to this list):