Noise contrastive estimation (NCE) replaces the expensive vocabulary-sized softmax operation at the final layer of language (?) models with a cheaper sampling based operation, which results in significant speed-up during training. To motivate NCE, let us start with the basic equations for probabilistic models.
To model the probability distribution