What is hierarchical softmax

word2vec: Negative Sampling (layman's term)?

Computing Softmax (Function to determine which words are similar to the current target word) is expensive as it requires summing across all words in V (Denominator) that are usually very large.

What can be done

Various strategies have been suggested to get around Approximate the softmax. These approaches can be summarized in softmax-based and sampling-based Approaches. Softmax-based Approaches are methods that keep the softmax layer intact, but change the architecture to improve its efficiency (e.g hierarchical softmax). Sampling based Approaches that, on the other hand, completely do away with the softmax layer and instead optimize some other loss function that approaches the softmax (you do this by approximating the normalization in the denominator of the softmax with some other loss that is cheap to calculate how negative sampling).

Loss function in Word2vec is something like:

Their logarithm can be broken down into:

Converted with some math and slope formula (see more details under three):

As you can see it converts to binary classification task (y = 1 positive class, y = 0 the negative class). As we need labels to do our binary classification task we rename all context words c as true labels (y = 1, positive sample) (all words in target-word window), and k randomly selected corpura as false (y = 0, negative sample).

reference :

Shareimprove this answer Amir