If you’ve done any NLP-related reading in the last few months, there is no way you have not heard about BERT, GPT-2 or ELMo that have pushed the boundaries of NLP tasks even further.
Even though these techniques are awesome, there are many NLP problem-setups where you wouldn’t be able to use them.
It might be that your problem is not of a real language, but can be formulated using tokens and sequences. You might be dealing with a language without pre-trained option to use and no resources to train it on your own.
Maybe you are just dealing with a very specific domain (did you know about BioBERT though?).
So you are left with the option of dealing with pre-trained embeddings (which are not generated by the mentioned modules). This was the case in the Kaggle Quora Insincere Questions Classification from earlier this year, where the competitors could not use BERT/GPT-2/ELMo, but were only supplied with four sets of pre-trained embeddings. This setup, raised an interesting discussion about how to combine the different embeddings instead of choosing only one.
In this post, we’ll go over some of latest techniques for combining different embeddings to a single representation, which is referred as “meta-embedding”.
Generally we can split meta-embedding techniques into two different categories. The first (1) is when the procedure of creating the meta-embedding is separate from the task they will be used for, and (2) when training the meta-embeddings is in-sync with the actual task.
I will not cover approaches that use the first mentioned procedure (1) in this post, most of them are pretty straightforward and include averaging along the embedding dimension, concatenation etc. Yet don’t mistake simple with weak, as the mentioned Kaggle contest
winning solution used weighted averaging of embeddings.
If you wish to read more about these kind of approaches, you can find information in the following two papers:
a. Frustratingly Easy Meta-Embedding — Computing Meta-Embeddings by Averaging Source Word Embeddings
b. Learning Meta-Embeddings by Using Ensembles of Embedding SetsIn the next section we cover two techniques that use the second mentioned procedure (2).
(Contextual) Dynamic Meta-Embedding
The techniques in this section are all taken from the Facebook AI Research paper, Dynamic Meta-Embeddings for Improved Sentence Representationsand present two novel techniques, Dynamic Meta-Embeddings (DME) and Contextual Dynamic Meta-Embeddings (CDME).
Both techniques are modules that are appended at the beginning of a network, and has trainable parameters, which get updated from the same gradient as the rest of the network.
The common step of each of the two techniques is a linear projection of the original embeddings. The figure below provides a visualization of the projection; the dashed lines represent the learned parameters.
Using the resulting projections, attention coefficients will be calculated, to use a weighted sum of these projections. The way that DME and CDME calculate these coefficients is what differs them.
DME
DME uses a mechanism that is dependent only on the word projections themselves. Each word projection gets multiplied by a learned vector awhich results in a scalar — so for N different projections (which corresponds N different embedding sets) we will have N scalars. These scalars are then passed through the softmax function and the results are the attention coefficients. Then these coefficients are used to create the meta-embedding, which is the weighted sum (weighted using the coefficient) of the projections.
This step, is visualized below (dashed line for learned parameters)
If you speak TensorFlow (or Keras), you might prefer seeing this in code — below is Github gist of DME (compact version), the full code can be found here
CDME
CDME, adds context into the mix, using a bidirectional LSTM (BiLSTM).
I won’t elaborate here on LSTMs, if your’e not comfortable with the term review the classic
colah post about them.
The only difference from DME, is in how the attention coefficients are calculated.
Just like DME, first the sequences get projected, and then the projected sequences are passed through the BiLSTM.
Then, instead of using the word itself, the concatenated hidden state of the forward and backward LSTMs (in the word’s corresponding index) and the vector
a are used to calculate the attention coefficient.
Visualization of this process:
And again, if you prefer reading code below is a compact version Github gist, full version in here
That’s it, hope you’ve learned something new, feel free to post your thoughts and questions. 👋