Combining Structure and Semantics for efficient Representation Learning
The principle of composition states the meaning of a whole comes from its parts and how they fit together. This is a really efficient principle because all you need to know is the meaning of smaller set of parts and rules for combining them in order to understand a near infinite set of higher order objects. It allows humans to understand the meanings of new words, sentences, and even entire books without having to memorize them. Machine learning models aren’t tasked with learning compositional structure explicitly, but what happens when they are? Does this help us create more efficient learners? In this blog we introduce and architecture, the Self-Structuring AutoEncoder (Self-StrAE) and its extension Banyan that do exactly that, and as a result are able to learn semantic representations in a radically more efficient manner. Requiring both very little data and parameter scale while remaining competitive with much bigger and more resource intensive models.
Relevant Papers: EMNLP 2023 SemEval 2024
Model
Self-StrAE is conceptually a fairly simple model. If we look at the figure above we can understand the forward pass pretty easily. We start with a set of tokens (words or subwords in a sentence) and then perform the following steps:
For the first version of Self-StrAE we used the contrastive loss as our objective function. This means that we asked the representations for each node in the encoder to be as close as possible to the reconstructed representation of the same node in the decoder, and vice versa, while remaining distinct from all other node representations.
And that’s it, fairly simple right? However, it turns out that this simple process actually leads to some pretty interesting results. Lets consider the learning process. Intuitively, the model starts from random embeddings, and therefore an essentially random merge order. Throughout training, tokens which are often part of the same merges will have their representations drawn together, so the representation reflects what they are likely to compose with. The model can then leverage any regularities to better perform reconstruction. This leads the representations to further reflect likely compositions and consequently increases the regularity in the structure. Ultimately, this leads to representations which must, by virtue of the training procedure, reflect the compositional semantics learned by the model.
Results
| Model | Simlex | Wordsim S | Wordsim R | STS 12 | STS 16 | STS B | SICK R |
|---|---|---|---|---|---|---|---|
| Self-StrAE | 13.04 | 48.19 | 45.47 | 34.42 | 49.93 | 36.68 | 51 |
| Fasttext | 25.8 | 50.8 | 29.18 | 4.35 | 32.43 | 16.93 | 41.58 |
| Bi-LSTM | 12.88 | 39.46 | 32.22 | 9.22 | 33.78 | 14.06 | 40.36 |
| RoBERTa | 9.92 | 26.6 | 6.2 | 29.48 | 50.88 | 38.36 | 49.58 |
We compared Self-StrAE with a few unstructured baselines to see how it performed. We pre-trained all the models from scratch on about 10 million tokens of English Wikipedia, to assess data efficiency, and then evaluated them on a few standard semantic similarity tasks that cover both the word and the sentence levels. We can see that Self-StrAE not only performs really competively, but is able to transfer its performance from the word level (Simlex + Wordsim) seamlessly to the sentence level (STS + SICK). This is because it can take advantage of composition to generalise to different levels of hierarchy, and is a unique capability that our inductive bias unlocks. The other important point is that Self-StrAE is a really simple model, it just has an embedding matrix and two functions for composition and decomposition. In our case these are just simple linear layers, and that means that Self-StrAE is able to achieve these results using just a fraction of the parameters of the baslines. So we have a model that shows both promising performance as well as serious efficiency.
| Model | Self-StrAE | Bi-LSTM | RoBERTa |
|---|---|---|---|
| Params | 430 | 181,800 | 3,950,232 |
Self-StrAE showed promise, but to make it really effective we had make a few changes, which we outline in our new paper Banyan. At the core are three modifications:
The full details are in the paper, but the key point is that with a few simple changes we can make the model much more effective. To prove that we ran a series of experiments against some tougher baselines. We compared Self-StrAE with our new model Banyan as well as an eight layer RoBERTa transformer and GloVe vectors. We trained our models on 10 million tokens to make sure that we stay efficient while we let the baselines train on 100 million tokens so that they aren’t limited by lack of scale. For both the RoBERTa and the GloVe baselines we also include a tougher variant. For GloVe we do some manual augmentation to remove uninformative words from the sentence representation and for RoBERTa we do some extra SimCSE training to boost its representations performance. We also increased the embedding dimension to 256 in order to make things just that little bit more challenging for Banyan.
On the Sentence Level:
| Model | STS-12 | STS-13 | STS-14 | STS 15 | STS 16 | STS B | SemRel |
|---|---|---|---|---|---|---|---|
| Self-StrAE | 31.98 | 53.88 | 37.73 | 55.23 | 55.55 | 39.53 | 50.05 |
| GloVe | 31.61 | 21.69 | 27.37 | 40.42 | 29.27 | 28.25 | 41.20 |
| + stopword rm | 39.00 | 41.61 | 39.31 | 51.06 | 45.14 | 48.40 | 42.37 |
| RoBERTa | 42.77 | 51.70 | 45.67 | 63.67 | 59.60 | 39.97 | 52.73 |
| +SimCSE | 50.63 | 62.63 | 54.17 | 68.77 | 66.67 | 53.53 | 59.27 |
| Banyan | 51.20 | 69.10 | 63.20 | 73.20 | 66.60 | 61.50 | 61.60 |
On the Word Level:
| Model | Simlex | Wordsim-S | Wordsim-R |
|---|---|---|---|
| Self-StrAE | 13.80 | 54.38 | 52.85 |
| GloVe | 27.47 | 62.53 | 51.00 |
| RoBERTa | 29.23 | 61.97 | 46.00 |
| Banyan | 16.57 | 63.25 | 69.00 |
On Retrieval and Classification Tasks:
| Model | Quora R@1 | Quora R@10 | Arguana R@1 | Arguana R@10 | SST-2 | MRPC |
|---|---|---|---|---|---|---|
| Self-StrAE | 29.59 | 44.77 | 9.96 | 21.48 | 74.67 | 80.34 |
| GloVe | 26.08 | 43.17 | 6.18 | 24.68 | 75.83 | 81 |
| + stopword rm | 38.78 | 62.15 | 9.89 | 33.00 | 76.50 | 81 |
| RoBERTa | 37.67 | 58.78 | 8.18 | 28.85 | 75.68 | 81 |
| +SimCSE | 45.09 | 68.74 | 10.06 | 37.36 | 75.97 | 80.83 |
| Banyan | 50.19 | 75.80 | 27.41 | 49.60 | 77.20 | 79.57 |
Across the board Banyan signficantly outperforms our original Self-StrAE and is able to match and even generally exceed the performance of even these tougher baselines. Importantly we are able to do so while maintaining all the efficiency benefits from the original model. This means we have an effective and efficient learner and that can become real game changer in contexts where scale is not available.
A classic example of such a situation are low resource languages. There are many such cases where the scale to pre-train a big model simply isn’t available our the communities don’t have the compute resources to do so, and that means that a lot of NLP tasks simply aren’t possible. In fact its such an issue that a recent challenge was held to try and provide good embeddings for just these languages, and now we can really put Banyan to the test. This time we don’t compare against baselines we trained from scratch, but instead against large SoTA foundation models. These include the latest LLMs, pre-trained multilingual encoders and even specialised embedding models trained on supervised semantic datasets.
| Model | Spanish | Telugu | Marathi | Hindi | Amharic | Afrikaans |
|---|---|---|---|---|---|---|
| Llama-3.1 (8B) | 66.7 | 65.6 | 63.4 | 61.7 | 64.1 | 65.4 |
| Mistral Nemo | 66.2 | 57.0 | 52.3 | 55.8 | 53.2 | 58.3 |
| MiniLM-L12 | 58.8 | 34.8 | 39.5 | 43.8 | 9.60 | 74.1 |
| Paraphrase XLM-R | 71.7 | 58.1 | 79.6 | 52.0 | 64.6 | 76.8 |
| XLM-R | 68.9 | 46.3 | 55.7 | 52.7 | 57.3 | 56.2 |
| XLM-R (FT) | 72.8 | 68.8 | 75.1 | 57.6 | 59.6 | 72.6 |
| Banyan | 61.0 | 71.1 | 67.7 | 61.8 | 66.2 | 78.7 |
On low resource languages Banyan is really able to shine and often even outperforms the SoTA foundation models and supervised embedding models. It does so while learning fully from scratch with very little data and no supervision. We are able to pretrain from scratch in under an hour on a single GPU, and this can be performed using free compute from Google colab. Moreover, because the model is so light weight it can be easily inferenced on a laptops CPU which makes it really accessible. To underscore just how efficient Banyan is compared to the baselines we can look at the number of non-embedding parameters:
| Model | Banyan | Self-StrAE | Roberta (M) | MiniLM-L12 | XLMR | Llama 3.1 | Mistral Nemo |
|---|---|---|---|---|---|---|---|
| Params | 14 | 1072 | ≈10M | ≈21M | ≈85M | ≈8B | ≈12B |
Good and cheap embedding models are useful for many applications. For example, the digital humanities need to organise corpora of ancient languages, making it easier for researchers to access texts they need. But these corpora are small, and these languages are unlikely to be present in pretraining corpora of larger models. Banyan provides an efficient solution for producing representations for both these use cases and low resource languages and under represented communities more generally.