Transformers: The Backbone of Modern NLP Models

OMKAR HANKARE

Blog

4 MINS READ

131

21 October, 2024

Natural Language Processing (NLP) is a rising star in the world of Tech, and it is thanks to a super powerful tool known as the Transformer.

In 2017, people were discussing recurrent neural networks (RNNs) and long short-term memory (LSTM) networks. They were the preferred architectures for processing text and other sequential information. However, they had some drawbacks – they were not very efficient in making long-distance connections and it took ages to train.

In the paper “Attention Is All You Need” by A. Vaswani and colleagues, they introduced the Transformer architecture, and it changed everything. Suddenly, we had a model that could deal with long dependencies without any additional problems and could be trained much faster than previous models.

Fig. Generalised architecture of transformers.

Transformers completely revolutionised text understanding by leveraging what is known as “Attention”. Unlike RNNs, Transformers can attend to different parts of a sentence at once. This parallel processing makes training faster and enhances performance on those lengthy sentences. Now, one of the essential concepts of Transformers is something called self-attention.

It allows every word in a sentence to determine for itself how significant it is relative to the other words. This parallel processing also makes Transformers great at processing sequential data, much better than RNNs. And that translates to quicker training and better language proficiency.

Self-attention, one of the driving elements, allows each word to decide its reliance on others in a sentence. Parallel processing of data by transformers is beneficial in utilizing the GPU while addressing RNNs drawbacks in an input sequence handling. This results in faster training and higher accuracy in language tasks than the traditional methods.

Fig. Self-attention mechanism in transformers

The diagram illustrates the mechanism of self-attention in the context of neural networks. The left side shows the Scaled Dot-Product Attention mechanism, which computes attention weights by multiplying query (Q) and key (K) matrices, scaling the result, applying an optional mask, and then using softmax to get a probability distribution. These weights are then used to compute a weighted sum of the value (V) matrix.

On the right side, the diagram illustrates the multi-head attention mechanism, where multiple Scaled Dot-Product Attention operations (called "heads") are performed in parallel. It shows how multiple sets of queries, keys, and values are passed through the scaled dot-product attention mechanism in parallel. The outputs of these multiple attention heads are then concatenated and passed through a final linear transformation.

Transformers employ this neat little hack called Positional Encoding to retain the notion of word order while they are doing parallel processing. In other words, every individual word in the given text gets a unique vector representation according to its position in the sentence. But of course, it is not just plain boring numbers – Transformers employ sinusoidal waveforms with distinct frequencies. This way, they won’t struggle to determine how words are related and retain the meaning of the content.

In general, the even positions work with sine waves while the odd positions work with cosine waves and with different frequencies in the dimensions. This creates a unique "wave combination" for each word position, ensuring:

Separate encoding for each position
Proper alignment of sequence data that has a variable number of elements
More similar encodings for nearby positions and dissimilar ones for positions far apart from each other

Fig. Positional encodings in transformer architecture

This image depicts positional encoding in transformers, which adds information about token positions to the input and output embeddings. Input tokens are embedded and combined with positional encodings to form input embeddings. For outputs, tokens are shifted right, embedded, and then combined with positional encodings to create output embeddings. This process helps the model understand the sequential order of tokens.

The positional encodings are concatenated with the word embeddings, making words not only semantically richer but also positionally informed. Oh, and let’s remember Multi-head attention. It is very much like having a lot of specialists view the input from their point of view.

Every “head” creates its own set of queries, keys, and values to the meaning of each word in the text in its specific manner. Subsequently, all the given approaches are integrated to attain a holistic perception of the analysed sentence. It highlights the need to receive advice from a group of professionals!

Thanks to this excellent approach, there has been impressive advancement in NLP. Some of the models we have seen are BERT, GPT and T5 have left us speechless as they can do translation, summarization, Q&A, and even creative writing. The prospects for the development of NLP in the future are bright, and we can expect that such models will soon become even more effective and enhance human-machine communication.