How BERT revolutionized natural language processing

I just read a great article about Natural Language Processing (NLP), which is the science of getting a computer to understand written text.

You might be wondering why we don’t have AI’s yet that can read books and comment on then?

It’s because understanding written text is fiendishly difficult!

Here, I’ll show you.

The go-to solution to teaching computers how to read is by using a Recurrent Neural Network (RNN):


This is a neural network that has a connections that loop back to the same node.

This setup allows the network to remember the data from the previous iteration and use it to adjust its predictions going forward.

To read a text, a computer would first tokenize each word to a number, and then feed the numbers of each word in a sentence into the network.

This works, but only to a point.

Consider these two sentences:

“I went fishing at the river bank
“The bank clerk looked at me suspiciously”

In the first sentence, ‘bank’ refers to the side of the river. In the second sentence, it refers to that building that holds our money.

An RNN would understand the first sentence, because it remembers ‘river’ and can use it to refine the meaning of ‘bank’.

But it will struggle with the second sentence, where the meaning of the word ‘bank’ is determined by the next word: ‘clerk’.

NLP researchers have tried to fix that by training neural networks on groups of words. A network that trains on words in relation to both the previous and next word in a sentence is called a bi-directional contextual network.

But even that doesn’t solve the problem. Consider these sentences:

“I spent ages crossing the river before I finally arrived at the bank
“I spent ages crossing the road before I finally arrived at the bank

Now the meaning of the word ‘bank’ is determined by another word 7 places earlier in the sentence!

And this touches on a big disadvantage of RNNs: they tend to forget information over time. By the time we reach the word ‘bank’, the neural network will have forgotten all about the river/road.

Last year, Google published a completely new neural network architecture for reading text called the Transformer.

Instead of looking at a sentence word for word like an RNN does, the Transformer reads an entire sentence simultaneously. It remembers how all words fit together, and uses that information to refine its understanding of the sentence as a whole.

Let’s say we have the following sentence:

“The animal didn’t cross the street because it was too tired”

Who was tired here? The animal or the street?

The Transformer can compare the word ‘it’ to all other words in the sentence and estimate correlations:

Here the network has discovered that ‘it’ most likely refers to ‘The animal’.

As you probably know by now, neural networks are just stacks of specialized network layers, and the Transformer is no different.

A Transformer is built up of a stack of special Decoder layers.

Here’s what a single decoder looks like. It’s made up of 4 sub-layers:

This simplified decoder is reading the two-word sentence “Thinking Machines”. The first Self-Attention layer in the stack is where the magic happens. This layer compares each word in a sentence to all other words.

The most powerful NLP network in the world right now is called BERT. It was published only a month ago, and it outperforms all other language processing software to date. It uses a stack of 24 decoders with some tweaks.

If you like, you can read more about BERT here:

Did I inspire you to start building NLP apps?