My last article revolved around the so-called *Transfomer*, an innovative new architecture that is particulary suited for sequence-to-sequence learning tasks such as machine translation. At its core, it abstains from using recurrent cells (e.g. LSTMs/GRUs) and solely relies on dense layers that are complemented by a content-based attention mechanism. This, in turn, enabled us to make both the training- and the prediction process much faster as we got rid of the sequential dependencies that are responsible for the inefficient nature of RNNs.

Another interesting architecture was proposed by Facebook AI Research in their seminal paper *“Convolutional Sequence-to-Sequence Learning”* [2017, arxiv] by J. Gehring, M. Auli, D. Grangier, D. Yarats and Y. Dauphin. Instead of leveraging the aforementioned mechanisms, they embrace the use of convolutional layers [Yann LeCun et al., 1998] while relying on a simpler variant of content-based attention. As a nice side effect, I found these parts to be much easier to implement than their *Transformer*-counterparts which makes this paper attractive for beginners to experiment with.

Fortunately, we can re-use most of the other components that we have already coded. Firstly, this comprises the dataset, its conversion to *TFRecords* and the loading unit using the *TensorFlow*’s Dataset API. Secondly, as we are keen on keeping the resource requirements as low as possible, we decide to use the pre-trained *fastText* [Bojanowski et. al, 2016] embeddings again. As the paper doesn’t say anything about the actual choice of the positional encodings (the vectors that are added onto the word embeddings w.r.t their position within the sequence), we however decide to use the same sinusoidal scheme that the *Transformer* already proposed. Thirdly, the interplay between encoder and decoder is exactly the same as the decoder synthesizes one word at a time using a fixed representation of the encoded input sequence.

As already pointed out in the last article, this should not be seen as a direct implementation of the paper. On the one hand, this is owed to the fact that both the data we are using as well as the computational resources differ between the paper and our reality which makes it necessary to use different hyperparameters. On the other hand, we can learn interesting new concepts that the authors have not tried out or which were released after the publication.

I would like to keep this article as concise as possible which is why I will focus on the parts that are different with respect to the *Transformer*. This makes it necessary to glance at the other article every now and then.

The code can be found on GitHub and can be easily run using Docker on Amazon EC2. An explanation on how to run the code can be found at the end of this article.

_{Title photo by Christian Lambert on Unsplash}

# The Model

From a higher level point of view the model can be dissected into two parts: The *encoder* and the *decoder*.

## Encoder

The encoder expects a sequence of embeddings $x_1, \ldots, x_T$ (one embedding for each word) that are subsequently fed through a stack of convolutional layers. Each convolution yields a new sequence of features $x_1', \ldots, x_T'$ that has exactly as many item as the input sequence (let us denote the final sequence by $\hat{x}_1, \ldots, \hat{x}_T \in \mathbb{R}^d$). By stacking the layers we can effectively increase the *receptive field*, a concept that tells us how many input elements are taken into account by a convolutional network. More specifically, a *receptive field* of $l$ means that the model can see $l$ words at once. Using a too small $l$ makes it impossible for the model to correctly translate sentences that are not closely aligned to each other. This is often the case when it comes to translating English to German (or vice-versa).

## Decoder

The first stage of the decoder follows the same rationale but acts on the words of the target language (English in this article). More formally, it transforms a sequence of $T'$ word embeddings $y_1, \ldots, y_{T'}$ into a new (final) sequence $\hat{y}_1, \ldots, \hat{y}_{T'} \in \mathbb{R}^d$.

The second stage uses content-based *attention* to let the decoder glance at the *encoded* representation of the input sequence. To this end, each $\hat{x}_i$ from the encoder is compared to each $\hat{y}_j$ of the decoder using a *dot* product (yielding a score $w_{ij}$). A subsequent normalization ensures that $\sum_{t=1}^T w_{t,t'} = 1$ holds for each $t' \in \lbrace 1, \ldots, T' \rbrace$.

Assuming that $W_E \in \mathbb{R}^{T \times d}$ is a concatenation of $\hat{x}_1, \ldots, \hat{x}_T$ and $W_D \in \mathbb{R}^{T' \times d}$ analogously for $y_1, \ldots, y_{T'}$, the operation can be succinctly written as:
$\begin{aligned}
G &= W_E W_D^T \in \mathbb{R}^{T \times T'} \\
w_{t, t'} &= \frac{\exp(G_{t, t'})}{\sum_{k=1}^T \exp(G_{k, t'})}
\end{aligned}$
These so-called *attention scores* could then be used to compute a linearly weighted sum of the encoder’s output $\hat{x}_1, \ldots \hat{x}_T$ for each decoded word. However, the authors have decided to apply the attention on a slightly modified sequence $c_t = x_t + \hat{x}_t$ for $t \in \lbrace 1, \ldots, T \rbrace$. This idea follows the residual scheme of [He et al, 2015] and can additionally be thought of as being a cheap way to get a new sequence (which enhances the model’s flexibility) at almost no additional cost.

The *context* vectors can be derived as follows:
$o_{t'} = \sum_{t=1}^{T} w_{t, t'} \hat{c}_t \in \mathbb{R}^d$
for $t' \in \lbrace 1, \ldots, T' \rbrace$.

**Multi-hop attention:**
Note that we transformed the *input* word embeddings $y_1, \ldots, y_{T'}$ into a new sequence of *output* vectors $o_1, \ldots, o_{T'}$ of equal dimension. This lets us repeatedly apply the *decoder* phase (keeping the encoder state fixed) by using the *output* of the last stage as the *input* of the new state. This chain of transformations was coined “multi-hop attention” by the authors.

## Probability distribution over words

Finally, we need to project the vectors of the last decoder stage onto the size of the vocabulary. To realize this, a fully-connected layer is used that introduces two trainable parameters $W_Q \in \mathbb{R}^{V \times d}$ and $b \in \mathbb{R}^V$. $\begin{aligned} u_{t'} &= W_U o_{t'} + b_Q \in \mathbb{R}^V \\ q_{t'}^i &= \frac{\exp(u_{t'}^i)}{\sum_{v=1}^V \exp(u_{t'}^v)} \end{aligned}$ where $u_{i}^k$ denotes the $k$-th component of $u_i$ and $V$ being the vocabulary size.

This *categorical probability distribution* can now be used in conjunction with cross entropy to obtain a loss value for each sentence $\mathcal{S}$:
$\mathcal{L}(\mathcal{S}) = -\frac{1}{\lvert \mathcal{S} \rvert} \sum_{s \in \mathcal{S}} \sum_{v=1}^V y_{s}^v \log(q_{s}^v)$
where $\mathcal{S} = \lbrace (y_1, q_1), (y_2, q_2) \ldots \rbrace$ and each $y_{b}^v$ is equal to 1 if and only if the $b$-th word within the sentence is associated to the $v$-index in the vocabulary and 0 otherwise (the so-called *one-hot* encoding).

# Implementation

To give you a coarse overview of the model I have spared out a lot of details that are necessary to implement the model. Let us first introduce a couple of mechanisms that are essential to get a solid understanding of the model.

# GLU activation

Its mechanism can easily be described by envisioning how a CNN layer changes the shape of some input tensor $x \in \mathbb{R}^{B \times T \times E}$. Assuming a 1D convolution, we generally get an output tensor $x' \in \mathbb{R}^{B \times T' \times C}$. By applying padding, we ensure that the number of elements within each sequence remains constant (i.e. $T = T'$). Note however that the number of *output* channels $C$ can be chosen arbitrarily. By setting $C = 2 E$ we consequently obtain an output shape of $B \times T \times 2 E$ which can be split into two halves $y, g \in \mathbb{R}^{B \times T \times E}$ along the last dimension.

Using this notation, the output of the GLU activation can be stated as $x' = \sigma(g)\ \odot\ y$. The so-called gate $g$ hence adjusts on a point-wise basis how important certain values of $y$ are (given a scale from 0 to 1 as $\sigma(z) =\frac{1}{1 + \exp(-z)} \in (0, 1)$ holds for all $z \in \mathbb{R}$)

```
def glu(x):
"""
GLU activation
:param x: A tf.Tensor of shape [B, T, 2 * E]
:return:
"""
y, g = tf.split(x, num_or_size_splits=2, axis=2) # each [B, T, E]
return tf.multiply(tf.nn.sigmoid(g), y)
```

# Encoder

Let us now take a look at the following illustration:

First we produce an embedding for each of the words (“Oregon”, “ist”, etc.) that is subsequently added to a corresponding positional embedding (could be learned or fixed). Assuming that we have a filter size of $k = 3$, we need to pad the newly created sequence by $k - 1 = 2$ elements from the left side. Applying **one** CNN filter this way produces a new sequence of equal length (assuming an input shape $B \times T \times E$) and exactly **one** feature dimension (i.e. $B \times T \times 1$). To prepare the input for the *GLU* activation, we thus have to apply $2 \cdot E$ filters which induce an output shape of $B \times T \times 2 E$.

The *GLU* activation splits this into two halves, where the second half (denoted $g$) is destined to decide how important the first half (denoted $y$) is. A *residual* connection is used which adds this result to the sum of positional and input embeddings (dashed arrow).

Due to the fact that this block transforms an input tensor of shape $B \times T \times E$ into an output tensor of the same shape, we can use multiple of these blocks in sequence to increase the complexity of our model.

## Implementation

An implementation of this scheme can be realized using the following code. Note that there are some components that haven’t been previously mentioned such as *Dropout* or *Batch Normalization* that are known to improve the generalization and resp. the learning dynamics. *Dropout* is applied in a non-standard way that *masks* out entire channels by specifying the *noise_shape*. Special attentions needs to be paid to not forget to specify the *training* parameter in *Batch Normalization*. Furthermore, choosing the axis to be equal to 2 ensures that the batch statistics are computed over the actual feature dimension which rules out any unwanted interference of padding.

```
def encoder_layer(x, x_length,
kernel_size,
dropout_rate, is_training):
"""
Encoder layer
:param x: A tf.Tensor of shape [B, T, E]
:param x_length: A tf.Tensor of shape [B]
:param kernel_size: Kernel size of the convolutional layer
:param dropout_rate: How many neurons should be deactivated (between 0 and 1)
:param is_training: Whether we are in training or prediction mode
:return: A tf.Tensor of shape [B, T, E]
"""
B = tf.shape(x)[0]
T = tf.shape(x)[1]
E = x.get_shape().as_list()[2]
# Residual
residual = x # [B, T, E]
# Pad from the left (makes convolutional causal)
num_pad = kernel_size - 1
x = tf.pad(x, paddings=[[0, 0],
[num_pad, 0],
[0, 0]], mode="constant")
# Convolution
x = tf.layers.Conv1D(filters=2 * E,
kernel_size=kernel_size,
strides=1,
padding="valid",
activation=None,
use_bias=True)(x) # [B, T, 2 * E]
# GLU activation
x = glu(x) # [B, T, E]
# Mask out padding
mask = tf.sequence_mask(lengths=x_length,
maxlen=T,
dtype=tf.float32) # [B, T]
mask = tf.expand_dims(mask, axis=2) # [B, T, 1]
x = tf.multiply(mask, x) # [B, T, E]
# Channel dropout
x = tf.layers.Dropout(rate=dropout_rate, noise_shape=[B, 1, E])(x, training=is_training) # [B, T, E]
# Apply residual
x = x + residual # [B, T, E]
# Batch normalization
x = tf.layers.BatchNormalization(axis=2)(x, training=is_training) # [B, T, E]
return x
```

# Decoder

The decoder’s underpinnings can be illustrated as follows:

The first phase of the decoder is identical to the workings of the encoder as a sequence of input tensors (formed by building the sum of the word embeddings and the positional encoding) are mapped to a more *abstract* respresentation of the same shape.

Let us assume that the result of the previous operation yields a tensor of shape $B \times T' \times E$ and the output of the last encoder layer is given by $B \times T \times E$. Then the attention compares each vector of the encoder’s representation against each vector of the decoder’s representation, yielding $B$ matrices of shape $T \times T'$. The matrix corresponding to the first element within the batch is depicted above. Note that each column sums up to 1 which can be interpreted as how important each vector of the encoder is conditioning on one *specific* vector of the decoder.

The latter scores are used to linearly weight the vectors of the encoder which produces exactly **one** vector for each decoder vector. Applying residual connections lets the decoder bypass the aforementioned process.

## Attention

The attention mechanism is conceptually simpler than its *Transformer* counterpart. Instead of using multiple heads, we rely on a single one. A major complication in the *Transformer* was that attention also had to be applied in the form of a self-attention over the decoder. This made it necessary to introduce a complex masking mechanism that prevented the decoder from cheating (such as simply copying words from the future into earlier predictions).

Fortunately, this is no longer necessary as we only perform a encoder-decoder attention where the attention is only prohibited from assigning non-zero weights to padding symbols.

**Info:**
An implementation of the *padding_aware_softmax* function can be found in my *Transformer* article and was left out here for the sake of brevity. Its purpose it to compute attention scores in the presence of padding in a consistent way.

```
def attention(query, key, value,
query_length, key_length):
"""
Multi-head attention
:param query: A tf.Tensor of shape [B, TQ, E]
:param key: A tf.Tensor of shape [B, TK, E]
:param value: A tf.Tensor of shape [B, TK, E]
"""
with tf.name_scope("attention"):
# Derive attention logits
attention_scores = tf.matmul(query, tf.transpose(key, perm=[0, 2, 1])) # [B, TQ, TK]
# Normalize scores
attention_scores = padding_aware_softmax(logits=attention_scores,
query_length=query_length,
key_length=key_length) # [B, TQ, TK]
# Apply scores to values
summary = tf.matmul(attention_scores, value) # [B, TQ, E]
return summary
```

## Implementation

An exemplary implementation of the decoder reads as follows:

```
def decoder_layer(x, x_length,
encoder_keys, encoder_values, encoder_length,
kernel_size,
dropout_rate, is_training):
"""
Decoder layer
:param x: A tf.Tensor of shape [B, T, E]
:param x_length: A tf.Tensor of shape [B]
:param encoder_keys: A tf.Tensor of shape [B, T', E]
:param encoder_values: A tf.Tensor of shape [B, T', E]
:param encoder_length: A tf.Tensor of shape [B]
:param kernel_size: Kernel size of the convolutional layer
:param dropout_rate: How many neurons should be deactivated (between 0 and 1)
:param is_training: Whether we are in training or prediction mode
:return: A tf.Tensor of shape [B, T, E]
"""
B = tf.shape(x)[0]
T = tf.shape(x)[1]
E = x.get_shape().as_list()[2]
# Residual
residual = x # [B, T, E]
# Pad from the left (makes convolutional causal)
num_pad = kernel_size - 1
x = tf.pad(x, paddings=[[0, 0],
[num_pad, 0],
[0, 0]], mode="constant")
# Convolution
x = tf.layers.Conv1D(filters=2 * E,
kernel_size=kernel_size,
strides=1,
padding="valid",
activation=None,
use_bias=True)(x) # [B, T, 2 * E]
# GLU activation
x = glu(x) # [B, T, E]
# Attention
x = attention(query=x,
query_length=x_length,
key=encoder_keys,
value=encoder_values,
key_length=encoder_length) # [B, T, E]
# Mask out padding
mask = tf.sequence_mask(lengths=x_length,
maxlen=T,
dtype=tf.float32) # [B, T]
mask = tf.expand_dims(mask, axis=2) # [B, T, 1]
x = tf.multiply(mask, x) # [B, T, E]
# Dropout
x = tf.layers.Dropout(rate=dropout_rate, noise_shape=[B, 1, E])(x, training=is_training) # [B, T, E]
# Residual connection
x = x + residual # [B, T, E]
# Batch normalization
x = tf.layers.BatchNormalization(axis=2)(x, training=is_training) # [B, T, E]
return x
```

The remarks already given for the encoder are also valid here.

## Deployment

The model was tested on Amazon’s Deep Learning Ubuntu Base AMI but can be run in any environment that supports Docker. I have written a couple of scripts that automatically install the dependencies.

**Warning:**
Even though I have carefully tested all the scripts it is possible that their execution harms the sanity of your system and can lead to loss of data or other undesired effects. I do not take any responsibility for potential damages. Please read the scripts thoroughly before you use them!

**Warning:**
The following scripts fetch data from third-party sources. Upon execution you implicitly acknowledge their terms of agreements and any restrictions they impose.

### Training

```
cd AttentionIsAllYouNeed
# Install docker, nvidia-docker, all python packages + spaCy corpora
chmod +x ec2_dl_ami_setup.sh
sh ./ec2_dl_ami_setup.sh
# Download fastText embeddings, EuroParl corpus and launch preprocessing
chmod +x download_data.sh
sh ./download_data.sh
# Start training process
chmod +x dispatch_docker.sh
sh ./dispatch_docker.sh
```

The last script generates a new *TensorFlow* checkpoint directory in the *MT_Data* folder.
You can use this to let *TensorBoard* visualize the process in real-time.

### Prediction

```
cd ConvolutionalSequenceToSequenceLearning
sudo nvidia-docker run -v `realpath MT_Data`:/data/ \
--rm convolutional_sequence_to_sequence_learning \
--mode predict \
--prediction_scheme beam \
--de_vocab /data/wiki.de_tokens.txt \
--en_vocab /data/wiki.en_tokens.txt \
--pre_trained_embedding_de /data/wiki.de_embeddings.npy \
--pre_trained_embedding_en /data/wiki.en_embeddings.npy \
--config config.yml \
--model_dir /data/TF_<<CHECKPOINT_ID>>
```

# Final words

Feel free to give me some hints on how to further improve the quality of this article.