Language Translation with Transformer Model Using TensorFlow on E2E’s Cloud GPU Server

March 19, 2024

Transformers: Unleashing the Power of Attention

The Transformer revolutionized machine translation with its unique architecture, built upon the concept of attention. Introduced in the groundbreaking paper ‘Attention is All You Need’, Transformers replaced traditional CNNs and RNNs with this powerful mechanism. Unlike its predecessors, attention allows each word to ‘attend’ to all other words in the sentence simultaneously, capturing intricate relationships and context across the entire sequence.

Think of it like a classroom discussion: instead of each student waiting for their turn to speak, everyone can participate simultaneously, enriching the conversation with diverse perspectives. Similarly, Transformers process words in parallel, leading to faster and more efficient information flow.

This parallelization isn't just about speed; it unlocks the ability to capture long-range dependencies. Unlike RNNs, where information fades with distance, Transformers can directly connect distant words, understanding how seemingly unrelated parts contribute to overall meaning. Imagine analyzing a complex sentence with multiple clauses and references. Transformers can seamlessly navigate these connections, achieving superior translation accuracy.

Also, Transformers make no assumptions about the order of elements, making them ideal for tasks beyond language like analyzing game scenarios where the spatial arrangement of objects is crucial. By harnessing the power of attention, Transformers have become the champions in various natural language processing tasks, offering unparalleled performance and versatility.

E2E’s GPU Cloud: An Overview

The most effective approach to grasp Neural Machine Translation involves hands-on experience, where the environment you choose for practice plays a pivotal role in mastering complex architectures. Amidst numerous GPU cloud service providers available, selecting the right one can notably enhance both cost-efficiency and productivity. Fortunately, after thorough research, I've identified E2E Cloud as the optimal choice, offering a balance between cost-effectiveness and accessibility. Moreover, it provides readily available setups for all required environments, expediting projects by saving valuable time. For this hands-on session, I utilized the TIR-AI Platform within E2E cloud. To embark on a similar journey, you can initiate the process by following this link: https://www.e2enetworks.com/blog/how-to-use-jupyter-notebooks-on-e2e-networks .

Boost your training efficiency with NVIDIA NGC pipelines and E2E’s Cloud GPUs. Leveraging pre-built, optimized pipelines from NGC can significantly accelerate your Transformer model training process. E2E's Cloud GPUs are specifically designed to harness the power of NGC, providing seamless compatibility and maximizing performance. This potent combination makes E2E a top choice for developers seeking to unlock the full potential of Transformer-based machine translation while minimizing training time and resources.

Let’s Play: Crafting a Transformer Model for Seamless Portuguese-to-English Translation

Ditch the dictionary and build your own AI-powered language bridge! This tutorial teaches you how to craft a Transformer model for seamless Portuguese-to-English translation.

To employ the needed packages for NMT, installation can be accomplished via the Python package installer, PIP. In a Jupyter notebook, utilize the magic command as illustrated below:


!apt install --allow-change-held-packages libcudnn8=8.1.0.77-1+cuda11.2
!pip uninstall -y -q tensorflow keras tensorflow-estimator tensorflow-text
!pip install protobuf~=3.20.3
!pip install -q tensorflow_datasets
!pip install -q -U tensorflow-text tensorflow

Next, import the necessary packages by executing:


import numpy as np
import matplotlib.pyplot as plt


import tensorflow_datasets as tfds
import tensorflow as tf
import tensorflow_text

Let’s set up the data pipeline using TFDS (Tensorflow Datasets).


examples, metadata = tfds.load('ted_hrlr_translate/pt_to_en',
                               with_info=True,
                               as_supervised=True)
train_examples, val_examples = examples['train'], examples['validation']

We can peek into the data to understand it using:


some examples
for pt_examples, en_examples in train_examples.batch(3).take(1):
  print('> Examples in Portuguese:')
  for pt in pt_examples.numpy():
    print(pt.decode('utf-8'))
  print()
  print('> Examples in English:')
  for en in en_examples.numpy():
    print(en.decode('utf-8'))

Before diving into the exciting world of machine translation with Transformer models, it's crucial to understand how we prepare the text for these powerful algorithms. This is where tokenization comes in.

Think of tokenization as a linguistic chef carefully chopping up a sentence into smaller pieces, called tokens. These tokens can be individual words, smaller pieces of words (subwords), or even individual characters, depending on the chosen method.

In our case, we're using a special type of tokenizer called a subword tokenizer. This tool is specifically designed to optimize text for language models like Transformers. Why subwords? Because they offer a sweet spot between individual words and characters:

More granular than words: Subwords can capture smaller nuances within words, like prefixes and suffixes, which are crucial for accurate translation.
Less numerous than characters: Unlike individual characters, subwords create a manageable vocabulary size, making the training process more efficient.

To handle both Portuguese and English effectively, we've employed two separate BertTokenizer objects, each trained on its respective language. This ensures each language is treated with the appropriate understanding of its unique grammar and vocabulary.


some examples
model_name = 'ted_hrlr_translate_pt_en_converter'
tf.keras.utils.get_file(
    f'{model_name}.zip',
    f'https://storage.googleapis.com/download.tensorflow.org/models/{model_name}.zip',
    cache_dir='.', cache_subdir='', extract=True
)
tokenizers = tf.saved_model.load(model_name)
MAX_TOKENS=128
BUFFER_SIZE = 20000
BATCH_SIZE = 256
def prepare_batch(pt, en):
    pt = tokenizers.pt.tokenize(pt)      # Output is ragged.
    pt = pt[:, :MAX_TOKENS]    # Trim to MAX_TOKENS.
    pt = pt.to_tensor()  # Convert to 0-padded dense Tensor
    en = tokenizers.en.tokenize(en)
    en = en[:, :(MAX_TOKENS+1)]
    en_inputs = en[:, :-1].to_tensor()  # Drop the [END] tokens
    en_labels = en[:, 1:].to_tensor()   # Drop the [START] tokens
    return (pt, en_inputs), en_labels
def make_batches(ds):
  return (
      ds
      .shuffle(BUFFER_SIZE)
      .batch(BATCH_SIZE)
      .map(prepare_batch, tf.data.AUTOTUNE)
      .prefetch(buffer_size=tf.data.AUTOTUNE))
train_batches = make_batches(train_examples)
val_batches = make_batches(val_examples)
for (pt, en), en_labels in train_batches.take(1):
  break
print(pt.shape)
print(en.shape)
print(en_labels.shape)

While the paper ‘Attention Is All You Need’ offers a deep dive into the theoretical underpinnings of the Transformer architecture, let's embark on a more practical journey. Forget dense equations and academic jargon - get ready to code this powerhouse architecture yourself!

This visual below provides a high-level overview of the Transformer's structure, but the true excitement lies in bringing it to life. We'll break down the key components, understand their interactions, and then translate that understanding into actual lines of code. Imagine, by the end of this exploration, you'll possess your very own functional Transformer model, ready to tackle natural language tasks with remarkable power! So, are you ready to embark on this coding adventure? Get your coding tools ready, and let's unlock the mysteries of the Transformer together!

Positional Embedding:


def positional_encoding(length, depth):
  depth = depth/2
  positions = np.arange(length)[:, np.newaxis]     # (seq, 1)
  depths = np.arange(depth)[np.newaxis, :]/depth   # (1, depth)
  angle_rates = 1 / (10000**depths)         # (1, depth)
  angle_rads = positions * angle_rates      # (pos, depth)
  pos_encoding = np.concatenate(
      [np.sin(angle_rads), np.cos(angle_rads)],
      axis=-1)
  return tf.cast(pos_encoding, dtype=tf.float32)
pos_encoding = positional_encoding(length=2048, depth=512)
class PositionalEmbedding(tf.keras.layers.Layer):
  def init(self, vocab_size, d_model):
    super().init()
    self.d_model = d_model
    self.embedding = tf.keras.layers.Embedding(vocab_size, d_model, mask_zero=True)
    self.pos_encoding = positional_encoding(length=2048, depth=d_model)
  def compute_mask(self, *args, **kwargs):
    return self.embedding.compute_mask(*args, **kwargs)
  def call(self, x):
    length = tf.shape(x)[1]
    x = self.embedding(x)
    # This factor sets the relative scale of the embedding and positonal_encoding.
    x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
    x = x + self.pos_encoding[tf.newaxis, :length, :]
    return x
embed_pt = PositionalEmbedding(vocab_size=tokenizers.pt.get_vocab_size(), d_model=512)
embed_en = PositionalEmbedding(vocab_size=tokenizers.en.get_vocab_size(), d_model=512)
pt_emb = embed_pt(pt)
en_emb = embed_en(en)

Attention:


Attention layers
class BaseAttention(tf.keras.layers.Layer):
  def init(self, **kwargs):
    super().init()
    self.mha = tf.keras.layers.MultiHeadAttention(**kwargs)
    self.layernorm = tf.keras.layers.LayerNormalization()
    self.add = tf.keras.layers.Add()
class CrossAttention(BaseAttention):
  def call(self, x, context):
    attn_output, attn_scores = self.mha(
        query=x,
        key=context,
        value=context,
        return_attention_scores=True)
    # Cache the attention scores for plotting later.
    self.last_attn_scores = attn_scores
    x = self.add([x, attn_output])
    x = self.layernorm(x)
    return x
class GlobalSelfAttention(BaseAttention):
  def call(self, x):
    attn_output = self.mha(
        query=x,
        value=x,
        key=x)
    x = self.add([x, attn_output])
    x = self.layernorm(x)
    return x
class CausalSelfAttention(BaseAttention):
  def call(self, x):
    attn_output = self.mha(
        query=x,
        value=x,
        key=x,
        use_causal_mask = True)
    x = self.add([x, attn_output])
    x = self.layernorm(x)
    return x

Feed Forward Layer:


Feed Forward Block
class FeedForward(tf.keras.layers.Layer):
  def init(self, d_model, dff, dropout_rate=0.1):
    super().init()
    self.seq = tf.keras.Sequential([
      tf.keras.layers.Dense(dff, activation='relu'),
      tf.keras.layers.Dense(d_model),
      tf.keras.layers.Dropout(dropout_rate)
    ])
    self.add = tf.keras.layers.Add()
    self.layer_norm = tf.keras.layers.LayerNormalization()
  def call(self, x):
    x = self.add([x, self.seq(x)])
    x = self.layer_norm(x)
    return x

Encoder:


Encoder
class EncoderLayer(tf.keras.layers.Layer):
  def init(self,*, d_model, num_heads, dff, dropout_rate=0.1):
    super().init()
    self.self_attention = GlobalSelfAttention(
        num_heads=num_heads,
        key_dim=d_model,
        dropout=dropout_rate)
    self.ffn = FeedForward(d_model, dff)
  def call(self, x):
    x = self.self_attention(x)
    x = self.ffn(x)
    return x
class Encoder(tf.keras.layers.Layer):
  def init(self, *, num_layers, d_model, num_heads,
               dff, vocab_size, dropout_rate=0.1):
    super().init()
    self.d_model = d_model
    self.num_layers = num_layers
    self.pos_embedding = PositionalEmbedding(
        vocab_size=vocab_size, d_model=d_model)
    self.enc_layers = [
        EncoderLayer(d_model=d_model,
                     num_heads=num_heads,
                     dff=dff,
                     dropout_rate=dropout_rate)
        for _ in range(num_layers)]
    self.dropout = tf.keras.layers.Dropout(dropout_rate)
  def call(self, x):
    # x is token-IDs shape: (batch, seq_len)
    x = self.pos_embedding(x)  # Shape (batch_size, seq_len, d_model).
    # Add dropout.
    x = self.dropout(x)
    for i in range(self.num_layers):
      x = self.enc_layersi
    return x  # Shape (batch_size, seq_len, d_model).

Decoder:


Decoder
class DecoderLayer(tf.keras.layers.Layer):
  def init(self,
               *,
               d_model,
               num_heads,
               dff,
               dropout_rate=0.1):
    super(DecoderLayer, self).init()
    self.causal_self_attention = CausalSelfAttention(
        num_heads=num_heads,
        key_dim=d_model,
        dropout=dropout_rate)
    self.cross_attention = CrossAttention(
        num_heads=num_heads,
        key_dim=d_model,
        dropout=dropout_rate)
    self.ffn = FeedForward(d_model, dff)
  def call(self, x, context):
    x = self.causal_self_attention(x=x)
    x = self.cross_attention(x=x, context=context)
    # Cache the last attention scores for plotting later
    self.last_attn_scores = self.cross_attention.last_attn_scores
    x = self.ffn(x)  # Shape (batch_size, seq_len, d_model).
    return x
class Decoder(tf.keras.layers.Layer):
  def init(self, *, num_layers, d_model, num_heads, dff, vocab_size,
               dropout_rate=0.1):
    super(Decoder, self).init()
    self.d_model = d_model
    self.num_layers = num_layers
    self.pos_embedding = PositionalEmbedding(vocab_size=vocab_size,
                                             d_model=d_model)
    self.dropout = tf.keras.layers.Dropout(dropout_rate)
    self.dec_layers = [
        DecoderLayer(d_model=d_model, num_heads=num_heads,
                     dff=dff, dropout_rate=dropout_rate)
        for _ in range(num_layers)]
    self.last_attn_scores = None
  def call(self, x, context):
    # x is token-IDs shape (batch, target_seq_len)
    x = self.pos_embedding(x)  # (batch_size, target_seq_len, d_model)
    x = self.dropout(x)
    for i in range(self.num_layers):
      x  = self.dec_layers[i](x, context)
    self.last_attn_scores = self.dec_layers[-1].last_attn_scores
    # The shape of x is (batch_size, target_seq_len, d_model).
    return x

The Final Transformer Model (tying up all the pieces):


Final Transformer Architecture
class Transformer(tf.keras.Model):
  def init(self, *, num_layers, d_model, num_heads, dff,
               input_vocab_size, target_vocab_size, dropout_rate=0.1):
    super().init()
    self.encoder = Encoder(num_layers=num_layers, d_model=d_model,
                           num_heads=num_heads, dff=dff,
                           vocab_size=input_vocab_size,
                           dropout_rate=dropout_rate)
    self.decoder = Decoder(num_layers=num_layers, d_model=d_model,
                           num_heads=num_heads, dff=dff,
                           vocab_size=target_vocab_size,
                           dropout_rate=dropout_rate)
    self.final_layer = tf.keras.layers.Dense(target_vocab_size)
  def call(self, inputs):
    # To use a Keras model with .fit you must pass all your inputs in the
    # first argument.
    context, x  = inputs
    context = self.encoder(context)  # (batch_size, context_len, d_model)
    x = self.decoder(x, context)  # (batch_size, target_len, d_model)
    # Final linear layer output.
    logits = self.final_layer(x)  # (batch_size, target_len, target_vocab_size)
    try:
      # Drop the keras mask, so it doesn't scale the losses/metrics.
      # b/250038731
      del logits._keras_mask
    except AttributeError:
      pass
    # Return the final output and the attention weights.
    return logits
num_layers = 3
d_model = 128
dff = 512
num_heads = 8
dropout_rate = 0.1
transformer = Transformer(
    num_layers=num_layers,
    d_model=d_model,
    num_heads=num_heads,
    dff=dff,
    input_vocab_size=tokenizers.pt.get_vocab_size().numpy(),
    target_vocab_size=tokenizers.en.get_vocab_size().numpy(),
    dropout_rate=dropout_rate)

Let’s train the model:


class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
  def init(self, d_model, warmup_steps=4000):
    super().init()
    self.d_model = d_model
    self.d_model = tf.cast(self.d_model, tf.float32)
    self.warmup_steps = warmup_steps
  def call(self, step):
    step = tf.cast(step, dtype=tf.float32)
    arg1 = tf.math.rsqrt(step)
    arg2 = step * (self.warmup_steps ** -1.5)
    return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)
learning_rate = CustomSchedule(d_model)
optimizer = tf.keras.optimizers.Adam(learning_rate, beta_1=0.9, beta_2=0.98, epsilon=1e-9)
def masked_loss(label, pred):
  mask = label != 0
  loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')
  loss = loss_object(label, pred)
  mask = tf.cast(mask, dtype=loss.dtype)
  loss *= mask
  loss = tf.reduce_sum(loss)/tf.reduce_sum(mask)
  return loss
def masked_accuracy(label, pred):
  pred = tf.argmax(pred, axis=2)
  label = tf.cast(label, pred.dtype)
  match = label == pred
  mask = label != 0
  match = match & mask
  match = tf.cast(match, dtype=tf.float32)
  mask = tf.cast(mask, dtype=tf.float32)
  return tf.reduce_sum(match)/tf.reduce_sum(mask)
transformer.compile(
    loss=masked_loss,
    optimizer=optimizer,
    metrics=[masked_accuracy])
transformer.fit(train_batches,
                epochs=200,
                validation_data=val_batches)

Congratulations! Your Transformer model has successfully navigated the training journey. Now, the thrill of putting it to the test arrives! Prepare to witness its translation prowess in action.

But the exploration doesn't stop there. We delve deeper to comprehend the inner workings of the model. By visualizing the attention heads, we gain insights into how it processes languages, identifies crucial connections, and ultimately generates translations.

Imagine unlocking a secret window into the model's thought process, observing how it analyzes each word, its relationship to others, and how that understanding shapes the translated output. This unveils the intricate dance of attention, allowing us to appreciate the model's brilliance and identify potential areas for further improvement.


class Translator(tf.Module):
  def init(self, tokenizers, transformer):
    self.tokenizers = tokenizers
    self.transformer = transformer
  def call(self, sentence, max_length=MAX_TOKENS):
    # The input sentence is Portuguese, hence adding the [START] and [END] tokens.
    assert isinstance(sentence, tf.Tensor)
    if len(sentence.shape) == 0:
      sentence = sentence[tf.newaxis]
    sentence = self.tokenizers.pt.tokenize(sentence).to_tensor()
    encoder_input = sentence
    # As the output language is English, initialize the output with the
    # English [START] token.
    start_end = self.tokenizers.en.tokenize([''])[0]
    start = start_end[0][tf.newaxis]
    end = start_end[1][tf.newaxis]
    # tf.TensorArray is required here (instead of a Python list), so that the
    # dynamic-loop can be traced by tf.function.
    output_array = tf.TensorArray(dtype=tf.int64, size=0, dynamic_size=True)
    output_array = output_array.write(0, start)
    for i in tf.range(max_length):
      output = tf.transpose(output_array.stack())
      predictions = self.transformer([encoder_input, output], training=False)
      # Select the last token from the seq_len dimension.
      predictions = predictions[:, -1:, :]  # Shape (batch_size, 1, vocab_size).
      predicted_id = tf.argmax(predictions, axis=-1)
      # Concatenate the predicted_id to the output which is given to the
      # decoder as its input.
      output_array = output_array.write(i+1, predicted_id[0])
      if predicted_id == end:
        break
    output = tf.transpose(output_array.stack())
    # The output shape is (1, tokens).
    text = tokenizers.en.detokenize(output)[0]  # Shape: ().
    tokens = tokenizers.en.lookup(output)[0]
    # tf.function prevents us from using the attention_weights that were
    # calculated on the last iteration of the loop.
    # So, recalculate them outside the loop.
    self.transformer([encoder_input, output[:,:-1]], training=False)
    attention_weights = self.transformer.decoder.last_attn_scores
    return text, tokens, attention_weights
translator = Translator(tokenizers, transformer)
def print_translation(sentence, tokens, ground_truth):
  print(f'{"Input:":15s}: {sentence}')
  print(f'{"Prediction":15s}: {tokens.numpy().decode("utf-8")}')
  print(f'{"Ground truth":15s}: {ground_truth}')
sentence = 'este é o primeiro livro que eu fiz.'
ground_truth = "this is the first book i've ever done."
translated_text, translated_tokens, attention_weights = translator(
    tf.constant(sentence))
print_translation(sentence, translated_text, ground_truth)

Let’s visualize the attention heads for the above example:


def plot_attention_head(in_tokens, translated_tokens, attention):
  # The model didn't generate <start> in the output. Skip it.
  translated_tokens = translated_tokens[1:]
  ax = plt.gca()
  ax.matshow(attention)
  ax.set_xticks(range(len(in_tokens)))
  ax.set_yticks(range(len(translated_tokens)))
  labels = [label.decode('utf-8') for label in in_tokens.numpy()]
  ax.set_xticklabels(
      labels, rotation=90)
  labels = [label.decode('utf-8') for label in translated_tokens.numpy()]
  ax.set_yticklabels(labels)
head = 0
Shape: (batch=1, num_heads, seq_len_q, seq_len_k).
attention_heads = tf.squeeze(attention_weights, 0)
attention = attention_heads[head]
in_tokens = tf.convert_to_tensor([sentence])
in_tokens = tokenizers.pt.tokenize(in_tokens).to_tensor()
in_tokens = tokenizers.pt.lookup(in_tokens)[0]
plot_attention_head(in_tokens, translated_tokens, attention)

While exploring, one attention head offered a glimpse into the model's inner workings – but it's merely a single piece of the puzzle. To truly grasp its intricate language processing, we need to unveil the grand tapestry of all attention heads.

Think of it like trying to understand a complex painting by examining just one brushstroke. By studying the interplay of all attention heads, we gain a holistic view of how the model analyzes relationships between words, identifies key connections, and ultimately guides the translation process.

Each head acts as a unique lens, focusing on different aspects of the input sentence. It's by combining these diverse perspectives that the model paints a full picture of the meaning and generates nuanced translations.


def plot_attention_weights(sentence, translated_tokens, attention_heads):
  in_tokens = tf.convert_to_tensor([sentence])
  in_tokens = tokenizers.pt.tokenize(in_tokens).to_tensor()
  in_tokens = tokenizers.pt.lookup(in_tokens)[0]
  fig = plt.figure(figsize=(16, 8))
  for h, head in enumerate(attention_heads):
    ax = fig.add_subplot(2, 4, h+1)
    plot_attention_head(in_tokens, translated_tokens, head)
    ax.set_xlabel(f'Head {h+1}')
  plt.tight_layout()
  plt.show()
plot_attention_weights(sentence,
                       translated_tokens,
                       attention_weights[0])


sentence = 'Eu li sobre triceratops na enciclopédia.'
ground_truth = 'I read about triceratops in the encyclopedia.'
translated_text, translated_tokens, attention_weights = translator(
    tf.constant(sentence))
print_translation(sentence, translated_text, ground_truth)
plot_attention_weights(sentence, translated_tokens, attention_weights[0])

Conclusion

This exploration has taken us on a thrilling journey through the world of Transformer-based machine translation. You've witnessed the power of attention, delved into the model's inner workings, and gained valuable insights into its translation prowess.

But remember, this is just the beginning. The true potential of your NMT model lies in its ability to scale and translate real-world data efficiently. This is where the combined power of E2E Cloud GPUs and NVIDIA NGC comes into play.

E2E's Cloud GPUs offer a robust and scalable platform specifically designed for AI workloads like NMT. The GPUs, coupled with the optimized pipelines and tools available through NVIDIA NGC, significantly accelerate training and inference, allowing you to handle larger datasets and achieve faster translation speeds.

Imagine translating massive volumes of text, powering real-time communication platforms, or enabling multilingual content creation – all with the efficiency and scalability provided by E2E and NGC.

So, don't let your exploration end here. Leverage the power of GPUs to push the boundaries of machine translation, unlock new possibilities, and bridge the gap between languages like never before.

Code

The GitHub code for this article can be found here: https://github.com/Lord-Axy/Atricle-Transformer_MT

Sign up for Free Trial

Latest Blogs

A vector illustration of a tech city using latest cloud technologies & infrastructure

Language Translation with Transformer Model Using TensorFlow on E2E’s Cloud GPU Server

March 19, 2024

Akshayraj Madhubalan

Transformers: Unleashing the Power of Attention

E2E’s GPU Cloud: An Overview

Let’s Play: Crafting a Transformer Model for Seamless Portuguese-to-English Translation

Ditch the dictionary and build your own AI-powered language bridge! This tutorial teaches you how to craft a Transformer model for seamless Portuguese-to-English translation.

To employ the needed packages for NMT, installation can be accomplished via the Python package installer, PIP. In a Jupyter notebook, utilize the magic command as illustrated below:


!apt install --allow-change-held-packages libcudnn8=8.1.0.77-1+cuda11.2
!pip uninstall -y -q tensorflow keras tensorflow-estimator tensorflow-text
!pip install protobuf~=3.20.3
!pip install -q tensorflow_datasets
!pip install -q -U tensorflow-text tensorflow

Next, import the necessary packages by executing:


import numpy as np
import matplotlib.pyplot as plt


import tensorflow_datasets as tfds
import tensorflow as tf
import tensorflow_text

Let’s set up the data pipeline using TFDS (Tensorflow Datasets).


examples, metadata = tfds.load('ted_hrlr_translate/pt_to_en',
                               with_info=True,
                               as_supervised=True)
train_examples, val_examples = examples['train'], examples['validation']

We can peek into the data to understand it using:


some examples
for pt_examples, en_examples in train_examples.batch(3).take(1):
  print('> Examples in Portuguese:')
  for pt in pt_examples.numpy():
    print(pt.decode('utf-8'))
  print()
  print('> Examples in English:')
  for en in en_examples.numpy():
    print(en.decode('utf-8'))

More granular than words: Subwords can capture smaller nuances within words, like prefixes and suffixes, which are crucial for accurate translation.
Less numerous than characters: Unlike individual characters, subwords create a manageable vocabulary size, making the training process more efficient.


some examples
model_name = 'ted_hrlr_translate_pt_en_converter'
tf.keras.utils.get_file(
    f'{model_name}.zip',
    f'https://storage.googleapis.com/download.tensorflow.org/models/{model_name}.zip',
    cache_dir='.', cache_subdir='', extract=True
)
tokenizers = tf.saved_model.load(model_name)
MAX_TOKENS=128
BUFFER_SIZE = 20000
BATCH_SIZE = 256
def prepare_batch(pt, en):
    pt = tokenizers.pt.tokenize(pt)      # Output is ragged.
    pt = pt[:, :MAX_TOKENS]    # Trim to MAX_TOKENS.
    pt = pt.to_tensor()  # Convert to 0-padded dense Tensor
    en = tokenizers.en.tokenize(en)
    en = en[:, :(MAX_TOKENS+1)]
    en_inputs = en[:, :-1].to_tensor()  # Drop the [END] tokens
    en_labels = en[:, 1:].to_tensor()   # Drop the [START] tokens
    return (pt, en_inputs), en_labels
def make_batches(ds):
  return (
      ds
      .shuffle(BUFFER_SIZE)
      .batch(BATCH_SIZE)
      .map(prepare_batch, tf.data.AUTOTUNE)
      .prefetch(buffer_size=tf.data.AUTOTUNE))
train_batches = make_batches(train_examples)
val_batches = make_batches(val_examples)
for (pt, en), en_labels in train_batches.take(1):
  break
print(pt.shape)
print(en.shape)
print(en_labels.shape)

Positional Embedding:


def positional_encoding(length, depth):
  depth = depth/2
  positions = np.arange(length)[:, np.newaxis]     # (seq, 1)
  depths = np.arange(depth)[np.newaxis, :]/depth   # (1, depth)
  angle_rates = 1 / (10000**depths)         # (1, depth)
  angle_rads = positions * angle_rates      # (pos, depth)
  pos_encoding = np.concatenate(
      [np.sin(angle_rads), np.cos(angle_rads)],
      axis=-1)
  return tf.cast(pos_encoding, dtype=tf.float32)
pos_encoding = positional_encoding(length=2048, depth=512)
class PositionalEmbedding(tf.keras.layers.Layer):
  def init(self, vocab_size, d_model):
    super().init()
    self.d_model = d_model
    self.embedding = tf.keras.layers.Embedding(vocab_size, d_model, mask_zero=True)
    self.pos_encoding = positional_encoding(length=2048, depth=d_model)
  def compute_mask(self, *args, **kwargs):
    return self.embedding.compute_mask(*args, **kwargs)
  def call(self, x):
    length = tf.shape(x)[1]
    x = self.embedding(x)
    # This factor sets the relative scale of the embedding and positonal_encoding.
    x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
    x = x + self.pos_encoding[tf.newaxis, :length, :]
    return x
embed_pt = PositionalEmbedding(vocab_size=tokenizers.pt.get_vocab_size(), d_model=512)
embed_en = PositionalEmbedding(vocab_size=tokenizers.en.get_vocab_size(), d_model=512)
pt_emb = embed_pt(pt)
en_emb = embed_en(en)

Attention:


Attention layers
class BaseAttention(tf.keras.layers.Layer):
  def init(self, **kwargs):
    super().init()
    self.mha = tf.keras.layers.MultiHeadAttention(**kwargs)
    self.layernorm = tf.keras.layers.LayerNormalization()
    self.add = tf.keras.layers.Add()
class CrossAttention(BaseAttention):
  def call(self, x, context):
    attn_output, attn_scores = self.mha(
        query=x,
        key=context,
        value=context,
        return_attention_scores=True)
    # Cache the attention scores for plotting later.
    self.last_attn_scores = attn_scores
    x = self.add([x, attn_output])
    x = self.layernorm(x)
    return x
class GlobalSelfAttention(BaseAttention):
  def call(self, x):
    attn_output = self.mha(
        query=x,
        value=x,
        key=x)
    x = self.add([x, attn_output])
    x = self.layernorm(x)
    return x
class CausalSelfAttention(BaseAttention):
  def call(self, x):
    attn_output = self.mha(
        query=x,
        value=x,
        key=x,
        use_causal_mask = True)
    x = self.add([x, attn_output])
    x = self.layernorm(x)
    return x

Feed Forward Layer:


Feed Forward Block
class FeedForward(tf.keras.layers.Layer):
  def init(self, d_model, dff, dropout_rate=0.1):
    super().init()
    self.seq = tf.keras.Sequential([
      tf.keras.layers.Dense(dff, activation='relu'),
      tf.keras.layers.Dense(d_model),
      tf.keras.layers.Dropout(dropout_rate)
    ])
    self.add = tf.keras.layers.Add()
    self.layer_norm = tf.keras.layers.LayerNormalization()
  def call(self, x):
    x = self.add([x, self.seq(x)])
    x = self.layer_norm(x)
    return x

Encoder:


Encoder
class EncoderLayer(tf.keras.layers.Layer):
  def init(self,*, d_model, num_heads, dff, dropout_rate=0.1):
    super().init()
    self.self_attention = GlobalSelfAttention(
        num_heads=num_heads,
        key_dim=d_model,
        dropout=dropout_rate)
    self.ffn = FeedForward(d_model, dff)
  def call(self, x):
    x = self.self_attention(x)
    x = self.ffn(x)
    return x
class Encoder(tf.keras.layers.Layer):
  def init(self, *, num_layers, d_model, num_heads,
               dff, vocab_size, dropout_rate=0.1):
    super().init()
    self.d_model = d_model
    self.num_layers = num_layers
    self.pos_embedding = PositionalEmbedding(
        vocab_size=vocab_size, d_model=d_model)
    self.enc_layers = [
        EncoderLayer(d_model=d_model,
                     num_heads=num_heads,
                     dff=dff,
                     dropout_rate=dropout_rate)
        for _ in range(num_layers)]
    self.dropout = tf.keras.layers.Dropout(dropout_rate)
  def call(self, x):
    # x is token-IDs shape: (batch, seq_len)
    x = self.pos_embedding(x)  # Shape (batch_size, seq_len, d_model).
    # Add dropout.
    x = self.dropout(x)
    for i in range(self.num_layers):
      x = self.enc_layersi
    return x  # Shape (batch_size, seq_len, d_model).

Decoder:


Decoder
class DecoderLayer(tf.keras.layers.Layer):
  def init(self,
               *,
               d_model,
               num_heads,
               dff,
               dropout_rate=0.1):
    super(DecoderLayer, self).init()
    self.causal_self_attention = CausalSelfAttention(
        num_heads=num_heads,
        key_dim=d_model,
        dropout=dropout_rate)
    self.cross_attention = CrossAttention(
        num_heads=num_heads,
        key_dim=d_model,
        dropout=dropout_rate)
    self.ffn = FeedForward(d_model, dff)
  def call(self, x, context):
    x = self.causal_self_attention(x=x)
    x = self.cross_attention(x=x, context=context)
    # Cache the last attention scores for plotting later
    self.last_attn_scores = self.cross_attention.last_attn_scores
    x = self.ffn(x)  # Shape (batch_size, seq_len, d_model).
    return x
class Decoder(tf.keras.layers.Layer):
  def init(self, *, num_layers, d_model, num_heads, dff, vocab_size,
               dropout_rate=0.1):
    super(Decoder, self).init()
    self.d_model = d_model
    self.num_layers = num_layers
    self.pos_embedding = PositionalEmbedding(vocab_size=vocab_size,
                                             d_model=d_model)
    self.dropout = tf.keras.layers.Dropout(dropout_rate)
    self.dec_layers = [
        DecoderLayer(d_model=d_model, num_heads=num_heads,
                     dff=dff, dropout_rate=dropout_rate)
        for _ in range(num_layers)]
    self.last_attn_scores = None
  def call(self, x, context):
    # x is token-IDs shape (batch, target_seq_len)
    x = self.pos_embedding(x)  # (batch_size, target_seq_len, d_model)
    x = self.dropout(x)
    for i in range(self.num_layers):
      x  = self.dec_layers[i](x, context)
    self.last_attn_scores = self.dec_layers[-1].last_attn_scores
    # The shape of x is (batch_size, target_seq_len, d_model).
    return x

The Final Transformer Model (tying up all the pieces):


Final Transformer Architecture
class Transformer(tf.keras.Model):
  def init(self, *, num_layers, d_model, num_heads, dff,
               input_vocab_size, target_vocab_size, dropout_rate=0.1):
    super().init()
    self.encoder = Encoder(num_layers=num_layers, d_model=d_model,
                           num_heads=num_heads, dff=dff,
                           vocab_size=input_vocab_size,
                           dropout_rate=dropout_rate)
    self.decoder = Decoder(num_layers=num_layers, d_model=d_model,
                           num_heads=num_heads, dff=dff,
                           vocab_size=target_vocab_size,
                           dropout_rate=dropout_rate)
    self.final_layer = tf.keras.layers.Dense(target_vocab_size)
  def call(self, inputs):
    # To use a Keras model with .fit you must pass all your inputs in the
    # first argument.
    context, x  = inputs
    context = self.encoder(context)  # (batch_size, context_len, d_model)
    x = self.decoder(x, context)  # (batch_size, target_len, d_model)
    # Final linear layer output.
    logits = self.final_layer(x)  # (batch_size, target_len, target_vocab_size)
    try:
      # Drop the keras mask, so it doesn't scale the losses/metrics.
      # b/250038731
      del logits._keras_mask
    except AttributeError:
      pass
    # Return the final output and the attention weights.
    return logits
num_layers = 3
d_model = 128
dff = 512
num_heads = 8
dropout_rate = 0.1
transformer = Transformer(
    num_layers=num_layers,
    d_model=d_model,
    num_heads=num_heads,
    dff=dff,
    input_vocab_size=tokenizers.pt.get_vocab_size().numpy(),
    target_vocab_size=tokenizers.en.get_vocab_size().numpy(),
    dropout_rate=dropout_rate)

Let’s train the model:


class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
  def init(self, d_model, warmup_steps=4000):
    super().init()
    self.d_model = d_model
    self.d_model = tf.cast(self.d_model, tf.float32)
    self.warmup_steps = warmup_steps
  def call(self, step):
    step = tf.cast(step, dtype=tf.float32)
    arg1 = tf.math.rsqrt(step)
    arg2 = step * (self.warmup_steps ** -1.5)
    return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)
learning_rate = CustomSchedule(d_model)
optimizer = tf.keras.optimizers.Adam(learning_rate, beta_1=0.9, beta_2=0.98, epsilon=1e-9)
def masked_loss(label, pred):
  mask = label != 0
  loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')
  loss = loss_object(label, pred)
  mask = tf.cast(mask, dtype=loss.dtype)
  loss *= mask
  loss = tf.reduce_sum(loss)/tf.reduce_sum(mask)
  return loss
def masked_accuracy(label, pred):
  pred = tf.argmax(pred, axis=2)
  label = tf.cast(label, pred.dtype)
  match = label == pred
  mask = label != 0
  match = match & mask
  match = tf.cast(match, dtype=tf.float32)
  mask = tf.cast(mask, dtype=tf.float32)
  return tf.reduce_sum(match)/tf.reduce_sum(mask)
transformer.compile(
    loss=masked_loss,
    optimizer=optimizer,
    metrics=[masked_accuracy])
transformer.fit(train_batches,
                epochs=200,
                validation_data=val_batches)

Congratulations! Your Transformer model has successfully navigated the training journey. Now, the thrill of putting it to the test arrives! Prepare to witness its translation prowess in action.


class Translator(tf.Module):
  def init(self, tokenizers, transformer):
    self.tokenizers = tokenizers
    self.transformer = transformer
  def call(self, sentence, max_length=MAX_TOKENS):
    # The input sentence is Portuguese, hence adding the [START] and [END] tokens.
    assert isinstance(sentence, tf.Tensor)
    if len(sentence.shape) == 0:
      sentence = sentence[tf.newaxis]
    sentence = self.tokenizers.pt.tokenize(sentence).to_tensor()
    encoder_input = sentence
    # As the output language is English, initialize the output with the
    # English [START] token.
    start_end = self.tokenizers.en.tokenize([''])[0]
    start = start_end[0][tf.newaxis]
    end = start_end[1][tf.newaxis]
    # tf.TensorArray is required here (instead of a Python list), so that the
    # dynamic-loop can be traced by tf.function.
    output_array = tf.TensorArray(dtype=tf.int64, size=0, dynamic_size=True)
    output_array = output_array.write(0, start)
    for i in tf.range(max_length):
      output = tf.transpose(output_array.stack())
      predictions = self.transformer([encoder_input, output], training=False)
      # Select the last token from the seq_len dimension.
      predictions = predictions[:, -1:, :]  # Shape (batch_size, 1, vocab_size).
      predicted_id = tf.argmax(predictions, axis=-1)
      # Concatenate the predicted_id to the output which is given to the
      # decoder as its input.
      output_array = output_array.write(i+1, predicted_id[0])
      if predicted_id == end:
        break
    output = tf.transpose(output_array.stack())
    # The output shape is (1, tokens).
    text = tokenizers.en.detokenize(output)[0]  # Shape: ().
    tokens = tokenizers.en.lookup(output)[0]
    # tf.function prevents us from using the attention_weights that were
    # calculated on the last iteration of the loop.
    # So, recalculate them outside the loop.
    self.transformer([encoder_input, output[:,:-1]], training=False)
    attention_weights = self.transformer.decoder.last_attn_scores
    return text, tokens, attention_weights
translator = Translator(tokenizers, transformer)
def print_translation(sentence, tokens, ground_truth):
  print(f'{"Input:":15s}: {sentence}')
  print(f'{"Prediction":15s}: {tokens.numpy().decode("utf-8")}')
  print(f'{"Ground truth":15s}: {ground_truth}')
sentence = 'este é o primeiro livro que eu fiz.'
ground_truth = "this is the first book i've ever done."
translated_text, translated_tokens, attention_weights = translator(
    tf.constant(sentence))
print_translation(sentence, translated_text, ground_truth)

Let’s visualize the attention heads for the above example:


def plot_attention_head(in_tokens, translated_tokens, attention):
  # The model didn't generate <start> in the output. Skip it.
  translated_tokens = translated_tokens[1:]
  ax = plt.gca()
  ax.matshow(attention)
  ax.set_xticks(range(len(in_tokens)))
  ax.set_yticks(range(len(translated_tokens)))
  labels = [label.decode('utf-8') for label in in_tokens.numpy()]
  ax.set_xticklabels(
      labels, rotation=90)
  labels = [label.decode('utf-8') for label in translated_tokens.numpy()]
  ax.set_yticklabels(labels)
head = 0
Shape: (batch=1, num_heads, seq_len_q, seq_len_k).
attention_heads = tf.squeeze(attention_weights, 0)
attention = attention_heads[head]
in_tokens = tf.convert_to_tensor([sentence])
in_tokens = tokenizers.pt.tokenize(in_tokens).to_tensor()
in_tokens = tokenizers.pt.lookup(in_tokens)[0]
plot_attention_head(in_tokens, translated_tokens, attention)


def plot_attention_weights(sentence, translated_tokens, attention_heads):
  in_tokens = tf.convert_to_tensor([sentence])
  in_tokens = tokenizers.pt.tokenize(in_tokens).to_tensor()
  in_tokens = tokenizers.pt.lookup(in_tokens)[0]
  fig = plt.figure(figsize=(16, 8))
  for h, head in enumerate(attention_heads):
    ax = fig.add_subplot(2, 4, h+1)
    plot_attention_head(in_tokens, translated_tokens, head)
    ax.set_xlabel(f'Head {h+1}')
  plt.tight_layout()
  plt.show()
plot_attention_weights(sentence,
                       translated_tokens,
                       attention_weights[0])


sentence = 'Eu li sobre triceratops na enciclopédia.'
ground_truth = 'I read about triceratops in the encyclopedia.'
translated_text, translated_tokens, attention_weights = translator(
    tf.constant(sentence))
print_translation(sentence, translated_text, ground_truth)
plot_attention_weights(sentence, translated_tokens, attention_weights[0])

Conclusion

Imagine translating massive volumes of text, powering real-time communication platforms, or enabling multilingual content creation – all with the efficiency and scalability provided by E2E and NGC.

So, don't let your exploration end here. Leverage the power of GPUs to push the boundaries of machine translation, unlock new possibilities, and bridge the gap between languages like never before.

Code

The GitHub code for this article can be found here: https://github.com/Lord-Axy/Atricle-Transformer_MT

Sign up for Free Trial

Latest Blogs

Language Translation with Transformer Model Using TensorFlow on E2E’s Cloud GPU Server

Table of Contents

Transformers: Unleashing the Power of Attention

E2E’s GPU Cloud: An Overview

Let’s Play: Crafting a Transformer Model for Seamless Portuguese-to-English Translation

some examples

some examples

Attention layers

Feed Forward Block

Encoder

Decoder

Final Transformer Architecture

Shape: `(batch=1, num_heads, seq_len_q, seq_len_k)`.

Conclusion

Code

Language Translation with Transformer Model Using TensorFlow on E2E’s Cloud GPU Server

Table of Contents

Transformers: Unleashing the Power of Attention

E2E’s GPU Cloud: An Overview

Let’s Play: Crafting a Transformer Model for Seamless Portuguese-to-English Translation

some examples

some examples

Attention layers

Feed Forward Block

Encoder

Decoder

Final Transformer Architecture

Shape: `(batch=1, num_heads, seq_len_q, seq_len_k)`.

Conclusion

Code

9 Cloud Computing Trends Shaping India’s Digital Future in 2025

LoRA fine-tune Gemma 7B Using TIR with 10 Easy Steps

How Does RAG Improve the Accuracy of LLM Responses?

Top 10 Cloud GPU Providers in 2025

What is Retrieval-Augmented Generation (RAG)?

AI Inference vs Training: Understanding Key Differences

Sovereign Cloud: India's Key to Digital Independence in the AI Age

E2E Sovereign Cloud Platform: Revolutionizing Cloud Sovereignty

Top 8 Generative AI Applications in 2025

A Comparison between TIR Containerized VMs vs Traditional VMs