Introduction to Machine Translation

Introduction

Machine translation refers to the automatic translation of text from one language to another. It is a subfield of computational linguistics and artificial intelligence. The goal of machine translation is to produce translations that are fluent and accurate, conveying the meaning of the original text.

Usually, nowadays, machine translation systems are based on neural networks, which have achieved state-of-the-art performance in many language pairs. These systems are trained on large amounts of parallel text data, which consists of pairs of sentences in two languages.

NLP and Machine Translation

NLP refers to the field of study that focuses on the interactions between computers and humans through natural language. Machine translation is one of the key applications of NLP, as it involves the automatic translation of text from one language to another.

More generally speaking, NLP consists of four major tasks- sequence-to-sequence modeling, text classification, text generation, and text summarization. Machine translation falls under the sequence-to-sequence modeling category, where the goal is to map an input sequence of words in one language to an output sequence of words in another language.

Major Concepts in NLP

Tokenization

Tokenization refers to the process of breaking down text into smaller units, such as words or subwords. This is an essential step in many NLP tasks, including machine translation, as it allows the model to process text at a more granular level.

Usually, tokenization involves splitting text on whitespace or punctuation, but more advanced methods, such as subword tokenization, can be used to handle out-of-vocabulary words.

Sometimes, there will be special words, aka special tokens, to signal extra information, like start of the sentence, end of the sentence, or padding. A usual way to handle this is to add a special token to the input and output sequences, like <sos> for start of sentence, <eos> for end of sentence, and <pad> for padding.

Nowadays, the best library to use for NLP is the transformers library, which is built on top of pytorch.

To create a custom tokenizer in the transformers library, we can use the PreTrainedTokenizer class. A simple example of a custom tokenizer is shown below.

from transformers import PreTrainedTokenizer

class CustomTokenizer(PreTrainedTokenizer):
    def __init__(self, vocab_file, tokenizer_file):
        super(CustomTokenizer, self).__init__(vocab_file, tokenizer_file)

    def _tokenize(self, text):
        return text.split()

    def _convert_token_to_id(self, token):
        return self.vocab[token]

    def _convert_id_to_token(self, index):
        return self.ids_to_tokens[index]

To add special tokens to pre-trained tokenizer, we can use the add_special_tokens method. If simply expanding the vocab, use add_tokens. A simple example of adding special tokens to a pre-trained tokenizer is shown below.

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
num_added_tokens = tokenizer.add_tokens(["new_token1", "my_new-token2"])
special_tokens_dict = {"cls_token": "[MY_CLS]"}
num_added_tokens = tokenizer.add_special_tokens(special_tokens_dict)

Embedding

On a greater level, embedding refers to the process of converting the input to a dense vector representation. This is done by mapping the input to a high-dimensional space, where similar inputs are closer together.

Below is an example of training an embedding on the MNIST dataset:

import torch as th
import torch.nn as nn
from torchvision import datasets, transforms

# get the minst dataset
def get_mnist_data():
    # load the data
    mnist_train = datasets.MNIST('data', train=True, download=True, transform=transforms.ToTensor())
    mnist_test = datasets.MNIST('data', train=False, download=True, transform=transforms.ToTensor())

    # create the data loaders
    train_loader = th.utils.data.DataLoader(mnist_train, batch_size=64, shuffle=True)
    test_loader = th.utils.data.DataLoader(mnist_test, batch_size=64, shuffle=False)

    return train_loader, test_loader

class MNISTEmbedding(nn.Module):

    # To a 2 dimensional space
    # Use a two-stack of convolutional layers
    def __init__(self, input_size=28, channels_hidden=32, mlp_hidden = 128):
        super(MNISTEmbedding, self).__init__()
        # [btach, 1, x, y]
        # [batch, 1, x, y] -> [batch, channels_hidden, x, y]
        self.conv1 = nn.Conv2d(1, channels_hidden, kernel_size=3, stride=1, padding=1)
        # [batch, channels_hidden, x, y] -> [batch, channels_hidden, x, y]
        self.conv2 = nn.Conv2d(channels_hidden, channels_hidden, kernel_size=3, stride=1, padding=1)
        # [batch, channels_hidden, x, y] -> [batch, channels_hidden * x * y]
        self.flatten = nn.Flatten()
        # [batch, channels_hidden * x * y] -> [batch, 2]
        self.mlp = nn.Sequential(
            nn.Linear(channels_hidden * (input_size ** 2), mlp_hidden),
            nn.ReLU(),
            nn.Linear(mlp_hidden, mlp_hidden),
            nn.ReLU(),
            nn.Linear(mlp_hidden, 2)
        )

    def forward(self, x):
        x = self.conv1(x)
        x = self.conv2(x)
        x = self.flatten(x)
        x = self.mlp(x)
        return x

class MNISTModel(nn.Module):

    def __init__(self, input_size=28, channels_hidden=32, mlp_hidden=128, embedding_to_result_hidden = 32):
        super(MNISTModel, self).__init__()
        self.embedding = MNISTEmbedding(input_size, channels_hidden, mlp_hidden)
        self.lc_in = nn.Linear(2, embedding_to_result_hidden)
        self.relu = nn.ReLU()
        self.lc_out = nn.Linear(embedding_to_result_hidden, 10)

    def forward(self, x):
        x = self.embedding(x)
        x = self.lc_in(x)
        x = self.relu(x)
        x = self.lc_out(x)
        return x

# train the model
device = "mps"
train_loader, test_loader = get_mnist_data()
model = MNISTModel().to(device=device, dtype=th.float32)
optimizer = th.optim.Adam(model.parameters(), lr=1e-4)

epochs = 20
logging_steps = 400

from tqdm.notebook import tqdm, trange

for epoch in trange(epochs):
    for i, (x, y) in enumerate(tqdm(train_loader)):
        x = x.to(device=device, dtype=th.float32)
        y = y.to(device=device, dtype=th.float32)

        optimizer.zero_grad()
        y_pred = model(x)
        loss = th.nn.functional.cross_entropy(y_pred, y.long())
        loss.backward()
        optimizer.step()

        if i % logging_steps == 0:
            print(f"Epoch {epoch}, step {i}, loss {loss.item()}")

model = model.eval()
embedding = model.embedding

# Convert test data to embedding vectors
embeddings = []
labels = []
for x, y in tqdm(test_loader):
    x = x.to(device=device, dtype=th.float32)
    y = y.to(device=device, dtype=th.float32)
    with th.no_grad():
        e = embedding(x)
    # flatten the batch dimension
    # detach then extend
    embeddings.extend(e.detach().cpu().numpy().tolist())
    labels.extend(y.detach().cpu().numpy().tolist())

labels = list(map(lambda x: int(x), labels))
import plotly.express as px
import pandas as pd

# Plot the embeddings with plotly
df = pd.DataFrame(embeddings, columns=["x", "y"])
df["label"] = list(map(str, labels))
# labels are discrete, so we can use category
fig = px.scatter(df, x="x", y="y", color="label", opacity=0.7, category_orders={"label": [str(i) for i in range(10)]})
# enlarge the size of the graph
fig.update_layout(width=800, height=600)
fig.show()

The embedding can be visualized as shown below:

MNIST Embedding

In NLP, embeddings is largely word embeddings, which are dense vector representations of words. These embeddings are trained on large amounts of text data and capture semantic and syntactic information about words.

There is a simpler ways to create embedding models with torch.nn.Embedding:

# Create an embedding layer with 1000 words and 100 dimensions
embedding = nn.Embedding(1000, 100)

The input of the embedding layer is the index of the word in the vocabulary, that is, a vector of integers. The output of the embedding layer is a dense vector representation of the word, which can be used as input to a neural network model.

Embeddings converts the input to a dense vector representation, which can be used as input to a neural network model. Taking the output of the middle layer of a neural network model, we also get an embedding of the output.

Encoder-Decoder

Encoder and decoder are two components of a sequence-to-sequence model. The encoder takes an input sequence and encodes it into a fixed-length vector representation, which is then passed to the decoder to generate the output sequence.

For example, the encoder can be a model that takes an input sentence in English and encodes it into a fixed-length vector representation, which is then passed to the decoder to generate the next token in the output sentence. Usually doing so in a loop until decoder generates the end of sentence token.

RNN

RNNs, or Recurrent Neural Networks, are a type of neural network that is designed to handle sequential data. They are particularly well-suited for tasks such as machine translation, where the input and output sequences are of variable length.

A simple RNN model in pytorch is shown below:

import torch
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()
        self.hidden_size = hidden_size
        self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
        self.i2o = nn.Linear(input_size + hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden):
        combined = torch.cat((input, hidden), 1)
        hidden = self.i2h(combined)
        output = self.i2o(combined)
        output = self.softmax(output)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, self.hidden_size)

That is, there are two states in the RNN model, the hidden state and the output state. The hidden state is updated by the input and the previous hidden state, and the output state is updated by the input and the hidden state. During each passing, the hidden state and the output state are updated together.

A more commonly used model is the LSTM, long-short term memory model, which is an improved version of the RNN model. The LSTM model has a cell state, which allows it to remember information over long sequences.

There is also a RNN model called GRU, gated recurrent unit, which is a simplified version of the LSTM model, also very useful in NLP tasks.

LSTM and GRU models are built-in in pytorch.

First Solution to Machine Translation

The first solution would be a encoder-RNN-decoder model. The encoder takes the input sentence and encodes it into a fixed-length vector representation, which is then passed to the decoder to generate the output sentence.

A basic encoder-decoder model is implemented under the code, task one folder. Which, doesn't perform well- or any at all, but it is a good starting point to understand the basic concepts of machine translation.

Please notice that this is a oversimplified version that doesn't even perform in the task. And no terminology-based method, as the contest requires, is used in this model. Terminologies are only used to expand the vocabulary of the model. This is different from the provided model.

Dataloader

First, load the data.

class MTTrainDataset(Dataset):


    def __init__(self, train_path, dic_path):
        self.terms = [
            {"en": l.split("\t")[0], "zh": l.split("\t")[1]} for l in open(dic_path).read().split("\n")[:-1]
        ]
        self.data = [
            {"en": l.split("\t")[0], "zh": l.split("\t")[1]} for l in open(train_path).read().split("\n")[:-1]
        ]
        self.en_tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased", cache_dir="../../../cache")
        self.ch_tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-chinese", cache_dir="../../../cache")
        self.en_tokenizer.add_tokens([
            term["en"] for term in self.terms
        ])
        self.ch_tokenizer.add_tokens([
            term["zh"] for term in self.terms
        ])

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index) -> dict:
        return {
            "en": self.en_tokenizer.encode(self.data[index]["en"]),
            "zh": self.ch_tokenizer.encode(self.data[index]["zh"]),
        }

    def get_raw(self, index):
        return self.data[index]

ds = MTTrainDataset("./data/train.txt", "./data/en-zh.dic")

Encoder

Create the encoder, encoder first does embedding, then use RNN to encode the input.

# Encoder encodes the input sequence into a sequence of hidden states
class Encoder(nn.Module):

    def __init__(self, en_vocab_size, embed_dim=256, hidden_dim=1024, drop_out_rate=0.1):
        super(Encoder, self).__init__()
        self.hidden_dim = hidden_dim
        # [batch, len] -> [batch, len, embed_dim]
        self.embed = nn.Embedding(en_vocab_size, embed_dim)
        # [len, batch, embed_dim] -> [len, batch, hidden_dim], [n_layers == 1, batch, hidden_dim]
        self.gru = nn.GRU(embed_dim, hidden_dim)
        self.dropout = nn.Dropout(drop_out_rate)

    def init_hidden(self, batch_size):
        # [n_layers == 1, batch, hidden_dim]
        return th.zeros(1, batch_size, self.hidden_dim).to(device)

    def forward(self, x):
        x = self.embed(x)
        x = self.dropout(x)
        h = self.init_hidden(x.size(0))

Decoder

Then the decoder. Please notice that, the decoder only outputs the next token in the output sequence. In the forward function, x is the input token, the translated token sequence to be in the context, and h is the encoded hidden state of the original input sequence, which contains the information of the input sequence.

class Decoder(nn.Module):

    def __init__(self, zh_vocab_size, embed_dim=256, hidden_dim=1024, drop_out_rate=0.1) -> None:
        super().__init__()
        # [batch, len == 1] -> [batch, len == 1, embed_dim]
        self.embed = nn.Embedding(zh_vocab_size, embed_dim)
        # [batch, len == 1, embed_dim] -> [batch, len == 1, hidden_dim], [n_layers, batch, hidden_dim]
        self.gru = nn.GRU(embed_dim, hidden_dim)
        # [batch, hidden_dim] -> [batch, zh_vocab_size]
        self.fc = nn.Linear(hidden_dim, zh_vocab_size)
        self.dropout = nn.Dropout(drop_out_rate)

    def forward(self, x, h):
        x = self.embed(x)
        x = self.dropout(x)
        x = x.permute(1, 0, 2)
        x, h = self.gru(x, h)
        x = x.permute(1, 0, 2)
        x = self.fc(x.squeeze(1))
        return x, h

Seq2Seq Model

Then create the model, which is a combination of the encoder and the decoder.

class Seq2Seq(nn.Module):

    def __init__(self, encoder, decoder):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, src, trg, src_tokenizer, trg_tokenizer, teacher_forcing_ratio=0.5):
        # src: [batch, src_len]
        # trg: [batch, target_len]
        batch_size = src.size(0)
        trg_len = trg.size(1)
        trg_vocab_size = self.decoder.fc.out_features
        outputs = th.ones(batch_size, trg_len, trg_vocab_size).mul(trg_tokenizer.cls_token_id).to(src.device)
        # encoder
        # enc_out: [batch, src_len, hidden_dim], enc_hidden: [n_layers, batch, hidden_dim]
        enc_out, enc_hidden = self.encoder(src)
        # decoder
        # dec_in: [batch, 1]
        dec_in = trg[:, 0]
        dec_hidden = enc_hidden
        for t in range(1, trg_len):
            dec_out, dec_hidden = self.decoder(dec_in.unsqueeze(1), dec_hidden)
            # dec_out: [batch, zh_vocab_size]
            outputs[:, t] = dec_out.squeeze(1)
            # dec_in: [batch]
            dec_in = dec_out.argmax(-1)
            if th.rand(1) < teacher_forcing_ratio:
                dec_in = trg[:, t]
            if (dec_in == trg_tokenizer.sep_token_id).all():
                if t < trg_len - 1:
                    outputs[:, t+1] = trg_tokenizer.sep_token_id
                    outputs[:, t+2:] = trg_tokenizer.pad_token_id
                break
        return outputs

Teacher forcing means to use the answer token as the input token in the next time slice when generating the output sequence. This is to help the model to learn the correct translation sequence faster. When doing actual generation, the ratio should be set to zero, so that the model can generate the sequence on its own.

This code uses bert tokenizer, so the beginning of sentence is actually cls token, whereas the end of sentence is sep token. Please note that these special tokens have special usage in the bert model, but we only treat them as bos and eos here.

Padding

Before training, also pad the input to train in batches.

def collect_fn(batch):
    # pad the batch
    src = [th.tensor(item["en"]) for item in batch]
    trg = [th.tensor(item["zh"]) for item in batch]
    src = th.nn.utils.rnn.pad_sequence(src, batch_first=True, padding_value=ds.en_tokenizer.pad_token_id)
    trg = th.nn.utils.rnn.pad_sequence(trg, batch_first=True, padding_value=ds.ch_tokenizer.pad_token_id)
    return src, trg

Training

Use train the model with the following code, remember to set ignore_index in the loss function to ignore the padding token.

def train(epochs, total = None, logging_steps=100):
    loss_logging = []
    criterion = nn.CrossEntropyLoss(ignore_index=ds.ch_tokenizer.pad_token_id)
    for epoch in trange(epochs):
        for i, (src, trg) in tqdm(enumerate(train_loader), total=total if total is not None else len(train_loader), leave=False):
            optim.zero_grad()
            src = src.to(device)
            trg = trg.to(device)
            out = model(src, trg, ds.en_tokenizer, ds.ch_tokenizer, teacher_forcing_ratio=0.5)
            # out is [batch, len, zh_vocab_size]
            # trg is [batch, len]
            loss = criterion(out.view(-1, len(ds.ch_tokenizer)), trg.view(-1))
            loss_logging.append(loss.item())
            loss.backward()
            optim.step()
            if i % logging_steps == 0:
                print(f"Epoch: {epoch}, Step: {i}, Loss: {loss.item()}")
            if total is not None and i >= total:
                break
    return loss_logging

Generating

def generate(src, trg):
    with th.no_grad():
        src = th.tensor(src).unsqueeze(0).to(device)
        trg = th.tensor(trg).unsqueeze(0).to(device)
        out = model(src, trg, ds.en_tokenizer, ds.ch_tokenizer, teacher_forcing_ratio=0)
    # out is [batch, len, zh_vocab_size]
    out = out.squeeze(0)
    out = out.argmax(-1)
    return ds.ch_tokenizer.decode(out.tolist())

Results

Well the result sucks, but it works. So long as there is a [SEP] in most of the generated result, it is a good sign that the model is learning to generate the sequence, despite its poor performance.