CSC401_A2_v1.1
pdf
keyboard_arrow_up
School
University of Toronto *
*We aren’t endorsed by this school
Course
401
Subject
Computer Science
Date
Feb 20, 2024
Type
Pages
9
Uploaded by HighnessTankLobster37
Computer Science 401
10 February 2024
St. George Campus
University of Toronto
Homework Assignment #2
Due: Friday, 8 March 2024 at 23h59m (11.59 PM),
Neural Machine Translation (MT) using Transformers
Email
:
csc401-2024-01-a2@cs.toronto.edu
. Please prefix your subject with
[CSC401
W24-A2]
.
All accessibility and extension requests should be directed to
csc401-2024-01@cs.toronto.edu
.
Lead TAs
: Julia Watson and Arvid Frydenlund.
Instructors
: Gerald Penn, Sean Robertson and Raeid Saqur.
Building the Transformer from Scratch
It is quite a cliche to say that the transformer architecture has revolutionized technology. It is the building
block that fuelled the recent headline innovations such as ChatGPT. Despite all those people who claimed
to work in a large language model, only a very selected and brilliant few understand and are familiar with
its internal workings. And you — who came from the prestigious and northern place called the University
of Toronto —
must
carry on the century-old tradition to be able to create a transformer-based language
model using only pen and paper.
Organization
You will build the transformer model from beginning to end, and train it to do some
basic but effective machine translation tasks using the Canadian Hansards data. In
§
1, we will guide you
through the process of implementing all the
building blocks
of a transformer model. In
§
2, you will use
the building blocks to put together the transformer
architecture
.
§
3 discusses the
greedy and beam
search
for decoded target sentence generation.
In
§
4, you’ll
train and evaluate
the model.
Finally,
in
§
5, you’ll use the trained mode to do some real machine translation and write-up your
analysis
report.
Goal
By the end of this assignment, you will acquire low-level understanding of the transformer archi-
tecture and implementation techniques of the entire data processing
→
training
→
evaluation pipeline of
a functioning AI application.
Starter Code
Unlike A1, the starter code of this assignment is distributed through MarkUs.
The
training data is located at
/u/cs401/A2/data/Hansard
on
teach.cs
.
We use Python version 3.10.13 on
teach.cs
.
That is, just run everything with the default
python3
command. You may need to add
srun
commands to request computational resources — please follow the
instructions in the following sections to proceed. You shouldn’t need to set up a new virtual environment
or install any packages on
teach.cs
.
You can work on the assignment on your local machine, but you
must make sure that your code works on
teach.cs
. Any test cases failed due to incompatibility issues will
not receive any partial marks.
Marking Scheme
Please see the
A2
rubric.pdf
file for detailed breakdown of the marking scheme.
Copyright
©
2024 University of Toronto. All rights reserved.
1
1
Transformer Building Blocks and Components [12 Marks]
Let’s start from the three types of building blocks of a transformer: the layer norm, multi-head attention
and the feed-forward modules (a.k.a. the MLP weights).
LayerNorm
The normalization layer computes the following.
Given an input representation
h
, the
normalization layer computes its mean
µ
and the standard deviation
σ
. Then, it outputs the normalized
features.
h
←
γ
(
h
−
µ
)
σ
+
ε
+
β
(1)
Using the instructions, please complete the
LayerNorm.forward
method.
FeedForwardLayer
The feed-forward layer is a two-layer fully connected feed-forward network.
As
shown in the following equation, the input representation
h
is fed through two layers of fully connected
layers. Dropout is applied after each layer, and ReLU is the activation function.
h
←
dropout(ReLU(
W
1
h
+
b
1
))
h
←
dropout(
W
2
h
+
b
2
)
(2)
Using the instructions, please complete the
FeedForwardLayer.forward
method.
MultiHeadAttention
Finally, you need to implement the most complicated but important component
of the transformer architecture: the multi-head attention module.
For the base case where there’s only
H
= 1 head, the attention score is calculated using the regular cross-attention algorithm:
dropout(softmax(
QK
⊤
√
d
k
))
V
(3)
Using the instructions, please complete the
MultiHeadAttention.attention
method.
Then, you need to implement the part where the query, key and value are split into
H
heads, and then
pass them through the regular cross-attention algorithm you have just implemented.
Next, you should
combine the results. Don’t forget to apply the linear combination and dropout when you output the final
attended representations.
Using the instructions, please complete the
MultiHeadAttention.forward
method.
2
Putting the Architecture (Enc-Dec) Together [20 Marks]
OK, now we have the building blocks, let’s put everything together and create the full transformer model.
We will start from a single transformer encoder layer and a single decoder layer.
Next, we build the
complete encoder and the complete decoder together by stacking the layers together. Finally, we connect
the encoder with the decoder and complete the final transformer encoder–decoder model.
TransformerEncoderLayer
You need to implement two types of encoder layers. Pre-layer normaliza-
tion (Figure 1a), as its name suggests, applies layer normalization before the representation is fed into the
next module. On the other hand, post-layer normalization (Figure 1b) applies layer normalization after
the representation is fed into the representation.
Using the instructions, please complete the
pre
layer
norm
forward
and the
post
layer
norm
forward
methods of the
TransformerEncoderLayer
class.
2
input:
h
Layer Norm
Multi-head Attention
Layer Norm
Feed-Forward
output:
h
+
+
(a) Pre-layer normalization for encoders.
input:
h
Multi-head Attention
Layer Norm
Feed-Forward
Layer Norm
output:
h
+
+
(b) Post-layer normalization for encoders.
Figure 1: Two types of
TransformerDecoderLayer
.
TransformerEncoder
You don’t need to implement the encoder class. The starter code contains the
implementation. Nonetheless, it will be helpful to read it, as it will a good reference for the following tasks.
TransformerDecoderLayer
Again, you need to implement both pre- and post-layer normalization.
Recall from lecture, there are two multi-head attention blocks for decoders. The first one is a self-attention
block and the second one is a cross-attention block.
Using the instructions, please complete the
pre
layer
norm
forward
and the
post
layer
norm
forward
methods of the
TransformerDecoderLayer
class.
input:
h
Layer Norm
Self Attention
Layer Norm
Cross Attention
Layer Norm
Feed-Forward
output:
h
+
+
+
(a) Pre-layer normalization for decoders.
input:
h
Self Attention
Layer Norm
Cross Attention
Layer Norm
Feed-Forward
Layer Norm
output:
h
+
+
+
(b) Post-layer normalization for decoders.
Figure 2: Two types of
TransformerEncoderLayer
.
TransformerDecoder
Similar to
TransformerEncoder
, you should pass the input through all the de-
coder layers. Make sure to add the LayerNorm module in the correct place depending if the model is pre-
or post-layer normalization. Finally, don’t forget to use the logit projection module on the final output.
3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Using the instructions, please complete the
TransformerDecoder.forward
method.
TransformerEncoderDecoder
After making sure that the encoder and the decoder are both built
properly, it’s time to put them together. You need to implement the following methods:
•
The method
create
pad
mask
is the helper function to pad a sequence to a specific length.
•
The method
create
causal
mask
is the helper function to create a causal (upper) triangular mask.
•
After finishing the two helper methods, you can implement the
forward
method that connect all the
dots. In particular, you first create all the appropriate masks for the inputs. Then, you feed them
through the encoder.
And, finally, you can get the final result by feeding everything through the
decoder.
3
MT with Transformers: Greedy and Beam-Search [20 Marks]
The
a2
transformer
model.py
file contains all the required functions to complete and detailed instructions
(with hints). Here we list the high-level methods/functions that you need to complete.
3.1
Greedy Decode
Let’s warm up by implementing the greedy algorithm. At each decoding step, compute the (log) proba-
bility over all the possible tokens. Then, choose the output with the highest probability and repeat the
process until all the sequences in the current mini-batch terminate.
Using the instructions, please complete the
TransformerEncoderDecoder.greedy
decode
method.
3.2
Beam Search
Now, it’s time for perhaps the hardest part of the assignment – beam search. But don’t worry, we have
broken everything down into smaller, and much simpler chunks. We will guide you step by step to complete
the entire algorithm.
Beam search is initiated by a call to the
TransformerEncoderDecoder.beam
search
decode
method call.
Recall from the lecture that its job is to generate
partial translations
(or,
hypotheses
) from the source tokens
during the decoding phase. So, the
beam
search
decode
method gets called whenever you are trying to gen-
erate decoded translations (e.g. from
TransformerRunner.[translate, compute
average
bleu
over
dataset]
etc.).
Complete the following functions in the
TransformerEncoderDecoder
class:
1.
initialize
beams
for
beam
search
: This function will initialize the beam search by taking the first
decoder step and using the top-k outputs to initialize the beams.
2.
expand
encoder
for
beam
search
: Beam search will process ‘batches‘ of size ‘batch
size * k‘ so we
need to expand the encoder outputs so that we can process the beams in parallel. (
Tip
: You should
call this from within the preceding function).
3.
repeat
and
reshape
for
beam
search
: It’s a relatively simple expand and reshaping function. See
how it’s called from the
.beam
search
decode
method and read the instructions in function com-
ments. (
Tip
: Review Torch.Tensor.expand).
4.
pad
and
score
sequence
for
beam
search
:
This function will pad the sequences with eos and
seq
log
probs with corresponding zeros.
It will then get the score of each sequence by summing
4
the log probabilities. See how it’s called from the
.beam
search
decode
method and read the in-
structions in function comments.
5.
finalize
beams
for
beam
search
:
Finally, this functions as its name - it will take a list of top
beams of length batch
size, where each element is a tensor of some length and return a padded tensor
of the top beams. It’s the final return statement of the
.beam
search
decode
method – see how the
return is consumed by the calling functions (e.g. ‘translate’).
4
MT with Transformers: Training and Testing [20 Marks]
4.1
Calculating BLEU scores
Modify
a2
bleu
score.py
to be able to calculate BLEU scores on single reference and candidate strings.
We will be using the definition of BLEU scores from the lecture slides:
BLEU
=
BP
C
×
(
p
1
p
2
. . . p
n
)
(1
/n
)
To do this, you will need to implement the functions
grouper(...)
,
n
gram
precision(...)
,
brevity
penalty(...)
, and
BLEU
score(...)
.
Make sure to
carefully
follow the doc strings of each
function. Do not re-implement functionality that is clearly performed by some other function.
Your functions will operate on sequences (e.g., lists) of tokens. These tokens could be the words themselves
(strings) or an integer ID corresponding to the words. Your code should be agnostic to the type of token
used, though you can assume that both the reference and candidate sequences will use tokens of the same
type.
Please scale the BLEU score by 100.
4.2
The Training Loop
Before you start working on this part of the assignment, have a good look at the
train
function.
It
describes how the training loop works. For each epoch, we first train the model using all the data from
the training set. Then, we evaluate the model’s performance upon the completion of the epoch. We repeat
the process until we reach the specified maximum epoch number. For this part of the assignment, you will
need to implement the following methods:
train
for
epoch
The training of one epoch contains seven steps:
1. We iterate through the training dataloader to obtain the current mini-batch.
2. Then, we send the data tensors to the appropriate devices (CPU or GPU).
3. Call
train
input
target
split
to prepare the data for teacher forcing training.
4. Feed the data through the transformer model and collect the logits.
5. Compute the loss value using the loss function.
6. Call
loss.backward()
to compute the gradients.
7. Call
train
step
optimizer
and
scheduler
to update the model using the optimizer, and step the
scheduler.
Note
: You should handle gradient accumulation (using the
accum
iter
parameter) for full marks.
5
train
input
target
split
This method splits target tokens into input and target for maximum likeli-
hood training (teacher forcing).
train
step
optimizer
and
scheduler
This method steps the optimizer, zeros out the gradient, and
steps scheduler.
compute
batch
total
bleu
This function computes the total BLEU score for each n-gram level in n-
gram levels over elements in a batch.
Note
: You need to clean up the sequences by removing
ALL
special
tokens (
sos
idx
,
eos
idx
,
pad
idx
).
4.3
Run Model, Run!
OK, enough coding. Let’s actually do some model training and deployment. In this part of the assignment,
you will (1) train models with different settings, (2) evaluate those models on the hold-out test set, and
(3) deploy the model and do some actual translation.
4.3.1
Training the Models
Debugging and Development
Similar to A1, we provide a
--tiny-preset
option to make your model
train only on a tiny sliver of the dataset.
You should use this option with a smaller toy model for
debugging during your development phase. Training this toy model using the CPU on
teach.cs
will take
approximately 10 seconds to complete. The following command is just an example. You should modify
the command to test different components of your code.
srun -p csc401 --pty \
python3 a2_main.py train \
model_test.pt \
--tiny-preset \
--source-lang f \
--with-transformer \
--encoder-num-hidden-layers 2 \
--word-embedding-size 20 \
--transformer-ff-size 31 \
--with-post-layer-norm \
--batch-size 5 \
--gradient-accumulation 2 \
--skip-eval 0 \
--epochs 2 \
--device cpu
The
srun -p csc401 --pty
part of the command requests a compute node on teach.cs. This will provide
you with dedicated CPUs and reduce congestion during peak times. The
--pty
flag enables pseudo-terminal
mode, which allows your breakpoints to work properly. However, if you want to run the assignment locally,
you should remove it.
Training the Final Models with GPUs
After ensuring that your code works properly, you should
train your model using both the pre- and post-layer normalization settings. Each epoch of training should
take approximately 2 minutes. However, this time may vary depending on GPU availability.
Note
: You are not required to train the model with post-layer normalization. However, you must ensure
that post-layer normalization is implemented correctly. The correctness of both pre- and post-layer nor-
malization will have equal importance during the evaluation of the code. For the analysis and training,
only the pre-layer normalization setting is required.
6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
# Train the model using pre-layer-norm
srun -p csc401 --gres gpu \
python3 -u a2_main.py train \
transformer_model.pt \
--device cuda
Then, include a printout of the the training logs in
analysis.pdf
. You have the option to use WandB for
this part of the assignment, see Appendix B for more information.
4.3.2
Evaluate the Models
Now, evaluate the models using the following commands.
# Test the model using pre-layer-norm
srun -p csc401 --gres gpu \
python3 -u a2_main.py test \
transformer_model.pt \
--device cuda
5
Analysis [8 Marks]
You are all done! Now, that you’ve completed all the parts needed for
training
your transformer end-
to-end and
testing
it, you’re ready to finish this assignment with some
analysis
. Report the evaluation
results (from preceding section) in your latex report file:
analysis.pdf
.
Is there a difference between BLEU-3 and BLEU-4? What do you think could be the reason behind the
differences? Write your answer in
analysis.pdf
.
5.1
Let’s Translate Some Sentences!
In the
TransformerRunner.translate
method, you are tasked with processing a “raw” input sentence
through several stages to obtain its translation. You need to (1) tokenize the sentence, (2) convert tokens
into ordinal ids, (3) feed the ids into your model, and (4) finally, convert the output of the model into an
actual sentence. You can load your trained model with the following command.
1
srun -p csc401 --pty python3 a2_main.py interact transformer_model.pt
An interactive Python prompt will start and your can start translating with the function
translate
. You
can then utilize the model by interacting with it in the prompt as shown.
Trained model from path Your Model.pt loaded as the object ‘model‘
Python 3.10.13 (main, Sep 19 2023, 12:09:15) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> translate("Mon nom est Frank.")
’<s> frank is frank frank frank </s>’
For this part of the assignment, you will translate a few sentences from French to English using your model,
a fine-tuned pre-trained transformer model (T5 MT model or Bart MT model), and a large, established
model (Google Translate or ChatGPT). First, you need to translate the following eight sentences
2
into
English using your model:
1
It’s OK to run the model with the greedy decoder if you haven’t completed the implementation of beam search.
You
won’t incur any penalty for doing so. To enable greedy decoding, simply add a
--greedy
flag at the end of the command.
2
French accent marks (´
e,¸c,`a,ˆ
o,¨
ı) are removed in the Canadian Hansards dataset and, therefore, you should too.
Use
francaise
,
Universite
and
etudiants
instead of
fran¸caise
,
Universit´
e
and
´
etudiants
.
7
•
Voila des mesures qui favorisent la famille canadienne.
•
Je voudrais aussi signaler aux deputes qu’ils peuvent maintenant s’avancer pour voter.
•
Trudeau embauche un cabinet de consultants pour examiner la dependance excessive du
gouvernement a l’egard des cabinets de consultants.
•
Pierre demande s’il vous plait s’il peut s’attribuer le merite d’avoir supprime 600 emplois
a la SRC.
•
La France a remporte la coupe du monde 2018.
•
J’avais a peine de l’eau a boire pour huit jours.
•
Toronto est une ville du Canada.
•
Les etudiants de l’Universite de Toronto sont excellents.
Then, translate the same sentences into English using your choice of the T5 MT model or the Bart MT
model.
Next, translate the same eight sentences into English using your choice of Google Translate or
ChatGPT.
In Section 2, “Translation Analysis,” of
analysis.pdf
, list all the translations and answer the following
questions.
1. Describe the
overall
quality of the three types of models. Which one is the best and which one is
the worst?
2. What attributes and factors of the models do you think play a role in determining the quality?
3. Now, let’s observe the quality of your model’s translations for individual sentences. Which sentences
does your model translate better, and which does it translate worse? Can you identify a pattern?
Describe the pattern of quality across different types of sentences.
4. What about the fine-tuned pre-trained model and Google Translate/ChatGPT’s quality for individual
sentences? Does the previous pattern still persist? Why?
Submission Requirements
This assignment is submitted electronically via MarkUs. You should submit a total of five ( “5”) required
files as follows:
1. Your code:
a2
transformer
model.py
,
a2
transformer
runner.py
, and
a2
bleu
score.py
.
2. Your analysis:
analysis.pdf
. N.B. a
latex
template for this file has been provided to you in the
starter code as
a2
report.zip
.
3.
ID.txt
: Provide pertinent information as per the
ID.txt
template on course’s website, including the
statement:
“By submitting this file, I declare that my electronic submission is my own work, and is in
accordance with the University of Toronto Code of Behaviour on Academic Matters and
the Code of Student Conduct, as well as the collaboration policies of this course.”
You do not need to hand in any files other than those specified above.
8
Appendices
A
Canadian Hansards
The main corpus for this assignment comes from the official records (
Hansards
) of the 36
th
Canadian
Parliament, including debates from both the House of Representatives and the Senate.
This corpus is
available at
/u/cs401/A2/data/Hansard/
and has been split into
Training/
and
Testing/
directories.
This data set consists of pairs of corresponding files (
*.e
is the English equivalent of the French
*.f
) in
which every line is a sentence. Here, sentence alignment has already been performed for you. That is, the
n
th
sentence in one file corresponds to the
n
th
sentence in its corresponding file (e.g., line
n
in
fubar.e
is aligned with line
n
in
fubar.f
).
Note that this data only consists of sentence pairs; many-to-one,
many-to-many, and one-to-many alignments are not included.
B
Visualizing and logging training with WandB
You have the option to use ‘
Weights and Biases
’ [W&B] to visualize and log model training. Go to the
W&B site
3
and sign-up, then create a new project space named: ‘
csc401-W24-a2
’. We refer to your W&B
username
as
$
WB
USERNAME
hereafter. Then, add
--viz-wandb
$
WB
USERNAME
flag to your model training
commands (3.3.1).
C
Suggestions
C.1
Check Piazza regularly
Updates to this assignment as well as additional assistance outside tutorials will be primarily distributed
via Piazza (
https://piazza.com/class/lnlv5iq4kgria
).
It is your responsibility to check Piazza
regularly for updates.
C.2
Run cluster code early and at irregular times
Because GPU resources are shared with your peers, your
srun
job may end up on a weaker machine (or
even postponed until the resources are available) if too many students are training at once. To help balance
resource usage over time, we recommend you finish this assignment as early as possible. You might find
that your peers are more likely to run code at certain times in the day.
To check how many jobs are
currently queued or running on our partition, please run
squeue -p csc401
.
If you decide to run your models right before the assignment deadline, please be aware that we will be
unable to request more resources or run your code sooner.
We will not grant extensions for this
reason.
3
https://wandb.ai/
9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Documents
Recommended textbooks for you
Np Ms Office 365/Excel 2016 I Ntermed
Computer Science
ISBN:9781337508841
Author:Carey
Publisher:Cengage
COMPREHENSIVE MICROSOFT OFFICE 365 EXCE
Computer Science
ISBN:9780357392676
Author:FREUND, Steven
Publisher:CENGAGE L

New Perspectives on HTML5, CSS3, and JavaScript
Computer Science
ISBN:9781305503922
Author:Patrick M. Carey
Publisher:Cengage Learning

EBK JAVA PROGRAMMING
Computer Science
ISBN:9781337671385
Author:FARRELL
Publisher:CENGAGE LEARNING - CONSIGNMENT

Microsoft Visual C#
Computer Science
ISBN:9781337102100
Author:Joyce, Farrell.
Publisher:Cengage Learning,

Programming with Microsoft Visual Basic 2017
Computer Science
ISBN:9781337102124
Author:Diane Zak
Publisher:Cengage Learning
Recommended textbooks for you
- Np Ms Office 365/Excel 2016 I NtermedComputer ScienceISBN:9781337508841Author:CareyPublisher:CengageCOMPREHENSIVE MICROSOFT OFFICE 365 EXCEComputer ScienceISBN:9780357392676Author:FREUND, StevenPublisher:CENGAGE LNew Perspectives on HTML5, CSS3, and JavaScriptComputer ScienceISBN:9781305503922Author:Patrick M. CareyPublisher:Cengage Learning
- EBK JAVA PROGRAMMINGComputer ScienceISBN:9781337671385Author:FARRELLPublisher:CENGAGE LEARNING - CONSIGNMENTMicrosoft Visual C#Computer ScienceISBN:9781337102100Author:Joyce, Farrell.Publisher:Cengage Learning,Programming with Microsoft Visual Basic 2017Computer ScienceISBN:9781337102124Author:Diane ZakPublisher:Cengage Learning
Np Ms Office 365/Excel 2016 I Ntermed
Computer Science
ISBN:9781337508841
Author:Carey
Publisher:Cengage
COMPREHENSIVE MICROSOFT OFFICE 365 EXCE
Computer Science
ISBN:9780357392676
Author:FREUND, Steven
Publisher:CENGAGE L

New Perspectives on HTML5, CSS3, and JavaScript
Computer Science
ISBN:9781305503922
Author:Patrick M. Carey
Publisher:Cengage Learning

EBK JAVA PROGRAMMING
Computer Science
ISBN:9781337671385
Author:FARRELL
Publisher:CENGAGE LEARNING - CONSIGNMENT

Microsoft Visual C#
Computer Science
ISBN:9781337102100
Author:Joyce, Farrell.
Publisher:Cengage Learning,

Programming with Microsoft Visual Basic 2017
Computer Science
ISBN:9781337102124
Author:Diane Zak
Publisher:Cengage Learning