hwk2
py
keyboard_arrow_up
School
University of Illinois, Urbana Champaign *
*We aren’t endorsed by this school
Course
200
Subject
Computer Science
Date
Apr 3, 2024
Type
py
Pages
31
Uploaded by EarlComputer5997
# -*- coding: utf-8 -*-
"""hwk2.ipynb
Automatically generated by Colaboratory.
Original file is located at
https://colab.research.google.com/drive/1cu3jBth8OlEcurSC9orWpgRTOAdZLPKW
# CS 447 Homework 2 $-$ Word Embeddings \& Text Classification with Neural Networks
In this homework, you will first train word embeddings using the continuous-bag-of-
words (CBOW) method. Then, you will build a convolutional neural network (CNN) classifier to detect the sentiment of movie reviews using the IMDb movie reviews dataset.
In addition to the Pytorch tutorial we have provided online, we highly recommend that you take a look at the PyTorch tutorials before starting this assignment:
<ul>
<li><a href="https://pytorch.org/tutorials/beginner/pytorch_with_examples.html">https://
pytorch.org/tutorials/beginner/pytorch_with_examples.html</a>
<li><a href="https://pytorch.org/tutorials/beginner/data_loading_tutorial.html">https://
pytorch.org/tutorials/beginner/data_loading_tutorial.html</a>
<li><a href="https://github.com/yunjey/pytorch-tutorial">https://github.com/
yunjey/pytorch-tutorial</a>
</ul>
<font color='green'><b>Hint:</b> We suggest that you work on this homework in <b>CPU</b> until you are ready to train. At that point, you should switch your runtime to <b>GPU</b>. You can do this by going to <TT>Runtime > Change Runtime Type</TT> and select "GPU" from the dropdown menu.
* You will find it easier to debug on CPU, and the error messages will be more understandable.
* Google monitors your GPU usage and will occasionally restrict GPU access if you use it too much. In these cases, you can either switch to a different Google account or wait for your access to be restored.</font>
We have imported all the libraries you need to do this homework. <b>You should not import any extra libraries. Furthermore, you should not write any code outside of TODO sections.</b> If you do, the autograder will fail to run your code.
# Part 1: Continuous-Bag-of-Words (CBOW) Embeddings [50 points]
In the first part of this assignment you will learn dense word embeddings based on the word2vec paradigm. In particular, you will use the continuous-bag-of-words approach, which trains a model to predict a word based on the embeddings of surrounding words. For example, in the sentence "the man walks the dog in the park", the embeddings for the words ("man, "walks", "dog", "in") will be used to predict the word "the" (if your context size is 2 on each side of the target word).
## Download \& Preprocess the Data
First we will download the dataset using [torchtext](https://torchtext.readthedocs.io/en/latest/index.html), which is a package that supports NLP for PyTorch.
Unfortunately, you have to install the <TT>torchdata</TT> package on the Colab machine in order to access the data. To do this, run the cell below (you may need to click the "Restart Runtime" button when it finishes). You will have to do this every time you return to work on the homework.
"""
!pip install torchdata==0.5.1
!pip install torchtext==0.14.0
### DO NOT EDIT ###
import torch
import torch.nn as nn
import torch.nn.functional as F
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
if __name__=='__main__':
print('Using device:', DEVICE)
"""Now, we download the data. As with Homework 1, we will use WikiText-2, a corpus of high-quality Wikipedia articles. The dataset was originally introduced in the following paper: https://arxiv.org/pdf/1609.07843v1.pdf. A raw version of the data can easily be viewed here: https://github.com/pytorch/examples/tree/master/word_language_model/data/wikitext-
2.preprocess
After downloading the data, we preprocess the text as in Homework 1. <i>You do not need to edit this code.</i>
* <b>Sentence splitting:</b> In this homework, we are interested in modeling individual sentences, rather than longer chunks of text such
as paragraphs or documents. The WikiTest dataset provides paragraphs; thus, we provide a simple method to identify individual sentences by splitting paragraphs at
punctuation tokens (".", "!", "?").
* <b>Sentence markers:</b> For both training and testing corpora, each sentence must be surrounded by a start-of-sentence (`<s>`) and end-
of-sentence marker (`/s`). These markers will allow your models to generate sentences that have realistic beginnings and endings.
* <b>Unknown words:</b> In order to deal with unknown words,
all words that do not appear in the vocabulary must be replaced with a special token for unknown words (`<UNK>`). The WikiText dataset has already done this, and you can read about the method in the paper above. When unknown words are encountered in the test corpus, they should be treated as that special token instead.
We provide you with preprocessing code here, and you should not modify it.
"""
### DO NOT EDIT ###
# Constants (feel free to use these in your code, but do not change them)
CBOW_START = "<s>" # Start-of-sentence token
CBOW_END = "</s>" # End-of-sentence-token
CBOW_UNK = "<UNK>" # Unknown word token
### DO NOT EDIT ###
import torchtext
import random
import sys
def cbow_preprocess(data, vocab=None, do_lowercase=True):
final_data = []
lowercase = "abcdefghijklmnopqrstuvwxyz"
for paragraph in data:
paragraph = [x if x != '<unk>' else CBOW_UNK for x in paragraph.split()]
if vocab is not None:
paragraph = [x if x in vocab else CBOW_UNK for x in paragraph]
if paragraph == [] or paragraph.count('=') >= 2: continue
sen = []
prev_punct, prev_quot = False, False
for word in paragraph:
if prev_quot:
if word[0] not in lowercase:
final_data.append(sen)
sen = []
prev_punct, prev_quot = False, False
if prev_punct:
if word == '"':
prev_punct, prev_quot = False, True
else:
if word[0] not in lowercase:
final_data.append(sen)
sen = []
prev_punct, prev_quot = False, False
if word in {'.', '?', '!'}: prev_punct = True
sen += [word]
if sen[-1] not in {'.', '?', '!', '"'}: continue # Prevent a lot of short sentences
final_data.append(sen)
vocab_was_none = vocab is None
if vocab is None:
vocab = {}
for i in range(len(final_data)):
# Make words lowercase for this assignment
final_data[i] = [x.lower() if do_lowercase and x != CBOW_UNK else x for x in final_data[i]]
final_data[i] = [CBOW_START] + final_data[i] + [CBOW_END]
if vocab_was_none:
for word in final_data[i]:
vocab[word] = vocab.get(word, 0) + 1
return final_data, vocab
def getDataset():
dataset = torchtext.datasets.WikiText2(root='.data', split=('train',))
train_dataset, vocab = cbow_preprocess(dataset[0])
return train_dataset, vocab
if __name__=='__main__':
sentences, vocab = getDataset()
"""Run the next cell to see 10 random sentences of the data."""
### DO NOT EDIT ###
if __name__ == '__main__':
for x in random.sample(sentences, 10):
print (x)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
"""##<font color='red'>TODO:</font> Define the Dataset Class [15 points]
In the following cell, we will define the <b>dataset</b> class. The dataset contains input-output pairs for each training example we will provide to the model.
You need to implement the following functions:
* <b>` make_training_examples(self)`:</b> <b>[5 points]</b> Each training example will be a list of <em>context</em> words along with a <em>target</em> word.
The context words consist of $c$ words on either side of the target word; hence, each list of context words has size $2c$. The goal will be to have your model predict the target word from the context words. Thus, you must convert each sentence into a series of context-target pairs, as follows:
<ul>
<li>For each sentence $s=[w_1,w_2,...,w_n]$ and a context size $c$, compute the following (context, target) pairs:<br>    $
([w_1,...,w_c,w_{c+2},...,w_{2c+1}]$, $w_{c+1}$)<br>    $
([w_2,...,w_{c+1},w_{c+3},...,w_{2c+2}]$, $w_{c+2}$)<br>           &em
sp; $\vdots$<br>    $([w_{n-2c},...,w_{n-c-1},w_{n-
c+1},...,w_{n}]$, $w_{n-c}$)<br>For example, suppose your sentence is "the man walks the dog in the park" and the context size is $c=2$. Your method should find the following training pairs:<br>    (["the", "man", "the", "dog"], "walks")<br>    (["man", "walks", "dog", "in"], "the")<br>    (["walks", "the", "in", "the"], "dog")<br>    (["the", "dog", "the", "park"], "in")<br>Of course, the sentences in your dataset have start- and end-of-sentence tokens as well, which you should treat as any other word.
</ul>
This function should return a list of <b>all</b> such training pairs.
* <b>` build_dictionaries(self, vocab)`:</b> <b>[4 points]</b> Creates the dictionaries `word2idx` and `idx2word`. You will represent each word in the vocabulary with a unique index, and keep track of this in these dictionaries. The input `vocab` is a list of words: you must assign indexes in the order the words appear in this list.
* <b>`get_context_vector(self, idx)`:</b> Returns a vector representing the <em>context</em> of the `idx`th training example. Specifically, if the context size
is $c$, this should be a tensor of $2c$ word indices corresponding to the context words of the `idx`th example.
<font color='green'><b>Hint:</b> You may want to pre-compute and save all context vectors (using word indices rather than the words themselves) in `__init__(...)`, and then access these in `get_context_vector(self, idx)`. This would give you a slight speedup at train time.</font>
* <b>`get_target_index(self, idx) `</b>: Return the target word index for the `idx`th training example.
* <b> ` __len__(self) `: [1 points]</b> Return the total number of training examples in the dataset as an `int`.
* <b>` __getitem__(self, idx)`:</b> <b>[5 points]</b> Return the `idx`th training
example as a tuple of `(context_vector, target_word_index)`. You should use the ` get_context_vector(self, idx) ` and ` get_label(self, idx) ` functions here.
"""
from torch.utils import data
from collections import defaultdict
class CbowDataset(data.Dataset):
def __init__(self, sentences, vocab, context_size):
##### DO NOT EDIT #####
assert CBOW_START in vocab and CBOW_END in vocab and CBOW_UNK in vocab
self.sentences = sentences
self.context_size = context_size
self.training_examples = []
self.make_training_examples()
self.word2idx = {} # Mapping of word to index
self.idx2word = {} # Mapping of index to word
self.build_dictionaries(sorted(vocab.keys()))
self.context_vectors = []
self.create_context_vectors()
self.vocab_size = len(self.word2idx)
def create_context_vectors(self):
for context, word in self.training_examples:
context_indices = torch.Tensor([self.word2idx[word] for word in context])
word_idx = self.word2idx[word]
self.context_vectors.append((context_indices, word_idx))
def make_training_examples(self):
'''
Builds a list of context-target_word pairs that will be used as training examples for the model and stores them in
self.training_examples.
Each example is a (context, target_word) tuple, where context is a list of strings of size 2*context_size and
target_word is simply a string.
Returns nothing.
'''
##### TODO #####
# For each sentence, loop over each word in the sentence. If there are c words before and c words after the word,
# make a (context, word) pair, where context is a list made up of the c words before the word and the c words
# after the word (in the same order they appear in the sentence). Append this (context, word) pair to self.training_examples.
c = self.context_size
for sentence in self.sentences:
n = len(sentence)
for i in range(n - 2 * c):
context = sentence[i:i + c] + sentence[i + c + 1:i + 2 * c + 1]
word = sentence[i + c]
self.training_examples.append((context, word))
def build_dictionaries(self, vocab):
'''
Builds the dictionaries self.idx2word and self.word2idx. Make sure that you
assign indices
in the order the words appear in vocab (a list of words).
Returns nothing.
'''
##### TODO #####
for i, word in enumerate(vocab):
self.idx2word[i] = word
self.word2idx[word] = i
def get_context_vector(self, idx):
'''
Returns the context vector (as a torch.tensor) for the training example at index idx.
This is is a tensor containing the indices of each word in the context.
'''
assert len(self.training_examples) > 0
##### TODO #####
return self.context_vectors[idx][0]
def get_target_index(self, idx):
'''
Returns the index of the target word (as type int) of the training example at index idx.
'''
##### TODO #####
return self.context_vectors[idx][1]
def __len__(self):
'''
Returns the number of training examples (as type int) in the dataset
'''
##### TODO #####
return len(self.training_examples)
def __getitem__(self, idx):
'''
Returns the context vector (as a torch.tensor) and target index (as type int) of the training example at index idx.
'''
##### TODO #####
return self.context_vectors[idx]
"""##Sanity Check: Dataset Class
The code below runs a sanity check for your `CbowDataset` class. The tests are similar to the hidden ones in Gradescope. However, note that passing the sanity check does <b>not</b> guarantee that you will pass the autograder; it is intended to help you debug.
You do <b>not</b> need to edit this cell.
"""
### DO NOT EDIT ###
def sanityCheckCbowDataset():
#
Read in the sample corpus
test_sents = [['<s>', 'the', 'man', 'walks', 'the', 'dog', 'in', 'the', 'park',
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
'</s>'],
['<s>', 'i', 'saw', 'the', 'man', 'with', 'the', 'telescope', 'on', 'the', CBOW_UNK, '</s>']]
test_vocab = {'<s>':2, 'the':6, 'man':2, 'walks': 1, 'dog': 1, 'in': 1, 'park':
1, 'i': 1, 'saw': 1, 'with': 1, 'telescope':1, 'on': 1, CBOW_UNK: 1, '</s>': 2}
print("Sample dataset:")
for x in test_sents: print(x)
context_sizes = [1,3,5]
print('\n--- TEST: training_examples ---')
training_examples_expected = [[(['<s>', 'man'], 'the'), (['the', 'walks'], 'man'), (['man', 'the'], 'walks'), (['walks', 'dog'], 'the'), (['the', 'in'], 'dog'), (['dog', 'the'], 'in'), (['in', 'park'], 'the'), (['the', '</s>'], 'park'),
(['<s>', 'saw'], 'i'), (['i', 'the'], 'saw'), (['saw', 'man'], 'the'), (['the', 'with'], 'man'), (['man', 'the'], 'with'), (['with', 'telescope'], 'the'), (['the',
'on'], 'telescope'), (['telescope', 'the'], 'on'), (['on', '<UNK>'], 'the'), (['the', '</s>'], '<UNK>')],
[(['<s>', 'the', 'man', 'the', 'dog', 'in'], 'walks'), (['the', 'man', 'walks', 'dog', 'in', 'the'], 'the'), (['man', 'walks', 'the', 'in', 'the', 'park'], 'dog'), (['walks', 'the', 'dog', 'the', 'park', '</s>'], 'in'), (['<s>', 'i', 'saw', 'man', 'with', 'the'], 'the'), (['i', 'saw', 'the', 'with', 'the', 'telescope'], 'man'), (['saw', 'the', 'man', 'the', 'telescope', 'on'], 'with'), (['the', 'man', 'with', 'telescope', 'on', 'the'], 'the'), (['man', 'with', 'the', 'on', 'the', '<UNK>'], 'telescope'), (['with', 'the', 'telescope', 'the', '<UNK>', '</s>'], 'on')],
[(['<s>', 'i', 'saw', 'the', 'man', 'the', 'telescope', 'on', 'the', '<UNK>'], 'with'), (['i', 'saw', 'the', 'man', 'with', 'telescope', 'on', 'the', '<UNK>', '</s>'], 'the')]
]
for i in range(len(context_sizes)):
c=context_sizes[i]
test_dataset = CbowDataset(test_sents, test_vocab, c)
has_passed, message = True, ''
training_examples = test_dataset.training_examples
expected = training_examples_expected[i]
if has_passed and len(training_examples) != len(expected):
has_passed, message = False, 'len(training_examples) is incorrect. Expected: ' + str(len(expected)) + '\tGot: ' + str(len(training_examples))
if has_passed and set([(type(x), len(x)) for x in training_examples]) != {(tuple, 2)}:
has_passed, message = False, 'Each item of training_examples must be a 2-tuple; at least one of your items is not a 2-tuple.'
if has_passed and set([(type(x[0]), type(x[1])) for x in training_examples]) != {(list, str)}:
has_passed, message = False, 'Each item must contain a list of context words and a target word as a string. At least one of your items does not meet this condition.'
if has_passed and sorted(training_examples, key = lambda x: (' '.join(x[0]), x[1])) != sorted(expected, key = lambda x: (' '.join(x[0]), x[1])):
has_passed, message = False, 'training_examples is incorrect (note that
the order of the examples does not matter). Expected: '+str(sorted(expected, key = lambda x: (' '.join(x[0]), x[1]))) + '\tGot: ' + str(sorted(training_examples, key = lambda x: (' '.join(x[0]), x[1])))
status = 'PASSED' if has_passed else 'FAILED'
print('\tcontext_size:', c, '\t'+status, '\t'+message)
print('\n--- TEST: idx2word and word2idx dictionaries ---')
expected_word2idx = {'</s>': 0, '<UNK>': 1, '<s>': 2, 'dog': 3, 'i': 4, 'in': 5, 'man': 6, 'on': 7, 'park': 8, 'saw': 9, 'telescope': 10, 'the': 11, 'walks': 12,
'with': 13}
expected_idx2word = {0: '</s>', 1: '<UNK>', 2: '<s>', 3: 'dog', 4: 'i', 5: 'in', 6: 'man', 7: 'on', 8: 'park', 9: 'saw', 10: 'telescope', 11: 'the', 12: 'walks', 13: 'with'}
for i in range(len(context_sizes)):
c=context_sizes[i]
test_dataset = CbowDataset(test_sents, test_vocab, c)
has_passed, message = True, ''
word2idx = test_dataset.word2idx
idx2word = test_dataset.idx2word
has_passed, message = True, ''
if has_passed and (test_dataset.vocab_size != len(test_dataset.word2idx) or
test_dataset.vocab_size != len(test_dataset.idx2word)):
has_passed, message = False, 'dataset.vocab_size (' + str(test_dataset.vocab_size) + ') must be the same length as dataset.word2idx (' + str(len(test_dataset.word2idx)) + ') and dataset.idx2word ('+str(len(test_dataset.idx2word)) +').'
if has_passed and (test_dataset.vocab_size != len(expected_word2idx)):
has_passed, message = False, 'Your vocab size is incorrect. Expected: '
+ str(len(expected_word2idx)) + '\tGot: ' + str(test_dataset.vocab_size)
if has_passed and sorted(list(test_dataset.idx2word.keys())) != list(range(0, test_dataset.vocab_size)):
has_passed, message = False, 'dataset.idx2word must have keys ranging from 0 to dataset.vocab_size-1. Keys in your dataset.idx2word: ' + str(sorted(list(test_dataset.idx2word.keys())))
if has_passed and sorted(list(test_dataset.word2idx.keys())) != sorted(list(expected_word2idx.keys())):
has_passed, message = False, 'Your dataset.word2idx has incorrect keys.
Expected: ' + str(sorted(list(expected_word2idx.keys()))) + '\tGot: ' + str(sorted(list(test_dataset.word2idx.keys())))
if has_passed: # Check that word2idx and idx2word are consistent
widx = sorted(list(test_dataset.word2idx.items()))
idxw = sorted(list([(v,k) for k,v in test_dataset.idx2word.items()]))
if not (len(widx) == len(idxw) and all([widx[q] == idxw[q] for q in range(len(widx))])):
has_passed, message = False, 'Your dataset.word2idx and dataset.idx2word are not consistent. dataset.idx2word: ' + str(test_dataset.idx2word) + '\tdataset.word2idx: ' + str(test_dataset.word2idx)
if has_passed and word2idx != expected_word2idx:
has_passed, message = False, 'Your dataset.word2idx is incorrect. Expected: ' + str(expected_word2idx) + '\tGot: ' + str(word2idx)
if has_passed and idx2word != expected_idx2word:
has_passed, message = False, 'Your dataset.word2idx is incorrect. Expected: ' + str(expected_idx2word) + '\tGot: ' + str(idx2word)
status = 'PASSED' if has_passed else 'FAILED'
print('\tcontext_size:', c, '\t'+status, '\t'+message)
print('\n--- TEST: len(dataset) ---')
correct_lens = [18,10,2]
for i in range(len(context_sizes)):
c=context_sizes[i]
test_dataset = CbowDataset(test_sents, test_vocab, c)
has_passed = len(test_dataset) == correct_lens[i]
status = 'PASSED' if has_passed else 'FAILED'
if has_passed: message = ''
else: message = 'len(dataset) is incorrect. Expected: ' + str(correct_lens[i]) + '\tGot: ' + str(len(test_dataset))
print('\tcontext_size:', c, '\t'+status, '\t'+message)
print('\n--- TEST: __getitem__(self, idx) ---')
for i in range(len(context_sizes)):
c=context_sizes[i]
test_dataset = CbowDataset(test_sents, test_vocab, c)
for j in range(correct_lens[i]):
returned = test_dataset.__getitem__(j)
has_passed, message = True, ''
if has_passed and len(returned) != 2:
has_passed, message = False, 'dataset.__getitem__(idx) must return 2 items. Got ' + str(len(returned)) +' items instead.'
if has_passed and (type(returned[0]) != torch.Tensor or type(returned[1]) != int):
has_passed, message = False, 'The context vector must be a torch.Tensor and the target index must be an int. Got: (' + str(type(returned[0])) + ', ' + str(type(returned[1])) + ')'
if has_passed and (returned[0].shape != torch.randint(0,100,
(2*c,)).shape):
has_passed, message = False, 'Shape of first return is incorrect. Expected: ' + str(torch.randint(0,100,(2*c,)).shape) + '.\tGot: ' + str(returned[0].shape)
status = 'PASSED' if has_passed else 'FAILED'
print('\tcontext_size:', c, '\tidx:',str(j),'\t' if j<10 else '','\
t',status, '\t'+message)
if __name__ == '__main__':
sanityCheckCbowDataset()
"""##<font color='red'>TODO:</font> Define the CBOW Model [20 points]
Here, you will define a simple feed-forward neural network that takes in a context vector and predicts the word that completes the context. We provide you with the `CbowModel` class, and you just need to fill in parts of the `__init__(...)` and `forward(...)` functions. Each of these functions is worth <b>10 points</b>.
We have provided you with instructions and hints in the comments. In particular, pay attention to the desired shapes; you may find it helpful to print the shape of the tensors as you code. It may also help to keep the PyTorch documentation open for the modules & functions you are using, since they describe input and output dimensions.
"""
class CbowModel(nn.Module):
def __init__(self, vocab_size, embed_size, hidden_size, context_size):
'''
vocab_size: Size of the vocabulary
embed_size: Size of your embedding vectors
hidden_size: Size of hidden layer of neural network
context_size: The size of your context window used to generate training examples
'''
super(CbowModel, self).__init__()
self.context_size = context_size
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
##### TODO #####
# 1. Create an embedding layer using nn.Embedding, that will take an index in your vocabulary as input
# (referring to a word) and return a vector of size embed_size (i.e. your embedding vector).
# Note that providing a word index to nn.Embedding is the same (conceptually) as providing a one-hot
# vector to nn.Linear (however, nn.Embedding takes sparsity into account, so is more efficient)
self.embedding = nn.Embedding(vocab_size, embed_size)
# 2. Create a linear layer that projects your embedding vector to a vector of size hidden_size.
self.linear = nn.Linear(embed_size, hidden_size)
# 3. Create an output linear layer, that projects your hidden vector to a vector the size of your vocabulary.
self.output = nn.Linear(hidden_size, vocab_size)
def forward(self, inputs):
'''
inputs: Tensor of size [batch_size, 2*context_size]
Returns output: Tensor of size [batch_size, vocab_size]
'''
##### TODO #####
# 1. Feed the inputs through your embedding layer to get a tensor of size [batch_size, 2*context size, embed_size]
# 2. Average the embedding vectors of each of your context word embeddings (for each example in your batch).
# Expected size: [batch_size, embed_size]
# 3. Feed this through your linear layer and then a ReLU activation. Expected size: [batch_size, hidden_size]
# 4. Feed this through your output layer and return the result. Expected size [batch_size, vocab_size]
# Do NOT apply a softmax to the final output - this is done in the training method!
relu = nn.ReLU()
x1 = self.embedding(inputs)
x2 = torch.mean(x1, 1)
x3 = self.linear(x2)
x4 = relu(x3)
return self.output(x4)
"""## Sanity Check: CBOW Model"""
### DO NOT EDIT ###
count_parameters = lambda model: sum(p.numel() for p in model.parameters() if p.requires_grad)
def makeCbowSanityBatch(test_params):
batch_size = test_params['batch_size']
new_test_params = {k:v for k,v in test_params.items() if k != 'batch_size'}
batch = torch.randint(0, new_test_params['vocab_size'], (batch_size,new_test_params['context_size']*2))
return batch, new_test_params
def sanityCheckModel(all_test_params, NN, expected_outputs, init_or_forward, make_batch_fxn=None):
print('--- TEST: ' + ('Number of Model Parameters (tests __init__(...))' if init_or_forward=='init' else 'Output shape of forward(...)') + ' ---')
for tp_idx, (test_params, expected_output) in enumerate(zip(all_test_params, expected_outputs)):
if init_or_forward == "forward":
input, test_params = make_batch_fxn(test_params)
# Construct the student model
tps = {k:v for k, v in test_params.items()}
stu_nn = NN(**tps)
if init_or_forward == "forward":
with torch.no_grad():
stu_out = stu_nn(input)
ref_out_shape = expected_output
has_passed = torch.is_tensor(stu_out)
if not has_passed: msg = 'Output must be a torch.Tensor; received ' + str(type(stu_out))
else:
has_passed = stu_out.shape == ref_out_shape
msg = 'Your Output Shape: ' + str(stu_out.shape)
status = 'PASSED' if has_passed else 'FAILED'
message = '\t' + status + "\t Init Input: " + str({k:v for k,v in tps.items()}) + '\tForward Input Shape: ' + str(input.shape) + '\tExpected Output Shape: ' + str(ref_out_shape) + '\t' + msg
print(message)
else:
stu_num_params = count_parameters(stu_nn)
ref_num_params = expected_output
comparison_result = (stu_num_params == ref_num_params)
status = 'PASSED' if comparison_result else 'FAILED'
message = '\t' + status + "\tInput: " + str({k:v for k,v in test_params.items()}) + ('\tExpected Num. Params: ' + str(ref_num_params) + '\tYour
Num. Params: '+ str(stu_num_params))
print(message)
del stu_nn
if __name__ == '__main__':
# Test init
cbow_init_inputs = [{'vocab_size': 10, 'embed_size': 32, 'hidden_size': 64, 'context_size': 2}, {'vocab_size': 10, 'embed_size': 32, 'hidden_size': 64, 'context_size': 4}, {'vocab_size': 10, 'embed_size': 32, 'hidden_size': 128, 'context_size': 2}, {'vocab_size': 10, 'embed_size': 32, 'hidden_size': 128, 'context_size': 4}, {'vocab_size': 10, 'embed_size': 64, 'hidden_size': 64, 'context_size': 2}, {'vocab_size': 10, 'embed_size': 64, 'hidden_size': 64, 'context_size': 4}, {'vocab_size': 10, 'embed_size': 64, 'hidden_size': 128, 'context_size': 2}, {'vocab_size': 10, 'embed_size': 64, 'hidden_size': 128, 'context_size': 4}, {'vocab_size': 1000, 'embed_size': 32, 'hidden_size': 64,
'context_size': 2}, {'vocab_size': 1000, 'embed_size': 32, 'hidden_size': 64, 'context_size': 4}, {'vocab_size': 1000, 'embed_size': 32, 'hidden_size': 128, 'context_size': 2}, {'vocab_size': 1000, 'embed_size': 32, 'hidden_size': 128, 'context_size': 4}, {'vocab_size': 1000, 'embed_size': 64, 'hidden_size': 64, 'context_size': 2}, {'vocab_size': 1000, 'embed_size': 64, 'hidden_size': 64, 'context_size': 4}, {'vocab_size': 1000, 'embed_size': 64, 'hidden_size': 128, 'context_size': 2}, {'vocab_size': 1000, 'embed_size': 64, 'hidden_size': 128, 'context_size': 4}]
cbow_init_expected_outputs = [3082, 3082, 5834, 5834, 5450, 5450, 10250, 10250,
99112, 99112, 165224, 165224, 133160, 133160, 201320, 201320]
sanityCheckModel(cbow_init_inputs, CbowModel, cbow_init_expected_outputs, "init")
print()
# Test forward
cbow_forward_inputs = [{'vocab_size': 10, 'embed_size': 32, 'hidden_size': 64, 'context_size': 2, 'batch_size': 1}, {'vocab_size': 10, 'embed_size': 32, 'hidden_size': 64, 'context_size': 2, 'batch_size': 5}, {'vocab_size': 10, 'embed_size': 32, 'hidden_size': 64, 'context_size': 2, 'batch_size': 500}, {'vocab_size': 10, 'embed_size': 32, 'hidden_size': 64, 'context_size': 4, 'batch_size': 1}, {'vocab_size': 10, 'embed_size': 32, 'hidden_size': 64, 'context_size': 4, 'batch_size': 5}, {'vocab_size': 10, 'embed_size': 32, 'hidden_size': 64, 'context_size': 4, 'batch_size': 500}, {'vocab_size': 10, 'embed_size': 64, 'hidden_size': 64, 'context_size': 2, 'batch_size': 1}, {'vocab_size': 10, 'embed_size': 64, 'hidden_size': 64, 'context_size': 2, 'batch_size': 5}, {'vocab_size': 10, 'embed_size': 64, 'hidden_size': 64, 'context_size': 2, 'batch_size': 500}, {'vocab_size': 10, 'embed_size': 64, 'hidden_size': 64, 'context_size': 4, 'batch_size': 1}, {'vocab_size': 10, 'embed_size': 64, 'hidden_size': 64, 'context_size': 4, 'batch_size': 5}, {'vocab_size': 10, 'embed_size': 64, 'hidden_size': 64, 'context_size': 4, 'batch_size': 500}, {'vocab_size': 1000, 'embed_size': 32, 'hidden_size': 64, 'context_size': 2, 'batch_size': 1}, {'vocab_size': 1000, 'embed_size': 32, 'hidden_size': 64, 'context_size': 2, 'batch_size': 5}, {'vocab_size': 1000, 'embed_size': 32, 'hidden_size': 64, 'context_size': 2, 'batch_size': 500}, {'vocab_size': 1000, 'embed_size': 32, 'hidden_size': 64, 'context_size': 4, 'batch_size': 1}, {'vocab_size': 1000, 'embed_size': 32, 'hidden_size': 64, 'context_size': 4, 'batch_size': 5}, {'vocab_size': 1000, 'embed_size': 32, 'hidden_size': 64, 'context_size': 4, 'batch_size': 500}, {'vocab_size': 1000, 'embed_size': 64, 'hidden_size': 64, 'context_size': 2, 'batch_size': 1}, {'vocab_size': 1000, 'embed_size': 64, 'hidden_size': 64, 'context_size': 2, 'batch_size': 5}, {'vocab_size': 1000, 'embed_size': 64, 'hidden_size': 64, 'context_size': 2, 'batch_size': 500}, {'vocab_size': 1000, 'embed_size': 64, 'hidden_size': 64, 'context_size': 4, 'batch_size': 1}, {'vocab_size': 1000, 'embed_size': 64, 'hidden_size': 64, 'context_size': 4, 'batch_size': 5}, {'vocab_size': 1000, 'embed_size': 64, 'hidden_size': 64, 'context_size': 4, 'batch_size': 500}]
cbow_forward_expected_outputs = [torch.Size([1, 10]), torch.Size([5, 10]), torch.Size([500, 10]), torch.Size([1, 10]), torch.Size([5, 10]), torch.Size([500, 10]), torch.Size([1, 10]), torch.Size([5, 10]), torch.Size([500, 10]), torch.Size([1, 10]), torch.Size([5, 10]), torch.Size([500, 10]), torch.Size([1, 1000]), torch.Size([5, 1000]), torch.Size([500, 1000]), torch.Size([1, 1000]), torch.Size([5, 1000]), torch.Size([500, 1000]), torch.Size([1, 1000]), torch.Size([5, 1000]), torch.Size([500, 1000]), torch.Size([1, 1000]), torch.Size([5, 1000]), torch.Size([500, 1000])]
sanityCheckModel(cbow_forward_inputs, CbowModel, cbow_forward_expected_outputs,
"forward", makeCbowSanityBatch)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
"""##Train the CBOW Model [15 points]
Now, we initialize the <b>dataloader</b>. A dataloader is responsible for providing
batches of data to your model. Notice how we first instantiate dataset.
You do not need to edit this cell.
"""
### DO NOT EDIT ###
BATCH_SIZE = 1000 # You may change the batch size if you'd like
CONTEXT_SIZE = 3 # You may change the context size if you'd like
if __name__=='__main__':
cbow_dataset = CbowDataset(sentences, vocab, CONTEXT_SIZE)
cbow_dataloader = torch.utils.data.DataLoader(cbow_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=2, drop_last=True)
"""Now we provide you with a function that takes your model and trains it on the data.
You do not need to edit this cell. However, you may want to write code to save your
model periodically, as Colab connections are not permanent. See the tutorial here if you wish to do this: https://pytorch.org/tutorials/beginner/saving_loading_models.html.
"""
### DO NOT EDIT ###
from tqdm.notebook import tqdm
from torch import optim
def train_cbow_model(model, num_epochs, data_loader, optimizer, criterion):
print("Training CBOW model
....
")
for epoch in range(num_epochs):
epoch_loss, n = 0, 0
for context, target in tqdm(data_loader):
optimizer.zero_grad()
log_probs = model(context.long().to(DEVICE)) # to(torch.float32)
loss = criterion(log_probs, target.to(DEVICE))
loss.backward()
optimizer.step()
n += context.shape[0]
epoch_loss += (loss*context.shape[0])
epoch_loss = epoch_loss/n
print('[TRAIN]\t Epoch: {:2d}\t Loss: {:.4f}'.format(epoch+1, epoch_loss))
print('CBOW Model Trained!\n')
"""Now you can instantiate your model. We provide you with some recommended hyperparameters; you should be able to get the desired accuracy with these, but feel free to play around with them."""
if __name__=='__main__':
cbow_model = CbowModel(vocab_size = cbow_dataset.vocab_size, # Don't change this
embed_size = 128, # Feel free to change
hidden_size = 128, # Feel free to change
context_size = CONTEXT_SIZE) # Don't change this (though you may
change the value of CONTEXT_SIZE above if you wish)
# Put your model on the device (cuda or cpu)
cbow_model = cbow_model.to(DEVICE)
print('The model has {:,d} trainable parameters'.format(count_parameters(cbow_model)))
"""Next, we create the **criterion**, which is our loss function: it is a measure of how well the model matches the empirical distribution of the data. We use cross-
entropy loss (https://en.wikipedia.org/wiki/Cross_entropy).
We also define the **optimizer**, which performs gradient descent. We use the Adam optimizer (https://arxiv.org/pdf/1412.6980.pdf), which has been shown to work well on these types of models.
"""
import torch.optim as optim
if __name__=='__main__':
LEARNING_RATE = 0.01 # Feel free to try other learning rates
# Define the loss function
criterion = nn.CrossEntropyLoss().to(DEVICE)
# Define the optimizer
optimizer = optim.Adam(cbow_model.parameters(), lr=LEARNING_RATE)
"""Finally, we can train the model. If the model is implemented correctly and you're using the GPU, this cell should take around <b>3 minutes</b> (or less). Feel
free to change the number of epochs."""
if __name__=='__main__':
N_EPOCHS = 6 # Feel free to change this
# Train model for N_EPOCHS epochs
train_cbow_model(cbow_model, N_EPOCHS, cbow_dataloader, optimizer, criterion)
"""To get full credit on the word embeddings, you must return the correct vectors when your model is instantiated with a particular random seed and called on the autograder. This is worth <b>15 points</b>.
## Visualize Word Embeddings
Now that you have a trained model, we can extract the word embeddings and visualize
them. The word embeddings are basically the weight matrix of the embedding layer that you defined, as this maps each index of your vocab to a dense vector of size `embed_size`.
Since we cannot easily visualize such high-dimensional vectors, we use a process called TSNE (t-distributed stochastic neighbor embedding). This reduces the vectors
to a 2-dimensional space so that we can visualize them. For more information on TSNE, see https://en.wikipedia.org/wiki/T-
distributed_stochastic_neighbor_embedding). Note that this method is not deterministic, so running this cell multiple times will give you a different visualization.
The cell below will run TSNE and plot the word embeddings corresponding to thed 1,000 most frequent words on a 2-dimensional plot. You are welcome to increase this
threshold if you'd like to see the vectors for more words.
"""
### DO NOT EDIT ###
if __name__=='__main__':
from sklearn.manifold import TSNE
import numpy as np
import plotly.express as px
import pandas as pd
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
THRESHOLD = 1000
words = [x[0] for x in sorted(vocab.items(), key = lambda x: -x[1])
[:THRESHOLD]]
idxes = [cbow_dataset.word2idx[word] for word in words]
vectors = np.array([cbow_model.embedding.weight[i].tolist() for i in idxes])
tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, verbose=False)
new_vectors = tsne_model.fit_transform(vectors)
df = pd.DataFrame(data={'x': new_vectors[:,0], 'y': new_vectors[:,1], 'word':words})
fig = px.scatter(df, x='x', y='y', text='word')
fig.update_traces(textposition='top center')
fig.update_layout(height=600, title_text='Word Embedding 2D Visualization')
fig.show()
"""At a high level, you should see words with similar meaning clustering together. You can use your mouse to zoom in and inspect the vector space closer.
You should also see mini-clusters within this plot; you would need to zoom into examine these. Examples of mini-clusters you might see are:
* <b>Time words:</b> hours, minutes, seconds, months, weeks, years, etc.
* <b>Years:</b> 2000, 2002, 2004, etc.
* <b>Numbers:</b> 10, 15, 37, etc.
* <b>Months:</b> january, february, march, etc. <em>Question: does the word 'may', which is both a month and a modal verb, cluster with the other months? If not, can you see where it is in relation to other modal verbs ('can', 'will', 'would', 'might', etc.)?</em>
Feel free to increase the number of vectors plotted if you want to investigate further.
# Part 2: Train a Convolutional Neural Network (CNN) [50 points]
The second part of this homework concerns text classification. You will train a CNN
classifier to determine the sentiment of movie reviews.
## Download & Preprocess the Data
We will be using the IMDb movie reviews dataset, which is a corpus of movie reviews
along with a <em>positive</em> or <em>negative</em> classification. This is again provided by torchtext.
The following cell will produce `train_data` and `test_data`. It also does some basic tokenization.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
* To access the list of textual tokens for the *i*th example, use `train_data[i]
[1]`
* To access the label for the *i*th example, use `train_data[i][0]`
"""
### DO NOT EDIT ###
import torchtext
import random
def cnn_preprocess(review):
'''
Simple preprocessing function.
'''
res = []
for x in review.split(' '):
remove_beg=True if x[0] in {'(', '"', "'"} else False
remove_end=True if x[-1] in {'.', ',', ';', ':', '?', '!', '"', "'", ')'} else False
if remove_beg and remove_end: res += [x[0], x[1:-1], x[-1]]
elif remove_beg: res += [x[0], x[1:]]
elif remove_end: res += [x[:-1], x[-1]]
else: res += [x]
return res
if __name__=='__main__':
train_data = torchtext.datasets.IMDB(root='.data', split='train')
train_data = list(train_data)
train_data = [(x[0], cnn_preprocess(x[1])) for x in train_data]
train_data, test_data = train_data[0:10000] + train_data[12500:12500+10000], train_data[10000:12500] + train_data[12500+10000:],
print('Num. Train Examples:', len(train_data))
print('Num. Test Examples:', len(test_data))
# Make pos/neg
train_data = [('neg' if x[0] == 1 else 'pos', x[1]) for x in train_data]
test_data = [('neg' if x[0] == 1 else 'pos', x[1]) for x in test_data]
print("\nSAMPLE DATA:")
for x in random.sample(train_data, 5):
print('Sample text:', x[1])
print('Sample label:', x[0], '\n')
"""## <font color='red'>TODO:</font> Define the Dataset Class [10 Points]
In the following cell, we will define the <b>dataset</b> class. The dataset contains the tokenized data for your model. You need to implement the following functions:
* <b>` build_dictionary(self)`:</b> <b>[5 points]</b> Creates the dictionaries `idx2word` and `word2idx`. You will represent each word in the dataset with a unique index, and keep track of this in these dictionaries. Use the hyperparameter `threshold` to control which words appear in the dictionary: a training word’s frequency should be `>= threshold` to be included in the dictionary.
* <b>`convert_text(self)`:</b> Converts each review in the dataset to a list of indices, given by your `word2idx` dictionary. You should store this in the
`textual_ids` variable, and the function does not return anything. If a word is not
present in the `word2idx` dictionary, you should use the `<UNK>` token for that word. Be sure to append the `<END>` token to the end of each review.
* <b>` get_text(self, idx) `:</b> Return the review at `idx` in the dataset as an
array of indices corresponding to the words in the review. If the length of the review is less than `max_len`, you should pad the review with the `<PAD>` character
up to the length of `max_len`. If the length is greater than `max_len`, then it should only return the first `max_len` words. The return type should be `torch.LongTensor`.
* <b>`get_label(self, idx) `</b>: Return the value `1` if the label for `idx` in the dataset is `positive`, and should return `0` if it is `negative`. The return type should be `torch.LongTensor`.
* <b> ` __len__(self) `:</b> Return the total number of reviews in the dataset as an `int`.
* <b>` __getitem__(self, idx)`:</b> <b>[5 points]</b> Return the (padded) text, and the label. The return type for both these items should be `torch.LongTensor`. You should use the ` get_label(self, idx) ` and ` get_text(self, idx) ` functions here.
<b>Note:</b> You should convert all words to lower case in your functions.
<font color='green'><b>Hint:</b> Make sure that you use instance variables such as `self.threshold` throughout your code, rather than the global variable `THRESHOLD` (defined later on). The variable `THRESHOLD` will not be known to the autograder, and the use of it within the class will cause an autograder error.</font>
<font color='green'><b>Hint:</b> Make sure that your dataset is deterministic $-$ that is, if it is instantiated multiple times, then the `word2idx` and `idx2word` mappings are the same. If they are not, the autograder will be unable to evaluate your CNN classifications.</font>
"""
CNN_PAD = '<PAD>'
CNN_END = '<END>'
CNN_UNK = '<UNK>'
from torch.utils import data
from collections import defaultdict
class TextDataset(data.Dataset):
def __init__(self, examples, split, threshold, max_len, idx2word=None, word2idx=None):
##### DO NOT EDIT #####
self.examples = examples
assert split in {'train', 'val', 'test'}
self.split = split
self.threshold = threshold
self.max_len = max_len
# Dictionaries
self.word2idx = word2idx # Mapping of word to index
self.idx2word = idx2word # Mapping of index to word
if split == 'train':
self.build_dictionary()
self.vocab_size = len(self.word2idx)
# Convert text to indices
self.textual_ids = []
self.convert_text()
def build_dictionary(self):
'''
Build the dictionaries idx2word and word2idx. This is only called when split='train', as these
dictionaries are passed in to the __init__(...) function otherwise. Be sure
to use self.threshold
to control which words are assigned indices in the dictionaries.
Returns nothing.
'''
assert self.split == 'train'
# Don't change this
self.idx2word = {0:CNN_PAD, 1:CNN_END, 2: CNN_UNK}
self.word2idx = {CNN_PAD:0, CNN_END:1, CNN_UNK: 2}
##### TODO #####
# Count the frequencies of all words in the training data (self.examples)
# Assign idx (starting from 3) to all words having word_freq >= self.threshold
# Make sure you call word.lower() on each word to convert it to lowercase
self.freq = {}
for _, example in self.examples:
for word in example:
word = word.lower()
if word in self.freq:
self.freq[word] += 1
else:
self.freq[word] = 1
i = 3
for word, freq in self.freq.items():
if freq >= self.threshold:
self.idx2word[i] = word
self.word2idx[word] = i
i += 1
def convert_text(self):
'''
Convert each review in the dataset (self.examples) to a list of indices, given by self.word2idx.
Store this in self.textual_ids; returns nothing.
'''
##### TODO #####
# Remember to replace a word with the <UNK> token if it does not exist in the word2idx dictionary.
# Remember to append the <END> token to the end of each review.
for _, example in self.examples:
review = [self.word2idx[word] if word in self.word2idx else self.word2idx[CNN_UNK] for word in example] + [self.word2idx[CNN_END]]
self.textual_ids.append(review)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
def get_text(self, idx):
'''
Return the review at idx as a long tensor (torch.LongTensor) of integers corresponding to the words in the review.
You may need to pad as necessary (see above).
'''
##### TODO #####
review = self.textual_ids[idx]
n = len(review)
if n > self.max_len:
review = review[:self.max_len]
else:
review = review + [self.word2idx[CNN_PAD]] * (self.max_len - n)
return torch.LongTensor(review)
def get_label(self, idx):
'''
This function should return the value 1 if the label for idx in the dataset
is 'positive',
and 0 if it is 'negative'. The return type should be torch.LongTensor.
'''
##### TODO #####
label = 1 if self.examples[idx][0] == 'pos' else 0
return torch.squeeze(torch.LongTensor([label]))
def __len__(self):
'''
Return the number of reviews (int value) in the dataset
'''
##### TODO #####
return len(self.examples)
def __getitem__(self, idx):
'''
Return the review, and label of the review specified by idx.
'''
##### TODO #####
return self.get_text(idx), self.get_label(idx)
"""##Sanity Check: Dataset Class
The code below runs a sanity check for your `Dataset` class. The tests are similar to the hidden ones in Gradescope. However, note that passing the sanity check does <b>not</b> guarantee that you will pass the autograder; it is intended to help you debug.
"""
### DO NOT EDIT ###
def sanityCheckTextDataset():
#
Read in the sample corpus
reviews = [('pos', 'Your life is good when you have money, success and health'),
('neg', 'Life is bad when you got not a lot')]
data = [(x[0], cnn_preprocess(x[1])) for x in reviews]
print("Sample dataset:")
for x in data: print(x)
thresholds = [1,2,3]
print('\n--- TEST: idx2word and word2idx dictionaries ---') # max_len does not matter for this test
correct = [[',', '<END>', '<PAD>', '<UNK>', 'a', 'and', 'bad', 'good', 'got', 'have', 'health', 'is', 'life', 'lot', 'money', 'not', 'success', 'when', 'you', 'your'], ['<END>', '<PAD>', '<UNK>', 'is', 'life', 'when', 'you'], ['<END>', '<PAD>', '<UNK>']]
for i in range(len(thresholds)):
dataset = TextDataset(data, 'train', threshold=thresholds[i], max_len=3)
has_passed, message = True, ''
if has_passed and (dataset.vocab_size != len(dataset.word2idx) or dataset.vocab_size != len(dataset.idx2word)):
has_passed, message = False, 'dataset.vocab_size (' + str(dataset.vocab_size) + ') must be the same length as dataset.word2idx (' + str(len(dataset.word2idx)) + ') and dataset.idx2word ('+str(len(dataset.idx2word)) +').'
if has_passed and (dataset.vocab_size != len(correct[i])):
has_passed, message = False, 'Your vocab size is incorrect. Expected: '
+ str(len(correct[i])) + '\tGot: ' + str(dataset.vocab_size)
if has_passed and sorted(list(dataset.idx2word.keys())) != list(range(0, dataset.vocab_size)):
has_passed, message = False, 'dataset.idx2word must have keys ranging from 0 to dataset.vocab_size-1. Keys in your dataset.idx2word: ' + str(sorted(list(dataset.idx2word.keys())))
if has_passed and sorted(list(dataset.word2idx.keys())) != correct[i]:
has_passed, message = False, 'Your dataset.word2idx has incorrect keys.
Expected: ' + str(correct[i]) + '\tGot: ' + str(sorted(list(dataset.word2idx.keys())))
if has_passed: # Check that word2idx and idx2word are consistent
widx = sorted(list(dataset.word2idx.items()))
idxw = sorted(list([(v,k) for k,v in dataset.idx2word.items()]))
if not (len(widx) == len(idxw) and all([widx[q] == idxw[q] for q in range(len(widx))])):
has_passed, message = False, 'Your dataset.word2idx and dataset.idx2word are not consistent. dataset.idx2word: ' + str(dataset.idx2word) + '\tdataset.word2idx: ' + str(dataset.word2idx)
status = 'PASSED' if has_passed else 'FAILED'
print('\tthreshold:', thresholds[i], '\tmax_len:', 3, '\t'+status, '\
t'+message)
print('\n--- TEST: len(dataset) ---')
has_passed = len(dataset) == 2
if has_passed: print('\tPASSED')
else: print('\tlen(dataset) is incorrect. Expected: 2\tGot: ' + str(len(dataset)))
print('\n--- TEST: __getitem__(self, idx) ---')
max_lens = [3,8,15]
idxes = [0,1]
combos = [{'threshold': t, 'max_len': m, 'idx': idx} for t in thresholds for m
in max_lens for idx in idxes]
correct = [(torch.tensor([3, 4, 5]), torch.tensor(1)), (torch.tensor([ 4, 5, 15]), torch.tensor(0)), (torch.tensor([ 3, 4, 5, 6, 7, 8, 9, 10]), torch.tensor(1)), (torch.tensor([ 4, 5, 15, 7, 8, 16, 17, 18]), torch.tensor(0)), (torch.tensor([ 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1, 0, 0]), torch.tensor(1)), (torch.tensor([ 4, 5, 15, 7, 8, 16, 17, 18, 19, 1, 0, 0, 0, 0, 0]), torch.tensor(0)), (torch.tensor([2, 3, 4]), torch.tensor(1)), (torch.tensor([3, 4, 2]), torch.tensor(0)), (torch.tensor([2, 3, 4, 2, 5, 6, 2, 2]), torch.tensor(1)), (torch.tensor([3, 4, 2, 5, 6, 2, 2, 2]), torch.tensor(0)), (torch.tensor([2, 3, 4, 2, 5, 6, 2, 2, 2, 2, 2, 2, 1, 0, 0]), torch.tensor(1)), (torch.tensor([3, 4, 2, 5, 6, 2, 2, 2, 2, 1, 0, 0, 0, 0, 0]), torch.tensor(0)), (torch.tensor([2, 2, 2]), torch.tensor(1)), (torch.tensor([2, 2, 2]), torch.tensor(0)), (torch.tensor([2, 2, 2, 2, 2, 2, 2, 2]), torch.tensor(1)), (torch.tensor([2, 2, 2, 2, 2, 2, 2, 2]), torch.tensor(0)), (torch.tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 0, 0]), torch.tensor(1)), (torch.tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 0, 0, 0, 0, 0]), torch.tensor(0))]
for i in range(len(combos)):
combo = combos[i]
dataset = TextDataset(data, 'train', threshold=combo['threshold'], max_len=combo['max_len'])
returned = dataset.__getitem__(combo['idx'])
has_passed, message = True, ''
if has_passed and len(returned) != 2:
has_passed, message = False, 'dataset.__getitem__(idx) must return 2 items. Got ' + str(len(returned)) +' items instead.'
if has_passed and (type(returned[0]) != torch.Tensor or type(returned[1]) !
= torch.Tensor):
has_passed, message = False, 'Both returns must be of type torch.Tensor. Got: (' + str(type(returned[0])) + ', ' + str(type(returned[1])) + ')'
if has_passed and (returned[0].shape != correct[i][0].shape):
has_passed, message = False, 'Shape of first return is incorrect. Expected: ' + str(correct[i][0].shape) + '.\tGot: ' + str(returned[0].shape)
if has_passed and (returned[1].shape != correct[i][1].shape):
has_passed, message = False, 'Shape of second return is incorrect. Expected: ' + str(correct[i][1].shape) + '.\tGot: ' + str(returned[1].shape) + '\n\
t\tHint: torch.Size([]) means that the tensor should be dimensionless (just a number). Try squeezing your result.'
if has_passed and (returned[1] != correct[i][1]):
has_passed, message = False, 'Label (second return) is incorrect. Expected: ' + str(correct[i][1]) + '.\tGot: ' + str(returned[1])
if has_passed:
correct_padding_idxes, your_padding_idxes = torch.where(correct[i][0] == 0)[0], torch.where(returned[0] == dataset.word2idx[CNN_PAD])[0]
if not (correct_padding_idxes.shape == your_padding_idxes.shape and torch.all(correct_padding_idxes == your_padding_idxes)):
has_passed, message = False, 'Padding is not correct. Expected padding indxes: ' + str(correct_padding_idxes) + '.\tYour padding indexes: ' + str(your_padding_idxes)
status = 'PASSED' if has_passed else 'FAILED'
print('\tthreshold:', combo['threshold'], '\tmax_len:', combo['max_len'] , '\tidx:', combo['idx'], '\t'+status, '\t'+message)
if __name__ == '__main__':
sanityCheckTextDataset()
"""The following cell builds the dataset on the IMDb movie reviews and prints an
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
example:"""
### DO NOT EDIT ###
if __name__=='__main__':
train_dataset = TextDataset(train_data, 'train', threshold=10, max_len=150)
print('Vocab size:', train_dataset.vocab_size, '\n')
randidx = random.randint(0, len(train_dataset)-1)
text, label = train_dataset[randidx]
print('Example text:')
print(train_data[randidx][1])
print(text)
print('\nExample label:')
print(train_data[randidx][0])
print(label)
"""## <font color='red'>TODO:</font> Define the CNN Model [20 points]
Here you will define your convolutional neural network for text classification. We provide you with the CNN class, you need to fill in parts of the `__init__(...)` and `forward(...)` functions. Each of these functions is worth <b>10 points</b>.
We have provided you with instructions and hints in the comments. In particular, pay attention to the desired shapes; you may find it helpful to print the shape of the tensors as you code. It may also help to keep PyTorch documentation open for the modules & functions you are using, since they describe input and output dimensions.
"""
import torch
import torch.nn as nn
import torch.nn.functional as F
class CNN(nn.Module):
def __init__(self, vocab_size, embed_size, out_channels, filter_heights, stride, dropout, num_classes, pad_idx):
super(CNN, self).__init__()
##### TODO #####
# Create an embedding layer (https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html)
# to represent the words in your vocabulary. Make sure to use vocab_size,
embed_size, and pad_idx here.
# Define multiple Convolution layers (nn.Conv1d) with filter (kernel) size [filter_height, embed_size] based on your
# different filter_heights.
# Input channels will be embed_size and output channels will be out_channels (these many different filters will be trained
# for each convolution layer)
# If you want, you can store a list of modules inside nn.ModuleList.
# Create a dropout layer (nn.Dropout) using dropout
# Define a linear layer (nn.Linear) that consists of num_classes units
# and takes as input the concatenated output for all cnn layers (out_channels * num_of_cnn_layers units)
self.embedding = nn.Embedding(vocab_size, embed_size, padding_idx=pad_idx)
self.convs = []
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
for filter_height in filter_heights:
self.convs.append(nn.Conv1d(embed_size, out_channels, filter_height))
self.convs = nn.ModuleList(self.convs)
self.dropout = nn.Dropout()
self.linear = nn.Linear(out_channels * len(filter_heights), num_classes)
def forward(self, texts):
"""
texts: LongTensor [batch_size, max_len]
Returns output: Tensor [batch_size, num_classes]
"""
##### TODO #####
# Pass texts through your embedding layer to convert from word ids to word embeddings
# Resulting: shape: [batch_size, max_len, embed_size]
x1 = self.embedding(texts)
# Pass these texts to each of your conv layers and compute their output as follows:
# Your cnn output will have shape [batch_size, out_channels, *] where * depends on filter_height and stride
# Apply non-linearity on it (F.relu() is a commonly used one. Feel free to try others)
# Take the max value across last dimension to have shape [batch_size, out_channels]
# Concatenate (torch.cat) outputs from all your cnns [batch_size, (out_channels*num_of_cnn_layers)]
x2 = []
for conv in self.convs:
x = conv(x1.permute(0, 2, 1))
x = F.relu(x)
x = torch.max(x, 2).values
x2.append(x)
x2 = torch.cat(x2, 1)
# Let's understand what you just did:
# Since each cnn is of different filter_height, it will look at different
number of words at a time
# So, a filter_height of 3 means your cnn looks at 3 words (3-grams) at
a time and tries to extract some information from it
# Each cnn will learn out_channels number of features from the words it sees at a time
# Then you applied a non-linearity and took the max value for all channels
# You are essentially trying to find important n-grams from the entire text
# Everything happens on a batch simultaneously hence you have that additional batch_size as the first dimension
# Apply dropout
x3 = self.dropout(x2)
# Pass your output through the linear layer and return its output
# Resulting shape: [batch_size, num_classes]
x4 = self.linear(x3)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
# NOTE: Do NOT apply a sigmoid or softmax to the final output - this is done in the training method!
return x4
"""##Sanity Check: CNN Model
The code below runs a sanity check for your `CNN` class. The tests are similar to the hidden ones in Gradescope. However, note that passing the sanity check does <b>not</b> guarantee that you will pass the autograder; it is intended to help you debug.
"""
### DO NOT EDIT ###
def makeCnnSanityBatch(test_params):
batch_size = test_params['batch_size']
max_len = test_params['max_len']
new_test_params = {k:v for k,v in test_params.items() if k not in {'batch_size', 'max_len'}}
batch = torch.randint(0, new_test_params['vocab_size'], (batch_size,max_len))
return batch, new_test_params
if __name__ == '__main__':
# Test init
cnn_init_inputs = [{'vocab_size': 1000, 'embed_size': 16, 'out_channels': 32, 'filter_heights': [3, 4, 5], 'stride': 1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 32, 'filter_heights': [3, 4, 5], 'stride': 1, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 32, 'filter_heights': [3, 4, 5], 'stride': 3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 32, 'filter_heights': [3, 4, 5], 'stride': 3, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 32, 'filter_heights': [5, 10], 'stride': 1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 32, 'filter_heights': [5, 10], 'stride': 1, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0}, {'vocab_size':
1000, 'embed_size': 16, 'out_channels': 32, 'filter_heights': [5, 10], 'stride': 3,
'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 32, 'filter_heights': [5, 10], 'stride': 3, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [3, 4, 5], 'stride': 1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [3, 4, 5], 'stride': 1, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [3, 4, 5], 'stride': 3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [3, 4, 5], 'stride': 3, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 1, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 3, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 32, 'out_channels': 32, 'filter_heights': [3, 4, 5], 'stride': 1, 'dropout': 0,
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 32, 'out_channels': 32, 'filter_heights': [3, 4, 5], 'stride': 1, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 32, 'out_channels': 32, 'filter_heights': [3, 4, 5], 'stride': 3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 32, 'out_channels': 32, 'filter_heights': [3, 4, 5], 'stride': 3, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 32, 'out_channels': 32, 'filter_heights': [5, 10], 'stride': 1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 32, 'out_channels': 32, 'filter_heights': [5, 10], 'stride': 1, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 32, 'out_channels': 32, 'filter_heights': [5, 10], 'stride': 3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 32, 'out_channels': 32, 'filter_heights': [5, 10], 'stride': 3, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 32, 'out_channels': 128, 'filter_heights': [3, 4, 5], 'stride': 1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 32, 'out_channels': 128, 'filter_heights': [3, 4, 5], 'stride': 1, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 32, 'out_channels': 128, 'filter_heights': [3, 4, 5], 'stride': 3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 32, 'out_channels': 128, 'filter_heights': [3, 4, 5], 'stride': 3, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 32, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 32, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 1, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 32, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 32, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 3, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0}]
cnn_init_expected_outputs = [22434, 22531, 22434, 22531, 23874, 23939, 23874, 23939, 41730, 42115, 41730, 42115, 47490, 47747, 47490, 47747, 44578, 44675, 44578,
44675, 47554, 47619, 47554, 47619, 82306, 82691, 82306, 82691, 94210, 94467, 94210,
94467]
sanityCheckModel(cnn_init_inputs, CNN, cnn_init_expected_outputs, "init")
print()
# Test forward
cnn_forward_inputs = [{'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [3, 4, 5], 'stride': 1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'max_len': 10, 'batch_size': 1}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [3, 4, 5], 'stride': 1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'max_len': 10, 'batch_size': 50}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [3, 4, 5], 'stride':
1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'max_len': 100, 'batch_size': 1}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [3, 4, 5], 'stride': 1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'max_len': 100, 'batch_size': 50}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [3, 4, 5], 'stride': 1, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0, 'max_len': 10, 'batch_size': 1}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [3, 4, 5], 'stride': 1, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0, 'max_len': 10, 'batch_size': 50}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [3, 4, 5], 'stride':
1, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0, 'max_len': 100, 'batch_size': 1}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [3, 4, 5], 'stride': 1, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0, 'max_len': 100, 'batch_size': 50}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128,
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
'filter_heights': [3, 4, 5], 'stride': 3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'max_len': 10, 'batch_size': 1}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [3, 4, 5], 'stride': 3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'max_len': 10, 'batch_size': 50}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [3, 4, 5], 'stride':
3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'max_len': 100, 'batch_size': 1}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [3, 4, 5], 'stride': 3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'max_len': 100, 'batch_size': 50}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [3, 4, 5], 'stride': 3, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0, 'max_len': 10, 'batch_size': 1}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [3, 4, 5], 'stride': 3, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0, 'max_len': 10, 'batch_size': 50}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [3, 4, 5], 'stride':
3, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0, 'max_len': 100, 'batch_size': 1}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [3, 4, 5], 'stride': 3, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0, 'max_len': 100, 'batch_size': 50}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'max_len': 10, 'batch_size': 1}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'max_len': 10, 'batch_size': 50}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'max_len': 100, 'batch_size': 1}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'max_len': 100, 'batch_size': 50}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 1, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0, 'max_len': 10, 'batch_size': 1}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 1, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0, 'max_len': 10, 'batch_size': 50}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 1, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0, 'max_len': 100, 'batch_size': 1}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 1, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0, 'max_len': 100, 'batch_size': 50}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'max_len': 10, 'batch_size': 1}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'max_len': 10, 'batch_size': 50}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'max_len': 100, 'batch_size': 1}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'max_len': 100, 'batch_size': 50}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 3, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0, 'max_len': 10, 'batch_size': 1}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 3, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0, 'max_len': 10, 'batch_size': 50}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 3, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0, 'max_len': 100, 'batch_size': 1}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 3, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0, 'max_len': 100, 'batch_size': 50}]
cnn_forward_expected_outputs = [torch.Size([1, 2]), torch.Size([50, 2]), torch.Size([1, 2]), torch.Size([50, 2]), torch.Size([1, 3]), torch.Size([50, 3]), torch.Size([1, 3]), torch.Size([50, 3]), torch.Size([1, 2]), torch.Size([50, 2]), torch.Size([1, 2]), torch.Size([50, 2]), torch.Size([1, 3]), torch.Size([50, 3]), torch.Size([1, 3]), torch.Size([50, 3]), torch.Size([1, 2]), torch.Size([50, 2]),
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
torch.Size([1, 2]), torch.Size([50, 2]), torch.Size([1, 3]), torch.Size([50, 3]), torch.Size([1, 3]), torch.Size([50, 3]), torch.Size([1, 2]), torch.Size([50, 2]), torch.Size([1, 2]), torch.Size([50, 2]), torch.Size([1, 3]), torch.Size([50, 3]), torch.Size([1, 3]), torch.Size([50, 3])]
sanityCheckModel(cnn_forward_inputs, CNN, cnn_forward_expected_outputs, "forward", makeCnnSanityBatch)
"""## Train CNN Model
First, we initialize the train and test <b>dataloaders</b>. A dataloader is responsible for providing batches of data to your model. Notice how we first instantiate datasets for the train and test data, and that we use the training vocabulary for both.
You do not need to edit this cell.
"""
### DO NOT EDIT ###
if __name__=='__main__':
THRESHOLD = 5 # Don't change this
MAX_LEN = 200 # Don't change this
BATCH_SIZE = 32 # Feel free to try other batch sizes
train_dataset = TextDataset(train_data, 'train', THRESHOLD, MAX_LEN)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=2, drop_last=True)
test_dataset = TextDataset(test_data, 'test', THRESHOLD, MAX_LEN, train_dataset.idx2word, train_dataset.word2idx)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=1, shuffle=False, num_workers=1, drop_last=False)
"""Now we provide you with a function that takes your model and trains it on the data.
You do not need to edit this cell. However, you may want to write code to save your
model periodically, as Colab connections are not permanent. See the tutorial here if you wish to do this: https://pytorch.org/tutorials/beginner/saving_loading_models.html.
"""
### DO NOT EDIT ###
from tqdm.notebook import tqdm
def train_cnn_model(model, num_epochs, data_loader, optimizer, criterion):
print('Training Model...')
model.train()
for epoch in range(num_epochs):
epoch_loss = 0
epoch_acc = 0
for texts, labels in tqdm(data_loader):
texts = texts.to(DEVICE) # shape: [batch_size, MAX_LEN]
labels = labels.to(DEVICE) # shape: [batch_size]
optimizer.zero_grad()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
output = model(texts)
acc = accuracy(output, labels)
loss = criterion(output, labels)
loss.backward()
optimizer.step()
epoch_loss += loss.item()
epoch_acc += acc.item()
print('[TRAIN]\t Epoch: {:2d}\t Loss: {:.4f}\t Train Accuracy: {:.2f}
%'.format(epoch+1, epoch_loss/len(data_loader), 100*epoch_acc/len(data_loader)))
print('Model Trained!\n')
"""Here are some other helper functions we will need."""
### DO NOT EDIT ###
def accuracy(output, labels):
"""
Returns accuracy per batch
output: Tensor [batch_size, n_classes]
labels: LongTensor [batch_size]
"""
preds = output.argmax(dim=1) # find predicted class
correct = (preds == labels).sum().float() # convert into float for division
acc = correct / len(labels)
return acc
"""Now you can instantiate your model. We provide you with some recommended hyperparameters; you should be able to get the desired accuracy with these, but feel free to play around with them."""
### DO NOT EDIT ###
if __name__=='__main__':
cnn_model = CNN(vocab_size = train_dataset.vocab_size, # Don't change this
embed_size = 128,
out_channels = 64,
filter_heights = [2, 3, 4],
stride = 1,
dropout = 0.5,
num_classes = 2, # Don't change this
pad_idx = train_dataset.word2idx[CNN_PAD]) # Don't change this
# Put your model on the device (cuda or cpu)
cnn_model = cnn_model.to(DEVICE)
print('The model has {:,d} trainable parameters'.format(count_parameters(cnn_model)))
"""Next, we create the **criterion**, which is our loss function: it is a measure of how well the model matches the empirical distribution of the data. We use cross-
entropy loss (https://en.wikipedia.org/wiki/Cross_entropy).
We also define the **optimizer**, which performs gradient descent. We use the Adam optimizer (https://arxiv.org/pdf/1412.6980.pdf), which has been shown to work well on these types of models.
"""
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
### DO NOT EDIT ###
import torch.optim as optim
if __name__=='__main__':
LEARNING_RATE = 5e-4 # Feel free to try other learning rates
# Define the loss function
criterion = nn.CrossEntropyLoss().to(DEVICE)
# Define the optimizer
optimizer = optim.Adam(cnn_model.parameters(), lr=LEARNING_RATE)
"""Finally, we can train the model. If the model is implemented correctly and you're using the GPU, this cell should take around <b>4 minutes</b> (or less). Feel
free to change the number of epochs."""
### DO NOT EDIT ###
if __name__=='__main__':
N_EPOCHS = 10 # Feel free to change this
# train model for N_EPOCHS epochs
train_cnn_model(cnn_model, N_EPOCHS, train_loader, optimizer, criterion)
"""## Evaluate CNN Model [20 points]
Now that we have trained a model for text classification, it is time to evaluate it. We have provided you with a function to do this; you do not need to modify anything.
To pass the autograder for the CNN, you will need to achieve **82% accuracy** on the hidden test set on Gradescope. Note that the Gradescope test set is very similar, and the accuracies between the two datasets should be comparable.
<font color='green'><b>Hint:</b> If you receive close to 82% accuracy in the notebook but close to 50% accuracy in the autograder, then the most likely causes are:
1. You uploaded an untrained model checkpoint. Make sure you save the model after it is trained.
2. Your `TextDataset` class is not deterministic in that the `word2idx` and `idx2word` mappings are not necessarily in the same order when the class is instantiated multiple times. This is a problem as your trained CNN will expect the words in the order seen in this notebook, but the autograder will be using a different ordering. If this is your issue, reimplement the `TextDataset` class so that it is deterministic, and then retrain and upload your model.</font>
"""
### DO NOT EDIT ###
import random
def evaluate(model, data_loader, criterion, use_tqdm=False):
print('Evaluating performance on the test dataset...')
has_printed=False
model.eval()
epoch_loss = 0
epoch_acc = 0
all_predictions = []
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
iterator = tqdm(data_loader) if use_tqdm else data_loader
total = 0
for texts, labels in iterator:
bs = texts.shape[0]
total += bs
texts = texts.to(DEVICE)
labels = labels.to(DEVICE)
output = model(texts)
acc = accuracy(output, labels) * len(labels)
pred = output.argmax(dim=1)
all_predictions.append(pred)
loss = criterion(output, labels) * len(labels)
epoch_loss += loss.item()
epoch_acc += acc.item()
if random.random() < 0.0015 and bs == 1:
if not has_printed: print("\nSOME PREDICTIONS FROM THE MODEL:")
print("Input: "+' '.join([data_loader.dataset.idx2word[idx] for idx in texts[0].tolist() if idx not in {data_loader.dataset.word2idx[CNN_PAD], data_loader.dataset.word2idx[CNN_END]}]))
print("Prediction:", pred.item(), '\tCorrect Output:', labels.item(), '\n')
has_printed=True
full_acc = 100*epoch_acc/total
full_loss = epoch_loss/total
print('[TEST]\t Loss: {:.4f}\t Accuracy: {:.2f}%'.format(full_loss, full_acc))
predictions = torch.cat(all_predictions)
return predictions, full_acc, full_loss
### DO NOT EDIT ###
if __name__=='__main__':
evaluate(cnn_model, test_loader, criterion, use_tqdm=True) # Compute test data accuracy
"""# What to Submit
To submit the assignment, download this notebook as a <TT>.py</TT> file. You can do
this by going to <TT>File > Download > Download .py</TT>. Then (optionally) rename it to `hwk2.py`.
You will also need to save the `cnn_model` (you do not need to save anything additional for your word embeddings). You can run the cell below to do this. After you save the files to your Google Drive, you need to manually download the files to
your computer, and then submit them to the autograder.
You will submit the following files to the autograder:
1. `hwk2.py`, the download of this notebook as a `.py` file (**not** a `.ipynb` file)
1. `cnn.pt`, the saved version of your `cnn_model`
"""
### DO NOT EDIT ###
if __name__=='__main__':
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
from google.colab import drive
drive.mount('/content/drive')
print()
try:
cnn_model is None
cnn_exists = True
except:
cnn_exists = False
if cnn_exists:
print("Saving CNN model
....
")
torch.save(cnn_model, "drive/My Drive/cnn.pt")
print("\nDone!")
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Documents
Recommended textbooks for you
data:image/s3,"s3://crabby-images/76250/762503ef8bed15d929593c1ab492e2e2028e039d" alt="Text book image"
EBK JAVA PROGRAMMING
Computer Science
ISBN:9781337671385
Author:FARRELL
Publisher:CENGAGE LEARNING - CONSIGNMENT
data:image/s3,"s3://crabby-images/98972/989727d766ccf442180c55aad7555e2e9b7e252f" alt="Text book image"
New Perspectives on HTML5, CSS3, and JavaScript
Computer Science
ISBN:9781305503922
Author:Patrick M. Carey
Publisher:Cengage Learning
Np Ms Office 365/Excel 2016 I Ntermed
Computer Science
ISBN:9781337508841
Author:Carey
Publisher:Cengage
data:image/s3,"s3://crabby-images/b907a/b907ada1f4be11d175260bd2a8acbc475b9f1fe1" alt="Text book image"
Systems Architecture
Computer Science
ISBN:9781305080195
Author:Stephen D. Burd
Publisher:Cengage Learning
COMPREHENSIVE MICROSOFT OFFICE 365 EXCE
Computer Science
ISBN:9780357392676
Author:FREUND, Steven
Publisher:CENGAGE L
data:image/s3,"s3://crabby-images/c63e8/c63e8dab9510ad4436da1d73d2cfa4a2607e71f3" alt="Text book image"
EBK JAVA PROGRAMMING
Computer Science
ISBN:9781305480537
Author:FARRELL
Publisher:CENGAGE LEARNING - CONSIGNMENT
Recommended textbooks for you
- EBK JAVA PROGRAMMINGComputer ScienceISBN:9781337671385Author:FARRELLPublisher:CENGAGE LEARNING - CONSIGNMENTNew Perspectives on HTML5, CSS3, and JavaScriptComputer ScienceISBN:9781305503922Author:Patrick M. CareyPublisher:Cengage LearningNp Ms Office 365/Excel 2016 I NtermedComputer ScienceISBN:9781337508841Author:CareyPublisher:Cengage
- Systems ArchitectureComputer ScienceISBN:9781305080195Author:Stephen D. BurdPublisher:Cengage LearningCOMPREHENSIVE MICROSOFT OFFICE 365 EXCEComputer ScienceISBN:9780357392676Author:FREUND, StevenPublisher:CENGAGE LEBK JAVA PROGRAMMINGComputer ScienceISBN:9781305480537Author:FARRELLPublisher:CENGAGE LEARNING - CONSIGNMENT
data:image/s3,"s3://crabby-images/76250/762503ef8bed15d929593c1ab492e2e2028e039d" alt="Text book image"
EBK JAVA PROGRAMMING
Computer Science
ISBN:9781337671385
Author:FARRELL
Publisher:CENGAGE LEARNING - CONSIGNMENT
data:image/s3,"s3://crabby-images/98972/989727d766ccf442180c55aad7555e2e9b7e252f" alt="Text book image"
New Perspectives on HTML5, CSS3, and JavaScript
Computer Science
ISBN:9781305503922
Author:Patrick M. Carey
Publisher:Cengage Learning
Np Ms Office 365/Excel 2016 I Ntermed
Computer Science
ISBN:9781337508841
Author:Carey
Publisher:Cengage
data:image/s3,"s3://crabby-images/b907a/b907ada1f4be11d175260bd2a8acbc475b9f1fe1" alt="Text book image"
Systems Architecture
Computer Science
ISBN:9781305080195
Author:Stephen D. Burd
Publisher:Cengage Learning
COMPREHENSIVE MICROSOFT OFFICE 365 EXCE
Computer Science
ISBN:9780357392676
Author:FREUND, Steven
Publisher:CENGAGE L
data:image/s3,"s3://crabby-images/c63e8/c63e8dab9510ad4436da1d73d2cfa4a2607e71f3" alt="Text book image"
EBK JAVA PROGRAMMING
Computer Science
ISBN:9781305480537
Author:FARRELL
Publisher:CENGAGE LEARNING - CONSIGNMENT