hwk2

School

University of Illinois, Urbana Champaign *

*We aren’t endorsed by this school

Course

200

Subject

Computer Science

Date

Apr 3, 2024

Type

Pages

Uploaded by EarlComputer5997

# -*- coding: utf-8 -*- """hwk2.ipynb Automatically generated by Colaboratory. Original file is located at https://colab.research.google.com/drive/1cu3jBth8OlEcurSC9orWpgRTOAdZLPKW # CS 447 Homework 2 $-$ Word Embeddings \& Text Classification with Neural Networks In this homework, you will first train word embeddings using the continuous-bag-of- words (CBOW) method. Then, you will build a convolutional neural network (CNN) classifier to detect the sentiment of movie reviews using the IMDb movie reviews dataset. In addition to the Pytorch tutorial we have provided online, we highly recommend that you take a look at the PyTorch tutorials before starting this assignment: <ul> <li><a href="https://pytorch.org/tutorials/beginner/pytorch_with_examples.html">https:// pytorch.org/tutorials/beginner/pytorch_with_examples.html</a> <li><a href="https://pytorch.org/tutorials/beginner/data_loading_tutorial.html">https:// pytorch.org/tutorials/beginner/data_loading_tutorial.html</a> <li><a href="https://github.com/yunjey/pytorch-tutorial">https://github.com/ yunjey/pytorch-tutorial</a> </ul> Hint: We suggest that you work on this homework in CPU until you are ready to train. At that point, you should switch your runtime to GPU. You can do this by going to <TT>Runtime > Change Runtime Type</TT> and select "GPU" from the dropdown menu. * You will find it easier to debug on CPU, and the error messages will be more understandable. * Google monitors your GPU usage and will occasionally restrict GPU access if you use it too much. In these cases, you can either switch to a different Google account or wait for your access to be restored. We have imported all the libraries you need to do this homework. You should not import any extra libraries. Furthermore, you should not write any code outside of TODO sections. If you do, the autograder will fail to run your code. # Part 1: Continuous-Bag-of-Words (CBOW) Embeddings [50 points] In the first part of this assignment you will learn dense word embeddings based on the word2vec paradigm. In particular, you will use the continuous-bag-of-words approach, which trains a model to predict a word based on the embeddings of surrounding words. For example, in the sentence "the man walks the dog in the park", the embeddings for the words ("man, "walks", "dog", "in") will be used to predict the word "the" (if your context size is 2 on each side of the target word). ## Download \& Preprocess the Data First we will download the dataset using [torchtext](https://torchtext.readthedocs.io/en/latest/index.html), which is a package that supports NLP for PyTorch. Unfortunately, you have to install the <TT>torchdata</TT> package on the Colab machine in order to access the data. To do this, run the cell below (you may need to click the "Restart Runtime" button when it finishes). You will have to do this every time you return to work on the homework.

""" !pip install torchdata==0.5.1 !pip install torchtext==0.14.0 ### DO NOT EDIT ### import torch import torch.nn as nn import torch.nn.functional as F DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu') if __name__=='__main__': print('Using device:', DEVICE) """Now, we download the data. As with Homework 1, we will use WikiText-2, a corpus of high-quality Wikipedia articles. The dataset was originally introduced in the following paper: https://arxiv.org/pdf/1609.07843v1.pdf. A raw version of the data can easily be viewed here: https://github.com/pytorch/examples/tree/master/word_language_model/data/wikitext- 2.preprocess After downloading the data, we preprocess the text as in Homework 1. You do not need to edit this code. * Sentence splitting:    In this homework, we are interested in modeling individual sentences, rather than longer chunks of text such as paragraphs or documents. The WikiTest dataset provides paragraphs; thus, we provide a simple method to identify individual sentences by splitting paragraphs at punctuation tokens (".", "!", "?"). * Sentence markers:    For both training and testing corpora, each sentence must be surrounded by a start-of-sentence (`<s>`) and end- of-sentence marker (`/s`). These markers will allow your models to generate sentences that have realistic beginnings and endings. * Unknown words:    In order to deal with unknown words, all words that do not appear in the vocabulary must be replaced with a special token for unknown words (`<UNK>`). The WikiText dataset has already done this, and you can read about the method in the paper above. When unknown words are encountered in the test corpus, they should be treated as that special token instead. We provide you with preprocessing code here, and you should not modify it. """ ### DO NOT EDIT ### # Constants (feel free to use these in your code, but do not change them) CBOW_START = "<s>" # Start-of-sentence token CBOW_END = "</s>" # End-of-sentence-token CBOW_UNK = "<UNK>" # Unknown word token ### DO NOT EDIT ### import torchtext import random import sys

def cbow_preprocess(data, vocab=None, do_lowercase=True): final_data = [] lowercase = "abcdefghijklmnopqrstuvwxyz" for paragraph in data: paragraph = [x if x != '<unk>' else CBOW_UNK for x in paragraph.split()] if vocab is not None: paragraph = [x if x in vocab else CBOW_UNK for x in paragraph] if paragraph == [] or paragraph.count('=') >= 2: continue sen = [] prev_punct, prev_quot = False, False for word in paragraph: if prev_quot: if word[0] not in lowercase: final_data.append(sen) sen = [] prev_punct, prev_quot = False, False if prev_punct: if word == '"': prev_punct, prev_quot = False, True else: if word[0] not in lowercase: final_data.append(sen) sen = [] prev_punct, prev_quot = False, False if word in {'.', '?', '!'}: prev_punct = True sen += [word] if sen[-1] not in {'.', '?', '!', '"'}: continue # Prevent a lot of short sentences final_data.append(sen) vocab_was_none = vocab is None if vocab is None: vocab = {} for i in range(len(final_data)): # Make words lowercase for this assignment final_data[i] = [x.lower() if do_lowercase and x != CBOW_UNK else x for x in final_data[i]] final_data[i] = [CBOW_START] + final_data[i] + [CBOW_END] if vocab_was_none: for word in final_data[i]: vocab[word] = vocab.get(word, 0) + 1 return final_data, vocab def getDataset(): dataset = torchtext.datasets.WikiText2(root='.data', split=('train',)) train_dataset, vocab = cbow_preprocess(dataset[0]) return train_dataset, vocab if __name__=='__main__': sentences, vocab = getDataset() """Run the next cell to see 10 random sentences of the data.""" ### DO NOT EDIT ### if __name__ == '__main__': for x in random.sample(sentences, 10): print (x)

Your preview ends here