Homework1_NLP_Lucia_Colin_pdf

pdf

School

Illinois Institute Of Technology *

*We aren’t endorsed by this school

Course

585

Subject

Mathematics

Date

Feb 20, 2024

Type

pdf

Pages

1

Uploaded by DrFangVulture37

Report
NATURAL LANGUAGE PROCESSING. HOMEWORK 1. Author: Lucía Colín Cosano. A20552447. PROBLEM 1 • Read in these two GLUE datasets (see section “DATA” above). Also convert alphabetical characters to lower case: • Convert each dataset into a single list of tokens by applying the function “word_tokenize()” in the NLTK :: nltk.tokenize package. We will use these lists represent two distributions of English text. • To show you have finished this step, print the first 10 tokens from each dataset. [nltk_data] Downloading package punkt to [nltk_data] C:\Users\lulac\AppData\Roaming\nltk_data... [nltk_data] Package punkt is already up-to-date! First 10 tokens from SST dataset: ['hide', 'new', 'secretions', 'from', 'the', 'parental', 'units', 'contains', 'no', 'wit'] First 10 tokens from QNLI dataset: ['as', 'of', 'that', 'day', ',', 'the', 'new', 'constitution', 'heralding', 'the'] PROBLEM 2 • Write a python function that creates a probability distribution from a list of tokens. This function should return a dictionary that maps a token to a probability (I.e., maps a string to a floating-point value) • Apply your function to the list created in Problem 1 to create SST and QNLI distributions. • Show that both probability distributions sum to 1, allowing for some small numerical rounding error. Or, if they do not, add a comment in your notebook to explain why. Both probability distributions sum to approximately 1. PROBLEM 3 •Write a python function that computes the entropy of a random variable, input as a probability distribution. • Use this function to compute the word-level entropy of SST and QNLI, using the distributions you created in Problem 2. Show results in your notebook. Word-level entropy of SST: 10.079162530566823 Word-level entropy of QNLI: 10.056278588664085 PROBLEM 4 • Write a python function to compute the KL divergence between two probability distributions. • Apply this function to the distributions you created in Problem 2 to show that KL divergence is not symmetric. [This is also question 2.12 of M&S, p79]. KL Divergence (SST to QNLI): 1.9005802911305951 KL Divergence (QNLI to SST): 1.8128388732656786 PROBLEM 5 • Write a python function that computes the per-word entropy rate of a message relative to a specific probability distribution. • Find a recent movie review online (any website) and compute the entropy rates of this movie review using the distributions you created for both SST and QNLI datasets. Show results in your notebook. For this problem I have used a review from "Gran Turismo". https://www.rottentomatoes.com/m/gran_turismo_based_on_a_true_story/reviews?intcmp=rt-what-to-know_read-critics-reviews Per-word Entropy Rate (SST distribution): 12.84171092944692 Per-word Entropy Rate (QNLI distribution): 13.616919515501094 Handling “zero probabilities”: Per-word Entropy Rate (SST distribution): 13.848587636582906 Per-word Entropy Rate (QNLI distribution): 13.962055312359407 In [1]: import nltk from nltk.tokenize import word_tokenize import csv # Download NLTK data if necessary nltk . download ( 'punkt' ) # Define a function to read and preprocess a TSV dataset def preprocess_tsv_dataset ( file_path , column_name ): tokens = [] with open ( file_path , 'r' , encoding = 'utf-8' ) as tsvfile : reader = csv . DictReader ( tsvfile , delimiter = '\t' ) for row in reader : text = row [ column_name ] . lower () # Convert to lowercase tokens . extend ( word_tokenize ( text )) # Tokenize the text and extend the token list return tokens # File paths to your GLUE datasets and the column name to use sst_path = 'train.tsv' qnli_path = 'dev.tsv' column_name_sst = 'sentence' column_name_qnli = 'sentence' # Preprocess and tokenize the datasets sst_tokens = preprocess_tsv_dataset ( sst_path , column_name_sst ) qnli_tokens = preprocess_tsv_dataset ( qnli_path , column_name_qnli ) # Print the first 10 tokens from each dataset print ( "First 10 tokens from SST dataset:" ) print ( sst_tokens [: 10 ]) print ( "\nFirst 10 tokens from QNLI dataset:" ) print ( qnli_tokens [: 10 ]) In [2]: def create_probability_distribution ( tokens ): total_tokens = len ( tokens ) token_counts = {} for token in tokens : if token in token_counts : token_counts [ token ] += 1 else : token_counts [ token ] = 1 probability_distribution = {} for token , count in token_counts . items (): probability = count / total_tokens probability_distribution [ token ] = probability return probability_distribution In [3]: sst_distribution = create_probability_distribution ( sst_tokens ) qnli_distribution = create_probability_distribution ( qnli_tokens ) # Calculate the sum of probabilities for each distribution sst_sum = sum ( sst_distribution . values ()) qnli_sum = sum ( qnli_distribution . values ()) # Check if both distributions sum to approximately 1 epsilon = 1e-6 # A small epsilon to account for rounding errors if abs ( sst_sum - 1 ) < epsilon and abs ( qnli_sum - 1 ) < epsilon : print ( "Both probability distributions sum to approximately 1." ) else : print ( "The probability distributions do not sum to 1 due to numerical rounding errors." ) In [4]: import math def compute_entropy ( probability_distribution ): entropy = 0.0 for probability in probability_distribution . values (): if probability > 0 : entropy -= probability * math . log2 ( probability ) return entropy In [5]: sst_entropy = compute_entropy ( sst_distribution ) qnli_entropy = compute_entropy ( qnli_distribution ) print ( "Word-level entropy of SST:" , sst_entropy ) print ( "Word-level entropy of QNLI:" , qnli_entropy ) In [6]: import math def compute_kl_divergence ( p , q , smoothing = 1e-6 ): kl_divergence = 0.0 for token , probability_p in p . items (): probability_q = q . get ( token , smoothing ) # Add smoothing to avoid division by zero if probability_p > 0 : kl_divergence += probability_p * math . log2 ( probability_p / probability_q ) return kl_divergence In [7]: # Compute KL divergence with smoothing kl_sst_to_qnli = compute_kl_divergence ( sst_distribution , qnli_distribution ) kl_qnli_to_sst = compute_kl_divergence ( qnli_distribution , sst_distribution ) print ( "KL Divergence (SST to QNLI):" , kl_sst_to_qnli ) print ( "KL Divergence (QNLI to SST):" , kl_qnli_to_sst ) In [8]: import math def compute_per_word_entropy_rate ( message , probability_distribution ): tokens = message . split () # Tokenize the message by splitting on spaces total_entropy = 0.0 for token in tokens : probability = probability_distribution . get ( token , 1e-6 ) # Add smoothing to avoid division by zero if probability > 0 : total_entropy += math . log2 ( probability ) num_tokens = len ( tokens ) if num_tokens == 0 : return 0.0 # Return 0 entropy if there are no tokens in the message entropy_rate = - total_entropy / num_tokens return entropy_rate In [9]: movie_review = "I liked the story line. It was inspirational. I liked the family dynamics. The family was very caring and loving towards each other. I liked that the ma # Compute per-word entropy rates sst_entropy_rate = compute_per_word_entropy_rate ( movie_review , sst_distribution ) qnli_entropy_rate = compute_per_word_entropy_rate ( movie_review , qnli_distribution ) print ( "Per-word Entropy Rate (SST distribution):" , sst_entropy_rate ) print ( "Per-word Entropy Rate (QNLI distribution):" , qnli_entropy_rate ) In [10]: import math def compute_per_word_entropy_rate ( message , distribution , vocabulary_size ): # Tokenize the message into words words = message . split () # Calculate the entropy rate with Laplace smoothing entropy = 0.0 for word in words : # Calculate smoothed probability using Laplace smoothing p_word = ( distribution . get ( word , 0 ) + 1 ) / ( len ( words ) + vocabulary_size ) # Add-one smoothing entropy -= math . log2 ( p_word ) # Using base 2 logarithm # Normalize the entropy by the number of words entropy_rate = entropy / len ( words ) return entropy_rate In [11]: movie_review = "I liked the story line. It was inspirational. I liked the family dynamics. The family was very caring and loving towards each other. I liked that the ma # Define vocabulary sizes for SST and QNLI distributions sst_vocabulary_size = len ( sst_distribution ) qnli_vocabulary_size = len ( qnli_distribution ) # Compute per-word entropy rates sst_entropy_rate = compute_per_word_entropy_rate ( movie_review , sst_distribution , sst_vocabulary_size ) qnli_entropy_rate = compute_per_word_entropy_rate ( movie_review , qnli_distribution , qnli_vocabulary_size ) print ( "Per-word Entropy Rate (SST distribution):" , sst_entropy_rate ) print ( "Per-word Entropy Rate (QNLI distribution):" , qnli_entropy_rate )
Discover more documents: Sign up today!
Unlock a world of knowledge! Explore tailored content for a richer learning experience. Here's what you'll get:
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help