Homework1_NLP_Lucia_Colin_pdf
pdf
keyboard_arrow_up
School
Illinois Institute Of Technology *
*We aren’t endorsed by this school
Course
585
Subject
Mathematics
Date
Feb 20, 2024
Type
Pages
1
Uploaded by DrFangVulture37
NATURAL LANGUAGE PROCESSING. HOMEWORK 1.
Author: Lucía Colín Cosano. A20552447.
PROBLEM 1
• Read in these two GLUE datasets (see section “DATA” above). Also convert alphabetical characters to lower case:
• Convert each dataset into a single list of tokens by applying the function “word_tokenize()” in the NLTK :: nltk.tokenize package. We will use these lists represent two distributions of English text.
• To show you have finished this step, print the first 10 tokens from each dataset.
[nltk_data] Downloading package punkt to
[nltk_data] C:\Users\lulac\AppData\Roaming\nltk_data...
[nltk_data] Package punkt is already up-to-date!
First 10 tokens from SST dataset:
['hide', 'new', 'secretions', 'from', 'the', 'parental', 'units', 'contains', 'no', 'wit']
First 10 tokens from QNLI dataset:
['as', 'of', 'that', 'day', ',', 'the', 'new', 'constitution', 'heralding', 'the']
PROBLEM 2
• Write a python function that creates a probability distribution from a list of tokens. This function should return a dictionary that maps a token to a probability (I.e., maps a string to a floating-point value)
• Apply your function to the list created in Problem 1 to create SST and QNLI distributions.
• Show that both probability distributions sum to 1, allowing for some small numerical rounding error. Or, if they do not, add a comment in your notebook to explain why.
Both probability distributions sum to approximately 1.
PROBLEM 3
•Write a python function that computes the entropy of a random variable, input as a probability distribution.
• Use this function to compute the word-level entropy of SST and QNLI, using the distributions you created in Problem 2. Show results in your notebook.
Word-level entropy of SST: 10.079162530566823
Word-level entropy of QNLI: 10.056278588664085
PROBLEM 4
• Write a python function to compute the KL divergence between two probability distributions.
• Apply this function to the distributions you created in Problem 2 to show that KL divergence is not symmetric. [This is also question 2.12 of M&S, p79].
KL Divergence (SST to QNLI): 1.9005802911305951
KL Divergence (QNLI to SST): 1.8128388732656786
PROBLEM 5
• Write a python function that computes the per-word entropy rate of a message relative to a specific probability distribution.
• Find a recent movie review online (any website) and compute the entropy rates of this movie review using the distributions you created for both SST and QNLI datasets. Show results in your notebook.
For this problem I have used a review from "Gran Turismo". https://www.rottentomatoes.com/m/gran_turismo_based_on_a_true_story/reviews?intcmp=rt-what-to-know_read-critics-reviews
Per-word Entropy Rate (SST distribution): 12.84171092944692
Per-word Entropy Rate (QNLI distribution): 13.616919515501094
Handling “zero probabilities”:
Per-word Entropy Rate (SST distribution): 13.848587636582906
Per-word Entropy Rate (QNLI distribution): 13.962055312359407
In [1]:
import
nltk
from
nltk.tokenize import
word_tokenize
import
csv
# Download NLTK data if necessary
nltk
.
download
(
'punkt'
)
# Define a function to read and preprocess a TSV dataset
def
preprocess_tsv_dataset
(
file_path
, column_name
):
tokens =
[]
with
open
(
file_path
, 'r'
, encoding
=
'utf-8'
) as
tsvfile
:
reader =
csv
.
DictReader
(
tsvfile
, delimiter
=
'\t'
)
for
row in
reader
:
text =
row
[
column_name
]
.
lower
() # Convert to lowercase
tokens
.
extend
(
word_tokenize
(
text
)) # Tokenize the text and extend the token list
return
tokens
# File paths to your GLUE datasets and the column name to use
sst_path =
'train.tsv'
qnli_path =
'dev.tsv'
column_name_sst =
'sentence'
column_name_qnli =
'sentence'
# Preprocess and tokenize the datasets
sst_tokens =
preprocess_tsv_dataset
(
sst_path
, column_name_sst
)
qnli_tokens =
preprocess_tsv_dataset
(
qnli_path
, column_name_qnli
)
# Print the first 10 tokens from each dataset
print
(
"First 10 tokens from SST dataset:"
)
print
(
sst_tokens
[:
10
])
print
(
"\nFirst 10 tokens from QNLI dataset:"
)
print
(
qnli_tokens
[:
10
])
In [2]:
def
create_probability_distribution
(
tokens
):
total_tokens =
len
(
tokens
)
token_counts =
{}
for
token in
tokens
:
if
token in
token_counts
:
token_counts
[
token
] +=
1
else
:
token_counts
[
token
] =
1
probability_distribution =
{}
for
token
, count in
token_counts
.
items
():
probability =
count /
total_tokens
probability_distribution
[
token
] =
probability
return
probability_distribution
In [3]:
sst_distribution =
create_probability_distribution
(
sst_tokens
)
qnli_distribution =
create_probability_distribution
(
qnli_tokens
)
# Calculate the sum of probabilities for each distribution
sst_sum =
sum
(
sst_distribution
.
values
())
qnli_sum =
sum
(
qnli_distribution
.
values
())
# Check if both distributions sum to approximately 1
epsilon =
1e-6 # A small epsilon to account for rounding errors
if
abs
(
sst_sum -
1
) <
epsilon and
abs
(
qnli_sum -
1
) <
epsilon
:
print
(
"Both probability distributions sum to approximately 1."
)
else
:
print
(
"The probability distributions do not sum to 1 due to numerical rounding errors."
)
In [4]:
import
math
def
compute_entropy
(
probability_distribution
):
entropy =
0.0
for
probability in
probability_distribution
.
values
():
if
probability >
0
:
entropy -=
probability *
math
.
log2
(
probability
)
return
entropy
In [5]:
sst_entropy =
compute_entropy
(
sst_distribution
)
qnli_entropy =
compute_entropy
(
qnli_distribution
)
print
(
"Word-level entropy of SST:"
, sst_entropy
)
print
(
"Word-level entropy of QNLI:"
, qnli_entropy
)
In [6]:
import
math
def
compute_kl_divergence
(
p
, q
, smoothing
=
1e-6
):
kl_divergence =
0.0
for
token
, probability_p in
p
.
items
():
probability_q =
q
.
get
(
token
, smoothing
) # Add smoothing to avoid division by zero
if
probability_p >
0
:
kl_divergence +=
probability_p *
math
.
log2
(
probability_p /
probability_q
)
return
kl_divergence
In [7]:
# Compute KL divergence with smoothing
kl_sst_to_qnli =
compute_kl_divergence
(
sst_distribution
, qnli_distribution
)
kl_qnli_to_sst =
compute_kl_divergence
(
qnli_distribution
, sst_distribution
)
print
(
"KL Divergence (SST to QNLI):"
, kl_sst_to_qnli
)
print
(
"KL Divergence (QNLI to SST):"
, kl_qnli_to_sst
)
In [8]:
import
math
def
compute_per_word_entropy_rate
(
message
, probability_distribution
):
tokens =
message
.
split
() # Tokenize the message by splitting on spaces
total_entropy =
0.0
for
token in
tokens
:
probability =
probability_distribution
.
get
(
token
, 1e-6
) # Add smoothing to avoid division by zero
if
probability >
0
:
total_entropy +=
math
.
log2
(
probability
)
num_tokens =
len
(
tokens
)
if
num_tokens ==
0
:
return
0.0 # Return 0 entropy if there are no tokens in the message
entropy_rate =
-
total_entropy /
num_tokens
return
entropy_rate
In [9]:
movie_review =
"I liked the story line. It was inspirational. I liked the family dynamics. The family was very caring and loving towards each other. I liked that the ma
# Compute per-word entropy rates
sst_entropy_rate =
compute_per_word_entropy_rate
(
movie_review
, sst_distribution
)
qnli_entropy_rate =
compute_per_word_entropy_rate
(
movie_review
, qnli_distribution
)
print
(
"Per-word Entropy Rate (SST distribution):"
, sst_entropy_rate
)
print
(
"Per-word Entropy Rate (QNLI distribution):"
, qnli_entropy_rate
)
In [10]:
import
math
def
compute_per_word_entropy_rate
(
message
, distribution
, vocabulary_size
):
# Tokenize the message into words
words =
message
.
split
()
# Calculate the entropy rate with Laplace smoothing
entropy =
0.0
for
word in
words
:
# Calculate smoothed probability using Laplace smoothing
p_word =
(
distribution
.
get
(
word
, 0
) +
1
) /
(
len
(
words
) +
vocabulary_size
) # Add-one smoothing
entropy -=
math
.
log2
(
p_word
) # Using base 2 logarithm
# Normalize the entropy by the number of words
entropy_rate =
entropy /
len
(
words
)
return
entropy_rate
In [11]:
movie_review =
"I liked the story line. It was inspirational. I liked the family dynamics. The family was very caring and loving towards each other. I liked that the ma
# Define vocabulary sizes for SST and QNLI distributions
sst_vocabulary_size =
len
(
sst_distribution
)
qnli_vocabulary_size =
len
(
qnli_distribution
)
# Compute per-word entropy rates
sst_entropy_rate =
compute_per_word_entropy_rate
(
movie_review
, sst_distribution
, sst_vocabulary_size
)
qnli_entropy_rate =
compute_per_word_entropy_rate
(
movie_review
, qnli_distribution
, qnli_vocabulary_size
)
print
(
"Per-word Entropy Rate (SST distribution):"
, sst_entropy_rate
)
print
(
"Per-word Entropy Rate (QNLI distribution):"
, qnli_entropy_rate
)
Discover more documents: Sign up today!
Unlock a world of knowledge! Explore tailored content for a richer learning experience. Here's what you'll get:
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help