TASK 2_SENTIMENT ANALYSIS USING NEURAL NETWORKS
pdf
keyboard_arrow_up
School
Western Governors University *
*We aren’t endorsed by this school
Course
D213
Subject
Computer Science
Date
Jan 9, 2024
Type
Pages
6
Uploaded by BailiffGoatPerson529
ADVANCED DATA ANALYTICS - D213: TASK 2: SENTIMENT ANALYSIS USING NEURAL NETWORKS
LONDA DEVLIN
Student ID: 010150242
Master of Science, Data Analytics, Western Governor's University
William Sewell, Ph.D.
Program Mentor: Christiana Okhipo
Phone: 239.687.6222
¶
Email: Ldevli2@wgu.egu
Part I: Research Question
A. Purpose
1. Using python and the imdb_labelled
data set, can sentiment analysis be used to predict if a reviewer rates a movie as positive or negative?
2. A model will be developed that looks at current reviews and based on that model determine how closely the predictions align with the known rating.
3. One type of neural network is Recurrent Neural Networks (RNN). This type of neural network works well with natural language analysis because it has internal memory allowing it to learn from previous inputs. RNN takes the input
data and loops the input through multiple hidden layers to produce the output (“Power of Recurrent Neural Networks (RNN): Revolutionizing AI”).
Part II: Data Preparation
B. Summarize the data cleaning process
• The first step is to import packages and the data.
In [1]:
import
pandas
as
pd
import
nltk
import
re
from
nltk.corpus
import
stopwords
from
nltk.tokenize
import
word_tokenize
import
numpy
as
np
from
sklearn.model_selection
import
train_test_split # Add this line
from
keras.preprocessing.text
import
Tokenizer
from
keras.preprocessing.sequence
import
pad_sequences
from
keras.models
import
Sequential
from
keras.layers
import
Embedding, LSTM, Dense
from
PyPDF2
import
PdfWriter, PdfFileReader
#(“How to Disable Warnings in Jupyter Notebooks? 3 Easy Ways along with the Code.”)
import
warnings
warnings
.
filterwarnings(
'ignore'
)
# DataFrame
data = df = pd
.
read_csv(
"C:/Users/londa/imdb_labelled.csv"
)
df
1. Perform exploratory data analysis
• Non-standard characters, coupled with stopwords and unnecessary whitespace, introduces unwanted noise into the dataset, rendering it unclean. To remedy this, the clean_text function was utilized to remove special characters,
digits, and stopwords. Additionally, all uppercase letters were transformed to lowercase to ensure consistency in the data (Deepanshi). Cleaned text is a fundamental prerequisite for constructing accurate, robust, and interpretable
data for models.
WARNING:tensorflow:From C:\Users\londa\anaconda3\lib\site-packages\keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.comp
at.v1.losses.sparse_softmax_cross_entropy instead.
Out[1]:
Text
Rating
0
A very, very, very slow-moving, aimless movie ...
0
1
Not sure who was more lost - the flat characte...
0
2
Attempting artiness with black & white and cle...
0
3
Very little music or anything to speak of.
0
4
The best scene in the movie was when Gerardo i...
1
...
...
...
523
I just got bored watching Jessice Lange take h...
0
524
Unfortunately, any virtue in this film's produ...
0
525
In a word, it is embarrassing.
0
526
Exceptionally bad!
0
527
All in all its an insult to one's intelligence...
0
528 rows × 2 columns
In [3]:
df = pd
.
DataFrame(data)
#nltk.download('stopwords')
#nltk.download('punkt')
# Function to clean text
def
clean_text
(text):
# Remove special characters and digits
text = re
.
sub(
r'[^a-zA-Z\s]'
, ''
, text)
# Convert to lowercase
text = text
.
lower()
# Remove stopwords
stop_words = set
(stopwords
.
words(
'english'
))
word_tokens = word_tokenize(text)
filtered_text = [word for
word in
word_tokens if
word
.
lower() not
in
stop_words]
# Join tokens back into a sentence
cleaned_text = ' '
.
join(filtered_text)
return
cleaned_text
In [4]:
# Apply the clean_text function to the 'text' column
df[
'Text'
] = df[
'Text'
]
.
apply(clean_text)
# Display the cleaned DataFrame
df
• The vocabulary size is computed during the fitting of my model, which produces a size of 2,816. (See step 5)
• Word embedding generates a numerical vector for each word, assigning a unique number to represent it. As computers operate with numerical data rather than text, these vectors are instrumental in computational modeling. Ideally,
effective embeddings range in length from 50 to 500 in a well-constructed model (Ruchini).
• Determining the appropriate sequence length is not a straightforward process. It is advisable to initiate with a shorter sequence length and progressively extend it until performance begins to decline (“What Is the Optimal Sequence
Length for an RNN?”). Consequently, I will begin with a sequence length of 30.
2. Tokenization involves breaking a string into individual parts or words, marking the initiation of the Natural Language Processing (NLP) pipeline for the model to
process (Burchfiel). Tokenization for my analysis is done after splitting the data.
3. Padding serves to standardize sentence lengths. While neural networks require consistent input shapes, the variable lengths of sentences present a challenge.
Employing post or pre-padding becomes crucial for maintaining uniform input sizes (Caner). I have implemented post-padding, which is the default, for my analysis and
can be seen as part of the code post-split.
• A first padded line can be seen in step 5.
4. I will be using a Sigmoid Activation as there are only positive and negative (binary) ratings in the final layer.
5. Steps taken and splitting data into test and training sets:
Steps
1. Data Collection:
imdb_labelled data
2. Data Cleaning:
Remove irrelevant characters, special symbols, lowercase text and stemming
3. Text Tokenization
4. Padding:
Ensure uniform sequence length by padding so all sequences are the same size.
5. Data Splitting (Prabhakaran):
Split the dataset into training and test sets.
Split: 80-20 (training-test).
Out[4]:
Text
Rating
0
slowmoving aimless movie distressed drifting y...
0
1
sure lost flat characters audience nearly half...
0
2
attempting artiness black white clever camera ...
0
3
little music anything speak
0
4
best scene movie gerardo trying find song keep...
1
...
...
...
523
got bored watching jessice lange take clothes
0
524
unfortunately virtue films production work los...
0
525
word embarrassing
0
526
exceptionally bad
0
527
insult ones intelligence huge waste money
0
528 rows × 2 columns
In [7]:
# Split
X_train, X_test, y_train, y_test = train_test_split(df[
'Text'
], df[
'Rating'
], test_size
=0.2
, random_state
=42
)
# Tokenize
tokenizer = Tokenizer()
tokenizer
.
fit_on_texts(X_train)
# Convert text data
X_train_sequences = tokenizer
.
texts_to_sequences(X_train)
X_test_sequences = tokenizer
.
texts_to_sequences(X_test)
# Pad X_train_padded = pad_sequences(X_train_sequences)
X_test_padded = pad_sequences(X_test_sequences, maxlen
=
X_train_padded
.
shape[
1
])
# Convert labels
y_train = np
.
array(y_train)
y_test = np
.
array(y_test)
# Vocabulary size
vocabulary_size = len
(tokenizer
.
word_index) + 1
print
(
"Vocabulary size: " ,vocabulary_size)
print
(X_train_padded[
1
])
6. The test, train and df files were exported as .csv and attached to the Performance Assessment (“How to Export Pandas DataFrame to a CSV File - Data to Fish”).
In [20]:
# Specify a path
df_csv = pd
.
DataFrame(df)
df
.
to_csv(
r"C:\Users\londa\OneDrive\Desktop\df.csv"
, index
=
False
)
X_train_sequences_csv = pd
.
DataFrame(X_train_sequences)
X_train_sequences_csv
.
to_csv(
r"C:\Users\londa\OneDrive\Desktop\X_train_sequences.csv"
, index
=
False
)
X_test_sequences_csv = pd
.
DataFrame(X_test_sequences)
X_test_sequences_csv
.
to_csv(
r"C:\Users\londa\OneDrive\Desktop\X_test_sequences.csv"
, index
=
False
)
Part III: Network Architecture
C. Description of network
1. Tensor Flow output of model summary.
In [8]:
# Create LSTM model
model = Sequential()
model
.
add(Embedding(input_dim
=
vocabulary_size, output_dim
=100
, input_length
=
X_train_padded
.
shape[
1
]))
model
.
add(LSTM(units
=100
))
model
.
add(Dense(units
=1
, activation
=
'sigmoid'
))
# Compile the model
model
.
compile(optimizer
=
'adam'
, loss
=
'binary_crossentropy'
, metrics
=
[
'accuracy'
])
# Display the model summary
model
.
summary()
2. Layers and Parameters.
The model consists of three layers: Embedding, LSTM, and Dense (Brownlee)
Embedding Layer (embedding): This layer is typically used to convert integer-encoded vocabulary indices into dense vectors of fixed size (embedding_dim). It is often the first layer in an NLP model.
LSTM Layer (lstm): Long Short-Term Memory (LSTM) is a type of recurrent neural network (RNN) layer. It is commonly used for sequence modeling tasks in NLP due to its ability to capture long-term dependencies.
Dense Layer (dense): This is the prediction output layer. It's a fully connected layer with num_classes neurons and a softmax activation function, suitable for multi-class classification tasks.
Total Number of Parameters (“LSTM: Understanding the Number of Parameters”):
The Param # column indicates the number of trainable parameters in each layer.
Embedding Layer Parameters: The number of parameters in the embedding layer depends on the vocabulary size (vocab_size) and the embedding dimension (embedding_dim). The formula is typically vocab_size * embedding_dim.
LSTM Layer Parameters: The number of parameters in the LSTM layer depends on the number of LSTM units (units). The formula is roughly 4 ((embedding_dim + units) units + units).
Dense Layer Parameters: The number of parameters in the dense layer depends on the number of neurons in the layer and the number of neurons in the previous layer. The formula is (number of neurons in the previous layer + 1) *
number of neurons in the current layer (the +1 accounts for the bias term).
Total Parameters: The Total params row shows the sum of trainable parameters in all layers.
Vocabulary size: 2816
[ 0 0 0 ... 168 987 988]
WARNING:tensorflow:From C:\Users\londa\anaconda3\lib\site-packages\keras\src\backend.py:873: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_
graph instead.
WARNING:tensorflow:From C:\Users\londa\anaconda3\lib\site-packages\keras\src\optimizers\__init__.py:309: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.tr
ain.Optimizer instead.
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param # =================================================================
embedding (Embedding) (None, 2772, 100) 281600 lstm (LSTM) (None, 100) 80400 dense (Dense) (None, 1) 101 =================================================================
Total params: 362101 (1.38 MB)
Trainable params: 362101 (1.38 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
In [9]:
#(“How to Use Plot_model to Convert a Model as Png?”)
import
os
os
.
environ[
"PATH"
] += os
.
pathsep + 'C:\Program Files\Graphviz
\b
in'
from
tensorflow.keras.utils
import
plot_model
import
matplotlib.pyplot
as
plt
# Plot the model
plot_model(model, to_file
=
'model_plot.png'
, show_shapes
=
True
, show_layer_names
=
True
)
# Display the plot img = plt
.
imread(
'model_plot.png'
)
plt
.
imshow(img)
plt
.
axis(
'off'
) # Turn off axis labels
plt
.
show()
3. Justification of hyperparameters:
• The sigmoid activation was used as it is commanly used for binary classification.
• The number of nodes per layer was chosen using trial and error. 100 nodes provided the best accuracy in this model.
• The loss function used is binary_crossentropy as the analysis is binary (0 or 1)
• The Adam optimizer was used for its popularity, efficency and adaptive learning rates.
• No stopping criteria was used in this evaluation due to the small size of the data.
• The model was assessed using the accuracy evaluation metric, enabling the detection of potential overfitting during the evaluation process.
Part IV: Model Evaluation
D. Evaluate the model training process and its relevant outcomes by doing the following:
1. The model was trained without including stopping criteria to identify initial fitting using 10 epochs. Since accuracy is over 70%, no stopping criteria was
implemented. All epochs are shown below.
In [10]:
# Train the model
model
.
fit(X_train_padded, y_train, epochs
=10
, batch_size
=32
)
In [11]:
# Evaluate the model
loss, accuracy = model
.
evaluate(X_test_padded, y_test)
print
(
f'Test Loss: {
loss
:
.4f
}
, Test Accuracy: {
accuracy
:
.4f
}
'
)
The model is set up to address overfitting. In order to increase model accurancy, the output_dim was changed from 50 to 100.
Visualization of loss and accuracy.
Epoch 1/10
WARNING:tensorflow:From C:\Users\londa\anaconda3\lib\site-packages\keras\src\utils\tf_utils.py:492: The name tf.ragged.RaggedTensorValue is deprecated. Please use tf.compat.v
1.ragged.RaggedTensorValue instead.
WARNING:tensorflow:From C:\Users\londa\anaconda3\lib\site-packages\keras\src\engine\base_layer_utils.py:384: The name tf.executing_eagerly_outside_functions is deprecated. Pl
ease use tf.compat.v1.executing_eagerly_outside_functions instead.
14/14 [==============================] - 25s 2s/step - loss: 0.6951 - accuracy: 0.5427
Epoch 2/10
14/14 [==============================] - 24s 2s/step - loss: 0.6766 - accuracy: 0.7251
Epoch 3/10
14/14 [==============================] - 27s 2s/step - loss: 0.6169 - accuracy: 0.7678
Epoch 4/10
14/14 [==============================] - 28s 2s/step - loss: 0.9279 - accuracy: 0.8199
Epoch 5/10
14/14 [==============================] - 28s 2s/step - loss: 0.4000 - accuracy: 0.9502
Epoch 6/10
14/14 [==============================] - 32s 2s/step - loss: 0.3013 - accuracy: 0.9597
Epoch 7/10
14/14 [==============================] - 32s 2s/step - loss: 0.2069 - accuracy: 0.9763
Epoch 8/10
14/14 [==============================] - 35s 2s/step - loss: 0.1409 - accuracy: 0.9834
Epoch 9/10
14/14 [==============================] - 31s 2s/step - loss: 0.1285 - accuracy: 0.9882
Epoch 10/10
14/14 [==============================] - 27s 2s/step - loss: 0.1074 - accuracy: 0.9953
Out[10]:
<keras.src.callbacks.History at 0x24da36e3c40>
4/4 [==============================] - 2s 298ms/step - loss: 0.5865 - accuracy: 0.7170
Test Loss: 0.5865, Test Accuracy: 0.7170
In [15]:
# Train the model
history = model
.
fit(X_train_padded, y_train, epochs
=10
, batch_size
=32
, validation_split
=0.2
)
# Plot training and testing loss
plt
.
plot(history
.
history[
'loss'
], label
=
'Training Loss'
)
plt
.
plot(history
.
history[
'val_loss'
], label
=
'Testing Loss'
)
plt
.
xlabel(
'Epoch'
)
plt
.
legend()
plt
.
show()
# Plot training and testing accuracy
plt
.
plot(history
.
history[
'accuracy'
], label
=
'Training Accuracy'
)
plt
.
plot(history
.
history[
'val_accuracy'
], label
=
'Testing Accuracy'
)
plt
.
xlabel(
'Epoch'
)
plt
.
legend()
plt
.
show()
4. The accuracy of the trained network closely aligns, indicating effective generalization of the model (“Validation Accuracy Is Always close to Training Accuracy”).
In [16]:
# final training and validation accuracy
final_train_accuracy = history
.
history[
'accuracy'
][
-1
]
final_val_accuracy = history
.
history[
'val_accuracy'
][
-1
]
print
(
f'Final Training Accuracy: {
final_train_accuracy
:
.4f
}
'
)
print
(
f'Final Validation Accuracy: {
final_val_accuracy
:
.4f
}
'
)
Epoch 1/10
11/11 [==============================] - 20s 2s/step - loss: 0.0715 - accuracy: 0.9911 - val_loss: 0.1767 - val_accuracy: 0.9529
Epoch 2/10
11/11 [==============================] - 20s 2s/step - loss: 0.0959 - accuracy: 0.9941 - val_loss: 0.2035 - val_accuracy: 0.9294
Epoch 3/10
11/11 [==============================] - 20s 2s/step - loss: 0.0861 - accuracy: 0.9941 - val_loss: 0.2075 - val_accuracy: 0.9294
Epoch 4/10
11/11 [==============================] - 20s 2s/step - loss: 0.0718 - accuracy: 0.9911 - val_loss: 0.2059 - val_accuracy: 0.9294
Epoch 5/10
11/11 [==============================] - 20s 2s/step - loss: 0.0620 - accuracy: 0.9911 - val_loss: 0.2033 - val_accuracy: 0.9294
Epoch 6/10
11/11 [==============================] - 20s 2s/step - loss: 0.0551 - accuracy: 0.9911 - val_loss: 0.2011 - val_accuracy: 0.9294
Epoch 7/10
11/11 [==============================] - 21s 2s/step - loss: 0.0501 - accuracy: 0.9941 - val_loss: 0.1990 - val_accuracy: 0.9294
Epoch 8/10
11/11 [==============================] - 20s 2s/step - loss: 0.0459 - accuracy: 0.9941 - val_loss: 0.1982 - val_accuracy: 0.9294
Epoch 9/10
11/11 [==============================] - 20s 2s/step - loss: 0.0424 - accuracy: 0.9941 - val_loss: 0.1975 - val_accuracy: 0.9294
Epoch 10/10
11/11 [==============================] - 21s 2s/step - loss: 0.0396 - accuracy: 0.9941 - val_loss: 0.1964 - val_accuracy: 0.9294
Final Training Accuracy: 0.9941
Final Validation Accuracy: 0.9294
Part V: Summary and Recommendations
E. The code used can be found in each of the corresponding performance assesment points.
F. The model examined 528 reviews to predict whether they conveyed a positive or negative rating. This involved partitioning the data
into training and test datasets, employing an LSTM model. Tuning the model with specific input parameters aimed to address overfitting
and enhance overall accuracy.
G. The recommended course of action is that the model is production ready.
Part VI: Reporting
Sources
Brownlee, Jason. “The 5 Step Life-Cycle for Long Short-Term Memory Models in Keras.” Machine Learning Mastery, 27 Aug. 2020, machinelearningmastery.com/5-step-life-cycle-long-short-term-memory-models-
keras/#:~:text=The%20LSTM%20recurrent%20layer%20comprised,prediction%20is%20called%20Dense().
Burchfiel, Anni. “What Is NLP (Natural Language Processing) Tokenization?” Tokenex, 16 May 2022, www.tokenex.com/blog/ab-what-is-nlp-natural-language-processing-tokenization/.
Caner. “Padding for NLP.” Medium, 3 Apr. 2020, medium.com/@canerkilinc/padding-for-nlp-7dd8598c916a.
“Ch03.Rst2.” Nltk.org, 2010, www.nltk.org/book/ch03.html.
Deepanshi. “Text Preprocessing in NLP with Python Codes.” Analytics Vidhya, 25 June 2021, www.analyticsvidhya.com/blog/2021/06/text-preprocessing-in-nlp-with-python-codes/#h-punctuation-removal. Accessed 24 Dec. 2023.
“How to Disable Warnings in Jupyter Notebooks? 3 Easy Ways along with the Code.” Noteable, 9 Jan. 2023, noteable.io/blog/disable-warnings-in-jupyter/#:~:text=Method%201%3A%20Using%20the%20warnings. Accessed 30 Dec.
2023.
“How to Export Pandas DataFrame to a CSV File - Data to Fish.” Datatofish.com, datatofish.com/export-dataframe-to-csv/.
“How to Use Plot_model to Convert a Model as Png?” Stack Overflow, stackoverflow.com/questions/72761152/how-to-use-plot-model-to-convert-a-model-as-png. Accessed 1 Jan. 2024.
“LSTM: Understanding the Number of Parameters.” Kaggle.com, www.kaggle.com/code/kmkarakaya/lstm-understanding-the-number-of-parameters.
“Power of Recurrent Neural Networks (RNN): Revolutionizing AI.” Simplilearn.com, www.simplilearn.com/tutorials/deep-learning-tutorial/rnn#: Accessed 22 Dec. 2023.
Prabhakaran, Selva. “Train Test Split - How to Split Data into Train and Test for Validating Machine Learning Models?” Machine Learning Plus, 29 Dec. 2022, www.machinelearningplus.com/machine-learning/train-test-split/.
Ruchini, Chanika. “Introduction to Word Embeddings.” Analytics Vidhya, 9 Nov. 2020, medium.com/analytics-vidhya/introduction-to-word-embeddings-c2ba135dce2f#:~:text=Every%20word%20has%20a%20unique. Accessed 23 Dec.
2023.
“Validation Accuracy Is Always close to Training Accuracy.” Data Science Stack Exchange, datascience.stackexchange.com/questions/42606/validation-accuracy-is-always-close-to-training-accuracy. Accessed 2 Jan. 2024.
“What Is the Optimal Sequence Length for an RNN?” Www.linkedin.com, www.linkedin.com/advice/3/what-optimal-sequence-length-rnn-skills-machine-learning-nu4gc#:~:text=Sequence%20length%20is%20determined%20by.
Accessed 23 Dec. 2023.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Documents
Recommended textbooks for you

Fundamentals of Information Systems
Computer Science
ISBN:9781337097536
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Fundamentals of Information Systems
Computer Science
ISBN:9781305082168
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Principles of Information Systems (MindTap Course...
Computer Science
ISBN:9781305971776
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Principles of Information Systems (MindTap Course...
Computer Science
ISBN:9781285867168
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning
Recommended textbooks for you
- Fundamentals of Information SystemsComputer ScienceISBN:9781337097536Author:Ralph Stair, George ReynoldsPublisher:Cengage LearningFundamentals of Information SystemsComputer ScienceISBN:9781305082168Author:Ralph Stair, George ReynoldsPublisher:Cengage LearningPrinciples of Information Systems (MindTap Course...Computer ScienceISBN:9781305971776Author:Ralph Stair, George ReynoldsPublisher:Cengage Learning
- Principles of Information Systems (MindTap Course...Computer ScienceISBN:9781285867168Author:Ralph Stair, George ReynoldsPublisher:Cengage Learning

Fundamentals of Information Systems
Computer Science
ISBN:9781337097536
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Fundamentals of Information Systems
Computer Science
ISBN:9781305082168
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Principles of Information Systems (MindTap Course...
Computer Science
ISBN:9781305971776
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning

Principles of Information Systems (MindTap Course...
Computer Science
ISBN:9781285867168
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning