TASK 2_SENTIMENT ANALYSIS USING NEURAL NETWORKS

pdf

School

Western Governors University *

*We aren’t endorsed by this school

Course

D213

Subject

Computer Science

Date

Jan 9, 2024

Type

pdf

Pages

6

Uploaded by BailiffGoatPerson529

Report
ADVANCED DATA ANALYTICS - D213: TASK 2: SENTIMENT ANALYSIS USING NEURAL NETWORKS LONDA DEVLIN Student ID: 010150242 Master of Science, Data Analytics, Western Governor's University William Sewell, Ph.D. Program Mentor: Christiana Okhipo Phone: 239.687.6222 Email: Ldevli2@wgu.egu Part I: Research Question A. Purpose 1. Using python and the imdb_labelled data set, can sentiment analysis be used to predict if a reviewer rates a movie as positive or negative? 2. A model will be developed that looks at current reviews and based on that model determine how closely the predictions align with the known rating. 3. One type of neural network is Recurrent Neural Networks (RNN). This type of neural network works well with natural language analysis because it has internal memory allowing it to learn from previous inputs. RNN takes the input data and loops the input through multiple hidden layers to produce the output (“Power of Recurrent Neural Networks (RNN): Revolutionizing AI”). Part II: Data Preparation B. Summarize the data cleaning process • The first step is to import packages and the data. In [1]: import pandas as pd import nltk import re from nltk.corpus import stopwords from nltk.tokenize import word_tokenize import numpy as np from sklearn.model_selection import train_test_split # Add this line from keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences from keras.models import Sequential from keras.layers import Embedding, LSTM, Dense from PyPDF2 import PdfWriter, PdfFileReader #(“How to Disable Warnings in Jupyter Notebooks? 3 Easy Ways along with the Code.”) import warnings warnings . filterwarnings( 'ignore' ) # DataFrame data = df = pd . read_csv( "C:/Users/londa/imdb_labelled.csv" ) df 1. Perform exploratory data analysis • Non-standard characters, coupled with stopwords and unnecessary whitespace, introduces unwanted noise into the dataset, rendering it unclean. To remedy this, the clean_text function was utilized to remove special characters, digits, and stopwords. Additionally, all uppercase letters were transformed to lowercase to ensure consistency in the data (Deepanshi). Cleaned text is a fundamental prerequisite for constructing accurate, robust, and interpretable data for models. WARNING:tensorflow:From C:\Users\londa\anaconda3\lib\site-packages\keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.comp at.v1.losses.sparse_softmax_cross_entropy instead. Out[1]: Text Rating 0 A very, very, very slow-moving, aimless movie ... 0 1 Not sure who was more lost - the flat characte... 0 2 Attempting artiness with black & white and cle... 0 3 Very little music or anything to speak of. 0 4 The best scene in the movie was when Gerardo i... 1 ... ... ... 523 I just got bored watching Jessice Lange take h... 0 524 Unfortunately, any virtue in this film's produ... 0 525 In a word, it is embarrassing. 0 526 Exceptionally bad! 0 527 All in all its an insult to one's intelligence... 0 528 rows × 2 columns
In [3]: df = pd . DataFrame(data) #nltk.download('stopwords') #nltk.download('punkt') # Function to clean text def clean_text (text): # Remove special characters and digits text = re . sub( r'[^a-zA-Z\s]' , '' , text) # Convert to lowercase text = text . lower() # Remove stopwords stop_words = set (stopwords . words( 'english' )) word_tokens = word_tokenize(text) filtered_text = [word for word in word_tokens if word . lower() not in stop_words] # Join tokens back into a sentence cleaned_text = ' ' . join(filtered_text) return cleaned_text In [4]: # Apply the clean_text function to the 'text' column df[ 'Text' ] = df[ 'Text' ] . apply(clean_text) # Display the cleaned DataFrame df • The vocabulary size is computed during the fitting of my model, which produces a size of 2,816. (See step 5) • Word embedding generates a numerical vector for each word, assigning a unique number to represent it. As computers operate with numerical data rather than text, these vectors are instrumental in computational modeling. Ideally, effective embeddings range in length from 50 to 500 in a well-constructed model (Ruchini). • Determining the appropriate sequence length is not a straightforward process. It is advisable to initiate with a shorter sequence length and progressively extend it until performance begins to decline (“What Is the Optimal Sequence Length for an RNN?”). Consequently, I will begin with a sequence length of 30. 2. Tokenization involves breaking a string into individual parts or words, marking the initiation of the Natural Language Processing (NLP) pipeline for the model to process (Burchfiel). Tokenization for my analysis is done after splitting the data. 3. Padding serves to standardize sentence lengths. While neural networks require consistent input shapes, the variable lengths of sentences present a challenge. Employing post or pre-padding becomes crucial for maintaining uniform input sizes (Caner). I have implemented post-padding, which is the default, for my analysis and can be seen as part of the code post-split. • A first padded line can be seen in step 5. 4. I will be using a Sigmoid Activation as there are only positive and negative (binary) ratings in the final layer. 5. Steps taken and splitting data into test and training sets: Steps 1. Data Collection: imdb_labelled data 2. Data Cleaning: Remove irrelevant characters, special symbols, lowercase text and stemming 3. Text Tokenization 4. Padding: Ensure uniform sequence length by padding so all sequences are the same size. 5. Data Splitting (Prabhakaran): Split the dataset into training and test sets. Split: 80-20 (training-test). Out[4]: Text Rating 0 slowmoving aimless movie distressed drifting y... 0 1 sure lost flat characters audience nearly half... 0 2 attempting artiness black white clever camera ... 0 3 little music anything speak 0 4 best scene movie gerardo trying find song keep... 1 ... ... ... 523 got bored watching jessice lange take clothes 0 524 unfortunately virtue films production work los... 0 525 word embarrassing 0 526 exceptionally bad 0 527 insult ones intelligence huge waste money 0 528 rows × 2 columns
In [7]: # Split X_train, X_test, y_train, y_test = train_test_split(df[ 'Text' ], df[ 'Rating' ], test_size =0.2 , random_state =42 ) # Tokenize tokenizer = Tokenizer() tokenizer . fit_on_texts(X_train) # Convert text data X_train_sequences = tokenizer . texts_to_sequences(X_train) X_test_sequences = tokenizer . texts_to_sequences(X_test) # Pad X_train_padded = pad_sequences(X_train_sequences) X_test_padded = pad_sequences(X_test_sequences, maxlen = X_train_padded . shape[ 1 ]) # Convert labels y_train = np . array(y_train) y_test = np . array(y_test) # Vocabulary size vocabulary_size = len (tokenizer . word_index) + 1 print ( "Vocabulary size: " ,vocabulary_size) print (X_train_padded[ 1 ]) 6. The test, train and df files were exported as .csv and attached to the Performance Assessment (“How to Export Pandas DataFrame to a CSV File - Data to Fish”). In [20]: # Specify a path df_csv = pd . DataFrame(df) df . to_csv( r"C:\Users\londa\OneDrive\Desktop\df.csv" , index = False ) X_train_sequences_csv = pd . DataFrame(X_train_sequences) X_train_sequences_csv . to_csv( r"C:\Users\londa\OneDrive\Desktop\X_train_sequences.csv" , index = False ) X_test_sequences_csv = pd . DataFrame(X_test_sequences) X_test_sequences_csv . to_csv( r"C:\Users\londa\OneDrive\Desktop\X_test_sequences.csv" , index = False ) Part III: Network Architecture C. Description of network 1. Tensor Flow output of model summary. In [8]: # Create LSTM model model = Sequential() model . add(Embedding(input_dim = vocabulary_size, output_dim =100 , input_length = X_train_padded . shape[ 1 ])) model . add(LSTM(units =100 )) model . add(Dense(units =1 , activation = 'sigmoid' )) # Compile the model model . compile(optimizer = 'adam' , loss = 'binary_crossentropy' , metrics = [ 'accuracy' ]) # Display the model summary model . summary() 2. Layers and Parameters. The model consists of three layers: Embedding, LSTM, and Dense (Brownlee) Embedding Layer (embedding): This layer is typically used to convert integer-encoded vocabulary indices into dense vectors of fixed size (embedding_dim). It is often the first layer in an NLP model. LSTM Layer (lstm): Long Short-Term Memory (LSTM) is a type of recurrent neural network (RNN) layer. It is commonly used for sequence modeling tasks in NLP due to its ability to capture long-term dependencies. Dense Layer (dense): This is the prediction output layer. It's a fully connected layer with num_classes neurons and a softmax activation function, suitable for multi-class classification tasks. Total Number of Parameters (“LSTM: Understanding the Number of Parameters”): The Param # column indicates the number of trainable parameters in each layer. Embedding Layer Parameters: The number of parameters in the embedding layer depends on the vocabulary size (vocab_size) and the embedding dimension (embedding_dim). The formula is typically vocab_size * embedding_dim. LSTM Layer Parameters: The number of parameters in the LSTM layer depends on the number of LSTM units (units). The formula is roughly 4 ((embedding_dim + units) units + units). Dense Layer Parameters: The number of parameters in the dense layer depends on the number of neurons in the layer and the number of neurons in the previous layer. The formula is (number of neurons in the previous layer + 1) * number of neurons in the current layer (the +1 accounts for the bias term). Total Parameters: The Total params row shows the sum of trainable parameters in all layers. Vocabulary size: 2816 [ 0 0 0 ... 168 987 988] WARNING:tensorflow:From C:\Users\londa\anaconda3\lib\site-packages\keras\src\backend.py:873: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_ graph instead. WARNING:tensorflow:From C:\Users\londa\anaconda3\lib\site-packages\keras\src\optimizers\__init__.py:309: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.tr ain.Optimizer instead. Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding (Embedding) (None, 2772, 100) 281600 lstm (LSTM) (None, 100) 80400 dense (Dense) (None, 1) 101 ================================================================= Total params: 362101 (1.38 MB) Trainable params: 362101 (1.38 MB) Non-trainable params: 0 (0.00 Byte) _________________________________________________________________
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In [9]: #(“How to Use Plot_model to Convert a Model as Png?”) import os os . environ[ "PATH" ] += os . pathsep + 'C:\Program Files\Graphviz \b in' from tensorflow.keras.utils import plot_model import matplotlib.pyplot as plt # Plot the model plot_model(model, to_file = 'model_plot.png' , show_shapes = True , show_layer_names = True ) # Display the plot img = plt . imread( 'model_plot.png' ) plt . imshow(img) plt . axis( 'off' ) # Turn off axis labels plt . show() 3. Justification of hyperparameters: • The sigmoid activation was used as it is commanly used for binary classification. • The number of nodes per layer was chosen using trial and error. 100 nodes provided the best accuracy in this model. • The loss function used is binary_crossentropy as the analysis is binary (0 or 1) • The Adam optimizer was used for its popularity, efficency and adaptive learning rates. • No stopping criteria was used in this evaluation due to the small size of the data. • The model was assessed using the accuracy evaluation metric, enabling the detection of potential overfitting during the evaluation process. Part IV: Model Evaluation D. Evaluate the model training process and its relevant outcomes by doing the following: 1. The model was trained without including stopping criteria to identify initial fitting using 10 epochs. Since accuracy is over 70%, no stopping criteria was implemented. All epochs are shown below. In [10]: # Train the model model . fit(X_train_padded, y_train, epochs =10 , batch_size =32 ) In [11]: # Evaluate the model loss, accuracy = model . evaluate(X_test_padded, y_test) print ( f'Test Loss: { loss : .4f } , Test Accuracy: { accuracy : .4f } ' ) The model is set up to address overfitting. In order to increase model accurancy, the output_dim was changed from 50 to 100. Visualization of loss and accuracy. Epoch 1/10 WARNING:tensorflow:From C:\Users\londa\anaconda3\lib\site-packages\keras\src\utils\tf_utils.py:492: The name tf.ragged.RaggedTensorValue is deprecated. Please use tf.compat.v 1.ragged.RaggedTensorValue instead. WARNING:tensorflow:From C:\Users\londa\anaconda3\lib\site-packages\keras\src\engine\base_layer_utils.py:384: The name tf.executing_eagerly_outside_functions is deprecated. Pl ease use tf.compat.v1.executing_eagerly_outside_functions instead. 14/14 [==============================] - 25s 2s/step - loss: 0.6951 - accuracy: 0.5427 Epoch 2/10 14/14 [==============================] - 24s 2s/step - loss: 0.6766 - accuracy: 0.7251 Epoch 3/10 14/14 [==============================] - 27s 2s/step - loss: 0.6169 - accuracy: 0.7678 Epoch 4/10 14/14 [==============================] - 28s 2s/step - loss: 0.9279 - accuracy: 0.8199 Epoch 5/10 14/14 [==============================] - 28s 2s/step - loss: 0.4000 - accuracy: 0.9502 Epoch 6/10 14/14 [==============================] - 32s 2s/step - loss: 0.3013 - accuracy: 0.9597 Epoch 7/10 14/14 [==============================] - 32s 2s/step - loss: 0.2069 - accuracy: 0.9763 Epoch 8/10 14/14 [==============================] - 35s 2s/step - loss: 0.1409 - accuracy: 0.9834 Epoch 9/10 14/14 [==============================] - 31s 2s/step - loss: 0.1285 - accuracy: 0.9882 Epoch 10/10 14/14 [==============================] - 27s 2s/step - loss: 0.1074 - accuracy: 0.9953 Out[10]: <keras.src.callbacks.History at 0x24da36e3c40> 4/4 [==============================] - 2s 298ms/step - loss: 0.5865 - accuracy: 0.7170 Test Loss: 0.5865, Test Accuracy: 0.7170
In [15]: # Train the model history = model . fit(X_train_padded, y_train, epochs =10 , batch_size =32 , validation_split =0.2 ) # Plot training and testing loss plt . plot(history . history[ 'loss' ], label = 'Training Loss' ) plt . plot(history . history[ 'val_loss' ], label = 'Testing Loss' ) plt . xlabel( 'Epoch' ) plt . legend() plt . show() # Plot training and testing accuracy plt . plot(history . history[ 'accuracy' ], label = 'Training Accuracy' ) plt . plot(history . history[ 'val_accuracy' ], label = 'Testing Accuracy' ) plt . xlabel( 'Epoch' ) plt . legend() plt . show() 4. The accuracy of the trained network closely aligns, indicating effective generalization of the model (“Validation Accuracy Is Always close to Training Accuracy”). In [16]: # final training and validation accuracy final_train_accuracy = history . history[ 'accuracy' ][ -1 ] final_val_accuracy = history . history[ 'val_accuracy' ][ -1 ] print ( f'Final Training Accuracy: { final_train_accuracy : .4f } ' ) print ( f'Final Validation Accuracy: { final_val_accuracy : .4f } ' ) Epoch 1/10 11/11 [==============================] - 20s 2s/step - loss: 0.0715 - accuracy: 0.9911 - val_loss: 0.1767 - val_accuracy: 0.9529 Epoch 2/10 11/11 [==============================] - 20s 2s/step - loss: 0.0959 - accuracy: 0.9941 - val_loss: 0.2035 - val_accuracy: 0.9294 Epoch 3/10 11/11 [==============================] - 20s 2s/step - loss: 0.0861 - accuracy: 0.9941 - val_loss: 0.2075 - val_accuracy: 0.9294 Epoch 4/10 11/11 [==============================] - 20s 2s/step - loss: 0.0718 - accuracy: 0.9911 - val_loss: 0.2059 - val_accuracy: 0.9294 Epoch 5/10 11/11 [==============================] - 20s 2s/step - loss: 0.0620 - accuracy: 0.9911 - val_loss: 0.2033 - val_accuracy: 0.9294 Epoch 6/10 11/11 [==============================] - 20s 2s/step - loss: 0.0551 - accuracy: 0.9911 - val_loss: 0.2011 - val_accuracy: 0.9294 Epoch 7/10 11/11 [==============================] - 21s 2s/step - loss: 0.0501 - accuracy: 0.9941 - val_loss: 0.1990 - val_accuracy: 0.9294 Epoch 8/10 11/11 [==============================] - 20s 2s/step - loss: 0.0459 - accuracy: 0.9941 - val_loss: 0.1982 - val_accuracy: 0.9294 Epoch 9/10 11/11 [==============================] - 20s 2s/step - loss: 0.0424 - accuracy: 0.9941 - val_loss: 0.1975 - val_accuracy: 0.9294 Epoch 10/10 11/11 [==============================] - 21s 2s/step - loss: 0.0396 - accuracy: 0.9941 - val_loss: 0.1964 - val_accuracy: 0.9294 Final Training Accuracy: 0.9941 Final Validation Accuracy: 0.9294
Part V: Summary and Recommendations E. The code used can be found in each of the corresponding performance assesment points. F. The model examined 528 reviews to predict whether they conveyed a positive or negative rating. This involved partitioning the data into training and test datasets, employing an LSTM model. Tuning the model with specific input parameters aimed to address overfitting and enhance overall accuracy. G. The recommended course of action is that the model is production ready. Part VI: Reporting Sources Brownlee, Jason. “The 5 Step Life-Cycle for Long Short-Term Memory Models in Keras.” Machine Learning Mastery, 27 Aug. 2020, machinelearningmastery.com/5-step-life-cycle-long-short-term-memory-models- keras/#:~:text=The%20LSTM%20recurrent%20layer%20comprised,prediction%20is%20called%20Dense(). Burchfiel, Anni. “What Is NLP (Natural Language Processing) Tokenization?” Tokenex, 16 May 2022, www.tokenex.com/blog/ab-what-is-nlp-natural-language-processing-tokenization/. Caner. “Padding for NLP.” Medium, 3 Apr. 2020, medium.com/@canerkilinc/padding-for-nlp-7dd8598c916a. “Ch03.Rst2.” Nltk.org, 2010, www.nltk.org/book/ch03.html. Deepanshi. “Text Preprocessing in NLP with Python Codes.” Analytics Vidhya, 25 June 2021, www.analyticsvidhya.com/blog/2021/06/text-preprocessing-in-nlp-with-python-codes/#h-punctuation-removal. Accessed 24 Dec. 2023. “How to Disable Warnings in Jupyter Notebooks? 3 Easy Ways along with the Code.” Noteable, 9 Jan. 2023, noteable.io/blog/disable-warnings-in-jupyter/#:~:text=Method%201%3A%20Using%20the%20warnings. Accessed 30 Dec. 2023. “How to Export Pandas DataFrame to a CSV File - Data to Fish.” Datatofish.com, datatofish.com/export-dataframe-to-csv/. “How to Use Plot_model to Convert a Model as Png?” Stack Overflow, stackoverflow.com/questions/72761152/how-to-use-plot-model-to-convert-a-model-as-png. Accessed 1 Jan. 2024. “LSTM: Understanding the Number of Parameters.” Kaggle.com, www.kaggle.com/code/kmkarakaya/lstm-understanding-the-number-of-parameters. “Power of Recurrent Neural Networks (RNN): Revolutionizing AI.” Simplilearn.com, www.simplilearn.com/tutorials/deep-learning-tutorial/rnn#: Accessed 22 Dec. 2023. Prabhakaran, Selva. “Train Test Split - How to Split Data into Train and Test for Validating Machine Learning Models?” Machine Learning Plus, 29 Dec. 2022, www.machinelearningplus.com/machine-learning/train-test-split/. Ruchini, Chanika. “Introduction to Word Embeddings.” Analytics Vidhya, 9 Nov. 2020, medium.com/analytics-vidhya/introduction-to-word-embeddings-c2ba135dce2f#:~:text=Every%20word%20has%20a%20unique. Accessed 23 Dec. 2023. “Validation Accuracy Is Always close to Training Accuracy.” Data Science Stack Exchange, datascience.stackexchange.com/questions/42606/validation-accuracy-is-always-close-to-training-accuracy. Accessed 2 Jan. 2024. “What Is the Optimal Sequence Length for an RNN?” Www.linkedin.com, www.linkedin.com/advice/3/what-optimal-sequence-length-rnn-skills-machine-learning-nu4gc#:~:text=Sequence%20length%20is%20determined%20by. Accessed 23 Dec. 2023.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help