IE6400_Day22
html
keyboard_arrow_up
School
Northeastern University *
*We aren’t endorsed by this school
Course
6400
Subject
Industrial Engineering
Date
Feb 20, 2024
Type
html
Pages
46
Uploaded by ColonelStraw13148
IE6400 Foundations of Data Analytics Engineering
¶
Fall 2023
¶
Module 4: Text Analysis
¶
Text Analysis and Natural Language Processing (NLP)
¶
Natural Language Processing, or NLP, is a field of Artificial Intelligence (AI) that focuses
on the interaction between computers and humans through natural language. The ultimate objective of NLP is to enable computers to understand, interpret, and respond
to human languages in a valuable and meaningful way.
Key Techniques in Text Analysis:
¶
1. Tokenization
¶
Breaking down text into smaller fragments, known as tokens. Typically, tokens are words, but they can also be sentences or paragraphs.
2. Stopword Removal
¶
Eliminating common words (e.g., "and", "the", "is") that usually don't convey significant meaning in isolation.
3. Stemming and Lemmatization
¶
Both techniques aim to revert words to their base or root form. For instance, "running" might be transformed to "run". Stemming truncates prefixes and suffixes, while lemmatization considers context to convert the word to its meaningful base form.
4. Part-of-Speech Tagging
¶
Identifying the grammatical categories of words in the text, such as nouns, verbs, adjectives, etc.
5. Named Entity Recognition (NER)
¶
Classifying named entities in the text into predefined groups like person names, organizations, locations, etc.
6. Sentiment Analysis
¶
Determining the sentiment or emotion conveyed in the text, typically categorized as positive, negative, or neutral.
7. Topic Modeling
¶
Identifying prevalent topics in a text corpus. Latent Dirichlet Allocation (LDA) is a popular algorithm for this purpose.
8. Text Classification
¶
Assigning predefined categories or labels to a text based on its content.
9. Text Clustering
¶
Grouping texts that are similar in content.
10. Text Summarization
¶
Generating a concise and coherent summary of a more extensive text.
11. Word Embeddings
¶
Representing words in a dense vector space where semantically similar words are proximate. Methods like Word2Vec, GloVe, and FastText are popular for this.
Python Libraries for Text Analysis:
¶
Python boasts a plethora of libraries tailored for text analysis:
•
NLTK (Natural Language Toolkit)
: A comprehensive toolkit for natural language processing. •
spaCy
: Known for its speed and precision, it's a high-performance library for NLP
tasks. •
TextBlob
: Built atop NLTK and Pattern, offering a straightforward API for common NLP tasks.
•
Gensim
: Renowned for topic modeling and word embedding tasks. •
Scikit-learn
: While primarily a machine learning library, it provides tools for text
processing and modeling. In essence, text analysis in Python encompasses a broad spectrum of techniques to process, analyze, and extract insights from textual data. Python's rich ecosystem makes it an ideal choice for various NLP and text analytics tasks.
Word Frequency Analysis
¶
Word frequency analysis involves counting how often each word appears in a document or a set of documents. It's a fundamental technique in text analysis, often used to understand the main themes or topics in a text, or as a preprocessing step for more advanced text analysis tasks.
Why is Word Frequency Analysis Important?
¶
1.
Identifying Key Themes
: By examining which words appear most frequently, we can often identify the main themes or topics of a document.
2.
Data Preprocessing
: Word frequencies can be used to filter out common but uninformative words (stopwords) or to identify important keywords to retain for further analysis.
3.
Visualization
: Word frequency counts can be visualized in word clouds or bar charts, providing a quick and intuitive overview of the content of a text.
4.
Feature Extraction for Machine Learning
: In text classification tasks, word frequencies (or related measures, like TF-IDF) can be used to convert text into a numerical format suitable for machine learning algorithms.
How to Perform Word Frequency Analysis in Python:
¶
1. Tokenization
¶
The first step is to break the text into individual words or tokens.
2. Cleaning
¶
Remove punctuation, convert all words to lowercase (to ensure that words like "The" and "the" are counted as the same word), and remove stopwords.
3. Count Frequencies
¶
Use Python's collections library or specialized libraries like NLTK or spaCy to count how
often each word appears.
4. Visualization (Optional)
¶
Visualize the most common words in a bar chart or a word cloud.
Challenges:
¶
•
Stopwords
: Common words like "and", "the", and "is" can appear very frequently but are often not informative on their own. They can be removed using a predefined list of stopwords.
•
Stemming/Lemmatization
: Different forms of the same word (e.g., "run" and "running") can be counted separately. Stemming or lemmatization can be used to reduce words to their root form.
•
Context
: Word frequency analysis doesn't consider the order of words, so it can miss nuances in meaning or context.
In conclusion, word frequency analysis is a powerful and straightforward technique for understanding the content of a text. While it has some limitations, it can provide valuable insights, especially when combined with other text analysis methods.
Exercise 1 Word Frequency Analysis Exercise
¶
Problem Statement:
¶
Given a public dataset containing textual data, your task is to perform a word frequency analysis to determine the most frequently occurring words in the dataset. Visualize the results and interpret the significance of the top words.
Dataset:
¶
•
For this exercise, we'll use the "Wine Reviews" dataset from Kaggle. This dataset
contains wine reviews with textual descriptions. •
You can download the dataset:
https://www.kaggle.com/datasets/zynicide/wine-reviews/
Data Loading and Exploration:
¶
•
Load the dataset using pandas. •
Explore the first few rows to understand the structure. In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# 1. Data Loading and Exploration
df = pd.read_csv('winemag-data-130k-v2.csv')
df.head()
Out[1]:
Unnamed:
0
country description designation points price province
region_1
reg
0
0
Italy
Aromas include tropical fruit, broom,
brimston...
Vulkà Bianco
87
NaN
Sicily & Sardinia
Etna
NaN
1
1
Portugal
This is ripe and fruity, a wine that is smooth...
Avidagos
87
15.0
Douro
NaN
NaN
2
2
US
Tart and snappy, the flavors of lime flesh and...
NaN
87
14.0
Oregon
Willamette
Valley
Willa
Valle
3
3
US
Pineapple rind, lemon pith and orange blossom ...
Reserve Late
Harvest
87
13.0
Michigan
Lake Michigan Shore
NaN
4
4
US
Much like the regular bottling from 2012, this...
Vintner's Reserve Wild
Child Block
87
65.0
Oregon
Willamette
Valley
Willa
Valle
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Unnamed:
0
country description designation points price province
region_1
reg
Data Cleaning:
¶
•
Convert all text to lowercase. •
Remove punctuations, numbers, and special characters. •
Tokenize the text. In [2]:
#!conda install -c anaconda nltk
In [3]:
import nltk
nltk.download('punkt')
[nltk_data] Downloading package punkt to
[nltk_data] /Users/sivaritsultornsanee/nltk_data...
[nltk_data] Package punkt is already up-to-date!
Out[3]:
True
In [4]:
# 2. Data Cleaning
import string
from nltk.tokenize import word_tokenize
text = df['description'].str.lower().str.cat(sep=' ')
text = text.translate(str.maketrans('', '', string.punctuation))
tokens = word_tokenize(text)
# Just show 10 first tokens
tokens[:10]
Out[4]:
['aromas',
'include',
'tropical',
'fruit',
'broom',
'brimstone',
'and',
'dried',
'herb',
'the']
Eliminate Common Words
¶
Remove common words (also known as "stop words") that don't convey significant meaning in isolation. Examples include words like "and", "the", "is", etc.
In [5]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word not in stop_words]
# Just show 10 first filtered_tokens
filtered_tokens[:10]
[nltk_data] Downloading package stopwords to
[nltk_data] /Users/sivaritsultornsanee/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
Out[5]:
['aromas',
'include',
'tropical',
'fruit',
'broom',
'brimstone',
'dried',
'herb',
'palate',
'isnt']
Word Frequency Analysis:
¶
•
Count the frequency of each word. •
Display the top 10 most frequent words. In [6]:
# 3. Word Frequency Analysis
from collections import Counter
word_freq = Counter(filtered_tokens)
top_words = word_freq.most_common(10)
print(top_words)
[('wine', 78035), ('flavors', 62678), ('fruit', 45016), ('aromas', 39613), ('palate', 38083), ('acidity', 34958), ('finish', 34943), ('tannins', 30854), ('drink', 29966), ('cherry', 27381)]
Visualization:
¶
•
Visualize the word frequency using a bar chart. In [7]:
# 4. Visualization
import matplotlib.pyplot as plt
words, counts = zip(*top_words)
plt.figure(figsize=(10, 6))
plt.bar(words, counts, color='purple')
plt.title('Top 10 Most Frequent Words in Wine Reviews')
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.show()
Interpretation:
¶
After running the code, you'll observe the top 10 most frequent words in the wine reviews. Words like "wine", "flavors", and "fruit" might be among the top words, indicating the primary focus of the reviews. The presence of these words suggests that
many reviews discuss the flavor profile and characteristics of the wines.
Stemming and Lemmatization
¶
Stemming and lemmatization are both techniques used in Natural Language Processing (NLP) to reduce words to their base or root form. While they aim to achieve similar goals, they operate on different principles and methods.
Stemming
¶
Stemming is the process of reducing a word to its base or root form by removing the suffixes (or in some cases prefixes). For example, the stem of the word "running" might be "run".
Algorithms for Stemming:
¶
1.
Porter Stemming Algorithm (PorterStemmer)
:
•
Developed by Martin Porter in 1980. •
It uses a set of heuristic rules to transform words. •
Example: "running" -> "run", "flies" -> "fli". 2.
Lancaster Stemming Algorithm (LancasterStemmer)
:
•
It is more aggressive than the Porter stemming algorithm. •
It has a set of iterative rules to convert words.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
•
Example: "running" -> "run", "flies" -> "fly". 3.
Snowball Stemmer
:
•
Also known as the "Porter2" stemming algorithm. •
It is an improvement over the original Porter stemmer and supports multiple languages. •
Example: "running" -> "run", "flies" -> "fli". Lemmatization
¶
Lemmatization is the process of reducing a word to its base or dictionary form. It involves looking up a word in a lexicon and returning its lemma or canonical form. For example, the lemma of the word "running" is "run", and the lemma of "better" is "good".
Algorithms for Lemmatization:
¶
1.
WordNet Lemmatizer
:
•
Uses the WordNet lexical database. •
It returns the base or dictionary form of a word. •
Example: "running" -> "run", "better" -> "good". 2.
Spacy Lemmatizer
:
•
Uses the Spacy NLP library. •
It is based on detailed linguistic annotations. •
Example: "running" -> "run", "better" -> "good". 3.
TextBlob Lemmatizer
:
•
Uses the WordNet lexical database. •
It is a simple API over the NLTK library. •
Example: "running" -> "run", "better" -> "good". In summary, while stemming might produce non-real words (like "fli" for "flies"), lemmatization always produces real words. The choice between stemming and lemmatization depends on the application and the level of precision required.
Exercise 2 Stemming Algorithms
¶
Problem Statement:
¶
Given public datasets containing textual data, your task is to apply different stemming algorithms to understand how they work and compare their results. Visualize the stemmed words from different algorithms and interpret the differences.
Datasets:
¶
•
For this exercise, we'll use two datasets: •
"Wine Reviews" dataset from Kaggle for wine descriptions. https://www.kaggle.com/datasets/zynicide/wine-reviews/
•
"Amazon Fine Food Reviews" dataset from Kaggle for food product reviews.
https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews
Data Loading and Exploration
¶
•
Load the datasets using pandas and explore the first few rows to understand their structure. In [8]:
import pandas as pd
wine_df = pd.read_csv('winemag-data-130k-v2.csv')
food_df = pd.read_csv('Reviews.csv')
In [9]:
print("Wine Reviews:")
wine_df.head()
Wine Reviews:
Out[9]:
Unnamed:
0
country description designation points price province
region_1
reg
0
0
Italy
Aromas include tropical fruit, broom,
brimston...
Vulkà Bianco
87
NaN
Sicily & Sardinia
Etna
NaN
1
1
Portugal
This is ripe and fruity, a wine that is smooth...
Avidagos
87
15.0
Douro
NaN
NaN
2
2
US
Tart and snappy, the flavors of lime flesh and...
NaN
87
14.0
Oregon
Willamette
Valley
Willa
Valle
3
3
US
Pineapple rind, lemon pith and orange blossom ...
Reserve Late
Harvest
87
13.0
Michigan
Lake Michigan Shore
NaN
4
4
US
Much like the regular bottling from 2012, this...
Vintner's Reserve Wild
Child Block
87
65.0
Oregon
Willamette
Valley
Willa
Valle
In [10]:
print("Amazon Fine Food Reviews:")
food_df.head()
Amazon Fine Food Reviews:
Out[10]:
Id
ProductId
UserId
ProfileName HelpfulnessNumerator Helpfulne
0
1
B001E4KFG0
A3SGXH7AUHU8GW delmartian
1
1
Id
ProductId
UserId
ProfileName HelpfulnessNumerator Helpfulne
1
2
B00813GRG4
A1D87F6ZCVE5NK
dll pa
0
0
2
3
B000LQOCH0 ABXLMWJIXXAIN
Natalia Corres
"Natalia Corres"
1
1
3
4
B000UA0QIQ
A395BORC6FGVXV
Karl
3
3
4
5
B006K2ZZ7K
A1UQRSCLF8GW1T
Michael D. Bigham "M. Wassir"
0
0
Data Cleaning
¶
•
Truncate the datasets to only the first 100 records for simplicity. •
Convert all text to lowercase. •
Remove punctuations, numbers, and special characters. •
Tokenize the text. In [11]:
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
# Truncate datasets to first 100 records
wine_df = wine_df.head(100)
food_df = food_df.head(100)
wine_text = wine_df['description'].str.lower().str.cat(sep=' ')
food_text = food_df['Text'].str.lower().str.cat(sep=' ')
wine_text = wine_text.translate(str.maketrans('', '', string.punctuation))
food_text = food_text.translate(str.maketrans('', '', string.punctuation))
wine_tokens = word_tokenize(wine_text)
food_tokens = word_tokenize(food_text)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
filtered_wine_tokens = [word for word in wine_tokens if word not in stop_words]
filtered_food_tokens = [word for word in food_tokens if word not in stop_words]
[nltk_data] Downloading package stopwords to
[nltk_data] /Users/sivaritsultornsanee/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
In [12]:
print(filtered_wine_tokens[:10])
['aromas', 'include', 'tropical', 'fruit', 'broom', 'brimstone', 'dried', 'herb', 'palate', 'isnt']
In [13]:
print(filtered_food_tokens[:10])
['bought', 'several', 'vitality', 'canned', 'dog', 'food', 'products', 'found', 'good',
'quality']
Step 3: Apply Stemming Algorithms
¶
•
Apply the Porter, Lancaster, and Snowball stemming algorithms to the tokens and observe the differences. In [14]:
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer
porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer("english")
wine_porter_stems = [porter.stem(word) for word in filtered_wine_tokens]
food_porter_stems = [porter.stem(word) for word in filtered_food_tokens]
wine_lancaster_stems = [lancaster.stem(word) for word in filtered_wine_tokens]
food_lancaster_stems = [lancaster.stem(word) for word in filtered_food_tokens]
wine_snowball_stems = [snowball.stem(word) for word in filtered_wine_tokens]
food_snowball_stems = [snowball.stem(word) for word in filtered_food_tokens]
Step 4: Visualization
¶
•
Visualize the stemmed words from different algorithms to compare their results. For simplicity, we'll visualize the stems of the first 10 tokens from each dataset. In [15]:
import matplotlib.pyplot as plt
def plot_stems(tokens, porter_stems, lancaster_stems, snowball_stems, title):
plt.figure(figsize=(15, 7))
x = range(len(tokens))
plt.scatter(x, tokens, color='blue', label='Original Tokens')
plt.scatter(x, porter_stems, color='red', label='Porter Stems')
plt.scatter(x, lancaster_stems, color='green', label='Lancaster Stems')
plt.scatter(x, snowball_stems, color='yellow', label='Snowball Stems')
plt.title(title)
plt.legend()
plt.xticks(rotation=45)
plt.show()
# Visualize stems for the first 10 tokens
plot_stems(filtered_wine_tokens[:10], wine_porter_stems[:10], wine_lancaster_stems[:10], wine_snowball_stems[:10], 'Wine Reviews Stemming Comparison')
plot_stems(filtered_food_tokens[:10], food_porter_stems[:10], food_lancaster_stems[:10], food_snowball_stems[:10], 'Amazon Fine Food Reviews Stemming
Comparison')
Interpretation
¶
After running the code and visualizing the results, observe the differences in stemming
results from the three algorithms. Discuss the characteristics of each algorithm and how they affect the stemming process. For instance, the Lancaster stemmer might be more aggressive than the Porter stemmer, leading to shorter stems.
Exercise 3 Lemmatization Algorithms
¶
Problem Statement:
¶
Given a public dataset containing textual data, your task is to apply lemmatization to understand its effects on words. Visualize the original words against their lemmatized forms and interpret the differences.
Dataset:
¶
For this exercise, we'll use the "Wine Reviews" dataset from Kaggle. This dataset contains wine reviews with textual descriptions.
Step 1: Data Loading and Exploration
¶
•
Load the dataset using pandas and explore the first few rows to understand its structure. In [16]:
import pandas as pd
df = pd.read_csv('winemag-data-130k-v2.csv')
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
df.head()
Out[16]:
Unnamed:
0
country description designation points price province
region_1
reg
0
0
Italy
Aromas include tropical fruit, broom,
brimston...
Vulkà Bianco
87
NaN
Sicily & Sardinia
Etna
NaN
1
1
Portugal
This is ripe and fruity, a wine that is smooth...
Avidagos
87
15.0
Douro
NaN
NaN
2
2
US
Tart and snappy, the flavors of lime flesh and...
NaN
87
14.0
Oregon
Willamette
Valley
Willa
Valle
3
3
US
Pineapple rind, lemon pith and orange blossom ...
Reserve Late
Harvest
87
13.0
Michigan
Lake Michigan Shore
NaN
4
4
US
Much like the regular bottling from 2012, this...
Vintner's Reserve Wild
Child Block
87
65.0
Oregon
Willamette
Valley
Willa
Valle
Step 2: Data Cleaning
¶
•
Truncate the dataset to only the first 100 records for simplicity. •
Convert all text to lowercase. •
Remove punctuations and tokenize the text. In [17]:
import string
from nltk.tokenize import word_tokenize
# Truncate dataset to first 100 records
df = df.head(100)
text = df['description'].str.lower().str.cat(sep=' ')
text = text.translate(str.maketrans('', '', string.punctuation))
tokens = word_tokenize(text)
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word not in stop_words]
# Just show 10 first filtered_tokens
filtered_tokens[:10]
[nltk_data] Downloading package stopwords to
[nltk_data] /Users/sivaritsultornsanee/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
Out[17]:
['aromas',
'include',
'tropical',
'fruit',
'broom',
'brimstone',
'dried',
'herb',
'palate',
'isnt']
Step 3: Apply Lemmatization
¶
•
Lemmatize the tokens using the WordNet Lemmatizer and observe the differences between the original tokens and their lemmatized forms. In [18]:
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
# Just show 10 first tokens
lemmatized_tokens[:10]
[nltk_data] Downloading package wordnet to
[nltk_data] /Users/sivaritsultornsanee/nltk_data...
[nltk_data] Package wordnet is already up-to-date!
Out[18]:
['aroma',
'include',
'tropical',
'fruit',
'broom',
'brimstone',
'dried',
'herb',
'palate',
'isnt']
Step 4: Visualization
¶
•
Visualize a sample of the original tokens against their lemmatized forms to
compare and understand the effects of lemmatization. In [19]:
import matplotlib.pyplot as plt
def plot_lemmatization(filtered_tokens, lemmatized_tokens, title):
plt.figure(figsize=(15, 7))
x = range(len(filtered_tokens))
plt.scatter(x, filtered_tokens, color='blue', label='Original Tokens')
plt.scatter(x, lemmatized_tokens, color='red', label='Lemmatized Tokens')
plt.title(title)
plt.legend()
plt.xticks(rotation=45)
plt.show()
# Visualize lemmatization for the first 10 tokens
plot_lemmatization(filtered_tokens[:10], lemmatized_tokens[:10], 'Effects of Lemmatization on Wine Reviews')
Interpretation
¶
After running the code and visualizing the results, observe the differences between the
original tokens and their lemmatized forms. Discuss how lemmatization can help in
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
reducing words to their base or dictionary form, which can be beneficial for various natural language processing tasks.
Zipf's Law
¶
Zipf's Law is an empirical law that describes the distribution of word frequencies in natural languages. It states that the frequency of any word in a corpus is inversely proportional to its rank in the frequency table. In other words, the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, and so on.
Mathematically, Zipf's Law can be represented as:
$f = \frac{c}{r}$
Where:
•
( f ) is the frequency of the word. •
( r ) is the rank of the word. •
( c ) is a constant. Key Points:
¶
1.
Word Frequency Distribution
:
•
In a typical corpus, a few words (like "the", "and", "of") occur very frequently, while the majority of words occur rarely. 2.
Log-Log Plot
:
•
When plotting the log of the frequency against the log of the rank, Zipf's Law produces a straight line with a slope of approximately -1. 3.
Applications
:
•
Zipf's Law has been observed in various phenomena, not just language. It applies to the distribution of city populations, income rankings, and even the number of citations received by academic papers. 4.
Variations
:
•
While Zipf's Law holds true for many corpora, there are variations. Some corpora may not follow the exact ( f = \frac{c}{r} ) distribution, but they often exhibit a similar hyperbolic distribution. 5.
Implications
:
•
Zipf's Law has implications for linguistics, information theory, and even the
design of search engines and databases. Example:
¶
Consider a corpus with the word "the" being the most frequent word, occurring 1,000 times. According to Zipf's Law:
•
The 2nd most frequent word will occur approximately 500 times. •
The 3rd most frequent word will occur approximately 333 times. •
The 4th most frequent word will occur approximately 250 times, and so on. In conclusion, Zipf's Law provides a fascinating insight into the patterns of word distribution in natural languages and has been a topic of interest for researchers in various fields.
Exercise 4 Zipf's Law
¶
Problem Statement:
¶
Given a public dataset containing textual data, your task is to investigate Zipf's Law. Zipf's Law states that the frequency of a word is inversely proportional to its rank in a frequency table. In other words, the most frequent word will occur approximately twice
as often as the second most frequent word, three times as often as the third most frequent word, and so on. Your goal is to visualize and verify if the dataset follows Zipf's Law.
Dataset:
¶
For this exercise, we'll use the "Wine Reviews" dataset from Kaggle. This dataset contains wine reviews with textual descriptions.
Step 1: Data Loading and Exploration
¶
•
Load the dataset using pandas and explore the first few rows to understand its structure. In [20]:
import pandas as pd
df = pd.read_csv('winemag-data-130k-v2.csv')
df.head()
Out[20]:
Unnamed:
0
country description designation points price province
region_1
reg
0
0
Italy
Aromas include tropical fruit, broom,
brimston...
Vulkà Bianco
87
NaN
Sicily & Sardinia
Etna
NaN
1
1
Portugal
This is ripe and fruity, a wine that is smooth...
Avidagos
87
15.0
Douro
NaN
NaN
2
2
US
Tart and snappy, the flavors of lime flesh and...
NaN
87
14.0
Oregon
Willamette
Valley
Willa
Valle
3
3
US
Pineapple rind, lemon pith and orange blossom ...
Reserve Late
Harvest
87
13.0
Michigan
Lake Michigan Shore
NaN
4
4
US
Much like the regular bottling from 2012, this...
Vintner's Reserve Wild
Child Block
87
65.0
Oregon
Willamette
Valley
Willa
Valle
Step 2: Data Cleaning and Tokenization
¶
•
Convert all text to lowercase. •
Remove punctuations and tokenize the text. In [21]:
import string
from nltk.tokenize import word_tokenize
text = df['description'].str.lower().str.cat(sep=' ')
text = text.translate(str.maketrans('', '', string.punctuation))
tokens = word_tokenize(text)
Step 3: Word Frequency Analysis
¶
•
Calculate the frequency of each word in the dataset and sort them in descending
order. In [22]:
from collections import Counter
word_freq = Counter(tokens)
sorted_word_freq = sorted(word_freq.items(), key=lambda x: x[1], reverse=True)
Step 4: Visualization of Zipf's Law
¶
•
Plot the ranks of words against their frequencies on a log-log scale to visualize Zipf's Law. In [23]:
import matplotlib.pyplot as plt
import numpy as np
ranks = np.arange(1, len(sorted_word_freq)+1)
frequencies = [freq for word, freq in sorted_word_freq]
plt.figure(figsize=(10, 6))
plt.loglog(ranks, frequencies, marker="o")
plt.title("Zipf's Law Visualization")
plt.xlabel("Rank of the word")
plt.ylabel("Frequency of the word")
plt.grid(True)
plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Interpretation
¶
After running the code and visualizing the results, observe the shape of the curve. If the dataset follows Zipf's Law, the plot will be roughly a straight line. Discuss the implications of this observation and how it reflects the natural distribution of word frequencies in languages.
N-grams, Unigrams, and Bigrams
¶
In the context of natural language processing and text analysis, N-grams refer to a contiguous sequence of 'N' items (typically words) from a given sample of text or speech. They are used to capture the language structure, such as word patterns and phrases, from a text corpus.
Unigrams
¶
•
Definition
: A unigram is a single word or item from a text. It is the simplest form
of N-gram, where N=1. •
Example
: The sentence "I love programming" contains the unigrams "I", "love", and "programming". Bigrams
¶
•
Definition
: A bigram is a sequence of two adjacent words or items from a text. It
is an N-gram where N=2. •
Example
: The sentence "I love programming" contains the bigrams "I love" and "love programming". N-grams
¶
•
Definition
: An N-gram is a sequence of 'N' words or items from a text. The value
of N can be any positive integer, and it determines the number of words or items
in the sequence. •
Example
: In the sentence "I love programming", the 3-gram (or trigram) is "I love programming". Key Points:
¶
1.
Usage
:
•
N-grams are widely used in text processing tasks such as text prediction, spelling correction, and sentiment analysis. 2.
Higher-Order N-grams
:
•
As the value of N increases, the N-grams capture more context but also become sparser in the text. For instance, 4-grams and 5-grams provide more context than bigrams but are less frequent in a typical corpus. 3.
Limitations
:
•
While N-grams capture local word patterns, they do not capture long-
distance dependencies between words or the overall sentence structure. 4.
Smoothing Techniques
:
•
Due to the sparsity of higher-order N-grams in a corpus, smoothing techniques are often applied in probabilistic language models to handle unseen N-grams. Example:
¶
Consider the sentence "I love coding in Python".
•
Unigrams
: "I", "love", "coding", "in", "Python" •
Bigrams
: "I love", "love coding", "coding in", "in Python" •
Trigrams
: "I love coding", "love coding in", "coding in Python" In conclusion, N-grams provide a way to represent and capture the structure of text, making them a fundamental concept in many natural language processing tasks.
Exercise 5 N-grams, Unigrams, and Bigrams
¶
Problem Statement:
¶
Given a public dataset containing textual data, your task is to explore the concept of N-grams, specifically focusing on Unigrams and Bigrams. Extract and visualize the most common Unigrams and Bigrams from the dataset. Interpret the significance of these N-grams in the context of the dataset.
Dataset:
¶
For this exercise, we'll use the "Wine Reviews" dataset from Kaggle. This dataset contains wine reviews with textual descriptions.
Step 1: Data Loading and Exploration
¶
•
Load the dataset using pandas and explore the first few rows to understand its structure. In [24]:
import pandas as pd
df = pd.read_csv('winemag-data-130k-v2.csv')
df.head()
Out[24]:
Unnamed:
0
country description designation points price province
region_1
reg
0
0
Italy
Aromas include tropical fruit, broom,
brimston...
Vulkà Bianco
87
NaN
Sicily & Sardinia
Etna
NaN
1
1
Portugal
This is ripe and fruity, a wine that is smooth...
Avidagos
87
15.0
Douro
NaN
NaN
2
2
US
Tart and snappy, the flavors of lime flesh and...
NaN
87
14.0
Oregon
Willamette
Valley
Willa
Valle
3
3
US
Pineapple rind, lemon pith and orange blossom ...
Reserve Late
Harvest
87
13.0
Michigan
Lake Michigan Shore
NaN
4
4
US
Much like the regular bottling from 2012, this...
Vintner's Reserve Wild
Child Block
87
65.0
Oregon
Willamette
Valley
Willa
Valle
Step 2: Data Cleaning and Tokenization
¶
•
Convert all text to lowercase. •
Remove punctuations and tokenize the text. •
Remove common words (also known as "stop words") that don't convey significant meaning in isolation. Examples include words like "and", "the", "is", etc. In [25]:
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
text = df['description'].str.lower().str.cat(sep=' ')
text = text.translate(str.maketrans('', '', string.punctuation))
tokens = word_tokenize(text)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word not in stop_words]
[nltk_data] Downloading package stopwords to
[nltk_data] /Users/sivaritsultornsanee/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
Step 3: Extracting Unigrams and Bigrams
¶
•
Extract Unigrams (individual words) and Bigrams (pairs of consecutive words) from the tokenized text. In [26]:
from nltk.util import ngrams
unigrams = list(ngrams(filtered_tokens, 1))
bigrams = list(ngrams(filtered_tokens, 2))
In [27]:
print(unigrams[:10])
[('aromas',), ('include',), ('tropical',), ('fruit',), ('broom',), ('brimstone',), ('dried',), ('herb',), ('palate',), ('isnt',)]
In [28]:
print(bigrams[:10])
[('aromas', 'include'), ('include', 'tropical'), ('tropical', 'fruit'), ('fruit', 'broom'), ('broom', 'brimstone'), ('brimstone', 'dried'), ('dried', 'herb'), ('herb', 'palate'), ('palate', 'isnt'), ('isnt', 'overly')]
Step 4: Visualization of Top Unigrams and Bigrams
¶
•
Visualize the top 10 most common Unigrams and Bigrams to understand their distribution in the dataset. In [29]:
from collections import Counter
import matplotlib.pyplot as plt
top_unigrams = Counter(unigrams).most_common(10)
top_bigrams = Counter(bigrams).most_common(10)
def plot_ngrams(ngrams_list, title):
ngrams, counts = zip(*ngrams_list)
ngrams = [" ".join(gram) for gram in ngrams]
plt.figure(figsize=(10, 6))
plt.bar(ngrams, counts, color='purple')
plt.title(title)
plt.xticks(rotation=45)
plt.show()
plot_ngrams(top_unigrams, 'Top 10 Unigrams')
plot_ngrams(top_bigrams, 'Top 10 Bigrams')
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Interpretation
¶
After visualizing the results, observe the most common Unigrams and Bigrams. Discuss their significance in the context of wine reviews. For instance, Bigrams like "black cherry" or "full bodied" might give insights into common wine characteristics discussed in the reviews.
Word Cloud
¶
A Word Cloud (also known as a tag cloud or text cloud) is a visual representation of text data where the importance or frequency of each word is represented by its size and/or color. In a word cloud, the more frequently a word appears in the source text, the larger and bolder it appears in the cloud.
Key Features:
¶
1.
Visualization
:
•
Word Clouds provide a quick visual summary of a large amount of text, highlighting the most prominent terms. 2.
Customization
:
•
The appearance, color scheme, and shape of a word cloud can often be customized to fit specific aesthetics or themes. 3.
Interactivity
:
•
Some word cloud tools allow for interactive features, such as hovering over
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
a word to see its frequency or clicking on a word to filter related content. 4.
Applications
:
•
Word Clouds are commonly used in data analysis, content summaries, presentations, and website design. They are particularly popular for analyzing feedback, reviews, and open-ended survey responses. How to Create a Word Cloud:
¶
1.
Text Data
:
•
Start with a collection of text data, such as articles, reviews, or any other textual content. 2.
Text Preprocessing
:
•
Clean the text by removing stop words (common words like "and", "the", "is", etc.), punctuation, and any other unwanted characters. •
Optionally, apply stemming or lemmatization to reduce words to their base
form. 3.
Word Frequency
:
•
Calculate the frequency of each word in the text. 4.
Visualization
:
•
Use a word cloud generator or library (e.g., the WordCloud
library in Python) to create the visual representation based on word frequencies. Example:
¶
Consider the feedback from a product review: "The product is amazing. I love the design and the functionality. The team did an amazing job."
In a word cloud representation, the words "amazing", "love", "design", "functionality", and "team" might appear larger because they convey the main sentiments and topics of the feedback.
Conclusion:
¶
Word Clouds offer an intuitive and visually appealing way to understand the main themes or sentiments in a large dataset. However, while they are useful for a quick overview, they lack the depth and context provided by more detailed textual analysis methods.
Exercise 6 WordCloud Visualization
¶
Problem Statement:
¶
Given a public dataset containing textual data, your task is to visualize the most frequently occurring words using a WordCloud. WordClouds provide a visual representation of text data where the size of each word indicates its frequency or importance. Your goal is to generate a WordCloud and interpret the significance of the prominent words in the context of the dataset.
Dataset:
¶
For this exercise, we'll use the "Wine Reviews" dataset from Kaggle. This dataset contains wine reviews with textual descriptions.
Step 1: Data Loading and Exploration
¶
•
Load the dataset using pandas and explore the first few rows to understand its structure. In [30]:
import pandas as pd
df = pd.read_csv('winemag-data-130k-v2.csv')
df.head()
Out[30]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Unnamed:
0
country description designation points price province
region_1
reg
0
0
Italy
Aromas include tropical fruit, broom,
brimston...
Vulkà Bianco
87
NaN
Sicily & Sardinia
Etna
NaN
1
1
Portugal
This is ripe and fruity, a wine that is smooth...
Avidagos
87
15.0
Douro
NaN
NaN
2
2
US
Tart and snappy, the flavors of lime flesh and...
NaN
87
14.0
Oregon
Willamette
Valley
Willa
Valle
3
3
US
Pineapple rind, lemon pith and orange blossom ...
Reserve Late
Harvest
87
13.0
Michigan
Lake Michigan Shore
NaN
4
4
US
Much like the regular bottling from 2012, this...
Vintner's Reserve Wild
Child Block
87
65.0
Oregon
Willamette
Valley
Willa
Valle
Step 2: Data Cleaning
¶
•
Convert all text to lowercase. •
Remove punctuations. In [31]:
import string
text = df['description'].str.lower().str.cat(sep=' ')
text = text.translate(str.maketrans('', '', string.punctuation))
Step 3: Generating the WordCloud
¶
•
Use the WordCloud library to generate a word cloud visualization of the most frequent words in the dataset. In [32]:
#!conda install -c conda-forge wordcloud
In [33]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
from wordcloud import WordCloud
import matplotlib.pyplot as plt
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)
plt.figure(figsize=(12, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('WordCloud for Wine Reviews')
plt.show()
Interpretation
¶
After generating the WordCloud, observe the most prominent words. These words are the most frequently occurring in the wine reviews. Discuss their significance in the context of wine reviews. For instance, words like "wine", "flavor", "fruit", and "aroma" might be dominant, indicating the primary focus of the reviews. The presence of these words suggests that many reviews discuss the flavor profile, aroma, and characteristics of the wines.
Fuzzy Matching in Text Analysis
¶
Fuzzy matching is a technique used in text analysis to find strings that are approximately equal to a given pattern. Unlike exact string matching, where the strings must be identical, fuzzy matching allows for minor discrepancies, such as typos, variations in spelling, or differences in word order.
Key Concepts:
¶
1.
Edit Distance
:
•
A measure of the similarity between two strings. It represents the
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
minimum number of operations (insertions, deletions, or substitutions) required to transform one string into another. •
Common algorithms include the Levenshtein distance and the Damerau-
Levenshtein distance. 2.
Token-Based Matching
:
•
Involves breaking the text into tokens (words, phrases, or n-grams) and comparing the sets of tokens. The Jaccard similarity coefficient is a common metric used in token-based matching. 3.
Phonetic Matching
:
•
Matches strings based on their phonetic similarity rather than their written form. Algorithms like Soundex and Metaphone are used to encode words into a phonetic representation. Applications:
¶
1.
Data Cleaning
:
•
Identifying and correcting typos or variations in data entries. 2.
Record Linkage
:
•
Merging records from different databases that refer to the same entity but have slight variations in naming. 3.
Search Engines
:
•
Improving search results by returning relevant items even if the search query contains typos or variations. 4.
Natural Language Processing
:
•
Matching synonyms or semantically similar words in text analysis tasks. Tools and Libraries:
¶
•
Python
:
•
Libraries such as fuzzywuzzy
, textdistance
, and difflib
provide functions for fuzzy string matching. Example:
¶
Consider a database with the entry "Microsoft Corporation". Using fuzzy matching, we can identify the following entries as being similar:
•
"Microsft Corporation" (typo) •
"Microsoft Corp" (abbreviation) •
"Microsoft Incorporated" (variation) Limitations:
¶
•
Fuzzy matching can be computationally intensive, especially for large datasets. •
Determining the appropriate threshold for a "match" can be subjective and may require domain knowledge. Conclusion:
¶
Fuzzy matching is a powerful technique for text analysis, especially in scenarios where
data may have inconsistencies or variations. It provides flexibility in matching strings and can significantly improve the accuracy and quality of data processing tasks.
Exercise 7 Fuzzy Matching
¶
Problem Statement:
¶
Given a public dataset containing textual data, your task is to explore the concept of Fuzzy Matching. Fuzzy Matching is a process that finds strings that are approximately equal to a given pattern. Your goal is to identify and visualize similar strings in the dataset using Fuzzy Matching techniques and interpret the significance of these matches.
Dataset:
¶
For this exercise, we'll use the "Wine Reviews" dataset from Kaggle, focusing on the winery column. This dataset contains wine reviews with various attributes, including the winery name.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Step 1: Data Loading and Exploration
¶
•
Load the dataset using pandas and explore the first few rows to understand its structure. In [34]:
import pandas as pd
df = pd.read_csv('winemag-data-130k-v2.csv')
df['winery'].head()
Out[34]:
0 Nicosia
1 Quinta dos Avidagos
2 Rainstorm
3 St. Julian
4 Sweet Cheeks
Name: winery, dtype: object
Step 2: Fuzzy Matching on Winery Names
¶
•
Fuzzy Matching is used to identify strings that are approximately similar to a given pattern. We'll use the fuzzywuzzy library in Python, which uses the Levenshtein distance to calculate the differences between sequences. •
Let's say we want to find wineries with names similar to "Hill". We'll use the process.extract function from fuzzywuzzy to get the top matches. In [35]:
#!conda install -c conda-forge fuzzywuzzy
In [36]:
from fuzzywuzzy import process
wineries = df['winery'].unique().tolist()
top_matches = process.extract("Hill", wineries, limit=10)
print(top_matches)
[('Heron Hill', 90), ('Claiborne & Churchill', 90), ('Hayman & Hill', 90), ('Autumn Hill', 90), ('Cherry Hill', 90), ('Cavas Hill', 90), ('Jasper Hill', 90), ('Rex Hill', 90), ('Melhill', 90), ('Seven Hills', 90)]
Step 3: Visualization of Fuzzy Matching Results
¶
•
Visualize the top matches and their respective scores to understand the similarity. In [37]:
import matplotlib.pyplot as plt
names, scores = zip(*top_matches)
plt.figure(figsize=(12, 6))
plt.barh(names, scores, color='teal')
plt.xlabel('Fuzzy Matching Score')
plt.ylabel('Winery Names')
plt.title('Top Fuzzy Matches for "Hill"')
plt.gca().invert_yaxis()
plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Interpretation
¶
After visualizing the results, observe the winery names that have high similarity scores
with "Hill". The scores indicate how similar each winery name is to the given pattern. A
score of 100 means an exact match. Discuss the significance of these matches and how Fuzzy Matching can be useful in scenarios where data might have typos or slight variations in naming.
Exercise 8 Fuzzy Matching Dataframe Merge
¶
Problem Statement:
¶
You are given two dataframes with a list of people. Both dataframes contain a column for last names, but due to typos or variations in spelling, the last names might not match exactly between the two dataframes. Your task is to use Fuzzy Matching to merge these two dataframes based on the similarity of their last names.
Dataset:
¶
For this exercise, we'll simulate two dataframes with names of people. One dataframe will have original last names, and the other will have slightly altered last names.
Step 1: Generating the Data
¶
Generate two sample dataframes with names of people.
In [38]:
import pandas as pd
data1 = {
'First Name': ['John', 'Jane', 'Robert', 'Alice', 'Steve'],
'Last Name': ['Doe', 'Smith', 'Johnson', 'Williams', 'Brown']
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
}
data2 = {
'First Name': ['Jon', 'Janet', 'Rob', 'Alicia', 'Steven'],
'Last Name': ['Do', 'Smit', 'Johnsen', 'William', 'Browne']
}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
print(df1)
print(df2)
First Name Last Name
0 John Doe
1 Jane Smith
2 Robert Johnson
3 Alice Williams
4 Steve Brown
First Name Last Name
0 Jon Do
1 Janet Smit
2 Rob Johnsen
3 Alicia William
4 Steven Browne
Step 2: Fuzzy Matching Last Names
¶
Use the fuzzywuzzy library to match the last names from the two dataframes.
In [39]:
from fuzzywuzzy import fuzz
def get_match(row, master_df, column_name, threshold=80):
best_match = None
highest_score = 0
for item in master_df[column_name]:
score = fuzz.ratio(row[column_name], item)
if score > threshold and score > highest_score:
highest_score = score
best_match = item
return best_match
df2['Matched Last Name'] = df2.apply(get_match, master_df=df1, column_name='Last Name',
axis=1)
Step 3: Merging Dataframes Using Fuzzy Matched Last Names
¶
Merge the two dataframes using the last names matched through Fuzzy Matching.
In [40]:
merged_df = pd.merge(df1, df2, left_on='Last Name', right_on='Matched Last Name', suffixes=('_Original', '_Altered'))
merged_df
Out[40]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
First
Name_Original
Last
Name_Original
First
Name_Altered
Last
Name_Altered
Matched Last
Name
0
Jane
Smith
Janet
Smit
Smith
1
Robert
Johnson
Rob
Johnsen
Johnson
2
Alice
Williams
Alicia
William
Williams
3
Steve
Brown
Steven
Browne
Brown
Step 4: Visualization and Interpretation
¶
Visualize the original and altered last names side by side to understand the variations and the effectiveness of Fuzzy Matching.
In [41]:
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 6))
plt.bar(merged_df['Last Name_Original'], merged_df.index, color='blue', label='Original
Last Names')
plt.bar(merged_df['Last Name_Altered'], merged_df.index, color='red', alpha=0.5, label='Altered Last Names')
plt.yticks(merged_df.index, merged_df['First Name_Original'])
plt.xlabel('Last Names')
plt.ylabel('First Names')
plt.title('Comparison of Original and Altered Last Names')
plt.legend()
plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Interpretation
¶
After merging and visualizing the results, observe the variations in the last names between the two dataframes. Discuss the significance of Fuzzy Matching in merging dataframes with non-identical values due to typos or variations. Highlight how Fuzzy Matching can be a powerful tool in scenarios where exact matches are not feasible.
Sentiment Analysis
¶
Sentiment Analysis, often referred to as opinion mining, is a subfield of Natural Language Processing (NLP) that focuses on identifying and categorizing opinions expressed in text, especially in order to determine whether the writer's attitude towards a particular topic, product, or service is positive, negative, or neutral. The significance of sentiment analysis lies in its ability to gauge public opinion, conduct nuanced market research, monitor brand and product reputation, and understand customer experiences.
How Sentiment Analysis Works
¶
Sentiment analysis typically involves the following steps:
1.
Data Collection
: Gathering data, usually text, from various sources like websites, online forums, social media platforms, and customer reviews.
2.
Preprocessing
: Cleaning and preparing the text data for analysis. This can include removing noise such as special characters and numbers, standardizing text, tokenization (breaking text into individual words or phrases), and stemming
or lemmatization (reducing words to their base or root form).
3.
Feature Extraction
: Transforming text into a format that can be analyzed by machine learning algorithms. This often involves creating a bag-of-words model
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
or using word embeddings that capture semantic meanings of words.
4.
Model Training
: Using machine learning algorithms to train a sentiment analysis model on a labeled dataset, where the sentiment for each text snippet is known. This could involve supervised learning techniques such as logistic regression, support vector machines, or neural networks.
5.
Classification
: Applying the trained model to new, unseen text to classify the sentiment. The output is usually a score that indicates the polarity of sentiment, or a label such as "positive," "negative," or "neutral."
6.
Interpretation
: Analyzing the results to draw insights. For instance, a company might analyze customer feedback to determine overall satisfaction with a product or service.
Applications of Sentiment Analysis
¶
•
Business Analytics
: Companies use sentiment analysis to understand customer
sentiment towards products or services, often through analysis of online reviews and social media conversations.
•
Market Research
: Sentiment analysis helps in gauging public opinion on various topics, brands, or products, which can inform marketing strategies.
•
Politics
: During elections, sentiment analysis can be used to assess public opinion on candidates or issues.
•
Customer Service
: Automatically categorizing customer support tickets based on sentiment can help businesses prioritize and respond to urgent queries.
Challenges in Sentiment Analysis
¶
•
Sarcasm and Irony
: Detecting sarcasm or irony in text can be difficult for algorithms, as they often require context and understanding of subtle language cues.
•
Contextual Meaning
: Words can have different meanings in different contexts, which can lead to misinterpretation of sentiment.
•
Language Nuances
: Sentiment analysis models must handle various linguistic nuances such as idioms, negations, and intensifiers.
Despite these challenges, sentiment analysis remains a powerful tool in NLP, providing
valuable insights across various domains.
Exercise 9 Sentiment Analysis
¶
Problem Statement:
¶
Given a public dataset containing textual reviews, your task is to perform sentiment analysis to determine the sentiment (positive, negative, or neutral) of each review. Visualize the distribution of sentiments and interpret the results in the context of the dataset.
Dataset:
¶
For this exercise, we'll use the "Wine Reviews" dataset from Kaggle. This dataset contains wine reviews with textual descriptions.
Step 1: Data Loading and Exploration
¶
Load the dataset using pandas and explore the first few rows to understand its structure.
In [42]:
import pandas as pd
df = pd.read_csv('winemag-data-130k-v2.csv')
df.head()
Out[42]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Unnamed:
0
country description designation points price province
region_1
reg
0
0
Italy
Aromas include tropical fruit, broom,
brimston...
Vulkà Bianco
87
NaN
Sicily & Sardinia
Etna
NaN
1
1
Portugal
This is ripe and fruity, a wine that is smooth...
Avidagos
87
15.0
Douro
NaN
NaN
2
2
US
Tart and snappy, the flavors of lime flesh and...
NaN
87
14.0
Oregon
Willamette
Valley
Willa
Valle
3
3
US
Pineapple rind, lemon pith and orange blossom ...
Reserve Late
Harvest
87
13.0
Michigan
Lake Michigan Shore
NaN
4
4
US
Much like the regular bottling from 2012, this...
Vintner's Reserve Wild
Child Block
87
65.0
Oregon
Willamette
Valley
Willa
Valle
Step 2: Sentiment Analysis Preparation
¶
We'll use the TextBlob library for sentiment analysis. The TextBlob library provides a simple API for diving into common natural language processing (NLP) tasks.
In [43]:
#!conda install -c conda-forge textblob
In [44]:
from textblob import TextBlob
def get_sentiment(text):
analysis = TextBlob(text)
if analysis.sentiment.polarity > 0:
return 'positive'
elif analysis.sentiment.polarity == 0:
return 'neutral'
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
else:
return 'negative'
df['sentiment'] = df['description'].apply(get_sentiment)
Step 3: Visualization of Sentiment Distribution
¶
Visualize the distribution of sentiments (positive, negative, neutral) in the dataset.
In [45]:
import matplotlib.pyplot as plt
sentiment_counts = df['sentiment'].value_counts()
plt.figure(figsize=(10, 6))
sentiment_counts.plot(kind='bar', color=['green', 'blue', 'red'])
plt.title('Sentiment Distribution in Wine Reviews')
plt.xlabel('Sentiment')
plt.ylabel('Number of Reviews')
plt.show()
Interpretation
¶
•
After visualizing the sentiment distribution, observe the number of positive, negative, and neutral reviews. Discuss the significance of these sentiments in the context of wine reviews. For instance, a high number of positive reviews
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
might indicate that most reviewers have a favorable opinion of the wines they reviewed. •
For further analysis, you can explore the relationship between sentiment and other variables in the dataset, such as wine variety, country, or price. This can provide deeper insights into how sentiments vary across different wines and regions. Topic Modeling
¶
Topic Modeling is an unsupervised machine learning technique in Natural Language Processing (NLP) that discovers the abstract "topics" that occur in a collection of documents. It is used to uncover hidden thematic structures in large textual corpora, categorize text documents into topics, and aid in the organization of large datasets by topic. This technique is particularly useful in digital libraries, information retrieval, and various content-based recommendation systems.
How Topic Modeling Works
¶
Topic modeling involves the following steps:
1.
Data Collection
: Assembling a corpus of text data that needs to be analyzed.
2.
Preprocessing
: Cleaning the text data to remove noise, including punctuation, special characters, and numbers. It also involves tokenization, stop-word removal, stemming, and lemmatization.
3.
Model Selection
: Choosing a statistical model for topic modeling. The most popular models include Latent Dirichlet Allocation (LDA), Non-negative Matrix Factorization (NMF), and Latent Semantic Analysis (LSA).
4.
Model Training
: Applying the chosen model to the preprocessed text data to identify patterns and topics. The model will learn to assign topic distributions to documents and word distributions to topics.
5.
Reviewing Topics
: After training, each topic is represented as a collection of terms with weights indicating their relevance to the topic. Analysts review these terms to interpret and label the topics.
6.
Assigning Topics
: Documents are then categorized based on the topics they are most strongly associated with, according to the model.
Applications of Topic Modeling
¶
•
Content Summarization
: Topic modeling can help summarize large volumes of
text by identifying the main themes.
•
Information Retrieval
: Enhancing search engines by indexing documents based on topics.
•
Understanding Trends
: Analyzing social media or customer feedback to identify trending topics or issues.
•
Recommender Systems
: Recommending articles, news, or products to users based on their interests in certain topics.
Challenges in Topic Modeling
¶
•
Interpreting Topics
: Topics are clusters of words, and interpreting them to find
a coherent theme can sometimes be subjective.
•
Choosing the Number of Topics
: Determining the optimal number of topics can be difficult and often requires domain knowledge or iterative experimentation.
•
Polysemy and Synonymy
: Words with multiple meanings (polysemy) and different words with similar meanings (synonymy) can affect the quality of the topics.
•
Dynamic Topics
: Topics can change over time, and static models may not capture these changes effectively.
Despite these challenges, topic modeling is a powerful tool in the NLP toolkit, providing
insights into the themes and concepts present in large and unstructured text data.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Exercise 10 Topic Modeling
¶
Problem Statement:
¶
Given a public dataset containing textual data, your task is to perform topic modeling to uncover the underlying topics within the dataset. Your goal is to identify the main topics, visualize the distribution of words within these topics, and interpret the significance of each topic in the context of the dataset.
Dataset:
¶
For this exercise, we'll use the "Wine Reviews" dataset from Kaggle. This dataset contains wine reviews with textual descriptions.
Step 1: Data Loading and Exploration
¶
Load the dataset using pandas, truncate to the first 100 records, and explore the first few rows to understand its structure.
In [46]:
import pandas as pd
import warnings # Settings the warnings to be ignored warnings.filterwarnings('ignore')
df = pd.read_csv('winemag-data-130k-v2.csv')
df = df.head(100)
df
Out[46]:
Unnamed:
0
country
description
designation points price province
region_1
r
0
0
Italy
Aromas include tropical fruit, broom, brimston...
Vulkà Bianco
87
NaN
Sicily & Sardinia
Etna
N
1
1
Portugal
This is ripe and fruity, a wine that is smooth...
Avidagos
87
15.0
Douro
NaN
N
2
2
US
Tart and snappy, the flavors of lime flesh and...
NaN
87
14.0
Oregon
Willamette
Valley
W
V
3
3
US
Pineapple rind, lemon pith and orange blossom ...
Reserve Late
Harvest
87
13.0
Michigan
Lake Michigan Shore
N
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Unnamed:
0
country
description
designation points price province
region_1
r
4
4
US
Much like the regular bottling from 2012, this...
Vintner's Reserve Wild
Child Block
87
65.0
Oregon
Willamette
Valley
W
V
...
...
...
...
...
...
...
...
...
..
95
95
France
This is a dense wine, packed with both tannins...
NaN
88
20.0
Beaujolais Juliénas
N
96
96
France
The wine comes from one of the cru
estates fol...
NaN
88
18.0
Beaujolais Régnié
N
97
97
US
A wisp of bramble extends a savory tone from n...
Ingle Vineyard
88
20.0
New York
Finger Lakes
Fi
La
98
98
Italy
Forest floor, menthol, espresso, cranberry and...
Dono Riserva 88
30.0
Tuscany
Morellino di Scansano
N
99
99
US
This blends 20% each of all five red-
Bordeaux ...
Intreccio Library Selection
88
75.0
California
Napa Valley
N
100 rows × 14 columns
Step 2: Data Preprocessing
¶
For topic modeling, we need to preprocess the text data. This involves:
•
Converting all text to lowercase. •
Removing punctuations and stopwords. •
Tokenizing the text and stemming. In [47]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
import string
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
# Download stopwords
import nltk
nltk.download('stopwords')
nltk.download('punkt')
stop_words = set(stopwords.words('english'))
stemmer = SnowballStemmer('english')
def preprocess(text):
text = text.lower()
text = text.translate(str.maketrans('', '', string.punctuation))
tokens = word_tokenize(text)
tokens = [stemmer.stem(token) for token in tokens if token not in stop_words]
return tokens
df['processed_description'] = df['description'].apply(preprocess)
[nltk_data] Downloading package stopwords to
[nltk_data] /Users/sivaritsultornsanee/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data] /Users/sivaritsultornsanee/nltk_data...
[nltk_data] Package punkt is already up-to-date!
Step 3: Topic Modeling using LDA
¶
We'll use the Latent Dirichlet Allocation (LDA) method for topic modeling. First, we need to convert our tokenized documents into a document-term matrix.
In [48]:
#!conda install -c anaconda gensim
In [49]:
from gensim import corpora
dictionary = corpora.Dictionary(df['processed_description'])
doc_term_matrix = [dictionary.doc2bow(doc) for doc in df['processed_description']]
from gensim.models.ldamodel import LdaModel
lda_model = LdaModel(doc_term_matrix, num_topics=5, id2word=dictionary, passes=15)
topics = lda_model.print_topics(num_words=5)
for topic in topics:
print(topic)
(0, '0.024*"flavor" + 0.019*"wine" + 0.015*"aroma" + 0.010*"offer" + 0.009*"touch"')
(1, '0.021*"aroma" + 0.019*"fruit" + 0.019*"flavor" + 0.012*"finish" + 0.012*"oak"')
(2, '0.024*"palat" + 0.023*"aroma" + 0.018*"finish" + 0.015*"fruit" + 0.015*"flavor"')
(3, '0.030*"wine" + 0.019*"flavor" + 0.015*"fruit" + 0.014*"drink" + 0.013*"soft"')
(4, '0.021*"dri" + 0.016*"fruit" + 0.013*"aroma" + 0.013*"flavor" + 0.011*"acid"')
Step 4: Visualization of Topics
¶
Visualize the topics using pyLDAvis to understand the distribution of words within each
topic and the significance of each topic.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
In [50]:
#!conda install -c conda-forge pyldavis
In [51]:
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis
import warnings # Settings the warnings to be ignored warnings.filterwarnings('ignore')
vis_data = gensimvis.prepare(lda_model, doc_term_matrix, dictionary)
pyLDAvis.display(vis_data)
Out[51]:
Interpretation
¶
After visualizing the topics, observe the most significant terms in each topic. Discuss the potential meaning or theme of each topic based on these terms. For instance, a topic with terms like "fruit", "citrus", and "fresh" might be related to wines with a fruity
and fresh flavor profile. Interpret the significance of each topic in the context of wine reviews.
Named Entity Recognition (NER)
¶
Named Entity Recognition (NER) is a key task in Natural Language Processing (NLP) that involves identifying and classifying named entities in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. NER is used to answer real-world
questions like "Who is mentioned in the text?", "Where are the different places discussed?", or "What specific dates are referenced?"
How NER Works
¶
NER systems typically follow these steps:
1.
Tokenization
: Segmenting text into words, phrases, symbols, or other meaningful elements called tokens.
2.
Part-of-Speech Tagging
: Assigning parts of speech to each token, such as noun, verb, adjective, etc., based on both its definition and its context.
3.
Chunking
: Parsing and segmenting sentences into phrases, which can be used as input for NER.
4.
Entity Identification
: Determining which tokens or phrases are named entities.
5.
Classification
: Assigning a category to each identified entity, such as PERSON, ORGANIZATION, or LOCATION.
6.
Post-processing
: Refining the output, potentially using additional resources like
entity databases for disambiguation and validation.
Techniques Used in NER
¶
•
Rule-Based Approaches
: Using handcrafted linguistic rules to identify named entities based on patterns.
•
Statistical Models
: Leveraging models like Conditional Random Fields (CRFs), Hidden Markov Models (HMMs), or Support Vector Machines (SVMs) trained on annotated corpora.
•
Deep Learning
: Applying neural network architectures, such as Recurrent Neural Networks (RNNs), Long Short-Term Memory networks (LSTMs), or Transformer-based models like BERT, that can capture complex patterns and dependencies.
Applications of NER
¶
•
Information Extraction
: Automatically extracting structured information from unstructured text sources.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
•
Content Classification
: Enhancing search and content discovery by tagging entities.
•
Question Answering
: Identifying entities in questions to retrieve or generate accurate answers.
•
Sentiment Analysis
: Determining the sentiment towards specific entities.
•
Knowledge Graphs
: Populating knowledge bases with entities and their relationships.
Challenges in NER
¶
•
Ambiguity
: A single entity name can refer to multiple unique entities (e.g., "Jordan" can refer to a person's name or a country).
•
Variation
: An entity can be referred to in multiple ways (e.g., "USA" and "United
States of America").
•
Context Dependence
: The meaning and entity type can depend heavily on context.
•
Domain Specificity
: Entities in specialized fields may require tailored approaches.
Despite these challenges, NER continues to be a vital component of NLP, enabling machines to understand and process human language in a more structured and insightful way.
Exercise 11 Named Entity Recognition (NER)
¶
Problem Statement:
¶
Given a public dataset containing textual data, your task is to perform Named Entity Recognition (NER) to identify and classify named entities within the text into predefined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.
Dataset:
¶
For this exercise, we'll use the first 100 records from the "Wine Reviews" dataset from Kaggle, focusing on the description column.
Step 1: Data Loading and Exploration
¶
Load the dataset using pandas, truncate to the first 100 records, and explore the first few rows to understand its structure.
In [52]:
import pandas as pd
df = pd.read_csv('winemag-data-130k-v2.csv')
df = df.head(100)
df.head()
Out[52]:
Unnamed:
0
country description designation points price province
region_1
reg
0
0
Italy
Aromas include tropical fruit, broom,
brimston...
Vulkà Bianco
87
NaN
Sicily & Sardinia
Etna
NaN
1
1
Portugal
This is ripe and fruity, a wine that is smooth...
Avidagos
87
15.0
Douro
NaN
NaN
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Unnamed:
0
country description designation points price province
region_1
reg
2
2
US
Tart and snappy, the flavors of lime flesh and...
NaN
87
14.0
Oregon
Willamette
Valley
Willa
Valle
3
3
US
Pineapple rind, lemon pith and orange blossom ...
Reserve Late
Harvest
87
13.0
Michigan
Lake Michigan Shore
NaN
4
4
US
Much like the regular bottling from 2012, this...
Vintner's Reserve Wild
Child Block
87
65.0
Oregon
Willamette
Valley
Willa
Valle
Step 2: Named Entity Recognition using NLTK
¶
We'll use the NLTK library for NER. First, we need to tokenize the text, perform POS tagging, and then perform NER.
In [53]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
def extract_entities_nltk(text):
words = word_tokenize(text)
pos_tags = pos_tag(words)
tree = ne_chunk(pos_tags)
named_entities = []
for subtree in tree.subtrees():
if subtree.label() in ['GPE', 'PERSON', 'ORGANIZATION', 'DATE', 'LOCATION']:
entity = " ".join([word for word, tag in subtree.leaves()])
named_entities.append((entity, subtree.label()))
return named_entities
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
df['named_entities_nltk'] = df['description'].apply(extract_entities_nltk)
print(df[['description', 'named_entities_nltk']].head(5))
[nltk_data] Downloading package punkt to
[nltk_data] /Users/sivaritsultornsanee/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /Users/sivaritsultornsanee/nltk_data...
[nltk_data] Package averaged_perceptron_tagger is already up-to-
[nltk_data] date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data] /Users/sivaritsultornsanee/nltk_data...
[nltk_data] Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data] /Users/sivaritsultornsanee/nltk_data...
[nltk_data] Package words is already up-to-date!
description named_entities_nltk
0 Aromas include tropical fruit, broom, brimston... [(Aromas, GPE)]
1 This is ripe and fruity, a wine that is smooth... []
2 Tart and snappy, the flavors of lime flesh and... [(Tart, GPE)]
3 Pineapple rind, lemon pith and orange blossom ... [(Pineapple, GPE)]
4 Much like the regular bottling from 2012, this... []
Step 3: Visualization of Named Entities
¶
We can create a simple bar chart to visualize the most common named entities in our dataset.
In [54]:
import matplotlib.pyplot as plt
from collections import Counter
all_entities = [entity for sublist in df['named_entities_nltk'] for entity in sublist]
entity_counts = Counter([entity[0] for entity in all_entities])
common_entities = entity_counts.most_common(10)
entities = [item[0] for item in common_entities]
counts = [item[1] for item in common_entities]
plt.figure(figsize=(12, 6))
plt.bar(entities, counts, color='skyblue')
plt.xticks(rotation=45)
plt.title('Top 10 Named Entities in Wine Reviews')
plt.xlabel('Named Entities')
plt.ylabel('Counts')
plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Interpretation
¶
After extracting and visualizing the named entities, observe the different entities recognized in the dataset. Discuss the significance of each entity type (e.g., GPE, ORG)
and how NER can be useful in extracting structured information from unstructured text. For instance, recognizing vineyards or wine brands as ORG entities can be useful for categorizing wines based on their producers.
Summary:
¶
Text analysis is a cornerstone in the field of Data Analytics Engineering, serving as a bridge between unstructured textual data and actionable insights. With the exponential growth of textual data from sources like social media, customer reviews, and digital content, the ability to process and understand this data is crucial. Data Analytics Engineering leverages text analysis to transform vast amounts of raw text into structured formats, enabling the extraction of meaningful patterns, trends, and relationships. This transformation not only enhances data quality but also enriches data repositories, making them more comprehensive and informative.
Furthermore, text analysis plays a pivotal role in enhancing decision-making processes
in Data Analytics Engineering. By employing techniques such as sentiment analysis, topic modeling, and named entity recognition, engineers can derive nuanced insights
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
about customer preferences, market dynamics, and emerging trends. These insights empower businesses to make data-driven decisions, optimize their strategies, and anticipate future challenges. In essence, text analysis elevates the value of textual data, making it an indispensable tool in the arsenal of Data Analytics Engineering.
Revised Date: November 18, 2023
¶
In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help