DMA23 - Lab 1_ Data Preprocessing - Colaboratory

pdf

School

University of California, Berkeley *

*We aren’t endorsed by this school

Course

144

Subject

Industrial Engineering

Date

Jan 9, 2024

Type

pdf

Pages

6

Uploaded by MajorResolve6456

Report
12/7/23, 8:05 PM DMA23 - Lab 1_ Data Preprocessing - Colaboratory https://colab.research.google.com/drive/1F3CmxxXhgxuRBO2naN1ogM5FulCvqAkS#scrollTo=vKduyi9dItO4&printMode=true 1/6 DATA MINING & ANALYTICS (2023) Make sure you ±ll in any place that says YOUR CODE HERE or YOUR ANSWER HERE , as well as your name below: NAME = Data transformations are useful for preparing a dataset for answering a particular question. Part of this process involves generating features from the dataset you ±nd relevant to the question at hand. For this lab, we will be using a Yelp reviews dataset. Each row in the dataset depicts one review along with the features of the review (the reviewer, the review text, etc.). The goal of this lab is to eventually convert this reviews dataset into a reviewers dataset by creating different features describing each reviewer. The submission for this assignment should be done individually , but you are allowed to work in groups of 2. Google Colab Colab is a free online platform provided by Google that allows you to execute python code without any installations on your local machine. Without Colab (using Jupyter notebooks or the command line), you would have to install various packages and manage dependencies. In Colab, you can simply import them, or even install them (for that particular session). Colab can be accessed at the link: https://colab.research.google.com IMPORTANT: This lab has been shared with only read permissions to you. Make sure to click File -> "Save a Copy in Drive" so that you can get your own copy that WILL SAVE YOUR PROGRESS in your own Colab environment. If you download the .ipynb and want to further edit the notebook, you will need to make sure you have Jupyter installed locally so you can view the notebook properly (not as a JSON ±le). Environment Setup Run this cell to setup your environment. Lab 1 - Data Preprocessing # Importing libraries import numpy as np import pandas as pd import math import os print('Libraries Imported') #DOWNLOADING DATASET IF NOT PRESENT !wget -nc http://askoski.berkeley.edu/~zp/yelp_reviews.csv #!unzip yelp_reviews.zip print('Dataset Downloaded: yelp_reviews.csv') df=pd.read_csv('yelp_reviews.csv') print(df.head()) print('Setup Complete') Libraries Imported --2023-09-06 03:11:30-- http://askoski.berkeley.edu/~zp/yelp_reviews.csv Resolving askoski.berkeley.edu (askoski.berkeley.edu)... 169.229.192.179 Connecting to askoski.berkeley.edu (askoski.berkeley.edu)|169.229.192.179|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 376638166 (359M) [text/csv] Saving to: ‘yelp_reviews.csv’ yelp_reviews.csv 100%[===================>] 359.19M 38.2MB/s in 7.6s 2023-09-06 03:11:38 (47.6 MB/s) - ‘yelp_reviews.csv’ saved [376638166/376638166] Dataset Downloaded: yelp_reviews.csv type business_id user_id stars \ 0 review mxrXVZWc6PWk81gvOVNOUw mv7shusL4Xb6TylVYBv4CA 4 1 review mxrXVZWc6PWk81gvOVNOUw 0aN5QPhs-VwK2vusKG0waQ 5 2 review kK4AzZ0YWI-U2G-paAL7Fg 0aN5QPhs-VwK2vusKG0waQ 5
12/7/23, 8:05 PM DMA23 - Lab 1_ Data Preprocessing - Colaboratory https://colab.research.google.com/drive/1F3CmxxXhgxuRBO2naN1ogM5FulCvqAkS#scrollTo=vKduyi9dItO4&printMode=true 2/6 3 review mxrXVZWc6PWk81gvOVNOUw 1JUwyYab-uJzEx_FRd81Zg 5 4 review mxrXVZWc6PWk81gvOVNOUw 2Zd3Xy8hUVmZkNg7RyNjhg 4 text date cool_votes \ 0 Definitely try the duck dish. I rank it amon... 2011-06-13 0 1 Big Ass Burger was awesome! Great $5 mojitos. ... 2011-06-25 1 2 Unbelievable sandwiches! Good service. 2011-06-25 0 3 Awesome, awesome, awesome! My mom and sister a... 2011-07-18 1 4 I had the ribs they were great. The beer sele... 2011-07-19 1 useful_votes funny_votes 0 0 0 1 0 0 2 0 0 3 1 0 4 0 1 Setup Complete Q1: What was the highest number of reviews for any one business_id ? For this task, we will need to group the reviews dataset by business_id . This will aggregate data for each business, which is what we need for this task. This can be done using the groupby method. Some pointers of how you could go about this question are listed below: yelp_businesses = yelp_dataset.groupby('business_id').size() The .size() function counts the number of instances for each business_id , which gives us the number of reviews as each instance in this dataset is a review. The following command will sort this list, after which you can take note of the highest value: sorted_yelp_businesses = yelp_businesses.sort_values(ascending=False, inplace=False) This approach allows you to see the data structure being used in the sort. A quicker approach to getting the max would be to use the max function: max(yelp_businesses) #Make sure you return the answer value in this function def q1(df): # Group the DataFrame by 'business_id' and count the number of reviews for each business yelp_businesses = df.groupby('business_id').size() # Sort the counts in descending order sorted_yelp_businesses = yelp_businesses.sort_values(ascending=False, inplace=False) # Get the highest number of reviews highest_reviews = sorted_yelp_businesses.iloc[0] # You can also use max(sorted_yelp_businesses) return highest_reviews raise NotImplementedError() #This is a graded cell, do not edit print(q1(df)) 4128 Q2: On average, how many reviews did each business get? #Make sure you return the answer value in this function def q2(df): average_reviews_per_business = df.groupby('business_id').size().mean() return average_reviews_per_business raise NotImplementedError() #This is a graded cell, do not edit print(q2(df)) 12.63413902163123 Q3: What is the average number of reviews per reviewer?
12/7/23, 8:05 PM DMA23 - Lab 1_ Data Preprocessing - Colaboratory https://colab.research.google.com/drive/1F3CmxxXhgxuRBO2naN1ogM5FulCvqAkS#scrollTo=vKduyi9dItO4&printMode=true 3/6 #Make sure you return the answer value in this function def q3(df): average_reviews_per_reviewer = df.groupby('user_id').size().mean() return average_reviews_per_reviewer raise NotImplementedError() #This is a graded cell, do not edit print(q3(df)) 3.188511934933203 Q4: Calculate the total number of cool votes per reviewer, then average these totals across reviewers. #Make sure you return the answer value in this function def q4(df): average_cool_votesper_reviewer = df.groupby('user_id')['cool_votes'].sum().mean() return average_cool_votesper_reviewer raise NotImplementedError() #This is a graded cell, do not edit print(q4(df)) 1.2417282785380945 Q5: Calculate the total number of funny votes per reviewer, then average these totals across reviewers. #Make sure you return the answer value in this function def q5(df): average_funny_votesper_reviewer = df.groupby('user_id')['funny_votes'].sum().mean() return average_funny_votesper_reviewer raise NotImplementedError() #This is a graded cell, do not edit print(q5(df)) 1.10126486404605 Q6: Calculate the total number of useful votes each business get, then average these totals across business_ids. #Make sure you return the answer in this function def q6(df): average_useful_votesper_business = df.groupby('user_id')['useful_votes'].sum().mean() return average_useful_votesper_business raise NotImplementedError() #This is a graded cell, do not edit print(q6(df)) 2.484476138872867 Q7: On average, what percentage of a reviewer's votes are cool votes? (hint1: calculate the percentage of cool votes for each reviewer, then average this percentage across reviewers) (hint2: you should discard reviewers who have absolutely no votes - from cool, funny, or useful votes - from your calculation)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
12/7/23, 8:05 PM DMA23 - Lab 1_ Data Preprocessing - Colaboratory https://colab.research.google.com/drive/1F3CmxxXhgxuRBO2naN1ogM5FulCvqAkS#scrollTo=vKduyi9dItO4&printMode=true 4/6 #Make sure you return the answer in this function #Remember to multiply by 100 for percentages def q7(df): grouped_votes = df.groupby('user_id')[['cool_votes', 'funny_votes', 'useful_votes']].sum() grouped_votes['total_votes'] = grouped_votes.sum(axis=1) grouped_votes = grouped_votes[grouped_votes['total_votes'] > 0] grouped_votes['cool_percentage'] = (grouped_votes['cool_votes'] / grouped_votes['total_votes']) * 100 average_cool_percentage = grouped_votes['cool_percentage'].mean() return average_cool_percentage raise NotImplementedError() #This is a graded cell, do not edit #Remember to multiply by 100 for percentages print(round(q7(df),2)) 19.27 Q8: On average, what percentage of a reviewer's votes are funny votes? (hint1: calculate the percentage of funny votes for each reviewer, then average this percentage across reviewers) (hint2: you should discard reviewers who have zero total votes from your calculation) #Make sure you return the answer in this function #Remember to multiply by 100 for percentages def q8(df): grouped_votes = df.groupby('user_id')[['cool_votes', 'funny_votes', 'useful_votes']].sum() grouped_votes['total_votes'] = grouped_votes.sum(axis=1) grouped_votes = grouped_votes[grouped_votes['total_votes'] > 0] grouped_votes['funny_percentage'] = (grouped_votes['funny_votes'] / grouped_votes['total_votes']) * 100 average_funny_percentage = grouped_votes['funny_percentage'].mean() return average_funny_percentage raise NotImplementedError() #This is a graded cell, do not edit print(round(q8(df),2)) 18.26 Q9: On average, what percentage of a reviewer's votes are useful votes? (hint1: calculate the percentage of useful votes for each reviewer, then average this percentage across reviewers) (hint2: you should discard reviewers who have zero total votes from your calculation)
12/7/23, 8:05 PM DMA23 - Lab 1_ Data Preprocessing - Colaboratory https://colab.research.google.com/drive/1F3CmxxXhgxuRBO2naN1ogM5FulCvqAkS#scrollTo=vKduyi9dItO4&printMode=true 5/6 #Make sure you return the answer in this function def q9(df): grouped_votes = df.groupby('user_id')[['cool_votes', 'funny_votes', 'useful_votes']].sum() grouped_votes['total_votes'] = grouped_votes.sum(axis=1) grouped_votes = grouped_votes[grouped_votes['total_votes'] > 0] grouped_votes['useful_percentage'] = (grouped_votes['useful_votes'] / grouped_votes['total_votes']) * 100 average_useful_percentage = grouped_votes['useful_percentage'].mean() return average_useful_percentage raise NotImplementedError() #This is a graded cell, do not edit print(round(q9(df),2)) 62.47 Q10: Find the average review text length (in non-space characters). #Make sure you return the answer in this function def q10(df): average_text_length = df['text'].apply(len).mean() return average_text_length raise NotImplementedError() #This is a graded cell, do not edit print(round(q10(df),0)) 613.0 Q11: Find the year in which each reviewer wrote the most reviews. Once you have this for each reviewer, subtract the minimum possible year (2004) from each year so that your ±nal feature values are 0, 1, 2, etc. Note: we are looking for the answer to be in the format of a Pandas Series with user_id as the index and the year (in 0, 1, 2 format as listed above) as the value. # YOUR CODE HERE df['date'] = pd.to_datetime(df['date']) df['year'] = df['date'].dt.year user_year_counts = df.groupby(['user_id', 'year']).size().reset_index(name='review_count') most_active_years = user_year_counts.sort_values(by='review_count', ascending=False).drop_duplicates(subset='user_id', keep='fir most_active_years['adjusted_year'] = most_active_years['year'] - 2004 answer = most_active_years.set_index('user_id')['adjusted_year'] #This is a graded cell, do not edit print(answer.sort_index().head()) user_id --1Y03CEKR3WDbBjYnsW7A 7 --2QZsyXGz1OhiD4-0FQLQ 10
12/7/23, 8:05 PM DMA23 - Lab 1_ Data Preprocessing - Colaboratory https://colab.research.google.com/drive/1F3CmxxXhgxuRBO2naN1ogM5FulCvqAkS#scrollTo=vKduyi9dItO4&printMode=true 6/6 --82_AVgRBsLw6Dhy8sEnA 4 --8A9o_NeGyt_3kzlXtSdg 11 --8WbseBk1NjfPiZWjQ-XQ 12 Name: adjusted_year, dtype: int64 Q12: Come up with a new feature for each review. This may be derived from existing features. Give your feature the name my_new_feature. Display head() of this new feature. # I camp up with a feature called 'postive_review' base on the number of star, if the star of the review is larger than 3, then my_new_feature = df['positive_review'] = df['stars'].apply(lambda x: 1 if x > 3 else 0) my_new_feature.head() 0 1 1 1 2 1 3 1 4 1 Name: stars, dtype: int64 #This is a graded cell, do not edit print(my_new_feature.head()) 0 1 1 1 2 1 3 1 4 1 Name: stars, dtype: int64 Prof. Zachary Pardos, 2023
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help