data-8hw4

pdf

School

University of California, Berkeley *

*We aren’t endorsed by this school

Course

100

Subject

Computer Science

Date

Feb 20, 2024

Type

pdf

Pages

6

Uploaded by ColonelAlpaca4085

Report
Data 100 Homework Assignment 3 Introduction to Data Science (Southwest Minnesota State University) Scan to open on Studocu Studocu is not sponsored or endorsed by any college or university Data 100 Homework Assignment 3 Introduction to Data Science (Southwest Minnesota State University) Scan to open on Studocu Studocu is not sponsored or endorsed by any college or university Downloaded by Joshua Yang (jyang6125@gmail.com) lOMoARcPSD|37302314
DATA 100 DATA 100 Spring 2023 Spring 2023 Homework 3: Homework 3: Functions, Histograms, and Groups Functions, Histograms, and Groups Submitted to the D2L dropbox by 11:59 PM, Monday, February 27, 2023 Submitted to the D2L dropbox by 11:59 PM, Monday, February 27, 2023 Put your name in the cell below. Put your name in the cell below. Required Reading Required Reading: Visualizing Numerical Distributions Functions and Tables Please complete this notebook by filling in the cells provided. Before you begin, execute the following cell to load the needed modules. Each time you start your server, you will need to execute this cell again to load the tests. Start early so that you can ask us if you're stuck. Start early so that you can ask us if you're stuck. In [ ]: # Don't change this cell; just run it. import import numpy numpy as as np np from from datascience datascience import import * # These lines do some fancy plotting magic import import matplotlib matplotlib % matplotlib inline import import matplotlib.pyplot matplotlib.pyplot as as plt plt plt . style . use( 'fivethirtyeight' ) 1. Burrito-ful San Diego 1. Burrito-ful San Diego Tam, Margaret and Winifred are trying to use Data Science to find the best burritos in San Diego! Their friends Irene and Maya provided them with two comprehensive datasets on many burrito establishments in the San Diego area taken from (and cleaned from): https://www.kaggle.com/srcole/burritos-in-san-diego/data The following cell reads in a table called ratings which contains names of burrito restaurants, their Yelp rating, Google rating, as well as their Overall rating. It also reads in a table called burritos_types which contains names of burrito restaurants, their menu items, and the cost of the respective menu item at the restaurant. In [ ]: #Just run this cell ratings = Table . read_table( "ratings.csv" ) ratings . show( 5 ) burritos_types = Table . read_table( "burritos_types.csv" ) burritos_types . show( 5 ) Question 1. Question 1. It would be easier if we could combine the information in both tables. Assign burritos to the result of joining the two tables together. Hint: Here is a nice python reference . for the table functions in the datasicence module. In [ ]: burritos = ratings . join( 'Name' , burritos_types) Downloaded by Joshua Yang (jyang6125@gmail.com) lOMoARcPSD|37302314
burritos = ratings . join( 'Name' , burritos_types) burritos . show( 5 ) Question 2. Question 2. Let's look at how the Yelp scores compare to the Google scores in the burritos table. First, assign yelp_and_google to a table only containing the columns Yelp and Google . Then, make a scatter plot with Yelp scores on the x-axis and the Google scores on the y-axis. In [ ]: yelp_and_google = burritos . select( 'Yelp' , 'Google' ) yelp_and_google . scatter( 'Yelp' , 'Google' ) # Don't change/edit/remove the following line. # To help you make conclusions, we have plotted a straight line on the graph (y=x) plt . plot(np . arange( 2.5 , 5 , .5 ), np . arange( 2.5 , 5 , .5 )); Question 3. Question 3. Looking at the scatter plot you just made in Question 1.2, do you notice any pattern(s) (i.e. is one of the two types of scores consistently higher than the other one)? If so, describe them briefly briefly in the cell below. Write your answer here, replacing this text. Here's a refresher on how .group works! You can read how .group works in the textbook , or you can view the video below. The video resource was made by a UC Berkeley staff member - Divyesh Chotai! In [1]: from from IPython.display IPython.display import import YouTubeVideo YouTubeVideo( "HLoYTCUP0fc" ) Question 4. Question 4. From the burritos table, some of the restaurant locations have multiple reviews. Winifred thinks California burritos are the best type of burritos, and wants to see the average overall rating for California burritos at each location. Create a table that has two columns: the name of the restaurant and the average overall rating of California burritos at each location. Tip: Revisit the burritos table to see how California burritos are represented. Note: you can break up the solution into multiple lines, as long as you assign the final output table to california_burritos ! For reference however, our solution only used one line. In [ ]: california_burritos = burritos . where( 'Menu_Item' , 'California' ) . select( 'Name' , 'Overall' ) . group( "Name" , np . mean) california_burritos Question 5. Question 5. Given this new table california_burritos , Winifred can figure out the name of the restaurant Out[1]: Downloaded by Joshua Yang (jyang6125@gmail.com) lOMoARcPSD|37302314
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
with the highest overall average rating! Assign best_restaurant to a line of code that evaluates to a string that corresponds to the name of the restaurant with the highest overall average rating. In [ ]: best_restaurant = california_burritos . sort( 'Overall mean' , descending = True True ) . column( 'Nam e' ) . item( 0 ) best_restaurant Question 6. Question 6. Using the burritos table, assign menu_average to a table that has three columns that uniquely pairs the name of the restaurant, the menu item featured in the review, and the average Overall score for that menu item at that restaurant. Hint: Use .group, and remember that you can group by multiple columns. Here's an example from the textbook . In [ ]: menu_average = burritos . group([ 'Name' , 'Menu_Item' ], np . mean) . drop( 'Yelp mean' , 'Google mean' , 'Cost mean' ) menu_average Question 7. Question 7. Tam thinks that burritos in San Diego are cheaper (and taste better) than the burritos in Berkeley. Plot a histogram that visualizes that distribution of the costs of the burritos from San Diego in the burritos table. Also use the provided bins variable when making your histogram, so that visually the histogram is more informative. In [ ]: my_bins = np . arange( 0 , 15 , 1 ) # Please also use the provided bins burritos . hist( 'Cost' , bins = my_bins) 2. Faculty Salaries 2. Faculty Salaries This exercise is designed to give you practice using the Table methods pivot and group . Here is a link to the Python reference page in case you need a quick refresher. Run the cell below to view a demo on how you can use pivot on a table. In [2]: from from IPython.display IPython.display import import YouTubeVideo YouTubeVideo( "4WzXo8eKLAg" ) Out[2]: Downloaded by Joshua Yang (jyang6125@gmail.com) lOMoARcPSD|37302314
In the next cell, we load a dataset created by the Daily Cal which contains Berkeley faculty, their departments, their positions, and their gross salaries in 2015. In [ ]: raw_profs = Table . read_table( "faculty.csv" ) . where( "year" , are . equal_to( 2015 )) . drop( "year" , "title" ) profs = raw_profs . relabeled( "title_category" , "position" ) profs We want to use this table to generate arrays with the names of each professor in each department. Question 1. Question 1. Set prof_names to a table with two columns. The first column should be called department and have the name of every department once, and the second column should be called faculty with each row in that second column containing an array of the names of all faculty members in that department. Hint: Think about how group works: it collects values into an array and then applies a function to that array. We have defined two functions below for you, and you will need to use one of them in your call to group . The other may be useful for later assignments. The group operation on multiple columns illustrated in section 8.3.2 of your text. The documentation for group in the datascience module is given here . In [ ]: # Pick one of the two functions defined below in your call to group. def def identity (array): '''Returns the array that is passed through''' return return array def def first (array): '''Returns the first item''' return return array . item( 0 ) # Make a call to group using one of the functions above when you define prof_names prof_names = profs . select( 'name' , 'department' ) . group( 'department' , collect = identity) prof_names Understanding the code you just wrote in 2.1 is important for moving forward with the class! If you made a lucky Understanding the code you just wrote in 2.1 is important for moving forward with the class! If you made a lucky guess, take some time to look at the code, step by step. guess, take some time to look at the code, step by step. Question 2. Question 2. At the moment, the name column of the profs table is sorted by last name. Would the arrays you generated in the faculty column of the previous part be the same if we had sorted by first name instead before generating them? Two arrays are the same same if they contain the same number of elements and the elements located at corresponding indexes in the two arrays are identical. An example of arrays that are NOT the same: array([1,2]) != array([2,1]) . Explain your answer. Write your answer here, replacing this text. Question 3. Question 3. Set department_ranges to a pivot table pivot table containing departments as the rows, and the position as the columns. Fore each row, the values in the columns should correspond to a salary range, where range is defined as the difference between the highest salary and the lowest salary in the department for that position difference between the highest salary and the lowest salary in the department for that position . You may need to review pivot tables in section 8.3.3 of your text. The documentation for pivot is here . Hint: First you'll need to define a new function salary_range which takes in an array of salaries and returns the range of salaries in that array. In [ ]: Downloaded by Joshua Yang (jyang6125@gmail.com) lOMoARcPSD|37302314
# Define salary_range first def def salary_range (a): """ Returns the difference between the max and min valuse of a. """ return return max (a) - min (a) department_ranges = profs . pivot( 'position' , 'department' , values = 'gross_salary' , colle ct = salary_range) department_ranges Question 4. Question 4. Give a resonable explanation as to why some of the row values are 0 in the department_ranges table from the previous question. Write your answer here, replacing this text. Yay! You're finished with Homework 3! Yay! You're finished with Homework 3! Be sure to... Be sure to... Save your notebook Save your notebook from the Jupyter File menu (not the web browser File menu), Download your finished notebook Download your finished notebook to your computer, Make certain Make certain you have downloaded the correct file (the extension should be .ipynb). Upload your finished notebook Upload your finished notebook to D2L. Downloaded by Joshua Yang (jyang6125@gmail.com) lOMoARcPSD|37302314
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help