hw04

pdf

School

University of California, Berkeley *

*We aren’t endorsed by this school

Course

C8

Subject

Computer Science

Date

Dec 6, 2023

Type

pdf

Pages

12

Uploaded by AdmiralAtom103517

Report
hw04 November 30, 2023 [1]: # Initialize Otter import otter grader = otter . Notebook( "hw04.ipynb" ) 1 Homework 4: Functions, Tables, and Groups Please complete this notebook by filling in the cells provided. Before you begin, execute the previous cell to load the provided tests. Helpful Resource: - Python Reference : Cheat sheet of helpful array & table methods used in Data 8! Recommended Readings : Visualizing Numerical Distributions Functions and Tables Please complete this notebook by filling in the cells provided. Before you begin, execute the cell below to setup the notebook by importing some helpful libraries. Each time you start your server, you will need to execute this cell again. For all problems that you must write explanations and sentences for, you must provide your answer in the designated space. Moreover, throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook! For example, if you use max_temperature in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you thought you were passing previously! Deadline: This assignment is due Wednesday, 9/20 at 11:00pm PT . Turn it in by Tuesday, 9/19 at 11:00pm PT for 5 extra credit points. Late work will not be accepted as per the policies page. Note: This homework has hidden tests on it. That means even though tests may say 100% passed, it doesn’t mean your final grade will be 100%. We will be running more tests for correctness once everyone turns in the homework. Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. Refer to the policies page to learn more about how to learn cooperatively. You should start early so that you have time to get help if you’re stuck. Offce hours are held Monday through Friday in Warren Hall 101B. The offce hours schedule appears on our OH page . 1
1.1 1. Burrito-ful San Diego [2]: # Run this cell to set up the notebook, but please don't change it. # These lines import the Numpy and Datascience modules. import numpy as np from datascience import * # These lines do some fancy plotting magic. import matplotlib % matplotlib inline import matplotlib.pyplot as plt plt . style . use( 'fivethirtyeight' ) import warnings warnings . simplefilter( 'ignore' , FutureWarning ) warnings . filterwarnings( "ignore" ) Eunice, Ciara and Kanchana are trying to use Data Science to find the best burritos in San Diego! Their friends Jessica and Sonya provided them with two comprehensive datasets on many burrito establishments in the San Diego area taken from (and cleaned from): https://www.kaggle.com/srcole/burritos-in-san-diego/data The following cell reads in a table called ratings which contains names of burrito restaurants, their Yelp rating, Google rating, as well as their overall rating. The Overall rating is not an average of the Yelp and Google ratings, but rather it is the overall rating of the customers that were surveyed in the study above. It also reads in a table called burritos_types which contains names of burrito restaurants, their menu items, and the cost of the respective menu item at the restaurant. [3]: # Just run this cell ratings = Table . read_table( "ratings.csv" ) ratings . show( 5 ) burritos_types = Table . read_table( "burritos_types.csv" ) . drop( 0 ) burritos_types . show( 5 ) <IPython.core.display.HTML object> <IPython.core.display.HTML object> Question 1. It would be easier if we could combine the information in both tables. Assign burritos to the result of joining the two tables together, so that we have a table with the ratings for every corresponding menu item from every restaurant. Each menu item has the same rating as the restaurant from which it is from. (8 Points) Note: It doesn’t matter which table you put in as the argument to the table method, either order will work for the autograder tests. Hint: Which function lets us combine information from two tables? Refer to the Python Reference Sheet if you’re unsure! 2
[8]: burritos = ratings . join( "Name" ,burritos_types, "Name" ) burritos . show( 10 ) <IPython.core.display.HTML object> [9]: grader . check( "q1_1" ) [9]: q1_1 results: All test cases passed! Question 2. Let’s look at how the Yelp scores compare to the Google scores in the burritos table. First, assign yelp_google_tbl to a table only containing the columns Yelp and Google . Then, make a scatter plot with Yelp scores on the x-axis and the Google scores on the y-axis. (8 Points) [10]: yelp_google_tbl = burritos . select( "Yelp" , "Google" ) yelp_google_tbl . scatter( "Yelp" , "Google" ) # Don't change/edit/remove the following line. # To help you make conclusions, we have plotted a straight line on the graph (y=x). plt . plot(np . arange( 2.5 , 5 , .5 ), np . arange( 2.5 , 5 , .5 )); 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Question 3. Looking at the scatter plot you just made in Question 1.2, do you notice any pattern(s) or relationships between Yelp and Google ratings (i.e. is one of the two types of scores consistently higher than the other one)? If so, describe them briefly in the cell below. (8 Points) One trend is that as the Yelp score increases, the Google score tends to increase. Another trend is that the Google scores are consistently higher (only one score below 3.0) Here’s a refresher on how .group works! You can read how .group works in the textbook , or you can view the video below. The video resource was made by a past staff member, Divyesh Chotai! You can also use the Table Functions Visualizer to get some more hands-on experience with the .group function. [11]: # Don't worry about what this code does! It simply embeds a YouTube video into a code cell. from IPython.display import YouTubeVideo YouTubeVideo( "HLoYTCUP0fc" ) [11]: Question 4. There are so many types of California burritos in the burritos table! Kanchana wants to consider her options for burritos based on rankings. For the sake of these questions, we 4
are treating each menu item’s rating the same as its respective restaurant’s, as we do not have the rating of every single item at these restaurants. You do not need to worry about this fact, but we thought to mention it! Create a table with two columns: the first column include the names of the burritos and the second column should contain the average overall rating of that burrito across restaurants. In your calculations, you should only compare burritos that contain the word “California”. For example, there are “California” burritos, “California Breakfast” burritos, “California Surf And Turf” burritos, etc. (9 Points) Hint: “California” is case sensitive (“California” is not the same as “california”)! Note: For reference, the staff solution only used one line. However, feel free to break up the solution into multiple lines and steps; just make sure you assign the final output table to california_burritos ! [16]: california_burritos = burritos . where( "Menu_Item" ,are . containing( "California" )) . drop( "Name" , "Yelp" , "Google" , "Cost" ) . group( "Menu_Item" , np . mean) california_burritos [16]: Menu_Item | Overall mean California | 3.5242 California (Only Cheese) | 4.1 California + Guac + Sour Cream | 3.4 California - Chicken | 3.45839 California - Pork Adobada | 3.26429 California - Steak | 3.26429 California Breakfast | 2.75833 California Chicken | 3.54815 California Chipotle | 4.36667 California Everything | 4.1 … (9 rows omitted) [17]: grader . check( "q1_4" ) [17]: q1_4 results: All test cases passed! Question 5. Given this new table california_burritos , Ciara can figure out the name of the California burrito with the highest overall average rating! Assign best_california_burrito to a line of code that outputs the string that represents the name of the California burrito with the highest overall average rating. If multiple burritos satisfy this criteria, you can output any one of them. (8 Points) [19]: best_california_burrito = california_burritos . sort( "Overall mean" , descending = True ) . column( "Menu_Item" ) . item( 0 ) best_california_burrito [19]: 'California Chipotle' 5
[20]: grader . check( "q1_5" ) [20]: q1_5 results: All test cases passed! Question 6. Eunice thinks that burritos in San Diego are cheaper (and taste better) than the burritos in Berkeley. Plot a histogram that visualizes that distribution of the costs of the burritos from San Diego in the burritos table. Also use the provided cost_bins variable when making your histogram, so that the histogram is more visually informative. (8 Points) [23]: cost_bins = np . arange( 0 , 15 , 1 ) # Do not change this line # Please also use the provided bins burritos . hist( "Cost" ,bins = cost_bins) Question 7. What percentage of burritos in San Diego are less than $6? Assign burritos_less_than_six to your answer, which should be between 0 and 100 . You should only use the histogram above to answer the question. Do not use code on the table to to find the answer, just eyeball the heights and use Python to evaluate your arithmetic! (8 Points) Note : Your answer does not have to be exact, but it should be within a couple percentages of the staff answer. Hint: It might be helpful to review Section 7.2 of the textbook about calculating percentages from histograms. [29]: burritos_less_than_six = 1+4+17 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
[28]: grader . check( "q1_7" ) [28]: q1_7 results: All test cases passed! 1.2 2. San Francisco City Employee Salaries This exercise is designed to give you practice with using the Table methods .pivot and .group . Here is a link to the Python Reference in case you need a quick refresher. The Table Function Visualizer may also be a helpful tool. Run the cell below to view a demo on how you can use pivot on a table. (Thank you to past staff Divyesh Chotai!) [30]: # Don't worry about what this code does! It simply embeds a YouTube video into a code cell. from IPython.display import YouTubeVideo YouTubeVideo( "4WzXo8eKLAg" ) [30]: The data source we will use within this portion of the homework is publicly provided by the City of San Francisco. We have filtered it to retain just the relevant columns and restricted the data to the calendar year 2019. Run the following cell to load our data into a table called full_sf . 7
[31]: full_sf = Table . read_table( "sf2019.csv" ) full_sf . show( 5 ) <IPython.core.display.HTML object> The table has one row for each of the 44,525 San Francisco government employees in 2019. The first four columns describe the employee’s job. For example, the employee in the third row of the table had a job called “IS Business Analyst-Senior”. We will call this the employee’s position or job title . The job was in a Job Family called Information Systems (hence the IS in the job title), and was in the Adult Probation Department that is part of the Public Protection Organization Group of the government. You will mostly be working with the Job column. The next three columns contain the dollar amounts paid to the employee in the calendar year 2019 for salary, overtime, and benefits. Note that an employee’s salary does not include their overtime earnings. The last column contains the total compensation paid to the employee. It is the sum of the previous three columns: Total Compensation = Salary + Overtime + Benefits For this homework, we will be using the following columns: 1. Organization Group : A group of departments. For example, the Public Protection Org. Group includes departments such as the Police, Fire, Adult Protection, District Attorney, etc. 2. Department : The primary organizational unit used by the City and County of San Francisco. 3. Job : The specific position that a given worker fills. 4. Total Compensation : The sum of a worker’s salary, overtime, and benefits in 2019. Run the following cell to select the relevant columns and create a new table named sf . [32]: sf = full_sf . select( "Job" , "Department" , "Organization Group" , "Total Compensation" ) sf . show( 5 ) <IPython.core.display.HTML object> We want to use this table to generate arrays with the job titles of the members of each Organiza- tion Group . Question 1. Set job_titles to a table with two columns. The first column should be called Organization Group and have the name of every “Organization Group” each listed only once in this column, and the second column should be called Jobs with each row in that second column containing an array of the names of all the job titles within that “Organization Group”. Don’t worry if there are multiple of the same job titles. (9 Points) Hint 1: Think about how group works: it collects values into an array and then applies a function to that array. We have defined two functions below for you, and you will need to use one of them in your call to group . Hint 2: You might need to rename one of the columns. 8
[55]: # Pick one of the two functions defined below in your call to group. def first_item (array): '''Returns the first item''' return array . item( 0 ) def full_array (array): '''Returns the array that is passed through''' return array # Make a call to group using one of the functions above when you define job_titles job_titles = Table() . with_columns( "Organization Group" , sf . group( "Organization Group" ) . column( "Organization Group" ), "Jobs" , sf . drop( "Department" , "Total Compensation" ) . group( "Organization Group" , full_array) . column( "Job full_array" )) job_titles [55]: Organization Group | Jobs Community Health | ['Painter Supervisor 1' 'Painter' 'Painter' … 'Nursing … Culture & Recreation | ['Electrician' 'Executive Secretary 2' 'Bldgs & Grounds General Administration & Finance | ['Painter' 'Painter' 'Electrician' … 'Investigator, Ta … Human Welfare & Neighborhood Development | ['Dept Head I' 'Administrative Analyst' 'Community Devel … Public Protection | ['IS Trainer-Journey' 'IS Engineer- Assistant' 'IS Busine … Public Works, Transportation & Commerce | ['Heavy Equip Ops Asst Sprv' 'Heavy Equipment Ops Sprv' [56]: grader . check( "q2_1" ) [56]: q2_1 results: All test cases passed! Question 2. At the moment, the Job column of the sf table is not sorted (no particular order). Would the arrays you generated in the Jobs column of the previous question be the same if we had sorted alphabetically instead before generating them? Explain your answer. To receive full credit, your answer should reference how the .group method works, and how sorting the Jobs column would affect this. (8 Points) Note: Two arrays are the same if they contain the same number of elements and the elements located at corresponding indexes in the two arrays are identical. An example of arrays that are NOT the same: array([1,2]) != array([2,1]) . 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
No, they would not be the same arrays because the .group method works by grouping all of the same values in one column together (in this case the organization group column) and then applying the inputted function to aggregate the values in the other columns of the table. In this case, we used the function full_array which aggregates the other columns by creating an array with all of their values for a given organization group, listing them from top to bottom, and thus if we sorted the job titles alphabetically, the titles more towards the beginning of the alphabet would be listed towards the top and then maintain this order when they are put into their respective arrays. Thus, the elements of the arrays if we sorted alphabetically would be in a different order. Question 3. Set department_ranges to a table containing departments as the rows, and the orga- nization groups as the columns. The values in the rows should correspond to a total compensation range, where range is defined as the difference between the highest total compensation and the lowest total compensation in the department for that organization group . (9 Points) Hint: First you’ll need to define a new function compensation_range which takes in an array of compensations and returns the range of compensations in that array. [60]: # Define compensation_range first def compensation_range (arr): return ( max (arr) - min (arr)) department_ranges = sf . pivot( "Organization Group" , "Department" , values = "Total Compensation" , collect = compensation_range) department_ranges [60]: Department | Community Health | Culture & Recreation | General Administration & Finance | Human Welfare & Neighborhood Development | Public Protection | Public Works, Transportation & Commerce Academy Of Sciences | 0 | 199121 | 0 | 0 | 0 | 0 Administrative Services | 0 | 0 | 478784 | 0 | 0 | 0 Adult Probation | 0 | 0 | 0 | 0 | 303419 | 0 Airport Commission | 0 | 0 | 0 | 0 | 0 | 445092 Art Commission | 0 | 251823 | 0 | 0 | 0 | 0 Asian Art Museum | 0 | 298230 | 0 | 0 | 0 | 0 Assessor | 0 | 0 | 277385 | 0 | 0 | 0 Board Of Appeals | 0 | 0 | 0 | 0 | 0 | 243582 Board Of Supervisors | 0 | 0 | 293773 | 0 | 0 | 0 Building Inspection | 0 | 0 | 0 10
| 0 | 0 | 340852 … (41 rows omitted) [61]: grader . check( "q2_3" ) [61]: q2_3 results: All test cases passed! Question 4. Why might some of the row values be 0 in the department_ranges table from the previous question. (8 Points) Not all of the departments have all of these organization groups, so the table just puts 0 as the compensation range for organization groups that don’t exist in that department Question 5. Find the number of departments appearing in the sf table that have an average total compensation of greater than 125,000 dollars; assign this value to the variable num_over_125k . (9 Points) Note: The variable names provided are meant to help guide the intermediate steps and general thought process. Feel free to delete them if you’d prefer to start from scratch, but make sure your final answer is assigned to num_over_125k ! [77]: sums = sf . select( "Department" , "Total Compensation" ) . group( "Department" , np . sum) lens = sf . select( "Department" , "Job" ) . group( "Department" ) department_avg = sums . with_column( "Department Avg" , sums . column( "Total Compensation sum" ) / lens . column( "count" )) num_over_125k = department_avg . where( "Department Avg" , are . above( 125000 )) . num_rows num_over_125k [77]: 23 [78]: grader . check( "q2_5" ) [78]: q2_5 results: All test cases passed! You’re done with Homework 4! Important submission steps: 1. Run the tests and verify that they all pass. 2. Choose Save Notebook from the File menu, then run the final cell . 3. Click the link to download the zip file. 4. Go to Gradescope and submit the zip file to the corresponding assignment. The name of this assignment is “Homework 4 Autograder”. It is your responsibility to make sure your work is saved before running the last cell. 1.3 Pets of Data 8 Congrats on finishing Homework 4! Pet of the week: Gus 11
1.4 Submission Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. Please save before exporting! [79]: # Save your notebook first, then run this cell to export your submission. grader . export(pdf = False , run_tests = True ) Running your submission against local test cases… Your submission received the following results when run against available test cases: q1_1 results: All test cases passed! q1_4 results: All test cases passed! q1_5 results: All test cases passed! q1_7 results: All test cases passed! q2_1 results: All test cases passed! q2_3 results: All test cases passed! q2_5 results: All test cases passed! <IPython.core.display.HTML object> 12
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help