


University of California, Berkeley *

*We aren’t endorsed by this school




Computer Science


Dec 6, 2023





Uploaded by AdmiralAtom103517

hw10 November 30, 2023 [1]: # Initialize Otter import otter grader = otter . Notebook( "hw10.ipynb" ) 1 Homework 10: Linear Regression Helpful Resource: Python Reference : Cheat sheet of helpful array & table methods used in Data 8! Recommended Readings : Correlation The Regression Line Method of Least Squares Least Squares Regression Please complete this notebook by filling in the cells provided. Before you begin, execute the cell below to setup the notebook by importing some helpful libraries. Each time you start your server, you will need to execute this cell again. For all problems that you must write explanations and sentences for, you must provide your answer in the designated space. Moreover, throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook! For example, if you use max_temperature in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you thought you were passing previously! Deadline: This assignment is due Wednesday, 11/8 at 11:00pm PT . Turn it in by Tuesday, 11/7 at 11:00pm PT for 5 extra credit points. Late work will not be accepted as per the policies page. Note: This homework has hidden tests on it. That means even though tests may say 100% passed, it doesn’t mean your final grade will be 100%. We will be running more tests for correctness once everyone turns in the homework. Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. Refer to the policies page to learn more about how to learn cooperatively. You should start early so that you have time to get help if you’re stuck. Offce hours are held Monday through Friday in Warren Hall 101B. The offce hours schedule appears here . 1
[2]: # Run this cell to set up the notebook, but please don't change it. import numpy as np from datascience import * # These lines do some fancy plotting magic. import matplotlib % matplotlib inline import matplotlib.pyplot as plt plt . style . use( 'fivethirtyeight' ) import warnings warnings . simplefilter( 'ignore' , FutureWarning ) from datetime import datetime 1.1 1. Linear Regression Setup When performing linear regression, we need to compute several important quantities which will be used throughout our analysis. Unless otherwise specified when asked to make a prediction please assume we are predicting y from x throughout this assignment. To help with our later analysis, we will begin by writing some of these functions and understanding what they can do for us. Question 1.1. Define a function standard_units that converts a given array to standard units. (3 points) Hint: You may find the np.mean and np.std functions helpful. [3]: def standard_units (data): return ((data - np . mean(data)) / np . std(data)) [4]: grader . check( "q1_1" ) [4]: q1_1 results: All test cases passed! Question 1.2. Which of the following are true about standard units? Assume we have converted an array of data into standard units using the function above. (5 points) 1. The unit of all our data when converted into standard units is the same as the unit of the original data. 2. The sum of all our data when converted into standard units is 0. 3. The standard deviation of all our data when converted into standard units is 1. 4. Adding a constant, C, to our original data has no impact on the resultant data when converted to standard units. 5. Multiplying our original data by a positive constant, C (>0), has no impact on the resultant data when converted to standard units. Assign standard_array to an array of your selections, in increasing numerical order. For example, if you wanted to select options 1, 3, and 5, you would assign standard_array to make_array(1, 3, 5) . 2
[5]: standard_array = make_array( 2 , 3 , 4 , 5 ) [6]: grader . check( "q1_2" ) [6]: q1_2 results: All test cases passed! Question 1.3. Define a function correlation that computes the correlation between 2 arrays of data in original units. (3 points) Hint: Feel free to use functions you have defined previously. [7]: def correlation (x, y): return (np . mean(standard_units(x) * standard_units(y))) [8]: grader . check( "q1_3" ) [8]: q1_3 results: All test cases passed! Question 1.4. Which of the following are true about the correlation coeffcient 𝑟 ? (5 points) 1. The correlation coeffcient measures the strength of a linear relationship. 2. A correlation coeffcient of 1.0 means an increase in one variable always means an increase in the other variable. 3. The correlation coeffcient is the slope of the regression line in standard units. 4. The correlation coeffcient stays the same if we invert our axes. 5. If we add a constant, C, to our original data, our correlation coeffcient will increase by the same C. Assign r_array to an array of your selections, in increasing numerical order. For example, if you wanted to select options 1, 3, and 5, you would assign r_array to make_array(1, 3, 5) . [9]: r_array = make_array( 1 , 2 , 3 , 4 ) [10]: grader . check( "q1_4" ) [10]: q1_4 results: All test cases passed! Question 1.5. Define a function slope that computes the slope of our line of best fit (to predict y given x), given two arrays of data in original units. Assume we want to create a line of best fit in original units. (3 points) Hint: Feel free to use functions you have defined previously. [11]: def slope (x, y): r = correlation(x,y) return (r * (np . std(y) / np . std(x))) [12]: grader . check( "q1_5" ) [12]: q1_5 results: All test cases passed! 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Question 1.6. Which of the following are true about the slope of our line of best fit? Assume x refers to the value of one variable that we use to predict the value of y . (5 points) 1. In original units, the slope has the unit: unit of x / unit of y. 2. In standard units, the slope is unitless. 3. In original units, the slope is unchanged by swapping x and y. 4. In standard units, a slope of 1 means our data is perfectly linearly correlated. 5. In original units and standard units, the slope always has the same positive or negative sign. Assign slope_array to an array of your selections, in increasing numerical order. For example, if you wanted to select options 1, 3, and 5, you would assign slope_array to make_array(1, 3, 5) . [13]: slope_array = make_array( 2 , 4 , 5 ) [14]: grader . check( "q1_6" ) [14]: q1_6 results: All test cases passed! Question 1.7. Define a function intercept that computes the intercept of our line of best fit (to predict y given x), given 2 arrays of data in original units. Assume we want to create a line of best fit in original units. (3 points) Hint: Feel free to use functions you have defined previously. [15]: def intercept (x, y): return (np . mean(y) - (slope(x,y) * np . mean(x))) [16]: grader . check( "q1_7" ) [16]: q1_7 results: All test cases passed! Question 1.8. Which of the following are true about the intercept of our line of best fit? Assume x refers to the value of one variable that we use to predict the value of y . (5 points) 1. In original units, the intercept has the same unit as the y values. 2. In original units, the intercept has the same unit as the x values. 3. In original units, the slope and intercept have the same unit. 4. In standard units, the intercept for the regression line is 0. 5. In original units and standard units, the intercept always has the same magnitude. Assign intercept_array to an array of your selections, in increasing numerical order. For example, if you wanted to select options 1, 3, and 5, you would assign intercept_array to make_array(1, 3, 5) . [17]: intercept_array = make_array( 1 , 4 ) [18]: grader . check( "q1_8" ) [18]: q1_8 results: All test cases passed! 4
Question 1.9. Define a function predict that takes in a table and 2 column names, and returns an array of predictions. The predictions should be created using a fitted regression line . We are predicting "col2" from "col1" , both in original units. (5 points) Hint 1: Feel free to use functions you have defined previously. Hint 2: Re-reading 15.2 might be helpful here. Note: The public tests are quite comprehensive for this question, so passing them means that your function most likely works correctly. [19]: def predict (tbl, col1, col2): x = tbl . column(col1) y = tbl . column(col2) return ((slope(x,y) * x) + intercept(x,y)) [20]: grader . check( "q1_9" ) [20]: q1_9 results: All test cases passed! 1.2 2. FIFA Predictions The following data was scraped from sofifa.com , a website dedicated to collecting information from FIFA video games. The dataset consists of all players in FIFA 22 and their corresponding attributes. We have truncated the dataset to a limited number of rows (100) to ease with our visualizations and analysis. Since we’re learning about linear regression, we will look specifically for a linear association between various player attributes. To help with understanding where the line of best fit generated in linear regression comes from please do not use the .fit_line argument in .scatter at any point on question 2 unless the code was provided for you. Feel free to read more about the video game on Wikipedia . [21]: # Run this cell to load the data fifa = Table . read_table( 'fifa22.csv' ) # Select a subset of columns to analyze (there are 110 columns in the original dataset) fifa = fifa . select( "short_name" , "overall" , "value_eur" , "wage_eur" , "age" , "pace" , "shooting" , "passing" , "attacking_finishing" ) fifa . show( 5 ) <IPython.core.display.HTML object> Question 2.1. Before jumping into any statistical techniques, it’s important to see what the data looks like, because data visualizations allow us to uncover patterns in our data that would have otherwise been much more diffcult to see. (3 points) Create a scatter plot with age on the x-axis (“age”), and the player’s value in Euros (“value_eur”) on the y-axis. [22]: fifa . scatter( "age" , "value_eur" ) 5
Question 2.2. Does the correlation coeffcient r for the data in our scatter plot in 2.1 look closest to 0, 0.75, or -0.75? (3 points) Assign r_guess to one of 0, 0.75, or -0.75. [23]: r_guess = -0.75 [24]: grader . check( "q2_2" ) [24]: q2_2 results: All test cases passed! Question 2.3. Create a scatter plot with player age (“age”) along the x-axis and both real player value (“value_eur”) and predicted player value along the y-axis. The predictions should be created using a fitted regression line . The color of the dots for the real player values should be different from the color for the predicted player values. (8 points) Hint 1: Feel free to use functions you have defined previously. Hint 2: 15.2 has examples of creating such scatter plots. 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
[25]: predictions = predict(fifa, "age" , "value_eur" ) fifa_with_predictions = fifa . with_columns( "preds" ,predictions) . select( "age" , "value_eur" , "preds" ) fifa_with_predictions . scatter( "age" ) Question 2.4. Looking at the scatter plot you produced above, is linear regression a good model to use? If so, what features or characteristics make this model reasonable? If not, what features or characteristics make it unreasonable? (5 points) Yes, because on average, the real values are not far off from the values predicted by the regression line, and there is roughly an even number of values with positive residuals as their is with negative residuals Question 2.5. In 2.3, we created a scatter plot in original units . Now, create a scatter plot with player age in standard units along the x-axis and both real and predicted player value in standard units along the y-axis. The color of the dots of the real and predicted values should be different. (8 points) Hint: Feel free to use functions you have defined previously. [26]: predictions_su = standard_units(fifa . column( "age" )) * correlation(fifa . column( "age" ),fifa . column( "value_eur" )) fifa_su = fifa . with_columns( "agesu" , standard_units(fifa . column( "age" )), "valsu" , standard_units(fifa . column( "value_eur" )) 7
, "predsu" , predictions_su) . select( "agesu" , "valsu" , "predsu" ) fifa_su . scatter( "agesu" ) Question 2.6. Compare your plots in 2.3 and 2.5. What similarities do they share? What differences do they have? (5 points) The data has the same exact shape in pattern in both plots, and the regression line fits the data well in both plots. However, the values of the data points change in each plot. In the plot in 2.3, the x values range from ~ 20 to 40 and the y values range from ~ 0 to 2*10^8, but in the plot in 2.5 the both the x and y values only range from ~ -2.5 to 2.5. The values for the age and value in euros seem to be centered around 0 for the plot in 2.5, but this is not true for the plot in 2.3. Question 2.7. Define a function rmse that takes in two arguments: a slope and an intercept for a potential regression line. The function should return the root mean squared error between the values predicted by a regression line with the given slope and intercept and the actual outcomes. (6 points) Assume we are still predicting “value_eur” from “age” in original units from the fifa table. [27]: def rmse (slope, intercept): predictions = (fifa . column( "age" ) * slope) + intercept errors = (fifa . column( "value_eur" ) - predictions) ** 2 8
return ((np . mean(errors)) ** 0.5 ) [28]: grader . check( "q2_7" ) [28]: q2_7 results: All test cases passed! Question 2.8. Use the rmse function you defined along with minimize to find the least-squares regression parameters predicting player value from player age. Here’s an example of using the minimize function from the textbook. (10 points) Then set lsq_slope and lsq_intercept to be the least-squares regression line slope and intercept, respectively. Finally, create a scatter plot like you did in 2.3 with player age (“age”) along the x-axis and both real player value (“value_eur”) and predicted player value along the y-axis. Be sure to use your least-squares regression line to compute the predicted values. The color of the dots for the real player values should be different from the color for the predicted player values. Note: Your solution should not make any calls to the slope or intercept functions defined earlier. Hint: Your call to minimize will return an array of argument values that minimize the return value of the function passed to minimize . [29]: minimized_parameters = minimize(rmse) lsq_slope = minimized_parameters[ 0 ] lsq_intercept = minimized_parameters[ 1 ] # This just prints your slope and intercept print ( "Slope: {:g} | Intercept: {:g} " . format(lsq_slope, lsq_intercept)) fifa_with_lsq_predictions = fifa . with_columns( "preds" ,((fifa . column( "age" ) * lsq_slope) + lsq_intercept)) . select( "age" , "value_eur" , "preds" ) fifa_with_lsq_predictions . scatter( "age" ) Slope: -6.41462e+06 | Intercept: 2.55525e+08 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Question 2.9. The resulting line you found in 2.8 should appear very similar to the line you found in 2.3. Why were we able to minimize RMSE to find nearly the same slope and intercept from the previous formulas? (5 points) Hint: Re-reading 15.3 might be helpful here. There is only one “best fit” regression line, which is the line that minimizes mean squared error. For mathematical reasons, the line you get with calculus-based numerical optimization (the approach that created the line from 2.8) as well as the the line you get by using the formulas related to the correlation coeffcient (the approach that created the line from 2.3) both yield this best fit line, and thus the line you get from either approach will be roughly the same. Question 2.10 For which of the following error functions would we have resulted in the same slope and intercept values in 2.8 instead of using RMSE? Assume error is assigned to the actual values minus the predicted values. (5 points) 1. np.sum(error) ** 0.5 2. np.sum(error ** 2) 3. np.mean(error) ** 0.5 4. np.mean(error ** 2) Assign error_array to an array of your selections, in increasing numerical order. For example, if you wanted to select options 1, 3, and 5, you would assign error_array to make_array(1, 3, 5) . Hint: What was the purpose of RMSE? Are there any alternatives, and if so, does minimizing them 10
them yield the same results as minimizing the RMSE? [30]: error_array = make_array( 2 , 4 ) [31]: grader . check( "q2_10" ) [31]: q2_10 results: All test cases passed! [32]: # goalies don't have shooting in our dataset so we removed them before looking at the pace stat no_goalies = fifa . where( "shooting" , are . above( 0 )) no_goalies [32]: short_name | overall | value_eur | wage_eur | age | pace | shooting | passing | attacking_finishing L. Messi | 93 | 78000000 | 320000 | 34 | 85 | 92 | 91 | 95 R. Lewandowski | 92 | 119500000 | 270000 | 32 | 78 | 92 | 79 | 95 Cristiano Ronaldo | 91 | 45000000 | 270000 | 36 | 87 | 94 | 80 | 95 Neymar Jr | 91 | 129000000 | 270000 | 29 | 91 | 83 | 86 | 83 K. De Bruyne | 91 | 125500000 | 350000 | 30 | 76 | 86 | 93 | 82 K. Mbappé | 91 | 194000000 | 230000 | 22 | 97 | 88 | 80 | 93 H. Kane | 90 | 129500000 | 240000 | 27 | 70 | 91 | 83 | 94 N. Kanté | 90 | 100000000 | 230000 | 30 | 78 | 66 | 75 | 65 K. Benzema | 89 | 66000000 | 350000 | 33 | 76 | 86 | 81 | 90 H. Son | 89 | 104000000 | 220000 | 28 | 88 | 87 | 82 | 88 … (75 rows omitted) [33]: # Run this cell to generate a scatter plot for the next part. no_goalies . scatter( 'shooting' , 'attacking_finishing' , fit_line = True ) 11
Question 2.11. Above is a scatter plot showing the relationship between a player’s shooting ability (“shooting”) and their scoring ability (“attacking_finishing”). There is clearly a strong positive correlation between the 2 variables, and we’d like to predict a player’s scoring ability from their shooting ability. Which of the following are true, assuming linear regression is a reasonable model? (5 points) Hint: Re-reading 15.2 might be helpful here. 1. For a majority of players with a shooting attribute above 80 our model predicts they have a better scoring ability than shooting ability. 2. A randomly selected player’s predicted scoring ability in standard units will always be less than their shooting ability in standard units. 3. If we select a player who’s shooting ability is 1.0 in standard units, their scoring ability, on average, will be less than 1.0 in standard units. 4. Goalies have attatcking_finishing scores in our dataset but do not have shooting scores. We can still use our model to predict their attacking_finishing scores. Assign scoring_array to an array of your selections, in increasing numerical order. For example, if you wanted to select options 1, 3, and 5, you would assign scoring_array to make_array(1, 3, 5) . 12
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
[34]: scoring_array = make_array( 1 , 3 ) [35]: grader . check( "q2_11" ) [35]: q2_11 results: All test cases passed! You’re done with Homework 10! Important submission steps: 1. Run the tests and verify that they all pass. 2. Choose Save Notebook from the File menu, then run the final cell . 3. Click the link to download the zip file. 4. Go to Gradescope and submit the zip file to the corresponding assignment. The name of this assignment is “HW 10 Autograder”. It is your responsibility to make sure your work is saved before running the last cell. 1.3 Pets of Data 8 Nico says congrats on finishing Homework 10! Only two more to go! Pet of the week: Nico 1.4 Submission Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. Please save before exporting! [36]: # Save your notebook first, then run this cell to export your submission. grader . export(pdf = False , run_tests = True ) Running your submission against local test cases… Your submission received the following results when run against available test cases: q1_1 results: All test cases passed! q1_2 results: All test cases passed! q1_3 results: All test cases passed! q1_4 results: All test cases passed! q1_5 results: All test cases passed! q1_6 results: All test cases passed! q1_7 results: All test cases passed! 13
q1_8 results: All test cases passed! q1_9 results: All test cases passed! q2_2 results: All test cases passed! q2_7 results: All test cases passed! q2_10 results: All test cases passed! q2_11 results: All test cases passed! <IPython.core.display.HTML object> 14