Tutorial_Wk9

pdf

School

The University of Sydney *

*We aren’t endorsed by this school

Course

ENVX1002

Subject

Statistics

Date

Feb 20, 2024

Type

pdf

Pages

7

Uploaded by CoachDragonfly351

Report
ENVX1002 Tutorial 9 What we will cover in this tutorial We will build upon what has been covered in the lecture, focussing on understanding relationships between data and then fitting a simple linear model to this relationship. In the lecture I mentioned there were two ways to fit a linear regression - analytical and numeric, so I will fit the model both ways, demonstrating this in excel. Then once we have a grasp of the underlying concepts, I will then show you how we fit simple linear models in R, which is much easier as the underlying equations have been done for us. The steps we will follow in this tutorial will be similar to what you will encounter in the practical. The tutorial will equip you with what you need to know, and from then in the practical you will able to lead your own learning a bit more. The data for this tutorial I have selected a single dataset for this tutorial so we can compare the outcomes from each fitting method. We will be using the Global Power Plant Database compiled by the World Resources Institute. More information about the dataset is available here: https://datasets.wri.org/dataset/540dcf46-f287-47ac- 985d-269b04bea4c6 I have left the data as-is so we can have a quick look over it before starting our analyses. Our scientific question With increased power plant capacity, is there more power being generated? Exercise 1: Data exploration a. What are the types of data within the Database? We will focus on the columns ‘capacity_mw’ and ‘generation_gwh_2019’ , which is the reported gen- eration of each power plant for 2019. b. What type of data are these two selected columns? To make the instructions more concise, I have isolated the columns of interest and pasted them into a dedi- cated sheet for each exercise. I have added x_ to capacity (now x_capacity_mw ), and y_ to generation (now y_generation_gwh_2019 ) to make things easier in our calculations. We will also only include values where generation_gwh_2019 is > 0. Some power plants recorded zero generation and even negative numbers, most likely due to the plants being shut down for maintenance. We are excluding these values as we only want instances where power was actually generated for that year. c. Using commands in Excel, what can we say about the spread and centre of the data? 1
Commands we can use To calculate the mean: AVERAGE() To calculate the median: MEDIAN() To calculate the minimum: MIN() Maximum: MAX() Standard Deviation: STDEV() Interquartile range: QUARTILE() for Q3 and Q1, then subtract Q3-Q1 for IQR Sometimes functions can be hard to remember, so if you are ever stuck, you can search for functions by clicking the Formulas tab in Excel. d. What can we say about the relationship between the two variables? To calculate this in Excel, we can use CORREL() Now we have explored the data, we can proceed to fitting a model. Exercise 2: Analytical fitting Analytical fitting is where we fit the model based on equations. We will use Excel for this rather than writing it out by hand as Excel gives us a bit of automation, rather than typing every single value of this large dataset our calculators. So we can set up our equations in excel, here are the equations we will be working with: The equation for our sample regression line is: ˆ y i = b 0 + b 1 x i + ϵ i where b 0 is the intercept, b 1 is the slope, and ϵ i is the error unaccounted for by the model. We can calculate the slope ( b 1 ) using the following equation: b 1 = QQQQQQQ ( x i ˘¯ x )( y i ˘¯ y ) QQQQQQQ ( x i ˘¯ x ) 2 = SS xy SS xx Once we have the slope, we can the calculate the intercept ( b 0 ) using the following equation: b 0 = ¯ y ˘ b 1 ¯ x From here, we can then jump into Excel and fill in the columns. For this exercise we will be working in the sheet labelled Analytical_fitting . 1. The column labelled x-meanx will be the first component of the slope equation ( x i ˘¯ x ). We can calculate the value using the first cell in x_capacity_mw , subtracting the mean value and then double click to autofill down the page. 2
For example, I have input =A2-$I$2 ; where I2 is the mean of x_capacity_mw, and I have used $ to tell Excel to hold the value when autofilling down the page. 2. The column labelled (x-mean)2 is the denominator component of the equation ( ( x i ˘¯ x ) 2 ), but as it is directly related to the column next to it, we keep them together. For example, I have input =C2ˆ2 and double-clicked the bottom of the cell to autofill down the page. 3. The column labelled y-meany , makes up the final portion of the equation ( y i ˘¯ y ). We can calculate this value using the first cell in y_generation_gwh_2019 and subtracting the mean of this column, then double-clicking to autofill. For example, I have input =B2-$I$3 ; cell I3 is where I have placed the mean of y_generation_gwh_2019, and I have used $ to tell Excel to hold the value when autofilling down the page. 4. Then we can calculate the numerator of the equation by multiplying our x-meanx and y-meany columns. I have labelled this new column product . For example, In the first cell of the product column I have input **=C2*E2** and autofilled. 5. From here we can calculate SS xy , calculating the SUM() of the product column, and then we can calculate SS xx by calculating the SUM of the (x-mean)2 column. Please store the SS xy result in cell I5, and SS xx in cell I6. 6. Once we have calculated SS xy and SS xx , we have all the components we need to calculate the slope estimate ( b 1 ). We do this by dividing SS xy / SS xx . In my example I have done this by putting the following equation into cell I8: =I5/I6 . 7. To calculate b 0 , as per the equation we subtract the slope multiplied by the mean of x, from the mean of y;. For example, I calculated b 1 in cell I8, and so the equation I have used is: =I3-(I8 * I2) . Quick questions a. Based on our results, what would the equation of our line be? b. How does this compare to the equation obtained from the plot? Exercise 3: Numerical fitting We will now have a go at numerical fitting and see how different the output is to our analytical fitting. Numerical fitting is where the computer generates a series of random guesses until the optimal residual sum of squares (RSS) is found. We can do this in Excel using the Solver add-in. To install Solver, I have created a separate resource, which you can find on the page for this tutorial, and I have included it on the practical page for easy reference. Solver is going to take into account a slope and intercept value, and guess the right combination to obtain the lowest RSS. We therefore need to provide 1) Slope ( b 1 ), 2) Intercept ( b 0 ), and 3) the RSS. We already have the first two values, but we need to calculate an RSS for Solver to start with. Let’s get to work! VERY IMPORTANT : You will notice in the data prep we will be using cell references rather than typing in discrete values (e.g. =I8, instead of 4.2707). If this is not done, the RSS will not update when Solver changes the b 1 and b 0 values. 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Data preparation for Solver For this exercise we will be working in the worksheet labelled Numerical_fitting . 1. In the column labelled predicted , you will use the model equation and coefficients to predict the values of y. We could use any value for b 1 and b 0 as they are only starting values for Solver to work with, but here we will use the model equation and coefficients we obtained in our analytical fitting as a starting point. Therefore, in cells I4 and I5, place your slope ( b 1 ) and intercept ( b 0 ) values we obtained earlier. You can then use these values to predict power generation by substituting the slope, intercept and x_capacity_mw values into the model equation. In cell C2, I have written the following equation: =$I 5 + ( I$4 * A2) , and have double-clicked the corner of the cell to autofill down the page. 2. In the column called residuals , you will calculate the difference between the observed values ( y_generation_gwh_2019 ) and the values you have just predicted with the model equation. 3. In the column called residuals ˆ 2 , you can then square the residuals column, i.e. for the first cell I have input =D2 ˆ 2 and autofilled. 4. In cell I2, you can calculate the sum of the residualsˆ2 column to obtain the RSS. Now you will have the 3 cells required for Solver to do some solving. Running Solver 5. In excel, click the Files tab, click Options and then Add-ins . At the bottom of that page, where it says ‘Manage’, ensure “Excel Add-ins’ is selected and click Go . A small window will pop up where you then need to tick the box next to ‘Solver’. Click Ok . 6. You can then open Solver by clicking the Data tab and to the far right you should see ‘Solver’, which you can click to open the add-in. 7. When the solver window opens, the first thing you need to input is the “Objective”. This is the cell that is going to change based on the combinations being tested by Solver. Here you will input the cell reference for wherever you have calculated the RSS; in my example, I have calculated it in cell I2. 8. As we want to minimise the RSS and therefore want Solver to stop when it reaches the smallest value; where it says “To”, you need to select the Min option. 9. Solver is going to change the slope and intercept, so in the next box “By changing the variable cells:”, we add in the cell references for where we have placed b 0 and b 1 . 10. You can then click Solve and see what happens to your b 1 and b 0 values. Quick questions a. Did Solver change anything for your data? b. Make a scatterplot of your x and y variables, add a trendline and show the equation and r 2 value. How did the r 2 value change- did it get smaller, larger, or stay the same? Update: We can check this because Solver has calculated the intercept to be zero. Therefore, to find the r 2 for the Solver solution, we can right click, Format Trendline. . . and in the settings, tick the box that says “Set intercept” and see how different the r 2 is. 4
Exercise 4: Fitting in R All of the calculations we have done in excel to fit the linear model are stored within a single function in R. For this exercise we will use the sheet labelled r_lm . 1. Similar to what you have been doing in previous weeks, set up a new project in a new directory and copy/paste your excel file into this directory. 2. Read the data in: library(readxl) ## Warning: package ’readxl’ was built under R version 4.1.3 dat = read_xlsx( "ENVX1002_Tutorial_Describing_relationships.xlsx" , sheet= "r_lm" ) You should see only the two columns we are interested in. 4. Run some quick exploratory analysis using the summary() function. summary(dat) ## x_capacity_mw y_generation_gwh_2019 ## Min. : 1.00 Min. : 0.00 ## 1st Qu.: 2.80 1st Qu.: 3.46 ## Median : 8.80 Median : 14.57 ## Mean : 127.59 Mean : 449.34 ## 3rd Qu.: 72.97 3rd Qu.: 141.05 ## Max. :6809.00 Max. :31920.37 5. Run a correlation analysis between the two variables using the cor() function. cor(dat) ## x_capacity_mw y_generation_gwh_2019 ## x_capacity_mw 1.0000000 0.8654114 ## y_generation_gwh_2019 0.8654114 1.0000000 Are these values similar to what we found in excel? Fitting a linear model in R We know there is some sort of relationship occurring, so we can investigate further by fitting a simple linear regression: # Fit the model mod = lm(y_generation_gwh_2019 ~ x_capacity_mw, data= dat) # Obtain a summary of the model summary(mod) 5
## ## Call: ## lm(formula = y_generation_gwh_2019 ~ x_capacity_mw, data = dat) ## ## Residuals: ## Min 1Q Median 3Q Max ## -12344.5 27.5 84.7 92.8 14693.5 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -95.56707 9.72579 -9.826 <2e-16 *** ## x_capacity_mw 4.27074 0.02589 164.977 <2e-16 *** ## --- ## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 ## ## Residual standard error: 873.9 on 9124 degrees of freedom ## Multiple R-squared: 0.7489, Adjusted R-squared: 0.7489 ## F-statistic: 2.722e+04 on 1 and 9124 DF, p-value: < 2.2e-16 You can also overlay the linear model onto a scatterplot of your points. # Create scatterplot and overlay abline plot(dat$x_capacity_mw, dat$y_generation_gwh_2019, xlab = "Capacity (MW)" , ylab = "Energy generated in 2 abline(mod, col = "red" ) 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
0 1000 2000 3000 4000 5000 6000 7000 0 5000 10000 15000 20000 25000 30000 Capacity (MW) Energy generated in 2019 (GWh) Quick questions a. Looking at the model summary, where can we find out the values for slope ( b 1 ) and intercept ( b 0 )? b. What would the equation of our fitted line be? c. Is the r 2 the same as what we ended up with in Excel? d. How would we conclude and answer our scientific question? That’s it for now, excellent work getting through this tutorial. Now you are ready for the prac! 7