13-Correlation-Regression

pdf

School

Seneca College *

*We aren’t endorsed by this school

Course

101

Subject

Statistics

Date

Apr 3, 2024

Type

pdf

Pages

8

Uploaded by SuperGazellePerson1097

Report
BSTA 200 3. Correlation and Regression 1 Prof. Joshua Emmanuel Positive y Negative y Negative x Positive x 0 2 4 6 8 10 12 14 16 0 1 2 3 4 5 6 7 8 9 10 Learning Outcomes: Calculate and interpret the simple correlation between two variables Calculate and interpret the simple linear regression equation for a set of data Describe the nature and strength of relationship between 2 interval level variables 3.1 R ELATIONSHIP B ETWEEN T WO Q UANTITATIVE V ARIABLES When we study the relationship between two variables, we refer to the data as bivariate. We are only interested in relationships that can be described with a straight line. SCATTER DIAGRAM To describe the relationship between 2 interval variables graphically, we often use a scatter diagram (or scatter plot). In a scatter diagram (as seen below), the variable along the vertical axis (Y-axis) is the dependent variable while the variable along the horizontal (X-axis) is the independent variable. Example 3.1: Construct a scatter diagram for the following data: x y 1 2 5 8 4 4 2 5 7 12 4 7 3 6 8 12 9 14 Scatter diagrams showing relationships
BSTA 200 3. Correlation and Regression 2 Prof. Joshua Emmanuel 3.2 C ORRELATION The Correlation Coefficient r measures strength of the linear relationship between paired x and y values. It is 'The degree to which the points cluster about the line of best fit' (Howell 1992 p.223). - 1 ≤ r 1 𝑟𝑟 = 𝑛𝑛 ( ∑𝑥𝑥𝑥𝑥 ) − ∑𝑥𝑥∑𝑥𝑥 �𝑛𝑛∑𝑥𝑥 2 ( ∑𝑥𝑥 ) 2 �𝑛𝑛∑𝑥𝑥 2 ( ∑𝑥𝑥 ) 2 The sign of r denotes the direction of association while the magnitude of r denotes the strength of association. -1 -0.6 -0.4 0 0.4 0.6 1 Interchanging x and y does not affect the value of r . Scatter diagrams and linear correlation coefficients CORRELATION DOES NOT IMPLY CAUSATION If the correlation between two variables is strong, it does not mean that one causes the other. People spend more when the weather is cold. Does cold weather increase sales in Canada? 3.3 R EGRESSION In regression analysis we use the independent variable ( X ) to estimate the dependent variable ( Y ). X is also referred to as the explanatory variable and Y is also referred to as the response variable. The relationship between the variables is linear Both variables must be at least interval scale The least squares criterion is used to determine the equation strong negative moderate negative weak negative weak positive moderate positive strong positive
BSTA 200 3. Correlation and Regression 3 Prof. Joshua Emmanuel LEAST SQUARES PRINCIPLE : The regression equation is obtained by minimizing the sum of the squares of the vertical distance between the actual y values and the predicted values of y. The scatter diagram on the right (with regression line) shows the relationship between wait time (before seeing a doctor) and satisfaction rating at a hospital. Regression Equation : An equation that expresses the linear relationship between two variables . General Form of Linear Regression Equation: 𝒚𝒚 = 𝒃𝒃 𝟎𝟎 + 𝒃𝒃 𝟏𝟏 𝒙𝒙 or 𝒚𝒚 = 𝒂𝒂 + 𝒃𝒃𝒙𝒙 𝒚𝒚 = 𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑 𝒗𝒗𝒂𝒂𝒗𝒗𝒗𝒗𝒑𝒑 𝒐𝒐𝒐𝒐 𝒑𝒑𝒕𝒕𝒑𝒑 𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒅𝒅𝒑𝒑𝒑𝒑𝒅𝒅𝒑𝒑 𝒗𝒗𝒂𝒂𝒑𝒑𝒑𝒑𝒂𝒂𝒃𝒃𝒗𝒗𝒑𝒑 𝒃𝒃 𝒐𝒐𝒑𝒑 𝒃𝒃 𝟏𝟏 is the slope 𝒂𝒂 𝒐𝒐𝒑𝒑 𝒃𝒃 𝟎𝟎 is the y-intercept 𝒃𝒃 = 𝑛𝑛 ( ∑𝑥𝑥𝑥𝑥 ) − ∑𝑥𝑥∑𝑥𝑥 𝑛𝑛∑𝑥𝑥 2 ( ∑𝑥𝑥 ) 2 𝒂𝒂 = ∑𝑥𝑥 𝑛𝑛 − 𝑏𝑏 ∑𝑥𝑥 𝑛𝑛 Slope 𝒃𝒃 𝒐𝒐𝒑𝒑 𝒃𝒃 𝟏𝟏 : the expected change in the value of y for a unit increase in x . The slope is interpreted as the change in the Y variable associated with a unit change in the X variable. y-intercept, 𝒂𝒂 𝒐𝒐𝒑𝒑 𝒃𝒃 𝟎𝟎 : the point where the regression line crosses the Y axis. The Y intercept is the predicted value of Y for an X value of zero. 0 20 40 60 80 100 0 10 20 30 40 50 60 70 SATISTIFACTION RATING WAIT TIME SATSIFACTION RATING
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
BSTA 200 3. Correlation and Regression 4 Prof. Joshua Emmanuel Example 3.3: To determine the relationship between sentence length (years) and the number of prior convictions, a criminologist collected the following data on defendants charged for similar offences. What is the expected sentence length for a defendant with 6 prior convictions? Defendant Prior Convictions Sentence Length 1 0 1 2 1 1 3 2 10 4 7 22 5 0 1 6 8 22 7 5 19 8 7 25 9 8 32 10 8 14 11 0 4 12 5 17 Excel output Coefficients The estimated regression equation is 𝒚𝒚 = 𝟐𝟐 . 𝟎𝟎𝟎𝟎𝟎𝟎 + 𝟐𝟐 . 𝟖𝟖𝟎𝟎𝟎𝟎𝒙𝒙 For a defendant with 6 prior convictions, the expected sentence length is 𝒚𝒚 = 𝟐𝟐 . 𝟎𝟎𝟎𝟎𝟎𝟎 + 𝟐𝟐 . 𝟖𝟖𝟎𝟎𝟎𝟎 ( 𝟔𝟔 ) ≈ 𝟏𝟏𝟏𝟏 𝒚𝒚𝒑𝒑𝒂𝒂𝒑𝒑𝒚𝒚 Intercept 2.0702 Slope 2.8070 Interpreting the coefficients The y-intercept, 𝒃𝒃 𝟎𝟎 = 2.070 is the point where the regression line crosses the y-axis. It is the value of y when x = 0. That is, the expected sentence length is when prior convictions is 0. The slope, 𝒃𝒃 𝟏𝟏 = 2.807 means that for every additional conviction, sentence length increases on average by 2.807 years. Excel The regression coefficients can be obtained in Excel as follows: y-intercept, a or b 0 : =INTERCEPT(known_y's, known_x's) Slope , b or b 1 : =SLOPE(known_y's, known_x's) Note: The regression coefficients were obtained using Excel. You can also use the BA II Plus calculator. See video . You will not be required to calculate it using the formula provided. ŷ = 2.807x + 2.070 0 5 10 15 20 25 30 35 0 2 4 6 8 10 SENTENCE PRIORS SENTENCE VS. PRIORS
BSTA 200 3. Correlation and Regression 5 Prof. Joshua Emmanuel 3.3 C OEFFICIENT OF D ETERMINATION , R 2 The coefficient of determination ( 𝑹𝑹 𝟐𝟐 ) is the percent of the variation in the dependent variable (Y) that is explained (or accounted for) by the variation in the independent variable (X). It is the square of the coefficient of correlation. That is, 𝑹𝑹 𝟐𝟐 = 𝒑𝒑 𝟐𝟐 . 𝟎𝟎 ≤ 𝑹𝑹 𝟐𝟐 ≤ 𝟏𝟏 Example 3.4: The coefficient of determination in Example 3.3 , 𝑅𝑅 2 = 0.822 . This tells us that 82.2% of the variation in sentence length is explained by prior convictions. The remaining 17.8% (1 – 0.822) is unexplained (or due to the other factors). 𝑟𝑟 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑛𝑛 𝑜𝑜𝑜𝑜 𝑠𝑠𝑠𝑠𝑜𝑜𝑠𝑠𝑠𝑠 𝑅𝑅 2 A SSESSING THE M ODEL : M EASURES OF V ARIATION IN R EGRESSION The difference between the actual value of the dependent variable ( y ) and the predicted value ( ŷ ) is called residual or error. 𝑹𝑹𝒑𝒑𝒚𝒚𝒑𝒑𝒑𝒑𝒗𝒗𝒂𝒂𝒗𝒗 = 𝒚𝒚 𝒚𝒚 The total variation in Y is divided into two parts; variation that can be explained by x, and the variation that is not explained by x. Total Variation Explained Variation Unexplained Variation 𝑆𝑆𝑆𝑆𝑆𝑆 = ( 𝑥𝑥 − 𝑥𝑥 ) 2 SST = Sum of Squares Total 𝑆𝑆𝑆𝑆𝑅𝑅 = ( 𝑥𝑥 � − 𝑥𝑥 ) 2 SSR = Sum of Squares Regression 𝑆𝑆𝑆𝑆𝑆𝑆 = ( 𝑥𝑥 − 𝑥𝑥 ) 2 SSE = Sum of Squares Error Total Variation = Explained Variation + Unexplained Variation or SST = SSR + SSE V ARIATIONS ABOUT THE ESTIMATED REGRESSION LINE 𝒚𝒚 𝒑𝒑 𝒙𝒙 𝒑𝒑 𝒚𝒚 𝒚𝒚 𝒚𝒚 𝒑𝒑 𝑺𝑺𝑺𝑺𝑺𝑺 = 𝒚𝒚 𝒑𝒑 𝒚𝒚 𝟐𝟐 𝑺𝑺𝑺𝑺𝑬 = 𝒚𝒚 𝒑𝒑 𝒚𝒚 𝒑𝒑 𝟐𝟐 𝑺𝑺𝑺𝑺𝑹𝑹 = 𝒚𝒚 𝒑𝒑 𝒚𝒚 𝟐𝟐 𝒚𝒚 𝒚𝒚 𝒙𝒙 𝒚𝒚 𝒑𝒑 𝒚𝒚 𝒚𝒚 𝒑𝒑 𝒚𝒚 𝒑𝒑 𝒚𝒚 𝒑𝒑 𝒚𝒚
BSTA 200 3. Correlation and Regression 6 Prof. Joshua Emmanuel The Coefficient of Determination is the proportion of total variation (SST) that is explained by the regression (SSR) . It is also computed as: 𝑹𝑹 𝟐𝟐 = 𝑺𝑺𝑺𝑺𝑹𝑹 𝑺𝑺𝑺𝑺𝑺𝑺 𝟎𝟎 ≤ 𝑹𝑹 𝟐𝟐 ≤ 𝟏𝟏 The higher the value of R 2 , the higher the amount of the variation in y that is explained by the regression equation. T HE S TANDARD E RROR OF E STIMATE , s e The standard error of estimate, se or SE , is a measure of the standard deviation of the differences between the observed y -values and the predicted values ŷ . It is computed as: 𝑠𝑠 𝑒𝑒 = ( 𝑥𝑥 − 𝑥𝑥 ) 2 𝑛𝑛 − 2 = 𝑆𝑆𝑆𝑆𝑆𝑆 𝑛𝑛 − 2 𝑜𝑜𝑟𝑟 𝑠𝑠 𝑠𝑠 = ∑𝑥𝑥 2 − 𝑎𝑎∑𝑥𝑥 − 𝑏𝑏∑𝑥𝑥𝑥𝑥 𝑛𝑛 − 2 When se is small, the data points are relatively closer to the regression line than when se is large. When se is zero, all the points are on the line. R EGRESSION OUTPUT FROM E XCEL See video (https://youtu.be/B-tFvua7qV4) The values of SSR, SSE, SST, R 2 , s e and others can be found in the Regression output of Excel using Data Analysis. Let us use the insurance premium data shown below: Driving Experience (years) 5 8 17 8 12 3 20 5 8 12 15 30 Auto Insurance Premium ($) 173 190 155 164 162 225 118 173 168 162 158 95 Copy and paste the data into Excel. Copy the data again in Excel, click on where you want to paste the values, and paste using the transpose option (or Hold down Alt, then hit E, S, E). 1. On the Data tab, click Data Analysis. 2. Select Regression and click OK. 3. Select the Y Range. Dependent variable - Auto Insurance Premium 4. Select the X Range. Independent Variable - Driving Experience 5. Check Labels. 6. Click OK. Excel produces the following Summary Output:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
BSTA 200 3. Correlation and Regression 7 Prof. Joshua Emmanuel SUMMARY OUTPUT Regression Statistics Multiple R 0.9032 R Square 0.8158 Adjusted R Square 0.7974 Standard Error 14.5692 Observations 12 ANOVA df SS MS F Significance F Regression 1 9402.304 9402.304 44.2959 5.67E-05 Residual 10 2122.613 212.2613 Total 11 11524.92 Coefficients Standard Error t Stat P-value Intercept 207.2771 8.0087 25.8815 0.0000 Driving Experience -3.80647 0.5719 -6.6555 0.0001 Description Value R 2 0.8158 Standard Error 14.5692 SSR 9402.304 SSE 2122.613 SST 11524.92 y-intercept 207.2771 Slope -3.80647 Exercises: 1. Identify the independent and dependent variables in the following: a) Number of hours of studying and final grade . b) Distance covered on a trip and number of hours spent travelling. c) Commute time and method of transportation. d) Age and weight .
BSTA 200 3. Correlation and Regression 8 Prof. Joshua Emmanuel 2. Answer the following questions on the correlation coefficient, r . a) Rank the following correlation coefficients from weakest to strongest: 0.4 0.8 0.6 0.0 -0.9 0.3 -0.2 -0.7 b) Rank the following correlation coefficients from the most negative to the most positive. 0.4 0.8 0.6 0.0 -0.9 0.3 -0.2 -0.7 c) Given the regression line ŷ = 4.72−2.52x, and a coefficient of determination of 0.64, what is the coefficient of correlation? d) If the coefficient of correlation between x and y is 0.74, what percent of variation in y is not explained by the variation in x ? a) 0 -0.2 0.3 0.4 0.6 -0.7 0.8 -0.9 b) -0.9 -0.7 -0.2 0 0.3 0.4 0.6 0.8 c) 𝑅𝑅 2 = 0.64, 𝑟𝑟 = −√ 0.64 = 0.8 because the slope (-2.52) is negative d) 𝑟𝑟 = 0.74 𝑅𝑅 2 = 0.74 2 = 0.5476 (54.76% of the variation is explained) Percent of variation not explained = 1 – 0.5467 = 45.24% 3. Humber Securities is studying the relationship between stock price and dividends paid to shareholders: Company A B C D E F Dividend 3.1 3.3 2 8 5.5 10 Stock Price 20 25 31 38 30 39 The following results on the relationship between the variables are obtained from Excel: 𝑺𝑺𝒗𝒗𝒐𝒐𝒑𝒑𝒑𝒑 = 𝟎𝟎 . 𝟑𝟑𝟑𝟑𝟑𝟑 , 𝒚𝒚 - 𝒑𝒑𝒅𝒅𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑 = −𝟓𝟓 . 𝟏𝟏𝟓𝟓𝟎𝟎 , 𝒑𝒑 = 𝟎𝟎 . 𝟖𝟖𝟎𝟎𝟑𝟑 a) If Humber’s goal was to predict the dividend paid based on stock price, which variable is the dependent variable and which is the independent variable? b) State the regression equation (round coefficients to 3 decimal places). c) What is the value of the correlation coefficient r? Comment on the strength and direction of the relationship. d) What percent of the variation in dividend paid is explained by stock price? e) Predict the dividend that should be paid for a stock priced at $42. f) Is it okay to use the regression equation obtained for this data to predict dividend for a stock priced at $250? Why? a) Independent Variable: Stock Price, x Dependent Variable: Dividend, y b) 𝒚𝒚 = −𝟓𝟓 . 𝟏𝟏𝟓𝟓𝟎𝟎 + 𝟎𝟎 . 𝟑𝟑𝟑𝟑𝟑𝟑𝒙𝒙 c) 𝒑𝒑 = 𝟎𝟎 . 𝟖𝟖𝟎𝟎𝟑𝟑 There is a strong positive linear relationship between stock price and dividend. d) 𝑹𝑹 𝟐𝟐 = 𝟔𝟔𝟑𝟑 . 𝟎𝟎 % 64.7% of the variation in dividend paid is explained by stock price. e) 𝒚𝒚 = −𝟓𝟓 . 𝟏𝟏𝟓𝟓𝟎𝟎 + 𝟎𝟎 . 𝟑𝟑𝟑𝟑𝟑𝟑 ( 𝟑𝟑𝟐𝟐 ) = 𝟏𝟏 . 𝟐𝟐𝟓𝟓 f) No. Because $250 is too far from the stock prices used to obtain the regression equation.