Handout 3.2 Linear Regression

docx

School

College of San Mateo *

*We aren’t endorsed by this school

Course

80

Subject

Mathematics

Date

Apr 3, 2024

Type

docx

Pages

8

Uploaded by HighnessDonkey2838

Report
Math 80 Correlation Coefficient and Best-Fit Line Handout 3.2 Is the price of shoes related to how long they last? Does a person with a larger head circumference tend to have higher IQ scores? Is there a relationship between the number of hours a student studies and test score? These and many other research questions look at the relationship between two quantitative variables. In this lesson, we will examine scatterplots, linear correlation coefficients, and best-fit lines of datasets for two quantitative variables. P ART I: S CATTERPLOTS Scatterplots are graphical displays of a dataset with two quantitative variables. The values of the explanatory variable appear on the horizontal axis (x-axis), and the values of the response variable appear on the vertical axis (y-axis). A dot on the scatterplot consists of x and y coordinates, which represents one person or one object, such as a car or book. A scatterplot is used to describe the overall pattern and any striking deviations. Recall that a response variable measures an outcome of a study, and an explanatory variable may explain or influence changes in a response variable. 1 For the research question “Is there a relationship between the number of hours a student studies and test score?”, which is the explanatory variable? Response variable? Explanatory variable: _______________________ Response variable: ________________________ 2 Suppose you find a foot print a crime scene. Would this be useful information to you in predicting the suspect’s height? The answer to these question depend on whether or not there is an association between foot length and height. So let’s look at some data. The table below shows a sample of 20 statistics students measured their height (in inches) and their foot length (in centimeters). A Which is the explanatory variable? Response variable? Explanatory variable:____________ Response variable:______________ B Use the grid below to make a scatterplot of the data (i.e. plot the data points). Be sure to place the explanatory variable on the horizontal axis (the x-axis) and the response variable on the vertical axis (the y-axis), and label your axes. Also, select the scales on each axis so that you use as much of the grid as possible. Foot length (cm) Height (in) 32 74 24 66 29 77 30 67 24 56 26 65 27 64 29.5 70 26 62 26.5 67 28 66 28 64 26 69 35 73 30 74 31 70 29 65 34 72 33 71 22 63 C What does each dot represent in your scatterplot in the previous page? How would you describe the overall pattern of the scatterplot that you just created? Page 1 of 8
Math 80 Correlation Coefficient and Best-Fit Line Handout 3.2 Given a scatterplot, we would like to describe the overall pattern of the relationship between two variables using direction , form , and strength , and also discuss any striking deviations from the overall pattern. (i) Look for direction: A pattern that’s positive (increasing from left to right), negative (decreasing), or neither? (ii) Look for overall form: Does the data in the scatterplot have linear or curvilinear or no shape? (iii) Look for strength : How scatter is the data? Is there a strong, weak, or no relationship? (iv) Look for any striking features: Are there any outliers or influential points from the overall pattern? 3 The scatterplots below show various body measurements for 34 adults who exercise several times each week. Describe the overall pattern with regard to direction, form, and strength. Scatterplot 1 Scatterplot 2 Scatterplot 3 Note: For now, we can only discuss the strength as strong, weak, or no correlation. One of our goals in this handout is to find a numerical value to describe the strength and direction of the dataset, which is called the correlation coefficient. Other possible scatterplots that have a non-linear (or curvilinear) pattern. Scatterplot 4 Scatterplot 5 Scatterplot 6 PART II: L INEAR R ELATIONSHIPS AND L INEAR C ORRELATION C OEFFICIENTS In our class, we will only study linear relationships between two quantitative variables. For each scatterplot below, a corresponding linear correlation coefficient, r, is given. Example #1 Example #2 Page 2 of 8
Math 80 Correlation Coefficient and Best-Fit Line Handout 3.2 4 Review and discuss the questions below with a partner after reviewing the scatterplots in Examples #1 and #2. A Is there a largest possible value for r or can it get larger and larger without limit? What makes you think so? B Is there a smallest possible value for r or can it get smaller and smaller without limit? What makes you think so? C Is there more than one scatterplot with a correlation coefficient of r = 1? If yes, can you come up with a different one? he linear correlation coefficient, r, is a numerical measure of the strength of the linear relationship between two quantitative variables. The value of the linear correlation coefficient helps us answer the question: Is there a linear correlation between the two variables? The linear correlation coefficient is always a value between -1 to +1. A value of +1 signifies a perfect positive linear correlation, and a value of -1 shows a perfect negative linear correlation. Perfect linear correlation occurs when all the points fall exactly along a straight line (note that a perfect linear correlation is unusual in real life data). When the value is close to zero, we say that there is little or no linear relationship. The sign of the correlation coefficient gives the direction of the relation. If the sign is positive, the overall pattern is trending upward (or as the x-values increase, the y-values generally increase). Similarly, if the sign is negative, the overall patter is trending downward (i.e., as the x-values increase, the y-values general decrease). The formula used to calculate the linear correlation coefficient is: r = ( x x s x )( y y s y ) n 1 Note: The calculation of the linear correlation coefficient is not the focus of this course. We will use technology to compute it. Be careful when studying linear relationship of a dataset: Page 3 of 8
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Math 80 Correlation Coefficient and Best-Fit Line Handout 3.2 (i) A strong correlation does not necessarily imply causation! For example, if there’s a strong relationship between height and weight of an adult, then it does not mean that if an adult’s weight increases it will cause the height to increase. (ii) Correlation is sensitive to outliers. A single outlying value can make a small correlation large or make a large one small. (iii) Correlation applies only to quantitative variables. Don’t apply correlation to categorical data. 5 Without doing any calculations, estimate the linear correlation coefficient for Question #2 on page #1. Now use technology to find the linear correlation coefficient. How close was your guess? Your best guess: r = __0.75_________ Using technology: r = ____0.71_________ HOMEWORK #1 It’s very important to be able to recognize how closely correlated between two variables. Please practice guessing linear correlation coefficient for a given dataset. Goto: http://www.rossmanchance.com/applets/guesscorrelation/GuessCorrelation.html Instructions: a). Click on the “New Sample” button, which will generate a scatterplot. Enter your guess for the correlation in the box called Correlation Guess and hit “Enter”. The applet will then reveal the actual value of the correlation coefficient. b). It isn’t easy to guess the value of the correlation coefficient exactly, so if a guess is within 0.1 of the actual value, it is a pretty good guess. (Example: if you guess 0.7 and the actual value is anything between 0.6 and 0.8, you have a pretty good guess.) c). Click New Sample and estimate the correlation as many times as it takes for you to be comfortable with your ability estimate the value of the correlation coefficient within 0.1 PART III : B EST -F IT L INE (L EAST S QUARES R EGRESSION L INE ) The linear correlation coefficient measures the strength of the linear relationship between two variables, it does not tell us about the mathematical relationship between these two variables. For example, the fast food dataset on page #1 shows that there’s a strong linear correlation between the amount of fat in a sandwich and the total calories. However, the correlation coefficient does not help us predict the total calories for a new sandwich if we knew that it has 15 grams of fat. We would need an equation to help us make this prediction. Least squares regression analysis is used to find the equation of best-fit line (or the regression line). To find the best-fit line, calculus is needed to perform a least squares regression analysis. Fortunately, we can use technology to help us determine the best-fit line. 6 The scatterplot below shows the dataset from Question #2 on page 1. Since the linear correlation coefficient is pretty, we will model the data with an equation of a line. Page 4 of 8
Math 80 Correlation Coefficient and Best-Fit Line Handout 3.2 A Use a straight edge to draw a straight that you think "best fits" the data. (Note: your best-fit line need not pass through any of the data points and it might be different from your neighbor’s). B Goto Rossman/Chance Applet Collection and click on Least Squares Regression. Notice the dataset from Question #2 is already enter in the Sample data box. Check the Show Movable Line box to add a blue line to the scatterplot. If you place your mouse over the green squares at the ends of the line and drag, you can change the slope of the line and move it. Move the line until you believe your line “best” summarizes the relationship between height and foot length. How does this line compare to the one you drew? Write down the resulting equation for your line . C Why do you believe that your line is “best”? Check with your neighbors, did they get the same line/equation? The question now is how do we decide for which line “best” summarizes the relationship. What do you think? (i) Check the Show Residuals box to visually see the residuals for your line on the scatterplot. The applet also reports the sum of the values of the residuals (SAE). SAE stands for “Sum of the Absolute Errors.” The acronym indicates that we need to make residuals positive before we add them up and that sometimes people call residuals “errors.” What is your SAE value? (ii) It turns out that a more common criterion for determining the “best” line is to instead look at the sum of the squared residuals (SSE). This approach is similar to simply adding up the residuals, but is even more strict in not letting individual residuals get too large. Check the Show Squared Residuals box to visually represent the squared residual for each observation. Note that we can visually represent the squared residual as the area of a square where each side of the square has length equal to the residual. (iii) Now continue to adjust your line until you think you have minimized the sum of the squared residuals. (iv) Now check the Show Regression Line box to determine and display the equation for the line that actually does minimize (as can be shown using calculus) the sum of the squared residuals. Write out the equation for the least squares regression line. D Use the graph of your “best-fit” line (also called the linear regression line) to predict the person’s height if the person’s foot length is 25 cm. Mathematicians often use the slope-intercept formula, y = mx + b , to represent the equation of a line that best fits a scatterplot. Which letter in this formula represents the slope of a line and which letter represents the y-intercept? However, statisticians often write the slope-intercept formula for the equation of a line as y = a + bx . Is the formula essentially the same as the y = mx + b formula? Given the equation in the form y = a + bx , which is the slope and which is the y-intercept? 7 Is there a relationship between total calories and fat of a sandwich? Given the dataset below, find the correlation coefficient and the best-fit line. Page 5 of 8
Math 80 Correlation Coefficient and Best-Fit Line Handout 3.2 Sandwich Ham- burger Cheese burger Quarter pounder Quarter Pounder with Cheese Big Mac Arch Special Arch Special with Bacon Crispy Chicken Fish Fillet Grilled Chicken Grilled Chicken Light Total Fat (g) 9 13 21 30 31 31 34 25 28 20 5 Total Calories 260 320 420 530 560 550 590 500 560 440 300 Correlation coefficient: ___________ Equation of the best-fit line: ______________________________ A Determine the slope and y-intercept of the line. Write a sentence to interpret the slope and y-intercept. B If you invent a new veggie sandwich that has 10 grams of fat, use the best-fit line to predict the total calories of your new sandwich. C The actual total calories for your new sandwich is 250. The difference between the actual total calories and the predicted total calories is called the residual . Residual for the new veggie sandwich = actual – predicted = ___________________ Note: The residual value tells us how far off the model’s prediction when compared to the actual value. A negative residual means the predicted value’s too big (an overestimate), and a positive residual shows the models makes an underestimate. The line of best-fit is the line for which the sum of the squared residuals is the smallest. There is a relationship between the linear correlation coefficient and the slope of the best-fit line. Here’s another approach to determine the best-fit line using the linear correlation and sample standard deviations of x and y values. The formulas below are used by statisticians to find the slope and y-intercept of the one-and-only true best fit line. For the best-fit line y = a + bx the slope is b = r∙ S y S x and the y-intercept is a = y b x . (Recall S x = ( x x ) 2 n 1 and S y = ( y y ) 2 n 1 ). 8 The table below shows a sample of Honda Accords listed on AutoTrader.com on March, 2010. We want to determine if there is a relationship between the age of the car and its list price. Car #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 Age (in years) 3 7 5 4 6 3 2 6 4 3 5 8 5 Listing price (in thousands $) 24.9 8.4 16.3 23.8 15.7 24.5 24.8 16.4 21.2 22.4 20.1 15.3 18.8 A Use technology to find the following: if r is negative than slope is negaive r , correlation coefficient: -0.879__________ Page 6 of 8
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Math 80 Correlation Coefficient and Best-Fit Line Handout 3.2 S y , SD of the y-values: 4.899_________ S x , SD of the x-values: 1.750__________ B Use the steps below to find the best-fit line: ___________________ Step 1 : Find the mean values: x = ¿ _4.692_______ y = ¿ __19.431 Step 2 : Find the slope b = r∙ S y S x = _____________ Step 3 : Now let's find the y-intercept a = y b x = __________ C Use technology to find the best-fit line. How does it compare to the answer in Part B? D Interpret the slope and y-intercept in the context of this problem. HOMEWORK #2 1) The dataset below shows the Math SAT scores of 20 high school seniors and the time they spent studying for the test. A Use technology to find the following: Correlation coefficient: _____ Equation of the best-fit line: __________________ B Using the equation of the line in part A, determine the slope and y-intercept of the line. Write a sentence to interpret the slope and y-intercept. C Use the best-fit line to estimate the Math SAT score if the student spent 15 hours studying. D Use the best-fit line to estimate the Math SAT score if the student spent 30 hours studying. Does this answer make sense in the context of this problem? Explain briefly. Hrs Study Math SAT 13 700 10 650 6 520 9 580 5 450 11 690 8 590 16 770 4 410 11 640 7 530 13 730 4 390 12 600 10 690 14 730 5 400 18 750 10 680 Page 7 of 8
Math 80 Correlation Coefficient and Best-Fit Line Handout 3.2 2) Vitruvius connected the proportions of the male figure to the proportions used in classical architecture and wrote that a man’s arm span is equal to his height. Below is a scatter plot and best-fit line (also called a least squares regression line or just a regression line ) for the arm span and height measurements (in cm) for 11 men. 150 160 170 180 190 200 140 150 160 170 180 190 200 Arm Span (cm) Height (cm) Obviously the graph of the least squares regression line does not predict the height of every man accurately. c) Consider the man with arm span of 173cm. Is the predicted height from the graph of the least squares regression line (the best-fit line) an overestimate or an underestimate? d) What is the error (also called the residual) in the predicted height for a man with an arm span of 173cm (the error or residual is the observed height minus the predicted height)? e) Consider the man with arm span of 196cm. Is the predicted height from the graph of the least squares regression line (the best-fit line) an over estimate or an underestimate? f) What is the error (the residual) in the predicted height for a man with an arm span of 196cm (the error or residual is the observed height minus the predicted height)? Page 8 of 8 a) Use the data points to find the observed heights (heights of actual men) for the indicated arm spans. Arm span = 173cm Observed Height = ______ Arm span = 196cm Observed Height = ______ b) Use the graph of the least squares regression line (a.k.a. best-fit line) to predict the height of a man with the same arm spans. Arm span = 173cm Predicted Height = ______ Arm span = 196cm Predicted Height = _______