Entity Academy Lesson 8 Linear Regression (AutoRecovered)

docx

School

University of South Florida *

*We aren’t endorsed by this school

Course

102

Subject

Statistics

Date

Jan 9, 2024

Type

docx

Pages

26

Uploaded by ssaintclair

Report
- Sindy Saintclair Thursday, December 24, 2021 Lesson 8 – Linear Regression Learning Objectives and Questions Notes and Answers Introductio n Linear regression is a method for investigating the relationship between two variables. In linear regression, the relationship between the variables is represented as a line, and you compute parameters of the line (how steep it is and where it starts) as well as determine how accurately this line represents the relationship. I will begin this lesson with scatter plots, which are used extensively to understand the relationship between continuous variables. I will then learn correlation. What is regression? - Allows you to predict y based on values of x - both IV and DV can be continuous - the basic statistic behind modeling - + simple = only one IV - + linear = data forms a straight line Code for Regression modelName <- lm(DV ~ IV, data) summary(modelName) Interpreting Regression the omnibus p value is at the bottom of the text, which determines if the overall model is significant (only if < 0.05) next is the adjusted R squared, which is the variability in the DV accounted for in the IV; convert to a % p value related to each specific variable; significant if < 0.05 One unit increase in this variable influences the DV by the estimate amount Making a scatter plot with best fit line ggplot(data, aes(x=column, y=column)) + geom_point( )
+ geom_smooth(method=lm, se=FALSE) Create scatterplot s In a scatter plot, data are displayed as a collection of points. Each data point is determined by the values of two variables, one on the horizontal (left to right) axis and one on the vertical (up and down) axis. Scatter plots makes the relationship between variables easy to see. In R, creating a scatter plot is relatively simple. In this lesson, you will use ggplot2 to create scatter plots as well. If you have closed R since last using ggplot2 , remember that you will need to load it by using the following command at the beginning of every RStudio session: library (ggplot2) You could also click the check box next to ggplot2 in the Packages tab. You will start by creating a scatter plot using the faithful data set; you will plot eruption times versus waiting times, with eruptions on the horizontal axis, or x= axis, and waiting on the vertical, or y= axis: d <- ggplot( faithful , aes( x = eruptions, y = waiting)) d + geom_point() These commands produce the following scatter plot: You can add a title and improve the axis labels using the ggtitle() , xlab() , and ylab() functions:
d + geom_point () + ggtitle ( "Old Faithful Eruption vs Waiting Times" ) + xlab ( "Eruption Time (min)" ) + ylab ( "Wating Time (min)" ) You see from this plot that there are two clusters of data: - a short eruption time followed by a short wait until the next eruption - a long eruption time followed by a long wait After having created a scatter plot, you may want to see how well the data fir a straight line. You can do this easily in ggplot2 with the additional function of adding in + geom_smooth() and specifying as an argument to geom_smooth() that you want the method, or shape of line, to lm . lm stands for linear model. A linear model will create a straight line on the graph. Here’s how all that code fits together: d + geom_point() + geom_smooth( method = lm ) This gives the following scatter plot with a best fit line. The phrase “best fit line” means that the line you see below wasn’t just plunked on the graph any old place; it was strategically fit to all the data points to be as close to as many of the points as possible. And here is the addition of your best fit line:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In addition to the line that fits the data, the geom_smooth() function adds a grey area around the line. In this image, it may be a little difficult to see, because it does not expand much past the blue line. It is easiest to see at the beginning and the end of the line, where there is a little contrast with all the black dots. This grey shading is called the confidence region . Roughly speaking, if the boundaries of the region are close to the line, it indicates that you are confident in the accuracy of your estimates of the parameters that define the line; if the region extends away from the line, it indicates that you are less confident in the accuracy of the line. You can think of the grey shading as like the margin of error. If you didn’t want your graph to include the grey shading, however, it can easily be turned off with argument se=false : d + geom_point() + geom_smooth( method = lm , se = FALSE ) Which yields this image. It may be easier to see the grey shading on the previous graph now that you have something with which to contrast it!
If you feel that black points and a blue regression line are too somber, you can always change the color of the points and the line: geom_smooth(method=lm, se=FALSE, color = "goldenrod2" ) This gives you the following colorful plot: Assess correlation s visually and numericall y Correlation Basics Two variables are correlated if there is a linear relationship between them; in a scatter plot, correlation is indicated if the points in the plot tend to lie close to a straight line. Both words in the phrase “linear relationship” are important. Linear is important, because if there is no semblance of a straight line in
the graph, then it cannot be correlated using the “standard” correlation, which is called Pearson’s Correlation , denoted by symbols as r . Take a look at the graphs below. None of them would count as being correlated, and have an r value of zero, because they are not linear. Relationship is equally important; there must be some connection between the two variables. It can’t be random; it has to be a pattern that is visible in the data. Direction of Correlations There are technically three directions for a correlation: - Positive - Negative - No correlation (uncorrelated) Positive correlations A positive correlation is indicated if the line goes up from left to right. This means that the variables change together in the same direction; as one goes up, the other goes up, or as one goes down, the other goes down. A positive correlation is indicated statistically by having a positive r value. The graphic below has positive correlations depicted on the left-hand side. Negative Correlations If the line slopes down from left to right, then it indicates a negative correlation. This means that variables change together in different directions. As one goes up, the other goes down. A negative correlation is indicated statistically by having a negative r value. The graphic below shows a negative correlation on the right.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Strength of Correlations Not only can you judge a correlation by its type, but also by its strength. A strong correlation forms a very tight grouping of dots to make a line, like the graph below on the far left. A moderate correlation will have the rough shape of a line, but the dots will be a bit more spread out, like the middle graph. A weak correlation shows a very general trend, like the one on the right below. You can have a strong, moderate, or weak correlation that is either positive or negative. Check out the full spectrum below: The numbers indicate the r values for a correlation. Correlations can range from -1 to +1. The closer to 1, whether a positive or negative one, the stronger the correlation is. The closer to 0,
again, whether positive or negative, the weaker the correlation is. Here are some correlation interpretation guidelines: Correlation Strength Correlation Coefficient (r value) Strong 0.7 - 1.0 Moderate 0.3 - .69 Weak 0.1 - 0.29 None 0.0 - 0.09 Types of Correlations There are different “flavors” of correlations for different situations. They are all interpreted the same way, but ae calculated differently behind the scenes in R, to make sure that they are as accurate as possible. The three main types of correlations you will learn about are: Pearson’s R: for 2 normally distributed, continuous variables Spearman’s Rho: For 2 non-normally distributed continuous variables Kendall’s Tau: For two categorical variables Correlation DOES NOT EQUAL Causation Just because two things are related does not mean that one caused the other! Sometimes, there are additional variables not in your dataset that can indirectly relate to the relationship between the two variables, and sometimes, they really are just randomly related without a whole lot of rationale behind it. Take this scenario as a cautionary tale: As ice cream sales increase, so do the number of armed robberies leading to murder.
Does this mean that after eating ice cream, people get inspired to go out and commit murder? Most likely not. Does this mean that murderers go out and celebrate their deeds by having some congratulatory ice cream? Again, unlikely. This means that ice cream sales did not cause murders, and murders did not cause ice cream sales to increase. Instead, there is a pesky third variable at work: heat! As temperatures increase in the summer, people are looking to cool off with some ice cream. Makes logical sense! And as temperatures increase, so does aggressive tendencies. Hence the spike in murder rates. The lesson here to take to heart is that just because there is a relationship between two variables, that does not mean that one was responsible for the other. Reporting this incorrectly can make for some embarrassing mistakes. Examples You will next create a plot that indicates very little correlation. The USArrests data frame has gruesome statistics on arrests for violent crimes, in arrests per 100,000 residents, in each of the
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
50 US States in 1973. One of the variables in this data frame is Murder , which is the murder rate for each state. Another variable is UrbanPop , which is the percentage of the state’s population that lives in an urban area. You can create a scatter plot of these two variables together with the linear regression line using the following commands: d <- ggplot(USArrests, aes(x = UrbanPop, y = Murder)) d + geom_point() + geom_smooth( method = lm , se = FALSE ) This gives the following scatter plot: You can see that the data do not seem to be in much of a pattern. The line almost flat, with very little slope. You say that the Murder and UrbanPop variables are uncorrelated. They do not have a linear relationship. You now can make a plot that shows a negative correlation. The data frame mtcars has data from the 1974 Motor Trend magazine; this data covers 32 1973-74 models. One variable in this data frame is mpg , which is the car’s mileage in miles per US Gallon of fuel. Another variable is disp , the engine displacement in cubic inches. You can create a scatter plot of these two variables with the linear regression line using the following commands: d <- ggplot(mtcars, aes(x = disp, y = mpg))
d + geom_point() + geom_smooth( method = lm , se = FALSE ) Because the linear regression line has a negative slope (it goes from upper left to lower right), these data values are negatively correlated. A larger value of displacement tends to be associated with a smaller value of miles per gallon. Create correlation matrices Calculating correlation Now that you understand what a correlation is and how to interpret it, you will learn how to calculate correlations in R. cor.test( ) The simplest way to find a correlation is to make use of the cor.test function. You will select the two variables you want to correlate; with cor.test you can only do two variables at a time. Here is the code: cor . test (mtcars $hp , mtcars $cyl , method= "pearson" , use = "complete.obs" ) This runs cor.test on the mtcars dataset variables of hp and cyl . You will use the method= argument to specify "pearson" if you have two continuous variables that are normally distributed . If, however, you have two continuous variables that are NOT normally distributed (this is called non-parametric), then you will use the argument "spearman" ”, which will conduct the non- parametric correlation Spearman’s Rho pronounced “Row.” If you had two categorical variables that are numeric or have been recoded to numeric, then you can make use of the argument "kendall" , which conducts the test Kendall’s Tau,
pronounced like ow!” with a “t” on the front, or the beginning of “tower.” The use= function with the argument “complete.obs” means that you don’t have to have a complete dataset; R will use what it has, as long as it has data for the two variables you are trying to correlate. The result from cor.test( ) are below: Pearson's product-moment correlation data: mtcars$hp and mtcars$cyl t = 8.2286, df = 30, p-value = 3.478e-09 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.6816016 0.9154223 sample estimates: cor 0.8324475 The first line tells you what analysis you ran – the Pearson’s correlation. The second line tells you the data you used. Next you have information about whether this correlation was significant. The t value is not important and is not reported, but the overall p value often is. Like with anything else, if the p value is less than 0.05, than the correlation is significant. The last important part of this R output to pay attention to is the number underneath cor. This is your r correlation coefficient, which you will need to interpret. This specific correlation between the horsepower of the car and the number of cylinders in the engine is strongly positively correlated at 0.83. Detailed Correlation Matrices cor.test() works just fine, but what if you want to look at all the variables in your dataset at once, just to take quick peek at what’s going on? Well, using cor.test() would take a long time! but there is a solution: correlation matrices . This lets you look at more than one correlation at a time, in a handy graphic.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
The easiest way to create correlation matrix with the p value included is to use the PerformanceAnalytics library and the chart.Correlation() function. So, get PerformanceAnalytics all installed and running: install.packages( "PerformanceAnalytics" ) library ( "PerformanceAnalytics" ) Then, because chart.Correlation( ) will look at all the data in your data frame, you need to limit it to only the quantitative, continuous variables. This can easily be done with subsetting. You are telling R that out of mtcars data frame, you want to keep all the rows (if you wanted to keep certain rows, those number would go before the comma), and that you want to keep only the first seven columns in the data frame. You’ll name this new truncated dataset mtcars_quant. mtcars_quant <- mtcars[, c( 1 , 2 , 3 , 4 , 5 , 6 , 7 )] And for a visual, this is what mtcars_quant now looks like. You’ll notice only the first seven columns were kept, compared to the entirety of the mtcars dataset.
Now that you have a dataset filled with only quantitative variables, it I time to make your correlation matrix! All you need to do is call the chart.Correlation( ) function. The first argument will be your data frame name, mtcars_quant, the second will be hisogram=, and the third will be method= , which is the type of correlation. method= takes the same arguments as cor.test() : "pearson" , "spearman" , and "kendall" chart.Correlation(mtcars_quant, histogram= FALSE , method =" pearson ") And here is the plot that results from the above code:
Although there is a lot to look at here, you only need to pay attention to the right=hand side. You will read this by the intersections of the variables on the left with the variables on the bottom, so the correlation of mpg with cyl is -0.85, and it is significant at p < 0.001, because three stars are displayed in that cell. Visually Pleasing Correlation Matrices The graph above conveniently has the larger r values printed in a larger font size as well, so it is somewhat intuitive to interpret. But it definitely isn’t the most pleasing to the eye. If you like a chart that is more of a looker, then corrplot() is the way to go. However, it doesn’t list significance values or specific r values, which is a downside. Another downside to corrplot() is that getting the image is bit more complex. Getting a Correlation Matrix Table using cor( ) First, you need to turn your data into a correlation matrix. The corrplot() function won’t take your data frame as an argument. To do this, you will use the function cor() on the quantitative dataset mtcars_quant that you have been using. corr_matrix <- cor(mtcars_quant) corr_matrix When you call your newly created matrix, this is what you will see:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
mpg cyl disp hp drat wt qsec mpg 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.68117191 -0.8676594 0.41868403 cyl -0.8521620 1.0000000 0.9020329 0.8324475 -0.69993811 0.7824958 -0.59124207 disp -0.8475514 0.9020329 1.0000000 0.7909486 -0.71021393 0.8879799 -0.43369788 hp -0.7761684 0.8324475 0.7909486 1.0000000 -0.44875912 0.6587479 -0.70822339 drat 0.6811719 -0.6999381 -0.7102139 -0.4487591 1.00000000 -0.7124406 0.09120476 wt -0.8676594 0.7824958 0.8879799 0.6587479 -0.71244065 1.0000000 -0.17471588 qsec 0.4186840 -0.5912421 -0.4336979 -0.7082234 0.09120476 -0.1747159 1.00000000 This correlation matrix displays the correlation between the variables along the top and the variables along the left-hand side. This is similar to how things were displayed in correlation matrix graphic above, but you’ll notice that there aren’t values just in the upper right-hand side. Correlation matrices are often displayed as a triangle instead of a rectangle because the information repeats in the second half. The correlation between mpg on the left and cyl on the top is the same as the correlation between cyl on the left and mpg on the top. See? The defining line to look for is a diagonal line of 1s all the way down from the upper left corner to the lower right corner. This line presents itself because the correlation of any variable with itself will always be 1. So mpg with mpg is 1, cyl with cyl is 1, and so on, all the way down. Take a look:
Everything the yellow line touches is a 1. The top right half is mirrored in the bottoms left half. Installing corrplot( ) Next, you will need to install and make available the library corrplot : install.packages( "corrplot" ) library ( "corrplot" ) Using corrplot( ) And finally, you are ready to actually make a beautiful, visually pleasing correlation matrix plot! corrplot( corr_matrix , type= "upper" , order= "hclust" , p.mat = corr_matrix , sig.level = 0.01 , insig= "blank" ) Remember, corrplot( ) is based off of the correlation matrix you created above, corr_matrix rather than your data frame. type = allows you to pick whether you want to see the top or bottom of the mirror-image matrix; in this case, you have chosen “upper”, which yields the following image: If you had chosen “lower” instead, you would get this plot:
The rest of the arguments to corrplot() have to do with demonstrating significance on the chart by only showing the significant values. You can change the significance level using sig.level=() ; here it is set to p = 0.001 to get a more rigorously determined set of correlations. Or really any other value you’d like. Then the insig="blank" argument tells R to leave blank anything that isn’t significant at the level you have specified. This is why not all rows of the matrices above are filled out. When interpreting either graph above, it’s important to note that correlations will be shown with both a color and size gradient. So the smaller and lighter a color is, the less significant the relationship between the two variables is. Positive correlations are shown in blues, while negative correlations are shown in reds. Create and analyze linear regression models INTRODUCTION TO LINEAR REGRESSION Regression Basics - 2 continuous variables - used for prediction - may be used for causality Ex: Do cats or dogs love their owners more? What is Linear Regression? Linear regression is the way you will compute the parameters of the line that best fits your data. It gives a relationship between the horizontal and vertical values in a scatter plot. In the process of computing the parameters of the line, you can also determine how well the line fits your data. Simple linear regression is when there are only 2 variables. It is possible to do linear regression with more than 2 variables; this
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
is called multiple linear regression. When you compute a linear regression, you are actually computing the equation of the line that best fits your data. You may remember from algebra that a line is represented by an equation of this form: y = mx + b Where you have the following information: x : variable on the horizontal axis y : variable on the vertical axis m : the slope of the line; indicates how steep the angle of the line is b : the intercept of the line, where the line crosses the y axis If you know the slope and intercept of the line, then you can compute the y value of any point on the line from its x value. (You can also, with some algebra, compute the x value of any point on the line from its y value.) So when you say that the linear regression computes the parameters of the line that best fits your data, you are saying that the linear regression computes the slope and the intercept of the line. COMPUTING Linear REGRESSION You will compute a linear regression using the data in the R dataset cars . Previously, you used this dataset to create a box plot. cars include speed measurements (in miles per hour) and stopping distance (in feet) for cars measured in the 1920’s. Take a look: head (cars) speed dist 1 4 2 2 4 10 3 7 4 4 7 22
5 8 16 6 9 10 Computing the linear regression in R is extremely simple. It is computed by the function lm() as follows: lin_reg <- lm(dist ~ speed, cars) print (lin_reg) lin_reg is a variable name. The lm() function returns an object that stores all the information computed by linear regression. So the assignment statement assigns this object to lin_reg . When you call the print() function for the lin_reg object, it prints the information that lm() was called with, and it prints the coefficients of the line that best fits the data. You indicated which variables you wanted to fit the line with using the dist ~ speed argument to lm(). In R terminology, this argument is a formula. In this formula, dist is the variable on the vertical axis-what you called y in the equation for the line above. speed is the variable on the horizontal axis – what you called x in the equation for the line. You can read the tilde ~ as the word “by.” So, you are asking R to produce a line showing stopping distance by speed. The last argument for the lm() function is the data frame name: cars in this case. When lin_reg is printed, this is the output R will provide: Call: lm(formula = dist ~ speed, data = cars) Coefficients (Intercept) speed -17.579 3.932
The coefficients are the slope and intercept of the line. Intercept is the intercept, or b for the line, and the slope is labeled by the x variable name that was used to create it: the slope is labeled speed . So the equation for this particular linear regression line predicting stopping distance using speed is: y = 3.932x -17.579 Linear Regression Model Summary The object stored in lin_reg has much more information stored in it. You can access some of this information by the following command: summary (lin_reg) And the summary information that R will provide you with is below: Call: lm(formula = dist ~ speed, data = cars) Residuals: Min 1Q Median 3Q Max -29.069 -9.525 -2.272 9.215 43.201 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -17.5791 6.7584 -2.601 0.0123\* speed 3.9324 0.4155 9.464 1.49e-12\*\*\* ---
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Signif. codes: * Three asterisks means 'less than 0.001' * Two asterisks means 'less than 0.01' * One asterisk means 'less than 0.05' * One dot (or period) means 'less than 0.1' * No codes means 'less than 1 Residual standard error: 15.38 on 48 degrees of freedom Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438 F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12 Is the Overall Model Significant? The first thing you will want to look for is the F-statistic and p-value , at the very bottom. These tell you if the overall model is significant. What is meant by that? If the overall model is significant, it means that your x value (or x values, in the case of multiple regression) are a significant predictor of your y value. If the p value isn’t significant at p < 0.05 at the very least, then the rest of the output really isn’t worth looking at. You didn’t find anything interesting to talk about or report! Which individual predictors are significant? Luckily, speed is a significant predictor of stopping distance, since the p value is quite small; much smaller than 0.05. So, you can go on to look at the rest of the output. You’ll next want to glance at the Coefficients section. In the Coefficients section, you can find the information provided with the print() function above, under the column header Estimate . The first frow is the intercept and the second row labeled
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
speed is the slope. Those are the vales you would plug into your equation, just like before. The other important part of this Coefficients table is t value and Pr(>|t|) sections, for everything but the Intercept . Because you only had one x variable, since you are doing simple linear regression, that is the only things besides Intercept listed. But if you were doing multiple regression, there would be multiple x variables and thus multiple rows of data after Intercept . R conducts a t-test on each individual predictor of y, to see if it contributes anything to the prediction model. The “Pr ( > |t| )” section is the p value for this t-test, and if it is significant, than you know that the x variable has an impact on the y variable. So, speed is a significant predictor of stopping distance according to the t test as well. Since tbere was only one x variable, you expect the F-test resultsto match the t-test results, in that they should both be significant. That is the case here qs well: both F and t are significant. How Much Variance is Explained by this Model? Next, move down to the Multiple R-squared and Adjusted R- squared rows. These both mean the same thing, but the second one is adjusted for the number of variables in the model, in order to reduce the amount of Type I error that may abound. As a general rule, looking at Adjusted R-squared is the more prudent thing to do. The R- squared value is also called the coefficient of determination . It is a measure of the percentage of the variability of the data set that the line explains. In this case, because the Adjusted R-squared value is 0.6438, the line explains approximately 64% of the variability of the data. Put another way, this means that speed is able to explain about 64% of the factors that go into stopping distance. The rest is covered by other variables that have not been included in the model. The larger the R-Squared value, the more closely related the variables in the model are. Using Linear Regression to Predict Values You can use this model to predict the necessary stopping distance for a given speed. For example, suppose you want to know the stopping distance for a 1920’s vehicle traveling at 21 mph. You could put 21 into the x value of the regression equation you computed above, and you can solve for y to get the predicted stopping distance. y = mx + b
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
y = 3.93 * 21 - 17.57 y = 64.99 So your model predicts that a car going 21 mph will require 64.99 feet to stop. You can describe what you have done graphically as well. In the image below, to find the predicted stopping distance at 21 mph, you go up from the x axis at a value of 21 until you reach the regression line. Then, you go horizontally left to the y axis and read the value, which would be 64.99. Does your linear model guarantee that the actual stopping distance for a car going 21 miles per hour will be 64.99 feet? No, because there is variability in the relationship between speed and stopping distance. All you can say that, based on our data, you would expect a car going 21 miles per hour to require something like 65 feet to stop; it may be longer or shorter than 65 feet. Suppose you want to find the stopping distance of a car going 45 miles per hour. You could put 45 into the x value of the regression equation and compute y: y = mx + b y = 3.93 * 45 - 17.57 y = 159.36 Your regression model shows that it will take 159.36 feet to stop. However, you should be very hesitant to accept this
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
number for the following reason: you have created the model using speeds between 4 and 25 mph. There is no data to verify that your model will work well for speeds above 25 mph. In this case, you are using the model to extrapolate beyond the data, and extrapolation can be fraught with peril. In the case of distance necessary to stop a car, this linear model may not be accurate at higher speeds. Scatter Plot with a Best Fit Line Lastly, you can create a scatter plot of the speed versus distance data with a line that best fits the data. This is done with the method=lm argument; you are asking R to fit the linear model into the graph. d <- ggplot(cars, aes(x = speed, y = dist)) d + geom_point() + geom_smooth( method = lm , se = FALSE ) Here is the result: The blue line is the line described by the linear regression equation you found above. Summary When you have 2 continuous variables, analyses like t=tests won’t work. Instead, you’ll use correlation to determine the relationship between variables, or simple linear regression to determine causality and/or predict y values. An ideal visualization for two continuous variables is a scatterplot with a best fit line; it allows you to visually assess a correlation. Correlations can be assessed by both their strength (ranging
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
from 0-1) and their direction (positive or negative). A positive correlation has both variables varying in the same direction, while a negative correlation has the variables varying in opposite directions. Linear regressions can be used to predict values, and thus is often called predictive modeling. A line is created using the equation y = mx + b, which considers m, the slope, and b, the y intercept. Luckily, R calculates these values for you.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help