Stat281 Noteguides

pdf

School

South Dakota State University *

*We aren’t endorsed by this school

Course

281

Subject

Statistics

Date

Apr 3, 2024

Type

pdf

Pages

37

Uploaded by MagistrateLyrebirdMaster756

Report
Stat 281 Chapter 1 Noteguide Here is an example of a large data set collected on Wheat yields. There are 3384 pieces of data. I have just copied portions of the data here. If you want to see the whole list, you can find it on D2L under the Content > Sample Datasets > “2014 Spring Wheat Data” Wheat Data Collected from South Dakota YEAR LOCATION SELNO VARIETY REPNO REP YIELD TESTWT PROTEIN 2012 ABERDEEN 00SD4023 ADVANCE 1001 1 49.6 59.1 15.3 2012 ABERDEEN 00SD4023 ADVANCE 2027 2 49.4 59.9 15.4 2012 ABERDEEN 00SD4023 ADVANCE 3025 3 50.5 60 15.4 2012 ABERDEEN 00SD4023 ADVANCE 4032 4 59.4 60.4 14.6 2012 ABERDEEN 000ND809 BARLOW 1002 1 57.4 59.2 16 2012 ABERDEEN 000ND809 BARLOW 2001 2 51.8 60 16.3 2012 ABERDEEN 000ND809 BARLOW 3041 3 57.5 59.6 16.1 2012 ABERDEEN 000ND809 BARLOW 4021 4 59.6 60.3 15.6 2012 ABERDEEN 0BREAKER BREAKER (WB) 1003 1 48.1 60.1 16.1 2012 ABERDEEN 0BREAKER BREAKER (WB) 2033 2 48.2 60 15.8 2012 ABERDEEN 0BREAKER BREAKER (WB) 3005 3 49 60.2 16.2 2012 ABERDEEN 0BREAKER BREAKER (WB) 4034 4 53.7 60.5 15.4 2012 ABERDEEN 00SD3851 BRICK 1004 1 51.6 60.6 15.6 2012 ABERDEEN 00SD3851 BRICK 2037 2 55.6 61.3 15.6 2012 ABERDEEN 00SD3851 BRICK 3028 3 53.5 60.6 15.3 2012 ABERDEEN 00SD3851 BRICK 4035 4 55.2 61.2 15.6 Here is a summary of the Wheat data set: YEAR LOCATION VARIETY Min. :2012 ABERDEEN : 564 ADVANCE: 72 1 st Qu :2012 FAULKTON : 564 BARLOW : 72 Median :2013 MILLER : 564 BRICK : 72 Mean :2013 SELBY : 564 BRIGGS : 72 3 rd Qu :2014 SOUTHSHORE: 564 ELGIN : 72 Max. VOLGA : 564 (Other): 3024 YIELD TESTWT PROTIEN Min. :21.4 Min. :44.4 Min. : 4.8 1st Qu.:50.4 1st Qu.:56.2 1 ST Qu. :14.6 Median :59.6 Median :58.2 Median :15.3 Mean :58.6 Mean :57.9 Mean :15.3 3rd Qu.:67.0 3rd Qu.:59.7 3 rd Qu. :16.1 Max. :96.3 Max. :84.5 Max. :18.7
What observations can you tell me about the Wheat data? Here is a histogram about the wheat yields from the data. What can you tell me about the yields? Here are some boxplots about the wheat locations and yields. What do you see?
Section 1.2 ______________________ - is the entire collection of individuals or objects that you want to learn about. ______________________ - is part of the population that is selected for study. What are the population and sample in our Wheat example? In an ____________________________________________ , the person carrying out the study does not control who (or what) is in which population. It is important to obtain samples that are ________________________ of the corresponding populations. In an ____________________________________________ , the person carrying out the study does control who (or what) is in which experimental group. Is the Wheat example an observational study or an experiment and why? Section 1.3 ______________________________________________ is a number that describes the entire population. ______________________________________________ is a number that describes a sample. In our example, is the data that we collected a Population Characteristic (Parameter) or a Statistic? Explain.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
________________________________________________ is a sample collected from a population in such a way that every different possible sample of size n has an equal chance of being selected. How can this be done? Other types of sample: What is “voluntary response sampling” and why is it a problem? Types of bias: What possible bias might exist in our example above? Questions to consider when doing an observational study: 1) What is the population of interest? 2) Was the sample selected in a reasonable way? 3) Is the sample likely to be representative of the population of interest? 4) Are there any obvious sources of bias?
Section 1.4 In a simple comparative experiment, the value of a ___________________________________ is measured under the different experimental conditions that are sometimes called treatments. What do you think the response variable is that we are trying to gather information out above? What are some of the treatments? __________________________________ is a critical component of a good experiment. If you are determining whether some treatment has an effect, you should have a group that does not receive the treatment, called a ____________________________________. If the subjects do not know which treatment they are receiving, the experiment is _______________________________. To be double-blind, what also must be true? Section 1.5 is about drawing conclusions In 2009, res ults from a study of the relationship between spanking and a child’s IQ were released. Headlines included “Spanking lowers a child’s IQ” (LA Times) and “Do you spank? Studies indicate it could lower your kid’s IQ” (Houston Chronicles) Are those headlines reasonable? How does data collection determine what conclusions can be drawn? What is the difference between random selection and random assignment? Understand the limitations of using volunteers as subjects in an experiment
Stat 281 Ch. 2.1 2.3 Stat 281 Survey Data. I have just copied a portion of the data here. The full set will be on D2L. Stat 281 Survey Data Spring 2015 What can we observe from the survey data? Device Time Country Region Semester Gender Reason Year School Weight Height State Age Colleg DESKTOP/LAP 82 US SD 1 M 1 1 115 65 6 19 2 DESKTOP/LAP 89 US SD 5 M 1 3 120 68 41 21 2 DESKTOP/LAP 72 US SD 1 M 1 1 130 71 41 18 6 SMARTPHONE 104 US SD 1 M 1 1 130 68 23 19 6 DESKTOP/LAP 75 US SD 2 M 1 2 135 68 41 19 1 DESKTOP/LAP 97 US SD 1 M 1 2 135 70 47 18 1 DESKTOP/LAP 65 US SD 8 M 1 4 138 68 41 22 3 DESKTOP/LAP 70 US SD 3 M 1 2 140 72 23 19 6 DESKTOP/LAP 96 US SD 3 M 1 2 148 69 41 21 7 DESKTOP/LAP 131 US SD 3 M 1 2 150 67 15 21 1 SMARTPHONE 85 US IA M 1 4 160 79 23 21 4 DESKTOP/LAP 219 US SD 1 M 3 1 160 76 23 19 1 DESKTOP/LAP 56 4 M 1 3 160 74 41 21 6 DESKTOP/LAP 55 US SD 3 M 1 2 160 74 23 20 1 DESKTOP/LAP 92 US SD 5 M 1 3 160 76 41 21 1 DESKTOP/LAP 52 US SD 5 M 3 3 163 69 5 21 1 DESKTOP/LAP 111 US SD 1 M 1 1 165 72 19 5 DESKTOP/LAP 58 US SD 1 M 1 2 165 71 23 19 1 DESKTOP/LAP 88 US SD 2 M 2 2 167 72 2 22 2 DESKTOP/LAP 72 US SD 3 M 1 2 170 68 41 20 1 DESKTOP/LAP 52 US SD 5 M 1 3 170 71 41 20 1 DESKTOP/LAP 47 US SD 15 M 2 4 171 28 2 DESKTOP/LAP 87 US SD 3 M 1 2 175 75 41 20 1 DESKTOP/LAP 98 US SD 6 M 1 3 175 72 41 21 6 DESKTOP/LAP 77 US SD 1 M 1 1 175 72 41 19 1 DESKTOP/LAP 64 US SD 3 M 1 2 175 68 23 20 1 DESKTOP/LAP 82 US SD 3 M 1 2 175 73 41 19 6 DESKTOP/LAP 72 US SD 3 M 2 2 175 72 41 20 1 DESKTOP/LAP 64 US SD 8 M 1 3 175 70 23 21 6 DESKTOP/LAP 129 US SD 4 M 1 3 175 72 23 24 7 DESKTOP/LAP 55 US SD 7 M 1 5 180 80 43 25 6 DESKTOP/LAP 58 US SD 1 M 3 2 180 72 41 19 1 DESKTOP/LAP 62 US SD 5 M 1 3 183 73 41 21 2 DESKTOP/LAP 62 US SD 1 M 1 2 185 72 15 19 1 DESKTOP/LAP 104 US SD 3 M 1 2 185 72 15 20 1 DESKTOP/LAP 194 US SD 2 M 1 2 185 72 25 24 3 DESKTOP/LAP 37 US SD M 1 5 185 6 41 36 4 DESKTOP/LAP 62 US SD 4 M 1 2 187 72 41 20 5 DESKTOP/LAP 87 US SD 3 M 1 2 190 74 23 19 6 SMARTPHONE 106 US SD 4 M 1 2 195 75 23 20 1 SMARTPHONE 54 US SD 3 M 1 2 200 77 23 20 1 DESKTOP/LAP 66 US SD 4 M 1 2 200 74 41 20 1
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Let’s look at a few definitions then let’s look at the data set more thoroughly. _ ___________________________________ is any characteristic whose value may change from one individual to another. ____________________________________ result from making observations either on a single variable or simultaneously on two or more variables A data set consisting of observations on a single characteristic is a __________________ data set. Alternatives: A univariate data set is ________________________ (or _______________________________) if the individual observations are categorical responses. A univariate data set is ________________________ (or ______________________________) if each observation is a number. Which variables in our Survey Data are categorical or qualitative? Which are numerical or quantitative? A numerical variable is __________________ if the possible values correspond to isolated points on a number line. A numerical variable is ___________________ if its possible values form an entire interval on the number line. Which variables in our Survey Data are discrete and which are continuous? We make graphical displays to show that data distribution. At the end of section 2.1 is a list of several of the ones that we will be looking at in the following sections.
Section 2.2: First let’s make a quick bar chart in Excel with survey data. Let’s use Gender. What does it show? What kind of info is appropriate for a bar chart? _______________________________for categorical data is a table that displays the possible categories along with the associated frequencies and/or relative frequencies. ______________________________________________ for a particular category is the number of times that category appears in the data set. Write the formula for relative frequency below. relative frequency = When making a “comparative bar chart”, why should you use “relative frequency” for the vertical axis instead of “frequency”? Side-by-side
Section 2.3. Look at the following histograms with Weight and tell me what you see. Survey Data Weight Histogram separated by Gender. Survey Data by Weight and Gender
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Note if doing a histogram manually, the book recommends using ඥ࠵?࠵?࠵?࠵?࠵?࠵? ࠵?࠵? ࠵?࠵?࠵?࠵?࠵?࠵?࠵?࠵?࠵?࠵?࠵?࠵? to figure out the number of intervals. I.e. if you have 100 observations, then use √100 = 10 intervals. (Sometimes, you may need to bypass this and choose the number of intervals using different criteria.) Sketch histograms that you could call “unimodal”, “bimodal”, and “multimodal” On the unimodal histogram, label the “upper tail” and “lower tail.” Sketch unimodal histograms that are positively skewed and negatively skewed. Some Test Score data that is skewed left or negatively skewed. Why is the data skewed left? Test Score Data I will show you how to create a histogram in Excel. We’ll make a histogram, and learn how to adjust the number of “bins.” Male or female heights work well.
Stat 281 Ch. 2.4 2.6 Section 2.4 Displaying Bivariate Numerical Data: Scatterplots and Time Series Plots What is the most important graph of bivariate numerical data? What does this type of graph help us look for? Let’s look at the Survey data and show what we can learn from a Scatterplot. Stat 281 Spring 2015 Survey Data with Weight and Height What are the two numerical variables and what can we learn from them? Notice anything unusual?
Let’s take a look at a scatterplot on some car data from 93 cars in 1993. Car Dat with MPG and Price What are the two numerical variables and what can we learn from them? A ______________________________________ is a simple graph of data collected over time that can help you see interesting trends or patterns.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Let’s look at some time series graphs of accidental deaths in the US from 1973 through 1978. Accidental Death in the US from 1973 - 1978 What can we tell from these time series graphs? Section 2.5 Graphical Displays in the Media This section will talk about some fancier graphs for you to look at. We won’t be making these, but you should be able to comprehend them. They can be misleading.
Segmented Horizontal Histograms on Online Communities What can we learn from this segmented histogram?
Let’s look at some misleading graphs. What is misleading about them? Number of Shingles Sold by Year
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Number of Pets Owned What is happening in this graph? What should have the graph-makers done instead?
Ch. 3.1 3.3 Chapter 3 Numerical Methods for Describing Data Distributions Section 3.1 Appropriate Numerical Summaries ________________________________________ - describe where the data distribution is located along the number line. A measure of center provides information about what is typical. ________________________________________ - describe how much variability there is in a data distribution. A measure of spread provides information about how much individual values tend to differ from one another. If the shape of the data distribution is Describe Center and Spread using… Approximately symmetric Skewed or has outliers Figure 3.3 from textbook: The mean salary of the players was 4.1 million. Does this represent a typical value?
Section 3.2 Sample mean = ࠵?̅ = What is the difference between a Population Mean and a Sample Mean? Sample variance = ࠵? = Sample standard deviation = s = Population variance = ࠵? = Population standard deviation = ࠵? = Let’s look at the Spring 2015 Survey data. Now is a good time to learn some basics of Excel Weight Height Mean = Mean = Variance = Variance = Standard Deviation = Standard Deviation = Which one is more variable?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Section 3.3 Describing Center and Spread of Data Distributions That Are Skewed or Have Outliers The ____________________________________ is obtained by first ordering the n observations from smallest to largest (with any repeated values included, so that every sample observation appears in the ordered list.) Then ________________________ = { ࠵?ℎ࠵? ࠵?࠵?࠵?࠵?࠵?࠵? ࠵?࠵?࠵?࠵?࠵?࠵? ࠵?࠵?࠵?࠵?࠵? ࠵?࠵? ࠵? ࠵?࠵? ࠵?࠵?࠵? ࠵?ℎ࠵? ࠵?࠵?࠵?࠵?࠵?࠵?࠵? ࠵?࠵? ࠵?ℎ࠵? ࠵?࠵?࠵? ࠵?࠵?࠵?࠵?࠵?࠵? ࠵?࠵?࠵?࠵?࠵?࠵? ࠵?࠵? ࠵? ࠵?࠵? ࠵?࠵?࠵?࠵? Weight Height Mean = 163.4155 Mean = 67.89539 Median = 160 Median = 68 How do the mean and median compare with our survey data? Let’s try it with some smaller data sets . 9, 11, 13, 18, 20, 22, 25, 28, and 32 11, 13, 18, 20, 22, 25, 28, and 32 Definitions: ___________________________________ = median of the lower half of the data set ___________________________________ = median of the upper half of the data set If n is odd, the median of the entire data set is excluded from both halves when calculating quartiles.
The interquartile range (IQR) is defined by Let’s calculate the lower and upper quartiles and the IQR for the smaller data sets. 9, 11, 13, 18, 20, 22, 25, 28, and 32 11, 13, 18, 20, 22, 25, 28, and 32 Let’s interpret the summary statistics for Weight and Height for our Survey data. Weight Height Min. : 98.0 Min. : 6.0 1st Qu.:130.0 1st Qu.:64.0 Median :160.0 Median :68.0 Mean :163.4 Mean :67.9 3rd Qu.:180.0 3rd Qu.:72.0 Max. :340.0 Max. :80.0 NA's :1 What does the median of 160 lbs tell us about the weight of students? What is the interquartile range for weight and what does it tell us? What does the median of 68 inches tell us about the height of students? What is the interquartile range for height and what does it tell us?
Ch. 3.4 3.6 Section 3.4 Summarizing a Data Set: Boxplots What are the 5 numbers that are part of the 5 number summary? 1. 2. 3. 4. 5. Here is some car data. Car MPG and Type What can we learn?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Let’s use this smaller data set to practice making a boxplot. Let’s identify the 5 numbers and make a quick sketch by hand: 2, 8, 10, 12, 16, 19, 22, 25, 28, 32, 35, 37, 44, 51, 57, 60, 64, 66, 72, 83 Then we can use the survey height data to make a boxplot in Excel. An observation can be called an outlier if it is: Section 3.5 Measures of Relative Standing: z-scores and Percentiles The _______________ tells you how many standard deviations the data value is from the mean. The z-score is positive when the data value is ___________________ than the mean and negative when the data value is __________________ than the mean. z-score = Let x = the average age at an employee at Daktronics. Given the following information, find the z score of an employee that is 38 years old. x-value = 38 mean = 40 standard deviation = 6 z-score =
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Now given that an employee has a z-score of 1.8, what is their age? The Empirical Rule If a data distribution is mound shaped and approximately symmetric, then Approximately _____ of the observations are within 1 standard deviation of the mean. Approximately _____ of the observations are within 2 standard deviation of the mean. Approximately _____ of the observations are within 3 standard deviation of the mean. With our wheat data, what would be a 95% interval of the yield? ࠵?̅ = 58.6 ࠵? = 12.3 What about a 99.7% interval of the yield? The r th percentile is a value such that r percent of the observations in the data set fall at or _____________ that value. If your GPA was at the 75 th percentile for your class, are you near the top or the bottom of the class? Another name for the 75 th percentile is the __________________________________
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Section 3.6 Avoid These Common Mistakes What’s the problem with these comparative boxplots? Survey Data with Gender and State With the following two histograms displays data with a larger standard deviation? The empirical rule only applies to distributions with a ________________ shape. Mean and standard deviation can be affected by _________________ values. If the distribution is _____________ or has _____________ the median and _______________ are often a better choice for describing center and variability.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Ch. 4.1 Chapter 4 Describing Bivariate Numerical Data Section 4.1 Correlation After we have found out what the data looks like and done some initial analysis, we may want to check and see if there is linear relationship between two variables Car Data Scatter Plot with Horsepower and Price Does there appear to be a linear correlation? For future reference r = .7882.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
_____________________________________________________________ , denoted by r, measures the strength and direction of a linear relationship between two numerical variables. Properties of r 1. r is ____________________ when the linear relationship is positive and _______________________ when the linear relationship is negative. 2. The value of r is always greater than or equal to ____ and less than or equal to ______ . Strong relationships are considered when r is __________________________________. Moderate relationships are considered when r is _______________________________. Weak relationships are considered when r is __________________________________. 3. r = 1 occurs when … r = - 1 occurs when … 4. r is a measure of the extent to which x and y are ______________________ related. 5. The value of r does not depend on the unit of ________________________________ for either variable.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Try to predict r for the following: 20 30 40 10 15 20 25 30 35 cty hwy 20 30 40 2 3 4 5 6 7 displ hwy
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Calculating the value of the correlation coefficient r. ࠵? = ࠵? = The correlation coefficient r = Let’s try an example. How much should a healthy Shetland pony weigh? Let x be the age of the pony (in months), and let y be the average weight of the pony (in kilograms). X 3 6 12 18 24 y 60 95 140 170 185 Compute r and round answer to 3 decimal places.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Lemons Imported and US Highway Fatality Rate In this example r = - .985. Do you believe that the increase in lemons being imported has caused a decrease in the highway fatality rate? Just because there is a strong correlation coefficient does not mean that there is a cause and effect relationship!
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Ch. 4.2 Wheat data taken from SDSU about the Yield and Protein levels taken 2012 2014 Section 4.2 Linear Regression: Fitting a Line to Bivariate Data When there is a linear relationship between two variables, you can use information about one variable to learn about the value of the second variable. The letter y is used to denote the variable you would like to predict, and this variable is called the _____________________________ (dependent variable). The other variable, denoted by x, is the ____________________________________________ (independent or explanatory variable). The equation of a line y = a + bx, where b = ______________ , and a = ___________________ . The least squares regression line is the line that ___________________________ the sum of squared deviations. The slope of the least squares regression line is b = Calculating formula for slope that is easier for calculations is b = The y intercept is a = The equation for the least squares regression line is ࠵? =
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Go back to our example of Yield and Protein which had r = - .7182. Let’s say that we want to come up with a line of best fit for that data. Coefficients: (Intercept) YIELD 17.63036 -0.03934 b = a = ࠵? = If Yield = x = 59.7, then what would our estimated value for ࠵? equal? Making predictions outside the range of x values in the data set can produce misleading predictions if the pattern does not continue. This is sometimes referred to as the danger of _______________________________________________ . Refer back to the pony example in 4.1. Predict the weight of a 9-month old pony. Then think of how one could extrapolate with a line. Another example. Create a regression line relating height and weight with the survey data. Compute r as well. Let’s try an example. An economist is studying the job market in Denver area neighborhoods. Let x represent the total number of jobs in a given neighborhood, and let y represent the number of entry-level jobs in the same neighborhood. A sample of six Denver neighborhoods gave the following information (units in hundreds of jobs). x 16 33 50 28 50 25 y 2 3 6 5 9 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Total Jobs vs Entry-level Jobs in Denver Neighborhoods: Compute r. Then find the equation of the least-squares line = a + bx . For a neighborhood with x = 40 hundred jobs, how many are predicted to be entry level jobs? We will solve on 4.2 Excel example. Line of Best Fit for Total Jobs vs Entry-Level Jobs
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Ch. 4.3 Assessing the Fit of a Line After we obtain the line of best fit, the next step is to determine how effective the line is at summarizing the relationship between x and y. We can consider three things: 1. 2. 3. The _________________________________ result from substituting each x value into the equation for the regression line. This gives ࠵? = ࠵? = ࠵? = The ___________________________________ are the n quantities ࠵? − ࠵? = ࠵? − ࠵? = ࠵? − ࠵? = Each residual is the difference between an observed y value and the corresponding predicted y value.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Examples of Residual Graphs Random pattern Non-random: U-shaped Non-random: Inverted U Go back to the ponies from 4.1 one more time. I will calculate the predicted values and the residuals and make a plot. Do we see any pattern here? An observation is an ___________________________ if it has a large residual. Outliers fall far away from the least squares line in the ___ direction. Alternatively, a point is potentially an ______________________________________________ if it has an x value far away from the rest of the data. The _____________________________________________________ , denoted by _____ , is the proportion of variability in y that can be attributed to an approximate linear relationship between x and y. The value of _____ is often converted to a percentage (by multiplying by 100). Residual sum of squares = SSResid = Total sum of squares = SSTo = The coefficient of determination is calculated as ࠵? ࠵? = * Note that we can calculate ࠵? ࠵? by squaring r
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
A large value of ࠵? ࠵? indicates that a large proportion of the variability in y can be explained by the approximate linear relationship between x and y. This tells you that knowing the value of x is helpful for predicting y. Let’s look at the famous “iris” data set. We have 150 measurements from 3 species of irises. Let’s look at the relationship of sepal lengths and widths. First we’ll look at this relationship for the whole dataset and then just for the species setosa. What is ࠵? ࠵? in each case? How much of the variability can be explained here? What is r and what does that tell us? Example of a test question: Lemon Imports vs US Highway Fatality Rate In this example, what is ࠵? ࠵? ? How much of the variability can be explained? What is r?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Section 4.4 Describing Linear Relationships and Making Predictions Putting It All Together Let’s summarize the steps in a linear regression. 1. Summarize the data graphically by constructing a ___________________________ . 2. Based on the scatterplot, decide if it looks like the relationship between x and y is approximately __________________________. 3. Find the equation of the __________________________________________________ . 4. Construct a residual plot to look for any patterns or unusual features. ( You can usually skip this step for our class.) 5. Calculate ࠵? ࠵? and interpret. 6. Decide if the least squares line is useful in making predictions. 7. Use the least squares line to make predictions. Does there appear to be a linear relationship between carat and price? (Investigate this with the “Diamonds data set” under “Sample Data Sets”). Find the equation for the least squares regression line? What is r ? What is ࠵? ?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Interpret the slope, r , and ࠵? . Does it seem like this line will be useful to make predictions?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help