1.42 Solver and least squares fits 1.43 Error and statistics The objectives of this section are to: 1. Create error bars within a chart in a spreadsheet. 2. Apply statistics-related functions for a data series. 3. Calculate and distinguish between standard deviation and quartiles for a data series. 4. Complete a linear regression within a spreadsheet using functions or built-in regression tools. Error and error bars An error is commonly known as a mistake, but in a spreadsheet, an error usually quantifies a statistical difference. Ex: A residual is the difference between the data and model, which is detailed in another section, and is one example of a quantifiable error. When plotting data and models in charts, error bars provide one method to visualize quantified error. The animation below shows several types of error bars. PARTICIPATION ACTIVITY 1.43.1: Types of error bars. Start 2x speed 100 y-axis error bar Property 80 with cap Error value y-axis 60- error bar without cap x and y axis Fixed value 1.0 error Percentage 1.5 bars 40 Standard deviation 1.0 x-axis y-axis error error 20 bars bars HAH 0 2 6 8 x Captions ^ 1. A data point is plotted (blue triangle) and error bars are added as lines extending from the point, with or without a cap. 2. A property window allows formatting of error bars by entering values or selecting values in cells. Increasing the percentage makes larger error bars. 3. Error bars may be used on the x or y-axis only, or both. PARTICIPATION ACTIVITY 1.43.2: Basics of error bars. 1) Error bars demonstrate the magnitude of the error for data points within a chart. ○ True O False 2) The magnitude of error bars can only be a percentage of the magnitude of the data point. True O False 3) Error bars are only used on scatter charts. O True O False Some statistics Feedback? Feedback? When collecting data, multiple measurements are made to quantify the error of the measurement. First, important quantities related to sets of data are presented, followed by definitions and animations related to distributions. Table 1.43.1: Summarizing data terms. Definition Sample calculation using data set [2.4, 2.9, 3.3, 2.7,2.6] An average, or mean, sums the data and divides by =(2.4+2.9 +3.3 +2.7 the number of data points. A median is the data point separating the larger half from the smaller half of a data series. When a series contains an even number of data points, the average of the two data points separating the larger and smaller halves can be used. +2.6)/5 = 2.78 Sorting from smallest to largest-2.4, 2.6, 2.7, 2.9, 3.3-finds the median equals 2.7. A maximum is the largest data point of the data 3.3 series. A minimum is the smallest data point of the data 2.4 series. Feedback? Table 1.43.2: Data terms to spreadsheet functions. The cell addresses and data set are: A2-2.4, A3-2.9, A4-3.3, A5-2.7, A6-2.6. Manual spreadsheet actions Spreadsheet Term function AVERAGE(cell Average =(A2+A3+A4+A5+A6)/5 group) Spreadsheet calculation using function =AVERAGE(A2:A6) MEDIAN(cell Median group) Maximum MAX(cell group) Sort data/cells from smallest to largest, identify data/cell dividing -MEDIAN(A2:A6) larger and smaller halves of data series. Sort data/cells from largest to smallest, first cell is maximum. Sort data/cells from smallest to largest, first =MAX(A2:A6) =MIN(A2:A6) MIN(cell Minimum group) cell is minimum. Feedback? A number of methods for quantifying error exist and are covered in a statistics course. Quantifying error usually begins by identifying the type of distribution found in the data. A distribution is the mathematical representation of a data series. While many types of distributions are covered in a statistics course, a normal distribution and quartiles are covered here. A normal distribution represents data symmetrically spread around the average. A normal distribution is also known as a bell curve or Gaussian distribution. Ex: Many natural things fit a normal distribution, including humans' height or snowflakes' size. Standard deviation is one common method for quantifying error. The definition, spreadsheet function, and use in a chart are demonstrated below. Table 1.43.3: Quantifying error terms. Definition Equation Sample calculation using data set (2.4, 2.9, 3.3, 2.7, 2.6] A standard deviation (sd) is a measure of the variation within a data series and returns a value with the same units as the data sd = (-average) N-1 where N is the number of data =(((2.4-2.78)^2 +(2.9-2.78)^2+ (3.3-2.78)^2+ (2.7-2.78)^2+ points and i represents each data (2.6-2.78)^2 point in the series. )/4)^0.5 0.34 series. Feedback? When a data series does not follow a normal distribution, quartiles can represent data in a chart. A quartile divides a data series into four equal parts. The first quartile divides the smallest 25% of the data series from the rest. A second quartile divides the data series into equal parts, which is also the median. The third quartile divides the smallest 75% of the data series from the rest. Ex: Exam scores do not fit a normal distribution in many cases, so reporting first quartile, median, and third quartile scores provide a method to compare one student's exam to an entire class. A box-whisker chart is a type of bar chart where a data series is bounded with error bars for the maximum and minimum, with a box bounding the first quartile and third quartile. The box is divided by the median, and the box encompasses the middle 50% of the data within the series. Spreadsheet functions and animations expand on the definitions of standard deviation and quartiles below. Table 1.43.4: Quantifying error terms using spreadsheet functions. The data set and cell addresses are: A2-2.4, A3-2.9, A4-3.3, A5-2.7, A6-2.6. Value returned by spreadsheet Spreadsheet Term function Standard deviation 1st Quartile 3rd Quartile Spreadsheet formula using function STDEV(cell group) =STDEV(A2:A6) QUARTILE(cell group, quartile number) QUARTILE(cell group, quartile number) 0.34 =QUARTILE(A2 A6, 1) 2.6 =QUARTILE(A2:A6, 3) 2.9 PARTICIPATION ACTIVITY 1.43.3: Visualizing a normal distribution. Start 2x speed 0.4 0.3- Normal distribution Scatter chart Average = 0 sd=1 1 sd (68.2%) 0.2- 0.1 -4 -2 0 2 x Minimum Medan 1st quartile 3rd quartile, (50%) 2 sd (95.4%) formula Box-whisker chart Feedback? Captions A distribution is plotted with an average of 0 and standard 1. A normal deviation of 1. 2. One standard deviation accounts for ~68% of the distribution while two SD covers over 95%. 3. A box-whisker chart also represents a distribution with 50% of the data accounted for in the box, divided by the median. PARTICIPATION ACTIVITY 1.43.4: Visualizing standard deviation and quartiles. Start 10 Measurement (units) 2x speed OOOOOO H Raw Average ± Box- data whisker standard deviation Data representation Feedback? Raw, unsorted data 6.5 4.5 7.1 8.3 7.8 7.3 9.4 Captions A 1. Seven data points are recorded and will be plotted three ways. 2. First, the raw data are plotted as individual data points (circles). 3. Next, the average (triangle) with error bars of one standard deviation are plotted 4. Finally, a box shows 1st, 2nd, and 3rd quartiles and minimum/maximum with error bars (whiskers). PARTICIPATION ACTIVITY 1.43.5: Distributions. 1) The median and average are always the same number. O True O False 2) The second quartile is also the O average O median O standard deviation 3) A data series with 100 data points follows a normal distribution. The number of data points within one standard deviation of the average is O 68 95 99 4) The median divides the two boxes in a box-whisker chart. The percentage of data points accounted for within either box is 50 75 =zyBooks ▾ AUTIVITY Feedback? Feedback? 584450.1699390 qay? Jump to level 1 Enter a formula in Cell D3 for the average of only the data point: average in Cell D4. A 1 Data 2 64 3 73 4 68 5 89 6 55 7 63 8 37 9 36 10 14 Copy sheet Check Next B C D =AVERAG 71 X Each incorrect answer is highlighted. Expected: D3-AVERAGE(A2:A10) and D4-55.4 An average sums the data points and divides by the number of data point: using the AVERAGE spreadsheet function. >-0-0-0-0- Feedback? Linear regression A linear regression is a model that fits data with a slope and intercept from a linear model, such as y=mx+ Linear regression can be done for multip dependent (x) variables, which is covered elsewhere by following the link in the Exploring Further section below. Linear regression can be completed using spreadsheet functions, as shown in the table below, or using a regression analysis tool, as shown in the animations below. Therefore, linear regression is a another method that can be applied within a spreadsheet to model data with a linear model; trendlines and least squares fits also create optimized linear models, which are covered in other sections. Table 1.43.5: Linear regression spreadsheet functions. A data series with four data points will be used in the definitions below. A B 1 x data y data 21 20 3 3 37 45 57 57 78 Spreadsheet Spreadsheet formula using Term function function Value returned using spreadsheet above Slope SLOPE(y cell range, x cell range) INTERCEPT(y =SLOPE(B2:B5A2:A5) 9.7 Intercept cell range, x cell -INTERCEPT(B2:B5,A2:A5) 9.2 R- squared range) RSQ (y cell range, x cell range) =RSQ(B2:B5A2:A5) 0.998 Feedback? A confidence interval provides the range which a value will fall between a lower and upper bound. Ex: A 95% confidence interval predicts that 95% of values will fall within the bounds. Spreadsheets have data analysis packages to perform linear regression and provide 95% confidence intervals for all parameters. Ex: A slope of 5.3 may have a 95% confidence interval between 3.9 and 6.7. The animations below demonstrate using linear regression and data analysis packages in a spreadsheet and a chart. PARTICIPATION ACTIVITY 1.43.6: Entering values into a regression analysis tool. Start 2x speed Tx A B C D 1 x data y data 2 1 20 3 3 37 4 5 57 5 7 78 Regression analysis Input y range: B2:B5 Input x range: A2:A5 Output range: E1:H10 Other options Residuals Residual plots OK R-squared Upper bound 95% confidence interval Lower bound 95% confidence interval Parameters and many other statistical measures Captions A 1. A spreadsheet is filled with x and y data for four data points. ranges entered, analysis options selected, and OK is clicked. 3. Many statistical measures are outputted from the Regression analysis 2. Next, the regression analysis tool is located, selected, appropriate cell tool, such as R-squared and optimized parameters. Results are shown in another animation. PARTICIPATION ACTIVITY 1.43.7: Visualizing a confidence interval. Start 2x speed 100 80 60- 40- 20 y=9.7x+9.2 R=0.998 Upper 95%: y 11.1x+15.6 Lower 95%: y = 8.3x+2.8 0 2 4 6 x Captions A Linear model: y = mx + b Feedback? Regression analysis tool Optimized parameters and R-squared and upper and lower 95% confidence interval with other statistical measures available 1. Four data points are plotted and the regression analysis tool is used within a spreadsheet. 2. First, the optimized line is plotted in green and the R-squared is 0.998. Then the upper and lower 95% confidence interval lines are added. PARTICIPATION ACTIVITY 1.43.8: Linear regression. 1) A linear model is found to have a slope of 97. The 95% confidence interval is 97 ± 14, and the 98% confidence interval is 97 ± 5. O True ○ False 2) A data set includes 35 data points of pressure as a function of temperature, which is fit by a linear model: P mT+b. Two engineers find m and b differently. One engineer uses SLOPE and INTERCEPT spreadsheet functions, while another engineer uses the linear regression data analysis tool. Both engineers find the same values for m and b. ○ True O False 3) Vivian and Alexander have a data set with pressure, temperature, and volume. They can use the data analysis tool within their spreadsheet to find a 95% confidence interval for a linear model. ○ True O False Feedback? Feedback? Linearizing an equation Linear regression can also be applied to non-linear models using a process called linearization. A linearization converts a non-linear model into a linear model. While many models cannot be linearized, some guidelines can direct the linearization process. The equation should be separable, so y and x expressions can be placed on different sides of the equation. The linear model of y = mx +b with two constants still applies, but now expands to account for a function of y instead of y, and a function of x instead of x. The animations below demonstrate the linearization process in both a spreadsheet and chart. PARTICIPATION ACTIVITY 1.43.9: Linearization of an exponential model. Start 2x speed fx A 1 A= B 2 C D E From 6.9 least B= 0.35 3 x y squares In(A)= 1.9 INTERCEPT B= 0.35 SLOPE(E4:E linearized x linearized y 4 1 10 1 2.30 5 4 29 4 3.35 6 6 55 6 4.05 7 7 84 7 4.40 Exponential model: Linearized model: y = A exp(Bx) Fit using: Least squares fit ог Trendline Linearization In(y) Bx + In(A) I is wry. Least squares fit or Trendline or Functions or Linear regression data analysis to Captions A 1. Four data points can be modeled by an exponential model. 2. The exponential model can be optimized using a least squares fit or trendline, which were covered elsewhere. 3. Linearization converts the model into a linear form. In this case, linearized x is still x while linearized y is in y. 4. Finally, fits can be completed by at least four methods covered in this section and other sections. For example, functions are used in cells E1 and E2. PARTICIPATION ACTIVITY 1.43.10: Visualizing a linearized model. Start 2x speed 100 80 60- 40- 20 y=6.9 exp(0.35x) Linearization Feedback? y = 0.35'x' +1.9 Non-linear model Linearized model: y= A exp(Bx) In(y) Bx+In(A) 0 2 4 8 0 2 4 x x Captions ^ 1. Four data points are plotted and a non-linear model fits the data. 2. Linearization alters only the y-axis for this model. The data points are replotted and a linearized model fits the data. Note: Parameter A converts as In(6.9)=1.9 PARTICIPATION ACTIVITY 1.43.11: Match the linearized equations. If unable to drag and drop, refresh the page y = e(4+4) y = A. eBz A.x y = B+x Feedback? y = A. B In(y) =Bx+ In(A) In(y) B. In(x) + In(A) 1 B 1 1 = =+ y A I A In(y) = B + A Reset CHALLENGE ACTIVITY 1.43.2: More and more error and statistics calculations. 584450.1699390 qv3zay? Start Converts a non-linear model into a linear model. Choose the correct term. Check Next How was this 10 section? 91 Provide section feedback Feedback? 4 Feedback? 1.44 Spreadsheet resources