Notes2-Descriptive_SS

docx

School

University of Wisconsin, Madison *

*We aren’t endorsed by this school

Course

324

Subject

Statistics

Date

Feb 20, 2024

Type

docx

Pages

16

Uploaded by UltraCrowPerson785

Report
Statistics University of Wisconsin Madison – Chelsey Green chelseygreen@wisc.edu N OTES 2: A F IRST L OOK AT R S TUDIO AND D ESCRIPTIVE S TATISTICS Graphical and Numeric Summaries make overall trends in data more apparent. The most appropriate options for graphical and numeric summaries depend on the type and amount of data you have. We only have time to look at a subset of available summary techniques in this course so we will focus on the most common. Oxide Layer Thicknesses Example: Computer chips contain electronic circuits and are sealed with a thin layer of silicon dioxide. The manufacturer considered using recycled silicon wafers instead of new ones to reduce cost. Oxide thickness measurements (in Angstrom Å ) from 18 test runs using new wafers are given below: 90.0, 92.2, 94.9, 92.7, 91.6, 88.2, 92.0, 98.2, 96.0, 91.1, 89.8, 91.5, 91.5, 90.6, 93.1, 88.9, 92.5, 92.4 Oxide Layer Thicknesses Example: (e): Create or access the Notes2 R markdown file. Save the file into your Stats Folder. Define a vector named Thickness to store the 18 observations. Resave the Thickness vector to be ordered from smallest to largest so it is easier to look at. S ELECTING APPROPRIATE G RAPHICAL SUMMARIES : How many variables, what type of data, and how many observations do you have? Summarizing 1 Variable Numeric/Quantitative Data: Large Data : Histograms, Box plots Small Data: Stem-and-Leaf, Dot Plot Categorial/Qualitative Data: Bar Charts, Pareto Charts, Pie Charts*, frequency table * Pie Charts are most useful when there are only a few categories and there is a large distinction between percentages 1
Statistics University of Wisconsin Madison – Chelsey Green chelseygreen@wisc.edu Summarizing more than 1 Variables 2 Numeric/Quantitative values on each subject Scatterplot Single Numeric Data on 2 or more groups Comparative Histograms or Box Plots Categorical Data on 2 or more groups Contingency table, mosaic plot G RAPHING Q UANTITATIVE D ATA Dot Plot: chart with a number line and a point for each datum above the line at its value. Repeated values are often stacked. a. Draw a number line b. Draw dot above number line at value of each datum In R: stripchart(x, method = “stack”, …) Histogram: chart used to display the frequency, percentage, or density of measurements falling into a range of values with rectangles with heights equal to the frequency, percentage, or density respectively. a. Divide the range (difference between the maximum and minimum measurement) by the number of class intervals desired (usually 5-20 intervals) and round to get a convenient width for each class interval. (Equal bins are most commonly used) b. Compute the frequency or relative frequency of measurements falling into each class interval (set up convention for values that fall on boundary) c. Density Histogram : Compute the density=(relative freq)/(width of bin) of measurements falling into each class interval d. Divide up an x axis according to the class intervals chosen and construct rectangles with heights according to frequency, relative frequency, or density. *For discrete data with only a few values, rectangles are often centered at the individual values In R: hist(x, breaks = "Sturges", freq = NULL, probability = !freq, include.lowest = TRUE, …) *Notice, by default R puts the values that land on breaks into the lower bin. 2
Statistics University of Wisconsin Madison – Chelsey Green chelseygreen@wisc.edu Boxplot : graphic that displays the 5 number summary and outlying values in a box with extending lines. a. Draw and label a vertical or horizontal axis that spans the range of the data b. Draw longer lines at Q1, Median, Q3 perpendicular to axis c. Connect ends of Q1 and Q3 to create box (and give visual display of IQR) d. Identify any point outside [Q1-1.5*IQR, Q3+1.5*IQR] an outlier and plot each outlier on the axis with a dot. (This is default R behavior, but can be adjusted) e. Draw lines from the box to the largest non-outlier and from box to smallest non-outlier In R: boxplot(x, …) or boxplot(y~grp, …) Oxide Layer Thicknesses Example (f): Construct a dot plot, frequency histogram, relative frequency histogram, and density histogram for the data to summarize the numeric observations. Compare the tools and explain how changing the number of classes/bins affects the histograms’ appearance. S UMMARIZING S HAPE OF Q UANTITATIVE D ATA Graphing numeric data allows us to see the shape of the data Symmetric Data: upper and lower half of the data have approximately the same shape E.g.: repeated measurement of same thing *Mean median with symmetric data 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Statistics University of Wisconsin Madison – Chelsey Green chelseygreen@wisc.edu Right Skewed Data: right side of the histogram (larger half of the observations) extends a greater distance than the lower E.g.: # of siblings, income *Often mean > median with right skewed data *Often happens when there is a natural lower bound for the values Left Skewed Data: left side of the histogram (lower half of the observations) extends a greater distance than the upper. E.g.: age at retirement, age of person who dies of natural causes *Often mean < median with left skewed data *Often happens when there is a natural upper bound for the values Uniform: histogram where every interval has essentially the same number [proportion] of observations E.g.: value that lands up on roll of die 100 times Unimodal: histogram with one major peak Bimodal: histogram with two major peaks. * often happens when there are two groups with different centers both considered in the one data set Oxide Thickness Example (g): Describe the shape of the Thickness sample data and discuss what that might mean in context. 4
Statistics University of Wisconsin Madison – Chelsey Green chelseygreen@wisc.edu S ELECTING APPROPRIATE N UMERIC SUMMARIES . One thing to keep in mind when choosing numeric summaries is whether you have a sample or population of data. Review notes 1 to remind yourself the difference between those two concepts. parameter : a numeric summary of a population’s characteristic. Examples: population mean: μ “mu”; population standard deviation: σ “sigma”, population proportion of success π E.g. The average length of all walleye in a lake is a parameter ( μ ¿ when we consider the lengths of all walleye in the lake our [statistical] population of interest. statistic : a numeric summary of a sample’s characteristic[s]. Examples: sample mean , sample standard deviation s x , sample proportion of success E.g. The average length of the walleye caught by fisher people in a lake over a weekend is a sample statistic if the population of interest is the lengths of all walleye in the lake. It would be a parameter however, if instead we considered the population of interest to be the lengths of walleyes caught in that lake over the weekend. Q UANTITATIVE D ATA : Measures of the “Center” value, value position and variability of the values are often calculated as summary measures. Measures of center Sample Mode: the measurement value that occurs most often. Sample Mean (Average):”X-bar”: X = 1 n i = 1 n X i is the sum of the sample values divided by the sample size [If we have a population of data, we call this mean parameter: μ “mu”] 5
Statistics University of Wisconsin Madison – Chelsey Green chelseygreen@wisc.edu *Because of the way it is computed, the mean represents the 'balance point' of the histogram of the distribution. *The sample mean is often reported for samples that have rough symmetry In R: mean(data) Sample Median (M): is the middle value of an ordered set of data. If the size of the data set is even, we take the average of the two center points *The sample median is often reported for samples that have skew In R: median(data) Oxide Layer Thicknesses Example (h): Compute and compare the sample Mode, Mean, and Median of the observed thicknesses for the 18 chips evaluated. Identify how these values correspond to what you see in the histogram. 88.2 88.9 89.8 90.0 90.6 91.1 91.5 91.5 91.6 92.0 92.2 92.4 92.5 92.7 93.1 94.9 96.0 98.2 Mode: Mean: Median: 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Statistics University of Wisconsin Madison – Chelsey Green chelseygreen@wisc.edu Measures of Value Position: 100 p-th Percentile (Quantiles): a value such that if the data are ordered from smallest to largest, at least 100p% of the observations are at or below this value and at least 100(1-p)% are at or above this value. *There are many different algorithms to compute percentiles. R has 9 in this one function! Computing percentiles by hand will not always match what R computes. In R: quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE, names = TRUE, type = 2, ...) a. First quartile (Q1): the median of the lower half of a data set. The 25 th percentile. b. Second Quartile (Q2): the median of the set. The 50 th percentile. c. Third quartile (Q3): the median of the upper half of a data set. The 75 th percentile. (When computing Q1 and Q3 by hand with a data set with odd size, we will include the median in both the first half of the sorted list and the second half of the sorted list when moving to calculate Q1 and Q3) The Five Number Summary is a common summary that contains the minimum, Q1, median, Q2, and Maximum value for a set of quantitative values. Oxide Thickness Example (i): Compute the 1 st , 2 nd , and 3 rd quartiles for the Thickness data: 88.2 88.9 89.8 90.0. 90.6. 91.1 91.5 91.5 91.6 92.0 92.2 92.4 92.5 92.7 93.1 94.9 96.0 98.2. Measures of Spread: 7
Statistics University of Wisconsin Madison – Chelsey Green chelseygreen@wisc.edu Range: Difference between the maximum and minimum value in a data set: E.g. 11.5-4.2=7.3 In R: range(..., na.rm = FALSE) Interquartile Range (IQR): Difference between the first and third quartiles. Range of the middle 50% of our data. IQR: Q3-Q1 In R: IQR(x, na.rm = FALSE, type = 2) Standard Deviation: How far a typical datum is from the mean value a. Compute the mean of the data set b. Compute the differences between each datum and the mean value: “Deviations” c. Square each deviation d. Sum up the squared deviations e. Divide by the size of the data set if we have a population of values or Divide by 1 less than the size of the data set if we have a sample of values f. Take the square root of the result to get back to the original data scale SD of a population of data: σ = 1 n i = 1 n ( X i μ ) 2 . SD of a sample of data: s = 1 n 1 i = 1 n ( X i x ) 2 . In R: sd(x, na.rm = FALSE) Variance: The square of standard deviation. Oxide Layer Thicknesses Example (j): Compute the range, IQR and Standard deviation for the oxide thickness data, treating these values as a sample of observations for chips that use new silicon wafers. Interpret these values. Range: IQR: 8 We’ll explore later in the course why we follow different procedures when we have data from a sample vs a population of data)
Statistics University of Wisconsin Madison – Chelsey Green chelseygreen@wisc.edu Sample SD: Oxide Thickness Example (k): Use the quartiles and IQR computed above to construct a boxplot for the Thickness data: 88.2 88.9 89.8 90.0. 90.6. 91.1 91.5 91.5 91.6 92.0 92.2 92.4 92.5 92.7 93.1 94.9 96.0 98.2. Q1:90.6 Median: 91.8 Q3: 92.7 Compare it to a boxplot constructed by R. Compare the features of the data that are apparent in the boxplot and histograms. Oxide Thickness Example (l): Suppose the 98.2 had actually been recorded as 99.2 . So the thickness values with the error are: 90.0, 92.2, 94.9, 92.7, 91.6, 88.2, 92.0, 99.2 , 96.0, 91.1, 89.8, 91.5, 91.5, 90.6, 93.1, 88.9, 92.5, 92.4. How would Mean, Median, Range, IQR, and sample SD be affected by the error? How much would their value[s] change? How would the graphical summaries change? Correct Data: 88.2 88.9 89.8 90.0 90.6 91.1 91.5 Incorrect Data: 88.2 88.9 89.8 90.0 90.6 91.1 91.5 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Statistics University of Wisconsin Madison – Chelsey Green chelseygreen@wisc.edu 91.5 91.6 92.0 92.2 92.4 92.5 92.7 93.1 94.9 96.0 98.2 91.5 91.6 92.0 92.2 92.4 92.5 92.7 93.1 94.9 96.0 99.2 Mean: 92.06667 Median: 91.8 Range: 98.2-88.2=10 IQR: 92.7-90.6=2.1 Sample SD: 2.440347 Mean: Median: Range: IQR: Sample SD: Graphical Summaries: Graphical Summaries: Code for graphical summaries: #Graphs of correct data par(mfrow=c(2,1), mar=c(2.5,2,2,1)) #This makes two rows of graphs in 1 column and sets the margins hist(Thickness, labels=TRUE, ylim=c(0,7), breaks=seq(86,100,1), main="Thicknesses (Å)") boxplot(Thickness, horizontal=TRUE, ylim=c(86,100)) #Graphs of data with error hist(Thickness_Error, labels=TRUE, ylim=c(0,7), breaks=seq(86,100,1), main="Thicknesses with Error (Å)") boxplot(Thickness_Error, horizontal=TRUE, ylim=c(86,100)) par(mfrow=c(1,1), mar=c(5, 4.1, 4.1, 2.1)) #This resets the graphics window to 1 graph at a time and resets the margins C OMPARING TWO S AMPLES OF D ATA Oxide Thickness Example (m): In addition to these 18 thicknesses of the silicon oxide layer from new wafers, the researchers also observed 17 thicknesses of the silicon oxide layer from recycled wafers. 10
Statistics University of Wisconsin Madison – Chelsey Green chelseygreen@wisc.edu Wafer Thickness observed (Å) New Wafer 90.0, 92.2, 94.9, 92.7, 91.6, 88.2, 92.0, 98.2, 96.0, 91.1, 89.8, 91.5, 91.5, 90.6, 93.1, 88.9, 92.5, 92.4 Recycled Wafer 91.8, 94.5, 93.9, 92.0, 89.9, 87.9, 92.8, 93.3, 92.6, 90.3, 92.8, 91.6, 92.7, 91.7, 89.3, 95.5, 93.6 Compare the measures of center and spread between the two treatment groups (New Wafer vs Recycled Wafer) just by looking at the comparison graphs (make sure to put on the same axes!). Use numeric summaries to confirm your estimates. ToothGrowth Example: Consider the ToothGrowth data set in R which looks at the effect of Vitamin C on tooth growth in Guinea Pigs. The data set has variables: len [length] which is the length of odontoblasts cells responsible for tooth growth; supp [supplement] which is supplement type: orange juice of ascorbic acid; and dose which is the dose of the supplement in milligrams/day. Supplement VC: Ascorbic Acid Dose Length 11
Statistics University of Wisconsin Madison – Chelsey Green chelseygreen@wisc.edu 0.5 4.2, 11.5, 7.3, 5.8, 6.4, 10.0, 11.2, 11.2, 5.2, 7.0 1.0 16.5, 16.5, 15.2, 17.3, 22.5, 17.3, 13.6, 14.5, 18.8, 15.5 2.0 23.6, 18.5, 33.9, 25.5, 26.4, 32.5, 26.7, 21.5, 23.3, 29.5 OJ: Orange Juice Dose Length 0.5 15.2, 21.5, 17.6, 9.7, 14.5, 10.0, 8.2, 9.4, 16.5, 9.7 1.0 19.7, 23.3, 23.6, 26.4, 20.0, 25.2, 25.8, 21.1, 14.5, 27.3 2.0 25.5, 26.4, 22.4, 24.5, 24.8, 30.9, 26.4, 27.3, 29.4, 23.0 In R: View(ToothGrowth) ToothGrowth Example a. The lengths of those cells for animals that received 2 mg/day vitamin C dose are given below. Save these observations into two vectors VC and OJ. VC_Dose2_len<-c(23.6, 18.5, 33.9, 25.5, 26.4, 32.5, 26.7, 21.5, 23.3, 29.5) OJ_Dose2_len<-c(25.5, 26.4, 22.4, 24.5, 24.8, 30.9, 26.4, 27.3, 29.4, 23.0) Or to avoid manual typing in values, you can define your vectors from the ToothGrowth data stored in R: Dose2<-subset(ToothGrowth, dose==2) VC_Dose2<-subset(Dose2, supp==”VC”); VC_Dose2_len<-VC_Dose2$len OJ_Dose2<-subset(Dose2, supp==”OJ”); OJ_Dose2_len<-OJ_Dose2$len ToothGrowth Example b. Compare the center and spread of the data from the two dose types at the 2.0 dose level numerically. VC: mean: 26.14, Median: 25.95, SD: 4.80, IQR:5.425, range: 15.4 OJ: mean: 26.06, Median: 25.95, SD: 2.66, IQR:2.5, range: 8.5 While the groups appear to have very similar “centers” the spread of the VC values data is quite a bit higher (sd, IQR, and range are all about twice as much in VC). ToothGrowth Example c. Compare the center, spread, and shape of the data from the two dose types at the 2.0 dose level graphically. 12 Delivery Length Ascorbic Acid 23.6, 18.5, 33.9, 25.5, 26.4, 32.5, 26.7, 21.5, 23.3, 29.5 Orange Juice 25.5, 26.4, 22.4, 24.5, 24.8, 30.9, 26.4, 27.3, 29.4, 23.0
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Statistics University of Wisconsin Madison – Chelsey Green chelseygreen@wisc.edu The similar centers between VC and OJ treatment lengths are apparent as both sets of data are roughly symmetric about 26. The larger spread in the VC cell lengths is made visible by the longer IQR box and whiskers in the VC boxplot and also in more values being further away from the center (of 26) in the VC histogram. ToothGrowth Example d. Combine the data from the two dose types at the 2.0 dose level into a single vector of data. How does the center, spread, and shape of the combined data compare with the individual samples? The mean of the combined data is the average of the means of the two groups since the groups were the same size. The median of the combined group is the same as the median of the two subgroups. The range of the combined group is larger than the range of each subgroup. The sd and IQR of the combined data is between those of the subgroups (sd: 3.77, IQR: 4.3, range: 15.4) which is consistent with the two subgroups having similar centers. S UMMARIZING Q UALITATIVE /C ATEGORICAL D ATA : N UMERIC S UMMARIES : Counts and proportions of responses within category levels are often reported as summary measures. 13
Statistics University of Wisconsin Madison – Chelsey Green chelseygreen@wisc.edu G RAPHICAL S UMMARIES : Pie Charts: chart used to display the percentage of the total number of measurements falling into each of the category levels of a variable by partitioning a circle. This chart is only applicable if each subject/observations only falls into one category level. a. Compute the total number of outcomes in each category b. Compute the percent of total outcomes for each category c. Construct slices of a circle with area proportional to the category percentages In R: pie(x, labels = names(x), …) Bar Charts: chart used to display the frequency of responses or percentage of a total number of measurements falling into each of category of a variable with rectangles with heights equal to the frequency or percentage. a. Compute the total number of outcomes in each category b. Compute the percent of total outcomes for each category c. Construct rectangles above categorical values with heights equal to the category frequency or percentage and equal width (often space is left between rectangles) In R: barplot(height, width = 1, space = NULL, names.arg = NULL,...) Pareto Charts: a bar chart with levels in decreasing order with a line graph displaying cumulative percentage overlaid. a. Construct bar chart with decreasing frequency order (left vertical axis: frequency) b. Compute cumulative percentage in decreasing frequency order. c. Line constructed by plotting cumulative percentage vertically (right axis) at right end of each bar boundary. Oxide Thickness Example (n): Suppose we are interested in the categorical variable “Oxide Thickness below 90 Å”. It would have 2 levels: True[/Yes/1] and False[/No/0]. Summarize the observed sample of thicknesses from new wafers according to that variable. 14 In R: pareto.chart(data, plot=TRUE, …) {qcc package}
Statistics University of Wisconsin Madison – Chelsey Green chelseygreen@wisc.edu 88.2 88.9 89.8 90.0. 90.6. 91.1 91.5 91.5 91.6 92.0 92.2 92.4 92.5 92.7 93.1 94.9 96.0 98.2. Government Spending Example (Ott and Longnecker pg 125): The US government spent more than $3.6 trillion in the 2014 fiscal year. The following table provides broad categories that demonstrate the expenditures of the federal government for domestic and defense programs. Government Spending Example (a). Identify what graphical summaries would be most useful for comparing the expenditures across Federal Programs. Each dollar fits into one of the category levels and thus we are summarizing categorical/qualitative data. Thus, we can consider making a pie chart or a bar chart. We have 6 category outcomes/levels which is borderline for making a useful pie chart.(Notice, we are given a frequency table at right) Government Spending Example (b). Construct the graphical summaries identified in a using R. #Pie Chart First Dollars<-c(612, 852, 821, 253, 562, 532) pie(x=Dollars, labels=c("Defense", "SS", "Medical", "Debt", "Social- Aid", "Other"), main="2014 Gov Expenditures") #Bar Plot (Frequency) barplot(height=Dollars, names.arg=c("Defense", "SS", "Medical", "Debt", "Social-Aid", "Other"), main="2014 Gov Expenditures", ylim=c(0,1000)) 15 Federal Program 2014 Expenditures (billions of dollars) National Defense $612 Social Security $852 Medicare & Medicaid $821 National Debt Interest $253 Major Social-Aid Programs $562 Other $532 Total $3635
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Statistics University of Wisconsin Madison – Chelsey Green chelseygreen@wisc.edu #Bar Plot (Relative Frequency) barplot(height=Dollars/sum(Dollars), names.arg=c("Defense", "SS", "Medical", "Debt", "Social-Aid", "Other"), main="2014 Gov Expenditures", ylim=c(0,0.3)) #Pareto Chart #Dollars<-c(612, 852, 821, 253, 562, 532) names(Dollars)<-c("Defense", "SS", "Medical", "Debt", "Social-Aid", "Other") #Load pareto.chart() function from qcc package #install.packages("qcc") library(qcc) pareto.chart(Dollars, main="Expenditures") Government Spending Example (c). Which graph makes it easier to compare the relative frequency of the category levels? Why? It is easier for me to compare relative heights of the bars in the bar plot than the relative areas of the pie chart pieces. The heights I can compare to the left axis while the pie chart areas all need to be compared to one another. The Pareto chart further organizes the levels by decreasing relative frequency and displays the cumulative percentage. 16