Assignment-1

pdf

School

McMaster University *

*We aren’t endorsed by this school

Course

2B03

Subject

Statistics

Date

Apr 3, 2024

Type

pdf

Pages

8

Uploaded by MinisterAnt14343

Report
2B03 Assignment 1 Descriptive Statistics (Chapters 1 & 2) Matthew Musulin 400329990 Due Thursday September 23 2021 Instructions: You are to use R Markdown for generating your assignment output file. You begin with the R Markdown script downloaded from A2L, and need to pay attention to information provided via introductory material posted to A2L on working with R and R Markdown. Having downloaded all necessary files, placed them in the same folder/directory, and added your answers to the R Markdown script, you then are to generate your output file using “Knit to PDF” and, when complete, upload both your R Markdown file and your PDF file to the appropriate folder on A2L. 1. Define the following terms in a sentence (or short paragraph) and state a formula if appropriate (this question is worth 5 marks). i. Categorical Data Categorical Data represents types of data that can be divided into categories or groups. ii. Frequency Distribution Frequency Distribution is a function that displays the number of observations within a givin interval. iii. Sturgess’s Rule Sturgess’s Rule is a rule used for determining the desirable number of classes. Where K is the interger rounded to the closest whole number. 1 + 3 . 3 log 10 n iv. Cross Tabulations Cross Tabulations are tabular summaries for two variables. v. Sample Median The Sample Median is the middle value after putting all observations in ascending order. X ( n/ 2) + X ( n/ 2+1) 2 2. Consider the following dataset on the final grade received in a particular course ( grade ) and attendance ( attend , number of times present when work was handed back during the semester out of a maximum of six times). Note that R has the ability to read datafiles directly from a URL, so here (unlike the odesi data that you manually retrieve) you do not have to manually download the data providing you are connected to the internet (this question is worth 8 marks). course <- read.table( "https://socialsciences.mcmaster.ca/racinej/files/attend.RData" ) attach(course) 1
i. Create a scatterplot of the data with attend on the horizontal axis and grades on the vertical axis via the command plot(attend,grade) 0 1 2 3 4 5 6 40 60 80 100 attend grade Do you see any pattern present in the data? If so describe it in your own words. In the scatterplot, students that attended more classes generally had a higher final grade, inversely, students that attended less classes had a lower final grade. ii. Construct the average grades for persons attending 0 times, and then repeat for those attending 1 time, 2 times, and so on through 6 times using something like mean(grade[attend== 0 ]) ## [1] 43.5 mean(grade[attend== 1 ]) ## [1] 56.83333 mean(grade[attend== 2 ]) ## [1] 51.14286 mean(grade[attend== 3 ]) ## [1] 66 mean(grade[attend== 4 ]) ## [1] 74.88889 mean(grade[attend== 5 ]) 2
## [1] 76 mean(grade[attend== 6 ]) ## [1] 81.375 Do you see any pattern present in the means? Yes, the pattern present in the means directly corelates to the pattern present in the scatterplot. As a student attends more classes, the average grade increases. 3. This question requires you to download data obtained from Statistics Canada. If you are working on campus go to www.odesi.ca (off campus users must first sign into the McMaster library via libaccess at library.mcmaster.ca/libaccess, search for odesi via the library search facilities then select odesi from these search results). Next, select the “Find data” field in odesi and search for “Labour Force Survey June, 2021”, then scroll down and select the Labour Force Survey, June 2021 [Canada] . Next click on the “Explore & Download” icon, then click on the download icon (i.e., the diskette icon, square, along the upper right of the browser pane) and then click on “Select Data Format” then scroll down and select “Comma Separated Value file” (csv) which, after a brief pause, will download the data to your hard drive (you may have to extract the file from a zip archive depending on which operating system you are using). Finally, make sure that you place this csv file in the same directory/folder as your R code file (this file ought to have the name LFS-71M0001-E-2021-June_F1.csv , and in RStudio select the menu item Session -> Set Working Directory -> To Source File Location). There will be another file with (almost) the same name but with the extension .pdf that is the pdf documentation that describes the variables in this data set. Note that it would be prudent to retain this file as we will use it in future assignments (this question is worth 8 marks). Next, open RStudio, make sure this csv file and your R Markdown script are in the same directory (in RStudio open the Files tab (lower right pane by default) and refresh the file listing if necessary). Then read the file as follows: lfp <- read.csv( "LFS-71M0001-E-2021-June_F1.csv" ) This data set contains some interesting variables on the labour force status of a random subset of Canadians. We will focus on the variable HRLYEARN (hourly earnings) described on page 22 of the pdf file LFS-71M0001-E-2021-June.pdf . We will also consider other variables so that we can condition our analysis on these variables by restricting attention to subsets of the data, e.g., for full-time workers only ( FTPTMAIN==1 ) reporting positive earnings. We also look at the highest educational attainment for people in the survey and consider both high school graduates ( EDUC==2 ) and those holding a bachelors degree ( EDUC==5 ). To construct these subsets we can use the R command subset as follows (the ampersand is the logical operator and - see ?subset for details on the subset command): hs <- subset(lfp, FTPTMAIN== 1 & EDUC== 2 & HRLYEARN > 0 )$HRLYEARN ba <- subset(lfp, FTPTMAIN== 1 & EDUC== 5 & HRLYEARN > 0 )$HRLYEARN These commands simply tell R to take a subset of the data frame lfp for full-time workers having either a high school diploma or university bachelors degree for those reporting positive earnings, and then retain only the variable HRLYEARN and store these in the variables named hs (hourly earnings for high-school graduates) or ba (hourly earnings for university graduates). The following questions ask you to compute various descriptive statistics and other graphical summaries of these two variables. Note that nothing will be printed out by running the two lines above - they simply create subsets of the data for subsequent use. i. Report the five number summary for each subset (hint: fivenum(hs) etc.). Indicate what each number tells us (hint: see help by typing ?fivenum in the console pane). In the following vector the numbers indicate (minimum, lower-hinge, median, upper-hinge, maxi- mum) for the inputted data 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
fivenum(hs) ## [1] 3.21 17.31 22.00 29.86 108.13 fivenum(ba) ## [1] 3.30 24.53 34.62 45.99 107.39 ii. What can you say about relative wages of high school and university graduates? University Graduates are generally paid more than High School Graduates according to the median. HS=22.00 and UG=34.62 iii. Using Sturges’ rule, how many classes would you construct for the hs and ba wage data (hint - length() gives you the length of the vector, log10() may also be useful, so something like round(1+3.3*log10(length(hs))) might do the trick for the hs data at least)? round( 1+3.3 *log10(length(hs))) ## [1] 14 round( 1+3.3 *log10(length(ba))) ## [1] 14 iv. Plot histograms for the hs and ba data on separate graphs (hint: hist() ). hist(hs) Histogram of hs hs Frequency 0 20 40 60 80 100 0 500 1500 2500 hist(ba) 4
Histogram of ba ba Frequency 0 20 40 60 80 100 0 500 1000 1500 2000 v. Do the number of classes correspond to Sturges’ rule? No, neither the histogram of hs or ba corresponds to the Sturges’ rule which calculated 14 classes for both hs and ba. vi. Plot density curves for the hs and ba data on the same graph and add a legend (hint: first use something like plot(density(...),col="blue",lty=1) (you need to fill in (...) parts with the name of your data object, e.g., hs etc.) then lines(density(...),col="red",lty=2) , then see the help page by typing ?legend in the console pane. Note that you can add a legend using something like plot(density(hs), col= "blue" , lty= 1 ) lines(density(ba), col= "red" , lty= 1 ) legend( "topright" ,c( "High School" , "University" ), lty= c( 1 , 1 ), col= c( "blue" , "red" ), bty= "n" ) 5
0 20 40 60 80 100 0.00 0.02 0.04 density.default(x = hs) N = 6740 Bandwidth = 1.446 Density High School University vii. What do these density curves tell us about the distribution of hourly wages for high school versus university graduates? The density curves show that the largest density of High School graduate wages are between 15-30 dollars an hour while the University graduates are mainly distributed from 20-50 dollars an hour. The data shows that University graduates are preferred for higher paying jobs than High School graduates. 4. Consider the following data on annual profits (in $millions of dollars) for all firms in the textbook publishing industry in Canada (ignore the ## [1] and ## [12] that appear at the beginning of each line; this is simply the way R displays a vector of numbers): ## [1] 21.000 6.300 12.700 14.600 12.100 5.080 0.145 14.100 5.840 9.030 ## [11] 3.170 4.880 To set these values in a vector in R, if desired, you can use the command profits <- c(...) where ... are the values above separated by commas, e.g., profits <- c(3.67, 6.57, etc.) i. How many observations are there (i.e., what is n , the sample size?) The number of observations: n = 12 ii. What is the minimum, maximum, and range? min(profits) ## [1] 0.145 max(profits) ## [1] 21 range(profits) ## [1] 0.145 21.000 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
iii. How many classes would you create if you used Sturges’ rule? n = 12 k = round( 1+3.3 *log10( 12 )) k ## [1] 5 iv. What are the class widths and class boundaries based on your answers to the previous two questions, using Sturges’ rule, the sample minimum as the first lower class boundary, and the sample maximum as the last upper class boundary? Width = (max(profits)-min(profits))/k Width ## [1] 4.171 v. Complete the table below showing the absolute frequency, relative frequency, cumulative frequency, and cumulative relative frequency for the above data. For this question you will need to do some manual data entry in the table skeleton provided below after you have figured out what the counts are based on your answers to the previous set of questions. In particular, you are to use Sturges’ rule (above) to obtain the desired number of classes, and use the range of the data (above) when constructing your class boundaries (note that you need to have a blank line between each new row that you add to the table, and the last class must be closed at the right - this question is worth 8 marks). Class Absolute Frequency Relative Frequency Cumulative Absolute Frequency Cumulative Relative Frequency [0.145,4.316) 2 0.1666 2 0.1666 [4.316,8.487) 4 0.3333 6 0.5 [8.487,12.658) 2 0.1666 8 0.6666 [12.658,16.829) 3 0.25 11 0.9166 [16.829,21] 1 0.0833 12 1 5. Since we use the summation operator ( Σ n i =1 ) often in class, let’s make sure we understand how to calculate objects that can be expressed succinctly using this operator. i. Care must be exercised when expanding certain sums and quantities. Let the sample size be n = 3 , and let X 1 = 1 , X 2 = 1 , and X 3 = 3 . Demonstrate in R that it is generally not true that QQQQQQQ n i =1 X 2 i = ( QQQQQQQ n i =1 X i ) 2 (this question is worth 2 marks). data <- c( 1 , - 1 , 3 ) # Create a vector to hold data. data ## [1] 1 -1 3 sum = 0.0 # Calculate the sum of Xi ' s. sum2 = 0.0 for (i in 1 : 3 ) { sum = sum + data[i] } cat(sprintf( "Sum of Xi ' s is %.2f \n " , sum)) ## Sum of Xi ' s is 3.00 7
for (i in 1 : 3 ) # Calculate the sum of Xi ' s squared. { sum2 = sum2 + data[i]*data[i] } cat(sprintf( "Sum of Xi ' s squared is %.2f \n " , sum2)) ## Sum of Xi ' s squared is 11.00 sq_sumx = sum*sum # Calculate the square of the sum of Xi ' s. cat(sprintf( "The square of the sum of Xi ' s %.2f \n " , sq_sumx)) ## The square of the sum of Xi ' s 9.00 cat( "From the math above you can see that the sum of Xi ' s squared is not equal to the square of ## From the math above you can see that the sum of Xi ' s squared is not equal to the square of t ii. Using the same data as in the previous question, compute the sample mean ¯ X = QQQQQQQ n i =1 X k /n then compute the sample standard deviation ˆ σ = rrrrrrr QQQQQQQ n i =1 ( X i ¯ X ) 2 / ( n 1) in two ways: longhand (you can use R and use longhand notation, e.g., X[1], X[2], and X[3] or 1, -1, and 3, whichever you prefer), then using R functions such as mean() and sd() (this question is worth 2 marks). mean = sum/ 3 # Calculate the mean longhand. cat(sprintf( "Mean: %.2f \n " , mean)) ## Mean: 1.00 sd = 0.0 var = 0.0 for (i in 1 : 3 ) { var = var + ((data[i] - mean)ˆ 2 )/ 2 } sd = sqrt(var) # Calculate the standard deviation longhand. cat(sprintf( "Standard Deviation: %.2f \n " , sd)) ## Standard Deviation: 2.00 mean2 = mean(data) # Calculate the mean using R function. cat(sprintf( "Mean using mean() %.2f \n " , mean2)) ## Mean using mean() 1.00 sd2 = sd(data) # Calculate the standard deviation using R function. cat(sprintf( "Standard Deviation using sd() %.2f \n " , sd2)) ## Standard Deviation using sd() 2.00 iii. Express QQQQQQQ n i =1 K , where K is a constant (i.e., a number that does not change hence has no subscript i ), in terms of n and K only (Hint - a constant does not have a subscript as it does not change with i , but it is being added/summed, so type out a string of n constants etc.). Then for K = 3 and n = 5 determine QQQQQQQ n i =1 K using your result purely using n and K (i.e., without a summation sign - this question is worth 2 bonus marks, and you do not use R, rather use your powerful sense of logic and type out your answer with an explanation). 8