Assignment-1
pdf
keyboard_arrow_up
School
McMaster University *
*We aren’t endorsed by this school
Course
2B03
Subject
Statistics
Date
Apr 3, 2024
Type
Pages
8
Uploaded by MinisterAnt14343
2B03 Assignment 1
Descriptive Statistics (Chapters 1 & 2)
Matthew Musulin 400329990
Due Thursday September 23 2021
Instructions:
You are to use R Markdown for generating your assignment output file. You begin
with the R Markdown script downloaded from A2L, and need to pay attention to information
provided via introductory material posted to A2L on working with R and R Markdown. Having
downloaded all necessary files, placed them in the same folder/directory, and added your answers
to the R Markdown script, you then are to generate your output file using “Knit to PDF” and,
when complete, upload both your R Markdown file and your PDF file to the appropriate folder on
A2L.
1.
Define the following terms in a sentence (or
short
paragraph) and state a formula if appropriate (this
question is worth 5 marks).
i. Categorical Data
Categorical Data represents types of data that can be divided into categories or groups.
ii. Frequency Distribution
Frequency Distribution is a function that displays the number of observations within a givin
interval.
iii. Sturgess’s Rule
Sturgess’s Rule is a rule used for determining the desirable number of classes. Where K is the
interger rounded to the closest whole number.
1 + 3
.
3
log
10
n
iv. Cross Tabulations
Cross Tabulations are tabular summaries for two variables.
v. Sample Median
The Sample Median is the middle value after putting all observations in ascending order.
X
(
n/
2)
+
X
(
n/
2+1)
2
2.
Consider the following dataset on the final grade received in a particular course (
grade
) and attendance
(
attend
, number of times present when work was handed back during the semester out of a maximum
of six times). Note that R has the ability to read datafiles directly from a URL, so here (unlike the
odesi
data that you manually retrieve) you do not have to manually download the data
providing you
are connected to the internet
(this question is worth 8 marks).
course
<-
read.table(
"https://socialsciences.mcmaster.ca/racinej/files/attend.RData"
)
attach(course)
1
i.
Create a scatterplot of the data with
attend
on the horizontal axis and
grades
on the vertical
axis via the command
plot(attend,grade)
0
1
2
3
4
5
6
40
60
80
100
attend
grade
Do you see any pattern present in the data? If so describe it in your own words.
In the scatterplot, students that attended more classes generally had a higher final grade, inversely,
students that attended less classes had a lower final grade.
ii.
Construct the average grades for persons attending 0 times, and then repeat for those attending 1
time, 2 times, and so on through 6 times using something like
mean(grade[attend==
0
])
## [1] 43.5
mean(grade[attend==
1
])
## [1] 56.83333
mean(grade[attend==
2
])
## [1] 51.14286
mean(grade[attend==
3
])
## [1] 66
mean(grade[attend==
4
])
## [1] 74.88889
mean(grade[attend==
5
])
2
## [1] 76
mean(grade[attend==
6
])
## [1] 81.375
Do you see any pattern present in the means?
Yes, the pattern present in the means directly corelates to the pattern present in the scatterplot.
As a student attends more classes, the average grade increases.
3.
This question requires you to download data obtained from Statistics Canada. If you are working on
campus go to www.odesi.ca (off campus users must first sign into the McMaster library via libaccess at
library.mcmaster.ca/libaccess, search for odesi via the library search facilities then select odesi from
these search results). Next, select the “Find data” field in odesi and search for “Labour Force Survey
June, 2021”, then scroll down and select the
Labour Force Survey, June 2021 [Canada]
. Next click
on the “Explore & Download” icon, then click on the download icon (i.e., the
diskette
icon, square,
along the upper right of the browser pane) and then click on “Select Data Format” then scroll down
and select “Comma Separated Value file” (csv) which, after a brief pause, will download the data to
your hard drive (you may have to extract the file from a zip archive depending on which operating
system you are using). Finally, make sure that you place this csv file in the same directory/folder as
your R code file (this file ought to have the name
LFS-71M0001-E-2021-June_F1.csv
, and in RStudio
select the menu item Session -> Set Working Directory -> To Source File Location). There will be
another file with (almost) the same name but with the extension .pdf that is the pdf documentation
that describes the variables in this data set. Note that it would be prudent to retain this file as we will
use it in future assignments (this question is worth 8 marks).
Next, open RStudio, make sure this csv file and your R Markdown script are in the same directory (in
RStudio open the Files tab (lower right pane by default) and refresh the file listing if necessary). Then
read the file as follows:
lfp
<-
read.csv(
"LFS-71M0001-E-2021-June_F1.csv"
)
This data set contains some interesting variables on the labour force status of a random subset of
Canadians. We will focus on the variable
HRLYEARN
(hourly earnings) described on page 22 of the pdf
file
LFS-71M0001-E-2021-June.pdf
. We will also consider other variables so that we can condition our
analysis on these variables by restricting attention to subsets of the data, e.g., for full-time workers
only (
FTPTMAIN==1
) reporting positive earnings. We also look at the highest educational attainment for
people in the survey and consider both high school graduates (
EDUC==2
) and those holding a bachelors
degree (
EDUC==5
).
To construct these subsets we can use the R command
subset
as follows (the
ampersand is the logical operator
and
- see
?subset
for details on the
subset
command):
hs
<-
subset(lfp, FTPTMAIN==
1
& EDUC==
2
& HRLYEARN >
0
)$HRLYEARN
ba
<-
subset(lfp, FTPTMAIN==
1
& EDUC==
5
& HRLYEARN >
0
)$HRLYEARN
These commands simply tell R to take a subset of the data frame
lfp
for full-time workers having
either a high school diploma or university bachelors degree for those reporting positive earnings, and
then retain only the variable
HRLYEARN
and store these in the variables named
hs
(hourly earnings for
high-school graduates) or
ba
(hourly earnings for university graduates). The following questions ask
you to compute various descriptive statistics and other graphical summaries of these two variables.
Note that nothing will be printed out by running the two lines above - they simply create subsets of
the data for subsequent use.
i.
Report the five number summary for each subset (hint:
fivenum(hs)
etc.). Indicate what each
number tells us (hint: see help by typing
?fivenum
in the console pane).
In the following vector the numbers indicate (minimum, lower-hinge, median, upper-hinge, maxi-
mum) for the inputted data
3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
fivenum(hs)
## [1]
3.21
17.31
22.00
29.86 108.13
fivenum(ba)
## [1]
3.30
24.53
34.62
45.99 107.39
ii. What can you say about relative wages of high school and university graduates?
University Graduates are generally paid more than High School Graduates according to the median.
HS=22.00 and UG=34.62
iii.
Using Sturges’ rule, how many classes would you construct for the
hs
and
ba
wage data (hint
-
length()
gives you the length of the vector,
log10()
may also be useful, so something like
round(1+3.3*log10(length(hs)))
might do the trick for the hs data at least)?
round(
1+3.3
*log10(length(hs)))
## [1] 14
round(
1+3.3
*log10(length(ba)))
## [1] 14
iv. Plot histograms for the
hs
and
ba
data on separate graphs (hint:
hist()
).
hist(hs)
Histogram of hs
hs
Frequency
0
20
40
60
80
100
0
500
1500
2500
hist(ba)
4
Histogram of ba
ba
Frequency
0
20
40
60
80
100
0
500
1000
1500
2000
v. Do the number of classes correspond to Sturges’ rule?
No, neither the histogram of hs or ba corresponds to the Sturges’ rule which calculated 14 classes
for both hs and ba.
vi.
Plot density curves for the
hs
and
ba
data on the same graph and add a legend (hint: first use
something like
plot(density(...),col="blue",lty=1)
(you need to fill in
(...)
parts with
the name of your data object, e.g.,
hs
etc.) then
lines(density(...),col="red",lty=2)
, then
see the help page by typing
?legend
in the console pane. Note that you can add a legend using
something like
plot(density(hs),
col=
"blue"
,
lty=
1
)
lines(density(ba),
col=
"red"
,
lty=
1
)
legend(
"topright"
,c(
"High School"
,
"University"
),
lty=
c(
1
,
1
),
col=
c(
"blue"
,
"red"
),
bty=
"n"
)
5
0
20
40
60
80
100
0.00
0.02
0.04
density.default(x = hs)
N = 6740 Bandwidth = 1.446
Density
High School
University
vii.
What do these density curves tell us about the distribution of hourly wages for high school versus
university graduates?
The density curves show that the largest density of High School graduate wages are between 15-30
dollars an hour while the University graduates are mainly distributed from 20-50 dollars an hour.
The data shows that University graduates are preferred for higher paying jobs than High School
graduates.
4.
Consider the following data on annual profits (in $millions of dollars) for all firms in the textbook
publishing industry in Canada (ignore the
##
[1]
and
##
[12]
that appear at the beginning of each
line; this is simply the way R displays a vector of numbers):
##
[1] 21.000
6.300 12.700 14.600 12.100
5.080
0.145 14.100
5.840
9.030
## [11]
3.170
4.880
To set these values in a vector in R, if desired, you can use the command
profits <- c(...)
where
...
are the values above separated by commas, e.g.,
profits <- c(3.67, 6.57, etc.)
i. How many observations are there (i.e., what is
n
, the sample size?)
The number of observations: n = 12
ii. What is the minimum, maximum, and range?
min(profits)
## [1] 0.145
max(profits)
## [1] 21
range(profits)
## [1]
0.145 21.000
6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
iii. How many classes would you create if you used Sturges’ rule?
n
=
12
k
=
round(
1+3.3
*log10(
12
))
k
## [1] 5
iv.
What are the class widths and class boundaries based on your answers to the previous two
questions, using Sturges’ rule, the sample minimum as the first lower class boundary, and the
sample maximum as the last upper class boundary?
Width
=
(max(profits)-min(profits))/k
Width
## [1] 4.171
v.
Complete the table below showing the absolute frequency, relative frequency, cumulative frequency,
and cumulative relative frequency for the above data. For this question you will need to do some
manual data entry in the table skeleton provided below after you have figured out what the counts
are based on your answers to the previous set of questions. In particular, you are to use Sturges’
rule (above) to obtain the desired number of classes, and use the range of the data (above) when
constructing your class boundaries (note that you need to have a blank line between each new row
that you add to the table, and the last class must be closed at the right - this question is worth 8
marks).
Class
Absolute
Frequency
Relative
Frequency
Cumulative
Absolute
Frequency
Cumulative
Relative
Frequency
[0.145,4.316)
2
0.1666
2
0.1666
[4.316,8.487)
4
0.3333
6
0.5
[8.487,12.658)
2
0.1666
8
0.6666
[12.658,16.829)
3
0.25
11
0.9166
[16.829,21]
1
0.0833
12
1
5.
Since we use the
summation operator
(
Σ
n
i
=1
) often in class, let’s make sure we understand how to
calculate objects that can be expressed succinctly using this operator.
i.
Care must be exercised when expanding certain sums and quantities. Let the sample size be
n
= 3
, and let
X
1
= 1
,
X
2
=
−
1
, and
X
3
= 3
. Demonstrate in R that it is generally not true that
QQQQQQQ
n
i
=1
X
2
i
= (
QQQQQQQ
n
i
=1
X
i
)
2
(this question is worth 2 marks).
data
<-
c(
1
, -
1
,
3
)
# Create a vector to hold data.
data
## [1]
1 -1
3
sum
=
0.0
# Calculate the sum of Xi
'
s.
sum2
=
0.0
for
(i
in
1
:
3
)
{
sum
=
sum + data[i]
}
cat(sprintf(
"Sum of Xi
'
s is %.2f
\n
"
, sum))
## Sum of Xi
'
s is 3.00
7
for
(i
in
1
:
3
)
# Calculate the sum of Xi
'
s squared.
{
sum2
=
sum2 + data[i]*data[i]
}
cat(sprintf(
"Sum of Xi
'
s squared is %.2f
\n
"
, sum2))
## Sum of Xi
'
s squared is 11.00
sq_sumx
=
sum*sum
# Calculate the square of the sum of Xi
'
s.
cat(sprintf(
"The square of the sum of Xi
'
s %.2f
\n
"
, sq_sumx))
## The square of the sum of Xi
'
s 9.00
cat(
"From the math above you can see that the sum of Xi
'
s squared is not equal to the square of
## From the math above you can see that the sum of Xi
'
s squared is not equal to the square of t
ii.
Using the same data as in the previous question, compute the sample mean
¯
X
=
QQQQQQQ
n
i
=1
X
k
/n
then
compute the sample standard deviation
ˆ
σ
=
rrrrrrr
QQQQQQQ
n
i
=1
(
X
i
−
¯
X
)
2
/
(
n
−
1)
in two ways: longhand
(you can use R and use longhand notation, e.g., X[1], X[2], and X[3] or 1, -1, and 3, whichever you
prefer), then using R functions such as mean() and sd() (this question is worth 2 marks).
mean
=
sum/
3
# Calculate the mean longhand.
cat(sprintf(
"Mean: %.2f
\n
"
, mean))
## Mean: 1.00
sd
=
0.0
var
=
0.0
for
(i
in
1
:
3
)
{
var
=
var + ((data[i] - mean)ˆ
2
)/
2
}
sd
=
sqrt(var)
# Calculate the standard deviation longhand.
cat(sprintf(
"Standard Deviation: %.2f
\n
"
, sd))
## Standard Deviation: 2.00
mean2
=
mean(data)
# Calculate the mean using R function.
cat(sprintf(
"Mean using mean() %.2f
\n
"
, mean2))
## Mean using mean() 1.00
sd2
=
sd(data)
# Calculate the standard deviation using R function.
cat(sprintf(
"Standard Deviation using sd() %.2f
\n
"
, sd2))
## Standard Deviation using sd() 2.00
iii.
Express
QQQQQQQ
n
i
=1
K
, where
K
is a constant (i.e., a number that does not change hence has no subscript
i
), in terms of
n
and
K
only (Hint - a constant does not have a subscript as it does not change
with
i
, but it is being added/summed, so type out a string of
n
constants etc.). Then for
K
= 3
and
n
= 5
determine
QQQQQQQ
n
i
=1
K
using your result purely using
n
and
K
(i.e., without a summation
sign - this question is worth 2 bonus marks, and you do not use R, rather use your powerful sense
of logic and type out your answer with an explanation).
8
Recommended textbooks for you

Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill

Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt

Holt Mcdougal Larson Pre-algebra: Student Edition...
Algebra
ISBN:9780547587776
Author:HOLT MCDOUGAL
Publisher:HOLT MCDOUGAL

Functions and Change: A Modeling Approach to Coll...
Algebra
ISBN:9781337111348
Author:Bruce Crauder, Benny Evans, Alan Noell
Publisher:Cengage Learning
Recommended textbooks for you
- Glencoe Algebra 1, Student Edition, 9780079039897...AlgebraISBN:9780079039897Author:CarterPublisher:McGraw HillBig Ideas Math A Bridge To Success Algebra 1: Stu...AlgebraISBN:9781680331141Author:HOUGHTON MIFFLIN HARCOURTPublisher:Houghton Mifflin HarcourtHolt Mcdougal Larson Pre-algebra: Student Edition...AlgebraISBN:9780547587776Author:HOLT MCDOUGALPublisher:HOLT MCDOUGAL
- Functions and Change: A Modeling Approach to Coll...AlgebraISBN:9781337111348Author:Bruce Crauder, Benny Evans, Alan NoellPublisher:Cengage Learning

Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill

Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt

Holt Mcdougal Larson Pre-algebra: Student Edition...
Algebra
ISBN:9780547587776
Author:HOLT MCDOUGAL
Publisher:HOLT MCDOUGAL

Functions and Change: A Modeling Approach to Coll...
Algebra
ISBN:9781337111348
Author:Bruce Crauder, Benny Evans, Alan Noell
Publisher:Cengage Learning