homework-1
docx
keyboard_arrow_up
School
University of Houston *
*We aren’t endorsed by this school
Course
4322
Subject
Mathematics
Date
Feb 20, 2024
Type
docx
Pages
12
Uploaded by hongyumei411
1
Homework 1 - MATH 4322 Fall
2023
Dr.
Cathy
Poliak
Instructions
1.
Due
date:
August
31,
2023,
11:59
PM
2.
Answer
the
questions
fully
for
full
credit.
3.
Scan
or
Type
your
answers
and
submit
only
one
file.
(If
you
submit
several
files
only
the
recent
one
uploaded
will
be
graded).
4.
Preferably
save
your
file
as
PDF
before
uploading.
5.
Submit
in
Canvas
under
Homework
1.
6.
These
questions
are
from
An
Introduction
to
Statistical
Learning
,
second
edition
by
James, et.
al.,
chapter
2.
7.
The
information
in
the
gray
boxes
are
R
code
that
you
can
use
to
answer
the
questions.
Problem
1
Explain
whether
each
scenario
is
a
classification
or
regression
problem,
and
indicate
whether we
are
most
interested
in
inference
or
prediction.
Finally,
provide
𝑛
and
.
𝑝
a)
We
are
interested
in
predicting
the
%
change
in
the
USD/Euro
exchange
rate
in
relation to
the
weekly
changes
in
the
world
stock
markets.
Hence
we
collect
weekly
data
for
all of
2012.
For
each
week
we
record
the
%
change
in
the
USD/Euro,
the
%
change
in
the US
market,
the
%
change
in
the
British
market,
and
the
%
change
in
the
German
market.
Regression Problem; Prediction; n = 52; p = 3 b)
An
online
store
is
determining
whether
or
not
a
customer
will
purchase
additional
items.
This
online
store
collected
data
from
1500
customers
and
looked
at
cost
of
initial
purchase, if
there
was
a
special
offer,
type
of
item
purchased,
number
of
times
the
customer
logged into
their
account,
and
if
they
purchased
additional
items.
Classification Problem; Prediction; n = 1500; p = 5
1
2
Problem
2
This
is
an
exercises
about
bias,
variance
and
MSE.
Suppose
we
have
𝑛
independent
Bernoulli
trails
with
true
success
probability
.
𝑝
Consider
two estimators
of
:
𝑝
𝑝
1
=
𝑝
where
𝑝
is
the
sample
proportion
of
successes
and
𝑝
2
= 1/2
,
a fixed
constant.
a)
Find
the
expected
value
and
bias
of
each
estimator. b)
Find
the
variance
of
each
estimator.
c)
Find
the
MSE
of
each
estimator
and
compare
them
by
plotting
against
the
true
.
𝑝
Use
𝑛
=
4.
Comment
on
the
comparison.
Red : MSE_p1
Blue : MSE_p2
For most values p, p1 has a smaller MSE. When p near to 1/2, p2 has a smaller MSE. Problem
3
Describe
the
differences
between
a
parametric
and
a
non-parametric
statistical
learning
ap-
proach.
What
are
the
advantages
of
a
parametric
approach
to
regression
or
classification
(as opposed
to
a
non-parametric
approach)?
What
are
its
disadvantages?
3
Parametric make assumptions about the model and non-parametric make no assumptions
about the model; Parametric regression or classification offers the benefits of reducing the
representation of function f to a small set of parameters, leading to a simpler model
structure. Moreover, this approach demands fewer observations for effective modeling in
comparison to non-parametric methods; Disadvantages of parametric is it might fail to
accurately capture the underlying true functions, resulting in potential errors.
Problem
4
This
exercise
involves
the
Auto
data
set
in
ISLR
package.
Make
sure
that
the
missing
values have
been
removed
from
the
data.
(a)
Which
of
the
predictors
are
quantitative,
and
which
are
qualitative?
Qualitative: name and origin
Quantitative: mpg, cylinders, displacement, horsepower, weight, acceleration, and year
(b)
What
is
the
range
of
each
quantitative
predictor?
You
can
answer
this
using
the
summary()
function.
summary(Auto$mpg) : 9.00 to 46.60
summary(Auto$acceleration) : 8.00 to 24.80
summary(Auto$cylinders) : 3.000 to 8.000
summary(Auto$year) : 70.00 to 82.00
summary(Auto$displacement) : 68.0 to 455.0
summary(Auto$horsepower) : 46.0 to 230.0
summary(Auto$weight) : 1613 to 5140
(c)
What
is
the
mean
and
standard
deviation
of
each
quantitative
predictor?
(d)
Now
remove
the
10th
through
85th
observations.
What
is
the
range,
mean,
and
standard
deviation
of
each
predictor
in
the
subset
of
the
data
that
remains?
> auto.new = Auto[-c(10:85),]
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
4
(e)
Using
the
full
data
set,
investigate
the
predictors
graphically,
using
scatterplots
or other
tools
of
your
choice.
Create
some
plots
highlighting
the
relationships
among
the predictors.
Comment
on
your
findings.
Clearly, there exists a
noticeable correlation
where
automobiles
featuring a greater
count of cylinders tend
to possess increased
displacement, weight,
and horsepower. This
trend is also associated
with
reduced
acceleration and miles
per
gallon
(mpg)
efficiency.
The
connection
between
mpg and factors such
as
displacement,
weight, and horsepower
demonstrate a certain
level of predictability.
(f) Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables.
Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify
your answer.
Yes, we observe variables displaying both positive and negative correlations with the mpg
outcome. For instance, there seems to be a positive association between the year and mpg,
indicating that as the year advances, mpg generally improves. On the other hand, there
5
appears to be a negative correlation between horsepower and mpg, suggesting that an increase
in horsepower often corresponds to a decrease in mpg.
6
Problem
5
This
exercise
relates
to
the
College
data
set,
which
can
be
found
in
the
file
College.csv attached
to
this
homework
set
in
Blackboard.
It
contains
a
number
of
variables
for
777 different
universities
and
colleges
in
the
US.
The
variables
are
•
Private
:
Public/private
indicator
•
Apps
:
Number
of
applications
received
•
Accept
:
Number
of
applicants
accepted
•
Enroll
:
Number
of
new
students
enrolled
•
Top10perc
:
New
students
from
top
10%
of
high
school
class
•
Top25perc
:
New
students
from
top
25%
of
high
school
class
•
F.Undergrad
:
Number
of
full-time
undergraduates
•
P.Undergrad
:
Number
of
part-time
undergraduates
•
Outstate
:
Out-of-state
tuition
•
Room.Board
:
Room
and
board
costs
•
Books
:
Estimated
book
costs
•
Personal
:
Estimated
personal
spending
•
PhD
:
Percent
of
faculty
with
Ph.D.’s
•
Terminal
:
Percent
of
faculty
with
terminal
degree
•
S.F.Ratio
:
Student/faculty
ratio
•
perc.alumni
:
Percent
of
alumni
who
donate
•
Expend
:
Instructional
expenditure
per
student
•
Grad.Rate
:
Graduation
rate
Before
reading
the
data
into
R
,
it
can
be
viewed
in
Excel
or
a
text
editor.
a)
Use
the
read.csv()
function
to
read
the
data
into
R
.
Call
the
loaded
data
college
.
Make
sure
that
you
have
the
directory
set
to
the
correct
location
for
the
data.
You
can also
import
this
data
set
into
RStudio
by
using
the
Import
Dataset
→
From
Text drop
down
list
in
the
Environment
window.
b)
Look
at
the
data
using
the
View()
function.
You
should
notice
that
the
first
column
is just
the
name
of
each
university.
We
will
not
use
this
column
as
a
variable
but
it
may be
handy
to
have
these
names
for
later.
Try
the
following
commands
in
R
:
rownames
(college)
<-
college[,
1
]
college
<-
college[,
-
1
]
View
(college)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
7
If
you
are
getting
an
error
make
sure
your
data
frame
is
named
with
a
lowercase
“c”. Give
a
brief
description
of
what
you
see
in
the
data
frame.
c)
Use
the
summary()
function
to
produce
a
numerical
summary
of
the
variables
in
the
data
set.
Is
there
any
variables
that
do
not
show
a
numerical
summary?
No, all the variables show the numerical summary.
Type
in
the
following
in
R
:
college
$
Private
<-
as.factor
(college
$
Private)
d)
Use
the
pairs()
function
to
produce
a
scatterplot
matrix
of
the
first
five
columns
or
variable
of
the
dataset.
Describe
any
relationships
you
see
in
these
plots.
pairs(college[,1:5])
; There are positive correlation between Apps & Accept, Apps & Enroll, and
Accept & Enroll.
e)
Use
the
plot()
function
to
produce
a
plot
of
Outstate
versus
Private
.
What
type
of
plot
was
produced?
Give
a
description
of
the
relationship.
Hint:
‘Outstate
is
in
the
y-axis
.
plot(college$Outstate ~ college$Private, xlab = "Private", ylab = "Outstate")
It produced a boxplots, Private have more out of state
students.
8
f)
Create
a
new
qualitative
variable,
called
Elite
,
by
𝑏𝑖𝑛𝑛𝑖𝑛𝑔
the
Top10perc
variable.
We
are
going
to
divide
universities
into
two
groups
based
on
whether
or
not
the
proportion of
students
coming
from
the
top
10%
of
their
high
school
classes
exceeds
50%.
Type
in the
following
in
R
:
Elite
<-
rep
(
"No"
,
nrow
(college))
#this
gives
a
column
of
No's
for
the
same
number
of
rows
Elite[college
$
Top10perc
>
50
]
<-
"Yes"
#changes
to
Yes
if
top
10%
is
greater
than
50
Elite
<-
as.factor
(Elite)
college
<-
data.frame
(college,Elite)
#adds
Elite
as
a
column
Use
the
summary()
function
to
see
how
many
elite
universities
there
are.
There are 78 elite universities.
9
Problem
6
This
exercise
involves
the
Boston
housing
data
set.
(a)
To
begin,
load
in
the
Boston
data
set.
The
Boston
data
set
is
part
of
the
ISLR2
library.
You
may
have
to
install
the
ISLR2
library
then
call
for
this
library.
library
(ISLR2)
Now
the
data
set
is
contained
in
the
object
Boston.
Boston
Read
about
the
data
set:
?Boston
How
many
rows
are
in
this
data
set?
How
many
columns?
What
do
the
rows
and
columns represent?
506 rows and 13 columns; 506 rows represent sample size = 506 (housing values in 506 suburbs of Boston); 13 columns represent 13 variables of data set: crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, lstat, and medv
(b)
Make
some
pairwise
scatterplots
of
the
predictors
(columns)
in
this
data
set.
Describe your
findings.
pairs(Boston)
age(x) vs nox(y)
It shows that as the level of occupation rises, there
is a corresponding increase in pollution.
Medv(x) vs lstat(y)
Indicates that individuals with a lower
socioeconomic status tend to have homes with a
lower average value.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
1
0
(c)
Are
any
of
the
predictors
associated
with
per
capita
crime
rate?
If
so,
explain
the
relationship.
Yes, crim has a negative linear relationship with medv and dis. And crim has a
positive linear relationship with indus, nox, rad, and tax. For example, crim vs dis. It appears that there is a higher occurrence of crimes in
proximity to employment centers.
crim vs dis
(d)
Do
any
of
the
census
tracts
of
Boston
appear
to
have
particularly
high
crime
rates?
Tax rates?
Pupil-teacher
ratios?
Comment
on
the
range
of
each
predictor.
Range of crime rate is between 0.00632 to 88.91620; 8 of the census tracts of Boston
appear to have particularly high crime rates.
Range of tax rate is between 187.0 to 711.0; 137 of the census tracts of Boston appear to
have particularly high tax rates.
1
1
Range of pupil-teacher ratios is between 12.60 to 22.00;
183 of the census tracts of Boston appear to have
particularly high pupil-teacher ratios.
(e)
How
many
of
the
census
tracts
in
this
data
set
bound
the
Charles
river? 35 of census tracts in this data set bound the Charles River.
(f)
What
is
the
median
pupil-teacher
ratio
among
the
towns
in
this
data
set?
The median pupil-teacher ratio among the towns in this data set is 19.05
(g)
Which
census
tract
of
Boston
has
lowest
median
value
of
owner
occupied
homes?
What are
the
values
of
the
other
predictors
for
that
census
tract,
and
how
do
those
values compare
to
the
overall
ranges
for
those
predictors?
Comment
on
your
findings.
There are two census tract of Boston, 399 and 406, which have the lowest median value of
owner occupied homes; Both of the census tract are not within the highest crime. Both
are low level of investment in these census tracts is reflected by the minimal development
within the city. Both census tracts are not located alongside the river. Both nox are in
upper quartile, which is due to the suburbs’ proximity to highways. Both average
number of rooms per dwelling is in the lower quartile, implying smaller living spaces.
Both rad are at the maximum, indicating that these areas are located on or near the
highways. Also, both of these pupil-teacher ratio are maximum, which indicates potential
underinvestment in education resources.
(h)
In
this
data
set,
how
many
of
the
census
tracts
average
more
than
seven
rooms
per
dwelling?
More
than
eight
rooms
per
dwelling?
Comment
on
the
census
tracts
that
average
more
than
eight
rooms
per
dwelling.
Among the 13 tracts where the average number of rooms per dwelling exceeds 8, the crime
rate is notably low. The pupil-teacher ratio falls within a certain range compared to the
variable's wider range, and there is a significantly high average number of rooms per
dwelling. With the exception of a single tract, property tax rates are low. Additionally, the
proportion of non-retail business acres per town is very low, apart from two tracts,
1
2
implying a predominance of residential areas. These tracts appear to be situated away
from highways. The majority of houses were constructed prior to 1940, although a few
exceptions exist.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help