439 sample midterm
pdf
keyboard_arrow_up
School
Rutgers University *
*We aren’t endorsed by this school
Course
439
Subject
Mathematics
Date
Apr 3, 2024
Type
Pages
9
Uploaded by MasterEnergyCapybara27
Question 1 - Multiple Choice - 16 points
This question covers multiple topics. Each question is worth 2 points.
1.
Suppose that site locations of 100 sites are given as latitude and longitude of each side. What
is the best way to visualize this data?
(a) histogram
(b) scatter plot
(c) bar plot
(d) KDE plot
2.
Your letter grade (e.g., A+, A, B+. . . ) in a class that grades on a curve is most accurately
described as what kind of data?
(a) ordinal
(b) nominal
(c) none of the above
3. The set that consists of all possible values of a random variable is called a
(a) random range
(b) sample space
(c) specific range
(d) none of the above
4. SVD and PCA applied to a matrix can be used to
(a) factor matrices
(b) find linearly independent columns
(c) reduce dimensions
(d) all of the above
(e) none of the above
5. A data scientist must always consider potential sources of bias in a given dataset.
(a) True
(b) False
(c) May be
Page ii
6. Which data formats would be well suited for nested data? Select all that apply.
(a) .csv
(b) .xml
(c) .ipynb
(d) .json
(e) .tsv
7.
Which of the following are reasonable motivations for applying a log transformation? Select
all that apply:
(a) Perform dimensionality reduction on the data.
(b) To help straighten relationships between pairs of variables.
(c) Removing missing values.
(d) Bring data distribution closer to random sampling.
(e) To help visualize highly skewed distributions
8.
The return type of the pandas.DataFrame.groupby function can either be a DataFrame or a
Series object.
(a) TRUE
(b) FALSE
Question 2 - EDA - 12 points
1.
Suppose you are given a data set that contains the stock market performance from Jan 1, 1981
to January 1, 2019. The presidents Reagan (8 yrs), H.W. Bush (4 yrs), Clinton(8 yrs), W.
Bush(8 yrs), Obama(8 yrs), Trump(2 yrs). The performances are given by the following chart.
During EDA, what are some questions one can ask? We are looking for 3 brief, but good
questions/observations.
Page iii
(a) Question 1:
(b) Question 2:
(c) Question 3:
2.
During the data cleaning process, is it always a good idea to remove records that contain
missing values? Briefly Justify your answer.
3. TRUE or FALSE. Exploratory data analysis is the process of testing key hypotheses.
4. TRUE or FALSE. The structure of the data describes how it is formatted and organized.
5.
TRUE or FALSE. Throughout the process of exploratory data analysis it is often necessary to
transform and clean data.
Question 3 - Data Visualization - 12 points
1.
What is the best data type description for home prices in New York city? Circle the answer
and briefly justify.
(a) Nominal
(b) Ordinal
(c) Quantitative
(d) Numerical
2. Justification:
Page iv
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
3.
Consider the following graph that shows registered male and female names and year that they
were sampled. The graph seem to show some unlikely phenomenon. Assuming data were valid,
briefly explain what might have caused this.
4.
TRUE/FALSE The descriptive statistics of a data set such as mean and variance is a good
metric to understand the distribution of the data. Justify your answer (briefly)
5.
For each of the following cases, choose the ideal plot type from : 1D : Bar chart, Histogram,
2D: Scatter plot, line plot, box-whisker heatmap, 3D: scatter matrix, bubble chart
(a) Plot 10,000 student grades consisting of letters A, B, C, D, F
(b) compare chicken and beef prices from 50 states for 1-year of data (365 data points)
(c) Compare the average, median, max and mean temperature in 3 different counties
(d) Density of traffic in NY city during rush hour.
6.
Consider the following heatmap showing height-weight distribution of Americans. State two
important facts revealed by this chart. Please be brief.
(a)
Page v
(b)
Question 4 - KDE - 15 points
1.
What is the purpose of using kernel density estimators (KDE) in visualizaing data? Explain
in 2-4 sentences
2.
Consider the following histogram. Draw the best KDE that you think will represent this data.
3.
The equation for Triweight Density function is given by
K
(
u
) = 15
/
16
*
(1
-
u
2
)
2
where
|
u
|
<
1.
Show that this function satisfies all 3 properties of a kernel density function. That is,
(a)
K
(
u
) is non-negative for all
u
(b)
K
(
u
) is symmetric. That is
k
(
u
) =
k
(
-
u
)
(c)
integraltext
1
−
1
k
(
u
)
du
= 1 (hint.
integraltext
1
−
1
1
du
= 2
integraltext
1
−
1
u
2
du
= 2
/
3 and
integraltext
1
−
1
u
4
du
= 2
/
5)
Page vi
Question 5 - Probability Fundamentals - 15 points
1.
Let X be the random variables that represents the outcome of a coin toss. Suppose a ”bias”
coin has
P
(
X
=
′
H
′
) = 0
.
2. Write down the entries in the sample space for tossing a bias coin
twice and their probabilities.
P(H,H) =
2.
Using the formulas
E
[
X
] =
∑
x
∈
X
x
*
P
(
X
=
x
) and
var
[
X
] =
E
[
X
2
]
-
(
E
[
X
])
2
, Compute the
expected value
E
(
X
) and
var
(
X
)
3.
If a coin is a bias coin, is there a way to agree to a trial where you have a 50-50 chance of
winning regardless of coin bias? Briefly explain your answer.
4.
A certain couple tells you that they have two children, at least one of which is a girl. What is
the probability that they have two girls?
5.
There is a 30% chance that a user will click an advertisement on the page. It is known from
past data that about 80% of the users who click on the ad buy the product. What percentage
of people both clicked on the ad and bought the product?
6.
suppose that the probability distribution of two random variables, Weather and Cavity is
given by following chart.
Answer the following questions. Show all work
(a) What is P(weather = sunny) =
Page vii
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
(b)
What is P(weather = sunny
|
cavity
=
yes
) =
WhatisP
(
cavity
=
yes
|
weather
=
sunny
) =
(c)
(c) Are the two random variables weather and cavity independent ? Justify your answer.
7.
It is suspected that children sleeping with night lamps develop nearsightedness. From a data
sample, the following probabilities were observed. P(night lamp) = 0.4, P(nearsightedness) =
0.25, P(night lamp
nearsightedness) = 0.2
Do you think that night lamps could be responsible for nearsightedness? Justify your answer.
8.
It was also suspected that having a nearsighted parent may be responsible for nearsigtedness
of a child. The following data were observed. of the 100 nearsighted parents sampled, 24% of
their children grew up with a night lamp and had near sightedness. There is a 60% chance that
a nearsightedness parent will have a nearsightedness child. It is also known that approximately,
40% of the nearsighted parents slept with a night lamp. Does this data validate or invalidate
the fact that chidlren sleeping with night lamps might develop nearsightedness? Justify your
answer.
Page viii
Question 6 - Linear Algebra - 15 points
1. Given two vectors
A
= [3
,
4] and
B
= [6
,
8], find the cosine of the angle between A and B
2. Given two vectors
A
= [3
,
4] and
B
= [6
,
8], find the projection of A onto B
3.
Suppose a data file contains 100,000 observations and 47 features each represented in a 100,000
x 47 matrix. What is the maximum possible number of linearly independent rows/observations
in the matrix and why?
4.
An eigenvalue x and eigenvector v of a matrix A is given by the equation : A v = x.v Explain
the geometric interpretation of an eigenvalue and an eigenvector.
5. Consider the following matrix A.
1
0
0
1
1
1
The SVD of the matrix A yields the following results.
where the matrices
U
,
D
, and
V
T
are shown. Write a vector expression for rank-2 approximation
of the original matrix A. Use only one decimal point in the answers. Do not simplify the
answer.
Page ix
Question 7 - Text processing - 15 points
1.
Suppose three documents
d
1
,d
2
,d
3
are represented by the vectors
d
1
= [1
,
0
,
2
,
5
,
0]
,d
2
=
[0
,
1
,
0
,
0
,
10]
,d
3
= [1
,
0
,
1
,
3
,
0] where vectors represents the word count for each word(5
different words in all documents).
(a) Which of these two documents are ”similar”? Why?
(b) Which of the two documents are significantly different from each other? Why?
(c)
Are these vectors linearly independent or independent? How would you interpret that in
the context of documents? Show work for linear independence and explain.
2. Solve the following regex problems
(a)
write a regex for an identifier that must start with an upper-case alpha character and can
be total number of upper or lower alphanumeric characters from 5 to 10 characters.
(b)
What would the following lines of code return? There are no spaces in any of the strings.
re.findall(r”
˙
.*”, ”VIXX-Error.mp3.bak”) [note: findall(pattern, text) returns a list of
matches]
(c) What would this return? re.findall(r”[cat—dog]”, ”bobcat”)
(d) what would this return? re.findall(r”a?p*[le]”
,
”
apple
”)
Page x
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help