HW0_solutions (1)
pdf
keyboard_arrow_up
School
University of California, Los Angeles *
*We aren’t endorsed by this school
Course
M146
Subject
Computer Science
Date
Jan 9, 2024
Type
Pages
7
Uploaded by phanta_ch
CS M146: Introduction to Machine Learning
UCLA Fall Quarter 2023
Prof. Aditya Grover
Release: 28 Sep 2023
Homework 0
Due: 5 Oct 2023, Thursday, before 11:59 pm
Except for the programming questions, you should be able to solve these problems by hand.
Problem 1
(Multivariate Calculus)
Consider
y
=
x
sin(
z
)
e
−
x
. What is the partial derivative of
y
with respect to
x
?
Solution:
∂y
∂x
= sin(
z
)
(
−
xe
−
x
+
e
−
x
)
= (1
−
x
) sin(
z
)
e
−
x
Problem 2
(Linear Algebra)
(a) Consider the matrix
X
and the vectors
y
and
z
below:
X
=
2
4
1
3
y
=
1
3
z
=
2
3
i. What is the inner product
y
T
z
?
Solution:
y
T
z
= 1
·
2 + 3
·
3 = 11
ii. What is the product
X
y
?
Solution:
X
y
=
2
·
1 + 4
·
3
1
·
1 + 3
·
3
=
14
10
iii. Is
X
invertible? If so, give the inverse; if not, explain why not.
Solution:
det(
X
) =
2
·
3
−
4
·
1 = 2
̸
= 0 so
X
is invertible.
For a 2
×
2 matrix,
A
−
1
=
a
b
c
d
−
1
=
1
det(
A
)
d
−
b
−
c
a
=
1
ad
−
bc
d
−
b
−
c
a
.
Thus,
X
−
1
=
1
2
3
−
4
−
1
2
.
1
iv. What is the rank of
X
?
Solution:
The rows (or equivalently, columns) of
X
are
linearly independent, so rank(
X
) = 2.
(b)
Vector Norms
[4 pts]
Draw the regions corresponding to vectors
x
∈
R
2
with following norms (you can hand draw
or use software for this question):
Solution:
Each
l
p
-norm corresponds to a unit ball. In
R
2
, this corresponds to a disc of diameter 2 centered at the origin.
i.
||
x
||
2
≤
1 (Recall
||
x
||
2
=
q
∑
i
x
2
i
.)
Solution:
the unit circle (origin-centered circle
with diameter of 2), and its interior
ii.
||
x
||
0
≤
1 (Recall
||
x
||
0
=
∑
i
:
x
i
̸
=0
1.)
Solution:
the axes
iii.
||
x
||
1
≤
1 (Recall
||
x
||
1
=
∑
i
|
x
i
|
.)
Solution:
origin-centered diamond with diameter
of 2, and its interior
iv.
||
x
||
∞
≤
1 (Recall
||
x
||
∞
= max
i
|
x
i
|
.)
Solution:
origin-centered square with side
length of 2, and its interior
(c)
Matrix Decompositions [6 pts]
i. Give the definition of the eigenvalues and the eigenvectors of a square matrix.
Solution:
An
eigenvector
of a square matrix is a vector that points in a direction that is invariant
under the associated linear transformation. In other words—, if
v
is a vector that is not
zero, then it is an eigenvector of a square matrix
A
if
A
v
is a scalar multiple of
v
. This
condition could be written as the equation
A
v
=
λ
v
,
where
λ
is a number (scalar) known as the
eigenvalue
associated with the eigenvector
v
.
ii. Find the eigenvalues and eigenvectors of
A
=
2
1
1
2
.
Solution:
The eigenvectors
v
of this transformation satisfy the equation
A
v
=
λ
v
.
Rearrange this equation to obtain (
A
−
λ
I
)
v
= 0, which has a solution only when its
determinant
|
A
−
λ
I
|
equals zero. Set the determinant to zero to obtain the polynomial
equation,
p
(
λ
) =
|
A
−
λ
I
|
= (2
−
λ
)
2
−
1 = (
λ
2
−
4
λ
+ 4)
−
1 =
λ
2
−
4
λ
+ 3 = (
λ
−
3)(
λ
−
1) = 0
,
known as the characteristic polynomial of the matrix
A
. In this case, it has the roots
λ
= 1 and
λ
= 3.
For
λ
= 1, the equation becomes,
(
A
−
I
)
v
=
1
1
1
1
v
1
v
2
=
0
0
,
which has the solution,
v
=
1
−
1
.
2
For
λ
= 3, the equation becomes,
(
A
−
3
I
)
w
=
−
1
1
1
−
1
w
1
w
2
=
0
0
,
which has the solution,
w
=
1
1
.
Thus, the vectors
v
and
w
are eigenvectors of
A
associated with the eigenvalues
λ
= 1
and
λ
= 3, respectively.
iii. For any positive integer
k
, show that the eigenvalues of
A
k
are
λ
k
1
, λ
k
2
, . . . , λ
k
n
, the
k
th
powers of the eigenvalues of matrix
A
, and that each eigenvector of
A
is still an eigen-
vector of
A
k
.
Solution:
We will prove by induction.
Base case (given):
A
v
=
λ
v
Inductive step: Prove that if
A
k
v
=
λ
k
v
, then
A
k
+1
v
=
λ
k
+1
v
.
This holds because
A
k
+1
v
=
A
(
A
k
v
)
=
A
(
λ
k
v
)
=
λ
k
(
A
v
) =
λ
k
+1
v
.
Since both the basis and the inductive step have been performed, by mathematical
induction, the statement
A
k
v
=
λ
k
v
holds for all positive integer
k
. Q.E.D.
(d)
Vector and Matrix Calculus [5 pts]
Consider the vectors
x
and
a
and the symmetric matrix
A
.
i. What is the first derivative of
a
T
x
with respect to
x
?
Solution:
Let
a
i
and
x
i
denote the elements of
a
and
x
, respectively. Then
f
(
x
) =
a
T
x
=
∑
n
i
=1
a
i
x
i
so
∂f
(
x
)
∂x
k
=
∂
∂x
k
∑
n
i
=1
a
i
x
i
=
a
k
and
∇
x
a
T
x
=
a
.
ii. What is the first derivative of
x
T
A
x
with respect to
x
? What is the second derivative?
Solution:
Let
a
ij
denote the element in row
i
, column
j
of
A
. Then
f
(
x
) =
x
T
A
x
=
∑
n
i
=1
∑
n
j
=1
x
i
a
ij
x
j
so
∂f
(
x
)
∂x
k
=
∂
∂x
k
n
X
i
=1
n
X
j
=1
a
ij
x
i
x
j
=
∂
∂x
k
X
i
̸
=
k
X
j
̸
=
k
a
ij
x
i
x
j
+
X
i
̸
=
k
a
ik
x
i
x
k
+
X
j
̸
=
k
a
kj
x
k
x
j
+
a
kk
x
2
k
= 0 +
X
i
̸
=
k
a
ik
x
i
+
X
j
̸
=
k
a
kj
x
j
+ 2
a
kk
x
k
=
n
X
i
=1
a
ik
x
i
+
n
X
j
=1
a
kj
x
j
= 2
n
X
i
=1
a
ki
x
i
where the last equality follows from
a
ij
=
a
ji
. Note that the
k
th
entry of
∇
x
f
(
x
) is just
the inner product of the
k
th
row of
A
and
x
; therefore,
∇
x
x
T
A
x
= 2
A
x
.
To find the second derivative,
∂
2
f
(
x
)
∂x
k
x
ℓ
=
∂
∂x
k
∂f
(
x
)
∂x
ℓ
=
∂
∂x
k
"
2
n
X
i
=1
a
ℓi
x
i
#
= 2
a
ℓk
= 2
a
kℓ
Therefore,
∇
2
x
x
T
A
x
= 2
A
.
3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
(e)
Geometry [5 pts]
i. Show that the vector
w
is orthogonal to the line
w
T
x
+
b
= 0.
(Hint: Consider two
points
x
1
,
x
2
that lie on the line. What is the inner product
w
T
(
x
1
−
x
2
)?)
Solution:
This line is all
x
such that
w
T
x
+
b
= 0. Consider two such points, called
x
1
and
x
2
.
Note that
x
1
−
x
2
is a vector parallel to our line. Also note that
w
T
x
1
+
b
= 0 =
w
T
x
2
+
b
=
⇒
w
T
x
1
=
w
T
x
2
=
⇒
w
T
(
x
1
−
x
2
) = 0
,
which shows that the vector
w
is orthogonal to our line.
ii. Argue that the distance from the origin to the line
w
T
x
+
b
= 0 is
b
||
w
||
2
.
Solution:
We can show this by first finding the closest point to the origin that lies on this line,
and then finding the distance to this point. Let
a
∗
be the closest point to the origin the
lies on the line. We can write
a
∗
as
a
∗
=
min
a
T
a
s.t.
w
T
a
+
b
=
0
So we first solve this constrained optimization problem and find
a
∗
. We start by taking
the derivative of the objective, setting it to zero, and using Lagrange multipliers (i.e.
setting the derivative of the Lagrangian
a
T
a
−
λ
(
w
T
a
+
b
) to zero). We can write
∇
a
a
T
a
−
λ
(
w
T
a
+
b
)
= 2
a
−
λ
w
= 0
to find that
a
∗
=
λ
2
w
. Hence, plugging this value for
a
into the constraint, we can write
w
T
a
+
b
=
w
T
λ
2
w
+
b
=
λ
2
w
T
w
+
b
= 0
Thus,
λ
=
−
2
b
w
T
w
a
∗
=
−
b
w
T
w
w
Once we have
a
∗
, we can compute the distance between
a
∗
and the origin to get
distance
=
||
a
||
=
q
(
a
∗
)
T
a
∗
=
s
−
b
w
T
w
2
w
T
w
=
b
w
T
w
√
w
T
w
=
b
√
w
T
w
=
b
||
w
||
Problem 3
(Probability and Statistics)
(a) Consider a sample of data
S
obtained by flipping a coin five times.
X
i
, i
∈ {
1
, . . . ,
5
}
is a
random variable that takes a value 0 when the outcome of coin flip
i
turned up heads, and
1 when it turned up tails.
Assume that the outcome of each of the flips does not depend
on the outcomes of any of the other flips. The sample obtained
S
= (
X
1
, X
2
, X
3
, X
4
, X
5
) =
(1
,
1
,
0
,
1
,
0).
4
i. What is the sample mean for this data?
Solution:
µ
=
1
n
∑
n
i
=1
X
i
=
3
5
ii. What is the unbiased sample variance ?
Solution:
σ
2
=
1
n
−
1
∑
n
i
=1
(
X
i
−
µ
)
2
=
1
4
3
(
1
−
3
5
)
2
+ 2
(
0
−
3
5
)
2
=
1
4
(
3
·
4
25
+ 2
·
9
25
)
=
3
10
Give partial credit if student uses the biased estimator –
σ
2
=
1
n
∑
n
i
=1
(
X
i
−
µ
)
2
iii. What is the probability of observing this data assuming that a coin with an equal
probability of heads and tails was used?
(
i.e.
, The probability distribution of
X
i
is
P
(
X
i
= 1) = 0
.
5,
P
(
X
i
= 0) = 0
.
5.)
Solution:
P
(
S
) =
(
1
2
)
5
=
1
32
iv. Note the probability of this data sample would be greater if the value of the probability
of heads
P
(
X
i
= 1) was not 0
.
5 but some other value. What is the value that maximizes
the probability of the sample
S
?
[Optional:
Can you prove your answer is correct?]
Solution:
To maximize the probability of sample
S
,
p
should take on the sample
mean:
p
∗
=
3
5
.
Optional: Let
p
=
P
(
X
= 1). We wish to find the value of
p
, 0
≤
p
≤
1 that maximizes
the likelihood
L
(
p
) =
p
3
(1
−
p
)
2
. To find the critical points, find
p
∗
such that
dL
dp
p
∗
= 0.
dL
dp
=
−
2
p
3
(1
−
p
) + 3
p
2
(1
−
p
)
2
=
p
2
(1
−
p
)(
−
2
p
+ 3(1
−
p
)) =
p
2
(1
−
p
)(3
−
5
p
)
p
∗
= 0
,
3
5
,
1
Thus,
L
attains a minimum of 0 at
p
= 0
,
1, and
L
attains a maximum of
(
3
5
)
3
(
2
5
)
2
=
108
3125
at
p
=
3
5
. (We could also have computed
d
2
L
dp
2
to determine the minimum and maximum.)
v. Given the following joint distribution between
X
and
Y
, what is
P
(
X
=
T
|
Y
=
b
)?
P
(
X, Y
)
Y
a
b
c
X
T
0
.
2
0
.
1
0
.
2
F
0
.
05
0
.
15
0
.
3
Solution:
P
(
X
=
T
|
Y
=
b
) =
P
(
X
=
T,Y
=
b
)
P
(
Y
=
b
)
=
0
.
1
0
.
1+0
.
15
= 0
.
4.
(b) Match the distribution name to its formula.
(a) Gaussian
(i)
p
x
(1
−
p
)
1
−
x
,
when
x
∈ {
0
,
1
}
; 0 otherwise
(b) Exponential
(ii)
1
b
−
a
when
a
≤
x
≤
b
; 0 otherwise
(c) Uniform
(iii)
(
n
x
)
p
x
(1
−
p
)
n
−
x
(d) Bernoulli
(iv)
λe
−
λx
when
x
≥
0; 0 otherwise
(e) Binomial
(v)
1
√
(2
π
)
σ
2
exp
(
−
1
2
σ
2
(
x
−
µ
)
2
)
Solution:
a-v, b-iv, c-ii, d-i, e-iii
(c) What is the mean and variance of a
Bernoulli
(
p
) random variable?
Solution:
The PMF
for the Bernouilli distribution is
f
(
k
;
p
) =
(
p
if
k
= 1
1
−
p
if
k
= 0
=
p
k
(1
−
p
)
1
−
k
for
k
∈ {
0
,
1
}
5
Let
q
= 1
−
p
. Then
µ
=
E
[
X
] =
X
k
∈{
0
,
1
}
kf
(
k
;
p
) = (0)(
q
) + (1)(
p
) =
p
σ
2
= Var(
X
) =
E
(
X
−
µ
)
2
=
E
X
2
−
(
E
[
X
])
2
= (0)
2
(
q
) + (1)
2
(
p
)
−
(
p
)
2
=
p
(1
−
p
) =
pq
H
=
E
[
−
ln(
f
(
k
;
p
))] = (
−
ln
q
)(
q
) + (
−
ln
p
)(
p
) =
−
q
ln
q
−
p
ln
p
(d) If the variance of a zero-mean random variable
X
is
σ
2
, what is the variance of 2
X
? What
about the variance of
X
+ 2?
Solution:
If all values are scaled by a constant, the variance
is scaled by the square of that constant: Var(
aX
) =
a
2
Var(
X
). Thus, Var(2
X
) = 4Var(
X
) =
4
σ
2
.
Variance is invariant with respect to changes in a location parameter; that is, if a constant is
added to all values of the variable, the variance is unchanged: Var(
X
+
a
) = Var(
X
). Thus,
Var(
X
+ 2) = Var(
X
) =
σ
2
.
Problem 4
(Algorithms)
Big-O notation
For each pair (
f, g
) of functions below, list which of the following are true:
f
(
n
) =
O
(
g
(
n
)),
g
(
n
) =
O
(
f
(
n
)), or both. Briefly justify your answers.
(a)
f
(
n
) =
ln
(
n
)
, g
(
n
) =
lg
(
n
). Note that ln denotes log to the base e and lg denotes log to the
base 2.
Solution:
Both since both functions are equivalent upto a multiplicative constant
(b)
f
(
n
) = 3
n
, g
(
n
) =
n
10
Solution:
g
(
n
) =
O
(
f
(
n
)) since
f
(
n
) grows much more rapidly as
n
becomes large.
(c)
f
(
n
) = 3
n
, g
(
n
) = 2
n
Solution:
g
(
n
) =
O
(
f
(
n
)) since
f
(
n
) grows much more rapidly as
n
becomes large.
Problem 5
(Programming Skills)
Start familiarizing yourself with the Python libraries
numpy
and
matplotlib
by completing the
following exercises. You may find the following references helpful:
•
https://numpy.org/doc/stable/reference/random/generated/numpy.random.multivariate_
normal.html
•
http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.eig.html
You do not have to submit code, simply paste screenshots of the code and the plots in your pdf.
(a) Sampling from multivariate probability distributions
i. Draw 1000 samples
x
=
(
x
1
x
2
)
from a 2-dimensional Gaussian distribution with mean
(
0
0
)
and identity covariance matrix, i.e.
p
(
x
) =
1
√
(2
π
)
2
exp
−
||
x
||
2
2
, and make a scatter
plot (
x
1
vs
x
2
).
Solution:
See attached code.
6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
ii. How does the scatter plot change if the mean is
(
−
1
1
)
?
Solution:
See solution code.
The data moves up and to the left. Namely, the center of the data moves from roughly
[0
,
0] to roughly [
−
1
,
1].
iii. How does the (original) scatter plot change if you double the variance of each component?
Solution:
See solution code. The data become more “spread out”.
iv. How does the (original) scatter plot change if the covariance matrix is changed to
(
1
0
.
5
0
.
5
1
)
?
Solution:
See solution code.
The data become skewed so that they
stretch from the lower left to the upper right.
v. How does the (original) scatter plot change if the covariance matrix is changed to
(
1
−
0
.
5
−
0
.
5
1
)
?
Solution:
See solution code.
The data become skewed so that they
stretch from the upper left to the bottom right.
(b) There are now lots of really interesting data sets publicly available to play with. They range
in size, quality and the type of features and have resulted in many new machine learning
techniques being developed. Find a public, free, supervised (i.e. it must have features
and
labels), machine learning dataset. For example, you can use one of the open datasets released
by the US government here: https://catalog.data.gov/dataset Once you have found the data
set, provide the following information:
i. The name of the data set and its link.
ii. A brief (i.e. 2-3 sentences) description of the data set including what the features are
and what is being predicted.
iii. The number of examples in the data set.
iv. The number of features for each example.
For this question, do not just copy and paste the description from the website or the paper;
reference it, but use your own words. Your goal here is to understand the data set, where it
came from, and potential issues involved.
7
Recommended textbooks for you

Operations Research : Applications and Algorithms
Computer Science
ISBN:9780534380588
Author:Wayne L. Winston
Publisher:Brooks Cole

C++ for Engineers and Scientists
Computer Science
ISBN:9781133187844
Author:Bronson, Gary J.
Publisher:Course Technology Ptr
Recommended textbooks for you
- Operations Research : Applications and AlgorithmsComputer ScienceISBN:9780534380588Author:Wayne L. WinstonPublisher:Brooks ColeC++ for Engineers and ScientistsComputer ScienceISBN:9781133187844Author:Bronson, Gary J.Publisher:Course Technology Ptr

Operations Research : Applications and Algorithms
Computer Science
ISBN:9780534380588
Author:Wayne L. Winston
Publisher:Brooks Cole

C++ for Engineers and Scientists
Computer Science
ISBN:9781133187844
Author:Bronson, Gary J.
Publisher:Course Technology Ptr