module2_assignment
pdf
keyboard_arrow_up
School
Washington State University *
*We aren’t endorsed by this school
Course
319
Subject
Mathematics
Date
Feb 20, 2024
Type
Pages
11
Uploaded by GrandDeerMaster1061
1) Explain in your own words the curse of dimensionality and its importance in modern data analysis.
The curse of dimensionality refers to what happens when looking at data in a high dimensional situation that would not occur in a low dimention situation. For example, data does different things when you have 200 variables than it would if it only had 2 variables.
Because of this, when you work with high dimensional data, you capture more noise and can lose focus and patterns with meaning. It can also be a challenge to visualize high dimension data, as current tools can struggle to deal with feature rich data. It is also much more costly to store high dimension data as it takes up a great deal more storage.
So when you look at data, it can be more cost-effective, efficient, and provides cleaner results when using low dimension data.
1.
Choose two different distance metrics that can be applied to pairs of vectors in R^n and explain their differences. For each metric, give an example of a dataset that would be more appropriately analyzed using that metric.
I am going to be looking at Euclidean Distance and Manhattan Distance.
Euclidean: this measures the straight line distance between two points in a space containing n number of dimensions. It is calculated as the square root of the sum of the squared differences between corresponding coordinates of the vectors.
You might use this when calculating the distance between two points in physical real world space in robotics.
A dataset that could be analyzed using this method could be one containing various houses and aspects of said houses (square feet, bedrooms, bathrooms). This method would allow you to directly compare each house to see how similar they are to each other.
Manhattan: this measures the distance between 2 points in space that contains n-dimensions as the sum of the absolute differences between their coordinates along each dimension.
It is best used in navigating on a grid system.
A dataset that you could analyze using this method could be a map of Manhattan. You could use it to find the shortest path from one point to another, come up with efficient driving routes, or just plan a running route.
1.
Consider the matrix
M = [3,2,1] [2,5,4] [1,4,8]
(a) compute QM([-7] [-8] [-9])
(b) compute the distance between x = [1] and y = [-7] using the distance from QM [2] [-8] [3] [-9]
(c) What is the maximum value of v^t Mv subject to ||v|| = 1 and what v attains that value?
import
numpy as
np
M =
np.array([[
3
, 2
, 1
],
[
2
, 5
, 4
],
[
1
, 4
, 8
]])
QM_vector =
np.array([
-
7
, -
8
, -
9
])
QM_result =
np.dot(M, QM_vector)
print
(
"QM ="
, QM_result)
QM = [ -46 -90 -111]
x =
np.array([[
1
], [
2
], [
3
]])
y =
np.array([[
-
7
], [
-
8
], [
-
9
]])
QM =
np.dot(M, x) -
np.dot(M, y)
distance =
np.linalg.norm(QM)
print
(
"Distance between x and y ="
, distance)
Distance between x and y = 192.01041638411183
eigenvalues, eigenvectors =
np.linalg.eig(M)
max_eigenvalue_index =
np.argmax(eigenvalues)
max_eigenvalue =
eigenvalues[max_eigenvalue_index]
max_eigenvector =
eigenvectors[:, max_eigenvalue_index]
max_eigenvector /=
np.linalg.norm(max_eigenvector)
print
(
"Maximum value of v^T Mv:"
, max_eigenvalue)
print
(
"Vector v that attains the maximum value:"
, max_eigenvector)
Maximum value of v^T Mv: 11.24577226473478
Vector v that attains the maximum value: [-0.23473499 -0.57643452 -
0.7827022 ]
(a) Compute the Jaccard similarities of each pair of the following three sets: {1, 2, 3, 4}, {2, 3, 5, 7}, and {2, 4, 6}.
(b) Compute the Jaccard bag similarity of each pair of the fol- lowing three bags: {1, 1, 1, 2}, {1, 1, 2, 2, 3}, and {1, 2, 3, 4}.
def
jaccard_similarity(set1, set2):
intersection =
len
(set1.intersection(set2))
union =
len
(set1.union(set2))
return
intersection /
union
def
jaccard_bag_similarity(bag1, bag2):
intersection =
sum
(
min
(bag1.count(item), bag2.count(item)) for item in
set
(bag1))
union =
sum
(
max
(bag1.count(item), bag2.count(item)) for
item in set
(bag1))
return
intersection /
union
set1 =
{
1
, 2
, 3
, 4
}
set2 =
{
2
, 3
, 5
, 7
}
set3 =
{
2
, 4
, 6
}
bag1 =
[
1
, 1
, 1
, 2
]
bag2 =
[
1
, 1
, 2
, 2
, 3
]
bag3 =
[
1
, 2
, 3
, 4
]
jaccard_similarity_set1_set2 =
jaccard_similarity(set1, set2)
jaccard_similarity_set1_set3 =
jaccard_similarity(set1, set3)
jaccard_similarity_set2_set3 =
jaccard_similarity(set2, set3)
jaccard_bag_similarity_bag1_bag2 =
jaccard_bag_similarity(bag1, bag2)
jaccard_bag_similarity_bag1_bag3 =
jaccard_bag_similarity(bag1, bag3)
jaccard_bag_similarity_bag2_bag3 =
jaccard_bag_similarity(bag2, bag3)
print
(
"Jaccard Similarities for Sets:"
)
print
(
f"J(set1, set2) = {
jaccard_similarity_set1_set2
}
"
)
print
(
f"J(set1, set3) = {
jaccard_similarity_set1_set3
}
"
)
print
(
f"J(set2, set3) = {
jaccard_similarity_set2_set3
}
"
)
print
(
"
\n
Jaccard Bag Similarities for Bags:"
)
print
(
f"J(bag1, bag2) = {
jaccard_bag_similarity_bag1_bag2
}
"
)
print
(
f"J(bag1, bag3) = {
jaccard_bag_similarity_bag1_bag3
}
"
)
print
(
f"J(bag2, bag3) = {
jaccard_bag_similarity_bag2_bag3
}
"
)
Jaccard Similarities for Sets:
J(set1, set2) = 0.3333333333333333
J(set1, set3) = 0.4
J(set2, set3) = 0.16666666666666666
Jaccard Bag Similarities for Bags:
J(bag1, bag2) = 0.6
J(bag1, bag3) = 0.5
J(bag2, bag3) = 0.6
1.
Suppose there are 100 items, numbered 1 to 100, and also 100 baskets, also numbered 1 to 100 . Item i is in basket b if and only if i divides b with no remainder. Thus, item 1 is in all the baskets, item 2 is in all fifty of the even-numbered baskets, and so on. Basket 12 consists of items {1, 2, 3, 4, 6, 12}, since these are all the integers that divide 12. Answer the following questions:
(a) If the support threshold is 5, which items are frequent?
(b) What is the confidence of the following association rules?
i. {5, 7} → 2
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
ii. {2, 3, 4} → 5
(c) Apply the A-Priori Algorithm with support threshold 5 to this data.
from
itertools import
combinations
num_baskets =
100
num_items =
100
frequent_items =
[]
for
item in
range
(
1
, num_items +
1
):
basket_count =
sum
(
1
for
basket in
range
(
1
, num_baskets +
1
) if basket %
item ==
0
)
if
basket_count >=
5
:
frequent_items.append(item)
print
(frequent_items)
def
calculate_support(itemset):
count =
0
for
basket in
range
(
1
, num_baskets +
1
):
if
all
(basket %
item ==
0
for
item in
itemset):
count +=
1
return
count
def
calculate_confidence(antecedent, consequent):
antecedent_support =
calculate_support(antecedent)
antecedent_consequent_support = calculate_support(antecedent.union(consequent))
return
antecedent_consequent_support /
antecedent_support
association_rules =
[
({
5
, 7
}, {
2
}),
({
2
, 3
, 4
}, {
5
})
]
for
antecedent, consequent in
association_rules:
support =
calculate_support(antecedent.union(consequent))
confidence =
calculate_confidence(antecedent, consequent)
print
(
f"Rule {
antecedent
}
→ {
consequent
}
:"
)
print
(
f"Support: {
support
}
"
)
print
(
f"Confidence: {
confidence
}
"
)
print
()
frequent_itemsets =
[]
for
item in
frequent_items:
frequent_itemsets.append({item})
for
itemset in
combinations(frequent_items, 2
):
if
calculate_support(
set
(itemset)) >=
5
:
frequent_itemsets.append(
set
(itemset))
for
itemset in
combinations(frequent_items, 3
):
if
calculate_support(
set
(itemset)) >=
5
:
frequent_itemsets.append(
set
(itemset))
print
(
"Frequent Itemsets with Support Threshold 5:"
)
for
itemset in
frequent_itemsets:
support =
calculate_support(itemset)
print
(
f"Itemset: {
itemset
}
, Support: {
support
}
"
)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
Rule {5, 7} → {2}:
Support: 1
Confidence: 0.5
Rule {2, 3, 4} → {5}:
Support: 1
Confidence: 0.125
Frequent Itemsets with Support Threshold 5:
Itemset: {1}, Support: 100
Itemset: {2}, Support: 50
Itemset: {3}, Support: 33
Itemset: {4}, Support: 25
Itemset: {5}, Support: 20
Itemset: {6}, Support: 16
Itemset: {7}, Support: 14
Itemset: {8}, Support: 12
Itemset: {9}, Support: 11
Itemset: {10}, Support: 10
Itemset: {11}, Support: 9
Itemset: {12}, Support: 8
Itemset: {13}, Support: 7
Itemset: {14}, Support: 7
Itemset: {15}, Support: 6
Itemset: {16}, Support: 6
Itemset: {17}, Support: 5
Itemset: {18}, Support: 5
Itemset: {19}, Support: 5
Itemset: {20}, Support: 5
Itemset: {1, 2}, Support: 50
Itemset: {1, 3}, Support: 33
Itemset: {1, 4}, Support: 25
Itemset: {1, 5}, Support: 20
Itemset: {1, 6}, Support: 16
Itemset: {1, 7}, Support: 14
Itemset: {8, 1}, Support: 12
Itemset: {1, 9}, Support: 11
Itemset: {1, 10}, Support: 10
Itemset: {1, 11}, Support: 9
Itemset: {1, 12}, Support: 8
Itemset: {1, 13}, Support: 7
Itemset: {1, 14}, Support: 7
Itemset: {1, 15}, Support: 6
Itemset: {16, 1}, Support: 6
Itemset: {1, 17}, Support: 5
Itemset: {1, 18}, Support: 5
Itemset: {1, 19}, Support: 5
Itemset: {1, 20}, Support: 5
Itemset: {2, 3}, Support: 16
Itemset: {2, 4}, Support: 25
Itemset: {2, 5}, Support: 10
Itemset: {2, 6}, Support: 16
Itemset: {2, 7}, Support: 7
Itemset: {8, 2}, Support: 12
Itemset: {9, 2}, Support: 5
Itemset: {2, 10}, Support: 10
Itemset: {2, 12}, Support: 8
Itemset: {2, 14}, Support: 7
Itemset: {16, 2}, Support: 6
Itemset: {2, 18}, Support: 5
Itemset: {2, 20}, Support: 5
Itemset: {3, 4}, Support: 8
Itemset: {3, 5}, Support: 6
Itemset: {3, 6}, Support: 16
Itemset: {9, 3}, Support: 11
Itemset: {3, 12}, Support: 8
Itemset: {3, 15}, Support: 6
Itemset: {18, 3}, Support: 5
Itemset: {4, 5}, Support: 5
Itemset: {4, 6}, Support: 8
Itemset: {8, 4}, Support: 12
Itemset: {10, 4}, Support: 5
Itemset: {4, 12}, Support: 8
Itemset: {16, 4}, Support: 6
Itemset: {4, 20}, Support: 5
Itemset: {10, 5}, Support: 10
Itemset: {5, 15}, Support: 6
Itemset: {20, 5}, Support: 5
Itemset: {9, 6}, Support: 5
Itemset: {12, 6}, Support: 8
Itemset: {18, 6}, Support: 5
Itemset: {14, 7}, Support: 7
Itemset: {8, 16}, Support: 6
Itemset: {9, 18}, Support: 5
Itemset: {10, 20}, Support: 5
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Itemset: {1, 2, 3}, Support: 16
Itemset: {1, 2, 4}, Support: 25
Itemset: {1, 2, 5}, Support: 10
Itemset: {1, 2, 6}, Support: 16
Itemset: {1, 2, 7}, Support: 7
Itemset: {8, 1, 2}, Support: 12
Itemset: {1, 2, 9}, Support: 5
Itemset: {1, 2, 10}, Support: 10
Itemset: {1, 2, 12}, Support: 8
Itemset: {1, 2, 14}, Support: 7
Itemset: {16, 1, 2}, Support: 6
Itemset: {1, 2, 18}, Support: 5
Itemset: {1, 2, 20}, Support: 5
Itemset: {1, 3, 4}, Support: 8
Itemset: {1, 3, 5}, Support: 6
Itemset: {1, 3, 6}, Support: 16
Itemset: {1, 3, 9}, Support: 11
Itemset: {1, 3, 12}, Support: 8
Itemset: {1, 3, 15}, Support: 6
Itemset: {1, 18, 3}, Support: 5
Itemset: {1, 4, 5}, Support: 5
Itemset: {1, 4, 6}, Support: 8
Itemset: {8, 1, 4}, Support: 12
Itemset: {1, 10, 4}, Support: 5
Itemset: {1, 4, 12}, Support: 8
Itemset: {16, 1, 4}, Support: 6
Itemset: {1, 4, 20}, Support: 5
Itemset: {1, 10, 5}, Support: 10
Itemset: {1, 5, 15}, Support: 6
Itemset: {1, 20, 5}, Support: 5
Itemset: {1, 6, 9}, Support: 5
Itemset: {1, 12, 6}, Support: 8
Itemset: {1, 18, 6}, Support: 5
Itemset: {1, 14, 7}, Support: 7
Itemset: {8, 1, 16}, Support: 6
Itemset: {1, 18, 9}, Support: 5
Itemset: {1, 10, 20}, Support: 5
Itemset: {2, 3, 4}, Support: 8
Itemset: {2, 3, 6}, Support: 16
Itemset: {9, 2, 3}, Support: 5
Itemset: {2, 3, 12}, Support: 8
Itemset: {18, 2, 3}, Support: 5
Itemset: {2, 4, 5}, Support: 5
Itemset: {2, 4, 6}, Support: 8
Itemset: {8, 2, 4}, Support: 12
Itemset: {2, 10, 4}, Support: 5
Itemset: {2, 4, 12}, Support: 8
Itemset: {16, 2, 4}, Support: 6
Itemset: {2, 4, 20}, Support: 5
Itemset: {2, 10, 5}, Support: 10
Itemset: {2, 20, 5}, Support: 5
Itemset: {9, 2, 6}, Support: 5
Itemset: {2, 12, 6}, Support: 8
Itemset: {2, 18, 6}, Support: 5
Itemset: {2, 14, 7}, Support: 7
Itemset: {8, 16, 2}, Support: 6
Itemset: {9, 2, 18}, Support: 5
Itemset: {2, 10, 20}, Support: 5
Itemset: {3, 4, 6}, Support: 8
Itemset: {3, 4, 12}, Support: 8
Itemset: {3, 5, 15}, Support: 6
Itemset: {9, 3, 6}, Support: 5
Itemset: {3, 12, 6}, Support: 8
Itemset: {18, 3, 6}, Support: 5
Itemset: {9, 18, 3}, Support: 5
Itemset: {10, 4, 5}, Support: 5
Itemset: {20, 4, 5}, Support: 5
Itemset: {4, 12, 6}, Support: 8
Itemset: {8, 16, 4}, Support: 6
Itemset: {10, 4, 20}, Support: 5
Itemset: {10, 20, 5}, Support: 5
Itemset: {9, 18, 6}, Support: 5
1.
Imagine we have summarized a collection of documents and words with the following table:
zoo, kangaroo, monkey, alligator, tiger, camel, eagle, lemur, dragon, pizza
doc1: 51, 92, 14, 71, 60, 20, 82, 86, 74, 74
doc2: 87, 99, 23, 2, 21, 52, 1, 87, 29, 37
doc3: 1, 63, 59, 20, 32, 75, 57, 21, 88, 48
doc4: 90, 58, 41, 91, 59, 79, 14, 61, 61, 46
doc5: 61, 50, 54, 63, 2, 50, 6, 20, 72, 38
doc6: 17, 3, 88, 59, 13, 8, 89, 52, 1, 83
doc7: 91, 59, 70, 43, 7, 46, 34, 77, 80, 35
doc8: 49, 3, 1, 5, 53, 3, 53, 92, 62, 17
doc9: 89, 43, 33, 73, 61, 99, 13, 94, 47, 14
doc10: 71, 77, 86, 61, 39, 84, 79, 81, 52, 23
doc11: 25, 88, 59, 40, 28, 14, 44, 64, 88, 70
doc12: 8, 87, 0, 7, 87, 62, 10, 80, 7, 34
doc13: 34, 32, 4, 40, 27, 6, 72, 71, 11, 33
doc14: 32, 47, 22, 61, 87, 36, 98, 43, 85, 90
doc15: 34, 64, 98, 46, 77, 2, 0, 4, 89, 13
(a) Compute the reduced SVD embedding with 2 dimensions for all 15 documents and make a scatterplot of the reduced points
(b) Compute the reduced SVD embedding with 2 dimensions for all 10 words and make a scatterplot of the reduced points
(c) What patterns do you observe from these 2-d embeddings?
from
sklearn.decomposition import
TruncatedSVD
import
matplotlib.pyplot as
plt
words =
[
"zoo"
, "kangaroo"
, "monkey"
, "alligator"
, "tiger"
, "camel"
, "eagle"
, "lemur"
, "dragon"
, "pizza"
]
docs =
[
[
51
, 92
, 14
, 71
, 60
, 20
, 82
, 86
, 74
, 74
],
[
87
, 99
, 23
, 2
, 21
, 52
, 1
, 87
, 29
, 37
],
[
1
, 63
, 59
, 20
, 32
, 75
, 57
, 21
, 88
, 48
],
[
90
, 58
, 41
, 91
, 59
, 79
, 14
, 61
, 61
, 46
],
[
61
, 50
, 54
, 63
, 2
, 50
, 6
, 20
, 72
, 38
],
[
17
, 3
, 88
, 59
, 13
, 8
, 89
, 52
, 1
, 83
],
[
91
, 59
, 70
, 43
, 7
, 46
, 34
, 77
, 80
, 35
],
[
49
, 3
, 1
, 5
, 53
, 3
, 53
, 92
, 62
, 17
],
[
89
, 43
, 33
, 73
, 61
, 99
, 13
, 94
, 47
, 14
],
[
71
, 77
, 86
, 61
, 39
, 84
, 79
, 81
, 52
, 23
],
[
25
, 88
, 59
, 40
, 28
, 14
, 44
, 64
, 88
, 70
],
[
8
, 87
, 0
, 7
, 87
, 62
, 10
, 80
, 7
, 34
],
[
34
, 32
, 4
, 40
, 27
, 6
, 72
, 71
, 11
, 33
],
[
32
, 47
, 22
, 61
, 87
, 36
, 98
, 43
, 85
, 90
],
[
34
, 64
, 98
, 46
, 77
, 2
, 0
, 4
, 89
, 13
]
]
X =
np.array(docs)
svd =
TruncatedSVD(n_components
=
2
)
document_embeddings =
svd.fit_transform(X)
plt.figure(figsize
=
(
8
, 6
))
plt.scatter(document_embeddings[:, 0
], document_embeddings[:, 1
], c
=
'b'
, marker
=
'o'
)
for
i, word in
enumerate
(words):
plt.annotate(word, (document_embeddings[i, 0
], document_embeddings[i, 1
]))
plt.title(
'SVD Embedding of Documents (2D)'
)
plt.xlabel(
'Dimension 1'
)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
plt.ylabel(
'Dimension 2'
)
plt.grid(
True
)
plt.show()
X_transpose =
X.T
svd_words =
TruncatedSVD(n_components
=
2
)
word_embeddings =
svd_words.fit_transform(X_transpose)
plt.figure(figsize
=
(
8
, 6
))
plt.scatter(word_embeddings[:, 0
], word_embeddings[:, 1
], c
=
'r'
, marker
=
'x'
)
for
i, word in
enumerate
(words):
plt.annotate(word, (word_embeddings[i, 0
], word_embeddings[i, 1
]))
plt.title(
'SVD Embedding of Words (2D)'
)
plt.xlabel(
'Dimension 1'
)
plt.ylabel(
'Dimension 2'
)
plt.grid(
True
)
plt.show()
I honestly cannot see any patterns. This may be due to my not fully understanding how this works, or maybe because there aren't any strong patterns that make sense to me. But even in the second graph with eagle, pizza, monkey, tiger, and camel all residing close to 170 on the x axis, I don't see anything.