module2_assignment

pdf

School

Washington State University *

*We aren’t endorsed by this school

Course

319

Subject

Mathematics

Date

Feb 20, 2024

Type

pdf

Pages

11

Uploaded by GrandDeerMaster1061

Report
1) Explain in your own words the curse of dimensionality and its importance in modern data analysis. The curse of dimensionality refers to what happens when looking at data in a high dimensional situation that would not occur in a low dimention situation. For example, data does different things when you have 200 variables than it would if it only had 2 variables. Because of this, when you work with high dimensional data, you capture more noise and can lose focus and patterns with meaning. It can also be a challenge to visualize high dimension data, as current tools can struggle to deal with feature rich data. It is also much more costly to store high dimension data as it takes up a great deal more storage. So when you look at data, it can be more cost-effective, efficient, and provides cleaner results when using low dimension data. 1. Choose two different distance metrics that can be applied to pairs of vectors in R^n and explain their differences. For each metric, give an example of a dataset that would be more appropriately analyzed using that metric. I am going to be looking at Euclidean Distance and Manhattan Distance. Euclidean: this measures the straight line distance between two points in a space containing n number of dimensions. It is calculated as the square root of the sum of the squared differences between corresponding coordinates of the vectors. You might use this when calculating the distance between two points in physical real world space in robotics. A dataset that could be analyzed using this method could be one containing various houses and aspects of said houses (square feet, bedrooms, bathrooms). This method would allow you to directly compare each house to see how similar they are to each other. Manhattan: this measures the distance between 2 points in space that contains n-dimensions as the sum of the absolute differences between their coordinates along each dimension. It is best used in navigating on a grid system. A dataset that you could analyze using this method could be a map of Manhattan. You could use it to find the shortest path from one point to another, come up with efficient driving routes, or just plan a running route. 1. Consider the matrix M = [3,2,1] [2,5,4] [1,4,8] (a) compute QM([-7] [-8] [-9]) (b) compute the distance between x = [1] and y = [-7] using the distance from QM [2] [-8] [3] [-9] (c) What is the maximum value of v^t Mv subject to ||v|| = 1 and what v attains that value? import numpy as np
M = np.array([[ 3 , 2 , 1 ], [ 2 , 5 , 4 ], [ 1 , 4 , 8 ]]) QM_vector = np.array([ - 7 , - 8 , - 9 ]) QM_result = np.dot(M, QM_vector) print ( "QM =" , QM_result) QM = [ -46 -90 -111] x = np.array([[ 1 ], [ 2 ], [ 3 ]]) y = np.array([[ - 7 ], [ - 8 ], [ - 9 ]]) QM = np.dot(M, x) - np.dot(M, y) distance = np.linalg.norm(QM) print ( "Distance between x and y =" , distance) Distance between x and y = 192.01041638411183 eigenvalues, eigenvectors = np.linalg.eig(M) max_eigenvalue_index = np.argmax(eigenvalues) max_eigenvalue = eigenvalues[max_eigenvalue_index] max_eigenvector = eigenvectors[:, max_eigenvalue_index] max_eigenvector /= np.linalg.norm(max_eigenvector) print ( "Maximum value of v^T Mv:" , max_eigenvalue) print ( "Vector v that attains the maximum value:" , max_eigenvector) Maximum value of v^T Mv: 11.24577226473478 Vector v that attains the maximum value: [-0.23473499 -0.57643452 - 0.7827022 ] (a) Compute the Jaccard similarities of each pair of the following three sets: {1, 2, 3, 4}, {2, 3, 5, 7}, and {2, 4, 6}. (b) Compute the Jaccard bag similarity of each pair of the fol- lowing three bags: {1, 1, 1, 2}, {1, 1, 2, 2, 3}, and {1, 2, 3, 4}. def jaccard_similarity(set1, set2): intersection = len (set1.intersection(set2)) union = len (set1.union(set2)) return intersection / union def jaccard_bag_similarity(bag1, bag2): intersection = sum ( min (bag1.count(item), bag2.count(item)) for item in set (bag1))
union = sum ( max (bag1.count(item), bag2.count(item)) for item in set (bag1)) return intersection / union set1 = { 1 , 2 , 3 , 4 } set2 = { 2 , 3 , 5 , 7 } set3 = { 2 , 4 , 6 } bag1 = [ 1 , 1 , 1 , 2 ] bag2 = [ 1 , 1 , 2 , 2 , 3 ] bag3 = [ 1 , 2 , 3 , 4 ] jaccard_similarity_set1_set2 = jaccard_similarity(set1, set2) jaccard_similarity_set1_set3 = jaccard_similarity(set1, set3) jaccard_similarity_set2_set3 = jaccard_similarity(set2, set3) jaccard_bag_similarity_bag1_bag2 = jaccard_bag_similarity(bag1, bag2) jaccard_bag_similarity_bag1_bag3 = jaccard_bag_similarity(bag1, bag3) jaccard_bag_similarity_bag2_bag3 = jaccard_bag_similarity(bag2, bag3) print ( "Jaccard Similarities for Sets:" ) print ( f"J(set1, set2) = { jaccard_similarity_set1_set2 } " ) print ( f"J(set1, set3) = { jaccard_similarity_set1_set3 } " ) print ( f"J(set2, set3) = { jaccard_similarity_set2_set3 } " ) print ( " \n Jaccard Bag Similarities for Bags:" ) print ( f"J(bag1, bag2) = { jaccard_bag_similarity_bag1_bag2 } " ) print ( f"J(bag1, bag3) = { jaccard_bag_similarity_bag1_bag3 } " ) print ( f"J(bag2, bag3) = { jaccard_bag_similarity_bag2_bag3 } " ) Jaccard Similarities for Sets: J(set1, set2) = 0.3333333333333333 J(set1, set3) = 0.4 J(set2, set3) = 0.16666666666666666 Jaccard Bag Similarities for Bags: J(bag1, bag2) = 0.6 J(bag1, bag3) = 0.5 J(bag2, bag3) = 0.6 1. Suppose there are 100 items, numbered 1 to 100, and also 100 baskets, also numbered 1 to 100 . Item i is in basket b if and only if i divides b with no remainder. Thus, item 1 is in all the baskets, item 2 is in all fifty of the even-numbered baskets, and so on. Basket 12 consists of items {1, 2, 3, 4, 6, 12}, since these are all the integers that divide 12. Answer the following questions: (a) If the support threshold is 5, which items are frequent? (b) What is the confidence of the following association rules? i. {5, 7} → 2
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
ii. {2, 3, 4} → 5 (c) Apply the A-Priori Algorithm with support threshold 5 to this data. from itertools import combinations num_baskets = 100 num_items = 100 frequent_items = [] for item in range ( 1 , num_items + 1 ): basket_count = sum ( 1 for basket in range ( 1 , num_baskets + 1 ) if basket % item == 0 ) if basket_count >= 5 : frequent_items.append(item) print (frequent_items) def calculate_support(itemset): count = 0 for basket in range ( 1 , num_baskets + 1 ): if all (basket % item == 0 for item in itemset): count += 1 return count def calculate_confidence(antecedent, consequent): antecedent_support = calculate_support(antecedent) antecedent_consequent_support = calculate_support(antecedent.union(consequent)) return antecedent_consequent_support / antecedent_support association_rules = [ ({ 5 , 7 }, { 2 }), ({ 2 , 3 , 4 }, { 5 }) ] for antecedent, consequent in association_rules: support = calculate_support(antecedent.union(consequent)) confidence = calculate_confidence(antecedent, consequent) print ( f"Rule { antecedent } { consequent } :" ) print ( f"Support: { support } " ) print ( f"Confidence: { confidence } " ) print () frequent_itemsets = [] for item in frequent_items: frequent_itemsets.append({item}) for itemset in combinations(frequent_items, 2 ): if calculate_support( set (itemset)) >= 5 :
frequent_itemsets.append( set (itemset)) for itemset in combinations(frequent_items, 3 ): if calculate_support( set (itemset)) >= 5 : frequent_itemsets.append( set (itemset)) print ( "Frequent Itemsets with Support Threshold 5:" ) for itemset in frequent_itemsets: support = calculate_support(itemset) print ( f"Itemset: { itemset } , Support: { support } " ) [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20] Rule {5, 7} → {2}: Support: 1 Confidence: 0.5 Rule {2, 3, 4} → {5}: Support: 1 Confidence: 0.125 Frequent Itemsets with Support Threshold 5: Itemset: {1}, Support: 100 Itemset: {2}, Support: 50 Itemset: {3}, Support: 33 Itemset: {4}, Support: 25 Itemset: {5}, Support: 20 Itemset: {6}, Support: 16 Itemset: {7}, Support: 14 Itemset: {8}, Support: 12 Itemset: {9}, Support: 11 Itemset: {10}, Support: 10 Itemset: {11}, Support: 9 Itemset: {12}, Support: 8 Itemset: {13}, Support: 7 Itemset: {14}, Support: 7 Itemset: {15}, Support: 6 Itemset: {16}, Support: 6 Itemset: {17}, Support: 5 Itemset: {18}, Support: 5 Itemset: {19}, Support: 5 Itemset: {20}, Support: 5 Itemset: {1, 2}, Support: 50 Itemset: {1, 3}, Support: 33 Itemset: {1, 4}, Support: 25 Itemset: {1, 5}, Support: 20 Itemset: {1, 6}, Support: 16 Itemset: {1, 7}, Support: 14 Itemset: {8, 1}, Support: 12
Itemset: {1, 9}, Support: 11 Itemset: {1, 10}, Support: 10 Itemset: {1, 11}, Support: 9 Itemset: {1, 12}, Support: 8 Itemset: {1, 13}, Support: 7 Itemset: {1, 14}, Support: 7 Itemset: {1, 15}, Support: 6 Itemset: {16, 1}, Support: 6 Itemset: {1, 17}, Support: 5 Itemset: {1, 18}, Support: 5 Itemset: {1, 19}, Support: 5 Itemset: {1, 20}, Support: 5 Itemset: {2, 3}, Support: 16 Itemset: {2, 4}, Support: 25 Itemset: {2, 5}, Support: 10 Itemset: {2, 6}, Support: 16 Itemset: {2, 7}, Support: 7 Itemset: {8, 2}, Support: 12 Itemset: {9, 2}, Support: 5 Itemset: {2, 10}, Support: 10 Itemset: {2, 12}, Support: 8 Itemset: {2, 14}, Support: 7 Itemset: {16, 2}, Support: 6 Itemset: {2, 18}, Support: 5 Itemset: {2, 20}, Support: 5 Itemset: {3, 4}, Support: 8 Itemset: {3, 5}, Support: 6 Itemset: {3, 6}, Support: 16 Itemset: {9, 3}, Support: 11 Itemset: {3, 12}, Support: 8 Itemset: {3, 15}, Support: 6 Itemset: {18, 3}, Support: 5 Itemset: {4, 5}, Support: 5 Itemset: {4, 6}, Support: 8 Itemset: {8, 4}, Support: 12 Itemset: {10, 4}, Support: 5 Itemset: {4, 12}, Support: 8 Itemset: {16, 4}, Support: 6 Itemset: {4, 20}, Support: 5 Itemset: {10, 5}, Support: 10 Itemset: {5, 15}, Support: 6 Itemset: {20, 5}, Support: 5 Itemset: {9, 6}, Support: 5 Itemset: {12, 6}, Support: 8 Itemset: {18, 6}, Support: 5 Itemset: {14, 7}, Support: 7 Itemset: {8, 16}, Support: 6 Itemset: {9, 18}, Support: 5 Itemset: {10, 20}, Support: 5
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Itemset: {1, 2, 3}, Support: 16 Itemset: {1, 2, 4}, Support: 25 Itemset: {1, 2, 5}, Support: 10 Itemset: {1, 2, 6}, Support: 16 Itemset: {1, 2, 7}, Support: 7 Itemset: {8, 1, 2}, Support: 12 Itemset: {1, 2, 9}, Support: 5 Itemset: {1, 2, 10}, Support: 10 Itemset: {1, 2, 12}, Support: 8 Itemset: {1, 2, 14}, Support: 7 Itemset: {16, 1, 2}, Support: 6 Itemset: {1, 2, 18}, Support: 5 Itemset: {1, 2, 20}, Support: 5 Itemset: {1, 3, 4}, Support: 8 Itemset: {1, 3, 5}, Support: 6 Itemset: {1, 3, 6}, Support: 16 Itemset: {1, 3, 9}, Support: 11 Itemset: {1, 3, 12}, Support: 8 Itemset: {1, 3, 15}, Support: 6 Itemset: {1, 18, 3}, Support: 5 Itemset: {1, 4, 5}, Support: 5 Itemset: {1, 4, 6}, Support: 8 Itemset: {8, 1, 4}, Support: 12 Itemset: {1, 10, 4}, Support: 5 Itemset: {1, 4, 12}, Support: 8 Itemset: {16, 1, 4}, Support: 6 Itemset: {1, 4, 20}, Support: 5 Itemset: {1, 10, 5}, Support: 10 Itemset: {1, 5, 15}, Support: 6 Itemset: {1, 20, 5}, Support: 5 Itemset: {1, 6, 9}, Support: 5 Itemset: {1, 12, 6}, Support: 8 Itemset: {1, 18, 6}, Support: 5 Itemset: {1, 14, 7}, Support: 7 Itemset: {8, 1, 16}, Support: 6 Itemset: {1, 18, 9}, Support: 5 Itemset: {1, 10, 20}, Support: 5 Itemset: {2, 3, 4}, Support: 8 Itemset: {2, 3, 6}, Support: 16 Itemset: {9, 2, 3}, Support: 5 Itemset: {2, 3, 12}, Support: 8 Itemset: {18, 2, 3}, Support: 5 Itemset: {2, 4, 5}, Support: 5 Itemset: {2, 4, 6}, Support: 8 Itemset: {8, 2, 4}, Support: 12 Itemset: {2, 10, 4}, Support: 5 Itemset: {2, 4, 12}, Support: 8 Itemset: {16, 2, 4}, Support: 6 Itemset: {2, 4, 20}, Support: 5
Itemset: {2, 10, 5}, Support: 10 Itemset: {2, 20, 5}, Support: 5 Itemset: {9, 2, 6}, Support: 5 Itemset: {2, 12, 6}, Support: 8 Itemset: {2, 18, 6}, Support: 5 Itemset: {2, 14, 7}, Support: 7 Itemset: {8, 16, 2}, Support: 6 Itemset: {9, 2, 18}, Support: 5 Itemset: {2, 10, 20}, Support: 5 Itemset: {3, 4, 6}, Support: 8 Itemset: {3, 4, 12}, Support: 8 Itemset: {3, 5, 15}, Support: 6 Itemset: {9, 3, 6}, Support: 5 Itemset: {3, 12, 6}, Support: 8 Itemset: {18, 3, 6}, Support: 5 Itemset: {9, 18, 3}, Support: 5 Itemset: {10, 4, 5}, Support: 5 Itemset: {20, 4, 5}, Support: 5 Itemset: {4, 12, 6}, Support: 8 Itemset: {8, 16, 4}, Support: 6 Itemset: {10, 4, 20}, Support: 5 Itemset: {10, 20, 5}, Support: 5 Itemset: {9, 18, 6}, Support: 5 1. Imagine we have summarized a collection of documents and words with the following table: zoo, kangaroo, monkey, alligator, tiger, camel, eagle, lemur, dragon, pizza doc1: 51, 92, 14, 71, 60, 20, 82, 86, 74, 74 doc2: 87, 99, 23, 2, 21, 52, 1, 87, 29, 37 doc3: 1, 63, 59, 20, 32, 75, 57, 21, 88, 48 doc4: 90, 58, 41, 91, 59, 79, 14, 61, 61, 46 doc5: 61, 50, 54, 63, 2, 50, 6, 20, 72, 38 doc6: 17, 3, 88, 59, 13, 8, 89, 52, 1, 83 doc7: 91, 59, 70, 43, 7, 46, 34, 77, 80, 35 doc8: 49, 3, 1, 5, 53, 3, 53, 92, 62, 17 doc9: 89, 43, 33, 73, 61, 99, 13, 94, 47, 14 doc10: 71, 77, 86, 61, 39, 84, 79, 81, 52, 23 doc11: 25, 88, 59, 40, 28, 14, 44, 64, 88, 70 doc12: 8, 87, 0, 7, 87, 62, 10, 80, 7, 34
doc13: 34, 32, 4, 40, 27, 6, 72, 71, 11, 33 doc14: 32, 47, 22, 61, 87, 36, 98, 43, 85, 90 doc15: 34, 64, 98, 46, 77, 2, 0, 4, 89, 13 (a) Compute the reduced SVD embedding with 2 dimensions for all 15 documents and make a scatterplot of the reduced points (b) Compute the reduced SVD embedding with 2 dimensions for all 10 words and make a scatterplot of the reduced points (c) What patterns do you observe from these 2-d embeddings? from sklearn.decomposition import TruncatedSVD import matplotlib.pyplot as plt words = [ "zoo" , "kangaroo" , "monkey" , "alligator" , "tiger" , "camel" , "eagle" , "lemur" , "dragon" , "pizza" ] docs = [ [ 51 , 92 , 14 , 71 , 60 , 20 , 82 , 86 , 74 , 74 ], [ 87 , 99 , 23 , 2 , 21 , 52 , 1 , 87 , 29 , 37 ], [ 1 , 63 , 59 , 20 , 32 , 75 , 57 , 21 , 88 , 48 ], [ 90 , 58 , 41 , 91 , 59 , 79 , 14 , 61 , 61 , 46 ], [ 61 , 50 , 54 , 63 , 2 , 50 , 6 , 20 , 72 , 38 ], [ 17 , 3 , 88 , 59 , 13 , 8 , 89 , 52 , 1 , 83 ], [ 91 , 59 , 70 , 43 , 7 , 46 , 34 , 77 , 80 , 35 ], [ 49 , 3 , 1 , 5 , 53 , 3 , 53 , 92 , 62 , 17 ], [ 89 , 43 , 33 , 73 , 61 , 99 , 13 , 94 , 47 , 14 ], [ 71 , 77 , 86 , 61 , 39 , 84 , 79 , 81 , 52 , 23 ], [ 25 , 88 , 59 , 40 , 28 , 14 , 44 , 64 , 88 , 70 ], [ 8 , 87 , 0 , 7 , 87 , 62 , 10 , 80 , 7 , 34 ], [ 34 , 32 , 4 , 40 , 27 , 6 , 72 , 71 , 11 , 33 ], [ 32 , 47 , 22 , 61 , 87 , 36 , 98 , 43 , 85 , 90 ], [ 34 , 64 , 98 , 46 , 77 , 2 , 0 , 4 , 89 , 13 ] ] X = np.array(docs) svd = TruncatedSVD(n_components = 2 ) document_embeddings = svd.fit_transform(X) plt.figure(figsize = ( 8 , 6 )) plt.scatter(document_embeddings[:, 0 ], document_embeddings[:, 1 ], c = 'b' , marker = 'o' ) for i, word in enumerate (words): plt.annotate(word, (document_embeddings[i, 0 ], document_embeddings[i, 1 ])) plt.title( 'SVD Embedding of Documents (2D)' ) plt.xlabel( 'Dimension 1' )
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
plt.ylabel( 'Dimension 2' ) plt.grid( True ) plt.show() X_transpose = X.T svd_words = TruncatedSVD(n_components = 2 ) word_embeddings = svd_words.fit_transform(X_transpose) plt.figure(figsize = ( 8 , 6 )) plt.scatter(word_embeddings[:, 0 ], word_embeddings[:, 1 ], c = 'r' , marker = 'x' ) for i, word in enumerate (words): plt.annotate(word, (word_embeddings[i, 0 ], word_embeddings[i, 1 ])) plt.title( 'SVD Embedding of Words (2D)' ) plt.xlabel( 'Dimension 1' ) plt.ylabel( 'Dimension 2' ) plt.grid( True ) plt.show()
I honestly cannot see any patterns. This may be due to my not fully understanding how this works, or maybe because there aren't any strong patterns that make sense to me. But even in the second graph with eagle, pizza, monkey, tiger, and camel all residing close to 170 on the x axis, I don't see anything.