Rev-Fi23

docx

School

University of Houston *

*We aren’t endorsed by this school

Course

3337

Subject

Computer Science

Date

Jan 9, 2024

Type

docx

Pages

9

Uploaded by samiyakhtar

Report
Review for Final Exam COSC 3337 November 30, 2023 1) Association Rule and Sequence Mining [9] a) Assume we have the following Transaction Database T1: {A,B,C,D} T2: {A,C,D,E} T3: {C,D,E,F} T4: {B,C,D,E} T5: {A,D,E} What is the support and confidence the following association rule: IF (C and D) THEN E ? [3] Support = 3/5 [1.5] Confidence=3/4 [1.5] b) Assume the APRIORI algorithm identified the following five 4-item sets that satisfy a user given support threshold: abcd, acde, acdf, acdg adfg; what initial candidate 5- itemsets are created by the APRIORI algorithm; which of those survive subset pruning? [4] acdef, acdeg, acdfg [3] One error: at most one point! None survives pruning [1] c) Why are association rule mining systems interested in finding rules with high support? [2] Rules with high support are more likely to predict the occurrence of an item based on the occurrences of other items in the transaction accurately; it is hard to learn accurate rules from just a few examples. d) Assume an association rule if smoke then cancer has a confidence of 86% and a high lift of 5.4. What does this tell you about the relationship of smoking and cancer? [2] Con = 86% 86% people who smoke tend to get cancer; that is P(Cancer|Smoke)=0.86 Lift = 5.4 Smoking increases the probability of getting cancer by a factor of 5.4; that is, P(Cancer|Smoke)/P(Cancer)=5.4 1
2. EM a) What cluster models does EM use Each cluster is described by: a. a mean value b. a covariance matrix c. a cluster prior/weight (weights of the k clusters have to add up to one) Gaussian Mixture Models — PyPR v0.1rc3 documentation (sourceforge.net) b) How does EM determine if a point i belongs to a cluster j p ( C j | x i ) = p ( x i | C j ) p ( C j ) l = 1 k p ( x i | C l ) p ( C l ) 2
3. Fuzzy C-Means (FCM) a. How is FCM different from K-means? FCM uses soft cluster memberships expressed in weight w ij which can be interpreted as probability of object i belonging to cluster j; that is, objects have to belong to exactly one cluster, as it is the case with k-means. FCM uses weight based computations to determine the centroid. b. How does FCM update the weights in its iterations Let us assume we run FCM for K=2 and the centroids are cluster 1=(1,1) and cluster 2=(2,3) and hyper parameter p is 2 and we use Manhattan distance; furthermore point i is: (1,4) in this case; W i1 = 1/3**2/(1/9+1/4)=0.309 Wi2= 1/2**2)/(1/9+1/4)=0.692 w ij =( 1 / dist ( x i ,c j ) ¿¿ 2 ) 1 p 1 / q = 1 k ( 1 / dist ( x i ,c q ) ¿¿ 2 ) 1 p 1 ¿¿ 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4) Similarity Assessment Design a distance function to assess the similarity of gradute students; each customer is characterized by the following attributes: a) Ssn b) qud (“ quality of undergraduate degree ”) which is ordinal attribute with values ‘excellent’, ‘very good’, ‘good’, ‘fair’, ‘poor’, ‘very poor’. c) gpa (which is a real number with mean 2.8 standard deviation is 0.8, and maximum 4.0 and minimum 2.1) d) gender is an nominal attribute taking values in {male, female}. Assume that the attributes qud and gpa are of major importance and the attribute gender is of a minor importance when assessing the similarity between students. Using your distance function compute the distance between the following 2 students: c1=(111111111, ‘good’, 2.9, male) and c2=(222222222, ‘very poor’, 3.7, female)! We convert the Oph rating values ‘excellent’, ‘very good’, ‘good’, ‘fair’, ‘poor’, ‘very poor’ to 5:0 using ; then we compute the distance by taking L-1 norm and dividing by the range, 5 in this case. Normalize gpa using Z-score and find distance by L-1 norm d gender (a,b):= if a=b then 0 else 1 Assign weights 1 to qud, 1 to Power-used and 0.2 to Gender Now[8]: one error: 2.5-5 two errors: 0-2 distance functions not properly defined: at most 3 points d(u,v) = (1*|(u.gpa)/0.8 – (v.gpa)/0.8| + 1*| (u.qud) – (v.qud)|/5 + 0.2*d gender (u.gender, v.gender)) /2.2 2 students: c1=(111111111, ‘good’, 2.9, male) and c2=(222222222, ‘very poor’, 3.7, female)! d(c1,c2)= (1 + 3/5 + 0.2)/2.2= 1.8/22=9/11=0.82 [2] 5) K-means Assume the following dataset is given: (1,1), (2,2) (4,4), (5,5), (4,6), (6,4) . K-Means is used with k=2 to cluster the dataset. Moreover, Manhattan distance is used as the distance function (formula below) to compute distances between centroids and objects in the dataset. Moreover, K-Means’s initial clusters C1 and C2 as follows: C1: {(1,1), (3,3), (4,4), (6,6)} C2: {(6,4), (4,6)} Now K-means is run for a single iteration; what are the new clusters you obtain 1 [4] d((x1,x2),(x1’,x2’))= |x1-x1’| + |x2-x2’| Manhattan Distance centroid C1= (3.5,3.5} centroid C2= {5,5} New Clusters 1 If there are any ties, break them whatever way you want! 4
C1={(1,1), (3,3), (4,4)} C2={(6,6},(4,6), (6,4)} 5
6) Autoencoders a) What role do Kullback–Leibler (KL) divergences play in Variational Autoencoders (VAEs)? i. KL-divergences measure the distance between two distributions e.g. how close are N(0.3,1.3) and N(5,13) to N(0,1); obviously, d KL (N(0.3,1.3),N(0,1))<< d KL (N(5,13),N(0,1)) 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
ii. KL-divergences are used in VAE loss functions to create a penalty that is proportional to how much the latent vector deviates from an assumed prior (e.g. from N(0,1) or from a covariance matrix which has 1 in the diagonal and 0 everywhere else 2 ); this accomplish some regularization of the latent space and also can be used to enforce independence of the different latent variables) b) How can autoencoders be used for outlier detection? 2 No correlation between the different variables in the latent space. 7
Steps: i. Learn an Autoencoder Y for your Dataset D ii. Feed all example d D into Y and add the reconstruction loss (d(x,r)) in an additional column to d. Remark: The newly added column serves as an outlier score (OLS); the larger d(x,r) is the more likely x is an outlier c) If I decrease the dimensionality of the latent vector h what will be the consequences? 8
The reconstruction loss will increase. d) Likely some Task5 related question(s ?) 7) Expect one more essay-style question in the final exam Important: this is an essay: write complete sentences! e.g What skills are important to be hired as a Data Scientist? (see slides that discuss this topic) Should know R and/or Phyton Should have sound software development skills Should have some sound knowledge of Statistics Should have sound knowledge of the different data analysis tasks; e.g. clustering, classification, similarity assessment Should be knowledgeable in data visualization Data scientists are involved with gathering data, massaging it into a tractable form, making it tell its story, and presenting that story to others.” The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades."  But what’s even harder is finding people who have those skills and are good at communicating the story behind the data.” 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help