final 5080_2024

docx

School

Austin Peay State University *

*We aren’t endorsed by this school

Course

5080

Subject

Computer Science

Date

Jan 9, 2024

Type

docx

Pages

6

Uploaded by JusticeGalaxyHorse20

Report
Q1. For the following 200, 400, 800, 1000, 2000 1. Calculate the mean and variance 2. Normalize the above group by min-max min=0, max=10 3. In z-score normalization what value should the first 200 be transformed to? 1. Mean = 880, Variance= 492,000 2. Min_max= 0, Max=10 Normalized by Min_Max = (X - Min /(Max-Min)) * (New_Max – New_Min) + New_Min (X-Min/(Max-Min) x (10 – 0) + 0 (200 – 0/2000-200) x (10 – 0) + 0 = 1.11 Value Min_Max 200 0 400 1.11 800 3.33 1000 4.44 2000 10 3. Z-score normalization Z-Score= x – mean/ σ 2 For the value (200) = 200 880 492,000 = -0.969452112 Q2. A database has four transactions. Let min sup = 60% and min conf = 80%. (a) At the granularity of item category (e.g., item i could be “ Milk" ), for the following rule template, [s,c] ꓯꓯ transaction ϵ , buys ( X, item 1 ) ^ buys ( X, item 2 ) => buys ( X, item 3 ) [ s; c ] list the frequent k -itemset for the largest k and all of the strong association rules (with their support s and confidence c ) containing the frequent k -itemset for the largest k . Answer: The value of k =3 and the frequent 3-itemset is {Bread, Milk, Cheese}. These are the Rules. Bread ^ Cheese => Milk, [75%, 100%] Cheese ^ Milk => Bread, [75%, 100%] Cheese => Milk ^ Bread, [75%, 100%] (b) At the granularity of brand-item category (e.g., itemi could be \Sunset-Milk" ), for the following rule template, [s, c] customer ; buys ( X; item 1) ^ buys ( X; item 2) )buys ( X; item 3) list the frequent k -itemset for the largest k . The maximum value of k =3. The frequent 3-itemset includes {(Wonder-Bread, Dairyland-Milk, Tasty-Pie), (Wonder-Bread, Sunset-Milk, Dairyland-Cheese)}. Q3. Suppose you are requesting to classify microarray with 100 tissues and 10000 genes. Which of the following algorithms would you recommend and why? 1. Decision tree induction 2. Piece-wise linear regression 3. SVM 4. Associative classification 5. Genetic algorithm 6. Bayesian clief network Answer: I suggest opting for Support Vector Machines (SVM) because this algorithm is adept at accommodating penalties arising from the adverse impact of genes on the classification process.
Additionally, SVM excels in managing high-dimensional data, addressing non-linear relationships, and delivering distinct separations with the incorporation of penalties for misclassifications. Q4. The Table below shows the Decision…..see text
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Question 5: *why does an SVM algorithm have high classification accuracy in high-demential space? *why is an SVM algorithm slow in large data sets? *outline an extended SVM algorithm that is scalable in large data sets? Answer: i. SVMs excel at classifying high-dimensional data by cleverly mapping it to a higher-dimensional space. This clever trick allows us to find a clear dividing line (hyperplane) that separates different classes effectively. To achieve this separation, a technique called the "kernel trick" is often used. This trick essentially transforms the data in a way that makes it easier to find the separating hyperplane. ii. As datasets grow larger, training a Support Vector Machine (SVM) can become increasingly demanding, both in terms of computational resources and memory usage. This is because the algorithm needs to analyze all data points to find the optimal decision boundary, and the computational cost of this process scales significantly with the size of the dataset. This can lead to slower training times and even system failures if memory limitations are reached. iii. outlined of a cluster based SVM(CB-SVM) method below. 1. Micro-cluster Formation: Utilize a CF-Tree (Clustering Feature Tree) to break down the large dataset into smaller, more manageable micro-clusters. 2. Representative Centroids: Train an SVM using the centroids of the micro-clusters as representatives for the original data points, reducing computational complexity compared to using the entire dataset. 3. Declustering Near the Boundary: Identify and eliminate data points near the decision boundary of the SVM to minimize redundancy and improve efficiency. 4. Data Augmentation: Introduce additional data points from the original dataset that were not initially part of the micro- clusters. This step aims to capture a more comprehensive range of information. 5. Iterative Training: Repeat the entire process iteratively. In each iteration, update the micro-clusters based on the current SVM model. Retrain the SVM using the updated representative points. Refine the model through iterations until it converges to a stable state .
Question 6: Different data types may need to use different similarities distance measures. State what is the expected similarity measure in each of the following questions. 1. Clustering stars in the universe 2. Clustering text documents 3. Clustering clinical data 4. Clustering houses to find delivery centers in a city rivers and bridges 1. Clustering Stars in the Universe = Euclidean Distance 2. Clustering text documents = Euclidean Distance, Cosine Similarity, Jaccard Coefficient 3. Clustering clinical data = Cosine Similarity 4. Clustering houses to find delivery centers = Cosine Similarity
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help