HW4 - Clustering & Association

docx

School

Sacred Heart University *

*We aren’t endorsed by this school

Course

617

Subject

Computer Science

Date

Dec 6, 2023

Type

docx

Pages

Uploaded by MateElement12013

BUAN-617 – Professor Cohen Homework#4 – Clustering & Associations The purpose of the homework is to have you practice writing Python in order to gain identification and mastery of the concepts. The goals for Lab #4 are as follows: 1. Reinforce data preparation & exploration practices 2. Provide examples of using different clustering techniques and association rules. 3. Using another Python ML package MLXTEND. Class Deliverable: 1) I will provide the first few steps in the process of analyzing the dataset to help start the lab. 2) I will prompt a few questions in between the given steps. 3) In order to get full credit for this lab, you must take screenshots of the various sections in the lab and paste them into a MS-Word document (where I prompt ), as well as attach your downloaded Jupyter notebook and emailed to me. My email is cohenl3@sacredheart.edu Grading Rubric out of 10 points: 1 point for each question answered. Answers that are less than 2 sentences and not observational based may receive partial credit 50% point deduction if Jupyter Notebook is not attached 50% point deduction if MS-word document is not emailed to me as attachment, but rather sent via Blackboard or as an attached document from Sharepoint

Part#1 – Clustering 1) Step #1 – load Scikit_Learn and Numpy. 2) We’re going to create a synthetic dataset using this code # synthetic classification dataset from numpy import where from sklearn.datasets import make_classification from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4) # create scatter plot for samples from each class for class_value in range(2): # get row indexes for samples with this class row_ix = where(y == class_value) # create scatter of these samples pyplot.scatter(X[row_ix, 0], X[row_ix, 1]) # show the plot # CODE INTENTIONALLY OMITTED – YOU SHOULD WRITE THE CODE HERE COPY THE OUTPUT TO YOUR WORD DOCUMENT 3) Now let’s do Agglomerative Clustering # agglomerative clustering from numpy import unique from numpy import where from sklearn.datasets import make_classification from sklearn.cluster import AgglomerativeClustering from matplotlib import pyplot # define dataset X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4) # define the model model = AgglomerativeClustering(n_clusters=2) # fit model and predict clusters yhat = model.fit_predict(X) # retrieve unique clusters clusters = unique(yhat) # create scatter plot for samples from each cluster for cluster in clusters: # get row indexes for samples with this cluster row_ix = where(yhat == cluster) # create scatter of these samples pyplot.scatter(X[row_ix, 0], X[row_ix, 1]) # show the plot # CODE INTENTIONALLY OMITTED – YOU SHOULD WRITE THE CODE HERE COPY THE OUTPUT TO YOUR WORD DOCUMENT 4) Now let’s do DBSCAN!!!!

# dbscan clustering from numpy import unique from numpy import where from sklearn.datasets import make_classification from sklearn.cluster import DBSCAN from matplotlib import pyplot # define dataset X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4) # define the model model = DBSCAN(eps=0.30, min_samples=9) # fit model and predict clusters yhat = model.fit_predict(X) # retrieve unique clusters clusters = unique(yhat) # create scatter plot for samples from each cluster for cluster in clusters: # get row indexes for samples with this cluster row_ix = where(yhat == cluster) # create scatter of these samples pyplot.scatter(X[row_ix, 0], X[row_ix, 1]) # show the plot # CODE INTENTIONALLY OMITTED COPY THE OUTPUT TO YOUR WORD DOCUMENT 5) Now let’s do K-Means Clustering! # k-means clustering from numpy import unique from numpy import where from sklearn.datasets import make_classification from sklearn.cluster import KMeans from matplotlib import pyplot # define dataset X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4) # define the model model = KMeans(n_clusters=2) # fit the model model.fit(X) # assign a cluster to each example yhat = model.predict(X) # retrieve unique clusters clusters = unique(yhat) # create scatter plot for samples from each cluster for cluster in clusters: # get row indexes for samples with this cluster row_ix = where(yhat == cluster)

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

# create scatter of these samples pyplot.scatter(X[row_ix, 0], X[row_ix, 1]) # show the plot #CODE INTENTIONALLY OMITTED COPY THE OUTPUT TO YOUR WORD DOCUMENT 6) Now it’s your turn to choose the Clustering model. Choose from OPTICS, Gaussian or BIRCH. The documentation for Scikit-learn is located here LINK . Copy the code and the output scatterplot to your Word Document. 7) Answer this question: which model would you choose and why? Part#2 – Associations Now we’re going to do a market basket analysis with supermarket data from Kaggle. The data is located here Grocery Store Data Set | Kaggle . We’re also going to play with another Python library called MLXTEND, which has some shortcuts to Machine Learning algorithms Github documentation here . import pandas as pd import numpy as np from mlxtend.frequent_patterns import apriori, association_rules df = pd.read_csv("GroceryStoreDataSet.csv", names = ['products'], sep = ',') df.head() 8) What is the shape of this dataset? COPY OUTPUT TO WORD DOCUMENT 9) Now let’s transform this list from qualitative data into a Boolean table data = list(df["products"].apply(lambda x:x.split(",") )) data from mlxtend.preprocessing import TransactionEncoder a = TransactionEncoder() a_data = a.fit(data).transform(data) df = pd.DataFrame(a_data,columns=a.columns_) df = df.replace(False,0) df COPY OUTPUT TO WORD DOCUMENT 10) Now let’s do an Apriori association model. We’re going to assume a minimum support at 20%, and run this with a minimum confidence of 60%. df = apriori(df, min_support = 0.2, use_colnames = True, verbose = 1) df_ar = association_rules(df, metric = "confidence", min_threshold = 0.6) df_ar COPY OUTPUT TO WORD DOCUMENT 11) Based on this output, what can we observe about Corn Flakes? What recommendations can we derive from these observations? ---END OF HW#4---

Related Documents

TrainingGuide.docx

RadhikaRaghuwanshi_HW7.pdf

figure 11.docx

BIA4100 - Project 1 - V1.docx

Final Exam Review - Part 1.pdf

M03 - Part 2- Case Project 6-3- Lightweight Cryptography (3).docx

Lab04.html

lab2_dataTypes_new.html

lab03_tables.html

lab06_simulation (1).html

Assingment 4 Chapter 3 SELECT.docx

Virtual Lab 6_John Travoltage.docx

Recommended textbooks for you

C++ for Engineers and Scientists

Computer Science

ISBN:9781133187844

Author:Bronson, Gary J.

Publisher:Course Technology Ptr

Systems Architecture

Computer Science

ISBN:9781305080195

Author:Stephen D. Burd

Publisher:Cengage Learning

C++ Programming: From Problem Analysis to Program...

Computer Science

ISBN:9781337102087

Author:D. S. Malik

Publisher:Cengage Learning

Programming Logic & Design Comprehensive

Computer Science

ISBN:9781337669405

Author:FARRELL

Publisher:Cengage

EBK JAVA PROGRAMMING

Computer Science

ISBN:9781337671385

Author:FARRELL

Publisher:CENGAGE LEARNING - CONSIGNMENT

EBK JAVA PROGRAMMING

Computer Science

ISBN:9781305480537

Author:FARRELL

Publisher:CENGAGE LEARNING - CONSIGNMENT

SEE MORE TEXTBOOKS

Recommended textbooks for you

C++ for Engineers and Scientists
Computer Science
ISBN:9781133187844
Author:Bronson, Gary J.
Publisher:Course Technology Ptr
Systems Architecture
Computer Science
ISBN:9781305080195
Author:Stephen D. Burd
Publisher:Cengage Learning
C++ Programming: From Problem Analysis to Program...
Computer Science
ISBN:9781337102087
Author:D. S. Malik
Publisher:Cengage Learning
Programming Logic & Design Comprehensive
Computer Science
ISBN:9781337669405
Author:FARRELL
Publisher:Cengage
EBK JAVA PROGRAMMING
Computer Science
ISBN:9781337671385
Author:FARRELL
Publisher:CENGAGE LEARNING - CONSIGNMENT
EBK JAVA PROGRAMMING
Computer Science
ISBN:9781305480537
Author:FARRELL
Publisher:CENGAGE LEARNING - CONSIGNMENT