HW4 - Clustering & Association

docx

School

Sacred Heart University *

*We aren’t endorsed by this school

Course

617

Subject

Computer Science

Date

Dec 6, 2023

Type

docx

Pages

4

Uploaded by MateElement12013

Report
BUAN-617 – Professor Cohen Homework#4 – Clustering & Associations The purpose of the homework is to have you practice writing Python in order to gain identification and mastery of the concepts. The goals for Lab #4 are as follows: 1. Reinforce data preparation & exploration practices 2. Provide examples of using different clustering techniques and association rules. 3. Using another Python ML package MLXTEND. Class Deliverable: 1) I will provide the first few steps in the process of analyzing the dataset to help start the lab. 2) I will prompt a few questions in between the given steps. 3) In order to get full credit for this lab, you must take screenshots of the various sections in the lab and paste them into a MS-Word document (where I prompt ), as well as attach your downloaded Jupyter notebook and emailed to me. My email is cohenl3@sacredheart.edu Grading Rubric out of 10 points: 1 point for each question answered. Answers that are less than 2 sentences and not observational based may receive partial credit 50% point deduction if Jupyter Notebook is not attached 50% point deduction if MS-word document is not emailed to me as attachment, but rather sent via Blackboard or as an attached document from Sharepoint
Part#1 – Clustering 1) Step #1 – load Scikit_Learn and Numpy. 2) We’re going to create a synthetic dataset using this code # synthetic classification dataset from numpy import where from sklearn.datasets import make_classification from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4) # create scatter plot for samples from each class for class_value in range(2): # get row indexes for samples with this class row_ix = where(y == class_value) # create scatter of these samples pyplot.scatter(X[row_ix, 0], X[row_ix, 1]) # show the plot # CODE INTENTIONALLY OMITTED – YOU SHOULD WRITE THE CODE HERE COPY THE OUTPUT TO YOUR WORD DOCUMENT 3) Now let’s do Agglomerative Clustering # agglomerative clustering from numpy import unique from numpy import where from sklearn.datasets import make_classification from sklearn.cluster import AgglomerativeClustering from matplotlib import pyplot # define dataset X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4) # define the model model = AgglomerativeClustering(n_clusters=2) # fit model and predict clusters yhat = model.fit_predict(X) # retrieve unique clusters clusters = unique(yhat) # create scatter plot for samples from each cluster for cluster in clusters: # get row indexes for samples with this cluster row_ix = where(yhat == cluster) # create scatter of these samples pyplot.scatter(X[row_ix, 0], X[row_ix, 1]) # show the plot # CODE INTENTIONALLY OMITTED – YOU SHOULD WRITE THE CODE HERE COPY THE OUTPUT TO YOUR WORD DOCUMENT 4) Now let’s do DBSCAN!!!!
# dbscan clustering from numpy import unique from numpy import where from sklearn.datasets import make_classification from sklearn.cluster import DBSCAN from matplotlib import pyplot # define dataset X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4) # define the model model = DBSCAN(eps=0.30, min_samples=9) # fit model and predict clusters yhat = model.fit_predict(X) # retrieve unique clusters clusters = unique(yhat) # create scatter plot for samples from each cluster for cluster in clusters: # get row indexes for samples with this cluster row_ix = where(yhat == cluster) # create scatter of these samples pyplot.scatter(X[row_ix, 0], X[row_ix, 1]) # show the plot # CODE INTENTIONALLY OMITTED COPY THE OUTPUT TO YOUR WORD DOCUMENT 5) Now let’s do K-Means Clustering! # k-means clustering from numpy import unique from numpy import where from sklearn.datasets import make_classification from sklearn.cluster import KMeans from matplotlib import pyplot # define dataset X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4) # define the model model = KMeans(n_clusters=2) # fit the model model.fit(X) # assign a cluster to each example yhat = model.predict(X) # retrieve unique clusters clusters = unique(yhat) # create scatter plot for samples from each cluster for cluster in clusters: # get row indexes for samples with this cluster row_ix = where(yhat == cluster)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
# create scatter of these samples pyplot.scatter(X[row_ix, 0], X[row_ix, 1]) # show the plot #CODE INTENTIONALLY OMITTED COPY THE OUTPUT TO YOUR WORD DOCUMENT 6) Now it’s your turn to choose the Clustering model. Choose from OPTICS, Gaussian or BIRCH. The documentation for Scikit-learn is located here LINK . Copy the code and the output scatterplot to your Word Document. 7) Answer this question: which model would you choose and why? Part#2 – Associations Now we’re going to do a market basket analysis with supermarket data from Kaggle. The data is located here Grocery Store Data Set | Kaggle . We’re also going to play with another Python library called MLXTEND, which has some shortcuts to Machine Learning algorithms Github documentation here . import pandas as pd import numpy as np from mlxtend.frequent_patterns import apriori, association_rules df = pd.read_csv("GroceryStoreDataSet.csv", names = ['products'], sep = ',') df.head() 8) What is the shape of this dataset? COPY OUTPUT TO WORD DOCUMENT 9) Now let’s transform this list from qualitative data into a Boolean table data = list(df["products"].apply(lambda x:x.split(",") )) data from mlxtend.preprocessing import TransactionEncoder a = TransactionEncoder() a_data = a.fit(data).transform(data) df = pd.DataFrame(a_data,columns=a.columns_) df = df.replace(False,0) df COPY OUTPUT TO WORD DOCUMENT 10) Now let’s do an Apriori association model. We’re going to assume a minimum support at 20%, and run this with a minimum confidence of 60%. df = apriori(df, min_support = 0.2, use_colnames = True, verbose = 1) df_ar = association_rules(df, metric = "confidence", min_threshold = 0.6) df_ar COPY OUTPUT TO WORD DOCUMENT 11) Based on this output, what can we observe about Corn Flakes? What recommendations can we derive from these observations? ---END OF HW#4---