ps1

pdf

School

Northeastern University *

*We aren’t endorsed by this school

Course

5230

Subject

Industrial Engineering

Date

Dec 6, 2023

Type

pdf

Pages

3

Uploaded by ConstableKookabura3737

Report
DS 5230: Unsupervised machine learning and data mining (Fall 2023) Dr. Roi Yehoshua Student name: (Due) October 2, 2023 PS1: Data Mining and Association Rules 1 Data Mining (30%) Download the file housing.csv from Canvas. The file contains data on median house prices in California districts, derived from the 1990 census data. The data set contains 20,640 rows, one row per district (house block). Each row contains the following features: longitude : how far west a house is; a higher value is farther west. latitude : how far north a house is; a higher value is farther north. housing_median_age : median age of a house within the block; a lower number is a newer building. total_rooms : total number of rooms within the block. total_bedrooms : total number of bedrooms within the block. population : total number of people residing within the block. households : total number of households (a group of people residing within a home unit) within the block. median_income : median income for households within the block (measured in tens of thousands of US dollars). median_house_value : median house value for households within the block (measured in US dollars). ocean_proximity : location of the house with respect to the ocean. Can have one of the following values: NEAR BAY, NEAR OCEAN, < 1H OCEAN, INLAND, ISLAND. The objective in this data set is to predict the median house value in a given district based on the values of the other features. Answer the following questions: 1. What is the data type of each feature? (ordinal/nominal/interval/ratio, discrete/continuous) 2. Display summary statistics of the data. What can you learn from it on the data? 3. Compute the correlation between each feature and the target median_house_value . Which features have strong correlation with the target? 4. Use data visualization tools to explore the data set. Display at least three different types of graphs. 5. What type of data quality issues can you detect in the data set (e.g., missing values, duplicates, outliers)? Name at least three different issues. 1
2 Association Rule Mining on a Toy Dataset (30%) You are given the following dataset of transactions from a grocery store: Transaction ID Items T1 Apple, Banana, Orange T2 Apple, Grape T3 Banana, Grape, Orange T4 Apple, Banana, Grape T5 Apple, Banana, Orange, Grape T6 Banana, Orange T7 Apple, Orange T8 Orange, Grape 2.1 Apriori 1. Run the Apriori algorithm on this data set with a minimum support of 25%. Show the candidate and frequent itemsets for each database scan. 2. Enumerate all the final frequent itemsets. 3. Indicate all the association rules that have a confidence of at least 65%. 2.2 FP-Growth 1. Use the same data set and support threshold and build a frequent pattern tree (FP- Tree). Show for each transaction how the tree evolves. 2. Use FP-Growth to discover the frequent itemsets from this FP-tree. 3 Retail Store Association Rule Mining (40%) Download the file store_data.csv from Canvas. This data set contains 7,500 transactions over the course of a week at a French retail store. 1. Write a program to find interesting association rules in this data set. Try various support, confidence and lift thresholds. 2. Display the top scoring rules and discuss their meaning. 3. Randomly split your records into two sets with roughly 50% of data each. Now use the Apriori algorithm to determine rules on both of these sets. Do you find similar rules on both sets? What does the similarity/the differences indicate? In this question you are allowed to use external libraries for computing the association rules (e.g., the package mlxtend mentioned in class). 2
Submission instructions: Submit one PDF file for all the non-programming questions. If you are submitting handwritten solutions, please ensure that your handwriting is legible and the pages are scanned properly. Submit your source files (.ipynb or .py) for each programming question separately (do not compress them into one zip). Clearly mention the question number in the source file name. Make sure that any comments or answers related to the code are included in the notebook itself, and not separately. In questions that require you to build a machine learning model, you are expected to perform sufficient hyperparameter tuning, analyze and discuss the results. Merely presenting the output of one model in one specific setting will not be sufficient to earn full points. You are welcome to discuss the assignment problems with other students in class, but you must write up the solution yourself, and indicate who you discussed with (if any). Assignments may be handed in up to one day late (24-hour period), penalized by 10%. Submissions later than this will not be accepted. Contact the teaching staff if there are extenuating circumstances. 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help