ps1
pdf
keyboard_arrow_up
School
Northeastern University *
*We aren’t endorsed by this school
Course
5230
Subject
Industrial Engineering
Date
Dec 6, 2023
Type
Pages
3
Uploaded by ConstableKookabura3737
DS 5230: Unsupervised machine learning and data mining (Fall 2023)
Dr. Roi Yehoshua
Student name:
(Due) October 2, 2023
PS1: Data Mining and Association Rules
1
Data Mining (30%)
Download the file
housing.csv
from Canvas. The file contains data on median house prices
in California districts, derived from the 1990 census data. The data set contains 20,640 rows,
one row per district (house block). Each row contains the following features:
•
longitude
: how far west a house is; a higher value is farther west.
•
latitude
: how far north a house is; a higher value is farther north.
•
housing_median_age
: median age of a house within the block; a lower number is a
newer building.
•
total_rooms
: total number of rooms within the block.
•
total_bedrooms
: total number of bedrooms within the block.
•
population
: total number of people residing within the block.
•
households
:
total number of households (a group of people residing within a home
unit) within the block.
•
median_income
: median income for households within the block (measured in tens of
thousands of US dollars).
•
median_house_value
: median house value for households within the block (measured
in US dollars).
•
ocean_proximity
: location of the house with respect to the ocean. Can have one of
the following values: NEAR BAY, NEAR OCEAN,
<
1H OCEAN, INLAND, ISLAND.
The objective in this data set is to predict the median house value in a given district based
on the values of the other features.
Answer the following questions:
1. What is the data type of each feature? (ordinal/nominal/interval/ratio, discrete/continuous)
2. Display summary statistics of the data. What can you learn from it on the data?
3. Compute the correlation between each feature and the target
median_house_value
.
Which features have strong correlation with the target?
4. Use data visualization tools to explore the data set.
Display at least three different
types of graphs.
5. What type of data quality issues can you detect in the data set (e.g., missing values,
duplicates, outliers)? Name at least three different issues.
1
2
Association Rule Mining on a Toy Dataset (30%)
You are given the following dataset of transactions from a grocery store:
Transaction ID
Items
T1
Apple, Banana, Orange
T2
Apple, Grape
T3
Banana, Grape, Orange
T4
Apple, Banana, Grape
T5
Apple, Banana, Orange, Grape
T6
Banana, Orange
T7
Apple, Orange
T8
Orange, Grape
2.1
Apriori
1. Run the Apriori algorithm on this data set with a minimum support of 25%. Show the
candidate and frequent itemsets for each database scan.
2. Enumerate all the final frequent itemsets.
3. Indicate all the association rules that have a confidence of at least 65%.
2.2
FP-Growth
1. Use the same data set and support threshold and build a frequent pattern tree (FP-
Tree). Show for each transaction how the tree evolves.
2. Use FP-Growth to discover the frequent itemsets from this FP-tree.
3
Retail Store Association Rule Mining (40%)
Download the file
store_data.csv
from Canvas. This data set contains 7,500 transactions
over the course of a week at a French retail store.
1. Write a program to find interesting association rules in this data set.
Try various
support, confidence and lift thresholds.
2. Display the top scoring rules and discuss their meaning.
3. Randomly split your records into two sets with roughly 50% of data each. Now use the
Apriori algorithm to determine rules on both of these sets. Do you find similar rules on
both sets? What does the similarity/the differences indicate?
In this question you are allowed to use external libraries for computing the association
rules (e.g., the package
mlxtend
mentioned in class).
2
Submission instructions:
•
Submit one PDF file for all the non-programming questions.
If you are submitting
handwritten solutions, please ensure that your handwriting is legible and the pages are
scanned properly.
•
Submit your source files (.ipynb or .py) for each programming question separately (do
not compress them into one zip). Clearly mention the question number in the source
file name. Make sure that any comments or answers related to the code are included in
the notebook itself, and not separately.
•
In questions that require you to build a machine learning model, you are expected
to perform sufficient hyperparameter tuning, analyze and discuss the results.
Merely
presenting the output of one model in one specific setting will not be sufficient to earn
full points.
•
You are welcome to discuss the assignment problems with other students in class, but
you must write up the solution yourself,
and
indicate who you discussed with (if any).
•
Assignments may be handed in up to one day late (24-hour period), penalized by 10%.
Submissions later than this will not be accepted. Contact the teaching staff if there are
extenuating
circumstances.
3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help