Homework 1
pdf
keyboard_arrow_up
School
Northeastern University *
*We aren’t endorsed by this school
Course
7615
Subject
Industrial Engineering
Date
Dec 6, 2023
Type
Pages
4
Uploaded by AmbassadorWasp3496
IE-7275 - Fall 2023
Soumya Janardhanan
Homework 1
Due: Thursday 05/25 end of day (11.59PM)
Total Points: 85 Extra Credit: 10 points
3 extra days available. For each extra day you take, you incur a penalty of 2%
In this homework, you'll practice the basics of data cleaning, data partition, data
normalization, and data visualization. Feel free to use python or R. Hints in the
homework assume python.
Extra credit:
Knowledge check on probability and statistics which are prerequisites for
machine learning
Please refer to the
Hint
section wherever available if you do not know where to start.
Question 1: Data Cleaning and Preprocessing
Glass Identification Dataset -
The Glass Identification Dataset is a multivariate dataset introduced by the UCI
Machine Learning Repository. The dataset involves predicting the type of glass given a
number of physical properties.
The dataset is a study of glass composition in forensic chemistry, and it's used for
testing classification algorithms. The goal is to classify different types of glass based on
their content.
Download the dataset ONLY at
this
link.
1.
Load the dataset
a.
Print summary statistics for it
( 1 point)
b.
How will you generate the summary for all columns of a DataFrame
regardless of data type (numerical or categorical)?
(1 point)
2.
Count the the
percentage
of null/missing values for each variable
(3 points)
3.
Drop the variables which have more than 75% missing values
(Avoid manual
intervention. Code should work even if the attribute/data changes)
(3 points)
Hint:
Handle missing data in Python
dropna() thresh option
4.
If a variable contains more than 10 missing records, impute the records by using
the mean value of records from the respective class instead of using the mean
IE-7275 - Fall 2023
Soumya Janardhanan
value of the entire column.
(Avoid manual intervention. Code should work even if
the attribute/data changes)
(5 points)
5.
If a variable contains 10 or lesser than 10 missing records, impute the records
with the previous non-NAN value from a row with the same 'Class'
(Avoid
manual intervention. Code should work even if the attribute/data changes)
(5 points)
Hint:
Consider using one or a combination of
fillna
,
groupby
,
transform
, and
mean
to complete this task
6.
Check if all the missing values are handled.
(2 points)
7.
The target variable is the “Class” variable, which denotes the classification of
glass into multiple classes
.
Plot a bar chart to visualize the distribution of the
Class column and describe what you infer from it.
(4 points)
8.
Compute the correlation matrix, and plot a heat map of the
numerical
predictors. Which features are highly positively correlated, not correlated, and
negatively correlated? Why?
(6 points)
Histogram and Boxplots
9.
Plot histograms for all predictors. Upon examining them:
a)
List down the modes for each histogram.
(2 points)
b)
Select any two histograms of your choice and provide a comparative
analysis of their distributions. Share your observations about each of
them.
(8 points)
Hint: Include comparisons about central tendency, shape, skewness,
number of peaks etc.)
10. Plot boxplots for all the features. Choose any two boxplots and provide the
values of the “five number summary” for each plot, followed by a comparative
analysis between the two plots.
(10 points)
Question 2: Data Visualization techniques
Download Dataset:
Forest fires
Source: https://archive.ics.uci.edu/ml/datasets/Forest+Fires
The file forestfires.csv includes data from Cortez and Morais (2007).
This dataset contains meteorological and other data in the northeast region of
Portugal, with the aim is to predict the burned area of forest fires.
Attribute Information:
Predictors:
IE-7275 - Fall 2023
Soumya Janardhanan
X - x-axis spatial coordinate within the Montesinho park map: 1 to 9
Y - y-axis spatial coordinate within the Montesinho park map: 2 to 9
month - month of the year: 'jan' to 'dec'
day - day of the week: 'mon' to 'sun'
FFMC - FFMC index from the FWI system: 18.7 to 96.20
DMC - DMC index from the FWI system: 1.1 to 291.3
DC - DC index from the FWI system: 7.9 to 860.6
ISI - ISI index from the FWI system: 0.0 to 56.10
temp - temperature in Celsius degrees: 2.2 to 33.30
RH - relative humidity in %: 15.0 to 100
wind - wind speed in km/h: 0.40 to 9.40
rain - outside rain in mm/m2 : 0.0 to 6.4
Target variable:
area - the burned area of the forest (in ha): 0.00 to 1090.84
1.
Plot a stacked bar chart to show the number of forest fires grouped by months
and days of the week. (Make sure the months are in chronological order i.e
attribute values are sorted starting with January and ending with December).
Do you see any issues with the stacked bar chart? How would you rectify them?
(10 points)
Hint:
Before creating the bar chart, transform the original dataset to the data
frame you need for this section. Then
build a stacked bar chart with matplotlib
or any other library of your choice
2.
Create a heatmap of the correlation coefficients between the area, wind, and
temp variables. How do these factors relate to the burned area?
(5 points)
3.
Additionally, plot a joint distribution of wind and area and another for temp and
area. Based on these visualizations, are there any particular patterns or outliers
that you can observe?
(5 points)
4.
Plot the scatter matrix for temp, RH, DC and DMC. How do you interpret the
result in terms of correlation among the variables?
(5 points)
Hint:
Creat a scatter matrix with Seaborn
5.
Open-ended analysis question
- perform your own exploratory data analysis on
this dataset, and present one finding. Feel free to use any of the concepts
covered in the textbook or in class or otherwise
(10 points)
Extra Credit Section:
Q1:
Probability Knowledge Check:
An Artificial Intelligence conference is taking
place, with attendees from three different countries: Country A, Country B, and
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
IE-7275 - Fall 2023
Soumya Janardhanan
Country C. The distribution of attendees from these countries is 50%, 30%, and 20%,
respectively. During the conference, attendees have an opportunity to participate in a
hackathon.
●
From Country A, 65% of attendees participate in the hackathon, and of those,
80% complete it successfully.
●
From Country B, 50% of attendees decide to participate in the hackathon, and
of those, 70% complete it successfully.
●
From Country C, 75% of the attendees participate in the hackathon, and of
those, 60% complete it successfully.
a)
What is the probability that a randomly chosen attendee participates in the
hackathon?
(2 points)
b)
Given that an attendee took part in the challenge and completed it successfully,
what is the probability they are from Country A?
(2 points)
Q2
:
Normal Distributions:
The height of corn plants follows a Normal distribution with
mean 145cm and standard
deviation 22cm.
1.
What percentage of plants are between 135cm and 155cm tall?
(1 point)
2.
If we choose a random sample of 16 plants, what is the probability that the
mean height would be between 135cm and 155cm tall?
(1 point)
3.
If we choose a random sample of 32 plants, what is the probability that the
mean height would be between 135cm and 155cm tall?
(1 point)
Q3. In a study of behavioral asymmetries, 2,391 individuals were asked which hand
they preferred to use to write, and which foot they prefer to use to kick a ball. The
results are as follows:
Right Hand
Left Hand
Total
Right Foot
Left Foot
Right Foot
Left Foot
2012
142
121
116
2391
Are hand and foot preferences independent?
(3 points)