Homework 1

pdf

School

Northeastern University *

*We aren’t endorsed by this school

Course

7615

Subject

Industrial Engineering

Date

Dec 6, 2023

Type

pdf

Pages

4

Uploaded by AmbassadorWasp3496

Report
IE-7275 - Fall 2023 Soumya Janardhanan Homework 1 Due: Thursday 05/25 end of day (11.59PM) Total Points: 85 Extra Credit: 10 points 3 extra days available. For each extra day you take, you incur a penalty of 2% In this homework, you'll practice the basics of data cleaning, data partition, data normalization, and data visualization. Feel free to use python or R. Hints in the homework assume python. Extra credit: Knowledge check on probability and statistics which are prerequisites for machine learning Please refer to the Hint section wherever available if you do not know where to start. Question 1: Data Cleaning and Preprocessing Glass Identification Dataset - The Glass Identification Dataset is a multivariate dataset introduced by the UCI Machine Learning Repository. The dataset involves predicting the type of glass given a number of physical properties. The dataset is a study of glass composition in forensic chemistry, and it's used for testing classification algorithms. The goal is to classify different types of glass based on their content. Download the dataset ONLY at this link. 1. Load the dataset a. Print summary statistics for it ( 1 point) b. How will you generate the summary for all columns of a DataFrame regardless of data type (numerical or categorical)? (1 point) 2. Count the the percentage of null/missing values for each variable (3 points) 3. Drop the variables which have more than 75% missing values (Avoid manual intervention. Code should work even if the attribute/data changes) (3 points) Hint: Handle missing data in Python dropna() thresh option 4. If a variable contains more than 10 missing records, impute the records by using the mean value of records from the respective class instead of using the mean
IE-7275 - Fall 2023 Soumya Janardhanan value of the entire column. (Avoid manual intervention. Code should work even if the attribute/data changes) (5 points) 5. If a variable contains 10 or lesser than 10 missing records, impute the records with the previous non-NAN value from a row with the same 'Class' (Avoid manual intervention. Code should work even if the attribute/data changes) (5 points) Hint: Consider using one or a combination of fillna , groupby , transform , and mean to complete this task 6. Check if all the missing values are handled. (2 points) 7. The target variable is the “Class” variable, which denotes the classification of glass into multiple classes . Plot a bar chart to visualize the distribution of the Class column and describe what you infer from it. (4 points) 8. Compute the correlation matrix, and plot a heat map of the numerical predictors. Which features are highly positively correlated, not correlated, and negatively correlated? Why? (6 points) Histogram and Boxplots 9. Plot histograms for all predictors. Upon examining them: a) List down the modes for each histogram. (2 points) b) Select any two histograms of your choice and provide a comparative analysis of their distributions. Share your observations about each of them. (8 points) Hint: Include comparisons about central tendency, shape, skewness, number of peaks etc.) 10. Plot boxplots for all the features. Choose any two boxplots and provide the values of the “five number summary” for each plot, followed by a comparative analysis between the two plots. (10 points) Question 2: Data Visualization techniques Download Dataset: Forest fires Source: https://archive.ics.uci.edu/ml/datasets/Forest+Fires The file forestfires.csv includes data from Cortez and Morais (2007). This dataset contains meteorological and other data in the northeast region of Portugal, with the aim is to predict the burned area of forest fires. Attribute Information: Predictors:
IE-7275 - Fall 2023 Soumya Janardhanan X - x-axis spatial coordinate within the Montesinho park map: 1 to 9 Y - y-axis spatial coordinate within the Montesinho park map: 2 to 9 month - month of the year: 'jan' to 'dec' day - day of the week: 'mon' to 'sun' FFMC - FFMC index from the FWI system: 18.7 to 96.20 DMC - DMC index from the FWI system: 1.1 to 291.3 DC - DC index from the FWI system: 7.9 to 860.6 ISI - ISI index from the FWI system: 0.0 to 56.10 temp - temperature in Celsius degrees: 2.2 to 33.30 RH - relative humidity in %: 15.0 to 100 wind - wind speed in km/h: 0.40 to 9.40 rain - outside rain in mm/m2 : 0.0 to 6.4 Target variable: area - the burned area of the forest (in ha): 0.00 to 1090.84 1. Plot a stacked bar chart to show the number of forest fires grouped by months and days of the week. (Make sure the months are in chronological order i.e attribute values are sorted starting with January and ending with December). Do you see any issues with the stacked bar chart? How would you rectify them? (10 points) Hint: Before creating the bar chart, transform the original dataset to the data frame you need for this section. Then build a stacked bar chart with matplotlib or any other library of your choice 2. Create a heatmap of the correlation coefficients between the area, wind, and temp variables. How do these factors relate to the burned area? (5 points) 3. Additionally, plot a joint distribution of wind and area and another for temp and area. Based on these visualizations, are there any particular patterns or outliers that you can observe? (5 points) 4. Plot the scatter matrix for temp, RH, DC and DMC. How do you interpret the result in terms of correlation among the variables? (5 points) Hint: Creat a scatter matrix with Seaborn 5. Open-ended analysis question - perform your own exploratory data analysis on this dataset, and present one finding. Feel free to use any of the concepts covered in the textbook or in class or otherwise (10 points) Extra Credit Section: Q1: Probability Knowledge Check: An Artificial Intelligence conference is taking place, with attendees from three different countries: Country A, Country B, and
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
IE-7275 - Fall 2023 Soumya Janardhanan Country C. The distribution of attendees from these countries is 50%, 30%, and 20%, respectively. During the conference, attendees have an opportunity to participate in a hackathon. From Country A, 65% of attendees participate in the hackathon, and of those, 80% complete it successfully. From Country B, 50% of attendees decide to participate in the hackathon, and of those, 70% complete it successfully. From Country C, 75% of the attendees participate in the hackathon, and of those, 60% complete it successfully. a) What is the probability that a randomly chosen attendee participates in the hackathon? (2 points) b) Given that an attendee took part in the challenge and completed it successfully, what is the probability they are from Country A? (2 points) Q2 : Normal Distributions: The height of corn plants follows a Normal distribution with mean 145cm and standard deviation 22cm. 1. What percentage of plants are between 135cm and 155cm tall? (1 point) 2. If we choose a random sample of 16 plants, what is the probability that the mean height would be between 135cm and 155cm tall? (1 point) 3. If we choose a random sample of 32 plants, what is the probability that the mean height would be between 135cm and 155cm tall? (1 point) Q3. In a study of behavioral asymmetries, 2,391 individuals were asked which hand they preferred to use to write, and which foot they prefer to use to kick a ball. The results are as follows: Right Hand Left Hand Total Right Foot Left Foot Right Foot Left Foot 2012 142 121 116 2391 Are hand and foot preferences independent? (3 points)