Hands-On Exercise 2

docx

School

California State University, Los Angeles *

*We aren’t endorsed by this school

Course

4200

Subject

Computer Science

Date

Feb 20, 2024

Type

docx

Pages

4

Uploaded by MasterOwlMaster673

Report
Kathlyne Alilain Dr. Li CIS 4200-01 20 Feb. 2024 Hands-On Exercise 2 1. Data Understanding: - What does each observation represent? Different variables or attributes about each car - How many variables are there? 12 - Which data attributes are categorical, and which are numeric? Categorical: manufacturer, model, year, trans, drv, fl, class Numeric: displ, cyl, cty, hwy 2. Data Preprocessing: - Check for duplicate and missing data. Are there any duplicate rows? Are there any missing values? 9 duplicates, 1 missing value - Propose solutions for handling missing data. Deletion and imputation: Use the remove duplicates function and find similar records to fill in gaps 3. Data Enrichment: Create a new variable called "mpg" that represents the average of city ("cty") and highway ("hwy") miles per gallon.
4. Understanding Numerical Variables: Calculate and present descriptive statistics (mean, median, range, variance, and standard deviation) for the numeric variable: "mpg" Compare and explain the mean and standard deviation. The mean is 20.24 and the standard deviation is 5.063234. The mean is the average value of mpg across all cars. The standard deviation is how much variation there is from the mean. This means that the value by which our mean may deviate may be +/- 5.063234. 5. Understanding categorical variables: What are the unique values of drive train type (drv)? What is the mode for "drv" variable? 4, f, r are the unique values of drv. F is the mode. Create bar plots to illustrate the distribution of "drv". Compare the distribution of "drv" in 1999 and 2008. Summarize the difference.
The distribution of drv in 199 and 2008 are similar to each other. In 1999, front-wheel drive cars were a bit more higher than in 2008, where 4-wheel drive is a bit more prevalent. There is a slight increase in rear-wheel drive in 2008 than 1999, but still very similar and small. 6. Box Plots for numeric variables: Use a box plot to show the summary distribution of numeric variables: "mpg" Report key statistics(Q1, median, Q2, max, min) displayed in the box plot. Q1= 15.75 Median= 20.24 Q2= 20.5 Max= 35 Min= 10.5 What is "mpg" range of the middle 50% of cars in the dataset? 15.75-23.5 Box plots by year: Use a box plot to show the distribution of "mpg" variable in 1999 and 2008. summarize the difference. The range is a bit larger in 2008 than 1999, from 13- 30.5 to 10.5-32.5. The interquartile range is also larger in 2008, where it is 16-24. In 1999, it had 3 outliers whereas 2008 had none. The mean in each is very similar to one another, along with their Q2. Box Plots by Classes: Use a box plot to show the distribution of "mpg" variable in different classes. summarize the difference.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
The data is tightly grouped and the data is not skewed, except for the subcompact class. Their means and Q2 fall within similar ranges of each other. The whiskers on the subcompact class is much longer than the others. The compact, subcompact, SUV, and minivan classes all had some outliers. 7. Histogram for numeric variables: Use Histogram to show the detailed distribution of numeric variables: "mpg" Explore different bin widths and discuss what is a proper bin width. The proper bin width must be consecutive, non-overlapping intervals. They should be of equal size. Different bin widths can reveal patterns or trends in the data. Use a bin width of 4, how many models fall into the common range? About 4-5 models.