STAT_515_07
pdf
keyboard_arrow_up
School
George Mason University *
*We aren’t endorsed by this school
Course
515
Subject
Statistics
Date
Jan 9, 2024
Type
Pages
18
Uploaded by GrandUniverse12664
TITLE: Analyzing
Pricing, Characteristics and Clustering of Used Cars Dataset.
STAT 515- Dr. Isuru Dassanayake
FINAL RESEARCH PROJECT
BY
Aditya Paravastu
Laxma Reddy
Soumith Alloju
INTRODUCTION:
The second important aspect in
today’s automotive industry, which constitutes the bulk part
of it, is the market for used cars that provide customers with the widest possible option of
choice. The latter also plays an integral role within the whole life cycle of a vehicle. Many
complex
elements ranging from consumers’ preferences, economic situations, and car
characteristics, influence this market. It is imperative for every stakeholder involved
—
consumers, sellers, manufacturers, and the legislature
—
to comprehend the underlining
mechanisms.
The various stakeholders in this sector include prospective consumers who wish to make
informed choices, producers that want to understand the nature and scope of change as well
as sellers who need to gauge shifts to the needs of consumers and adjust their own responses
accordingly. Additionally, such research may act as a guide to lawmakers to formulate
statutes and standards which will protect car buyers’ interests yet also ensure sustainability
of the used car market.
ABOUT:
The study focuses on understanding the features of the used car market utilizing an extensive
data set called “Used_Cars_Cleaned.csv”.
The cost-of-car, model of car, car brand, year of car,
number of miles-on-car, and condition of car are some of the variables that this set consists
of. Since the seller type can affect the vehicle's price, condition, and reliability, it is essential
to understand it when analysing market dynamics. From the price column we can provide
insights into market research on vehicle characteristics and cons
umers’ willingness to pay.
The ‘gearbox’ column indicates the type of vehicle transmission system, which is classified as
manual=0 and automatic=1. Mileage column is an important indicator of vehicle use and plays
an important role in determining its condition, age, and value. The fuel type column explains
which type of fuel is used in the cars like petrol=0, diesel=1. The NotRepairedDamage column
indicates that the car is repaired or not (yes=0, no=1), this variable plays a crucial role for the
buyers in decision making.
Such variables are very crucial as they help determine the marketability and/or resale value
of second-hand automobiles. For this reason, we will try to reveal what factors explain the
differences in the price of second-hand cars and their demand.
By applying statistical analysis as well as data visualization methods we try to shed light on
how specific characteristics of a car influence the price it is offered at. One may consider such
factors as old-fashioned car, mileage, brand value, car type and the condition of the used
car. Secondly, our study examines how different variables like changes in the economy, taste
and preference in regions influence the second-hand motor vehicles business.
The methods and strategy selected for data analysis will be described in the following sections
of this work. The research findings will then be summarized. We'll also talk about the
implications.
RESEARCH OBJECTIVES:
1.
To analyse the regression model from the relationship between car prices and their
age, this analysis helps us to identify trends in the market and providing the best value
for.
2.
Predicting vehicle types using classification methods (e.g., Random Forest, decision
trees, support vector machine (SVM)) predicting of the type of car by indicators, such
as brand, price, power, and so on.
METHODOLOGY:
DATA CLEANING:
Cleaning The date columns:
There were multiple date columns in the data, in most of the
cases they are the same. Anyway, the analysis was not on the date, hence except date Created
all the other dates (dates Crawled, Last Seen) were dropped.
Cleaning The year of registration Column:
The year of registration column had the years in
the range 1000 and 9999 which were impractical, hence the year of registration between 1980-
2023 was considered as it seems logical.
Converting German to English:
The data set contains German eBay listings, hence some
columns contained German words, hence those German words were converted to English as
per our understanding.
Cleaning powerPS Column:
In the data some vehicles had HP > 1000, to put in context a
Lamborghini aventador a supercar has 780hp. but when compared in the data the vehicles were
not super cars. Hence the data is wrong. A realistic range of horsepower would be 25 - 800PS.
Cleaning the Price Column:
From the density plot of Price, it can be observed that the data
is skewed, has many outliers, and requires cleaning. Following is the price distribution in the
data.
No of cars with a price lower than 1K: 63690
No of cars with a price higher than 1K: 259641
No of cars with a price higher than 10K: 57144
No of cars with a price higher than 20K: 16008
No of cars with a price higher than 30K: 5356
No of cars with a price higher than 40K: 2369
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
considering the price range between 200 and 20K seems reasonable, doing so many outliers
would be handled.
Fig 1. Price distribution before cleaning Vs After Cleaning
Creating the Age column:
The age column was created as Current Year
–
Year of registration.
Dropping
Unnecessary
Columns:
Columns
like
nrOfPictures,
seller,
offerType,
monthOfRegistration, postalCode, name,yearOfRegistration were unnecessary and dropped
from the data frame.
Dropping Null Values:
Finally all the null values were dropped using na.omit.
Converted the Factor Variables to Numeric:
Converted the Factor variables in columns
Gearbox, Fuel Type, Not Repaired Damage to 0 and 1 for linear model.
EDA and summary Statistics of the dataset:
Fig 2. Density Plot of Price
Fig 3. Density plot of Power
Fig 4. Frequency of Vehicle Type
Fig 5. Frequency of Gearbox Type, 0: Manual, 1: Automatic
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Fig 6. Frequency of Fuel Type, 0: Petrol, 1: Diesel
Fig 7. Frequency of Unrepaired Damage, 0: Yes, 1: No
Fig 8. Density plot of Age
Fig 9. Correlation plot
In this study, we shall use different statistical techniques to respond to the above researched
topics. Each study question's models and approaches are explained in detail below:
Research Question 1:
Utilizing a linear regression model and a random forest to investigate the correlation between
car price and age (year of registration) and other factors to determine which cars are the most
affordable.
Model:
Linear, Ridge regression model and random forest.
Linear Regression:
To Predict the price of used cars, we fit a linear regression model using
the predictor variables
: gearbox, powerPS, kilometer, fuelType, notRepairedDamage, Age.
Then we use Best Subset selection to select the best variables for the model.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Fig. RSS, adjusted R2, Cp, and BIC plots
From the best variable selection, we found out that all the variables were important.
We got a multiple R squared value of 67.89. i.e approximately 67.89% of the variability in
Price can be explained by the predictor variables.
Random Forest Model:
We further proceeded to build a random forest regression model on the dataset:
We divided the dataset into test set (20%) and train set(80%) and applied the random forest
model on the train set. The random forest model was built with the following formula:
sqrt_price~.-abtest-brand-vehicleType-model-dateCreated,Used_cars_data.
The random forest model was able to achieve an overall accuracy of 82.2%.
Fig . variable Importance Plot
Research Question 2:
Determine different types of vehicles based on criteria such as brand,
price, and powertrain specifications by using categorization models.
Model:
Random forest.
Response variable:
vehicleType
Predictor variable:
powerPS, price, brand, model
To Predict the
different types of vehicles
, we fit a Random Forest Classifier model using the
predictor variables
: 'powerPS', 'brand','model','price'.
We divided the dataset into test set (20%) and train set (80%) and applied the random forest
model on the train set. The model was built with the following formula:
vehicleType ~
powerPS + brand + model + price.
The random forest model achieved an overall accuracy of 76.91% on the test data indicating
that the model was successfully able to classify 76.91% of the vehicles into different Type.
Fig. Confusion Matrix for the random forest model
Research Question 3: I
dentifying distinct groups (clusters) within the used car dataset based
on various attributes such as mileage, price, power,
notRepairedDamage, age to provide
insights on the condition of the car.
Model:
K means Clustering
k-means clustering analysis was performed on the dataset after scaling the data, with 4
specified cluster centers and 25 random initial cluster assignments. Elbow method was used
to determine the optimal number of clusters for K-means by evaluating the within-cluster
sum of squares (WSS) for different values of K.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Fig. Elbow Method WSS
The clustering was performed, and the data was divided into 4 clusters.
Using the factoextra package's fviz_cluster function, we visualized the clustering results from
our kmeans model.
We have further assigned the clustering results from the kmeans function to a new column
clustering_vector in the Used_cars_data dataframe.
Then we created visualizations to analyse the clusters.
Firstly, we created a scatter plot between the price variable and power variable with clusters as
the legend we got the following scatter plot.
Then we created a density plot of the kilometer (mileage) variable with clusters as legend.
Then we created a density plot of the Age variable with clusters as legend.
similarly we created a box plot of the notRepairedDamageas in the image below.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
The following was analysed from the dataset:
Cluster 1:
Size: 23,490
Attributes: low price, slightly lower powerPS, high kilometer, low unrepaired damage, older
age.
Cluster 2:
Size: 50,662
Attributes:
higher
price, higher powerPS, moderate kilometer, high unrepaired damage,
moderate age.
Cluster 3:
Size: 117,921
Attributes: Lower price, lower powerPS, moderate kilometer, moderate unrepaired damage,
moderate age.
Cluster 4:
Size: 41,491
Attributes: Higher price, moderate powerPS, low kilometer, moderate unrepaired damage,
slightly younger age.
From the cluster analysis above and the clustering_vector in the Used_cars_data dataframe
the dataset can be used to analyse the patterns and relationships within and between clusters.
The clustering vector can be used as a categorical variable for further predictive modeling or
classification tasks, or as an input for other algorithms to explore relationships between
clusters and other variables.
CONCLUSION:
Finally, an
exhaustive assessment of the “Used_Cars_Cleaned.csv” dataset has generated
important findings about the trends taking place in used automobile trade. The analysis of
linear regression models showed that the older the vehicle is, the lower price it is valued at
due to depreciation trend. Thus, the finding could help prospective purchasers, telling them
that older cars normally represent more dollar-for-mile value, but with contingencies of
maintenance status and the vehicle condition respectively.
Using the classical classification models that considered specific characteristics such as power,
prices, and brands, we successfully pointed out the possible type of car. The part of the study
that identifies unique features of different car classes is also important because it helps sellers
to
classify
cars
and
buyers
to
choose
needed
types.
In addition, vehicle characteristics significantly affected market prices. Fuel type, gear box and
brand were some of the attributes that greatly determined whether a particular used car was
appealing or not as well as its pricing. Therefore, every potential buyer ought to consider a
thorough collection of factors while evaluating the worth of a used car. The study has not only
broadened understanding concerning the used cars industry but has also offered some
practical guidelines for various interest groups. It demonstrates the significance of a data
driven approach to decision making and this may be buying, selling and generally being part
of the auto industry.
REFERENCES:
[1] “Used Cars Data - dataset by data-society,”
data.world
.
https://data.world/data-
society/used-cars-data
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Documents
Recommended textbooks for you

Linear Algebra: A Modern Introduction
Algebra
ISBN:9781285463247
Author:David Poole
Publisher:Cengage Learning

Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill

Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt
Recommended textbooks for you
- Linear Algebra: A Modern IntroductionAlgebraISBN:9781285463247Author:David PoolePublisher:Cengage LearningGlencoe Algebra 1, Student Edition, 9780079039897...AlgebraISBN:9780079039897Author:CarterPublisher:McGraw HillBig Ideas Math A Bridge To Success Algebra 1: Stu...AlgebraISBN:9781680331141Author:HOUGHTON MIFFLIN HARCOURTPublisher:Houghton Mifflin Harcourt

Linear Algebra: A Modern Introduction
Algebra
ISBN:9781285463247
Author:David Poole
Publisher:Cengage Learning

Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill

Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt