STAT_515_07

pdf

School

George Mason University *

*We aren’t endorsed by this school

Course

515

Subject

Statistics

Date

Jan 9, 2024

Type

pdf

Pages

18

Uploaded by GrandUniverse12664

Report
TITLE: Analyzing Pricing, Characteristics and Clustering of Used Cars Dataset. STAT 515- Dr. Isuru Dassanayake FINAL RESEARCH PROJECT BY Aditya Paravastu Laxma Reddy Soumith Alloju
INTRODUCTION: The second important aspect in today’s automotive industry, which constitutes the bulk part of it, is the market for used cars that provide customers with the widest possible option of choice. The latter also plays an integral role within the whole life cycle of a vehicle. Many complex elements ranging from consumers’ preferences, economic situations, and car characteristics, influence this market. It is imperative for every stakeholder involved consumers, sellers, manufacturers, and the legislature to comprehend the underlining mechanisms. The various stakeholders in this sector include prospective consumers who wish to make informed choices, producers that want to understand the nature and scope of change as well as sellers who need to gauge shifts to the needs of consumers and adjust their own responses accordingly. Additionally, such research may act as a guide to lawmakers to formulate statutes and standards which will protect car buyers’ interests yet also ensure sustainability of the used car market. ABOUT: The study focuses on understanding the features of the used car market utilizing an extensive data set called “Used_Cars_Cleaned.csv”. The cost-of-car, model of car, car brand, year of car, number of miles-on-car, and condition of car are some of the variables that this set consists of. Since the seller type can affect the vehicle's price, condition, and reliability, it is essential to understand it when analysing market dynamics. From the price column we can provide insights into market research on vehicle characteristics and cons umers’ willingness to pay. The ‘gearbox’ column indicates the type of vehicle transmission system, which is classified as manual=0 and automatic=1. Mileage column is an important indicator of vehicle use and plays an important role in determining its condition, age, and value. The fuel type column explains which type of fuel is used in the cars like petrol=0, diesel=1. The NotRepairedDamage column indicates that the car is repaired or not (yes=0, no=1), this variable plays a crucial role for the buyers in decision making. Such variables are very crucial as they help determine the marketability and/or resale value of second-hand automobiles. For this reason, we will try to reveal what factors explain the differences in the price of second-hand cars and their demand. By applying statistical analysis as well as data visualization methods we try to shed light on how specific characteristics of a car influence the price it is offered at. One may consider such factors as old-fashioned car, mileage, brand value, car type and the condition of the used car. Secondly, our study examines how different variables like changes in the economy, taste and preference in regions influence the second-hand motor vehicles business. The methods and strategy selected for data analysis will be described in the following sections of this work. The research findings will then be summarized. We'll also talk about the implications.
RESEARCH OBJECTIVES: 1. To analyse the regression model from the relationship between car prices and their age, this analysis helps us to identify trends in the market and providing the best value for. 2. Predicting vehicle types using classification methods (e.g., Random Forest, decision trees, support vector machine (SVM)) predicting of the type of car by indicators, such as brand, price, power, and so on. METHODOLOGY: DATA CLEANING: Cleaning The date columns: There were multiple date columns in the data, in most of the cases they are the same. Anyway, the analysis was not on the date, hence except date Created all the other dates (dates Crawled, Last Seen) were dropped. Cleaning The year of registration Column: The year of registration column had the years in the range 1000 and 9999 which were impractical, hence the year of registration between 1980- 2023 was considered as it seems logical. Converting German to English: The data set contains German eBay listings, hence some columns contained German words, hence those German words were converted to English as per our understanding. Cleaning powerPS Column: In the data some vehicles had HP > 1000, to put in context a Lamborghini aventador a supercar has 780hp. but when compared in the data the vehicles were not super cars. Hence the data is wrong. A realistic range of horsepower would be 25 - 800PS. Cleaning the Price Column: From the density plot of Price, it can be observed that the data is skewed, has many outliers, and requires cleaning. Following is the price distribution in the data. No of cars with a price lower than 1K: 63690 No of cars with a price higher than 1K: 259641 No of cars with a price higher than 10K: 57144 No of cars with a price higher than 20K: 16008 No of cars with a price higher than 30K: 5356 No of cars with a price higher than 40K: 2369
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
considering the price range between 200 and 20K seems reasonable, doing so many outliers would be handled. Fig 1. Price distribution before cleaning Vs After Cleaning Creating the Age column: The age column was created as Current Year Year of registration. Dropping Unnecessary Columns: Columns like nrOfPictures, seller, offerType, monthOfRegistration, postalCode, name,yearOfRegistration were unnecessary and dropped from the data frame. Dropping Null Values: Finally all the null values were dropped using na.omit. Converted the Factor Variables to Numeric: Converted the Factor variables in columns Gearbox, Fuel Type, Not Repaired Damage to 0 and 1 for linear model. EDA and summary Statistics of the dataset:
Fig 2. Density Plot of Price Fig 3. Density plot of Power
Fig 4. Frequency of Vehicle Type Fig 5. Frequency of Gearbox Type, 0: Manual, 1: Automatic
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Fig 6. Frequency of Fuel Type, 0: Petrol, 1: Diesel Fig 7. Frequency of Unrepaired Damage, 0: Yes, 1: No
Fig 8. Density plot of Age Fig 9. Correlation plot In this study, we shall use different statistical techniques to respond to the above researched topics. Each study question's models and approaches are explained in detail below:
Research Question 1: Utilizing a linear regression model and a random forest to investigate the correlation between car price and age (year of registration) and other factors to determine which cars are the most affordable. Model: Linear, Ridge regression model and random forest. Linear Regression: To Predict the price of used cars, we fit a linear regression model using the predictor variables : gearbox, powerPS, kilometer, fuelType, notRepairedDamage, Age. Then we use Best Subset selection to select the best variables for the model.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Fig. RSS, adjusted R2, Cp, and BIC plots From the best variable selection, we found out that all the variables were important. We got a multiple R squared value of 67.89. i.e approximately 67.89% of the variability in Price can be explained by the predictor variables.
Random Forest Model: We further proceeded to build a random forest regression model on the dataset: We divided the dataset into test set (20%) and train set(80%) and applied the random forest model on the train set. The random forest model was built with the following formula: sqrt_price~.-abtest-brand-vehicleType-model-dateCreated,Used_cars_data. The random forest model was able to achieve an overall accuracy of 82.2%. Fig . variable Importance Plot Research Question 2: Determine different types of vehicles based on criteria such as brand, price, and powertrain specifications by using categorization models. Model: Random forest. Response variable: vehicleType Predictor variable: powerPS, price, brand, model To Predict the different types of vehicles , we fit a Random Forest Classifier model using the predictor variables : 'powerPS', 'brand','model','price'. We divided the dataset into test set (20%) and train set (80%) and applied the random forest model on the train set. The model was built with the following formula: vehicleType ~ powerPS + brand + model + price. The random forest model achieved an overall accuracy of 76.91% on the test data indicating that the model was successfully able to classify 76.91% of the vehicles into different Type.
Fig. Confusion Matrix for the random forest model Research Question 3: I dentifying distinct groups (clusters) within the used car dataset based on various attributes such as mileage, price, power, notRepairedDamage, age to provide insights on the condition of the car. Model: K means Clustering k-means clustering analysis was performed on the dataset after scaling the data, with 4 specified cluster centers and 25 random initial cluster assignments. Elbow method was used to determine the optimal number of clusters for K-means by evaluating the within-cluster sum of squares (WSS) for different values of K.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Fig. Elbow Method WSS The clustering was performed, and the data was divided into 4 clusters. Using the factoextra package's fviz_cluster function, we visualized the clustering results from our kmeans model. We have further assigned the clustering results from the kmeans function to a new column clustering_vector in the Used_cars_data dataframe.
Then we created visualizations to analyse the clusters. Firstly, we created a scatter plot between the price variable and power variable with clusters as the legend we got the following scatter plot. Then we created a density plot of the kilometer (mileage) variable with clusters as legend. Then we created a density plot of the Age variable with clusters as legend.
similarly we created a box plot of the notRepairedDamageas in the image below.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
The following was analysed from the dataset: Cluster 1: Size: 23,490 Attributes: low price, slightly lower powerPS, high kilometer, low unrepaired damage, older age. Cluster 2: Size: 50,662 Attributes: higher price, higher powerPS, moderate kilometer, high unrepaired damage, moderate age. Cluster 3: Size: 117,921 Attributes: Lower price, lower powerPS, moderate kilometer, moderate unrepaired damage, moderate age. Cluster 4: Size: 41,491 Attributes: Higher price, moderate powerPS, low kilometer, moderate unrepaired damage, slightly younger age. From the cluster analysis above and the clustering_vector in the Used_cars_data dataframe the dataset can be used to analyse the patterns and relationships within and between clusters. The clustering vector can be used as a categorical variable for further predictive modeling or classification tasks, or as an input for other algorithms to explore relationships between clusters and other variables.
CONCLUSION: Finally, an exhaustive assessment of the “Used_Cars_Cleaned.csv” dataset has generated important findings about the trends taking place in used automobile trade. The analysis of linear regression models showed that the older the vehicle is, the lower price it is valued at due to depreciation trend. Thus, the finding could help prospective purchasers, telling them that older cars normally represent more dollar-for-mile value, but with contingencies of maintenance status and the vehicle condition respectively. Using the classical classification models that considered specific characteristics such as power, prices, and brands, we successfully pointed out the possible type of car. The part of the study that identifies unique features of different car classes is also important because it helps sellers to classify cars and buyers to choose needed types. In addition, vehicle characteristics significantly affected market prices. Fuel type, gear box and brand were some of the attributes that greatly determined whether a particular used car was appealing or not as well as its pricing. Therefore, every potential buyer ought to consider a thorough collection of factors while evaluating the worth of a used car. The study has not only broadened understanding concerning the used cars industry but has also offered some practical guidelines for various interest groups. It demonstrates the significance of a data driven approach to decision making and this may be buying, selling and generally being part of the auto industry.
REFERENCES: [1] “Used Cars Data - dataset by data-society,” data.world . https://data.world/data- society/used-cars-data
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help