STAT_515_07

pdf

School

George Mason University *

*We aren’t endorsed by this school

Course

515

Subject

Statistics

Date

Jan 9, 2024

Type

pdf

Pages

Uploaded by GrandUniverse12664

TITLE: Analyzing Pricing, Characteristics and Clustering of Used Cars Dataset. STAT 515- Dr. Isuru Dassanayake FINAL RESEARCH PROJECT BY Aditya Paravastu Laxma Reddy Soumith Alloju

INTRODUCTION: The second important aspect in today’s automotive industry, which constitutes the bulk part of it, is the market for used cars that provide customers with the widest possible option of choice. The latter also plays an integral role within the whole life cycle of a vehicle. Many complex elements ranging from consumers’ preferences, economic situations, and car characteristics, influence this market. It is imperative for every stakeholder involved — consumers, sellers, manufacturers, and the legislature — to comprehend the underlining mechanisms. The various stakeholders in this sector include prospective consumers who wish to make informed choices, producers that want to understand the nature and scope of change as well as sellers who need to gauge shifts to the needs of consumers and adjust their own responses accordingly. Additionally, such research may act as a guide to lawmakers to formulate statutes and standards which will protect car buyers’ interests yet also ensure sustainability of the used car market. ABOUT: The study focuses on understanding the features of the used car market utilizing an extensive data set called “Used_Cars_Cleaned.csv”. The cost-of-car, model of car, car brand, year of car, number of miles-on-car, and condition of car are some of the variables that this set consists of. Since the seller type can affect the vehicle's price, condition, and reliability, it is essential to understand it when analysing market dynamics. From the price column we can provide insights into market research on vehicle characteristics and cons umers’ willingness to pay. The ‘gearbox’ column indicates the type of vehicle transmission system, which is classified as manual=0 and automatic=1. Mileage column is an important indicator of vehicle use and plays an important role in determining its condition, age, and value. The fuel type column explains which type of fuel is used in the cars like petrol=0, diesel=1. The NotRepairedDamage column indicates that the car is repaired or not (yes=0, no=1), this variable plays a crucial role for the buyers in decision making. Such variables are very crucial as they help determine the marketability and/or resale value of second-hand automobiles. For this reason, we will try to reveal what factors explain the differences in the price of second-hand cars and their demand. By applying statistical analysis as well as data visualization methods we try to shed light on how specific characteristics of a car influence the price it is offered at. One may consider such factors as old-fashioned car, mileage, brand value, car type and the condition of the used car. Secondly, our study examines how different variables like changes in the economy, taste and preference in regions influence the second-hand motor vehicles business. The methods and strategy selected for data analysis will be described in the following sections of this work. The research findings will then be summarized. We'll also talk about the implications.

RESEARCH OBJECTIVES: 1. To analyse the regression model from the relationship between car prices and their age, this analysis helps us to identify trends in the market and providing the best value for. 2. Predicting vehicle types using classification methods (e.g., Random Forest, decision trees, support vector machine (SVM)) predicting of the type of car by indicators, such as brand, price, power, and so on. METHODOLOGY: DATA CLEANING: Cleaning The date columns: There were multiple date columns in the data, in most of the cases they are the same. Anyway, the analysis was not on the date, hence except date Created all the other dates (dates Crawled, Last Seen) were dropped. Cleaning The year of registration Column: The year of registration column had the years in the range 1000 and 9999 which were impractical, hence the year of registration between 1980- 2023 was considered as it seems logical. Converting German to English: The data set contains German eBay listings, hence some columns contained German words, hence those German words were converted to English as per our understanding. Cleaning powerPS Column: In the data some vehicles had HP > 1000, to put in context a Lamborghini aventador a supercar has 780hp. but when compared in the data the vehicles were not super cars. Hence the data is wrong. A realistic range of horsepower would be 25 - 800PS. Cleaning the Price Column: From the density plot of Price, it can be observed that the data is skewed, has many outliers, and requires cleaning. Following is the price distribution in the data. No of cars with a price lower than 1K: 63690 No of cars with a price higher than 1K: 259641 No of cars with a price higher than 10K: 57144 No of cars with a price higher than 20K: 16008 No of cars with a price higher than 30K: 5356 No of cars with a price higher than 40K: 2369

Your preview ends here