DSCI 4520 - Assignment 3

docx

School

University of North Texas *

*We aren’t endorsed by this school

Course

4520

Subject

Computer Science

Date

Feb 20, 2024

Type

docx

Pages

10

Uploaded by GrandPartridge3978

Report
DSCI 4520 – Introduction to Data Mining Spring 2023 Assignment 3 Name: Kamala Priya Lingam Student ID: 11441832 Course Section: 001 A total of 1 question with 7 sections, for 60 points. This assignment is based on Question 6.3 from the course textbook. To answer the questions you need to use data in “Airfare.xlsx” file and Analytic Solver for Excel. Submission: Submit this file and the Excel file in which build the models and did the calculations.
Predicting Airfare on New Routes The following problem takes place in the United States in the late 1990s, when many major US cities were facing issues with airport congestion, partly as a result of the 1978 deregulation of airlines. Both fares and routes were freed from regulation, and low-fare carriers such as Southwest (SW) began competing on existing routes and starting nonstop service on routes that previously lacked it. Building completely new airports is generally not feasible, but sometimes decommissioned military bases or smaller municipal airports can be reconfigured as regional or larger commercial airports. There are numerous players and interests involved in the issue (airlines, city, state and federal authorities, civic groups, the military, airport operators), and an aviation consulting firm is seeking advisory contracts with these players. The firm needs predictive models to support its consulting service. One thing the firm might want to be able to predict is fares, in the event a new airport is brought into service. The firm starts with the file Airfares.csv, which contains real data that were collected between Q3-1996 and Q2- 1997. The variables in these data are listed in Table 6.11, and are believed to be important in predicting FARE. Some airport-to-airport data are available, but most data are at the city-to-city level. One question that will be of interest in the analysis is the effect that the presence or absence of Southwest has on FARE.
a. Explore the numerical predictors and target (FARE) by creating a correlation table and examining some scatterplots between FARE and those predictors. What seems to be the best single predictor of FARE? (8 points) Answer: Distance seems to be the best single predictor of FARE S_INCOM E E_INCOM E S_PO P E_PO P NEW HI DISTANC E PAX COUPO N FARE S_INCOM E E_INCOM E -0.14 S_POP 0.52 -0.14 E_POP -0.27 0.46 -0.28 NEW 0.03 0.11 -0.02 0.06 HI -0.03 0.08 -0.17 -0.06 0.05 DISTANCE 0.03 0.18 0.02 0.12 0.08 -0.31 PAX 0.14 0.26 0.28 0.31 0.01 -0.17 -0.10 COUPON -0.09 0.05 -0.11 0.09 0.02 -0.35 0.75 -0.34 FARE 0.21 0.33 0.15 0.29 0.09 0.03 0.67 -0.09 0.50
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
b. Explore the relationship between the categorical predictors (excluding the first four) and the target (FARE). For each categorical variable create a pivot table/chart with the average fare in each category. Which categorical predictor seems best for predicting FARE? (10 points) Answer: SW seems to be the best one for predicting FARE among categorical predictors.
c. Find a model for predicting the average fare on a new route: use forward elimination regression to reduce the number of predictors. Partition the data into training (60%) and validation (40$) sets. The model will be fit to the training data and evaluated on the validation set. You can ignore the first four predictors (S_CODE, S_CITY, E_CODE, E_CITY). Report the estimated model selected. (10 points) Answer: Predictor Estimate Intercept 137.80955 S_INCOME 4.5971059 E_INCOME 8.6336057 S_POP 13.461936 E_POP 11.676874 NEW - 1.4042297 HI 15.642481 DISTANCE 48.99318 PAX - 11.055972 COUPON 2.0633133 VACATION_No 33.749147 VACATION_Yes 0 SW_No 35.107694 SW_Yes 0 SLOT_Controlled 0 SLOT_Free - 17.459633 GATE_Constrained 0 GATE_Free -17.39758
d. Find another model for predicting the average fare using exhaustive search method. Report the estimated model selected. (10 points) Coefficients Predictor Estimate Intercept 137.8095 5 S_INCOME 4.597105 9 E_INCOME 8.633605 7 S_POP 13.46193 6 E_POP 11.67687 4 NEW - 1.404229 7 HI 15.64248 1 DISTANCE 48.99318 PAX - 11.05597 2 COUPON 2.063313 3 VACATION_No 33.74914 7 VACATION_Yes 0 SW_No 35.10769 4 SW_Yes 0 SLOT_Controlled 0 SLOT_Free - 17.45963 3 GATE_Constraine d 0 GATE_Free -17.39758
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
e. Compare models c and d by reporting RMSE of both training and validation sets. Do you see a sign of over-fitting? (10 points) Model c training RMSE: 35.26 Model c Validation RMSE: 35.46 Model d training RMSE: 35.26 Model d Validation RMSE: 35.46 Since RMSE of the training and validation sets for both models are close, there is no sign of over- fitting.
f. sing model c, predict the average fare on a route with the following characteristics: (6 points) COUPON = 1.202, NEW = 3, VACATION = No, SW = No, HI = 4442.141, S_INCOME = $28,760, E_INCOME = $27,664, S_POP = 4,557,004, E_POP = 3,195,503, SLOT = Free, GATE = Free, PAX = 12,782, DISTANCE = 1976 miles FARE= 4.5971059*28760+ 8.6336057*27664+13.461936*4557004+11.676874*3195503- 1.4042297*3+15.642481*4442.141+48.99318*1976- 11.055972*12782+2.0633133*1.202+33.749147*1+35.107694*1-17.459633*1-17.39758*1 Predicted Average Fare = 99055646.4 Predictor Estimate Intercept 137.80955 S_INCOME 4.5971059 E_INCOME 8.6336057 S_POP 13.461936 E_POP 11.676874 NEW - 1.4042297 HI 15.642481 DISTANCE 48.99318 PAX - 11.055972 COUPON 2.0633133 VACATION_No 33.749147 VACATION_Yes 0 SW_No 35.107694 SW_Yes 0 SLOT_Controlled 0 SLOT_Free - 17.459633 GATE_Constrained 0 GATE_Free -17.39758
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
g. How much does the predicted fare change, if Southwest decides to cover the route in section f? (6 points) COUPON = 1.202, NEW = 3, VACATION = No, SW = Yes, HI = 4442.141, S_INCOME = $28,760, E_INCOME = $27,664, S_POP = 4,557,004, E_POP = 3,195,503, SLOT = Free, GATE = Free, PAX = 12,782, DISTANCE = 1976 miles FARE= 4.5971059*28760+ 8.6336057*27664+13.461936*4557004+11.676874*3195503- 1.4042297*3+15.642481*4442.141+48.99318*1976- 11.055972*12782+2.0633133*1.202+33.749147*1+0*1-17.459633*1-17.39758*1 New Predicted Average Fare: 99055611.3 Change = 99055646.4-99055611.3 = 35.1