Assignment 7

pdf

School

University of Michigan, Dearborn *

*We aren’t endorsed by this school

Course

DS 633

Subject

Information Systems

Date

Feb 20, 2024

Type

pdf

Pages

11

Uploaded by monikagautam93

Report
Assignment 7 Problem 1 Tableau Worksheet 3: Note the instructions that each person in the group should submit a file. Part I: Getting data and starting up 1. Get data: Go to the website linked above. It should look a bit different than the screenshot above. To be able to select the data fields, one now has to select Download under Data Tools (on the left of the screen). Select this. On that page, make sure you set the filters at the top to the latest data that is available. One month of data is fine, but you may choose to use more data for your analysis. Use the most recent month: note “Latest Available Data.” (Tableau should have some merge capabilities if multiple downloads are needed, or use another program for merging, perhaps even Excel.) Next, from the data fields on that page, ensure that you select at least the following. You can certainly select more, particularly if you think they will be helpful in generating additional insights in the open-ended portions of this assignment. But, selecting the ones below is essential for you to complete the exercise. a. FlightDate (Time Period) b. Reporting_Airline c. OriginCityName, OriginState d. DestCityName, DestState e. DepDel15 f. ArrDel15 g. AirTime, Distance h. CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay Make sure you understand all these data fields. You may need to do your own external research for this. In particular, the airline codes may be a little weird, and you may need to do some web search to sort this out. Please also see the Table Profile for the “Reporting Carrier On-Time Performance (1987 present)” database, linked here: https://transtats.bts.gov/ : Data Finder: By Mode: Aviation: Airline On-Time Performance Data: On-Time 2. Open the data set in Tableau, and create the extract as described in Worksheet 1 (not live data). This may take a few minutes, but then you won’t have to do it again. Do not attempt to open the data set in Excel; you will die waiting. Take a quick look at the data, and examine which fields are measures and which are dimensions. Make sure you understand how certain fields like DepDelay15 are reported. You may occasionally need to move fields from Dimensions to Measures and vice-versa.
Part II: Directed analysis 3. How many total flights are in the data set? Classify the flights according to those that were delayed by at least 15 minutes on departure, delayed by at least 15 minutes on arrival, or both. Use the DepDel15 and ArrDel15 fields for this. Depending on the visual you choose, you may need to change those fields to Dimensions. The dataset includes a total of 602,987 flights. From this dataset, it was identified that 128,439 flights were delayed by at least 15 minutes on arrival, and 131,733 flights were delayed by at least 15 minutes on departure. This classification allows us to understand the extent of delays for these flights, revealing that a significant portion of the total flights experienced delays, either on departure, arrival, or both. 4. Which of the five types of delay are most frequent? Does the effect of different types of delay change over the different days in the most recent November? For this, you will need to use the FlightDate field; click on the plus sign next to it so it expands to the granularity of days. The most frequent type of delay among the five types is the "Late Aircraft Delay," with a total count of 3,968,456 incidents in the year 2023. To assess whether the effect of different types of delay changes over the different days in the most recent November, a breakdown of these delays over specific days in November is required. Unfortunately, the given data only includes the total counts for the entire year of 2023 and not specifically for November. To accurately analyze the variation in delay types over different days in November, the dataset should be filtered to include only November's data and then segmented by day to determine how the counts for each type of delay fluctuate during that specific month.
5. Next, look at how different types of delays differentially affect the different airlines. Briefly describe what, if anything, you find noteworthy. This dataset provides a breakdown of different types of delays (Carrier Delay, Late Aircraft Delay, Nas Delay, Security Delay, and Weather Delay) concerning various airlines (Op Unique Carrier). Noteworthy observations: Late Aircraft Delay Prevalence: Among most airlines, late aircraft delays are among the highest, suggesting that issues related to the aircraft itself significantly impact flight schedules. For instance, American Airlines (AA), Delta (DL), and Southwest (WN) exhibit notably high late aircraft delay counts, implying potential shared issues regarding aircraft-related problems across these carriers. Nas Delay Variations: Nas Delays (National Aviation System) show moderate variations across different airlines, indicating that factors beyond an individual airline's control, such as the Federal Aviation Administration (FAA), might be contributing to these issues. Airlines like Southwest (WN) and American Airlines (AA) display a comparatively higher count of Nas Delays. Carrier and Weather Delay Differences: Carrier delays are quite variable among airlines. Carriers like American Airlines (AA), Delta (DL), Southwest (WN), and United Airlines (UA) showcase relatively higher counts of Carrier Delays. Conversely, Security and Weather Delays show nominal counts across airlines, suggesting these issues are less frequent or impactful for most carriers. Varied Security Delay Incidents: Security Delays, though generally low in frequency across the airlines, show some variability. This could be due to diverse operational security protocols or issues specific to certain airports or routes. Unique Operational Challenges: Each airline might face unique operational challenges, which can be discerned by the varying patterns of delay types. For example, some carriers might face more weather-related problems based on their flight routes or operational hubs. Understanding these differences among carriers for different types of delays can help airlines focus on specific areas of improvement and tailor their strategies to reduce delays and enhance overall flight punctuality.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
6. Let's look at which airports are the busiest as well as most often delayed in terms of airline departures. To do this, you will need to be able to mark the cities on the map. Unfortunately, Tableau runs into a small problem: the data has city names specified by "City, State Abbreviation", whereas Tableau prefers just city name. Try generating a map that uses city names and you will see the problem. To fix this, do the following. Create a new calculated field that only uses the city name, not the state abbreviation. For this, you will need the split command in Tableau. That is, create a new calculated field, giving it some name that makes sense to you, defined as "split([Origin City Name], ',', 1)", without the double quotes. This command splits the text in OriginCityName by commas, and takes the first piece of it. Change the geographic role of this new field to City. Create a similar field for DestCityName. Now, create a map that uses the city names, and shows both number of flights as well as fraction of flights delayed for each city. You may still have some cities that are unrecognized---ignore those, as long as you have most major US airports clearly visible on your map. The map showcase the locations of these airport’s city, with the color indicating the average delay in flight departures and the size representing the number of flights for each airport. And given the filter function to view flight delay at origin or destination.
Part III: Open-ended analysis 7. Using a single visual, present as much insight as you can about one specific airline: pick whichever airline you like. Things you can consider are: its network, on-time performance, flight distances, flight lengths, and types of delay. You obviously cannot consider all of those, because if you do then your visual will become too cluttered. So explore and then present the most compelling findings using a single visual. The bubble chart I've created visualizes information related to Op Unique Carrier: F9(Frontier Airlines) filter based on various destinations (Dest State Nm) and several key performance indicators: X-axis (horizontal): Sum of Distance (Total flight distance for each destination). Y-axis (vertical): Sum of Arrival Delay (Total delay upon arrival for each destination). Size of the bubbles: Represents the Average Late Aircraft Delay, where larger bubbles indicate a higher average late aircraft delay. Color of the bubbles: Denotes the destination state (Dest State Nm). This visualization offers multiple insights: Arrival Delay vs. Flight Distance: Each bubble's placement shows how the total distance flown relates to the overall delay upon arrival. Destinations with higher distances and varying degrees of delays are represented. Bubble Size (Average Late Aircraft Delay): The size of each bubble allows for a comparison of the average late aircraft delay across different destinations. Larger bubbles indicate higher average late aircraft delays. Destination States (Color): Different colors represent specific states, making it easier to identify states and compare their performance regarding delays and distances traveled. This chart helps to visually comprehend how Frontier Airlines performs across different destinations, showcasing not only the total distances flown and arrival delays but also the specific states and their associated average late aircraft delay, providing a comprehensive view of performance metrics across locations.
8. Generate one more visual that uses external data: either underlying data included with Tableau, or some other data you pull from elsewhere. The easiest of course is to first generate a map, and then include underlying map data like you did in earlier exercises, but you can do whatever you want. As usual, creativity and originality will be rewarded. Here's an example of connecting external data, although you don't have to do this. Go to https://transtats.bts.gov/ : Data Finder: By Mode: Aviation and download data from one of the databases there. One of these databases may have passenger information, for example. Try to match the date you have already loaded, using, for example, the same identifiers for origin and destination airports that you loaded in step 1 of this exercise. Save as .csv. In Tableau, click the Tableau icon button at the top left; this takes you back to the opening screen. Connect to the .csv file you just saved. Also drag the original .csv file you've been working with, so that both .csv files are listed in the box with a little join symbol between them. Click Inner Join, and join the two data sets. Next go to the main Tableau worksheet, and on the left pane, you should see data from both sources listed. Try generating visuals that have data from both sources. For example, can you generate a chart that shows any relationship between number of passengers and minutes of delay experienced? This chart helps in visualizing the distribution of flight traffic across various locations globally. It can provide insights into the concentration of flights and passengers at different geographical points. The markers help viewers easily identify and compare the volume of air traffic and passenger load among various locations across the world. And filter gives the functionality to see the load at origin of flight city or destination.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Problem 2 Chemical Features of Wine.Figure 4.17 (below and also on p. 102 of your textbook) shows the PCA output for analyses conducted on normalized (correlations) and nonnormalized data (covariances). Each variable represents chemical characteristics of wine, and each case is a different wine. a) The data are in the file Wine.jmp. Consider the variances in the columns labeled Eigenvalue for the principal components analysis conducted using covariances. Why is the variance for the first principal component so much larger than the others? In principal component analysis (PCA), the objective is to derive new variables (z1, z2, ..., zp) that are weighted combinations of the original variables, ensuring that each pair of these new variables has zero correlations. These resulting variables are ordered based on their variance, with z1 having the largest variance and zp having the smallest. The reason why the variance for the first principal component is significantly larger than the rest is that the first principal component typically accounts for a vast majority of the total variability in the data. In this specific analysis of wine chemical characteristics, the variance of the first principal component is notably higher, amounting to approximately 99.8% of the total variability among the wines analyzed. This indicates that the first principal component captures the most substantial amount of information or variation present in the chemical features of the wines, making it a significantly dominant factor in understanding the dataset. b) Comment on the use of correlations versus covariances. Would the results change dramatically if PCA (in this example) were conducted on the correlations instead? The usage of correlations versus covariances in the principal component analysis (PCA) can significantly impact the outcomes. If PCA were conducted on correlations instead of covariances in this particular example, the results might alter due to the differences in scaling and units among the variables.
In this case, the units of measurement for the various variables differ, and one of the variables, 'proline,' stands out as it has values that are notably larger, often in the thousands, compared to the other variables. Because of this disparity in scale, 'proline' tends to have a more significant influence on the principal components and contributes the most to the overall variability within the data. To mitigate the impact of this disparity and ensure that all variables have a similar level of importance in terms of variability, normalizing the data becomes essential. Normalization involves adjusting the scale or magnitude of variables to a standard level, often with a mean of zero and a standard deviation of one. This normalization process helps to prevent any single variable from dominating the principal components due to its larger scale, thereby ensuring a more balanced and equitable assessment of variability across all variables. It's worth noting that in JMP Pro, this normalization process is typically executed by default to address issues related to varying scales among the variables when conducting PCA.
Problem 3 University Rankings. The dataset on American college and university rankings (Colleges.jmp, in the textbook data) contains information on 1302 American colleges and universities offering an undergraduate program. For each university there are 17 measurements that include continuous measurements (e.g., tuition and graduation rate) and categorical measurements (e.g., location by state and whether it is a private or a public school). a) Make sure the variables are coded correctly in JMP (Nominal, Ordinal, or Continuous), then use the Columns Viewer to summarize the data. Are there any missing values? How many Nominal columns are there? Columns N N Missing N Categories Min Max Mean Std Dev Median Lower Quartile Upper Quartile Interquartile Range College Name 1302 0 1274 . . . . . . . . State 1302 0 51 . . . . . . . . Public (1)/ Private (2) 1302 0 2 . . . . . . . . Math SAT 777 525 . 320 750 506.83783784 67.822440706 500 459.5 544.5 85 Verbal SAT 777 525 . 280 665 461.22393822 58.298412857 457 422 492 70 ACT 714 588 . 11 31 22.120448179 2.5798989555 22 20 24 4 # appli. rec'd 1292 10 . 35 48094 2752.0975232 3541.9747123 1470 695.25 3314.75 2619.5 # appl. accepted 1291 11 . 35 26330 1870.6831913 2250.8664002 1095 554 2306 1752 # new stud. enrolled 1297 5 . 18 7425 778.88049345 884.57827386 447 236 984.5 748.5 % new stud. from top 10% 1067 235 . 1 98 25.671977507 18.312617579 21 13 32 19 % new stud. from top 25% 1100 202 . 6 100 52.35 20.881315688 50 36.25 66 29.75 # FT undergrad 1299 3 . 59 31643 3692.665127 4544.8478973 1812 965 4540 3575 # PT undergrad 1270 32 . 1 21836 1081.5267717 1672.202912 472 131 1314.75 1183.75 in-state tuition 1272 30 . 480 25750 7897.2743711 5348.1626263 8050 2560 11600 9040 out-of-state tuition 1282 20 . 1044 25750 9276.9056162 4170.7708507 8670 6099.25 11667.5 5568.25 room 981 321 . 500 7400 2514.6819572 1150.8368485 2200 1707 3045 1338 board 804 498 . 531 6250 2060.9838308 661.74209947 1980 1617.75 2404.5 786.75 add. fees 1028 274 . 9 4374 392.01264591 469.37923417 264.5 130 480 350 estim. book costs 1254 48 . 90 2340 549.97288676 167.35538618 502 480 600 120 estim. personal $ 1121 181 . 75 6900 1389.2917038 714.24785715 1250 900 1794.5 894.5 % fac. w/PHD 1270 32 . 8 105 68.645669291 17.82562749 71 57 82 25 stud./fac. ratio 1300 2 . 2.3 91.8 14.858769231 5.18639854 14.3 11.8 17.6 5.8 Graduation rate 1204 98 . 8 118 60.405315615 18.889058042 60 47 74 27 The "N Missing" column shows the count of missing values for each respective column. The number of nominal columns can be determined by counting the columns with a number of categories (N Categories) specified. There are missing values observed across almost all the variables within the dataset. Additionally, among the variables present, three are categorical in nature: The first pertains to the College Name , the second designates the S tate where the college or university is situated, and the third identifies whether the institution operates as Public (1)/ Private (2) entity.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
b) Conduct a principal components analysis on the data. Recall that, by default, JMP will conduct the analysis on correlations rather than covariances. This is a good thing since, as you may see from the summary statistics, the units of measurement for the variables are different. Comment on the PCA results: specifically, can you explain the first 2 principal components? (Not every PCA analysis will be human interpretable, really, so it is ok if no… just look for some explanation if possible.)
In the principal components analysis (PCA), two different approaches were employed: one based on correlation and the other based on covariance. When PCA was conducted using correlations, the analysis highlighted how the first two principal components accounted for just 56.86% of the variation. On the other hand, utilizing covariances, the first two principal components explained over 91.24% of the variation. The variance in PCA using covariances was significantly higher than that of correlations. The two primary components primarily captured variables with larger measurement scales. For instance, the Eigen Vectors for both methods revealed differences in how the variables influenced the principal components. In summary, the choice between correlations and covariances significantly impacted the results of the PCA. Specifically, the two principal components derived from the covariance-based PCA explained a much larger percentage of the variability compared to the correlation-based PCA, primarily capturing variables with larger measurement scales.