Data Visualization Project 3

docx

School

Virginia Commonwealth University *

*We aren’t endorsed by this school

Course

MISC

Subject

Computer Science

Date

Feb 20, 2024

Type

docx

Pages

4

Uploaded by DukeElementCobra30

Report
Data Visualization Project 3 (Due on Blackboard by 11:59 pm, EST, on Sunday of Week 6) Instructions: Please write your name in the specified space (below) and answer questions 1 through 6. All these questions are based on the content of Weeks 5-6. Note : The bold text in the questions (below) is used to indicate R codes, functions, or datasets. Also note that you will not be able to run the codes shown in the questions if you have not installed the relevant R packages. Name: Question 1: Start RStudio in your computer, open an R Script file, and type the R code help(economics, package = "ggplot2") and run it. As a result of running this code, you will see a description of the dataset in the Help tab on lower right pane of RStudio. Look at the variable information within the heading of Format. Then, provide a description of variables date , pop , and unemploy (from the Help tab) in the space below this question [3 points]. Create a subset (e.g., plotdata) from the economics dataset by first grouping by date using group_by() function and creating a new variable unemploy_rate = 100*(unemploy/pop) using mutate() function. [Hint: Use group_by() and mutate() functions together with pipe operator ( %>% ) from the dplyr package]. Copy this piece of code from R Script file and paste it below [2 points]. Closely follow R code above Figure 8.2 of Kabacoff (2023) textbook to create a time series graph using date on x-axis and unemploy_rate on y-axis. Make sure to replace fields (in the original code) with new ones so that they reflect the data you are visualizing. For example, the title within labs() function should be “ Unemployment Rate ”. Run the modified code to obtain the time series graph. Copy the graph and the code from RStudio and paste them below [5 points]. Question 2: Closely follow the R code above Figure 8.3 of Kabacoff (2023) textbook to create the multivariate time series graph of the same variables (i.e., “ Date ” on x-axis and “ Closing Price ” on the y-axis) for the stock prices of the two companies, “Apple” and “Amazon.” However, make changes in the date range so that you have dates between 2022-01-01 and 2023-01-31 [hint: Use from = 2022-01-01 and to = 2023-01-31 in the corresponding fields within getSymbols() function]. Replace the option inside the date_format() function with "%b-%y" . Make suitable changes for the limits and breaks arguments within scale_y_continuous() function so that both line graphs are visible on the plot . Also, make appropriate changes to subtitle within labs() function . After running the entire code chunk, you will see the desired line
graphs for the two companies on the Plots tab. Copy this graph and paste it below [5 points]. Also, copy the entire R code from your R Script file and paste it below the graph [5 points]. Question 3: There are three separate R code chunks within section “9.1 Correlation plots” of Kabacoff (2023) textbook. Your task in this question is to modify the first and third code chunks by replacing SaratogaHouses data with msleep data available from the package ggplot2 . Specifically, first, type help(msleep, package = “ggplot2”) in an R Script file in RStudio and run this code. Look at the variable description in the Help tab of RStudio and identify numeric variables. List these numeric variables below [2 points]. Modify the first R code chunk so that it is based on the msleep dataset, then, run this modified R code chunk. Copy this modified R code chunk and paste it below [1 point]. After running the R code chunk, you will see output in the R Console window. Copy the output and paste it below [1 point]. Does the list of variables in this output match with your list in the first part of this question? [1 point] Finally, modify the third R code chunk and run it so that the resulting plot reflects the correlation plot of numeric variables from msleep dataset. Copy this plot and paste it below [2.5 points]. Also, copy the R code from your R Script file and paste it below the plot [2.5 points]. Question 4: There are three separate R code chunks within section “9.3 Logistic regression” of Kabacoff (2023) textbook. Your task in this question is to modify the first and third code chunks by replacing CPS85 data with TitanicSurvival data available from the package carData . Specifically, first, type help(TitanicSurvival, package = “carData”) in an R Script file in RStudio and run this code. Look at the variable names in the Help tab of RStudio. List these variable names along with their labels below this question [2 points]. Modify the first R code chunk so that it is based on the TitanicSurvival dataset, then, run this modified R code chunk. Specifically, use “ survived ” as response variable and all three remaining variables on the right side of tilde ( ~ ) sign. Name the object as “ titanic_glm ” and ensure that data argument inside glm() function takes the correct data name. Run this R code chunk. You will need this for the next question. Copy this modified R code chunk and paste it below this question [2 points].
Finally, modify the third R code chunk and run it so that the resulting plot reflects the relationship between age and probability of survival status for two labels of sex , while controlling for passenger class. These changes must be reflected in your code and in the resulting graph. Copy this plot and paste it below [3 points]. Also, copy the R code from your R Script file and paste it below the plot [3 points]. Question 5: There are two separate R code chunks within section “9.4 Survival plots” of Kabacoff (2023) textbook. Your task in this question is to modify both code chunks by replacing lung data with kidney data available from the package survival . Specifically, first, type help(kidney, package = “survival”) in an R Script file in RStudio and run this code. Look at the description of data from Help tab of RStudio as you will need this information in the next part of this question. Write a brief summary of the data description and provide a list of the variable names below this question [2 points]. Then modify the first R code chunk so that it is based on the kidney dataset, specifically, by replacing the correct dataset and changing the title within ggsurvplot() function, but keeping the same variables within the Surv() function. [Note: The kidney dataset can be loaded using the code provided within Usage section of the Help page.] Run this modified R code chunk. Copy the code and paste it below [2 points]. Also, copy the plot and paste it below [2 points]. Finally, modify the second R code chunk and run it so that the resulting plot shows survival plot for two labels of sex , based on the kidney dataset. The title and xlab within the ggsurvplot() function must be changed to reflect the data being plotted. Run this modified R code chunk to obtain the plot. Copy this plot and paste it below [2 points]. Also, copy the R code from your R Script file and paste it below the plot [2 points]. What does the p-value of 0.0039 tell you? [1 bonus point]. Question 6: This question is based on three heatmaps shown within section “10.5 Heatmaps” of Kabacoff (2023) textbook. For the first part of this question, you will copy and paste the first two R code chunks into an R Script file of your RStudio window and run them one after another. You will get two separate heatmaps. Copy each heatmap along with its corresponding code from RStudio and paste it below [4 points]. How is the second heatmap different from the first? [1 point]. Which option is used within superheat() function to cluster the rows and/or columns based on similar characteristics? [1 bonus point]
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Recreate the third heatmap using the same R code chunk (as shown in the textbook), but by including the option for clustering rows and/or columns based on similar characteristics within the superheat() function. It is the same option that you identified in the previous question. After running the entire R code chunk, you will see the desired heatmap on the Plots tab. Copy this heatmap plot and paste it below [2.5 points]. Also, copy the entire R code from your R Script file and paste it below the plot [2.5 points]. Name a pair of two countries that have similar characteristics [1 bonus point].