Data Visualization Week 5 Notes

docx

School

Virginia Commonwealth University *

*We aren’t endorsed by this school

Course

MISC

Subject

Economics

Date

Feb 20, 2024

Type

docx

Pages

25

Uploaded by DukeElementCobra30

Report
Data Visualization Week 5 Notes Facet Wrap: - Facet wraps are a useful way to view individual categories in their own graph. Example 1 - Obtain a scatterplot with ‘facet_wrap()’ function: library(ggplot2) library(gapminder) view(gapminder) ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, size = pop, color = continent)) + geom_point(alpha = 0.7) + facet_wrap(~year) --- This R code is using the ‘ggplot2’ package to create a scatter plot using the ‘gapminder’ dataset. The ‘gapminder’ dataset contains information about various countries, including their GDP per capita (‘gdpPercap’), life expectancy (‘lifeExp’), population (‘pop’), continent, and year. --- Breakdown of the code:
library(ggplot2) library(gapminder) This code loads the necessary libraries, ‘ggplot2’ for creating plots and ‘gapminder’ for accessing the dataset. view(gapminder) This command displays the ‘gapminder’ dataset. ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, size = pop, color = continent)) + geom_point(alpha = 0.7) + facet_wrap(~year) ggplot(): Initiates the creation of a new ggplot object. data=gapminder: Specifies the dataset to be used, which is gapminder. mapping=aes(...): Defines the aesthetic mappings. Here: x=gdpPercap: GDP per capita on the x-axis. y=lifeExp: Life expectancy on the y-axis. size=pop: The size of points is determined by the population. color=continent: Points are colored based on the continent. geom_point(alpha=0.7): Adds points to the plot with a transparency (alpha) of 0.7, making overlapping points more visible. facet_wrap(~year): Creates multiple plots, each representing a different year. The tilde ~ indicates that the variable to be faceted is year. So, the final result is a scatter plot where each point represents a country, with GDP per capita on the x-axis, life expectancy on the y-axis, point size based on population, point color based on continent, and separate panels for each year.
Example 2: Using `scale_x_log10()` function to transform gdpPercap into log10 scale: ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, size = pop, color = continent)) + geom_point(alpha = 0.7) + facet_wrap(~year) + scale_x_log10() 1. ggplot() and geom_point(): These functions are the same as in the previous code and are used to set up the basic structure of the scatter plot. 2. facet_wrap(~year): This part creates separate panels for each year, as in the previous example. 3. scale_x_log10(): This function is used to apply a logarithmic scale to the x-axis. Specifically, scale_x_log10() transforms the x-axis to a logarithmic scale with a base of 10. This is often used when dealing with data that spans several orders of magnitude, such as GDP per capita, to make the visualization more interpretable.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
So, the addition of scale_x_log10() in this code indicates that the x-axis (GDP per capita) will be displayed on a logarithmic scale, providing a clearer representation of the data when there is a wide range of values. Line Graphs (aka time series graphs): Line graphs, also known as time series graphs, are a type of data visualization that represents data points over a continuous interval or time span. These graphs are particularly useful for showing trends, patterns, and relationships in data that evolve over time. Time series graphs typically have time on the x-axis (horizontal axis) and a variable of interest on the y-axis (vertical axis).
Example 1: Time series graph for lifeExp by year for the two countries in the continent Oceania This R code uses the ggplot2 package to create a plot using the gapminder dataset, specifically focusing on the Oceania continent. library(ggplot2) library(gapminder) help(gapminder, package = "gapminder") library(dplyr) plotdata <- gapminder %>% filter(continent == "Oceania") plotdata ggplot(data = plotdata, mapping = aes(x = year, y = lifeExp, color = country, size = pop)) + geom_point(alpha = 0.7) + geom_line(linewidth = 1) 1. library(ggplot2): Loads the ggplot2 library, which is a powerful data visualization package in R. 2. library(gapminder): Loads the gapminder library, which contains the gapminder dataset. This dataset includes information about countries, continents, population, life expectancy, and GDP per capita over time. 3. help(gapminder, package = "gapminder"): Displays help information for the gapminder dataset. 4. library(dplyr): Loads the dplyr library, which is used for data manipulation and filtering. 5. plotdata <- gapminder %>% filter(continent == "Oceania"): Creates a new data frame called plotdata by filtering the gapminder dataset to include only rows where the continent is "Oceania." 6. plotdata: Displays the filtered data frame, showing only the data for countries in the Oceania continent. 7. The ggplot function is used to create a plot: a. data = plotdata: Specifies the data frame to be used for plotting. b. mapping = aes(...): Maps the aesthetics (variables) to the visual elements of the plot. i. x = year: Maps the x-axis to the "year" variable.
ii. y = lifeExp: Maps the y-axis to the "lifeExp" (life expectancy) variable. iii. color = country: Colors the points based on the "country" variable. iv. size = pop: Sizes the points based on the "pop" (population) variable. 8. geom_point(alpha = 0.7): Adds a scatter plot layer with points, where alpha controls the transparency of the points. 9. geom_line(linewidth = 1): Adds a line plot layer connecting the points, with a specified line width of 1. The code creates a scatter plot and line plot visualizing life expectancy over time for countries in the Oceania continent. Each point represents a country, colored based on the country and sized based on its population. The transparency of the points is set to 0.7, and a line connects the points for each country. Example 2: Time series graph of medianLifeExp by year for the five continents This R code also uses the ggplot2, gapminder, and dplyr libraries to create a summarized dataset and display it. library(ggplot2) library(gapminder) library(dplyr) plotdata <- gapminder %>% group_by(year, continent) %>% summarize(medianLifeExp = median(lifeExp)) plotdata
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
1. library(ggplot2): Loads the ggplot2 library, a data visualization package in R. 2. library(gapminder): Loads the gapminder library, which contains the gapminder dataset. 3. library(dplyr): Loads the dplyr library, used for data manipulation and summarization. 4. plotdata <- gapminder %>% group_by(year, continent) %>% summarize(medianLifeExp = median(lifeExp)): Creates a new data frame called plotdata by performing the following operations: a. %>% is the pipe operator, used to chain operations together. b. group_by(year, continent): Groups the data by year and continent. c. summarize(medianLifeExp = median(lifeExp)): Calculates the median life expectancy (medianLifeExp) for each group (combination of year and continent). 5. plotdata: Displays the summarized data frame, showing the median life expectancy for each combination of year and continent. This code creates a new dataset (plotdata) that summarizes the median life expectancy for each combination of year and continent in the gapminder dataset. The resulting dataset is then displayed. Line graph without `facet_wrap()` This R code uses the ggplot2 library to create a line plot visualizing the median life expectancy over time for different continents. ggplot(data = plotdata, mapping = aes(x = year, y = medianLifeExp, color = continent)) + geom_line(linewidth = 1) 1. ggplot(data = plotdata, mapping = aes(...)): Initiates a ggplot object, specifying the data frame (plotdata) and mapping the aesthetics (variables) to the visual elements of the plot. a. x = year: Maps the x-axis to the "year" variable. b. y = medianLifeExp: Maps the y-axis to the "medianLifeExp" variable (median life expectancy). c. color = continent: Colors the lines based on the "continent" variable. 2. geom_line(linewidth = 1): Adds a line plot layer to the ggplot object, specifying a line width of 1. This creates a line plot connecting points for each combination of year and continent, where the y-coordinate represents the median life expectancy. The code creates a line plot showing how the median life expectancy changes over time for different continents. Each line corresponds to a continent, and the x-axis represents the years, while the y-axis represents the median life expectancy.
Line graph with `facet_wrap()` This R code extends the previous ggplot visualization by adding more features, such as different line and point aesthetics, and using facet_wrap to create separate plots for each continent. ggplot(data = plotdata, mapping = aes(x = year, y = medianLifeExp)) + geom_line(color = "green4") + geom_point(size = 3, color = "steelblue") + facet_wrap(~continent) 1. ggplot(data = plotdata, mapping = aes(...)): Initiates a ggplot object, specifying the data frame (plotdata) and mapping the aesthetics (variables) to the visual elements of the plot. a. x = year: Maps the x-axis to the "year" variable. b. y = medianLifeExp: Maps the y-axis to the "medianLifeExp" variable (median life expectancy). 2. geom_line(color = "green4"): Adds a line plot layer to the ggplot object, specifying a line color of "green4." This creates lines connecting points for each combination of year and continent. 3. geom_point(size = 3, color = "steelblue"): Adds a point plot layer to the ggplot object, specifying point size of 3 and color of "steelblue." This overlays points on the line plot for each combination of year and continent. 4. facet_wrap(~continent): Uses facet_wrap to create separate plots (facets) for each level of the "continent" variable. This means that the visualization will have subplots for each
continent, allowing you to compare the median life expectancy trends across continents more easily. This code creates a multi-faceted line and point plot, where each facet represents a continent. The lines show the trend of median life expectancy over time, and points provide additional data points on top of the lines. The lines are colored green, and the points are colored steel blue. Kabacoff (2023) Chapter 8 Figure 8.1: Simple time series graph The "economics" dataset comes with "ggplot2" package: This R code generates a simple time series graph using the "ggplot2" package with the "economics" dataset. The dataset comes with the "ggplot2" package and contains economic variables. The graph shows variables "psavert" on the y-axis and "date" on the x-axis. library(ggplot2) help(economics, package = "ggplot2") summary(economics) ggplot(data = economics, mapping = aes(x = date, y = psavert)) + geom_line() + labs(title = "Personal Savings Rate", x = "Date",
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
y = "Personal Savings Rate") 1. library(ggplot2): Loads the ggplot2 library. 2. help(economics, package = "ggplot2"): Displays help information for the "economics" dataset that comes with the "ggplot2" package. 3. summary(economics): Provides a summary of the "economics" dataset, showing descriptive statistics for the variables in the dataset. 4. ggplot(data = economics, mapping = aes(...)): Initiates a ggplot object, specifying the "economics" dataset and mapping the aesthetics (variables) to the visual elements of the plot. a. x = date: Maps the x-axis to the "date" variable. b. y = psavert: Maps the y-axis to the "psavert" variable (personal savings rate). 5. geom_line(): Adds a line plot layer to the ggplot object, creating a line graph that connects the data points for the personal savings rate over time. 6. labs(title = ..., x = ..., y = ...): Sets the title and axis labels for the plot. Figure 8.2: Simple time series graph with modified date This R code extends the previous time series graph by adding more features and customizations using the "ggplot2" and "scales" libraries. library(ggplot2) library(scales) ggplot(data = economics, mapping = aes(x = date, y = psavert)) + geom_line(color = "red",
size = 1) + geom_smooth() + scale_x_date(date_breaks = '5 years', labels = date_format("%b-%y")) + labs(title = "Personal Savings Rate", subtitle = "1967 to 2015", x = "", y = "Personal Savings Rate") + theme_minimal() # multivariate time series Graphs # one time install # install.packages("quantmod") # Figure 8.3: Multivariate time series # The figure is modified with dates from 2022-01-01 to 2022-10-01 library(quantmod) library(dplyr) # get Apple (AAPL) closing prices apple <- getSymbols("AAPL", return.class = "data.frame", from = "2022-01-01", to = "2022-10-01") View(AAPL) apple <- AAPL %>% mutate(Date = as.Date(row.names(.))) %>% select(Date, AAPL.Close) %>%
rename(Close = AAPL.Close) %>% mutate(Company = "Apple") # str(apple) # get amazon (AMZN) closing prices amazon <- getSymbols("AMZN", return.class = "data.frame", from = "2022-01-01", to = "2022-10-01") View(AMZN) amazon <- AMZN %>% mutate(Date = as.Date(row.names(.))) %>% select(Date, AMZN.Close) %>% rename(Close = AMZN.Close) %>% mutate(Company = "Amazon") # str(amazon) # combine data for both companies mseries <- rbind(apple, amazon) # head(mseries) # tail(mseries) # plot data library(scales) library(ggplot2) ggplot(data = mseries, mapping = aes(x = Date, y = Close, color = Company)) + geom_line(size = 1) + scale_x_date(date_breaks = '1 month',
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
labels = scales::date_format("%b")) + scale_y_continuous(limits = c(50, 210), breaks = seq(50, 210, 20), labels = scales::dollar) + labs(title = "NASDAQ Closing Prices", subtitle = "Jan - Oct 2022", caption = "source: Yahoo Finance", y = "Closing Price") + theme_minimal() + scale_color_brewer(palette = "Dark2") # Dumbbell Charts # Install the package "ggalt" by uncommenting and running the following line of code. # install.packages("ggalt") library(ggalt) library(tidyr) library(dplyr) library(ggplot2) # load data data(gapminder, package = "gapminder") # subset data plotdata_long <- filter(gapminder, continent == "Americas" & year %in% c(1952, 2007)) %>% select(country, year, lifeExp)
# convert data to wide format plotdata_wide <- spread(plotdata_long, year, lifeExp) View(plotdata_wide) names(plotdata_wide) <- c("country", "y1952", "y2007") # create dumbbell plot ggplot(data = plotdata_wide, mapping = aes(y = country, x = y1952, xend = y2007)) + geom_dumbbell() # The following code is used to sort by 1952 life expectancy, modify the line and point size, color the points, add titles and labels, and simplify the theme. # create dumbbell plot ggplot(data = plotdata_wide, mapping = aes(y = reorder(country, y1952), x = y1952, xend = y2007)) + geom_dumbbell(size = 1.2, size_x = 3, size_xend = 3, colour = "grey", colour_x = "blue", colour_xend = "red") + theme_minimal() + labs(title = "Change in Life Expectancy",
subtitle = "1952 to 2007", x = "Life Expectancy (years)", y = "") # Slope Graphs # To create a slope graph, we will use the `newggslopegraph()` function from the `CGPfunctions` package. # The `newggslopegraph()` function requires following parameters (in order): # data frame # time variable (which must be a factor) # numeric variable to be plotted, and # grouping variable (creating one line per group). library(CGPfunctions) library(dplyr) library(ggplot2) library(gapminder) # create a subset of gapminder dataset df <- gapminder %>% filter(year %in% c(1992, 1997, 2002, 2007) & country %in% c("Panama", "Costa Rica", "Nicaragua", "Honduras", "El Salvador", "Guatemala", "Belize")) %>% mutate(year = factor(year), lifeExp = round(lifeExp))
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
# create slope graph newggslopegraph(df, year, lifeExp, country) + labs(title = "Life Expectancy by Country", subtitle = "Central America", caption = "source: gapminder") # Area Charts # A simple area chart is basically a line graph, with a fill from the line to the x-axis. # basic area chart library(ggplot2) ggplot(data = economics, mapping = aes(x = date, y = psavert)) + geom_area(fill = "lightblue", color = "black") + labs(title = "Personal Savings Rate", x = "Date", y = "Personal Savings Rate") # Stacked area chart help(uspopage, package = "gcookbook") data(uspopage, package = "gcookbook")
ggplot(data = uspopage, mapping = aes(x = Year, y = Thousands, fill = AgeGroup)) + geom_area() + labs(title = "US Population by age", x = "Year", y = "Population in Thousands") # Stacked area chart with simpler scale # stacked area chart data(uspopage, package = "gcookbook") ggplot(data = uspopage, mapping = aes(x = Year, y = Thousands/1000, fill = forcats::fct_rev(AgeGroup))) + geom_area(color = "black") + labs(title = "US Population by age", subtitle = "1900 to 2002", caption = "source: U.S. Census Bureau, 2003, HS-3", x = "Year", y = "Population in Millions", fill = "Age Group") + scale_fill_brewer(palette = "Set2") + theme_minimal() # Another Example for Area Chart data(gapminder, package = "gapminder")
plotdata <- gapminder %>% group_by(year, continent) %>% summarize(total_pop = sum(pop)/1e+09) plotdata ggplot(data = plotdata, mapping = aes(x = year, y = total_pop, fill = forcats::fct_rev(continent))) + geom_area(color = "black") + labs(title = "World Population by Continent", subtitle = "1957 to 2007", caption = "gapminder data", x = "Year", y = "Population in Billions", fill = "Continent") + scale_fill_brewer(palette = "Set2") + theme_minimal() # Kabacoff (2023) Chapter 9 # 9.1. Correlation Plots # We use "SaratogaHouses" dataset that comes with "mosaicData" package help(SaratogaHouses, package = "mosaicData") data(SaratogaHouses, package = "mosaicData") library(dplyr) df <- select_if(SaratogaHouses, is.numeric) head(df)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
# calculate the correlations r <- cor(df, use = "complete.obs") round(r,2) # We use the ggcorrplot() function of the "ggcorrplot" package for visualizing these correlations. library(ggplot2) library(ggcorrplot) ggcorrplot(r) # The ggcorrplot function has a number of options for customizing the output. For example: # hc.order = TRUE reorders the variables, placing variables with similar correlation patterns together. # type = "lower" plots the lower portion of the correlation matrix. # lab = TRUE overlays the correlation coefficients (as text) on the plot. ggcorrplot(r, hc.order = TRUE, type = "lower", lab = TRUE) # 9.2. Linear Regression data(SaratogaHouses, package="mosaicData") houses_lm <- lm(price ~ lotSize + age + landValue + livingArea + bedrooms + bathrooms + waterfront, data = SaratogaHouses) summary(houses_lm) # The visreg package provides tools for visualizing these conditional relationships.
# The visreg function takes (a) the model and (b) the variable of interest and plots the conditional relationship, controlling for the other variables. The option gg = TRUE is used to produce a ggplot2 graph. # conditional plot of price vs. living area library(ggplot2) library(visreg) visreg(houses_lm, "livingArea", gg = TRUE) # Continuing the example, the price difference between waterfront and non-waterfront homes is plotted, controlling for the other seven variables. Since a ggplot2 graph is produced, other ggplot2 functions can be added to customize the graph. # conditional plot of price vs. waterfront location visreg(houses_lm, "waterfront", gg = TRUE) + scale_y_continuous(label = scales::dollar) + labs(title = "Relationship between price and location", subtitle = "controlling for lot size, age, land value, bedrooms and bathrooms", caption = "source: Saratoga Housing Data (2006)", y = "Home Price", x = "Waterfront") # 9.3. Logistic Regression # Logistic regression can be used to explore the relationship between a binary response variable (e.g., yes/no, lived/died, pass/fail, malignant/benign) and an explanatory variable while other variables are held constant. # Example: A logistic regression model for predicting marital status (married/single) data(CPS85, package = "mosaicData") cps85_glm <- glm(married ~ sex + age + race + sector, family = "binomial",
data = CPS85) summary(cps85_glm) # Let's use visreg() function to visualize the relationship between age and the probability of being married, holding all other explanatory variables constant. library(ggplot2) library(visreg) visreg(cps85_glm, "age", gg = TRUE, scale="response") + labs(y = "Prob(Married)", x = "Age", title = "Relationship of age and marital status", subtitle = "controlling for sex, race, and job sector", caption = "source: Current Population Survey 1985") # We can create multiple conditional plots by adding a by option. For example, the following code will plot the probability of being married by age, seperately for men and women, controlling for race and job sector. library(ggplot2) library(visreg) visreg(cps85_glm, "age", by = "sex", gg = TRUE, scale = "response") + labs(y = "Prob(Married)", x = "Age", title = "Relationship of age and marital status", subtitle = "controlling for race and job sector",
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
caption = "source: Current Population Survey 1985") # 9.4. # Survival Plots # In healthcare research, we are interested in the response variable "time to an event." Some examples are: time to recovery, time to death, or time to relapse. # For an observation, if the event has not occurred (either because the study ended or the patient dropped out), then such observation is said to be censored. # The NCCTG Lung Cancer dataset in the survival package provides data on the survival times of patients with advanced lung cancer following treatment. The study followed patients for up to 34 months. # The outcome for each patient is measured by two variables # time (survival time in days) # status (1 = censored, 2 = dead) # Thus a patient with time = 305 & status = 2 lived 305 days following treatment. Another patient with time = 400 & status = 1, lived at least 400 days but was then lost to the study (censored). A patient with time = 1022 & status = 1, survived to the end of the study (34 months). # A survival plot (also called a Kaplan-Meier Curve) can be used to illustrate the probability that an individual survives up to and including time t. # plot survival curve library(survival) library(survminer) help(lung, package = "survival")
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
data(cancer, package = "survival") sfit <- survfit(Surv(time, status) ~ 1, data = cancer) sfit ggsurvplot(sfit, title = "Kaplan-Meier curve for lung cancer survival") # About 50% of patients are still alive 300 days after the treatment. # It is frequently of great interest whether groups of patients have the same survival probabilities. In the next graph, the survival curve for men and women are compared. # plot survival curve for men and women sfit <- survfit(Surv(time, status) ~ sex, data = cancer) ggsurvplot(sfit, conf.int = TRUE, pval = TRUE, legend.labs = c("Male", "Female"), legend.title = "Sex", palette = c("cornflowerblue", "indianred3"), title = "Kaplan-Meier Curve for lung cancer survival", xlab = "Time (days)") # 9.5. Mosaic plots # Mosaic plots can display relationships between categorical variables using rectangles whose areas represent the proportion of cases for any given combination of levels. # The color of the titles can also indicate the degree of relationship among the variables. # Although mosaic plots can be created with `ggplot2` using `ggmosaic` package, we use `vcd` package. # Example 1: Mosaic plot of titanic dataset
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
# Import `titanic.csv` dataset library(readr) # Download "titanic.csv" data from Readings section of Week 5 and save a copy in your working directory titanic <- read_csv("titanic.csv") # Create a table tbl <- xtabs(~Survived + Class + Sex, titanic) ftable(tbl) # Create a mosaic plot from the table library(vcd) mosaic(tbl, main = "Titanic data") # The size of the tile is proportional to the percentage of cases in that combination of levels. # Clearly more passengers perished, than survived. Those that perished were primarily 3rd class male passengers and male crew (the largest group). # If we assume that these three variables are independent, we can examine the residuals from the model and shade the tiles to match. # In the graph below, dark blue represents more cases than expected given independence. Dark red represents less cases than expected if independence holds. mosaic(tbl, shade = TRUE, legend = TRUE, labeling_args = list(set_varnames = c(Sex = "Gender", Survived = "Survived", Class = "Passenger Class")), set_labels = list(Survived = c("No", "Yes"), Class = c("1st", "2nd", "3rd", "Crew"), Sex = c("F", "M")),
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
main = "Titanic data") # From the graph, we can see that if class, gender, and survival are independent, we are seeing many more male crew perishing, and 1st, 2nd and 3rd class females surviving than would be expected. Conversely, far fewer 1st class passengers (both male and female) died than would be expected by chance. Thus the assumption of independence is rejected.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help