Data description and PCA
pdf
keyboard_arrow_up
School
University of Phoenix *
*We aren’t endorsed by this school
Course
552
Subject
Statistics
Date
Feb 20, 2024
Type
Pages
9
Uploaded by AmbassadorArtPorpoise41
#Exploratory analysis of dataset:
# important library for the project.
#these are the imported libraries used to run the dataset
library(corrplot)
library(ggplot2)
library(gridExtra)
library(dplyr)
library(caret)
library(e1071)
library(class)
# Loading the dataset in R
df = read.csv("~/Fall 2021/MATH 4339 - Multivariate Statistics/Group
Project/heart_failure_clinical_records_dataset.csv")
# Exploratory Data Analysis
dim(df)
[1] 299 13
In our dataset, we have 299 observations with 13 features including the label variable i.e.
DEATH_EVENT.
str(df)
This gives us the brief outlook on each variables
summary(df)
The above summary gives us necessary information on each of the features regarding minimum
and maximum values, mean and median and 1st and 3rd quartile values of each feature.
# Checking the missing values
colSums(is.na(df))
From above we can conclude that our dataset does not have any missing values.
# Correlation matrix for the dataset
> df.cor <- cor(df[sapply(df,is.numeric)])
> corrplot(df.cor, type = "upper", tl.srt = 45)
The above correlation plot shows us how each of our variables are correlated with each other.
# Comparing categorical variables with response variable(Death_Event)
plt1 <- ggplot(df, aes(anaemia)) + geom_bar(aes(fill = DEATH_EVENT))
plt2 <- ggplot(df, aes(diabetes)) + geom_bar(aes(fill = DEATH_EVENT))
plt3 <- ggplot(df, aes(high_blood_pressure)) + geom_bar(aes(fill = DEATH_EVENT))
plt4 <- ggplot(df, aes(sex)) + geom_bar(aes(fill = DEATH_EVENT))
plt5 <- ggplot(df, aes(smoking)) + geom_bar(aes(fill = DEATH_EVENT))
grid.arrange(plt1, plt2, plt3, plt4, plt5, nrow = 3)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
# Comparing continuous variables with response variable(Death_Event)
plt6 <- ggplot(df, aes(x=age, fill=DEATH_EVENT)) + geom_histogram()
plt7 <- ggplot(df, aes(x=creatinine_phosphokinase, fill=DEATH_EVENT)) +
geom_histogram()
plt8 <- ggplot(df, aes(x=ejection_fraction, fill=DEATH_EVENT)) + geom_histogram()
plt9 <- ggplot(df, aes(x=platelets, fill=DEATH_EVENT)) + geom_histogram()
plt10 <- ggplot(df, aes(x=serum_creatinine, fill=DEATH_EVENT)) + geom_histogram()
plt11 <- ggplot(df, aes(x=serum_sodium, fill=DEATH_EVENT)) + geom_histogram()
plt12 <- ggplot(df, aes(x=time, fill=DEATH_EVENT)) + geom_histogram()
grid.arrange(plt6, plt7, plt8, plt9, plt10, plt11, plt12, nrow = 4)
# Now we do the analysis of the variables with high correlation to DEATH_EVENT
# We examine the variable age since it is correlated with the response variable
plt13 <- ggplot(df, aes(x=age, y=creatinine_phosphokinase, color=DEATH_EVENT)) +
geom_point()
plt14 <- ggplot(df, aes(x=age, y=ejection_fraction, color=DEATH_EVENT)) + geom_point()
plt15 <- ggplot(df, aes(x=age, y=platelets, color=DEATH_EVENT)) + geom_point()
plt16 <- ggplot(df, aes(x=age, y=serum_creatinine, color=DEATH_EVENT)) + geom_point()
plt17 <- ggplot(df, aes(x=age, y=serum_sodium, color=DEATH_EVENT)) + geom_point()
plt18 <- ggplot(df, aes(x=age, y=time, color=DEATH_EVENT)) + geom_point()
grid.arrange(plt13, plt14, plt15, plt16, plt17, plt18, nrow = 3)
# Separating the continuous variables and the label variable from our original dataframe
continuous=c("age" , "creatinine_phosphokinase", "ejection_fraction", "platelets",
"serum_creatinine", "serum_sodium", "time")
continuous_df=as.data.frame(df[continuous])
sum(diag(cov(continuous_df)))
pr.out = prcomp(continuous_df, scale. = TRUE)
pr.out$sdev^2
summary(pr.out)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
From the above summary, we can see that PC1 explains 21% of the total variance in our
dataset, PC2 explains 17%, PC3 explains 15%, PC4 explains 14%, PC5 explains 13%, and
PC6 explains 11% of the total variance of our dataset.
screeplot(pr.out,type = "l")
From the above Scree plot, we can see an elbow formation at PC6. The above plot and the
above summary explains us that the six principle components explains almost 90% of the total
variance in our dataset. So, six principle component should be used instead of whole data
matrix.
library(FactoMineR)
pr.out2 = PCA(continuous_df)
pr.out2$var$cor
From the above biplot of the two major principle components, we can see how the variables are
related with our principal component. We can see, PC1 is able to describe serum creatinine,
creatinine_ phosphokinase and time whereas PC2 is able to describe age, ejection_fraction,
platelets and serum_soduim. To be more specific, we can see above table.
pr.out$rotation = -pr.out$rotation
pr.out$x = -pr.out$x
biplot(pr.out,scale =0)
The scale 0 we added to the biplot() ensures that the arrows are scaled to the biplot to
represent the loadings. This graph shows the samples and variables of the data.
In our dataset we have 5 categorical binary converted variables(“anaemia", "diabetes",
"high_blood_pressure", "sex", "smoking") which we are excluding before performing PCA. This
is because PCA is more efficient for the variables with higher variation but our categorical
features only have two binary values “0” and “1”. So, performing PCA will not give better results.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Questions
A data set contains the observations 8,5,4,6,9. find ( ∑x )^2
arrow_forward
Please share an excel screen on how to input the data for #2 only.
Thank you
arrow_forward
A dolphin is swimming 18 feet below the surface of the ocean There is a coast guard helicopter 75.5 above the surface of the water that is directly above the dolphin what is the distance between the dolphin and the helicopter
arrow_forward
How Panel Data is useful to control some types of omitted variables without actually oberving them?
arrow_forward
Describe about the three positive relationships of Scatterplots?
arrow_forward
Explain when can we use data grouping?
arrow_forward
Dr. V. noticed that the more frequently a student is late or absent from class the worse he or she performs on the final exam. He decided to investigate. Dr. V. collected a random sample of 22 students. This sample includes the number of times a student is absent and their grades on the final exam. The data can be found in the Excel file Assignment10_R1.xlsx on the Exercise1_Excel_Built-in f. worksheet. Do not use any software that I did not assign.
Using Excel, conduct a NHST to determine whether there is a correlation in the population. Use a 0.05 significance level. Use the Excel file Assignment10_R1.xlsx, Exercise1_Correlation NHST worksheet)
Test Set-Up: What type of Null Hypothesis Test should you use? It this a left-tailed, two-tailed, or right-tailed test.
Significance Level: Use a 5% significance level. What is/are the Critical Value(s)?
Write the Null & Alternate Hypotheses (Follow the examples shown in Clear-Sighted Statistics. Use the appropriate…
arrow_forward
Dr. V. noticed that the more frequently a student is late or absent from class the worse he or she performs on the final exam. He decided to investigate. Dr. V. collected a random sample of 22 students. This sample includes the number of times a student is absent and their grades on the final exam. The data can be found in the Excel file Assignment10_R1.xlsx on the Exercise1_Excel_Built-in f. worksheet. Do not use any software that I did not assign.
Using Excel, conduct a NHST to determine whether there is a correlation in the population. Use a 0.05 significance level. Use the Excel file Assignment10_R1.xlsx, Exercise1_Correlation NHST worksheet)
Write the Null & Alternate Hypotheses (Follow the examples shown in Clear-Sighted Statistics. Use the appropriate Greek letters and mathematical symbols).
H0:
H1:
Write the Decision Rule using the Critical Value(s), not p-values.
Calculate the value of the test statistic and the p-value.
arrow_forward
Researchers want to know which food chain makes people the happiest. Researchers randomly
assign participants to eat at either Chipotle, Qdoba, or California Tortilla for lunch every day for one
week. At the end of the week, researchers measured the participants' happiness levels.
Chipotle
Qdoba
California Tortilla
93
78
82
82
77
88
82
60
80
82
60
65
89
83
74
What is the mean squares within (s^2 w)? Please compute to 4 decimal places.
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you

Holt Mcdougal Larson Pre-algebra: Student Edition...
Algebra
ISBN:9780547587776
Author:HOLT MCDOUGAL
Publisher:HOLT MCDOUGAL
Related Questions
- A data set contains the observations 8,5,4,6,9. find ( ∑x )^2arrow_forwardPlease share an excel screen on how to input the data for #2 only. Thank youarrow_forwardA dolphin is swimming 18 feet below the surface of the ocean There is a coast guard helicopter 75.5 above the surface of the water that is directly above the dolphin what is the distance between the dolphin and the helicopterarrow_forward
- Dr. V. noticed that the more frequently a student is late or absent from class the worse he or she performs on the final exam. He decided to investigate. Dr. V. collected a random sample of 22 students. This sample includes the number of times a student is absent and their grades on the final exam. The data can be found in the Excel file Assignment10_R1.xlsx on the Exercise1_Excel_Built-in f. worksheet. Do not use any software that I did not assign. Using Excel, conduct a NHST to determine whether there is a correlation in the population. Use a 0.05 significance level. Use the Excel file Assignment10_R1.xlsx, Exercise1_Correlation NHST worksheet) Test Set-Up: What type of Null Hypothesis Test should you use? It this a left-tailed, two-tailed, or right-tailed test. Significance Level: Use a 5% significance level. What is/are the Critical Value(s)? Write the Null & Alternate Hypotheses (Follow the examples shown in Clear-Sighted Statistics. Use the appropriate…arrow_forwardDr. V. noticed that the more frequently a student is late or absent from class the worse he or she performs on the final exam. He decided to investigate. Dr. V. collected a random sample of 22 students. This sample includes the number of times a student is absent and their grades on the final exam. The data can be found in the Excel file Assignment10_R1.xlsx on the Exercise1_Excel_Built-in f. worksheet. Do not use any software that I did not assign. Using Excel, conduct a NHST to determine whether there is a correlation in the population. Use a 0.05 significance level. Use the Excel file Assignment10_R1.xlsx, Exercise1_Correlation NHST worksheet) Write the Null & Alternate Hypotheses (Follow the examples shown in Clear-Sighted Statistics. Use the appropriate Greek letters and mathematical symbols). H0: H1: Write the Decision Rule using the Critical Value(s), not p-values. Calculate the value of the test statistic and the p-value.arrow_forwardResearchers want to know which food chain makes people the happiest. Researchers randomly assign participants to eat at either Chipotle, Qdoba, or California Tortilla for lunch every day for one week. At the end of the week, researchers measured the participants' happiness levels. Chipotle Qdoba California Tortilla 93 78 82 82 77 88 82 60 80 82 60 65 89 83 74 What is the mean squares within (s^2 w)? Please compute to 4 decimal places.arrow_forward
arrow_back_ios
arrow_forward_ios
Recommended textbooks for you
- Holt Mcdougal Larson Pre-algebra: Student Edition...AlgebraISBN:9780547587776Author:HOLT MCDOUGALPublisher:HOLT MCDOUGAL

Holt Mcdougal Larson Pre-algebra: Student Edition...
Algebra
ISBN:9780547587776
Author:HOLT MCDOUGAL
Publisher:HOLT MCDOUGAL