Data description and PCA

pdf

School

University of Phoenix *

*We aren’t endorsed by this school

Course

552

Subject

Statistics

Date

Feb 20, 2024

Type

pdf

Pages

Uploaded by AmbassadorArtPorpoise41

#Exploratory analysis of dataset: # important library for the project. #these are the imported libraries used to run the dataset library(corrplot) library(ggplot2) library(gridExtra) library(dplyr) library(caret) library(e1071) library(class) # Loading the dataset in R df = read.csv("~/Fall 2021/MATH 4339 - Multivariate Statistics/Group Project/heart_failure_clinical_records_dataset.csv") # Exploratory Data Analysis dim(df) [1] 299 13 In our dataset, we have 299 observations with 13 features including the label variable i.e. DEATH_EVENT. str(df) This gives us the brief outlook on each variables summary(df)

The above summary gives us necessary information on each of the features regarding minimum and maximum values, mean and median and 1st and 3rd quartile values of each feature. # Checking the missing values colSums(is.na(df)) From above we can conclude that our dataset does not have any missing values. # Correlation matrix for the dataset > df.cor <- cor(df[sapply(df,is.numeric)]) > corrplot(df.cor, type = "upper", tl.srt = 45)

The above correlation plot shows us how each of our variables are correlated with each other. # Comparing categorical variables with response variable(Death_Event) plt1 <- ggplot(df, aes(anaemia)) + geom_bar(aes(fill = DEATH_EVENT)) plt2 <- ggplot(df, aes(diabetes)) + geom_bar(aes(fill = DEATH_EVENT)) plt3 <- ggplot(df, aes(high_blood_pressure)) + geom_bar(aes(fill = DEATH_EVENT)) plt4 <- ggplot(df, aes(sex)) + geom_bar(aes(fill = DEATH_EVENT)) plt5 <- ggplot(df, aes(smoking)) + geom_bar(aes(fill = DEATH_EVENT)) grid.arrange(plt1, plt2, plt3, plt4, plt5, nrow = 3)

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

# Comparing continuous variables with response variable(Death_Event) plt6 <- ggplot(df, aes(x=age, fill=DEATH_EVENT)) + geom_histogram() plt7 <- ggplot(df, aes(x=creatinine_phosphokinase, fill=DEATH_EVENT)) + geom_histogram() plt8 <- ggplot(df, aes(x=ejection_fraction, fill=DEATH_EVENT)) + geom_histogram() plt9 <- ggplot(df, aes(x=platelets, fill=DEATH_EVENT)) + geom_histogram() plt10 <- ggplot(df, aes(x=serum_creatinine, fill=DEATH_EVENT)) + geom_histogram() plt11 <- ggplot(df, aes(x=serum_sodium, fill=DEATH_EVENT)) + geom_histogram() plt12 <- ggplot(df, aes(x=time, fill=DEATH_EVENT)) + geom_histogram() grid.arrange(plt6, plt7, plt8, plt9, plt10, plt11, plt12, nrow = 4)

# Now we do the analysis of the variables with high correlation to DEATH_EVENT # We examine the variable age since it is correlated with the response variable plt13 <- ggplot(df, aes(x=age, y=creatinine_phosphokinase, color=DEATH_EVENT)) + geom_point() plt14 <- ggplot(df, aes(x=age, y=ejection_fraction, color=DEATH_EVENT)) + geom_point() plt15 <- ggplot(df, aes(x=age, y=platelets, color=DEATH_EVENT)) + geom_point() plt16 <- ggplot(df, aes(x=age, y=serum_creatinine, color=DEATH_EVENT)) + geom_point() plt17 <- ggplot(df, aes(x=age, y=serum_sodium, color=DEATH_EVENT)) + geom_point() plt18 <- ggplot(df, aes(x=age, y=time, color=DEATH_EVENT)) + geom_point() grid.arrange(plt13, plt14, plt15, plt16, plt17, plt18, nrow = 3)

# Separating the continuous variables and the label variable from our original dataframe continuous=c("age" , "creatinine_phosphokinase", "ejection_fraction", "platelets", "serum_creatinine", "serum_sodium", "time") continuous_df=as.data.frame(df[continuous]) sum(diag(cov(continuous_df))) pr.out = prcomp(continuous_df, scale. = TRUE) pr.out$sdev^2 summary(pr.out)

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

From the above summary, we can see that PC1 explains 21% of the total variance in our dataset, PC2 explains 17%, PC3 explains 15%, PC4 explains 14%, PC5 explains 13%, and PC6 explains 11% of the total variance of our dataset. screeplot(pr.out,type = "l") From the above Scree plot, we can see an elbow formation at PC6. The above plot and the above summary explains us that the six principle components explains almost 90% of the total variance in our dataset. So, six principle component should be used instead of whole data matrix. library(FactoMineR) pr.out2 = PCA(continuous_df) pr.out2$var$cor

From the above biplot of the two major principle components, we can see how the variables are related with our principal component. We can see, PC1 is able to describe serum creatinine, creatinine_ phosphokinase and time whereas PC2 is able to describe age, ejection_fraction, platelets and serum_soduim. To be more specific, we can see above table.

pr.out$rotation = -pr.out$rotation pr.out$x = -pr.out$x biplot(pr.out,scale =0) The scale 0 we added to the biplot() ensures that the arrows are scaled to the biplot to represent the loadings. This graph shows the samples and variables of the data. In our dataset we have 5 categorical binary converted variables(“anaemia", "diabetes", "high_blood_pressure", "sex", "smoking") which we are excluding before performing PCA. This is because PCA is more efficient for the variables with higher variation but our categorical features only have two binary values “0” and “1”. So, performing PCA will not give better results.

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version