Lab-5B---Train-and-Test-Dataset

docx

School

Conestoga College *

*We aren’t endorsed by this school

Course

CONS 1447

Subject

Statistics

Date

Apr 3, 2024

Type

docx

Pages

15

Uploaded by ColonelJaguar3196

Report
Laboratory 5B - X Y relation dataset Shijitha Sandeep 2023-03-23 Introduction In lab 5B, we are using 2 data sets Training and Testing to find the X and Y relationship and to learn the process of cleaning the data and removing the outliers. Also to understand about how to convert the raw data from original source to the format ready for analysis, to learn how to apply linear regression technique to the given dataset and to understand the basic fundamentals of data analysis using R. With reference to these data set,we will explore, clean and provide the required analysis of the data for the given requirements… Load the packages relevent for the Lab excersise library (ggplot2) #to design GG plots library (ISLR) #to do statistical analysis of data library (tidyverse) #data modelling, visualization of data ## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ── ## ✔ tibble 3.1.8 ✔ dplyr 1.0.10 ## ✔ tidyr 1.2.1 ✔ stringr 1.5.0 ## ✔ readr 2.1.3 ✔ forcats 0.5.2 ## ✔ purrr 1.0.1 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() library (dplyr) #For easier data manipulation library (rmarkdown) #get more enhancements to R markdown Import both train and test data into R studio using the below code # Read or import train data train <- read.csv ( "train.csv" ) # Import test data using read.csv function test <- read.csv ( "test.csv" ) # We can view the structure of the data with the help of below code view (train) view (test)
After importing the data, we can proceed with the instructions given to analyze the train and test data sets. Question 1: Explore the dataset (you can provide 2 plots for exploring the data) # We can explore the data sets by first checking the first and last 5 records of the data using head() and tail() function. Also fix() will allow us to make changes on the fly # head() to fetch first 5 records of both train and test data head (train) ## x1 y1 ## 1 24 21.54945 ## 2 50 47.46446 ## 3 15 17.21866 ## 4 38 36.58640 ## 5 87 87.28898 ## 6 36 32.46387 head (test) ## x1 y1 ## 1 77 79.775152 ## 2 21 23.177279 ## 3 22 25.609262 ## 4 20 17.857388 ## 5 36 41.849864 ## 6 15 9.805235 # tail() to fetch last 5 records of both train and test data tail (train) ## x1 y1 ## 695 81 81.45545 ## 696 58 58.59501 ## 697 93 94.62509 ## 698 82 88.60377 ## 699 66 63.64869 ## 700 97 94.97527 tail (test) ## x1 y1 ## 295 8 5.405221 ## 296 71 68.545888 ## 297 46 47.334876 ## 298 55 54.090637
## 299 62 63.297171 ## 300 47 52.459467 # fix() allows us to look and modify the data in a fly fix (train) fix (test) 1.2 We can better explore the data using plots and graphs. The best methods or plots to interpret and identify the correlation between varaiables are by using scatter plot or boxplot. First we can find the relation between the X and Y variables using Boxplot # Identify the relation between X1 and Y1 variables of both train and test data set using Boxplot boxplot (train $ y1,train $ x1, main = "y1 and xq relationship" , xlab= "y1" , ylab = "x1" , col = c ( "#3F7DC1" , "#F60AB8" )) boxplot (test $ y1,test $ x1, main = "y1 and xq relationship" , xlab= "y1" , ylab = "x1" , col = c ( "#C63A41" , "#C2A63E" ))
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Interpretation: From the Box plot of both train and test data, we can interpret as follows 1. The box plot of train data set shows one outlier with very high value when compare to all the other records of the data and this outlier will affect the mean, median of the data and thus in turn affect the performance of the entire data. 2. The boxplot of test data set shows normal distribution between y1 and x1 variables. Hence we can interpret that the test data has a linear relationship between y1 and x1. 1.3 We can also explore the data using scatterplot as shown below to better identify the correlation. # The scatter plot for train data qqplot (train $ y1,train $ x1, main = "Relation between y1 and x1" , xlab = "y1" , ylab = "x1" , col = "orange" )
# The scatter plot for test data qqplot (test $ y1,test $ x1, main = "Relation between y1 and x1" , xlab = "y1" , ylab = "x1" , col = "blue" )
Interpretation: From the scatter plot of both train and test data, we can interpret as follows 1. The scatter plot of train data set shows one outlier with very high value and the outlier is affecting the linear relationship between y1 and x1 variables of the entire data. 2. As per the plot shown, the qqplot of test data is scattered in a way to form a straight line, hence we can conclude that there is a linear relationship between both y1 and x1. Question 2: Check for any missing values or outliers 2.1 We can use some basic analysis and summary strategic measurement to check the summary and structure of the data sets in order to identify missing values # First we can find the summary of the data sets using summary() function to get to know about mean, median and other strategic measurements of each variables summary (train) ## x1 y1 ## Min. : 0.00 Min. : -3.84 ## 1st Qu.: 25.00 1st Qu.: 24.93 ## Median : 49.00 Median : 48.97 ## Mean : 54.99 Mean : 49.94 ## 3rd Qu.: 75.00 3rd Qu.: 74.93 ## Max. :3530.16 Max. :108.87 ## NA's :1 summary (test) ## x1 y1 ## Min. : 0.00 Min. : -3.468 ## 1st Qu.: 27.00 1st Qu.: 25.677 ## Median : 53.00 Median : 52.171 ## Mean : 50.94 Mean : 51.205 ## 3rd Qu.: 73.00 3rd Qu.: 74.303 ## Max. :100.00 Max. :105.592 # We can also check the variables, their data type and the total records present in both the datasets, this can be achieved by using the function str() str (train)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
## 'data.frame': 700 obs. of 2 variables: ## $ x1: num 24 50 15 38 87 36 12 81 25 5 ... ## $ y1: num 21.5 47.5 17.2 36.6 87.3 ... str (test) ## 'data.frame': 300 obs. of 2 variables: ## $ x1: num 77 21 22 20 36 15 62 95 20 5 ... ## $ y1: num 79.8 23.2 25.6 17.9 41.8 ... Interpretation: From the summary and structure of each data set, we could make out that 1. The train data set has 700 records with 2 attributes or variables and contain 1 missing value in y1. Also from the Summary, we can observe that the Max value is very high (3530) when compare to the Mean and median which is of range 50. Hence we can conclude that the Train data set has OUTLIERS. 2. The test data set consists of 300 records and 2 variables. From the summary(), we can conclude that the data does not have any missing values and also might not contain any outliers as the 5 summary parameters have same range of values. 2.2 Now we can cross verify if our assumption of missing value or NAs in both data sets are correct by writing the below code. If there are any missing values, we need to take neccessary action or measures to handle the same. # check if NA is present in any columns of both train and test data colSums ( is.na (train)) ## x1 y1 ## 0 1 colSums ( is.na (test)) ## x1 y1 ## 0 0 # We could see 1 NA in y1 column of train. Omit this row from the data train <- na.omit (train, cols = "y1" ) # We can now verify if the NAs has been removed from y1 variable of train data using sum() function sum ( is.na (train $ y1)) ## [1] 0
Interpret: From the above code, we can observe that the y1 variable of train dataset contains 1 NA record. Since there is only 1 missing value, we can directly remove the record from the data set. So after removing the missing data, the train data have 699 records. Where as the test data set does not have any NAs or missing values. Question 3: In case you have an outlier then just subset the range of the dataset We have identified outlier in train data set while checking the summary of the datasets and this outlier can be handled by just subsetting the range of the train dataset using the below code # To remove the outlier, first we need to identify the quantile and IQR of the variable which has the outlier. This can be handled by using below code # Find the quantile of x1 of train dataset q <- quantile (train $ x1, probs = c (. 25 ,. 75 ), na.rm = FALSE ) # Find the IQR iqr <- IQR (train $ x1) # Now we have calculated quantile and IQR of x1 variable, we can eliminate the outlier present in the dataset using subset() function train <- subset (train, train $ x1 > (q[ 1 ] - ( 1.5 * iqr)) & train $ x1 < (q[ 2 ] + ( 1.5 * iqr))) # Post removing the outlier, we can cross verify the dataset for any outliers by plotting the boxplot() boxplot (train $ y1,train $ x1, main = "y1 and xq relationship" , xlab= "y1" , ylab = "x1" )
# we can croiss verify the data usinf summary() summary (train) ## x1 y1 ## Min. : 0.00 Min. : -3.84 ## 1st Qu.: 25.00 1st Qu.: 24.93 ## Median : 49.00 Median : 48.97 ## Mean : 50.01 Mean : 49.94 ## 3rd Qu.: 75.00 3rd Qu.: 74.93 ## Max. :100.00 Max. :108.87 Observation: The Quantile function provides the 1st (25%) and 3rd (75%) quntiles of the x1 attribute of train data. The IQR function provides the difference between the 75th and 25th percentiles The Subset() function eliminates the values with IQR 1.5 times lesser than 25% and IQR 1.5 times higher than the 75%. Thus we conclude that the values which are lesser than 1.5 times of 25% and 1.5 higher than 75% are outlier in dataset. From the boxplot(), we can observe that after the removal of outliers, the values are normally distributed and there is a correlation between both the variables of train data set. Similarly the summary does not contain any values extremely larger than the mean and median. Thus we can conclude that the outliers have been handled by eliminating it.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Question 4: Building a linear regression model on the given training data (where dependent variable is y1 and independent variable is x1) # The linear regression can be identified by using the below function train_lm <- lm (x1 ~ y1, data = train) Observation: The above lm() biulds the linear regression model between the dependent variable y1 and independent variable x1 of train data set. Question 5: To interpret the model results and the relationship between the independent and dependent variables, you need to explain the p-value and R-square measures. These values work as an indicator of how strong/well performing is the model you just built. The interpretation of the model result can be obtained by getting the summary of the linear regression model. This can be done by using the below functions # Provide the summary of the linear regression model summary (train_lm) ## ## Call: ## lm(formula = x1 ~ y1, data = train) ## ## Residuals: ## Min 1Q Median 3Q Max ## -8.3598 -1.8873 0.0081 2.0166 9.5167 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.571255 0.209970 2.721 0.00668 ** ## y1 0.990052 0.003633 272.510 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.794 on 697 degrees of freedom ## Multiple R-squared: 0.9907, Adjusted R-squared: 0.9907 ## F-statistic: 7.426e+04 on 1 and 697 DF, p-value: < 2.2e-16 # We can also check how well the liner regression model fits the data AIC (train_lm) ## [1] 3424.107
Interpretation: From the summary result of linear regression model, we can observe that linear regression model’s p - value and the independent variable x1 p-values are lesser than the significant level mentioned with three * (asterisk) level. Hence we can conlude that the model is significant or well performed and also, there is a relationship between the dependent variable y1 and independent variable x1. The intercept of this model indicates that it is a significant value to predict the good relationship between y1 and x1. The adjusted R - squared value of the model is obtained as 0.9907, which indicates that almost 99.07% of variation in the train data set can be explained by the linear regression model built. The AIC value is 3424.107, which indicates that how easily we can test the performance of the model. Question 6: You need to calculate the prediction accuracy, (you could do some research/study about this accuracy metric and include your calculations in R markdown file). 6.1 calaculating predictive model # Prediction accuracy is nothing but testing, how the model built for train data set will perform on test data set. One of the main reason for doing prediction accuracy is to check if the predictive model built will predict same set of variable for some other data set like test data set and if so how accurate the result would be? # The prediction accuracy can be better interpreted by using the visualization graphs or plots. Here we can use ggplot to get the prediction accuracy # ggplpot for train data ggplot () + geom_point ( aes ( x= train $ x1, y= train $ y1), colour = "yellow" ) + geom_line ( aes ( x= train $ x1, y= predict (train_lm, newdata = train)), colour = "green" ) + ggtitle ( "x1 and y1 relation of train set" ) + xlab ( "x1" ) + ylab ( "y1" )
# ggplot for test data ggplot () + geom_point ( aes ( x= test $ x1, y= test $ y1), colour = "yellow" ) + geom_line ( aes ( x= train $ x1, y= predict (train_lm, newdata = train)), colour = "green" ) + ggtitle ( "x1 and y1 relation of test set" ) + xlab ( "x1" ) + ylab ( "y1" )
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Observation: In both ggplot visualization, the green line represents the linear regression line of the model built for train data set. From the ggplot of both train and test data set, we can conclude that the predicted model built for training data can perform really well for test data, as values of both the data set are almost inline with the linear regression line. Thus we can argue that the predictive model of training data is very well fits the test data. 6.2 Accuracy check #Eventhough the predictive model fits well for test data, we can perform the accuracy check for the model to cross verify how accurate the predictive model for test data. # Predicting y values of test data using the train data model test_predict <- predict (train_lm, test) # Actual prediction of test data test_pred_act <- data.frame ( cbind ( actuals = test $ y1, predicted = test_predict)) # Finally checking the correlation test_cor <- cor (test_pred_act) test_cor
## actuals predicted ## actuals 1 1 ## predicted 1 1 Interpretation: The accuracy is measured as the correlation between the actual value and the predicted values. Higher the correlation accuracy, then the actual and predictive values are directly proportional. As per the result shown, the correlation of the actual and predicted value of y1 of test data is 100%, which means the actual and predictive values are perfectly accurate. Thus we could conclude that the model is completely perfect 6.3 Min max accuracy. The best way to check the accuarcy between predictive and actual value is by using the method min-max accuracy, which calculates the mean of minimum and maximum prediction as shown in below code minmax_acc <- mean ( apply (test_pred_act, 1 , min) / apply (test_pred_act, 1 ,max)) minmax_acc ## [1] 0.9953695 Observation: As per the result, the min max accuracy is 99.53%, which indicates that the prediction and actual values are almost perfectly accurate. As for the perfect model, the accuracy measure should be 100%. 6.4 Mape calculation: the Mean Abosolute Percentage Error (MAPE) indicates how far the prediction of the model is off from their actual value. # the MAPE calculation can be done by using the below equation mape <- mean ( abs (test_pred_act $ predicted - test_pred_act $ actuals) / test_pred_act $ actuals) mape ## [1] 0.0149498
Interprate: The result of the MAPE shows that the model’s prediction is only 1.49% off from the actual value of the test data, which is an indication that the model will fit perfectly for test data.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help