Lab-5B---Train-and-Test-Dataset

docx

School

Conestoga College *

*We aren’t endorsed by this school

Course

CONS 1447

Subject

Statistics

Date

Apr 3, 2024

Type

docx

Pages

Uploaded by ColonelJaguar3196

Laboratory 5B - X Y relation dataset Shijitha Sandeep 2023-03-23 Introduction In lab 5B, we are using 2 data sets Training and Testing to find the X and Y relationship and to learn the process of cleaning the data and removing the outliers. Also to understand about how to convert the raw data from original source to the format ready for analysis, to learn how to apply linear regression technique to the given dataset and to understand the basic fundamentals of data analysis using R. With reference to these data set,we will explore, clean and provide the required analysis of the data for the given requirements… Load the packages relevent for the Lab excersise library (ggplot2) #to design GG plots library (ISLR) #to do statistical analysis of data library (tidyverse) #data modelling, visualization of data ## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ── ## ✔ tibble 3.1.8 ✔ dplyr 1.0.10 ## ✔ tidyr 1.2.1 ✔ stringr 1.5.0 ## ✔ readr 2.1.3 ✔ forcats 0.5.2 ## ✔ purrr 1.0.1 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() library (dplyr) #For easier data manipulation library (rmarkdown) #get more enhancements to R markdown Import both train and test data into R studio using the below code # Read or import train data train <- read.csv ( "train.csv" ) # Import test data using read.csv function test <- read.csv ( "test.csv" ) # We can view the structure of the data with the help of below code view (train) view (test)

After importing the data, we can proceed with the instructions given to analyze the train and test data sets. Question 1: Explore the dataset (you can provide 2 plots for exploring the data) # We can explore the data sets by first checking the first and last 5 records of the data using head() and tail() function. Also fix() will allow us to make changes on the fly # head() to fetch first 5 records of both train and test data head (train) ## x1 y1 ## 1 24 21.54945 ## 2 50 47.46446 ## 3 15 17.21866 ## 4 38 36.58640 ## 5 87 87.28898 ## 6 36 32.46387 head (test) ## x1 y1 ## 1 77 79.775152 ## 2 21 23.177279 ## 3 22 25.609262 ## 4 20 17.857388 ## 5 36 41.849864 ## 6 15 9.805235 # tail() to fetch last 5 records of both train and test data tail (train) ## x1 y1 ## 695 81 81.45545 ## 696 58 58.59501 ## 697 93 94.62509 ## 698 82 88.60377 ## 699 66 63.64869 ## 700 97 94.97527 tail (test) ## x1 y1 ## 295 8 5.405221 ## 296 71 68.545888 ## 297 46 47.334876 ## 298 55 54.090637

## 299 62 63.297171 ## 300 47 52.459467 # fix() allows us to look and modify the data in a fly fix (train) fix (test) 1.2 We can better explore the data using plots and graphs. The best methods or plots to interpret and identify the correlation between varaiables are by using scatter plot or boxplot. First we can find the relation between the X and Y variables using Boxplot # Identify the relation between X1 and Y1 variables of both train and test data set using Boxplot boxplot (train $ y1,train $ x1, main = "y1 and xq relationship" , xlab= "y1" , ylab = "x1" , col = c ( "#3F7DC1" , "#F60AB8" )) boxplot (test $ y1,test $ x1, main = "y1 and xq relationship" , xlab= "y1" , ylab = "x1" , col = c ( "#C63A41" , "#C2A63E" ))

Your preview ends here