Lab-5B---Train-and-Test-Dataset
docx
keyboard_arrow_up
School
Conestoga College *
*We aren’t endorsed by this school
Course
CONS 1447
Subject
Statistics
Date
Apr 3, 2024
Type
docx
Pages
15
Uploaded by ColonelJaguar3196
Laboratory 5B - X Y relation dataset
Shijitha Sandeep
2023-03-23
Introduction
In lab 5B, we are using 2 data sets Training and Testing to find the X and Y relationship and to learn the process of cleaning the data and removing the outliers. Also to understand about how to convert the raw data from original source to the format ready for analysis, to learn how to apply linear regression technique to the given dataset and to understand the basic fundamentals of data analysis using R.
With reference to these data set,we will explore, clean and provide the required analysis of the data for the given requirements…
Load the packages relevent for the Lab excersise
library
(ggplot2) #to design GG plots
library
(ISLR) #to do statistical analysis of data
library
(tidyverse) #data modelling, visualization of data
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0 ## ✔ readr 2.1.3 ✔ forcats 0.5.2 ## ✔ purrr 1.0.1 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library
(dplyr) #For easier data manipulation
library
(rmarkdown) #get more enhancements to R markdown
Import both train and test data into R studio using the below code
# Read or import train data
train <-
read.csv
(
"train.csv"
)
# Import test data using read.csv function
test <-
read.csv
(
"test.csv"
)
# We can view the structure of the data with the help of below code
view
(train)
view
(test)
After importing the data, we can proceed with the instructions given to analyze the train and test data sets.
Question 1: Explore the dataset (you can provide 2 plots for exploring the data)
# We can explore the data sets by first checking the first and last 5 records of the data using head() and tail() function. Also fix() will allow us to make changes on the fly
# head() to fetch first 5 records of both train and test data
head
(train)
## x1 y1
## 1 24 21.54945
## 2 50 47.46446
## 3 15 17.21866
## 4 38 36.58640
## 5 87 87.28898
## 6 36 32.46387
head
(test)
## x1 y1
## 1 77 79.775152
## 2 21 23.177279
## 3 22 25.609262
## 4 20 17.857388
## 5 36 41.849864
## 6 15 9.805235
# tail() to fetch last 5 records of both train and test data
tail
(train)
## x1 y1
## 695 81 81.45545
## 696 58 58.59501
## 697 93 94.62509
## 698 82 88.60377
## 699 66 63.64869
## 700 97 94.97527
tail
(test)
## x1 y1
## 295 8 5.405221
## 296 71 68.545888
## 297 46 47.334876
## 298 55 54.090637
## 299 62 63.297171
## 300 47 52.459467
# fix() allows us to look and modify the data in a fly
fix
(train)
fix
(test)
1.2 We can better explore the data using plots and graphs. The best methods or plots to interpret and identify the correlation between varaiables are by using scatter plot or boxplot. First we can find the relation between the X and Y variables using Boxplot
# Identify the relation between X1 and Y1 variables of both train and test data set using Boxplot
boxplot
(train
$
y1,train
$
x1, main =
"y1 and xq relationship"
, xlab=
"y1"
,
ylab =
"x1"
, col =
c
(
"#3F7DC1"
,
"#F60AB8"
))
boxplot
(test
$
y1,test
$
x1, main =
"y1 and xq relationship"
, xlab=
"y1"
, ylab =
"x1"
, col =
c
(
"#C63A41"
,
"#C2A63E"
))
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Interpretation:
From the Box plot of both train and test data, we can interpret as follows
1.
The box plot of train data set shows one outlier with very high value when compare to all the other records of the data and this outlier will affect the mean, median of the data and thus in turn affect the performance of the entire data.
2.
The boxplot of test data set shows normal distribution between y1 and x1 variables. Hence we can interpret that the test data has a linear relationship between y1 and x1.
1.3 We can also explore the data using scatterplot as shown below to better identify the correlation.
# The scatter plot for train data
qqplot
(train
$
y1,train
$
x1, main =
"Relation between y1 and x1"
, xlab = "y1"
, ylab =
"x1"
, col =
"orange"
)
# The scatter plot for test data
qqplot
(test
$
y1,test
$
x1, main =
"Relation between y1 and x1"
, xlab = "y1"
, ylab =
"x1"
, col =
"blue"
)
Interpretation:
From the scatter plot of both train and test data, we can interpret as follows
1.
The scatter plot of train data set shows one outlier with very high value and the outlier is affecting the linear relationship between y1 and x1 variables of the entire data.
2.
As per the plot shown, the qqplot of test data is scattered in a way to form a straight line, hence we can conclude that there is a linear relationship between both y1 and x1.
Question 2: Check for any missing values or outliers
2.1 We can use some basic analysis and summary strategic measurement to check the summary and structure of the data sets in order to identify missing values
# First we can find the summary of the data sets using summary() function to get to know about mean, median and other strategic measurements of each variables
summary
(train)
## x1 y1 ## Min. : 0.00 Min. : -3.84 ## 1st Qu.: 25.00 1st Qu.: 24.93 ## Median : 49.00 Median : 48.97 ## Mean : 54.99 Mean : 49.94 ## 3rd Qu.: 75.00 3rd Qu.: 74.93 ## Max. :3530.16 Max. :108.87 ## NA's :1
summary
(test)
## x1 y1 ## Min. : 0.00 Min. : -3.468 ## 1st Qu.: 27.00 1st Qu.: 25.677 ## Median : 53.00 Median : 52.171 ## Mean : 50.94 Mean : 51.205 ## 3rd Qu.: 73.00 3rd Qu.: 74.303 ## Max. :100.00 Max. :105.592
# We can also check the variables, their data type and the total records present in both the datasets, this can be achieved by using the function str()
str
(train)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
## 'data.frame': 700 obs. of 2 variables:
## $ x1: num 24 50 15 38 87 36 12 81 25 5 ...
## $ y1: num 21.5 47.5 17.2 36.6 87.3 ...
str
(test)
## 'data.frame': 300 obs. of 2 variables:
## $ x1: num 77 21 22 20 36 15 62 95 20 5 ...
## $ y1: num 79.8 23.2 25.6 17.9 41.8 ...
Interpretation:
From the summary and structure of each data set, we could make out that
1.
The train data set has 700 records with 2 attributes or variables and contain 1 missing value in y1. Also from the Summary, we can observe that the Max value is very high (3530) when compare to the Mean and median which is of range 50. Hence we can conclude that the Train data set has OUTLIERS.
2.
The test data set consists of 300 records and 2 variables. From the summary(), we can conclude that the data does not have any missing values and also might not contain any outliers as the 5 summary parameters have same range of values.
2.2 Now we can cross verify if our assumption of missing value or NAs in both data sets are correct by writing the below code. If there are any missing values, we need to take neccessary action or measures to handle the same.
# check if NA is present in any columns of both train and test data
colSums
(
is.na
(train))
## x1 y1 ## 0 1
colSums
(
is.na
(test))
## x1 y1 ## 0 0
# We could see 1 NA in y1 column of train. Omit this row from the data
train <-
na.omit
(train, cols =
"y1"
)
# We can now verify if the NAs has been removed from y1 variable of train data using sum() function
sum
(
is.na
(train
$
y1))
## [1] 0
Interpret:
From the above code, we can observe that the y1 variable of train dataset contains 1 NA record. Since there is only 1 missing value, we can directly remove the record from the data
set. So after removing the missing data, the train data have 699 records. Where as the test data set does not have any NAs or missing values.
Question 3: In case you have an outlier then just subset the range of the dataset
We have identified outlier in train data set while checking the summary of the datasets and this outlier can be handled by just subsetting the range of the train dataset using the below code
# To remove the outlier, first we need to identify the quantile and IQR of the variable which has the outlier. This can be handled by using below code
# Find the quantile of x1 of train dataset
q <-
quantile
(train
$
x1, probs =
c
(.
25
,.
75
), na.rm =
FALSE
)
# Find the IQR
iqr <-
IQR
(train
$
x1)
# Now we have calculated quantile and IQR of x1 variable, we can eliminate the outlier present in the dataset using subset() function
train <-
subset
(train, train
$
x1
>
(q[
1
]
-
(
1.5
*
iqr)) & train
$
x1
<
(q[
2
]
+
(
1.5
*
iqr)))
# Post removing the outlier, we can cross verify the dataset for any outliers by plotting the boxplot() boxplot
(train
$
y1,train
$
x1, main =
"y1 and xq relationship"
, xlab=
"y1"
,
ylab =
"x1"
)
# we can croiss verify the data usinf summary()
summary
(train)
## x1 y1 ## Min. : 0.00 Min. : -3.84 ## 1st Qu.: 25.00 1st Qu.: 24.93 ## Median : 49.00 Median : 48.97 ## Mean : 50.01 Mean : 49.94 ## 3rd Qu.: 75.00 3rd Qu.: 74.93 ## Max. :100.00 Max. :108.87
Observation:
The Quantile function provides the 1st (25%) and 3rd (75%) quntiles of the x1 attribute of train data.
The IQR function provides the difference between the 75th and 25th percentiles
The Subset() function eliminates the values with IQR 1.5 times lesser than 25% and IQR 1.5
times higher than the 75%. Thus we conclude that the values which are lesser than 1.5 times of 25% and 1.5 higher than 75% are outlier in dataset.
From the boxplot(), we can observe that after the removal of outliers, the values are normally distributed and there is a correlation between both the variables of train data set. Similarly the summary does not contain any values extremely larger than the mean and median. Thus we can conclude that the outliers have been handled by eliminating it.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Question 4: Building a linear regression model on the given training data (where dependent variable is y1 and independent variable is x1)
# The linear regression can be identified by using the below function
train_lm <-
lm
(x1
~
y1, data =
train)
Observation:
The above lm() biulds the linear regression model between the dependent variable y1 and independent variable x1 of train data set.
Question 5: To interpret the model results and the relationship between the independent and dependent variables, you need to explain the p-value and R-square measures. These values work as an indicator of how strong/well performing is the model you just built.
The interpretation of the model result can be obtained by getting the summary of the linear regression model. This can be done by using the below functions
# Provide the summary of the linear regression model
summary
(train_lm)
## ## Call:
## lm(formula = x1 ~ y1, data = train)
## ## Residuals:
## Min 1Q Median 3Q Max ## -8.3598 -1.8873 0.0081 2.0166 9.5167 ## ## Coefficients:
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.571255 0.209970 2.721 0.00668 ** ## y1 0.990052 0.003633 272.510 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## ## Residual standard error: 2.794 on 697 degrees of freedom
## Multiple R-squared: 0.9907, Adjusted R-squared: 0.9907 ## F-statistic: 7.426e+04 on 1 and 697 DF, p-value: < 2.2e-16
# We can also check how well the liner regression model fits the data
AIC
(train_lm)
## [1] 3424.107
Interpretation:
From the summary result of linear regression model, we can observe that linear regression model’s p - value and the independent variable x1 p-values are lesser than the significant level mentioned with three * (asterisk) level. Hence we can conlude that the model is significant or well performed and also, there is a relationship between the dependent variable y1 and independent variable x1.
The intercept of this model indicates that it is a significant value to predict the good relationship between y1 and x1.
The adjusted R - squared value of the model is obtained as 0.9907, which indicates that almost 99.07% of variation in the train data set can be explained by the linear regression model built.
The AIC value is 3424.107, which indicates that how easily we can test the performance of the model.
Question 6: You need to calculate the prediction accuracy, (you could do some research/study about this accuracy metric and include your calculations in R markdown file).
6.1 calaculating predictive model
# Prediction accuracy is nothing but testing, how the model built for train data set will perform on test data set. One of the main reason for doing prediction accuracy is to check if the predictive model built will predict same set of variable for some other data set like test data set and if so how accurate the result would be?
# The prediction accuracy can be better interpreted by using the visualization graphs or plots. Here we can use ggplot to get the prediction accuracy
# ggplpot for train data
ggplot
()
+
geom_point
(
aes
(
x=
train
$
x1, y=
train
$
y1),
colour =
"yellow"
) + geom_line
(
aes
(
x=
train
$
x1,
y=
predict
(train_lm, newdata =
train)), colour
=
"green"
) +
ggtitle
(
"x1 and y1 relation of train set"
) +
xlab
(
"x1"
)
+
ylab
(
"y1"
)
# ggplot for test data
ggplot
()
+
geom_point
(
aes
(
x=
test
$
x1, y=
test
$
y1),
colour =
"yellow"
) + geom_line
(
aes
(
x=
train
$
x1,
y=
predict
(train_lm, newdata =
train)), colour
=
"green"
) +
ggtitle
(
"x1 and y1 relation of test set"
) +
xlab
(
"x1"
)
+
ylab
(
"y1"
)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Observation:
In both ggplot visualization, the green line represents the linear regression line of the model built for train data set. From the ggplot of both train and test data set, we can conclude that the predicted model built for training data can perform really well for test data, as values of both the data set are almost inline with the linear regression line.
Thus we can argue that the predictive model of training data is very well fits the test data.
6.2 Accuracy check
#Eventhough the predictive model fits well for test data, we can perform the accuracy check for the model to cross verify how accurate the predictive model for test data.
# Predicting y values of test data using the train data model
test_predict <-
predict
(train_lm, test)
# Actual prediction of test data
test_pred_act <-
data.frame
(
cbind
(
actuals =
test
$
y1, predicted = test_predict))
# Finally checking the correlation test_cor <-
cor
(test_pred_act)
test_cor
## actuals predicted
## actuals 1 1
## predicted 1 1
Interpretation:
The accuracy is measured as the correlation between the actual value and the predicted values. Higher the correlation accuracy, then the actual and predictive values are directly proportional.
As per the result shown, the correlation of the actual and predicted value of y1 of test data is 100%, which means the actual and predictive values are perfectly accurate.
Thus we could conclude that the model is completely perfect
6.3 Min max accuracy. The best way to check the accuarcy between predictive and actual value is by using the method min-max accuracy, which calculates the mean of minimum and maximum prediction as shown in below code
minmax_acc <-
mean
(
apply
(test_pred_act,
1
, min)
/
apply
(test_pred_act,
1
,max))
minmax_acc
## [1] 0.9953695
Observation:
As per the result, the min max accuracy is 99.53%, which indicates that the prediction and actual values are almost perfectly accurate. As for the perfect model, the accuracy measure should be 100%.
6.4 Mape calculation: the Mean Abosolute Percentage Error (MAPE) indicates how far the prediction of the model is off from their actual value.
# the MAPE calculation can be done by using the below equation
mape <-
mean
(
abs
(test_pred_act
$
predicted - test_pred_act
$
actuals)
/
test_pred_act
$
actuals)
mape
## [1] 0.0149498
Interprate:
The result of the MAPE shows that the model’s prediction is only 1.49% off from the actual value of the test data, which is an indication that the model will fit perfectly for test data.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Questions
Tire pressure (psi) and mileage (mpg) were recorded for a random sample of seven cars of thesame make and model. The extended data table (left) and fit model report (right) are based on aquadratic model
What is the predicted average mileage at tire pressure x = 31?
arrow_forward
Explain why Gauss- mark theorem is used to form a linear regression model?
arrow_forward
(BIOSTATISTICS)
In this question , What characteristics are associated with BMI?
Use simple and multivariable linear regression analysis to complete the following table relating the characteristics listed to BMI as a continuous variable. Before conducting the analysis, be sure that all participants have complete data on all analysis variables. If participants are excluded due to missing data, the numbers excluded should be reported. Then, describe how each characteristic is related to BMI. Are crude and multivariable effects similar? What might explain or account for any differences?
Outcome Variable: BMI, kg/m2
Characteristic
Regression Coefficient Crude Models
p-value Regression Coefficient Multivariable Model
P-value
Age, years
0.0627
<0.001-0.02155
0.004
Male sex
-0.580
<0.001-09884
<0.001
Systolic blood pressure, mmHg
0.0603
<0.0010.05716
<0.001
Total serum cholesterol, mg/dL
0.0113
<0.0010.00638…
arrow_forward
A 10-year study conducted by the American Heart Association provided data on how age, blood pressure, and smoking related to the risk of strokes. The data file “Stroke.xslx” includes a portion of the data from the study. The variable “Risk of Stroke” is measured as the percentage of risk (proportion times 100) that a person will have a stroke over the next 10-year period.
Regression Analysis As Image:
1) Based on the simple regression analysis output, write the estimated regression equation.
2) What is the correlation coefficient between Risk of Stroke and Age? How do you find i
arrow_forward
give an easy example of an simple linear regression with solution and line graph
arrow_forward
Can you help me answer this question please
arrow_forward
Biochemical oxygen demand (BOD) measures organic pollutants in water by measuring the amount of oxygen consumed by microorganisms that break down these compounds. BOD is hard to measure accurately. Total organic carbon (TOC) is easy to measure, so it is common to measure TOC and use regression to predict BOD. A typical regression equation for water entering a municipal treatment plant is
BOD = -55.48 + 1.506 TOC
Both BOD and TOC are measured in milligrams per liter of water.
(a) What does the slope of this line say about the relationship between BOD and TOC?
BOD rises (falls) by 1.506 mg/l for every 1 mg/l increase (decrease) in TOCBOD rises (falls) by 55.48 mg/l for every 1 mg/l increase (decrease) in TOC TOC rises (falls) by 1.506 mg/l for every 55.48 mg/l increase (decrease) in BODTOC rises (falls) by 1.506 mg/l for every 1 mg/l increase (decrease) in BOD
(b) What is the predicted BOD when TOC = 0? Values of BOD less than 0 are impossible. Why do you think the prediction gives an…
arrow_forward
Discuss model Adequacy and Residual Analysis in Multiple regression. Does it differ from the simple linear regression Madel Adequacy and Residual Analysis?
arrow_forward
I need help with part b please.
linear regression equation: y_hat = 6.6500 + 1.7000x
table of temperatures versus converted sugars:
Temperature, x
Converted Sugar, y
1
8.2
1.1
8.1
1.2
8.7
1.3
9.9
1.4
9.6
1.5
8.7
1.6
8.2
1.7
10.6
1.8
9.3
1.9
9.2
2
10.7
arrow_forward
What test do I need to run in SPSS to perform a stepwise linear regression? The question is Do one's smoking history, exercise, energy level, and eating 3 meals a day predict one's overall state of health. I want to make sure I do not forget a test to run
arrow_forward
We have data on Lung Capacity of persons and we wish
to build a multiple linear regression model that predicts
Lung Capacity based on the predictors Age and
Smoking Status. Age is a numeric variable whereas
Smoke is a categorical variable (0 if non-smoker, 1 if
smoker). Here is the partial result from STATISTICA.
b*
Std.Err.
of b*
Std.Err.
N=725
of b
Intercept
Age
Smoke
0.835543
-0.075120
1.085725
0.555396
0.182989
0.014378
0.021631
0.021631
-0.648588
0.186761
Which of the following statements is absolutely false?
A. The expected lung capacity of a smoker is expected
to be 0.648588 lower than that of a non-smoker.
B. The predictor variables Age and Smoker both
contribute significantly to the model.
C. For every one year that a person gets older, the lung
capacity is expected to increase by 0.555396 units,
holding smoker status constant.
D. For every one unit increase in smoker status, lung
capacity is expected to decrease by 0.648588 units,
holding age constant.
arrow_forward
The following result perspective in RapidMiner shows a multiple linear regression model.
Based on the diagram, the model for our dependent variable Y is Predicted Y=
(Insulation *0.420)+(Temperature *0.071)+(Avg_Age*0.065)+(Home_Size *0.311)+7.589
Attribute
Insulation
Temperature
Avg Age
Home Size
(Intercept)
O True
O False
Coefficient
3.323
-0.869
1.968
3.173
134.511
Std. Error
0.420
0.071
0.065
0.311
7.589
Std. Coefficient
0.164
-0.262
0.527
0.131
?
Tolerance
0.431
0.405
0.491
0.914
?
t-Stat
7.906
-12.222
30.217
10.210
17.725
arrow_forward
Interpret the least squares regression line of this data set.
Meteorologists in a seaside town wanted to understand how their annual rainfall
is affected by the temperature of coastal waters.
For the past few years, they monitored the average temperature of coastal
waters (in Celsius), x, as well as the annual rainfall (in millimetres), y.
Rainfall statistics
• The mean of the x-values is 11.503.
• The mean of the y-values is 366.637.
• The sample standard deviation of the x-values is 4.900.
• The sample standard deviation of the y-values is 44.387.
• The correlation coefficient of the data set is 0.896.
The correct least squares regression line for the data set is:
y = 8.116x + 273.273
Use it to complete the following sentence:
The least squares regression line predicts an additional
annual rainfall if the average temperature of coastal waters increases by one degree
millimetres of
Celsius.
arrow_forward
2 PART Question regarding MULTIPLE LINEAR REGRESSION
A researcher is interested in predicting the number of homes sold from years in business as a real estate agent and their level of education.Part 1- In jamovi explore the relationship between these variables and describe what the results are PART 2 - Conduct the regression analysis in jamovi
arrow_forward
I appreciate help I can receive on this
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you
Algebra & Trigonometry with Analytic Geometry
Algebra
ISBN:9781133382119
Author:Swokowski
Publisher:Cengage

Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill

Elementary Linear Algebra (MindTap Course List)
Algebra
ISBN:9781305658004
Author:Ron Larson
Publisher:Cengage Learning

Related Questions
- Tire pressure (psi) and mileage (mpg) were recorded for a random sample of seven cars of thesame make and model. The extended data table (left) and fit model report (right) are based on aquadratic model What is the predicted average mileage at tire pressure x = 31?arrow_forwardExplain why Gauss- mark theorem is used to form a linear regression model?arrow_forward(BIOSTATISTICS) In this question , What characteristics are associated with BMI? Use simple and multivariable linear regression analysis to complete the following table relating the characteristics listed to BMI as a continuous variable. Before conducting the analysis, be sure that all participants have complete data on all analysis variables. If participants are excluded due to missing data, the numbers excluded should be reported. Then, describe how each characteristic is related to BMI. Are crude and multivariable effects similar? What might explain or account for any differences? Outcome Variable: BMI, kg/m2 Characteristic Regression Coefficient Crude Models p-value Regression Coefficient Multivariable Model P-value Age, years 0.0627 <0.001-0.02155 0.004 Male sex -0.580 <0.001-09884 <0.001 Systolic blood pressure, mmHg 0.0603 <0.0010.05716 <0.001 Total serum cholesterol, mg/dL 0.0113 <0.0010.00638…arrow_forward
- A 10-year study conducted by the American Heart Association provided data on how age, blood pressure, and smoking related to the risk of strokes. The data file “Stroke.xslx” includes a portion of the data from the study. The variable “Risk of Stroke” is measured as the percentage of risk (proportion times 100) that a person will have a stroke over the next 10-year period. Regression Analysis As Image: 1) Based on the simple regression analysis output, write the estimated regression equation. 2) What is the correlation coefficient between Risk of Stroke and Age? How do you find iarrow_forwardgive an easy example of an simple linear regression with solution and line grapharrow_forwardCan you help me answer this question pleasearrow_forward
- Biochemical oxygen demand (BOD) measures organic pollutants in water by measuring the amount of oxygen consumed by microorganisms that break down these compounds. BOD is hard to measure accurately. Total organic carbon (TOC) is easy to measure, so it is common to measure TOC and use regression to predict BOD. A typical regression equation for water entering a municipal treatment plant is BOD = -55.48 + 1.506 TOC Both BOD and TOC are measured in milligrams per liter of water. (a) What does the slope of this line say about the relationship between BOD and TOC? BOD rises (falls) by 1.506 mg/l for every 1 mg/l increase (decrease) in TOCBOD rises (falls) by 55.48 mg/l for every 1 mg/l increase (decrease) in TOC TOC rises (falls) by 1.506 mg/l for every 55.48 mg/l increase (decrease) in BODTOC rises (falls) by 1.506 mg/l for every 1 mg/l increase (decrease) in BOD (b) What is the predicted BOD when TOC = 0? Values of BOD less than 0 are impossible. Why do you think the prediction gives an…arrow_forwardDiscuss model Adequacy and Residual Analysis in Multiple regression. Does it differ from the simple linear regression Madel Adequacy and Residual Analysis?arrow_forwardI need help with part b please. linear regression equation: y_hat = 6.6500 + 1.7000x table of temperatures versus converted sugars: Temperature, x Converted Sugar, y 1 8.2 1.1 8.1 1.2 8.7 1.3 9.9 1.4 9.6 1.5 8.7 1.6 8.2 1.7 10.6 1.8 9.3 1.9 9.2 2 10.7arrow_forward
- What test do I need to run in SPSS to perform a stepwise linear regression? The question is Do one's smoking history, exercise, energy level, and eating 3 meals a day predict one's overall state of health. I want to make sure I do not forget a test to runarrow_forwardWe have data on Lung Capacity of persons and we wish to build a multiple linear regression model that predicts Lung Capacity based on the predictors Age and Smoking Status. Age is a numeric variable whereas Smoke is a categorical variable (0 if non-smoker, 1 if smoker). Here is the partial result from STATISTICA. b* Std.Err. of b* Std.Err. N=725 of b Intercept Age Smoke 0.835543 -0.075120 1.085725 0.555396 0.182989 0.014378 0.021631 0.021631 -0.648588 0.186761 Which of the following statements is absolutely false? A. The expected lung capacity of a smoker is expected to be 0.648588 lower than that of a non-smoker. B. The predictor variables Age and Smoker both contribute significantly to the model. C. For every one year that a person gets older, the lung capacity is expected to increase by 0.555396 units, holding smoker status constant. D. For every one unit increase in smoker status, lung capacity is expected to decrease by 0.648588 units, holding age constant.arrow_forwardThe following result perspective in RapidMiner shows a multiple linear regression model. Based on the diagram, the model for our dependent variable Y is Predicted Y= (Insulation *0.420)+(Temperature *0.071)+(Avg_Age*0.065)+(Home_Size *0.311)+7.589 Attribute Insulation Temperature Avg Age Home Size (Intercept) O True O False Coefficient 3.323 -0.869 1.968 3.173 134.511 Std. Error 0.420 0.071 0.065 0.311 7.589 Std. Coefficient 0.164 -0.262 0.527 0.131 ? Tolerance 0.431 0.405 0.491 0.914 ? t-Stat 7.906 -12.222 30.217 10.210 17.725arrow_forward
arrow_back_ios
SEE MORE QUESTIONS
arrow_forward_ios
Recommended textbooks for you
- Algebra & Trigonometry with Analytic GeometryAlgebraISBN:9781133382119Author:SwokowskiPublisher:CengageGlencoe Algebra 1, Student Edition, 9780079039897...AlgebraISBN:9780079039897Author:CarterPublisher:McGraw HillElementary Linear Algebra (MindTap Course List)AlgebraISBN:9781305658004Author:Ron LarsonPublisher:Cengage Learning
Algebra & Trigonometry with Analytic Geometry
Algebra
ISBN:9781133382119
Author:Swokowski
Publisher:Cengage

Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill

Elementary Linear Algebra (MindTap Course List)
Algebra
ISBN:9781305658004
Author:Ron Larson
Publisher:Cengage Learning
