Assignment 4_MKTG 746_Group5
docx
keyboard_arrow_up
School
York University *
*We aren’t endorsed by this school
Course
746
Subject
Statistics
Date
Apr 3, 2024
Type
docx
Pages
15
Uploaded by LieutenantSteel8402
Big Data and Predictive Analysis
Assignment 4 (Lab 2 Part 2)
Predictive Modeling Using Regression-SAS Miner
Submitted by
REGRESSION EXERCISE
1. Predictive Modeling Using Regression
a. Return to the Chapter 3 Organics diagram in the My Project
. Use the StatExplore tool on the
ORGANICS
data source.
1) First StatExplore
node is connected to the ORGANICS
node.
2) StatExplorer node results is generated
b. In-order to prepare for regression, missing values are imputed? Why do you think we should
impute?
c. What changed after imputing?
d. Add an Impute
node from the Modify
tab into the diagram and connect it to the Data Partition
node. Set the node to impute U
for unknown class variable values and the overall mean for
unknown interval variable values. Create imputation indicators for all imputed inputs.
e. Add a Regression
node to the diagram and connect it to the Impute
node. Choose stepwise as
the selection model and the validation error as the selection criterion.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
f. Choose stepwise as the selection model and the validation error as the selection criterion.
g.
h. Run the Regression node and view the results. Maximize the Effect Plot. i. Which variables are included in the final model? Which variables are important in this model?
What is the validation ASE?
i)
Go to line 664 in the Output window.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
j)
The odds ratios indicate the effect that each input has on the logit score.
k)
Interpret the odds ratio estimate:
1. For IMP_DemAffl, the odds ratio of 1.283 suggests that for every unit increase in democratic affiliation, the odds of purchasing organic products increases by approximately 28%.
2. For IMP_DemAge, the odds ratio of 0.947 implies that for each additional year of age, the odds of buying organic products decreases by roughly 5%, assuming all else remains equal.
3. Regarding IMP_DemGender, the odds ratio of 6.967 for females versus unknown suggests that
women have almost six times greater odds of purchasing organic products compared to those whose gender is unknown. Likewise, the odds ratio of 2.899 for males versus unknown suggests that men are nearly three times more likely to buy organic products than those whose gender is unknown.
4. T
his indicates that females are 6 times more likely to purchase as compared to men.
l)
The validation ASE is given in the Fit Statistics window.
PART 2
a.
In preparation for regression, are any transformations of the data warranted? Why or why not?
Because of outliers, regression model does not give proper results. Also, before starting a
regression model, we select the variables which are highly skewed.
i.
Open the Variables window of the Regression node. Select the imputed interval inputs.
ii.
Select Explore
. The Explore window appears.
b.
Both Card Tenure and Affluence Grade have moderately skewed distributions. Applying a log
transformation to these inputs might improve the model fit.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
c.
Disconnect the Impute
node from the Data Partition
node.
d.
Add a Transform Variables
node from the Modify tab
to the diagram and connect it to the Data Partition
node.
e.
Connect the Transform Variables
node to the Impute
node.
f.
Apply a log transformation to the DemAffl and PromTime inputs.
i.
Open the Variables window of the Transform Variables node.
ii.
Select
Method
Log for the DemAffl and PromTime inputs. Select OK
to close the Variables
window.
g.
Run the Transform Variables
node. Explore the exported training data. Did the transformations
result in less skewed distributions?
Yes, it resulted in a less skewed distribution.
i.
The easiest way to explore the created inputs is to open the Variables window in the subsequent
Impute node. Make sure that you update the Impute node before opening its Variables window.
ii.
With the LOG_DemAffl
and LOG_PromTime
inputs selected, select Explore
.
The distributions are nicely symmetric.
h.
Rerun the Regression
node. Do the selected variables change? How about the validation ASE?
The selected variables changed and the validation ASE changed to 0.137535.
i.
Go to line 664 of the Output window.
i.
Apparently the log transformation actually increased the validation ASE slightly.
j.
Create a full second-degree polynomial model. How does the validation average squared error for the
polynomial model compare to the original model?
There is slight reduction in the average squared error in polynomial regression in comparison
to the original model.
i.
Add another Regression node to the diagram and rename it Polynomial Regression
.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
ii.
Make the indicated changes to the Polynomial Regression Properties panel and run the node.
iii.
Go to line 1598 of the results output window.
iv.
The polynomial regression node adds additional interaction terms.
v.
Examine the Fit Statistics window.
k) In your words, describe what did you do in this assignment and why you had to do each of these steps? Plus, how would you describe the IV’s that have an impact on the DV.
Using SAS Miner's Regression function, we performed predictive modelling. Preparing the data, analysing it, converting it, and then performing regression analysis were the processes involved. The steps
and reasoning are broken down as follows:
1. StatExplore and Data Imputation: We first used the StatExplore tool to look into the ORGANICS sample set, which allowed us to find any absent values. By putting any blank values—usually the average
for interval factors and a standard grouping for class factors—imputation was required to prevent bias in the framework.
2. Establishing the Regression Model: We created an impute node and then assigned it to the average for interval parameters and unidentified class parameters after imputing the data that was absent. This guaranteed that we had a way to assess new situations and an entire set of information for modelling.
3. Model Building and Analysis: To ascertain which factors were more accurate, we employed a stepwise choosing model with validation error as the regression node's criteria. DemGender, DemAge, and DemAffl were the factors that were chosen. To evaluate the model's fit, the validation ASE (Average Squared Error) was computed.
4. How to Interpret the Odds Ratios: Our dependent variable i.e. the possibility of buying organic items, was represented by the probability ratios from the results, which indicated the influence of each independent variable (IV) on this likelihood (DV). For example, the probabilities increased by around 28% for every single increase in DemAffl (IV).
5. Data Transformation: Regression outcomes may be impacted by the outliers and skew we found in the parameters. Model fit was enhanced by using modifications, such as logarithmic modifications for factors
like Card Tenure and Affluence Grade, to produce less skewed ranges. 6. Refine the model: We unplugged some of the nodes, inserted a Transform Variables node, and then ran the regression to see how the validation ASE and certain variables that were chosen were modified. The fact that ASE can occasionally rise after a transformation suggests that the model may not always be enhanced by the modification.
7. Polynomial Regression: To evaluate the validation ASE of the final model with the original model, we constructed a complete second-degree polynomial version. We discovered a small decrease in error.
To guarantee that the predictive analysis was as precise and trustworthy as possible, each stage in this procedure contributed to the data and model's improvement. The factors that are thought to affect the DV, such as DemAffl, DemAge, and DemGender, or IVs, were carefully investigated to determine how they affected the likelihood of buying organic goods or DV. All these steps were important for developing an accurate predictive model that might be utilised to make reasonable inferences according to the evaluation.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Documents
Related Questions
Tire pressure (psi) and mileage (mpg) were recorded for a random sample of seven cars of thesame make and model. The extended data table (left) and fit model report (right) are based on aquadratic model
What is the predicted average mileage at tire pressure x = 31?
arrow_forward
Create the regression equations based on the research model below!
arrow_forward
We have data on Lung Capacity of persons and we wish
to build a multiple linear regression model that predicts
Lung Capacity based on the predictors Age and
Smoking Status. Age is a numeric variable whereas
Smoke is a categorical variable (0 if non-smoker, 1 if
smoker). Here is the partial result from STATISTICA.
b*
Std.Err.
of b*
Std.Err.
N=725
of b
Intercept
Age
Smoke
0.835543
-0.075120
1.085725
0.555396
0.182989
0.014378
0.021631
0.021631
-0.648588
0.186761
Which of the following statements is absolutely false?
A. The expected lung capacity of a smoker is expected
to be 0.648588 lower than that of a non-smoker.
B. The predictor variables Age and Smoker both
contribute significantly to the model.
C. For every one year that a person gets older, the lung
capacity is expected to increase by 0.555396 units,
holding smoker status constant.
D. For every one unit increase in smoker status, lung
capacity is expected to decrease by 0.648588 units,
holding age constant.
arrow_forward
Biochemical oxygen demand (BOD) measures organic pollutants in water by measuring the amount of oxygen consumed by microorganisms that break down these compounds. BOD is hard to measure accurately. Total organic carbon (TOC) is easy to measure, so it is common to measure TOC and use regression to predict BOD. A typical regression equation for water entering a municipal treatment plant is
BOD = -55.48 + 1.506 TOC
Both BOD and TOC are measured in milligrams per liter of water.
(a) What does the slope of this line say about the relationship between BOD and TOC?
BOD rises (falls) by 1.506 mg/l for every 1 mg/l increase (decrease) in TOCBOD rises (falls) by 55.48 mg/l for every 1 mg/l increase (decrease) in TOC TOC rises (falls) by 1.506 mg/l for every 55.48 mg/l increase (decrease) in BODTOC rises (falls) by 1.506 mg/l for every 1 mg/l increase (decrease) in BOD
(b) What is the predicted BOD when TOC = 0? Values of BOD less than 0 are impossible. Why do you think the prediction gives an…
arrow_forward
Define the Linear Regression Model. Also explain Terminology for the Linear Regression Model with a Single Regressor?
arrow_forward
Biochemical oxygen demand (BOD) measures organic pollutants in water by measuring the amount of oxygen consumed by microorganisms that break down these compounds. BOD is hard to
measure accurately. Total organic carbon (TOC) is easy to measure, so it is common to measure TOC and use regression to predict BOD. A typical regression equation for water entering a municipal
treatment plant is
BOD = -55.47 + 1.505 TOC
Both BOD and TOC are measured in milligrams per liter of water.
(a) What does the slope of this line say about the relationship between BOD and TOC?
TOC rises (falls) by 1.505 mg/l for every 55.47 mg/l increase (decrease) in BOD
BOD rises (falls) by 55.47 mg/l for every 1 mg/l increase (decrease) in TOC
TOC rises (falls) by 1.505 mg/l for every 1 mg/l increase (decrease) in BOD
BOD rises (falls) by 1.505 mg/l for every 1 mg/l increase (decrease) in TOC
arrow_forward
(Print-screen your Excel -Solution and upload it)
The electric power consumed each month by a chemical plant is thought to be related to the average ambient temperature x1, the number of days in the month x2, the average product purity X3, and the tons of product produced X4. The past year's historical data are available and are presented in the following table.
(a) Fit a multiple linear regression model using the above data set
(b) Predict power consumption for a month in which x1 = 75°F, x2 = 24 days, x3 = 90%, and x4 = 98 tons.
arrow_forward
Use the data to develop a regression equationthat could be used to predict the quantity of pork sold during future periods. Discuss how you can tell whether heteroscedasticity, autocorrelation, or multi-collinearity might be a problem.
arrow_forward
Can you help me answer this question please
arrow_forward
Independent variable data is listed in cells B2 through B100, and dependent variable data is in cells C2 through C100. Which spreadsheet function would calculate the slope of a linear regression model of this data?
Group of answer choices
=SLOPE(B2:B100,C2:C100)
=SLOPE(C2:C100,B2:B100)
=SLOPE(B2,C2)
=SLOPE(C2,C100,B2,B100)
arrow_forward
1)math regression analysis. PLease show all steps. Correctly
arrow_forward
Pls help ASAP. Pls show all work.
arrow_forward
COmpare and constrast the use of prediction intervals for a Single Linear Regression model having one X and Multiple Linear Regression Model having two predictors X1 and X2. WHat are the similarities/differences in process and interpretation?
arrow_forward
Stick the landing - Elite female gymnasts compete on 4 apparatus: Floor, Vault, Uneven Bars, and Balance Beam.
Simone is investigating the relationship between gymnasts' scores on the different apparatus. She collects a random sample of 75 gymnasts who competed in international competitions between the
years 2006 and 2019. For this problem we will look at the scores for the two apparatus, vault and balance beam.
Simone constructs a linear regression model using Score on Vault as the explanatory variable and Score on Balance Beam as the response variable. A scatterplot of Simone's data is shown.
Elite Womens Gymnastics
13.5
14.0
14.5
15.0
15.5
Score on Vault
Simone uses statistical software to fit a linear model to the data. A summary of that model fit is given below:
Coefficients
Estimate
Std Error
t value
Pr( > [t])
(Intercept)
3.511
2.663
1.319
0.191
Score on Vault
0.7283
0.1859
3.918
0.000199
Residual standard error: 0.908 on 73 degrees of freedom
Multiple R-squared: 0.1738,…
arrow_forward
The following result perspective in RapidMiner shows a multiple linear regression model.
Based on the diagram, the model for our dependent variable Y is Predicted Y=
(Insulation *0.420)+(Temperature *0.071)+(Avg_Age*0.065)+(Home_Size *0.311)+7.589
Attribute
Insulation
Temperature
Avg Age
Home Size
(Intercept)
O True
O False
Coefficient
3.323
-0.869
1.968
3.173
134.511
Std. Error
0.420
0.071
0.065
0.311
7.589
Std. Coefficient
0.164
-0.262
0.527
0.131
?
Tolerance
0.431
0.405
0.491
0.914
?
t-Stat
7.906
-12.222
30.217
10.210
17.725
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you

Algebra & Trigonometry with Analytic Geometry
Algebra
ISBN:9781133382119
Author:Swokowski
Publisher:Cengage
Related Questions
- Tire pressure (psi) and mileage (mpg) were recorded for a random sample of seven cars of thesame make and model. The extended data table (left) and fit model report (right) are based on aquadratic model What is the predicted average mileage at tire pressure x = 31?arrow_forwardCreate the regression equations based on the research model below!arrow_forwardWe have data on Lung Capacity of persons and we wish to build a multiple linear regression model that predicts Lung Capacity based on the predictors Age and Smoking Status. Age is a numeric variable whereas Smoke is a categorical variable (0 if non-smoker, 1 if smoker). Here is the partial result from STATISTICA. b* Std.Err. of b* Std.Err. N=725 of b Intercept Age Smoke 0.835543 -0.075120 1.085725 0.555396 0.182989 0.014378 0.021631 0.021631 -0.648588 0.186761 Which of the following statements is absolutely false? A. The expected lung capacity of a smoker is expected to be 0.648588 lower than that of a non-smoker. B. The predictor variables Age and Smoker both contribute significantly to the model. C. For every one year that a person gets older, the lung capacity is expected to increase by 0.555396 units, holding smoker status constant. D. For every one unit increase in smoker status, lung capacity is expected to decrease by 0.648588 units, holding age constant.arrow_forward
- Biochemical oxygen demand (BOD) measures organic pollutants in water by measuring the amount of oxygen consumed by microorganisms that break down these compounds. BOD is hard to measure accurately. Total organic carbon (TOC) is easy to measure, so it is common to measure TOC and use regression to predict BOD. A typical regression equation for water entering a municipal treatment plant is BOD = -55.48 + 1.506 TOC Both BOD and TOC are measured in milligrams per liter of water. (a) What does the slope of this line say about the relationship between BOD and TOC? BOD rises (falls) by 1.506 mg/l for every 1 mg/l increase (decrease) in TOCBOD rises (falls) by 55.48 mg/l for every 1 mg/l increase (decrease) in TOC TOC rises (falls) by 1.506 mg/l for every 55.48 mg/l increase (decrease) in BODTOC rises (falls) by 1.506 mg/l for every 1 mg/l increase (decrease) in BOD (b) What is the predicted BOD when TOC = 0? Values of BOD less than 0 are impossible. Why do you think the prediction gives an…arrow_forwardDefine the Linear Regression Model. Also explain Terminology for the Linear Regression Model with a Single Regressor?arrow_forwardBiochemical oxygen demand (BOD) measures organic pollutants in water by measuring the amount of oxygen consumed by microorganisms that break down these compounds. BOD is hard to measure accurately. Total organic carbon (TOC) is easy to measure, so it is common to measure TOC and use regression to predict BOD. A typical regression equation for water entering a municipal treatment plant is BOD = -55.47 + 1.505 TOC Both BOD and TOC are measured in milligrams per liter of water. (a) What does the slope of this line say about the relationship between BOD and TOC? TOC rises (falls) by 1.505 mg/l for every 55.47 mg/l increase (decrease) in BOD BOD rises (falls) by 55.47 mg/l for every 1 mg/l increase (decrease) in TOC TOC rises (falls) by 1.505 mg/l for every 1 mg/l increase (decrease) in BOD BOD rises (falls) by 1.505 mg/l for every 1 mg/l increase (decrease) in TOCarrow_forward
- (Print-screen your Excel -Solution and upload it) The electric power consumed each month by a chemical plant is thought to be related to the average ambient temperature x1, the number of days in the month x2, the average product purity X3, and the tons of product produced X4. The past year's historical data are available and are presented in the following table. (a) Fit a multiple linear regression model using the above data set (b) Predict power consumption for a month in which x1 = 75°F, x2 = 24 days, x3 = 90%, and x4 = 98 tons.arrow_forwardUse the data to develop a regression equationthat could be used to predict the quantity of pork sold during future periods. Discuss how you can tell whether heteroscedasticity, autocorrelation, or multi-collinearity might be a problem.arrow_forwardCan you help me answer this question pleasearrow_forward
- Independent variable data is listed in cells B2 through B100, and dependent variable data is in cells C2 through C100. Which spreadsheet function would calculate the slope of a linear regression model of this data? Group of answer choices =SLOPE(B2:B100,C2:C100) =SLOPE(C2:C100,B2:B100) =SLOPE(B2,C2) =SLOPE(C2,C100,B2,B100)arrow_forward1)math regression analysis. PLease show all steps. Correctlyarrow_forwardPls help ASAP. Pls show all work.arrow_forward
arrow_back_ios
SEE MORE QUESTIONS
arrow_forward_ios
Recommended textbooks for you
- Algebra & Trigonometry with Analytic GeometryAlgebraISBN:9781133382119Author:SwokowskiPublisher:Cengage

Algebra & Trigonometry with Analytic Geometry
Algebra
ISBN:9781133382119
Author:Swokowski
Publisher:Cengage