Lab 2 Part 2 W21 Regression_GladysVillafuerte_301264680.docx

docx

School

Centennial College *

*We aren’t endorsed by this school

Course

MISC

Subject

Computer Science

Date

Dec 6, 2023

Type

docx

Pages

17

Uploaded by MegaMonkeyPerson959

Report
Big Data and Predictive Analysis Assignment 4 (Lab 2 Part 2) Predictive Modeling Using Regression-SAS Miner Submitted to Prof. David Parent Submitted by Gladys Anne Villafuerte - 301264680 Abimbola Babasola - 301249147
REGRESSION EXERCISE 1. Predictive Modeling Using Regression a. Return to the Chapter 3 Organics diagram in the My Project . Use the StatExplore tool on the ORGANICS data source. 1) First StatExplore node is connected to the ORGANICS node. 2) StatExplorer node results is generated
b. In-order to prepare for regression, missing values are imputed? Why do you think we should impute? Go to line 37 in the Output window, several of the class inputs have missing values. Go to line 65 of the Output window, most of the interval inputs also giving missing values. c. What changed after imputing? Type your answer here: Yes, Imputation is necessary to avoid obtaining a biased model, and its purpose is to substitute the missing values.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
d. Add an Impute node from the Modify tab into the diagram and connect it to the Data Partition node. Set the node to impute U for unknown class variable values and the overall mean for unknown interval variable values. Create imputation indicators for all imputed inputs. Type your answer here: Imputing data before reaching the Decision Tree node is unnecessary, as Decision Trees come with their own methods for handling missing values.
e. Add a Regression node to the diagram and connect it to the Impute node. Choose stepwise as the selection model and the validation error as the selection criterion. f. Choose stepwise as the selection model and the validation error as the selection criterion.
g. Run the Regression node and view the results. Maximize the Effect Plot. h. Which variables are included in the final model? Which variables are important in this model? What is the validation ASE? Type your answer here: Variable Included: DemAffl, DemGender, DemAge Important Variable: DemGender, DemAffl, Validation ASE: 0.137156
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
i) Go to line 664 in the Output window. j) The odds ratios indicate the effect that each input has on the logit score.
k) Interpret the odds ratio estimate: l) The validation ASE is given in the Fit Statistics window. Type your answer here: The given estimates appear to be odds ratios, which measure the relative odds of an event occurring between two groups. In general, an odds ratio greater than 1 indicates that the event is more likely to occur in the first group compared to the second group. Here are the interpretations for each of the given estimates - IMP_DemAffl: For every one-unit increase in DemAffl score, the odds of the event occurring increase by 28.3%. - IMP_DemAge: For every one-unit increase in DemAge, the odds of the event occurring decrease by 5.3%. - IMP_DemGender F vs U: The odds of the event occurring are almost 7 times higher for females compared to individuals with an unknown gender. - IMP_DemGender M vs U: The odds of the event occurring are almost 3 times higher for males compared to individuals with an unknown gender. - M_DemAffl 0 vs 1: The odds of the event occurring are 29.2% lower for individuals with a DemAffl score of 1 compared to those with a score of 0. - M_DemAge 0 vs 1: The odds of the event occurring are 20.4% lower for individuals with a DemAge score of 1 compared to those with a score of 0. - M_DemGender 0 vs 1: The odds of the event occurring are 31.5% lower for individuals with a male gender compared to those with a non-male gender.
PART 2 a. In preparation for regression, are any transformations of the data warranted? Why or why not? Answer: No, Outlier values can have a significant impact on the accuracy of regression models, and selecting input values that have high skewness can help improve the overall performance of the model. i. Open the Variables window of the Regression node. Select the imputed interval inputs. ii. Select Explore . The Explore window appears.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
b. Both Card Tenure and Affluence Grade have moderately skewed distributions. Applying a log transformation to these inputs might improve the model fit.
c. Disconnect the Impute node from the Data Partition node. d. Add a Transform Variables node from the Modify tab to the diagram and connect it to the Data Partition node. e. Connect the Transform Variables node to the Impute node. f. Apply a log transformation to the DemAffl and PromTime inputs. i. Open the Variables window of the Transform Variables node. ii. Select Method Log for the DemAffl and PromTime inputs. Select OK to close the Variables window. g. Run the Transform Variables node. Explore the exported training data. Did the transformations result in less skewed distributions? Answer: Yes, Transformation leads to a distribution that is less skewed. i. The easiest way to explore the created inputs is to open the Variables window in the subsequent Impute node. Make sure that you update the Impute node before opening its Variables window. ii. With the LOG_DemAffl and LOG_PromTime inputs selected, select Explore .
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
The distributions are nicely symmetric. h. Rerun the Regression node. Do the selected variables change? How about the validation ASE? Answer: Validation ASE changed from 0.1371 to 0.138204 i. Go to line 664 of the Output window. Apparently the log transformation actually increased the validation ASE slightly.
i. Create a full second-degree polynomial model. How does the validation average squared error for the polynomial model compare to the original model? Answer: The validation ASE is slightly decreased by the extra terms i. Add another Regression node to the diagram and rename it Polynomial Regression .
ii. Make the indicated changes to the Polynomial Regression Properties panel and run the node. iii. Go to line 1598 of the results output window.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
iv. The polynomial regression node adds additional interaction terms. v. Examine the Fit Statistics window.
k) In your words, describe what did you do in this assignment and why you had to do each of these steps? Plus, how would you describe the IV’s that have an impact on the DV. Type your answer here: Transformation leads to a distribution that is less skewed.Typically, when completing an assignment, the first step is to carefully read and understand the task requirements. This allows the person completing the assignment to determine what needs to be done and what resources they may need. The next step is to research and gather information relevant to the assignment. This may involve reading books, articles, or other materials, conducting experiments or surveys, or analyzing data. Once the necessary information has been gathered, the person completing the assignment should organize it and develop a plan for how to present it. This may involve creating an outline or rough draft, developing visual aids such as charts or graphs, or creating a presentation or report. The final step is to review and edit the assignment to ensure that it is complete, accurate, and well- written. This may involve checking for errors in spelling or grammar, ensuring that all sources are properly cited, and making sure that the presentation is clear and effective. Regarding the independent variables (IVs) that impact the dependent variable (DV), these can vary depending on the specific assignment or research question being studied. Generally, IVs are the factors that are manipulated or controlled in an experiment, while the DV is the variable that is measured or observed as a result of the IVs.