Final Project Part 1 Instructions 1

pdf

School

University of Toronto *

*We aren’t endorsed by this school

Course

302

Subject

Statistics

Date

Jan 9, 2024

Type

pdf

Pages

3

Uploaded by JudgeGoatPerson933

Report
Final Project Part 1 Research Proposal and Data Introduction Due: October 11, 2023 by 6:00PM ET Goal of the Assessment: Learning Outcomes being Assessed: To have the opportunity to work on a topic of interest to them and to be creative about this topic. To experience the process of conducting a small literature review and incorporating knowledge gained into analysis. To think about whether a research question and/or a dataset is appropriate for use with linear regression. To create a draft of the components to be included in an introduction section of a report, as well as summary figures and/or tables for results section. Apply multiple linear models on various datasets using R statistical software. Differentiate the relationships modelled using qualitative predictors, interactions between predictors, and continuous predictors. Create appropriate residuals plots to evaluate model assumptions for a given data set using software. Recognize distinct patterns in appropriate residual plots and correctly conclude which assumption is violated Report the results of a residual plot analysis and recommend a course of action. Instructions: 1. Students will need to locate open-source data in an area of interest to them that meets the data requirements listed below. Some examples could be (but are certainly not limited to) sports, medicine, public health, economics, video games, literature, etc. 2. Students will need to then define an explicit research question using the information in that dataset. Note that students will need to ensure and show that linear regression can be used to answer this question with this dataset. 3. Students must also locate 3 peer-reviewed academic papers related to their specific research question or topic of interest. Students will need to describe how each article relates back to their proposed research question, as well as rank it for its usefulness in informing them about the population relationship being estimated. 4. Students will need to select at least 9 variables from their dataset to be predictors in a multiple linear model, with at least one of these five being categorical in nature. A justification for why each variable is chosen must be provided. This model can then be fit and a complete residual analysis to assess model assumptions will be done. 5. Lastly, students will provide a table that numerically summarizes each variable used in their preliminary model, with an informative caption that highlights any interesting features of the variables (e.g., skews, possible outliers or non-sensical observations, high spread, missing values).
Dataset Requirements: Dataset must be open-source and the website where it was found/downloaded from must be provided. MUST contain at least 1000 observations (i.e., rows). MUST contain 1 response variable suitable for linear regression and at least 9 predictor variables. Categorical variables with multiple levels count as 1 variable. Since at least one predictor will need to be categorical, you may convert one of your numerical variables to categorical if no such variable is available in your downloaded dataset. However, you will need to justify your choice of variable and categorization in part B.2. of the proposal. Should NOT be from an educational resource, such as a textbook dataset. If you’re not sure, please ask the instructor or one of the TAs. If the dataset was found in a data repository (e.g., Kaggle, UCI Repository, etc.), you MUST ensure that your research question is different from the original usage of the data. Proposal Format: Groups must complete each portion of the Final Project Part 1 Template. The proposal document should be no more than 5 pages in length, which includes plots and tables. Keep responses brief and to the point while ensuring that you address each point noted in the rubric requirements for that portion of the proposal. What to Submit: Only ONE member of the group should submit ALL required submission components. A complete submission to Quercus will include: 1. The original downloaded dataset as a CSV file (if file is too large, save it on a cloud-based storage service (e.g. OneDrive) and include a shareable link as a comment on your submission). 2. The cleaned dataset containing the variables used in your preliminary model and data summary as a CSV file (if file is too large, save it on a cloud-based storage service (e.g. OneDrive) and include a shareable link as a comment on your submission). 3. The completed proposal template, saved as a PDF. 4. The R code should be provided in the appendix or in a separate R Markdown file containing the code used to subset and clean the data, fit the model, produce a summary table, and conduct the residual analysis for checking assumptions. Failure to meet these submission requirements, including incorrect format of components, missing components, and cloud links that do not allow shared access will result in a one-mark deduction on the grade of the proposal.
Dataset Resources: Should your group have difficulty locating a suitable dataset that meets the groups interest and the dataset requirements, your group can consider using one of the below datasets: Ames Housing dataset NHANES survey dataset AirBnB dataset (needs you to create a free account) Million Song dataset NBA player dataset Library Resources (for locating and citing academic papers): How to search for academic articles Using search operators to find articles Limiting search to peer-reviewed articles Why and how to cite your references Help getting the correct citation format Exporting a citation RMarkdown Resources: Settings for displaying or not displaying R code in knitted document Adding captions and other plotting features Creating tables in RMarkdown using Kabble or manually Exporting plots in RStudio
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help