cheat sheep

docx

School

Concordia University *

*We aren’t endorsed by this school

Course

315

Subject

Statistics

Date

Jan 9, 2024

Type

docx

Pages

3

Uploaded by LieutenantSpider3956

Report
-What is out of range data the range is the spread of your data from the lowest to the highest value in the distribution. It is a commonly used measure of variability . Along with measures of central tendency , measures of variability give you descriptive statistics for summarizing your data set. The range is calculated by subtracting the lowest value from the highest value. While a large range means high variability, a small range means low variability in a distribution. A result that is too far above or below the statistically reliable range for the selected need "Out of range" normally means that some value is outside the expected range of valid values. 2. What are the two types of missing data Missing data: - MCAR: missing completely at random -> is ok -> can replace continuous var with function (only replace ≤ 5% across/down) - MNAR: missing not at random -> might be problem with design/question -> exclude - do not replace: categorical, demog, MNAR - how replace: put mean / estimate regression (reduces variance), expected maximization algorithm (uses multiple imputation, best way) - Why keep as much data as possible: easier keep than collect; more data = power = represent pop (if delete, be strict p = .001) 3. Which of these two types would you feel comfortable replacing with an algorithm Replacing missing data with an algorithm is more appropriate for MCAR data since the missingness occurs randomly and is not related to the value of the missing data or any other variables in the dataset. In this case, the algorithm can estimate the missing data without introducing bias or distorting the results. However, for MNAR data, replacing the missing data with an algorithm can introduce bias since the missingness is related to the value of the missing data or other variables in the dataset. 4. What is R^2 is a statistical measure used to evaluate the goodness of fit of a linear regression model. It represents the proportion of the variance in the dependent variable that is explained by the independent variable(s) included in the model. R^2 takes on values between 0 and 1, with a higher value indicating a better fit of the model to the data. An R^2 value of 0 indicates that the independent variable(s) in the model do not explain any of the variance in the dependent variable, while an R^2 value of 1 indicates that the independent variable(s) in the model explain all of the variance in the dependent variable. 5. What is Adjusted R^2 and how is it different from R^2 Adjusted R^2 is a modified version of R^2 that takes into account the number of independent variables included in a linear regression model. While R^2 measures the proportion of the variance in the dependent variable that is explained by the independent variable(s) included in the model, Adjusted R^2 adjusts for the number of independent variables in the model, penalizing the addition of irrelevant variables. The formula for Adjusted R^2 is as follows:
Adjusted R^2 = 1 - [(1-R^2)*(n-1)/(n-p-1)] where n is the number of observations and p is the number of independent variables in the model. Adjusted R^2 can take on values between negative infinity and 1, with a higher value indicating a better fit of the model to the data. Like R^2, an Adjusted R^2 value of 1 indicates that the independent variable(s) in the model explain all of the variance in the dependent variable. However, unlike R^2, Adjusted R^2 can also take on negative values, indicating that the model provides a worse fit to the data than a model with no independent variables. Adjusted R^2 is useful for comparing regression models with different numbers of independent variables, as it provides a measure of goodness of fit that is adjusted for the number of variables in the model. However, it should be used in conjunction with other statistical measures to evaluate the predictive performance of the model. 6. Gives you an example and asks what analysis, I think it’s LME which I got fucking wrong 7. How do you account for the variance in the two variables in the above problem Descriptive statistics: Measures of variability, such as the range, variance, and standard deviation, are used to describe the spread of the data and account for variance in variables. Inferential statistics: In hypothesis testing, the variance of a variable is taken into account by calculating the standard error of the mean or the standard error of the difference between two means. Regression analysis: In regression analysis, the variance in the dependent variable is modeled as a function of one or more independent variables. The regression model estimates the coefficients of the independent variables, which represent the amount of variance in the dependent variable that is explained by the independent variables. ANOVA: Analysis of Variance (ANOVA) is used to test for differences in means between two or more groups and takes into account the variance of the variables within and between the groups. 8. What is the sign that you should use bootstrap t technique instead of percentile bootstrap The decision to use bootstrap t technique or percentile bootstrap depends on the sample size, population distribution, and the purpose of the analysis. Bootstrap t is preferred for small sample sizes or unknown population distributions, while percentile bootstrap is preferred for large sample sizes with known or symmetric population distributions. 9. What parameter should you not assume with the bootstrap t technique It is important not to assume that the bootstrapped samples are independent and identically distributed when using the bootstrap technique. The data should be carefully examined to identify and account for any dependencies or non-i.i.d. characteristics to avoid biased estimates of population parameters.
10. What is p hacking and the file drawer problem and what is the relationship between the two P-hacking involves manipulating data or statistical tests to achieve statistically significant results, while the file drawer problem involves selective publication of studies that show significant results, ignoring non-significant ones. P-hacking can contribute to the file drawer problem by encouraging researchers to only report significant results, creating a biased view of research findings, which can be addressed by transparent and reproducible research practices and using rigorous statistical methods. 2. What are the advantages of Bayesian statistics over just p-values Bayesian statistics offer more intuitive and informative interpretation of probabilities Bayesian methods allow incorporation of prior knowledge for more accurate parameter estimates Bayesian methods allow calculation of posterior probabilities for evaluating hypotheses or models Bayesian methods can handle more complex and flexible models Bayesian statistics offer a more powerful and useful approach to statistical inference than traditional p-values Beyond hypothesis testing Replication crisis Problems with null testing Power what is what might increase What is type 1 error rate Bootstrapping When we can rely on central limit theorem and when does that mean data is normally distributed Which analysis test is the most effective (t test is better or anova is better) Big picture of effect size why is it useful what is issue Describe concept what is it, describe concepts What are important data screening tests that we would need for particular tests Whats the difference between between group/ within group / repeated measures anova
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help