Module 6 Final Project
docx
keyboard_arrow_up
School
Northeastern University *
*We aren’t endorsed by this school
Course
6025
Subject
Industrial Engineering
Date
Jan 9, 2024
Type
docx
Pages
12
Uploaded by MasterHummingbird3960
Northeastern University
Module 6 Assignment
Team: Capstone 3
Aniket Milind Telrandhe
Chirag Malik
ALY 6015
: Intermediate Analytics
Mr. Zhi He
October 27, 2023
Introduction:
We've chosen a dataset from the National Incident-Based Reporting System (NIBRS) Database, encompassing data from 2000 to 2015 at the state-by-date level. This dataset comprises 327,656 records and 54 distinct variables. Our aim is to gain insights into crime occurrences within each state
during the specified timeframe, with a focus on the total number of victims.
For our analysis, we've specifically utilized variables such as Total victim count, state, year, male victim count, female victim count, total offender count, and types of offenses including violence, theft, drug, and sex offenses. To address this research objective, we've developed regression models and selected the most accurate model from the available options. This chosen model is instrumental in providing a solution to the aforementioned question.
Statistics table:
To create the statistics summary table, we substituted a few variables and using the stargazer function.
We can see that there are a greater number of male victims who were found guilty of drug offenses from the scatterplot between drug offenses and men victim.
When comparing female victims to male victims, we can see from the scatterplot regarding drug offenses and female offender that there are fewer female victims who were found guilty of drug crimes.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
The association between a drug offence and a female victim is seen in the figure below. The majority of drug offenses target about 700 female victims. The number of female victims of drug
offenses rises in tandem with their frequency.
The number of male victims in the dataset is shown in the boxplot below. Compared to female victims, there are very few male victims.
Corrplot:
The relationship between each axis's variables is displayed in the correlation heatmap that follows. There exists a correlation coefficient ranging from -1 to +1.
A correlation of -1 denotes a completely negative connection, a correlation of 0 denotes no association, and a correlation of 1 denotes an ideal positive correlation between two variables. The correlation coefficient indicates the strength of the association between the two variables; the greater the distance from zero.
The association between the two variables is stronger and will have a darker hue the higher the correlation between them.
Linear Regression:
A statistical model called linear regression examines the relationship and interactions between a response attribute and one or more characteristics. Using the lm function, I have carried out regression using the total victim as the answer property and the other attributes—state, year, male
victim, and female victim—as inputs.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
The provided output aids us in evaluating the model's effectiveness. Within the "call" section, you can
find the regression model formula, wherein "Total victim" serves as the dependent variable, and predictor variables from the dataset are employed. The disparity between the actual and predicted values is referred to as "residuals." These residuals are obtained by subtracting the actual values of "Total victim" from the model's predictions. The estimated coefficient indicates the average change in the response variable for a one-unit increase
in the predictor variable, assuming all other predictor variables remain constant. The p-value corresponding to the t-statistic determines the predictor variable's statistical significance; if it's less than the significance level (e.g., 0.05), the predictor variable is deemed statistically significant.
The residual standard error denotes the average deviation of observed values from the regression line. A smaller value suggests a better fit of the regression model to the data. The Multiple R-squared value, ranging from 0 to 1, signifies the predictive capability of the predictor variables for the response variable. The closer it is to 1, the more effective the predictor variables are at forecasting the
response variable's value.
The above plot depicts the relation between residuals and fitted values of the linear regression model.
Two-Way Anova:
A two-way ANOVA is used to determine the mean of a quantitative variable in respect to the values of two categorical variables. Use a two-way ANOVA to try and ascertain the combined effect of two independent variables on a dependent variable.
As the p value is less than the significant value that is 0.05, we reject the null hypothesis.
Logistic Regression:
Logistic Regression serves as a method for fitting a regression model within the broader framework of Generalized Linear Models (GLM). This approach is particularly relevant when dealing with datasets containing both continuous and categorical predictor variables. To implement logistic regression, the glm function is employed, requiring the specification of a formula, dataset, and the family as its arguments.
In the context of my analysis, I partitioned the data into separate training and testing datasets, following a 70-30 split ratio. This means that 70% of the data is allocated to the training set, while the
remaining 30% is dedicated to the testing set. Subsequently, I applied the glm function to create the logistic regression model, defining the formula to involve the 'total victim' variable along with other relevant factors, such as 'state,' 'year,' 'male victim,' and 'female victim.'
It provides us with the deviance residuals, offering insights into their distribution, including measures such as the minimum, maximum, median, and quartile values. The significance of predictor variables is assessed through the p-value, with a value below 0.05 indicating the importance of a variable. This significance can also be observed by the presence of asterisks (*) against the p-value. In our analysis, variables including state, year, female victim, total offender, violence offense, theft offense, and drug offense exhibit p-values below 0.05, signifying their significance as predictors. We have subsequently
computed predicted values and generated a confusion matrix, and the following statistics are derived from this matrix.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
The ratio of accurate forecasts to total predictions, or accuracy, establishes the model's overall anticipated accuracy. With an accuracy of 99.77%, the model is highly effective.
LASSO Regression:
Using the glmnet function, we have carried out LASSO regression by providing the formula with the variables listed above, which is the same as the linear regression and alpha of 1. I used the cross-
validation-performing cv.glmnet method to calculate the lambda min and lambda 1se values. The value of lambda.min yields the mean cross-validated error with the least value, whereas lambda.1se yields the most regularized model with a cross-validated error within one standard error of the minimum. The lambda values for se and min are shown by the two dotted lines.
The graph indicates that the lambda min value is -10.633, while the lambda 1se value stands at -9.33. In order to explore the connection between one or more predictor variables and a response variable, regression analysis is a useful technique. A method to evaluate the goodness of fit of a regression model to a dataset involves calculating the root mean square error (RMSE), which quantifies the average difference between the model's predicted values and the actual dataset values. A lower RMSE
signifies a better fit of the model to the dataset. In our case, the RMSE value is 0.2209, indicating a strong fit of the model.
Conclusion:
We conducted both Linear Regression and Lasso Regression analyses using the total number of victims as the response variable, with independent variables including state, year, male victims, female victims, total offenders, violent offenses, theft offenses, drug offenses, and sex offenses. The Logistic Regression model yielded a high accuracy of 99.77%, signifying its suitability as the best-fit model. In the case of Lasso Regression, the RMSE values were approximately 0.2209. Additionally, a
two-way ANOVA was executed, revealing a p-value below 0.05, leading to the rejection of the null hypothesis.
Reference:
Learn linear regression in R: Linear regression in R cheatsheet
. Codecademy. (n.d.). Retrieved
from
https://www.codecademy.com/learn/learn-linear-regression-in-
r/modules/linear-regression-
in-r/cheatsheet
Lee, S.-H. (n.d.). Lasso regression model with R code: R-bloggers
. R. Retrieved from
https://www.r-bloggers.com/2021/05/lasso-regression-model-with-r-code/
Colors
in
R
-
department
of
statistics
.
(n.d.).
Retrieved
from
http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf
Xu, W. (n.d.). What's the difference between linear regression, Lasso, Ridge, and
ElasticNet?
Medium. Retrieved from
https://towardsdatascience.com/whats-the-
difference-between-linear-regression-lasso-ridge-and-elasticnet-
8f997c60cf29#:~:text=Lasso%20is%20a%20modification%20of,will%20tend%20to
%20be%20zeros
.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help