Analysis of Crime Data: Victim Insights from NIBRS Dataset

Northeastern University Module 6 Assignment Team: Capstone 3 Aniket Milind Telrandhe Chirag Malik ALY 6015 : Intermediate Analytics Mr. Zhi He October 27, 2023

Introduction: We've chosen a dataset from the National Incident-Based Reporting System (NIBRS) Database, encompassing data from 2000 to 2015 at the state-by-date level. This dataset comprises 327,656 records and 54 distinct variables. Our aim is to gain insights into crime occurrences within each state during the specified timeframe, with a focus on the total number of victims. For our analysis, we've specifically utilized variables such as Total victim count, state, year, male victim count, female victim count, total offender count, and types of offenses including violence, theft, drug, and sex offenses. To address this research objective, we've developed regression models and selected the most accurate model from the available options. This chosen model is instrumental in providing a solution to the aforementioned question.

Statistics table: To create the statistics summary table, we substituted a few variables and using the stargazer function. We can see that there are a greater number of male victims who were found guilty of drug offenses from the scatterplot between drug offenses and men victim. When comparing female victims to male victims, we can see from the scatterplot regarding drug offenses and female offender that there are fewer female victims who were found guilty of drug crimes.

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

The association between a drug offence and a female victim is seen in the figure below. The majority of drug offenses target about 700 female victims. The number of female victims of drug offenses rises in tandem with their frequency. The number of male victims in the dataset is shown in the boxplot below. Compared to female victims, there are very few male victims.

Corrplot: The relationship between each axis's variables is displayed in the correlation heatmap that follows. There exists a correlation coefficient ranging from -1 to +1. A correlation of -1 denotes a completely negative connection, a correlation of 0 denotes no association, and a correlation of 1 denotes an ideal positive correlation between two variables. The correlation coefficient indicates the strength of the association between the two variables; the greater the distance from zero. The association between the two variables is stronger and will have a darker hue the higher the correlation between them.

Linear Regression: A statistical model called linear regression examines the relationship and interactions between a response attribute and one or more characteristics. Using the lm function, I have carried out regression using the total victim as the answer property and the other attributes—state, year, male victim, and female victim—as inputs.

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

The provided output aids us in evaluating the model's effectiveness. Within the "call" section, you can find the regression model formula, wherein "Total victim" serves as the dependent variable, and predictor variables from the dataset are employed. The disparity between the actual and predicted values is referred to as "residuals." These residuals are obtained by subtracting the actual values of "Total victim" from the model's predictions. The estimated coefficient indicates the average change in the response variable for a one-unit increase in the predictor variable, assuming all other predictor variables remain constant. The p-value corresponding to the t-statistic determines the predictor variable's statistical significance; if it's less than the significance level (e.g., 0.05), the predictor variable is deemed statistically significant. The residual standard error denotes the average deviation of observed values from the regression line. A smaller value suggests a better fit of the regression model to the data. The Multiple R-squared value, ranging from 0 to 1, signifies the predictive capability of the predictor variables for the response variable. The closer it is to 1, the more effective the predictor variables are at forecasting the response variable's value. The above plot depicts the relation between residuals and fitted values of the linear regression model.

Two-Way Anova: A two-way ANOVA is used to determine the mean of a quantitative variable in respect to the values of two categorical variables. Use a two-way ANOVA to try and ascertain the combined effect of two independent variables on a dependent variable. As the p value is less than the significant value that is 0.05, we reject the null hypothesis. Logistic Regression: Logistic Regression serves as a method for fitting a regression model within the broader framework of Generalized Linear Models (GLM). This approach is particularly relevant when dealing with datasets containing both continuous and categorical predictor variables. To implement logistic regression, the glm function is employed, requiring the specification of a formula, dataset, and the family as its arguments. In the context of my analysis, I partitioned the data into separate training and testing datasets, following a 70-30 split ratio. This means that 70% of the data is allocated to the training set, while the remaining 30% is dedicated to the testing set. Subsequently, I applied the glm function to create the logistic regression model, defining the formula to involve the 'total victim' variable along with other relevant factors, such as 'state,' 'year,' 'male victim,' and 'female victim.'

It provides us with the deviance residuals, offering insights into their distribution, including measures such as the minimum, maximum, median, and quartile values. The significance of predictor variables is assessed through the p-value, with a value below 0.05 indicating the importance of a variable. This significance can also be observed by the presence of asterisks (*) against the p-value. In our analysis, variables including state, year, female victim, total offender, violence offense, theft offense, and drug offense exhibit p-values below 0.05, signifying their significance as predictors. We have subsequently computed predicted values and generated a confusion matrix, and the following statistics are derived from this matrix.

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

The ratio of accurate forecasts to total predictions, or accuracy, establishes the model's overall anticipated accuracy. With an accuracy of 99.77%, the model is highly effective. LASSO Regression: Using the glmnet function, we have carried out LASSO regression by providing the formula with the variables listed above, which is the same as the linear regression and alpha of 1. I used the cross- validation-performing cv.glmnet method to calculate the lambda min and lambda 1se values. The value of lambda.min yields the mean cross-validated error with the least value, whereas lambda.1se yields the most regularized model with a cross-validated error within one standard error of the minimum. The lambda values for se and min are shown by the two dotted lines.

The graph indicates that the lambda min value is -10.633, while the lambda 1se value stands at -9.33. In order to explore the connection between one or more predictor variables and a response variable, regression analysis is a useful technique. A method to evaluate the goodness of fit of a regression model to a dataset involves calculating the root mean square error (RMSE), which quantifies the average difference between the model's predicted values and the actual dataset values. A lower RMSE signifies a better fit of the model to the dataset. In our case, the RMSE value is 0.2209, indicating a strong fit of the model. Conclusion: We conducted both Linear Regression and Lasso Regression analyses using the total number of victims as the response variable, with independent variables including state, year, male victims, female victims, total offenders, violent offenses, theft offenses, drug offenses, and sex offenses. The Logistic Regression model yielded a high accuracy of 99.77%, signifying its suitability as the best-fit model. In the case of Lasso Regression, the RMSE values were approximately 0.2209. Additionally, a two-way ANOVA was executed, revealing a p-value below 0.05, leading to the rejection of the null hypothesis.

Reference: Learn linear regression in R: Linear regression in R cheatsheet . Codecademy. (n.d.). Retrieved from https://www.codecademy.com/learn/learn-linear-regression-in- r/modules/linear-regression- in-r/cheatsheet Lee, S.-H. (n.d.). Lasso regression model with R code: R-bloggers . R. Retrieved from https://www.r-bloggers.com/2021/05/lasso-regression-model-with-r-code/ Colors in R - department of statistics . (n.d.). Retrieved from http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf Xu, W. (n.d.). What's the difference between linear regression, Lasso, Ridge, and ElasticNet? Medium. Retrieved from https://towardsdatascience.com/whats-the- difference-between-linear-regression-lasso-ridge-and-elasticnet- 8f997c60cf29#:~:text=Lasso%20is%20a%20modification%20of,will%20tend%20to %20be%20zeros .

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Module 6 Final Project

Related Documents