Data_wrangling
docx
keyboard_arrow_up
School
Cumberland University *
*We aren’t endorsed by this school
Course
3268
Subject
Statistics
Date
Nov 24, 2024
Type
docx
Pages
5
Uploaded by ngararichrisgmail.com
1
Data Wrangling Assignment
Name
Institutional Affiliation
Date
2
Data Wrangling Task
Step 1: Loading the dataset
I started by loading the crime data and COVID-19 data into pandas DataFrames. This
allowed me to work with the datasets in a format that is easy to manipulate and analyze.
Step 2: Exploring the structure and content
Next, I explored the structure and content of both datasets using the
head()
function.
This gave me an initial understanding of the data and helped me identify the relevant
variables for analysis.
Step 3: Identifying the analytical question
To guide my data wrangling and analysis process, I formulated a specific analytical question:
What is the relationship between the number of reported shoplifting offenses and
COVID-19 cases?
This question provided a clear objective for the analysis.
Step 4: Data wrangling
I performed data wrangling to prepare the datasets for analysis. This involved
handling missing values, converting data types, and filtering relevant columns. These
operations ensured that the data was in a suitable format and contained only the necessary
information.
Step 5: Merging the datasets
3
In order to analyze the relationship between shoplifting offenses and COVID-19
cases, I merged the crime data and COVID-19 data based on the common ZIP code column.
This created a unified dataset that contained information about both variables.
Step 6: Data visualization
I created a scatter plot to visualize the relationship between the number of reported
shoplifting offenses and COVID-19 cases. This allowed me to observe any patterns or trends
in the data and gain initial insights.
Step 7: Drawing conclusions
By examining the scatter plot and analyzing the distribution of data points, I drew
conclusions or made inferences about the relationship between the variables. This provided
initial insights into the relationship which were:
Based on the analysis of the scatter plot, where the x-axis is the number of reported
shoplifting offenses and the y-axis is the number of COVID-19 cases by ZIP code, I observe
that the data points are located towards the left side of the scatter plot.
This concentration of data points towards the left side suggests that there is a higher number
of reported shoplifting offenses in ZIP codes with a lower number of COVID-19 cases. On
the other hand, ZIP codes with a higher number of COVID-19 cases tend to have a lower
number of reported shoplifting offenses.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
4
Step 8: Calculating correlation coefficient and hypothesis testing
To quantitatively measure the strength and significance of the relationship, I
calculated the correlation coefficient and conducted hypothesis testing. This allowed me to
determine the statistical significance of the observed relationship.
Step 9: Interpreting the results
Finally, I interpreted the correlation coefficient and p-value to draw meaningful
conclusions about the relationship between the variables. This helped me answer the
analytical question and understand the practical significance of the findings which were:
Based on the calculation results, the correlation coefficient between the number of reported
shoplifting offenses and COVID-19 cases is approximately 0.2391 which suggests a weak
positive correlation between the of reported shoplifting offenses and the number of COVID-
19 cases by ZIP code.
The p-value is approximately 0.1426 which indicates a null hypothesis that indicates no
correlation. In this case, the p-value is greater than the commonly used significance level of
0.05. Since its greater than 0.05, there is not enough evidence to reject the null hypothesis.
Therefore, I cannot conclude that the relationship between the number of reported shoplifting
offenses and COVID-19 cases is statistically significant.
In summary, based on the correlation coefficient and p-value, there is a weak positive
correlation between the reported shoplifting offenses and the number of COVID-19 cases by
ZIP code, which is not statistically significant.
By following these steps, I was able to systematically analyze the data, perform
relevant data wrangling operations, visualize the relationship between shoplifting offenses
and COVID-19 cases, and conduct statistical analysis to draw meaningful conclusions. The
5
chosen solutions, such as merging the datasets and calculating correlation coefficients, were
directly aligned with the analytical question and allowed me to investigate the relationship
between the variables in a comprehensive and structured manner.
Related Documents
Related Questions
The Excel file for this assignment contains a database with information about the tax assessment value assigned to medical office buildings in a city. The following is a list of the variables in the database:
FloorArea: square feet of floor space
Offices: number of offices in the building
Entrances: number of customer entrances
Age: age of the building (years)
AssessedValue: tax assessment value (thousands of dollars)
Use the data to construct a model that predicts the tax assessment value assigned to medical office buildings with specific characteristics.
Construct a scatter plot in Excel with FloorArea as the independent variable and AssessmentValue as the dependent variable. Insert the bivariate linear regression equation and r^2 in your graph. Do you observe a linear relationship between the 2 variables?
Use Excel’s Analysis ToolPak to conduct a regression analysis of FloorArea and AssessmentValue. Is FloorArea a significant predictor of AssessmentValue?
Construct a scatter plot…
arrow_forward
The exercise involving data in this and subsequent sections were designed to be solved using Excel. Johnson Filtration, Inc. provides maintenance service for water-filtration systems. Suppose that in addition to information on the number of months since the machine was serviced and whether a mechanical or an electrical repair was necessary, the managers obtained a list showing which repairperson performed the service. The revised data follow.
Repair Time(hours)
Months SinceLast Service
Type ofRepair
Repairperson
2.9
2
electrical
Dave Newton
3.0
6
mechanical
Dave Newton
4.8
8
electrical
Bob Jones
1.8
3
mechanical
Dave Newton
2.9
2
electrical
Dave Newton
4.9
7
electrical
Bob Jones
4.2
9
mechanical
Bob Jones
4.8
8
mechanical
Bob Jones
4.4
4
electrical
Bob Jones
4.5
6
electrical
Dave Newton
Ignore for now the months since the last maintenance service (x1 ) and the repairperson…
arrow_forward
//$$/$/$/$::$/$:Helppppppp
arrow_forward
This data type data is non-numbers, OR numbers that do not represent quantities.
arrow_forward
. Primary Data Source and Secondary Data Source ?
arrow_forward
A1
1
2
3
7
5
5
7
B
9
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
6
7
8
9
0
1
2
A
Day
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
X
B
Discharged_Patients
A
54
63
110
105
131
137
80
63
75
92
105
112
120
95
72
128
126
106
129
136
94
74
107
135
124
113
140
83
62
106
fx
Day
с
Records_not_Processed
18
18
52
29
70
57
26
29
14
27
49
38
47
26
28
49
45
39
27
38
44
25
29
62
44
60
46
38
25
39
D
E
F
G
H
I
J
arrow_forward
The process of systematic arrangement of data
in rows and columns is called
a. Array
b. None of These
c. Classification
d. Tabulation
arrow_forward
The table gives the first 5 observations of 42 years of data on boats registered in Florida and manatees killed by boats.
Year
Boats
Manatees
1977
447
Florida manatees killed by boats
140-
120-
100-
80-
1978
40
1979
20
1980
1981
1982
460
481
498
513
512
To access the data, click the link for your preferred software format.
CSV Excel (xls) Excel (xlsx) JMP Mac-Text Minitab14-18 Minitab18+ PC-Text R SPSS TI CrunchIt!
The scatterplot of this data shows a strong positive linear relationship.
13
21
This scatterplot has a linear
(straight-line) overall pattern.
24
16
24
20
Moore/Nott, The Basic Practice of Statistics, 9⁹, 2021 W. H. Freeman and Company
nh
arrow_forward
A nutritionist collects the weight of college students in the first semester, then again in the second semester. What is the best way to visually present this data?
arrow_forward
An business reviews data on the daily amount of calls it receives. Are the data discrete or continous?
arrow_forward
In IBM SPSS, what does clicking on this icon do?
arrow_forward
What type of Data is being shown
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you

Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt

Holt Mcdougal Larson Pre-algebra: Student Edition...
Algebra
ISBN:9780547587776
Author:HOLT MCDOUGAL
Publisher:HOLT MCDOUGAL

Mathematics For Machine Technology
Advanced Math
ISBN:9781337798310
Author:Peterson, John.
Publisher:Cengage Learning,
Related Questions
- The Excel file for this assignment contains a database with information about the tax assessment value assigned to medical office buildings in a city. The following is a list of the variables in the database: FloorArea: square feet of floor space Offices: number of offices in the building Entrances: number of customer entrances Age: age of the building (years) AssessedValue: tax assessment value (thousands of dollars) Use the data to construct a model that predicts the tax assessment value assigned to medical office buildings with specific characteristics. Construct a scatter plot in Excel with FloorArea as the independent variable and AssessmentValue as the dependent variable. Insert the bivariate linear regression equation and r^2 in your graph. Do you observe a linear relationship between the 2 variables? Use Excel’s Analysis ToolPak to conduct a regression analysis of FloorArea and AssessmentValue. Is FloorArea a significant predictor of AssessmentValue? Construct a scatter plot…arrow_forwardThe exercise involving data in this and subsequent sections were designed to be solved using Excel. Johnson Filtration, Inc. provides maintenance service for water-filtration systems. Suppose that in addition to information on the number of months since the machine was serviced and whether a mechanical or an electrical repair was necessary, the managers obtained a list showing which repairperson performed the service. The revised data follow. Repair Time(hours) Months SinceLast Service Type ofRepair Repairperson 2.9 2 electrical Dave Newton 3.0 6 mechanical Dave Newton 4.8 8 electrical Bob Jones 1.8 3 mechanical Dave Newton 2.9 2 electrical Dave Newton 4.9 7 electrical Bob Jones 4.2 9 mechanical Bob Jones 4.8 8 mechanical Bob Jones 4.4 4 electrical Bob Jones 4.5 6 electrical Dave Newton Ignore for now the months since the last maintenance service (x1 ) and the repairperson…arrow_forward//$$/$/$/$::$/$:Helppppppparrow_forward
- This data type data is non-numbers, OR numbers that do not represent quantities.arrow_forward. Primary Data Source and Secondary Data Source ?arrow_forwardA1 1 2 3 7 5 5 7 B 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 A Day 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 X B Discharged_Patients A 54 63 110 105 131 137 80 63 75 92 105 112 120 95 72 128 126 106 129 136 94 74 107 135 124 113 140 83 62 106 fx Day с Records_not_Processed 18 18 52 29 70 57 26 29 14 27 49 38 47 26 28 49 45 39 27 38 44 25 29 62 44 60 46 38 25 39 D E F G H I Jarrow_forward
- The process of systematic arrangement of data in rows and columns is called a. Array b. None of These c. Classification d. Tabulationarrow_forwardThe table gives the first 5 observations of 42 years of data on boats registered in Florida and manatees killed by boats. Year Boats Manatees 1977 447 Florida manatees killed by boats 140- 120- 100- 80- 1978 40 1979 20 1980 1981 1982 460 481 498 513 512 To access the data, click the link for your preferred software format. CSV Excel (xls) Excel (xlsx) JMP Mac-Text Minitab14-18 Minitab18+ PC-Text R SPSS TI CrunchIt! The scatterplot of this data shows a strong positive linear relationship. 13 21 This scatterplot has a linear (straight-line) overall pattern. 24 16 24 20 Moore/Nott, The Basic Practice of Statistics, 9⁹, 2021 W. H. Freeman and Company nharrow_forwardA nutritionist collects the weight of college students in the first semester, then again in the second semester. What is the best way to visually present this data?arrow_forward
arrow_back_ios
arrow_forward_ios
Recommended textbooks for you
- Big Ideas Math A Bridge To Success Algebra 1: Stu...AlgebraISBN:9781680331141Author:HOUGHTON MIFFLIN HARCOURTPublisher:Houghton Mifflin HarcourtHolt Mcdougal Larson Pre-algebra: Student Edition...AlgebraISBN:9780547587776Author:HOLT MCDOUGALPublisher:HOLT MCDOUGALMathematics For Machine TechnologyAdvanced MathISBN:9781337798310Author:Peterson, John.Publisher:Cengage Learning,

Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt

Holt Mcdougal Larson Pre-algebra: Student Edition...
Algebra
ISBN:9780547587776
Author:HOLT MCDOUGAL
Publisher:HOLT MCDOUGAL

Mathematics For Machine Technology
Advanced Math
ISBN:9781337798310
Author:Peterson, John.
Publisher:Cengage Learning,