Assignment2_CategoricalVariables_ShubhamJethwa

docx

School

Seneca College *

*We aren’t endorsed by this school

Course

110

Subject

Statistics

Date

Feb 20, 2024

Type

docx

Pages

11

Uploaded by SuperEnergyTarsier27

Report
1/ 11 Assignment 2 Bank Marketing Case Study: Categorical Attributes Check The head of Marketing wants to know which customers have the highest propensity for buying a Certificate of Deposit (CD) from the institution. The goal of this assignment is to check errors in character variables and correct them. Learning outcomes Use PROC FREQ to inspect errors in character variables. Use character functions for data cleaning Use 2x2 contingency table to examine dependency between variables by looking at Chi- square test. Use mosaic plot to visually examine dependency between variables
2/ 11 Readings: Simple frequency table: http://support.sas.com/training/sas94/m15_2.htm (http://support.sas.com/training/sas94/m15_2.htm) 2x2 contingency table: A contingency table shows the frequency distribution of the variables in a matrix format, while a mosaic plot graphically displays the information. Look here for an example: http://assets.csom.umn.edu/assets/163747.pdf (http://assets.csom.umn.edu/assets/163747.pdf) https://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/viewer.htm#statug_freq_sec (https://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/viewer.htm#statug_freq_sec Mosaic plot: https://blogs.sas.com/content/iml/2013/11/04/create-mosaic-plots-in-sas-by-using-proc-freq.html https://towardsdatascience.com/mosaic-plot-and-chi-square-test-c41b1a527ce4 Measuring association between categorical variables: The null hypothesis for a chi-square independence test is that two categorical variables are independent in some population. The p-value is given by the area under the right tail after the χ² test value. p=P[X>=chisquareValue]. Usually we can say two variables are related (we’re rejecting the null hypothesis of independence) if p-value<0.01 (sometimes also p- value<0.05 is considered statistically significant) and we can assume the two variables are related. You can find the p-value in the third column of the statistics table. https://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/viewer.htm#statug_freq_sec The range of Cramer’s V is [-1 1] for 2x2 tables The range of the phi coefficient is [-1 1] for 2x2 tables. The contingency coefficient is a measure of association derived from the Pearson chi-square and is >0. https://www.statisticshowto.datasciencecentral.com/contingency-coefficient/ Understanding Contingency Coefficient Values A contingency coefficient is particularly informative if you’re working with a large sample. The contingency coefficient helps us decide if variable b is ‘contingent’ on variable a. However, it is a rough measure and doesn’t quantify the dependence exactly; It can be used as a rough guide: If C is near zero (or equal to zero) you can conclude that your variables are independent of each other; there is no association between them. If C is away from zero there is some relationship; C can only take on positive values.
3/ 11 Q1. Examine the target variable y: Use PROC FREQ to list a simple frequency table for the variable y. Q2. Examine the variable "contact" and study its dependency with the target variable y. Use PROC FREQ to list a simple frequency table for the variable "contact". Examine the output for invalid values.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4/ 11 Q3. Contingency table Contact by y and mosaic plot: Create a 2x2 contingency table along with a mosaic plot. Show the statistics for a table of contact by y.
5/ 11
6/ 11 Interpret: Based on the mosaic plot, do you assume association between the two variables? Based on the contingency coefficient, is there an association between the two variables? - Based on the mosaic plot, we can see that the distribution of the "y" variable differs across the levels of the "contact" variable. This suggests that there may be an association between the two variables. - To quantify the strength of association between the two variables, we can use the contingency coefficient. The contingency coefficient measures the degree of association between two categorical variables and ranges from 0 (no association) to 1 (complete association). - In SAS, the contingency coefficient is included in the output of the chi-square test performed by PROC FREQ. In our example, we requested the chi-square test results by including the /CHISQ option in our PROC FREQ statement. The contingency coefficient is reported as Cramer's V in the output. - If the contingency coefficient is close to 0, it suggests that there is no association between the two variables. If the contingency coefficient is close to 1, it suggests that there is a strong association between the two variables. - Therefore, to determine whether there is an association between the "contact" and "y" variables, we can examine the contingency coefficient (Cramer's V) from the chi-square test output. If the Cramer's V is close to 0, we can assume that there is no association between the variables. If the Cramer's V is greater than 0, we can assume that there is some degree of association between the variables. Q4. Examine the variable "education" define a new format, name it education_Check and use it to identify invalid values for the variable education. Valid values are 'primary', 'secondary', 'tertiary', 'unknown'. Refer to program 1.8. Chapter 1 - Working with Character Data Cody's Data Cleaning Techniques Using SAS, Third Edition Use the function lowcase on education column. use the same dataset name for output dataset. Show the simple frequency table after the change.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
7/ 11 Q5. Examine the variable "marital". Use PROC print with a where statement to check for data errors in the variable marital. Consider the valid values as "single", "married", "divorced". Refer to program 1.6. Chapter 1 - Working with Character Data Cody's Data Cleaning Techniques Using SAS, Third Edition Use the function lowcase on the variable marital. Show the simple frequency table after the change.
8/ 11
9/ 11 Q6. Examine the variable "Job". Use PROC FREQ to list a simple frequency table. Write a code to combine the categories "admin." and "ADMINISTRATION" for the job variable as "admin". Show the simple frequency table after the change.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
10/ 11
11/ 11 Q7. Checking missing values Adapt the code in program 7.2. of Chapter 1 so it works on customer_all dataset. Refer to program 7.2. Counting Missing Values for Character Variables in Chapter 1 - Working with Character Data Cody's Data Cleaning Techniques Using SAS, Third Edition. Q8. Create a new variable named jobMF to indicate the most frequent job category Reuse the code provided in ch17, section 17.3.2. Check the most frequent job category based on the output of proc freq. Create the new variable jobMF Print the first few observations