most missing values

docx

School

University of the Fraser Valley *

*We aren’t endorsed by this school

Course

7034

Subject

Economics

Date

Nov 24, 2024

Type

docx

Pages

4

Uploaded by ChiefBaboonMaster733

Report
most missing values =Income(19) least missing values=Number of orders(0) Explanation: Primary Analysis [50% marks] a) Find the % missing values for each and every feature? (variable / columns) variables missing values formula sex 2 =COUNTBLANK(B2:B20 race 8 =COUNTBLANK(C2:C20 BirthDate 6 =COUNTBLANK(D2:D20 College 1 =COUNTBLANK(E2:E20 Household 5 =COUNTBLANK(F2:F201 Zipcode 11 =COUNTBLANK(G2:G20 Income 19 =COUNTBLANK(H2:H20 Spending 2017 4 =COUNTBLANK(I2:I201 spending 2018 3 =COUNTBLANK(J2:J201 Numoforders 0 =COUNTBLANK(K2:K20 days 4 =COUNTBLANK(L2:L20 satisfaction 5 =COUNTBLANK(M2:M2 channel 4 =COUNTBLANK(N2:N20 b) Based on your answer to a), which feature has the most and least amount of missing values (ignore the first column)? most missing values =Income(19) least missing values=Number of orders(0) c) What portion of the data (200 subjects in total) is complete and can be used for analysis immediately without further processing? (We want the actual number not an estimate, use a method that could have been applied even if there were 200,00 subjects in this dataset) number of complete data set 1. On the Home tab, in the Editing group, click Find & Select. 2. Click Go To Special. 3. Select Blanks and click OK. Excel selects the blank cells. 4. On the Home tab, in the Cells group, click Delete.
5. Click Delete Sheet Rows. Result: 6. we get in this set 138 rows are perfect and does not need further processing . d) How many customers have an income greater than 53K but made a total orders of less than 9? we select Income column, then click on Home tab > sort>sort smallest to larger > expand to all tabs > ok we get sorted values now we can use countif formula or simply calculate values . answer = 64 values Method to predict Zip code Find the highest correlation of available zip codes with other variables. This allow us to find the vague relation between zip code and other variables or in an easier definition, the unsaid criterion in which people find their home. Correlation Zip code Race -0.09440716085 Income 0.01379309587 CustID -0.05379061979 Spending 2017 -0.03735486136 Spending 2018 -0.007111959855 Num of orders 0.02732980938 Days since last order -0.04115799099 Channel 0.0558540738 The highest correlation is with Race. Even though the correlation is just near to 10%, this is the best fit we can find. Now find the mode of zip codes with respect to each race Mode Zip code
White Hispanic Pacific Islander Black Asian Not specified In case of no mode found, use the universal mode for the Zip codes =90010 Thus the missing zip codes for the customers are CustId Zip code Predicted 1546138 90030 1557386 90030 1562867 90030 1549186 90010 1554342 90010 1536475 90010 1542818 90032 1552447 90032 1566529 90032 1579979 90032 1533697 90010 Explanation: Income too can be found using the similar method The correlation is Correlation Income Race 0.06602854854 Zip Code 0.01379309587
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
CustID -0.0206969211 Spending 2017 -0.01145710477 Spending 2018 -0.002618788797 Num of orders -0.01765234055 Days since last order -0.05088941745 Channel 0.05124686426 Average Spending -0.001309293822 Here average spending is the average of spending of each customer in the years 2017 and 2018. (Please note that usually income and spending is supposed to have a high correlation value according to the theory of economics. This pattern is not observed here) No variable have even a 10% correlation with income. thus the only possible method is to find the average income of the whole customer base and fill this value in the missing incomes. The average income is 73513.81215 mean income before adding the predicated value is 73513.81215 Mean income after adding the predicted value is 73513.81215 therefore no change has occurred. PS: The method used here is the basic method followed by corporates for creating golden data. Any better methods need specialized software suited for master data management.