Nonparametric Methods for Data Analytics: Key Insights

IE6400 Foundations for Data Analytics Engineering ¶ Fall 2023 ¶ Module 2: Nonparametric Methods ¶ Nonparametric methods refer to a broad category of statistical techniques that do not make strong assumptions about the form or parameters of the underlying population distribution from which the sample data is drawn. These methods are often used when the assumptions of parametric methods (like normal distribution) are not met. Here are some key points about nonparametric methods: 1. Distribution-Free : Nonparametric methods do not assume a specific distribution for the data, such as the normal distribution. This makes them more flexible in handling data from unknown or non-normal distributions. 2. Rank-Based : Many nonparametric tests are based on the ranks of the data rather than their actual values. Examples include the Wilcoxon rank-sum test and the Kruskal-Wallis test. 3. Applications : Nonparametric methods are particularly useful for analyzing ordinal or nominal data, as well as interval or ratio data that doesn't meet the assumptions of parametric tests. 4. Advantages : • Flexibility: Can be applied to data that doesn't meet the assumptions of parametric tests. • Robustness: Less sensitive to outliers and skewed data. • Simplicity: Often easier to understand and interpret. 5. Disadvantages : • Less Powerful: When the assumptions of parametric tests are met, nonparametric tests are generally less powerful (i.e., they might have a lower chance of detecting a true effect). • Limited Parameters: Nonparametric methods do not provide estimates of population parameters like the mean or standard deviation. 6. Common Nonparametric Tests : • Mann-Whitney U Test (or Wilcoxon Rank-Sum Test) : Compares the distributions of two independent samples. • Wilcoxon Signed-Rank Test : Compares the distributions of two paired samples. • Kruskal-Wallis Test : An extension of the Mann-Whitney U test for comparing more than two samples. • Spearman's Rank Correlation : Measures the strength and direction of the association between two ranked variables. • Chi-Squared Test : Tests the association between two categorical variables. 7. Kernel Density Estimation : A nonparametric way to estimate the probability density function of a continuous random variable. 8. Nonparametric Regression : Techniques like LOESS (locally weighted scatterplot smoothing) that do not assume a specific form for the relationship between predictors and response variable. In summary, nonparametric methods offer a versatile toolkit for statistical analysis when the assumptions of traditional parametric methods are not met. They are especially useful for analyzing data that is skewed, has outliers, or comes from an unknown distribution.

Exercise 1 Ranking data ¶ Ranking data refers to data representing the order or position of items relative to one another without necessarily indicating the magnitude of differences between them. In other words, ranking data tells you the order of items but not the actual values or scores that led to that order. In [1]: import pandas as pd In [2]: data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Score': [85, 72, 92, 72]} In [3]: df = pd.DataFrame(data) df Out[3]: Name Score 0 Alice 85 1 Bob 72 2 Charlie 92 3 David 72 In [4]: df['Rank'] = df['Score'].rank(method='average', ascending=False) print(df) Name Score Rank 0 Alice 85 2.0 1 Bob 72 3.5 2 Charlie 92 1.0 3 David 72 3.5 This method assigns ranks to data points based on their values. Ties can be handled in various ways, such as averaging the ranks. Exercise 2 Ranking Using the scipy.stats.rankdata() Function: ¶ In [5]: import numpy as np from scipy.stats import rankdata In [6]: scores = np.array([85, 72, 92, 72]) ranks = rankdata(-scores, method='average') In [7]: print('Data:', scores) print('Ranks:', ranks) Data: [85 72 92 72] Ranks: [2. 3.5 1. 3.5] The rankdata() function from SciPy can be used to rank data, and it provides various methods for handling ties Exercise 3 Ranking Using the pandas.Series.rank() Method: ¶ In [8]:

import pandas as pd In [9]: data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Score': [85, 72, 92, 72]} In [10]: df = pd.DataFrame(data) df['Rank'] = df['Score'].rank(ascending=False, method='average') print(df) Name Score Rank 0 Alice 85 2.0 1 Bob 72 3.5 2 Charlie 92 1.0 3 David 72 3.5 Exercise 4 Ranking Using the argsort() Function: ¶ Perform an indirect sort along the given axis using the algorithm In [11]: import numpy as np In [12]: scores = np.array([85, 72, 92, 72]) ranks = np.argsort(-scores) + 1 print('Data:', scores) print('Ranks:', ranks) Data: [85 72 92 72] Ranks: [3 1 2 4] You can rank data by sorting the indices using the argsort() function, which returns the indices that would sort an array Exercise 5 ¶ The SciPy library provides the rankdata() function to rank numerical data, which supports a number of variations on ranking. The example below demonstrates how to rank a numerical dataset. In [13]: from numpy.random import rand from numpy.random import seed from scipy.stats import rankdata In [14]: # seed random number generator seed(1) In [15]: # generate dataset data = rand(1000) In [16]: # review first 10 samples print(data[:10]) [4.17022005e-01 7.20324493e-01 1.14374817e-04 3.02332573e-01 1.46755891e-01 9.23385948e-02 1.86260211e-01 3.45560727e-01 3.96767474e-01 5.38816734e-01] Display the first 10 elements of a sequence data In [17]: # rank data ranked = rankdata(data)

Your preview ends here