IE6400_Day12

html

School

Northeastern University *

*We aren’t endorsed by this school

Course

6400

Subject

Industrial Engineering

Date

Feb 20, 2024

Type

html

Pages

25

Uploaded by ColonelStraw13148

Report
IE6400 Foundations for Data Analytics Engineering Fall 2023 Module 2: Nonparametric Methods Nonparametric methods refer to a broad category of statistical techniques that do not make strong assumptions about the form or parameters of the underlying population distribution from which the sample data is drawn. These methods are often used when the assumptions of parametric methods (like normal distribution) are not met. Here are some key points about nonparametric methods: 1. Distribution-Free : Nonparametric methods do not assume a specific distribution for the data, such as the normal distribution. This makes them more flexible in handling data from unknown or non-normal distributions. 2. Rank-Based : Many nonparametric tests are based on the ranks of the data rather than their actual values. Examples include the Wilcoxon rank-sum test and the Kruskal-Wallis test. 3. Applications : Nonparametric methods are particularly useful for analyzing ordinal or nominal data, as well as interval or ratio data that doesn't meet the assumptions of parametric tests. 4. Advantages : Flexibility: Can be applied to data that doesn't meet the assumptions of parametric tests. Robustness: Less sensitive to outliers and skewed data. Simplicity: Often easier to understand and interpret. 5. Disadvantages : Less Powerful: When the assumptions of parametric tests are met, nonparametric tests are generally less powerful (i.e., they might have a lower chance of detecting a true effect). Limited Parameters: Nonparametric methods do not provide estimates of population parameters like the mean or standard deviation. 6. Common Nonparametric Tests : Mann-Whitney U Test (or Wilcoxon Rank-Sum Test) : Compares the distributions of two independent samples. Wilcoxon Signed-Rank Test : Compares the distributions of two paired samples. Kruskal-Wallis Test : An extension of the Mann-Whitney U test for comparing more than two samples. Spearman's Rank Correlation : Measures the strength and direction of the association between two ranked variables. Chi-Squared Test : Tests the association between two categorical variables. 7. Kernel Density Estimation : A nonparametric way to estimate the probability density function of a continuous random variable. 8. Nonparametric Regression : Techniques like LOESS (locally weighted scatterplot smoothing) that do not assume a specific form for the relationship between predictors and response variable. In summary, nonparametric methods offer a versatile toolkit for statistical analysis when the assumptions of traditional parametric methods are not met. They are especially useful for analyzing data that is skewed, has outliers, or comes from an unknown distribution.
Exercise 1 Ranking data Ranking data refers to data representing the order or position of items relative to one another without necessarily indicating the magnitude of differences between them. In other words, ranking data tells you the order of items but not the actual values or scores that led to that order. In [1]: import pandas as pd In [2]: data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Score': [85, 72, 92, 72]} In [3]: df = pd.DataFrame(data) df Out[3]: Name Score 0 Alice 85 1 Bob 72 2 Charlie 92 3 David 72 In [4]: df['Rank'] = df['Score'].rank(method='average', ascending=False) print(df) Name Score Rank 0 Alice 85 2.0 1 Bob 72 3.5 2 Charlie 92 1.0 3 David 72 3.5 This method assigns ranks to data points based on their values. Ties can be handled in various ways, such as averaging the ranks. Exercise 2 Ranking Using the scipy.stats.rankdata() Function: In [5]: import numpy as np from scipy.stats import rankdata In [6]: scores = np.array([85, 72, 92, 72]) ranks = rankdata(-scores, method='average') In [7]: print('Data:', scores) print('Ranks:', ranks) Data: [85 72 92 72] Ranks: [2. 3.5 1. 3.5] The rankdata() function from SciPy can be used to rank data, and it provides various methods for handling ties Exercise 3 Ranking Using the pandas.Series.rank() Method: In [8]:
import pandas as pd In [9]: data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Score': [85, 72, 92, 72]} In [10]: df = pd.DataFrame(data) df['Rank'] = df['Score'].rank(ascending=False, method='average') print(df) Name Score Rank 0 Alice 85 2.0 1 Bob 72 3.5 2 Charlie 92 1.0 3 David 72 3.5 Exercise 4 Ranking Using the argsort() Function: Perform an indirect sort along the given axis using the algorithm In [11]: import numpy as np In [12]: scores = np.array([85, 72, 92, 72]) ranks = np.argsort(-scores) + 1 print('Data:', scores) print('Ranks:', ranks) Data: [85 72 92 72] Ranks: [3 1 2 4] You can rank data by sorting the indices using the argsort() function, which returns the indices that would sort an array Exercise 5 The SciPy library provides the rankdata() function to rank numerical data, which supports a number of variations on ranking. The example below demonstrates how to rank a numerical dataset. In [13]: from numpy.random import rand from numpy.random import seed from scipy.stats import rankdata In [14]: # seed random number generator seed(1) In [15]: # generate dataset data = rand(1000) In [16]: # review first 10 samples print(data[:10]) [4.17022005e-01 7.20324493e-01 1.14374817e-04 3.02332573e-01 1.46755891e-01 9.23385948e-02 1.86260211e-01 3.45560727e-01 3.96767474e-01 5.38816734e-01] Display the first 10 elements of a sequence data In [17]: # rank data ranked = rankdata(data)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
# review first 10 ranked samples print(ranked[:10]) [408. 721. 1. 300. 151. 93. 186. 342. 385. 535.] Exercise 6 Kendall Tau Coefficient The Kendall Tau Coefficient, often denoted as τ (tau) , is a non-parametric statistic used to measure the strength and direction of the association between two ordinal variables. It's a rank correlation coefficient that assesses the degree of concordance between two sets of rankings. In this exercise, we will compute the Kendall Tau Coefficient using Python to measure the rank correlation between two sets of data. We'll use the scipy.stats library, which provides a function to calculate this coefficient. Step 1: Import Necessary Libraries In [18]: import numpy as np from scipy.stats import kendalltau import matplotlib.pyplot as plt Step 2: Create Sample Data Let's create two sets of ranked data to compute the Kendall Tau Coefficient. In [19]: # Sample data rankings_A = np.array([1, 2, 3, 4, 5]) rankings_B = np.array([3, 2, 4, 1, 5]) Step 3: Compute the Kendall Tau Coefficient In [20]: tau, p_value = kendalltau(rankings_A, rankings_B) print(f'Kendall Tau Coefficient: {tau:.2f}') print(f'p-value: {p_value:.2f}') Kendall Tau Coefficient: 0.20 p-value: 0.82 In [21]: # Interpret the correlation alpha = 0.05 if p_value < alpha: print("There is a statistically significant correlation.") else: print("There is no statistically significant correlation.") There is no statistically significant correlation. The Kendall Tau Coefficient value will lie between -1 and 1. A positive value indicates that the two sets of rankings are similar, while a negative value indicates that the rankings are dissimilar. A value close to 0 suggests little to no association. In the conclusion, you can determine the strength and direction of the association between the two sets of rankings. A significant p-value (typically < 0.05) suggests that the observed association is statistically significant. Exercise 7 Case Study - Analyzing Exam Scores and Study Hours Background: A researcher wants to investigate if there is a significant correlation between the number of hours students spend studying for an exam and their scores on the exam. Data Collection: The researcher collects data from a group of 20 students. For each student, they record the number of hours spent studying and the exam score achieved. Here's the dataset: In [22]:
import pandas as pd In [23]: df = pd.read_csv('example.csv') In [24]: df Out[24]: Hours_Study Exam_Score 0 10 85 1 5 60 2 8 75 3 3 50 4 12 92 5 15 98 6 9 80 7 7 70 8 2 45 9 11 88 10 6 62 11 4 55 12 14 96 13 16 99 14 13 90 15 18 100 16 1 40 17 17 98 18 20 105 19 19 110
Hours_Study Exam_Score In [25]: from scipy import stats In [26]: # Calculate Kendall's Tau and p-value tau, p_value = stats.kendalltau(df['Hours_Study'], df['Exam_Score']) # Set significance level alpha = 0.05 In [27]: # Compare p-value to alpha if p_value < alpha: result = "Reject the null hypothesis. There is a significant correlation." else: result = "Fail to reject the null hypothesis. There is no significant correlation." In [28]: print(f"Kendall's Tau (τ) = {tau:.2f}") print(f"P-value = {p_value:.4f}") print(result) Kendall's Tau (τ) = 0.97 P-value = 0.0000 Reject the null hypothesis. There is a significant correlation. Since the p-value is less than the chosen significance level (α = 0.05), we reject the null hypothesis. This suggests that there is a significant positive correlation between the number of hours spent studying and the exam scores Exercise 8 Wilcoxon test The Wilcoxon test, also known as the Wilcoxon signed-rank test, is a non-parametric statistical test used to compare two paired groups. It is used to test the null hypothesis that two related paired samples come from the same distribution. Specifically, it tests whether the differences between the pairs follow a symmetric distribution around zero. In this exercise, we will perform the Wilcoxon signed-rank test using Python to compare two paired groups. We'll use the scipy.stats library, which provides a function to conduct this test. Step 1: Import Necessary Libraries In [29]: import numpy as np from scipy.stats import wilcoxon import matplotlib.pyplot as plt Step 2: Create Sample Paired Data Let's create two sets of paired data to perform the Wilcoxon test. In [30]: # Sample paired data before_treatment = np.array([20, 22, 24, 19, 18, 21, 25, 23, 22, 20]) after_treatment = np.array([19, 24, 22, 20, 19, 20, 26, 22, 21, 19]) Step 3: Perform the Wilcoxon Test In [31]: w, p_value = wilcoxon(before_treatment, after_treatment) print(f'Wilcoxon Test Statistic: {w}')
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
print(f'p-value: {p_value:.4f}') Wilcoxon Test Statistic: 23.0 p-value: 0.6953 In [32]: # Interpret the results alpha = 0.05 # significance level if p_value < alpha: print("Reject the null hypothesis: There is a significant difference between the two groups.") else: print("Fail to reject the null hypothesis: There is no significant difference between the two groups.") Fail to reject the null hypothesis: There is no significant difference between the two groups. The p-value will help us determine whether the differences between the paired samples are statistically significant. A small p-value (typically < 0.05) suggests that the two sets of paired data are significantly different. Note that you can conclude whether there's a statistically significant difference between the two paired groups. If the p-value is less than 0.05, it suggests that the treatment had a significant effect on the sample. Otherwise, there's no evidence to suggest a significant effect. Exercise 9 Kruskal-Wallis H Test The Kruskal-Wallis H Test, often simply referred to as the Kruskal-Wallis test, is a non- parametric statistical test used to determine if there are statistically significant differences between two or more groups of an independent variable on a continuous or ordinal dependent variable. It is the non-parametric equivalent to the one-way ANOVA. In [33]: import numpy as np from scipy.stats import kruskal Define three or more independent groups of data for the Kruskal-Wallis H Test. In [34]: group1 = [22, 25, 28, 30, 32] group2 = [18, 21, 23, 26, 29] group3 = [15, 16, 19, 20, 24] Perform the Kruskal-Wallis H Test In [35]: statistic, p_value = kruskal(group1, group2, group3) In [36]: print("Kruskal-Wallis H Test:") print(f"Kruskal-Wallis H Statistic: {statistic:.2f}") print(f"P-value: {p_value:.4f}") Kruskal-Wallis H Test: Kruskal-Wallis H Statistic: 6.86 P-value: 0.0324 In [37]: if p_value < 0.05: print("There is a significant difference among the groups.") else: print("There is no significant difference among the groups.") There is a significant difference among the groups. Exercise 10 Friedman Test The Friedman Test is a non-parametric statistical test used to detect differences in
treatments across multiple test attempts or blocks. It is the non-parametric alternative to the repeated measures ANOVA and is used to test for differences between groups when the dependent variable is ordinal or when the assumptions of parametric ANOVA are not met for interval data. In [38]: import numpy as np from scipy.stats import friedmanchisquare Collect related (paired) data for three or more groups for the Friedman Test In [39]: group1 = [10, 12, 15, 8, 11] group2 = [8, 10, 13, 6, 12] group3 = [9, 11, 14, 7, 10] In [40]: statistic, p_value = friedmanchisquare(group1, group2, group3) In [41]: print("Friedman Test:") print(f"Friedman Test Statistic: {statistic:.2f}") print(f"P-value: {p_value:.4f}") Friedman Test: Friedman Test Statistic: 5.20 P-value: 0.0743 We just compare the p value with 0.05. If more than that then there is no significant difference among the groups In [42]: if p_value < 0.05: print("There is a significant difference among the groups.") else: print("There is no significant difference among the groups.") There is no significant difference among the groups. Exercise 11 Mann-Whitney U Test The Mann-Whitney U Test, also known as the Wilcoxon rank-sum test, is a non- parametric statistical test used to determine if there are significant differences between two independent groups on a continuous or ordinal dependent variable. It is the non-parametric alternative to the independent samples t-test. In [43]: import numpy as np from scipy.stats import mannwhitneyu Define two independent groups of data for the Mann-Whitney U Test In [44]: group1 = [20, 24, 22, 26, 21] group2 = [30, 32, 31, 34, 35] In [45]: statistic, p_value = mannwhitneyu(group1, group2) In [46]: print("Mann-Whitney U Test:") print(f"Mann-Whitney U Statistic: {statistic:.2f}") print(f"P-value: {p_value:.4f}") Mann-Whitney U Test: Mann-Whitney U Statistic: 0.00 P-value: 0.0079 In [47]: if p_value < 0.05:
print("There is a significant difference between the two groups.") else: print("There is no significant difference between the two groups.") There is a significant difference between the two groups. Exercise 12 Pearson’s Chi-Squared Test Pearson’s Chi-Squared Test (often simply referred to as the Chi-Squared Test) is a statistical test used to determine if there is a significant association between two categorical variables in a contingency table. It's one of the most commonly used tests for independence in categorical data. In this exercise, we will perform the Pearson’s Chi-Squared Test using Python to determine if there is a significant association between two categorical variables. We'll use the scipy.stats library, which provides a function to conduct this test. Step 1: Import Necessary Libraries In [48]: import numpy as np import pandas as pd from scipy.stats import chi2_contingency import matplotlib.pyplot as plt import seaborn as sns Step 2: Create Sample Data Let's create a sample contingency table representing the relationship between two categorical variables: Gender and Preference. In [49]: # Sample data data = {'Gender': ['Male', 'Male', 'Female', 'Female'], 'Preference': ['Like', 'Dislike', 'Like', 'Dislike'], 'Count': [50, 10, 20, 40]} df = pd.DataFrame(data) df Out[49]: Gender Preference Count 0 Male Like 50 1 Male Dislike 10 2 Female Like 20 3 Female Dislike 40 Step 3: Perform the Chi-Squared Test In [50]: # Create a contingency table contingency_table = df.pivot(index='Gender', columns='Preference', values='Count').fillna(0) # Perform the test chi2, p_value, _, expected = chi2_contingency(contingency_table) print(f'Chi-Squared Value: {chi2:.2f}') print(f'p-value: {p_value:.4f}')
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
print('Expected Frequencies:') print(expected) Chi-Squared Value: 28.83 p-value: 0.0000 Expected Frequencies: [[25. 35.] [25. 35.]] Step 4: Visualize the Data In [51]: # Interpretation of the Chi-Squared Test results alpha = 0.05 # significance level print(f'Chi-Squared Value: {chi2:.2f}') print(f'p-value: {p_value:.4f}') if p_value <= alpha: print("The results are statistically significant. There is an association between the categorical variables.") else: print("The results are not statistically significant. There is no evidence of an association between the categorical variables.") # Visualization of Expected vs. Observed Frequencies fig, ax = plt.subplots(1, 2, figsize=(12, 5)) # Observed Frequencies sns.heatmap(contingency_table, annot=True, cmap='coolwarm', fmt='g', ax=ax[0]) ax[0].set_title('Observed Frequencies') # Expected Frequencies sns.heatmap(expected, annot=True, cmap='coolwarm', fmt='.1f', ax=ax[1]) ax[1].set_title('Expected Frequencies') plt.tight_layout() plt.show() Chi-Squared Value: 28.83 p-value: 0.0000 The results are statistically significant. There is an association between the categorical variables.
Decision: If the p-value is less than or equal to a significance level (often 0.05), you reject the null hypothesis and conclude that there is a significant association between the two categorical variables. If the p-value is greater than 0.05, you fail to reject the null hypothesis and conclude that there is no significant evidence to suggest an association between the two categorical variables. Visualization (Observed vs. Expected Frequencies): By visually comparing the observed frequencies (from your data) with the expected frequencies (calculated by the test), you can get a sense of where the differences lie. Areas (cells) in the heatmap with the most color contrast between observed and expected frequencies are where the most significant differences occur. Exercise 13 Spearman’s Rank Correlation Spearman's rank correlation coefficient is a non-parametric statistical test used to measure the strength and direction of the association between two ranked (ordinal) variables. It's an alternative to Pearson's correlation coefficient when the assumptions of linearity or normality are not met. Step 1: Import Libraries and Seed Generator In [52]: # Import necessary libraries from numpy.random import rand from numpy.random import seed from scipy.stats import spearmanr In [53]: # Seed the random number generator
seed(1) Generate two sets of random data, data1 and data2, each containing 1000 data points. rand(1000) generates 1000 random numbers between 0 and 1. rand(1000) * 20 scales the random numbers in data1 to have values between 0 and 20. data2 is created by adding random values between 0 and 10 to data1. This introduces a degree of correlation between data1 and data2. In [54]: # Prepare data data1 = rand(1000) * 20 data2 = data1 + (rand(1000) * 10) In [55]: # Calculate Spearman's correlation coefficient and p-value coef, p = spearmanr(data1, data2) In [56]: # Print Spearman's correlation coefficient print('Spearmans correlation coefficient: %.3f' % coef) Spearmans correlation coefficient: 0.900 In [57]: # Interpret the significance alpha = 0.05 if p > alpha: print('Samples are uncorrelated (fail to reject H0) p=%.3f' % p) else: print('Samples are correlated (reject H0) p=%.3f' % p) Samples are correlated (reject H0) p=0.000 Running the example calculates the Spearman’s correlation coefficient between the two variables in the test dataset. The statistical test reports a strong positive correlation with a value of 0.9. The p-value is close to zero, which means that the likelihood of observing the data given that the samples are uncorrelated is very unlikely (e.g. 95% confidence) and that we can reject the null hypothesis that the samples are uncorrelated. Exercise 14 Bootstrap Resampling Bootstrap resampling, commonly referred to as "bootstrapping," is a powerful statistical technique used for estimating the distribution of a statistic (like the mean or variance) by repeatedly sampling with replacement from an observed dataset. It's particularly useful when the sample size is small or when the underlying distribution of the data is unknown. In this exercise, we will perform bootstrap resampling to estimate the mean and its 95% confidence interval of a dataset using Python. We'll use the numpy library for data manipulation and matplotlib for visualization. Step 1: Import Necessary Libraries In [58]: import numpy as np import matplotlib.pyplot as plt Step 2: Create Sample Data Let's start with a small dataset of 15 observations. In [59]: data = np.array([23, 45, 56, 78, 89, 12, 67, 49, 55, 77, 88, 90, 34, 56, 71]) Step 3: Bootstrap Resampling We'll draw bootstrap samples from the original dataset and calculate the mean for
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
each sample. In [60]: n_iterations = 10000 bootstrap_means = [] for _ in range(n_iterations): bootstrap_sample = np.random.choice(data, size=len(data), replace=True) bootstrap_means.append(np.mean(bootstrap_sample)) Step 4: Calculate 95% Confidence Interval In [61]: confidence_level = 0.95 lower_percentile = (1 - confidence_level) / 2 * 100 upper_percentile = (1 + confidence_level) / 2 * 100 confidence_interval = (np.percentile(bootstrap_means, lower_percentile), np.percentile(bootstrap_means, upper_percentile)) print(f'95% Confidence Interval for the Mean: {confidence_interval}') 95% Confidence Interval for the Mean: (47.46666666666667, 70.66666666666667) Step 5: Visualize the Bootstrap Distribution In [62]: plt.hist(bootstrap_means, bins=50, color='skyblue', edgecolor='black') plt.axvline(confidence_interval[0], color='red', linestyle='dashed') plt.axvline(confidence_interval[1], color='red', linestyle='dashed') plt.title('Bootstrap Distribution of the Mean') plt.xlabel('Mean') plt.ylabel('Frequency') plt.show() Conclusion The histogram visualizes the distribution of the bootstrap means. The dashed red lines represent the 95% confidence interval. If the confidence interval does not contain the
population parameter (in this case, the mean), then the estimate is considered statistically significant at the 5% level. Interpretation Bootstrapping provides an empirical representation of the sampling distribution of the mean. The 95% confidence interval gives us a range in which we are 95% confident that the true population mean lies. This method is especially useful when the sample size is small or the underlying distribution of the data is unknown. Exercise 15 Normality Assumption The normality assumption refers to the assumption that the residuals (or errors) of a model are normally distributed. This assumption is foundational for many statistical tests and methods, especially in the context of linear regression and other parametric tests. When the normality assumption is met, it allows for the use of certain statistical techniques that are based on the properties of the normal distribution. In this exercise, we will explore the normality assumption by checking whether a given dataset is normally distributed. We'll use Python libraries such as numpy, scipy, and matplotlib for this purpose. Step 1: Import Necessary Libraries In [63]: import numpy as np import matplotlib.pyplot as plt from scipy.stats import shapiro, probplot Step 2: Generate Sample Data Let's create a sample dataset using numpy. In [64]: data = np.random.randn(100) Step 3: Visual Inspection using Histogram A simple way to check for normality is to visualize the data using a histogram. In [65]: plt.hist(data, bins=30, color='skyblue', edgecolor='black') plt.title('Histogram of the Data') plt.xlabel('Value') plt.ylabel('Frequency') plt.show()
Step 4: Quantile-Quantile Plot A Q-Q plot is another graphical method to assess normality. If the data is normally distributed, the points should fall roughly on a straight line. In [66]: probplot(data, plot=plt) plt.title('Q-Q Plot') plt.show() Step 5: Shapiro-Wilk Test The Shapiro-Wilk test is a statistical test for normality. A p-value less than 0.05 typically indicates that the data is not normally distributed.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In [67]: stat, p = shapiro(data) print(f'Statistic: {stat}, p-value: {p}') Statistic: 0.9924927949905396, p-value: 0.8556731343269348 Conclusion Based on the histogram and Q-Q plot, you can visually assess the normality of the data. The Shapiro-Wilk test provides a statistical measure. If the p-value is less than 0.05, it suggests that the data may not be normally distributed. Interpretation If the data appears to be normally distributed based on visualizations and the Shapiro-Wilk test p-value is greater than 0.05, you can proceed with statistical tests that assume normality. If the data does not appear to be normally distributed, consider data transformations or non-parametric statistical methods. Excercise 16 Make Data Gaussian and Gaussian-Like "Making data Gaussian" or "Gaussian-like" refers to the process of transforming a dataset so that its distribution becomes closer to a Gaussian distribution (also known as a normal distribution). In this exercise, we will explore various techniques to transform a dataset so that its distribution becomes closer to a Gaussian distribution. We'll use Python libraries such as numpy, scipy, and matplotlib for this purpose. Step 1: Import Necessary Libraries In [68]: import numpy as np import matplotlib.pyplot as plt from scipy.stats import boxcox, yeojohnson, shapiro, probplot from sklearn.preprocessing import QuantileTransformer Step 2: Generate Sample Data Let's create a skewed dataset using numpy. In [69]: data = np.random.exponential(scale=2, size=1000) plt.hist(data, bins=30, color='skyblue', edgecolor='black') plt.title('Original Skewed Data') plt.xlabel('Value') plt.ylabel('Frequency') plt.show()
Step 3: Log Transformation In [70]: data_log = np.log(data) plt.hist(data_log, bins=30, color='lightgreen', edgecolor='black') plt.title('Log Transformed Data') plt.xlabel('Value') plt.ylabel('Frequency') plt.show() Step 4: Box-Cox Transformation In [71]:
data_boxcox, _ = boxcox(data) plt.hist(data_boxcox, bins=30, color='lightcoral', edgecolor='black') plt.title('Box-Cox Transformed Data') plt.xlabel('Value') plt.ylabel('Frequency') plt.show() Step 5: Yeo-Johnson Transformation In [72]: data_yj, _ = yeojohnson(data) plt.hist(data_yj, bins=30, color='lightpink', edgecolor='black') plt.title('Yeo-Johnson Transformed Data') plt.xlabel('Value') plt.ylabel('Frequency') plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Step 6: Quantile Transformation In [73]: transformer = QuantileTransformer(output_distribution='normal') data_quantile = transformer.fit_transform(data.reshape(-1, 1)).flatten() plt.hist(data_quantile, bins=30, color='lightsalmon', edgecolor='black') plt.title('Quantile Transformed Data') plt.xlabel('Value') plt.ylabel('Frequency') plt.show()
Conclusion Each transformation method has its own characteristics and may be suitable for different types of skewed data. It's essential to visualize the transformed data and possibly use statistical tests like the Shapiro-Wilk test to check the effectiveness of the transformation. Interpretation The goal of these transformations is to make the data more Gaussian-like, which can be beneficial for certain statistical methods and algorithms. However, the transformed data might require a different interpretation, especially in the context of the problem domain. Excercise 17 Normality Assumption Exercise using the Iris Dataset In this exercise, we'll explore the normality assumption by checking whether the features in the Iris dataset are normally distributed. We'll use Python libraries such as pandas, seaborn, scipy, and matplotlib. Step 1: Import Necessary Libraries In [74]: import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from scipy.stats import shapiro Step 2: Load the Iris Dataset The Iris dataset is available in the seaborn library. In [75]: iris = sns.load_dataset('iris') iris.head() Out[75]: sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 setosa 1 4.9 3.0 1.4 0.2 setosa 2 4.7 3.2 1.3 0.2 setosa 3 4.6 3.1 1.5 0.2 setosa 4 5.0 3.6 1.4 0.2 setosa Step 3: Visual Inspection using Histograms A simple way to check for normality is to visualize the distribution of each feature using histograms. In [76]: iris.hist(figsize=(12, 10), bins=30, color='skyblue', edgecolor='black') plt.suptitle('Histograms of Iris Dataset Features') plt.show()
Step 4: Shapiro-Wilk Test for Normality The Shapiro-Wilk test is a statistical test for normality. A p-value less than 0.05 typically indicates that the data is not normally distributed. In [77]: features = iris.columns[:-1] # Exclude the 'species' column
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
for feature in features: stat, p = shapiro(iris[feature]) print(f'{feature} - Statistic: {stat:.4f}, p-value: {p:.4f}') sepal_length - Statistic: 0.9761, p-value: 0.0102 sepal_width - Statistic: 0.9849, p-value: 0.1011 petal_length - Statistic: 0.8763, p-value: 0.0000 petal_width - Statistic: 0.9018, p-value: 0.0000 Step 5: Q-Q Plots A Q-Q plot is another graphical method to assess normality. If the data is normally distributed, the points should fall roughly on a straight line. In [78]: from scipy.stats import probplot for feature in features: plt.figure(figsize=(8, 6)) probplot(iris[feature], plot=plt) plt.title(f'Q-Q Plot for {feature}') plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Conclusion Based on the histograms, Q-Q plots, and the Shapiro-Wilk test results, you can assess the normality of each feature in the Iris dataset. Interpretation If a feature appears to be normally distributed based on visualizations and the Shapiro-Wilk test p-value is greater than 0.05, it suggests that the feature is approximately normally distributed. If the p-value is less than 0.05, it suggests that the feature may not be normally distributed. In such cases, consider data transformations or non-parametric statistical methods. Summary: sepal_length and petal_length as well as petal_width are not normally distributed. sepal_width is approximately normally distributed. Revised Date: October 7, 2023 In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help