attachment_1 (86)

docx

School

Yale University *

*We aren’t endorsed by this school

Course

MISC

Subject

Statistics

Date

Nov 24, 2024

Type

docx

Pages

Uploaded by ogwenogatamu

1 Manchester City Performance Analysis and Models Vanja Vidanovic Keiser University ISM 4117: Data Mining and Warehousing Professor Sase Singh December 9, 2023

2 Manchester City Performance Analysis and Models Manchester City Performance Analysis and Models Introduction In the ever-evolving landscape of professional football, leveraging analytics has become essential to understanding team performance and making informed decisions. This report focuses on a comprehensive analysis of Manchester City's football performance. The study adopts a multifaceted approach, combining exploratory analysis, hypothesis testing, cluster analysis, regression analysis, classification analysis, and comparative analysis. The study aims to extract actionable insights that can guide strategic decisions for the team's success in the next season. The Kaggle Premier League dataset is a collection of data covering the performance of Premier League football teams from 2019/2020 to 2022/2023. Methodology Data Processing The data was provided in one CSV file containing matches’ information, including home team, opponent, season, scores lost and won, and match outcome, among others. The data ranges from 2019 to 2023. The study leverages data from the most recent seasons, 2022/2023, for performance analysis and 2019/2023 data for modelling. Additionally, it was crucial to filter data by selecting only matches where Manchester City played. No missing values were in the columns relevant to this study, and all variables were in the correct format. Feature engineering was a crucial step in data preprocessing as it created new variables from the existing ones, making the data more fit for the study. The first step was to create win, loss, and draw columns from the results column. The three new columns are binary, so if the match result value is a win, the win column is assigned True for that particular match. Otherwise, the win column is assigned false. This condition applies to the loss and draw columns. Further, we converted the new columns from Boolean type to numerical by assigning 1 to True and 0 to False. We aggregated the resulting data by grouping them by teams and computing the sums of

3 Manchester City Performance Analysis and Models wins, losses, draws, goals, and average possession time. Additionally, we calculated the goal differences by subtracting goals lost from goals scored. Data transformation for machine learning models included all seasons from 2019 to 2023. We dropped irrelevant variables from the data frame and scaled the data using Sklearn’s standard scaling method to ensure all variables were measured on the same scale. Further, we conducted feature selection for K-means, regression, and classification models separately. It was important to perform separate feature selection as the models have different target variables and predictors. Visual and correlation analysis provided helpful insights into variable selection. We performed a data split on an 80:20 to provide separate datasets for model training and testing (Mueller & Massaron, 2021). Hypothesis Test The study utilises the Chi-Square Test of independence to check the dependency between categorical variables (Chen et al., 2022). The test checks the following hypothesis: H0: Variables X and Y are dependent. H1: Variables X and Y are independent. The study assumes the conventional significance level of 0.05 and rejects the null hypothesis if the test’s p-value is less than 0.05. A p-value less than the significance level indicates sufficient evidence to reject the null hypothesis, and a p-value greater than the significance level suggests insufficient evidence to reject the null hypothesis. Cluster Analysis The study adopts the K-means algorithm to group the data into distinct groups such that teams with similar characteristics are grouped in the same cluster. We first performed principal component analysis on the scaled data to obtain new components. The new components are

Your preview ends here