attachment_1 (86)

.docx

School

Yale University *

*We aren’t endorsed by this school

Course

MISC

Subject

Statistics

Date

Nov 24, 2024

Type

docx

Pages

17

Uploaded by ogwenogatamu

1 Manchester City Performance Analysis and Models Vanja Vidanovic Keiser University ISM 4117: Data Mining and Warehousing Professor Sase Singh December 9, 2023
2 Manchester City Performance Analysis and Models Manchester City Performance Analysis and Models Introduction In the ever-evolving landscape of professional football, leveraging analytics has become essential to understanding team performance and making informed decisions. This report focuses on a comprehensive analysis of Manchester City's football performance. The study adopts a multifaceted approach, combining exploratory analysis, hypothesis testing, cluster analysis, regression analysis, classification analysis, and comparative analysis. The study aims to extract actionable insights that can guide strategic decisions for the team's success in the next season. The Kaggle Premier League dataset is a collection of data covering the performance of Premier League football teams from 2019/2020 to 2022/2023. Methodology Data Processing The data was provided in one CSV file containing matches’ information, including home team, opponent, season, scores lost and won, and match outcome, among others. The data ranges from 2019 to 2023. The study leverages data from the most recent seasons, 2022/2023, for performance analysis and 2019/2023 data for modelling. Additionally, it was crucial to filter data by selecting only matches where Manchester City played. No missing values were in the columns relevant to this study, and all variables were in the correct format. Feature engineering was a crucial step in data preprocessing as it created new variables from the existing ones, making the data more fit for the study. The first step was to create win, loss, and draw columns from the results column. The three new columns are binary, so if the match result value is a win, the win column is assigned True for that particular match. Otherwise, the win column is assigned false. This condition applies to the loss and draw columns. Further, we converted the new columns from Boolean type to numerical by assigning 1 to True and 0 to False. We aggregated the resulting data by grouping them by teams and computing the sums of
3 Manchester City Performance Analysis and Models wins, losses, draws, goals, and average possession time. Additionally, we calculated the goal differences by subtracting goals lost from goals scored. Data transformation for machine learning models included all seasons from 2019 to 2023. We dropped irrelevant variables from the data frame and scaled the data using Sklearn’s standard scaling method to ensure all variables were measured on the same scale. Further, we conducted feature selection for K-means, regression, and classification models separately. It was important to perform separate feature selection as the models have different target variables and predictors. Visual and correlation analysis provided helpful insights into variable selection. We performed a data split on an 80:20 to provide separate datasets for model training and testing (Mueller & Massaron, 2021). Hypothesis Test The study utilises the Chi-Square Test of independence to check the dependency between categorical variables (Chen et al., 2022). The test checks the following hypothesis: H0: Variables X and Y are dependent. H1: Variables X and Y are independent. The study assumes the conventional significance level of 0.05 and rejects the null hypothesis if the test’s p-value is less than 0.05. A p-value less than the significance level indicates sufficient evidence to reject the null hypothesis, and a p-value greater than the significance level suggests insufficient evidence to reject the null hypothesis. Cluster Analysis The study adopts the K-means algorithm to group the data into distinct groups such that teams with similar characteristics are grouped in the same cluster. We first performed principal component analysis on the scaled data to obtain new components. The new components are
4 Manchester City Performance Analysis and Models formed by combining original variables based on their contribution to the data's variance (Gray, 2017). The optimal number of components is determined by plotting the cumulative variance proportion against the number of variables (Gray, 2017). The optimal number is the least number of factors that explain at least 80 per cent of the total variance in the data (Gray, 2017). Consequently, the new components are used for the cluster analysis model. Determining the number of clusters optimal for the K-means model is crucial. Initially, the data is fitted in the model with an arbitrary number of clusters, say ten clusters, and the sum of squares of errors is calculated for each number of clusters in an iterative manner. The study adopts the elbow method to determine the optimal number of clusters for the data by plotting the sum of squares of errors against the number of clusters. The appropriate number of clusters is the x-axis value, where the graph forms a sharp elbow shape (Mueller & Massaron, 2021). Regression Analysis Given some independent variables, the study leverages a linear regression algorithm from sklearn and a decision tree regressor to predict the number of goals a team scored. It is important to ensure the predictor variables are not correlated to satisfy the multi-collinearity condition of regression, which states that the independent variables should not be correlated (Arkes, 2023). The study evaluates the regression models by calculating the mean squared error. The lower the mean squared error, the better the model, depending on the data scale (Mueller & Massaron, 2021). Classification Analysis The study applies classification algorithms from the decision trees family to predict the outcome of a match. For simplicity, the study only focuses on two outcomes (win or otherwise). The otherwise category in the target column is denoted by 0 and indicates a loss or draw. This section's model selection and evaluation leverages two evaluation metrics, including accuracy score and confusion matrix. An accuracy score is used to identify the best model, and the final model is evaluated using the confusion matrix. The accuracy score compares the predicted labels to the actual labels and calculates the percentage of correct predictions (Demidenko, 2020). On
5 Manchester City Performance Analysis and Models the other hand, the confusion matrix shows the number of true negatives, false negatives, true positives, and false positives (Demidenko, 2020). Results Figure 1 Goals Scored Grouped by Captain İlkay Gündoğan is the clear leader, the club having scored significantly more goals under his leadership than any other captain. This suggests that the team is performing well when he is the captain. The club is also doing well under Kevin De Bruyne, but there is a noticeable gap between him and Gündoğan. This could be due to several factors, such as Gündoğan being the captain in more matches than Kevin. Kyle Walker is the ranked third-highest among captains, but the gap between him and De Bruyne is even larger. This suggests that Walker may not be as prolific as the other two captains. It is evident that Gündoğan is a key player and that the team’s performance is better under him.
6 Manchester City Performance Analysis and Models Figure 2 Goals Conceded Grouped by Captain The figure shows that the club lost most matches under Kyle. Comparing the goals scored and lost under Kyle, he clearly lost most of the matches.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help