attachment_1 (86)

docx

School

Yale University *

*We aren’t endorsed by this school

Course

MISC

Subject

Statistics

Date

Nov 24, 2024

Type

docx

Pages

17

Uploaded by ogwenogatamu

Report
1 Manchester City Performance Analysis and Models Vanja Vidanovic Keiser University ISM 4117: Data Mining and Warehousing Professor Sase Singh December 9, 2023
2 Manchester City Performance Analysis and Models Manchester City Performance Analysis and Models Introduction In the ever-evolving landscape of professional football, leveraging analytics has become essential to understanding team performance and making informed decisions. This report focuses on a comprehensive analysis of Manchester City's football performance. The study adopts a multifaceted approach, combining exploratory analysis, hypothesis testing, cluster analysis, regression analysis, classification analysis, and comparative analysis. The study aims to extract actionable insights that can guide strategic decisions for the team's success in the next season. The Kaggle Premier League dataset is a collection of data covering the performance of Premier League football teams from 2019/2020 to 2022/2023. Methodology Data Processing The data was provided in one CSV file containing matches’ information, including home team, opponent, season, scores lost and won, and match outcome, among others. The data ranges from 2019 to 2023. The study leverages data from the most recent seasons, 2022/2023, for performance analysis and 2019/2023 data for modelling. Additionally, it was crucial to filter data by selecting only matches where Manchester City played. No missing values were in the columns relevant to this study, and all variables were in the correct format. Feature engineering was a crucial step in data preprocessing as it created new variables from the existing ones, making the data more fit for the study. The first step was to create win, loss, and draw columns from the results column. The three new columns are binary, so if the match result value is a win, the win column is assigned True for that particular match. Otherwise, the win column is assigned false. This condition applies to the loss and draw columns. Further, we converted the new columns from Boolean type to numerical by assigning 1 to True and 0 to False. We aggregated the resulting data by grouping them by teams and computing the sums of
3 Manchester City Performance Analysis and Models wins, losses, draws, goals, and average possession time. Additionally, we calculated the goal differences by subtracting goals lost from goals scored. Data transformation for machine learning models included all seasons from 2019 to 2023. We dropped irrelevant variables from the data frame and scaled the data using Sklearn’s standard scaling method to ensure all variables were measured on the same scale. Further, we conducted feature selection for K-means, regression, and classification models separately. It was important to perform separate feature selection as the models have different target variables and predictors. Visual and correlation analysis provided helpful insights into variable selection. We performed a data split on an 80:20 to provide separate datasets for model training and testing (Mueller & Massaron, 2021). Hypothesis Test The study utilises the Chi-Square Test of independence to check the dependency between categorical variables (Chen et al., 2022). The test checks the following hypothesis: H0: Variables X and Y are dependent. H1: Variables X and Y are independent. The study assumes the conventional significance level of 0.05 and rejects the null hypothesis if the test’s p-value is less than 0.05. A p-value less than the significance level indicates sufficient evidence to reject the null hypothesis, and a p-value greater than the significance level suggests insufficient evidence to reject the null hypothesis. Cluster Analysis The study adopts the K-means algorithm to group the data into distinct groups such that teams with similar characteristics are grouped in the same cluster. We first performed principal component analysis on the scaled data to obtain new components. The new components are
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4 Manchester City Performance Analysis and Models formed by combining original variables based on their contribution to the data's variance (Gray, 2017). The optimal number of components is determined by plotting the cumulative variance proportion against the number of variables (Gray, 2017). The optimal number is the least number of factors that explain at least 80 per cent of the total variance in the data (Gray, 2017). Consequently, the new components are used for the cluster analysis model. Determining the number of clusters optimal for the K-means model is crucial. Initially, the data is fitted in the model with an arbitrary number of clusters, say ten clusters, and the sum of squares of errors is calculated for each number of clusters in an iterative manner. The study adopts the elbow method to determine the optimal number of clusters for the data by plotting the sum of squares of errors against the number of clusters. The appropriate number of clusters is the x-axis value, where the graph forms a sharp elbow shape (Mueller & Massaron, 2021). Regression Analysis Given some independent variables, the study leverages a linear regression algorithm from sklearn and a decision tree regressor to predict the number of goals a team scored. It is important to ensure the predictor variables are not correlated to satisfy the multi-collinearity condition of regression, which states that the independent variables should not be correlated (Arkes, 2023). The study evaluates the regression models by calculating the mean squared error. The lower the mean squared error, the better the model, depending on the data scale (Mueller & Massaron, 2021). Classification Analysis The study applies classification algorithms from the decision trees family to predict the outcome of a match. For simplicity, the study only focuses on two outcomes (win or otherwise). The otherwise category in the target column is denoted by 0 and indicates a loss or draw. This section's model selection and evaluation leverages two evaluation metrics, including accuracy score and confusion matrix. An accuracy score is used to identify the best model, and the final model is evaluated using the confusion matrix. The accuracy score compares the predicted labels to the actual labels and calculates the percentage of correct predictions (Demidenko, 2020). On
5 Manchester City Performance Analysis and Models the other hand, the confusion matrix shows the number of true negatives, false negatives, true positives, and false positives (Demidenko, 2020). Results Figure 1 Goals Scored Grouped by Captain İlkay Gündoğan is the clear leader, the club having scored significantly more goals under his leadership than any other captain. This suggests that the team is performing well when he is the captain. The club is also doing well under Kevin De Bruyne, but there is a noticeable gap between him and Gündoğan. This could be due to several factors, such as Gündoğan being the captain in more matches than Kevin. Kyle Walker is the ranked third-highest among captains, but the gap between him and De Bruyne is even larger. This suggests that Walker may not be as prolific as the other two captains. It is evident that Gündoğan is a key player and that the team’s performance is better under him.
6 Manchester City Performance Analysis and Models Figure 2 Goals Conceded Grouped by Captain The figure shows that the club lost most matches under Kyle. Comparing the goals scored and lost under Kyle, he clearly lost most of the matches.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
7 Manchester City Performance Analysis and Models Figure 3 Match Proportions by Venue A large proportion of the matches by Manchester City were played at home.
8 Manchester City Performance Analysis and Models Figure 4 Match Outcome by Venue The team performs better at home than away. They have a significantly higher win rate at home compared to away. This suggests that the home crowd may significantly influence their success. The team rarely draws. There are very few draws compared to wins and losses. This suggests that the team is usually dominant in the matches, winning or losing convincingly. There is a possibility of home-field advantage. While inconclusive, the graph shows a higher win rate at home, possibly due to factors like crowd support, familiarity with the field, or psychological advantage. The graph suggests that the team is strong at home and generally plays to win rather than draw.
9 Manchester City Performance Analysis and Models Figure 5 Possession time by Venue The team has a higher possession percentage when playing at home. This could suggest that the home crowd or other home-field factors may give them an edge. On the contrary, the team's possession is slightly lower when playing away than at home. This could be due to several factors, such as facing stronger opposition away from home, struggling to adapt to different playing surfaces, or lacking the same level of support from the crowd.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
10 Manchester City Performance Analysis and Models Figure 6 Match Outcome by Captain İlkay is the clear leader among the captains. He has the most wins and the fewest losses. This suggests he is a successful captain, and the team performs well under his leadership. Kevin De Bruyne is also a strong captain. He has the second-most wins and a decent number of draws. Kyle Walker has the fewest wins and the most losses among the three captains.
11 Manchester City Performance Analysis and Models Figure 7 Goals Scored by Captain Over Time The number of goals scored under İlkay Gündoğan tends to increase slightly over time. This suggests that he may improve as a captain and lead the team to more goals. However, it is important to note that the line has plateaus and dips recently. This could indicate that his improvement is not always consistent or that external factors influence his performance. The goals scored under Kevin De Bruyne fluctuate more than Gündoğan's, with highs and lows throughout the period shown. This suggests that his performance as captain can be more unpredictable. He can have periods of high goal involvement but also go through stretches with fewer goals. Hypothesis Tests Results
12 Manchester City Performance Analysis and Models The chi-square test of independence on game outcome and captain returns a p-value of 0.276, greater than the conventional significance level of 0.05, suggesting insufficient evidence to reject the null hypothesis and conclude that match outcome depends on the team captain. Therefore, the team captain is a good predictor of game outcome and goals scored or conceded. Moreover, the chi-square test on the venue and the game outcome shows that the match outcome depends on the match venue. After data transformation, we conducted Pearson’s correlation test on various pairs of variables to determine whether there is a correlation within potential predictor variables. The results show a 0.71 correlation between a team’s total wins and the percentage of time the team touches the ball. Additionally, there is a 0.74 correlation between the team’s total goals scored and the proportion of time the team touches the ball. On the contrary, there is a moderate negative correlation of 0.6 between a team’s possession time and goals conceded. Also, a 0.58 negative correlation exists between a team’s total losses and the amount of time the team touches the ball. Comparative Analysis Table 1 Top Five Clubs by Wins Team Win Loss Draw Possession Goal Lost Goals Scored Goal Difference Arsenal 16 2 3 58 18 46 28 Manchester City 16 4 3 64 23 59 36 Manchester United 14 5 4 53.13 28 38 10 Tottenham Hotspur 12 8 3 50.26 35 42 7 Brighton and Hove Albion 10 6 5 58.52 28 39 11
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
13 Manchester City Performance Analysis and Models Manchester City performed well with 16 wins, but there is room for improvement, especially in minimizing losses and draws. It is important to evaluate the games where losses and draws occurred to identify patterns or areas of weakness. Manchester City has scored the highest number of goals (59) among the listed teams. However, the team has also conceded 36 goals, indicating a need for improvement in the defensive aspect. The team should consider strengthening the defense through training, tactical adjustments, or potential transfers. Manchester City has the highest possession percentage (64%) among the top five teams, indicating good game dominance. There is a need to analyze successful possession patterns and replicate them consistently across different formations and opponents. Further, the players should convert possession into dangerous situations quickly. Moreover, the players should understand how Arsenal (58%) uses wide areas to stretch defenses and create space for crosses or cutbacks. Figure 8 Cluster Visualization by Components 1 & 2 The data can be categorized into four clusters, as shown in the figure above. Manchester City falls under the second cluster.
14 Manchester City Performance Analysis and Models Table 2 Regression Models Comparison Model Mean Squared Error Linear Regression 0.94 DecisionTreeRegressor 2.01 The two models have several predictors, including shots home, shots on target, free kicks, penalty kicks, penalty kicks attempted, and possession. The linear regression model performs better than the tree-based regression model. The linear regression model has a smaller mean squared error than the decision tree-based regression model. Therefore, the linear regression model is the best model for predicting goals scored by a team. Table 2 Classification Models Comparison Model Accuracy CART Classifier 60% CR.45 Classifier 62.01% The second model performs better in predicting match outcomes. We applied accuracy to select the best model. Further, we used a confusion matrix to evaluate the second model (best model). Figure 9 Confusion Matrix Plot
15 Manchester City Performance Analysis and Models The model seems to be good at predicting losses. This means that the model usually correctly predicts loss and rarely misses actual loss. Also, the model sometimes incorrectly predicts wins when the team loses. The number of times the model predicts a win when the team wins is higher than the number of times the model predicts a win when the team loses. Conclusion This comprehensive analysis equips the coaching staff and management with valuable information to make data-driven decisions for improving Manchester City's performance. Whether it be strategic adjustments, player-specific considerations, or focusing on defensive capabilities, these insights provide a solid foundation for ongoing performance enhancement and success in future seasons. Recommendations i. Continue to prioritize İlkay Gündoğan as the team captain due to the clear positive impact on the team's performance under his leadership.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
16 Manchester City Performance Analysis and Models ii. Evaluate and address factors contributing to the performance gap between captains, especially focusing on enhancing Kyle Walker's captaincy effectiveness. iii. Analyze matches where losses and draws occurred to identify patterns or weaknesses, adjusting game strategies accordingly. iv. Acknowledge the significant home-field advantage and consider strategies to replicate or mitigate this advantage when playing away. v. Emphasize the importance of maintaining a strong home performance and consider psychological factors that may contribute to the higher win rate at home. vi. Strengthen the defense through targeted training, tactical adjustments, or potential transfers to minimize the number of goals conceded. vii. Acquire data and conduct a detailed analysis of matches where goals were conceded to identify defensive weaknesses and address them proactively. References Arkes, J. (2023) Regression analysis a practical introduction . London: Routledge, Taylor & Francis Group.
17 Manchester City Performance Analysis and Models Chen, D.-G., M., M.S.O. and Chirwa, T.F. (2022) Modern biostatistical methods for evidence- based Global Health Research . Cham, Switzerland: Springer. Demidenko, E. (2020) Advanced statistics with applications in R . Hoboken, NJ: John Wiley & Sons, Inc. Gray, V. (2017) Principal component analysis: Methods, applications, and Technology . Hauppauge, NY: Nova Science Publishers, Incorporated. Mueller, J. and Massaron, L. (2021) Machine learning . Hoboken, NJ: John Wiley & Sons. Premier League Match Data (2019-2023). (n.d.).