Analysis1

pdf

School

University of Waterloo *

*We aren’t endorsed by this school

Course

847

Subject

Statistics

Date

Apr 3, 2024

Type

pdf

Pages

17

Uploaded by BailiffJellyfishMaster1195

Report
STAT 847: Analysis Assignment 1 DUE: Thursday, February 2 2023 by 11:59pm EST NOTES Your assignment must be submitted by the due date listed at the top of this document, and it must be submitted electronically in .pdf format via Crowdmark. Organization and comprehensibility is part of a full solution. Consequently, points will be deducted for solutions that are not organized and incomprehensible. Furthermore, if you submit your assignment to Crowdmark, but you do so incorrectly in any way (e.g., you upload your Question 2 solution in the Question 1 box), you will receive a 5% deduction (i.e., 5% of the assignment’s point total will be deducted from your point total). This assignment is all about cleaning (wrangling) text data of a custom structure, please show all your code and comment it accordingly. There are a total of 50 points possible. Look at the end of the Week 02 notes for a start on this assignment. 1
1. [7 points] Create a data frame or a tibble of tags/metadata portion of the games in chess_classic_games.pgn. One row should represent one game, and one column should represent one tag. Tags that appear in some games, but not a given game, should be left as NA in any game that doesn’t have them. For example, the first five rows should look like classic_first_five.csv Show the next five lines and the skim() of the dataset. 2. [3 points] Add two columns to the left end of the data set (hint: rbind() can do this, and so can select() ). These added first two columns should have the first line in the file chess_classic_games.pgn which includes a tag and the moves for the given game, respectively. For example, the first five values of tag_line should be 1, 21, 41, 61, 81, and the first five values of moves_line should be 19, 39, 59, 79, 99. library(plyr) library(tidyverse) ## -- Attaching packages --------------------------------------- tidyverse 1.3.2 -- ## v ggplot2 3.3.6 v purrr 0.3.4 ## v tibble 3.1.8 v dplyr 1.0.10 ## v tidyr 1.2.1 v stringr 1.4.1 ## v readr 2.1.3 v forcats 0.5.2 ## -- Conflicts ------------------------------------------ tidyverse_conflicts() -- ## x dplyr::arrange() masks plyr::arrange() ## x purrr::compact() masks plyr::compact() ## x dplyr::count() masks plyr::count() ## x dplyr::failwith() masks plyr::failwith() ## x dplyr::filter() masks stats::filter() ## x dplyr::id() masks plyr::id() ## x dplyr::lag() masks stats::lag() ## x dplyr::mutate() masks plyr::mutate() ## x dplyr::rename() masks plyr::rename() ## x dplyr::summarise() masks plyr::summarise() ## x dplyr::summarize() masks plyr::summarize() library(stringr) library(skimr) pgn_chess_classic = readLines( "chess_classic_games.pgn" ) is_metadata = str_detect(pgn_chess_classic, \\ [.* \\ ]$" ) pgn_classic_meta = pgn_chess_classic pgn_classic_meta[!is_metadata] = "" pgn_classic_meta = str_split_fixed(pgn_classic_meta, " " , 2 ) vars = str_replace(pgn_classic_meta[, 1 ], " \\ [" , "" ) values = pgn_classic_meta[, 2 ] values = str_replace_all(values, " \" " , "" ) values = str_replace_all(values, " \\ ]" , "" ) unique_keys = unique(vars) unique_keys = unique_keys[- 18 ] #18 for classic df = data.frame(matrix( nrow = 0 , ncol = length(unique_keys))) colnames(df) = unique_keys dict = list() dict[[ "WhiteTitle" ]] = "" dict[[ "BlackTitle" ]] = "" 2
for (i in 1 :length(vars)){ if (vars[i] == "" ){ next } dict[[vars[i]]] = values[i] if (vars[i] == "Termination" ){ if (!( "WhiteRatingDiff" %in% names(dict))){ dict[[ "WhiteRatingDiff" ]] = NA dict[[ "BlackRatingDiff" ]] = NA } df = rbind(df,dict) dict = list() dict[[ "WhiteTitle" ]] = "" dict[[ "BlackTitle" ]] = "" } } df = df %>% relocate(WhiteTitle, .after = last_col()) %>% relocate(BlackTitle, .after = last_col()) df$WhiteRatingDiff = as.numeric(df$WhiteRatingDiff) df$BlackRatingDiff = as.numeric(df$BlackRatingDiff) df$BlackElo = as.numeric(df$BlackElo) df$WhiteElo = as.numeric(df$WhiteElo) df = as_tibble(df) is_tag = str_detect(pgn_chess_classic, \\ [Event.*" ) is_move = str_detect(pgn_chess_classic, "ˆ[0-9].*|ˆ " ) tag_lines = which(is_tag) move_lines = which(is_move) df = cbind(df, move_lines = move_lines ) df = cbind(df, tag_lines = tag_lines) write.csv(df, file= "Q1-2.csv" ) df[ 21 : 40 ,] ## Event ## 21 Rated Bullet tournament https://lichess.org/tournament/rs53Xpda ## 22 Rated Bullet game ## 23 Rated Blitz game ## 24 Rated Rapid game ## 25 Rated Rapid game ## 26 Rated Rapid game ## 27 Rated Rapid game ## 28 Rated Rapid game ## 29 Rated Rapid game ## 30 Rated Rapid game ## 31 Rated Rapid game ## 32 Rated Bullet game ## 33 Rated Bullet game ## 34 Rated Bullet game ## 35 Rated Bullet game 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
## 36 Rated Bullet game ## 37 Rated Bullet game ## 38 Rated Bullet game ## 39 Rated Bullet game ## 40 Rated Bullet game ## Site Date Round White Black ## 21 https://lichess.org/NaPtjbza 2018.09.30 - WhyNoGm Ayopublic ## 22 https://lichess.org/Q5hOlUTw 2018.09.30 - kokinho jj1987 ## 23 https://lichess.org/H1DjlehD 2018.09.30 - MACHETAZO bobo92 ## 24 https://lichess.org/Q5wlBOt3 2018.09.30 - joseesosam aliahmadi89 ## 25 https://lichess.org/mPhEK7yW 2018.09.30 - rochris Vagouras ## 26 https://lichess.org/oxqZoPEc 2018.09.30 - leo2105 Diegomart ## 27 https://lichess.org/6AdgfTIM 2018.09.30 - robertlemalin bydcollin ## 28 https://lichess.org/UTsDBRti 2018.09.30 - ivobb BlackPoison ## 29 https://lichess.org/fAr31RWf 2018.09.30 - ganeshtsk anubhavdawer ## 30 https://lichess.org/iGn3UquI 2018.09.30 - Maximewolf DARTAGNAN1414 ## 31 https://lichess.org/O6bFp6hs 2018.09.30 - AlexRose dideilci ## 32 https://lichess.org/EyZMfDR2 2018.09.30 - DirtySilver congeja ## 33 https://lichess.org/bFnSVV95 2018.09.30 - case963 KrzysztofG ## 34 https://lichess.org/RAa9RP7z 2018.09.30 - kirito18 l0xp ## 35 https://lichess.org/sFjLQq9j 2018.09.30 - joshdube Davidsti1 ## 36 https://lichess.org/a6Fut2lC 2018.09.30 - Russiabah pianista2018 ## 37 https://lichess.org/t5BgyMzJ 2018.09.30 - shyonfire rato23 ## 38 https://lichess.org/NAU4rpI0 2018.09.30 - schu007 VictorAlencar ## 39 https://lichess.org/BII1pt2F 2018.09.30 - omranlm oneofalone ## 40 https://lichess.org/IhwxefAm 2018.09.30 - mbkonacoglu ManuelMagnifico ## Result UTCDate UTCTime WhiteElo BlackElo WhiteRatingDiff ## 21 1-0 2018.09.30 22:00:11 1664 1596 9 ## 22 1-0 2018.09.30 22:00:11 2030 1992 26 ## 23 1-0 2018.09.30 22:00:13 1781 1669 7 ## 24 0-1 2018.09.30 22:00:13 1070 919 -14 ## 25 1-0 2018.09.30 22:00:13 1523 1527 21 ## 26 1-0 2018.09.30 22:00:13 1500 1500 7 ## 27 1-0 2018.09.30 22:00:13 1420 1410 10 ## 28 1-0 2018.09.30 22:00:13 1226 1218 11 ## 29 1-0 2018.09.30 22:00:13 1592 1536 9 ## 30 0-1 2018.09.30 22:00:13 1655 1675 -10 ## 31 0-1 2018.09.30 22:00:13 1139 1158 -10 ## 32 1-0 2018.09.30 22:00:13 1496 1599 14 ## 33 1-0 2018.09.30 22:00:13 1243 1227 10 ## 34 1/2-1/2 2018.09.30 22:00:13 1828 1805 -1 ## 35 0-1 2018.09.30 22:00:13 1706 1705 -10 ## 36 0-1 2018.09.30 22:00:13 1541 1555 -10 ## 37 1/2-1/2 2018.09.30 22:00:13 2069 2051 -1 ## 38 1-0 2018.09.30 22:00:13 1020 922 8 ## 39 1-0 2018.09.30 22:00:13 1105 1176 13 ## 40 0-1 2018.09.30 22:00:13 1895 1950 -9 ## BlackRatingDiff ECO Opening ## 21 -13 A04 Zukertort Opening: Pirc Invitation ## 22 -9 C10 French Defense: Rubinstein Variation ## 23 -7 B30 Sicilian Defense: Old Sicilian ## 24 52 C00 French Defense: Knight Variation ## 25 -11 C44 Scotch Game ## 26 -175 C40 King ' s Pawn Game: Damiano Defense 4
## 27 -10 D02 Queen ' s Pawn Game: Symmetrical Variation, Pseudo-Catalan ## 28 -17 C40 Elephant Gambit: Paulsen Countergambit ## 29 -9 A00 Hungarian Opening: Sicilian Invitation ## 30 10 D02 Queen ' s Pawn Game: Anti-Torre ## 31 10 B01 Scandinavian Defense ## 32 -13 D02 Queen ' s Pawn Game: Symmetrical Variation, Pseudo-Catalan ## 33 -10 B10 Caro-Kann Defense: Two Knights Attack ## 34 1 A04 Zukertort Opening: Speelsmet Gambit ## 35 10 A06 Zukertort Opening: Tennison Gambit ## 36 10 D02 Queen ' s Pawn Game: Symmetrical Variation, Pseudo-Catalan ## 37 1 C40 Elephant Gambit ## 38 -7 A43 Benoni Defense: Old Benoni ## 39 -13 A01 Nimzo-Larsen Attack: English Variation ## 40 9 C30 King ' s Gambit Declined, Queen ' s Knight Defense ## TimeControl Termination WhiteTitle BlackTitle move_lines tag_lines ## 21 60+0 Time forfeit 419 401 ## 22 60+0 Time forfeit 439 421 ## 23 180+0 Normal 459 441 ## 24 600+0 Normal 479 461 ## 25 600+0 Normal 499 481 ## 26 600+0 Time forfeit 519 501 ## 27 600+0 Time forfeit 539 521 ## 28 600+0 Normal 559 541 ## 29 600+0 Normal 579 561 ## 30 600+0 Normal 599 581 ## 31 600+0 Normal 619 601 ## 32 60+0 Normal 639 621 ## 33 60+0 Normal 659 641 ## 34 60+0 Time forfeit 679 661 ## 35 60+0 Normal 699 681 ## 36 60+0 Time forfeit 719 701 ## 37 60+0 Time forfeit 739 721 ## 38 60+0 Time forfeit 759 741 ## 39 60+0 Time forfeit 779 761 ## 40 60+0 Time forfeit 799 781 Answer: I have added the code along with the output below. In the question it is said to append the columns to the left where as the example file had it appended to the right so I have done the same as it makes more sense. Along with this I used cbind instead of rbind as rbind is for concatenating rows. I have added a screenshot of the csv file for the next 20 rows along with the skim of the data. 5
6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
3. [3 points] Find the quantile(. . . , probs=c(0.01,0.05,0.25,0.5,0.75,0.90,0.99,0.999,1)) of all player Elos. Make sure to include both black and white players. I am assuming by all players it means the White Player and black Player.I also included combined ELO quantile I have printed their quantile below df$WhiteElo %>% quantile( probs= c( 0.01 , 0.05 , 0.25 , 0.5 , 0.75 , 0.90 , 0.99 , 0.999 , 1 )) ## 1% 5% 25% 50% 75% 90% 99% 99.9% ## 922.000 1081.000 1357.250 1584.000 1801.000 1996.000 2307.140 2484.084 ## 100% ## 2544.000 df$BlackElo %>% quantile( probs= c( 0.01 , 0.05 , 0.25 , 0.5 , 0.75 , 0.90 , 0.99 , 0.999 , 1 )) ## 1% 5% 25% 50% 75% 90% 99% 99.9% ## 913.930 1072.650 1357.000 1582.000 1797.750 1990.000 2315.070 2493.035 ## 100% ## 2527.000 c(df$WhiteElo,df$BlackElo) %>% quantile( probs= c( 0.01 , 0.05 , 0.25 , 0.5 , 0.75 , 0.90 , 0.99 , 0.999 , 1 )) ## 1% 5% 25% 50% 75% 90% 99% 99.9% ## 919.000 1076.350 1357.000 1582.000 1799.000 1994.000 2313.000 2493.039 ## 100% ## 2544.000 4. [4 points] Create a linear model of proportion of time the first player (white) wins as a function of the amount of higher rating the first player has over the second player. Use this model to determine how many rating points playing first is worth (i.e., how many rating points worse does the player using white have to be in order to win exactly half of the time). this question was very confusing for me because it asks as the PROPORTION of time the first player wins as a function of the amount of HIGHER rating first player has over second player. This means our response variable is a proportion/probability where if White is 30 points higher than second player then how many of such player won or lost that is the proportion. Based on the wording both the variables are continuous and I did accordingly.However, on Piazza Post 30, Professor stated that the response variable is 0 or 1 and that we have to change the subject of the equation so that we can get x when y=0.5. Which is directly contradicting what the question has asked that is how WORSE should first player be as compared to second player to win the games half the time. white_elo_higher = df %>% filter(Result != "1/2-1/2" ) %>% summarise( victory = as.numeric(str_detect(Result, "1-0" )), diff_elo = WhiteElo-BlackElo) %>% group_by(diff_elo) %>% summarise( white_win_perc = mean(victory), diff_elo = min(diff_elo)) linear_white_win_model = lm(white_win_perc~diff_elo, data= white_elo_higher) model_summary = summary(linear_white_win_model) model_summary ## ## Call: ## lm(formula = white_win_perc ~ diff_elo, data = white_elo_higher) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.93561 -0.17065 -0.00051 0.16803 1.06104 ## 7
## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.070e-01 1.044e-02 48.56 <2e-16 *** ## diff_elo 8.659e-04 3.816e-05 22.69 <2e-16 *** ## --- ## Signif. codes: 0 ' *** ' 0.001 ' ** ' 0.01 ' * ' 0.05 ' . ' 0.1 ' ' 1 ## ## Residual standard error: 0.2794 on 714 degrees of freedom ## Multiple R-squared: 0.419, Adjusted R-squared: 0.4182 ## F-statistic: 514.9 on 1 and 714 DF, p-value: < 2.2e-16 rating = ( 0.5 - model_summary$coefficients[ 1 , 1 ])/model_summary$coefficients[ 2 , 1 ] rating ## [1] -8.071927 So this is the rating that indicates how much worse should White Player Elo be from Black player to win half the times. Show the summary() of the lm() 5. [3 points] Repeat the linear model from part 4, but filter() so that only games between players where both players have a 2000 rating or better. Does the pattern hold. I have produced the same result with same explanation above. The first model is according to Piazza Post 30 while second is according to the actual interpretation of the question. white_elo_higher_2000 = df %>% filter(Result != "1/2-1/2" , WhiteElo >= 2000 , BlackElo >= 2000 )%>% summarise( victory = as.numeric(str_detect(Result, "1-0" )), diff_elo = WhiteElo-BlackElo) %>% group_by(diff_elo) %>% summarise( white_win_perc = mean(victory), diff_elo = min(diff_elo)) linear_white_win_model_2000 = lm(white_win_perc~diff_elo, data= white_elo_higher_2000) model_summary_2000 = summary(linear_white_win_model_2000) rating_2000 = ( 0.5 - model_summary_2000$coefficients[ 1 , 1 ])/model_summary_2000$coefficients[ 2 , 1 ] rating_2000 ## [1] -10.04814 model_summary_2000 ## ## Call: ## lm(formula = white_win_perc ~ diff_elo, data = white_elo_higher_2000) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.93218 -0.38487 0.01425 0.39191 0.92425 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.5130782 0.0293185 17.500 < 2e-16 *** ## diff_elo 0.0013016 0.0002265 5.745 3.08e-08 *** ## --- ## Signif. codes: 0 ' *** ' 0.001 ' ** ' 0.01 ' * ' 0.05 ' . ' 0.1 ' ' 1 ## ## Residual standard error: 0.4337 on 217 degrees of freedom ## Multiple R-squared: 0.132, Adjusted R-squared: 0.128 8
## F-statistic: 33.01 on 1 and 217 DF, p-value: 3.077e-08 Show the summary() of the lm() Yes the pattern still holds as more or less both have similar coefficients and the same difference of elo *Note: A logistic regression would be better for each of these, we will return to this in class or in a later assignment. 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
6. [6 points] Add the first move to the cleaned dataset. Also add an indicator variable that is 1 if the first move is either e4, and 0 otherwise. The moves_line variable from part 2 will be useful. Get two separate table()s, one of each of the new variables. The first move of the first five games is d4, f4, d4, e4, and d4, respectively. I have created seperate tables and I also imputed the values and filled them through simple interpolation using zoo library. Professor also had similar graph on Piazza first_move_df = df %>% mutate( first_move = str_extract(pgn_chess_classic[df$move_lines], "[a-zA-Z]+ \\ d" )) table(first_move_df$first_move) ## ## a3 a4 b3 b4 c3 c4 d3 d4 e3 e4 f3 f4 g3 g4 h3 h4 ## 7 4 68 17 6 170 46 1334 93 2868 6 47 86 13 3 10 ## Na3 Nc3 Nf3 Nh3 ## 1 28 176 1 move_indicator_df = first_move_df %>% mutate( move_indicator = as.numeric(first_move == "e4" )) table(move_indicator_df$move_indicator) ## ## 0 1 ## 2116 2868 7. [4 points] Plot the proportion games where WhiteElo is in (-infty, 850), [850, 900), [900, 950). . . starts the game by playing e4 as broken line graph (Hint: plot(. . . , type=‘b’). You may use ggplot instead, but it will be worth no extra points. max_elo = round_any(max(move_indicator_df$WhiteElo), 50 , f= ceiling) series = seq( from= 850 , to= max_elo, by= 50 ) series[ 35 ] = Inf interval_data = table(cut(move_indicator_df$WhiteElo, breaks= c(-Inf, series), right = FALSE)) start_move_interval = move_indicator_df %>% mutate( intervals = cut(WhiteElo, breaks= c(-Inf, series), right = FALSE)) %>% complete(intervals) %>% group_by(intervals) %>% summarise( win_proportion_e4 = mean(move_indicator)) library(zoo) ## ## Attaching package: ' zoo ' ## The following objects are masked from ' package:base ' : ## ## as.Date, as.Date.numeric z = zoo(start_move_interval$win_proportion_e4) start_move_interval$imputed = na.approx(z) x_ticks = 1 :length(start_move_interval$win_proportion_e4) %>% map( function (x) (x* 50 )+( 800-25 )) 10
plot(x_ticks, start_move_interval$imputed, type= ' b ' , main= "Proportion Of Games with \n E4 starting for White Player" , xlab= "Elo Ratings Intervals" , ylab= "Starting Proportion of E4" ) 1000 1500 2000 2500 0.3 0.5 0.7 0.9 Proportion Of Games with E4 starting for White Player Elo Ratings Intervals Starting Proportion of E4 11
8. [6 points] A question mark on a move indicates that it is a blunder. A double question mark indicates it is an extreme blunder. Make a multivariate model of the number of question marks that appear as a function of the average of WhiteElo and BlackElo. (Hint: str_count() ) Show the summary() of the lm() In this I regarded ?? and ?! as 2 question marks where as ? as 1 question mark q8_blunder_df = df %>% summarise( blunder = str_count(pgn_chess_classic[df$move_lines], pattern = " \\ ?" ), ext_blunder_excl = ( summarise( total_blunder = blunder + ext_blunder_excl, mean_elo = mean_elo) q8_lm = lm(total_blunder ~ mean_elo, data = q8_blunder_df) summary(q8_lm) ## ## Call: ## lm(formula = total_blunder ~ mean_elo, data = q8_blunder_df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.963 -2.275 -1.998 -1.712 90.895 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.923993 0.663620 5.913 3.58e-09 *** ## mean_elo -0.001193 0.000412 -2.895 0.00381 ** ## --- ## Signif. codes: 0 ' *** ' 0.001 ' ** ' 0.01 ' * ' 0.05 ' . ' 0.1 ' ' 1 ## ## Residual standard error: 8.911 on 4992 degrees of freedom ## Multiple R-squared: 0.001676, Adjusted R-squared: 0.001476 ## F-statistic: 8.381 on 1 and 4992 DF, p-value: 0.003809 9. [4 points] A time control of 300 + 1, for example, indicates that each player started the game with 5 minutes (300 seconds) to play all their moves in the match, and that each move played added (i.e., incremented) 1 second to their clock. Add log(starting time) (Hint: + I(log(x)) ) in seconds to your model from question 7. Show the summary() of the lm() comment on the differences between this model and the question 7 model. q8_blunder_df = q8_blunder_df %>% mutate( starting_time = ifelse(as.numeric(str_extract(df$TimeControl, "( \\ d+)" ))== 0 , 1 , as.numeric(str q8_lm_multi = lm(total_blunder ~ mean_elo+I(log(starting_time)), data = q8_blunder_df) summary(q8_lm_multi) ## ## Call: ## lm(formula = total_blunder ~ mean_elo + I(log(starting_time)), ## data = q8_blunder_df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.052 -2.644 -2.027 -0.998 90.319 ## 12
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -2.5062518 1.0844364 -2.311 0.0209 * ## mean_elo -0.0005229 0.0004199 -1.245 0.2130 ## I(log(starting_time)) 1.0491728 0.1403341 7.476 8.99e-14 *** ## --- ## Signif. codes: 0 ' *** ' 0.001 ' ** ' 0.01 ' * ' 0.05 ' . ' 0.1 ' ' 1 ## ## Residual standard error: 8.866 on 4975 degrees of freedom ## (16 observations deleted due to missingness) ## Multiple R-squared: 0.01277, Adjusted R-squared: 0.01237 ## F-statistic: 32.17 on 2 and 4975 DF, p-value: 1.314e-14 q8_lm_multi ## ## Call: ## lm(formula = total_blunder ~ mean_elo + I(log(starting_time)), ## data = q8_blunder_df) ## ## Coefficients: ## (Intercept) mean_elo I(log(starting_time)) ## -2.5062518 -0.0005229 1.0491728 The difference between these models is that univariate model has a positive intercept of 4 and the coefficient for average elo is negative 0.001 where as the intercept for multivariate model is negative and the coefficient of average elo is negative 0.0005 with a positive coefficient of 1 for log starting time 13
10. [5 points] Chess 960 is a variant of chess where the starting positions of some of the pieces are randomized before the match. Repeat steps 1 and 2 for the chess_960_database. Show the skim() of the resulting dataset. pgn_chess_classic = readLines( "chess_960_games.pgn" ) is_metadata = str_detect(pgn_chess_classic, \\ [.* \\ ]$" ) pgn_classic_meta = pgn_chess_classic pgn_classic_meta[!is_metadata] = "" pgn_classic_meta = str_split_fixed(pgn_classic_meta, " " , 2 ) vars = str_replace(pgn_classic_meta[, 1 ], " \\ [" , "" ) values = pgn_classic_meta[, 2 ] values = str_replace_all(values, " \" " , "" ) values = str_replace_all(values, " \\ ]" , "" ) unique_keys = unique(vars) unique_keys = unique_keys[- 19 ] #18 for classic df = data.frame(matrix( nrow = 0 , ncol = length(unique_keys))) colnames(df) = unique_keys dict = list() dict[[ "WhiteTitle" ]] = "" dict[[ "BlackTitle" ]] = "" for (i in 1 :length(vars)){ if (vars[i] == "" ){ next } dict[[vars[i]]] = values[i] if (vars[i] == "Variant" ){ if (!( "WhiteRatingDiff" %in% names(dict))){ dict[[ "WhiteRatingDiff" ]] = "" dict[[ "BlackRatingDiff" ]] = "" } df = rbind(df,dict) dict = list() dict[[ "WhiteTitle" ]] = "" dict[[ "BlackTitle" ]] = "" } } df = df %>% relocate(WhiteTitle, .after = last_col()) %>% relocate(BlackTitle, .after = last_col()) df$WhiteRatingDiff = as.numeric(df$WhiteRatingDiff) df$BlackRatingDiff = as.numeric(df$BlackRatingDiff) df$BlackElo = as.numeric(df$BlackElo) df$WhiteElo = as.numeric(df$WhiteElo) df$SetUp = as.numeric(df$SetUp) df = as_tibble(df) pgn_classic_moves = pgn_chess_classic is_tag = str_detect(pgn_chess_classic, \\ [Event.*" ) is_move = str_detect(pgn_chess_classic, "ˆ[0-9].*|ˆ " ) 14
tag_lines = which(is_tag) move_lines = which(is_move) tag_lines = tag_lines[- 4420 ] df = df[- 4420 ,] df = cbind(df, tag_line = tag_lines ) df = cbind(df, move_line = move_lines) Figure 1: Skim of the dataframe 15
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
11. [5 points] Repeat questions 6 and 7 for the chess_960_database. (Get opening move, and plot the proportion starting e4). Comment on the differences in the popularity of e4 as an opening move. (You do not need to demonstrate any chess knowledge to make this comparison. You can if you wish, but you are not being marked on chess knowledge) I did not imputate the proportions in this because If the proportion does not exist then it should be zero. In previous case, the answer was given by the professor in Piazza first_move_df = df %>% mutate( first_move = str_extract(pgn_chess_classic[df$move_line], "[A-Za-z]+ \\ d" )) table(first_move_df$first_move) ## ## a3 a4 b3 b4 c3 c4 c6 d3 d4 d5 e3 e4 e5 f3 f4 g3 ## 6 17 546 100 198 252 1 85 470 1 132 885 1 151 216 648 ## g4 h3 h4 Na3 Nab3 Nb3 Nbc3 Nbc6 Nc3 Nc6 Ncb3 Ncd3 Nd3 Ndc3 Ne3 Nef3 ## 81 10 23 3 15 140 11 2 86 1 3 3 29 2 22 1 ## Nf3 Nfe3 Ng3 Ng6 Ngf3 Nhg3 ## 102 3 125 1 15 21 move_indicator_df = first_move_df %>% mutate( move_indicator = as.numeric(first_move == "e4" )) table(move_indicator_df$move_indicator) ## ## 0 1 ## 3523 885 move_indicator_df = move_indicator_df[- 22 ] #Q7 max_elo = round_any(max(move_indicator_df$WhiteElo), 50 , f= ceiling) series = seq( from= 850 , to= max_elo, by= 50 ) series[length(series)] = Inf intervals = cut(move_indicator_df$WhiteElo, breaks= c(-Inf, series), right = FALSE) start_move_interval = move_indicator_df %>% mutate( intervals = cut(WhiteElo, breaks= c(-Inf, series), right = FALSE)) %>% complete(intervals) %>% group_by(intervals) %>% summarise( win_proportion_e4 = mean(move_indicator)) start_move_interval$win_proportion_e4 = start_move_interval$win_proportion_e4 %>% replace(is.na(.), 0 ) x_ticks = 1 :length(start_move_interval$intervals) %>% map( function (x) (x* 50 )+( 800-25 )) plot(x_ticks, start_move_interval$win_proportion_e4, type= ' b ' , main= "Proportion Of Games with \n E4 starting for White Player" , xlab= "Elo Ratings Intervals" , ylab= "Starting Proportion of E4" ) 16
1000 1500 2000 2500 0.0 0.2 0.4 0.6 0.8 1.0 Proportion Of Games with E4 starting for White Player Elo Ratings Intervals Starting Proportion of E4 Although the opening move of E4 is generally more common with players that have less ELO scores however, as they move towards higher elo rating or players with higher elo scores this popularity decreases. When comparing the graphs, we can see that opening move of E4 is more common in classic chess as compared to 960 variant of chess where the graph is lying towards the range of 0.2 17