Analysis1
pdf
keyboard_arrow_up
School
University of Waterloo *
*We aren’t endorsed by this school
Course
847
Subject
Statistics
Date
Apr 3, 2024
Type
Pages
17
Uploaded by BailiffJellyfishMaster1195
STAT 847: Analysis Assignment 1
DUE: Thursday, February 2 2023 by 11:59pm EST
NOTES
Your assignment must be submitted by the due date listed at the top of this document, and it must be
submitted electronically in .pdf format via Crowdmark. Organization and comprehensibility is part of a full
solution. Consequently, points will be deducted for solutions that are not organized and incomprehensible.
Furthermore, if you submit your assignment to Crowdmark, but you do so incorrectly in any way (e.g., you
upload your Question 2 solution in the Question 1 box), you will receive a 5% deduction (i.e., 5% of the
assignment’s point total will be deducted from your point total).
This assignment is all about cleaning (wrangling) text data of a custom structure, please show all your code
and comment it accordingly. There are a total of 50 points possible. Look at the end of the Week 02 notes
for a start on this assignment.
1
1.
[7 points] Create a data frame or a tibble of tags/metadata portion of the games in chess_classic_games.pgn.
One row should represent one game, and one column should represent one tag.
Tags that appear in some games, but not a given game, should be left as NA in any game that doesn’t have
them.
For example, the first five rows should look like classic_first_five.csv
Show the next five lines and the skim() of the dataset.
2.
[3 points] Add two columns to the left end of the data set (hint: rbind() can do this, and so can select()
). These added first two columns should have the first line in the file chess_classic_games.pgn which
includes a tag and the moves for the given game, respectively. For example, the first five values of
tag_line should be 1, 21, 41, 61, 81, and the first five values of moves_line should be 19, 39, 59, 79, 99.
library(plyr)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.3.6
v purrr
0.3.4
## v tibble
3.1.8
v dplyr
1.0.10
## v tidyr
1.2.1
v stringr 1.4.1
## v readr
2.1.3
v forcats 0.5.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::arrange()
masks plyr::arrange()
## x purrr::compact()
masks plyr::compact()
## x dplyr::count()
masks plyr::count()
## x dplyr::failwith()
masks plyr::failwith()
## x dplyr::filter()
masks stats::filter()
## x dplyr::id()
masks plyr::id()
## x dplyr::lag()
masks stats::lag()
## x dplyr::mutate()
masks plyr::mutate()
## x dplyr::rename()
masks plyr::rename()
## x dplyr::summarise() masks plyr::summarise()
## x dplyr::summarize() masks plyr::summarize()
library(stringr)
library(skimr)
pgn_chess_classic
=
readLines(
"chess_classic_games.pgn"
)
is_metadata
=
str_detect(pgn_chess_classic,
"ˆ
\\
[.*
\\
]$"
)
pgn_classic_meta
=
pgn_chess_classic
pgn_classic_meta[!is_metadata]
=
""
pgn_classic_meta
=
str_split_fixed(pgn_classic_meta,
" "
,
2
)
vars
=
str_replace(pgn_classic_meta[,
1
],
"
\\
["
,
""
)
values
=
pgn_classic_meta[,
2
]
values
=
str_replace_all(values,
"
\"
"
,
""
)
values
=
str_replace_all(values,
"
\\
]"
,
""
)
unique_keys
=
unique(vars)
unique_keys
=
unique_keys[-
18
]
#18 for classic
df
=
data.frame(matrix(
nrow =
0
,
ncol =
length(unique_keys)))
colnames(df)
=
unique_keys
dict
=
list()
dict[[
"WhiteTitle"
]]
=
""
dict[[
"BlackTitle"
]]
=
""
2
for
(i
in
1
:length(vars)){
if
(vars[i] ==
""
){
next
}
dict[[vars[i]]]
=
values[i]
if
(vars[i] ==
"Termination"
){
if
(!(
"WhiteRatingDiff"
%in% names(dict))){
dict[[
"WhiteRatingDiff"
]]
=
NA
dict[[
"BlackRatingDiff"
]]
=
NA
}
df
=
rbind(df,dict)
dict
=
list()
dict[[
"WhiteTitle"
]]
=
""
dict[[
"BlackTitle"
]]
=
""
}
}
df
=
df %>% relocate(WhiteTitle,
.after =
last_col()) %>% relocate(BlackTitle,
.after =
last_col())
df$WhiteRatingDiff
=
as.numeric(df$WhiteRatingDiff)
df$BlackRatingDiff
=
as.numeric(df$BlackRatingDiff)
df$BlackElo
=
as.numeric(df$BlackElo)
df$WhiteElo
=
as.numeric(df$WhiteElo)
df
=
as_tibble(df)
is_tag
=
str_detect(pgn_chess_classic,
"ˆ
\\
[Event.*"
)
is_move
=
str_detect(pgn_chess_classic,
"ˆ[0-9].*|ˆ "
)
tag_lines
=
which(is_tag)
move_lines
=
which(is_move)
df
=
cbind(df,
move_lines =
move_lines )
df
=
cbind(df,
tag_lines =
tag_lines)
write.csv(df,
file=
"Q1-2.csv"
)
df[
21
:
40
,]
##
Event
## 21 Rated Bullet tournament https://lichess.org/tournament/rs53Xpda
## 22
Rated Bullet game
## 23
Rated Blitz game
## 24
Rated Rapid game
## 25
Rated Rapid game
## 26
Rated Rapid game
## 27
Rated Rapid game
## 28
Rated Rapid game
## 29
Rated Rapid game
## 30
Rated Rapid game
## 31
Rated Rapid game
## 32
Rated Bullet game
## 33
Rated Bullet game
## 34
Rated Bullet game
## 35
Rated Bullet game
3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
## 36
Rated Bullet game
## 37
Rated Bullet game
## 38
Rated Bullet game
## 39
Rated Bullet game
## 40
Rated Bullet game
##
Site
Date Round
White
Black
## 21 https://lichess.org/NaPtjbza 2018.09.30
-
WhyNoGm
Ayopublic
## 22 https://lichess.org/Q5hOlUTw 2018.09.30
-
kokinho
jj1987
## 23 https://lichess.org/H1DjlehD 2018.09.30
-
MACHETAZO
bobo92
## 24 https://lichess.org/Q5wlBOt3 2018.09.30
-
joseesosam
aliahmadi89
## 25 https://lichess.org/mPhEK7yW 2018.09.30
-
rochris
Vagouras
## 26 https://lichess.org/oxqZoPEc 2018.09.30
-
leo2105
Diegomart
## 27 https://lichess.org/6AdgfTIM 2018.09.30
- robertlemalin
bydcollin
## 28 https://lichess.org/UTsDBRti 2018.09.30
-
ivobb
BlackPoison
## 29 https://lichess.org/fAr31RWf 2018.09.30
-
ganeshtsk
anubhavdawer
## 30 https://lichess.org/iGn3UquI 2018.09.30
-
Maximewolf
DARTAGNAN1414
## 31 https://lichess.org/O6bFp6hs 2018.09.30
-
AlexRose
dideilci
## 32 https://lichess.org/EyZMfDR2 2018.09.30
-
DirtySilver
congeja
## 33 https://lichess.org/bFnSVV95 2018.09.30
-
case963
KrzysztofG
## 34 https://lichess.org/RAa9RP7z 2018.09.30
-
kirito18
l0xp
## 35 https://lichess.org/sFjLQq9j 2018.09.30
-
joshdube
Davidsti1
## 36 https://lichess.org/a6Fut2lC 2018.09.30
-
Russiabah
pianista2018
## 37 https://lichess.org/t5BgyMzJ 2018.09.30
-
shyonfire
rato23
## 38 https://lichess.org/NAU4rpI0 2018.09.30
-
schu007
VictorAlencar
## 39 https://lichess.org/BII1pt2F 2018.09.30
-
omranlm
oneofalone
## 40 https://lichess.org/IhwxefAm 2018.09.30
-
mbkonacoglu ManuelMagnifico
##
Result
UTCDate
UTCTime WhiteElo BlackElo WhiteRatingDiff
## 21
1-0 2018.09.30 22:00:11
1664
1596
9
## 22
1-0 2018.09.30 22:00:11
2030
1992
26
## 23
1-0 2018.09.30 22:00:13
1781
1669
7
## 24
0-1 2018.09.30 22:00:13
1070
919
-14
## 25
1-0 2018.09.30 22:00:13
1523
1527
21
## 26
1-0 2018.09.30 22:00:13
1500
1500
7
## 27
1-0 2018.09.30 22:00:13
1420
1410
10
## 28
1-0 2018.09.30 22:00:13
1226
1218
11
## 29
1-0 2018.09.30 22:00:13
1592
1536
9
## 30
0-1 2018.09.30 22:00:13
1655
1675
-10
## 31
0-1 2018.09.30 22:00:13
1139
1158
-10
## 32
1-0 2018.09.30 22:00:13
1496
1599
14
## 33
1-0 2018.09.30 22:00:13
1243
1227
10
## 34 1/2-1/2 2018.09.30 22:00:13
1828
1805
-1
## 35
0-1 2018.09.30 22:00:13
1706
1705
-10
## 36
0-1 2018.09.30 22:00:13
1541
1555
-10
## 37 1/2-1/2 2018.09.30 22:00:13
2069
2051
-1
## 38
1-0 2018.09.30 22:00:13
1020
922
8
## 39
1-0 2018.09.30 22:00:13
1105
1176
13
## 40
0-1 2018.09.30 22:00:13
1895
1950
-9
##
BlackRatingDiff ECO
Opening
## 21
-13 A04
Zukertort Opening: Pirc Invitation
## 22
-9 C10
French Defense: Rubinstein Variation
## 23
-7 B30
Sicilian Defense: Old Sicilian
## 24
52 C00
French Defense: Knight Variation
## 25
-11 C44
Scotch Game
## 26
-175 C40
King
'
s Pawn Game: Damiano Defense
4
## 27
-10 D02 Queen
'
s Pawn Game: Symmetrical Variation, Pseudo-Catalan
## 28
-17 C40
Elephant Gambit: Paulsen Countergambit
## 29
-9 A00
Hungarian Opening: Sicilian Invitation
## 30
10 D02
Queen
'
s Pawn Game: Anti-Torre
## 31
10 B01
Scandinavian Defense
## 32
-13 D02 Queen
'
s Pawn Game: Symmetrical Variation, Pseudo-Catalan
## 33
-10 B10
Caro-Kann Defense: Two Knights Attack
## 34
1 A04
Zukertort Opening: Speelsmet Gambit
## 35
10 A06
Zukertort Opening: Tennison Gambit
## 36
10 D02 Queen
'
s Pawn Game: Symmetrical Variation, Pseudo-Catalan
## 37
1 C40
Elephant Gambit
## 38
-7 A43
Benoni Defense: Old Benoni
## 39
-13 A01
Nimzo-Larsen Attack: English Variation
## 40
9 C30
King
'
s Gambit Declined, Queen
'
s Knight Defense
##
TimeControl
Termination WhiteTitle BlackTitle move_lines tag_lines
## 21
60+0 Time forfeit
419
401
## 22
60+0 Time forfeit
439
421
## 23
180+0
Normal
459
441
## 24
600+0
Normal
479
461
## 25
600+0
Normal
499
481
## 26
600+0 Time forfeit
519
501
## 27
600+0 Time forfeit
539
521
## 28
600+0
Normal
559
541
## 29
600+0
Normal
579
561
## 30
600+0
Normal
599
581
## 31
600+0
Normal
619
601
## 32
60+0
Normal
639
621
## 33
60+0
Normal
659
641
## 34
60+0 Time forfeit
679
661
## 35
60+0
Normal
699
681
## 36
60+0 Time forfeit
719
701
## 37
60+0 Time forfeit
739
721
## 38
60+0 Time forfeit
759
741
## 39
60+0 Time forfeit
779
761
## 40
60+0 Time forfeit
799
781
Answer: I have added the code along with the output below. In the question it is said to append the columns
to the left where as the example file had it appended to the right so I have done the same as it makes more
sense. Along with this I used cbind instead of rbind as rbind is for concatenating rows. I have added a
screenshot of the csv file for the next 20 rows along with the skim of the data.
5
6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
3.
[3 points] Find the quantile(. . . , probs=c(0.01,0.05,0.25,0.5,0.75,0.90,0.99,0.999,1)) of all player Elos.
Make sure to include both black and white players.
I am assuming by all players it means the White Player and black Player.I also included combined ELO
quantile I have printed their quantile below
df$WhiteElo %>% quantile(
probs=
c(
0.01
,
0.05
,
0.25
,
0.5
,
0.75
,
0.90
,
0.99
,
0.999
,
1
))
##
1%
5%
25%
50%
75%
90%
99%
99.9%
##
922.000 1081.000 1357.250 1584.000 1801.000 1996.000 2307.140 2484.084
##
100%
## 2544.000
df$BlackElo %>% quantile(
probs=
c(
0.01
,
0.05
,
0.25
,
0.5
,
0.75
,
0.90
,
0.99
,
0.999
,
1
))
##
1%
5%
25%
50%
75%
90%
99%
99.9%
##
913.930 1072.650 1357.000 1582.000 1797.750 1990.000 2315.070 2493.035
##
100%
## 2527.000
c(df$WhiteElo,df$BlackElo) %>% quantile(
probs=
c(
0.01
,
0.05
,
0.25
,
0.5
,
0.75
,
0.90
,
0.99
,
0.999
,
1
))
##
1%
5%
25%
50%
75%
90%
99%
99.9%
##
919.000 1076.350 1357.000 1582.000 1799.000 1994.000 2313.000 2493.039
##
100%
## 2544.000
4.
[4 points] Create a linear model of proportion of time the first player (white) wins as a function of the
amount of higher rating the first player has over the second player. Use this model to determine how
many rating points playing first is worth (i.e., how many rating points worse does the player using
white have to be in order to win exactly half of the time).
this question was very confusing for me because it asks as the PROPORTION of time the first player wins as
a function of the amount of HIGHER rating first player has over second player. This means our response
variable is a proportion/probability where if White is 30 points higher than second player then how many of
such player won or lost that is the proportion. Based on the wording both the variables are continuous and I
did accordingly.However, on Piazza Post 30, Professor stated that the response variable is 0 or 1 and that we
have to change the subject of the equation so that we can get x when y=0.5. Which is directly contradicting
what the question has asked that is how WORSE should first player be as compared to second player to win
the games half the time.
white_elo_higher
=
df %>% filter(Result !=
"1/2-1/2"
) %>%
summarise(
victory =
as.numeric(str_detect(Result,
"1-0"
)),
diff_elo =
WhiteElo-BlackElo) %>%
group_by(diff_elo) %>%
summarise(
white_win_perc =
mean(victory),
diff_elo =
min(diff_elo))
linear_white_win_model
=
lm(white_win_perc~diff_elo,
data=
white_elo_higher)
model_summary
=
summary(linear_white_win_model)
model_summary
##
## Call:
## lm(formula = white_win_perc ~ diff_elo, data = white_elo_higher)
##
## Residuals:
##
Min
1Q
Median
3Q
Max
## -0.93561 -0.17065 -0.00051
0.16803
1.06104
##
7
## Coefficients:
##
Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.070e-01
1.044e-02
48.56
<2e-16 ***
## diff_elo
8.659e-04
3.816e-05
22.69
<2e-16 ***
## ---
## Signif. codes:
0
'
***
'
0.001
'
**
'
0.01
'
*
'
0.05
'
.
'
0.1
' '
1
##
## Residual standard error: 0.2794 on 714 degrees of freedom
## Multiple R-squared:
0.419,
Adjusted R-squared:
0.4182
## F-statistic: 514.9 on 1 and 714 DF,
p-value: < 2.2e-16
rating
=
(
0.5
- model_summary$coefficients[
1
,
1
])/model_summary$coefficients[
2
,
1
]
rating
## [1] -8.071927
So this is the rating that indicates how much worse should White Player Elo be from Black player to win half
the times.
Show the summary() of the lm()
5.
[3 points] Repeat the linear model from part 4, but filter() so that only games between players where
both players have a 2000 rating or better. Does the pattern hold.
I have produced the same result with same explanation above. The first model is according to Piazza Post 30
while second is according to the actual interpretation of the question.
white_elo_higher_2000
=
df %>% filter(Result !=
"1/2-1/2"
, WhiteElo >=
2000
, BlackElo >=
2000
)%>%
summarise(
victory =
as.numeric(str_detect(Result,
"1-0"
)),
diff_elo =
WhiteElo-BlackElo) %>%
group_by(diff_elo) %>%
summarise(
white_win_perc =
mean(victory),
diff_elo =
min(diff_elo))
linear_white_win_model_2000
=
lm(white_win_perc~diff_elo,
data=
white_elo_higher_2000)
model_summary_2000
=
summary(linear_white_win_model_2000)
rating_2000
=
(
0.5
- model_summary_2000$coefficients[
1
,
1
])/model_summary_2000$coefficients[
2
,
1
]
rating_2000
## [1] -10.04814
model_summary_2000
##
## Call:
## lm(formula = white_win_perc ~ diff_elo, data = white_elo_higher_2000)
##
## Residuals:
##
Min
1Q
Median
3Q
Max
## -0.93218 -0.38487
0.01425
0.39191
0.92425
##
## Coefficients:
##
Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.5130782
0.0293185
17.500
< 2e-16 ***
## diff_elo
0.0013016
0.0002265
5.745 3.08e-08 ***
## ---
## Signif. codes:
0
'
***
'
0.001
'
**
'
0.01
'
*
'
0.05
'
.
'
0.1
' '
1
##
## Residual standard error: 0.4337 on 217 degrees of freedom
## Multiple R-squared:
0.132,
Adjusted R-squared:
0.128
8
## F-statistic: 33.01 on 1 and 217 DF,
p-value: 3.077e-08
Show the summary() of the lm()
Yes the pattern still holds as more or less both have similar coefficients and the same difference of elo
*Note: A logistic regression would be better for each of these, we will return to this in class or in a later
assignment.
9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
6.
[6 points] Add the first move to the cleaned dataset. Also add an indicator variable that is 1 if the
first move is either e4, and 0 otherwise. The moves_line variable from part 2 will be useful. Get two
separate table()s, one of each of the new variables.
The first move of the first five games is d4, f4, d4, e4, and d4, respectively.
I have created seperate tables and I also imputed the values and filled them through simple interpolation
using zoo library. Professor also had similar graph on Piazza
first_move_df
=
df %>%
mutate(
first_move =
str_extract(pgn_chess_classic[df$move_lines],
"[a-zA-Z]+
\\
d"
))
table(first_move_df$first_move)
##
##
a3
a4
b3
b4
c3
c4
d3
d4
e3
e4
f3
f4
g3
g4
h3
h4
##
7
4
68
17
6
170
46 1334
93 2868
6
47
86
13
3
10
##
Na3
Nc3
Nf3
Nh3
##
1
28
176
1
move_indicator_df
=
first_move_df %>%
mutate(
move_indicator =
as.numeric(first_move ==
"e4"
))
table(move_indicator_df$move_indicator)
##
##
0
1
## 2116 2868
7.
[4 points] Plot the proportion games where WhiteElo is in (-infty, 850), [850, 900), [900, 950). . . starts
the game by playing e4 as broken line graph (Hint: plot(. . . , type=‘b’). You may use ggplot instead,
but it will be worth no extra points.
max_elo
=
round_any(max(move_indicator_df$WhiteElo),
50
,
f=
ceiling)
series
=
seq(
from=
850
,
to=
max_elo,
by=
50
)
series[
35
]
=
Inf
interval_data
=
table(cut(move_indicator_df$WhiteElo,
breaks=
c(-Inf, series),
right =
FALSE))
start_move_interval
=
move_indicator_df %>%
mutate(
intervals =
cut(WhiteElo,
breaks=
c(-Inf, series),
right =
FALSE)) %>%
complete(intervals) %>%
group_by(intervals) %>%
summarise(
win_proportion_e4 =
mean(move_indicator))
library(zoo)
##
## Attaching package:
'
zoo
'
## The following objects are masked from
'
package:base
'
:
##
##
as.Date, as.Date.numeric
z
=
zoo(start_move_interval$win_proportion_e4)
start_move_interval$imputed
=
na.approx(z)
x_ticks
=
1
:length(start_move_interval$win_proportion_e4) %>%
map(
function
(x) (x*
50
)+(
800-25
))
10
plot(x_ticks, start_move_interval$imputed,
type=
'
b
'
,
main=
"Proportion Of Games with
\n
E4 starting for White Player"
,
xlab=
"Elo Ratings Intervals"
,
ylab=
"Starting Proportion of E4"
)
1000
1500
2000
2500
0.3
0.5
0.7
0.9
Proportion Of Games with
E4 starting for White Player
Elo Ratings Intervals
Starting Proportion of E4
11
8.
[6 points] A question mark on a move indicates that it is a blunder. A double question mark indicates
it is an extreme blunder. Make a multivariate model of the number of question marks that appear as a
function of the average of WhiteElo and BlackElo. (Hint: str_count() )
Show the summary() of the lm()
In this I regarded ?? and ?! as 2 question marks where as ? as 1 question mark
q8_blunder_df
=
df %>%
summarise(
blunder =
str_count(pgn_chess_classic[df$move_lines],
pattern =
"
\\
?"
),
ext_blunder_excl =
(
summarise(
total_blunder =
blunder + ext_blunder_excl,
mean_elo =
mean_elo)
q8_lm
=
lm(total_blunder ~ mean_elo,
data =
q8_blunder_df)
summary(q8_lm)
##
## Call:
## lm(formula = total_blunder ~ mean_elo, data = q8_blunder_df)
##
## Residuals:
##
Min
1Q Median
3Q
Max
## -2.963 -2.275 -1.998 -1.712 90.895
##
## Coefficients:
##
Estimate Std. Error t value Pr(>|t|)
## (Intercept)
3.923993
0.663620
5.913 3.58e-09 ***
## mean_elo
-0.001193
0.000412
-2.895
0.00381 **
## ---
## Signif. codes:
0
'
***
'
0.001
'
**
'
0.01
'
*
'
0.05
'
.
'
0.1
' '
1
##
## Residual standard error: 8.911 on 4992 degrees of freedom
## Multiple R-squared:
0.001676,
Adjusted R-squared:
0.001476
## F-statistic: 8.381 on 1 and 4992 DF,
p-value: 0.003809
9.
[4 points] A time control of 300 + 1, for example, indicates that each player started the game with
5 minutes (300 seconds) to play all their moves in the match, and that each move played added (i.e.,
incremented) 1 second to their clock. Add log(starting time) (Hint: + I(log(x)) ) in seconds to your
model from question 7.
Show the summary() of the lm() comment on the differences between this model and the question 7 model.
q8_blunder_df
=
q8_blunder_df %>%
mutate(
starting_time =
ifelse(as.numeric(str_extract(df$TimeControl,
"(
\\
d+)"
))==
0
,
1
, as.numeric(str
q8_lm_multi
=
lm(total_blunder ~ mean_elo+I(log(starting_time)),
data =
q8_blunder_df)
summary(q8_lm_multi)
##
## Call:
## lm(formula = total_blunder ~ mean_elo + I(log(starting_time)),
##
data = q8_blunder_df)
##
## Residuals:
##
Min
1Q Median
3Q
Max
## -5.052 -2.644 -2.027 -0.998 90.319
##
12
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
## Coefficients:
##
Estimate Std. Error t value Pr(>|t|)
## (Intercept)
-2.5062518
1.0844364
-2.311
0.0209 *
## mean_elo
-0.0005229
0.0004199
-1.245
0.2130
## I(log(starting_time))
1.0491728
0.1403341
7.476 8.99e-14 ***
## ---
## Signif. codes:
0
'
***
'
0.001
'
**
'
0.01
'
*
'
0.05
'
.
'
0.1
' '
1
##
## Residual standard error: 8.866 on 4975 degrees of freedom
##
(16 observations deleted due to missingness)
## Multiple R-squared:
0.01277,
Adjusted R-squared:
0.01237
## F-statistic: 32.17 on 2 and 4975 DF,
p-value: 1.314e-14
q8_lm_multi
##
## Call:
## lm(formula = total_blunder ~ mean_elo + I(log(starting_time)),
##
data = q8_blunder_df)
##
## Coefficients:
##
(Intercept)
mean_elo
I(log(starting_time))
##
-2.5062518
-0.0005229
1.0491728
The difference between these models is that univariate model has a positive intercept of 4 and the coefficient
for average elo is negative 0.001 where as the intercept for multivariate model is negative and the coefficient
of average elo is negative 0.0005 with a positive coefficient of 1 for log starting time
13
10.
[5 points] Chess 960 is a variant of chess where the starting positions of some of the pieces are randomized
before the match. Repeat steps 1 and 2 for the chess_960_database. Show the skim() of the resulting
dataset.
pgn_chess_classic
=
readLines(
"chess_960_games.pgn"
)
is_metadata
=
str_detect(pgn_chess_classic,
"ˆ
\\
[.*
\\
]$"
)
pgn_classic_meta
=
pgn_chess_classic
pgn_classic_meta[!is_metadata]
=
""
pgn_classic_meta
=
str_split_fixed(pgn_classic_meta,
" "
,
2
)
vars
=
str_replace(pgn_classic_meta[,
1
],
"
\\
["
,
""
)
values
=
pgn_classic_meta[,
2
]
values
=
str_replace_all(values,
"
\"
"
,
""
)
values
=
str_replace_all(values,
"
\\
]"
,
""
)
unique_keys
=
unique(vars)
unique_keys
=
unique_keys[-
19
]
#18 for classic
df
=
data.frame(matrix(
nrow =
0
,
ncol =
length(unique_keys)))
colnames(df)
=
unique_keys
dict
=
list()
dict[[
"WhiteTitle"
]]
=
""
dict[[
"BlackTitle"
]]
=
""
for
(i
in
1
:length(vars)){
if
(vars[i] ==
""
){
next
}
dict[[vars[i]]]
=
values[i]
if
(vars[i] ==
"Variant"
){
if
(!(
"WhiteRatingDiff"
%in% names(dict))){
dict[[
"WhiteRatingDiff"
]]
=
""
dict[[
"BlackRatingDiff"
]]
=
""
}
df
=
rbind(df,dict)
dict
=
list()
dict[[
"WhiteTitle"
]]
=
""
dict[[
"BlackTitle"
]]
=
""
}
}
df
=
df %>% relocate(WhiteTitle,
.after =
last_col()) %>% relocate(BlackTitle,
.after =
last_col())
df$WhiteRatingDiff
=
as.numeric(df$WhiteRatingDiff)
df$BlackRatingDiff
=
as.numeric(df$BlackRatingDiff)
df$BlackElo
=
as.numeric(df$BlackElo)
df$WhiteElo
=
as.numeric(df$WhiteElo)
df$SetUp
=
as.numeric(df$SetUp)
df
=
as_tibble(df)
pgn_classic_moves
=
pgn_chess_classic
is_tag
=
str_detect(pgn_chess_classic,
"ˆ
\\
[Event.*"
)
is_move
=
str_detect(pgn_chess_classic,
"ˆ[0-9].*|ˆ "
)
14
tag_lines
=
which(is_tag)
move_lines
=
which(is_move)
tag_lines
=
tag_lines[-
4420
]
df
=
df[-
4420
,]
df
=
cbind(df,
tag_line =
tag_lines )
df
=
cbind(df,
move_line =
move_lines)
Figure 1: Skim of the dataframe
15
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
11.
[5 points] Repeat questions 6 and 7 for the chess_960_database. (Get opening move, and plot the
proportion starting e4). Comment on the differences in the popularity of e4 as an opening move. (You
do not need to demonstrate any chess knowledge to make this comparison. You can if you wish, but
you are not being marked on chess knowledge)
I did not imputate the proportions in this because If the proportion does not exist then it should be zero. In
previous case, the answer was given by the professor in Piazza
first_move_df
=
df %>%
mutate(
first_move =
str_extract(pgn_chess_classic[df$move_line],
"[A-Za-z]+
\\
d"
))
table(first_move_df$first_move)
##
##
a3
a4
b3
b4
c3
c4
c6
d3
d4
d5
e3
e4
e5
f3
f4
g3
##
6
17
546
100
198
252
1
85
470
1
132
885
1
151
216
648
##
g4
h3
h4
Na3 Nab3
Nb3 Nbc3 Nbc6
Nc3
Nc6 Ncb3 Ncd3
Nd3 Ndc3
Ne3 Nef3
##
81
10
23
3
15
140
11
2
86
1
3
3
29
2
22
1
##
Nf3 Nfe3
Ng3
Ng6 Ngf3 Nhg3
##
102
3
125
1
15
21
move_indicator_df
=
first_move_df %>%
mutate(
move_indicator =
as.numeric(first_move ==
"e4"
))
table(move_indicator_df$move_indicator)
##
##
0
1
## 3523
885
move_indicator_df
=
move_indicator_df[-
22
]
#Q7
max_elo
=
round_any(max(move_indicator_df$WhiteElo),
50
,
f=
ceiling)
series
=
seq(
from=
850
,
to=
max_elo,
by=
50
)
series[length(series)]
=
Inf
intervals
=
cut(move_indicator_df$WhiteElo,
breaks=
c(-Inf, series),
right =
FALSE)
start_move_interval
=
move_indicator_df %>%
mutate(
intervals =
cut(WhiteElo,
breaks=
c(-Inf, series),
right =
FALSE)) %>%
complete(intervals) %>%
group_by(intervals) %>%
summarise(
win_proportion_e4 =
mean(move_indicator))
start_move_interval$win_proportion_e4
=
start_move_interval$win_proportion_e4 %>% replace(is.na(.),
0
)
x_ticks
=
1
:length(start_move_interval$intervals) %>%
map(
function
(x) (x*
50
)+(
800-25
))
plot(x_ticks, start_move_interval$win_proportion_e4,
type=
'
b
'
,
main=
"Proportion Of Games with
\n
E4 starting for White Player"
,
xlab=
"Elo Ratings Intervals"
,
ylab=
"Starting Proportion of E4"
)
16
1000
1500
2000
2500
0.0
0.2
0.4
0.6
0.8
1.0
Proportion Of Games with
E4 starting for White Player
Elo Ratings Intervals
Starting Proportion of E4
Although the opening move of E4 is generally more common with players that have less ELO scores however,
as they move towards higher elo rating or players with higher elo scores this popularity decreases. When
comparing the graphs, we can see that opening move of E4 is more common in classic chess as compared to
960 variant of chess where the graph is lying towards the range of 0.2
17
Recommended textbooks for you

Algebra: Structure And Method, Book 1
Algebra
ISBN:9780395977224
Author:Richard G. Brown, Mary P. Dolciani, Robert H. Sorgenfrey, William L. Cole
Publisher:McDougal Littell
Recommended textbooks for you
- Algebra: Structure And Method, Book 1AlgebraISBN:9780395977224Author:Richard G. Brown, Mary P. Dolciani, Robert H. Sorgenfrey, William L. ColePublisher:McDougal Littell

Algebra: Structure And Method, Book 1
Algebra
ISBN:9780395977224
Author:Richard G. Brown, Mary P. Dolciani, Robert H. Sorgenfrey, William L. Cole
Publisher:McDougal Littell