Data Wrangling Assignment: Analyzing Chess Game Metadata

STAT 847: Analysis Assignment 1 DUE: Thursday, February 2 2023 by 11:59pm EST NOTES Your assignment must be submitted by the due date listed at the top of this document, and it must be submitted electronically in .pdf format via Crowdmark. Organization and comprehensibility is part of a full solution. Consequently, points will be deducted for solutions that are not organized and incomprehensible. Furthermore, if you submit your assignment to Crowdmark, but you do so incorrectly in any way (e.g., you upload your Question 2 solution in the Question 1 box), you will receive a 5% deduction (i.e., 5% of the assignment’s point total will be deducted from your point total). This assignment is all about cleaning (wrangling) text data of a custom structure, please show all your code and comment it accordingly. There are a total of 50 points possible. Look at the end of the Week 02 notes for a start on this assignment. 1

1. [7 points] Create a data frame or a tibble of tags/metadata portion of the games in chess_classic_games.pgn. One row should represent one game, and one column should represent one tag. Tags that appear in some games, but not a given game, should be left as NA in any game that doesn’t have them. For example, the first five rows should look like classic_first_five.csv Show the next five lines and the skim() of the dataset. 2. [3 points] Add two columns to the left end of the data set (hint: rbind() can do this, and so can select() ). These added first two columns should have the first line in the file chess_classic_games.pgn which includes a tag and the moves for the given game, respectively. For example, the first five values of tag_line should be 1, 21, 41, 61, 81, and the first five values of moves_line should be 19, 39, 59, 79, 99. library(plyr) library(tidyverse) ## -- Attaching packages --------------------------------------- tidyverse 1.3.2 -- ## v ggplot2 3.3.6 v purrr 0.3.4 ## v tibble 3.1.8 v dplyr 1.0.10 ## v tidyr 1.2.1 v stringr 1.4.1 ## v readr 2.1.3 v forcats 0.5.2 ## -- Conflicts ------------------------------------------ tidyverse_conflicts() -- ## x dplyr::arrange() masks plyr::arrange() ## x purrr::compact() masks plyr::compact() ## x dplyr::count() masks plyr::count() ## x dplyr::failwith() masks plyr::failwith() ## x dplyr::filter() masks stats::filter() ## x dplyr::id() masks plyr::id() ## x dplyr::lag() masks stats::lag() ## x dplyr::mutate() masks plyr::mutate() ## x dplyr::rename() masks plyr::rename() ## x dplyr::summarise() masks plyr::summarise() ## x dplyr::summarize() masks plyr::summarize() library(stringr) library(skimr) pgn_chess_classic = readLines( "chess_classic_games.pgn" ) is_metadata = str_detect(pgn_chess_classic, "ˆ \\ [.* \\ ]$" ) pgn_classic_meta = pgn_chess_classic pgn_classic_meta[!is_metadata] = "" pgn_classic_meta = str_split_fixed(pgn_classic_meta, " " , 2 ) vars = str_replace(pgn_classic_meta[, 1 ], " \\ [" , "" ) values = pgn_classic_meta[, 2 ] values = str_replace_all(values, " \" " , "" ) values = str_replace_all(values, " \\ ]" , "" ) unique_keys = unique(vars) unique_keys = unique_keys[- 18 ] #18 for classic df = data.frame(matrix( nrow = 0 , ncol = length(unique_keys))) colnames(df) = unique_keys dict = list() dict[[ "WhiteTitle" ]] = "" dict[[ "BlackTitle" ]] = "" 2

for (i in 1 :length(vars)){ if (vars[i] == "" ){ next } dict[[vars[i]]] = values[i] if (vars[i] == "Termination" ){ if (!( "WhiteRatingDiff" %in% names(dict))){ dict[[ "WhiteRatingDiff" ]] = NA dict[[ "BlackRatingDiff" ]] = NA } df = rbind(df,dict) dict = list() dict[[ "WhiteTitle" ]] = "" dict[[ "BlackTitle" ]] = "" } } df = df %>% relocate(WhiteTitle, .after = last_col()) %>% relocate(BlackTitle, .after = last_col()) df$WhiteRatingDiff = as.numeric(df$WhiteRatingDiff) df$BlackRatingDiff = as.numeric(df$BlackRatingDiff) df$BlackElo = as.numeric(df$BlackElo) df$WhiteElo = as.numeric(df$WhiteElo) df = as_tibble(df) is_tag = str_detect(pgn_chess_classic, "ˆ \\ [Event.*" ) is_move = str_detect(pgn_chess_classic, "ˆ[0-9].*|ˆ " ) tag_lines = which(is_tag) move_lines = which(is_move) df = cbind(df, move_lines = move_lines ) df = cbind(df, tag_lines = tag_lines) write.csv(df, file= "Q1-2.csv" ) df[ 21 : 40 ,] ## Event ## 21 Rated Bullet tournament https://lichess.org/tournament/rs53Xpda ## 22 Rated Bullet game ## 23 Rated Blitz game ## 24 Rated Rapid game ## 25 Rated Rapid game ## 26 Rated Rapid game ## 27 Rated Rapid game ## 28 Rated Rapid game ## 29 Rated Rapid game ## 30 Rated Rapid game ## 31 Rated Rapid game ## 32 Rated Bullet game ## 33 Rated Bullet game ## 34 Rated Bullet game ## 35 Rated Bullet game 3

Your preview ends here