2023-01-30_dataframes

pdf

School

University of Houston *

*We aren’t endorsed by this school

Course

404

Subject

Statistics

Date

Apr 3, 2024

Type

pdf

Pages

13

Uploaded by student4781

Report
Notes 3: Data Frames and Control STAT 404: Statistical Computing Recap 1. What attribute defines an S3 object on a base type? What are some (one main one) S3 objects we have discussed? If it has a class attribute it’s S3. Data frame has class data.frame 2. .Rmd demo: Open a new .Rmd file and practice the following three ways to insert a new code chunk The keyboard shortcut Cmd + Option + I / Ctrl + Alt + I. The “Insert” button icon in the editor toolbar. By manually typing the chunk delimiters ``` {r} and ``` . 3. What do the chunk options eval and echo control? Outline Making and working with data frames Subsetting Adding new variables (columns) Removing variables (columns) In our last thrilling episode Atomic vectors: series of values all of the same type e.g., v[5] , v["name"] Arrays: multi-dimensional generalization of atomic vectors e.g., a[5,6,2] , a[,6,] , a["rowname", "colname", "layername"] Matrices: special 2D arrays with matrix math e.g., m[5,6] , m[,6] , m[,"colname"] Lists: vector of values of mixed types e.g., l[[3]] , l$name Data frames: list with data.frame class attribute; matrix and list indexing work 1
Data frames, encore 2D tables of data Each case/observation is a row Each variable/feature is a column Variables can be of any type (numbers, text, Booleans, . . . ) Both rows and columns can get names Creating an example data frame Use data.frame() , similar to how we create lists with list() my.df = data.frame( nums= seq( 0.1 , 0.6 , by= 0.1 ), chars= letters[ 1 : 6 ], bools= sample(c(TRUE,FALSE), 6 , replace= TRUE)) my.df ## nums chars bools ## 1 0.1 a FALSE ## 2 0.2 b FALSE ## 3 0.3 c FALSE ## 4 0.4 d FALSE ## 5 0.5 e FALSE ## 6 0.6 f TRUE attributes(my.df) ## $names ## [1] "nums" "chars" "bools" ## ## $class ## [1] "data.frame" ## ## $row.names ## [1] 1 2 3 4 5 6 # Note, a list can have different lengths for different elements! my.list = list( nums= seq( 0.1 , 0.6 , by= 0.1 ), chars= letters[ 1 : 12 ], bools= sample(c(TRUE,FALSE), 6 , replace= TRUE)) my.list ## $nums ## [1] 0.1 0.2 0.3 0.4 0.5 0.6 ## ## $chars ## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" ## ## $bools ## [1] FALSE FALSE FALSE FALSE FALSE TRUE 2
Indexing a data frame By rows/columns: similar to how we index matrices By columns only: similar to how we index lists my.df[, 1 ] # Also works for a matrix ## [1] 0.1 0.2 0.3 0.4 0.5 0.6 my.df[, "nums" ] # Also works for a matrix ## [1] 0.1 0.2 0.3 0.4 0.5 0.6 my.df$nums # Doesn ' t work for a matrix, but works for a list ## [1] 0.1 0.2 0.3 0.4 0.5 0.6 my.df$chars # Note: this one has been converted into a factor data type ## [1] "a" "b" "c" "d" "e" "f" as.character(my.df$chars) # Converting it back to a character data type; As of 4.1 R no longer converts ## [1] "a" "b" "c" "d" "e" "f" Creating a data frame from a matrix Often times it’s helpful to start with a matrix, and add columns (of different data types) to make it a data frame class(state.x77) # Built-in matrix of states data, 50 states x 8 variables ## [1] "matrix" "array" head(state.x77) ## Population Income Illiteracy Life Exp Murder HS Grad Frost Area ## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708 ## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432 ## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417 ## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945 ## California 21198 5114 1.1 71.71 10.3 62.6 20 156361 ## Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
class(state.region) # Factor of regions for the 50 states ## [1] "factor" head(state.region) ## [1] South West West South West West ## Levels: Northeast South North Central West class(state.division) # Factor of divisions for the 50 states ## [1] "factor" head(state.division) ## [1] East South Central Pacific Mountain West South Central ## [5] Pacific Mountain ## 9 Levels: New England Middle Atlantic South Atlantic ... Pacific levels(state.division) ## [1] "New England" "Middle Atlantic" "South Atlantic" ## [4] "East South Central" "West South Central" "East North Central" ## [7] "West North Central" "Mountain" "Pacific" is.object(state.division) ## [1] TRUE typeof(state.division) ## [1] "integer" # Combine these into a data frame with 50 rows and 10 columns state.df = data.frame(state.x77, Region= state.region, Division= state.division) class(state.df) ## [1] "data.frame" head(state.df) # Note that the first 8 columns name carried over from state.x77 4
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area ## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708 ## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432 ## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417 ## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945 ## California 21198 5114 1.1 71.71 10.3 62.6 20 156361 ## Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766 ## Region Division ## Alabama South East South Central ## Alaska West Pacific ## Arizona West Mountain ## Arkansas South West South Central ## California West Pacific ## Colorado West Mountain data.frame(unname(state.x77), Region= state.region, Division= state.division) ## X1 X2 X3 X4 X5 X6 X7 X8 Region Division ## 1 3615 3624 2.1 69.05 15.1 41.3 20 50708 South East South Central ## 2 365 6315 1.5 69.31 11.3 66.7 152 566432 West Pacific ## 3 2212 4530 1.8 70.55 7.8 58.1 15 113417 West Mountain ## 4 2110 3378 1.9 70.66 10.1 39.9 65 51945 South West South Central ## 5 21198 5114 1.1 71.71 10.3 62.6 20 156361 West Pacific ## 6 2541 4884 0.7 72.06 6.8 63.9 166 103766 West Mountain ## 7 3100 5348 1.1 72.48 3.1 56.0 139 4862 Northeast New England ## 8 579 4809 0.9 70.06 6.2 54.6 103 1982 South South Atlantic ## 9 8277 4815 1.3 70.66 10.7 52.6 11 54090 South South Atlantic ## 10 4931 4091 2.0 68.54 13.9 40.6 60 58073 South South Atlantic ## 11 868 4963 1.9 73.60 6.2 61.9 0 6425 West Pacific ## 12 813 4119 0.6 71.87 5.3 59.5 126 82677 West Mountain ## 13 11197 5107 0.9 70.14 10.3 52.6 127 55748 North Central East North Central ## 14 5313 4458 0.7 70.88 7.1 52.9 122 36097 North Central East North Central ## 15 2861 4628 0.5 72.56 2.3 59.0 140 55941 North Central West North Central ## 16 2280 4669 0.6 72.58 4.5 59.9 114 81787 North Central West North Central ## 17 3387 3712 1.6 70.10 10.6 38.5 95 39650 South East South Central ## 18 3806 3545 2.8 68.76 13.2 42.2 12 44930 South West South Central ## 19 1058 3694 0.7 70.39 2.7 54.7 161 30920 Northeast New England ## 20 4122 5299 0.9 70.22 8.5 52.3 101 9891 South South Atlantic ## 21 5814 4755 1.1 71.83 3.3 58.5 103 7826 Northeast New England ## 22 9111 4751 0.9 70.63 11.1 52.8 125 56817 North Central East North Central ## 23 3921 4675 0.6 72.96 2.3 57.6 160 79289 North Central West North Central ## 24 2341 3098 2.4 68.09 12.5 41.0 50 47296 South East South Central ## 25 4767 4254 0.8 70.69 9.3 48.8 108 68995 North Central West North Central ## 26 746 4347 0.6 70.56 5.0 59.2 155 145587 West Mountain ## 27 1544 4508 0.6 72.60 2.9 59.3 139 76483 North Central West North Central ## 28 590 5149 0.5 69.03 11.5 65.2 188 109889 West Mountain ## 29 812 4281 0.7 71.23 3.3 57.6 174 9027 Northeast New England ## 30 7333 5237 1.1 70.93 5.2 52.5 115 7521 Northeast Middle Atlantic ## 31 1144 3601 2.2 70.32 9.7 55.2 120 121412 West Mountain ## 32 18076 4903 1.4 70.55 10.9 52.7 82 47831 Northeast Middle Atlantic ## 33 5441 3875 1.8 69.21 11.1 38.5 80 48798 South South Atlantic ## 34 637 5087 0.8 72.78 1.4 50.3 186 69273 North Central West North Central ## 35 10735 4561 0.8 70.82 7.4 53.2 124 40975 North Central East North Central ## 36 2715 3983 1.1 71.42 6.4 51.6 82 68782 South West South Central 5
## 37 2284 4660 0.6 72.13 4.2 60.0 44 96184 West Pacific ## 38 11860 4449 1.0 70.43 6.1 50.2 126 44966 Northeast Middle Atlantic ## 39 931 4558 1.3 71.90 2.4 46.4 127 1049 Northeast New England ## 40 2816 3635 2.3 67.96 11.6 37.8 65 30225 South South Atlantic ## 41 681 4167 0.5 72.08 1.7 53.3 172 75955 North Central West North Central ## 42 4173 3821 1.7 70.11 11.0 41.8 70 41328 South East South Central ## 43 12237 4188 2.2 70.90 12.2 47.4 35 262134 South West South Central ## 44 1203 4022 0.6 72.90 4.5 67.3 137 82096 West Mountain ## 45 472 3907 0.6 71.64 5.5 57.1 168 9267 Northeast New England ## 46 4981 4701 1.4 70.08 9.5 47.8 85 39780 South South Atlantic ## 47 3559 4864 0.6 71.72 4.3 63.5 32 66570 West Pacific ## 48 1799 3617 1.4 69.48 6.7 41.6 100 24070 South South Atlantic ## 49 4589 4468 0.7 72.48 3.0 54.5 149 54464 North Central East North Central ## 50 376 4566 0.6 70.29 6.9 62.9 173 97203 West Mountain data.frame() is combining a pre-existing matrix ( state.x77 ) and two vectors of qualitative categorical variables (called factors ; state.region , state.division ) Column names are preserved or guessed if not explicitly set colnames(state.df) ## [1] "Population" "Income" "Illiteracy" "Life.Exp" "Murder" ## [6] "HS.Grad" "Frost" "Area" "Region" "Division" state.df[ 1 ,] ## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area Region ## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708 South ## Division ## Alabama East South Central Data frame access By row and column index state.df[ 49 , 3 ] ## [1] 0.7 By row and column names state.df[ "Wisconsin" , "Illiteracy" ] ## [1] 0.7 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
rownames(state.df) ## [1] "Alabama" "Alaska" "Arizona" "Arkansas" ## [5] "California" "Colorado" "Connecticut" "Delaware" ## [9] "Florida" "Georgia" "Hawaii" "Idaho" ## [13] "Illinois" "Indiana" "Iowa" "Kansas" ## [17] "Kentucky" "Louisiana" "Maine" "Maryland" ## [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi" ## [25] "Missouri" "Montana" "Nebraska" "Nevada" ## [29] "New Hampshire" "New Jersey" "New Mexico" "New York" ## [33] "North Carolina" "North Dakota" "Ohio" "Oklahoma" ## [37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina" ## [41] "South Dakota" "Tennessee" "Texas" "Utah" ## [45] "Vermont" "Virginia" "Washington" "West Virginia" ## [49] "Wisconsin" "Wyoming" state.df[ "Wisconsin" , 3 ] ## [1] 0.7 class(state.df[ "Wisconsin" , 3 ]) ## [1] "numeric" Data frame access (cont’d) All of a row: state.df[ "Wisconsin" ,] ## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area ## Wisconsin 4589 4468 0.7 72.48 3 54.5 149 54464 ## Region Division ## Wisconsin North Central East North Central class(state.df[ "Wisconsin" ,]) ## [1] "data.frame" Exercise: what class is state.df["Wisconsin",] ? Data frame Data frame access (cont’d.) All of a column: 7
head(state.df[, 3 ]) ## [1] 2.1 1.5 1.8 1.9 1.1 0.7 head(state.df[, "Illiteracy" ]) ## [1] 2.1 1.5 1.8 1.9 1.1 0.7 head(state.df$Illiteracy) ## [1] 2.1 1.5 1.8 1.9 1.1 0.7 Data frame access (cont’d.) Rows matching a condition: state.df[state.df$Division== "New England" , "Illiteracy" ] ## [1] 1.1 0.7 1.1 0.7 1.3 0.6 state.df[state.df$Region== "South" , "Illiteracy" ] ## [1] 2.1 1.9 0.9 1.3 2.0 1.6 2.8 0.9 2.4 1.8 1.1 2.3 1.7 2.2 1.4 1.4 Adding columns to a data frame To add columns: we can either use data.frame() , or directly define a new named column # First way: use data.frame() to concatenate on a new column state.df = data.frame(state.df, Cool= sample(c(T,F), nrow(state.df), rep= TRUE)) head(state.df, 4 ) ## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area ## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708 ## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432 ## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417 ## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945 ## Region Division Cool ## Alabama South East South Central FALSE ## Alaska West Pacific TRUE ## Arizona West Mountain FALSE ## Arkansas South West South Central FALSE 8
# Second way: just directly define a new named column state.df$Score = sample( 1 : 100 , nrow(state.df), replace= TRUE) head(state.df, 4 ) ## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area ## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708 ## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432 ## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417 ## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945 ## Region Division Cool Score ## Alabama South East South Central FALSE 42 ## Alaska West Pacific TRUE 49 ## Arizona West Mountain FALSE 84 ## Arkansas South West South Central FALSE 75 ncol(state.df) ## [1] 12 state.df[, 13 ] <- NA head(state.df) ## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area ## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708 ## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432 ## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417 ## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945 ## California 21198 5114 1.1 71.71 10.3 62.6 20 156361 ## Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766 ## Region Division Cool Score V13 ## Alabama South East South Central FALSE 42 NA ## Alaska West Pacific TRUE 49 NA ## Arizona West Mountain FALSE 84 NA ## Arkansas South West South Central FALSE 75 NA ## California West Pacific FALSE 89 NA ## Colorado West Mountain TRUE 4 NA state.df <- state.df[,- 13 ] Deleting columns from a data frame To delete columns: we can either use negative integer indexing, or set a column to NULL # First way: use negative integer indexing state.df = state.df[,-ncol(state.df)] head(state.df, 4 ) 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area ## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708 ## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432 ## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417 ## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945 ## Region Division Cool ## Alabama South East South Central FALSE ## Alaska West Pacific TRUE ## Arizona West Mountain FALSE ## Arkansas South West South Central FALSE # Second way: just directly set a column to NULL state.df$Cool = NULL head(state.df, 4 ) ## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area ## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708 ## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432 ## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417 ## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945 ## Region Division ## Alabama South East South Central ## Alaska West Pacific ## Arizona West Mountain ## Arkansas South West South Central Reminder: Boolean indexing With matrices or data frames, we’ll often want to access a subset of the rows corresponding to some condition. You already know how to do this, with Boolean indexing # Compare the averages of the Frost column between states in New England and # Pacific divisions mean(state.df[(state.df$Division == "New England" ), "Frost" ]) ## [1] 145.3333 mean(state.df[(state.df$Division == "Pacific" ), "Frost" ]) ## [1] 49.6 What is the average of Frost for the division that contains Texas? # Which division contains Texas? tex_div <- state.df[ "Texas" , "Division" ] state.df[ "Texas" ,] 10
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area Region ## Texas 12237 4188 2.2 70.9 12.2 47.4 35 262134 South ## Division ## Texas West South Central mean(state.df[(state.df$Division == "West South Central" ), "Frost" ]) ## [1] 48.5 mean(state.df[(state.df$Division == tex_div), "Frost" ]) ## [1] 48.5 subset() The subset() function provides a convenient alternative way of accessing rows for data frames # Using subset(), we can just use the column names directly (i.e., no need for # using $) state.df.ne .1 = subset(state.df, Division == "New England" ) # Get same thing by extracting the appropriate rows manually state.df.ne .2 = state.df[state.df$Division == "New England" , ] all(state.df.ne .1 == state.df.ne .2 ) ## [1] TRUE # Same calculation as in the last slide, using subset() mean(subset(state.df, Division == "New England" )$Frost) ## [1] 145.3333 mean(subset(state.df, Division == "Pacific" )$Frost) # Wimps ## [1] 49.6 mean(subset(state.df, Division == "New England" ,Frost, drop = TRUE)) ## [1] 145.3333 Replacing values Parts or all of the data frame can be assigned to: 11
summary(state.df$HS.Grad) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 37.80 48.05 53.25 53.11 59.15 67.30 state.df$HS.Grad <- state.df$HS.Grad/ 100 summary(state.df$HS.Grad) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.3780 0.4805 0.5325 0.5311 0.5915 0.6730 state.df$HS.Grad <- 100 *state.df$HS.Grad # state.df$HS.Grad[1:5] <- NA with() The with() function provides a way of expressing operations by column names only. What percentage of literate adults graduated high school? head( 100 *(state.df$HS.Grad/( 100 -state.df$Illiteracy))) ## [1] 42.18590 67.71574 59.16497 40.67278 63.29626 64.35045 with() takes a data frame and evaluates an expression “inside” it: with(state.df, head( 100 *(HS.Grad/( 100 -Illiteracy)))) ## [1] 42.18590 67.71574 59.16497 40.67278 63.29626 64.35045 (so you don’t have to type state.df$xyz ) Data arguments Lots of functions take data arguments, and look variables up in that data frame: plot(Illiteracy~Frost, data = state.df) 12
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
0 50 100 150 0.5 1.0 1.5 2.0 2.5 Frost Illiteracy Summary Data frames are a representation of the “classic” data table in R: rows are observations/cases, columns are variables/features Each column can be a different data type (but must be the same length) subset() : function for extracting rows of a data frame meeting a condition with() : function for operating on data frame columns without indexing 13