2023-01-30_dataframes

pdf

School

University of Houston *

*We aren’t endorsed by this school

Course

404

Subject

Statistics

Date

Apr 3, 2024

Type

pdf

Pages

Uploaded by student4781

Notes 3: Data Frames and Control STAT 404: Statistical Computing Recap 1. What attribute defines an S3 object on a base type? What are some (one main one) S3 objects we have discussed? If it has a class attribute it’s S3. Data frame has class data.frame 2. .Rmd demo: Open a new .Rmd file and practice the following three ways to insert a new code chunk • The keyboard shortcut Cmd + Option + I / Ctrl + Alt + I. • The “Insert” button icon in the editor toolbar. • By manually typing the chunk delimiters ``` {r} and ``` . 3. What do the chunk options eval and echo control? Outline • Making and working with data frames – Subsetting – Adding new variables (columns) – Removing variables (columns) In our last thrilling episode • Atomic vectors: series of values all of the same type e.g., v[5] , v["name"] • Arrays: multi-dimensional generalization of atomic vectors e.g., a[5,6,2] , a[,6,] , a["rowname", "colname", "layername"] • Matrices: special 2D arrays with matrix math e.g., m[5,6] , m[,6] , m[,"colname"] • Lists: vector of values of mixed types e.g., l[[3]] , l$name • Data frames: list with data.frame class attribute; matrix and list indexing work 1

Data frames, encore • 2D tables of data • Each case/observation is a row • Each variable/feature is a column • Variables can be of any type (numbers, text, Booleans, . . . ) • Both rows and columns can get names Creating an example data frame Use data.frame() , similar to how we create lists with list() my.df = data.frame( nums= seq( 0.1 , 0.6 , by= 0.1 ), chars= letters[ 1 : 6 ], bools= sample(c(TRUE,FALSE), 6 , replace= TRUE)) my.df ## nums chars bools ## 1 0.1 a FALSE ## 2 0.2 b FALSE ## 3 0.3 c FALSE ## 4 0.4 d FALSE ## 5 0.5 e FALSE ## 6 0.6 f TRUE attributes(my.df) ## $names ## [1] "nums" "chars" "bools" ## ## $class ## [1] "data.frame" ## ## $row.names ## [1] 1 2 3 4 5 6 # Note, a list can have different lengths for different elements! my.list = list( nums= seq( 0.1 , 0.6 , by= 0.1 ), chars= letters[ 1 : 12 ], bools= sample(c(TRUE,FALSE), 6 , replace= TRUE)) my.list ## $nums ## [1] 0.1 0.2 0.3 0.4 0.5 0.6 ## ## $chars ## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" ## ## $bools ## [1] FALSE FALSE FALSE FALSE FALSE TRUE 2

Indexing a data frame • By rows/columns: similar to how we index matrices • By columns only: similar to how we index lists my.df[, 1 ] # Also works for a matrix ## [1] 0.1 0.2 0.3 0.4 0.5 0.6 my.df[, "nums" ] # Also works for a matrix ## [1] 0.1 0.2 0.3 0.4 0.5 0.6 my.df$nums # Doesn ' t work for a matrix, but works for a list ## [1] 0.1 0.2 0.3 0.4 0.5 0.6 my.df$chars # Note: this one has been converted into a factor data type ## [1] "a" "b" "c" "d" "e" "f" as.character(my.df$chars) # Converting it back to a character data type; As of 4.1 R no longer converts ## [1] "a" "b" "c" "d" "e" "f" Creating a data frame from a matrix Often times it’s helpful to start with a matrix, and add columns (of different data types) to make it a data frame class(state.x77) # Built-in matrix of states data, 50 states x 8 variables ## [1] "matrix" "array" head(state.x77) ## Population Income Illiteracy Life Exp Murder HS Grad Frost Area ## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708 ## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432 ## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417 ## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945 ## California 21198 5114 1.1 71.71 10.3 62.6 20 156361 ## Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766 3

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

class(state.region) # Factor of regions for the 50 states ## [1] "factor" head(state.region) ## [1] South West West South West West ## Levels: Northeast South North Central West class(state.division) # Factor of divisions for the 50 states ## [1] "factor" head(state.division) ## [1] East South Central Pacific Mountain West South Central ## [5] Pacific Mountain ## 9 Levels: New England Middle Atlantic South Atlantic ... Pacific levels(state.division) ## [1] "New England" "Middle Atlantic" "South Atlantic" ## [4] "East South Central" "West South Central" "East North Central" ## [7] "West North Central" "Mountain" "Pacific" is.object(state.division) ## [1] TRUE typeof(state.division) ## [1] "integer" # Combine these into a data frame with 50 rows and 10 columns state.df = data.frame(state.x77, Region= state.region, Division= state.division) class(state.df) ## [1] "data.frame" head(state.df) # Note that the first 8 columns name carried over from state.x77 4

## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area ## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708 ## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432 ## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417 ## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945 ## California 21198 5114 1.1 71.71 10.3 62.6 20 156361 ## Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766 ## Region Division ## Alabama South East South Central ## Alaska West Pacific ## Arizona West Mountain ## Arkansas South West South Central ## California West Pacific ## Colorado West Mountain data.frame(unname(state.x77), Region= state.region, Division= state.division) ## X1 X2 X3 X4 X5 X6 X7 X8 Region Division ## 1 3615 3624 2.1 69.05 15.1 41.3 20 50708 South East South Central ## 2 365 6315 1.5 69.31 11.3 66.7 152 566432 West Pacific ## 3 2212 4530 1.8 70.55 7.8 58.1 15 113417 West Mountain ## 4 2110 3378 1.9 70.66 10.1 39.9 65 51945 South West South Central ## 5 21198 5114 1.1 71.71 10.3 62.6 20 156361 West Pacific ## 6 2541 4884 0.7 72.06 6.8 63.9 166 103766 West Mountain ## 7 3100 5348 1.1 72.48 3.1 56.0 139 4862 Northeast New England ## 8 579 4809 0.9 70.06 6.2 54.6 103 1982 South South Atlantic ## 9 8277 4815 1.3 70.66 10.7 52.6 11 54090 South South Atlantic ## 10 4931 4091 2.0 68.54 13.9 40.6 60 58073 South South Atlantic ## 11 868 4963 1.9 73.60 6.2 61.9 0 6425 West Pacific ## 12 813 4119 0.6 71.87 5.3 59.5 126 82677 West Mountain ## 13 11197 5107 0.9 70.14 10.3 52.6 127 55748 North Central East North Central ## 14 5313 4458 0.7 70.88 7.1 52.9 122 36097 North Central East North Central ## 15 2861 4628 0.5 72.56 2.3 59.0 140 55941 North Central West North Central ## 16 2280 4669 0.6 72.58 4.5 59.9 114 81787 North Central West North Central ## 17 3387 3712 1.6 70.10 10.6 38.5 95 39650 South East South Central ## 18 3806 3545 2.8 68.76 13.2 42.2 12 44930 South West South Central ## 19 1058 3694 0.7 70.39 2.7 54.7 161 30920 Northeast New England ## 20 4122 5299 0.9 70.22 8.5 52.3 101 9891 South South Atlantic ## 21 5814 4755 1.1 71.83 3.3 58.5 103 7826 Northeast New England ## 22 9111 4751 0.9 70.63 11.1 52.8 125 56817 North Central East North Central ## 23 3921 4675 0.6 72.96 2.3 57.6 160 79289 North Central West North Central ## 24 2341 3098 2.4 68.09 12.5 41.0 50 47296 South East South Central ## 25 4767 4254 0.8 70.69 9.3 48.8 108 68995 North Central West North Central ## 26 746 4347 0.6 70.56 5.0 59.2 155 145587 West Mountain ## 27 1544 4508 0.6 72.60 2.9 59.3 139 76483 North Central West North Central ## 28 590 5149 0.5 69.03 11.5 65.2 188 109889 West Mountain ## 29 812 4281 0.7 71.23 3.3 57.6 174 9027 Northeast New England ## 30 7333 5237 1.1 70.93 5.2 52.5 115 7521 Northeast Middle Atlantic ## 31 1144 3601 2.2 70.32 9.7 55.2 120 121412 West Mountain ## 32 18076 4903 1.4 70.55 10.9 52.7 82 47831 Northeast Middle Atlantic ## 33 5441 3875 1.8 69.21 11.1 38.5 80 48798 South South Atlantic ## 34 637 5087 0.8 72.78 1.4 50.3 186 69273 North Central West North Central ## 35 10735 4561 0.8 70.82 7.4 53.2 124 40975 North Central East North Central ## 36 2715 3983 1.1 71.42 6.4 51.6 82 68782 South West South Central 5

## 37 2284 4660 0.6 72.13 4.2 60.0 44 96184 West Pacific ## 38 11860 4449 1.0 70.43 6.1 50.2 126 44966 Northeast Middle Atlantic ## 39 931 4558 1.3 71.90 2.4 46.4 127 1049 Northeast New England ## 40 2816 3635 2.3 67.96 11.6 37.8 65 30225 South South Atlantic ## 41 681 4167 0.5 72.08 1.7 53.3 172 75955 North Central West North Central ## 42 4173 3821 1.7 70.11 11.0 41.8 70 41328 South East South Central ## 43 12237 4188 2.2 70.90 12.2 47.4 35 262134 South West South Central ## 44 1203 4022 0.6 72.90 4.5 67.3 137 82096 West Mountain ## 45 472 3907 0.6 71.64 5.5 57.1 168 9267 Northeast New England ## 46 4981 4701 1.4 70.08 9.5 47.8 85 39780 South South Atlantic ## 47 3559 4864 0.6 71.72 4.3 63.5 32 66570 West Pacific ## 48 1799 3617 1.4 69.48 6.7 41.6 100 24070 South South Atlantic ## 49 4589 4468 0.7 72.48 3.0 54.5 149 54464 North Central East North Central ## 50 376 4566 0.6 70.29 6.9 62.9 173 97203 West Mountain data.frame() is combining a pre-existing matrix ( state.x77 ) and two vectors of qualitative categorical variables (called factors ; state.region , state.division ) Column names are preserved or guessed if not explicitly set colnames(state.df) ## [1] "Population" "Income" "Illiteracy" "Life.Exp" "Murder" ## [6] "HS.Grad" "Frost" "Area" "Region" "Division" state.df[ 1 ,] ## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area Region ## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708 South ## Division ## Alabama East South Central Data frame access By row and column index state.df[ 49 , 3 ] ## [1] 0.7 By row and column names state.df[ "Wisconsin" , "Illiteracy" ] ## [1] 0.7 6

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

rownames(state.df) ## [1] "Alabama" "Alaska" "Arizona" "Arkansas" ## [5] "California" "Colorado" "Connecticut" "Delaware" ## [9] "Florida" "Georgia" "Hawaii" "Idaho" ## [13] "Illinois" "Indiana" "Iowa" "Kansas" ## [17] "Kentucky" "Louisiana" "Maine" "Maryland" ## [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi" ## [25] "Missouri" "Montana" "Nebraska" "Nevada" ## [29] "New Hampshire" "New Jersey" "New Mexico" "New York" ## [33] "North Carolina" "North Dakota" "Ohio" "Oklahoma" ## [37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina" ## [41] "South Dakota" "Tennessee" "Texas" "Utah" ## [45] "Vermont" "Virginia" "Washington" "West Virginia" ## [49] "Wisconsin" "Wyoming" state.df[ "Wisconsin" , 3 ] ## [1] 0.7 class(state.df[ "Wisconsin" , 3 ]) ## [1] "numeric" Data frame access (cont’d) All of a row: state.df[ "Wisconsin" ,] ## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area ## Wisconsin 4589 4468 0.7 72.48 3 54.5 149 54464 ## Region Division ## Wisconsin North Central East North Central class(state.df[ "Wisconsin" ,]) ## [1] "data.frame" Exercise: what class is state.df["Wisconsin",] ? Data frame Data frame access (cont’d.) All of a column: 7

head(state.df[, 3 ]) ## [1] 2.1 1.5 1.8 1.9 1.1 0.7 head(state.df[, "Illiteracy" ]) ## [1] 2.1 1.5 1.8 1.9 1.1 0.7 head(state.df$Illiteracy) ## [1] 2.1 1.5 1.8 1.9 1.1 0.7 Data frame access (cont’d.) Rows matching a condition: state.df[state.df$Division== "New England" , "Illiteracy" ] ## [1] 1.1 0.7 1.1 0.7 1.3 0.6 state.df[state.df$Region== "South" , "Illiteracy" ] ## [1] 2.1 1.9 0.9 1.3 2.0 1.6 2.8 0.9 2.4 1.8 1.1 2.3 1.7 2.2 1.4 1.4 Adding columns to a data frame To add columns: we can either use data.frame() , or directly define a new named column # First way: use data.frame() to concatenate on a new column state.df = data.frame(state.df, Cool= sample(c(T,F), nrow(state.df), rep= TRUE)) head(state.df, 4 ) ## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area ## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708 ## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432 ## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417 ## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945 ## Region Division Cool ## Alabama South East South Central FALSE ## Alaska West Pacific TRUE ## Arizona West Mountain FALSE ## Arkansas South West South Central FALSE 8

# Second way: just directly define a new named column state.df$Score = sample( 1 : 100 , nrow(state.df), replace= TRUE) head(state.df, 4 ) ## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area ## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708 ## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432 ## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417 ## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945 ## Region Division Cool Score ## Alabama South East South Central FALSE 42 ## Alaska West Pacific TRUE 49 ## Arizona West Mountain FALSE 84 ## Arkansas South West South Central FALSE 75 ncol(state.df) ## [1] 12 state.df[, 13 ] <- NA head(state.df) ## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area ## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708 ## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432 ## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417 ## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945 ## California 21198 5114 1.1 71.71 10.3 62.6 20 156361 ## Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766 ## Region Division Cool Score V13 ## Alabama South East South Central FALSE 42 NA ## Alaska West Pacific TRUE 49 NA ## Arizona West Mountain FALSE 84 NA ## Arkansas South West South Central FALSE 75 NA ## California West Pacific FALSE 89 NA ## Colorado West Mountain TRUE 4 NA state.df <- state.df[,- 13 ] Deleting columns from a data frame To delete columns: we can either use negative integer indexing, or set a column to NULL # First way: use negative integer indexing state.df = state.df[,-ncol(state.df)] head(state.df, 4 ) 9

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area ## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708 ## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432 ## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417 ## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945 ## Region Division Cool ## Alabama South East South Central FALSE ## Alaska West Pacific TRUE ## Arizona West Mountain FALSE ## Arkansas South West South Central FALSE # Second way: just directly set a column to NULL state.df$Cool = NULL head(state.df, 4 ) ## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area ## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708 ## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432 ## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417 ## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945 ## Region Division ## Alabama South East South Central ## Alaska West Pacific ## Arizona West Mountain ## Arkansas South West South Central Reminder: Boolean indexing With matrices or data frames, we’ll often want to access a subset of the rows corresponding to some condition. You already know how to do this, with Boolean indexing # Compare the averages of the Frost column between states in New England and # Pacific divisions mean(state.df[(state.df$Division == "New England" ), "Frost" ]) ## [1] 145.3333 mean(state.df[(state.df$Division == "Pacific" ), "Frost" ]) ## [1] 49.6 What is the average of Frost for the division that contains Texas? # Which division contains Texas? tex_div <- state.df[ "Texas" , "Division" ] state.df[ "Texas" ,] 10

## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area Region ## Texas 12237 4188 2.2 70.9 12.2 47.4 35 262134 South ## Division ## Texas West South Central mean(state.df[(state.df$Division == "West South Central" ), "Frost" ]) ## [1] 48.5 mean(state.df[(state.df$Division == tex_div), "Frost" ]) ## [1] 48.5 subset() The subset() function provides a convenient alternative way of accessing rows for data frames # Using subset(), we can just use the column names directly (i.e., no need for # using $) state.df.ne .1 = subset(state.df, Division == "New England" ) # Get same thing by extracting the appropriate rows manually state.df.ne .2 = state.df[state.df$Division == "New England" , ] all(state.df.ne .1 == state.df.ne .2 ) ## [1] TRUE # Same calculation as in the last slide, using subset() mean(subset(state.df, Division == "New England" )$Frost) ## [1] 145.3333 mean(subset(state.df, Division == "Pacific" )$Frost) # Wimps ## [1] 49.6 mean(subset(state.df, Division == "New England" ,Frost, drop = TRUE)) ## [1] 145.3333 Replacing values Parts or all of the data frame can be assigned to: 11

summary(state.df$HS.Grad) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 37.80 48.05 53.25 53.11 59.15 67.30 state.df$HS.Grad <- state.df$HS.Grad/ 100 summary(state.df$HS.Grad) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.3780 0.4805 0.5325 0.5311 0.5915 0.6730 state.df$HS.Grad <- 100 *state.df$HS.Grad # state.df$HS.Grad[1:5] <- NA with() The with() function provides a way of expressing operations by column names only. What percentage of literate adults graduated high school? head( 100 *(state.df$HS.Grad/( 100 -state.df$Illiteracy))) ## [1] 42.18590 67.71574 59.16497 40.67278 63.29626 64.35045 with() takes a data frame and evaluates an expression “inside” it: with(state.df, head( 100 *(HS.Grad/( 100 -Illiteracy)))) ## [1] 42.18590 67.71574 59.16497 40.67278 63.29626 64.35045 (so you don’t have to type state.df$xyz ) Data arguments Lots of functions take data arguments, and look variables up in that data frame: plot(Illiteracy~Frost, data = state.df) 12

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

0 50 100 150 0.5 1.0 1.5 2.0 2.5 Frost Illiteracy Summary • Data frames are a representation of the “classic” data table in R: rows are observations/cases, columns are variables/features • Each column can be a different data type (but must be the same length) • subset() : function for extracting rows of a data frame meeting a condition • with() : function for operating on data frame columns without indexing 13

2023-01-30_dataframes

Related Documents