Subscripting with vectors and matrices
When working with a vector of data (a one dimensional array) you select a single element of the vector using the index of the data in square brackets following the variable. Remember arrays start at ONE in R
Input: x <- seq(1, 10, by=0.5) x Output: 1.5
To select more than one element of data from the vector we place a concatenated vector of indices within the square brackets
Input: x[c(4,7)] Output: 2.5 4
We can also select data using boolean (true/false) indexing. In the example below, we create a vector of integers, create a boolean vector of true/false for the condition y is greater than three, then select all of the elements of y for which l is true.
Input: y <- c(3,7,5,2,8) l 3 l y[l] Output: FALSE TRUE TRUE FALSE TRUE 7 5 8
Negative indexing can be used to select all of the elements of y not included in the subscript. This is done by placing a minus sign in front of the index vector.
Input: y[-c(2,4)] Output: 3 5 8
Subscripting of matrices is similar to the subscripting of vectors except that you are now working in two dimensions instead of one. Below we create a simple matrix with three rows and four columns
Input: z <- matrix(c(1,2,5,3,2,4,6,4,5,3,6,1), nrow=3, ncol=4, byrow=TRUE) print(z) Output: [,1] [,2] [,3] [,4] [1,] 1 2 5 3 [2,] 2 4 6 4 [3,] 5 3 6 1
Subscripting a matrix is still done using square brackets except we use a comma to separate between the row value(s) on the left and column value(s) on the right
[rows,columns]. Not entering a row or column value will return all of rows or columns as appropriate. Conditional and negative subscripting works the same for matrices as for vectors.
Input: z[2,] z[,4] z[2,4] z[c(1,2),] z[c(T,F,T),] Output: 2 4 6 4 3 4 1 4 1 2 5 3 2 4 6 4 1 2 5 3 5 3 6 1
Subsetting with dataframes
Dataframes are the best way to handle n-dimensional arrays which contain more than one type of data (int, str, float etc.). Let's start by converting the matrix z into a data frame using the
Input: data <- as.data.frame(z) data Output: V1 V2 V3 V4 1 2 5 3 2 4 6 4 5 3 6 1
To select a single vector of data from a dataframe we can use the column name in the format
Input: dataOne <- data$V1 dataOne Output: 1 2 5
If you are going to be performing a lot of manipulations on a dataframe the
dataFrameName$columnName format can become cumbersome and really rather annoying. You can use the
attach() function to fix the name of the data frame in R's memory and can then refer to columns by their name only without the dataframe$ prefix. To remove the dataframe name from R's memory we use the
detach() function. R can only store the name of one dataframe at a time so remember to use the detach command when changing dataframes!
Input: attach(data) dataOne <- V1 dataOne detach(data) Output: 1 2 5
Multiple columns can be subset from a data frame by using the column names or indices in the same way as for vectors.
Input: dataOne <- data[c("V2","V3")] dataOne Output: V2 V3 2 5 4 6 3 6
Rows can also be subset in a data frame using the row indices
Input: dataTwo <- data[1:2,] dataTwo Output: V1 V2 V3 V4 1 2 5 3 2 4 6 4
When using a data frame, we can also conditionally subset the rows that we wish to extract using the
subset() function. The function requires at least two arguments; the name of the dataframe and the condition by which to select the rows to include. A third argument
select= can be added to only extract certain columns for the rows that meet the prescribed condition.
Input: dataTwo = 2) dataTwo dataThree = 2, select=c(V2,V3)) dataThree Output: V1 V2 V3 V4 2 2 4 6 4 3 5 3 6 1 V2 V3 2 4 6 3 3 6
The subset function is the easiest way to select observations in a data frame. Sometimes however R does have a funny turn and will throw an error when using the subset function. In these instances it is good to have an alternate way to subset your dataframe. The example below uses the
which() function to find the indices of the rows in the dataframe that meet the specified conditions. These indices are then used to subscript the rows in the same way as for matrices.
Input: dataFour <- data[ which(data$V1 == 2 | data$V3 == 6),] dataFour Output: V1 V2 V3 V4 2 2 4 6 4 3 5 3 6 1