Merge and append data

rbind, cbind and append with vectors

Vectors can be added to matrices/arrays and dataframes using functions such as rbind() or cbind(), which add either extra rows or columns respectively. Let’s start by creating a matrix then we will a new row then a new column.

The c() function is a concatenation operator that combines single elements into a single dimension vector, it is used all of the time.


Input:
A<-matrix(c(1,2,3,4),nrow =2, ncol =2)
A
Output:
1	3
2	4

Input:
# Use rbind to add a row
B<-c(5,6)
C<-rbind(A,B)
C

Output:
	1	3
        2	4
B	5	6

Input:
# Use cbind to add a column instead
D<-c(5,6)
E<-cbind(A,D)
E

Output:
		D
1	3	5
2	4	6

To add observations to a single vector, we can use the append() function.


Input:
a<-c(1,2,3,4)
b<-append(a,c(5,6))
b

Output:
1 2 3 4 5 6

A feature of append() is that you can specify where the observations are to be added. For example:


Input:
a<-c(1,2,3,4)
b<-append(a,c(5,6), after =2)
b

Output:
1 2 5 6 3 4

merge, rbind and cbind with dataframes

When combining dataframes we can use the merge() function

First we create some data. Note the use of the signif() function in the generation of age. Round() or as.Integer() are other possibilities. signif rounds a number to the set number of significant figures, round rounds a number to a certain number of decimal places and as.integer rounds a number to the nearest whole number. The following code generates 100 random patients, with their ages, whether or not they received intervention or control treatment in a clinical trial, and which one of five centres they were allocated to.


Input:
clinical_trial1 <-
    data.frame(patient = 1:100,
               age = signif(rnorm(100, mean = 60, sd = 6),
               digits=2),treatment = gl(2, 50,
               labels = c("Treatment", "Control")),
               centre = sample(paste("Centre", LETTERS[1:5]),
                  100,replace = TRUE))

Now we create a second set of data to be merged with the first. This adds columns for Body Mass Index(BMI) and gender. For ease, this has the same number of rows, and we create a duplicate identifier column over which the values will be matched, in this case, the column 'patient'.

Note the use of the rep function to repeat the assignation of the gender factor labels over the second set of 50 observations.


Input:
clinical_trial2 <-
    data.frame(patient = 1:100,
               BMI = rpois(100, lambda=25),
               gender = rep(gl(2, 25,
               labels = c("Male", "Female"),ordered=FALSE),
                  times=2))

Now we merge the data, and we specify which column in each dataframe is the one that contains the row identifier by which entries should be matched.


Input:
clinical_trial3<-merge(clinical_trial1, clinical_trial2,
                       by.x="patient", by.y="patient",
                       all=TRUE) 
clinical_trial3[1:10,]

Output:
patient	age	treatment	centre	BMI	gender
1	60	Treatment	Centre C	21	Male
2	58	Treatment	Centre E	24	Male
3	62	Treatment	Centre E	34	Male
4	60	Treatment	Centre B	32	Male
5	55	Treatment	Centre E	25	Male
6	63	Treatment	Centre A	26	Male
7	60	Treatment	Centre C	29	Male
8	58	Treatment	Centre D	22	Male
9	74	Treatment	Centre B	34	Male
10	55	Treatment	Centre B	38	Male
....

We now want to add another column of data to say which GP each patient has.
Each GP is represented as a numeric code from one to five.


Input:
GP<-rep(1:5,20) #i.e 100 observations.

Here we use the cbind() function to add the vector of GP codes to the dataframe. The order of the arguments determine where the new column is added.


Input:
clinical_trial4<-cbind(clinical_trial3,GP)
clinical_trial4

Output:
patient	age	treatment	centre	        BMI     gender	GP
1	60	Treatment	Centre C	21	Male	1
2	58	Treatment	Centre E	24	Male	2
3	62	Treatment	Centre E	34	Male	3
4	60	Treatment	Centre B	32	Male	4
5	55	Treatment	Centre E	25	Male	5
6	63	Treatment	Centre A	26	Male	1
7	60	Treatment	Centre C	29	Male	2
8	58	Treatment	Centre D	22	Male	3
9	74	Treatment	Centre B	34	Male	4
10	55	Treatment	Centre B	38	Male	5

We can use the rbind() function in the same way. In both cases, the added data must have the correct number of entries for the data it is being joined to, i.e. the correct number of rows for cbind(), or the correct number of columns in the case of rbind().


Input:
row<-c(101,57,"Treatment","Centre D", 27,
      "Male", 4) #one row of 7 columns.
clinical_trial5<-rbind(clinical_trial4,row)
clinical_trial5[91:101,]

Output:
	patient	age   treatment	centre	       BMI	gender	GP
91	91	55	Control	Centre B	22	Female	1
92	92	70	Control	Centre B	27	Female	2
93	93	61	Control	Centre A	28	Female	3
94	94	68	Control	Centre B	30	Female	4
95	95	54	Control	Centre A	21	Female	5
96	96	60	Control	Centre B	25	Female	1
97	97	64	Control	Centre E	25	Female	2
98	98	69	Control	Centre D	27	Female	3
99	99	61	Control	Centre D	29	Female	4
100	100	62	Control	Centre E	31	Female	5
101	101	57    Treatment	Centre D	27	Male	4