A common task in R is assessing correlations and performing linear regressions on data. In the example below we create a dataframe of distance and time data with a forced high level of association between the variables. To assess the correlation between the two variables we use the
cor() function. This takes a matrix or dataframe as its first input, this is the data that you want to know the correlation between. There is then an argument for the treatment of missing data, in this instance we want to use all of the data so we enter
use="all.obs". We can then select the correlation method in this case Pearson correlation but we could use Spearmans rank or Kendell rank correlations.
We can see from the correlation results that the data is highly correlated as we would expect. To then perform a linear regression (generalised linear model) we use the function
lm(). Our first argument is the relation between the variables e.g. time as related to distance, and we specify the dataframe where these variables exist.
lm() function produces various information related to the regression model that it generates and stores this data as a named list. The most important things normally are the coefficient values. We can access these using
a$coefficients and we can see that the intercept value is 6.95… and the gradient(or time) value is 0.52…. This means our linear equation is y = 0.52*x + 6.95.
In the example we plot the time and distance data as a scatter plot and use the
abline() function to add our regression line using our model using the model variable a as the input.
Input: distance <- sort(runif(100,1,70)) time <- sort(runif(100,5,120)) data <- cbind(distance,time) data <- as.data.frame(data) cor(data,use="all.obs",method="pearson") Output: distance time distance 1.0000000 0.9743785 time 0.9743785 1.0000000 Input: a <- lm(distance ~ time, data=data) summary(a) (Intercept) -10.85511 1.12056 -9.687 5.78e-16 *** time 0.60352 0.01407 42.887 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.453 on 98 degrees of freedom Multiple R-squared: 0.9494, Adjusted R-squared: 0.9489 F-statistic: 1839 on 1 and 98 DF, p-value: < 2.2e-16 Input: b <- a$coefficients b Output: (Intercept) -10.8551065123979 time 0.603524548466263 Input: plot(data$time,data$distance) abline(a) Output:
We can also plot the 95% confidence intervals for the regression line. To start with we need the minimum and maximum values from the x axis variable which in this instance is time. We then create a vector of data using the
seq() function from the minimum to the maximum time values in steps of 0.05. Next we use the predict function to establish the confidence intervals at a 0.95 level. The final step is to create the scatter plot, add the regression line and then use the
lines() function to add the upper
[,3] and lower
[,2] confidence intervals.
Input: c <- summary(data$time) cMin <- min(data$time) cMax <- max(data$time) d <- seq(cMin,cMax,by=0.05) confInt <- predict(a, newdata=data.frame(time=d), interval="confidence", level=0.95) plot(data$time,data$distance, main="Distance over time", xlab="Time", ylab="Distance", pch=20) abline(a) lines(d,confInt[,2], col="blue", lty=2) lines(d,confInt[,3], col="blue", lty=2) Output: