Correlations and linear regression

A common task in R is assessing correlations and performing linear regressions on data. In the example below we create a dataframe of distance and time data with a forced high level of association between the variables. To assess the correlation between the two variables we use the cor() function. This takes a matrix or dataframe as its first input, this is the data that you want to know the correlation between. There is then an argument for the treatment of missing data, in this instance we want to use all of the data so we enter use="all.obs". We can then select the correlation method in this case Pearson correlation but we could use Spearmans rank or Kendell rank correlations.

We can see from the correlation results that the data is highly correlated as we would expect. To then perform a linear regression (generalised linear model) we use the function lm(). Our first argument is the relation between the variables e.g. time as related to distance, and we specify the dataframe where these variables exist.

The lm() function produces various information related to the regression model that it generates and stores this data as a named list. The most important things normally are the coefficient values. We can access these using a$coefficients and we can see that the intercept value is 6.95… and the gradient(or time) value is 0.52…. This means our linear equation is y = 0.52*x + 6.95.

In the example we plot the time and distance data as a scatter plot and use the abline() function to add our regression line using our model using the model variable a as the input.


Input:
distance <- sort(runif(100,1,70))
time <- sort(runif(100,5,120))
data <- cbind(distance,time)
data <- as.data.frame(data)
cor(data,use="all.obs",method="pearson")

Output:
	  distance	time
distance  1.0000000	0.9743785
time	  0.9743785	1.0000000

Input:
a <- lm(distance ~ time, data=data)
summary(a)
(Intercept) -10.85511    1.12056  -9.687 5.78e-16 ***
time          0.60352    0.01407  42.887  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.453 on 98 degrees of freedom
Multiple R-squared:  0.9494,	Adjusted R-squared:  0.9489 
F-statistic:  1839 on 1 and 98 DF,  p-value: < 2.2e-16

Input:
b <- a$coefficients
b

Output:
(Intercept)  -10.8551065123979
time         0.603524548466263

Input:
plot(data$time,data$distance)
abline(a)

Output:
linFig1

We can also plot the 95% confidence intervals for the regression line. To start with we need the minimum and maximum values from the x axis variable which in this instance is time. We then create a vector of data using the seq() function from the minimum to the maximum time values in steps of 0.05. Next we use the predict function to establish the confidence intervals at a 0.95 level. The final step is to create the scatter plot, add the regression line and then use the lines() function to add the upper[,3] and lower [,2] confidence intervals.


Input:
c <- summary(data$time)
cMin <- min(data$time)
cMax <- max(data$time)
d <- seq(cMin,cMax,by=0.05)
confInt <- predict(a, newdata=data.frame(time=d), interval="confidence", level=0.95)
plot(data$time,data$distance, main="Distance over time", xlab="Time", ylab="Distance", pch=20)
abline(a)
lines(d,confInt[,2], col="blue", lty=2)
lines(d,confInt[,3], col="blue", lty=2)

Output:
linFig2