Dates and times

Dates and times in data analysis can be awkward to manage owing to the large number of different formats encountered, and missing values. Other things to look out for are dates written in US format, and whether or not data has been corrected for seasonal time differences. Let’s start with some simple date data.


Input:
df_birth<-data.frame(ID=1:10,dob=c("12/04/1976","16/06/1965","23/11/1985",
                      "24/02/1973","01/04/1946","27/05/1983","09/08/2001",
                      "30/03/1957","14/07/2007","19/12/1994"))
df_birth

Output:
   ID        dob
1   1 12/04/1976
2   2 16/06/1965
3   3 23/11/1985
4   4 24/02/1973
5   5 01/04/1946
6   6 27/05/1983
7   7 09/08/2001
8   8 30/03/1957
9   9 14/07/2007
10 10 19/12/1994

The following code, designed to generate the difference in days between two dates, throws an error.


attach(df_birth)
dob[1] -"23/3/45"

Output:
[1] NA
Warning message:
In Ops.factor(dob[1], "23/3/45") : ‘-’ not meaningful for factors

We need to change the formats! The as.Date() function changes the date format to an integer. To do this, it needs to know what format the original dates are in. The notation used to describe the date format is a combination of the following:

  • %d day of the month
  • %m month as an integer (1, 2, etc.)
  • %b month as a word jan feb etc.
  • %y year without a century (defaults to 1900s)
  • %Y year with a century, 2001, 1997 etc.

The common European date format is written as "%d/%m/%y". Notice that it is in quotation marks meaning it is entered as a string object. The separator or "/" in this instance is important as "-" might be used or another character separator although the forward slash and hyphen are the most common.


dob_int<-as.Date(dob, "%d/%m/%Y") #4 digit year
df_birth1<-data.frame(dob_int,df_birth)
df_birth1

Output:
   ID        dob    dob_int
1   1 12/04/1976 1976-04-12
2   2 16/06/1965 1965-06-16
3   3 23/11/1985 1985-11-23
4   4 24/02/1973 1973-02-24
5   5 01/04/1946 1946-04-01
6   6 27/05/1983 1983-05-27
7   7 09/08/2001 2001-08-09
8   8 30/03/1957 1957-03-30
9   9 14/07/2007 2007-07-14
10 10 19/12/1994 1994-12-19

We can see the data types in the dataframe using the str() function


Input:
str(df_birth1)

Output:
'data.frame':	10 obs. of  3 variables:
 $ ID     : int  1 2 3 4 5 6 7 8 9 10
 $ dob    : Factor w/ 10 levels "01/04/1946","09/08/2001",..: 3 5 7 8 1 9 2 10 4 6
 $ dob_int: Date, format: "1976-04-12" "1965-06-16" "1985-11-23" "1973-02-24" ...

For more complex data, useful extra arguments include tryformat if the data is in a variety of formats, and optional, if there are missing values suspected and we don't want R to throw an error.

Extracting date and time features

Sometimes we want to know which day of the week a date refers to, or we might want to work simply with a single component of a date (or time).


Input:
day<-weekdays(df_birth1$dob_int)
num_days<-julian(df_birth1$dob_int)
day
num_days

Output:
[1] "Monday"    "Wednesday" "Saturday"  "Saturday"  "Monday"    "Friday"    "Thursday"  "Saturday"  "Saturday"  "Monday"

[1]  2293 -1660  5805  1150 -8676  4894 11543 -4660 13708  9118
attr(,"origin")
[1] "1970-01-01"

Other possibilities include month() and quarter().

Including time

Times can be handled in a similar way to dates. Combined date and time data can be handled in a format called POSIXct where the date/time variable is stored as the number of seconds since 01/01/1970


Input:
time<-as.POSIXct(c("12/09/1971 23:49:00","09/06/1970 04:07:57"),format="%d/%m/%Y %H:%M:%S")
time
time[2]-time[1]

Output:
[1] "1971-09-12 23:49:00 BST" "1970-06-09 04:07:57 BST"
Time difference of -460.8202 days

Date/time data can also be handled in a format called POSIXlt which stores the date/time variable as a list of attributes. In the example below we compare POSIXct to POSIXlt


Input:
x <- c("06-07-19, 5:12am", "06-07-20, 5:15am", "06-07-21, 5:18pm", "06-07-22, 5:22am",  "06-07-23, 5:25am")
dct <- as.POSIXct(x,format="%y-%m-%d, %I:%M%p")
dlt <- strptime(x, format="%y-%m-%d, %I:%M%p")#%I = 12 hour clock, %p = am/pm format

dct #this is now stored as the number of seconds since 1/1/1970.  To see this type:
unclass(dct)
dlt

Output:
[1] "2006-07-19 05:12:00 BST" "2006-07-20 05:15:00 BST"
[3] "2006-07-21 17:18:00 BST" "2006-07-22 05:22:00 BST"
[5] "2006-07-23 05:25:00 BST"

1153282320 1153368900 1153498680 1153542120 1153628700

[1] "2006-07-19 05:12:00 BST" "2006-07-20 05:15:00 BST"
[3] "2006-07-21 17:18:00 BST" "2006-07-22 05:22:00 BST"
[5] "2006-07-23 05:25:00 BST"
$sec
0 0 0 0 0
$min
12 15 18 22 25
$hour
5 5 17 5 5
$mday
19 20 21 22 23
$mon
6 6 6 6 6
$year
106 106 106 106 106
$wday
3 4 5 6 0
$yday
199 200 201 202 203
$isdst
1 1 1 1 1
$zone
'BST' 'BST' 'BST' 'BST' 'BST'
$gmtoff
    


Input:
dlt[4]-dlt[1]#  differences
dlt[4]$min-dlt[1]$min# DIFFERENCE BETWEEN THE MINUTE COMPONENTS
dct[4]-dct[1]

Output:
Time difference of 3.006944 days

10

Time difference of 3.006944 days

More advanced options can include time zones etc. It appears that R generally ignores leap seconds, i.e. seconds added to compensate for the effect of earth's rotation etc. This is not likely to be a problem for most of us!