Data.Table - everything you need to know to get you started in R

I will take you through step by step how to use the data.table package, and compare it with base R operations, to see the performance gains you get when using this optimised package.

Load in data.table

To load the package in you can follow the below instructions:

#install.packages(data.table)
library(data.table)

Reading in a data.table csv

To read files in data.table you use the fread syntax to bring files in. I will load the NHSRDatasets package and export this out and then I will use the data.table functionality to read it back in:

library(NHSRdatasets)
ae <- NHSRdatasets::ae_attendances
write.csv(ae, "ae_nhsr.csv", row.names = FALSE) #Set row names to false
#Use data.table to read in the document
ae_dt <- fread("ae_nhsr.csv")

Benchmarking the speed of data.table vs base R

Here, we will create a synthetic frame, using a random number generator, to show how quick the data.table package is compared to base R:

#Create a random uniform distribution
big_data <- data.frame(BigNumbers=runif(matrix(10000000, 10000000)))
write.csv(big_data, "bigdata.csv")

Start the benchmarking:

# Start to benchmark using system.time
# Read CSV with base
base_metrics <- system.time(
  read.csv("bigdata.csv")
)

dt_metrics <- system.time(
  data.table::fread("bigdata.csv")
)
print(base_metrics)
##    user  system elapsed 
##   56.69    1.33   58.95
print(dt_metrics)
##    user  system elapsed 
##    1.11    0.21    0.38
# # user  system elapsed 
#   25.78    0.42   26.74 
#    user  system elapsed 
#    1.09    0.07    0.33 

Compare on a ggplot side by side:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tibble)
library(ggplot2)


df <- data.frame(
  base_run_time = base_metrics[1], #Grab the elapsed time for the user
  data.table_run_time = dt_metrics[1] #Grab the elapsed time for the user
)
#Flip data.frame over to get the run times using transpose
df<- data.frame(t(df)) %>% 
  rownames_to_column() %>% 
  setNames(c("Type", "TimeRan"))

# Make the ggplot
library(ggplot2)

plot <- df %>% 
  ggplot(aes(x=Type, 
             y=TimeRan, 
             fill=as.factor(Type))) + geom_bar(stat="identity", 
                                               width = 0.6) +
         scale_fill_manual(values= c("#26ACB5", "#FAE920")) + theme_minimal() +
         theme(legend.position = "none") + coord_flip() +
         labs(title="Run time comparison data.table and ggpplot", 
              y="Run Time (seconds)", 
              x="Data.Table vs Base",
              caption="Produced by Gary Hutson") 

print(plot)

As you can see - data.table is lightening fast compared to base R and it is great for working with large datasets.

We detract, this section is just to highlight how useful the data.table package is for dealing with larger datasets.

Conversion between data.table and data.frame (base) objects

Time to time you may want to convert the data.table objects back to base R, to do this you can follow the below:

#Convert base data.frame to data.table
ae_dt <- as.data.table(ae)
class(ae_dt)
## [1] "data.table" "data.frame"
#Using the setDT command
ae_copy <- ae
data.table::setDT(ae_copy)
class(ae_copy)
## [1] "data.table" "data.frame"
# Converting this back to a data.frame
data.table::setDF(ae_copy)
class(ae_copy)
## [1] "data.frame"
# [1] "data.table" "data.frame"
# [1] "data.table" "data.frame"
# [1] "data.frame"

To expand on the above:

Filtering on a data.table object

The general rule to filtering is to use the below visual:

Markdown Monster icon

We will used our accident and emergency dataset (ae_dt) to work with some of the fields and commence filtering:

# Filter out hospital types and high attendances
ae_reduced <- ae_dt[type == 1 & attendances > 30000 & org_code != "R1H", ] #The comma indicates rows and no columns
print(ae_reduced)
##        period org_code type attendances breaches admissions
## 1: 2019-03-01      RRK    1       32017    10670      10850
## 2: 2019-01-01      RRK    1       31935    12502      11493
## 3: 2018-12-01      RRK    1       30979     8981      11401
## 4: 2018-11-01      RRK    1       30645     8120      11230
## 5: 2018-10-01      RRK    1       30570     7089      10770
## 6: 2018-07-01      RRK    1       32209     6499      11332

This selects my arrival type equal to 1, filters out accident and emergency attendances greater than (>) 30000 and organisation code is not equal (1=) to R1H, which relates to a specific NHS trust. The ampersand means filter on this field and(&) that field.

Indexing and selecting data.table objects

This will show you how to perform selections on columns:

Selecting given columns

The code snippet shows how to select only given columns, as an index:

#Select by index
ae_dt[,1] #Select the first column
##            period
##     1: 2017-03-01
##     2: 2017-03-01
##     3: 2017-03-01
##     4: 2017-03-01
##     5: 2017-03-01
##    ---           
## 12761: 2018-04-01
## 12762: 2018-04-01
## 12763: 2018-04-01
## 12764: 2018-04-01
## 12765: 2018-04-01
ae_dt[,c(1,3)] #Select the first and third column
##            period  type
##     1: 2017-03-01     1
##     2: 2017-03-01     2
##     3: 2017-03-01 other
##     4: 2017-03-01     1
##     5: 2017-03-01     2
##    ---                 
## 12761: 2018-04-01 other
## 12762: 2018-04-01     1
## 12763: 2018-04-01 other
## 12764: 2018-04-01 other
## 12765: 2018-04-01     1
ae_dt[, 1:4] #Select a range of the first 1 to 4 columns
##            period org_code  type attendances
##     1: 2017-03-01      RF4     1       21289
##     2: 2017-03-01      RF4     2         813
##     3: 2017-03-01      RF4 other        2850
##     4: 2017-03-01      R1H     1       30210
##     5: 2017-03-01      R1H     2         807
##    ---                                      
## 12761: 2018-04-01   Y02584 other         471
## 12762: 2018-04-01      RA3     1        3825
## 12763: 2018-04-01      AXG other        2980
## 12764: 2018-04-01    NLX24 other        1538
## 12765: 2018-04-01      RA4     1        4388
#Select by column name
head(ae_dt[, period]) #Select by name
## [1] "2017-03-01" "2017-03-01" "2017-03-01" "2017-03-01" "2017-03-01"
## [6] "2017-03-01"
#[1] "2017-03-01" "2017-03-01" "2017-03-01" "2017-03-01" "2017-03-01" "2017-03-01"

Selecting multiple columns using a character vector

By character vector, I mean a list of columns to select from the data.frame:

# One column
my_col <- "period"
ae_dt[, my_col, with=FALSE]
##            period
##     1: 2017-03-01
##     2: 2017-03-01
##     3: 2017-03-01
##     4: 2017-03-01
##     5: 2017-03-01
##    ---           
## 12761: 2018-04-01
## 12762: 2018-04-01
## 12763: 2018-04-01
## 12764: 2018-04-01
## 12765: 2018-04-01
#Multiple columns
#First way
ae_dt[, list(period, attendances)]
##            period attendances
##     1: 2017-03-01       21289
##     2: 2017-03-01         813
##     3: 2017-03-01        2850
##     4: 2017-03-01       30210
##     5: 2017-03-01         807
##    ---                       
## 12761: 2018-04-01         471
## 12762: 2018-04-01        3825
## 12763: 2018-04-01        2980
## 12764: 2018-04-01        1538
## 12765: 2018-04-01        4388
#My preferred way
ae_dt[, .(period, attendances)]
##            period attendances
##     1: 2017-03-01       21289
##     2: 2017-03-01         813
##     3: 2017-03-01        2850
##     4: 2017-03-01       30210
##     5: 2017-03-01         807
##    ---                       
## 12761: 2018-04-01         471
## 12762: 2018-04-01        3825
## 12763: 2018-04-01        2980
## 12764: 2018-04-01        1538
## 12765: 2018-04-01        4388

The first way means you have to specify a list wrapper, otherwise you can use the special period operator to do this.

Dropping columns

To drop columns in data.table you can use the below commands:

ae_dt_copy <- ae_dt
drop_cols <- c("period", "breaches")
ae_drops <- ae_dt_copy[, !drop_cols, with=FALSE]
# This says keep the columns that are not equal to ! my list of drop cols
names(ae_drops)
## [1] "org_code"    "type"        "attendances" "admissions"

Renaming columns

To rename columns, use this convention below:

# Rename a single column
setnames(ae_dt_copy, "period", "month", skip_absent = TRUE)
colnames(ae_dt_copy)
## [1] "month"       "org_code"    "type"        "attendances" "breaches"   
## [6] "admissions"

Viola, there you go, renamed from old to new.

Column creation from existing columns like mutate

To create a new column in base you use the dollar sign, to do it in dplyr you use mutate and in data.table you use the below conventions:

new_col <- ae_dt_copy[, ed_performance := 1-(breaches/attendances)]
glimpse(new_col)
## Rows: 12,765
## Columns: 7
## $ month          <date> 2017-03-01, 2017-03-01, 2017-03-01, 2017-03-01, 2017-0~
## $ org_code       <fct> RF4, RF4, RF4, R1H, R1H, R1H, AD913, RYX, RQM, RQM, RJ6~
## $ type           <fct> 1, 2, other, 1, 2, other, other, other, 1, other, 1, ot~
## $ attendances    <dbl> 21289, 813, 2850, 30210, 807, 11352, 4381, 19562, 17414~
## $ breaches       <dbl> 2879, 22, 6, 5902, 11, 136, 2, 258, 2030, 86, 1322, 140~
## $ admissions     <dbl> 5060, 0, 0, 6943, 0, 0, 0, 0, 3597, 0, 2202, 0, 0, 0, 3~
## $ ed_performance <dbl> 0.8647658, 0.9729397, 0.9978947, 0.8046342, 0.9863693, ~

To create multiple columns, we use the below syntax - the special column command needs to be wrapped in backticks:

mult_cols <- ae_dt_copy[, `:=` (ed_performance_inv = breaches/attendances,
                   admit_to_attend_ratio = admissions/attendances)]

glimpse(mult_cols)
## Rows: 12,765
## Columns: 9
## $ month                 <date> 2017-03-01, 2017-03-01, 2017-03-01, 2017-03-01,~
## $ org_code              <fct> RF4, RF4, RF4, R1H, R1H, R1H, AD913, RYX, RQM, R~
## $ type                  <fct> 1, 2, other, 1, 2, other, other, other, 1, other~
## $ attendances           <dbl> 21289, 813, 2850, 30210, 807, 11352, 4381, 19562~
## $ breaches              <dbl> 2879, 22, 6, 5902, 11, 136, 2, 258, 2030, 86, 13~
## $ admissions            <dbl> 5060, 0, 0, 6943, 0, 0, 0, 0, 3597, 0, 2202, 0, ~
## $ ed_performance        <dbl> 0.8647658, 0.9729397, 0.9978947, 0.8046342, 0.98~
## $ ed_performance_inv    <dbl> 0.1352341585, 0.0270602706, 0.0021052632, 0.1953~
## $ admit_to_attend_ratio <dbl> 0.237681432, 0.000000000, 0.000000000, 0.2298245~

Grouping - GroupBy

Suppose I want to create a summary frame, similar to group_by() and summarise() in dplyr. This can be achieved in data.table, like below:

summary_frame <- ae_dt_copy[, .(mean_attendance=mean(attendances),
               mean_breaches=mean(breaches), 
               sum_attendances=sum(attendances)),
           by=.(org_code)]

glimpse(summary_frame)
## Rows: 274
## Columns: 4
## $ org_code        <fct> RF4, R1H, AD913, RYX, RQM, RJ6, Y02696, NX122, RVR, RJ~
## $ mean_attendance <dbl> 8091.7500, 13625.7500, 4437.1944, 18374.1111, 12686.66~
## $ mean_breaches   <dbl> 1403.6666667, 1897.4444444, 1.5555556, 164.7222222, 76~
## $ sum_attendances <dbl> 873909, 1471581, 159739, 661468, 913440, 522130, 39777~

Chaining in data.table - similar to magrittr’s piping

Chaining in data.table can be achieved by the following:

chained <- ae_dt_copy[, .(mean_attendance=mean(attendances),
               mean_breaches=mean(breaches), 
               sum_attendances=sum(attendances)),
           by=.(org_code)][order(org_code)]
# Adding square brackets, instead of %>%, chains the ordering
# Here we create a group by and summarise function and at the end we add another
# Command sequence i.e. group by org code, summarise the mean and then order by ord code

glimpse(chained)
## Rows: 274
## Columns: 4
## $ org_code        <fct> 8J094, AAH, AC008, AD913, AF002, AF003, AJN, ATQ02, AX~
## $ mean_attendance <dbl> 2857.8077, 189.1111, 1032.3200, 4437.1944, 628.5000, 5~
## $ mean_breaches   <dbl> 0.00000000, 0.05555556, 0.76000000, 1.55555556, 0.0000~
## $ sum_attendances <dbl> 74303, 6808, 25808, 159739, 7542, 6046, 27425, 127445,~
#Show ordering by org_code:

That is how simple it is - just add another set of square brackets after your summary function.

What is .SD and why should I care

Let’s suppose, you want to compute the mean of all the variables, grouped by ‘org_code’. How to do that?

You can create the columns one by one by writing by hand. Or, you can use the lapply() function to do it all in one go. But lapply() takes the data.frame as the first argument. Then, how to use `lapply() inside a data.table?

ae_summarised <- ae_dt_copy[, lapply(.SD[, 4:6, with=F], mean), by=org_code]
# .SD allows for it to be used in an lapply statement to create the column mean group by org_code 
# of multiple columns
glimpse(ae_summarised)
## Rows: 274
## Columns: 4
## $ org_code       <fct> RF4, R1H, AD913, RYX, RQM, RJ6, Y02696, NX122, RVR, RJ1~
## $ breaches       <dbl> 1403.6666667, 1897.4444444, 1.5555556, 164.7222222, 767~
## $ admissions     <dbl> 1556.87037, 2404.24074, 0.00000, 0.00000, 1868.90278, 1~
## $ ed_performance <dbl> 0.9217378, 0.9263262, 0.9996570, 0.9910515, 0.9529410, ~

Instead of me slicing the breaches to ed_performance column - I could add .SDcols to specify the exact columns to use in the function:

sd_cols_agg <- ae_dt_copy[, lapply(.SD, mean), by=org_code,
           .SDcols=c("breaches", "admissions")]

# Take the mean, group by organisation code and use SDCols breaches and admissions to performs aggregations on
glimpse(sd_cols_agg)
## Rows: 274
## Columns: 3
## $ org_code   <fct> RF4, R1H, AD913, RYX, RQM, RJ6, Y02696, NX122, RVR, RJ1, Y0~
## $ breaches   <dbl> 1403.6666667, 1897.4444444, 1.5555556, 164.7222222, 767.805~
## $ admissions <dbl> 1556.87037, 2404.24074, 0.00000, 0.00000, 1868.90278, 1018.~

It takes some getting used to for tidyverse users, but the performance benefits are massive.

Setting Keys - to speed up searches

Setting one or more keys on a data.table enables it to perform binary search, which is many order of magnitudes faster than linear search, especially for large data. To set keys, follow the routine below:

setkey(ae_dt_copy, org_code)
# Check the key has been assigned
key(ae_dt_copy) #Prints out org_code as the key
## [1] "org_code"

Merging data.tables together now becomes a breeze, and will speed up and even quicker process:

dt1 <- ae_dt_copy[, .(org_code, breaches, admissions)]
dt2 <- ae_dt_copy[1:10, .(org_code, type)]

# Join the tables
merged <- dt1[dt2]
glimpse(merged)
## Rows: 260
## Columns: 4
## $ org_code   <fct> 8J094, 8J094, 8J094, 8J094, 8J094, 8J094, 8J094, 8J094, 8J0~
## $ breaches   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ admissions <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ type       <fct> other, other, other, other, other, other, other, other, oth~

Removing keys

To remove the key on the data.table we use this logic:

setkey(ae_dt_copy, NULL)
key(ae_dt_copy)
## NULL

The key has now been removed.

Joining tables

Next, I show how data.table handles joining tables. First, I will set up the data.table:

dt1 <- ae_summarised
dt2 <- ae_summarised[1:10, .(org_code, new_admits=admissions)]

Inner Join

To join the matching rows, we will use an inner join:

inner_joined_df <- merge(dt1, dt2, by="org_code")
glimpse(inner_joined_df)
## Rows: 10
## Columns: 5
## $ org_code       <fct> AD913, NX122, R1H, RF4, RJ1, RJ6, RQM, RVR, RYX, Y02696
## $ breaches       <dbl> 1.555556, 0.000000, 1897.444444, 1403.666667, 667.20370~
## $ admissions     <dbl> 0.000, 0.000, 2404.241, 1556.870, 1093.269, 1018.139, 1~
## $ ed_performance <dbl> 0.9996570, 1.0000000, 0.9263262, 0.9217378, 0.9472967, ~
## $ new_admits     <dbl> 0.000, 0.000, 2404.241, 1556.870, 1093.269, 1018.139, 1~

Left Join

Using the left join:

left_joined <- merge(dt1, dt2, by="org_code", all.x = TRUE, allow.cartesian = FALSE)
glimpse(left_joined)
## Rows: 274
## Columns: 5
## $ org_code       <fct> 8J094, AAH, AC008, AD913, AF002, AF003, AJN, ATQ02, AXG~
## $ breaches       <dbl> 0.00000000, 0.05555556, 0.76000000, 1.55555556, 0.00000~
## $ admissions     <dbl> 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000~
## $ ed_performance <dbl> 1.0000000, 0.9996942, 0.9992052, 0.9996570, 1.0000000, ~
## $ new_admits     <dbl> NA, NA, NA, 0, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~

Outer Join

The outer join can be performed like so:

outer_join <- merge(dt1, dt2, by="org_code", all=TRUE)
glimpse(outer_join)
## Rows: 274
## Columns: 5
## $ org_code       <fct> 8J094, AAH, AC008, AD913, AF002, AF003, AJN, ATQ02, AXG~
## $ breaches       <dbl> 0.00000000, 0.05555556, 0.76000000, 1.55555556, 0.00000~
## $ admissions     <dbl> 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000~
## $ ed_performance <dbl> 1.0000000, 0.9996942, 0.9992052, 0.9996570, 1.0000000, ~
## $ new_admits     <dbl> NA, NA, NA, 0, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~

That concludes the section on joins. The next section looks at the Pivot Table functionality in data.table.

Pivoting using data.table

This example uses the copy data frame we made and uses the organisation code by the type of attendances. I want to then summarise the mean admissions by type and organisation code. To achieve this:

dcast.data.table(ae_dt_copy, org_code ~ type, fun.aggregate = mean, value.var = 'admissions')
##      org_code   1   2 other
##   1:    8J094 NaN NaN     0
##   2:      AAH NaN NaN     0
##   3:    AC008 NaN NaN     0
##   4:    AD913 NaN NaN     0
##   5:    AF002 NaN NaN     0
##  ---                       
## 270:   Y03201 NaN NaN     0
## 271:   Y03218 NaN NaN     0
## 272:   Y03571 NaN NaN     0
## 273:   Y03645 NaN NaN     0
## 274:   Y04538 NaN NaN     0

Conclusion

That concludes the tour into data.table. This is a really flexible package and can be used for a variety of purposes. For Base R users the transition will be easier, as some of the subsetting concepts are not taught if you approach it from a tidy perspective, but this can lead to massive performance gains in your code. Perhaps to even match Python?