I will take you through step by step how to use the data.table package, and compare it with base R operations, to see the performance gains you get when using this optimised package.
To load the package in you can follow the below instructions:
#install.packages(data.table)
library(data.table)
To read files in data.table you use the fread syntax to bring files in. I will load the NHSRDatasets package and export this out and then I will use the data.table functionality to read it back in:
library(NHSRdatasets)
ae <- NHSRdatasets::ae_attendances
write.csv(ae, "ae_nhsr.csv", row.names = FALSE) #Set row names to false
#Use data.table to read in the document
ae_dt <- fread("ae_nhsr.csv")
Here, we will create a synthetic frame, using a random number generator, to show how quick the data.table package is compared to base R:
#Create a random uniform distribution
big_data <- data.frame(BigNumbers=runif(matrix(10000000, 10000000)))
write.csv(big_data, "bigdata.csv")
Start the benchmarking:
# Start to benchmark using system.time
# Read CSV with base
base_metrics <- system.time(
read.csv("bigdata.csv")
)
dt_metrics <- system.time(
data.table::fread("bigdata.csv")
)
print(base_metrics)
## user system elapsed
## 56.69 1.33 58.95
print(dt_metrics)
## user system elapsed
## 1.11 0.21 0.38
# # user system elapsed
# 25.78 0.42 26.74
# user system elapsed
# 1.09 0.07 0.33
Compare on a ggplot side by side:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tibble)
library(ggplot2)
df <- data.frame(
base_run_time = base_metrics[1], #Grab the elapsed time for the user
data.table_run_time = dt_metrics[1] #Grab the elapsed time for the user
)
#Flip data.frame over to get the run times using transpose
df<- data.frame(t(df)) %>%
rownames_to_column() %>%
setNames(c("Type", "TimeRan"))
# Make the ggplot
library(ggplot2)
plot <- df %>%
ggplot(aes(x=Type,
y=TimeRan,
fill=as.factor(Type))) + geom_bar(stat="identity",
width = 0.6) +
scale_fill_manual(values= c("#26ACB5", "#FAE920")) + theme_minimal() +
theme(legend.position = "none") + coord_flip() +
labs(title="Run time comparison data.table and ggpplot",
y="Run Time (seconds)",
x="Data.Table vs Base",
caption="Produced by Gary Hutson")
print(plot)
As you can see - data.table is lightening fast compared to base R and it is great for working with large datasets.
We detract, this section is just to highlight how useful the data.table package is for dealing with larger datasets.
Time to time you may want to convert the data.table objects back to base R, to do this you can follow the below:
#Convert base data.frame to data.table
ae_dt <- as.data.table(ae)
class(ae_dt)
## [1] "data.table" "data.frame"
#Using the setDT command
ae_copy <- ae
data.table::setDT(ae_copy)
class(ae_copy)
## [1] "data.table" "data.frame"
# Converting this back to a data.frame
data.table::setDF(ae_copy)
class(ae_copy)
## [1] "data.frame"
# [1] "data.table" "data.frame"
# [1] "data.table" "data.frame"
# [1] "data.frame"
To expand on the above:
The general rule to filtering is to use the below visual:
We will used our accident and emergency dataset (ae_dt) to work with some of the fields and commence filtering:
# Filter out hospital types and high attendances
ae_reduced <- ae_dt[type == 1 & attendances > 30000 & org_code != "R1H", ] #The comma indicates rows and no columns
print(ae_reduced)
## period org_code type attendances breaches admissions
## 1: 2019-03-01 RRK 1 32017 10670 10850
## 2: 2019-01-01 RRK 1 31935 12502 11493
## 3: 2018-12-01 RRK 1 30979 8981 11401
## 4: 2018-11-01 RRK 1 30645 8120 11230
## 5: 2018-10-01 RRK 1 30570 7089 10770
## 6: 2018-07-01 RRK 1 32209 6499 11332
This selects my arrival type equal to 1, filters out accident and emergency attendances greater than (>) 30000 and organisation code is not equal (1=) to R1H, which relates to a specific NHS trust. The ampersand means filter on this field and(&) that field.
This will show you how to perform selections on columns:
The code snippet shows how to select only given columns, as an index:
#Select by index
ae_dt[,1] #Select the first column
## period
## 1: 2017-03-01
## 2: 2017-03-01
## 3: 2017-03-01
## 4: 2017-03-01
## 5: 2017-03-01
## ---
## 12761: 2018-04-01
## 12762: 2018-04-01
## 12763: 2018-04-01
## 12764: 2018-04-01
## 12765: 2018-04-01
ae_dt[,c(1,3)] #Select the first and third column
## period type
## 1: 2017-03-01 1
## 2: 2017-03-01 2
## 3: 2017-03-01 other
## 4: 2017-03-01 1
## 5: 2017-03-01 2
## ---
## 12761: 2018-04-01 other
## 12762: 2018-04-01 1
## 12763: 2018-04-01 other
## 12764: 2018-04-01 other
## 12765: 2018-04-01 1
ae_dt[, 1:4] #Select a range of the first 1 to 4 columns
## period org_code type attendances
## 1: 2017-03-01 RF4 1 21289
## 2: 2017-03-01 RF4 2 813
## 3: 2017-03-01 RF4 other 2850
## 4: 2017-03-01 R1H 1 30210
## 5: 2017-03-01 R1H 2 807
## ---
## 12761: 2018-04-01 Y02584 other 471
## 12762: 2018-04-01 RA3 1 3825
## 12763: 2018-04-01 AXG other 2980
## 12764: 2018-04-01 NLX24 other 1538
## 12765: 2018-04-01 RA4 1 4388
#Select by column name
head(ae_dt[, period]) #Select by name
## [1] "2017-03-01" "2017-03-01" "2017-03-01" "2017-03-01" "2017-03-01"
## [6] "2017-03-01"
#[1] "2017-03-01" "2017-03-01" "2017-03-01" "2017-03-01" "2017-03-01" "2017-03-01"
By character vector, I mean a list of columns to select from the data.frame:
# One column
my_col <- "period"
ae_dt[, my_col, with=FALSE]
## period
## 1: 2017-03-01
## 2: 2017-03-01
## 3: 2017-03-01
## 4: 2017-03-01
## 5: 2017-03-01
## ---
## 12761: 2018-04-01
## 12762: 2018-04-01
## 12763: 2018-04-01
## 12764: 2018-04-01
## 12765: 2018-04-01
#Multiple columns
#First way
ae_dt[, list(period, attendances)]
## period attendances
## 1: 2017-03-01 21289
## 2: 2017-03-01 813
## 3: 2017-03-01 2850
## 4: 2017-03-01 30210
## 5: 2017-03-01 807
## ---
## 12761: 2018-04-01 471
## 12762: 2018-04-01 3825
## 12763: 2018-04-01 2980
## 12764: 2018-04-01 1538
## 12765: 2018-04-01 4388
#My preferred way
ae_dt[, .(period, attendances)]
## period attendances
## 1: 2017-03-01 21289
## 2: 2017-03-01 813
## 3: 2017-03-01 2850
## 4: 2017-03-01 30210
## 5: 2017-03-01 807
## ---
## 12761: 2018-04-01 471
## 12762: 2018-04-01 3825
## 12763: 2018-04-01 2980
## 12764: 2018-04-01 1538
## 12765: 2018-04-01 4388
The first way means you have to specify a list wrapper, otherwise you can use the special period operator to do this.
To drop columns in data.table you can use the below commands:
ae_dt_copy <- ae_dt
drop_cols <- c("period", "breaches")
ae_drops <- ae_dt_copy[, !drop_cols, with=FALSE]
# This says keep the columns that are not equal to ! my list of drop cols
names(ae_drops)
## [1] "org_code" "type" "attendances" "admissions"
To rename columns, use this convention below:
# Rename a single column
setnames(ae_dt_copy, "period", "month", skip_absent = TRUE)
colnames(ae_dt_copy)
## [1] "month" "org_code" "type" "attendances" "breaches"
## [6] "admissions"
Viola, there you go, renamed from old to new.
To create a new column in base you use the dollar sign, to do it in dplyr you use mutate and in data.table you use the below conventions:
new_col <- ae_dt_copy[, ed_performance := 1-(breaches/attendances)]
glimpse(new_col)
## Rows: 12,765
## Columns: 7
## $ month <date> 2017-03-01, 2017-03-01, 2017-03-01, 2017-03-01, 2017-0~
## $ org_code <fct> RF4, RF4, RF4, R1H, R1H, R1H, AD913, RYX, RQM, RQM, RJ6~
## $ type <fct> 1, 2, other, 1, 2, other, other, other, 1, other, 1, ot~
## $ attendances <dbl> 21289, 813, 2850, 30210, 807, 11352, 4381, 19562, 17414~
## $ breaches <dbl> 2879, 22, 6, 5902, 11, 136, 2, 258, 2030, 86, 1322, 140~
## $ admissions <dbl> 5060, 0, 0, 6943, 0, 0, 0, 0, 3597, 0, 2202, 0, 0, 0, 3~
## $ ed_performance <dbl> 0.8647658, 0.9729397, 0.9978947, 0.8046342, 0.9863693, ~
To create multiple columns, we use the below syntax - the special column command needs to be wrapped in backticks:
mult_cols <- ae_dt_copy[, `:=` (ed_performance_inv = breaches/attendances,
admit_to_attend_ratio = admissions/attendances)]
glimpse(mult_cols)
## Rows: 12,765
## Columns: 9
## $ month <date> 2017-03-01, 2017-03-01, 2017-03-01, 2017-03-01,~
## $ org_code <fct> RF4, RF4, RF4, R1H, R1H, R1H, AD913, RYX, RQM, R~
## $ type <fct> 1, 2, other, 1, 2, other, other, other, 1, other~
## $ attendances <dbl> 21289, 813, 2850, 30210, 807, 11352, 4381, 19562~
## $ breaches <dbl> 2879, 22, 6, 5902, 11, 136, 2, 258, 2030, 86, 13~
## $ admissions <dbl> 5060, 0, 0, 6943, 0, 0, 0, 0, 3597, 0, 2202, 0, ~
## $ ed_performance <dbl> 0.8647658, 0.9729397, 0.9978947, 0.8046342, 0.98~
## $ ed_performance_inv <dbl> 0.1352341585, 0.0270602706, 0.0021052632, 0.1953~
## $ admit_to_attend_ratio <dbl> 0.237681432, 0.000000000, 0.000000000, 0.2298245~
Suppose I want to create a summary frame, similar to group_by() and summarise() in dplyr. This can be achieved in data.table, like below:
summary_frame <- ae_dt_copy[, .(mean_attendance=mean(attendances),
mean_breaches=mean(breaches),
sum_attendances=sum(attendances)),
by=.(org_code)]
glimpse(summary_frame)
## Rows: 274
## Columns: 4
## $ org_code <fct> RF4, R1H, AD913, RYX, RQM, RJ6, Y02696, NX122, RVR, RJ~
## $ mean_attendance <dbl> 8091.7500, 13625.7500, 4437.1944, 18374.1111, 12686.66~
## $ mean_breaches <dbl> 1403.6666667, 1897.4444444, 1.5555556, 164.7222222, 76~
## $ sum_attendances <dbl> 873909, 1471581, 159739, 661468, 913440, 522130, 39777~
Chaining in data.table can be achieved by the following:
chained <- ae_dt_copy[, .(mean_attendance=mean(attendances),
mean_breaches=mean(breaches),
sum_attendances=sum(attendances)),
by=.(org_code)][order(org_code)]
# Adding square brackets, instead of %>%, chains the ordering
# Here we create a group by and summarise function and at the end we add another
# Command sequence i.e. group by org code, summarise the mean and then order by ord code
glimpse(chained)
## Rows: 274
## Columns: 4
## $ org_code <fct> 8J094, AAH, AC008, AD913, AF002, AF003, AJN, ATQ02, AX~
## $ mean_attendance <dbl> 2857.8077, 189.1111, 1032.3200, 4437.1944, 628.5000, 5~
## $ mean_breaches <dbl> 0.00000000, 0.05555556, 0.76000000, 1.55555556, 0.0000~
## $ sum_attendances <dbl> 74303, 6808, 25808, 159739, 7542, 6046, 27425, 127445,~
#Show ordering by org_code:
That is how simple it is - just add another set of square brackets after your summary function.
Let’s suppose, you want to compute the mean of all the variables, grouped by ‘org_code’. How to do that?
You can create the columns one by one by writing by hand. Or, you can use the lapply() function to do it all in one go. But lapply()
takes the data.frame as the first argument. Then, how to use `lapply() inside a data.table?
ae_summarised <- ae_dt_copy[, lapply(.SD[, 4:6, with=F], mean), by=org_code]
# .SD allows for it to be used in an lapply statement to create the column mean group by org_code
# of multiple columns
glimpse(ae_summarised)
## Rows: 274
## Columns: 4
## $ org_code <fct> RF4, R1H, AD913, RYX, RQM, RJ6, Y02696, NX122, RVR, RJ1~
## $ breaches <dbl> 1403.6666667, 1897.4444444, 1.5555556, 164.7222222, 767~
## $ admissions <dbl> 1556.87037, 2404.24074, 0.00000, 0.00000, 1868.90278, 1~
## $ ed_performance <dbl> 0.9217378, 0.9263262, 0.9996570, 0.9910515, 0.9529410, ~
Instead of me slicing the breaches to ed_performance column - I could add .SDcols to specify the exact columns to use in the function:
sd_cols_agg <- ae_dt_copy[, lapply(.SD, mean), by=org_code,
.SDcols=c("breaches", "admissions")]
# Take the mean, group by organisation code and use SDCols breaches and admissions to performs aggregations on
glimpse(sd_cols_agg)
## Rows: 274
## Columns: 3
## $ org_code <fct> RF4, R1H, AD913, RYX, RQM, RJ6, Y02696, NX122, RVR, RJ1, Y0~
## $ breaches <dbl> 1403.6666667, 1897.4444444, 1.5555556, 164.7222222, 767.805~
## $ admissions <dbl> 1556.87037, 2404.24074, 0.00000, 0.00000, 1868.90278, 1018.~
It takes some getting used to for tidyverse users, but the performance benefits are massive.
Setting one or more keys on a data.table enables it to perform binary search, which is many order of magnitudes faster than linear search, especially for large data. To set keys, follow the routine below:
setkey(ae_dt_copy, org_code)
# Check the key has been assigned
key(ae_dt_copy) #Prints out org_code as the key
## [1] "org_code"
Merging data.tables together now becomes a breeze, and will speed up and even quicker process:
dt1 <- ae_dt_copy[, .(org_code, breaches, admissions)]
dt2 <- ae_dt_copy[1:10, .(org_code, type)]
# Join the tables
merged <- dt1[dt2]
glimpse(merged)
## Rows: 260
## Columns: 4
## $ org_code <fct> 8J094, 8J094, 8J094, 8J094, 8J094, 8J094, 8J094, 8J094, 8J0~
## $ breaches <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ admissions <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ type <fct> other, other, other, other, other, other, other, other, oth~
To remove the key on the data.table we use this logic:
setkey(ae_dt_copy, NULL)
key(ae_dt_copy)
## NULL
The key has now been removed.
Next, I show how data.table handles joining tables. First, I will set up the data.table:
dt1 <- ae_summarised
dt2 <- ae_summarised[1:10, .(org_code, new_admits=admissions)]
To join the matching rows, we will use an inner join:
inner_joined_df <- merge(dt1, dt2, by="org_code")
glimpse(inner_joined_df)
## Rows: 10
## Columns: 5
## $ org_code <fct> AD913, NX122, R1H, RF4, RJ1, RJ6, RQM, RVR, RYX, Y02696
## $ breaches <dbl> 1.555556, 0.000000, 1897.444444, 1403.666667, 667.20370~
## $ admissions <dbl> 0.000, 0.000, 2404.241, 1556.870, 1093.269, 1018.139, 1~
## $ ed_performance <dbl> 0.9996570, 1.0000000, 0.9263262, 0.9217378, 0.9472967, ~
## $ new_admits <dbl> 0.000, 0.000, 2404.241, 1556.870, 1093.269, 1018.139, 1~
Using the left join:
left_joined <- merge(dt1, dt2, by="org_code", all.x = TRUE, allow.cartesian = FALSE)
glimpse(left_joined)
## Rows: 274
## Columns: 5
## $ org_code <fct> 8J094, AAH, AC008, AD913, AF002, AF003, AJN, ATQ02, AXG~
## $ breaches <dbl> 0.00000000, 0.05555556, 0.76000000, 1.55555556, 0.00000~
## $ admissions <dbl> 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000~
## $ ed_performance <dbl> 1.0000000, 0.9996942, 0.9992052, 0.9996570, 1.0000000, ~
## $ new_admits <dbl> NA, NA, NA, 0, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
The outer join can be performed like so:
outer_join <- merge(dt1, dt2, by="org_code", all=TRUE)
glimpse(outer_join)
## Rows: 274
## Columns: 5
## $ org_code <fct> 8J094, AAH, AC008, AD913, AF002, AF003, AJN, ATQ02, AXG~
## $ breaches <dbl> 0.00000000, 0.05555556, 0.76000000, 1.55555556, 0.00000~
## $ admissions <dbl> 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000~
## $ ed_performance <dbl> 1.0000000, 0.9996942, 0.9992052, 0.9996570, 1.0000000, ~
## $ new_admits <dbl> NA, NA, NA, 0, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
That concludes the section on joins. The next section looks at the Pivot Table functionality in data.table.
This example uses the copy data frame we made and uses the organisation code by the type of attendances. I want to then summarise the mean admissions by type and organisation code. To achieve this:
dcast.data.table(ae_dt_copy, org_code ~ type, fun.aggregate = mean, value.var = 'admissions')
## org_code 1 2 other
## 1: 8J094 NaN NaN 0
## 2: AAH NaN NaN 0
## 3: AC008 NaN NaN 0
## 4: AD913 NaN NaN 0
## 5: AF002 NaN NaN 0
## ---
## 270: Y03201 NaN NaN 0
## 271: Y03218 NaN NaN 0
## 272: Y03571 NaN NaN 0
## 273: Y03645 NaN NaN 0
## 274: Y04538 NaN NaN 0
That concludes the tour into data.table. This is a really flexible package and can be used for a variety of purposes. For Base R users the transition will be easier, as some of the subsetting concepts are not taught if you approach it from a tidy perspective, but this can lead to massive performance gains in your code. Perhaps to even match Python?