data.table (by Matt Dowle, Arun
Srinivasan et. al.) is a package written in C to make high-performance data
wrangling tasks a breeze. Despite being incredibly powerful, it is dependency
free and has a rock-solid API. data.table code reliably works decades apart.
Installation
Before continuing, make sure that you have installed data.table. You only
have to do this once (or as often as you want to update the package).
Once data.table is installed, don’t forget to load it whenever you want to
use it. Unlike Stata, you have to re-load a package every time you start a new R
session.
All of the examples in this section will use real-life 2014 New York air traffic
data. You can use the following commands to import the dataset into both Stata
and R.
Introduction
The data.table package centers
around data.tables, which are highly efficient data frames that can be
manipulated using the package’s concise syntax. For example, say we have a
data.table called dat (you can call it whatever you want). Then we can
manipulate it by putting arguments into its square brackets, i.e. dat[]. The
three main components of a data.table operation are i, j, and by,
which go in the order dat[i, j, by]. Note you don’t have to specify the
latter two if you’re not currently using them.
i, the first component, selects the rows of the data.table that you’ll be working with, like how in Stata the if or in command options let you refer to certain rows.
j, the second component, both selects and operates on the columns of the data.table, like how in Stata the keep or drop commands select specific columns of your data, or how generate or replace create or modify columns in your data.
by, the third component, gives the variable(s) designating groups that you’ll be doing your calculations within, like how in Stata you can precede a command with bysort.
data.table uses these simple components very flexibly. The upshot is that
you can perform complicated operations in a single line of concise data.table
code, which may have required multiple commands in other languages or libraries
to accomplish. But even if you aren’t doing anything fancy, data.table has
you covered with a stable set of functions that can be deployed on virtually
any data wrangling task.
Like Stata, data.table also provides some special shortcut symbols for
common operations. For example, _N in Stata is equivalent to .N in
data.table, while .(x1, x2) is short for list(x1, x2). We’ll see more
examples in cheatsheat that follows. But we do want to quickly highlight one
special symbol in particular: .SD refers to the (S)ubset of (D)ata you’re
working with. This can be used to do complex within-group calculations when you
have by specified. But more generally it’s a way to perform operations on lots
of columns with one line of code. By default, .SD refers to all columns in the
dataset (excepting those in by). But you can specify the columns you want with
the .SDcols argument. Again, we’ll see a bunch of examples below.
Finally, data.table is extremely fast. It has long set the standard for
in-memory data wrangling benchmarks
across a variety of libraries and languages. You will likely see an order(s) of
magnitude performance difference as you compare the code chunks below. As a
bonus for Stata users, who are used to operations changing a single dataset in
memory, many data.table operations can be done in-place. This means that
you don’t have to (re)assign the result to a new data.table. In-place
modifications are also very efficient, since they will only affect the parts
you’re actually changing, without wasting memory and time on the parts that
aren’t being changed. Any time in the below cheat sheet you see a function with
the word set in it, or the := operator, that’s an in-place operation.
Data I/O
Like Stata’s .dta file format, R has its own native .rds storage format.
(See also the fst package.) However,
we generally recommend that users avoid native—especially proprietary—data types
since they hamper interoperability and reproducibility. We’ll hence concentrate
on common open-source file types below. We’ll make an exception for .dta given
our target audience, but we still recommend avoiding it if possible. Note that
all of the examples in this section will assume generic datasets.
Read and write .csv
Single file.
Read many files and append them together.
Read and write .dta
Note: .dta is Stata’s native (proprietary) filetype.
Random numbers. Note that the random seeds will be different across the two
languages.
Some convenience functions for specific data types.
Order
Sort rows
Sort columns
Rename columns
Subset
In newer versions of Stata, it’s possible to keep multiple datasets in memory,
or “frames” as Stata calls them. But this still requires extra steps that
would be unusual to users of other languages.
In contrast, keeping multiple datasets in memory is
extremely common in R. Moreover, subsetting and collapsing operations don’t
overwrite your original dataset. The upshot is that you don’t need to wrap
everything in preserve/restore. However, it also means that you’ll need to
(re)assign your subsetted/collapsed data if you want to use it again later. E.g.
dat1 = dat[origin=='LGA'].
Subset rows
Reminder: You’ll need to use preserve/restore if you want to retain the
original dataset in the examples that follow.
Reminder: You’ll need to (re)assign the subsetted dataset if you want to use it
later, e.g. dat1 = dat[...].
Subset columns
Reminder: You’ll need to use preserve/restore if you want to retain the
original dataset in the examples that follow.
Reminder: You’ll need to (re)assign the subsetted dataset if you want to use it
later, e.g. dat1 = dat[...].
Subset rows and columns
Reminder: You’ll need to use preserve/restore if you want to retain the
original dataset in the examples that follow.
Reminder: You’ll need to (re)assign the subsetted dataset if you want to use it
later, e.g. dat1 = dat[...].
Drop duplicates
Reminder: You’ll need to use preserve/restore if you want to retain the
original dataset in the examples that follow.
Reminder: You’ll need to (re)assign the subsetted dataset if you want to use it
later, e.g. dat1 = dat[...].
Drop missing
Reminder: You’ll need to use preserve/restore if you want to retain the
original dataset in the examples that follow.
Reminder: You’ll need to (re)assign the subsetted dataset if you want to use it
later, e.g. dat1 = dat[...].
Modify
Aside: You can force print a data.table’s in-place modifications to screen by
adding a trailing [], e.g. dat[, dist_sq := distance^2][].
Create new variables
Here are some data.table modifying operations that don’t have direct Stata
equivalents (although you could implement a loop).
Create new variables within groups
Aside: In R, any missing (i.e. “NA”) values will propagate during
aggregating functions. If you have NA values in your real-life dataset—we
don’t in this example dataset—you probably want to add na.rm=TRUE to remove
these on the fly. E.g. mean(var1, na.rm=TRUE) or
lapply(.SD, mean, na.rm=TRUE).
Aside: We don’t normally use a gen -> replace workflow in R, the way we do in
Stata. See the Using Booleans & control-flow
section below for a more idiomatic approach.
Using Booleans & control-flow
Row-wise calculations
Fill in Time Series/Panel Data
Lags and leads (generic dataset)
Replace missing values forward or back (generic dataset)
Collapse
In newer versions of Stata, it’s possible to keep multiple datasets in memory,
or “frames” as Stata calls them. But this still requires extra steps that
would be unusual to users of other languages.
In contrast, keeping multiple datasets in memory is
extremely common in R. Moreover, subsetting and collapsing operations don’t
overwrite your original dataset. The upshot is that you don’t need to wrap
everything in preserve/restore. However, it also means that you’ll need to
(re)assign your subsetted/collapsed data if you want to use it again later. E.g.
dat1 = dat[, mean(var1)]. Finally, remember our earlier note about aggregating
functions on columns that have missing values: Use na.rm=TRUE to remove these
on the fly. E.g. dat[, mean(var1, na.rm=TRUE)].
Collapse with no grouping
Reminder: You’ll need to use preserve/restore if you want to retain the
original dataset in the examples that follow.
Reminder: You’ll need to (re)assign the subsetted dataset if you want to use it
later, e.g. dat1 = dat[...].
Collapse by group
Reminder: You’ll need to use preserve/restore if you want to retain the
original dataset in the examples that follow.
Reminder: You’ll need to (re)assign the subsetted dataset if you want to use it
later, e.g. dat1 = dat[...].
Count rows
Advanced collapse (tips and tricks)
These next few examples are meant to highlight some specific data.table
collapse tricks. They don’t really have good Stata equivalents (that we’re aware
of).
Use keys for even faster grouped operations
The data.table website
describes
keys as “supercharged rownames”. Essentially, setting a key means ordering
your data in a way that makes it very efficient to do subsetting or aggregating
operations. data.table is already highly performant, but setting keys can
give a valuable speed boost for big data tasks.
Grouped calculations and complex objects inside a data.table
data.table supports list columns, so you can have complex objects like
regression models inside a data.table. Among many other things, this means you
can nest simulations inside a data.table as easily as you would perform any
other (grouped) operation. Here we illustrate with a simple grouped regression,
i.e. a separate regression for each month of our dataset.
Grouped aggregations when reshaping
You can do complicated (grouped) aggregations as part of a data.table::dcast()
(i.e. reshape wide) call. Here’s an example where we summarise both the
departure and arrival delays—getting the minimum, mean, and maximum
values—by origin airport.
Reshape
Reshape prep (this dataset only)
Note: We need to do a bit of prep to our air-traffic dataset to better
demonstrate the reshape examples in this section. You probably don’t need to do
this for your own dataset.
Reshape long
Reshape wide
Merge
Import and prep secondary dataset
Note: Our secondary dataset for demonstrating the merges in this section
will be one on airport characteristics.
Inner merge
Only keep the matched rows across both datasets.
Full merge
Keep all rows of both datasets, regardless of whether matched.
Left merge
Keep all rows from the “main” dataset.
Right merge
Keep all rows from the “secondary” dataset.
Anti merge
Keep non-matched rows only.
Appending data
Advanced merges (tips and tricks)
These next few examples are meant to highlight some specific data.table
merge tricks. They don’t really have good Stata equivalents (that we’re aware
of).
Merge on different ID names
Set keys for even faster merges and syntax shortcuts
Non-equi joins
Non-equi joins are a bit hard to understand if you’ve never seen them before.
But they are incredibly powerful and solve a surprisingly common problem:
Merging datasets over a range (e.g. start to end dates), rather than exact
matches. Here’s a simple example where we want to subset the 1st quarter flights
for American Airlines and the 2nd quarter flights for United Airlines:
Rolling Joins
Rolling joins are similar and allow you to match a set of dates forwards or
backwards. For example, our dat dataset ends in October. Let’s say we want to
carry the last known entries for American and United Airlines forward to
(random) future dates.