Other Packages
While we think you can get really far in R with just data.table and fixest, of course these two packages don’t cover everything.
This page covers a small list of packages you may find especially useful when getting started with R. We won’t try to cover everything under the sun here. Just a few places to get going. For the rest, well, that’s what StackOverflow or your favourite search engine is for.
All of the below packages have far more applications than is shown here. We’ll
just provide one or two examples of how each can be used. Finally, don’t forget
to install them with install.packages('PKGNAME')
and load them with
library(PKGNAME)
. The former command you only have to run once per package (or
as often as you want to update it); the latter whenever you want to use a
package in a new R session.
base
Where it all begins
Like many programming languages, one of R’s great strengths is its package ecosystem. But none of that would be possible without the scaffolding provided by base R. The “base” part here represents a set of core libraries and routines that get installed and loaded automatically whenever you start an R session. And you really get a lot out of the gate, because base R is incredibly versatile and function rich. Many of the operations that we have shown you on the preceding pages could equally have been implemented using off-the-shelf base R equivalents. We won’t attempt to persuade you of that here, but there are lots of good tutorials available if you’re interested (here for example). Below we’ll just highlight a few simple examples to give you an idea.
Plotting (simple histogram)
Linear regression
Iteration (loops)
ggplot2
Beautiful and customizable plots
ggplot2 is widely considered one of the preeminent plotting libraries available in any language. It provides an intuitive syntax that applies in the same way across many, many different kinds of visualizations, and with a deep level of customization. Plus, endless additional plugins to do what you want, including easy interactivity, animation, maps, etc. We thought about giving ggplot2 its own dedicated page like data.table and fixest. But instead we’ll point you to the Figures section of the Library of Statistical Techniques, which already shows how to do many different graphing tasks in both Stata and ggplot2. For a more in-depth overview you can always consult the excellent package documentation, or Kieran Healy’s wonderful Data Visualization book.
Basic scatterplot(s)
tidyverse
A family of data science tools
The tidyverse provides an extremely popular
framework for data science tasks in R. This meta-package is actually a
collection of smaller packages that are all designed to work together, based on
a shared philosophy and syntax. We’ve already covered ggplot2 above, but
there are plenty more. These include dplyr and tidyr, which offer an
alternative syntax and approach to data wrangling tasks. While we personally
recommend data.table, these tidyverse packages have many ardent fans
too. You may find that you prefer their modular design and verbal syntax. But
don’t feel bound either way: it’s totally fine to combine them. Some other
tidyverse packages worth knowing about include purrr, which contains a suite
of functions for automating and looping your work, lubridate which makes
working with date-based data easy, and stringr which offers functions with
straightforward syntax for working with string variables. In the examples that
follow, note that |>
is a pipe operator.
Data wrangling with dplyr
Note: dplyr doesn’t modify data in place. So you’ll need to (re)assign if you want to keep your changes. E.g. dat = dat |> group_by...
Subset by rows and then columns.
Create a new variable by group.
Collapse by group.
Manipulating dates with lubridate
Iterating with purrr
Read in many files and append them together.
Iterate over variables.
String operations with stringr
arrow and duckdb
Fast data storage and database
One advantage of the learning the tidyverse
syntax is that it plugs
in seamlessly with databases using the
dbplyr
package. The arrow package provides a really fast and memory efficient
in-memory data storage format as well as a matching on-disk storage format (.parquet
). The duckdb package provides a very similar in-memory format,
hence the two play remarkably well together.
These two packages together (or separately) make working with very large
datasets super fast. For an example using the massive NYC Taxi dataset, see https://arrow.apache.org/docs/2.0/r/articles/dataset.html. The file is 37 gigabytes (bigger than most people’s RAM) and arrow can do a group_by()
and summarize()
in a few seconds on a standard Macbook.
Import/export data with arrow
For this example, we will create a small arrow dataset from the flights14 dataset using the following code:
The write_dataset
and open_dataset
functions are useful for very large datasets and for efficiently “pre-grouping” the data (e.g. if you know you’re going to group_by(origin)
a lot, then saving the dataset this way is efficient). If you want a single file rather than a folder system, you can read/write to the .parquet
file format. The parquet version will be much smaller in size (in this case, the .csv
is 9.6MB versus the .parquet
at 1.4MB) and load/save faster.
We can load
Data wrangling with arrow
With our dataset loaded, we can begin to do standard data modifying using dplyr
.
If you type sum into the console, the output will not look like a data.frame. Instead, you’ll see that a query
object has been created:
Instead of loading the output, arrow
has “prepared” a set of commands that will need to be triggered with the collect()
function. Note that once you collect()
, the result returned is a standard R data.frame, so do so carefully.
Under the hood, arrow uses the computation engine, acero. This computation engine is what allows arrow to do really efficient operations on large datasets. However, it is not as fast or fully featured as DuckDB’s computation engine. We will discuss more below.
duckdb
Because of the way that arrow stores memory, it can be passed around between coding languages (e.g. to and from python) and the duckdb
database without copying any data. This makes the operation incredibly efficient. We can convert from arrow to duckdb using the to_duckdb()
function:
With the data transfered to a duckdb database, we can continue as before with dplyr and collect()
syntax. Under the hood, the duckdb computation engine will be used which is has a more complete feature set allowing more operations (e.g. rolling windows) to be used. One nice thing duckdb
/arrow
do is the explain()
function which explains what is happening under the hood for those curious.
Then, the data can be collect()
ed into a data.frame:
The duckdb engine is very powerful and can do things that arrow does not support (yet?), e.g. pivot/reshape (NB: these require the tidyr
package to be loaded as well). The duckdb compute engine has a bunch of awesome features including spatial data manipulation. DuckDB has a fully function and very fast spatial spatial computation set. See duckdbfs
for an R integration.
collapse
Extra convenience functions and super fast aggregations
Sure, we’ve gone on and on about how fast data.table is compared to just
about everything else. But there is another R package that can boast even faster
computation times for certain grouped calculations and transformations, and
that’s
collapse.
The collapse package doesn’t try to do everything that data.table does.
But the two
play very well together
and the former offers some convenience functions like descr
and collap
,
which essentially mimic the equivalent functions in Stata and might be
particularly appealing to readers of this guide. (P.S. If you’d like to load
data.table and collapse at the same time, plus some other
high-performance packages, check out the
fastverse.)
Quick Summaries
Multiple grouped aggregations
sandwich
More standard error adjustments
fixest package comes with plenty of shortcuts for accessing standard error
adjustments like HC1 heteroskedasticity-robust standard errors, Newey-West,
Driscoll-Kraay, clustered standard errors, etc. But of course there are still
more than that. A host of additional options are covered by the
sandwich package, which comes
with a long list of functions like vcovBS()
for bootstrapped standard errors,
or vcovHC()
for HC1-5. sandwich supports nearly every model class in R, so
it shouldn’t surprise that these can slot right into fixest
estimates, too.
You shouldn’t be using those , robust
errors for smaller samples anyway… but
you knew that, right?
Linear Model Adjustments
modelsummary
Summary tables, regression tables, and more
The fixest package already has the etable()
function for generating
regression tables. However, it is only really intended to work with models from
the same package. So we also recommend checking out the fantastic
modelsummary package.
It works with all sorts of model objects, including those not from fixest,
is incredibly customizable, and outputs to a bunch of different formats (PDF,
HTML, DOCX, etc.) Similarly, modelsummary has a wealth of options for
producing publication-ready summary tables. Oh, and it produces coefficient
plots too. Check out the package
website for more.
Summary tables
Regression tables
Aside: Here we’ll use the base R lm()
(linear model) function, rather than
feols()
, to emphasize that modelsummary works with many different model
classes.
marginaleffects
Marginal effects, contrasts, joint hypothesis tests, etc.
The Stata margins
command is great. To replicate it in R, we highly recommend
the marginaleffects
package. Individual marginal effects or average marginal effects for nonlinear
models, or models with interactions or transformations, etc. The documentation
is outstanding and the underlying functions are also very fast.
Marginal effects and plots
Here’s a simple example of a hypothetical logit model.
And here’s another of a hypothetical continuous * categorical interaction model.
Joint coefficient and (non)linear hypothesis tests
Stata provides a number of inbuilt commands for (potentially complex)
postestimation coefficient tests. We’ve already seen the testparm
command
equivalent with fixest::wald()
. But what about combinations of coefficients a
la Stata’s lincom
and nlcom
commands? While several R packages do this,
we’ll again recommend the marginaleffects package. It’s lightweight and fast,
and supports
hypothesis testing
of both linear and non-linear combinations.
lme4
Random effects and mixed models
fixest can do a lot, but it can’t do everything. This site isn’t even going
to attempt to go into how to translate every single model into R. But we’ll
quick highlight random-effects and mixed models. The
lme4 and its
lmer()
function covers not just random-intercept models but also hierarchical
models where slope coefficients follow random distributions. (Aside: If you
prefer Bayesian models for this kind of thing, check out
brms.)
Random effects and mixed models
P.S. Take a look at the CRAN Econometrics Task View page for a thorough list of econometric methods and relevant packages.
sf
Geospatial operations
R has outstanding support for geospatial computation and mapping. There are a variety of packages to choose from here, depending on what you want (e.g. vector vs raster data, interactive maps, high-dimensional data cubes, etc.) But the workhorse geospatial tool for most R users is the incredibly versatile sf package. We’ll only provide a simple mapping example below. The sf website has several in-depth tutorials, and we also recommend the Geocomputation with R book by Robin Lovelace, Jakub Nowosad, and Jannes Muenchow.