Other Packages

While we think you can get really far in R with just data.table and fixest, of course these two packages don't cover everything.

This page covers a small list of packages you may find especially useful when getting started with R. We won't try to cover everything under the sun here. Just a few places to get going. For the rest, well, that's what StackOverflow or your favourite search engine is for.

All of the below packages have far more applications than is shown here. We'll just provide one or two examples of how each can be used. Finally, don't forget to install them with install.packages('PKGNAME') and load them with library(PKGNAME). The former command you only have to run once per package (or as often as you want to update it); the latter whenever you want to use a package in a new R session.

base

Where it all begins

Like many programming languages, one of R's great strengths is its package ecosystem. But none of that would be possible without the scaffolding provided by baseopen in new window R. The "base" part here represents a set of core libraries and routines that get installed and loaded automatically whenever you start an R session. And you really get a lot out of the gate, because base R is incredibly versatile and function rich. Many of the operations that we have shown you on the preceding pages could equally have been implemented using off-the-shelf base R equivalents. We won't attempt to persuade you of that here, but there are lots of good tutorials available if you're interested (hereopen in new window for example). Below we'll just highlight a few simple examples to give you an idea.

Plotting (simple histogram)

set obs 100
gen x = rnormal()
histogram x
1
2
3
x = rnorm(100)
hist(x)
1
2

Linear regression

reg y x1 x2
1
lm(y ~ x1 + x2, dat)
1

Iteration (loops)

foreach i of numlist 1/10 {
   display `i' + 100
}
1
2
3
for (i in 1:10) {
    print(i + 100) 
}

# Aside 1: A single line works too here.
for (i in 1:10) print(i + 100)

# Aside 2: R provides "functional programming" eqivalents
# to for-loops via the *apply family of functions. These
# have various advantages, which we won't get into here.
# Still the most important member is arguably "lapply", which 
# we've already seen a couple of times and returns a list
# result (which is great for programming). Here's the
# equivalent lapply code to the previous for-loop.
lapply(1:10, function(i) print(i + 100))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

ggplot2

Beautiful and customizable plots

ggplot2open in new window is widely considered one of the preeminent plotting libraries available in any language. It provides an intuitive syntax that applies in the same way across many, many different kinds of visualizations, and with a deep level of customization. Plus, endless additional plugins to do what you want, including easy interactivity, animation, maps, etc. We thought about giving ggplot2 its own dedicated page like data.table and fixest. But instead we'll point you to the Figuresopen in new window section of the Library of Statistical Techniques, which already shows how to do many different graphing tasks in both Stata and ggplot2. For a more in-depth overview you can always consult the excellent package documentationopen in new window, or Kieran Healy's wonderful Data Visualizationopen in new window book.

Basic scatterplot(s)

twoway scatter yvar xvar

twoway (scatter yvar xvar if group == 1, mc(blue)) \\\
        (scatter yvar xvar if group == 2, mc(red))
1
2
3
4
ggplot(dat, aes(x = xvar, y = yvar)) + geom_point()

ggplot(dat, aes(x = xvar, y = yvar, color = group)) + 
  geom_point()
1
2
3
4

tidyverse

A family of data science tools

The tidyverseopen in new window provides an extremely popular framework for data science tasks in R. This meta-package is actually a collection of smaller packages that are all designed to work together, based on a shared philosophy and syntax. We've already covered ggplot2 above, but there are plenty more. These include dplyr and tidyr, which offer an alternative syntax and approach to data wrangling tasks. While we personally recommend data.table, these tidyverse packages have many ardent fans too. You may find that you prefer their modular design and verbal syntax. But don't feel bound either way: it's totally fine to combine them. Some other tidyverse packages worth knowing about include purrr, which contains a suite of functions for automating and looping your work, lubridate which makes working with date-based data easy, and stringr which offers functions with straightforward syntax for working with string variables. In the examples that follow, note that %>% is a pipe operatoropen in new window.

Data wrangling with dplyr

Note: dplyr doesn't modify data in place. So you'll need to (re)assign if you want to keep your changes. E.g. dat = dat %>% group_by...

Subset by rows and then columns.

keep if var1=="value"
keep var1 var2 var3
1
2
dat %>%
   filter(var1=="value") %>%
   select(var1, var2, var3)
1
2
3

Create a new variable by group.

bysort group1: egen mean_var1 = mean(var1)
1
dat %>%
  group_by(group1) %>%
  mutate(mean_var1 = mean(var1))
1
2
3

Collapse by group.

collapse (mean) mean_var1 = var1, by(group1)
1
dat %>%
  group_by(group1) %>%
  summarise(mean_var1 = mean(var1))
1
2
3

Manipulating dates with lubridate

* Shift a date forward one month (not 30 days, one month)
* ???
1
2
# Shift a date forward one month (not 30 days, one month)
shifted_date = date + months(1)
1
2

Iterating with purrr

Read in many files and append them together.

local filelist: dir "data/" files "*.csv"
tempfile mytmpfile
save `mytmpfile', replace empty
foreach x of local filelist {
	qui: import delimited "data/`x'", case(preserve) clear
	append using `mytmpfile'
	save `mytmpfile', replace
}
1
2
3
4
5
6
7
8
filelist = dir("data/", pattern=".csvquot;, full.names=TRUE)
dat = map_df(filelist, data.table::fread)

# Note: map_*df* means map (iterate) then coerce the
# result to a data frame
1
2
3
4
5

Iterate over variables.

ds, has(type long)
collapse (mean) `r(varlist)'
1
2
# Note: map is a stand-in replacement for lapply
dat[, map(.SD, mean), .SDcols=is.numeric]
1
2

String operations with stringr

subinstr("Hello world", "world", "universe", .)
substr("Hello world", 1, 3)
regexm("Hello world", "ello")

1
2
3
4
str_replace_all("Hello world", "world", "universe")
str_sub("Hello world", 1, 3)
str_detect("Hello world", "ello")
# Note all the stringr functions accept regex input
1
2
3
4

collapse

Extra convenience functions and super fast aggregations

Sure, we've gone on and on about how fast data.table is compared to just about everything else. But there is another R package that can boast even faster computation times for certain grouped calculations and transformations, and that's collapseopen in new window. The collapse package doesn't try to do everything that data.table does. But the two play very well togetheropen in new window and the former offers some convenience functions like descr and collap, which essentially mimic the equivalent functions in Stata and might be particularly appealing to readers of this guide. (P.S. If you'd like to load data.table and collapse at the same time, plus some other high-performance packages, check out the fastverseopen in new window.)

Quick Summaries

summarize
describe
1
2
qsu(dat)
descr(dat)
1
2

Multiple grouped aggregations

collapse (mean) var1, by(group1)
collapse (min) min_var1=var1 min_var2=var2 (max) max_var1=var1 max_var2=var2, by(group1 group2)
1
2
collap(dat, var1 ~ group1, fmean) # 'fmean' => fast mean
collap(dat, var1 + var2 ~ group1 + group2, FUN = list(fmin, fmax))
1
2

sandwich

More standard error adjustments

fixest package comes with plenty of shortcuts for accessing standard error adjustments like HC1 heteroskedasticity-robust standard errors, Newey-West, Driscoll-Kraay, clustered standard errors, etc. But of course there are still more than that. A host of additional options are covered by the sandwichopen in new window package, which comes with a long list of functions like vcovBS() for bootstrapped standard errors, or vcovHC() for HC1-5. sandwich supports nearly every model class in R, so it shouldn't surprise that these can slot right into fixest estimates, too. You shouldn't be using those , robust errors for smaller samples anyway... but you knew thatopen in new window, right?

Linear Model Adjustments

* ", robust" uses hc1 which isn't great for small samples
regress Y X Z, vce(hc3)
1
2
# sandwich's vcovHC uses HC3 by default
feols(Y ~ X + Z, dat, vcov = sandwich::vcovHC) 

# Aside: Remember that you can also adjust the SEs 
# for existing models on the fly 
m = feols(Y ~ X + Z, dat) 
summary(m, vcov = sandwich::vcovHC)
1
2
3
4
5
6
7

modelsummary

Summary tables, regression tables, and more

The fixest package already has the etable() function for generating regression tables. However, it is only really intended to work with models from the same package. So we also recommend checking out the fantastic modelsummaryopen in new window package. It works with all sorts of model objects, including those not from fixest, is incredibly customizable, and outputs to a bunch of different formats (PDF, HTML, DOCX, etc.) Similarly, modelsummary has a wealth of options for producing publication-ready summary tables. Oh, and it produces coefficient plots too. Check out the package websiteopen in new window for more.

Summary tables

* Summary stats table 
estpost summarize 
esttab, cells("count mean sd min max") nomtitle nonumber 

* Balance table 
by treat_var: eststo: estpost summarize 
esttab, cells("mean sd") label nodepvar
1
2
3
4
5
6
7
# Summary stats table 
datasummary_skim(dat) 


# Balance table 
datasummary_balance(~treat_var, dat)
1
2
3
4
5
6

Regression tables

Aside: Here we'll use the base R lm() (linear model) function, rather than feols(), to emphasize that modelsummary works with many different model classes.

reg Y X Z 
eststo est1 
esttab est1b

reg Y X Z, vce(hc3) 
eststo est1b 
esttab est1b 

esttab est1 est1b

reg Y X Z A, vce(hc3)
eststo est2
esttab est1 est1b est2
1
2
3
4
5
6
7
8
9
10
11
12
13
est1 = lm(Y ~ X + Z, dat) 
msummary(est1) # msummary() = alias for modelsummary()

# Like fixest::etable(), SEs for existing models can
# be adjusted on-the-fly 
msummary(est1, vcov='hc3')

# Multiple SEs for the same model
msummary(est1, vcov=list('iid', 'hc3')) 

est3 = lm(Y ~ X + Z + A, dat) 
msummary(list(est1, est1, est3),
         vcov = list('iid', 'hc3', 'hc3'))
1
2
3
4
5
6
7
8
9
10
11
12
13

lme4

Random effects and mixed models

fixest can do a lot, but it can't do everything. This site isn't even going to attempt to go into how to translate every single model into R. But we'll quick highlight random-effects and mixed models. The lme4open in new window and its lmer() function covers not just random-intercept models but also hierarchical models where slope coefficients follow random distributions. (Aside: If you prefer Bayesian models for this kind of thing, check out brmsopen in new window.)

Random effects and mixed models

xtset group time
xtreg Y X, re
mixed lifeexp || countryn: gdppercap
1
2
3
# No need for an xtset equivalent
m = lmer(Y~(1|group) + X, data = dat)
nm = lmer(Y~(1+x|group) + X, data = dat)
1
2
3

marginaleffects

Marginal effects, constrasts, etc.

The Stata margins command is great. To replicate it in R, we highly recommend the marginaleffectsopen in new window package. Individual marginal effects or average marginal effects for nonlinear models, or models with interactions or transformations, etc. It's also very fast.

Basic logit marginal effects

* A logit:
logit Y X Z
margins, dydx(*)
1
2
3
# This example incorporates the fixest function feglm()
m = feglm(Y ~ X + Z, family = binomial, data = mtcars)
summary(marginaleffects(m))
1
2
3

multcomp / nlWaldTest

Joint coefficient tests

Stata provides a number of inbuilt commands for (potentially complex) postestimation coefficient tests. We've already seen the testparm command equivalent with fixest::wald(). But what about combinations of coefficients a la Stata's lincom and nlcom commands? The multcompopen in new window package handles a variety of linear tests and combinations, while nlWaldTestopen in new window has you covered for nonlinear combinations.

Test other null hypotheses and coefficient combinations

regress y x z 

* One-sided test 
test _b[x]=0 
local sign_wgt = sign(_b[x]) 
display "H0: coef <= 0  p-value = " ttail(r(df_r),`sign_wgt'*sqrt(r(F))) 

* Test linear combination of coefficients 
lincom x + z 


* Test nonlinear combination of coefficients 
nlcom _b[x]/_b[z]
1
2
3
4
5
6
7
8
9
10
11
12
13
m = feols(y ~ x + z, dat)

# One-sided test 
m2 = multcomp::ghlt(m, '<=0')
summary(m2) 


# Test linear combination of coefficients 
m3 = multcomp::glht(m, 'x + z = 0') 
summary(m3) # or confint(m3) 

# Test nonlinear combination of coefficients 
nlWaldtest::nlWaldtest(m, 'b[2]/b[3]') # or nlWaldtest::nlConfint()
1
2
3
4
5
6
7
8
9
10
11
12
13

sf

Geospatial operations

R has outstanding support for geospatial computation and mapping. There are a variety of packages to choose from here, depending on what you want (e.g. vector vs raster data, interactive maps, high-dimensional data cubes, etc.) But the workhorse geospatial tool for most R users is the incredibly versatile sfopen in new window package. We'll only provide a simple mapping example below. The sf websiteopen in new window has several in-depth tutorials, and we also recommend the Geocomputation with Ropen in new window book by Robin Lovelace, Jakub Nowosad, and Jannes Muenchow.

Simple Map

* Mapping in Stata requires the spmap and shp2dta 
* commands, and also that you convert your (say) 
* shapefile to .dta format first. We won't go through 
* all that here, but see: 
* https://www.stata.com/support/faqs/graphics/spmap-and-maps/
1
2
3
4
5
# This example uses the North Carolina shapefile that is
# bundled with the sf package. 
nc = st_read(system.file("shape/nc.shp", package = "sf")) 
plot(nc[, 'BIR74'])
# Or, if you have ggplot2 loaded: 
ggplot(nc, aes(fill=BIR74)) + geom_sf()
1
2
3
4
5
6