Regression analysis with fixest
fixest (by Laurent Bergé) is a package designed from the ground up in C++ to make running regressions fast and incredibly easy. It provides built-in support for a variety of linear and nonlinear models, as well as regression tables and plotting methods.
Installation
Before continuing, make sure that you have installed fixest. You only have to do this once (or as often as you want to update the package).
# Install from CRANinstall.packages("fixest")
# Alternatively, you can install the latest development version# install.packages("fixest", repos = "https://fastverse.r-universe.dev")
Once fixest is installed, don’t forget to load it whenever you want to use it. Unlike Stata, you have to re-load a package every time you start a new R session.
# Load fixest into our current R sessionlibrary(fixest)
All of the examples in this section will use a modified dataset from the CPS with some added variables for demonstration purposes. To load the data run the following:
import delimited using /// "https://raw.githubusercontent.com/stata2r/stata2r.github.io/main/data/cps_long.csv", clear
# Base R reads CSVs too, but we'll use data.table heredat = data.table::fread('https://raw.githubusercontent.com/stata2r/stata2r.github.io/main/data/cps_long.csv')
Introduction
The fixest package contains a highly flexible set of functions that allow you to estimate a large set of regression models. While the package obviously doesn’t cover every model out there, there is a non-negligible subset of Stata users for whom every model they’ve ever needed to estimate is covered by fixest.
This includes regular linear regression via the feols()
function, which extends the Base R lm()
function by supporting (high-dimenional) fixed effects. But fixest isn’t just limited to linear regression. The package supports efficient instrumental variables (IV) estimation, as well as a wide range of GLM models like logit, probit, Poisson, and negative binomial.
You also get a lot of convenience features with fixest. Adjusting for heteroskedasticity-robust or clustered standard errors is easily done via the vcov
option. The package provides built-in methods for exporting regression tables and coefficient plots. You can select long lists of control variables without having to type them all in, retrieve estimated fixed effects, conduct Wald tests, adjust the reference levels for categorical variables and interactions on the fly, efficiently estimate simulation studies, etc. etc. You even get some stuff that’s rather tricky in Stata, like automatically iterating over a bunch of model specifications, basic and staggered difference-in-difference support, or Conley standard errors.
fixest offers all of this in addition to being very fast. If you felt a speed boost going from Stata’s xtreg
to reghdfe,
get ready for another significant improvement when moving to fixest.
Using fixest for regression starts with writing a formula. While there are plenty of bells and whistles to add, at its core regression formulas take the form y ~ x1 + x2 | fe1 + fe2
, where y
is the outcome, x1
and x2
are predictors, and fe1
and fe2
are your sets of fixed effects.
Models
Unlike Stata, which only ever has one active dataset in memory, remember that
having multiple datasets in your global environment is the norm in R. We
highlight this difference to head off a very common error for new Stata R users:
you need to specify which dataset you’re using in your model calls, e.g.
feols(..., data = dat)
. We’ll see lots of examples below. At the same time,
note that fixest allows you to set various global
options,
including which dataset you want to use for all of your regressions. Again,
we’ll see examples below.
Simple model
reg wage educreg wage educ age
feols(wage ~ educ, data = dat)feols(wage ~ educ + age, data = dat)
# Aside 1: `data = ...` is the first argument after# the model formula. So many R users would just write:feols(wage ~ educ, dat)
# Aside 2: You can also set your dataset globally so# that you don't have to reference it each time.setFixest_estimation(data = dat)feols(wage ~ educ)feols(wage ~ educ + age)# etc.
Categorical variables
reg wage educ i.treat
* Specifying a baseline:reg wage educ ib1.treat
feols(wage ~ educ + i(treat), dat)
# Specifying a baseline:feols(wage ~ educ + i(treat, ref = 1), dat)
Fixed effects
reghdfe wage educ, absorb(countyfips) cluster(countyfips)
reghdfe wage educ, absorb(countyfips)
* Add more fixed effects...reghdfe wage educ, absorb(countyfips year) /// vce(cluster countyfips year)reghdfe wage educ, absorb(countyfips#year) /// vce(cluster countyfips#year)
feols(wage ~ educ | countyfips, dat, vcov = ~countyfips)# feols(wage ~ educ | countyfips, dat) # same (see below)
# Note: fixest automatically clusters SEs by the first# fixed effect (if there are any). We'll get to SEs# later, but if you just want iid errors for a fixed# effect model:feols(wage ~ educ | countyfips, dat, vcov = 'iid')
# Add more fixed effects...feols(wage ~ educ | countyfips + year, dat, vcov = ~countyfips + year)feols(wage ~ educ | countyfips^year, dat) # defaults to vcov = ~countyfips^year
Weights
gen w = runiform(0.1, 0.9)reg wage educ i.treat [aw = w]
dat$w = runif(nrow(dat), 0.1, 0.9)feols(wage ~ educ + i(treat), dat, weights = ~w)
Instrumental variables
ivreg 2sls wage (educ = age)ivreg 2sls wage marr (educ = age)
* With fixed effectsivreghdfe 2sls wage marr (educ = age), absorb(countyfips)
feols(wage ~ 1 | educ ~ age, dat)feols(wage ~ marr | educ ~ age, dat)
# With fixed effects (IV 1st stage always comes last)feols(wage ~ marr | countyfips | educ ~ age, dat)
Nonlinear models
While we don’t really show it here, note that (almost) all of the functionality that
this page demonstrates w.r.t. feols()
carries over to fixest’s non-linear
estimation functions too (feglm()
, fepois()
, etc.). This includes SE
adjustment, and so forth.
xtset statefipslogit marr age black hisp
* Note: Attempting to replicate the feglm() model with fixed* effects at right using xtlogit or xtprobit leads to* numerical overflow or matsize issues
* https://github.com/sergiocorreia/ppmlhdfeppmlhdfe educ age black hisp, absorb(statefips year) /// vce(cluster statefips)
feglm(marr ~ age + black + hisp, dat, family = 'logit')
# Add fixed effects (probit this time)feglm(marr ~ age + black + hisp | statefips + year, dat, family = 'probit')
# fepois() is there for Poisson regressionfepois(educ ~ age + black + hisp | statefips + year, dat)
Macros, wildcards and shortcuts
local ctrls age black hisp marrreg wage educ `ctrls'
reg wage educ x*reg wage educ *spreg wage educ *ac*
ctrls = c("age", "black", "hisp", "marr")feols(wage ~ educ + .[ctrls], dat)
feols(wage ~ educ + ..('^x'), dat) # ^ = starts withfeols(wage ~ educ + ..('sp$'), dat) # $ = ends withfeols(wage ~ educ + ..('ac'), dat)
# Many more macro options. See `?setFixest_fml` and# `?setFixest_estimation`. Example (reminder) where# you set your dataset globally, so you don't have to# retype `data = ...` anymore.setFixest_estimation(data = dat)feols(wage ~ educ)feols(wage ~ educ + .[ctrls] | statefips)# Etc.
Multi-model estimations (advanced)
fixest supports a variety of
multi-model
capabilities. Not only are these efficient from a coding perspective (you can get
away with much less typing), but they are also highly optimized. For example,
if you run a multi-model estimation with the same group of fixed-effects then
fixest will only compute those fixed-effects once for all models. The next
group of examples are meant to highlight some specific examples of this
functionality. They don’t necessarily have direct Stata equivalents that we are
aware of. Moreover, while we don’t show it here, please note that all of these
options can be combined (e.g. split sample with stepwise regression).
Multi-model objects can also be sent directly to presentation
functions like etable()
and coefplot()
.
Split sample
## Separate regressions for hispanic and non-hispanicfeols(wage ~ educ | countyfips, dat, split = ~hisp)
# As above, but now includes the full sample as a 3rd regfeols(wage ~ educ | countyfips, dat, fsplit = ~hisp)
Multiple dependent variables
# Regress wage & educ separately on the same indep. vars & FEsfeols(c(wage, educ) ~ age + marr | countyfips + year, dat)
Stepwise regression
# Stepwise. First reg doesn't include "marr", second reg# doesn't include "age"feols(wage ~ educ + sw(age, marr), dat)
# Cumulative stepwise. As above, except now the second reg# includes "age".feols(wage ~ educ + csw(age, marr), dat)
# Stepwise operators work in the FE slot toofeols(wage ~ educ | csw(year, statefips), dat)
# Aside: You can also use "sw0()" and "csw0()", in which# case you'll get an extra regression at the start that# doesn't include the stepwise components.
Interactions
Interact continuous variables
reg wage c.educ#c.agereg wage c.educ##c.age
* Polynomialsreg wage c.age#c.agereg wage c.age##c.age
feols(wage ~ educ:age, dat)feols(wage ~ educ*age, dat)
# Polynomialsfeols(wage ~ I(age^2), dat)feols(wage ~ poly(age, 2, raw = TRUE))
Interact categorical variables
reg wage i.treat#i.hisp
reg wage i.treat i.treat#i.hispreg wage i.treat##i.hisp
feols(wage ~ i(treat, i.hisp), dat)
# Aside: i() is a fixest-specific shortcut that also# has synergies with some other fixest functions. But# base R interaction operators all still work, e.g.feols(wage ~ factor(treat)/factor(hisp), dat)feols(wage ~ factor(treat)*factor(hisp), dat)
Interact categorical with continuous variables
reg wage i.treat#c.age
reg wage i.treat#c.agereg wage i.treat i.treat#c.agereg wage i.treat##c.age
feols(wage ~ i(treat, age), dat)
# Aside: i() is a fixest-specific shortcut that also# has synergies with some other fixest functions. But# base R interaction operators all still work, e.g.feols(wage ~ factor(treat):age, dat)feols(wage ~ factor(treat)/age, dat)feols(wage ~ factor(treat)*age, dat)
Difference-in-differences
In addition to the ability to estimate a difference-in-differences design using two-way fixed effects (if the design is appropriate for that; no staggered treatment, for instance), fixest offers several other DID-specific tools. The below examples use generic data sets, since the CPS data used in the rest of this page is not appropriate for DID.
* No immediate Stata equivalent to did_means that we know of,* although you could replicate much of it by hand with an* elaborate call to table
* Sun and Abraham can be estimated using the* eventstudyinteract package on ssc
# did_means() provides tables of means, SEs, and treatment/# control and pre/post differences for 2x2 DIDdid_means(outcome + control ~ treat | post)
# sunab() produces interactions of the type that allow you to# estimate the Sun & Abraham model for staggered treatment# timing, and automatically get average treatment effects for# each relative periodfeols(y ~ control + sunab(year_treated, year))
Interact fixed effects
* Combine fixed effectsreghdfe wage educ, absorb(statefips#year)
* Varying slopes (e.g. time trend for each state)reghdfe wage educ, absorb(statefips#c.year) /// vce(cluster statefips#c.year)
# Combine fixed effectsfeols(wage ~ educ | statefips^year, dat)
# Varying slopes (e.g. time trend for each state)feols(wage ~ educ | statefips[year], dat)
Standard errors
While you can specify standard errors inside the original fixest model call (just like Stata), a unique feature of R is that you can adjust errors for an existing model on-the-fly. This has several benefits, including being much more efficient since you don’t have to re-estimate your whole model. We’ll try to highlight examples of both approaches below.
HC
reg wage educ, vce(robust)reg wage educ, vce(hc3)
feols(wage ~ educ, dat, vcov = 'hc1')feols(wage ~ educ, dat, vcov = sandwich::vcovHC)
# Note: You can also adjust the SEs of an existing modelm = feols(wage ~ educ, dat) # iidsummary(m, vcov = 'hc1') # switch to HC1
HAC
xtset id yearivreghdfe wage educ, bw(auto) vce(robust)
feols(y ~ x, dat, vcov = 'NW', panel.id = ~unit + time)# feols(y ~ x, dat, vcov = 'NW') # if panel id already set (see below)
Clustered
reghdfe wage educ, absorb(countyfips) /// vce(cluster countyfips)
* Twoway clustering etc.reghdfe wage educ, absorb(countyfips year) /// vce(cluster countyfips year)
reghdfe wage educ, absorb(countyfips#year) /// vce(cluster countyfips#year)
feols(wage ~ educ | countyfips, dat) # Auto clusters by FE# feols(wage ~ educ | countyfips, dat, vcov = ~countyfips) # ofc can be explicit too
# Twoway clustering etc.feols(wage ~ educ | countyfips + year, dat, vcov = ~countyfips + year)# feols(wage ~ educ | countyfips + year,# dat, vcov = 'twoway') ## same as above
feols(wage ~ educ | countyfips^year, dat, vcov = ~countyfips^year)
Conley standard errors
* See: http://www.trfetzer.com/conley-spatial-hac-errors-with-fixed-effects/
feols(wage ~ educ, dat, vcov = conley("25 mi"))
On-the-fly SE adjustment
We’re belabouring the point now, but one last reminder that you can adjust the
standard errors for existing models “on the fly” by passing the vcov = ...
argument. There’s no performance penalty, since the adjustment is done
instantaneously and it therefore has the virtue of separating the mechanical
computation stage of model estimation from the inference stage. As we’ll see
below, on-the-fly SE adjustment works for a variety of other fixest
functions, e.g. etable()
. But here is a quick example using summary()
:
m = feols(wage ~ educ | countyfips + year, dat)m # Clustered by countyfips (default)summary(m, vcov = 'iid') # Switch to iid errorssummary(m, vcov = 'twoway') # Cluster by countyfips and yearsummary(m, vcov = ~countyfips^year) # Cluster by countyfips*year interaction
Presentation
Regression table
reg wage educ ageeststo est1esttab est1
* Add second regressionreg wage educ age black hispeststo est2esttab est1 est2
* Export to TeXesttab using "regtable.tex", replace
est1 = feols(wage ~ educ + age, dat)etable(est1)
# Add second regressionest2 = feols(wage ~ educ + age + black + hisp, dat)etable(est1, est2)
# Export to Texetable(est1, est2, file = "regtable.tex")
Note: The etable()
function is extremely flexible and includes support for
many things that we won’t show you here. See the relevant vignettes for more
(1,
2). Below we highlight a few unique features that don’t have direct Stata
equivalents. (You could potentially mimic with a loop, but that will require
more code and be slower, since your whole model has to be re-estimated each
time.)
# SEs for existing models can be adjusted on-the-flyetable(est1, est2, vcov = 'hc1')
# Report multiple SEs for the same modeletable(est1, vcov = list('iid', 'hc1', ~id, ~countyfips))
# Multi-model example: Two dep. vars, stepwise coefs,# varying slopes, split sample, etc. (18 models in total!)est_mult = feols(c(wage, age) ~ educ + csw0(hisp, black) | statefips[year], dat, fsplit = ~marr)etable(est_mult, vcov = ~statefips^year)
Joint test of coefficients
* Rename so we can use the wildcard laterrename (black hisp) (raceeth_black raceeth_hisp)regress wage educ age raceeth_* marrtestparm raceeth_black raceeth_hisptestparm raceeth_*
# Rename so we can use a regular expression laterdata.table::setnames(dat, c('black','hisp'), c('raceeth_black','raceeth_hisp'))est1 = feols(wage ~ educ + age + ..('raceeth_') + marr, dat)wald(est1, c('raceeth_black','raceeth_hisp'))wald(est1, 'raceeth_')
Coefficient plot
* Assume we have est1 and est2 from abovecoefplot est1coefplot est1 est2
# Assume we have est1 and est2 from abovecoefplot(est1)coefplot(list(est1, est2))
Interaction Plot
regress wage hisp##c.age
* Show how effect differs by groupmargins hisp, dydx(age)marginsplot
# Show predictive margins with an interactionregress wage hisp##c.agemargins hisp, at(age = (16(1)55))* The recast here gives a line and error area instead of points and linesmarginsplot, recast(line) recastci(rarea)
est1 = feols(wage ~ i(hisp, age), dat)
# Show how effect differs by groupiplot(est1)
# Show predictive margins with an interaction# This requires plot_predictions from the marginaleffects packagelibrary(marginaleffects)plot_predictions(est1, condition = c('age','hisp'))
Panel
Note: You don’t need to specify your panel variables globally and this functionality is mostly for convenience features associated with time-series operations like leads and lags. You can also use panel(dat, ~ id + var)
to do so on-the-fly in your regression call. But Laurent, the fixest author, recommends setting the panel ID globally when applicable, so that’s what we do below.
Lag variables
xtset id yearreg wage educ l1.wage
setFixest_estimation(panel.id = ~id+year)feols(wage ~ educ + l(wage, 1), dat)# feols(wage ~ educ + l(wage, 1), dat, panel.id = ~id+year) # if not set
Lead variables
xtset id yearreg wage educ f1.wage
# setFixest_estimation(panel.id = ~id+year) # already setfeols(wage ~ educ + f(wage, 1), dat)
First difference
xtset id yearreg wage educ D.x
# setFixest_estimation(panel.id = ~id+year) # already setfeols(wage ~ educ + d(wage), dat)