Michael Roberts has been trying to convince me to us restricted cubic splines to plot highly nonlinear functions, in part because they are extremely flexible and they have nice properties near their edges. Unlike polynomials, information at one end of the support only weakly influences fitted values at the other end of the support. Unlike the binned non-parametric methods I posted a few weeks ago, RC-splines are differentiable (smooth). Unlike other smooth non-parametric methods, RC-splines are fast to compute and easily account for control variables (like fixed effects) because they are summarized by just a few variables in an OLS regression. They can also be used with spatially robust standard errors or clustering, so they are great for nonlinear modeling of spatially correlated processes.
In short: they have lots of advantages. The only disadvantage is that it takes a bit of effort to plot them since there's no standard Stata command to do it.
Here's my function plot_rcspline.ado, which generates the spline variables for the independent variable, fits a spline while accounting for control variables, and plots the partial effect of the specified independent variables (adjusting for the control vars) with confidence intervals (computed via delta method). It's as easy as
plot_rcspline y x
and you get something like
where the "knots" are plotted as the vertical lines (optional).
Help file below the fold. Enjoy!
Related non-linear plotting functions previously posted on FE:
- Boxplot regression
- Visually-weighted regression
- Watercolor regression
- Non-parametric three-dimensional regression
- Binned non-parametric regression
- Polynomial regression of any degree
SHSIANG@PRINCETON.EDU
2/2013
-------------------------------------------------------
SYNTAX
PLOT_RCSPLINE Yvar Xvar [Zvars] [if] , [NKNOTS(integer 5) KNOTPREF(string) BINS(integer 100) SAVE_data_file(string) plotcommand(string) regcommand(string) cilevel(integer 95) PLOTKNOTS NOCI]
-------------------------------------------------------
PLOT_RCSPLINE is designed to show a nonlinear relationship between Xvar and Yvar, adjusted for other variables Zvar1 ... Zvarn that are specified as controls. Confidence intervals are computed by the delta method.
PLOT_RCSPLINE uses a restricted cubic spline approach to estimating the nonlinear relationship between Yvar and Xvar. The spline is estimated with NKNOTS (5 is the default) and the spline variables in Xvar are generated by PLOT_RCSPLINE. One usage of PLOT_RCSPLINE is to generate the spline variables that will be used elsewhere in an analysis, while allowing the user to visually check that the partial response function (in Xvar) is reasonably specified using the spline.
PLOT_RCSPLINE does not automatically save the figure. Use GRAPH EXPORT or GRAPH SAVE afterwards to save the figure.
-------------------------------------------------------
Required arguments:
Yvar - the dependent variable to be plotted
Xvar - the independent variable to be plotted
Optional arguments:
Zvars - control variables in the model
PLOTKNOTS - plots the knots as vertical lines
NKNOTS - the number of knots used in the spline, default = 5 (the number of variables representing the spline in the regression is one less than NKNOTS)
KNOTPREF() - a prefix used to name the variables generated for the spline
NOCI - suppress the plotting of the confidence intervals
CILEVEL() - Integer specifying the confidence level to be plotted. The default is cilevel(95).
REGCOMMAND() - String that is added to the regression command as an option, eg. "cluster(year)"
BINS() - Number of "bins" that are used to plot the final response funtion. Fewer bins will make the response function look "jagged", more bins will make the response look smoother. Default is 100.
SAVE_data_file() - String that specifies a path/file to save the dataset of estimates used to plot the response function. These are the values of the response function (with SE) at each bin of the independent variable. If nothing is specified, the data used to plot the results will not be saved.
PLOTCOMMAND() - String arguments that are appending to the final plotting command. Eg. to label the plot with the title "X vs Y" and xtitle "X" type: plot_options("title(X vs Y) xtit(X)").
-------------------------------------------------------
EXAMPLES:
set obs 1000
gen year = ceil(_n/100)
gen x=4*uniform()
gen z=2*uniform()
gen w=2*uniform()
gen e = 4*rnormal()
gen y= w + 3*z + 4*x - x^2 + .1*x^3 + e
plot_rcspline y x
plot_rcspline y x z w, nknots(7) noci plotknots bins(20) knotpref(_x_knots_)
plot_rcspline y x z w, nknots(7) plotknots cilevel(90) plotcommand("tit(E[y|x,z,w])") regcommand("cluster(year)")
-------------------------------------------------------
*/
Solomon: I agree with your advocacy of restricted cubic splines as a good scatter plot smoother.
ReplyDeleteYou are correct that there is no official command to do this in Stata, if that's what you mean by "standard". As far as user-written commands are concerned, -rcspline- has been downloadable from SSC since 2007. It doesn't have quite the same objectives as your program: yours goes further in supporting what you call control variables). -rcspline- always shows the original data for the variables plotted.
In terms of your program, Stata users would typically expect a help file that would explain the syntax in a standard way, not documentation embedded in the code. Your implementation makes use of permanent variable names in a way that poses a small but not zero risk of messing up existing datasets and in a future revision could usefully support -in- as well as -if-.
In a quick test of your program I went
sysuse auto, clear
plot_rcspline mpg weight
and I see predictions shown for -mpg- that are all negative. What did I do wrong?
Nick: thanks for pointing out rcspline (and writing it!). It looks really nice.
DeleteOne day, I will learn how to write proper help files for Stata... but until then, I apologize that getting the details on my functions is a little unwieldy.
In your test, the predictions for mpg are all negative because plot_rcspline does not account for the constant term. The reason is that when there are multiple additive constants, I was only interested in the marginal effect of the variable of interest, all other non-weight variables are not included in the prediction. If you center the variables first (so the constant is zero), you see that the scatter is identical to the spline:
sysuse auto, clear
center mpg weight
sc c_mpg c_weight
plot_rcspline c_mpg c_weight
In the next version, I'll make including the constant an option. Thanks for pointing this out.
Thanks for this. When there are no control variables, it's possible just to use -predict- directly to get predictions in terms of the response. I'd guess other users than myself might expect that to be the default with no control variables, but as you imply that clashes with your broader purpose in getting marginal effects. That in turn raises the question of getting Stata's -marginsplot- to do your work for you.
DeleteAlright, a few other folks emailed about plots that dropped the constant term in a bivariate model, so I've updated the code to have an option "keepcons". This option will keep the constant term when computing the spline, which is appropriate only in models where there are no control variables. So if you are plotting the relationship between X and Y, and you want it look similar to what you would get if you typed "lowess Y X" then you should type
ReplyDeleteplot_rcspline Y X, keepcons
and you should recover a curve that looks more familiar. If you include controls Z but type
plot_rcspline Y X Z, keepcons
then you will keep the constant but drop the average effect of Z when you plot the spline. This can be interpreted as the prediction of the model when all covariates are set to zero (note that this doesn't make too much sense in many models, so I would stick to using the "keepcons" option only in bivariate models).