Two and three-dimensional non-parametric regressions

A pet peeve of mine are people who run dozens of linear regressions but never check whether linearity  is a good assumption.  Often, people will run linear models and then look for non-linearities later.  But there's no point in estimating lots of linear models first if they might be meaningless.  For example, imagine a U-shaped curve of Y as a function of X.  In this case, estimating and declaring that a linear model estimates no relation between Y and X is not meaningful.  As a general rule, it's probably a better strategy to look for non-linearities early and often.

With the goal of searching-for-specifications in mind, its useful to have a method of non-parametric regression that's fast since the researcher will have to try many things. Locally weighted polynomial regressions are fairly slow, so the faster and simpler Nadaraya-Watson estimator (Nadaraya, 1964 and Watson, 1964) comes in handy.

My code for the NW estimator in Matlab is here.
(If your data is in Stata and you want to move it to Matlab, see my code for that here).

It contains 4 .m files. The file test.m will generate some random data and estimate the conditional mean with the NW estimator.  The code uses bootstrap-resampling to estimate confidence intervals. All models have flexible but fixed bandwidth.

With one dependent variable, NWbootstrap.m (with a normal kernel) and NWbootstrap_epkv.m (with an Epinechnikov kernel) are appropriate.

With two dependent variables,  NWbootstrap_epkv_3D.m is useful.  This should be used, for example, if you are looking for "interaction terms" in a multiple regression model (the slope of a function changes as a function of another variable).

UPDATE: A problem with dropping missing variables has been fixed.  The code drop_missing_Y is now in the NW toolbox to solve the problem.

All files have helpfiles with syntax. Below is a demonstration of the code.

This is the example in the file test.m.

Say you have three variables, Y, X1 and X2.  You want to model Y as a function of the other two.  If you are thinking of single variable regressions, you might just look at one variable at a time with a scatter:

If you estimate the conditional mean non-parametrically, you see there is some curvature.

Perhaps you believe that a second variable (X2) is also important for explaining the slope of Y on X1. You can plot all three in a scatter:

but it's a little bit overwhelming.  The NWbootstrap_epkv_3D code will fit a surface that is the mean of Y on conditional on a location (X1, X2).

Once you remove the overwhelming scatter, it becomes much easier to see what's going on:

(The whiskers are showing the bootstrapped confidence interval at each point.)

1 comment:

  1. Thank you for this post. Finally somebody who speaks about the possibility of non-linearity. It's possible to go through the whole first year of this program without ever hearing the possibility that data may be non-linear and consequently not multivariate normal. From studying natural science we know that most processes are in fact non-linear. No reason to think the same should not apply in social systems.