Showing posts with label matlab. Show all posts
Showing posts with label matlab. Show all posts

6.05.2013

Souped-up Watercolor Regression

I introduced "watercolor regression" here on FE several months ago, after some helpful discussions with Andrew Gelman and our readers. Over the last few months, I've made a few upgrades that I think significantly increase the utility of this approach for people doing work similar to my own.

First, the original paper is now on SSRN and documents the watercolor approach, explaining its relationship to the more general idea of visual-weighting.
Visually-Weighted Regression 
Abstract: Uncertainty in regression can be efficiently and effectively communicated using the visual properties of statistical objects in a regression display. Altering the “visual weight” of lines and shapes to depict the quality of information represented clearly communicates statistical confidence even when readers are unfamiliar with the formal and abstract definitions of statistical uncertainty. Here we present examples where the color-saturation and contrast of regression lines and confidence intervals are parametrized by local measures of an estimate’s variance. The results are simple, visually intuitive and graphically compact displays of statistical uncertainty. This approach is generalizable to almost all forms of regression.
Second, the Matlab code I've posted to do watercolor regression is now parallelized. If you have Matlab running on multiple processors, the code automatically detects this and runs the bootstrap procedure in parallel.  This is helpful because a large number of resamples (>500) is important for getting the distribution of estimates (the watercolored part of the plot) to converge but serial resampling gets very slow for large data sets (eg. >1M obs), especially when block-boostrapping (see below).

Third, the code now has an option to run a block bootstrap. This is important if you have data with serial or spatial autocorrelation (eg. models of crop yields that change in response to weather).  To see this at work, suppose we have some data where there is a weak dependance of Y on X, but all observations within a block (eg. maybe obs within a single year) have a uniform level-shift induced by some unobservable process.
e = randn(1000,1);
block = repmat([1:10]',100,1);
x = 2*randn(1000,1);
y = x+10*block+e;
The scatter of this data looks like:


where each one of stripes of data is block of obs with correlated residuals. Running watercolor_reg without block-bootrapping
 watercolor_reg(x,y,100,1.25,500)
we get an exaggerated sense of precision in the relationship between Y and X:


If we try to account for the fact that residuals within a block are not independent by using the block bootstrap
watercolor_reg(x,y,100,1.25,500,block)
we get a very different result:



Finally, the last addition to the code is a simple option to clip the watercoloring at the edge of a specified confidence interval (default is 95%), an idea suggested by Ted Miguel. This allows us to have a watercolor plot which also allows us to conduct some traditional hypothesis tests visually, without violating the principles of visual weighting. Applying this option to the example above
blue = [0 0 .3]
watercolor_reg(x,y,100,1.25,500,block, blue,'CLIPCI')
we obtain a plot with a clear 95% CI, where the likelihoods within the CI are indicated by watercoloring:


Code is here. Enjoy!

8.31.2012

Watercolor regression

I'm in a rush, so I will explain this better later. But Andrew Gelman posted my idea for a type of visually-weighted regression that I jokingly called "watercolor regression" without a picture, so its a little tough to see what I was talking about. Here is the same email but with the pictures to go along with it. The code to do you own watercolor regression is here as the 'SMOOTH' option to vwregress. (Update: I've added a separate cleaner function watercolor_reg.m for folks who want to tinker with the code but don't want to wade through all the other options built into vwregress. Update 2: I've added watercolor regression as a second example in the revised paper here.)

This was the email with figures included:

Two small follow-ups based on the discussion (the second/bigger one is to address your comment about the 95% CI edges).

1. I realized that if we plot the confidence intervals as a solid color that fades (eg. using the "fixed ink" scheme from before) we can make sure the regression line also has heightened visual weight where confidence is high by plotting the line white. This makes the contrast (and thus visual weight) between the regression line and the CI highest when the CI is narrow and dark. As the CI fade near the edges, so does the contrast with the regression line. This is a small adjustment, but I like it because it is so simple and it makes the graph much nicer. 


My posted code has been updated to do this automatically.

2. You and your readers didn't like that the edges of the filled CI were so sharp and arbitrary. But I didn't like that the contrast between the spaghetti lines and the background had so much visual weight.  So to meet in the middle, I smoothed the spaghetti plot to get a nonparametric estimate of the probability that the conditional mean is at a given value:


To do this, after generating the spaghetti through bootstrapping, I estimate a kernel density of the spaghetti in the Y dimension for each value of X.  I set the visual-weighting scheme so it still "preserves ink" along a vertical line-integral, so the distribution dims where it widens since the ink is being "stretched out". To me, it kind of looks like a watercolor painting -- maybe we should call it a "watercolor regression" or something like that.

The watercolor regression turned out to be more of a coding challenge than I expected, because the bandwidth for the kernel smoothing has to adjust to the width of the CI. And since several people seem to like R better than Matlab, I attached 2 figs to show them how I did this. Once you have the bootstrapped spaghetti plot:


I defined a new coordinate system that spanned the range of bootstrapped estimates for each value in X 


The kernel smoothing is then executed along the vertical columns of this new coordinate system.

I've updated the code posted online to include this new option. This Matlab code will generate a similar plot using my vwregress function:

x = randn(100,1);
e = randn(100,1);
y = 2*x+x.^2+4*e;

bins = 200;
color = [.5 0 0];
resamples = 500;
bw = 0.8;

vwregress(x, y, bins, bw, resamples, color, 'SMOOTH');


NOTE TO R USERS: The day after my email to andrew, Felix Schönbrodt posted a nice similar variant with code in R here.

Update: For overlaid regressions, I still prefer the simpler visually-weighted line (last two figs here) since this is what overlaid watercolor regressions look like:


It might look better if the scheme made the blue overlay fade from blue-to-clear rather than blue-to-white, but then it would be mixing (in the color sense) with the red so the overlaid region would then start looking like very dark purple. If someone wants to code that up, I'd like to see it. But I'm predicting it won't look so nice.

8.20.2012

Visually-weighted confidence intervals

Following up on my earlier post describing visually weighted regression (paper here) and some suggestions from Andrew Gelman and others, I adjusted my Matlab function (vwregress.m, posted here) and just thought I'd visually document some of the options I've tried.

All of this information is available if you type help vwregress once the program is installed, but I think looking at the pictures helps.

The basic visually weighted regression is just a conditional mean where the visual weight of the line reflects uncertainty.  Personally, I like the simple non-parametric plot overlaid with the OLS regression since its clean and helps us see whether a linear approximation is a reasonable fit or not:
vwregress(x, y, 300, .5,'OLS',[0 0 1])
 


Confidence intervals (CI) can be added, and visually-weighted according to the same scheme as the conditional mean:
vwregress(x, y, 300, .5, 200,[0 0 1]);

Since the CI band is bootstrapped, Gelman suggested that we overlay the spaghetti plot of resampled estimates, I added the option 'SPAG' to do this. If the spaghetti are plotted using a solid color (option 'SOLID'), this ends up looking quasi-visually-weighted:
vwregress(x, y, 300, .5,'SPAG','SOLID',200,[0 0 1]);

But since it gets kind of nasty looking near the edges, where the estimates go little haywire since the observations get thin, we can visually weight the spaghetti too to keep it from getting distracting (just omit the 'SOLID' option).
vwregress(x, y, 300, .5,'SPAG',200,[0 0 1]);

Gelman also suggested that we try approximating the spaghetti by smoothly filling in the CI band, using the original visual-weighting scheme. To do this, I added the 'FILL' option. I like the result quite a bit (even more than the spaghetti, but others may disagree). [Warning: this plotting combination may be very slow, especially with a lot of resamples.]
vwregress(x, y, 300, .5,'FILL',200,[0 0 1]);

If 'SOLID' is combined with 'FILL', only the conditional mean is plotted with solid coloring. (This differs from 'SPAG' and the simple CI bounds).
vwregress(x, y, 300, .5,'FILL','SOLID',200,[0 0 1]);

Finally, I included the 'CI' option which changes the visual-weighting scheme from using weights 1/sqrt(N) to using 1/(CI_max - CI_min), where CI_max is the upper limit (point-wise) of the CI and CI_min is the lower limit. 

I like this because if we combine this with 'FILL', then the confidence band "conserves ink" (which we equate with confidence) in the y-dimension. Imagine that we squirt out ink uniformly to draw the conditional mean and then smear the ink vertically so that it stretches from the lower confidence bound to the upper confidence bound.  In places where the CI band is narrow, this will cause very little spreading of the ink so the CI band will be dark. But in places where the CI band is wide, the ink is smeared a lot so it gets lighter. For any vertical sliver of the CI band (think dx) the amount of ink displayed (integrated along a vertical line) will be constant.
vwregress(x, y, 300, .5,'CI','FILL',200,[0 0 1]);

For Stata users, I have written vwlowess.ado (here), but unfortunately it does not yet have any of these options.

Lucas Leeman has implemented some of these ideas in R (see here), so maybe he'll make that code available.

All of the above plots were made with the random data:
x = randn(200,1); 
e = randn(200,1).*(1+abs(x)).^1.5; 
y = 2*x+x.^2+4*e;

7.30.2012

Visually-Weighted Regression

[This is the overdue earth-shattering sequel to this earlier post.]

I recently posted this working paper online. It's very short, so you should probably just read it (I was actually originally going to write it as a blog post), but I'll run through the basic idea here.  

Since I'm proposing a method, I've written functions in Matlab (vwregress.m) and Stata (vwlowess.ado) to accompany the paper. You can download them here, but I expect that other folks can do a much better job implementing this idea.

Solomon M. Hsiang
Abstract: Uncertainty in regression can be efficiently and effectively communicated using the visual properties of regression lines.  Altering the "visual weight" of lines to depict the quality of information represented clearly communicates statistical confidence even when readers are unfamiliar or reckless with the formal and abstract definitions of statical uncertainty. Here, we present an example by decreasing the color-saturation of nonparametric regression lines when the variance of estimates increases. The result is a simple, visually intuitive and graphically compact display of statistical uncertainty. This approach is generalizable to almost all forms of regression.
Here's the issue. Statistical uncertainty seems to be important for two different reasons. (1) If you have to make a decision based on data, you want to have a strong understanding of the possible outcomes that might result from your decision, which itself rests on how we interpret the data.  This is the "standard" logic, I think, and it requires a precise, quantitative estimate of uncertainty.  (2) Because there is noise in data, and because sampling is uneven across independent variables, a lot of data analysis techniques generate artifacts that we should mostly just ignore.  We are often unnecessarily focused/intrigued by the wackier results that shows up in analyses, but thinking carefully about statistical uncertainty reminds us to not focus too much on these features. Except when it doesn't.

"Visually-weighted regression" is a method for presenting regression results that tries to address issue (2), taking a normal person's psychological response to graphical displays into account. I had grown a little tired of talks and referee reports where people speculate about the cause of some strange non-linearity at the edge of a regression sample, where there was no reason to believe the non-linear structure was real.  I think this and related behaviors emerge because (i) there seems to be an intellectual predisposition to thinking that "nonlinearity" is inherently more interesting that "linearity" and (ii) the traditional method for presenting uncertainty subconsciously focuses viewers attention on features of the data that are less reliable. I can't solve issue (i) with data visualization, but we can try to fix (ii). 

The goal of visually-weighted regression is to take advantage of viewer's psychological response to images in order to focus their attention on the results that are the most informative.  "Visual weight" is a concept from art and graphical design that is used to to direct a viewer's focus within an image.  Large, dark,  high-contrast, and complex structures tend to "grab" a viewer's attention.  Our brains are constantly looking for visual information and, somewhere along the evolutionary line, detailed/high-contrast structures in our field of view were probably more informative and more useful for survival, so we are programmed to give them more of our attention.  Unfortunately, the traditional approaches to displaying statistical uncertainty give more visual weight to the uncertain portions of the analysis, which is exactly backwards of what we want. Ideally, a viewer will focus more of their attention on the portions of analysis that have some statical confidence and they will mostly ignore the portions of analysis that are so uncertain that they contain little or no information.

[continued below the fold]

5.18.2012

Plotting a two-dimensional line using color to depict a third dimension (in Matlab)

I had seen other people do this in publications, but was annoyed that I couldn't find a quick function to do it in a general case. So I'm sharing my function to do this.

If you have three vectors describing data, eg. lat, lon and windspeed of Hurricane Ivan (yes, this data is from my env. sci. labs), you could plot it in 3 dimensions, which is awkward for readers:


or using my nifty script, you can plot wind speed as a color in two dimensions:


Not earth-shattering, but useful. Next week, I will post an earth-shattering application.

[BTW, if someone sends me code to do the same thing in Stata, I will be very grateful.]

Also, you can make art:


Help file below the fold.

4.16.2012

Cheap parallel computing for students

I don't like to advertise for companies, but Matlab recently released its 2012 version for students and I know that a lot of students don't realize how much computational bang they can get for their buck.  The parallel computing package for Matlab only costs students $29 and allows them to run parallel code on an unlimited number of cores. (Compare this to Stata, which charges hundreds of dollars per additional core.)  And even more importantly, parallelizing your code is amazingly easy (in the best cases, it may only involve changing your "for" commands to "parfor").  As a PhD student, I think this $29 investment saved me several months on my dissertation (I'm not kidding).

I'm guessing there's a similar free package for R which I don't know about (please comment).

1.09.2012

"Boxplot regression" - my new favorite non-parametric method

I like non-parametric regressions since it lets you see more of what's going on in data (see this post too).    I also like when people plot more than just a mean regression, since seeing the dispersion in a conditional distribution is often helpful.  So why don't people combine both to make a suped-up graph with the best of both worlds?  Probably because there's no standard command in software packages (that I know of, at least).  So I decided to fix that.  Introducing boxplot regression! I'm not the first to do this, but I'm giving this code away for free, so I'm taking the liberty of making up a name since I haven't seen one before (heaven knows that I may be way off here...).

For one of our projects, Michael Roberts (his blog is solid) suggested we mimic this plot from one of his earlier papers.  This seemed like such a good idea, I wrote a Matlab function up so I can make a similar plot for every paper I write from here on out.

The function takes a continuous independent variable and bins it according to an arbitrary bin-width.  It then makes a boxplot for values of the dependent variable over each bin in the independent variable.  The result is basically a non-parametric quantile regression showing the median, 25th and 75th percentiles.  I then plot the mean for each bin overlaid as a line, so one can see how more traditional non-parametric mean regressions will compare.  Simple.

I'm not the first, but I'm not sure why this isn't done more often. Boxplots are usually used to compare distributions over qualitatively different groups, like races, sexes or treatment groups.  But it's not a huge conceptual leap to discretize an independent variable so we can apply the approach to standard regression. It's just annoying to code up.

My boxplot regression function is here (along with a utility to bin any variable without making the plot).  Now making this plot takes a single command.

Example: We take a globally gridded dataset from the SAGE group (Google Earth file of it here) and do a boxplot regression of area planted with crops on the agriculture suitability index of that grid cell:


We get a bi-variate graph packed full of information, right?  I hope Tufte would approve.  If you specify >25 bins, I've set the function to switch to a slightly more compact style for the boxplots.


Enjoy!

[If you like this, see this earlier post too. Help-file cut and pasted below the fold.]

7.06.2011

Some shapefile utilities for Matlab (and data transfer to Stata)

Here are some extremely simple but extremely useful functions when working with shapefiles in Matlab.

Functions to drop or keep polygons in your workspace based on their attributes (drop_by_attribute.m, keep_by_attribute.m).

A simple command to plot all your polygons (plot_districts.m).

A Matlab function to merge attributes (combine_attributes.m) and an accompanying Stata function to "un-merge" those attributes (break_var.ado).  This is particularly useful if you want a single attribute to uniquely identify a polygon (eg. combining State and County into a single attribute State_County) which you use when you do some calculations in Matlab (eg. compute avg. temperature by State_County) and export to Stata (using my scripts here) where you then want to retrieve the original distinct attributes (eg. State and County) for statistical analysis.


6.29.2011

Email yourself from Matlab

(I'm trying to get back in the habit of posting the most useful scripts from the piles of code I've been generating.  Step 1: start small.)

I wrote this little function to email me from Matlab.  Its very convenient for telling you when a long script finishes (or crashes).

It's simple to use: type "email_me('message to yourself')"

If you type no second argument (for the email's body) than it just sends you the time.

When you first install the function, open it up and change the first three lines of code to match your gmail account info.  You can specify that it sends email from a different account, so if you're anxious about typing your password somewhere, you can just open a new gmail account for Matlab-only uses.  [I'm not 100% sure if it will work for non-gmail accounts.]

12.11.2010

G-ECON and SAGE data in Google Earth

I was working with William Nordhaus's G-ECON dataset and the SAGE cropland datasets, so I figured I would format them for Google Earth.  (In an earlier post, I explained how to do this with your own data).

Here they are for download, if you'd like to explore them. (Instructions: Download the file, and double click it to open it in google earth.  They all change with time, which you can control with the slider at the upper left of the screen.  The images may take a second to load. You can save the file in google earth by dragging the layer to "my places" so google earth will always open it when it starts up.)

Gross Cell Product (Data from G-Econ)
Log10 Gross cell product (like GDP, but higher resolution)
1990-2005
(Data from William Nordhaus's G-ECON project)

Fraction of land cultivated
1700-1990
(data from SAGE)

Also, since I posted this last time

Maximum tropical cyclone windspeed
1950-2008
(data from LICRICE)

8.09.2010

Two and three-dimensional non-parametric regressions

A pet peeve of mine are people who run dozens of linear regressions but never check whether linearity  is a good assumption.  Often, people will run linear models and then look for non-linearities later.  But there's no point in estimating lots of linear models first if they might be meaningless.  For example, imagine a U-shaped curve of Y as a function of X.  In this case, estimating and declaring that a linear model estimates no relation between Y and X is not meaningful.  As a general rule, it's probably a better strategy to look for non-linearities early and often.

With the goal of searching-for-specifications in mind, its useful to have a method of non-parametric regression that's fast since the researcher will have to try many things. Locally weighted polynomial regressions are fairly slow, so the faster and simpler Nadaraya-Watson estimator (Nadaraya, 1964 and Watson, 1964) comes in handy.

My code for the NW estimator in Matlab is here.
(If your data is in Stata and you want to move it to Matlab, see my code for that here).

It contains 4 .m files. The file test.m will generate some random data and estimate the conditional mean with the NW estimator.  The code uses bootstrap-resampling to estimate confidence intervals. All models have flexible but fixed bandwidth.

With one dependent variable, NWbootstrap.m (with a normal kernel) and NWbootstrap_epkv.m (with an Epinechnikov kernel) are appropriate.

With two dependent variables,  NWbootstrap_epkv_3D.m is useful.  This should be used, for example, if you are looking for "interaction terms" in a multiple regression model (the slope of a function changes as a function of another variable).

UPDATE: A problem with dropping missing variables has been fixed.  The code drop_missing_Y is now in the NW toolbox to solve the problem.

All files have helpfiles with syntax. Below is a demonstration of the code.


7.14.2010

Displaying Matlab data in Google Earth

[ONE UPDATE AT BOTTOM]


It struck me the other day that Google Earth would be good platform for sharing my data with other people and found out that lots of people display their own data on the platform. Google even has a website to distribute Google Earth files that they like (this one, of the world population, was one of my favorites).  

So then I looked around to see if anyone had written code to help people convert Matlab data into Google Earth files.  The Google Earth Toolbox (recommended by my colleague Amir) worked great. It allows you to display data using functions very similar to standard Matlab functions (eg. imagesc()) only the output image is projected onto the planet and can be explored using Google Earth.  The image above is a map of the tropical cyclone climatology that I imported to Google Earth with only two lines of code. 

A = ge_imagesc(LON, LAT, flipud(DATA));
ge_output('FILENAME.kml', A);

[As indicated, the only bug I found was that some of the functions flipped the data vertically, which is easy to fix with one usage of the function flipud().]

It seems to me that one huge advantage of this presentation method is that pretty much anybody can view and explore global data-sets, since the interface is entirely intuitive and requires no manipulation (similar, in many respects to the contribution of Gapminder).

With a little extra work, I made the data-set dynamic in Google Earth, so that users can view data from different dates using a slider (see code below), and made this movie for use in presentations:



Below the fold is the script I used to make a Google Earth data set that changes over time.  Making the movie above required two additional steps.

  1. Add a "Tour" in Google Earth (under "Add") which just records your browsing of your data in a movie that plays back in the Google Earth Application.
  2. Use a screen capture program to record that movie as it plays back. (My friend Pam recommended this screen capture software, which works nicely and is easy to use). 
UPDATE:  SINCE I'VE REPEATED THIS PROCESS A FEW TIMES NOW, I'VE WRITTEN A FUNCTION IN MATLAB THAT GENERATES THE GOOGLE EARTH FILE.  ITS SIMILAR TO THE SCRIPT BELOW, BUT IS EASIER TO USE FOR AN ARBITRARY DATASET. DOWNLOAD IT HERE.

6.16.2010

Standard error adjustment (OLS) for spatial correlation and serial correlation in panel data in (Stata and Matlab)

I've been doing statistical work on climate impacts (see a typhoon climatology of the Philippines to the right) and have been having trouble finding code that will properly account for spatial correlation and serial correlation when estimating linear regression models (OLS) with panel (longitudinal) data.  So I wrote my own scripts for Matlab and Stata.  They implement GMM estimates similar to Newey-West (see Conley, 2008). You can download them here.

They are "beta versions" at this point, they may contain errors.  If you find one, please let me know.

In the future, I may try to document what I've learned about the different estimators out there. But for the time being, some of the advantages of what I've written are:
  • The weighting kernels are isotropic in space
  • The spatial weighting kernel can be uniform (as Conley, 2008 recommends) or decay linearly (a la Bartlett).
  • Serial correlation at a specific location is accounted for non-parametrically
  • Locations are specified with Lat-Lon, but kernel cutoffs are specified in kilometers; the code accounts for the curvature of the earth (to first order) so relatively large spatial domains can be accounted for in a single sample
  • An option allows you to examine the impact of adjustments for spatial correlation and spatial + serial correlation on your standard errors
  • The Stata version follows the format of all Stata estimates, so it should be compatible with post-estimation commands (eg. you can output your results using "outreg2").
[If you use this code, please cite: Hsiang (PNAS 2010)]


STATA VERSION 2 UPDATE 2013: Thanks to my field-testing team (Gordon McCord and Kyle Meng), several bugs in the code and additional options have been added.  Most useful changes: the code now correctly accepts the wildcard "*" when specifying variables and the option "dropvar" can be used to drop variables that Stata regards as 'too collinear' (contributed by KM).

STATA VERSION 3 UPDATE 2018: Thanks to some careful bug-chasing by Mathias Thoenig and Jordan Adamson, I've fixed a single line of code that led to a miscalculation in the weights for autocorrelation across within-unit panel observations. This error did not affect the Matlab implementation. This error did not affect calculations for the influence of spatial autocorrelation within contemporary observations or adjustments for heteroskedasticity. The previous version (v2) sometimes over-inflated or under-estimated the standard error estimate adjustment due to auto-correlation, but the magnitude of this effect depends on both the serial-correlation structure in the data and the maximum lag length parameter (lagcutoff) determined by the user. For very long lag lengths (e.g. infinity) the standard error estimates are too small, but for shorter lag lengths the bias may be of either sign.

My (currently ad-hoc) help file for the Stata script is below the fold. The Matlab code has an associated help-file embedded.









5.18.2010

Data transfer from Matlab to Stata and reverse

Zip file with Matlab code to export or import datasets to Stata. For use in conjunction with "insheet/outsheet". Importing data from Stata is restricted to numeric variables (but not the reverse).

12.11.2009

Matlab toolbox updated

I've updated my Matlab toolboxes here.

Some useful additions include scripts to flexibly coarsen spatial data, area or population weighted averages of global spatial data and a new (small) suite of network functions.