[This is the overdue earth-shattering sequel to this earlier post.]
I recently posted this working paper online. It's very short, so you should probably just read it (I was actually originally going to write it as a blog post), but I'll run through the basic idea here.
Since I'm proposing a method, I've written functions in Matlab (vwregress.m) and Stata (vwlowess.ado) to accompany the paper. You can download them here, but I expect that other folks can do a much better job implementing this idea.
Solomon M. Hsiang
Abstract: Uncertainty in regression can be efficiently and effectively communicated using the visual properties of regression lines. Altering the "visual weight" of lines to depict the quality of information represented clearly communicates statistical confidence even when readers are unfamiliar or reckless with the formal and abstract definitions of statical uncertainty. Here, we present an example by decreasing the color-saturation of nonparametric regression lines when the variance of estimates increases. The result is a simple, visually intuitive and graphically compact display of statistical uncertainty. This approach is generalizable to almost all forms of regression.
Here's the issue. Statistical uncertainty seems to be important for two different reasons. (1) If you have to make a decision based on data, you want to have a strong understanding of the possible outcomes that might result from your decision, which itself rests on how we interpret the data. This is the "standard" logic, I think, and it requires a precise, quantitative estimate of uncertainty. (2) Because there is noise in data, and because sampling is uneven across independent variables, a lot of data analysis techniques generate artifacts that we should mostly just ignore. We are often unnecessarily focused/intrigued by the wackier results that shows up in analyses, but thinking carefully about statistical uncertainty reminds us to not focus too much on these features. Except when it doesn't.
"Visually-weighted regression" is a method for presenting regression results that tries to address issue (2), taking a normal person's psychological response to graphical displays into account. I had grown a little tired of talks and referee reports where people speculate about the cause of some strange non-linearity at the edge of a regression sample, where there was no reason to believe the non-linear structure was real. I think this and related behaviors emerge because (i) there seems to be an intellectual predisposition to thinking that "nonlinearity" is inherently more interesting that "linearity" and (ii) the traditional method for presenting uncertainty subconsciously focuses viewers attention on features of the data that are less reliable. I can't solve issue (i) with data visualization, but we can try to fix (ii).
The goal of visually-weighted regression is to take advantage of viewer's psychological response to images in order to focus their attention on the results that are the most informative. "Visual weight" is a concept from art and graphical design that is used to to direct a viewer's focus within an image. Large, dark, high-contrast, and complex structures tend to "grab" a viewer's attention. Our brains are constantly looking for visual information and, somewhere along the evolutionary line, detailed/high-contrast structures in our field of view were probably more informative and more useful for survival, so we are programmed to give them more of our attention. Unfortunately, the traditional approaches to displaying statistical uncertainty give more visual weight to the uncertain portions of the analysis, which is exactly backwards of what we want. Ideally, a viewer will focus more of their attention on the portions of analysis that have some statical confidence and they will mostly ignore the portions of analysis that are so uncertain that they contain little or no information.
[continued below the fold]
The standard approach to depicting statistical uncertainty in a regression is to add more objects or "ink" to the image in locations where there is high uncertainty: error bars, confidence limits and shading. These objects all increase the visual weight of uncertain portions of the display in regions where there is less information. Including these objects is very important for the formal interpretation of results, per issue (1) above, but there are scenarios where viewers are not sufficiently trained or disciplined in statistical thinking that they overcome their natural tendency to focus on these "visually heavy" regions in a display. To overcome this tendency, visually-weighted regression tries to remove visual weight from statistically uncertain portions of a regression, thereby focusing viewers on the more certain portions of the analysis.
Here's an example. Say we have data with a non-linear conditional mean (solid line), but the data is noisy:
and we want to recover the conditional mean using regression while conveying uncertainty. The standard approach is to plot standard errors or confidence intervals using additional graphical "ink" (using Tufte's language). But this draws viewers' attention towards the edge of the display, where the large, dark, high-contrast confidence intervals grow and become conspicuous. This approach makes the edges of the display more interesting to look at than the center, even though most of the data is in the center. In contrast, if we make the visual weight of the regression line proportional to it's statistical certainty, viewer's attention is brought back toward the middle of the display. Compare these two approaches by relaxing your eyes and looking at these two displays of the same non-parametric result:
In Panel A, our eyes are drawn outward, away from the center of the display and toward the swirling confidence intervals at the edges. But in Panel B, our eyes are attracted to the middle of the regression line, where the high contrast between the line and the background is sharp and visually heavy. By using visual-weighting, we focus our readers's attention on those portions of the regression that contain the most information and where the findings are strongest. Furthermore, when we attempt to look at the edges of Panel B, we actually feel a little uncertain, as if we are trying to make out the shape of something through a fog. This is good thing, because everyone knows that feeling, even if we have no statistical training (or ignore that training when its inconvenient). By aligning the feeling of uncertainty with actual statistical uncertainty, we can more intuitively and more effectively communicate uncertainty in our results to a broader set of viewers.
Visually-weighted regression has other benefits as well. For example, it is visually compact. Below are four sets of overlaid data. If we were to plot error bars or confidence intervals for all four regressions, the result would be visually chaotic and difficult to read. Clearly plotting all the regressions along with their uncertainty (using the traditional method) would require 2-4 panels and wouldn't allow us to easily compare the results across all four samples. But if we use visually-weighted regressions, they overlay quite nicely while still communicating where the results are more or less certain:
The general innovation here is to utilize the visual properties of a regression line to communicate statistical uncertainty intuitively, but there are many different ways to do this. I have only given a single example here since in all of these graphics the color saturation is equal to the square-root of the local observational density (a measure of certainty) for a kernel-weighted moving average. But there are many measures of uncertainty one could use (eg. the width of the 95% confidence interval or a p-value), many other properties of a line that can be altered to give it more or less visual weight (eg. the width of the line or its color), and many other regression techniques where statistical confidence varies throughout a display (eg. OLS, parametric MLE regression or Bayesian approaches). In fact, the approach of visually-weighting a regression is extremely generalizable, and I think it can (and maybe should?) be applied almost anywhere.
The paper is more eloquent and provides a slightly more formal description of the approach. General functions that can reproduce all of the graphs above are here. The Matlab code is fairly flexible and contains several options, including plotting visually-weighted bootstrapped confidence intervals as well as using these CIs to do the visual-weighting. The Stata code is substantially less flexible and slower for large data sets. If anyone writes a script to implement this method in other platforms or using different estimators, I would love to see (and post) it.
UPDATE: After some discussion on Andrew Gelman's blog (here, here, here and here) I've posted some additional material examples of visually weighted confidence intervals and "watercolor regression," both of which would fall under the general class of "visually-weighted regression."
Should the confidence bounds be color-weighted opposite the regression line to emphasize the distribution of potential outcomes when the regression is weak? It's something of a nitpick and probably more audience-specific than anything. Overall it's a nice, elegant approach to resolve something that is embedded in scientific culture but not elsewhere.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDeleteI just tried your Stata Ado and saw that the names of the two variables vwlowess takes are stored in the locals x and y but are afterwards adressed as variables rather than locals. A simple find/replace resolves this, though.
ReplyDeleteJust wanted to give you a hint on that one.
Best
Thanks. I will fix it soon. Some other folks have already given me multiple tips on improving the ado file.
Delete