Fight Entropy: statistics

Showing posts with label statistics. Show all posts

8.04.2016

Responding to critique of ivory-sale analysis

I've written an absurdly long (but comprehensive) response to the critique of our paper on elephant poaching induced by legal ivory sales on the sister G-FEED blog.

The critique we respond to is one leveled by Dr. Fiona Underwood and Robert Burn, which was posted to Dr. Underwood's blog. Dr. Underwood and Dr. Burn are consultants that were hired by CITES to evaluate whether the the legal sale designed and implemented by CITES was effective at reducing elephant poaching globally. We have received numerous inquiries by experts and government analysts about their critique, so we thought it was important to respond to it directly.

The short answer is that the criticisms in the Dr. Underwood and Dr. Burn's post are based on a mischaracterization of the analysis and discussion contained in our paper, an inaccurate portrayal of how the models we use work, and an incorrect description of the assumptions needed for causal inference when implementing an event study/regression discontinuity research design. The long answer is here.

12.01.2015

Choosing experiments to accelerate collective discovery

How efficient are research agendas?

Abstract: A scientist’s choice of research problem affects his or her personal career trajectory. Scientists’ combined choices affect the direction and efficiency of scientific discovery as a whole. In this paper, we infer preferences that shape problem selection from patterns of published findings and then quantify their efficiency. We represent research problems as links between scientific entities in a knowledge network. We then build a generative model of discovery informed by qualitative research on scientific problem selection. We map salient features from this literature to key network properties: an entity’s importance corresponds to its degree centrality, and a problem’s difficulty corresponds to the network distance it spans. Drawing on millions of papers and patents published over 30 years, we use this model to infer the typical research strategy used to explore chemical relationships in biomedicine. This strategy generates conservative research choices focused on building up knowledge around important molecules. These choices become more conservative over time. The observed strategy is efficient for initial exploration of the network and supports scientific careers that require steady output, but is inefficient for science as a whole. Through supercomputer experiments on a sample of the network, we study thousands of alternatives and identify strategies much more efficient at exploring mature knowledge networks. We find that increased risk-taking and the publication of experimental failures would substantially improve the speed of discovery. We consider institutional shifts in grant making, evaluation, and publication that would help realize these efficiencies.

The paper is Rzhetsky et al.'s 2015 - Choosing experiments to accelerate collective discovery. (via Shanee)

4.06.2015

Data-driven causal inference

Distinguishing cause from effect using observational data: methods and benchmarks

From the abstract:

The discovery of causal relationships from purely observational data is a fundamental problem in science. The most elementary form of such a causal discovery problem is to decide whether X causes Y or, alternatively, Y causes X, given joint observations of two variables X, Y . This was often considered to be impossible. Nevertheless, several approaches for addressing this bivariate causal discovery problem were proposed recently. In this paper, we present the benchmark data set CauseEffectPairs that consists of 88 different "cause-effect pairs" selected from 31 datasets from various domains. We evaluated the performance of several bivariate causal discovery methods on these real-world benchmark data and on artificially simulated data. Our empirical results provide evidence that additive-noise methods are indeed able to distinguish cause from effect using only purely observational data. In addition, we prove consistency of the additive-noise method proposed by Hoyer et al. (2009).

From the arxiv.org blog (note):

The basis of the new approach is to assume that the relationship between X and Y is not symmetrical. In particular, they say that in any set of measurements there will always be noise from various cause. The key assumption is that the pattern of noise in the cause will be different to the pattern of noise in the effect. That’s because any noise in X can have an influence on Y but not vice versa.

There's been a lot of research in stats on "causal discovery" techniques, and the paper in essence is running a horse race between Additive-Noise Methods and Information Geometric Causal Inference, with ANM winning out. Some nice overview slides providing background are here.

3.21.2014

When evidence does not suffice

Halvard Buhaug and numerous coauthors have released a comment titled “One effect to rule them all? A comment on climate and conflict” which critiques research on climate and human conflict that I published in Science and Climatic Change with my coauthors Marshall Burke and Edward Miguel.

The comment does not address the actual content of our papers. Instead it states that our papers say things they do not say (or that our papers do not say thing they actually do say) and then uses those inaccurate claims as evidence that our work is erroneous.

I have posted my reaction to the comment on the G-FEED blog, written as the referee report that I would write if I were asked to referee the comment.

(This is not the first time Buhaug and I have disagreed on what constitutes evidence. Kyle Meng and I recently published a paper in PNAS demonstrating that Buhaug’s 2010 critique of an earlier paper made aggressive claims that the earlier paper was wrong without actually providing evidence to support those claims.)

1.17.2014

FAQs for "Reconciling disagreement over climate–conflict results in Africa"

[This is a gues blog post by my coauthor Kyle Meng.]

Sol and I just published an article in PNAS in which we reexamine a controversy in the climate-conflict literature. The debate is centered over two previous PNAS articles: the first by Burke et al. (PNAS, 2009) which claims that higher temperature increases conflict risks in sub-Saharan Africa and a second PNAS article by Buhaug (PNAS, 2010) refuting the earlier study.

How did we get here?

First, a bit of background. Whether climate change causes societies to be more violent is a critical question for our understanding of climate impacts. If climate change indeed increases violence, the economic and social costs of climate change may be far greater than what was previously considered, and thus further prompt the need to reduce greenhouse gas emissions. To answer this question, researchers in recent years have turned to data from the past asking whether violence has responded historically to changes in the local climate. Despite the increasing volume of research (summarized by Sol, Marshall Burke, and Ted Miguel in their meta-analysis published in Science and the accompanying review article in Climatic Change) this question remained somewhat controversial in the public eye. Much of this controversy was generated by this pair of PNAS papers.

What did we do?

Our new paper takes a fresh look at these two prior studies by statistically examining whether the evidence provided by Buhaug (2010) overturns the results in Burke et al. (2009). Throughout, we examine the two central claims made by Buhaug:

1) that Burke et al.'s results "do not hold up to closer inspection" and
2) climate change does not cause conflict in sub-Saharan Africa.

Because these are quantitative papers, Buhaug’s two claims can be answered using statistical methods. What we found was that Buhaug did not run the appropriate statistical procedures needed for the claims made. When we applied the correct statistical tests, we find that:

a) the evidence in Buhaug is not statistically different from that of Burke et al. and
b) Buhaug’s results cannot support the claim that climate does not cause conflict.

A useful analogy

The statistical reasoning in our paper is a bit technical so an analogy may be helpful here. Burke et al's main result is equivalent to saying "smoking increases lung cancer risks roughly 10%". Buhaug claims above are equivalent to stating that his analysis demonstrates that “smoking does not increase lung cancer risks” and furthermore that “smoking does not affect lung cancer risks at all”.

What we find, after applying the appropriate statistical method, is that the only equivalent claim that can be supported by Buhaug’s analysis is "smoking may increase lung cancer risks by roughly 100% or may decrease them by roughly 100% or may have no effect whatsoever". Notice this is a far different statement than what Buhaug claims he has demonstrated in 1) and 2) above. Basically, the results presented in Buhaug are so uncertain that they do not reject zero effect, but they also do not reject the original work by Burke et al.

Isn’t Buhaug just showing Burke et al.’s result is “not robust”?

In statistical analyses, we often seek to understand if a result is “robust” by demonstrating that reasonable alterations to the model do not produce dramatically different results. If successful, this type of analysis sometimes convinces us that we have not failed to account for important omitted variables (or other factors) that would alter our estimates substantively.

Importantly, however, the reverse logic is not true and “non-robustness” is not a conclusive (or logical) result. Obtaining different estimates from the application of model alterations alone does not necessarily imply that the original result is wrong since it might be the new estimate that is biased. Observing unstable results suggests that there are errors in the specification of some (or all) of the models. It merely means the analyst isn’t working with the right statistical model.

There must exist only one “true” relationship between climate and conflict, it may be a coefficient of zero or a larger coefficient consistent with Burke et al., but it cannot be all these coefficients at the same time. If models with very different underlying assumptions provide dramatically different estimates, this suggests that all of the models (except perhaps one) is misspecified and should be thrown out.

A central error in Buhaug is his interpretation of his findings. He removes critical parts of Burke et al.’s model (e.g. those that account for important differences in geography, history and culture) or re-specifies them in other ways and then advocates that the various inconsistent coefficients produced should all be taken seriously. In reality, the varying estimates produced by Buhaug are either due to added model biases or to sampling uncertainty caused by the techniques that he is using. It is incorrect to interpret this variation as evidence that Burke et al.’s estimate is “non-robust”.

So are you saying Burke et al. was right?

No. And this is a very important point. In our article, we carefully state:

“It is important to note that our findings neither confirm nor reject the results of Burke et al.. Our results simply reconcile the apparent contradiction between Burke et al. and Buhaug by demonstrating that Buhaug does not provide evidence that contradicts the results reported in Burke et al. Notably, however, other recent analyses obtain results that largely agree with Burke et al., so we think it is likely that analyses following our approach will reconcile any apparent disagreement between these other studies and Buhaug.”

That is, taking Burke et al’s result as given, we find that the evidence provided in Buhaug does not refute Burke et al. (the central claim of Buhaug). Whether Burke et al. was right about climate causing conflict in sub-Saharan Africa is a different question. We’ve tried to answer that question in other settings (e.g. our joint work published in Nature), but that’s not the contribution of this analysis.

Parting note

Lastly, we urge those interested to read our article carefully. Simply skimming the paper by hunting for statistically significant results would be missing the paper’s point. Our broader hope besides helping to reconcile this prior controversy is that the statistical reasoning underlying our work becomes more common in data-driven analyses.

1.15.2014

Reconciling disagreement over climate–conflict results in Africa

Kyle and I have a paper out in the Early Edition of PNAS this week:

Reconciling disagreement over climate–conflict results in Africa
Solomon M. Hsiang and Kyle C. Meng

Abstract: A recent study by Burke et al. [Burke M, Miguel E, Satyanath S, Dykema J, Lobell D (2009) Proc Natl Acad Sci USA 106(49):20670– 20674] reports statistical evidence that the likelihood of civil wars in African countries was elevated in hotter years. A following study by Buhaug [Buhaug H (2010) Proc Natl Acad Sci USA 107 (38):16477–16482] reports that a reexamination of the evidence overturns Burke et al.’s findings when alternative statistical models and alternative measures of conflict are used. We show that the conclusion by Buhaug is based on absent or incorrect statistical tests, both in model selection and in the comparison of results with Burke et al. When we implement the correct tests, we find there is no evidence presented in Buhaug that rejects the original results of Burke et al.

Related reconciliation of different results in Kenya.

A brief refresher and discussion of the controversy that we are examining is here.

1.10.2014

Reconciling temperature-conflict results in Kenya

Marshall, Ted and I have a new short working paper out. When we correct the coding of a single variable in a previous study (that uses a new data set), we obtain highly localized temperature-conflict associations in Kenya that are largely in line with the rest of the literature. I think this is a useful example for why we should be careful with how we specify interaction terms.

Reconciling temperature-conflict results in Kenya
Solomon M. Hsiang, Marshall Burke, and Edward Miguel

Abstract: Theisen (JPR, 2012) recently constructed a novel high-resolution data set of intergroup and political conflict in Kenya (1989-2004) and examined whether the risk of conflict onset and incidence responds to annual pixel-level variations in temperature and precipitation. Thiesen concluded that only extreme precipitation is associated with conflict incidence and that temperature is unrelated to conflict, seemingly at odds with recent studies that found a positive association at the pixel scale (O'laughlin et al., PNAS 2012), at the country scale (Burke et al., PNAS 2009), and at the continental scale (Hsiang et al., Nature 2011) in Africa. Here we show these findings can be reconciled when we correct the erroneous coding of temperature-squared in Thiesen. In contrast to the original conclusions presented in Theisen, both conflict onset and conflict incidence are significantly and positively associated with local temperature in this new and independently assembled data set.

12.09.2013

What is identification?

There are relatively few non-academic internet resources on identification and causal inference in the social sciences, especially of the sort that can be consumed by a nonspecialist. To remedy that slightly I decided to tidy up and post some slides I've used to give talks on causal inference a few times in the past year. They're aimed at senior undergrad or graduate students with at least some background in statistics or econometrics, and can be found here:

Causal Inference, Identification, and Identification Strategies

Feel free to drop me a line and give me feedback, especially if somethings seems unclear / incorrect. Thanks!

7.29.2013

Forward vs. reverse causal questions

Andrew Gelman has a thought-provoking post on asking "Why?" in statistics:

Consider two broad classes of inferential questions:

1. Forward causal inference. What might happen if we do X? What are the effects of smoking on health, the effects of schooling on knowledge, the effect of campaigns on election outcomes, and so forth?

2. Reverse causal inference. What causes Y? Why do more attractive people earn more money? Why do many poor people vote for Republicans and rich people vote for Democrats? Why did the economy collapse? [...]

My question here is: How can we incorporate reverse causal questions into a statistical framework that is centered around forward causal inference. (Even methods such as path analysis or structural modeling, which some feel can be used to determine the direction of causality from data, are still ultimately answering forward casual questions of the sort, What happens to y when we change x?)

My resolution is as follows: Forward causal inference is about estimation; reverse causal inference is about model checking and hypothesis generation.

Among many gems is this:

A key theme in this discussion is the distinction between causal statements and causal questions. When Rubin dismissed reverse causal reasoning as “cocktail party chatter,” I think it was because you can’t clearly formulate a reverse causal statement. That is, a reverse causal question does not in general have a well-defined answer, even in a setting where all possible data are made available. But I think Rubin made a mistake in his dismissal. The key is that reverse questions are valuable in that they focus on an anomaly—an aspect of the data unlikely to be reproducible by the current (possibly implicit) model—and point toward possible directions of model improvement.

You can read the rest here.

6.05.2013

Souped-up Watercolor Regression

I introduced "watercolor regression" here on FE several months ago, after some helpful discussions with Andrew Gelman and our readers. Over the last few months, I've made a few upgrades that I think significantly increase the utility of this approach for people doing work similar to my own.

First, the original paper is now on SSRN and documents the watercolor approach, explaining its relationship to the more general idea of visual-weighting.

Visually-Weighted Regression

Abstract: Uncertainty in regression can be efficiently and effectively communicated using the visual properties of statistical objects in a regression display. Altering the “visual weight” of lines and shapes to depict the quality of information represented clearly communicates statistical confidence even when readers are unfamiliar with the formal and abstract definitions of statistical uncertainty. Here we present examples where the color-saturation and contrast of regression lines and confidence intervals are parametrized by local measures of an estimate’s variance. The results are simple, visually intuitive and graphically compact displays of statistical uncertainty. This approach is generalizable to almost all forms of regression.

Second, the Matlab code I've posted to do watercolor regression is now parallelized. If you have Matlab running on multiple processors, the code automatically detects this and runs the bootstrap procedure in parallel. This is helpful because a large number of resamples (>500) is important for getting the distribution of estimates (the watercolored part of the plot) to converge but serial resampling gets very slow for large data sets (eg. >1M obs), especially when block-boostrapping (see below).

Third, the code now has an option to run a block bootstrap. This is important if you have data with serial or spatial autocorrelation (eg. models of crop yields that change in response to weather). To see this at work, suppose we have some data where there is a weak dependance of Y on X, but all observations within a block (eg. maybe obs within a single year) have a uniform level-shift induced by some unobservable process.

e = randn(1000,1);
block = repmat([1:10]',100,1);
x = 2*randn(1000,1);
y = x+10*block+e;

The scatter of this data looks like:

where each one of stripes of data is block of obs with correlated residuals. Running watercolor_reg without block-bootrapping

watercolor_reg(x,y,100,1.25,500)

we get an exaggerated sense of precision in the relationship between Y and X:

If we try to account for the fact that residuals within a block are not independent by using the block bootstrap

watercolor_reg(x,y,100,1.25,500,block)

we get a very different result:

Finally, the last addition to the code is a simple option to clip the watercoloring at the edge of a specified confidence interval (default is 95%), an idea suggested by Ted Miguel. This allows us to have a watercolor plot which also allows us to conduct some traditional hypothesis tests visually, without violating the principles of visual weighting. Applying this option to the example above

blue = [0 0 .3]
watercolor_reg(x,y,100,1.25,500,block, blue,'CLIPCI')

we obtain a plot with a clear 95% CI, where the likelihoods within the CI are indicated by watercoloring:

Code is here. Enjoy!

6.03.2013

Weather and Climate Data: a Guide for Economists

Now posted as an NBER working paper (it should be out in REEP this summer):

Using Weather Data and Climate Model Output in Economic Analyses of Climate Change
Maximilian Auffhammer, Solomon M. Hsiang, Wolfram Schlenker, Adam Sobel

Abstract: Economists are increasingly using weather data and climate model output in analyses of the economic impacts of climate change. This article introduces weather data sets and climate models that are frequently used, discusses the most common mistakes economists make in using these products, and identifies ways to avoid these pitfalls. We first provide an introduction to weather data, including a summary of the types of datasets available, and then discuss five common pitfalls that empirical researchers should be aware of when using historical weather data as explanatory variables in econometric applications. We then provide a brief overview of climate models and discuss two common and significant errors often made by economists when climate model output is used to simulate the future impacts of climate change on an economic outcome of interest.

5.13.2013

What is the debate over climate and conflict about?

Last week, Andrew Solow published a Nature comment titled "A call for peace on climate and conflict." In the article, Solow raises many important points that I whole-heartedly agree with, such as trying to avoid data-mining, looking deep into statistical models when they disagree, engaging with qualitative researchers, and presenting and publishing across research communities. My coauthors and I agree so strongly with these latter points that we regularly present and engage with researchers outside of our field -- e.g. Marshall Burke recently presented at the International Studies Association (a political science meeting) and I recently presented at the Association of American Geographers, at an interdisciplinary water resources conference at UCSD, and I will be presenting to a community of medical doctors at Harvard today.

However, I worry that Solow's comment may confuse readers as to why there is controversy in the field. Solow begins his comment:

Among the most worrying of the mooted impacts of climate change is an increase in civil conflict as people compete for diminishing resources, such as arable land and water [1]. Recent statistical studies [2–4] reporting a connection between climate and civil violence have attracted attention from the press and policy-makers, including US President Barack Obama. Doubts about such a connection have not been as widely aired [5–7], but a fierce battle has broken out within the research community.

The battle lines are not always clear, but on one side are the ‘quants’, who use quantitative methods to identify correlations between conflict and climate in global or regional data sets. On the other side are the ‘quals’, who study individual conflicts in depth. They argue that the factors that underlie civil conflict are more complex than the quants allow and that the reported correlations are statistical artefacts.

Where the papers he is referencing to are

1. Homer-Dixon (Princeton Press, 1999).
2. Miguel, Satyanath, Sergenti, J.Polit. Econ. (2004).
3. Burke, Miguel, Satyanath, Dykema, Lobell, D. B. Proc. Natl Acad. Sci. USA (2009).
4. Hsiang, Meng, Cane, Nature (2011).
5. Buhaug, Proc. Natl Acad. Sci. USA (2010).
6. Theisen, Holtermann, Buhaug, Internatl Secur. (2011).
7. Buhaug, Hegre, Strand, (Peace Research Institute of Oslo, 2010).

Thus, the dispute that motivates the comment (referenced in the first paragraph) is the disagreement between Miguel-Burke-Hsiang et al vs Buhaug-Theisen-Buhaug et al while the transition in the second paragraph then shifts the discussion to a dispute between ‘quants’ and ‘quals’ (which is the topic of most of the text). Because these two discussions are so intermingled, a careless reader might incorrectly conclude that the Miguel-Burke-Hsiang vs. Buhaug-Theisen debate is the qual vs. quant debate. This is not the case. Miguel-Burke-Hsiang et al and Buhaug-Theisen et al are all quantitative research groups. The debate between the two groups is about how quantitative research should be executed and interpreted. It is not a debate over whether quantitative or qualitative methods are better.

Because the Miguel-Burke-Hsiang vs. Buhaug-Theisen debate is raised in the comment, but not outlined, I summarize the papers that Solow cites here:

2004: Miguel et al. demonstrate that annual fluctuations in rainfall are negatively correlated with annual fluctuations in GDP growth and positively correlated with civil conflict in African countries. Miguel et al argue that rainfall changes influence conflict through this economic channel.

2009: Burke et al. (which includes Miguel and Satyanath, both authors on the 2004 paper) revisit this problem but include growing season temperature in their statistical model, motivated in part by other findings that temperature is a strong predictor of agricultural performance (even once rainfall is controlled for). They find that temperature appears to have an even stronger effect on conflict than rainfall. They conduct a number of robustness checks and project how conflict might change under global warming.

2010: Buhaug (PNAS) argues that Burke et al. arrive at incorrect conclusions because they should not include country fixed effects or country-specific trends in their statistical model. Buhaug instead advocates for a model that assumes all countries are identical (with respect to conflict) except for GDP and an index of political exclusion. Using this model, Buhaug argues that temperature has zero effect on conflict. Buhaug concludes his article with the statement:

"The challenges imposed by future global warming are too daunting to let the debate on social effects and required countermeasures be sidetracked by atypical, nonrobust scientific findings and actors with vested interests."

This is when the debate begins to get attention (eg. here)

2010: Buhaug et al. (PRIO) examine several additional dimensions of the result in Burke et al., such as its out of sample prediction and how results look when other measures of civil conflict are used. The authors conclude:

"In conclusion, the sensitivity assessments documented here reveal little support for the alleged positive association between warming and higher frequency of major civil wars in Africa… More research is needed to get a better understanding of the full range of possible social dimensions of climate change."

2011: Thiesen et al. revisit civil conflict in Africa by trying to pinpoint the locations where the first battle deaths in major wars occurred. Theisen et al examine whether the 0.5 degree pixels where these first deaths occurred were experiencing drought at the time of these deaths. The authors follow Buhaug and do not use fixed effects, instead they use a model that assumes all pixels are identical except for six control variables (e.g. democracy, infant mortality). The authors do not find a statistically significant association between drought and the location of first battle death, so they conclude that climate does not affect civil conflict in Africa.

2011: Hsiang et al. examine whether the global climate (not local temperature) has any effect on global rates of civil conflict. Hsiang et al. identify the tropical and sub-tropical regions of the world that are most strongly affected by the El Nino-Southern Oscillation (ENSO) and then examine the likelihood that countries in this region start new civil conflicts, conditional on the state of ENSO. They find that in cooler/wetter La Nino years the rate of conflicts is half of what it is in hotter/drier El Nino years -- but only in the tropical and sub-tropical regions that are affected by this global cimate oscillation. The authors show that the additional conflicts observed in El Nino years only occur after El Nino begins and are focused in the poorest countries.

Some of my thoughts on the above debate (in no particular order):

Clearly, this discussion is all based on statistical evidence -- it is not a debate as to whether quals or quants are better suited to answer this question.
No statistical evidence undermining the findings of Hsiang et al has been released or published in the last two years (to my knowledge). Many authors have casually stated in reviews that "there are issues with the paper" or that Buhaug (2010) or Theisen et al (2011) disprove our findings (eg. here). But valid "issues" have not been pointed out to me, publicly or privately, and I do not see how these other papers can possibly be interpreted as disproving our results. Since I'm fairly certain that these authors have been trying to find problems with our paper, but have not released them anytime in the last two years, I am gaining confidence that our findings are extremely robust. Furthermore, one of Chris Blattman's graduate students recently replicated our paper successfully for an econometrics assignment.
Buhaug and Theisen et al. generally overstate their findings. The estimates they obtain are extremely noisy, so they have very large confidence intervals, preventing them from rejecting a "zero effect"or very large effects. This is far from proving there is zero effect. For example, saying that X is somewhere between -100 and 100 is not evidence that X is exactly equal to 0.
Buhaug and Theisen et al.'s approach of dropping fixed effects, and assuming Africa is homogenous except for a handful of controls, is easily rejected by the data. A simple F-test for the joint significance of the fixed effects in Burke's model easily rejects their hypothesis that these effects are the same throughout Africa.
I think the paper by Thiesen et al is very difficult to interpret, since they are assigning all the potential causes of a conflict to conditions within the 50 x 50 km pixel where the first battle death occurred. Regardless of what results they report or whether the statistical techniques are sound, I'm not sure how I would interpret any of their results since I tend to think that many factors located beyond that pixel would affect the likelihood of civil war in a country.
There is a general argument underlying all the Buhaug-Theisen articles that "because regression coefficients change a lot across our models, the result of Burke must be non-robust." But this is faulty statistical logic. If the regression coefficients are changing between models, this means that all the models (or all but one) are mis-specified because they have different omitted variables, which is causing a different amount of bias in each model (and thus the different regression coeffs). This does not imply that the "true effect"of climate is equal to zero. There can only be one true effect. A good model might identify this effect and be robust to small variations in the model, but the true relationship between any X and Y cannot be generally "non-robust" and presenting non-robust estimates certainly does not prove that the true effect is zero.
Plotting the results in Burke et al. is pretty compelling evidence. There is some noise (which is what drives the Buhaug claims) but just plotting the data early on might have prevented all this controversy (perhaps I am dreaming).
I think Miguel and Satyanath should be praised for revisiting their 2004 findings, including an additional and important control variable and then altering their conclusions based on their new findings.

Note: I am not opposed to qualitative research. However, I do think that qualitative researchers must carefully consider the limited extent of their observations when drawing inferences. Large scale political conflict is a rare event, so it is unlikely that a randomly sampled case study will observe conflict in association with climatic events, even if there is a strong relationship. More discussion of this point is here.

My coauthor Marshall Burke has some additional thoughts on Solow's Comment and the general debate on G-FEED.

5.02.2013

Getting in touch with our feelings

If the goal of our work is to improve global human welfare, we should be finding ways to measure it.

The Expression of Emotions in 20th Century Books
Alberto Acerbi, Vasileios Lampos, Philip Garnett, R. Alexander Bentley

Abstract: We report here trends in the usage of “mood” words, that is, words carrying emotional content, in 20th century English language books, using the data set provided by Google that includes word frequencies in roughly 4% of all books published up to the year 2008. We find evidence for distinct historical periods of positive and negative moods, underlain by a general decrease in the use of emotion-related words through time. Finally, we show that, in books, American English has become decidedly more “emotional” than British English in the last half-century, as a part of a more general increase of the stylistic divergence between the two variants of English language.

Historical periods of positive and negative moods. Difference between -scores of Joy and Sadness for years from 1900 to 2000 (raw data and smoothed trend). Values above zero indicate generally ‘happy’ periods, and values below the zero indicate generally ‘sad’ periods.

People are unhappy during economic depressions and world wars...

h/t Brenda

4.23.2013

Self-control and long run outcomes

A gradient of childhood self-control predicts health, wealth, and public safety
Terrie E. Mofﬁtt et al.

Abstract: Policy-makers are considering large-scale programs aimed at selfcontrol to improve citizens’ health and wealth and reduce crime. Experimental and economic studies suggest such programs could reap beneﬁts. Yet, is self-control important for the health, wealth, and public safety of the population? Following a cohort of 1,000 children from birth to the age of 32 y, we show that childhood selfcontrol predicts physical health, substance dependence, personal ﬁnances, and criminal offending outcomes, following a gradient of self-control. Effects of children’s self-control could be disentangled from their intelligence and social class as well as from mistakes they made as adolescents. In another cohort of 500 sibling-pairs, the sibling with lower self-control had poorer outcomes, despite shared family background. Interventions addressing self-control might reduce a panoply of societal costs, save taxpayers money, and promote prosperity.

click to enlarge

Related results from China's One Child Policy here.

4.01.2013

The potential for global fisheries management

Status and Solutions for the World’s Unassessed Fisheries
Christopher Costello, Daniel Ovando, Ray Hilborn, Steven D. Gaines, Olivier Deschenes, Sarah E. Lester

Recent reports suggest that many well-assessed fisheries in developed countries are moving toward sustainability. We examined whether the same conclusion holds for fisheries lacking formal assessment, which comprise >80% of global catch. We developed a method using species’ life-history, catch, and fishery development data to estimate the status of thousands of unassessed fisheries worldwide. We found that small unassessed fisheries are in substantially worse condition than assessed fisheries, but that large unassessed fisheries may be performing nearly as well as their assessed counterparts. Both small and large stocks, however, continue to decline; 64% of unassessed stocks could provide increased sustainable harvest if rebuilt. Our results suggest that global fishery recovery would simultaneously create increases in abundance (56%) and fishery yields (8 to 40%).

3.14.2013

Now on Stata-bloggers...

Francis Smart has set up Stata-bloggers, a new blog aggregator for Stata users (modeled after R-bloggers). FE will be contributing there, but there's lots of other goodies worth checking out from more prolific bloggers.

Everyone say "Thank you, Francis."

2.18.2013

Plotting restricted cubic splines in Stata [with controls]

Michael Roberts has been trying to convince me to us restricted cubic splines to plot highly nonlinear functions, in part because they are extremely flexible and they have nice properties near their edges. Unlike polynomials, information at one end of the support only weakly influences fitted values at the other end of the support. Unlike the binned non-parametric methods I posted a few weeks ago, RC-splines are differentiable (smooth). Unlike other smooth non-parametric methods, RC-splines are fast to compute and easily account for control variables (like fixed effects) because they are summarized by just a few variables in an OLS regression. They can also be used with spatially robust standard errors or clustering, so they are great for nonlinear modeling of spatially correlated processes.

In short: they have lots of advantages. The only disadvantage is that it takes a bit of effort to plot them since there's no standard Stata command to do it.

Here's my function plot_rcspline.ado, which generates the spline variables for the independent variable, fits a spline while accounting for control variables, and plots the partial effect of the specified independent variables (adjusting for the control vars) with confidence intervals (computed via delta method). It's as easy as

plot_rcspline y x

and you get something like

where the "knots" are plotted as the vertical lines (optional).

Help file below the fold. Enjoy!

Related non-linear plotting functions previously posted on FE:

Quickly plotting nonparametric response functions with binned independent variables [in Stata]

Yesterday's post described how we can bin the independent variable in a regression to get a nice non-parametric response function even when we have large data sets, complex standard errors, and many control variables. Today's post is a function to plot these kinds of results.

After calling bin_parameter.ado to discretize an independent variable (see yesterday's post), run a regression of the outcome variable on the the sequence of generated dummy variables (this command can be as complicated as you like, so feel free to throw your worst semi-hemi-spatially-correlated-auto-regressive-multi-dimensional-cluster-block-bootstrap standard errors at it). Then run plot_response.ado (today's function) to plot the results of that regression (with your fancy standard errors included). It's that easy.

Here's an example. Generate some data where Y is a quadratic function of X and a linear function of Z:

set obs 1000
gen x = 10*runiform()-5
gen z = rnormal()
gen e = rnormal()
gen y = x^2+z+5*e

Then bin the parameter using yesterday's function and run a regression of your choosing, using the the dummy variables output by bin_parameter:

bin_parameter x, s(1) t(4) b(-4) drop(-1.6) pref(_dummy)
reg y _dummy* z

After the regression, call plot_response.ado to plot the results of that regression (only the component related to the binned variables). The arguments describing the bins are the same format as those used by bin_parameter to make this easier:

plot_response, s(1) t(4) b(-4) drop(-1.6) pref(_dummy)

The result is a plot that clearly shows us the correct functional form:

Note: plot_response.ado requires parmest.ado (download from SSC by typing "net install st0043.pkg" at the command line). It also calls a function parm_bin_center.ado that is included in the download.

Citation note: If you use this suite of functions in publication, please cite: Hsiang, Lobell, Roberts & Schlenker (2012): "Climate and the location of crops."

Help file below the fold.

Binning a continuous independent variable for flexible nonparametric models [in Stata]

Sometimes we want to a flexible statistical model to allow for non-linearities (or to test if an observed relationship is actually linear). It's easy to run a model containing a high-degree polynomial (or something similar), but these can become complicated to interpret if the model contains many controls, such as location-specific fixed effects. Fully non-parametric models can be nice, but they require partialling out the data and standard errors can be come awkward if the sample is large or something sophisticated (like accounting for spatial correlation) is required.

An alternative that is easy to interpret, and handles large samples and complex standard errors well, is to convert the independent variable into discrete bins and to regress the outcome variable on dummy variables that represent each bin.

For example, in a paper with Jesse we take the typhoon exposure of Filipino households (a continuous variable) and make dummy variables for each 5 m/s bin of exposure. So there is a 10_to_15_ms dummy variable that is zero for all households except for those households whose exposure was between 10 and 15 m/s, and there is a different dummy for exposure between 15 and 20 m/s, etc. When we regress our outcomes on all these dummy variables (and controls) at the same time, we recover their respective coefficients -- which together describe the nonlinear response of the outcome. In this case, the response turned out to be basically linear:

The effect of typhoon exposure on Filipino households finances.
From Anttila-Hughes & Hsiang (2011)

This approach coarsens the data somewhat, so there is some efficiency loss and we should be wary of Type 2 error if we compare bins to one another. But as an approach to determine the functional form of a model, this is a great approach so long as you have enough data.

I found myself rewriting Stata code to bin variables like this in many different contexts, so I wrote bin_parameter.ado to do it for me quickly. Running these models can now be done in two lines of code (one of which is the regression command). bin_parameter allows you to specify a bin width, a top bin, a bottom bin and a dropped bin (for your comparison group). It spits out a bunch of dummy variables that represent all the bins which cover the range of the specified variable. It also has options for naming the dummy variables so you can use the wildcard notation in regression commands. Here's a short example of how it can be used:

set obs 1000
gen x = 10*runiform()-5
gen y = x^2
bin_parameter x, s(1) t(4) b(-4) drop(-1.6) pref(x_dummy)
reg y x_dummy*

Help file below the fold.

Prettier graphs with less headache: use schemes in Stata

I'm picky about graphs looking nice. So for a long time I did silly things that drove me nuts, like typing "ylabel(, angle(horizontal))" for every single Stata graph I made (since some of the default settings in Stata are not pretty). I always knew that you could set a default scheme in stata to manage colors, but I didn't realize that it could do more or be customized. See here to learn more.

After a bit of playing around, I wrote my own scheme. The text file looks pitiful, I know. But it saves me lots of headache and makes each plot look nicer with zero marginal effort. You can make your own, or if you download mine, put the file scheme-sol.scheme in ado/personal/ (where you put other ado files) and then type

set scheme sol

at the command line. Or set this as the default on your machine permanently with

set scheme sol, perm

It will make your plots that look kind of like this: