Sunday, October 30, 2016

More on the collapse of science

The first dropped shoe announced the “collapse” of science. It clearly dropped with a loud bang as this “news” has become a staple of conventional wisdom. The second shoe is poised and ready to drop. It’s ambition? To explain why the first shoe fell. Now that we know that  science is collapsing we all want to know why exactly it is doing so and whether there is anything we can do to bring back the good old days.

So why the fall? The current favorite answer appears to be a combination of bad incentives for ambitious scientists and statistical tools (significance testing being the current bête noir) that “gave scientists a mathematical machine for turning baloney into breakthroughs, and flukes into funding” ((now that’s a rhetorical flourish!) cited here p. 12). So, powerful tools in ambitious hands lead to scientific collapse. In fact, ambition may be beside the point, academic survival alone may be a sufficient motive. Put people in hyper competitive environments and give them a tool that “lets” them get their work “done” in a timely manner and all hell breaks loose.[1]

I have just read several papers that develop this theme in great detail. They are worth reading, IMO, for they do a pretty good job of identifying real forces in contemporary academic research (and not limited to the sciences). These forces are not new. The above “baloney” quote is from 1998 and there are prescient observations relating to somewhat similar (though not identical) effects made as early as 1948. Here’s Leo Szilard (cited here):

Answer from the hero in Leo Szilard’s 1948 story “The Mark Gable Foundation” when asked by a wealthy entrepreneur who believes that science has progressed too quickly, what he should do to retard this progress: “You could set up a foundation with an annual endowment of thirty million dollars. Research workers in need of funds could apply for grants, if they could make a convincing case. Have ten committees, each composed of twelve scientists, appointed to pass on these applications. Take the most active scientists out of the laboratory and make them members of these committees. ...First of all, the best scientists would be removed from their laboratories and kept busy on committees passing on applications for funds. Secondly the scientific workers in need
of funds would concentrate on problems which were considered promising and were pretty certain to lead to publishable results. ...By going after the obvious, pretty soon science would dry out. Science would become something like a parlor game. ...There would be fashions. Those who followed the fashions would get grants. Those who wouldn’t would not.”

The papers I’ve read come in two flavors. The first are discussions of the perils of p-values. Those who read the Andrew Gelman blog are already familiar with many of the problems. The main issue seems to be that phishing for significance is extremely hard to avoid, even by those with true hearts and noble natures (see the Simonsohn (a scourge of p-hacking) quote here). Here (and the more popular here) are a pair of papers that go into how this works in ways that I found helpful. One important point the author (David Colquhoun (DC)) makes is that the false discovery (aka: the false positive) problem is quite general, and endemic to all forms of inductive reasoning. It follows from the “obvious rules of conditional probabilities.” So this is not just a problem for Fisher and significance testing, but applies to all modes of inductive inquiry, including Bayesian modes.

Assuming this is right and that even the noble might be easily mislead statistically, is there some way of mitigating the problem? One rather pessimistic paper suggests that the answer is no. Here (with a popular exposition here) is a paper that gives an evolutionary model of how bad science must win out over good in our current academic environment. It is a kind of Gresham’s law theory where quick successful bad work floods less quick, careful good work. In fact, the paper argues that not even a culture where replication is highly valued will stop bad work from pushing out the good so long as “original” research remains more highly valued than “mere” replication.

The authors, Smaldino and McElreath (S&M), base these grim projections on an evolutionary model they develop which tracks the reward structure of publication and the incentives that these impose on individual and labs. I am no expert in these matters, but the model looks reasonable enough and the forces it identifies and incorporates seem real enough. The solution: shift from a culture that rewards “discovery” to one that rewards “understanding.”

I personally like the sound of this (see below), but I am skeptical that it is operationalizable, at least institutionally. The reason is that valuing understanding requires exercising judgment (it involves more than simple bookkeeping) and this is both subjective (and hence hard to defend in large institutional settings) and effortful (which makes it hard to get busy people to do). Moreover, it requires some very non-trivial understanding of the relevant disciplines and this is a lot to expect even within small departments, let alone university wide APT committees or broad based funding agencies. A tweet by a senior scientist (quoted in S&M p.2) makes the relevant point: “I’ve been on a number of search committees. I don’t remember anybody looking at anybody’s papers. Number and IF [impact factor] of pubs are what counts.” I don’t believe that this is only the result of sloth and irresponsibility. In many circumstances it is silly to rely on your own judgment. Given how specialized so much good work has become, it is unreasonable to think that we can as individuals make useful judgments about the quality of work. I don’t see this changing, especially above the department level anytime soon.

Let me belabor this. It is not clear how people above the department level would competently judge work outside their area of expertise. I know that I would not feel competent to read and understand a paper in most areas outside of syntax, especially if my judgment carried real consequences. If so, who can we get to judge whose judgments would be reasonable?  And if there is no one then what can one do but count papers weighted by some “prestige” factor? Damn if I know. So, I agree that it would be nice if we could weight matters towards more thoughtful measures that involved serious judgment, but this will require putting most APT decisions in the hands of those that can make these judgments, namely leave them at effectively the department level, which will not be happening anytime soon (and which has its own downsides if my own institution is anything to go by).

An aside: this is where journals should be stepping in. However, it appears that they are no longer reliable indicators of quality. Many are very conservative institutions whose stringent review processes tend to promote “safe” incremental findings. Many work hard to protect their impact factors to the point of only very reluctantly publishing work critical of previously published work. Many seem just a stones throw removed from show business where results are embargoed until an opening day splash can be arranged. At any rate, professional journals is a venue in which responsible judgment could be exercised, but, it appears, that it is difficult even here.

So, there are science (indeed academy) wide forces imposing shallow measures for evaluation and reward that bad statistical habits can successfully game. I have no problem believing this. But I still do not see how these forces suffice to explain the “crisis” before us. Why?  Because such explanations are too general and the problems appear to hold not in general but in localizable domains of inquiry. More exactly, the incentives S&M cites and the problems of induction that DC elaborates are pervasive. Nonetheless, the science (more particularly, replication) crisis seems localized in specific sub-areas of investigation, ones that I would describe as more concerned with establishing facts than in detailing causal mechanisms. [2] Here’s what I mean.

What’s the aim of inquiry? For DC it is “to establish facts, as accurately as possible” (here, 1). For me, it is to explain why things are as they are.[3] Now, I concede that the second project relies on the first. But I would equally claim that the first relies on the second. Just as we need facts to verify theories, we need theories to validate facts. The main problem with lots of “science” (and I am sure you won’t be surprised to hear me write this) is that it is theory free. Thus, the only way to curb its statistical enthusiasm is by being methodologically pristine. You gotta get the stats exactly right for this is the only thing grounding the result. In most cases of drug trials, for example, we have no idea why they work, and for practical purposes we may not (immediately) care. The question is do they, not how. Sciences stuck in the “does it” stage rather than the “how does it do it and why” stages, not surprisingly, have it tough. Fact gathering in the absence of understanding is going to really hard even with great stats tools. Should we be surprised that in areas where we know very little that stats can and do regularly mislead?

Note that the real sciences do not seem to be in the same sad state as psych, bio-med and neuroscience. You don’t see tons of articles explaining how the physics of the last 20 years is rotten to its empirical core. Not that Nobel winning results are not challenged. They can be and are. Here’s a recent example in which dark energy and the thesis that the universe is expanding at an accelerating rate is being challenged (see here) based on more extensive data. But in this case, evaluation of the empirical possibilities heavily relies on a rich theoretical background. Here’s a quote from one of the lead critics. Note how the critique relies on an analysis of an “oversimplified theoretical model” and how some further theoretical sophistication would lead to different empirical results. This interplay between theory and data (statistically interpreted data by and large) is not available in domains where there is no “fundamental theory,” (i.e. non-trival theory).

'So it is quite possible that we are being misled and that the apparent manifestation of dark energy is a consequence of analysing the data in an oversimplified theoretical model - one that was in fact constructed in the 1930s, long before there was any real data. A more sophisticated theoretical framework accounting for the observation that the universe is not exactly homogeneous and that its matter content may not behave as an ideal gas - two key assumptions of standard cosmology - may well be able to account for all observations without requiring dark energy. Indeed, vacuum energy is something of which we have absolutely no understanding in fundamental theory.'

So, IMO, the problem with most problematic “science” is that it is not yet really science. It has not moved from the earliest data collection stage to the explanation stage where what’s at issue are not facts but mechanisms. If this is roughly right, then the “end of science” problems will dissipate as understanding deepens (if it ever does (no guarantee that it will or should)) in these domains. So understood, the demise of science that replication problems herald is more a problem for the particular areas identified (and more an indication of how little is known here) than for science as a whole.[4]

That said, let me end with one or two caveats. The science-in-crisis narrative rests on the slew of false discoveries regularly churned out. Szilard’s worry mooted in the quote above is different. His worry is not false discoveries but the trivialization of research as big science promotes quantity and incrementalism over quality and concern for the big issues. Interestingly, this too is a recurrent theme. Szilard voiced this worry over 60 years ago. More recently (the last 15 years or so), Peter Lawrence voiced similar concerns in two pieces that discuss Szilard’s problem in the context of how scientific work is evaluated for granting and publication (here and here). And the problem is discussed in very much the same terms today. Here (and here) are two papers in Nature from 2016 which address virtually the same questions in virtually the same terms (i.e. how institutions reward more of the same reserach, punish thinking about new questions, look at publication numbers rather than judge quality etc.). What is striking is that this is all stuff noted and lamented before and the proposed fixes are pretty much the same: calls for judgment to replace auditing.

I agree that this would be a good idea. In fact, I believe that one of the reasons for the disparagement of theory in linguistics is a reflection of the same demands it makes on judgment for adequate evaluation. It is easier to see if a story “captures” the facts than to see if it offers an interesting explanation. So I am all in favor of promoting judgment as an important factor in scientific evaluation. However, to repeat, I am skeptical this is actually doable as judgment is not something that bureaucracies do well and like it or not, today science is big and so, not surprisingly, it comes with a large bureaucracy attached. Let me explain.

Today science is conducted in big settings (universities, labs, foundations, funding agencies). Big settings engender bureaucratic oversight, and not for entirely bad reasons. Bureaucracies arise in response to real needs where the actions of large numbers of people require coordination. And given the size of modern science, bureaucracy is inevitable. Unfortunately, bureaucracies by necessity favor blunt metrics over refined judgment (i.e. quantitative auditable measures over nuanced hard to compare evaluations). And all of this fosters the problems that Szilard and Lawrence and the Nature comments worry about. As noted, I think that this is simply unavoidable given the current economics of research. The hopeful (e.g. Lawrence) think that there are ways of mitigating these trends. I hope they are right. However, given the fact that this problem recurs regularly and the same solutions get suggested just as regularly, I have my doubts.

Let me end on a more positive note. It may not be possible to inject judgment into the process in a systematic way. However, it may be possible to find ways to promote unconventional research by having a sub-part of the bureaucracy looking for it. In the old days when money was plentiful, “whacky” research got institutional support because everything did (think of the early days of GG funding, or early CS). When money gets scarcer we need to still put aside some for work to support the unconventional. This is a problem in portfolio management: put most of your cash on safe stuff and 10% or so on unconventional stuff. The latter will mostly fail, but when it pays off, it pays off big. The former rarely fails, but its payoffs are small. Maybe the best we can do right now is allow our institutions to start thinking about the wild 10% just a little bit more.

So, the replication crisis will take care of itself as it is largely a reflection of the primitive nature of most of the “science” that it infects. The trivialization problem, IMO, is more serious and here, IMO, the problem is and will remain much harder to solve.

[1] I have long thought that stats should be treated a little like the Rabbis treated Kabbalah. The Rabbis banned its study as too dangerous until the age of forty, i.e. explosive in the hands of clever but callow neophytes.
[2] The collapse seems to be restricted. In psych, it is largely restricted to social psych. Perception and cognition, for example, seem relatively immune to the non-replicability disease. In bio-medicine, the bio part also seems healthy. Nobody is worrying about the findings in basic cell biology or physiology. The problem seems limited to non “basic” discoveries (e.g. is cholesterol/fat bad for you, does such and such drug work as advertised, and so on). In neuroscience the problems also seem largely restricted to fMRI results of the sort that make it into the NYTs. If one were inclined to be skeptical, one might say that the problems arise not in those areas where we know something about the underlying mechanisms but in those domains where we know relatively little. But who would be so skeptical?
[3] The search for explanation ends up generating novel data (facts). But the aim is not to establish new facts but to understand what is going on. In the absence of theory it might even be hard to know what a “fact” is.  Is it a fact that the sun rises in the East and sets in the west? Well, yes and no. It depends.
[4] It also reflects the current scientism of the age. Nothing nowadays is legit unless wrapped up in scientific looking layers. Not surprisingly much trivial insight is therefore statistically marinated so that it can look scientific.