Thursday, April 25, 2013

Methodological Hygiene

I was planning to write something thoughtful today on formal versus substantive universals and how we have made lots of progress wrt the former but quite a bit less wrt the latter. This post, however, will have to wait.  Why? Lunch!  Over lunch I had an intriguing discussion of the latest academic brouhaha, and how it compares with one that linguists know about “L’affaire Hauser.”  For those of you that don’t read the financial pages, or don’t religiously follow Krugman (here, first of many), or don’t watch Colbert (here) let me set the stage.

In 2010, two big name economists, Carmen Reinhart and Kenneth Rogoff (RR), wrote a paper building on their very important work (This Time is Different) chronicling the aftermath of financial crises through the ages. The book garnered unbelievable reviews and made the two rock stars of the Great Recession.  The paper that followed was equally provocative, though not nearly as well received.  The paper claimed to find an important kind of debt threshold which when crossed caused economic growth to tank.  Actually, this is a bit tendentious.  What the paper claimed was that there was a correlation between debt to GDP ratio of 90% and higher and the collapse of growth.  Note: correlation, not causation.  However, what made the paper hugely influential was the oft-suggested hint that the causality was from high debt to slow growth rather than the other way around or some combination of the two. The first interpretation was quickly seized upon by the “Very Special Serious People” (VSP), aka “austerians,” to justify policies of aggressively cutting budget deficits rather than fiscally priming the economic pump to combat high unemployment.[1] Keynesians like Krugman (and many others, including Larry Summers, another famous Harvardian) argued that the causality was from slow growth to large deficits and so the right policy was to boost government spending to fight unemployment as doing this would also alleviate the debt “problem.”[2] At any rate, it is safe to say that RR’s 2010 paper had considerable political and economic impact.  Ok, let’s shift to the present, or at any rate the last week.

Three U Mass economists (Herndon, Ash and Pollin:the lead being a first year grad student whose econometrics class project was to replicate some well known result in order to learn econometric methods) showed that the 2010 paper was faulty in several important ways: (i) there was a spread sheet error with some important info left out (this accounted for a small part of RR’s result), (ii) there was a trimming decision where some data points that could be deemed relevant as they trended against the RR conclusion were left out (this accounted for a decent percentage of the RR effect), and (iii) there was a weighting decision in which one year’s results were weighted the same as 17 year’s worth of results (this accounted for a good chunk of RR’s results).  All together, when these were factored in, RR’s empirical claim disappeared. Those who click onto the Colbert link above will get to meet the young grad student that started all of this.  If you are interested in the incident, just plug “Reinhart and Rogoff” into Google and start reading. To say that this is now all over the news is an understatement. Ok, why do I find this interesting for us.  Several reasons.

First, though this is getting well discussed and amply criticized in the media, I did not read anywhere that Harvard was putting together a panel to investigate bad scientific practice. Spreadsheet errors are to be expected. But the other maneuvers look like pretty shoddy empirical practice, i.e. even if defensible, they should be up front and center in any paper. They weren’t. But, still no investigation. Why not? It cannot be because this is “acceptable” for once exposed it seems that everyone finds it odd. Moreover, RR’s findings have been politically very potent, i.e. consequential.  So, the findings were important, false and shoddy. Why no investigation? Because this stuff though really important is hard to distinguish from what everyone does?

Second, why no expose in the Chronicle accompanied by a careful think piece about research ethics?  One might think that this would be front page academic news and that venues that got all excited over fraud might find it right up their alley to discuss such an influential case.

It is worth comparing this institutional complacency to the reaction our own guardians of scientific virtue had wrt Hauser.  They went ape (tamarin?) shit! Professors were impanelled to review his lab’s work, he was censured and effectively thrown out of the university, big shot journal editors reviled him in the blogosphere, and he was held out as an object lesson in scientific vice. The Chronicle also jumped onto the band wagon tsk-tsking about dishonesty and how it derails serious science. Moreover, even after all the results in all the disputed papers were replicated no second thoughts, no revisiting and re-evaluating the issues, nothing.  However, if one were asked to weigh the risks to scientific practice of RR’s behavior and Hauser’s alleged malpractice it’s pretty clear that the former are far more serious than the latter. Their results did not replicate. And, I am willing to bet, their sins are far more common and so pollute the precious data stream much much more. Indeed, there is a recent paper (here) that suggests that the bulk of research in neuroscience is not replicable, i.e. the data are simply not, in general, reliable. Do we know how generally replicable results in psycho are?  Anyone want to lay a bet that the number is not as high as we like to think?

Is this surprising? Not really, I think. We don’t know that much about the brain or the mind. It strikes me that a lot of research consists of looking for interesting phenomena rather than testing coherent hypotheses. When you know nothing, it’s not clear what to count or how to count it.  The problem is that the powerful methods of statistics encourages us to think that we know something when in fact we don’t. John Maynard Smith, I think, said that statistics is a tool that allows one to do 20 experiments and get one published in Nature (think p>.05).  Fraud is not the problem, and I suspect that it never has been. The problems lie in the accepted methods, which, unless used very carefully and intelligently, can muddy the empirical waters substantially. What recent events indicate (at least to me) is that if you are interested in good data, then it’s the accepted methods that need careful scrutiny. Indeed, if replicability is what we want (and isn’t that the gold standard for data?), maybe we should all imitate Hauser for he seems to know how to get results that others can get as well.

I will end on a positive note: we linguists are pretty lucky.  Our data is easily accessed and very reliable (as Sprouse and Almeida have made abundantly clear).  We are also lucky in that we have managed to construct non-trivial theories with reasonable empirical reach.  This acts to focus research and, just as importantly, makes it possible to identify “funny looking” data so that it can be subjected to careful test. Theories guard against gullibility.  So, despite the fact that we don’t in general gather our data as “carefully” as neuroscientists and psychologists and economists gather theirs, we don’t need to.  It’s harder to “cheat,” statistically or otherwise, because we have some decent theory and because the data is ubiquitous, easy to access and surprisingly robust.  This need not always be so. In the future, we may need to devise fancy experiments to get data relevant to our theories. But to date, informal methods have proven sufficient.  Strange that some see this as a problem, given the myriad ways there are to obscure the facts when one is being ultra careful.

[1] VSP is Krugman’s coinage. I am not sure who first coined the second.
[2] The scare quotes are to indicate that there is some debate about whether there actually is a problem, at least in the medium term.


  1. I don't really see why there would be an investigation. Some people get the gold and some people get the shaft. Linguists should know this better than any school. For example, what's UG based on? It's not empirical and it's not a spreadsheet error so...

    And yet Chomsky has a position at MIT and Pinker has a position at Harvard, even though their arguments scurry faster than a cockroach when held up to the light (for one example, see:

    I can't speak too much for the Hauser case, but I think it boils down to money (as all things possibly do). Harvard can shaft Hauser because they need to save face. RR got picked up by some high ranking politicians who smell like endowments. Now if you'll excuse me, I'm off to go hack my version of Excel...

    1. Followed your link and remain dumbfounded. Maybe you'd like to elaborate on how this bears on anything.

    2. I was just trying to point out that a comparison between RR and Hauser isn't really possible since I think there's a double standard at work. On the one hand, when there's a public outcry, linguists can be thrown under the bus, but apparently researchers in other fields can get off scot-free. On the other hand, linguists can publish misleading or wrong information (which is what my link was related to) for years without so much as a glance from the administration. Shoddy empirical practice in linguistics will lead to censure if it suits the institution. A comparison of the Hauser case to RR just shows that those in other fields are playing by different rules.

    3. Linguists huh? Of the four - John Kim, Steven Pinker, Alan Prince, and Sandeep Prasada- one is a linguist, the other three psychologists. Note the journal was Cognitive Psychology, not Linguistic Inquiry. But let's put that aside: there is no question that "false" data get published and even widely used. I personally think that the horror of data pollution is a badly misplaced fear, one that stems I suspect from a naive view (empiricist actually)of the scientific method. Put this aside and we can get down to serious data evaluation: is the data point important? Is it reliable? Is it theoretically illuminating? Does it make a difference? Is the effect size big or small? These questions in place will make it harder for bad data to gain a foothold.

    4. Yep, Linguistics. Because language! I'm not sure who you're claiming the sole linguist author of that paper is, but I count more than one. Google tells me there's a phonologist and a guy who's published extensively about language (hint: his name starts with a P and ends with a inker). But let's put that aside and assume every scholar is only ever one kind of scholar. And let's talk about data in linguistics. The horror of data pollution in linguistics (which I thought the link to LL pointed out) is that sometimes the data is not so much polluted as it is purely made up. Data evaluation needs to start with: did you actually look at the data or are you just making this stuff up? That question will make it harder for bad data to gain a foothold. What I was getting at in the previous comments is that if linguists aren't going to check themselves, then no one else is either, unless the school or organization that the naughty data polluter works for stands to lose money.

    5. I actually think in linguistics it is not always the case that the data are ubiquitous and easy to access. Sure for English or German that is true. But when a researcher reports data from a virtually unknown *exotic* language EL not every linguist can just check for him or herself. In those cases linguists are in no better position than other scientists [e.g. the psychologists who depended on Hauser to report genuine findings] - so they need to be able to trust that the researcher who knows EL is reporting genuine findings. Especially in cases where these data are [or seem to be] making a big difference to a research program/hypothesis/theory. And i seem to remember that linguists are very cautious in such cases...

    6. Joemmcveigh: let me concede that 'linguist' counting is a mugs game. Let me also concede that there are times when stylized facts enter the discourse more robustly than they should. However, let me also say that equating linguistics with language is off the mark, at least for someone like me. 'Language' is not a term of art, but a descriptive term. The unit of analysis is the idiolect, though through some (lots of) idealization we can talk about 'English,' 'French,' etc. I recall the issue in lexical storage that your post referred to. I also recall there was lots of push back among people working on this as not all swung the same way on the data. Let me remind you of the Sprouse-Almeida stuff, for here someone looked at the ling data and found it remarkably robust and reliable. Is it perfect? No. Is it very good by psych standards? Oh yes, very, very good. Can it be made better still. That I don't know as the numbers reported by SA were very high. That said, yes, linguists can also pollute, it's just not generally fraught with nasty scientific or practical consequences.

    7. My previous comment was a reply to Norbert but it showed not up where i intended it. This is just a short reply to joemcveigh:

      Pinker is 'officially' a psychologist and ever since he teamed up with Ray Jackendoff to argue against Hauser, Chomsky, and Fitch this seems to be worth stressing.

      Regarding the data: in addition to what I just wrote there is also some disagreement about data interpretation of plain old English data. One piece that might interest you is at [there are other interesting pieces by the same author on LingBuzz, so have a look around]

    8. There is some disagreement within linguistics over data points and their interpretation, as in any other scientific discipline. However, the vast majority of the acceptability judgment data in the literature appear to be replicable (Sprouse, Schütze & Almeida). I think Christina and joemcveigh are getting the wrong end of the stick because they don't have a sense for what is and isn't a central hypothesis or data point within syntactic theory. For example, Pinker's claim regarding regular/irregular past tense forms is used to support his own particular theory of verbal inflection. If the data supporting this theory turn out to be wrong, that's too bad for Pinker, but it's not something that would significantly undermine syntactic theory as a whole. Similarly, the A-over-A condition is just one of many proposals for a locality constraint on movement (and not an especially popular one at that). If someone refutes the A-over-A condition, that's just part of the usual scientific hurly-burly of creating, testing and rejecting hypotheses. Postal uses the literature on the A-over-A condition as the springboard for a character attack on Chomsky, but I don't really see his point. Even supposing he's correct that Chomsky has dishonestly concealed data which refute the A-over-A condition, this can't be said to have had much of a knock-on effect on the field, since the principle has never been widely adopted.

    9. Side remark on Alex Drummond's comment (for afficionados only): One might also note that the A-over-A condition might actually be true, once we view movement as involving a probe head that is looking for a particular feature on its goal (cf. Preminger's MIT dissertation for recent discussion), if we also allow domination to count for intervention. This is exactly the move that Kitahara made in order to account for the Müller-Takano generalization about remnant movement. So while Chomsky's A-over-A condition may have seemed to founder over particular problems with particular analyses in the late 1960s, it's not so obvious today that it was incorrect after all.

      reference: Kitahara, Hisatsugu. "Restricting Ambiguous Rule-Application: A Unified Analysis of Movement." In MIT Working Papers in Linguistics #24. Edited by Masatoshi Koizumi and Hiroyuki Ura. Cambridge, MA: MIT Working Papers in Linguistics, 1994. (apparently not freely downloadable anywhere, alas)

    10. For what it's worth, Chomsky (1973, 235) explicitly notes that under one possible interpretation of the version of the A-over-A condition that he adopts, this condition ``does not establish an absolute prohibition against transformations that extract a phrase of type A from a more inclusive phrase of type A. Rather, it states that if a transformational rule is nonspecific with respect to the configuration defined, it will be interpreted in such a way as to satisfy the condition.'' This relativized interpretation of the A-over-A condition is then adopted (and further modified) in Bresnan (1976). It can be viewed as very similar to the feature-based reconstruction that Kitahara and others later came up with: In more current terminology, a transformation like wh-movement is ``nonspecific with respect to the configuration defined'' just in case there is more than one item that bears a wh-feature and could in principle be affected by the transformation, and the A-over-A condition then demands movement of the higher, ``more inclusive'' wh-item; similarly for other transformations.

      As far as I can tell, this interpretation of the A-over-A condition, which follows directly if "A" is viewed as a movement-related feature rather than a category label, has not been seriously called into question yet. Arguably, one should therefore stop claiming that the A-over-A condition has been refuted; this is simply not the case.

      Refs: Chomsky, Noam. 1973. Conditions on Transformations. In A Festschrift for Morris Halle. Bresnan, Joan. 1976. On the Form and Functioning of Transformations. Linguistic Inquiry 7.

    11. @Gereon. Postal's recent article on the A-over-A condition does consider the interpretation of the A-over-A condition that you mention. He claims, without really filling out the argument, that Chomsky hasn't shown that this formulation could overcome various counterexamples noted by Ross and others. It seems to me that Postal is basing his argument on the assumption that it is not legitimate to assume that topics etc. have special features which trigger movement. (Representative quote: “Such cases are particularly important because there is no independently motivated or visible feature picking out an adjective phrase as a candidate for such left edge positioning.”; p.6 here.) Most of the problematic cases Postal raises could be handled by assuming that topics have a particular feature which triggers movement. If so, nothing would block (non-remnant) topicalization of an XP out of an XP, since the containing XP would not bear the topic feature in terms of which the topicalization transformation is specified. I suppose it might be argued that if one can freely postulate such features, it removes some of the empirical content from the A-over-A constraint. The Müller-Takano generalization still stands, though, given the reasonable assumption that topicalization and scrambling are each triggered by only one feature.

    12. Thank you for these interesting comments. They answer almost all my concerns about A-over-A. Just a few questions remain:

      1. The reason I [and seemingly others too] are getting the wrong end of the stick is [at least in the A-over-A case] how things have been 'sold' to the community. A-over-A was called a 'principle' which was supposed to apply without exception to all languages. This was done at a time when Chomsky was already quite aware of Ross' work that showed A-over-A has exception even in English. Now in the sciences if you say X is a principle that has now exceptions finding an exception refutes the clam that X is an exceptionless principle. And if you continue to call X a principle even when you know it has exceptions you commit fraud. Linguistics is a natural science [at least according to Chomsky and most who comment here] yet it seems to have different rules. What justifies this difference?

      2. This discussion reminds me of what Postal calls the 'phantom principle' move: not once is mentioned what the A-over-A condition is or how some of the things you suggest come to the rescue. Again that is very much unlike in other sciences. So why do you not give specific examples, especially for those in the multidisciplinary audience who are not linguists?

      3. What exactly is the status of A-over-A? Seemingly it has been demoted from a principle to a condition, though Alex also calls it a constraint. Are these terms synonyms? If so, why not just stick to one [as done elsewhere in the sciences]? If they are not synonyms what is the difference between a principle and a condition and a constraint?

      4. When I asked Chomsky not that long ago about the A-over-A principle he replied right away that that had been refuted in the 1960s by Ross. This seems inconsistent with what you say above - so was Chomsky wrong?

  2. I admit I also struggled with the comparison between RR and Hauser but think the point Norbert was trying to make in the last paragraph is actually quite interesting. He says linguists are fortunate because:

    "Our data is easily accessed and very reliable (as Sprouse and Almeida have made abundantly clear). We are also lucky in that we have managed to construct non-trivial theories with reasonable empirical reach. This acts to focus research and, just as importantly, makes it possible to identify “funny looking” data so that it can be subjected to careful test. Theories guard against gullibility."

    So while possibly some linguistic "arguments scurry faster than a cockroach when held up to the light", theories are stable and here to stay; especially the non-trivial ones with reasonable reach. And the easy access to data allows for the public to evaluate theories as mentioned by joemcveigh. Of course the foundation of biolinguistics, MP, is not a theory but a research program. And SMT is not a theory either but a thesis. So I am curious: what are the current theories?

  3. As it happens, the New York Times had an article about another case of fraud, in my home country:

    I believe this case can serve to illustrate some of Norbert's points (even though in this case, the professor in question WAS fired). My impression is that social psychology is even worse than economics in its equating science with finding correlations in large data sets: especially if you read Stapel's book it becomes clear that social psychology is all about doing experiments and interpreting them in the 'right' way statistically, and hardly ever about trying to construct a theory with some explanatory depth.

    If Stapel's research would not have been fraudulent, not many things would have changed. He found correlations between eating meat and being aggressive, or between seeing the word 'capitalism' and eating M&M's. In this way, he became an academic superstar, at least at the Dutch scale: he published in Science, was a Dean at Tilburg University (where as you may know, a thriving department of linguistics has been closed a few years ago because of its unfeasibility) and appeared on tv a lot with the outcomes of his 'research'.

    People are now discussing what should be the consequences of this. The die-hard empiricists say that experiments should be more standardly replicated, we should do statistics on data sets just to see how likely it is that they have been made up, etc. But it seems to me that having a good and solid theory, or a number of competing theories, also helps here.

    The point is, once you have a theory, some data can be PROBLEMATIC (or 'funny-looking', as Norbert says) for somebody believing in that theory, so that person will become suspicious and therefore motivated to replicate the experiments, or at least check all the relevant data. This apparently is hardly ever the case in social psychology: the (fabricated) observation that people who see the word 'capitalism' eat more M&Ms was just not problematic for anybody, since nobody had any deep expectations about the relation between seeing that word and consuming chocolate sweets to begin with.

    But to be fair, it has to be noted that in this case after a number of years a few junior researchers were brave enough to discover the fraud and talk to the rector about it, and the guy was fired. (A detail which might interest linguists, and which is not mentioned in the NYT article, is that the committee which examined the fraud was led by the well-known psycholinguist Willem Levelt.) And that might shed some light on the asymmetry between the Hauser case and the RR case. The differences might have less to do with issues of methodology than with prestige and political power.
    (I have to admit that I know much more about the Stapel case than about Hauser or RR.)

  4. This comment has been removed by a blog administrator.

    1. Yup and my point has been that this was a nutty response to what allegedly took place. I guess I have less faith in these processes than you do.

    2. This was spam (see the link to "Body Mint"). The trick seems to have been to find the topic of this article ("Marc Hauser") and then to concoct a generic one-sentence comment that would fit. Your reaction to the conjectured intention of this NLP robot was premature :)

    3. Mu gullibility led to rage and I DELETED the bot!! Take that!