Thursday, May 21, 2015

Manufacturing facts; the case of Subject Advantage Effects

Real science data is not natural. It is artificial. It is rarely encountered in the wild and (as Nancy Cartwright has emphasized (see here for discussion)) it standardly takes a lot of careful work to create the conditions in which the facts are observable. The idea that science proceeds by looking carefully at the natural world is deeply misleading, unless, of course, the world you inhabit happens to be CERN. I mention this because one of the hallmarks of a progressive research program is that it supports the manufacture of such novel artificial data and their bundling into large scale “effects,” artifacts which then become the targets of theoretical speculation.[1] Indeed, one measure of how far a science has gotten is the degree to which the data it concerns itself with is factitious and the number of well-established effects it has managed to manufacture. Actually, I am tempted to go further: as a general rule only very immature scientific endeavors are based on naturally available/occurring facts.[2]

Why do I mention this. Well, first, by this measure, Generative Grammar (GG) has been a raging success. I have repeatedly pointed to the large number of impressive effects that GG has collected over the last 60 years and the interesting theories that GGers have developed trying to explain them (e.g. here). Island and ECP effects, binding effects and WCO effects do not arise naturally in language use. They need to be constructed, and in this they are like most facts of scientific interest.

Second, one nice way to get a sense of what is happening in a nearby domain is to zero in on the effects its practitioners are addressing. Actually, more pointedly, one quick and dirty way of seeing whether some area is worth spending time on is to canvass the variety and number of different effects it has manufactured.  In what follows I would like to discuss one of these that has recently come to my attention that has some interests for a GGer like me.

A recent paper (here) by Jiwon Yun, Zhong Chen, Tim Hunter, John Whitman and John Hale (YCHWH) discusses an interesting processing fact concerning relative clauses (RC) that seems to hold robustly cross linguistically. The effect is called the “Subject Advantage” (SA). What’s interesting about this effect is that it holds in languages where the head both precedes and follows the relative clause (i.e. for languages like English and those like Japanese). Why is this interesting? 

Well, first, this argues against the idea that the SA simply reflects increasing memory load as a function of linear distance between gap and filler (i.e. head). This cannot be the relevant variable for though it could account for SA effects in languages like English where the head precedes the RC (thus making the subject gap closer to the head than the object gap is) in Japanese style RCs where heads follow the clause the object gap is linearly closer to the head than the subject gap is, hence predicting an object advantage, contrary to experimental fact.

Second, and here let me quote John Hale (p.c.):

SA effects defy explanation in terms of "surprisal". The surprisal idea is that low probability words are harder, in context. But in relative clauses surprisal values from simple phrase structure grammars either predict effort on the wrong word (Hale 2001) or get it completely backwards --- an object advantage, rather than a subject advantage (Levy 2008, page 1164).

Thus, SA effects are interesting in that they appear to be stable over languages as diverse as English on the one hand and Japanese on the other and seem to refractory to many of the usual processing explanations.

Furthermore, SA effects suggest that grammatical structure is important, or to put this in more provocative terms, that SA effects are structure dependent in some way. Note that this does not imply that SA effects are grammatical effects, only that G structure is implicated in their explanation.  In this, SA effects are a little like Island Effects as understood (here).[3] Purely functional stories that ignore G structure (e.g. like linearly dependent memory load or surprisal based on word-by-word processing difficulty) seem to be insufficient to explain these effects (see YCHWH 117-118).[4]

So how to explain the SA? YCHWH proposes an interesting idea: that what makes object relatives harder than subject relatives is have different amounts of “sentence medial ambiguity” (the former more than the latter) and that resolving this ambiguity takes work that is reflected in processing difficulty. Or put more flatfootedly, finding an object gap requires getting rid of more grammatical ambiguity than finding a subject gap and getting rid of this ambiguity requires work, which is reflected in processing difficulty. That’s the basic idea. He work is in the details that YCHWH provides. And there are a lot of them.  Here are some.

YCHWH defines a notion of “Entropy Reduction” based on the weighted possible continuations available at a given point in a parse. One feature of this is that the model provides a way of specifying how much work parsing is engaged in at a particular point. This contrasts with, for example, a structural measure of memory load. As note 4 observes, such a measure could explain a subject advantage but as John Hale (p.c.) has pointed out to me concerning this kind of story:

This general account is thus adequate but not very precise. It leaves open, for instance, the question of where exactly greater difficulty should start to accrue during incremental processing.

That said, whether to go for the YCHWH account or the less precise structural memory load account is ultimately an empirical matter.[5] One thing that YCHWH suggests is that it should be possible to obviate the SA effect given the right kind of corpus data. Here’s what I mean.

YCHWH defines entropy reduction by (i) specifying a G for a language that defines the possible G continuations in that language and (ii) assigning probabilistic weights to these continuations. Thusm YCHWH shows how to combine Gs and probabilities of use of these. Parsing, not surprisingly, relies on the details of a particular G and the details of the corpus of usages of those G possibilities. Thus, what options a particular G allows affects how much entropy reduction a given word licenses, as does the details of the corpus that are probabilize the G.  This thus means that it is possible that SA might disappear given the right corpus details. Or it allows us to ask what if any corpus details could wipe out SA effects. This, as Tim Hunter noted (p.c.) raises two possibilities. In his words:

An interesting (I think) question that arises is: what, if any, different patterns of corpus data would wipe out the subject advantage? If the answer were 'none', then that would mean that the grammar itself (i.e. the choice of rules) was the driving force. This is almost certainly not the case. But, at the other extreme, if the answer were 'any corpus data where SRCs are less frequent than ORCs', then one would be forgiven for wondering whether the grammar was doing anything at all, i.e. wondering whether this whole grammar-plus-entropy-reduction song and dance were just a very roundabout way of saying "SRCs are easier because you hear them more often".

One of the nice features of the YCHWH discussion is that it makes it possible to analytically approach this problem. It would be nice to know what the answer is both analytically as well as empirically.

Another one of he nice features of YCHWH is that it demonstrates how to probabilize MGs of the Stabler variety so that one can view parsing as a general kind of information processing problem. In such a context difficulties in language parsing are the natural result of general information processing demands. Thus, this conception of parsing locates it in a more general framework of information processing, parsing being one specific application where the problem is to determine the possible G compatible continuations of a sentence. Note that this provides a general model of how G knowledge can get used to perform some task.

Interestingly, on this view, parsing does not require a parser. Why? Because parsing just is information processing when the relevant information is fixed. It’s not like we do language parsing differently than we do, say, visual scene interpretation once we fix the relevant structures being manipulated. In other words, parsing on the YCHWH view is just information processing in the domain of language (i.e. there is nothing special about language processing except the fact that it is Gish structures that are being manipulated). Or, to say this another way, though we have lots of parsing, there is no parser that does it.

YCHWH  is a nice example of a happy marriage of grammar and probabilities to explain an interesting parsing effect, the SA. The latter is a discovery about the ease of parsing RCs that suggests that G structure matters and that language independent functional considerations just won’t cut it. It also shows how easy it is to combine MGs with corpora to deliver probabilistic Gs that are plausibly useful in language use. All in all, fun stuff, and very instructive.

[1] This is all well discussed by Bogen and Woodward (here).
[2] This is one reason why I find admonitions to focus on natural speech as a source of linguistic data to be bad advice in general. There may be exceptions, but as a general rule such data should be treated very gingerly.
[3] See, for example, the discussion in the paper by Sprouse, Wagers and Phillips.
[4] A measure of distance based on structure could explain the SA. For example, there are more nodes separating the object trace and the head than separating the subject trace and the head. If memory load were a function of depth of separation, that could account for the SA, at least at the whole sentence level. However, until someone defines an incremental version of the Whole-Sentence structural memory load theory, it seems that only Entropy Reduction can account for the word-by-word SA effect across both English-type and Japanese-type languages.
[5] The following is based on some correspondence with Tim Hunter. Thus he is entirely responsible for whatever falsehoods creep into the discussion here.


  1. Norbert, the link to the YCHWH paper seems broken.

  2. A quick follow-up on the point about what patterns of corpus data would wipe out the subject advantage. It turns out (at least in Japanese, which is the only language I tested) that it is not the case that the subject advantage only appears when SRCs are more frequent than ORCs in the corpus. So we are not at the "other extreme" mentioned in the post: the theory put forward by YCHWH is not just a roundabout way of appealing to the corpus frequencies.

    To be more precise: if you leave all the corpus weights as they are in the paper except for the SRC-vs-ORC frequency, and replace this with an artificial 50-50 split, then you still get the subject advantage. So the subject advantage appears even though SRCs and ORCs were equally frequent.

  3. This comment has been removed by the author.

  4. A minor point about surprisal: there's nothing inherently grammar-sensitive about entropy reduction or grammar-insensitive about surprisal. Surprisal in Hale (2001), Levy (2008) and elsewhere is calculated based on probabilistic grammars. In the other direction you can get entropy reduction estimates from language models that don't have any hierarchical structure (e.g. Stefan Frank's work). You could make the case that entropy reduction is in many cases more sensitive to representational assumptions than surprisal, though.

    Empirically, surprisal and entropy reduction make very different predictions, which are right in some cases and wrong in others for both metrics (though we have more evidence for surprisal effects). But the debate over which metric is correct (or whether both are) is orthogonal to whether you use probabilistic grammars or more "linear" models.

    1. thx. John told me the same thing and I forgot to remedy my ignorance. I appreciate your making this clear both to others and to me.

    2. Tal's own work, appearing soon in _Cognitive Science_, confirms Entropy Reduction using phrase structure grammars based on the Penn Treebank: "We find that uncertainty about the full structure...was a significant predictor of processing difficulty"

    3. The final manuscript of my paper with Florian Jaeger that John mentions (thanks!) can be found here.

  5. @Tal: You are absolutely right. And I don't think this is a minor point: any probabilistic or information theoretic notion of language requires a clearly specified underlying model. But I think this point is missed by many practitioners in the cognitive science of language, who seem to interpret previous findings as a demonstration that language functions--in use, change, learning, etc.--to facilitate communication in some very general sense. Of course, one needs to ask the question, If you have a well motivated specific model of language, are such general considerations still necessary? (And they may be wrong.) This is especially important because calculating surprisal or similar information theories measures is computationally very difficult. It seems that YCHWH took an important step in this direction and I hope their paper is widely read and discussed.

  6. Me and the students in my MG parsing research project have a sort-of follow-up paper on this that will be presented at MOL in July (which is colocated with the LSA summer institute this year, btw). We're still working on the revisions, but I'll put a link here once it's done.

    We approach the SA from a very different perspective. We completely ignore distributional facts and ask instead what assumptions about memory usage in an MG parser can derive the SA. It turns out that one needs a fairly specific (albeit simple and plausible) story, and it is a story that hooks directly into the movement dependencies one sees with subject relative clauses and object relative clauses. More specifically, there's three simple ways of measuring memory usage:

    1) the number of items that must be held in memory
    2) the duration that an item must be held in memory
    3) the size of the items that must be held in memory

    1 and 2 can at best get you a tie between subjects and objects, you need 3 to derive the SA. Intuitively, the structurally higher position of subjects in comparison to objects leads to shorter movement paths, which reduces the size of parse items and thus derives the SA.

    I'm not sure how the two approaches differ in the predictions they make for other constructions; one of the things that really is missing from the parsing literature right now is a formal method for comparing models, similar to how formal language theory provides a scaffolding for comparing macroclasses of syntactic proposals.

    1. This sounds like a version of the whole sentence theory mentioned in noted 4. If so, it seems like it will still be missing the word-by-word effect that Entropy Reduction can accommodate, or am I missing something?

    2. You're right that it only gives you an offline difficulty measure that does not map neatly to online performance. That's the main gripe some of my students have with the project, but I'm actually quite happy to abstract away from this for now. That's because the overall goal of the projects is slightly different:

      We're building on Kobele, Gerth and Hale (2012), who use a memory-based mapping from tree structures to processing difficulty to argue against specific syntactic analyses on the basis that they cannot derive the observed processing effects with this simple machinery. So we are less concerned with modeling specific processing effects, the real issue is how far a purely syntactically informed mapping from structures to levels of processing difficulty can take us before we have to switch to (presumably) more powerful methods. The SA is interesting because it looks like something that intuitively could have a memory-based explanation, but at the same time it is rather tricky to accommodate. Our result confirms that picture: you have to add metric 3 to get it right, 1 and/or 2 by itself is not enough even if you're willing to play around with your analysis of relative clauses (wh VS promotion VS no movement).

      I actually feel like this result is still way too specific. What I want is a general theory that tells us something like "if your metric is in complexity class C, and your structures satisfy property P, then you can only derive processing effects of type T". To the best of my knowledge nobody has ever tried to do something like that, but it seems to me that this is the only way to combat the problem of combinatorial explosion you run into when you want to do this kind of work on a bigger scale (dozens of synactic analyses, parsing models, difficulty metrics, etc).

    3. Actually, let me rephrase the first sentence: it might map neatly to online processing, but nobody has really worked that out yet. The MG parser is incremental and moves through the derivations in a specific way, so you can map its position in the structure to a specific position in the string. For all I know, this might make it straight-forward to turn the global difficulty metric into an incremental one. I haven't really thought about it all that much, so I can't even make an educated guess.

    4. The project sounds fine as an analytic exercise, but it's not clear to me how hard it is. There are two obvious measures of load based on distance; linear proximity and hierarchical proximity. It sounds like you opted for door number 2. That's fine if we have any reason to think that this is a general metric in general for memory load. Is it? I dunno. But I guess I don't see why this was tricky. What's tricky is to see if this is more or less on the right track and the SA suggests that it is. However, it also suggests that the SA is categorical (it should hold in every language regardless of frequency of usage). Tim has done some analytic work thinking about what kind of frequency data would overturn SA given an Entropy Reduction approach. If the SA did not hold categorically it would be very interesting. But thx for explaining what you are up to an why.

    5. The challenge came mostly from spelling out that idea, grounding it in memory usage, and ensuring that it does not interfere with any of the previous findings --- the memory-load story works for at least three other phenomena: crossing VS nested dependencies, right embedding VS center embedding, and relative clause within a sentential clause VS sentential clause within a relative clause. I agree that it's not a spectacularly surprising finding, but it made for a nice seminar topic and gave the students quite a bit of material to chew on.

      Regarding the universality of the SA, the next language to look at should probably be Basque, for which a fairly recent study has reported an object advantage. One conceivable story is that this is related to the fact that Basque is an ergative language and thus might use a very different structural configuration for subjects and objects.

    6. @Thomas: It depends what you mean by a very different structural configuration. Basque, like most (if not all) ergative languages, exhibits all the regular subject-object asymmetries that you're familiar with from nominative-accusative languages (except for those having to do with case & agreement, of course). So, for example, the (ergative) subject binds the (absolutive) object but not vice versa, the (absolutive) object is structurally closer to the verb than the (ergative) subject is, etc. etc.

      Now, none of that guarantees that the subject and object in Basque are literally in the same structural positions as, say, their counterparts in Japanese. But what I can tell you with confidence is that Basque is not what used to be called a "deep ergative" language, where subjecthood and objecthood (and Agenthood and Patienthood) are underlyingly inverted. As I intimated above, there are probably somewhere between zero and three (inclusive range) languages that are actually "deep ergative."

      (My money is on zero, fwiw.)

    7. @Omer: If you're right then there's at least three possible outcomes: i) Basque is a clear-cut counterexample to structural explanations of subject/object preferences with relative clauses (which would be a neat and very strong result), ii) the metric can be refined even further to accommodate Basque (not particularly appealing imho), or iii) the experimental findings about Basque are wrong (though I can't imagine what kind of confound would produce a clear object advantage).

      It will be interesting to see whether YCHWH's account has a good chance of extending to Basque. The Basque study (behind a paywall *sigh*) talks a little bit about structural ambiguity in relative clauses, but they look very similar to East Asian RCs in that respect.

    8. As I noted earlier, Tim has been working on some analytical results trying to explain what sort of frequencies are required to reverse SA in Korean style RCs. The YCHWH account can accommodate non SA, but it requires a pretty specific set of facts. In other words, it may come close to making a prediction about Basque data (though I suspect I am being over optimistic here). Tim?

    9. Yes, the YWHWH account (i.e. the Entropy Reduction Hypothesis) can definitely accommodate non-SA. The SA prediction in the three languages we looked at is a function of the grammars of those languages *and* the probabilities that describe the comprehender's expectations (which we drew from corpora, but you can get them from anywhere you want). I played around with the Japanese grammar, and found that there are definitely ways the corpus freqencies could have been which would have led to an object-advantage prediction, holding the grammar fixed. So knowing exactly what "the correct" grammar of Basque is would not itself be sufficient to get a prediction about whether SRCs or ORCs are predicted to be easier.

      Put differently, it's in effect an empirical discovery that corpora tend to have the properties which lead the ERH to make the subject-advantage prediction. There's no necessary analytic connection between the ERH and the subject advantage.

      I don't think we can confidently say that getting a non-SA prediction in the Chinese/Japanese/Korean cases that we looked at "requires a pretty specific set of facts". If I'm understanding right this would mean that in some sense "most of the ways the corpora could have been" end up producing the SA prediction, but I just have no idea whether this is right or not. (And of course the question of whether that's right or not is going to be different for each grammar you define.)

    10. The ERH leads you to make the subject-advantage prediction based on a combination of your corpus and the syntactic representations you use to encode it, right? It would be interesting to see how robust the results are to various representational decisions in your Minimalist Grammar.

      I also wonder about the consequences of analyzing a fragment of the grammar of the language rather than the full grammar - couldn't you be underestimating the entropy of a nonterminal that you happened not to care about that much in your fragment?

    11. To the first point: yes -- but if I'm understanding what you mean correctly, that's just to say that the predictions are a function of the grammar and the corpus. The procedure that you follow to produce a probabilistic grammar from a grammar G and a corpus C will (typically?) involve working out what syntactic representations G assigns to the sentences in C. Even holding the grammar and corpus fixed, there are many such procedures: for one thing, you can parametrize the probability distribution in different ways (as I talked about here). That's even before considering different representational decisions in the grammar.

      On the second point: yes, I think that's definitely possible (John may have thought about this more than I have), but I think the assumption/hope is that while the actual entropy values we compute are no doubt much smaller than the "real" values, the relationships among them (and therefore the points where higher and lower ER takes place) might be unaffected. The reasons for concentrating only on a small fragment are really just practical, at this point.

    12. I should clarify one more thing: when I said above that "it's in effect an empirical discovery that corpora tend to have the properties which lead the ERH to make the subject-advantage prediction", this is not simply the discovery that corpora tend to have the property that SRCs are more frequent than ORCs. The properties that lead the ERH to make the subject-advantage prediction are much more sublte and complex, and dependent on all sorts of other things like the choice of grammar.

    13. Yes, I don't think we disagree, it's just that the shorthand you used earlier ("the grammar of the language") could imply that there's only one possible grammar that derives the language, whereas in practice there are a wide range of grammars. And my hunch is that two grammars that have the same weak generative capacity can lead to different ERH predictions - it really depends on whether and where in the grammar you have "spurious" ambiguity.

      As for the fragments, I think your assumption might hold if your productions are a representative sample of the grammar in some sense, though I'm not sure. But imagine a situation where your fragment doesn't have AdvP, but in reality AdvPs have a lot of internal entropy, and SRCs are more likely to have AdvPs than ORCs; wouldn't you underestimate the entropy of the SRCs in that case, and potentially derive the opposite predictions than if you estimated an empirical grammar?

    14. Possibly... but we believe we considered a realistic subset of relevant alternative constructions. The burden of proof that we left something out some really important alternative is really on the proposer of an alternative theory :)

    15. Tal wrote: Yes, I don't think we disagree, it's just that the shorthand you used earlier ("the grammar of the language") could imply that there's only one possible grammar that derives the language, whereas in practice there are a wide range of grammars. And my hunch is that two grammars that have the same weak generative capacity can lead to different ERH predictions - it really depends on whether and where in the grammar you have "spurious" ambiguity.

      Right, I don't think we disagree either. When I wrote "the grammar of the language" I just meant "the grammar in the Basque speaker's head". There's only one of those (under the usual idealizing assumptions). There are no doubt many possible grammars that are "weakly equivalent to Basque", or "grammars which weakly generate Basque", and those would definitely make different ERH predictions, even holding the corpus and everything else fixed. The point I wanted to make was that even when you work out what that one true grammar is -- even when Thomas and Omer's questions about the Basque case system (note these are not just questions about the string language) were answered in full detail -- no predictions follow until you assign probabilities.

    16. This comment has been removed by the author.

    17. Oh, I understand what you meant now. I think the point still stands that we don't know exactly which of those weakly equivalent grammars are in the Basque speaker's head, how different are the grammars across Basque speakers, etc. This is clearly science fiction, but in principle you could even use the ERH in an individual differences study to figure out which grammar predicts each person's reading times best...