Monday, March 31, 2014

Optimal design

Here is a terrific short piece on optimal design. The writer is John Rawls, one of the great philosophers in the 20th century. The discussion is about baseball and so you might dismiss the observations as more in the vein of parody than serious reflection. This, I believe, would be a mistake. There are serious points here about how rules can be considered wrt to their optimal game design. Note that the optimality refers to how humans are built (compare: minimality in the context of bounded content addressable memory) and how games entertain (compare: how grammars are used). How the rules have not changed over time (stable) (compare: UG not changed since humans first hit the scene) and apply to all kinds of people (universal). There is also discussion of how easy it is to use the rules in game situations (i.e. using these rules in actual play is easy) (compare: conditions that must be specified to be usable at all). At any rate, I leave the analogies with minimalist principles to you dear readers. Many of you will just laugh this off, regarding anything that cannot be mathematically defined as meaningless and a waste of time. This Rawlsian meditation is not for them. For the rest of you, enjoy it. It's fun.

Sunday, March 30, 2014

A Big Data update

Seems that the overselling of Big Data is becoming more widely evident. Here is a pice by Tim Harford  that reflects the growing skepticism with the idea that lots of data will replace any need for theoretical insight.

Friday, March 28, 2014

The logic of the POS, one more time

Alex C has provided some useful commentary on my intentionally incendiary post concerning the POS (here). It is useful because I think that it highlights an important misunderstanding concerning the argument and its relation to the phenomenon of Auxiliary Inversion (AI). So, as a public service, let me outline the form of the argument once more.

The POS is a tool for investigating the structure of FL. The tool is useful for factoring out the causal sources for some or another feature of a rule or principle of a language particular grammar G. Some features of G (e.g. rule types) are what they are because the input is what it is. Other features look as they do because they reflect innate principles of FL.

For example: we trace the fact that English Wh movement leaves a phonetic gap after the predicate that assigns it a theta role to the fact that English kids hear questions like what did you eat. Chinese kids don’t form questions in this way as they hear the analogues of you ate what.  In contrast we trace the fact that sentences like *Everyone likes him with him interpreted as a pronoun bound by everyone as ill-formed back to Principle B, which we (or at least I) take to arise from some innate structure of FL. So, again, the aim of the POS is to distinguish those features of our Gs that have their etiology in the PLD from those that have it in the native structure of FL.

Note, for POS to be so deployable, its subject matter must be Gs and their properties; their operations and principles. How does this apply to AI?  Well, in order to apply the POS to AI we need a G for AI.  IMO, and that of a very large chunk of the field, AI involves a transformation that moves a finite Aux to C. Why do we/I believe this? Well, we judge that the best analyses of AI (i.e. the phenomenon) involves a transformation that moves Aux to Comp (A-to-C) (i.e. A-to-C names a rule/operation)[1]. The analysis was first put forward in Syntactic Structures (LSLT actually, though SS was the first place where many (including me) first encountered it) and has been refined over time, in particular by Howard Lasnik (the best discussion being here). The argument is that this analysis is better for a variety of reasons than alternative analyses. One of the main alternatives is to analyze the inversion via a phrase structure operation, an alternative that Chomsky and Lasnik considered in detail and argued against on a variety of grounds. Some were not convinced by the Chomsky/Lasnik story (e.g. Sag, Pullum, Gazdar) as Alex C notes in linking to Sag’s Sapir Lecture on the topic).  Some (e.g. me) were convinced and still are. What’s this have to do with the POS?

Well for those convinced by this story, there follows another question: what does A-to-C’s properties tell us about FL? Note, this question makes no sense if you don’t think that this is the right description (rule) for AI.  In fact, Sag says as much in his lecture slides (here). Reject this presupposition then the conclusion of the POS as applied to AI will seem to you unconvincing (too bad for you, but them’s the breaks). Not because the logic is wrong, but because the factual premise is rejected.  If you do accept this as an accurate description of the English G rule underlying the phenomenon of AI then you should find the argument of interest.

So, given the rule of Aux to Comp that generates AI phenomena, we can ask what features of the rule and how it operates are traceable to the structure of the Primary Linguistic Data (PLD: viz. data available to and used by the English Child in acquiring the rule A-to-C) and how much must be attributed to the structure of FL. So here we ask how much of the details of an adequate analysis of AI in terms of A-to-C can be traced to the structure of the PLD and how much cannot. What cannot, the residue, it is proposed, reflects the structure of FL.

I’ve gone over the details before, so I will refrain yet again here. However, let’s consider for a second how to argue against the POS argument.

1.     One can reject the analysis, as Sag does. This does not argue against the POS, it argues against the specific application in the particular case of AI. 
2.     One can argue that the PLD is not as impoverished as indicated. Pullum and Scholtz have so argued, but I believe that they are simply incorrect. Legate and Yang have, IMO, the best discussion of how their efforts miss the mark. 

These are the two ways to argue against the conclusion that Chomsky and others have regularly drawn. The debate is an empirical one resting on analyzed data and a comparison of PLD to features of the best explanation.

What did Berwick, Pietroski, Yankama and Chomsky add to the debate? Their main contribution, IMO, was two fold.

First, they noted that many of the arguments against the POS are based on an impoverished understanding of the relevant description of AI. Many take the problem to be a fact about good and bad strings. As BPYC note, the same argument can be made in the domain of AI where there are no ill-formed strings, just strings that are monoguous where one might have a priori expected ambiguity.

Second, they noted that the pattern of licit and illicit movement that one sees in AI data appear as well in many other kinds of data, e.g. in cases of adverb fronting and, as I noted, even in cases of WH movement (both argument and adjunct). Indeed, for any case of an A’ dependency.  BPYC’s conclusion is that whatever is happening in AI is not a special feature of AI data and so not a special feature of the A-to-C rule. In other words, in order to be an adequate, any account of AI must over to these other cases as well. Another way of making the same point: if an analysis explains only AI phenomena and does not extend to these other cases as well then it is inadequate![2]

As I noted (here), these cases all unify when you understand that movement from a subject relative clause is in general prohibited. I also note (BUT THIS IS EXTRA, as Alex D commented) that subject RC as islands is an effect generally accounted for in terms of something like subjacency theory (this latter coming in various guises within GG bounding nodes, barriers, phases and has analogues in other frameworks).  Moreover, I believe that a POS argument would be easy to construct that island effects reflect innately specified biases of FL.[3]

So, that’s the logic of the argument. It is very theory internal in the sense that it starts from an adequate description of the rules generating the phenomenon. It ends with a claim about the structure of FL. This should not be surprising: one cannot conclude anything about an organ that regulates the structure of grammars (FL) without having rules/principles of grammar. One cannot talk about explanatory adequacy without having candidates that are descriptively adequate, just as one cannot address Darwin’s Problem without candidate solutions to Plato’s. This is part of the logic of the POS.[4] So, if someone talks as if he can provide a POS argument that is not theory internal, i.e. that does not refer to the rules/operations/principles involved, run fast in the other direction and reach for your wallet.


For the interested, BPYC in an earlier version of their paper note that analyses of the variety Sag presents in the linked to slides have an analogous POS problem to the one associated with transformations in Chomsky’s original discussion. This is not surprising. POS arguments are not proprietary to transformational approaches. They arise within any analysis interested in explaining the full range of positive and negative data. At any rate, here are parts of two deleted footnotes that Bob Berwick was kind enough to supply me with that discusses the issue as it relates to these settings. The logic is the following: relevant AI pairings suggest mechanisms and given a mechanism POS problem can be stated for that mechanism.  What the notes make clear is that analogous POS problems arise for all the mechanisms that have been proposed once the relevant data is taken into account (See Alex Drummond’s comment here (March 28th), which makes a similar point). The take home message is that non-transformational analyses don’t sidestep POS conclusions so much as couch them in different technical terms. This should not be a surprise to those that understand that the application of the POS tool is intimately tied to the rules that are being proposed and the rules that are often proposed are usually (sadly) tightly tied to the relevant data that is being considered.[5]  At any rate, here is some missing text from from two notes in an earlier draft of the BPYC paper. I have highlighted two particularly relevant observations.

Such pairings are a part of nearly every linguistic theory that considers the relationship between structure and interpretation, including modern accounts such as HPSG, LFG, CCG, and TAG. As it stands, our formulation takes a deliberately neutral stance, abstracting away from details as to how pairings are determined, e.g., whether by derivational rules as in TAG or by relational constraints and lexical-redundancy rules, as in LFG or HPSG.  For example, HPSG (Bender, Sag, and Wasow, 2003) adopts an “inversion lexical rule” (a so-called ‘post-inflectional’ or ‘pi-rule’) that takes ‘can’ as input, and then outputs ‘can’ with the right lexical features so that it may appear sentence initially and inverted with the subject, with the semantic mode of the sentence altered to be ‘question’ rather than ‘proposition’.  At the same time this rule makes the Subject noun phrase a ‘complement’ of the verb, requiring it to appear after ‘can’. In this way the HPSG implicational lexical rule defines a pair of the exactly the sort described by (5a,b), though stated declaratively rather than derivationally.  We consider one example in some detail because here precisely because, according to at least one reviewer, CCG does not ‘link’ the position before the main verb to the auxiliary. Note, however, that combinatorial categorical grammar (CCG), as described by Steedman (2000) and as implemented as a parser by Clark and Curran (2007), produces precisely the ‘paired’ output we discuss for “can eagles that fly eat.”  In the Clark and Curran parser, can’ (with a part of speech MD, for modal), has the complex categorial entry (S[q])/S([b]\NP))/NP, while the entry for “eat” has the complex part of speech label S[b]\NP. Thus the lexical feature S[b]/NP, which denotes a ‘bare’ infinitive, pairs the modal “can” (correctly) with the bare infinitive “eat” in the same way as GPSG (and more recently, HPSG), by assuming that “can” has the same (complex) lexical features as it does in the corresponding declarative sentence. This information is ‘transmitted’ to the position preceding eat via the proper sequence of combinatory operations, e.g., so that ultimately “can,” with the feature (S[q])/S([b]\NP)) along with “eat,” with the feature S[b]/NP can combine.  At this point, note that the combinatory system combines “can” and “eat” in that order, as per all combinatory operations, exactly as in the corresponding ‘paired’ declarative, and exactly following our description that there must be some mechanism by which the declarative and its corresponding polar interrogative form are related (in this case, by the identical complex lexical entries and the rules of combinatory operations, which work in terms of adjacent symbols) [my emphasis, NH]. However, it is true that not all linguistic theories adopt this position; for example, Rowland and Pine, 2000; explicitly reject it (thereby losing this particular explanatory account for the observed cross-language patterns). A full discussion of the pros and cons of these differing approaches to linguistic explanation outside the scope of the present paper.
As the main text indicates, one way to form pairs more explicitly is to use the machinery proposed in Generalized Phrase Structure Grammar (GPSG), or HPSG, to ‘remember’ that a fronted element has been encountered by encoding this information in grammar rules and nonterminal, in this case linking a fronted ‘aux’ to the position before the main verb via a new nonterminal name.  This is straightforward: we replace the context-free rules that PTR use, S ® aux IP, etc., with new rules, S ® aux IP/aux, IP® aux/aux vi, aux/aux®v where the ‘slashed’ nonterminal names IP/aux and aux/aux ‘remember’ that an aux has been generated at the front a sentence and must be paired with the aux/aux expansion to follow.  This makes explicit the position for interpretation, while leaving the grammar’s size (and so prior probability) unchanged.  This would establish an explicit pairing, but it solves the original question by introducing a new stipulation since the nonterminal name explicitly provides correct place of interpretation rather than the wrong place and does not say how this choice is acquired [my emphasis NH]. Alternatively, one could adopt the more recent HPSG approach of using a ‘gap’ feature that stands in the position of the ‘unpronounced’ v, a, wh, etc., but like the ‘slash category’ proposal this is irrelevant in the current context since it would enrich the domain-specific linguistic component (1), contrary to PTR’s aims – which, in fact, are the right aims within the biolinguistic framework that regards language as a natural object, hence subject to empirical investigation in the manner of the sciences, as we have discussed.

[1] In what follows it does not matter whether A-to-C is an instance of a more general rule, e.g. move alpha or merge, which I believe is likely to be the case.
[2] For what it’s worth, the Sag analysis that Alex C linked to and I relinked to above fails this requirement.
[3] Whether these biases are FL specific, is, of course, another question. The minimalist conceit is that most are not.
[4] One last point: one thing that POS arguments highlight is the value of understanding negative data. Any good application of the argument tries to account not only for what is good (e.g. that the rule can generate must Bill eat) but also account for what is not (e.g. that the system cannot generate *Did the book Bill read amused Frank). Moreover, the POS often demands that the negative data be ruled out in a principled manner (given the absence of PLD that might be relevant). In other words, what we want from a good account is that what is absent should be absent for some reason other than the one we provide for why we move WHs in English but not in Chinese. I mention this because if one looks at Sag’s slides, for example, there is no good discussion of why one cannot have a metarule that targets an Aux within a subject.  And if there is an answer to this, one wants an account of how this prohibition against this kind of metarule extends to the cases of adverb fronting and WH question formation that seem to illustrate the exact same positive and negative data profiles. In my experience it is the absence of attention to the negative data that most seriously hampers objections to the AI arguments. The POS insists that we answer two questions: why is what we find ok ok and why is what we find to be not ok not ok. See the appendix for further discussion of this point as applied to the G/HPSG analyses that Sag discusses.
[5] To repeat, one of the nicest features of the BPYC paper is that it makes clear that the domain of relevant data (the data that needs covering) goes far beyond the standard AI cases in polar questions that is the cynosure of most analyses.

Thursday, March 27, 2014

POS and the inverse problem in vision

About a month ago, Bill Idsardi gave me an interesting book by Dale Purves to read (here). Purves is a big deal neuroscientist at Duke who works on vision. The book is a charming combination of personal and scientific biography; how Purves got into the field, how it changed since he entered it and how his personal understanding of the central problem in visual perception has changed over his career. For someone like me, interested in language from a cog-neuro perspective, it’s fun to read about what’s going on in a nearby, related discipline.  The last chapter is especially useful for in it Purves presents a kind of overview of his general conclusions concerning what vision can tell us about brains.  Three things caught my eye (get it?).

First, he identifies the “the inverse problem” as the main cog-neuro problem within vision (in fact, in perception more generally). The problem is effectively a POS problem: the stimulus info available on the retina is insufficient for figuring out the properties of the distal stimulus that caused it. Why? Because there are too many ways that the pattern of stimulation on the eyeball could have been caused by environmental factors. This reminds me of Lila’s old quip about word learning: a picture is worth a thousand words and this is precisely the problem. So, the central problem is the inverse problem and the only way of “solving” it is by finding the biological constraints that allow for a “solution.”[1] Thus, because the information available at the eyeball is too poor to deliver its cause, yet we make generalizations in some ways but not others, there must be some constraints on how we do this that need recovering.  As Purves notes, illusions are good ways of studying the nature of these constraints for they hint at the sorts of constraints the brain imposes to solve the problem. For Purves, the job of the cog-neuro of vision is to find these constraints by considering various ways of bridging this gap.

This way of framing the problem leads to his second important point: Purves thinks that because the vision literature has largely ignored the inverse problem it has misconceived what kinds of brain mechanisms we should be looking for. The history as he retells it is interesting. He traces the misconception, in part, to two very important neuroscience discoveries: Hubel and Wiesel’s discovery of “feature detecting” neurons and Mountcastle’s discovery of the columnar structure of brains. These two ideas combined to give the following picture: perception is effectively feature detection. It starts with detecting feature patterns on the retina and then ever higher order feature patterns of the previously detected patterns. So it starts with patterns in the retina (presumably products of the distal stimulus) and does successive higher order pattern recognition on these. Here’s Purves (222-3):

…the implicit message of Hubel and Wiesel’s effort [was] to understand vision in terms of an anatomical and functional hierarchy in which simple cells feed onto complex cells, complex cells feed onto hypercomplex cells, and so on up to the higher reaches of the extratriate cortext….Nearly everyone believed that the activity of neurons with specific receptive field properties would, at some level of the visual system, represent the combined image features of a stimulus, thereby accounting for what we see.

This approach, Purves notes, “has not been substantiated” (223). 

This should come as no surprise to linguists. The failed approach that Purves describes sounds to a linguist very much like the classical structuralist discovery procedures that Chomsky and others argued to be inadequate over 50 years ago within linguistics.  Here too the idea was that linguistic structure was the sum total of successive generalizations over patterns of previous generalizations. I described this (here) as the idea that there are detectable patterns in the data that inductions over inductions over inductions would reveal. The alternative idea is that one needs to find a procedure that generates the data and that there is no way to induce this procedure from the simple examination of the inputs, in effect, the inverse problem.  If Purves is right, this suggests that within cog-neuro the inverse problem is the norm and that generalizing over generalizations will not get you where you want to go. This is the same conclusion as Chomsky’s 50 years earlier. And it seems to be worth repeating given the current interest in “deep learning” methods, which, so far as I can tell (which may not be very far, I concede), seems attracted to a similar structuralist view.[2] If Purves (and Chomsky) are right (and I know that at least one of them is, guess which) then this will lead cog-neuro down the wrong path.

Third, Purves documents how studying the intricacies of the cognition using behavioral methods was critical in challenging the implicit very simple theory common in the nuero literature. Purves notes how understanding the psycho literature was critical in zeroing in on the right cog-neuro problem to solve. Moreover, he notes how hostile the neuro types were to this conclusion (including the smart ones like Crick).  It is not surprising that the prestige science does not like being told what to look at from the lowly behavioral domains. So, in place of any sensible cognitive theory, neuro types invented the obvious ones that they believed to be reflected in the neuro structure. But, as Purves shows (and any sane person should conclude) neuro structure, at least at present, tells us very little about what the brain is doing. This is not quite accurate, but it is accurate enough.  In the absence of explicit theory, implicit “empiricism” always emerges as the default theory. Oh well.

There is lots more in the book, much of it, btw, that I find either oddly put or wrong. Purves, for example, has an odd critique of Marr, IMO. He also has a strange idea of what a computational theory would look like and places too much faith in evolution as the sole shaper of the right solutions to the inverse problem. But big deal. The book raises interesting issues relevant to anyone interested in cog-neuro regardless of the specific domain of interest. It’s a fun, informative and enjoyable read.

[1] I use quotes here for Purves argues that we never make contact with the real world. I am not a fan about this way of putting the issue, but it’s his.  It seems to me that the inverse problem can be stated without making this assumption: the constraints being one way of reconstructing the nature of the distal stimulus given the paucity of data on the retina.
[2] As the Wikipedia entry puts it: “Deep learning algorithms in particular exploit this idea of hierarchical explanatory factors. Different concepts are learned from other concepts, with the more abstract, higher level concepts being learned from the lower level ones. These architectures are often constructed with a greedy layer-by-layer method that models this idea. Deep learning helps to disentangle these abstractions and pick out which features are useful for learning.”

Friday, March 21, 2014

Let's pour some oil on the flames: A tale of too simple a story

Olaf K asks in the comments section to this post why I am not impressed with ML accounts of Aux-to-C (AC) in English. Here’s the short answer: proposed “solutions” have misconstrued the problem (both the relevant data and its general shape) and so are largely irrelevant. As this judgment will no doubt seem harsh and “unhelpful” (and probably offend the sensibilities of many (I’m thinking of you GK and BB!!)) I would like to explain why I think that the work as conducted heretofore is not worth the considerable time and effort expended on it. IMO, there is nothing helpful to be said, except maybe STOP!!! Here is the longer story. Readers be warned: this is a long post. So if you want to read it, you might want to get comfortable first.[1]

It’s the best of tales and the worst of tales. What’s ‘it’? The AC story that Chomsky told to explicate the logic of the Poverty of Stimulus (POS) argument.[2] What makes it a great example is its simplicity. To be understood requires no great technical knowledge and so the AC version of the POS is accessible even to those with the barest of abilities to diagram a sentence (a skill no longer imparted in grade school with the demise of Latin).

BTW, I know this from personal experience for I have effectively used AC to illustrate to many undergrads and high school students, to family members and beer swilling companions how looking at the details of English can lead to non-obvious insights into the structure of FL. Thus, AC is a near perfect instrument for initiating curious tyros who into the mysteries of syntax.

Of course, the very simplicity of the argument has its down sides. Jerry Fodor is reputed to have said that all the grief that Chomsky has gotten from “empiricists” dedicated to overturning the POS argument has served him right. That’s what you get (and deserve) for demonstrating the logic of the POS with such a simple straightforward and easily comprehensible case. Of course, what’s a good illustration of the logic of the POS is, at most, the first, not last, word on the issue. And one might have expected professionals interested in the problem to have worked on more than the simple toy presentation. But, one would have been wrong. The toy case, perfectly suitable for illustration of the logic, seems to have completely enchanted the professionals and this is what critics have trained their powerful learning theories on. Moreover, treating this simple example as constituting the “hard” case (rather than a simple illustration), the professionals have repeatedly declared victory over the POS and have confidently concluded that (at most) “simple” learning biases are all we need to acquire Gs. In other words, the toy case that Chomsky used to illustrate the logic of the POS to the uninitiated has become the hard case whose solution would prove rationalist claims about the structure of FL intellectually groundless (if not senseless and bankrupt).

That seems to be the state of play today (as, for example, rehearsed in the comments section of this). This despite the fact that there have been repeated attempts (see here) to explicate the POS logic of the AC argument more fully. That said, let’s run the course one more time. Why? Because, surprisingly, though the AC case is the relatively simple tip of a really massive POS iceberg (c.f. Colin Phillips’ comments here March 19 at 3;47), even this toy case has NOT BEEN ADEQUATELY ADDRESSED BY ITS CRITICS! (see. In particular BPYC dhere for the inadequacies).  Let me elaborate by considering what makes the simple story simple and how we might want to round it out for professional consideration.

The AC story goes as follows. We note, first, that AC is a rule of English G. It does not hold in all Gs. Thus we cannot assume that the AC is part of FL/UG, i.e. it must be learned. Ok, how would AC be learned, viz: What is the relevant PLD? Here’s one obvious thing that comes to mind: kids learn the rule by considering its sentential products.[3] What are these? In the simplest case polar questions like those in (1) and their relation to appropriate answers like (2):

(1)  a. Can John run
b. Will Mary sing
c. Is Ruth going home

(2)  a. John can run
b. Mary will sing
c. Ruth is going home

From these the following rule comes to mind:

(3)  To form a polar question: Move the auxiliary to the front. The answer to a polar question is the declarative sentence that results from undoing this movement.[4]

The next step is to complicate matters a tad and ask how well (3) generalizes to other cases, say like those in (4):

(4)  John might say that Bill is leaving

The answer is “not that well.” Why? The pesky ‘the’ in (3). In (4), there is a pair of potentially moveable Auxs and so (3) is inoperative as written. The following fix is then considered:

            (3’) Move the Aux closest to the front to the front.

This serves to disambiguate which Aux to target in (4) and we can go on. As you all no doubt know, the next question is where the fun begins: what does “closest” mean? How do we measure distance? It can have a linear interpretation: the “leftmost” Aux and, with a little bit of grammatical analysis, we see that it can have a hierarchical interpretation: the “highest” Aux. And now the illustration of the POS logic begins: the data in (1), (2) and (4) cannot choose between these options. If this is representative of what there is in the PLD relevant to AC, then the data accessible to the child cannot choose between (3’) where ‘closest’ means ‘leftmost’ and (3’) where ‘closest’ means ‘highest.’ And this, of course, raises the question of whether there is any fact of the matter here. There is, as the data in (5) shows:

(5)  a. The man who is sleeping is happy
b. Is the man who is sleeping happy
c. *Is the man who sleeping is happy

The fact is that we cannot form a polar question like (5c) to which (5a) is the answer and we can form one like (5b) to which (5a) is the answer. This argues for ‘closest’ meaning ‘highest.’ And so, the rule of AC in English is “structure” dependent (as opposed to “linear” dependent) in the simple sense of ‘closest’ being stated in hierarchical, rather than linear, terms.

Furthermore, choice of the hierarchical conception of (3’) is not and cannot be based on the evidence if the examples above are characteristic of the PLD. More specifically, unless examples like (5) are part of the PLD it is unclear how we might distinguish the two options, and we have every reason to think (e.g. based on Childes searches) that sentences like (5b,c) are not part of the PLD. And, if this is all correct, then we have reason for thinking that: (i) that a rule like AC exists in English and whose properties are in part a product of the PLD we find in English (as opposed to Brazilian Portuguese, say) (ii) that AC in English is structure dependent, (iii) that English PLD includes examples like (1), (2) and maybe (4) (though not if we are a degree-0 learners) but not (5) and so we conclude (iv) if AC is structure dependent, then the fact that it is structure dependent is not itself a fact derivable from inspecting the PLD. That’s the simple POS argument.

Now some observations: First, the argument above supports the claim that the right rule is structure dependent. It does not strongly support the conclusion that the right rule is (3’) with ‘closest’ read as ‘highest.’ This is one structure dependent rule among many possible alternatives. All we did above is compare one structure dependent rule and one non-structure dependent rule and argue that the former is better than the latter given these PLD.  However, to repeat, there are many structure dependent alternatives.[5] For example, here’s another that bright undergrads often come up with:

            (3’’) Move the Aux that is next to the matrix subject to the front

There are many others. Here’s the one that I suspect is closest to the truth:

            (3’’) Move Aux

(3’’) moves the correct Aux to the right place using the very simple rule (3’’) in conjunction with general FL constraints. These constraints (e.g. minimality, the Complex NP constraint (viz. bounding/phase theory)) themselves exploit hierarchical rather than linear structural relations and so the broad structure dependence conclusion of the simple argument follows as a very special case.[6] Note, that if this is so, then AC effects are just a special case of Island and Minimality effects. But, if this is correct, it completely changes what an empiricist learning theory alternative to the standard rationalist story needs to “learn.” Specifically, the problem is now one of getting the ML to derive cyclicity and the minimality condition from the PLD, not just partition the class of acceptable and unacceptable AC outputs (i.e. distinguish (5b) from (5c)). I return to a little more discussion of this soon, but first one more observation.

Second, the simple case above uses data like (5) to make the case that the ‘leftmost’ aux cannot be the one that moves. Note that the application of (3’)-‘leftmost’ here yields the unacceptable string (5c). This makes it easy to judge that (3’)-‘leftmost’ cannot be right for the resulting string is clearly unacceptable regardless of what it is intended to mean. However, using this sort of data is just a convenience for we could have reached the exact same conclusion by considering sentences like (6):

(6)  a. Eagles that can fly swim
b. Eagles that fly can swim
c. Can eagles that fly swim

(6c) can be answered using (6b) not (6a). The relevant judgment here is not a simple one concerning a string property (i.e. it sounds funny) as it is with (5c). It is rather unacceptability under an interpretation (i.e. this can’t mean that, or, it sounds funny with this meaning). This does not change the logic of the example in any important way, it just uses different data, (viz. the kind of judgment relevant to reaching the conclusions is different).

Berwick, Pietroski, Yankama and Chomsky (BPYC) emphasize that data like (6), what they dub constrained homophony, best describes the kind of data linguists typically use and have exploited since, as Chomsky likes to say, “the earliest days of generative grammar.” Think: flying planes can be dangerous, or I saw the woman with the binoculars, and their disambiguating flying planes is/are dangerous and which binoculars did you see the woman with.  At any rate, this implies that the more general version of the AC phenomena is really independent of string acceptability and so any derivation of the phenomenon in learning terms should not obsess over cases like (5c). They are just not that interesting for the POS problem arises in the exact same form even in cases where string acceptability is not a factor.

Let’s return briefly to the first point and then wrap up. The simple discussion concerning how to interpret (3’) is good for illustrating the logic of POS. However, we know that there is something misleading about this way of framing the question. How do we know this? Well, because, the pattern of the data in (5) and (6) is not unique to AC movement. Analogous dependencies (i.e. where some X outside of the relative clause subject relates to some Y inside it) are banned quite generally. Indeed, the basic fact, one, moreover that we all have known about for a very long time, is that nothing can move out of a relative clause subject. For example: BPYC discuss sentences like (7):

(7)  Instinctively, eagles that fly swim

(7) is unambiguous, with instinctively necessarily modifying fly rather than swim. This is the same restriction illustrated in (6) with fronted can restricted in its interpretation to the matrix clause. The same facts carry over to examples like (8) and (9) involving Wh questions:

(8)  a. Eagles that like to eat like to eat fish
b. Eagles that like to eat fish like to eat
c. What do eagles that like to eat like to eat

(9)  a. Eagles that like to eat when they are hungry like to eat
b. Eagles that like to eat like to eat when they are hungry
c. When do eagles that like to eat like to eat

(8a) and (9a)  are appropriate answers to (8c) and (9c) but (8b) and (9b) are not. Once again this is the same restriction as in (7) and (6) and (5), though in a slightly different guise. If this is so, then the right answer as to why AC is structure dependent has nothing to do with the rule of AC per se (and so, plausibly, nothing to do with the pattern of AC data). It is part of a far more general motif, the AC data exemplifying a small sliver of a larger generalization. Thus, any account that narrowly concentrates on AC phenomena is simply looking at the wrong thing! To be within the ballpark of the plausible (more pointedly, to be worthy of serious consideration at all), a proffered account must extend to these other cases of as well. That’s the problem in a nutshell.[7]

Why is this important? Because criticisms of the POS have exclusively focused on the toy example that Chomsky originally put forward to illustrate the logic of POS.  As noted, Chomsky’s original simple discussion more than suffices to motivate the conclusion that G rules are structure dependent and that this structure dependence is very unlikely to be a fact traceable to patterns in the PLD. But the proposal put forward was not intended to be an analysis of ACs, but a demonstration of the logic of the POS using ACs as an accessible database. It’s very clear that the pattern attested in polar questions extends to many other constructions and a real account of what is going on in ACs needs to explain these other data as well. Suffice it to say, most critiques of the original Chomsky discussion completely miss this. Consequently, they are of almost no interest.

Let me state this more baldly: even were some proposed ML able to learn to distinguish (5c) from other sentences like it (which, btw, seems currently not to be the case), the problem is not just with (5c) but sentences very much like it that are string kosher (like (6)). And even were they able to accommodate (6) (which so far as I know, they currently cannot) there is still the far larger problem of generalizing to cases like (7)-(9). Structure dependence is pervasive, AC being just one illustration. What we want is clearly an account where these phenomena swing together; AC, Adjunct WH movement, Argument Wh Movement, Adverb fronting, and much much more.[8] Given this, the standard empiricist learning proposals for AC are trying (and failing) to solve the wrong problem, and this is why they are a waste of time. What’s the right problem? Here’s one: show how to “learn” the minimality principle or Subjacency/Barriers/Phase theory from PLD alone. Now, were that possible, that would be interesting. Good luck.

Many will find my conclusion (and tone) harsh and overheated. After all isn’t it worth trying to see if some ML account can learn to distinguish good from bad polar questions using string input? IMO, no. Or more precisely, even were this done, it would not shed any light on how humans acquire AC. The critics have simply misunderstood the problem; the relevant data, the general structure of the phenomenon and the kind of learning account that is required. If I were in a charitable mood, I might blame this on Chomsky. But really, it’s not his fault. Who would have thought that a simple illustrative example aimed at a general audience should have so captured the imagination of his professional critics! The most I am willing to say is that maybe Fodor is right and that Chomsky should never have given a simple illustration of the POS at all. Maybe he should in fact be banned from addressing the uninitiated altogether or only if proper warning labels are placed on his popular works.

So, to end: why am I not impressed by empiricist discussions of AC? Because I see no reason to think that this work has yielded or ever will yield any interesting insights to the problems that Chomsky’s original informal POS discussion was intended to highlight.[9] The empiricist efforts have focused on the wrong data to solve the wrong problem.  I have a general methodological principle, which I believe I have mentioned before: those things not worth doing are not worth doing well. What POS’s empiricist critics have done up to this point is not worth doing. Hence, I am, when in a good mood, not impressed. You shouldn’t be either.

[1] One point before getting down and dirty: what follows is not at all original with me (though feel free to credit me exclusively). I am repeating in a less polite way many of the things that have been said before. For my money, the best current careful discussion of these issues is in Berwick, Pietroski, Yankama and Chomsky (see link to this below). For an excellent sketch on the history of the debate with some discussion of some recent purported problems with the POS arguments, see this handout by Howard Lasnik and Juan Uriagereka.
[2] I believe (actually I know, thx Howard) that the case is first discussed in detail in Language and Mind (L&M) (1968:61-63). The argument form is briefly discussed in Aspects (55-56), but without attendant examples. The first discussion with some relevant examples is L&M. The argument gets further elaborated in Reflections on Language (RL) and Rules and Representations (RR) with the good and bad examples standardly discussed making their way prominently into view. I think that it is fair to say that the Chomsky “analysis” (btw, these are scare quotes) that has formed the basis of all of the subsequent technical discussion and criticism is first mooted in L&M and then elaborated in his other books aimed at popular audiences. Though the stuff in these popular books is wonderful, it is not LGB, Aspects, the Black Book, On Wh movement, or Conditions on transformations. The arguments presented in L&M, RL and RR are intended as sketches to elucidate central ideas. They are not fully developed analyses, nor, I believe, were they intended to be. Keep this in mind as we proceed.
[3] Of course, not sentences, but utterances thereof, but I abstract from this nicety here.
[4] Those who have gone through this know that the notion ‘Aux’ does not come tripping off the tongue of the uninitiated. Maybe ‘helping verb,’ but often not even this.  Also, ‘move’ can be replaced with ‘put’ ‘reorder’ etc.  If one has an inquisitive group, some smart ass will ask about sentences like ‘Did Bill eat lunch’ and ask questions about where the ‘did’ came from. At this point, you usually say (with an interior smile), to be patient and that all will be revealed anon.
[5] And many non-structure dependent alternatives, though I leave these aside here.
[6] Minimality suffices to block (4) where the embedded Aux moves to the matrix C. The CNPC suffices to block (5c). See below for much more discussion.
[7] BTW, none of this is original with me here. This is part of BPYC’s general critique.
[8] Indeed, every case of A’-movement will swing the same way. For example: in It’s fresh fish that eagles that like to eat like to eat, the focused fresh fish is complement of the matrix eat not the one inside the RC.
[9] Let me add one caveat: I am inclined to think that ML might be useful in studying language acquisition combined with a theory of FL/UG. Chomsky’s discussion in Chapter 1 of Aspects still looks to me very much like what a modern Bayesian theory with rich priors and a delimited hypothesis space might look like. Matching Gs to PLD even given this, does not look to me like a trivial task (and work by those like Yang, Fodor, Berwick) strike me as trying to address this problem. This, however, is very different from the kind of work criticized here, where the aim has been to bury UG not to use it. This has been a both a failure and, IMO, a waste of time.