Monday, October 21, 2013

Mothers know best

My mother always told me that you should be careful what you wish for because you just might get it. In fact, I’ve discovered that her advice was far too weak: you should be careful what you idly speculate about as it may come to pass. As readers know, my last post (here) questioned the value added of the review process based on recent research noting the absence of evidence that reviewing serves its purported primary function of promoting quality and filtering out the intellectually less deserving. Well, no sooner did I write this than I received proof positive that our beloved LSA, has implemented a no review policy for Language’s new online journal Perspectives. Before I review the evidence for this claim, let me say that though I am delighted that my ramblings have so much influence and can so quickly change settled policy, I am somewhat surprised at the speed with which the editors at Language have adopted my inchoate maunderings. I would have hoped that we might make haste slowly by first trying to make the review process progressively less cumbersome before adopting more exciting policies. I did not anticipate that the editors at Language would be so impressed with my speculations that they would immediately throw all caution aside and allow anything at all, no matter how slipshod and ignorant, to appear under its imprimatur. It’s all a bit dizzying, really, and unnerving (What power! It’s intoxicating!). But why do things by halves, right? Language has chosen to try out a bold policy, one that will allow us to see whether the review process has any utility at all.

Many of you will doubt that I am reporting the aforementioned editorial policy correctly. After all, how likely is it that anything I say could have such immediate impact?  In fact, how likely is that that the LSA and the editors of Language and its online derivatives even read FoL?  Not likely, I am sorry to admit.  However, unbelievable as it may sound, IT IS TRUE, and my evidence for this is the planned publication of the target article by Ambridge, Pine and Lieven (APL) (“Child language: why universal grammar doesn’t help” here[1]). This paper is without any redeeming intellectual value and I can think of only two explanations for how it got accepted for publication: (i) the radical change in review policy noted above and (ii) the desire to follow the Royal Society down the path of parody (see here). I have eliminated (ii) because unlike the Royal Society’s effort, APL is not even slightly funny haha (well maybe as slapstick, I’ll let you decide). So that leaves (i).[2]

How bad is the APL paper? You can’t begin to imagine.  However, to help you vividly taste its shortcomings, let me review a few of its more salient “arguments” (yes, these are scare quotes). A warning, however, before I start. This is a long post. I couldn’t stop myself once I got started. The bottom line is that the APL paper is intellectual junk. If you believe me, then you need not read the rest. But it might interest you to know just how bad a paper can be. Finding zero on a scale can be very instructive (might this be why it is being published? Hmm).

The paper goes after what APL identify as five central claims concerning UG: identifying syntactic categories, acquiring basic morphosyntax, structure dependence, islands and binding. They claim to “identify three distinct problems faced by proposals that include a role for innate knowledge –linking, inadequate data coverage, and redundancy…(6).” ‘Linking’ relates to “how the learner can link …innate knowledge to the input language (6).” ‘Data-coverage’ refers to the empirical inadequacy of the proposed universals, and ‘redundancy’ arises when a proposed UG principle proves to be accurate but unnecessary as the same ground is covered by “learning procedures that must be assumed by all accounts” and thus obviate the need “for the innate principle or constraint” (7). APL’s claim is that all proposed UG principles suffer from one or another of these failings.

Now far be it from me to defend the perfection of extant UG proposals (btw, the principles APL discusses are vintage LGB conceptions, so I will stick to these).[3] Even rabid defenders of the generative enterprise (e.g. me) can agree that the project of defining the principles of UG is not yet complete. However, this is not APL’s point: their claim is that the proposals are obviously defective and clearly irreparable. Unfortunately, the paper contains not a single worthwhile argument, though it does relentlessly deploy two argument forms: (i) The Argument from copious citation (ACC), (ii) The Argument from unspecified alternatives (AUA).  It combines these two basic tropes with one other: ignorance of the relevant GB literature. Let me illustrate.

The first section is an attack on the assumption that we need assume some innate specification of syntactic categories so as to explain how children come to acquire them, e.g. N, V, A, P etc.  APL’s point is that distributional analysis suffices to ground categorization without this parametric assumption. Indeed, the paper seems comfortable with the idea that the classical proposals critiqued “seem to us to be largely along the right lines (16),” viz. that “[l]earners will acquire whatever syntactic categories are present in a particular language they are learning making use of both distributional …and semantic similarities…between category members (16).” So what’s the problem? Well, it seems that categories vary from language to language and that right now we don’t have good stories on how to accommodate this range of variation. So, parametric theories seeded by innate categories are incomplete and, given the conceded need for distributional learning, not needed.

Interestingly, APL does not discuss how distributional learning is supposed to achieve categorization. APL is probably assuming non-parametric models of categorization. However, to function, these latter require specifications of the relevant features that are exploited for categorization. APL, like everyone else, assume (I suspect) that we humans follow principles like “group words that denote objects together,” “group words that denote events together,” “group words with similar “endings” together,” etc. APL’s point is that these are not domain specific and so not part of UG (see p.12). APL is fine with innate tendencies, just not language particular ones like “tag words that denote objects as Nouns,” “tag words that denote events as Verbs.”  In short, APL’s point is that calling the groups acquired nouns, verbs, etc. serves no apparent linguistic function . Or does it?

Answering this question requires asking why UG distinguishes categories, e.g. nouns from verbs. What’s the purpose of distinguishing N or V in UG? To ask this question another way: which GB module of UG cares about Ns, Vs, etc? The only one that I can think of is the Case Module. This module identifies (i) the expressions that require case (Nish things) (ii) those that assign it (P and Vish things) and (iii) the configurations under which the assigners assign case to the assignees (roughly government). I know of no other part of UG that cares much about category labels. [4] [5]

If this is correct, what must an argument aiming to show that UG need not natively specify categorical classes show? It requires showing that the distributional facts that Case Theory (CT) concerns itself with can be derived without such a specification. In other words, even if categorization could take place without naming the categories categorized, APL would need to show that the facts of CT could also be derived without mention of Ns and Vs etc. APL doesn’t do any of this. In fact, APL does not appear to know that the facts about CT are central to UG’s adverting to categorical features.

Let me put this point another way: Absent CT, UG would function smoothly if it assigned arbitrary tags to word categories, viz. ‘1’, ‘2’ etc.  However, given CT and its role in regulating the distribution of nominals (and forcing movement) UG needs category names. CT uses these to explain data like: *It was believed John to be intelligent, or *Mary to leave would be unwise or *John hopes Bill to leave or *who do you wanna kiss Bill vs who do you wanna kiss. To argue against categories in UG requires deriving these kinds of data without mention of N/V-like categories. In other words, it requires deriving the principles of CT from non-domain specific procedures. I personally doubt that this is easily done. But, maybe I am wrong. What I am not wrong about is that absent this demonstration we can’t show that an innate specification of categories is nugatory. As APL doesn't address these concerns at all, its discussion is irrelevant to the question they purport to address.

There are other problems with APL’s argument: it has lots of citations of “problems” pre-specifying the right categories (i.e. ACC), lots of claims that all that is required is distributional analysis, but it contains no specification of what the relevant features to be tracked are (i.e. AUA). Thus, it is hard to know if they are right that the kinds of syntactic priors that Pinker and Mintz (and Gleitman and Co. sadly absent from the APL discussion) assume can be dispensed with.[6] But, all of this is somewhat besides the point given the earlier point: APL doesn’t correctly identify the role that categories play in UG and so the presented argument even if correct doesn’t address the relevant issues.

The second section deals with learning basic morphosyntax. APL frames the problem in terms of divining the extension of notions like SUBJECT and OBJECT in a given language. It claims that nativists require that these notions be innately specified parts of UG because they are “too abstract to be learned” (18).

I confess to being mystified by the problem so construed. In GB world (the one that APL seem to be addressing), notions like SUBJECT and OBJECT are not primitives of the theory. They are purely descriptive notions, and have been since Aspects.  So, at least in this little world, whether such notions can be easily mapped to external input is not an important problem.  What the GB version of UG does need is a mapping to underlying structure (D-S(tructure)). This is the province of theta theory, most particularly UTAH in some version. Once we have DS, the rest of UG (viz. case theory, binding theory, ECP) regulate where the DPs will surface in S-S(tructure).

So though GB versions of UG don’t worry about notions like SUBJECT/OBJECT, they do need notions that allow the LAD to break into the grammatical system. This requires primitives with epistemological priority (EP) (Chomsky’s term) that allow the LAD to map PLD onto grammatical structure. Agent and patient, seem suited to the task (at least when suitably massaged as per Dowty and Baker).  APL discusses Pinker’s version of this kind of theory. Its problem with it? APL claims that there is no canonical mapping of the kind that Pinker envisages that covers every language and every construction within a language (20-21). APL cites work on split ergative languages and notes that deep ergative languages like Dyirbal may be particularly problematic. It further observes that many of these problems raised by these languages might be mitigated by adding other factors (e.g. distributional learning) to the basic learning mechanism. However, and this is the big point, APL concludes that adding such learning obviates the need for anything like UTAH.

APL’s whole discussion is very confused. As APL note, the notions of UG are abstract. To engage it, we need a few notions that enjoy EP. UTAH is necessary to map at least some input smoothly to syntax (note: EP does not require that every input to the syntax be mapped via UTAH to D-S). There need only be a core set of inputs that cleanly do so in order to engage the syntactic system. Once primed other kinds of information can be used to acquire a grammar. This is the kind of process that Pinker describes. This obviates the need for a general UTAH like mapping.

Interestingly APL agrees with Pinker’s point, but it bizarrely concludes that this obviates the need for EPish notions altogether, i.e. for finding a way to get the whole process started. However, the fact that other factors can be used once the system is engaged does not mean that the system can be engaged without some way to get it going. Given a starting point, we can move on. APL doesn’t explain how to get the enterprise off the ground, which is too bad, as this is the main problem that Pinker and UTAH addresses.[7] So once again, APL’s discussion fails to engage UG’s main worry: how to initially map linguistic input onto DS so that UG can work its magic.

APL have a second beef with UTAH like assumptions. APL asserts that there is just so much variation cross linguistically that there really is NO possible canonical mapping to DS to be had. What’s APL’s argument? Well, the ACC, argument by citation. The paper cites resaearch that claims there is unbounded variation in the mapping principles from theta roles to syntax and concludes that this is indeed the case. However, as any moderately literate linguist knows, this is hotly contested territory. Thus, to make the point APL wants to make responsibly requires adjudicating these disputes. It requires discussing e.g. Baker’s and Legate’s work and showing that their positions are wrong. It does not suffice to note that some have argued that UTAH like theories cannot work if others have argued that they can.  Citation is not argumentation, though APL appears to read as if it is.  There has been quite a bit of work on these topics within the standard tradition that APL ignores (Why? Good question). The absence of any discussion renders APL’s conclusions moot. The skepticism may be legitimate (i.e. it is not beside the point). However, nothing APL says should lead any sane person to conclude that the skepticism is warranted as the paper doesn’t exercise the due diligence required to justify its conclusions. Assertions are a dime a dozen. Arguments take work. APL seems to confuse the first for the second.

The first two sections of APL are weak. The last three sections are embarrassing. In these, APL fully exploits AUAs and concludes that principles of UG are unnecessary. Why? Because the observed effects of UG principles can all be accounted for using pragmatic discourse principles that boil down to the claim that “one cannot extract elements of an utterance that are not asserted, but constitute background information” …and “hence that only elements of a main clause can be extracted or questioned” (31-32). For the case of structure dependence, APL supplements this pragmatic principle with the further assertion that “to acquire a structure-dependent grammar, all a learner has to do is to recognize that strings such as the boy, the tall boy, war and happiness share both certain functional and –as a consequence- distributional similarities” (34). Oh boy!! How bad is this? Let me count some of the ways.

First, there is no semantic or pragmatic reason for why back-grounded information cannot be questioned. In fact, the contention is false. Consider the Y/N question in (1) and appropriate negative responses in (2):

(1)  Is it the case that eagles that can fly can swim
(2)  a. No, eagles that can SING can swim
b. No eagles that can fly, can SING

Both (2a,b) are fine answers to the question in (1). Given this, why can we form the question with answer (2b) as in (3a) but not the question conforming to the answer in (2a) as in (3a)? Whatever is going on has nothing to do with whether it is possible to question the content of relative clause subjects. Nor is it obvious how “recogniz[ing] that strings such as the boy, the tall boy, war and happiness share both certain functional …and distributional similarlities” might help matters.

(3)  a. *Can eagles that fly can swim?
b. Can eagles that can fly swim?

This is not a new point and it is amazing how little APL has to say about it. In fact, the section on structure dependence quotes and seems to concede all the points made in the Berwick et. al. 2011 paper (see here). Nonetheless APL concludes that there is no problem in explaining the structure dependence of T to C if one assumes that back-grounded info is frozen for pragmatic reasons. However, as this is obviously false, as a moment’s thought will show, APL’s alternative “explanation” goes nowhere.

Furthermore, APL doesn’t really offer an account of how back-grounded information might be relevant as the paper nowhere specifies what back-grounded information is or in which contexts it appears.  Nor does APL explicitly offer any pragmatic principle that prevents establishing syntactic dependencies with back-grounded information. APL has no trouble specifying the GB principles it critiques, so I take the absence of a specification of the pragmatic theory to be quite telling.

The only hint APL provides as to what it might intend (again copious citations, just no actual proposal) is that because questions ask for new information and back-grounded structure is old information it is impossible to ask a question regarding old information (c.f. p. 42). However, this, if it’s what APL has in mind (which, again is unclear as the paper never actually makes the argument explicitly) is both false and irrelevant.

It is false because we can focus within a relative clause island, the canonical example of a context where we find back-grounded info (c.f. (4a)). Nonetheless, we cannot form the question (4b) for which (4a) would be an appropriate answer. Why not? Note, it cannot be because we can’t focus within islands, for we can as (4a) indicates.

(4)  a. John likes the man wearing the RED scarf
b. *Which scarf does John like the man who wears?

Things get worse quickly. We know that there are languages that in fact have no trouble asking questions (i.e. asking for new info) using question words inside islands. Indeed, a good chunk of the last thirty years of work on questions has involved wh-in-situ languages like Chinese or Japanese where these kinds of questions are all perfectly acceptable. You might think that APL’s claims concerning the pragmatic inappropriateness of questions from back-grounded sources would discuss these kinds of well-known cases. You might, but you would be wrong. Not a peep. Not a word. It’s as if the authors didn’t even know such things were possible (nod nod wink wink).

But it gets worse still: ever since forever (i.e. from Ross) we know that Island effects per se are not restricted to questions. The same things appear entirely with structures having nothing to do with focus e.g. relativization and topicalization to name two relevant constructions. These exhibit the very same island effects that questions do, but in these constructions the manipulanda do not involve focused information at all. If the problem is asking for new info from a back-grounded source, then why can’t operations that target old back-grounded information not form dependencies into the relative clause?  The central fact about islands is that it really doesn’t matter what the moved element means, you cannot move it out (‘move’ here denotes a particular kind of grammatical operation). Thus, if you can’t form a question via movement, you can’t relativize or tropicalize using movement either. APL does not seem acquainted with this well-established point.

One could go on: e.g. resumptive pronouns can obviate island effects but the analogous non-resumptive analogues do not despite semantic and pragmatic informational equivalence, islands in languages like Swedish/Norwegian do not allow extraction from any island whatsoever, contrary to what PL suggests. All of this is relevant to APL’s claims concerning islands. None of it is discussed, nor hinted at. Without mention of these factors, APL once again fails to address the problems that UG based accounts have worried about and discussed for the last 30 years. As such, the critique advanced in this section on islands, is, once again, largely irrelevant.

APL’s last section on binding theory (BT) is more of the same. The account of principle C effects in cases like (4) relies on another pragmatic principle, viz. that it is “pragmatically anomalous to use a full lexical NP in part of the sentence that exists only to provide background information” (48). It is extremely unclear what this might mean.  However, on at least the most obvious reading, it is either incorrect or much too weak to account for principle C effects. Thus, one can easily get full NPs within back-grounded structure (e.g. relative clauses like (4a)). But within the relative clause (i.e. within the domain of back-grounded information)[8], we still find principle C effects (contrast (4a,b)).

(5)  a. John met a woman who knows that Frank1 loves his1 mother
b. * John met a woman who knows that he1 loves Frank’s1 mother

The discussion of principles A and B are no better. APL does not explain how pragmatic principles explain why reflexives must be “close” to their antecedents (*John said that Mary loves himself or *John believes him/heself is tall), why they cannot be anteceded by John in structures like John’s mother upset himself (where the antecedent fails to c-command but is not in a clause), why they must be preceded by their antecedents (*Mary believes himself loves John) etc.  In other words, APL does not discuss BT and that facts that have motivated it at all and so the paper provides no evidence for the conclusion that BT is redundant and hence without explanatory heft.

This has been a long post. I am sorry. Let me end. APL is a dreadful paper. There is nothing there. The question then is why did Perspectives accept it for publication? Why would a linguistics venue accept such a shoddy piece of work on linguistics for publication? It’s a paper that displays no knowledge of the relevant literature, and presents not a single argument (though assertions aplenty) for its conclusions. Why would a journal sponsored by the LSA allow the linguistic equivalent of flat-earthism to see the light of day under its imprimatur? I can only think of only one reasonable explanation for this: the editors of Language have decided to experiment with a journal that entirely does away with the review process. And I fear I am to blame. The moral: always listen to your mother.

[1] It’s currently the first entry on his Ambridge’s web page.
[2] There are a couple of other possibilities that I have dismissed out of hand: (i) that the editors thought that this paper had some value and (ii) linguistic self loathing has become so strong that anything that craps on our discipline is worthy of publication precisely because it dumps on us. As I said, I am putting these terrifying possibilities aside.
[3] Thus, when I say ‘UG’ I intend GB’s version thereof.
[4] Bounding theory cares too (NP is, but VP is not a bounding node). APL discusses island effects and I discuss their points below. However, suffice it to say, if we need something like a specification of bounding nodes that we need to know, among other things, which groups are Nish and which not.
[5] X’ theory will project the category of the head of a phrase to the whole phrase. But what makes something an NP requiring case is that N heads it.
[6] APL also seems to believe that unless the same categories obtain cross linguistically they cannot be innate (c.f. p. 11). This confuses Greenberg’s conception of universals with Chomsky’s, and so is irrelevant.  Say that the following principle “words that denote events are grouped as V” is a prior that can be changed given enough data. This does not imply that the acquisition of linguistic categories can proceed in the absence of this prior. Such a prior would be part of UG on Chomsky’s conception, even if not on Greenberg’s.
[7] It’s a little like saying that you can get to New York using a good compass without specifying any starting point. Compass readings are great, but not if you don’t know where you are starting from.
[8] Just to further back-ground the info (4) embeds the relative clause within know, which treats the embedded information as pre-supposed. 


  1. From a very brief scan, this one looks like a pretty good representative of the entire school of thought that produced it. & although I basically belong to the 'less UG' school of thought, it seems to me that every worked out account (that is, something that looks like primary linguistic data in, something that looks like a grammar out) that gets a child from zero to the ability to say things like 'I want to push Owen while being carried' in about 4 years (there is a dispute between Cindy and me as to whether our oldest came out with this in his late 3s or early 4s) has some kind of UG (such as, for example, Combinatory Categorial Grammar with its fixed system of basic categories and combinators). 'UG' being something that at least for now has to be presented as if it was some mechanisms specific to language, even if it might someday reduce to something else.

    AUA indeed.

  2. The syntactic categories thing is interesting, since we talked about it before.

    So one explanation is that they are learned using distributional methods (no semantics) and this has been an active area of research
    for some years. So there are lots of algorithms for this (you can download one from my website if you want), and they work quite well
    on a wide range of languages; and they have some mathematical guarantees, and they tie up with the AGL experiments along the lines of Safran et al.
    So that seems like a very good explanation of how categories might be acquired. Or at least an explanation that has a lot of detail to it.
    (I think your stipulation that it should also explain case theory is a non starter)

    So your alternative is that the categories themselves are innate and then there is some mechanism that learns which words are in which categories.
    Could you give a citation or two to papers where your alternative is explained?
    One would like to see: a list of the innate categories, a specificied algorithm, some evaluation on a range of languages, and some theoretical analysis that shows that it works. But failing that, what is the best, fullest, most recent explanation of this?

    Then we can see which of the two is the unexplained alternative.

    1. I have no problem with distributional methods for grouping categories. In fact, I am not wedded to a parametric theory in which sorting bins are pre-specified. However, contrary to your parenthetical comment, case theory is not merely a "starter,", it is critical if APL's critique is to be relevant to their critique of UG. Why? Because, IF their target are GB based conceptions of UG, which seems to be what the paper is after (if not, I have no idea what the prey is), then APL needs to address the empirical issues of concern to GB or show that they should be ignored. If APL fails to do this, it is equivocating.

      Now, within GB there is a requirement that we distinguish Nish things from Vish things. Why? Case theory. This implies that grouping of LIs into categories had better be able to distinguish N from V, if case theory is correct. Or, a critique must show that case theoretic data do not exist, or they can be accounted for in some way that does not refer to N/V etc. I take this to be a point of logic: you cannot show that X is irrelevant by ignoring what X does, or purports to do. So, no non-starter here. It's in this sense that APL need to say something. Now, as far as I can tell, even if it is possible to group LIs purely distributionally, we need a way of NAMING those distributions so as to make them amenable to case theory. So we need some way of mapping groups (however concocted) into N groups, V groups etc. We need to name names! Or find some other way to account for the data of case theory.

      How to do this? Well, I suspect that there is some mapping principle that seeds the N and V distinction and then finer distinctions are included as required. So, we identify the Nish things (something APL concede can be done, and people like Gelitman and Co have shown is doable pretty early on). This gives an N/V distinction. Then subcategories of these can be induced as required. How this is done, is a fine research question, not one that I personally investigate. But, there had better be some way to do this IF case theory is accurate (which I believe it is to a good first approximation). Absent doing something like this, all the categorization in the world, no matter how perfect, fails to address the main UG points, and as this is what APL aimed to do, the paper completely misses its target. That was my point.

      To recap: if if APL aimed to show that categories need not be prespecified as GB versions of UG do, then they need to address those data that UG seeks to explain. Absent this, APL's musings, though maybe interesting for some other purposes, are beside the point wrt the stated goal of the paper.

    2. The stated goal of the paper is quite clear:
      "In many different domains of language acquisition, there exists an apparent learnability problem, to which innate knowledge of some aspect of Universal Grammar (UG) has been proposed as a solution. The present article reviews these proposals in the core domains of (a) identifying syntactic categories,.... We conclude that, in each of these domains, the innate UG- specified knowledge posited does not, in fact, simplify the task facing the learner."

      They are not arguing directly that syntactic categories are not innate. They are arguing that UG does not simplify significantly the task of learning.
      There are I suppose lots of other arguments why syntactic categories are innate (you have just given one). But they are attacking only one such argument, and giving another argument is beside the point.

      Suppose you say "I have been burgled; I know that because my burglar alarm just went off". And I say, "your alarm is broken and goes off all the time".
      And you say "but I can't find my wallet either".
      I don't have to find your wallet to make my point. Maybe you were burgled after all. I am just saying that your alarm doesn't work.
      Saying " so where's my wallet then if I wasn't burgled" doesn't address the issue.

      ( burgled =innateness , alarm = learnability, wallet = case theory, in case my parable isn't clear).

      And anyway, hasn't case theory been abandoned in the MP anyway ?

      (wait, I found my wallet in the back pocket of my jeans..)

    3. "Does not simplify the task facing the learner." So what is that task? The GB version of UG (in assuming that there is a Case module) describes the relevant task as identifying N like things from V like things. This task includes categorizing groupings of LIs AS N and AS V. So, grouping the nouns together and the verbs together does not suffice. They must be identified as such as well. Now, does this help with the categorization problem? Well, it depends on what facts you think categorization is responsible for. If you INCLUDE the distribution of NPs as part of the categorization problem, then yes it helps. How? Well it says that once you've identified groups as Nish then you need say nothing more about where they are found in the sentence for Case theory will take care of that.

      Now you may say that this is not a proper part of the categorization problem. But then you are not dealing with the GB claim and so when you say that assuming that N/V are innate you have not advanced the problem in any way, you have not addressed what GB takes the problem to be. You are ignoring data that GB deems relevant. Note that this is consistent with the further claim that APL is correct that wrt the grouping problem (which is not the categorization problem) postulating innate N/V categories does not help. It may well be true (I am not saying it is, but that it may be) that distributional analysis is all you need to group LIs. But in addition to this, if GB is correct, you need to have a mapping rule that says something like things that group together and have property P are Ns. And this rule will have to be part of any story of categorization and this mapping rule looks to involve innate specification of categories.

      So, does the GB theory "work." Well, if correct, in part. Is it sufficient to solve the grouping problem? No. Did anyone say that it was? Here, I am not sure, but frankly I never thought that anyone did think this and this is why people like Pinker and Gleitman devised theories of LI grouping that included semantic and syntactic bootstrapping principles and mapping principles like "thingy LIs get mapped to N and actiony things get mapped to V ceteris paribus."

      Two more points: Case facts are facts. MP deals with them differently than GB did, but they are still taken to be facts. MP questions whether one needs a case module different in kind from, say, the movement module. The facts remain and have been dealt with in roughly the way GB did (the case filter is now a Bare Output Condition and government has been replaced by some medial dependency with a head with certain featural demands).

      Second, if you want to say that theory T does not help solve problem P then it is important to identify the problem. It may help with some parts of the problem and not others. To be told that assuming innate N/V categories only helps solve PART of the problem is a critique I can live with. If that is what APL had in mind, it's not news. Oddly, APL does not discuss syntactic bootstrapping much and if assuming N and V as innate categories plays any role in the grouping problem it will be if there is syntactic bootstrapping. So even for the grouping problem, APL fails to discuss the relevant issues.

      Btw, thanks: I was hoping that someone would notice that it was indeed a rant. What better way to deal with junk?

    4. "But in addition to this, if GB is correct, you need to have a mapping rule that says something like things that group together and have property P are Ns."

      I don't think that Ambridge et al are making, as you are, the assumption that GB theory is correct.

    5. You think? They struck me as hard core nativists. Damn fooled again!

      But let's say you are right and APL doesn't like GB. APL appears to argue that its assumptions are wrong and then tries to demonstrate this by arguing that the nativist assumptions that are central to GB are nugatory. I ranted shrilly that this misses the boat for in order to argue against a certain view you must address the arguments advanced, evidence adduced and problems explored by that view. APL did none of this. So even if APL doesn't like GB (hard as this is to believe) this does not mean a purported critique of GB can take as a premise that it is incorrect and then conclude that indeed it is. Though 'P --> P' is a valid form of inference, it is not a very interesting one.

      So how does one go about arguing the case that APL wants to make? Well by showing (i) that what GB tries to explain can be explained without making GBish assumptions or (ii) that what GB wants to explain should not be explained because it is not "real," e.g. the facts are not as described, the problems/generalizations identified are only apparent, etc. APL combines these two modes. Unfortunately, the critique is garbage for the reasons noted in the post. APL either doesn't know why certain assumptions are made and so doesn't address the relevant issues or offers alternatives that are laughably inadequate. That's why I was and am shrill! This is not serious work, and should not be treated as such.

      Last point: Even if you have empiricist sympathies, there is no way that the APL paper could have been useful to you. I am pretty sure that you are largely in agreement with me on this (or should be). Could you really be convinced that, say, Pinker is wrong or that Gleitman et al might not be onto something or that semantic/syntactic bootstrapping is hopeless after reading APL? Does this paper advance the critique in any way? It's a most a rehash of stuff, and the "argument" is effectively accomplished by massive citation. APL never even organizes the arguments in the cited papers to re-present them in a compelling way. That's however what a critique based on a review of the literature does (and this is at most what APL is, as there is nothing original in it). It simply cites and concludes that there are problems. This is not even an undergrad level discussion. And I am pretty sure you know this. So repeat after me: this is junk. This is junk. This is junk.

    6. "So even if APL doesn't like GB (hard as this is to believe) this does not mean a purported critique of GB can take as a premise that it is incorrect and then conclude that indeed it is"

      They don't assume that GB is true and they also don't assume that GB is false. Those aren't the only two possibilities.

      It is possible also to keep an open mind and examine the arguments for and against a position, and evaluate them objectively like, say, a scientist.
      That is always useful.

  3. A couple of points on categories and distributions.

    1. As Alex notes, there are a lot of distributional methods for category induction, but I suspect that he'd agree with me that none has produced sufficiently good results to match human children, who very rarely make category mistakes (Valian 1986). This is to put aside valid psychological concerns (e.g., the amount of data, the complexity of computation, etc.)

    2. Erwin Chan, in his 2008 Penn dissertation, reviewed many distributional methods for category induction. He took the distributional neighborhood profiles of the most frequent words of English and plotted the first two principal components of the similarity matrix: the overlap across the basic categories (N, V, ADJ) seems sufficiently high for any purely distributional methods to separate them out. This is what "the data says". (The Reply option does not allow images but you can find Erwin's dissertation at; the diagram is on page 158).

    3. Grimshaw/Pinker-style semantic bootstrapping--which assumes an innate set of categories--needn't be mutually exclusive with more structural based accounts. The child could use "semantics" to figure out a handful of nouns, verbs, etc., build a grammar out of them, and let the grammar do the rest of category learning. And this connects with Norbert's point about Case filter. The term "syntactic bootstrapping" was introduced, I believe, by Howard Lasnik, as an alternative/complementary strategy for the category learning problem. And there is psycholinguistic evidence that very young children can infer syntactic categories of novel words based on their structural configurations relative to other categories (Shi & Melancon 2010, Infancy, among many others); Virginia Valian refers to these category learning helpers as "anchoring points".

    4. Fully agree with Alex. It's incumbent on the innate category believers to produce a working model. Stay tuned.

    1. Hi Charles,

      1. Completely agree: the algorithms fall far short of human performance: indeed these algorithms have as you know a bunch of other undesirable properties (often you need to specify the number of categories in advance, the categories are a flat partition, often don't work well with ambiguity etc etc )
      Indeed I can't think of *any* NLP tasks where even the supervised systems match up to human performance. But I don't think that rules out us drawing some reasonable inferences from the moderate success of what are rather simple algorithms.

      2. Chan's diagram shows that if you use naive clustering based on the words before and after then you probably won't get that far. That suggests that one should use longer range information (even Schutze back in 1993 had figures that out) either directly or indirectly by using cluster labels rather than words. Which is what all current techniques use.

      3. I think that once you have some syntactic information, then the rest becomes a lot easier, and certainly we know from e.g. Pinker that children start to leverage semantics and syntactic structure pretty early on. So the disagreement is really about how the first few steps get taken; I agree that later on semantics/syntax/prosody etc all play a role. So maybe more consensus on the later stages?

      4. Ok, it's an active area of research and we all have papers in the pipeline at various stages. It's overreach though for Norbert to start ranting about unexplained alternatives when he can't point to a proposal that spells out the details of his approach. What's the best thing I can read now?

    2. Looking at this discussion I am once more reminded of the Intelligent Design [ID] vs. Evolution Theory [ET] debates. IDers will point at phenomenon P1 and say: You can't explain this with ET, therefore ET must be wrong and ID is confirmed. ETers come up with an explanation for P1 and IDers first denounce the solution but finally reply: yeah, so what but you can't account for phenomenon P2 with ET, therefore ET must be wrong and ID is confirmed. ETers come up with an explanation for P2 and IDers first denounce the solution but finally reply: yeah, so what but you can't account for phenomenon P3 with ET, therefore ET must be wrong and ID is confirmed. and so on. This style of argumentation has even gotten its own name: 'God of the Gaps' [GOG]. IDers use GOG to 'prove' they are right [and assuming we won't have any accounts that leaves no gaps any time soon we cannot refute GOG]. Still many of us resist adopting 'divine creation' vs. ET. Word has it even Chomsky rejects ID..

      I am in full agreement with Alex C. and would like to add that even if the paper does not meet Norbert's discriminating taste [one wonders of course how "Science of Language" could ever pass his quality control] it is hardly justified to accuse the editors of Language that "they would immediately throw all caution aside and allow anything at all, no matter how slipshod and ignorant, to appear under its imprimatur" or to assert that "the APL paper is intellectual junk". Clearly, one has to have the Scheinheiligkeit of David Pesestsky to be surprised that those who speak publicly like this [and not object to such speech] are disliked. Some ethical housekeeping could go a long way here...

      As for content: I agree with Norbert, it is high time to 'name names' - for BIOLINGUISTS. Just saying X is innate no longer cuts it: tell us how it is realized in human brains. If you cannot do that, then stop your pretentious finger-pointing. On Chomsky's count you have been working on the biology of language since the 1950s - so tell us: WHAT are your biological results? A lot of the research Norbert attacks so viciously has a much shorter history. And while results are modest these researchers certainly have produced results. To make myself clear: I do not deny that generativists have produced results. But for example the work by Collins&Stabler has little to say about the BIOLOGY of language. The debates with Alex C. re computational issues show that at best competing models are at a similar level - there are no huge [or even modest] advantages of innativist models. I understand that David A. has been very busy and had no time to answer my questions about these models. But he is surely not the ONLY person who could have - so why does it take so long for generating answers about YOUR models?

    3. It doesn't matter how many bad candidate Ps ID proponents have come up with in the past. Each candidate has to be evaluated on its merits. To bring this back to linguistics, consider the question of whether all natural languages have context-free string languages. Chomsky and others came up with lots of bad arguments for the conclusion that they do not. But eventually some good arguments were found, and these arguments aren't any the worse just because they happen to have been preceded by some bad arguments.

    4. Dear Alex D., you make a valid albeit irrelevant point. I am not skeptical about MP because Chomsky came up with some bad arguments in the past In fact I think some of his arguments in the 1950s were superior to his most recent "arguments" - you may want have a look at the discussion about the "Galilean Style" on this blog [I hope the link is to the main post not one of my comments]: ]

      I am skeptical because in 60+ years Chomsky [and his collaborators] has [have] not specified a single biological property of human brains that is language specific. As long as this does not happen it does not really matter how convincing some of the recent arguments are [again, note I do not deny progress of generativists]. If you claim to do BIOlinguistics you have to pay attention to biology. We had arguments for innateness [good or bad, general or specific] for decades. So it is time to specify biological properties. By denigrating others as Norbert routinely does [and yes it was impossible NOT to notice that this post was a rant] generativists make ZERO progress on issues that are important...

  4. This comment has been removed by the author.

  5. Although this is definitely one of those areas where Ambridge and I disagree, I think there's a bit more substance to this article than you suggest. One nice thing about the article -- and Ben's work in general -- is that it tries to grapple with (some) Nativist theory on its own terms but from an Empiricist standpoint. Maybe it doesn't succeed, but I think this effort is worthwhile and helps clarify where the issues are.

    In any case, Language is soliciting open commentaries, so you might consider sending one in.

    1. I disagree: APL does not grapple with anything because it doesn't have the foggiest idea of what the nativist positions are. Look, I think that there is quite a bit of nativism required to get the an understanding of linguistic competence off the ground. However, even if I didn't, to argue against this view requires that I know what the view entails. APL displays nothing approaching an understanding of the issues involved. And, as such, it does not clarify the issues involved. It simply is another piece of junk that will be used to obfuscate further discussion and that's why I am ranting against it!

  6. APL probably represents the majority opinion in the field of language acquisition, and they are, of the empiricists, among those who know the nativist claims & literature best. So if they don't have the foggiest idea of what the nativist positions are, then nativists have a lot of work to do. And APL have helpfully gotten the ball rolling by telling us exactly what they think the nativist position is.

    1. Funny, I would have thought you'd say that those in language acquisition have a lot of work to do. But your view seems to be that their ignorance is my responsibility. Politically, this may indeed be right. But that supposes that were me and my kind more open and helpful that these positions would not be so ubiquitous. I used to believe that. I've grown up since. If this is really the best that's out there, then it's a waste of time trying to engage seriously with the intellectual issues. There are none. This leaves a political problem, if, as you insist, this represents enlightened empiricism. Frankly, I am not sure what to do, except to continue to deal with the real issues, try to block out the empiricist noise and every now and then fire a broadside aimed at showing how ignorant and silly the enterprise is. Why this tack? Well rational discussion clearly does not work and engagement is a serious waste of time. And it's been repeatedly tried. I vote for ridicule and embarrassment. Here's my advice: every time you encounter such junk stand up and say that it is junk. Say it loud and often. You can be polite, but you must be insistent. This may get noticed and maybe, just maybe, it will have the desired effect. I personally cannot wait to be able to do this in a public venue. I will ask my students and friends and colleagues to do this as well. No more attempts to meet this ignorance half way. That, in my view, is a giant mistake as it makes it look like APL like views are reasonable. As they are not, this is a loosing strategy.

    2. The thing is that it's impossible for "nativists" to take the argumentation in the paper seriously because there are so many basic errors. E.g. pages 54-55 present two supposed alternatives to Conditions A/B. Both of these alternative conditions have obvious counterexamples. The first condition is that “Reflexive pronouns are used in English if and only if they are direct recipients or targets of the actions represented by the sentences.” This is too vague to evaluate properly, but I don't see how it would account for the contrast between, say, “John embarrassed himself” and “*John's appearance embarrassed himself.” The second principle is that “reflexive pronouns denote a referent as seen from his or her own point of view, non-reflexive pronouns from a more objective viewpoint.” As stated, this appears to imply that “itself” should not exist. Even apart from this the claim is simply false in general, as shown by e.g. “Forgetting that he was the author of the paper, John unknowingly criticized himself”. The paper constantly says things like the following: “Discourse-functional principles that must be included in formal accounts to explain particular counterexamples can, in fact, explain the entire pattern of data.” But in many cases the authors do not consider even the most basic textbook data points used to motivate the conditions under discussion.

      The broader problem here is the question of whether or not Conditions A/B should be replaced with some "discourse functional" conditions is completely orthogonal to the issue the paper is supposed to be addressing. There is no reason to think that these alternatives would be any easier to learn inductively from the data.

    3. I have not had the time yet to read the paper, so I have little so say about its quality, but the statement Alex D quotes is baffling:
      Discourse-functional principles that must be included in formal accounts to explain particular counterexamples can, in fact, explain the entire pattern of data.
      That might be a valid line of reasoning in areas where you do not care about computability (even there it would make my toes curl), but for learnability the fact that an extremely powerful system such as the principles regulating discourse can handle simpler problems is a) hardly surprising, and b) utterly useless. One might just as well say that all of phonology can be done with Minimalist grammars, which completely misses the point that phonological rules are less complex than syntactic ones, and they are also much easier to learn (both for machines and humans, as far as I know). Discourse is hard, very hard. Syntax not so much.

    4. For Binding Theory, Yag Dii (Niger Congo) as analysed by Mary Dalrymple (; purely functionalist theories would appear to be completely hopeless (especially for the 'LD' series), and the phenomena go far beyond traditional BT in complexity (especially the 'LD2' series), but can be accomodated in a 'weak UG with a constraint language for stipulations' (ie LFG with the iofu approach to binding as worked out in Dalrymple's thesis).

      It is an interesting question of what allows a language to get such a complex system of anaphors, at a seminar where Mary presented this material, somebody suggested that it might be because of the extremely elaborate story-telling tradition that exists in that culture (from which all of the spectacular examples appear to be taken).

    5. So consider this quote: (not from the paper under discussion)

      "It is standardly held that having a highly restricted hypothesis space makes
      it possible for such a learning mechanism to successfully acquire a grammar that is compatible with the learner’s experience and that without such restrictions, learning would be impossible (Chomsky 1975, Pinker 1984, Jackendoff 2002). In many respects, however, it has remained a promissory note to show how having a well-defined initial hypothesis space makes grammar induction possible in a way that not having an initial hypothesis space does not (see Wexler 1990 and Hyams 1994 for highly relevant discussion).
      The failure to cash in this promissory note has led, in my view, to broad
      skepticism outside of generative linguistics of the benefit of a constrained initial hypothesis space."

      This seems a reasonable point to me, and more or less the same one that is made in this paper: namely that the proposed UG don't actually solve the learnability problem.

    6. @AlexC. I think from the perspective of a generative linguist things look something like the following. We have sentences that look superficially similar but which instantiate different underlying dependencies:

      John wants to win often. (control)
      John seems to win often. (raising)

      John believes Bill to have left. (ecm)
      John persuaded Bill to leave. (control)

      And we have sentences which look superficially different but which instantiate the same underlying dependencies:

      Who did John talk to?
      I talk to more people than you do.
      John is difficult to talk to.

      There's a plausible sketch of how innate knowledge would help here. E.g., if you have some idea of what ‘wants’ means, you should be able to figure out that it must assign an external theta-role and hence can't be a raising verb. I don't know of any interesting account of how one would settle on the correct analyses for the above cases without the help of some built-in constraints along the lines of those provided by GB theory.

    7. @Alex D -- yes that's exactly the sort of thing I mean (and APL mean too, I guess?). There is a plausible sketch, and intuitively it seems like the innate knowledge will help, but if you try to flesh out the details you find that you need a learning system anyway, and the learning system that you need doesn't in the end need the innate knowledge.
      Or actually of course, no one ever bothers to fill in the details. ("the promissory note").

    8. Has anyone fleshed out the details for those cases?

    9. @Alex C: "...and the learning system that you need doesn't in the end need the innate knowledge."

      Could you please point me to any learner that learns, for example, principle A or B or C effects, or island effects or ECP effects or anything similar that develops competence in these domains without any built in knowledge of the GB variety. I'd love to see how they work. Note, I'm not asking for a system that learns an entire grammar. I am happy with something that learns any things of this sort.


    10. So assuming that all these effects lie within the class of MGs and therefore MCFGs then there are learners for MCFGs that can learn large classes of these; so e.g. Yoshinaka's work being a good example. But the granularity of the descriptions is quite different -- so these are abstract mathematical models of learning with assumptions that probably seem weird and wrong to you, and that don't mention Principle B effects -- so there is a gulf here that I hope to overcome in future work.

      But the point here isn't (from my point of view) a general "my theory is better than yours" argument -- there is a more specific issue.

      So as an illustrative example take the Wexler and Culicover model which we argued about before, and which assumes a deep structure tree as an input (more or less). So the argument here is something like.
      We have a model of UG that allows us to learn given these inputs. But where do the inputs come from? From some other unspecified learning component that infers the deep structure. Call this learning component System X. The claim is then that System X plus UG accounts for learning.
      So the argument is that actually system X has to be quite powerful, and when we work out the details we find that system X on its own can learn the deep structure *and* the transformation. In which case UG turns out not to be necessary.

      In more modern terms, maybe you assume a distributional component that can learn the crude phrase structure -- well now given the MP/Stablerian unification of movement with phrase structure, maybe the component that can learn the phrase structure can also learn the structure dependent movement rules which produce the island effects that you are interested in. And given the way that CFG learners turn out to be easily adpatable to MCFG learners that seems quite fair.

      So that is the argument: the starting point for the discussion therefore shouldn't be my theories of how things can be learned *without* UG, but your theories of how things can be learned *with* UG.

      (for a suitably vague definition of UG).

    11. So assuming that all these effects lie within the class of MGs and therefore MCFGs then there are learners for MCFGs that can learn large classes of these;

      Only given sufficiently rich data. How much data would you need to learn that wh-movement, tough-movement and comparative operator movement all obey (e.g.) the wh-island constraint? (Not a rhetorical question by the way — I'd like to know.)

      If you want an account of how these three constructions can be learned with UG, here goes. Take any account of how simple instances can be learned without UG. Now, the principles of UG will tell you that the movement is unbounded with the exception of island constraints and the ECP, no further learning required.

      In the absence of UG you are forced to make the quite implausible claim that kids learn the locality constraints on A-bar movement separately for each construction in which it is involved. It's far from clear that kids get enough data to do this.

    12. "In the absence of UG you are forced to make the quite implausible claim .."
      why is that? Do you mean UG in the language specific sense?

      I think you are right that the amount of data is a really important consideration -- and also crucial when you look at the move between MCFGs and MGs -- because the MCFGs don't have the sort of feature system that allows you to represent those cross-categorial constraints.

    13. Nor do grammars invidiously distinguish A' constructions, i.e. Have questions obey islands but allow relativization to skirt them. Again, a feature of UG accounts and so need not be learned.

    14. "So as an illustrative example take the Wexler and Culicover model which we argued about before, and which assumes a deep structure tree as an input (more or less). So the argument here is something like.
      We have a model of UG that allows us to learn given these inputs. But where do the inputs come from?"

      If these inputs were typed meanings (perhaps very approximate, from word learning) plus 'axiom links' (proof-net literature term for assembly instructions) derived from context and plausibility, then they would have a non-syntactic source, & syntactic UG would be would whatever structure in the hypothesis space was then needed to explain learnability, typology, and the prevalence of recurrent patterns generalizations. Learnability might turn out to require less than Norbert expects, but there might be quite a lot in the other two. For example there is so much data that perhaps the island constraints could be learned from the data by the time the students show up in your syntax class to accept or reject the standard examples, but there is still the recurrent tendency for all the various types to be pretty much the same in each language, even when the constraints differ across languages.

    15. "Frankly, I am not sure what to do, except to continue to deal with the real issues, try to block out the empiricist noise and every now and then fire a broadside aimed at showing how ignorant and silly the enterprise is."

      The problem I see with this view is that many of the classic nativist arguments have been revealed to be unsound, wanna contraction for example is obviously weak, and Andrea Zukowski has I think more or less demolished it, the simple ABC binding theories that people mostly talk about grossly under-represent the actual typological diversity of these things, island constraints can now be redone in quite a variety of ways etc etc, so it's really at this point a non-race between two horses, one of which is half zombie, the other of which is half imaginary (due to failure of the empiricists to address the basic phenomena of the 'Jimmy test'). Nativists could at least start trying to clean up their own position by losing the obsolete and dodgy arguments.

    16. I agree re wanna. Not sure about the others. The ABC effects underestimate the diversity but do you know of clean cases where reflexivization is non local but pronominalization is? Or where in simplex cases reflexivization and pronominalization are both fine? As for islands, I. Not sure what you intend. The various ways I know for dealing with them all amount to making the same two or three distinctions, though coded in slightly different ways. So, I am less critical of the basic facts than you seem to be. Of course a more refined set of descriptions would be useful to probe deeper, but the rough lay of the land strikes me as pretty well described.

    17. @Norbert "do you know of clean cases where reflexivization is non local but pronominalization is?" The problem with this statement is that there is no independent criterion (such as, say, complex morphology involving something that looks vaguely like a possessive) for distinguishing a 'reflexive' from a 'nonreflexive'.

      There are I think some reasonably solid universal cases of principle C, such as

      *he_i left before John_i got tired

      also in the same paddock Postal's no overlap constraint for local co-arguments, so no language that inflects for both subject and object has forms for things like 'I like us' and 'we criticized me'. (but you do seem to be able to say these things anyway in English, with pronouns if you really want to, rendering the systematic gaps in all inflectional paradigms that I've ever heard of rather odd).

      But for A and B, there are forms that need to be bound in various local environments, and others that need to be free in local environments, & I don't claim to know what a proper typology might or might reveal about UG, but the use-by date for the simple ABC story is definitely past (note the challenging data from the paper Mary Dalrymple that I link to from somewhere in this discussion).

      The point about Islands is a bit different: at the moment 'believers' in UG seem to be betting everything on learnability, but I don't think it's too clear to anybody what can or cannot be learned by people who've had a parsed corpus of 150-200 million words run passed them (eg the students in our syntax classes), so perhaps the island constraints could be learned independently for the various constructions that show them. Do we *know* (as opposed to merely conjecture) that this is impossible? I doubt it. But we do know that the constraints tend to be the same for large numbers of constructions, subject to various kinds of uniform tendencies, such as being a bit looser for relative clauses (especially NRCs) than for questions. But roughly the same, so that's typology (in this case, under the guise of recurrent patterns of generalizations).

      The poster child for respecting typology as much as learnability would be 'structure preservation'; this does seem to be learnable in principle from child accessible data, but no language has anything like 'apparent structure blind preposing' such as would produce things like

      *is the dog that barking is hungry?

      So I think these things need to be sorted out and displayed properly in order to expect anybody not brought up in our peculiar tradition to understand them. Even then, maybe not, but I'd be more willing to write them off as hopeless (which, these days, I am not).

    18. Reflexive vs non-reflexive: Yes, that's why I tried to mention reflexivization versus pronominalization. Even in copy reflexive languages there are pretty good tests (e.g. sloppy identity) that pulls these apart. Morphology is not everything, though it is often useful. The focus on morphology alone as the key to Binding is, I believe, a mistake, one that earlier Less-Klima theories avoided and that GB goofed on.

      Re the islands: We don't really know, but the typology stuff and the absence of violations should make the problem harder. Absence of negative data to be useful requires some pre-packaged expectations (i.e. UG). So, I think that the bet is a safe one (but see comment below on Colin's remark). I think that both the typology and recent work by Sprouse have presented pretty strong reasons for thinking that islands are not construction sensitive, a point that goes back to Ross and seems to me pretty well established by now. This too is interesting for it requires an abstract conception of what islands are relevant for. Not dependencies, not constructions, but movement dependencies. This is pretty abstract and to be gleaned from input, if possible, would require lines of pre-packaged generalization. Again UG.

      Lasst point: so that I don't sound triumphalist or complacent let me assert that there are still monsters out there. Hard problems that require hard thinking. But, and this is my main point, I think we have found over 50 years of research a good stable group of generalizations that are real. These point to a simple conclusion, at least to me, that there is interesting structure to UG that undergirds language acquisition. I don't expect anything we find in the future to render this judgment moot. I could be wrong, but I doubt it and I think that a generalized skepticism that acts as if the remaining problems will dislodge this conclusion are counter-productive, and misdescribe the state of play.

    19. Well the connection between morphology & strict vs sloppy is not so simple, I believe. But my claim here is not that there is no UG to be found in Binding Theory phenomena, but that the presentation needs to be different for me to think that the 'empiricists' ought to be impressed by it (and that typology might well gain *a lot* of ground over learnability).

    20. Alex C (Oct 25, 3am) gives quote from a different paper to say that APL have identified a real problem and that UG doesn't solve learnability problems.

      The odd thing, however, is that this quote comes from a paper that attempts to cash in on that promissory note, showing in specific cases what the benefit of UG would be. Here are some relevant examples.

      Sneed's 2007 dissertation examines the acquisition of bare plurals in English. Bare plural subjects in English are ambiguous between a generic and an existential interpretation. However, in speech to children they are uniformly generic. Nonetheless, Sneed shows that by age 4, English learners can access both interpretations. She argues that if something like Diesing's analysis of how these interpretation arise is both true and innate, then the learner's task is simply to identify which DPs are Heim-style indefinites and the rest will follow. She then provides a distributional analysis of speech to children does just that. The critical thing is that the link between the distributional evidence that a DP is indefinite and the availability of existential interpretations in subject position can be established only if there is an innate link between these two facts. The data themselves simply do not provide that link. Hence, this work successfully combines a UG theory with distributional analysis to show how learners acquire properties of their language that are not evident in their environment.

      Viau and Lidz (2011, which appeared in Language and oddly enough is cited by APL for something else) argues that UG provides two types of ditransitive construction, but that the surface evidence for which is which is highly variable cross-linguistically. Consequently, there is no simple surface trigger which can tell the learner which strings go with which structures. Moreover, they show that 4-year-olds have knowledge of complex binding facts which follow from this analysis, despite the relevant sentences never occurring in their input. However, they also show what kind of distributional analysis would allow learners to assign strings to the appropriate category, from which the binding facts would follow. Here again, there is a UG account of children's knowledge paired with an analysis of how UG makes the input informative.

      Takahashi's 2008 UMd dissertation shows that 18month old infants can use surface distributional cues to phrase structure to acquire basic constituent structure in an artificial language. She shows also that having learned this constituent structure, the infants also know that constituents can move but nonconstituents cannot move, even if there was no movement in the familiarization language. Hence, if one consequence of UG is that only constituents can move, these facts are explained. Distributional analysis by itself can't do this.

      Misha Becker has a series of papers on the acquisition of raising/control, showing that a distributional analysis over the kinds of subjects that can occur with verbs taking infinitival complements could successfully partition the verbs into two classes. However, the full range of facts that distinguish raising/control do not follow from the existence of two classes. For this, you need UG to provide a distinction.

      In all of these cases, UG makes the input informative by allowing the learner to know what evidence to look for in trying to identify abstract structure. In all of the cases mentioned here, the distributional evidence is informative only insofar as it is paired with a theory of what that evidence is informative about. Without that, the evidence could not license the complex knowledge that children have.

      It is true that APL is a piece of shoddy scholarship and shoddly linguistics. But, it is right to bring to the fore the question of how UG makes contact with data to drive learning. And you don't have to hate UG to think that this is valuable question to ask.

  7. Are there any systems that do acquire things functionally like NL grammars (defining a sound-meaning correspondence sort of like in NLs) from stuff that is sort of like NL data (interpreted corpora with some glossing) that *don't* have a well defined hypothesis space?

    1. to strengthen the point I think your question gets at: are there any models that acquire anything from anything that don't have a well defined hypothesis space?

    2. So there is a trivial sense in which all models implicitly define a hypothesis space, namely the set of all outputs which they might produce under any possible input.
      But that isn't the same thing as a restricted hypothesis space which is well-defined initially.

    3. Perhaps we are just arguing semantics (or I'm being dense), but I'm afraid I fail to understand your point. What does it mean for a hypothesis space to be well-defined initially? Just that somebody bothered to spell it out?

      Perhaps you could give an example for a model that only "trivially" and "implicitly" defines a hypothesis space, as opposed to one that has an "initially" "well-defined" one?

    4. So to give a machine learning example; say you are learning regular languages -- contrast say the Hsu-Kakade-Zhange spectral HMM learner versus the Clark-Thollard PDFA learner. What is the hypothesis space of the former?

      Or contrast say a Charles Yang learner for a parameterised finite class of context free grammars versus a distributional clustering learning algorithm like ADIOS? The former would try to exploit the structure of the explicit hypothesis class.

      Not sure if it lines up quite with the parametric versus non-parametric difference.

      But the point of that quote is that regardless of its correctness, it is at least a reasonable point, and it is basically the same point as Ambridge et al's, but the quote I made is from one of Norbert's allies and I think that rather than misrepresenting the point and ranting one could examine the arguments in a more rational way.

    5. I'll rephrase my query to end with "don't have an explicitly defined hypothesis space designed on the basis of some ideas of what language is like" (ie some kind of UG, albeit perhaps a weak one). So for example the old Wexler and Culicover TG learner would count, as would Charles' P&P learner and the Kwiatkwoski et al. CCG one, but not the old Rummelhart et al PDP device, nor the recent Rescorla-Wagner based proposals of Michael Ramscar et al (paper downloadable at, The latter I find particularly interesting because I think it does demonstrate that UG is weaker than I used to think it was (by not containing any general Morphological Blocking Principle), and the general approach should also I think extend further to explaining the absence of things like feetses, hobbitses, kissededed (kissed+2 further iterations of the past tense suffix). I like this because I spent a lot of time once trying to get principles to explain such things to work out properly, without much success, and would rather think that the reason is that they don't exist rather than that they do and I was too stupid and/or lazy to manage to find them.

      But I don't see how Ramscar et al's approach can possibly give a definite account of how a child gets from linguistic zero to being able to say 'I want to push Owen while being carried' in about 4 years. So: less or different UG, very likely; zero UG, not so much. But if there is some 'hard core empiricist learner' that has a reasonable prospect of passing the 'Jimmy test' (what the composer of the push sentence was called at the time), I'd like to know about it.

      Of course you might be able to get a RW model to pass the Jimmy test by giving some account of what the cues and outputs were allowed to be, but that would be a kind of UG (according to me, at least).

    6. "Not sure if it lines up quite with the parametric versus non-parametric difference." (Alex C.)

      I don't think it does. To give one example, take Mark Johnson's Adaptor Grammars. While non-parametric in the sense that there is an infinite number of (model) parameters, their nature is constrained by whatever (Meta-)grammar the modeler comes up with, and as of now, it seems as if certain (admittedly, weak) biases are needed even for "lowly" tasks such as (word / morph) segmentation.

      "Or contrast say a Charles Yang learner for a parameterised finite class of context free grammars versus a distributional clustering learning algorithm like ADIOS?"

      The trouble with models like ADIOS is that their inductive biases aren't explicit (I grant you that) but that's hardly a feature, at least that's how I view it. Of course there are biases, and they fall out of the definition of the model, "trivially so", perhaps, but I don't quite see how whether or not people are careful enough to state as explicitly as possible the exact biases their models embody or not makes a difference as to whether or not there are any.

      "Hsu-Kakade-Zhange spectral HMM learner versus the Clark-Thollard PDFA learner. What is the hypothesis space of the former?"

      I have the strong suspicion I'm missing the point you're trying to make, but, HMM-parameters (for any given HMM)? As opposed to, well, "everything".

      Presumably it's really just a question of what to call things, but if there is some substantial point I'm missing (or something I misunderstood about your examples), please correct me.

      (And it goes without saying that I whole-heartedly agree with this: "I think that rather than misrepresenting the point and ranting one could examine the arguments in a more _rational_ way." (emphasis added) )

    7. "Hsu-Kakade-Zhange spectral HMM learner versus the Clark-Thollard PDFA learner. What is the hypothesis space of the former?"

      I have the strong suspicion I'm missing the point you're trying to make, but, HMM-parameters (for any given HMM)? As opposed to, well, "everything".

      Sorry I was not being very clear -- hypothesis class is used in several different ways in the literature. I mean something like the set of possible outputs for any possible inputs or the range of the learner considered as a function.

      So the Hsu et al learner has a hypothesis class which is some rather poorly defined submanifold of the space of all HMMs and it has a bias towards HMMs with high separability, -- leaving aside the fact that actually it doesn't output HMMs at all but rather approximations that may not even define a probability distribution, and the fact that the things it outputs only have rational coeffficients and various other wrinkles.
      So saying the hypothesis space is all HMMs is I think completely wrong.
      I mean all HMMs are also PCFGs right? so one might as well say the hypothesis space is all PCFGs (I am going a bit far here maybe).

      Biases are something completely different.

      In the context of the discussion here, the question is does restricting the hypothesis space at the beginning in the way that generative grammar conceives of the problem (limiting UG) actually ever help to solve the learning problem?
      I don't have a very strong view one way or the other, but it would be nice to see some examples, and that would help to answer the questions raised by Ambridge et al. as it seems like a reasonable point that we could discuss in a rational way.

    8. Thanks, that helps. I actually only skimmed the Hsu-et-al-paper, and took their LearnHMM-description too literally, without reading the entire paper with enough care (and I also have to admit that my use of terminology is (unintentionally) somewhat non-standard...)

      "[H]ypothesis class is used in several different ways in the literature. I mean something like the set of possible outputs for any possible inputs or the range of the learner considered as a function. [...] Biases are something completely different."

      I'm not sure I agree with taking biases to be completely different from the space of admissible hypotheses. Now I don't want to turn this into an argument about words, but isn't the fact that a learner attempt to acquire something specific (phrase structure, morphs (and possibly grouping them into morphemes), (pseudo-)HMM-parameters (as opposed to simply collecting bigram-frequencies of the observed elements), etc.) something you can meaningfully attribute to the learner's inductive bias to look for exactly that kind of thing? And that inductive bias effectively determines the hypothesis space and explains why it excludes certain things, and includes other things.

      "In the context of the discussion here, the question is does restricting the hypothesis space at the beginning in the way that generative grammar conceives of the problem (limiting UG) actually ever help to solve the learning problem?"

      I get the sense that part of the problem might be that one's theory of language determines the learning problems one is interested in, up to the point that certain problems might not even be describable without notions which seem like prime candidates for what is provided by UG. Which isn't really a helpful answer to your question, but it might explain why the discussion doesn't seem to go anywhere, and why people might feel as if the other side is begging the really important questions.

    9. On your final point, I completely agree. Even expressing what an island constraint is is quite hard in a theoretically neutral way (if you want to do it cross linguistically) -- so it is hard to talk about learning them except through
      a) buying into a theory which accepts some syntactic notions as universal and thus presumably innate
      b) zooming out and saying, well I will look at learning MCFGs or MGs and worry about these later.
      I obviously choose b).
      I just used this quote by Putnam from 1971 in a paper which I think sums up one view that I think has some merit:

      "Invoking "Innateness" only postpones the problem of learning; it does not solve it. Until we understand the strategies which make general learning possible - and vague talk of 'classes of hypotheses' - and 'weighting functions' is utterly useless here - no discussion of the limits of learning can even begin."

      Alex D upthread a bit said that one way of viewing UG is as accelerating an existing non-UG based learner. And that seems a reasonable view, but again maybe an argument for strategy b).

    10. I find the idea that you would look for a theoretical neutral way to describe anything somewhat odd. Would you look for a theoretically neuteral way to describe, say, gravitational attraction, or, say, the fine structure constant, or, say, the law of entropy? If you did, nobody would be interested in listening to you. The aim is not to eschew theory but to explain stuff. There is no problem displaying, say, island phenomena, and no problem specifying a theory of islands (e.g. subjacency) in a way that is testable and usable. That should suffice.

      Re Putnam: I have no idea what this could mean. Everyone invokes innate capacities. Without it there is no generalization beyond cases and thus nothing like what people call learning. SO very one needs innate mechanisms. The only question then is what kinds. Thus, not only does postulating such not postpone the problem, a specification of the innate mechanisms is a pre-requisite for stating the problem. The view that general mechanisms are legit while specific ones are ipso facto not is a prejudice. Everyone has innate mechanisms. The only question is which and this is entirely an empirical issue.

      Last point: there is no reason, so far as I can tell, why a theory that explains how a learner can learn any MCFG will explain how I learn the one we have. It might and it might not. After all, we are also interested in why we learn some MCFGs and not others and it is entirely conceivable that the kind of learner that does the second is nothing like the kind that does the first. This needs some demonstration. And IMO the best way to show that the methods are useful is to show how some specific feature that has been isolated as characteristic of our Gs is learned using these methods. So go ahead: show us. Take just ONE case and do it. Then we can tell how useful these methods are.

    11. The Putnam quote was maybe too short -- it is from a paper called "The Innateness Hypothesis"; here is another quote that is clearer, and closer to Ambridge's point:

      "The theorems of mathematics, the solutions to puzzles, etc., cannot on any theory be individually 'innate'; what must be 'innate' are heuristics, i.e., learning strategies. In the absence of any knowledge of what general multipurpose learning strategies might even look like, the assertion that such strategies (which absolutely must exist and be em- ployed by all humans) cannot account for this or that learning process, that the answer or an answer schema must be 'innate', is utterly unfounded."

      So several people on this very thread -- Avery, Alex D and I think also Colin, have made arguments along this line: a general purpose learner (direct learner, learner without UG etc etc ) can't learn a phenomenon (islands, that-trace, ..) so it must be innate. Putnam's point, with which I agree is that before you can make that claim you need a good idea of what direct general purpose learning mechansims might be able to achieve.

      Your first and last points are kind of related -- what would we want a theoretically neutral description (TND) as opposed to a theoretical explanation (TE)? So by TND, I mean --- to adopt your gravitational example -- the claim that the path of a falling body follows a parabola, or that the orbits are conic sections -- as opposed to the TE that there is an inverse square force.
      So sure we are interested in TE -- but if we are for example
      Well suppose you want to compare two different theories, or critique one. Then obviously you can't use the TE since that would presuppose the very question that is under examination.
      For example Berwick & Chomsky in their cognitive science 2011 paper take some care to use only TND of the auxiliary fronting examples to avoid this pitfall.

      Your last point is a little baffling -- by the way there aren't any learners that can learn all MCFGs (or CFGs for that matter) -- of course I agree that just because we have an algorithm that can learn a sufficiently large class of grammars, doesn't mean that that is how we humans actually do things.
      It would just be a hypothesis. So you want me to "show how some specific feature that has been isolated as characteristic of our Gs is learned using these methods... ake just ONE case and do it".

      So "characteristic" is quite strong -- so maybe let's pick Chomsky's discrete infinity and recursion (following Hauser Chomsky and Fitch) ..
      can we learn an infinite set of discrete hierarchically structured expressions from strings?
      Yes. See my paper about to come out in JMLR -- which is available at lingbuzz/001735.

    12. Avery, Alex D and I think also Colin, have made arguments along this line: a general purpose learner (direct learner, learner without UG etc etc ) can't learn a phenomenon (islands, that-trace, ..) so it must be innate.

      Just speaking for myself, I'm open minded about this in principle. If you can show how a general purpose learner could plausibly learn one of these constraints given a not implausibly huge amount of data, then great. With regard to constraints on A' movement this strikes me as highly unlikely on the face of it. You'd need a ton of data to find (i) evidence for all the details of subjacency and the ECP and (ii) evidence that these details stay constant across a bunch of superficially unrelated constructions which all instantiate the same abstract dependency type. So e.g., even if you get it all figured out for wh-questions, you need to find evidence that the same constraints apply in comparatives and tough-movement.

      Re your last two paragraphs, I would say that if discrete infinity and recursion were the only interesting properties of syntax then that wouldn't provide very much support for a rich UG. This is a case where I think Minimalism can be a bit of a PR disaster with regard to the interaction of syntax with other fields. Syntacticians have not suddenly decided that there are no interesting syntactic phenomena besides discrete infinity and recursion. The speculative research program that Minimalism is pursuing is that of showing that these properties are in some highly abstract sense the only essential properties of human language (the rest deriving from interaction with other cognitive faculties, some notion of efficiency or optimality, etc. etc). From the point of view of a learner, all the rich and complex generalizations of GB theory are still there. I sort of wish that we could conduct these discussions just forgetting about Minimalism and assuming something like GB.

      So maybe Norbert was being a tad hyperbolic. We don't want you to take just any old case and do it. Take one of our parade cases and do it.

    13. Auxiliary fronting in polar interrogatives in English?

      There is something I am uneasy about here; ok, so I understand that syntacticians are interested in the things they are interested in -- quirky case in Icelandic and parasitic gaps and so on. And these are the most complex bits of language, and there is a lot of theoretical debate about how islands for example should be modeled. So aren't the active areas of debate in syntax the worst areas to study learnability? I understand why you pick them -- because they are maybe the hardest challenge, and thus the strongest argument for innateness.
      And I think that we are still a few years away from understanding how we could learn these more complex phenomena.

      But if what we are interested in is finding out roughly where the boundary is between what can be learned by domain-general methods and what needs a rich UG to be learned -- this is what I am interested in anyway -- then the models published up to now are obviously a poor guide to where the ultimate boundary is going to end up.This theory is very new and rapidly developing and what we can do is changing every year -- MCFG learning is only a few years old (first journal paper published in 2011!).

      So basing your argument on the fact that we don't at the moment have a program that can learn parasitic gaps from a corpus of child directed speech seems a little strange ... was that the sort of demonstration you had in mind? Because you are right that that is still some way off.
      If I had an algorithm that could learn all of syntax, even the tricky bits, from CDS then that would be a complete answer to nearly all of linguistics, right?
      So that seems a high bar to set opposing theories.

      Maybe if you could give an example of the sort of (positive ) demonstration that the innate UG helps the learning process, then I can think about whether that is within the scope of the learning models we have at hand.

      So give me an example of a case where a theory you like demonstrates success in learning in a manner you find convincing.

    14. If I am not mistaken, Alex C delivered one such. Here it is again. Take any learning account of Topic, WH-question formation, focus movement etc in SIMPLE clauses (e.g.: any learning account that manages to "learn" how to from questions like 'what (did) Bill eat.' Add a GBish version of UG and that same system has now acquired the knowledge that you cannot form "*What (did) you meet a man who ate." So, take your favorite theory of simple clauses, add UG and you get an account of what can happen in complex clauses. Done.

      Here's another: take your favorite theory of learning that "DP-self" is a reflexive morpheme in English. Add Binding Theory and you derive that "John believes Mary to like himself" is unacceptable.

      So, the UG accounts (if correct) cut down what needs acquiring. As any theory will have to account for both the acquisition of knowledge about what happens in both simple and complex clauses, I assume that reducing the learning problem to simple clauses is a contribution. At any rate, WHATEVER the theory for what happens in simplex clauses is, once you have it, you get the rest.

      How's that? A nice feature of UG is that it localizes where acquisition needs to take place. It thus reduces the problem to a subset of the whole. It thus provides an instant recipe for learning the whole once one knows certain specified subparts (generally in simple clauses, but this is up for debate, you know degree 0 vs degree 1 etc). So UG "helps" with the "learning problem" (though I am not sure that "learning" is the right term) pretty directly.

      Your turn.

    15. There is certainly a lot of debate regarding how islands should be modeled, but the the basic generalizations are pretty well understood and in many cases extremely robust cross-linguistically (as David P noted wrt adjunct extraction). Theoretical syntax is probably never going to be "finished", so it's always going to be the case that there will be debate at the cutting edge of the field over how to model particular phenomena. I hope this doesn't mean we have to stick with subject/aux inversion as our only case study. We should stick to cases which are well understood at a descriptive level, and island phenomena certainly meet this criterion.

      The thing about POS arguments is that they argue not from the lack of a suitable learning theory but from the lack of any evidence in the input. You can't learn that for which there is no evidence, however fancy your learning algorithm. So I personally would not bet the nativist farm on there being no learning algorithm for MGs (or whatever) that makes it possible in principle to learn islands, parastic gaps, etc. etc. Whether or not such an algorithm exists is a complex mathematical question and I agree with you that we should not try to guess at the answer.

    16. @Norbert -- sure, I understand that is the idea but can you make it work?
      Has anyone ever made this work?

      At this point you are just saying: if you, Alex, build a car, then I, Norbert, will come and paint some go-faster stripes on the side. That's very kind, but I'd like some detail. Pick some algorithm, define your UG, and show that this makes it faster. Otherwise, it's just wishful thinking.

    17. I just want to get this right. Your claim is that the problem of learning a rule, say, for WH questions that just applies correctly in simple clauses, is no different from the problem of learning the rule as it applies in both simple clauses AND in all sorts of embedded clauses. Or, a learner that acquires a rule for correctly generating WH questions in simple clauses will necessarily also acquire the rule for forming good WH questions (and not forming bad ones) in arbitrarily complex clauses. Really? There is no different in the complexity of the two rules involved. That's your claim?

      I should add, that in many parts of the world showing that what appears to be a complex problem reduces to a simpler problem is considered progress. In my view, learning a rule of question formation that applies correctly in simple clauses is quite a bit easier than one that applies in arbitrary clauses. The rule of wh-fronting will have to learn the rule, more or less like, 'move the wh word to the front' based on examples like 'who left' what did you eat' where do you live'. From such examples the kid has to learn the stated rule. Now, I'm no expert here, but I'd be happy to take a bet that there is a pretty trivial learner out there that can do this. Interested? And if there is, then GIVEN a GB version of UG, that's all that needs learning. Or, such a rule, once learned will generalize to cases like 'what did Bill say you ate' but not to 'what did you meet a man who ate.' Get one, get them all. But you're telling me that in your view, assuming nothing about bounding and subjacency, learning a rule that correctly generates 'what did you eat' will already have enough to it to NOT generate 'what did you meet a man who ate.' Right? I want to hear you say this, so please answer.

    18. Forensic examination of the time stamp of my last post will reveal a post-pub lack of judgment -- in the morning my comment doesn't seem as clear ...
      I agree with you on two points: first that we should reduce the complex problems of learning to simple ones, and secondly that we need simple learners too as well as more complex learners.

      I agree also that the learners we have at the moment are too slow (this was Alex D's point) -- and that you can't learn each set of construction specific movement rules separately. And so we definitely need some "go-faster stripes" as I dismissively called them -- maybe a "turbo-charger" would have been a better example.

      But given my expertise in learning algorithms I don't see how one can attach a UG based turbocharger (where UG means innate and language specific) to a non UG based car. I just don't see how that would work. It may be possible, but it depends on the detail of the simple algorithm and the details of the constraints on the feature system etc. etc.

      Parenthetically, you say "learning a rule that correctly generates 'what did you eat' will already have enough to it to NOT generate 'what did you meet a man who ate.'".
      The approaches I have in mind don't have a general move-alpha with constraints, rather each local movement rule is learned separately so actually, learning the former won't automatically lead to the latter. The problem is undergeneration and needing too many positive examples, rather than blocking overgeneration.

      (Public service announcement: don't drink and post)

    19. Norbert: "Here's another: take your favorite theory of learning that "DP-self" is a reflexive morpheme in English. Add Binding Theory and you derive that "John believes Mary to like himself" is unacceptable."

      Not quite, because how do you know that DP-self isn't a long-distance anaphor? (Thrainsson 2007 Icelandic p 473-474 might be especially worth looking at; subjects but not non subjects can antecede sig-reflexives in a complement infinitive). e,g,

      Anna_i telur [þig hafa svikið sig_i]
      Anna believes you to have betrayed self
      "Anna believes that you betrayed her"

      You can't just derive it, but would seem to have to look, & is there yet a 'fnished' system of parameters for all of the variations that are found, often in rather complex situations?

    20. @Avery: You are, of course, correct. The GB binding theory is incomplete and how to incorporate long distance anaphors is not particularly clear. However, this said, I don't believe that it invalidates the general point I was trying to make; viz. that if GB correctly describes UG then this would advance the learning problem. APL, for example, appear to deny this (as have others). I am arguing that they are incorrect. The argument presupposes that GB is roughly adequate for their argument seems to be saying that even were it so it would not advance the problem any. What you are pointing out is that the GB conception is not adequate, even in its own terms. This is a very good objection. The one that claims that even if correct it adds nothing is, IMO, close to incoherent. That's what I wanted to show.

    21. Yes; I quibble because it's the most oversimplified stuff that is likely to be remembered when people look in here to find out what people who think that there's some kind of UG are up to, so some gesturing at the fuller picture is needed, I think. But I do buy Alex Cs point of the dubious prospects of hooking up a random theory of UG to a random learner. The fitting and joining is likely to be insanely difficult.

    22. Quibble away. As for your last sentence, I have no idea what you mean. I suspect that finding the right UG will be hard. I think that if this is found, the learning problem will not prove to be that difficult as there will be far less to learn than we currently believe. Indeed, I am starting to think that there is very little learning when it comes to core biological capacities. Acquisition yes. Learning no. But I am patient and will wait and see.

    23. Alex: "So aren't the active areas of debate in syntax the worst areas to study learnability?" Or possibly the best, since they are where the most simple-minded and superficially attractive ideas are most likely to crash and burn? I'm not going to sneer at some learner that can't do QC in Icelandic yet, but one such that I can't imagine how it might ever to be extended or modified to manage that is a different story. (Part of the interest in QC comes from the fact that it produces one of the more spectacular arguments that some (possibly very weakened form of) the UTAH is correct, at least to the extent that all arguments selected by a verb in simple syntactic environments are present syntactically in more complex ones where it is conceptually possible, but apparently false, that they are syntactically absent.)

  8. Norbert writes: "Could you please point me to any learner that learns [...] island effects [...] or anything similar that develops competence in these domains without any built in knowledge of the GB variety. I'd love to see how they work."
    Good question. One place to look would be a couple of chapters in a just-published collection on islands edited by Sprouse and Hornstein. Lisa Pearl & Jon Sprouse have a very interesting model of learning island constraints from real parental input, without prior knowledge of the constraints. And then there's a chapter by me that argues that it doesn't work. Nevertheless the Pearl & Sprouse model is very interesting. There's a fuller version of that work in an article in Language Acquisition, 2013.
    Norbert also writes: "I am less critical of the basic facts than you seem to be. Of course a more refined set of descriptions would be useful to probe deeper, but the rough lay of the land strikes me as pretty well described."
    But this is what makes the problem very interesting right now. Extant proposals -- on either side -- have practically nothing to offer to date. In the domain of islands, for example, it does look like the "rough lay of the land" is more-or-less accurate. The same kinds of things create islands in more-or-less the same way in many constructions in most any language. We're not discovering new island types all the time, and we don't find languages that diverge dramatically from the cross-language tendencies. But there's lots of annoying variation in the extant data. Yes, in general if wh-fronting can't escape island X, then relativization and topicalization also won't escape island X. And if construction Y is island-sensitive (e.g., scrambling, comparatives), then it's sensitive to the usual gamut of islands. But we find all kinds of persnickety exceptions to this, at least on the surface. We need good ideas for how that kind of variation could be learned ... or how learners could somehow peel away the confounds. Reductionist accounts don't address that. But nor do accounts that say "well, the facts seem to be roughly right".

    1. Re Jon and Lisa's stuff: I agree that this is interesting, but right now, I don't think it works, for reasons YOU are familiar with. It's very very sensitive to single pieces of PLD given the extreme sparsity of the relevant input, as you note. Second, it presupposes that we can acquire full phrase structure without UG. Once embedding is addressed, as any theory if islands must, this is not at all obvious. Third, even if entirely correct, it's a look for legal paths FOR MOVEMENT, requiring that movement as a type of dependency be stipulated. It holds for these and not, say for binding, and it holds for all of these regardless of the construction type. So, yes the approach is worth exploring, but it is not UG innocent even if correct. It effectively comes down to how one learns the bounding nodes.

      I agree the problem is interesting. I see that you find my apparent complacency disturbing. I do not intend to sound, nor do I believe, that all problems have been solved. What I do believe, and think that is far more contentious than it deserves to be, is that we have established a serious set of factual baselines and that these are very real. Moreover, taken these established facts already suffices to argue for an interesting view of UG. It may not be exactly right, but, if we wait for that, we will wait forever. Nothing is EVER exactly right and people should always want more. But not at the price of dissing what has been found. When I say that the facts are roughly right, I mean that the well known complications will not disturb the broad conclusions we can now draw. Of course, I may be wrong. Waiting for the perfect description will protect one from this terrible fate. But at the cost of never saying anything much at all. So, unless you believe that the refinements (yes they are worth pursuing) will overturn the general lines of what we have found then failure to consider what our results imply is not a service to the enterprise. Modesty about one's accomplishments is a virtue. False modesty is not for it can impede insight as much as complacency can.

    2. Yes, I'm aware of the limitations of the Pearl & Sprouse learning model, but I think it's a serious attempt, and they are quite upfront about their learner's priors. No attempt to bury their assumptions. So I hold it up as a model of the kind of attempt that we'd like to see, in order to make the discussion as serious as possible.

      But on the main topic - I do not mean to belittle the accomplishments of the past, and I agree that there are many important and enduring generalizations. But I think that the big challenge going forward will be to explain why those generalizations seem to be only approximately right. This is not a matter of waiting around for the perfect description. Rather, it's where I'd place my money on the route to the next big breakthrough. If there's no cross-language or cross-construction (e.g., wh vs. topicalization) variation in hard-to-observe stuff, then all is dandy, and we can simply hard-wire it into the kid, end of story. (Well, apart from the pesky biology business.) But if there's variation in the hard-to-observe stuff, then life gets a whole lot more difficult. My hope is that the apparent variation will turn out to be illusory, masked by surface confounds, and that the "roughly true" generalizations will turn out to be "really true" at the right level of analysis. And I don't know of any other plausible ways of explaining microvariation in hard-to-observe phenomena. But hope is just that. And I'm an inveterate worrier, as you know.

    3. I have the impression that the standard repertoire of islands makes 100% correct predictions in 100% of all studied languages, so long as we retrict our attention to monomorphemic words for "why" and "how" -- i.e. that the variation comes when we extract nominals (or in some versions, when we extract canonical arguments). Or is there reported variation in this domain as well? If my impression is correct (and this is as good a place as any to learn if it's wrong), there's some reason to be optimistic about the "illusory, masked by surface confounds" view -- since we have now localized the variation against a background of indisputable uniformity. (That is, something about nominals. Start thinking about phi-probes etc.)

    4. Chiming in a bit late, but it is not true that Pearl & Sprouse's paper (admirable though it is) represents a model with no knowledge of the GB variety. What they argue is that if the learner knows that movement dependencies will be bounded in their paths, it is possible to learn the constraints on their paths. So, it's very much like GB knowledge (since the bounding theory is a theory of what's a possible path), only it doesn't prespecify the paths. Now, I don't think even this attempt succeeds (at least for the reasons that you laid out in your reply, but also because the same learning algorithm will yield roughly the same conclusions for quantifier variable binding, which is unbounded and also unconstrained by subjacency), but even if it did succeed, it would still represent a theory with a significant amount of innate syntactic knowledge about long-distance dependencies. It would be an advance in the sense of taking part of what we thought was in UG (e.g., bounding nodes or their equivalents) out of UG, but it would in no way be evidence that a general purpose learner with no antecedent knowledge about movement dependencies could discover that such dependencies were constrained by subjacency.

  9. @Colin "One place to look would be a couple of chapters in a just-published collection on islands edited by Sprouse and Hornstein." Lots of homework there, but a quibble: I think Ash Asudeh mauls that-trace as a univeral rather severely in his 2009 LFG paper, and there's also Gathercole 2002 on its late and iirc somewhat flakey acquisition.

    I wonder if a possible form of solution to typological tendencies like this might be that there are universal functionally based inhibitions that tend to lower the frequency of certain kinds of structures, which the learner then sometimes crystallizes as grammatical constraints.

  10. @Avery: Not the main point of this discussion, but since you brought it up. I'm not sure what you mean by "Ash Asudeh mauls that-trace as a universal rather severely in his 2009 LFG paper". I assume you have in mind this paper (click), right? I don't see the mauling. Asudeh offers an LFG account of the standard facts, and then criticizes various other theories, but I don't see any mauling relevant to the current discussion.

    He adds one fact to the tradtional stew, claiming that though various adverbs between "that" and the extraction site ameliorate the effect, it is striking that VP fronting doesn't:

    *Who does Mary know that doubt her never could?

    -- but fails to note that VP fronting creates an island for any kind of extraction, so it's not clear that the data is even relevant to the topic of "that-trace" effects, cf:

    *Which day does Mary know that celebrate her birthday we never would on __.

    So where's the mauling?

    1. @Avery: as it happens, I've been working with a group of students at UMD on following up on some of the that-trace 'anomalies'. Our cross-language digging has found the problematic cross-language cases to be less problematic than we had expected. There will be a paper on that in the hopefully not-too-distant future. Of course, there are well-known cases where the surface sequence appears to be ok, but it's old news that those examples are probably misleading.

      Re: that-t and acquisition. Being learned late or in a flaky fashion does not undermine the learning problem. What matters is whether speakers reliably arrive at the same conclusion within a community with similar input. It seems that they do. (Yes, there's alleged dialect variation in English. Cowart found no evidence of that, nor do we.) More interestingly, Pearl & Sprouse analyze the input that children receive, and the parental input that one might hope to support direct learning of that-t turns out to be wildly misleading. The parental input seems to show overwhelming that-deletion, irrespective of which argument is extracted, i.e., no evidence for a constraint that selectively targets subject positions.

      As for the notion that roughly-true generalizations are the result of universally functionally based inhibitions that then get tweaked -- I'm skeptical. That begs the question of how the tweaking happens. And given the scarcity of the relevant input, it's far from clear how the input would reliably lead to the appropriate tweaking.

    2. Gathercole found that they didn't really acquire it until they started going to school so perhaps parental input really isn't enough. My belief is that tweaking in the form of distributional learning without benefit of a finite set of parameters is something that we will need to deal with in any event, but time will tell.

      @David, perhaps I overestimaged the degree of mauling, but the big problem with the last sentence might be stranding a PP under VP preposing:

      ?? insult a policeman we never would in Boston

      Extraction without such PP stranding doesn't seem so bad to me:

      ?which day does Mary know that celebrate her birthday on we never

      Ash's example lacks potentially offending features such as extraction from a moved item and stranding PP (fragments) by VP movement, so still looks significant to me (so far).

    3. This comment has been removed by the author.

    4. @Avery. I actually stranded the preposition to make sure we know where the gap is, and picked an adjunct (in my example, temporal) PP precisely because it *can* be left behind by VP-fronting:

      Celebrate her birthday we never would on any national holiday.

      Not the kind of sentence I go around saying on a daily basis, of course, but I don't think the PP is the problem.

    5. I've tended to find PPs stranded by VP-preposing pretty bad, which is why I left them out the the squib I wrote on it once. But they do seem better with this construction.

      Asudeh seems to left the empirical aspects mostly to Falk 2006 (subjects book); if dialect variation in English is taken care of, then there's Hebrew she vs im (p 130-131) to sort out, plus the supposedly late acquisition claimed by Gathercole. Given the role of that-trace as a poster-child for UG, it would be very interesting if it could be fully restored (including a spiffy explanation).

  11. Readers of this thread might be interested in Lisa Pearl's commentary on the Ambridge et al paper (

    1. Interesting paper indeed, it certainly will help me to write my commentary [as will some of the opinions voiced here so in advance a thank you to everyone who provided stimulation!]