Wednesday, November 21, 2012

A Quick Thanksgiving Reply to Alex, Avery, and Noah

I am still playing impresario to Bob Berwick's arias. Enjoy! What follows is all Robert C. Berwick: 

Excellent questions all; each deserves a reply in itself. But for now, we’ll have to content ourselves with this Thanksgiving appetizer, with more to come. The original post made just 2 simple points: (1) wrt Gold, virtually everyone who’s done serious work in the field immediately moved to a stochastic setting, > 40 years ago, in the best case coupling it to a full-fledged linguistic theory (e.g., Wexler and Degree-2 learnability); and (2) that simply adding probabilities to get PCFGs doesn’t get us out of the human language learning hot water. So far as I can make out, none of the replies to date have really blunted the force of these two points. If anything, the much more heavy-duty (and much more excellent and subtle) statistical armamentarium that Shay Cohen & Noah Smith bring to bear on the PCFG problem actually reinforces the second message. Evidently, one has to use sophisticated estimation techniques to get sample complexity bounds, and even then, one winds up with an unsupervised learning method that is computationally intractable, solvable only by approximation (a point I’ll have to take up later).

Now, while I personally find such results enormously valuable, as CS themselves say in their introduction, they’re considering probabilistic grammars that are “used in diverse NLP applications…” “ranging from syntactic and morphological processing to applications like information extraction, question answering, and machine translation.” Here, I suspect, is where my point of view probably diverges from what Alex and Noah subscribe to (though they ought to speak for themselves of course): what counts a result? I can dine on Cohen and Smith’s fabulous statistical meal, but then I walk away hungry: What does it tell us about human language and human language acquisition that we did not already know? Does it tell us why, e.g., in a sentence like He said Ted criticized Morris, that he must be used deictically, and can’t be the same person as Morris? And, further, why this constraint just happens to mirror the one in sentences such as Who did he say Ted criticized, where again he must be deictic, and can’t be who; and, further, why this just happens to mirror the one in sentences such as, He said Ted criticized everyone, where again he must be deictic, and can’t mean everybody; and, then finally, and most crucially of all, why these same constraints appear to hold, not only for English but for every language we’ve looked at, and, further, how it is that children come to know all this before age 3? (Examples from Stephen Crain by way of Chomsky.)  Now that, as they say, is a consummation devoutly to be wished. And yet that’s what linguistic science currently can deliver.  Now, if one's math-based approach account could do the same – why this pattern must hold, ineluctably, well, then that would be a Happy Meal deserving of the name. But this, for me, is what linguistic science is all about. In short, I hunger for explanations in the usual scientific sense, rather than theorems about formal systems. Explanations like the ones we have about other biological systems, or the natural world generally. Again, remember, this is what Ken Wexler's Degree-2 theory bought us – to ensure feasible learnability, one had to impose locality constraints on grammars that are empirically attested. In this regard it seems my taste runs along the lines of Avery Andrews’ comment regarding the disability placard to be awarded to PCFGs. (Though as I’ll write in a later post, his purchase of Bayes as a “major upgrade over Chomsky” in fact turns out, perhaps surprisingly, that he’s bought the older, original model.)  Show me a better explanation and I’ll follow you anywhere.

As for offering constructive options, well, but of course! The original post mentioned two: the first, Ken Wexler's degree-2 learnability demonstration with an EST-style TG; and the second, Mark Steedman's more recent combinatory categorial grammar approach (which, as I've conjectured just the other day with Mark, probably has a degree-1 learnability proof. (And in fact both of these probably have nice Vapnik-Chervonenkis learnability results hiding in there somewhere – something I’m confident CS could make quick work of, given their obvious talents.)) The EST version badly wants updating, so, there’s plenty of work to do. Time to be fruitful and multiply.


  1. It is entirely reasonable to criticise the positive PCFG learning results on the grounds that they fail to address computational complexity considerations. For me, computation is the central problem -- not the lack of negative evidence issue which is a sideshow in the light of a probabilistic view of learning -- we agree on that.

    But do either the Wexler and Culicover work or the Kwiatkowski et al (inc. Steedman) have a satisfactory answer to this problem?

    Moreover the W & C work is based on very unrealistic inputs -- if I recall correctly, (and I may not) they assume a deep structure tree as input. Is that a reasonable assumption in your view? Is the assumption of learning from sound/meaning pairs (as in the Edinburgh work) reasonable as a model of language acquisition?

  2. RCB replies:

    It’s good to know that for some, the lack of negative evidence “becomes a sideshow” resolved by probabilistic learning. Unhappily, this doesn't help me sleep any better at night, because I don’t subscribe to this view. None of the responses so far has offered any explanation for how it is that children acquire knowledge that’s never attested in adult input, or knowledge that runs counter to adult input – so that probability matching simply fails (including all the much-touted Bayesian models known to me). And by this I mean an explanation of, e.g., how children by age 3 come to know that pronoun anaphora, wh-crossover examples, and quantifier-pronoun anaphora all pattern alike. There are many more examples; Stephen Crain's latest book is chock full of them. This is the bread-and-butter of modern linguistic theory, of course – as Crain shows, linguistic theory does explain all this. I’ll touch on this in a later post.

    To recall a bit more history, in 1970, Ken Wexler and Henry Hamburger first tried for a learnability proof working only from surface strings – and showed this was not possible, for EST style transformational grammars. (It’s in the chapter immediately following Ken’s measure-1 learnability account.) Good scientist that he is, Ken then resorted to (base, surface string) pairs – and wound up writing nearly 100 pages in his 1980 book justifying this strong medicine, which he, along with everyone else, knew was too strong.
    But to stop here misses the point. As Ken stressed, even when given this enriched data, one still requires substantive, empirically-verified constraints on the space of possible grammars to establish learnability – locality constraints on movement, as mentioned in previous posts – and several others besides. That’s what we already knew 40 years ago. Thirty years ago (shameless plug follows), I tried to improve on this by reducing base structure to just any rough thematic argument structure – again, computational feasibility was the goal. Again strong locality constraints were required. Can we now do better? I hope so. But here it seems to me that Ken’s strategy still applies: one starts with a known linguistic theory – which is to say, one that we already know has attained some degree of explanatory adequacy – and builds a learning infrastructure around that.

    1. I think some aspects of children not running counter to their input are explainable in Bayesian terms, such as for example not making basic word-order mistakes. Suppose that the correct theory of surface phrase structure were the ID/LP theory plus Bresnan 2001's general ideas about how X-bar theory works (right branching extended projections with discourse-function linked specifier positions, plus exocentric S that can be free word order or not, depending on the LP rules).

      Some languages such as Japanese or Tagalog will have Linear Precedence restrictions such as YP < X^0, others won't have them at all (Warlpiri, in finite clauses), or relatively weakly (Ancient Greek) (n.b. I think Legate makes an excellent case for discourse projections and some configurationality in Warlpiri, but not configurationality inside what would be S for Bresnan). The LP restrictions are a stipulation, so that grammars containing them will be longer than those that don't, creating the negative evidence problem of how Japanese speakers learn to put the verb at the end, Tagalog at the beginning.

      With a 2 part MDL approach (e.g. Mike Dowman, Anne Hsu and others), the restriction, added to the grammar (increasing its cost), reduces the info needed to specify where the verb goes, reducing the cost to specify the corpus, & it should be clear that since the corpus can grow indefinitely, the savings there will ultimately compensate for the extra cost of the rules.

      Indeed it seems to me that if you measure the cost of the corpus as the amount of info needed, in bits, to specify which possible way to express the meaning is actually used (4 variant word orders = 2 bits, for ex), corpora consisting of around a dozen glossed sentences, such as the Japanese problem in the Akmajian and Heny book, will be enough to select the grammar with the head-final restriction as best. And of course we do seem to expect students to be able to do that too.

      Children have presumably understood and heard far more than a dozen simple sentences before they start producing multi-word utterances that could in fact violate basic word order patterns of their language, so there's no evidence problem in explaining their behavior.

      Finding the correct grammar amongst the possibilities is another matter, but we really don't know what the true computational resources are that the brain can bring to bear on that problem. I think Chomsky's idea of distinguishing the problem of mathematically determining the best grammar vs finding it is still valid, and that the former problem is more suitable for descriptively inclined linguists to worry about.

  3. I was just talking about the lack of negative evidence -- not things like probability matching etc. Why doesn't the switch to probabilistic learning deal with the lack of negative evidence? If it doesn't then how is it different from Gold learning?

    Wexler's strategy (I don't feel I can say Ken's) is problematic -- because the theory may be wrong. In which case one's learning infrastructure is irrelevant. Now EST has been completely abandoned so I don't see why looking at the learning of EST is a good idea.
    I like Stabler's MGs -- which seem not to have the problems of EST. I don't know what 'attained some degree of explanatory adequacy' means -- I don't think MGs have a learning theory yet -- but it seems better to target a live theory rather than resuscitate an inanimate corpse. Which would only give you a grammatical zombie ....

    The natural question is whether, if you are studying learnability of MGs, whether you should consider the inputs to be sound/meaning pairs or derivation trees or just strings -- I'd be interested in your view on that.

    1. "The natural question is whether, if you are studying learnability of MGs, whether you should consider the inputs to be sound/meaning pairs or derivation trees or just strings -- I'd be interested in your view on that."

      I've always been puzzled by people's interest in learning from plain strings. If the goal is to learn about language acquisition (as opposed to ways to train systems for practical applications from easily available data), how is treating the input to the learner as literally meaningless sequences of characters a plausible assumption? It's interesting in its own right to know what kind of results we can get from plain strings, and perhaps that's the best we can currently do as far as syntax is concerned: I don't think we have a good theory of what plausible representations on the meaning side should look like and even if so, we seem to lack sufficient amounts of this kind of data to perform any interesting computational experiments.
      But meaning seems to pop up in the earliest stages of language acquisition --- identifying phones and morphs without access to meaning seems like a pretty odd task to me ---, making me extremely skeptical about what ``some kind of syntax just from strings''-results can be taken to show about actual language acquisition by humans. (if it doesn't work, probably something essential is missing; if it does work, why should we believe that this is even remotely like what humans do?)

    2. I think phones and morphs without meaning are highly plausible - there's a fair amount of work underway about meaning-free morphological analysis (some examples on Mark Johnson's publication page (Macquarie uni), and it's striking to me how much of morphological form is made out of bits and pieces that lack distinct meanings, for example, as far as I can tell, not a single one of the many derivational formations of Bahasa Indonesia is made out of a morph with a single meaning. 'meN-' (transitive verbs in actor focus marker) is about as close as it gets, but occurs with some intransitives as well. The components of the notorious 'circumfixes' for example all have alternate uses as prefixes or suffixes. The thematic vowels of Latin and Greek, still running rampant in the modern descendents, would be another example.

      Syntax I suspect very much is a different story.

    3. And of course John Goldsmith since about 2001 or so.

    4. I share Bebboe's skepticism about learning strictly from strings. When I started out in this game -- a long, long time ago, in a galaxy far, far away -- all the positive results about learnability that I could dig out (literally, since this was pre-internet) shared one thing in common: and that was they all relied on structure in some way or another. And people knew the Hamburger and Wexler result about the unlearnability of transformational grammars from surface strings alone, and the corresponding positive result once one had (b, s) pairs.
      So for example, L. Crespi-Reghizzi (1971, IFIPS conference) developed a series of algorithms that worked for CFGs if one was given partial precedence structures along with strings. (As far as I can make out, this is roughly comparable more recently cited work along these lines for CFGs.) Crespi-Reghizzi also developed algorithms for so-called 'non-counting' CFGs.(JACM, 1978, 25:571-580)/
      That was about it. Even so, all of this seems to have been forgotten in the current era. (Shameless plug again): I tried to incorporate some of this in the computer model that's described in my thesis work (1982/85) but only partly developed some of Crespi-Reghizzi's ideas. For example, there was an attempt
      to model a 'bootstrapping' system, that could exploit prosody to get the main precedence breaks in a language, and once those were in place, could then refine the structural description of a sentence. (There's a method in computer science called 'precedence parsing' that only needs to know whether an element is a 'head' and another element is an 'argument or adjunct' -- when mapped over to more familiar linguistic terminology.) This was worked out for Japanese. The goal of course was to come up with the weakest assumptions that would still get you learnability --- something far weaker than (b, s) pairs. Eventually, the computer model wound up replicating the Wexler Degree-2 theory -- at least in
      terms of its constraints on possible grammar rules. The motivation for all of this was cognitive fidelity - along with Bebboe, it didn't seem plausible to me then (or now) that language learning was simply a matter of surface string induction.
      One more result, also down the Orwellian memory hole because it was never published as far as I know: Michael Arbib (the computer scientist) had shown that 'the base component' was learnable by using some sample of skeleton (ie, unlabeled) tree structures, of a certain depth (1978). This was done by analogy with the finite-state automata case, using the bridge between fsa's and cfg's by way of tree automata. For fsa's with n states, it's easy to show that their behavior can be characterized by all strings of length less than or equal to 2n; for CFGs, this carries over to all _trees_ of _depth_ less than or equal to 2n. So, that's too deep. You need other restrictions.

    5. There is work by Sakakibara that seems to be the same but later as the Arbib result you describe.

      Are you sceptical about whether learning from strings is the right model or whether it is computationally feasible in the absence of some rich prior knowledge?
      The P and P models were I thought all based around learning from strings -- like Charles Yang's/

    6. Yes: I’m quite familiar with the Sakakibara work on learning grammars given structural evidence. That’s precisely why I brought up the Arbib paper. I’ve never commented on this wrt S.’s work, since Arbib’s manuscript was never published. But I’ll post it if there’s interest and I can get Michael to OK this. (The only extant copy is in Ken Wexler’s files – again, it was Ken who was deeply into this starting from about 1970.) Personally, I see acquisition from ‘strings’ taking one of in the direction of learning external patterns, and in that sense going in the wrong direction; the focus for me is what happens inside. As for P&P models and learning – offhand, my understanding is that models like Fodor & Sakas’ are structural, not string-based. And that there’s no reason in principle why Charles Yang couldn’t adopt this as well. At the moment, I don’t have any magic bullet for the challenge regarding how to bootstrap structures from strings, besides my modest contribution years ago – an incremental parser. There, if the learner couldn’t parse a particular sentence, it waited until a simpler one came along. I wasn’t entirely happy with that, but at the time, 35 years ago, it was the best I could muster.

    7. Please do post that stuff if you can find it -- I am reading a lot of early automata theory at the moment. People like Samuel Eilenberg knew a thing or too, and much of this theory is not part of the standard curriculum anymore.

      If you take language acquisition with trees as input, then as you seem to say (I don't want to put words in your mouth again) one still needs a story of how one can bootstrap trees from strings;
      and learning structures from strings is a very hard problem indeed.

      An alternative is to project them from meanings -- if one can 'parse the scene' into hierarchically structured meanings then the learner could in principle project the structure from the meanings ala Steedman et al. (and much other work by people like Isabel Tellier, Annie Foret Tim Oates etc.) That would obviate the need for the tree bootstrapping step, which may be intractable (though I have some work in the pipeline on this).

    8. From RCB:

      Alex: “I was just talking about the lack of negative evidence - not things like probability matching etc. Why doesn't the switch to probabilistic learning deal with the lack of negative evidence? If it doesn’t then how is it different from Gold learning?”

      Perhaps my last reply was unclear, or perhaps this again raises the point that Alex and I may simply be pursuing different explanatory goals. For in truth, I don’t see any concrete proposals at all for explaining what we actually see in language acquisition coming from the ‘switch to probabilistic learning.’ For the sake of argument, let’s consider two polar positions regarding language acquisition: (1) the “prior constraints” (prior constraints on grammars) view (PC); and (2) the “experience-driven” view, using ‘probabilistic learning’ (EDP). (Please, no quibbles, space is short here: clearly, nobody who subscribes to the prior knowledge view believes that experience plays no rule in language acquisition. Similarly, one can maintain position (2), but weaken it.) Now consider just three out of the flood of examples that Stephen Crain presents in his recent book, The Emergence of Meaning (2012). There’s space here only for briefest of sketches, lacking the impressive array of evidence that Crain marshals to bolster his account; perhaps in a future post we can return to the details. (Or better yet, we can have Stephen respond.)
      Fact 1: “children acquiring historically unrelated languages acquire the inclusive-or meaning of disjunction” and “never the exclusive meaning” (52). That is, the or in a sentence like, “John ate sushi or pasta” means that either “John ate pasta” or “John ate sushi”, or both – not just in English, but in Russian, Dutch, Norwegian, Mandarin…and so on. The acquisition puzzle this raises is this: “it is a matter of record that the vast majority of young children’s experience is consistent with the exclusive-or interpretation…(e.g., Is it red or blue? ; Do you want ice cream or cake)” (35). Take a look at the Adam Childes transcript. Or any other. Now, perhaps I’m missing something here, but as far as I can make out, on the experience-driven probabilistic view, there’s no account of how to deal with this.
      Fact 2: children produce sentences that have zero support from the child-directed utterances they receive – rather, they speak some other language instead. Now, at first glance, this all seems bizarre – my daughters speaking Romani or Mandarin? Were they shanghaied at some point or switched at birth? But it’s all true, every last bit, and this seems pretty powerful evidence indeed for the PC view, and completely inexplicable on the EDP account. To take but two examples: (1) English, Basque, French, and Spanish children sometimes produce sentences that are only found in German or Romani, where a single wh word is repeated in a middle position, e.g., What do you think what pigs eat (cf., Wer glaubst du we nach Hause geht, ‘who do you think (who) goes home). (2) As confirmed by many experiments unpacked in Crain’s book, with respect to conjunctions, English-speaking children start off aligned with what Japanese and Mandarin adult speakers do, not English speakers.
      These 3 examples stand for many, many more. (You can also recall as well the anaphora/wh-question/quantifier alignment in a previous post, and Crain’s book has much, much more.) The evidence just keeps piling up, higher and higher.

      So, let’s take stock. On the PC side, we have an elaborated account that explains why all these empirically observed behaviors arise in child language acquisition, and why human languages (both child and adult) look the way they do, rather than, say, like Fortran. And, at least so far, on the EDP side we have – exactly nothing. I don’t know about you, but it would seem to me that the burden of proof now rests with those pursuing EDP models to deliver the goods. But perhaps they’re involved in some other pursuit. If so, it’s not the one that grips my interest.

    9. A good demonstration that some things follow from some kind of universal principle doesn't show that they all do - indeed, some favored examples that the generative community used to make much of seem to have fallen on hard times, such as 'wanna contraction' (Andrea Zukowsky has a under reveiw paper on it). I think each individual case has to be looked at on its merits, rather more closely than we used to consider necessary.

      Stephen is certainly doing everybody a service by lining up a lot of tough new cases for people who disfavor innatism to to deal with.

    10. edit "that some things follow from" -> "that some things are well explained by"

    11. I am sure there is lots of interesting stuff about the acquisition of semantics in that Crain book which I have just ordered. I am more interested in syntax but I don't think that is the interesting difference between us.

      I am certainly not claiming that probabilistic learning is a magic bullet that solves every problem -- but I do think that it solves one problem, namely the no negative evidence problem. We can retreat from an overly general hypothesis without explicit negative evidence -- using Chomsky's term -- indirect negative evidence;
      In particular your Subset Principle, which derives from Angluin's result in a Gold paradigm, seems not justified once you make the change to a probabilistic model.
      I honestly don't know at this point whether this is something you disagree with or whether you consider it so trivially obvious it's not worth confirming...

      And claiming there are no results at all is an exaggeration, the literature is full of debates where the merits of competing models are discussed -- e.g Regier and Gahl and Pearl and Lidz on one-anaphora. I am not claiming that Regier and Gahl are right here -- just that they exist!

  4. Avery, that's a good source. But I also think one might bear in mind the original - and that is
    Carl deMarcken's MIT thesis, which learned morphological forms from phonemic transcriptions. (Carl assumed there was some front-end 'speech recognizer' that could map a raw speech signal stream to phonemes. He had to start somewhere.) This was
    based on minimum description length (deMarcken, 1996). Goldsmith's work flows from this, as John will tell you. In fact, Carl did much, much more than this (but my prior is biased: I supervised his thesis). He even had a component that learned the mappings to both syntax and semantics. Most of that did not surface in his thesis, so I'll post links both to his thesis and to those missing bits later. Carl's work will join the discussion again when there's time to post about Bayesian learning methods and minimum description length - promises, promises, I know. But I'll make good on this promissory note soon.