Some of this post is thinking out loud. I am not as sure as I would like to be about certain things (e.g. how to understand feasibility for example, see below). This said, I thought I’d throw it up and see if comments etc. allow me to clarify my own thoughts.
In Aspects (chapter 1;30ff), Chomsky outlines an abstract version of an “acquisition model.” I want to review some of its features here. I do this for two reasons. First, this model was later replaced with a principles and parameters (P&P) account and in order to see why this happened it’s useful to review the theory that P&P displaced. Second, it seems to me that the Aspects model is making a comeback, often in a Bayesian wrapper, and so reviewing the features of the Aspects model will help clarify what Bayesians are bringing to the acquisition party beyond what we already had in the Aspects model. BTW, in case I haven’t mentioned this before, chapter 1 of Aspects is a masterpiece. Everyone should read it (maybe once a year, around Passover where we commemorate our liberation from the tyranny of Empiricism). If you haven’t done so yet, stop reading this post and do it! It’s much better than anything you are going to read below.
Chomsky presents an idealized acquisition model in §6 (31). The model has 5 parts:
i. an enumeration of the class s1, s2… of possible sentences
ii. an enumeration of the class SD1, SD2… of possible structural descriptions
iii. an enumeration of the class G1, G2,… of possible generative grammars
iv. specifiction of the f such that SDf(i,j) is the structural description to sentence si by grammar Gj for arbitrary i, j
v. specification of a function m such that m(i) is an integer associated with the grammar Gi as its value
(i-v) describe the kinds of capacities a language acquisition device (LAD) must have to use primary linguistic data (PLD) to acquire a G. It must (i) have a way of representing the input signals, (ii) a way of assigning structures to these signals, (iii) a way of restricting the class of possible structures available to languages, (iv) a way of figuring out what each hypothetical G implies for each sentence (i.e. a input/structure pair) and (v) a method for selecting one of the very very many (maybe infinitely many) hypothesis allowed by (iii) that are compatible with the given PLD. So, we need a way of representing the input, matching that input to a Gish representation of that input and a way of choosing the “right” match (the correct G) from the many logically possible G-matches (i.e. a way of “evaluating alternative proposed grammars”).
How would (i-v) account for language acquisition? A LAD with structure (i-v) could use PLD to search the space of Gs to find the one that generates that PLD. The PLD, given (i, ii) is a pairing of inputs with (partial) SDs. (iii, iv) allows these SDs to be related to particular Gs. As Chomsky says it (32):
The device must search through the set of possible hypotheses G1, G2,…, which are available to it by virtue of condition (iii), and must select grammars that are compatible with the primary linguistic data, represented in terms if (i) and (ii). It is possible to test compatibility by virtue of the fact that the device meets condition (iv).
The last step is to select one of these “potential grammars” using the evaluation measure provided by (v). Thus, if a LAD has these five components, the LAD has the capacity to build a “theory of the language of which the primary linguistic data are a sample” (32).
As Chomsky notes, (i-v) packs a lot of innate structure into the LAD. And, interestingly, what he proposes in Aspects matches pretty closely how our thoroughly modern Bayesians would describe the language acquisition problem: A space of possible Gs, a way of matching empirical input to structures that the Gs in the space generate, and a way of choosing the right G among the available Gs given the analyzed input and the structure of the space of Gs. The only thing “missing” from Chomsky’s proposal is Bayes rule, but I have no doubt that were it useful to add, Chomsky would have had no problem adding it. Bayes rule would be part of (v), the rule specifying how to choose among the possible Gs given PLD. It would say: “Choose the G with the highest posterior probability.” The relevant question is how much this adds? I will return to this question anon.
Chomsky describes theories that meet conditions (i-iv) as descriptive and those adding (v) as well, as being explanatory. Chomsky further notes that gaining explanatory power is very hard, the reason being that there are potentially way too many Gs compatible with given PLD. If so, then choosing the right G (the needle) given the PLD (a very large haystack) is not a trivial task. In fact, in Chomsky’s view (35):
… the real problem is almost always to restrict the range of possible hypotheses [i.e. candidate Gs, NH] by adding additional structure to the notion “generative grammar.” For the construction of a reasonable acquisition model, it is necessary to reduce the class of attainable grammars compatible with given primary linguistic data to the point where selection among them can be made by a formal evaluation measure. This requires a precise and narrow delimitation of the notion “generative grammar”- a restrictive and rich hypothesis concerning the universal properties that determine the form of language…[T]he major endeavor of the linguist must be to enrich the theory of linguistic form by formulating more specific constraints and conditions on the notion “generative grammar.”
So, the main explanatory problem, as Chomsky sees it, is to so circumscribe (and articulate) the grammatical hypothesis space such that for any given PLD, only a very few candidate Gs are possible acquisition targets. In other words, the name of the explanatory game is to structure the hypothesis space (either by delimiting the options or biasing the search (e.g. via strong priors)) so that, for any given PLD, very few candidates are being simultaneously evaluated. If this is correct, the focus of theoretical investigation is the structure of this space, which as Chomsky further argues, amounts to finding the universals that suitably structure the space of Gs. Indeed, Chomsky effectively identifies the task of achieving explanatory adequacy with the “attempt to discover linguistic universals” (36), principles of G that will deliver a space of possible Gs that for any given PLD a very small number of candidate Gs need be considered.
I have noted that the Aspects model shares many of the features that a contemporary Bayesian model of acquisition would also assume. Like the Aspects model, a Bayesian one would specify a structured hypothesis space that ordered the available alternatives in some way (e.g. via some kind of simplicity measure?). It would also add a rule (viz. Bayes rule) for navigating this space (i.e. by updating values of Gs) given input data and a decision rule that roughly enjoins that the one choose (at some appropriate time) the highest valued alternative. Here’s my question: what does Bayes add to Aspects?
In one respect, I believe that it reinforces Chomsky’s conclusion: that we really really need a hypothesis space that focuses LAD’s attention on a very small number of candidates. Why?
The answer, in two words, is computational tractability. Doing Bayes proud is computationally expensive. A key feature of Bayesian models is that with each new input of data the whole space of alternatives (i.e. all potential Gs) is updated. Thus, if the there are, say, 100 possible grammars, then for each datum D all 100 are evaluated with respect to D (i.e. Bayes computes a posterior for each G given D). And this is known to be computationally so expensive as to not be feasible if the space of alternatives is moderately large. Here, for example, is what O’Reilly, Jbabdi and Behrens (OJB) say (see note 5):
…it is well known that “adding parameters to a model (more dimensions to the model) increases the size of the state space, and the computing power required to represent and update it, exponentially (1171).
As OJB further notes, the computational problems arise even when there are only “a handful of dimensions of state spaces” (1175, my emphasis, NH).
This would not be particularly problematic if only a small number of relevant alternatives were the focus of Bayesian attention, as would be the case given Chomsky’s conception of the problem, and that’s why I say that the Aspects formulation of what’s needed seems to fit well with Bayesian concerns. Or, to put this another way: if you want to be Bayesian then you’d better hope that something like Chomsky’s position is correct and that we can find a way of using universals to develop an evaluation measure that serves to severely restrict the relevant Gs under consideration for any given PLD.
There is one way, however, in which Chomsky’s guess in the Aspects model and contemporary Bayesians seem to part ways, or at least seem to emphasize different parts of the research problem (I say ‘seem’ because what I note does not follow from Bayes like assumptions. Rather it is characteristic of what I have read (and, recall, I have not read tons in this area, just some of the “hot” papers by Tenenbaum and company). Chomsky says the following (36-7):
It is logically possible that the data might be sufficiently rich and the class of potential grammars sufficiently limited so that no more than a single permitted grammar will be compatible with the available data at the moment of successful language acquisition…In this case, no evaluation procedure will be necessary as part of linguistic theory – that is, as an innate property of an organism or a device capable of language acquisition. It is rather difficult to imagine how in detail this logical possibility might be realized, and all concrete attempts to formulate an empirically adequate linguistic theory certainly leave ample room for mutually inconsistent grammars, all compatible with the primary linguistic data of any conceivable sort. All such theories therefore require supplementation by an evaluation measure if language acquisition is to be accounted for and the selection of specific grammars is to be justified: and I shall continue to assume tentatively…that this is an empirical fact about the innate human faculté de langage and consequently about general linguistic theory as well.
In other words, the HARD acquisition problem in Chomsky’s view resides in figuring out the detailed properties of the evaluation metric. Once we have this, the other details will fall into place. So, the emphasis in Aspects strongly suggests that serious work on the acquisition problem will focus on elaborating the properties of this innate metric. And this means working on developing “a restrictive and rich hypothesis concerning the universal properties that determine the form of language.”
Discussions of this sort are largely missing from Bayesian proposals. It’s not that they are incompatible with these and it’s not even that nods in this direction are not frequently made (see here). Rather most of the effort seems placed on Bayes Rule, which, from the outside (where I sit) looks a lot like bookkeeping. The rule is fine, but its efficacy rests on a presupposed solution to the hard problem. And this looks as if Bayesians worry more about how to navigate the space (on the updating procedure) given its structure rather than on what the space looks like (it’s algebraic structure and the priors on it). So, though Bayes and Chomsky in Aspects look completely compatible, what they see as the central problems to be solved look (or seem to look) entirely different.
What happened to the Aspects model? In the late 70s and early 80s, Chomsky came to replace this “acquisition model” with a P&P model. Why did he do this and how are they different? Let’s consider these questions in turn.
Chomsky came to believe that the Aspects approach was not feasible. In other words, he despaired of finding a formal simplicity metric that would so order the space of grammars as required. Not that he didn’t try. Chomsky discusses various attempts in §7, including ordering grammars in accord with the number of symbols they use to express their rules (42). However, it proved to be very hard (indeed impossible) to come up with a general formal way of so ordering Gs.
So, in place of general formal evaluation metrics, Chomsky proposed P&P systems where the class of available grammars are finitely specified by substantive 2 valued parameters. P&P parameters are not formally interesting. In fact, there have been no general theories (not even failed ones) of what a possible parameter is (a point made by Gert Webelhuth in this thesis, and subsequently). In this sense, P&P approaches to acquisition are less theoretically ambitious than earlier theories based on evaluation measures. In effect, Chomsky gave up on the Aspects model because it proved to be hard to give a general definition of “generative grammar” that served to order the infinite variety of Gs according to some general (symbol counting) metric. So, in place of this, he proposed that all Gs have the same general formal structure and only differ in a finite number of empirically pre-specified ways. On this revised picture, Gs as a whole are no longer formally simpler than one another. They are just parametrically different. Thus, in place of an overall simplicity measure, P&P theories concentrate on the markedness values of the specific parameter values; some values being more highly valued than others.
Let me elaborate a little. The P&P “vision” as illustrated by GB (as one example) is that Gs all come with a pre-specified set of rules (binding, case marking, control, movement etc.). Languages differ, but not in the complexity of these rules. Take movement rules as an example. They are all of the form ‘move alpha’ with the value of alpha varying across languages. This very simple rule has no structural description and no structural change, unlike earlier rules like passive, or raising or relativization. In fact, the GB conceit was that rules like Passive did not really exist! Constructions gave way to congeries of simple rules with no interesting formal structure. As such there was little for an evaluation measure to do. There remains a big role for markedness theory (which parameter values are preferred over others (i.e. priors in Bayes speak), but these do not seem to have much interesting formal structure.
Let me put this one more way: the role of the evaluation metric in Aspects was to formally order the relevant Gs by making some rule formats more unnatural than others. As rules become more and more simple, the utility of counting symbols to differentiate them becomes less and less useful. A P&P theory need not order Gs as it specifies the formal structure of every G: it’s a vector with certain values for the open parameters. The values may be ranked, but formally, the rules in different Gs look pretty much the same. The problem of acquisition moves from ordering the possible Gs by considering the formally distinct rule types, to finding the right values for pre-specified parameters.
As it turns out, even though P&P models are feasible in the required formal sense, they still have problems. In particular, setting parameters incrementally has proven to be a non-trivial task (as people like Dresher and Fodor & Sakas have shown) largely because the parameters proposed are not independent of one another. However, this is not the place to rehearse this point. What is of interest here is why evaluation metrics gave way to P&P models, namely that it proved to be impossible to find general evaluation measures to order the set of possible Gs and hence impossible to specify (v) above and thus attain explanatory adequacy.
Let me end here for now (I really want to return to these issues later on). The Aspects model outlines a theory of acquisition in which a formal ordering of Gs is a central feature. With such a theory, the space of possible Gs can be infinite, acquisition amounting to going up the simplicity ladder looking for the simplest G compatible with the PLD. The P&P model largely abandoned this vision, construing acquisition instead as filling a finite number of fixed parameters (some values being “better” than others (i.e. unmarked)). The ordering of all possible Gs gave way to a pre-specification of the formal structure of all Gs. Both stories are compatible with Bayesian approaches. The problem is not their compatibility, but what going Bayes adds. It’s my impression that Bayesians as a practical matter slight the concerns that both Aspects style models and P&P models concentrate on. This is not a matter of principle, for any Bayesian story needs what Chomsky has emphasized is both central and required. What is less clear, at least to me, is what we really learn from models that concentrate more on Bayes rule than the structures that the rule is updating. Enlighten me.
 It’s wroth emphasizing that what is offered here is not an actual learning theory, but an idealized one. See his note 19 and 22 for further discussion.
 Chomsky suggests the convention that lower valued Gs are associated with higher numbers.
 One thing I’ve noticed though is that many Bayesians seem reluctant to conclude that this information about the hypothesis space and the decision rule are innately specified. I have never understood this (so maybe if someone out there thinks this they might drop a comment explaining it). It always seemed to me that were they not part of the LAD then we had not acquisition explanation. At any rate, Chomsky did take (i-v) to specify innate features of the LAD that were necessary for acquisition.
 With the choice being final after some period of time t.
 See especially note 22 where Chomsky says:
What is required of a significant linguistic theory…is that given primary linguistic data D, the class of grammars compatible with D be sufficiently scattered, in terms of value, so that the intersection of the class of grammars compatible with D and the class of grammars which are highly valued be reasonably small. Only then can language learning actually take place.
 See OJB discussed here. It is worth noting that many Bayesians take Bayesian updating over the full parameter space to be a central characteristic of the Bayes perspective. Here again is OJB:
It is a central characteristic of fully Bayesian models that they represent the full state space (i.e. the full joint probability distribution across all parameters) (1171).
It is worth noting, perhaps, that a good part of what makes Bayesian modeling “rational” (aka: “optimal,” and it’s key purported virtue) is that it considers all the consequences of all the evidence. One can truncate this so that only some of the consequences of only some of the evidence is relevant, but then it is less clear what makes the evaluations “rational/optimal.” Not that there aren’t attempts to truncate the computation citing resource constraints and evaluating optimality wrt these constraints. However, this has the tendency of being a mug’s game as it is always possible to add just enough to get the result that you want, whatever these happen to be. See Glymour here. However, this is not the place to go into these concerns. Hoepfully, I can return to them sometime later.
 Indeed, many of the papers I’ve seen try to abstract from the contributions of the priors. Why? Because sufficient data washes out any priors (so long as they are not set to 1, a standard assumption in the modeling literature precisely to allow the contribution of priors to be effectively ignored). So, the papers I’ve seen say little of a general nature about the hypothesis space and little about the possible priors (viz. what I have been calling the hard problem).
 See for example the Perfors et. al. paper (here). The space of options is pretty trivial (5 possible grammars (three regular and two PCFGs) and it is hand coded in. It is not hard to imagine a more realistic problem: say including all possible PCFGs. Then the choice of the right one becomes a lot more challenging. In other words, seen as an acquisition model, this one is very much a toy system.
 Chomsky emphasizes that the notion of simplicity is proprietary to Gs, it is not some general notion of parsimony (see p. 38). It would be interesting to consider how this fits with current Minimalist invocations of simplicity, but I won’t do so, or at least not now, not here.
 This model was more fully developed in Sound Patterns and earlier in The morphophonemics of modern Hebrew. It also plays a role in Syntactic Structures and the arguments for a transformational approach to the auxiliary system. Lasnik (here) has some discussion of this. I hope to write something up on this in the near future (I hope!).
 Which precisely what makes parameters internal to FL minimalistically challenging.
 This is not quite right: a G that has alpha = any category is more highly valued than one that limits alpha’s reach to, e.g. just DPs. Similary for things like the head parameter. Simplicity is then a matter of specifying more or less generally the domain of the rule. Context specifications, which played a large part in the earlier theory, however, are no longer relevant given such a slimed down rule structure. So the move to simple rules does not entirely eliminate considerations of formal simplicity, but it truncates it quite a bit.
 The Dresher-Fodor/Sakas problem is a problem that arises on the (very reasonable and almost certainly correct) assumption that Gs are acquired incrementally. The problem is that unless the parameter values are independent, no parameter setting is fixed unless all the data is in. P&P models abstract away from real time learning. So too with Aspects style models. They were not intended as models of real time learning. Halle and Chomsky note this p. 331 of Sound Patterns where they describe acquisition as an “instantaneous process.” When Chomsky concludes that evaluation measures are not feasible, he abstracts away from the incrementality issues that Dresher-Fodor/Sakas zero in on.