There are two kinds of questions linguists would like to
address: (1) Why do we see some kinds of Gs and never see others and (2) Why do
kids acquire the particular Gs that they do. GG takes it that the answer to (2)
is usefully informed by an answer to (1). One reason for thinking this is that
both questions have a similar structure. Kids are exposed to products of a G
and on the basis of these products they must infer the structure of the G that
produces it. In other words, from a finite set of examples, a Language
Acquisition Device (LAD) must infer the

*correct*underlying function, G, that generates these examples. What does ‘correct’ mean? That G is correct which not only covers the finite set of given examples, but also correctly predicts the properties of the unbounded number of linguistic objects that might be encountered. In other words, the “right” G is one that correctly projects all possible unseen data from exposure to the limited input data.[1] GG calls the input examples the ‘primary linguistic data’ (PLD), and contrasts this with ‘linguistic data’ (LD), which comprises the full range of possible linguistic expressions of a given language L (e.g. ‘Who did John see’ is an example of PLD, ‘*Who did John see a man who likes’ is an example of LD). The correct G is that G which covers the PLD*and*also covers all the non-observed LD. As LD is in effect infinite, and PLD is necessarily finite, there’s a lot of unseen stuff that G needs to cover.[2]
The very general characterization, let’s call it the
Projection Problem (PrP), can cover both (1) and (2) above. Indeed, the
standard PoS argument is based on a specific characterization of PrP. How so?

First, a standard PoS argument gives the following characterization
of the PLD. It consists of well-formed, “simple,” sound/meaning (SM) pairs
generated from a single G. In other words, the data used to infer the right G is
“perfect” (i.e. no noise to speak of) but circumscribed (i.e. only “simple”
data (see here
for some discussion)).[3]
Second, it assumes that the data is abundant. Indeed, it is counterfactually
presumed that the PLD is presented “all at once,” rather than in smaller incremental
chunks.[4]
In short, the PoS makes two important assumptions about the PLD: (i) it is
restricted to “simple” data, (ii) it is noiseless, homogeneous, and abundant
(i.e. there is no room for variance as there would be were the data presented
incrementally in smaller bits). Last, the LAD is also assumed to be “perfect”
in having no problem in accurately coding the information the PLD contains and
no problems computing its structure and relating it to the G that generated it.
This idealization eliminates another source of potential noise. Thus, the quality
of the data wrt input and intake, is assumed to be flawless.

Given these (clearly idealized) assumptions the PoS question
is how does the LAD go from PLD/LAD so described to a G able to generate the
full range of data (i.e. both simple

*and complex*)? The idealization isolates the core of the PoS argument: getting from PLD to the “correct” G is massively underdetermined by the PLD even if we assume that the PLD is of immaculate quality. The standard PoS conclusion is that the only way to explain why some kinds of Gs are unattested is to assume that some (logically possible) inductions from PLD to G are formally illict. That’s the Projection Problem as it relates to (1). UG (i.e. formal restrictions on the set of admissible Gs) is the proposed answer.
Next step: assume now that we have a fully developed theory
of UG. In other words, let’s assume that we have completely limned the borders
of

*possible*Gs. We are still left with question (2). How does the LAD acquire the specific G that it does? How does the LAD use the PLD to select one among the many possible Gs? Note that it appears (at least at first blush) that restricting our attention to selecting the specific G compatible with the given PLD from among the*grammatically*possible*Gs (rather than from all the**logically*possible Gs) simplifies the problem. There are a whole lot of Gs that LAD need never consider precisely because they are grammatically impossible. And it is conceivable that finding the right G among the grammatically admissible ones requires little more than matching PLD to Gs. So, one possible interpretation of the original Chomsky program is that once UG is fixed, acquisition reduces to simple learning (e.g. once the UG principles are specified, acquisition is little more than standard matching of data to Gs). On this view, UG so restricts the class of accessible Gs that using PLD to search for the right G is relatively trivial.
There is another possibility, however. Even with the

*invariant*principles fixed (i.e. even once we specified the*impossible*(kinds of) Gs), the PLD is still too insubstantial to select the right G given PLD (i.e. the PLD still underdetermines choice of the right G). On this second scenario, additional machinery (perhaps some of it domain specific) is required to navigate the remaining space of possible grammatical options. Or another way of putting this: fixing the invariant principles of UG does not suffice to uniquely select a G given PLD?
There is reason to think Chomsky, at least in

*Aspects*took door number 2 above.[5] In other words, “since the earliest days of generative grammar” (as Chomsky likes to say), it has been assumed that a usable acquisition model will likely need*both*a way of eliminating the impossible Gs and another (perhaps related, perhaps not) set of principles to guide the LAD to its actual G.[6] So, in addition to invariant principles of UG, GG also deployed markedness principles (i.e. “priors”) to play a hefty explanatory role. So, for example, say the principles of UG delimit the borders of the hypothesis space, Gs within the borders being possible. Acquisition theory (most likely) still requires that the Gs within the borders have some kind of preferential ordering, with some Gs better than others.
To repeat, this is roughly the

*Aspects*view of the world and it is one that fits well with the Bayes conception where in addition to a specification of the hypotheses entertained, some are endowed with higher priors than others. P&P models endorse a similar conception as some parameters, the unmarked ones, are treated as more equal than others. Thus, while the invariant principles and open parameters delimit the space of G options, markedness theory (or the evaluation metric) is responsible for getting an LAD to specific parameter values on the basis of the available PLD.
This division of labor seems reasonable, but is not
apodictic. There is a trading relation
between specifying high priors and delimiting the hypothesis space. Indeed,
saying that some option is

*impossible*amounts to setting the prior for this option to 0 and saying that it is necessary amounts to setting the prior to 1. Moreover, given our current state of knowledge, it is unclear what the difference is between assuming that something is*impossible*given PLD versus saying that it is very*improbable*. However, it is not unreasonable, IMO, to divide the problem up as above as several kinds of things really do seem unattested while other things though possible are not required.
With this as background, I want to now turn to a kind of PoS
argument that builds on (steals from?) a terrific paper that I’ve recently read
by Gigerenzer and Brighton (G&B) (here)
and that I have been recommending to all and any in my general vicinity in the
last week.

G&B discuss the role of biases in inductive learning. The
discussion is under the rubric of heuristics. They note that biases/heuristics
have commonly been motivated on grounds of reducing computational complexity. As
noted several times before in other posts (e.g. here),
many inductive theories are computationally intensive if implemented directly.
In fact, so intensive as to be intractable.
I’ve mentioned this wrt Bayesian models and several commentators noted (here)
that there are reasons to hope that these problems can be finessed using
various well-known (in the sense of well-known to those in the know, i.e. not
to me) statistical sampling methods/algorithms. These methods can be used to
approximate the kinds of solutions the computationally intractable direct
Bayesian methods would produce were they tractable. Let’s call these methods
“heuristics.” If correct, this constitutes one good cognitive argument for
heuristics; they reduce the computational complexity of a problem making its
solution tractable. As G&B note, on
this conception, heuristics (and the biases they incorporate) are the price one
has to pay for tractability. Or; though it would be best to do the obvious
calculation, such calculations are sadly intractable and so we use heuristics
to get the calculations done even though this sacrifices (or might sacrifice)
some accuracy for tractability. They call this the accuracy-effort tradeoff (AET).
As G&B put it:

If you invest less effort the cost
is lower accuracy. Effort refers to searching for more information, performing
more computation, or taking more time; in fact these typically go together.
Heuristics allow for fast and frugal decisions; thus, it is commonly assumed
that they are second best approximations of more complex “optimal” computations
and serve the purpose of trading off accuracy for effort. If information were
free and humans had eternal time, so the argument goes, more information and
computation would always be better (109).

G&B note that this is the common attitude towards
heuristics/biases.[7] They exist to make the job doable. And though
G&B agree that this

*might*be one reason for them, they think that it is not the most important helpful feature that heuristics/biases have. So what is? G&B highlight a second feature of biases/heuristics; what they call the “bias-variance dilemma” (BVD). They describe it as follows:[8]
… achieving a good fit to
observations does not necessarily mean we have found a good model, and choosing
a model with the best fit is likely to result in poor predictions…(118).

Why? Because

…bias is only one source of error
impacting on the accuracy of model predictions. The second source is variance,
which occurs when making inferences from finite samples of noisy data. (119).

In other words, a potentially very serious problem is
“overfitting,” a problem that flexible models standardly enjoy. In G&B’s
words:

The more flexible the model, the
more likely it is to capture not only the underlying pattern but unsystematic
patterns such as noise…[V]ariance reflects the sensitivity of the induction
algorithm to the specific contents of samples, which means that for different
samples of the environment, potentially very different models are being
induced. [In such circumstances NH] a
biased model can lead to more accurate predictions than an unbiased model.
(119)

Hence the dilemma: To best cover the

*input*data set, “model must accommodate a rich class of patterns in order to insure low bias.” But “[t]he price is an increase in variance, as the model will have greater flexibility, this will enable it to accommodate not only systematic patterns but also accidental patterns such as noise” (119-120). Thus a btter fit to the input may have deleterious effects on predicting future data. Hence the BVD:
Combating high bias requires using
a rich class of models, while combating high variance requires placing
restrictions on this class of models. We cannot remain agnostic and do both
unless we are willing to make a bet on what patterns will occur. This is why
“general purpose” models tend to be poor predictors of the future when data are
sparse (120).

And the moral G&B draw?

The bias-variance dilemma shows
formally why a mind can be better off with an adaptive toolbox of biased
specialized heuristics. A single, general-purpose tool with many adjustable
parameters is likely to be unstable and incur greater prediction error as a
result of high variance. (120)

What consequences might the BVD have for work on language?
Well, note first of all that it provides the template for an additional kind of
PoS argument. In contrast to the standard one reviewed above, this one holds
when we relax the standard idealizations reviewed above; in particular, the
assumption that the PLD is noise free and that it is provided all-at-once. We know that these assumptions are false,
what the BVD suggests is that when these are relaxed we potentially encounter
another kind of inductive problem in which biases can be empirically very
useful. I say “suggests” rather than “shows” because as G&B demonstrate
quite nicely, whether the problem is a real one, depends on how sparse and
noisy the relevant PLD is.

The severity of the BVD problem in linguistics will likely depend
on the particular linguistic case being studied. So for example, work by
Gleitman, Trueswell and friends (discussed here,
here,
here)
suggests that at least early word learning occurs in very noisy data sparse
environments. This is just the kind that G&B point to as favor shallow
non-intensive data analysis. The procedure that Gleitman, Trueswell and friends
argue for seems to fit well into this picture.

I’m no expert in the language acquisition literature, but
from what I’ve seen, the scenarios that G&B argue promote BVDs are rife in
the wild. I sure looks like many people converge to (very close to) the same G
despite plausibly having very different individual inputs (isn’t this the basis
for the overwhelming temptation to reify languages? Believe me my Polish parent English PLD was
quite a bit different from that of my Montreal peers and we ended up sounding
and speaking very much the same). If so, the kinds of biased systems that GG is
very comfortable with will be just what G&B ordered. However, whether this always holds or even whether it ever
holds is really an empirical question.[9]

G&B contrasts heuristic systems with more standard
models, including Bayesian models, exemplar models, multiple regression models
etc. that embody Carnap’s “principle of total evidence” (110). From what
G&B say (and I have sort of confirmed by doing econometrician on the campus
interviews), it appears that most of the current favored approaches to rational
decision making embody this principle, at least as an ideal. As a favorite
conceit is to assume that cognitively speaking, humans are very rational,
indeed optimal decision makers, Carnap’s principle is embodied in most of the
common approaches (indeed Bayesians love to highlight this). Theories that
embody Carnap’s principle understand “rational decision making as the process
of weighing and adding all information” up to computational tractability. The phenomena that G&B isolates (what the
paper dubs “less is more” effects) challenge this vision. These effects,
G&B argues, illustrate that it’s just false that more is always better

*even in the absence of computational constraints*. Rather, in some circumstances, the ones that G&B identifies, shallow and blinkered is the way to go. And if this is correct, then the empirical questions will have to be settled on a case by case basis, sometimes favoring total evidence based models and sometimes not. Further, if this is correct (and if Bayesian models are species of total evidence models) then whether a Bayesian approach is apposite in a given cognitive context becomes an*empirical*question, the answer depending on how well behaved the data samples are.
Third, it would not be surprising (at least to me) were
there two (or at least two) kinds of native FL biases, corresponding to the two
kinds of PoS arguments discussed above.
It is possible that the biases motivated via the classical PoS argument
(the invariances that circumscribe the class of possible Gs) alone suffice to lead
the LAD to its specific G. However, this clearly need not be so. Nor is it obvious (again at least to me) that
the principles that operate within the circumscribed class of grammatically
possible grammars would operate as well within the wider class of logically
possible ones. Indeed, when specific
examples are considered (e.g. ECP effects, island effects, binding effects) the
case for the two-prong attack on the PoS problem seems reasonable. In short,
there are two different kinds of PoS problems invoking different kinds of
mechanisms.

G&B ends with a description of two epistemological scenarios
and the worlds where they make sense.[10]
Let me recap them, comment very briefly and end.

The first option has a mind with no biases “with an
infinitely flexible system of abstract representations.” This massive
malleability allows the mind to “reproduce perfectly” “whatever structure the
world has.” This mind works best with “large samples of observations” drawn
from world that is “relatively stable.” Because such a mind “must choose from
an infinite space of representations, it is likely to require resource
intensive cognitive processing.” G&B believes that exemplar models and
neural networks are excellent models for this sort of mind. (136)

The second mind makes inferences “quickly from a few
observations.” The world it lives in changes in unforeseen ways and the data it
has access to is sparse and noisy. To overcome this it uses different specialized
biases that can “help to reduce the estimation error.” This mind need not have
“knowledge of all relevant options, consequences and probabilities both now and
in the future” and it “relies on several inference tools rather than a single
universal tool.” Last, in this second scenario intensive processing is not
required nor favored. Rather minds come
packed with specialized heuristics able to offset the problems that small noisy
data brings with it.

You probably know where I am about to go. The first kind of
mind seems more than just a tad familiar from the Empiricist literature.
“Infinitely flexible” minds that “reproduce perfectly” “whatever structure the
world has” sound like the perfect wax tablets waiting to faithfully receive the
contours that the world via “large sample of observations” is ready to
structure it with. The second with its biases and specialized heuristics has a
definite Rationalist flavor. Such minds contain domain specific operations.
Sound familiar?

What G&B adds to the standard Empiricism-Rationalism
discussion is not these two conceptions of different minds, but the kinds of
advantages we can expect from each given the nature of the input and the
“worlds’ that produce it. When a world is well behaved, G&B observes, minds
can be lightly structured and wait for the environment to do its work. When it
is a blooming buzzing confusion bias really helps.

There is a lot more in the G&B paper. I found it one of
the more stimulating and thought provoking things I’ve read in the last several
years. If G&B is correct, the BVD is rich in consequences for language
acquisition models that begin to loosen the idealizations characteristic of
Plato’s Problem ruminations. Most interestingly, at least to me, coarsening the
idealization adds new reasons for assuming that biological systems come packed
with rich innately structured minds. In the right circumstances, they don’t
only relieve computational burdens, they allow for good inference, indeed
better inference than a mind that more carefully tracks the world and
intensively computes the consequences of this careful tracking. Interesting,
very interesting. Take a look.

[1]
This problem goes back to the very beginning GG, see Stanley Peters’ paper “The
Projection Problem: How is a grammar to be selected” in

*Goals of Linguistic Theory*. As he noted in his paper, this problem is closely tied to the question of Explanatory Adequacy. The logic outlined above is very clearly articulated in Peters’ paper. He describes the projection problem as the “problem of providing a general scheme which specifies the grammar (or grammars) tht can be provided by a human upon exposure to a possible set of basic data” (172).
[2]
Note that the projection problem can hold for finite sets as well. The issue is
how to select the function that covers the unobserved on the basis of the
observed (i.e. how to generalize from a small sample to a larger one). How does
a system “project” to the unobserved data based on the observed sample. The
infinity assumption allows for a clear example of the logic of projection. It
is not a necessary feature.

[3]
Peters also zeros in on the idea that PLD is “simple.” As he puts it: “as has
often been remarked, one rarely hears a fully grammatical sentence of any
complexity…One strategy open to him [the LAD, NH] is to put the greatest
confidence in short utterances, which are likely to be less complex than longer
ones and thus more likely to be grammatical” (175).

[4]
As noted here this assumption quite explicit in

*Aspects*is known to be a radical idealization.*However*, this does not indicate that it has baleful consequences. It does seem that kids in the same linguistic environment come to acquire very similar competences (no doubt the source of our view that languages exist). This despite the reasonable conjecture that they are not exposed (or intake) exactly the same (kinds of) sentences in the same order. This suggests that order of presentation is not that critical and this is follows from the all-at-once idealization. That said, I return to this assumption below. For some useful discussion see Peters where the idealization is defended (p.175).
[5]
Again, see Peters for illuminating discussion.

[6]
This overstates the case. The evaluation measure did no tell the LAD how to

*construct*a G given PLD. Rather it specified how to*order*Gs as better or worse given PLD. In other words, it specifies how to rank two*given*Gs. Specifying how to actually build these was considered (and probably still is) too ambitious a goal.
[7]
I suspect that the general disdain for priors in Bayesian accounts is the
belief that they do not fundamentally alter the acquisition scenario. What I
mean by this is that though they may accelerate or impede the rate at which one
gets to the best result, over enough time the data will overwhelm the priors so
that even if one starts, as it were, in the wrong place in the hypothesis
space, the optimal solution will be attained. So priors may affect computation
and the rate of convergence to the optimum but it cannot fundamentally alter
the destination.

[8]
By “fit” here G&B mean fit with the input data sets.

[9]
So Jeff Lidz noted that perhaps all LADs enjoy a good number of rich learning
encounters where sufficient amounts of the same good data is used. In other words, though the data overall might
stink, there are reliable instances where the data is robust and there are
where the acquisition action takes place.
This is indeed possible, it seems to me, and this is what makes the BVD
problem an empirical, rather than a conceptual, one.

[10]
There are actually three, but I ignore the first as it has little real
interest.

I'll have to read the G&B paper, but I think the description of the bias-variance dilemma sketched here is not the way it's usually thought of. The bias-variance dilemma or trade-off is a computational level property (in Marr terms) about sets of models, not about heuristics, approximations or other properties of algorithms. (There are lots of good explanations of the bias-variance dilemma, such as in Wikipedia or the (free) book Elements of Statistical Learning).

ReplyDeleteThe bias-variance dilemma, as usually formulated, assumes that the "true model" we're trying to learn lies outside the set of models that that the learner can formulate, so all the learner can hope to do is approximate the true model to a greater or lesser degree. The distance between the "true model" and the best model in the set of models available to the learner is called the "bias" of the learner.

In addition to bias, there's another kind of error that learners can suffer from. The "variance" in a learner arises from the fact that the learner only sees a finite amount of noisy data, and this "observation noise" may cause the learner to make incorrect generalisations, i.e., there's high variance in its model estimates.

In the common situation in statistics and machine learning there's a trade-off between bias and variance. If we try very hard to reduce the bias by making the set of possible models very large, e.g., with a huge number of parameters, then we increase the likelihood that one or more of those parameters will be incorrectly set, i.e., increase the bias.

But I'm not sure that the bias-variance trade-off is applicable in human language acquisition. I'm pretty sure a human child doesn't select the class of possible grammars to trade-off bias and variance the way a statistician might in a generic machine-learning problem. I think the human child has a set of possible grammars available to it and the true grammar is in that set, i.e., the bias is zero.

So the real question is: how large is the class of possible grammars available to the child? There are several ways to approach this. The conventional approach from Aspects on (which I think is quite reasonable) is to study cross-linguistic variation -- the grammars of actual human languages provide a lower bound on the set of possible grammars.

But I think we can also try to identify the class of possible grammars by identifying sets of grammars from which learning algorithms succeed in learning when given data that is plausibly available to a child. I admit we're still a long way from having learning algorithms that can learn anything remotely as complex as a human language, of course, but we do have models of the acquisition of things like the lexicon and rudiments of syntax.

I'd love to show that a GB linguistic universal is crucial for making a learning algorithm work. Such a result might take the following form: a learning procedure P fails to learn when given data D, but procedure P+U succeeds when given the same data D. ("+U" might mean that universal U is incorporated into P's prior, or that the set of models P considers is restricted to those compatible with U).

I am pretty sure that I reported the the G&B claims more or less correctly and I am pretty sure that they call this the Bias-Variance Dilemma. They discuss this in the context of heuristics as I described. I would love it if you looked at the paper. It might be they used a common term unconventionally. There is also a technical paper they base their work on. The paper is by Geman, Bienenstock and Doursat (1992) and it's in the bibliography. I've peaked at it and it is something I would need some hand-holding to fully (partially!) understand. But plea take a look and let us know.

DeleteThe point that G&B makes seems plausible to me and they discuss many cases where they think it applies. They do mention one language case. So take a look. I'd love to know what you think.

The Geman et al paper popularised the bias-variance trade-off in machine learning.

DeleteThe Bias-variance tradeoff is a pretty standard part of statistics, but they are using it uncoventionally. Basically there are two different fields: bounded rationality and statistical learning theory, (or how to avoid overfitting) and they are making some links -- more precisely they are psychologists interested in bounded rationality that are making claims that using heuristics prevents overfitting.

ReplyDeleteLet me translate some of this into terms that are might be more familiar.

In NLP and machine learning we sometimes talk about model errors and search errors.

So you have some model M, and you look for some solution that e.g. maximizes some objective function, and you have

some search function that looks for it. E.g. M is a bayesian model, and you want to find the MAP estimate E and because the state space is very large or infinite

and non convex

you can't exhaustively search the state space, so you have some approximate algorithm S that searches and returns a guess at E which we can call G.

So there are two ways that G can be bad/wrong -- it might be that the search algorithm performs badly and that G is far from E, and that E would work really well but you didn't find it

and G doesn't. So this is a search error.

Or it could be that G=E (or is very close) but it might be bad because the model M is wrong. That is a model error: you found the best solution, but it wasn't good enough.

So the argument in the paper seems to be that search errors can compensate for model errors. I.e. you have a model where E is in fact worse than G.

i.e. that humans have really bad models, and really bad search algorithms but they interact to work quite well.

Why? Because E overfits. e.g. if the model isn't regularised.

So if the model is say the set of all PCFGs, then E will be the grammar H which maximizes P(D|H) which will just memorize the training data and not generalise.

But if we have a heuristic that only searches for small models, then that may compensate for this.

One way of formalizing this heuristic might be to define a function P on the space of hypotheses, and we could only search through the

hypotheses where P is high, and then find some way of trading off the goodness of fit of a hypothesis -- which we could measure using, say, P(D|H),

Indeed we could maybe just multiply P(H) and P(D|H) to get some objective function. Maybe we could even call the function P a

prior.But then we are just back at Bayes, and a good heuristic would just be exactly the sort of stochastic approximation algorithms (MCMC) that Bayesian uses.

The difference is I suppose that the prior would be defined algorithmically.

I am just back from a learning theory conference (ALT) where the problem of overfitting (in the form of empirical process theory nowadays -- empirical Rademacher complexity etc.) is a core concern. Bias-variance is a bit

vieux jeubut the more general point is still a live issue.This is not to criticize bounded rationality theory in decision making, which is very important; and though I haven't seen it applied much in linguistics it surely has a role to play in production and utterance planning if not in language acquisition.

A follow up comment -- so on my interpretation (I didn't read the paper very carefully) a successful example from current ML practice would be what is called "early stopping". So rather than training an algorithm to convergence you stop after a few iterations which sometimes improves the generalisation. Rather than adding a regularisation term and training to convergence which would be the non heuristic approach.

ReplyDeleteCertainly heuristics like early stopping can be viewed as a kind of regularisation; I think there are "deep learning" papers that make that connection explicitly. But the bias-variance trade-off really isn't an algorithmic issue at all; it has to do with the information present in the data (i.e., even with unbounded computation you still face a bias-variance trade-off). Regularisation and early stopping may help you make a reasonably good bias-variance trade-off, but you're still making one.

ReplyDeleteBut as I tried to explain in my comment, I suspect the bias-variance trade-off is not relevant for language acquisition by humans.

DeleteBut the bias-variance trade-off really isn't an algorithmic issue at all; it has to do with the information present in the dataIn statistics, bias and variance are properties of estimated quantities (e.g., regression parameters, expected values, etc). So the bias-variance tradeoff has to do with the information present in the data

andwhatever model you're dealing with (and, at least for bias, whatever we assume the true data-generating model is, since bias tells us about the difference between an estimated quantity and the corresponding quantity in the true model). If we have data and no model, it's not clear what bias or variance would even mean.Yes, you're right of course -- the bias of an estimator depends on the prior distribution over models assumed by the learner.

DeleteMy point is that it's not an algorithmic issue -- the bias-variance dilemma can't be "solved" with more computational power or even new algorithms. Machine learning (in the case where the true model is not in the set considered by the learner) will always face a bias-variance dilemma simply because the information present in the data is limited.

I think Mark is certainly right that the BVD is not quite the right tool to think about language acquisition, since we assume that the correct model is in the class of hypotheses (would creolization be a case where this is not true?)

Deleteand for various other technical reasons (it's not a regression problem -- though BVD has been extended), but I think one can use, as Partha Niyogi argued, the tools of machine learning to get some insight into language acquisition. Though quite which tool is right is a knotty problem.

@mark: I am not sure that we always believe that the right G is in the class of hypotheses if Creolization is a possible problem. Note that every case of G acquisition is actually a case of Creolization, the difference being the divergence go the relevant Gs. No two Gs are identical and so the PLD is never homogeneous. In fact, there may not be a G that covers the PLD as the PLD is the products of different Gs and so there may not be a single G that produces the PLD. The idealization to the case where this is true is fine for classical PoS arguments, but not for the "realistic" case. The goal of the child then is not to match the G products of which it is exposed to, but to find a G that does well enough; covering some of the PLD and leaving out some. Even this may not be quite right, as what is acquired may not be one G but a family of Gs. Now, if I understand the things Mark and Alex pointed me to (again highly doubtful) this might make BVD issues relevant. The G attained will be a possible Gs but it may not be any of the Gs that are generating the PLD. What you are looking for then may well be described as good enough G from the class of possible Gs that approximates the given Gs. So a question: would this be relevant to how to model the problem?

DeleteI haven't thought much about Creolisation yet, so I'll hold off on expressing an opinion yet. My point is that the BVD is something that a statistian faces when trying to decide what class of models to use to analyse some data in the kind of "theory-free" approach Norbert was complaining about a few weeks ago. E.g., do I fit 1st order or 2nd order polynomials to my earthquake data? I don't know how earthquakes really are generated, but It's unlikely they are either 1st or 2nd order polynomials! I don't see why a child would ever face a similar problem. (I have a similar comment about Amy Perfors work: yes, Bayesian methods can be used -- in principle at least -- to decide if a sample comes from a finite state or a context free language, but I don't see this as a question that a child learner ever has to answer).

ReplyDeleteHere's a reason to think a child faces a similar problem. Suppose that language is about efficiently communicating meaning representations. Suppose moreover that children are restricted to considering compositional representations; i.e. a meaning representation is decomposed into component parts, and the length of the signal for the whole meaning is the sum of the lengths of the signals for each part. Then, for reasons I've detailed on my blog, natural language assumes that each component occurs independently at random. Of course, meaning components do not occur independently at random: if a meaning representation contains a transitive verb, it is very likely to also have two noun phrases. So, if children restrict themselves to compositional grammars, and their goal is communicating meanings efficiently, then the true distribution over meanings is outside the family of distributions they can characterize.

Delete