Faculty of Language: Guest Post: Tim Hunter on Minimalist Grammars and Stats

Friday, July 26, 2013

Guest Post: Tim Hunter on Minimalist Grammars and Stats

I have more than once gotten the impression that some think that generative grammarians (minimalists in particular) have a hostility to combing grammars and stats because of some (misguided, yet principled) belief that grammars and probabilities don't mix. Given the wide role that probability estimates play in processing theories, learnability models, language evolution proposals, etc. the question is not whether grammars and stats ought to be combined (yes they should be) but how they should be combined. Grammarians should not fear stats and the probabilistically inclined should welcome grammars. As Tim notes below there are two closely related issues: what to count and how to count it. Grammars specify the whats, stats the hows. The work Tim discusses was done jointly with Chris Dyer (both, I am proud to say, UMD products) and I hope that it encourages some useful discussion on how to marry work on grammars with stats to produce useful and enlightening combinations.

Tim Hunter Post:

Norbert came across this paper, which defines a kind of probabilistic minimalist grammar based on Ed Stabler's formalisation of (non-probabilistic) minimalist grammars, and asked how one might try to sum up "what it all means". I'll mention two basic upshots of what we propose: the first is a simple point about the compatibility of minimalist syntax with probabilistic techniques, and the second is a more subtle point about the significance of the particular nuts and bolts (e.g. merge and move operations) that are hypothesised by minimalist syntacticians. Most or all of this is agnostic about whether minimalist syntax is being considered as a scientific hypothesis about the human language faculty, or as a model that concisely captures useful generalisations about patterns of language use for NLP/engineering purposes.

Norbert noted that it is relatively rare to see minimalist syntax combined explicitly with probabilities and statistics, and that this might give the impression that minimalist syntax is somehow "incompatible" with probabilistic techniques. The straightforward first take-home message is simply that we provide an illustration that there is no deep in-principle incompatibility there.

This, however, is not a novel contribution. John Hale (2006) combined probabilities with minimalist grammars, but this detail was not particularly prominent in that paper because it was only a small piece of a much larger puzzle. The important technical property of Stabler's formulation of minimalist syntax that Hale made use of had been established even earlier: Michaelis (2001) showed that the well-formed derivation trees can be defined in the same way as those of a context-free grammar, and given this fact probabilities can be added in essentially the same straightforward way that is often used to construct probabilistic context-free grammars. So everything one needs for showing that it is at least possible for these minimalist grammars to be supplemented with probabilities has been known for some time.

While the straightforward Hale/Michaelis approach should dispel any suspicions of a deep in-principle incompatibility, there is a sense in which it does not have as much in common with (non-probabilistic) minimalist grammars as one might want or expect. The second, more subtle take-home message from our paper is a suggestion for how to build on the Hale/Michaelis method in a way that better respects the hypothesised grammatical machinery that distinguishes minimalist/generative syntax from other formalisms.

As mentioned above, an important fact for the Hale/Michaelis method is that minimalist derivations can be given a context-free characterisation; more precisely, any minimalist grammar can be converted into an equivalent multiple context-free grammar (MCFG), and it is from the perspective of this MCFG that it becomes particularly straightforward to add probabilities. The MCFG that results from this conversion, however, "misses generalisations" that the original minimalist grammar captured. (The details are described in the paper, and are reminiscent of the way GPSG encodes long-distance dependencies in context-free machinery by using distinct symbols for, say, "verb phrase" and "verb phrase with a wh-object", although MCFGs do not reject movement transformations in the way that GPSG does.) In keeping with the slogan that "Grammars tell us what to count, and statistical methods tell us how to do the counting", in the Hale/Michaelis method it is the MCFG that tells us what to count, not the minimalist grammar that we began with. This means that the things that get counted are not defined by notions such as merge and move operations, theta roles or case features or wh features, which appeared in the original minimalist grammar; rather, the counts are tied to less transparent notions that emerge in the conversion to the MCFG.

We suggest a way around this hurdle, which allows the "what to count" question to be answered in terms of merge and move and feature-checking and so on (while still relying on the context-free characterisation of derivations to a large extent). The resulting probability model therefore works within the parameters that one would intuitively expect to be laid out for it by the non-probabilistic machinery that defines minimalist syntax; to adopt merge and move and feature-checking and so on is to hypothesise certain joints at which nature is to be carved, and the probability model we propose works with these same joints. Therefore to the extent that this kind of probability model fares empirically better than others based on different nuts and bolts, this would (in principle, prima facie, all else being equal, etc.) constitute evidence in favour of the hypothesis that merge and move operations are the correct underlying grammatical machinery.

32 comments:

davidadgerJuly 26, 2013 at 9:52 AM
Can I just plug Adger 2006 and Adger and Smith 2010 where we argue that one can handle probabilistic distributions of data in a simple minimalist feature checking system in a way that brings together minimalist syntax with probabilistic variationist sociolinguistics. The 2006 paper is in journal of linguistics and the 2010 in Lingua.
ReplyDelete
Replies
Avery AndrewsJuly 26, 2013 at 10:10 AM
Excellent to see a reference list on this building up.

A conceptual aspect of this stuff I'm wondering about is how to tell what if any aspects of the grammar are probabilistic, as opposed to the probabilities being caused by the environment of use. My judgement would be that for example the use of discourse positions really is probabilistic, but proving it is a different matter.
ReplyDelete
Replies
benjamin.boerschingerJuly 27, 2013 at 12:40 AM
Mark Johnson gave a talk about learning parameters (Pollock 1989 style V-to-T, ...) in a minimalist grammar framework using statistical inference, at this year's International Congress of Linguists building directly on Tim's and Chris's paper.
I think it was recorded but I don't know where or when the video will be available. The slides are available here: http://web.science.mq.edu.au/~mjohnson/papers/Johnson12ICLtalk.pdf
ReplyDelete
Replies
Alex ClarkJuly 27, 2013 at 5:40 AM
You should have put a spoiler alert here -- you have spoiled the drama of Tim's talk at MOL which I was looking forward to ..

Obviously I am completely on-side here with Tim's work as I am a big fan both of Stablerian MGs and probabilistic modelling, but one quibble. There doesn't seem, to me at least, to be much relationship between MGs and Minimalist syntax as exemplified by e.g. David Adger's core syntax book, (since he just posted above) or Norbert's recent book on theory of syntax (since this is his blog), or Chomsky's view of Minimalism, (
based on some brief discussion with Sandiway Fong who is trying to formalise it).
So Tim has shown that there is no incompatibility between MGs and the use of probabilities, which is roughly because MGs are well behaved computationally, and equivalent in one sense to a PSG formalism. But this certainly does not imply that there is a compatibility between minimalist syntax and probabilities.

On another note, it's interesting to thing about parameterising MCFGs in the light of Postal 64's arguments about discontinuous constituents. One of his arguments is basically Tim's point about about the naive parameterisation missing generalisations.
So you need some sort of feature calculus to control the derivations in an MCFG/MG --
it would be nice to have some more abstract way of thinking about these features and what they do, in the way that we can now think of the derivations in an abstract way.
ReplyDelete
Replies
benjamin.boerschingerJuly 27, 2013 at 1:27 PM
The video of Mark Johnson's talk about statistical inference and minimalist grammars for language acquisition is available here: https://mediaserver.unige.ch/play/80084
ReplyDelete
Replies
Charles YangJuly 29, 2013 at 6:28 AM
As one of the look-at-LSLT revisionists that Alex refers to, I also think it's a good idea to shout as loudly as possible, that grammars are not incompatible with statistical inference. That doesn't mean we will be heard. The field has gone through this a couple of times already; see, among others, the variable rule debate, which was ignited by Labov's probabilistic amendment to SPE.

More to the point here, and also echoing Alex's point: it has more to do with acquisition rather than representation. On my view, the most compelling argument for using probabilistic models for language learning is that it offers a straightforward account of the gradualness in child language acquisition. But the puzzling fact is that children's syntactic development does not generally follow the usual type of frequency effects.

For instance, as Amy Pierce showed many years ago and replicated for all verb raising languages, French children learn V-to-T at 18-20th month such as virtually no errors ever occur. On what kind of statistical information? A long time ago, I counted child directed French and found that 7% of utterances contain a finite verb followed by negation or VP level adverb, which seems adequate to facilitate early acquisition. But problems come up when we turn to other aspects of child language--in fact, the most studied aspects. If I have to name three biggest topics in language acquisition, they would be (a) Null Subjects, (b) Optional Infinitives and (c) English past tense (true, it's morphology). In all three cases, children stubbornly resist statistical trends in adult language: the Null Subject stage for English children last 3 years, Optional Infinitives even longer, and even adults still over-regularize. To compound the matter further, Italian and Chinese children, who learn the opposite grammars to English with respect to subject use, are at near adult level at the age of 2.

The question, then, seems to come up with the best representation-learning combo to account for child language quantitatively and cross-linguistically. Formal properties matter too: after all, every child learns approximately the same grammar (at least they can understand each other), and this suggests that the search space be suitably smooth.

ReplyDelete
Replies
Alex ClarkJuly 29, 2013 at 10:07 AM
Yes, there is good stats (e.g. Tim's work) and bad stats (n-grams), and in the anti bad stats broadsides that Chomsky has delivered over the years, I feel the good stats have suffered some collateral reputational damage. So it is worth carrying on shouting, though like you I am somewhat pessimistic.

I still don't quite understand the "probabilities in the performance module" versus "probabilities in the competence grammar" debate, or even what the consensus view on it is at the moment.
ReplyDelete
Replies
davidadgerJuly 29, 2013 at 11:19 AM
Alex, have a look at Stabler's forthcoming TICS paper on his website for a very nice illustration of probabilities in the parser vs the grammar. I agree with Charles that where probabilistic info looks most likely (ahem) is in acquisition and in parsing and other aspects of performance like lexical choice. The principles of the grammar don't, at least as far as current arguments go, look probabilistic to me.
ReplyDelete
Replies
Alex ClarkJuly 29, 2013 at 12:36 PM
I am just trying to fit Tim and Chris's work into this debate. If everybody agrees that stats have some place in syntax, then the only thing to discuss is where precisely the stats should go: in the lexicon, the grammar, the parser, the LAD ...

So ... MGs are a fully lexicalised formalism, so in a certain sense the grammar is the lexicon, assuming a universal set of features (?) -- if the lexicon is part of performance and thus probabilistic then one approach is just to attach probabilities only to the productions that introduce lexical items, and then everything is deterministic bottom up.. but that doesn't work mathematically (I think).

I guess Tim's approach is one alternative answer -- but the probabilities there are not in the parser or the lexicon .. but defined over local chunks of the derivation tree, which seems pretty close to "the principles of the grammar".

So where does the paper that is the topic of the post fit in to the representation/acquisition/performance space?

(thanks for the pointer to the new Stabler paper!)
ReplyDelete
Replies
NorbertJuly 30, 2013 at 2:37 PM
Norbert here:

Thomas Graf has tried to post this comment twice and it has not appeared for some reason. The point is interesting so I have posted it for him. Here it is:

****

I'm a little late to the party, but I'm curious where you, Alex and Tim, see big discrepancies between MGs and Minimalist syntax. The strength of MGs is that they are an extremely malleable formalism, they can easily be modified without altering their fundamental properties. Personally, I think of standard MGs as characterized by the following properties:

1) their derivation tree languages fall into a particular formal class thanks to the SMC (regular tree languages),
2) the mapping from derivations to derived trees has limited complexity (definable in monadic second-order logic),
3) the derivation trees are lexicalized via a (usually symmetric) feature calculus.

As far as I can tell, Tim's probabilistic work depends only on 1), as this is what makes it possible to view MGs as underlyingly context-free. You can easily expand MGs with new movement types, Adjunction, (limited) Late Merger, locality restrictions, reference-set computation, change the feature checking mechanism, or relax the SMC, and 1) will still hold. The only proposals in the literature that strike me as problematic are those that incorporate notions of semantic or structural identity.
ReplyDelete
Replies
Tim HunterJuly 30, 2013 at 4:14 PM
This comment has been removed by the author.
ReplyDelete
Replies
Alex ClarkJuly 31, 2013 at 2:36 AM
There was another post by Thomas Graf which doesn't seem to have shown up on the page though it came through on the subscription so I copy it here:

---- from TG
"Thanks, Norbert, for pushing through my comment, i'm also on the road --- just like Tim --- so maybe this confused blogspot in some way i cannot fathom (previous comments went through just fine). I agree with Tim that relaxing the SMC from, say, 1 to 3 is missing the point. But i do not think that's what the SMC is about. The SMC is a very brute-force way of ensuring MG derivations are regular and involve symmetric feature checking, and there's many alternative routes (mirroring the point made by Alex D).

I think it's worth giving a quick summary of what the SMC does in Minimalist Grammars. In MGs, every movement operation is triggered by mapping a movement licensor feature (+f) to a corresponding licensee feature (-f). This mapping involves two processes: a derivational feature checking mechanism (``if you have +f and -f, elide them from your equation'') and a mapping process (``put a -f element where the +f element used to be''). Now for various reasons, we do not want to have any ambiguity in how these features are mapped to each other. That is to say, if we have one +f feature and more than one -f feature that could check it, we're not happy because this raises several questions: which -f feature is mapped to the +f feature, and what are we supposed to do with the remaining -f features that did not happen to be among the precious chosen -f features? The SMC does away with these questions in a very blunt way --- we simply block all those configurations as ungrammatical

But there are many viable alternatives. For example, we might have some mechanism to decide which -f feature was closer to the relevant +f feature and just decide not to care about -f features that cannot be checked this way. That would be very close to the Closeness condition in Minimalist syntax and still preserve property 1) mentioned above.

Frankly, i don't see why anyone would expect the original MG setup, which was designed in 1997, to be compatible with recent interations of Minimalist syntax. That doesn't mean early iterations of MGs were a waste of time, because all the interesting theorems about the early kind of MGs still carry over to the new variants. But it does bring me back to my original point: what kind of analysis or proposal is incompatible with MGs? Alex D's answer suggests that the answer is none, but i'm still curious what Tim and Alex C have to say about
this. "
--- end of the post from TG

So I am not sure I think they are incompatible as such -- there are as you say a wide variety of MGs and an even wider variety of proposals in the Minimalist syntax literature of various degrees of formality -- but I was thinking about for example, the sorts of models where set theoretic merge is defined as { A, B } without linear order, and linearisation comes afterwards and contains some learned components that only pronounce one of each copied element and so on. It may be possible to formalise that within the MG framework, but superficially at least it doesn't seem to be closely related. Or no more closely related to MGs than to some CG proposals.

And I should clarify that if there is a fundamental incompatibility between MGs and some proposal in Minimalist syntax, then I consider that more of a problem for the syntactic proposal than for MGs, which I think are very much on the right track. For example, if a proposal takes the class of language outside of PTIME.
ReplyDelete
Replies
NorbertJuly 31, 2013 at 4:52 AM
From David Adger:

***

Completely agree with Alex (Drummond) above. Even looking at the system I presented way back in Core Syntax, it's fairly straightforward to formalise most of the analysis given there using MGs so I think there can actually be a fairly close relationship between MGs and minimalism in the wild as presented in undergrad textbooks. I think there's a sociological issue here though. When I talk to friends who work in LFG or HPSG, many bemoan what they see as the absence of formal work in minimalism, and most simply don't believe me when I say that much of the work is straightforward to make formally explicit or they say that an MG type formalisation is not really minimalism. I think this is because they want a uniform, mostly agreed upon, formally explicit and fairly complete theory (looking at you Miram Butt and Ash Asudeh!), Essentially a grammar fragment for UG. But we working syntacticians in minimalism (and elsewhere) look like we are constantly changing even what seem to be fairly crucial and core theoretical precepts. Which from the LFG/HPSG perspective must be a bit annoying. The reason why this is sociological, I think, is that its about aims and interests. Theoretical minimalist syntacticians are trying to solve theoretical problems (sometimes raised by empirical analysis, sometimes by theoretical qualms) so we are constantly trying out new ways of configuring assumptions (which would lead to different formalisations) basically because although the research programme is fairly clear and has had numerous successes, how it will pan out in detail is not. So for many syntacticians working in the framework (although not all) its about exploring which ways of configuring rather inexplicit theoretical hunches lead to interesting new ways of understanding the phenomena (which can then be made explicit). This gets quite messy and disparate (and interesting!). I hesitate to speak for my LFG/HPSG friends, but my impression is that, possibly because of the discipline imposed by computational interests (building usable grammars), buy probably for other reasons too, such disparate messiness is unattractive, and uniformity of the basic theoretical framework is more highly valued. But this is an issue of interests and is, I think, orthogonal to questions of formalizability. This is then related to the question that Alex (Clark) raises about the relation between MGs vs minimalism in the wild: MGs provide a great way of formalising ideas that are being explored by theoretical syntacticians even though these ideas might be quite disparate, but MG is not intended to be a constraining formal framework in the same way that, for example, LFG is. I may have got that wrong, so please correct me if so!
ReplyDelete
Replies
Avery AndrewsJuly 31, 2013 at 5:58 AM
@David via Norbert. The computational project is certainly a big factor. Another, which motivates the more descriptively oriented LFG-ers, is to capture generalisations in an at least semiformal framework that has a good chance of remaining accessible for a reasonably long time.

Another consideration is that we tended to find the explanatory ambitions of GB/MP implausible, on the basis that learning seemed to be probably more powerful than Chomsky was speculating in 1979, & many of the supposed principles and parameters seemed to have bad & unacknowledged problems from the beginning.
ReplyDelete
Replies
Avery AndrewsJuly 31, 2013 at 7:11 AM
Past tenses in the above because I think the ground has shifted a lot under the various entrenched positions, and things need to be deeply rethought and rephrased.
ReplyDelete
Replies
Tim HunterSeptember 4, 2013 at 4:38 PM
(Sorry to have to leave this interesting discussion for so long. I'll add this anyway and see if anyone's still interested ...)

I agree with the comments from Thomas and Alex D. that we needn't get bogged down in the precise details of the SMC, as it was formulated in the original MGs in 1997. The bigger point is what the SMC gets us, which is "ensuring that MG derivations are regular" (which in turn ensures that they can be characterised by a context-free grammar, with some missed generalisations and blowup); swapping in some other method of ensuring that the derivations are regular will leave the "nice computational properties" in place, including the probability model discussed above, for example. My worry is not about whether minimalism-in-the-wild follows Stabler's SMC to the letter, it's about whether the derivations we find in the wild are such that there is in fact another regularity-enforcing constraint that we could swap in for the SMC.

To illustrate with a fairly contrived example: let's suppose that quantifiers take scope via syntactic movement (QR), and that all such movements are driven by the same type of feature (say, '-q'). The number of quantifiers we can have in a single clause doesn't seem to be bounded in any principled sense, because we can construct things like:
(1) every man met some woman [on every day] [in some building] [with every friend] ...
Let's suppose that there's a derivation where all of these quantifiers move to scope-taking positions at the top of this clause. (I don't think their relative scope-taking positions actually matter at all, nor whether there is more than one option or not.) Then at a certain point in the derivation, we have say a TP constituent that has one unchecked '-q' feature somewhere inside it for each quantifier, each of which needs to be checked by some future move operation. There's no limit on how many of these to-be-moved quantifiers we might need to be keeping track of by the time we get to this TP level, so doesn't this violate the finiteness that is required for the derivations to be regular?

In one sense, it doesn't matter at all whether the assumptions I made about the data are plausible. My point is just that if such data turned up, and a syntactician made the theoretical moves that I sketched in order to try to account for it, then I don't think any of those theoretical moves would be considered particularly outlandish. And this means we have a mismatch between (i) MGs in the broad sense, encompassing all those possible variations that maintain the nice computational properties, and (ii) the things syntacticians might do that are not considered outlandish.

Of course, in another sense, the data does matter: if we don't need that extra stuff, then we don't need it, and so much the better for MGs as an empirical hypothesis. But this doesn't affect the mismatch we have at the moment.
ReplyDelete
Replies

Add comment

Faculty of Language

Comments

Friday, July 26, 2013

Guest Post: Tim Hunter on Minimalist Grammars and Stats

32 comments:

Contributors