Faculty of Language: Why Formalize- II?

Monday, September 23, 2013

Why Formalize- II?

Ewan (here) provides me the necessary slap upside the head, thereby preventing a personality shift from stiff-necked defender of the faith to agreeable, reasonable interlocutor. Thanks, I needed that. My recognition that Alex C (and Thomas G) had reasonable points to make in the context of our discussion, had stopped me from thinking through what I take the responsibilities of formalizations to consist in. I will try to remedy this a bit now.

Here’s the question: what makes for a good formalization. My answer: a good formalization renders perspicuous the intended interpretation of the theory that it is formalizing. In other words, a good formalization (among other things) clarifies vagaries that, though not (necessarily) particularly relevant in theoretical practice, constitute areas where understanding is incomplete. A good formalization, therefore, consults the theoretical practice of interest and aims to rationalize it through formal exposition. Thus formalizing theoretical practice can have several kinds of consequences. Here are three (I’m sure there are others): it might reveal that a practice faces serious problems of one kind or another due to implicit features of its practice (see Berwick’s note here) (or even possible inconsistency (think Russell on Frege’s program)), or it might lay deeper foundations (and so set new questions) for a practice that is vibrant and healthy (think Hilbert on Geometry), or it may attempt to clarify the conceptual bases of a widespread practice (think Frege and Russell/Whitehead on the foundations of arithmetic). At any rate, on this conception, it is always legit to ask if the formalization has in fact captured the practice accurately. Formalizations are judged against the accuracy of their depictions of the theory of interest, not vice versa.

Now rendering the intended interpretation of a practice is not at all easy. The reason is that most practices (at least in the sciences) consist of a pretty well articulated body of doctrine (a relatively explicit theoretical tradition) and an oral tradition. This is as true for the Minimalist Program (MP) as for any other empirical practice. The explicit tradition involves the rules (e.g. Merge) and restrictions on them (e.g. Shortest Attract/Move, Subjacency). The oral tradition includes (partially inchoate) assumptions about what the class of admissible features are, what a lexical item is, how to draw the functional/lexical distinction, how to understand thematic notions like ‘agent,’ ‘theme,’ etc. The written tradition relies on the oral one to launch explanations: e.g. thematic roles are used by UTAH to project vP structure, which in turn feeds into a specification of the class of licit dependencies as described by the rules and the conditions on them. Now in general, the inchoate assumptions of the oral tradition are good enough to serve various explanatory ends, for there is wide agreement on how they are to be applied in practice. So for example, in the thread to A Cute Example (here) what I (and, I believe David A) found hard to understand in Alex C’s remarks revolved around how he was conceptualizing the class of possible features. Thomas G came to the rescue and explained what kinds of features Alex C likely had in mind:

"Is the idea that to get round 'No Complex Values', you add an extra feature each time you want to encode a non-local selectional relation? (so you'd encode a verb that selects an N which selects a P with [V, +F] and a verb that selects an N which selects a C with [V, +G], etc)?"  

Yes, that's pretty much it. Usually one just splits V into two categories V_F and V_G, but that's just a notational variant of what you have in mind.

Now, this really does clarify things. How? Well, for people like me, these kinds of features fall outside the pale of our oral tradition, i.e. nobody would suggest using such contrived items/features to drive a derivation. They are deeply unlike the garden variety features we standardly invoke (e.g. +Wh, case, phi, etc.) and, so far as I can tell, restricted to these kinds of features, the problem Alex C notes does not seem to arise.

Does this mean that all is well in the MP world? Yes and No.

Yes, in the sense that Alex C’s worries, though logically on point, are weak in a standard MP (or GB) context for nobody supposes the kinds of features he is using to generate the problem exist. This is why I still find the adverb fronting argument convincing and dispositive with regard to the learnability concerns it was deployed to address. Though I may not know how to completely identify the feature malefactors that Thomas G describes, I am pretty sure that nothing like them is part of a standard MPish account of anything. [1] For the problem Alex C identifies to be actually worrisome (rather than just be possibly so) would require showing that the run of the mill, every day, garden variety features that MPers use daily could generate trouble, not features that practitioners would reject as “nutty” could.[2]

No, because it would be nice to know how to define these possible “nutty” features out of existence and not merely rely on inchoate aspects of the oral tradition to block them. If we could provide an explicit definition for what counts as a legit feature (and what not) then we will have learned something of theoretical interest even if it fails to have much of an impact on the practice as a result the latter’s phobia for such kinds of features to begin with. Let me be clear here: this is definitely a worthwhile thing to do and I hope that someone figures out a way to do this. However, I doubt that it will (or should) significantly alter the conclusion concerning FL/UG like those animated by Chomsky’s “cute example.” [3]

[1] Alex D makes this same point (here), and I agree:

I completely agree with you that if we reject P&P, the evaluation measure ought to receive a lot more attention. However, in the case of trivial POS argument such as subject/aux inversion, I think the argument can be profitably run without having a precise theory of the evaluation measure.

[2] Bob Berwick’s comment (here) shows how to do this in the context of GPSG and HPSG style grammars. These systems rely on complex features to do what transformations do in GB style theories. What Bob notes is that in the context of features endowed with these capacities, serious problems arise. The take home message: don’t allow such features.

[3] This seconds David Adger’s conclusion in the comment section (here). Let me quote:

I am convinced by what I think is a slightly different argument, which is that you can use another technology to get the dependency to work (passing up features), but that just means our theory should be ruling out that technology in these cases. as the core explanandum (i.e. the generalization) still needs explained. I think that makes sense…

38 comments:

UnknownSeptember 23, 2013 at 3:58 PM
I don't think restricting the set of features is a good solution, for several reasons. In increasing order of importance:

1) It obscures the inner workings of the formalism. Restricting the set of category features in MGs is not like Ristad's restrictions of the GPSG feature calculus. Ristad really prunes down the complexity of feature expressions. But category features in MGs are already atomic, flat units without any structure to them. Writing V_F and V_G is just a convenient way for constructing the proof, to the grammar those features are just as different as C and T. So all you can do is restrict the size of the set of category features, and that's like studying only numbers between -17 and +17, you deprive yourself of the opportunity to say anything interesting about integers vs rationals vs reals.

2) Just like formal universals are preferable to substantive universals, we should first try to patch up the feature checking mechanism. The feature coding strategy only works because subcategorization is in a sense symmetric: A needs to be selected to enter the derivation, and B needs an argument of category A.
But this kind of pattern does not hold for adjuncts, for example. B does not need to be adjoined to, and an adjunct does not need to be selected. The standard MG implementation of adjunction captures this via a variant of Merge in which feature checking is asymmetric --- the adjunct has an adjunction feature that must match the category feature of the phrase it adjoins to, but only the adjunction feature is deleted. Move can no longer be reliably emulated via Merge in this case. I have argued recently that this asymmetry even derives the Adjunct Island Constraint.

3) You are presupposing that features are tangible objects that can easily be distinguished from other parts of the grammar. But the feature coding result not only tells us that constraints can be represented as features, it also tells us that features represent constraints, the two are interchangeable from a technical perspective. So whatever set of features I start out with, I can reduce it to a fixed one by adding constraints to the grammar.

4) The previous argument raises an even bigger issue. Mainstream minimalism does not just consist of Merge and Move and some features, there are tons of additional constraints such as the Specifier Island Constraint that started this whole discussion.
So let's assume that the SPIC holds for all MGs, and now let's look at it from the perspective of feature coding. We know that how many new features you need to encode a constraint through Merge varies between grammars even if they started out with the same set of features. So a universal SPIC actually entails that not all grammars have the same set of features if viewed from the perspective of feature refinement. So if you want to have a fixed set of features, it must include every feature that is included in some grammar. But since not every feature occurs in every MGs, some MGs will have "unused" features that we can exploit to encode new constraints and we are back to square 1.

Imho the issue is not really with the set of features, it is with the way we currently allow information to flow through the tree via Merge. You cannot stop that in a conceptually coherent way by assuming a fixed set of features because the whole notion of feature has become ephemeral in MGs. Constraints are features, features are constraints. So ultimately, we are dealing with a constraint problem, not a feature problem, and I think that's how we should try to solve it.
ReplyDelete
Replies
Alex ClarkSeptember 24, 2013 at 2:16 AM
There is a useful difference between "being precise" and "being formalized".
The examples you give are all cases where a perfectly precise theory (naive set theory, plane geometry, arithmetic) were formalized and some problems appeared or didn't appear. It's notable also that none of these are *scientific* examples.

But there is no vagueness or imprecision in the pretheoretic notion of an integer or of addition.

So take a linguistic example -- your bog standard tree. So there are several different ways of formalizing the notion of a tree -- terms over ranked alphabet or Gorn style tree domains , and no doubt some others -- but they don't matter. What matters is that the idea of a tree is perfectly precise, and maybe you need to specify what the labels are and so on, but basically it just doesn't matter which formalization you use (footnote: probably there are some areas where it might matter but lets leave them to one side).

The problem we have been discussing here is nothing to do with formalization and everything to do with precision. The "theory" you are talking about is just not specified precisely enough for it to have any empirical content.

Whether the set of features is universal and innate (and if so what they are, and if not where they come from) is not some theoretically uninteresting detail of the formalization (like the tree formalization issue): this is a centrally important point of the theory.

So sure, you can start adding a whole lot of extra stuff to your theory, but that makes it a different theory -- more complex, maybe less evolutionarily plausible etc etc.
What irritates me is the shifting between a clean simple and elegant theory that just doesn't explain what you claim it does, and some hugely complex ad hoc cobbled together theory that perhaps might (though I have my doubts). These are just two different theories, and you can't cherry-pick the positive features of both and pretend it is a consistent theory.
ReplyDelete
Replies
Alex ClarkSeptember 25, 2013 at 3:57 AM
I am not sure about the technical part of that as the learning theory doesn't necessarily guarantee that the languages would be represented by monadic branching MCFGs, even if the languages are all monadic branching MCFLs.
Indeed I think there are even finite languages where one would learn a non-monadic grammar.

But you are right, no empirical content is overreaching -- I got carried away by my rant.

So what *are* we arguing about? The origin of structure dependence of syntax (SDS), and in particular whether it is
learned or innate.
The claim was that the MP had an explanation of SDS because it had a hypothesis class that excluded the structure independent
hypothesis -- and this is the standard Chomsky line going all the way back to when it was framed in terms of transformations of surface sentences -- front the first auxiliary.. etc.
And I think it is clear now that the current theories of minimalist syntax do not exclude the structure independent (SI) rules in any meaningful way.
So one answer is to say that the hypothesis class includes both SI and SD options, and the child learns which ones are which.
And there are quite precise models of how this might happen.
Then you need to explain why natural languages only use the SD option, and there are a number of options there; probably it arises out of some Herb Simon style functional considerations interacting with some biases of the learner. But that is a different problem, and one I am not very interested in, at least at the moment.

And the other is to say that we need to make the hypothesis class smaller in some way; which is an approach that in my view is unlikely to succeed. Because it is I think mathematically impossible to express that sort of language theoretic property through a grammar theoretic restriction -- I think I mentioned this earlier to Benjamin B -- for example, you can't define a restriction that gives you the class of CFGs that generate non-regular languages. So I think you can't define the class of MGs that only give you SD rules, even if you could make the notion of a SD rule precise. (Define here mean having a decidable property on the grammars).

So that is the real reason I think you want to put it in the learner rather than in UG.
ReplyDelete
Replies
Alex DrummondSeptember 25, 2013 at 4:42 AM
And I think it is clear now that the current theories of minimalist syntax do not exclude the structure independent (SI) rules in any meaningful way.

This is where I don't quite follow. I can see two ways of encoding structure-independent rules in MGs. One would be to have a grammar generating a uniformly right-branching derivation tree language, so that there was no real distinction between hierarchy and linear order. The other would be to sneak in linear relations via selection. (I haven't worked out the details, but I guess you could probably use selectional features to run a finite state machine over the linear sequence of heads, and this would presumably be sufficient to correlate the presence of a +Q head with inversion of the linearly first auxiliary .) The first method depends on the learner adopting a structural hypothesis which is completely inadequate (the size of the grammar will balloon as soon as they encounter e.g. complex subjects, and in any case, uniformly-branching hypotheses are plausibly ruled out by UG). The second method requires the addition of enormous numbers of spurious features which would be punished by any conceivable evaluation mechanism. So, not surprisingly, as we get into the technical details, claims need to be stated in a more nuanced and sophisticated way. "UG does not permit rules of this type" turns out to mean "UG does not permit rules of this type unless you also have (i) uniformly-branching tree structures (which are probably blocked by UG too and which in any case will eventually result in a huge blow up in the size of the grammar) or (ii) a huge lexicon".

In other words, yes, you can have linear rules, but what you can't have is linear rules and a structure that isn't completely wrong and a reasonably small lexicon. So, in effect you can't have them at all. And you can't have them because UG makes it a real PITA to encode what is on the face of it a perfectly reasonable hypothesis.
ReplyDelete
Replies

Add comment

Faculty of Language

Comments

Monday, September 23, 2013

Why Formalize- II?

38 comments:

Contributors