Sunday, October 6, 2013

The Merge Conspiracy [Part 1]

A while ago, Norbert wrote several interesting posts regarding the Specifier Island Constraint (SPIC) that caused me to chime in with some technical observations. The basic upshot was that SPIC cannot reliably block extraction from a specifier because we can use Merge rather than Move to displace constituents, in which case SPIC does not apply. Since the comments section isn't the ideal place for discussing such technical matters in an accessible form, Norbert was so kind to endow me with the divine power of publishing my remarks directly on his blog. So here's the first entry of a three part epic, the story of the power of Merge and how it can be used for both good and evil (rated PG-13 for some technical details; but hey, we're all grown-ups here).

Subcategorization in Minimalism

Syntactic frameworks are a diverse bunch, but they all involve some mechanism to capture subcategorization. Words cannot be combined freely like appetizers at a Lebanese restaurant. You can say both He likes to bake and He likes baking, but the semantically equivalent enjoy is more restricted --- He enjoys baking is fine, whereas He enjoys to bake is ungrammatical (in many dialects of English). Combining enjoy with to bake is the linguistic equivalent of soaking your crêpe in your soup, the two just don't go together. Thanks to the wonders of scholarly diversity, how subcategorization is encoded on a technical level differs between theories. I am just going to discuss the Minimalist view of things here, but rest assured, all the issues we will encounter in later posts also arise in CCG, GPSG, LFG, and HPSG, among others.

One of the most explicit accounts of selection in Minimalist syntax can be found in David Adger's textbook Core Syntax, and a similar implementation is used in Ed Stabler's formalization of Minimalist syntax known as Minimalist grammars. First, every word has a syntactic category feature. Verbs have V, nouns have N, complementizers have C, and so on. Second, if a word has, say, 3 arguments of category X, Y, and Z, where it first combines with X, then with Y, and finally with Z, then the word has corresponding selector features Arg1:X, Arg2:Y, Arg3:Z. All Arg features of a word must be discharged by merging it with arguments with a matching category feature in the correct order.

The split between like and enjoy can now be handled as follows. First one posits two distinct feature specifications for like. In both cases its category feature is V and its second argument must be a DP. The specification for the first argument determines whether like combines with the TP to bake or the DP baking. In contrast to like, enjoy has only one entry, which requires both arguments to be DPs. The relevant feature specifications for all the words in the examples above are as follows:
  1. like[Cat:V, Arg1:T, Arg2:D]
  2. like[Cat:V, Arg1:D, Arg2:D]
  3. enjoy[Cat:V, Arg1:D, Arg2:D]
  4. to bake[Cat:T]
  5. baking[Cat:D]
  6. he [Cat:D]
We can now combine these words via Merge to create bigger expressions as long as the relevant features match. If we want to build He likes to bake, we first merge like[Cat:V, Arg1:TP, Arg2:DP] and to bake[Cat:TP]. This is licit because the category feature of to bake matches the value of the Arg1 feature on like. And of course we can then merge like to bake with he because the latter is of category D and the head of like to bake has the matching Arg2 feature value. The expression He enjoys to bake, on the other hand, cannot be built because enjoy always has D as the value for Arg1. So merging to bake with enjoy is like trying to hammer a square peg into a round hole.

Note that likes to bake by itself is not a well-formed expression because the Arg2-feature of likes still needs to be checked via Merge. Similarly, he he likes to bake is ungrammatical because likes in he likes to bake has no further Arg-features that would license a third application of Merge.

Local Dependencies via Subcategorization

The feature-driven system captures the essential properties of subcategorization:
  • Arguments must have a specific category.
  • A phrase cannot be an argument of more than one head.
  • Heads combine with a fixed number of arguments.
But every mechanism that correctly handles these subcategorization dependencies can also handle other local dependencies: Suppose that we want to ensure that our grammar generates He likes him rather than he likes he. In Minimalism this is handled by the operation Agree, but our simple Merge mechanism can get the job done, too. All we have to do is to split the category D into S for subjects and O for objects.
  1. like[Cat:V, Arg1:T, Arg2:S]
  2. like[Cat:V, Arg1:O, Arg2:S]
  3. enjoy[Cat:V, Arg1:O, Arg2:S]
  4. to bake[Cat:T]
  5. baking[Cat:S]
  6. baking[Cat:O]
  7. he [Cat:S]
  8. him [Cat:O]
Now he can only occur as the second argument of the verb, ruling out he likes he.

This kind of feature refinement can also be used to enforce gender agreement between an object reflexive and the subject. First we introduce himself with category feature O(masc) and then we ensure via the subcategorization properties of like that the subject is he whenever the object is himself.
  1. like[Cat:V, Arg1:T, Arg2:S]
  2. like[Cat:V, Arg1:O, Arg2:S]
  3. like[Cat:V, Arg1:O(masc), Arg2:S(masc)]
  4. enjoy[Cat:V, Arg1:O, Arg2:S]
  5. enjoy[Cat:V, Arg1:O(masc), Arg2:S(masc)]
  6. to bake[Cat:T]
  7. baking[Cat:S]
  8. baking[Cat:O]
  9. he [Cat:S]
  10. he [Cat:S(masc)]
  11. him [Cat:O]
  12. himself [Cat:O(masc)]
Just looking at the lexical entries required to regulate local case and gender agreement via Merge, we can already tell why feature refinement is not an option usually entertained by linguists. The size of the lexicon has doubled, and the generalizations underlying agreement are obscured by lumping them together with subcategorization requirements into a baroque system of category features.

But elegance is not a factor in determining what our proposals are in principle capable of. If a theory can account for phenomenon X in a roundabout way, it can account for phenomenon X. Sure, as scientists we favor the account that seems less crude to our refined palate. But the point here is not whether there are more elegant accounts for agreement, it is that a system designed to capture just subcategorization requirements actually is capable of regulating a lot more than just that. Right now this looks like a technical curiosity at best because we have only seen a few innocent toy examples --- nothing complicated like long-distance dependencies or anything related to movement. But believe me, long-distance dependencies are child's play compared to some of the things Merge is capable of. Tune in on Wednesday for Part 2, where the digestive waste product hits the rotary air perturber.

Postscript: Just for the record, putting crêpes in your soup is perfectly fine if the former are sliced and the latter is Austrian beef broth or oxtail soup.


  1. Along with the doubling of the size of the lexicon comes the problem that the introduction of S and O categories comes the false typological expectation that languages should be able to differ substantially in the internal structures of DPs in those two positions, which doesn't seem to happen (modulo case, and some minor things involving determiners in some languages such as Greek, which I expect can be 'explained away'). So this possibility needs to be headed off at the pass pronto.

    My LFG-2008 conference paper contains an attempt to manage some aspects of this problem in a somewhat innovative LFG+glue semantics framework, but it's got substantial problems.

    1. Whether splitting D into S and O makes new predictions is a subtle issue. For the toy grammar we are looking at, it only captures the case-marking differences between he and him, so it does exactly the same work as Agree in this limited case. But you are right that feature refinement makes some very wacky typological predictions that are clearly wrong, I will discuss that issue on Wednesday.

      In Part 3 I'll talk about possible ways of restricting the refinement strategy, but I haven't thought at all about using semantics for this purpose. Now I'm curious, how exactly does glue semantics enforce structural similarity across arguments? And why does it do it just in some cases, and not for a head selecting, say, a DP and a CP?

    2. Are we assuming a fixed universal set of features here? If not how do the typological predictions come out from this theory?

    3. I can't tell if the question was directed at me or Avery (unbounded nesting would be neat for comments), but here's my $.02: No fixed set of features at this point, that's why the S-O split is not a strong case for unwanted typological claims. But feature refinement entails various closure properties over Minimalist derivation tree languages that seem wrong from a typological perspective, e.g. closure under union, intersection and relative complement.

    4. @Thomas top: it's a rather involved story involving the interaction between glue semantics and other aspects of LFG, under some ways of developing it. People with zero interest in LFG should stop reading now.

      So here goes:

      1. Classic LFG did subcategorization/selection with argument lists in the PRED features that provided lists of grammatical relations and (often implicitly) assigned semantic roles to them. Eg the PRED feature for all the inflected forms of transitive 'eat' would be PRED `Eat(SUBJ, OBJ)'. The GR labels were not to be identified with category features, a property occasionally explointed in analyses.

      2. Glue semantics introduces the possibility, and, according to me, the necessity, of removing the argument lists from the PRED features and doing the same job with the meaning-constructors associated with lexical items. So we would have:

      Eat^{e->e->p} : (^ OBJ) -> (^ SUBJ) -> ^

      Where I'm using latex notation for the type superscript to Eat, and the '^' instances to the left of the colon are LFG up-arrows that are supposed to be instantiated.

      3. This move however raises numerous questions (which is perhaps why most LFG-ers have not bitten this bullet, as discussed briefly by Ash Asudeh in his resumption book), such as what the PRED features are there for.

      4. Another issue is that the classic way to introduce meaning-constructors in LFG is in the lexicon, alongside all of the other grammatical features, etc, which creates its own collection of problems which I discuss in my 2007 paper in the Joan Bresnan Festschrift volume, and also in the 2008 lfg conference paper. What I propose in both places is that the meaning-constructors actually work by 'interpreting' features in the f-structure (on an 'interpret once and once only' basis), so that you can think of yourself as starting with an f-structure with features and no meaning-constructors, and removing features while adding meaning-constructors. This involves splitting the traditional lexicon into two components, a 'morphological' lexicon that looks just like the traditional one but lacking the argument-lists in the PRED-features, and a 'semantic' lexicon that licenses meaning-constructors on the basis of the features in the f-structure.

      Afaik nobody in LFG literature has taken up my suggestions, but that's how the actual implementations of semantics in the XLE system have tended to work (including the glue one). A technical name for the approaches here are 'codescription' (classic LFG+glue) vs 'description by analysis' (removing f-structure while adding meaning-constructors). Most of the time, PRED and other features are interpreted individually, but in idiomatic and various other kinds of constructions, combinations of them are interpreted jointly. E.g. the PRED's of 'get', 'up' and 'nose' are jointly interpreted to mean 'annoy somebody [the possessor of the nose] intensely'.

      5. So the question of what the PRED features are really doing intensifies, 'and in the 2008 paper I propose a bunch of special properties that they are supposed to have to explain various things, including a constraint that if any feature of a grammatical function-bearer is interpreted jointly with the PRED of whatever it bears the function to, then the PRED of that bearer must also be. So you can select for a plural (2nd) object as in 'give NP the creeps', but you can't arbitrarily impose a plural feature on your object without also fixing its PRED. The reverse doesn't seem to work; you can get up Fred's nose, but also up the reviewer's noses.

      6. So at the end of this twisty path, we are suppose to have a system wherein it is impossible for a lexical item, as conventionally understood as a potentially arbitrary pairing between a chunk of sound and a chunk of meaning, to specify a grammatical feature of one of its syntactic arguments without specifying the entire syntactic argument as an idiom chunk.

  2. I'm curious about your previous comment ruling out the "restrict the features" approach to this:

    "the feature coding result not only tells us that constraints can be represented as features, it also tells us that features represent constraints, the two are interchangeable from a technical perspective. So whatever set of features I start out with, I can reduce it to a fixed one by adding constraints to the grammar."

    "from a technical perspective" is key here. It's generally understood that the formalism does more than simply delimit the class of possible mappings that grammars can specify. Rather, the notation does some work; in the classical theory this was limited to pointing to an evaluation measure, but in say Berwick and Weinberg the "transparency" of the online processing mechanism wrt the specification of the grammar was also raised. In other words, the notational specification of the grammar stands in some homomorphic relation with something more than just the function it computes, in an ideal world. So that's just to say that, well, one _could_ still in principle divide the world into feature-based restrictions and constraint-driven restrictions, given independent evidence and agreed upon criteria for which restriction should go where.

    I wonder (since I don't know) if there is still any conceivable MG lever we could pull to have a feature/non-featural restriction division continue to be meaningful, not necessarily in what the grammars can do, but perhaps in what the grammars "mean" - just to see how far we could push these things. Such questions should no longer be antithetical to the "formal view" (I sometimes get the impression that they are although they never should have been) given the new Stabler parsing work.

    1. That's an excellent point. I mention at the end of the post that feature refinement blows up the size of the lexicon, and while doubling seems bad, the worst case scenario is a lot worse. If somebody wanted to argue in favor of a distinction between features and constraints, that's the attack vector they should pick.

      For example, parsing performance is dependent on grammar size, so a big lexicon is a bad thing from a parsing perspective. At this point there is no formal proof that a refined grammar is harder to parse with than the original grammar with several constraints on top of it, but my hunch is that this is true across the board once the constraints pass some complexity threshold. A refined grammar might also make different predictions as to which constructions are difficult to parse.

      So the parsing perspective could establish that there is some use for constraints. I am less sure what would constitute an argument for category features, because those simply regulate the Merge mechanism and can easily be replaced by very local constraints. Is there a succinctness/parsing argument for having at least some features?