Friday, March 28, 2014

The logic of the POS, one more time

Alex C has provided some useful commentary on my intentionally incendiary post concerning the POS (here). It is useful because I think that it highlights an important misunderstanding concerning the argument and its relation to the phenomenon of Auxiliary Inversion (AI). So, as a public service, let me outline the form of the argument once more.

The POS is a tool for investigating the structure of FL. The tool is useful for factoring out the causal sources for some or another feature of a rule or principle of a language particular grammar G. Some features of G (e.g. rule types) are what they are because the input is what it is. Other features look as they do because they reflect innate principles of FL.

For example: we trace the fact that English Wh movement leaves a phonetic gap after the predicate that assigns it a theta role to the fact that English kids hear questions like what did you eat. Chinese kids don’t form questions in this way as they hear the analogues of you ate what.  In contrast we trace the fact that sentences like *Everyone likes him with him interpreted as a pronoun bound by everyone as ill-formed back to Principle B, which we (or at least I) take to arise from some innate structure of FL. So, again, the aim of the POS is to distinguish those features of our Gs that have their etiology in the PLD from those that have it in the native structure of FL.

Note, for POS to be so deployable, its subject matter must be Gs and their properties; their operations and principles. How does this apply to AI?  Well, in order to apply the POS to AI we need a G for AI.  IMO, and that of a very large chunk of the field, AI involves a transformation that moves a finite Aux to C. Why do we/I believe this? Well, we judge that the best analyses of AI (i.e. the phenomenon) involves a transformation that moves Aux to Comp (A-to-C) (i.e. A-to-C names a rule/operation)[1]. The analysis was first put forward in Syntactic Structures (LSLT actually, though SS was the first place where many (including me) first encountered it) and has been refined over time, in particular by Howard Lasnik (the best discussion being here). The argument is that this analysis is better for a variety of reasons than alternative analyses. One of the main alternatives is to analyze the inversion via a phrase structure operation, an alternative that Chomsky and Lasnik considered in detail and argued against on a variety of grounds. Some were not convinced by the Chomsky/Lasnik story (e.g. Sag, Pullum, Gazdar) as Alex C notes in linking to Sag’s Sapir Lecture on the topic).  Some (e.g. me) were convinced and still are. What’s this have to do with the POS?

Well for those convinced by this story, there follows another question: what does A-to-C’s properties tell us about FL? Note, this question makes no sense if you don’t think that this is the right description (rule) for AI.  In fact, Sag says as much in his lecture slides (here). Reject this presupposition then the conclusion of the POS as applied to AI will seem to you unconvincing (too bad for you, but them’s the breaks). Not because the logic is wrong, but because the factual premise is rejected.  If you do accept this as an accurate description of the English G rule underlying the phenomenon of AI then you should find the argument of interest.

So, given the rule of Aux to Comp that generates AI phenomena, we can ask what features of the rule and how it operates are traceable to the structure of the Primary Linguistic Data (PLD: viz. data available to and used by the English Child in acquiring the rule A-to-C) and how much must be attributed to the structure of FL. So here we ask how much of the details of an adequate analysis of AI in terms of A-to-C can be traced to the structure of the PLD and how much cannot. What cannot, the residue, it is proposed, reflects the structure of FL.

I’ve gone over the details before, so I will refrain yet again here. However, let’s consider for a second how to argue against the POS argument.

1.     One can reject the analysis, as Sag does. This does not argue against the POS, it argues against the specific application in the particular case of AI. 
2.     One can argue that the PLD is not as impoverished as indicated. Pullum and Scholtz have so argued, but I believe that they are simply incorrect. Legate and Yang have, IMO, the best discussion of how their efforts miss the mark. 

These are the two ways to argue against the conclusion that Chomsky and others have regularly drawn. The debate is an empirical one resting on analyzed data and a comparison of PLD to features of the best explanation.

What did Berwick, Pietroski, Yankama and Chomsky add to the debate? Their main contribution, IMO, was two fold.

First, they noted that many of the arguments against the POS are based on an impoverished understanding of the relevant description of AI. Many take the problem to be a fact about good and bad strings. As BPYC note, the same argument can be made in the domain of AI where there are no ill-formed strings, just strings that are monoguous where one might have a priori expected ambiguity.

Second, they noted that the pattern of licit and illicit movement that one sees in AI data appear as well in many other kinds of data, e.g. in cases of adverb fronting and, as I noted, even in cases of WH movement (both argument and adjunct). Indeed, for any case of an A’ dependency.  BPYC’s conclusion is that whatever is happening in AI is not a special feature of AI data and so not a special feature of the A-to-C rule. In other words, in order to be an adequate, any account of AI must over to these other cases as well. Another way of making the same point: if an analysis explains only AI phenomena and does not extend to these other cases as well then it is inadequate![2]

As I noted (here), these cases all unify when you understand that movement from a subject relative clause is in general prohibited. I also note (BUT THIS IS EXTRA, as Alex D commented) that subject RC as islands is an effect generally accounted for in terms of something like subjacency theory (this latter coming in various guises within GG bounding nodes, barriers, phases and has analogues in other frameworks).  Moreover, I believe that a POS argument would be easy to construct that island effects reflect innately specified biases of FL.[3]

So, that’s the logic of the argument. It is very theory internal in the sense that it starts from an adequate description of the rules generating the phenomenon. It ends with a claim about the structure of FL. This should not be surprising: one cannot conclude anything about an organ that regulates the structure of grammars (FL) without having rules/principles of grammar. One cannot talk about explanatory adequacy without having candidates that are descriptively adequate, just as one cannot address Darwin’s Problem without candidate solutions to Plato’s. This is part of the logic of the POS.[4] So, if someone talks as if he can provide a POS argument that is not theory internal, i.e. that does not refer to the rules/operations/principles involved, run fast in the other direction and reach for your wallet.


For the interested, BPYC in an earlier version of their paper note that analyses of the variety Sag presents in the linked to slides have an analogous POS problem to the one associated with transformations in Chomsky’s original discussion. This is not surprising. POS arguments are not proprietary to transformational approaches. They arise within any analysis interested in explaining the full range of positive and negative data. At any rate, here are parts of two deleted footnotes that Bob Berwick was kind enough to supply me with that discusses the issue as it relates to these settings. The logic is the following: relevant AI pairings suggest mechanisms and given a mechanism POS problem can be stated for that mechanism.  What the notes make clear is that analogous POS problems arise for all the mechanisms that have been proposed once the relevant data is taken into account (See Alex Drummond’s comment here (March 28th), which makes a similar point). The take home message is that non-transformational analyses don’t sidestep POS conclusions so much as couch them in different technical terms. This should not be a surprise to those that understand that the application of the POS tool is intimately tied to the rules that are being proposed and the rules that are often proposed are usually (sadly) tightly tied to the relevant data that is being considered.[5]  At any rate, here is some missing text from from two notes in an earlier draft of the BPYC paper. I have highlighted two particularly relevant observations.

Such pairings are a part of nearly every linguistic theory that considers the relationship between structure and interpretation, including modern accounts such as HPSG, LFG, CCG, and TAG. As it stands, our formulation takes a deliberately neutral stance, abstracting away from details as to how pairings are determined, e.g., whether by derivational rules as in TAG or by relational constraints and lexical-redundancy rules, as in LFG or HPSG.  For example, HPSG (Bender, Sag, and Wasow, 2003) adopts an “inversion lexical rule” (a so-called ‘post-inflectional’ or ‘pi-rule’) that takes ‘can’ as input, and then outputs ‘can’ with the right lexical features so that it may appear sentence initially and inverted with the subject, with the semantic mode of the sentence altered to be ‘question’ rather than ‘proposition’.  At the same time this rule makes the Subject noun phrase a ‘complement’ of the verb, requiring it to appear after ‘can’. In this way the HPSG implicational lexical rule defines a pair of the exactly the sort described by (5a,b), though stated declaratively rather than derivationally.  We consider one example in some detail because here precisely because, according to at least one reviewer, CCG does not ‘link’ the position before the main verb to the auxiliary. Note, however, that combinatorial categorical grammar (CCG), as described by Steedman (2000) and as implemented as a parser by Clark and Curran (2007), produces precisely the ‘paired’ output we discuss for “can eagles that fly eat.”  In the Clark and Curran parser, can’ (with a part of speech MD, for modal), has the complex categorial entry (S[q])/S([b]\NP))/NP, while the entry for “eat” has the complex part of speech label S[b]\NP. Thus the lexical feature S[b]/NP, which denotes a ‘bare’ infinitive, pairs the modal “can” (correctly) with the bare infinitive “eat” in the same way as GPSG (and more recently, HPSG), by assuming that “can” has the same (complex) lexical features as it does in the corresponding declarative sentence. This information is ‘transmitted’ to the position preceding eat via the proper sequence of combinatory operations, e.g., so that ultimately “can,” with the feature (S[q])/S([b]\NP)) along with “eat,” with the feature S[b]/NP can combine.  At this point, note that the combinatory system combines “can” and “eat” in that order, as per all combinatory operations, exactly as in the corresponding ‘paired’ declarative, and exactly following our description that there must be some mechanism by which the declarative and its corresponding polar interrogative form are related (in this case, by the identical complex lexical entries and the rules of combinatory operations, which work in terms of adjacent symbols) [my emphasis, NH]. However, it is true that not all linguistic theories adopt this position; for example, Rowland and Pine, 2000; explicitly reject it (thereby losing this particular explanatory account for the observed cross-language patterns). A full discussion of the pros and cons of these differing approaches to linguistic explanation outside the scope of the present paper.
As the main text indicates, one way to form pairs more explicitly is to use the machinery proposed in Generalized Phrase Structure Grammar (GPSG), or HPSG, to ‘remember’ that a fronted element has been encountered by encoding this information in grammar rules and nonterminal, in this case linking a fronted ‘aux’ to the position before the main verb via a new nonterminal name.  This is straightforward: we replace the context-free rules that PTR use, S ® aux IP, etc., with new rules, S ® aux IP/aux, IP® aux/aux vi, aux/aux®v where the ‘slashed’ nonterminal names IP/aux and aux/aux ‘remember’ that an aux has been generated at the front a sentence and must be paired with the aux/aux expansion to follow.  This makes explicit the position for interpretation, while leaving the grammar’s size (and so prior probability) unchanged.  This would establish an explicit pairing, but it solves the original question by introducing a new stipulation since the nonterminal name explicitly provides correct place of interpretation rather than the wrong place and does not say how this choice is acquired [my emphasis NH]. Alternatively, one could adopt the more recent HPSG approach of using a ‘gap’ feature that stands in the position of the ‘unpronounced’ v, a, wh, etc., but like the ‘slash category’ proposal this is irrelevant in the current context since it would enrich the domain-specific linguistic component (1), contrary to PTR’s aims – which, in fact, are the right aims within the biolinguistic framework that regards language as a natural object, hence subject to empirical investigation in the manner of the sciences, as we have discussed.

[1] In what follows it does not matter whether A-to-C is an instance of a more general rule, e.g. move alpha or merge, which I believe is likely to be the case.
[2] For what it’s worth, the Sag analysis that Alex C linked to and I relinked to above fails this requirement.
[3] Whether these biases are FL specific, is, of course, another question. The minimalist conceit is that most are not.
[4] One last point: one thing that POS arguments highlight is the value of understanding negative data. Any good application of the argument tries to account not only for what is good (e.g. that the rule can generate must Bill eat) but also account for what is not (e.g. that the system cannot generate *Did the book Bill read amused Frank). Moreover, the POS often demands that the negative data be ruled out in a principled manner (given the absence of PLD that might be relevant). In other words, what we want from a good account is that what is absent should be absent for some reason other than the one we provide for why we move WHs in English but not in Chinese. I mention this because if one looks at Sag’s slides, for example, there is no good discussion of why one cannot have a metarule that targets an Aux within a subject.  And if there is an answer to this, one wants an account of how this prohibition against this kind of metarule extends to the cases of adverb fronting and WH question formation that seem to illustrate the exact same positive and negative data profiles. In my experience it is the absence of attention to the negative data that most seriously hampers objections to the AI arguments. The POS insists that we answer two questions: why is what we find ok ok and why is what we find to be not ok not ok. See the appendix for further discussion of this point as applied to the G/HPSG analyses that Sag discusses.
[5] To repeat, one of the nicest features of the BPYC paper is that it makes clear that the domain of relevant data (the data that needs covering) goes far beyond the standard AI cases in polar questions that is the cynosure of most analyses.


  1. As they say, one man's ponens is another man's tollens. If you have a theory of grammar and then on analysis it is clear that large chunks of it cannot be learned and must be innate, then there are two approaches. One is to say as you do, "Darwin be damned." The other is to question whether your theory might be false.

    Particularly when, as is uncontroversial, the grammars are severely undetermined by the linguistic evidence available [1], and Chomksyan linguists rely heavily on non-empirical assumptions (like the SMT, full interpretation etc etc ) in the process of theory construction. Given these facts, and the antipathy that many here have to theories of learning, it is unsurprising that the theories you come up with are not learnable.
    One conclusion is that Darwin is wrong, the other is that these theories (the standard theory, the revised extended standard theory, P & P, etc etc. ) are wrong.

    [1] e.g. " Choice of a descriptively adequate grammar for the language L is always much underdetermined (for the linguist, that is) by data from L."

    1. There is an interesting presupposition in your answer that I want to make explicit: you seem to assume that empiricist learning is a general cognitive mechanism that explains how cognition functions in other domains. In other words, it treats language as the outlier while other areas of mentation are easily described in empiricist terms (without much innate hardware necessary). So far as I can tell, (see Gallistel and my posts on him for discussion), nothing could be further from the truth. Given that this is so, the standard ML approaches, which make very weak assumptions about domain specific knowledge are probably wrong EVERYWHERE. So even if one is a partisan of Darwin, the specific approaches you seem to favor have very little if any Darwinian street cred.

      Seen in this light, it is not only my theories that are unlearnable (actually they are (or it is reasonable to think they are) given the right set up of the hypothesis space and the right priors) but almost every form of cognitive competence we are aware of. You really should read Gallistel's stuff on classical learning in rats. If he is right, and I believe he is, then classical learning theories of the empiricist (ML) variety are biologically hopeless. I take that to imply that Darwin would not favor them.

    2. I think it is genuinely uncontroversial that there are many behaviours in many species that are innate: spider webs, ungulates walking a few minutes after birth, some bird songs (but not all), nesting behaviours of some birds etc etc. and this can be verified by raising the animal in isolation and so on.
      and it is I think pretty clear that there are learned behaviours that are very highly canalised by some innate structures --- in vision, navigation etc.
      But all of these have some common factors -- they are evolutionarily very ancient (tens of millions of years), and they are clearly adaptive, and there is as a result no problem for Darwin.
      Clearly there is a difference between these things and relative clause extraction.

      It's worth pointing out that Gallistel's notion of a domain is somewhat different from the way that you and I use it --- for example, he considers probabilistic learning a domain, whereas for me it would be a mechanism that could be used widely in many different domains, of which one might be language processing. This terminological difference might account for some of this disagreement.

  2. However, it's different if the contentious theoretical proposals go on to predict true facts for which clear evidence in the PLD is close to nonexistent, which seems to be the case for agreements with quirky and non-quirky case-marked convert subjects in Icelandic. Unfortunately, most of the cases people argue about aren't anywhere near so extreme.

    1. I take it you meant 'covert' subjects? Also did you mean 'fortunately'?

    2. Yes, I do mean 'covert', but also think I mean 'unfortunately', because if the usual cases were more extreme, there would be more solid evidence for UG. Of course the rhetorical force of 'unfortunately' in English is interesting, I recall Howard Lasnik pointing out that it usually means 'unfortunately for the proponents of the idea I am attacking, but fortunately for me'.