Monday, September 16, 2013

Why Formalize?

This is a long post.  I guess I got carried away. In my defense, I think that the topic keeps coming up obliquely and so it’s time to take it head on. The issue is the role of formalization in grammatical research. Let me be up front with the conclusion: there’s nothing wrong with it, it can be valuable if done the right way, it has been valuable in some cases where it has been done the right way, but there seems to be an attitude floating in the atmosphere that there is something inherently indispensible about it, and that anything that is not formalized is ipso facto contentless and/or incapable of empirical examination.  This, IMO, is balderdash (a word I’ve always wanted to use).  This is not only false within linguistics, it is false for other sciences as well (I mention a few examples below).  Moreover, the golden glow of formalization often comes wrapped in its won dark cloud. In particular, if done without a sensitivity to the theories/concepts of interest, it risks being imposing looking junk.  This is actually more deleterious than simple junk for formalizations can suggest depth and can invite a demand for respect that simple junk does not. This point will take me some time to elaborate and that’s why the post is way too long. Here goes.

There is an interesting chain of conversation in the thread to this post. In the original, I outlined “a cute example” from Chomsky, which illustrated the main point about structure dependence but sans the string altering features that T to C movement (TCM) in Yes/No questions (Y/N) can induce. To briefly review, the example in (1) requires that instinctively modify the main verb swim rather than the verb inside the relative clause fly; this despite (i) the fact that eagles do instinctively fly but don’t instinctively swim (at least I don’t think they do) and (ii) fly is linearly closer to instinctively than swim is.

(1)  Instinctively, eagles that fly swim

The analogy to T to C movement (TCM) in Y/Ns lies in both adverbial modification and TCM being insensitive to string linear properties. I thought the example provided a nice simple illustration of how one can get sidetracked by irrelevant surface differences (e.g. in *Is the boy who sleeping is dreaming, the pair who sleeping is an illicit bigram) from missing the main point, viz. that examples like this are a dime a dozen and that this surface blemish cannot generally be expected to help matters much (e.g. Can boys that sleep dream means ‘is it the case that boys that sleep can dream’ and not ‘is it the case that boys that can sleep dream’).  At any rate, I thought that the adverbial modification facts were useful in making this point, but I was wrong.  And I was wrong because I failed to consider how the adverbial case would be generated and how its generation compared to that of Y/Ns. I plan to try and rectify this mistake here before getting on to a discussion of how and whether formalization can clarify matters, a point raised in the notes by Alex Clark and Benjamin Boerschinger, though in somewhat different ways. But first let’s consider the syntax of the matter.

The standard analysis of Y/Ns is that T moves to C to discharge some demand of C. More specifically, in some Gs (note TCM is not a feature of UG!), +WH C needs a finite T for “support.” GB analyzes this operation as a species of head movement from the genus ‘move alpha’.  Minimalists (at least some) have treated this as an instance of I-merge. Let’s assume something like this is correct. The UGish question is why in sentences like (2) can must be interpreted as moving from the matrix T and not the embedded relative clause T?

(2)  [Can+C+WH [DP boys [CP that (*tcan) sleep]] *(tcan) dream][1]

What makes (2) interesting is precisely the fact that in cases like this there are two potential candidates available for satisfying the needs of C+WH but that only one of them can serve, the other being strictly prohibited from moving. The relevant question is what principle selects the right T for movement? One answer is that the relevant T is the one “closest” to C and that proximity is measured hierarchically, not linearly.  Indeed, such examples show that were locality measured string linearly then the opposite facts should obtain. Thus, these kinds of data indicate that in such cases we had better prohibit string linear measures of proximity as they never seem to be empirically condign for Y/Ns.

So far, I believe, there is no controversy. The battle begins in trying to locate the source of the prohibition against string linear restrictions like the one proposed. For people like me, we take these kinds of cases to indicate that the prohibition has its roots in UG. Thus, a particular G eschews strong linear notions of proximity because UG eschews such notions and thus they cannot be components of particular Gs. Others argue that the prohibitions are G specific and thus that structure independent processes of syntax are possible but for the appropriate input.  In other words, were our young LADs and LASs (language acquisition device/system) exposed to the relevant input they would acquire rules that moved the linearly most proximate T to C rather than the hierarchically most prominent one. Thus, the disagreement, like all disagreements about scientific principles, is one about counterfactual situations.  So, how does one argue about a counterfactual?

My side scours the Gs of the world and argues that linear sensitive rules are never found in their syntax and that this argues that structure dependence is part of UG. Or, we argue that the relevant data to eliminate the strong linear condition is unavailable in the PLD available to LADs and LASs and so the absence of strong linear conditions in GEnglish, for example, cannot be traced to the inductive proclivities of LADs and LASs. The other side argues (or has to argue) that this is just a coincidence for there is nothing inherent to Gs that prohibits such processes and that in cases where particular Gs eschew string linear conditions it’s because the data surveyed in the acquisition process sufficed to eliminate it, not because such conditions could not have been incorporated into particular Gs to regulate how their rules apply.

Note that the arguments are related but independent. Should the absence of any string linear prohibitions in any Gs prove correct (and I believe that it is very strongly supported) it should cast doubt on any kind of “coincidence” theory (btw, this is where the adverbial cases in (1) are relevant). So too should evidence that the PLD is too weak to provide an inductive basis for eliminating the strong linear option (which I also believe has been demonstrated, at least to my satisfaction).[2]

This said, it is important to note that this is a very weak conclusion.  It simply indicates that something inherent to UG eliminates string linear conditions as options, it does not specify what the structure relevant feature of UG is.[3] And here is where collapsing (1) and (2) can mislead. Let me explain.

The example in (1) would typically be treated as a case of adverb fronting (AF), along the lines of (3).[4]

(3)  Adverb1 […t1…]

AF contrasts with TCM in several respects. First, it is not obligatory. Thus declaratives without AF are perfectly acceptable, in contrast with Y/Ns without TCM.[5] Second, whereas every finite clause must contain a T, not every finite declarative need contain an adverb.

The first difference can be finessed in a simple (quite uninteresting) way. We can postulate that whenever AF has applied some head F has attracted it. Thus, AF in (1) is not really optional. What’s optional is the F feature that attracts the adverb. Once there, it functions like its needy C+WH counterpart.

The second difference, I believe, drives a wedge between the two cases.  Here’s how: given that a relative clause and a matrix clause will each contain (finite) T0s and hence both be potential C+WH rescuers, there is no reason to think that they will each contain adverbs. Why’s this relevant? Because whereas for TCM we can argue that we need a principle to select the relevant T that moves, there is no obvious choice of mover required for AF.  So whereas we can argue that the right principle for TCM in (4a) is something like Shortest Attract/Move (SA/M) this would not suffice for (4b) where there is but one adverb available for fronting.[6] Thus, if SA/M is the right principle regulating TCM it does not suffice to regulate AF cases (if, I assume here, they are species of I-merge).

(4)  a. [C+WH [RC …T…]…T…]
b. [F+ADV [RC …ADV…]…]

What else is required? Well, the obvious answer is something like the CNPC and/or the Subject Condition (SC). Both would suffice to block AF in (4b). Moreover, both have long been considered properties of UG and both are clearly structure sensitive prohibitions (they are decidedly not string linear).[7] However, island conditions and minimality restrictions are clearly different locality conditions even if both are structure dependent.[8]

Now this has been a long-winded and overly didactic review of a much over-discussed example. Why do I bring this up again?! Because of some comments by Alex Clark suggesting that the AF facts could be derived in formalized minimalist grammars and that therefore this nullifies any explanation of the kind provided above, viz. that UG explains the data in (1) and (2) by noting that the relevant structures are underivable (my emphasis in what follows):

So here is a more controversial technical claim:
let English+ be English with additionally the single incorrect pairing (s,w2). English+ can be generated by an MCFG; ergo it can be generated by an MG. English++ is English but additionally with the fronted adverbs out of subject relatives.; again generable by an MG. (MG means Stabler's Minimalist grammars with shortest move constraint). So I think these claims are correct, and if not could someone technical chime in and correct me.

So Norbert is right that the grammars will look strange. Very strange indeed if you actually convert them from an MCFG. But they are allowed by this class of grammars, which in a sense defines the notion of licit grammatical dependencies in the theory. So Norbert wants to say, oh well if my theory makes the wrong predictions then it has been formalized incorrectly, and when it is formalized correctly it will make the right predictions, Period. But while this is certainly a ballsy argument, it's not really playing the game.

Alex is right, Stabler’s MG with SA/M can derive the relevant AF examples as the derivation implicit in (4b) does not violate SA/M. That’s why an MG including this (let’s call it ‘SMG’) cannot prevent the derivation. However, as Stabler (here) notes there are other MGs that code for other kinds of locality restrictions. In fact, there are whole families of such, some encoding relativized minimality (RMG) and some embodying phases (PMG and PCMG). I assume, though Stabler does not explicitly discuss this, that it is also possible to combine different locality restrictions together in an MG (i.e. RPMG that combines both relativized minimality and phases). So what we have are formalizations of various MG grammars (SMG, RMG, PMG, PCMG, and RPMG) all with slightly different locality properties that generate slightly different licit structural configurations. Stabler shows that despite these differences, these restricted versions of MG all share some common computational properties such as efficient recognition and parsability.  However, they are, as Stabler notes, also different in that they allow for different kinds of licit configurations, PMG/PCMGs blocking dependencies that violate the PIC and RMGs blocking those that violate relativized minimality (see his section 4). In sum, there are varieties of MGs that have been formalized by Stabler and Co. and these encode different kinds of conditions that have been empirically motivated in the minimalist literature.[9] There is no problem formalizing these different MGs nor in recognizing that despite being different in what structures they license they can still share some common general properties.

Three observations: First, I leave it as an exercise for the reader to code island restrictions (like the CNPC) in phase based terms. This is not hard to do given that phases and the original subjacency theory (i.e. employing bounding nodes) are virtually isomorphic terms (hint: D is a phase without an accessible phase edge).[10]

Second, the Stabler paper offers one very good reason for formalizing grammars. The paper shows that different theories (i.e. those that characterize UG differently) can nonetheless share many features in common.  Though empirically relevant, the different locality conditions do not differ in some of their more general computational features. Good. What we see is that not all empirically different characterizations of FL/UG need have entirely different computational properties.[11]

Third, Stabler recognizes that the way to explore MG and UG computationally is to START with the empirically motivated features research has discovered and then develop formalizations that encode them. More pointedly, this seems to contrast with Alex’s favored method of choosing some arbitrary formalism (simple MG) and then insisting that anyone who thinks that this is the wrong formalism (e.g. moi) for the problem at hand is (though “ballsy,” hmm, really! OK), “not really playing the game.”  Au contraire: to be interesting the formal game requires formalizing the right things.  If research has found that FL/UG contains islands and minimality, then to be interesting your formalization had better code both these restrictions. If it doesn’t it’s just the wrong formalization and is not, and should not, be part of any game anyone focused on FL/UG plays. There may be some other game (as David Adger suggests in his comments on the post) but it is arguably of no obvious relevance to any research that syntacticians like moi are engaged in and of dubious relevance to the investigation of FL/UG or the acquisition of grammar. Boy, that felt good to say![12]

Now, it is possible that some grammars encode inconsistent principles and that formalization could demonstrate this (think Russell’s Paradox and Frege’s naïve set theory). However, this is not at issue here. What is at issue is how one conceives of the proper role of formalization in this sort of inquiry. Frankly, I am a big fan. I think that there has been some very insightful work of the formalizing sort.[13] However, there has also been a lot of bullying. And there has been a lot of misguided rhetoric. Formalization is useful, but hardly indispensible. Remember, Euclidean geometry did just fine for thousands of years before finally formalized by Hilbert, so too the Calculus before Cauchy/Weierstrass. Not to mention standard work in biology and physics, which though (sometimes) mathematical is hardly formalized (not at all the same thing; formal does not equate with formalized). What we need are clear models that make clear predictions and that can be explored. Formalization can help in this process, and to the degree that it does, it should be encouraged. But PULEEEZE, it is not a panacea and it is not even a pre-requisite for good work.  And, in general it should be understood to be the handmaiden of theory, not its taskmaster. To repeat, formalizations that formalize the wrong things or leave out the right things is of questionable value. Jim Higginbotham has expressed this well in his discussion of whether English is a context free language (here).[14] As he put it:

…once our attention turns to core grammar as the primary object of linguistic study, questions such as the one that I have tried to answer here are of secondary importance (232).

What matters are the properties of FL/UG. Formalizations that encompass these are interesting and can be useful tools for investigating further properties of FL/UG. But, and this is the important part (maybe this is what makes my attitude “ballsy”) they must earn their keep and if they fail to code the relevant features of “core grammar” or FL/UG then, no matter how careful and precise their claims I don’t see what they bring to the game.  Quite often what we find is all hat, no cattle.

[1] The traces are here to mark the possible base positions of can. (*…) means ‘unacceptable if included’ while *(…) means unacceptable if left out.
[2] There are also actual language acquisition studies like Crain and Nakayama that are relevant to evaluating the UG claim.
[3] Indeed, it need not be a feature of UG at all, at least in principle.  Imagine an argument to the effect that learning in general, i.e. whatever the cognitive domain, ignores string linear information. Were that so, it would suffice to explain why it is so ignored in the acquisition of Gs.  However, this view strikes me as quite exotic and, I believe, would cause not small degree of problems in the acquisition of phonology, for example, where string linear relations is where all the action is.
[4] I assume that this is how to treat AF. If, however, adverbs could be base generated sentence initially and their interpretation were subject to a rule like “modify the nearest V” then AF phenomena would be entirely analogous to CTMs. The main reason I doubt that this is the right analysis comes from data like that in note 7, where it appears that adverbs can move quite a distance from the verbs they modify. This is certainly true of WH adverbs like when and how but I also find the cases like (i) in note 7 acceptable with long distance readings.
[5] TCM also seems obligatory in WH questions, thought there is some debate right now about whether TCMs applies in questions like who left. For Y/Ns, however, it is always required in matrix clauses. Here is a shameless plug for a discussion of these matters in a recent paper joint paper with Vicki Carstens and Dan Seely (here).
[6] I hope it goes without saying that I am simplifying matters here. The relevant F is not +ADV for example, but something closer to +focus, but none of this is relevant here as far as I can see.
[7] There is one further important difference between TCM and AF. The latter is not strictly local. Thus, in (i) tomorrow can modify either the matrix or the embedded clauses. Nonetheless, it cannot modify the relative clause in (ii):
(i)             Tomorrow, NBR is reporting that Bill will be in Moscow
(ii)           Tomorrow, the news anchor that is reporting from DC will be in Moscow
[8] I am setting aside the question whether there is a way of unifying the two. It is to be hoped that there is, but I don’t know of any currently workable suggestions. Such a unification would not alter anything said below, though it would make for a richer and nicer theory of FL/UG.
[9] In other papers Stabler considers MGs that allow for sidewards movement and those that don’t. It is nice to see that to date the theoretical innovations proposed have been formalizable in pretty straightforward ways, or so it appears.
[10] Stabler observes the obvious parallelism with bounding node versions of Subjacency. The isomorphism between the older GB theory and modern Phase Theory means that Phase Theory does not advance our understanding of islands in any minimalistically interesting way. However, for current purposes the fact that Phases and the PIC can code the CNPC and SC suffices. We all welcome the day when we have a deeper understanding of the locality restrictions underlying islands.
[11] IMO, this is the best reason to formalize: to see if two things that look very different may nonetheless be similar (or identical) in with respect to other properties of interest.  It is possible to be careful and clear without being formalized (indeed, the latter often obscues as much as it enlightens). However, formalizing often allows one to take a bird’s eye view of theoretical matters that matter, and when used thus they it can be extremely enlightening.
[12] Let me insist: this is not to argue that formalization does not have a place in such investigations. Rather, it is to argue that the fact that some formalization fails to make the relevant cut is generally a problem about the formalization not the adequacy of the empirical cut. See below.
[13] I am a big fan of Bob Berwicks’s work, as well as Stabler and Co, Tim Hunter’s (with Chris and alone), stuff on Phonology by Heinz and Idsardi to name a few. What makes these all interesting to me is their being anchored firmly in the syntactic theory literature.
[14] Thx to Bob Berwick for the reference.


  1. This may be overly pedantic but I think there's a point in the background here (and in all the discussion on the previous post) that has not been made explicit: if we want to conclude anything about structure-dependence from "Instinctively, eagles that fly swim", then we need to at least attempt to show that no non-structure-dependent rule works. And that single example does nothing to discredit a rule like "A fronted adverb always modifies the last verb in the sentence". Of course it's not hard to put together some relevant example with a complex object to spoil the party for that "last verb" rule, but then you have to make sure you also show that a "second verb" rule can't work, etc. In a way it may seem harmless to skip over these steps of the argument because everyone's familiar with the facts, but I think skipping over them sometimes clouds the issue.

    Starting off by saying "Look, it's impossible to form this dependency into a subject", and stating this as if it's an observation, is begging the question entirely (because "subject" is a structural notion).

    1. This last sentence is a deep question that I've been struggling with for the last few months. Our generalizations are often stated in highly theoretical terms and it seems very difficult to strip down the theory to its core and really figure out what's being said. Constituency, categories, etc. are all highly theoretical, and we're not even into the fancier stuff like islands. It gives me much angst.

  2. Goodness, quite long. But, just a brief comment before I read it all, "who sleeping" is quite the licit bigram! After all, I'm a guy who sleeping appeals to quite often! In fact, sleeping appeals to me pretty much ever time I'm trying to wake up!

  3. Just extracting from that comment above (i.e. without delving into the old comments) it sounds like the point is not to say the line of argument is wrong. Rather, sounds like the point was to say that, upon closer inspection, your theory actually doesn't explain the pattern because the cleaned up version of the theory allows us to inspect very closely and find a derivation that gets a wrong fact. And then you replied saying yes but there are other ways to clean it up that get the facts right - so the theory isn't wrong, in fact. Sorry if this is incoherent. But was that it?

    1. Not exactly. The point I was trying to make was that both TCM and AF provide evidence that Gs DON'T use structure INdependent processes. However, this does not imply that they use the SAME structure dependent ones. On the most plausible analyses, I suspect that they don't (though the island explanation used for AF extends to TCM, i.e. CNPC generalizes to both, SA/M does not). Moreover, this was easily seen once one considered the analyses available. The theory did not cleaning up, the question at issue changed.

      However, I had a larger fish that I wanted to fry: that when one looks at some formalization/theory and assumes that it fails, it is very useful to ask WHY it fails. And then to ask can it be made whole or are the data truly inconsistent with the major lines of the theory. In this context, Alex took one version of MP to be definitive and concluded that MP provided no account of these phenomena. However, (big surprise), like any story, MP involves a family of theories that can be formalized in many ways. So, saying MP does or doesn't derive the discussed data requires fixing what the question is and what the theories at issue are. I take none of this to be anything but anodyne.

  4. Often the exact formalisation doesn't matter.
    I think the problem here is with the theory not the formalisation of it.
    But I do think that the inadequacies of the argument have been hidden by the lack of formalization, so I guess there
    is an argument for formalization lurking there somewhere.

    So my argument is something like:

    Premise A:
    Any formalisation of your theory of UG you come up with, if it allows English will allow English+.

    Your theory of UG does not explain *on its own* why we see English rather than English+.

    So we can argue about whether this is a valid argument or we can argue about whether premise A is true.

    Again if one of the technical people reading this thinks that Premise A is wrong, I would like to be put straight.
    But generally formalizations like this can
    a) generate any finite set of sound-meaning pairs and
    b) are closed under union,
    from which two assumptions Premise A follows.

    But you might also say that I am missing the point of the example which you might claim is about admissible structures rather than admissible sound meaning pairs.
    I would say that
    a) we don't observe the structures. Syntactic structures aren't data, they are theoretical constructs that we posit to explain certain aspects of the sound/meaning relation. (Tim H's point which is also "anodyne") And
    b) crucially, (s,w2) (which is the example in English+ - English) has a well-formed structure/sound/meaning mapping *by definition* , since it is generated by a grammar in your class which defines the class of legitimate structures.
    c) anyway you explicitly said that "Berwick et. al. emphasize that the Poverty of Stimulus Problem has always aimed to explain “constrained homophony,” a fact about absent meanings for given word strings.."

    So how does then your UG explain the "cute example"? My view is that MGs or any reasonable formalisation of it, are far too unconstrained to explain some detail like that.

    (ETA: I really think that the mathematical study of the set of allowable sound-meaning relationships is really important, and understudied, and you should all look at a recent paper which I think will turn out to be pivotal: Makoto Kanazawa and Sylvain Salvati. 2013. The string-meaning relations definable by Lambek grammars and context-free grammars.
    It's very dense so just read the abstract. So this could be the right place to look for some very good explanations. )

    1. @Alex C. Would it be fair to roughly paraphrase the claim you are making as the claim that we do not currently have an adequate formalization of island constraints? The existence of island constraints is of course an undeniable descriptive fact[1], and it also seems overwhelmingly likely that that ‘instinctively’ cannot associate with ‘fly’ due to the presence of subject and RC islands. I'm guessing you would agree up to this point (without assuming that the relevant island constraints necessarily derive from constraints on grammars themselves). So is the problem in your view just that no-one has so far succeeded in adequately formalizing the relevant island constraints (or the deeper principles that give rise to them)? Or rather, that no-one has succeeded in doing this in a way which does not presuppose some kind of evaluation measure which can block the acquisition of “crazy” grammars? If it's the latter then there may not be a very big point of disagreement here, since at least some Chomskyans (including earlier iterations of Chomsky himself), are open to the possibility that the evaluation measure plays a significant role.

      [1] There are interesting exceptions, but none that appear to be relevant here.

    2. How indeed. The explanation goes as follows: given that the words mean what they do and that adverbs that modify verbs do so by being related to them in some way, the question is why doesn't 'instinctively' modify 'fly'? The answer is that to establish the relevant relation would have to mediated by a kind of dependency that UG disallows. What can we say about this prohibition? First, that it is not string linear: e.g. given the case at hand and given that words mean what we think they do then the rule cannot be 'modify' the string-linear closest verb. Here's another things we can say: the dependency is not regulated by minimality as coded in MG. Here's another things we can say, if some version of PMG is right (one that codes for islands) then we could explain it. That's it.

    3. @Alex C:

      I am most definitely a nontechnical person so would like a little more detail on how Premise A is true. English+ is not meant to be the same as the English'' from the "A cute example" post, is it? Because those two languages, as you've defined them, are not the same. English'' is English with additional lexical items from English' (which was English with a shuffled lexicon), while English+ was "English with additionally the single incorrect pairing (w,s2)".

      If this was meant to be a description of English'', it's not accurate. There are many more differences between English'' and English+ than simply whether they allow (s,w2). For instance, English'' allows a sentence like "I like instinctively eagles fly", meaning 'I like eagles that fly'; this is also not allowed in English. Maybe this is a point that doesn't need to be made, but I find it misleading to describe English+ the way you do (unless you don't mean it to be identical to English''). As far as I can tell, you can't get (s,w2) unless you *also* get sentences like "I like instinctively eagles fly", and so on.

      David Adger made a similar point in a comment on the other thread, though I'm not sure I agree with him. He says that English'' probably wouldn't be learned for the reasons I just mentioned, but actually if we are serious about UG then in fact English'' should be *easier* to learn than English+. UG doesn't allow extraction from specifiers, so (by hypothesis) a language that includes both (w,s1) and (w,s2) shouldn't be learned *unless* it also includes (e.g.) both "I like instinctively eagles fly" and "I like eagles that fly", both with the same meaning ('I like eagles that fly'), which serve to clue the learner into the fact that there are two homophonous lexical items "eagles", and two homophonous lexical items "that", and two phonologically distinct words meaning 'eagles', and two phonologically distinct words meaning 'that', and so forth. Of course, whether such a grammar actually *is* easier to learn is an empirical question; but I think our current theory of UG makes a testable prediction in this case.

      (continued below...)

    4. (...continued from above)

      So English'', if it actually existed, would be somewhat akin to one of those "apparent" (but not actual) counterexamples which are usually trotted out towards the end of a linguistics paper. English+ would be an actual counterexample. A language like English'' is permitted by our theory of UG, as it should be. Then why don't we see any languages like English'' in the real world? I imagine two factors are at play.

      (1) The amazing coincidence that would have to have transpired for just the right interaction of homophony and synonymy that we see in English'' (e.g., there are two words pronounced "that", one of which means 'instinctively', and there are two words pronounced "eagles", one of which means 'that', and there are two words pronounced "instinctively", one of which means 'eagles'). UG doesn't rule out languages like English'', but the probability of a language with just the right lexicon is so low that we just don't see them.

      (2) As coincidental as such a language would be, I imagine its likelihood would diminish even further as a result of functional constraints on language transmission and ease of learning, as D. Adger suggested. (Though I stress again that if we're serious about UG then English'', while harder to learn than English, should still be easier to learn than English+.) Though natural language is known to tolerate tons of ambiguity, I imagine a language like English'' would be a bit too much for even NL (or, rather, a learner of NL) to handle. One could see this appeal to functionalism as an admission of defeat, but I don't see it as such. Generativist explanations are not fundamentally opposed to functionalist ones, and can often be complementary (as argued by e.g. Newmeyer); on the other hand, generativism and functionalism often do not even seek to solve the same problems, and so again are not diametrically opposed. While a generativist explanation is, I suppose, not needed to account for the absence of English'', I don't think a purely functionalist approach can explain the absence of languages like English+ (which, again, is *not* English'').

      Apologies if I've totally understood; the points made above are invalid if English+ is not meant to be something like English''. If that's the case, is it possible to see in just what way Premise A is true, dumbed down just enough for a nontechnical person like me?

    5. Apologies for the slightly rushed responses:

      @RG: English+ was meant to be the language (set of sound-meaning pairs) that is just English with the single "wrong" additional sound-meaning pairing (s,w2). So different from English''.

      So the idea here is that natural languages often have exceptional constructions that are idiosyncratically both syntactically and semantically ("by and large", "I could care less" etc etc ) and any adequate grammar formalisms must be able to represent some finite set of these exceptions. And we can give these analyses syntactic features that are not shared by any other lexical items, so we can then take the union of the grammar for English with a mini-grammar that just generates the pair (w,s2), and the resulting grammar will just generate English+.

      You say "Or rather, that no-one has succeeded in doing this in a way which does not presuppose some kind of evaluation measure which can block the acquisition of “crazy” grammars? If it's the latter then there may not be a very big point of disagreement here, since at least some Chomskyans (including earlier iterations of Chomsky himself), are open to the possibility that the evaluation measure plays a significant role.".
      I guess that is my point, but of course saying "evaluation measure" is vacuous: which measure is it?
      And I think this is very contentious. Because if the explanatory work is being done by the evaluation measure, then the arguments about UG become largely empty.
      Hale and Reiss make this point well I think in their book the Phonological Enterprise, when they talk about assimilating UG to the LAD.

    6. @AlexC Does this mean that if there is NO evaluation measure and the island restrictions are absolute then there is no problem? Recall Alex D's question noted that there are two schools of thought re Eval Measures. P&P theories have tended to eschew these. But, is this right, if there is no eval measures, so no exceptions to islands for example, then your point is moot?

    7. Hi Norbert; I don't follow. It is the absence of the evaluation measure that causes the problem. Sorry if I misunderstand but I am about to catch a plane, and a bit rushed.

      P&P models don't have these problems I think. Or rather wouldn't have if they were specified fully.

    8. Thx. I was assuming that islands were not parametrized and that there was no way of evading them using more complex rule formats, as was possible in earlier eval measure based accounts. So, IF this proves correct, there is no argument? Or the argument concerns whether we need eval measures in UG?

    9. Because if the explanatory work is being done by the evaluation measure, then the arguments about UG become largely empty. (Alex C)

      Well, the explanatory work is being done by a combination of the innate constraints with the evaluation measure. And as Chomsky used to emphasize in the old days, the evaluation measure is built in just as much as the constraints are. I'm just guessing here, but I'd assume that even hardcore P&P advocates don't really think that there's no evaluation measure. They just think that the universal constraints are so strong that pretty much any sensible evaluation measure would do the job (three lexical items is better than 10^15 all else being equal, that sort of thing). In other words, what P&Pers ought to say is not that there's no evaluation measure, but rather that the evaluation measure plays a negligible explanatory role as compared to the universal constraints. I completely agree with you that if we reject P&P, the evaluation measure ought to receive a lot more attention. However, in the case of trivial POS argument such as subject/aux inversion, I think the argument can be profitably run without having a precise theory of the evaluation measure.

    10. That's a good point. But note that the reaction of Norbert and David A is not to say, oh ok so we need to pay some attention to the evaluation measure/learning procedure/inference mechanism, but rather to say that we need to tighten up the notion of UG. I think in the context of a discussion about the innateness of the structure dependence of syntax, the claim that the evaluation measure has any role to play can be seen as an admission that structure dependence is at least partially learned.
      Which may or may not be acceptable depending on your prior commitments.

      Further, I think that if you think the explanation is a theory of UG, say H, and an evaluation measure E, then it is hard to see why you can't have a much larger hypothesis space H' and another evaluation measure E' which gives the same effect -- i.e. ranks H ahead of H'; and so the end point of this is just a hypothesis space which represents a very general class of grammars and a specific evaluation measure which does all the work. And so there is a "slippery slope" argument here which may be why admitting any theoretically interesting role for the learning algorithm has been anathema.
      Perhaps also related to why even using the word "learning" for language acquisition is considered problematic.

    11. I am happy to buy Alex D's version. I don't see that it makes much of a difference, with one caveat: you take the Eval Measure to clearly be domain general. If it is, great. Fine with me. I am happy to suppose that it is. I also have nothing against thinking that the hypothesis space is very large and that what explains the grammars we end up are the properties of the procedures used to navigate this domain, though I may differ from you in allowing this procedure to have some non-domain general features (that's an empirical question, as they say). This was Chomsky's proposal in Aspects and I have nothing in principle against it. Chomsky's main objection to it had to do with the feasibility of such a theory, but I've never been that clear how it is that P&P theories finesse the feasibility issue. I've even suggested in print that we should reconsider the earlier Eval Measure view as I have less faith in P&P theories than I once had. So, I see nothing amiss about having learning algorithms or Eval Measures as part of an explanation of how Gs are acquired. That would be fine with me. All I require are proposals that explain what I am interested in (e.g. Island effects, Minimality, Binding Effects etc.) in these terms. So go ahead and derive some of these for me using these general learning algorithms and I am on board.

      Last point: re "learning" my objection to the term is that it has been bleached of all content when used carefully but has connotations that I believe to be misleading. But here, you agree with Lila and both of you disagree with me. Oh well.

  5. Alex, strings aren't data then either. The only "data" you have is the raw sound stream. If we want to do any worthwhile theeory we have to abstract. Our grasp of the geeneeralisations about structure is pretty good, and the ones under discussion certainly good enough to say that there is no structural dependency available between the adverb and the verb in the subject. That's the fact we want our theory to capture and, as I pointed out on the other thread, we can do that via the kind of derivation I provided. If you aren't willing to abstract beyond strings, good luck with explaining any kind of intra or inter language generalisations! And ask why "strings" are an ok abstraction from phonetic multidimensional structures, while multidimensional syntactic structures aren't. I think strings are just a deeply wrong model for dealing with anything linguistic at really any level of abstraction. Apologies for all the extra eees in my post, my ipad seems to want to add these to random words and is grumpy about deletion on this blog for one reason!

    1. On the data question, data doesn't have to be merely raw sound streams, it can be slightly structured and interpreted but crucially, at least if we're using Bogen and Woodward's stance, it has to be contingent on experiment. That means judgments are data, grammaticality or not isn't. A big Excel spreadsheet with sentences and yes/no flags is data if it's one person's judgments and we understand that it's tied to time and context and so forth. But if you take a bunch of such spreadsheets, and you the theorist process it to extract the "truth" of whether some sentence is grammatical or not, that ain't data, that's phenomenon. Trees, tho, certainly aren't data, they're not phenomena either, they're part of the explanatory apparatus.

    2. Sorry, I shouldn't have been facetious. Of course raw sound streams aren't data for building syntactic theories. I was just making the point that you have to abstract at each level. For building a theory of the Language Faculty, the relevant data (or explananda if you prefer) are generalisations (like the generalization that you can't create a dependency from C into the T of a relative clause embedded in a subject). These generalizations are what your theory is developed to account for, and the theory so developed ideally goes beyond those generalizations to new ones. So the data for theory building are not sound meaning pairings, contra to what Alex C assumes in his argument - I made this point on the other thread `A cute example' but obviously not very well! To account for particular patternings in sound meaning pairings you need to build an analysis, not a theory, and judgments are the data (or explananda) for analyses, where an analysis is a particular configuration of what primitives you want to posit (possibly constrained by your theoretical framework, but perhaps you want to try other things, so possibly not). Trees, features, categories etc are the primitives of your analysis, but a theory is not constituted of trees (or not necessarily anyway, pace TAG). A theory is the system that gives you the trees (or derivations, or whatever you think the right model for the analysis is). So what goes into the theory (the atoms and algorithms that constitute it) are motivated by generalizations that emerge from analyses: these are the explananda of the theory. I think that the fundamental assumption that at least I don't share with Alex (C) is that the import of the Chomsky sentence really is about the generalisation that the example supports. You can state this generalization in various ways, depending on how you want to set up the primitives (via movement dependencies in constituent structures, via admissble types and combinatorial operations in a categorial grammar, via features and feature passing dependencies in a GPSG, or whatever), but the generalization is fundamental to ensuring that you develop the right kind of theory. I think that a theory which says that the primitive operations are string building operations imbued with linear information and restricted to contiguous relations is utterly hopeless as a theory for capturing this kind of generalisation and since this kind of generalization is true (or true enough to be interesting), such theories are no good as theories of grammar.

      You could, as Norbert mentioned, play the game another way (by saying that the system generates the bad meanings for the string, but that something else means they are inaccessible - e.g. processing or whatever) but I think the evidence doesn't make that look massively hopeful at the moment. But I guess that is not what is at issue. Stablerian MGs with his Specifier Impenetrabiity constraints, as specified in his recent Topics in CogSci paper will just not generate the bad examples, and hence will capture the relevant generalization.

    3. So I think we just have a terminological misalignment between what we call "data" and "theory". So the architecture for me is:
      we have an internally represented grammar G that generates some structural descriptions SD, and we map each expression to a sound meaning pair (or maybe set of pairs) like (s,m) where s is a flat sequence of acoustic categories: phones, phonemes perhaps with some prosodic info mixed in, and m is some meaning representation language like, say lambda expressions.
      So there is lots of complicated structure in the grammar and the SDs, but we don't observe these; but we do observe the outputs (modulo some caveats about the meanings because we don't know what they are).
      So the data then are the uncontentious facts about the sound-meaning pairs that are allowed (or not allowed) in the language.
      And the theoretical constructs are the LAD, the various Gs, the sets of SDs generated by the individual Gs etc.

      So obviously there are also components that map the sequence of acoustic categories to lip and tongue movements, and map sounds to acoustic categories, and any complete psychological theory has to account for this, and for the acquisition of these mappings, but I think it is reasonable to cut it at this point for all of the standard reasons. (e.g. it's not specifically linguistic, there aren't any theoretical interesting learnability issues, the existence of writing systems that represent language as a flat string of discrete symbols, this is the point where the continuous and the discrete systems interface, animals can be trained to recognize acoustic categories, there is almost no debate about the phonemic inventory of particular languages etc etc.)
      Whereas I think it is completely unreasonable to take the data to be the hypothesized SDs, for reasons that are basically the same as the list above but with *not* inserted everywhere.

      I genuinely thought that my view was the orthodox one. Am I incorrect ? About whether my view is standard or not, not whether it is right :)

    4. I'm not sure how orthodox or not your view is, but I was just saying that, for the construction of a linguistic theory, there are explananda that go beyond particular sound meaning pairs. So the generalization that you can't topicalize an adverb from inside a subject holds of English, and every other language that I know or have tested this on (quite a few from a typology class I taught a few years back). This generalization across languages is data for the construction of a theory of the human linguistic capacity. Is the generalization an `observable'? I guess I don't really know what `observable' means in this case. All of our observations are deeply theory laden, because they involve abstraction. (Think of the constructional homonymy cases like `visting relatives can be a pleasure'. Are there really two meanings, or is it just vague? Is this just one string? Well varying the stress placement rules out particular readings. Do we abstract away from varying stress placement? All of these are choices we make about which properties we are happy to ignore, and which ones we aren't.) I take the point of the Chomsky example to be quite an abstract one about the generalization that adverb topicalization from inside a subject is disallowed generally, and that's what is in need of explanation, which is why I wasn't (and still amn't) convinced by the English' argument (where you redefine the lexical items so that `instinctively' means `eagles' etc). I am convinced by what I think is a slightly different argument, which is that you can use another technology to get the dependency to work (passing up features), but that just means our theory should be ruling out that technology in these cases as the core explanandum (i.e. the generalization) still needs explained. I think that makes sense, but have spent the whole day so far arguing with the university Finance department, so my brain may be addled!

    5. I agree, of course, but I believe that the problem is how to define features so passing them up cannot occur in the guise of no passing. We need some handle on the idea of a feature so that these rogue elements are banned. Now, as I understand matters the features we have used within MG have been rather anodyne, case, agreement, +WH, Too, etc. even subcat features that are ver local. At any rate, these features are pretty safe, it seems. Maybe as a working hypothesis we restrict ourselves to features that are morphologized in SOME language and ban all those that never are. This is an old GB chestnut and may serve to ban exotica till we better understand the formal problem. Of course this will have the effect of drastically cutting down what counts as a feature and what kinds of operations we can avail ourselves of, but this seems ok, at least to, as a way of plugging this unwanted very big hole.

  6. "Stablerian MGs with his Specifier Impenetrabiity constraints, as specified in his recent Topics in CogSci paper will just not generate the bad examples, and hence will capture the relevant generalization."

    This statement is not quite true without further qualifications. Yes, adding the SPIC to MGs lowers their weak generative capacity by disallowing extraction from specifiers, but this does not imply that MGs cannot generate strings that linguists would analyze as involving movement from such a position.

    The thing is, many instances of Move can be replaced by Merge, even without significant changes to the phrase structure. Rather than moving XP from A to B, we can Merge an empty category at A that requires XP to be merged at B, e.g. with a mechanism similar to slash feature percolation. This strategy does not work for cases that require head movement or remnant movement, so those will be affected by the SPIC. But the example above is not such a case. If you want the SPIC to block the tree for the illicit reading, you have to posit a universally fixed set of categories and prove(!) that this set cannot be exploited in a fashion that makes slash feature percolation possible.

    My hunch is that Alex's resistance to Norbert's argument is at least partially due to the fact that i) nobody has given him such a proof, and ii) more importantly, he does not believe that the set of categories is fixed across languages.

    1. Thomas: yes exactly. Norbert was explicit on an earlier post that the assumption that "the set of categories is fixed across languages." was NOT part of his theory. So I don't think his argument goes through.

    2. So let me get this straight: if the set of categories that induce islands, e.g. CNPC and Subject Islands, are fixed universally, the bulk of the problem goes away?

      Also, I'm a little confused about the problem of proving that one cannot merge across an island. Why can't one use Stablers PMG suitably restricted universally so that Relative clause C and D are phase heads with D having no relevant edge to derive the I permeability for subjects? If PMGs code for phases and phases have this feature, why can't an MG prohibit such grammatical commerce?

    3. So far as I know, there is little reason to think that the nodes relevant for islands are NOT fixed across grammars. There was talk for a while on parametrized theory of bounding nodes, but this was quickly given up. So, is the following correct: if the relevant island inducing nodes DO NOT VARY ACROSS GRAMMARS then your problem goes away?

    4. You have to fix the set of all categories, not just which ones count as islands. Basically, you have to say something like "all lexical items are Cs, Ts, Vs, Ns, or Ds, and that's it".

      The problem is that category features can be abused to enforce non-local dependencies in a local fashion via Merge, including certain dependencies that are usually handled via Move. Slash feature percolation is a common example of this, but things can get a lot sneakier and elaborate than that. The SPIC does not regulate these "camouflaged" dependencies.

    5. A quick addendum: Enforcing non-local dependencies via Merge does come at the cost of blowing up the size of the lexicon, often to a ludicrous degree (for the technically minded: the blow-up is linear in the size of the original lexicon but exponential in the size of the tree automaton encoding the dependency). Replacing Move by Merge can take you from a lexicon with 2 lexical items to one whose number of lexical items exceeds the number of seconds since the Big Bang.

      Maybe this could be used to construct a learnability argument among the following lines: 1) If possible, use Move rather than Merge to model certain dependencies. 2) Since Move is the preferred option, the Merge grammar analysis is never entertained for our example sentence. 3) SPIC blocks Move. 4) No illicit reading.

  7. this is getting interesting and I completely agree. If you use some other means to get the dependency than internal Merge, then you could weakly generate the strong, and perhaps even strongly generate a structure that will support the meaning. But you don't have to fix the categories. You just have to have a theory that doesn't allow category valued features. For such a theory, see my `A minimalist theory of feature structures' in the Kibort/Corbert CUP `Features' book (draft on lingbuzz). There I propose that the admissible categories in human language don't have any more complexity than being bundles for feature value pairs where the values are atmoc in nature. I try to derive this from a general theoretical principle that confines recursive structures to being the outputs of Merge. Since Merge takes lexical items as its arguments and produces non-lexical items as its outputs, it follows that lexical items can't include recursive structures. This has some nice side effects in meaning that we can't have lexical items that, say, select for a CP, that contains a V that selects for a PP.

    Does Stablerian SpecMG plus my `No Complex Values' idea do the trick? I guess perhaps Norbert and I were assuming something like that.

    1. Unfortunately this does not help either.

      Regarding valued features one has to be careful to distinguish systems that only allow a finite number of feature values (e.g. person specification) from those that allow an infinite number (HPSG-style recursive feature matrices). The former can actually be made atomic, the latter cannot. The camouflaging techniques for Merge only require a finite number of feature values, so even a purely atomic feature system is sufficient if we are free to pick as many features as we need.

      So yes, this implies that even the simplest type of feature system makes it possible to write MGs in which X selects only Ys that contain Zs that select Qs. If you do not limit the set of features your lexical items are built over, Merge can do a lot of things that we simply do not find in language; it can enforce structural conditions that are outlandish. For example, a tree could be well-formed only if it holds that if we interpret each leaf as 0 or 1 depending on whether it is an even number of nodes away from the root, one obtains a string that is the binary encoding of a sentence from the collected works of Shakespeare.

      There's only two ways around this: 1) Fix the set of features and make sure it is too small to allow for camouflaging, or 2) change the way Merge is controlled by features. So far, nobody in the MG community has found a way to redefine Merge that still allows us to capture basic subcategorization facts.

    2. Actually, there's also a third option, a weaker version of 1).

      Just like we can establish a movement-like dependency via Merge and then camouflage it by coding it into the feature system, we can also posit a constraint that blocks such Merge-mediated dependencies if the equivalent Move operation would be illicit. Call this constraint Merge-SPIC. Positing Merge-SPIC as a universal constraint limits how features can be distributed over lexical items yet does not require a fixed feature set.

      However, this still rules out certain MGs that one could write if our choice of features and how we assign them to lexical items was completely unrestricted. So it might be close to what Norbert and you have in mind, but it probably won't satisfy Alex.

  8. Hi Thomas, v quick as am in start of semester hell. The issue of atomicity of feature values vs complex values is exactly addressed in that paper I mentioned. But if I've understood `camoflaging', you're right. Is the idea that to get round 'No Complex Values', you add an extra feature each time you want to encode a non-local selectional relation? (so you'd encode a verb that selects an N which selects a P with [V, +F] and a verb that selects an N which selects a C with [V, +G], etc)? So this will indeed require us to disallow such features in one of the ways you suggest: restrict the features or restrict the way that features can be manipulated through Merge so they information is locked in place. I think Norbert and I were assuming that such feature percolation either is illicit or if it is licit, it's only licit in particular circumstances, not involving specs. Thanks, this has made me think about this.

    1. Let me add my thx as well, both to you and to AlexC. The discussion has been helpful (and it seems that considering the issue formally HAS BEEN VERY USEFUL as AlexC insisted: You were right Alex C). The nature of features has been largely ignored (but see David's notes above) in the standard generative literature, though not in other formalizations. The "Chomsky" group have tried to put most of the combinatoric "power" into the operations, while treating the features relevant to these on LIs as "simple." If one takes the standard inventory of such features (one's that have been postulated) they are pretty anodyne. However, other formalisms have used more interesting features (slash categories of various kinds) to gain the effects of things like IM/Move. The take home message for me here is that when combining these two "traditions" one must exercise caution for fancy features when combined with interesting operations can nullify properties that the less complex combinations have tried hard to block. This does raise two important questions for minimalists: what are the feature inventories permitted, i.e. what does "simple" mean, how big can these be (this is a topic that Bob Berwick has worried about in the context of computational complexity). The hope is that the inventory is pretty small and that the features are not too complex. However, I agree that specifying exactly what this entails is a very good question. As BenjaminB noted somewhere in an earlier comment, it amounts to starting to develop a richer conception of Substantive Universals/features, an area that has been neglected heretofore. Thx for the enlightenment.

    2. "Is the idea that to get round 'No Complex Values', you add an extra feature each time you want to encode a non-local selectional relation? (so you'd encode a verb that selects an N which selects a P with [V, +F] and a verb that selects an N which selects a C with [V, +G], etc)?"

      Yes, that's pretty much it. Usually one just splits V into two categories V_F and V_G, but that's just a notational variant of what you have in mind.

      If you're curious, the details can be found in my LACL 2011 paper. It's a purely technical paper, but the picture on page 9 might be enough to get the idea. The same results can also be found in Greg Kobele's LACL 2011 paper, but it's just as difficult a read and has no colorful pictures :)