Saturday, June 28, 2014

Thomas Can't Into Cognitive Modelling

This week I got to present at CMCL 2014, a workshop on computational models of language-related cognition, i.e. processing, acquisition, discourse representation, and so on. My talk was about the connection between Stabler's top-down parser for Minimalist grammars and the processing of relative clauses, something I've been working on for a while now with Bradley Marcinek, a student of mine. Thanks to Greg Kobele, John Hale and Sabrina Gerth, we already know that the predictions of this parser depend on one's syntactic analysis in interesting ways, so we wanted to extend their line of work to some other well-known phenomena. Long story short, our results are rather messy and it will be a while until we can get this idea truly off the ground.

That is why I won't blog about this research quite yet (except for the shameless self-promotion above) and instead focus on the talks I heard, rather than the one I gave. Don't get me wrong, many of them were very interesting to me on a technical level; some of them even pierced my 90s habitus of acerbic cynicism and got me a bit excited. Quite generally, a fun time was had by all. But the talks made me aware of a gapping hole in my understanding of the field, a hole that one of you (I believe we have some readers with serious modelling chops) may be able to plug for me: Just what is the point of cognitive modelling?

Sunday, June 22, 2014

Comments on lecture 2; part deux

In the first post (here), I discussed Chomsky’s version of Merge and the logic behind it.  The main idea is that Merge, the conceptually simplest conception of recursion, has just the properties to explain why NL Gs generate structures with unbounded hierarchical structure, why NLs allow displacement, show reconstruction effects, and why rules of G are structure dependent. Not bad for any story. Really good (especially for DP concerns) if we get all of this from a very simple (nay, simplest) conception. In what follows I turn to a discussion of the last three properties Chomsky identified and see how he aims to account for them. I repeat them here for convenience.

(v)           its operations apply cyclically
(vi)          it can have lots of morphology
(vii)        in externalization only a single “copy” is pronounced

In contrast to the first four properties, the last three do not follow simply from the properties of the conceptually “simplest” combination operation. Rather Chomsky argues that they reflect principles of computational efficiency. Let’s see how.

With respect to (vii), Chomsky assumes that externalization (i.e. “vocalizing” the structures) is computationally costly. In other words, actually saying the structures out loud is hard. How costly? Well, it must be more costly than copy deletion at Transfer is. Here’s why. Given the copy theory as a consequence of Merge, FL must contain a procedure to choose which copy/occurrence is pronounced (note: this is not a conceptual observation but an inference based on the fact that typically only one copy is pronounced). This decision/choice, I assume, requires some computation. I further assume that choosing which copies/occurrences to externalize requires some computation that would not be required were all copies/occurrences pronounced. Chomsky’s assumption is that the cost of choosing is less than the cost of externalizing.  Thus, FL’s choice lowers overall computational cost.

Furthermore, we must also assume that the cost of pronunciation also exceeds the computational cost of being misunderstood for otherwise it would make sense for FL to facilitate parsing by pronouncing all the copies, or at least those that would facilitate a hearer’s parsing of our sentences. None of these assumptions are self-evidently true or false. Plus, the supposition that copy deletion is more computationally efficient than pronouncing them would be does not follow simply from considerations of conceptual simplicity, at least as far as I can tell. It involves substantive assumptions about actual computational costs, for which, so far as I can tell, we have little independent evidence.

One more point: If copy deletion exists in Transfer to the CI interface (as Chomsky argued in his original 1993 paper and that underlies standard accounts of reconstruction effects and that so far as I know is still part of current theory) then in the normal case only a single copy/occurrence makes it to either interface, though which copy is interpreted at CI can be different form the copy spoken at AP (and this is typically how displacement is theoretically described). But if this is correct, then it suggests that Chomsky’s argument here might need some rethinking. Why? If deletion is part of Transfer to CI then copy deletion cannot be simply a fact about the computational cost of externalization, as it applies to the mapping of linguistic objects to the internal thought system as well. It seems that copies per se are the problem, not just copies that must be pronounced.

Before moving on to (v) and (vi) it is worth pausing to note that Chomsky’s discussion here reverberates with pretty standard conceptions of computational efficiency (viz. he is making claims about how hard it is to do something). This moves away from the purely conceptual matters that motivated the discussion of the first four features of FL. There is a very interesting hypothesis that might link the two: that the simplest computational operation will necessarily be embedded in a computationally efficient system. This is along the lines of how I interpreted the SMT in earlier posts (linked to in the first part of this post).  However, whether you think this is feasible, it appears, at least to me, that there are two different kinds of arguments being deployed to SMT ends, a purely conceptual one and a more conventional “resource” argument.

Ok, let’s return to (v) and (vi). Chomsky suggests that considerations of computational efficiency also account for these properties of. In particular, they follow from something like the strict cycle as embodied in phase theory.  So the question is what’s the relation between the strict cycle and efficient computation?

Chomsky supposes that the strict cycle, or something like it, is what we would expect from a computationally well-designed system. There are times that (to me) Chomsky sounds like he seems to be assuming that the conceptually simplest system will necessarily be computationally efficient.[1] I don’t see why. In particular, if I understand the lecture correctly, Chomsky is suggesting that the link between conceptual simplicity and computational efficiency should follow as a matter of natural law. Even if correct, it is clear that this line of reasoning goes considerably beyond considerations of conceptual simplicity. What I mean is that even if one grants that the simplest computational operation will be something like Merge, it does not follow that the simplest system that includes Merge will also incorporate the strict cycle.  Phases then, (Chomsky’s mechanism for realizing the strict cycle) are motivated not on grounds of conceptual simplicity alone but on grounds of efficiency (i.e. a well/optimally designed system will incorporate something like the strict cycle). So far as I can tell Chomsky does not explain the relation (if any) between conceptual simplicity and computationally efficiency, though to be fair, I may be over-interpreting his intent here.

This said how does the strict cycle bear on computational efficiency? It allows computational decisions to be made locally and incrementally. This is a generically nice feature for computational systems to have for it simplifies computations.[2] Chomsky notes that it also simplifies the process of distinguishing two selections of the same expression from the lexicon vs two occurrences of the same expression. How does it simplify it? By making the decision a bounded one. Distinguishing them, he claims, requires recalling whether a given occurrence/copy is a product of E- or I-Merge. If such decisions are made strict cyclically (at every phase) then phases reduce memory demand: because phases are bounded, you need not retain information in memory regarding the provenance of a valued occurrence beyond the phase where an expression’s features are valued.[3] So phases ease the memory burdens that computations impose. Let me note again without further comment, that if this is indeed a motivation for phases, then it presupposes some conception of performance for only in this kind of context do resource issues (viz. memory concerns) arise. God has no need for bounding computation.

Now I have a confession to make.  I could not come up with a concrete example where this logic is realized involving DP copies, given standard views.  It’s easy enough to come up with a relevant case if e.g. reflexivization is a product of movement.[4] If reflexives involve A-chains with two thematically marked “links” then we need to distinguish copies from originals (e.g. Everyone loves himself differs from everyone loves everyone in that the first involves one selection of everyone from the lexicon (and so one chain with two occurrences of everyone) while the second involves two selections of everyone from the lexicon and so two different chains). However, if you don’t assume this, I personally had a hard time finding an example of what’s worrying Chomsky, at least with copies. This might mean that Chomsky is finally coming to his senses and appreciating the beauty of movement theories of Control and Binding OR it might mean that I am a bear of little brain and just couldn’t come up with a relevant case. I know which option I would bet on, even given my little brain, and it’s not the first. So, anyone with a nice illustration is invited to put it in the comments section or send it to me and I will post it. Thanks.

It is not hard to come up with cases that do not involve DPs, but the problem then is not distinguishing copies from originals. Take the standard case of Subject-Predicate agreement for example. Here the unvalued features of T are valued by those of the inherently valued features of the subject DP.  Once valued, the features on T and D are indistinguishable qua features. However, there is assumed to be an important difference between the two, one relevant to the interpretation at the CI interface. Those on D are meaning relevant but those on T are uninterpretable. What, after all, could it mean to say that the past tense is first person and plural?[5] If one assumes that all features at the interfaces must be interpretable at those interfaces if they make it there, then the valued features on T must disappear at Transfer to CI. But if (by assumption) they are indistinguishable from the interpretable ones on D, the computational system must remember how the features got onto T (i.e. by valuation rather or inherently). The ones that get there by valuation in the grammar must be removed or the derivation will not converge. Thus, Gs need to know how features get onto the expressions they sit on and it would be very nice memory-wise if this was a bounded decision.

Before moving on, it’s worth noting that even this version of the argument is hardly straightforward. It assumes that phi-features on T are not-interpretable and that these cause derivations to crash (rather, then, for example, converge as gibberish) (also see note 5). It also requires that deletion not be optional, otherwise there would be derivations where all the good features remained on all of the right objects and all of the uninterpretable ones freely deleted. Nor does it allow Transfer (which, after all, straddles the syntax and CI) to peak at the meaning of T during Transfer, thereby determining which features are interpretable on which items and so which should be deleted and which retained. Note that such a peak-a-boo decision to delete during Transfer would be very local, relying just on the meaning of T and the meaning of phi-features. Were this possible, we could delay Transfer indefinitely. So, to make Chomsky’s argument we must assume that Transfer is completely “blind” to the interpretation of the syntactic objects at every point in the syntactic computation including the one that interfaces with CI. This amounts to a very strong version of the autonomy of syntax thesis; one in which no part of the syntax, even the rules that directly interface with the interpretive interfaces, can see any information that the interfaces contain.[6]

Let’s return to the main point. Must the simplest system imaginable be computationally efficient? It’s not clear. One might imagine that the conceptually “simplest” system would not worry about computational efficiency at all (damn memory considerations!). The simplest system might just do whatever it can and produce whatever structured products it can without complicating FL with considerations of resource demands like memory burdens. True, this might render some products of FL unusable or hard to use (and so we would probably perceive their use as perceive them as unacceptable) but then we just wouldn’t use them (sort of like what we say about self-embedded clauses).  So, for example, we would tend not to use sentences with multiple occurrences of the same expressions where this made life computationally difficult (e.g. you would not talk about two Norberts in the same sentence). Or without phases we might leave to context the determination of whether an expression is a copy or a lexical primitive or we might allow Transfer to see if features on an expression were kosher or not. At any rate, it seems to me that all of these options are as conceptually “simple” as adding phases to FL unless, or course, phases come for free as a matter of “natural law.”  I confess to being skeptical about this supposition. Phases come with a lot of conceptual baggage, which I personally find quite cumbersome (reminds me of Barriers actually, not one of the aesthetic high points in GG (ugh!)). That said, let’s accept that the “simplest” theory comes with phases. 

As Chomsky notes, phases themselves come have complex properties.  For example, phases bring with them a novel operation, feature lowering, which now must be added to the inventory of FL operations. However, feature lowering does not seem to be either a conceptually simple or cognitively/computationally generic kind of operation. Indeed, it seems (at least to me) quite linguistically parochial. This, of course, is not a good thing if one’s sights are set on answering Darwin’s problem.  If so, phases don’t fit snugly with the SMT. This does not mean there are none. It just means that they complicate matters conceptually and pull against Chomsky’s first conceptual argument wrt Merge.

Again, let’s put this all aside and assume that strict cyclicity is a desirable property to have and that phases are an optimal way of realizing this. Chomsky then asks how we identify phases? He argues that we can identify phases by their heads as phase heads are where unvalued features live. Thus a phase is the minimal domain of a phase head with unvalued features.[7] A possible virtue of this way of looking at things is that it might provide a way of explaining why languages contain so much morphology. They are the adventitious by-products for identifying the units/domain of the optimal computational system.  Chomsky notes that what he means by morphology is abstract (a la Vergnaud), so a little more has to be said, especially given that externalization is costly, but it’s an idea in an area where we don’t have many (see here).[8]

One remark: on this reconstruction of Chomsky’s arguments, unvalued features play a very big role. They identify phases, which implement strict cyclicity and are the source of overt morphology.  I confess to being wary here. Chomsky originally introduced unvalued features to replace uninterpretable ones. Now he assumes that features are both +/- valued and +/- interpretable. As unvalued features are always uninterpretatble, this seems like an unwanted redundancy in the feature system.  At any rate, as Chomsky notes, uninterpretable features really do look sort of strange in a perfect system. Why have them only to get rid of them?  Chomsky’s big idea is that they exist to make FL computationally efficient. Color me very unconvinced.

So this is the main lay of the land. I should mention that, as others have pointed out (especially Dennis O), part of Chomsky’s SMT argument here (i.e. the one linked to conceptual simplicity concerns) is different from the interpretation of the SMT that I advanced in other posts (here, here, here).  Thus, my version is definitely NOT the one that Chomsky elaborates when considering these. However, there is a clear second strand dealing with pretty standard efficiency concerns, and here my speculations and his might find some common ground. That said, Chomsky’s proposals rest heavily on certain assumptions about conceptual simplicity, and of a very strong kind. In particular, Chomsky’s argument rests on a very aggressive use of Occam’s razor.  Here’s what I mean. The argument he offers is not that we should adopt Merge because all other notions are too complex to be biologically plausible units of genetic novelty. Rather, he argues that in the absence of information to the contrary, Occamite considerations should rule: choose the simplest (not just a simple) starting point and see where you get. Given that we don’t know much about how operations that describe the phenotype (the computational properties of FL) relate to the underlying biological substrate that is the thing that actually evolved, it is not clear (at least to me) how to weight such strong Occamite considerations. They are not without power, but, to me at least, we don’t really know how to assess whether all things are indeed equal and how seriously to weight this very strong demand for simplicity

Let me end by fleshing this out a bit.  I confess to not being moved by Chomsky’s conceptual simplicity arguments. There are lots of simple starting points (even if some may be simpler than others). Ordered pairs are not that much more conceptually complex than sets. Symmetric operations are not obviously simpler than asymmetric ones, especially given that it appears that syntax abhors symmetry (see Moro and Chomsky). So, the general starting point that we need to start with the conceptually simplest conception of “combination” and that this means an operation that creates sets of expressions seems based on weak considerations. IMO, we should be looking for basic concepts that are simple enough to address DP (and there may be many) and evaluate them in terms of how well they succeed in unifying the various apparently disparate properties of FL. Chomsky does some of this here, and it’s great. But we should not stop here. Let me given an example.

One of the properties that modern minimalist theory has had trouble accounting for is the fact that the unit of syntactic movement/interpretation/deletion is the phrase. We may move heads, but we typically move/delete phrases. Why? Right now standard minimalist accounts have no explanation on hand. We occasionally hear about “pied piping” but more as an exercise in hand waving than in explanation. Now, this feature of FL is not exactly difficult to find in NL Gs. That constituency matters is one of the obvious facts about how displacement/deletion/binding operates. There is a simple story about this that labels and headedness can be used to deliver.[9] If this means that we need a slightly less conceptually simple starting point than sets, then so be it.

More generally: the problem that motivates the minimalist program is DP. To address DP we need to factor out most of the linguistic specific structure of FL and attribute it to more cognitively generic operations (or/and, if Chomsky is right, natural laws).  What’s simple in a DP context is not what is conceptually most basic, but what is simple given what our ancestors had available cognitively about 100k years ago. We need a simple addition to this, not something that is conceptually simple tout court.[10]  In this context it’s not clear to me that adding a set construction operation (which is what Merge amounts to) is the simplest evolutionary alternative. Imagine, for example, that our forbearers already had an itterative concatenation operation.[11]  Might not some addition to this be just as simple as adding Merge in its entirety? Or imagine that our ancestors could combine lexical atoms together into arbitrarily big unstructured sets, might not an addition that allowed that operation to yield structured sets be just as simple in the DP context as adding Merge? Indeed, it might be simpler depending in what was cognitively available in the mental life of our ancestors.  And once we are at it, how “simple” is an operation that forms arbitrary sets from atoms and other sets?  Sets may be simple objects with just the properties we need, but I am not sure that operations that construct them are particularly simple.[12]

Ok, let me end this much too long second post. And moreover, let me end on a very positive note. In the second lecture Chomsky does what we all should be doing when we are doing minimalist syntax. He is interested in finding simple computational systems that derive the basic properties of FL. He concentrates on some very interesting key features: unbounded hierarchy, displacement, reconstruction, etc. and makes concrete proposals (i.e. he offers a minimalist theory) that seem plausible. Whether he is right in detail is less important IMO than that his ambitions and methods are worth copying. He identifies non-trivial properties of FL that GG has discovered over the last 60 years and he tries to explain why they should exist.  This is exactly the right kind of thing MPers should be doing. Is he right? Well, let’s just say that I don’t entirely agree with him (yet!). Does lecture 2 provide a nice example of what MP research should look like. You bet. It identifies real deep properties of FL and sees how to derive them from more general principles and operations. If we are ever to solve Darwin’s problem, we will need simple systems that do just what Chomsky is proposing. 

[1] Note, we want the necessarily here. That it is both simple and efficient does not explain why it need be efficient if simple.
[2] It is also a necessary condition for incrementality in the use systems (e.g. parsing), as Bill Idsardi pointed out to me.  I know that the SMT does not care about use systems according to some (Dennis and William this is a shout-out to you), but this is a curious and interesting fact nonetheless.  Moreover, if I am right that the last three properties do not follow (at least not obviously) from conceptual considerations, it seems that Chomsky might be pursuing a dual route strategy for explaining the properties of FL.
[3] Note that this assumes that there is no syntactic difference between inherent features and features valued in the course of the derivation.
[4] And even this requires a special version of the theory, one like Idsardi and Lidz’s rather than Zwart’s.
[5] However, if v raised to T before Transfer then one might try and link these features to the thematic argument that v licenses. And then it might make lots of sense to say that phi-features are interpretable on T. They would say that the variable of the predicate bound by the subject must have such and such an interpretation. This information might be redundant, but it is not obviously uninterpretable.
[6] The ‘autonomy of syntax’ thesis refers to more than one claim. The simplest one is that syntactic primitives/operations are not reducible to phonetic or semantic ones. This is not  the version adverted to above. This is a more specific version of the thesis; one that requires a complete separation between syntactic and semantic information in the course of a derivation. Note, that the idea that one can add EPP/edge features only if it affects interpretation (the Reinhart-Fox view that Chomsky has at times endorsed) violates this strong version of the autonomy thesis.
[7] Note, we still need to define ‘domain’ here.
[8] Note, incidentally, that Chomsky assumes both that features are +/- valued and that they are +/- interpretable. At one time, the former was considered a substitute for the latter. Now, they are both theoretically required, it seems. As -valued features seem to always be –interpretatble, this seems like an unwanted redundancy. 
[9] I provide a story here based on labels and minimality.
[10] A question: we can define ordered pairs set theoretically. I assume the argument against labels is that ordered sets are conceptually more complex than unordered sets. So {a,b} is conceptually simpler than {a,{a,b}}.  If this is the argument, it is very very subtle. I find it hard to believe that whereas the former is simple enough to be biologically added, the latter is not. Or even that the relative simplicity of the two could possibly matter. Ditto for other operations like concatenation in place of Merge as the simplest operation.  Given how long this post is already, I will refrain from elaborating these points here.
[11] Birds (and mice and other animals) can string “syllables” together (put them together in a left/right order) to make songs. From what I can tell, there is no hard upper bound on how many syllables can be so combined.  These do not display hierarchy, but they may be recursive in the sense that the combination operation can iterate. Might it not be possible that what we find in FL builds on this iteration operation? That the recursion we find in FL is iteration plus something novel (I have suggested labeling is the novelty)? My point here is not that this is correct, but that the question of simplicity in a DP context need not just be a matter of conceptual simplicity.  
[12] How are sets formed? How computationally simple is the comprehension axiom in set theory, for example? It is actually logically quite involved (see here). I ask because Merge is a set forming operation, so the relevant question is how cognitively complex is it to form arbitrary sets. We have been assuming that this is conceptually simple and hence cognitively easy. However, it is worth considering just how easy. The Wikepedia entry suggests that it is not a particularly simple operation. Sets are funny things and what mental powers go into being able to construct them is not all that clear.

Saturday, June 21, 2014

Baker's Paradox IV: Transformation and Variation

How does the learner acquire the following patterns in dative constructions:

(1) a.     John told a story to Bill.
         John told Bill a story.
b.    John promised a car to Bill.
         John promised Bill a car.
c.    John donated a painting to the museum/them.
        *John donated the museum/them a painting.

Lexical conservation is not the way to go. Children productively (over)generalize both constructions (“I said him no”) about 5% of time (Gropen et al. 1989 Lg.) at a rate comparable to that of past tense overreguarlization.  As young as age 3, they can extend novel verbs from one construction to another (“I pilked the cup to Petey”=>”I pilked Petey the cup”; Conwell & Demuth 2007 Cognition) though the DOC to PC extension is more robust than the other way around. 

There is pretty good agreement on the semantic conditions for the dative constructions: DOC generally involves caused possession of the theme by the goal and PC requires caused motion of the theme along the path to the goal. These are what Pinker (1989) calls “broad range rules” but they are clearly necessary conditions on the dative constructions as the examples in (1) illustrate. Moreover, there is considerably crosslinguistic variation: in some languages, (the equivalent of) dative constructions are limited to a handful of verbs. 

Pinker then propose a set of “narrow range rules”, each defining a subclass of verbs on the basis of semantics, e.g., verbs of instantaneous causation of ballistic motion (“throw”), verbs of future having (“leave”), verbs of instrument of communication (“telegraph”), etc., which allow DOC and verbs of fulling (“present”), verbs of manner of speaking (“shout”) etc., which allow PC only. Beth Levin refined these lists in her 1993 EVCA book. But as noted by Melissa Bowerman and others, these subclasses do not solve the learning problem. First, it’s not clear how the child learner can conjure up these subclasses: we probably don’t want to build the telecommunication class into an innate UG. Second, these subclasses do not behave consistently across languages (Levin 2008 Stanford ms.); even if the they are available for the learner’s consideration, their productivity still needs to be determined.

You know where we are going with this. I looked at a 3 million word corpus of child directed English and found a total of 49 verbs attested in either dative constructions:

(2) a. 48 appear in PC, of which 37 also appear in DOC. 
b. 38 appear in DOC, of which 37 also appear in PC. 

Applying the N/ln N formula, we see that both PC=>DOC and DOC=>PC are productive generalizations. That is, if the child sees a verb used in one of the constructions, it will automatically generalize to the other. This appears to be what children do; see above. The DOC=>PC rule is a far more reliable generalization, virtually exceptionless,  than the PC=>DOC rule, which may account for the asymmetry in the extension of novel verbs in Conwell and Demuth’s study. 

So there is no Baker’s paradox for a 3 year old, as both construction can be productively learned. The paradox arises for certain verbs such as the Latinate class but there is hardly any Latinate dative verbs in the child directed data (and no a single instance of the telecommunication verbs; these are data collected before everyone was online). As the child grows older, especially after the onset of literacy which will begin to feature more Latinate words, his vocabulary will expand and he will encounter more examples of dative constructions: some verbs will appear in both DOC and PC while others will only appear in PC. But even the ungrammaticality of latinate verbs in DOC's is matter of tendency not to mention individual variation. Those such as “assign”, “advance”, “award” “guarantee” etc. do allow DOC and Germanic verbs such as “shout”, “trust”, “lift”, “pick” do not. Collectively, Gropen et al’s list contains 54 Latinate verbs that can participate in PC but only 14 can be used in DOC: Latinate verbs, then, do not productively participate in DOC and the learner will have to lexicalize the 14. Levin’s longer list shows the same pattern.

So the child grows into a paradox: in other words, the productivity of rules/constructions must change over the course of language acquisition. Gropen et al. (1989) lists of 73 DOC/PC verbs and 34 PC only verbs for a total of 107, which yields a threshold of 23. If the child learns all of the 107 verbs, the PC=>DOC extension will no longer be justified. A productive rule when he was three will cease to be productive when he’s 30.  

I think this is when the child will be prompted to look for subclasses or narrow range rules. Not having a productive linguistic system is a crime against nature. Sometimes we are genuinely stuck when there isn’t any to be found (such as the paradigmatic gap examples I mentioned in the previous post) but the child will not give up trying. In a paper published in the same volume as Berwick, Chomsky and Piatelli-Parlmarini, Julie Legate and I studied how the metrical stress parameters of English can be acquired. It’s well known that the overwhelming majority of English words are stress initial (up to 80-90%; Cutler & Carter 1987, Comp. Speech & Lg.), but no metrical theory of English, or any English speaker, treats English as a quantity insensitive (QI) system like Afrikaans while lexically listing 10% of exceptions. Using child directed English words, we found that indeed, the QI system fails to reach productivity despite being the overwhelming majority, and a productive system (as described in Halle 1998 LI) can only be established if the child subdivides the vocabulary into nouns and verbs and consider different stress marking options for these subclasses. Conceivably, this is how they learn the narrow range rules. OK, PC=>DOC may be bust, but if I cut up the verbs into semantic classes, I can still find some productive ones. 

This work has tormented me for quite some time. I have argued for a variational conception of language learning, where the learner acquires a probabilistic distribution over grammatical hypotheses—which is contrasted that with what can be called “transformational” model of learning, where the learner goes from one grammar to another. Yet what we have on hand is exactly a transformational model of language a la hypothesis testing (see Aspects), where the hypotheses are confirmed or rejected by an evaluation metric for productivity. 

There really does seem to be two kinds of learning in child language. On the one hand, there is probabilistic adjustment, where non-target grammars show up. The case for parameters remains strong; I hope to provide a report on some recent collaborative work soon. On the other, we have the tipping point phenomena such as U-shape curve learning and other forms of linguistic induction, where a hypothesis suddenly emerges. 

I’m happy to concede that I’m treating unattested examples as negative evidence. As noted earlier, the child must be able to generalize over unseen data so that much seems unavoidable.  But I still think this work is different from at least the conventional use of indirect negative evidence. Under the standard view, the learner has two (or many) hypotheses and performs some kind of comparison, discrete or probabilistic, to select the best. (For a recent take on the dative acquisition, see Perfors et al. JCL 2010 and Villavicencio et al. ACL 2013.) The model developed here considers one hypothesis at a time by working over two numbers: it keeps a hypothesis that is good enough and moves on to find another if not. This is the classic error driven learning in much of the inductive learning business (Aspects, Wexler & Culicover, Berwick).

In any case, I think the empirical aspects of productivity are far more important than theoretical formulations and deserve much more attention: 

  • A productive system requires super duper majority: see English metrical stress. 
  • Productivity can change over the course of language acquisition.
  • The failure of productivity results in ineffability such as paradigmatic gaps: sometimes the best isn't good enough. 

Off to hiking in Yunnan, with some Peking ducks along the way. 

Thursday, June 19, 2014

Comments on lecture 2; first part

This was once a 10 page post. I’ve decided to break it into two to make it more manageable. I welcome discussion as there is little doubt that I got many things wrong. However, it’s been my experience that talking about Chomsky’s stuff with others, even if it begins in the wrong place, ends up being very fruitful. So engage away.

In lecture 2, Chomsky starts getting down to details.  Before reviewing these, however, let me draw attention to one of Chomsky’s standard themes concerning semantics, with which he opens.  He does not really believe that semantics exists (at least as part of FL). Or more accurately, he doubts that there is any part of FL that recursively specifies truth (or satisfaction) conditions on the bases of reference relations that lexical atoms have to objects “in the world.”

Chomsky argues that lexical atoms within natural language (viz. words, more or less) do not refer.  Speakers can use words to refer, but words in natural languages (NL) have no intrinsic reference relation to objects or properties or relations or qualities or whatever favorite in the world “things” one cares to name.  Chomsky interestingly contrasts word with animal symbols, which he observes really do look like they fit the classical referential conception as they are tightly linked to external states or current appetites on every occasion of use. As Chomsky has repeatedly stressed, this contrast between our “words” and animal “words” needs explaining, as it appears to be a distinctive (dare I say species specific) feature of NL atoms.

Interestingly (at least to me), the point Chomsky makes here echoes ideas in the Wittgenstein’s (W) later writings. Take a look at W’s slab language in the Investigations. This is a “game” in which terms are explicitly referentially anchored. This language has a very primitive tone (a point that W wants to make IMO) and has none of the suppleness characteristic of even the simplest words in NL.  This resonates very clearly with Chomsky’s Aristotelian observations about how words function.

Chomsky’s pushes these observations further. If he is right about the absence of an intrinsic reference relation between words and the world and that words function in a quasi Aristotelian way, then semantics is just a species of syntax, in that it specifies internal relations between different types of symbols.  Chomsky once again (he does this here for example) urges an analogy with phonological primitives, which also have no relations to real world objects but can be used to create physical effects that others built like us can interpret. So, no semantics, just various kinds of syntax and some pragmatics describing how these different sets of symbols are used by speakers. 

Two remarks and we move on to discuss the meat of the lecture: (i) Given Chomsky’s skepticism concerning theories of use, this suggests that there is unlikely to be a “theory” of how linguistic structures are used to “refer,” make assertions, ask questions etc.  We can get informal descriptions that are highly context sensitive, but Chomsky is likely skeptical about getting much more, e.g. a general theory of how sentences are used to assert truths. Interestingly, here too Chomsky echoes W. W noted that there are myriad language games, but he doubted that there could be theories of such games. Why? Because games, W observes, are very loosely related to one another and a game’s rules are often constructed on the fly. 

With very few exceptions semanticists, both linguists and philosophers, have not reacted well to these observations. Most of the technology recursively specifies truth conditions based on satisfaction conditions of predicates. There is a whole referentialist metaphysics based on this. If Chomsky is right, then this will all have to be re-interpreted (and parts scrapped). So far as I know, Paul Pietroski (see here) is unique among semanticists in developing interpretive accounts of sentence meaning not based on these primitive referential conceptions.

Ok, let’s now move onto the main event. Chomsky, despite his standard comments noting that Minimalism is a program and not a theory, outlines a theory that, he argues, addresses minimalist concerns.[1] The theory he outlines aims to address Darwin’s Problem (DP). In reviewing the intellectual lay of the land (as described more fully in lecture 1) he observes that FL arose quickly, all at once, in the recent past, and has remained stable ever since.  He concludes from this that the change, whatever it was, was necessarily “simple.” Further, Chomsky specifies the kinds of things that this “simple” change should account for, viz. a system with (at least) the following characteristic:

(i)             it generates an infinite number of hierarchically structured objects
(ii)           it allows for displacement
(iii)          it displays reconstruction effects
(iv)          its operations are structure dependent
(v)           its operations apply cyclically
(vi)          it can have lots of morphology
(vii)        in externalization only a single “copy” is pronounced

Chomsky argues that these seven properties are consequences of the simplest conceivable conception of a recursive mechanism. Let’s follow the logic.

Chomsky assumes that whatever emerged had to be “simple.” Why? One reason is that complexity requires time, and if the timeline that experts like Tattersall have provided is more or less correct, then the timeline is very short in evo terms (roughly 50-100k years). So whatever changed occurred must have been a simple modification of the previous cognitive system. Another reason for thinking it was simple is that it has been stable since it was first introduced. In particular, human FLs have not changed since humans left Africa and dispersed across the globe about 50-100k years ago. How do we know? Because any human kid acquires any human language in effectively the same way. So, whatever the change was, it was simple.

Next question: what’s “simple” mean? Here Chomsky makes an interesting (dare I say, bold?) move. He equates evolutionary simplicity with conceptual simplicity. So he assumes that what we recognize as conceptually simple corresponds to what our biochemistry takes to be simple. I say that this is “interesting/bold” for I see no obvious reason why it need be true. The change was “simple” at the genetic/chemical level. It was anything but at the cognitive one.  Indeed, that’s the point; a small genetic/biochemical change can have vast phenotypic effects, language being the parade case. However, what Chomsky is assuming, I think, is that the addition of a simple operation to our cognitive inventory will correspond to a simple change at the genetic/developmental level.[2] We return to this assumption towards the end.

As is well known, Chomsky’s candidate for the “simplest” change is the addition of an operation that “takes two things already (my emphasis NH) constructed and forms a new thing from them” (at about 28;20). Note the ‘already.’ The simplest operation, let’s call it by its common name- “Merge,” does not put any two things together. It puts two constructed things together. We return to this too.

How does it put them together? Again, the simplest operation will leave the combinees unchanged in putting them together (it will obey the No Tampering Condition (NTC)) and the simplest operation will be symmetric (i.e. impose no order on the elements combined).[3]  So the operation will be something like “combine A and B,” not “combine A with B.” The latter is asymmetric and so imposes a kind of order on the combiners.  The Merge so conceived can be represented as an operation that creates sets. Sets have both the required properties. Their elements are unordered and putting things into sets (i.e. taking things elements of a set) does not thereby change the elements so besetted.[4]

We have heard this song before. However, Chomsky puts a new spin on things here. He notes that the “simplest” application of Merge is one where you pick an expression X that is within another expression Y and combine X and Y.  Thus I(nternal)-Merge is the simplest application/instance of Merge. The cognoscenti will recognize that this is not how Chomsky elaborated things before. In earlier versions, taking two things neither of which was contained in the other and Merging them (viz. E-merge) was taken to be simpler.  Not now, however. Chomsky does not go into why he changes his mind, but he hints that the issue is related to “search.” It is easier to “find” a term within a term than to find two terms in a workspace (especially one that contains a lexicon).[5]  So, the simplest operation is I-merge, E-merge being only slightly more complex, and so also available.

Comments: I found this discussion a bit hard to follow. Here’s why. A logical precondition for the application of I-merge is the existence of structured objects and many (most) of these will be products of E-merge. That would seem to suggest that the “simplest” version of the operation is not the conceptually most basic as it logically presupposes that another operation exist.  It is coherent to assume that even if E-merge is more conceptually basic, I-merge is easier to apply (think search). But if one is trucking in conceptual simplicity, it sure looks like E-merge is the more basic notion. After all, one can imagine derivations with E-merges and no I-merges but not the reverse.[6] Clearly we will be hearing more about this in later lectures (or so I assume). Note that this eliminates the possibility of Economy notions like “merge over move” (MoM). This is unlikely to worry Chomsky given the dearth of effects regulated by this MoM economy condition (Existential constructions? Fougetaboutit!).[7] Nonetheless, it is worth noting. Indeed, it looks like Chomsky is heading towards a conception more like “move over merge” or “I over E merge” (aka: Pesetsky’s Earliness principle), but stay tuned.

Chomsky claims that these are the simplest conceivable pair of operations and so we should eschew all else.[8] Some may not like this (e.g. moi) as it purports to eliminate operations like inter-arboreal/sidewards Merge (where one picks a term within one expression and merges it with a term from the lexicon). I am not sure, however, why this should not be allowed. If we grant that finding mergeables in the lexicon is more complex than finding a mergeable within a complex term, then why shouldn’t finding a term within a term (bounded search here) and merging it with a term from the lexicon not be harder than I-merge but simpler than E-merge?  After all, for interarboreal merge we need scour the big vast nasty lexicon but once rather than twice, as is the case with many case of E-merge (e.g. forming {the,man}). At any rate, Chomsky wants none of this, as it goes beyond the conceptually simplest possibilities.

Chomsky also does not yet mention pair merge, though in other places he notes that this operation, though more complex than set merge (note: it does imply an ordering, hence the ‘pair’ in (ordered?) pair merge) is also required.  If this is correct, it would be useful to know how pair merge relates to I and E merge: is it a different operation altogether (that would not be good for DP purposes as we need keep miracles to a small minumum) and where does it sit in the conceptual complexity hierarchy of merge operations? Stay tuned.

So, to return to the main theme, the candidate for the small simple change that occurred is the arrival of Merge, an operation that forms new sets of expressions both from already constructed sets of expressions (I-merge) and from lexical items (which are themselves atomic, at least as far as merge is concerned) (E-merge).  The strong minimalist thesis (SMT) is the proposal that these conceptual bare bones suffice to get us many (in the best case, all) of the distinctive properties of NL Gs.  In other words, that the conceptually ”simplest” operation (i.e. the one that would have popped into our genomes/developmental repertoires if anything did) suffices to explain the basic properties of FL. Let’s see how merge manages this.

Recall that Merge forms sets in accord with the NTC. Thus, it can form bigger and bigger (with no bound to how big) hierarchically structured objects. The hierarchy is a product of the NTC. The recursion is endemic to Merge. Thus, Merge, the “simplest” recursive operation, suffices to derive (i) above (i.e. the fact that NLs contain an infinite number of hierarchically structured objects).

In addition, I-merge models displacement (an occurrence of the same expression in two different places) and as I-merge is the simplest application of Merge, we expect any system built on Merge to have displacement as an inherent property (modulo AP deletion, see next post).[9]

We also expect to find (iii) reconstruction effects for Merge plus NTC implies the copy theory of movement. Note, that we are forming sets, so when we combine A (contained in B) with B via merge we don’t change A (due to NTC) and so we get another instance of A in its newly merged position.  In effect, movement results in two occurrences of the same expression in the two places.  These copies suffice to support reconstruction effects so the simplest operation explains (iii), at least in part (see note 10).[10]

(iv) follows as well. The objects created have no left/right order, as the objects created are sets and sets have no order at all, and so no left/right order.[11] This means that operations on such set theoretic structures cannot exploit left/right order as such relations are not defined for the set theoretic objects that are the objects of syntactic manipulation. Thus, syntactic operations must be structure dependent as they cannot be structure independent.[12]

This seems like a good place to stop. The discussion continues in the next post where I discuss the last three properties outlined above.

[1] Chomsky would argue, correctly, that should his particular theory fail then this would not impugn the interest of the program.  However, he is also right in thinking that the only way to advance a program is by developing specific theories that embody its main concerns.
[2] I use ‘genetic/developmental’ as shorthand for whatever physical change was responsible for this new cognitive operation. I have no idea what the relation between cognitive primitives and biological primitives is. But, from what I can tell, neither does anyone else. Ah dualism! What a pain!!
[3] We need to distinguish order from a left/right ordering. For example, in earlier proposals, labels were part of Merge. Labels served to order the arguments: {a,{a,b}} is equivalent to the ordered pair <a,b>. However, Merge plus label does not impose a left-right ordering on ‘a’ and ‘b’. Chomsky in this lecture explicitly rejects a label based conception of Merge so he is arguing that the combiners are formed into simple sets, not ordered sets. The issue about ordering, then, is more general than whether Merge, like earlier Phrase Structure rules in GG, imposes a left-right order on the atoms in addition to organizing them into “constituents.”
[4] If it did, we could not identify a set in terms of the elements it contains.
[5] I heard Chomsky analogize this to finding something in your pocket vs finding it on your desk, the former being clearly simpler. This clearly says something about Chomsky’s pockets versus his desks.  But substitute purses or school bags for pockets and the analogy, at least in my case, strains. This said, I like this analogy better than Chomsky’s old refinery analogy in his motivation of numerations.
[6] Indeed, one can imagine an FL that only has an operation like E-merge (no I-merge) but not the converse.  Restricting Merge to E-merge might be conceptually ad hoc, As Chomsky has argued before, but it is doable. A system with I-merge alone (no E-merge at all) is, at least for me, inconceivable.
[7] I know of only three cases where MoM played a role: Existential constructions, adjunct control and the order of shifted objects and base generated subjects. I assume that Chomsky is happy to discount all three, though a word or two why they fail to impress would be worth hearing given the largish role MoM played in earlier proposals. In particular, what is Chomsky’s current account of *there seems a man to be here?
[8] Chomsky talks as if these are two different operations with one being simpler than the other. But I doubt that this is what he means. He very much wants to see structure building and movement as products of the same single operation and that on the simplest story, if you get one you necessarily get the other. This is not what you get if the two merge operations are different, even slightly so. Rather I think we should interpret Chomsky as saying that E/I-Merge are two applications of the same operation with the application of I-merge being simpler than E-merge.
[9] What I mean is that I-merge implies the presence of non-local dependencies. It does not yet imply displacement phenomena if these are understood to mean that an expression appears at AP in a postion different from where it is interpreted at CI. For this we need copy deletion as well as I-merge.
[10] Actually, this is not quite right. Reconstruction requires allowing information from lower copies to be retained for binding. This need not have been true. For example, if CI objects were like the objects that AP interprets, the lower copies would be minimized (more or less deleted) and so we would not expect to find reconstruction effects.  So what Merge delivers is a necessary (not a sufficient) condition for reconstruction effects. Further technology is required to actually deliver the goods. I mention this, for any additional technology must find its way into the FL genome and so complicates DP.  It seems that Chomsky here may be claiming a little more for the simplest operation than Merge actually delivers. In Chomsky’s original 1993 paper, Chomsky recognized this. See his discussion of the Preference Principle, wherein minimizing the higher copy is preferred to minimizing the lower one.
[11] As headedness imposes an ordering on the arguments (it effectively delivers ordered pairs), headedness is also excluded as a basic part of the computational system as it does not follow from the conceptually “simplest” possible combination operation. I discuss this a bit more in the next post.
[12] Note, that we need one more assumption to really seal the deal, viz. that there are no syntax like operations that apply after Transfer. Thus, there can be no “PF” operations that move things around. Why not? Because Transfer results in left-right ordered objects. Such kinds of operations were occasionally proposed and it would be worth going back to look at these cases to see what they imply for current assumptions.