In the first post (here),
I discussed Chomsky’s version of Merge and the logic behind it. The main idea is that Merge, the conceptually simplest conception of
recursion, has just the properties to explain why NL Gs generate structures
with unbounded hierarchical structure, why NLs allow displacement, show
reconstruction effects, and why rules of G are structure dependent. Not bad for
any story. Really good (especially for DP concerns) if we get all of this from
a very simple (nay, simplest) conception. In what follows I turn to a discussion
of the last three properties Chomsky identified and see how he aims to account
for them. I repeat them here for convenience.
(v)
its operations apply cyclically
(vi)
it can have lots of morphology
(vii) in
externalization only a single “copy” is pronounced
In contrast to the first four properties, the last three do
not follow simply from the properties of the conceptually “simplest” combination
operation. Rather Chomsky argues that they reflect principles of computational
efficiency. Let’s see how.
With respect to (vii), Chomsky assumes that externalization
(i.e. “vocalizing” the structures) is computationally costly. In other words,
actually saying the structures out loud is hard. How costly? Well, it must be more
costly than copy deletion at Transfer is. Here’s why. Given the copy theory as
a consequence of Merge, FL must contain a procedure to choose which
copy/occurrence is pronounced (note: this is not a conceptual observation but
an inference based on the fact that typically only one copy is pronounced).
This decision/choice, I assume, requires some computation. I further assume that
choosing which copies/occurrences to externalize requires some computation that
would not be required were all
copies/occurrences pronounced. Chomsky’s assumption is that the cost of
choosing is less than the cost of externalizing. Thus, FL’s choice lowers overall
computational cost.
Furthermore, we must also assume that the cost of
pronunciation also exceeds the computational cost of being misunderstood for
otherwise it would make sense for FL to facilitate parsing by pronouncing all
the copies, or at least those that would facilitate a hearer’s parsing of our
sentences. None of these assumptions are self-evidently true or false. Plus,
the supposition that copy deletion is more computationally efficient than pronouncing
them would be does not follow simply from considerations of conceptual
simplicity, at least as far as I can tell. It involves substantive assumptions
about actual computational costs, for which, so far as I can tell, we have
little independent evidence.
One more point: If copy deletion exists in Transfer to the
CI interface (as Chomsky argued in his original 1993 paper and that underlies
standard accounts of reconstruction effects and that so far as I know is still
part of current theory) then in the normal case only a single copy/occurrence
makes it to either interface, though
which copy is interpreted at CI can be different form the copy spoken at AP
(and this is typically how displacement is theoretically described). But if
this is correct, then it suggests that Chomsky’s argument here might need some
rethinking. Why? If deletion is part of Transfer to CI then copy deletion cannot
be simply a fact about the computational cost of externalization, as it applies to the mapping of linguistic objects
to the internal thought system as
well. It seems that copies per se are
the problem, not just copies that must be pronounced.
Before moving on to (v) and (vi) it is worth pausing to note
that Chomsky’s discussion here reverberates with pretty standard conceptions of
computational efficiency (viz. he is making claims about how hard it is to do something). This moves away from the
purely conceptual matters that motivated the discussion of the first four
features of FL. There is a very interesting hypothesis that might link the two:
that the simplest computational operation will necessarily be embedded in a
computationally efficient system. This is along the lines of how I interpreted
the SMT in earlier posts (linked to in the first part of this post). However, whether you think this is feasible,
it appears, at least to me, that there are two different kinds of arguments
being deployed to SMT ends, a purely conceptual one and a more conventional
“resource” argument.
Ok, let’s return to (v) and (vi). Chomsky suggests that
considerations of computational efficiency also account for these properties
of. In particular, they follow from something like the strict cycle as embodied
in phase theory. So the question is
what’s the relation between the strict cycle and efficient computation?
Chomsky supposes that the strict cycle, or something like
it, is what we would expect from a computationally well-designed system. There
are times that (to me) Chomsky sounds like he seems to be assuming that the
conceptually simplest system will necessarily
be computationally efficient.[1]
I don’t see why. In particular, if I understand the lecture correctly, Chomsky
is suggesting that the link between conceptual simplicity and computational
efficiency should follow as a matter of natural law. Even if correct, it is
clear that this line of reasoning goes considerably beyond considerations of conceptual
simplicity. What I mean is that even if one grants that the simplest
computational operation will be something like Merge, it does not follow that
the simplest system that includes Merge will also incorporate the strict cycle. Phases then, (Chomsky’s mechanism for
realizing the strict cycle) are motivated not on grounds of conceptual simplicity
alone but on grounds of efficiency (i.e. a well/optimally designed system will
incorporate something like the strict cycle). So far as I can tell Chomsky does
not explain the relation (if any) between conceptual simplicity and
computationally efficiency, though to be fair, I may be over-interpreting his
intent here.
This said how does the strict cycle bear on computational
efficiency? It allows computational decisions to be made locally and
incrementally. This is a generically nice feature for computational systems to
have for it simplifies computations.[2]
Chomsky notes that it also simplifies the process of distinguishing two
selections of the same expression from the lexicon vs two occurrences of the
same expression. How does it simplify it? By making the decision a bounded one.
Distinguishing them, he claims, requires recalling whether a given
occurrence/copy is a product of E- or I-Merge. If such decisions are made
strict cyclically (at every phase) then phases reduce memory demand: because
phases are bounded, you need not retain information in memory regarding the
provenance of a valued occurrence beyond the phase where an expression’s
features are valued.[3]
So phases ease the memory burdens that computations impose. Let me note again without
further comment, that if this is
indeed a motivation for phases, then it presupposes some conception of
performance for only in this kind of context do resource issues (viz. memory
concerns) arise. God has no need for bounding computation.
Now I have a confession to make. I could not come up with a concrete example where
this logic is realized involving DP copies, given standard views. It’s easy enough to come up with a relevant
case if e.g. reflexivization is a product of movement.[4]
If reflexives involve A-chains with two thematically marked “links” then we
need to distinguish copies from originals (e.g. Everyone loves himself differs from everyone loves everyone in that the first involves one selection of
everyone from the lexicon (and so one
chain with two occurrences of everyone)
while the second involves two selections of everyone
from the lexicon and so two different chains). However, if you don’t assume
this, I personally had a hard time finding an example of what’s worrying
Chomsky, at least with copies. This might mean that Chomsky is finally coming
to his senses and appreciating the beauty of movement theories of Control and
Binding OR it might mean that I am a bear of little brain and just couldn’t
come up with a relevant case. I know which option I would bet on, even given my
little brain, and it’s not the first. So, anyone with a nice illustration is
invited to put it in the comments section or send it to me and I will post it.
Thanks.
It is not hard to come up with cases that do not involve
DPs, but the problem then is not distinguishing copies from originals. Take the
standard case of Subject-Predicate agreement for example. Here the unvalued
features of T are valued by those of the inherently valued features of the subject
DP. Once valued, the features on T and D
are indistinguishable qua features.
However, there is assumed to be an important difference between the two, one
relevant to the interpretation at the CI interface. Those on D are meaning
relevant but those on T are uninterpretable.
What, after all, could it mean to say that the past tense is first person and
plural?[5]
If one assumes that all features at the interfaces must be interpretable at
those interfaces if they make it there, then the valued features on T must
disappear at Transfer to CI. But if (by assumption) they are indistinguishable
from the interpretable ones on D, the computational system must remember how the features got onto T (i.e.
by valuation rather or inherently). The ones that get there by valuation in the
grammar must be removed or the derivation will not converge. Thus, Gs need to
know how features get onto the expressions they sit on and it would be very
nice memory-wise if this was a bounded decision.
Before moving on, it’s worth noting that even this version
of the argument is hardly straightforward. It assumes that phi-features on T
are not-interpretable and that these cause derivations to crash (rather, then, for example, converge as gibberish) (also see note 5). It also requires that
deletion not be optional, otherwise there would be derivations where all the
good features remained on all of the right objects and all of the uninterpretable
ones freely deleted. Nor does it allow Transfer (which, after all, straddles
the syntax and CI) to peak at the meaning of T during Transfer, thereby determining which features are interpretable
on which items and so which should be deleted and which retained. Note that
such a peak-a-boo decision to delete during Transfer would be very local,
relying just on the meaning of T and the meaning of phi-features. Were this
possible, we could delay Transfer indefinitely. So, to make Chomsky’s argument we
must assume that Transfer is completely “blind” to the interpretation of the
syntactic objects at every point in the syntactic computation including the one
that interfaces with CI. This amounts to a very strong version of the autonomy
of syntax thesis; one in which no part of the syntax, even the rules that directly
interface with the interpretive interfaces, can see any information that the
interfaces contain.[6]
Let’s return to the main point. Must the simplest system
imaginable be computationally efficient? It’s not clear. One might imagine that
the conceptually “simplest” system would not worry about computational
efficiency at all (damn memory considerations!). The simplest system might just
do whatever it can and produce whatever structured products it can without
complicating FL with considerations of resource demands like memory burdens.
True, this might render some products of FL unusable or hard to use (and so we
would probably perceive their use as perceive them as unacceptable) but then we
just wouldn’t use them (sort of like what we say about self-embedded
clauses). So, for example, we would tend
not to use sentences with multiple occurrences of the same expressions where
this made life computationally difficult (e.g. you would not talk about two
Norberts in the same sentence). Or without phases we might leave to context the
determination of whether an expression is a copy or a lexical primitive or we
might allow Transfer to see if features on an expression were kosher or not. At
any rate, it seems to me that all of these options are as conceptually “simple”
as adding phases to FL unless, or course, phases come for free as a matter of
“natural law.” I confess to being skeptical
about this supposition. Phases come with a lot of conceptual baggage, which I
personally find quite cumbersome (reminds me of Barriers actually, not one of the aesthetic high points in GG
(ugh!)). That said, let’s accept that the “simplest” theory comes with
phases.
As Chomsky notes, phases themselves come have complex
properties. For example, phases bring
with them a novel operation, feature lowering, which now must be added to the
inventory of FL operations. However, feature lowering does not seem to be either
a conceptually simple or cognitively/computationally generic kind of operation.
Indeed, it seems (at least to me) quite linguistically parochial. This, of
course, is not a good thing if one’s sights are set on answering Darwin’s
problem. If so, phases don’t fit snugly
with the SMT. This does not mean there are none. It just means that they
complicate matters conceptually and pull against Chomsky’s first conceptual
argument wrt Merge.
Again, let’s put this all aside and assume that strict
cyclicity is a desirable property to have and that phases are an optimal way of
realizing this. Chomsky then asks how we identify phases? He argues that we can
identify phases by their heads as phase heads are where unvalued features live.
Thus a phase is the minimal domain of a phase head with unvalued features.[7]
A possible virtue of this way of looking at things is that it might provide a
way of explaining why languages contain so much morphology. They are the
adventitious by-products for identifying the units/domain of the optimal
computational system. Chomsky notes that
what he means by morphology is abstract (a la Vergnaud), so a little more has
to be said, especially given that externalization is costly, but it’s an idea
in an area where we don’t have many (see here).[8]
One remark: on this reconstruction of Chomsky’s arguments,
unvalued features play a very big role. They identify phases, which implement
strict cyclicity and are the source of overt morphology. I confess to being wary here. Chomsky
originally introduced unvalued features to replace uninterpretable ones. Now he
assumes that features are both +/- valued and +/- interpretable. As unvalued
features are always uninterpretatble, this seems like an unwanted redundancy in
the feature system. At any rate, as
Chomsky notes, uninterpretable features really do look sort of strange in a
perfect system. Why have them only to get rid of them? Chomsky’s big idea is that they exist to make
FL computationally efficient. Color me very unconvinced.
So this is the main lay of the land. I should mention that,
as others have pointed out (especially Dennis O), part of Chomsky’s SMT argument
here (i.e. the one linked to conceptual simplicity concerns) is different from
the interpretation of the SMT that I advanced in other posts (here,
here,
here). Thus, my version is definitely NOT the one
that Chomsky elaborates when considering these. However, there is a clear
second strand dealing with pretty standard efficiency concerns, and here my
speculations and his might find some common ground. That said, Chomsky’s proposals
rest heavily on certain assumptions about conceptual simplicity, and of a very
strong kind. In particular, Chomsky’s argument rests on a very aggressive use
of Occam’s razor. Here’s what I mean.
The argument he offers is not that we should adopt Merge because all other notions
are too complex to be biologically plausible units of genetic novelty. Rather,
he argues that in the absence of information to the contrary, Occamite
considerations should rule: choose the simplest
(not just a simple) starting point
and see where you get. Given that we don’t know much about how operations that
describe the phenotype (the computational properties of FL) relate to the
underlying biological substrate that is the thing that actually evolved, it is
not clear (at least to me) how to weight such strong Occamite considerations.
They are not without power, but, to me at least, we don’t really know how to
assess whether all things are indeed equal and how seriously to weight this
very strong demand for simplicity
Let me end by fleshing this out a bit. I confess to not being moved by Chomsky’s
conceptual simplicity arguments. There are lots of simple starting points (even if some may be simpler than others).
Ordered pairs are not that much more conceptually
complex than sets. Symmetric operations are not obviously simpler than
asymmetric ones, especially given that it appears that syntax abhors symmetry
(see Moro and Chomsky). So, the general starting point that we need to start
with the conceptually simplest
conception of “combination” and that this means an operation that creates sets
of expressions seems based on weak considerations. IMO, we should be looking
for basic concepts that are simple enough
to address DP (and there may be many) and evaluate them in terms of how well
they succeed in unifying the various apparently disparate properties of FL. Chomsky
does some of this here, and it’s great. But we should not stop here. Let me
given an example.
One of the properties that modern minimalist theory has had
trouble accounting for is the fact that the unit of syntactic
movement/interpretation/deletion is the phrase.
We may move heads, but we typically move/delete phrases. Why? Right now
standard minimalist accounts have no explanation on hand. We occasionally hear
about “pied piping” but more as an exercise in hand waving than in explanation.
Now, this feature of FL is not exactly difficult to find in NL Gs. That
constituency matters is one of the obvious facts about how
displacement/deletion/binding operates. There is a simple story about this that
labels and headedness can be used to deliver.[9]
If this means that we need a slightly less conceptually simple starting point
than sets, then so be it.
More generally: the problem that motivates the minimalist
program is DP. To address DP we need to factor out most of the linguistic
specific structure of FL and attribute it to more cognitively generic
operations (or/and, if Chomsky is right, natural laws). What’s simple in a DP context is not what is conceptually most basic, but what is
simple given what our ancestors had
available cognitively about 100k years ago. We need a simple addition to this, not something that is conceptually
simple tout court.[10] In this context it’s not clear to me that
adding a set construction operation
(which is what Merge amounts to) is the simplest evolutionary alternative. Imagine,
for example, that our forbearers already had an itterative concatenation
operation.[11] Might not some addition to this be just as
simple as adding Merge in its entirety? Or imagine that our ancestors could
combine lexical atoms together into arbitrarily big unstructured sets, might not an addition that allowed that
operation to yield structured sets be just as simple in the DP context as
adding Merge? Indeed, it might be simpler depending in what was cognitively
available in the mental life of our ancestors.
And once we are at it, how “simple” is an operation that forms arbitrary
sets from atoms and other sets? Sets may
be simple objects with just the properties we need, but I am not sure that
operations that construct them are particularly simple.[12]
Ok, let me end this much too long second post. And moreover,
let me end on a very positive note. In the second lecture Chomsky does what we
all should be doing when we are doing minimalist syntax. He is interested in
finding simple computational systems that derive the basic properties of FL. He
concentrates on some very interesting key features: unbounded hierarchy,
displacement, reconstruction, etc. and makes concrete proposals (i.e. he offers
a minimalist theory) that seem
plausible. Whether he is right in detail is less important IMO than that his
ambitions and methods are worth copying. He identifies non-trivial properties
of FL that GG has discovered over the last 60 years and he tries to explain why
they should exist. This is exactly the
right kind of thing MPers should be doing. Is he right? Well, let’s just say
that I don’t entirely agree with him (yet!). Does lecture 2 provide a nice
example of what MP research should look like. You bet. It identifies real deep
properties of FL and sees how to derive them from more general principles and
operations. If we are ever to solve Darwin’s problem, we will need simple
systems that do just what Chomsky is proposing.
[1]
Note, we want the necessarily here.
That it is both simple and efficient
does not explain why it need be efficient if
simple.
[2]
It is also a necessary condition for incrementality in the use systems (e.g.
parsing), as Bill Idsardi pointed out to me.
I know that the SMT does not care about use systems according to some
(Dennis and William this is a shout-out to you), but this is a curious and
interesting fact nonetheless. Moreover,
if I am right that the last three properties do not follow (at least not
obviously) from conceptual considerations, it seems that Chomsky might be
pursuing a dual route strategy for explaining the properties of FL.
[3]
Note that this assumes that there is no syntactic difference between inherent
features and features valued in the course of the derivation.
[4]
And even this requires a special version of the theory, one like Idsardi and
Lidz’s rather than Zwart’s.
[5]
However, if v raised to T before Transfer then one might try and link these
features to the thematic argument that v licenses. And then it might make lots
of sense to say that phi-features are interpretable on T. They would say that
the variable of the predicate bound by the subject must have such and such an
interpretation. This information might be redundant, but it is not obviously
uninterpretable.
[6]
The ‘autonomy of syntax’ thesis refers to more than one claim. The simplest one
is that syntactic primitives/operations are not reducible to phonetic or
semantic ones. This is not the version adverted to above. This is a more
specific version of the thesis; one that requires a complete separation between
syntactic and semantic information in the course of a derivation. Note, that
the idea that one can add EPP/edge features only if it affects interpretation
(the Reinhart-Fox view that Chomsky has at times endorsed) violates this strong
version of the autonomy thesis.
[7]
Note, we still need to define ‘domain’ here.
[8]
Note, incidentally, that Chomsky assumes both that features are +/- valued and
that they are +/- interpretable. At one time, the former was considered a
substitute for the latter. Now, they are both theoretically required, it seems.
As -valued features seem to always be –interpretatble, this seems like an
unwanted redundancy.
[10]
A question: we can define ordered pairs set theoretically. I assume the
argument against labels is that ordered sets are conceptually more complex than
unordered sets. So {a,b} is conceptually simpler than {a,{a,b}}. If this is the argument, it is very very
subtle. I find it hard to believe that whereas the former is simple enough to
be biologically added, the latter is not. Or even that the relative simplicity
of the two could possibly matter. Ditto for other operations like concatenation
in place of Merge as the simplest operation.
Given how long this post is already, I will refrain from elaborating
these points here.
[11]
Birds (and mice and other animals) can string “syllables” together (put them
together in a left/right order) to make songs. From what I can tell, there is
no hard upper bound on how many syllables can be so combined. These do not display hierarchy, but they may
be recursive in the sense that the combination operation can iterate. Might it
not be possible that what we find in FL builds on this iteration operation?
That the recursion we find in FL is iteration plus something novel (I have
suggested labeling is the novelty)? My point here is not that this is correct,
but that the question of simplicity in a DP context need not just be a matter
of conceptual simplicity.
[12]
How are sets formed? How computationally simple is the comprehension axiom in
set theory, for example? It is actually logically quite involved (see here). I
ask because Merge is a set forming
operation, so the relevant question is how cognitively complex is it to form arbitrary sets. We have been assuming
that this is conceptually simple and hence cognitively easy. However, it is
worth considering just how easy. The Wikepedia entry suggests that it is not a
particularly simple operation. Sets are funny things and what mental powers go
into being able to construct them is not all that clear.
Depending on one's assumptions regarding case and regarding how internal arguments of nouns become possessors, the following may be the example you're looking for:
ReplyDelete(1) John[1]'s arrest t[1]
(2) John[1]'s arrest of [John][2]
So, to be (only very slightly) more concrete: if the "of" in (2) is just the case morphology given to DPs that have not A-moved out of [Compl,N], then syntactically, (1) and (2) are distinguished only by the fact that the two "John"s in (1) are copies of the same object and the two "John"s in (2) are not.
This is a question that has always puzzled me but that probably has a straight forward answer: In what sense can notions such as computational efficiency be applied to things which are not to be interpreted as corresponding to "actual processes" in the (vague) sense of performance?
ReplyDeleteTo elaborate a bit, I feel as if Norbert shares part of my puzzlement when saying "it presupposes some conception of performance for only in this kind of context do resource issues (viz. memory concerns) arise", though in a slightly more limited context than I think is appropriate. It's not just memory, even search seems, to me at least, to only make sense with respect when thinking about performance -- "God has no need for bounding search". Or does he?
@BB
DeleteAs you say, I sort of agree with you (though I'm not sure what you mean by "actual" above. Real? The ones we use? If so, yes). Search, memory load etc only makes sense to me in the context of some system that uses the G. That's why I tried to suggest in earlier posts that we understand the SMT as committing hostages to the kinds of issues that Berwick, Wexler, Weinberg, DeMarcken etc discussed so fruitfully. So, I agree.
What I did not fully appreciate is that Chomsky wants to get a lot of mileage out of conceptual simplicity concerns. Of the 7 properties he discusses, he believes 4 follow directly form the "simplest" conception of the combine operation. Say we agree, then these aspects of FL have little to do with resource issues. They are, as it were, purely facts about the data structures and the kinds of info they code. My own view is that even wrt these we can peak at their performance implications (The NTC in particular). However, one need not. When it comes to phases and copy deletion however, I think that even Chomsky is thinking in a performancy manner, albeit at a very abstract level. I personally don't think that 'search' is not the right way to put things. But I do think that bounding computation is a good idea for finite minds like ours. If resources are infinite (God?) then computational cost is irrelevant. But then if minds are infinite do we even need to recursively specify anything. I can see why only the brave indulge in metaphysics!
One issue, not entirely clear to me, is how the deletion operation is implemented. This is not part of narrow syntax, right? That consists only of Merge. So the deletion operation is properly part of the interface(s), correct? I don't know about deletion occurring at CI, but if deletion occurs at SM, then presumably this interface can appropriately take into account the externalization system's computational costs. So, I suppose we have notions of computational efficiency/optimality that occur at multiple levels, both at narrow syntax (Merge) and at the interfaces, each with different notions of efficiency at play.
Delete@ William
DeleteThat's a good point. It's not clear what to make of deletion processes. One option is that there IS a deletion operation that cleans the syntactic phrase marker up. So something like FULL INTERPRETATION read as that the interface is passive and reads all that it gets suggests that there is some pre-interface process that cleans the relevant representations up. THis is assuredly NOT Merge. But then, feature lowering is not merge either but this is part of the syntactic computation so it contains more than Merge. Ditto with Probing and Agreeing. So, merge may be the newbie on the block but it is not the only thing that FL does.
I think that I agree that there are various notions of "complexity" that Chomsky is playing with. And they may respond to different concerns. A big hypothetical way of putting these all together would be to argue that systems with the conceptually simplest rules are necessarily embedded in systems with optimal computational properties. Maybe. It's logically possible. But we would need an argument.