Faculty of Language: Why Formalize?

Monday, September 16, 2013

Why Formalize?

This is a long post. I guess I got carried away. In my defense, I think that the topic keeps coming up obliquely and so it’s time to take it head on. The issue is the role of formalization in grammatical research. Let me be up front with the conclusion: there’s nothing wrong with it, it can be valuable if done the right way, it has been valuable in some cases where it has been done the right way, but there seems to be an attitude floating in the atmosphere that there is something inherently indispensible about it, and that anything that is not formalized is ipso facto contentless and/or incapable of empirical examination. This, IMO, is balderdash (a word I’ve always wanted to use). This is not only false within linguistics, it is false for other sciences as well (I mention a few examples below). Moreover, the golden glow of formalization often comes wrapped in its won dark cloud. In particular, if done without a sensitivity to the theories/concepts of interest, it risks being imposing looking junk. This is actually more deleterious than simple junk for formalizations can suggest depth and can invite a demand for respect that simple junk does not. This point will take me some time to elaborate and that’s why the post is way too long. Here goes.

There is an interesting chain of conversation in the thread to this post. In the original, I outlined “a cute example” from Chomsky, which illustrated the main point about structure dependence but sans the string altering features that T to C movement (TCM) in Yes/No questions (Y/N) can induce. To briefly review, the example in (1) requires that instinctively modify the main verb swim rather than the verb inside the relative clause fly; this despite (i) the fact that eagles do instinctively fly but don’t instinctively swim (at least I don’t think they do) and (ii) fly is linearly closer to instinctively than swim is.

(1) Instinctively, eagles that fly swim

The analogy to T to C movement (TCM) in Y/Ns lies in both adverbial modification and TCM being insensitive to string linear properties. I thought the example provided a nice simple illustration of how one can get sidetracked by irrelevant surface differences (e.g. in *Is the boy who sleeping is dreaming, the pair who sleeping is an illicit bigram) from missing the main point, viz. that examples like this are a dime a dozen and that this surface blemish cannot generally be expected to help matters much (e.g. Can boys that sleep dream means ‘is it the case that boys that sleep can dream’ and not ‘is it the case that boys that can sleep dream’). At any rate, I thought that the adverbial modification facts were useful in making this point, but I was wrong. And I was wrong because I failed to consider how the adverbial case would be generated and how its generation compared to that of Y/Ns. I plan to try and rectify this mistake here before getting on to a discussion of how and whether formalization can clarify matters, a point raised in the notes by Alex Clark and Benjamin Boerschinger, though in somewhat different ways. But first let’s consider the syntax of the matter.

The standard analysis of Y/Ns is that T moves to C to discharge some demand of C. More specifically, in some Gs (note TCM is not a feature of UG!), +WH C needs a finite T for “support.” GB analyzes this operation as a species of head movement from the genus ‘move alpha’. Minimalists (at least some) have treated this as an instance of I-merge. Let’s assume something like this is correct. The UGish question is why in sentences like (2) can must be interpreted as moving from the matrix T and not the embedded relative clause T?

(2) [Can+C_+WH [_DP boys [_CP that (*t_can) sleep]] *(t_can) dream][1]

What makes (2) interesting is precisely the fact that in cases like this there are two potential candidates available for satisfying the needs of C_+WH but that only one of them can serve, the other being strictly prohibited from moving. The relevant question is what principle selects the right T for movement? One answer is that the relevant T is the one “closest” to C and that proximity is measured hierarchically, not linearly. Indeed, such examples show that were locality measured string linearly then the opposite facts should obtain. Thus, these kinds of data indicate that in such cases we had better prohibit string linear measures of proximity as they never seem to be empirically condign for Y/Ns.

So far, I believe, there is no controversy. The battle begins in trying to locate the source of the prohibition against string linear restrictions like the one proposed. For people like me, we take these kinds of cases to indicate that the prohibition has its roots in UG. Thus, a particular G eschews strong linear notions of proximity because UG eschews such notions and thus they cannot be components of particular Gs. Others argue that the prohibitions are G specific and thus that structure independent processes of syntax are possible but for the appropriate input. In other words, were our young LADs and LASs (language acquisition device/system) exposed to the relevant input they would acquire rules that moved the linearly most proximate T to C rather than the hierarchically most prominent one. Thus, the disagreement, like all disagreements about scientific principles, is one about counterfactual situations. So, how does one argue about a counterfactual?

My side scours the Gs of the world and argues that linear sensitive rules are never found in their syntax and that this argues that structure dependence is part of UG. Or, we argue that the relevant data to eliminate the strong linear condition is unavailable in the PLD available to LADs and LASs and so the absence of strong linear conditions in G_English, for example, cannot be traced to the inductive proclivities of LADs and LASs. The other side argues (or has to argue) that this is just a coincidence for there is nothing inherent to Gs that prohibits such processes and that in cases where particular Gs eschew string linear conditions it’s because the data surveyed in the acquisition process sufficed to eliminate it, not because such conditions could not have been incorporated into particular Gs to regulate how their rules apply.

Note that the arguments are related but independent. Should the absence of any string linear prohibitions in any Gs prove correct (and I believe that it is very strongly supported) it should cast doubt on any kind of “coincidence” theory (btw, this is where the adverbial cases in (1) are relevant). So too should evidence that the PLD is too weak to provide an inductive basis for eliminating the strong linear option (which I also believe has been demonstrated, at least to my satisfaction).[2]

This said, it is important to note that this is a very weak conclusion. It simply indicates that something inherent to UG eliminates string linear conditions as options, it does not specify what the structure relevant feature of UG is.[3] And here is where collapsing (1) and (2) can mislead. Let me explain.

The example in (1) would typically be treated as a case of adverb fronting (AF), along the lines of (3).[4]

(3) Adverb₁ […t₁…]

AF contrasts with TCM in several respects. First, it is not obligatory. Thus declaratives without AF are perfectly acceptable, in contrast with Y/Ns without TCM.[5] Second, whereas every finite clause must contain a T, not every finite declarative need contain an adverb.

The first difference can be finessed in a simple (quite uninteresting) way. We can postulate that whenever AF has applied some head F has attracted it. Thus, AF in (1) is not really optional. What’s optional is the F feature that attracts the adverb. Once there, it functions like its needy C_+WH counterpart.

The second difference, I believe, drives a wedge between the two cases. Here’s how: given that a relative clause and a matrix clause will each contain (finite) T⁰s and hence both be potential C_+WH rescuers, there is no reason to think that they will each contain adverbs. Why’s this relevant? Because whereas for TCM we can argue that we need a principle to select the relevant T that moves, there is no obvious choice of mover required for AF. So whereas we can argue that the right principle for TCM in (4a) is something like Shortest Attract/Move (SA/M) this would not suffice for (4b) where there is but one adverb available for fronting.[6] Thus, if SA/M is the right principle regulating TCM it does not suffice to regulate AF cases (if, I assume here, they are species of I-merge).

(4) a. [C_+WH [_RC …T…]…T…]

b. [F_+ADV [_RC …ADV…]…]

What else is required? Well, the obvious answer is something like the CNPC and/or the Subject Condition (SC). Both would suffice to block AF in (4b). Moreover, both have long been considered properties of UG and both are clearly structure sensitive prohibitions (they are decidedly not string linear).[7] However, island conditions and minimality restrictions are clearly different locality conditions even if both are structure dependent.[8]

Now this has been a long-winded and overly didactic review of a much over-discussed example. Why do I bring this up again?! Because of some comments by Alex Clark suggesting that the AF facts could be derived in formalized minimalist grammars and that therefore this nullifies any explanation of the kind provided above, viz. that UG explains the data in (1) and (2) by noting that the relevant structures are underivable (my emphasis in what follows):

So here is a more controversial technical claim:

let English+ be English with additionally the single incorrect pairing (s,w2). English+ can be generated by an MCFG; ergo it can be generated by an MG. English++ is English but additionally with the fronted adverbs out of subject relatives.; again generable by an MG. (MG means Stabler's Minimalist grammars with shortest move constraint). So I think these claims are correct, and if not could someone technical chime in and correct me.

So Norbert is right that the grammars will look strange. Very strange indeed if you actually convert them from an MCFG. But they are allowed by this class of grammars, which in a sense defines the notion of licit grammatical dependencies in the theory. So Norbert wants to say, oh well if my theory makes the wrong predictions then it has been formalized incorrectly, and when it is formalized correctly it will make the right predictions, Period. But while this is certainly a ballsy argument, it's not really playing the game.

Alex is right, Stabler’s MG with SA/M can derive the relevant AF examples as the derivation implicit in (4b) does not violate SA/M. That’s why an MG including this (let’s call it ‘SMG’) cannot prevent the derivation. However, as Stabler (here) notes there are other MGs that code for other kinds of locality restrictions. In fact, there are whole families of such, some encoding relativized minimality (RMG) and some embodying phases (PMG and PCMG). I assume, though Stabler does not explicitly discuss this, that it is also possible to combine different locality restrictions together in an MG (i.e. RPMG that combines both relativized minimality and phases). So what we have are formalizations of various MG grammars (SMG, RMG, PMG, PCMG, and RPMG) all with slightly different locality properties that generate slightly different licit structural configurations. Stabler shows that despite these differences, these restricted versions of MG all share some common computational properties such as efficient recognition and parsability. However, they are, as Stabler notes, also different in that they allow for different kinds of licit configurations, PMG/PCMGs blocking dependencies that violate the PIC and RMGs blocking those that violate relativized minimality (see his section 4). In sum, there are varieties of MGs that have been formalized by Stabler and Co. and these encode different kinds of conditions that have been empirically motivated in the minimalist literature.[9] There is no problem formalizing these different MGs nor in recognizing that despite being different in what structures they license they can still share some common general properties.

Three observations: First, I leave it as an exercise for the reader to code island restrictions (like the CNPC) in phase based terms. This is not hard to do given that phases and the original subjacency theory (i.e. employing bounding nodes) are virtually isomorphic terms (hint: D is a phase without an accessible phase edge).[10]

Second, the Stabler paper offers one very good reason for formalizing grammars. The paper shows that different theories (i.e. those that characterize UG differently) can nonetheless share many features in common. Though empirically relevant, the different locality conditions do not differ in some of their more general computational features. Good. What we see is that not all empirically different characterizations of FL/UG need have entirely different computational properties.[11]

Third, Stabler recognizes that the way to explore MG and UG computationally is to START with the empirically motivated features research has discovered and then develop formalizations that encode them. More pointedly, this seems to contrast with Alex’s favored method of choosing some arbitrary formalism (simple MG) and then insisting that anyone who thinks that this is the wrong formalism (e.g. moi) for the problem at hand is (though “ballsy,” hmm, really! OK), “not really playing the game.” Au contraire: to be interesting the formal game requires formalizing the right things. If research has found that FL/UG contains islands and minimality, then to be interesting your formalization had better code both these restrictions. If it doesn’t it’s just the wrong formalization and is not, and should not, be part of any game anyone focused on FL/UG plays. There may be some other game (as David Adger suggests in his comments on the post) but it is arguably of no obvious relevance to any research that syntacticians like moi are engaged in and of dubious relevance to the investigation of FL/UG or the acquisition of grammar. Boy, that felt good to say![12]

Now, it is possible that some grammars encode inconsistent principles and that formalization could demonstrate this (think Russell’s Paradox and Frege’s naïve set theory). However, this is not at issue here. What is at issue is how one conceives of the proper role of formalization in this sort of inquiry. Frankly, I am a big fan. I think that there has been some very insightful work of the formalizing sort.[13] However, there has also been a lot of bullying. And there has been a lot of misguided rhetoric. Formalization is useful, but hardly indispensible. Remember, Euclidean geometry did just fine for thousands of years before finally formalized by Hilbert, so too the Calculus before Cauchy/Weierstrass. Not to mention standard work in biology and physics, which though (sometimes) mathematical is hardly formalized (not at all the same thing; formal does not equate with formalized). What we need are clear models that make clear predictions and that can be explored. Formalization can help in this process, and to the degree that it does, it should be encouraged. But PULEEEZE, it is not a panacea and it is not even a pre-requisite for good work. And, in general it should be understood to be the handmaiden of theory, not its taskmaster. To repeat, formalizations that formalize the wrong things or leave out the right things is of questionable value. Jim Higginbotham has expressed this well in his discussion of whether English is a context free language (here).[14] As he put it:

…once our attention turns to core grammar as the primary object of linguistic study, questions such as the one that I have tried to answer here are of secondary importance (232).

What matters are the properties of FL/UG. Formalizations that encompass these are interesting and can be useful tools for investigating further properties of FL/UG. But, and this is the important part (maybe this is what makes my attitude “ballsy”) they must earn their keep and if they fail to code the relevant features of “core grammar” or FL/UG then, no matter how careful and precise their claims I don’t see what they bring to the game. Quite often what we find is all hat, no cattle.

[1] The traces are here to mark the possible base positions of can. (*…) means ‘unacceptable if included’ while *(…) means unacceptable if left out.

[2] There are also actual language acquisition studies like Crain and Nakayama that are relevant to evaluating the UG claim.

[3] Indeed, it need not be a feature of UG at all, at least in principle. Imagine an argument to the effect that learning in general, i.e. whatever the cognitive domain, ignores string linear information. Were that so, it would suffice to explain why it is so ignored in the acquisition of Gs. However, this view strikes me as quite exotic and, I believe, would cause not small degree of problems in the acquisition of phonology, for example, where string linear relations is where all the action is.

[4] I assume that this is how to treat AF. If, however, adverbs could be base generated sentence initially and their interpretation were subject to a rule like “modify the nearest V” then AF phenomena would be entirely analogous to CTMs. The main reason I doubt that this is the right analysis comes from data like that in note 7, where it appears that adverbs can move quite a distance from the verbs they modify. This is certainly true of WH adverbs like when and how but I also find the cases like (i) in note 7 acceptable with long distance readings.

[5] TCM also seems obligatory in WH questions, thought there is some debate right now about whether TCMs applies in questions like who left. For Y/Ns, however, it is always required in matrix clauses. Here is a shameless plug for a discussion of these matters in a recent paper joint paper with Vicki Carstens and Dan Seely (here).

[6] I hope it goes without saying that I am simplifying matters here. The relevant F is not +ADV for example, but something closer to +focus, but none of this is relevant here as far as I can see.

[7] There is one further important difference between TCM and AF. The latter is not strictly local. Thus, in (i) tomorrow can modify either the matrix or the embedded clauses. Nonetheless, it cannot modify the relative clause in (ii):

(i) Tomorrow, NBR is reporting that Bill will be in Moscow

(ii) Tomorrow, the news anchor that is reporting from DC will be in Moscow

[8] I am setting aside the question whether there is a way of unifying the two. It is to be hoped that there is, but I don’t know of any currently workable suggestions. Such a unification would not alter anything said below, though it would make for a richer and nicer theory of FL/UG.

[9] In other papers Stabler considers MGs that allow for sidewards movement and those that don’t. It is nice to see that to date the theoretical innovations proposed have been formalizable in pretty straightforward ways, or so it appears.

[10] Stabler observes the obvious parallelism with bounding node versions of Subjacency. The isomorphism between the older GB theory and modern Phase Theory means that Phase Theory does not advance our understanding of islands in any minimalistically interesting way. However, for current purposes the fact that Phases and the PIC can code the CNPC and SC suffices. We all welcome the day when we have a deeper understanding of the locality restrictions underlying islands.

[11] IMO, this is the best reason to formalize: to see if two things that look very different may nonetheless be similar (or identical) in with respect to other properties of interest. It is possible to be careful and clear without being formalized (indeed, the latter often obscues as much as it enlightens). However, formalizing often allows one to take a bird’s eye view of theoretical matters that matter, and when used thus they it can be extremely enlightening.

[12] Let me insist: this is not to argue that formalization does not have a place in such investigations. Rather, it is to argue that the fact that some formalization fails to make the relevant cut is generally a problem about the formalization not the adequacy of the empirical cut. See below.

[13] I am a big fan of Bob Berwicks’s work, as well as Stabler and Co, Tim Hunter’s (with Chris and alone), stuff on Phonology by Heinz and Idsardi to name a few. What makes these all interesting to me is their being anchored firmly in the syntactic theory literature.

[14] Thx to Bob Berwick for the reference.

36 comments:

Tim HunterSeptember 16, 2013 at 3:43 PM
This may be overly pedantic but I think there's a point in the background here (and in all the discussion on the previous post) that has not been made explicit: if we want to conclude anything about structure-dependence from "Instinctively, eagles that fly swim", then we need to at least attempt to show that no non-structure-dependent rule works. And that single example does nothing to discredit a rule like "A fronted adverb always modifies the last verb in the sentence". Of course it's not hard to put together some relevant example with a complex object to spoil the party for that "last verb" rule, but then you have to make sure you also show that a "second verb" rule can't work, etc. In a way it may seem harmless to skip over these steps of the argument because everyone's familiar with the facts, but I think skipping over them sometimes clouds the issue.

Starting off by saying "Look, it's impossible to form this dependency into a subject", and stating this as if it's an observation, is begging the question entirely (because "subject" is a structural notion).
ReplyDelete
Replies
Darryl McAdamsSeptember 16, 2013 at 5:40 PM
Goodness, quite long. But, just a brief comment before I read it all, "who sleeping" is quite the licit bigram! After all, I'm a guy who sleeping appeals to quite often! In fact, sleeping appeals to me pretty much ever time I'm trying to wake up!
ReplyDelete
Replies
ewanSeptember 17, 2013 at 2:23 AM
Just extracting from that comment above (i.e. without delving into the old comments) it sounds like the point is not to say the line of argument is wrong. Rather, sounds like the point was to say that, upon closer inspection, your theory actually doesn't explain the pattern because the cleaned up version of the theory allows us to inspect very closely and find a derivation that gets a wrong fact. And then you replied saying yes but there are other ways to clean it up that get the facts right - so the theory isn't wrong, in fact. Sorry if this is incoherent. But was that it?
ReplyDelete
Replies
Alex ClarkSeptember 17, 2013 at 8:17 AM
Often the exact formalisation doesn't matter.
I think the problem here is with the theory not the formalisation of it.
But I do think that the inadequacies of the argument have been hidden by the lack of formalization, so I guess there
is an argument for formalization lurking there somewhere.

So my argument is something like:

Premise A:
Any formalisation of your theory of UG you come up with, if it allows English will allow English+.

Conclusion:
Your theory of UG does not explain *on its own* why we see English rather than English+.

So we can argue about whether this is a valid argument or we can argue about whether premise A is true.

Again if one of the technical people reading this thinks that Premise A is wrong, I would like to be put straight.
But generally formalizations like this can
a) generate any finite set of sound-meaning pairs and
b) are closed under union,
from which two assumptions Premise A follows.

But you might also say that I am missing the point of the example which you might claim is about admissible structures rather than admissible sound meaning pairs.
I would say that
a) we don't observe the structures. Syntactic structures aren't data, they are theoretical constructs that we posit to explain certain aspects of the sound/meaning relation. (Tim H's point which is also "anodyne") And
b) crucially, (s,w2) (which is the example in English+ - English) has a well-formed structure/sound/meaning mapping *by definition* , since it is generated by a grammar in your class which defines the class of legitimate structures.
And
c) anyway you explicitly said that "Berwick et. al. emphasize that the Poverty of Stimulus Problem has always aimed to explain “constrained homophony,” a fact about absent meanings for given word strings.."

So how does then your UG explain the "cute example"? My view is that MGs or any reasonable formalisation of it, are far too unconstrained to explain some detail like that.

(ETA: I really think that the mathematical study of the set of allowable sound-meaning relationships is really important, and understudied, and you should all look at a recent paper which I think will turn out to be pivotal: Makoto Kanazawa and Sylvain Salvati. 2013. The string-meaning relations definable by Lambek grammars and context-free grammars.
It's very dense so just read the abstract. So this could be the right place to look for some very good explanations. )
ReplyDelete
Replies
davidadgerSeptember 17, 2013 at 11:06 AM
Alex, strings aren't data then either. The only "data" you have is the raw sound stream. If we want to do any worthwhile theeory we have to abstract. Our grasp of the geeneeralisations about structure is pretty good, and the ones under discussion certainly good enough to say that there is no structural dependency available between the adverb and the verb in the subject. That's the fact we want our theory to capture and, as I pointed out on the other thread, we can do that via the kind of derivation I provided. If you aren't willing to abstract beyond strings, good luck with explaining any kind of intra or inter language generalisations! And ask why "strings" are an ok abstraction from phonetic multidimensional structures, while multidimensional syntactic structures aren't. I think strings are just a deeply wrong model for dealing with anything linguistic at really any level of abstraction. Apologies for all the extra eees in my post, my ipad seems to want to add these to random words and is grumpy about deletion on this blog for one reason!
ReplyDelete
Replies
UnknownSeptember 17, 2013 at 6:30 PM
"Stablerian MGs with his Specifier Impenetrabiity constraints, as specified in his recent Topics in CogSci paper will just not generate the bad examples, and hence will capture the relevant generalization."

This statement is not quite true without further qualifications. Yes, adding the SPIC to MGs lowers their weak generative capacity by disallowing extraction from specifiers, but this does not imply that MGs cannot generate strings that linguists would analyze as involving movement from such a position.

The thing is, many instances of Move can be replaced by Merge, even without significant changes to the phrase structure. Rather than moving XP from A to B, we can Merge an empty category at A that requires XP to be merged at B, e.g. with a mechanism similar to slash feature percolation. This strategy does not work for cases that require head movement or remnant movement, so those will be affected by the SPIC. But the example above is not such a case. If you want the SPIC to block the tree for the illicit reading, you have to posit a universally fixed set of categories and prove(!) that this set cannot be exploited in a fashion that makes slash feature percolation possible.

My hunch is that Alex's resistance to Norbert's argument is at least partially due to the fact that i) nobody has given him such a proof, and ii) more importantly, he does not believe that the set of categories is fixed across languages.
ReplyDelete
Replies
davidadgerSeptember 17, 2013 at 10:30 PM
this is getting interesting and I completely agree. If you use some other means to get the dependency than internal Merge, then you could weakly generate the strong, and perhaps even strongly generate a structure that will support the meaning. But you don't have to fix the categories. You just have to have a theory that doesn't allow category valued features. For such a theory, see my `A minimalist theory of feature structures' in the Kibort/Corbert CUP `Features' book (draft on lingbuzz). There I propose that the admissible categories in human language don't have any more complexity than being bundles for feature value pairs where the values are atmoc in nature. I try to derive this from a general theoretical principle that confines recursive structures to being the outputs of Merge. Since Merge takes lexical items as its arguments and produces non-lexical items as its outputs, it follows that lexical items can't include recursive structures. This has some nice side effects in meaning that we can't have lexical items that, say, select for a CP, that contains a V that selects for a PP.

Does Stablerian SpecMG plus my `No Complex Values' idea do the trick? I guess perhaps Norbert and I were assuming something like that.
ReplyDelete
Replies
davidadgerSeptember 18, 2013 at 7:02 AM
Hi Thomas, v quick as am in start of semester hell. The issue of atomicity of feature values vs complex values is exactly addressed in that paper I mentioned. But if I've understood `camoflaging', you're right. Is the idea that to get round 'No Complex Values', you add an extra feature each time you want to encode a non-local selectional relation? (so you'd encode a verb that selects an N which selects a P with [V, +F] and a verb that selects an N which selects a C with [V, +G], etc)? So this will indeed require us to disallow such features in one of the ways you suggest: restrict the features or restrict the way that features can be manipulated through Merge so they information is locked in place. I think Norbert and I were assuming that such feature percolation either is illicit or if it is licit, it's only licit in particular circumstances, not involving specs. Thanks, this has made me think about this.
ReplyDelete
Replies

Add comment

Faculty of Language

Comments

Monday, September 16, 2013

Why Formalize?

36 comments:

Contributors