Monday, February 24, 2014

DTC redux

Syntacticians have effectively used just one kind of probe to investigate the structure of FL, viz. acceptability judgments. These come in two varieties: (i) simple “sounds good/sounds bad” ratings, with possible gradations of each (effectively a 6ish point scale ok, ?, ??, ?*, *, **), and (ii) “sounds good/sounds bad under this interpretation” ratings (again with possible gradations). This rather crude empirical instrument has proven to be very effective as the non-trivial nature of our theoretical accounts indicates.[1] Nowadays, this method has been partially systematized under the name “experimental syntax.” But, IMO, with a few important conspicuous exceptions, these more refined rating methods have effectively endorsed what we knew before. In short, the precision has been useful, but not revolutionary.[2]

In the early heady days of Generative Grammar (GG), there was an attempt to find other ways of probing grammatical structure. Psychologists (following the lead that Chomsky and Miller (1963) (C&M) suggested) took grammatical models and tried to correlate them with measures involving things like parsing complexity or rate of acquisition. The idea was a simple and appealing one: more complex grammatical structures should be more difficult to use than less complex ones and so measures involving language use (e.g. how long it takes to parse/learn something) might tell us something about grammatical structure. C&M contains the simplest version of this suggestion, the now infamous Derivational Theory of Complexity (DTC). The idea was that there was a transparent (i.e. at least a homomorphic) relation between the rules required to generate a sentence and the rules used to parse it and so parsing complexity could be used to probe grammatical structure.

Though appealing, this simple picture can (and many believed did) go wrong in very many ways (see Berwick and Weinberg 1983 (BW) here for a discussion of several).[3] Most simply, even if it is correct that there is a tight relation between the competence grammar and the one used for parsing (which there need not be, though in practice there often is, e.g. the Marcus Parser) the effects of this algorithmic complexity need not show up in the usual temporal measures of complexity, e.g. how long it takes to parse a sentence. One important reason for this is that parsers need not apply their operations serially and so the supposition that every algorithmic step takes one time step is just one reasonable assumption among many. So, even if there is a strong transparency between competence Gs and the Gs parsers actually deploy, no straightforward measureable time prediction follows.

This said, there remains something very appealing about DTC reasoning (after all, it’s always nice to have different kinds of data converging on the same conclusion, i.e. Whewell’s consilience) and though it’s true that the DTC need not be true, it might be worth looking for places where the reasoning succeeds. In other words, though the failure of DTC style reasoning need not in and of itself imply defects in the competence theory used, a successful DTC style argument can tell us a lot about FL. And because there are many ways for a DTC style explanation to fail and only a few ways that it can succeed, successful stories if they exist can shed interesting light on the basic structure of FL.

I mention this for two reasons. First, I have been reading some reviews of the early DTC literature and have come to believe that its demonstrated empirical “failures” were likely oversold. And second, it seems that the simplicity of MP grammars has made it attractive to go back and look for more cases of DTC phenomena. Let me elaborate on each point a bit.

First, the apparent demise of the DTC. Chapter 5 of Colin Phillips’ thesis (here) reviews the classical arguments against the DTC.  Fodor, Bever and Garrett (in their 1974 text) served as the three horsemen of the DTC apocalypse. They interned the DTC by arguing that the evidence for it was inconclusive. There was also some experimental evidence against it (BW note the particular importance of Slobin (1966)). Colin’s review goes a very long way in challenging this pessimistic conclusion. He sums up his in depth review as follows (p.266):

…the received view that the initially corroborating experimental evidence for the DTC was subsequently discredited is far from an accurate summary of what happened. It is true that some of the experiments required reinterpretation, but this never amounted to a serious challenge to the DTC, and sometimes even lent stronger support to the DTC than the original authors claimed.

In sum, Colin’s review strongly implies that linguists should not have abandoned the DTC so quickly.[4] Why, after all, give up on an interesting hypothesis, just because of a few counter-examples, especially ones that when considered carefully seem on the weak side? In retrospect, it looks like the abandonment of the strong hypothesis was less a matter of reasonable retreat in the face of overwhelming evidence than a decision that disciplines occasionally make to leave one another alone for self-interested reasons. With the demise of the DTC, linguists could assure themselves that they could stick to their investigative methods and didn’t have to learn much psychology and psychologists could concentrate on their experimental methods and stay happily ignorant of any linguistics. The DTC directly threatened this comfortable “live and let live” world and perhaps this is why its demise was so quickly embraced
by all sides.

This state of comfortable isolation is now under threat, happily.  This is so for several reasons. First, some kind of DTC reasoning is really the only game in town in cog-neuro. Here’s Alec Marantz’s take:

...the “derivational theory of complexity” … is just the name for a standard methodology (perhaps the dominant methodology) in cognitive neuroscience (431).

Alec rightly concludes that given the standard view within GG that what linguists describe are real mental structures, there is no choice but to accept some version of the DTC as the null hypothesis. Why? Because, ceteris paribus:

…the more complex a representation- the longer and more complex the linguistic computations necessary to generate the representation- the longer it should take for a subject to perform any task involving the representation and the more activity should be observed in the subject’s brain in areas associated with creating or accessing the representation or performing the task (439).

This conclusion strikes me as both obviously true and salutary, with one caveat. As BW has shown us, the ceteris paribus clause can in practice be quite important.  Thus, the common indicators of complexity (e.g. time measures) may be only indirectly related to algorithmic complexity. This said, GG is (or should be) committed to the view that algorithmic complexity reflects generative complexity and that we should be able to find behavioral or neural correlates of this (e.g. Dehaene’s work (discussed here) in which BOLD responses were seen to track phrasal complexity in pretty much a linear fashion or Forster’s work finding temporal correlates mentioned in note 4).

Alec (439) makes an additional, IMO correct and important, observation. Minimalism in particular, “in denying multiple routes to linguistic representations,” is committed to some kind of DTC thinking.[5] Furthermore, by emphasizing the centrality of interface conditions to the investigation of FL, Minimalism has embraced the idea that how linguistic knowledge is used should reveal a great deal about what it is. In fact, as I’ve argued elsewhere, this is how I would like to understand the “strong minimalist thesis,” (SMT) at least in part. I have suggested that we interpret the SMT as committed to a strong “transparency hypothesis” (TH) (in the sense of Berwick & Weinberg), a proposal that can only be systematically elaborated by how linguistic knowledge is used.

Happily, IMO, paradigm examples of how to exploit “use” and TH to probe the representational format of FL are now emerging. I’ve already discussed how Pietroski, Hunter, Lidz and Halberda’s work relates to the SMT (e.g. here and here). But there is other stuff too of obvious relevance: e.g. BW’s early work on parsing and Subjacency (aka Phase Theory) and Colin’s work on how islands are evident in incremental sentence processing. This work is the tip of an increasingly impressive iceberg. For example, there is analogous work showing that that parsing exploits binding restrictions incrementally during processing (e.g. by Dillon, Sturt, Kush).

This latter work is interesting for two reasons. It validates results that syntacticians have independently arrived at using other methods (which, to re-emphasize, is always worth doing on methodological grounds). And, perhaps even more importantly, it has started raising serious questions for syntactic and semantic theory proper. This is not the place to discuss this in detail (I’m planning another post dedicated to this point), but it is worth noting that given certain reasonable assumptions about what memory is like in humans and how it functions in, among other areas, incremental parsing, the results on the online processing of binding noted above suggest that binding is not stated in terms of c-command but some other notion that mimics its effects.

Let me say a touch more about the argument form, as it is both subtle and interesting. It has the following structure: (i) we have evidence of c-command effects in the domain of incremental binding, (ii) we have evidence that the kind of memory we use in parsing cannot easily code a c-command restriction, thus (iii) what the parsing Grammar (G) employs is not c-command per se but another notion compatible with this sort of memory architecture (e.g. clausemate or phasemate). But, (iv) if we adopt a strong SMT/TH (as we should), (iii) implies that c-command is absent from the competence G as well as the parsing G. In short, the TH interpretation of SMT in this context argues in favor of a revamped version of Binding Theory in which FL eschews c-command as a basic relation. The interest of this kind of argument should be evident, biut let me spell it out. We S-types are starting to face the very interesting prospect that figuring out how grammatical information is used at the interfaces will help us choose among alternative competence theories by placing interface constraints on the admissible primitives. In other words, here we see a non-trivial consequence of Bare Output Conditions on the shape of the grammar. Yessss!!!

We live in exciting times. The SMT (in the guise of TH) conceptually moves DTC-like considerations to the center of theory evaluation. Additionally, we now have some useful parade cases in which this kind of reasoning has been insightfully deployed (and which, thereby, provide templates for further mimicking). If so, we should expect that these kinds of considerations and methods will soon become part of every good syntactician’s armamentarium.




[1] The fact that such crude data can be used so effectively is itself quite remarkable. This speaks to the robustness of the system being studied for such weak signals should not be expected to be so useful otherwise.
[2] Which is not to say that such more careful methods don’t have their place. There are some cases where being more careful has proven useful. I think that Jon Sprouse has given the most careful thought to these questions. Here is an example of some work where I think that the extra care has proven to be useful.
[3] I have not been able to find a public version of the paper.
[4] BW note that Forster provided evidence in favor of the DTC even as Fodor et. al. were in the process of burying it. Forster effectively found temporal measures of psychological complexity that tracked the grammatical complexity the DTC identified by switching the experimental task a little (viz. he used an RSVP presentation of the relevant data).
[5] I believe that what Alec intends here is that in a theory where the only real operation is merge then complexity is easy to measure and there are pretty clear predictions of how this should impact algorithms that use this information. It is worth noting that the heyday of the DTC was in a world where complexity was largely a matter of how many transformations applied to derive a surface form. We have returned to that world again, though with a vastly simpler transformational component.

Sunday, February 23, 2014

MOOCs, education, credentials and basic research

Here's another essay investigating the MOOC issue and its relation to education. There are two points. First, that lots of what is intended when people speak of "education" is really how to effectively provide credentials that will enhance job prospects. Second, that what you learn is less important that who you meet. MOOCs may help with the first, but won't address the second. And as the second dominates in determining one's life opportunities, MOOCs will simply serve to further disadvantage the less advantaged. This, in part, reflects my own jaundiced views about MOOCs, with one more perverse twist. Should MOOCs win the day, then we might discover that we destroy research as well as education. Look at what happened to Bell Labs when we made telecommunication more efficient. A by-product of MOOCing the university might be the elimination of any venue for basic research. At least in the US and Canada, the only place basic research happens is the university and the university in the US and Canada effectively tethers undergrad instruction to a research engine. Break the bond and there is no reason to suppose that there will be no place for basic research. So, MOOCs may not only kill education (that's my hunch) but destroy basic inquiry as well.

Friday, February 14, 2014

Derivation Trees and Phrase Structure

Another snow day, another blog post. Last week we took a gander at derivation trees and noticed that they satisfy a number of properties that should appeal to Minimalists. Loudmouth that I am I took things a step further and proclaimed that there is no good reason to keep using phrase structure trees now that we have this shiny new toy that does everything phrase structure trees do, just better. Considering the central role of phrase structure trees, that's a pretty bold claim... or is it?

Tuesday, February 11, 2014

Plato, Darwin, P&P and variation

Alex C (in the comment section here (Feb. 1)) makes a point that I’ve encountered before that I would like to comment on. He notes that Chomsky has stopped worrying about Plato’s Problem (PP) (as has much of “theoretical” linguistics as I noted in the previous post) and suggests (maybe this is too much to attribute to him, if so, sorry Alex) that this is due to Darwin’s Problems (DP) occupying center stage at present. I don’t want to argue with this factual claim, for I believe that there’s lots of truth to it (though IMO, as readers of the last several posts have no doubt gathered, theory of any kind is largely absent from current research). What I want to observe is that (1) there is a tension between PP and DP and (2) that resolving it opens an important place for theoretical speculation. IMO, one of the more interesting facets of current theoretical work is that it proposes a way of resolving this tension in an empirically interesting way. This is what I want to talk about.

First the tension: PP is the observation that the PLD the child uses in developing its G is impoverished in various ways when one compares it to the properties of Gs that children attain. PP, then, is another name for the Poverty of Stimulus Problem (POS).  Generative Grammarians have proposed to “solve” this problem by packing FL with principles of UG, many of which are very language specific (LS), at least if GB is taken as a guide to the content of FL.  By LS, I mean that the principles advert to very linguisticky objects (e.g. Subjects, tensed clauses, governors, case assigners, barriers, islands, c-command, etc) and very linguisticky operations (agreement, movement, binding, case assignment, etc.).  The idea has been that making UG rich enough and endowing it with LS innate structure will allow our theories of FL to attain explanatory adequacy, i.e. to explain how, say, Gs obey islands despite the absence of good and bad data relevant to fixing them present in the PLD. 

By now, all of this is pretty standard stuff (which is not to say that everyone buys into the scheme (Alex?)), and, for the most part, I am a big fan of POS arguments of this kind and their attendant conclusions. However, even given this, the theoretical problem that PP poses has hardly been solved. What we do have (again assuming that the POS arguments are well founded (which I do believe)) is a list of (plausibly) invariant(ish) properties of Gs and an explanation for why these can emerge in Gs in the absence of the relevant data in the PLD required to fix them. Thus, why do movement rules in a given G resist extraction from islands? Because something like the Subjacency/Barriers theory is part of every Language Acquisition Device’s (LAD) FL, that’s why.

However, even given this, what we still don’t have is an adequate account of how the variant properties of Gs emerge when planted in a particular PLD environment. Why is there V to T in French but not in English? Why do we have inverse control in Tsez but not Polish? Why wh-in-situ in Chinese but multiple wh to C in Bulgarian. The answer GB provided (and so far as I can tell, the answer still) is that FL contains parameters that can be set in different ways on the basis of PLD and the various Gs we have are the result of differential parameter setting. This is the story, but we have known for quite a while that this is less a solution to the question of how Gs emerge in all their variety than it is an explanation schema for a solution. P&P models, in other words, are not so much well worked out theories than they are part of a general recipe for a theory that were we able to cook it, would produce just the kind of FL that could provide a satisfying answer to the question of how Gs can vary so much. Moreover, as many have observed (Dresher and Janet Fodor are two notable examples, see below) there are serious problems with successfully fleshing out a P&P model.

Here are two: (i) the hope that many variant properties of Gs would hinge on fixing a small number of parameters seems increasingly empirically uncertain. Cederic Boeckx and Fritz Newmeyer have been arguing this for a while, and while their claims are debated (and by very intelligent people so, at least for a non-expert like me, the dust is still too unsettled to reach firm conclusions), it seems pretty clear that the empirical merits of earlier proposed parameterizations are less obvious than we took them to be. Indeed, there appears to some skepticism about whether there are any macro-parameters (in Baker’s sense[1]) and many of the micro-parametric proposals seem to end up restating what we observe in the data: that languages can differ. What made early macro-parameter theories interesting is the idea that differences among Gs come in largish clumps. The relation between a given parameter setting and the attested surface differences was understood as one to many. If, however, it turns out that every parameter correlates with just a single difference then the value of a parametric approach becomes quite unclear, at least so far as acquisition considerations are concerned. Why? Because it implies that surface differences are just due to differing PLD, not to the different options inherent in the structure of FL. In other words, if we end up with one parameter per surface difference then variation among Gs will not be as much of a window into the structure of FL as we thought it could be.

Here’s another problem: (ii) the likely parameters are not independent. Dresher (and friends) has demonstrated this for stress systems and Fodor (and friends) has provided analogous results for syntax.  The problem with a theory where parameters are not independent is that they make it very hard to see how acquisition could be incremental. If it turns out that the value of any parameter is conditional on the value of every other parameter (or very many others) then it would seem that we are stuck with a model in which all parameters must be set at once (i.e. instantaneous learning). This is not good! To evade this problem, we need some way of imposing independence on the parameters so that they can be set piecemeal without fear of having to re-set them later on. Both Dresher and Fodor have proposed ways of solving this independence problem (both elaborate a richer learning theory for parameter values to accommodate this problem). But, I think that it is fair to say that we are still a long way from a working solution. Moreover, the solutions provided all involve greatly enriching FL in a very LS way. This is where PP runs into DP. So let’s return to the aforementioned tension between PP and DP.

One way to solve PP is to enrich FL. The problem is that the richer and more linguistically parochial FL is, the harder it becomes to understand how it might have evolved. In other words, our standard GB tack in solving PP (LS enrichment of FL) appears to make answering DP harder. Note I say ‘appears.’ There are really two problems, and they are not equally acute. Let me explain.

As noted above, we have two things that a rich FL has been used to explain; (a) invariances characteristic of all Gs and (b) the attested variation among Gs. In a P&P model, the first ‘P’ handles (a) and the second (b). I believe that we have seen glimmers of how to resolve the tension between PP’s demands on FL versus DP’s as regards the principles part of P&P. Where things have become far more obscure (and even this might be too kind) involves the second parametric P. Here’s what I mean.

As I’ve argued in the past, one important minimalist project has been to do for the principles of GB what Chomsky did for islands and movement via the theory of subjacency in On Wh Movement (OWM). What Chomsky did in this paper is theoretically unify the disparate island effects by unifying all non-local (A’) dependency constructions by proposing that they have a common movement core (viz. move WH) subject to locality restrictions characterized by Bounding Theory (BT). This was terrifically inventive theory and aside from rationalizing/unifying Ross’s very disparate Island Effects, the combination of Move WH + BT predicted that all long movement would have to be successive cyclic (and even predicted a few more islands, e.g. subject islands and Wh-islands).[2]

But to get back to PP and DP, one way of regarding MP work over the last 20 years is as an attempt to do for GB modules what Chomsky did for Ross’s Islands. I’ve suggested this many times before but what I want to emphasize here is that this MP project is perfectly in harmony with the PP observation that we want to explain many of the invariances witnessed across Gs in terms of an innately structured FL. Here there is no real tension if this kind of unification can be realized. Why not? Because if successful we retain the GB generalizations. Just as Move WH + BT retain Ross’s generalizations, a successful unification within MP will retain GB’s (more or less) and so we can continue to tell the very same story about why Gs display the invariances attested as we did before. Thus, wrt this POS problem, there is a way to harmonize DP concerns with PP concerns. Of course, this does not mean that we will successfully manage to unify the GB modules in a Move WH + BT way, but we understand what a successful solution would look like and, IMO, we have every reason to be hopeful, though this is not the place to defend this view.

So, the principles part of P&P is, we might say, DP compatible (little joke here for the cognoscenti). The problem lies with the second P. FL on GB was understood to provide not only the principles of invariance but also to specify all the possible ways that Gs could differ. The parameters in GB were part of FL! And it is hard to see how to square this with DP given the terrific linguistic specificity of these parameters. The MP conceit has been to try and understand what Gs do in terms of one (perhaps)[3] linguistically specific operation (Merge) interacting with many general cognitive/computational operations/principles.  In other words, the aim has been to reduce the parochialism of the GB version of FL. The problem with the GB conception of parameters is that it is hard to see how to recast them in similarly general terms. All the parameters exploit notions that seem very very linguo-centric. This is especially true of micro parameters, but it is even true of macro ones. So, theoretically, parameters present a real problem for DP, and this is why the problems alluded to earlier have been taken by some (e.g. me) to suggest that maybe FL has little to say about G-variation. Moreover, it might explain why it is that, with DP becoming prominent, some of the interest in PP has seemed to wane. It is due to a dawning realization that maybe the structure of FL (our theory of UG) has little to say directly about grammatical variation and typology. Taken together PP and DP can usefully constrain our theories of FL, but mainly in licensing certain inferences about what kinds of invariances we will likely discover (indeed have discovered). However, when it comes to understanding variation, if parameters cannot be bleached of their LSity (and right now, this looks to me like a very rough road), it looks to me like they will never be made to fit with the leading ideas of MP, which are in turn driven by DP. 

So, Alex C was onto something important IMO. Linguists tend to believe that understanding variation is key to understanding FL. This is taken as virtually an article of faith. However, I am no longer so sure that this is a well founded presumption. DP provides us with some reasons to doubt that the range of variation reflects intrinsic properties of FL. If that is correct, then variation per se may me of little interest for those interested in liming the basic architecture of FL. Studying various Gs will, of course, remain a useful tool for in getting the details of the invariant principles and operations right. But, unlike earlier GB P&P models, there is at least an argument to be made (and one that I personally find compelling) that the range of G-variation has nothing whatsoever to do with the structure of FL and so will shed no light on two of the fundamental questions in Generative Grammar: what’s the structure of FL and why?[4]





[1] Though Baker, a really smart guy, thinks that there are so please don’t take me as endorsing the view that there aren’t any. I just don’t know. This is just my impression from linguist in the street interviews.
[2] The confirmation of this prediction was one of the great successes of generative grammar and the papers by, e.g. Kayne and Pollock, McCloskey, Chung, Torrego, and many others are still worth reading and re-reading. It is worth noting that the Move WH + BT story was largely driven by theoretical considerations, as Chomsky makes clear in OWM. The gratifying part is that the theory proved to be so empirically fecund.
[3] Note the ‘perhaps.’ If even merge is in the current parlance “third factor” then there is nothing taken to be linguistically special about FL.
[4] Note that this quite a bit of room for “learning” theory. For if the range of variation is not built into FL then why we see the variation we do must be due to how we acquire Gs given FL/UG.  The latter will still be important (indeed critical) in that any larning theory will have to incorporate the isolated invariances. However, a large part of the range of variation will fall outside the purview of FL. I discuss this somewhat in the last chapter of A theory if syntax for any of you with a prurient interest in such matters. See, in particular, the suggestion that we drop the switch analogy in favor of a more geometrical one.

Monday, February 10, 2014

Where Norbert posts Chris's revised post after screwing things up

In my haste to get Chris's opinions out there, I jumped the gun and posted an early draft of what he intended to see the light of day. So all of you who read the earlier post, fogetabouit!!! You are to ignore all of its contents and concentrate on the revised edition below.  I will try (but no doubt fail) never to screw up again.  So, sorry Chris. And to readers, enjoy Chris's "real" post.

*****

Why Formalize?
I read with interest Norbert’s recent post on formalization: “Formalization and Falsification in Generative Grammar”. Here I write some preliminary comments on his post.  I have not read other relevant posts in this sprawling blog, which I am only now learning how to navigate. So some of what I say may be redundant. Lastly, the issues that I discuss below have come up in my joint work with Edward Stabler on formalizing minimalism, to which I refer the reader for more details.
I take it that the goal of linguistic theory is to understand human language faculty by formulating UG, a theory of the human language faculty. Formalization is a tool toward that goal. Formalization is stating a theory clearly and formally enough that one can establish conclusively (i.e., with a proof) the relations between various aspects of the theory and between claims of the theory and claims of alternative theories.
Frege in the Begriffsschrift (pg. 6 of the Begriffschrift in the book Frege and Godel) analogizes the “ideography” (basically first and second order predicate calculus) to a microscope: “But as soon as scientific goals demand great sharpness of resolution, the eye proves to be insufficient. The microscope, on the other hand, is perfectly suited to precisely such goals, but that is just why it is useless for all others.” Similarly, formalization in syntax is a tool that should be employed when needed. It not an absolute necessity and there are many ways of going about things (as I discuss below). By citing Frege, I am in no way claiming that we should aim for the same level of formalization that Frege aimed for.
There is an important connection with the ideas of Rob Chametzky (posted by Norbert in another place on this blog). As we have seen, Rob divides up theorizing into meta-theoretical, theoretical and analytical.  Analytical work, according to Chametzky is: “concerned with investigating the (phenomena of the) domain in question. It deploys and tests concepts and architecture developed in theoretical work, allowing for both understanding of the domain and sharpening of the theoretical concepts.” It is clear that more than 90% of all linguistics work (maybe 99%) is analytical, and that there is a paucity of true theoretical work.
A good example of analytical work would be Noam Chomsky’s “On Wh-Movement”, which is one of the most beautiful and important papers in the field. Chomsky proposes the wh-diagnostics and relentlessly subjects a series of constructions to those diagnostics uncovering many interesting patterns and facts. The consequence that all these various constructions can be reduced to the single rule of wh-movement is a huge advance, allowing one insight into UG. Ultimately, this paper led to the Move-Alpha framework, which then led to Merge (the simplest and most general operation yet).
 “On Wh-Movement” is what I would call “semi-formal”. It has semi-formal statements of various conditions and principles, and also lots of assumptions are left implicit. As a consequence it has the hallmark property of semi-formal work: there are no theorems and no proofs.
Certainly, it would have been a waste of time to fully formalize “On Wh-Movement”. It would have expanded the text 10-20 fold at least, and added nothing. This is something that I think Pullum completely missed in his 1989 NLLT contribution on formalization. The semi-formal nature of syntactic theory, also found in such classics as “Infinite Syntax” by Haj Ross and “On Raising” by Paul Postal, has led to a huge explosion of knowledge that people outside of linguistics/syntax do not really appreciate (hence all the uninformed and uninteresting discussion out there on the internet and Facebook about what the accomplishments of generative grammar have been), in part because syntacticians are generally not very good popularizers.
Theoretical work, according to Chametzky is:  is concerned with developing and investigating primitives, derived concepts and architecture within a particular domain of inquiry.” There are many good examples of this kind of work in the minimalist literature. I would say Juan Uriagereka’s original work on multi-spell-out qualifies and so does Sam Epstein’s work on c-command, amongst others.
My feeling is that theoretical work (in Chametzky’s sense) is the natural place for formalization in linguistic theory. One reason is that it is possible, using formal assumptions to show clearly the relationship between various concepts, assumptions, operations and principles. For example, it should be possible to show, from formal work, that things like the NTC, the Extension Condition and Inclusiveness should really be thought of as theorems proved on the basis of assumptions about UG.  If they were theorems, they could be eliminated from UG. One could ask if this program could be extended to the full range of what syntacticians normally think of as constraints.
In this, I agree with Norbert who states: “It can lay bare what the conceptual dependencies between our basic concepts are.” Furthermore, as my previous paragraph makes clear, this mode of reasoning is particularly important for pushing the SMT (Strong Minimalist Thesis) forward. How can we know, with certainty, how some concept/principle/mechanism fits into the SMT? We can formalize and see if we can prove relations between our assumptions about the SMT (assumptions about the interfaces and computational efficiency) and the various concepts/principles/mechanisms. Using the ruthless tools of definition, proof and theorem, we can gradually whittle away at UG, until we have the bare essence. I am sure that there are many surprises in store for us. Given the fundamental, abstract and subtle nature of the elements involved, such formalization is probably a necessity, if we want to avoid falling into a muddle of unclear conclusions.
A related reason for formalization (in addition to clearly stating/proving relationships between concepts and assumptions) is that it allows one to compare competing proposals. One of the biggest such areas nowadays is whether syntactic dependencies make use of chains, multi-dominance structures or something else entirely. Chomsky’s papers, including his recent ones, make references to chains at many points. But other recent work invokes multi-dominance. What are the differences between these theories?  Are either of them really necessary? The SMT makes it clear that one should not go beyond Merge, the lexicon, and the structures produced by Merge. So any additional assumptions needed to implement multi-dominance or chains are suspect. But what are those additional assumptions? I am afraid that without formalization it will be impossible to answer these questions.
Questions about syntactic dependencies interact closely with TransferPF (Spell-Out) and TransferLF, which to my knowledge, have not only not been formalized but not even stated in an explicit manner (other than the initial attempt in Collins and Stabler 2013). Investigating the question of whether multi-dominance, chains or some something else entirely (perhaps nothing else) is needed to model human language syntax will require a concomitant formalization of TransferPF and TransferLF, since these are the functions that make use of the structures formed by Merge. Giving explicit and perhaps formalized statements of TransferPF and TransferLF should in turn lead to new empirical work exploring the predictions of the algorithms used to define these functions.
A last reason for formalization is that it may bring out complications in what appear to be innocuous concepts (e.g., “workspaces”, “occurrences”, “chains”).  It will also help one to understand what alternative theories without these concepts would have to accomplish. In accordance with the SMT, we would like to formulate UG without reference to such concepts, unless they are really needed.
Minimalist syntax calls for formalization in a way that previous syntactic theories did not. First, the nature of the basic operations is simple enough (e.g., Merge) to make formalization a real possibility. The baroque and varied nature of transformations in the “On Wh-Movement” framework and preceding work made the prospect for a full formalization more daunting.
Second, the nature of the concepts involved in minimalism, because of their simplicity and generality (e.g., the notion of copy), are just too fundamental and subtle and abstract to resolve by talking through them in an informal or semi-formal way. With formalization we can hope to state things in such a way to make clear the conceptual and the empirical properties of the various proposals, and compare and evaluate them.

My expectation is that selective formalization in syntax will lead to an explosion of interesting research issues, both of an empirical and conceptual natural (in Chametzky’s terms, both analytical and theoretical). One can only look at a set of empirical problems against the backdrop of a particular set of theoretical assumptions about UG and I-language. The more that these assumptions are articulated; the more one will be able to ask interesting questions about UG.