Thursday, April 17, 2014

The SMT again

Two revisions below thx to Ewan spring a thinko, a slip of the mind.

I have recently urged that we adopt a particular understanding of the Strong Minimalist Thesis (SMT) (here).  The version that I favor treats the SMT as a thesis about systems that use grammars and suggests that central features of the grammatical representations that they use will be crucial to explaining why they are efficient. If this proves to be doable, then it is reasonable to describe FL and the grammars it makes available as “well designed” and “computationally efficient.” Stealing from Bob Berwick (here), I will take parsing efficiency to mean real time parsing and (real time) acquisition to mean easy acquisition given the PLD.  Put this all together and the SMT is the conjecture that the grammatical format of Gs and UG is critical to allowing parsers, acquirers, producers, etc. to be very good at what they do (i.e. to be well-designed). On this view, grammars are “well designed” or “computationally efficient” in virtue of having properties that allow their users to be good at what they do when such grammars are embedded transparently within these systems.

One particularly attractive virtue of this interpretation (for me) is that I understand how I could go about empirically investigating it.  I confess that this is not true for other versions of the SMT that talk about neat fits between grammar and the CI interface, for example. So far as I can tell, we know rather little about the CI interface and so the question of fit is, at best, premature. On the other hand we do know a bit about how parsing works and how acquisition proceeds so we have something to fit the grammar to.[1]

So how to proceed? In two steps I believe. The first is to see if use systems (e.g. parsers) actually deploy grammars in real time, i.e. as they parse. Thus, if it is true that the basic features of grammatical representations are responsible for how (e.g.) parsers manage to efficiently do what they do then we should find real time evidence implicating these representations in real time parsing. Second, we should look for how exactly the implicated features manage to make things so efficient. Thus, we should look for theoretical reasons for why parsers that transparently embody, say, Subjacency like principles, would be efficient.  Let me discuss each of these points in turn.


There is increasing evidence from psycho-ling research indicating that real time parsing respects grammatical distinctions, even very subtle ones.  Colin Phillips is a leader in this kind of work and he and his (ex) students (e.g. Brian Dillon, Matt Wagers, Ellen Lau, Dave Kush, Masaya Yoshida) have produced a body of work that demonstrates how very closely parsers respect grammatical conditions like islands, c-command, and local binding domains. And by closely I mean very closely.  So, for example, Colin shows (here) that online parsing respects the grammatical conditions that license parasitic gaps. So, not only do parsers respect islands, but they even treat configurations where island effects are amnestied as if they were not islands. Thus, parsers respect both the general conditions that grammars lay down regarding islands and the exceptions to these general conditions that grammars allow. This is what I mean by ‘close.’

There is a recent excellent demonstration of this from Masaya Yoshida, Lauren Ackerman, Morgan Purier and Rebekah Ward (YLPW) (here are slides from a recent CUNY talk).[2] YLPW analyzes the processing of backward sluicing constructions like (1):

(1)  I don’t recall which writer, but the editor notified a writer about a new project

There is an ellipsis “gap” right after which writer that is redeemed by anchoring it to a writer in the following sentence. What YLPW is looking to determine is whether the elided gap site is sensitive to online parsing effects. YLPW uses a plausibility effect as probe as follows.

First, it is well known that a wh in CP triggers an active search for a verb/gap that will give it an interpretation. ‘Active’ here means that the parser uses a top down predictive process and is eagerly looking to link the wh to a predicate without first consulting bottom information that would indicate the link to be ill-advised. YLPW show that the eagerness to “fill a gap” is as true for implicit gaps within ellipsis sites as it is for “real” gaps in regular wh sentences.  YLPW shows this by demonstrating a plausibility effect slowdown in sentences like (2a) parallel to the ones found in (2b):

(2)  a. I don’t remember which writer/which book, but the editor notified a writer about a new book
b. I don’t remember which writer/which book the editor notified GAP about a new book

When the wh is which book then there is a significant pause at notified in both sentences in (2), as contrasted with the same sentences where which writer is the antecedent of the gap.  This is because parsers, we know, greedily try and relate the wh to the first syntactically available position encountered and in the case of which book the wh is not a plausible filler of the gap and the attempted filling results in a little lingering about the verb (*notify this book about…). If the antecedent is which writer no such pause occurs, for obvious reasons.  The plausibility effect, then, is just a version of the well-known filled gap effect, with a little semantic kicker to add some frisson. At any rate, the first important discovery is that we find the same plausibility effect in both (2a) with the gap inside a sluiced ellipsis site, and (2b) where the gap is “overt.”

The next step is to see if this plausibility/filled gap effect slowdown occurs when the relevant antecedent for the sluiced ellipsis site is inside an island. It is well known that ellipsis is not subject to island restrictions. Thus, if the parser cleaves tightly to the distinctions the grammar makes (as the SMT would lead us to expect) then we should find plausibility slowdowns except when the gap is inside an ellipsis site for the latter are not subject to island restrictions [and so should induce filled gap/plausibility effects (added: thx Ewan)].  And that’s exactly what YLPW find. Though plausibility effects are not found at notified in cases like (3) they are found in cases like (4) where the “gap” is inside a sluice sight.

(3)  I don’t remember which book [the editor who notified the publisher about some science book] had recommended to me
(4)  I don’t remember which book, but [the editor who notified the publisher about some science book] recommended a new book to me

This is just what we expect from a parser that transparently embeds a UG like grammar that treats movement but not ellipsis as a product of (long) movement.

The conclusion: it seems that parsers make just the distinctions that grammars make when they parse in real time, just as the SMT would lead us to expect.

So, there is growing evidence that parsers transparently embed UG like grammars.  This readies us for the second step. Why should they do so?  Here, there is less current research that bears on the issue. However, there is work from the 80s by Mitch Marcus, Bob Berwick and Amy Weinberg that showed that a Marcus style parser that incorporated grammatical features like Subjacency (and, interestingly, Extension) could parse sentences efficiently (effectively, in real time).  This is just what the doctor ordered. It goes without saying (though I will say it) that this work needs updating to bear more directly on the SMT and minimalist accounts of FL. However, it provides a useful paradigm of how one might go about connecting the discoveries concerning online parsing with computational questions of parsing efficiency and their relationship to central architectural features of FL/UG.

The SMT is a bold conjecture. Indeed, it is likely false, at least in fine detail. This does not, however, detract from its programmatic utility.  The fact is that there is currently lots of research that can be understood as bearing on its accuracy and that fruitfully brings together work in syntax, psycholinguistics and computational linguistics.  The SMT, in other words, is a terrific hypothesis that will generate fascinating work regardless of its ultimate empirical fate.  That’s what we want from a research program and that’s something that the Strong Minimalist Thesis is ready to deliver. Were this all that the Minimalist Program provided, it would have been enough (dayenu!). There is more, but for the nonce, this is more than enough. Yay, for the Minimalist Program!!!




[1] Let me modulate this: we know something about some other parts, see here for discussion of magnitude estimation in the visual domain. Note that this discussion fits well with the version of the SMT deployed here precisely because we know something about how this part of the visual system works. We cannot say as much about most of the other parts of CI. Indeed, we don’t really know how many “parts” CI has.
[2] They are running some more experiments, so this work is not yet finished. Nonetheless, it illustrates the relevant point well, and it is really fun stuff.

67 comments:

  1. Quick correction: you write:

    (i) if [X] then we should find plausibility slowdowns except when [SLUICING]
    (ii) plausibility effects are not found at notified in cases like [NOT SLUICING] [and] they are found in cases like [SLUICING]

    (ii) is correct, (i) is (a fortiori) backwards, no? (i) asserts that there should be plausibility slowdowns in case "NOT SLUICING" and no plausibility slowdowns in case "SLUICING". But that would mean the parser isn't respecting islands in the case where it should, and is respecting islands (i.e., ignoring the possible-but-island-filtered-out gap site) in the case where it shouldn't. Their Experiment 4 shows the opposite (i.e. the right thing), unless my brain is completely farted out.

    Unless I'm missing something, the island experiment also serves as a needed control. One might have argued based only on (their) Experiment 2 (ex (2)) that the reason you see the plausibility effect in the sluicing cases is because the gap-filling doesn't respect the structure at all, it's just the linearly recent wh that triggers it. But Experiment 4 (ex (3-4)) shows (yet again) that active gap filling is island sensitive, pulling apart the linear-filling strategy and the grammatical-filling strategy for these sentences. In other words if you're not already convinced that active gap filling should be island sensitive, E4 serves to COMPLETE E2 as an argument that in the case of sluicing we see the appropriate grammar-hewing; if you are it's as your treatment reads, another piece of evidence ON TOP of E2.

    ReplyDelete
    Replies
    1. You are, of course, correct. I've corrected the prose to reflex the correct description of the prediction. Thx.

      Delete
  2. OK, now that the facts are straight: is there any sense in maintaining this version of SMT at all? Not to be difficult. It says that the parser should respect the grammar, but the whole premise of (SMT being about) systems that "use the grammar" respecting it presupposes that there can be systems that "use the grammar" which are in some sense separate from the grammar. I'm pretty sure this implies what I call the Explicit Grammar Hypothesis - namely, that there is a possibility of distinguishing between the grammar and the systems that "use" it. I.e. the grammar is not merely implicit in the systems that use it, but has some existence apart from them which can be meaningfully characterized. I've always wondered what this hypothesis - always implicit, never argued for - really amounts to. Consider learning, parsing/recognition, and production. Is there really anything meaningful apart from this? do we think that these systems "use" the grammar, e.g., by accessing it from memory? If so, what's the evidence for this? And if not, what's the force of (this treatment of) SMT?

    ReplyDelete
    Replies
    1. I'm surprised that you find the hypothesis hard to fathom. Take Colin's or Masaya's work. There is nothing logically incoherent about finding that the online systems and the off line systems make different distinctions. So, it COULD have happened that, for example, parsers did not respect islands or that ellipsis did. We would even have accounts as to why this might be the case; and in fact Bever once provided them. One could imagine that systems that parse used heuristics or even systems that did not reflect grammatical categorization. Given this perfectly coherent logical possibility, we could distinguish grammars from the system of rules that parsers use to parse a sentence. We might even give them a name, e.g. heuristics. Moreover, it is logically possible that the "heuristics" the parser uses are different from those the producer or acquirer does.

      Now, say we found this. What could we concluded? Well, one thing we could conclude is that the system of knowledge, the data structures that grammars define, are related to these heuristics but not very directly. So, Bever said that we look for N-V-N templates rather than for syntactic structure. Why N-V-N? Well presumably because English is SVO more or less. I doubt that we would look for this in Japanese, for example. Same with Islands: we don't compute these structures on line and use them in guiding our parse but we do compute them later to "check" our parse. Again this is all coherent and what we should say is somewhat up in the air. We might conclude that we have the wrong theory of grammar, or the wrong parser or that we have both. The SMT is the hypothesis that this does not happen. That we should expect a high level of transparency between the rules parsers/producers/acquirers use and those that the grammar AS MEASURED USING OFFLINE JUDGEMENT DATA, for example, give us. In other words, the off line data and the online data tap the very same system and so we should expect them to coincide. Need this have been true? Not so far as I can tell. Hence, it is a hypothesis; what I would call the SMT.

      Delete
    2. This comment has been removed by the author.

      Delete
    3. This comment has been removed by the author.

      Delete
    4. This comment has been removed by the author.

      Delete
    5. This comment has been removed by the author.

      Delete
    6. That's true although what I was saying was not that I find SMT or transparency or non-transparency in whatever form hard to fathom, to be clear. I was saying I think the notion of the grammar being separate in a meaningful sense from the systems that embody it is pervasive but unsupported. Maybe that is what you are getting at - i.e.

      (1) acceptability judgment (AJ) - online measure (OM) mismatches could be interpreted as evidence for the Explicit Grammar Hypothesis (EGH)

      The reasoning being, AJ - OM mismatches are parser - grammar mismatches, and how can there be a parser-grammar mismatch if there isn't a grammar (implicit: "explicit"). Maybe that's what you're suggesting I should consider when considering EGH.

      I'm not so sure. First of all, I'm not immediately inclined to interpret mismatches between "fast stuff" in sentence processing (e.g. self paced reading slowdowns) and "slow stuff" (acceptability judgments) as indicative of the fast stuff being non-faithful (e.g. heuristic). Implicit in (1) is that AJs are different from OMs. Okay, duh. But not just different. Reasoning about THE PARSER (C - comprehension mechanism) from OMs requires assuming C is under investigation with OMs and not with AJs - or a different kind of investigation.. AJs are seen as the output of C, while OMs give you some moment by moment trace. But you need to assume that there is a there there, that there is a G apart from C, a grammar apart from the parser. And then and only then you can ask (I try to keep it chill so I've stopped frontin') how much C is like G. And remember, we're now talking about the internal workings, not the inputs and outputs. So what I'm saying is, to argue about whether transparency is true in the sense of "the steps you go through in C are very G-like" on the basis ONLY of C and its output (the kernel of the AJs) - which in this case would just be the question of the timing of when island constraints are in force - always for G, maybe not always for C, if yes then transparent if no then not - THEEEEN you've got to assume that you've defined G independently of C. The EGH, or some weaker version of it, but crucially not the IGH - Implicit Grammar Hypothesis, more on that in my other comment below. So that's the first part - the crucial bit of reasoning where it's licit to assume AJ's are necessarily more G-like OM's presupposes EGH or something like it. So it can't also be taken as evidence for EGH.

      Delete
    7. Under IGH no such conclusion can be drawn. The IGH is different from transparency (T), although there's still an IGH version of T. IGH implies that there isn't even a coherent way of defining T without looking at MORE than one G-implementing system. It implies that the closest thing you could get would be to ask how similar (say) C and P (production) are in terms of how they work. It says that the knowledge that G is is an abstraction of C and P, nothing more. It's the overlap between C and P, once mapped into the appropriate space where they're comparable. So to take a trivial example but a relevant one here you might ask whether C and P both have island constraints "on all the time" or if in some sense P "avoids" islands more actively than C, which (hypothetically) goes for it and then filters it out. If they align, the island constraint is part of G, if not, it isn't - definitionally. That's simplifying a bit I think but you get the idea.

      Both EGH+T and IGH are worlds in which the parser (say) more or less is the grammar, but in very different ways. IGH says "you can have a BIG gap between parser and grammar - by having a big gap between C and P - " and I think learning, L, as well, but this takes a lot of abstraction to see how works, not in the obvious way, so I'll leave it for now - " -- having a big gap between C and P -- but STILL the parser and the grammar ARE/can be the same thing in a certain sense, in the sense that G is merely an abstraction of C."

      Delete
    8. Back to SMT. SMT_{Chomsky}:

      (2) FL is an optimal solution to minimal design specifications

      Your idea is it seems to me to get at properties of FL, which is predicated over G's, by looking at C's (or P's) and seeing how efficient it (they) is (I'll leave that one so your C can suffer - G's are above suffering). And here I think I think "efficient" is being cashed out as "not doing a bunch of non-G-like stuff". Under IGH, I think it still makes sense to look at the overlap between C and P as being "the interesting part". Call it FL - fair enough. It's a solid, though not rock, bit of reasoning to see that overlap as the crucial part for language, and dub it FL. Although we could call it FHQWGADS just as easily.

      So the question - does this means of investigating SMT rely on EGH? That's what I meant. I think the answer is "kind of." I think under IGH it's possible / more plausible that the gap between C (say) and G is due to things that are orthogonal to the "how implementable is this G" question/logic that SMT_{Norbert} (not a distinct SMT at all, but rather a means of investigating SMT - sorry I got this wrong before) relies on - i.e., not facts about G but annoying facts about the peripheral systems - but I don't really know, and I suspect that actually SMTN is equally prone to this caveat.

      Delete
    9. (I said "that's the first part" - but there is no second part. Sorry.)

      Delete
    10. I don't think that the SMT relies on the EGH, though that would be one explanation for why all systems that use grammars make the same distinctions, viz. because they are all using the same thing. But, one could imagine making this the case without locating the grammar outside of its various users. I have nothing against thinking this to be so, but I have no present evidence to think that it is. What I am pretty sure of is that we don't want to identify Gs with what we find in parsers. Colin has evidence bearing on this. Recall, he finds filled gap effects into some subject islands; those that license PGs. This means that the parser can and does establish dependencies into these islands AND unless there is a PG downstream speakers will judge these perfectly parsed sentences to be unacceptable. I assume that to describe this we need to say that a parable dependency is nonetheless ungrammatical. Is this incompatible with the SMT? I don't think so, for the question is how does the parser deploy the relevant information in real time: Given the info the parser has, there may not be enough to conclude that the subject island is indeed an island and so dipping into it would be ok. This would contrast with relative clauses where there is enough local information to conclude this. So, the parser is not the grammar, rather the parser uses the grammar and information flow may affect what the parser concludes. At any rate, I suspect that this is all too far afield.

      Delete
    11. Right, when I said "parser is grammar under IGH" I was sloganizing, I meant "IGH says grammar is the parser/producer/learner overlap so in a certain sense parser and grammar are inseparable but it's a very different kind than if you're thinking of grammars as separate entities". But anyway for the relation between SMT and EGH see my last comment below. I agree that SMT and EGH are totally separate beings. My musing is on _YOUR_ SMT package, which relies on transparency.

      Delete
  3. This comment has been removed by the author.

    ReplyDelete
    Replies
    1. I think Ron Kaplan has motivated it somewhere on the basis that parsing and production require different algorithms, but appear to involve the same underlying competence; I don't think you can learn to produce language in a way that is 'free but appropriate to circumstances' without being able to understand similar productions by other people (shared grammar & use rules being the standard of similarity).

      Delete
  4. Shared competence is fine, but that's different from explicit competence as far as I can see. Shared competence says, give me production mechanism P, comprehension mechanism C. There's something you could call the grammar G(P) and something you could call the grammar G(C) and they're the same. Fine. Of course.

    But now consider what we're saying when we ask "does P [/C] respect G(P) [/C]?". Hard to say exactly what that means, because in a certain sense it seems tautological. We need another ingredient - Berwick and Weinberg's notion of "transparency". How much does P "differ" from G(P)? You'd need some more details to spell out exactly what that means but we can take the Marcus parser as a positive example - there was a negative example in the book if I recall, I don't remember what it was - N will.

    OK. Now. There are two ways of fixing the function G() (which needs to be overloaded for P/C/etc but whatever). One is to say "G is fixed by virtue of the fact that there is an explicit grammar which is separate from P or C" - as I suggested, one example of this might be loading the grammar from storage. Think of loading the Java virtual machine (P or C) and then that loads the program up (G). There might be some other way to spell this out neurophysiologically other than memory access.

    But anyway there is a completely OTHER way you could fix G(.), which is, G(.) is simply DEFINED as the set of all the things I can find in common between P and C; but it is completely implicit in both. The idea would be that, as I learn a language, I am simply learning two separate programs under the rule that they have to be consistent in a bunch of ways. Again, you'd have to do some work to figure out exactly what this means, and where you would ever stop with this task of finding "all" the things that P and C have in common. But you get the idea. G is just the bit where P and C, mapped into the appropriate comparable space, overlap. They both respect islands, for example.

    The slogan - are language mechanisms centralized planners (Explicit Grammar Hypothesis), or friendly anarchists who get along by copying each other (Implicit Grammar Hypothesis)?

    In both cases, something is shared. But SMT_{Norbert} takes a very different meaning on under these two ideas about competence. Under the IGH, it really just works out to "how close is the correspondence between C and P?"; under EGH, that's a derived question - the basic question is "how close is the correspondence between C and G and between P and G?" With EGH, the answer could be "not close at all" and there would still be a grammar. With IGH, if the answer is sufficiently far from "close" then the only overlap might be in their extension (the set of acceptable/generable sound-meaning pairs). We would find ourselves in the position where all the principles of grammar we've discovered are actually principles of (I guess it would be) parsing.

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
    2. I am not sure that I fully follow, so let me ask a question: Say we find that online effects analogous to filled gap effects in production and say that we find that these incorporate island sensitivity in the way it appears that parsing does. What would we say? I would be inclined to say that these processes all use the same system of rules in parsing and producing AND these are the same rules that govern off line judgements. Say that we then went into real time acquisition and found that kids treat islands as different as well (e.g. won't generalize rules into islands were they movement rules). Then these three grammar users all make the same distinction. Say that for ALL relevant grammatically proposed restrictions these three users make the same distinctions. What would we say? I think we would say that they all use the grammar transparently (very transparently) and that's why they all act the same way. Does this mean that the grammar is something "independent" of the systems that use them? I don't know. I don't even know if that is relevant. What it does show is that they all share the same system of knowledge. FOr now, that would suffice to make the SMT an interesting thesis for I am interpreting it as saying that this is exactly what we will find.

      Defending this position, I believe, will be very hard. I suspect that we will soon enough find ways that grammatical mechanisms are not reflected in parsing data (think 'the key to the cabinets are…). THe SMT bet is that this is very much the exception. I find it heartening to note that for the major features of grammar, c-command and locality, this seems to have some non-trivial support. So is the SMT true? Dunno. Is it man interesting conjecture? Yup. That's good enough for me right now.

      Delete
    3. Yes yes, who could follow such stuff? Not even my hairdresser.

      OK, the answer should clarify I guess. And recapitulate some of what I wrote in the other comments and then maybe there can be one comment to unify them all at some point. So: I take this

      "What it does show is that they all share the same system of knowledge."

      to be at leaset compatible with IGH. EGH would add "by virtue of the fact that they interact with some separate representation of G" again, e.g., by accessing it from memory. My claim is that people tend to think of G's in the EGH way and that's no good because there's no reason to think that.

      And the point here was that to even formulate your SMT as "does online processing do what G does" implies EGH. But now what you've got is

      "[C, P and L all work the same and fo]r now, that would suffice to make the SMT an interesting thesis for I am interpreting it as saying that this is exactly what we will find."

      Without the "for now," that's the IGH version of transparency. It's really really different from the EGH version. It says "for grammars to be transparent it is DEFINITIONALLY sufficient for comprehension/production/learning to 'work the same' in some well-defined sense." I stress the tangential point that in fact what counts as "the grammar" in this view is merely an abstraction of the various different linguistic systems, not necessarily meaningful - we could have chosen another abstraction for which transparency wouldn't hold - or the weakest one, the extension, where we hope it does.

      So I think you got about as far as I did. But now I can finish the challenge above to this version of SMT. First let me just ask you to clarify - is it the case that you see your reasoning package here as being in short "transparency is evidence for SMT"? That's what I concluded before. "Transparency is evidence for SMT because it shows that the grammar is usable, therefore efficiently usable." Is that it? If so then we can plug and play the EGH and the IGH versions of transparency and see what happens. I think something.

      Delete
    4. I think I am saying that if SMT is right then we should find that the systems that use the grammar will display the effects of this use in their real run time. Why? Well, the way I am thinking of this is that part of the reason that these use systems run as well as they do (are very good at doing what they do) is (partly) in virtue of linguistic representations having the properties they have (e.g. subjacency, phases, monatomic derivations, minimality etc.). If this is so, we should expect users to respect the distinctions the G makes in the process of doing what the user does. So, if this is correct, we should expect to see effects of these representations in real time data. Given that this is expected, finding it, I think, supports the SMT, the thesis that users use grammars with the properties competence theories attribute to them to do what they do well.

      So first question, do we indeed find this? I take the Colin and Masaya stuff (and Dillon's, Sturt's, Kush's, Wager's etc) as evidence that we do actually cleave to the distinctions Gs make in actually computing the structures of sentences as we hear them. So parsing fits with the SMT (at least in part). How about the other users? Don't know. Production is hard to investigate. Acquisition? Well here Wexler and Culicover and later Berwick argue that there is an intimate connection between learning and some grammatical restrictions in Gs that UG imposes. The results are not good enough yet (degree2, rather than degree0+) but they are suggestive. Berwick actually ties this to fact that G learning is dependent on parsing with something like a Marcus parser. At any rate, this is the right way to go. If it can be made to work, it would seem to confirm the SMT again.

      Last point: I take the SMT to be a very productive hypothetical. I suspect that it is not entirely right. However, it is useful for guiding research and it seems to have done so in some areas productively. There are also well know anomalies (e.g. The key to the cabinets are…). But for now, it has surprise me that islands should have online effects, as does binding locality conditions and c-command requirements. So, let's wee where this goes. On one version, the SMT becomes a modern version of the DTC: which, I now think, we gave up on too quickly.

      Last point: It seems to me that the IGH and EGH are refractory to these issues. The SMT is compatible with both, or so it seems to me. Maybe I am wrong however. That said, should all use systems make all the same distinctions online that would be an interesting thing to discover. Let me ask you: would that have an implication for the two Ivs EGH? IF not, what would?

      Delete
    5. (1) All systems do/do not make the same distinctions
      (2) The grammar is/is not in a meaningful sense distinct from these systems

      No, indeed, it seems (2) is independent of (1). But now add

      (3) [SMT] Grammar is an optimal solution to the problem of mapping between certain interfaces

      And now add

      (4) [SMTN] The more of the properties of grammar that can be found in efficient systems, the more the SMT is confirmed

      That's how I'm summing up your first paragraph. Now, I think this will be tangential, but I actually just realized that the systems referenced in (4) don't need to have anything to do with language at all. But we were talking about language systems anyway, say parsing. And we were assuming that by doing online measures, we were getting measures of the "efficient part" of that system.

      And, to be clear, (1) is related to (4) because of (5):

      (5) The only way we get any information about any properties of grammar is via cognitive systems that do languagey stuff (whatever turns out to be the case for (2)); so we can only operationalize (4) by looking for cases of (1), where at least one of the systems in question is efficient.

      But again, that's tangential: just to put the whole picture together.

      So, now I can answer your question. I think (2) interacts with (4). Suppose it were the case empirically that:

      (6) All efficient systems implicated in language are in utter disagreement with respect to how they work.

      In order for this world to make any sense, something like (7) would have to be the case:

      (7) The inefficient parts of the systems that we're looking at when we find out properties of grammar are the ones that are in agreement.

      So, the first stages of parsing are heuristic, the first stages of production are shoddy, and, crucially, in very different ways. In sum, everything would be dysfunctional except for some eleventh hour superman systems who come along to save the day and inject island constraints. For example.

      In this case, (4) would lead us to the conclusion that SMT is false if IGH is true. But under EGH, (4) would have failed us, because we would still have to hold out the possibility that grammar is, in principle, a very good solution to the problem of mapping between the interfaces, except that when that very good solution is combined with the problem of getting that information through the external systems, to/from the form where it can be used by grammar (the interface), that efficiency consistently evaporates.

      To bring it back to earth, EGH allows for the possibility that the Grammar exists and decided to encode its inputs and outputs in ways that are in principle usable by the relevant systems - but godawful for the systems in practice - and then chose the optimal way of mapping between these two sub-optimal encodings. But IGH won't let you do that, because there's no there there.

      Delete
    6. I would point out that this conclusion completely contradicts the interaction I alluded to before - whereby SMTN presupposed EGH (in something other than its wording). It's the reverse, sort of. SMTN gets stickier if EGH is true. I think.

      Delete
    7. Whew: more involved than I had dreamed it would be. I am still not sure I get the point but let me try to say it another way.
      All systems make distinctions, and, as you note, these need not be the same. The grammar does as well, and, as you say, this need not be the same as those the users make (thought they can be). Now to (3)/(4). the locution you use is CHomsky's where the SMT is thought of as an optimal solution to bridging AP and CI. This is not a view, I believe, that her now endorses as he has started downplaying the AP relation. At any rate, the proposal I made is that the SMT is saying something about how things that use grammars look and it is making two claims. First that users embody grammars transparently and second that it is in virtue of so embodying grammars that they are good at what they do.

      What do I mean by "good at what they do"? I intend that part of the reason that Parsers are efficient (incrementally parse quickly) and acquirers are able to do what they do (learn languages easily) is that the representations in Gs that UG makes available have the properties they have. So, because Gs have locality conditions of the subjacency variety they can be used by parsers that humans deploy to parse sentences quickly incrementally. This is the KIND of argument B&W make that I am proposing the SMT generalizes.

      How do we operationalize this version of the SMT. I think we have paradigms of how to do this already: (i) we look to see how incremental systems embody Gs. How transparently do they do so? One answer is that they do so VERY transparently, and we find that online dependency formation (that's all a parse is right?) is regulated by the kinds of relations and conditions that Gs license. Another answer is that the online incremental process is less transparent e.g. it builds DPs and VPs but makes no distinction between dependencies into islands and those that are not into islands. The SMT leads us to expect that the relation will be very transparent in the best case we expect to find a match between paring probes and grammar probes (i.e. late results and early results match (btw: I doubt that this will hold up)). Say we find some of these. This sets up a following question: (ii) do the features of the grammar that the user exploits explain (in part) why the user does as well as it does. Note: we have some sense that parser are "efficient" independently of whether they embody grammatical properties. They are "fast," "increments;," generally "reliable," etc. That means that the algorithm that underlies the user routines is a good one (in some intuitive sense to be sharpened with research). Now we can ask whether to what degree its virtues depend on that fact that it crunches representations with the particular formats the competence theory proposes. For example, would the Marcus parser do as well without a finite left bound? Or if popped expressions could be re-opened? If no, then we can say that these features are "efficient" in the derivative sense of being part of a use system that does what it does well in virtue of having features like these.

      Now, I am not sure how this all relates to your points above. I am pretty sure that it is orthogonal to the E/IGH issues you raise. I also agree that evaluating the "truth" of the SMT will be indirect, hence the programmatic nature of the SMT. Moreover, I wouldn't have thought it a proposal worth entertaining (in fact until recently I DIDN'T think it such) until I started thinking about the recent psycho stuff in conjunction with the older findings by Berwick, Weinberg, Marcus etc. With these paradigms in mind, it struck me that there is a real project here and that the bits and pieces are starting to emerge that show that it is empirically explorable in non-trivial ways. Personally, I have been surprised that online measures have made G distinctions as nicely as they have. I am expecting a rocky future.

      Delete
    8. OK yes. I think your version of the SMT skirts the issue. My point in the previous comment was that if you believe grammars are separate from the systems that implement them, you could have a version of the SMT that's ONLY predicated over grammars, and not the systems that use them. Like (3). And then there would be some operationalizational daylight. It seems like you've formulated your SMT to be compatible with IGH, not necessarily EGH.

      Delete
    9. What I mean is not that it's INCOMPATIBLE with EGH. Rather I mean that SMTN is not the only way the broader SMT could be true if EGH is true. If IGH is true, then SMTN is I think all there could be of SMT. There could be a different SMT if the G is E.

      Delete
    10. One more thought on that: given that the relation between SMTN and SMT (which you might call SMTCO, for Chomsky/old) wasn't immediately transparent to me, and somem of the bits about "efficiency" weren't transparent to Tim and Alex (or me but I think I sorted it out for myself with the fast systems/slow systems part, not sure if that made much sense to the external world). ... given this, I would vote for a further post with more rumination/exegesis narrowly on these two points, i.e., the logic of your SMT.

      Delete
  5. So (to pose the question) is there any empirical evidence for one treatment of competence over the other?

    And to add a bit, under IGH, we'd need to say something more about exactly HOW production does islands in order to assess the SMT in this case (I think).

    ReplyDelete
  6. I am confused here about the term "real time". From my conversations with Bob Berwick I think he means some technical sense of real time; ie. recognized by deterministic (perhaps non-deterministic?) Turing machines in linear time with bounded delay. That is a particularly stringent notion of computational efficiency that is I think incompatible with the approach you are taking, where real-time means something else, but I don't know what.

    ReplyDelete
    Replies
    1. I think that I would like this stringent condition. Something Berwick suggests at times is to take the Marcus parser as a good proxy for what the parser looks like for it runs in real time: s a sentence N units long takes N units of time to parse. We parse it as we hear it. That would be a good thing to aims for. No?

      Delete
  7. Norbert wrote: There is nothing logically incoherent about finding that the online systems and the off line systems make different distinctions. So, it COULD have happened that, for example, parsers did not respect islands. [But that is not what we see.]

    While I agree with the gist of this, the logical possibility that is being ruled out seems to me to be a very strange and complicated one (and this strangeness and complicated-ness often seems to slip by unmentioned). Suppose that some studies of word-by-word reading times (or whatever) revealed to us that "parsers did not respect islands". Then something else would be required to account for the fact that island violations produce unacceptability. I don't think it's enough to just say that "well, while the *parser* doesn't respect islands, the *grammar* does, and it's the grammar that's responsible for the 'slower' things like acceptability judgements". This would have to be supplemented by some story of how the acceptability judgements themselves are computed, after the parser does its non-island-respecting thing: this sounds to me like we are in effect positing two distinct parsers, only one of which is incremental. (What is it that produces acceptability judgements if not a parser of some kind?) One way of doing this is Bever's, which takes grammatical derivations to describe the bottom-up operations of the "second parser", but (and I suppose this is my main point) this goes beyond simply making a distinct between a parser and a grammar. So is this two-parser view being implicitly assumed every time we entertain the logical possibility that "parsers did not respect islands"?

    Norbert wrote: What I am pretty sure of is that we don't want to identify Gs with what we find in parsers. Colin has evidence bearing on this. Recall, he finds filled gap effects into some subject islands; those that license PGs. This means that the parser can and does establish dependencies into these islands AND unless there is a PG downstream speakers will judge these perfectly parsed sentences to be unacceptable. I assume that to describe this we need to say that a parsable dependency is nonetheless ungrammatical.

    On a similar note, I'm a bit confused about what is meant by a "parsable dependency" here (and I suspect this may be related to Alex C's question about the term "real time"). It seems to refer to "a dependency that is not rendered impossible due to memory constraints [or other implementation details]". But what about the ungrammatical subject-verb dependency in "You is tall". That is presumably parsable in this sense, right? If so, then you don't need anything as involved as Colin's PGs experiment to make the general point that parsable dependencies can be ungrammatical.

    Of course the important point of that experiment was that those dependencies had been claimed by others to be unacceptable due to non-parsability rather than due to non-grammaticality. So the finding that those particular dependencies were parsable was relevant to that debate. (Indeed, it seems to me to be a genuinely knock-down argument of a kind that we very rarely get.) But it seems to me to be a much more obvious and less subtle point that there exist dependencies that are parsable and yet ungrammatical. Am I misunderstanding the terms?

    ReplyDelete
    Replies
    1. The background - my interpretation of N's SMT reasoning is that "the more nuancèdly/completely the off-line judgments are captured in on-line measures, the better reason we have to believe the SMT"; because (axiom) on-line measures show us things about "efficient" computations ("real-time" - yes I agree not much more specific, but not Bever Stage 2 if there is such a thing, I'll take that as something to hang on to) - N has yet to correct me on this reconstruction of his logic, so now that's what it is.

      So leaving aside the "grammar" association for the offline judgments, just saying "fast system" and "slow system", accepting that they may (Bever) or may not be qualitatively different - I _think_ N is saying

      (i) there is OBSERVED fast system behavior which is different from slow system behavior

      whereas (given that there's no illusion in "You is tall") that case would be

      (ii) POTENTIAL parses which are different from the slow system behavior

      note the vagueness of "potential" and the lack of tying it to the fast system. That's because I think that case is "parseable" in a different way. It differs in a lexical item _and not anything else_ - so if for example I were to parse the POS tags without any agreement then it would be fine. But there's no evidence that any receptive system does this. So it wouldn't count.

      So then here's the obvious question: why is this case not a failure for SMT? I think that's what's muddying up the reasoning. Why is "fast system tracks islands for YLPW" nice, but "fast system kinda misses the boat on the precise nuances of certain islands but does it in a way that sorta makes sense if you consider that those islands are exceptional in certain other cases for the slow system (but crucially not the case under investigation!)" .. why is that ALSO nice??

      Delete
    2. (by "this case" I mean the PG-type subject island violation cases)

      Delete
    3. @Tim:
      I propose taking the SMT to commit hostages to properties of systems that use grammars. Parsers use grammars to parse in real time (as we hear the sentence, incrementally). The proposal is that (i) parsers use these grammars to produce a parse and (ii) that they are able to parse incrementally in real time (as we hear the words) in virtue (at least in part) because of the structures that these representations have. Given this view of the SMT we can look for two different kinds of evidence to support it.

      First, we can look to see if incremental parsing honors the distinctions Gs make. There is work suggesting that they do and they do so in surprising detail (that's the upshot of Colin's PG paper and Masaya's ellipsis slides).

      Second, we can ask if having representations that embody locality effects like subjacency or c-command or principles like monotonicity are critical in letting users do what they do well. I interpret earlier work by Berwick, Weinberg, Wexler, Culicover Marcus as suggesting that certain features that Gs have are very important in allowing efficient parsing or easy learnability. Now, I do not think that these results suffice. But they are very suggestive of how one might go about looking at these matters. For example, something that interests me now: is endocentricity useful for users? Maybe. Bob reminds me that Carl deMarken argued that endocentricity has beneficial effects in both domains.

      So that's the thesis. COuld it be wrong? Sure. Indeed, it has been assumed to be wildly false for a long time, since the demise of the DTC. If it were wrong we would not expect to find footprints of the grammar in its real time use. As this seems unlikely, say we would not expect to see FINE grammatical distinctions honored in incremental use.

      Last point: by "parable dependency" I meant something anodyne (I hope): In Colin's stuff one finds filled gap effects into Noun complement subjects (in contrast to relative clauses). I assumed that the presence of such effects indicated the attempt to forge a link between antecedent and verb. Thus, it is an online indication of a successful antecedent gap dependency being forged. What is interesting is that the capacity for doing this is compatible with a sentence being judge unacceptable if there is no down stream PG. Thus, we cannot reduce grammaticality to parsability for we can successfully parse (set up the requisite dependency) even though we will judge the sentence unacceptable. In the case you site, we have not successfully parsed the sentence for there is no successful parse of it. We can attribute our unacceptability reaction to its not having a good parse. True we understand it, so we have done something. But we have not executed a successful parse of the sentence.

      I hope this helps. At any rate, I am glad that the SMT is generating discussion. I think it is a really interesting conjecture, one that unifies lots of current linguistic work from different angles. And that, I believe, is good.

      Delete
    4. I'm still rather baffled. Not sure what else to say so let me try saying the same thing again just in a different way.

      As you say below in the response to Greg, the question is *how quickly* the knowledge (say, knowledge of island constraints) gets deployed in real time. There is no question of whether or not it gets deployed; if it were never deployed, we would not observe the eventual unacceptability judgements. So the knowledge is deployed at some point. The thing that deploys it is, pretty much by definition, a parser. So there is a parser that respects those constraints (island constraints or whatever). That does not seem to be up for debate.

      The potential finding that is sometimes described as "the parser doesn't respect island constraints", would therefore I think be better described as "there is an additional, first-pass parser that doesn't respect island constraints". This is not impossible or a downright crazy idea, but it does seem like the more surprising possibility. Doesn't it?

      (Perhaps some of the temptation to think of this two-parser view as the less surprising one stems from the fascination at how a parser would manage to achieve what it does on the one-parser view, island constraints and all, so quickly. I share this fascination. It is not obvious how this is achieved, and I would like to know how it is. But it seems "even less obvious" how things would fit together on the two-parser view.)

      Delete
    5. I guess we are surprised by different things, but that's ok. I am happy with your version. Note, that there is no reason to think that the second late parser need actually parse left to right at all. In fact, it could simply try to "generate" the structure bottom up to see if the input can be parsed correctly. This possibility, would make it inadequate as a real-time parser I would suppose but would serve for rendering "grammaticality" judgements. So, what surprised me was not that we could give each sentence a parse but that we could do it in real time as the words came in. Now, given that the grammar "generates" sentences bottom up, it is likely that whatever parsing is, it is not generation in this sense given the obvious facts about incrementally. For the parser to be interesting it must be, in some sense, left-right. Now the question of whether the grammar can be used productively by a left-right parser is more interesting. We know that to do this, the parser must make more than a few top-down "predictions." Ok, how faithful are these "predictions" to what the bottom up generation would license? Well, very faithful. How does a left-right parser use a bottom up generative system? Good question. But it is in part by making certain kinds of surprising predictions. So might the first pass parser that operates left-to-right make work on principles different from those of a bottom up generation device? Sure. Why not? At any rate, that's what got me surprised.

      So, yes, if we main by a parser something that gives an analysis of the string into something like a structured object, or course acceptability judgements require parsing. However, the very late nature of this judgement allows for parsing to proceed in exactly the way we believe that real time parsing CANNOT operate.

      Last point: ass Paul Pietroski keeps reminding me: what we really want from our on-line parser is something that relates phons to sems. Currently we take something like an annotated S-structure to be proxy for sems. But they are not sems. So we should also start trying to generate these, not S-structures. But for now, we can, perhaps, abstract away from this issue.

      Delete
    6. One more small point: I suspect that the problem I have had expressing what I have in mind comes from the fact that "parse" has two different uses, informally. For some it simply means assigning a structure to a string of words. to others it means assigning such a structure as the sentence is heard. On the first, parsers are left-right thingies, one the second they have no obvious preferred direction. The parses that our standard grammars assign are generally done by bottom up generation. THis is not a good model of parsing in the second sense.

      Delete
    7. I share Tim's aesthetic judgements (obviously).
      1. Crocker and Brants have suggested that a correct parser which maintains a very small number of candidate parses might explain the "dual nature -- generally good and occasionally pathological -- of human linguistic performance." It is simple to impose such a narrow beam width on parsers, including those for minimalist grammars.
      2. Top-down parsers, including those for minimalist grammars, work their way through the string from left to right. The difference between the `direction' of the grammar and that of the parser is well-understood. Stabler has a recent paper discussing top-down parsing in MGs.
      3. A directly compositional semantic theory assigns meanings to the intermediate structures generated by a parser. We have a directly compositional semantics for MGs (my paper on montagovian dynamics), and so a parser which incrementally assigns meanings to strings is trivial to implement.
      4. There are many different ideas about how to link the actions of a parser to actual data, both on- and off-line. Late measures of complexity are perfectly compatible with correct and incremental parsers.

      Delete
    8. corrigendum: left-to-right-ness of a parser is not tied to its top-to-bottom-ness, of course.

      Delete
    9. Here's the way I understood (I think wrongly) the thrust of Alex C's comment in light of this, tell me if this makes sense:

      (1) The fast parser has to be fast
      (2) but there has to be a lot of stuff that's therefore not handled right by the fast machine (let's not say parser or grammar) because you reject AP ... and thus a bunch of stuff is only computed in the slow machine
      (3) but taken to its logical extreme the SMT would say there's no daylight between the fast machine and the slow machine

      So, given (3) and the other comments about not identifying fast and slow, we're not after the logical extreme. So this is more complicated than I thought. Now I'm asking, what (if any) important properties of language should lead us to reject AP, and, meanwhile what - presumably DISTINCT important properties of language should lead us to accept SMT?

      I don't think that's a topic to work out today, but I think there's got to be a tension there unless I've got this all wrong.

      Delete
    10. @Greg. Re 4. I don't believe it suggested otherwise. After all that's what left corner parsers do. My point was that this needs showing and that it is not hard to imagine how a fast parser might not be responsive to G like restrictions while a slow parser might.

      Delete
    11. @Ewan I think I have the same difficulty. "optimal" means best, not just good, so what sense of "better than all the alternatives" is actually being used in the SMT?

      To be concrete, there is a class of the NTS languages, which are a subclass of the context-free languages that can be parsed incrementally in real time, PAC-learned efficiently from positive data and so on; but natural languages can't be described by them. So in what sense are natural languages "better" than NTS languages?

      I really find it hard to take seriously a notion of optimal efficiency for parsing that allows for a correct parsing algorithm which is intractably hard (viz not in P).


      Delete
    12. "To be concrete, there is a class of the NTS languages, which are a subclass of the context-free languages that can be parsed incrementally in real time, PAC-learned efficiently from positive data and so on; but natural languages can't be described by them. So in what sense are natural languages "better" than NTS languages?"

      Less concretely, I think that's exactly the question I asked Norbert above. In any case, that's the question I wanted to ask, and I share Alex's feeling that "optimal" must have a rather peculiar meaning here that warrants some more explanation.

      Delete
    13. +1 to this formulation of the problem ("in what sense are natural languages 'better' than NTS languages?"). The rational analysis part of me would like to at least try to work out in what sense that could be true, for the normal reasons (it will be progress even if it's false). But this I think puts the difficulty of the question into relief. It obviously seems it's not as simple as enumerating properties of natural languages and finding they are uniformly processed efficiently: succeeding unqualifiedly at this could put us in a real pickle.

      Delete
    14. I suspect the answer to Alex's question will be something like "there is some property P such that NLs have P but NTS languages do not have P, and NLs are the optimal solution for [parsing + learning + P]." Clearly this game can be played forever (or we can win immediately by setting P = "equal to NLs"). On the other hand, if one imposes conditions on what counts as reasonable properties, then this is not vacuous, and one can legitimately ask whether there are some reasonable properties such that NLs are the best among classes with those properties. I think that this is the SMT game.

      Delete
    15. I wouldn't call " setting P = 'equal to NLs' " a winning move (not even a vacuous winning move) --- it's really just ending up with a circular non-explanation of the form "NLs are the optimal solution for [parsing + learning + being equal to NLs]".

      I think there is something more going on here, though, and that's an odd equivocation on NLs, not just in the boring and all too familiar "grammar / set of expressions"-sense but in a "something that provides for a solution to something (NLs as grammars that we can use in parsing, say) / something that is given independently of the solution (NLs as objects of parsing and learning) and with respect to which we can ask questions about optimality".

      But I thought we all agree that when it comes to language, the objects for my ability to deal with language are just the products of your ability to deal with language? And that there is no NL in the latter sense?

      But then I don't think you can explain the kind of grammar we have (answering "Why are NLs not regular, or NTS, or X, but Y") with pointing to optimality for parsing or learning -- that's essentially just doing what Greg caricatured so nicely: NLs have the properties they have because they are NLs, i.e. because they have the properties they have. And because they do have the properties they have, they are actually really good for parsing their own outputs.

      On second thought, I think I've lost track of this discussion and am simply not getting the point at all. But I'd still really like to know in what sense natural languages are 'better' than NTS (or for that matter, regular) languages.

      Delete
  8. There is an informal sense of real time which means taking n seconds to process n seconds of input; so call this RT-I.
    A formal sense (actual several) which involves multi-tape turing machines, linear bounds and finite delay, call this RT-F.

    So we observe that humans can in general understand language in RT-I subject to occasional slow downs, pauses and complete failures (think garden-path sentences, multiple sentence embeddings etc). Call this empirical observation (E).

    So one argument (AR) is that E implies that the sets of grammatical strings of the language are in RT-F.

    That may or may not be a good argument, one can say a lot about it, but it doesn't have anything to do with the SMT right?

    The weaker argument (AP) is that E implies that the sets of grammatical strings of the language are in PTIME (the set of efficiently parseable languages).

    So clearly if you reject AP you must reject AR. (since RT-I is a subset of PTIME). And I think that you and Bob B reject AP.

    ReplyDelete
    Replies
    1. Wrt AR:

      "the sets of grammatical strings of the language are in RT-F.
      . . . That may or may not be a good argument, one can say a lot about it, but it doesn't have anything to do with the SMT right?"

      - actually, from what N and I have triangulated throughout the comments, I take the first statement to be more or less actually what he is understanding as the SMT - or at least an immediate consequence of it.

      If N will let me summarize what I got from our various discussions, I believe his version of the SMT is (restricting attn to parsing - shouldn't change anything):

      [SMTN] The properties of language are "optimal" in the sense that they're there to facilitate efficient parsing.

      So we should expect to see a correlation between efficient parsing and the properties of language, by which is meant the various nuances like island constraints.

      It might be a useful public service to highlight and spell out the tension between !AP and AR a bit further, because it seems to me that the psycholinguists N is referring to who are trying to demonstrate how nicely on-line measures (and thus efficient mechanisms) track the grammatical nuances of the language are apt to wind up taking this contradictory position sooner or later if they're not careful.

      Delete
    2. This may just me being slow, but wouldn't languages that ensured linear time parseability and avoided any kind of garden pathing (basically, languages that, if they were spoken, would justify an empirical observation like Alex's without the "subject to occasional...") even more "optimal" in that sense? And if so, how come we aren't speaking these kinds of languages? And if not, why aren't these languages more "optimal"?

      Delete
    3. Thanks to Ewan for unpacking my thoughts, but, at the rise of embarrassing myself, let me say something here. I agree that we should take real time pasting PERFORMANCE to be a fact, as Alex observes. The aim is to explain this fact. It is due to a combination of factors, ONE OF WHICH IS THE SHAPE OF HUMAN Gs. The SMT is the proposition that our remarkable ability to do what we do as well as we do IN THE RELEVANT COGNITIVE DOMAIN (say sentences with at most 6 levels of embedding) is in part due to the fact that our Gs have the structures that they in fact have. Thus, the theory of competence, which I assume describes these Gs and their properties, will play a large part in our theory of parsing performance. It won't be the whole thing, but a big part of the account.

      How's this related to Alex's observations? Well, so far as I can tell, his observations concern the efficient recognizability of grammatical strings. It is not clear to me what this has to do with the performance issue of parsing in real time. Moreover, as Bob Berwick has pointed out to me, the relevance of this to the performance issue also eludes Greg Koebele for he writes in his thesis (appendix 2. p. 250):
      "Mild context sensitivity can be given a bipartite characterization as a list of conditions that a language must meet. The first condition is that the language be among those strings that are recognizable by a deterministic turing machine in polynomial time. This is often called efficient recognizability. Although it contains the words 'efficient' and 'recognize,' this criterion is emphatically *not* related to idea about human language processing."

      Greg put the words right into my mouth. The SMT is a thesis about how competence theory will interact with performance theories and it claims that the fact that Gs have the proerpties the competence theory claims they have will be a significant part of the explanation of why they are very good at what they do.

      Last point: how to investigate the thesis? Berwick suggested someplace that we use the Marcus parser as a benchmark for good performance. It implemented a fair amount of the EST theory and got pretty good linear time performance over a reasonable range of sentences. It also failed where people do. We can inspect its properties and ask why it dod so well. Marcus and Berwixk&Weinberg argue that this has to do, in part, with the fact that it implemented certain kinds of grammatical constraints. If correct, this is a good SMT result. This, of course, is not the last word on the issue. But it is an excellent FIRST word. It is much better than the the formal results that Alex keeps referring us to (and Ewan seems seduced by) for as Kobele notes, it is irrelevant to the SMT issue at hand.

      Last point: it would be nice to address these issues entirely in formal terms. We cannot, so far as I can see. We can start to build realistic models that incorporate Gs and see how they run. This requires building parsers that parse over a reasonable range (Sandiway Fong has done this and Marcus did this and Berwick did this) and asking what makes them run well. This is the SMT project, at least in part.

      Delete
    4. @Benjamin: it is entirely consistent with the SMT that what allows a parser to parse well will also force it to stumble at places. This, indeed, is what the Marcus parser does.

      Delete
  9. @Norbert I am not advocating the arguments AP or AR here, just trying to understand the SMT as you view it.

    So just to be clear, you reject AR and AP, and the research strategy you advocate to investigate the SMT is to implement broad coverage parsers and test them empirically somehow? Presumably on real corpora, or on artificial examples?

    ReplyDelete
    Replies
    1. Yes, something like what Marcus did and for that matter Fong has done. It's not perfect, but a good start. As for what the example set should be? I am catholic here, though I suspect that for the time being real corpora would be more difficult.

      Oh yes: I reject both and do advocate what you say. You got it quite right.

      Delete
  10. I did indeed say those things, but I intended them to be understood differently. An interesting result of comp sci is the division of the logical space of possible languages into hierarchies (Chomsky hierarchy, complexity hierarchies, etc). It is a non-trivial discovery of linguistics that the actual languages that have been observed are not randomly distributed throughout the logical space of possible languages, but rather cluster into a small group. One of the properties shared by this group is that of being recognizable in polynomial time on a deterministic turing machine. Although it is tempting to think of this property in terms of a parser, there are equivalent characterizations of this class of languages (being describable in first order logic with a least fixed point operator) which do not have anything to do with the dynamic processing of recognizing strings. Still, the fact that all known natural languages can be recognized efficiently is very suggestive! It also means that we can write correct algorithms which do exactly what the grammar says they should do (in terms of accept/reject). This seems like a natural place to start, when trying to develop theories of the human sentence processor. There are otherwise infinitely many procedures that agree with the correct ones on a fixed finite number of strings. The idea to start with a correct procedure is related to the idea of rational analysis in the cog sci/psych literature (in addition to being the basis of the levels hypothesis).
    One thing that has always puzzled me about the idea that the grammar should be ontologically distinct from the parser is what explanatory role that leaves for the grammar. Norbert seems to be suggesting that the grammar should be used for the explanation of acceptability judgments, and the parser for the explanation of other things (like eye-tracking). This can't be right as I've stated it, because we appeal to the parser to explain why center embeddings are unacceptable. So then the grammar is used for the explanation of whatever acceptability judgments that the parser doesn't account for. It feels like this is a fairly slippery slope; how does the grammar actually get used here? We need some sort of theory of the use of the grammar; a separate parser for acceptability. Now we have two parsers; the parser which respects the grammar, and the one which only sort of respects the grammar. What a mess!
    Work in computer science shows how we can use the parser which respects the grammar to do lots of things, including seeming not to respect the grammar (NVN-style heuristic effects). Why not start here? An uncharitable person might describe this as looking for lost keys under the lightpost, but, where else are we going to start looking for them!

    ReplyDelete
    Replies
    1. Me being slow, but the key sentence here to explain your previous comment is

      "Although it is tempting to think of this property in terms of a parser, there are equivalent characterizations of this class of languages (being describable in first order logic with a least fixed point operator) which do not have anything to do with the dynamic processing of recognizing strings."

      , yes? Trying to decide whether this is to be interpreted as -

      (a) it might not turn out that the actual parsing mechanism operates in polynomial time even though the language is in P

      OR

      (b) it might not turn out that the operation of an efficient parsing mechanism is causally responsible for the fact that the language is in P

      Or something else. ?

      Delete
    2. I liked the early Greg K's comments better than the revised version. But that is neither here not there. I would just make two comments.

      First, there are many places to "start," and far be it from me to legislate where to look. However, for my money, as I take the big fact about language use to be that (i) we parse incrementally and quickly (i.e. we understand a sentence as it is being spoken) we want to investigate parsers that have a chance of dong this. So a boundary condition on being "interesting" (for me, but you decide for yourself) is that it produce pi-lambda pairs in real time (ii) that it lead to easy learnablility. Given these two criteria, I think that the (sadly forgotten) work in the mid 80s by Berwick, Marcus, Weinberg, Fong, etc is a better place to look. True, these efforts did not start from a principled mathematical basis and the efforts were empirically driven but they addressed the right questions and produced models for how this kind of investigation might move forward. As early Greg noted, this is far less clear for the well grounded investigations that he now endorses. Second, since the 80s we have good empirical evidence that parsers/learners really do use UG principles "in real time" to do what they do. Hence we should investigate systems that in fact incorporate these as part of their performance models. This too suggests we start from the mid 80s work. So, being catholic in my research principles, I say let people look where they want, but given my interests, I would look in different places than Greg would.

      Moreover, I think that the SMT has a place in this second program. Thus, what I found particularly interesting about the earlier work is that in fact tried to explain the success of the systems in part from the kinds of representations being manipulated. This is a good model for SMT like thinking in my view. So, not only were the results empirically in the right direction the stuff could serve as a useful model of SMT like thinking.

      Last point: the grammar is not used to explain acceptability judgments. Rather these judgments are evidence for the competence theory. I accept that there is a difference between what one knows and how one puts what one knows to use, i.e. competence vs performance. The evidence we have used to triangulate on the former has been acceptability under an interpretation judgments. There is a second question of how quickly this knowledge is deployed in real time. The evidence seems to point to the (to me) surprising conclusion that it is used very quickly and robustly on-line. Unlike Greg, I don not see a slippery slope, just the standard problem of trying to sort out what the data is telling us about the structure of the underlying system. This is what scientific inquiry always worries about, and it is no different here. Some unacceptability can be traced to the structure of the performance system (e.g. it has a limited memory etc) some to the nature of the data structures (e.g. structures like this are ill-formed). How to divide up the raw data is what we do. The object of inquiry is not the data, but the mechanisms that cause this data and like all interesting problems arguing from data to mechanism is a complicated affair. We have paradigms for how to do this and, with luck and skill, we might do better. But there is no slippery slope and no principled problem. At least, I don't see the problem.

      So, wehre to look? Wherever you want. But for my money, were I making a research bet, I would look backwards a little, in fact to the mid 80s for this, IMO, is the best work done on the relation between competence and performance we have.

      Delete
    3. @ewan: I would say both; the fact that there is an efficient parser for a language doesn't necessitate that it is being used. And the *why* question is, I'd guess, answered by a combination of stories about the learning algorithm, the environment's influence on the primary linguistic data (the possible presentations, in the sense of Gold), communicative efficiency, the dynamics of language communities, etc.

      Delete
    4. @Norbert: I don't think anyone here is really criticising the scientific merit of something like SMT, I certainly accept it implicitly as a very natural thing to go after. But, many things proved to be a distraction here:

      - I (and it seems not only me) get confused when you start talking about "parsers" versus "grammars"; after a while I got convinced that even though you were talking loosely about the grammar as if it were ontologically separate, you didn't mean it and were on board with letting the grammar be an implicit property - but then some of the reasoning about the "link between parsers and grammars" becomes highly obscure - I would insist on translating it, a la Colin, to "fast versus slow"

      - facts about "fast stuff" necessarily have implications for what the "slow stuff" can be, not only for WHY it is, but for WHAT it is; you can't dismiss the efficient recognizability property when you're playing this game, because a simple and straightforward version of SMT would actually IMPLY it, no matter what you think of it. The closer the parser gets to doing the sum total of what is done up there in the head (other judgments and all) the easier it is to draw up a proof about the properties of grammars based solely on the fact that "parsing has property X".

      - further distraction on the WHY line - now I can integrate Greg's reply - it sounds like the intention was "have healthy skepticism about drawing conclusions about the link between the parsing efficiency and the reason language is the way it is" . . . which is exactly the SMT enterprise. It doesn't matter whether you put "efficiently recognizable" or "has island constraints" or "has the strict cycle condition" or whatever as the property of grammars. You may be skeptical about the relevance of being in P per se, but the logic here would apply to ANY attempt to draw inferences about why language is the way it is by relating it to the performance of the parser. So you can't endorse that criticism and accept the SMT at the same time.

      Delete
    5. Two points:
      "Facts about the fast stuff": I can dismiss it if I find that it asks us to look at the wrong problem. Explain why I need to worry about recognizability? What does IT tell me about UG? So far as I can tell, very little. The problem is not to recognize whether an arbitrary string is in my "language." So, I guess I do not see the relevance of this concern. Enlighten me.

      Second: I do not say that language "is the way it is" because of parsing efficiency. I am proposing an interpretation of the SMT: it is the thesis that in virtue of Gs and UG having the properties it has, languages can be efficiently parsed and easily learned. This does not say nor imply that G and UG has these properties IN ORDER TO ALLOW FOR efficient parsability and easy learning. This is an entirely different claim having to do with the etiology of FL. The SMT as I understand it commits no hostages to this claim. Indeed, I believe that I noted in the post that should the SMT be true then it RAISES the question of why it is true.

      This said, IF some version of the SMT is true then it raises interesting research questions that are currently being actively invesitgated, and with interesting beneficial results. So, IF it is true then it implies that realt ime parsing cleaves closely to grammatical distinctions, as appears to be the case. It further implies that the reason we are able to parse so well, and by this I explicitly mean, in something like real time, is that Grammatical objects have the representational properties they have. I take it as a datum that we parse really well overall (Marcus notes this as have many others). The question is whether this is true (yes) and why. I am happy to dump 'optimal' for 'well-designed' taking as evidence of this that we parse in real time. The empirical problem is to understand how and why. The SMT does commit hostages to this, as I understand it.

      So, be as skeptical as you wish to be. The SMT is programmatic. It is a conjecture that right now seems to be bearing fruit, at least in one direction (I plan to talk about what the SMT can do for syntax in a following post). But, it would be more fruitful still, IMO, if people redirected their attention from ptime issues and others of the kind that computationalists find fascinating, to others more aking to what Marcus, Berwick, Fong, etc looked at. These provided models of fast parsers/learners that covered an interesting domain of data. We could ask OF THESE PROPOSALS how they did what they did. It seems that part of what allows them to do what they did well is that they embodied properties of the kind that competence grammars were proposing. So far as I can tell, this line of inquiry has nothing to do with the line advocated by Greg or Alex or, it seems, you. Fine. We can do different things. But as I am interested in why we parse quickly and learn easily and by this I mean in real time and on the basis of more or less simple input this may focus my attention in ways different from the focus of the more computationally adept, who seem to have their sights set on questions only tangentially related to these. Indeed, if I understand things right, they take my interests to be barely coherent and I seem to find theirs largely irrelevant. That's fine: it's a big area and the problems are hard so maybe the right strategy is to pursue them in different ways. At any rate, that's still how I see the SMT and I don't see why I need worry for my version of the SMT about the other concerns.

      Delete
    6. they take my interests to be barely coherent and I seem to find theirs largely irrelevant.
      and that's why I feel so frustrated. I think that our interests are largely identical. Yet somehow, we end up talking in circles.

      Example: (Parsing quickly)
      Of course it is true that we parse in real time (and to meanings to boot) most of the time. As I see it there are two proposals for this. N: there are two parsers, one is a fast and frugal one, which sorta kinda respects the grammar sometimes, the other is a slow and ponderous one, which respects the grammar more. G: there is exactly one parser, which exactly represents the grammar, but uses heuristics to guide it.
      Clearly these are different avenues of explanation of the same basic datum. I have no idea how to decide a priori which one is right/better/more fruitful/etc, but it seems like we're here interested in the same thing.

      Delete
    7. I would frame the dispute about parsing quickly differently.
      Here are two approaches

      The Formal view: take a grammar class which is well defined and has a PTIME recognition algorithm; use this as the basis for some heuristics (.e.g a beam search width) which take the complexity down from n^6 or whatever to something that might explain the real time parsing (I don't think this has to be linear on a conventional computer for various reasons).

      The Informal view: ptime recognition is irrelevant; instead we should take whatever grammar Chomsky says is right (apologies for this irrelevant snarkiness) even if it doesn't have a PTIME algorithm, and experimentally explore heuristics that bring down the parsing time on a conventional computer to something that is roughly linear.

      So we could talk about F versus I (personally I think I has little chance of success, and there are in any event approximately zero people working on it, since this is not what Sandiway is trying to do), but the important question here is whether this has anything to do with the SMT. And I just don't see that it does.
      The SMT is not about "efficiency" which is uncontroversial but about "optimality" which is highly controversial. And dropping down from optimality to efficiency is tantamount to abandoning the SMT -- in favour of some Weak MT.

      Delete
    8. @Norbert:

      "Explain why I need to worry about recognizability? What does IT tell me about UG? So far as I can tell, very little. The problem is not to recognize whether an arbitrary string is in my 'language.' "

      I think you are saying it is not obvious to you what the relation is between the properties of parsing/understanding in "real time" (fast), the properties of sound-meaning computations that may or may not work in real time (slow), and non-trivial properties of the "language". In fact, I think you are even saying it seems to you there isn't any useful relation between these things. Let me give it a shot.

      The "language", if we're going to be useful about it, is the set of sounds for which there's some sound-meaning computation that converges. This might seem like a boring non-window into the computation, but it's potentially interesting. Why? Because if I can parse efficiently (find the m for a given s - for our purposes) then I can compute the language efficiently. Now, who's I - that's a counterfactual. It's not something we think the human computation is doing, it's just a fact we can keep in the back of our minds.

      Like, for example, if there's an efficient and correct parser, then the search-for-a-meaning problem is (at least) in P. And if the search-for-a-meaning problem is in P, then the language is in P. I think you don't think the first question is boring, but you see that it necessarily gives an answer to the second problem, so they can't possibly be unrelated.

      So the relation to the SMT is that the SMT implies that the (fast) parser is also correct. It "has the properties of the grammar". In its simplest form that means "it computes what the grammar computes" - and I'm not talking about the string language, I'm talking about the nitty gritty details of how, because that's what you're talking about. And I know that the idea that it's doing things exactly "right" is stronger than what you're trying to say, but the suggestion is that they're close, and how close is yet to be determined. So if they're that close, and the parser is efficient, then you better tell me exactly why it is that you still think you haven't said anything about the "language" property of the grammar. Remember, saying that language is in P is a counterfactual. It says "there exists" some algorithm - well, you just found it: run the parser, and tell me if it converged.

      So then how can you care about one and not about the other? The greater the degree to which SMT is true, the more you narrow in on it computing everything correctly "as the grammar does" so to speak, and the more you have to be careful about saying WHY you maintain that the AP property is not true. What sentences can't be parsed efficiently? Why? Why doesn't SMT help in these cases? Knowing this is at least is important as knowing about the cases where it does. That's why efficient recognizability is an issue.

      Delete
    9. @Ewan: If I understand the results about recognizability the results are for upper bounds: it says that for a class of systems, MGs or CFGs or whatever the "longest" it will take for an arbitrary grammar from this class is XXX. Moreover, it applies to recognizing ANY sentence in the relevant class. So, worst case for any sentence. My question is what does this have to do with a cognitively relevant theory of parsing? Not much, I fear. Why not?

      First, I don't care what ANY MG (or suitably interesting candidate grammar) can do but what the ones that FL allows do. The upper bound is a very weak condition on this. Furthermore, we have an empirical hint that this worst case stuff is not relevant for we parse "normal" sentences in real time; incrementally as we hear them (actually: we either parse very fast or stumble around a lot, e.g. garden paths and 'police police police police police' where we do very badly). Given this, we want theory that does for the most part very well indeed.

      Second, I don't think we are about parsing ALL sentences well. Like I said, let's declare victory if, say, we can parse sentences with 6 levels of embedding of roughly length 25 words and less very fast. That's a good target for now. I really don't care what happens as N goes to infinity. And I don't care for two reasons. First, the obvious one that we are probably pretty bad at parsing 50 word sentences (try reading Henry James as a test). Second, that recognizability results for the very long sentences abstract away from the size of G (as Berwick and Weinberg told us long ago). I would be happy with a version of the SMT that predicted fast parsing for sentences where there properties of G matter and these are the cases where the size of G is not overwhelmed by the length N.

      So, recognition results make two idealizations whose relevance for the cognitive matters of interest leave me cold: they are results for the worst case Gs in a large class where I am interested in the properties of specific Gs and they worry about sentences of length N as N gets very very big. I don't expect any cognitively interesting parsing story to say much about what happens in either of these cases. That's why I care about one but not the other.

      Delete
    10. Yes, the crucial point is just :

      "let's declare victory if, say, we can parse sentences with 6 levels of embedding of roughly length 25 words and less very fast"

      This is much too casual an attempt to strike the balance between NOT necessarily having overall fastness and WANTING to necessarily have SOME fastness. If some simple bound on embeddings like this will turn out to work to get an interesting SMT while not necessarily preserving fastness in toto, I don't know, I don't know enough about the technicalia of syntax. It's very optimistic though, that's the point. In general you will have to worry a lot about which sentences SMT works for and which sentences it doesn't.

      Delete
    11. The think is, we already have parsers for MGs that are polynomial time (you can download them from Stabler's web page); whereas we just don't have parsers for standard transformational grammars or Minimalist Program grammars (i.e. non-Stablerian grammars) that "can parse sentences with 6 levels of embedding of roughly length 25 words and less very fast".
      So methodologically we are comparing something we already know how to do, and *has already been done* with something that experts in the field now think is an inappropriate way to approach the problem.
      If we actually had an observed linear time parser that worked on this large finite set of sentences, then this would be a different discussion.

      Delete
    12. Sorry for the typos..

      Just to continue the discussion; there are some different arguments being run together here.
      A) arguments against worst case asymptotic complexity in general (as used widely in CS)
      B) arguments against formal analysis in general
      C) arguments specifically against considering P time complexity of the set of strings in parsing.

      I have some sympathy with C arguments, in particular that we need to consider the size of the grammar, and perhaps also the number of embeddings (k) or some other parameters (number of simultaneously moving constituents, stack depth, etc. etc.) that may cause problems, which would lead one to what is called a fixed parameter tractability analysis, i.e. a more refined formal analysis that takes account of these.

      On the A arguments, I could point you to a lot of literature, but crudely the success of information technology in general, suggests that the general analysis has some value. We teach CS students this stuff for a reason.
      There are of course cases where we use a sort of average case complexity analysis -- PAC learning is a good example, where the worst case is that every example you see is the same.

      Delete
    13. One more thing "First, I don't care what ANY MG (or suitably interesting candidate grammar) can do but what the ones that FL allows do."
      So yes, I think this is right:
      if we have a class of grammars C (say C = all MGs with any set of features etc, ) but FL only allows some subclass C' (say all MGs with some fixed universal set of features), then we are only interested in the behaviour on C'.
      And since C' is a subset of C the worst case behaviour of C will be an upper bound on the worst case behaviour of C'.

      So I agree that we should study C' not C, but has anyone suggested otherwise?

      Delete