Monday, June 9, 2014

POS and PLD

What do linguists study? The obvious answer is language. GGers beg to differ. The object of study, at least if you buy into the Chomsky program in GG (which all right thinking linguists do), is not language but the faculty of language (FL). GGers treat FL realistically. It is part of the mind/brain, a chunk of psychology/biology.  IMO, perhaps Chomsky’s biggest achievement lies in having found a way of probing the fine structure of FL by studying its outputs and thinking backwards to the properties that a mind/brain with such outputs would have to have. This mode of reasoning has a name: the Poverty of Stimulus argument (POS).

The POS licenses inferences about the structure of FL from the structure of the Gs that native speakers of a natural language (NL) acquire.  As I’ve before illustrated how the POS can be pressed into service for this end, I will refrain from doing so again here. Instead, I’d like to recommend a good paper by Jeff Lidz and Annie Gagliardi (L&G) (here) that goes over these issues again with novel illustrations. Here’s some of what’s in it.

First, the paper illustrates something that’s often obscured: how POS reasoning focuses attention onto the properties of the Primary Linguistic Data (PLD). As usually structured, the argument relies on there being a gap between what can be gleaned from the data the LAD can exploit and the structure of the knowledge attained. It’s through this gap that structure of FL can be discerned. L&G nicely walk us through the logic of the gap and the inferential procedure underlying the POS.  They observe, correctly in my view, that there is nothing antithetical between GGers and a commitment to UG like principles and statistical learning procedures (a point Charles has made in his recent posts as well).  The question is not whether stats are relevant, but how they are. Here’s what L&G says.

L&G notes that there are effectively two views of how stats can enter into acquisition discussions. The first view, what L&G dubs the “input driven view,” (IDV) conceives of acquisition as a process of “track[ing] patterns of co-occurrence in the environment and stor[ing] them in a summarized format” (3). Generalizing beyond actual experience is licensed by this “summary representation of experience” (4). IDVs tend to take this learning procedure to be domain general, rather than linguistically specific.

There is a second approach, the “knowledge-drive view” (KDV). For KDV, experience functions to “select” a knowledge representation from a pre-specified (aka: innate) domain of options.

There is a long tradition distinguishing “instructive” versus “selective” theories of “learning.” This has even played a role in studies of the immune system where several people got Nobels (Jerne, Edelman) for showing that earlier instructive theories of antibody formation were wrong and that antibody formation is actually selection against a given set of possible options.

Nowadays, given the rise of Bayes, selection theories of learning are all the rage. However, there is a sense in which any statistical theory must be selective for probabilities presuppose an antecedent demarcation of possibilities over which the probabilities are assigned. In other words, we need a given hypothesis space/algebra in order to assign, e.g. probability densities. No space, no probabilities.[1] The probable selects from the possible.

One of the nice features of the L&G discussion is that it notes that the instruction/selection question is orthogonal to the empiricism/rationalism issue. The relevant question for the latter is the nature of the relation between the input and what is “selected.” If selection requires matching inputs (i.e. if acquisition is essentially a bottom up affair in which the generalizations are simply distillations of regularities found in the input) then the relevant mechanism is Empiricist as, using traditional terminology, the generalization “resembles” the input. If one allows for a “distance” between the attained generalization and the input, then one is approaching the issue from a Rationalist perspective. Rationalists are necessarily selectionists, but Empiricists need not be instructionists. The deep question lies not on this dimension, but on how close the input must “resemble” the generalized output. As L&D explains, the important question is whether the statistically massaged PLD “shapes” or “triggers” the attained generalization.[2] For the interested, this is further discussed (here).

Here’s how L&G contrasts IDV and KDV:

1.     For IDV the learning process goes from “specific to general” as the main operation is a process that “generalizes over specific cases” (4). For KDV, learners do not need to “arrive at abstract representations via a process of generalization” as these are built in. Rather input is sampled to “identify the realization of those abstract representations” in the PLD (5).
2.     For IDV the end product of learning is “a recapitulation of the inputs to the learner, [t]he acquired representation [being] a compressed memory representation of the regularities found in the input” (5). For KDV, the attained representations (or many features thereof) are innately provided so the “representation may bear no obvious relation to the input that triggered it” (5). There need be, in other words, no similarity between the attained generalizations and the specific structure of the PLD. Or, another way of putting this (which L&G does) is that whereas for IDVs the input shapes the attained representation, for KDVs it merely triggers it. 

Thus, though both IDVs and KDVs recognize that experience is critical in G attainment (i.e. to learn English you need English PLD), how this PLD functions is different in the two cases. Forming the Gish generalizations in the case of IDVs and triggering them in the case of KDVs. As such, both approaches require inferential mechanisms that take one from PLD to Gs. 

Indeed, because KDVs cannot rely on “similarity” to bridge the gap from data to generalizations as IDVs do,[3] KDVs need learning theories that show how FL/UG supplies “predictions about what the learner should expect to find in the environment” to guide acquisition which, for KVDs consists in “compar[ing] those [FL/UG provided (NH)] predicted features against the perceptual intake…[which] drives an inference about the grammatical features responsible for the sentence under consideration” (10-11).  L&G provide a nice little schema for this interactive process in their table 1.

Given these distinctions one can now begin to investigate to what degree language acquisition resembles IDVs vs KDVs.  L&G provide several interesting case studies. Let me mention two.

The first is based on Gagliardi’s thesis work on noun class acquisition in Tsez. Here’s a surprise: there are lots of them and it’s complicated.  L&G notes that classification (at least in part) depends on a noun’s semantic and phonological features. What L&G notes is that despite being sensitive to both kinds of info, LADs used them “out of proportion with their statistical reliability, even ignoring highly predictive semantic features” (24-5). L&G asks why. The discussion relies on distinguishing the input to the LAD from its intake; the former being the information available in the ambient linguistic environment, the latter being the PLD that the child actually uses.

This is an important distinction and it has been put to use elsewhere to great effect.  One of my favorites is the explanation of how Old English, which was a good Germanic OV NL, changed to VO.  David Lightfoot (here, here) asked a very interesting question about this transition: how could the change have happened? Here’s the problem: while the data in main clauses concerning the OV status of Old English (OE) was obscure and not particularly conclusive, the data that OE was OV was very clear in embedded clauses. He reasoned that if LADs had (robust) access to embedded clause PLD then there would have been no change from OV to VO in English as there was plenty of very reliable data showing that the language was OV in embedded clauses. Conclusion: the LAD did not have use embedded clause information. Given the reasonable conclusion that OE kids had the same FLs as anyone else, this means that in language acquisition kids do not “intake” embedded clause data (i.e. the PLD is degree 0+).

Lisa Pearl in her thesis work and later publications (here, here) addresses this question in terms analogous to L&G (viz. intake vs input of PLD to the OE kids). Pearl asks how much embedded data would have been enough to prevent the shift to VO order. She does this to pin down whether the LAD actively ignores embedded data (an intake restriction) or whether embedded data info is absent because complex clauses with embedding are just rare in the PLD (an input restriction). Lisa was able to “quantify” the question and the data she harvested from CHILDES (current English data not OE, of course) suggested that kids must actively ignore available data if we are to account for the English shift from OV to VO. Note, there could have been another answer to Lightfoot’s question: degree 1 sentences are too rare in the input for their information to be effectively usable, i.e. the restricted intake is due to restricted input. However, both Lisa and L&G show apparent cases where this answer is insufficient. It appears that the LAD might be deliberately myopic. And this raises another interesting question: WHY?

L&G discuss a second case that I would like to bring to your attention; acquisition of Double Object Constructions (DOC) in Kannada. The facts are once again subtle, though recognizably similar to what we find in languages like English and Spanish. At any rate, it is very complex, with binding allowed between pronouns in the accusative and dative DPs in some conditions but not in others. L&G notes that the data present a standard POS puzzle and show how it might be addressed. What was most interesting to me is the discussion of possible triggers and how exactly FL/UG might focus attention on some kinds of available information (spoiler alert: it has to do with animacy in possession constructions and their relation to DOCs), which would function as triggers. L&G notes that this kind of info is available with some statistical regularity if you are primed to look for it. In other words, the abstract POS considerations immediately generate a search for triggers that FL/UG might make salient and thus be in-takeable (i.e perceptually prominent data that would permit the triggering).  This is a very nice illustration of the fecundity of POS considerations in considering how real time acquisition works. It also nicely illustrates how the gaps that POS arguments identify can be bridged via data that is in no way “similar” to the representations attained.

Let me end. L&G is a fun paper. And there’s lots of other good stuff (I really liked the discussion of Korean and how POS allows a kind of data free parameter setting). It gives a good overview of the intuitions behind the competing acquisition traditions, makes some nice distinctions concerning how to think of PLD in the wild, and provides a budget of nice novel illustrations of POS arguments. A colleague (Colin Phillips) keeps insisting that he is tired of the focus on Y/N questions in English as the “hard” case of POS reasoning. I am not nearly as tired as he is, but we agree on one thing: POS arguments are thick on the ground once one looks at any even slightly complex bit of grammatical competence. There is nothing special about Chomsky’s original illustration, except perhaps its extreme accessibility. L&G provides some good new grist for those interested in this kind of milling and shows how useful such reasoning is in attacking the question of how real kids (not just ideal speaker-hearer LADs) manage to acquire Gs.



[1] Perfors makes essentially this point (here).
[2] For the interested, this is further discussed (here).

[3] This is not to agree that ‘similarity’ is a serviceable notion. It’s not, as philosopher’s (e.g. Nelson Goodman) have repeatedly shown.

23 comments:

  1. I think it is a good idea to keep separate two distinctions:
    A) the distinction between domain-general and language specific
    B) the distinction between algorithms that go from general to specific and algorithms that go from specific to general.

    Just as an easy example, there are clustering algorithms that go bottom up, clustering individual items and then progressively making larger clusters (e.g. hierarchical agglomerative clustering) and those that start with one giant cluster that contains everything and splitting it into two (e.g. spectral clustering).

    By the way what's wrong with similarity? I have read Goodman but I have also read Vapnik.

    ReplyDelete
    Replies
    1. Agree 1 and 2 should be kept apart. It's just that many domain general approaches of the Empiricist variety have tended to endorse bottom up procedures.

      I'm not sure what you intend by Vapnik. Goodman's point is that there is no useful sense of similarity to be had without explicit specification of what the similarity is wrt. And even then one needs specification of the entrenched predicates that are being projected. Empiricists took a long time coming to appreciate this point, and some still do. Does Vapnik show that this is false?

      Delete
    2. Agree 1 and 2 should be kept apart. It's just that many domain general approaches of the Empiricist variety have tended to endorse bottom up procedures.

      I'm not sure what you intend by Vapnik. Goodman's point is that there is no useful sense of similarity to be had without explicit specification of what the similarity is wrt. And even then one needs specification of the entrenched predicates that are being projected. Empiricists took a long time coming to appreciate this point, and some still do. Does Vapnik show that this is false?

      Delete
    3. Oops, some still do not...

      Delete
    4. I agree that one needs to be precise about what similarity measure is being used (precision being important ....) but I just wanted to say that there is a very large and well developed theory of learning based on similarity (kernel methods, Mercer's theorem, RKHS etc etc ) A generalized and abstracted notion of similarity (via what is called a kernel) is the basis of all these techniques.

      Anyway you are right that many empiricists, Tomasello especially, seem wedded to a bottom up, exemplar based learning model, that is unfortunately (IMO) not precise enough to evaluate. Charles Yang has done some nice work showing some of the problems with Tomasello's claims.

      Delete
    5. I agree with all of this too. What's not clear to me (because I don't know enough about kernel methods) is whether the kinds of new dimensions of similarity that get invented with kernel methods can produce the kinds of clustering of surface unrelated phenomena that syntacticians like to think about (e.g., the link between possession and ditransitives in our paper, the link between postverbal subjects and the presence/absence of that-trace effects, the parallel locality domains in raising, control and reflexivization, etc. etc.)

      Delete
    6. It's not clear to me either .. I think while we have good notions of syntactic similarity, things like the binding theory, quantifier scope etc depend at least in part on the semantic representations and I don't think we have a good idea of semantic similarity (apart from say lexical similarity) -- like Norbert says, we don't know much about the CI interface,
      As I have said before, I find stuff like the Kannada ditransitives really baffling. Locality domains I think we can start to see how some similarity measures on the derivation trees might help -- e.g. the Pearl and Sprouse approach.

      Delete
  2. I want to thank Norbert for drawing an important distinction:

    "The object of study, at least if you buy into the Chomsky program in GG (which all right thinking linguists do), is not language but the faculty of language (FL). GGers treat FL realistically. It is part of the mind/brain, a chunk of psychology/biology"

    Acknowledging the distinction between 'language' and the 'faculty of language' is an important step forward from decades of conflating this distinction. It is of course no new distinction, Jerry Katz insisted on it back in the 1980s [for some references see http://ling.auf.net/lingbuzz/001607 ]. Good to know that some leading GGers finally accept Katz' insight. Are those who continue to insist that language and FL are the same no 'right thinking linguists'?

    Now one question arises in the current context: are PLD part of language or part of FL? If the former it would seem that the [direct] study of language allows GGers to draw conclusions about FL [which can only be studied indirectly].

    ReplyDelete
    Replies
    1. Christina, I'm afraid your polemic commentary once again just serves to reveal your ignorance. From Chomsky's earliest work on, it has always been clear that "language" is not the object of inquiry, and could not be given that the word doesn't denote a scientific category. He first explicitly talked about "I-language" in the 80s, but that was just making explicit what had been a foundation of GG all along. As far as I can see, it's typically self-proclaimed critics of GG that conflate the distinction, which leads to all kinds of confusion -- the Evans & Levinson paper is a striking case. So I see no basis for your claim that "some leading GGers finally accept Katz' insight," given that it's neither Katz's insight nor has it ever been disputed by anybody working in the field.

      Delete
    2. Please don't be afraid, dear Dennis O. Your fear is groundless: I can be accused of many things but ignorance of Chomsky's writings is not among them. I do not know when you date Chomsky's earliest work but in 1957 he certainly claimed to study language:

      "I will consider a language to be a set (finite or infinite) of sentences, each finite in length and constructed out of a finite set of elements. All natural languages are languages in this sense. ...The fundamental aim of linguistic analysis of a language L is to separate the grammatical sequences, which are sentences of L from the ungrammatical sequences, which are not sentences of L, and to study the structure of the grammatical sequences. The grammar of L will thus be a device that generates all of the grammatical sequences of L and none of the ungrammatical ones." (Chomsky, 1957, 13)

      As those who worked with Chomsky during his early years at MIT note: back then there was no talk of biological organs. So focus on I-language has not been "a foundation of GG all along".

      You are also incorrect about the conflation. In 2000 Chomsky committed in print what you attribute to Evans&Levinson. When asked what the nature of the LAD is, Chomsky replied: “Well whatever the nature of language is, that is what it is” [Chomsky, 2000, 54]. This was long after he introduced the E-/I-language distinction and neither he nor his editors seemed concerned about drawing distinctions you seem to think are important. If he is entitled to such 'off the cuff' remarks in his published works [I am not aware you ever criticizing him] then you cannot blame others for doing the same.

      Delete
    3. You may be well-read, but you clearly understand very little. Take a look at Syntactic Structures and see how much you can find there that has anything to do with the common-sensical notion of "language." Hint: it's very little.

      Delete
    4. It is always a pleasure to debate with someone as knowledgable yet as modest as you are, dear Dennis O. Could you be so kind and tell me when I said Chomsky talks in SS about the common-sensical notion of "language"? There are 2 reasons that make it unlikely I used such a phrase. First, I have no idea what this 'common-sensical' notion is, maybe you can enlighten me? Possibly with something more illuminating than Chomsky's "E-language is everything that is not I-Language" slogan. Second, I would expect that professional linguists talk about the formal notion of language [as Chomsky does in the passage I cite above].

      Maybe you can further be so kind and cite specific passage in SS that indicate that Chomsky was conceiving of language back then as a biological organ? It was your claim that this commitment "had been a foundation of GG all along". And while you're at it, maybe you can also show where in the post 1993 work [the beginnings of Minimalism] we find confirmation that there has been no change to "The fundamental aim of linguistic analysis of a language L is to separate the grammatical sequences, which are sentences of L from the ungrammatical sequences, which are not sentences of L"?

      Delete
  3. Perhaps it will be useful to clarify the distinction between compressed memory representations and abstract representations. What counts as compression? Only clustering methods that use a euclidean distance metric over raw perceptual features (e.g. contours or formants)?

    I'm also not sure how this distinction relates to the domain-general vs. language-specific distinction. Is the idea that compression algorithms are necessary domain-general?

    ReplyDelete
    Replies
    1. Yes, it is orthogonal to the domain-general/specific distinction - and I remember thinking that was my least favourite line in what was otherwise a good paper. I try not to give up hope that some day linguists and psychologists will deconflate indispensable generic notions like abstractness and similarity from issues about being domain-general or domain-specific. Completely orthogonal and there isn't much hope for the field until that distinction is made. The exchange between Norbert and Alex in the first comment thread gives me a bit of optimism (because Norbert came around) but it can get tiring to have to say it again and again and again.

      Delete
    2. I think we meant something slightly different in our use of the term "abstract". What we (or at least I, I can't speak for Annie) mean by abstractness is not just that the representation generalizes away from (i.e, abstracts over) certain surface features, but that the representations have consequences for phenomena with no surface similarity. In the case of the ditransitives that we discuss in the paper, a single aspect of the representation controls both the class of NPs that can serve as the dative argument (#I sent NY the letter) and the binding facts. This aspect of the representation is abstract because it is only partially revealed in the surface distributions.

      But for sure, the two issues are orthogonal. I do think, however, that the kind of domain-specific representations that are implicated by a POS argument are necessarily abstract. The reverse, however, is not true. That is, one can be abstract (in at least some sense of abstract, possibly not ours) without being domain specific.

      Delete
    3. @ewan
      Norbert has never conflated these notions, to the best of my knowledge. Indeed, not conflating them is a pre-condition, in his opinion, for engaging in minimalist inquiry. I confess to getting tired hearing that the problems psychos and linguists have talking to one another productively rests on missing this obvious distinction. As Norbert argued extensively before, this is to trivialize the differences. They are far deeper and trivializing them in this way will only further hinder productive discussion.

      Delete
    4. Thanks for the clarification. I think I understand the intuition now, but the distinction gets fuzzy once you move beyond storing exemplars over raw perceptual features and allow memory representations with abstract features. As an obvious example, you can think of a CFG rewrite rule like NP -> A N as a language-specific compressed memory representation of both "big meatball" and "red penguin", two phrases that share no surface (phonetic or orthographic) features. Then you could compare a new phrase to these memory representations to check whether is grammatical -- which is what Pearl and Sprouse do to account for some island constraints (as Alex mentioned). Of course, you could ask where these language-specific input representations come from, maybe that's what needs to be innate, but if I understand the intuition correctly this case would fall on the non-abstract side of the divide.

      Delete
    5. @ Tal: Yes, I think the distinction can get fuzzy, but the need for the kind of abstraction we're talking about does not. So, you're right that the category NP can be seen as a compressed memory representation of some distributional facts, and hence would fall on the nonabstract side of the divide. However, if being of the category NP has consequences that are not exhibited in the data, then that category had some deductive power to it that puts it on the abstract side of the divide. For example, if being an NP makes a phrase subject to the case filter and hence the kind of thing that can undergo certain kinds of movement (e.g., passive, raising...) subject to restrictions that are not themselves expressed in the distributions (e.g, not out of a finite clause), then that category is abstract in the sense that I'm trying to get at. The category can be identified by distributional analysis, but if there are consequences that are not expressed in the distributions, the category is abstract in the relevant sense.

      Delete
    6. I guess what I was trying to say is that it's hard to talk about what's discoverable from the statistical distribution of the data and what isn't independently of the compression algorithm you use to represent the input. If your grammar formalism has movement and finiteness, and you use this formalism to encode every sentence you hear, then statistics over your memory representations might actually be enough to acquire a ban on NP movement out of a finite clause. I definitely agree that the term "compressed memory representations" sounds less appropriate when your input representations get this abstract, but I don't know if it's possible to come up with a formal definition that would clarify the distinction.

      Delete
    7. Agreed. But then, this is what the debates are all about and why it is actually hard to to make a fully explicit POS argument, though there are many to be made. A POS argument depends on a particular representation of the input, what we're calling the intake. If the intake representations have all of the information that the (target) acquired representations have, then learning is a lot easier. A POS argument therefore has to be made against a set of assumptions about what the intake representations are, what the space of possible acquired representations are, and a learning algorithm. See Ewan Dunbar's dissertation for some excellent discussion of this.

      Delete
    8. Thanks for the pointer to Ewan Dunbar's dissertation -- I agree it is very clear headed on these points. (here)

      Delete
    9. @Norbert

      The reason I harp on this is lost then. The idea is that the sword is supposed to cut both ways. The people who think that there can be "assumption free" learning or learning which is based "only" on similarity without saying "similar in what sense" are wrong.

      But the point is that by that token the "good guys" cannot then turn around and set up a contrast between learning that "[only] works by generalization, compresses the regularities in the input" versus learning that doesn't. The whole point is that as soon you buy into that false dichotomy you're automatically playing the wrong game.

      The whole point should always be to underscore that each time the "input driven" people say it's "just" this to point out that they've made a commitment to a particular way of doing it, one among many, there is no "just." That's the Goodman logic, that's the argument I want to see. The skepticism about input-driven-ness should go so deep that you come out and say "there is no serviceable notion of input-driven." Because I don't think there is, and therefore I think all rebuttals that presuppose there is are incoherent.

      Delete
  4. I must say that I really appreciate Chomsky's contribution. Lately I visited his speech in the Czech Republic, but it was mainly aimed at politics...
    https://play.google.com/store/apps/details?id=cz.prialabs.dictionary&hl=cs

    ReplyDelete