Faculty of Language: POS and PLD

Monday, June 9, 2014

POS and PLD

What do linguists study? The obvious answer is language. GGers beg to differ. The object of study, at least if you buy into the Chomsky program in GG (which all right thinking linguists do), is not language but the faculty of language (FL). GGers treat FL realistically. It is part of the mind/brain, a chunk of psychology/biology. IMO, perhaps Chomsky’s biggest achievement lies in having found a way of probing the fine structure of FL by studying its outputs and thinking backwards to the properties that a mind/brain with such outputs would have to have. This mode of reasoning has a name: the Poverty of Stimulus argument (POS).

The POS licenses inferences about the structure of FL from the structure of the Gs that native speakers of a natural language (NL) acquire. As I’ve before illustrated how the POS can be pressed into service for this end, I will refrain from doing so again here. Instead, I’d like to recommend a good paper by Jeff Lidz and Annie Gagliardi (L&G) (here) that goes over these issues again with novel illustrations. Here’s some of what’s in it.

First, the paper illustrates something that’s often obscured: how POS reasoning focuses attention onto the properties of the Primary Linguistic Data (PLD). As usually structured, the argument relies on there being a gap between what can be gleaned from the data the LAD can exploit and the structure of the knowledge attained. It’s through this gap that structure of FL can be discerned. L&G nicely walk us through the logic of the gap and the inferential procedure underlying the POS. They observe, correctly in my view, that there is nothing antithetical between GGers and a commitment to UG like principles and statistical learning procedures (a point Charles has made in his recent posts as well). The question is not whether stats are relevant, but how they are. Here’s what L&G says.

L&G notes that there are effectively two views of how stats can enter into acquisition discussions. The first view, what L&G dubs the “input driven view,” (IDV) conceives of acquisition as a process of “track[ing] patterns of co-occurrence in the environment and stor[ing] them in a summarized format” (3). Generalizing beyond actual experience is licensed by this “summary representation of experience” (4). IDVs tend to take this learning procedure to be domain general, rather than linguistically specific.

There is a second approach, the “knowledge-drive view” (KDV). For KDV, experience functions to “select” a knowledge representation from a pre-specified (aka: innate) domain of options.

There is a long tradition distinguishing “instructive” versus “selective” theories of “learning.” This has even played a role in studies of the immune system where several people got Nobels (Jerne, Edelman) for showing that earlier instructive theories of antibody formation were wrong and that antibody formation is actually selection against a given set of possible options.

Nowadays, given the rise of Bayes, selection theories of learning are all the rage. However, there is a sense in which any statistical theory must be selective for probabilities presuppose an antecedent demarcation of possibilities over which the probabilities are assigned. In other words, we need a given hypothesis space/algebra in order to assign, e.g. probability densities. No space, no probabilities.[1] The probable selects from the possible.

One of the nice features of the L&G discussion is that it notes that the instruction/selection question is orthogonal to the empiricism/rationalism issue. The relevant question for the latter is the nature of the relation between the input and what is “selected.” If selection requires matching inputs (i.e. if acquisition is essentially a bottom up affair in which the generalizations are simply distillations of regularities found in the input) then the relevant mechanism is Empiricist as, using traditional terminology, the generalization “resembles” the input. If one allows for a “distance” between the attained generalization and the input, then one is approaching the issue from a Rationalist perspective. Rationalists are necessarily selectionists, but Empiricists need not be instructionists. The deep question lies not on this dimension, but on how close the input must “resemble” the generalized output. As L&D explains, the important question is whether the statistically massaged PLD “shapes” or “triggers” the attained generalization.[2] For the interested, this is further discussed (here).

Here’s how L&G contrasts IDV and KDV:

1. For IDV the learning process goes from “specific to general” as the main operation is a process that “generalizes over specific cases” (4). For KDV, learners do not need to “arrive at abstract representations via a process of generalization” as these are built in. Rather input is sampled to “identify the realization of those abstract representations” in the PLD (5).

2. For IDV the end product of learning is “a recapitulation of the inputs to the learner, [t]he acquired representation [being] a compressed memory representation of the regularities found in the input” (5). For KDV, the attained representations (or many features thereof) are innately provided so the “representation may bear no obvious relation to the input that triggered it” (5). There need be, in other words, no similarity between the attained generalizations and the specific structure of the PLD. Or, another way of putting this (which L&G does) is that whereas for IDVs the input shapes the attained representation, for KDVs it merely triggers it.

Thus, though both IDVs and KDVs recognize that experience is critical in G attainment (i.e. to learn English you need English PLD), how this PLD functions is different in the two cases. Forming the Gish generalizations in the case of IDVs and triggering them in the case of KDVs. As such, both approaches require inferential mechanisms that take one from PLD to Gs.

Indeed, because KDVs cannot rely on “similarity” to bridge the gap from data to generalizations as IDVs do,[3] KDVs need learning theories that show how FL/UG supplies “predictions about what the learner should expect to find in the environment” to guide acquisition which, for KVDs consists in “compar[ing] those [FL/UG provided (NH)] predicted features against the perceptual intake…[which] drives an inference about the grammatical features responsible for the sentence under consideration” (10-11). L&G provide a nice little schema for this interactive process in their table 1.

Given these distinctions one can now begin to investigate to what degree language acquisition resembles IDVs vs KDVs. L&G provide several interesting case studies. Let me mention two.

The first is based on Gagliardi’s thesis work on noun class acquisition in Tsez. Here’s a surprise: there are lots of them and it’s complicated. L&G notes that classification (at least in part) depends on a noun’s semantic and phonological features. What L&G notes is that despite being sensitive to both kinds of info, LADs used them “out of proportion with their statistical reliability, even ignoring highly predictive semantic features” (24-5). L&G asks why. The discussion relies on distinguishing the input to the LAD from its intake; the former being the information available in the ambient linguistic environment, the latter being the PLD that the child actually uses.

This is an important distinction and it has been put to use elsewhere to great effect. One of my favorites is the explanation of how Old English, which was a good Germanic OV NL, changed to VO. David Lightfoot (here, here) asked a very interesting question about this transition: how could the change have happened? Here’s the problem: while the data in main clauses concerning the OV status of Old English (OE) was obscure and not particularly conclusive, the data that OE was OV was very clear in embedded clauses. He reasoned that if LADs had (robust) access to embedded clause PLD then there would have been no change from OV to VO in English as there was plenty of very reliable data showing that the language was OV in embedded clauses. Conclusion: the LAD did not have use embedded clause information. Given the reasonable conclusion that OE kids had the same FLs as anyone else, this means that in language acquisition kids do not “intake” embedded clause data (i.e. the PLD is degree 0+).

Lisa Pearl in her thesis work and later publications (here, here) addresses this question in terms analogous to L&G (viz. intake vs input of PLD to the OE kids). Pearl asks how much embedded data would have been enough to prevent the shift to VO order. She does this to pin down whether the LAD actively ignores embedded data (an intake restriction) or whether embedded data info is absent because complex clauses with embedding are just rare in the PLD (an input restriction). Lisa was able to “quantify” the question and the data she harvested from CHILDES (current English data not OE, of course) suggested that kids must actively ignore available data if we are to account for the English shift from OV to VO. Note, there could have been another answer to Lightfoot’s question: degree 1 sentences are too rare in the input for their information to be effectively usable, i.e. the restricted intake is due to restricted input. However, both Lisa and L&G show apparent cases where this answer is insufficient. It appears that the LAD might be deliberately myopic. And this raises another interesting question: WHY?

L&G discuss a second case that I would like to bring to your attention; acquisition of Double Object Constructions (DOC) in Kannada. The facts are once again subtle, though recognizably similar to what we find in languages like English and Spanish. At any rate, it is very complex, with binding allowed between pronouns in the accusative and dative DPs in some conditions but not in others. L&G notes that the data present a standard POS puzzle and show how it might be addressed. What was most interesting to me is the discussion of possible triggers and how exactly FL/UG might focus attention on some kinds of available information (spoiler alert: it has to do with animacy in possession constructions and their relation to DOCs), which would function as triggers. L&G notes that this kind of info is available with some statistical regularity if you are primed to look for it. In other words, the abstract POS considerations immediately generate a search for triggers that FL/UG might make salient and thus be in-takeable (i.e perceptually prominent data that would permit the triggering). This is a very nice illustration of the fecundity of POS considerations in considering how real time acquisition works. It also nicely illustrates how the gaps that POS arguments identify can be bridged via data that is in no way “similar” to the representations attained.

Let me end. L&G is a fun paper. And there’s lots of other good stuff (I really liked the discussion of Korean and how POS allows a kind of data free parameter setting). It gives a good overview of the intuitions behind the competing acquisition traditions, makes some nice distinctions concerning how to think of PLD in the wild, and provides a budget of nice novel illustrations of POS arguments. A colleague (Colin Phillips) keeps insisting that he is tired of the focus on Y/N questions in English as the “hard” case of POS reasoning. I am not nearly as tired as he is, but we agree on one thing: POS arguments are thick on the ground once one looks at any even slightly complex bit of grammatical competence. There is nothing special about Chomsky’s original illustration, except perhaps its extreme accessibility. L&G provides some good new grist for those interested in this kind of milling and shows how useful such reasoning is in attacking the question of how real kids (not just ideal speaker-hearer LADs) manage to acquire Gs.

[1] Perfors makes essentially this point (here).

[2] For the interested, this is further discussed (here).

[3] This is not to agree that ‘similarity’ is a serviceable notion. It’s not, as philosopher’s (e.g. Nelson Goodman) have repeatedly shown.

23 comments:

Alex ClarkJune 9, 2014 at 2:40 PM
I think it is a good idea to keep separate two distinctions:
A) the distinction between domain-general and language specific
B) the distinction between algorithms that go from general to specific and algorithms that go from specific to general.

Just as an easy example, there are clustering algorithms that go bottom up, clustering individual items and then progressively making larger clusters (e.g. hierarchical agglomerative clustering) and those that start with one giant cluster that contains everything and splitting it into two (e.g. spectral clustering).

By the way what's wrong with similarity? I have read Goodman but I have also read Vapnik.
ReplyDelete
Replies
UnknownJune 9, 2014 at 10:02 PM
I want to thank Norbert for drawing an important distinction:

"The object of study, at least if you buy into the Chomsky program in GG (which all right thinking linguists do), is not language but the faculty of language (FL). GGers treat FL realistically. It is part of the mind/brain, a chunk of psychology/biology"

Acknowledging the distinction between 'language' and the 'faculty of language' is an important step forward from decades of conflating this distinction. It is of course no new distinction, Jerry Katz insisted on it back in the 1980s [for some references see http://ling.auf.net/lingbuzz/001607 ]. Good to know that some leading GGers finally accept Katz' insight. Are those who continue to insist that language and FL are the same no 'right thinking linguists'?

Now one question arises in the current context: are PLD part of language or part of FL? If the former it would seem that the [direct] study of language allows GGers to draw conclusions about FL [which can only be studied indirectly].
ReplyDelete
Replies
UnknownJune 9, 2014 at 10:13 PM
Perhaps it will be useful to clarify the distinction between compressed memory representations and abstract representations. What counts as compression? Only clustering methods that use a euclidean distance metric over raw perceptual features (e.g. contours or formants)?

I'm also not sure how this distinction relates to the domain-general vs. language-specific distinction. Is the idea that compression algorithms are necessary domain-general?
ReplyDelete
Replies
M.June 15, 2014 at 8:14 AM
I must say that I really appreciate Chomsky's contribution. Lately I visited his speech in the Czech Republic, but it was mainly aimed at politics...
https://play.google.com/store/apps/details?id=cz.prialabs.dictionary&hl=cs
ReplyDelete
Replies

Add comment