Faculty of Language: The 3rd Hilbert Question: What can you learn from degree 0-ish data

Sunday, December 7, 2014

The 3rd Hilbert Question: What can you learn from degree 0-ish data

This Hilbert Question is the joint effort of Bill Idsardi and me (Norbert). It arose from several lunchtime conversations, a delicious source of Hilbert Problems I believe. Nothing suggests juicy digestible questions like conversation over edibles soon to be digested. Maybe it’s because cognitive science marches on its stomach or maybe because there really is a close connection between our frontal lobes and the gut brain. So that’s its source. Here’s the content.

****

There is quite a large literature on formal learning theory that is beyond the scope of many linguists (note, we said ‘many’ not ‘all’ or even ‘most’). Norbert counts himself among these, though Bill, is very definitely not a fellow traveller. However, even a monkey can write a sonnet given enough typewriters, paper and patience and in discussion Bill and Norbert have come up with the following kind of problem that can be approached both formally and empirically. Moreover, answers to this question would be of real interest to linguists like us (and we think there could well be a multiplicity of answers). We like to think of it as a formal version of the POS argument, a Gedankenexperiment exploration of POS space. It is a version of the problem that Wexler and Culicover investigated (here) and that seems to have since fallen out of fashion.[1] We would like to revive it, with a twist to make it more empirically relevant. Our question is what can be learned from degree 0+ data? That’s the question. Let’s unpack it.

To turn this question into a tractable one requires explaining what the parts mean; in particular: What does ‘degree 0+’ mean? What does ‘learn’ mean? What does ‘data’ mean? In other words, to make the question approachable we need to give serviceable “definitions” of these terms. Fortunately, the formal learning literature provides ways (yes, plural) for doing so. (Oh, btw, there is now a great paper by Jeff Heinz called “Computational theories of learning and developmental psycholinguistics” (here), which should really be called Computational Learning Theory for Dummies that Norbert recommends most highly especially to fellow members of the computationally challenged community.)

There are various assumptions (or rules of the game) that go into proofs in the formal learning world. The first is setting a criterion for what it is to have learned something. The something we care about (and the computationalists Heinz discusses care about) is learning a grammar G that generates the target language (we return to what we mean by ‘language’). There are many different criteria (as Heinz nicely reviews) concerning when a G is considered learned, but a big cut is between “identification” which is perfect (we get the relevant G with probability 1) and “identification” which is less than perfect (we get the G with some probability less than 1). There are various versions of this, but for our problem we don’t care which one you choose, so choose away (and let a hundred flowers bloom).

The second thing that needs doing is to specify a time/data limit within which we evaluate the criterion. So, how much time/data can pass until identification in some form is required for learning to have taken place? Most of the work Heinz reviews (which is most of the important work; again great paper!!) assumes that the criteria should be identification in the limit. That means that the more and more data we get the closer and closer we get to the identification criterion (i.e. as time/data → ∞). Now, as any parent will tell you (and most linguists assume) our cute little LADs and LASs (language acquisition devices/systems) attain competence not in the limit but by about 4 or 5 (and for many many features of competence much much earlier (this is what Jeff and Colin and their friends do; show how linguistically able younger and younger kids are).[2] In this time, LAD/Ss have been exposed to about two million tokens of child directed speech. So a better empirical estimate of the convergence criteria would be not “the limit” but after exposure to about two million tokens. Of course, this is just a rough estimate (and it doesn’t take into account ambient speech). But for purposes of our question we are ready to idealize away from this bound and are happy assume that the limit is where we will assess criteria. We again return to this idealization at the end when we discuss additional questions that relate to our primary one. And we recognize that mathematicians won't be much interested in proofs that converge on a result as x → 2,000,000; that's just not a general mathematical condition (whereas → 0 or → ∞ are).[3]

Third, there is the question of how the learning is organized. One big cut here is between supervised and non-supervised. By ‘supervised’ what is meant, more or less, is that the LAD/S has access to both positive and negative data. For our purposes, we want to restrict investigation to the non-supervised case. We have very strong reason to think that LAD/Ss have no direct access to negative data (indirect negative data is another thing, and semi-supervised learning is also a thing). This is a standard assumption within GG and for our problem to be of interest to GGers (which is the point of the Hilbert Problem enterprise that Hubert and Henk initiated, thx again) we restrict our attention to the unsupervised case. But we won't find much fault with someone who discovers how much a system can learn from degree 0+ data in a supervised or semi-supervised setting (though we will be pretty unsure how relevant such proofs are for the case of human language acquisition as there is pretty good reason to think that there is no negative evidence directly available to the LAD/S).

Fourth, and here the rubber really hits the road, comes the data presentation. In the formal literature, the data is sometimes called the “text” (sounds very POMO doesn’t it?). At any rate, we need to characterize the text. We have already said that it contains only positive data. We now want to add a big restriction: the positive data are all of degree 0+. What’s this mean?

The term is David Lightfoot’s (see here). He argued that to understand how languages change (in particular how English moved from OV to VO) it is necessary to assume that LAD/Ss do not have reliable access to embedded clause information.[4] If the matrix is degree 0 then embedded clauses are degree 1, 2…N depending on how deeply they are embedded. The ‘+’ in degree 0+ come from allowing a bit more than just main clause data in. It also allows information about complementizers and non-finite subjects.[5] So the only linguistic items in the text presented to the learner are of degree 0+; no looking at embedded clauses in toto.

This definition can be made more or less stringent: so for example, we can also insist that there be no complex DPs (e.g. ‘Man from Texas’) in the text (or, more accurately, that the LAD/Ss intake ignores the embedded portion of complex DPs). But we won’t do this here. Our proposed degree 0+ cut is made at clauses (so, ‘man from texas’ is in but ‘man who is from Texas’ is not). For the nonce, let’s assume that the text contains linguistic objects with no more complexity than main clauses and the little extra ‘+’ noted above.

Why is this a useful assumption? Here are a few reasons: (i) There is some interesting data that supports this, like Lightfoot’s on historical change. We recommend taking a look, for failure to veridically transmit G from one generation to the next is particularly instructive about the LAD/S. (ii) There are theory internal generalizations in line with this, e.g. Ross’s Penthouse Principle, which notes that embedded clauses are far more “regular” than matrix ones (things can occur in the latter that are unattested in the former; the rules are different if you live in the penthouse), or Emonds's class of root transformations. (iii) we both happen to think that this is a pretty good first pass description of the kind of data that is robustly available to the LAD/S (or taken in by the LAD/S). There are surprisingly few embedded clauses addressed to kids (some, but not tons). So it may be false that all learning is based on degree 0+ input, but it is probably not wildly false and, anyhow, we are happy enough to relax this assumption. In fact, if you don’t like degree 0+ and prefer degree 1 as a specification of the input, go for it! Or forget the + and go with pure degree 0. That would all be fine with us. But no more than 1 please (sorry, Ken and Peter). Why? Because so far as we know, there is precious little that happens in clauses of N>1 degrees of embedding that doesn’t happen at degree 1. In other words, though there is a matrix-embedded clause cut among grammatical phenomena, there is to our knowledge no degree N>1 vs degree 1 cut in the data (the observant will note that this is just a way of re-adverting to Ross and Emonds). For concreteness then, let’s stick with degree 0+.

Can we make degree 0+ precise enough to fit into a proof? Sure. How do we know? Because we can define it as follows: assume that texts are not strings but “trees” (say from your favorite tree bank or personally annotated) with marked out constituents. We can define 0+ as information from non-embedded clauses up to and including the first embedded clause marker (i.e. a C or embedded T). So in (1) this would be the highlighted parts:

(1) a. [[The man from Texas] [believes [that Mary ate a bagel]]]

b. [[The man from Texas] [past] [kiss a woman from Baltimore]]

c. [John [expects [Mary to leave the party early]]]

c. I love you (we leave the bracketing here as an exercise)

We are sure that one could find string proxies for this specification of the input to simplify matters if you have no access to a syntactic arboretum, but we are happy enough for the text to contain structured objects in addition to strings. We assume that this will make the learning problem easier (see Jim Morgan's book, From Simple Input to Complex Grammar, for example) and we are very nice people always wanting to help. And even with this information available, we believe the problem will be hard enough.

We could also assume that the above phrase markers come annotated with some thematic information. So let’s assume this. We know that ‘the man from Texas’ is the external argument of ‘believes’ in (1), for example. Note that these assumptions basically follow the suggestions made in Wexler and Culicover concerning the PLD. They assume something like DS/SS pairs were available to the learner. We believe in a modern rendition of this idea too. The main difference between our question and their investigations regards the cutoff point. They dug three clauses deep, we go one and a bit.

So that’s the text. We want to know how much of what we think of as, say, G_English can be projected from this input. And (once again with a hat tip to Wexler and Culicover) what could be added to the learner to bridge whatever gaps (and we believe there will be many) appear. We suspect that many of the UG principles that have been proposed over the years will be very fruitful in getting from degree 0+ texts of a G_L to the G_L that generates the degree 0+ text (along with the rest of L, of course). But we don’t care if we are wrong. We would like to know how far one can go with just this kind of info and the learning problem set up in this way.

There are many projects that follow from this one. For example, another formal project would be to see how extending texts to degree 1 would change what we could learn. What are the formal limits of the learnable given degree 1 rather than degree 0+ data? Another formal project would be to compare various learners with various “innate” additions to see how much they could learn with degree 0+/1 data. For example, does packing PISH, the ECP or Subjacency into the learner facilitate learning whatever G you are interested in?[6] Another question: What kind of degree 0+/1 data is critical for learning what features of the G of interest? Note that we can begin to restrict degree 0+/1 data as well and see what this does. After all, it need not be the case that every kind of degree 0+/1 data is actually available in the PLD (Jeff’s data in (2) here might be of degree 0+ but he has excellent evidence that it is unavailable in the PLD and so should not be available in our learning texts if we want to formally model his problem). Another project would be to see if allowing convergence in the limit rather than after some finite amount of data made any difference (it will certainly help with the proofs). Maybe most learners that got to the right G did so after, say, 1 million lines of “representative” text and having more doesn’t change matters much. That would be interesting to know.[7]

There are also a variety of empirical studies that would usefully follow from this question: for example, how varied and robust is the degree 0+/1 data (how many different kinds of forms are present, how often?)? How good an idealization is degree 0+/1? Is there typological variation only attested in Degree 1/>1 contexts? A catalogue of these would be very useful. Can we design and run Artificial Grammar Learning experiments using degree 0+ data? (Maybe all we gotta do is some Hugh Laurie style mumbling https://www.youtube.com/watch?v=Q8chs2ncYIw to obscure the embedded content beyond degree 0+ – the non-underlined stuff in (1).)

Last point and we wrap up: we have illustrated the logic here using syntactic examples (that’s where degree 0+/1 assumptions are likely to have real bite, we believe), but the line of attack we outlined can be used for exploring learning in any domain. Of course many already know this, but we want to make this clear for an audience that may not be familiar with this sort of work. Change the “text” and the same approach can be applied to study phonological and morphological learning (see Heinz and collaborators again) or even DNA sequence learning in bioinformatics (where we expect however that the appropriate priors are somewhat different from those for human language). For phonology, Hammond (1991, Parameters of metrical theory and learnability) has suggested that having data on words up to seven syllables long suffices to learn any stress system (at least given extant theories of metrical stress). But not all languages are that profligate with long words. What can we learn if we only hear (or attend to) relatively short words? How much of English phonotactics can we learn if we only have words of 1 or 2 syllables? (And 1 and 2 syllable words comprise 95% of the Switchboard corpus tokens; we wouldn't be surprised if around 5% of natural speech data are simply speech errors and disfluencies, so learners might be hard-pressed to sort the trisyllabic wheat from the extra-grammatical chaff.) Put another way, if pyrrhic sequences were absent in the PLD could we predict the syllable contact phenomena within them anyway?

Everyone now seems on board with the idea that all learners and learning systems must extrapolate beyond the data (see the comments from Chris Dyer and Alex Clark on Jeff's PISH post), so we're proposing exploring POS models constructively, to learn with one hand tied behind our back, to explore the limits of extrapolation.

So that’s our question. It’s really a plea to return to an old method of investigation using more realistic empirical (data) assumptions. An answer will help sharpen our POS arguments (as that’s really all that this is, the abstract specifications for a POS using computational learning vocab) and allow us to compare different suggestions concerning the structure of FL/UG across different theories (i.e. it also provides a platform that very different conceptions of UG can use to compare proposals). Our formal friends never tire of telling us how much they value the capacity to compare approaches across different conceptions of what a G/UG is, so this should appeal to them. Both of us would love to know the answer, so when you figure it out let us know. We are confident that Norbert will even dedicate some posts to it.

[1] Wexler and Culicover showed what was required of a learner to learn from degree 2 data. An interesting question is why this sort of work fell out of fashion? Here’s a suspicion: it’s of little interest to the CS community for the devices investigated will have little technological value. They used computational methods to study a problem of specific linguistic interest. The standard theorems don’t worry about the specific boundary conditions of the linguistic problem, so such problems have too many specificities to be of general interest. Thus, answering our question may not get you CS kudos, though linguists like us will be forever grateful.

[2] Jeff’s estimate is that modulo “pragmatics” LADs identify their Gs at roughly 4. This is somewhat far from “the limit,” which is more akin to Keynes’ “long run” (the place, to refresh your memories, where we are all dead!).

[3] We should note for the record, however, that being mathematically interesting is not the same as being linguistically interesting (and as linguists we care far more about the second than the first). Nor does it mean that proofs that show what happens as data goes to infinity cannot shed light on what happens considerably before this limit is reached. After all functions can converge at different rates and one that converges very quickly toward asymptote would tell us quite a lot about how much data would be required to be convergish.

[4] Lisa Pearl further developed this work (here and here).

[5] These more or less coincide with binding domains (and control of non-finite subjects) in GB and Lightfoot described degree 0+ as being unembedded binding domains.

[6] This is what Wexler and Culicover did so well.

[7] We could also study the creolization from pidgins using this. This would be particularly interesting as we suspect that the degree 0+ assumption aligns well with pidgin input. And there is the added complication that the relevant texts will be generated from two different Gs. Interestingly, however, it seems that this results in a usable induced G nonetheless, i.e. allows for the emergence of a Creole (i.e. just another G).

21 comments:

Olaf K.December 9, 2014 at 3:41 AM
I was wondering how much of the degree 0+ thinking relies on Lightfoot's analysis of the change of English word order, because I have never been convinced by that analysis. If English changed from OV to VO because children have no access to embedded clauses, then (i) why did the change not occur earlier (ii) why did Dutch not go through the same change and (iii) why are Dutch children geniuses in figuring out the OV base order, which is in fact the order they start out with? As far as I can tell, Old English was similar to Modern (or Middle) Dutch in all relevant aspects.
ReplyDelete
Replies
Olaf K.December 10, 2014 at 5:42 AM
What he argues is that the OV signposts are ambiguous in Old English, as this language allowed direct objects and particles from particle verbs to appear after the verb in its base position, in contrast to Dutch. But the point is that these signposts were just as ambiguous in Middle Dutch, facts that, as far as I recall, Lightfoot does not discuss.

I also find Degree 0+ conceptually implausible. It's like saying that the complementizer is a signal for children to close their ears.
ReplyDelete
Replies
Alex ClarkDecember 10, 2014 at 5:50 AM
"supervision" is a slightly tricky word, as it can mean either the existence of positive and negative data, or it can mean that the input consists of labelled trees.
So inducing a grammar from a treebank would count as "supervised learning" in the second sense, and unsupervised in the first, But most people call it supervised.

There is no easy answer to the question as it depends on the algorithm: some algorithms do indeed only need degree-0 data and some need degree-1.
So to take an easy example: how does the child learn that you can say
"that is a very^* good question" (where the * means zero or more repetitions.
Some learning algorithms will generalise quite readily from seeing
"that is a good question" and "that is a very good question."
Other algorithms don't.
So you can't say "what can we induce from this data?", you can only say "what does this learning algorithm learn from this data?".
ReplyDelete
Replies
AveryAndrewsDecember 10, 2014 at 1:51 PM
It's maybe worth pointing out that overt forms with at least some semantic annotation is weaker than labelled trees, because the labels give away the syntactic category information, whose allegedly arbitrary nature is an issue for people like Bill Croft and Martin Haspelmath. So a learner that can extract the usual N, V, Det etc. from data without these labels (but perhaps semantic ones such as OBJECT, ACTION, STATE, ...) would constitute a sensible and nondogmatic answer to Bill and Martin's complaint (and I don't that the current state of syntactic theory actually provides one).
ReplyDelete
Replies
Jeff HeinzDecember 20, 2014 at 5:51 PM
I agree with Colin that work in formal language theory was not particularly influential in linguistic theory; it was a convenient result supporting an existing mindset. It's too bad because it is an example of research in computational learning theory that directly relates to important questions in theoretical linguistics. After discussing Gold's result, I'll try to give another more recent example of research in computational theory here that I think is relevant to this particular Hilbert question, which theoretical linguists ought to be aware of. The work there doesn't answer this question yet, but it is a step in the right direction.
ReplyDelete
Replies

Add comment

Faculty of Language

Comments

Sunday, December 7, 2014

The 3rd Hilbert Question: What can you learn from degree 0-ish data

21 comments:

Contributors