Sunday, December 7, 2014

The 3rd Hilbert Question: What can you learn from degree 0-ish data

This Hilbert Question is the joint effort of Bill Idsardi and me (Norbert).  It arose from several lunchtime conversations, a delicious source of Hilbert Problems I believe. Nothing suggests juicy digestible questions like conversation over edibles soon to be digested. Maybe it’s because cognitive science marches on its stomach or maybe because there really is a close connection between our frontal lobes and the gut brain.  So that’s its source. Here’s the content.


There is quite a large literature on formal learning theory that is beyond the scope of many linguists (note, we said ‘many’ not ‘all’ or even ‘most’). Norbert counts himself among these, though Bill, is very definitely not a fellow traveller. However, even a monkey can write a sonnet given enough typewriters, paper and patience and in discussion Bill and Norbert have come up with the following kind of problem that can be approached both formally and empirically. Moreover, answers to this question would be of real interest to linguists like us (and we think there could well be a multiplicity of answers). We like to think of it as a formal version of the POS argument, a Gedankenexperiment exploration of POS space. It is a version of the problem that Wexler and Culicover investigated (here) and that seems to have since fallen out of fashion.[1] We would like to revive it, with a twist to make it more empirically relevant. Our question is what can be learned from degree 0+ data? That’s the question. Let’s unpack it.

To turn this question into a tractable one requires explaining what the parts mean; in particular: What does ‘degree 0+’ mean? What does ‘learn’ mean? What does ‘data’ mean? In other words, to make the question approachable we need to give serviceable “definitions” of these terms. Fortunately, the formal learning literature provides ways (yes, plural) for doing so. (Oh, btw, there is now a great paper by Jeff Heinz called “Computational theories of learning and developmental psycholinguistics” (here), which should really be called Computational Learning Theory for Dummies that Norbert recommends most highly especially to fellow members of the computationally challenged community.) 

There are various assumptions (or rules of the game) that go into proofs in the formal learning world. The first is setting a criterion for what it is to have learned something. The something we care about (and the computationalists Heinz discusses care about) is learning a grammar G that generates the target language (we return to what we mean by ‘language’). There are many different criteria (as Heinz nicely reviews) concerning when a G is considered learned, but a big cut is between “identification” which is perfect (we get the relevant G with probability 1) and “identification” which is less than perfect (we get the G with some probability less than 1). There are various versions of this, but for our problem we don’t care which one you choose, so choose away (and let a hundred flowers bloom).

The second thing that needs doing is to specify a time/data limit within which we evaluate the criterion. So, how much time/data can pass until identification in some form is required for learning to have taken place? Most of the work Heinz reviews (which is most of the important work; again great paper!!) assumes that the criteria should be identification in the limit. That means that the more and more data we get the closer and closer we get to the identification criterion (i.e. as time/data ). Now, as any parent will tell you (and most linguists assume) our cute little LADs and LASs (language acquisition devices/systems) attain competence not in the limit but by about 4 or 5 (and for many many features of competence much much earlier (this is what Jeff and Colin and their friends do; show how linguistically able younger and younger kids are).[2] In this time, LAD/Ss have been exposed to about two million tokens of child directed speech. So a better empirical estimate of the convergence criteria would be not “the limit” but after exposure to about two million tokens.  Of course, this is just a rough estimate (and it doesn’t take into account ambient speech). But for purposes of our question we are ready to idealize away from this bound and are happy assume that the limit is where we will assess criteria. We again return to this idealization at the end when we discuss additional questions that relate to our primary one. And we recognize that mathematicians won't be much interested in proofs that converge on a result as x 2,000,000; that's just not a general mathematical condition (whereas → 0 or ∞ are).[3]

Third, there is the question of how the learning is organized. One big cut here is between supervised and non-supervised.  By ‘supervised’ what is meant, more or less, is that the LAD/S has access to both positive and negative data. For our purposes, we want to restrict investigation to the non-supervised case. We have very strong reason to think that LAD/Ss have no direct access to negative data (indirect negative data is another thing, and semi-supervised learning is also a thing). This is a standard assumption within GG and for our problem to be of interest to GGers (which is the point of the Hilbert Problem enterprise that Hubert and Henk initiated, thx again) we restrict our attention to the unsupervised case. But we won't find much fault with someone who discovers how much a system can learn from degree 0+ data in a supervised or semi-supervised setting (though we will be pretty unsure how relevant such proofs are for the case of human language acquisition as there is pretty good reason to think that there is no negative evidence directly available to the LAD/S).

Fourth, and here the rubber really hits the road, comes the data presentation. In the formal literature, the data is sometimes called the “text” (sounds very POMO doesn’t it?). At any rate, we need to characterize the text. We have already said that it contains only positive data. We now want to add a big restriction: the positive data are all of degree 0+.  What’s this mean?

The term is David Lightfoot’s (see here). He argued that to understand how languages change (in particular how English moved from OV to VO) it is necessary to assume that LAD/Ss do not have reliable access to embedded clause information.[4] If the matrix is degree 0 then embedded clauses are degree 1, 2…N depending on how deeply they are embedded.  The ‘+’ in degree 0+ come from allowing a bit more than just main clause data in. It also allows information about complementizers and non-finite subjects.[5]  So the only linguistic items in the text presented to the learner are of degree 0+; no looking at embedded clauses in toto.

This definition can be made more or less stringent: so for example, we can also insist that there be no complex DPs (e.g. ‘Man from Texas’) in the text (or, more accurately, that the LAD/Ss intake ignores the embedded portion of complex DPs). But we won’t do this here. Our proposed degree 0+ cut is made at clauses (so, ‘man from texas’ is in but ‘man who is from Texas’ is not). For the nonce, let’s assume that the text contains linguistic objects with no more complexity than main clauses and the little extra ‘+’ noted above. 

Why is this a useful assumption?  Here are a few reasons: (i) There is some interesting data that supports this, like Lightfoot’s on historical change. We recommend taking a look, for failure to veridically transmit G from one generation to the next is particularly instructive about the LAD/S. (ii) There are theory internal generalizations in line with this, e.g. Ross’s Penthouse Principle, which notes that embedded clauses are far more “regular” than matrix ones (things can occur in the latter that are unattested in the former; the rules are different if you live in the penthouse), or Emonds's class of root transformations. (iii) we both happen to think that this is a pretty good first pass description of the kind of data that is robustly available to the LAD/S (or taken in by the LAD/S). There are surprisingly few embedded clauses addressed to kids (some, but not tons). So it may be false that all learning is based on degree 0+ input, but it is probably not wildly false and, anyhow, we are happy enough to relax this assumption. In fact, if you don’t like degree 0+ and prefer degree 1 as a specification of the input, go for it! Or forget the + and go with pure degree 0. That would all be fine with us. But no more than 1 please (sorry, Ken and Peter). Why? Because so far as we know, there is precious little that happens in clauses of N>1 degrees of embedding that doesn’t happen at degree 1. In other words, though there is a matrix-embedded clause cut among grammatical phenomena, there is to our knowledge no degree N>1 vs degree 1 cut in the data (the observant will note that this is just a way of re-adverting to Ross and Emonds).  For concreteness then, let’s stick with degree 0+. 

Can we make degree 0+ precise enough to fit into a proof? Sure. How do we know? Because we can define it as follows: assume that texts are not strings but “trees” (say from your favorite tree bank or personally annotated) with marked out constituents. We can define 0+ as information from non-embedded clauses up to and including the first embedded clause marker (i.e. a C or embedded T). So in (1) this would be the highlighted parts:

(1)  a. [[The man from Texas] [believes [that Mary ate a bagel]]]
b. [[The man from Texas] [past] [kiss a woman from Baltimore]]
c. [John [expects [Mary to leave the party early]]]
c. I love you (we leave the bracketing here as an exercise)

We are sure that one could find string proxies for this specification of the input to simplify matters if you have no access to a syntactic arboretum, but we are happy enough for the text to contain structured objects in addition to strings. We assume that this will make the learning problem easier (see Jim Morgan's book, From Simple Input to Complex Grammar, for example) and we are very nice people always wanting to help. And even with this information available, we believe the problem will be hard enough.

We could also assume that the above phrase markers come annotated with some thematic information. So let’s assume this. We know that ‘the man from Texas’ is the external argument of ‘believes’ in (1), for example. Note that these assumptions basically follow the suggestions made in Wexler and Culicover concerning the PLD. They assume something like DS/SS pairs were available to the learner. We believe in a modern rendition of this idea too. The main difference between our question and their investigations regards the cutoff point. They dug three clauses deep, we go one and a bit.

So that’s the text. We want to know how much of what we think of as, say, GEnglish can be projected from this input.  And (once again with a hat tip to Wexler and Culicover) what could be added to the learner to bridge whatever gaps (and we believe there will be many) appear.  We suspect that many of the UG principles that have been proposed over the years will be very fruitful in getting from degree 0+ texts of a GL to the GL that generates the degree 0+ text (along with the rest of L, of course). But we don’t care if we are wrong. We would like to know how far one can go with just this kind of info and the learning problem set up in this way.

There are many projects that follow from this one. For example, another formal project would be to see how extending texts to degree 1 would change what we could learn. What are the formal limits of the learnable given degree 1 rather than degree 0+ data? Another formal project would be to compare various learners with various “innate” additions to see how much they could learn with degree 0+/1 data.  For example, does packing PISH, the ECP or Subjacency into the learner facilitate learning whatever G you are interested in?[6] Another question: What kind of degree 0+/1 data is critical for learning what features of the G of interest? Note that we can begin to restrict degree 0+/1 data as well and see what this does.  After all, it need not be the case that every kind of degree 0+/1 data is actually available in the PLD (Jeff’s data in (2) here might be of degree 0+ but he has excellent evidence that it is unavailable in the PLD and so should not be available in our learning texts if we want to formally model his problem). Another project would be to see if allowing convergence in the limit rather than after some finite amount of data made any difference (it will certainly help with the proofs). Maybe most learners that got to the right G did so after, say, 1 million lines of “representative” text and having more doesn’t change matters much. That would be interesting to know.[7]

There are also a variety of empirical studies that would usefully follow from this question: for example, how varied and robust is the degree 0+/1 data (how many different kinds of forms are present, how often?)? How good an idealization is degree 0+/1? Is there typological variation only attested in Degree 1/>1 contexts?  A catalogue of these would be very useful. Can we design and run Artificial Grammar Learning experiments using degree 0+ data? (Maybe all we gotta do is some Hugh Laurie style mumbling to obscure the embedded content beyond degree 0+ – the non-underlined stuff in (1).)

Last point and we wrap up: we have illustrated the logic here using syntactic examples (that’s where degree 0+/1 assumptions are likely to have real bite, we believe), but the line of attack we outlined can be used for exploring learning in any domain. Of course many already know this, but we want to make this clear for an audience that may not be familiar with this sort of work. Change the “text” and the same approach can be applied to study phonological and morphological learning (see Heinz and collaborators again) or even DNA sequence learning in bioinformatics (where we expect however that the appropriate priors are somewhat different from those for human language). For phonology, Hammond (1991, Parameters of metrical theory and learnability) has suggested that having data on words up to seven syllables long suffices to learn any stress system (at least given extant theories of metrical stress).  But not all languages are that profligate with long words. What can we learn if we only hear (or attend to) relatively short words? How much of English phonotactics can we learn if we only have words of 1 or 2 syllables? (And 1 and 2 syllable words comprise 95% of the Switchboard corpus tokens; we wouldn't be surprised if around 5% of natural speech data are simply speech errors and disfluencies, so learners might be hard-pressed to sort the trisyllabic wheat from the extra-grammatical chaff.) Put another way, if pyrrhic sequences were absent in the PLD could we predict the syllable contact phenomena within them anyway?
Everyone now seems on board with the idea that all learners and learning systems must extrapolate beyond the data (see the comments from Chris Dyer and Alex Clark on Jeff's PISH post), so we're proposing exploring POS models constructively, to learn with one hand tied behind our back, to explore the limits of extrapolation.

So that’s our question. It’s really a plea to return to an old method of investigation using more realistic empirical (data) assumptions. An answer will help sharpen our POS arguments (as that’s really all that this is, the abstract specifications for a POS using computational learning vocab) and allow us to compare different suggestions concerning the structure of FL/UG across different theories (i.e. it also provides a platform that very different conceptions of UG can use to compare proposals). Our formal friends never tire of telling us how much they value the capacity to compare approaches across different conceptions of what a G/UG is, so this should appeal to them. Both of us would love to know the answer, so when you figure it out let us know. We are confident that Norbert will even dedicate some posts to it.

[1] Wexler and Culicover showed what was required of a learner to learn from degree 2 data. An interesting question is why this sort of work fell out of fashion? Here’s a suspicion: it’s of little interest to the CS community for the devices investigated will have little technological value. They used computational methods to study a problem of specific linguistic interest. The standard theorems don’t worry about the specific boundary conditions of the linguistic problem, so such problems have too many specificities to be of general interest. Thus, answering our question may not get you CS kudos, though linguists like us will be forever grateful.
[2] Jeff’s estimate is that modulo “pragmatics” LADs identify their Gs at roughly 4. This is somewhat far from “the limit,” which is more akin to Keynes’ “long run” (the place, to refresh your memories, where we are all dead!).
[3] We should note for the record, however, that being mathematically interesting is not the same as being linguistically interesting (and as linguists we care far more about the second than the first). Nor does it mean that proofs that show what happens as data goes to infinity cannot shed light on what happens considerably before this limit is reached. After all functions can converge at different rates and one that converges very quickly toward asymptote would tell us quite a lot about how much data would be required to be convergish.
[4]  Lisa Pearl further developed this work (here and here).
[5] These more or less coincide with binding domains (and control of non-finite subjects) in GB and Lightfoot described degree 0+ as being unembedded binding domains.
[6]  This is what Wexler and Culicover did so well.
[7]  We could also study the creolization from pidgins using this. This would be particularly interesting as we suspect that the degree 0+ assumption aligns well with pidgin input. And there is the added complication that the relevant texts will be generated from two different Gs. Interestingly, however, it seems that this results in a usable induced G nonetheless, i.e. allows for the emergence of a Creole (i.e. just another G).


  1. I was wondering how much of the degree 0+ thinking relies on Lightfoot's analysis of the change of English word order, because I have never been convinced by that analysis. If English changed from OV to VO because children have no access to embedded clauses, then (i) why did the change not occur earlier (ii) why did Dutch not go through the same change and (iii) why are Dutch children geniuses in figuring out the OV base order, which is in fact the order they start out with? As far as I can tell, Old English was similar to Modern (or Middle) Dutch in all relevant aspects.

    1. It's one of the reasons I find it plausible. Other reasons include the relative scarcity of degree 1 data in child directed speech (Lisa Pearl goes over this in her papers). We list some of the other reasons in the post, some theory internal.

      That said, I do find the Lightfoot argument compelling for two reasons. First he does worry about the Dutch case and notes that the change in English stems from two facts: first the landmarks for verb movement in main clauses get erased in the English case but not the Dutch one. One example is the capacity to strand parts of the verb under movement in the latter but not the former (e.g. separable particles). OE was like Dutch in this but the surface evidence for this begins to fade for various reasons that Lightfoot discusses in chapter 3 of "How to set Parameters." So, for various reasons, he argues, the degree 0 data pointing to verb movement begin to disappear (he has other diagnostics as well). However, the second important assumption is that one has relatively little access to degree 1 data. Why? Because here the evidence fort OV order is robust and unambiguous. So, were the LAD to have access to this data it would "know" that the underlying structure is OV even were the degree 0 data weak. That's the argument. I am no expert in the empirical facts, but the logic is sound. Importantly, the story involves 2 parts, only one of which involves degree 0 assumptions.

      Second, say that the degree 0+ assumption is wrong. Well it would be still interesting to know what going from degree 0 to degree 0+ to degree 1 brings to the learnability table. In other words, the problem is analytically interesting. I am pretty confident that even the degree 1 case will prove challenging.

  2. What he argues is that the OV signposts are ambiguous in Old English, as this language allowed direct objects and particles from particle verbs to appear after the verb in its base position, in contrast to Dutch. But the point is that these signposts were just as ambiguous in Middle Dutch, facts that, as far as I recall, Lightfoot does not discuss.

    I also find Degree 0+ conceptually implausible. It's like saying that the complementizer is a signal for children to close their ears.

    1. I don't share your judgment about conceptual implausibility, but it is of little moment, IMO. I personally would find it interesting to ask what one could acquire using Degree0, 0+, 1. These would be interesting. I am pretty sure that degree 0 is too little to do anything approximating what we do. 0+ strikes ME (but this is a judgment) as on the right track. I think that Degree 1 is possible, but the kind of facts that Lightfoot points to would then be a problem unless Degree 1 data was not that robust so though available in principle, there wasn't enough of it. But Lightfoot's is a real problem. Here's why:

      Say that ME were OV. We would argue that this is not surprising given that OE was OV and so kids just acquired their parental Gs and preserved OV ness. But it's not OV it's VO. Why? Well, maybe because main clauses looked VO despite being OV. Here the issue might well be one of quantity: how VO did it look? Did Old Dutch look as VO as OE did? What's the obscurity metric? There is room for empirical wiggle here. Note, I am not saying you are wrong, I am noting that there is room to move. WHere there is no room to wiggle is IF one has good access to embedded clause data. Why? Well then we have strong unambiguous data that the underlying structure is OV. After all the order is UNIFORMLY OV in embedded classes. So were we to have unfettered access to this the change would be mysterious. This is why Lightfoot's points are well taken. The fact that change occurred is more problematic than that it did not. That's my view. But, as I said, this is just one of the reasons I find 0+ reasonable, though I am not wedded to this.

    2. I agree, Norbert, there is room to wiggle. It could be that numerically sentence types were distributed differently in OE and Middle Dutch.

      >>"WHere there is no room to wiggle is IF one has good access to embedded clause data. Why? Well then we have strong unambiguous data that the underlying structure is OV."

      I wonder if this is true. If we want to account for the fact that the Dutch degree 0+ learner gets to the OV order, Lightfoot must allow for SOME embedded clauses to be accessible to such a learner, e.g. non-finite constructions of the type "John must the radio up-turn". If not, OV is out of reach and the child would not be able to acquire the fact that the particle is a signpost to begin with. If non-finite clauses are Degree 0+-accessible, they must have been in OE. The problem there is that objects and particles could follow the non-finite verb. But then the fact that FINITE embedded clauses are inaccessible for the Degree 0+ learner is simply irrelevant for the analysis of the change to VO. It plays no role. Alternatively, we say that non-finite clauses are also inaccessible, but then a problem arises for Dutch.

    3. So here is the problem: If kids have robust access to degree 1 data there is no explanation for how English moved from OV to VO. Why? Because embedded clauses are UNIFORMLY OV and so the kid should have known that the language is OV underlyingly. Let me repeat, the evidence is unambiguous in embedded clauses so there had better not be too much of it or the kid will know that the underlying grammar is OV. So, can't have robust degree 1 data.

      What then about Dutch? Why did it NOT change? Well it cannot be that it had robust degree 1 data for this is inconsistent with the English change. However, we agree that there is one place where there was a difference: the degree of obscurity regarding OV landmarks in matrix (Degree 0) clauses is greater in English than in Dutch. WERE this true then the difference between the two languages could be explained without recourse to degree 1 data. And as recourse to degree 1 data is a non-starter, then something like this MUST be true. Sadly, it looks like it is hard to pin down how English was more obscurely OV than Dutch was as it seems that many of the same things Lightfoot points to in English occurred in Dutch. I agree that is a problem; but it is an empirical one, not a conceptual one. The conceptual point Lightfoot made stands: if there is too much reliable degree 1 data the child would never have changed. As noted, Lisa Pearl even had a number she was able to put on how much is too much. So, I think we should hole onto this fact and look at the quality and nature of degree 0 data in the two languages. Were I a betting man, this is where I would put my money. But as this is not may area, it would be a bet that I would not work to win.

    4. >>However, we agree that there is one place where there was a difference: the degree of obscurity regarding OV landmarks in matrix (Degree 0) clauses is greater in English than in Dutch.

      I am not sure because that depends on what you count as a matrix clause. If a matrix clause is a clause with only one finite verb (i.e. any non-finite verb sits in an embedded clause), then neither Old English, nor Middle Dutch , nor Modern Dutch matrix clauses provide any evidence for a base OV order to the Degree 0+ learner. The fact that particles can appear after the object ("John turns the radio up") does nothing to reveal the base position of the verb if you have not first acquired that this particle is part of the verb to begin with. This evidence for this comes from embedded clauses, which are unavailable. So if the OV order is to be learnable at all, something has to give in Lightfoot's account.

      The most modest adjustment would be to say that non-finite clauses (e.g. John has the radio up-turned) don't count as embedded clauses whereas finite embedded clauses ("I think that John the radio up-turns") do. But then Lightfoot's proposal empirically requires that particles and objects in OE were able to follow verbs in non-finite clauses ("John has the radio turned up"/"John has up-turned the radio"), but not in embedded finite clauses (i.e. embedded finite clauses are strictly verb final). That is not how I remember the facts, but I happily stand corrected. I have to reread that chapter, Norbert, which I will do. If the data are not like this, then Lightfoot has a conceptual problem as well.

    5. If you are correct (which I have no doubt you are) then Lightfoot's argument needs supplementing. I agree wholeheartedly. What I don't think we should do, however, is supplement this by allowing wide ranging access to embedded clause information for the reasons Lightfoot reviewed. Robust availability of Degree 1 data will make the change impossible. So, don't allow it.

      One more point: it is a necessary condition for the change to occur that degree 1 data not be for the taking. It is NOT sufficient. This leaves room to play ion the degree 0 domain. Again, this is where I suggest we look to address your very reasonable points. The only place I would disagree with what you say above is that Lightfoot has a "conceptual" problem. I would say that he has an "empirical" one, not quite the same thing. That does not make the problem less interesting, but it makes it different. So to repeat: IMO, Lightfoot's case provides very nice evidence that one cannot have access to (much) Degree 1 data. Lisa Pearl's work substantiates this. So this supports something like a Degree 0+ hypothesis. This does not answer every interesting question, however. Many problems and puzzles remain. Nor does it imply that Lightfoot's conjecture is correct, only that there is interesting evidence in its favor. For my money, I would be happy to frame the question in terms of degree 1 data as well. My favorite scenario would be to investigate Degree 0, 0+ and 1 and compare them. What kinds of theories are able to acquire the kinds of Gs we find based on the three different kinds of "texts"?

    6. Ah yes I agree, then the problem would be empirical: he would have no story for the OV -> VO change but the Degree 0+ hypothesis might still be correct..

  3. "supervision" is a slightly tricky word, as it can mean either the existence of positive and negative data, or it can mean that the input consists of labelled trees.
    So inducing a grammar from a treebank would count as "supervised learning" in the second sense, and unsupervised in the first, But most people call it supervised.

    There is no easy answer to the question as it depends on the algorithm: some algorithms do indeed only need degree-0 data and some need degree-1.
    So to take an easy example: how does the child learn that you can say
    "that is a very^* good question" (where the * means zero or more repetitions.
    Some learning algorithms will generalise quite readily from seeing
    "that is a good question" and "that is a very good question."
    Other algorithms don't.
    So you can't say "what can we induce from this data?", you can only say "what does this learning algorithm learn from this data?".

    1. First point: fine.We meant no negative data by unsupervised. We are happy to start with just sound meaning pairs (i.e. acoustic/theta role pairs) but as I am interested mainly in syntactic learning, I am happy to assume that the relevant input is more "supervised" in the standard sense so long as there is no DIRECT negative data.

      Of course it depends on the algorithm!! That's the point of the question. Your job, Mr Phelps, should you decide to play, is to answer the question by proposing algorithms that will answer it. Those are the one's linguists should care about. Algorithms that address a different "text" presentation may be of little interest to the linguist. So many learnability proofs in the CS literature are, IMO, of dubious interest even though they are called learning proofs. So, yes: find the algorithms that meet this condition for they are the ones that will be of interest. Others MIGHT be, but that takes lots of argument. These will be, because they frame the problem in the right way. Remember the urns Alex. Same point, different version.

    2. Norbert, most of the negative results in the learning literature are based on text presentations; is it your view that these are irrelevant to linguistic theorizing?

    3. Funny you should ask: they are in fact based on different kinds of text presentations (i.e. as you know better than I do, they are not all of a piece). So in fact, (again as you know better than I do) there are a series of "learning" results all based on what definition of learning is adopted AND what kinds of texts are being considered. Now, IMO, most of the results are not particularly significant for linguistics for the learning problem being studied (i.e. the texts being considered) bear little verisimilitude (even under idealized conditions, which I am fine with) to the acquisition problem that occurs in the Language case. And as proofs are interesting to the degree that their premises model what we are interested in, and as the standard assumptions bear no obvious relation to the acquisition situation as I understand it, I cannot say that the proofs are all that enlightening. What makes them somewhat interesting is that the techniques developed for these uninteresting cases might be adopted in proving more interesting conclusions.

      One point: the assumption that does have some purchase is the positive data only dictum. However, we have a pretty good idea what the right texts are (again at least in idealized form) and we should consider just those cases if we are interested in the language case. Oddly, Wexler and Culicover already did something like this, but their work seems to have dropped off the radar. Even their assumptions are not to my liking, but they are framed in the right way. This is more than I can say for the standard proofs. Tant pis.

    4. My question was really about negative results (like Gold's theorems) which many researchers in the generative paradigm have appealed to to justify for example, the Subset Principle or a finite space of possible grammars.

    5. I am not sure how large a role these results have actually played, beyond some simple motivation by analogy. But really, this is beyond my pay grade. Whatever, the results have been, their influence, rightly, has been pretty slight given that the problem they portray is actually very dependent on the nature of the text assumed and these texts have been pretty unlike what I think occurs in the actual case. My recollection is that Gold, for example, proved that some languages were in fact learnable (those generated by primitive recursive functions; his third proof?) while others were not (generated by arbitrary functions). What I want to know is what is learnable from texts generated from Degree 0+ functions of the kind I mentioned. I know why THESE would be interesting results. I am far less sure why the former are.

    6. Interesting; regardless of the primary point of whether these results should be important, I definitely ended up with a different interpretation based on my reading on the secondary question of whether they actually were influential. I always thought that one of the motivations for the P&P model was a learnability argument derived from these theorems. And the Subset Principle? That seems to have been quite influential, and as far as I can tell was directly motivated by an interpretation of a theorem by Angluin.

      I would be very interested to hear from any other readers of the blog about this.

    7. Isn't one of the reasons why Wexler & Culicover fell off the radar the apparent falsity of their Uniqueness Principle? I think I recall reading something to that effect somewhere. And couldn't this be at least somewhat remedied by the observation that if your grammar gets a bit more complicated by adding a constraint to eliminate a possible way of saying something, this can pay for itself by reducing the data score (MDL), or increasing the accuracy of predication of what happens next (Ramscar, Dye, and people like that). The idea reduction of the possible number of saying something is to 1, but it doesn't have to be attained, 3 is better than 240.

    8. @ Alex. I don't think that those proofs were actually particularly influential. I think that they fall in the category of "nice things to invoke when they seem to confirm what you want to say anyway". In that respect they are in good company. There are many findings in "nearby fields" that have attracted much interest when they seem to corroborate what people already believed in, and then were ignored when the findings turned out to be less obviously supportive. I wouldn't think that those proofs played any genuine role in the emergence of P&P etc.

  4. It's maybe worth pointing out that overt forms with at least some semantic annotation is weaker than labelled trees, because the labels give away the syntactic category information, whose allegedly arbitrary nature is an issue for people like Bill Croft and Martin Haspelmath. So a learner that can extract the usual N, V, Det etc. from data without these labels (but perhaps semantic ones such as OBJECT, ACTION, STATE, ...) would constitute a sensible and nondogmatic answer to Bill and Martin's complaint (and I don't that the current state of syntactic theory actually provides one).

  5. I agree with Colin that work in formal language theory was not particularly influential in linguistic theory; it was a convenient result supporting an existing mindset. It's too bad because it is an example of research in computational learning theory that directly relates to important questions in theoretical linguistics. After discussing Gold's result, I'll try to give another more recent example of research in computational theory here that I think is relevant to this particular Hilbert question, which theoretical linguists ought to be aware of. The work there doesn't answer this question yet, but it is a step in the right direction.

    1. Gold's discovery that no superfinite class of languages is identifiable in the limit from positive data tracks very closely with Lila Gleitman's (1992) statement that "The trouble is that an observer who notices everything can learn nothing for there is no end of categories known and constructible to describe a situation [emphasis in original]."

      I think humans can't learn every finite language. One consequence of this is that there are some finite languages, some finite sets of experience, from which humans automatically generalize to some kind of grammar whose extension (set of recognizable strings, tree structures, etc) is much bigger than the finite set of experience. This is relevant to the 3rd Hilbert question posed here which suggests that the finite kernel of data from which human children generalize from contains only "degree 0+" data. It also implies the converse that there are some very long strings which are not needed for learning. I remember Ed Keenan saying something like this: no natural language invokes new rules for only very long sentences. This is like learning from "degree 0+" idea, except the 'complexity' of a sentence is measured in terms of its length instead of depth of embedding.

      There is work ongoing in the grammatical inference community which speaks to this kind of question. In 1997, Colin de la Higuera published an influential paper which gets away from learning from text in Gold's sense and defines set-based learning from small (= reasonably-sized) samples. Informally, an algorithm A learns a class of languages C if for each language L in C there is a finite sample S (the 'characteristic' sample) belonging to L such that for any superset S' in L of S, A returns grammar G for L. Furthermore, de la Higuera imposed the normal time-complexity requirements on A (so A must return a grammar in time polynomial to the size of the input sample), but he also imposed a size complexity requirement on the characteristic sample. The size of S must be polynomial in the size of G (typically the size of S is measured as the sum of the length of its strings plus the cardinality of S). This is not exactly the same as "degree 0+" but it is the same kind of idea (pace Keenan) and one of the reasons I think linguists ought to pay attention to work in grammatical inference. Often, the algorithms posited in GI say exactly what kind of data (finite lg) is sufficient for learners to generalize in a certain way. That's very different from heuristic-based algorithms, which don't have clear statements in this regard.

      One issue with de la Higuera's paper is that it only deals with regular languages, for which there exist established canonical representations. In the case of nonregular languages, the above definition is not so useful because there are context-free grammars for finite languages which are exponentially smaller than the finite language they describe. That's a good thing and de la Higuera's definition gives the wrong result for cases like this. Ryo Yoshinaka and R\'emi Eyraud have some concrete ideas about how to address this, and their ideas are a step closer to measuring the complexity of a sentence in terms of the number of embeddings (because their metrics explictly refer to strings generated by the CFG production rules), but there is plenty of room here for future efforts that either build on or complement these approaches.