Sunday, January 20, 2013

A comment, not a post


0. This will have to do for now in lieu of several posts on the different varieties of computational complexity that have, like Gandalf at The Prancing Pony, been delayed.  I hope the inevitable brevity does not come across as ‘shrill,’ even though the delay is of a nature to vex any sane person –  Norbert will tell you the story if you ask.
1. Where I grew up, a language acquisition model had 4 parts: (i) an initial state; (ii) a target state(s); (iii) input data & (iv) an acquisition algorithm, mapping (i) to (ii), using (i) and (iii). I have to side with Stephen Crain & Norbert on this one: figuring out some of the constraints on (i), viz., Condition C, takes us part of the way, and in my book, a really long, long way, to an understanding of language acquisition. Even though we don’t understand much about (iv). Of course, one can decide to throw away all we’ve learned about (i) so far and choose another lifestyle.  But it would seem to me that the burden of proof lies with those who choose to walk this alternative route, since (e.g. Alex) notes that he has “no idea” where the constraints in (i) come from. 
2. As far as I can make out, much the same holds for any biological phenomenon we know anything about. By way of comparison, consider, eg, human bipedalism.  We know quite a bit about the initial state: H. sap. but not Pan Troglodytes, winds up walking on 2 legs, and we know a lot about the genomic differences. But not enough. No biologist would venture to say, even given all this knowledge about the differences in initial states, that they have a ‘complete acquisition algorithm’ that explains how human zygotes map to walking babies.   Prediction is very difficult, especially about the future, but it’s a fair guess that no such account is likely to be forthcoming any time soon. However, this brute fact doesn’t seem to stop the sequence machines running at the Broad Institute, nor the long string of Nobel prizes handed down from Stockholm, nor for that matter, the day-to-day research of any biologist I know.  They just keep plugging away, finding more and more about the initial state, and how that differs between us and other species. Nor does any biologist I know ever rely on, or even expect there to ever be, a ‘formal model’ for this mapping from (i) to (ii), in the sense intended by folks like Alex.  Now, all this parallels Condition C: a universal competence that H. sap. has, but Pan Trog. lacks.  Why throw away all this  hard-won knowledge?  Would such folks really say that that we must now abandon DNA and figure out some general mapping ensuring that people wind up walking but chimps don't?
3. I don’t know about you, but Alex could do me a big favor by carefully explaining exactly how all the neat results that he always points to, eg, Ryo Yokoshina’s work on the learnability of multiple-context free languages (MCFGs), (eg, Theor. Comp. Sci. 412:19, 1755-1852) actually work with natural language/grammar examples.  Because, to be honest, I for one simply don’t get it – at least when I read through these articles, the results always pull punches, eg, Y’s results are actually about “new subclasses of multiple context-free languages with variants of substitutability”, and says, “the obtained learnable classes are however not rich, as we have seen in Section 3 that several rather simple languages are not 2d-substitutable. pd-substitutability easily causes too much generalization from finite languages even when p = 2” (p. 1828, italics mine; sorry, no space to cover the terminology here). So are English, Warlpiri, etc. included here or no?  I simply can’t tell. And the MCFG-MG bridge seems equally shaky.  Take for example Alex’s own version of this, learning using the MCFG/substitution method as applied to (one version of) minimalist grammars. (I’m referring now from the slides and talk he gave on this at UQAM September 2011.) Here, at least for me, there’s one big catch (really several): on first blush, my intuition tells me that the MCFGs all blow up exponentially in size as compared to their strongly equivalent MG counterparts.  So does the learning.  This is a common succinctness phenomenon, as I’ll detail in my next blog.  In my old thesis (1982), chapter 4, pp. 384-390, there’s a demonstration that GPSG grammars wind up exponentially larger than (strongly) equivalent GB style grammars (sadly, this part didn’t make it into the published book, due to page limits).  As for the whole business of about learnability-by-substitution, I’ve yet to see any method that blunts the force of Chomsky’s thorough discussion (1955) of its pitfalls – take the discussion on homonyms for starters.  But I’m prepared to be enlightened.  Where’s the beef?
4. I apologize in advance, but I won't be able to respond to the expected torrent of replies for a day or so – see under (0) above.

3 comments:

  1. I can answer some of this very easily:
    natural languages are certainly not in the class learnable by that paper in TCS -- that uses 2d-substitutability, which is far stronger than 1d-substituability and natural languages don't have that property as you know.

    I conjecture that all natural languages (weakly) lie in the classes learnable by my joint paper with Ryo on learning PMCFGs. These grammars include a copying operation; I was convinced by Greg Kobele's thesis that these are needed in syntax, and they can also do things like case-stacking and reduplication in morphology.

    But -- and this is a big "but" -- that learner uses membership queries. And it is a weak learner (also a big but).

    So I have a learner which is probabilistic, and I have some proposals for strong learners (only in draft form but I hope to publish this summer), but all in separate papers. In an ideal world we would have everything in one formal model, but for the reasons that you sketch in part 1, that is unlikely to happen real soon, and perhaps unrealistic.

    It also has one more flaw which you point out very precisely in the the second half of point 3.
    MGs are equivalent to MCFGs, but MGS are exponetially more compact just as GPSGs are exponentially more compact that CFGs. (slightly different technically but you know what I mean).

    So I completely agree that even the best learning algorithm for CFGs is too weak as we need to learn something more compact. That is something I have only vague ideas about -- what sort of feature systems can we learn? Another very good problem, but I am working mostly on the strong learning stuff at the moment.

    I also completely agree with your comment about Chomsky 55 -- I just wrote a survey paper which said: (I quote)
    "The naive application
    of these heuristic approaches has been known for a long time to suffer from
    some serious problems; Chomsky was perhaps the first to articulate these problems
    explicitly in his doctoral thesis [6], and indeed much of the technical work that we
    describe can be seen as an attempt to either answer or avoid those problems, which
    at the time were taken to be compelling arguments against the very possibility of
    distributional learning."

    So I think the criticisms that Chomsky 55 make are exactly the right ones -- they are amazingly prescient, and though we have solutions to some of them now, some still resist analysis.

    ReplyDelete
  2. Wow! This can be one particular of the most beneficial blogs. We have ever arrive across on this subject Basically Great. I am also an expert in this topic therefore I can understand your effort.

    ReplyDelete
  3. This comment has been removed by a blog administrator.

    ReplyDelete