For various reasons, Mark J could not post this as a comment on this post. As he knows much more than I do about these matters, I thought it a public service to lift these remarks from the comments section to make them more visible. I think that this is worth reading in conjunction with Charles' recent post (here). At any rate, this is interesting stuff and I don't disagree much (or have not found reasons to disagree much) with what Mark J says below. I will of, course, allow myself some comments later. Thx Mark.
This was originally written as a comment for the "Bayes and Gigerenzer" post, but a combination of a length restriction on comments and my university's not enabling blog posts from our accounts meant I had to email Norbert directly.
As Norbert has remarked, Bayesian approaches are often conflated with strong empiricist approaches, and I think this post does that too. But even within a Bayesian approach, there are powerful reasons not to be a "tabula rasa" empiricist. The "bias-variance dilemma" is a mathematical statement of something I've seen Norbert say in this blog: learning only works when the hypothesis space is constrained. In mathematical terms, you can characterise a learning problem in terms of its bias -- the range of hypotheses being considered -- and the variance or uncertainty with which you can identify the correct hypothesis. There's a mathematical theorem that says that in general as the bias goes down (i.e., the class of hypotheses increases) the variance increases.
Given this, I think a very reasonable approach is to formulate a model that includes as much relevant information from universal grammar as we can put into it, and perform inference that is as close to optimal as we can achieve from data that is as close as possible to what the child receives. I think this ought to be every generative linguist's baseline model of acquisition! Even with an incomplete model and incomplete data, we can obtain results of the kind "innate knowledge X plus data Y can yield knowledge Z".
But for some strange (I suspect largely historical) reason, this is not how Chomskyian linguists think of computational models of language acquisition. Instead, they prefer ad hoc procedural models. Everyone agrees there has to be some kind of algorithm which children use to learn language. I know there are lots of pyschologists who are sure they have a good idea of the kinds of things kids can and can't do, but I suspect nobody really has the faintest idea of what algorithms are "cognitively plausible". We have little idea of how neural circuitry computes, especially over the kinds of hierarchical representations we know are involved in language. Algorithms which can be coded up as short computer programs (which is what most people have in mind when they say simple) might turn out to be neurally complex, while we know that the massive number of computational elements in the brain enable it to solve computationally very complex problems. In vision -- a domain we can sort-of study because we can stick electrodes into animals' brains -- it turns out that the image processing algorithms implemented in brains are actually very sophisticated and close to Bayes-optimal, backed up with an incredible amount of processing power. Why not start with the default assumption that the same is true of language?
It's true that in word segmentation, a simple ad hoc procedure -- a simple greedy learning algorithm that ignores all interword dependencies -- actually does a "reasonable job" of segmentation, and that improving either the algorithm's search procedure or making it track more complex dependencies actually decreases the overall word segmentation accuracy. Here I think Ben Borschinger's comment has it right - sometimes a suboptimal algorithm can correct for the errors of a deficient model if the errors of each go in the opposite way. We've known since at least Goldwater's work that an inaccurate "unigram" model that ignores inter-word interactions will prefer to find multi-word collocations and hence undersegment. On the other hand, a naive greedy search procedure tends to over-segment, i.e., find word boundaries where there are none. Because the unigram model under-segments, while a naive greedy algorithm over-segments, the combination actually does better than approaches where you just improve only the search procedure or only the model (by incorporating inter-word dependencies) since now you have "uncancelled errors".
Of course it's logically possible that children use ad hoc learning procedures that rely on this kind of "happy coincidence", but I think it's unlikely for several reasons.
First, these procedures are ad hoc -- there is no theory, no principled reason why they should work. Their main claim to fame is that they are simple, but there are lots of other "simple" procedures that don't actually solve the problem at hand (here, learn words). We know that they work (sort of) because we've tried them and checked that they do. But a child has no way of knowing that this simple procedure works while this other one doesn't, so the procedure would need to be innately associated with the word learning task. This raises Darwin's problem for the ad hoc algorithm (as well as other related problems: if the learning procedure is really innately specified, then we ought to see dissociation disorders in acquisition, where the child's knowledge of language is fine, but their word learning algorithm is damaged somehow).
Second, ad hoc procedures like these only partially solve the problem, and there's usually no clear way to extend them to solve the problem fully, so some other learning mechanism will be required anyway. For example, the unigram+greedy approach can find around 3/4 of tokens and 1/2 of the types, and there's no obvious way to extend it so it learns all the tokens and all the types. But children do eventually learn all the tokens and all the types, and we'll need another procedure for doing this. Note that the Bayesian approach that relies on more complex models does have an account here, even though it currently involves "wishful thinking": as the models become more accurate by including more linguistic phenomena and the search procedures become more accurate, the word segmentation accuracy continues to improve. We don't know how to build models that include even a fraction of the linguistic knowledge of a 3 year old, but the hope is that eventually these models would achieve perfect word segmentation, and indeed, be capable of learning all of a language. In other words, there isn't a plausible path by which the ad hoc approach would generalise to learning all of a language, while there is plausible path for the Bayesian approach that relies on more and more accurate linguistic models.
Finally -- and I find it strange to be saying this to a linguist who is otherwise providing very cogent arguments for linguistic structure -- there really are linguistic structures and linguistic dependencies, and it seems weird to assume that children use a learning procedure that just plain ignores them. Maybe there is a stage where children think language consists of isolated words (this is basically what a unigram model assumes), and the child only hypothesises larger linguistic structures after some "maturation" period. But our work shows that you don't need to assume this; instead, a single model that does incorporate these dependencies combined with a more effective search procedure actually learns words from scratch more accurately than the ad hoc procedures.
Norbert sometimes seems very sure he knows what aspects of language have to be innate. I'm much less sure myself of just what has to be innate and what can be learned, but I suspect a lot has to be innate (I think modern linguistic theory is as good a bet as any). I think an exciting thing about Bayesian models is that they give us a tool for investigating the relationship between innate knowledge and learnability. For example, if we can show that a model with innate knowledge X+X' can learn Z from data Y, but a model with only innate knowledge X fails to learn Z, then probably innate knowledge X' plays a role in learning Z. I said probably because someone could claim that the child's data isn't just Y but also includes Y' and from model X and data Y+Y' it's possible to infer Z. Or someone might show that a completely different set of innate knowledge X'' suffices to learn Z from Y. So of course a Bayesian approach won't definitely answer all the questions about language acquisition, but it should provide another set of useful constraints on the process.