Tuesday, April 19, 2016

Indirect negative evidence

One of the features of Charles’ paper (CY) that I did not comment on before and that I would like to bring to your attention here is the relevance (or, more accurately, lack thereof) of indirect negative evidence (INE) for real time acquisition. CY’s claim is that it is largely toothless and unlikely to play much of a role in explaining how kids acquire their Gs.  A few comments.

CY is not the first time I have been privy to this observation. I recall that my “good and great friend” Elan Dresher said as much when he was working on learnability of stress with Jonathan Kaye. He noted (p.c.) that very few Gs lined up in the relevant sub/super set configuration relevant for an application of the principle. Thus, though it is logically possible that INE could provide info to the LAD for zeroing in on the right G, in fact it was all but useless given the nature of the parameter space and the Gs that such a space supports. So, nice try INE, but no cigar.[1]

CY makes this point elaborately. It notes several problems with INE as, for example, embodied in Bayes models (see p 14-15).

First, generating the sets necessary to make the INE comparison is computationally expensive. CY cites work by Osherson et al (1986) noting that generating such sets may not even be computable and by Fodor and Sakas that crunches the numbers in cases with a finite set of G alternatives and finds that here too com putting the extensions of the relevant Gs in order to apply the INE is computationally costly.

Nor should this be surprising. If even updating several Gs wrt data quickly gets out of computational control, then it is hardly surprising that using Gs to generate sets of outputs and then comparing them wrt containment is computationally demanding. In sum, surprise, surprise, INE runs into the same kind to tractability issues that Bayes is already rife with.[2]

Second, and maybe more interesting still, CY diagnoses why it is that INE is not useful in real world contexts. Here is CY (note: ‘super-hypothesis’ is what some call the supersets):

The fundamental problem can be stated simply: the super-hypothesis cannot be effectively ruled out due to the statistical properties of child directed English. (16)

What exactly is the source of the problem? Zipf’s law.

The failure of indirect negative evidence can be attributed to the inherent statistical distribution of language. Under Zipf’s law, which applies to linguistic units (e.g. words) as well as their combinations (e.g. N-grams, phrases, rules; see Yang (2013)), it is very difficult to differentiate low probability events and impossible events.

And this makes it inadvisable to use the absence of a particular form as evidence of its non-generability. In other words, Zipf’s law cuts the ground from under INE.

Here CY (as it notes) is making a point quite similar to that made over 25 years ago by Steve Pinker (here) (14):

…it turns out to be far from clear what indirect negative evidence could be. It can’t be true that the child literally rules out any sentence he or she hasn’t heard, because there is always an infinity of sentences that he or she hasn’t heard that are grammatical …And it is trivially true that the child picks hypothesis grammars that rule out some of the sentences that he or she hasn’t heard, and that if a child hears a sentence she or she will often entertain a different hypothesis grammar that if she or she hasn’t heard it. So the question is, under exactly what circumstances does a child conclude that a non witnessed sentence is ungrammatical?

What CY notes is that this is not only a conceptual possibility given the infinite number of grammatical linguistic objects, but it is statistically likely that because of the Zipfian distribution of linguistic forms in the PLD that the evidence relevant to concluding G absence from statistical absence (or rarity) will be very spotty, and that building on such absence will lead in very unfortunate directions. CY discusses a nice case of this wrt adjectives, but the point is quite general. It seems like Zipf’s law makes relying on gaps in the data to make conclusions about (il)licit grammatical structures a bad strategy.

This a very nice point, which is why I have belabored it. So, not only are the computations intractable but the evidence relevant for using INE is inadequate for principled reasons. Conclusion, forget about the INE.

Why mention this? It is yet another problem with Bayes. Or, more directly, it suggests that the premier theoretical virtue of Bayes (the one that gets cited whenever I talk to a Bayesian) is empirically nugatory. Bayes incorporates the subset principle (i.e. Bayesian reasoning can explain why the subset principle makes sense). This might seem like a nice feature. And it would be were INE actually an important feature of the LAD’s learning strategy (i.e. a principle that guided learning). But, it seems that it is not. It cannot be used both for computational and statistical reasons. Thus, it is a strike against any theory of the ideal learner that it incorporates the subset principle in a principled manner. Why? Because, the idealization points in the wrong direction. It suggests that negative evidence is important to the LAD in getting to its G. But if this is false, then a theory that incorporates it in a principled fashion is, at best, misleading. And being misleading is a major strike against an idealization. So, bad idealization! Again!

And it’s worse still because there is an alterative?  Here’s CY (18):

The alternative strategy is a positive one, as it exploits the distributional similarities … Under this approach, the over-hypothesis is never available to the learner, and there is no need to rule it out.

So, frame the problem well (i.e. adopt the right idealization) and you point yourself in the right direction (i.e. by avoiding dealing with problems that the wrong idealization generates).

As CY notes, none of these arguments are “decisive.” Arguments against idealizations never are (though the ones CY presents and that I have rehearsed wrt Bayes in the last several posts seems to me pretty close to dispositive). But, they are important. Like all matters scientific, idealizations need to be defended. One way to defend them is to note that they point to the right kinds of problems and suggest how the kinds of solutions we ought to explore. If an idealization consistently points in the wrong direction, then it’s time to chuck it. It’s worse than false, it is counter-productive. In the domain of language, whatever the uses of the technology Bayes makes available, it looks like it is misleading in every possible way. The best that we seem to be able to say for it is that if we don’t take any of its claims seriously then it won’t cause too much trouble. Wow, what an endorsement. Time to let the thing go and declare the “revolution” over. Let’s say this loudly and all together: Bye bye Bayes!

[1] It is worth noting that the Dresher-Kaye system was pretty small, about 10 parameters. Even in this small system, the subset principle proved to be idle.
[2] In fact, it might be worse in this case. The Bayes maneuver generally circumvents the tractability issue by looking for algorithms that can serve to “update” the hypotheses without actually directly updating them. For INE we will need cheap algorithms to generate the required sets and then compare them. Do such quick and dirty algorithms exist for generation and comparison of the extensions of hypotheses?

Monday, April 18, 2016

Two interviews with Chomsky on linguistics (and politics)

Here is a recent interview with Chomsky (thx to Elika for the link) where he talks about things like Big Data in linguistics, Experimental Syntax, islands, superiority and other things. The interview (Jeff Runner doing the ling questioning) is short but interesting.

He makes at least there important points.

First, that there is a difference between data collection and scientific experimentation. The idea, implicit in most of the big data PR, is that one can collect data quite a-theoretically and expect to gain scientific insight. As Chomsky notes that this runs against the accumulated wisdom of the last 200 years of scientific research. As Chomsky compactly put it:
...theory-driven experimental investigation has been the nature of the sciences for the last 500 years.
Quite right. Experiments are not just looking. They are looking with an attitude and the tude is a function of theory.

Second, much of what linguistic study has NO relevant data in any conceivable corpus. He cites ECP, but this is just the tip of a very large iceberg. No relevant data, then big data collection is besides the point:
In linguistics we all know that the kind of phenomena that we inquire about are often exotic. They are phenomena that almost never occur. In fact, those are the most interesting phenomena, because they lead you directly to fundamental principles. You could look at data forever, and you’d never figure out the laws, the rules, that are structure dependent. Let alone figure out why. And somehow that’s missed by the Silicon Valley approach of just studying masses of data and hoping something will come out. It doesn’t work in the sciences, and it doesn’t work here.
Let me underline one point Chomsky makes: it's the manufactured experimental data that is important to gaining insight. As in the other sciences, linguists create data not found in the wild and use this factitious data to understand what is happening. Real life data is often (IMO, generally) useless because it is too complex. The aim of good data is to reduce irrelevant interference effects that arise from the interaction of many component causes. Real life data is just that; too complex. In linguistics, of particular importance is negative data; data that some structure is unacceptable or cannot have a specific meaning. This is not the kind of data that Big Data can get because it is data that is missing from everyday usage of language. And yes, PoS arguments are built from this kind of data and that is why they are so useful.

Third, I am still not sure what Chomsky's take on island effects is. One of the interesting debates in the Sprouse and Hornstein volume revolved around whether these were reducible to simple complexity effects. My read on this is that Sprouse and Wagers and Phillips got the better of the discussion and that reducing islands to complexity just wasn't going to fly. I'd be interested to know what others think.

At any rate, take a quick look, as it is short and interesting.

CHomsky's recent Sophia Lectures is another excellent recent source of Chomsky syntax speculation. The lectures (plus an excellent interview by Naomi Fukui and Mihoko Zushi) are contained in volume 64 of Sophia Linguistica. I have no online link, unfortunately. But I recommend getting hold of the volume and reading it. Interesting stuff.

Friday, April 15, 2016

Another three things to read: pedagogy, connectors and Marr

Here are three pieces (and one youtube clip) that you might find interesting and provocative. In the last one, Chomsky discusses Marr.

The first is a piece on teaching. It responds to a piece by Brain Leiter on teaching philosophy in mixed gender environments and whether or not males create environments which make it harder for females to participate and learn. Leiter and the blogger Harry Brighouse (HB) are philosophers so their concern is with philo pedagogy. But I believe that ling classes and philo classes have very similar dynamics (less lecture and more discussion, “discovery,” give and take, argument) and so the observations HB makes on Leiter’s original post (link included in above piece) seems relevant to what we do.  Take a look and let me know what you think.

FWIW, I personally found some of the suggestions useful, and not only as applied to women. In my experience some very smart people can be quite reluctant to participate in class discussion. This is unfortunate for I know for a fact that the class (and me too) would benefit from their participation (as, I suspect, would they). IMO, learning takes place in classes less because information is imparted and more because a certain style of exploration of ideas is promoted. If lucky, the process is fun and develops a dynamic of its own, which leads to new ideas which leads to more discussion which promotes more amusement which… A really good class shows how to ride this kind of enthusiasm and think more clearly and originally. The problem that Leiter and HB identify would impede this. So, is it a problem for linguistics? My guess is absolutely. If so, what to do? Comments welcome.

The second paper (here) is on a new IARPA funded project to get machines to “think” more like brains. I don’t really care about the technology concerns (though I don’t think that they are uninteresting or trivial either though the ends to which they will be put are no doubt sinister), but it is interesting to hear how leaders in cog-neuro see the problem. The aim is to get machines to think like brains and so what do they fund? Projects aimed at complete wiring diagrams of the brain. So, for example, Christof Koch and his team at the very well endowed Allen Institute are going to do a “complete wiring diagram of a small cube of brain – a million cubic microns, totaling one five-hundredth of cortext.” The idea is that once we have complete wiring diagrams we will know how brains do what they do. Here’s Andreas Tolias being quoted: “without knowing all of the component parts, he said, “maybe we’re missing the beauty of the structure.” Maybe. Then again, maybe not. Who knows? Well, I think I do and that’s because of observations that Koch has made in the past.

It is simply false that we do not have complete wiring diagrams. We do. We have the complete wiring diagram and genome of the nematode c-elegans. Despite this we know very little about what the little bugger does (actually we do know a lot about how it defecates David Poeppel informed me recently). So, having the complete diagram and genome has not helped crack the critter’s cognitive issues. Once you see this, you understand that the whole project discussed here is based on the assumption that the relation of human cognition/behavior to brain diagrams is simpler than that of the behavior/cognition of a very simple worm to its wiring diagram and genome. A bold conjecture, you might say. Yup, very bold. Foolhardy anyone? But see below.

It is hard to avoid the suspicion that this is another case of research following the money. Koch knows that there is little reason to think that this will work. But big deal, there’s money there so work it will. And if it fails, then it means we have not gotten to the right level of wiring detail. We need yet more fine grained maps, or maps of other things, or maps between maps of other things and the connectome or.... There really is no end to this and so it is the perfect project.

The little piece is also worth reading for it reports many of the assumptions that our leaders in neuroscience make about brains. Here’s one I liked: Some brain types really believe the neural networks of the 1980s vintage “mimic the basic structure of brains.” So now we know why neural nets they were so popular: they looked “brainy”! I used to secretly think that this kind of belief was too silly to attribute to anyone. But, nope, it seems that some really take arguments from pictorial resemblance to be dispositive.

We also know that they have no idea what “feedback loops” are doing, especially from higher order to lower order layers. Despite the mystery surrounding what top down loops do, the assumption still seems to be that, largely, “information flows from input to output through a series of layers, each layer is trained to recognize certain features…with each successive layer performing complex computations on the data.” In other words, the standard learning model is a “discovery procedure,” and the standard view of the learning involved is standard Empiricism/Associationsim, the only tweak being that maybe we can do inductions over inductions over inductions as well as inductions over just the initial input. This is the old discredited idea central to American Structuralist Linguistics. Early GG showed that this could not be true and that the relations between levels is much more complex than this picture envisaged. However, the idea that levels might be autonomous is not even on the neuroscience agenda, or so it appears.

In truth, none of this should be surprising. If the report in Quanta accurately relays the standard wisdom, neuroscience is completely unencumbered by any serious theories of cognition. The idea seems to be that we will reverse engineer cognition from wiring diagrams. This is nuts. Imagine reverse engineering the details of a word processing program from a PC’s wiring diagram. It would be a monumental task, though a piece of cake compared to the envisioned project of reverse engineering brains from connectomes.

At any rate, read the piece (and weep).

As a relevant addendum to the above piece take a look at the following. Ellen Lau send me a link to a debate about the utility of studying the connectome moderated by David Poeppel at the last CNS meeting in NYC. It is quite amusing. The protagonists are Moritz Helmstaeder (MH) and Tony Movshon (TM). The former holds the pro connectome position (don’t let his first remarks fool you, they are intended to be funny), while the latter embraces a more skeptical Marr like view.

Here’s one remarkable bit: MH presents an original argument regarding the recognized failure of c-elegans connectomics to get much function out of structure. He claims that simple systems are more complex than more complex ones. As TM notes, this is more guess than argument (and there is no argument given). I am pretty sure that were the c-elegans case “successful” this would be generally advertised. David P questions him on this with, IMO, little satisfactory reply. Let’s just say that the position he holds is, ahem, possible but a stretch.

The one things about the debate that I found interesting is that MH seems to be defending a position that nobody could object to while TM is addressing a question that is very hard to be dispositive about. MH is arguing that connectomics has been and can be useful. TM is arguing that there are other better ways to proceed right now. Or, more accurately, that the Marr three prong attack is the way to go and that we will not get cognition from wiring diagrams, no matter how carefully drawn they are.

IMO, TM has the better of this discussion because he notes that the cases that MH points to as success stories for connectomics are areas where we have had excellent functional stories (Barlow results are the basis of MH’s results) for a while. And in this context, looking at the physiology is likely to be very useful and likely successful. To put this crudely, TM (who cited Marr) seems to appreciate that questions of CN interest can be pursued at different levels, which are somewhat independent. And of course, we want them to be related to each other. MH seems to think in a more reductive manner, that level 3 is basic and that we will be able to deduce/infer level 2 and level 1 stories once we understand the connectomic details. Thus, we can get cognition from wiring diagrams (hence the relevance of the failure of c-elegans).

You know where I stand on this. But the discussion is interesting and worth the 90 minutes. There is a lot of mother and apple pie here (as questioners point out). Nobody argues (reasonably enough) against doing any connectomics work. The argument should be (but often isn’t) about research strategy; about whether connectomics can bypass the C part of CNS? As David P puts it: can one reverse engineer the other two levels given level 3 (see discussion from about 1:15 ff)? Connectomics (MH) leans towards a ‘yes,’ the critics TM think ‘no.’ Given the money at stake, this is no small debate. Those who want to see the relevance of Marrian methodological reasoning, need look no further than here.

The last piece is something that I probably already posted once before but might be of interest to those following the Marr discussion in recent posts. It’s Chomsky talking about AI and its prospects (here). It’s a fun interview and a good antidote to the second piece I linked to. It also has the longest extended discussion of Marr as it relates to linguistics that I know of.

Chomsky makes two points. First, the point that David Adger made that there is “no real algorithmic level” when it comes GG because “it’s just a system of knowledge” and “there is no process” a system of knowledge not reducible to how it gets used. (24)

He also makes a second point. Chomsky allows that “[m]aybe information about how it’s used [can] tell you something about the mechanisms.” So ontologically speaking, Gs are not things that do anything, but it might be possible for us (Chomsky notes that some higher (Martian?) intelligence might not require this) to learn something about the knowledge by inspecting how it is used: “Maybe looking at process by which it’s used gives you helpful information” about the structure of the knowledge. (26)

The upshot: there is an ontological/conceptual difference between the knowledge structures that GG describes and how this knowledge is put to use algorithmically but looking at how the system of knowledge is used may be helpful in figuring out what the structure of that knowledge is.

I agree with the ontological point, but I think that Marr might too. Level 2 theories, as I read him, are not less abstruse descriptions of level 1 theories. Rather, level 1 theories specify computational problems that level 2 theories must solve if they are to explain how humans see or speak or hear or…. In other words, level 2 theories must solve level 1 problems to do what they do. So, for example, in the domain of language, to (at least in part) explain linguistic creativity (humans can produce and understand sentences never before encountered) we must show how information Gs describe (i.e. rules relating sound with meaning) is extracted by parsers in real time. So, the Marr schema does not deny the knowledge/use distinction that Chomsky emphasizes here, and that is a good thing as the two are not the same thing.

However, putting things in this way, misidentifies the value of the Marr schema. It is less a metaphysical doctrine than a methodological manual. It notes that it is very useful in vision to parse a problem into three parts and ask how they talk to one another. Why is it helpful? Because it seems that the parts do often enough talk to one another. In other words, asking how the knowledge is put to use can be very helpful in figuring out what the structure of that knowledge is. I think that this is especially true in linguistics where there is really nothing like physical optics or arithmetic to ground level 1 speculations. Rather we discover the content of level 1 theories by inspecting a particular kind of use (i.e. judgments in reflective equilibrium). It seems very reasonable (at least to me) to think that insight we get into the structures using this kind of data will carry over to our study of processing and real time acquisition. Thus, the structures that the processor or LAD is looking for very close to those that our best theories of linguistic knowledge say that they are. Another way of saying this is that we assume that there is a high level of transparency between what we know and those things we parse. There may even be a pretty close relation between derivations that represent knowledge and variables that measure occurrent psychological processes (think the DTC). This need not be the case, for Chomsky and Adger are right that there is an ontological distinction between knowledge and how knowledge is put to use, but it might be the case. Moreover, if it is, that would offer a terrifically useful probe into the structure of linguistic knowledge. And this is precisely what a methodological reading of Marr’s schema suggests, which is shy I would like to emphasize that reading.

Let me add one more point once I am beating a hobbyhorse that I have lately ridden silly: not only is this a possibility, but we have seen recent efforts that suggest its fecundity. Transparency plays an important conceptual role in Pietroski et al’s argument for its proposed semantic structure of most and it also plays an important role in Yang’s understanding of the Elsewhere Principle. I found these arguments very compelling. They use a strong version of transparency to motivate the conclusions. This provides a reason for valuing transparency as a regulative ideal. And this is what, IMO, a Marr schema encourages.

Ok, I’ll stable the pony now with the following closing comments: Chomsky and Adger are right about the ontology. However, there is an interesting reading of Marr where the methodology urged is very relevant to linguistic practice. And Marr is very worthwhile under that reading for it urges a practice where competence and performance issues are more tightly aligned, to the benefit of each.

Oh yes: there is lot’s more interesting stuff in the Chomsky interview. He takes shots at big data, the distinction between engineering and science, and the difference between reduction and unification. You’ve no doubt seen/heard/read him make these points before, but the interview is compact and easy to read.

Thursday, April 14, 2016

Once more into the breech: Re (3d)

So, what makes an inductive theory Bayesian? I have no idea. Nor, it appears does anyone else. This is too bad. Why? Because though ti is always the case that particular models must be evaluated on their own merits (as Charles rightly notes in the previous post), the interest in particular models, IMO, stems from the light they shine on the class of models of which they are a particular instance. In other words, specific models are interesting both for their empirical coverage AND (IMO, more specifically) for the insight they provide for the theoretical commitments a model embodies (hence one model from the class of models).

My discussion of Bayes rested on the assumption that Bayes commits one to some interesting theoretical claims and that the specific models offered are in service of advancing more general claims that Bayes embodies. From where I sit, it seems to me that for many there are no theoretical claims that Bayes embodies so that the supposition that a Bayes model intends to tell us something beyond what the specific model is a model of is off base. Ok. I can live with that. It just means that the whole Bayes thing is not that interesting, except technologically. What's potential interest are the individual proposals, but they don't have theoretical legs as they are not in service of larger claims.

I should add, however, that many "experts" are not quite so catholic. Here is a quote from Gelman and Shalizi's paper on Bayes.

The common core of various conceptions of induction is some form of inference from particulars to the general – in the statistical context, presumably, inference from the observations y to parameters describing the data-generating process. But if that were all that was meant, then not only is ‘frequentist statistics a theory of inductive inference’ (Mayo & Cox, 2006), but the whole range of guess-and-test behaviors engaged in by animals (Holland, Holyoak, Nisbett, & Thagard, 1986), including those formalized in the hypothetico-deductive method, are also inductive. Even the unpromising-sounding procedure, ‘pick a model at random and keep it until its accumulated error gets too big, then pick another model completely at random’, would qualify (and could work surprisingly well under some circumstances – cf. Ashby, 1960; Foster & Young, 2003). So would utterly irrational procedures (‘pick a new random when the sum of the least significant digits in y is 13’). Clearly something more is required, or at least implied, by those claiming that Bayesian updating is inductive. (25-26)    
Note the theories that they count as "inductive" under the general heading but find to be unlikely candidates for the Bayes moniker. See what they consider not Bayes inductive rules? Here are two, in case you missed it: "the whole range of guess-and-test behaviors" and even the "pick a model at random and keep it until its accumulated error gets too big, then pick another model completely at random." G&S take it that if even there methods are instances of Bayesian updating, then there is nothing interesting to discuss for it denudes Bayes of any interesting content.

Of course, you will have noticed that these two procedures are in fact the ones that people (e.g. Charles, Trueswell and Gleitman and Co) have been arguing in fact characterize acquisition in various linguistic domains of interest. Thus, they reasonably enough (at least if they understand things the way Gelman and Shalizi do) conclude that these methods are not interestingly Bayesian (or for that matter "inductive," except in a degenerate sense).

So, there is a choice: treat "Bayes" as an honorific in which case there is no interesting content to being Bayesian beyond "hooray!!" or treat it as having content, in which case it seems opposed to systems like "guess-and-test" or "pick at random." Which one picks is irrelevant to me. It would be nice to know, however, which is intended when someone offers up a Bayesian model. In the first case it 'Bayesian' just means "one that I think is correct." In the second, it has slightly more content. But what that is? Beats me.

One last thing. It is possible to understand the Aspects model of G acquisition as Bayesian (I have this from an excellent (let's say, impeccable) source). Chomsky took the computational intractability of that model (its infeasibility) to imply that we need to abandon the Aspect model in favor of a P&P view of acquisition (though whether this is tractable is an open question as well). In other words, Chomsky took seriously the mechanics of the Aspects model and thought that its intractability indicated that it was fatally flawed. Good for him. He opted for being wrong over being vacuous. May this be a lesson for us all.

Wednesday, April 13, 2016

Yang (himself) on Bayes

Norbert has brought out the main themes of my paper much more clearly than I could have (many thanks for that). This entry is something of a postscript triggered by the comments over the past few days.

The comments remind me of the early days in the Past Tense debate. What does it mean to be a connectionist model? Can't it pass the Wug test if we just get rid of those awful Wickelfeatures? If not backprop, maybe a recurrent net? Most commentators tread a similar terrain:  What’s the distinction between a normative Bayesian model and a cognitive one? How essential is the claim of optimality? Is a model that uses non-Bayesian approximations Bayesian in name only? If not MAP, then how about a full posterior interpretation … [1]

These questions can never be fully resolved because they are questions about frameworks. As Norbert notes, frameworks can only be evaluated by the questions they raise and the answers they provide, not by whether it can or cannot do X because one can always patch things up. (Of course this holds for the Minimalist Framework as well.) A virtue of the Past Tense debate was that it grounded a largely conceptual/philosophical discussion in a well-defined empirical domain, and we have it to thank for a refined understanding of morphology and language acquisition. That represents progress, even if no minds were changed. So let’s focus on some concrete empirical cases, be it probability matching by rodents or Polish genitives by kids. Framework-level questions go nowhere, especially when the highest priests of Bayesianism disagree. 

As I said in the paper, none of my criticisms is necessarily decisive but taken together, I hope they make it worthwhile to pursue alternatives [2]: alternatives that linguists have always been good at (e.g., restricting hypothesis space), alternatives that take the psychological findings of language acquisition seriously, and alternatives that do not take forever to run. It’s disappointing to see all the hard lessons are forgotten. For instance, indirect negative evidence, which was always viewed with suspicion, is now freely invoked without actually working through its complications. The problem doesn't go away when the modeler peeks at the target grammar and rigs the machinery accordingly, even though the modeler is some kind of idealized observer.

Somewhere during the Second Act of the Past Tense debate, connectionist models that implicitly implemented the regular/irregular distinction started to appear. I remember it annoyed the heck out of a young Gary Marcus, but I suspect that an older and wiser Gary would take that as a compliment. 

[1]  A "true" Bayesian model does not necessarily do better. As I noted in the paper, one such model for morphological learning took a week to train on supervised data but only offers very marginal improvement over an online incremental and psychologically motivated unsupervised model, which processed almost a million words in under half an hour.

[2] The paper does offer an alternative, one embedded in a framework that insists on a transparent mapping between the Marrian levels. Like in the Past Tense debate, a critique is never enough, and one needs a positive counterproposal. So let's hear some counter-counter-proposals. 

Monday, April 11, 2016

Pouring gasoline on the flames: Yang on Bayes 3

I want to pour some oil on the flames. Which flames? The ones that I had hoped that my two recent posts on Yang’s critique of Bayes (here and here) would engender. There has been some mild pushback (from Ewan, Tal, Alex and Avery). But the comments section has been pretty quiet. I want to restate what I take to be the heart of the critique because, if correct, it is very important. If correct, it suggests that there is nothing worth salvaging from the Bayes “revolution” for there is no there there. Let me repeat this. If Yang is right, then Bayes is a dead end with no redeeming scientific (as opposed to orthographic) value. This does not mean that specific Bayes proposals are worthless. They may not be. What it means is that Bayes per se not only adds nothing to the discussion, but that taking its tenets to heart will mislead inquiry. How so? It endorses the wrong idealization of how stats are relevant to cognition. And misidealizations are as big a mistake as one can make, scientifically speaking. Here’s the bare bones of the argument.

1.     Everyone agrees that data matters for hypothesis choice
2.     Everyone agrees that stats matter in making this choice
3.     Bayes makes 4 specific claims about how stats matter for hypothesis choice:
a.     The hypothesis space is cast very wide. In the limit all possible hypothesis are in the space of options
b.     All potentially relevant data is considered, i.e. any data that could decide between competing hypotheses is used to adjudicate among the hypotheses in the space
c.     All hypotheses are evaluated wrt to all of the data. So, as data is considered every hypothesis’ chance of being true is evaluated wrt to every data point considered
d.     When all data has been considered the rule is to choose that hypothesis in the space with the highest score

Two things are worth noting about the above.

First, that (3) provides serious content to a Bayesian theory, unlike (1) and (2). The latter are trivial in that nobody has ever thought otherwise. Nobody. Ever. So if this is the point of Bayes, then this ain’t no revolution!

Second, (3) has serious normative motivation. It is a good analysis of what kind of inference an inference to the best explanation might be. Normatively, an explanation is best if it is better than all other possible explanations and accounts for all of the possibly relevant data. Ideally, this implies evaluating all alternatives wrt to all the data and choosing the best. This gives us (3a-d). Cognitive Bayes (CB) is the hypothesis that normative Bayes (NB) is a reasonable idealization for people actually do when the learn/acquire something. And we should appreciate that this could be the case. Let’s consider how for a moment.

The idealization would make sense for the following kind of case (let’s restrict ourselves to language). Say that the hypothesis space of a potential Gs was quite big. For concreteness, say that we were always considering about 50 different candidate Gs. This is not all possible Gs, but 50 is a pretty big number computationally speaking. So say 50 or more alternatives is the norm. Then Bayes (3a) would function a lot like the standard linguistic assumption that the set of well-formed syntactic objects in a given language is effectively infinite. Let me unpack the analogy.

This infinity assumption need not be accurate to be a good idealization. Say it turns out that the number of well-formed sentences a native speaker of English is competent wrt is “only” 101000. Wouldn’t this invalidate the infinity assumption? No, it would show that it is false, but not that it is a bad idealization. Why? Because the idealization is a good one because it focuses attention onto the right problem. Which one? The Projection Problem: how do native speakers go from a part of the language all of it? How given exposure to only a subset of the language does a LAD get mastery over a whole language? The answer: you acquire recursive rules, a G, that’s how. And this is true whether or not the “language” is infinite or just very big. The problem, going from a subset to its containing superset, will transit via a specification of rules whether or not the set is actually infinite. All the infinite idealization does is concentrate the mind on the projection problem by making the alternative tempting idea (learning by listing) silly. This is what Chomsky means when he says in Current Issues” “once we have mastered a language, the class of sentences with which we can operate fluently or hesitation is so vast that for all practical purposes (and, obviously, for all theoretical purposes), we may regard it as infinite” (7, my emphasis NH). See: the idealization is reasonable because it does not materially change the problem to be solved (i.e. how to go from part of the language you are exposed to, to the whole language that you have mastery over).

A similar claim could be true of Bayes. Yes, the domain of Gs a LAD considers is in fact big. Maybe not thousands or millions of alternatives, but big enough to be worth idealizing to a big hypothesis space in the same way that it is worth assuming that the class sentences a native speaker is competent wrt is infinite. Is this so? Probably not. Why not? Because even moderately large hypothesis spaces (say with over 5 competing alternatives) turns out to be very hard to manage. So the standard practice is to use really truncated spaces, really small SWSs. But when you so radically truncate the space, there is no reason to think that the inductive problem remains the same. Just think if the number of sentences we actually knew was about 5 (roughly what happens in animal communication systems). Would the necessity of rules really be obvious? Might we not reject the idealization Chomky argues for (and note that I emphasize ‘argue’)? So, rejecting (3a) means rejecting part of the Bayes idealization.

What of the other parts, (3b-d)? Well, as I noted in my posts, Charles argues that each and every one is wrong in such a way as to be not worth making. It gets the shape of the problem wrong. He may be right. He may be wrong (not really, IMO), but he makes an argument. And if he is right, then what’s at stake is the utility of RB as a useful idealization for cognitive purposes. And, if you accept this, we are left with (1-2), which is methodological pablum.

I noted one other thing the normative idealization above was once considered as a cognitive option within linguistics. It was knows as the child-as-little-linguist theory. And it had exactly the same problems that Bayes has.  It suggests that what kids do is what linguists do. But it is not the same thing at all. And realizing this helped focus on what the problem the LAD faces is. Bayes is not unique in misidealizing a problem.

Three more points and I end today’s diatribe.

First, one can pick and choose among the four features above. In other words, there is no law saying that one must choose the various assumptions as a package. One can adopt a SWS assumption (rejecting 3a) while adopting a panoramic view of the updating function (assuming that every hypothesis in the space is updated wrt every new data point) and rejecting choice optimization (3d). In other words, mixing and matching is fine and worth exploring. But what gives Bayes content, and makes it more than one of many bookkeeping notations, is the idealization implicit in CB as NB.

Second, what makes Bayes scientifically interesting is the idealization implicit in it.  I mention this because as Tal notes in a comment (here), it seems that current Bayesians are promoting their views as just “set of modeling practices.” The ‘just’ is mine, but this seems to me what Tal is indicating about the paper he links to. But the “just” matters. Modeling practices are scientifically interesting to the degree that they embody ideas about the problem being modeled. The good ones are ones that embody a good idealization. So, either these practices are based on substantive assumptions or they are “mere” practices. If the latter, then the Bayes modeling is in itself of zero scientific interest. Does anyone really want to defend Bayes in this way? I confess that if this is the intent then there is nothing much to argue about given how modest (how really modest, how really really modest) the Bayes claim is.

Last, there is a tendency to insulate one’s work from criticism. One way of doing this is to refuse to defend the idealizations implicit in one’s technology. But technology is never innocent. It always embodies assumptions about the way the world is so that the technology used is a good technology in that it allows one to see/do things that other technologies do not permit or, at least, does not distort how the basic problems of interest are to be investigated. But researchers hate having to defend their technology, more often favoring the view that how it runs is its own defense. I have been arguing that this is incorrect. It does matter. So, if it turns out that Bayesians now are urging us to use the technology but are backing away from the idealizations implicit in it, that is good to know. This was not how it was initially sold. It was sold as a good way of developing level 1 cognitive theories. But if Bayes has no content then this is false. It cannot be the source of level 1 theories for on the revised version of Bayes as a “set of modeling practices” Bayes per se has no content so Bayes is not and cannot be a level 1 theory of anything. It is vacuous. Good to know. I would be happy if this is now widely conceded by our most eminent Bayesians. If this is now the current view of things, then there is nothing to argue about. If only Bayes had told us this sooner.