Comments

Showing posts with label indirect negative evidence. Show all posts
Showing posts with label indirect negative evidence. Show all posts

Tuesday, April 19, 2016

Indirect negative evidence

One of the features of Charles’ paper (CY) that I did not comment on before and that I would like to bring to your attention here is the relevance (or, more accurately, lack thereof) of indirect negative evidence (INE) for real time acquisition. CY’s claim is that it is largely toothless and unlikely to play much of a role in explaining how kids acquire their Gs.  A few comments.

CY is not the first time I have been privy to this observation. I recall that my “good and great friend” Elan Dresher said as much when he was working on learnability of stress with Jonathan Kaye. He noted (p.c.) that very few Gs lined up in the relevant sub/super set configuration relevant for an application of the principle. Thus, though it is logically possible that INE could provide info to the LAD for zeroing in on the right G, in fact it was all but useless given the nature of the parameter space and the Gs that such a space supports. So, nice try INE, but no cigar.[1]

CY makes this point elaborately. It notes several problems with INE as, for example, embodied in Bayes models (see p 14-15).

First, generating the sets necessary to make the INE comparison is computationally expensive. CY cites work by Osherson et al (1986) noting that generating such sets may not even be computable and by Fodor and Sakas that crunches the numbers in cases with a finite set of G alternatives and finds that here too com putting the extensions of the relevant Gs in order to apply the INE is computationally costly.

Nor should this be surprising. If even updating several Gs wrt data quickly gets out of computational control, then it is hardly surprising that using Gs to generate sets of outputs and then comparing them wrt containment is computationally demanding. In sum, surprise, surprise, INE runs into the same kind to tractability issues that Bayes is already rife with.[2]

Second, and maybe more interesting still, CY diagnoses why it is that INE is not useful in real world contexts. Here is CY (note: ‘super-hypothesis’ is what some call the supersets):

The fundamental problem can be stated simply: the super-hypothesis cannot be effectively ruled out due to the statistical properties of child directed English. (16)

What exactly is the source of the problem? Zipf’s law.

The failure of indirect negative evidence can be attributed to the inherent statistical distribution of language. Under Zipf’s law, which applies to linguistic units (e.g. words) as well as their combinations (e.g. N-grams, phrases, rules; see Yang (2013)), it is very difficult to differentiate low probability events and impossible events.

And this makes it inadvisable to use the absence of a particular form as evidence of its non-generability. In other words, Zipf’s law cuts the ground from under INE.

Here CY (as it notes) is making a point quite similar to that made over 25 years ago by Steve Pinker (here) (14):

…it turns out to be far from clear what indirect negative evidence could be. It can’t be true that the child literally rules out any sentence he or she hasn’t heard, because there is always an infinity of sentences that he or she hasn’t heard that are grammatical …And it is trivially true that the child picks hypothesis grammars that rule out some of the sentences that he or she hasn’t heard, and that if a child hears a sentence she or she will often entertain a different hypothesis grammar that if she or she hasn’t heard it. So the question is, under exactly what circumstances does a child conclude that a non witnessed sentence is ungrammatical?

What CY notes is that this is not only a conceptual possibility given the infinite number of grammatical linguistic objects, but it is statistically likely that because of the Zipfian distribution of linguistic forms in the PLD that the evidence relevant to concluding G absence from statistical absence (or rarity) will be very spotty, and that building on such absence will lead in very unfortunate directions. CY discusses a nice case of this wrt adjectives, but the point is quite general. It seems like Zipf’s law makes relying on gaps in the data to make conclusions about (il)licit grammatical structures a bad strategy.

This a very nice point, which is why I have belabored it. So, not only are the computations intractable but the evidence relevant for using INE is inadequate for principled reasons. Conclusion, forget about the INE.

Why mention this? It is yet another problem with Bayes. Or, more directly, it suggests that the premier theoretical virtue of Bayes (the one that gets cited whenever I talk to a Bayesian) is empirically nugatory. Bayes incorporates the subset principle (i.e. Bayesian reasoning can explain why the subset principle makes sense). This might seem like a nice feature. And it would be were INE actually an important feature of the LAD’s learning strategy (i.e. a principle that guided learning). But, it seems that it is not. It cannot be used both for computational and statistical reasons. Thus, it is a strike against any theory of the ideal learner that it incorporates the subset principle in a principled manner. Why? Because, the idealization points in the wrong direction. It suggests that negative evidence is important to the LAD in getting to its G. But if this is false, then a theory that incorporates it in a principled fashion is, at best, misleading. And being misleading is a major strike against an idealization. So, bad idealization! Again!

And it’s worse still because there is an alterative?  Here’s CY (18):

The alternative strategy is a positive one, as it exploits the distributional similarities … Under this approach, the over-hypothesis is never available to the learner, and there is no need to rule it out.

So, frame the problem well (i.e. adopt the right idealization) and you point yourself in the right direction (i.e. by avoiding dealing with problems that the wrong idealization generates).

As CY notes, none of these arguments are “decisive.” Arguments against idealizations never are (though the ones CY presents and that I have rehearsed wrt Bayes in the last several posts seems to me pretty close to dispositive). But, they are important. Like all matters scientific, idealizations need to be defended. One way to defend them is to note that they point to the right kinds of problems and suggest how the kinds of solutions we ought to explore. If an idealization consistently points in the wrong direction, then it’s time to chuck it. It’s worse than false, it is counter-productive. In the domain of language, whatever the uses of the technology Bayes makes available, it looks like it is misleading in every possible way. The best that we seem to be able to say for it is that if we don’t take any of its claims seriously then it won’t cause too much trouble. Wow, what an endorsement. Time to let the thing go and declare the “revolution” over. Let’s say this loudly and all together: Bye bye Bayes!





[1] It is worth noting that the Dresher-Kaye system was pretty small, about 10 parameters. Even in this small system, the subset principle proved to be idle.
[2] In fact, it might be worse in this case. The Bayes maneuver generally circumvents the tractability issue by looking for algorithms that can serve to “update” the hypotheses without actually directly updating them. For INE we will need cheap algorithms to generate the required sets and then compare them. Do such quick and dirty algorithms exist for generation and comparison of the extensions of hypotheses?

Monday, September 15, 2014

Computations. modularity and nativism

The last post (here) prompted three useful comments by Max, Avery and Alex C. Though they appear to make three different points (Max pointing to Fodor’s thoughts on modularity, Avery on indirect negative evidence and Alex C on domain specific nativism) I believe that they all end up orbiting a similar small set of concerns. Let me explain.

Max links to (IMO) one of Fodor’s best ever book reviews (here). The review brings together many themes in discussing a pair of books (one by Pinker, the other by Plotkin). It outlines some links between computationalism, modularity, nativism and Darwininan natural selection (DNS). I’ll skip the discussion on DNS here, though I know that there will be many of you eager to battle his pernicious and misinformed views (not!).  Go at it.  What I think is interesting given the earlier post is Fodor’s linking together computationalism, modularity and nativism.  How do these ideas talk to one another? Let’s start by seeing what they are.

Fodor takes computationalism to be Turing’s “simply terrific idea” about how to mechanize rationality (i.e. thinking). As Fodor puts it (p. 2):

…some inferences are rational in virtue of the syntax of the sentences that enter into them; metaphorically, in virtue of the ‘shapes’ of these sentences.

Turing noted that, wherever an inference is formal in this sense, a machine can be made to execute the inference. This is because…you can make them [i.e. machines NH] quite good at detecting and responding to syntactic relations among sentences.

 And what makes syntax so nice? It’s LOCAL. Again as Fodor puts it (p. 3):

…Turing’s account of computation…doesn’t look past the form of sentences to their meanings and it assumes that the role of thoughts in a mental process is determined entirely by their internal (syntactic) structure.

Fodor continues to argue that where this kind of locally focused computation is not available, computationalism ceases to be useful.  When does this happen? When belief fixation requires the global canvassing and evaluation of disparate kinds of information all of which have variable and very non-linear effects on the process. Philosophers call this ‘inference to the best explanation’ (IBT) and the problem with IBT is that it’s a complete and utter mystery how it gets done.[1] Again as Fodor puts it (p. 3):

[often] your cognitive problem is to find and adopt whatever beliefs are best confirmed on balance. ‘Best confirmed on balance’ means something like: the strongest and simplest relevant beliefs that are consistent with as many of one’s prior epistemic commitments as possible. But as far as anyone knows, relevance, strength, simplicity, centrality and the like are properties, not of single sentences, but of whole belief systems: and there’s no reason at all to suppose that such global properties of belief systems are syntactic.[2]

And this is where modularity comes in; for modular systems limit the range of relevant information for any given computation and limiting what counts as relevant is critical to allowing one to syntactify a problem and allow computationalism to operate.  IMO, one of the reasons that GG has been a doable and successful branch of cog sci is that FL is modular(ish) (i.e. that something like the autonomy of syntax is roughly correct).  ‘Modular’ means “largely autonomous with respect to the rest of one’s cognition” (p. 3). Modularity is what allows Turing’s trick to operate. Turing’s trick, the mechanization of cognition, relies on the syntacticifcation of inference, which in turn relies on isolating the formal features that computations exploit.

All of which brings us (at last!) to nativism.  Modularity just is domain specificity.  Computations are modular if they are “more or less autonomous” and “special purpose” and “the information [they] can use to solve [cognitive problems] are proprietary” (p. 3).  So construed, if FL is modular, then it will also be domain specific. So if FL is a module (and we have lots of apparent evidence to suggest that it is) then it would not be at all surprising to find that FL is specially tuned to linguistic concerns. And that it exploits and manipulates “proprietary information” and that its computations were specifically “designed” to deal with the specific linguistic information it worries about.  So, if FL is a module, then we should expect it be contain lots of domain specific computational operations, principles and primitives.

How do we go about investigating the if-clause immediately above?  It helps go back to the schema we discussed in the previous post. Recall the general schema in (1) that we used to characterize the relevant problem in a given domain, ‘X’ ranging over different domains.  (2) is the linguistic case.

(1)  PXD -> FX -> GX
(2)  PLD -> FL -> GL

Linguists have discovered many properties of FL.  Before the Minimalist Program (MP) got going, the theories of FL were very linguistically parochial. The basic primitives, operations and principles did not appear to have much to say about other cognitive domains (e.g. vision, face recognition, causal inference). As such it was reasonable to conclude that the organization of FL was sui generis. And to the degree that this organization had to be take as innate (which, recall, was based on empirical arguments about what Gs did) then to that degree we had an argument for innate domain specific principles of FL.  MP has provided (a few) reasons for thinking that earlier theories overestimated the domain specificity of FL’s organization. However, as a matter of fact, the unification of FL with other domains of cognition (or computation) has been very very very modest.  I know what I am hoping for and I try not to confuse what I want to be true with what we have good reason to be true. You should too. Ambitions are one thing, results quite another. How one might go about realizing these MP ambitions?

If (1) correctly characterizes the problem, then one way for arguing against a dedicated capacity is to show that for various values of ‘X,’ FX is the same. So, say we look at vision and language, then were FL = FV we would have an argument that the very same kind of information and operations were cognitively at play in both vision and language.  I confess, that stating things this baldly makes it very implausible that FL does equal FV, but heh, it’s possible. The impressive trick would show how to pull this off (as opposed to simply expressing hopes or making windy assertions that this could be done), at least for some domains. And the trick is not an easy one to execute: we know a lot about the properties of natural language Gs. And we want an FL that explains these very properties. We don’t want a unification with other FXs that sacrifices this hard won knowledge to some mushy kind of “unification” (yes, these are scare quotes) which sacrifices the specifics that we have worked so hard to establish (yes Alex, I’m talking to you). An honest appraisal of how far we’ve come in unifying the principles across modules would conclude that, to date, we have very few results suggesting that FL is not domain specific. Don’t get me wrong: there are reasons to search for such unifications and I for one would be delighted if this happens. But hoping is not doing and ambitions are not achievements. So, if FL is not a dedicated capacity, but is merely the reflection of more general cognitive principles then it should be possible to find FL being the same as some FX (if not vision, then something else) and that this unified FX’ (i.e. which encompasses FL and FX) can derive the relevant Gs with all their wonderful properties given the appropriate PLD. There’s a Nobel prize awaiting such a unification, so hope to it.[3]

It is worth noting that there is tons of standard variety psycho evidence that FL really is modular with respect to other cognitive capacities.  Susan Curtiss (here and here) reviews the wealth of double dissociations between language and virtually any other capacity you might be interested in. Thus, at least in one perfectly coherent sense, FL is a module and so a dedicated special purpose system. Language competence swings independently of visual acuity, auditory facility, IQ, hair color, height, voacab proficiency, you name it. So if one takes such dissociations as dispositive (and it is the gold standard) then FL is a module with all that this entails.

However, there is a second way of thinking about what unification of the cognitive modules consists in and this may be the source of much (what I take to be) confused discussion. In particular, we need to separate out two questions: ‘Is FL a module?’ and ‘Is FL contain linguistically proprietary parts/circuits?’ One can maintain that FL is a module without also thinking that its parts are entirely different from those in every other module.  How so? Well, FL might be composed from the same kinds of parts present in other modules, albeit put together in distinctive ways. Same parts, same computations, different wiring. If this were so, then there would be a sense in which FL is a module (i.e. it has special distinctive proprietary computations etc.), yet when seen at the right grain it shares many (most? All?) of its basic computational features with other domains of cognition. In other words, it is possible that FL’s computations are distinctive and dedicated, and that they are built from the same simple parts found in other modules. Speaking personally, this is how I now understand the Minimalist Bet (i.e. that FL shares many basic computational properties with other systems). 

This is a coherent position (which does not imply it is correct). At the cellular level our organs are pretty similar. Nonetheless, a kidney is not a heart, and neither is a liver or a stomach.  So too with FL and other cognitive “organs.”  This is a possibility (in fact, I have argued in places that this is also plausible and maybe even true). So, seen from the perspective of the basic building blocks, it is possible that FL, though a separate module, is nonetheless “just like” every other kind of cognition. This version of the “modularity” issue asks not whether FL is a domain specific dedicated system (it is!), but whether it employs primitive circuits/operations proprietary to it (i.e. not shared with other cognitive domains). Here ‘domain specific’ means uses basic operations not attested in the other domains of non-linguistic cognition.

Of course, the MP bet is easy to articulate at a general level. What’s hard is to show that it’s true (or even plausible).  As I’ve argued before, to collect on this bet requires, first, reducing FL’s internal modularity (which in turn requires showing Binding, movement, control, agreement, etc. are really only apparently different) and, second, showing that this unification rests on cognitively generic basic operations.[4] Believe me when I tell you that this program has been a hard sell.

Moreover, the mainstream Minimalist position is that though this may be largely correct, it is exactly wrong: there are some special purpose linguistic devices and operations (e.g. Merge), which are responsible for Gs distinctive recursive property. At any rate, I think the logic is clear so I will not repeat the mantra yet again.

This brings me to the last point I want to make: Avery notes that more often than not positive evidence relevant to fixing a grammatical option is missing from the PLD.  In other words, Avery notes that the PLD is in fact even more impoverished than we tend to believe. He rightly notes that this implies that indirect negative evidence (INE) is more important than we tend to think.  Now if he is right (and I have no reason to think that he isn’t), then FL must be chocked full of domain specific information. Why? Because INE requires a sharp specification of options under consideration to be operative.  Induction that uses INE effectively must be richer than induction exploiting only positive data.[5] INE demands more articulated hypothesis space, not less. INE can compensate for poor direct evidence but only if FL knows what absences it’s looking for! You can hear the dogs that don’t bark but only if you are listening for barking dogs. If Avery’s cited example is correct (see here), then it seems that FL is attuned to micro variations, and this suggests a very rich system of very linguistically specific micro parameters internal to FL. Thus, if Avery is right, then FL will contain quite a lot of very domain specific information and given that this information is logically necessary to exploit INE it looks like these options must be innately specified and that FL contains lots of innate domain specific information. Of course, Avery may be wrong and those that don’t like this conclusion are free (indeed urged) to reanalyze the relevant cases (i.e. to indulge in some linguistic research and produce some helpful results).

This is a good place to stop.  There is an intimate connection between modularity, computationalism, and nativism. Computations can only do useful work where information is bounded. Bounded information is what modules provide. More often than not the information that a module exploits is native to it. MP is betting that with respect to FL, there is less language specific basic circuitry than heretofore assumed. However, this does not imply that FL is not a module (i.e. part of “general intelligence”). Indeed, given the kinds of evidence that Curtiss reviews, it is empirically very likely that FL is a module. And this can be true even if we manage to unify the internal modules of FL and demonstrate that the requisite remaining computations largely exploit domain general computational principles and operations. Avery’s important question remains: how much acquisition is driven by direct and how much by indirect negative evidence? Right now, we don’t really know (at least not to the level of detail that we want). That’s why these are still important research topics.  However, the logic is clear, even if the answers are not.



[1] Incidentally, IBT is one of the phenomena that dualists like Descartes pointed to in favor of a distinct mental substance. Dualism, in other words, is roughly the observation that much of thought cannot be mechanized.
[2] It’s important to understand where the problem lies. The problem is not giving a story in specific cases in specific contexts. We do this all the time. The problem is providing principles that select out the IBT antecedent to a specification of the contextually relevant variables. The hard problem is specifying what is relevant ex ante.
[3] Successful unifications almost always win kudos. Think electricity and magnetism, the the latter two with the weak force, terrestrial and celestial mechanics, chemistry and mechanics. These all get their own chapters in the greatest hits of science books. And in each case, it took lots of work to show that the desired unification was possible. There is no reason to think that cognition should be any easier.
[4] I include generic computational principles here, so-called first factor computational principles.
[5] In fact, if I understand Gold correctly (which is a toss up), acquiring modestly interesting Gs strictly using induction over positive data is impossible.