Wednesday, December 3, 2014

Anatomy of a POS argument; remarks on Jeff's PISH

Those who have not checked out the discussion thread to Jeff’s post here should take a look. It is both enlightening and entertaining. As a public service, I would like to dissect the logic of a POS argument one more time using Jeff’s post as illustration and then project some of the critiques of Jeff’s argument against this backdrop. I do this, of course, because I believe that they generally fail to grapple with the issues POSs make vivid. And like Tolstoy’s unhappy families, they each fail in their own way. So, let’s begin by carefully outlining the steps in a POS argument. Here’s the recipe.

Step 1: Any POS argument begins with a data set consisting of a set of acceptability judgments. This set always consists of both good and bad examples; sentences acceptable (or not) under a certain interpretation. Jeff’s is no different. His consists of eight sentences, the relevant datum being whether ‘Norbert’ can antecede ‘himself’ in each sentence and the important contrasts being the ones in (2):

(1) a.     Norbert remembered that Ellen painted a picture of herself
      b.  * Norbert remembered that Ellen painted a picture of himself
      c.     Norbert remembered that Ellen was very proud of herself
      d.  * Norbert remembered that Ellen was very proud of himself

(2) a.     Norbert remembered which picture of herself Ellen painted
      b.    Norbert remembered which picture of himself Ellen painted
      c.    Norbert remembered how proud of herself Ellen was
      d. * Norbert remembered how proud of himself Ellen was

The first kind of objection to a POS argument arises at this point; the data are mis-presented. This objection is beloved of psychologists and some CSers. They note the cavalier attitude that linguist have of asking their friends what they think of such and such sentences, believing this to be a decent form of data collection. Shocked at this, critics dismiss the reported facts and or deny they have been rightly described (not gradient enough for example). [1]

This kind of objection in this particular case can be easily parried. Colin notes that the data has been psycho-linguistically vetted (using our favorite 7 point rating system and administered in the whitest of white lab coats) and that the data are exactly as presented (here) (nor, btw, are the data in this case particularly gradient, as Colin also observes).[2] This should be no surprise given Sprouse and Alemida’s repeated demonstrations that acceptability data is rather robust and consistent, but given the unbounded energy of the skeptic, it is nice to show that the “informal” perfectly serviceable techniques of the average linguist can be backed up by the more formal obsessive techniques of modern psycho-linguistics. So, the data are sound.

Step 2: The second move in any good POS argument is to see whether the distinctions found in the data might themselves be the basis of the judgments that native speakers make concerning the data. In other words, speakers judge as they do because they are faithfully tracking the information they have been exposed to.[3]

Recall, that a good POS argument loads the dice against itself. The argument is a tool to probe the structure of FL. The default assumption is that if there is a decent inductive basis in the Primary Linguistic Data (PLD: data that the child has available to it in learning its G) for learning the facts exemplified in the data set of interest (i.e. (1) and especially (2) above) then we refrain from concluding anything about the innate structure of FL from the argument. This is a pretty high bar for it does not follow from the fact that there is possible data in the PLD relevant to learning the facts from the patterns in the PLD that such data is in fact so used. Nonetheless, this is what is generally assumed so as to make it a bit harder for the enthusiastic Rationalist from jumping to an interesting nativistic conclusion.

Being a good fabricator of POSs, Jeff presents evidence that the sentences of interest (those in (2)) are unattested in the PLD. Thus, whatever differences native English speakers make regarding them cannot be based on inspecting instances of these sentences.[4] How did Jeff establish this? Well, he looked both at the Childes database (where he found no examples of these sentences) and did a Google search (where again he came up empty handed[5]). So, this is pretty good evidence that there is no direct evidence that the contrasting judgments in (2) (ones that native speakers make) reflects induction from sentences like (2) that native speakers hear. But if not driven inductively by the data, it is reasonable to conclude that the knowledge reflected in the differential judgments reflects some innate feature of the LAD.

Note the weakness of this conclusion. No mechanism has been cited as the source of the knowledge. Rather, the conclusion is that whatever the right explanation, it cannot be based on direct induction from the proffered data for there is no induction to be had as there is no data (in the PLD and beyond) to induce from. In other words, the data is poor wrt to the knowledge displayed in this particular data set.

Note that this step of the argument can be contested by arguing that the PLD is replete with the relevant sentences (i.e. that it is in fact richer than it appears to be). This is a favorite kind of retort (especially beloved by Pullum). But absent this (and remember that Jeff has done the relevant empirical legwork required to (IMO, strongly) argue that there is nary a soupcon of data like (2) in the PLD), a POS problem has been identified.

Before proceeding, let me stress how important it is for step 2 of the argument to distinguish PLD from the more expansive linguistic data (LD). LD includes examples like (2), PLD does not. LD is what the linguist uses to argue for one or another proposal. PLD is what the child uses to construct its G. Kids are not little linguists for they do not mine the same data stream as linguists do. They have access to a vastly truncates group of data points, truncated both in quantity and quality.

Let’s beat this horse good and dead. The problem is (at a minimum) to explain how the LAD projects from PLD to LD. In a well constructed POS, like Jeff’s, it is possible to show that the PLD contains no trace of the relevant data of interest. Of course, by assumption, the LD does, as the examples in (2) are part of the LD but not part of the PLD. If so, the “learning” problem is to explain how to get from PLD containing no instances of the relevant data to LD that does contain it. In other words, the problem is not merely to show how to go from a subset of the data to the full data set (i.e. how to generalize from a small number of cases to a larger number of cases), but how to go from certain informationally deficient subsets of the data to the qualitatively richer “full” data. In particular, in this case, how to go from PLD that likely contains sentences like (1) but none like (2) to LD that contains both.[6]

This off my chest, let’s return to the main line of exposition. The next step in a POS is to use the conclusion established at step 2 to investigate the structure of the LAD.

There are two options to investigate. First, that the contrast in (2) itself reflects innate features of the LAD. The second is that thought the contrast itself is not learned, it is based on G features that have themselves been learned. The only way to address either question is via a grammatical analysis of the contrast in (2). In other words, we need a theory of this particular data set. So, onto the third step.

Step 3: So we have a data set and we have evidence that the pattern it manifests is absent from the PLD so now we look for a way to explain the pattern. It goes without saying that explanations are based on assumptions. Nonetheless, let me say it out loud, shouting with all my considerable wind power: “explanations are based on assumptions” and “this necessarily makes them theory laden” (btw, stentorian announcements along these lines a couple of times a day is good for your methodological health). Jeff (following Huang) offers one possible explanation for the data in (2). IMO, Jeff’s/Huang’s story is a very conservative proposal in the context of GBish analyses (actually most extant GG analyses, as I note below). How so?

Well, first, it takes as given a standard version of Principle A of the Binding Theory (A/BT), and notes that PISH when added to A/BT, suffices to derive the relevant data in (2). The assumption that something like A/BT holds of Gs is very anodyne.  Virtually every theory of binding within GG (not just GB) has the functional equivalent of A/BT (though these have been coded in various ways). Thus, it is fair to say, that accepting Principle A as part of the background assumptions is pretty innocuous. Of course this does not mean that making such an assumption is correct. Maybe A/BT is not part of the G of English. But I really doubt this. Why? Well, there is plenty of evidence in its favor. [7] But if someone wants to challenge its adequacy the right way to go about this is well understood: present a better alternative. The data are well known (as are the exceptions) so if you don't like the A/BT, your job is to present an alternative and argue that it does a better job. Good luck![8]

Second, the ingredients for PISH themselves rest on what appear to be very weak assumptions.[9] They number four: (i) ‘proud of himself’ is a predicate while ‘picture of himself’ is an argument, (ii) ‘Ellen’ in (2) is an argument of ‘proud of himself’ while ‘Ellen’ is not an argument of ‘picture of himself,’ (iii) for an argument to be marked as an argument of a predicate (i.e. theta-marked) it must be in the projection of the predicate (at DS in GB) and (iv) that predicate/argument information is preserved in the course of the derivation (in GB trace theory does this, in MP copies). None of these assumptions seems particularly heroic in the current theoretical climate and most theories of GG that I know of adopt some version of each. Of the three, (iii) is probably the most controversial (and it is the heart of the PISH proposal), but conceptually it simply extends to external arguments the common assumption for internal arguments.[10] (i-iii) lead to the conclusion that ‘Ellen’ is inside the projection of ‘proud of himself’ (but not ‘picture of himself’) before moving to Spec T to be case marked. This in conjunction with principle A/BT (and some version of trace theory, be it semantic or syntactic) suffices to explain the pattern of data in (2).[11]

To reiterate, this is a very conservative analysis of the data in (2): given the theoretical context (widely construed) and adding a conceptually modest PISH suffices to account for the data. Thus PISH is a good candidate explanation for the data in (2) given standard (i.e. non exotic) assumptions. The explanation works and it is based on conceptually conservative assumptions. So, it’s a good story.

Let’s now take one further step: say for purposes of discussion that PISH plus then other assumptions is in fact the best explanation of the data in (2).[12] What conclusion would this license? That PISH is part of the G of English. In other words, the fact that PISH fits well with standard assumptions to get some new data and by assumption does this better than competing proposals licenses the conclusion that PISH is part of the grammar of English.  And given this conclusion, we can now ask a more refined POS question: is PISH part of the structure of the LAD (a principle of UG or derived from a principle of UG) or is it “learned”? In other words, given that this proposal is true what does it tell us about FL?

Many might object at this point that the refined POS question is very theory internal, relying as it does on all sorts of non-trivial assumptions (e.g. A/BT, structured headed phrases, the distinction between an argument and a predicate, some version of trace theory, etc.). And I would agree. But this is not a bug it’s a feature. To object to this way of proceeding is to misunderstand the way the scientific game is played.  Now repeat after me and loudly: “All explanations are theory internal as all accounts go beyond the data.” The relevant question is not whether assumptions are being made, but how good these assumptions are. The PISH argument Jeff outlines is based on very well grounded assumptions within GG (versions of which appear in many different “frameworks”). Of course, one may be skeptical about any results within GG, but if you are one of these people then IMO you are not really part of the present discussion. If the work over the last 60 years has, in your opinion, been nugatory, then nothing any linguist says to you can possibly be of interest (here’s where I would normally go into a rant about flat earthers and climate change deniers). This kind of sweeping skepticism is simply a refusal to play the game (here). Of course, there is no obligation to play, but then there’s no obligation to listen to those that so refuse.

To return to the POS, observe that the refined question above is not the same question we asked in Step 2 above. What step 2 showed us was that whatever the explanation of the data in (2) it did not come from the LAD’s generalizing from data like that in (2). Such data does not exist. So if PISH+A/BT+… is the right account of (2) this account has not been induced from data like (2). Rather the explanation must be more indirect, transiting through properties of Gs incorporating principles like A/BT and PISH.  Indeed, if A/BT+PISH+… is the best theory we have for binding data like (2) and if we assume (as is standard in the sciences) that our best theories should be taken as (tentatively) true (at least until something better comes on the scene) then we are in a position to entertain a very refined and interesting POS question (in fact, formulating this question (and considering answers to it) is the whole point of POSs), one relevant to probing the fine structure of FL.

Step 4: Are A/BT and PISH innate features of the LAD or acquired features of the LAD?

Happily this question can be addressed in the same way deployed at step 2. We can ask what data would be relevant to inducing A/BT and PISH +etc. and ask whether this data was robustly available in the PLD. Let’s concentrate on half the problem, as this is what Jeff’s post concentrated on. Let’s assume that A/BT amd the pther ancillary features are innate features of FL (though if you want to assume they are learned, that’s also fine, though likely incorrect) and ask what about PISH? The question then is if PISH correctly describes the G of English is it learned (i.e. induced form PLD) or innate (either an innate primitive or the consequence of innate primitives).

Of course, to answer this question we need a specification of PISH, so here’s one which applies not just to subjects (external arguments) but to all arguments of a predicate.

PISH: A predicate P can mark and X as an argument only if X is in the projection of P.

This means that it is a necessary condition for X being marked as an argument of a predicate P that X be in a structure more or less like [PP X…P…]. So, if a DP is an internal or external argument of a verb V or an adjective A then it must be contained in VP or AP of that V or A.[13] So, can PISH so described be induced from the PLD. I doubt it. Why? Because there is very little evidence for this principle in surface forms. Movement operations regularly move expressions from their argument positions to positions arbitrarily far away (i.e. DS constituents are not SS constituents). In other words, in a vast array of cases, surface forms are a very poor reflection of thematic constituency. Thus, the local relation that PISH insists holds between predicates and their arguments is opaque in surface patterns. Or, said another way, surface patterns provide poor evidence that a principle like PISH (which requires that predicates and their arguments form constituents headed by these predicates) holds.[14]

Note in addition, the modal must in the PISH principle above. It is not enough that DPs optionally generate “low” in the domain of their thematic predicates. If PISH were only an option, then the unacceptable data in (2) should be fine as their would be one derivation, one in which “Ellen” was not generated within the ‘proud of himself’, which would pattern like (2b).[15] 

My conclusion, if PISH is (part of) the right account of the data in (2), then PISH reflects properties of the built in structures of FL and it is not induced from PLD. Indeed, I don’t see any real way to induce this principle for it reflects no surface pattern in the data (but maybe others can be more imaginative). So, if PISH exists at all, then it is likely to be a reflection of the innate structure of FL.

As you can see, I find this line of reasoning persuasive. But perhaps I am wrong and that PISH can be induced from surface forms. If you think that this is so, your job is to show how to do this. It’s pretty clear what such a demo would do: show how to induce the data in (2) from the PLD without using a principle like PISH.[16] Do it and we defuse the argument above. Again, Good Luck!

I should add that critics rarely try to do this. Certainly none of the critics to Jeff’s original post attempt anything like this. Rather there are two or three reactions that we should now take a look at.

The first is this kind of argument is very involved and so cannot be trusted. It relies on a number of intermediate steps, which quickly involve non-trivial assumptions. So, PISH relies on A/BT and on the idea that phrases are headed. To this one is tempted to reply: grow up! That’s the way science works. You build arguments piecemeal, relying on previous conclusions. You take conclusions of earlier work as premises for the next steps. There is no way around this, and not just in linguistics. Successful inquiry builds on past insight. These form the fixed points for the next steps. Denying the legitimacy of thus proceeding is simply to eschew scientific inquiry in the domain of language.[17] It amounts to a kind of methodological dualism that Chomsky has been right to excoriate.

Moreover, it misunderstands the logic of the challenge Jeff’s case provides. The  challenge is not a theory internal one. Here’s what I mean. The problem is a standard induction problem. Everyone agrees that one covers all of LD by inducing Gs based on data drawn from a subpart of LD. The POS challenge above is to show how to do this using only PLD. This challenge does not commit hostages to a particular conception of grammar or UG or FL. In this sense it is very neutral. All it requires is recognizing that what we have here is a problem of induction and a specification of what’s in (and not in) the inductive base.  Jeff’s solution uses theory particular assumptions in responding to this challenge. But the challenge does not require buying into these assumptions. If, for example, you think that all human Gs are (roughly) CFGs, then show how to induce those CFGs that have the contrasts in (2) as outputs based on PLD that does not contain data like (2). So, though there is nothing wrong, IMO, in using GGish assumptions to solve the problem (and given my view that GG has learned a lot about language in the last 60 years) if you think that this is all junk then take your favorite ones and show how to get from PLD to (2) using them.  Put another way, what the PLD consists in in this case is not a complex theory internal matter and all one needs to raise the POS question is agreement that Jeff’s characterization of the PLD in this case is more or less correct. So if you don’t like his solution, you now should know what needs doing. So do it.

The second reaction is to reject POS conclusions because they overly encumber FL with domain specific innate constraints. More concretely, PISH cannot be innate despite the POS argument for there is now way it could have gotten into FL.  This is a kind of Minimalist maneuver, and I personally find it somewhat compelling. But the right conclusion is not that the POS argument has failed but to understand that there is yet another project worth embarking on: showing how to derive the linguistically specific principles from more general computational and cognitive ones.  This is a project that I am very sympathetic to. But it is very different from the one that simply rejects conclusions that are not to its liking. Both Plato’s Problems and Darwin’s Problem are real. The right way to reconcile them is not to wish one or the other away.

Let me end this anatomy lesson here with some summary remarks. POSs have been and will continue to be invaluable tools for the investigation of the structure of FL. They allow us to ask how to parse our analyses of particular phenomena into those features that relate to structural features of FL and those that are plausibly traced to the nature of the input data. POS arguments live within a certain set of assumptions: they are tools for investigating the properties of generative procedures (Gs) and the mental structures that restrict the class of such procedures available to humans (UG). In this sense, answers to POS arguments are not assumption free. But who could have thought that they would be. Is there any procedure anywhere in the sciences that is assumption free. I doubt it. At any rate, POS is not one of these. However, the POS problems are run of the mill inductive ones where we have a reasonably good specification of what the inductive base for looks like. This problem is neutral wrt an extremely wide range of possible solutions.

Three of the more important features of a POS is the distinction between PLD and LD; the linguistic data the child has available to it versus the full range of structures that the attained G (i.e. generative procedure) generates.  Whatever learning goes on, uses PLD alone, though the G attained on the basis of this PLD extends far beyond it. Second, the data of interest includes both acceptable and unacceptable cases. We both want to explain what does and what doesn’t “exist.” Lastly, the objects that POSs investigates are Gs and UG. It is assumed that linguistic competence amounts to having acquired a generative procedure (G). POSs are then deployed to examine the fine structure of these Gs and parse their properties into those that expose the structural properties of FLs and those that reflect the nature of the input. Like all tools, POSs will be useful to the degree that this is your object of inquiry. Now, many people do not think that this IS the right way of framing the linguistic problem. For them, this sort of argument will, not surprisingly, loose much of its appeal. Just another indication that you can’t please everyone all the time. Oh well.



[1] Thomas moots a version of this in the comments, though I believe that he is simply expressing a well worn objection rather than expressing his own views (see here).
[2] As Colin notes there is nothing squishy about the data: “For what it’s worth, we tested the facts in Jeff’s paradigm in (2), and the ratings are uncommonly categorical (Omaki, Dyer, Malhotra, Sprouse, Lidz, & Phillips, 2007).”
[3] Even a hard boiled rationalist like me thinks that this is sometimes the right kind of account to give. Why do I put determiners before NPs rather than after them (as, e.g. in Icelandic)? Well because I have been exposed to English data and that’s what English does. Had I been exposed to Icelandic data I would do otherwise.
[4] This is important: the best POS notes a complete absence of the relevant data. There can be no induction in the absence of any data to induce over. That’s why many counterarguments try to find some relevant input that statistical processes can then find and clean up for use. However, when there really is no relevant data in the PLD, then the magic of statistical inference becomes considerably less relevant.
[5] Well, not quite: he did get hits from linguistic papers talking about the phenomenon of interest. Rashly, perhaps, he took this to indicate that such sentences were not available to the average LAD.
[6] It is worth remarking that the PLD LD distinction is not likely to be one specifiable (i.e. definable) in general terms (though I suspect that a rough cut between degree 0+ and others is correctish). This, however, does not imply that the PLD/LD distinction cannot be determined in specific cases of interest, as Jeff demonstrates. The fact that there is no general definition of what data constitute the PLD does not imply that we cannot fix the character of the PLD in specific cases of interest. In this PLD is like “coefficient of friction.” The value of the latter depends on the materials used. There is no general all purpose value to be had. But it is easily determined empirically.
[7] There is a reason for this: A/BT is a generalization very close to the data. This is a pretty low level generalization of what we see in the LD.
[8] This is all obvious. I mention it to parry one kind of retort: “well it’s possible that A/BT is not right, so we have no reason to assume that it is.” This is not an objection that any right thinking person should take seriously.
[9] Ewan makes a similar point here and here.
[10] In GB this is the assumption that predicates govern what they theta mark. Analogous assumptions are made in other “frameworks.”
[11] Actually, we need also assume that displacement targets constituents. Again, this is hardly a heroic assumption.
[12] There are other accounts out there (e.g. Heycock’s reconstruction analysis) but the logic outlined here will apply to the other proposed analyses I know and lead to similar qualitative conclusions concerning the structure of FL.
[13] Or in MP terms, it must have merged with whatever theta marks it.  Note that this is also a standard assumption in the semantic literature where arguments saturate a predicate via lambda conversion.
[14] This DS/SS disconnect is what lay behind the earliest arguments for innate linguistic structure. That’s why Deep Structure was such a big deal in early GG theory.
[15] David Basilico (here and following) makes a very relevant observation concerning Existential Constructions and whether they might help ground PISH in the PLD rather than FL/UG. His observations are very relevant, though for reasons I mention in the thread and the must nature of PISH, I think that there is still some slack that needs reeling in by UG. That said, his point is a good and fair one.
[16] Actually this would be a good first step. In addition one would hope that the principles used to license the induction would equally well grounded; i.e. at least as good as PISH. But for now, I would be happy to forgo this additional caveat.
[17] As I’ve noted before, there are a group of skeptics that don’t really think much of work in GG at all. I actually have nothing at all to say to them as IMO they are not persuadable, at least by the standards of rational inquiry that I am acquainted with. For a recent expression of this see Alex D here. I completely agree with the sentiment.

11 comments:

  1. There is this idea of "direct induction" which I think is maybe a bit misleading... any inductive algorithm has some bias, so the generalisation always arises out of an interaction between the biases of the learner and the data. Even if the learner is just memorising the data that is a bias too -- a bias towards the most conservative hypothesis.

    I get the intuition that learning, say, that determiners are always (almost always .. there is "galore") before the noun in English, is easy to do, but that reflects a bias to a learner that pays attention to the surface order of strings. A different and maybe simpler algorithm might assume that the surface order was irrelevant and that is "the cat" is grammatical then so is "cat the".

    If you follow this line a bit further I think you come to the conclusion that there is no such thing as direct induction, there are just inductive processes (learning algorithms) of various types. It is easier to see how a simple learning algorithm that pays attention to surface order can work, and harder to see how learning algorithms that work with derivation trees of MCFGs can work, but I don't see any very great difference between them, as the surface order is just the derivation tree of a regular grammar (which is an MCFG of a special type; rules like N(au) := P(u)).

    ReplyDelete
    Replies
    1. @ Alex:
      Your comment reminds me of a Yogi Berra insight: In theory, there's no difference between theory and practice, in practice there is. Let's consider some urn scenarios to pump intuitions:

      X is asked to say (bet?) whether urn A has more black balls than white ones. S/he takes a 100 pulls from the urn with replacement etc. After 100 pulls there are 99 black balls and 1 white one. S/he is asked to bet. I'm pretty sure I know what he would bet and why. As do you.
      Scenario B: Same thing, but after 100 pulls s/he asked to say (bet?) whether urn Y has more black balls than pink ones. He has sampled nothing from urn Y, only urn X. Whatever he does here is different from what he did with urn X. Maybe something like a principle of indifference drives the choice, maybe something else, but it is very different. Now Urn Z: s/he again pulls 100 times and again 99 blacks 1 white. Then an undisclosed number of pink balls are added to Z and without any further selection he is asked to say if there are more black balls than pink ones. Again, whatever s/he does based on whatever principle you choose is different from the first case.

      The moral: I think that the first case is what we have in the determiner example. I think that the 2nd and 3rd case are roughly what we have in many linguistic cases (e.g. Islands, ECP, PISH etc). Imagine further that after sampling urn X our chooser always answered rightly wrt any other urn with any other colors. Would we say that s/he got it right BECAUSE of his/her samplings of urn X? And if s/he did get it right wouldn't we say it's because there is some non-trivial causal relation between the urns that our chooser "knows" e.g.s/he knows something about all color choices other than white and black? I'm sure you get the point.

      These cases are different "in practice" regardless of whether we can mathematically assimilate to the same thing. In fact, I would go further, if we mathematically we do assimilate them to the same thing, then all this means is that the mathematical perspective on the problem is unilluminating.

      Delete
    2. @Norbert:
      Alex presented the established view according to which everything the learner does is revealing of a bias; here it doesn't even make sense to ask whether there is some innate aspect to learning---of course there is!
      As soon as one starts to formalize learning settings, the memorizing learner emerges as a logical possibility, on a par with others. One can surely decide that purely memorizing learners are not interesting, and try to exclude them from particular learning settings (by perhaps requiring that the learner's memory is finitely bounded, etc), and people have done this.
      However, you seem to want memorization (or, more generally, easy/surface-based/??? strategies) to be treated as qualitatively different from `real' generalization. Why? I find your analogy between urns and language intuitive, but I don't understand why you want more than a quantitative distinction.

      Delete
    3. I'm glad you find this intuitive. And I am not looking for a qualitative distinction so much as a meaningful one. These are two different kinds of induction problems. Linguists have distinguished them as relevantly different looking for different kinds of biases. If they are different ends of a quantitatively continuum or qualitatively different is of no concern to me. But they should be distinguished for they ask for different modes of explanation and investigation. Anything that reduces them to the same thing, therefore, is IMO ISO facto the wrong way to look at things. That's what I tried to bring out with the urn examples. One can call all these case examples of generalization from the data, but the inductive grounds and principle relied upon are different. The last two use something like the principle of insufficient reason if they use anything at all. The first does not.

      Maybe an analogy will serve. Being sure thing and not is the difference between being of probability 1 and probability 1<. Is this a quantitative or qualitative difference? My Bayesian friends think the former for setting priori to probability 1 or 0 creates all sorts of problems, or so I am told. So, qualitatively or quantitative? Is the question even well posed? And who cares so long as the difference is recognized and distinguished, something that I did not take Alex to be doing.

      Delete
    4. Let me tweak your urn example a little bit to make my point.
      Suppose you are observing a sequence of black and white balls, and you
      see the sequence of 99 balls WBBWBB .... WBBWBB so WBB repeated 33 times, and you are asked to guess what the next ball will be.
      Well one learner might assume that they are drawn from an urn with replacement and that the urn only contains black and white balls and say we have seen 33 white and 66 black so the probability of a black is 2/3.
      Or the learner might be detecting a sequence in the data and say, it will be white W.
      Both of these are equally "direct", they just depend on the assumptions they make about the generating process.

      So suppose we have a "direct" learner whose assumption is that the data is generated by ngrams or regular grammars or CFGs or MCFGs or MGs, or MGs with PISH, or to some finite class of parametric grammars defined by transformational. Is that possible? Or does the idea of a direct learner only apply to, say, ngram models?

      Delete
    5. No one doubts that there are different ways to extrapolate/generalize from data, which is what your tweaking points to. Yes, even when there is data to generalize from this can be done in endlessly many ways. We have Hume and Goodman to thank for bringing this so vividly to our attention (even Wittgenstein in his foundations of math book). Yes, this is very important. BUT, let me say this again BUT: the reason I pointed you to my cases is that they are different from these. In the two latter cases we are not generalizing from the data at all. There is no pattern we are trying to discern as there are no relevant samplings of the relevant urns. That was the point. And in these cases we are doing something different. However (and this was the point) it looks like many POS cases are just like these second two cases. Jeff presented one like this. There are many many more. So what we need is to divide the relevant cases into the two piles: those from which we are NOT inducing from data and those where we are. FOr the former cases it looks like we are learning something (I would say) about the nature of the hypothesis space, in the second something about the natural inductive dimensions relevant to the problem (i.e. the strength and shape of priors). I am sure that things can be cut up differently. However, these clearly seem to me like different kinds of cases and mushing them together is not very useful.

      Last point: a standard reply to a POS is that the kind of case that Jeff points to where there is 0 data in the PLD to generalize from (like in the last two urn cases) is that these don't really exist. My point is that they do, in fact they do in spades. And missing this will only lead one off in entirely irrelevant directions.

      Delete
    6. I think the problem with the argument you are making is precisely this false dichotomy between inductive processes which are very shallow and non inductive processes. You rely on this when you say "there is 0 data in the PLD to generalize from". That only makes sense if you have some idea of a learning algorithm which needs data of type X to learn phenomenon Y.
      Then you can say, there is no data of type X in the PLD; therefore no inductive learning algorithm can learn Y from the PLD, therefore Y is innate.
      Have I got your logic right?

      So my question is meant to try to understand what you mean by "inducing from data". If we induce a minimalist grammar using standard Bayesian techniques is that "inducing from data"? Or does it only apply to ngram type techniques?

      (Side point: The multiple urn problem is interesting -- there are of course learning algorithms that can exploit structure between urns; say if we have seen 10 urns that all have 75% black balls and 25% white, then this might affect the expected probability we would give to a ball from a new urn being black. This would be a hierarchical Bayesian model, and another type of induction.)

      Delete
    7. I knew you'd say this. I even predicted this. So you know my answer and there is little reason to discuss this any further. When you come up with a way of getting any of the data linguists look at using the PLD we have evidence is the pertinent kind, we can talk again. Till then, further discussion is not useful.

      As for your side point: this is really nutty. I set up the problem precisely so that there was NO relation between the urns. I set it up this way because for the reasons I mentioned. The aside simply tries to assimilate a problem FOR induction into another problem OF induction. In other words, it denies that the problem as illustrated exists. And that was my point in the first place.

      BTW, Bill Idsardi and I have a forthcoming post that should present a semi-formal version of this problem.

      Delete
    8. Maybe I have got the dialectic wrong, but if I think you are setting up a false dichotomy then I do want to show that there is an example which is an alternative, and so the urn example I gave seems the right move? It may not convince of course.

      And once again, I am not trying to sell my way of doing things, I am just pointing out an implausible assumption in your argument. I don't need to solve the problem to point out an inadequacy in your argument or the solution you favour. And I am certainly not trying to diminish or ignore the problem -- it is a great problem! and much better than AUX inversion.

      Delete
    9. You are rejecting the description of the problem as I gave it. Of course, the debate is how well my description models the actual situation. I am not arguing that it does, I was asserting that it did. The argument comes from Jeff's PISH example which gien his description of the PLD suggests that this is a fair model of the problem. You may disagree. In fact I'm sure you do. Fine. the question is there a way of adjudicating this disagreement rationally? I'm not sure, but I think that there is. Bill and I will post something on one possible approach next week. Saty tuned. For now, I am happy if the point of our disagreement is clear, which I think it is.

      Delete
  2. The Native POS Mobile application for your online website. Using this native app, your cashiers can make use of this single application on their Android or iOS devices for your multiple physical stores. Create the most flexible Point-Of-Sale Native App to increase revenue, improve operations, and provide a positive customer experience.

    ReplyDelete