Those who have not checked out the discussion thread to
Jeff’s post here
should take a look. It is both enlightening and entertaining. As a public
service, I would like to dissect the logic of a POS argument one more time
using Jeff’s post as illustration and then project some of the critiques of
Jeff’s argument against this backdrop. I do this, of course, because I believe
that they generally fail to grapple with the issues POSs make vivid. And like
Tolstoy’s unhappy families, they each fail in their own way. So, let’s begin by
carefully outlining the steps in a POS argument. Here’s the recipe.
Step 1: Any POS argument begins with a data set consisting
of a set of acceptability judgments. This set always consists of both good and bad examples; sentences
acceptable (or not) under a certain interpretation. Jeff’s is no different. His
consists of eight sentences, the relevant datum being whether ‘Norbert’ can
antecede ‘himself’ in each sentence and the important contrasts being the ones
in (2):
(1)
a. Norbert remembered that Ellen painted a picture of
herself
b. * Norbert remembered that Ellen painted a picture of himself
c. Norbert remembered that Ellen was very proud of
herself
d. * Norbert remembered that Ellen was very proud of himself
(2)
a. Norbert remembered which picture of herself Ellen painted
b. Norbert remembered which picture of himself Ellen painted
c. Norbert remembered how proud of herself Ellen was
d. * Norbert remembered how proud of himself Ellen was
The first kind of objection to a POS argument arises at this
point; the data are mis-presented. This objection is beloved of psychologists
and some CSers. They note the cavalier attitude that linguist have of asking
their friends what they think of such and such sentences, believing this to be
a decent form of data collection. Shocked at this, critics dismiss the reported
facts and or deny they have been rightly described (not gradient enough for
example). [1]
This kind of objection in this particular case can be easily
parried. Colin notes that the data has been psycho-linguistically vetted (using
our favorite 7 point rating system and administered in the whitest of white lab
coats) and that the data are exactly as presented (here)
(nor, btw, are the data in this case particularly gradient, as Colin also
observes).[2]
This should be no surprise given Sprouse and Alemida’s repeated demonstrations
that acceptability data is rather robust and consistent, but given the
unbounded energy of the skeptic, it is nice to show that the “informal”
perfectly serviceable techniques of the average linguist can be backed up by
the more formal obsessive techniques of modern psycho-linguistics. So, the data
are sound.
Step 2: The second move in any good POS argument is to see
whether the distinctions found in the data might themselves be the basis of the
judgments that native speakers make concerning the data. In other words,
speakers judge as they do because they are faithfully tracking the information
they have been exposed to.[3]
Recall, that a good POS argument loads the dice against
itself. The argument is a tool to probe the structure of FL. The default assumption
is that if there is a decent inductive basis in the Primary Linguistic Data (PLD:
data that the child has available to it in learning its G) for learning the facts exemplified in the
data set of interest (i.e. (1) and especially (2) above) then we refrain from
concluding anything about the innate structure of FL from the argument. This is
a pretty high bar for it does not follow from the fact that there is possible data in the PLD relevant to
learning the facts from the patterns in the PLD that such data is in fact so used.
Nonetheless, this is what is generally assumed so as to make it a bit harder for
the enthusiastic Rationalist from jumping to an interesting nativistic
conclusion.
Being a good fabricator of POSs, Jeff presents evidence that
the sentences of interest (those in (2)) are unattested in the PLD. Thus, whatever differences native English
speakers make regarding them cannot be based on inspecting instances of these
sentences.[4]
How did Jeff establish this? Well, he looked both at the Childes database (where
he found no examples of these sentences) and did a Google search (where again
he came up empty handed[5]).
So, this is pretty good evidence that there is no direct evidence that the contrasting judgments in (2) (ones that
native speakers make) reflects induction from sentences like (2) that native
speakers hear. But if not driven inductively by the data, it is reasonable to
conclude that the knowledge reflected in the differential judgments reflects
some innate feature of the LAD.
Note the weakness
of this conclusion. No mechanism has been cited as the source of the knowledge.
Rather, the conclusion is that whatever
the right explanation, it cannot be based on direct induction from the proffered
data for there is no induction to be had as there is no data (in the PLD and
beyond) to induce from. In other words, the data is poor wrt to the knowledge
displayed in this particular data set.
Note that this step of the argument can be contested by
arguing that the PLD is replete with the relevant sentences (i.e. that it is in
fact richer than it appears to be). This is a favorite kind of retort
(especially beloved by Pullum). But absent this (and remember that Jeff has
done the relevant empirical legwork required to (IMO, strongly) argue that
there is nary a soupcon of data like (2) in the PLD), a POS problem has been identified.
Before proceeding, let me stress how important it is for
step 2 of the argument to distinguish PLD from the more expansive linguistic
data (LD). LD includes examples like (2), PLD does not. LD is what the linguist
uses to argue for one or another proposal. PLD is what the child uses to
construct its G. Kids are not little
linguists for they do not mine the same data stream as linguists do. They have
access to a vastly truncates group of data points, truncated both in quantity and quality.
Let’s beat this horse good and dead. The problem is (at a
minimum) to explain how the LAD projects from PLD to LD. In a well constructed
POS, like Jeff’s, it is possible to show that the PLD contains no trace of the
relevant data of interest. Of course, by assumption, the LD does, as the
examples in (2) are part of the LD but not part of the PLD. If so, the
“learning” problem is to explain how to get from PLD containing no instances of
the relevant data to LD that does contain it. In other words, the problem is
not merely to show how to go from a subset of the data to the full data set
(i.e. how to generalize from a small number of cases to a larger number of
cases), but how to go from certain informationally deficient subsets of the
data to the qualitatively richer “full” data. In particular, in this case, how
to go from PLD that likely contains sentences like (1) but none like (2) to LD
that contains both.[6]
This off my chest, let’s return to the main line of
exposition. The next step in a POS is to use the conclusion established at step
2 to investigate the structure of the LAD.
There are two options to investigate. First, that the
contrast in (2) itself reflects innate features of the LAD. The second is that
thought the contrast itself is not learned, it is based on G features that have
themselves been learned. The only way to address either question is
via a grammatical analysis of the contrast in (2). In other words, we
need a theory of this particular data
set. So, onto the third step.
Step 3: So we have a data set and we have evidence that the
pattern it manifests is absent from
the PLD so now we look for a way to explain the pattern. It goes without saying
that explanations are based on assumptions. Nonetheless, let me say it out
loud, shouting with all my considerable wind power: “explanations are based on assumptions”
and “this necessarily makes them theory laden” (btw, stentorian announcements
along these lines a couple of times a day is good for your methodological
health). Jeff (following Huang) offers one possible
explanation for the data in (2). IMO, Jeff’s/Huang’s story is a very
conservative proposal in the context of GBish analyses (actually most extant GG
analyses, as I note below). How so?
Well, first, it takes as given a standard version of Principle
A of the Binding Theory (A/BT), and notes that PISH when added to A/BT,
suffices to derive the relevant data in (2). The assumption that something like
A/BT holds of Gs is very anodyne.
Virtually every theory of binding within GG (not just GB) has the
functional equivalent of A/BT (though these have been coded in various ways).
Thus, it is fair to say, that accepting Principle A as part of the background
assumptions is pretty innocuous. Of course this does not mean that making such an assumption is correct. Maybe A/BT is
not part of the G of English. But I really doubt this. Why? Well, there is
plenty of evidence in its favor. [7]
But if someone wants to challenge its adequacy the right way to go about this
is well understood: present a better alternative. The data are well known (as
are the exceptions) so if you don't like the A/BT, your job is to present an
alternative and argue that it does a better job. Good luck![8]
Second, the ingredients for PISH themselves rest on what
appear to be very weak assumptions.[9]
They number four: (i) ‘proud of himself’ is a predicate while ‘picture of
himself’ is an argument, (ii) ‘Ellen’ in (2) is an argument of ‘proud of
himself’ while ‘Ellen’ is not an argument of ‘picture of himself,’ (iii) for an
argument to be marked as an argument of a predicate (i.e. theta-marked) it must
be in the projection of the predicate (at DS in GB) and (iv) that
predicate/argument information is preserved in the course of the derivation (in
GB trace theory does this, in MP copies). None of these assumptions seems
particularly heroic in the current theoretical climate and most theories of GG
that I know of adopt some version of each. Of the three, (iii) is probably the
most controversial (and it is the heart of the PISH proposal), but conceptually
it simply extends to external arguments the common assumption for internal
arguments.[10]
(i-iii) lead to the conclusion that ‘Ellen’ is inside the projection of ‘proud
of himself’ (but not ‘picture of himself’) before moving to Spec T to be case
marked. This in conjunction with
principle A/BT (and some version of trace theory, be it semantic or
syntactic) suffices to explain the pattern of data in (2).[11]
To reiterate, this is a very conservative analysis of the
data in (2): given the theoretical context (widely construed) and adding a
conceptually modest PISH suffices to account for the data. Thus PISH is a good candidate explanation for the data in
(2) given standard (i.e. non exotic) assumptions. The explanation works and it
is based on conceptually conservative assumptions. So, it’s a good story.
Let’s now take one further step: say for purposes of discussion
that PISH plus then other assumptions is in
fact the best explanation of the data in (2).[12]
What conclusion would this license? That PISH is part of the G of English. In
other words, the fact that PISH fits well with standard assumptions to get some
new data and by assumption does this
better than competing proposals licenses the conclusion that PISH is part
of the grammar of English. And given
this conclusion, we can now ask a more refined POS question: is PISH part of
the structure of the LAD (a principle of UG or derived from a principle of UG)
or is it “learned”? In other words, given
that this proposal is true what does it tell us about FL?
Many might object at this point that the refined POS
question is very theory internal, relying as it does on all sorts of
non-trivial assumptions (e.g. A/BT, structured headed phrases, the distinction
between an argument and a predicate, some version of trace theory, etc.). And I
would agree. But this is not a bug it’s a feature. To object to this way of
proceeding is to misunderstand the way the scientific game is played. Now repeat after me and loudly: “All
explanations are theory internal as all accounts go beyond the data.” The
relevant question is not whether assumptions are being made, but how good these
assumptions are. The PISH argument Jeff outlines is based on very well grounded
assumptions within GG (versions of which appear in many different
“frameworks”). Of course, one may be skeptical about any results within GG, but
if you are one of these people then IMO you are not really part of the present discussion.
If the work over the last 60 years has, in your opinion, been nugatory, then
nothing any linguist says to you can possibly be of interest (here’s where I
would normally go into a rant about flat earthers and climate change deniers). This
kind of sweeping skepticism is simply a refusal to play the game (here).
Of course, there is no obligation to play, but then there’s no obligation to
listen to those that so refuse.
To return to the POS, observe that the refined question above
is not the same question we asked in
Step 2 above. What step 2 showed us was that whatever the explanation of the
data in (2) it did not come from the LAD’s generalizing from data like that in
(2). Such data does not exist. So if PISH+A/BT+… is the right account of (2)
this account has not been induced from data like (2). Rather the explanation
must be more indirect, transiting through properties of Gs incorporating
principles like A/BT and PISH. Indeed,
if A/BT+PISH+… is the best theory we have for binding data like (2) and if we
assume (as is standard in the sciences) that our best theories should be taken
as (tentatively) true (at least until something better comes on the scene) then
we are in a position to entertain a very refined and interesting POS question
(in fact, formulating this question (and considering answers to it) is the
whole point of POSs), one relevant to probing the fine structure of FL.
Step 4: Are A/BT and PISH innate features of the LAD or
acquired features of the LAD?
Happily this question can be addressed in the same way deployed
at step 2. We can ask what data would be relevant to inducing A/BT and PISH +etc.
and ask whether this data was robustly available in the PLD. Let’s concentrate
on half the problem, as this is what Jeff’s post concentrated on. Let’s assume
that A/BT amd the pther ancillary features are innate features of FL (though if
you want to assume they are learned, that’s also fine, though likely incorrect)
and ask what about PISH? The question then is if PISH correctly describes the G of English is it learned (i.e.
induced form PLD) or innate (either an innate primitive or the consequence of
innate primitives).
Of course, to answer this question we need a specification
of PISH, so here’s one which applies not just to subjects (external arguments)
but to all arguments of a predicate.
PISH: A predicate P can mark and X
as an argument only if X is in the projection of P.
This means that it is a necessary condition for X being
marked as an argument of a predicate P that X be in a structure more or less
like [PP X…P…]. So, if a DP is an internal or external argument of a
verb V or an adjective A then it must be contained in VP or AP of that V or A.[13]
So, can PISH so described be induced from the PLD. I doubt it. Why? Because
there is very little evidence for this principle in surface forms. Movement
operations regularly move expressions from their argument positions to
positions arbitrarily far away (i.e. DS constituents are not SS constituents). In
other words, in a vast array of cases, surface forms are a very poor reflection
of thematic constituency. Thus, the local relation that PISH insists holds between
predicates and their arguments is opaque in surface patterns. Or, said another
way, surface patterns provide poor evidence that a principle like PISH (which
requires that predicates and their arguments form constituents headed by these
predicates) holds.[14]
Note in addition, the modal must in the PISH principle above. It is not enough that DPs
optionally generate “low” in the domain of their thematic predicates. If PISH
were only an option, then the unacceptable data in (2) should be fine as their
would be one derivation, one in which “Ellen” was not generated within the
‘proud of himself’, which would pattern like (2b).[15]
My conclusion, if
PISH is (part of) the right account of the data in (2), then PISH reflects
properties of the built in structures of FL and it is not induced from PLD.
Indeed, I don’t see any real way to induce this principle for it reflects no
surface pattern in the data (but maybe others can be more imaginative). So, if
PISH exists at all, then it is likely to be a reflection of the innate
structure of FL.
As you can see, I find this line of reasoning persuasive.
But perhaps I am wrong and that PISH can be induced from surface forms. If you
think that this is so, your job is to show how to do this. It’s pretty clear
what such a demo would do: show how to induce the data in (2) from the PLD
without using a principle like PISH.[16]
Do it and we defuse the argument above. Again, Good Luck!
I should add that critics rarely try to do this. Certainly
none of the critics to Jeff’s original post attempt anything like this. Rather
there are two or three reactions that we should now take a look at.
The first is this kind of argument is very involved and so
cannot be trusted. It relies on a number of intermediate steps, which quickly
involve non-trivial assumptions. So, PISH relies on A/BT and on the idea that
phrases are headed. To this one is tempted to reply: grow up! That’s the way
science works. You build arguments piecemeal, relying on previous conclusions.
You take conclusions of earlier work as premises for the next steps. There is
no way around this, and not just in linguistics. Successful inquiry builds on
past insight. These form the fixed points for the next steps. Denying the
legitimacy of thus proceeding is simply to eschew scientific inquiry in the
domain of language.[17]
It amounts to a kind of methodological dualism that Chomsky has been right to
excoriate.
Moreover, it misunderstands the logic of the challenge
Jeff’s case provides. The challenge is not a theory internal one. Here’s what I
mean. The problem is a standard induction problem. Everyone agrees that one
covers all of LD by inducing Gs based on data drawn from a subpart of LD. The
POS challenge above is to show how to do this using only PLD. This challenge does not commit hostages to a particular
conception of grammar or UG or FL. In this sense it is very neutral. All it requires
is recognizing that what we have here is a problem of induction and a
specification of what’s in (and not in)
the inductive base. Jeff’s solution uses
theory particular assumptions in responding to this challenge. But the
challenge does not require buying into these assumptions. If, for example, you
think that all human Gs are (roughly) CFGs, then show how to induce those CFGs that have the contrasts in
(2) as outputs based on PLD that does not contain data like (2). So, though
there is nothing wrong, IMO, in using GGish assumptions to solve the problem
(and given my view that GG has learned a lot about language in the last 60
years) if you think that this is all junk then take your favorite ones and show
how to get from PLD to (2) using them.
Put another way, what the PLD consists in in this case is not a complex theory internal matter and
all one needs to raise the POS question is agreement that Jeff’s
characterization of the PLD in this case is more or less correct. So if you
don’t like his solution, you now should know what needs doing. So do it.
The second reaction is to reject POS conclusions because
they overly encumber FL with domain specific innate constraints. More
concretely, PISH cannot be innate despite the POS argument for there is now way
it could have gotten into FL. This is a
kind of Minimalist maneuver, and I personally find it somewhat compelling. But
the right conclusion is not that the POS argument has failed but to understand
that there is yet another project worth embarking on: showing how to derive the
linguistically specific principles from more general computational and
cognitive ones. This is a project that I
am very sympathetic to. But it is very different from the one that simply
rejects conclusions that are not to its liking. Both Plato’s Problems and
Darwin’s Problem are real. The right way to reconcile them is not to wish one
or the other away.
Let me end this anatomy lesson here with some summary
remarks. POSs have been and will continue to be invaluable tools for the
investigation of the structure of FL. They allow us to ask how to parse our
analyses of particular phenomena into those features that relate to structural
features of FL and those that are plausibly traced to the nature of the input
data. POS arguments live within a certain set of assumptions: they are tools
for investigating the properties of generative procedures (Gs) and the mental
structures that restrict the class of such procedures available to humans (UG).
In this sense, answers to POS arguments are not assumption free. But who could
have thought that they would be. Is there any
procedure anywhere in the sciences
that is assumption free. I doubt it. At any rate, POS is not one of these. However,
the POS problems are run of the mill inductive ones where we have a reasonably
good specification of what the inductive base for looks like. This problem is
neutral wrt an extremely wide range of possible solutions.
Three of the more important features of a POS is the
distinction between PLD and LD; the linguistic data the child has available to
it versus the full range of structures that the attained G (i.e. generative
procedure) generates. Whatever learning
goes on, uses PLD alone, though the G attained on the basis of this PLD extends
far beyond it. Second, the data of interest includes both acceptable and
unacceptable cases. We both want to explain what does and what doesn’t “exist.”
Lastly, the objects that POSs investigates are Gs and UG. It is assumed that
linguistic competence amounts to having acquired a generative procedure (G).
POSs are then deployed to examine the fine structure of these Gs and parse
their properties into those that expose the structural properties of FLs and
those that reflect the nature of the input. Like all tools, POSs will be useful
to the degree that this is your object of inquiry. Now, many people do not
think that this IS the right way of framing the linguistic problem. For them,
this sort of argument will, not surprisingly, loose much of its appeal. Just
another indication that you can’t please everyone all the time. Oh well.
[2]
As Colin notes there is nothing squishy about the data: “For what it’s worth, we tested the facts in Jeff’s
paradigm in (2), and the ratings are uncommonly categorical (Omaki, Dyer,
Malhotra, Sprouse, Lidz, & Phillips, 2007).”
[3]
Even a hard boiled rationalist like me thinks that this is sometimes the right kind of account to give. Why do I put
determiners before NPs rather than after them (as, e.g. in Icelandic)? Well
because I have been exposed to English data and that’s what English does. Had I
been exposed to Icelandic data I would do otherwise.
[4]
This is important: the best POS notes a complete absence of the relevant data.
There can be no induction in the absence of any data to induce over. That’s why
many counterarguments try to find some
relevant input that statistical processes can then find and clean up for use.
However, when there really is no
relevant data in the PLD, then the magic of statistical inference becomes
considerably less relevant.
[5]
Well, not quite: he did get hits from linguistic papers talking about the
phenomenon of interest. Rashly, perhaps, he took this to indicate that such
sentences were not available to the average LAD.
[6]
It is worth remarking that the PLD LD distinction is not likely to be one
specifiable (i.e. definable) in general terms (though I suspect that a rough
cut between degree 0+ and others is correctish). This, however, does not imply
that the PLD/LD distinction cannot be determined in specific cases of interest,
as Jeff demonstrates. The fact that there is no general definition of what data
constitute the PLD does not imply that we cannot fix the character of the PLD
in specific cases of interest. In this PLD is like “coefficient of friction.”
The value of the latter depends on the materials used. There is no general all
purpose value to be had. But it is easily determined empirically.
[7]
There is a reason for this: A/BT is a generalization very close to the data.
This is a pretty low level generalization of what we see in the LD.
[8]
This is all obvious. I mention it to parry one kind of retort: “well it’s possible
that A/BT is not right, so we have no reason to assume that it is.” This is not
an objection that any right thinking person should take seriously.
[10]
In GB this is the assumption that predicates govern what they theta mark.
Analogous assumptions are made in other “frameworks.”
[11]
Actually, we need also assume that displacement targets constituents. Again,
this is hardly a heroic assumption.
[12]
There are other accounts out there (e.g. Heycock’s reconstruction analysis) but
the logic outlined here will apply to the other proposed analyses I know and
lead to similar qualitative conclusions concerning the structure of FL.
[13]
Or in MP terms, it must have merged with whatever theta marks it. Note that this is also a standard assumption
in the semantic literature where arguments saturate a predicate via lambda
conversion.
[14]
This DS/SS disconnect is what lay behind the earliest arguments for innate
linguistic structure. That’s why Deep Structure was such a big deal in early GG
theory.
[15]
David Basilico (here
and following) makes a very relevant observation concerning Existential
Constructions and whether they might help ground PISH in the PLD rather than
FL/UG. His observations are very relevant, though for reasons I mention in the
thread and the must nature of PISH, I
think that there is still some slack that needs reeling in by UG. That said,
his point is a good and fair one.
[16]
Actually this would be a good first step. In addition one would hope that the
principles used to license the induction would equally well grounded; i.e. at
least as good as PISH. But for now, I would be happy to forgo this additional
caveat.
[17]
As I’ve noted before, there are a group of skeptics that don’t really think
much of work in GG at all. I actually have nothing at all to say to them as IMO
they are not persuadable, at least by the standards of rational inquiry that I
am acquainted with. For a recent expression of this see Alex D here.
I completely agree with the sentiment.
There is this idea of "direct induction" which I think is maybe a bit misleading... any inductive algorithm has some bias, so the generalisation always arises out of an interaction between the biases of the learner and the data. Even if the learner is just memorising the data that is a bias too -- a bias towards the most conservative hypothesis.
ReplyDeleteI get the intuition that learning, say, that determiners are always (almost always .. there is "galore") before the noun in English, is easy to do, but that reflects a bias to a learner that pays attention to the surface order of strings. A different and maybe simpler algorithm might assume that the surface order was irrelevant and that is "the cat" is grammatical then so is "cat the".
If you follow this line a bit further I think you come to the conclusion that there is no such thing as direct induction, there are just inductive processes (learning algorithms) of various types. It is easier to see how a simple learning algorithm that pays attention to surface order can work, and harder to see how learning algorithms that work with derivation trees of MCFGs can work, but I don't see any very great difference between them, as the surface order is just the derivation tree of a regular grammar (which is an MCFG of a special type; rules like N(au) := P(u)).
@ Alex:
DeleteYour comment reminds me of a Yogi Berra insight: In theory, there's no difference between theory and practice, in practice there is. Let's consider some urn scenarios to pump intuitions:
X is asked to say (bet?) whether urn A has more black balls than white ones. S/he takes a 100 pulls from the urn with replacement etc. After 100 pulls there are 99 black balls and 1 white one. S/he is asked to bet. I'm pretty sure I know what he would bet and why. As do you.
Scenario B: Same thing, but after 100 pulls s/he asked to say (bet?) whether urn Y has more black balls than pink ones. He has sampled nothing from urn Y, only urn X. Whatever he does here is different from what he did with urn X. Maybe something like a principle of indifference drives the choice, maybe something else, but it is very different. Now Urn Z: s/he again pulls 100 times and again 99 blacks 1 white. Then an undisclosed number of pink balls are added to Z and without any further selection he is asked to say if there are more black balls than pink ones. Again, whatever s/he does based on whatever principle you choose is different from the first case.
The moral: I think that the first case is what we have in the determiner example. I think that the 2nd and 3rd case are roughly what we have in many linguistic cases (e.g. Islands, ECP, PISH etc). Imagine further that after sampling urn X our chooser always answered rightly wrt any other urn with any other colors. Would we say that s/he got it right BECAUSE of his/her samplings of urn X? And if s/he did get it right wouldn't we say it's because there is some non-trivial causal relation between the urns that our chooser "knows" e.g.s/he knows something about all color choices other than white and black? I'm sure you get the point.
These cases are different "in practice" regardless of whether we can mathematically assimilate to the same thing. In fact, I would go further, if we mathematically we do assimilate them to the same thing, then all this means is that the mathematical perspective on the problem is unilluminating.
@Norbert:
DeleteAlex presented the established view according to which everything the learner does is revealing of a bias; here it doesn't even make sense to ask whether there is some innate aspect to learning---of course there is!
As soon as one starts to formalize learning settings, the memorizing learner emerges as a logical possibility, on a par with others. One can surely decide that purely memorizing learners are not interesting, and try to exclude them from particular learning settings (by perhaps requiring that the learner's memory is finitely bounded, etc), and people have done this.
However, you seem to want memorization (or, more generally, easy/surface-based/??? strategies) to be treated as qualitatively different from `real' generalization. Why? I find your analogy between urns and language intuitive, but I don't understand why you want more than a quantitative distinction.
I'm glad you find this intuitive. And I am not looking for a qualitative distinction so much as a meaningful one. These are two different kinds of induction problems. Linguists have distinguished them as relevantly different looking for different kinds of biases. If they are different ends of a quantitatively continuum or qualitatively different is of no concern to me. But they should be distinguished for they ask for different modes of explanation and investigation. Anything that reduces them to the same thing, therefore, is IMO ISO facto the wrong way to look at things. That's what I tried to bring out with the urn examples. One can call all these case examples of generalization from the data, but the inductive grounds and principle relied upon are different. The last two use something like the principle of insufficient reason if they use anything at all. The first does not.
DeleteMaybe an analogy will serve. Being sure thing and not is the difference between being of probability 1 and probability 1<. Is this a quantitative or qualitative difference? My Bayesian friends think the former for setting priori to probability 1 or 0 creates all sorts of problems, or so I am told. So, qualitatively or quantitative? Is the question even well posed? And who cares so long as the difference is recognized and distinguished, something that I did not take Alex to be doing.
Let me tweak your urn example a little bit to make my point.
DeleteSuppose you are observing a sequence of black and white balls, and you
see the sequence of 99 balls WBBWBB .... WBBWBB so WBB repeated 33 times, and you are asked to guess what the next ball will be.
Well one learner might assume that they are drawn from an urn with replacement and that the urn only contains black and white balls and say we have seen 33 white and 66 black so the probability of a black is 2/3.
Or the learner might be detecting a sequence in the data and say, it will be white W.
Both of these are equally "direct", they just depend on the assumptions they make about the generating process.
So suppose we have a "direct" learner whose assumption is that the data is generated by ngrams or regular grammars or CFGs or MCFGs or MGs, or MGs with PISH, or to some finite class of parametric grammars defined by transformational. Is that possible? Or does the idea of a direct learner only apply to, say, ngram models?
No one doubts that there are different ways to extrapolate/generalize from data, which is what your tweaking points to. Yes, even when there is data to generalize from this can be done in endlessly many ways. We have Hume and Goodman to thank for bringing this so vividly to our attention (even Wittgenstein in his foundations of math book). Yes, this is very important. BUT, let me say this again BUT: the reason I pointed you to my cases is that they are different from these. In the two latter cases we are not generalizing from the data at all. There is no pattern we are trying to discern as there are no relevant samplings of the relevant urns. That was the point. And in these cases we are doing something different. However (and this was the point) it looks like many POS cases are just like these second two cases. Jeff presented one like this. There are many many more. So what we need is to divide the relevant cases into the two piles: those from which we are NOT inducing from data and those where we are. FOr the former cases it looks like we are learning something (I would say) about the nature of the hypothesis space, in the second something about the natural inductive dimensions relevant to the problem (i.e. the strength and shape of priors). I am sure that things can be cut up differently. However, these clearly seem to me like different kinds of cases and mushing them together is not very useful.
DeleteLast point: a standard reply to a POS is that the kind of case that Jeff points to where there is 0 data in the PLD to generalize from (like in the last two urn cases) is that these don't really exist. My point is that they do, in fact they do in spades. And missing this will only lead one off in entirely irrelevant directions.
I think the problem with the argument you are making is precisely this false dichotomy between inductive processes which are very shallow and non inductive processes. You rely on this when you say "there is 0 data in the PLD to generalize from". That only makes sense if you have some idea of a learning algorithm which needs data of type X to learn phenomenon Y.
DeleteThen you can say, there is no data of type X in the PLD; therefore no inductive learning algorithm can learn Y from the PLD, therefore Y is innate.
Have I got your logic right?
So my question is meant to try to understand what you mean by "inducing from data". If we induce a minimalist grammar using standard Bayesian techniques is that "inducing from data"? Or does it only apply to ngram type techniques?
(Side point: The multiple urn problem is interesting -- there are of course learning algorithms that can exploit structure between urns; say if we have seen 10 urns that all have 75% black balls and 25% white, then this might affect the expected probability we would give to a ball from a new urn being black. This would be a hierarchical Bayesian model, and another type of induction.)
I knew you'd say this. I even predicted this. So you know my answer and there is little reason to discuss this any further. When you come up with a way of getting any of the data linguists look at using the PLD we have evidence is the pertinent kind, we can talk again. Till then, further discussion is not useful.
DeleteAs for your side point: this is really nutty. I set up the problem precisely so that there was NO relation between the urns. I set it up this way because for the reasons I mentioned. The aside simply tries to assimilate a problem FOR induction into another problem OF induction. In other words, it denies that the problem as illustrated exists. And that was my point in the first place.
BTW, Bill Idsardi and I have a forthcoming post that should present a semi-formal version of this problem.
Maybe I have got the dialectic wrong, but if I think you are setting up a false dichotomy then I do want to show that there is an example which is an alternative, and so the urn example I gave seems the right move? It may not convince of course.
DeleteAnd once again, I am not trying to sell my way of doing things, I am just pointing out an implausible assumption in your argument. I don't need to solve the problem to point out an inadequacy in your argument or the solution you favour. And I am certainly not trying to diminish or ignore the problem -- it is a great problem! and much better than AUX inversion.
You are rejecting the description of the problem as I gave it. Of course, the debate is how well my description models the actual situation. I am not arguing that it does, I was asserting that it did. The argument comes from Jeff's PISH example which gien his description of the PLD suggests that this is a fair model of the problem. You may disagree. In fact I'm sure you do. Fine. the question is there a way of adjudicating this disagreement rationally? I'm not sure, but I think that there is. Bill and I will post something on one possible approach next week. Saty tuned. For now, I am happy if the point of our disagreement is clear, which I think it is.
DeleteThe Native POS Mobile application for your online website. Using this native app, your cashiers can make use of this single application on their Android or iOS devices for your multiple physical stores. Create the most flexible Point-Of-Sale Native App to increase revenue, improve operations, and provide a positive customer experience.
ReplyDelete