Some of this post is thinking out loud. I am not as sure as I would like to be about
certain things (e.g. how to understand feasibility for example, see below).
This said, I thought I’d throw it up and see if comments etc. allow me to
clarify my own thoughts.
In Aspects
(chapter 1;30ff), Chomsky outlines an abstract version of an “acquisition
model.”[1]
I want to review some of its features here. I do this for two reasons. First,
this model was later replaced with a principles and parameters (P&P)
account and in order to see why this happened it’s useful to review the theory
that P&P displaced. Second, it seems to me that the Aspects model is making a comeback, often in a Bayesian wrapper,
and so reviewing the features of the Aspects
model will help clarify what Bayesians are bringing to the acquisition party
beyond what we already had in the Aspects
model. BTW, in case I haven’t mentioned this before, chapter 1 of Aspects is a masterpiece. Everyone
should read it (maybe once a year, around Passover where we commemorate our
liberation from the tyranny of Empiricism). If you haven’t done so yet, stop
reading this post and do it! It’s much better than anything you are going to
read below.
Chomsky presents an idealized acquisition model in §6
(31). The model has 5 parts:
i.
an enumeration of the class s1, s2… of possible sentences
ii.
an enumeration of the class SD1, SD2… of possible structural descriptions
iii.
an enumeration of the class G1, G2,… of possible generative grammars
iv.
specifiction of the f such that SDf(i,j)
is the structural description to sentence si
by grammar Gj for
arbitrary i, j
v.
specification of a function m such that m(i) is an
integer associated with the grammar Gi
as its value[2]
(i-v) describe the
kinds of capacities a language acquisition device (LAD) must have to use primary
linguistic data (PLD) to acquire a G. It must (i) have a way of representing
the input signals, (ii) a way of assigning structures to these signals, (iii) a
way of restricting the class of possible structures available to languages,
(iv) a way of figuring out what each hypothetical G implies for each sentence
(i.e. a input/structure pair) and (v) a method for selecting one of the very
very many (maybe infinitely many) hypothesis allowed by (iii) that are
compatible with the given PLD. So, we
need a way of representing the input, matching that input to a Gish
representation of that input and a way of choosing the “right” match (the
correct G) from the many logically possible G-matches (i.e. a way of “evaluating
alternative proposed grammars”).
How would (i-v) account for language acquisition? A LAD with
structure (i-v) could use PLD to search the space of Gs to find the one that generates that PLD. The PLD, given (i, ii) is a pairing of inputs
with (partial) SDs. (iii, iv) allows these SDs to be related to particular Gs.
As Chomsky says it (32):
The device must search through the
set of possible hypotheses G1,
G2,…, which are available to it by virtue of condition (iii),
and must select grammars that are compatible with the primary linguistic data,
represented in terms if (i) and (ii). It is possible to test compatibility by
virtue of the fact that the device meets condition (iv).
The last step is to select one of these “potential grammars”
using the evaluation measure provided by (v). Thus, if a LAD has these five
components, the LAD has the capacity to build a “theory of the language of
which the primary linguistic data are a sample” (32).
As Chomsky notes, (i-v) packs a lot of innate structure into
the LAD. And, interestingly, what he proposes in Aspects matches pretty closely how our thoroughly modern Bayesians
would describe the language acquisition problem: A space of possible Gs, a way
of matching empirical input to structures that the Gs in the space generate,
and a way of choosing the right G among the available Gs given the analyzed
input and the structure of the space of Gs.[3]
The only thing “missing” from Chomsky’s proposal is Bayes rule, but I have no
doubt that were it useful to add, Chomsky would have had no problem adding it.
Bayes rule would be part of (v), the rule specifying how to choose among the
possible Gs given PLD. It would say: “Choose the G with the highest posterior
probability.”[4]
The relevant question is how much this adds? I will return to this question
anon.
Chomsky describes theories that meet conditions (i-iv) as descriptive and those adding (v) as
well, as being explanatory. Chomsky
further notes that gaining explanatory power is very hard, the reason being
that there are potentially way too many Gs compatible with given PLD. If so, then choosing the right G (the
needle) given the PLD (a very large haystack) is not a trivial task. In fact, in Chomsky’s view (35):
… the real problem is almost always
to restrict the range of possible hypotheses [i.e. candidate Gs, NH] by adding
additional structure to the notion “generative grammar.” For the construction
of a reasonable acquisition model, it is necessary to reduce the class of
attainable grammars compatible with given primary linguistic data to the point
where selection among them can be made by a formal evaluation measure. This
requires a precise and narrow delimitation of the notion “generative grammar”-
a restrictive and rich hypothesis concerning the universal properties that
determine the form of language…[T]he major endeavor of the linguist must be to
enrich the theory of linguistic form by formulating more specific constraints
and conditions on the notion “generative grammar.”
So, the main explanatory problem, as Chomsky sees it, is to
so circumscribe (and articulate) the grammatical hypothesis space such that for
any given PLD, only a very few candidate Gs are possible acquisition targets.[5]
In other words, the name of the explanatory game is to structure the hypothesis
space (either by delimiting the options or biasing the search (e.g. via strong
priors)) so that, for any given PLD, very
few candidates are being simultaneously evaluated. If this is correct, the focus of theoretical
investigation is the structure of this space, which as Chomsky further argues,
amounts to finding the universals
that suitably structure the space of Gs. Indeed, Chomsky effectively identifies
the task of achieving explanatory adequacy with the “attempt to discover
linguistic universals” (36), principles of G that will deliver a space of
possible Gs that for any given PLD a very
small number of candidate Gs need be considered.
I have noted that the Aspects
model shares many of the features that a contemporary Bayesian model of
acquisition would also assume. Like the Aspects
model, a Bayesian one would specify a structured hypothesis space that ordered
the available alternatives in some way (e.g. via some kind of simplicity
measure?). It would also add a rule (viz. Bayes rule) for navigating this space
(i.e. by updating values of Gs) given input data and a decision rule that
roughly enjoins that the one choose (at some appropriate time) the highest
valued alternative. Here’s my question:
what does Bayes add to Aspects?
In one respect, I believe that it reinforces Chomsky’s
conclusion: that we really really need a hypothesis space that focuses LAD’s attention
on a very small number of candidates. Why?
The answer, in two words, is computational tractability.
Doing Bayes proud is computationally expensive. A key feature of Bayesian
models is that with each new input of data the whole space of alternatives
(i.e. all potential Gs) is updated. Thus, if the there are, say, 100 possible grammars, then for each datum D
all 100 are evaluated with respect to D (i.e. Bayes computes a posterior for
each G given D). And this is known to
be computationally so expensive as to not be feasible if the space of
alternatives is moderately large.[6]
Here, for example, is what O’Reilly, Jbabdi and Behrens (OJB) say (see note 5):
…it is well known that “adding parameters to a model
(more dimensions to the model) increases the size of the state space, and the
computing power required to represent and update it, exponentially (1171).
As OJB
further notes, the computational problems arise even when there are only “a handful of dimensions of state spaces”
(1175, my emphasis, NH).
This would not be particularly problematic if only a small
number of relevant alternatives were the focus of Bayesian attention, as would
be the case given Chomsky’s conception of the problem, and that’s why I say
that the Aspects formulation of
what’s needed seems to fit well with Bayesian concerns. Or, to put this another
way: if you want to be Bayesian then you’d better hope that something like
Chomsky’s position is correct and that we can find a way of using universals to
develop an evaluation measure that serves to severely restrict the relevant Gs
under consideration for any given PLD.
There is one way, however, in which Chomsky’s guess in the Aspects model and contemporary Bayesians
seem to part ways, or at least seem to emphasize different parts of the
research problem (I say ‘seem’ because what I note does not follow from Bayes
like assumptions. Rather it is characteristic of what I have read (and, recall,
I have not read tons in this area, just some of the “hot” papers by Tenenbaum
and company). Chomsky says the following (36-7):
It is logically possible that the
data might be sufficiently rich and the class of potential grammars
sufficiently limited so that no more than a single permitted grammar will be
compatible with the available data at the moment of successful language
acquisition…In this case, no evaluation procedure will be necessary as part of
linguistic theory – that is, as an innate property of an organism or a device
capable of language acquisition. It is rather difficult to imagine how in
detail this logical possibility might be realized, and all concrete attempts to
formulate an empirically adequate linguistic theory certainly leave ample room
for mutually inconsistent grammars, all compatible with the primary linguistic
data of any conceivable sort. All such theories therefore require
supplementation by an evaluation measure if language acquisition is to be
accounted for and the selection of specific grammars is to be justified: and I
shall continue to assume tentatively…that this is an empirical fact about the
innate human faculté de langage and
consequently about general linguistic theory as well.
In other words, the HARD acquisition problem in
Chomsky’s view resides in figuring out the detailed properties of the
evaluation metric. Once we have this,
the other details will fall into place. So, the emphasis in Aspects strongly suggests that serious work
on the acquisition problem will focus on elaborating the properties of this
innate metric. And this means working on developing “a restrictive and rich
hypothesis concerning the universal properties that determine the form of
language.”
Discussions of this sort are largely missing from Bayesian
proposals. It’s not that they are incompatible with these and it’s not even
that nods in this direction are not frequently made (see here).
Rather most of the effort seems placed on Bayes Rule, which, from the outside
(where I sit) looks a lot like bookkeeping. The rule is fine, but its efficacy
rests on a presupposed solution to the hard problem. And this looks as if
Bayesians worry more about how to navigate the space (on the updating
procedure) given its structure rather
than on what the space looks like (it’s algebraic structure and the priors on
it).[7]
So, though Bayes and Chomsky in Aspects
look completely compatible, what they see as the central problems to be solved
look (or seem to look) entirely different.[8]
What happened to the Aspects
model? In the late 70s and early 80s, Chomsky came to replace this “acquisition
model” with a P&P model. Why did he do this and how are they different? Let’s
consider these questions in turn.
Chomsky came to believe that the Aspects approach was not feasible. In other words, he despaired of
finding a formal simplicity metric
that would so order the space of grammars as required.[9] Not that he didn’t try. Chomsky discusses
various attempts in §7, including ordering grammars in accord with the number
of symbols they use to express their rules (42).[10]
However, it proved to be very hard (indeed impossible) to come up with a
general formal way of so ordering Gs.
So, in place of general formal evaluation metrics, Chomsky
proposed P&P systems where the class of available grammars are finitely
specified by substantive 2 valued parameters. P&P parameters are not
formally interesting. In fact, there have been no general theories (not even
failed ones) of what a possible parameter is (a point made by Gert Webelhuth in
this thesis, and subsequently).[11]
In this sense, P&P approaches to acquisition are less theoretically
ambitious than earlier theories based on evaluation measures. In effect,
Chomsky gave up on the Aspects model
because it proved to be hard to give a general definition of “generative
grammar” that served to order the infinite variety of Gs according to some
general (symbol counting) metric. So, in place of this, he proposed that all Gs
have the same general formal structure and only differ in a finite number of empirically
pre-specified ways. On this revised picture, Gs as a whole are no longer
formally simpler than one another. They are just parametrically different.
Thus, in place of an overall simplicity measure, P&P theories concentrate
on the markedness values of the specific parameter values; some values being
more highly valued than others.
Let me elaborate a little. The P&P “vision” as
illustrated by GB (as one example) is that Gs all come with a pre-specified set
of rules (binding, case marking, control, movement etc.). Languages differ, but
not in the complexity of these rules. Take movement rules as an example. They
are all of the form ‘move alpha’ with
the value of alpha varying across languages. This very simple rule has no
structural description and no structural change, unlike earlier rules like
passive, or raising or relativization. In fact, the GB conceit was that rules
like Passive did not really exist! Constructions gave way to congeries of
simple rules with no interesting formal structure. As such there was little for an evaluation measure
to do.[12]
There remains a big role for markedness theory (which parameter values are
preferred over others (i.e. priors in Bayes speak), but these do not seem to
have much interesting formal structure.
Let me put this one more way: the role of the evaluation
metric in Aspects was to formally order
the relevant Gs by making some rule formats more unnatural than others. As
rules become more and more simple, the utility of counting symbols to
differentiate them becomes less and less useful. A P&P theory need not
order Gs as it specifies the formal structure of every G: it’s a vector with
certain values for the open parameters.
The values may be ranked, but formally, the rules in different Gs look
pretty much the same. The problem of
acquisition moves from ordering the possible Gs by considering the formally
distinct rule types, to finding the right values for pre-specified parameters.
As it turns out, even though P&P models are feasible in
the required formal sense, they still have problems. In particular, setting
parameters incrementally has proven to be a non-trivial task (as people like
Dresher and Fodor & Sakas have shown) largely because the parameters
proposed are not independent of one another. However, this is not the place to
rehearse this point. What is of interest here is why evaluation metrics gave
way to P&P models, namely that it proved to be impossible to find general
evaluation measures to order the set of possible Gs and hence impossible to
specify (v) above and thus attain explanatory adequacy.[13]
Let me end here for now (I really want to return to these
issues later on). The Aspects model outlines a theory of
acquisition in which a formal ordering of Gs is a central feature. With such a
theory, the space of possible Gs can be infinite, acquisition amounting to
going up the simplicity ladder looking for the simplest G compatible with the
PLD. The P&P model largely abandoned
this vision, construing acquisition instead as filling a finite number of fixed
parameters (some values being “better” than others (i.e. unmarked)). The
ordering of all possible Gs gave way to a pre-specification of the formal
structure of all Gs. Both stories are
compatible with Bayesian approaches. The problem is not their compatibility,
but what going Bayes adds. It’s my
impression that Bayesians as a practical matter slight the concerns that both Aspects style models and P&P models
concentrate on. This is not a matter of principle, for any Bayesian story needs
what Chomsky has emphasized is both central and required. What is less clear, at least to me, is what
we really learn from models that concentrate more on Bayes rule than the
structures that the rule is updating. Enlighten me.
[1]
It’s wroth emphasizing that what is offered here is not an actual learning
theory, but an idealized one. See his note 19 and 22 for further discussion.
[2]
Chomsky suggests the convention that lower valued Gs are associated with higher
numbers.
[3]
One thing I’ve noticed though is that many Bayesians seem reluctant to conclude
that this information about the hypothesis space and the decision rule are
innately specified. I have never understood this (so maybe if someone out there
thinks this they might drop a comment explaining it). It always seemed to me that were they not
part of the LAD then we had not acquisition explanation. At any rate, Chomsky
did take (i-v) to specify innate features of the LAD that were necessary for
acquisition.
[4]
With the choice being final after some period of time t.
[5]
See especially note 22 where Chomsky says:
What is required of a
significant linguistic theory…is that given primary linguistic data D, the
class of grammars compatible with D be sufficiently scattered, in terms of
value, so that the intersection of the class of grammars compatible with D and
the class of grammars which are highly valued be reasonably small. Only then
can language learning actually take place.
[6]
See OJB discussed here. It is worth noting that many Bayesians take Bayesian
updating over the full parameter
space to be a central characteristic of the Bayes perspective. Here again is OJB:
It is a central characteristic of fully Bayesian
models that they represent the full state space (i.e. the full joint
probability distribution across all parameters) (1171).
It
is worth noting, perhaps, that a good part of what makes Bayesian modeling
“rational” (aka: “optimal,” and it’s key purported virtue) is that it considers
all the consequences of all the evidence. One can truncate this
so that only some of the consequences of only some of the evidence is relevant,
but then it is less clear what makes the evaluations “rational/optimal.” Not that there aren’t attempts to truncate
the computation citing resource constraints and evaluating optimality wrt these
constraints. However, this has the tendency of being a mug’s game as it is
always possible to add just enough to get the result that you want, whatever
these happen to be. See Glymour here. However, this is not the place to go into these
concerns. Hoepfully, I can return to them sometime later.
[7]
Indeed, many of the papers I’ve seen try to abstract from the contributions of
the priors. Why? Because sufficient data washes out any priors (so long as they
are not set to 1, a standard assumption in the modeling literature precisely to
allow the contribution of priors to be effectively ignored). So, the papers I’ve seen say little of a
general nature about the hypothesis space and little about the possible priors
(viz. what I have been calling the hard problem).
[8]
See for example the Perfors et. al. paper (here).
The space of options is pretty trivial (5 possible grammars (three regular and
two PCFGs) and it is hand coded in. It is not hard to imagine a more realistic
problem: say including all possible PCFGs. Then the choice of the right one
becomes a lot more challenging. In other words, seen as an acquisition model,
this one is very much a toy system.
[9]
Chomsky emphasizes that the notion of simplicity is proprietary to Gs, it is
not some general notion of parsimony (see p. 38). It would be interesting to consider how this
fits with current Minimalist invocations of simplicity, but I won’t do so, or
at least not now, not here.
[10]
This model was more fully developed in Sound
Patterns and earlier in The
morphophonemics of modern Hebrew. It also plays a role in Syntactic Structures and the arguments
for a transformational approach to the auxiliary system. Lasnik (here)
has some discussion of this. I hope to write something up on this in the near
future (I hope!).
[11]
Which precisely what makes parameters internal to FL minimalistically
challenging.
[12]
This is not quite right: a G that has alpha = any category is more highly
valued than one that limits alpha’s reach to, e.g. just DPs. Similary for things like the head parameter.
Simplicity is then a matter of specifying more or less generally the domain of
the rule. Context specifications, which played a large part in the earlier
theory, however, are no longer relevant given such a slimed down rule
structure. So the move to simple rules does not entirely eliminate
considerations of formal simplicity, but it truncates it quite a bit.
[13]
The Dresher-Fodor/Sakas problem is a problem that arises on the (very
reasonable and almost certainly correct) assumption that Gs are acquired
incrementally. The problem is that unless the parameter values are independent,
no parameter setting is fixed unless all
the data is in. P&P models abstract away from real time learning. So
too with Aspects style models. They
were not intended as models of real time learning. Halle and Chomsky note this
p. 331 of Sound Patterns where they
describe acquisition as an “instantaneous process.” When Chomsky concludes that
evaluation measures are not feasible, he abstracts away from the incrementality
issues that Dresher-Fodor/Sakas zero in on.
Bayes does have rather different things to say about Aspects and P&P since infinite hypothesis spaces have different asymptotic guarantees than finite ones. There a lot of theory in asymptotic statistics that I won't pretend to know much about, but the high level message is that the the Bernstein-von Mises theorem, which gives the nice "your prior doesn't matter with enough data" guarantee, doesn't hold with countably infinite hypothesis (parameter) spaces. As someone who works on lots of infinite hypothesis spaces (maybe even ones that have a passing familiarity to Aspects-like things?), my experience is that even with "lots of data" priors matter a great deal in infinite models. In other words, my guess is that the "hard" evaluation function problem probably corresponds to a "hard" prior problem in Bayes-land. And then you still have to do updating. Thus, I suppose there is some sense in which I concede that I there isn't a great answer to "what does Bayes buy you?".
ReplyDeleteOn the other hand, I do think the Bayesian program is useful practically, since reasonably uninformative priors (e.g., MDL-like things) often do behave in interesting ways given "lots of data", which lets us think about hypothesis spaces without solving all the hard problems at once.
The aspects model is missing one key part of the Bayesian model (as it is standardly used) namely the likelihood -- how well the grammar fits the data. As a result the Aspects model has a big problem controlling overgeneralisation (aka no negative evidence, the subset problem, the logical problem of language acquisition etc etc) whereas the Bayes model doesn't.
ReplyDeleteThere are as a result a lot of papers showing theoretical learning results in the Bayes framework and none (as far as I know? This is a good place to find them!) in the Aspects model.
I want to elaborate on Alex Clark's point here. If we think about the likelihood function in the right way, we can see why getting it right is important for paring down the number of “relevant” grammars.
DeleteThe likelihood function defines how well the grammar fits the data, and can be richly structured. If the likelihood function is richly structured, it will be a product of many local factors. Because probabilities must sum to one, this product will result in very small numbers. The likelihood function then decays exponentially in the number of local factors (= richness of the structure; and also usually the length of the sentence, but we usually take the sentence lengths to be fixed in the PLD). However, it will decay much more slowly for grammars that fit the data well. So a richer likelihood function implies an exponentially larger space of possible grammars, but it also means that the likelihood of “bad” grammars decays expontentially more quickly. This has the effect that a relatively small set of settings for any group of variables, including grammar-level variables, dominate the probability mass. This small set of settings is called the typical set.
OJB must be addressing the computational level of analysis when they talk about updating the entire state space. Nobody does inference in a non-trivial model by updating the entire state space. Instead, they take sampling or variational approaches to inference. A sampling approach randomly explores the state space in proportion to its posterior probability; if the typical set is small, then it can explore the important parts of the state space relatively quickly, regardless of the size of the entire state space. A variational approach (“variational” here refers to the calculus of variations, not Charles Yang's models) takes a probability distribution, the variational distribution, that has a simpler functional form with fewer degrees of freedom than the true posterior, and squishes it so that it matches the true distribution as closely as possible. If the typical set is small, then the reduced degrees of freedom of the variational distribution will more readily approximate the true posterior.
I agree with you (Norbert) that a Bayesian approach is consistent with Chomsky's Aspects and P&P models. But as Alex points out, because Bayesian inference is probabilistic rather than logical, it can avoid problems with negative evidence that plague Aspects and P&P approaches. It's also not true that an inference problem necessarily gets harder as the state space grows because there are inference methods that avoid searching the entire space. In fact, relaxing a discrete space inference problem (e.g., with Boolean parameters) to a continuous one (e.g., where the parameters are probabilistic) can sometimes dramatically simplify an inference problem. There's been a huge explosion of methods for approximate Bayesian inference, such as the variational approximations John mentions, which let us perform inference over grammars without enumerating them. Other techniques can perform inference over infinite parameter spaces (as Chris mentions); I've used these techniques to model the acquisition of the lexicon.
ReplyDeleteI think there are several reasons why there's not much work integrating Bayesian inference with a Chomskyian approach to grammar. Both areas are very technical, and there just aren't many people with expertise in both areas. There's also a big difference in terms of what counts as a "success" in both fields.
I think the complexity problem is an issue that Bayesians haven't yet come to grips with, and it's worth thinking about.
ReplyDeleteFrom one point of view, the Bayesian program is a computation level model (like the Aspects one) and so the issue of multiple choices and the size of the hypothesis space is a question of convergence -- i.e. will this function converge to the right grammar? -- rather than a computational complexity issue -- how many alternatives can the algorithm consider given the computational constraints that it operates under?
But of course it may be the case that there just are no algorithms that can efficiently compute or approximate the function defined by the computational level theory.
And further we know that this is in fact the case because of the complexity results (Abe &Warmuth, Gold 78, .. etc etc all the way up to Cohen and Smith)
And that seems to me to be a very good argument that the approach needs to be constrained in some further way. But those constraints don't, in the Bayesian approaches, need to be baked into the prior in the way that they do in the Aspects model.
Looks like I'm a little late to the party but I'll have a go. I'm coming from the perspective of someone on the "other side" though I also don't self-identify as a Bayesian. So really I have no foot in either camp. I learned (Chomskyan) syntax during the GB days so it's nice to see this historical perspective!
ReplyDeleteFirst, the quote from OJB is just false. Mark already pointed out one reason why. OJB appears to be suggesting something like curse of dimensionality, and, yes, it's true that sometimes (maybe even often) going to a higher dimensional space increases your computation exponentially, but frequently this is not the case, particularly if you are willing to make independence assumptions. There are also well known (introductory) results from computational learning theory that show that learning becomes easier (i.e., drops from exponential time/sample complexity to polynomial time/sample complexity) when you increase the size of your hypothesis space.
Okay, now I may be misreading, but if I cast all of this analysis in a learning theory framework (forgetting Bayes for a moment -- he is pretty incidental to the discussion here), the basic change from Aspects to P&P was that the hypothesis space got a lot smaller. As a result, we quite naturally do not require an ordering on the set of hypotheses because the data will be consistent with far fewer of them. Now, whether this makes learning easier or harder is a totally open question. As I said before, smaller is not always easier to learn!
But I feel like this isn't saying anything useful. If you like P&P, then take all grammars that are consistent with P&P and give them 5 points, and all that are not consistent with P&P and give them 1 point. There's your "prior" and now you can run the aspects computation and nothing changes. (This argument acutely hinges on the logical/probabilistic issue that Alex and Mark raised, which I think is actually a huge discussion missing from the discussion.)
And I guess while I'm here, I feel like I need to push back so Norbert can put me in my place :). The whole notion of "identify the right grammar from a set of possible grammars" just feels like a horribly broken question to me. Would you really argue that Hal, Norbert, Alex and Mark all have the same grammar in our heads? That seems like quite a stretch. But once you admit that we might have different grammars, then the whole P&P business makes much less sense: there just aren't enough grammars to go around.
I think this last point is actually one of the reasons why pragmatic Bayesians might lean closer to an Aspects interpretation than a P&P one. At least speaking for myself, I find it hard to believe that all English speakers acquire the same G, and I would go further to say that no two English speakers acquire the same G. Once you take this position, P&P just cannot work, except in an uninteresting way, that turns it back into Aspects.
I think the reason we want to say that people get (roughly) the same grammar in their heads from (roughly) the same data is that people generalize in (roughly) the same way, and assigning a grammar to bodies of data as proposed in Aspects seems to be a reasonable way to produce a theory of generalization that says what they will do from given data (the only way that appears to have any traction in grammar learning afaik).
ReplyDeleteForex, in all of the 81 NPs in Child Directed Speech in the CHILDES English database (NA & UK, about 14.4 million words if my counting script is not horribly wrong) that I've found that have a possessor which is itself possessed by a full NP ('Kalie's cow's name', etc.), the innermost possessor is always a proper name or kinship term' name ('Daddy', 'Auntie Marian'), and never, for example, a pronominally possessed common noun ('your friend's dog's name'), but this is not a generalization that people learn. A theory that causes people to acquire a rule roughly equivalent to NP -> (NP POS) ... N .. rather than something without recursion in the possessor position can explain this, a data-hugging exemplar-based learner might miss it. Such a learner would also be in peril of learning that the top possessor must be 'name' (most of them are), and that the two possessums can't have compositional semantics (there are no clear examples of this, the closest one is 'Karen's little boy's ..' but it's unclear what this really is and 'little boy' is presumably common enough to be noncompositional. The other 4 are like 'Marcella's mother's dressing table' with a compound as a possessum (this is all adult/sibling speech, not CHI). This is actually a bit mysterious, because there is also clearly a certain amount of exemplar-based-looking stuff going on, as described in the construction grammar literature.
So there's a fair amount of pretty basic stuff to pick up, without a well developed doctrine of how it happens, or, even, as far as I can see, a well-developed story about what needs to be picked up on the basis of what, let alone how.
Chomsky's 'acquire roughly the same grammar from roughly the same data' is at least a way to think about it; an alternative would be interesting if somebody produced such a thing, but has anybody?
Nearly all of machine learning focuses just on learning a hypothesis that is extensionally equivalent to the target concept being learned.
DeleteIt's complicated in the case of language acquisition by the question of whether you consider the meanings to be observed or not.
If you think the meanings are observed then there is, AFAIK, no argument in favour of requiring the learners to have the same (or structurally equivalent) grammars.
If you think meanings are not observed, then there is an argument that the grammars learned must be structurally equivalent in some way to the target grammar(s). But structural equivalence does not in general imply identity or isomorphism.
I don't think you can do an inference to the best explanation from the Aspects model, since the Aspects model doesn't work and thus fails to explain anything.
@Hal: Do we all have the same G? Probably not. Indeed, I doubt that each of us have only one G. Recall that the project has been to find that G that an *ideal speaker hearer* would acquire, not what an actual speaker does. Why? Well, it's an idealization to simplify the problem. Is it a worthwhile one? It is if you believe that the mechanisms required to acquire a G in the idealized case is similar enough to what is required in the less idealized ones. Now, from what I can gather (but I am willing to be corrected) having a homogenous PLD without errors (slips), incomplete sentences, clear articulations, etc. as opposed to what kids are really exposed to, should simplify the acquisition problem. So, can we even explain what goes on in this case? And, if so, is it reasonable to think that the mechanisms we identify implicated also in the *real* case (up to statistical massaging). If it is, then the fact that we don't all have the same Gs is not a problem. We know this. The real question is whether we all acquire the same TYPES of Gs as in the ideal case. So far we have little reason to think that we don't (see Avery's comment)and we have a hard enough time handling this case.
DeleteIn Apsects Chomskey makes it clear that he is discussing a very ideal case. For example, there is no incremental learning, as the evidence is assumed to be provided all at once. So the Aspects model is taken to be a necessary, not sufficient, step in developing an acquisition model. The question is whether this idealization seems right. Why do you think it isn't? Do you think that the complex case results in a problem that is different in kind from the ideal one? It might, so this is a real question. I just don't know.
@Alex: Is meaning observed or not? Well, some is and some isn't. Do we observe quantifier scope interactions in the complex cases? I personally doubt it. Do we observe theta roles? I suspect we do, at least in part. Do we observe principle C effects? Well not in the general case and certainly the exceptions are remarkably rare. Do we observe WCO effects? I doubt it.
DeleteAs you know, these are the sorts of questions that deploy the full syntactitians armementarium. Here we postulate structure and conditions on it and infer stuff about UG. So, the question you raise: how much meaning is visible is the right question. And like many other such questions, PoS considerations suggest that we need a lot of UG.
Suppose we have a idealised uniform community of language users, who agree exactly on the set of legitimate sound/meaning pairings in the language. By definition the grammars (I-languages if you must( that they all have are extensionally equivalent in the sense that they all generate exactly the same set of sound/meaning pairings.
ReplyDeleteWhat are the arguments that the grammars must be equivalent in some stronger sense? i.e. that
A) the grammars generate the same set of structural descriptions
or
B) the grammars are isomorphic or intensionally equivalent
It's not an issue that has been discussed much (other than the Quine/Lewis/Stich papers from way back, which aren't very helpful). I don't find any of the arguments very good.
Linguists assume similar structures across grammars and speakers for a variety of reasons.
DeleteThe weakest one is a methodological one: it is the easiest route to take, and it doesn't run into major problems most of the time. If you assume that all grammars work more or less the same, you can bootstrap results from one language and apply them to similar looking problems in other languages. And that works surprisingly well most of the time, so there's little to be gained from dropping that assumption when it comes to empirical coverage.
The mroe profound argument is that speakers of the same community generalize in similar ways (wrt acquisition, meaning of nonce words, acceptability of novel constructions, etc), and that doesn't follow if you have extensionally equivalent sound-meaning mappings. Just because two functions agree on the values for every a in A doesn't imply that they agree on A |_| {b}. So if you reject that the grammars are very similar and thus allow for the same kind of structural generalizations, you have to assume that either 1) the differences are so minimal and systematic that they don't matter for generalization, which is just a less idealized version of what linguists do, or 2) the generalization procedures also differ between speakers, in exactly such a way that speakers make the same generalizations nonetheless.
If the latter turns out to be true, fair enough, then linguists are working with some kind of normal form that has no analogue on the invidual level. How would you prove it to be true, though?
I'm also not sure what would be gained on the learnability level. Let's take your recent paper on inferring trees from strings. Rather than inferring one grammar for a given string set via some function h, we have a relation that associates the string set to any grammar that generates it (or a subclass for which that problem is decidable in polynomial time if you care about computability). How is that an important change?
@Alex:
DeleteThomas took the word sour of my mouth (actually, gave a more coherent version of what I would have liked to say). The issue is ow and always has been the projection problem: why/how do we generalize as we do. The structures are there to license/favor some generalizations and forbid/restrict others.This is something that intensional specification of functions does well and that's why we postulate them. Do people generalize in the same way. For some things, the things that PoS types like me focus on, the answer is yes. For other things, not so much. It would be nice to have an explanation for the former cases and this is where we postulate lots of grammatical structure.
For sophistications on this point, see Thomas above.
'Sour'? Wth! It's "out of my"! I hate auto-correct!
DeleteI am assuming that the language community consists of people with identical LADs exposed to slightly different samples of the PLD. They all converge to extensionally equivalent grammars that are slightly different -- maybe one memorizes larger chunks of frequently occurring phrases, or has a slightly more fine-grained set of categories (unnecessarily fine-grained). In this circumstance I would expect them to generalize in the same way to nonce words, since the generalization device is identical.
DeleteIt may not matter in learnability terms -- but it matters a bit with CFGs, and my hunch is it will matter a lot more when you move to feature based grammars like MGs, because there are lots of equivalent feature systems that will generate the same (or equivalent) derivation trees. I don't want to try to prove something unnecessarily hard.
Do you think these are things that we could test experimentally (or what I really mean is, getting some data other than simple judgments) - like do you see a way of operationalizing any of the fine inter-speaker differences that might need to be tolerated for various learnability results to hold? It might be good to have a list of them, a clear statement of what the learnability difference means practically, and then see how many of them could be sussed out empirically.
DeleteI think it would be pretty difficult to identify any real correlates of such subtle differences. There might be structurally priming effects, but the current methods aren't delicate enough to detect inter-personal differences, AFAIK.
ReplyDeleteIndeed they may not be real differences in the sense that while it might be meaningful to talk of two distinct grammars that differ in some respect, it wouldn't be meaningful to talk of them as describing two distinct states of the brain.
E.g. say you have a CFG which has some rules with right hand sides longer than 2 and a binarised version of that grammar. Is there going to be a meaningful difference between the cognitive states described by these two grammars?
"Is there going to be a meaningful difference between the cognitive states described by these two grammars?"
DeleteI fear as long as Norbert [or anyone defending the same view] refuses to commit himself to a proposal that could be implemented in a human brain this entire debate remains academic. Either they are serious about BIO-linguistics or they are not. There are many linguists who can do at least as well [and a lot better] as [than] Norbert & co on the syntax/linguistics proper front. So unless he has some genuine insights into the BIO of LF why would it even matter what he answers?
@Alex: Your example is interesting, although maybe not in the way you intended.
DeleteA binarized CFG defines different constituents than a non-binarized one. So if the two speakers nonetheless agree on all sentences, whatever work the notion of constituency is responsible for in the binarized grammar will have to be handled by some other concept in the non-binarized CFG. Without restrictions on what is a valid generalization mechanism, we can match up each grammar with distinct mechanisms so that both speakers generalize exactly the same way irrespective of constituency --- differences in the grammars are then indeed untestable.
But there are clear restrictions on how people generalize (which you can see in artificial language learning experiments; irrespective of whether they tell us anything about UG, they do show that some methods of generalization are never entertained).
So we can actually turn this into a practical exercise: let's compare two grammars, say, one where PPs are not binarized, and one where they have the form PP --> P NP. How complex a generalization mechanism do we need so that the former grammar can build the basis for the cluster of inferences that's already inherent in the PP rule of the second grammar: that a preposition can be followed by anything that is a well-formed NP, that it can be followed by multiple coordinated NPs, that the preposition can be stranded, and so on.
Or the other way round: how much can grammars G and G' differ such that we can find two psychologically plausible generalization mechanisms M and M' with M(G) = M'(G')? That's something that can be both tested (psychological plausiblity in its weakest form means avoiding structural generalizations that humans don't make) and studied formally.
My own hunch is that if we go even further and start optimizing the complexity of the composite bundle of grammar and generalization procedure, we'll end up wih something that looks awfully close to what linguists are already doing.
There's a lot packed in here, it sounds like you're conceding the point that - even in a cleanly controlled experimental setting where you could control exactly what the learner's generalized FROM, and therefore characterize exactly how they've extended their input - that you're saying, even in that setting, you pretty much wouldn't be able to tell which of the two grammars the subjects were using
Delete... EXCEPT if you start to put strong (complexity) constraints on the inference (generalization) mechanism. Have I got that right?
If so what exactly is it that makes you think the generalization mechanism is less complex that would give you the binary solution (it sounds like that's what you're guessing). Maybe it's very simple but just to spell it out, and to see if I've got this right.
Reverting to Norbert’s comments at the beginning, the Aspects chapter may be a brilliant piece of Chomskyan thinking but IMHO it is a prime example of where this thinking goes wrong. I refer in particular to S 6 that Norbert quotes from and the ‘acquisition model” of schedules (12) – (14). The chief flaw is the failure to recognise the prior development of cognitive, conceptual and LOT capabilities. These build a large part of the space in which acquisition occurs. The process he describes never recovers from this failure that is a consequence of ignoring the richness of semantic data in PLD.
ReplyDeleteA more appropriate sequence, I propose, is along the following lines that are entirely dependent on the small set of innate biological/ cognitive/ conceptual/semantic factors that I have sketched in earlier posts:
Conditions for the emergence of cognitive, conceptual and linguistic competence.
(i) Possession of the means of interpreting patterns of sensory input signals as recurrent experiential gestalts (categorisation);
(ii) the means of interpreting such gestalts as significant in terms of their effects on the experiencer (cognition);
(iii) the means of representing such gestalts independent from their direct stimuli (conceptualisation);
(iv) the means of freely combining these gestalts in macro-gestaltic chunks (LOT);
(v) the accumulation in memory of discrete clusters of vocal sounds (word sounds);
(vi) the combination of significant experiential gestalts with learned sequences of articulated sounds (word formation);
(vii) the accumulation in memory of structural patterns of word combinations (phrases and short sentences);
(viii) the articulation of LOT macro-gestalts into linear verbal and sign language of growing complexity using learned constructions and following pragmatic principles (spoken and signed phrases, clauses and sentences)
The semantic factors have a critical function as parameters of the dimensional space of all of these operations. They form the US, Universal Semantics, that operates alongside grammars. The semantic factors have an extraordinarily wide range of functions across these fields and crucially in human behaviour. The power of the factors stems from their meshing physical dimensions and affective/evaluative dimensions. This is the real biolinguistics.
Chomsky showed substantial interest in and insight into semantic features but he has always stressed their inscrutability. This was particularly clear in ‘Aspects’ and the ‘Science of Language’ interviews. But a canonical type, the semantic factors, has always been under our noses. See my paper ‘How to Spell the Meanings of Words (currently subject to revision) at trevorlloyd.ac.nz