Comments on Faculty of Language: Cedric Boeckx replies to some remarks by Hornstein on Berwick and Chomsky's "Why Only Us"

@Willem. That is interesting. I was taking the vie...

2017-05-11T16:17:21.370-07:00

@Willem. That is interesting. I was taking the view to be one that didn't really endorse a comp/per distinction. How is understanding depicted when it separates from performance?

Just on 'empiricism': I had in mind more the traditional picture of abstraction as a means of arriving at the constituents of cognition a la Locke et al., rather than a direct opposition to nativism, although the two strands obviously coalesce.

I agree that reality is messy. I suppose the issue is how far explanation will take us into the mess or force us to always remain at a level of idealization and treat the mess as noise. Please to hear the PDP-ers are aghast:)

What makes this new generation of networks so inte...

2017-05-11T14:32:51.859-07:00

What makes this new generation of networks so interesting is that they appear to pass an equivalent kind of test for 'recursiveness' that you now formulate for humans. I.e., if the network we describe understands (2+3)-7, then it also understands (7-3)-(2+(3+2)). There is no finite bound on this understanding, although there is on performance: for longer expressions the expected error becomes larger.

The models remain simplistic, and it is easy to find unrealistic properties. But we should use these models not as efforts to *imitate* human language processing, but as conceptual tools: as a kind of thought experiment that helps us critically assess whether the reasoning steps we take as linguists -- the ways we go from empirical observations to theoretical constructs -- are valid.

I don't think 'empiricism' or 'nativism' are very useful labels anymore for this kind of discussion. When I talk to people in old school connectionism, they think I am too obsessed with hierarchical structure, and too open for language-specific biological specialisations. But I think both camps are missing something important: there is non-trivial, hierarchical structure in natural language, and there are now neural models that can really do justice to it, in the sense that they support the kind of generalizations that you discuss. Symbolic grammars remain a useful level of description, as they avoid the unwieldy mess of matrices, vectors, and nonlinearities in neural networks. But they are best thought of as an approximation/idealisation of the messy neural reality. For many linguistic questions working with that approximation is fine, in the same way that Newtonian physics is perfectly fine for physic at a day-to-day scale. But for some questions, including those about the neural basis and evolution of language, it is not.

Their understanding of any given instance of a rel...

2017-05-11T13:51:03.924-07:00

Their understanding of any given instance of a relevant construction provides the evidence. If I understand 'The boy behind the girl', then I understand 'The girl behind the boy behind the girl'. That is something like a logical truth. There is no finite bound on this understanding, although there is on performance; any such bound would be arbitrary, and get the counterfactuals wrong (if I had enough time,...). It is a bit like asking how do you know there is no greatest natural number, maybe we only ever approximate the unbounded. I think empiricism is conceptually false for these kinds of reasons. It trades in the clear and distinct, as Descartes would have it, for a more or less approximation without explaining what is approximated, for what is approximated is presupposed in treating the approximation as an approximation to it.

What is the evidence that humans arrive at it, r...

2017-05-11T05:39:20.132-07:00

What is the evidence that *humans* arrive at it, rather than approximate it?

@Willem. Thanks for the clarification and the furt...

2017-05-10T12:26:25.310-07:00

@Willem. Thanks for the clarification and the further interesting thoughts. Just a little point, if I may. I agree, of course, that any specific level of embedding would be aribtrary as a criterion for 'recursion' (here just letting recursion be CF-ish). I take the Chomsky-Norbert line hereabouts precisely to be that you only have recursion if there is no finite level that would suffice for recursion; hence, there is no finite criterion. That is, the relevant rule/principle/function delivers an infinite output (any restriction would presuppose the unrestricted case). Thus, no mere approximation to the ideal would count, just as, as discussed above, no irregular shape counts as a circle. What is real here is nothing platonic, an infinite object, but just the principles that apply unrestrictedly. Something like this holds for counting. One can count only once one has a conception of the infinite, albeit implicitly. FWIW, I tend to steer clear of evo issues, but recursion in the relevant sense is an all or nothing property, so I still wonder how one can arrive at it gradually. If I read you aright, you claim we never arrive at it, but only approximate it, which seems conceptually awry, although I appreciate how it undermines the argument at hand.

@Alex C. I'm not sure if I really understand t...

2017-05-10T09:42:31.150-07:00

@Alex C. I'm not sure if I really understand the argument either, and I don't have any particular sense of recursion in mind, for whatever that's worth.

I feel the RNN example is a little more convincing...

2017-05-10T09:21:10.108-07:00

I feel the RNN example is a little more convincing than eg a PDA with a finite bound on the stack, for a couple of different reasons but that depends on some of the details of the argument that it is meant to be a counterexample to, and those details are a little obscure. (for a start, what sense of "recursion" we are meant to be using; Willem and Alex D seem to be converging on a sense that has something to with the boundary between regular and context-free formalisms, and I am not sure that is the relevant sense.)

2017-05-10T09:18:39.736-07:00

This comment has been removed by the author.

Agreed. But somehow people keep making the same ar...

2017-05-10T09:04:33.858-07:00

Agreed. But somehow people keep making the same argument about 'no gradual route to recursion', so I have tried to illustrate the counterargument in various ways. Plus: we're interested in RNN's expressivity for their own sake, and in ways of learning hierarchical structure from examples (here through backprop).

@Willem. I originally said "finite state tran...

2017-05-10T08:38:24.959-07:00

@Willem. I originally said "finite state transducer". The device I described in my previous comment is a transducer with finite memory. Depending on exactly how the relevant terms are defined, it either is a finite state transducer, or is trivially equivalent to one.

You could indeed use such a device to make the same point as you were making with RNNs. That is why I questioned whether the RNN simulations are really all that relevant to the point that you, Alex C and others are making about recursion. It seems to me that this is a very straightforward point that can be made in a few sentences on the basis of a bit of armchair reflection.

@AlexD With 'finite state approximation' I...

2017-05-10T08:21:52.530-07:00

@AlexD With 'finite state approximation' I meant a FS automaton that needs a separate subgraph for each length of string. You're talking about an automaton/transducer with a memory, very similar to a PDA (but with a bounded stack). And yes, such a model can both model the recursion and show graceful degradation. In fact, you could use it to make the exact same point I was making with RNN's: by gradually varying the memory capacity from 0 to infinity you show a gradual path to recursion.

@Willem. It’s actually very easy to get graceful d...

2017-05-10T03:57:54.852-07:00

@Willem. It’s actually very easy to get graceful degradation with length using a finite state transducer. Say that the FST has n bits available to store context, and think of the context as a stack of numbers. If the transducer has to store m numbers, then it can use n/m bits to store the value of each number. When an additional number is added to the context, the number of bits available to represent the existing numbers decreases. So the more numbers the transducer has to ‘remember’, the less accurately the value of each number is represented. This is of course exactly what we see with floating point arithmetic in computers. You can convert an array of 64-bit floats to an array of 32-bit floats and halve the storage requirements — but at the cost of precision.

There is a nice quote from this paper in Nature wh...

2017-05-10T02:26:45.208-07:00

There is a nice quote from this paper in Nature which has Partha Niyogi as a coauthor, which puts this very well:

"Thus, for example, it may be possible that the
language learning algorithm may be easy to describe mathematically
while the class of possible natural language grammars may be
difficult to describe."

It has some interesting discussion on this topic, though it is quite technical in parts.

@Charles - you're absolutely right that most f...

2017-05-09T14:48:34.403-07:00

@Charles - you're absolutely right that most formal learning work starts with classes of languages, and tries to find learning algorithms (or guarantees for learning algorithms) for that class, and that in this work (as in most current machine learning, I would say) the learning algorithms come first and questions about what class of languages they can learn are largely unanswered. But I'd argue that formal learning theory has it the wrong way around. The primate brain was there first. Language, when it emerged and started to be transmitted from generation to generation, had to pass through the bottleneck of being learnable by that primate brain, plus or minus a few human-specific and perhaps language-specific innovations. The class of learnable languages ("constraints on variation") is thus in many ways a derivative of the algorithm.

@Alex - A finite-state approximation of a recursiv...

2017-05-09T14:48:13.403-07:00

@Alex - A finite-state approximation of a recursive system is much less interesting, and wouldn't exhibit the graceful degradation with increasing length that we observe in RNN's. You want a model that captures the fundamental relation that exists between strings of different lengths, and that can exploit these relations when, for instance, the language is extended. E.g., imagine we build model of the contextfree language A^nB^n over classes A and B. If we add another member to class A, we want our model to easily accommodate such a change. A PDA does, and so does an RNN with the right weights -- even if its state space is continuous -- but discrete finite-state automata do not.

@Benjamin & John - You are absolutely right that the models I discussed (here and at Evolang'16) do not address the origins of hierarchical structure. I was only concerned with the question whether or not 'a gradual route to recursion' is a logical possibility (but check out my Evolang'08 abstract :)). I'm the first to admit there are many fundamental questions to be asked about how neural machinery may deal with linguistic structure, and about why natural languages have the complex structure that they do. The frame 'language is recursive, recursion is an all or nothing phenomenon, hence recursive language must have merged in a single sweep' is, however, unhelpful in answering those questions (just as unhelpful as 'language is context-sensitive, context sensitive languages are unlearnable, language is not learned' -- Evolang'02), because the inferences are incorrect.
The graceful degradation also answers John's other concern - that there would still be a 'leap' to recursion, even if the underlying processes are continuous. If we agree on a specific criterion for 'recursiveness' there is of course a particular point along a continuous trajectory where we might suddenly meet that criterion. But what would the criterion be? 5 levels of embedding? 4? 3? Any criterion is arbitrary. A neural system can approximate the Platonic ideal of a symbolic recursive system to an arbitrary degree. The Platonic ideal (the competence grammar, if you like) is useful, but not because it is the true underlying system corrupted by performance noise, but because it is the attractor that the continuous system is approaching. Knowing what the attractor is, and *why* it is the attractor, is immensely useful. In the same way, good symbolic grammars that capture important generalizations about language might be immensely useful. But they shouldn't be confused with the cognitive reality.

Let's put aside the evolution business for the...

2017-05-05T14:16:47.066-07:00

Let's put aside the evolution business for the moment. The RNN results are interesting but are difficult to interpret. For almost all we do in formal learning research, a class of languages is defined and then we study its learnability properties (e.g., FSA, CFG, MG, etc.) The RNN stuff is different. It is an algorithm (A), and presumably there is a class of languages, L(A), that is learnable by A. So far as I know, no one in machine learning works in this algorithm-first-language-later fashion.

The results, if sound, show that L(RNN) contains some elements similar to human languages. I'm afraid that unless we understand formally what an RNN actually does, we are not in the position to say much about anything.

I think the transition from a nonrecursive to a re...

2017-05-05T13:21:15.766-07:00

I think the transition from a nonrecursive to a recursive system could have occurred prior to some more explicitly recursive devices evolving, so I think the validity of the argument that Merge must have evolved instantaneously is independent of one's position on the neural reality of recursive grammars.

Thanks Alex - very helpful. So, is your position -...

2017-05-05T11:05:12.103-07:00

Thanks Alex - very helpful. So, is your position - or at least the position you are entertaining - something as follows. There are formally specifiable devices (the kind of devices one finds on the Chomsky hierarchy), but it is potentially a mistake to think that such devices are real beyond systems approximating the idealised behaviour of such devices. In particular, the sharpness of the distinction between one device and another (going from finite embeding, as might be, to unbounded) need not be mirrored in an underlying system that simply approximates one device or another without respecting the sharp cut offs formally specifiable at the elvel of the devices.

If that is the position, then a load of questions arise, but one radical thought hereabouts is that we simply are not recursive devices, but only approximate them. That sounds unduly performative, though. Think about the circle case. There are no circles, in one sense, but we see and reason over circles, not irregular arnitrary shapes. Likewise, our understanding of language appears to be sharply unbounded, even though our performance will only ever approximate unboundedness. So, the metaphysical question (well, it might just be the evlutionary one) is why we evolved a system that approximates an ideal device such that the device, rather than the approximation, reflects our understanding, just like we see and understand sharp Euclidean figures rather than variable irregular shapes.

I don't think a formal property need correspond to some 'real' property, but insofar as the the formal property is not an artefact of the notation in some sense, it must have empirical content, and so there must be some property, potentially a quite complex one, which the formal property is tracking.

I don't really know what recursive means here ...

2017-05-05T09:53:37.324-07:00

I don't really know what recursive means here but assuming that a push-down automaton counts as recursive in the relevant sense, then one can view these recurrent NNs as being an approximation of a deterministic push down automaton where the stack is encoded is a vector (of weights of hidden units).
(This is an oversimplification in several different respects).

A function that calls itself would then correspond under some circumstances to a geometrical property of certain spaces which could be approximated gradually in various ways. So sure a circle is an all or nothing property but things can be round to greater or lesser extents. There need be no great leap.

But more generally why should a formal property of a theory of X correspond directly to some property of X?
Even if the theory is right. What's the standard philosophical example here? Maybe centers of gravity.

Thanks Alex. Yes, I see the idea, but I'm stil...

2017-05-05T04:49:44.005-07:00

Thanks Alex. Yes, I see the idea, but I'm still not sure of the significance. I take Norbert's point, or at least the point Chomsky and Dawkins make, to be a formal one: you either have a function that can call itself, or you don't, just as any function that enumerates a finite set greater than some other finite set is equally distant from enumerating an infinite set as the function that enumerates the lesser set. That looks like some kind of conceptual truth, but it doesn't, all by itself, tell us how any system can or does recurse, as it were, in a way that takes us beyond the specification of the relevant function. Still, the formal truth imposes a constraint any other story we want to tell, viz., the system must respect some formal condition, such as, at some point, employing a variable defined over strings or being able to loop (something like that). If nothing like that is provided, one is left scratching one's head. Put naively, the point is, given the formal truth of the great leap, at what point between 0 and 1 corresponds to the leap? What happened? If there is no answer, then the behaviour of the system at 1 might not be recursive after all, it might just be very robust over length, say. I hope that makes sense.

@Alex, sorry, posting without reading carefully. ...

2017-05-05T03:50:26.419-07:00

@Alex, sorry, posting without reading carefully.

From my point of view though, a finite state transducer or indeed a "humongous look up table" would not serve as an appropriate counterexample.

To amplify, if we take the starting model (recurre...

2017-05-05T03:38:32.515-07:00

To amplify, if we take the starting model (recurrent neural network before training) to be at time t = 0, and the final model (after training), to be at time t =1, the training sequence (suitably interpolated) will give a continuous varying set of models between t= 0 and t = 1, where there is no discrete change and yet at t= 0 the system does not exhibit recursive behaviour, and yet at t = 1 the system does. And I think this potentially constitutes a counterexample to Norbert's claim that recursion is an all or nothing affair.

I think the recurrent neural networks illustrate t...

2017-05-05T03:34:39.421-07:00

I think the recurrent neural networks illustrate the gradual emergence of a system that exhibits recursive behaviour. I agree that if the input is trees as in the recursive NN models, then it doesn't really bear on the point at issue.

Right, but we are talking about the recursive ones...

2017-05-05T03:21:42.264-07:00

Right, but we are talking about the recursive ones, which Willem above explained as all involving supervision.

@Alex. Yes I know. That is why I mentioned perform...

2017-05-05T03:20:13.322-07:00

@Alex. Yes I know. That is why I mentioned performance decreasing with the length of the input.