Faculty of Language: A modest proposal

Wednesday, April 4, 2018

A modest proposal

Bill Idsardi and Eric Raimy

[Note: the following owes tremendous debts to Jeff Heinz and Paul Pietroski, but that does NOT imply their endorsement. But they can provide endorsements or disavowals in the comments if they want to. They also have the right to remain silent, because ‘Murica.]

Taking phonology to be the mind/brain model for speech, it needs to have interfaces to at least three other systems: the articulatory motor control system (action), the auditory system (perception) and the long term memory system (memory), forming a Memory-Action-Perception (MAP) loop (Poeppel & Idsardi 2012). Likewise, signed languages will have to have interfaces to action (the motor system for the hands, arms, etc.), perception (the visual system) and to memory. As discussed in the comments last week, we think therefore that sign language phonology probably also includes spatial primitives which spoken language phonology lacks.

So the data structures inside phonology proper must be able to effectively receive and send information across those interfaces. This condition is a basic tenet of the minimalist program. Pietroski (2003: 198 Chomsky and his critics) is on-point and unmistakable :

"Indeed, a sentence is said to be a pair of instructions -- called "PF" and LF" -- for the A/P and C/I systems. If these instructions are to be usable by the extralinguistic systems, PFs and LFs may have to respect constraints that would be arbitrary from a purely linguistic perspective."

(And see Chomsky's replies in the same book for further remarks on the importance of the interface conditions, e.g. p 275.) That is, the data structures inside phonology must be sufficiently similar to the data structures in the directly connecting modules to allow for the relevant information to be transferred. Such interface conditions are also definitely part of Marr's program, though people seem to miss this point, perhaps because it's buried in his detailed account of the visual system, e.g. pp 317 ff on transforms between coordinate systems.

Reiss and Hale often refer to the interfaces as “transducers”, but the whole of phonology is a transducer between LTM and motor control, and between perception and LTM (“From memory to speech and back”, Halle 2003). (There also seem to be some connections between action and perception that do not include a way station at memory. We ignore those things here.) Moreover, at times the SFP proposals seem to imply transducers which would have very impressive computational abilities, and which look implausible to us as direct interfaces which can be instantiated in human neural hardware (which in perceptual systems seem largely limited to affine transformations and certain changes in topology and discretization such as are accomplished in the vision system, Palmer 1999, though see Koch 1999 for an idea of how much computation a single neuron might be able to do -- a lot).

So, first, we will adopt the proposals of Jakobson, Fant and Halle 1952 (minus the acoustic definitions, which are nevertheless very relevant for engineering applications), and Halle 1983 regarding distinctive features and their neural instantiation, but drawing the feature set from Avery & Idsardi 1999. Here is a relevant diagram from Halle 1983:

And here is A&I’s wildly speculative proposal:

We believe that SFP (or at least the Reiss & Hale contingent) is sort of ok with this, although they remain more open on the feature set (which for them continues to include things like [±voiced] -- by the way, also in the Halle 1983 figure above -- despite Halle & Stevens 1971, Iverson & Salmons 1995 et seq).

We understand the perception side to involve neural assemblies including spectro-temporal receptive fields (STRFs, Mesgarani, David, Fritz & Shamma 2008, Mesgarani, Cheung, Johnson & Chang 2014) and a coupled dual-time-window temporal analysis (Poeppel 2003, Giraud & Poeppel 2012). On the action side, we’ll go with Bouchard, Mesgarani, Johnson & Chang 2013 and another off-the-shelf component, Guenther 2014. Much less is known about the relevant memory systems, but we’ll take stuff like Hasselmo 2012 and Murray, Wise & Graham 2017 as some starting points.

Most importantly in our opinion (we've been telegraphing this point in previous posts) we need a reasonable understanding of what “precedes” is, and we think precedes needs to be front and center in the theory. We feel that it is very unfortunate that phonological practice has often favored implicit depictions of “precedes” (as horizontal position in a diagram) rather than being explicit about it (but we’ve rehearsed these arguments before, to not much effect). We take “precedes” to be a temporal relation at the action and perception interfaces, providing the basis in the phonology for notions such as “before” and “after” and “at the same time, more or less”. (And for speech “more of less” appears to be about 50ms, Saberi & Perrott 1999, again see Poeppel & Giraud & Ghitza &co. It's not impossible for the auditory system to detect changes that are faster than this, but such rapid transitions will be encoded in phonology as features rather than as separate events.) We feel that it bears repeating here that the precedes relation in phonology is interfacing to the data structures and relations for time that are available in motor control and auditory perception and memory, which themselves are not going to capture perfectly the physical nature of time (whatever that is). That is, the acuity of precedes will be limited by things such as the fact that there is an auditory threshold for the detection of order between two physical events. (A point that Charles brought up in the comments last week.)

There are a number of ways to construct pertinent data structures and relations, but we will choose to do this in terms of events, abstract points in time (compare Carson-Berndsen’s 1998 Time-Map Phonology in which events cover spans or intervals of linear time). This will probably seem weird at first, but we believe that it leads to a better overall model. NB: This is absolutely NOT the ONLY way to go about formalizing phonology. SFP (Bale & Reiss 2018) take quite a different approach based on set theory. We will discuss the differences in a later post or two.

Within the phonology that means that we have at least:

events/elements/entities, which are points in abstract time. We will use lower case letters (e, f, g, ...) to indicate these.
features, which we construct as properties of events. When we don’t care what their content is we will use upper case letters to indicate them (F, G, H, …). So Fe means that event e has feature F. We will enclose specific features in brackets, following common usage in phonology, e.g. [spread]e. Notationally, [F, G]e will mean Fe AND Ge. We will drop the event variable when it’s clear in context (think Haskell point-free notation).
precedes, a 2-place relation of order over events, notated e^f (e precedes f). The exact “meaning” of this relation is a little tricky given that we are not going to put many restrictions on it. For example, following Raimy 2000, we will allow “loops in time”. We're not sure that model internal relations really have any "meaning" apart from how they function inside the system and across the interfaces, but if it helps, e^f is something like "after e you can send f next" at the motor interface and "perceived e and then perceived f next" at the perceptual interface.

(We’ll do things in this way partly because of Bromberger 1988, though I [wji] still don’t think I fully understand Sylvain’s point. Also, having (1-3) allows us to steal some of Paul Pietroski’s ideas.)

So far (1-3) give us a directed multigraph (it allows self-edges and multiple edges between nodes); and with it comes with no guarantees of connectedness yet. (And Jon Rawski would like us to point out that there's a fore-shadowing of a model-theoretic approach here. Jon, please say more in the comments if you'd like.) We suppose this thing needs a name, so let’s call it Event-Feature-Precedence (EFP) Theory (would PFE be better? that could be pronounced [p͡fɛ], as in “[p͡fɛ], that’s not much of a theory”). With (1-3) we have a feature-based version of Raimy 2000 (as opposed to its original x-tier orientation), but since we will allow events to have multiple properties (features), we can recreate Raimy diagrams, such as this one for “kitty-kitty” where the symbols are the usual shorthands for combinations of features.

And now, for some random quotes about non-linear time (add more in the comments, please!):

“This time travel crap, just fries your brain like a egg.” Looper

“There is no time. Many become one.” Arrival

“Thirty-one years ago, Dick Feynman told me about his "sum over histories" version of quantum mechanics. "The electron does anything it likes," he said. "It just goes in any direction at any speed, forward or backward in time, however it likes, and then you add up the amplitudes and it gives you the wave-function." I said to him, "You're crazy." But he wasn't." Freeman Dyson
(Note: Freeman Dyson is a physicist, not a movie).

We also take properties (features) to be brain states, as in Halle 1983. Then [spread]e means that event e has the property spread glottis. (For us being brain states doesn't preclude the properties from being other things too. We mean (1-3) in a Marrian way across implementations, algorithms and problem specifications.) This discussion will sometimes be cast as if features are single neurons. This is certainly a vast over-simplification, but it will do for present purposes. For us this means that a feature (= neuron (group)) can be “activated”. (The word “feature” seems to induce a lot of confusion, so we might call these constructs fneurons, which we will insist should be pronounced [fnɚɑ̃n] without a prothetic [ɛ].) The innervation of [spread] fneuron in event e (in conjunction with the correct state of volitional control circuits) will cause a signal to be sent to the motor control system that will ultimately innervate the descending laryngeal nerve to innervate the posterior cricoarytenoid muscle (and reciprocally de-innervate the lateral cricoarytenoid muscle). On the perception side, we assume (facts not in evidence because we’re too lazy to look through all of the STRFs in Mesgarani et al 2008) that there are auditory neurons whose STRFs calculate the intensity difference between bark bands 1 and 2 as versus bark bands 3 and 4 (probably modulated by the overall spectral tilt in bark bands 5 to 10). The greater this intensity difference the more likely a [spread] fneuron is to be innervated (activated past threshold). We doubt that there’s much effect of volitional control on the perceptual side, as auditory MMNs can be observed in comatose patients, or at least in those that eventually recover, 30/33 patients in Fischer, Morlet & Giard 2000. To Charles's point last week about phonological delusions, it's well known that large positive values of voice onset time (VOT) can signal [spread] also, without the inclusion of voice quality differences in the first few pitch periods of the vowel. So the working hypothesis is that the neurons for [spread] are connected to auditory neurons with at least two kinds of STRFs, the bark-based one mentioned above, and a neuron yielding a double-on response, see various publications by Steinschneider.

As many people are aware, graphs (and multigraphs) are usually defined over sets (or bags or multisets) of vertices and edges. We haven’t included the set stuff here. Why not? The idea, for the moment at least, is that we’re calculating in a workspace, and there isn’t significant substructure in the workspace in terms of individuated phonological forms. So the workspace universe provides the set structures, such as they are, the events, the properties of the events and the relations between events. This could well be a big mistake, as it seems to preclude asking (or answering) questions like “do these two words rhyme?” as this would involve comparing sub-structures of two different phonological representations. That is, in order to evaluate rhyme (or alliteration, or …) we would have to evaluate/find a matching relation between two sub-graphs, so we would need to be able to represent two separate graphs in the workspace and know which one was which. To do this we could add labels to keep track of multiple representations, or add the extra set structure. There are (different) mathematical consequences for either move, and it isn’t at all clear which would be preferable. So we won’t do anything for now. That is, we’re wimping out on this question.

Finally, we will say once more that there are some strong similarities between this approach and Carson-Berndsen’s work. But in Time-Map Phonology time is represented using intervals on a continuous linear timeline whereas here we have discretized time instead and we allow precedence to be non-linear.

Next time: It’s Musky

15 comments:

Veno VolenecApril 5, 2018 at 7:09 AM
Let me start this longish post by saying that it is nice to see a phonological discussion develop on this wonderful blog. As a kind of a preamble, I am going to comment on some of the claims made in the previous post, and then I will focus on some issues raised in this post, but all in all my comments will revolve around the phonology-phonetics interface, or to be more specific, around the relationship between phonological features and phonetic substance.

What is substance free phonology about? Idsardi and Raimy (I&R) (sometimes “I” is used in these posts, sometimes “we”, so forgive me if I misattribute some claims) offer the following interpretation of the SFP enterprise:

“(…) how I understood the Hale & Reiss 2000 charge of “substance abuse”: there’s too much appeal to substance, and this should be reduced (take your medicine). As a methodological maxim "Reduce Substance!" then I'm all on board.”

I don’t think that SFP is to be interpreted as a phonological methodology. SFP is not a recommendation that can roughly be read as: ‘take it easy with the substance abuse, though a little abuse here and there is tolerable’. Rather, I take SFP to be a theoretically coherent (coherent in the broader context of generative linguistics and the rationalist approach to the study of language) and an empirically testable claim about the nature of phonological competence: phonological rules do not refer to articulatory, or acoustic, or perceptual information. In other words, phonology (as an aspect of the mind/brain) treats features (and other units of phonological representation) as arbitrary symbols. From the point of view of phonology, then, features are substance-free units. This, of course, does not mean that features are not related to phonetic substance, and such a conceptualization of features does not preclude the construction of a neurobiologically plausible interface theory (even spelled out in Marr’s terms). I will return to this point in a moment, but first let me just clarify what I mean by ‘substance’ in ‘substance free phonology’.

I&R write the following:

“let's not confuse ourselves into thinking that all reference to substance can be completely eliminated, for the theory has to be about something”

And before that:

“And a theory without any substance is not a theory of anything.”

I take ‘substance’ to mean ‘phonetic substance’, i.e., things like movements of the tongue, values of formants, loudness, duration expressed in milliseconds etc. I don’t think that ‘substance’ should be equated with ‘any kind of content’ or ‘aboutness’, which is what the above quotes seem to suggest (correct me if I am misinterpreting). In the final paragraph of the same post, I&R elaborate on their understanding of the term ‘substance’: “substantive = veridical and useful”. But in my opinion ‘veridical and useful’ is not at all what ‘substance’ means in ‘substance free phonology’. Can’t purely formal operations (e.g., merge) be veridical and useful? Features understood as substance-free units are veridical and useful in at least two senses: In conjunction with rules they allow for the expression of linguistically relevant generalizations, and they play a role in the phonetic interpretation of the surface representation.

[continued below]
ReplyDelete
Replies
Veno VolenecApril 7, 2018 at 3:56 AM
The conception of features as ‘instructions to the articulators’ (e.g. Kenstowicz & Kisseberth 1979: 239) raises more questions about their relation to substance than it answers. While I do think that features are lawfully related to articulatory movements, I think that that relation is indirect and complex. Take, for example, the feature +HIGH. In order for the sensorimotor (SM) system to execute this feature it needs to know for how long the +HIGH configuration must be maintained; importantly, this temporal information is irrelevant for phonological computation. If we care about the competence/performance distinction, we cannot ascribe temporal information expressed in milliseconds to features (or segments) because the duration of a speech sound depends on speech rate, and surely we agree that speech rate is not part of competence. This is why we drew a line between the phonological module of the grammar (competence) and cognitive phonetics (performance). Both are cognitive (and ultimately neurobiological) systems: the first one computes (i.e., preserves the representational format), the second one transduces (i.e., changes the representational format). This is also the reason why I think it is slightly misleading to define phonology as the mental model for speech -- speech contains so much information that phonological computation systematically ignores; and also why I don’t quite agree that phonology is a transducer between LTM and motor control (since phonological computation preserves the representational format, and therefore cannot be a transducer).

In my view, effective communication between adjacent modules does not entail overlap or identity in data structures characteristic for each module. (If there is identity in data structures between two modules, then why do we think that we’re dealing with two modules instead of a single module?) Rather, I take that the distinctness of data structures is surmounted by transduction (a conversion of one data structure into a different data structure), which can be found between many different systems within a single organism. For example, the process of hearing entails several transductions: air pressure differentials are transduced into biomechanical vibrations of the tympanic membrane and the ossicles of the middle ear, which are transduced via the oval window into fluidic movements within the cochlea, which are in turn transduced by the organ of Corti into electrical signals. In our paper, Charles and I proposed that transduction is also present at the phonology-phonetics interface: SRs (consisting of features) are transformed into what we call ‘true phonetic representations’ and it is this representational format that can be viewed as a “score” (what Paul mentioned) for articulation, not the format with which phonology operates. The ‘proximal substance’ is a bit suspect to me since it relies on the assumption about the overlap/identity in data structures between phonology and the SM system.
ReplyDelete
Replies
Bill IdsardiApril 7, 2018 at 8:56 AM
Thanks Veno, I think we're getting close to clarifying the differences here, which are not all that large, but I do consider important, and I think you do too. With the clarifications, I then think the differences turn into empirical issues which I think can then be investigated. A few things, I'm sure this discussion will continue across subsequent posts.

VV: "In order for the sensorimotor (SM) system to execute this feature it needs to know for how long the +HIGH configuration must be maintained; importantly, this temporal information is irrelevant for phonological computation."

I'd be a bit more cautious here. If you allow for autosegmental association of +HIGH, or for geminates, then there is some (perhaps small) use of timing information in the phonology. The mapping to motor commands isn't going to be isomorphic, but my feeling is that it will be quasi-homomorphic, preserving some aspects of the difference. (Put rather bluntly phonological geminates will not be systematically shorter in articulation or audition than singletons.) Also, the motor system *eventually* has to establish the temporal extents, it's not at all clear to me that the temporal extents are established by the *beginning* of the motor calculation (in fact it seems pretty clear to me that they aren't). I think the motor system only receives a rough idea from the phonology of how long the gestures should be. That part is in the interface, the rest of the calculation of the temporal extents is within the motor system.

VV: "we cannot ascribe temporal information expressed in milliseconds to features"

I'm sure that you can find quotes (even some from me) expressed in milliseconds. When you find some of mine, then I was being sloppy, offering some frame of reference for the reader or I would disavow the statements. The idea that we're pursuing here is that brain timing is done by endogenous oscillations (with theta and gamma bands being particularly important for speech). Even this is certainly too simple an answer, but we will try not to mix endogenous and exogenous descriptions of time. I was just on a NACS/Kinesiology dissertation defense on walking yesterday, and all of the discussion of time was done in terms of phase in the gait cycle. That's the sort of thing we would mean by time here, phase in delta-theta-gamma bands (which now has me worried about possible fraternity connotations).

ReplyDelete
Replies
RaimyApril 7, 2018 at 10:47 AM
Yes, Veno thanks for continuing the discussion. One thing that I want to add here is that I think there might be a latent disagreement or misunderstanding about how many modules we are talking about. Incorrectly or not, I get the impression that Veno (sorry to talk about you while you are right here) assumes that there is one big (or maybe small) phonology module that feeds directly into the 'cognitive phonetics' module which then goes to the MS. Along with this assumption is that a 'gestural score' continuous type representation is down in the MS somewhere. This appears to me to be a kind of 'fewer modules are better' type of position.

We on the other hand (Bill can correct this here but he has never thrown anything at me when I talk this way) believe that there are more modules than probably what other folk assume. As part of this Brownman & Goldstein's Articulatory Phonology (and maybe Carson-Berndsen’s 1998 Time-Map Phonology, I say maybe because I haven't read it) could be distinct modules that are in the speech chain between the MS and LTM. Regardless of whether this is correct or not, I think we (at least me and Bill) are talking about the last module which interfaces with LTM when we are talking about EFP here. I'm sure everyone else's milage varies here...

To me the importance of thinking about how many modules/links there are in the speech chain affects what one thinks about the overlap conditions between modules. Fewer modules (or less? I'll talk pretty one day) will give the impression that there can be much more deviance between modules or more freedom in the mapping from one to the other. More modules allow for a much tighter restriction on the interfaces between modules. Tighter interface conditions enforce 'baby steps' in the transformations of representations.

A final point on this is that if there more modules involved here I then take any veracity between distant modules to be strong evidence for some sort of substance in features.
ReplyDelete
Replies
Veno VolenecApril 10, 2018 at 11:42 AM
Thank you Bill and Eric for your replies. I’ll try to clarify one of my previous points, as it seems to me that it was taken slightly differently than I intended. I have a feeling that our discussion is beginning to narrow on very specific issues, which is only possible because we agree on so many fundamental concepts.

One idea that I tried to argue for is that it cannot be the case that the data structure that exits the phonological module of the grammar is the same data structure that enters what’s informally called ‘the phonetic implementation system’ (PIS). Why not? Because the output of phonology – a surface representation – lacks information that the PIS (I’m having a hard time not pronouncing this as [pʰɪs]) needs in order to produce speech. The logic of the argument is basically this:

(1) Outputs of the phonological module, SRs consisting of features, do not contain substantial and temporal information.
(2) The PIS requires articulatory, auditory and temporal information in order to produce speech.
∴ SRs are not legible to the PIS and phonology cannot in principle feed speech production directly.
∴ The interface between phonology and the PIS is mediated by transduction.

By “substantial and temporal information” in (1) I mean, for example, the information about which muscles to contract and for how long. Why would (1) hold? Because phonological computation treats features equivalently despite wild variation in their articulatory (and concomitant acoustic) realizations. As Charles mentioned in a previous post, if we define phonological features through precise, richly specified articulatory configurations or acoustic measures, then we won’t be able to formally capture a tremendous amount of obviously important generalizations. On the other hand, the fact that Sylvester Stallone cannot contract his orbicularis oris (due to the paralysis of a facial nerve) in the same way than, say, Bruce Willis, is phonologically irrelevant, i.e., we should not assume that Sly’s [+ROUND] is different than Bruce’s [+ROUND]. And that irrelevancy suggests to me that features contain only a very rough, highly abstract characterization of what needs to be achieved in articulatory and auditory terms. Another kind of information that I think is missing from SRs is exact temporal information. Note that I am not talking about abstract timing, but rather about concrete time; abstract timing is useful for phonology, but not enough for speech production. Take, for example, the SR [sluga] (meaning ‘servant’ in Croatian). While the phonological precedence relation s^l^u holds, in speech the articulatory realization of [u]’s [+ROUND] feature is temporally (over)extended across all of [l]’s duration and most (or all) of [s]’s duration. In other words, [s] is [–ROUND], but it is realized as if it were [+ROUND]. Note also that anticipatory coarticulation cannot be a simple consequence of speech organ inertia – it has to be planned/calculated before the final efferent neuromuscular instructions are sent to speech effectors, thus motivating a cognitive processing stage which is distinct from both phonology and from (traditionally conceived) articulatory phonetics. This seems to suggest that what enters the motor system is not an SR (i.e., features), but rather a more richly specified data structure created by cognitive transduction; while there might be some isomorphism between data structures characteristic for phonology and the motor system, this isomorphism is quite modest in my opinion.

[continued below]
ReplyDelete
Replies
FredApril 11, 2018 at 9:30 AM
I think I may need some clarifications about your formalization, but it might turn out that my confusion is merely terminological. You open the next post with:

Recall from last time that we are trying to formalize phonology in terms of events (e, f, g, ...; points in abstract time), distinctive features (F, G, ...; properties of events), and precedence, a non-commutative relation of order over events (e^f, etc.). So far, together this forms a directed multigraph.

You've used the term multigraph a few times now, but I don't understand why. A multigraph allows for multiple edges between adjacent vertices, but AFAICS there's no real need (yet?) for that.

Your precedence relation is a preorder, it is transitive and reflexive, but not a(nti)symmetric. I.e. you can have a in E and b in E s.t. a^b and b^a and a<>b (these are your "loops"). Your use of "non-commutative" is also confusing to me, as that's typically a descriptor of operations...is that what you mean by the a(nti)symmetry?

Also, I'm not clear on what "time" means here. You refer to your events as "points in abstract time" (also as "abstract points in time", which is maybe the same?)...I'm not sure what "abstract time" is, but if it's anything like real time, which is basically the ne plus ultra of total orders (at least in the subrelativistic, non-quantum world we're dealing with in the present case), then it seems to me that your precedence relation is something rather different, and so saying that you "allow loops in time" is also confusing at best (Feynmann and heptapods notwithstanding).

Sorry to nitpick...I like the idea of a full and tight formal model of phonology (I'm slowly working through Charles & Alan's book in parallel) and so want to make sure I'm grokking it all.
ReplyDelete
Replies
Charles ReissApril 22, 2018 at 7:50 PM
I am finally able to come back to the discussion, which seems to have taken a Kafka-esque turn (Rambo qua roach) in my absence. One of these days, I'll have to come clean and take the blame for making a mess of the phrase "Substance Free". I think Mark warned me about this at some point, so don't blame him.
ReplyDelete
Replies

Add comment

Faculty of Language

Comments

Wednesday, April 4, 2018

A modest proposal

15 comments:

Contributors