Infants can show a puzzling range of abilities and deficits in comparison with adults, out-performing adults on many phonetic perception tasks while lagging behind in other ways. Some null results using one procedure can be overturned with more sensitive procedures and some contrasts are "better" than others in terms of effect size and various acoustic or auditory measures of similarity (Sundara et al 2018). And there are other oddities about the infant speech perception literature, including the fact that the syllabic stimuli generally need to be much longer than the average syllable durations in adult speech (often twice as long). One persistent idea is that infants start with a syllable-oriented perspective and later move to a more segment-oriented one (Bertoncini & Mehler 1981), and that in some languages adults still have a primary orientation for syllables, at least for some speech production tasks (O'Seaghdha et al 2010; but see various replies, e.g. Qu et al 2012).
More than a decade ago, I worked with Rebecca Baier and Jeff Lidz to try to investigate audio-visual (AV) integration in 2 month old infants (Baier et al 2007). Infants were presented with one audio track along with two synchronized silent movies of the same person (namely Rebecca) presented on a large TV screen. The movies were of different syllables being produced; the audio track generally matched one of the movies. Using this method we were able to replicate the results of Kuhl & Meltzoff 1982 that two month old infants are able to match faces and voices among /a/, /i/, and /u/. Taking this one step further, we were also able to show that infants could detect dynamic syllables, matching faces with for example /wi/ vs. /i/. We did some more poking around with this method, but got several results that were difficult to understand. One of them was a failure of the infants to match on /wi/ vs /ju/. (And we are pretty sure that "we" and "you" are fairly frequently heard words for English learning infants.) Furthermore, when they were presented with /ju/ audio alongside /wi/ and /i/ faces, they matched the /ju/ audio with the /wi/ video. This behavior is at least consistent with a syllable-oriented point of view: they hear a dynamic syllable with something [round] and something [front] in it, but they cannot tell the relative order of [front] and [round]. This also seems consistent with the relatively poor abilities of infants to detect differences in serial order (Lewkowicz 2004). Rebecca left to pursue electrical engineering and this project fell by the wayside.
This is not to say that infants cannot hear a difference between /wi/ and /ju/, though. I expect that dishabituation experiments would succeed on this contrast. The infants would also not match faces for /si/ vs /ʃi/ but the dishabituation experiment worked fine on that contrast (as expected). So, certainly there are also task differences between the two experimental paradigms.
But I think that now we may have a way to understand these results more formally, using the Events, Features and Precedence model discussed on the blog a year ago, and developed more extensively in Papillon 2018. In that framework, we can capture a /wi~ju/ syllable schematically as (other details omitted):
The relative ordering of [front] and [round] is underspecified here, as is the temporal extent of the events. The discrimination between /wi/ and /ju/ amounts to incorporating the relative ordering of [front] and [round], that is, which of the dashed lines is needed in:
When [round] precedes [front], that is the developing representation for /wi/; when [front] precedes [round] that is the developing representation for /ju/. Acquiring this kind of serial order knowledge between different features might not be that easy as it is possible that [front] and [round] are initially segregated into different streams (Bregman 1990) and order perception across streams is worse than that within streams. It's conceivable that the learner would be driven to look for additional temporal relations when the temporally underspecified representations incorrectly predict such "homophony", akin to hashing collision.
If we pursue this idea more generally, the EFP graphs will gradually become more segment oriented as additional precedence relations are added, as in:
And if we then allow parallel, densely-connected events to fuse into single composite events we can get even closer to a segment oriented representation:
So the general proposal would be that the developing representation of the relative order of features is initially rather poor and is underspecified for order between features in different "streams". Testing this is going to be a bit tricky though. An even more general conclusion would be that features are not learned from phonetic segments (Mielke 2008) but that features are gradually combined during development to form segment sized units. We could also include other features encoding extra phonetic detail to these developing representations, it could then be the case that different phonetic features have different temporal acuity for the learners, and so cohere with other features to different extents.
Baier, R., Idsardi, W. J., & Lidz, J. (2007). Two-month-olds are sensitive to lip rounding in dynamic and static speech events. In Proceedings of the International Conference on Auditory-Visual Speech Processing.
Bertoncini, J., & Mehler, J. (1981). Syllables as units in infant speech perception. Infant behavior and development, 4, 247-260.
Bregman, A. (1990). Auditory Scene Analysis. MIT Press.
Kuhl, P.K., and Meltzoff, A.N. (1982). The Bimodal perception of speech in infancy. Science, 218, 1138-1141.
Lewkowicz, D. J. (2004). Perception of serial order in infants. Developmental Science, 7(2), 175–184.
Mielke, J. (2008). The Emergence of Distinctive Features. Oxford University Press.
O’Seaghdha, P. G., Chen, J. Y., & Chen, T. M. (2010). Proximate units in word production: Phonological encoding begins with syllables in Mandarin Chinese but with segments in English. Cognition, 115(2), 282-302.
Papillon, M. (2018). Precedence Graphs for Phonology: Analysis of Vowel Harmony and Word Tones. ms.
Qu, Q., Damian, M. F., & Kazanina, N. (2012). Sound-sized segments are significant for Mandarin speakers. Proceedings of the National Academy of Sciences, 109(35), 14265-14270.
Sundara, M., Ngon, C., Skoruppa, K., Feldman, N. H., Onario, G. M., Morgan, J. L., & Peperkamp, S. (2018). Young infants’ discrimination of subtle phonetic contrasts. Cognition, 178, 57–66.