Linguistic commentary from a guy who takes things too literally

Grover and the Excellent Idea

Posted by Neal on January 2, 2019

It’s been called “the new Laurel vs. Yanny“: A six-second video clip from Sesame Street in which Grover expresses his enthusiastic approval for an idea that a fellow Muppet named Rosita suggests. In case you haven’t already read what people are hearing Grover say, I’ll let you listen to it before I bring in the spoilers. Here’s a clip of just the audio. Further commentary below the fold.

OK, so here’s the clip in context:

Here are the two ways that people have been hearing the clip:

  1. Yes! That sounds like an excellent idea!
  2. Yes! That’s a fuckin’ excellent idea!

Of course, if you’re the least bit familiar with Sesame Street, you know that Grover had to have been saying the first sentence. Even so, people have been hearing the second one. Even I heard the f-bombastic sentence: Doug played the clip for me without telling me what I should be listening for, and I heard “That’s a fuckin’ excellent idea!” Only when he told me the actual words did I hear them clearly on a second listen.

Let’s look at the utterance more closely, highlighting the places where the actual phonemes making up the perceived words would be different:

  1. Yes! That sounds like an excellent idea! (the obviously intended utterance)
  2. Yes! That’s a fuckin’ excellent idea! (the obviously unintended utterance)

To get a more precise look at the troublesome segment, here’s a visual representation of it. The top tier is the waveform, showing the intensity (i.e. loudness) of the utterance at different points in time. Just underneath is a spectrogram, which I’ll say more about later. I’ve annotated the waveform and spectrogram first with the “sounds like” interpretation, in ordinary spelling first, and in International Phonetic Alphabet just below. Just below that is the “fuckin'” interpretation, written in IPA so that we can see how the specific phonemes line up on the timeline. Finally, I’ve put in the “fuckin'” interpretation written in ordinary spelling.

Overall, and not surprisingly, the spectrogram shows more support for the “sounds like” interpretation than for the “fuckin'” interpretation. Notice the waveform for the portion that I’ve annotated with the [ʌ] (“uh”) vowel of fuckin’. You can see that there is a distinct change in the waveform where there should just be a single [ʌ] sound. It suddenly gets louder. However, this is what we would expect if that interval of time contained two sounds: [l] and [ɑɪ]. We’re going from a slightly constricted airflow for [l] to the unconstricted airflow of the following vowel.

Next, let’s look at the vowel in sounds. Specifically, look at the yellow line in the spectrogram. This, like the waveform above, shows intensity, and you can see by the first three bumps in the line, that the syllables that, sounds, and like all are about the same level of intensity. This is what we would expect for three “content” words (the demonstrative pronoun that, the verb sounds, and the adjective like). On the other hand, “function” words such as conjunctions, helping verbs, and articles (like a), get less stress, so if Grover were saying “That’s a fuckin’ excellent idea,” we’d expect a less-intense a. In fact, we’d expect a lower intensity like the one you can see for the an right before excellent, with the smaller bump in the yellow line.

So far, the things I’ve pointed out support what we basically already know: that Grover is saying “That sounds like an”. So what is it that is allowing people to hear it as “That’s a fuckin'”? Let’s look first at the vowels in sounds and like. In order to talk about vowels, we need to know about formants. I’ll say more about them in a moment, but the TL;DR is that the vowels (and a few other sounds) have these fuzzy black horizontal lines–known as formants–hovering over them in the spectrogram. Since a diphthong has two vowels squeezed into one syllable, there should be a transition as Grover moves from one vowel to the other. In other words, at least one of the formants should bend quite a bit, instead of looking more or less flat like it does here. Here’s a waveform and spectrogram of me pronouncing [ɑʊ] for the clickable IPA project I wrote about a few months ago:

You can see the top formant (F2) sloping down to meet the bottom one (F1).

Next, here’s my pronunciation of [ɑɪ]:

This time, you can see F2 curving upward, away from F1.

So how do Grover’s diphthongs appear? Here’s the spectrogram of sounds like, where I’ve outlined the formants in red. The formant lines for [ɑɪ] are as expected, with F2 curving up, and indeed, when I play this segment by itself, it does sound like the diphthong [ɑɪ]. On the other hand, it’s hard to tell which formant lines we’re dealing with for [ɑʊ]. The one on top does curve down, but there seem to be two formants under it, not just one, so this might be the third formant (F3) rather than F1.  Grover might be monophthongizing his [ɑɪ], making it more like an extended [ɑ]. When I play this sound in isolation, it’s hard to say what I’m hearing. So it’s possible to hear the clip up to this point as That’s a without too much trouble–only a slightly longer duration and intensity than we might expect for the word a, but easily attributed to stylistic variation in Grover’s peculiar speech patterns. And once we’ve parsed that much, what happens when we get to the [nz] of sounds?

What seems to be going on here is that the cues for both [nz] and [f] are mostly absent, forcing the listener’s brain to fill in the missing sounds. First of all, both of these sounds are fricatives, so we expect to hear more white noise: soundwaves of many frequencies, all about the same intensity. In a spectrogram, this shows up as a curtain of gray, without the distinct black caterpillars crawling across it that you see above vowels. For example, here are my pronunciations of [aza] and [afa]. Notice that In our spectrogram, the regions between the [a] sounds is a lot grayer than the regions of silence before and after the [aza] and [afa].

In the waveform and spectrogram for [nz], I actually can’t tell where to divide the two sounds, which is why they’re listed together. You can see in the spectrogram, and in the waveform, that this segment is pretty quiet. In fact, when I listen to it in isolation, I hear nothing at all that sounds like speech. It’s just white noise. Although the software did detect some voicing (indicated by the faint formant lines and in the regularity of the spikes in the waveform), it’s still quiet enough that a listener might not pay attention to that. Furthermore, since [nz] is not a phonotactically acceptable beginning to an English word, our brains will try to find a different solution. [f] would work: The very quiet signal accommodates that as well as it does an [nz], and that’s an acceptable beginning of a word.

Once we’ve parsed that far, it’s easier to force the more-or-less clear [ɑɪ] into a [ʌ] mold, coming between an [f] and a [k], in exactly the place where fuckin’ would naturally go in an enthusiastic utterance.

So that’s my take on how a listener could hear cute, little, furry Grover drop an F-bomb even if they would never expect him to do such a thing, in an audio clip that also can be heard as That’s sounds like an excellent idea. Maybe I’ll submit it as the homework assignment given in Mark Liberman’s Language Log post on this topic.

Finally, here’s the more-detailed information about formants if you’re interested. To make sense of a formant, it helps to imagine a third dimension on the spectrogram. The x-axis is time, and the y-axis shows frequency in Hertz. The sounds made during speech, and in particular the vowels, are a blend of many sound waves of different frequencies, but they’re not all equally intense. Depending on the height of the tongue for any particular vowel, how far forward or back it’s positioned in the mouth, and how rounded the lips are, different frequencies get amplified. If we had a third axis on the spectrogram, pointing directly off the screen towards our eyes, we could use it to measure the intensity. But since we don’t, the intensity is represented by darkness: The darker the point at an x-y coordinate, the more intense the wave of that frequency at that point in time. (We do get a precise measure of intensity in the waveform, but this is the overall intensity of the signal, not the intensity of particular wave components.) This shows up on the spectrogram as those fuzzy black horizontal lines at different frequency levels. These are the formants, and they’re the main cue that allows listeners to figure out which vowel a speaker is pronouncing.

7 Responses to “Grover and the Excellent Idea”

  1. Neal said

    Here is another linguist’s blog post on the subject.

  2. Breffni said

    Great analysis, Neal. One thing: you don’t say anything about the role of the velarised [l] of like in the mishearing. Isn’t that likely to play a role in making the [ɑɪ] sound like [ʌ]?

    • Neal said

      You’re probably right. Since /l/ is a sonorant sound, it has formants too, although not as pronounced. And velarized /l/ would have formants a bit like a back vowel, which (at least according to the latest IPA charts), [ʌ] is.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: