A peculiar audio clip has turned into a viral sensation, the acoustic equivalent of "the dress" — which, you'll recall, was either white and gold or blue and black, depending on your point of view. This time around, the dividing line is between "Yanny" and "Laurel."
The Yanny vs. Laurel perceptual puzzle has been fiercely debated (see coverage in the New York Times, the Atlantic, Vox, and CNET, for starters). Various linguists have chimed in on social media (notably, Suzy J. Styles and Rory Turnbull on Twitter). On Facebook, the University of Minnesota's Benjamin Munson shared a cogent analysis that he provided to an inquiring reporter, and he has graciously agreed to have an expanded version of his explainer published here as a guest post.
The production of sounds produced with a relatively open vocal tract (like the vowels in "Laurel" and "Yanny") and some consonants (like the "l", "r", "y", and "n'" sounds in "Laurel" and "Yanny") have infinitely many frequencies in them. Think of it like hundreds of tuning forks playing at once. If the lowest-frequency tuning fork vibrates at 100 cycles per second, then the tuning forks will be at integer multiples of 100 Hz: 100, 200, 300, up to infinity. If the lowest-frequency tuning fork vibrates at 120 cycles per second, then the tuning forks will be at integer multiples of 120 Hz: 120, 240, 360, up to infinity. We can change the frequency of the so-called 'lowest frequency fork' by changing the tension in our vocal folds (layperson: 'vocal cords'), which causes them to vibrate more slowly or more quickly. We hear those changes as changes in the frequency of the voice, like the pitch glide upward when you ask a yes-no question, or the pitch glide downward when you make a statement. But speech has many more frequency components than just that lowest-frequency component. Remember, infinitely many tuning forks. The difference between an "ee" and "ah" vowel is that some of the frequencies that are especially loud in "ee" are quiet in "ah" and vice versa. The same pitches are present–the tuning forks are always vibrating–but the loudness of each of the frequency components (each of the tuning forks) changes from vowel to vowel.
Hopefully I've been clear so far, because here's where it gets weird.
So, the frequencies (the 'tuning forks') that are loudest are what we call formants. The formants are the horizontal stripes in the picture above, called a spectrogram. A spectrogram is a quasi 3D picture. It's sort of like a topographic map. The x axis is time, so a spectrogram can show things changing over time. The y-axis is frequency. There are many different frequencies in speech—many different 'tuning forks' in our analogy—and we need to represent as many of these as we need to describe speech. In speech, we usually focus on those frequencies between 0 and 10,00 Hz, but since the youngest, healthiest humans can hear up to 20,000 Hz, we sometimes show 0-20,000 Hz, as in the above. The shading shows which of the frequencies are loudest. Think of the shading on a topographic map: the shading shows where the mountains are. It's the third dimension of the map. The dark-shaded regions are the highest amplitude (=loudest, though 'loudness' and 'amplitude' are subtly different for reasons we won't worry about here). They formants (=the frequencies where there are amplitude peaks) change over the course of the utterance, as you go from "l" to the "aw" vowel to the "er" vowel to "l". Roughly speaking, each formant has an articulatory correlate (or, the higher up you go, correlates, plural). The frequency of the lowest-frequency formant roughly tracks tongue movement in the up-down dimension (from a high position in "ee" in "beet" to a low position in the 'short a' of "bat"). The second-lowest formant roughly tracks tongue movement in the front-back dimension (from the front position of the 'short a' in "bat" to the back position of the 'long a' in "bot"). We tend to perceive these differences regardless of the absolute frequency of the formants. A child's formants are higher than an adults, because the kids' mouths and necks are smaller than adults'. Still, we perceive the lowest-frequency formant as tracking up-down tongue movement, regardless of whether it's lower frequency overall (as in an adult) or higher-frequency overall (as in a kid). Think of the way a melody sounds on an alto saxophone versus a baritone saxophone. We can hear the same notes and melody even though the timbre—the 'tone quality', so to speak—is different because of the overall size differences of the instrument. For illustration, here is the lowest 2000 Hz of the above spectrogram, with the formants overlaid on it in red.
Now, right away we see that there is something awry about this signal. Where there should be a second formant, there are just speckles that appear random. One thing about this signal is that it's hard to track the F2. This is perhaps the first ingredient into why it is so susceptible to being identified differently. The F2 is, for some reason (overlapping voices? Intentional shenanigans? The girl from The Ring?) masked. That means that people can use 'top-down' knowledge (expectations, beliefs, priming from the words on the screen that this was presented with) to fill in their perception of the tongue's back-front movement. This was first pointed out by Rory Turnbull in his nice analysis of this signal.
Now, look at the first spectrogram, and focus on the higher frequencies. You see some faint stripes that look like lighter-gray formants at those higher frequencies. Those shouldn't be there. Humans can't produce those. Perhaps they were made by some bored undergrad who just discovered a few facts about signal processing. Let's imagine that the bored undergrad made them by taking some formants at lower frequencies and raising their frequencies (not a hard thing to do, honestly, using the 'change gender' function in the free software application Praat), then mixing them back in with the original signal. Now, just to be clear, the higher-frequency formants are not an additional 'new voice' (scare quotes intentional) in the signal. It's the same tuning forks. It's just that the higher-frequency tuning forks has the same patterns of high and low loudnesses as the lower frequency ones. I heard the higher-frequency formant sequences when I first listened to this signal two hours ago and thought that they maybe were someone talking in the background. Then I thought "ERMERGERD, IT'S THE AUDIO VERSION OF THE RING.
Then I remembered that I was an adult, and that The Ring was just a movie. So, I decided to look at the spectrogram. Then, I made the attached spectrogram and realized that one of the primary weird things in this signal is that it has a lower-frequency formant pattern repeated at higher frequencies.
Excursus: Rory Turnbull pointed out that we don't know what kind of compression algorithm is used on Twitter. Maybe that's to blame, and not a bored undergrad. But no matter, I will assert with some confidence that these *shouldn't* be there based on what we know about human speech production.
Returning from the excursus, we can now ask: what does that mean?
One possibility is that the formant pattern at the higher frequencies is just "Laurel" transposed to higher frequencies, and that "Laurel" sounds like "Yanny" at higher frequencies. That's plausible–we never hear that kind of higher-frequency speech, and we don't have a huge body of scholarship on what higher-frequency formants would sound like.
Excursus: since writing this, Suzy Styles has tweeted a B-E-A-U-T-I-F-U-L tutorial on speech acoustics, along with her equally beautiful interpretation of this phenomenon. WTG, Dr. Prof. Styles! If you read Dr. Prof. Styles' tutorial, you will learn a figurative ton about the perception of formant-frequency changes. If you want to learn more about the perception of higher-frequency formants, you can dive head-first into UC-Davis professor Santiago Barreda's work on formant-frequency shifting in speech.
So, circling back: high-frequency formants like those highlighted with "shenanigans" above just don't occur in human speech.
Is that the only possibility? Of course not. Another possibility is that the lower-frequency signal has "Laurel" and "Yanny" mixed together on top of one another. Maybe the pitch (=the 'lowest frequency tuning fork') for "Yanny" is lower than that for "Laurel", which would jibe with the percept that was reported to me by Vox's Jen Kirby. If that were true, and if the higher-frequency formant sequences (the ones that were added artificially) were also "Yanny", then you'd be setting people up to hear "Yanny" if they could hear high frequencies well. That is, the higher-frequency formant ensembles (the girl from The Ring/bored undergrad stuff that is outlined in red boxes on my figure) might be most clearly audible to folks with good headphones and good hearing, and those who can hear them might then be more likely to hear the lower-frequency "Yanny" 'pop' away from the "Laurel" than those of us with Dad hearing (sorry, I love quacking about the fact that I made it to middle age) and cheap headphones. I'm not sure–there's no easy way to analyze overlapping speech signals. Separating concurrent voices is actually one of the worst problems in speech engineering, and is something that human ears can actually do better than machines most of the time. So, we're all left just speculating at this time. But, I'd say that my story is as plausible as anyone's. I still say that it's possible that it's just the formant pattern "Laurel" repeated at multiple frequencies, but if you say that you hear a lower-frequency "Yanny", I don't want to argue.
Now, just to give some additional evidence for this argument, I have embedded sound files that have various frequency ranges in them, from low to high. To be clear, these don't correspond to the frequency ranges above. The lowest is 0-4500 Hz, the next-lowest is 2000-6500 Hz, and the highest is 4500 to 9500 Hz. To quote University of Arizona professor Natasha Warner, "I tried filtering a whole bunch of ways (low-pass at various frequencies, high-pass at various frequencies), all on my computer with the sound turned up, and I consistently get Yanny. Of course high-pass filter at 5000 Hz makes the whole thing sound like crickets, roughly. But crickets saying 'Yanny'."
OK, hopefully that was a little clearer. There are a few long-term takeaways from this kerfuffle:
(1) Damn, sure is nice not to be talking about '45' for once. We need stuff like this to go down more often!
(2) Mad kudos to the woman who posted this on her Twitter. She was described as a "Social Media Influencer," and I say that she is at the top of her game. By sharing this, she brought enormous attention to herself. That's her job, and she's doing it cunningly and shrewdly.
(3) Speech is hard to understand. We might naively think it's easy to understand because we use it all the time (at least, folks who communicate in the oral/aural modalities do). Speech acoustics (and acoustics in general) are, in my thinking, not intuitive. Moreover, this body of knowledge doesn't build on other bodies of knowledge that most people have. When you learn about language in school, it's mostly about written language, not spoken. That's not me being snotty, but rather me saying that it must be hard to write about this kind of information for a broad audience, because it's three layers removed from what most people think about daily. Even disentangling the types of frequencies ('what is the lowest-frequency tuning fork?' vs. 'what are the frequencies of the loudest tuning forks?') takes a little bit of a conceptual leap. One of the reasons why speech is such a neat phenomenon is because there is so much work to be done still at the ground level. I hope that this phenomenon will inspire people to think more about speech science, experimental phonetics, and the nascent field called 'laboratory phonology'. Good places to start looking for work on these topics are www.acousticalsociety.org and www.labphon.org.
(4) Building on (3), way to go to the speech science/experimental phonetics/labphon communities for working together as a team to talk about this phenomenon. I feel hashtag blessed to be part of a community that has precious few members who are driven by credentialist ambition, and who are instead driven to work as a community to solve problems as they arise.
(5) And lastly, as someone whose primary job is to train people to be speech-language pathologists, consider this. Did you find listening to this audio sample maddeningly hard? Welcome to the daily world of people for whom speech perception is not always automatic. This includes people with even mild hearing loss, people with subtle auditory perception and processing problems that are associated with various learning disabilities (developmental language disorder, speech sound disorder, dyslexia, autism spectrum conditions), and even new second-language learners. The frustration that you might have felt listening to this signal is what many of these folks face on a daily basis when listening to something as seemingly simple as trying to identify speech in the presence of background noise. Turn your frustration into empathy and advocacy for those folks. Learn more at www.asha.org, and support your local speech-language pathologists and audiologists!
[end guest post by Benjamin Munson]
A postscript (from Ben Zimmer): It turns out the audio in question comes from the pronunciation given for the word laurel on Vocabulary.com (link). This was revealed on Reddit and has been subsequently analyzed on Twitter by Carolyn McGettigan and Suzy J. Styles. As it happens, I used to work for Vocabulary.com and its sister site, the Visual Thesaurus. In fact, back in 2008 when the audio pronunciations were first rolled out for the Visual Thesaurus, I wrote about the project here on Language Log (as well as on the VT site), explaining how we had worked with performers trained in opera, who were adept at reading the International Phonetic Alphabet. So the laurel audio ultimately comes from one of those IPA-savvy opera singers!
Update (myl): The NYT has an app that lets you use a slider to vary the l0w-to-high preemphasis of the recording, which should help you to understand what's going on.
Even later update (from Ben Zimmer): This article in Wired gives the whole backstory, including more on the opera singer who pronounced the word laurel for Vocabulary.com, and the high school students who circulated the audio. I also spoke briefly to Wired editor-in-chief Nicholas Thompson about the whole Yanny/Laurel debate on CBS Evening News.