Harnessed: How Language and Music Mimicked Nature and Transformed Ape to Man - читать бесплатно онлайн полную версию книги автора Mark Changizi (ч. 8)

Chapter 2

Speech Events

Grasshopper

In M. Night Shyamalan’s movie The Village, a young woman, Ivy, sets off on a journey into an unknown forest. She has persuaded the elders of her tribe to let her find other people on the far side of the forest, get medicine, and return to save the life of her sick lover. She has no knowledge of anything beyond the several acres of her village, except that beyond their meadow and inside the forest are chilling, otherworldly beasts that occasionally invade the village and carve up one of the pets.

As if this quest were not harrowing enough, there’s an important fact I left out: she is blind. Now, the village leaders know the truth about what’s beyond their meadow—no beasts (but the costumed elders themselves), just woods, and then modern civilization, from which they’ve sheltered their children. That’s why they allow her to go into the forest. But no one of Ivy’s generation knows this. And neither do we, the moviegoers. We’re terrified for her. As it turns out, terrifying things do happen to her in that forest, because a monster (really a man from the village in a monster costume) secretly follows her, and eventually attacks her.

The movie would be considerably less dramatic if our female heroine were deaf, rather than blind. Instead of a woman waving her arms and tramping about through the thorny tangles, we’d be watching a woman walking normally through the forest, keeping to deer trails. In fact, many of us regularly do just this, wearing headphones and blasting music as we deafly, yet deftly, jog through our local park. This would not quite elicit the thrill Shyamalan had in mind. A deaf person on a forest quest does not make a good movie. Being deaf just doesn’t seem like much of a big deal compared to blindness. If not for the inability to hear speech, we might hardly miss our auditory systems if they fell out through our ears.

Then again, there’s another twist to the story that may change one’s feeling about audition: our young blind heroine defeats her attacker. She kills him, in fact. She may look out of sorts crashing into trees, but her hearing makes it impossible for her attacker to sneak up on her. Especially in the forest. Had she been deaf, not blind, her attacker could have whistled “Dixie” with an accordion accompaniment while following her through the woods and still taken her completely by surprise.

If deaf-maiden-alone-in-the-forest is not spine-tingling to movie audiences, it is only because we tend not to appreciate all that our ears do for us beyond language. Providing a sneakproof alert system is just one of the many powers of audition.

The greatest respect for our ears is found among blind kung fu masters. Every “Grasshopper” learns from his old blind master that by attending to and dissecting the ambient sounds around oneself, it is possible to sense how many attackers surround one, their locations, stances, weapons, intent, confidence level, and which one is the enemy mastermind. I once saw, in an old movie, one of these scrawny geezers defeat six men using only a baseball bat wielded upside down. But you don’t have to be a fictional blind kung fu master to have a mastery of audition and know how to sense the world with it. We all do; we just don’t get all “Grasshopper” about it. Our brains have a mastery of it even if we’ve never thought about it.

In fact, when I first began pondering whether speech might sound like natural events, I had great difficulty thinking of any important natural-event sounds. I was initially dumbfounded: what is so useful about having ears that nearly all vertebrates have them? It seemed to me that I primarily use my ears for listening to speech, and that obviously cannot explain why all those other vertebrates have ears as well. Sure, it is difficult to sneak up on me, but one hardly needs such a fine-tuned ear and auditory system for a simple alarm.

After some months of contemplation, however, I came to consciously appreciate my ability to use sound to recognize the world and what’s happening around me. I began to notice every tap, clink, rub, burble, and skid. And I noticed how difficult it was for me to do anything without making a sound that gave away what I was doing, like eating from my daughter’s Halloween stash. When you’re next at home and your family is active around you, close your eyes and listen. You will hear sounds such as the plink of a spoon in a coffee mug, the scrape of a drawer opening, or the scratch of crayons on drywall. It will typically take some time before you hear an event that you cannot recognize. In the late 1980s, the psychologist William Gaver played environmental sounds to listeners, and asked them to identify what they heard. He found that people are impressive at this: most are capable, for example, of distinguishing running upstairs from running downstairs. Research following in the tradition of work done by the psychologist William H. Warren in the mid-1980s has shown that people are even able to use sound to sense the shapes and textures of some objects.

Our ears and auditory systems are, then, highly designed for and competent at sensing and recognizing what is happening around us. Our auditory systems are priceless pieces of machinery, just the kind of hardware that cultural evolution shouldn’t let go to waste, perfect for harnessing. In this chapter, I sift through the sounds of nature and distill a host of regularities found there, regularities that apply nearly anywhere—in the jungle, on the tundra, or in a modern city. The idea is that our auditory system, having evolved in the presence of these regularities for hundreds of millions of years, will have evolutionarily “internalized” them; our auditory system will therefore work best when incoming sounds conform to these regularities. I will then ask whether the sounds of speech across human languages tend to respect these regularities. That’s what we expect if language harnesses us.

Over Hear

It can be difficult for students to attract my attention when I am lecturing. My occasional glances in their direction aren’t likely to notice a static arm raised in the standing-room-only lecture hall, and so they are reduced to jumping and gesturing wildly in the hope of catching my eye. And that’s why, whenever possible, I keep the house lights turned off. There are, then, three reasons why my students have trouble visually signaling me: (i) they tend to be behind my head as I write on the chalkboard, (ii) many are occluded by other people, are listening from behind pillars, or are craning their necks out in the hallway, and (iii) they’re literally in the dark.

These three reasons are also the first ones that come to mind for why languages everywhere employ audition (with the secondary exceptions of writing and signed languages for the deaf) rather than vision. We cannot see behind us, through occlusions, or in the dark; but we can hear behind us, through occlusions, and in the dark. In situations where one or more of these—(i), (ii), and (iii) above—apply, vision fails, but audition is ideal. Between me and the students in my course lectures, all three of these conditions apply, and so vision is all but useless as a route to my attention. In such a scenario a student could develop a firsthand appreciation of the value of speech for orienting a listener. And if it weren’t for the fact that I wear headphones blasting Beethoven when I lecture, my students might actually learn this lesson.

The three reasons for vision’s failure mentioned above are good reasons why audition might be favored for language communication, but there is a much more fundamental reason, one that would apply to us even if we had eyes in the backs of our heads and lived on wide-open prairies in a magical realm of sunlit nights. To understand this reason, we must investigate what vision and audition are each good at.

Vision excels at answering the questions “What is it?” and “Where is it?” but not “What happened?” Each glance cannot help but inform you about what objects are around you, and where. But nearly everything you see isn’t doing anything. Mostly you just see nature’s set pieces, currently not participating in any event—and yet each one is visually screaming, “I’m here! I’m here!” There’s a simple reason for this: light is reflecting off all parts of the scene, whether or not the parts have anything interesting to say. Not only are all parts of a scene sending light toward you even when they are not involved in any event, but the visual stimulus often changes in dramatic ways even when the objects out there are not moving. In particular, this happens whenever we move. As we change position, objects in our visual field dynamically shift: their shapes distort, nearer objects move more quickly, and objects shift from visible to occluded and vice versa. Visual movement and change are not, therefore, surefire signals that an event has occurred. In sum, vision is not ideal for sensing events because events have trouble visually outshouting all the showy nonevents.

If visual nature is the loquacious coworker you avoid eye contact with, auditory nature is (ironically) the silent fellow who speaks up only to say, “Piano falling.” Audition excels at the “What’s happening?” sensing a signal only when there’s an event. Audition not only captures events we cannot see—like my (fictional) gesticulating students—but serves to alert us to events occurring even within our view. Nonevents may be screaming visually, but they are not actually making any noise, and so audition has unobstructed access to events—for the simple reason that sound waves are cast only when there is an event.

That’s why audition, but not vision, is intrinsically about “what’s happening.” Audition excels at event perception. And this is crucial to why audition, but not vision, is best suited for everyday language communication. Communication is a kind of event, and thus is a natural for audition. That is, everyday person-to-person language interactions are acute events intended to be comprehended at that moment. Writing is not like this; it is a longer-term record of our thoughts. And when writing does try to be an acute person-to-person means of communication, it tends to take measures to ensure that the receiver gets the message now—and often this is done via an auditory signal, such as when one’s e-mail or text messaging beeps an alert that there is a new message.

That language is auditory and not visual is, in the broadest sense, a case of harnessing, or being like nature for the purpose of best utilizing our hardware. Language was culturally selected to utilize the auditory modality because sound is nature’s modality of event communication.

That’s nice as far as it goes, but it does not take us very far. The Morse code for electric telegraphy utilizes sound (dots and dashes), and even the world-record Morse code reader, Ted McElroy, could only handle reading 75.2 Morse code words per minute (a record set in 1939), whereas we can all comprehend speech comfortably at around 150 words per minute—and with effort, at rates approaching 750 words per minute. Fax machines and modems also communicate by sound, but no human language asks us to squeal and bleep like that. Clearly, not just any auditory communication will do. And that brings us to the main aim of this chapter: to say what auditory communication should sound like in order to best harness our auditory system. We move next to the first step in this project: searching for the atoms of natural sounds, akin to the contours in natural scenes on the visual side.

Nature’s Phonemes

By understanding the different evolutionary roles for vision and audition, we just saw that audition is the appropriate modality to harness for language: sound is nature’s standard event stream, and language therefore wants to utilize sound to make sure language utterances get received. But what kinds of sounds, more specifically, should language use to best harness our brains? The sounds of nature, of course. But the natural world has a large portfolio of sounds it can make, and people are good at mimicking a fair share of these sounds, mostly with their mouths, but sometimes with the help of their hands and underarms. Saying that a well-designed language will use sounds from nature is like saying one had “a sandwich” in a deli. Which sounds from nature? Wind blowing, water splashing, trees falling (when someone is around), leaves rustling, thunder, animal vocalizations, knuckle cracks, eggs breaking? Where is language to begin?

Although nature’s sounds are all over the map, there’s order to the cacophony. Most events we hear are built out of just three fundamental building blocks: hits, slides, and rings.

Hits happen whenever a solid object bumps into another object. When you walk, your feet hit the ground. When you knock, your knuckles hit the door. A tennis match is a game of hits—ball hits racket, ball hits net, ball hits ground. Hits make a distinctive sound. They happen suddenly, and the auditory signal consists of an almost instantaneous explosive burst of energy emanating from the impact.

Slides are the other common kind of physical interaction between solid objects. Slides occur whenever there is a long duration of friction contact between surfaces. If you drag your finger down the page of this book, you’re making a slide. If you push a box along the floor, that’s a slide. The auditory structure of slides differs from that of hits: Rather than a nearly instantaneous release of energy, slides have a non-sudden start and a white-noise-like sound that can last for a more extended period of time. Slides are less common than hits. First, they require a special circumstance, the extended interaction of two surfaces; hits, on the other hand, are what perception scientists call “generic,” because no special coincidences are needed to carry off a hit. Second, when slides do happen their friction tends to significantly lower the energy in the event, and therefore they commonly occur at the tail ends of events. Third, whereas a long sequence of hits is possible (with intervening rings, as discussed in a moment)—as when a ping pong ball bounces lower and lower, for instance—a long sequence of distinct slides is not typically possible; something would have to stop one slide to allow another one to start, but any such interference with a slide is likely to involve a hit.

Hits and slides are the only physical interactions among solid objects that we regularly experience, and they are certainly the primary ones our ancestors would have experienced. We are land mammals. Splashes, involving a solid and a liquid, are neither hits nor slides, and although they could shape the auditory system of otters, seals, and whales, they’re unlikely to be of central significance to our auditory system.

With the two kinds of solid-object physical interaction out of the way, we are left with the final fundamental constituent of these natural events: rings. A ring is what happens to a solid object after a physical interaction, that is, after a hit or a slide. When a solid object is physically impinged upon, it vibrates and wobbles, and although one can almost never see these vibrations, one can hear them. You can tell from the sound whether your pen is tapping your desk, your computer, or your coffee mug, because the same pen hit leads to different rings; you may also be able to tell that it is the same pen hitting the three different objects.

Different objects ring in distinct “timbres,” a word (pronounced “TAM-ber”) that refers to the overall perceptual nature of the sound. For example, a piano C and a violin C have the same pitch, or frequency, but they differ in the quality or texture of their sound, and timbre refers to this. Most objects have very short-lived rings—unlike the long-drawn-out ring of a gong—but they do ring, and once you set your mind to noticing, you’ll be amazed to hear these rings everywhere. And it is not just hits that ring, but slides as well. The vibrations that occur when any two objects hit each other will have many similarities to the vibrations resulting from the same two objects sliding together, so that we can tell that a coffee mug is being dragged along the desk because the ring possesses certain features also found in the ring of a pinged coffee mug.

Hits, slides, and rings are, therefore, nature’s primary phonemes (see Figure 3). They are a consequence of how solid physical objects interact and vibrate. Although these three kinds of sound are special in the lexicon of nature, there is nothing requiring language to carve sounds at these joints. Dog woofs, cat calls, horse neighs, whale song, and bird song do not carve at these joints. Neither does the auditory communication of a fax machine. But if a language is to be designed to harness the human auditory system, then it will be built out of the sounds of hits, slides, and rings.

Figure 3. The three principal constituents of physical events: (a) hits, (b) slides, and (c) rings. They sound suspiciously similar to plosives, fricatives, and sonorant phonemes in human languages.

Are human languages built out of these constituents? Yes. In fact, the most fundamental universal of human speech is that phonemes, the “atoms” of speech, come in three primary types, and these types match nature’s phonemes! Language’s hits, slides, and rings are, respectively, plosives, fricatives, and sonorants.

Plosives—like b, p, d, t, g, and k—are found in every language, and consist of sudden, explosive, high-energy inceptions. Plosives sound like hits (even embedding their explosive hitlike starts in the name). Figure 4a shows the time-varying frequency distribution for the sound made when I hit my desk with a small plastic cup, and one can see that the hit begins with a sharp vertical line indicating the presence of a wide range of frequencies at the instant of the collision. That same figure shows, on the right, the same kind of plot when I made a “k” sound. Again one can see the sharp edge at the beginning of the sound, characteristic of a hit. (Also note that, in English, at least, one finds many plosive-filled words with meanings related to hits: bam, bang, bash, blam, bop, bonk, bump, clack, clang, clink, clap, clatter, click, crack, crush, hit, klunk, knock, pat, plunk, pop, pound, pow, punch, push, rap, rattle, tap, and thump.)

Languages have a second principal kind of consonant called the fricative, such as s, sh, th, f, v, and z. They are extended and noisy, and sound like slides. (In fact, the very word “fricative” captures the friction nature of a slide.) And just as slides are rarer than hits, fricatives are less common than plosives. All languages have plosives, whereas many languages (especially in Australia) do not have fricatives. Figure 4b, on the left, shows the frequencies of sound emanating from a small cup that I slid on my desk, and one can see that there is no longer a crisp start to the sound as there was for hits. There is also a longer duration of sound, all of it with a wide range of frequencies. On the right of Figure 4b is the same kind of plot, this one generated when I made a “sh” sound. One sees the signature features of a slide in fricatives. (Also note that in English, at least, one finds many fricative-filled words with meanings related to slides: fizzle, hiss, rustle, scratch, scrunch, shuffle, sizzle, slash, slice, slip, swoosh, whiff, whiffle, and zip.)

The third principal phoneme type used across human languages is the sonorant, including vowels like a, e, i, o, u, but also sonorant consonants like l, r, y, w, m, and n. Each of these phonemes has strongly periodic vibrations, and has a complex spectral shape. Sonorants sound like rings. Figure 4c, left, shows the ringing after tapping my coffee mug. Only certain frequencies occur during the quickly decaying ring, and these frequency bands are characteristic of the shape and material properties of my mug. To the right of that in Figure 4c is the signal of me saying “ka.” (The plosive “k” sound corresponds to the tap.) As with the coffee mug, there are certain frequency bands that are more active, and these patterns are what characterize the sound as an “a.”

Lo and behold! The principal three classes of phonemes in human speech sound just like nature’s three classes of phonemes. We speak in hits, slides, and rings!

Before getting overly excited by the realization that language’s phonemes are like nature’s phonemes, we must, however, address a worry: How else could we speak? What if human vocalization can’t help but sound like hits, slides, and rings? If that were the case, then the observations made in this section would have little significance for harnessing; culture would not need to design language to sound like hits, slides, and rings, because our mouths would make these sounds by default. We take this up next.

Figure 4. Illustration that plosives, fricatives, and sonorants sound like hits, slides, and rings, respectively. These plots show the frequencies on the y-axis, and time on the x-axis. Comparison of (a) hits and plosives, (b) slides and fricatives, and (c) rings and sonorants.

Tongue Wagging

When the Mars Rover landed on Mars, it bounced several times on balloon-like cushions; the cushions then deflated, allowing the rover to roll gently onto the iron-red dirt. If you had been there watching the bouncy landing, you would have heard—as you writhed in pain from decompression in the low-pressure atmosphere—a sequence of hits, with rings in between. And once the rover found a place to take a sample of Martian soil, it would have scraped debris into a container for analysis, and that scrape would have sounded like a slide, followed by a ring characteristic of the Rover’s scraping arm. Hits, slides, and rings on Mars! It is not so much that hits, slides, and rings are Earthly nature’s phonemes as much as they are physics’ phonemes. These sounds are the principal building blocks of event sounds anywhere there are solid objects interacting—even in our mouths.

Our mouths have moving parts, including a powerful and acrobatic tongue; fleshy, maneuverable lips; and a jaw rigged with rock-hard teeth. When we speak, these parts physically interact in complex ways, creating speech events. But speech events are events, and if hits, slides, and rings are the fundamental constituents of physical events, then speech events must also be built from hits, slides, and rings in the mouth. It is no wonder, then, that human speech sounds like hits, slides, and rings. Speech is built from the fundamental constituents of physical events because speech is a physical event. Harnessing would appear to have nothing to do with it.

However, when we speak, our mouth is not simply a container with a tongue, lips, and teeth rattling around. We are not, for example, making hit sounds by tapping our teeth together, or slide sounds by grinding our teeth. When our mouth (in collaboration with our nose, throat, and lungs) makes sounds, it is using mechanisms for sound production that go well beyond the solid-object event atoms—hits, slides, and rings. Although hits, slides, and rings are the most fundamental kinds of physical events (because solid-object events are the most fundamental kind of physical event), they are not the only kinds. There are hosts of others. In particular, there are many physical events that involve the flow of fluid or air. The events in our mouths that make the sounds of speech are events involving airflow, not hits, slides, or rings at all. Airflow events in our mouths mimic hits, slides, and rings, the constituents of solid-object physical events. Our mouths make a plosive by a sudden release of air, not by an actual collision in the mouth. Fricatives are made by the noninstantaneous movement of air through a tight passage; no surfaces in the mouth are actually rubbed against one another. And sonorants are not due to an object vibrating because of a hit or slide; instead, sonorants come from the vocal chords vibrating as air passes by.

Hit, slide, and ring sounds without hits, slides, or rings! What a coincidence! Human speech employs three principal sounds via airflow mechanisms, and yet they happen to sound just like the three principal sounds that happen in events with physical interactions between solid objects. Utterly different mechanisms, but the same resultant sound. That’s too coincidental to be a coincidence. That’s just what harnessing expects: airflow sound-producing mouths settling on just a few sounds for language—the sounds of physical interactions among solid objects.

We must be careful, though. What if airflow mechanisms cannot help but make hit, slide, and ring sounds? Or, more to the point, could it be that the particular airflow mechanisms our mouths are capable of can lead only to sounds like hits, slides, and rings? No. Human mouths are capable of sounds much more varied than the sounds of interacting solid objects. For example, people can mimic many animal sounds—quacks, moos, barks, ribbits, meows, and even human sounds like slurps, burps, sneezes, and yawns—that are constructed out of constituents beyond simple hit, slide, and ring sounds. People can mimic water-related sounds—like splashes, flushes, and drips—none of which are built from hit, slide, and ring sounds. And our airflow sound-mimicking mouths can, of course, mimic airflow sounds—like a soda pop being opened, howling wind, or even breaking wind—also unrelated to the sounds of hits, slides, and rings. People can mimic “hot” sounds, like sizzling bacon and roaring fires. They can even mimic the sounds of revving motorcycles, fax machines, digital alarm clocks, shrilling phones, and alien spaceships, none of which are sounds built from hits, slides, and rings. We see, then, that our airflow sound-producing mouths have a very wide repertoire, and yet speech has employed only the barest of our talents for mimicry, preferring exactly the sounds that occur among interacting macroscopic solid objects. We’re not, therefore, speaking in hits, slides, and rings by default. That we find these in all languages is a sign that we have been harnessed.

In upcoming sections, I will also concentrate on some other kinds of sounds our mouths can produce, but that language tends to avoid; these cases deserve special attention because of their prima facie similarity to sounds we do find in speech. Thus, they can help to answer the question of why speech utilizes some sounds we can make, but not others we can make just as easily. For example, we will see in the upcoming section that although we can make the sounds of wiggly hits and slides, we do not have them as phonemes—and this is consistent with their absence in physics. In the section following that we will see that although we can make slide-hit sounds and hit-slide sounds, only the latter is given the honor of phoneme status in languages (see the section titled “Nature’s Other Phoneme”), consistent with hit-slides being a fundamental sound in physics, while slide-hits are not. And we’ll see in the “Two-Hit Wonder” section that a simple kind of sound (a “beep”) that could exist as a phoneme does not occur in human languages, consistent with its nonprimitive status in physics. More generally, for the next five sections I will brandish a magnifying glass and closely examine the internal structures of hits, slides, and rings, asking whether those same fine structures are found in plosives, fricatives, and sonorants, respectively.

Wiggly Rings

Harmonicas don’t get no respect. They’re cheap (I just found one online for $5), tiny hunks of metal that tend to be played by guys who didn’t finish finishing school. I’ve had a couple of harmonicas for years, and have never understood them: they don’t have all the notes and can only play three chords. Blowing on a harmonica can’t help but sound fairly good, but I have always been frustrated by my inability to get it to do much more. A serious blues harmonica player can create sounds far richer than seems possible from what would appear to be little more than a toy.

A harmonica is deceptive because it is, in a sense, not an entire instrument at all. It is perhaps half an instrument—maybe that’s why they’re so inexpensive. The other half of the instrument is the human hand. That explains why the best harmonica players have hands, and, in addition, tend to move them all about the instrument when playing. This is described as “bending” the notes, and by doing so, the performer can provide a musical dynamism not possible with just the twenty or so notes in the harmonica’s range. The sounds reaching the listener’s ears are not only those coming directly from the harmonica, but also the harmonica sounds that first bounce off objects in the environment before reflecting toward the listener’s ears. For the note-bending blues performer, the hands are the objects the sounds bounce off. Each time a sound bounces off something, some sound frequencies are absorbed more than others, and so the timbre of the sound coming from that reflection is changed. The total timbre depends on the totality of harmonica sounds that reach the ear directly and indirectly from all points in the environment. And we’re able to hear these sound shapes, which is why harmonica benders go to all the trouble of wiggling their hands—and why there are acoustics engineers who worry about the physical layout of auditoriums.

Bending and acoustic reflections don’t just matter in the blues and in concert halls where instruments (including half instruments) are crooning out musical tones. Objects involved in events also croon, or ring. A ring has a complex timbre that informs us of the object’s size, shape, and material. But just like harmonica sounds, rings can get bent by the environmental surroundings. And our brains can decode the bends, and can give us a sense of our surroundings purely on the basis of the shapes of the sounds reaching our ears. The psychologist James J. Jenkins demonstrated in 1985 that blindfolded students, after a little practice, can navigate very well amongst obstacles by utilizing such auditory cues.

These acoustical observations about how the surroundings affect sound have an important consequence for the internal structure of rings: rings can be wiggly. There are several converging reasons for this. First, an event that causes a ring often also sets the ringing object in motion: something has been hit, or something is sliding. Because the shape of a ring reaching one’s ears depends on the object’s surroundings, ringing objects that are moving produce rings that vary over time. Second, when an event occurs, we are often on the move. Because the shape of the ring we receive depends, in part, upon our position in the world, the shape of the ring reaching our ears may be varying over time. In each case, whether we are moving or the object is, the timbre of a ringing object can change, and these are wiggles we notice, at least subconsciously. In addition to such dynamic changes in the subtleties of a ring’s timbre, there is another dimension in which rings can often vary: pitch, the musical-note-like “higher” or “lower” quality of sound. When motion is involved—either our own motion or that of the objects involved in events—we get Doppler shifts, a phenomenon we are all familiar with, as when a car approaching you sounds higher-pitched than when it is moving away. (See also the later section of this chapter titled “Unresolved Questions” for more about the Doppler effect and its stamp upon speech. And see the following chapters on music, where the Doppler effect will be discussed in detail.)

Rings can therefore change over time, both in timbre and in pitch. That is, a single ring can often be intrinsically dynamic. What about hits and slides?

Hits are nearly instantaneous, and for this simple reason they cannot change over time, at least not in the sense of continuously varying from one kind of hit to another. Hits can, of course, happen in quick succession, such as when you drop a pen and one end hits an instant before the other. But such a pen event would be two physical interactions, not one. Unlike a single ring, which can wiggle, a single hit has no wiggle room.

How about slides? Slides can occur for a lot longer than an instant, and so they can, in principle, dynamically vary over their occurrence. Although slides can be long—for example, a single snowy hill run on a sled may be one continuous slide—they are much more commonly short (though not instantaneous) in duration, because they quickly dissipate the energy of an event, sometimes ending it. Do the sounds of slides ever, in fact, dynamically vary over time? Before answering this, let’s be clear on what we mean by the sound of a slide. A slide can cause a ring, as we have discussed, but that is not what we’re interested in at the moment. We are, instead, interested in the sound made by the physical interaction of the two sliding surfaces—the noisy friction sound itself, caused by the coarseness of the objects involved. Therefore, to produce a wiggly slide, the coarseness of the surface being slid upon would have to vary, so that one friction sound would change gradually to another friction sound. Although coarseness varies randomly on lots of materials, few objects vary in a systematic, graded fashion, and thus slides will tend to have a rather nonvarying sound.

Rings, then, can be wiggly. But not hits, and not slides. If language has culturally evolved to sound like nature, then we would expect that sonorant phonemes (language’s rings) would sometimes be dynamically varying, but not plosives (language’s hits) or fricatives (language’s slides).

Languages do, indeed, often have sonorants that vary during their utterance. Although vowels like those in “sit” and in “set” are nonvarying, some vowels do vary, like those in “skate” and “dive.” When one says “skate,” for example, notice how the vowel sound requires your mouth to vary its shape, thereby dynamically modulating its timbre (in particular, modulating something called the formant structure, where formants are the bands of frequencies emanating from a sonorant). Vowel sounds like these are called diphthongs. Furthermore, sonorant consonants like l, r, y, w, and m demand ring changes. For example, when you say “yet,” notice how during the “y” your mouth dynamically varies its shape. These sonorants incorporate timbre changes. Recall that rings in nature also can change in pitch due to the Doppler effect. Do we find something like the Doppler shift in sonorant phonemes? Yes, in fact, in the many tonal languages of the world (such as Chinese), where vowels may be distinguished from one another only by virtue of how they dynamically vary their pitch during their utterance.

Whereas sonorants are commonly wiggly, effectively making more than one ringing sound during their utterance, no language possesses phonemes having in them more than one hit sound. It is possible in principle to have a single phoneme that sounds like two hits in very quick succession—for example, the “ct” in “ectoplasm”—but while we can make such sounds, and they even occur in language, they are never given building-block, or phoneme, status.

Are language’s slides like nature’s slides in being non-wiggly? First, let’s be clear on what it would even mean to have a fricative that varies dynamically as it is spoken. Try saying the sound “fs.” That is, begin with an “f” sound, and then slowly morph it to become “s” at the end. You make this sound when, for example, you say “puffs.” Languages could, in principle, have fricative phonemes that sound like “fs.” That is, languages could possess a single phoneme that has this complex dynamic fricative sound, just as languages possess single sonorant phonemes that are dynamic. One does not, however, find phonemes like this among human languages.

Nature’s rings are wiggly but hits and slides are not, and culture has given us language with the same wiggles: language commonly has sonorant phonemes that dynamically vary, but does not have plosive or fricative phonemes that dynamically vary. Our auditory systems are happy with dynamic rings, but not with dynamic hits or slides, and culture has given us speech that conforms to these tastes.

In addition to looking at dynamic changes within phonemes, we can make similar observations at the level of how phonemes combine into words: languages commonly have words with multiple sonorants in a row, but more rarely have multiple plosives or multiple fricatives in a row. For example, consider the following English words, which I found by perusing the second paragraph of this chapter: “harrowing” possesses six sonorants in a row (a, rr, o, w, i, and ng, the latter of which is a nasal sonorant), “village” has three in a row, “generation” has five in a row, and “eventually” has four in a row. One can find adjacent plosives, like in “packed” (“kt”) and “grabbed” (“bd”), and one can find adjacent fricatives like in “puffs” (“fs”), “gives” (“vz”), and “isthmus” (“sth”), but finding more than two in a row is difficult, and five or six in a row is practically impossible.

We now know how, and how much, each of the three kinds of “event atoms” can vary in sound while they are occurring. We have not, however, considered whether an event of one of these three kinds can ever dynamically change into another kind of event. Could some simple event pairs be so common that we are likely to possess special auditory mechanisms for their recognition, mechanisms language harnesses? We turn to this question next, and uncover a kind of event sufficiently fundamental in physics that it is also found as a fourth kind of phoneme in language.

Nature’s Other Phoneme

I have been treating hits and slides as two different kinds of physical interaction. But slides are more complex than hits. This is because slides consist of very large numbers of very low-energy hits. For example, if you rub your fingernail on this piece of paper, it will be making countless tiny collisions at the microscopic level. Or, if you close this book and run your fingernail over the edges of the pages of the book, the result will be a slide with one little hit for each page of the book. But it would not be sensible to conclude, on this basis, that there are just two fundamental natural building blocks for events—hits and rings—because describing a slide in terms of hits could require a million hits! We still want to recognize slides as one of nature’s phonemes, because slides are a kind of supersequence of little hits that is qualitatively unlike the hits produced when objects simply collide.

But there are implications to the fact that slides are built from very many hits, but not vice versa: that fact opens up the possibility of a fundamental event type that is not quite a hit, and not quite a slide. To understand this new event type, let’s look at a slide at the level of its million underlying hits. Imagine that the first of these million hits is appreciably more energetic than the others. If this were the case, then the start of the slide would acquire a crispness normally found in hits. But this hit would be just the first of a long sequence of hits, and would thus be part of the slide itself. Such a hit-slide would, if it existed, be neither a hit nor a slide.

And they do exist, for several converging reasons. First, slides have a tendency to be initiated by hits. Try sliding this book on a desk. The first time you tried, you may have bumped your hand into the book in the process of attempting to make it slide. That is, you may have hit the book prior to the slide (see Figure 5). It requires careful attention to gently touch the book without hitting it first. Now grab hold of the book and try to slide it without an initial hit. Even in this case there can often be an initial hitlike event. This is because in order to slide an object, you must overcome static friction, the “sticky” friction preventing the initiation of a slide. This initial push is hitlike because the sudden overcoming of static friction creates a sudden burst of many frequencies, as in Figure 4a. Slides, then, often begin with a hit. Second, hits often have slides following them. If you hit a wall with a straight jab, you will get a lone hit, with no follow-up slide. But if you move your arm horizontally next to the wall as you are hitting it—in order to give it a more glancing blow—there will sometimes be a small skid, or slide, after the initial hit.

Figure 5. A hit-slide is a fourth fundamental constituent of physical events. It sounds like a kind of phoneme in language called the affricate, which is like a plosive followed by a fricative.

Although a hit followed by a slide is a natural regularity in the world, a slide followed by a hit is not a natural physical regularity. First, it is common to have a hit not preceded by a slide. To see this, just hit something. Odds are you managed to make a hit without a slide first. Second, when there is a slide, there is no physical regularity tending to lead to a hit. Slides followed by hits are possible, of course—in shuffleboard, for example (and note the fricatives in “shuffle”)—but they really are two separate events in succession. A hit-slide, on the other hand, can effectively be a single event, as we discussed a moment ago.

If language sounds like nature, then we should expect linguistic hit-slide sounds to be more common than slide-hit sounds. Later in this chapter—in the section titled “Nature’s Words”—I will provide evidence that this is true of the way phonemes combine into words across human languages. But in this section I want to focus on the single-phoneme level. The question is, since hit-slides are a special kind of fundamental event atom, but slide-hits are not, do we find that languages have phonemes that sound like hit-slides, but not phonemes that sound like slide-hits?

Languages, like nature, are asymmetrical in this way. There is a kind of phoneme found in many languages called an affricate, which is a fricative that begins as a plosive. One example in English is “ch,” which is a single phoneme that possesses a “t” sound followed by a “sh” sound. In addition to words like “chair,” it also occurs in words like “congratulate” (spoken like “congratchulate”), and often in words like “trash” (spoken like “chrash”). Another example is “j,” which begins with “d” sound followed by a voiced version of the “sh” phoneme. Although we can describe “ch” as a “sh” initiated by a “t,” it is not the same sound that occurs when we say “t” and quickly follow it up with “sh.” The “ch” phoneme has the “t” and “sh” sounds bound up so closely to one another that they sound like a single atomic event. The “tsh” sound in “hotshot,” on the other hand, will typically sound different from “ch”; that is, we do not pronounce the word “ha-chot.”

Whereas language has incorporated nature’s hit-slide phoneme as one of its phoneme types, slide-hits, on the other hand, are not one of nature’s phonemes, and a harnessing language is not expected to have phonemes that sound like slide-hits. Indeed, that is the case. Languages do not have the symmetric counterpart to affricates—phonemes that sound like a plosive initiated by a fricative. It is not that we can’t make such sounds—“st” is a standard sound combination in English of this slide-hit form, but it is not a single phoneme. Other cases would be the sounds “fk” and “shp,” which occur as pairs of phonemes in words in some languages, but not as phonemes themselves.

By examining physics in greater detail, in this section we have realized that there is a fourth fundamental building block of events: hit-slides. And just as languages have honored the other three fundamental event atoms as their principal phoneme types, this fourth natural event atom is also so honored. Furthermore, the symmetrical fifth case, slide-hits, is not a fundamental event type in nature, and we thus expect—if harnessing has occurred—not to find fricative-plosives as language phonemes. And indeed, we don’t find them.

Slides that Sing

Recall that slides are, in essence, built from very many little hits in quick succession. The pattern of hits occurring inside a slide depends on the nature of the materials sliding together, and this pattern is what determines the nature of the slide’s sound. If you scrape your pencil on paper, then because the paper’s microscopic structure is fairly random, the sound resulting from the many little hits is a bit “noisy,” or like radio static, in having no particular tone to it. (The pencil scraping may also cause some ringing in the table or the pencil, but at the moment I want you to concentrate only on the sound emanating from the slide itself.)

However, now unzip your pants. You just made another slide. Unlike a pencil on paper, however, the zipper’s regularly spaced ribs create a slide sound that has a tonality to it. And the faster you unzip it, the higher the pitch of the zip. Slides can sing. That is, slides can have a ringlike quality to them, due not to the periodic vibrations of the objects, but to the periodicity in the many tiny hits that make up a slide.

Whether or not a slide sings depends on the nature of the materials involved, and that’s why the voice of a slide is an auditory feature that brains have evolved to take notice of: our brains treat singing and hissing slides as fundamentally different because these differences in slide sounds are informative as to the identity of the objects involved in the slides. Although slides can sing, it is more common that they don’t, because texture with periodicity capable of a ringlike sound is rare, compared to random texture that leads to generic friction sounds akin to white noise.

Do human languages treat singing slide sounds as different from otherwise similar nonsinging slide sounds? Yes. Languages have fricatives of both the singing and the hissing kinds, called the voiced and unvoiced fricatives, respectively. Voiced fricatives include “z,” “v,” “th” as in “the,” and the sound after the beginning of “j” (which you will recall is an affricate, discussed earlier in “Nature’s Other Phoneme”). Unvoiced fricatives include “s,” “f,” “th” as in “thick,” and “sh.” Just as singing slides will be rarer than nonsinging slides—because the former require special circumstances, namely, slides built out of many periodically repeating hits—voiced fricatives are rarer in languages than unvoiced fricatives. John L. Locke tabulated data in his excellent 1983 book, Phonological Acquisition and Change, and discovered that “s” is found in 172 of 197 languages in the Stanford Handbook[1] (87 percent) and in 102 of 317 languages in the UCLA Phonological Segment Inventory Database (32 percent), whereas “z” (the voiced version of “s”) is found in 77 of 197 languages (39 percent) and 36 of 317 languages (11 percent), respectively. Similarly, “f” is found in 106 of 197 languages (54 percent) and in 135 of 317 languages (43 percent), whereas “v” is found in 61 of 197 languages (31 percent) and in 67 of 317 languages (21 percent), respectively. These data suggest that unvoiced slides are about twice as likely as voiced slides to be found in a language. (And notice how, in English at least, one finds voiced-fricative words with meanings related to slides that sing: rev, vroom, buzz, zoom, and fizz. One also finds unvoiced-fricative words with meanings related to unsung slides: slash, slice, and hiss.)

Voiced and unvoiced fricatives are found in languages because they’re found in the physics of slides. Hits can also be voiced or unvoiced, but for completely different physical reasons than slides. Zip up your pants and let’s get to this.

Two-Hit Wonder

Each day, more than a billion people wake to the sound of a ringing alarm, reach over, and hit the alarm clock, thereby terminating the ring and giving themselves another five minutes of sleep. In these billion cases a hit stops a ring, rather than starting one as we talked about earlier. Of course, the hit on the clock does cause periodic vibrations of the clock (and of the sleeper’s hand), but the sound of these vibrations is likely drowned out by the sound of the alarm still ringing in one’s ears.

Although hitting the snooze button of an alarm clock is not a genuine case of a hit stopping a ring, there are such genuine cases. Imagine a large bell that has been struck and is ringing. If you now suddenly place your hand on it, and keep it there, the ringing will suddenly stop. Such a sudden hand placement amounts to a hit—a hit that sticks its landing. And it is, in this case, by virtue of dampening, a hit that leads to the termination of a ring. Some dampening will occur even if your hand doesn’t stick the landing, so long as you hit the bell much less energetically than it is currently ringing; the temporary contact will “smother” some of the periodic vibrations occurring in the bell.

Although in such cases it can sound as if the bell’s ringing has terminated, in reality one can leave the bell with a residual ring. A hit on a quiet bell would sound like an explosive hit, because in contrast to the bell’s stillness, the hit is a sudden discontinuous rise in the ringing magnitude. But that same hit on an already very loudly ringing bell causes a sudden discontinuous drop in the ringing magnitude. In contrast to the loud ringing before the hit, the hit will sound like the sudden ceasing of a ring, even if there is residual ringing.

Hits, therefore, have two voices, not just the one we discussed earlier in the section called “Nature’s Phonemes.” Hits not only can create the sudden appearance of a wide range of frequencies, but can also sometimes quite suddenly dampen out a wide range of frequencies. These two sounds of hits are, in a sense, opposites, and yet both are possible consequences of one and the same kind of hit. This second voice is rarer, however, because it depends on there already being a higher-energy ring before the hit, which is uncommon because rings typically decay quickly. That is, the explosive voice of hits is more common than the dampening voice, because most objects are not already ringing when they are hit.

If languages have harnessed our brain’s competencies for natural events, then we might expect languages to utilize both of these hit sounds. And indeed they do. The plosives we discussed earlier consisted of an explosive release of air, after having momentarily stopped the airflow and let pressure build. But plosives also occur when the air is momentarily stopped, but not released. This happens most commonly when plosives are at the ends of words. For example, when you utter “what” in the sentence “What book is this?” your mouth goes to the anatomical position for a “t,” but does not ever release the “t” (unless, say, you are angry and slowly enunciating the sentence). Such instances of plosive stop sounds are quite common in language, but less so than released plosive sounds—there are many languages that do not allow unreleased plosives, but none that do not allow released plosives. John Locke tabulated from the Stanford Handbook that, in 32 languages that possessed word-position information, no plosives were off limits at word starts (where they would be released), but 79 plosives were impermissible at word-final position (where they are typically unreleased). Also, among the words we collected from 18 languages, 16,130 of a total of 18,927 plosives, or 85 percent, were directly followed by a sonorant (and thus were released), and therefore only 2,797 plosives, or 15 percent, were unreleased. And even in languages (like English) that allow both kinds of plosive sounds, plosives are more commonly employed in their explosive form, something we will talk about in a later section (“In the Beginning”). This fits with the pattern in nature, where explosive hits are more common than dampening hits.

Not only does language have both hit sounds as part of its repertoire, but, like nature, it treats the unreleased “t” sound and the released “t” sound as the same phoneme. This is remarkable, because they are temporal opposites: one is like a little explosion, the other like a little antiexplosion. One can imagine, as a thought experiment, that people could have ended up with a language that treats these two distinct “t” sounds as two distinct phonemes, rather than two instances of a single one. In light of the auditory structure of nature, however, it is not at all mysterious: any given hit can have two very different sounds, and language carves at nature’s joints.

In light of the two sounds hits make, there is a simple kind of sound we can make, but that language never includes as a phoneme: “beep,” like an electronic beep or like Road Runner. A beep consists of a sudden start of a tone, and then a sudden stop. Beeps might, at first glance, seem to be a candidate for a fundamental constituent of communicating by sound: what could be simpler, or more “raw,” than a beep? However, although our first intuitions tell us that beeps are simple, in physics they are not. In the real world of physical events among objects, beeps can only happen when there is a hit (the abrupt start to the beep), a ring that follows (the beep’s tone), and a second hit, this one a dampening one (the abrupt beep ending). A “simple” beep can’t happen in everyday physics unless three simple constituent events occur. And we find that in languages as well: there are no beeplike phonemes. To make a beep sound in language requires one to first say a plosive of the released kind, then a (nonwiggly) sonorant, and finally an unreleased plosive . . . just like when we say the word “beep.”

Hesitant Hits

Bouncing a basketball could hardly be a simpler event. A bounce is just a hit, followed by a ring. And as we discussed earlier, the sound is a sudden explosion of many frequencies at the initiation of the hit, followed by a more tonal sound with a timbre due to the periodic vibrations of the basketball and floor. Although hits seem simple, they become complicated when viewed in super slow motion. After the ball first touches the ground, the ball begins to compress, a bit like a spring. After compression, the ball then decompresses as it rises on its upward bounce. Although these ball compressions and decompressions are typically very fast, they are not instantaneous: the physical changes that occur during a hit occur over an extended period of time, albeit short. What happens during this short period of time depends on the nature of the objects involved.

One of the most important acoustical observations about collisions is that ringing doesn’t tend to occur until the collision is entirely finished. There are several reasons why this is so. First, the ground rings less during the collision because even though the ground has already been struck, the ball’s contact with the ground dampens the ground’s vibrations. Similarly, the ground’s contact with the ball dampens the ball’s vibrations. Second, during the ball’s compression, its shape is continually varying, and so any vibrations it is undergoing are changing in their timbre and pitch very quickly, far more quickly than the ring-wiggles we discussed earlier. In fact, the vibration changes occur at a time scale so short that any rings that do occur during the collision will not sound like rings at all. Third, during the period of the collision when the ball is not yet at maximum compression, the ball is continually hitting new parts of the ground. This is because, as the ball compresses, the ball’s footprint on the ground keeps enlarging, which means that new parts of the ball continually come into contact with the ground. In fact, even if the surface area of contact never enlarges, the mass in parts of the ball continues to descend during the ball’s compression, providing further impetus upon the surface area of contact. Because the compression period is filled with many little hits, any ringing occurring during compression will have a tendency to be drowned out by the little hits.

For several converging reasons, then, the ringing that occurs after a hit doesn’t tend to begin until the compressions and decompressions are over. For the basketball, the ringing occurs most vigorously when the ball rebounds back into the air. There is a simple lesson from these super slow motion observations: there is often a gap between the time of the start of a collision and the start of the ringing.

What determines the length of these hit-to-ring gaps? When your basketball is blown up fully, and the ground is firm, then the time duration of the contact with the floor is very short, and so the gap between the start of the hit and the ring is very small. However, when the ball is fairly flat—low in pressure—it spends more time interacting with the ground. Bouncing a ball on soft dirt would also lead to more ground time (see Figure 6). Figure 7a shows the sound waveform signal of a book falling onto a crumpled piece of paper—producing similar acoustics (in the relevant respects here) to those of a ball dropped on soft dirt—and one can see a hit-to-ring gap that is larger than that for the same book falling directly onto the table. For the flat basketball, then, the gap between hit and ring is larger than that for the properly blown-up ball.

Figure 6. (a) A rigid hit (i.e., involving rigid objects) rebounds—and rings—with little delay after the initial collision. (b) A nonrigid hit takes some time before rebounding and ringing. These physical distinctions are similar to the voiced and unvoiced plosives.

The key difference between the high-pressure ball and the flat ball—and the difference between the book falling on a solid desk versus crumpled paper—is that the former is more rigid than the latter. The more rigid the objects in a collision, the shorter the compression period, and the shorter the gap between the initial hit and the ring. The high-pressure ball is not only more rigid than the flat ball, but also more elastic. More elastic objects regain their original shape and kinetic energy after decompression, lose less energy to heat during compression, and tend to have shorter gaps. Also, if an object breaks, cracks, or fractures as it hits—a kind of nonrigidity and inelasticity—the gap is longer.

Therefore, although some hits ring with effectively no delay, other kinds of hits take their time before ringing. Hits can be hesitant, and the delay between hit and ring is highly informative because it tells us about the rigidities of the objects involved. Our auditory systems understand this information very well: they have been designed by evolution to possess mechanisms for sensing this gap and thus for perceiving the rigidity of the objects involved in events.

Because our auditory systems are evolutionarily primed to notice these hit-to-ring delays, we expect that languages should have come to harness this capability, so that plosives may be distinguished on the basis of such hit-to-ring delays. That is, we would expect that plosive phonemes will have as part of their identity a characteristic gap between the initial explosive sound and the subsequent sonorant. Language does, indeed, pay homage to the hit-ring gaps in nature, in the form of voiced and unvoiced plosives. Voiced plosives are like “b,” “g,” and “d,” and in these cases the sonorant sound following them occurs with negligible delay (Figure 7b, left). They even sound bouncy—“boing,” “bob,” and “bounce”—like a properly inflated basketball. Unvoiced plosives are like “p,” “k,” and “t,” and in these cases there is a significant delay after the plosive and before the sonorant sound begins, a delay called the voice onset time (Figure 7b, right). (Try saying “pa,” and listen for when your voice kicks in.) In English we have short voice onset times and long ones, corresponding to voiced and unvoiced plosives, respectively. Some other languages have plosives with voice onset times in between those found in English.

Figure 7. Illustration that voiced plosives are like rigid, elastic hits, and unvoiced plosives like nonrigid, inelastic hits. These plots show the amplitude of the sound on the y-axis, and time on the x-axis. (a) The sound made by a stiff hardcover book landing on my wooden desk on the left, followed by the sound of that same book landing on my desk, but where a wrinkly piece of paper cushioned the landing (making it less rigid and less elastic). (b) Me saying “bee” and “pee.” Notice that in the inelastic book-drop and the unvoiced plosive cases—i.e., the right in (a) and (b)—there is a delay after the initial collision before the ringing begins.

Not only do languages utilize a wide variety of voice onset times—hit-to-ring gaps—for plosive phonemes, but one does not find plosive phonemes that don’t care about the length of the gap. One could imagine that, just as the intensity of a spoken plosive doesn’t change the identity of the plosive, the voice onset time after a plosive might not matter to the identity of a plosive. But what we find is that it always does matter. And that’s because the intensity of a hit in nature is not informative about the objects involved, but the gap from hit to ring is informative (as is the timbre). That’s why the gap from hit to ring is harnessed in language. And that’s why, as we saw earlier, the distinct plosive sounds at the start and end of words are treated as the same, despite being acoustically more different than are voiced and unvoiced plosives (like “b” and “p”).

In light of the ecological meaning of voiced versus unvoiced plosives, consider the following two letters from a mystery language: ◆ and ✴. Each stands for a plosive, but one is voiced and the other unvoiced. Which is which? Most people guess that ◆ is voiced, and that ✴ is unvoiced. Why? My speculation is that it is because ◆ looks rigid, and would tend to be involved in hits that are voiced (i.e., a short gap from hit to ring), whereas ✴ looks more kinked, and thus would be likely to have a more complex collision, one that is unvoiced (i.e., a long gap between hit and ring). My “mystery language” is fictional, but could it be that more rigid-looking letters across real human writing systems have a tendency to be voiced, and more kinked-looking letters have a tendency to be unvoiced? It is typically assumed that the shapes of letters are completely arbitrary, and have no connection to the sounds of speech they stand for, but could it be that there are connections because objects with certain shapes tend to make certain sounds? This is the question Kyle McDonald—a graduate student at Rensselaer Polytechnic Institute (RPI) working with me—raised and set out to investigate. He found that letters having junctions with more contours emanating from them—i.e., the more kinked letters—have a greater probability of being unvoiced. For example, in English the three voiced plosives are “b,” “d,” and “g,” and their unvoiced counterparts are “p,” “t,” and “k.” Notice how the unvoiced letters—the “t” and “k,” in particular—have more complex structures than the voiced ones. Kyle McDonald’s data—currently unpublished—show that this is a weak but significant tendency across writing systems generally.

Rigid Muffler

As I walk along my upstairs hallway, I accidentally bump the hammer I’m carrying into the antique gong we have, for some inexplicable reason, hung outside the bedroom of our sleeping infant. I need to muffle it, quickly! I have one bare hand, and the other wielding the guilty hammer; what do I do? It’s obvious. I should use my bare hand, not the hammer, to muffle the gong. Whereas my hand will dampen out the gong ring quickly, the hammer couldn’t be worse as a dampener. My hand serves as a good gong-muffler because it is fleshy and nonrigid. My hand muffles the gong faster than the rigid hammer, yet recall from the previous section that nonrigid objects cause explosive hits with long hit-to-ring gaps. Nonrigid hits create rings with a delay, and yet diminish rings without delay. And, similarly, rigid hits create rings without delay, but are slow dampeners of rings.

These gong observations are crucial for understanding what happens to voiced and unvoiced plosives when they are not released (i.e., when the air in the mouth and lungs is not allowed to burst out, creating the explosive hit sound), which often occurs at word endings (as discussed in the section titled “Two-Hit Wonder”). When a plosive is not released, there clearly cannot be a hit-to-ring gap—because it never rings. So how do voiced and unvoiced plosives retain their voiced-versus-unvoiced distinction at word endings? For example, consider the word “bad.” How do we know it is a “d” and not a “t” at the end, given that it is unreleased, and thus there is no hit-to-ring delay characterizing it as a “d” and not a “t”?

My gong story makes a prediction in this regard. If voiced plosives really have their foundation in rigid objects (mimicking rigidity’s imperceptibly tiny hit-to-ring gap at a word’s beginning), then, because rigid objects are poor mufflers, the sonorant preceding an unreleased voiced plosive at a word ending should last longer than the sonorant preceding an unreleased unvoiced plosive at a word ending. For example, the vowel sound in “bad” should last longer than in the word “bat.” The nonrigid “t” at the end of the latter should muffle it quickly. Are words like “bad” spoken with vowels that ring longer than in words like “bat”?

Yes. Say “bad” and “bat.” The main difference is not whether the final plosive is voiced—neither is, because neither is ever released, and thus neither ever gets to ring. Notice how when you say “bad,” the “a” gets more drawn out, lasting longer, than the “a” sound in “bat.” Most nonlinguist readers may never have noticed that the principal distinguishing feature of voiced and unvoiced plosives at word endings is not whether they are voiced at all. It is a seemingly unrelated feature: how long the preceding vowel lasts. But, as we see from the physics of events, a longer-lasting ring before a dampening hit is the signature of a rigid object’s bouncy hit, and so there is a fundamental ecological order to the seemingly arbitrary linguistic phonological regularity. (See Figure 8.)

Figure 8. Matrix illustrating the tight match between the qualities of hits (not in parentheses) and plosives (within parentheses). For hits, the columns distinguish between rigid and nonrigid hits, and the rows distinguish between hits that initiate rings and hits that muffle rings. Inside the matrix are short descriptions of the auditory signature of the four kinds of hits. For plosives, the columns distinguish the analogs of rigid and nonrigid hits, which are, respectively, voiced and unvoiced plosives; the rows distinguish the analogs of ring-initiating and ring-muffling hits, which are, respectively, released and unreleased plosives. Together, this means four kinds of hits, and four expected kinds of plosives, matching the signature features of the respective hits. If the meaning of voiced versus unvoiced concerns rigid versus nonrigid objects, then we expect that plosives at word starts should have little versus a lot of voice-onset time, respectively, for voiced and unvoiced. And we expect that for plosives at word endings the voiced ones should reveal themselves via a longer preceding sonorant (slow to damp) whereas unvoiced should reveal themselves via a shorter preceding sonorant (fast to damp). Plosives do, in fact, modulate across this matrix as predicted from the ecological regularities of rigid and nonrigid hits at ring-inceptions and ring-dampenings.

Over the last half dozen sections of this chapter we have analyzed the constituents—the hits, slides, and rings—of events and language. Hits, slides, and rings may be the fundamental building blocks for human speech, but that alone doesn’t make speech sound natural. Just as natural contours can be combined in unnatural ways for vision, natural sound atoms can be combined unnaturally for audition. Language will not effectively harness our auditory system if speech combines plosives, fricatives, and sonorants in unnatural ways, like “yowoweelor” or “ptskf.” To find out whether speech sounds like nature, we need to understand how nature’s phonemes combine, and then see if language combines in the same way. For the rest of this chapter, we will look at successively larger combinations of sounds. But we turn first to the simplest combination.

Nature’s Syllables

My friend’s boy made a video of himself solving a Rubik’s Cube blindfolded, and then posted it on the Web. As I watched him put the blindfold on, pick up the cube, and begin twisting, I noticed something strange about the sound, but I couldn’t put my finger on what was unusual. Later, when I commented to my friend how his bright boy must owe it to inheritance, he replied, “Indeed, the apple doesn’t fall far from the tree. He faked it. The movie was in reverse.”

The world does not sound the same when run backward. What had raised my antennae when watching the Rubik’s Cube video was the unusual sounds that occur when one hears events in reverse. One of the first strange sounds occurred when he picked up the cube at the start of the video. Knowing now that it was shown in reverse, what appeared in the video to be him picking up the cube to begin unscrambling it was actually him setting the cube down after having scrambled it. Setting the cube down caused a hit and a ring, but in reverse what one hears is a ring coming out of nowhere, and ending with a sudden ring-stopping hit (the second voice of a hit, as discussed earlier in the section titled “Two-Hit Wonder”). That just doesn’t happen much in nature. When nature comes to the door, it knocks before ringing, not the other way around. Rings don’t start events. Rings are due to the periodic vibrations of objects, and objects do not typically ring without first being in physical contact with another object. Rings therefore do not typically occur without a hit or slide occurring first.

Hits, slides, and rings may be the principal fundamental building blocks for events, but rings are a different animal than hits and slides. Hits and slides involve objects in motion, physically interacting with other objects. Hits and slides are the backbone of the causal chain in an event. Rings, on the other hand, occur as a result of hits or slides, but don’t themselves cause more events. Rings are free riders, contributing nothing to the causality. Events do not have a ring followed by another ring. That’s impossible (although a single complex, or wiggly, ring is possible, as we discussed in an earlier section). And events never have an interaction (i.e., a hit or a slide) followed directly by another interaction without an intervening ring. Sometimes a ring will be inaudible, and so there will appear to be two interactions without an intervening ring, but physically there’s always an intervening ring, because objects that are involved in a physical interaction always vibrate to some extent. Events also always end with a ring, although whether it is audible is another matter.

The most basic way in which hits, slides, and rings combine is, then, this:

Interaction—Ring

where the interaction can be either a hit or a slide. If we let c stand for a hit or a slide (because “c” can be pronounced either as a plosive, “k,” or as a fricative, “s”), and a stand for a ring (which, recall, can sometimes be wiggly), then the fundamental structure of solid-object physical events is exemplified by caca. Not acac. Not cccaccca. Not accacc. And so on. Letting b stand for hits and s for slides, events take forms such as ba, sa, baba, saba, basaba, and so on. Not ab or sba or a or bbb or ssb or assb or the like. This interaction-ring combination is perhaps the most fundamental event regularity in nature, and is perhaps the most perceptually salient. Objects percussively interact via either a hit or slide, and give off a ring. Our auditory system—and probably that of most other mammals—is designed to expect nature’s phonemes to come in this interaction-ring form.

Given the fundamental status of interaction-ring combinations, if language harnesses the innate powers of our auditory system, then we expect language to be built out of vocalizations that sound like interaction-ring. Do languages have this feature? That is, do plosives and fricatives tend to be followed by sonorants? Yes. A plosive or fricative followed by a sonorant is, in fact, the most basic and most common phoneme combination across languages. It is the quintessential example of a syllable. Words across humankind tend to look approximately like ca, or caca, or cacaca, where c stands for a plosive or fricative, and a for one or more consecutive sonorants. All languages have syllables of this ca form. And many languages—such as Japanese—only allow syllables of this form.

Whereas interaction-ring is the most fundamental natural combination of event atoms, ring-interaction is a combination that is not possible. A ring followed by an interaction sounds out of this world, as in my friend’s son’s Rubik’s Cube video. We therefore expect that languages tend to avoid combinations like ac and acac. This is, in fact, the case. The rarest syllable type is of this ac form, and words starting with a sonorant and followed by a plosive or fricative are rare. In data I collected at RPI in 2008 with the help of undergraduate student Elizabeth Counterman and graduate student Kyle McDonald, about 80 percent of our sampled words (with three or fewer non-sonorants) across 18 widely varying languages begin with a plosive or a fricative. (See the legend of Figure 9 for a list of the sampled languages.) And a large proportion of the words starting with a sonorant start with a nasal, like “m” and “n,” the least sonorant-like of the sonorant consonants (nasals at word starts can have a fairly sudden start, and are more plosive-like than other sonorant consonants).

Note that a word starting with a vowel does not start with a sonorant, because when one speaks such a word, the utterance actually begins with something called a glottal plosive, produced via the sudden hitlike release of air at one’s voice box. To illustrate the glottal plosive, slowly say “packet,” and then slowly say “pack it.” When you say the latter, there can often be a sharp beginning to the “it,” something that will never occur before the “et” sound in “packet.” That sharp beginning is the glottal plosive. Words starting with sonorants are, thus, less common than one might at first suspect. Even words like “ear,” “I,” “owe,” and “owl,” then, are cases of plosives followed by sonorants, and agree with the common hit-ring (the most common kind of interaction-ring) structure of nature.

Words truly beginning with a sonorant sound begin not with a vowel, but with a sonorant consonant like w, y, l, r, and m. When one says, “what,” “yup,” “lid,” “rip,” and “map,” the start of the word is nonsudden (or less sudden than a plosive), ramping up more gradually to the sonorant sound instead. And notice that words such as these—with a sonorant at the start and a plosive at the end—do sound like backwards sounds. Try saying the following meaningless sentence: “Rout yab rallod.” Now say this one: “Cort kabe pullod.” Although they are similar, the first of these meaningless sentences sounds more like events in reverse. This is because it has words of the ring-hit form, the signature sound of a world in reverse. The second sentence, while equally meaningless, sounds like typical speech (and event) sounds, because it starts with plosives.

Language’s most universal structure above the level of phonemes—the syllable—has its foundation, then, in physics. The interaction-rings of physical events got instilled into our auditory systems over hundreds of millions of years of vertebrate and mammalian evolution, and culture shaped language to sound like physics in order to best harness our hardware.

Before we move next to the shape of words, there is another place where syllables play a central role: in rhyme. Two words rhyme if their final syllables have the same sonorant sound, and the same plosive or fricative following the sonorant—for example, “snug as a bug in a rug.” The sonorant sound is the more important of the two: “bug” rhymes better with “bud” than with “bag.” Our ecological understanding of syllables may help to make sense of the perceptual salience of rhyme. When two events share the same ring sound, it means the same kind of object is involved in both events. For example, “tell and “sell” rhyme, and in terms of nature’s physics, they sound like two distinct events involving the same object. “Tell” might suggest that some object has been hit, and “sell” that that same object is now sliding. The “ell” in each case signals that it is the same object undergoing different events. This is just the kind of gestalt perceptual mechanism humans are well known to possess: we attempt to group stimuli into meaningful units. In vision this can lead to contours at distant corners of an image being perceptually treated as parts of one and the same object, and in audition it can lead to sounds separated by time as nevertheless grouped into the same object. That’s what happens in rhyme: the second word of a rhyming pair may occur several lines later, but our brain hears the similar ringing sound and groups it with the earlier one, because it would be likely in nature that such sounds were made by one and the same object.

In the Beginning

The Big Bang is the ultimate event, and even it illustrates the typical physical structure of events: it started with a sudden explosion, one whose ringing is still “heard” today as the background microwave radiation permeating all space. Slides didn’t make an appearance in our universe until long after the Bang. As we will see in this section, hits, slides, and rings tend to inhabit different parts of events, with hits and rings—bangs—favoring the early parts.

To get a feeling for where hits, slides, and rings occur in events, let’s take a look at a simpler event than the one that created the universe. Take a pen and throw it onto a table. What happened? The first thing that happened is that the pen hit the table; the audible event starts with a hit. Might this be a general feature of solid-object physical events? There are fundamental reasons for thinking so, something we discussed in the earlier section, “Nature’s Other Phoneme.” We concluded that whereas hits can occur without a preceding slide, slides do not tend to occur without a preceding hit. Another reason why slides do not tend to start events is that friction turns kinetic energy into heat, decreasing the chance for the slide to initiate much of an event at all. So, while hits can happen at any part of an event, they are most likely to occur at the start. And while slides can also happen anywhere in an event, they are less likely to occur near the start. Note that I am not concluding that slides are more common than hits at the nonstarts. Hits are more common than slides, no matter where one looks within solid-object physical events. I’m only saying that hits are more common at event starts than they are at nonstarts, and that slides are less common at event starts than they are at nonstarts.

Is this regularity about the kinds of interaction at the starts and nonstarts of events found in spoken language? Yes. Words of the form bas are more common than words of the form sab (where, as earlier, b stands for a plosive, s for a fricative, and a for any number of consecutive sonorants). Figure 9 shows the probability that a non-sonorant is a plosive (rather than a fricative) as one moves from the start of a word to non-sonorants further into the word. The data come from 18 widely varying languages, listed in the legend. One can see that the probability that the non-sonorant phoneme is a plosive begins high at the start of words, after which it falls, matching the pattern expected from physics. And, as anticipated, one can also see that the probability of plosives after the start is still higher than the probability of a fricative.

Figure 9. This shows how plosives are more probable at the start of words, and fall in probability after the start. The y-axis shows the plosive-to-fricative ratio, and the x-axis the ith non-sonorant in a word. The dotted line is for words with two non-sonorants, and the solid line for words with three non-sonorants. The main points are (i) that plosives are always more probable than fricatives, as seen here because the plosive-to-fricative probability ratios are always greater than 1, and (ii) that the ratio falls after the start of the word, meaning fricatives are disproportionately rare at word starts. These data come from common words (typically about a thousand) from each of the following languages: Japanese, Zulu, Malagasy, Somali, Fijian, Lango, Inuktitut, Bosnian, Spanish, Turkish, English, German, Bengali, Yucatec, Wolof, Tamil, Taino, Haya.

We just concluded that hits are disproportionately common at the starts of events in nature, and that this feature is also found in language. But we ignored rings. Where in events do rings tend to reside? In the previous section (“Nature’s Syllables”) we discussed the fact that rings do not start events, a phenomenon also reflected in language. How about after the start of a word? There would appear to be a simple answer: rings always occur after physical interactions, and so rings should appear at all spots within events, following each hit or slide.

But as we will see next, reality is more subtle.

The First Was a Doozy

While it is true that all physical interactions cause ringing, the ringing need not be audible, a point that already came up in the section called “Two-Hit Wonder.” In this light, we need to ask, where in events are the rings most audible? Consider the generic pen-on-table event again. The beginning of that event—the audible portion of it, starting when the pen hit the table—is where the greatest energy tends to be, and the ring sound after the first hit will therefore tend to be the loudest. If the pen bounces and hits the table again, the ring sound will be significantly lower in magnitude, and it will be lower still for any further bounces. Because energy tends to get dissipated during the course of an event, rings have a tendency to be louder earlier in the event than later in the event. This is a tendency, but it is not always the case. If energy gets added during the event, ring magnitude can increase. For example, if your pen bounces a couple of times on the table, but then bounces off the table onto the floor, then the floor hit may well be louder than the first table hit (gravity is the energy-adding culprit). Nevertheless, in the generic or typical case, energy will dissipate over the course of a physical event, and thus ringing magnitude will tend to be reduced as an event unfolds. Therefore, the audibility of a ring tends to be higher near the start of an event; or, correspondingly, the probability is higher later in an event that a ring might not be audible.

If language is instilled with physics, we would accordingly expect that sonorant phonemes are more likely to follow a plosive or fricative near the start of a word, and are more likely to go missing near the end of a word. This is, in fact, the case. Figure 10 shows how the probability of a sonorant following a non-sonorant falls as one moves further into a word, using the same data set mentioned earlier. For example, words like “pact” are not uncommon in English, but words like “ctap” do not exist, and are rare in languages generally.

Figure 10. This shows that sonorant phonemes are more probable near the starts of words, namely just after the first non-sonorant (usually a plosive). The square data are for words having two non-sonorants, and the triangle data for words having three non-sonorants.

We have begun to get a grip on how hits, slides, and rings occur within events, but we have only considered their probability as a function of how far into the event they occur. In real events there will be complex dependencies, so that if, say, a slide occurs, it changes the probability of another slide occurring next. In the next section we’ll ask, more generally, which combinations of hits and slides are common and which are rare, and then check for the same patterns in language.

Nature’s Words

Rube Goldberg machines excel at producing very long events, all part of a single causal chain. Like most events, Rube Goldberg events are built mostly out of hits, slides, and rings. Again letting b, s, and a stand for hits, slides, and rings, Rube Goldberg events sound something like basabababasababababababasabababasa, although the chains are very often much longer than even this. If events were typically like Rube Goldberg events, then even if spoken words have many of the auditory features found in events, words would be much too short to be event-like. Events are, however, not typically Rube Goldberg-like. Events are, instead, much more typically like a pen thrown on a table, the generic event we discussed in the previous section. Pen-on-table events may consist of a hit, hit, and slide. Or possibly just a hit and a slide. Or even just a lone hit. Most events have just several physical interactions or fewer, much nearer in length to spoken words than to Rube Goldberg events.

This is what nature-harnessing expects. Spoken words across human languages are not only built out of sounds like those in solid-object physical events, but words tend to have the size of typical physical events. Words tend to sound like events with up to several interaction sounds—plosives or fricatives—not, say, ten. And although words with a single interaction sound are allowed, two or three interaction sounds are more common, again like solid-object physical events.

Words are not only approximately the size of solid-object physical events—i.e., having several interaction sounds—words also take the amount of time for a typical event. This is something I have thus far ignored. But notice that plosives, fricatives, and rings do not just have similar acoustic characteristics to hits, slides, and rings; they also occur over periods of time similar to those typical of hits, slides, and rings. For example, although I described both hits and plosives as nearly instantaneous explosions, the notion of “instantaneous” depends on the time scale relevant to the listener—what’s instantaneous to a human may not be instantaneous to a fly. Hits and plosives are both instantaneous explosions as heard by human ears. This is why plosives sound hitlike; for example, if a hitlike sound were stretched out it would, instead, sound more slidelike (something we discussed in the earlier section called “Hesitant Hits”). Similarly, fricatives and sonorants tend to occur over time scales similar to the slides and rings of physical events. Typical syllables of human speech—e.g., of the form ba or sa—tend to have a duration approximately on the order of tenths of seconds, roughly the same time scale as is common for physical events involving macroscopic objects. In fact, you’ll notice in Figure 4 earlier that the physical and linguistic analogs (e.g., a hit and “k”) are on the same scale for the time (x) axis.

Words tend to be built out of the constituents of natural solid-object physical events, and to have approximately the size and temporal duration of such events. But are words actually structured like solid-object physical events? Are the natural-sounding phonemes and syllables put together into natural-sounding words? In particular, I’m interested in asking whether the sequences of physical interactions that occur in events—the hits and slides—are similar to the sequences of plosives and fricatives in words. My students and I analyzed the “event structure” of common words across 18 languages, and for each language we measured the distribution of six event types: hit (b), slide (s), hit-hit (bb), hit-slide (bs), slide-hit (sb), and slide-slide (ss). For example, “tea” is a b, “far” is an s, and “faker” is an sb.

Figure 11. The freqency of the structure types found in words across 18 widely diverse languages (listed in the legend of Figure 9). (Standard error bars shown. See Appendix for details.)

To estimate how common these simple event types are in nature, students Elizabeth Counterman, Kyle McDonald, and Romann Weber counted the kinds of events occurring in a wide variety of videos. In deciding upon the kinds of videos to sample, we were not especially interested in having videos of, say, the savanna. Recall our discussion in the previous chapter, where we observed that there are “hard cores” of nature likely found in most or all habitats with solid objects crashing about. In choosing twenty videos from which to enumerate solid-object physical events, we simply aimed for a variety of scenarios in which solid-object physical events occur, including cooking, children playing, family gatherings, assembly instructions, and acrobatics. Each student acquired data on the events occurring, and did so using only the visual modality (that is, the videos were on mute); this helped to deal with a worry that our auditory systems are biased by speech so that we hear speechlike structure in events (akin to seeing faces in clouds). The three observers identified an average of 650 events across the 20 videos. Figure 12 shows the average results for the videos as a dotted line, overlaid on the language data from Figure 11. One can see the close similarity in the plots. (Notice that a simple model assuming hits are more common than slides does not explain why bs occurs more often than sb in the language data.)

Figure 12. The relative frequency of simple event types in videos and in language. One can see their considerable similarity. (Standard error bars shown. See Appendix for details.)

Again, we find the signature of solid-object physical events—of nature—in spoken language! Our final story in this chapter on speech concerns the sounds of speech above the level of words: the structure of whole phrases and sentences.

Unresolved Questions

Earlier in the chapter I remarked on how audition is nature’s more terse modality, only speaking up when there’s an event. In real life, though, there can often be “event overload.” I’m sitting at an airport right now, and I just counted 30 distinct sound events occurring around me over the last 30 seconds. How can we possibly pick out the sounds that matter to us amongst all the noise? There are, in fact, auditory cues that can tell an observer whether an event is relevant to him or her. In particular, these cues can tell the observer that “an event you should pay attention to is coming.”

The most obvious such auditory cue is loudness. As a sequence of events nears me—be it footsteps, the whir of a whiffle ball, or the siren of a police car—it gets louder. Loudness is also worthy of attention because louder events can sometimes be the more energetic events. The ecological importance of loudness may underlie the role of emphasis in language, the way that more important words or sentences are sometimes spoken more loudly. That louder speech is more important speech is one of those things that is so obvious it is difficult to notice. But its analog in vision is not true—brighter parts of a scene are not the more important parts. Brightness in a scene is usually just a matter of where the sun is, and where it glares off objects. The importance of loudness modulations in speech needs explaining, and the explanation is found in the structure of nature.

In addition to loudness, events in nature have another sound quality that is even more informative: pitch (the musical, note-like quality of sound). The pitch of an event depends not on how close it is to the observer, but on the rate at which it is getting closer to the observer. To understand why, let’s imagine standing next to a passing train, the standard example used to explain the Doppler effect. The main observation is that the pitch of the train’s whistle starts high and changes to low as it passes. More specifically, note that when the train is far away and approaching, its whistle is at a fixed high pitch, that is, a pitch that is not changing. (It is actually falling, but negligibly and imperceptibly.) The pitch only begins falling audibly when the train is very close to passing you. And shortly after the train has passed you, the pitch has dropped to nearly its low point, so that from then on the pitch stays effectively constant and low. This drop in pitch would apply in any scenario where sequences of events are passing us by. It also occurs any time we are moving past noisy objects. Our auditory systems can sense pitch changes on the order of half a percentage of the sound frequency, sufficient for sensing (if not consciously) the pitch changes due to our walking by a source of sound.

The important conclusion of these observations is that a typical sequence of events will tend to have this signature falling pitch (unless headed directly toward you). One might speculate that this is why language has a tendency to signal the approaching end of a sentence with a falling intonation—a drop in pitch. That’s what events typically sound like in nature.

Sequences of events do not always have pitches that fall, however. Pitches can sometimes rise, but special circumstances are required. First, let’s consider what happens if you stand on the railroad tracks rather than beside them. Now the pitch of the train stays the same, right up to the moment that it hits you. Of course, at the instant it hits you, the sound you would be hearing if you were conscious abruptly drops to a lower pitch (because it passes you in a single brain-crushingly short instant), and stays at that pitch as the train moves away. A constant pitch accompanied by increasing loudness is the signature of an impending collision. That same loudness increase, but with a pitch decrease, signals a near miss.

What could make a pitch increase? Considering the train again, imagine first standing beside the tracks as it approaches, but then walking onto the tracks before it gets there. Because you have moved to a position more directly in the train’s path of motion, the frequency your ears receive from the train will increase as you walk onto the tracks. Alternatively, the pitch would also increase if you stayed off to the side, but the train jumped the tracks and headed toward you at the last moment. A pitch increase is the signature of a sequence of events that is changing its direction in your direction. This is true not only when an approaching sequence of events veers toward you, but also when a receding sequence of events veers so as to begin turning around, perhaps to come back and get you after a miss. An increase in the pitch is, in a sense, more important than loudness. An event might be loud and getting louder, but if its pitch is decreasing, it is not going to hit you. But if an event is not so loud, but has a pitch that is increasing, that means it is aiming itself more toward you (or you are aiming more toward it).

A rising pitch suggests, then, that the sequence of events is not finished. Events are coming your way. Or, if the sequence of events is moving away from you, then a rising pitch means it is beginning to turn around. This unresolved nature of rising pitches may be the reason why rising pitches in many languages tend to indicate a question. The spoken sentence, “Is that the elephant that stepped on your car?” is a request for further speech. And what better way to sound unresolved than to mimic the sound of nature’s unresolved events?

This is a natural lead-in to the rest of the book, which deals with the origins of music, where loudness and pitch are even more crucial. We will see that “unresolved” pitch even tends to get resolved in melody.

Summary Table

In our modern lives we hear hits, slides, and rings all around us, and we also hear the sounds of speech. They mean fundamentally different things to us, and so our brains quickly learn to treat them differently. Our brains can treat them differently because, despite the many similarities between solid-object physical event sounds and speech sounds that I have pointed to throughout this chapter, there are ample auditory cues distinguishing them (e.g., the timbre of a voice is fundamentally different from the timbre of most solid objects). And once our brains treat these sounds as fundamentally different in their ecological meaning, it can be next to impossible to hear that there are deep similarities in how they sound. A fish struggling up onto land for the first time, however, and listening to human speech intermingled with the solid-object event sounds in the terrestrial environment, might find the similarity overwhelming. “What is wrong with these apes,” it might wonder, “that they spend so much of their day mimicking the sounds of solid-object physical events?”

In this chapter, I have tried to bring out the fish in all of us, pointing out the solid-object event sounds we make when we’re speaking, but fail to notice because of our overfamiliarity with them (and because of the similarities not holding “all the way up,” as discussed in the previous chapter). The table below summarizes the many ways in which speech sounds like solid-object physical events, with references to the earlier sections where we discussed each of them.

Setion of chpter

Soid-object

hysical events

Laguage

1.Mother Natre’s Voice

Phsical evens are best sensed by audition.

Laguage usesaudition.

2.Nature’s Ponemes

Th main thre event constituents are hits, slides, and rings.

Th main thre kinds of phoneme are plosives, fricatives, and sonorants.

3.Nature’s Ponemes

His are morecommon than slides.

Plsives are ore common than fricatives.

4.Wiggly Rins

Rigs can chage in timbre and tone during their occurrence.

Soorants canchange in formants (diphthongs and sonorant consonants) and tone during utterance.

5.Wiggly Rins

His and slids do not tend to change their sound during their occurrence.

Plsives and ricatives do not tend to change during their utterance.

6.Nature’s Oher Phoneme

A ourth mainconstituent of events is the hit-slide. But not slide-hit.

A ourth mainphoneme type is the affricate. But there is no “fricative-plosive” phoneme type.

7.Two-Hit Woder

A it betweentwo objects can have two distinct auditory consequences. Usually it is an instantaneous explosive burst, but sometimes it is a sudden dampening.

Plsives of ay kind have two forms, explosive and dampened (usually word-final).

8.Slides Tha Sing

Sldes usuall occur on nonregular surfaces, but sometimes occur on surfaces with periodic regularities, leading to sound with periodicity (tonality).

Frcatives ar more commonly voiceless, but are still often voiced. Whether or not a fricative is voiced is usually part of the identity of the fricative (as the surface periodicity is part of the identity of a surface).

9.Hesitant Hts

His can varywidely in the rigidity of the objects involved, and thus vary widely in the time from first explosion to the ring. This can help to identify the objects involved.

Plsives varyin the duration of time to the following sonorant sound. This is called the voice onset time (VOT), and is part of a phoneme’s identity.

10 Rigid Mufler

Riid hits (wich cause short hit-to-ring delays when initiating a ring) are poor dampeners of rings.

Voced plosivs (which have short voice onset times when released) are, when unreleased at word-endings, preceded by longer sonorant sounds.

11 Nature’s yllables

His and slids cause (usually audible) rings.

Plsives and ricatives tend to be followed by a sonorant. This is the basic syllable form, consonant-vowel (CV).

12 In the Beinning

His tend to tart events disproportionately more often than slides do.

Plsives tendto start words disproportionately more often than fricatives do.

13 The FirstWas a Doozy

Rigs are mor audible early in an event.

Soorants aremore likely to follow a plosive or fricative near the starts of words.

14 Nature’s ords

Th number ofinteractions in an event tends to be from one to several, and the time scale of natural solid-object physical events tends to be on the order of several hundred milliseconds (with a lot of variability).

Th number ofplosives or fricatives in a word tends to be one to several, and the time scale of its utterance tends to be on the order of several hundred milliseconds (with a lot of variability).

15 Nature’s ords

Th combinatins of hits and slides that occur in natural solid-object physical events have a characteristic, theoretically comprehensible pattern.

Th combinatins of plosives and fricatives in words of languages have the signature pattern of solid-object physical events.

16 Unresolve Questions

Evnts with rsing pitch are often due to the Doppler effect, wherein an object is veering more toward the observer; i.e., it is the signature auditory pattern of an event “headed your way.” Falling pitch means the object is directing itself less and less toward you.

Phases with ising intonation tend to connote a question or something that is unresolved, metaphorically akin to an event suddenly being directed toward you. Phrases with falling intonation tend to connate greater resolution, metaphorically akin to an object veering away from you, which you no longer have to deal with.

This chapter, together with the fourth chapter in The Vision Revolution, argues that our linguistic ability, for both speech and writing, may well be due to nature-harnessing, rather than to a built-in “language instinct” or to general learning. Although language is central to our modern human identity, so is art, and it is natural to wonder whether some of humankind’s artistic wonders also have their origins in nature-harnessing. The remainder of the book takes up music, arguably the pinnacle of humankind’s artistic achievement.

[1]Handbook of phonological data from a sample of the world’s languages: A report of the Stanford Phonology Archive (1979). Stanford University, Department of Linguistics.