SAY WHAT?: EDITING VOICE TRACKS LIKE A PRO

By Jay Rose

From DV Magazine, April 2001

There's an approach to cutting dialog that appears in most NLE tutorials and is even taught in some film schools:

look at a waveform
find pauses where the waveform drops to zero
and edit during the silence

This kind of editing is simple, easy to describe, and a damn shame: At best, it limits your creativity. At worst, it forces you to discard otherwise perfect takes when they have easily fixable sound problems.

The problem is that this method doesn't edit sound; it edits pictures of sound. Many audio details just don't show up on a waveform. That's why professional sound editors use waveforms as a rough guide only and mark their edits while scrubbing and listening. This method is faster and more flexible than relying on waveforms and just as precise for dialog purposes.

It's also easy to learn how to edit this way. All you need is a little ear training and an understanding of how sounds fit together. Then, you'll be able to replace individual syllables of on-camera dialog, piece together the most challenging interview, or assemble inhumanly perfect voice-overs. We'll start by stretching your ears a little. Then, we'll go over some tips for how to edit what you'll soon be able to hear.

THE EAR IS FASTER THAN THE EYE

Film and video work because your eyes can't distinguish individual stills when they're flashed quickly, so they blur into an illusion of motion. But you can identify sounds that are much faster than a single frame. Want proof? Say these two phrases aloud: "the small pot" and "the small tot". Figure 1 shows the first phrase on the upper channel and the second one on the bottom. Only about a dozen milliseconds of sound -- in the red circles -- have any significant difference. That's less than half a video frame... but you'd never confuse a cooking utensil with a child.

Figure 1:
These two phrases have very different meanings. But except for one tiny phoneme, they sound the same.

This ability to hear and understand fast sounds is already built into most of us. If you want to edit successfully, you just have to learn to analyze what you've heard. Many musicians and actors already have this talent. But even if you can't carry a tune, you can hone your ears with some fairly simple voice-and-mind exercises to improve your short-term audio memory.

Say "the small pot" aloud, listening very closely to your own voice while you do. Then, immediately afterwards, hear those three syllables again in your head. Try to remember exactly what your voice sounded like. Repeat this a few times, and you'll find that you can recall the actual sound of your voice more and more accurately. Try saying the phrase with different intonations and timings, and hear those versions in your head as well. Try saying slightly longer phrases until you can recall them also. Then try listening to very short dialog clips, and practice until you can recall them as well.

Now do one more thing: Slow down what you're hearing in your head, so you can hear the tiny changes inside each word. In the "small pot" example, you should be able to hear definite transitions between /th/, /uh/, /s/, /maw/, /l/, /p/, /ah/, and /t/. Don't give up if you aren't hearing individual sounds immediately -- this is a skill that can be difficult at first, particularly for someone who's visually-oriented. But I've found that most video editors catch on after about a dozen tries, and from then on it's easy.

WHAT YOU'RE LISTENING FOR

We could predict exactly which tiny sounds were in "the small pot" because there aren't very many ways humans move their mouths during speech. In the entire English vocabulary there are only about four dozen of these sounds, called phonemes. Table 1 shows them organized into groups, based on how they can be useful to editors. Note that they don't correspond to the letters that spell a word. There's no phoneme for "C", because it's pronounced as /s/ or /k/, but there are 16 phonemes for the five vowels. When you're looking through a script or interview transcript to find a likely edit, it helps to say the words aloud rather than just rely on their spelling.

Friction Consonants

f (food) / v (very)

s (silly) / z (zebra)

sh (shoe) / zh (leisure)

th (thin) / TH (then)

Nasal Consonants

m (mighty)

n (nap)

ng (lung)

Glide Consonants

w (willing)

y (yes)

l (locate)

r (rub)

Double Vowels

ay-ih (play)

i-ih (high)

aw-ih (toy)

Stop Consonants

p (paint) / b (barbell)

k (cat) / g (gut)

t (tot) / d (dot)

Double Consonants

t-sh (church) / d-zh (judge)

Vowels

ee (eat)

ae (hat)

eh (lend)

ih (sit)

aw (all)

ah (father)

o (note)

u (bull)

oo (tool)

uh (up)

er (worker)

Note there are a lot more phonemes than letters in the alphabet... and some letters don't have an equivalent phoneme. There's a more formal way of writing these sounds, but I doubt your browser has the International Phonetic Alphabet font.

People slide words together when they're speaking, so you can't always tell one word ends and the next begins. But you can almost always hear when one phoneme slides into another. So it's often easier to edit from phoneme to phoneme rather than look for word breaks. Download the file audio1.mp3: it's got a voice delivering two takes of the line "This is a special occasion." While the first take is usable, the second has more emotion. But the second has a noise under the first few words. The only way to get a performance that starts cleanly and ends enthusiastically is to use a few words from each.

The line was delivered at normal speed so there are no pauses between words. But checking the tips below, you see that the /p/ in "special" is a stop plosive: it creates a pause inside the word. You can scrub through the takes, hear and mark the pause in each take easily, and edit from one to the other seamlessly. This principle also works when the words aren't strictly identical: if you had the extra material, perhaps from a long interview segment, you could join them the first phrase to say "This is a spectacular show" or "This is a specious argument".

This is the method professional dialog editors use every day:

Listen to the phrase you want to edit, slowly, in your head. Identify any phonemes that might be useful for the edit you want to make, and decide which one will be easiest to use.
Scrub slowly through the audio clip for the first half of the edit. Even though speech is continuous, you should be able to hear most of the places where one phoneme changes into another. Stop precisely at the beginning of the desired phoneme.
What you do next depends on the program you're using. If you're scrubbing in an NLE's clip window, mark where you've stopped as the out-point of one clip. Then open another clip that where'll you'll be able to mark an in-point for the other side of the edit. (You can also use a razor tool in the timeline -- if your NLE lets you scrub there -- to split a single clip into pieces.) If you're using a word-processor style audio program, mark this as the start of a selection to delete.
Do the same scrub-stop-and-mark for the other side of the edit.
Join the clips together in the NLE, or press Delete in your audio editor, and listen to the result. If you've marked the start of the phonemes accurately, it should be fine. Sometimes there'll be a volume difference between the two pieces of the edit, but that's easy to adjust in any program. Occasionally there'll be an intonation difference that can't be fixed without special tools, and sometimes the two pieces are so radically different that the edit is impossible. But if you're using this technique, at least you'll have the comfort of knowing that no professional sound cutter could have done the job any better.

Apply the same principle to copying and inserting dialog. If you want to replace a syllable or a few words in the middle of a clip, however, treat it as an insert instead of pasting it over the original. Then delete the extra sounds. This makes a smoother edit because the replacement is seldom the same length as the original.

If you're doing just a few audio cuts in an on-camera sequence, you can link picture and sound and edit them simultaneously. A cutaway will hide any jumps in the video. But for a lot of edits, it's usually faster to export the track to an audio program and work there (see "Tools of the Trade"). This inevitably changes the length of the audio, but Figure 2 shows one way to get things back in sync. The on-camera interview was originally on linked tracks V1/A1. I exported a small section that needed massaging, fixed it in an audio program, and brought it back into the NLE on A2. Then, checking the waveform and an audio playback, I slid the new audio so its front was in sync with A1 (around 1:00:00;26). Next I used the razor tool to cut the original, unedited interview and moved the second part to linked tracks V2/A3. This gave me room to slide it earlier, so it would match the end of the audio edits. and slid it to match the end of the audio edits (around 1:00:01;25). The final step would be to clean up the overlaps and find a visual cutaway. Of course, edited sequences usually run longer than a single second... but for the purposes of this article, I wanted everything to fit on a single screenshot.

*Figure 2: Track A2 has edits that could only be done in an audio-only program. But despite the fact that this changed its length, it'll blend perfectly and stay in sync.*

THE TIPS

That's the technique. Here's what you should be listening for, based on the categories in Table 1.

STOP CONSONANTS

All of these are created by storing the flow of air pressure and then releasing it in a burst. There's a moment of silence in the middle of each stop consonant, right before the pressure is released. It can be as short as a third of a frame.

If a stop consonant is followed by a pause, it usually has two distinct sounds: one when the pressure is cut off, and another when it's released. But the second part isn't important. Eliminate it if you want to shorten the pause or go onto some other word.

If two stop consonants are next to each other (as in "The Fat Cat"), they're usually elided: the closure uses the mouth shape of the first, and the release uses the mouth shape of the second. But when people are self-conscious, they often pronounce each stop separately for a total of four distinct sounds. Editing from one silence to the next will make your performer sound more relaxed.

FRICTION CONSONANTS

With the exception of /h/, these are created by forcing air through a narrow opening: between the lips for /f/ and /v/, between the tip of the tongue and back of teeth for /th/ and /TH/, and so on. This always makes a high-pitched sound that's very easy to spot while you're scrubbing.

You can often edit from the start of one friction consonant to the start of a completely different one.

/h/ is also created by air pressure, but it's flowing through an open mouth. There's very little friction there, so this phoneme can be very quiet and not even show up on a waveform display. Be careful that you don't accidentally delete it while you're editing.

DOUBLE CONSONANTS

These are actually two phonemes, one after another, that we usually hear as a single sound. But if you scrub through them slowly -- or have a well-trained inner ear -- you can hear the transition. You can also edit them separately, clipping the /d/ to turn my name into "Zhay" or borrowing a /t/ from the beginning of "chicken".

CONSONANT PAIRS

The stop, friction, and double consonants are listed two to a line for a reason. Each pair uses exactly the same tongue and lip movement. The only difference is that the first in a pair relies on air pressure alone, while the second adds a buzzing from the vocal cords. Phoneticists call these 'unvoiced' and 'voiced' consonants.

Unvoiced consonants don't carry any pitch, so they tend to stay consistent even if the speaker has a lot of tonal variety. This makes it easier to match them, even over a long speech. They also don't carry much that can identified as a specific voice: you can often substitute one person's unvoiced phoneme for someone else's.

Since the mouth movements are identical, you can occasionally substitute one consonant in a pair for its brother. Sometimes this may be the only way to build words that weren't in the original. It lends a slight accent to the dialog, because we're used to hearing foreigners confuse these pairs when learning English.

When the consonant /b/ begins a word, some people start their vocal cords a half-second or so before the release. The result turns a word like "Baby" into "mmmBaby". Deleting the hum or covering it with room tone makes it sound better.

NASALS

For these three consonants, air comes out of the nose instead of the mouth (try saying a long "nnn" as you pinch your nostrils together). That's not particularly relevant to editing, but can make a difference if your performer has a head cold.

The /ng/ phoneme is written that way because it's heard at the end of words like "ring". But it's not a double consonant -- there's no separate /g/ in it. Many people say one anyway, as in the New York regionalism "Long Guyland". Feel free to delete the extra sound.

GLIDE CONSONANTS

These change shape while they're sounding. They're influenced a lot by the sounds on either side, so they're more difficult to match during editing.

The /l/ glide involves lifting your tongue from the ridge behind your upper front teeth. If the speaker's mouth is dry, saliva can stick and cause a tiny click in the middle of this sound. You can delete the click easily.

Some people have trouble with an initial /r/, turning it almost into a /w/. When this happens it's usually consistent throughout a take, so it may be hard to find a good /r/ to substitute. If you get a chance to re-record, ask the talent to add a tiny /d/ to the start of a critical word. This puts the tongue in the right place for the /r/ that follows. Then drop the extra /d/ -- it's a stop consonant, so it's easy to find -- while you're editing.

VOWELS

Practice saying them aloud, and learn to recognize them in dialog, because they're all different. You can't substitute one for another.

Vowels and voiced consonants carry the pitch of a voice, which varies a lot during normal speech. After you edit them, make sure the pitch doesn't jump unnaturally. If it does, try moving the edit to a nearby unvoiced consonant instead. As a last resort, varispeed or pitch-shift a few words one or two percent.

Vowels and friction consonants carry most of the pacing of a voice. If a word is said too slowly, you can often make a small cut in the middle of one of these sounds to pick up the speed.

When nervous performers pause before a word that starts with a vowel, they often build up pressure in the back of their throats. When they release it to say the word, the result is a tiny click, almost like a stop consonant. It sounds tense. But you can calm things down by deleting the click.

There are three double vowels in normal English, similar to the double consonants. They always end with /ih/. (Say "Play" aloud and you'll hear it at the end.) Frequently, the two phonemes can be edited separately.

Bad dialog edits grow on you. While you're practicing these techniques, don't make the mistake of listening to a questionable edit over and over until it starts to sound good. Instead, trust your first instinct -- or move on to something else, and review the edit a few hours later.

SIDEBAR: TOOLS OF THE TRADE

Of course, you'll also need some way to move the pieces of dialog around. For most readers this will be an NLE or audio editing program, but the principles of editing are the same whether you're using a computer, analog audio tape, or magnetic film. A few things to bear in mind:

The system must be able to scrub, letting you hear the sound while you shuttle back and forth at very low speeds. It should also be responsive, changing speed and direction instantly while you move the mouse or turn a large knob. If there's a delay or the system seems sluggish, you'll have a hard time pinpointing where you want to edit.

The system should not restrict you to editing on frame boundaries. A thirtieth of a second is a long time for sound. Some NLEs let you mark audio edits to the tenth or hundredth frame. Most audio editing programs let you edit with single-sample precision, as finely as 1/1600th frame.

It's usually faster to use a program that lets you cut and paste along a track, word-processor style, rather than being limited to clip-based edits on a timeline.

For all these reasons, many video editors prefer to run a separate audio program for fine-tuning sound. Most NLEs export audio clips as .wav or AIFF, which can then be opened and tweaked in an audio program and then put back into the video timeline. Many good audio programs can also display QuickTime or A VI video in sync while you're working on the sound.

You also need a monitoring environment that lets you hear what you're editing. This doesn't require the kind of high-quality speakers that are essential for making mix decisions, but the speakers should be close to you and the room shouldn't have much of an echo.

Among Jay Rose's recent dialog editing projects have been national commercials for Disney and an art film featuring Johnny Depp. He also writes DV's Audio Solutions column, and has written a couple of books about audio for video that are constantly at the top of their category at amazon.com.