Automating the removal of terrible mouth sounds from podcasts

It is important to me to create Small Findings episodes that have information that I myself would want to hear. At the same time, for the sake of sustainability, it is also important to keep it cheap in terms of time and effort . That is to say, if it’s too much work, I can’t justify doing it.

Therefore, I cast aside concerns about slick audio and perfect phrasing and vocal performance.

Here are two other things I do to keep it small enough to fit into my life:

  • I record just using my phone without a complex setup or a fixed location. This lets me record a segment wherever I happen to be, whenever I have a few moments of spare time.
  • I automate much of the production. A shell script and Makefile take the m4a files from my phone, normalize them, convert them to the proper formats, concatenate them with the theme music and stings, and make the final mp3. Sometimes, I have to step in to some part of the chain to make a manual edit, but I still save time as a result of the remaining parts being automated.

This saves a lot of time. However, one consequence of recording with a handheld phone is that sometimes I hold it too close to my face and get the dreaded “mouth crackle” in the recording, which can sound distracting at best and gross at worst to the listener.

Here’s an example.

There are a lot of audio issues that I don’t think are important as people think they are, but this one cannot be ignored.

To get rid of the crackle, you can either re-record the segment (which makes the podcast too expensive), or you can isolate the frequencies that the crackle lives in, then cut them out with a graphic equalizer. The crackle noise lives in the 6 kHz-12 kHz range. That coincides with the frequency range for perceived brightness in music, but luckily, this is speech, not music.

To manually cut those frequencies, you can do something like this with a graphic EQ filter.

Graphic EQ settings for cutting mouth crackle in Audacity.

However, manually doing this for every segment is laborious. It’s going to make me want to quit the podcast. Fortunately, this can be done via the SoX command line program.

Here is an example bash script that will run sox on every file in a directory.

#!/bin/bash

for file in meta/before/*.wav
do
  filename=${file##*/}
  sox "${file}" \
    "meta/after/${filename}" \
    equalizer 5k 1.0q -7 \
    equalizer 6.3k 5.0q -20 \
    equalizer 8k 5.0q -20 \
    equalizer 10k 5.0q -20 \
    equalizer 12k 5.0q -20
done

sox commands take an input file, switches (none were used in this example), an output file, and optionally, an effects chain. In the command in the above script, the effects chain is five equalizer filters. Each equalizer clause specifies:

  • The center of the frequencies to be affected
  • The width of the band (how far from the center to go)
  • The amount to adjust the affected band

For example, this clause, equalizer 6.3k 5.0q -20, says “take the frequencies in a narrow band (5.0q) around 6.3 kHz and drop them by as much as 20 dB”. So, sox will drop 20 kHz frequencies by 20 dB, and a little less so for neighboring frequencies.

(The default bandwidth is 1.0q. The higher the value, the narrower the band. This online book has some nice graphs of various q values.)

Again, here’s the terrible mouth crackling clip:

And here it is after running the command to turn those frequencies way down:

It’s now muddier because, as we said, we are reducing the brightness by attenuating those frequencies. However, it is now more listenable, which is the trade-off that we want.

Here is my eq command in context. I bet there are better values I can use, but that’s what I found via ten minutes of experimentation. If you have parameter suggestions, let me know!

Plosives

The other problem with holding a phone rather than having a mic in a fixed position is that you sometimes get pops from speaking plosives too close to the phone mic. A burst of air really close to a mic is unpleasant to hear in a recording.

Here’s what that sounds like:

Fortunately, there are frequencies we can attenuate to sweep these under the rug, too. Unlike the mouth crackle frequencies, plosive pops live in low frequency bands, around 100-130 Hz, in my estimation. (Note that it’s Hz, not kHz.)

If we’re going to filter out stuff that low, we might as well get rid of everything else down there, too. Stuff under 80 Hz doesn’t really serve a purpose outside of music. Everything under 20 Hz is unhearable and is essentially a waste of computer energy, so you might as always roll that off.

Given that, it’s preferable to just say “roll everything off lower than 130 Hz” instead of specifying several individual frequency peaks. And that is exactly what highpass filters are for.

This sox command tells it to only keep frequencies above 130 Hz:

sox before.wav after.wav highpass 130 4q

(The 4q says to do a steeper-than-default rolloff.)

Here’s the clip with the pop again:

And here it is after the highpass filter is run:

It’s not entirely smooth-sounding now, but it is no longer jarringly percussive.

So, now you can have scripts deal with both mouth crackling and plosive pops. Enjoy not having to do that yourself!