erm: A Local CLI That Strips Ums, Uhs, and Erms From Speech

May 2, 2026 ยท 1547 words ยท 8 minute read

Linguists have a word for the ums, uhs, ers, and elongated versions (ummmm, uhhhhh) that pad spoken English: disfluencies.

I don’t record a lot of voice audio, but a few friends do, and they tell me editing those out by hand is miserable. So I built erm to do it.

uvx erm input.wav

That’s the whole interface for the common case. It writes a cleaned .wav and a JSON cut list next to the input. This post walks through how it works, because the obvious approach doesn’t sound very good and most of the code is the stuff that fixes that.

The naive version doesn’t work ๐Ÿ”—

You’d expect the job to be: transcribe with word-level timestamps, find tokens like um and uh, cut those ranges with ffmpeg.

That gets you maybe 60% of the way, and the result sounds worse than the original. Three reasons:

  • Whisper quietly leaves a lot of fillers out of the transcript, so there’s no um token to match in the first place.
  • Slicing audio at an arbitrary point in time produces a tiny step in the waveform. Your ear hears it as a click.
  • Even when the splice itself is clean, the background hiss before and after the cut doesn’t quite match, so you hear a faint shift at every edit.

Most of erm is the work of fixing those three things.

A quick word on Whisper ๐Ÿ”—

Whisper is OpenAI’s open-source speech-to-text model. You hand it audio, it hands you back a transcript, and with the right flag it’ll also tell you the start and end timestamp of every word. It runs locally, which is what makes a tool like this possible without sending your recordings anywhere.

erm uses faster-whisper , a reimplementation that’s several times faster than the reference one and uses less memory. Same model weights, same output, just a better runtime. The default is the medium.en model, which is a good speed/accuracy balance. You can override with --model if you want small.en (faster), but I’d actually reach for large-v3. It’s noticeably better at picking up fillers and worth the extra compute.

Detection ๐Ÿ”—

First, run Whisper. erm asks for word-level timestamps and gives it a small instruction up front telling it not to clean up the transcript. Whisper, left alone, will edit out fillers because most of its training transcripts are clean prose. Any word that comes back as a known filler (um, uh, er, etc.) is flagged for cutting. Elongated versions like ummmm get matched against the um stem on the fly.

Whisper still misses things, so three more passes look at the audio directly:

Gap fillers. If there’s an unusually long pause between two transcribed words (more than 350ms by default), erm checks whether somebody is actually making a sound during that “pause.” If a chunk of voice is sitting inside what Whisper marked as silence, that’s a filler Whisper deleted entirely. It really does just drop them. No token at all, just a hole in the transcript where an um used to be.

Fillers hiding inside a word. Whisper sometimes glues a filler onto an adjacent word, so "in, uhhhhh" comes back as a single in token. erm looks at long single-token words, splits them at brief dips in the audio, figures out which chunk is the actual word (based on how long that word should reasonably take to say), and treats the rest as filler.

Words that are much too long. If a word lasts way longer than its text could plausibly take to pronounce, the tail end is suspicious. erm scans the tail for voiced sound, and optionally double-checks with a pitch test: does the suspicious chunk sound like someone holding a vowel (uhhhhh), or like someone just speaking slowly? A held vowel has a steady, simple acoustic shape; real speech is constantly changing as you move between sounds. The pitch test keeps the tool from trimming slow talkers.

All four passes (the Whisper one and the three audio ones) produce candidate cuts independently, and the lists get merged before the next step.

Refining the cut points ๐Ÿ”—

A cut at exactly t = 1.234s lands wherever the waveform happens to be at that instant, almost never at zero. Stitching two arbitrary points together leaves a step in the waveform, and that step is the click you hear.

Two small fixes, in order. First, each cut endpoint is allowed to slide a tiny bit (up to 60ms) to land in the quietest spot nearby. If there’s a momentary lull in the audio just before or after the original cut point, slide there. The slide is bounded so it can’t cross into a neighboring word, otherwise you’d chew off real speech. Second, from that quiet spot, the endpoint snaps to the nearest moment when the waveform is exactly crossing zero. Two zero points stitched together produce a continuous waveform with no step, and no click.

After all that, very short surviving fragments get cleaned up: if two adjacent cuts would leave a sliver of audio shorter than about 120ms between them, the sliver gets merged into one bigger cut. A fragment that small can’t survive the smoothing on either side anyway and just sounds like a blip.

Splicing ๐Ÿ”—

ffmpeg does the actual stitching using a crossfade. Instead of butting the two pieces of audio together, it briefly overlaps them and fades one out as the other fades in. That smooths over any remaining mismatch.

The trick is picking how long to overlap. A fixed length (most tutorials say 80ms or so) sounds wrong both ways: short cuts get smeared together, long cuts still pop. erm scales the length to the size of the cut: a tiny clip of uh gets a short crossfade, a long ummmmm gets a longer one. There’s a floor and ceiling (50ms to 120ms), and the crossfade is never allowed to reach back across the start of a real word, which would muddy the speech on either side.

Room tone ๐Ÿ”—

Even after all of the above, the background hiss of the recording (the ambient sound of the room when nobody’s talking) doesn’t perfectly match across cuts. Every room has a slightly different “silence,” and stitching two near-silences together still produces a faint shift you can hear.

The fix is dumb but it works. Find a quiet stretch in the original recording (a real piece of “this room when nobody’s talking”) and loop it underneath the entire output at low volume. Now the background is identical everywhere, because it’s the same loop everywhere. Any small mismatch at each splice gets covered up by the steady tone sitting on top.

By default the quiet stretch is found automatically. You can also point it at a specific time range if you know a good one.

The denoiser is sneaky ๐Ÿ”—

ffmpeg has a built-in noise reducer, and you can run it on the audio at various points in the pipeline. The catch: denoising smooths out the very details (volume bumps and pitch wiggles) that the detectors rely on to find fillers. So it matters when you do it.

erm has four modes:

Mode Detection looks at The output is cut from
none the original the original
pre a denoised copy the denoised copy
post the original the original; denoised at the end
hybrid the original a denoised copy

hybrid is the default, and the one you want: detection runs on the original audio (so it can see all the cues), but the actual cuts come from a clean, denoised copy (so the splices sound nice).

pre looks sensible but is the worst option, because running the detectors on denoised audio hides the very things they’re looking for.

Validation ๐Ÿ”—

Audio renders can break in subtle ways, so there’s a validate subcommand:

uvx erm validate input.wav cleaned.wav --cuts cuts.json

It runs three checks:

  • The output file actually opens.
  • The output is shorter than the input by roughly the total length of the cuts (within a small margin).
  • When you transcribe the cleaned file back to text, no fillers come back.

That last one is the useful one. It’s end-to-end: it tells you the tool actually did what it claimed.

What it won’t touch ๐Ÿ”—

It leaves like, you know, and I mean alone. Those sound like fillers but they’re doing real work in the sentence, and cutting them automatically would change what someone said. The rule erm follows: only remove things that are sound, not language.

It also doesn’t touch repeated words, false starts, or long thinking pauses. Those aren’t noise on top of the speech; they are the speech, just messier than the speaker would like. Cleaning them up is an editorial decision about which take to keep, and erm doesn’t have an opinion about that.

Try it ๐Ÿ”—

The quickest way is with uv , which fetches and runs the tool in one step without a permanent install:

uvx erm input.wav --dry-run     # see what would be cut
uvx erm input.wav               # render

If you’d rather install it the usual way:

pip install erm                 # or: pipx install erm
erm input.wav

You’ll also need ffmpeg and ffprobe on your PATH (brew install ffmpeg on macOS).

github.com/dougcalobrisi/erm . Audio stays local. If you record voice notes or podcasts and your every other word is um, give it a try.