Transcription for the Masses

Ron Dahlgren
May 16
3 min read

Transcribing the text from an audio source is a time-consuming task when done manually. Having done it myself, I can tell you that an amateur will easily take four times as long as the source material. That can add up when transcribing interviews for a reporting endeavor or a research project. Technology is the solution, right? Well, yes, but maybe not what you think.

Some of the people I interview have concerns about where the audio is stored and shared. The content of the interview combined with the sound of their voice would uniquely identify them. In other cases, they have concerns about their voice being used to train impersonation models. Then there's the more basic concern of feeling "the spotlight" when they know they are being recorded. To address these concerns, I commit to my interviewees that nobody ever gets the source audio except me, and I only use it for transcription purposes. To honor that commitment, I need to take some specific steps.

Manual transcription is out. It's too time-consuming to be useful. Transcription services (both human-based and technology-based) are out - they require surrendering the original recording to an untrusted third party. That leaves running something locally, on hardware I control. Luckily, Whisper provides exactly that sort of functionality. To follow along, you'll need a spoken audio source (an mp3 is fine), ffmpeg, and a functional Python environment.

Setting It Up

First, let's create a virtual environment for this project, then install the primary dependency. Always use a virtual environment. The system Python should be kept pristine.

$ python3 -m venv transcription-venv
$ source transcription-venv/bin/activate 
(transcription-venv) $ pip3 install openai-whisper

Next, we need to make sure ffmpeg is available. This component is used for audio conversions by Whisper. If you're on macOS, use brew:

$ brew install ffmpeg

For Linux platforms, you'll use your distribution package manager (apt for debian-based distributions, `pacman` for cool distributions). Now we are ready to process the audio file! Assuming your sample file is "~/interview.mp3", you can perform the transcription process with:

$ whisper --model large \
  --output_dir transcription-output \
  --output_format vtt \
  --task transcribe \
  --language English \
  --temperature 0.6 \
  --best_of 10 \
  --initial_prompt "Transcribe this interview between two people" \
  --no_speech_threshold 0.6 \
  --hallucination_silence_threshold 2.2 \
  ~/interview.mp3

The Whisper GitHub repository includes details about the different available models and their performance relative to one-another. I use the invocation above on a Lovelace-class GPU with 48GB of VRAM. I have also used the "base.en" model on a CPU-only system with good performance - no GPU required! If you omit the `--output_format` flag, Whisper will generate five different formats. For my use, I found the VTT format to be the most useful. If you want some structured processing downstream, there is a JSON file format with segment metadata included.

Putting It Together

I use a digital voice recorder that I purchased for less then $40 to capture the original audio. This keeps the interview off of devices that may aggressively sync to the cloud. In practice, it's also faster and more convenient than using an app on my phone or setting up a laptop to record. When I'm back home, I transfer the audio to my processing computer, delete the audio on the digital voice recorder, and run the transcription process. The output is a mostly accurate transcript with timestamps for referencing the original material. No third party services required. I've successfully transcribed multi-hour interviews with exactly this setup. My only complaint is the lack of speaker labeling!

Enhancements

The missing piece here is attribution to the speakers on the recording. This is sometimes called "diarization". As of May 2025, there are no viable models for performing this task with any degree of accuracy. Have you found a "run local" setup that accurately attributes speakers? Let us know!

Transcription for the Masses

Setting It Up

Putting It Together

Enhancements

Recent Posts

Comments

Humanity + AI, Inc