Whisper Transcripts
Whisper and whisper.cpp
Whisper is a speech recognition model by OpenAI that is open source. It is a multi-task, multilingual model that can perform speech recognition, speech translation and language identification. It was released in September 2022 and achieved state of the art results. The latest model is whisper-large-v3 which OpenAI released on their Dev Day.
I had heard about Whisper when it was released but got interested in using Whisper after Georgi Gerganov of llama.cpp fame created an equivalent library called whisper.cpp. He ported Whisper to C/C++ and also added accelerated inference on Apple Metal. This means that llama.cpp and whisper.cpp both allow you to run these transformer based models locally using a CPU or if you have a M1/M2/M3 Mac, to run these models at a reasonable speed.
Using whisper.cpp
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
make clean
make -j
# download a model
bash ./models/download-ggml-model.sh large
# run inference
./main -m models/ggml-large.bin -f samples/jfk.wav
Speed benchmarks
The accuracy of Whisper’s large model is quite good. Coupled with the accelerated inference that whisper.cpp offers on a Mac, it’s becoming easy to transcribe audio quickly and accurately. For example, running ggml-large-v2
, I can transcribe around 1 hour of audio in about 13 minutes on an M2 Pro (200 GB/s) or in 4 minutes on an M2 Ultra (800 GB/s).
Building a personal transcript repository
One use case is building a personal transcript library. I enjoy listening to podcasts and sermons and often like to reference what was said. Having a transcript handy makes this trivial. I was inspired by what Andrej Karpathy created in transcribing Lex Fridman’s podcasts: https://karpathy.ai/lexicap/ and built something similar for a handful of podcasts I enjoy listening to: https://lawwu.github.io/transcripts/.
The code to generate the transcripts using Whisper and to generate the webpages is here: https://github.com/lawwu/transcripts.