I admit it, I’m a yapper. I even started to yap at my computer now.

I type a lot, and I figured that improving my typing speed could be a major lever to boost my productivity. So recently I started to learn to type with ten fingers (up from ~7) using tipp10.com, and it has been working well for me, although I am still far off from the pros.

But then, I thought: why am I using my silly little fingers at all? Considering that locally transcribing with high accuracy and speed is a reality now, I was inclined to try if that would suit my workflow.

My needs

I wanted a simple, global hotkey to record my voice and get the text ASAP. Naturally, I looked at existing solutions, but they were either cloud-based (not free and a privacy nightmare), or a hassle to setup. Note: I since found Voxtype, which seems to fill this niche with a pretty polished solution, and a lot of extra features. If you want to keep it simple, read on.

So, as one does, I coded my own solution called YapType. It uses OpenAI’s Whisper model locally, so my voice never leaves the machine. It’s made for Linux, but probably could be made to work with MacOS without too much extra work.

TL/DR: see the README and try for yourself!

The Old Way (and why it sucked)

My first iteration was… a hack. A glorious, fragile, horrible hack.

It was a Python script that used evdev to listen directly to the raw keyboard hardware events via e.g. /dev/input/event0. It intercepted a specific shortcut key combo, recorded audio, ran the transcription, and then - this is the worst part - simulated keystrokes to “type” the text back out into whatever window had focus.

This had several major drawbacks:

Hardware changes: If I plugged in my USB keyboard, the /dev/input/eventX ID changed, and the script didn’t work anymore.
Permissions: To listen to raw inputs, the user needs to be in the input group (or run as root). Not ideal.
The “Typing” Bottleneck: Simulating keystrokes is fast, but not instant. If I clicked away or moved the cursor while it was “typing,” my transcription would end up in the wrong place or trigger random shortcuts.

The fixed version

I realized I was over-engineering the input/output handling. Why am I listening to hardware pins when the Desktop Environment (GNOME, KDE, Hyprland, etc.) already has a perfectly good shortcut system?

I replanned the project with a cleaner separation of concerns:

The Server: A background service that keeps the transcription model loaded in RAM. This is crucial, as loading the model takes time, and having it ready in memory means recording can start instantly. Spoiler: it takes 250MB of RAM.
The Client: A tiny, dumb script that just sends a “TOGGLE” signal to the server via a Unix socket. It gets invoked by a shortcut.

How it works now

I simply bound the client script to Ctrl+Alt+- in my GNOME settings.

When I press the shortcut:

The Client pings the server and exits immediately.
The Server wakes up and starts recording from the default microphone. This makes the microphone privacy indicator light up in the system menu (top-right corner, at least on default GNOME):
I yap ahead.
I press the shortcut again.
The Server stops, runs the audio through Faster-Whisper, saves the text to a file, and opens it in my text editor of choice.

This approach solves the problems I mentioned before. It’s much more robust because it relies on the OS to handle the shortcut, and safer because it runs in user-space without special groups, and the transcription is not typed out anymore via simulated keystrokes. As a nice side effect, the transcriptions are preserved, too (unless you write them to /tmp/).

Optimization: RAM vs Speed

I’m using faster-whisper, which is a reimplementation of OpenAI’s Whisper model using CTranslate2. It’s both faster and uses less memory, and supports further speedups using int8 quantization, perfect for CPUs!

I went with the base.en model, quantized to int8, using 8 threads to transcribe. It eats up about ~250MB of my RAM, but considering that I have a bunch of unused RAM anyway, that’s a cheap price to pay. The model itself is pretty accurate, sans minor hiccups here and there. Note that for languages other than english, you have to change to a checkpoint without an .en suffix.

Open source: YapType

I am open-sourcing the code, which is <200 lines of code total. It’s a small, sharp tool that does one thing well. Just as I like it.

It takes just a few minutes to setup, and I wrote a clean README to guide you through it.

Code and more technical info can be found here.

That’s all!

Yap away! :-)

alt text