Wispr Flow is genuinely impressive. You hold a key, speak naturally, and it inserts polished text wherever your cursor is - emails, Slack, terminal, VS Code, anywhere. The experience feels like magic.
Then the $15/month bill hits. That's $180 a year for a dictation tool. Not outrageous, but it got me thinking - this shouldn't cost that much. Voice recognition is a solved problem. So why is good voice dictation still locked behind a subscription?
That question turned into a weekend project. That weekend turned into SpeakInk. And if you want to try it right now - grab the repo, run python setup.py, and you're up in under 5 minutes.
The Journey to Get Here
I didn't start with NVIDIA. I started with local models.
My first instinct was to run everything offline. No API keys, no internet dependency, completely private. I set up Whisper locally and it worked - but it was slow. On CPU alone the latency was noticeable enough to break the flow of dictating. You'd finish a sentence and wait. That waiting kills the whole point.
So I switched to looking at paid cloud options. AssemblyAI, ElevenLabs, Cartesia - all solid, all fast. But then I was just building a slightly cheaper Wispr Flow. Still paying per hour of audio, still vendor-dependent. That felt wrong. The goal wasn't to save a few dollars, it was to make this genuinely free and accessible to anyone.
Then I came across NVIDIA's Parakeet TDT 0.6B v2 on build.nvidia.com. Free API. Rate-limited for personal use, but the limits are generous enough that you'd have to dictate for hours every single day to hit them.
I honestly wasn't expecting much - free usually means a catch. But when I tested it, the accuracy was right there with Wispr Flow. Same level. I was genuinely amazed. Fast, clean transcripts, handles natural speech well, no weird artifacts. Running on NVIDIA's infrastructure so latency is minimal. And it costs nothing.
That was the missing piece. The whole goal - free, fast, accurate, no subscriptions - suddenly had a real answer.
But while I was building around Parakeet, I kept thinking - not everyone has the same needs. Some people need 99+ language support. Some people work in environments where they can't send audio to any cloud. Some people want the cleanest possible transcript for formal writing. One provider can't be perfect for everyone.
So instead of building around a single provider, I built around the ability to swap them.
How It Works
Hold Right Option on macOS (or Right Alt on Windows). Speak. Release the key. Text appears at your cursor, in whatever app you're focused on.
That's the whole interaction. No window to switch to, no button to click, no copy-paste. It works in VS Code, Slack, Chrome, Terminal, Notes, email - anywhere you can type.
You can also switch to toggle mode if you prefer tapping the key once to start and once to stop. Both modes are in settings and apply immediately, no restart needed.
Pick Your Provider, Pick Your Experience
This is the part I'm most proud of. Every STT provider has a different personality and different strengths. SpeakInk lets you switch between them based on what you're doing:
| Provider | Cost | Best For |
|---|---|---|
| NVIDIA Parakeet TDT 0.6B v2 | Free | Everyday use, best accuracy |
| Whisper (local) | Free | Fully offline, privacy-first |
| Cartesia | ~$0.13/hr | 99+ languages, real-time streaming |
| AssemblyAI | ~$0.15/hr | Partial transcripts, turn detection |
| ElevenLabs Scribe v2 | ~$0.40/hr | Speaker diarization |
NVIDIA Parakeet is the default for a reason - it's free and genuinely best-in-class. For most people, most of the time, this is all you need.
Whisper local is for people who refuse to send their voice to any server, ever. If you're dictating sensitive content, working offline, or just philosophically opposed to cloud APIs, this is your provider. It's slower on CPU, fast on a decent GPU.
Cartesia shines for multilingual use. If you switch between languages mid-sentence or need support for a language that NVIDIA handles less cleanly, Cartesia's 99+ language support is hard to beat.
AssemblyAI does something the others don't - it gives you partial transcripts as you're speaking. If you're building something on top of SpeakInk or just want to watch the text appear word by word, this is interesting.
ElevenLabs is the choice if you're doing interview transcription or any recording where multiple speakers are involved. Speaker diarization - knowing who said what - is unique to this provider.
Switch between them in settings. No code changes, no config file editing, just a dropdown. The whole app hot-swaps the provider live.
Architecture Deep Dive
The core of the app is AppController. It's initialized with five things: the STT provider, the text insertion method, the AI correction provider, a ConfigManager, and an EventBus. Everything is wired through dependency injection - no global state, no singletons.
The EventBus keeps everything decoupled. When audio recording starts, an event fires. When transcription completes, another event fires. When there's an error, it flows through the bus. Components subscribe to the events they care about and ignore the rest.
The trickiest part of PyQt6 is its thread model. Qt is strict - you cannot update the UI from a background thread. STT providers run on background threads, so UiBridge exists to solve this. It uses Qt signals to marshal events from background threads back to the main thread safely.
When you change a setting - switch providers, change the hotkey, toggle AI correction - on_settings_changed() gets called and everything propagates live. No restart required for any setting.
AI Correction (Optional)
Raw transcripts aren't always clean. "lets meet at three p m" should be "Let's meet at 3 PM." The AI correction step handles this.
After transcription, you can optionally run the text through Gemini Flash or a local Ollama model. It fixes punctuation, capitalizes properly, removes filler words like "um" and "uh" - without changing what you actually said.
Gemini Flash is the right choice for most people. Fast, cheap, and if you're already using Google AI Studio it's effectively free at dictation volumes. Ollama is fully local if you have 16GB RAM and want zero network dependency.
Off by default because the raw transcript is usually good enough. Turn it on when you're dictating formal writing or emails where polish matters.
Building for macOS
macOS has strict privacy controls that make this non-trivial. Keyboard simulation via pynput requires Accessibility permission. Without it, the event tap silently fails - no error, your text just doesn't appear. That's a maddening bug to debug the first time.
SpeakInk handles this with a PermissionsDialog on first run. It detects which permissions are missing, explains what each one is for, and links directly to the right pane in System Settings. No digging through menus.
You Can Add Your Own Provider
The provider interface is intentionally simple. Adding a new STT provider is one file - implement the interface, register it in ProviderRegistry, done. No other code changes needed anywhere in the app.
If there's a provider you want that isn't here, the path to adding it is genuinely short. The repo has a contributing guide. A new provider is a good first PR.
Setup
git clone https://github.com/samrathreddy/Speakink
cd Speakink
python setup.py
setup.py creates a virtual environment, installs dependencies, and launches the app. Add NVIDIA_API_KEY to .env - get a free key at build.nvidia.com - and you're done.
The app lives in your system tray. Right-click for settings, history, and provider switching.
What's Next
Windows testing and packaging, PyInstaller builds so you don't need Python installed, and more providers. The architecture makes adding providers easy - what takes time is testing them properly across different accents, environments, and edge cases.
If you try it and something breaks, open an issue. If you add a provider, open a PR. That's how this gets better for everyone.
The repo is at github.com/samrathreddy/Speakink. Free. Open source. No subscription.