
I've always wanted a transcription machine because for years, typing has been a bottleneck.
Not thinking.
Not clarity.
Not ideas.
Typing.
Back when the first iPhones came out, I had a simple wish:
let me talk, and let my words appear in my blog.
At the time, that was fantasy. Speech recognition existed, but only in research labs, big companies, or cloud services that didn’t really work well and definitely weren’t private. I moved on, kept typing, and learned to live with the speed limit.
Fast-forward to now.
Modern hardware.
Local machine learning.
Open models.
Enough computing power sitting on my desk to do what used to require a lab.
So I finally did it.
I built a fully local voice cloning and publishing pipeline on my own laptop. No cloud inference. No subscriptions. No dashboards. No usage caps. No data leaving my machine unless I explicitly choose it.
My intellectual property never leaves my machine unless I explicitly choose it.
That constraint mattered more than the tech itself.
What I wanted (and what I refused)
I didn’t want:
-
another AI subscription
-
another web interface
-
another service asking me to “upgrade” my own brain
-
another place my raw thoughts were stored on someone else’s servers
I wanted:
-
text → audio
-
audio → text
-
both directions
-
locally
-
for free
-
automated, but only when I asked for it
This isn’t about replacing judgment.
It’s about removing friction.
Automation that empowers judgment instead of eroding it.
The tool I built
At a high level, the system now does two things:
-
Transcription
-
I record audio
-
Drop it in a folder
-
Whisper runs locally on Apple Silicon using Metal
-
Clean, readable text appears
-
Optional publishing happens only if I explicitly speak intent
-
-
Voice synthesis
-
I provide my own voice reference
-
Text files dropped into a folder become .m4a files
-
The voice is mine
-
The processing is local
-
The output is mine to keep or discard
-
No GPU calls inside Python ML stacks.
No fragile cloud dependencies.
No long-running services pretending to be “magic.”
Just files, folders, and clear contracts.
Why this is finally possible
In 2008, this idea simply wasn’t realistic.
Speech models weren’t good enough. Hardware wasn’t accessible. Tooling didn’t exist outside academic circles.
Today, it is.
Not because of one model or one framework, but because the ecosystem finally matured:
-
open speech models
-
commodity GPUs
-
local inference
-
better system-level tooling
This is the kind of problem that’s only solvable now.
What this unlocks for me
I can think out loud without restraint.
I can write at the speed of thought.
I can turn raw thinking into drafts without ceremony.
And I can do it knowing:
-
my data stays local
-
my voice is mine
-
my process is under my control
This isn’t a product (yet).
It’s a personal tool.
But it’s also a case study in how I approach problems:
constraints first, workflow second, technology last.
If you’re curious how it works in detail, I’ve written more about the architecture and tradeoffs here:
👉 My Local Transcription Pipeline
More soon.
