Building a local, private voice clone without burning my time
(or my data)
For years I’ve wanted a way to think out loud and have it turn into something usable. Not just a transcript, but a real artifact I could shape, edit, and publish.
Typing has always felt like a speed limit. Speaking doesn’t.
Recently, I built a fully local audio-to-text pipeline on macOS using native GPU tools. That system now works. It’s fast, private, and costs nothing to run.
Once that was stable, a new question emerged naturally:
If I can go from voice → text locally…
Can I go from text → my voice just as cleanly?
That question led me into voice cloning.
Not as a product idea. Not as a demo.
As a personal tool—one that respects privacy, judgment, and intent.
What follows isn’t a tutorial. It’s a log.
A record of how I’m tuning a system in the open, one version at a time.
The constraint that mattered most Privacy wasn’t a feature.
It was the constraint.
I wasn’t interested in cloud services, subscriptions, or uploading years of my voice to someone else’s servers.
I wanted the same philosophy I used for transcription:
- Local
- Offline = free
- Explicit control
- Nothing happens unless I choose it
That constraint immediately shaped the stack and ruled out most commercial options.
Version 0.1—Proof of life
The first goal was simple: Can a local model produce audio that sounds vaguely like me? Using an open-source multilingual TTS model running locally, I generated my first test file. Technically, it worked. Emotionally, it didn’t. The voice was metallic. Slightly cartoonish. Too fast. Too high. It sounded like a narrator doing an impression of me. That was expected — and useful. At this stage, the question wasn’t “Is this good?” It was: What knob actually matters?
Version 0.2—Speed and pacing
The first obvious issue was speed. Spoken words were coming out too fast, with no breathing room. That was easy to fix. Slowing the model down helped immediately. Adding pauses helped more. Transcripts don’t naturally contain breath. Speech does. Simple text preprocessing to insert space between phrases made a noticeable difference. Result: better pacing. Still not me.
Version 0.3—Pitch reality check
The next issue was pitch. The output consistently sounded too high — almost falsetto — even when the rhythm felt right. Instead of guessing, I measured. I took my original voice recordings, split them into chunks, and analyzed the pitch using phonetics tools. The data was consistent: my speaking voice sits roughly in the 130–140 Hz range. Deep tenor. No mystery there. That confirmed something important: the source audio wasn’t the problem. The model was drifting upward. I experimented briefly with post-processing pitch correction. It helped technically — but identity suffered. The voice moved closer in frequency but further away in feel. That was the first hard lesson: You can’t fix identity downstream.
Version 0.4—Anchor quality over everything
At this point, the pattern was clear. No amount of tweaking speed, pitch, or filters would solve the core issue if the anchor voice wasn’t right. I went back to the beginning. The original anchor audio I fed the system was clean but short. Technically valid, but emotionally thin. So I recorded a longer segment—30 minutes—and listened to it carefully for the first time.
That alone revealed problems: uneven loudness, inconsistent energy, and moments I wouldn’t want the system to learn from. I’m now preparing a new 8–10 minute anchor, deliberately recorded, leveled, and representative of how I actually sound when I’m thinking, not performing.
This is the moment when most people rush ahead. I’m doing the opposite.
This process mirrors something I’ve seen repeatedly in software design:
- Don’t automate instability
- Don’t optimize confusion
- Don’t tune outputs before understanding inputs
Voice cloning turns out to be less about AI and more about restraint.
Right now, the system works.
It’s fast.
It’s local.
It’s free.
But it’s not finished—because it doesn’t yet feel like me. That’s okay.
This log exists so I can track decisions, roll back when needed, and keep judgment in the loop.
When it locks in, I’ll know—and when it does, the automation will follow.