No Fees or Subscriptions Forever.
You Don’t Lose Your Intellectual Property.
Overview
I needed a reliable way to transcribe audio locally on macOS.
Not as a demo.
Not as a one-off experiment.
As infrastructure.
Most existing solutions fell into one of two categories:
- Cloud services that charge per minute, per month, or per tier
- Local solutions that technically worked, but were fragile, unstable, or required constant maintenance
I wanted something different: a system that runs locally, uses the hardware I already own, costs nothing to operate, and can be trusted to keep working months from now without babysitting.
This case study documents how I built that system, where AI helped, where it didn’t, and what I learned along the way.
The Real-World Problem
The problem wasn’t “how do I transcribe audio?”
That’s already solved.
The real problem was:
How do I build a stable, automated transcription pipeline on macOS that doesn’t require subscriptions, external services, or constant maintenance?
Most solutions today optimize for convenience or scale. They assume:
- You’re fine uploading private audio
- You’re fine with recurring fees
- You’re fine being locked into a service
I wasn’t.
I needed something:
- Local
- Automated
- Cost-free to run
- Private
- Durable
Initial Approach, And Why It Failed
My first instinct was the obvious one: Python-based ML tooling.
I explored:
- Whisper via Python
- Faster-Whisper
- PyTorch with Apple’s MPS backend
- Virtual environments
- Version pinning
This approach mostly worked. And that was the problem.
Over roughly an hour, I ran into:
- Python version incompatibilities
- Homebrew’s PEP 668 restrictions
- Silent CPU fallbacks
- Numerical instability on MPS, including NaNs during inference
- Backend limitations that only surfaced under real use
None of these issues were catastrophic alone. Together, they made the system fragile.
It became clear that I was trying to force Python into a role it wasn’t well suited for on macOS: long-running, unattended GPU inference.
The Key Insight
The breakthrough wasn’t technical. It was architectural.
I arrived at a durable rule:
On macOS + Apple Silicon, prefer native Metal tools over Python ML stacks for production workflows.
This reframed the entire problem.
Python is still excellent for:
- Glue code
- Automation
- Orchestration
- Text processing
But it’s a poor choice for:
- Stability-critical GPU inference
- Fire-and-forget pipelines
- Systems that should survive OS updates untouched
Once I accepted that, the solution became obvious.
The Final Architecture
I rebuilt the system around a native Whisper implementation that uses Metal directly.
The result is a pipeline with:
- A watch folder for incoming audio
- Automatic file handling and logging
- Native GPU-accelerated transcription via Metal
- No Python ML dependencies
- No subscriptions
- No cloud services
- No per-minute costs
Python still plays a role, but only as orchestration. The heavy lifting happens in native code, where macOS is strongest.
The final system is intentionally boring:
- Predictable
- Quiet
- Stable
- Repeatable
That’s exactly what infrastructure should be.
How AI Was Used In This Project
AI played two distinct roles.
1. As The Engine
The transcription itself is powered by a large speech-to-text model. This is AI in the most literal sense: inference over real audio data to produce usable text.
Running it locally eliminates ongoing cost and preserves privacy, but only works if the system is architected correctly.
2. As A Thinking Partner
AI was also used as a collaborative tool during development:
- To reason through backend limitations
- To compare architectural tradeoffs
- To sanity-check assumptions
- To accelerate debugging and decision-making
What it didn’t do was replace thinking.
The final solution wasn’t generated automatically. It emerged through iteration, constraint analysis, and recognizing when an approach was fighting the platform instead of working with it.
AI was effective as an amplifier, not a replacement.
Why This Matters
Most transcription solutions today cost money indefinitely.
Even modest usage adds up:
- Monthly fees
- Usage tiers
- API overages
- Vendor lock-in
This system costs:
- $0 to run
- $0 per minute
- $0 per month
It uses hardware already owned, runs entirely locally, and can be audited, modified, or frozen as needed.
That matters not just financially, but cognitively. Once the system is built, it stops demanding attention.
Outcome
The end result is a production-ready transcription system that:
- Solves a real, recurring problem
- Avoids subscriptions entirely
- Uses AI responsibly and locally
- Aligns with the strengths of macOS
- Can be trusted long-term
The full project is open-sourced here:
https://github.com/berchman/macos-whisper-metal
Final Reflection
This project reinforced something I’ve learned repeatedly over the years:
Good systems aren’t defined by how clever they are.
They’re defined by how little they ask of you once they exist.
The right use of AI doesn’t create more complexity.
It removes it.