Convert any audio or video file to text—completely offline and private. Choose between CPU-optimized (recommended) or GPU-accelerated depending on your hardware.
✅ Works offline — No cloud uploads, no account needed
✅ Fast and accurate — Powered by advanced AI models
✅ Free and open — No subscriptions or usage limits
✅ Cross-platform — Windows, macOS, and Linux supported
✅ Choose your tool — CPU-based is significantly faster for single files; GPU-based better for batch processing
| Feature | CPU-Based | GPU-Based |
|---|---|---|
| Best for | Single files, laptops | Batch processing |
| Engine | Parakeet V3 (ONNX) | Whisper AI (PyTorch) |
| Speed | 13.8x real-time | 2.91x real-time |
| Setup | Simple | Requires CUDA setup |
| GPU needed | No | Yes (NVIDIA) |
| Model size | ~671MB | ~1GB |
Test Results (1:14 min, 1MB audio file):
- CPU: 5.36 seconds (13.8x real-time, 0.07 RTF)
- GPU: 25.45 seconds (2.91x real-time, 0.34 RTF)
- CPU is 4.75x faster for single files
→ Recommended: Start with CPU-based for single transcriptions. Use GPU for batch processing 50+ files.
This is the easiest option. It works on any computer without requiring a GPU.
Step 1: Navigate to the CPU folder
cd "CPU Based Audio Transcriber"Step 2: Create a virtual environment
macOS / Linux:
python3 -m venv venv
source venv/bin/activateWindows (PowerShell):
python -m venv venv
venv\Scripts\Activate.ps1Windows (Command Prompt):
python -m venv venv
venv\Scripts\activate.batStep 3: Install dependencies
pip install -r requirements.txtStep 4 (Optional): Install FFmpeg for video support
macOS:
brew install ffmpegUbuntu / Debian:
sudo apt-get install ffmpegWindows:
choco install ffmpegLaunch the app:
python app.pyOn first run:
- Download splash screen — App automatically downloads AI models from Hugging Face (~671MB)
- Progress indicator — Shows download status
- Main window — Ready to transcribe
Subsequent runs:
- No splash screen — Models are cached locally, app starts instantly
- Loading screen — App loads the model into memory
- Main window — Ready to transcribe
Then:
- Click Browse and select an audio or video file
- Click Transcribe
- Watch the countdown timer
- Results appear automatically
Audio: MP3, WAV, M4A, FLAC, OGG, AIFF, AU, and more
Video: MP4, MKV, AVI, MOV, WEBM, FLV, WMV, etc. (auto-converted)
- Smart time estimation — App measures your CPU performance on startup
- Accurate countdown — Timer shows real remaining time, not guesses
- Responsive UI — Never freezes; cancel anytime
- No language selection — Model auto-detects what you're speaking
1. Select file
↓
2. App measures CPU speed (5-second benchmark)
↓
3. Calculate: estimated_time = file_duration × your_cpu_speed
↓
4. Transcribe with accurate countdown
↓
5. Show results
CPU Based Audio Transcriber/
├── app.py # Main interface (start here)
├── config.py # Model auto-download configuration
├── download_splash.py # Download progress splash screen
├── model_manager.py # Parakeet V3 model loader
├── transcription_service.py # Transcription engine
├── performance_profiler.py # CPU benchmarking
├── video_converter.py # Video-to-audio conversion
├── audio_handler.py # Audio file loading
├── requirements.txt # Dependencies list
│
├── models/ # ONNX model files (auto-downloaded on first run)
│ ├── encoder.int8.onnx
│ ├── decoder.int8.onnx
│ ├── joiner.int8.onnx
│ └── tokens.txt
Note: Model files are automatically downloaded from Hugging Face on first run (~671MB). Subsequent runs use the cached models. │ └── assets/ # UI resources
- Python: 3.14+
- RAM: 2GB minimum (4GB recommended)
- Disk: 300MB for models
- OS: Windows, macOS (Intel/Apple Silicon), Linux
sherpa-onnx>=1.9.0— ONNX runtime for Parakeet V3soundfile>=0.12.1— Audio file readingcustomtkinter>=5.0.0— Modern graphical interfacenumpy>=1.21.0— Audio processing
Use this if you have an NVIDIA GPU for batch processing.
Step 1: Navigate to the GPU folder
cd "GPU Based Audio Transcriber"Step 2: Create a virtual environment
macOS / Linux:
python3 -m venv venv
source venv/bin/activateWindows (PowerShell):
python -m venv venv
venv\Scripts\Activate.ps1Step 3: Install dependencies
pip install -r requirements.txtStep 4: Install PyTorch with CUDA support
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126Step 5: Verify GPU detection
python -c "import torch; print('GPU available:', torch.cuda.is_available())"If True appears, your GPU is ready. If False, see GPU troubleshooting.
python main.pyOn first run:
- Download splash screen — App automatically downloads Whisper AI model from Hugging Face (~1GB)
- Progress indicator — Shows download status
- Splash screen — Model loads into memory
- Main window — Two columns for single/batch transcription
Subsequent runs:
- No download — Models are cached, app loads model directly
- Splash screen — Model loads into memory
- Main window — Ready to transcribe
Choose Single File Transcription or Batch Processing:
Single File:
- Click Browse File
- Select language
- Click Transcribe
Batch Processing:
- Click Select Folder
- Choose output folder
- Click Start Batch
- 🌐 Multi-language — Supports multiple languages
- 📁 Batch processing — Transcribe entire folders automatically
- 💾 Auto-save — Results saved to text files
- ⏱️ Real-time feedback — Monitor transcription progress
torch>=2.0.0— PyTorch with CUDA supportopenai-whisper>=20230314— Whisper speech recognitionpydub>=0.25.0— Audio handlingPyQt5>=5.0.0— Professional interface
Note: AI models are automatically downloaded from Hugging Face on first run (~1GB). Subsequent runs use the cached models.
Example: Transcribe a podcast episode
1. Start app: python app.py
2. App benchmarks your CPU (watch progress):
"⏳ Starting CPU benchmark..."
"✓ CPU benchmark complete (RTF: 1.2x)"
3. Click "Browse" → Select "podcast.mp4"
4. Click "Transcribe"
Status: "⏱️ 15.2s remaining"
(countdown updates every 0.5 seconds)
5. Wait for transcription
Status: "✓ Ready"
6. Results appear in text box
(Select all + copy, or save to file)
Example: Transcribe 10 MP4 files
1. Start app: python main.py
(Splash screen shows model loading progress)
2. Click "Select Folder with Files"
→ Choose folder with 10 MP4s
3. Shows "✓ 10 file(s) found"
4. Click "Start Batch"
→ Choose output folder
5. Watch real-time progress:
"Processing 3/10: meeting-notes.mp4"
"⏱️ Est. Time: 2:45"
6. Results saved:
✓ meeting-notes.txt
✓ Q1-summary.txt
...
MIT — Free to use, modify, and distribute
Q: Can I use this without internet?
A: Yes! After you run the app once (models download), you can work completely offline.
Q: Is there a file size limit?
A: No. Very large files will just take longer to transcribe.
Q: Does it work on Mac with Apple Silicon?
A: Yes, the CPU version works on Apple Silicon (M1/M2/M3).
Q: Can I batch transcribe with CPU version?
A: Not in the UI, but you can run multiple instances.
Q: Can I improve transcription accuracy?
A: Use clear audio and minimize background noise for best results.
Found a bug? Want to improve something?
- Check existing GitHub issues
- Create a new issue with clear description
- Submit a pull request with improvements


