phase iii — voice lab
Train the voice.
Build training datasets from your vocal stems. Register checkpoints you trained on a GPU pod. Render new performances in your voice. Score how close you got.
Training happens off-platform — see
ml/voice/README.md for the RunPod recipe.Phase 3 · step 1
Dataset builder
Selects vocal stems by quality, slices them, exports a training manifest.
Source tags (all must match)
Min vocal quality ≥ 60
Segment seconds
Overlap (0–0.9)
Phase 3 · step 2
Off-platform training
Download a dataset manifest, ship it to a GPU pod, run RVC training, bring the .pth + .index files back. Detailed steps in ml/voice/README.md.
- RunPod RTX 3090 (~$0.30/hr) — clone the RVC repo, install requirements.
- Sync your
storage/datasets/{id}folder onto the pod. - Process → extract features (RMVPE) → one-click train.
- 200–400 epochs for 30 min of clean vocal data.
- scp the .pth + .index back, upload below.
Phase 3 · step 3
Checkpoint registry
Upload the .pth (and optional .index) you trained on the GPU pod. See ml/voice/README.md.
.pth weights (required)
.index (optional, RVC)
Phase 3 · step 4
Render console
Convert a guide vocal / hum / acapella into the trained voice. Requires a working inference backend — set SM_VOICE_BACKEND in .env.
Input audio
Checkpoint
Transpose +0 semitones
Dryness 0.75
Phase 3 · scoring
Voice similarity
Compare two clips — e.g. a real reference vocal vs. a rendered one. Cosine of Resemblyzer speaker embeddings. ≥ 0.75 = strongly matches.
Clip A (e.g. real reference)
Clip B (e.g. rendered)