phase iii — voice lab

Train the voice.

Build training datasets from your vocal stems. Register checkpoints you trained on a GPU pod. Render new performances in your voice. Score how close you got.

Training happens off-platform — see ml/voice/README.md for the RunPod recipe.

Phase 3 · step 1

Dataset builder

Selects vocal stems by quality, slices them, exports a training manifest.

Source tags (all must match)

Min vocal quality ≥ 60

Segment seconds

Overlap (0–0.9)

Phase 3 · step 2

Off-platform training

Download a dataset manifest, ship it to a GPU pod, run RVC training, bring the .pth + .index files back. Detailed steps in ml/voice/README.md.

RunPod RTX 3090 (~$0.30/hr) — clone the RVC repo, install requirements.
Sync your storage/datasets/{id} folder onto the pod.
Process → extract features (RMVPE) → one-click train.
200–400 epochs for 30 min of clean vocal data.
scp the .pth + .index back, upload below.

Phase 3 · step 3

Checkpoint registry

Upload the .pth (and optional .index) you trained on the GPU pod. See ml/voice/README.md.

.pth weights (required)

.index (optional, RVC)

Phase 3 · step 4

Render console

Convert a guide vocal / hum / acapella into the trained voice. Requires a working inference backend — set SM_VOICE_BACKEND in .env.

Input audio

Checkpoint

Transpose +0 semitones

Dryness 0.75

Phase 3 · scoring

Voice similarity

Compare two clips — e.g. a real reference vocal vs. a rendered one. Cosine of Resemblyzer speaker embeddings. ≥ 0.75 = strongly matches.

Clip A (e.g. real reference)

Clip B (e.g. rendered)