Voice Model Creation

Create the best possible voice model by creating a high-quality dataset using the tips below.

Voice Model Creation

Create the best possible voice model by creating a high-quality dataset using the tips below.

Voice Model Creation

Create the best possible voice model by creating a high-quality dataset using the tips below.

How to create your dataset.

How to create your dataset.

Gather 30-60 total minutes of dry (no effects) and monophonic (one note at a time) vocals.

  • No reverb, delay, chorus, or instrumentals,

  • No harmonies, layering, doubletracking, stereo effects.

  • No variation in vocal styles. Eg. just singing or just rapping but not both.

Bad vocals

Bad vocals

Bad vocals

Stereo, reverb, delay

0:00/1:34

Good vocals

Good vocals

Good vocals

Mono, clean tone, low noise

Getting your file(s) ready.

Export your files with no silence and consistent volume as a 16-bit lossless audio file (.wav preferred).

Before: silence, inconsistent volume levels

After: truncated silence, consistent volume

Once you’ve compiled your vocals, the next step is to prepare your files for training:

  • Remove any extra silence (we recommend doing this automatically with Audacity)

  • Export as true mono (rather than stereo with equal L + R channels)

  • Export as 16-bit .wav (no audio length requirements, can be one 15-minute file or 15 1-minute files)

How to convert to mono and remove silence with Audacity

Use the Kits.AI Vocal Separator tool to isolate vocals for your dataset.

To isolate vocals from a song, simply upload a file into the Kits.AI Vocal Separator tool. This is an easy way to create your own dataset.

Advanced dataset techniques.

Pre-process your audio for higher quality.

Your audio can be:

  • clean EQd (subtractive) to reduce muddy or harsh frequencies in the recording

  • subtly pitch corrected (slow attack, moderate strength) unless it's a key part of the vocal style

  • De-essed to reduce any harsh sibilance

  • Compressed lightly to even out dynamic range/reduce peaks (~4-5db of gain reduction at most)

  • Boosted (additive EQd) to fit the style of the vocal

  • Limited to a peak of -6db with overall levels between -6 and -12db.

  • High/low passed to remove frequencies below 40hz–100hz and above 20khz

  • Phase re-balanced

Record your own vocals.

Recording vocals for your model? Here are some configurations to get you started.:

  • Use a quality mic with a wide frequency range (40hz–20khz)

  • Set your recording sample rate to 48khz and file type to lossless (.wav, .aiff, .flac)

  • Limit breath sounds and try to capture a clean tone (avoid plosives, place mic off-axis &/or use a pop filter if singing in a breathy style)

  • Avoid room reflections (record in a room with soft surfaces like carpet and furniture to absorb sound, place microphones away from walls, move closer and reduce your input gain)

  • Monitor your recording volume and avoid exceeding -6db dBFS. Try to keep your levels between -12 and -6 dBFS.

  • Export your audio as true mono (rather than stereo with equal L + R channels)

  • Avoid any hard cuts on audio (add a short fade out to avoid pops that come from cutting audio before or after a zero crossing)

Content

More variety, the better.

Best to have examples covering your entire range. Chest, mix, falsetto; large and short intervals; grit and clean notes; etc. The more variety, the better.

You can sing the same lyrics in different keys, a couple songs from your repertoire, originals, etc. The audio can be in multiple files or in one single take — as long as the singing time adds up to 10–15 minutes.

Techniques

How to convert to True Mono

Use the free Audacity program to convert stereo files to true mono.

How to remove silence

Use the free Audacity program to quickly remove silence from an acapella.

(Copy the settings in this video but feel free to experiment. Choose a threshold of between -20db and -40db depending on the noise level of your acapella.)

FAQ

Q: How long does model training take?

Depending on the size of your data, model training could take anywhere from 30 minutes to multiple hours! Don’t worry though - as long as you are seeing Training on your create voices dashboard, your model will finish soon.

Q: How long does model training take?

Depending on the size of your data, model training could take anywhere from 30 minutes to multiple hours! Don’t worry though - as long as you are seeing Training on your create voices dashboard, your model will finish soon.

Q: How long does model training take?

Depending on the size of your data, model training could take anywhere from 30 minutes to multiple hours! Don’t worry though - as long as you are seeing Training on your create voices dashboard, your model will finish soon.

Q: My model is taking forever to upload! What’s going on?

If you’re uploading a large file, it takes a long time to upload the data on our backend. Just press “Upload” and be patient - it’ll process eventually. Be sure not to refresh the page during your upload.

Q: My model is taking forever to upload! What’s going on?

If you’re uploading a large file, it takes a long time to upload the data on our backend. Just press “Upload” and be patient - it’ll process eventually. Be sure not to refresh the page during your upload.

Q: My model is taking forever to upload! What’s going on?

If you’re uploading a large file, it takes a long time to upload the data on our backend. Just press “Upload” and be patient - it’ll process eventually. Be sure not to refresh the page during your upload.

Q: What do I do if I see an error?

A: If you see an error during upload, contact us at our bug form!

Get started, free. No credit card required.

Streamline your vocal production workflow with Kits AI's free plan. Convert a voice and hear what's possible.

Get started, free. No credit card required.

Streamline your vocal production workflow with Kits AI's free plan. Convert a voice and hear what's possible.

Get started, free. No credit card required.

Streamline your vocal production workflow with Kits AI's free plan. Convert a voice and hear what's possible.