Common Mistakes to Avoid When Using AI Vocals

Read on for the best tips and tricks to make the most of your AI vocal conversions compiled by writer, songwriter, and producer Sam Kearney.

Written by

Sam Kearney

Sam Kearney

Published on

August 23, 2024

Copy link

Copied

Introduction

Incorporating AI Vocals into your music is an exciting and innovative tool for musicians and producers, thanks to advancements in artificial intelligence. Like any new technology, it requires some fine-tuning to get the best results. At Kits.AI, we process data sets to create ideal setups for accurate and realistic AI vocal model training. Over time, I’ve noticed common mistakes that can hinder the performance of AI-generated vocals. In this article, I’ll highlight these pitfalls and offer tips on how to optimize your AI vocal models.

A page of sheet music

Level and Dynamics

The human voice is unique, much like a fingerprint, with its own timbre and emotional nuance. Singing is typically a heightened form of emotional expression and can naturally vary in loudness. When recording vocals, these variations are often managed using mic techniques and compressors. Experienced session singers may “self-compress” by adjusting their distance from the mic during loud sections. However, even with this technique, additional compression is usually needed to maintain a balanced mix.

Just as natural compression benefits songs, it also enhances the training process for AI vocal models. At Kits.AI, we’ve found that vocal tracks with a controlled dynamic range produce better results when it comes to vocal cloning, especially when using advanced software for processing. My personal technique to prepare a vocal for training is to import the track into my DAW, and use clip gain to level out some of the more extreme sections before applying any additional compression. This ensures the compressor works efficiently without introducing unnatural sounds.

In the image below, the top track shows the original data set, while the bottom track illustrates my leveling adjustments:

Two tracks in a DAW

By using this approach, only a light touch of compression is necessary. I recommend no more than 3-5 dB of gain reduction.

For optimal results, aim for an average volume level of -12 dB with peaks no higher than -6 dB. This provides a great foundation for machine learning and creates more realistic AI voice models.

De-ess to Reduce Harsh Sibilance

Harsh sibilance, caused by consonants like “s,” “t,” and “z,” can be distracting and unpleasant in vocal recordings. A de-esser, such as FabFilter’s Pro-DS, is essential for controlling these bright sounds.  This ensures your AI voice model isn’t trained to replicate these harsh elements, resulting in a smoother and more professional output.

FabFilter Pro DS

EQ: Balancing the Spectrum

Equalization (EQ) plays a crucial role in shaping the sound of a vocal recording. While the specific EQ settings can vary depending on the musical content, a well-balanced EQ can significantly improve the quality of your AI voice model and provide a great starting point for whatever context and genre your AI voice model will exist within. 

Start with a high-pass filter to remove any unnecessary low-end frequencies that don’t contribute to the vocal tone. However, take care when going above 100 Hz, as this could strip away important elements of the vocal timbre.

On the other end of the spectrum, be mindful of any harsh high-end frequencies that may be introduced by many more affordable microphones. Not everyone has a vintage Neumann to sing through (myself included). A low-pass filter can help tame these frequencies, typically around 20 kHz and above. 

Using an EQ like the Pultec EQP-1A, known for its smooth and warm character, is a great choice for cleaning up low-end rumble and softening the highs. 

Adjusting EQ with the Pultec EQP-1A

Pitch Correction: When and How to Use It

Pitch correction tools, like the free version of Antares Auto-Tune, are often used as an effect in modern music production. However, when training an AI voice model, I recommend keeping the vocals natural and applying pitch correction after the vocal has already been cloned. This approach maintains the realism of your AI model and offers flexibility for future projects that may require a more natural sound.

Vocal Variety: Expand Your Source Material

One of the most common mistakes in AI vocal training is the lack of variety in the vocal dataset. Machine learning models can only train from the material provided, so a limited dataset results in a limited vocal model. To elaborate, I’ve received submissions that include singers performing one song over and over again. Though they may sound great on that one song, I know they are capable of reaching higher and lower pitches, exuding more intense and softer vocal inflections, all of which won’t be included in their vocal model because machine learning doesn’t have access to this additional information. In turn, this will provide a very limited use-case for an AI voice model.

To create versatile AI voices, include a wide range of vocal performances in your training material. This should cover different pitches, emotional expressions, and vocal techniques, including both chest and falsetto voices, to mimic the versatility of a real artist. Although the minimum requirement is 15 minutes of audio, I recommend utilizing the full 30 minutes to capture the full range of the vocalist’s abilities.

A sound mixer

Remove Empty Space

Vocal submissions are often acapella versions of songs in their entirety. Since the machine learning process only cares about analyzing a vocal performance, long empty spaces, which may be instrumental sections of a full song, are unnecessary and take up valuable time in the dataset. To optimize your AI voice model, remove any non-vocal sections and ensure the audio is continuous, as shown in my initial example above. Utilizing this approach will maximize the training data and help your model retain as much realism as possible.

Export Your Audio As True Mono

Finally, always export your vocal stems as true mono tracks. Submitting stereo tracks, even if the recording was in mono, doubles the perceived data and reduces the amount of usable material for training. To get the best voice cloning results, maximize the amount of material your model can be trained on by bouncing your vocal track to mono before uploading to Kits.AI.

Conclusion

By following these tips, you can avoid common AI vocal mistakes and get started with unlocking the full potential of this powerful tool. Remember, AI is not a creative tool, it’s a creator’s tool. Like all new tools and emerging technology, there is a learning curve, but with the right approach, incorporating AI vocals into your music can open up new possibilities that were once unimaginable.




-SK

Sam Kearney is a producer, composer and sound designer based in Evergreen, CO.

Table of Contents

Title

Title

Get started, free.

Streamline your vocal production workflow with studio-quality AI audio tools

Get started, free.

Streamline your vocal production workflow with studio-quality AI audio tools

Get started, free.

Streamline your vocal production workflow with studio-quality AI audio tools

Blog Posts Recommended For You