AI 음성 모델 훈련 최적화 방법

지금 Kits AI에서 최고의 AI 음성 클론을 구축하는 방법에 대한 가이드를 읽어보세요.

AI 음성 모델 훈련 최적화 방법
AI 음성 모델 훈련 최적화 방법
AI 음성 모델 훈련 최적화 방법

Written by

샘 커니

샘 커니

Published on

2024년 9월 17일

Copy link

Copied

Though it may seem counterintuitive, a great sounding AI Voice Model doesn’t require singers with perfect pitch. One of the most common mistakes I encounter when reviewing submissions for our Community Voices program is datasets heavily altered with auto-tune. From the outside, it’s understandable that many would assume pitch-perfect datasets equal pitch-perfect models. In this post, we’ll explore why using pitch correction can actually harm the quality of your AI voice model, along with other helpful tips to train a more natural, realistic model.

Quality in = quality out

The More, the Better!

AI vocal models thrive on diverse data. If you upload a three-and-a-half-minute song in a low vocal range, the model might sound great for that particular song, but it will lack the versatility of a real-life singer’s full range. For optimal results, aim for at least 30 minutes of vocal material that spans a wide range of pitches, dynamics, and delivery styles.

Incorporate everything from soft, delicate notes to full-energy belts, covering the broad spectrum of a singer’s abilities. This diversity ensures your model sounds natural and versatile, capable of performing across a wide array of material without being constrained by a limited dataset.

File upload page of the Kits AI voice cloning feature

Bounce to True Mono!

A common oversight is uploading stereo audio instead of true mono when training a voice model. Kits currently allows a maximum of 200 MB of training data, so bouncing tracks to stereo, even if recorded with a single microphone, can unnecessarily double your file size. This reduces the amount of usable training data.

By ensuring your vocals are bounced to true mono, you maximize the amount of training data and avoid hitting the size limit too soon. Even though stereo is essential for modern productions, AI voice models only require mono for efficiency.

Antares Autotune

Autotune and Pitch Correction Aren’t Necessary!

As I mentioned earlier, pitch-perfect vocals aren’t required for training data. Every singer, even those with exceptional pitch, has natural variations in their voice. While hard-tuned Antares AutoTune might suit your production style, it can result in robotic, unnatural-sounding AI models.

The key is to save pitch correction for post-production. Training your AI voice model with natural, unprocessed vocals will yield a more realistic sound and prevent your model from being locked into one specific, overly processed style.

Guidelines for vocal input for the Kits AI voice clone feature

Save the Effects For Post!

Effects like reverb, delay, and modulation are great for enhancing vocal performances, but they should be avoided when creating training data. These effects can interfere with the machine learning process, which focuses on capturing the natural essence of the human voice. Including them in your dataset can result in models filled with digital artifacts, making them sound less lifelike.

Instead, focus on capturing dry, clean vocals. You can always add effects later. If room reflections are an issue, try recording in a small space like a closet, or use a reflection filter like the sE RF-X to minimize reverb and ensure a cleaner dataset.

Avoid background noise

Prioritize Sonic Consistency

While diversity in vocal delivery can enhance your AI model, consistency in recording quality is crucial. Background noise from fans, air conditioners, or other household items can negatively affect the outcome of your model. Take note of preamp levels and any distortion caused by clipping the mic or interface. Keep an ear out for any inconsistencies and ensure a clean, distortion-free capture.

Slight vocal variations due to daily changes in the singer’s voice can actually add depth to your model, but make sure the technical side of your recording remains consistent to maintain high-quality results.

Conclusion

When building an AI voice model, it’s easy to assume traditional vocal production techniques will improve the result. However, by following these tips–using natural, diverse data, maintaining technical consistency, and saving effects for post-production–you’ll create a more realistic, versatile voice model. Kits AI can unlock incredible creative possibilities, and with the right approach, you can get the most out of your AI voice models. For additional recording guidelines, follow this link for Kits’ recommendations for capturing high-quality datasets.


-SK

Sam Kearney is a producer, composer and sound designer based in Evergreen, CO.

Table of Contents

제목

제목

무료로 시작하세요. 신용카드가 필요하지 않습니다.

우리의 무료 플랜을 통해 Kits가 귀하의 음성 및 오디오 작업 흐름을 어떻게 간소화할 수 있는지 확인할 수 있습니다. 다음 단계로 나갈 준비가 되면 유료 플랜은 월 $14.99부터 시작합니다.

무료로 시작하세요. 신용카드가 필요하지 않습니다.

우리의 무료 플랜을 통해 Kits가 귀하의 음성 및 오디오 작업 흐름을 어떻게 간소화할 수 있는지 확인할 수 있습니다. 다음 단계로 나갈 준비가 되면 유료 플랜은 월 $14.99부터 시작합니다.

무료로 시작하세요. 신용카드가 필요하지 않습니다.

우리의 무료 플랜을 통해 Kits가 귀하의 음성 및 오디오 작업 흐름을 어떻게 간소화할 수 있는지 확인할 수 있습니다. 다음 단계로 나갈 준비가 되면 유료 플랜은 월 $14.99부터 시작합니다.

당신을 위한 추천 블로그 글