RESEARCH

Kits Voice Conversion (KVC)

KVC: Studio-Quality Singing Voice Conversion

Kits.AI is the world’s leading platform for professional AI singing voice conversion. Millions of music producers and vocalists rely on Kits for studio-quality AI vocals that capture the natural intonation, dynamics, and nuance of the human voice.

The research team at Kits.AI has designed Kits Voice Conversion (KVC), an industry-leading voice-to-voice conversion system that pushes the boundaries of quality in voice-to-voice conversion technology.

This page is an overview of the growing list of innovations within KVC — improved architecture, robust pre-trained weights, and optimized infrastructure — that make it the top choice for industry professionals worldwide.

KVC Architecture: Optimized for Singing

KVC has made architectural improvements optimizing specifically for professional quality singing outputs. This section outlines the architectural improvements that enables KVC to outperform open-source SVC systems across a number of dimensions, including pronunciation, pitch accuracy, frequency range, and dynamics.

Kits Base Weights

Kits has curated and hand-processed a proprietary dataset sourced from individual vocalists who are compensated for the rights to train on recordings of their voice. These recordings form the dataset that KVC base weights are trained on. Whenever a voice is cloned with KVC, it draws from the quality of this dataset.

Our training data, data sourcing and data management practices are certified as Fairly Trained. We remain committed to respecting the rights of artists and supporting them financially.

Pitch Detection: Kits Hybrid Pitch

More accurate detection of F0 is critical for the SVC task. The Kits Research team has developed a custom pitch detection algorithm called Kits Hybrid Pitch that outperforms baseline Crepe, RMVPE, and Mangio-Crepe leading to improved results.

Metrics for RMVPE

Metrics for Hybrid

Open Source: RVC with RMVPE

0:00/1:34

Kits Hybrid

0:00/1:34

Adaptive Content Retrieval

KVC uses adaptive content feature retrieval smoothing, which leads to higher levels of speaker similarity over standard retrieval SVC systems like RVC. During inference, Kits VC takes input features and applies retrieval strength adaptively: the more aligned the features are, the more the content features are pulled towards retrieval.

This results in a higher preservation of phonemic content leading to improvements in pronunciation and speaker similarity.

Open Source: Contentvec + nearest neighbor retrieval

0:00/1:34

Kits: Adaptive feature retrieval

0:00/1:34

Advanced Content Encoding: Xeus, Hybrid

Open-source SVC systems use Hubert or ContentVec weights. KVC is integrated with both ContentVec, as well as advanced content encoders like Xeus and hybrid systems, which can lead to improvements in pronunciation. Examples are included below.

Training Pre-Processing

Intelligent Slicing

KVC uses a more intelligent slicing method to train on longer, more complete phrases, avoiding cutting in the middle of a word or phrase.

Breath and Noise Removal

KVC includes additional steps for adaptive noise removal to enhance quality.

Adaptive EQ for Spectral Balance

KVC includes automatic EQ adjustment in both training and inference, resulting in higher spectral balance and parity between input and output audio.

Inference Post-Processing

Pitch Correction

Automatic pitch correction, inspired by tools like Antares Auto-Tune, is optionally applied during conversion.

Stylistic Effects

Stylistic effects like stereo widening and reverb are built directly into the inference pipeline, improving the stylistic quality of singing outputs.

Audio Examples

Pitch Stability

Where Open Source weights are largely trained on speech data, the base weights of KVC are optimized for singing. The result: fuller, clearer notes across (and even beyond) the range of a singer.

Open Source (RVC)

0:00/1:34

Kits VC

0:00/1:34

Vocal Energy

With KVC, the energy level in an input file is reproduced much more realistically than with Open Source alternatives. Volume fluctuations, breathiness, and smooth note onsets result in a much more natural result.

Open Source (RVC)

0:00/1:34

Kits VC

0:00/1:34

Volume

Through adaptive pre-processing, KVC addresses volume artifacts common to Open Source RVC conversions.

Open Source (RVC)

0:00/1:34

Kits VC

0:00/1:34

Sonic Quality

Without careful EQ and dynamic range processing, a voice model can quickly sound harsh. KVC adaptively balances volume and frequency response of training datasets for smooth, low-distortion conversions.

Open Source (RVC)

0:00/1:34

Kits VC

0:00/1:34

Pitch/Vocal Fry

Through improvements to pitch detection, feature retrieval, and temporal resolution, KVC makes improvements to subtle inflections like vocal fry and breathy singing styles.

Open Source (RVC)

0:00/1:34

Kits VC

0:00/1:34