Abstract

We are interested in a chanllenging task, Realistic-Music-Score based Singing Voice Synthesis (RMS-SVS). RMS-SVS aims to generate high-quality singing given realistic music scores. Though significant progress has been achieved, current singing voice synthesis (SVS) methods are limited to fine-grained music scores, which require a complicated data collection pipeline with time-consuming manual annotation to align music notes with phonemes. Furthermore, existing approaches cannot synthesize rhythmic singing voices given realistic music scores due to the domain gap between fine-grained music scores and realistic music scores.To tackle these challenges, we propose RMSSinger, the first RMS-SVS method, which takes realistic music scores as input, eliminating most of the tedious manual annotation for data collection and avoiding the aforementioned domain gap. Notice that music scores are based on words rather than phonemes, in RMSSinger, we introduce word-level modeling to avoid the time-consuming phoneme duration annotation and the complicated phoneme-note alignment. Furthermore, we propose the first diffusion-based pitch modeling method, which ameliorates the naturalness of existing pitch-modeling methods. To achieve these, we collect a new dataset containing realistic music scores and singing according to these realistic music scores from professional singers. Extensive experiments on the dataset demonstrate the effectiveness of our methods.

Audio Samples

We provide the audio samples generated by the different SVS systems.

  • AP SP AP SP
  • GT(vocoder) FFTSinger DiffSinger RMSSinger
    wav

  • AP AP
  • GT(vocoder) FFTSinger DiffSinger RMSSinger
    wav

  • SP AP AP AP
  • GT(vocoder) FFTSinger DiffSinger RMSSinger
    wav

  • 怀 AP AP
  • GT(vocoder) FFTSinger DiffSinger RMSSinger
    wav

  • AP AP AP AP
  • GT(vocoder) FFTSinger DiffSinger RMSSinger
    wav

    Ablation Studies

    We provide the audio samples generated by different ablations.

  • AP AP AP
  • GT(vocoder) Full model w/o UV diffusion w/o F0 diffusion w/o postnet
    wav