JukeDrummer: Conditional Beat-aware Audio-domain Drum Accompaniment Generation via Transformer VQ-VA

Abstract

Model Structure & Configurations

Demo audio

Limitation

Part 1: Evaluation of Different Model Variants on the Test Set

Part 2: Evaluation on Diversity

Part 3: Evaluation on External Data

Contact

This is the demo page for ISMIR2022 paper JukeDrummer: Conditional Beat-aware Audio-domain Drum Accompaniment Generation via Transformer VQ-VAE

Author: Yueh-Kao Wu, Ching-Yu Chiu, Yi-Hsuan Yang

This is the demo page of JukeDrummer, the work generates a drum track in the audio domain to play along to a user-provided drum-free recording. Specifically, using paired data of drumless tracks and the corresponding human-made drum tracks, we train 2 vector-quantized variation autoencoders (VQ-VAE) to discretize both drumless and drum Mel spectrogram. Subsequently, we also train the Transformer to improvise the drum part of an unseen drumless recording with these discretized drum tokens. Finally, we use MelGAN as Vocoder to transform our Mel spectrogram decoded by the decoder of VQ-VAE into the audio wave. This demo page contains several results of our attempts at different domain inputs.

We have our model highly refer to Jukebox. While there are hundreds of self-attention layers in Jukebox, there are only 9 layers in both encoder and decoder in our work. In addition, we also apply so called “Beat Information Extractor” to extract beat information externally in aid of generating rhythmically consistent drum accompaniment audio.

Fig 2. The language model of JukeDrummer

In this demo section contains 3 parts. The first part is the comparison between different model configurations using testing data as input. The second part shows the diversity of our drum accompaniment tracks using our testing data by our best model. Finally, in the third part, we provide some results from our best model using external but famous drumless tracks as input.

Drumless tracks: Drumless
Ground truth of the drum tracks + drumless tracks: GroundTruth
Drum tracks generated by model with the transformer encoder and with the beat information extractor + drumless tracks: W/ Encoder W/ BeatInfo
Drum tracks generated by model with the transformer encoder but without the beat information extractor + drumless tracks: W/ Encoder W/O BeatInfo
Drum tracks generated by model without the transformer encoder but with the beat information extractor + drumless tracks: W/O Encoder W/ BeatInfo
Drum tracks generated by model without the transformer encoder but without the beat information extractor + drumless tracks: W/O Encoder W/ BeatInfo

	Drumless	GroundTruth	W/ Encoder W/ BeatInfo	W/ Encoder W/O BeatInfo	W/O Encoder W/ BeatInfo	W/O Encoder W/O BeatInfo
1.
2.
3.
4.
5.
6.

We use our best model W/ Encoder W/ BeatInfo to reaptly generate drum tracks with identical parameters and other configurations.

We use our best model W/ Encoder W/ BeatInfo to reaptly generate drum tracks with external input data. We use spleeter to extract drumless tracks of the first and the second tracks.

Earth, Wind & Fire - September

Drumless	Sample 1	Sample 2	Sample 3

伍佰 Wu Bai & China Blue - 挪威的森林 Norwegian fores

Drumless	Sample 1	Sample 2	Sample 3

All of me

Drumless	Sample 1	Sample 2	Sample 3

Coldplay - Viva La Vida

Drumless	Sample 1	Sample 2	Sample 3

First, the generalizability of our model is not good enough. According to our own observation, our model is capable to perform functionally when using most of our testing data as input which is divided from the joined dataset of MUSDB18, MedleyDB, and MixingSecret prior to our training. However, the results are relatively worse when using recordings of drumless music outside our joined dataset as input. We conjecture that our model is sensitive to audio compression, original sample rate, or the way the music is mixed and mastered.

Second, the stability of our model still needs to be improved. At times, the model struggles to change its tempo going through different sections of a song. Moreover, the generation might be out of sync with the input in the beginning few seconds, until the model gets sufficient context.

Last but not least, it would be helpful if the drumless input contains some “rhythmical hints”, such as strong bass, rhythm guitar, and any other sources that can be conducive for our model to locate beats and downbeats. If so, the model is likely to perform better. On the other hand, if the model can’t get enough clues from input to locate beats and tempo, the result of generation would be pretty bad.

To sum up, issues related to generalizability, stability, and rhythm dependency are issues that should be addressed in future works.

GroundTruth

Yueh-Kao Wu yk.lego09@gmail.com