Music Stem Separation with Text-Conditioning using a Transformer Model

Extracting musical stems (e.g. drums, bass, vocals, etc.) from a mixed audio file

Authors

Jacob Krucinski - email
Maximilian Huber - email
Surya Mani - email
Noah Smith - email

Python Environment

It is highly recommended to create a new Python 3.13 virtual environment to run this repo. If using conda, the following commands can be used to create the new environment:

conda create -n cs_7150_stem_sep python=3.13

As PyTorch is the ML framework used for this project, follow the PyTorch instructions (with CUDA support if desired) for installation (v.2.6.0 has been tested for this project):

pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124

All remaining dependencies can be installed using pip:

pip install -r requirements.txt

Dataset

We are using the MusDB18 dataset. We create our torch.Dataset class for it. It loads in all segments (of length 6s) for all stems for all songs. For the text prompts, it uses the original stem name (drums, bass, vocals, other) as well as slight variants (i.e. for vocal, "vocals", "voice", "singing", "the vocals" are used as additional prompts). The dataset can be downloaded for free from the link above.

Model Architecture

Our model is built off the existing Hybrid Transformer Demucs (HTDemucs) model from Meta. It consists of 2 parallel U-Net models with skip connections, one for the time/waveform domain and the other for the frequency domain. It uses a cross-attention bottleneck layer to learn harmonic representations unique to each stem to aid in the separation process. Then the decoder parts of this model use the attended to features and performs the inverse Short Time Fourier Transform (iSTFT) to convert the final spectrogram to a waveform. A model diagram provided by the authors is shown below:

We extend this model to support any user-defined stem name by adding text-conditioning using Contrastive Language-Audio Pairs (CLAP) text embeddings. We then add another cross-attention between the text embeddings and the combined time & frequency embeddings from HTDemucs and modify the U-Net decoders to only return 1 stem (instead of 4 previously). This gives the HTDemucs model zero-shot ability for stems that it has not seen before. Even if not explicity trained on that stem, the CLAP embedding will guide the separatiion based on a similar stem it has trained on.

Our model architecture is shown below. The blue boxes denotes pre-trained and frozen model components, and the tan represent our modifications that are trained.

More details on our implementation and audio samples can be found in our presentation.

Training

To train the model, run the Python scipt main.py from the projet root directory. All data, model, and logging configurations are specified in the config.yaml file. The full training loop can be found in src/train.py. The YAML configuration file is organized by data, model, training, and Weights and Biases (wandb) parameters.

The loss functions used are Signal Distortion Ratio (SDR), Scale-Invariant SDR (SISDR), a combined loss function using a linear combination of the SDR and SISDR loss, and lastly a combined L1 + SDR loss.

For logging purposes during training, we use a Weights & Biases (wandb) project dashboard. If you also want to use wandb, you need to create an account using the instructions here.

Model Checkpoints

To avoid training from scratch, the best model file can be found on HuggingFace.

Inference

To simplify the inference process, we have created a Gradio demo which supports uploading a local audio file or using a YouTube link. To run the gradio app locally, please run:

python app.py

To run a file upload-only version of the Gradio app, it is hosted on HuggingFace_Hub.

Screenshot of the app is shown below:

For custom inference, the test-inference.py file can be modified.

References

Défossez, A. (2021). Hybrid Spectrogram and Waveform Source Separation. Proceedings of the ISMIR 2021 Workshop on Music Source Separation.

Wu*, Y., Chen*, K., Zhang*, T., Hui*, Y., Berg-Kirkpatrick, T., & Dubnov, S. (2023). Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP.

Ma, H., Peng, Z., Li, X., Shao, M., Wu, X., & Liu, J. (2024). CLAPSep: Leveraging Contrastive Pre-trained Models for Multi-Modal Query-Conditioned Target Sound Extraction. arXiv Preprint arXiv:2402. 17455.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
eval_results		eval_results
images		images
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
benchmark.py		benchmark.py
config.yaml		config.yaml
embedding_comparison.py		embedding_comparison.py
generate_spectrogram.py		generate_spectrogram.py
main.py		main.py
requirements.txt		requirements.txt
test_inference.py		test_inference.py
testing.ipynb		testing.ipynb
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Music Stem Separation with Text-Conditioning using a Transformer Model

Authors

Python Environment

Dataset

Model Architecture

Training

Model Checkpoints

Inference

References

About

Uh oh!

Languages

License

MaxHuber888/TextStemSep

Folders and files

Latest commit

History

Repository files navigation

Music Stem Separation with Text-Conditioning using a Transformer Model

Authors

Python Environment

Dataset

Model Architecture

Training

Model Checkpoints

Inference

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages