Overview

MUSE is a text-to-music generation system built as a final project for the Parallel Computation course. It addresses the fundamental β€œone-to-many” problem in creative AI: a single text prompt like β€œa sad piano melody” can correspond to infinitely many valid musical interpretations.

MUSE Pipeline

The One-to-Many Problem

Traditional regression approaches predict the mean of possible outputs, resulting in blurry, mode-averaged music. MUSE instead models the full conditional distribution p(Y∣X)p(Y \mid X), enabling diverse, high-fidelity sampling.

ApproachOutputIssue
Regression Y=f(X)Y = f(X)Mean E[Y∣X]\mathbb{E}[Y \mid X]Blurry, averaged
Generative p(Y∣X)p(Y \mid X)Diverse samplesβœ“ Preserves creativity

Two-Stage Architecture

Stage 1: Text2MuQFlow (271M params)

  • Cross-attention Flow Matching
  • Output: 512-dim MuQ-MuLan embedding
  • 50-step ODE (dopri5 solver), CFG scale 3.0

Stage 2: StableAudioMuQ (1.05B params)

  • Diffusion Transformer + Classifier-Free Guidance
  • 100-step ODE sampling
  • Output: 44.1kHz stereo, up to 47 seconds

Parallel Computing Highlights

ChallengeSolution
Memory-heavy (1B+ params)PyTorch DDP across 4Γ—A800 GPUs
Compute-bound (100-step ODE)Batch inference parallelization
I/O bottleneck (44.1kHz audio)Cached MuQ embeddings

Training Summary

StageGPUsWall TimeGPU-hrsBest Loss
Stage 12Γ—A800~8.2h~160.0569
Stage 24Γ—A800~32h~1281.078

Stage 1 Loss

Batch Inference Performance

The key insight: model size determines batching efficiency.

Model SizeBatch SpeedupReason
< 500M5-30Γ—GPU under-utilized
500M-1B2-5Γ—Partial saturation
> 1B~1Γ—Already saturated

Stage 1 (271M): 15Γ— speedup at BS=16 β€” memory stays constant at 4.58GB Stage 2 (1.05B): 1.5Γ— speedup at BS=16 β€” memory scales linearly (18β†’36GB)

End-to-End Result

  • 42% time reduction for 16 samples (185s β†’ 108s)
  • Reproducible outputs (max diff < 1e-7)

MUSE Application

The full-stack application includes:

  • 🎹 Studio: Text-to-music generation with batch sampling
  • πŸ”¬ Lab: Latent space exploration & interpolation
  • πŸ“š Library: Audio management, tagging & playback

Latent Space Diverse samples visualized in MuQ-MuLan latent space

Tech Stack

Gradio Β· PyTorch Β· torchaudio Β· Plotly Β· UMAP


πŸ“Ž View Research Poster | πŸ”— GitHub Repository