π Follow the links above for audio examples, full training code, and the arXiv pre-print.
π Follow the links above for audio examples, full training code, and the arXiv pre-print.
π We then apply this method to a dataset of sounds sampled from Surge XT β a feature rich software synthesizer β and find that it dramatically outperforms state-of-the-art baselines on audio reconstruction.
π€ However, in the case of real synthesizers, we may not know the appropriate symmetries a priori. To allow them to be discovered adaptively, we introduce a technique called Param2Tok, which learns a mapping from synthesizer parameters to model tokens.
πΊοΈ We can further improve performance by designing a model with equivariance to the appropriate symmetry.
π We design a toy task that isolates this phenomenon and find that the presence of permutation symmetry degrades the performance of conventional methods. We then show that a generative approach, which can assign predictive weight to multiple possible solutions, performs considerably better.
βΌοΈ In this work, we argue that the problem is ill-posed: there are multiple sets of parameters that produce any given sound. Further, we show that many of these equivalent solutions are due to intrinsic symmetries of the synthesizer!
π§βπ¬ Previous approaches have struggled to scale to the full complexity of synthesizers used in modern audio production. Why?
ποΈ Programming synthesizers is a fiddly business, and so a line of work known as "sound matching" has, over the last few decades, sought to answer the question: given an audio signal and a synthesizer, which configuration of parameters best approximates the signal?
πΉ Audio synthesizers are diverse and complex beasts, combining a variety of techniques to produce sounds ranging from familiar to entirely alien.
TL;DR: Predicting synthesizer parameters from audio is hard because multiple parameter configurations can produce the same sound. We design a model that accounts for this and find that it dramatically outperforms previous approaches, and works on production grade, feature rich VST synthesizers.
Very excited to share that our latest work, "Audio synthesizer inversion in symmetric parameter spaces with approximately equivariant flow matching", has been accepted to ISMIR 2025 in Daejon, Korea!
Paper: arxiv.org/abs/2506.07199
Audio: benhayes.net/synth-perm/
Code: github.com/ben-hayes/sy...
π§΅
going to Korea, baby! π°π· #ISMIR2025
DiffVox integrates differentiable vocal effects; analysis reveals parameter correlations and connections to McAdams' timbre dimensions; parameter distributions non-Gaussian; code and datasets available.
wake up, babe. new @sedielem.bsky.social just dropped
sander.ai/2025/04/15/l...
amazing how the soothing beep of stolen Lime bikes has so naturally woven itself into the London soundscape
turned on an old computer and found some old unfinished music gathering dust. uploading it so it at least lives somewhere.
the best ones combine two or more
realised tonight there are only 3 red hot chili peppers songs:
1. california
2. zoop di blamp
3. heroin, but it's a woman
A low-latency neural audio synthesizer (BRAVE) was designed by analyzing latency sources in existing models (RAVE); BRAVE improved pitch and loudness replication while maintaining timbre modification capabilities, implemented in a specialized inference framework.
negative \vspace season approaches π
NablAFx, an open-source PyTorch framework, supports differentiable black-box and gray-box modeling of audio effects; it includes model architectures, datasets, training features, and plotting functions.
Two excellent recent resources:
1. (not strictly a paper) This tutorial from the last ISMIR, courtesy of: geoffroypeeters.github.io/deeplearning...
2. This overview of model-based deep learning for MIR: arxiv.org/abs/2406.11540
I look at it as squeezing a *slightly* better coupling out of the batch.
they do something related here (arxiv.org/abs/2306.15030) with the Kabsch algorithm, but they transform the target samples as they're specifically trying to learn a rotation invariant distribution with an equivariant flow.
haven't crunched through it on paper but my hunch is this works because of the spherical symmetry of the Gaussian dist, so any orthogonal transformation of the batch is exactly as probable (should work for any O(d) invariant distribution if true)
very anecdotally, I've found that when using a normal source distribution, performing orthogonal Procrustes on the source samples (to match the target samples) after minibatch coupling by exact linear assignment (Hungarian algo), seems to speed up convergence by a noticeable amount.
amazing, @drscotthawley.bsky.social ! I've been recommending this post to everyone recently.
πΆβ¨ New Paper Announcement! β¨πΆ
We present "Improving Musical Accompaniment Co-creation via Diffusion Transformers" πΉπΈβa study advancing our Diff-A-Riff stem generator through improved quality, efficiency, and control.
πRead the full paper here: arxiv.org/pdf/2410.23005 π§΅π
This seems to be where ML-facing config libraries (hydra, gin, jsonargparse, etc) converge, and is what I grudgingly end up doing. It makes me wince, though, because it seems to lead invariably to non-trivial and untested instantiation logic being encoded in the relationships between config files.
1. this is excellent work
2. your vocal imitations are everything β€οΈ
speaking at Akademie der Bildenden KΓΌnste in Munich on Dec 16th
"Phantasmagoria: Sound Synthesis after the Turing Test"
about the methodological, ethical, and environmental implications of Generative AI for audio
by invitation from Florian Hecker
hal.science/hal-04650754