Zero-Shot Structure Labeling with Audio And Language Model Embeddings - Equipe Signal, Statistique et Apprentissage
Communication Dans Un Congrès Année : 2024

Zero-Shot Structure Labeling with Audio And Language Model Embeddings

Résumé

Recent progress on audio-based music structure analysis has closely aligned with the appearance of new deep learning paradigms, notably for the extraction of robust spectro-temporal audio features and their sequential modeling. However, most recent methods resort to supervised learning, which requires careful annotation of audio music pieces. Such annotations may sometimes operate at different temporal scales from one dataset to another or comprise inconsistent variation markers across repetitions of identical segments. This work explores language models as an alternative to manual pre-processing of the section label space, thus facilitating training and predictions across different annotated corpora. We propose a joint audio-to-text embedding space in which latent representations of audio frames and their respective section labels are close. We take inspiration from recent works on cross-modal contrastive learning and demonstrate the plausibility of this paradigm in the context of music structure analysis.
Fichier principal
Vignette du fichier
ISMIR24_LBD_ZeroShotSegmentation.pdf (322.27 Ko) Télécharger le fichier
Origine Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-04764247 , version 1 (03-11-2024)

Licence

Identifiants

  • HAL Id : hal-04764247 , version 1

Citer

Morgan Buisson, Christopher Ick, Tom Xi, Brian McFee. Zero-Shot Structure Labeling with Audio And Language Model Embeddings. Extended Abstracts for the Late-Breaking Demo Session of the 25th International Society for Music Information Retrieval Conference (ISMIR), Nov 2024, San Francisco California, United States. ⟨hal-04764247⟩
0 Consultations
0 Téléchargements

Partager

More