Zero-Shot Structure Labeling with Audio And Language Model Embeddings

Morgan Buisson; Christopher Ick; Tom Xi; Brian McFee

Communication Dans Un Congrès Année : 2024

Zero-Shot Structure Labeling with Audio And Language Model Embeddings

(1, 2) , (3) , (3) , (3)

1
2
3

Morgan Buisson

Fonction : Auteur correspondant
PersonId : 1160385
IdHAL : morgan-buisson
ORCID : 0000-0001-8541-3071

Connectez-vous pour contacter l'auteur

Signal, Statistique et Apprentissage

Département Images, Données, Signal

Christopher Ick

Fonction : Auteur correspondant

Music and Audio Research Lab [New York]

Tom Xi

Fonction : Auteur

Music and Audio Research Lab [New York]

Brian McFee

Fonction : Auteur
PersonId : 1435470

Music and Audio Research Lab [New York]

Résumé

Recent progress on audio-based music structure analysis has closely aligned with the appearance of new deep learning paradigms, notably for the extraction of robust spectro-temporal audio features and their sequential modeling. However, most recent methods resort to supervised learning, which requires careful annotation of audio music pieces. Such annotations may sometimes operate at different temporal scales from one dataset to another or comprise inconsistent variation markers across repetitions of identical segments. This work explores language models as an alternative to manual pre-processing of the section label space, thus facilitating training and predictions across different annotated corpora. We propose a joint audio-to-text embedding space in which latent representations of audio frames and their respective section labels are close. We take inspiration from recent works on cross-modal contrastive learning and demonstrate the plausibility of this paradigm in the context of music structure analysis.

Domaines

Intelligence artificielle [cs.AI]

Fichier principal

ISMIR24_LBD_ZeroShotSegmentation.pdf (322.27 Ko)

Origine	Fichiers produits par l'(les) auteur(s)

Morgan Buisson : Connectez-vous pour contacter le contributeur

https://hal.science/hal-04764247

Soumis le : dimanche 3 novembre 2024-20:08:29

Dernière modification le : mercredi 13 novembre 2024-12:14:36

Dates et versions

hal-04764247 , version 1 (03-11-2024)

Licence

Paternité

Identifiants

HAL Id : hal-04764247 , version 1

Citer

Morgan Buisson, Christopher Ick, Tom Xi, Brian McFee. Zero-Shot Structure Labeling with Audio And Language Model Embeddings. Extended Abstracts for the Late-Breaking Demo Session of the 25th International Society for Music Information Retrieval Conference (ISMIR), Nov 2024, San Francisco California, United States. ⟨hal-04764247⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

LTCI IDS S2A IP_PARIS INSTITUT-MINES-TELECOM

189 Consultations

85 Téléchargements

Zero-Shot Structure Labeling with Audio And Language Model Embeddings

Résumé

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Collections

Partager