Read, look and detect: Bounding box annotation from image-caption pairs - IRT Saint Exupéry - Institut de Recherche Technologique Access content directly
Preprints, Working Papers, ... Year : 2023

Read, look and detect: Bounding box annotation from image-caption pairs

Abstract

Various methods have been proposed to detect objects while reducing the cost of data annotation. For instance, weakly supervised object detection (WSOD) methods rely only on image-level annotations during training. Unfortunately, data annotation remains expensive since annotators must provide the categories describing the content of each image and labeling is restricted to a fixed set of categories. In this paper, we propose a method to locate and label objects in an image by using a form of weaker supervision: image-caption pairs. By leveraging recent advances in vision-language (VL) models and self-supervised vision transformers (ViTs), our method is able to perform phrase grounding and object detection in a weakly supervised manner. Our experiments demonstrate the effectiveness of our approach by achieving a 47.51% recall@1 score in phrase grounding on Flickr30k Entities and establishing a new state-of-the-art in object detection by achieving 21.1 mAP 50 and 10.5 mAP 50:95 on MS COCO when exclusively relying on image-caption pairs.
Fichier principal
Vignette du fichier
RLD.pdf (4.84 Mo) Télécharger le fichier
Origin Files produced by the author(s)

Dates and versions

hal-04121503 , version 1 (07-06-2023)

Identifiers

Cite

Eduardo Hugo Sanchez. Read, look and detect: Bounding box annotation from image-caption pairs. 2023. ⟨hal-04121503⟩

Collections

IRT_SAINT-EXUPERY
24 View
18 Download

Altmetric

Share

Gmail Mastodon Facebook X LinkedIn More