Back to Search Start Over

DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding

Authors :
Ren, Tianhe
Chen, Yihao
Jiang, Qing
Zeng, Zhaoyang
Xiong, Yuda
Liu, Wenlong
Ma, Zhengyu
Shen, Junyi
Gao, Yuan
Jiang, Xiaoke
Chen, Xingyu
Song, Zhuheng
Zhang, Yuhong
Huang, Hongjie
Gao, Han
Liu, Shilong
Zhang, Hao
Li, Feng
Yu, Kent
Zhang, Lei
Publication Year :
2024

Abstract

In this paper, we introduce DINO-X, which is a unified object-centric vision model developed by IDEA Research with the best open-world object detection performance to date. DINO-X employs the same Transformer-based encoder-decoder architecture as Grounding DINO 1.5 to pursue an object-level representation for open-world object understanding. To make long-tailed object detection easy, DINO-X extends its input options to support text prompt, visual prompt, and customized prompt. With such flexible prompt options, we develop a universal object prompt to support prompt-free open-world detection, making it possible to detect anything in an image without requiring users to provide any prompt. To enhance the model's core grounding capability, we have constructed a large-scale dataset with over 100 million high-quality grounding samples, referred to as Grounding-100M, for advancing the model's open-vocabulary detection performance. Pre-training on such a large-scale grounding dataset leads to a foundational object-level representation, which enables DINO-X to integrate multiple perception heads to simultaneously support multiple object perception and understanding tasks, including detection, segmentation, pose estimation, object captioning, object-based QA, etc. Experimental results demonstrate the superior performance of DINO-X. Specifically, the DINO-X Pro model achieves 56.0 AP, 59.8 AP, and 52.4 AP on the COCO, LVIS-minival, and LVIS-val zero-shot object detection benchmarks, respectively. Notably, it scores 63.3 AP and 56.5 AP on the rare classes of LVIS-minival and LVIS-val benchmarks, improving the previous SOTA performance by 5.8 AP and 5.0 AP. Such a result underscores its significantly improved capacity for recognizing long-tailed objects.<br />Comment: Technical Report

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2411.14347
Document Type :
Working Paper