LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

Hao Zhang; Hongyang Li; Feng Li; Tianhe Ren; Xueyan Zou; Shilong Liu; Shijia Huang; Jianfeng Gao; Lei Zhang; Chunyuan Li; Jianwei Yang

doi:10.48550/arxiv.2312.02949

Verified authors • Institutional access • DOI aware

50,000+ researchers120,000+ datasets90% satisfaction

Preprint

2023

LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

0 Datasets

0 Files

2023

DOI: 10.48550/arxiv.2312.02949 arxiv.org/abs/2312.02949

Get instant academic access to this publication’s datasets.

Create free account How it works

Frequently asked questions

Is access really free for academics and students?

Yes. After verification, you can browse and download datasets at no cost. Some premium assets may require author approval.

How is my data protected?

Files are stored on encrypted storage. Access is restricted to verified users and all downloads are logged.

Can I request additional materials?

Yes, message the author after sign-up to request supplementary files or replication code.

Advance your research today

Join 50,000+ researchers worldwide. Get instant access to peer-reviewed datasets, advanced analytics, and global collaboration tools.

Get free academic access Learn more

✓ Immediate verification • ✓ Free institutional access • ✓ Global collaboration

With the recent significant advancements in large multi-modal models (LMMs), the importance of their grounding capability in visual chat is increasingly recognized. Despite recent efforts to enable LMMs to support grounding, their capabilities for grounding and chat are usually separate, and their chat performance drops dramatically when asked to ground. The problem is the lack of a dataset for grounded visual chat (GVC). Existing grounding datasets only contain short captions. To address this issue, we have created GVC data that allows for the combination of grounding and chat capabilities. To better evaluate the GVC capabilities, we have introduced a benchmark called Grounding-Bench. Additionally, we have proposed a model design that can support GVC and various types of visual prompts by connecting segmentation models with language models. Experimental results demonstrate that our model outperforms other LMMs on Grounding-Bench. Furthermore, our model achieves competitive performance on classic grounding benchmarks like RefCOCO/+/g and Flickr30K Entities. Our code will be released at https://github.com/UX-Decoder/LLaVA-Grounding .

LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

Frequently asked questions

Is access really free for academics and students?

How is my data protected?

Can I request additional materials?

Advance your research today

LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

Frequently asked questions

Is access really free for academics and students?

How is my data protected?

Can I request additional materials?

Advance your research today

Access Research Data

This PDF is not available in different languages.

Lei Zhang

Abstract

How to cite this publication

Related publications

Why join Raw Data Library?

Quality

Control

Free for Academia

Publication Details

Join Research Community