Computer Vision Faculty Publications

How Good is Google Bard’s Visual Understanding? An Empirical Study on Open Challenges

Haotong Qin, ETH Zürich
Ge Peng Ji, The Australian National University
Salman Khan, Mohamed Bin Zayed University of Artificial IntelligenceFollow
Deng Ping Fan, ETH Zürich
Fahad Shahbaz Khan, Mohamed Bin Zayed University of Artificial IntelligenceFollow
Luc Van Gool, ETH Zürich

Document Type

Article

Publication Title

Machine Intelligence Research

Abstract

Google’s Bard has emerged as a formidable competitor to OpenAI’s ChatGPT in the field of conversational AI. Notably, Bard has recently been updated to handle visual inputs alongside text prompts during conversations. Given Bard’s impressive track record in handling textual inputs, we explore its capabilities in understanding and interpreting visual data (images) conditioned by text questions. This exploration holds the potential to unveil new insights and challenges for Bard and other forthcoming multi-modal Generative models, especially in addressing complex computer vision problems that demand accurate visual and language understanding. Specifically, in this study, we focus on 15 diverse task scenarios encompassing regular, camouflaged, medical, under-water and remote sensing data to comprehensively evaluate Bard’s performance. Our primary finding indicates that Bard still struggles in these vision scenarios, highlighting the significant gap in vision-based understanding that needs to be bridged in future developments. We expect that this empirical study will prove valuable in advancing future models, leading to enhanced capabilities in comprehending and interpreting fine-grained visual data. Our project is released on https://github.com/htqin/GoogleBard-VisUnderstand .

First Page

605

Last Page

613

DOI

10.1007/s11633-023-1469-x

Publication Date

8-30-2023

Keywords

chatbot, conversational AI, Google Bard, large language models, multi-modal understanding, visual comprehension

Comments

IR Deposit conditions:

OA version (pathway b) Accepted version

12 months embargo

License: Publisher's Bespoke License

Published source must be acknowledged with citation

Must link to publisher version with DOI

Post-prints are subject to Springer Nature re-use terms

Set statement to accompany deposit (see policy)

Recommended Citation

H. Qin, G.P. Ji, S. Khan, D.P. Fan, F.S. Khan and L.V. Gool, "How Good is Google Bard’s Visual Understanding? An Empirical Study on Open Challenges," Machine Intelligence Research, vol. 20, no. 5, pp. 605 - 613, Aug 2023. doi:10.1007/s11633-023-1469-x

Additional Links

https://doi.org/10.1007/s11633-023-1469-x

Link to Full Text

COinS

Computer Vision Faculty Publications

How Good is Google Bard’s Visual Understanding? An Empirical Study on Open Challenges

Document Type

Publication Title

Abstract

First Page

Last Page

DOI

Publication Date

Keywords

Comments

Recommended Citation

Additional Links

Browse

Contribute

Links

Computer Vision Faculty Publications

How Good is Google Bard’s Visual Understanding? An Empirical Study on Open Challenges

Authors

Document Type

Publication Title

Abstract

First Page

Last Page

DOI

Publication Date

Keywords

Comments

Recommended Citation

Additional Links

Share

Browse

Contribute

Links