Machine Learning Faculty Publications

Skin-Former: Mobile-Friendly Transformer for Skin Lesion Diagnosis

Sheng Zhang, Mohamed Bin Zayed University of Artificial IntelligenceFollow
Muzammal Naseer, Mohamed Bin Zayed University of Artificial IntelligenceFollow
Guangyi Chen, Mohamed Bin Zayed University of Artificial IntelligenceFollow
Zhiqiang Shen, Mohamed Bin Zayed University of Artificial IntelligenceFollow
Salman Khan, Mohamed Bin Zayed University of Artificial IntelligenceFollow
Kun Zhang, Mohamed Bin Zayed University of Artificial IntelligenceFollow
Fahad Shahbaz Khan, Mohamed Bin Zayed University of Artificial IntelligenceFollow

Document Type

Conference Proceeding

Publication Title

Proceedings of the AAAI Conference on Artificial Intelligence

Abstract

Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification. Despite the success, most traditional VLMs-based methods are restricted by the assumption of partial source supervision or ideal target vocabularies, which rarely satisfy the open-world scenario. In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary. To address the new problem, we propose the Self Structural Semantic Alignment (S3A) framework, which extracts the structural semantic information from unlabeled data while simultaneously self-learning. Our S3A framework adopts a unique Cluster-Vote-Prompt-Realign (CVPR) algorithm, which iteratively groups unlabeled data to derive structural semantics for pseudo-supervision. Our CVPR algorithm includes iterative clustering on images, voting within each cluster to identify initial class candidates from the vocabulary, generating discriminative prompts with large language models to discern confusing candidates, and realigning images and the vocabulary as structural semantic alignment. Finally, we propose to self-train the CLIP image encoder with both individual and structural semantic alignment through a teacher-student learning strategy. Our comprehensive experiments across various generic and fine-grained benchmarks demonstrate that the S3A method substantially improves over existing VLMs-based approaches, achieving a more than 15% accuracy improvement over CLIP on average. Our codes, models, and prompts are publicly released at https://github.com/shengeatamath/S3A.

First Page

7278

Last Page

7286

DOI

10.1609/aaai.v38i7.28557

Publication Date

3-25-2024

Recommended Citation

S. Zhang et al., "Skin-Former: Mobile-Friendly Transformer for Skin Lesion Diagnosis," Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 7, pp. 7278 - 7286, Mar 2024.

The definitive version is available at https://doi.org/10.1609/aaai.v38i7.28557

This document is currently not available here.

COinS

Machine Learning Faculty Publications

Skin-Former: Mobile-Friendly Transformer for Skin Lesion Diagnosis

Document Type

Publication Title

Abstract

First Page

Last Page

DOI

Publication Date

Recommended Citation

Browse

Contribute

Links

Machine Learning Faculty Publications

Skin-Former: Mobile-Friendly Transformer for Skin Lesion Diagnosis

Authors

Document Type

Publication Title

Abstract

First Page

Last Page

DOI

Publication Date

Recommended Citation

Share

Browse

Contribute

Links