Towards Generic Anomaly Detection and Understanding: Large-scale Visual-linguistic Model (GPT-4V) Takes the Lead.
Yunkang Cao*, Xiaohao Xu*, Chen Sun*, Xiaonan Huang, Weiming Shen. (*These authors contribute equally.)
Anomaly detection is a crucial task across different domains and data types. However, existing anomaly detection models are often designed for specific domains and modalities. This study explores the use of GPT-4V(ision), a powerful visual-linguistic model, to address anomaly detection tasks in a generic manner. We investigate the application of GPT-4V in multi-modality, multi-domain anomaly detection tasks, including image, video, point cloud, and time series data, across multiple application areas, such as industrial, medical, logical, video, 3D anomaly detection, and localization tasks. To enhance GPT-4V's performance, we incorporate different kinds of additional cues such as class information, human expertise, and reference images as prompts. Based on our experiments, GPT-4V proves to be highly effective in detecting and explaining global and fine-grained semantic patterns in zero/one-shot anomaly detection. This enables accurate differentiation between normal and abnormal instances. Overall, GPT-4V exhibits promising performance in generic anomaly detection and understanding, thus opening up a new avenue for anomaly detection.
All cases and corresponding prompts for evaluations can be found in ./Cases
.
The evaluations were conducted before November 7, 2023. GPT-4V has shown gradual improvement and may deliver even stronger performance in the future.
- Evaluate quantitative results
- Evaluate GPT-4V in multi-round conversations
GPT-4V demonstrates robust anomaly detection capabilities in various multi-modal and multi-field tasks. It excels in comprehending image context, discerning normal standards, and comparing provided images effectively.
-
Multi-Modality Anomaly Detection: GPT-4V handles diverse data types like images, point clouds, and X-rays, making it adaptable to multi-modal tasks, surpassing single-modal detectors.
-
Multi-Field Anomaly Detection: GPT-4V performs well in industrial, medical, pedestrian, traffic, and time series anomaly detection, showcasing its versatility across domains.
-
Zero/One-Shot Anomaly Detection: GPT-4V adapts to different inference scenarios, utilizing language prompts for anomaly detection, with or without reference images.
-
Global Semantics: GPT-4V recognizes overarching abnormal patterns or behaviors, making it suitable for identifying anomalies in broader contexts, e.g., traffic anomaly detection.
-
Fine-Grained Semantics: GPT-4V precisely localizes anomalies within complex data, enhancing its ability to detect subtle irregularities, e.g., industrial image anomaly detection.
GPT-4V automatically reasons complex normal standards and provides explanations for detected anomalies. It adds interpretability to its results, making it valuable for understanding irregularities in various domains.
Additional prompts improve GPT-4V's anomaly detection performance, responding to class information, human expertise, and reference images.
While GPT-4V shows promise, challenges exist in highly complex scenarios, such as industrial applications and ethical constraints in the medical field. Further enhancements and fine-tuning may be required to address these challenges and unlock its potential.
Task Introduction: Industrial image anomaly detection is a critical component of manufacturing processes aimed at upholding product quality. Following the establishment of the MVTec AD dataset, various methods have thrived in this field.
Task Introduction: Industrial image anomaly localization entails a more intricate process than mere image anomaly detection. Its potential for image anomaly localization warrants further exploration.
Task Introduction: Geometrical information holds a crucial role in fields like industrial anomaly detection, especially when dealing with categories lacking textual information. In contrast, CPMF proposes a novel approach by transforming point clouds into depth images.
Task Introduction: In addition to structural anomalies, there exists another type of anomaly, named logical anomalies. Existing logical anomaly detection methods typically relied on solely visual context.
Task Introduction: Anomaly detection in medical imaging is pivotal for early diagnosis and effective treatment planning. GPT-4V shows promising future in enhancing the performance of anomaly detection tasks in various medical imaging modalities.
Task Introduction: Anomaly localization is imperative for clinicians to understand the extent and nature of the pathology.
Task Introduction: Pedestrian anomaly detection is dedicated to recognizing irregular activities within pedestrian interactions captured in video streams.
Task Introduction: Traffic anomaly detection aims at identifying the commencement and conclusion of abnormal events in traffic scenarios.
Task Introduction: Time series anomaly detection refers to the task of identifying unusual or abnormal patterns in sequential data over time.
For more details, please refer to the document.
If you found this study useful in your research or applications, please kindly cite using the following BibTeX:
@article{cao2023genericad,
title={Towards Generic Anomaly Detection and Understanding: Large-scale Visual-linguistic Model (GPT-4V) Takes the Lead},
author={Yunkang Cao and Xiaohao Xu and Chen Sun and Xiaonan Huang and Weiming Shen},
journal={arXiv preprint arXiv:2311.02782},
year={2023}
}
Our study is largely inspired by GPT-4V, GPT-4V for Medical, SoM, CPMF. Thanks for their wonderful work!
@article{yang2023dawnoflmms,
title={The Dawn of LMMs: Preliminary Explorations with GPT-4V (ision)},
author={Yang, Zhengyuan and Li, Linjie and Lin, Kevin and Wang, Jianfeng and Lin, Chung-Ching and Liu, Zicheng and Wang, Lijuan},
journal={arXiv preprint arXiv:2309.17421},
year={2023}
}
@article{yang2023SoM,
title = {Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in {GPT}-4V},
journal = {{arXiv preprint, arXiv}:2310.11441},
year = {{2023}},
author = {Yang, Jianwei and Zhang, Hao and Li, Feng and Zou, Xueyan and Li, Chunyuan and Gao, Jianfeng},
}
@article{liu2023hallusionbench,
title={HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V (ision), LLaVA-1.5, and Other Multi-modality Models},
author={Liu, Fuxiao and Guan, Tianrui and Li, Zongxia and Chen, Lichang and Yacoob, Yaser and Manocha, Dinesh and Zhou, Tianyi},
journal={arXiv preprint arXiv:2310.14566},
year={2023}
}
@article{wu2023cangpt,
title = {Can {GPT}-4V(ision) Serve Medical Applications? Case Studies on {GPT}-4V for Multimodal Medical Diagnosis},
author = {Wu, Chaoyi and Lei, Jiayu and Zheng, Qiaoyu and Zhao, Weike and Lin, Weixiong and Zhang, Xiaoman and Zhou, Xiao and Zhao, Ziheng and Zhang, Ya and Wang, Yanfeng and Xie, Weidi},
journal={arXiv preprint arXiv:2310.09909},
year={2023}
}
@article{cao2023cpmf,
title = {Complementary Pseudo Multimodal Feature for Point Cloud Anomaly Detection},
journal = {arXiv preprint arXiv:2303.13194},
author = {Cao, Yunkang and Xu, Xiaohao and Shen, Weiming},
year = 2023,
}