수중 기뢰 탐지 및 분류를 위한 Vision-Language Model 분석

김, 가연¹^{, *} ; 장, 현배² ; 박, 석준³ ; 김, 윤호²

1LIG넥스원 무인플랫폼연구소 연구원
2LIG넥스원 무인플랫폼연구소 선임연구원
3LIG넥스원 무인플랫폼연구소 수석연구원

Analysis of Vision-Language Models for Underwater Mine Detection and Classification

Kayeon Kim¹^{, *} ; Hyunbae Chang² ; Seokjoon Park³ ; Yoonho Kim²

1Researcher, Unmanned Platform R&D Center, LIG Nex1
2Senior researcher, Unmanned Platform R&D Center, LIG Nex1
3Chief researcher, Unmanned Platform R&D Center, LIG Nex1

Correspondence to: ^*Kayeon Kim Unmanned Platform R&D Center, LIG Nex1, 13494 231, Pangyoyeok-ro, Bundang-gu, Seongnam-si, Gyeonggi-do, Republic of Korea Tel: +82-31-8038-0768 E-mail: kayeon.kim@lignex1.co

초록

수중 기뢰는 함선·선박 파괴와 해상 봉쇄 등 전략적 역할을 수행하므로 기뢰 탐지 연구가 필수적이다. 이에 따라, 급변하는 해양 환경, 다양한 기뢰 형태, 전시 상황에서의 작전 변화에 적응할 수 있는 모델이 필요하다. Vision-Language Model(VLM)은 이미지와 텍스트를 동시에 처리할 수 있는 모델로, 변화가 잦은 수중 환경에 적합하다. 본 논문은 최신 VLM을 분석하고, 이를 수중 기뢰 탐지·분류에 적용하는 방안을 모색한다.

Abstract

Underwater mines play a strategic role in destroying ships and vessels and enforcing maritime blockades. Therefore, mine detection research is essential. Accordingly, a model that can adapt to rapidly changing marine environments, various mine types, and operational shifts in wartime is required. Vision-language models (VLMs) can process both images and texts simultaneously, making them well-suited for the frequently changing underwater environment. This paper analyzes the latest VLMs and explores methods to apply them to underwater mine detection and classification.

Keywords:

Underwater Mine Detection, Vision-Language Model, Multimodal Dataset, Image Classification, Open-World Object Detection

키워드:

수중 기뢰 탐지, 비전-언어 모델, 멀티모달 데이터셋, 이미지 분류, 오픈-월드 객체 탐지

Acknowledgments

이 논문은 2023년도 정부(방위사업청)의 재원으로 국방기술진흥연구소의 지원을 받아 수행된 연구임(No. KRIT-CT-23-035-03, AI기반 수중 기뢰 탐지 기술(기뢰탐지용 무인잠수정 군집 운용 기술)).

References

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, & Ilya Sutskever, ‘Learning Transferable Visual Models from Natural Language Supervision,’ in proceedings of the 38th International Conference on Machine Learning, 2021, pp. 8748-8763.
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, & Tom Duerig, ‘Scaling up Visual and Vision-Language Representation Learning with Noisy Text Supervision,’ in proceedings of the 38th International Conference on Machine Learning, 2021, pp. 4904-4916.
Lewei Yao, Jianhua Han, Youpeng Wen, Xiaodan Liang, Dan Xu, Wei Zhang, Zhenguo Li, Chunjing Xu, & Hang, Xu, ‘Detclip: Dictionary-Enriched Visual-Concept Paralleled Pre-Training for Open-World Detection,’ Advances in Neural Information Processing Systems 35, 2022, pp. 9125-9138. [https://doi.org/10.52202/068431-0663]
Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, & Ying Shan, ‘Yolo-World: Real-Time Open-Vocabulary Object Detection,’ in proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 16901-16911. [https://doi.org/10.1109/CVPR52733.2024.01599]
Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lĳuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, & Jianfeng Gao, ‘Grounded Language-Image Pre-Training,’ in proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 10965-10975. [https://doi.org/10.1109/CVPR52688.2022.01069]