Lightweight Real-Time Underwater Object Detection Transformer Based on Gated Convolution Network
-
摘要: 针对水下目标检测算法图像特征处理困难、模型结构冗余、参数量庞大等问题, 提出基于轻量化门控卷积网络的实时Transformer水下目标检测方法, 该方法首先基于门控思想构建卷积门控线性单元, 动态调节特征的传递, 并以此为基础提出门控通道交互模块, 该模块通过完全解耦token mixer和channel mixer, 并针对token mixer部分引入结构重新参数化技术, 极大降低模型在推理过程中的计算成本。混合编码器针对门控骨干网络提取的3个特征分别进行尺度内信息交互和多尺度特征融合, 实现浅层高频率信息和深层语义空间信息之间的高度融合。文中模型在多个不同模态数据集上进行了大量实验, 其中mAP@0.5达到了0.849, 整体参数量为23.3, FPS检测帧率为136.8, 保持该系列模型优秀检测精度的同时, 实现了较小的模型参数量和较高的检测帧率, 整体优于其他模型。结果表明, 文中模型与一系列优秀的目标检测模型相比具有优秀的检测性能和高效的实时检测能力。
-
关键词:
- 水下目标检测 /
- 轻量化网络 /
- 门控卷积 /
- Transformer
Abstract: To address the challenges in underwater object detection algorithms, including difficulties in image feature extraction, redundant model architectures, and excessive parameter complexity. This paper proposes a lightweight real-time underwater object detection transformer based on a gated convolution network. The proposed approach involves the construction of a convolutional gated linear unit based on the gating mechanism to dynamically modulate feature transmission. Furthermore, the paper proposes a gated channel interaction module to fully decouple the token mixer from the channel mixer. The structural reparameterization technique significantly reduces the computational cost of the model during inference, and the hybrid encoder promotes the intra-scale information exchange and multi-scale feature fusion of the three features extracted by the gated linear unit backbone, realising the high fusion of shallow high-frequency information and deep semantic spatial information. The model has carried out a large number of experiments on different modal data sets, where mAP@0.5 reaches 0.849, the overall number of parameters is 23.3, and the FPS detection frame rate is 136.8. While maintaining the excellent detection accuracy of this series of models, it achieves a smaller number of model parameters and a higher detection frame rate, which is better than other models.-
Key words:
- underwater object detection /
- lightweight network /
- gated convolution /
- Transformer
-
表 1 在DUO数据集上对比实验结果
Table 1. Comparative experimental results on the DUO dataset
模型 mAP@0.5 mAP@
0.5:0.95Params
(M)FLOPs
(G)FPS Faster R-CNN 0.819 0.613 41.14 63.26 19.2 Cascade R-CNN 0.839 0.500 69.0 72.0 27.3 RetinaNet 0.704 0.461 36.17 52.62 39.2 GFL 0.837 0.655 74.2 47.5 19.8 YOLOv5s 0.813 0.492 7.03 16.0 164.2 YOLOv7 0.801 0.559 6.0 13.3 191.0 YOLOv8 0.812 0.573 3.2 8.7 205.6 Deformable DETR 0.844 0.637 41.3 193.0 16.0 RTDETR 0.844 0.635 38.6 57.0 131.1 RTDETR+GLbone 0.849 0.639 23.3 30.6 136.8 表 2 水下目标检测算法在DUO数据集上对比实验
Table 2. The underwater target detection algorithms are compared on DUO dataset
模型 mAP@0.5 精确率 召回率 F1-score Holothurian Echinus Scallop Starfish Holothurian Echinus Scallop Starfish Boosting-R-CNN 0.832 0.710 0.745 0.416 0.793 0.903 0.919 0.746 0.910 0.754 RoIA 0.828 0.839 0.731 0.451 0.849 0.750 0.896 0.422 0.843 0.723 Rtmdet 0.846 0.755 0.825 0.589 0.794 0.870 0.893 0.644 0.894 0.781 GCC-Net 0.839 0.839 0.731 0.452 0.849 0.750 0.896 0.422 0.843 0.723 文中方法 0.849 0.842 0.864 0.783 0.857 0.800 0.859 0.783 0.857 0.811 表 3 在Trashcan数据集上对比实验结果
Table 3. Comparative experimental results on the Trashcan dataset
模型 mAP@0.5 mAP@
0.5:0.95Params
/MbFLOPs
/GbFPS Faster R-CNN 0.553 0.312 41.14 63.26 19.2 Cascade R-CNN 0.543 0.341 42.00 270.30 9.6 Dino 0.286 0.183 47.56 226.30 61.2 YOLOv7 0.434 0.241 6.00 13.30 191.0 Deformable DETR 0.569 0.361 41.30 193.00 16.0 RTDETR 0.578 0.387 38.60 57.00 131.1 RTDETR+GLbone 0.577 0.384 23.30 30.60 136.8 表 4 在URPC数据集上各类别实验结果
Table 4. Experimental results on URPC dataset
目标 精确率 召回率 mAP@0.5 Ball 0.929 0.938 0.945 Circle cage 0.943 0.947 0.936 Cube 0.963 0.999 0.995 Cylinder 0.946 0.846 0.902 Human body 0.901 0.898 0.907 Metal bucket 0.771 0.931 0.928 Square cage 0.894 0.881 0.837 Tyre 0.960 0.728 0.783 -
[1] XU S, ZHANG M, SONG W, et al. A systematic review and analysis of deep learning-based underwater object detection[J]. Neurocomputing, 2023, 527: 204-232. doi: 10.1016/j.neucom.2023.01.056 [2] YEH C H, LIN C H, KANG L W, et al. Lightweight deep neural network for joint learning of underwater object detection and color conversion[J]. IEEE Transactions on Neural Networks and Learning Systems, 2021, 33(11): 6129-43. [3] KAUR R, SINGH S. A comprehensive review of object detection with deep learning[J]. Digital Signal Processing, 2023, 132: 103812. doi: 10.1016/j.dsp.2022.103812 [4] GIRSHICK R, DONAHUE J, DARRELL T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014: 580-587. [5] HE K, ZHANG X, REN S, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(9): 1904-16. doi: 10.1109/TPAMI.2015.2389824 [6] GIRSHICK R. Fast R-CNN[C]//Proceedings of the IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015: 1440-48. [7] HE K, GKIOXARI G, DOLLÁR P, et al. Mask R-CNN[C]//2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2020. [8] REN S, HE K, GIRSHICK R, et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-49. doi: 10.1109/TPAMI.2016.2577031 [9] LIU W, ANGUELOV D, ERHAN D, et al. SSD: Single shot multibox detector[C]//Computer Vision-ECCV 2016: 14th European Conference. Amsterdam, The Netherlands: Springer International Publishing, 2016: 21-37. [10] REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: Unified, real-time object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Las Vegas, USA: IEEE, 2016: 779-788. [11] REDMON J, FARHADI A. YOLO9000: better, faster, stronger[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA : IEEE, 2017: 7263-7271. [12] BOCHKOVSKIY A, WANG C Y, LIAO H Y M. YOLOv4: Optimal speed and accuracy of object detection[EB/OL]. [2025-02-13]. http://arxiv.org/abs/2004.10934. [13] Dosovitskiy A. An image is worth 16x16 words: Transformers for image recognition at scale[EB/OL]. [2025-02-13]. https://arxiv.org/abs/2010.11929 [14] 刘麒东, 沈鑫, 刘海路, 等. 基于GPA+CBAM的域自适应水下目标检测方法[J]. 水下无人系统学报, 2024, 32(5): 846-854.LIU Q, SHEN X, LIU H, et al. Domain-adaptive underwater target detection method based on GPA+ CBAM[J]. Journal of Unmanned Undersea Systems, 2024, 32(5): 846-854. [15] 徐凤强. 水下机器人视域中小目标检测方法研究[D]. 大连: 大连海事大学, 2021. [16] KHAN A, FOUDA M M, Do D T, et al. Underwater target detection using deep learning: methodologies, challenges, applications and future evolution[J]. IEEE Access, 2024. [17] DAI L, LIU H, SONG P, et al. A gated cross-domain collaborative network for underwater object detection[J]. Pattern Recognition, 2024, 149: 110222. doi: 10.1016/j.patcog.2023.110222 [18] FANG P, ZHENG M, FEI L, et al. S-FPN: A shortcut feature pyramid network for sea cucumber detection in underwater images[J]. Expert Systems with Applications, 2021, 182: 115306. doi: 10.1016/j.eswa.2021.115306 [19] GAO J, ZHANG Y, GENG X, et al. PE-Transformer: Path enhanced transformer for improving underwater object detection[J]. Expert Systems with Applications, 2024, 246: 123253. doi: 10.1016/j.eswa.2024.123253 [20] KNAUSGÅRD K M, WIKLUND A, SØRDALEN T K, et al. Temperate fish detection and classification: A deep learning based approach[J]. Applied Intelligence, 2022, 52(6): 6988-7001. doi: 10.1007/s10489-020-02154-9 [21] CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[C]//European conference on computer vision. Cham: Springer International Publishing, 2020: 213-229. [22] ZHANG L, YANG K, HAN Y, et al. TSD-DETR: A lightweight real-time detection transformer of traffic sign detection for long-range perception of autonomous driving[J]. Engineering Applications of Artificial Intelligence, 2025, 139: 109536. doi: 10.1016/j.engappai.2024.109536 [23] ZHAO Y, LV W, XU S, et al. Detrs beat YOLOs on real-time object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 16965-16974. [24] WANG A, CHEN H, LIN Z, et al. Repvit: Revisiting mobile cnn from vit perspective[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 15909-15920. [25] DAUPHIN Y N, FAN A, AULI M, et al. Language modeling with gated convolutional networks[J]. The Journal of Machine Learning Research, 2017, 70: 933-941. [26] YU W, LUO M, ZHOU P, et al. Metaformer is actually what you need for vision[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 10819-29. [27] SHI D. TransNeXt: Robust foveal visual perception for vision transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 17773-83. -