• 中国科技核心期刊
  • Scopus收录期刊
  • DOAJ收录期刊
  • JST收录期刊
  • Euro Pub收录期刊

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于多模态MAE数据增强网络的鲸豚类动物叫声识别与分类

刘悦悦 牛秋娜 孙悦 王景景 施威

刘悦悦, 牛秋娜, 孙悦, 等. 基于多模态MAE数据增强网络的鲸豚类动物叫声识别与分类[J]. 水下无人系统学报, xxxx, x(x): x-xx doi: 10.11993/j.issn.2096-3920.2026-0052
引用本文: 刘悦悦, 牛秋娜, 孙悦, 等. 基于多模态MAE数据增强网络的鲸豚类动物叫声识别与分类[J]. 水下无人系统学报, xxxx, x(x): x-xx doi: 10.11993/j.issn.2096-3920.2026-0052
LIU Yueyue, NIU Qiuna, SUN Yue, WANG Jingjing, SHI Wei. Cetacean call recognition and classification model based on multimodal MAE data augmentation network[J]. Journal of Unmanned Undersea Systems. doi: 10.11993/j.issn.2096-3920.2026-0052
Citation: LIU Yueyue, NIU Qiuna, SUN Yue, WANG Jingjing, SHI Wei. Cetacean call recognition and classification model based on multimodal MAE data augmentation network[J]. Journal of Unmanned Undersea Systems. doi: 10.11993/j.issn.2096-3920.2026-0052

基于多模态MAE数据增强网络的鲸豚类动物叫声识别与分类

doi: 10.11993/j.issn.2096-3920.2026-0052
基金项目: 国家自然科学基金面上项目(62571290); 国家自然科学基金联合基金重点项目(U24A20215); 国家自然科学基金面上项目(62171246); 山东省自然科学基金面上项目(ZR2025MS1085).
详细信息
    作者简介:

    刘悦悦(1999-), 女, 硕士, 主要研究方向为仿生水声通信

  • 中图分类号: TJ630; TN929.3

Cetacean call recognition and classification model based on multimodal MAE data augmentation network

  • 摘要: 被动声学监测中的叫声识别与分类是海洋动物保护与种群调查的重要手段。针对叫声识别与分类中存在的数据稀缺与类间不平衡问题, 数据增强方法具有重要的实用价值与研究意义。然而海洋动物叫声拥有丰富的声学信息, 仅依赖于频域特征提取缺乏对音频结构与语义的建模能力, 难以有效捕捉叫声的深层特征。为此, 文中提出了一种基于多模态掩码自编码器(MAE-MF)的数据增强网络, 突破单模态信息局限, 以梅尔频谱图为主模态, 融合时序特征与帧级统计指标构成多模态输入, 并引入语义标签作为条件引导重建。为科学验证该数据增强网络的有效性与实用价值, 文中基于MAE-MF数据增强网络, 构建鲸豚叫声识别分类模型。模型在Watkins数据集上表现优异, 相较主流算法, 频谱图重建效果更佳。实验测得6类鲸豚物种平均识别准确率达97.6%, 较基础MAE方法提升6.72个百分点。该方案可有效改善样本类别失衡问题, 也为鲸豚保护相关研究提供了可靠的技术支撑。

     

  • 图  1  基于 MAE-MF 数据增强网络的鲸豚类叫声识别与分类模型图

    Figure  1.  Architecture of the cetacean call recognition and classification model based on the MAE-MF data augmentation network

    图  2  时序特征和帧级指标提取结构图

    Figure  2.  Time sequence feature and frame level index extraction structure diagram

    图  3  不同掩码策略示意图

    Figure  3.  Schematic diagram of different mask strategies

    图  4  编码器结构图

    Figure  4.  Encoder structure diagram

    图  5  MLP结构图

    Figure  5.  MLP structure diagram

    图  6  75%掩码比率下梅尔谱图重建效果图

    Figure  6.  Reconstruction effect of Mel spectrum at 75% mask ratio

    图  7  不同掩码比例下MSE、PSNR的变化曲线

    Figure  7.  The variation curves of MSE and PSNR under different mask ratios

    图  8  基于不同网络分类性能对比

    Figure  8.  Comparison of classification performance based on different networks

    图  9  数据增强前后识别与分类性能的比较

    Figure  9.  Comparison of recognition and classification performance before and after data enhancement

    图  10  MAE-MF模型数据增强前后识别分类混淆矩阵图

    Figure  10.  MAE-MF model recognition and classification confusion matrix before and after data enhancement

    图  11  不同组合对分类性能的影响

    Figure  11.  Effects of different combinations on classification performance

    图  12  不同掩码策略与掩码比率性能比较

    Figure  12.  Performance comparison of different mask strategies and mask ratios

    表  1  Watkins数据集的物种分布与样本量

    Table  1.   Species distribution and sample size of Watkins dataset

    序号物种样本量
    1虎鲸2 134
    2座头鲸1 204
    3抹香鲸1 922
    4长鳍领航鲸1 616
    5小须鲸607
    6宽吻海豚965
    下载: 导出CSV

    表  2  数据增强系数

    Table  2.   Data enhancement factor

    序号物种样本量增强系数增强后样本量
    1虎鲸2 1341.22 560
    2座头鲸1 2042.12 528
    3抹香鲸1 9221.32 498
    4长鳍领航鲸1 6161.62 586
    5小须鲸6074.02 428
    6宽吻海豚9652.62 509
    下载: 导出CSV

    表  3  模型训练超参数设置表

    Table  3.   Model training hyperparameter setting table

    阶段 优化器 初始
    学习率
    批量
    大小
    训练
    轮数
    权重衰
    减系数
    训练 AdamW 0.000 2 128 100 $ 5\times {10}^{-2} $
    微调 AdamW 0.000 1 64 50 $ 1\times {10}^{-2} $
    下载: 导出CSV

    表  4  不同模型频谱重建质量评估结果

    Table  4.   Evaluation results of spectrum reconstruction quality for different models

    模型MSE($ {\times 10}^{-2} $)PSNR/dBSSIM
    DCGAN0.3025.230.82
    MAE-Res2Net0.2026.990.88
    MAE0.1827.450.90
    MAE-MF(文中)0.1428.660.92
    下载: 导出CSV

    表  5  各类动物频谱重建质量评估结果

    Table  5.   Evaluation results of spectral reconstruction quality of various animals

    物种MSE($ {\times 10}^{-2} $)PSNR/dBSSIM
    虎鲸0.1229.210.94
    座头鲸0.1030.000.95
    抹香鲸0.1428.530.92
    长鳍领航鲸0.1528.240.90
    小须鲸0.1627.960.89
    宽吻海豚0.1927.210.86
    下载: 导出CSV
  • [1] Montgomery J C, Radford C A. Marine bioacoustics[J]. Current Biology, 2017, 27(11): 502-507. doi: 10.1016/j.cub.2017.01.041
    [2] Verfuss U K, Gillespie D, Gordon J, et al. Comparing methods suitable for monitoring marine mammals in low visibility conditions during seismic surveys[J]. Marine Pollution Bulletin, 2018, 126: 1-18. doi: 10.1016/j.marpolbul.2017.10.034
    [3] Tyack P L. Implications for marine mammals of large-scale changes in the marine acoustic environment[J]. Journal of Mammalogy, 2008, 89(3): 549-558. doi: 10.1644/07-MAMM-S-307R.1
    [4] Qiao Z, Liu S, Wang D, et al. Bio-inspired underwater acoustic communication through PCHIP-based whistle generation and improved CSS modulation[J]. Applied Acoustics, 2025, 235: 110673. doi: 10.1016/j.apacoust.2025.110673
    [5] Li L, Qiao G, Liu S, et al. Automated classification of Tursiops aduncus whistles based on a depth-wise separable convolutional neural network and data augmentation[J]. The Journal of the Acoustical Society of America, 2021, 150(5): 3861-3873. doi: 10.1121/10.0007291
    [6] Abayomi-Alli O O, Damaševičius R, Qazi A, et al. Data augmentation and deep learning methods in sound classification: A systematic review[J]. Electronics, 2022, 11(22): 3795. doi: 10.3390/electronics11223795
    [7] Park D S, Chan W, Zhang Y, et al. SpecAugment: A simple data augmentation method for automatic speech recognition[J/OL]. arXiv: 1904.08779. (2019-04-18)[2026-04-16]. https://arxiv.org/abs/1904.08779.
    [8] Zhang H, Cisse M, Dauphin Y N, et al. Mixup: Beyond empirical risk minimization[J/OL]. arXiv: 1710.09412. (2017-10-25)[2026-04-16]. https://arxiv.org/abs/1710.09412.
    [9] Kopets E, Shpilevaya T, Vasilchenko O, et al. Generating synthetic sperm whale voice data using StyleGAN2-ADA[J]. Big Data and Cognitive Computing, 2024, 8(4): 40. doi: 10.3390/bdcc8040040
    [10] Li P, Roch M A, Klinck H, et al. Learning stage-wise GANs for whistle extraction in time-frequency spectrograms[J]. IEEE Transactions on Multimedia, 2023, 25: 9302-9314. doi: 10.1109/TMM.2023.3251109
    [11] Mellinger D K, Clark C W, et al. Recognizing transient low-frequency whale sounds by spectrogram correlation[J]. The Journal of the Acoustical Society of America, 2000, 107(6): 3518-3529. doi: 10.1121/1.429434
    [12] Wahlberg M, Jensen F H, Aguilar Soto N, et al. Source parameters of echolocation clicks from wild bottlenose dolphins(tursiops aduncus and tursiops truncatus)[J]. The Journal of the Acoustical Society of America, 2011, 130(4): 2263-2274. doi: 10.1121/1.3624822
    [13] Li X, Dong C, Dong G, et al. Marine mammal call classification using a multi-scale two-channel fusion network (MT-resformer)[J]. Journal of Marine Science and Engineering, 2025, 13(5): 944. doi: 10.3390/jmse13050944
    [14] Vester H, Hammerschmidt K, Timme M, et al. Bag-of-calls analysis reveals group-specific vocal repertoire in long-finned pilot whales[J/OL]. arXiv: 1410.4711. arXiv(2014-10-17)[2026-04-16]. https://arxiv.org/abs/1410.4711.
    [15] Constantinescu C, Brad R. An overview of sound features in time and frequency domain[J/OL]. International Journal of Advanced Statistics and IT&C for Economics and Life Sciences, 2023, 13(1): 45-58. https://doi.org/10.2478/ijasitels-2023-0006.
    [16] Peeters G. A large set of audio features for sound description (similarity and classification) in the CUIDADO project[R]. Paris: IRCAM, 2004: 1-25.
    [17] Baumann-Pickering S, McDonald M A, Simonis A E, et al. Species-specific beaked whale echolocation signals[J]. The Journal of the Acoustical Society of America, 2013, 134(3): 2293-2301. doi: 10.1121/1.4817832
    [18] He K, Chen X, Xie S, et al. Masked autoencoders are scalable vision learners[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 16000-16009.
    [19] Radford A, Metz L, Chintala S. Unsupervised representation learning with deep convolutional generative adversarial networks[PP/OL]. arXiv: 1511.06434v2. arXiv (2015-11-19)[2026-04-16]. https://arxiv.org/abs/1511.06434.
    [20] Gao S H, Cheng M M, Zhao K, et al. Res2net: A new multi-scale backbone architecture[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 43(2): 652-662.
  • 加载中
计量
  • 文章访问数:  11
  • HTML全文浏览量:  4
  • PDF下载量:  3
  • 被引次数: 0
出版历程
  • 收稿日期:  2026-03-10
  • 修回日期:  2026-04-27
  • 录用日期:  2026-05-13
  • 网络出版日期:  2026-05-28
图(12) / 表(5)

目录

    /

    返回文章
    返回
    服务号
    订阅号