• 中国科技核心期刊
  • Scopus收录期刊
  • DOAJ收录期刊
  • JST收录期刊
  • Euro Pub收录期刊
Turn off MathJax
Article Contents
LIU Yueyue, NIU Qiuna, SUN Yue, WANG Jingjing, SHI Wei. Cetacean call recognition and classification model based on multimodal MAE data augmentation network[J]. Journal of Unmanned Undersea Systems. doi: 10.11993/j.issn.2096-3920.2026-0052
Citation: LIU Yueyue, NIU Qiuna, SUN Yue, WANG Jingjing, SHI Wei. Cetacean call recognition and classification model based on multimodal MAE data augmentation network[J]. Journal of Unmanned Undersea Systems. doi: 10.11993/j.issn.2096-3920.2026-0052

Cetacean call recognition and classification model based on multimodal MAE data augmentation network

doi: 10.11993/j.issn.2096-3920.2026-0052
  • Received Date: 2026-03-10
  • Accepted Date: 2026-05-13
  • Rev Recd Date: 2026-04-27
  • Available Online: 2026-05-28
  • Passive acoustic monitoring-based call recognition and classification are essential means for marine animal conservation and population surveys. To address the issues of data scarcity and inter-class imbalance in call recognition and classification, data augmentation methods hold significant practical value and research importance. However, marine animal calls contain rich acoustic information, and relying solely on frequency-domain feature extraction lacks the capability to model audio structure and semantics, making it difficult to effectively capture the deep features of calls. To this end, this paper proposes a data augmentation network based on a multimodal masked autoencoder (MAE-MF), which breaks through the limitations of single-modal information. The network employs Mel-spectrograms as the primary modality, integrates temporal features and frame-level statistical metrics to form multimodal inputs, and incorporates semantic labels as conditional guidance for reconstruction. To scientifically validate the effectiveness and practical value of the proposed data augmentation network, a cetacean call recognition and classification model is further constructed based on the MAE-MF network. Experimental results on the Watkins dataset demonstrate superior performance of the proposed method, with improved spectrogram reconstruction quality compared to mainstream algorithms. The proposed method achieves an average recognition accuracy of 97.6% across six cetacean species, representing an improvement of 6.72 percentage points over the baseline MAE method. This scheme effectively alleviates the inter-class imbalance issue and provides reliable technical support for cetacean conservation research.

     

  • loading
  • [1]
    Montgomery J C, Radford C A. Marine bioacoustics[J]. Current Biology, 2017, 27(11): 502-507. doi: 10.1016/j.cub.2017.01.041
    [2]
    Verfuss U K, Gillespie D, Gordon J, et al. Comparing methods suitable for monitoring marine mammals in low visibility conditions during seismic surveys[J]. Marine Pollution Bulletin, 2018, 126: 1-18. doi: 10.1016/j.marpolbul.2017.10.034
    [3]
    Tyack P L. Implications for marine mammals of large-scale changes in the marine acoustic environment[J]. Journal of Mammalogy, 2008, 89(3): 549-558. doi: 10.1644/07-MAMM-S-307R.1
    [4]
    Qiao Z, Liu S, Wang D, et al. Bio-inspired underwater acoustic communication through PCHIP-based whistle generation and improved CSS modulation[J]. Applied Acoustics, 2025, 235: 110673. doi: 10.1016/j.apacoust.2025.110673
    [5]
    Li L, Qiao G, Liu S, et al. Automated classification of Tursiops aduncus whistles based on a depth-wise separable convolutional neural network and data augmentation[J]. The Journal of the Acoustical Society of America, 2021, 150(5): 3861-3873. doi: 10.1121/10.0007291
    [6]
    Abayomi-Alli O O, Damaševičius R, Qazi A, et al. Data augmentation and deep learning methods in sound classification: A systematic review[J]. Electronics, 2022, 11(22): 3795. doi: 10.3390/electronics11223795
    [7]
    Park D S, Chan W, Zhang Y, et al. SpecAugment: A simple data augmentation method for automatic speech recognition[J/OL]. arXiv: 1904.08779. (2019-04-18)[2026-04-16]. https://arxiv.org/abs/1904.08779.
    [8]
    Zhang H, Cisse M, Dauphin Y N, et al. Mixup: Beyond empirical risk minimization[J/OL]. arXiv: 1710.09412. (2017-10-25)[2026-04-16]. https://arxiv.org/abs/1710.09412.
    [9]
    Kopets E, Shpilevaya T, Vasilchenko O, et al. Generating synthetic sperm whale voice data using StyleGAN2-ADA[J]. Big Data and Cognitive Computing, 2024, 8(4): 40. doi: 10.3390/bdcc8040040
    [10]
    Li P, Roch M A, Klinck H, et al. Learning stage-wise GANs for whistle extraction in time-frequency spectrograms[J]. IEEE Transactions on Multimedia, 2023, 25: 9302-9314. doi: 10.1109/TMM.2023.3251109
    [11]
    Mellinger D K, Clark C W, et al. Recognizing transient low-frequency whale sounds by spectrogram correlation[J]. The Journal of the Acoustical Society of America, 2000, 107(6): 3518-3529. doi: 10.1121/1.429434
    [12]
    Wahlberg M, Jensen F H, Aguilar Soto N, et al. Source parameters of echolocation clicks from wild bottlenose dolphins(tursiops aduncus and tursiops truncatus)[J]. The Journal of the Acoustical Society of America, 2011, 130(4): 2263-2274. doi: 10.1121/1.3624822
    [13]
    Li X, Dong C, Dong G, et al. Marine mammal call classification using a multi-scale two-channel fusion network (MT-resformer)[J]. Journal of Marine Science and Engineering, 2025, 13(5): 944. doi: 10.3390/jmse13050944
    [14]
    Vester H, Hammerschmidt K, Timme M, et al. Bag-of-calls analysis reveals group-specific vocal repertoire in long-finned pilot whales[J/OL]. arXiv: 1410.4711. arXiv(2014-10-17)[2026-04-16]. https://arxiv.org/abs/1410.4711.
    [15]
    Constantinescu C, Brad R. An overview of sound features in time and frequency domain[J/OL]. International Journal of Advanced Statistics and IT&C for Economics and Life Sciences, 2023, 13(1): 45-58. https://doi.org/10.2478/ijasitels-2023-0006.
    [16]
    Peeters G. A large set of audio features for sound description (similarity and classification) in the CUIDADO project[R]. Paris: IRCAM, 2004: 1-25.
    [17]
    Baumann-Pickering S, McDonald M A, Simonis A E, et al. Species-specific beaked whale echolocation signals[J]. The Journal of the Acoustical Society of America, 2013, 134(3): 2293-2301. doi: 10.1121/1.4817832
    [18]
    He K, Chen X, Xie S, et al. Masked autoencoders are scalable vision learners[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 16000-16009.
    [19]
    Radford A, Metz L, Chintala S. Unsupervised representation learning with deep convolutional generative adversarial networks[PP/OL]. arXiv: 1511.06434v2. arXiv (2015-11-19)[2026-04-16]. https://arxiv.org/abs/1511.06434.
    [20]
    Gao S H, Cheng M M, Zhao K, et al. Res2net: A new multi-scale backbone architecture[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 43(2): 652-662.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(12)  / Tables(5)

    Article Metrics

    Article Views(34) PDF Downloads(12) Cited by()
    Proportional views
    Related
    Service
    Subscribe

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return