Cetacean call recognition and classification model based on multimodal MAE data augmentation network

LIU Yueyue; NIU Qiuna; SUN Yue; WANG Jingjing; SHI Wei

doi:10.11993/j.issn.2096-3920.2026-0052

Article Contents

Article Navigation > Journal of Unmanned Undersea Systems > 2026 > Accepted Manuscript

LIU Yueyue, NIU Qiuna, SUN Yue, WANG Jingjing, SHI Wei. Cetacean call recognition and classification model based on multimodal MAE data augmentation network[J]. Journal of Unmanned Undersea Systems. doi: 10.11993/j.issn.2096-3920.2026-0052

Citation:

LIU Yueyue, NIU Qiuna, SUN Yue, WANG Jingjing, SHI Wei. Cetacean call recognition and classification model based on multimodal MAE data augmentation network[J]. Journal of Unmanned Undersea Systems. doi: 10.11993/j.issn.2096-3920.2026-0052

Citation:

PDF( 5139 KB)

Cetacean call recognition and classification model based on multimodal MAE data augmentation network

doi: 10.11993/j.issn.2096-3920.2026-0052

Shandong Key Laboratory of Intelligent Network for Underwater Equipment, College of Information Science and Technology, Qingdao University of Science and Technology, Qingdao 266061, China

Received Date: 2026-03-10
Accepted Date: 2026-05-13
Rev Recd Date: 2026-04-27

Available Online: 2026-05-28

Abstract

Abstract

Passive acoustic monitoring-based call recognition and classification are essential means for marine animal conservation and population surveys. To address the issues of data scarcity and inter-class imbalance in call recognition and classification, data augmentation methods hold significant practical value and research importance. However, marine animal calls contain rich acoustic information, and relying solely on frequency-domain feature extraction lacks the capability to model audio structure and semantics, making it difficult to effectively capture the deep features of calls. To this end, this paper proposes a data augmentation network based on a multimodal masked autoencoder (MAE-MF), which breaks through the limitations of single-modal information. The network employs Mel-spectrograms as the primary modality, integrates temporal features and frame-level statistical metrics to form multimodal inputs, and incorporates semantic labels as conditional guidance for reconstruction. To scientifically validate the effectiveness and practical value of the proposed data augmentation network, a cetacean call recognition and classification model is further constructed based on the MAE-MF network. Experimental results on the Watkins dataset demonstrate superior performance of the proposed method, with improved spectrogram reconstruction quality compared to mainstream algorithms. The proposed method achieves an average recognition accuracy of 97.6% across six cetacean species, representing an improvement of 6.72 percentage points over the baseline MAE method. This scheme effectively alleviates the inter-class imbalance issue and provides reliable technical support for cetacean conservation research.
- data augmentation,
- masked autoencoder,
- multimodal,
- marine mammal calls,
- spectrogram reconstruction,
- call recognition

FullText(HTML)

References(20)

References

[1]	Montgomery J C, Radford C A. Marine bioacoustics[J]. Current Biology, 2017, 27(11): 502-507. doi: 10.1016/j.cub.2017.01.041
[2]	Verfuss U K, Gillespie D, Gordon J, et al. Comparing methods suitable for monitoring marine mammals in low visibility conditions during seismic surveys[J]. Marine Pollution Bulletin, 2018, 126: 1-18. doi: 10.1016/j.marpolbul.2017.10.034
[3]	Tyack P L. Implications for marine mammals of large-scale changes in the marine acoustic environment[J]. Journal of Mammalogy, 2008, 89(3): 549-558. doi: 10.1644/07-MAMM-S-307R.1
[4]	Qiao Z, Liu S, Wang D, et al. Bio-inspired underwater acoustic communication through PCHIP-based whistle generation and improved CSS modulation[J]. Applied Acoustics, 2025, 235: 110673. doi: 10.1016/j.apacoust.2025.110673
[5]	Li L, Qiao G, Liu S, et al. Automated classification of Tursiops aduncus whistles based on a depth-wise separable convolutional neural network and data augmentation[J]. The Journal of the Acoustical Society of America, 2021, 150(5): 3861-3873. doi: 10.1121/10.0007291
[6]	Abayomi-Alli O O, Damaševičius R, Qazi A, et al. Data augmentation and deep learning methods in sound classification: A systematic review[J]. Electronics, 2022, 11(22): 3795. doi: 10.3390/electronics11223795
[7]	Park D S, Chan W, Zhang Y, et al. SpecAugment: A simple data augmentation method for automatic speech recognition[J/OL]. arXiv: 1904.08779. (2019-04-18)[2026-04-16]. https://arxiv.org/abs/1904.08779.
[8]	Zhang H, Cisse M, Dauphin Y N, et al. Mixup: Beyond empirical risk minimization[J/OL]. arXiv: 1710.09412. (2017-10-25)[2026-04-16]. https://arxiv.org/abs/1710.09412.
[9]	Kopets E, Shpilevaya T, Vasilchenko O, et al. Generating synthetic sperm whale voice data using StyleGAN2-ADA[J]. Big Data and Cognitive Computing, 2024, 8(4): 40. doi: 10.3390/bdcc8040040
[10]	Li P, Roch M A, Klinck H, et al. Learning stage-wise GANs for whistle extraction in time-frequency spectrograms[J]. IEEE Transactions on Multimedia, 2023, 25: 9302-9314. doi: 10.1109/TMM.2023.3251109
[11]	Mellinger D K, Clark C W, et al. Recognizing transient low-frequency whale sounds by spectrogram correlation[J]. The Journal of the Acoustical Society of America, 2000, 107(6): 3518-3529. doi: 10.1121/1.429434
[12]	Wahlberg M, Jensen F H, Aguilar Soto N, et al. Source parameters of echolocation clicks from wild bottlenose dolphins(tursiops aduncus and tursiops truncatus)[J]. The Journal of the Acoustical Society of America, 2011, 130(4): 2263-2274. doi: 10.1121/1.3624822
[13]	Li X, Dong C, Dong G, et al. Marine mammal call classification using a multi-scale two-channel fusion network (MT-resformer)[J]. Journal of Marine Science and Engineering, 2025, 13(5): 944. doi: 10.3390/jmse13050944
[14]	Vester H, Hammerschmidt K, Timme M, et al. Bag-of-calls analysis reveals group-specific vocal repertoire in long-finned pilot whales[J/OL]. arXiv: 1410.4711. arXiv(2014-10-17)[2026-04-16]. https://arxiv.org/abs/1410.4711.
[15]	Constantinescu C, Brad R. An overview of sound features in time and frequency domain[J/OL]. International Journal of Advanced Statistics and IT&C for Economics and Life Sciences, 2023, 13(1): 45-58. https://doi.org/10.2478/ijasitels-2023-0006.
[16]	Peeters G. A large set of audio features for sound description (similarity and classification) in the CUIDADO project[R]. Paris: IRCAM, 2004: 1-25.
[17]	Baumann-Pickering S, McDonald M A, Simonis A E, et al. Species-specific beaked whale echolocation signals[J]. The Journal of the Acoustical Society of America, 2013, 134(3): 2293-2301. doi: 10.1121/1.4817832
[18]	He K, Chen X, Xie S, et al. Masked autoencoders are scalable vision learners[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 16000-16009.
[19]	Radford A, Metz L, Chintala S. Unsupervised representation learning with deep convolutional generative adversarial networks[PP/OL]. arXiv: 1511.06434v2. arXiv (2015-11-19)[2026-04-16]. https://arxiv.org/abs/1511.06434.
[20]	Gao S H, Cheng M M, Zhao K, et al. Res2net: A new multi-scale backbone architecture[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 43(2): 652-662.