Crowd Speaker Identification Methodologies, Datasets And Features: Review
DOI:
https://doi.org/10.37385/jaets.v6i1.4952Keywords:
Crowed Speaker Identification, Deep Learning, Speech Features, Crowed DatasetsAbstract
Crowded speech or Overlapping speech, occurs when multiple individuals speak simultaneously, which is a common occurrence in real-life scenarios such as telephone conversations, meetings, and debates. The critical task in these situations is to identify all the speakers rather than just one. Overlapping speech identification is a significant research domain with applications in human-machine interaction, criminal detection in airports, trains, and public spaces. Our work examines crowd speech identification from four perspectives, including the most commonly used datasets, the most effective features for crowed speaker identification, and the best methodologies employed, and the highest results gained. This study proposes a comprehensive survey of research on crowd speech identification, covering the period from 2016 to present. The survey includes ninety research papers, fifty of which, are empirical studies. Initially, statistical methods were predominant, but the current trend leans towards artificial intelligence, particularly deep learning, which has demonstrated considerable efficacy in this field.
Downloads
References
Abdulmohsin, H. A. (2022). Automatic Health Speech Prediction System Using Support Vector Machine. Proceedings of International Conference on Computing and Communication Networks: ICCCN 2021, 165–175.
Abdulmohsin, H. A., Al-Khateeb, B., Hasan, S. S., & Dwivedi, R. (2022). Automatic illness prediction system through speech. Computers and Electrical Engineering, 102, 108224.
Abdulmohsin, H. A., Stephan, J. J., Al-Khateeb, B., & Hasan, S. S. (2022). Speech Age Estimation Using a Ranking Convolutional Neural Network. Proceedings of International Conference on Computing and Communication Networks: ICCCN 2021, 123–130.
Alsalam, E. A., Razoqi, S. A., & Ahmed, E. F. (2021). Effects of using static methods with contourlet transformation on speech compression. Iraqi Journal of Science, 62(8), 2784–2795. https://doi.org/10.24996/ijs.2021.62.8.31
Andrei, V., Cucu, H., & Burileanu, C. (2017). Detecting Overlapped Speech on Short Timeframes Using Deep Learning. Interspeech, 1198–1202.
Arons, B. (1992). A review of the cocktail party effect. Journal of the American Voice I/O Society, 12(7), 35–50.
Bansal, P., Singh, V., & Beg, M. T. (2019). A multi-featured hybrid model for speaker recognition on multi-person speech. Journal of Electrical Engineering & Technology, 14, 2117–2125.
Bronkhorst, A. W. (2015). The cocktail-party problem revisited: early processing and selection of multi-talker speech. Attention, Perception, & Psychophysics, 77(5), 1465–1487.
Chang, X., Kanda, N., Gaur, Y., Wang, X., Meng, Z., & Yoshioka, T. (2021). Hypothesis stitcher for end-to-end speaker-attributed asr on long-form multi-talker recordings. ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6763–6767.
Chang, X., Qian, Y., & Yu, D. (2018a). Adaptive permutation invariant training with auxiliary information for monaural multi-talker speech recognition. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5974–5978.
Chang, X., Qian, Y., & Yu, D. (2018b). Monaural Multi-Talker Speech Recognition with Attention Mechanism and Gated Convolutional Networks. INTERSPEECH, 1586–1590.
Chang, X., Zhang, W., Qian, Y., Le Roux, J., & Watanabe, S. (2019). MIMO-Speech: End-to-end multi-channel multi-speaker speech recognition. 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 237–244.
Chen, Z., Li, J., Xiao, X., Yoshioka, T., Wang, H., Wang, Z., & Gong, Y. (2017). Cracking the cocktail party problem by multi-beam deep attractor network. 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 437–444.
Fan, C., Liu, B., Tao, J., Wen, Z., Yi, J., & Bai, Y. (2018). Utterance-level permutation invariant training with discriminative learning for single channel speech separation. 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), 26–30.
Gales, M., & Young, S. (2008). The application of hidden Markov models in speech recognition. Foundations and Trends® in Signal Processing, 1(3), 195–304.
Huang, L., Cheng, G., Zhang, P., Yang, Y., Xu, S., & Sun, J. (2019). Utterance-level permutation invariant training with latency-controlled BLSTM for single-channel multi-talker speech separation. 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 1256–1261.
Kanda, N., Chang, X., Gaur, Y., Wang, X., Meng, Z., Chen, Z., & Yoshioka, T. (2021). Investigation of end-to-end speaker-attributed ASR for continuous multi-talker recordings. 2021 IEEE Spoken Language Technology Workshop (SLT), 809–816.
Kanda, N., Fujita, Y., Horiguchi, S., Ikeshita, R., Nagamatsu, K., & Watanabe, S. (2019). Acoustic modeling for distant multi-talker speech recognition with single-and multi-channel branches. ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6630–6634.
Kanda, N., Gaur, Y., Wang, X., Meng, Z., Chen, Z., Zhou, T., & Yoshioka, T. (2020). Joint speaker counting, speech recognition, and speaker identification for overlapped speech of any number of speakers. ArXiv Preprint ArXiv:2006.10930.
Kanda, N., Gaur, Y., Wang, X., Meng, Z., & Yoshioka, T. (2020). Serialized output training for end-to-end overlapped speech recognition. ArXiv Preprint ArXiv:2003.12687.
Kantowitz, B. H., & Sorkin, R. D. (1983). Human factors: Understanding people-system relationships. (No Title).
Kumar, M. K. P., & Kumaraswamy, R. (2016). Speech separation with EMD as front-end for noise robust co-channel speaker identification. 2016 International Conference on Circuits, Controls, Communications and Computing (I4C), 1–4.
Kunešová, M. (2018). Detection of overlapping speech using a convolutional neural network: first experiments.
Li, Z., & Whitehill, J. (2021). Compositional embedding models for speaker identification and diarization with simultaneous speech from 2+ speakers. ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7163–7167.
Mao, H. H., Li, S., McAuley, J., & Cottrell, G. (2020). Speech recognition and multi-speaker diarization of long conversations. ArXiv Preprint ArXiv:2005.08072.
Mehrish, A., Majumder, N., Bharadwaj, R., Mihalcea, R., & Poria, S. (2023). A review of deep learning techniques for speech processing. Information Fusion, 101869.
Meng, L., Kang, J., Cui, M., Wu, H., Wu, X., & Meng, H. (2023). Unified Modeling of Multi-Talker Overlapped Speech Recognition and Diarization with a Sidecar Separator. ArXiv Preprint ArXiv:2305.16263.
Menne, T., Sklyar, I., Schlüter, R., & Ney, H. (2019). Analysis of deep clustering as preprocessing for automatic speech recognition of sparsely overlapping speech. ArXiv Preprint ArXiv:1905.03500.
Mitchell, O. M. M., Ross, C. A., & Yates, G. H. (1971). Signal processing for a cocktail party effect. The Journal of the Acoustical Society of America, 50(2B), 656–660.
Mohammed, T. S., Aljebory, K. M., Rasheed, M. A. A., Al-Ani, M. S., & Sagheer, A. M. (2021). Analysis of Methods and Techniques Used for Speaker Identification, Recognition, and Verification: A Study on Quarter-Century Research Outcomes. Iraqi Journal of Science, 62(9), 3255–3281. https://doi.org/10.24996/ijs.2021.62.9.38
Mohammed, Z. K., & Abdullah, N. A. Z. (2022). Survey For Arabic Part of Speech Tagging based on Machine Learning. Iraqi Journal of Science, 63(6), 2676–2685. https://doi.org/10.24996/ijs.2022.63.6.33
Qian, Y., Chang, X., & Yu, D. (2018). Single-channel multi-talker speech recognition with permutation invariant training. Speech Communication, 104, 1–11.
Quatieri, T. F., & Danisewicz, R. G. (1990). An approach to co-channel talker interference suppression using a sinusoidal model for speech. IEEE Transactions on Acoustics, Speech, and Signal Processing, 38(1), 56–69.
Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.
Sato, H., Ochiai, T., Delcroix, M., Kinoshita, K., Moriya, T., & Kamo, N. (2021). Should we always separate?: Switching between enhanced and observed signals for overlapping speech recognition. ArXiv Preprint ArXiv:2106.00949.
Seki, H., Hori, T., Watanabe, S., Roux, J. Le, & Hershey, J. R. (2018). A purely end-to-end system for multi-speaker speech recognition. ArXiv Preprint ArXiv:1805.05826.
Settle, S., Le Roux, J., Hori, T., Watanabe, S., & Hershey, J. R. (2018). End-to-end multi-speaker speech recognition. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4819–4823.
Shakat, A., Arif, K. I., Hasan, S., Dawood, Y., & Mohammed, M. A. (2021). YouTube keyword search engine using speech recognition. Iraqi Journal of Science, 2021, 167–173. https://doi.org/10.24996/ijs.2021.SI.1.23
Shangguan, Y., & Yang, J. (2019). Permutation Invariant Training Based Single-Channel Multi-Talker Speech Recognition with Music Background. 2019 International Conference on Artificial Intelligence and Advanced Manufacturing (AIAM), 427–430.
Shi, Y., & Hain, T. (2021). Supervised speaker embedding de-mixing in two-speaker environment. 2021 IEEE Spoken Language Technology Workshop (SLT), 758–765.
Sun, H., & Ma, B. (2011). Study of overlapped speech detection for NIST SRE summed channel speaker recognition. Twelfth Annual Conference of the International Speech Communication Association.
Svendsen, B., & Kadry, S. (2024). A Dataset for recognition of Norwegian Sign Language. 2, 2–4.
Swadi, H. M., & Ali, H. M. (2019). Mobile-based Human Emotion Recognition based on Speech and Heart rate. Journal of Engineering, 25(11), 55–66. https://doi.org/10.31026/j.eng.2019.11.05
Thakker, M., Vyas, S., Ved, P., & Shanthi Therese, S. (2018). Speaker identification in a multi-speaker environment. Information and Communication Technology for Sustainable Development: Proceedings of ICT4SD 2016, Volume 2, 239–244.
Wang, Y., & Sun, W. (2019). Multi-speaker recognition in cocktail party problem. Communications, Signal Processing, and Systems: Proceedings of the 2017 International Conference on Communications, Signal Processing, and Systems, 2116–2123.
Wu, B., Yu, M., Chen, L., Xu, Y., Weng, C., Su, D., & Yu, D. (2020). Distortionless multi-channel target speech enhancement for overlapped speech recognition. ArXiv Preprint ArXiv:2007.01566.
Yoshioka, T., Erdogan, H., Chen, Z., Xiao, X., & Alleva, F. (2018). Recognizing overlapped speech in meetings: A multichannel separation approach using neural networks. ArXiv Preprint ArXiv:1810.03655.
Yousefi, M., & Hansen, J. H. L. (2020). Frame-based overlapping speech detection using convolutional neural networks. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6744–6748.
Zhang, W., Chang, X., & Qian, Y. (2019). Knowledge Distillation for End-to-End Monaural Multi-Talker ASR System. INTERSPEECH, 2633–2637.