Divided spectro-temporal transformer for sound event localization and detection in real scenes

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 83
  • Download : 0
Sound event localization and detection (SELD) involves the detection of sound events (SED) and the estimation of their direction of arrival (DoA) by utilizing multichannel sound signals. Recent research in SELD has predominantly focused on deep neural network (DNN) based models, which specifically emphasize learning temporal context. Examples of these models include the convolutional recurrent neural network (CRNN) and the ResNet-conformer architecture, which handle spectral and channel information only as the embeddings of temporal features. To fully exploit spectral information providing a crucial cue for both SED and DoA, it is imperative to devise a network architecture that effectively learns both spectral and temporal contexts. In this regard, we propose a divided transformer architecture that separately identifies the spectral and temporal context to encourage the model to learn more spectral characteristics of signals while retaining the temporal context. The efficacy of the divided spectro-temporal transformer approach is validated using the DCASE 2022 and 2023 challenge task 3 datasets. Furthermore, a series of parameter studies carried out to optimize the performance of SELD demonstrates that the number of frequency bins for attention and the pooling location impact the performance, and the divided spectro-temporal transformer is beneficial for both SED and DoA.
Publisher
Acoustical Society of America
Issue Date
2023-12-07
Language
English
Citation

Acoustics 2023

DOI
10.1121/10.0023458
URI
http://hdl.handle.net/10203/316670
Appears in Collection
EE-Conference Papers(학술회의논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0