Advanced search
2 files | 3.70 MB Add to list

Exploiting temporal context in CNN based multisource DOA estimation

Author
Organization
Abstract
Supervised learning methods are a powerful tool for direction of arrival (DOA) estimation because they can cope with adverse conditions where simplified models fail. In this work, we consider a previously proposed convolutional neural network (CNN) approach that estimates the DOAs for multiple sources from the phase spectra of the microphones. For speech, specifically, the approach was shown to work well even when trained entirely on synthetically generated data. However, as each frame is processed separately, temporal context cannot be taken into account. This prevents the exploitation of interframe signal correlations, and the fact that DOAs do not change arbitrarily over time. We therefore consider two different extensions of the CNN: the integration of a long short-term memory (LSTM) layer, or of a temporal convolutional network (TCN). In order to accommodate the incorporation of temporal context, the training data generation framework needs to be adjusted. To obtain an easily parameterizable model, we propose to employ Markov chains to realize a gradual evolution of the source activity at different times, frequencies, and directions, throughout a training sequence. A thorough evaluation demonstrates that the proposed configuration for generating training data is suitable for the tasks of single-, and multi-talker localization. In particular, we note that with temporal context, it is important to use speech, or realistic signals in general, for the sources. Experiments with recorded impulse responses and noise reveal that the CNN with the LSTM extension outperforms all other considered approaches, including the plain CNN, and the TCN extension.
Keywords
ARRIVAL ESTIMATION, NEURAL-NETWORKS, LOCALIZATION, MULTIPLE, SIGNALS, Direction-of-arrival estimation, Estimation, Training, Feature extraction, Training data, Time-frequency analysis, Microphone arrays, Convolutional neural networks, direction-of-arrival, temporal context, training data generation

Downloads

  • (...).pdf
    • full text (Published version)
    • |
    • UGent only
    • |
    • PDF
    • |
    • 3.25 MB
  • DS422 acc.pdf
    • full text (Accepted manuscript)
    • |
    • open access
    • |
    • PDF
    • |
    • 451.85 KB

Citation

Please use this url to cite or link to this publication:

MLA
Bohlender, Alexander, et al. “Exploiting Temporal Context in CNN Based Multisource DOA Estimation.” IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, vol. 29, 2021, pp. 1594–608, doi:10.1109/TASLP.2021.3067113.
APA
Bohlender, A., Spriet, A., Tirry, W., & Madhu, N. (2021). Exploiting temporal context in CNN based multisource DOA estimation. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 29, 1594–1608. https://doi.org/10.1109/TASLP.2021.3067113
Chicago author-date
Bohlender, Alexander, Ann Spriet, Wouter Tirry, and Nilesh Madhu. 2021. “Exploiting Temporal Context in CNN Based Multisource DOA Estimation.” IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING 29: 1594–1608. https://doi.org/10.1109/TASLP.2021.3067113.
Chicago author-date (all authors)
Bohlender, Alexander, Ann Spriet, Wouter Tirry, and Nilesh Madhu. 2021. “Exploiting Temporal Context in CNN Based Multisource DOA Estimation.” IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING 29: 1594–1608. doi:10.1109/TASLP.2021.3067113.
Vancouver
1.
Bohlender A, Spriet A, Tirry W, Madhu N. Exploiting temporal context in CNN based multisource DOA estimation. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING. 2021;29:1594–608.
IEEE
[1]
A. Bohlender, A. Spriet, W. Tirry, and N. Madhu, “Exploiting temporal context in CNN based multisource DOA estimation,” IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, vol. 29, pp. 1594–1608, 2021.
@article{8710290,
  abstract     = {{Supervised learning methods are a powerful tool for direction of arrival (DOA) estimation because they can cope with adverse conditions where simplified models fail. In this work, we consider a previously proposed convolutional neural network (CNN) approach that estimates the DOAs for multiple sources from the phase spectra of the microphones. For speech, specifically, the approach was shown to work well even when trained entirely on synthetically generated data. However, as each frame is processed separately, temporal context cannot be taken into account. This prevents the exploitation of interframe signal correlations, and the fact that DOAs do not change arbitrarily over time. We therefore consider two different extensions of the CNN: the integration of a long short-term memory (LSTM) layer, or of a temporal convolutional network (TCN). In order to accommodate the incorporation of temporal context, the training data generation framework needs to be adjusted. To obtain an easily parameterizable model, we propose to employ Markov chains to realize a gradual evolution of the source activity at different times, frequencies, and directions, throughout a training sequence. A thorough evaluation demonstrates that the proposed configuration for generating training data is suitable for the tasks of single-, and multi-talker localization. In particular, we note that with temporal context, it is important to use speech, or realistic signals in general, for the sources. Experiments with recorded impulse responses and noise reveal that the CNN with the LSTM extension outperforms all other considered approaches, including the plain CNN, and the TCN extension.}},
  author       = {{Bohlender, Alexander and Spriet, Ann and Tirry, Wouter and Madhu, Nilesh}},
  issn         = {{2329-9290}},
  journal      = {{IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING}},
  keywords     = {{ARRIVAL ESTIMATION,NEURAL-NETWORKS,LOCALIZATION,MULTIPLE,SIGNALS,Direction-of-arrival estimation,Estimation,Training,Feature extraction,Training data,Time-frequency analysis,Microphone arrays,Convolutional neural networks,direction-of-arrival,temporal context,training data generation}},
  language     = {{eng}},
  pages        = {{1594--1608}},
  title        = {{Exploiting temporal context in CNN based multisource DOA estimation}},
  url          = {{http://doi.org/10.1109/TASLP.2021.3067113}},
  volume       = {{29}},
  year         = {{2021}},
}

Altmetric
View in Altmetric
Web of Science
Times cited: