Predicting human speech perception using deep phoneme classifiers

Abstract

Deep machine learning resulted in a major performance boost in many research fields such as computer vision, natural language processing, and automatic speech recognition (ASR). In my talk, I will provide examples how this technology – specifically ASR based on deep learning – can be used to create models of speech perception that predict speech intelligibility, the perceived speech quality, or the subjective listening effort for normal-hearing and hearing-impaired listeners. At the core of these models, phoneme probabilities from a deep neural network are calculated; the degradation of these probabilities in the presence of noise, reverberation or other distortions is quantified, which results in the model output. In some cases, these algorithms outperform baseline models even though they operate on a mixture of noise and speech – in contrast to other approaches that often require separate noise and speech inputs. This implies a reduced amount of a priori knowledge for the models, which could be interesting for applying them in the context of hearing research, e.g., for continuous optimization of parameters in future hearing devices. The underlying statistical models were trained with hundreds or thousands of hours of speech and are harder to analyze in comparison to many established models; yet they are not black boxes since we have various methods to study their properties, which will be briefly outlined.

**Please note: This talk will take place on Wednesday, Feb 2nd at 1:30 pm!**