The role of binaural hearing in speech understanding and its representation in the human brain
Jaime A. Undurraga
2018-09-07
Cocktail-party problem (Cherry, 1953)
How do we recognize what one person is saying when others are speaking at the same time?
When everyone at a well-attended party talks at the same level, the speech of the attended talker at a distance of 0.7 m has a signal-to-noise ratio (SNR) of about 0 dB (the background is as intense as the target talker Plomp 1977). This level is sufficient to give adequate intelligibility for listeners with normal hearing (Miller, 1947).

Auditory scene analysis
A critial role of the auditory system is to parse the inputs from the left and right ears into auditory objects - auditory scene analysis. This is an essential role for survival, recognition, and communication.
Binaural hearing provides cues that allow us to estimate the relative number and location of sources and objects in the environment.
These cues also help us to estimate the dimensions and characteristics of rooms as well as to hear out speakers in the presence of interfering noise.

Grothe et al. (2010)
- When the speech sources are spatially separated, normal listeners can perceptually segregate and selectively attend to the source of interest if the sounds arrive from the same location - spatial release from masking (SRM) (Freyman et al., 1999; Brungart, 2001; Freyman et al., 2001; Hawley et al., 2004)
- This task is extreamly challenging for listeners with sensorineural hearing loss (with or without hearing aids) or with cochlear implants (Loizou et al., 2009; Marrone et al., 2008).
- Moreover, listeners with “normal” hearing and elder listeners experience great difficulty when listening in the presence of background noise (Ruggles et al., 2011; Swaminathan et al., 2015; Gallun et al., 2013).
- Hearing impaired listeners with symmetric binaural hearing often demonstrate reduced SRM primarily due to increased thresholds in spatially separated conditions (Arbogast et al., 2005; Marrone et al., 2008b; Best et al., 2012)
Monaural and binaural cues
Monaural cues and Auditory scence analysis
- Normal hearing listeners can segregate sounds remarkably well from monaural signals.
- The brain can use the statistical regularities of a sound to group acoustic characteristics that are likely originated from the same source.
common onset
harmonicity
repetition
Harmonicity

Which features are conveyed by speech sounds?
- Temporal fine structure (TFS)
- Envelope information (ENV)
Play 6 Channel vocoder
Play Original

Binaural cues
When listening to sounds, we rely on three mechanisms for both sound localization and auditory scene analysis
Interaural level differences (ILDs)
Interaural time differences (ITDs)
Interaural coherence
ITDs and ILDs

ILDs
ILDs only useful for frequencies higher than 1200 Hz
ILDs can be as large as 20 dB (e.g. Bronkhorst & Plomp, 1988)
ILDs resulting from the head shadow provide an advantage by sheltering the ear turned toward the target source from noise from the other side. However, listeners with unilateral hearing loss struggle when the target sound comes from the impaired side, specially in the presence of background noise.
Localization using ITDs

ITDs within the physiological range experienced by human listeners are about ±760 μs; (e.g. Constan and Hartmann (2003) and Hartmann and Macaulay (2014))
ITDs (using fine structure) are useful for frequencies below 1500 Hz
ITDs in the envelope of the signal are also used to determine the location of a source in both lower and higher frequencies.
Temporal fine structure and envelope ITDs
500 Hz tone TFS left-to-right
Modulated 500 Hz tone
with envelope itd (ipd) left-to-right
4000 Hz tone TFS itd (ipd) left-to-rigth
Modulated 4000 Hz tone
with envelope itd left-to-rigth
Localization using ITDs

ITDs are ambiguous for frequencies which half-wavelengths are smaller than the head ‘diameter’.
Localization on the vertical plane

The main cues are provided by diffraction patterns by the head and pinna, which result in characteristic spectral dips
Hearing-impaired subjects have wider cochlear filters (increasing spectral smearing), preventing them to localize a source in the vertical plane.
The precedence effect
(Dietz et al. 2013)
A single auditory event is perceived at the direction of the first direct wave front (2 and 50 ms later, even when reflections are 10 dB louder)
Asymmetric hearing loss, hearing impairment and ageing negatively affect the precedence effect (Akeroyd and Guy, 2011). This can only partially restored by hearing aids.
Binaural processing and speech understanding
Binaural Redundancy
Loudness doubles when the two ears are used instead of one ear for a sound coming from the front of the listener (a single ear would require an increase of about 10 dB; Fletcher and Munson, 1933)
Just noticeable differences in intensity and frequency improve with signal redundancy
Speech recognition in the presence of background noise improves (Marrone 2008, Neher 2009)
Hearing impairment may lead to a slightly weaker binaural benefit in patients (Dillon, 2001)
Binaural stimulation sounds can be louder than with a monaural presentation without causing discomfort (even true for CI-treated patient)
Binaural Release from Masking (or Binaural Squelch; or Hirsh effect)

Binaural release from masking may improve detection threshold up to about 16 dB for frequencies around 250 Hz and around 3 dB at 1500 Hz
Spatial Release from Masking

Binaural release from masking may improve detection thresholds up to 12 dB for multiple speech interferers (Jones and Litovsky, 2011), and facilitates source segregation provided that streaming can build up and natural onset cues are present (Drennan, Gatehouse, and Lever, 2003).
Segregation is always better for the combination of both ITDs and ILDs cues (Culling, Hawley, and Litovsky 2004)
A separation of only 10° between two voices is already strong enough to allow segregation (Brungart and Simpson, 2007).
ITD is a critical spatial cue for sound localization and speech perception in noise (Bronkhorst & Plomp, 1988; Wightman & Kistler, 1992).
General Conclusions
Interaural level and, in particular, interaural time differences conveyed by low frequencies (< 1500 Hz) are critical for speech understanding and segregation in the cocktail party problem (e.g. Swaminathan et al. 2016)
Hearing-impaired listeners who typically find noisy environments disproportionately difficult for understanding speech, whether they wear a hearing aid or not (Killion 1997; Moore 1998).
Similar problems are encountered by users of cochlear implants, again owing to the limited frequency resolution available with implants (Clark 2003).
Binaural release from masking may improve detection thresholds up to 12 dB for multiple speech interferers (Jones and Litovsky, 2011), and facilitates source segregation provided that streaming can build up and natural onset cues are present (Drennan, Gatehouse, and Lever, 2003).
Segregation is always better for the combination of both ITDs and ILDs cues (Culling, Hawley, and Litovsky 2004)
Objective measures of binaural processing agree with difficulties to localize and understand speech in observed in hearing impaired and elder listeners.
Input to one ear
Output at the auditory nerve level

Auditory pathway

ILD pathway

(Grothe et al. 2010)
The initial site of ILD processing is generally considered to be the LSO.
LSO are innervated by direct excitatory ipsilateral inputs from spherical bushy cells and inhibitory inputs indirectly originating from the globular bushy cells.
ITD pathway

(Grothe et al. 2010)
The initial site of ITD processing is considered to be the MSO.
MSO are innervated by direct excitatory ipsi- anc contra-lateral inputs from spherical bushy cells and inhibitory inputs indirectly originating from the globular bushy cells.
Coincidence detection (MSO)
Objective detection of binaural processing in humans
Electroencephalogram (EEG)

(From Luck 2005)
EEG recording and validation
Real experiments

EEG allows to record electrical activity of the brain by means of electrodes placed along the scalp.

Objective measures of binaural processing can be obtained by using stimuli where the temporal fine structure is manipulated so that the perceived location of the sound image changes periodically (e.g. 6.7 Hz) over time.

(Undurraga et al. 2016)
Interaural-phase modulation
- The size of the interaural phase difference (relative ITD with respect to the frequency of the sound) will determine the lateralization of the sound.
Interaural-phase modulation following-responses (IPM-FR)
By swithcing the “sound image” from left to right at 6.7 Hz, a strong steady-state response is evoked at that particular frequency
IPM-FRs are larger when the stimuli causes a strong lateralization percept.
How well do objective IPM-FR and behavioural measurements correlate with each other?
Behavioural task
The level of the masking noise to perform at chance is determined and compared and then compared with brain responses (IPM-FR)
Behavioural vs. Objective measures
The correlation between objective and behavioural measurements is excellent (r = 0.96)
The effect of age and hearing loss in objective binaural measures
- Does the IPM-FR differ between healthy and impaired hearing listeners?

Interaural difference in hearing thresholds near the stimulation frequency (500 Hz) did not exceed 10 dB HL.
The stimulus level was 65 dB SPL for normal listeners.
The stimulus level for hearing impaired listeners was adjusted using visual analogue scale (“inaudible”, “very soft”, “soft”, “comfortable”, “loud”, “very loud”, and “uncomfortably loud”). The subjective loudness was adjusted to match that of normal listeners.
Stimuli were amplitude modulated tones (500 Hz) using IPDs ranging from 0 (diotic) to 180 (dichotic).



Binaural processing of low frequency sounds is significantly deteriorated by age and hearing status.
A larger IPM-FR dynamic range was associated with lower (better) IPD discrimination.
Models for ITD processing
Straightness weighting

Peaks of the cross-correlograms at –1.5 ms < peaks closer to 0 - central weighting.
Correct localization estimated from second processing level (in the inferior colliculus, gray curve on top of each panel).
The \(\pi\)-limit

- ITD detectors in the mammalian brain restricted to \(\approx\) half a cycle of best frequency.
- Frequency-dependent weighting for centrality
Thompson et al. (2006)

- Inferior colliculus consistent with the \(\pi\)-limit.
- Cortical responses to sounds with ITDs within the \(\pi\)-limit are in line with the predictions of both models.
- However, neural activation is bilateral for “long” ITDs, despite these being perceived as clearly lateralized
- Long ITDs leads to higher activation in cortex than processing of short ITDs.
Kriegstein et al. (2008)

- human and animal Magnetoencephalography (MEG) recordings as well as computational models with ITD detectors restricted either to ±π or with their best ITDs well beyond ±π but with strong weighting to ITDs near zero - “central-weighting” agree with a limited range of ITD detectors (Salminen et al., 2018; Stern and Shear, 1996).
Neural representation of asymmetric interaural time modulations (ITMs)
- ITMs: 0/ + 0.5 ms, 0/ + 1.5 ms, 0/ + 2.0 ms, 0/ + 3.0 ms, 0/ + 4.0 ms.

Testing the independence of ITD detectors
- ITMs: −0.5/ + 0.5 ms, −0.5/ + 1.5 ms, and +0.5/ + 1.5 ms.
The predictions are:
- the ITM between −0.5 ms and −0.5 ms should be less adapted than the ITM between −0.5 ms and +1.5 ms.
- the ITM between +0.5 ms and +1.5 ms should be less adapted than the ITM between −0.5 ms and +1.5 ms
Neural representation of symmetric ITMs
- ITMs: −0.5/ + 0.5 ms, −1.0/ + 1.0 ms, −1.5/ + 1.5 ms, −2.0/ + 2.0 ms, −2.5/ + 2.5 ms, −3.0/ + 3.0 ms, −4.0/ + 4.0 ms
Results
At −0.5/ + 0.5 ms (−90 ◦ / 90 ◦ IPD), the IPD of the centre frequency is consistent with the envelope ITD.
At −1.0/ + 1.0 ms (−180 ◦ / 180 ◦ IPD), the IPD of the centre frequency is ambiguous.
At −1.5/ + 1.5 ms (−270 ◦ / 270 ◦ IPD), the IPD of the centre frequency is conflicting with the envelope ITD, i.e. the fine structure IPD leads in the right ear whilst the envelope ITD leads on the left ear.
At −2.0/ + 2.0 ms (−360 ◦ / 360 ◦ IPD), there is not IPD in the fine structure at the centre frequency, and so no conflicting cues.
Conclusions
ITMs could be reliable recorded from all participants across conditions.
Asymmetric ITMs showed large and consistent responses with a damping pattern at ITMs with ITDs multiple of the central frequency of the noise.
Similarly ITMs showed a decreasing damping function. The damping function is consistent with conflicting cues between the envelope and the fine structure of the signal.
The adaptor paradigm in Experiment 2 could be also explained by conflictive processing between envelope and fine structure.
At −0.5/ + 0.5 ms (−90 ◦ / 90 ◦ IPD), the IPD of the centre frequency is consistent with the envelope ITD.
At −0.5/ + 1.5 ms (−90 ◦ / 270 ◦ IPD), the IPD of the centre frequency does not change, whilst the envelope ITD does.
At +0.5/ + 1.5 ms (90 ◦ / 270 ◦ IPD), the IPD of the centre frequency effectively changes between 90 ◦ and −90 ◦ , whilst the envelope ITD is always leading on the right ear.
Acknowledgments
- Australian Research Council [project number FL160100108]
- Thanks to you for “listening”
