A neural-network framework for modelling auditory sensory cells and synapses

In classical computational neuroscience, transfer functions are derived from neuronal recordings to derive analytical model descriptions of neuronal processes. This approach has resulted in a variety of Hodgkin-Huxley-type neuronal models, or multi-compartment synapse models, that accurately mimic neuronal recordings and have transformed the neuroscience field. However, these analytical models are typically slow to compute due to their inclusion of dynamic and nonlinear properties of the underlying biological system. This drawback limits the extent to which these models can be integrated within large-scale neuronal simulation frameworks and hinders an uptake by the neuro-engineering field which requires fast and efficient model descriptions. To bridge this translational gap, we present a hybrid, machine-learning and computational-neuroscience approach that transforms analytical sensory neuron and synapse models into artificial-neural-network (ANN) neuronal units with the same biophysical properties. Our ANN-model architecture comprises parallel and differentiable equations that can be used for backpropagation in neuro-engineering applications, and offers a simulation run-time improvement factor of 70 and 280 on CPU or GPU systems respectively. We focussed our development on auditory sensory neurons and synapses, and show that our ANN-model architecture generalizes well to a variety of existing analytical models of different complexity. The model training and development approach we present can easily be modified for other neuron and synapse types to accelerate the development of large-scale brain networks and to open up avenues for ANN-based treatments of the pathological system.

1 Introduction 1 Following the fundamental work of Hodgkin and Huxley in modelling action-potential generation 2 and propagation [1], numerous specific neuronal models were developed that proved essential 3 for shaping and driving modern-day neuroscience [2]. In classical computational neuroscience, 4 transfer functions between stimulation and recorded neural activity are derived and approximated 5 analytically. This approach resulted in a variety of stimulus-driven models of neuronal firing and 6 was successful in describing the nonlinear and adaptation properties of sensory systems [3][4][5][6]. For 7 example, the mechano-electrical transduction of cochlear inner-hair-cells (IHCs) was described 8 using conductance models [7][8][9][10], and the IHC-synapse firing rate using multi-compartment 9 diffusion models [11][12][13]. Such mechanistic models have substantially improved our understanding 10 of how individual neurons function, but even the most basic models use coupled sets of ordinary 11 differential equations (ODEs) in their descriptions. This computational complexity hinders their 12 further development to simulate more complex behaviour, limits their integration within large-13 scale neuronal simulation platforms, such as brain-machine interfaces [14,15], and their uptake in 14 neuro-engineering applications that require real-time, closed-loop neuron model units [16,17]. 15 To meet this demand, neuroscience recently embraced deep-learning [18]; a technique that 16 quickly revolutionised our ability to construct large-scale neuronal networks and to quantify 17 complex neuronal behaviour [19][20][21][22][23][24][25][26][27][28]. These machine-learning methods can yield efficient, end- 18 to-end descriptions of neuronal transfer functions, population responses or neuro-imaging data 19 without having to rely on detailed analytical descriptions of the individual neurons responsible 20 for this behaviour. Deep Neural Networks (DNNs) learn to map input to output representations 21 and are composed of multiple layers with simplified units that loosely mimic the integration 22 and activation properties of real neurons [29]. Examples include DNN-based models that were 23 succesfully trained to mimic the representational transformations of sensory input [30,31], or 24 DNNs that use neural activity to manipulate sensory stimuli [32,33]. Even though deep-learning 25 has become a powerful research tool to help interpret the ever-growing pool of neuroscience and 26 neuro-imaging recordings [34,35], these models have an important drawback when it comes to 27 predicting responses to novel inputs. DNNs suffer from their data-driven nature that requires 28 a vast amount of data to accurately describe an unknown system, and can essentially be only 29 as good as the data that were used for training. Insufficient experimental data can easily lead 30 to overfitted models that describe the biophysical systems poorly while following artifacts or 31 noise present in the recordings [36]. The boundaries of experimental neuroscience and limited 32 experiment duration hence pose a serious constraint on the ultimate success of ANN models in 33 predicting responses of neuronal systems. 34 To overcome these difficulties and merge the advantages of analytical and ANN model 35 descriptions, we propose a hybrid approach in which analytical neuronal models are used to 36 generate a sufficiently large and diverse dataset to train DNN-based models of sensory-cells and 37 synapses. A combination of traditional and machine-learning approaches was recently adopted 38 to optimise analytical model descriptions [37], but our method moves in the opposite direction 39 and takes advantage of deep-learning benefits to develop convolutional neural network (CNN) 40 models from mechanistic descriptions of neurons and synapses. We show here that the resulting 41 CNN models can accurately simulate outcomes of traditional Hodgkin-Huxley neuronal models 42 and synaptic diffusion models, but in a differentiable and computationally-efficient manner. The 43 CNN-based model architecture is compatible with GPU computing and facilitates the integration of 44 our model-units within large-scale, closed-loop or spiking neuronal networks. The most promising 45 design feature relates to the backpropagation property, a mathematically-complex trait to achieve 46 for nonlinear, coupled ODEs of traditional neural models. We will illustrate here how normal and 47 pathological CNN models can be used in backpropagation to modify the sensory stimuli such to 48 yield an optimised (near-normal) response of the pathological system. 49 We develop and test our hybrid approach on sensory neurons and synapses within the auditory 50 system. The cochlea, or inner-ear, encodes sound via the inner hair cells (IHCs). IHCs sense the 51 vibration of the basilar membrane in response to sound using their stereocilia, and translate this 52 movement into receptor potential changes. By virtue of Ca 2+ -driven exocytosis, glutamate is 53 released to drive the synaptic transmission between the IHC and the innervated auditory-nerve fiber 54 (ANF) synapses and neurons [38]. Experimentally extracted IHC parameters from in-vitro, whole-55 cell patch clamp measurements of the cellular structures and channel properties [39,40] have led to 56 different model descriptions of the nonlinear and frequency-dependent IHC transduction [10,[41][42][43]. 57 Parameters for analytical IHC-ANF synapse models have mainly been derived from single-unit 58 AN recordings to basic auditory stimuli in cats and small rodents [44][45][46][47][48][49][50]. Progressive insight into 59  the function of IHC-ANF synapses over the past decades has inspired numerous analytical model 60 descriptions of the IHC, IHC-ANF synapse and ANF neuron complex [11][12][13][51][52][53][54][55][56][57][58][59][60][61]. 61 To generate sufficient training data for our CNN-based models of IHC-ANF processing, we 62 adopted a state-of-the-art biophysical model of the human auditory periphery that simulates 63 mechanical as well as neural processing of sound [60]. We describe here how the CNN model 64 architecture and hyperparameters can be optimised for complex neuron or synapse models and 65 we evaluate the quality of our CNN models on the basis of key IHC-ANF complex properties 66 described in experimental studies, i.e., the IHC AC/DC ratio and excitation patterns, ANF firing 67 rate, rate-level curves and modulation synchrony. The stimuli we adopted for the evaluations were 68 not included in the CNN training datasets. We then determine the model run-time benefit over 69 analytical models and investigate the extent to which our methodology is generalisable to different 70 mechanistic descriptions of the IHC-ANF complex. Lastly, we provide two user cases: one in 71 which IHC-ANF models are connected to a CNN-based cochlear mechanics model (CoNNear [62]) 72 to capture the full transformation of acoustic stimuli into IHC receptor potentials and ANF firing 73 rates along the cochlear tonotopy and hearing range, and a second one where we illustrate how 74 backpropagation can be used to modify the CNN model input to restore a pathological output. 75 2 The CoNNear IHC and ANF models 76 Figure 1(a) depicts the adopted training and evaluation method to calibrate the parameters of each 77 CoNNear module. Three modules that correspond to different stages of the analytical auditory 78 periphery model described in Verhulst et al. [60] were considered: cochlear processing, IHC 79 transduction and ANF firing. The calibration of the cochlear mechanics module (CoNNear cochlea ) 80 is described elsewhere [62,63], here we focus on developing the sensory neuron models (i.e., 81 CoNNear IHC and CoNNear ANF ). Fig. 1(a) illustrates the training procedure for the CoNNear ANf L 82 module that approximates the functioning of a low-spontaneous-rate ANF. Acoustic speech 83 material is given as input to an analytical description of cochlear and IHC-ANF processing, after 84 which simulated ANF firing rates are used as training material to determine the CoNNear ANf L 85 parameters. CoNNear modules were trained separately for each stage of the IHC-ANF complex, 86 resulting in one model for IHC transduction and three models for different ANF types: a high-87 (H; 68.5 spikes/s), medium-(M; 10 spikes/s) and low-(L; 1 spike/s) spontaneous-rate (SR) ANF 88 fiber. We chose a modular approach because this facilitates future simulations of the pathological 89 system, where the IHC receptor potential can be impaired through presbycusis [64], or where 90 selective damage to the ANF population can be introduced through cochlear synaptopathy [65]. 91 Each module was modelled using a convolutional encoder-decoder architecture, consisting of 92 a distinct number of CNN layers, as shown in Figs. 1(b),(c). Within these architectures, each 93 CNN layer is comprised of a set of filterbanks followed by a nonlinear operation [18], except 94 for the last layer where the nonlinear operation was omitted. The parameters of the nonlinear 95 operations were shared across the frequency and time dimensions (first two dimensions) of the 96 model, i.e., weights were applied only on the filter dimension (third dimension). The encoder 97 CNN layers use strided convolutions, i.e., the filters were shifted by a time-step of two such to 98 half the temporal dimension after every CNN layer. Thus, after N encoder CNN layers, the 99 input signal was encoded into a representation of size L/2 N × k N , where k N equals the number of 100 filters in the N th CNN layer. The decoder uses N deconvolution, or transposed-convolutional, 101 layers, to double the temporal dimension after every layer to re-obtain the original temporal 102 dimension of the input (L). Skip connections were used to bypass temporal audio information 103 from the encoder to the decoder layers to preserve the stimulus phase information across the 104 architecture. Skip connections have earlier been adopted for speech enhancement applications 105 to avoid the loss of temporal information through the encoder compression [66][67][68][69], and could 106 benefit the model-training to best simulate the nonlinear and the level-dependent properties of 107 auditory processing by providing interconnections between several CNN layers [62,70]. Lastly, 108 we provided context information by making a number of previous and following input samples 109 also available to the CoNNear modules when simulating an input of length L. Because CNN 110 models treat each input independently, providing context is essential to avoid discontinuities at 111 the simulation boundaries and take into account the neural adaptation processes [62]. A final 112 cropping layer was added to remove the context after the last CNN decoder layer. Even though 113 we trained each module using a fixed L, CoNNear models can process input of any length L and 114 N CF once they are trained due to their convolutional architecture.

115
To provide realistic input to the IHC-ANF models for training, acoustic speech waveforms were 116 input to the cochlear model after which simulated cochlear basilar-membrane (BM) outputs were 117 used to train and evaluate the IHC-ANF models. To this end, the IHC transduction model was 118 trained using N CF = 201 cochlear filter outputs that span the human hearing range (0.1-12kHz) 119 and that were spaced according to the Greenwood place-frequency description of the human 120 cochlea [71]. Similarly, simulated IHC receptor potentials of the analytical model cochlear regions 121 (N CF = 201) were used as training material for the different ANF models. The CoNNear model 122 parameters of each module were optimised to minimise the mean absolute error (L1-loss) between 123 the predicted CoNNear outputs and the reference analytical model outputs. It should be noted 124 that even though we trained the models on the basis of 201 inputs, the optimal weights for a 125 single CF-independent IHC or ANF model were determined during the training phase. Thus, 126 these model units can afterwards be connected to each N CF input to simulate CF-dependent IHC 127 or ANF processing of the entire cochlea.

128
To evaluate the CoNNear IHC-ANF models, it is important to characterise their properties 129 to acoustic stimuli that were not seen during training. Training was performed using a single 130 speech corpus [72], but IHC and ANF processing have very distinct adaptation, and frequency-131 and level-dependent properties to basic auditory stimuli such as tones, clicks or noise. To test 132 how well the CoNNear modules generalise to unseen stimuli and to other analytical IHC-ANF 133 model descriptions, we evaluated their performance on a set of classical experimental neuroscience 134 recordings of IHC transduction and ANF firing. The six considered evaluation metrics together 135 form a thorough evaluation of the CoNNear IHC-ANF complex, and outcomes of these simulations 136 were used to optimise the final model architecture and its hyperparameters. Additional details on 137 the model architecture, training procedure and IHC-ANF evaluation metrics are given in Methods. 138 In the following sections, we first describe how we optimised the architectures of each CoNNear 139 model. We evaluate how well the trained CoNNear models capture signature experimental IHC 140 and ANF processing properties using stimuli that were not part of the model training. Afterwards, 141 we quantify the run-time benefit of our CoNNear models over the analytical descriptions and 142 show how the modules can be connected to simulate processing of the entire cochlear periphery. 143 To demonstrate the versatility of our method, we describe the extent to which our methodology 144 can be applied to different mechanistic descriptions of the IHC-ANF complex. And lastly, we 145 illustrate how the differential properties of our CoNNear models can be used within a closed-loop 146 backpropagation network to restore the function of a pathological system.  Table 1 shows the final layouts of all the CoNNear modules we obtained after taking into account: 149 (i) the L1-loss on the training speech material (i.e., the absolute difference between simulated 150 CNN and analytical responses), (ii) the desired auditory processing characteristics, and (iii) the 151 computational load. A principled fine-tuning approach was followed for each CoNNear module 152 architecture on which we elaborate in the following sections. Fixed parameters: Prior knowledge of fine-tuning a neural-network-based model of human 155 cochlear processing [62] helped us to make initial assumptions about the needed architecture to 156 accurately capture the computations performed by the analytical IHC model [60]. The shorter 157 IHC adaptation time constants [73] enabled us to use fewer convolution layers and shorter filter 158 lengths than were used in the CoNNear cochlea model. Thus, we opted for an architecture 159 with 6 convolution layers and a filter length of 16. In each layer, 128 convolution filters were 160 used and a stride of 2 was applied for dimensionality reduction. Each layer was followed by a 161 hyperbolic-tangent (tanh) nonlinearity. The input length was set to L c = 2048 + 2 · 256 = 2560 162 samples (102.8 ms). Figure 2 shows that the trained architecture (b) generally followed the 163 pure-tone excitation patterns of the reference model (a), but showed a rather noisy response 164 across CFs, especially for the higher stimulation levels. Although we tried architectures with 165 different layer numbers or filter durations, these models did not show significant improvements 166 over Fig. 2 Hyperparameters: The shape of the activation function, or nonlinearity, is crucial to enable 168 CoNNear to learn the level-dependent cochlear compressive growth properties and negative signal 169 deflections present in BM and IHC processing. A tanh nonlinearity was initially preferred for 170 each CNN layer, since it shows a compressive characteristic similar to the outer-hair-cell (OHC) 171 input/output function [74] and crosses the x-axis. To optimise the rather noisy response of the 172 trained IHC model ( Fig. 2(b)) different nonlinear activation functions were compared for the 173 encoder and decoder layers. Because the IHC receptor potential is expressed as a (negative) 174 voltage difference, we opted for a sigmoid nonlinear function in the decoding layers to better 175 capture the reference model outputs, while ensuring that the compressive nature present in the 176 tanh could be preserved. Figure 2(c) shows that using a sigmoid activation function instead 177 of a tanh for the decoder layers outperforms the tanh architecture (b) and better predicts the 178 excitation patterns of the reference model (a).
179 Figure 3 furthermore depicts how different combinations of activation functions affected the 180 simulated AC/DC ratios of the IHC responses across CF, and the half-wave rectified IHC receptor 181 potential as a function of stimulus level. The logarithmic decrease of the AC/DC ratio and the 182 linear-like growth of the IHC potential were predicted similarly using both architectures, but the 183 tanh architecture overestimated the responses for high frequencies and levels. Overall, a much 184 smoother response was achieved when using a sigmoid activation function in the decoder layers, 185 motivating our final choice for the CoNNear IHC architecture (Table 1). Fixed parameters: Our modular approach enabled the use of the preceding stages to optimise 188 our ANF model parameters. To determine the architecture, we first took into account the 189 slower adaptation time constants (i.e., the much longer exponential decay) of the analytical ANF 190 model description compared to those observed in the cochlea and IHC [12]. The choice of the 191 window size L will thus be important to realistically capture the steady-state ANF response to 192 sustained stimulation, e.g., for acoustic pure tones. Figure 4(a) visualises the exponential decay of 193 simulated ANF firing rates, in response to a 1-kHz pure-tone presented at 70 dB SPL. At the time 194 corresponding to a window size of 2048 samples (∼100 ms), the firing rates of the three ANFs 195 have not significantly decayed to their steady state and hence we chose to use a longer window 196 duration of L of 8192 samples (∼400 ms) for our ANF models. At 400 ms, the firing rates of the 197 HSR, MSR and LSR fibers have respectively reached 99.5%, 95% and 93.4% of their final (1-sec) 198 firing rate ( Fig. 4(a)).

199
Another important aspect relates to capturing the experimentally [49] as well as computationally 200 [60] observed slow recovery of ANF onset-peak responses after prior stimulation. Since CNN 201 models treat each input independently, the duration of the context window is crucial to sufficiently 202 present prior stimulation to the CoNNear ANF models. Figure 4(b) shows the exponential 203 recovery of the onset-peak, for simulated responses of the three ANF types, as a function of 204 the inter-stimulus interval between a pair of pure-tones. Two 2-kHz pure-tones were generated 205 according to experimental procedures [49], i.e., 100 ms pure-tones presented at 40 dB above the 206 firing threshold of each ANF (60, 65 and 75 dB for the HSR, MSR and LSR fibers respectively) 207 with an inter-stimulus interval from 0.1 to 1.9 secs. Since the 1.9-sec interval corresponds to 208 38,000 samples (f s = 20 kHz), a compromise was made to select a final context window that was 209 short enough to limit the needed computational resources, but that was still able to capture the 210 recovery and adaptation properties of the ANF models faithfully. We chose 7936 samples for the 211 context window (∼400 ms) which resulted in a total input size of L c = 7936 + 8192 + 256 = 16384 212 samples. For a 400-ms inter-stimulus interval, the onset-peak of the HSR, MSR and LSR fibers has 213 recovered to the 92.4%, 94.2% and 95.8% of the onset-peak of the 1.9-sec interval tone respectively 214 ( Fig. 4(b)). 215 We further illustrate the effect of adding a context window in    an architecture with a short context window (c), the simulated response was unable to reach the 218 onset amplitude of the reference LSR fiber model (b) observed for the high CFs at approximately 219 100 ms (grey dashed box). At the same time, the response for the short context architecture 220 decayed to a more saturated output after the onset peak, compared to the reference model. In 221 contrast, when adopting an architecture with a longer context window (d), the CoNNear ANf L 222 model better captured the onset peak observed after the long inter-stimulus interval while showing 223 an unsaturated fiber response which was similar to the reference model (b).

224
Lastly, the much slower adaptation properties of the ANF responses and the chosen input size 225 of L c = 16384 samples led us to realise that a larger number of CNN layers might be required 226 to model the ANF stage, compared to the IHC transduction stage. A much deeper architecture 227 might be necessary to simultaneously capture the characteristic ANF onset-peak and subsequent 228 exponentially-decaying adaptation properties of ANF firing rates to step-like stimuli, and this 229 is demonstrated in Fig. 4(c). A trained architecture consisting of 16 layers failed to capture the 230 adaptation properties of the LSR ANF, while the use of 28 layers successfully approximated the 231 The audio stimulus was presented to the reference cochlear and IHC model and the simulated IHC receptor potential output was used to stimulate the three ANF models. The N CF =201 considered output channels are labeled per channel number: channel 0 corresponds to a CF of 100 Hz and channel 200 to a CF of 12 kHz.
reference firing rates. By introducing an architecture which encodes the input to a very condensed 232 representation, preferably of a size of 1, we can ensure that the long-term correlations existent in 233 the input can be captured faithfully by the convolutional filters. To this end, we opted for an 234 architecture of 28 total layers and a condensed representation of size 1 × N CF (Fig. 1(c) Hyperparameters: The compressive properties of BM and IHC processing are not retained in 241 ANF processing, so a linear activation function (a Parametric ReLU; PReLU ) was initially used 242 for each CNN layer. Figure 6 shows the responses of the three trained CoNNear ANF models (b) 243 for different tonal stimuli in comparison to the reference ANF model (a). The firing rates of the 244 three ANF models, CoNNear ANf H , CoNNear ANf M and CoNNear ANf L , are visualised in red, blue 245 and cyan respectively.

246
The good match between analytical and CoNNear predictions in Fig. 6 was extended to ANF 247 rate-level growth as well (Fig. 7), and together these simulations show that the chosen architecture 248 and P ReLU non-linearity were suitable to model characteristic ANF properties of the three ANF 249 types. Compared to the reference firing rates, the architectures in panel (b) introduced noise, 250 which might be eliminated by using a more compressive activation function (tanh) between the 251 encoder layers. The tanh function was able to transform the negative potential of the IHC stage 252 to the positive firing response of the ANFs (Fig. 6(c)), and yielded similar firing rates for all ANF 253 models. However, for the CoNNear ANf H and CoNNear ANf M architectures, the tanh non-linearity 254 introduced an undesirable compressive behaviour at higher stimulus levels, as depicted in Fig. 7(a). 255 This was not the case for CoNNear ANf L , and hence we also tested using a sigmoid nonlinearity 256 in the decoder layers. This combination of non-linearities (d) compressed the responses of the 257 CoNNear ANf H and CoNNear ANf M models even more, and negatively affected the onset responses. 258 However, this combination (d) was found to best approximate the LSR ANFs firing rates. The excitation patterns of the final CoNNear IHC model (Fig. 2(c)) are generally consistent with 263 the reference IHC model (a). The IHC AC/DC components ( Fig. 3(a)) followed the simulated and 264 measured curves well, and showed a slight overestimation for the lower frequency responses. The 265 simulated half-wave-rectified IHC receptor potential (Fig. 3(a)) corroborated the in-vivo guinea 266 pig IHC measurements [76], by showing an unsaturated, almost linear, growth of the half-wave 267 rectified IHC receptor potential (in dB) for stimulation levels up to 90 dB.

268
The properties of single-unit ANF responses were accurately captured by the CoNNear 269  (Fig. 6). As expected, phase-locking to the 271 stimulus fine-structure was present for the 1-kHz ANF response and absent for the 4-kHz ANF. 272 Phase-locking differences between the 1 and 4-kHz CF fibers were also evident from their responses 273 to amplitude-modulated tones.

274
The level-dependent properties of different ANF types were also captured by our CoNNear 275 architectures, as shown in Fig. 7. Compared to the reference data, the 4-kHz simulations captured 276 the qualitative differences between LSR, MSR and HSR guinea pig ANF rates well. The mouse rate-277 level curves show somewhat steeper growth than our simulations, especially when comparing the 278 lower SR fiber data with the simulated MSR fiber responses. Given that the cochlear mechanics are 279 fundamentally different across species, it is expected that the responses are not overly generalisable 280 across species. The shape of the simulated rate-level curves was different for the different CF 281 fibers (1-kHz dotted lines compared to 4-kHz solid lines) despite the CF-independent parameters 282 of the ANF model. This illustrates that differences in BM processing across CF, given as input 283 to the IHC-ANF model, are still reflected in the shape of ANF rate-level curves. The smaller 284 dynamic range of levels encoded by the BM for the 1-kHz than the 4-kHz CF (e.g., Fig. 2 in [60]) 285 was also reflected, yielding ANF level-curve compression at lower stimulus levels for the 1-kHz CF. 286 Lastly, ANF synchrony-level curves were overall captured well by our final CoNNear ANF 287 architectures, while showing no apparent differences between the different non-linearities ( Fig. 7(b)). 288 In qualitative agreement with the reference data, the maxima of the synchrony-level curves shifted 289 towards higher levels as the fibers' threshold and rate-level slope increased. At the same time, 290 enhanced synchrony for LSR over HSR fibers was observed for medium to high stimulus levels, 291 with the most pronounced difference for the 1-kHz simulations (dotted lines). For MSR and LSR 292 fibers, the CoNNear models were able to simulate modulation gain, i.e., vector strength > 0.5 [50]. 293

CoNNear as a real-time model for audio applications 294
The CoNNear IHC-ANF computations can be sped up when run on an AI accelerator (GPU, 295 VPU etc.). Table 2 summarises the computation time required to execute the final CoNNear 296 architectures on a CPU and GPU, for 201-channel and single-channel inputs. A TIMIT speech 297 utterance of 2.3 secs was used for this evaluation and served as input to the analytical model [60] 298 to simulate the outputs of the cochlear and IHC stages. The cochlear BM outputs were then 299 framed into windows of 2560 samples (102.4 ms) to evaluate the CoNNear IHC model and the 300 Table 2. Model processing time. Comparison of the average time required to calculate each stage of the reference and the CoNNear model on a CPU (Intel Xeon E5-2620 v4) and a GPU (Nvidia GTX 1080). For each of the separate stages, the reported time corresponds to the average time needed to process a fixed-size input of N CF = 201 frequency channels (population response) and N CF = 1 channel (single-unit response), corresponding to the output of the preceding stage of the analytical model to a speech stimulus. The same results are shown for the CoNNear IHC-ANF complex, after connecting all the individual modules. The last row shows the computation time needed to transform a speech window input to ANF firing rates, after connecting the CoNNear cochlea and IHC-ANF modules together.  [62] was connected with CoNNear IHC-ANF to directly transform the 305 speech inputs to ANF firing rates. 306 We did not observe a processing time benefit when running the IHC-ANF stages with 201-307 channel inputs on a CPU: the CoNNear ANF models actually increased the computation time 308 on a CPU when compared to the reference models. However, the execution of the 201-channel 309 IHC-ANF models on the GPU reduced the computation time 12-fold, when compared to the 310 reference model CPU calculations. At the same time, our modular design choice makes it possible 311 to use CoNNear IHC-ANF modules for only a subset of CFs or for single-unit responses. A significant 312 speed up was seen for the latter case, with an almost 70-times faster CPU computation than for 313 the reference model and a 280-times speed up when executed on the GPU. ANF firing rates can 314 thus be simulated in ∼800 ms on a CPU, and less than 20 ms on a GPU for a stimulus window of 315 more than 800 ms. Here, we test whether our CNN-based IHC-ANF architectures can be used to approximate other 319 analytical model descriptions of auditory neurons and synapses. This attests to the generalisability 320 of our method for analytical model descriptions with varied levels of complexity. In Fig. 8 Figure 8(a) shows that the trained CNN model was 330 able to accurately simulate the steady-state responses of this detailed IHC description, as reflected 331 by the AC/DC ratio. However, a property that was not fully captured by our architecture was 332 the adaptation of the responses after the stimulus onset, as shown in Figure 8 potential responses, and the IHC outputs were subsequently given as inputs to the ANF module 345 to extract the mean firing rates for each fiber type. The resulting datasets were used to train 346 the CNN models, thus omitting the additive Gaussian noise and the subsequent spike generator 347 due to the noisy and probabilistic character which is beyond the scope of our model architectures. 348 However, after training the CNN models, the outputs can be fed to the spike generator present in 349 the analytical ANF model to simulate PSTH responses.

350
The trained CNN models accurately approximated mean firing rates of the different ANF 351 types, as shown in response to two different tonal stimuli (Fig. 9(c)). With the predicted outputs 352 given as inputs to the spike generator model, the simulated PSTH responses were used to compute 353 the ANF rate-and synchrony-level curves of the different types of ANFs ( Fig. 9(a),(b)). The 354 predicted curves show a similar trend to the Zilany et al. ANF model, however it is not possible 355 to directly compare the resulting curves due to the inherent noise of the non-deterministic spike 356 generator model in the reference analytical model. example user case is presented in Fig. 10(a), where a DNN model was trained to minimise the 362 difference between the outputs of two IHC-ANF models: a normal and pathological model. Each 363 model comprised the CoNNear IHC and CoNNear ANf H modules, and the firing rates of each model 364 were multiplied by a factor of 10 and 8 respectively, to simulate innervations of a normal-hearing 365 human IHC at 4 kHz ( Fig. 5 in [77]), and a pathological IHC that has a 20% fiber deafferentation 366 due to cochlear synaptopathy [65]. The DNN model was trained based on the responses of these 367 two CoNNear models to modify the stimulus such to restore the output of the pathological model 368 back to the normal-hearing model output. Training was done using a small input dataset of 4 kHz 369 tones with different levels and modulation depths, normalised to the amplitude ranges of IHC 370 inputs, and the DNN model was trained to minimise the L1 loss between the time and frequency 371 representations of the outputs. After training, the DNN model provides a processed inputx to the 372 8-fiber model to generate an outputr F that matches the normal-hearing firing rate r F as much as 373 possible. The result for a modulated tone stimulus is shown in Fig. 10(b), for which the amplitude 374 of the 8-fiber model response is restored to that of the normal-hearing IHC-ANF. This example 375 demonstrates the backpropagation capabilites of our CNN models and their application range can 376 be extended to more complex datasets such as a speech corpus, to derive suitable signal-processing 377 strategies for speech processing restoration in hearing-impaired cochleae. Analytical descriptions of IHC-ANF processing have evolved over the years, with the IHC 380 transduction shifting from simplified low-pass filter implementations [13,51,56,57,59,78] to 381 detailed models of basolateral outward K + currents [7][8][9][10]. State-of-the-art IHC-ANF models 382 estimate the vibrations of the IHC stereocilia based on the mechanical drive to the IHC and 383 often describe the ANF spikes or instantaneous firing rate resulting from the depletion and 384 replenishment of different neurotransmitter stores [57,58,60,79]. While such sensory models 385 have progressed to accurately capture the nonlinear properties of human hearing, they typically 386 comprise "hand-constructed" mechanistic descriptions that incorporate coupled sets of ODEs to 387 describe small neuronal systems. 388 We presented a hybrid framework to develop a DNN-based model of IHC-ANF auditory process-389 ing, CoNNear IHC-ANF . Different from pre-existing IHC-ANF models, the CoNNear architectures 390 are based on DNNs that are differentiable and computationally efficient to accelerate and facilitate 391 future studies of complex neuronal systems and behaviours. Our general framework for modelling 392 sensory-cells and synapses consists of the following steps: (i) Derive an analytical description of 393 the biophysical system using available neuroscience recordings. (ii) Use this analytical description 394 to generate a training dataset that contains a broad and representative set of sensory stimuli. 395 (iii) Define a suitable DNN-based architecture and optimise its hyperparameters on the basis of 396 the training dataset and its performance on known physiological characteristics. (iv) Train the 397 architecture to predict the behaviour of the biophysical system and evaluate using unseen data. 398 Apart from requiring an analytical description that accurately describes the system, we showed 399 that a careful design of the DNN architecture and a broad range of sensory input stimuli are 400 essential to derive a maximally generalisable model.

401
The resulting IHC-ANF complex models were trained to apply the same operations to each 402 frequency channel, such that they can be used for either single-unit or population response 403 simulations across the cochlear partition. Simulating all N CF = 201 frequency channels on the 404 same CPU negatively impacted the required computation time compared to analytical descriptions, 405 but nevertheless resulted in a biophysically-plausible, and rather versatile, model description. 406 Single-channel CoNNear IHC-ANF CPU simulations did offer a 70-fold speed-up compared to their 407 analytical counterparts, and can be executed with latencies below 20 ms when simulating ∼800 ms 408 inputs on a GPU. This holds promise for the future uptake of our models within audio-processing 409 pipelines that require real-time processing capabilities. At the same time, our models offer 410 a differentiable solution that can directly be used in closed-loop systems for auditory feature 411 enhancement or augmented hearing.

412
The trained DNN models can be further optimised using normal or pathological experimental 413 data via transfer learning [63], or can be retrained on the basis of large neural datasets, when these 414 become available. Approximately 3 and 8 days were needed to train each CoNNear ANF model and 415 IHC model, respectively. To improve these training durations, a different scaling of the datasets, 416 or batch normalisation between the convolutional layers, could prove beneficial [80]. Lastly, when 417 considering their use for real-time applications, where ANF adaptation and recovery properties 418 may be of lesser importance, it is possible to further reduce the context and window sizes of the 419 ANF CoNNear models and bring execution times below 10 ms. However, this will always result in 420 saturated, steady-state responses, rendering the models blind to long inter-stimulus intervals and 421 unable to fully capture recovery properties, as visualised in Figs. 4 and 5. A different approach 422 could be the use of recurrent layers (e.g., LSTM) within the CoNNear architectures to capture 423 the dependency to prior stimulation without requiring long context windows. 424 We demonstrated that the proposed framework and architectures generalises well to unseen 425 stimuli as well as to other auditory sensory cell and synapse model descriptions, and this provides 426 a promising outlook. On the one hand, our method might be applicable to other neuronal systems 427 that depend on nonlinear and/or coupled ODEs (e.g., see also their application to cochlear me-428 chanics descriptions [62]). On the other hand, the CNN model architectures can easily be retrained 429 when improved analytical model descriptions become available. When properly benchmarked, 430 CNN-based neuronal models can provide new tools for neuroscientists to explain complex neuronal 431 mechanisms such as heterogenous neural activity, circuit connectivity or optimisation [34,35].  The developed CoNNear models are suitable for implementation in data processing devices 441 such as a cochlear implant to provide biophysically-accurate stimuli to the auditory nerve. The 442 ANF responses could also be used to drive neural-network back-ends that simulate brainstem 443 processing or even the generation of auditory evoked potentials, such as the auditory brainstem 444 response [59,60] or the compound action potential [81]. All the developed CoNNear architectures 445 can easily be integrated as part of brain networks, neurosimulators, or closed-loop systems for 446 auditory enhancement or neuronal-network based treatments of the pathological system. Further 447 neural network models can be developed on the basis of the present framework to compose large-448 scale neuronal networks and advance our understanding of the underlying mechanisms of such 449 systems, making use of the transformative ability of backpropagating through these large-scale 450 systems. We think that this type of neural networks can provide a breakthrough to delve deeper 451 into unknown systems of higher processing levels, such as the brainstem, midbrain and cortical 452 pathway of the human auditory processing.  The source code of the auditory periphery model used for training is available via 10.5281/zen-462 odo.3717431 or github/HearingTechnology/Verhulstetal2018Model, the TIMIT speech corpus used 463 for training can be found online [72]. All figures in this paper can be reproduced using the trained 464 CoNNear models. A supplementary zip file is included which contains the evaluation procedure of 465 the trained CoNNear models.

466
Code availability 467 The code for running and evaluating the trained CoNNear models, including instructions of how 468 to execute the code, is included together with this manuscript and will be made available as a 469 GitHub repository upon acceptance of this paper. A non-commercial, academic UGent license 470 applies. The procedure of Fig. 1(a) was used to train the CoNNear IHC and ANF modules using simulated 478 responses of an analytical Hodgkin-Huxley-type IHC model [43] and a three-store diffusion model 479 of the ANF synapse [9] respectively. We adopted the implementations described in [60] and found 480 on 10.5281/zenodo.3717431. Figure 1(b) depicts the CoNNear IHC encoder-decoder architecture we used: an input of size 482 L c × N CF cochlear BM waveforms is processed by an encoder (comprised of three CNN layers) 483 which encodes the input signal into a condensed representation, after which the decoder layers 484 map this representation onto L × N CF IHC receptor potential outputs, for N CF = 201 cochlear 485 locations corresponding to the filters' centre frequencies. Context is provided by making the 486 previous L l = 256 and following L r = 256 input samples also available to an input of length 487 L = 2048, yielding a total input size of L c = L l + L + L r = 2560 samples.

488
The three CoNNear ANF models follow an encoder-decoder architecture as depicted in Fig. 1(c): 489 an IHC receptor potential input of size L c × N CF is first processed by an encoder (comprised 490 of N = 14 CNN layers) which encodes the IHC input signal into a condensed representation 491 of size 1 × k N using strided convolutions, after which the decoder, using the same number of 492 layers, maps this representation onto L × N CF ANF firing outputs corresponding to N CF = 201 493 cochlear centre frequencies. Context is provided by making the previous L l = 7936 and following 494 L r = 256 input samples also available to an input of length L = 8192, yielding a total input size 495 of L c = L l + L + L r = 16384 samples.

496
Training the CoNNear IHC-ANF complex 497 The IHC-ANF models were trained using reference analytical BM or IHC model simulations [60] 498 to 2310 randomly selected recordings from the TIMIT speech corpus [72], which contains a large 499 amount of phonetically balanced sentences with sufficient acoustic diversity. The 2310 TIMIT 500 sentences were upsampled to 100 kHz to solve the analytical model accurately [82]. The root-501 mean-square (RMS) energy of half the sentences was adjusted to 70 dB and 130 dB sound pressure 502 level (SPL), respectively. These levels were chosen to ensure that the stimuli contained sufficiently 503 high instantaneous intensities, necessary for the CoNNear models to capture the characteristic 504 input-output and saturation properties of individual IHC [47] and ANFs [76].

505
BM displacements, IHC potentials and ANF firing rates were simulated across 1000 cochlear 506 sections with CFs between 25 Hz and 20 kHz [60]. The corresponding 1000 y BM , V m and ANf h/m/l 507 output waveforms were downsampled to 20 kHz and only 201 uniformly distributed CFs between 508 112 Hz and 12 kHz were selected to train the CoNNear models. Above 12 kHz, human hearing 509 sensitivity becomes very poor [83], motivating the chosen upper limit of considered CFs. The 510 simulated data were then transformed into a one-dimensional dataset of 2310 × 201 = 464310 511 different training sequences. This dimension reduction was necessary because the IHC and ANF 512 models are assumed to have CF-independent parameters, whereas the simulated BM displacements 513 have different impulse responses for different CFs, due to the cochlear mechanics [84]. Hence, 514 parameters for a single IHC or ANF model (N CF = 1) were determined during training, based on 515 simulated CF-specific BM inputs and corresponding IHC, or ANF outputs from the same CF.

516
For each of the resulting 464310 training pairs, the simulated BM and IHC outputs were sliced 517 into windows of 2048 samples with 50% overlap and 256 context samples for the IHC model. In 518 the case of the ANF models, silence was also added before and after each sentence with duration 519 of 0.5 and 1 s, respectively, to ensure that our models can accurately capture the recovery and 520 adaptation properties of ANF firing rates. The resulting simulated IHC and ANF outputs were 521 sliced into windows of 8192 samples with 50% overlap, using 7936 context samples before and 256 522 samples after each window.

523
A scaling of 10 6 was applied to the simulated BM displacement outputs before they were given 524 as inputs to the CoNNear IHC model, expressing them in [µm] rather than in [m]. Similarly, 525 the simulated IHC potential outputs were multiplied by a factor of 10, expressed in [dV] instead 526 of [V], and a scaling of 10 −2 was applied to the simulated ANF outputs, expressing them in 527 [x100 spikes/s]. These scalings were necessary to enforce training of CoNNear with sufficiently 528 high digital numbers, while retaining as much as possible the datasets' statistical mean close to 529 0 and standard deviation close to 1 to accelerate training [80]. For visual comparison between 530 the original and CoNNear outputs, the values of the CoNNear models were scaled back to their 531 original units in all following figures and analyses.

532
CoNNear model parameters were optimised to minimise the mean absolute error (L1-loss) 533 between the predicted CoNNear outputs and the reference analytical model outputs. A learning 534 rate of 0.0001 was used with an Adam optimiser [85] and the entire framework was developed 535 using the Keras machine learning library [86] with a Tensorflow [87] back-end.

536
After completing the training phase, the IHC and ANF models were extrapolated to compute 537 the responses across all 201 channels corresponding to the N CF = 201 tonotopic centre frequencies 538 located along the BM. The trained architectures were adjusted to apply the same calculated 539 weights (acquired during training) to each of the N CF channels of the input, providing an output 540 with the same size, as shown in Fig. 1(c). In the same way, the trained models can easily simulate 541 single-CF IHC responses, or be used for different numbers of channels or frequencies than those 542 we used in the cochlear model.

543
Evaluating the CoNNear IHC-ANF complex 544 Three IHC and three ANF evaluation metrics were used to determine the final model architecture 545 and its hyperparameters, and to ensure that the trained models accurately captured auditory 546 properties, did not overfit to the training data and can be generalised to new inputs. Even though 547 any speech fragment can be seen as a combination of basic stimuli such as impulses and tones of 548 varying levels and frequencies, the acoustic stimuli used for the evaluation can be considered as 549 unseen to the models, as they were not explicitly present in the training material. The evaluation 550 stimuli were sampled at 20 kHz and had a total duration of 128 ms (2560 samples) and 819.2 ms 551 (16384 samples) for the CoNNear IHC model and the CoNNear ANF models, respectively. The 552 first 256 samples of the IHC stimuli and 7936 samples of the ANF stimuli consisted of silence, to 553 account for the respective context of the models. Each time, the evaluation stimuli were passed 554 through the preceding processing stages of the analytical model to provide the necessary input for 555 each CoNNear model, i.e., through the cochlear model for evaluating the CoNNear IHC model 556 and through the cochlear and IHC models for evaluating the CoNNear ANF models.

IHC excitation patterns 558
Inner-hair-cell excitation patterns can be constructed from the mean IHC receptor potential at 559 each measured CF in response to tonal stimuli of different levels. Similar to cochlear excitation 560 patterns, IHC patterns show a characteristic half-octave basal-ward shift of their maxima as 561 stimulus level increases [88]. These excitation patterns also reflect the nonlinear compressive 562 growth of BM responses with level observed when stimulating the cochlea with a pure-tone which 563 has the same frequency as the CF of the measurement site in the cochlea [89]. 564 We calculated excitation patterns for all 201 simulated IHC receptor potentials in response to 565 pure tones of 0.5, 1 and 2 kHz frequencies and levels between 10 and 90 dB SPL using: 566 tone(t) = p 0 · √ 2 · 10 L/20 · sin(2πf tone t), where p 0 = 2 × 10 −5 Pa, L corresponds to the desired RMS level in dB SPL and f tone to the 567 stimulus frequencies. The pure-tones were multiplied with Hanning-shaped 5-ms ramps to ensure 568 gradual onsets and offsets.

569
IHC transduction aspects 570 Palmer and colleagues recorded intracellular receptor potentials from guinea-pig IHCs in response 571 to 80-dB-SPL tones [73], and reported the ratio between the AC and DC response components as 572 a function of stimulus frequency. The AC/DC ratio shows a smooth logarithmic decrease over 573 frequency and is used as a metric to characterise synchronisation in IHCs, with higher ratios 574 indicating more efficient phase-locking. Our simulations were conducted for 80-ms, 80-dB-SPL 575 tone bursts of different frequencies presented at the respective CFs, and were compared against 576 experimental AC/DC ratios reported for two guinea-pig IHCs. We used a longer stimulus than 577 adopted during the the experimental procedures (50 ms), to ensure that the AC component would 578 reach a steady-state response after the stimulus onset. A 5-ms rise and fall ramp was used for 579 the stimuli, and the AC and DC components of the responses were computed within windows of 580 50-70 ms after and 5-15 ms before the stimulus onset, respectively.

581
Capturing the dynamics of outward IHC K + currents has an important role in shaping 582 ANF response properties of the whole IHC-ANF complex [9,10]. This feature of mechanical-to-583 electrical transduction compresses IHC responses dynamically and thereby extends the range of 584 v BM amplitudes that can be encoded by the IHC, as postulated in experimental and theoretical 585 studies [8,39]. As the v BM responses only show compressive growth up to levels of 80 dB SPL [60,62], 586 the simulated half-wave rectified IHC receptor potential is expected to grow roughly linearly with 587 SPL (in dB) for stimulus levels up to 90 dB SPL, thus extending the compressive growth range by 588 10 dB. To simulate the IHC receptor potential, tonal stimuli with a frequency of 4 kHz and levels 589 from 0 to 100 dB SPL were generated, using the same parameters as before (80-ms duration, 5-ms 590 rise/fall ramp). The responses were half-wave rectified by subtracting their DC component, and 591 the root-mean-square of the rectified responses was computed for each level.

592
ANF firing rates 593 We evaluate key properties of simulated ANF responses to amplitude-modulated and pure tone 594 stimuli for which single-unit reference ANF recordings are available. We simulated the firing rate 595 for low-, medium-and high-SR fibers to 1 and 4-kHz tone-bursts and amplitude-modulated tones, 596 presented at 70 dB SPL and calculated at the respective CFs. Based on physiological studies 597 that describe phase-locking properties of the ANF [50,90], stronger phase-locking to the stimulus 598 fine structure is expected for the 1-kHz fiber response than for the 4-kHz, where the response is 599 expected to follow the stimulus envelope after its onset. Similar differences are expected for the 600 amplitude-modulated tone responses as well.

601
The pure-tone stimuli were generated according to Eq. 1 and the amplitude-modulated tone 602 stimuli using: 603 SAM-tone(t) = [1 + m · cos(2πf mod t + π)] · sin(2πf tone t), where m = 100% is the modulation depth, f mod = 100 Hz the modulation frequency, and f tone 604 the stimulus frequency. Amplitude-modulated tones were multiplied with a 7.8-ms rise/fall ramp 605 to ensure a gradual onset and offset. The stimulus levels L were adjusted using the reference 606 pressure of p 0 = 2 × 10 −5 Pa, to adjust their root-mean-square energy to the desired level.

607
ANF level-dependent properties 608 Rate-level curves can be computed to evaluate ANF responses to stimulus level changes, in 609 agreement with experimental procedures [47,75]. Using Eq. 1, we generated pure-tone stimuli 610 (50-ms duration, 2.5-ms rise/fall ramp) with levels between 0 and 100 dB and frequencies of 611 approximately 1 and 4 kHz, based on the corresponding CFs of the ANF models (1007 and 612 3972.7 Hz). The rate levels were derived by computing the average response 10-40 ms after 613 the stimulus onset (i.e., excluding the initial and final 10 ms, where some spike intervals may 614 include spontaneous discharge [75]). Data from the experimental studies are plotted alongside our 615 simulations and reflect a variety of experimental ANF rate-level curves from different species and 616 CFs.

617
Synchrony-level functions were simulated for fully-modulated 400-ms long pure tones with a 618 modulation frequency f m of 100 Hz [50] and carrier frequencies of 1007 and 3972.7 kHz (henceforth 619 referred to as 1 and 4 kHz), generated using Eq. 2. Synchrony to the stimulus envelope was 620 quantified using vector strength [91] and was calculated by extracting the magnitude of the f m 621 component from the Fourier spectrum of the fibers' firing rate. The f m magnitude was normalised 622 to the DC component (0 Hz) of the Fourier spectrum, corresponding to the average firing rate of 623  Fig 11. CoNNear model of the auditory periphery. Acoustic stimuli can be transformed to IHC receptor potentials and ANF firing rates along the cochlear tonotopy and hearing range, after connecting the CoNNear cochlea [62], IHC and ANF modules together.
the fiber [90]. Experimental synchrony-level functions [50] show a non-monotonic relation to the 624 stimulus level and exhibit maxima that occur near the steepest part of ANF rate-level curves.

625
Connecting the different CoNNear modules 626 We considered the evaluation of each CoNNear module separately, without taking into account 627 the CoNNear models of the preceding stages and thus eliminating the contamination of the 628 results by other factors. Each time, the evaluation stimuli were given as inputs to the reference 629 analytical model of the auditory periphery and the necessary outputs were extracted and given as 630 inputs to the respective CoNNear models. However, the different CoNNear models can be merged 631 together to form different subsets of the human auditory periphery, such as CoNNear IHC-ANF 632 or CoNNear cochlea-IHC-ANF , by connecting the output of the second last layer of each model 633 (before cropping) to the input layer of the next one. This can show how well these models can 634 work together and how any internal noise in these neural-network architectures would affect the 635 final response for each module. Using a CNN model of the whole auditory periphery (Fig. 11), 636 population responses can be simulated and similar ANN-based back-ends can be added afterwards 637 to expand the pathway and simulate higher levels of auditory processing.