Automatic spike sorting for high-density microelectrode arrays

Roland Diggelmann; Michele Fiscella; Andreas Hierlemann; Felix Franke

doi:10.1152/jn.00803.2017

. 2018 Sep 12;120(6):3155–3171. doi: 10.1152/jn.00803.2017

Automatic spike sorting for high-density microelectrode arrays

Roland Diggelmann ^1,^2,^✉, Michele Fiscella ^1,², Andreas Hierlemann ¹, Felix Franke ¹

PMCID: PMC6314465 EMSID: EMS79687 PMID: 30207864

Abstract

High-density microelectrode arrays can be used to record extracellular action potentials from hundreds to thousands of neurons simultaneously. Efficient spike sorters must be developed to cope with such large data volumes. Most existing spike sorting methods for single electrodes or small multielectrodes, however, suffer from the “curse of dimensionality” and cannot be directly applied to recordings with hundreds of electrodes. This holds particularly true for the standard reference spike sorting algorithm, principal component analysis-based feature extraction, followed by k-means or expectation maximization clustering, against which most spike sorters are evaluated. We present a spike sorting algorithm that circumvents the dimensionality problem by sorting local groups of electrodes independently with classical spike sorting approaches. It is scalable to any number of recording electrodes and well suited for parallel computing. The combination of data prewhitening before the principal component analysis-based extraction and a parameter-free clustering algorithm obviated the need for parameter adjustments. We evaluated its performance using surrogate data in which we systematically varied spike amplitudes and spike rates and that were generated by inserting template spikes into the voltage traces of real recordings. In a direct comparison, our algorithm could compete with existing state-of-the-art spike sorters in terms of sensitivity and precision, while parameter adjustment or manual cluster curation was not required.

NEW & NOTEWORTHY We present an automatic spike sorting algorithm that combines three strategies to scale classical spike sorting techniques for high-density microelectrode arrays: 1) splitting the recording electrodes into small groups and sorting them independently; 2) clustering a subset of spikes and classifying the rest to limit computation time; and 3) prewhitening the spike waveforms to enable the use of parameter-free clustering. Finally, we combined these strategies into an automatic spike sorter that is competitive with state-of-the-art spike sorters.

Keywords: HD-MEA surrogate data, high-density microelectrode array, prewhitening, spike sorting

INTRODUCTION

High-density microelectrode arrays (HD-MEAs) are important tools in electrophysiology research and are used to simultaneously record the electrical activity of large numbers of neurons. Recent advances in complementary metal oxide-semiconductor (CMOS) technology have increased the number of simultaneously active recording electrodes from a few hundred to several thousand per chip (Berdondini et al. 2009; Bertotti et al. 2014; Frey et al. 2010; Johnson et al. 2012; Müller et al. 2013; Obien et al. 2017; Viswam et al. 2016). At the same time, the center-to-center electrode distances (pitch) have decreased significantly to a point where the electrode density comes close to or even exceeds the density of neurons in certain tissues, for example, the density of ganglion cells in the murine retina (Fiscella et al. 2012). Each electrode records the activity of all neurons in its vicinity, and because of the high electrode density each neuron is usually recorded on many electrodes. Thus the recorded multielectrode activity can be thought of as the sum of mixed neuronal signals and correlated noise. We do not know a priori the number of active neurons or the exact time at which an action potential was fired. The process of estimating the number of recorded neurons and assigning each detected action potential to one of those neurons is called “spike sorting.”

Various spike sorting methods have been described previously, and many constitute a variant of the following approach (Lewicki 1998): First, spike events are detected through thresholding, and the spike waveforms are extracted from the recordings (throughout this report we assume that the recorded signals have been band-pass filtered before spike detection is performed). Next, a feature extraction, such as principal component analysis, reduces the dimensionality of the cut-out spike waveforms. Finally, a clustering technique groups the spikes and assigns them to putative neurons based on assumptions about the distribution of spikes of the same neuron in the feature space (Einevoll et al. 2012; Jun et al. 2017; Lewicki 1998; Schmidt 1984).

This approach represents the standard reference method against which most other spike sorting algorithms are evaluated; however, it cannot be scaled to HD-MEA data in a straightforward way: When using a thousand parallel readout electrodes at a typical sampling rate of 20 kHz, even 1-ms-short spike waveforms would span a vector space with 20,000 dimensions. Having that many dimensions makes feature extraction and clustering extremely susceptible to even small quantities of noise because of a phenomenon known as the “curse of dimensionality” (Bishop 2007). Furthermore, temporal spike overlaps, i.e., spikes of different neurons that occur at the same time, are difficult to resolve and become more abundant with increasing numbers of recorded neurons (Dragas et al. 2015; Franke et al. 2012; Pillow et al. 2013).

On the other hand, HD-MEAs have the advantage that spikes of each neuron are simultaneously detected on many electrodes, which provides spatial information in addition to the characteristic temporal spike waveforms of each neuron. This combined information increases the discriminability of neurons and has been used in various spike sorting algorithms (Franke et al. 2015a; Hill et al. 2011; Jäckel et al. 2012; Lambacher et al. 2011; Marre et al. 2012; Muthmann et al. 2015; Prentice et al. 2011; Swindale and Spacek 2014).

In this article, we present and discuss three strategies with which the previously sketched standard reference method could be scaled up for HD-MEA data: The first strategy involved splitting the overall set of recording electrodes into smaller local electrode groups (LEGs) and treating each of them independently as a classical spike sorting problem. Yger et al. (2018) used a similar approach with one group around each electrode. At the end, we combined the results from each LEG into a final, global sorting result. This strategy helped to circumvent the “curse of dimensionality” and, at the same time, helped to eliminate the problem of temporal spike overlaps between spatially disjoint neurons (we do not address spatiotemporal overlaps within a single LEG in this report). Another advantage was that all LEGs could be processed in parallel, which made the overall processing duration effectively independent of the number of recording electrodes. We split the recording electrodes into LEGs in a way that each neuron would be well sorted in at least one LEG, while keeping the total number of LEGs low and thus minimizing the number of detections of the same neuron in multiple LEGs.

The second strategy involved performing feature selection and clustering only on a subset of spikes, which limited the computational cost of these steps and rendered them independent of the total number of spikes. In a subsequent step, we classified all spikes by template matching. During template matching, we compared the voltage traces around each detected spike with characteristic waveforms of putative neurons and matched them to the neuron with the most similar waveform (Franke et al. 2015b; Lewicki 1998). This strategy has been employed by several other spike sorters previously (Franke et al. 2015b; Marre et al. 2012; Rutishauser et al. 2006).

The third strategy included the use of prewhitening before feature extraction, followed by nonparametric clustering. We describe how prewhitening increased cluster separation and made cluster shapes and sizes predictable, which allowed us to use parameter-free clustering by means of the mean-shift algorithm.

Finally, we developed an algorithm that incorporates these three strategies, which can be easily parallelized, is scalable to thousands of electrodes, and relies on a clustering algorithm that does not require any parameter adjustment. We evaluated this algorithm against newly generated HD-MEA benchmark data sets with up to 1,000 electrodes, and we compared it to five other spike sorters using publicly available data sets. All of our evaluations were done without manual curation of the sorting results, and all parameters were fixed before we ran the evaluation on the benchmarks, i.e., they were identical for all evaluations and not fitted to individual data sets to improve the performance of the algorithm.

It was and is difficult to obtain good benchmarking data sets to evaluate the performance of HD-MEA spike sorting algorithms. Patch-clamp recordings in combination with MEA recordings can be used to provide a ground truth, but currently only a handful of neurons can be patched at the same time (Fournier et al. 2016; Franke et al. 2015a). Simulated data may provide a surrogate ground truth for all neurons in a given data set (Einevoll et al. 2012; Hagen et al. 2015; Jäckel et al. 2012), but the generation of useful data sets requires a well-characterized electrophysiological model for each experimental condition.

To generate our benchmarking data sets, we used an established method to create surrogate ground-truth data by adding typical spike waveforms at random time points into recorded data (Pachitariu et al. 2016; Rossant et al. 2016). We adapted this method for HD-MEA data to obtain a simple and robust tool with which we assessed the performance of our spike sorter.

METHODS

Tissue Preparation

Syrian hamsters (Mesocricetus auratus; Janvier Labs) were anesthetized and killed under protocols that were approved by the Basel City Veterinary Office in accordance with Swiss federal laws on animal welfare. Each hamster was kept in darkness for 10 min, anesthetized (Telazol 30 mg/kg, xylazine 10 mg/kg), and decapitated. Retinas from both eyes were immediately removed under dim red light and immersed in Ames medium (8.8 g/l, supplemented with 1.9 g/l sodium bicarbonate; Sigma-Aldrich Chemie, Buchs, Switzerland), which was perfused with Oxycarbon at room temperature (PanGas, Dagmersellen, Switzerland) for at least 30 min before the optical stimulus sequence was started. The retina patch was placed ganglion cell side down on the electrode array and perfused with Ames medium (pH 7.4, 36°C) equilibrated with 5%CO₂-95%O₂. Retinal ganglion cell extracellular activity was recorded for 7–8 h.

Light Stimulation

For retinal light stimulation, we used the Acer K10 light projector (60-Hz refreshing rate) in a previously developed and described setup in which extracellular electrophysiological measurements under light projections can be performed on a microscope stage [more details about the microscope and the optics for focusing the light stimulus on the retina can be found in Fiscella et al. (2012)]. We simultaneously used blue (460 ± 15 nm) and green (523 ± 23 nm) projector LEDs for stimulating retinal photoreceptors with moving, flashing squares and moving bars. For the analysis conducted here we did not analyze the light responses, as the recordings were only used for the creation of surrogate data.

MEA Recordings

We used a high-density CMOS MEA (Müller et al. 2015) with a total of 26,400 electrodes at a density of 3,300 electrodes/mm². The electrodes have a size of 9.3 × 5.45 µm², and their center-to-center pitch is 17.5 µm. The MEA featured 1,024 readout channels with a 10-bit, 20-kHz analog-to-digital converter each and a noise level of 2.4 µV_rms in the action potential band (300 Hz to 10 kHz). A reprogrammable switch matrix allowed us to connect two arbitrarily located, disjoint blocks of 23 × 23 adjacent electrodes to the readout channels with only a few electrodes per block that remained unconnected. We created these high-density blocks so that they were translation symmetric, which means that the unconnected electrodes were located at the same relative positions within each block. This was important for the generation of the surrogate data, as can be seen below. We produced six different recordings with a duration of ~20 min each. The signals were digitally band-pass filtered between 300 and 7,000 Hz before the spike sorting started.

RESULTS

Spike Sorting Algorithm

The spike sorting process consisted of three steps as depicted in Fig. 1. In the first step, we divided the global set of electrodes into LEGs. In the second step, we performed spike sorting on each group independently, and we processed all groups in parallel. In the third step, we identified and resolved duplicates of neuronal units, i.e., neurons that were detected in multiple LEGs. The spike sorting process within each LEG started with spike detection and the extraction of the waveforms for each spike. To limit computational complexity, we selected a random subset of N_spikes spikes within each LEG and performed waveform alignment, feature extraction, and clustering only on this subset. The resulting cluster centers were then used to classify the entire set of spikes with template matching. Finally, we merged clusters based on the similarity of their mean waveforms.

We semantically distinguish here between neurons as biological entities in the experiment and neuronal units as the results of the spike sorting process. In an optimal case, each neuronal unit corresponds to one neuron and reproduces all of its action potentials. The values of all parameters that were used and are not specified separately are listed in Table 1.

Table 1.

Values of parameters used in this evaluation

Parameter	Value	Remark
θ_el	4.2σ_n	Spike-detection threshold
Δt_dist	1.25 ms	Maximal temporal distance between threshold crossing and spike peak
Δt_event	0.9 ms	Maximal temporal distance between peaks belonging to the same spike event
f_s	20 kHz	Sampling rate
t_w	1.0 ms	Temporal extension of waveforms
t_peak	0.4 ms	Temporal location of the peak within the waveforms
N_spikes	50,000	Maximal no. of spikes selected for clustering
k	6	No. of principal components used for clustering
c_h	1.8	Parameter for mean-shift bandwidth estimation
M_spikes	10	Minimal number of spikes per cluster
D_max	0.30σ_n	Maximal Euclidean distance between merged templates
P_min	0.93	Minimal value of vector projection between merged templates
Δt_overlap	0.5 ms	Maximal temporal distance between spikes that count as overlaps

Open in a new tab

σ_n, Noise standard deviation.

The algorithm was implemented on MATLAB, and the source code is available for download at https://git.bsse.ethz.ch:443/hima_public/HDsort.git.

Subdivision into local electrode groups.

We divided the overall array of recording electrodes into groups of adjacent electrodes. These LEGs can overlap, i.e., some electrodes were members of multiple LEGs. The number of LEGs as well as the number of electrodes, N_els, within an LEG varied according to the size and the topology of the recording electrode sets. The LEGs were obtained in an iterative process that assigned each electrode to at least one LEG. The goal was to assign all electrodes in close vicinity to each other to the same LEG in order to minimize the number of groups and the overlap between groups. For a detailed description of this process see Subdivision into Local Electrode Groups in the appendix. We set the maximal number of electrodes per LEG to 9, because we wanted to compartmentalize the recording area into squares of 3 × 3 electrodes. This arrangement proved to be well suited for recordings from murine retina and for the given electrode pitch. There is a trade-off between having more electrodes in an LEG and thereby increasing the amount of information on a specific neuron vs. having fewer electrodes and thereby keeping the number of recorded neurons, the number of spike overlaps, as well as the computational complexity low. We found that, within a radius of 43.5 µm from the center of an LEG, the precise location of the neuron did not affect the sorting performance or, in other words, as long as the maximal signal amplitude of a neuron was within the area covered by the LEG, it did not matter if this neuron was only partially covered by the LEG (for more information see Sorting Performance as a Function of LEG Distance in the appendix). In experiments with different spatial arrangements of the electrodes, electrode pitches, or cell densities, other solutions for the creation of LEGs may be more appropriate.

Spike detection.

Spikes were detected on each electrode independently whenever the signal crossed a threshold (θ_el). The threshold was defined independently for each electrode as a multiple of the electrode’s noise standard deviation (σ_el). We estimated σ_el in a two-step procedure: In the first step, we computed a preliminary threshold, based on the signal standard deviation, and detected all spikes that surpassed this threshold. In the second step, we excluded these spikes and computed the median absolute deviation on the remaining noise signal. This allowed us to estimate σ_el in a way that is robust also in the presence of spikes and other outliers.

We defined the spike event as a time point at which the electrical signal reached its maximal amplitude within a short period (Δt_dist) after threshold crossing. When two spike events on different electrodes within a single LEG occurred within a defined short time interval (Δt_event), we considered them to be the same spike and kept only the spike event with the highest amplitude. Then, we extracted the waveforms, w_i_,_j(t), for each spike event i on each electrode j within the LEG and grouped them to obtain a multielectrode waveform $w_{i} (t) = {[w_{i, 1} (t) ... w_{i, N_{els}} (t)]}^{T}$ We selected a random subset of N_spikes spikes, which we used in the following tasks of waveform alignment, feature selection, and clustering.

Waveform alignment.

As the recording sampling rate, f_s, is finite, the spike waveforms, their amplitudes, and the exact timing of their peaks may slightly differ because of small temporal shifts of the waveform with respect to the sampling interval (so-called registration jitter). We corrected for this jitter through upsampling, alignment of the upsampled waveforms, and subsequent downsampling. The upsampling factor (L_up = 3) and the downsampling factor (L_down = 3) were chosen to be the same in this case, which meant that the aligned waveforms had the same length as the original waveforms. To align the spikes, we searched for the maximum in the cross-correlation between the upsampled waveforms $w_{i}^{up}$ with their mean waveform $\bar{w^{up}}$ , which gave us the temporal shift $τ_{i} = \arg \max_{τ} {\sum_{t} w_{i}^{up} {(t)}^{T} \cdot \bar{w^{up}} (t + τ)}$ . We shifted the waveforms by this value, $w_{i}^{up} (t) \leftarrow w_{i}^{up} (t - τ_{i})$ , and repeated this process until convergence was reached.

We trimmed the aligned waveforms to a length t_w with their peaks located at t_peak relative to their first sample. Then, we concatenated them over all electrodes to form a vector w_i with dimension N_dim = t_w·f_s·N_els.

Feature selection.

We started the feature selection with prewhitening of the waveforms. For this, we estimated the spatiotemporal noise covariance matrix, C, on periods of data where no spike events were detected. We computed C with a similar method that was described in Pouzat et al. (2002). In brief, we computed the auto- and cross-correlations of noise epochs (i.e., in voltage traces where no spikes occurred) between all pairs of electrodes. With these correlation functions, we built the blockwise Toeplitz matrix C, where each block corresponds to the temporal noise covariance between two electrodes (for more details see Noise Covariance Matrix in the appendix).

Through Cholesky decomposition, we obtained the whitening matrix U that has the property U^TU = C. The spike waveforms were then transformed by w_i ← U⁻¹w_i (Franke et al. 2015b; Pouzat et al. 2002; Rutishauser et al. 2006). After prewhitening, we performed a principal component analysis and reduced the dimensionality of the spike vectors by keeping only the projections of each spike on the first k principal components (PCs).

Prewhitening is a linear operation that transforms signals with a given noise covariance matrix C so that the resulting signals are decorrelated, i.e., their noise covariance becomes the identity matrix. This means, in an ideal case where cluster shapes are fully defined by the noise covariance, that after prewhitening all clusters become hyperspherical with σ² = 1 (Fig. 2A). In this case, the first few PCs of all spikes will represent the largest between-cluster variance, and the cluster distances in this subspace will increase on average by 50% (Fig. 2, A–D) in comparison to cluster distances in the PC space computed with nonprewhitened spike waveforms.

Fig. 2. — Effects of prewhitening on cluster separation in the linear subspace spanned by the first principal components (PCs) (*A–D*) and on cluster shapes (*E–G*). A: simplified schematic of the effect of prewhitening (W, and its inverse W⁻¹)on the PCs in a 2-dimensional example case (x₁ and x₂). Correlated noise between x₁ and x₂ gives clusters an elliptic shape. *PC1* is not an optimal direction to separate these 2 clusters. Prewhitening decorrelates the noise and produces circular clusters. After prewhitening *PC1′* represents the optimal subspace for cluster separation. B and C: example of a labeled cluster (dark gray) corresponding to an artificial unit surrounded by other clusters in its local electrode group (light gray). B: scatterplot in *PCs 1–3*. C: histogram of Euclidean distances in *PCs 1–6* for each spike to the center of the labeled cluster. The distance was normalized such that the variance of the labeled cluster became 1. Vertical dashed lines show the distance d within which 5% of spikes from other clusters are located. D: change of distances Δd introduced by prewhitening for 109 labeled clusters across first 30 PCs. Each point represents 1 cluster. Horizontal black lines indicate the median values. E: schematic of the contribution of cluster PCs to amplitude variability. We computed principal component analyses for each cluster and determined the contribution μ of each cluster PC to the direction of amplitude variability, which is the direction from the origin to the cluster center. F and G: means (black line) and standard deviations (gray areas) over 440 clusters. F: % of amplitude variability captured by each PC. The total contribution over the first 6 PCs is 75% and 64% in the nonwhitened and the prewhitened case, respectively. G: standard deviations within each PC (dark gray) compared with the standard deviation within the direction of amplitude variability (light gray).

Cluster shapes in our data, however, were not determined by noise correlation alone but also by intrinsic spike variation, mainly amplitude variation. A large share of the variability captured by the first 6 PCs is due to variability in the amplitude of the spikes (Fig. 2, E and F). Prewhitening therefore did not produce completely spherical clusters; instead, the median standard deviation in the first PC was σ₁ = 6.8, and the average standard deviation over the first 6 PCs was $\sqrt{\bar{σ^{2}}} = 4.3$ (Fig. 2G).

Clustering.

Prewhitening had the effect that the cluster distributions became close to standard normal, which allowed us to implement a parameter-free spike clustering method by using mean-shift clustering with a flat kernel (Fukunaga and Hostetler 1975). Mean-shift is a density-based clustering algorithm that requires only one input parameter, h, called bandwidth. It works by iteratively shifting each data point x toward the mean of all points within a neighborhood around x, until all points converge to their cluster centers. The size of the neighborhood is defined by the bandwidth. The shift of each point toward the mean within the neighborhood is given by

m (x) = \frac{\sum_{x_{i} \in x} K (x - x_{i}) x_{i}}{\sum_{x_{i} \in x} K (x - x_{i})}

where K(x) is a flat kernel defined as

K (x) = {\begin{matrix} 1 if x^{T} x \leq h^{2} \\ 0 if x^{T} x > h^{2} \end{matrix}

To avoid under- or overclustering, the size of the kernel should be roughly on the same order as the spread of the clusters in the data. For a k-dimensional cluster $X ~ N_{k} (μ, C)$ with mean µ and covariance matrix C, the mean squared Euclidean distance (d² = x^Tx) between each data point and the mean is given as

E [{(x - μ)}^{T} (x - μ)] = k \bar{σ^{2}}

Using our empirical estimate of the residual variability of the clusters after prewhitening we could therefore fix the bandwidth to h² = c_hk with a constant parameter $c_{h} \approx \bar{σ^{2}}$ .

At the end of the clustering step, we retained all clusters that had at least M_spikes spikes, which prevented single outliers from forming their own clusters. It is important to keep in mind that the clustering was not used for the final assignments of spikes to neuronal units but only to compute the templates, which were subsequently used for template matching.

Template matching.

In the previous step we clustered only a subset of all detected spikes and obtained clusters that corresponded to putative neuronal units. For each cluster n we computed a template ξ_n by averaging the waveforms w of their assigned spikes.

In this step, we classified all detected spikes in a process called “template matching.” This process compares the waveform of each spike to all templates and matches each waveform to the template with the highest similarity. Here we used a specific method called Bayes optimal template matching. It combines matched filters with Fisher linear discriminant analysis. Matched filters maximize the separation between signal and noise, whereas linear discriminant analysis optimizes the discrimination between classes that have the same covariance matrix (Franke 2011; Franke et al. 2015b) (for more details see Template Matching in the appendix). In brief, the multielectrode waveform of each detected spike is convolved with a set of multielectrode matched filters. Each filter is matched to one of the templates. From each filter’s output a discriminant function can be computed, which ranks how well the spike fits to the respective template. The values of all discriminant functions are then compared against each other, and the spike is assigned to the template with the largest associated discriminant function value.

Cluster merging.

Spikes from the same neuron may be distorted by noise and overlying spikes of other neurons so that they do not always have the same waveform. Additionally, the spike generation process can cause variability in the waveforms, e.g., during a burst (Fee et al. 1996). The first spike within the burst will have a larger amplitude than the second spike within the same burst. Therefore, a single neuron can produce spikes that fall into multiple, well-defined clusters if its intrinsic waveform variability is on the same order as or larger than its extrinsic variability (i.e., noise and overlapping spikes).

There are two ways to address this problem: First, one can increase the mean-shift bandwidth in an effort to prevent split neurons, but this approach increases the risk of merging two separate units together. Second, one can keep a smaller bandwidth and end up with neurons that are represented by multiple clusters. The second option is the preferred one. Not only is it easier to do automatic merging of clusters than automatic splitting, it is also beneficial to have more than one template per neuron for template matching. For this reason, we set the mean-shift bandwidth parameter lower than the previously estimated value.

It is therefore important that there is a step in which the detected neuronal units are compared to each other and merged if they represent the same neuron. This step was done after template matching, when all the spikes were classified and the final template for each neuronal unit could be computed. Since the final templates were averaged over more spikes than those obtained through clustering, their noise component was lower. We merged two neuronal units when the following two conditions with respect to their templates ξ_n and ξ_m were met:

\max (ξ_{n} - ξ_{m}) < D_{\max}

and

\frac{ξ_{n}^{T} ξ_{m}}{‖ ξ_{n} ‖ ‖ ξ_{m} ‖} > P_{min}

This process of merging was done iteratively: 1) The two units with the smallest distance were merged (i.e., the spikes of one neuron were assigned to the other). 2) The templates of the newly formed units were computed through averaging of the waveforms of all spikes. 3) The Euclidean distances and the projections with the other templates were calculated. 4) The process was repeated until there were not any pairs to meet the merging criteria.

Duplicate resolution.

As a result of the process described in the previous section, we obtained a number of neuronal units for every LEG, each consisting of a spike train and a template. In a final step, we resolved and removed duplicated neuronal units between LEGs, each of which was the result of a neuron being detected independently in multiple LEGs. Occurrence of the same neuronal unit in multiple LEGs was found in cases when the extracellular signal of a neuron was large and spread over several LEGs or when a neuron was located at the intersection of two LEGs. We used a simple heuristic that was computationally efficient and that compared the global templates (i.e., the templates over the entire set of recording electrodes) and the spike trains to decide whether two units represented the same neuron.

We began with the assumption that a neuron was always detected best in the LEG where its amplitude was largest. Consequently, when the global template of a neuronal unit found in LEG A had its maximal amplitude in LEG B, the respective neuron must have been detected more reliably in LEG B. Therefore, we removed all units that featured global templates with maximal amplitude located in other LEGs, which accounted for roughly half of all duplicates (data not shown).

The remaining duplicates were units that had their maximal template amplitudes on electrodes at the intersection of two LEGs. We made pairwise comparisons of all these units and earmarked those pairs that featured sufficient template similarity (for more details see Duplicate Resolution in the appendix).

Finally, we compared the spike trains of these earmarked pairs: We counted the number of overlapping spikes, i.e., the spikes within the same time window of Δt_overlap. When the percentage of overlapping spikes was >50%, we determined the pair to be duplicates and discarded the unit with the smaller maximum template amplitude. All remaining units were retained as final results of the spike sorting algorithm.

Surrogate Data Generation

For the performance assessment of a spike sorter, it is necessary to have a benchmarking data set in which the exact spike times of many neurons are known. We wanted to emulate realistic noise properties and spike shapes for each experimental condition that we analyzed. In this section, we describe a simple method with which we generated surrogate data based on real HD-MEA recordings. We created two separate benchmarks, each containing 12 data sets, with different randomly inserted neuronal units. In the first benchmark, the amplitudes of the inserted spikes were varied, in the second benchmark the firing rates.

Initially, we ran our spike sorter on six different 20-min recordings from murine retinas without surrogate ground truth and detected a total of 4,034 neuronal units that we named “original units.” As our HD-MEA allowed us to record from an almost arbitrary selection of 1,024 electrodes of the 26,000 available electrodes, we placed the electrodes in two similar blocks of 23 × 23 adjacent electrodes. The electrode blocks (Fig. 3A) were spatially separated by ∼0.1 mm. Because of design constraints of our HD-MEA, some electrodes were not connected during the recording, resulting in apparent gaps in the recording area. These gaps did not influence the performance assessment. For each recording, we computed global templates of 10 original units per electrode block. To make sure that the templates contained little noise, we only took the mean spike waveforms of units with at least 4,000 spikes (~3.3 spikes/s). We also band-pass filtered these templates with the same filter settings that were used to prefilter the raw data (300–6,000 Hz) and multiplied them with a Tukey window to ascertain zero on- and offset.

Fig. 3. — Generation of benchmark data exemplified with a representative artificial neuron. A: schematic of “waveform swapping” to generate artificial templates: the microelectrode data were recorded with an electrode configuration consisting of 2 high-density electrode blocks. The data were spike sorted, and 20 units from the sorting were chosen (“original units”). An “artificial unit” was created from each of the templates by swapping of the waveforms between the 2 high-density blocks. B: example template of an artificial neuron. *Inset*: region of the array in which the template had the largest-amplitude waveforms. (Note that the high-density blocks contained gaps. This is a consequence of a design constraint in our high-density microelectrode arrays that only subsets of electrodes can be simultaneously recorded with high-density electrode configurations.) C: example waveforms of a spike that has been inserted into the recordings. D: recording traces of the electrodes corresponding to the one in C before insertion of the spike (*top*) and after insertion (*bottom*). E: superposition of all spike waveforms on the one electrode before insertion into the recordings (*top*) and after insertion into the recordings (*bottom*). *Insets*: histograms of spike amplitudes. B–E: * marks the electrode where the amplitude of the example waveform was maximal.

For each of these templates, we created a new, artificial template by interchanging the waveforms of the original template between the two high-density blocks (Fig. 3, A and B). The obtained new templates formed the basis of 20 “artificial units.” The advantage of this procedure is that the shape and spatial distribution of the inserted spikes were identical to those obtained from real neurons in the recordings, yet the interchange between the blocks put them at new locations so that they were sufficiently different from their originals.

The templates created this way were inserted as spikes into the original recordings by adding the waveforms onto the respective recorded traces (Fig. 3, C and D). Before insertion of an artificial spike i we multiplied the respective templates by a factor α_i drawn from a normal distribution with variance $σ_{α}^{2}$ , $α_{i} ~ N (1, σ_{α}^{2})$ , so that the amplitudes of the inserted spikes reflected the amplitude distribution of neurons in the recordings (Fig. 3E, top). The spike amplitudes after insertion were therefore randomly distributed, but their variance was larger than $σ_{α}^{2}$ because of the noisy background to which they were added (Fig. 3E, bottom; see Spike Amplitude Distribution in the appendix). We also jittered each spike in time before inserting it into the data by upsampling the respective template by a factor of 10, randomly shifting the upsampled template between 0 and 9 samples before downsampling it again. This process ensured that the inserted spikes were not always perfectly aligned with the sampling intervals.

To choose the time points at which spikes were inserted into the data, we computed an independent Poisson process for each artificial unit with spiking rate parameter λ_i and a refractory period of 1.5 ms.

Performance Assessment

We evaluated the performance of our algorithm by sorting the surrogate data and comparing the resulting sorted units with the inserted artificial units. We benchmarked the sorting performance as a function of spike amplitude and spike rate, using two dedicated sets of surrogate data. Since our surrogate data consisted not only of the artificial units but also of many more unknown neurons, it was not trivial to obtain meaningful metrics to assess the sorting performance. We matched the artificial units to the sorted units by using two independent procedures: In the first procedure, we compared the spike trains of all pairs of one sorted unit and one artificial unit and counted the number of true positives as well as the detection errors categorized as false positives, false negatives, and false classifications (for a detailed description of these terms see Performance Assessment in the appendix). We matched those pairs that produced the smallest number of detection errors. In the second procedure, we compared the templates of the units and matched the pairs with the smallest Euclidean distance of their templates. When both procedures matched the same sorted unit to an artificial unit we categorized it as “found,” and otherwise as “lost.”

We then computed the sorting performance metrics, sensitivity, precision, and error rate, to quantify the detection accuracy of each artificial unit based on spike train similarity to its matched sorted unit. The descriptions of these metrics are given in Table 2 (for more a detailed description see Performance Assessment in the appendix).

Table 2.

Definitions of sorting performance metrics

Term	Description
Sensitivity	Percentage of true positives in surrogate ground truth spikes
Precision	Percentage of true positives in sorted unit spikes
Error rate	Percentage of detection errors in surrogate ground truth spikes

Open in a new tab

Sorting performance as a function of spike amplitudes.

The templates of the artificial units used in this section were selected randomly from the original recordings. The mean spiking rate was the same for each artificial unit with a value of λ_i = 5 Hz. This means that all artificial units in this evaluation had roughly the same number of spikes. All spike amplitudes shown here are given as multiples of the noise standard deviation, σ_n.

Figure 4, A–D, show the relationship between the spike amplitude and sensitivity, precision, and error rate. Table 3 lists the mean detection metrics of all found units. There is a steep drop in sorting performance for units with amplitudes around the detection threshold 4.2. This is due to the fact that half of these units were completely lost but also that found units within this range were sorted with an overall error rate of 115%. Units with amplitudes between 4.2 and 10.0 were generally sorted well, with ~10% of them being lost. The found units, however, were sorted with >90% sensitivity and precision. Except for a few outliers, all units with amplitudes >10.0 were found and sorted with a median sensitivity and precision of 99.0% and 100.0%, respectively.

Fig. 4. — Evaluation of the sorting performance with respect to spike amplitudes in units of noise standard deviation (σ_n) (*A–D*) and spike rates (Hz) (*E–H*). A and E: sensitivity: no. of true positives divided by no. of inserted spikes. B and F: precision: no. of true positives divided by no. of detected spikes. C and G: error rate: no. of detection errors divided by no. of inserted spikes. D and H: histogram of found and lost units. *A–D*: *left* dashed vertical line indicates the spike detection threshold (4.2σ_n); *right* dashed vertical line indicates 10σ_n.

Table 3.

Mean and median values to assess sorting performance of found units

	Amplitude, σ_n
	0–4.2	4.2–10.0	>10.0
Sensitivity, %
Mean	62.6	92.4	97.7
Median	72.4	94.8	99.0
Precision, %
Mean	52.4	92.3	98.6
Median	49.7	98.4	100.0
Error rate, %
Mean	115.0	16.8	3.56
Median	101	8.42	1.01
Found units	7	111	99
All units	15	124	101

Open in a new tab

σ_n, Noise standard deviation.

Sorting performance as a function of spike rate.

To investigate how the number of spikes per neuron that were available for clustering affected the detection performance, we used the second surrogate data set in which we varied the spike rates of the artificial units. The templates were created in the same way as described above; however, the original units were not selected randomly. Instead, we looked at the amplitude distribution of the original units and selected the 10 units per high-density block that produced signals closest to the mean amplitude in each recording (≈10σ_n, data not shown). This way, we ended up with 20 artificial units per recording that had similar spike amplitudes.

The spike rates λ_i that we used as input to the Poisson spike-time generator were evenly distributed (on a logarithmic scale) in the range of 0.05–50 Hz. This produced spike counts in the range of 40–60,000.

Analogous to the previous section, Fig. 4, E–H, shows spike rate vs. sensitivity, precision and error rate, and Table 4 lists the mean detection metrics of all found units. We saw a drop in the number of found units at spike rates <0.2 Hz, with a few outliers above. The found units with spike rates <0.1 Hz (~120 spikes) had a large mean error rate (3,340%). These errors were all due to a lack in precision, as the sensitivity was 100%, i.e., there were no false negatives but only falsely classified spikes from other neurons. In general, the sensitivity was ∼99% for all found units irrespective of their spike rate. The mean precision also approached 99% for higher spike counts. The error rate was <2% for units with spike rates >1 Hz (corresponding to spike counts of 1,500 and more).

Table 4.

Mean and median values to characterize sorting performance of found units

	Spike Rate, Hz
	<0.1	0.1–1	1–10	>10
Sensitivity, %
Mean	99.9	98.9	99.7	99.0
Median	100.0	100.0	99.9	99.9
Precision, %
Mean	64.5	95.3	98.5	99.1
Median	100.0	100.0	100.0	100.0
Error rate, %
Mean	3340	21.1	0.846	1.90
Median	0	0.131	0.0877	0.231
Found units	12	68	69	56
All units	28	80	72	60

Open in a new tab

Comparison to other spike sorters.

We compared the performance of our algorithm to other spike sorters using a publicly available spike sorting benchmark data set. We used the algorithm with parameters identical to those in the previous section. The benchmark data set is available for download at http://phy.cortexlab.net/data/sortingComparison (Steinmetz 2016). On this data set five other spike sorters have been evaluated and compared (phy: Rossant et al. 2016; spykingCircus: Yger et al. 2018; globalSuper: Shabnam 2016; kiloSort: Pachitariu et al. 2016; JRClust: Jun et al. 2017). The data set consisted of recordings of cortical neurons with 118 electrodes, arranged in two columns, and the duration of the recordings ranged between 46 and 83 min. The surrogate ground truth spikes were created by adding denoised waveforms into recorded data. Thus the final data set contained, besides the artificially inserted neurons, recorded spikes of an unknown number of real neurons (Pachitariu et al. 2016).

To ensure that the performance metrics were comparable to those used to evaluate the other sorters, we matched and compared our sorted units to the surrogate ground truth by using the code that was provided together with the data sets. This code was used to compute a score for each pair of sorted units and artificial units and matched the pairs with the highest scores. The score was equivalent to sensitivity plus precision minus 1. The code further helped to assess whether the obtained score could be improved by merging units, but we only report here the initial scores before the merging.

Figure 5 compares the results of all six sorters per given data set. Our sorter showed consistently high median scores for all data sets, either matching or exceeding those of the top three sorters. The median precision was nearly 100% in all data sets, whereas the results for sensitivity were more mixed. We observed a loss of sensitivity in some units of data set 6 that we attribute to overclustering as a consequence of multimodal amplitude distributions; other sorters seemed to have the same problem.

Fig. 5. — Performance comparison with other spike sorters using 6 data sets of increasing difficulty. Data sets were sorted separately with our algorithm (hdsort) using default parameters. Score, sensitivity, and precision were computed with software that was published together with the data sets at http://phy.cortexlab.net/data/sortingComparison (Steinmetz 2016). A: score combining sensitivity and precision (sensitivity + precision – 1). B: sensitivity: no. of true positives divided by no. of inserted spikes. C: precision: no. of true positives divided by no. of detected spikes.

Runtime estimation.

To estimate the runtime of our algorithm in a realistic scenario, we measured the time required to process each LEG (parallel processes) and for the final duplicate resolution step. We excluded the time that was necessary to load the data sets, filter them, and detect the spikes, as these time spans are highly dependent on the performance of the file system and the file format in which the recordings were saved. We compare the runtimes of a 20-min and a 63-min recording in Table 5. The results showed that the runtimes between LEGs can vary by a factor of 20, which was mainly due to the fact that the number of spikes per LEG could differ significantly. The theoretical total runtime upon using one CPU per LEG in a parallel approach was limited by the slowest of the parallel processes. It amounted to 10.5 min for the 20-min recording and 23.3 min for the 63-min recording. These results further showed that the runtime did not scale linearly with the recording duration but was comparably shorter for longer recordings. This was due to the fact that the slowest process was the mean-shift clustering and that we defined a maximum of 50,000 spikes for this step within a single LEG (for more on this see Runtime Estimation in the appendix).

Table 5.

Runtime estimation for two data sets with different recording durations

	Recording duration, min
	20	63
No. of LEGs	92	166
No. of electrodes	578	890
Parallel processes, min
Mean	3.8	7.6
Median	3.6	5.8
Min.	0.4	0.8
Max.	9.7	20.9
Duplicate resolution, min	0.8	2.4
Total runtime without parallelization, min	353	1,269
Theoretical runtime with 1 CPU per parallel process	10.5	23.3

Open in a new tab

LEG, local electrode group.