Abstract
Peristimulus time histograms are a widespread form of visualizing neuronal responses. Kernel convolution methods transform these histograms into a smooth, continuous probability density function. This provides an improved estimate of a neuron's actual response envelope. We here develop a classifier, called the h-coefficient, to determine whether time-locked fluctuations in the firing rate of a neuron should be classified as a response or as random noise. Unlike previous approaches, the h-coefficient takes advantage of the more precise response envelope estimation provided by the kernel convolution method. The h-coefficient quantizes the smoothed response envelope and calculates the probability of a response of a given shape to occur by chance. We tested the efficacy of the h-coefficient in a large data set of Monte Carlo simulated smoothed peristimulus time histograms with varying response amplitudes, response durations, trial numbers, and baseline firing rates. Across all these conditions, the h-coefficient significantly outperformed more classical classifiers, with a mean false alarm rate of 0.004 and a mean hit rate of 0.494. We also tested the h-coefficient's performance in a set of neuronal responses recorded in humans. The algorithm behind the h-coefficient provides various opportunities for further adaptation and the flexibility to target specific parameters in a given data set. Our findings confirm that the h-coefficient can provide a conservative and powerful tool for the analysis of peristimulus time histograms with great potential for future development.
Keywords: peristimulus time histogram, kernel convolution, signal detection, action potentials, spike trains, firing rate, human data
neurophysiological publications regularly report the modulation of a cell's firing rate in response to a specific stimulus (Abeles 1982; Adrian 1928). These findings are often presented in the form of a peristimulus time histogram (PSTH; Gerstein and Kiang 1960). The PSTH provides an intuitive format for data visualization and analysis by plotting the cross-correlation between stimulus presentation times and neuronal spike times (Moore et al. 1966; Perkel et al. 1967). Surprisingly, since its introduction in the 1960s (Gerstein 1960), the use of the PSTH in the neurophysiological literature has, for the most part, remained largely unchanged. There are, however, two fundamental problems inherent to the PSTH: 1) choosing the right bin width or bandwidth to generate the PSTH and 2) deciding whether an observed modulation in firing rate is sufficiently different from baseline to warrant classifying it as a response. An elegant solution to the first challenge was recently introduced by Shimazaki and Shinomoto (2010). The aim of the present study was to develop a solution to the second challenge by taking advantage of the merits provided by Shimazaki and Shinomoto's solution.
In a PSTH, the use of an actual histogram limits statistical application compared with a smooth, twice continuously differentiable function (sometimes Bayesian adaptive regression splines are used to smooth the histogram; Kass et al. 2005). A histogram is also limited in its resolution of the underlying data because it weighs spikes occurring in the center of a bin equally to those occurring close to a bin's edge. Because of these limitations, instead of calculating an actual histogram, spiking sequences are often convolved with a smoothing kernel to generate a spike density function of the underlying firing rate (Parzen 1962). In the interest of continuity, we will still refer to such a kernel-smoothed spike density function as a PSTH (with the differentiation of nPSTH for a histogram-based, nonsmoothed PSTH and sPSTH for a kernel-smoothed PSTH where needed).
The goodness of fit of a PSTH is determined by the bin width in the case of a nPSTH or the smoothing kernel width in the case of a sPSTH. If this parameter is chosen too small (i.e., a high bandwidth), the estimated firing rate becomes jagged and uneven; if it is chosen too big (i.e., a low bandwidth), then the estimate is averaged out to the point where any real fluctuations of the underlying data may be lost (e.g., Fig. 9A in Levick and Zacks 1970). Various approaches have been suggested to calculate the optimal nPSTH bin width (Shimazaki and Shinomoto 2007) or sPSTH kernel width (for a comprehensive review see Jones et al. 1996). However, these optimized parameters remain constant across low firing rate periods where a lower bandwidth would be optimal and high firing rate periods where a higher bandwidth would be optimal. A more powerful approach is to vary the kernel width in time according to the local spike density (Abramson 1982; Devroye and Lugosi 2000; Loader 1999; Loftsgaarden and Quesenberry 1965). The advantage of Shimazaki and Shinomoto's (2010) approach is that it provides a powerful yet easy to implement optimization algorithm for such continuous adjustment of the smoothing kernel width by minimizing a local cost function. This method produces a PSTH that is smooth during low firing rate periods but can still resolve transient increases in spiking activity optimally.
A PSTH computed with Shimazaki and Shinomoto's algorithm provides a good approximation of a neuron's underlying firing rate. However, on its own, neither the nPSTH nor the sPSTH provides a framework for statistical inference of whether a fluctuation in the data should be considered a response or not. Various solutions have been suggested to make such a distinction, with many individual researchers developing their own classifiers (an exhaustive review is beyond the scope of this introduction). Sometimes, the signal-to-noise ratio in a PSTH is improved by transforming the PSTH into a cumulative sum histogram, which plots sustained deviations from baseline as a slope (Ellaway 1978). This adjustment can assist statistical inference (Falzett et al. 1985; Ushiba et al. 2004) but is limited by its bias toward prolonged responses and its sensitivity to nonrandom effects in the spiking sequence (e.g., refractory periods; Davey et al. 1986). An alternative approach is to search the PSTH for the point at which the firing rate deviates from an expected value predicted by the distribution measured or modeled during baseline (classical approaches: Hanes et al. 1995; Maunsell and Gibson 1992; Thompson et al. 1996; Bayesian approaches: Bélisle et al. 1998; Ritov et al. 2002). In theory, order statistics, which measure the actual order of interspike intervals within a response, could also be used to detect such a “change point” in a PSTH. To the authors' knowledge, however, order statistical methods have so far only been applied to the analysis of individual trials within a PSTH (Richmond 2009). In recent years, bootstrapping has become an increasingly popular tool for hypothesis testing in general (Efron 1982). Although this method has been applied to the analysis of PSTHs early on, its adoption in the field is not as widespread as one might hope and further developments in this area are needed (Hung et al. 2002; Richmond et al. 1987).
Advanced analysis methods, such as those mentioned above, find only limited adoption in the neurophysiological literature. More often, classifiers are based on a simple comparison of the mean or median firing rate during a predefined response period to that during a baseline period or on the comparison of individual bins of the PSTH either to baseline or to each other. Examples include the comparison of the response period to baseline through a z score (Ison et al. 2011), a binomial test (e.g., Dörrscheidt 1981), a Mann-Whitney U-test (e.g., Mormann et al. 2008), an ANOVA with post hoc analysis (e.g., Oram and Perrett 1992), or a t-test (most widespread approach, e.g., Mukamel et al. 2010). The prevalence of these more classical approaches may be due to their ease of use and intuitiveness. However, the ability to register and classify responses based on more subtle discriminators is becoming increasingly more relevant in neurophysiology (Hung et al. 2002).
In most of the approaches mentioned above the only parameter that is used for classification is the response amplitude. Sometimes studies include additional parameters, such as response onset time or duration (e.g., Richmond et al. 1987), but we are not aware of any studies that systematically quantify and classify the actual shape of the PSTH. Shimazaki and Shinomoto's algorithm provides a good description of the response envelope (i.e., the shape of the response), which can be utilized for such a classifier. In the present study, we developed an algorithm that quantifies the response envelope provided by Shimazaki and Shinomoto's method and classifies a neuronal response on the basis of this quantification. This algorithm was designed to be nonparametric and easy to use and to provide a classification value that is as informative and comparative as the P value provided by most alternative classifiers. We refer to this parameter as the h-coefficient. The h-coefficient provides a measure of how similar or different a test response is to a randomized smoothed PSTH drawn from the same data. It provides information about the proportion of the response that is higher or wider than a response produced by randomly shuffling the spiking sequence. Focusing on experimental neurophysiology, we developed the h-coefficient in an application-oriented and empirical (rather than purely theoretical) way. We compared the h-coefficient's performance to that of classifiers based on a t-test and on a z score. These classifiers were chosen for comparison, as they are exemplary of the most widespread classifiers used today. The h-coefficient's performance was tested on a large set of Monte Carlo simulated PSTHs. Because of the stochastic nature of the simulated data, the mean and the median response amplitudes can be assumed to be similar; therefore these classifiers are also informative relative to median-based classifiers (e.g., Ison et al. 2011).
Recording single-unit responses in human subjects is one of many research areas in which the h-coefficient may improve sensitivity and yield (Fried et al. 2014). Human single-cell recordings are typically limited to only small groups of subjects, low trial numbers per recording session, and low mean neuronal firing rates (frequently neurons have firing rates < 3 Hz; Buzsaki and Mizuseki 2014). Time-dependent fluctuations in the firing rate further contribute to high noise levels and a small signal-to-noise ratio in this type of data. Under such circumstances, increasing yield can substantially improve experimental outcome. We therefore also investigated the h-coefficient's performance in a data set recorded in humans.
METHODS
To evaluate the efficacy of the h-coefficient, we employed a two-step approach. In the first step, we analyzed the h-coefficient's ability to report responses in a large set of artificially generated PSTHs. In this data set the ground truth, of whether a response was present or not, was always known and the h-coefficient's performance could therefore be quantified by measuring its hit rate (H, the proportion of trials in which a response was present and the h-coefficient detected it) and false alarm rate (FA, the proportion of trials in which the h-coefficient falsely reported the presence of a response). This first step provided a reliable measure of the h-coefficient's efficacy in modeled data. In the second step, we calculated the FA of the h-coefficient in a data set of PSTHs that were generated by randomly shuffling real data recorded in humans. This second step evaluated whether the h-coefficient demonstrates a similarly low FA in real data compared with modeled data (because of the shuffling the ground truth of this data set was also known, i.e., no real responses were present). For comparison, both steps were also repeated for two alternative response classifiers, which find widespread application in PSTH analysis: a t-test and a z score-based measure. Finally, in a “real-world” application the three classifiers were used to analyze neuronal spiking data recorded during a behavioral paradigm in humans (this final comparison was a qualitative measure only, as here the ground truth was unknown).
Calculating the h-coefficient.
We first defined the h-coefficient as an algorithm in eight steps and then generated a large data set of Monte Carlo simulated PSTH responses as our ground truth.1 We used this data set to test the efficacy of the h-coefficient to discriminate real responses from random fluctuations and to compare H and FA of the h-coefficient to two other classifiers: a classifier that measured the number of standard deviations that the mean firing rate during the response period was above the mean firing rate during the baseline period (SD) and a paired t-test between the firing rate during the response period and the baseline period (TT; see below for precise definitions of the baseline and the response periods). For each simulated PSTH we calculated the h-coefficient and then compared the h-coefficient's H and FA to that measured for SD and TT.
For any given test PSTH the h-coefficient is calculated according to Algorithm 1 (Fig. 1A). Algorithm 1 assumes the availability of data that were recorded or modeled in the format of a typical neurophysiological paradigm. In this format we referred to raw data as the continuous spiking data recorded or modeled over the duration of a complete experiment. Interspersed within the raw data are a set number of time points at which a stimulus was presented. The individual trials of the test PSTH are aligned to these time points (t = 0). The goal of Algorithm 1 is to determine the probability of whether fluctuations in the response period of the test PSTH (e.g., 200–1,000 ms) are correlated to stimulus presentation or whether they occurred randomly. Algorithm 1 is dependent upon optimal smoothing of the PSTH for precise response shape estimation.
Fig. 1.

Calculating the h-coefficient. A: the h-coefficient measures the probability with which fluctuations of the firing rate (y-axis) may occur by chance within the response period of a test peristimulus time histogram (PSTH) (e.g., 200–800 ms on x-axis, dashed lines). It does so by comparing the shape of the response in the test PSTH to those in a large set of shuffled PSTHs in which the individual trials were drawn randomly from the raw data. Top: in each shuffled PSTH the area underneath the response envelope (black curve) and within the response period is quantized by dividing it into horizontal stripes of amplitude s (gray lines). The surface area below the response curve and within each individual stripe (e.g., gray area in shuffle n) is measured and stored in a column vector (gray arrows). All n column vectors are then concatenated into 1 matrix mij (black arrows). Bottom: a function f1 is then applied across the rows of m to calculate M. Here we use the maximum function as an example of f1: Mi = max(mi). The test PSTH is striped and quantized in the same way as the shuffled PSTHs, and M is then compared to the test response quantization vector r through logical function f2. Here we use ri > Mi as an example for f2. The true or false values of f2 are also recorded in a column vector (bottom right). The h-coefficient is calculated as , where a is the number of true elements in the logical vector for all corresponding nonzero elements in M, b is the number of true elements in the logical vector for all corresponding zero elements in M, and c is the number of nonzero elements in M (see methods for details). B–D: examples of how the h-coefficient compares the shape of a test PSTH to that of M. In all 3 panels the ranges of a, b, and c are shown with brackets (that of c in red). The surface area measurements in M are plotted as shaded red areas within the respective stripes. In B, M is wider throughout but the test PSTH is higher. In C, both M and the test PSTH are the same height but the test PSTH is wider throughout. In D, the test PSTH is only in some places wider than M but it is at the same time also higher than the test PSTH. In all 3 examples h = 1.
The h-coefficient is calculated by first generating a large set of n randomly shuffled sPSTHs from the raw data (Algorithm 1.1). Each shuffled PSTH is constructed by cutting nt segments from the raw data, where nt is the number of trials in the test PSTH. Each segment is of the same duration as the test PSTH, but its temporal position in the raw data is random (i.e., in the shuffled PSTH t = 0 is not aligned to any event). The shuffled PSTHs are therefore comparable to the test PSTH (e.g., mean firing rate, fluctuations, and history effects in the firing rate, interspike interval distribution, etc.), but any changes of the firing rate measured within these shuffled PSTHs are generated randomly. The firing rate in the shuffled PSTHs is normalized according to x = x̃/ν, where x̃ is the firing rate calculated by the optimized kernel-smoothed probability density function over the spike sequence and ν is the mean firing rate across the raw data (Algorithm 1.2). For each shuffled PSTH, the surface area below the response envelope and within the predefined response period is then quantized and stored in a column vector (Algorithm 1.3). For the quantization, only the continuous area that lies below the response peak and above 1 is included (i.e., above the mean firing rate ν). This area is divided into horizontal stripes of height 0.1 (i.e., ν/10), and the area of each stripe is measured individually. A vector of these area measurements contains information about the response amplitude and shape. The number of nonzero elements in this vector is directly proportional to the response amplitude (in units of ν/10), and the value of each element is a measurement of the mean duration of the response (the area of the stripe) at the amplitude at which the respective stripe was measured. The individual column vectors containing the stripe measurements of the shuffled responses are then concatenated into a matrix m (Algorithm 1.4). Each column j in m contains all the stripe area measurements from the jth shuffled PSTH. Each row i in m contains the mean response duration measurements across all n shuffled responses at the ith stripe amplitude above baseline:
| (1) |
g(λj(t),ν,i) is a functional of the jth probability density function λj(t) normalized to ν and limited to the ith stripe. The ith stripe is the area between the response amplitudes amp2i(ν)–amp1i(ν). After normalization to the baseline firing rate ν, the limits of the ith stripe are defined by amp1i = 1 + 0.1(i − 1) and amp2i = 1 + 0.1i, within the response period t2–t1. To aid intuition, the first dimension of m is inverted m(end + 1 − i,j) so that the orientation of the quantized responses in m remains the same as that of the PSTHs. As a result of the inversion, the last (bottom) row of m contains the area measurements of the first stripe above baseline and the area measurements from stripes of increasing amplitude are added on top (see Fig. 1A). Analogous to its use in the MATLAB programming environment, the notation end marks the last element of i and is chosen sufficiently large to contain all response amplitudes generated during the shuffling process.
A function f1 is then applied across each row of m, and the results of f1 are saved in a new column vector M (Algorithm 1.5). Here we focus on the maximum function Mi = max(mi) as an example for f1. In this example the number of nonzero elements in M is proportional to the maximum response amplitude that was measured across all n shuffled responses (in units of ν/10). At the same time, each value within M corresponds to the maximum response duration that was measured across all n shuffled responses at the amplitude of the respective stripe. M represents the highest and, at every amplitude, longest possible response that was generated randomly from the raw data within n shuffles.
The test PSTH response is smoothed, normalized, and quantized in the same way as the shuffled PSTHs, and the measured stripe area values of the test PSTH are stored in vector r. The values in r are then compared to the shuffled PSTH values in M through logical function f2 (Algorithm 1.6). Here we focus on “is larger than” as an example for logical function f2. To calculate the h-coefficient, Algorithm 1 counts three sets of elements: a: for all nonzero elements in M the number of elements in r that are larger than their equivalent in M; b: for all zero elements in M the number of elements in r that are larger than their equivalent in M; and c: the number of nonzero elements in M (Algorithm 1.7). In the final step the h-coefficient is calculated according to Eq. 2 (Algorithm 1.8):
| (2) |
There are two ways of reporting the h-coefficient. Reporting only the resulting scalar value h summarizes to what proportion the test PSTH response was either larger (higher and/or wider) or smaller (shorter and/or narrower) than the shuffled responses. Alternatively, the h-coefficient can be reported in the format of the full Eq. 2, providing more detailed information about the dimensions of the underlying response. The three values in Eq. 2 contain the following information (all in units of ν/10): a: Up to what amplitude of c was the test response wider than c? b: How much higher than c was the test response? c: How high was the shuffled response? In this format the h-coefficient provides information about the proportion to which the test response was wider than the shuffled responses (a); the normalized test response amplitude, if it was higher than the shuffled responses (b + c if b > 0); and the normalized maximum amplitude of the shuffled responses (c). c can also provide a qualitative parameter for nonrandom properties of the baseline firing rate. Neurons with nonrandom firing behavior are more likely to generate large c values than random-firing neurons because of coincidence and resonance across trials (e.g., bursting or phase-locked firing behavior). From these parameters we can gain an understanding of what the range and meaning of the h-coefficient is:
To understand how the actual values of the h-coefficient relate to probability it can be helpful to look to a simpler, one-dimensional case of the shuffling procedure. For example, to know whether a response in a PSTH arose by chance, one can generate 100 randomly shuffled PSTHs and then simply measure their response amplitudes only (e.g., the mean firing rate during the response period). In a second step, a P value can then be calculated by counting the number of shuffled responses with higher response amplitudes than the test response. If no responses were higher than the test response, then that would mean that <1% of responses were higher and so there is a <1% chance of the test response being a false alarm, i.e., P < 0.01. The reliability of this P value is limited by the number of shuffles and is improved by increasing the number of shuffles used to generate it.
The same is true if h > 1 is chosen as the threshold to determine whether a response was real or random. A response with h > 1 has per definition a larger response amplitude than any of the shuffled responses, independent of the number of shuffles (Fig. 1, B–D). In analogy to the example provided above, we can therefore say that it has a P value of at least P < 0.01 as long as at least 100 shuffles were used to generate it (as before, the number of shuffles will influence the reliability and conservativeness of this P value and should ideally be in the range of ∼1,000 or more). In some applications it may be interesting to reduce the threshold to a value h ≤ 1. In this situation it would be important to either measure the ground truth h-coefficient distribution within the data at hand (e.g., during baseline) or simply use the full equation of the h-coefficient and set individual thresholds for a and b independently. The latter option additionally allows for a more differentiated analysis in these cases, providing a probability measure for both the height and the duration of the test response.
Algorithm 1: calculating the h-coefficient for a test sPSTH.
The h-coefficient for a test sPSTH is calculated as follows.
1. Generate a large set of sPSTHs (Shimazaki and Shinomoto 2010) from randomly shuffled segments taken from the raw data and of the same duration and number of trials as the test sPSTH.
2. Normalize the firing rate of each shuffled sPSTH to the mean firing rate ν across the raw data.
3. For each shuffled sPSTH:
a. Find the maximum of the response envelope within a predefined response period and follow the response envelope downward toward the left and the right until it either reaches the limit of the response period or is equal to 1 (i.e., intersects ν; note that the sPSTH does not necessarily need to be monotonically decreasing).
b. Within the response period, vertically section the continuous area below the curve traced in step 3a and above the mean firing rate ν into horizontal stripes of a predefined height s.
c. Calculate the area of each stripe and store these values in a long column vector with zeros for all other values.
4. Concatenate all column vectors generated under step 3 into a matrix m.
5. Apply a function f1 across each row of m and store the resulting values in a new column vector M with zeros for all other values.
6. Apply steps 2 and 3 to the test sPSTH and compare the resulting test column vector r to M through logical function f2.
7. Count:
a. For all nonzero elements in M the number of equivalent elements in r for which f2 is true.
b. For all zero elements in M the number of elements in r for which f2 is true.
c. The number of nonzero elements in M.
8. Calculate the h-coefficient from the counts in step 7 according to .
Applying a continually optimized smoothing algorithm to a large set of PSTHs (Algorithm 1.1) can require substantial computational resources. Ideally ∼1,000 shuffled PSTHs should be generated to calculate reliable h-coefficients. However, fewer shuffles can be sufficient if a conservative function f1 is used (Algorithm 1.5). To speed up computation even further, fewer but longer smoothed PSTHs can be generated and then cut into separate PSTHs of the right length in a separate step. Generating fewer PSTHs and then cutting these into shorter segments is appropriate for time-independent data (e.g., homogeneous Poisson process) but should be used with caution when the signal is time-dependent.
The quantized response area is limited to only the continuous area between the maximum and ν (Algorithm 1.3a). This limitation was introduced to only measure the larger peak in response envelopes that show two peaks but returns to ν in between them. Algorithm 1 cannot distinguish between single-peaked and double-peaked responses beyond this simple selection criterion. Additional conditions can, however, easily be added to step 3a to classify such responses more specifically, if needed.
The height s of the stripes in Algorithm 1.3c has no influence on the sensitivity of the h-coefficient but will change its resolution and its unit and should therefore always be reported. For example, if Mi = 0.84 and ri = 0.86, then Algorithm 1 would still recognize ri > Mi, even if s = 0.1ν. However, h, a, b, and c in Eq. 2 would only be reported in units of 0.1ν. For many neurophysiological applications a value of s = 0.1ν may suffice (i.e., a resolution down to 1/10th of the mean firing rate of the respective neuron).
Monte Carlo simulated PSTHs.
As a ground truth we generated a data set of 14,700 Monte Carlo simulated sPSTHs to test the efficacy of the h-coefficient. Half of the simulated PSTHs contained responses; the other half served as controls. Each PSTH was 10 s long (−5,000 to 5,000 ms sampled at 1 kHz). We used a nonhomogeneous Poisson process to generate the simulated neuronal responses. Spikes recorded from an individual neuron are correlated within the individual trials in a PSTH due to nonrandom processes such as refractory periods or spike-field coherence. Nonetheless, superimposed spikes of multiple trials across a PSTH can be approximated as a single inhomogeneous Poisson point process (Shimazaki and Shinomoto 2010).
The expected firing rate λ(t) for each PSTH was generated by first defining a ground truth consisting of a predefined constant baseline firing rate ν. In response PSTHs λ(t) was then convolved with a Gaussian response of defined amplitude, standard deviation, and mean. This step was omitted in control PSTHs. In one block of 2,100 PSTHs (block A, see below) we then convolved λ(t) for each trial with a sinusoidal function of random frequency, amplitude, and phase limited to a frequency range of 0–1 Hz and an amplitude range of 0–1 times the mean firing rate ν. This sinusoidal convolution was added to simulate low-frequency baseline modulation and ongoing alternate processing typical for in vivo recordings (e.g., Mainen and Sejnowski 1995). The resulting noise resonance across trials also added further complexity to the data, providing a more conservative estimate of a classifier's H and FA. Finally, for all PSTHs we added a 3-ms absolute refractory period to our Poisson spike model.
For each PSTH we defined a baseline period and a response period. The baseline period was set at −1,000 to −200 ms and the response period at 200 to 1,000 ms. In all subsequent t-test- and standard deviation-based analysis we always compared the response period to the baseline period. We also defined a baseline range in addition to the baseline period. The baseline range was set at −5,000 to −200 ms and was used only for the generation of the PSTHs. Each Monte Carlo simulation was run until the mean firing rate across the baseline range ν′ = was equal to the target mean firing rate across the whole virtual experiment ν. The differentiation between this baseline range and the baseline period allowed us to generate PSTHs with average baseline period firing rates that were similar but not necessarily equal to the mean firing rate across the whole experiment (ν, see discussion).
The complete modeled data set.
The total number of 14,700 PSTHs was made up of seven blocks (blocks A–G) of 2,100 PSTHs each with baseline firing rates and response durations varying across blocks. Block A's baseline firing rate was ν = 3 Hz, and the standard deviation of the Gaussian response envelope was σ = 100 ms. Block A was the only block that had sinusoidal noise added to its PSTHs (see above). Blocks B and C were also held at ν = 3 Hz but without the sinusoidal noise and with σ = 25 ms (block B) and σ = 100 ms (block C). Blocks D and E had a baseline firing rate of ν = 30 Hz with σ = 25 ms (block D) and σ = 100 ms (block E), and finally, blocks F and G had a baseline firing rate of ν = 90 Hz with σ = 25 ms (block F) and σ = 100 ms (block G).
Each block consisted of three sets of 700 PSTHs with 8, 12, or 24 trials per PSTH, respectively. Each set was made up of seven subsets of varying response amplitudes. The tested response amplitudes were 0.5, 0.75, 1, 1.25, 1.5, 2, and 2.5 times ν. As mentioned, half of the PSTHs in every subset were controls without any responses in them. For each subset condition (number of trials × response amplitude) we therefore compared 50 response PSTHs to 50 control PSTHs. The response mean was located at 450 ms across all conditions, approximating human medial temporal lobe response times (Mormann et al. 2008).
The individual trials of each PSTH were distributed equally within a simulated neurophysiological experiment that was defined to last 30 min. The modeled experiment had interstimulus intervals of 257.1, 163.6, and 78.3 s, respectively, for the 8-, 12-, and 24-trial sets. The interstimulus intervals were filled with response-free spike sequences with the same parameters as the PSTH trials. Either end of the experiment was also padded with baseline firing of the same duration as the interstimulus intervals. Randomly shuffled segments from the complete 30-min-long raw data (plus padding) were then used to calculate the h-coefficient according to Algorithm 1.
Biological data.
In a final analysis, we compared the performance of the h-coefficient to that of SD and TT in a data set recorded in awake and behaving human subjects. All experiments were carried out at the UCLA Medical Center. The study protocol was approved by the center's ethical review board (IRB no. 10-000973), and all subjects provided written informed consent to participate in the study. The detailed methods for data acquisition have been described elsewhere (Fried et al. 1997). Briefly, subjects suffering from intractable epilepsy were implanted with chronic depth electrodes to localize seizure foci for possible surgical resection. The recording sites included in this study were the medial prefrontal cortex, the anterior cingulate cortex, and the amygdala. All localization targets were based exclusively on clinical criteria. Through the lumen of the clinical electrodes Pt/Ir microwires were inserted into the tissue to record action potentials. The differential signal from these microwires was amplified and sampled at 30 kHz (BlackRock, Salt Lake City, UT) and then clustered with wave_clus (Quian-Quiroga et al. 2004). Each recording session lasted ∼30 min.
All electrophysiological data were recorded during the execution of a two-alternative forced choice task. In this task subjects had to choose one of two playing cards displayed on a computer screen by pushing a button. Their choice was followed by a 1-s interval before the outcome was revealed, which could either be winning $10 or $100 or losing the same amounts in virtual cash. The outcomes were displayed by turning the virtual card over and revealing the words “win $10” or equivalent. The time points of revealing a winning outcome were aligned to generate a large set of behaviorally relevant PSTHs.
In contrast to the modeled data set, the ground truth of the biological data was not known. As is typical for such neurophysiological experiments, the true number of responsive neurons was, accordingly, unknown and no H or FA could be calculated. We therefore also generated a second data set based on these in vivo recordings consisting of randomly shuffled PSTHs. The time points of the individual trials in each of these shuffled PSTHs were chosen randomly from within the spiking data of one individual neuron. Per recorded neuron we generated 10 shuffled PSTHs, resulting in a large number of shuffled PSTHs with the same distribution of spiking parameters as the underlying behavioral PSTH data set. Any response registered by a classifier in this shuffled data set was known to be the result of the random shuffling. In this second biological data set we were therefore able to calculate and compare the FA of the three classifiers.
The h-coefficient analyzes the response period independent of the baseline period. SD and TT are, however, equally sensitive to fluctuations in both the baseline period and the response period. To be able to more closely compare the performance of the h-coefficient to SD and TT we therefore corrected the baseline period firing rates of the biological PSTHs so that their mean was equal to the mean firing rate across the whole recording session (see discussion). This baseline period correction also allowed us to more closely compare the biological data to the modeled data. In the modeled data, the mean firing rate during the baseline period was per definition similar to the mean firing rate across the raw data ν and no such correction was necessary (see above).
Statistical analysis.
For each Monte Carlo simulated PSTH we recorded the output of the three classifiers included in our comparison (SD, TT, and h-coefficient). We compared these three values to each other through a receiver operator characteristic (ROC) and a bias criterion (C; Fawcett 2006) as well as through a direct comparison of both the H and the FA of each parameter at various thresholds. All comparisons across multiple conditions were done with repeated-measures ANOVA, sphericity was tested with Mauchly's sphericity test, P values were corrected with Greenhouse-Geisser ϵ correction where necessary (Pϵ), and effect size was measured with partial η2. Pairwise post hoc comparisons were Bonferroni corrected.
RESULTS
In our modeled data set the smoothing algorithm usually generated an almost flat baseline and, in cases where the response was weak and may have been ambiguous, the sPSTHs often remained equally flat (e.g., Fig. 2A). Only when a precisely timed or sufficiently strong response was present did the sPSTHs deviate from baseline to provide a closer representation of the ground truth (e.g., Fig. 2B). In this form, the smoothing algorithm already provided a certain degree of classification even before application of any quantitative analysis. As the mean firing rates and response amplitudes increased, the sPSTH also more consistently represented the ground truth (e.g., Fig. 2, C and D).
Fig. 2.

Examples of Monte Carlo simulated data. Each panel consists of a ground truth plot (top), a raster plot (middle), and a PSTH (bottom). Ground truth plots show the ground truth λ(t) of all trials (gray lines) as well as their mean (black line) and SE (gray shaded area, x-axis: time, y-axis: normalized response amplitude). Raster plots show the time points of individual action potentials (black dashes) for each trial (rows). PSTHs show the histograms across the raster plots (nPSTH, gray bars, bin width = 200 ms) as well as the smoothed histogram (sPSTH, black line) and its bootstrapped 95% confidence interval (shaded area around black line; x-axis: time, y-axis: normalized response amplitude). A: at low mean firing rates and amplitudes (amp) weak responses are sometimes smoothed out in the sPSTH and therefore missed by the h-coefficient classifier. B: however, the h-coefficient is sensitive to narrow but precisely timed responses. The increase in firing rate during the response period in this example is minimal and too weak to be registered by SD or TT. However, the coincidence of individual single spikes across trials was sufficient for the sPSTH to show a clear, narrow response, which was also registered by the h-coefficient (h = 2.2). C: at high firing rates and amplitudes the nPSTH and the sPSTH are similar, but the continuous trace of the sPSTH allows for a closer estimate of response parameters such as response shape and time to peak. D: the nPSTH weighs all spikes within a bin equally, resulting in a lower response peak amplitude than that measured in the sPSTH. For the same reason in the nPSTH the time to peak can only be estimated within the limits of the 200-ms bin width, while the sPSTH reproduces the response peak more closely (all response peaks were at 0.45 s).
ROC curves, bias, and raw classifier values.
Our initial focus is on the results of block A, which were qualitatively similar to those recorded in the other blocks (see below). We calculated ROC curves to compare the sensitivity of the h-coefficient, SD, and TT. The area under the curve (AUC) of an ROC is equivalent to the probability that the classifier will rank a randomly chosen PSTH containing a response higher than a randomly chosen control PSTH (Hanley and McNeil 1982). All three classifiers had ROC curves with similar AUCs, although that of TT was consistently smaller than that of SD and the h-coefficient (Fig. 3). The mean d′ values from the ROCs across all subsets were for 8 trials d′SD = 1.02, d′TT = 0.71, d′h−coefficient = 1.1; for 12 trials d′SD = 1.13, d′TT = 0.77, d′h−coefficient = 1.15; and for 24 trials d′SD = 1.33, d′TT = 1.2, d′h−coefficient = 1.54.
Fig. 3.

Mean receiver operator characteristic (ROC) and isobias curves. For each set of block A ROC and isobias curves for all 3 classifiers are shown calculated with the mean d′ and C measures taken from the raw ROC curves of those sets. The mean ROC curves (running parallel to the diagonal) were similar for all 3 classifiers although consistently lowest for TT. The isobias curves (running perpendicular to the diagonal) of the SD and TT classifiers were also similar to each other and close to zero, but the isobias curves of the h-coefficient showed a strong negative bias (skewed toward the upper right corner, see results for values of d′ and C). H, hit rate; FA, false alarm rate.
The AUCs of SD, TT, and the h-coefficient were similar, but we did observe a difference in their bias measures (C). Both the SD and TT bias remained close to zero, while the h-coefficient showed a strong negative bias. The bias values for the three classifiers were for 8 trials CSD = −0.07, CTT = −0.21, Ch-coefficient = −0.73; for 12 trials CSD = −0.03, CTT = −0.13, Ch-coefficient = −0.8; and for 24 trials CSD = −0.05, CTT = −0.13, Ch-coefficient = −0.95 (Fig. 3). The strong negative bias of the h-coefficient implies that it should achieve a high H even when a conservative threshold is used (i.e., a low FA). This property was reflected in the distribution of the actual h-coefficient values compared with those of SD and TT (Fig. 4). As expected, all three classifiers improved separation of the response and control distributions with increasing response amplitudes. However, the range of the response values was larger and their separation from the control values increased further in the h-coefficient compared with SD and TT.
Fig. 4.

Distribution of classifier values: raw values and their means and standard deviations for all 3 classifiers (SD, top; TT, middle; h-coefficient, bottom) and each set (8 trials, left; 12 trials, center; 24 trials, right) of block A. The measured values of the response trials (n = 50 per subset) and control trials (n = 50 per subset) are shown for all subsets (x-axis). With increasing response amplitude all 3 classifiers showed increased separation of the response and control values; 99% of h-coefficient values for the control trials remained h < 1 across all response amplitudes. At the same time a large number of h-coefficient response values were h > 1, suggesting that h = 1 may serve as a good discrimination threshold for the h-coefficient classifier.
We used the complete set of control values (1,050 data points, Fig. 4) to calculate probability values for the h-coefficient at various thresholds. We found that, within this data set, 95% of all data points remained below a threshold of h > 0.6333 (i.e., P < 0.05) and 99% of all data points remained below a threshold of h > 0.9889 (i.e., P < 0.01). We also calculated the equivalent P values for h-coefficient thresholds higher than 1. For a threshold of h > 1 we found a P value of P < 0.0076, for h > 1.2 the equivalent P value was P < 0.0057, and for h > 1.5 we found a P value of P < 0.0048.
False alarm and hit rates at different thresholds.
We compared H and FA across all three classifiers to test whether an h-coefficient close to h = 1 would provide a better classifier than SD or TT at equally conservative thresholds. For this comparison we plotted H and FA for all subsets with 8, 12, and 24 trials at thresholds corresponding to α-levels of α = 0.05 and α = 0.01 (Fig. 5). For SD these thresholds corresponded to mean response firing rates higher than 1.645 and 2.326 times the standard deviation over the baseline period. For TT the respective thresholds were simply the equivalent P values of P < 0.05 and P < 0.01. For the h-coefficient these α-levels corresponded to threshold values of h > 0.6336 and h > 0.9889, respectively (see above). However, we substituted the higher h-coefficient threshold with h > 1 instead of h > 0.9889 as this is a more intuitive value for this metric, as mentioned in methods (albeit a more conservative one, too, at P < 0.0076). To the h-coefficient analysis we also added two additional thresholds of h > 1.2 and h > 1.5, equivalent to α = 0.0057 and α = 0.0048. These arbitrarily chosen high thresholds were added to test whether the h-coefficient would still achieve a high H at conservative thresholds, as predicted by its strong negative bias.
Fig. 5.

H and FA at various thresholds. H (positive y-axis) and FA (negative y-axis) for all 3 classifiers were compared to each other across all response amplitudes in block A (x-axis). The respective classifier thresholds for plots in A–D were equivalent to a P value < 0.05. The thresholds for plots in E–H were equivalent to a P value < 0.01 for TT and SD. For the h-coefficient the respective thresholds in E–H were h > 1 (red), h > 1.2 (orange), and h > 1.5 (yellow; see False alarm and hit rates at different thresholds for details). On average the h-coefficient achieved higher H across all thresholds and sets (8, 12, or 24 trials, A–C and E–G) as well as across the whole block (D and H, respectively). Especially at more conservative thresholds the h-coefficient outperformed the other 2 classifiers with a H that was up to twice as high as that of TT or SD without increasing the FA (H, see Table 1 for details).
All three classifiers improved their H with increasing response amplitudes, while the FA for all three classifiers remained below 10% across all conditions (Fig. 5). TT and the h-coefficient improved their performance with increasing numbers of trials, but SD's performance was highly variable and even seemed to decrease for high numbers of trials. As a result of this behavior SD and TT remained similar over 8 trials (Fig. 5, A and E) but TT outperformed SD over 12 and 24 trials (Fig. 5, B, C, F, and G). The h-coefficient outperformed the other two classifiers throughout all conditions with one exception at response amplitude 0.5 during 12 trials at P < 0.05 (Fig. 5B). The [FA, H] pairs averaged across all amplitudes and number of trials at α = 0.05 (Fig. 5D) were [0.004, 0.111]SD, [0.033, 0.301]TT, and [0.043, 0.453]h-coefficient (Table 1). A repeated-measures ANOVA revealed a significant effect of the three classifiers on both the FA [F(2,12) = 28.3, Pε = 0 < 0.05, partial η2 = 0.83] and the H [F(2,12) = 21.9, Pε = 0 < 0.05, partial η2 = 0.79] for a threshold of α = 0.05.
Table 1.
[FA,H] pairs at low and high thresholds
| Amp | SD P < 0.05 | TT P < 0.05 | h-coeff h > 0.63 | SD P < 0.01 | TT P < 0.01 | h-coeff h > 1 |
|---|---|---|---|---|---|---|
| 0.5 | 0.01, 0.03 | 0.04, 0.09 | 0.05, 0.1 | 0.01, 0.01 | 0.01, 0.02 | 0.01, 0.05 |
| 0.75 | 0.01, 0.02 | 0.03, 0.08 | 0.07, 0.17 | 0.00, 0.01 | 0.01, 0.01 | 0.02, 0.04 |
| 1 | 0.00, 0.02 | 0.03, 0.15 | 0.05, 0.29 | 0.00, 0.01 | 0.00, 0.04 | 0.01, 0.11 |
| 1.25 | 0.00, 0.08 | 0.03, 0.32 | 0.03, 0.48 | 0.00, 0.02 | 0.01, 0.10 | 0.00, 0.33 |
| 1.5 | 0.00, 0.15 | 0.02, 0.31 | 0.04, 0.55 | 0.00, 0.03 | 0.01, 0.15 | 0.01, 0.45 |
| 2 | 0.01, 0.16 | 0.05, 0.47 | 0.03, 0.69 | 0.00, 0.07 | 0.00, 0.25 | 0.01, 0.59 |
| 2.5 | 0.00, 0.32 | 0.03, 0.69 | 0.03, 0.89 | 0.00, 0.07 | 0.01, 0.40 | 0.01, 0.83 |
| Mean (SE) | 0.00, 0.11 (0.00, 0.04) | 0.03, 0.30 (0.00,0.08) | 0.04, 0.45 (0.01, 0.11) | 0.00, 0.03 (0.00, 0.01) | 0.01, 0.14 (0.02, 0.05) | 0.01, 0.34 (0.00, 0.11) |
[FA,H] pairs at each amplitude (Amp) and for each classifier (SD, TT, and the h-coefficient) are listed; the mean and SE were calculated across each classifier.
At the higher threshold of α = 0.01 (h > 1) on average, across all numbers of trials, TT had more than double the H over SD and the h-coefficient had more than double the H again over TT, while all three classifiers maintained a similarly low FA throughout (Fig. 5H). The mean [FA, H] pairs across all amplitudes and number of trials at α = 0.01 (Fig. 5H) were [0.001, 0.031]SD, [0.007, 0.139]TT, and [0.01, 0.343]h-coefficient (Table 1). A repeated-measures ANOVA showed a significant effect of the classifiers on the FA [F(2,12) = 6, P = 0.016 < 0.05, partial η2 = 0.5] and on the H [F(2,12) = 9.08, Pε = 0.01 < 0.05, partial η2 = 0.6] for a threshold of α = 0.01. From this comparison we concluded that with increasing response amplitudes the h-coefficient achieves a better ratio of H to FA than SD or TT.
Comparing false alarm and hit rates across blocks.
We plotted the mean H and FA of all three classifiers at the threshold equivalent to α = 0.01 (h > 1) across blocks B–G to compare the performance of TT, SD, and the h-coefficient relative to the varying parameters across these blocks (Fig. 6). The mean [FA, H] pairs for the different classifiers and blocks were SD: [0, 0.01]B, [0, 0.03]C, [0, 0.01]D, [0, 0.43]E, [0, 0.03]F, [0, 0.72]G; TT: [0.01, 0.02]B, [0.01, 0.14]C, [0.01, 0.11]D, [0.01, 0.74]E, [0.01, 0.29]F, [0.01, 0.92]G; h-coefficient: [0.01, 0.04]B, [0, 0.35]C, [0, 0.34]D, [0.01, 0.88]E, [0, 0.54]F, [0, 0.97]G. A repeated-measures ANOVA showed a significant effect of the classifiers on both the FA [F(2,5) = 17.77, P = 0.001 < 0.05, partial η2 = 0.78] and the H [F(2,5) = 17.5, Pε = 0.009 < 0.05, partial η2 = 0.78] across all blocks. Pairwise comparison of the FA only showed a significant difference between the FA of SD and TT (P = 0 < 0.017) but not between SD and the h-coefficient (P = 0.175 > 0.017) and not between TT and the h-coefficient (P = 0.025 > 0.017). Pairwise comparison of the H showed a significant difference between the H of SD and TT (P = 0 < 0.0153), between SD and the h-coefficient (P = 0.006 < 0.017), and between TT and the h-coefficient (P = 0.0128 < 0.017). The mean [FA, H] pairs across blocks B–G were [0, 0.205]SD, [0.01, 0.37]TT, [0.003, 0.52]h-coefficient. Across the whole modeled data set (blocks A–G) the [FA, H] pairs were [0, 0.191]SD, [0.01, 0.337]TT, [0.004, 0.494]h-coefficient. From these results we conclude that on average across all conditions the h-coefficient significantly outperformed SD and TT.
Fig. 6.

H and FA across blocks B–G. The H and the FA for all 3 classifiers (x-axis) were compared to each other for all remaining blocks (blocks B–G) with histograms across all 3 sets (left: 8 trials, center: 12 trials, right: 24 trials for each classifier, respectively). Only the more conservative thresholds were used in this comparison (TT < 0.01, SD > 2.326, h > 1; see False alarm and hit rates at different thresholds for details). For all 3 classifiers and sets the FA was low at these conservative thresholds and the only significant difference in FA was found between SD and TT. TT had a significantly higher H than SD across all sets, and the h-coefficient had a significantly higher H than both SD and TT, respectively. TT is calculated based on the average firing rate across the response period. Therefore, narrow responses (e.g., σ = 25 ms, top) are more likely to be averaged out and missed by TT. The mean difference in H between TT and the h-coefficient was accordingly significantly larger in blocks D and F than in blocks E and G combined (see Comparing false alarm and hit rates across blocks for details).
At low trial numbers and firing rates the H of SD was highly variable (see also results for block A above). However, with increasing mean firing rates the independence of SD from the number of trials became more apparent. SD's independence of the number of trials results from the fact that SD is based on the standard deviation, a descriptive statistic (Cumming et al. 2007). In contrast, TT and the h-coefficient (both inferential statistics) showed a clear increase in the H for higher trial numbers at 30 and 90 Hz.
SD and TT are calculated based on the mean response during the whole response period. They are therefore biased toward longer-lasting responses because short responses (t ≪ response period) are more likely to be averaged out in the process. In an attempt to measure this bias of TT toward longer-lasting responses we calculated the difference between the H for TT and the h-coefficient across blocks with short responses (blocks D and F) and compared these values to those across blocks with longer-lasting responses (blocks E and G). Blocks B and C were not included in the comparison because the H for all three classifiers in block B was very low as the response duration was small relative to the average interspike interval at 3 Hz. SD was not included in this analysis because of its low H across all short response conditions (the mean H of SD across blocks D and F was 0.01). Across blocks D and F the mean H of the h-coefficient was 0.43 and that of TT was 0.20. Across blocks E and G the mean H of the h-coefficient was 0.92 and that of TT was 0.82. The difference in the H between TT and the h-coefficient was 0.23 in blocks with short responses but only 0.1 in blocks with longer-lasting responses. A t-test confirmed that the difference in the H between TT and the h-coefficient was significantly larger in blocks with short responses than in blocks with longer-lasting responses (P = 0.005 < 0.05). This difference shows that for short responses TT's efficacy dropped significantly more strongly compared with that of the h-coefficient.
Performance in biological data.
In a final analysis we compared the performance of SD, TT, and the h-coefficient in two noisy biological data sets. The first data set consisted of 627 units recorded in 21 sessions across 7 awake and behaving human subjects. No distinction between single units and multiunits was made in this analysis. In this first data set the PSTHs were aligned to a behaviorally relevant stimulus (winning $10 or $100). Each PSTH contained between 27 and 37 trials with a mean of 33.07 (±0.107) trials per PSTH. The mean firing rate across the whole recording session across all units in this data set was 1.74 Hz (±0.105). The number of registered responses (putative hits, Hp), for all three classifiers at an α-level of α = 0.01 (h > 1) were HpSD = 0, HpTT = 0.01, Hph-coefficient = 0.065 (0 responses registered with SD, 6 registered with TT, and 41 registered with the h-coefficient; Fig. 7). Four of the six units that were registered as a response by TT were also registered by the h-coefficient (Fig. 7). SD did actually register three units as responsive in this comparison. However, all three units were outliers, with 0 spikes in their baseline period, and were therefore not counted (no such outliers were registered as responses by either TT or the h-coefficient).
Fig. 7.

Examples of single-unit human recordings. We recorded neuronal responses in humans during a 2-alternative forced choice task (x-axis: time, ms; y-axis: normalized firing rate; t = 0 is the time point at which an outcome of winning $10 or $100 was revealed; y = 1 is the normalized mean firing rate of the neuron across the whole recording session). Unlike the modeled data set, in this noisy biological data set the ground truth was not known (i.e., which neurons actually encoded the relevant event in their firing rate). Nonetheless, some units (A–D) showed visually more clearly discernible putative responses, while others were most probably false alarms (E and F). A and D are examples of PSTHs that were registered as a response by both TT and the h-coefficient: TTA: P = 0 < 0.01; h-coefficientA: = 5.556 and TTD: P = 0 < 0.01; h-coefficientD: = 1.375. As in the modeled data set, here too the sPSTH allows for better approximation of the response onset, peak, and overall shape than the nPSTH (bin width: 200 ms, see Fig. 2 for examples from the modeled data set). B and E are examples of units that were registered by the h-coefficient but missed by TT: TTB: P = 0.421 > 0.01; h-coefficientB: = 1.3 and TTE: P = 0.173 > 0.01; h-coefficientE: = 1.533. Averaging across the initial reduction in the firing rate (at ∼300 ms) and the narrow but strong increase in firing rate (at ∼700 ms) prevented TT from registering the response in B as being significantly different from baseline. In E it becomes apparent from the raster plot that either the response in this PSTH may be a false alarm of the h-coefficient or the underlying unit encoded different trials differently (e.g., because of alternate processing). The response in E was included here as an example of where the simple maximum function (Algorithm 1) may lead to ambiguous results because it does not take the distribution of the underlying data into account (see discussion). Unit C was 1 of only 2 units that were registered as a response by TT but not by the h-coefficient: TTC: P = 0.004 < 0.01; h-coefficientC: = 0.9. Unlike the h-coefficient, TT is equally sensitive to a decrease in the baseline period firing rate (−1,000 to −200 ms) as it is to an increase in the response period firing rate (200 to 1,000 ms), which can lead to false alarms. To prevent such false alarms we applied a correction to the baseline period firing rate (see discussion), which also made the biological data set more comparable to the modeled data set, in which the mean baseline period firing rate was, per definition, always close to 1. Without correction TT would have registered F as a response: TTF(no corr): P = 0.002 < 0.01. With baseline period correction TT did not register F as a response: TTF(corr): P = 0.594 > 0.01, h-coefficientF: = 0.125. SD registered none of these units as a response: SDA–F < 1.983.
The second data set that we included in this analysis consisted of randomly shuffled PSTHs drawn from the data of the first data set (above). This second data set contained 6,270 shuffled PSTHs (for every recorded behavioral PSTH we generated 10 shuffled PSTHs from the same underlying raw data) with the same distribution of parameters as the first data set (e.g., number of trials, mean firing rate, etc.). Within this second data set the FA of all three classifiers was FA < 0.01 [FA(SD) = 0.001 FA(TT) = 0.001, FA(h-coefficient) = 0.008]. Ninety-five percent of the data points in the shuffled data set had an h-coefficient h < 0.44 and ninety-nine percent of the data was below h < 0.87, confirming that a threshold of h > 1 corresponds to a P value of at least P < 0.01 in these biological data sets, too. While all three classifiers showed comparatively low FA in the shuffled biological data, the number of responses registered by the h-coefficient in nonshuffled biological data was higher than that registered by SD and TT (above).
In conclusion, the first set of results presented here was based on a large number of modeled neuronal responses. In this data set the ground truth was known and a direct measure of the FA and H of all three classifiers could be made and compared, revealing a better performance of the h-coefficient compared with SD and TT. The second set of findings was based on noisy biological data recorded in humans. Here the ground truth to calculate H was not available and the mean and variability of the firing rate in this data set were different from those in the modeled data set. Nonetheless, we found that in both the modeled and the biological data sets the h-coefficient displayed a low FA while still registering a high number of H or Hp. From these results we may conclude that the h-coefficient can provide a powerful statistic for the analysis of PSTHs.
DISCUSSION
In this study, we developed a classifier called the h-coefficient to determine whether a given PSTH contained a response or not. To make this classification, the h-coefficient uses information about the actual shape of a given response envelope in a sPSTH. The h-coefficient quantizes this response shape and compares it to a set of shuffled responses generated from trials drawn randomly from the underlying raw data. The h-coefficient provides a metric of how likely any given response shape is to occur by chance within the raw data.
We generated a large data set of simulated PSTHs divided into seven blocks (A–G) to test and compare the h-coefficient's efficacy across varying conditions. Each PSTH was analyzed with three different classifiers: a measure of the number of standard deviations that the mean of the response period was above the mean of the baseline period (SD), a t-test between the mean firing rate of the response period and the baseline period (TT), and the h-coefficient calculated according to Algorithm 1. The reasoning behind using SD and TT for comparison goes beyond their extensive application as classifiers in their own right. Many (probably most) classifiers in widespread use today are based on, or are fundamentally similar to, either SD or TT (or both, e.g., Ison et al. 2011). SD is a descriptive statistic that compares the response to the distribution of the baseline, and it is therefore, in theory, independent of the number of trials (Cumming et al. 2007). TT is an inferential statistic that compares the response to the mean of the baseline and is therefore dependent on the number of trials. Additionally, SD and TT allow the reader to easily compare the performance of the h-coefficient to their own classifier by using these comparison classifiers as an intermediate step.
ROC analysis revealed similar AUCs for all three classifiers but a stronger negative bias of the h-coefficient compared with that of the other two classifiers. The h-coefficient's strong negative bias resulted in an improved H at a comparable FA. This advantage of the h-coefficient over the other two classifiers increased with increasing numbers of trials and was more pronounced at more conservative thresholds. We investigated these findings further in a set of biological data. In these data we compared the performance of the three classifiers in real neurons recorded during a two-alternative forced choice task in humans. All three classifiers had a low FA (<0.01), while the h-coefficient had a higher Hp than SD or TT.
In blocks B–G of the modeled data set spike times outside of the response period were random with the exception of an absolute refractory period of 3 ms. Even in the presence of sinusoidal noise (block A), the variability of the mean firing rate in the baseline period was limited because of the condition of preserving the mean firing rate of the raw data across the baseline region (see methods). These conditions resulted in a mean firing rate of 3.12 Hz across all baseline regions within block A with a standard deviation of only 0.39 Hz. In comparison, the mean firing rate across the baseline regions in the biological data set was 1.74 Hz with a standard deviation of 2.64 Hz. The high variability in the baseline periods of the biological data set can lead to an increase in FA in SD and TT (but not in the h-coefficient). We therefore corrected the baseline period firing rates in the biological data so that their mean was the same as the mean across the raw data (i.e., across the whole recording session; see methods). On average the mean firing rate across the baseline region was 0.22 Hz lower than the mean across the raw data, with a standard deviation of the difference of 0.65 Hz. The correction of the mean baseline firing rate allowed for a closer comparison of the three classifiers to each other and also of the biological data set to the modeled data set.
Under some experimental conditions it is feasible that a reduction in the firing rate during the baseline period constitutes part of the response profile of a neuron. For example, one could argue that the firing rate during the baseline period may be reduced if an event occurring at t = 0 is anticipated or predicted. Indeed, without correction of the baseline period firing rate, the Hp of TT in the biological data was similar to that of the h-coefficient [Hp(TT, no corr) = 0.075]. The Hp of SD remained equally low without baseline correction [Hp(SD, no corr) = 0.002]. However, closer inspection of the responses registered by TT without baseline correction revealed that many of them showed only a minor increase of the firing rate during the response period but were heavily influenced by low firing rates during the baseline period and may therefore have been false alarms (Fig. 7). Given the high variability of the firing rate in this data set and the absence of a ground truth it is difficult to determine whether a reduction in the baseline period firing rate should be taken into account when classifying the responsiveness of a neuron. TT (as well as SD and many similar classifiers) is equally sensitive to fluctuations of the firing rate in the baseline period and the response period, while the h-coefficient analyzes the response period only. Which of these characteristics may be better suited for the analysis of a given data set will need to be determined based on the nature of the data at hand.
In some neurophysiological experiments a high number of trials can be recorded or an electrode is moved along a trajectory until a neuron with a certain response profile is detected. In such cases responses are usually strong and less sensitive classifiers may be sufficient to register a response. In other applications, such as in human single-cell neurophysiology, electrodes are fixed in place and a large set of neurons with unknown response properties are recorded over a low number of trials only. In this type of experiment the classification of whether a unit was responsive or not is made post hoc. Under these conditions highly conservative and sensitive classifiers are needed because of the often large number of units that are being recorded and the unknown response properties of these units.
Choosing a shorter response period could increase the sensitivity of SD, TT, and the h-coefficient as long as any response is still fully contained within the response period. For example, if a neuronal population is known to reliably respond 100 ms after stimulus onset with a transient increase in firing rate lasting <200 ms, then choosing a response period of 100–300 ms would increase the sensitivity of all three classifiers. However, when the neuronal population's response properties are not known a priori a long response period is needed to “scan” for any type of response occurring time locked to the stimulus. Algorithm 1 predicts that the h-coefficient's performance remains robust across different durations of the response period. In step 3a, the h-coefficient searches for the maximum within the response period and then only includes the firing rate fluctuation associated with that peak; the rest of the data is ignored. That is, even in very long-lasting response periods, the h-coefficient will always only focus on the most dominant change in firing rate within that period. This is a key difference of the h-coefficient over other classifiers including SD and TT, which average responses out across the whole response period and therefore are intrinsically biased toward longer-lasting responses. However, the duration of the response period does still have an effect on the h-coefficient. On the one hand, the longer the response period is in the test PSTH, the more responses with different response onset times can be detected. On the other hand, the same long response period is then also used in the shuffled PSTHs to construct M. Here longer response periods increase the likelihood of finding shuffled responses that are larger than the test response, thus rendering the classifier more conservative. The response period should therefore always be chosen proportional to the period within which responses are actually expected to occur, i.e., as long as necessary and as short as possible.
In theory there are many ways of further optimizing the sensitivity of all three classifiers included in this study. Here we only compared their most fundamental forms for basic reference. The h-coefficient provides two key areas where its sensitivity may be improved. The first one is the point at which the response shape is quantized by measuring the striped surface areas (Algorithm 1.3). Here additional parameters (e.g., response onset times or time to peak, etc.) could be included in the quantification of the overall response by generating additional response matrices m1, m2, . . . mn. With these additional matrices a higher-dimensional h-coefficient could be calculated, which may provide a more sensitive metric for some applications. The second step where the h-coefficient could be improved further is function f1 (and per extension f2, Algorithm 1.5 and 1.6). In this study we used the maximum function as an example for f1. We used this function because in shuffling- and bootstrapping-based spike train analysis real responses are per definition expected to be larger than the shuffled or bootstrapped data (within the range of a given α-level). However, the maximum function does not take the distribution of the underlying data into account and is therefore sensitive to outliers. A more sophisticated analysis of the properties of the shuffled responses and of the distribution of these properties could lead to a more fine-tuned and sensitive h-coefficient. For example, one could compare r directly to m, which would allow for a calculation of the percentile of values within a row in m that were larger than the corresponding element in r. As a result the elements in a and b would then be weighted by probabilities instead of being only binary.
In this study we only analyzed responses that were defined by an increase in firing rate, as opposed to responses for which the firing rate decreased time locked to a stimulus. Such suppressed responses can be more difficult to classify because of the limited range between the baseline and the natural floor at 0 Hz. The h-coefficient can, however, be easily adjusted to classify such suppressed responses by simply adding the sentence “[…]Invert the response envelope λ(t) by multiplying it with −1 and then adding 2: λ′ = −λ(t) + 2.” to Algorithm 1.2. This addition inverts the responses in the test PSTH and in all the shuffled PSTHs and, as a result, now quantizes any suppressed fluctuations in the data. As a convention, the h-coefficient for suppressed responses may be denoted with negative values. Responses that consist of both a suppressed part and a part in which the firing rate increases can accordingly be classified by simply providing two h-coefficients, one for each type of response. In this case, the h-coefficient can be reported as a pair of values in the form h = [−hs,he], where hs is the h-coefficient of the suppressed part of the response and he is the h-coefficient of the enhanced part of the response. Analogously, two independent thresholds can be applied to this type of h-coefficient, allowing for a more fine-tuned response criterion in data sets where suppressed, enhanced, or combined responses are expected.
Conclusion.
Many electrophysiological applications call for a conservative yet sensitive response classifier that can be applied to a large number of neurons with unknown response properties. Applications that do not fall within these criteria could also profit from this type of classifier by potentially increasing their yield without increasing their false alarm rates. The h-coefficient provides such a statistic with the following advantages over more classical approaches. 1) The h-coefficient analyzes the continuous, smooth response envelope generated by optimized kernel convolution of the spiking sequence as opposed to analyzing the discrete nPSTH. 2) The h-coefficient is sensitive to the actual shape of the response, analyzing the response amplitude as well as the duration at various amplitudes independently of each other. It can also report both these parameters separately. 3) As a result of this more differentiated approach, the h-coefficient is not biased toward longer-lasting responses. It is therefore more sensitive to short responses occurring within prolonged response periods than the classical approaches. 4) The h-coefficient can easily be extended and modified to target specific parameters of interest within a given data set. 5) The h-coefficient provides a parameter-free statistic that is based on the respective neuron's baseline firing behavior. This property allows for an objective comparison of responses across neurons with different baseline and response profiles, from different brain areas, or from different species. 6) The h-coefficient can generate a higher yield in a given experiment, resulting in a lower number of required trials and more efficient use of available resources.
The h-coefficient provides a versatile and advanced tool for PSTH analysis and a statistic that is both conservative and powerful. Based on our findings we believe that the h-coefficient can help enhance and standardize the analysis and reporting of spiking data across many areas of PSTH-based research.
GRANTS
This work was supported by the Swiss National Science Foundation (PBSKP3-124730) and the G. Harold & Leila Y. Mathers Foundation (09212007).
DISCLOSURES
No conflicts of interest, financial or otherwise, are declared by the author(s).
AUTHOR CONTRIBUTIONS
Author contributions: M.R.H.H. and C.K. conception and design of research; M.R.H.H. and I.F. performed experiments; M.R.H.H. analyzed data; M.R.H.H. interpreted results of experiments; M.R.H.H. prepared figures; M.R.H.H. drafted manuscript; M.R.H.H. and C.K. edited and revised manuscript; M.R.H.H., I.F., and C.K. approved final version of manuscript.
ACKNOWLEDGMENTS
We thank Hideaki Shimazaki for valuable advice on optimization algorithms and Ueli Rutishauser for insightful comments on signal detection theory.
Footnotes
The authors have made a MATLAB toolbox called h-coefficient available for download on the MATLAB File Exchange repository (www.mathworks.com).
REFERENCES
- Abeles M. Quantification, smoothing, and confidence limits for single-units' histograms. J Neurosci Methods 5: 317–325, 1982. [DOI] [PubMed] [Google Scholar]
- Abramson IS. On bandwidth variation in kernel estimates—a square root law. Ann Stat 10: 1217–1223, 1982. [Google Scholar]
- Adrian ED. The Basis of Sensation. London: Hafner, 1928, p. 122. [Google Scholar]
- Bélisle P, Joseph L, MacGibbon B, Wolfson DB, du Berger R. Change-point analysis of neuron spike train data. Biometrics 54: 113–123, 1998. [PubMed] [Google Scholar]
- Buzsaki G, Mizuseki K. The log-dynamic brain: how skewed distributions affect network operations. Nat Rev Neurosci 15: 264–278, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cumming G, Fidler F, Vaux DL. Error bars in experimental biology. J Cell Biol 177: 7–11, 2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Davey NJ, Ellaway PH, Stein RB. Statistical limits for detecting change in the cumulative sum derivative of the peristimulus time histogram. J Neurosci Methods 17: 153–166, 1986. [DOI] [PubMed] [Google Scholar]
- Devroye L, Lugosi G. Variable Kernel Estimates: On the Impossibility of Tuning the Parameters. Boston, MA: Birkhäuser, 2000, p. 405–424. [Google Scholar]
- Dörrscheidt GH. The statistical significance of the peristimulus time histogram (PSTH). Brain Res 220: 397–401, 1981. [DOI] [PubMed] [Google Scholar]
- Efron B. The Jackknife, the Bootstrap, and Other Resampling Plans. Philadelphia, PA: SIAM, 1982. [Google Scholar]
- Ellaway PH. Cumulative sum technique and its application to the analysis of peristimulus time histograms. Electroencephalogr Clin Neurophysiol 45: 302–304, 1978. [DOI] [PubMed] [Google Scholar]
- Falzett M, Moore RK, Petry HM, Powers MK. A method for determining threshold from single-unit neural activity. Brain Res 347: 127–131, 1985. [DOI] [PubMed] [Google Scholar]
- Fawcett T. An introduction to ROC analysis. Pattern Recognition Lett 27: 861–874, 2006. [Google Scholar]
- Fried I, MacDonald K, Wilson C. Single neuron activity in human hippocampus and amygdala during recognition of faces and objects. Neuron 18: 753–765, 1997. [DOI] [PubMed] [Google Scholar]
- Fried I, Rutishauser U, Cerf M, Kreiman G. Single Neuron Studies of the Human Brain. Cambridge, MA: MIT Press, 2014. [Google Scholar]
- Gerstein GL. Analysis of firing patterns in single neurons. Science 131: 1811–1812, 1960. [DOI] [PubMed] [Google Scholar]
- Gerstein GL, Kiang NY. An approach to the quantitative analysis of electrophysiological data from single neurons. Biophys J 1: 15–28, 1960. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hanes DP, Thompson KG, Schall JD. Relationship of presaccadic activity in frontal eye field and supplementary eye field to saccade initiation in macaque: Poisson spike train analysis. Exp Brain Res 103: 85–96, 1995. [DOI] [PubMed] [Google Scholar]
- Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143: 29–36, 1982. [DOI] [PubMed] [Google Scholar]
- Hung CP, Ramsden BM, Roe AW. Weakly modulated spike trains: significance, precision, and correction for sample size. J Neurophysiol 87: 2542–2554, 2002. [DOI] [PubMed] [Google Scholar]
- Ison MJ, Mormann F, Cerf M, Koch C, Fried I, Quian-Quiroga R. Selectivity of pyramidal cells and interneurons in the human medial temporal lobe. J Neurophysiol 106: 1713–1721, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jones MC, Marron JS, Sheather SJ. A brief survey of bandwidth selection for density estimation. J Am Stat Assoc 91: 401–407, 1996. [Google Scholar]
- Kass RE, Ventura V, Brown EN. Statistical issues in the analysis of neuronal data. J Neurophysiol 94: 8–25, 2005. [DOI] [PubMed] [Google Scholar]
- Levick WR, Zacks JL. Responses of cat retinal ganglion cells to brief flashes of lights. J Physiol 206: 677–700, 1970. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Loader C. Local Regression and Likelihood (Statistics and Computing). New York: Springer, 1999, p. 308. [Google Scholar]
- Loftsgaarden DO, Quesenberry CP. A nonparametric estimate of a multivariate density function. Ann Math Stat 36: 1049–1051, 1965. [Google Scholar]
- Mainen ZF, Sejnowski TJ. Reliability of spike timing in neocortical neurons. Science 268: 1503–1506, 1995. [DOI] [PubMed] [Google Scholar]
- Maunsell JH, Gibson JR. Visual response latencies in striate cortex of the macaque monkey. J Neurophysiol 68: 1332–1344, 1992. [DOI] [PubMed] [Google Scholar]
- Moore GP, Perkel DH, Segundo JP. Statistical analysis and functional interpretation of neuronal spike data. Annu Rev Physiol 28: 493–522, 1966. [DOI] [PubMed] [Google Scholar]
- Mormann F, Kornblith S, Quian-Quiroga R, Kraskov A, Cerf M, Fried I, Koch C. Latency and selectivity of single neurons indicate hierarchical processing in the human medial temporal lobe. J Neurosci 28: 8865–8872, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mukamel R, Ekstrom AD, Kaplan J, Iacoboni M, Fried I. Single-neuron responses in humans during execution and observation of actions. Curr Biol 20: 750–756, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Oram MW, Perrett DI. Time course of neural responses discriminating different views of the face and head. J Neurophysiol 68: 70–84, 1992. [DOI] [PubMed] [Google Scholar]
- Parzen E. On estimation of a probability density function and mode. Ann Math Stat 33: 1065–1076, 1962. [Google Scholar]
- Perkel DH, Gerstein GL, Moore GP. Neuronal spike trains and stochastic point processes. I. The single spike train. Biophys J 7: 391–418, 1967. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Quian-Quiroga R, Nadasdy Z, Ben-Shaul Y. Unsupervised spike detection and sorting with wavelets and superparamagnetic clustering. Neural Comput 16: 1661–1687, 2004. [DOI] [PubMed] [Google Scholar]
- Richmond BJ. Stochasticity, spikes and decoding: sufficiency and utility of order statistics. Biol Cybern 100: 447–457, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Richmond BJ, Optican LM, Podell M, Spitzer H. Temporal encoding of two-dimensional patterns by single units in primate inferior temporal cortex. I. Response characteristics. J Neurophysiol 57: 132–146, 1987. [DOI] [PubMed] [Google Scholar]
- Ritov Y, Raz A, Bergman H. Detection of onset of neuronal activity by allowing for heterogeneity in the change points. J Neurosci Methods 122: 25–42, 2002. [DOI] [PubMed] [Google Scholar]
- Shimazaki H, Shinomoto S. A method for selecting the bin size of a time histogram. Neural Comput 19: 1503–1527, 2007. [DOI] [PubMed] [Google Scholar]
- Shimazaki H, Shinomoto S. Kernel bandwidth optimization in spike rate estimations. J Comput Neurosci 29: 171–182, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thompson KG, Hanes DP, Bichot NP, Schall JD. Perceptual and motor processing stages identified in the activity of macaque frontal eye field neurons during visual search. J Neurophysiol 76: 4040–4055, 1996. [DOI] [PubMed] [Google Scholar]
- Ushiba J, Honda S, Onishi Y, Tomita Y, Masakado Y. Integration-plot test for peri-stimulus time histograms in human motor units. Med Biol Eng Comput 42: 264–271, 2004. [DOI] [PubMed] [Google Scholar]
