Abstract
The correlation coefficient squared, r2, is commonly used to validate quantitative models on neural data, yet it is biased by trial-to-trial variability: as trial-to-trial variability increases, measured correlation to a model’s predictions decreases. As a result, models that perfectly explain neural tuning can appear to perform poorly. Many solutions to this problem have been proposed, but no consensus has been reached on which is the least biased estimator. Some currently used methods substantially overestimate model fit, and the utility of even the best performing methods is limited by the lack of confidence intervals and asymptotic analysis. We provide a new estimator, , that outperforms all prior estimators in our testing, and we provide confidence intervals and asymptotic guarantees. We apply our estimator to a variety of neural data to validate its utility. We find that neural noise is often so great that confidence intervals of the estimator cover the entire possible range of values ([0, 1]), preventing meaningful evaluation of the quality of a model’s predictions. This leads us to propose the use of the signal-to-noise ratio (SNR) as a quality metric for making quantitative comparisons across neural recordings. Analyzing a variety of neural data sets, we find that up to ∼ 40% of some state-of-the-art neural recordings do not pass even a liberal SNR criterion. Moving toward more reliable estimates of correlation, and quantitatively comparing quality across recording modalities and data sets, will be critical to accelerating progress in modeling biological phenomena.
Author summary
Quantifying the similarity between a model and noisy data is fundamental to advancing the scientific understanding of biological phenomena, and it is particularly relevant to modeling neuronal responses. A ubiquitous metric of similarity is the correlation coefficient, but this metric depends on a variety of factors that are irrelevant to the similarity between a model and data. While neuroscientists have recognized this problem and proposed corrected methods, no consensus has been reached as to which are effective. Prior methods have wide variation in their accuracy, and even the most successful methods lack confidence intervals, leaving uncertainty about the reliability of any particular estimate. We address these issues by developing a new estimator along with an associated confidence interval that outperforms all prior methods. We also demonstrate how a signal-to-noise ratio can be used to usefully threshold and compare noisy experimental data across studies and recording paradigms.
Introduction
Building an understanding of the nervous system requires the quantification of model performance on neural data, and this often involves computing Pearson’s correlation coefficient between model predictions and neural responses. Yet this typical estimator, , is fundamentally confounded by the trial-to-trial variability of neural responses: a low could be the result of a poor model or high neuronal variability.
One approach to this problem is to average over many repeated trials of the same stimulus in order to reduce the influence of trial-to-trial variability. With a finite number of trials, this approach will never wholly remove the influence of noise and its confounding effect, moreover, the collection of additional trials is expensive. A more principled approach has been to account for trial-to-trial variability in the estimation of the fraction of explainable variance or r2. Most often, this takes the form of attempting to estimate what the r2 would have been in the absence of trial-to-trial variability. Here we call this quantity , the r2 between the model prediction and the expected response (ER) of the neuron (i.e., the ‘true’ mean, or expected value, of the estimated tuning curve). While a variety of solutions have been proposed to estimate this quantity [1–10], they have not been quantitatively compared, thus there is no basis to reach a consensus on which methods are appropriate, or more importantly inappropriate. We find that several estimators still in recent use have large biases. Estimators that did have relatively small biases lacked associated confidence intervals, thus the degree of uncertainty in these sometimes highly variable estimates remains ambiguous. Finally, none of these methods have been analyzed asymptotically to give a theoretical guarantee that they will converge to , i.e., it has not been shown that they are consistent estimators.
To address these substantial problems, we introduce , which is a simple analytic estimator of , along with a method for generating α-level confidence intervals. We validate our estimator in simulation, prove that it is consistent, and provide head-to-head comparisons to prior methods. We then demonstrate the use of and its confidence interval on two sets of neural data. We find many cases where neuronal data is so noisy that estimates of provide little inferential power about the quality of a model fit. This naturally leads to a useful metric of the quality of a neuronal recording that we will refer to as the signal-to-noise ratio (SNR), and which can be interpreted in terms of the number of trials needed to reliably detect tuning. Across a diverse set of neural recordings, we find that many neurons do not pass even a liberal criterion for providing meaningful insight into the quality of a model fit.
Results
Our results are organized as follows. First, we give the essential intuition into the source of the bias in and we explain how removes this bias. Next, we evaluate through simulation and compare it to prior methods. We then demonstrate the method on two neural data sets: one from a study of motion direction tuning in area MT and one from a study of responses to natural images in area V4. Finally, we develop an estimator, , based on the signal-to-noise ratio (SNR), as a metric to determine the inferential power of a given neuronal recording.
Bias of and its correction
Consider a typical scenario in sensory neuroscience where the responses of a neuron to m stimuli across n repeated trials of each stimulus have been collected and the average of these responses, the estimated tuning curve (Fig 1, dashed green line), is compared to responses predicted by a model (red line). These responses could be spike counts from a neuron but could just as well be any other neural signal. Even if the m expected values of the neuronal response, μi (solid green trace), perfectly correlate with the model predictions, νi (red trace is scaled and shifted relative to green), the m sample averages, Yi (dashed green trace), will deviate from their expected value owing to the sample mean’s variability. Here, we quantify this variability using the variance, σ2, of the distribution of responses from trial-to-trial (see Methods, “Assumptions and terminology for derivation of unbiased estimators”). We assume σ2 is constant across responses to different stimuli, which can be achieved by applying a variance stabilizing transform to the data. The variance of the sample mean for all stimuli will thus be . Owing to the variance of the sample mean, the reported can be appreciably less than 1 even though the r2 between the underlying expected values of the neuronal response and the model is 1.
Fig 1. Sampling noise confounds estimation of the correlation between model prediction and neuronal tuning curve.
The expected (true) spike counts in response to a set of 10 stimuli (solid green points) is perfectly correlated with a model (red points), yet owing to sampling error (neural trial-to-trial variability) the estimated tuning curve (green open circles) has correlation less than one with the model ().
The quantity we attempt to estimate in this paper is r2 between the model predictions (νi) and the expected neuronal responses (μi). We will call this quantity , the fraction of variance of the ‘expected response’ explained by the model:
| (1) |
We will show that the naive sample estimator, which uses Yi in place of μi,
| (2) |
has an expected value that can be well approximated as the ratio of the expected values of its numerator and denominator as follows (for asymptotic justification see Methods, “Inconsistency of in m”):
| (3) |
While the terms on the left in the numerator and denominator are the same as , the terms on the right are proportional to the trial-to-trial variability (σ2) and cause to deviate from . This is the essential problem: is biased away from by terms proportional to the amount of variability, , in the estimated responses.
The strategy we take to solve this problem is straightforward: find unbiased estimators of these noise terms and subtract them from the numerator and denominator of Eq 2 for , thus:
| (4) |
where is an unbiased estimator for trial-to-trial variability, after a variance stabilizing transform if necessary. Typically , the sample variance, but not necessarily. For example if stimuli are shown only once (n = 1), then an assumed value of trial-to-trial variability could be substituted into . The numerator and denominator of the fraction are unbiased estimators of the numerator and denominator of ; therefore, this solution is approximate since the expected value of a ratio is not necessarily the ratio of the expected values of the numerator and denominator (see Methods, “Bias of ”). Yet we show in simulation that the approximation is very good for typical neural statistics, and we show analytically that, unlike , our estimator converges to the true as the number of stimuli m → ∞ (see Methods, “Consistency of in m”). We next evaluate this estimator in simulation.
Validation of by simulation
To demonstrate the effectiveness and key properties of , we ran a simulation with m = 362 stimuli, n = 4 repeats, and σ2 = 0.25 (the trial-to-trial variance of Poisson neuronal response after a variance-stabilizing transform, see Methods: “Assumptions and terminology for derivation of unbiased estimator”). This corresponds, for example, to a minimal experiment to characterize shape tuning in V4 neurons, which requires hundreds of unique shapes and takes on the order of 1 hour [2]. In the case where the model prediction (νi) and expected response (μi) were perfectly correlated (as in Fig 1) and SNR was moderate at 0.5, the distribution of the naive estimator, , is centered well below 1 (Fig 2A, blue). Thus, the model appears to be a poor fit to data that it in fact generated, indicating that is a poor estimator of . On the other hand, the distribution of our corrected estimator, , is appropriately centered at 1 (Fig 2A, orange). Approximately 50% of the time our estimator exceeds 1, taking on impossible values of , but this is necessary to achieve unbiased estimates for high because truncating the values would shift the mean below 1.
Fig 2. Simulation of the naive and unbiased estimators for model-to-neuron fits at varying levels of where m = 362, n = 4, and σ2 = 0.25.
(A) For true r2 = 1, at a moderately low SNR = 0.5, (blue) is on average 0.67 whereas (orange) is on average 1.00. The bias of (see Methods, “Bias of ”) is small relative to its variability (90% quantile = [0.93, 1.07] vertical bars) and to the bias of . (B) Same simulation as A but at five levels of (0, 0.25, 0.5, 0.75, 1). Lines show mean values of (blue) and (orange). Black line (beneath orange) shows true ; error bars show 90% quantile.
We evaluated the estimators and at five values of (0, 0.25, 0.5, 0.75, 1) and plotted the mean and 90% quantiles. Fig 2B shows that (blue line) grossly underestimates (black line) at all levels except for , whereas (orange line) correctly estimates the true correlation (orange and black lines overlap). Thus the estimator performs favorably in this simulation. We ran similar simulations on the square root of Poisson distributed spikes counts and found similar results for both low and high average firing rates. Next, we characterize the performance of relative to in simulations that cover a wide range of the parameters, m, n and SNR.
Asymptotic properties of and
We ran simulations to determine the bias and variance of relative to as a function of the parameters SNR, n, and m. Fig 3A shows that as SNR increases, (blue) and (orange) converge (, n = 4, m = 362). Thus, for neuronal recordings where variation in response strength across stimuli is large relative to trial-to-trial variability, these two estimators should have similar values. At low values of SNR, e.g., 0.1, has a large downward bias (mean ), whereas has a small upward bias relative to its own variability and to the bias of (for the source of this bias see Methods, “Bias of ”). This small upward bias of quickly diminishes as SNR increases, whereas the large negative bias of remains across a much wider range of SNR. The essential problem this simulation reveals is that if SNR varies widely from neuron to neuron, the bias in the naive estimate will cause apparent variation in r2 across neurons that depends on SNR and not on the underlying tuning curve. Neuronal SNR is not typically under experimental control, making this problem difficult to avoid.
Fig 3. Comparison of and for estimating model-to-neuron fit across broad, relevant ranges of SNR, n, and m.
(A) Average performance of naive (blue) and corrected (orange) as a function of SNR for a simulation where true (horizontal black line), m = 362, n = 4, and σ2 = 0.25. Error bars indicate 90% quantiles. (B) Performance of estimators as a function of n, the number of repeats of each stimulus. Simulation like (A), except SNR = 0.5 and n is varied. (C) Performance as a function of m, the number of unique stimuli, for a low number of repeats (n = 4). Like (A), except SNR = 0.5 and m is varied.
The number of repeats, n, is under the experimenter’s control but is expensive to increase. Fig 3B shows how and converge as n increases. Thus the bias in can be reduced by increasing the number of repeats, but to achieve this requires a very high number of repeats for low SNR. An advantage of is that even for low n, it on average estimates the true correlation to the model (orange trace overlaps black trace, Fig 3B), providing a large gain in total trial efficiency for estimating the quality of model fit.
When increasing the number of stimuli, m, unlike the previous two cases, and do not converge to the same value (Fig 3C). While variability of both estimators decreases (90% quantiles narrow), it is clear in simulation that is not a consistent estimator of in m since it does not converge to . While there is a small upward bias of for low m, as m increases this bias is reduced (see Methods, “Consistency of in m”).
Comparison to prior methods
Accounting for noise when interpreting the fit of models to neural data has been examined and applied in the neuroscientific literature for some time [1–10]. Several studies have followed the approach of attempting to estimate the upper bound on the quality of fit of a model given noise and then referencing the measure of fit to this quantity. Roddey et al. [1] estimate this upper bound by computing their estimate of model fit, ‘coherence’, across split trials then plotting coherence of the data to the model predictions relative to the split trial coherence. Yamins et al. [7] normalize r2 with split-trial correlation transformed by the Spearman-Brown prediction formula (averaged across randomly resampled subsets of trials); we will call this . Hsu et al. [4] also use split-half correlation (averaged across randomly resampled subsets of trials), to estimate an upper bound (CCmax) by a transformation they derive attempting to estimate the correlation of the true mean with the firing rate of the neuron. For purpose of comparison, we square this estimator and call it . Schoppe et al. [8] improve upon this method by giving an analytic form, thus removing the need for resampling. They do this by using the ‘signal-power’ (SP) estimate developed by Sahani and Linden [3], thus we call their estimator . Kindel et al. [10] take inspiration from Schoppe et al., except to estimate CCmax they measure the correlation of responses from a Gaussian simulation (based on the sample mean and variance of the neural data) with the sample mean. We square their estimator and call it (PB for parametric bootstrap). Pasupathy and Connor [2] estimate the fraction of total variance accounted for by trial-to-trial variability, intuitively the fraction of unexplainable variance, then use it to normalize . We call this estimator . With a similar motivation, Cadena et al. [9] provide a metric they call “fraction explainable variance explained” (FEVE). They form the ratio of mean squared prediction error over total variance of the response (except subtracting off an estimate of trial-to-trial variability from both) and subtract this ratio from one. While all of these methods might be intuitively appealing, the quantities to which they converge, and their relationship to is unclear.
Unlike the above approaches, we follow a line of research [3, 6] that explicitly attempts to construct an unbiased estimator of r2 in the absence of noise (see Methods, “Prior analytic methods of estimating ”). Heretofore many of the methods reviewed above have not been quantitatively validated and none have been directly compared. We now compare all these methods with respect to estimating . We exclude from this comparison David and Gallant [5] because their method depends on a large number of repeated trials, at which point the estimators’ utility decreases.
We quantified the ability of all methods to estimate in a simulation with n = 4 trials and m = 362 stimuli (see Methods, “Simulation procedure”). We sort the estimators (Fig 4, y-axis) by their MSE in a test case where . We generally find, , ϒ, SPEnorm, , FEVE, and are all comparable in their performance (red trace, top 6 points) with performing slightly, but significantly, better. SPEnorm and are numerically identical in their performance and their trial-to-trial results. On the other hand, and both over estimate , and the naive estimator , as expected, yields an under estimate (mean = 0.50). In addition underestimates . When the true is 0.5, we find similar results, where and produce overestimates (0.63 and 1.04 on average, respectively) and the mean is 0.25. Thus, serious caution should be taken when interpreting these last two estimators. We applied these estimators to neural data from V4 fit by a deep neural network (see Methods, “Electrophysiological data”) and found a similar pattern of results where the top 6 estimators give similar estimates to each other, and tend to be greater than these estimators, and the estimators and are lower (Fig 4C). We conclude is as good as any estimator of available, has a simple analytic form, and in contrast to ϒ, can still be calculated without calculating the sample variance, for example, if no repeats are collected and variance must be assumed (see Discussion, “Relationship to prior methods”). None of the top five prior estimators we reviewed have associated confidence intervals, and thus we now provide confidence intervals for .
Fig 4. Comparison of with published estimators of on the basis of simulated and real data.
(A) Low SNR (0.25) simulation where estimators on vertical axis are sorted from top to bottom by smallest MSE with respect to estimating . Traces show mean and SD of each estimator. (B) Same simulation at higher SNR (1.0) but same m, n. (C) Estimated fit of DNN to V4 data by and published estimators. Each trace is the estimated fit of the model for one neural recording.
Confidence intervals for
In order to interpret point estimates such as , it is important to be able to meaningfully quantify uncertainty about the estimate relative to the true parameter . An α-level confidence interval (CI) provides an interval that will contain the true parameter α × 100% of the time for IID estimates. We considered three typical generic approaches to forming CIs for : the non-parametric bootstrap, the parametric bootstrap, and BCa [11]. We found all methods to be lacking because they did not achieve the desired α in simulations with ground truth. Motivated by these problems, we developed a novel Bayesian method. We first recount the issues we found with the bootstrap methods and then provide a basic account of the Bayesian method we use throughout the paper. For more detailed exposition, see Methods: “Quantifying uncertainty in the estimator”.
The non-parametric bootstrap is a commonly used method to approximate CIs. In our case, it involves randomly re-sampling with replacement from the n trials in response to each of the m stimuli then calculating for the bootstrap sample. Repeating this many times allows the quantiles of the bootstrap distribution of to be used as CIs. We applied this method across a simulated population of 3000 neurons with m = 40 and n = 4 and found it suffered from two problems. First, the CIs were not centered around , specifically the interval was too low (Fig 5A), with the upper and lower bounds of the interval (orange and blue traces, respectively) almost always falling below the true value (green). Secondly, as the true increased from 0 to 1, CIs contained at rates far lower than the desired α = 0.8 (Fig 6 cyan trace, open-circles under the trace indicate a significant difference, p < 0.01 Bonferoni corrected z-test). Thus at practically all levels of correlation, the non-parametric bootstrap performs poorly. The problem is a result of the expected value of the empirical distribution (the sample mean) being typically much lower than . To overcome this, we turned to the parametric bootstrap where we could explicitly estimate with our estimator . This method approximates CIs by estimating the parameters of an assumed distribution from which samples are generated. In our case it involves estimating σ2, , and the variance of the neuronal tuning curve d2 (see Results, “Signal-to-noise ratio as recording quality metric”) and then simulating observations from the distribution with these parameters to calculate . Drawing many we again use the sample quantiles as CI estimates. Fig 5B shows that this overcomes the main failure of the non-parametric bootstrap, but this method tended to be too conservative for low values (Fig 6 red trace below 0.8 at left side) and too liberal for high values (red trace above 0.8 at right side). Deviations such as these are well known for bootstrap percentile methods when the variance is a non-constant function of the mean and/or the distribution of the estimator is skewed [11]. The correction to the bootstrap, the bias-corrected and accelerated bootstrap (BCa), can help ameliorate these issues by implicitly approximating the skewness and the mean-variance relationship from bootstrap samples. We employed BCa with our parametric bootstrap and found that performance improved relative to the parametric bootstrap (Fig 6 green trace closer to desired α than red for low ) but still deviated from the desired α for low and high .
Fig 5. Validation of confidence interval (CI) methods by simulation—example CIs for three methods.
Simulation parameters: n = 4, m = 40, true , dynamic range d2 = 0.25, trial-to-trial variability σ2 = 0.25, and target confidence level α = 0.8. Of 2000 independent simulations, CIs for the first 100 are plotted here for three different methods. CIs for all methods were calculated using the same set of randomly generated responses. (A) For the non-parametric bootstrap method, the upper end (orange) and lower end (blue) of the CI were almost always both below the true correlation value (0.91, green line), indicating an overwhelming failure to achieve 80% containment of the true value. (B) The parametric bootstrap method and (C) our ECCI method perform substantially better. Performance of all three methods over the full range of true is plotted in Fig 6.
Fig 6. Comparison of four methods for computing confidence intervals for spanning the full range of true correlation.
The fraction of times the CI contained the true value is plotted for each method (see line style inset) as a function of the true correlation value, , at 100 values linearly spaced between 0 and 1. The target α-level was 0.8. Open circles indicate that the fraction deviated from 0.8 significantly (p < 0.01, Bonferroni corrected).
We aimed to create a CI with better α-level performance. To do this, we assumed uninformative priors on the parameters σ2 and d2 so that, conditioned on estimates of these parameters, we can draw from the distribution of for an arbitrary (see Methods, “Confidence Intervals for ”). This allows us to compute the highest true that would have given an observed or a lower value in α/2 proportion of IID samples. We take this as the high end, , of our CI. Similarly we determine the low end, , of the CI as the lowest that produces a value greater than or equal to in α/2 of the samples. In Methods we give conditions under which this procedure will provide α-level CIs (see Methods, “Confidence intervals for ”). In our simulations, this method consistently achieves the desired α at all levels of (Fig 6, orange trace). We use this CI method, which we call the estimate-centered credible interval (ECCI), throughout the rest of the paper.
Application of estimator to MT data
We have shown in simulation that the use of introduces ambiguity as to whether a low correlation value was the result of trial-to-trial variability or a poor model, whereas removes this ambiguity. Here we demonstrate in neural data how this, in tandem with confidence intervals, allows investigators to distinguish between units that systematically deviate from model predictions versus those that simply have noisy responses. We re-analyzed data from single neurons in the visual cortical motion area MT responding to dot motion in eight equally spaced directions [12, 13]. A classic model of these responses is a single cycle sinusoid as a function of the direction of dot motion with the free parameters phase, amplitude, and average firing rate. We chose this MT data set as the first example application because it has a high number of repeats (n = 10) and a low dimensional model, thus it is simple to visually inspect whether the neuronal tuning curves agree with the model predictions.
An example of an MT neuron direction tuning curve (Fig 7A, orange trace) has an excellent sinusoidal fit (blue trace), as reflected in its estimated . Furthermore, the short confidence interval ([0.99, 1.0]) quantifies the lack of ambiguity about the quality of the fit. On the other hand, the tuning curve of a neuron with (Fig 7B) has a clear systematic deviation from the least-squares fit. Here the tuning curve is double-peaked and thus largely orthogonal to any single cycle sinusoid. It is important to notice that this neuron has far lower SNR (2.8 here vs. 20 for the example in A), as quantified by our estimator, (Eq 16), which corrects for trial-to-trial variability (described below and defined in Methods, “Estimators of correction terms”). Thus without , there would be plausible doubts about whether the correlation was lower because of noise or systematic deviation. Furthermore, with low SNR it would be plausible that the estimate itself is noisy (Fig 3A), but the short confidence interval ([0.01, 0.11]) unambiguously characterizes the fit as being systematically poor.
Fig 7. Applying our unbiased estimator with CIs to fit four example MT neuronal direction tuning curves to a sinusoidal model.
(A) Example neuron tuning curve (orange trace with SEM bars) with excellent fit to sinusoidal model (blue trace, ), high SNR and tight CI (parameters specified above plot panel). (B) Example neuron with poor fit to sinusoidal model but with a reasonable SNR and narrow CI that provide confidence that the neuronal tuning systematically deviates from the model. (C) Example neuron with poor SNR and wild estimate of , which is reflected in large CI = [0.3, 1], suggesting that no conclusion can be made about how well the model describes any actual tuning here. (D) Example neuron with a seemingly reasonable , but the low SNR and CI covering the entire interval [0, 1] reveals that this fit cannot be trusted.
In some cases, neurons show little tuning for direction and thus have very low SNR over a set of directional stimuli. This in turn can cause to give wild estimates (Fig 7C, , ). If we truncate the value to the nearest possible , we might be tempted to interpret this as a well-fit direction selective neuron. But, the CI covers most of the interval of possible values ([0.3, 1]), making it clear that little information can be gleaned about from this data. Extreme values themselves can indicate when the estimator is unreliable, but even a reasonable seeming value, for example (Fig 7D), can be unreliable when there is a low (0.12). In this case, the confidence interval covers the maximal range ([0, 1]), indicating that the point estimate is unreliable. Thus, and its associated confidence interval quickly and unambiguously show how well the model fits the MT data, avoiding the tiresome and unreliable process of judging each fit by eye for the 162 neurons.
While we have shown to a good approximation that is unbiased and its expected value is largely invariant to SNR, this is definitely not the case for the variance of the estimator. Fig 3A shows clearly that the variability of the estimator is larger for lower SNR. This fact should be kept in mind when interpreting the spread of values. For example, we calculated and confidence intervals across our entire population of MT neurons. Of the estimates with high SNR (Fig 8, right side, ), most neurons are well fit to the model and only a few have less than 3/4 of their variance accounted for (8/81). For the estimates with low SNR (), left side of Fig 8), this fraction is substantially higher (39/81), but the increased variability of these estimates will spread out the distribution, thus this difference in quantiles must be interpreted carefully. When estimating population dispersion, conclusions may be confounded by the SNR-dependence of the variability of .
Fig 8. Confidence intervals (α = 90%, vertical lines) and point estimates (red dashes) for across all MT neuron direction tuning curves fit to sinusoidal model.
Data points are grouped into two intervals on the basis of (of the direction tuning curves) being less than or equal to or higher than the median value (3.5), revealing that lower SNR (left interval) is associated with much longer CIs.
Comparing the naive to our unbiased (Fig 9), the high SNR units (red points), lie close to the diagonal. Thus for these units, one could exchange the two estimates and come to similar general conclusions about model fits. The utility of is that it removes ambiguity about whether trial-to-trial variability may be spuriously pushing fits down (black points). The interpretation of the naive estimator remains ambiguous for any given unit until it can be confirmed it does not suffer from this issue.
Fig 9. Relationship of naive and corrected between fits of sinusoidal model to MT data.
Units with greater than the median across the population () are plotted in red and those less than or equal to in black.
The MT data considered here has relatively few stimuli and many repeats, but other experimental paradigms involve a larger number of stimuli and, consequently, fewer repeats. Below we apply in these more challenging conditions.
Application of estimator to V4 data
The primate mid-level visual cortical area V4 is known to have complex, high-dimensional selectivity for visual inputs. To rigorously assess models of neuronal responses in areas like V4, validation needs to be performed on responses to a large corpus of natural images to ensure that models capture ecologically valid selectivity [14, 15]. Thus, the number of unique stimuli, m, will be large at the expense of having relatively few repeats, n, and SNR can be low because stimuli are not customized to the preferences of a given neuron. These are the challenging conditions under which avoids the major confounds of . Here we estimate and associated 90% confidence intervals for a model that won the University of Washington V4 Neural Data Challenge by most accurately predicting single-unit (SUA) and multi-unit activity (MUA) for held-out stimuli (see Methods, “Electrophysiological data”). Plotting against (Fig 10A) shows that the corrected estimates are higher than the naive estimates (points lie above diagonal line). Using here is important because it provides confidence that the poor fit quality is not a result of noise and that the best performing model often did not explain more than 50% of the variance in the tuning curve.
Fig 10. Applying to analyze performance of a deep neural network (DNN) in predicting V4 responses to natural images.
(A) For single-unit (orange) and multi-unit (blue) recordings, is plotted against the naive . The relatively short α = 0.1 CIs (vertical bars) suggest that most of these correlation values are trustworthy. (B) The mean value across multi-unit recordings (horizontal blue line) is significantly higher than that for the set of single-unit recordings (orange horizontal line; Welch’s t-test t = 3.7, p = 0.005). Because individual estimates are asymptotically unbiased, the group average inherits this lack of bias.
While we have examined for individual recordings, it can also be useful to estimate the average quality of model fit across a population of neurons. Since the individual estimates are unbiased, the group average is also an unbiased estimate of the population mean . We computed such group means for the single-unit and multi-unit V4 recordings (Fig 10B), and found that the model performed significantly better in predicting the responses of multi-unit activity (Welch’s t-test p = 0.005, MUA mean = 0.57, SUA mean = 0.35). If instead the naive were used, this finding could have been dismissed as the result of MUA having higher SNR and thus naturally higher . As it stands, this interesting observation can be followed up to potentially gain insight about the structure of selectivity across multiple units recorded nearby in V4.
Finally, this V4 data set provides a good example of how using can allow testing a larger stimulus space, as predicted by simulations above in Fig 3B. Fig 11 shows that with (solid lines, on average two trials is enough to estimate the true correlation, whereas the naive estimator requires more repeats (higher n) to converge. For example, for recording 1 (red), (solid line) on average predicts the same quality of model fit for two or more stimulus repeats, whereas even after six repeats, the naive has not converged.
Fig 11. Relationship of naive and corrected with n, the number of repeats for V4 data.
Different colors indicate different recordings. Solid lines show the average estimate across random shuffling of trials (with replacement); vertical bars indicate SD. Dashed lines show average .
Signal-to-noise ratio as recording quality metric
We have shown above that correcting for bias in r2 is important, but it is also critical to recognize when recordings are so noisy that they are effectively useless for evaluating a model. Here we demonstrate the use of the signal-to-noise ratio (SNR) as a quality metric to help make this determination. We define the SNR for a neuronal tuning curve to be the ratio of the variation in the expected response across stimuli to the trial-to-trial variability across repeats:
| (5) |
where μi is the expected response to the ith stimulus and . For experimental data, we do not know μi in Eq 5, and rather than substituting sample estimates, Yi, which would give an inflated estimate, we use an equation that corrects for trial-to-trial noise (, Eq 16, Methods). The removal of this bias in our SNR metric allows for direct comparisons between studies with different numbers of repeats and amounts of trial-to-trial variability. This is appropriate because SNR is not a function of n or m, rather it can vary across neurons, sets of stimuli and recording modalities, as we show below.
We examined a diverse collection of neural data sets (see Methods, Electrophysiological data) and found wide variation in both within and across the data sets (Fig 12A). At the low end, calcium imaging data from cortical neurons in area VISp of mouse responding to gratings (pink trace N = 40,520 neuronal ROIs, [16]) had a median SNR of 0.01, while at the high end, MT neurons in response to dot motion [12] had a median of 3.5 (blue trace, N = 162). A stimulus protocol nearly identical to that used for the VISp Ca2+ data (pink and gray traces for gratings and natural images, respectively) was used to collect the Allen Institute NeuroPixel electrode data [17] (purple and brown traces N = 2,015); however, the Ca2+ data had a substantially lower (0.01 and 0.02) compared to the electrode data ( 0.12 and 0.19), suggesting that this difference relates to the recording modality.
Fig 12. A comparison of our data quality metric, the signal-to-noise ratio estimator (Eq 16), across several datasets.
(A) The cumulative distribution of under the original experimental protocols. Traces with the same line thickness have similar numbers of n and m. Thick line (blue): MT data has n ≈ 10, m = 8. Medium lines (green, orange, red): V4 data has n ≈ 5, m ≈ 350. Thin lines: Allen Inst. data has n ≈ 50, m ≈ 120. The Allen Inst. data has two recording modalities: extracellular action potentials (spikes) on Neuropixel probes (NP) and two-photon calcium imaging (Ca). Both were recorded for the same stimuli: natural scenes and gratings (see Methods, “Electrophysiological data”). (B) Distribution of after normalization with respect to the duration of the spike counting window (traces for calcium signal are not included). The normalization assumes that the original average spike rate can be applied to a 1 s counting window. But, if firing rates tend to decay over time, this will produce overestimates for recordings shorter than 1 s and underestimates for recordings longer than 1 s.
In the case of spiking neurons, SNR can be improved by increasing the stimulus duration and thus the spike counting window. Under the generally optimistic assumption that spike rate stays constant in the counting window and assuming that the spike counting window could be changed given experimental constraints, we can normalize across the data-sets to what the would have been had all spike count windows been 1 second long (Fig 12B). Under these assumptions, this normalization allows us to examine SNR differences across studies removing the counting window length as a factor. We find this reduces the differences in across the spiking data-sets (the six right-most traces), thus the outstanding of the MT data-set could potentially have been achieved if spike count windows had been longer for the other experiments. Still, of the spiking data, the Allen Neuropixel data has the lowest medians, thus additional efforts to ameliorate low SNR (via number of trials or stimulus choice) could be utilized. Furthermore, the assumption of a constant spike rate will hold to different degrees: neural responses can peak shortly after stimulus onset and then return close to baseline. Thus, different experimental conditions call for different standards for number of trials and stimulus duration to adequately characterize a tuning curve.
To provide concrete meaning to , we suggest interpreting it in terms of the number of trials (m and n) needed to reliably detect stimulus modulation in an F-test. Specifically, for a given m and n we computed the minimal SNR required to achieve a high probability (β = 0.99) of rejecting the null hypothesis that the mean response to all stimuli is the same (see Methods, SNR relation to F-test and number of trials, Eq 22). We plot a color map of this minimal SNR as a function of m and n (Fig 13), where the diagonal grey contour lines indicate fixed total number of trials (mn) for different m: n ratios. In general, as the total number of trials increases (moving perpendicular to the grey diagonals toward the upper right), the SNR required for reliable tuning curve estimates decreases. The SNR threshold is lower when n is favored over m for the same number of total trials, i.e., the SNR threshold level iso-contours have steeper slopes than the grey diagonals.
Fig 13. The minimal SNR needed to reliably detect tuning as a function of m, the number of unique stimuli, and n, the number of repeats of each stimulus.
White arrows indicate the approximate location in (m, n) corresponding to the datasets used in Fig 12. Gray diagonal lines indicate constant number of total trials (n × m).
On this map, we can locate points corresponding to the m and n, roughly, for data sets in Fig 12. The three V4 data sets have about the same number of stimuli and repeats (arrow marked “V4”, Fig 13), and thus require SNR≈0.1 or greater, implying that from 3% to 23% of the V4 data does not pass the criterion (Fig 12A, red and green traces, respectively, define endpoints). The MT data has the fewest number of total trials and thus has the highest threshold SNR ≈ 0.5, which leaves 10% of the neurons with poorly estimated tuning curves. If more stimuli had been used at the expense of fewer repeats, say n = 2 and m = 40, then only a quarter of the neurons would have exceeded the increased threshold of SNR > 1. The Allen Ca2+ and spike data sets both had similar m and n. Relative to the other data sets they had far more total trials and a greater number of repeats, thus the SNR criterion is substantially lower (SNR > 0.01). Still, for the Ca2+ data, ∼ 37% of the grating and ∼ 25% of the natural image data did not have reliable tuning (Fig 12A, pink and grey thin trace). The Allen spiking data on the other hand had much higher SNR, and thus more trials could have been spent on expanding the stimulus set and fewer on repeated presentation (Fig 12A, thin brown purple trace).
We have shown SNR can be employed as a simple metric with a concrete interpretation to judge data quality across different organisms, recording modalities and brain regions for the purpose of making comparative analyses and setting aside data that has little or no power. The expected distribution of SNR, based on prior data, can be taken into account when choosing n and m to achieve a criterion level of statistical power for an experiment. If SNR is high, recording time can be reduced by keeping n and m low, or a larger stimulus set can be explored by increasing m at the expense of n.
Discussion
Summary
We have investigated the estimation of the correlation between a model prediction and the expected response of a neuron. Although it has been long established that trial-to-trial variability will cause the classic estimator, Pearson’s , to underestimate correlation, there has been no direct comparison of prior methods to account for this confound. We found that some methods grossly over estimate correlation in high noise conditions, and we built upon the best performing method to derive a more generally applicable estimator, , that performs as well as or better than prior methods. We analytically validated by determining that it was a consistent estimator in the number of stimuli. We found in simulation that it had a small upward bias, but this was only appreciable at very high noise levels. None of the prior methods that we examined had validated confidence intervals, thus we developed confidence intervals for . Motivated by the failure of generic bootstrap methods to achieve satisfactory confidence intervals, we developed a confidence interval method that outperformed them.
Applying our estimator to neural data, we demonstrated its essential value. In the case of MT recordings, it was able to unambiguously distinguish neurons for which a sinusoidal model was a good fit from those for which it was a poor fit specifically because of systematic deviation and not because of noise. The associated confidence intervals allowed the systematic identification of noisy recordings that served no practical use in assessing the fit of the model. Poor model fits caused by noise vs. those caused by systematic differences in selectivity have very different interpretations, yet the traditional does not differentiate them while does.
Application of the estimator to the winning UW neural data challenge model, a deep neural network (DNN), provides the only validated assessment of state-of-the-art predictive model performance in V4. The estimator along with its CIs identified neurons that were challenging to the DNN and perhaps require a different modeling approach. It also validated the existence of single units that had nearly 50% of their variance explained, indicating that the DNN functionally captured a substantial part of what these units encode across natural images and thus could provide real insight into naturalistic V4 single unit encoding. On a practical level, we showed how the estimator allows for gains in trial efficiency since it converges more rapidly than (Figs 3B and 11). This is important when many stimuli are needed to validate models of high dimensional neural tuning.
Our tests on experimental data revealed that some neurons had confidence intervals covering the entire range of possible values, motivating us to propose the signal-to-noise ratio (SNR) as a metric of neural recording quality in the context of model fitting. We provide an unbiased estimator of SNR (Eq 16) and a practical interpretation: for a given number of stimuli and repeats, the SNR should be sufficient to reliably detect stimulus-driven response modulation on the basis of an F-test (Eq 22). Examining a variety of data sets, we found differences with respect to how the numbers of stimuli vs. repeats (m vs. n) were balanced, revealing how adjustments can be made on the basis of SNR to improve experimental efficiency. We also found large differences in SNR across data sets that are likely related to recording modality (e.g., Ca2+ imaging vs. electrode recording), which could be important for selecting appropriate experimental approaches and for guiding the refinement of current techniques to improve SNR.
Interpretation of
We have introduced an estimator and confidence intervals for the correlation between the true tuning curve of a neuron (its expected value across stimuli) and model predictions: . In the context of sensory neurophysiology, we believe it is reasonable to think of as reflecting solely how well a model explains a sensory representation. We justify this by the fact that is solely a function of E[Response|Stimulus] thus solely a function of stimuli. We note two caveats: (1) non-sensory signals can influence sensory responses, e.g., eye movements which may be stimulus dependent and (2) E[Response|Stimulus] is not the only component of the sensory response, e.g., variability can also be stimulus dependent [18].
Relationship of to ϒ
We have taken a similar approach to Haefner and Cumming [6], commensurately their estimator gives nearly identical results to ours in simulation (less than < 0.0001% power unexplained for SNR = 1 vs >1% for the other estimators), though we provide an important generalization. Their formula requires the calculation of the sample variance because their derivation relies on the F-distribution formed by taking the ratio of the sum of squares of model residual over the sample variance (see Methods, ϒ). This is problematic if stimuli are never repeated in an experiment (for example, in free viewing experiments), then one has to assume a priori the trial-to-trial variability either from previous experimental measurements or by asserting a theoretical mean-variance relationship (e.g., the square root of Poisson distributed spiking gives ).
Haefner and Cumming’s estimator is more general than the we have presented. The estimate measures the variance of the mean centered data explained by a single covariate, and ϒ measures the variance explained by a linear combination of up to m covariates. In particular, ϒ, accounts for the decrease in degrees of freedom available to the noise as more covariates are added. We also provide the more general version of our estimator for the case of variance explained by a linear model (Eq 23) with the advantage discussed in the previous paragraph.
SNR
We found that differences in SNR can be substantial and widely varying across neurons, data sets, and recording methods. Given the rise in large scale recordings and sharing of neuronal data, we believe unbiased estimates of SNR should be reported so that researchers can quickly judge whether a data set has sufficient statistical power or whether its power is on par with that of data sets from potentially comparable studies. We provide concrete criteria by which to interpret SNR: the statistical power to detect stimulus-driven response modulation. Strikingly, in our small sample of data sets, many neurons do not pass this criteria, suggesting that the adoption of a standard criterion for data quality, such as our SNR metric, could have a major impact in practice. Furthermore, guided by the metric the experimentalist can take steps to improve SNR by increasing stimulus duration and associated spike counting windows or by customizing stimuli to the preferences of a neuron. On the other hand, the deleterious effects of low SNR can be ameliorated by favoring repeats over number of stimuli (Fig 13).
One conceptual interpretation of the SNR metric we introduced is that it quantifies, for the time scale of the spike count window, the overall variance in the responses of the neuron attributable to the tuning curve of the neuron vs. trial-to-trial variability about that tuning curve. For example on the time scale of 1 second, a large fraction of spike-based recordings had SNR>1, indicating that more variance was caused by the stimulus than by other sources (Fig 12B blue, orange, green traces median SNR > 1). Still, an appreciable number of neurons were dominated by their trial-to-trial variability. Whether this is the result of stimulus choice and perhaps would be different in a more natural context is an open question. Recent theoretical and experimental work has argued that weakly tuned and untuned neurons can contribute to sensory encoding [19–21]. The corrected estimate of SNR we provide (Eq 16) along with naturalistic stimulation can help to identify such neurons.
Further work
Small improvements to our estimator could be made by decreasing its bias in the case of very low SNR (see Methods, “Bias of ”). In the case of very low SNR, a single neuronal recording has little inferential power, but across a population of neurons, estimates of the average correlation to a model’s predictions can have low enough variance to provide useful inference. Yet, at very low SNR an appreciable bias begins to appear that will remain in the population average. We showed this bias is a function of the covariance between the numerator () and inverse of the denominator () of and Jensen’s gap where (Eq 17). The former covariance can be removed by using separate subsets of the data for estimation of the numerator and denominator. To reduce the influence of Jensen’s gap, further work could attempt to directly estimate and correct for its value. In addition, analytic results on how this small sample bias varies as a function of critical parameters m, n and SNR would be helpful in it’s interpretation.
In the derivation of our estimator we assume the m responses across which the model predictions are evaluated are independent. Thus the estimator in its current form would not be appropriate for evaluating models that make predictions across spike counts in adjacent time bins. In future work we plan to extend our estimator to the case of correlated responses.
Here we have derived an estimator for the case where deterministic model predictions are correlated to a noisy signal. Often, one noisy signal is correlated to another, for example when judging the similarity of tuning curves from two neurons (termed signal correlation). We have extended the methods described here to the neuron-to-neuron case and will describe this in a forthcoming publication.
A subtle but important point about our estimator is that it assumes stimuli are fixed: it estimates the for the exact stimuli shown. An investigator may be interested in the quality of a model across a large corpus of natural images of which only a small fraction can be included in a recording session. In this case, one collects responses for a random sample of images, fits the model to some (training set) and tests the model on others (test set). The random test set will account for over-fitting and using will account for noise in the neural responses in the evaluation on the test set. Crucially though, this does not account for the variability in the parameters of the model induced by the random training sample. Intuitively, estimated model parameters will vary across image sets even in the absence of trial-to-trial variability. The correct interpretation of in this case is that it estimates how well a model can perform given finite noisy training data on noiseless test set data, and not as the best the model could possibly perform given infinite training data. Indeed, with more neural responses and less noise, model test set performance would improve. David and Gallant [5] explored this issue calling it ‘estimation noise’ and provided an extrapolation method for estimating the fit of a linear model given unlimited stimuli. The estimator was not evaluated in terms of its bias or variance, and no analytic solutions that directly remove the bias of finite training data have been proposed. Both are valuable directions to pursue: the former to build confidence in the current method and the latter for potential gains in trial efficiency. A data driven re-sampling approach may be unavoidable when evaluating complex models where the relationship between the amount of training samples and model performance would be analytically intractable, such as a deep neural network or biophysical model.
Materials and methods
Simulation procedure
To simulate model-to-neuron fits, the square root of neural responses, ri,j, for the ith of m stimuli and the jth of n trials are modeled as independent normally distributed responses:
| (6) |
where variance σ2 is the same across all Yi,j. The mean response of the neuron to the ith stimulus (tuning curve) is (Fig 1A, green trace solid dots) whose correlation to the model predictions (red trace solid dots) are estimated, and the true correlation is . The results of the simulation are only a function of the magnitude of the centered vector of expected responses d2 the correlation between model prediction and tuning curves, m, n, and σ2 thus the form of the model and true tuning curve is arbitrary. We choose a sinusoid for the simplicity of adjusting the phase, θ, to simulate different .
From this model we draw n responses for each of the m stimuli and apply our estimator to this sample. We repeat this across many IID simulations to accumulate reliable statistics.
Assumptions and terminology for derivation of unbiased estimators
Below we derive an unbiased estimator of the fraction of variance explained when a known signal is being fit to noisy neural responses. For this derivation, we assume the responses have undergone a variance stabilizing transform such that trial-to-trial variability is the same across all stimuli. For example, if the neural responses are Poisson distributed, Yi,j ∼ P(λi), where Yi,j is the response to the jth repeat of the ith stimulus, which has expected response λi, then a variance stabilizing transform is the square root. In particular, if , then,
and
The expected value of the transformed response, , still increases with λi, whereas the variance is now approximately constant. To improve the estimate of the mean response, n repeats of each stimulus are collected. Invoking the central limit theorem, we can make the approximation:
where the average across the n repeats is approximately normally distributed with variance decreasing with n. The assumption of a Poisson distributed neural response is not always accurate. A more general mean-to-variance relationship,
can be approximately stabilized to 1 by,
A square root will stabilize any linear mean-to-variance relationship (b = 1), but an unknown slope, a, requires that this parameter be estimated. In the case of the linear relationship, this simply requires taking a square root and then averaging the estimated variance, which is constant, across all stimuli. To account for more diverse mean-to-variance relationships, the Box-Cox technique can be used find an appropriate exponent by which to transform the data [22]. For the derivation below, we assume that variance-stabilized responses to n repeats have been averaged for each of m stimuli to yield the mean response to the ith stimulus: , where σ2 is the trial-to-trial variability and μi the ith expected value.
Unbiased estimation of r2
Given a set of mean neural responses, Yi, and model predictions, νi, the naive estimator, , is calculated as follows:
| (7) |
Our goal is to find an estimator such that,
| (8) |
where is the correlation in the absence of noise, i.e., the fraction of variance explained by the model prediction, ν, of the expected response (ER), μi, of the neuron. Our strategy will be to remove the bias in the numerator and denominator separately and then reform the ratio of these unbiased estimators for an approximately unbiased estimator.
Unbiased estimate of numerator
The numerator of Eq 7, which we call , is a weighted sum of normal random variables that is then squared, thus it has a scaled non-central chi-squared distribution:
| (9) |
and since its expectation is:
| (10) |
In the final line, the term on the left is the desired numerator and the term on the right the bias contributed by σ2. To form our estimator, , for the numerator of Eq 15, we simply subtract an unbiased estimator of this bias term from the numerator of the naive estimator 2:
| (11) |
where is typically the sample variance, s2, estimated from the data, but it can be any unbiased estimator, even an assumed constant. For example, if stimuli are not repeated (i.e., n = 1) and one is willing to assume that responses are Poisson distributed, then the square root of these responses will give and thus one can substitute . The case for the denominator is similar.
Unbiased estimate of denominator
The denominator of Eq 7, which we call , is a weighted sum of squared normal random variables and thus also follows a scaled non-central chi-squared distribution:
| (12) |
with expectation,
| (13) |
Similarly to the numerator, the first term is the desired denominator, and the second term is the bias. Thus, we subtract an unbiased estimate of this second term from the naive denominator:
| (14) |
Taking the ratio of these two unbiased estimators (Eqs 11 and 14) we have:
| (15) |
This equation can be further simplified by scaling the model predictions such that .
Estimators of correction terms
Two important parameters, and σ2, are unknown. Below we provide unbiased estimators of each of these terms. An unbiased estimate of sample variance for trials of the ith stimulus is , where the dot in the subscript of indicates the mean over repeats. Assuming the variance is the same across stimuli, we can average over i for a global estimate:
Throughout the paper we use this as our estimate of trial-to-trial variability .
For d2 we have:
which would be inflated by trial-to-trial variability, so as an unbiased estimator we use,
We use this estimator to correct the estimate of SNR (Eq 5) for trial-to-trial variability as follows:
| (16) |
Bias of
To remove the bias of Pearson’s , we follow the approach of subtracting off its effect in the numerator and denominator. Prior work has not examined the potential problem with this approach: the expectation of a non-linear transformation of a set of random variables is not necessarily the transformation of their expected values. In this particular case, the expectation of the ratio is not necessarily the ratio of the expectations: . Thus even though we have removed the bias in the numerator and denominator, it does not imply their ratio is unbiased. Calculating the expectation of the ratio we see the conditions under which it will be unbiased:
| (17) |
Thus, is unbiased if and , but we find in simulation often and by Jensen’s inequality .
Thus if the estimator is not unbiased for what recommends it over the naive ? While we mainly focused on how in simulation for typical ranges of parameters it has a lower bias (Fig 3) it also has a theoretical justification. As we saw in simulation, as the number of stimuli, m, increases, its bias diminishes while that of does not (Fig 3C). Convergence to the parameter of interest, otherwise known as consistency, gives a theoretical justification for an estimator. Below we show that is consistent for while is not.
We note that the covariance in Eq 17 can be removed by using separate subsets of the data for the estimation of and . This leaves the inflation by Jensen’s inequality , which could be estimated and corrected for via a simulation-based method such as the parametric bootstrap (see Discussion, “Further work”).
Consistency of in m
We aim to show that is consistent for in m, more formally:
We make use of the continuous mapping theorem that guarantees if a random vector , then for a continuous transformation g, where the random vector is almost surely different from any discontinuity points. Taking our random vector to be, , and our continuous transformation to be, (assuming expectation of the denominator is non-zero), it then suffices to show that and themselves are consistent estimators for the numerator and denominator of .
First, we have already shown that and are unbiased estimators. Next, we must show that their variance is decreasing with m, and then via Chebyshev’s inequality,
we can show their convergence to their expectation. Here we consider the case where . Since the model predictions (νi) are fixed for the purpose of the proof, we assume the dot product between model predictions and neural responses is scaled linearly by m:
as is the dynamic range of the neuron:
and we scale the numerator and denominator of by which makes no change to their ratio.
The numerator, , has variance equal to the sum of the variance of its first and second term (since they are independent). Since the variances are, respectively,
and
thus
The denominator, , also has variance equal to the sum of the variance of its first and second term (by independence). The variances are, respectively,
and
thus
For both and , all but m is constant; therefore, we can find an m to scale variance below any given ϵ. So by Chebyshev’s inequality we have:
Since as m → ∞, we have that,
and similarly,
Thus by the continuous mapping theorem:
In contrast, we show below that the naive estimator is not consistent and provide insight into when the difference between and is large.
Inconsistency of in m
Similarly to the previous derivation, we can take the numerator and denominator of (Eqs 9 and 12), scale by , find their expected values, and in turn find the asymptotic value of . Here, though, we simplify by setting the model to be unit length,
This result shows that is not a consistent estimator in m of .
Confidence intervals
Here we develop and prove a method that provides α-level confidence intervals for the estimator . We considered the typical parametric bootstrap and non-parametric bootstrap approaches, but found that they were not reliable for typical ranges of parameters (see Results, “Confidence intervals for ”).
Our approach hinges upon finding the lowest whose distribution would give an estimate greater than the observed with probability α/2, calling this , and finding the highest that would give an estimate less then the observed with probability α/2, calling this . The interval then serves as our α- level confidence interval (see Fig 14 for graphical explanation). We use a Bayesian framework to sample from the probability distribution, , parameterized by the observed neural statistics s2 and , allowing us to find under assumed uninformative priors on σ2 and d2 (see “Computing confidence intervals”, below).
Fig 14. Illustrative schematic of confidence interval estimation.
Given an observed estimate x* (green dashed vertical) from the distribution of the estimator X with CDF T(x) (solid black curve) associated with the parameter being estimated μ (black dashed vertical), the upper limit of the α-level confidence interval is the μU (purple vertical dashed) corresponding to the cumulative distribution of XU, U(x) (solid purple curve) that would generate values less than x* with probability α/2 (purple horizontal dashed). Thus U(x) is defined by U(x*) = α/2. Under the assumption the family of CDFs of X are stochastically increasing in μ, the event that T(x) ≥ α/2 corresponds to the event that μ < μU, thus the upper limit of the confidence interval contains the true value of μ. In graphical terms, if the black horizontal dashed line is above the purple, then it is guaranteed that the purple vertical dashed is to the right of the black. Thus these two events have the same probability: Pr(μ ≤ μU) = Pr(α/2 ≤ T(X)) = 1 − α/2. Here we have used generic symbols for illustrative purposes, but for reference to the proof (see Methods, “Proof of α-level confidence intervals”), the notation used here correspond as follows: , , , , , and .
Proof of α-level confidence intervals
Here, we justify this procedure for the case of ( is similar). Our two main assumptions are that the cumulative distribution is stochastically increasing in ,
| (18) |
and that we can always find an such that for any observed ,
| (19) |
We now consider two mutually exclusive possibilities. First, with probability 1 − α/2, the observed is large enough to satisfy:
Then by the assumption in Eq 19 we can find a where,
and under our initial assumption (Eq 18), this implies
because
Second, if on the other hand is small enough such that
then
Thus, under repeated sampling, with the desired probability α/2, the upper limit of our confidence interval, , does not contain . The proof for the lower end of the confidence interval is similar. The probability of the mutually exclusive events that either or is the sum of the probability of the two events, α. See Fig 14 for a graphical explanation of this proof.
For simplicity of the proof, we assumed that it was possible to find , which is not necessarily the case because is bounded but is not. If or , then there is no that will achieve α/2. Under the condition where , we simply set , and since implies , the confidence interval will contain the true value.
Under the condition , we set , but we must set the confidence interval, though normally inclusive, to be non-inclusive. Intuitively, this is because if , then the upper end of the confidence interval would always contain the true value, and we would be restricted to α = 1. Making the CI non-inclusive avoids this problem. Doing this does not cause a problem when the true , because implies , the confidence interval should not contain the true and it does not because . The case for is similar.
In summary, our confidence interval is defined to be when and but ∅ (the empty set) if or . The lower bound, , satisfies , except if or , then respectively or . The upper bound, , satisfies , except if or , then respectively or .
To sample from the conditional distribution , we assume that σ2 and d2 follow an uninformative non-negative uniform prior (U[0, ∞]), and given their observed estimates s2 and , we obtain samples from the posterior distribution of σ2 and d2 via the Metropolis-Hastings sampling method (for details see “Bayesian model and simulation”). For a chosen (e.g., a candidate for , we sample from by drawing samples of σ2 and d2 from the posterior distribution while is fixed to the desired value. Thus for each sample we then draw observations Y and predictions νi from the model described in Eq 6 and finally calculate for a sample from .
Computing confidence intervals
We use a simple iterative bracketing algorithm to narrow down the range of candidate values for the ends of our confidence interval. For example, to estimate within [0, 1], we first evaluate the highest possible value: 1. We sample N = 2,500 draws from to find the proportion, , of those less than or equal to . We then calculate a z-statistic to test whether this is significantly different from the desired α/2:
At some desired significance level (here p < 0.01), we either do not reject the null and accept , or we reject the null. In the latter case, if z is positive we determine that must be higher, whereas if z is negative it must be lower. In the case where and z is positive, there are no higher possible values of and thus we accept . Otherwise, on the next step we choose a new candidate by sampling from then evaluating the result and if we reject the null and z is positive our new interval will be and if z is negative . Otherwise, if we do not reject the null we accept . We continue this bracketing until we do not reject the null or a pre-determined number of splits has passed (here we use 100). Accuracy of this algorithm will increase with number of splits and simulation samples.
Confidence interval validation
We used simulations to evaluate our confidence interval methods under the sampling distribution . Conceptually, this is the distribution of after data has been collected and sample variance and sample dynamic range calculated, and now we wish to calculate the data’s fit to a model with unknown but fixed . To demonstrate that our method contains the unknown at the desired α, our procedure is as follows. For a chosen n, m, σ2, d2, and , sample an n × m data matrix (Y) and calculate its sample variance s2 and dynamic range . Then using the Metropolis-Hastings algorithm (see Methods, “Bayesian model and simulation”), draw 5,000 samples from the posterior distribution . Next, we simulate the distribution of for each of these data samples by drawing from . For each of these draws, we construct confidence intervals, and then we calculate the proportion of times that the confidence intervals contain the true r2ER. This proportion estimates the true α level of the confidence interval method.
Bayesian model and simulation
We sample from the posterior of two parameters: σ2 and . Their associated sufficient statistics are:
and their distributions are:
| (20) |
| (21) |
By Bayes theorem we have,
where the equality is derived by recognizing the sample variance () and dynamic range () are independent and setting the prior to be uniform non-negative. The estimates and are fixed, calculated from the data, and our goal is to look up the distribution of the parameters given these fixed values. We use the Metropolis-Hastings algorithm to draw from the desired distribution and approximate it with the empirical distribution (a histogram). Our sampling procedure is as follows. We initialize our parameter samples σ2, d2 at their estimates , and we then sample a new candidate from our proposal distribution: a truncated multivariate normal with means and diagonal variances equal to the variance of the distributions (Eqs 20 and 21) where . We take the ratio of likelihoods,
If a > 1, we accept the candidates as our new current samples, but if a < 1, we then draw from u ∼ U [0, 1]. If u < a, we also accept the candidates but if not, we retain the current samples. Throughout the paper we run the chain for 5,000 iterations then randomly sample with replacement from it.
SNR relation to F-test and number of trials
Our goal is to be able find for a given SNR and number of repeats the number of stimuli needed to reliably detect tuning under an F-test. To calculate the F-statistic for testing whether there is variation in the expected responses across stimuli (i.e., stimulus selectivity), we form the ratio,
where for clarity we indicate dimensions averaged over with a dot. The numerator calculates the amount of variance explained by stimuli and the denominator calculates the amount of variance unexplained by stimuli. The numerator is a scaled non-central χ2 distribution:
where the final equality comes from the definition of SNR (5). The denominator is a central χ2 distribution:
Thus taking the ratio we have a singly non-central F-distribution:
| (22) |
To test for significant tuning, we set an α-level criterion, cF(α), under the null hypothesis that the observed F-statistic is from a central F-distribution:
Finally, given m and n, we can find the SNR where, for some high probability β,
We set β = 0.99 and α = 0.01 and numerically solve for the SNR.
Electrophysiological data
We reanalyzed a variety of neuronal data from previous studies. This includes three experiments in area V4 and one in MT of the awake, fixating rhesus monkey (Macaca mulattta), as well as spiking and two-photon imaging in awake mouse VISp. Experimental protocols for all studies are described in detail in the original publications.
From Pasupathy and Connor [2], we examined responses of 109 V4 neurons to a set of 362 shapes. There were typically 3-5 repeats of each stimulus, but we used only the 96 cells that had at least 4 repeats for all stimuli. We used the spike count for each trial during the 500 ms stimulus presentation.
From Popovkina et al. [23], we examined responses of 43 V4 neurons (7 from one monkey, 36 from another) to filled stimuli (drawn from the same set of shapes used for the previous study) and to outline stimuli that were the same except the fill was set to be equivalent to background color. Stimulus color and luminance were customized to elicit a robust response from the recorded neuron. Spikes were counted over the 300 ms duration of each stimulus presentation.
From the 2019 UW V4 Neural Data Challenge, we examined single unit (SUA) and multi-unit (MUA) data from 7 V4 recordings. Up to 601 images were shown with between 3-20 repeats for each image. The images were drawn semi-randomly from the 2012 ILSVRC validation set of images [24] where an 80X80 pixel patch was sampled and had a soft window applied (circular Gaussian, SD 16 pixels, applied to the alpha channel). Images were shown for 300 ms with 250 ms in between images. The model we analyze was the winner of the Neural Data Challenge (out of 32 competitors) on held-out data from the 14 sets of V4 responses to natural images.
From Zohary et al. [12, 13], we examined responses from 81 pairs of MT neurons recorded from three awake rhesus monkey (Macaca mulatta) viewing dynamic random dots (stimuli described in Britten et al. [25]). Optimal speed of drifting dots was found for the one of the two neurons being recorded. Eight different directions of motion at 45° increments were repeated 10-20 times. Monkeys performed a two alternative forced choice task of motion direction discrimination during the experiment. Post-stimulus spikes were counted in the 2 s window of stimulus presentation. Experimenters were rigorous in only recording from pairs of neurons whose spike waveforms were strikingly different.
From the Allen Institute for Brain Science (AIBS) mouse database [16], we examined calcium fluorescence data. Fluorescence of mouse visual cortex neurons expressing GCaMP6f was measured via 2-photon imaging through a cranial window. We analyzed signals recorded in response to natural scenes and static gratings presented for 0.25 s each with no interval between with 50 repeats in random order. The natural scene stimulus consisted of 118 natural images from a variety of databases. The static grating stimulus consisted of a full field static sinusoidal grating at a single contrast (80%). Gratings were presented at 6 orientations, 5 spatial frequencies (0.02, 0.04, 0.08, 0.16, 0.32 cycles/degree), and 4 phases (0, 0.25, 0.5, 0.75). For every trial, ΔF/F was estimated using the average fluorescence of the preceding second (4 image presentations) as baseline. We analyzed the average change in fluorescence during the 1/2 s period after the start of the image presentation relative to 1 s before. We also examined SUA data from the AIBS mouse neuropixel data set [17], which was recorded in response to the same stimuli as the calcium data. Spike counting windows were 0.25 s.
Prior analytic methods for estimating
Our estimator, , is derived via the strategy of Haefner and Cumming [6] to unbias the numerator and denominator of the coefficient of determination. Haefner and Cumming in turn cite Sahani and Linden [3] as the predecessor to their method. Sahani and Linden constructed an unbiased estimator of the variation in the expected response of the neuron (i.e. tuning curve) they called this ‘signal power’. They normalize an estimator of explained variation, unbiased with respect to noise under conditions they did not specify (1-parameter regression of model predictions, see Methods, “Derivation of Normalized Signal Power Explained (SPEnorm)”). Sahani and Linden, by not specifying how a given model prediction should be fit to the neural data before estimating the quality of the fit, introduced potential problems in their estimator. This was recognized by Schoppe et al. [8], who point out that the estimator was sensitive to differences between the mean and amplitude of the model predictions and the neural data. Consequently, the estimator could give large negative values because the squared error between model predictions and neural responses was unbounded. This criticism, while technically correct, is easily overcome by regressing (with intercept term) the given model predictions onto the neural data before using normalized SPE. Schoppe et al., motivated by the problems they found in SPE, focused on simplifying CCnorm of Hsu et al. [4] to not require re-sampling by making use of the signal power estimator developed by Sahani and Linden. They derived a simple estimator whose square, termed here , we find is essentially numerically equivalent to SPEnorm of Sahani and Linden in the case of one-parameter regression.
For the purpose of comparison, below we write out the exact formulas and approximate expected values of two prior methods [3, 6] that are closely related to in the notation we use throughout our paper. For all estimators, we assume responses to m stimuli with n repeats where variance have been stabilized. The response to the ith stimulus, jth repeat, is Yi,j ∼ N(μi, σ2) where σ2 is the trial-to-trial variability and μi the ith expected value of response after variance stabilization. The predictions are fixed for the m stimuli and the ith predicted expected value of the data is νi and we assume they have been fit by a linear model with d degrees of freedom. When averaging data across trials our notation will be and across stimuli .
Derivation of normalized signal power explained (SPEnorm)
The original description of SPE [3] did not specify, when calculating sample estimates of ‘power’ (better known as variance), whether to normalize by m − 1 (the unbiased estimate) or m (the MLE). Nor was it specified whether an average across trials was used when calculating the difference between variance of the measured response and the residual. We thus use the formula as described by Schoppe et al. [8], who provide code to unambiguously calculate SPE, albeit in a manner that differs from their text (in particular, their (similar to Eq 5 of Sahani and Linden), but in their code it is implemented as ). In the context of the quantities being estimated, the latter makes more sense (see calculation of expected value of denominator below). For comparison to the derivation in Schoppe et al., their notation equates to ours as follows: N = n, T = m, R = Yi,j, , , and . Finally, in our notation their estimator is:
Calculating the expectation of the numerator and the denominator for the fit of a linear model with d degrees of freedom, we can find the asymptotic expectation. Numerator:
Denominator:
Putting the expectations into the numerator and denominator we have:
We note that only if d = 1 (i.e., the model has only 1 term) is the numerator unbiased. Below we describe how Haefner and Cumming developed an estimator that accounts for degrees of freedom more generally.
Derivation of ϒ
For comparison to the original paper of Haefner and Cumming [6], we give their notation and its equivalent terms in our notation: di,j = Yi,j, , , Σ2 = σ2, N = m, Nσ = m(n − 1), R = n, n = d, Di = μi, Mi = νi, , , ,
These authors explicitly attempt to remove the bias of the coefficient of determination:
Their unbiased estimator is derived by dividing the numerator and denominator by the sample trial-to-trial variability ( and their respective degrees of freedom (d below being the degrees of freedom of the linear model), noting the numerator and denominator become non-central F-distributions, then shifting and scaling these to provide unbiased estimates. Since , the expectation of the numerator is,
thus the unbiased estimate of the numerator is:
The expectation of the denominator is:
thus the unbiased estimate of the denominator is:
Forming the ratio, we obtain the Haefner and Cumming estimator:
Extension of to fit of linear model
The derivation of Haefner and Cumming via the non-central F was not necessary: the expectation of the numerator and denominator are straightforward to calculate as non-central χ2 random variables. While our is explicitly meant to be the analogue of Pearson’s r2, we re-derive the Haefner and Cumming formula along the lines of for measuring the fit of a linear model. We specifically avoid the non-central F-distribution so that it is unnecessary to estimate variance (if for example there is a strong prior for the variance and/or multiple trials were not collected). We assume, as did Haefner and Cumming that were fit from a linear model via least squares with d coefficients. The expectation of the numerator is:
thus its unbiased estimate is:
The expectation of the denominator is:
thus its unbiased estimate is:
Thus their ratio forms:
| (23) |
Acknowledgments
We thank Polina Zamarashkina for collecting data for the UWNDC dataset. We thank Anitha Pasupathy, Dina Popovkina, and the Allen Institute for sharing data. We thank Greg Horwitz and Yen-Chi Chen for helpful suggestions and advice. We thank the participants in the UWNDC, including the winner Oleg Polosin, for their neural models.
Data Availability
The MT data is available at: http://www.neuralsignal.org/data/21/nsa2021.1.html V4 data is available at: https://www.kaggle.com/c/uwndc19. The code to calculate all estimators and confidence intervals is available at: https://github.com/deanpospisil/er_est.
Funding Statement
This work was supported by a National Science Foundation Graduate Research Fellowship DGE-1256082 (D.A.P.), National Institutes of Health Grant NEI R01-EY02999, and National Institutes of Health Grant NEI R01-EY027023 (W.B.). All funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- 1.Roddey JC, Girish B, Miller JP. Assessing the Performance of Neural Encoding Models in the Presence of Noise. Journal of Computational Neuroscience. 2000; 8(2):95–112. doi: 10.1023/A:1008921114108 [DOI] [PubMed] [Google Scholar]
- 2.Pasupathy A, Connor CE. Shape Representation in Area V4: Position-Specific Tuning for Boundary Conformation. Journal of Neurophysiology. 2001;86(5):2505–2519. doi: 10.1152/jn.2001.86.5.2505 [DOI] [PubMed] [Google Scholar]
- 3.Sahani M, Linden JF. How Linear are Auditory Cortical Responses? In: Becker S, Thrun S, Obermayer K, editors. Advances in Neural Information Processing Systems 15. MIT Press; 2003. p. 125–132. [Google Scholar]
- 4.Hsu A, Borst A, Theunissen FE. Quantifying variability in neural responses and its application for the validation of model predictions. Network: Computation in Neural Systems. 2004; 15(2):91–109. doi: 10.1088/0954-898X_15_2_002 [DOI] [PubMed] [Google Scholar]
- 5.David SV, Gallant JL. Predicting neuronal responses during natural vision. Network (Bristol, England). 2005;16(2-3):239–260. doi: 10.1080/09548980500464030 [DOI] [PubMed] [Google Scholar]
- 6.Haefner RM, Cumming BG. An improved estimator of Variance Explained in the presence of noise. Advances in neural information processing systems. 2008;2008:585–592. [PMC free article] [PubMed] [Google Scholar]
- 7.Yamins DLK, Hong H, Cadieu CF, Solomon EA, Seibert D, DiCarlo JJ. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the National Academy of Sciences of the United States of America. 2014;111(23):8619–8624. doi: 10.1073/pnas.1403112111 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Schoppe O, Harper NS, Willmore BDB, King AJ, Schnupp JWH. Measuring the Performance of Neural Models. Frontiers in Computational Neuroscience. 2016;10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Cadena SA, Denfield GH, Walker EY, Gatys LA, Tolias AS, Bethge M, et al. Deep convolutional mod- els improve predictions of macaque V1 responses to natural images. PLOS Computational Biology. 2019;15(4):e1006897. doi: 10.1371/journal.pcbi.1006897 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Kindel WF, Christensen ED, Zylberberg J. Using deep learning to probe the neural code for images in primary visual cortex. Journal of Vision. 2019;19(4). doi: 10.1167/19.4.29 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Efron B, Tibshirani RJ. An Introduction to the Bootstrap. CRC Press; 1994. [Google Scholar]
- 12.Zohary E, Shadlen MN, Newsome WT. Correlated neuronal discharge rate and its implications for psychophysical performance. Nature. 1994;370(6485):140–143. doi: 10.1038/370140a0 [DOI] [PubMed] [Google Scholar]
- 13.Bair W, Zohary E, Newsome WT. Correlated firing in macaque visual area MT: time scales and relationship to behavior. The Journal of Neuroscience: The Official Journal of the Society for Neuroscience. 2001;21(5):1676–1697. doi: 10.1523/JNEUROSCI.21-05-01676.2001 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Olshausen BA, Field DJ. How close are we to understanding v1? Neural Computation. 2005;17(8):1665–1699. [DOI] [PubMed] [Google Scholar]
- 15.Rust NC, Movshon JA. In praise of artifice. Nature Neuroscience. 2005;8(12):1647–1650. doi: 10.1038/nn1606 [DOI] [PubMed] [Google Scholar]
- 16.de Vries SEJ, Lecoq JA, Buice MA, Groblewski PA, Ocker GK, Oliver M, et al. A large-scale standardized physiological survey reveals functional organization of the mouse visual cortex. Nature Neuroscience. 2020;23(1):138–151. doi: 10.1038/s41593-019-0550-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Siegle JH, Jia X, Durand S, Gale S, Bennett C, Graddis N, et al. A survey of spiking activity reveals a functional hierarchy of mouse corticothalamic visual areas. bioRxiv. 2019; p. 805010. [Google Scholar]
- 18.Cohen MR, Kohn A. Measuring and interpreting neuronal correlations. Nature Neuroscience. 2011;14(7):811–819. doi: 10.1038/nn.2842 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Leavitt ML, Pieper F, Sachs AJ, Martinez-Trujillo JC. Correlated variability modifies working memory fidelity in primate prefrontal neuronal ensembles. Proceedings of the National Academy of Sciences of the United States of America. 2017;114(12):E2494–E2503. doi: 10.1073/pnas.1619949114 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Insanally MN, Carcea I, Field RE, Rodgers CC, DePasquale B, Rajan K, et al. Nominally non-responsive frontal and sensory cortical cells encode task-relevant variables via ensemble consensus-building. bioRxiv. 2018; p. 347617. [Google Scholar]
- 21.Zylberberg J. The role of untuned neurons in sensory information coding. bioRxiv. 2018; p. 134379. [Google Scholar]
- 22.Box GEP, Cox DR. An Analysis of Transformations. Journal of the Royal Statistical Society Series B (Methodological). 1964; 26:211–252. doi: 10.1111/j.2517-6161.1964.tb00553.x [DOI] [Google Scholar]
- 23.Popovkina DV, Bair W, Pasupathy A. Modeling diverse responses to filled and outline shapes in macaque V4. Journal of Neurophysiology. 2019; 121(3):1059–1077. doi: 10.1152/jn.00456.2018 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Deng J, Dong W, Socher R, Li L, Kai Li, Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition; 2009. p. 248–255.
- 25.Britten KH, Shadlen MN, Newsome WT, Movshon JA. Responses of neurons in macaque MT to stochastic motion signals. Visual Neuroscience. 1993;10(6):1157–1169. doi: 10.1017/S0952523800010269 [DOI] [PubMed] [Google Scholar]














