Abstract
Extracting signals from noisy backgrounds is a fundamental problem in signal processing across a variety of domains. In this paper, we introduce Noisereduce, an algorithm for minimizing noise across a variety of domains, including speech, bioacoustics, neurophysiology, and seismology. Noisereduce uses spectral gating to estimate a frequency-domain mask that effectively separates signals from noise. It is fast, lightweight, requires no training data, and handles both stationary and non-stationary noise, making it both a versatile tool and a convenient baseline for comparison with domain-specific applications. We provide a detailed overview of Noisereduce and evaluate its performance on a variety of time-domain signals.
Keywords: Noise reduction, Signal enhancement, Time-domain signals
Subject terms: Data processing, Software
Introduction
Natural signals such as speech, electrophysiology, and bioacoustics are challenging to record in isolation. Sensors, both biological and artificial, tend to record these signals in the context of noisy environments. To record from a singing songbird in its natural environment, for example, a microphone will pick up not only the bird but the richness of its sensory environment—a babbling brook, chirping crickets, wind passing through leaves, and the croaks of a nearby frog. Such ’noise’ can both provide important context for the signal of interest and important confounds. For example, a classifier trained to predict bird species from song recordings can be biased by environmental context; a babbling brook in the background might cause a Wood Thresh to be classified as a Robin. That error would in turn lead to downstream inaccuracies in estimating the migratory patterns of both birds. These same technical challenges arise in a variety of domains, from detecting action potentials to distinguishing seismic events from human activity.
Determining what constitutes noise versus signal is highly context-dependent. Consider two researchers: one focusing on the croaking of the American Bullfrog and the other analyzing the song of the Wood Thrush. They might approach the same audio recording yet define signal and noise in vastly different ways. Fortunately for these hypothetical researchers, the vocalizations of Bullfrogs and Wood Thrushes can be relatively easily distinguished from each other. Bullfrogs produce sounds in a frequency range of approximately 200-2000 Hz43, whereas Wood Thrushes vocalize in a higher spectrum, roughly 2000-9000 Hz22. Therefore, by applying a simple low-pass or high-pass filter, each researcher can effectively isolate the vocalizations of their respective species with minimal effort.
Signal and noise events that overlap spectro-temporally pose a greater challenge but are not insurmountable. If the signal and noise retain identifiable structures, we can devise algorithms to exploit these structures and eliminate unwanted noise. For instance, the persistent 60-Hz hum from nearby electronics in a poorly grounded electrophysiology implant exhibits temporal structure. This constant hum can be identified and algorithmically removed from the signal. Noise reduction algorithms harness these structural differences to distinguish and separate noise from the signal.
The challenge of separating signal from noise exists across many domains. While much focus in recent years has been on machine-learning based noise reduction algorithms, these algorithms are generally domain-specific; machine learning generally relies on large, often labeled, datasets that do not exist in all domains. These algorithms have been exhaustively reviewed for various signal domains in prior literature44,58,64. For domain general applications, out of the purview of any domain-specific machine learning models conventional approaches to noise reduction remain valuable58.
Here, we survey the utility of our algorithm, Noisereduce, a fast, domain-general, spectral-subtraction-based algorithm available in Python. Noisereduce has already been available open-source for over five years and has found utility in a variety of different domains including bioacoustics17,41,42,45,48,51, brain-machine interfacing11,25,26, livestock welfare monitoring4,24, human emotion analysis29,40, medical and clinical diagnostics27,32,39,55,68, seismic monitoring38, and many other domains. Until now, its performance has not been rigorously validated. Here, we address this by validating Noisereduce on several time-domain signals and comparing it to other conventional algorithms. Our findings show Noisereduce is fast and performs well, making it a strong candidate for domain-general applications where large datasets are unavailable and a solid baseline for comparing machine-learning-based algorithms.
Noisereduce algorithm
Noisereduce belongs to a class of noise reduction algorithms that perform spectral-subtraction in the time-frequency domain5. Spectral subtraction algorithms subtract an estimate of the noise spectrum from a noisy signal in an attempt to improve the signal-to-noise ratio of the signal. The challenge in developing a spectral subtraction algorithm is in determining what constitutes noise and how that estimate of noise should be subtracted from the signal. For example, in Stephan Boll’s original paper on spectral subtraction5 a Fast Fourier Transform (FFT) is taken on a on a noise-only portion of speech recording. a Short-Time Fourier Transformation (STFT) is the computed over the signal and the estimated noise magnitude is subtracted from each frequency component (clamping values above zero). In practice, this approach can leave behind unwanted noise artifacts and several variants exist to overcome these issues60. Spectral gating, the approach Noisereduce takes, is one of several variants of spectral subtraction. Spectral gates merge the concept of noise gating. Spectral gates are noise gates that act in the time-frequency domain, by masking specific time-frequency components to be subtracted away, while leaving other time-frequency components unaltered. This approach is commonly used in auditory scene analysis, where an auditory mixture is decomposed into time-frequency components, and an Ideal Binary Mask28 is estimated to determine which components to attenuate (zeros) and which to pass through unaffected (ones). In practice masks are rarely binary- but are used to determine what proportion of the signal to attenuate. The success of spectral gating in noise reduction can be seen in its adoption in professional audio analysis software such as Adobe Audition (Effects
Noise Reduction/Restoration
Noise Reduction in version 25.2), Audacity (Effects
Noise Reduction in version 3.7.3), and iZotope RX (Spectral De-noise* in version RX11). Noisereduce represents an open-source, lightweight, Python approach to spectral gating.
Noisereduce accepts two inputs: (1) X, the time-domain recording to be denoised and (2, optionally)
, a time-domain recording containing only noise, used to calculate noise statistics. Noisereduce operates through the following steps (Fig 1).
- Estimate noise:
-
1.1Compute a Short-Time Fourier Transform (STFT;
) on each channel of the noise recording (
). -
1.2For each frequency channel, compute spectral statistics (mean
, standard deviation
) over the noise STFT (
). -
1.3Compute a noise threshold based upon the statistics of the noise and the desired sensitivity.
-
1.1
- Mask noise:
-
2.1Compute a STFT (
) over each channel of the recording (X). -
2.2Compute a mask (M) over the signal STFT (
), based on the thresholds for each frequency channel. -
2.3(optional) smooth the mask (
) with a filter over frequency and time -
2.4Apply the mask (
) to the STFT of the signal (
) to produce the masked STFT (
). -
2.5Invert the masked STFT (
) back into the time-domain (
).
-
2.1
If the noise recording is not provided to the algorithm, the noise statistics are computed on directly on the recording (
). A more detailed description of the algorithm and its parameters are given in 5.1.
Fig. 1.
Basic outline of Noisereduce algorithm. (A) A block diagram of the steps of Noisereduce. The stationary version of the time-frequency mask is depicted. (B) An example waveform (U.S. President George W Bush stating “I know that human beings and fish can coexist peacefully”) passing through the Noisereduce pipeline. The non-stationary algorithm is not shown here.
Nonstationary noise reduction
In natural settings, background noise often varies over extended periods. For example, in bioacoustics, weather can shift within minutes, while in electrophysiology, the activity rates of nearby neurons may increase as animals transition between states, such as sleeping and waking. Consequently, it is advantageous to enable Noisereduce to adapt its noise definition over time48. To address this, we introduced a non-stationary variant of Noisereduce, where mask statistics are calculated using a sliding window across the signal rather than relying solely on an isolated noise clip. This non-stationary approach is particularly beneficial for signals such as those from hydrophones in underwater bioacoustics, where the engine hum of a boat can fluctuate as the hydrophone drifts toward and away from the boat towing it. To decide whether non-stationary noise reduction is appropriate, one can test whether the signal is stationary either using formal testing6 or by inspecting the signal manually for periods of fluctuating noise levels. Normalizing audio signals to channel-specific fluctuations in amplitude has proven useful for tasks like bioacoustic species identification35.
The non-stationary algorithm omits the need for a noise recording (
) since noise statistics are directly derived from the signal recording (
). In this revised approach, statistics for the noise threshold are computed over a sliding window for each frequency channel. This approach dynamically sets noise gate thresholds for each frequency channel, as opposed to static settings across the entire recording. For additional details, see Section 5.1.
Figure 2 illustrates the non-stationary algorithm’s utility. We took a one-minute recording of an American Robin (Macaulay Library 321642131; Fig 2A) and added the non-stationary noise of an airplane passing overhead (Fig 2B). We then applied both stationary and nonstationary Noisereduce (Fig 2C-D). During the highest amplitude period of airplane noise, the stationary algorithm leaves additional noise artifacts in the recording, unlike the non-stationary version (Fig 2E-G, Blue/Green, 20-30 seconds). Conversely, more of the signal is lost in sections with lower noise amplitude (e.g. 40-60 seconds, red/purple). We quantified this as the absolute error in dB, relative to the noise-free recording (Fig 2H) exemplifting that the non-stationary algorithm performs consistently better with non-stationary noise in this case.
Fig. 2.
Comparison of stationary and non-stationary noise reduction. (A) Spectrogram of clean recording of an American Robin (Macaulay Library 321642131). (B) Airplane noise imposed over Robin Recording. (C-D) Denoising of (B) with (C) stationary noisereduce and (D) nonstationary noisereduce (window size of 2 seconds) (E-F) Magnitude error in stationary noisereduce vs ground truth for (E) stationary noisereduce and (F) nonstationary noisereduce. (G) Magnitude difference between stationary and nonstationary noisereduce. (H) Error (in dB) from ground truth for stationary (green) and nonstationary (purple) noisereduce.
Results
To evaluate the performance of Noisereduce, we tested it on a set of benchmark datasets across four domains: speech, bioacoustics, electrophysiology, and seismology (see 5.2). We compared its results against several noise reduction algorithms (see 5.3). The evaluation metrics used in the comparison are detailed in 5.5.
Speech
Speech is the best-established domain for enhancement and noise reduction33. Many speech noise reduction applications are well-suited to machine learning methods, especially deep neural networks like convolutional neural networks (CNNs)36,66, long short-term memory networks (LSTMs)12,19, and Generative Adversarial Network (GANs)18,46, which outperform any conventional algorithm. We therefore submit that Noisereduce in this domain for two purposes. First, as a candidate “conventional algorithm” baseline. Second, Noisereduce may remain useful for speech applications where machine-learning based approaches might not be well suited, such as out-of-domain speech signals, very lightweight applications where computational costs of machine-learning based approaches are too cumbersome, or in creating new datasets with varying manipulations on noise levels.
We evaluated Noisereduce against other noise reduction conventional algorithms on the NOIZEUS dataset21,34,65 across various SNR levels (0, 5, 10, and 15 dB). Examples of speech spectrograms obtained with Noisereduce, Wiener30, Iterative Wiener30, Subspace16,20, Spectral Subtraction5 and Savitzky-Golay52 appear in Fig 3. Particularly, Noisereduce preserves the speech signal without distortions, unlike other algorithms that add artifacts, particularly under low SNR. The performance metrics used were Short-Time Objective Intelligibility (STOI)57 and Perceptual Evaluation of Speech Quality (PESQ)47, which assess speech intelligibility and quality, respectively. STOI and PESQ results are in Tables 1 and 2, showing that Noisereduce outperforms other conventional algorithms at all tested SNR levels, for the hyperparameters we sampled (Table 9).
Fig. 3.
Noise reduction samples from different algorithms applied to the ’sp04’ sample from the NOIZEUS dataset (SNR: 10 dB, exhibition noise).
Table 1.
STOI performance metric on NOIZEUS dataset (mean ± SEM) for different algorithms across various SNR levels.
| Algorithm | SNR 0 | SNR 5 | SNR 10 | SNR 15 |
|---|---|---|---|---|
| Baseline | 0.671 ± 0.004 | 0.783 ± 0.003 | 0.878 ± 0.003 | 0.937 ± 0.002 |
| Iterative Wiener | 0.509 ± 0.004 | 0.594 ± 0.004 | 0.664 ± 0.005 | 0.704 ± 0.005 |
| NoiseReduce (ours) | 0.683 ± 0.004 | 0.799 ± 0.003 | 0.893 ± 0.002 | 0.946 ± 0.002 |
| Savitzky-Golay | 0.668 ± 0.004 | 0.779 ± 0.003 | 0.875 ± 0.003 | 0.934 ± 0.002 |
| Spectral Subtraction | 0.417 ± 0.003 | 0.451 ± 0.002 | 0.479 ± 0.002 | 0.493 ± 0.002 |
| Subspace | 0.608 ± 0.004 | 0.682 ± 0.003 | 0.712 ± 0.004 | 0.724 ± 0.004 |
| Wiener | 0.668 ± 0.004 | 0.766 ± 0.004 | 0.840 ± 0.003 | 0.879 ± 0.002 |
Table 2.
PESQ performance metric on NOIZEUS dataset (mean ± SEM) for different algorithms across various SNR levels.
| Algorithm | SNR 0 | SNR 5 | SNR 10 | SNR 15 |
|---|---|---|---|---|
| Baseline | 1.421 ± 0.009 | 1.600 ± 0.010 | 1.878 ± 0.011 | 2.238 ± 0.013 |
| Iterative Wiener | 1.374 ± 0.010 | 1.516 ± 0.011 | 1.687 ± 0.013 | 1.874 ± 0.017 |
| Noisereduce (ours) | 1.559 ± 0.008 | 1.854 ± 0.009 | 2.286 ± 0.011 | 2.778 ± 0.012 |
| Savitzky-Golay | 1.475 ± 0.010 | 1.672 ± 0.011 | 1.973 ± 0.012 | 2.353 ± 0.014 |
| Spectral Subtraction | 1.493 ± 0.010 | 1.733 ± 0.009 | 2.064 ± 0.011 | 2.449 ± 0.012 |
| Subspace | 1.415 ± 0.009 | 1.407 ± 0.008 | 1.380 ± 0.006 | 1.379 ± 0.007 |
| Wiener | 1.458 ± 0.009 | 1.634 ± 0.009 | 1.858 ± 0.011 | 2.095 ± 0.012 |
Table 9.
Parameters for noise reduction.
| Parameter | Description |
|---|---|
| n_fft | Length of the windowed signal after padding with zeros, by default 1024. |
| win_length | Each frame of audio is windowed by “window“ of length “win_length“ and then padded with zeros to match “n_fft“, by default None. |
| hop_length | Number of audio samples between adjacent STFT columns, by default None. |
| n_std_thresh | Number of standard deviations above mean to place the threshold between signal and noise, by default 1.5. |
| noise_window_size_nonstationary_ms | The window size (in milliseconds) to compute the noise floor over in the non-stationary algorithm, by default 1. |
| freq_mask_smooth_hz | The frequency range to smooth the mask over in Hz, by default 500. |
| time_mask_smooth_ms | The time range to smooth the mask over in milliseconds, by default 50. |
| prop_decrease | The proportion to reduce the noise by (1.0 = 100%), by default 1.0. |
To further evaluate its performance, we compared Noisereduce against a state-of-the-art deep learning-based model, Denoiser12. While Denoiser had higher STOI and PESQ scores (Tables 3 and 4), Noisereduce achieved competitive results with substantially lower computational overhead. Specifically, Denoiser requires over 33 million trainable parameters, whereas Noisereduce uses efficient signal processing techniques that require minimal computational resources and provide faster runtime (see Section 2.5).
Table 3.
Comparison of STOI performance metric for Noisereduce and Denoiser across various SNR levels (mean ± SEM).
| Algorithm | SNR 0 | SNR 5 | SNR 10 | SNR 15 |
|---|---|---|---|---|
| Noisereduce (ours) | 0.683 ± 0.004 | 0.799 ± 0.003 | 0.893 ± 0.002 | 0.946 ± 0.002 |
| Denoiser | 0.796 ± 0.005 | 0.88 ± 0.003 | 0.927 ± 0.002 | 0.951 ± 0.002 |
Table 4.
Comparison of PESQ performance metric for Noisereduce and Denoiser across various SNR levels (mean ± SEM).
| Algorithm | SNR 0 | SNR 5 | SNR 10 | SNR 15 |
|---|---|---|---|---|
| Noisereduce (ours) | 1.559 ± 0.008 | 1.854 ± 0.009 | 2.286 ± 0.011 | 2.778 ± 0.012 |
| Denoiser | 1.671 ± 0.015 | 2.04 ± 0.017 | 2.39 ± 0.018 | 2.703 ± 0.023 |
Bioacoustics
Bioacoustic signals are recorded across Earth’s diverse bioregions, with conditions often unique to each dataset. Consequently, state-of-the-art machine-learning methods are rarely available64, making bioacoustics is an ideal domain for applying Noisereduce. To our knowledge, no benchmark dataset exists for bioacoustic noise reduction, unlike NOIZEUS34 for speech. To fill this gap, we developed “NOIZEUS Birdsong”49, a benchmark dataset modeled after NOIZEUS’s methodology and structure. We sampled recordings from 14 European starlings, with five 40-second songs from each bird, all recorded in an acoustically isolated chamber2. To simulate realistic conditions, we added noise at four SNRs: 0, 5, 10, and 15 dB. Noise samples were taken from the “Soundscapes from around the world” dataset from Xeno Canto61. We selected eight distinct soundscape categories which we named: “rain”, “town”, “wind”, “waterfall”, “insects”, “swamp”, “frogscape”, and “forest”. Each soundscape contains various sources of noise and were sampled from the European Starling’s natural geographic range. The dataset exhibits diverse spectro-temporal noise characteristics, illustrated in 5.2 (Fig 9).
Fig. 9.
Spectrograms of a sample from the “Birdsong NOIZEUS” dataset at an SNR of 10 dB, showcasing the clean signal, the noisy signals with different types of environmental noise.
We evaluated Noisereduce’s performance using the NOIZEUS Birdsong dataset and compared it with Savitzky-Golay and Wiener filtering, which are both domain-general noise reduction algorithms that performed well in the speech analysis. We measure improvements in Segmental Signal-to-Noise Ratio (SegSNR), which evaluates the quality of noise reduction across temporal segments, and Source-to-Distortion Ratio (SDR), which quantifies both signal degradation and residual noise. We find that Noisereduce outperforms the other conventional algorithms on both metrics (Tables 5, 6; Figure 4).
Table 5.
SegSNR [dB] performance metric on Birdsong NOIZEUS dataset (mean ± SEM) for different algorithms across various SNR levels.
| Algorithm | SNR 0 | SNR 5 | SNR 10 | SNR 15 |
|---|---|---|---|---|
| Baseline | −0.38 ± 0.40 | 4.59 ± 0.40 | 9.58 ± 0.40 | 14.58 ± 0.40 |
| Noisereduce (ours) | 6.96 ± 0.31 | 9.61 ± 0.33 | 11.78 ± 0.33 | 13.50 ± 0.39 |
| Savitzky-Golay | 0.27 ± 0.48 | 4.45 ± 0.46 | 8.41 ± 0.43 | 11.91 ± 2.47 |
| Wiener | −0.09 ± 0.42 | 4.51 ± 0.41 | 8.61 ± 0.39 | 11.82 ± 0.37 |
Table 6.
SDR [dB] performance metric on Birdsong NOIZEUS dataset (mean ± SEM) for different algorithms across various SNR levels.
| Algorithm | SNR 0 | SNR 5 | SNR 10 | SNR 15 |
|---|---|---|---|---|
| Baseline | −6.37 ± 0.36 | −1.43 ± 0.36 | 3.55 ± 0.37 | 8.56 ± 0.37 |
| Noisereduce (ours) | 0.79 ± 0.40 | 4.94 ± 0.37 | 8.69 ± 0.31 | 11.61 ± 0.25 |
| Savitzky-Golay | −5.41 ± 0.49 | −0.93 ± 0.49 | 3.61 ± 0.47 | 7.95 ± 0.44 |
| Wiener | −5.71 ± 0.43 | −0.71 ± 0.44 | 4.18 ± 0.45 | 8.53 ± 0.38 |
Fig. 4.
Noise reduction samples from different algorithms applied to the ’B335’ sample from the NOIZEUS Birdsong dataset (SNR: 10 dB, waterfall noise).
Electrophysiology
Extracellular electrophysiology is a key tool in recording single-neuron activity as animals interact with their environment. A challenge here is detecting extracellular spikes and assigning them to individual neurons, a process known as spikesorting. Current algorithms tackle this in steps: initially detecting spikes by thresholding amplitude or convolving the signal with spike templates, then iteratively clustering these putative spikes to estimate neuron identities, which provide templates for further detection. We tested whether Noisereduce could enhance initial spike detection by improving the SNR between spikes and background noise.
We created a dataset of biophysically realistic neural recordings using the MEArec library9, simulating extracellular electrophysiology. Ground truth spikes were simulated from 10 neurons (8 excitatory, 2 inhibitory, Fig 5A), with noise from 300 background neurons. Simulated data were used as real recordings lack ground truth.
Fig. 5.
Noisereduce results on a simulated extracellular recording. (A) Sample neuron waveform templates. (B) A sample of 100ms of z-scored sampled neural data, with the original data in red and the denoised signal in black. (C-D) A spectrogram of the same data in B. (E) Amplitude of action potentials (blue) versus background noise (grey) in the original signal versus the denoised signal. (F) Reciever Operator Characteristic (ROC) curve of spike detection using the SpikeInterface detect_peaks algorithm to detect spikes.
We applied a modified Noisereduce approach to this data (Fig 5B), omitting the spectral mask smoothing step, which is computationally intensive and unnecessary for preliminary spike detection where spike shape is not used. We compared the output of Noisereduce to the untreated signal (bandpass filtered at 200-6000Hz; Fig 5C-D). We found that the spike amplitude (z-scored; Fig 5E) increased relative to background noise. To assess detection improvement, we used the SpikeInterface detection algorithm10 and computed an ROC curve by varying the detection threshold. Noisereduce was compared against three conditions: baseline bandpass filtering, Wiener filtering, and Savitzky-Golay filtering (Fig 5F). An Area Under the Curve (AUC) analysis found highest performance with Noisereduce (Noisereduce=0.97; Savitzky-Golay=0.96; Wiener = 0.94; Baseline=0.91), suggesting its suitability for initial spike detection. Given that spectral masking can alter spike shapes, we advise using Noisereduce solely for initial spike detection, not clustering.
Seismology
Seismic event detection methods focus on identifying the onset of these events, a critical step for accurately locating and characterizing seismic activity1,15. A widely used approach is the Short-Time Average over Long-Time Average (STA/LTA) algorithm59,63, which calculates the ratio of short-term to long-term signal averages to detect events. However, background noise from the environment and equipment makes detection less reliable, resulting in missed detections and false alarms.
Following Zhu et al. (2019)67, we tested Noisereduce on seismic waveforms from the ObsPy library3 (see Fig 6, top). To simulate realistic conditions, we added white and pink noise at SNRs ranging from 0 to 15 dB and evaluated detection accuracy by comparing STA/LTA-detected onset times between denoised and clean recordings. As with the spike-detection analysis, we applied a modified Noisereduce, omitting the smoothing step. In detecting the onset time of seismic activity, Noisereduce outperformed three baseline methods — no filtering, Wiener, and Savitzky-Golay - across all SNR levels, with the most significant improvements in low-SNR conditions (0 and 5 dB) for both noise types (see Tables 7 and 8).
Fig. 6.
(Top) Seismic recording sample “ev0_6.a01.gse2” from ObsPy dataset. The trigger, determined using the STA/LTA algorithm, marks the signal onset (red line). Noise added (pink, SNR = 1dB) and Noisereduce and DeepDenoiser are compared. (Bottom) Performance metrics for the seismology dataset. DeepDenoiser comparisons were generated using the DeepDenoiser API at “https://ai4eps-deepdenoiser.hf.space”.
Table 7.
Onset detection error (mean ± SEM) for different algorithms across various SNR levels for white noise.
| Algorithm | SNR 0 | SNR 5 | SNR 10 | SNR 15 |
|---|---|---|---|---|
| Baseline | 0.569 ± 0.106 | 0.385 ± 0.082 | 0.186 ± 0.042 | 0.090 ± 0.019 |
| Noisereduce (ours) | 0.192 ± 0.065 | 0.187 ± 0.058 | 0.124 ± 0.034 | 0.069 ± 0.019 |
| Wiener | 0.297 ± 0.063 | 0.237 ± 0.052 | 0.197 ± 0.054 | 0.080 ± 0.024 |
| Savitzky-Golay | 0.242 ± 0.058 | 0.201 ± 0.045 | 0.169 ± 0.044 | 0.090 ± 0.023 |
Table 8.
Onset detection error (mean ± SEM) for different algorithms across various SNR levels for pink noise.
| Algorithm | SNR 0 | SNR 5 | SNR 10 | SNR 15 |
|---|---|---|---|---|
| Baseline | 0.392 ± 0.102 | 0.308 ± 0.073 | 0.226 ± 0.052 | 0.111 ± 0.028 |
| Noisereduce (ours) | 0.129 ± 0.036 | 0.106 ± 0.025 | 0.074 ± 0.020 | 0.067 ± 0.017 |
| Wiener | 0.253 ± 0.085 | 0.244 ± 0.057 | 0.193 ± 0.039 | 0.093 ± 0.027 |
| Savitzky-Golay | 0.134 ± 0.023 | 0.171 ± 0.034 | 0.160 ± 0.028 | 0.101 ± 0.027 |
To further assess Noisereduce’s quality in detecting seismic signals, we compared it to DeepDenoiser67, a deep neural network-based approach for denoising seismic waveforms. We compared the SNR of the signal post-denoising, the correlation coefficient between clean and denoised signals, and the change in maximum amplitude of the signal from the clean recording. While Noisereduce underperforms compared to the deep learning approach on all metrics, its performance is closer to the deep learning model than any of the other conventional algorithms, with the exception of the amplitude of the denoised signal, which is more greatly decreased in Noisereduce (Fig. 6, bottom).
Run-time analysis
Speed is a critical factor in selecting a noise reduction method, especially in applications requiring real-time or near-real-time analysis. Noise reduction algorithms are of limited use if they cannot process signals in a timely manner, as delays can become bottlenecks in analytical workflows. Noisereduce supports GPU parallelization, which significantly improves processing speed. To evaluate performance, we measured the average runtime across various signal lengths using an NVIDIA GeForce RTX 3070 GPU. The results (Fig 7) demonstrate that GPU-accelerated Noisereduce outperforms other noise reduction algorithms, highlighting its potential for real-time applications.
Fig. 7.

Runtime analysis comparing GPU-based Noisereduce, CPU-based Noisereduce, GPU-based Denoiser, CPU-based Denoiser, Wiener filter, and Savitzky-Golay filter on an RTX 3070 GPU with batch size of 32, and sample rate of 16 kHz.
Selecting hyperparameters
Noisereduce relies on a small number of hyperparameters which impact how noise is detected and attenuated (Table 9).
The main two parameters to consider are n_std_thresh_stationary and prop_decrease. n_std_thresh_stationary sets the threshold for what to consider signal in terms of standard deviations of power above (or below) the mean power for each frequency channel. prop_decrease then determines the extent to which we remove the below-threshold noise. We additionally include noise_window_size_nonstationary_ms in the nonstationary version of the algorithm, which is the window over which threshold statistics are computed. freq_mask_smooth_hz and time_mask_smooth_ms, are used to smooth the mask using a Gaussian kernel, with the shape of the kernel defined by those parameters. A further implicit parameter is the duration of noise clip presented to the algorithm. For example, a very short noise clip may not accurately reflect the statistics of the noise profile in the full recording. Finally, n_fft, win_length, and hop_length are all parameters used to compute the spectrogram and should be set at values that would visibly capture spectrotemporal structure in your signal, if you were to plot the spectrogram.
We performed an analysis of the robustness of this parameter selection on the Birdsong NOIZEUS dataset (Fig. 8). Although the optimal parameters will be both dataset and downstream application specific, the analysis given here may provide some intuition for Noisereduce users. Broadly, we observe that a good range of choices for n_std_thresh_stationary is an intermediate range, between 1 and 5. prop_decrease generally improves in performance even at 1.0. Smooth mask is more variable on the metrics we analyzed (SDR and SegSNR). Finally, as we increase the duration of the noise clip thus performing a better estimate over noise, noise reduction improves (here maximum noise clip duration was the dataset maximum of 1 second). However, these metrics are an imperfect proxy for both perceptual quality and value in downstream tasks. To aid in intuiting the value of these parameters, we include supplementary audio clips of each of the sames in Fig. 8.
Fig. 8.
Visualization of parameters varied on the Birdsong NOIZEUS dataset (forest noise, SNR=0). (A) n_std_thresh (B) prop_decrease (C) freq_mask_smooth_hz and time_mask_smooth_ms (D) Noise clip length. (E) SDR values over a range of parameters for the Birdsong NOIZEUS dataset.
Discussion
In this work we provide a validation for Noisereduce as a domain-general noise-reduction algorithm. Our findings demonstrate that Noisereduce can perform similarly to and often outperform traditional noise reduction algorithms, making it suitable for a number of applications such as in bioacoustics and electrophysiology. Additionally, it can be used as a baseline comparison in domains where extensive domain-general machine-learning based approaches already exist. An important advantage of Noisereduce is its support for GPU parallelization, which accelerates processing speed compared to CPU-only algorithms. The algorithm does not rely on training data, and its lightweight design makes it suitable for real-time use or resource-limited settings where deep learning may not be practical. Noisereduce is publicly available as a Python package, actively maintained, and easy to use.
Limitations
Limitations to the Noisereduce algorithm We do not recommend Noisereduce for all applications. In many domains, including speech, domain-specific and often supervised approaches exist that will generally outperform Noisereduce. For example, the Denoiser algorithm we presented above. In other applications, we find that Noisereduce is a valuable starting place to improve signal to noise ratio. Even in domains where substantial domain-specific efforts exist, Noisereduce can remain a valuable tool. For example, Noisereduce can be used as an augmentation tool in creating training datasets, and it can be used in applications where a high-throughput low-latency approach is needed. Noisereduce is also subject to the challenges of spectral masking. When time-frequency components contain both signal and noise, a binary masking approach will not optimally separate the signal from noise. Noisereduce also operates by assuming that the highest amplitude components of the recording are signal, which is not always the case. Noisereduce also works best when noise is either stationary or nonstationary over timescales that are longer than the signal; when noise is intermittent over short timescales, particularly when the amplitude and frequency of the noise is similar, Noisereduce will not be able to differentiate signal from noise. In all cases, we recommend users carefully analyze the outputs of Noisereduce before blindly using it in an analysis pipeline.
Limitations in comparisons The work presented here attempts to benchmark Noisereduce against a set of comparison algorithms on several noise-reduction quality metrics. However, these comparisons are neither complete not exhaustive. Each of the algorithms we presented here have applications which they are good at, and applications in which they fail. In the analyses we provided here, we tried to produce a good-faith attempt to parameterize each algorithm in such a way that it would perform well. However, the hyperparameters chosen for both Noisereduce and its comparisons (Table 9) were neither exhaustively scanned, nor were they systematically optimized. There are also several other domain-general noise reduction approaches which have not been compared here, for example wavelet-based approaches13,14 and Empirical Mode Decomposition7,8. We therefore ask that readers interpret these results as we have, i.e. that noisereduce performs very well on the metrics we have provided, and at least consistent with other approaches. It is also true that the ’metrics’ that we chose to make comparisons are themselves only a very rough proxy of what is wanted out of a noise reduction algorithm, which differ depending on application. There are no algorithms which perfectly reflect human perceptual judgement in any domain. Even if there were, noise reduction algorithms that optimized for human perception would not necessarily be optimized for the many possible downstream tasks that one might perform on the denoised signal.
Methods
Implementation details
Algorithm
To reduce noise from time-series recordings, Noisereduce builds a mask over a time-frequency representation of the signal, which is used to mask noise from signal. The concept of spectral gating/subtraction/masking originates with Boll in 19795 and many variants of spectral gating have been developed since that time, ranging from simple statistics to more recent deep learning based approaches54. Noisereduce generates a spectral mask by computing descriptive statistics over the time-frequency representation of noise clip and comparing them to the signal. The spectral mask is generated using the following steps:
Let X denote the noisy input signal. If an isolated noise clip
is available, it is used for statistics; otherwise, we set
. The noise Short-Time Fourier Transform (STFT), obtained with a Hann window of length
and hop
, is:
![]() |
1 |
where indices i and j denote time frames and frequency bins, respectively.
is converted to magnitude spectrogram in decibels (
):
![]() |
2 |
where
is a small positive constant for stability.
We next compute statistics over the noise spectrogram
. For every frequency bin j, we compute the mean (
) and standard deviation (
) across all time frames:
![]() |
3 |
![]() |
4 |
where
is the total number of time frames.
These statistics are used to create a threshold for each frequency bin (
), from
,
, and a hyperparameter (
) which sets the number of standard deviations above the mean:
![]() |
5 |
We can then create the mask for the signal. To do this, we need to compute the STFT (
) of the signal clip (
).
![]() |
6 |
is converted to magnitude spectrogram in decibels (
):
![]() |
7 |
The binary mask (
) is then computed on the signal spectrogram (
), based on the thresholds for each frequency bin.
![]() |
8 |
To reduce artifacts from sharp transitions,
can optionally be smoothed using a 2-D filter
, characterized by
and
, which define the half-width of the filter in frequency and time, respectively.
The filter
is expressed as a separable matrix:
![]() |
9 |
where
denotes the outer product, and the components are defined as:
![]() |
10 |
![]() |
11 |
The expressions are defined for
and
, effectively creating symmetric triangular windows. The normalization constant
is determined such that the sum of all elements in the 2-D filter satisfies
.
If smoothing is applied, the smoothed mask
is obtained by convolving the original mask
with the smoothing filter
.
![]() |
12 |
We can then apply the mask to the STFT of the signal (
) by multiplying
or
with
to produce the masked STFT (
).
![]() |
13 |
where
is a scaling factor that controls the strength of the masking effect.
The masked STFT (
) is then inverted back into the time-domain
using an inverse STFT.
![]() |
14 |
Non-stationary The non-stationary algorithm differs from the stationary version of Noisereduce in how the noise mask is computed. The central goal of the non-stationary algorithm is to compute a noise mask locally in time rather than globally across the entire recording or dataset, to account for fluctuations in the noise floor. To accomplish this, we simply compute the mean and standard deviation of the frequency components over a sliding window on X without a noise clip, and then proceed with the rest of the algorithm normally.
Soft Mask While the current implementation uses a binary mask (0 or 1), future work could explore a soft mask with values between 0 and 1 to achieve smoother signal-noise separation.
Datasets
Speech The evaluation included thirty phonetically balanced speech utterances from the “Noisy Speech Corpus” (NOIZEUS) database21,34,65, a database specifically designed for noisy speech research. These utterances were combined with eight distinct real-world noise types, including suburban train, babble, car, exhibition hall, restaurant, street, airport, and train station noises, at SNRs of 0 dB, 5 dB, 10 dB, and 15 dB, following Method B of the ITU-T P.56 standard23.
Birdsong We created the Birdsong NOIZEUS dataset49 in the likeness of the speech NOIZEUS dataset34. We selected 70 song samples from 14 European starlings (5 samples of 40 seconds each). These recordings were selected from a larger collection previously gathered by the authors for prior publications2,50. The original dataset contains several hundred 30 to 60 second recordings per bird, obtained from wild-caught European starlings in Southern California. Recordings were performed in acoustically isolated chambers to ensure high-quality audio capture. We sampled noise from the “Soundscapes from around the world” dataset from Xeno Canto61. We hand selected 8 soundscapes from this dataset which we named “rain”, “town”, “wind”, “waterfall”, “insects”, “swamp”, “frogscape”, and “forest”. Each soundscape contains various sources of noise and were sampled from the European Starling’s natural range. For each song, we selected a different segment of the soundscape (soundscapes were around 5-20 minutes each). An example of the dataset can be seen in Figue 9. We set the SNR based on loudness measured using the pyloudnorm Python library56. Additionally, for each song and noise clip we included a 1-second clip of noise sampled randomly (at the same SNR of the audioclip). This dataset is publicly available on Zenodo (DOI: 10.5281/zenodo.13947444).
Seismology The seismic data used in this study was obtained from the ObsPy Trigger/Picker Tutorial3, including waveform recordings from three seismic stations: EV, RJOB, and MANZ. The dataset, recorded in January 1970, contains natural seismic events specifically selected for evaluating triggering and picking algorithms. For our analysis, we used all available signals from the trigger dataset, excluding those with low SNR. To simulate realistic seismic and instrumental noise, we added both white noise and pink noise at varying SNR ratios (0, 5, 10, and 15 dB).
Electrophysiology Electrophysiology datasets were generated using the MEArec9 Python library so that we would have access to ground truth spiking events alongside electrophysiology. Some non-simulated ephys datasets record ground truth events, e.g. by pairing extracellular recordings with intracellular recordings37, but, since not all cells are recorded intracellularly, they are of limited value in differentiating between false positive and true positive detections of other cells. facilitates the generation of customizable extracellular spiking activity datasets by leveraging biophysically detailed simulations. It achieves this by first creating templates of extracellular action potentials using realistic cell models, positioned around electrode probes within a simulation environment. These cell models, drawn from established neuroscience databases, undergo intracellular simulation to compute transmembrane currents using tools like NEURON, while the extracellular potentials are calculated using methods such as the line-source approximation via the LFPy package. This process allows MEArec to accurately simulate various neural dynamics and probe configurations, offering a flexible framework for evaluating and developing spike sorting methods under controlled experimental conditions.
We generated a dataset comprising a monotrode (single channel) recording 10 minutes in length. The recording had 10 neurons (8 excitatory and 2 inhibitory). Spikes ranged in amplitude from 75-150uV. Background noise was generated using 300 simulated neurons that were further away from the probe (each with a maximum amplitude of 75uV). Simulated data were bandpass filtered between 300 and 6000 Hz.
Additional algorithms
Wiener Filter The Wiener filter, as implemented in31, is an adaptive noise reduction algorithm that analyze local statistics within a sliding window. The filter adapts to local signal characteristics by weighting the difference between the noisy observation and the local mean based on the local variance. The filter applies minimal smoothing in high-variance regions to preserve significant signal features, while employing more aggressive smoothing in low-variance areas presumed to be noise-dominated.
The output signal
at index n is computed using:
![]() |
15 |
where y[n] is the observed noisy signal,
is the local mean within a window centered around n,
is the variance of the signal in that window, and
is the estimated noise variance calculated as the average of all local variances across the signal.
We used the SciPy implementation62 as a comparison.
Iterative Wiener The Iterative Wiener30 performs noise reduction in the frequency domain by iteratively computing a Wiener filter for each frame. The Wiener filter for each frequency
is defined as:
![]() |
16 |
where
is the speech power spectral density, and
is the noise variance.
To determine if a frame contains speech, a simple energy threshold is used. When speech is detected, the algorithm refines the clean signal estimate by iteratively calculating Linear Predictive Coding (LPC) coefficients of the input frame, which model the vocal tract as an all-pole filter. These LPC coefficients help estimate the speech power spectrum and update the Wiener filter to reduce noise. After denoising, new LPC coefficients are calculated from the denoised signal, further improving the filter.
When no speech is detected, the noise variance is updated. The algorithm uses an IIR filter to smooth the noise estimate. The noise variance
is updated as follows:
![]() |
17 |
where
is the smoothing factor, and
is the energy of the input frame.
We used the pyroomacoustics library53 as a comparison.
Savitzky-Golay Filter The Savitzky-Golay filter52 is a smoothing technique that employs a polynomial fitting approach. By fitting low-degree polynomials to successive subsets of adjacent data points, it effectively reduces noise while preserving important signal features. The output signal
at index n is computed using:
![]() |
18 |
where
represents the window size,
are the input data points within the window, and
are the convolution coefficients derived by the polynomial fitting.
We used the SciPy implementation62 as a comparison.
Spectral Subtraction The Spectral Subtraction algorithm5 performs noise reduction by subtracting an estimate of the noise spectrum from the spectrum of the noisy signal. It operates under the assumption that the noise is additive and uncorrelated with the signal. The output signal spectrum
is computed as:
![]() |
19 |
where
is the Fourier transform of the noisy signal, and
is the estimated noise spectrum. The
function ensures that the resulting magnitude is non-negative. The noise spectrum
is obtained during periods of silence in the signal, under the assumption that only noise is present.
To reconstruct the clean signal, the inverse Fourier transform is applied to
while using the phase information from the noisy signal
.
We used the pyroomacoustics library53 as a comparison.
Subspace The Subspace algorithm16,20 performs noise reduction by projecting the noisy signal y[n] onto a lower-dimensional subspace that primarily contains the clean signal, while the noise is assumed to be in the complementary subspace.
An eigendecomposition is performed on the matrix,
![]() |
20 |
where
is the noise covariance matrix,
is the covariance matrix of the input noisy signal, and I is the identity matrix.
is obtained during periods of silence in the signal, under the assumption that only noise is present.
The cleaned signal
is obtained by projecting the noisy signal y[n] onto the signal subspace,
![]() |
21 |
where
is the projection matrix, derived from the positive eigenvectors of
.
We used the pyroomacoustics library53 as a comparison.
DeepDenoiser The DeepDenoiser67 is a deep neural network designed to denoise seismic signals by learning to separate the signal from noise. It is trained on datasets containing both noisy and clean waveform data. During the denoising process, the input seismic signal is first converted into the time-frequency domain. The network then uses a series of fully convolutional layers with skip connections to generate two masks: one for the signal and one for the noise. Finally, the denoised signal and the estimated noise are obtained by applying the inverse Short Time Fourier Transform.
Denoiser Denoiser12 is a deep learning model designed to denoise speech signals. It processes raw audio waveforms using an encoder-decoder architecture with skip connections. The model is optimized across both time and frequency domains through multiple loss functions and is trained end-to-end on paired datasets of noisy and clean speech.
Hyperparameters
We used the hyperparameters listed in Table 10 for comparison.
Table 10.
Noise reduction algorithms and their parameters.
| Algorithm | Parameters |
|---|---|
| Noisereduce | n_fft=1024, |
| win_length=256, | |
| n_std_thresh_stationary=2, | |
| prop_decrease=0.75, | |
| freq_mask_smooth_hz=50, | |
| time_mask_smooth_ms=32 | |
| Spectral Subtraction | nfft=512, |
| db_reduce=10, | |
| lookback=5, | |
| beta=20, | |
| alpha=3 | |
| Savitzky-Golay | window_length=5, |
| polyorder=2 | |
| Subspace | frame_len=32, |
| mu=10, | |
| lookback=10, | |
| skip=2, | |
| thresh=0.001 | |
| Iterative Wiener | frame_len=64, |
| lpc_order=20, | |
| iterations=2, | |
| alpha=0.8, | |
| thresh=0.01 | |
| Wiener | window_size=5 |
Metrics
STOI Short-Time Objective Intelligibility (STOI)57 is a widely used objective metric for assessing speech intelligibility, particularly in noisy environments. It functions by comparing short-time segments of clean reference and degraded speech signals, quantifying the level of degradation in terms of intelligibility. STOI calculates the correlation between the two signals over overlapping time windows, producing a score between 0 and 1. Higher scores indicate better intelligibility.
PESQ Perceptual Evaluation of Speech Quality (PESQ)47 is another objective metric designed to evaluate speech quality as perceived by human listeners. It incorporates a psychoacoustic model to simulate the human auditory system, comparing degraded or processed speech to a clean reference. PESQ captures both time-domain distortions and perceptual differences, generating a score between −0.5 and 4.5. Higher scores signify better perceived speech quality. Unlike STOI, which focuses on intelligibility, PESQ is more concerned with overall speech quality.
SDR Source Distortion Ratio (SDR) evaluates the quality of source separation by measuring the logarithmic ratio of the power of the target source signal to the power of distortions, such as interference, noise, and artifacts. Higher SDR values indicate better separation performance, with fewer distortions. SDR is defined as follows:
![]() |
22 |
where x is the true clean signal, and
is the estimated signal.
SegSNR Segmental Signal-to-Noise Ratio (SegSNR or SSNR), a modified version of Signal-to-Noise Ratio (SNR), provides a more localized assessment of signal quality. While traditional SNR evaluates signal-to-noise ratios across the entire signal, SegSNR calculates SNR within smaller segments and then averages these values. SegSNR is defined as follows:
![]() |
23 |
where N is the total number of segments, and SNR(n) is the SNR for the n-th segment.
AUC The Receiver Operating Characteristic (ROC) curve is used to evaluate the performance of binary classifiers. The ROC curve plots true positive rate against false positive rate for various classification thresholds. The Area Under the Curve (AUC) quantifies the overall performance, ranging from 0.5 (random guessing) to 1.0 (perfect classification).
Acknowledgements
We thank David Burshtein for their feedback on an earlier version of this manuscript. Published by a grant from the Wetmore Colles fund.
Author contributions
Both authors contributed equally to writing, software development, and experiments.
Data availability
The Birdsong NOIZEUS dataset was generated for this publication. It is available at https://zenodo.org/records/13947444
Code availability
The implementation of the Noisereduce algorithm is available at: https://github.com/timsainb/noisereduce. A future version of this software may be migrated to https://github.com/noisereduce/noisereduce. The experimental results, including all necessary configuration files and scripts for reproducing the experiments, are provided at: https://github.com/noisereduce/paper_noisereduce.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Tim Sainburg and Asaf Zorea contributed equally to this work.
Asaf Zorea is an independent researcher
References
- 1.Allen, R. V. Automatic earthquake recognition and timing from single traces. Bulletin of the Seismological Society of America68(5), 1521–1532. 10.1785/BSSA0680051521 (1978). [Google Scholar]
- 2.Arneodo, Z., Sainburg, T., Jeanne, J. & Gentner, T. An acoustically isolated european starling song library, (2019).
- 3.Beyreuther, M. et al. ObsPy: A Python Toolbox for Seismology. Seismological Research Letters81(3), 530–533. 10.1785/gssrl.81.3.530 (2010). [Google Scholar]
- 4.Bhatt, R., Singh, S., Choudhary, P. & Saini, M. An experimental study of the concept drift challenge in farm intrusion detection using audio. In 2022 18th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pages 1–8. IEEE, (2022).
- 5.Boll, S. Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on acoustics, speech, and signal processing27(2), 113–120 (1979). [Google Scholar]
- 6.Borgnat, P., Flandrin, P., Honeine, P., Richard, C. & Xiao, J. Testing stationarity with surrogates: A time-frequency approach. IEEE Transactions on Signal Processing58(7), 3459–3470 (2010). [Google Scholar]
- 7.Boudraa, A.-O. & Cexus, J.-C. Emd-based signal filtering. IEEE transactions on instrumentation and measurement56(6), 2196–2202 (2007). [Google Scholar]
- 8.Boudraa, A.-O. et al. Denoising via empirical mode decomposition. Proc. IEEE ISCCSP4, 2006 (2006). [Google Scholar]
- 9.Buccino, A. P. & Einevoll, G. T. Mearec: a fast and customizable testbench simulator for ground-truth extracellular spiking activity. Neuroinformatics19(1), 185–204 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Buccino, A. P. et al. Spikeinterface, a unified framework for spike sorting. Elife9, e61834 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Chen, X., Wang, R., Khalilian-Gourtani, A., Yu, L., Dugan, P., Friedman, D., Doyle, W., Devinsky, O., Wang, Y. & Flinker, A. A neural speech decoding framework leveraging deep learning and speech synthesis. Nature Machine Intelligence, pages 1–14, (2024).
- 12.Defossez, A., Synnaeve, G. & Adi, Y. Real time speech enhancement in the waveform domain, (2020). URL https://arxiv.org/abs/2006.12847.
- 13.Donoho, D. L. & Johnstone, I. M. Ideal spatial adaptation by wavelet shrinkage. biometrika81(3), 425–455 (1994). [Google Scholar]
- 14.Donoho, D. L., Johnstone, I. M., Kerkyacharian, G. & Picard, D. Wavelet shrinkage: asymptopia?. Journal of the Royal Statistical Society: Series B (Methodological)57(2), 301–337 (1995). [Google Scholar]
- 15.Earle, P. S. & Shearer, P. M. Characterization of global seismograms using an automatic-picking algorithm. Bulletin of the Seismological Society of America84(2), 366–376. 10.1785/BSSA0840020366 (1994). [Google Scholar]
- 16.Ephraim, Y. & Van Trees, H. A signal subspace approach for speech enhancement. IEEE Transactions on Speech and Audio Processing3(4), 251–266. 10.1109/89.397090 (1995). [Google Scholar]
- 17.Fleishman, E. et al. Ecological inferences about marine mammals from passive acoustic data. Biological Reviews98(5), 1633–1647 (2023). [DOI] [PubMed] [Google Scholar]
- 18.Hao, X., Su, X., Wang, Z., Zhang, H. & Unetgan, Batushiren. A robust speech enhancement approach in time domain for extremely low signal-to-noise ratio condition. In Interspeech 2019. ISCA, Sept. (2019). URL 10.21437/Interspeech.2019-1567.
- 19.Hao, X., Su, X., Horaud, R. & Li, X. Fullsubnet: A full-band and sub-band fusion model for real-time single-channel speech enhancement. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, (June 2021). 10.1109/icassp39728.2021.9414177.
- 20.Hu, Y. & Loizou, P.C. A subspace approach for enhancing speech corrupted by colored noise. In 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages I–573–I–576, (2002). 10.1109/ICASSP.2002.5743782.
- 21.Hu, Y. & Loizou, P. C. Evaluation of Objective Quality Measures for Speech Enhancement. IEEE Transactions on Audio, Speech, and Language Processing16(1), 229–238. 10.1109/TASL.2007.911054 (2008). [Google Scholar]
- 22.Injaian, A. S., Lane, E. D. & Klinck, H. Aircraft events correspond with vocal behavior in a passerine. Scientific Reports11(1), 1197 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.International Telecommunication Union. P.56:Objective measurement of active speech level, (1993). URL https://www.itu.int/rec/T-REC-P.56.
- 24.Jung, D.-H. et al. Deep learning-based cattle vocal classification model and real-time livestock monitoring system with noise filtering. Animals11(2), 357 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Lee, Y.-E., Kim, S.-H., Lee, S.-H., Lee, J.-S., Kim, S. & Lee, S.-W. Speech synthesis from brain signals based on generative model. In 2023 11th International Winter Conference on Brain-Computer Interface (BCI), pages 1–4. IEEE, (2023a).
- 26.Lee, Y.-E., Lee, S.-H., Kim, S.-H. & Lee, S.-W. Towards voice reconstruction from eeg during imagined speech. In Proceedings of the AAAI Conference on Artificial Intelligence37, 6030–6038 (2023). [Google Scholar]
- 27.Li, J.-H. et al. Multi-sensor fusion approach to drinking activity identification for improving fluid intake monitoring. Applied Sciences14(11), 4480 (2024). [Google Scholar]
- 28.Li, N. & Loizou, P. C. Factors influencing intelligibility of ideal binary-masked speech: Implications for noise reduction. The Journal of the Acoustical Society of America123(3), 1673–1682 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Li, W. et al. Global-local-feature-fused driver speech emotion detection for intelligent cockpit in automated driving. IEEE Transactions on Intelligent Vehicles8(4), 2684–2697 (2023). [Google Scholar]
- 30.Lim, J. & Oppenheim, A. All-pole modeling of degraded speech. IEEE Transactions on Acoustics, Speech, and Signal Processing26(3), 197–210. 10.1109/TASSP.1978.1163086 (1978). [Google Scholar]
- 31.Lim, J. S. Two-Dimensional Signal and Image Processing (Prentice Hall, 1990). [Google Scholar]
- 32.Liu, Z. et al. Machine learning of transcripts and audio recordings of spontaneous speech for diagnosis of alzheimer’s disease. Alzheimer’s & Dementia17, e057556 (2021). [Google Scholar]
- 33.Loizou, P. Speech Enhancement: Theory and Practice, Second Edition. Taylor & Francis, ISBN 9781466504219. (2013). URL https://books.google.co.il/books?id=ntXLfZkuGTwC.
- 34.Loizou, P.C. NOIZEUS: Noisy speech corpus - Univ. Texas-Dallas, (2007). URL https://ecs.utdallas.edu/loizou/speech/noizeus/.
- 35.Lostanlen, V. et al. Per-channel energy normalization: Why and how. IEEE Signal Processing Letters26(1), 39–43 (2018). [Google Scholar]
- 36.Macartney, C. & Weyde, T. Improved speech enhancement with the wave-u-net, (2018). URL https://arxiv.org/abs/1811.11307.
- 37.Magland, J. et al. Spikeforest, reproducible web-facing ground-truth validation of automated neural spike sorters. Elife9, e55167 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Maher, S. P., Dawson, P. B., Hotovec-Ellis, A. J., Thelen, W. A. & Matoza, R. S. Automated detection of volcanic seismicity using network covariance and image processing. Seismological Research Letters95(5), 2580–2594 (2024). [Google Scholar]
- 39.Mandala, S. et al. Enhanced myocardial infarction identification in phonocardiogram signals using segmented feature extraction and transfer learning-based classification. IEEE Access11, 136654–136665 (2023). [Google Scholar]
- 40.Mazzocconi, C., O’Brien, B. & Chaminade, T. How do you laugh in an fmri scanner? laughter distribution, mimicry and acoustic analysis. In Disfluency in Spontaneous Speech (DiSS) Workshop 2023, (2023).
- 41.McEwen, B. et al. Automatic noise reduction of extremely sparse vocalisations for bioacoustic monitoring. Ecological Informatics77, 102280 (2023). [Google Scholar]
- 42.McGinn, K., Kahl, S., Peery, M. Z., Klinck, H. & Wood, C. M. Feature embeddings from the birdnet algorithm provide insights into avian ecology. Ecological Informatics74, 101995 (2023). [Google Scholar]
- 43.Megela Simmons, A., Simmons, J. A. & Bates, M. E. Analyzing acoustic interactions in natural bullfrog (rana catesbeiana) choruses. Journal of Comparative Psychology122(3), 274 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Mehrish, A., Majumder, N., Bharadwaj, R., Mihalcea, R. & Poria, S. A review of deep learning techniques for speech processing. Information Fusion99, 101869 (2023). [Google Scholar]
- 45.Michaud, F., Sueur, J., Le Cesne, M. & Haupert, S. Unsupervised classification to improve the quality of a bird song recording dataset. Ecological Informatics74, 101952 (2023). [Google Scholar]
- 46.Pascual, S., Bonafonte, A. & Serrà, J. Segan: Speech enhancement generative adversarial network, (2017). URL https://arxiv.org/abs/1703.09452.
- 47.Rix, A., Beerends, J., Hollier, M. & Hekstra, A. Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), volume 2, pages 749–752 vol.2, (2001). 10.1109/ICASSP.2001.941023.
- 48.Sainburg, T. & Gentner, T. Q. Toward a computational neuroethology of vocal communication: from bioacoustics to neurophysiology, emerging tools and future directions. Frontiers in Behavioral Neuroscience15, 811737 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Sainburg, T. & Zorea, A. Birdsong NOIZEUS: Bioacoustics noise reduction benchmark dataset10.5281/zenodo.13947444 (2024). [Google Scholar]
- 50.Sainburg, T., Theilman, B., Thielk, M. & Gentner, T. Q. Parallels in the sequential organization of birdsong and human speech. Nature communications10(1), 3636 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Sainburg, T., Thielk, M. & Gentner, T. Q. Finding, visualizing, and quantifying latent structure across diverse animal vocal repertoires. PLoS computational biology16(10), e1008228 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Savitzky, A. & Golay, M. J. E. Smoothing and differentiation of data by simplified least squares procedures. Analytical Chemistry36(8), 1627–1639. 10.1021/ac60214a047 (1964). [Google Scholar]
- 53.Scheibler, R., Bezzam, E. & Dokmanić, I. Pyroomacoustics: A python package for audio room simulation and array processing algorithms. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 351–355. IEEE, (2018).
- 54.Soni, MH., Shah, N. & Patil, HA. Time-frequency masking-based speech enhancement using generative adversarial network. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5039–5043. IEEE, (2018).
- 55.Spiller, M., Esmaeili, N., Sühn, T., Boese, A., Turial, S., Gumbs, AA., Croner, R., Friebe, M. & Illanes, A. Enhancing veress needle entry with proximal vibroacoustic sensing for automatic identification of peritoneum puncture. Diagnostics, 14 (15), (2024). [DOI] [PMC free article] [PubMed]
- 56.Steinmetz, C.J. & Reiss, J. pyloudnorm: A simple yet flexible loudness meter in python. In Audio Engineering Society Convention 150. Audio Engineering Society, (2021).
- 57.Taal, C.H., Hendriks, R.C., Heusdens, R. & Jensen, J. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 4214–4217, (Mar. 2010). 10.1109/ICASSP.2010.5495701. URL https://ieeexplore.ieee.org/abstract/document/5495701. ISSN: 2379-190X.
- 58.Taha, T. M., Adeel, A. & Hussain, A. A survey on techniques for enhancing speech. International Journal of Computer Applications179(17), 1–14 (2018). [Google Scholar]
- 59.Trnkoczy, A. Understanding and parameter setting of sta/lta trigger algorithm. In P. Bormann, editor, New Manual of Seismological Observatory Practice 2 (NMSOP-2). Deutsches GeoForschungsZentrum GFZ, (2009) 10.2312/GFZ.NMSOP-2_IS_8.1.
- 60.Upadhyay, N. & Karmakar, A. Spectral subtractive-type algorithms for enhancement of noisy speech: an integrative review. International Journal of Image, Graphics and Signal Processing5(11), 13 (2013). [Google Scholar]
- 61.Vellinga, W. Xeno-canto - soundscapes from around the world10.15468/9u3zaq, (2024). Occurrence dataset accessed via GBIF.org on 2024-10-17.
- 62.Virtanen, P. et al. Scipy 1.0: fundamental algorithms for scientific computing in python. Nature methods17(3), 261–272 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Weber, M. & Davis, J. P. Evidence of a laterally variable lower mantle structure from P- and S-waves. Geophysical Journal International102(1), 231–255. 10.1111/j.1365-246X.1990.tb00544.x (1990). [Google Scholar]
- 64.Xie, J., Colonna, J. G. & Zhang, J. Bioacoustic signal denoising: a review. Artificial Intelligence Review54, 3575–3597 (2021). [Google Scholar]
- 65.Hu, Yi, & Loizou, P. Subjective Comparison of Speech Enhancement Algorithms. In 2006 IEEE International Conference on Acoustics Speed and Signal Processing Proceedings, volume 1, pages I–153–I–156, Toulouse, France, (2006). IEEE. ISBN 9781424404698. 10.1109/ICASSP.2006.1659980. URL http://ieeexplore.ieee.org/document/1659980/.
- 66.Zheng, N. & Zhang, X.-L. Phase-aware speech enhancement based on deep neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing27(1), 63–76. 10.1109/TASLP.2018.2870742 (2019). [Google Scholar]
- 67.Zhu, W., Mousavi, S. M. & Beroza, G. C. Seismic signal denoising and decomposition using deep neural networks. IEEE Transactions on Geoscience and Remote Sensing57(11), 9476–9488. 10.1109/TGRS.2019.2926772 (2019). [Google Scholar]
- 68.Zhu, Y., Smith, A. & Hauser, K. Automated heart and lung auscultation in robotic physical examinations. IEEE Robotics and Automation Letters7(2), 4204–4211 (2022). [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The Birdsong NOIZEUS dataset was generated for this publication. It is available at https://zenodo.org/records/13947444
The implementation of the Noisereduce algorithm is available at: https://github.com/timsainb/noisereduce. A future version of this software may be migrated to https://github.com/noisereduce/noisereduce. The experimental results, including all necessary configuration files and scripts for reproducing the experiments, are provided at: https://github.com/noisereduce/paper_noisereduce.































