On Tracer Breakthrough Curve Dataset Size, Shape, and Statistical Distribution

Malcolm S Field

doi:10.1016/j.advwatres.2020.103596

. Author manuscript; available in PMC: 2021 Aug 5.

Published in final edited form as: Adv Water Resour. 2020 May 21;141:10.1016/j.advwatres.2020.103596. doi: 10.1016/j.advwatres.2020.103596

On Tracer Breakthrough Curve Dataset Size, Shape, and Statistical Distribution^★

Malcolm S Field ^a,^*

PMCID: PMC8340600 NIHMSID: NIHMS1607523 PMID: 34366548

Abstract

A tracer breakthrough curve (BTC) for each sampling station is the ultimate goal of every quantitative hydrologic tracing study, and dataset size can critically affect the BTC. Groundwater-tracing data obtained using in situ automatic sampling or detection devices may result in very high-density data sets. Data-dense tracer BTCs obtained using in situ devices and stored in dataloggers can result in visually cluttered overlapping data points. The relatively large amounts of data detected by high-frequency settings available on in situ devices and stored in dataloggers ensure that important tracer BTC features, such as data peaks, are not missed. Alternatively, such dense datasets can also be difficult to interpret. Even more difficult, is the application of such dense data sets in solute-transport models that may not be able to adequately reproduce tracer BTC shapes due to the overwhelming mass of data. One solution to the difficulties associated with analyzing, interpreting, and modeling dense data sets is the selective removal of blocks of the data from the total dataset. Although it is possible to arrange to skip blocks of tracer BTC data in a periodic sense (data decimation) so as to lessen the size and density of the dataset, skipping or deleting blocks of data also may result in missing the important features that the high-frequency detection setting efforts were intended to detect. Rather than removing, reducing, or reformulating data overlap, signal filtering and smoothing may be utilized but smoothing errors (e.g., averaging errors, outliers, and potential time shifts) need to be considered. Appropriate probability distributions to tracer BTCs may be used to describe typical tracer BTC shapes, which usually include long tails. Recognizing appropriate probability distributions applicable to tracer BTCs can help in understanding some aspects of the tracer migration.

Keywords: high-density datasets, tracer-breakthrough curves, data smoothing, downsampling, probability

1. Introduction

Conducting solute fate and transport investigations in groundwater environments are much more difficult than air or surface-water environments because of the general inaccessibility of the subsurface. Realistically, only a modest number of wells may be installed over an area and with only a minimal number of sampling depths. For porous-media aquifers basic hydrogeologic measurements using wells is generally taken as sufficient when substantial efforts are made to integrate the measured hydraulic properties in the wells (e.g., head level, hydraulic conductivity, etc.) with chemical specific parameters (e.g., retardation, decay, etc.) in appropriate flow and solute-transport models that include some delineation of heterogenities. Conducting groundwater tracing studies in porous media is still considered a valuable and important undertaking, however (Ptak et al., 2004).

In fractured-rock aquifers conducting solute fate and transport investigations in such aquifers are much more difficult than in porous-media aquifers because of the heterogeneous and anisotropic nature of the aquifer. Conducting groundwater tracing studies in fractured-rock aquifers is generally regarded as being more important that doing so in porous-media aquifers. This is because of the typically greater heterogeneity and anisotropy of fractured-rock aquifers (Tsang, 1993).

In karstic aquifers the problem of extreme heterogeneity and anisotropy is generally taken as being much more complex than that of fractured-rock aquifers and the difficulties associated with conducting solute fate and transport investigations in karstic aquifers are also many times more difficult. Typically, groundwater investigations of karstic aquifers still rely on basic investigative techniques (e.g., potentiometric-surface mapping, aquifer testing, etc.), but these methods have been shown to be of much more limited value in karstic aquifers than in other aquifer types.

When investigating solute fate and transport in karstic aquifers it is now conventional knowledge that comprehensive groundwater-tracing studies are necessary and essential. When investigating karstic aquifers it is well established that water wells are only very rarely situated so as to be installed in the solution conduits draining the aquifer (Field, 1992–93; Quinlan and Ewers, 1985). Water-level measurements using these wells will, in most instances, only provide a very rough indication of groundwater-flow trajectories that are often radically incorrect (e.g., 90° from the measured water-level determined direction (see, for example, Figure 4 in Arnow, 1963)). Flow and solute-transport velocities are even more poorly estimated because, whereas the bulk of the volume of water in the aquifer is stored in the rock matrix, the overwhelming flow occurs in the solution conduits that can rival velocities seen in surface streams and man-made conduits according to open-channel and closed-conduit flow equations.

Quantitative-tracing studies provide definitive information regarding flow and solute-transport trajectories and velocities for the season and environmental conditions for when the tracing studies were conducted (e.g., wet-spring vs. dry-summer conditions). Such flow and solute-transport information, besides providing some insight into subsurface conditions, also provides valuable information regarding possible contamination at downgradient resurgences and drinking-water wells in terms of connections with contaminated sites, arrival times, and concentrations because the released tracer serves as surrogate pollutant.

Quantitative-tracing studies require that a measured mass of tracer be released and that downstream water samples be collected or analyzed in situ while also measuring resurgence discharge or flow in or pumped from a well. The injected tracer mass is then compared with the recovered tracer mass calculated by summing the measured downstream tracer concentrations multiplied with the measured discharges. As would seem apparent, the more frequently the downstream measurements are taken the more precise the calculated recovered tracer mass. Although grab sampling is often employed, such infrequent sampling and irregular sampling intervals tend to result in data aliasing (Quinlan et al., 1993) whereas automatic water samplers tend to better facilitate sample collection. Even more useful and appropriate are field detection devices (e.g., in situ field fluorometers) coupled with dataloggers, which can result in the collection of huge datasets over very short periods. These latter devices allow for very frequent in situ sample analyses and long-term storage of very large amounts of data, which although not really big as big data are generally understood to be (Jagadish et al., 2014; WFI, 2019a), the acquired datasets are sufficiently large as to require any of a variety of numerical routines for analysis provided that the nature of the underlying system is included in the analyses (Siegel and Hinchey, 2019).

The relatively large amounts of data detected with high-frequency settings available when using in situ devices and stored in dataloggers ensures that important tracer breakthrough curve (BTC) features, such as data peaks, are not missed. Large tracer BTC datasets obtained using in situ devices and stored in dataloggers can, however, be visually cluttered by data points that may appear to overlap due to large plot symbol sizes. Additionally, such dense datasets can also be difficult to numerically analyze.

One solution to the difficulties associated with analyzing, interpreting, and modeling dense datasets is the selective removal of blocks of the data from the total dataset. Although it is possible to arrange to skip blocks of BTC data in a periodic sense so as to lessen the size and density of the dataset( e.g., data decimation, oversampling and processing, or employing a lower sampling frequency at the start of a tracer study), skipping or deleting blocks of data also may result in missing the important features that the high-frequency detection settings were intended to detect. Rather than removing, reducing, or reformulating apparent data overlap, signal filtering and signal smoothing may be utilized.

The purpose of this paper is to develop an in-depth investigation of tracer BTCs in terms of size, shape, and statistical distribution. A fairly comprehensive literature search indicated that the issues investigated in this paper have not been the focus of previous studies regarding tracer BTCs, although quite a lot has been published in the fields of chromatography and digital signal processing, which has been applied to the analysis of seismic data. The analysis begins with a general explanation of the sampling process as it applies to a tracing study. An example BTC dataset obtained from an instantaneous (mathematical impulse function) tracer release consisting of a very large and densely-packed number of data points is used to illustrate the value of frequent sampling so as to not miss important BTC features then follows. Reducing the sampling frequency and smoothing the tracer BTC further emphasizes important aspects in the BTC. Lastly, it was found that tracer BTCs have only rarely been evaluated from a statistical perspective so a detailed statistical assessment of the example BTC was undertaken with the objective of showing that application of appropriate statistical distributions can assist in understanding the shape and nature of a measured tracer BTC.

2. Tracer Time Series and Sampling

Environmental tracer studies involving the releases of an artificial substance to trace the flow of air or water may be thought of as a time-series investigation in that measurements of downstream tracer concentrations are most appropriately taken in reference to the time of tracer release. A time series consists of a set of measurements collected sequentially over time (Bai and Li, 2014). It is rare, however, for groundwater tracing studies utilizing anthropogenic tracers to be analyzed as a time series. There are undoubtedly many reasons why time-series analysis seems to not be readily applied to most groundwater tracer studies, but the most likely reason is probably due to the fact that samples are collected too infrequently. Connecting the measured data with a line from datum to datum generally shows the resulting BTCs to be a right-skewed time-series plot (see, for example, Figure 12 in Mull et al., 1988, p. 55).

2.1. Measurement Errors

Although generally ignored in most, if not all groundwater tracing studies, is the fact that the random variable, measured tracer concentration C, is dependent on the controlled variable, time t, and consists of two parts, the true value of the measured quantity (i.e., true tracer concentration $C_{Υ_{i}}$ ), and a measurement error concentration $C_{ε_{i}}$ according to (Brandt, 1998, p. 427)

C = \sum_{i = 1}^{n} C_{Υ_{i}} + C_{ε_{i}} .

(1)

Typically, measurement errors $C_{ε_{i}}$ consist of at least two parts; a systematic error associated with the analytical instrument (e.g., drift, miscalibration, sensor fouling, and particulates suspension) and random errors that are always present and occur as a result of factors that cannot be predicted or controlled.

2.2. Samples with Background

Groundwater tracing studies involve the collecting of downgradient water samples following tracer release. Each sample may or may not include a quantity of the released tracer and may or may not include some background entity that serves to adversely affect accurate tracer detection and concentration determination. For example, when tracing with fluorescent dyes, background fluorescence at or near the same wavelength as that of the released tracer can be very problematic (see, for example, Brown, 2009).

2.2.1. Statistics of Small Samples

Statistically, the sampling process may be described for each sampling station with n number of samples collected with k samples that have a detectable level of tracer in them such that n − k represents the number of samples with no detectable levels of tracer (Brandt, 1998, p. 163). The statistical error Δk is then a Poisson distribution described by (Brandt, 1998, p. 164)

Δ k = \sqrt{[k (1 - \frac{k}{n})]},

(2)

which, for small k (i.e., $k ≪ n$ ), allows λ = np to be defined according to (Brandt, 1998, p. 95)

\Pr (k) = \Pr_{k}^{n} = (\begin{array}{l} n \\ k \end{array}) p^{k} q^{n - k},

(3)

to obtain confidence limits $λ_{-}, λ_{+}$ , and an upper confidence limit $λ_{S}^{(up)}$ (Brandt, 1998, p. 166–168). The parameters p and q are probabilities that are related to the sample space. Brandt (1998, p. 87) demonstrated this by supposing an experiment with two possible outcomes A and $\bar{A}$

E = A + \bar{A},

(4a)

\Pr (A) = p,

(4b)

\Pr (\bar{A}) = 1 - p = q .

(4c)

2.2.2. Effect of Background

Background tracer concentrations further complicate statistical assessments of tracing studies. Although it is desirable that background concentrations be collected during and after tracer release background concentrations analyses are very difficult because signal events C_S occur on top of background events C_B and thus obscure C_B. Efforts have been undertaken to address this problem using, for example, spectral deconvolution methods (e.g., Alexander, 2005).

2.2.3. Statistics of Small Samples with Background

Brandt (1998, p. 169) noted that for experiments in which it is impossible to separate signal events from background events, the number of events is a poisson distribution with the parameter $C = C_{S} + C_{B}$ where C_S is the desired (sought) concentration so Equation (1) must be revised to consider background according to

C_{S} = \sum_{i = 1}^{n} C_{Υ_{i}} + C_{ε_{i}} - C_{B_{i}}, = \sum_{i = 1}^{n} C_{i} - C_{B_{i}} .

(5)

Equation (5) may generally be considered to be less precise as sample sizes decrease and/or as the number of background samples decrease. It is also conspicuous that $n_{B} \leq n$ will always be true; the smaller n_B relative to n the greater the uncertainty.

For tracing studies, it is well established that, prior to tracer release, a selected number of background samples n_B must be collected at each prospective sampling station and averaged to create the pre-release mean tracer background concentration ${\bar{C}}_{B}$ rather than a consecutive time-background concentration datafile $C_{B_{i}}$ according to

C_{S}^{'} = \sum_{i = 1}^{n} C_{Υ_{i}} + C_{ε_{i}} - {\bar{C}}_{B}, = \sum_{i = 1}^{n} C_{i} - {\bar{C}}_{B},

(6)

and resulting in the even less precise $C_{S}^{'}$ relative to C_S. Application of Equation (6) is necessary in order to counteract the adverse effect that individual background concentrations may impart on each measured sample. A mean background concentration ${\bar{C}}_{B}$ calculated for each sampling station must be subtracted from all measured concentrations taken after tracer release (e.g., Mull et al., 1988, p. 52–54). Such a methodology introduces a temporal error because no valid real-time comparisons can be made between background concentrations and measured tracer concentrations after tracer release. In addition, calculating a single mean background concentration from a set of given background measurements $C_{B_{i}}$ introduces an additional random error (Field, 2011). As also noted in the same paper by Field, outliers are to be expected as well, which can adversely skew the mean background concentration ${\bar{C}}_{B}$ .

Subtracting ${\bar{C}}_{B}$ from Equation (1) rather than $C_{B_{i}}$ (as shown in Equation (5)) as was done in Equation (6) does not result in C_S actually being calculated. Rather, Equation (6) results in $C_{S} ≃ C_{S}^{'}$ because C_S cannot be easily determined because there is no simple methodology to reliably estimate $C_{B_{i}}$ . Typically, $C_{S}^{'}$ is accepted as C_S when analyzing and reporting tracing study results. In an effort to alleviate some of the difficulties associated with estimating $C_{B_{i}}$ Bailly-Comte et al. (2018) proposed a method for correcting artificial tracer results that reportedly makes it possible to fine-tune the effect of natural background variation on BTCs for clear identification of the tracer presence and a more precise quantification of its recovery.

2.2.4. Probability of Observing a Number of Events

Brandt (1998, p. 170) explains that the probability of observing n events is $n = n_{S} + n_{B}$ according to

f (n; C_{S} + C_{B}) = \frac{1}{n!} e^{- (C_{S} + C_{B})} {(C_{S} + C_{B})}^{n},

(7)

and the probabilities of observing C_S signal events and C_B background events are

f (n; C_{S}) = \frac{1}{n!} e^{- C_{S}} C_{S}^{n_{S}},

(8)

f (n; C_{B}) = \frac{1}{n!} e^{- C_{B}} C_{B}^{n_{B}} .

(9)

Brandt (1998, p. 525–527) applied the binomial theorem to Equations (8) and (9) to validate Equation (7) as shown by

\sum_{n_{S} = 0}^{n} f (n_{S}; C_{S}) f (n - n_{S}; C_{B}), = e^{- (C_{S} + C_{B})} \times \sum_{n_{S} = 0}^{n} \frac{1}{n_{S}! (n - n_{S})!} C_{S}^{n_{S}} C_{B}^{n - n_{S}}, = \frac{1}{n!} e^{- (C_{S} + C_{B})} \times \sum_{n_{S} = 0}^{n} (\begin{matrix} n \\ n_{S} \end{matrix}) C_{S}^{n_{S}} C_{B}^{n - n_{S}}, = \frac{1}{n!} e^{- (C_{S} + C_{B})} {(C_{S} + C_{B})}^{n}, = f (n; C_{S} + C_{B}) .

(10)

From Equations (7) – (10) it is apparent that the probabilities of obtaining some number of events from some number of signal events and background events is a complex process the results of which can be seen in Table 1. For many tracer studies in which grab samples are collected, Equations (7) – (9) may be relevant because a relatively few number of samples are typically collected. Use of automatic water samplers or in situ fluorometers, however, can alleviate the problem of small sample sizes and/or small background sample sizes and result in more refined data (Bailly-Comte et al., 2018). Calculating the parameters listed in Table 1 for a larger number of signal detections $λ_{S}$ (e.g., tracer detections), the values for $λ_{S \pm}$ and $λ_{S}^{(up)}$ theoretically do not change from one experiment (e.g., initial tracer release) to the next (e.g., subsequent tracer release(s)).

Table 1.

Limits for λ_S for a confidence level α = 0.90 for 1000 samples collected each for 20 tracer releases with λ_S = 5 and λ_B = 2 according to the methods described in Brandt (1998).

Num	k_s	k_B	k^a	$λ_{S -}$	$λ_{S +}$	$λ_{S}^{(up)}$
1	5	3	8	1.998	12.435	10.995
2	3	1	4	0.226	7.241	6.087
3	5	0	5	0.433	8.542	7.306
4	4	5	9	2.699	13.705	12.206
5	5	2	7	1.350	11.150	9.773
6	2	2	4	0.226	7.241	6.087
7	2	0	2	0.077	4.824	3.877
8	7	3	10	3.426	14.962	13.407
9	7	3	10	3.426	14.962	13.407
10	4	3	7	1.350	11.150	9.773
11	2	0	2	0.077	4.824	3.877
12	0	1	1	0.051	3.817	2.995
13	6	1	7	1.350	11.150	9.773
14	8	1	9	2.699	13.705	12.206
15	3	1	4	0.226	7.241	6.087
16	5	4	9	2.699	13.705	12.206
17	10	1	11	4.169	16.207	14.598
18	5	2	7	1.350	11.150	9.773
19	3	1	4	0.226	7.241	6.087
20	7	3	10	3.426	14.962	13.407

Open in a new tab

Realistically, only k is ever known; k_S and k_B cannot be determined.

3. Tracer Test Dataset

For this paper a tracer dataset obtained from a tracing study conducted in the Appalachian Mountains was selected. The study involved the collection of a large amount of tracer data from a geologically complex karstic terrane that resulted in complexly-appearing tracer BTCs. The dataset for this paper and Fortran source program are available as a Supplemental File.

3.1. Tracer Test

The tracing study was initiated at a site located in the Hagerstown Valley, Maryland (part of the Great Valley) of the eastern Valley and Ridge province of the Appalachians with recovery at several locations details of which may be found in Field (2017). The purpose of the tracer test was to assess the distribution and migration of several pesticides in groundwater in the area: 2,4-dichlorodiphenyltrichloroethane (2,4-DDT), 2,4-dichlorodiphenyldichloroethane (2,4-DDD), 4,4-dichlorodiphenyltrichloroethane (4,4-DDT), 4,4-dichlorodiphenyldichloroethane (4,4-DDD), Aldrin, alpha-Chlordane, arsenic, Benzo-(a)-pyrene, alpha-hexachlorocyclohexane (α-HCH), beta-hexachlorocyclohexane (β-HCH), delta-hexachlorocyclohexane (δ-HCH), gamma-hexachlorocyclohexane (γ-HCH), Dieldrin, Heptachlor Epoxide, Heptachlor, gamma-Chlordane, Endrin Ketone, Manganese, Thallium, Atrazine, and Toxaphene total dichlorodiphenyltrichloroethane (DDX), which consists of the summation of $ϱ, ϱ^{'}$ -dichlorodiphenyltrichloroethane (DDT), $ϱ, ϱ^{'}$ -dichlorodiphenyldichloroethane (DDD), and $ϱ, ϱ^{'}$ -dichlorodiphenyldichloroethylene (DDE), and total hexachlorocyclohexane (HCH), the summation of HCH isomers.

The tracing study consisted of a release of 7.16 kg of fluorescein dye (Colour Index, Acid Yellow 73) into a small sinkhole ~0.5 m in diameter (Figure 1). Various springs and wells were monitored continuously for dye using a Turner Designs Cyclops-7 Logger in situ fluorometer and data logger with recoveries in radial directions as a result of the site overlying a groundwater mound (Field, 2017).For this study, tracer recovery at a nearby monitoring well was used for the analyses. All the fluorometers were set to take a measurement reading every 30 minutes so as to ensure that no peaks in the BTCs would be missed.

3.2. Site Geology

The Hagerstown Valley is characterized by a series of weathered tightly folded sedimentary rocks that has resulted in a series of closely spaced ridges (Means, 2010, p 48) that are aligned northeast–southwest (Schmidt, 1993, p 10). The sedimentary rocks are very complexly folded (Figure 2a) and faulting of the rocks is common. The Hagerstown area is underlain by the Middle Member of the Stonehenge Formation with a tongue of the Upper Member of the Stonehenge Formation on top. An example of the complex geology formed by the Stonehenge Formation is shown in Figure 2b. The site where the tracing study was initiated is located in an area that is underlain primarily by the fractured limestone and dolomite of the Upper Cambrian and Lower Ordovician Conococheague Limestone and the Lower Ordovician Stonehenge Limestone and Rockdale Run Formation. Bordering on the west is a NE-SW trending reverse fault with a second NE-SW trending reverse fault slightly further to the west (Brezinski, 2013). A third NE-SW trending reverse fault is also evident at a slightly further distance to the east of the site. Both reverse faults west of the site are associated with considerable folding and overturned beds at the site.

Figure 2. — Features evidencing the complex nature of the site geology where the tracing study was conducted.

Sinkholes, grikes (Figures 2c and 2d, respectively), and caves are apparent and well-known to exist throughout the area (see for example, Table 1 in Duigon, 2001, p. 15 and Franze and Slifer, 1971, p. 68–104), all of which are diagnostic of karst. Groundwater flow, however, can be quite slow in some aquifers underlying the Hagerstown Valley relative to what is normally expected in karstic terranes but appears to not be atypical of groundwater flow in some karst aquifers (or parts thereof) of the Appalachians (see for example, Kozar et al., 2007 and Duigon, 2009). Duigon (2001) mapped springs throughout the Hagerstown Valley and mapped the potentiometric surface but tracing studies were not conducted for that study. Duigon (2009) documented the common occurrence of losing streams but Duigon (2001) notes the rare occurrence of sinking streams and documents the extreme heterogeneity and anisotropy of the Hagerstown Valley aquifers and the need for tracing studies.

3.3. Tracer Data

Consider the tracer BTC shown in Figure 3, which consists of 4118 measured data points from the release of fluorescein dye. Figure 3 represents the latter portion of the original measured dataset obtained from a downgradient monitoring well for clarity purposes and to emphasize the appearance of a typical tracer BTC shape, which does not appear to conform to a normally distribution population. The BTC shown in Figure 3 was confirmed to be from a non-normally distributed population by the small value of the Shapiro-Wilk test-statistic W calculated for the measured dataset (W = 0.92073, P-value < 0.001). The large number of data points obtained was a result of setting the datalogger to take a reading every 30 minutes at a site where flow and transport rates were quite slow.

4. Methods for Simplifying Large Densely-Concentrated Tracer Datasets

The latter data plotted in Figure 3a are so densely packed as to obscure most individual data points. The basic right-skewed histogram shape of the BTC shown in Figure 3a is apparent but the density of the data suggests the existence of a significant degree of complexity in regard to the discharge of tracer dye. Similarly complex BTCs appear to be common when using a Turner Designs Cyclops-7 Logger in situ fluorometer when tracing flows through glacier-hydrological systems (e.g., Fountain, 1993; Fyffe, 2013; Fyffe et al., 2012; Gulley et al., 2012). Of particular concern might be the difficulty in determining the true peak concentration. The maximum measured BTC concentration was 0.5054 μg L⁻¹ but perhaps a value somewhere between the maximum measured peak concentration and the minimum measured peak concentration of 0.4281 μg L⁻¹ would be more representative of tracer recovery.

A basic analysis of the BTC shown in Figure 3a is possible by the method of moments such as may be done using the Qtracer2 program (Field, 2002) but parameter refinement utilizing such programs asCXTFIT2 (Field and Pinsky, 2000; Toride et al., 1993, 1995), DADE (Field and Leij, 2012), and PhysChem (Field and Leij, 2014) may be problematic with such densely-packed data (CXTFIT2 is currently restricted to 405 data points). In order to more easily model the BTC shown in Figure 3 it is necessary to either reduce the size of the data set or to smooth the data and apply either a reduced or smoothed dataset in a solute-transport model.

Figure 3b is a plot of the same BTC shown in Figure 3a but without the data being connected by a line in a vain attempt at clarifying important aspects of the BTC. Figure 3c is a plot of the same BTC shown in Figure 3a but without the actual data being depicted, which allows for slightly better visualization of the areas where the data trend changes. For example, in Figure 3c it may be noted that between 113 days and 122 days the downward slope of the recession limb of the BTC appears to lessen and nearly levels off and then begins to descend more steeply. After 122 days the BTC then appears to descend less steeply. Such visual assessment is not so easily determined from Figures 3a and 3b.

4.1. Downsampling by an Integer Factor

Downsampling the dataset shown in Figure 3 is no more complicated than using every second, third, or fourth, etc. data point in the analysis as necessary while avoiding excessive data reduction. As such, downsampling is simply the process of reducing the sampling rate of a signal (Xiong et al., 2016), preferably by some number that divides evenly into the maximum number of samples. In signal processing, downsampling is known as decimation and may be regarded as an anti-alising filter and is a two-step process to (WFI, 2018b):

Reduce high-frequency signal components with a digital low-pass filter.
Decimate the filtered signal by M according to $x [M n] = x {[n]}_{↓ M}$ so that every Mth sample is retained.

Figure 4 depicts the BTC shown in Figure 3 at various percentages of the full dataset and illustrates the effect of step number 2 in signal processing downsampling. Qtracer2 analyses of the BTCs shown in Figure 4 are shown in Table 2. Although the Qtracer2 results shown in Table 2 mostly do not indicate radical differences between any of the BTCs depicted in Figure 4, the results do serve to illustrate how less frequent sampling can influence basic BTC analyses.

Table 2.

Comparison of Qtracer2 results for varying percentages of the breakthrough curves (BTCs)^a,b shown in Figure 4 and in Figure 5.

Figure BTC	Percentage of Data Used, %	Sampling Interval, h	Mass^c Recovered, mg	Time to Leading Edge, d	Time to Peak, d	Peak Concen., μg L⁻¹	Mean Time of Travel, d	Mean Velocity, m d⁻¹	Longitudinal Dispersivity, m	Peclet Number
Fig. 4a	100.0	0.5	669.71	81.59	84.92	0.5149	109.67	8.04 × 10⁻²	1.38 × 10⁻¹	64.02
Fig. 4b	50.0	1	669.79	81.96	85.46	0.5113	109.60	8.05 × 10⁻²	1.69 × 10⁻¹	52.21
Fig. 4c	33.0	1.5	665.10	81.67	84.92	0.5054	109.65	8.04 × 10⁻²	1.48 × 10⁻¹	59.75
Fig. 4d	20.0	2.5	669.48	81.96	85.40	0.5096	109.75	8.04 × 10⁻²	1.71 × 10⁻¹	51.43
Fig. 4e	10.0	5	671.11	81.75	85.30	0.4917	109.64	8.04 × 10⁻²	1.53 × 10⁻¹	57.58
Fig. 4f	5.0	10	668.41	81.96	85.30	0.4917	109.58	8.05 × 10⁻²	1.55 × 10⁻¹	57.07
Fig. 4g	3.4	15	674.11	82.80	85.30	0.4917	109.81	8.03 × 10⁻²	1.71 × 10⁻¹	51.66
Fig. 4h	2.5	20	669.52	82.38	84.88	0.4851	109.68	8.04 × 10⁻²	1.67 × 10⁻¹	52.87
Fig. 4i	2.0	25	672.59	83.21	85.30	0.4917	109.41	8.06 × 10⁻²	1.68 × 10⁻¹	52.65
Fig. 4j	1.7	30	671.28	82.80	85.30	0.4917	109.53	8.05 × 10⁻²	1.71 × 10⁻¹	51.58
Fig. 4k	1.4	35	677.92	81.96	84.88	0.4851	109.54	8.05 × 10⁻²	1.02 × 10⁻¹	86.23
Fig. 4l	1.3	40	660.05	82.38	85.71	0.4772	109.71	8.04 × 10⁻²	1.16 × 10⁻¹	75.71
Fig. 4m	1.12	45	684.83	82.80	84.67	0.4786	109.32	8.07 × 10⁻²	1.70 × 10⁻¹	51.84
Fig. 4n	1.02	50	686.91	83.21	85.30	0.4917	109.04	8.09 × 10⁻²	1.68 × 10⁻¹	52.61
Fig. 4o	0.92	55	665.45	83.63	85.92	0.4586	109.61	8.05 × 10⁻²	1.70 × 10⁻¹	51.97
Fig. 4p	0.85	60	671.85	84.05	86.55	0.4728	110.11	8.01 × 10⁻²	1.70 × 10⁻¹	51.79
Fig. 4q	0.78	65	645.05	81.75	84.46	0.4761	108.88	8.10 × 10⁻²	7.49 × 10⁻²	117.68
Fig. 4r	0.73	70	670.87	81.96	84.88	0.4851	109.97	8.02 × 10⁻²	7.82 × 10⁻²	112.76
Fig. 4s	0.68	75	665.34	85.30	85.30	0.4917	110.33	7.99 × 10⁻²	1.58 × 10⁻¹	55.86
Fig. 4t	0.63	80	634.56	82.38	85.71	0.4772	109.82	8.03 × 10⁻²	8.66 × 10⁻²	101.85
Fig. 5a	0.32	168	705.02	86.05	86.05	0.4835	109.50	8.05 × 10⁻²	1.59 × 10⁻¹	55.64
Fig. 5b	0.17	336	625.75	93.05	93.05	0.3627	112.69	7.83 × 10⁻²	1.32 × 10⁻¹	66.59

Open in a new tab

The calculated elapsed time from tracer injection until cessation of sampling was 23 wk, 3 d, 20 h, 0 min, and 0 s.

The calculated elapsed time from first tracer detection until cessation of sampling was 12 wk, 1 d, 18 h, 55 min, and 0 s.

It should be noted that mass recovered and all estimated transport parameters will be adversely affected by likely substantial errors in discharge measurements.

Figure 4a is the same plot as Figure 3a. Figures 4b and 4c are plots depicting 50 % and 33 % of the data, respectively. It is readily apparent that while Figures 4a and 4c still include the two most apparent outliers occurring at 112 d and 140 d, the two most apparent outliers are not evident in Figures 4b and 4d – 4t. The disappearance of the two most apparent outliers in Figure 4b is not so much a result of plotting just a reduced percentage of the data. Rather, it is just coincidence that plotting one half the number of data points for this particular dataset results in skipping past the two most apparent outliers. It should be noted that outliers should not be deleted directly; rather, apparent possible outliers should be investigated carefully in order to understand why they appeared and what they may actually mean to the study (Gentle, 2009, p. 559; NIST/SEMATECH, 2013a).

Figures 4d (20 % of the data) – 4t (0.63 % of the data) allow visualization of the changes in slope of the descending curve of the BTC between 113 days and 122 days as was observed in Figure 3c. Figures 4e (10 % of the data) – 4t really begin to depict the thinning out of the data with Figures 4f (5 % of the data) – 4t reflecting the skipping over of the lesser apparent outliers. As subsequent plots (Figures 4i (2 % of the data) – 4t) are displayed, the spread of the data becomes problematic. For example, peak time is now no longer as great as with the more concentrated dataset plots (Table 2) although consideration of the maximum concentration depicted in Figures 4a may be inappropriate (i.e., a peak concentration slightly less than the maximum but slightly greater than the peak time minimum might be more appropriate). This possible problem is most evident in Figure 4o (0.92 % of the data) in which peak concentration appears to have leveled off to the lowest value displayed (Table 2).

The BTCs plots depicted in Figure 4 suggest various concerns because either too many densely concentrated data points are displayed or too few data points are displayed. Basic tracer BTC analyses generally do not result in major discrepancies between parameter estimates in most instances, such as mean time of travel, but there are some parameter estimates that may be concerning, such as longitudinal dispersivity and Péclet number (Table 2).

If in situ fluorometers or automatic water samplers are not used then grab sampling is required. Grab sampling is costly and man-power intensive so sample collection often consists of the use of packets of activated carbon for simple detection of tracer dye in a qualitative sense. Common practice involving the use of carbon packets requires the collecting and replacing of carbon packets on a weekly or biweekly basis and includes the collection of grab samples of water at the time the packets are collected. The water grab samples are typically only analyzed if the packets of activated carbon indicate dye detection when analyzed. Such a practice can result in serious problems with BTC shape and analysis as shown in Figure 5 and indicated in Table 2.

Figure 5. — Example BTC plot exhibiting the result of collecting samples on a weekly (a) and biweekly (b) basis.

One substantial problem with downsampling that can be as serious as infrequent grab sampling is the potential loss of early time rise in measured concentrations and the detection of first arrival, which is critical when assessing pathogen transport (see for example, Worthington et al., 2002). Such critically important information must not be missed but is a real possibility as shown in Figures 4 and 5.

4.2. Curve Fitting and Digital Filtering

Numerous methods are available for filtering datasets and curve fitting (a type of smoothing) but not all methods are appropriate for any given dataset. Curve fitting is the process of creating a smooth function that approximately fits the data (Guest, 2012, p. 349). Application of any particular filtering or curve-fitting routine requires careful and judicious assessment of the method and its limitations when being considered. In general, it is usually preferable to apply a filtering routine in lieu of a curve-fitting routine because filtering is more theoretically and scientifically based.

Choosing a relatively simple model that is a good approximation to the data is called smoothing (Gentle, 2009, p. 157). Smooth is most simply defined as an approximation that is continuous and has continuous derivatives (Gentle, 2009, p. 179). Smoothing is a process that removes high-frequency fluctuations from a signal and may be regarded as an integral process of graphing discrete data with a continuous function representing an underlying model of the process that generated the data (Gentle, 2009, p. 341). Low-pass filtering is another term for the same thing, but is restricted to methods that are linear (van den Bogert, 1996).

According to Press et al. (1997a, p. 644) the premise of data smoothing is that one is measuring a variable that is both slowly varying and also corrupted by random noise, as might be an expected problem with tracer data. As such, it may be useful and even appropriate to replace each data point by some kind of local average of surrounding data points. Nearby points measure very nearly the same underlying value as the detected data point so averaging can reduce the level of noise while not excessively biasing the value (Press et al., 1997a, p. 644). As noted above regarding curve fitting, Press et al. (1997a, p. 644) point out that smoothing data lies in a murky area, beyond the fringe of some better posed, and therefore more highly recommended, techniques such as low-pass filters.

4.2.1. Curve Fitting

The main purpose of curve fitting is to estimate the effects of covariates X on a response y nonparametrically by letting the data suggest the appropriate functional form. In a simple sense, curve fitting is the smoothing of a linear model with one predictor that is defined as (Rodríguez, 2019)

y = f (x) + ε,

(11)

in which the determination of f as a trend or smooth curve is desired.

Kernel Smoothers.

An alternative approach is to use a weighted moving average (MA), with weights that decline as one moves away from the target value. To calculate $S m (x_{i})$ , the j-th point receives weight (Rodríguez, 2019)

w_{i j} = \frac{ς_{i}}{ψ} δ (\frac{| x_{i} - x_{j} |}{ψ}),

(12)

where $δ (\cdot)$ is an even function, $ψ$ is a tuning constant called the window width or bandwidth, and $ς_{i}$ is a normalizing constant so that the weights add up to one for each x_i. Popular choices for function $δ (\cdot)$ are (Rodríguez, 2019)

Gaussian density,
Epanechnikov: $δ (t) = {\begin{array}{l} \frac{3}{4} (1 - t^{2}) & t^{2} < 1, \\ 0 & otherwise, \end{array}$
Minimum var: $δ (t) = {\begin{array}{l} \frac{3}{8} (1 - 5 t^{2}) & t^{2} < 1, \\ 0 & otherwise, \end{array}$

but as Rodríguez (2019) points out, a kernel smoother still exhibits bias at the end points.

LOESS/LOWESS.

One method for dealing with the bias at end points typical of kernel smoothers is to apply a locally estimated scatterplot smoothing (LOESS) methodology or a locally weighted scatterplot smoothing (LOWESS) methodology, which are generalizations of moving average and polynomial regression. LOWESS, also known as robust locally weighted regression, is basically an improvement built on top of LOESS (Cleveland, 1979; Cleveland and Devlin, 1988). They are two strongly related non-parametric regression methods that combine multiple regression models in a k-nearest-neighbor-based meta-model (WFI, 2018d). $S m (x_{i})$ is calculated using LOESS by (Rodríguez, 2019)

find a symmetric nearest neighborhood of x_i,
find the distance from x_i to the furthest neighbor and use this as $ψ$ ,
use a tri-cube weight function: $δ (t) = {\begin{array}{l} {(1 - t^{3})}^{3} & 0 \leq t \leq 1, \\ 0 & otherwise, \end{array}$
estimate a local line using these weights and take the fitted value at x_i as $S m (x_{i})$ .

whereas the variant, LOWESS, uses robust regression in each neighborhood.

According to NIST/SEMATECH (2013b) the biggest advantage LOESS has over many other methods is that it does not require the specification of a function to fit a model to all of the data. Rather, only a smoothing parameter value and the degree of the local polynomial is required. In addition, LOESS is very flexible, making it ideal for modeling complex processes for which no theoretical models exist. These two advantages, combined with the simplicity of the method, make LOESS one of the most attractive of the modern regression methods for applications that fit the general framework of least squares regression but which have a complex deterministic structure. LOESS also accrues most of the benefits typically shared by those procedures, the most important of which is the theory for computing uncertainties for prediction and calibration.

Although LOESS shares many of the best features of other least squares methods, its inefficient use of data is a disadvantage. LOESS requires fairly large, densely sampled data sets in order to produce good models. This is not really surprising, however, because LOESS needs good empirical information on the local structure of the process in order to perform the local fitting. Given this fact, the results LOESS provides may be more efficient overall than other methods like nonlinear least squares NIST/SEMATECH (2013b). As such, is may be the most appropriate smoothing method for large very densely-packed datasets. Another potential disadvantage of LOESS is that it does not produce a regression function that is easily represented by a mathematical formula. Depending on the application, this could be either a major or a minor drawback to using LOESS NIST/SEMATECH (2013b).

Finally, as noted above, LOESS is a computationally intensive method. This may be a problem for very large datasets. LOESS is also prone to the effects of outliers in the data set, similar to other least squares methods. Application of LOWESS, rather than LOESS, can reduce sensitivities to outliers, but extreme outliers can still overcome LOWESS (NIST/SEMATECH, 2013b).

4.2.2. Moving Statistics

One of the most common scatterplot smoothing method is the well-known MA or moving mean, which is also known as a running average and rolling average (WFI, 2018e). Various modifications, such as the simple moving average, cumulative moving average, weighted moving average, and exponential moving average are common. In addition, a moving median (MM) is common.

When calculating a MA in the sciences and engineering, the mean is normally taken from an equal number of data on either side of a central value to ensure that variations in the mean are aligned with the variations in the data rather than being shifted in time. In most instances it is advantageous to avoid the shifting induced by using only past data so a central MA is computed using data equally spaced on either side of the point in the series where the mean is calculated. As such, this requires using an odd number of data in the sample window (WFI, 2018e).

A simple moving average may be defined as (Rodríguez, 2019)

S m (x_{i}) = \sum_{j \in N (x_{i})} \frac{(y_{j})}{n_{i}},

(13)

to estimate the smooth curve at x_i by averaging the y’s corresponding to the x’s in a neighborhood of x_i for a neighborhood $N (x_{i})$ with n_i observations. Typically, a symmetric neighborhood consisting of the nearest 2k + 1 points is taken centrally according to (Rodríguez, 2019)

N (x_{i}) = {\max (i - k, 1), \dots, i - 1, i, i + 1, \dots, \min (i + k, n)},

(14)

As with the calculation of an average, Equation (13) is susceptible to outliers (WFI, 2018e). In order to avoid the adverse influence of outliers, Equation (13) can be modified for the calculation of a simple moving median. Statistically, the MA may be regarded as optimal for recovering the underlying trend of the time series when the fluctuations about the trend are normally distributed. Outliers have a disproportionately large effect on trend estimates because the normal distribution does not place a high probability on very large outliers. If the fluctuations are assumed to be Laplace distributed, then the MM will be statistically optimal. For a given variance, the Laplace distribution places a higher probability on rare events than does the normal, which explains why the MM tolerates shocks better than does the MA (WFI, 2018e).

An exponentially weighted moving average (EWMA) is a first-order infinite impulse response filter that applies weighting factors that decrease exponentially. The weighting for each older datum decreases exponentially but never reaches zero (WFI, 2018e). An EWMA may be calculated from (NIST/SEMATECH, 2013c)

S m_{t_{p}} = α y_{t_{p - 1}} + (1 - α) S m_{t_{p - 1}},

(15)

where t_p represents the time period for the analysis and α is the smoothing constant.

4.2.3. Digital Filtering

A digital filter is a system utilized in signal processing that performs mathematical operations on a sampled, discrete-time signal to reduce or enhance certain aspects of that signal (WFI, 2018a). Typical examples of frequency functions are (WFI, 2018c):

A low-pass filter is used to cut unwanted high-frequency signals.
A high-pass filter passes high frequencies fairly well; it is helpful as a filter to cut any unwanted low-frequency components.
A band-pass filter passes a limited range of frequencies.
A band-stop filter passes frequencies above and below a certain range. A very narrow band-stop filter is known as a notch filter.
A differentiator has an amplitude response proportional to the frequency.
A low-shelf filter passes all frequencies, but increases or reduces frequencies below the shelf frequency by specified amounts.
A high-shelf filter passes all frequencies, but increases or reduces frequencies above the shelf frequency by specified amounts.
A peak equilizer filter makes a peak or a dip in the frequency response, commonly used in parametric equalizers.

For smoothing large densely-concentrated tracer datasets, only a low-pass filter is really appropriate because high-frequency signals are most representative of noise in the data. Figures 6a – 6o depict the BTC shown in Figure 3 and are examples of step number 2 in signal processing downsampling (Section 4.1). It will be noted that all of the smoothing routines greatly affect the appearance of the two outliers apparent in Figure 3. Only Figure 6g shows any indication of the original apparent outliers but Figure 6g does not appear to be the best smoothed BTC from a visual perspective.

Qtracer2 analyses of the BTCs shown in Figure 6 are shown in Table 3. Although the Qtracer2 results shown in Table 3 mostly do not indicate radical differences between any of the smoothed BTCs, the results do serve to illustrate how differing smoothing routines can influence basic BTC analyses.

Table 3.

Comparison of Qtracer2 results using differing smoothing routines for the breakthrough curves (BTCs) shown in Figure 6.

Figure 6 BTC	Smoothing Method	Smoothing Parameter	Sampling Interval, h	Mass^c Recovered, mg	Time to Leading Edge, d	Time to Peak, d	Peak Concen., μg L⁻¹	Mean Time of Travel, d	Mean Velocity, m d⁻¹	Longitudinal Dispersivity, m	Peclet Number
Fig. 6a	Moving Average	35	17.5	668.95	81.96	85.61	0.481	109.86	80.28	114.36	77.12
Fig. 6b	Moving Median	35	17.5	668.07	82.69	85.61	0.480	109.94	80.22	163.49	53.95
Fig. 6c	Expon. Mov. Ave.	151	75.5	651.44	79.05	85.34	0.480	110.16	80.06	23.86	369.60
Fig. 6d	Cubic Spline	…	0.5	670.43	80.17	84.75	0.487	109.63	80.45	73.20	120.48
Fig. 6e	Quintic Spline	50	25.0	693.26	79.05	85.30	0.493	108.54	81.26	80.73	109.25
Fig. 6f	Wt. Quintic Spline	0.96^a	48.0	681.08	83.05	85.05	0.479	109.06	80.87	167.17	52.76
Fig. 6g	B-Spline Fit	20	10.0	670.76	81.96	84.88	0.486	109.58	80.48	154.33	57.15
Fig. 6h	Poly. Fit	…	0.5	646.98	79.25	84.34	0.484	108.82	81.04	159.04	55.46
Fig. 6i	Savit.-Golay Fit	120	60.0	664.10	84.05	84.05	0.473	109.60	80.47	166.78	52.88
Fig. 6j	Butterworth Filter	50^b	25.0	669.81	82.17	85.30	0.481	109.80	80.33	123.63	71.34
Fig. 6k	Smooft Fit	120	60.0	715.29	79.05	86.55	0.467	107.36	82.15	46.82	188.37
Fig. 6l	Curve Fit	…	0.5	675.27	81.09	86.09	0.481	109.48	80.56	116.25	75.87
Fig. 6m	Gelkem Fit	15	7.5	670.15	80.61	85.30	0.481	109.60	80.47	72.94	120.91
Fig. 6n	Lokem Fit	15	7.5	670.98	80.92	84.98	0.482	109.58	80.49	86.20	102.32
Fig. 6o	Lowess Fit	…	0.5	670.40	80.09	85.25	0.480	109.62	80.45	67.67	130.33

Open in a new tab

This is the “weight value” for the weighted quintic.

This is the cut-off frequency for the Butterworth Filter.

It should be noted that mass recovered and all estimated transport parameters will be adversely affected by likely substantial errors in discharge measurements.

Assessment of Fit.

Typically, the fit of a smoothing routine to a set of data is assessed using a statistical routine. Most common is use of the Pearson’s Correlation Coefficient R and the Coefficient of Determination R² but other statistical measures of fit (e.g., Spearman’s Rank-Order Correlation Coefficient R_S, Kendall’s Rank-Order Correlation Coefficient $τ$ , etc.) can be used. All measures of fit are dependent on specific conditions (Press et al., 1997a, p. 633–634), which often appear to be generally ignored (e.g., parametric verses nonparametric methods). Table 4 lists various statistical measure of fit for the smoothing routines shown in Figure 6.

Table 4.

Comparison of measures of fit for the smoothing routines for the breakthrough curves (BTCs) shown in Figure 6.

		Measures of Fit
		Correlation Measures^a						Other Measures
Figure 6 BTC	Smoothing Method	Pearson’s Correlation		Spearman’s Rank-Order		Kendall’s Rank-Order		Coefficient of Determination R²	Nash-Sutcliff Efficiency NSE	Root Mean Sq. Error, RMSE	Percent Bias, PBIAS
Figure 6 BTC	Smoothing Method	Coefficient R	P-value	Coefficient R_S	P-value	Coefficient τ	P-value	Coefficient of Determination R²	Nash-Sutcliff Efficiency NSE	Root Mean Sq. Error, RMSE	Percent Bias, PBIAS
Fig. 6a	Moving Average	0.9821	0.00	0.9745	0.00	0.8805	0.00	0.9645	0.9645	0.0230	−0.0940
Fig. 6b	Moving Median	0.9811	0.00	0.9715	0.00	0.8798	0.00	0.9626	0.9625	0.0236	−0.1097
Fig. 6c	Expon. Mov. Ave.	0.9868	0.00	0.9804	0.00	0.8900	0.00	0.9737	0.9737	0.0198	−0.0745
Fig. 6d	Cubic Spline	0.9917	0.00	0.9838	0.00	0.8959	0.00	0.9834	0.9834	0.0157	0.0006
Fig. 6e	Quintic Spline	0.9904	0.00	0.9819	0.00	0.8893	0.00	0.9809	0.9809	0.0168	0.0506
Fig. 6f	Wt. Quintic Spline	0.9932	0.00	0.9874	0.00	0.9086	0.00	0.9865	0.9865	0.0142	0.0070
Fig. 6g	B-Spline Fit	0.9973	0.00	0.9949	0.00	0.9420	0.00	0.9946	0.9946	0.0090	−0.0012
Fig. 6h	Polynomial Fit	0.9874	0.00	0.9795	0.00	0.8839	0.00	0.9751	0.9724	0.0203	−3.4959
Fig. 6i	Savitsky-Golay Fit	0.9904	0.00	0.9816	0.00	0.8900	0.00	0.9809	0.9809	0.0169	0.0700
Fig. 6j	Butterworth Filter	0.9861	0.00	0.9775	0.00	0.8814	0.00	0.9724	0.9724	0.0203	−0.0678
Fig. 6k	Smooft Fit	0.9901	0.00	0.9815	0.00	0.8891	0.00	0.9803	0.9803	0.0171	−0.1234
Fig. 6l	Curve Fit	0.9767	0.00	0.9607	0.00	0.8631	0.00	0.9540	0.9536	0.0263	0.7232
Fig. 6m	Gelkem Fit	0.9895	0.00	0.9803	0.00	0.8869	0.00	0.9791	0.9791	0.0177	−0.0028
Fig. 6n	Lokem Fit	0.9898	0.00	0.9805	0.00	0.8870	0.00	0.9797	0.9797	0.0174	0.1231
Fig. 6o	Lowess Fit	0.9886	0.00	0.9791	0.00	0.8845	0.00	0.9774	0.9772	0.0184	−0.0030

Open in a new tab

The calculated P-value equaled zero for Pearson’s Correlation Coefficient R, Spearman’s Rank-Order Correlation Coefficient R_S, and Kendall’s Rank-Order Correlation Coefficient τ for all smoothing routines.

If the desire of a selected smoothing routine is to obtain the smoothest possible curve to the measured data then it is apparent from Table 4 that reliance on statistical measures of a fit may not be well grounded. For example, Table 4 would suggest that, if percent bias (PBIAS) is ignored, the B-Spline method produces the statistically best fit to the data by all other measures of fit but is visually the least smooth curve (see Figure 6g). The PBIAS measure suggests that the Cubic Spline (Figure 6d) represents the best fit to the BTC, although other smoothing routines (e.g., Gelkern and Lowess, Figures 6m and 6o) also may be regarded as reasonable according to the PBIAS measure.

Error sources play a significant role in regard to fit. For example, if instrumental measurement errors are solely responsible for a poor fit, then a normal distribution might be expected. Alternatively, outliers can adversely affect smoothing fit. Although not evaluated in this study, it might be possible to quantify sample to sample instrument error in one distribution and resolve outliers in a second distribution, which then could allow for a statistical analysis on just the outliers but even a higher sampling frequency (e.g., every minute or every second) would be required.

Test for independence.

The Chi-Square Test Statistic $χ^{2}$ and 1-D Kolmogorov-Smirnov Test Statistic D₁ for each smoothing routine are shown in Table 5 but neither of these test statistics really indicate that one routine is any better than the others. For D₁ only MA, MM, weighted quintic spline, and Butterworth filter exceeded the Kolmogorov-Smirnov critical value (Table 5). Calculation of a 2-D Kolmogorov-Smirnov statistic D₂ can be used to determine if the measured tracer BTC data and the various smoothing routines are drawn from the same distribution. When either D₁ or D₂ is greater than the Kolmogorov-Smirnov critical value then the null hypothesis H_o (measured dataset and the smoothed datasets are drawn from the same distribution) may be rejected. A small Kolmogorov-Smirnov P-value also suggests that the two samples are significantly different (Press et al., 1997b, p. 1283). For the smoothing routines shown in Figure 6 and listed in Table 5 only the B-spline (Figure 6g) had a calculated D₂ less than the Kolmogorov-Smirnov critical value (Table 5). In addition, only the Weighted Quintic Spline and B-spline (Figures 6f and 6g, respectively) had estimated P-values greater than 0.05 (Table 5).

Table 5.

Some statistical measures of assessment for the differing smoothing routines for the breakthrough curves (BTCs) shown in Figure 6.

		Tests for Independence						Test for Autocorrelation
		Pearson Chi-Square^a		1-D Kolmogorov-Smimov^b		2-D Kolmogorov-Smimov^b		Durbin-Watson
Figure 6 BTC	Smoothing Method	Test Statistic, $χ^{2}$	P-value	Test Statistic, D₁	P-value	Test Statistic, D₂	P-value	Test Statistic, d	P-value
Fig. 6a	Moving Average^b,c	6.9398975	1.00	0.0279	0.08	0.0421	8.68 × 10⁻³	0.93	1.00
Fig. 6b	Moving Median^c,d,e	7.8217144	1.00	0.0342	0.02	0.0395	1.70 × 10⁻²	0.88	1.00
Fig. 6c	Expon. Mov. Ave.^d	5.3769827	1.00	0.0002	1.00	0.0404	1.33 × 10⁻²	1.20	1.00
Fig. 6d	Cubic Spline^d,g	4.1638837	1.00	0.0039	1.00	0.0412	1.09 × 10⁻²	2.05	1.00
Fig. 6e	Quintic Spline^d	5.1174531	1.00	0.0107	0.97	0.0464	2.66 × 10⁻³	1.79	1.00
Fig. 6f	Wt. Quintic Spline^c,d,^g	3.2597842	1.00	0.0279	0.08	0.0288	0.16	1.96	1.00
Fig. 6g	B-Spline Fit	1.3903918	1.00	0.0024	1.00	0.0170	0.77	3.24	1.00
Fig. 6h	Polynomial Fit^d	8.2148256	1.00	0.0236	0.20	0.0847	4.51 × 10⁻¹⁰	1.25	1.00
Fig. 6i	Savitsky-Golay Fit^d	4.8710556	1.00	0.0075	1.00	0.0470	2.23 × 10⁻³	1.74	1.00
Fig. 6j	Butterworth Filter^c,d	5.8660245	1.00	0.0279	0.08	0.0404	1.33 × 10⁻²	1.25	1.00
Fig. 6k	SmooftFit^d	6.0584912	1.00	0.0112	0.96	0.0567	9.30 × 10⁻⁵	1.73	1.00
Fig. 6l	Curve Fit^d	12.0661	1.00	0.0228	0.23	0.0567	9.30 × 10⁻⁵	0.74	1.00
Fig. 6m	Gelkem Fit^d	5.4183507	1.00	0.0172	0.57	0.0488	1.29 × 10⁻³	1.63	1.00
Fig. 6n	Lokem Fit^d	5.2129078	1.00	0.0187	0.46	0.0469	0.23 × 10⁻²	1.68	1.00
Fig. 6o	Lowess Fit^d	5.9411678	1.00	0.0080	1.00	0.0497	9.95 × 10⁻⁴	1.50	1.00

Open in a new tab

Chi-Square $χ^{2}$ values ranged from 1.3904 to 12.0661, but all smoothing routines resulted in a P-value were equal to 1.0.

Kolmogorov-Smirnov Test critical value = 0.02697.

For the ID Kolmogorov-Smirnov Test, exceeds the Kolmogorov-Smirnov critical value.

For the 2D Kolmogorov-Smirnov Test, exceeds the Kolmogorov-Smirnov critical value.

For the ID Kolmogorov-Smirnov Test, P-value less than 0.05.

For the 2D Kolmogorov-Smirnov Test, P-value greater than 0.05.

For the Durbin-Watson Test, the Cubic Spline and Weighted Quintic Spline suggest the least probability of autocorrelation, but all smoothing routines resulted in a P-value were equal to 1.0.

The Durbin-Watson Test Statistic d, also listed in Table 5, suggests that the Cubic Spline and Weighted Quintic Spline methods are the only smoothing routines with little to no autocorrelation (see Figures 6d and 6f) because 1.5 < d < 2.5, which implies independence between observations and fit (Karacan, 2008). Visually, Figure 6d appears reasonable but Figure 6f less so. Realistically, however, Figures 6d – 6f, 6i, 6d – 6n, and perhaps Figure 6o may be regarded as reasonable according to the Durbin-Watson Test Statistic. The B-Spline method is the only method that suggested a negative serial correlation (see Figure 6g and Table 5).

A careful examination of Table 5 shows that for the Chi-Square and Durbin-Watson Test Statistics p > 0.05 for all smoothing routines and that for the 1D Kolmogorov-Smirnov Test Statistic p > 0.05 for all smoothing routines, except for one method (moving median), which would generally be regarded as strong evidence favoring a null hypothesis (Briggs, 2019b; Wasserstein et al., 2019) but is really nonsense (Amrheine et al., 2019). Such large P-values would suggest that all the smoothing routine fits are due only to chance, which is impossible because chance cannot cause anything (Briggs, 2019e). In this sense, even though P-values are reported in Table 5, they are essentially worthless (Briggs, 2019a,c,d) and can be ignored when applying a smoothing routine to data-dense tracer BTCs.

4.2.4. Smoothing Error Analysis

Analyzing the appropriateness or correctness of a smoothing routine (curve fitting or digital filtering) applied to a set of measured data isn’t easily accomplished from a quantitative (mathematical or statistical) perspective. Traditional measures of fit, such as the correlation coefficient and coefficient of determination, are less appropriate when a smoothing routine is applied to very dense data sets as shown in Figure 3. In some instances, a qualitative visual determination of fit is considered to be an acceptable measure (see Figure 6, particularly Figures 6l – 6o).

Residual Data Plots.

A generally accepted important method for assessing smoothing fit problems is the analysis of residual plots (Costa, 2017, p. 414). According to Tsai et al. (1998) generally, a null linear residual plot shows that there are no obvious defects in the model, a curved plot indicates nonlinearity, and a fan-shaped or double-bow pattern indicates nonconstant variance. Tsai et al., citing Cook (1994), further noted, however, that linear residual plots can be misleading, and may be insufficiently powerful by themselves to detect nonlinearity or heteroscedasticity. Figure 7 is a plot of residual values versus explanatory variables (time), which is commonly employed when regression is performed on time-series data, and Figure 8 is a plot of residual values versus fitted values for assessing linearity, independence, and homoscedasticity (Costa, 2017, p. 416).

Figure 7 shows that, in general, that the bulk of the data for all the smoothing routines plots along the zero residual line. However, Figure 7d (cubic spline), Figure 7f (weighted quintic pline) Figure 7g (B-spline), and Figure 7i (Savitsky-Golay) appear to be acceptable although Figure 7l (curve fit), Figure 7m (Gelkern), Figure 7n (Lokern), and Figure 7o (Lowess) appear to be reasonably acceptable as well.

In contrast to Figure 7, Figure 8 suggests that several of the smoothing routines may be seriously problematic. Figures 8a – 8c, 8j, and 8l (moving average, moving median, exponential moving average, Butterworth, and curve fit, respectively) suggest lack of linearity, independence, and homoscedasticity. Only Figures 8d (cubic spline) and 8f (weighted quintic spline) appear to be acceptable. Figures 8g (B-spline), 8i (Savitsky-Golay), 8k (smoothing), 8m (Gelkern), 8n (Lokern), and 8o may also be regarded as reasonable as well, however.

An important aspect of determining the appropriateness of a regression analysis is to assess homoscedasticity. According to Statistics Solutions (2019) the assumption of homoscedasticity (i.e., same variance) is central to linear regression models. Homoscedasticity describes a situation in which the random error is the same across all values of the independent variables. Heteroscedasticity (the violation of homoscedasticity) is present when the size of the random error differs across values of an independent variable. The impact of violating the assumption of homoscedasticity is a matter of degree, increasing as heteroscedasticity increases. When heteroscedasticity is present, the cases with larger disturbances have more drag than do other observations. Statistics Solutions (2019) further emphasizes that a more serious problem associated with heteroscedasticity is the fact that the standard errors are biased. Standard errors are central to conducting significance tests and calculating confidence intervals, so biased standard errors lead to incorrect conclusions about the significance of the regression coefficients. In general, however, the violation of the homoscedasticity assumption must be quite severe in order to present a major problem given the robust nature of regression (Statistics Solutions, 2019).

Figure S1 shows homoscedasticity plots for the BTCs shown in Figure 6 in which only Figures S1d–S1f appear most reasonable. Figures S1g–S1i, S1k, and S1m–S1o may be regarded as acceptable as well. Figures S2 and S3 are squared residual plots verses the explanatory variable (time) and verses the fitted values, respectively as suggested by Tsai et al. (1998).

5. Survival Analysis

Survival analysis is the analysis of time-to-event data that describe the elapsed time from time origin to some endpoint of interest (Kartsonaki, 2016), which, from a tracing study perspective, would be the elapsed time from injection until tracer detection at some downstream location. Assuming that T is a continuous random variable with a probability density function (PDF) f(t) and a cumulative density function (CDF) $F (t) = Pr T < t$ , the probability that an event will occur by duration t (Rodríguez, 2007). In terms of a tracing study, the survival function is the probability that the tracing study continues until tracer detection at time t, which can be described by survival function (Kartsonaki, 2016; Rodríguez, 2007)

S (t) = \Pr {T > t}, = \int_{0}^{\infty} f (u) d u, = 1 - F (t) = {\begin{array}{l} \int_{t}^{\infty} f (t) d t & if T is continuous, \\ \sum_{t_{i} > t} p (t_{i}) & if T is discrete, \end{array}

(16)

Alternatively, the distribution of T is given by the hazard function, or instantaneous rate of occurrence of the event, that is defined as (Kartsonaki, 2016; Rodríguez, 2007)

h (t) = \lim_{d t \to 0} \frac{\Pr {t \leq T < t ∣ T \geq t}}{d t}, = \frac{f (t)}{S (t)}, = \frac{f (t)}{\int_{0}^{\infty} f (u) d u}, = \frac{f (t)}{1 - F (t)}, = - \frac{\partial}{\partial t} \log [1 - F (t)], = - \frac{\partial}{\partial t} \log [S (t)],

(17)

which suggests that

S (t) = e^{- H (y)},

(18)

and

H (y) = {\begin{array}{l} \int_{0}^{t} h (t) d t & if T is continuous, \\ \sum_{t_{i} \leq t} h (t_{i}) & if T is discrete . \end{array}

(19)

For a tracing study, the hazard function represents the successful detection of tracer and thus is regarded as a positive; the greater the hazard the lower the survival.

5.1. Parametric Survival Functions

A probability distribution Pr is determined from the probability of a scalar random variable T being in a half-open interval $(- \infty, T]$ , so the probability distribution is fully characterized by its CDF (WFI, 2019b). The CDF describes the probability that the random variable will be no larger than a given value. This means that the probability that an outcome lies within a given interval and can be computed by taking the differences between the values of the CDF at the endpoints of the interval (WFI, 2019b). The measurement of tracer concentrations during a tracing study is a CDF, which is generally described by (WFI, 2019b)

F (t) = \Pr (T \leq t) \forall t \in ℝ .

(20)

The CDF (which defines the area under the PDF from $- \infty$ to t) is defined for an infinite number of points over a continuous interval, and the probability at a single point that is always zero. Probabilities are measured over intervals, not single points, so the area under the curve between two distinct points defines the probability for that interval. This means that the height of the probability function can in fact be greater than one. The property that the integral must equal one is equivalent to the property for discrete distributions that the sum of all the probabilities must equal one (NIST/SEMATECH, 2013d).

The percent point function (PPF) (the inverse of the cumulative distribution function) specifies the value of the random variable such that the probability of the variable being less than or equal to that value equals the given probability (NIST/SEMATECH, 2013e; WFI, 2019c). The PDF helps identify regions of higher and lower tracer detection probabilities, so the PPF gives the corresponding tracer detection time for each cumulative probability. For a selected distribution function the probability is calculated so that the variable is less than or equal to t, and for the PPF, the probability is used to calculated the corresponding t for the cumulative distribution, which may be expressed mathematically as (NIST/SEMATECH, 2013e)

\Pr [T \leq G (p)] = p,

(21)

or as

t = G (p) = G [F (t)] .

(22)

5.1.1. Weibull Distribution

A probability distribution function is not typically determined for tracer BTCs even though appropriate probability distribution functions are commonly developed and applied to measured datasets. One exception, developed in Hansen et al. (2018), suggested that solute BTCs generated in heterogeneous aquifers can be described by a lognormal distribution, and that important aspects of the BTCs, such as BTC shape and mean time of travel, can be predicted. Another exception developed in Hansen and Berkowitz (2014) suggested that a Pareto distribution can be used in some instances to model experimental BTCs obtained in heterogeneous media. In still another instance, Zhou et al. (2002) suggested that the gamma distribution can be used to synthesize a BTC.

For the data set depicted in Figure 3 it was found that neither the lognormal nor the Pareto distributions could be applied. Rather, it was determined that a Weibull distribution given by the PDF (NIST/SEMATECH, 2013f; Walck, 2007, p. 152)

f (t; γ, η) = \frac{γ}{η} {(\frac{t - β}{η})}^{(γ - 1)} e^{- {(\frac{t - β}{η})}^{γ}}, \forall t \geq β; γ, η > 0,

(23)

most appropriately describes the tracer BTCs shown in Figures 4, 5, and 6 (Table 6). For the case in which the location parameter β = 0, Equation (23) reduces to the standardized form (NIST/SEMATECH, 2013f; Walck, 2007, p. 152)

f (t) = γ t^{(γ - 1)} e^{- t^{γ}}, \forall t \geq 0; γ > 0.

(24)

Table 6.

Weibull analysis of the full and reduced data breakthrough curves (BTCs) shown in Figure 4 and 5, and smoothed data BTCs shown in Figure 6.

						Probability Distribution Parameters
		Basic Distribution Statistics^a				Weibull/Exponential Distributions^b				Extreme Value Type-I Distribution (Gumbel_max)
Figure BTC	Data Examination Method	Mean, μ	Standard Deviation, σ	Minimum	Maximum	Scale, η	Location, β	Correlation Coefficient, R	Coefficient of Determination, R²	Scale, η	Location, β	Correlation Coefficient, R	Coefficient of Determination, R²
Data Reduction
Fig. 4a	Full Data Set	0.1809	0.1220	0.0000	0.5149	0.1171	0.0638	0.9579	0.9176	0.0933	0.1270	0.9797	0.9597
Fig. 4b	50%	0.1807	0.1218	0.0000	0.5113	0.1172	0.0636	0.9584	0.9185	0.0933	0.1269	0.9798	0.9600
Fig. 4c	33%	0.1806	0.1225	0.0000	0.5149	0.1183	0.0624	0.9599	0.9214	0.0941	0.1264	0.9810	0.9623
Fig. 4d	20%	0.1808	0.1223	0.0000	0.5096	0.1184	0.0625	0.9603	0.9221	0.0940	0.1266	0.9802	0.9608
Fig. 4e	10%	0.1810	0.1223	0.0000	0.4917	0.1192	0.0623	0.9605	0.9226	0.0944	0.1268	0.9799	0.9602
Fig. 4f	5.0%	0.1803	0.1220	0.0000	0.4917	0.1207	0.0604	0.9650	0.9312	0.0951	0.1259	0.9827	0.9656
Fig. 4g	3.4%	0.1810	0.1226	0.0000	0.4917	0.1224	0.0597	0.9666	0.9344	0.0963	0.1261	0.9844	0.9690
Fig. 4h	2.5%	0.1807	0.1229	0.0000	0.4851	0.1240	0.0581	0.9694	0.9398	0.0970	0.1256	0.9838	0.9678
Fig. 4i	2.0%	0.1803	0.1274	0.0000	0.4917	0.1296	0.0524	0.9715	0.9437	0.1011	0.1230	0.9853	0.9708
Fig. 4j	1.7%	0.1805	0.1219	0.0000	0.4917	0.1252	0.0572	0.9742	0.9491	0.0976	0.1254	0.9884	0.9770
Fig. 4k	1.4%	0.1826	0.1258	0.0000	0.4851	0.1300	0.0549	0.9739	0.9484	0.1006	0.1259	0.9832	0.9667
Fig. 4l	1.3%	0.1769	0.1226	0.0000	0.4772	0.1278	0.0515	0.9780	0.9565	0.0987	0.1214	0.9869	0.9740
Fig. 4m	1.12%	0.1841	0.1240	0.0000	0.4786	0.1295	0.0573	0.9744	0.9495	0.1003	0.1278	0.9873	0.9748
Fig. 4n	1.02%	0.1823	0.1294	0.0000	0.4917	0.1363	0.0491	0.9778	0.9561	0.1053	0.1234	0.9896	0.9792
Fig. 4o	0.92%	0.1772	0.1237	0.0000	0.4586	0.1305	0.0500	0.9740	0.9487	0.1006	0.1210	0.9852	0.9706
Fig. 4p	0.85%	0.1784	0.1237	0.0000	0.4728	0.1315	0.0504	0.9781	0.9597	0.1013	0.1221	0.9889	0.9779
Fig. 4q	0.78%	0.1728	0.1278	0.0000	0.4761	0.1373	0.0394	0.9831	0.9665	0.1052	0.1144	0.9899	0.9800
Fig. 4r	0.73%	0.1779	0.1261	0.0000	0.4851	0.1363	0.0457	0.9854	0.9710	0.1041	0.1202	0.9907	0.9815
Fig. 4s	0.68%	0.1766	0.1280	0.0000	0.4917	0.1388	0.0422	0.9844	0.9691	0.1064	0.1178	0.9937	0.9875
Fig. 4t	0.63%	0.1699	0.1204	0.0000	0.4772	0.1319	0.0425	0.9892	0.9784	0.1002	0.1147	0.9915	0.9830
Fig. 5b^c	0.32%	0.1802	0.1378	0.0000	0.4835	0.1600	0.0294	0.9909	0.9819	0.1197	0.1163	0.9922	0.9844
Fig. 5c^d	0.17%	0.1491	0.1287	0.0000	0.3627	0.1602	0.0036	0.9879	0.9759	0.1182	0.0889	0.9929	0.9858
Data Smoothing
Fig. 6a	Moving Average	0.1801	0.1216	0.0000	0.4813	0.1221	0.0592	0.9686	0.9383	0.0957	0.1256	0.9836	0.9675
Fig. 6b	Moving Median	0.1799	0.1223	0.0000	0.4795	0.1228	0.0583	0.9678	0.9367	0.0962	0.1251	0.9834	0.9671
Fig. 6c	Expon. Mov. Ave.	0.1719	0.1246	4.6600 × 10⁻⁶	0.4797	0.1356	0.0407	0.9880	0.9761	0.1034	0.1148	0.9926	0.9852
Fig. 6d	Cubic Spline	0.1809	0.1209	0.0000	0.4872	0.1158	0.0651	0.9558	0.9136	0.0922	0.1277	0.9766	0.9537
Fig. 6e	Quintic Spline	0.1797	0.1219	0.0000	0.4926	0.1244	0.0570	0.9736	0.9478	0.0968	0.1249	0.9852	0.9707
Fig. 6f	Weighted Quintic Spline	0.1838	0.1249	0.0000	0.4791	0.1310	0.0558	0.9747	0.9501	0.1014	0.1271	0.9874	0.9749
Fig. 6g	B-Spline Fit	0.1809	0.1220	0.0000	0.4863	0.1207	0.0610	0.9653	0.9318	0.0950	0.1266	0.9819	0.9642
Fig. 6h	Polynomial Fit	0.1745	0.1206	0.0000	0.4839	0.1155	0.0591	0.9556	0.9131	0.0923	0.1213	0.9804	0.9611
Fig. 6i	SavitSavitsky-Golay Fit	0.1765	0.1251	0.0000	0.4733	0.1334	0.0466	0.9813	0.9629	0.1024	0.1195	0.9889	0.9780
Fig. 6j	Butterworth Filter	0.1796	0.1216	0.0000	0.4810	0.1237	0.0575	0.9710	0.9428	0.0966	0.1248	0.9861	0.9723
Fig. 6k	Fast Fourier Fit	0.1764	0.1239	1.1995 × 10⁻³	0.4675	0.1322	0.0477	0.9812	0.9628	0.1015	0.1199	0.9891	0.9782
Fig. 6l	Curve Fit	0.1822	0.1174	0.0000	0.4809	0.1127	0.0696	0.9574	0.9166	0.0896	0.1305	0.9774	0.9553
Fig. 6m	Gelkern Fit	0.1806	0.1199	0.0000	0.4806	0.1179	0.0632	0.9636	0.9286	0.0930	0.1272	0.9810	0.9624
Fig. 6n	Lokern Fit	0.1808	0.1205	0.0000	0.4818	0.1184	0.0629	0.9632	0.9278	0.0934	0.1272	0.9806	0.9616
Fig. 6o	Lowess Fit	0.1809	0.1192	0.0000	0.4797	0.1142	0.0667	0.9562	0.9142	0.0909	0.1284	0.9767	0.9539

Open in a new tab

The Basic Distribution Statistics are the same for the Weibull, Exponential, and Extreme Value Type-I distributions.

For all BTCs the Weibull shape parameter γ = 1 resulting in the Weibull distribution devolving to the exponential distribution (kan).

BTC for data collected weekly.

BTC for data collected biweekly.

The CDF is then simply (NIST/SEMATECH, 2013f)

F (t) = 1 - e^{- (t^{γ})}, \forall t \geq 0; γ > 0,

(25)

which translates into a PPF as (NIST/SEMATECH, 2013f)

G (p) = {(- γ \ln (1 - p))}^{1 / γ}, \forall 0 \leq p < 1; γ > 0.

(26)

For all the BTCs depicted in Figures 4, 5, and 6 the Weibull shape parameter γ = 1 was found to be the case. When the Weibull shape parameter γ = 1 the Weibull distribution devolves to the exponential distribution (Table 6), which is a special case of the gamma distribution (Walck, 2007, p. 69).

5.1.2. Exponential Distribution

The exponential distribution is given by the PDF (NIST/SEMATECH, 2013f; Walck, 2007, p. 54)

f (t; β) = \frac{1}{β} e^{- \frac{(t - η)}{β}}, \forall t \geq β; η > 0,

(27)

and the case where η = 0 and β = 1 is called the standard exponential distribution. The equation for the standard exponential distribution is

f (t) = e^{- t}, \forall t \geq 0,

(28)

The formula for the CDF of the exponential distribution is (NIST/SEMATECH, 2013f; Walck, 2007, p. 54)

F (t) = 1 - e^{- t / β}, \forall t \geq 0; β > 0,

(29)

which translates into a PPF as (NIST/SEMATECH, 2013f)

G (p) = - β \ln (1 - p), \forall 0 \leq p < 1; β > 0.

(30)

It will ne noted that Equations (28), (29), and (30) are the same as Equations (24), (25), and (26), respectively, when the Weibull shape parameter γ = 1.

5.1.3. Extreme Value Type-I Distribution

The extreme value type-I (EVTI) distribution (Table 6), also known as the Fisher-Tippett distribution (type I), the log-Weibull distribution, and the Gumbel distribution after E. J. Gumbel (1891–1966) (Walck, 2007, p. 57)), is given by the PDF (NIST/SEMATECH, 2013g; Walck, 2007, p. 57)

f (t; β, η) = \frac{1}{η} e^{{\mp \frac{t - β}{η}}} e^{- e^{{\mp \frac{t - β}{η}}}}, \forall t \geq β; η > 0,

(31)

where the upper sign is for the maximum and the lower sign for the minimum but only the maximum is applicable to tracer BTCs (see Figure 11 in Walck, 2007). As noted by Walck (2007) the extreme value distribution provides the limiting distribution for the largest or smallest elements of a set of independent observations from a distribution of exponential type (normal, gamma, exponential, etc.).

For the case where β = 0 and η = 1, Equation (31) reduces to the standard Gumbel (EVTI) distribution (NIST/SEMATECH, 2013g; Walck, 2007, p. 57)

f (t) = {\begin{array}{l} e^{t} e^{- e^{t}}, & {for Gumbel}_{\min}, \\ e^{- t} e^{- e^{- t}}, & {for Gumbel}_{\max} . \end{array} \forall t \geq 0,

(32)

The CDF is then given by (NIST/SEMATECH, 2013f)

F (t) = {\begin{array}{l} 1 - e^{- e^{t}}, & {for Gumbel}_{\min}, \\ e^{- e^{- t}}, & {for Gumbel}_{\max}, \end{array} \forall t \geq 0,

(33)

which translates into a PPF as (NIST/SEMATECH, 2013f)

G (p) = {\begin{array}{l} \ln [\ln (\frac{1}{1 - p})], & {for Gumbel}_{\min}, \forall p < 1, \\ - \ln [\ln (\frac{1}{p})], & {for Gumbel}_{\max}, \forall p > 0. \end{array}

(34)

5.1.4. Tracer Test Probabilty Plots

Figure S4 and Figure S5 show probability plots for the Weibull (exponential) and Gumbel_max (EVTI) distributions for the reduced datasets shown in Figure 4. For every corresponding plot between Figure S4 and Figure S5 the R and R² values are higher for the Gumbel_max (EVTI) distributions relative to Weibull (exponential) distribution. A gamma probability plot is not shown but was observed to be identical to the Weibull (exponential) distribution.

Similarly, Figure S6 shows probability plots for the weekly and biweekly datasets shown in Figure 5 for both the Weibull (exponential) and Gumbel_max (EVTI) distributions. Again, the R and R² values are higher for the Gumbel_max (EVTI) distributions relative to Weibull (exponential) distribution.

Figure S7 and Figure S8 show probability plots for the Weibull (exponential) and Gumbel_max (EVTI) distributions for the reduced datasets shown in Figure 6. As with the nonsmoothed datasets, for every corresponding plot between Figure S7 and Figure S8 the R and R² values are higher for the Gumbel_max (EVTI) distributions relative to Weibull (exponential) distribution.

Figure S9 shows the Weibull (exponential) PPF plot and the Gumbel_max (EVTI) PPF plot for the full datasets for both the smoothed and nonsmoothed datasets. The horizontal axis is a probability so it goes from zero to one, and the vertical axis goes from the smallest to the largest value of the cumulative distribution function (NIST/SEMATECH, 2013e). Calculation of the PPF for each distribution type was identical for all datasets, regardless of modification so only one plot each PPF is shown. Figure S9a appears as an expected exponentially increasing curve because γ = 1 for every dataset but had $γ \neq 1$ for the example dataset then Figure S9a would have more closely approximated a Weibull PPF (see third figure in NIST/SEMATECH, 2013f). Comparing Figure S9a with Figure S9b emphasizes the differences between the Weibull (exponential) PPF the Gumbel_max PPF.

5.2. Parametric Survival Analysis

Parametric functions provide estimates for distribution quantiles, the hazard time (e.g., event occurrence = tracer detection), estimates for S(t) and h(t), and smooth curves for S(t) and h(t). Although not particularly useful to tracing studies, it is worthwhile to recognize the appropriateness of such curves from a tracer-test perspective.

5.2.1. Weibull Survival and Hazard

The Weibull survival function is described by (Rodríguez, 2007)

S (t) = e^{{- {(β t)}^{γ}}}, \forall t \geq 0; γ > 0,

(35)

which can be simplified to

S (t) = e^{- t^{γ}}, \forall t \geq 0; γ > 0.

(36)

Similarly, the Weibull hazard function is described by (Rodríguez, 2007)

h (t) = γ β {(β t)}^{γ - t}, \forall t \geq 0; γ > 0,

(37)

which can be simplified to

h (t) = γ t^{(γ - 1)}, \forall t \geq 0; γ > 0.

(38)

When γ = 1 Equations (37) and (38) reduce to the exponential distribution suggesting a constant probability of tracer detection over time (Rodríguez, 2007).

The cumulative hazard function is given by (NIST/SEMATECH, 2013g)

H (t) = γ t^{γ}, \forall t \geq 0; γ > 0.

(39)

As noted above with the Weibull distribution, when γ = 1 the Weibull survival and hazard functions Equations (35) and (37), respectively, devolve to the simple exponential survival and hazard functions.

5.2.2. Exponential Survival and Hazard

The exponential survival function can be described by (NIST/SEMATECH, 2013g)

S (t) = e^{- \frac{t}{β}}, \forall t \geq 0; β > 0,

(40)

and the exponential hazard function and can be described by (NIST/SEMATECH, 2013g)

h (t) = \frac{1}{β}, \forall t \geq 0; β > 0,

(41)

which results in a constant probability of tracer detection over times. The cumulative hazard function is given by (NIST/SEMATECH, 2013g)

H (t) = \frac{t}{β}, \forall t \geq 0; β > 0.

(42)

5.2.3. Extreme Value Type-I Survival and Hazard

The extreme value type-I (EVTI) survival function can be described by (NIST/SEMATECH, 2013g)

S (t) = {\begin{array}{l} e^{- e^{t}} & {for Gumbel}_{\min}, \\ 1 - e^{- e^{- t}} & {for Gumbel}_{\max}, \end{array} \forall t \geq 0,

(43)

and the EVTI hazard function and can be described by (NIST/SEMATECH, 2013g)

h (t) = {\begin{array}{l} e^{t} & {for Gumbel}_{\min}, \\ \frac{e^{- t}}{e^{e^{- t}} - 1} & {for Gumbel}_{\max}, \end{array} \forall t \geq 0,

(44)

which suggests improving probability of tracer detection with increasing time. The cumulative hazard function is then described by (NIST/SEMATECH, 2013g)

H (t) = {\begin{array}{l} e^{t} & {for Gumbel}_{\min}, \\ - \ln (1 - e^{- e^{- t}}) & {for Gumbel}_{\max} . \end{array} \forall t \geq 0,

(45)

6. Discussion and Summary

With the exception of simple qualitative tracing studies using activated carbon as a detector typical of many tracing studies in karstic terranes, obtaining breakthrough curves (BTCs) are the ultimate goal of all tracing studies. Only a quantitative tracing study results in a BTC, which allows investigators to be reasonably assured of a successful tracing study, and allows for calculating such transport parameters as mass recovery, travel time, velocity, and dispersion.

Actual BTC development, although generally appearing to be a rather simple process, involves a number of factors, such as random and systematic errors, that affect the measured data in various ways. Problems associated with the collection and application background concentrations further affect BTC development. These factors are generally known to exist but little if any effort is commonly taken to address them.

Application of a solute-transport model fitting to a measured BTC is common but it is rare that any consideration is taken in regard to the number of samples collected, although Field (2003) did attempt to address this consideration from a practical perspective. By applying basic solute-transport theory to the design of tracing studies, Field (2003) was able to provide an estimate of the minimum number of samples that need to be collected at an appropriate frequency such that the BTC peak (statistically known as the mode) might be adequately defined. The advent of relatively inexpensive in situ fluorometers with data loggers now abrogates the need to estimate a minimum number of samples to be collected during a tracing study, but very frequent analysis and data storage can result in very large and densely-packed datasets that can be difficult to interpret and assess.

Simplification of BTCs composed of very large and densely-packed datasets may be accomplished by such methods as downsampling and curve fitting. Application of downsampling or smoothing routines can assist in visualizing and understanding important BTC appearance and form, such as dual peaks (see for example, Field and Leij, 2012; Leij et al., 2012) and complex recession limbs with long tails (see for example, Field and Leij, 2014; Field and Pinsky, 2000). Developing solute-transport models that adequately represent important and complex BTC appearance and form is an ongoing endeavor of many individuals.

Utilization of additional tools, such as downsampling and smoothing, may allow for additional insights into BTC appearance and form. Some aspects of the insights that might be gained are evident in Figures 4 and 6 where potential datum outliers and additional BTC peaks are suggested. In some instances either of these two methods can allow for more insight and understanding regarding a BTC and tracer-test results. For example, from 100 h to 120 h of the recession limbs shown in Figure 4j and Figure 6o suggest some data smoothing that is not so readily evident in Figure 3a. It is even possible that such methods may allow for consideration of factors not always apparent from the full measured dataset. Multiple peaks in the BTC (compare, for example, Figure 3a with Figure 6e where additional BTC peaks do not appear unreasonable) is one possible factor of considerable importance.

Alternatively, downsampling and curve fitting can also result in seriously-flawed alterations of a BTC and application of such must be considered judiciously. For example, application of to much downsampling, can result in radical effects on BTCs as depicted in Figure 5a, which clearly indicates an excessive loss of data and Figure 5b, which clearly shows a substantial reduction of the BTC peak and mean concentrations. In terms of curve fitting, Figures 6h and 6i suggest serious problems with the BTC descending limb. Adding an extrapolation routine similar to the three methods developed in the Qtracer2 program by Field (2002) might alleviate the problems evident in Figures 6h and 6i but may, in turn, create additional problems.

Prior to accepting a selected downsampling or curve fitting routine it is necessary that the altered BTC be visually compared with the original BTC to ensure a certain degree of reasonableness in the altered BTC. If a curve fitting routine is to be selected, it is essential that the residual plots that may be generated from the curve fitting routine be carefully examined (see, for example, Figures 7–8 and Figures S1–S3).

From a statistical perspective a typical right-skewed tracer BTC generally conforms to an Weibull (exponential) distribution when the Weibull shape parameter γ = 1, but it is possible for $γ \neq 1$ , which would make the Weibull distribution more appropriate. Interestingly, the maximum extreme value type-I (Gumbel_max) distribution was actually found to be the best descriptor of the example BTC evaluated in this study and this could possibly always be the case for any typical right-skewed tracer BTC. This is evident from both the correlation coefficient R and the coefficient of determination R² that were each calculated using the example BTC (Table 6) and visually from the probability plots for the Weibull (exponential) and Gumbel_max distributions (Figures S4–S8). It will be further noted that if an exponential distribution is applied then a hazard plot of the data (probability of tracer detection) is a constant whereas when a Gumbel_max distribution is applied then a hazard plot of the data causes the probability of tracer detection to increase with increasing time.

Supplementary Material

Sup1

NIHMS1607523-supplement-Sup1.zip^{(264.2KB, zip)}

Sup2

Figure S1. Homoscedasticity plots of their respective BTC plots shown in Figure 6.

Figure S2. Plots of squared residual values versus explanatory variables (time) developed from their respective BTC plots shown in Figure 6.

Figure S3. Plots of squared residual values versus fitted values developed from their respective BTC plots shown in Figure 6.

Figure S4. Weibull (exponential) probability plots for the BTC plots developed using varying downsampled amounts of the measured data.

Figure S5. Gumbel_max (EVTI) probability plots for the BTC plots developed using varying downsampled amounts of the measured data.

Figure S6. Weibull (exponential) probability plots and Gumbel_max (EVTI) probability plots for the BTC plots developed from datasets collected weekly and biweekly.

Figure S7. Weibull (exponential) probability plots for the BTC plots developed using various smoothing routines for the measured data. (α is the smoothing parameter.)

Figure S8. Gumbel_max (EVTI) probability plots for the BTC plots developed using various smoothing routines for the measured data. (α is the smoothing parameter.)

Figure S9. Weibull (exponential) PPF probability plot and Gumbel_max (EVTI) PPF probability plot for the BTC plots developed from the full measured dataset.

NIHMS1607523-supplement-Sup2.pdf^{(8.1MB, pdf)}

Acknowledgements

The author appreciates the review and comments provided by Dr. Carol Wicks of Louisiana State University, Dr. Zargham Mohammadi of the University of Waterloo and Shiraz University, Dr. Walter Illman of Waterloo University, Dr. William White of Pennsylvania State University and four anonymous reviewers. Their thoughts and comments greatly improved this manuscript. The author also thanks Dr. Graham Sander and four anonymous reviewers for their helpful comments. Photos depicted in Figures 1 and 2 curtesy of Robert Wallace.

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Acronyms

α-HCH: alpha-hexachlorocyclohexane
α-HCH: alpha-hexachlorocyclohexane
β-HCH: beta-hexachlorocyclohexane
BTC: breakthrough curve
CDF: cumulative density function
DDD: dichlorodiphenyldichloroethane
2,4-DDD: 2,4-dichlorodiphenyldichloroethane
4,4-DDD: 4,4-dichlorodiphenyldichloroethane
DDE: dichlorodiphenyldichloroethylene
DDT: dichlorodiphenyltrichloroethane
2,4-DDT: 2,4-dichlorodiphenyltrichloroethane
4,4-DDT: 4,4-dichlorodiphenyltrichloroethane
DDX: dichlorodiphenyltrichloroethane
δ-HCH: delta-hexachlorocyclohexane
EVTI: extreme value type-I
EWMA: exponentially weighted moving average
γ-HCH: gamma-hexachlorocyclohexane
HCH: hexachlorocyclohexane
LOESS: locally estimated scatterplot smoothing
LOWESS: locally weighted scatterplot smoothing
MA: moving average
MM: moving median
PDF: probability density function
PPF: percent point function

Notation

A and $\bar{A}$: two possible experimental outcomes [ ]
C: measured concentration [M L⁻³]
C_B: background tracer concentration [M L⁻³]
${\bar{C}}_{B}$: mean background tracer concentration for sample collected prior to tracer release [M L⁻³]
C_ε: measurement error impacting measured tracer concentration [M L⁻³]
C_S: desired (sought) tracer concentration affected by an unknown background concentrations [M L⁻³]
$C_{S}^{'}$: desired (sought) tracer concentration affected by the mean of the background concentration [M L⁻³]
C_Υ: true tracer concentration [M L⁻³]
d: Durbin-Watson Test Statistic [ ]
D₁: 1D Kolmogorov-Smirnov Test Statistic [ ]
D₂: 2D Kolmogorov-Smirnov Test Statistic [ ]
E: experiment [ ]
F: cumulative density function [ ]
G: inverse cumulative density function [ ]
k: number of samples with detectable tracer concentrations [ ]
k_B: number of background samples with detectable tracer concentrations [ ]
k_S: number of signal samples with detectable tracer concentrations [ ]
n: number [ ]
n_B: number of background events [ ]
n_S: number of signal events [ ]
NSE: Nash-Sutcliff Efficiency [ ]
Pr: probability [ ]
p_k: probability of k trials [ ]
PBIAS: percent bias [ ]
q^n−k: probability of n − k trials [ ]
R: Pearson’s Correlation Coefficient [ ]
R²: coefficient of determination [ ]
RMSE: root mean square error [ ]
R_S: Spearman’s Rank-Order Correlation Coefficient [ ]
S_m: Smoothed curve fit to a measured data set [ ]
t: time [T]
t_p: time period for a smoothing operation [T]
W_ij: a weighting value in a smoothing operation [ ]
W: Shapiro-Wilk test-statistic [ ]

Greek

α: a smoothing constant [ ]
β: statistical distribution location parameter [ ]
γ: statistical distribution shape parameter [ ]
η: statistical distribution scale parameter [ ]
λ_±: confidence limits [ ]
$λ_{S}^{(up)}$: upper confidence level [ ]
μ: mean of the distribution [ ]
σ: standard deviation of the distribution [ ]
$ς$: kernel smoothing normalizing constant [ ]
τ: Kendall’s Rank-Order Correlation Coefficient [ ]
$χ^{2}$: Chi-Square Test Statistic [ ]
ψ: kernel smoothing tuning constant [ ]

Footnotes

Declaration of Competing Interest

The author declares that he has no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

^★

Disclaimer: The views expressed in this paper are solely those of the authors and do not necessarily reflect the views or policies of the U.S. Environmental Protection Agency. Mention of trade names does not constitute endorsement. Declarations of interest: none.

References

Alexander SC, 2005. Spectral deconvolution and quantification of natural organic material and fluorescent tracer dyes, in: Beck BF (Ed.), Sinkholes and the Engineering and Environmental Impacts of Karst. Proceedings of the 10th Multidisciplinary Conference, American Society of Civil Engineers, San Antonio, Texas. pp. 441–448. 10.1061/40796(177)47. [DOI] [Google Scholar]
Amrheine V, Greenland S, McShane B, 2019. Scientists rise up against statistical significance. Nature 567, 305–307. 10.1038/d41586-019-00857-9. [DOI] [PubMed] [Google Scholar]
Arnow T, 1963. Ground-Water Geology of Bexar County, Texas. Geological Survey Water-Supply Paper 1588. U.S. Geological Survey. Washington, D.C. https://pubs.usgs.gov/wsp/1588/report.pdf,. [Google Scholar]
Bai C, Li Y, 2014. Time series analysis of contaminant transport in the subsurface: Applications to conservative tracer and engineered nanomaterials. Journal of Contaminant Hydrology 164, 153–162. Doi: 10.1016/jjconhyd.2014.06.002. [DOI] [PubMed] [Google Scholar]
Bailly-Comte V, Durepaire X, Batiot-Guilhe C, Schnegg PA, 2018. In situ monitoring of tracer tests: how to distinguish tracer recovery from natural background. Hydrogeology Journal 26, 2057–2069. Doi: 10.1007/s10040-018-1748-8. [DOI] [Google Scholar]
van den Bogert T, 1996. Practical Guide to Data Smoothing and Filtering. Published online. https://isbweb.org/software/sigproc/bogert/filter.pdf.
Brandt S, 1998. Data Analysis: Statistical and Computational Methods for Scientists and Engineers. 3rd ed., Springer. [Google Scholar]
Brezinski DK, 2013. Geologic and Karst Features Map of the Hagerstown Quadrangle, Washington County, Maryland. Map. http://www.mgs.md.gov/geology/hagerstown.html.
Briggs WM, 2019a. Beyond Traditional Probabilistic Methods in Economics. Springer Nature, Cham, Switzerland. chapter Everything Wrong with P-Values Under One Roof. pp. 22–44. https://wmbriggs.com/public/Briggs.EverthingWrongWithPvalues.pdf. [Google Scholar]
Briggs WM, 2019b. Death Blow To Statistical Significance! — Bonus: Here’s The Replacement. Published online. https://wmbriggs.com/post/26701/.
Briggs WM, 2019c. Reality-based probability & statistics: solving the evidential crisis. Asian Journal of Economics and Banking 3, 37–80. https://wmbriggs.com/public/Briggs.Reality.Based.Prob.Stats.pdf. [Google Scholar]
Briggs WM, 2019d. Stop Using P-values & Parameter-Centric Methods. Published online. https://wmbriggs.com/post/27051/.
Briggs WM, 2019e. Using P-Values To Diagnose “Trends” Is Invalid. Published online. https://wmbriggs.com/post/27088/.
Brown TL, 2009. Fluorescence Characterization of Karst Aquifers in East Tennessee. Master’s thesis. University of Tennessee. Knoxville, Tenn. [Google Scholar]
Cleveland WS, 1979. Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association 74, 829–836. 10.1080/01621459.1979.10481038. [DOI] [Google Scholar]
Cleveland WS, Devlin SJ, 1988. Locally weighted regression: An approach to regression analysis by local fitting. Journal of the American Statistical Association 83, 596–610. 10.2307/2683591. [DOI] [Google Scholar]
Cook RD, 1994. On the interpretation of regression plots. Journal of the American Statistical Association 89, 177–189. 10.1080/01621459.1994.10476459. [DOI] [Google Scholar]
Costa V, 2017. Fundamentals of Statistical Hydrology. Springer International Publishing, Cham, Switzerland. chapter Correlation and Regression. pp. 391–440. [Google Scholar]
Duigon MT, 2001. Karst Hydrogeology of the Hagerstown Valley, Maryland. Report of Investigations No. 73. Maryland Geological Survey. Baltimore, Md. [Google Scholar]
Duigon MT, 2009. Phase 2 Study of the Area Contributing Groundwater to the Spring Supplying the A.M. Powell State Fish Hatchery, Washington County, Maryland. Technical Report Open-File Report 2008-02-18. Maryland Geological Survey. Baltimore, Md. http://www.mgs.md.gov/reports/0FR_08-02-18.pdf. [Google Scholar]
Field MS, 1992–93. Karst hydrology and chemical contamination. Journal of Environmental Systems 22, 1–26. [Google Scholar]
Field MS, 2002. The Qtracer2 Program for Tracer-Breakthrough Curve Analysis for Hydrological Tracer Tests. Technical Report EPA/600/R-02/001 and EPA/600/CR-02/001. U.S. Environmental Protection Agency. Washington, D.C. http://cfpub.epa.gov/ncea/cfm/recordisplay.cfm?deid=54930 [accessed April 9, 2019]. [Google Scholar]
Field MS, 2003. Tracer-Test Planning Using the Efficient Hydrologic Tracer-Test Design EHTD Program. Technical Report EPA/600/R-03/034 and EPA/600/CR-03/034. U.S. Environmental Protection Agency. Washington, D.C., 175 p. [Google Scholar]
Field MS, 2011. Application of robust statistical methods to background tracer data characterized by outliers and left-censored data. Water Research 45, 3107–3118. Doi: 10.1016/j.watres.2011.03.018. [DOI] [PubMed] [Google Scholar]
Field MS, 2017. Tracer-Test Results for the Central Chemical Superfund Site, Hagerstown, Md. May 2014 – December 2015. Technical Report EPA/600/R-17/032. U.S. Environmental Protection Agency. Washington, D.C. https://edg.epa.gov/metadata/catalog/main/home.page;jsessionid=0E8386D9DF2E39A5A59A94AE806001C4. [Google Scholar]
Field MS, Leij FJ, 2012. Solute transport in solution conduits exhibiting multi-peaked breakthrough curves. Journal of Hydrology 440–441, 26–35. Doi: 10.1016/jjhydrol.2012.03.018. [DOI] [Google Scholar]
Field MS, Leij FJ, 2014. Combined physical and chemical nonequilibrium transport model for solution conduits. Journal of Contaminant Hydrology 157, 37–46. Doi: 10.1016/jjconhyd.2013.11.001. [DOI] [PubMed] [Google Scholar]
Field MS, Pinsky PF, 2000. A two-region nonequilibrium model for solute transport in solution conduits in karstic aquifers. Journal of Contaminant Hydrology 44, 329–351. Doi: 10.1016/S0169-7722(00)00099-1. [DOI] [Google Scholar]
Fountain AG, 1993. Geometry and flow conditions of subglacial water at South Cascade Glacier, Washington State, U.S.A.; an analysis of tracer injections. Journal of Glaciology 39, 143–156. 10.1017/S0022143000015793. [DOI] [Google Scholar]
Franze R, Slifer D, 1971. Caves of Maryland. Educational Series No. 3. Maryland Geological Survey. Baltimore, Md. http://www.mgs.md.gov/output/reports/ES/ES_3.pdf. [Google Scholar]
Fyffe CL, 2013. 3.4.3. Tracer Investigations. online ed.. British Society for Geomorphology. pp. 1–8.
Fyffe CL, Brock BW, Kirkbride MP, Mair DWF, Diotri F, 2012. The hydrology of a debris-covered glacier, the Miage Glacier, Italy, in: BHS Eleventh National Symposium, Hydrology for a changing world, British Hydrological Society, Dundee. pp. 1–5.
Gentle JE, 2009. Computational Statistics. Springer, New York. [Google Scholar]
Guest PG, 2012. Numerical Methods of Curve Fitting. Cambridge University Press, New York. [Google Scholar]
Gulley JD, Walthard P, Martin J, Banwell AF, Benn DI, Catania G, 2012. Conduit roughness and dye-trace breakthrough curves: why slow velocity and high dispersivity may not reflect flow in distributed systems. Journal of Glaciology 58, 915–925. 10.3189/2012JoG11J115. [DOI] [Google Scholar]
Hansen SK, Berkowitz B, 2014. Interpretation and nonuniqueness of ctrw transition distributions: Insights from an alternative solute transport formulation. Advances in Water Resources 74, 54–63. 10.1016/j.advwatres.2014.07.011. [DOI] [Google Scholar]
Hansen SK, Haslauer CP, Cirpka OA, Vesselinov VV, 2018. Direct breakthrough curve prediction from statistics of heterogeneous conductivity fields. Water Resources Research 54, 271–285. 10.1002/2017WR020450. [DOI] [Google Scholar]
Jagadish HV, Gehrke J, Labrinidis A, Papakonstantinou Y, Patel JM, Ramakrishnan R, Shahabi AC, 2014. Big data and its technical challenges. Communications of the ACM 57, 86–94. 10.1145/2611567. [DOI] [Google Scholar]
Karacan CO, 2008. Evaluation of the relative importance of coalbed reservoir parameters for prediction of methane inflow rates during mining of longwall development entries. Computers & Geosciences 34, 1093–1114. Doi: 10.1016/jxageo.2007.04.008. [DOI] [Google Scholar]
Kartsonaki C, 2016. Survival analysis. Diagnostic Histopathology 22, 263–270. 10.1016/j.mpdhp.2016.06.005. [DOI] [Google Scholar]
Kozar MD, McCoy KJ, Weary DJ, Field MS, Pierce HJ, Schill WB, Young JA, 2007. Hydrogeology and Water Quality of the Leetown Area, West Virginia. Technical Report Open-File Report 2007–1358. U.S. Geological Survey. Reston, Va. http://pubs.usgs.gov/of/2007/1358. [Google Scholar]
Leij FJ, Toride N, Field MS, Sciortino A, 2012. Solute transport in dual-permeability porous media. Water Resources Research 48, 1–13. 10.1029/2011WR011502. [DOI] [Google Scholar]
Means J, 2010. Roadside Geology of Maryland, Delaware, and Washington, D.C. Mountain Press Publ. Co., Missoula, Mont. [Google Scholar]
Mull DS, Liebermann TD, Smoot JL, Woosley LH Jr., 1988. Application of Dye-Tracing Techniques for Determining Solute-Transport Characteristics of Ground Water in Karst Terranes. Technical Report EPA904/6–88-001. U.S. Environmental Protection Agency. Atlanta, Ga. http://karstwaters.org/wp-content/uploads/2015/04/dye-tracing.pdf [accessed June 10, 2013]. [Google Scholar]
NIST/SEMATECH, 2013a. e-Handbook of Statistical Methods. Engineering Statistics Handbook 7.1.6. What are outliers in the data?. National Institute of Standards and Technology. http://www.itl.nist.gov/div898/handbook/ [accessed April 24, 2019]. [Google Scholar]
NIST/SEMATECH, 2013b. e-Handbook of Statistical Methods. Engineering Statistics Handbook 4.1.4.4. LOESS (aka LOWESS). National Institute of Standards and Technology.https://www.itl.nist.gov/div898//handbook/pmd/section1/pmd144.htm [accessed December 12, 2018],. [Google Scholar]
NIST/SEMATECH, 2013c. e-Handbook of Statistical Methods. Engineering Statistics Handbook 6.4.3.1. Single Exponential Smoothing. National Institute of Standards and Technology. http://www.itl.nist.gov/div898/handbook/ [accessed December 12, 2018],. [Google Scholar]
NIST/SEMATECH, 2013d. e-Handbook of Statistical Methods. Engineering Statistics Handbook 1.3.6.1. What is a Probability Distribution. National Institute of Standards and Technology. http://www.itl.nist.gov/div898/handbook/ [accessed December 12, 2018]. [Google Scholar]
NIST/SEMATECH, 2013e. e-Handbook of Statistical Methods. Engineering Statistics Handbook 1.3.6.2. Related Distributions. National Institute of Standards and Technology. http://www.itl.nist.gov/div898/handbook/ [accessed April 19, 2019],. [Google Scholar]
NIST/SEMATECH, 2013f. e-Handbook of Statistical Methods. Engineering Statistics Handbook I.3.6.6.8. Weibull Distribution. National Institute of Standards and Technology. http://www.itl.nist.gov/div898/handbook/ [accessed December 12, 2018]. [Google Scholar]
NIST/SEMATECH, 2013g. e-Handbook of Statistical Methods. Engineering Statistics Handbook 1.3.6.6.16. Extreme Value Type I Distribution. National Institute of Standards and Technology. http://www.itl.nist.gov/div898/handbook/ [accessed December 12, 2018]. [Google Scholar]
Press WH, Teukolsky SA, Vetterling WT, Flannery BP, 1997a. Numerical Recipes in Fortran 77: The Art of Scientific Computing. volume 1. Second ed., Cambridge University Press, New York. [Google Scholar]
Press WH, Teukolsky SA, Vetterling WT, Flannery BP, 1997b. Numerical Recipes in Fortran 90. volume 2. Second ed., Cambridge University Press, New York. [Google Scholar]
Ptak T, Piepenbrink M, Martac E, 2004. Tracer tests for the investigation of heterogeneous porous media and stochastic modelling of flow and transport — a review of some recent developments. Journal of Hydrology 294, 122–163. 10.1016/j.jhydrol.2004.01.020. [DOI] [Google Scholar]
Quinlan JF, Davies GJ, Worthington SRH, 1993. Review of ground-water quality monitoring network design. Journal of Hydraulic Engineering 119, 1436–1441. https://ascelibrary.org/doi/pdf/10.1061/%28ASCE%290733-9429%281993%29119%3A12%281436%29. [Google Scholar]
Quinlan JF, Ewers RO, 1985. Ground water flow in limestone terranes: Strategy rationale and procedure for reliable, efficient monitoring of ground water quality in karst areas, in: National Symposium and Exposition on Aquifer Restoration and Ground Water Monitoring, 5th, National Water Well Association, Worthington, Ohio. pp. 197–234. [Google Scholar]
Rodríguez G, 2007. Survival Models, in: Lecture Notes on Generalized Linear Models. Princeton University. http://data.princeton.edu/wws509/notes/ [accessed March 19, 2019]. [Google Scholar]
Rodríguez G, 2019. Interpolation and Graduation: Smoothing and Non-Parametric Regression, in: Demographic Methods. Princeton University. https://data.princeton.edu/eco572/ [accessed March 20, 2019]. [Google Scholar]
Schmidt MF Jr., 1993. Maryland’s Geology. Tidewater Publ., Centreville, Md. [Google Scholar]
Siegel DI, Hinchey EJ, 2019. Big ddata and the curse of scale. Groundwater 57, 505. 10.1111/gwat.12905. [DOI] [PubMed] [Google Scholar]
Statistics Solutions, 2019. Homoscedasticity. https://www.statisticssolutions.com/homoscedasticity/ [accessed April 2, 2019]. [Google Scholar]
Toride N, Leij FJ, van Genuchten MT, 1993. A comprehensive set of analytical solutions for nonequilibrium solute transport with first-order decay and zero-order production. Water Resources Research 29, 2167–2182. Doi: 10.1029/93WR00496. [DOI] [Google Scholar]
Toride N, Leij FJ, van Genuchten MT, 1995. The CXTFIT Code for Estimating Transport Parameters from the Laboratory or Field Tracer Experiments; Version 2.0. Technical Report Research Report 137. U.S. Salinity Laboratory. Riverside, Calif. https://www.ars.usda.gov/arsuserfiles/20360500/pdf_pubs/P1444.pdf. [Google Scholar]
Tsai CL, Cai Z, Wu X, 1998. The examination of residual plots. Statistica Sinica 8,445–465. [Google Scholar]
Tsang CF, 1993. Flow and Contaminant Transport in Fractured Rock. Academic Press, Inc., San Diego. chapter Tracer Transport in Fracture Systems. pp. 237–266. [Google Scholar]
Walck C, 2007. Hand-Book on Statistical Distributions for Experimentalists. Internal Report SUF-PFY/96–01. Particle Physics Group. Fysikum, University of Stockholm. http://www.stat.rice.edu/dobelman/textfiles/DistributionsHandbook.pdf [accessed December 12, 2018]. [Google Scholar]
Wasserstein RL, Schirm AL, Lazar NA, 2019. Moving to a world beyond “p < 0.05”. The American Statistician 73, 1–19. 10.1080/00031305.2019.1583913. [DOI] [Google Scholar]
WFI, 2018a. Digital filter, in: Wikipedia: The Free Encyclopedia. Wikimedia Foundation, Inc. https://en.wikipedia.org/wiki/Digital_filter [accessed November 16, 2018]. [Google Scholar]
WFI, 2018b. Downsampling (signal processing), in: Wikipedia: The Free Encyclopedia. Wikimedia Foundation, Inc. https://en.wikipedia.org/wiki/Downsampling_(signal_processing) [accessed November 29, 2018]. [Google Scholar]
WFI, 2018c. Filter design, in: Wikipedia: The Free Encyclopedia. Wikimedia Foundation, Inc. https://en.wikipedia.org/wiki/Filter_design [accessed November 28, 2018]. [Google Scholar]
WFI, 2018d. Local regression, in: Wikipedia: The Free Encyclopedia. Wikimedia Foundation, Inc. https://en.wikipedia.org/wiki/Local_regression [accessed December 14, 2018]. [Google Scholar]
WFI, 2018e. Moving average, in: Wikipedia: The Free Encyclopedia. Wikimedia Foundation, Inc. https://en.wikipedia.org/wiki/Moving_average [accessed December 12, 2018]. [Google Scholar]
WFI, 2019a. Big data, in: Wikipedia: The Free Encyclopedia. Wikimedia Foundation, Inc. https://en.wikipedia.org/wiki/Big_data [accessed July 29, 2019]. [Google Scholar]
WFI, 2019b. Probability distribution, in: Wikipedia: The Free Encyclopedia. Wikimedia Foundation, Inc. https://en.wikipedia.org/wiki/Probability_distribution [accessed March 12, 2019]. [Google Scholar]
WFI, 2019c. Quantile function, in: Wikipedia: The Free Encyclopedia. Wikimedia Foundation, Inc. https://en.wikipedia.org/wiki/Digital_filter [accessed April 19, 2019]. [Google Scholar]
Worthington SRH, Smart CC, Ruland WW, 2002. Assessment of Groundwater Velocities to the Municipal Wells at Walkerton, in: Stolle D, Piggott AR, Crowder J. (Eds.), Ground and Water: Theory to Practice, Southern Ontario Section of the Canadian Geotechnical Society, Ontario, Canada. pp. 1081–1086. [Google Scholar]
Xiong L, Wang G, Wessel P, 2016. Anti-aliasing filters for deriving high-accuracy DEMs from TLS data: A case study from Freeport, Texas. Computers & Geosciences 100, 125–134. Doi: 10.1016/j.cageo.2016.11.006. [DOI] [Google Scholar]
Zhou W, Beck BF, Pettit AJ, Stephenson BJ, 2002. A groundwater tracing investigation as an aid of locating groundwater monitoring stations on the Mitchell Plain of southern Indiana. Environmental Geology 41, 842–851. 10.1007/s00254-001-0464-0. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials