Abstract
In order to understand fish biology and reproduction it is important to know the fecundity patterns of individual fish, as frequently established by recording the output of mixed-sex groups of fish in a laboratory setting. However, for understanding individual reproductive health and modeling purposes it is important to estimate individual fecundity from group fecundity. We created a multistage method that disaggregates group level data into estimates for individual-level clutch size and spawning interval distributions. The first stage of the method develops estimates of the daily spawning probability of fish. Daily spawning probabilities are then used to calculate the log-likelihood of candidate distributions of clutch size. Selecting the best candidate distribution for clutch size allows for a Monte Carlo resampling of annotations of the original data which state how many fish spawned on which day. We verify this disaggregation technique by combining data from fathead minnow pairs, and checking that the disaggregation method reproduced the original clutch sizes and spawning intervals. This method will allow scientists to estimate individual clutch size and spawning interval distributions from group spawning data without specialized or elaborate experimental designs.
Keywords: disaggregation, maximum likelihood, deconvolution, inverse problems
1 Introduction
Understanding individual-level fish fecundity is important for ecological questions involving aquatic ecosystems. However, for both logistic and biological reasons, fecundity data from controlled laboratory studies often are collected from groups of spawning fish [2,8,14,13]. From these groups, scientists can obtain estimates about the group as a whole and estimates of the average daily female fecundity, but cannot easily characterize the distribution of individual reproductive behaviors.
The ideal scenario would be an experimental design that allowed researchers to determine which eggs in a tank were produced by which female fish. While there are experimental approaches that might allow this to be achieved using a group spawning design, constraints on cost and complexity can limit their use. For example, Lange et al. [8] worked with a species (Rutilus rutilus) that only spawns in groups; they were able to utilize genotyping of spawns to determine parentage of the fish, but this approach is difficult to do routinely. As an alternative, we propose a computational method which takes a daily record of egg production from a group spawning study, and disaggregates the data, thereby producing an inferred distribution of clutch sizes.
Our method involves two main steps. First, the probability of a female spawning on a given day is determined by the proportion of tanks in which no spawning occurred. Second, we search parameterized clutch size distributions for a distribution which maximizes the likelihood of the observed data under the assumption of independence between spawning events.
This work is in some ways similar to work that focus on deconvolution. Deconvolution generally seeks to determine a function g, from observed signal h = g ∗ f (where g ∗ f is the convolution of g and f). Although an overloaded term, Self-deconvolution1 seeks to find functions g such that observed function h is the convolution of g with itself, i.e. h = g ∗ g. Indeed, if exactly two females spawned on each day, then disaggregating fish clutch sizes would exactly be self-deconvolution of the observed histogram of clutch sizes, where even the simplest case requires some care [10]. In contrast, we consider a scenario where there are a variable number of female fish, labeled 1, 2, …, n contributing to observed spawns, such that the problem could be stated as h = α1g + α2g ∗ g + α3g ∗ g ∗ g + … + αng*n with unknown parameters α1, α2, …, αn. Moreover, our disaggregation goal is further complicated by the fact that only discrete spawning events are observed, implying that the observed signal h is fundamentally noisy. Indeed, as in some deconvolution approaches [12], the presence of noise serves as a primary motivation for adopting likelihood methods.
Following an estimation of the clutch size distribution, a number of other useful statistics can be estimated as well. For applications which need to estimate individual spawning intervals, we demonstrate how the estimated clutch size distribution can be used to estimate the number of females that contributed to the number of eggs spawned on a given day and subsequently, the number of days between spawns. Thus, two outputs of our disaggregation method are a clutch size distribution and a spawning interval distribution.
One possible use of individual-level clutch size distributions is as input into models of oocyte growth and dynamics [9]. Originally applied to populations of fathead minnows (Pimephales promelas), the Oocyte Growth Dynamics Model (OGDM) predicts fish fecundity in response to changes in plasma vitellogenin (egg yolk protein) concentrations as affected by exposure to environmental endocrine-disrupting chemicals. However, to adapt the OGDM to other species can only produce accurate estimates of the effects of endocrine disruptors if there are individual-level estimates for spawning intervals and clutch size distributions are needed in unexposed fish.
2 Disaggregating daily spawns
As in Table 1, suppose there exist daily records for the total number of eggs found in each of m tanks, where D(i, j) is the number of eggs found in tank i on day j. While D(i, j) gives the number of eggs found in a tank, it is unknown if this number represents one clutch from a single female, or if any number of n female fish in the tank each contributed with a clutch of their own. There are thus two interrelated questions:
On a given day, how many females spawned?
When a female spawns, how many eggs is she likely to lay (what is her clutch size)?
These two questions are intimately related. For example, it is possible that most spawns involve only a single fish, in which case individual clutch sizes will be large, but it is also possible that some spawns involve many of the n fish, in which case clutch sizes will be a fraction of the size.
Table 1.
Example format of input data from m separate tanks where each tank holds n females and the total daily egg count is recorded on each of the d days.
| days | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ⋯ | d | |
| tank 1 | 0 | 0 | 0 | 0 | 34 | 0 | 0 | 16 | 0 | |
| tank 2 | 57 | 0 | 21 | 0 | 0 | 0 | 8 | 0 | 64 | |
| tank 3 | 9 | 0 | 0 | 15 | 0 | 87 | 0 | 0 | 3 | |
| ⋮ | ⋱ | |||||||||
| tank m | 0 | 13 | 3 | 0 | 46 | 0 | 0 | 0 | 0 | |
In order to address these two questions we make two simplifying assumptions: one, we assume that the probability that a fish spawns on day j is the same for all fish and equals p(j); second, we assume that there is a single clutch size distribution g across fish and days, such that the probability that a fish lays a clutch of size x equals g(x). For some species, the assumption that all fish exhibit the same clutch size distribution may be a strong one; however, disaggregation is likely impossible without such an assumption, or at least a very similar assumption.
In the following sections we develop a method that addresses this inverse problem, as illustrated in the flowchart in figure 1.
Fig. 1.

Starting from data D(i, j) we first estimate the daily spawning probabilities, then find a maximum likelihood estimate for the clutch size distribution and finally use both of those to estimate the distribution of spawning intervals.
2.1 Daily spawning probability
To get a first order estimate of the probability that an individual fish spawns on a given day j, we assume that fish spawn independent of one another, with day dependent probability p(j). Based upon this assumption, the probability that r fish spawn on the same day and in the same tank is given by the binomial distribution. In particular, for a tank with n fish, the probability that r fish spawn is:
| (1) |
Since the probability that no female spawns in a tank on day j is (1−p(j))n we estimate that if on day j, λ of m tanks had no eggs that , where we add 1 to the numerator and 2 to the denominator to regularize the probabilities and ensure that p(j) is neither 0 nor 1. Using this estimate of the daily spawning probability, we can estimate the number of females that spawn on any given day. Clearly, this approach will have problems when the number of tanks, m is small, n is large or p is close to 1, but as we document later, this works surprisingly well in the regimes similar to empirical data. Allowing for a per day probability allows for considering species of fish that spawn periodically, or whose spawning varies over the experimental period [4]. In contrast, for experiments with only a few or a single tank, there may not be enough data to allow the spawning probability to vary with time, and instead a single value p should be estimated, where p(j) = p for all j.
2.2 Maximum likelihood clutch sizes
After estimating the daily spawning probabilities, we find a distribution for clutch size probabilities that maximizes the likelihood of the observed data. For a given clutch size distribution g, let gr, be the corresponding probability of a clutch size if r fish (1 ≤ r ≤ n) spawn separately and then their clutches are subsequently grouped together. Notice, that gr is simply the convolution of gr−1 with g, i.e. gr = gr−1 ∗ g.
Given g, the probability of observing D(i, j) eggs on day j in tank i is the sum of probabilities of having r females spawn on day j, gr(D(i, j)), multiplied by the probability of r fish producing D(i, j) eggs, B(r, p(j)):
| (2) |
Thus the likelihood L of observing all the data is equal to the product:
| (3) |
Since likelihoods are typically extraordinarily small, a standard practice is to use the corresponding log-likelihood:
| (4) |
Utilizing the value for p(j) determined in section 2.1, we thus search for a distribution g that maximizes this likelihood, equation 4. In order to perform this search we parameterize g as a mixture of three Gaussians with parameters :
We searched for the optimal θ using MATLAB’s implementation of pattern search with 50 independent starts. In the discussion, we address attempts to optimize equation 4 simultaneously for both p and g using pattern search.
2.3 The number of spawners per day
The estimate of clutch size distribution g allows us to address the question of how many fish spawned on a given day. Indeed, given g, the probability that r fish spawned in tank i on day j is:
| (5) |
Sampling r according to probability P (r|D(i, j)) allows us to annotate each nonzero entry in D with an estimate on the number of fish that participated in the spawning, as exemplified in table 2. Once each D(i, j) is annotated to include an estimated number of spawning fish, we use this and our independence assumption to estimate the time interval between fish spawning. Repeating this procedure via Monte Carlo sampling creates a more accurate estimate of fish spawning intervals.
Table 2.
A clutch size distribution implies probabilities for the number of fish that spawned in a given tank on a given day. Sampling from this probability allows us to estimate the number of females that spawned for each entry in D.
| days | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ⋯ | d | |
| tank i | 64 | 0 | 0 | 0 | 34 | 0 | 0 | 16 | 0 | |
| estimated # of fish | 3 | 2 | 1 | |||||||
|
| ||||||||||
| tank i+1 | 0 | 124 | 3 | 0 | 0 | 0 | 57 | 2 | 33 | |
| estimated # of fish | 4 | 1 | 2 | 1 | 2 | |||||
3 Results for simulated input
Testing the accuracy of the disaggregation method was done by creating aggregated training input from different trial distributions of individual clutch size, disaggregating them and then comparing the disaggregated result to the original trial distributions. In particular, we investigated the method’s ability to disaggregate depending on the underlying individual clutch size distribution, and the number of females in a tank. Simulated training input was created by assuming each tank held nt fish; each fish had an IID 21% probability of spawning each day, and each fish had a clutch size drawn from the same distribution of clutch sizes (g).
Because fish clutch sizes are the result of complex reproductive processes there is not an obvious choice of prior clutch size distributions. Thus, we investigated the ability to disaggregate clutch sizes from a variety of differently shaped trial distributions. We simulated input using clutch sizes of the following forms, each constrained to positive values below 300:
exponential distribution; a distribution without an interior mode, we used one with mean μ = 55.
log normal distribution; previously used to parameterize fish clutch sizes [9], we used mean μ ≈ 84.24 and standard deviation σ ≈ 43.01.
Gaussian distribution; with mean μ = 150 and standard deviation σ = 45.
double Gaussian; used to evaluate the method when fish heterogeneity results in multi model distributions, composed of two equally weighted Guassians with means μ1 = 35 and μ2 = 165 and both with standard deviation σ = 30.
quadruple Gaussian; the sum of four Gaussian, used specifically to evaluate the failure of the disaggregation method.
A qualitative visual analysis of the output, Figures 2 and 3 show that disaggregating recovers much of the shape of the original trial distributions. As expected and shown in Figure 2, tanks with larger numbers of females (nt = 10) are more difficult to disaggregate than tanks with fewer fish (nt = 4). Indeed, as nt increases, using days with no spawns to estimate the daily spawning probabilities becomes decreasingly accurate. Thus, for much larger tanks it may be worthwhile to consider more robust mechanisms of estimating spawning probabilities. However, this method performed well for study designs using three or four females.
Fig. 2.

As illustrated by disaggregating simulated input (with 20 tanks for 50 days), disaggregated clutch size distributions overestimate the number of large clutches when tanks hold four female fish, but are otherwise able to match much of the original Gaussian, as displayed by either the PDF (upper left) or the CDF (upper right). As expected, disaggregating 10 fish is more difficult, though the results are still a large qualitative improvement over the aggregated input (bottom left and right).
Fig. 3.

As illustrated by disaggregating simulated input (20 tanks for 50 days), disaggregating increasingly complex clutch size distributions is increasingly difficult, yet yields large qualitative improvements even for multi-modal input.
As seen in Figure 3, input generated from more complicated, multimodal distributions was harder to disaggregate. For instance, the quality of the fit decreases with the number of modes. The difference between the CDF of the four Gaussian distribution and its disaggregated fit is entirely predictable as the disaggregation technique was implemented with a parameterization limited to using at most the sum of three Gaussians. Of course, a different choice in parameterization could extend this method to any number of modes but increasing number of modes will likely be associated with decreased accuracy. Notably though, while the fit of the four mode simulated input is not ideal, it represents a graceful rather than catastrophic failure.
In order to quantitatively compare the quality of the disaggregation, we cross validated the method, comparing the log likelihood of simulated input reserved for testing purposes under the estimated disaggregated values of g and p along with the true distributions used to create the simulated input. Specifically, for each of the aforementioned distributions an alternative table was created so that the estimated disaggregated results could be tested with input that the method has never seen. As seen in the ‘test input’ column of table 3, the log likelihood of the test input under the estimated distribution g and values p is very similar to the likelihoods under the true trial distributions. Thus it does not appear the disaggregation method has significant problems in terms of over-fitting.
Table 3.
This table lists the per day per tank differences between the log likelihood of the simulated input with the true distributions and the estimated disaggregated distributions, where negative numbers represent a better fit from the disaggregated distributions. The difference in log likelihood between the estimated disaggregation distributions and the true distributions on test input unseen by the disaggregation method is listed in columns ‘test data’. To compare the effect of fitting spawning p with fitting clutch sizes g, we show the differences in log likelihood of the training input, between the true distribution against both fitted p and g (fitted) and against the fitted g paired with the true spawning probability of 21%. These estimates were done both for tank sizes of 4 and 10, and on data consisting of 50 days with 20 tanks, and 200 days with 40 tanks.
| per day per tank log likelihood differences | |||||||
|---|---|---|---|---|---|---|---|
| 50 days 20 tanks | 200 days 40 tanks | ||||||
|
| |||||||
| Fish in Tank | test input |
p(j)
|
test input |
p(j)
|
|||
| fitted | 21% | fitted | 21% | ||||
| Gaussian | 4 | −0.0483 | −0.0841 | −0.0010 | 0.0027 | −0.0077 | −0.0001 |
| 10 | 0.0909 | 0.0841 | 0.0114 | 0.0740 | 0.0726 | 0.0026 | |
|
| |||||||
| Lognormal | 4 | 0.1277 | 0.1058 | 0.0243 | −0.0354 | −0.0476 | 0.0240 |
| 10 | 0.0060 | −0.0105 | 0.0192 | −0.0004 | −0.0018 | 0.0167 | |
|
| |||||||
| Exponential | 4 | 0.1658 | 0.1345 | −0.0029 | 0.0817 | 0.0722 | −0.0005 |
| 10 | 0.0304 | 0.0238 | 0.0004 | 0.0270 | 0.0236 | −0.0005 | |
|
| |||||||
| Two Modes | 4 | 0.0575 | 0.0334 | 0.0083 | 0.0690 | 0.0571 | 0.0075 |
| 10 | 0.0596 | 0.0540 | −0.0001 | 0.0222 | 0.0164 | 0.0030 | |
|
| |||||||
| Four Modes | 4 | −0.0359 | −0.0396 | 0.0614 | 0.0315 | 0.0168 | 0.0761 |
| 10 | 0.0014 | 0.0298 | 0.0375 | 0.0128 | 0.0164 | 0.0363 | |
In order, to compare the accuracy of the estimation of spawning probabilities p versus the clutch size distribution g, we also computed the difference in log likelihoods of the training input between the true trial distributions and the estimated g both with the estimated values of p and with the true value p(j) = 0.21. As seen in Table 3 columns ‘fitted’ and ‘21%’ respectively, there is not a large difference in likelihoods between either the fitted values of p or the true value. This further supports the notion that estimating p based on the daily proportion of tanks that spawn performs adequately.
Using the same simulated input, the method’s ability to find spawning intervals was also tested as seen in Figure 4. These results suggest that as long as the clutch size distribution was reasonable, then the spawning interval will match well too. However, as tanks include more fish, it becomes harder for the method to reproduce the true trial distribution.
Fig. 4.

The previous fits for p and g are used to estimate the spawning intervals.
Additional numerical experiments were performed where likelihood equation 4 was optimized with regard to both p and g simultaneously. This approach also yields reasonable results, though the added parameter complexity increases the run time of the estimation and makes regularizing or controlling estimates of daily spawning probabilities more opaque.
4 Results for fathead minnow data
Next, the method was tested using fathead minnow data obtained by combining controls from multiple 21-day studies testing the effects of different endocrine-disrupting chemicals on male-female pairs [1,3,5,7,15,16]. As in the earlier evaluations of simulated input, the paired data were first manually aggregated into groups, to simulate data that would be obtained from a tank of multiple fish. The aggregated data were then disaggregated and the output compared to the original paired data. This allows for the method to work with real clutch sizes while still having a basis for accurate comparison. Results from this are shown in Figure 5. Since the data are coming from different studies, and the characteristics (e.g., age, size) of the fish varied, these data may be slightly less uniform than an ideal scenario and thus the disaggregation method may be subject to slightly more noise.
Fig. 5.

Results from disaggregation model when used on real fish data. The PDF shows a histogram for the actual clutch sizes from the original paired Fathead Minnows data and the outputted distribution from the method. The CDF also includes the fathead minnows data after it is aggregated into groups of 4.
Nonetheless, results indicate that the method is able to accurately estimate underlying distributions of the clutch size. This is most clearly seen from the PDF shown in Figure 5, as the estimated distribution g closely resembles the original histogram of paired fathead minnow clutch sizes. That the estimated distribution is a better estimator than the aggregated input is clear by comparing their CDFs, also shown in Figure 5.
5 Conclusion
Results show that our disaggregation method can successfully estimate underlying clutch size distributions as well as spawning intervals under experimental conditions. A limitation of the disaggregation method is the estimation of p, the probability of a fish spawning. When the number of females is too large, the number of tanks too small or p is close to 1; using only tanks with no spawns is an inefficient estimator of p. In this case we suggest optimizing the log likelihood, equation 4 over both p and g simultaneously. However, for scenarios similar to the experimental settings we are familiar with, estimating p separately from g performs well. Furthermore, estimating p first is slightly more computationally efficient and easier to directly regularize or manipulate.
In other settings it may make more sense to consider fewer degrees of freedom for the spawning probability p, and instead introduce a parameterized effective tank population function h, which could account for a subset of the females in a tank to be dormant, non-spawners during the experimental period. In such a case, the modified likelihood function would become:
| (6) |
Additionally, since in some species spawning is influenced by the lunar cycle, it may be worthwhile to directly incorporate external influences into the likelihood model or the parametrization of either p and/or g.
Finally, while this work was performed specifically with the disaggregation of recorded fish fecundity data, the fundamentals are not specific to fish or aquatic systems in general and thus these techniques may apply to disaggregation problems in other settings, biological or not.
Acknowledgments
The authors would like to thank Drs. Diane Nacci, Bryan Clark, and Denise Champlin from the U.S. Environmental Protection Agency, as well as Doctor Thijs Bosker from Leiden University for initiating conversations leading to the development of this method.
Footnotes
References
- 1.Ankley GT, Jensen KM, Durhan EJ, Makynen EA, Butterworth BC, Kahl MD, Villeneuve DL, Linnum A, Gray LE, Cardon M, et al. Effects of two fungicides with multiple modes of action on reproductive endocrine function in the fathead minnow (pimephales promelas) Toxicological Sciences. 2005;86(2):300–308. doi: 10.1093/toxsci/kfi202. [DOI] [PubMed] [Google Scholar]
- 2.Ankley GT, Jensen KM, Kahl MD, Korte JJ, Makynen EA. Description and evaluation of a short-term reproduction test with the fathead minnow (pimephales promelas) Environmental Toxicology and Chemistry. 2001;20(6):1276–1290. [PubMed] [Google Scholar]
- 3.Ankley GT, Kuehl DW, Kahl MD, Jensen KM, Linnum A, Leino RL, Villeneuve DA. Reproductive and developmental toxicity and bioconcentration of perfluorooctanesulfonate in a partial life-cycle test with the fathead minnow (pimephales promelas) Environmental toxicology and chemistry. 2005;24(9):2316–2324. doi: 10.1897/04-634r.1. [DOI] [PubMed] [Google Scholar]
- 4.Bosker T, Munkittrick KR, Nacci DE, MacLatchy DL. Laboratory spawning patterns of mummichogs, fundulus heteroclitus (cyprinodontiformes: Fundulidae) Copeia. 2013;2013(3):527–538. [Google Scholar]
- 5.Jensen KM, Makynen EA, Kahl MD, Ankley GT. Effects of the feedlot contaminant 17α-trenbolone on reproductive endocrinology of the fathead minnow. Environmental science & technology. 2006;40(9):3112–3117. doi: 10.1021/es052174s. [DOI] [PubMed] [Google Scholar]
- 6.Kauppinen JK, Moffatt DJ, Mantsch HH, Cameron DG. Fourier self-deconvolution: a method for resolving intrinsically overlapped bands. Applied Spectroscopy. 1981;35(3):271–276. [Google Scholar]
- 7.LaLone CA, Villeneuve DL, Cavallin JE, Kahl MD, Durhan EJ, Makynen EA, Jensen KM, Stevens KE, Severson MN, Blanksma CA, et al. Cross-species sensitivity to a novel androgen receptor agonist of potential environmental concern, spironolactone. Environmental toxicology and chemistry. 2013;32(11):2528–2541. doi: 10.1002/etc.2330. [DOI] [PubMed] [Google Scholar]
- 8.Lange A, Paull GC, Hamilton PB, Iguchi T, Tyler CR. Implications of persistent exposure to treated wastewater effluent for breeding in wild roach (rutilus rutilus) populations. Environmental science & technology. 2011;45(4):1673–1679. doi: 10.1021/es103232q. [DOI] [PubMed] [Google Scholar]
- 9.Li Z, Villeneuve DL, Jensen KM, Ankley GT, Watanabe KH. A computational model for asynchronous oocyte growth dynamics in a batch-spawning fish. Canadian Journal of Fisheries and Aquatic Sciences. 2011;68(9):1528–1538. [Google Scholar]
- 10.Martinez V. Global methods in the inversion of a self-convolution. Journal of Electron Spectroscopy and Related Phenomena. 1979;17(1):33–43. [Google Scholar]
- 11.Maudsley A. Spectral lineshape determination by self-deconvolution. Journal of Magnetic Resonance, Series B. 1995;106(1):47–57. doi: 10.1006/jmrb.1995.1007. [DOI] [PubMed] [Google Scholar]
- 12.Mendel JM. Maximum-likelihood deconvolution: a journey into model-based signal processing. Springer Science & Business Media. 2012 [Google Scholar]
- 13.Park JW, Rinchard J, Liu F, Anderson TA, Kendall RJ, Theodorakis CW. The thyroid endocrine disruptor perchlorate affects reproduction, growth, and survival of mosquitofish. Ecotoxicology and environmental safety. 2006;63(3):343–352. doi: 10.1016/j.ecoenv.2005.04.002. [DOI] [PubMed] [Google Scholar]
- 14.Paull GC, Filby AL, Giddins HG, Coe TS, Hamilton PB, Tyler CR. Dominance hierarchies in zebrafish (danio rerio) and their relationship with reproductive success. Zebrafish. 2010;7(1):109–117. doi: 10.1089/zeb.2009.0618. [DOI] [PubMed] [Google Scholar]
- 15.Villeneuve DL, Blake LS, Brodin JD, Cavallin JE, Durhan EJ, Jensen KM, Kahl MD, Makynen EA, Martinović D, Mueller ND, et al. Effects of a 3β-hydroxysteroid dehydrogenase inhibitor, trilostane, on the fathead minnow reproductive axis. Toxicological Sciences. 2008;104(1):113–123. doi: 10.1093/toxsci/kfn073. [DOI] [PubMed] [Google Scholar]
- 16.Villeneuve DL, Murphy MB, Kahl MD, Jensen KM, Butterworth BC, Makynen EA, Durhan EJ, Linnum A, Leino RL, Curtis LR, et al. Evaluation of the methoxytriazine herbicide prometon using a short-term fathead minnow reproduction test and a suite of in vitro bioassays. Environmental toxicology and chemistry. 2006;25(8):2143–2153. doi: 10.1897/05-604r.1. [DOI] [PubMed] [Google Scholar]
