Learning to localize sounds in a highly reverberant environment: Machine-learning tracking of dolphin whistle-like sounds in a pool

Sean F Woodward; Diana Reiss; Marcelo O Magnasco

doi:10.1371/journal.pone.0235155

. 2020 Jun 25;15(6):e0235155. doi: 10.1371/journal.pone.0235155

Learning to localize sounds in a highly reverberant environment: Machine-learning tracking of dolphin whistle-like sounds in a pool

Sean F Woodward ^1,^*, Diana Reiss ², Marcelo O Magnasco ¹

Editor: Haru Matsumoto³

PMCID: PMC7316258 PMID: 32584861

Abstract

Tracking the origin of propagating wave signals in an environment with complex reflective surfaces is, in its full generality, a nearly intractable problem which has engendered multiple domain-specific literatures. We posit that, if the environment and sensor geometries are fixed, machine learning algorithms can “learn” the acoustical geometry of the environment and accurately track signal origin. In this paper, we propose the first machine-learning-based approach to identifying the source locations of semi-stationary, tonal, dolphin-whistle-like sounds in a highly reverberant space, specifically a half-cylindrical dolphin pool. Our algorithm works by supplying a learning network with an overabundance of location “clues”, which are then selected under supervised training for their ability to discriminate source location in this particular environment. More specifically, we deliver estimated time-difference-of-arrivals (TDOA’s) and normalized cross-correlation values computed from pairs of hydrophone signals to a random forest model for high-feature-volume classification and feature selection, and subsequently deliver the selected features into linear discriminant analysis, linear and quadratic Support Vector Machine (SVM), and Gaussian process models. Based on data from 14 sound source locations and 16 hydrophones, our classification models yielded perfect accuracy at predicting novel sound source locations. Our regression models yielded better accuracy than the established Steered-Response Power (SRP) method when all training data were used, and comparable accuracy along the pool surface when deprived of training data at testing sites; our methods additionally boast improved computation time and the potential for superior localization accuracy in all dimensions with more training data. Because of the generality of our method we argue it may be useful in a much wider variety of contexts.

Introduction

Principled methods to track the spatial origin of sounds using a microphone array rely on differences in the time of arrival and the intensity of the sound as registered across array elements. This is subject to a central assumption: that the direct, straight-line-path arrival of the sound at each element can be computationally identified and isolated from the raw waveforms recorded in the array. Acoustically-reflective surfaces create echoes which arrive at the destination through different paths; the pattern of differences in time of arrival and intensities may make the echoes appear to originate from locations other than the source, just like optical reflections in a mirror appear to originate in a place other than the source, called a virtual image, as shown in Fig 1. When sound propagates through multiple reflections before being damped, we have a reverberant environment, where it may be impossible to computationally distinguish the straight-line-path incidence from the echoes. Particularly for tonal sounds, the amplitude envelope may be distorted differentially across different array elements due to the echoes arriving in a different temporal order at different array elements, as shown in Fig 1B. At this point, principled methods are unable to track. This problem has also been studied in the radio wave propagation literature, where it goes by the name of “multipath”, being particularly relevant to GPS and cell signal propagation in cities with dense high-rises.

Fig 1 — (A) Sound originates from a source and arrives at two sensors, both directly (red paths) or bouncing on walls (green and blue). The sound bouncing off the walls arrives at the sensors at angles and delay times matching what would have happened if the sound had originated at a “virtual image”, located at the mirror reflection of the source on the corresponding wall. (B) The arrival times of the sounds at each sensor; because the source is slightly to the left the direct incidence (red) arrives just slightly earlier at sensor 1; because the left image (green) is to the left that echo arrives much earlier at sensor 1 than 2, and because the right image (blue) is to the right, that echo arrives much earlier at sensor 2 than 1. If the sounds were purely impulsive then the first incidence would always be clear, but matching later echoes requires computation. However, if the sounds are longer than the typical distance between echoes, then the recorded waveform contains overlapping pieces, and any particular method of computing TOA (e.g. by cross-correlation) may give erroneous results due to the echoes arriving in a different temporal order at each sensor.

In this paper we consider the case of tracking tonal dolphin whistles in an aquarium pool with extensive reverberation. Underwater sound propagation has unique characteristics which distinguish it from aerial sound localization: sound travels five times faster in water than in air, it is subject to refraction at thermoclines and haloclines, it has longer propagation distance at many frequencies, and the the water-air interface presents a consistent overhead reflective and scattering surface. On the other side, the 1.5 km/s speed of sound, together with the slower speed of swimming animals than, say, flying animals, conspire to make Doppler shift a lesser concern.

As we shall discuss in far more detail below, dolphins produce an array of different vocalization types, both impulsive and tonal. The tonal sounds, termed whistles, that dolphins produce are generally in the 4-25 kHz range [1] and are hard to track in reverberant environments because multiple echoes may arrive at any given hydrophone within the duration of the sound itself, and these echoes may arrive in a different order at different microphones, making time delay tracking inviable, as shown in Fig 1.

However, for a situation where a fixed hydrophone array has been placed in a fixed environment, we could imagine that the geometric idiosyncrasies of the environment could be mapped out. Particularly with several hydrophones, one can build a lot of information which in principle is redundant, but which in a particular environment may have vastly variable degrees of reliability. It should be in principle possible to choose the right combination that permits localization in this particular environment; the question is what is the best way to do so.

Our proposal is in principle simple. First, we created a “training dataset”, by placing an emitter in a number of known locations within the aquarium pool, and playing back through the emitter a repertoire of computer-generated sounds approximating dolphin whistles; over 1700 of these sounds were recorded using our 16-hydrophone array deployed at the boundary of the pool. The original signal that was played through the emitter is, of course, never again used. The 16-track, 2-second recording of each signal and its echoes is a datum in our dataset, and is associated with a label denoting the source location. Second, for each datum we compute an overabundance of “clues”; for example, we use several distinct cross-correlation methods to find time difference estimates Δt between every pair of channels, of which there are of course 16*15/2 = 120 pairs.

What is the logic in doing so? To calculate the 3D location of the source by triangulation, a minimum of 4 reliable Δt are required, since in addition to the (x, y, z) coordinates we need the time at which the sound originated at the source. In order to use more data than 4 Δts we can set up an overdetermined system and solve it in the least squares sense. In a non-reverberant environment this is useful since Δt estimates have some imprecision or temporal jitter; obviously, the smaller the jitter, the smaller the Δx imprecision in the estimation of the sound source coordinates, but also this imprecision is lowered if there are more Δt available. However, in a reverberant environment, many of those clues will be wrong, because any given estimator of Δt may try to match a first arrival in one channel to an echo in another channel, yielding a Δt estimate which is, we emphasize, not inaccurate but incorrect. Due to the position of the hydrophones with respect to the pool walls, and the way the reverberation echoes arrive at those particular locations, it may be the case that some comparisons are often incorrect for many sounds. Inputting such potentially misleading data into a least squares algorithm risks corrupting the whole estimate. By computing a large number of different estimates we make sure we have some that are often correct. Learning is then a matter of sorting the wheat from the chaff.

So, after computing our overabundance of estimates, then third, we use a supervised learning method to train a classifier to predict, from these many clues, the label. Any regularization that promotes sparseness will rapidly discard the cross-comparisons that are often misleading. In doing so, the machine learning method is, de facto, learning how to track the sounds in this particular acoustical environment. If the environment or the hydrophone position were changed, the good clues would change. We shall show below that this strategy performs rather well.

Fourth, finally, once the system has learned to classify, running the system for novel data only entails computing those specific clues that were selected by learning, and running forward the classifier. This is computationally a rather lightweight operation in comparison with training, and it can be performed online in real time.

The outline of the rest of this Paper is as follows. In Background and Significance we shall review both the motivation to study these particular bioacoustical signals, as well as an extremely summary review of the extensive amount of work that has been done in the field of marine mammal acoustic localization. In Materials and Methods we shall review our hydrophone array design and layout, the creation of the sounds used, playback and acquisition methods, features, and machine learning methodology. Then we list our Results and discuss them in Discussion, and finally speculate on generalizations in Conclusions and Outlook.

Background and significance

Dolphin communication research is in an active period of growth. Many researchers expect to find significant communicative capacity in dolphins given their complex social structure [1–3], advanced cognition including the capacity for mirror self-recognition [4], culturally transmitted tool-use and other behaviors [5], varied and adaptive foraging strategies [6], and their capacity for metacognition [7]. Moreover, given dolphins’ well-studied acoustic sensitivity and echolocation ability [8–10], some researchers have speculated that dolphin vocal communication might share properties with human languages [11–13]. However, there is an insufficiency of work in this area to make substantive comparisons.

Among most dolphin species, a particular narrowband class of call, termed the whistle, has been identified as socially important. In particular, for the common bottlenose dolphin, Tursiops truncatus—arguably the focal species of most dolphin cognitive and communication research—research has focused on signature whistles, individually distinctive whistles [14–16] that may convey an individual’s identity to conspecifics [15, 17] and that can be mimicked, potentially to gain conspecifics’ attention [18].

Signature whistle studies aside, most studies of bottlenose dolphin calls concern group-wide repertoires of whistles and other, pulse-form call types [19–23]; there is a paucity of studies that seek to examine individual repertoires of non-signature whistles or the phenomenon of non-signature acoustic exchanges among dolphins. Regarding the latter, difficulties with whistle source attribution at best allow for sparse sampling of exchanges [17, 24]. Nevertheless, such studies constitute a logical prerequisite to an understanding of the communicative potential of whistles.

The scarcity of such studies can be explained in part by limitations in available Passive Acoustic Monitoring (PAM) analyses, in particular for the source attribution of whistles to freely-interacting members of a social group. A version of the sound attribution problem is encountered in many areas of passive acoustic research, often representing a necessary step to making source comparisons. We provide a brief overview of sound source attribution in the context of marine mammalogy, with a focus on the present application.

Sound source attribution is closely related to sound source separation, the separation of audio data composed of sounds (such as calls) from many sources into individual, source-specific tracks, which is a common task in marine mammalogy. Sound separation might be accomplished by recognizing sound features (durations, repetitions, frequency components, frequency component modulations, etc.) that are distinctive to sources and that imply separations. Examples include the separation of dolphin signature whistles from non-signature whistles based on differences in repetition characteristics [24], and the separation of calls belonging to different marine mammal species [25] or different intra-species regional groups, such as groups of sperm whales [26], based on differences in call spectrographic characteristics. However, if sound sources are not known to generate inherently distinct sounds, as is the case when focusing on bottlenose non-signature whistles, sound separation might be performed on the basis of identifying their distinctive source locations. Often this is achieved by obtaining explicit source locations or coordinates for the sounds of interest, which is called sound source localization.

At this point, we distinguish sound source separation from sound source attribution. In marine mammalogy, the former often carries the connotation that sound sources are separated exclusively on the basis of audio data. When this is done using sound source localization, source attribution is often implied by the sustained large spacings among oceanic individuals/groups [27] or by an emphasis on regions as sources rather than individuals/groups, as in abundance surveys [25, 28, 29]. In other words, in these cases a source’s location identifies the source. Bt contrast, aquarium dolphins intermix rapidly as compared with their whistle rate, such that a sound’s location does not uniquely identify the source. For this case, we refer to sound attribution as the the process of not only separating a sound by localization, but individually attributing it to a particular dolphin based on the dolphin’s visual coordinates and identifiers, obtained from accompanying video feed.

Whether the end goal is sound source separation or explicit attribution, the fundamentals of sound source localization are the same. Sound source localization refers to locating the spatial origin of a sound based on an understanding of predictable phenomena that affect its waveform between its production at the source and its receipt at (hydro)phones, from which acoustic data are taken. Such phenomena include but are not necessarily limited to intensity attenuation, dispersion, and time delay. The last is the focus of many marine mammal call localization methods. Because a sound’s travel distance is the mathematical product of its travel time and the speed of sound, time delay information implies distance information about a sound source with respect to the hydrophones. Because in general we do not know the absolute travel time of a sound between source and hydrophone (given that the time of sound production is unknown), we are often interested in time-difference-of-arrivals (TDOA’s), the differences in a sound’s time-of-arrivals (TOA) within pairs of hydrophones. Several methods are available for solving the optimization problem of localizing a sound based on a set of estimated TDOA’s for several hydrophones (typically four or more) [30–32], and in this paper with primarily refer to one, Spherical Interpolation. This method has been proven to be optimal if the error in TDOA estimates is Gaussian [33]. Alternatively, some sound localization methods (namely beamforming methods) are based on TDOA estimations that are not explicitly obtained [34–37], though ultimately these rely on similar techniques of sound processing as discussed below.

Much of the difficulty in sound source localization lies in obtaining reliable TDOA estimates from the waveforms of received sounds. For broadband, impulse-like sounds (characterized by fast, high-intensity, sparse onsets) in certain environments, this can be as straightforward as estimating arrival times to be the peaks in the waveforms [38]. More frequently, one seeks to leverage information from the whole waveforms. Traditionally, this can be done by using the cross-correlation or one of its close mathematical relatives. In essence, the cross-correlation computes the dot-product of two waveforms across a range of relative shifts in time. Assuming the sound waveforms received by two hydrophones are identical apart from their arrival times, their cross-correlation will reach a maximum at the relative time shift corresponding to their TDOA. This assumption of identical, shifted waveforms is valid enough to accurately estimate TDOA’s, and in turn perform sound source localization, for certain sounds (particularly impulse-type), in large and low-reverberation recording environments, and/or when localization precision can be relatively low. These conditions often co-occur and apply to oceanic survey studies seeking to achieve source separation versus attribution [39–46].

However, the assumption of identical, shifted received waveforms is never strictly correct, in large part because of the effects of multipath. Multipath refers to a sound taking multiple paths through space between its source and each hydrophone, including both the direct path as well as paths involving reflections from nearby surfaces. As a result, each hydrophone receives multiple imperfect copies of the source waveform (imagine your ears receiving echoes in an opera house), rather just one perfect copy. In an enclosed environment (which induces substantial multipath), copies that arrive within 50 ms are termed reverberations. If the original sound was a short-duration impulse, such that a single waveform copy fully arrives in a hydrophone before the next starts, it may be possible to undo multipath effects. In general, however, the result of cross-correlating two hydrophones’ samples of a sound subject to multipath is a data series with multiple local peaks (N² peaks given that waveforms from N paths arrive in each hydrophone) and no easy way to determine the peak corresponding to the desired direct-path TDOA.

Various strategies have been proposed to make the cross-correlation robust to multipath effects [47, 48], such as by locating the correct peak in the cross-correlation of two sets of stacked waveforms [49]. More recent methods have been proposed for sperm whale click trains, using statistical approaches to reconcile multiple sets of estimated TDOA’s (including direct-path and higher-order paths) obtained for clustered impulse sounds [50, 51]. While generally promising, it remains unclear to what extent these methods generalize to longer-duration narrowband sounds, particularly in reverberant environments. These methods belong to a larger class of methods targeted at impulse-like sounds such as sperm whale clicks or to narrowband sounds produced in large non-reverberant environments (where reflections are sparse), in which direct-path and secondary-path waveforms are separable and amenable to geometric interpretation [27, 52–54]. Perhaps more pertinent to narrowband sounds in reverberant environments is a method recently proposed for transforming complex waveforms into impulse-like waveforms (impulse response functions) for easier TDOA extraction [55], however at present this method has not been tested for marine mammal call localization in any context.

For the task of localizing dolphin whistles in particular, standard cross-correlation methods have typically been applied, with at best modest results in the relatively irregular, low-reverberation environments where they have been evaluated [44, 56–58]. The method with the best proven ability to localize dolphin whistles in a reverberant environment to date falls under the umbrella of Steered-Response Power (SRP). In short, these methods rely on the finding the spatial coordinates that best explain the received signals, under the assumption that cross-correlating (and summing) the received signals shifted by the set of TDOA’s implied by those spatial coordinates’ will result in a numerical maximum [59]—this will be discussed more in the Tonal Localization subsection of our Materials and Methods. Rebecca E. Thomas et al. [60] have demonstrated the use of such a method with bottlenose whistles in an enclosed environment with reasonable success (used for about 40% recall of caller identity) [60].

We note that an alternative solution to whistle source attribution that does not involve sound source localization is the use of sound transducers attached to the dolphins themselves [61–63]. While promising, shortfalls include the need to manually tag every member of the group under consideration, the tendency of tags to fall off, and the tags’ inherent lack of convenient means for visualizing caller behavior. Most significant to research with captive dolphins, the use of tags can conflict with best husbandry practices (e.g., due to risk of skin irritation, of ingestion) and be forbidden, as is the case at the National Aquarium. At such locations, less invasive means of sound source attribution are necessary.

In this paper, we propose methods of source sound localization suited for sound source attribution that offer good results for long-duration, whistle-like sounds in a highly reverberant marine environment, the half-cylindrical artificial dolphin pool located at the National Aquarium in Baltimore, Maryland. Our methods, taken from machine learning, are distinct from those previously discussed in that they do not approach the problem of multipath by explicitly modeling it in any way, and in that a majority of computation is done before a sound of interest is to be localized. The disadvantage of these methods is the need to prepare them by playing numerous calibration tones in a space of interest, suiting them to small enclosures.

Our raw data consist of a set of artificial, frequency-modulated, narrowband sounds (referred to as “tonals” here) that were designed to broadly reflect the scope of T. truncatus whistles. The data included the waveforms received by 16 hydrophones, separated among four 4-hydrophone-arrays surrounding the pool, after tonal play at several known source locations inside the pool. We first show that a nonlinear random forest “classification model” (i.e., a predictor of categorical group membership) succeeds at correctly choosing a played tonal’s source location from the set of all source locations. Later, we show that a linear classification model achieves similar results. Finally, we show that regression models can predict the two or three dimensional source coordinates of played tonals with dolphin-length accuracy (if not inter-dolphin-head accuracy, which approximates minimum possible source separation), depending on whether the models were built with data from source locations that were also to be predicted. The latter two models rely on fewer than 10,000 values to predict each played tonal source location. We implement an established SRP method for evaluation of our source localization regression models, obtaining favorable results along the pool surface but not in the direction of pool depth. Finally, reducing our feature set even further by building a parsimonious (minimum-feature) classification tree for the same task, we find that a minimally sufficient feature set for classification includes data from all pairs of 4-hydrophone-arrays, consistent with the data valued by a strictly geometric, TDOA-based approach to sound source localization.

Materials and methods

Hydrophone setup

All data were obtained from equipment deployed at the Dolphin Discovery exhibit of the National Aquarium in Baltimore, Maryland. The exhibit’s 33.5-m-diameter cylindrical pool is subdivided into one approximate half cylinder, termed the exhibit pool (or EP, Fig 2A), as well three smaller holding pools, by thick concrete walls and 1.83 m x 1.30 m perforated wooden gates; all pools are acoustically linked. The acoustic data were obtained from the EP, when the seven resident dolphins were in the holding pools; their natural sounds were present in recordings.

Sound data were collected by 16 hydrophones (SQ-26-08’s from Cetacean Research Technology, with approximately flat frequency responses between 200 and 25,000 Hz) placed inside the EP. The details of our hydrophone placement were constrained in large part by National Aquarium operational concerns. Within these constraints, we chose to split the 16 hydrophones among four 4-hydrophone-arrays designed for long-term deployment, which we splayed about the EP as highlighted by the yellow circles in Fig 2A. We chose to place four hydrophones in each array primarily for redundancy, secondarily to accommodate beamforming methods of sound source localization. We chose to evenly distribute the four 4-hydrophone-arrays along the curved EP wall to approximately maximize the time-difference-of-arrivals (TDOA’s) among the four hydrophone clusters across all potential source locations inside the EP. This has been shown to be helpful to TDOA-based methods of sound source localization [64–66].

Moving from left to right in Fig 2A, we refer to the 4-hydrophone-arrays as H1, H2, H3, and H4. The functional component of each 4-hydrophone-array was a 0.3058 m x 0.4572 m x 0.0762 m acrylic panel, which contained one locally-addressed hydrophone near each corner (Fig 2B) and was cushioned against the acrylic pool wall 1.6000 m (measuring from the panel center) below the water surface. The origin of our three-dimensional Cartesian coordinate system was located at the center of H1’s acrylic panel. The coordinates of the 16 hydrophones are given in Table 1. We manually measured the coordinates of the individual hydrophones using both rules and laser rangefinders, and demonstrated that these hydrophone coordinates accommodated localization of calibration signals (using the geometric, TDOA-based method of sound source localization termed Spherical Interpolation [31, 33, 67]) with error < 1 m. While we did not require precise coordinates for the vertical pool walls for any purpose, for visualization we approximated these walls as located on the closed bottom half of a 33.5-m-diameter circle centered at (15.7 m, 3.87 m).

Table 1. Hydrophone coordinates.

Hydrophone Number	Inter-Intra Array Address	X (m)	Y (m)	Z (m)
1	H1-S1	0.0059	-0.0760	0.1905
2	H1-S2	-0.0059	0.0760	0.1905
3	H1-S3	0.0059	-0.0760	-0.1905
4	H1-S4	-0.0059	0.0760	-0.1905
5	H2-S1	9.1521	-10.5672	0.1905
6	H2-S2	9.0089	-10.5148	0.1905
7	H2-S3	9.1521	-10.5672	-0.1905
8	H2-S4	9.0089	-10.5148	-0.1905
9	H3-S1	23.2718	-10.3148	0.1905
10	H3-S2	23.1288	-10.3672	0.1905
11	H3-S3	23.2718	-10.3148	-0.1905
12	H3-S4	23.1288	-10.3672	-0.1905
13	H4-S1	31.7886	0.2984	0.1905
14	H4-S2	31.7768	0.1464	0.1905
15	H4-S3	31.7886	0.2984	-0.1905
16	H4-S4	31.7768	0.1464	-0.1905

Open in a new tab

Refer to Fig 2 for Inter-Intra Array Address orientation.

Tonal creation

The sound data collected by our hydrophones consisted of computer-generated, whistle-like tonals played over an underwater speaker (a Lubbell LL916H). This choice of sound data, intended to approximate whistle data collected from bottlenose dolphins, was motivated by a few considerations. First, we required that the tonals were not previously subject to multipath phenomena (i.e., were not pre-recorded), so as not to risk skewing our source localization models with false source information. Second, we required that the tonals were broadly representative of the approximate “whistle space” for Tursiops truncatus, or distributed over a generous range of relevant whistle characteristics. These first two considerations complicated the use of real whistle playbacks. Lastly, we desired that the tonals were obtained in sufficient quantity (on the order of hundreds of tonals per classification group, consistent with typical machine learning sample sizes), from well-measured source locations inside the EP. While this would not have been impossible using whistles produced by the pools’ live dolphins, at this stage of study we hoped to isolate sound source localization from the separate task of visual object localization (i.e., obtaining dolphin coordinates from cameras), which this method would have conflated; a semi-stationary speaker could be more reliably measured.

Another choice we made during tonal creation and subsequent source localization model creation was to focus on the “fundamental” (lowest-frequency component) of the dolphin whistle, ignoring the accompanying set of harmonics, or components that are “stacked” above the fundamental in frequency. This choice was made to avoid making perilous and unnecessary assumptions about the mathematical relationship between fundamentals and harmonics during signal creation, and to avoid expanding the size of our target whistle space by adding degrees of freedom. Above all, this choice was made because the whistle fundamental is generally understood to be the strongest-intensity whistle component as well as separable from the harmonics as a result of their signal-spanning displacements in frequency [68, 69]. Thus, with appropriate filtering, the problem of localizing the source of a whistle should be reducible to the problem of localizing its fundamental, with the excluded harmonics representing additional information that we would only expect to improve the quality of localization (were we to overcome the drawbacks). Moreover, by focusing on fundamentals, we hoped our methods would be more applicable to localizing two or more whistles produced simultaneously by two or more sources; the methods only require that the whistles’ fundamentals, and not their harmonics, be cleanly filtered of overlaps in time-frequency space.

We created 128 unique tonals with pitch, duration and other parameters spread across published ranges of T. truncatus whistles [69, 70]—we note that these parameters assume whistles with sinusoid characteristics. Our chosen tonal parameter values are given in Table 2. Two tonals were created for each permutation of table parameter values, as described below. To construct a tonal’s waveform, we began with an instantaneous frequency, f(t), that described a goal time-frequency (or spectrographic) trace, for instance the trace shown in Fig 3. For simplicity, and consistent with the parameters typically used to describe dolphin whistles, we approximated dolphin whistles as sinusoidal traces in spectrographic space—thus f(t) was always a sinusoid. Based on the standard definition of the instantaneous frequency as $f (t) = \frac{1}{2 π} \frac{d Φ (t)}{d t}$ , we obtained the phase Φ(t) by integration of f(t) with respect to time. The phase could be straightforwardly transformed into a playable waveform y(t) as y(t) = A(t)sin(Φ(t) + α), where α represents “Phase Start” from Table 2 and A(t) denotes a piecewise function that enforced a gradual onset and decay of tonal intensity. Given a tonal length of X, a “Power/Decay Onset Fraction” of P (see Table 2), and a tonal onset time of t = 0, the component of A(t) responsible for a gradual onset would take the form $s i n^{2} (t \frac{2 π}{4 X P})$ , acting between t = 0 and t = XP; the decay component of A(t) would be a mirrored version, acting at the end of the tonal. Instead of creating the signal y(t) as just described, the phase derived for f(t) could be transformed with a heuristic into a waveform corresponding to a slightly modified version of f(t), specifically $y (t) = A (t) \frac{a r c s i n (m \cdot s i n (Φ (t)))}{a r c s i n (m)}$ . m is a nonzero parameter less than one that renders the underlying waveform more triangular, with the effect of creating a harmonic stack in time-frequency space (each harmonic successively decreasing in intensity as a function of total harmonics), as the parameter decreases from one. We used a value of m = 0.8, for which four harmonics were created; in practice, we rarely observed the second harmonic (and never the third onward), which is ~50 dB less in signal power than the fundamental, above noise in the pool.

Table 2. Parameter values of created tonals.

Parameter	Value Set
Duration (sec)	[0.3, 1]
Number of Cycles	[1, 2]
Center Frequency (Hz)	[6000, 10500]
Cycle Amplitude (Hz)	[2000, 5000]
Phase Start (rad)	[ $- \frac{π}{2}$ , $\frac{π}{2}$ ]
Power Onset/Decay Fraction ^*	[0.1, 0.25]

Open in a new tab

* Refer to the body for an explanation of “Power Onset/Decay Fraction.”

Fig 3 — Displayed is a standard, 1024-bin, Hamming-window spectrogram of one of the 128 tonals that was played (and here sampled) at 192 kHz; frequency resolution of the plot is 187.5 Hz (smoothing added). Note that the spectrogram was constructed from the unplayed, computational source signal (power level is arbitray). Duration = 0.3 s, Number of Cycles = 2, Center Frequency = 10500 Hertz, Cycle Amplitude = 2000 Hertz, Phase Start = -π/2.

Tonal play and acquisition

Using a Lubbell LL916H underwater speaker, which possesses an approximately omnidirectional sound profile, the 128 created tonals were played at calibrated volume levels at each of 14 source locations inside the EP, corresponding to 7 surface positions and two depths, 1.60 m and 5.87 m below the water surface (Fig 2A, Table 3). Surface spacing between adjacent source locations was approximately 1.5 m—5 m. The speaker was suspended by rope from a custom flotation device and moved across the pool surface by four additional ropes extending from the device to research assistants standing on ladders poolside. Importantly, the speaker was permitted to sway from its center point by ~1/3 m (as much as 1 m) in arbitrary direction during calibration. These assistants also used handheld Bosch 225 ft (68.58 m) Laser Measure devices to determine the device’s distance from their reference points (several measurements were taken for each source location), and through a least-squares trilateration procedure [71] the device location could always be placed on a Cartesian coordinate system common with the hydrophones.

Table 3. Tonal source locations.

Source Location Identifier	X (m)	Y (m)	Z (m)
P1	8.84	-5.61	0.00
P2	12.00	-5.49	0.00
P3	15.18	-4.91	0.00
P4	19.45	-5.03	0.00
P5	22.19	-5.49	0.00
P6	15.43	-2.02	0.00
P7	15.24	-7.01	0.00
P8	9.05	-5.88	-4.27
P9	12.13	-5.33	-4.27
P10	15.33	-5.03	-4.27
P11	19.23	-5.43	-4.27
P12	22.01	-5.36	-4.27
P13	15.39	-2.13	-4.27
P14	15.14	-7.62	-4.27

Open in a new tab

Tonals were both played and sampled at 192 kHz over two networked MOTU 8M audio interfaces connected by fiber optic to a 2013 Apple desktop running macOS Sierra. Tonals were played using a custom Matlab 2018a script, while sounds were manually saved to the Audacity AUP project format, which accommodated long recording sessions in 16 channels. Multiple versions of Matlab (primarily 2015b, 2018a) were used for downstream data management and handling.

In total, 1,783 of 1,792 sampled tonals were successfully extracted to individual 2-s, 16-channel WAV’s (referred to subsequently as “snippets”). This was performed in two steps. First, the raw AUP files were manually separated into 16-channel WAV files containing multiple tonals apiece. Second, a Matlab script read each multiple-tonal WAV, and extracted single-tonal snippets by detecting and orienting to a 0.25-s, 2-kHz tone that was programmed to precede every tonal by 2 s during playing; each snippet window began approximately 1.5 s after the leader tone start. The quality of extraction was confirmed manually. The last preprocessing step at this stage involved band-pass filtering the snippets with a Matlab script based on their maximum and minimum detected tonal frequencies (0.5 kHz gaps were left above and below the maximum and minimum frequencies, respectively) to eliminate a degree of ambient noise.

All snippets were saved with alphanumeric information about their tonal types and source locations for later analysis. Data are available at https://doi.org/10.6084/m9.figshare.7956212. An example of data contained in a single snippet is shown in Fig 4.

Fig 4 — Shown here is real sound data obtained by playing the signal from Fig 3 at Source Location P14 and receiving it at Hydrophone 1. (A) The received signal waveform. Signal amplitude is uncalibrated. (B) The spectrogram of the signal waveform. As above, this is a standard, 1024-bin, Hamming-window spectrogram; sampling is 192 kHz, and frequency resolution of the plot is 187.5 Hz (smoothing added). Color scale is in uncalibrated decibels.

Feature extraction

Overall, our goal was to use information contained in each snippet to perform source localization of the underlying played tonal. In other words, we sought to predict tonal source location from snippet data. We implemented supervised learning techniques to achieve this. This involved computationally building statistical models to predict tonal source location from snippet data by “training” them with a subset of snippet data paired (or labeled) with what we ultimately wished to predict, which included both snippets’ discrete source location identifiers (the first column of Table 3) and their source location coordinates (columns two through four of Table 3). A semi-random 90% of snippets were thus labeled and used for model training, and together composed the training data (or set). For the remaining 10% of snippets, which composed the test data (or set), the explicit source location information would remain unknown to the models and used only for evaluating the models’ predictions of explicit source location from the snippet data alone. Note that our separation of snippets into training and testing sets was semi-random in that we randomly distributed the snippets under the constraints that the two sets contain a proportional number of snippets from each source location, and that snippets derived from tonals that differed only in their underlying waveform type (sinusoidal or triangular) be distributed together.

To select and cast the snippet data in a way we felt was useful to supervised source localization techniques across a broad range of contexts (including real dolphin whistles manually extracted from recordings), we transformed each snippet into a set of 897,856 numerical features that the models would rely on for source localization (a feature set). In performing this transformation, we needed to comply with our machine learning models’ demand that the ith feature of every snippet feature set contain information about the snippet’s source location that was directly comparable to the information contained in the ith feature of every other snippet feature set; in other words, each feature had to measure the same property across all snippets. This last requirement immediately excluded the snippet samples themselves from consideration as features; for the general applicability of the proposed methods we assumed our tonals were windowed arbitrarily (e.g., that the duration between the snippet start and the tonal “onset” was variable), and that therefore sampling times could not be meaningfully compared across snippets.

Ultimately, we chose features that reflected information about the snippet tonals’ time-difference-of-arrivals (TDOA’s) to our hydrophones. First, this choice was made because the TDOA’s for all pairs of 4 or more hydrophones are theoretically sufficient to determine sound source coordinates using a geometric technique such as Spherical Interpolation, which transform TDOA’s into inter-hydrophone distances via the speed of sound and algebraically determine the most probable source-hydrophone distances [31, 33, 67]. Second, this choice was made because a complete set of TDOA’s is theoretically available from the information contained in every snippet (which included waveforms from 16 hydrophone channels sharing a clock). Third, this choice allowed us to choose features that were directly comparable across snippets despite variability in their tonal windowing. Lastly, by not choosing features that reflected snippet tonal sound intensity—one possible alternative to TDOA-reflective features—we hoped to produce models that were not limited to classifying sounds of the same source intensity.

Our first features included one estimation of a snippet tonal’s TDOA for each unique pair of 16 hydrophones (excluding self-pairs) represented in a snippet, amounting to $(\begin{matrix} 16 \\ 2 \end{matrix}) = \frac{16!}{2! (16 - 2)!} = 120$ features that were comparable across snippets. By standard practice, a TDOA between hydrophone i and hydrophone j could be estimated by choosing the relative shift between the two hydrophones’ 2 s tonal waveforms that maximized the value of their cross-correlation. We deviated from standard practice only in substituting the standard cross-correlation for the Generalized Cross-Correlation Phase Transform (GCC-PHAT), a form of cross-correlation that equalizes power across frequency bands and allows for alignment of signals based primarily on relative phase shifts [72]. In many contexts, GCC-PHAT has demonstrated better performance than the standard cross-correlation for TDOA estimation [73]. While we found that these TDOA estimations were not reliable enough to accurately perform tonal source localization using Spherical Interpolation, a possibility mentioned in the previous paragraph, we suspected that they still might be useful as part of a larger machine learning feature set.

The next and last 6601 x 136 features consisted of elements from standard, normalized circular cross-correlations [74]: for each unique pair of the 16 hydrophones (including self-pairs) represented in a snippet, 136 in total, we computed the standard, normalized circular cross-correlation for a snippet. We chose to include standard cross-correlation elements because, as discussed last paragraph, they contain information about TDOA’s. Similarly, these elements are naturally aligned across snippets regardless of variation in tonal onset time inside a snippet (or other properties such as tonal length); each sample corresponds to a relative time-shift between hydrophone channels that depends only on snippet length and sampling frequency. We normalized the cross-correlations (dividing a snippet’s cross-correlation elements by the largest absolute value among them) in order to suppress amplitude information. Our early investigations suggested that these standard cross-correlation elements were more helpful to classification than those derived from the GCC-PHAT, discussed previously. While each correlation series was initially 384,000 elements long (192,000 samples/s x 2 s), we only kept the central 6601 elements from each, corresponding to TDOA information for time shifts up to ~±17 ms between hydrophones. This comfortably ensured we included TDOA information for at least the first arrival of snippet tonals for any pair of hydrophones and any source location (we calculated the longest possible first arrival TDOA to be ~12 ms); without specific guarantees this also accommodated information about a number of (but certainly not all) TDOA’s between later arrivals. Note that later arrival times do not necessarily correspond to greater differences in arrival times (TDOA’s). The main purpose of this feature reduction was improved computation times. In general, this reduction is not necessary and may in fact be preferable to avoid.

Tonal localization by classification, regression and SRP

We began with the task of multiclass classification [75], which entailed training models to predict a tonal’s source location identifier from the total set of 14 identifiers (first column of Table 3). The first model class we considered was the Breiman random forest [76, 77]. The random forest class was chosen for several reasons. First, it was chosen for its low susceptibility to erroneously modeling relationships in training/validation data that are not present in the test data, a source of prediction error known as overfitting. Second, the random forest was chosen for its reliable performance across many choices of user-specified input parameters, also called hyperparameter values. Lastly, it was chosen for its ability to provide ancillary information about the input features, specifically information helpful to feature set reduction; this is the common goal of intelligently merging/eliminating features in the input data without loss of prediction accuracy, for increased computational performance and possibly increased accuracy. This ancillary information consisted of a measure of each feature’s relative importance to classification. The specific measure we derived from the random forest is termed the permuted variable delta error, referring to the increase in classification error when a given feature is effectively randomized (representing an estimation of that feature’s importance).

Except where indicated, we built the random forest using Matlab 2018a’s “TreeBagger” class with default hyperparameter values. This allowed us to grow a Breiman random forest composed of classification trees (built from the CART, Classification And Regression Tree, algorithm [78]) on the training data; each tree was sequentially trained on a random ~75% subset of the training set snippets (labeled with their corresponding source location identifiers), using a random ~ $\sqrt{897, 856}$ -element subset of the total available features (as per standard practice). Out-of-bag (OOB) error, referring to the classification error on snippets not randomly chosen for the training subset, was used to evaluate validation accuracy. Validation accuracy (or error) is a pre-testing measure of model performance that is often repeatedly calculated while tuning a model’s hyperparameters (which was not done here). Note that background on the methods we considered are available at MathWorks (https://www.mathworks.com) and scikit-learn (https://scikit-learn.org/stable/).

Finally, we evaluated the random forest’s accuracy in predicting the source location identifiers of the test snippets, which were fed to the classifier without source location identifier labels. As for all classification models, we measured the random forest’s classification accuracy as the simple percentage of source location identifiers correctly predicted from the test set snippets. Where-ever classification accuracy is accompanied by a confidence interval, this is 95% confidence interval was computed as the Wilson confidence interval, which is a function of the sample size.

Next, we used the permuted variable delta error as a measure of feature importance to obtain a reduced feature set ahead of training models from additional classification and regression classes. Our reduced feature set included all features that possessed nonzero permuted variable delta error (ultimately, the random forest did not use most features for classification). Using this new set of features, we similarly trained models from classes including the basic CART classification tree [75, 78], the linear and quadratic Support Vector Machine (SVM) [75, 79], and linear discriminant analysis [80]. Again, these standard multiclass classification models were built with Matlab 2018a, using the “fitctree,” “fitcecoc,” and “fitcdiscr” classes (respectively) with default hyperparameter values except where indicated. In particular, based on evaluation on the training/validation set we limited the number of the classification tree’s “splits” (effectively, a measure of the complexity of the tree’s decision making) to 20, to reduce this model class’ natural tendency to overfit. These model classes were initially evaluated with 10-fold cross validation on the training/validation data set (after randomly splitting the training/validation data into 10 groups, the accuracy of a model trained on 9 groups was evaluated on the 10th group, the result averaged across testing on all 10 groups), before the training data were combined to train models for evaluation on the test data set.

After considering tonal source location prediction by classification, we next considered source location prediction by regression. Whereas the classification models were designed to predict a snippet tonal’s source location from the set of 14 source location identifiers, the regression models were designed to predict a snippet tonal’s three-dimensional source coordinates from anywhere in three-dimensional space. Thus, training snippets were now labeled with source location coordinates rather than source location identifiers.

We specifically considered Gaussian process regression (also termed kriging) [81, 82]. This form of regression performs predictive interpolation by merging the weighted labels of nearby training points, with weights empirically determined by calculating autocorrelations between points (i.e., their tendency towards similarity) and inferring trends in the underlying feature landscape. The method was developed in geostatistics and is correspondingly suited to interpolation over smoothly varying landscapes, and where appropriate its assumptions are powerful for interpolating within data that is sparely sampled, as our data from 14 source locations were. Judging by validation error on our training/validation data, we determined Gaussian regression was more effective than ordinary regression methods on our data; we used a squared exponential kernel function and otherwise the default hyperparameters belonging go the “fitrgp” class in Matlab 2019b. Gaussian regression models were built and evaluated in two ways. In the first way, the reduced feature sets of all training snippets were used to build three models for predicting snippet tonal source coordinates; one model was built to predict each of the three source coordinates. These three models were then evaluated on all test snippets. In the second way, the reduced feature sets of training snippets from all but one of the fourteen source locations were likewise used to build three models. These models were then evaluated only on test snippets from the excluded source location; in other words, the models were prompted to predict the source coordinates of snippet tonals from novel locations for which they received no training, to perform spatial interpolation. Three regression models were built and evaluated with every source location excluded from training and used for evaluation in turn, and the results aggregated. While we were doubtful of our models’ ability to precisely interpolate at distances of several meters from 14 points, this second way of building and evaluating the models was envisioned to show that reduced-feature-set Gaussian regression models are capable of a degree of spatial interpolation on our problem.

For evaluation of our regression models, we implemented a Steered-Response Power approach to localizing the snippet tonals in the test set, using established methodology involving spatial gridding [59] together with what details were made available by Thomas et al. [60], primarily pertaining to limiting the search field. We gridded the pool into 15 cm cubes, with cube corners representing potential tonal source locations. For a hypothetical tonal from each corner, we calculated the full set of expected TDOA’s for our 16 hydrophones (as before, 120 TDOA’s). We then shifted the 16 waveforms of a snippet in our test set by each set of TDOA’s, simulating the reversal of the relative shifts caused by travel from each grid corner, before summing the dot products of every unique waveform pair. In theory, using the TDOA set from a snippet tonal’s true source location would maximize this final quantity. Consequently, the source location predicted by SRP corresponded to the cube corner whose TDOA’s produced the largest value. Unlike the sound source localization methods we propose, SRP required the speed of sound as input, which we calculated to be ~1533 m/s from the Del Grosso equation [83]; the pool salinity was 31.5 ppt and temperature 26.04°C. As an aside, we attempted to replace the standard cross-correlation in the prior calculation with the Generalized Cross Correlation with Phase Transform [72], which constitutes the SRP-PHAT technique for sound localization discussed in [59], but found the results to be inferior.

We first compared the performance of Gaussian process regression with SRP by comparing their source prediction/localization errors, in particular the deviations (simple differences) between model-estimated source coordinates and the physically measured source coordinates. We calculated both Euclidean (alternatively, straight-line or 3D) deviations and deviations along the three individual component axes, using the Median Absolute Deviation (MAD) metric bounded by standard 25th-75th percentile Interquartile Ranges (IQR) to combine and express this information across all test set snippets; one MAD and set of IQR’s was computed for each model class along each axis of interest. We also generated histograms of the absolute deviations, to illustrate their distributions for different model-axis sets. While the MAD and IQR metrics allowed us to compare model performances in a way that was intuitive, strictly speaking they were not appropriate for determining whether the MAD’s were statistically distinct. To make such comparisons given that the sets of absolute deviations were not necessarily drawn from normal distributions (which we established with Anderson-Darling tests), we first used a Krusal-Wallis test (a nonparametric equivalent of the ANOVA) to show that the different sets of absolute deviations likely did not all originate from the same distribution. Second, we used Dunn’s post-hoc tests (akin to Tukey’s tests used after an ANOVA) to determine whether individual pairs of MAD’s significantly differed at a 5% significance level, which was determined from a comparison of the nonparametric “mean ranks” values.

Finally, we sought confirmation that our features were used by the machine learning models to distinguish source locations based on inferred TDOA information and geometric reasoning. Noting that data from two hydrophones in a single 4-hydrophone-array would not as helpful to TDOA-based, geometric sound source localization as two hydrophones in two different 4-hydrophone-arrays (owing to larger changes in their TDOA’s with changes in source location), we predicted that direct or indirect information from all possible pairs of the four 4-hydrophone-arrays would be necessary for optimum classification. We evaluated our prediction by looking more closely at the classification tree previously constructed, which we found to be naturally “shallow” (composed of few branches) and parsimonious, selecting a minimum (or near-minimum) subset of features for use in classification. We then mapped these features’ importance (again, using the permuted variable delta error from a random forest, constructed as before but on the new feature set) back to the hydrophone and 4-hydrophone-array pairs from which they were derived, as a measure of which hydrophone and 4-hydrophone-array pairs were most important to classification.

Results

The random forest classification model trained on the full feature set reached 100.0% OOB accuracy at a size of ~180 trees. We continued training to 300 trees, and evaluated the resulting model on the test set: 100.0% accuracy was achieved, with 6,778 features possessing permuted variable delta error greater than 0 (based on OOB evaluations). Note that, because random forest construction did not consider all features equally (even excluding some) given that each member tree used a random subset of available features during training, subsets of features other than this set of 6,778 could potentially accommodate 100.0% test accuracy. When we considered which 4-hydrophone-array pairs the 6,778 TDOA and cross-correlation features represented, we found that all pairs of the 4-hydrophone-arrays were represented with no significant preference.

We trained several more models on the reduced, 6,778-element feature set, including a basic classification tree, a linear and quadratic SVM, and linear discriminant analysis (LDA), performing initial evaluations with 10-fold cross-validation before aggregating all training data for final evaluation on all test data. Results are shown in Table 4.

Table 4. Accuracy of source location prediction by classification.

Model	10-fold CV Accuracy (%)	Test Accuracy [Wilson CI] (%)
Classification Tree	96.90	97.75 [97.06–98.44]
Linear SVM	100.0	99.44 [98.34–100.0]
Quadratic SVM	100.0	100.0 [n/a]
LDA	100.0	100.0 [n/a]

Open in a new tab

Using the reduced feature set, we performed Gaussian process regression (kriging) to predict source coordinates of test snippet tonals, training one model for each Cartesian axis. We trained and evaluated models in two ways. First, we trained three models on all training/validation data for evaluation on all test data; this method is referred to as Standard Gaussian Regression (S GR) below. A random subset of predictions made this way is plotted in Fig 5. Second, we trained three models on training/validation data from all but one source location for evaluation on test data exclusively from the excluded source location, repeating the process for all permutations of source locations and aggregating the predictions; this method is referred to as Leave-Out Gaussian Regression (LO GR) below. For comparison, we also predicted test set source coordinates using SRP. For all sets of predictions, we computed absolute deviations, which are the basis of the statistics in Table 5, Figs 6 and 7. Our use of nonparametric statistics was motivated by Anderson-Darling tests that rejected the null hypotheses that the sets of absolute deviations were drawn from normal distributions (5% significance level). Further, a Kruskal-Wallis test on the deviation sets’ mean ranks determined the sets did not originate from the same distribution at a 5% significance level.

Fig 5 — The half-cylindrical National Aquarium EP is depicted. Large unfilled circles indicate the true source coordinates of the test snippet tonals; each has a unique color. Small filled circles indicate the source coordinates of test snippet tonals predicted by Gaussian process regression, colors matching their respective true coordinates.

Table 5. MAD and IQR of test set source coordinate predictions.

Model	Axis	MAD [IQR] (m)
S GR	Eucl.	0.66 [0.34–1.57]
S GR	X	0.19 [0.07–0.39]
S GR	Y	0.13 [0.05–0.34]
S GR	Z	0.52 [0.16–1.18]
LO GR	Eucl.	3.37 [2.85–3.75]
LO GR	X	0.56 [0.26–1.03]
LO GR	Y	0.50 [0.17–1.62]
LO GR	Z	2.73 [2.14–3.39]
SRP	Eucl.	1.56 [0.73–2.48]
SRP	X	0.43 [0.18–0.73]
SRP	Y	0.52 [0.21–0.98]
SRP	Z	0.91 [0.15–1.98]

Open in a new tab

Fig 6 — For the Standard Gaussian Regression (S GR), Leave-Out Gaussian Regression (LO GR), and SRP methods, test snippet tonal localization error (absolute deviation) is displayed (A) for the Euclidean (straight-line) direction, (B) for the X-axis, (C) for the Y-axis, and (D) for the Z-axis.

Fig 7 — The mean ranks for the localization error (absolute deviation) sets belonging to every source localization method and axis of localization, computed ahead of a Kruskal-Wallis test, are shown with 95% CI’s. Based on Dunn’s tests with an overall 5% significance threshold, overlapping intervals reflect sets whose underlying distributions are not statistically distinct (implying non-distinct MAD’s); non-overlapping intervals reflect sets that are distinct (implying distinct MAD’s).

The classification tree previously built on the reduced, 6,788-element feature set displayed the unique property of parsimony, in that it naturally identified a smaller, 22-element feature subset as sufficient for performing source location classification. We trained a parsimonious random forest on this 22-element feature set (using the same hyperparameter values as before), this model achieving 98.88% test accuracy (95% CI [97.73—100.0]). Thus, we considered this 22-element feature set both sufficient and small enough to determine whether classification made use of features derived from a mixed set of spatially distant hydrophone and 4-hydrophone-array pairs, as would be favored by a geometric, TDOA-based approach to sound source localization (such as Spherical Interpolation). The permuted variable delta error was summed across hydrophone and then averaged across 4-hydrophone-array pairs, which is visualized in Fig 8. Overall, we note that features representing all pairs of 4-hydrophone-arrays, directly or through a transitive relationship (e.g., in Fig 8 Panel D, TDOA information between arrays 3 and 4 is implied by TDOA information between arrays 1 and 3 and between arrays 1 and 4), are utilized for classification.

Fig 8 — Feature importance values for a parsimonious, 22-feature random forest model summed within corresponding hydrophone pairs to make (A) and (C), with those values subsequently averaged within 4-hydrophone-arrays to make (B) and (D). Note that values are equal between reversed hydrophone/4-hydrophone-array pairs (e.g., values are equal for hydrophone pairs 1-2 and 2-1) but plotted only once. (A) Cross-hydrophone importances for cross-correlation features. (B) Cross-4-hydrophone-array importances for cross-correlation features. (C) Cross-hydrophone importances for TDOA features. (D) Cross-4-hydrophone-array importances for TDOA features.

Discussion

We provide a basic proof of concept that sound source localization of bottlenose dolphin whistles might be achieved with classification and regression methods in a half-cylindrical captive dolphin enclosure, by localizing the semi-stationary sources of artificial tonal sounds. An enclosure of this type traditionally poses challenges to source localization of tonal sounds (by tag-less methods) given the prominence of multipath effects, which complicate the acquisition of TDOA’s by cross-correlation methods for successful use with standard geometric sound source localization methods (such as Spherical Interpolation). Moreover, for the same conditions we showed that, for the localization of test set tonals played near training set tonals (within ~1/3 m, the speaker sway range), Gaussian regression outperforms an established Steered-Response Power (SRP) method for whistle source localization. Localizing in the plane of the pool surface, Gaussian process models localized tonals played at source locations not included in training with similar accuracy to SRP, interpolating at distances as long as several meters. This is significant because source localization along the pool surface can be sufficient for attributing whistles to dolphins, as dolphins often occupy distinct locations in this plane. Moreover, this is often the only plane for which imaging is available (i.e., using an overhead camera) and in which the dolphins can be assigned position coordinates, and thus potentially the only plane in which sound source coordinates are useful to whistle source attribution.

The initial data consisted of 2-s, 16-channel WAV samples (“snippets”) of 127 unique bottlenose-whistle-like tonals played at 14 source locations inside the National Aquarium EP. Each of the 16 channels corresponded to one hydrophone among four 4-hydrophone-arrays. The 1,783 good-quality snippets were semi-randomly divided into training/validation and test sets in a 9:1 ratio for model building/training and evaluation, respectively.

First, we showed that a random forest classifier with fewer than 200 trees achieved 100% testing accuracy at predicting from which of the 14 source locations a snippet tonal originated. The classifier used 6,778 of 897,856 available features (those features with nonzero permuted variable delta error, or equivalently nonzero feature importance), which included TDOA’s obtained from GCC-PHAT as well as normalized cross-correlations from all pairs of hydrophones. We then showed that linear discriminant analysis and a quadratic SVM achieved the same classification accuracy on the same reduced, 6,778-element feature set. Despite the classification models’ success, we concede that the distant spacing between adjacent source locations in our data are unlikely to be conducive to the practical localization of dolphin whistles for source attribution. However, were these model classes to achieve similar accuracy when trained on data reflecting shorter spacing between adjacent source locations (spacings of approximately 0.5—1.0 m, noting some spacings in our present data were less than 2 m), the linear model classes might offer simple and computationally efficient source localization of dolphin whistles for attribution.

Although it remains unclear to what extent test snippet tonals corresponding to novel source locations not included in the training set are classified to their most logical (i.e., nearest) training set source locations, we note that our classifiers’ success was achieved despite the ~1/3 m drift of the speaker during play-time. This together with the success of regression may indicate a degree of smoothness in the classifiers’ decision-making, and the likelihood of classification of snippet tonals from novel locations to nearby training set source locations. Also, it is reassuring that a linear classifier, which by definition cannot support nonlinear decision making, can achieve high accuracy: because we expect the TDOA and cross-correlation features to vary continuously over real space, the (linear) hyperplanes that partition classification zones in feature space—the mathematical basis of linear classification—suggest continuous classification zones in real space. Nevertheless, the nature of the models’ classification of snippet tonals from novel source locations to training set source locations still warrants further investigation.

We more suitably addressed the localization of test snippet tonals from source locations that were novel to the models by using Gaussian process regression, which we used to predict the source coordinates of test snippet tonals (with one regression model trained to each of the three coordinates). This model class was rather successful when trained on the full training data, achieving Euclidean test MAD of 0.66 m (IQR = 0.34—1.57). In order to better assess the regression models’ capacity for long-distance interpolation, we evaluated the regression models’ performance on test snippets for which no training snippets from the same source locations were used for model training; in effect, the models were prompted to predict the source coordinates of test snippet tonals from novel locations. While the regression models’ overall performance on test snippets representing novel source locations was not satisfactory, admitting error larger than average dolphin length (MAD of 3.37 m, IQR = 2.85—3.75), when we decomposed the error along three Cartesian axes (X-axis MAD of 0.56 m with IQR = 0.26—1.03, Y-axis MAD of 0.50 m with IQR = 0.17—1.62, and Z-axis MAD of 2.73 m with IQR = 2.14—3.39), we found that the Euclidean localization error was dominated by localization error along the Z-Axis, or direction of pool depth (Fig 7). This is significant because only two pool distinct pool depths were represented in our data, which are intuitively insufficient for meaningful interpolation in this direction. We think it reasonable to suggest that interpolation along the Z-axis, and thereby overall Euclidean interpolation, would substantially improve with finer depth-wise spatial sampling among the training data. Increasing spatial sampling from our 14 points in the lateral directions might be expected to improve Euclidean interpolation as well. Nevertheless, we note that MAD for the interpolative case in the lateral directions was still less than average adult dolphin body length, if still greater than the minimum possible separation between two dolphins’ whistle-transmitting heads (perhaps 1/3 m).

To benchmark the Gaussian process regression models, we also predicted test snippet tonal source coordinates using a standard SRP approach that has met success elsewhere [60]. Gaussian regression significantly outperformed SRP when all training data were used (Fig 7). When the regression models were deprived of training data from source locations of evaluated test snippets (the novel source location or long-distance interpolative case), SRP performed better overall and along the Z-axis; there was no significant performance difference along the other two component axes. While this indicates that, with the current training data, Gaussian regression does not outperform SRP in three dimensions when prompted to interpolate at longer distances, it also indicates that Gaussian regression is capable of long-distance interpolation with comparable accuracy as SRP in the lateral directions. Further, related to the previous discussion, we might expect Gaussian process regression to perform at least as well as SRP were more than two pool depths represented in the training data. Most importantly, the results suggest that Gaussian process regression can perform just as well as SRP for localizing tonals across the pool surface (i.e., disregarding depth), which is often sufficient for distinguishing among potential sound sources based on overhead imaging (as in Thomas et al.). Thus, even with our limited training data, Gaussian process regression seems to be a promising option for whistle localization ahead of whistle attribution in an environment with pronounced multipath effects.

Lastly, we showed that an extremely sparse, 22-feature random forest with high classification accuracy (98.88%)—a parsimonious model—includes direct or indirect TDOA information from all possible pairs of the four 4-hydrophone-arrays. The information, manifested in the 22 features that included TDOA estimates from GCC-PHAT and normalized cross-correlation elements for pairs of hydrophone channels, reflected the largest of the temporal separations among pairs of our 16 hydrophones. As increasing the temporal separation of hydrophones is expected to optimize the performance of TDOA-based geometric methods of sound source localization (such as Spherical Interpolation), which in fact motivated our initial placement of the four 4-hydrophone-arrays, the parsimonious classifiers’ selection of these features suggests that they are reproducing similar geometric reasoning in their decision making. However, we concede that the inner logic of the models ultimately remains unknown.

Caveats

These methods must be still be evaluated on tonals originating from dolphins (i.e., whistles). Not only do dolphins move faster than our speaker, but they possess different acoustical directional properties, and their whistles potentially fall outside of the “whistle space” spanned by our tonals. Regarding movement speed, we are hopeful because the past success of the Thomas et al. SRP method on dolphin whistles suggests that whistle cross-correlations, which both methods rely on, retain information sufficient for localization even when a source moves at dolphin speed. Regarding whistle directionality, though the directionality of whistles strictly differs from that of our speaker-produced tonals, it has been observed that the “fundamental” or lowest-frequency trace of whistles (the focus of our method) is approximately omnidirectional [84]. Lastly, regarding whistles existing outside of our artificial “whistle space,” we note that dolphin groups of the approximate size of the National Aquarium’s seven have been shown to possess fewer than 100 unique call types [21]. This suggests the possibility of creating a complete, customized set of training sounds for any small group of dolphins. A final consideration is whether these methods are applicable to tonals/whistles originating simultaneously from more than one source; while these cases have been rare at the National Aquarium, they would likely factor into the methods’ utility in the wild. This case must be still be studied, but we would expect our methods to succeed (dependent on the previous considerations) in cases where whistle fundamentals can be cleanly separated by signal preprocessing.

Conclusions and outlook

We feel this study offers a valid argument that machine learning methods are promising for solving the problem of bottlenose whistle localization in highly reverberant aquaria, where tag-based solutions to whistle source attribution are not feasible. We offer evidence to suggest that these methods might be capable of greater accuracy than SRP methods given adsquate training data, coming at smaller real-time computational expense—at the cost of initial model training. While we caution that these methods still must be trained and evaluated on whistles originating from real dolphins, we opine that our results are rather encouraging.

And looking further, there are a huge number of scientific areas where researchers need to localize the origin coordinates of sounds of interest to them. Because both the nature of the sounds and the nature of the acoustical environment matter to how this problem is solved, there’s a massive spread of methods tailored to the specific sounds and the specific acoustical environment. Yet one problem that often recurs in many situations is that of reverberations and multiple closely-spaced echoes (“multipath”) causing confounds that badly impact localization. While we are motivated specifically by the problem of tracking dolphin vocalizations in a man-made aquarium pool, we believe that the solution we have found in our specific problem has great potential for broad applicability. Our solution is to harness the power of regularized machine learning to, de facto, map out and learn the idiosyncrasies of the acoustical environment; once the environment has been learned, localization can be both accurate and computationally fast even in the face of massive reverberation.

Acknowledgments

We thank the many helpers, support staff, and backers involved this project. At the Rockefeller University, we thank Ana Hočevar, Sanjee Abeytunge, Vadim Sherman, Brigid Maloney, Dimitrios Moirogiannis, A. James Hudspeth, Alipasha Vaziri, and Fernando Nottebohm. At Hunter College, we thank Megan McGrath, Stephanie Bousseau, Raymond Van Steyn, Miranda Trapani, Jennifer Savoie, Eric Ramos, Kristi Collom, Adrienne Koepke, Robert Dutchen, and Ofer Tchernichovski. At the National Aquarium, we thank James Tunney, Manuel Chico, Richard Snader, Allan Consolati, Kerry Diehl, Susie Rodenkirchen Walker, Allison Ginsburg, April Martin, Kimmy Barron, Rebekah Miller, Kelsey Fairhurst, Gretchen Geiger, Kelsey Wood, Angela Lopresti, Nicole Guyton, Leigh Clayton, Brent Whitaker, Jill Arnold, Mark Kennedy, Holly Bourbon, Andrew Pulver, and John Racanelli. We thank unaffiliated helper Elias Buchanan Ohrstrom.

Data Availability

Data used for model building and testing available at https://doi.org/10.6084/m9.figshare.7956212.

Funding Statement

MO Magnasco, DR Reiss Awards 1530544, 1607280 National Science Foundation https://www.nsf.gov The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Shirihai H, Jarrett B. Whales, Dolphins, and Other Marine Mammals of the World. Princeton and Oxford: Princeton University Press; 2006. [Google Scholar]
2. Connor RC, Heithaus MR, Barre LM. Complex social structure, alliance stability and mating access in a bottlenose dolphin ‘super-alliance’. Proceedings of the Royal Society B: Biological Sciences. 2001;268(1464):263–267. 10.1098/rspb.2000.1357 [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Krutzen M, Sherwin WB, Connor RC, Barre LM, Van de Casteele T, Mann J, et al. Contrasting relatedness patterns in bottlenose dolphins (Tursiops sp.) with different alliance strategies. Proceedings of the Royal Society B: Biological Sciences. 2003;270(1514):497–502. 10.1098/rspb.2002.2229 [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Reiss D, Marino L. Mirror self-recognition in the bottlenose dolphin: A case of cognitive convergence. Proceedings of the National Academy of Sciences. 2001;98(10):5937–5942. 10.1073/pnas.101086398 [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Krutzen M, Mann J, Heithaus MR, Connor RC, Bejder L, Sherwin WB. Cultural transmission of tool use in bottlenose dolphins. Proceedings of the National Academy of Sciences. 2005;102(25):8939–8943. 10.1073/pnas.0500232102 [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Sargeant BL, Mann J, Berggren P, Krutzen M. Specialization and development of beach hunting, a rare foraging behavior, by wild bottlenose dolphins (Tursiop sp.). Canadian Journal of Zoology. 2005;83(11):1400–1410. 10.1139/z05-136 [DOI] [Google Scholar]
7. Smith JD. Inaugurating the Study of Animal Metacognition. International Journal of Comparative Psychology. 2010;23(3):401–413. [PMC free article] [PubMed] [Google Scholar]
8. Au WWL, Moore PWB, Pawloski D. Echolocation transmitting beam of the Atlantic bottlenose dolphin. The Journal of the Acoustical Society of America. 1986;80:688–691. 10.1121/1.394012 [DOI] [PubMed] [Google Scholar]
9. Johnson CS. Discussion In: Busnel RG, editor. Animal Sonar Systems Biology and Bionics. Jouy-en-Josas, France; 1967. p. 384–398. [Google Scholar]
10. Au WWL. Echolocation in Dolphins. In: Hearing by Whales and Dolphins. New York City, New York: Springer; 2000. [Google Scholar]
11. Lilly JC, Miller AM. Vocal Exchanges between Dolphins. Science. 1961;134(3493):1873–1876. 10.1126/science.134.3493.1873 [DOI] [PubMed] [Google Scholar]
12. Dreher JJ. Linguistic Considerations of Porpoise Sounds. The Journal of the Acoustical Society of America. 1961;33(12):1799–1800. 10.1121/1.1908584 [DOI] [Google Scholar]
13. McCowan B, Hanser SF, Doyle LR. Quantitative tools for comparing animal communication systems: information theory applied to bottlenose dolphin whistle repertoires. Animal Behaviour. 1998;57(2):409–419. 10.1006/anbe.1998.1000 [DOI] [PubMed] [Google Scholar]
14. Caldwell MC, Caldwell DK. Individualized Whistle Contours in Bottlenosed Dolphins (Tursiops truncatus). Nature. 1965;207(1):434–435. 10.1038/207434a0 [DOI] [Google Scholar]
15. Caldwell MC, Caldwell DK, Tyack PL. Review of the signature-whistle-hypothesis for the Atlantic bottlenose dolphin In: Leatherwood S, Reeves RR, editors. The Bottlenose Dolphin. San Diego; 1990. p. 199–234. [Google Scholar]
16. Sayigh LS, Esch HC, Wells RS, Janik VM. Facts about signature whistles of bottlenose dolphins, Tursiops truncatus. Animal Behaviour. 2007;74(6):1631–1642. 10.1016/j.anbehav.2007.02.018 [DOI] [Google Scholar]
17. Janik VM, Sayigh LS. Communication in bottlenose dolphins: 50 years of signature whistle research. Journal of Comparative Physiology A. 2013;199(6):479–489. 10.1007/s00359-013-0817-7 [DOI] [PubMed] [Google Scholar]
18. Tyack PL. Whistle Repertoires of Two Bottlenosed Dolphins, Tursiops truncatus: Mimicry of Signature Whistles? Behavioral Ecology and Sociobiology. 1986;18(4):251–257. 10.1007/BF00300001 [DOI] [Google Scholar]
19. McCowan B. A New Quantitative Technique for Categorizing Whistles Using Simulated Signals and Whistles from Captive Bottlenose Dolphins (Delphinidae, Tursiops truncatus). Ethology. 1995;100:177–193. 10.1111/j.1439-0310.1995.tb00324.x [DOI] [Google Scholar]
20. McCowan B, Reiss D. Whistle Contour Development in Captive-Born Infant Dolphins (Tursiops truncatus): Role of Learning. Journal of Comparative Psychology. 1995;109(3):242–260. 10.1037/0735-7036.109.3.242 [DOI] [Google Scholar]
21. McCowan B, Reiss D. Quantitative Comparison of Whistle Repertoires from Captive Adult Bottlenose Dolphins (Delphinidae, Tursiops truncatus): a Re-evaluation of the Signature Whistle Hypothesis. Ethology. 1995;100:194–209. 10.1111/j.1439-0310.1995.tb00325.x [DOI] [Google Scholar]
22. McCowan B, Reiss D, Gubbins C. Social familiarity influences whistle acoustic structure in adult female bottlenose dolphins (Tursiops truncatus). Aquatic Mammals. 1998;24(1):27–40. [Google Scholar]
23. McCowan B, Reiss D. The fallacy of ‘signature whistles’ in bottlenose dolphins: a comparative perspective of ‘signature information’ in animal vocalizations. Animal Behaviour. 2001;62(6):1151–1162. 10.1006/anbe.2001.1846 [DOI] [Google Scholar]
24. Janik VM, King SL, Sayigh LS, Wells RS. Identifying signature whistles from recordings of groups of unrestrained bottlenose dolphins (Tursiops truncatus). Marine Mammal Science. 2013;29(1):109–122. 10.1111/j.1748-7692.2011.00549.x [DOI] [Google Scholar]
25. Mellinger DK, Stafford KM, Moore SE, Dziak RP, Matsumoto H. An Overview of Fixed Passive Acoustic Observation Methods for Cetaceans. Oceanography. 2007;20(4):36–45. 10.5670/oceanog.2007.03 [DOI] [Google Scholar]
26. Gero S, Whitehead H, Rendell L. Individual, unit and vocal clan level identity cues in sperm whale codas. Royal Society Open Science. 2016;3(1):150372 10.1098/rsos.150372 [DOI] [PMC free article] [PubMed] [Google Scholar]
27. Širović A, Hildebrand JA, Wiggins SM. Blue and fin whale call source levels and propagation range in the Southern Ocean. The Journal of the Acoustical Society of America. 2007;122(2):1208–1215. 10.1121/1.2749452 [DOI] [PubMed] [Google Scholar]
28. Charif RA, Clapham PJ, Clark CW. Acoustic Detections of Singing Humpback Whales in Deep Waters off the British Isles. Marine Mammal Science. 2001;17(4):751–768. 10.1111/j.1748-7692.2001.tb01297.x [DOI] [Google Scholar]
29. Stafford KM, Fox CG, Clark DS. Acoustic Detections of Singing Humpback Whales in Deep Waters off the British Isles. J Acoust Soc Am;104(6):3616–3625. [DOI] [PubMed] [Google Scholar]
30. Li X, Deng ZD, Sun Y, Martinez JJ, Fu T, McMichael GA, et al. A 3D approximate maximum likelihood solver for localization of fish implanted with acoustic transmitters. Scientific Reports. 2014;4(1):668–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
31. Li X, Deng ZD, Rauchenstein LT, Carlson TJ. Source-localization algorithms and applications using time of arrival and time difference of arrival measurements. Review of Scientific Instruments. 2016;87(4):041502–13. 10.1063/1.4947001 [DOI] [PubMed] [Google Scholar]
32. Spiesberger JL. Locating animals from their sounds and tomography of the atmosphere: Experimental demonstration. The Journal of the Acoustical Society of America. 1999;106(2):837–846. 10.1121/1.427100 [DOI] [PubMed] [Google Scholar]
33. Smith JO, Abel JS. The Spherical Interpolation Method of Source Localization. IEEE Journal of Oceanic Engineering. 1987;OE-12(1):246–252. 10.1109/JOE.1987.1145217 [DOI] [Google Scholar]
34. Di Claudio ED, Parisi R. Robust ML wideband beamforming in reverberant fields. IEEE Transactions on Signal Processing. 2003;51(2):338–349. 10.1109/TSP.2002.806866 [DOI] [Google Scholar]
35.Van Veen BD, Buckley KM. Beamforming: A Versatile Approach to Spatial Filtering. IEEE ASSP Magazine. 1998; p. 1–21.
36.Greensted A. type; 2012. Available from: http://www.labbookpages.co.uk/audio/beamforming/delaySum.html.
37.Do H, Silverman HF, Yu Y. A Real-Time SRP-PHAT Source Location Implementation Using Stochastic Region Contract (SRC) on a Large-Aperture Array. Proc ICASSP-2007. 2007; p. I–121—I–124.
38. Freitag LE. Passive acoustic localization of the Atlantic bottlenose dolphin using whistles and echolocation clicks. J Acoust Soc Am. 1993;93:2197 10.1121/1.406681 [DOI] [PubMed] [Google Scholar]
39. Watkins WA, Schevill WE. Listening to Hawaiian Spinner Porpoises, Stenella Cf. Longirostris, with a Three-Dimensional Hydrophone Array. Journal of Mammalogy. 1974;55(2):319–328. 10.2307/1379001 [DOI] [Google Scholar]
40. Küsel ET, Munoz T, Siderius M, Mellenger DK, Heimlich S. Marine mammal tracks from two-hydrophone acoustic recordings made with a glider. Ocean Science;13(2):273–288. [Google Scholar]
41. Yack TM, Barlow J, Calambokidis J, Southall B, Coates S. Passive acoustic monitoring using a towed hydrophone array results in identification of a previously unknown beaked whale habitat. The Journal of the Acoustical Society of America. 2013;134(3):1–7. 10.1121/1.4816585 [DOI] [PubMed] [Google Scholar]
42. Barlow J, Taylor BL. Estimates of Sperm Whale Abundance in the Northeastern Temperate Pacific from a Combined Acoustic and Visual Survey. Marine Mammal Science. 2005;21(3):429–445. 10.1111/j.1748-7692.2005.tb01242.x [DOI] [Google Scholar]
43. Clark CW, Ellison WT. Calibration and comparison of the acoustic location methods used during the spring migration of the bowhead whale, Balaena mysticetus, off Pt. Barrow, Alaska, 1984–1993. J Acoust Soc Am. 2000;107:3509–3517. 10.1121/1.429421 [DOI] [PubMed] [Google Scholar]
44. Janik VM, Thompson M. A Two-Dimensional Acoustic Localization System for Marine Mammals. Marine Mammal Science. 2000;16(2):437–447. 10.1111/j.1748-7692.2000.tb00935.x [DOI] [Google Scholar]
45. Whitman EC. SOSUS. Undersea Warfare. 2005;7(1). [Google Scholar]
46. Cummings WC, Thompson PO. Characteristics and seasons of blue and finback whale sounds along the U.S. west coast as recorded at SOSUS stations. The Journal of the Acoustical Society of America. 1994;95(5):2853–2853. 10.1121/1.409514 [DOI] [Google Scholar]
47. Giraudet P, Glotin H. Real-time 3D tracking of whales by echo-robust precise TDOA estimates with a widely-spaced hydrophone array. Appl Acoust. 2006;67:1106–1117. 10.1016/j.apacoust.2006.05.003 [DOI] [Google Scholar]
48. White P, Leighton T, Finfer D, Powles C, Baumann O. Localisation of sperm whales using bottom-mounted sensors. Appl Accost. 2006;67:1074–1090. 10.1016/j.apacoust.2006.05.002 [DOI] [Google Scholar]
49. Spiesberger JL. Linking auto- and cross-correlation functions with correlation equations: Application to estimating the relative travel times and amplitudes of multipath. The Journal of the Acoustical Society of America. 1998;104(1):300–312. 10.1121/1.423257 [DOI] [Google Scholar]
50. Nosal EM. Methods for tracking multiple marine mammals with wide-baseline passive acoustic arrays. J Acoust Soc Am. 2013;134(3):2383–2392. 10.1121/1.4816549 [DOI] [PubMed] [Google Scholar]
51. Baggenstross PM. An algorithm for the localization of multiple interfering sperm whales using multi-sensor time difference of arrival. J Acoust Soc Am. 2011;130(102). [DOI] [PubMed] [Google Scholar]
52. Nosal EM, Frazer LN. Track of a sperm whale from delays between direct and surface-reflected clicks. Applied Acoustics. 2006;67(11-12):1187–1201. 10.1016/j.apacoust.2006.05.005 [DOI] [Google Scholar]
53. Michalopoulou ZH, Ma X. Source localization in the Haro Strait primer experiment using arrival time estimation and linearization. J Acoust Soc Am. 2011;(118):2924. [Google Scholar]
54. Thode AM. Three-dimensional passive acoustic tracking of sperm whales (Physeter macrocephalus) in ray-refracting environments. J Acoust Soc Am. 2005;(118):3575 10.1121/1.2049068 [DOI] [PubMed] [Google Scholar]
55. Rideout B, Nosal E, Høst-Maden A. Blind channel estimation of underwater acoustic waveguide impulse responses using marine mammal vocalizations. Canadian Acoustics [Internet]. 2016;44(3). [Google Scholar]
56. Bell BM, Ewart TE. Separating Multipaths by Global Optimization of a Multidimensional Matched Filter. IEEE Transactions on Acoustic, Speech, and Signal Processing. 1986;ASSP-34(5):1029–1036. 10.1109/TASSP.1986.1164940 [DOI] [Google Scholar]
57. Freitag LE, Tyack PL. Passive acoustic localization of the Atlantic bottlenose dolphin using whistles and echolocation clicks. The Journal of the Acoustical Society of America. 1993;93(4):2197–2205. 10.1121/1.406681 [DOI] [PubMed] [Google Scholar]
58. López-Rivas RM, Bazúa-Durán C. Who is whistling? Localizing and identifying phonating dolphins in captivity. Applied Acoustics. 2010;71(11):1057–1062. 10.1016/j.apacoust.2010.05.006 [DOI] [Google Scholar]
59.DiBiase JH. A High-Accuracy, Low-Latency Technique for Talker Localization in Reverberant Environments Using Microphone Arrays; 2000.
60. Thomas RE, Fristrup KM, Tyack PL. Linking the sounds of dolphins to their locations and behavior using video and multichannel acoustic recordings. The Journal of the Acoustical Society of America. 2002;112(4):1692–1701. 10.1121/1.1494805 [DOI] [PubMed] [Google Scholar]
61. Tyack PL. An optical telemetry device to identify which dolphin produces a sound. The Journal of the Acoustical Society of America. 1985;78(5):1892–1895. 10.1121/1.392777 [DOI] [PubMed] [Google Scholar]
62. Watwood SL, Owen ECG, Tyack PL, Wells RS. Signature whistle use by temporarily restrained and free-swimming bottlenose dolphins, Tursiops truncatus. Animal Behaviour. 2005;69(6):1373–1386. 10.1016/j.anbehav.2004.08.019 [DOI] [Google Scholar]
63. Akamatsu T, Wang D, Wang K, Naito Y. A method for individual identification of echolocation signals in free-ranging finless porpoises carrying data loggers. The Journal of the Acoustical Society of America. 2000;108(3):1353–5. 10.1121/1.1287841 [DOI] [PubMed] [Google Scholar]
64.Meng W, Xie L, Xiao W. Optimal Sensor Pairing for TDOA based Source Localization and Tracking in Sensor Networks. In: 15th International Conference on Information Fusion FUSION; 2012. p. 1897–1902.
65.Isaacs JT, Klein DJ, Hespanha JP. Optimal Sensor Placement for Time Difference of Arrival Localization. In: Joint 48th IEEE Conference on Decision and Control and the Chinese Control Conference; 2009. p. 7878–7844.
66.Yang B. Different Sensor Placement Strategies for TDOA Based Localization. In: IEEE International Conference on Acoustics, Speech and Signal Processing; 2007. p. 1093–1096.
67. Zimmer WMX. Passive Acoustic Monitoring of Cetaceans. Cambridge: Cambridge University Press; 2011. [Google Scholar]
68. Mallawaarachchi A, Ong SH, Chitre M, Taylor E. Spectrogram denoising and automated extraction of the fundamental frequency variation of dolphin whistles. The Journal of the Acoustical Society of America. 2008;124(2):1159–1170. 10.1121/1.2945711 [DOI] [PubMed] [Google Scholar]
69. Rendell LE, Matthews JN, Gill A, Gordon JCD, MacDonald DW. Quantitative analysis of tonal calls from five odontocete species, examining interspecific and intraspecific variation. Journal of Zoology, London. 1999;249:403–410. 10.1111/j.1469-7998.1999.tb01209.x [DOI] [Google Scholar]
70. Steiner WW. Species-Specific Differences in Pure Tonal Whistle Vocalizations of Five Western North Atlantic Dolphin Species. Behavioral Ecology and Sociobiology. 1981;9(4):241–246. 10.1007/BF00299878 [DOI] [Google Scholar]
71. Navidi W, Murphy WSJ, Hereman W. Statistical methods in surveying by trilateration. Computational Statistics & Data Analysis. 1998;27:209–227. 10.1016/S0167-9473(97)00053-4 [DOI] [Google Scholar]
72. Knapp CH, Carter C. The Generalized Correlation Method for Estimation of Time Delay. IEEE Transactions on Acoustic, Speech, and Signal Processing. 1976;24(4):320–327. 10.1109/TASSP.1976.1162830 [DOI] [Google Scholar]
73. Lin C, Liu Y, Kong F, He N. Acoustic Source Localization Based on Generalized Cross-correlation Time-delay Estimation. Procedia Engineering;15:4912–4919. [Google Scholar]
74. Oppenheim AV, Schafer RW, Buck JR. Discrete-Time Signal Processing. Upper Saddler River, NJ; 1999. [Google Scholar]
75. Kotsiantis SB, Zaharakis ID, Pintelas PE. Machine learning: a review of classification and combining techniques. Artificial Intelligence Review. 2006;26:159–190. 10.1007/s10462-007-9052-3 [DOI] [Google Scholar]
76. Breiman L. Random Forests. Machine Learning. 2001;45(1):5–32. 10.1023/A:1010933404324 [DOI] [Google Scholar]
77. Belgiu M, Drăguţ L. Random forest in remote sensing: A review of applications and future directions. ISPRS Journal of Photogrammetry and Remote Sensing. 2016;114:24–31. 10.1016/j.isprsjprs.2016.01.011 [DOI] [Google Scholar]
78. Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and Regression Trees. Monterey, CA: Wadsworth & Brooks/Cole Advanced Books & Software; 1984. [Google Scholar]
79. Cortes C, Vapnik V. Support-vector networks. Machine Learning. 1995;20:273–297. 10.1023/A:1022627411411 [DOI] [Google Scholar]
80. Fisher RA. The use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics. 1936;7:179–188. 10.1111/j.1469-1809.1936.tb02137.x [DOI] [Google Scholar]
81.Wahba G. Spline Models for Observational Data; 1990.
82. Williams CKI. Prediction with Gaussian Processes: From Linear Regression to Linear Prediction and Beyond In: Jordan MI, editor. Learning in Graphical Models. NATO ASI Series (Series D: Behavioural and Social Sciences). vol. 89 Springer, Dordrecht; 1998. [Google Scholar]
83. Del Grosso VA. New equation for the speed of sound in natural waters (with comparisons to other equations). The Journal of the Acoustical Society of America. 1974;56(4):1084–1091. 10.1121/1.1903388 [DOI] [Google Scholar]
84. Branstetter BK, Moore PW, Finneran JJ, Tormey MN, Aihara H. Directional properties of bottlenose dolphin (Tursiops truncatus) clicks, burst-pulse, and whistle sounds. The Journal of the Acoustical Society of America. 2012;131(2):1613–1621. 10.1121/1.3676694 [DOI] [PubMed] [Google Scholar]

PLoS One. 2020 Jun 25;15(6):e0235155. doi: 10.1371/journal.pone.0235155.r001

Author response to previous submission

8 Dec 2019

Attachment

Submitted filename: FINAL_reviewer_response.pdf

Click here for additional data file.^{(86KB, pdf)}

PLoS One. doi: 10.1371/journal.pone.0235155.r002

Decision Letter 0

Haru Matsumoto

9 Jan 2020

PONE-D-19-33956

Machine Source Localization of Tursiops truncatus Whistle-like Sounds in a Reverberant Aquatic Environment

PLOS ONE

Dear Dr. Woodward,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

We would appreciate receiving your revised manuscript by Feb 23 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

We look forward to receiving your revised manuscript.

Kind regards,

Haru Matsumoto

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at http://www.journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and http://www.journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Thank you for stating the following in the Acknowledgments Section of your manuscript:

'We thank the National Aquarium for participating in this study, as well the National 402

Science Foundation (Awards 1530544, 1607280), the Eric and Wendy Schmidt Fund for 403

Strategic Innovation, and the Rockefeller University for funding.'

We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

'MO Magnasco, DR Reiss

Awards 530544, 1607280

National Science Foundation

https://www.nsf.gov

The funders had no role in study design, data collection and analysis, decision to

publish, or preparation of the manuscript.'

Additional Editor Comments (if provided):

Dear Dr. Woodward,

The manuscript has improved but it is still lacking details and scientific discussions. Although the application of machine learning to animal localization is unique and the results are interesting, I have to agree with both reviewers that the manuscript needs a major revision. As guidelines, 1) as reviewer 2 pointed out, PLOS One draws a broader audience and not just for experts in machine learning or bio-acoustics. For that, it needs more explanation without losing your audience. 2) The experiment must be repeatable with details of experimental set-up (e.g., hydrophone locations in XYZ as pointed out by both reviewers). I am sure that marine bio-acoustics researchers are eager to apply your ML method to the other animals in the ocean if you can describe the experiment clearly with a few jargon. Unlike the ocean, the aquarium tank is a controlled environment. 3) Re-submission without the details of your aquarium experiment set up and methodology, especially the dimensional information, will jeopardize further review. 4) Before resubmitting the revision, please ask your coauthors to proofread your manuscript carefully in order to save the reviewer's time.

Regards,

Haru Matsumoto

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: No

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: No

Reviewer #2: No

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This paper aims to tackle an important challenge of localization: associate the recorded whistle to the sound-producing animal. Instead of TDOA-based localization algorithms, the authors used classify the approximate locations and estimate the locations (regression) through random forest, support vector machine and other machine learning algorithms. It was a promising work but the data could be summarized more clearly, experiment setup laid out more precisely and results reported more systematically. It seems to be a work written by a junior academic member without proof-reading from other co-authors. There're simply too many errors and inconsistency all over the manuscript. I recognized that there might be a diamond in the rough. However, the paper have to be organized in a better shape so that scientific findings can be communicated effectively.

Detailed comments are as follows:

In abstract, please refrain from using abbreviations without the full names. What is MAD? What is IQR? Why is SRP the abbreviation of Steered-Response? What does P stand for?

It's not clear what it means by "...comparable accuracy - even when interpolating at several meters - in the lateral directions when deprived of training data at testing sites..." and "...in all domains..."

Line 120-123: It'll be easier to understand through an equation or an example on the figure.

Line 127: It's almost impossible to imagine what y(t) might look like. The denominator arcsin(Phi(t)) are very likely to be incorrect since the input to arcsin needs to be [-1, 1] whereas Phi(t) can be outside this range.

Line 129: Why does m=0.8 correspond to two harmonics?

Line 133: What is "pre-speaker"?

In the legend of Figure 1, it says duration = 1 s whereas Figure 1 only shows roughly 340 ms. In addition, please use SI units everywhere in the manuscript, i.e., "s" instead of "second" or "sec". Details and examples of SI units can be seen:

https://www.bipm.org/en/measurement-units/

Line 151: what are the depths of the hydrophone? What's the separation of the four hydrophones in a location? If there's no significant separation of depth, how can you expect that they are able to locate the depth of locations?

Line 163: the authors mentioned "Audacity AUP sound format". AUP is not sound format. It's Audacity's format for organizing projects.

Line 175: "...with sinusoids and quasi-sinusoids with the same parameters grouped together given their close similarity." There might be too many "with"s.

Line 205 - 207: Why does 27,126 Fourier transform elements correspond to 0 to 27,126 Hz? What is your FFT/DFT size? In order to have 1 s frequency resolution, DFT size needs to be 192,000!

Line 210: Why is the number of features 897,871? I calculated it to be 897,736 (=6,601 x 136)

Line 251-255: the training and testing data are unclear from the authors' description. Neither were they found in the previous section. The second way seems to imply that the authors used "leave-one-out" to do the validation in that training the model by all the data points except one that is tested. However, if this is true, what are the testing data for the first way? Did the authors use all training data for testing, which is definitely incorrect way?

Line 285-289: 6,788 or 6,778 features? Two different numbers appeared. The authors must do a better job to proof-reading before submitting the paper.

Line 295: "...achieved 100.0% cross-validation and 100.0% test accuracy..." I doubt that this statement is true. What is your training/validation data split? After 10-fold cross-validation, how do you use the data to train the final model?

Line 296-297: How were you able to have test accuracy 97.75% and 99.44%, respectively? Line 171 mentioned that there are 1,605 recorded tones and 10% were used for final testing. Thus, either 160 or 161 were the number of testing tones. Either number would not result in 97.75% or 99.44% unless repeated experiments were conducted and average results were reported. In addition, confidence intervals were reported. There must be repeated experiments and is was not mentioned at all in the method.

Line 305: What are "EP front" or "EP Wall"? A figure might be needed for readers to understand.

All figures need to have better quality. Legions are difficult to read, almost unintelligible.

Reviewer #2: This work represents a non-deterministic approach to source separation/localization using machine learning methods. The work likely has merit as proof of concept, but the lack of details make it non-repeatable and I have some concerns about the methodology. Subsequently, I cannot recommend publication as presented. My review largely echoes those of previous reviewers and while I note that there has been some positive movement towards explaining the methods and results in full, much remains to be done before this can be accepted into the bioacoustics and source separation cannon.

A major concern is how the sounds were generated. It appears that the sounds were produced sequentially at each of the speaker locations and the authors used machine learning (ML) to discriminate between these locations. Thus the authors ask, which of these locations could this sound have come from. This is a valid question from a theoretical standpoint but it’s value is only useful when animals are greater than half a body length apart and whistles are not produced in isolation. To make it applicable to the field the authors should ask, ‘Which animal did each source come from’. In doing so the authors would play sources from multiple locations simotaneously. This is especially important given that in ML, it’s difficult to determine which features are the most useful in the discrimination task. If the features are relative amplitude of the first arrival at the four hydrophones then the value of the proposed method is limited.

Technical Concerns

No definition about how tonal extraction was done. Were the 2 second clips taken from 1 after the onset of the tonal recording? From the peak time? From which hydrophone was the timing obtained? For example, they sound clips could be slightly offset on each hydrophone. Did an analysist manually select the clips? How the clips are extracted will determine the features useful for source separation and subsequently what their system uses to discriminate.

Organization

The authors have not organized the Methods section of the manuscript in a manner consistent with the field of bioacoustics and machine learning which is causing considerable confusion. To be consistent with the field, better document their work, and convey their results to the broadest audience possible, the authors should arrange the methods section as follows.

Array setup

This including speaker and receiver positions. Failing to include the hydrophone locations within the body of the manuscript is inexcusable. Where hydrophones are placed relative to the sources and eachother is fundamental to the study and the ability of the methods to generalize. As noted by the previous reviewers, the authors need to include this in the methods as well as in the figure.

Signal generation

This section does not need to be as extensive and could possibly go in supplemental information. One or two paragraphs with the included table will suffice.

The authors choice to exclude harmonics may problematic without considerable explanation. In doing so they have produced idealized tonals which are not representative of biological signals. This choice needs to be justified.

Signal acquisition

This should include how the authors parsed the signals of interest from the recordings. It is not clear how long each recording used to train the regression trees was. Did they collect two seconds around the peak of the received signal? Was it the onset or was some other method used? This is a critical and not well documented aspect of their work.

Further down the authors refer to ‘snipits’ of data. The word choice is fine but what each snipit is and how it was generated needs to go in this section.

Feature Extraction.

This section is key in understanding and replicating their work. As noted by previous reviewers, it’s still not clear. I advise that they exclude or move to the supplemental information any bits of the feature extraction analysis that wasn’t used in the final model and clarify what was included.

Introduction

The value of this work concerns sound source discrimination. The introduction needs to be structured to highlight this. The authors have gone through some efforts to provide background information on the biological motivations for the system but have not done due diligence to the extensive body of work available on source separation in marine mammals. However, there are some issues in what they are presenting in overstating the value of captive studies in signature whistles. While captive studies were integral in the initial study of whistles and have some value in ontogeny initially, the limitations are considerable.

I suggest that the authors re-structure the introduction to focus only on caller discrimination and limit the potential applications to small portions of the discussion. Clarify the issue with source separation in a reverberant environment. Please see work by EM Nosal on sperm whale source separation, a variety of papers by D. Mellenger and other contemporaneous researchers. At minimum, these papers would demonstrate the proper terms in the bioacoustics field.

Methods

86 - The paragraph starting on line 86 is a single sentence. This does not help the reader understand the work.

92 – This paragraph could be considerably simplified

Consider rewording:

We generated 128 unique tonal sounds with pitch and duration within published ranges of Tursiops (table). We used frequency modulated pure tones randomly generated from XXX distribution. For this analysis, harmonics were not included. For details on the generation system see supplemental info.

92- replace ‘sounds’ with ‘tonals’ throughout to refer to your generated signals. ‘sounds’ is too vague for a bioacoustic audience

119- this should be it’s own section (array setup, see above)

120- Consider rewording for clarity:

'The 128 generated signals were played at each of the 14 hydrophone locations corresponding to 7 horizontal positions and two depths, xx m and yy m (Figure). Horizontal hydrophone spacing was approximately XXX between adjacent locations. See linked data below.'

Talking about the cross doesn't add much. Also nix the imperial measurements. Let’s momentarily pretend we are from a civilized country.

123-130- Unclear

Name and thank the assistants in the acknowledgments.

144- ‘Various calibrations’? Define what was calibrated. Explicit details for well established procedures are not needed but I vehemently disagree that defining what they did is ‘outside of the scope of the study’

- First reference to relevant work not included in the manuscript.

145 – Replace ‘collected at’ with ‘sampled at’ as in sample frequency

148- Replaced ‘involved’ with ‘used’

148 – What does ‘Standard Passive acoustic monitoring system’ mean? Does it mean custom matlab scripts were used to autonomously record sound from the hydrophones? As far as I’m aware there are no ‘standard’ methods for PAM in matlab. Clarify.

135- Why is there so much information about the visual recording system? The need for this section is lost on me. Please clarify how the system fits into your study. Knowledge of the hydrophone and speaker locations within a tank shouldn’t require an advanced visual system unless there is considerable and undocumented flow preventing the speakers from remaining stationary. Clarify.

153- This section is the heart of what the authors did and is really difficult to get through. The jargon is very heavy and multiple concepts are introduced in the same sentences. The authors need to walk your readers through this more carefully. There is clearly a lot going on and a lot of work to be carefully discussed. Do justice to it by conveying it to a broader audience.

158- ‘digested’ is the wrong word here. Pick something more accurate.

15-- It's not a, 'so-called' feature set. It’s a feature set. Also, this is part of the confusion. It's not immediately clear what is going on here.

Consider rewording: XXXX features were created from each 4-channel recording of the simulated tonals by converting the wave forms to YYY….

161- GCC-PHAT isn’t explained but TDOA is. The authors need to spend more time on the former and less on the latter.

161 – Currently unpublished -> put it in the supplemental information

Second reference to work not included in the manuscript.

163- Remove ‘briefly’. A ‘brief’ explanation of TDOA consists of , ‘TDOA is the delay in the arrival time of a signal between multiple hydrophones where hydrophones further from the source receive the signal later than those closer to the source’. Feel free to use this.

168- Where ‘elsewhere’? Either this concept is important and warrants explaining and citing or it should be removed.

Third reference to work not included in the manuscr

137-173 a single sentence with multiple interjections. Lost.

Still unsure how you are getting the 120 features from the 16 hydrophones. Hydrophone schematic would help.

176- ‘snipits’- this is reference above and should be defined in your data acquisition.

176-178- this bit is actually quite good and understandable.

179- Would be nice to know what the geometry is…

180 – this section isn’t clear again. Do you mean that the cross correlation time did or did not include second through n-th arrivals?

187- So the only information going into the regression tree is TDOA, cross correlation and GCC-PHAT? Be explicit here to give your readers a break. Remove what didn’t work or add it in the supplemental information.

187 – What do you mean ‘processing’ for each whistle? I thought the snipits were processed for the feature set, not the other way around.

187- This is the first reference to the labeled data. In the ‘feature set’ or earlier in the methods state that you are using supervised learning and your targets and label sets represent the potential source location and XYZ coordinates.

189- Typical phrases in ML to refer to data used to train and test the data are ‘Traning data’, ‘Test Data’, and/or ‘Validation data’. The wording here is awkward

191- what was ‘novel’ about the whistle? Again stick to the same term for the sounds you generated. I suggest, as above, ‘tonal’ and only use ‘whistle’ to refer to the actuall, real life, whistles coming from the animals.

191- This isn’t true, the authors could batch generate but never the less the sentance should go. Consider rewording ‘We chose the Bierman random forest for the classification task due to its ability to reduce the feature space and address whilst performing multi-class classification.

193-196- This is a single sentence with two interjections. Difficult to read consider rewording. The Bierman random forest is a multi-class classifier with a built-in resistance to overfitting through XXXXX. Additionally, the classifier performs feature reduction through YYY.

200- CART acronym without explanation. Write it out the first time.

204 – This remark can easily be read as condescending, especially when taken in context with the various other references to work not included in the paper.

211 – unclear what ‘additional’ models are

221- The localization portion of your study needs it’s own section.

221- replace ‘training sounds’ with ‘snipits’ or ‘tonals’ depending on whether you are referring to the sounds you generated or the sounds you extracted. Consistency.

121-123- Are you referring to the features extracted from each recording? I’m very lost.

222- Three SVM models? In the start of the to-be-created localization section, please provide a brief 1 or 2 sentence overview to SVM and how you use it here.

223- grid point? Be specific throughout.

227- This justification should be in the start of the section about the localization

229- Employed- I hope you paid it a living wage. Replace with ‘used’.

230 – Is the ‘standard’ approach what is referenced in your citation? Or just some details? Clarify.

231- Is this what you mean by grid-point? Note how much later the definition is than it’s first use. If not clarify.

231-236- This could be simplified. Consider:

We divided the available pool space into 6cm grid squares representing all potential source locations. For each grid corner, we calculated the expected TDOA of a source at that location to each of the 16 hydrophones.

233-236- imprecise wording. I know what the authors mean but others may not and all struggle.

237- I assume it’s a single value for soundspeed. State the calculated soundspeed or provide a figure of the soundspeed profile of the pool.

242- state the purpose of the procedure at at the start of the paragraph, not the end.

Results

248- Remove, ‘described above’. Else replace it with a section reference.

252- Not sure what is meant by this.

254- Completely lost. By array you mean the 4 connected hydrophones? The descriptions of when the authors use all 16 hydrophones and when arrays are treated separately isn’t clear to the reader.

248-255- 100% accuracy while increasing the model size strongly suggests overfitting

257-258- This should be in the methods

264- Comma after ‘again’

265- first mention of kriging. Methods.

235- replace ‘test sound’ coordinates with ‘speaker’ or ‘sound source’ coordinates

270- The error looks considerably worse in the Z direction in comparison to the X or Y. This is useful information that I hope is brought up in the discussion

271-277 single sentence

217-281 – This would be useful as a table with rows being the axes and columns as the models

Figure 4- Bar graph or histogram. Remove ‘x-ticks’ denote histogram edges. This is confusing, matlab speak that isn’t useful for non-matlab users.

The figure itself is hard to parse. The variables are the models, the different axes, and the histogram bins. The comparison the authors should be making is the different models (pannals) for each axis. So, I suggest the authors make each axes a different panel and the colors should represent the models. This will make the model comparison significantly easier to see.

Figure 5- I don’t see a lot of value added by this figure.

Figure 5- caption – Remove ‘discussed in the text’. Much of the rest of the caption should be in the methods.

289- removed ‘so-called’,

289-290 – methods.

294- replace ‘ask’ with ‘determine’

298-299- Major point, don’t bury it at the end of the results

Figure 6- Unit labels for colorbars. Replace hydrohpne ‘label’ with hydrophone ‘number’ provide figure of hydrophone layout with each hydrophone number.

Figure 6 label – ‘common panals’ what? First reference. Describe better in ‘array setup’ section to be added

TDOA features do not seem very useful here. Highlight in results.

Discussion

301-303-Run on

307- I presume by ‘sound sources’ you are referring to free-swimming animals? If so say it.

307-310- run on

309-311- unintelligible. Reword.

312 – Are you sure recordings is the word you want? 4-channel arrays/panels? It should be arrays but this needs to be made clear throughout.

314- EP already defined (or should have been)

317- sound ‘source’ originated

321-326- run on

331-333- run on. Start with , ‘Also, it is reassuring that a….’. Try to always put the subject next to the verb or risk sounding like Yoda, do you.

334- Which question? There was no stated question

340- It’s not so much the length of the dolphin that’s important it’s the width. The sound source in a dolphin is near the front of the animal. So, the authors need to highlight that this method shows promise when animals heads are greater than a meter apart which may, or may not, be tenable.

355- Just add the citation after elsewhere remove the rest.

356-Move ‘referring to figure 5’ to later in the sentence. Just reference it as (fig 5)

I think you should replace (EP-Wall) and EP dimensions with X, Y, and Z or lat, lon, depth. Something more intuitive will allow for readers to better understand the results

370- time-of-flight? Used in abstract and elsewhere. Undefined.

371- Reword to say amplitude was not include rather than it was removed. Highlight just TDOA and other methods that were included. Consider, ‘In this work TDOA and Cross correlation values were used to discriminate between source locations. Direct or relative amplitude was not included in the feature set’. Or something like that.

Acknowledgments

This is an online journal without word limits. Name and graciously thank the two-dozen people who helped you. Don’t be lazy.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2020 Jun 25;15(6):e0235155. doi: 10.1371/journal.pone.0235155.r003

Author response to Decision Letter 0

21 Feb 2020

Dear Dr. Matsumoto and Reviewers,

We thank you for the time and patience you have dedicated to reading and thoroughly, constructively criticizing our paper. In almost all cases we have striven to follow your direction, and we hope that the results please you. To a few, mostly minor, points we do raise objections and explain ourselves. Most notably, we do not agree with Reviewer 2’s suggestion to strike our motivating application of the proposed methods, namely whistle localization for dolphins in an enclosed environment, from the Introduction and generally the paper. We feel this application is inextricably linked to our study design. In any case, we respond to this and all other specific comments below.

Thank you,

Sean Woodward

Diana Reiss

Marcelo Magnasco

Responses to Editor and Reviewer comments are given in bold red (only in PDF version).

PONE-D-19-33956

Machine Source Localization of Tursiops truncatus Whistle-like Sounds in a Reverberant Aquatic Environment

PLOS ONE

Dear Dr. Woodward,

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.

A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.

An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

We look forward to receiving your revised manuscript.

Kind regards,

Haru Matsumoto

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

2. Thank you for stating the following in the Acknowledgments Section of your manuscript:

'We thank the National Aquarium for participating in this study, as well the National 402

Science Foundation (Awards 1530544, 1607280), the Eric and Wendy Schmidt Fund for 403

Strategic Innovation, and the Rockefeller University for funding.'

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

'MO Magnasco, DR Reiss

Awards 530544, 1607280

National Science Foundation

https://www.nsf.gov

The funders had no role in study design, data collection and analysis, decision to

publish, or preparation of the manuscript.'

Additional Editor Comments (if provided):

Dear Dr. Woodward,

Regards,

Haru Matsumoto

(1) Per Reviewer 2’s detailed comments, we have added more explanation of our methods. While we have also provided more background on pre-established methods, unless specifically requested by Reviewer 2 we were sometimes unclear as to where more background is desired, and how much. The paper makes mention of many pre-established methods (that are arguably not expert-level machine learning or bioacoustics, such as the decision tree), and we question whether thoroughly introducing the reader to them all is what’s desired and/or appropriate. That being said, again we have added considerable background, and we continue to be open to adding more.

(2, 3) We have added this information.

(4) This has been done.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Partly

Reviewer #2: Partly

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: No

Reviewer #2: Yes

3. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #2: Yes

4. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: No

Reviewer #2: No

5. Review Comments to the Author

Detailed comments are as follows:

In abstract, please refrain from using abbreviations without the full names. What is MAD? What is IQR? Why is SRP the abbreviation of Steered-Response? What does P stand for?

All abbreviations in the Abstract are now defined. We have reworded the troublesome lines.

Line 120-123: It'll be easier to understand through an equation or an example on the figure.

An equation has been added.

A mistake in the denominator has been corrected; it resulted from a source material conflict. We apologize for the error.

Line 129: Why does m=0.8 correspond to two harmonics?

We have added more detail about the equation, being more explicit about the mechanism through which harmonics are added (specifically, the triangularization of the underlying waveform) and the fact that harmonics successively decrease in signal power. As we write, at m=0.8 we do not see a third harmonic above noise in practice.

Line 133: What is "pre-speaker"?

Word choice has been changed: “…example of a tonal generated by the above procedure…”

https://www.bipm.org/en/measurement-units/

The bad duration (from a previous, different choice of tonal) has been fixed.

We now use SI units everywhere.

Hydrophone coordinates are now given, with their relationship to the pool surface provided in the text. Hydrophones are approximately 1.60 +/- 0.22 m under the surface. Yes, we expected decreased resolution in the direction of depth (theoretically these separations are not necessarily limiting), but how much depended on the particulars of the sounds and acoustic environment.

Line 163: the authors mentioned "Audacity AUP sound format". AUP is not sound format. It's Audacity's format for organizing projects.

This has been clarified.

Line 175: "...with sinusoids and quasi-sinusoids with the same parameters grouped together given their close similarity." There might be too many "with"s.

Fixed.

Line 205 - 207: Why does 27,126 Fourier transform elements correspond to 0 to 27,126 Hz? What is your FFT/DFT size? In order to have 1 s frequency resolution, DFT size needs to be 192,000!

Per Reviewer 2’s comments, we have decided to eliminate discussion of these features, as they ultimately went entirely unused. But, yes, we used large windows to eliminate time dependence.

Line 210: Why is the number of features 897,871? I calculated it to be 897,736 (=6,601 x 136)

Your number is correct, ours was not updated. We apologize.

We’ve clarified our methods. Except for the random forest (for which OOB error was used), all validation error was determined by 10-fold cross validation on the training/validation data. It would have been reasonable at this stage to perform hyperparameter optimization, but in essentially every case we didn’t, either because the model class doesn’t heavily rely on it and/or because we didn’t want to further promote overfitting (in the case of the basic classification tree) and were content with off-the-shelf prediction. We are unclear about the meaning of the reviewer’s last sentence. We didn’t include testing data among our training/validation data. However, we did aggregate all training/validation data into final models before evaluation on the test data, but this is correct.

Line 285-289: 6,788 or 6,778 features? Two different numbers appeared. The authors must do a better job to proof-reading before submitting the paper.

The latter. We apologize, it has been corrected.

We used a 9:1 training/validation:test set split. For test-set evaluation, all data from the training/validation set were used to train the model. Hyperparameter optimization was not performed with evaluation on the test set, and actually not performed on the training/validation set either (though regression model selection was, which we now state explicitly); LDA has no hyperparameters, we pre-specified the main hyperparameter choice for the SVM’s (kernel class), and default hyperparameters were kept for the classification tree for fear of over-fitting.

We profusely apologize, the training/validation set size of 1,605 was given where the total of 1,783 should have been. The 10% testing size was thus 178 and 97.75% and 99.44% correspond to 174 and 178 tonals, respectively.

Line 305: What are "EP front" or "EP Wall"? A figure might be needed for readers to understand.

The now-updated axis names are included in two figures now.

All figures need to have better quality. Legions are difficult to read, almost unintelligible.

The quality of our figures was/is significantly degraded during the PLOS reviewer manuscript creation process. We confirmed this with PLOS help staff, which states that at this stage the figures must merely be high-quality enough for content review. We apologize for any inconvenience this causes, and almost everywhere have provided higher-resolution figures to try to compensate for review.

The reviewer suggests interesting avenues for investigation, however we do feel that it is valid to consider the localization of whistles that do not occur simultaneously. This is a realistic problem, at least in the captive setting where vocalizations are often sparse. Also, we note it remains unclear whether our limited training set demonstrates the best potential accuracy available to the machine learning models considered here.

Moreover, because we suggest methods of source localization based on whistle fundamentals alone, there exists the possibility of applying these methods to simultaneously occurring whistles whose fundamentals can be satisfactorily separated by filtering. Of course, we admit there still exists the opportunity to extend these methods to overlapping traces.

The problem of classifying sounds by speaker identity rather than by source location is fascinating (having only been explored only for signature whistles to our knowledge), however addressing it is not currently within our reach. We have imagined using the present source localization methods to aid in the whistle attribution for dolphins at the National Aquarium in order to create a large training set suited to this purpose. But fate has been cruel to our experimental setup and we must now leave this task to others.

Technical Concerns

We have added these details (Line ~281), within the new Methods structure the reviewer suggested. We also discuss how our choice of features was envisioned to treat our automatically-extracted tonals the same as manually-extracted whistles (i.e., by not assuming fixed times between window starts and tonal onsets).

Organization

Array setup

Signal generation

This section does not need to be as extensive and could possibly go in supplemental information. One or two paragraphs with the included table will suffice.

Signal acquisition

Further down the authors refer to ‘snipits’ of data. The word choice is fine but what each snipit is and how it was generated needs to go in this section.

Feature Extraction.

We have executed the recommended restructuring.

Hydrophone coordinates are now provided (Table 1).

The exclusion of whistle harmonics is now discussed around Line 220. The main justification is that we did not want to make unnecessary assumptions about the fundamental-harmonic relationships during signal generation (this is why we ultimately decided to remove the faux-harmonics from the triangular waveform tonals), when with appropriate filtering the problem of localizing real whistles should be reducible to the problem of localizing fundamentals, given their frequency separation. As we see it, the harmonics simply offer extra information that could potentially help otherwise successful models.

Introduction

The present research project’s design is highly particular to the study of dolphin whistles in an enclosed environment; from our assumptions about the “sound space” and our restrictive choice of training sounds (sinusoidal narrowband sounds), to our assumption that sound sources are acoustically identical (which dolphins tentatively are with respect to non-signature whistles), to our choice of environment (a concrete dolphin tank), to the choice of sound localization methods (which require spatial sampling, practical mainly for relatively small environments), the design is specific to our proposed application. While we hope that the methods proposed here have other applications, our study was not designed with them in mind, and -- partly given the black box nature of machine learning -- we are not positioned to predict to which of these more distant applications our results might be relevant. Similarly, among all potential applications (especially within the area of “caller discrimination”) we suspect our study is of primary importance to the particular biological application we propose, which we feel is indeed valid. We feel that it would be unfair to the authors to deprive the paper of this relevance. While, alternatively, the reviewer may suggest we frame this project as a purely abstract exercise, we feel we do have sufficient justification for equating our design to this real and valid application, the whistle source localization for bottlenose dolphins in an enclosed environment.

Studying the vocalizations of captive dolphins has limitations, yes, but also presents special opportunities. Rarely in other contexts is it possible monitor a fixed group of individuals continuously for months at a time or longer -- this is still not possible with dolphin dtags. Moreover, rarely in other contexts is training and psychological/behavioral experimentation possible, which include (humane) methods that have proven uniquely helpful for studying dolphin cognition and language comprehension. We broadly refer to the studies of the late Louis Herman and humbly opine that much of his aquaria research could not realistically be replicated in the field at this time. As reflected by the sophistication of the field of birdsong research and its relationship to captive studies, we feel that responsible captive studies serve an established scientific role in the study of language and related cognition, that cannot be discarded simply for the loss of some wild behavior. Of course, captive research’s relationship to animal welfare is an important topic worthy of consideration. However, as we are merely addressing an ancillary problem to captive research (a problem that may also extend to purely passive research on animals in a sanctuary environment), we feel that at present this topic is not relevant.

We await more feedback on the above. In the meantime, we are happy to continue to make our methods understandable and recognizable within the broader field of marine mammal communication. To this end, we have significantly expanded our Introduction to include more background information about sound source separation/attribution, with special attention paid to marine mammalogy. We do try to convey that our goals deviate slightly from those of sound source separation as described in other contexts.

Methods

86 - The paragraph starting on line 86 is a single sentence. This does not help the reader understand the work.

The one sentence was broken into two, with clearer separation between clauses.

92 – This paragraph could be considerably simplified

Consider rewording:

We thank the reviewer for the suggested wording, and have incorporated it into our new version.

92- replace ‘sounds’ with ‘tonals’ throughout to refer to your generated signals. ‘sounds’ is too vague for a bioacoustic audience

We now refer to our generated/created sounds as “tonals.”

119- this should be it’s own section (array setup, see above)

Methods have been restructured.

120- Consider rewording for clarity:

Talking about the cross doesn't add much. Also nix the imperial measurements. Let’s momentarily pretend we are from a civilized country.

Recommendations have been adopted.

123-130- Unclear

Name and thank the assistants in the acknowledgments.

This is indeed appropriate. The main author concedes his negligence and thanks the reviewer for this especially important correction.

- First reference to relevant work not included in the manuscript.

Clarification added. We make it explicit that the localization methods proposed in this study did not make use of measured hydrophone coordinates (which motivated our previous comment about scope). In light of this, we opted to remove mention of the work we previously indicated as relevant instead of citing it, because it has not yet undergone peer review; we would be happy to reintroduce the mention with a citation if the editor and reviewer prefer.

145 – Replace ‘collected at’ with ‘sampled at’ as in sample frequency

Done.

148- Replaced ‘involved’ with ‘used’

Done.

We were referring to passive operation of the sound recording system (i.e., for multi-day/week recordings), which we distinguish from the operator-centric, Audacity-based mode of operation used during this study. We have opted to delete mention of passive system operation for clarity.

The extent of the camera system is not relevant to this study. We have removed mention of the total camera count along with mention of the camera make/model from the text body. We have left make/model information in the image caption, where we feel it is not out of place.

We have expanded this section, including more explanation. We acknowledge that PLOS One appeals to a broad audience, at the same time we note that this paper has a lot of ground to cover and are unsure how much explanation is warranted. We can open up this section further should the reviewers like.

158- ‘digested’ is the wrong word here. Pick something more accurate.’

This word and sentence has been removed as part of a larger rewrite.

15-- It's not a, 'so-called' feature set. It’s a feature set. Also, this is part of the confusion. It's not immediately clear what is going on here.

Consider rewording: XXXX features were created from each 4-channel recording of the simulated tonals by converting the wave forms to YYY….

The recommendation has been incorporated into a rewrite.

161- GCC-PHAT isn’t explained but TDOA is. The authors need to spend more time on the former and less on the latter.

We have expanded the explanation of GCC-PHAT.

161 – Currently unpublished -> put it in the supplemental information

Second reference to work not included in the manuscript.

This reference has been removed.

A much briefer definition of the TDOA has been included elsewhere as part of a larger rewrite.

168- Where ‘elsewhere’? Either this concept is important and warrants explaining and citing or it should be removed.

Third reference to work not included in the manuscr

This reference has been removed.

137-173 a single sentence with multiple interjections. Lost.

Still unsure how you are getting the 120 features from the 16 hydrophones. Hydrophone schematic would help.

The troublesome sentence has been removed as part of a rewrite.

More information about hydrophone geometry has been provided, in the new Hydrophone Setup subsection.

We now share the calculation of the number of unique hydrophone pairs in 16, namely “16 choose 2” = 16! /(2! x 14!) = 120.

176- ‘snipits’- this is reference above and should be defined in your data acquisition.

We have defined snippets.

176-178- this bit is actually quite good and understandable.

Thank you!

179- Would be nice to know what the geometry is…

Now included in the new Hydrophone Setup subsection.

180 – this section isn’t clear again. Do you mean that the cross correlation time did or did not include second through n-th arrivals?

We expanded this section. What we’re saying is that we truncated the cross-correlations generously enough to ensure inclusion of first-arrival TDOA information for any given tonal between any given pair of hydrophones. TDOA information for some pairs of later arrivals is included, we just do not make specific guarantees. And note that pairs of later arrivals do not necessarily correspond to larger TDOA’s than the first-arrival TDOA’s. In fact, pairs of later arrivals can correspond to smaller TDOA’s. Take, for instance, a complex multi-path to a nearby hydrophone versus a simple multi-path to a distant hydrophone, for some source location; these higher-order paths can be imagined to be arbitrarily similar in length and to correspond to a TDOA arbitrarily close to zero.

Done.

187 – What do you mean ‘processing’ for each whistle? I thought the snipits were processed for the feature set, not the other way around.

The use of “processing” in this way has been removed as part of a larger rewrite.

The word is now introduced earlier (Line 316).

189- Typical phrases in ML to refer to data used to train and test the data are ‘Traning data’, ‘Test Data’, and/or ‘Validation data’. The wording here is awkward

Introduced relevant terminology earlier, in part to avoid awkward wording here.

Use of “novel” as in “novel whistle” has been deemed unnecessary and removed in most cases. We were referring to tonals that were not used in a model’s training, but were not necessarily in the test set. We continue to use “novel” with respect to tonals from un-trained source locations, but are clearer about this meaning.

The suggested change in terminology has been implemented.

While the original sentence was overly dramatic (and removed), we ultimately stand by it; while the authors are aware of batch training, an excessively wide feature set is not as aptly addressed by batch training as an excessively deep one, for which the method is ultimately intended.

The writers have reduced the other sentence for clarity.

The sentence has been reworded for clarity.

200- CART acronym without explanation. Write it out the first time.

Done.

204 – This remark can easily be read as condescending, especially when taken in context with the various other references to work not included in the paper.

We have rephrased this remark.

211 – unclear what ‘additional’ models are

Hopefully this has been clarified.

221- The localization portion of your study needs it’s own section.

Done.

221- replace ‘training sounds’ with ‘snipits’ or ‘tonals’ depending on whether you are referring to the sounds you generated or the sounds you extracted. Consistency.

We restrict ourselves to these two terms now.

Done.

121-123- Are you referring to the features extracted from each recording? I’m very lost.

This section has largely been rewritten.

222- Three SVM models? In the start of the to-be-created localization section, please provide a brief 1 or 2 sentence overview to SVM and how you use it here.

Given the line referenced and the mention of three models we think the author is referring to our Gaussian process regression models. We have therefore included more information about Gaussian process regression.

223- grid point? Be specific throughout.

We removed references to grid points (outside of the SRP-PHAT section).

227- This justification should be in the start of the section about the localization

We have considered this comment, however it seems to us that talking about the limitations of interpolative regression on our test data before we’ve explained the interpolative regression may confuse the reader. Also, we put regression after classification because it seems like the natural order of problem difficulty.

229- Employed- I hope you paid it a living wage. Replace with ‘used’.

Done.

230 – Is the ‘standard’ approach what is referenced in your citation? Or just some details? Clarify.

We have expanded/clarified “standard methodology” per this comment.

231- Is this what you mean by grid-point? Note how much later the definition is than it’s first use. If not clarify.

The ambiguity has been resolved by limiting references of “grid points” to this section.

231-236- This could be simplified. Consider:

Thank you for the recommendation, we pulled from it for our rewrite.

233-236- imprecise wording. I know what the authors mean but others may not and all struggle.

We are now more precise and explicit in our explanation.

237- I assume it’s a single value for soundspeed. State the calculated soundspeed or provide a figure of the soundspeed profile of the pool.

The value is now given.

242- state the purpose of the procedure at at the start of the paragraph, not the end.

Done. Also, the sentence has been rephrased.

Results

248- Remove, ‘described above’. Else replace it with a section reference.

Done.

252- Not sure what is meant by this.

The sentence has been reworded for clarity.

254- Completely lost. By array you mean the 4 connected hydrophones? The descriptions of when the authors use all 16 hydrophones and when arrays are treated separately isn’t clear to the reader.

In all instances we now refer to our hydrophone arrays as “4-hydrophone-arrays.”

248-255- 100% accuracy while increasing the model size strongly suggests overfitting

The authors do not fully understand this criticism. Adding more trees to a random forest to minimize validation error is standard practice. As an ensemble method in which the trees, each trained on a subset of features, “vote” on the classification of each sample, adding trees is not anticipated to increase overfitting. https://arxiv.org/abs/1407.7502 In any case, checking for overfitting is the purpose of evaluation on the held-out test set. Moreover, that our linear models achieved similar results also suggests against overfitting.

257-258- This should be in the methods

Yes, done.

264- Comma after ‘again’

Done.

265- first mention of kriging. Methods.

This was in methods, however we’ve added more description with regard to our motivation to use the method.

235- replace ‘test sound’ coordinates with ‘speaker’ or ‘sound source’ coordinates

We now refer primarily to “source coordinates of tonals.”

270- The error looks considerably worse in the Z direction in comparison to the X or Y. This is useful information that I hope is brought up in the discussion

It is! Yes, we think it is important as well.

271-277 single sentence

The sentence has been split for clarity.

217-281 – This would be useful as a table with rows being the axes and columns as the models

Figure 4- Bar graph or histogram. Remove ‘x-ticks’ denote histogram edges. This is confusing, matlab speak that isn’t useful for non-matlab users.

We have rearranged the plots as recommended; we now have one for each axis. We have removed the reference to the ticks.

Figure 5- I don’t see a lot of value added by this figure.

This figure compares the overall accuracy of the different methods. We feel this graph summarizes a main finding of the paper -- the comparison of Gaussian regression with SRP -- in a statistically correct way that the table of MAD’s and histograms do not, and in a way that we not execute as succinctly in text.

Figure 5- caption – Remove ‘discussed in the text’. Much of the rest of the caption should be in the methods.

Done.

289- removed ‘so-called’,

Done.

289-290 – methods.

This was already included in Methods, but is now more prominent.

294- replace ‘ask’ with ‘determine’

Done.

298-299- Major point, don’t bury it at the end of the results

We now highlight this result in the Abstract and Introduction.

Figure 6- Unit labels for colorbars. Replace hydrohpne ‘label’ with hydrophone ‘number’ provide figure of hydrophone layout with each hydrophone number.

Done.

Figure 6 label – ‘common panals’ what? First reference. Describe better in ‘array setup’ section to be added

TDOA features do not seem very useful here. Highlight in results.

We now refer to “4-hydrophone-arrays” consistently. The Hydrophone Setup subsection has been added.

Discussion

301-303-Run on

The sentence has been split for clarity.

307- I presume by ‘sound sources’ you are referring to free-swimming animals? If so say it.

Done.

307-310- run on

The sentence has been split for clarity.

309-311- unintelligible. Reword.

The sentence has been reworded for clarity.

312 – Are you sure recordings is the word you want? 4-channel arrays/panels? It should be arrays but this needs to be made clear throughout.

We now refer to “WAV samples” and “snippets” (previously defined) here.

We now universally refer to arrays/panels as “4-hydrophone-arrays” for consistency and clarity.

314- EP already defined (or should have been)

This was a reminder. Re-definition has been removed, however.

317- sound ‘source’ originated

We now universally refer to “[sound] source locations.”

321-326- run on

The sentence has been split for clarity.

331-333- run on. Start with , ‘Also, it is reassuring that a….’. Try to always put the subject next to the verb or risk sounding like Yoda, do you.

We humbly suggest that Yoda’s ears suggest a level of expertise in the area of sound source localization that make him worth evoking.

Fixed!

334- Which question? There was no stated question

We have rephrased this.

Although we continue to reference dolphin body length as a crude yardstick, we acknowledge the importance of the reviewer’s point and have indicated that our method’s error is greater than the minimum possible separation between two dolphins’ heads.

355- Just add the citation after elsewhere remove the rest.

Done.

356-Move ‘referring to figure 5’ to later in the sentence. Just reference it as (fig 5)

I think you should replace (EP-Wall) and EP dimensions with X, Y, and Z or lat, lon, depth. Something more intuitive will allow for readers to better understand the results

The reference has been moved.

We now use X, Y, Z for pool axis names.

370- time-of-flight? Used in abstract and elsewhere. Undefined.

We no longer reference time-of-flight.

Done.

Acknowledgments

This is an online journal without word limits. Name and graciously thank the two-dozen people who helped you. Don’t be lazy.

Yes, we recognize our error and gladly include all names!

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Attachment

Submitted filename: Response to Reviewers.pdf

Click here for additional data file.^{(169.6KB, pdf)}

PLoS One. doi: 10.1371/journal.pone.0235155.r004

Decision Letter 1

Haru Matsumoto

2 Mar 2020

PONE-D-19-33956R1

Machine Source Localization of Tursiops truncatus Whistle-like Sounds in a Reverberant Aquatic Environment

PLOS ONE

Dear Dr. Woodward,

We would appreciate receiving your revised manuscript by Apr 16 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

We look forward to receiving your revised manuscript.

Kind regards,

Haru Matsumoto

Academic Editor

PLOS ONE

Additional Editor Comments (if provided):

Dear Dr. Woodward,

I believe that it is the first application of ML to the localization problem of the sound sources in the water. Your research is unique and results are interesting. You have improved the manuscript by taking suggestions from the two reviewers. All the errors pointed out by the reviewers' were corrected. The experimental setup and results are more clear now. However, the main criticism of the two reviewers is the intelligibility of the manuscript, which is still a problem. For example, the abstract and intro are very confusing and difficult to follow even for us in underwater acoustics and marine mammal fields. Please keep in mind that often the abstract and the results are the sections that readers read first. If they are hard to follow, readers would not read the rest and would not cite your research. The reviewers did the job they were asked and I am not going to send the manuscript back to them for further review. I have suggested in my previous comment to invite someone outside of ML or AI field (preferably someone with underwater acoustic background) to read the manuscript carefully and improve intelligibility. Have you done that? You can include him/her as a last co-author.

If you disagree with my decision, another option for you is to request PLOS ONE to change the science editor, which I do not mind at all. But please keep in mind that it would be the same long process again of finding new reviewers and revising.

One error that I just noticed is in Fig. 2. The unit you use, dB/Hz is not correct. It should be dB relative to the standard unit (Pa/sqrt(Hz) for sound). Also it is not clear if it is the actual sound, electrical signal or simulation. You have used SQ26-08 hydrophone from Cetacean Research, which is a calibrated hydrophone. But the dB values are too small for sound or electrical power. Regardless, the spectral intensity is already normalized by a unit frequency (uPa/sqrt(Hz) for sound) when you do FFT. Also, it may be interesting for readers if you can include one of the hydrophone channel signals and spectrograms in the reverberated environment of the tank.

Regards,

Haru Matsumoto

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

PLoS One. 2020 Jun 25;15(6):e0235155. doi: 10.1371/journal.pone.0235155.r005

Author response to Decision Letter 1

3 Jun 2020

(No additional referee reviews were included in this revision)

Attachment

Submitted filename: final-cover-letterR2R.pdf

Click here for additional data file.^{(39.9KB, pdf)}

PLoS One. doi: 10.1371/journal.pone.0235155.r006

Decision Letter 2

Haru Matsumoto

10 Jun 2020

Learning to localize sounds in a highly reverberant environment: machine-learning tracking of dolphin whistle-like sounds in a pool

PONE-D-19-33956R2

Dear Dr. Magnasco,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Haru Matsumoto

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Congratualtions! Your paper has significantly improved. I appreciate all the efforts you put in.

Reviewers' comments:

PLoS One. doi: 10.1371/journal.pone.0235155.r007

Acceptance letter

Haru Matsumoto

16 Jun 2020

PONE-D-19-33956R2

Learning to localize sounds in a highly reverberant environment: machine-learning tracking of dolphin whistle-like sounds in a pool

Dear Dr. Magnasco:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Haru Matsumoto

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Attachment

Submitted filename: FINAL_reviewer_response.pdf

Click here for additional data file.^{(86KB, pdf)}

Attachment

Submitted filename: Response to Reviewers.pdf

Click here for additional data file.^{(169.6KB, pdf)}

Attachment

Submitted filename: final-cover-letterR2R.pdf

Click here for additional data file.^{(39.9KB, pdf)}

Data Availability Statement

Data used for model building and testing available at https://doi.org/10.6084/m9.figshare.7956212.

[pone.0235155.ref001] 1. Shirihai H, Jarrett B. Whales, Dolphins, and Other Marine Mammals of the World. Princeton and Oxford: Princeton University Press; 2006. [Google Scholar]

[pone.0235155.ref002] 2. Connor RC, Heithaus MR, Barre LM. Complex social structure, alliance stability and mating access in a bottlenose dolphin ‘super-alliance’. Proceedings of the Royal Society B: Biological Sciences. 2001;268(1464):263–267. 10.1098/rspb.2000.1357 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0235155.ref003] 3. Krutzen M, Sherwin WB, Connor RC, Barre LM, Van de Casteele T, Mann J, et al. Contrasting relatedness patterns in bottlenose dolphins (Tursiops sp.) with different alliance strategies. Proceedings of the Royal Society B: Biological Sciences. 2003;270(1514):497–502. 10.1098/rspb.2002.2229 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0235155.ref004] 4. Reiss D, Marino L. Mirror self-recognition in the bottlenose dolphin: A case of cognitive convergence. Proceedings of the National Academy of Sciences. 2001;98(10):5937–5942. 10.1073/pnas.101086398 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0235155.ref005] 5. Krutzen M, Mann J, Heithaus MR, Connor RC, Bejder L, Sherwin WB. Cultural transmission of tool use in bottlenose dolphins. Proceedings of the National Academy of Sciences. 2005;102(25):8939–8943. 10.1073/pnas.0500232102 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0235155.ref006] 6. Sargeant BL, Mann J, Berggren P, Krutzen M. Specialization and development of beach hunting, a rare foraging behavior, by wild bottlenose dolphins (Tursiop sp.). Canadian Journal of Zoology. 2005;83(11):1400–1410. 10.1139/z05-136 [DOI] [Google Scholar]

[pone.0235155.ref007] 7. Smith JD. Inaugurating the Study of Animal Metacognition. International Journal of Comparative Psychology. 2010;23(3):401–413. [PMC free article] [PubMed] [Google Scholar]

[pone.0235155.ref008] 8. Au WWL, Moore PWB, Pawloski D. Echolocation transmitting beam of the Atlantic bottlenose dolphin. The Journal of the Acoustical Society of America. 1986;80:688–691. 10.1121/1.394012 [DOI] [PubMed] [Google Scholar]

[pone.0235155.ref009] 9. Johnson CS. Discussion In: Busnel RG, editor. Animal Sonar Systems Biology and Bionics. Jouy-en-Josas, France; 1967. p. 384–398. [Google Scholar]

[pone.0235155.ref010] 10. Au WWL. Echolocation in Dolphins. In: Hearing by Whales and Dolphins. New York City, New York: Springer; 2000. [Google Scholar]

[pone.0235155.ref011] 11. Lilly JC, Miller AM. Vocal Exchanges between Dolphins. Science. 1961;134(3493):1873–1876. 10.1126/science.134.3493.1873 [DOI] [PubMed] [Google Scholar]

[pone.0235155.ref012] 12. Dreher JJ. Linguistic Considerations of Porpoise Sounds. The Journal of the Acoustical Society of America. 1961;33(12):1799–1800. 10.1121/1.1908584 [DOI] [Google Scholar]

[pone.0235155.ref013] 13. McCowan B, Hanser SF, Doyle LR. Quantitative tools for comparing animal communication systems: information theory applied to bottlenose dolphin whistle repertoires. Animal Behaviour. 1998;57(2):409–419. 10.1006/anbe.1998.1000 [DOI] [PubMed] [Google Scholar]

[pone.0235155.ref014] 14. Caldwell MC, Caldwell DK. Individualized Whistle Contours in Bottlenosed Dolphins (Tursiops truncatus). Nature. 1965;207(1):434–435. 10.1038/207434a0 [DOI] [Google Scholar]

[pone.0235155.ref015] 15. Caldwell MC, Caldwell DK, Tyack PL. Review of the signature-whistle-hypothesis for the Atlantic bottlenose dolphin In: Leatherwood S, Reeves RR, editors. The Bottlenose Dolphin. San Diego; 1990. p. 199–234. [Google Scholar]

[pone.0235155.ref016] 16. Sayigh LS, Esch HC, Wells RS, Janik VM. Facts about signature whistles of bottlenose dolphins, Tursiops truncatus. Animal Behaviour. 2007;74(6):1631–1642. 10.1016/j.anbehav.2007.02.018 [DOI] [Google Scholar]

[pone.0235155.ref017] 17. Janik VM, Sayigh LS. Communication in bottlenose dolphins: 50 years of signature whistle research. Journal of Comparative Physiology A. 2013;199(6):479–489. 10.1007/s00359-013-0817-7 [DOI] [PubMed] [Google Scholar]

[pone.0235155.ref018] 18. Tyack PL. Whistle Repertoires of Two Bottlenosed Dolphins, Tursiops truncatus: Mimicry of Signature Whistles? Behavioral Ecology and Sociobiology. 1986;18(4):251–257. 10.1007/BF00300001 [DOI] [Google Scholar]

[pone.0235155.ref019] 19. McCowan B. A New Quantitative Technique for Categorizing Whistles Using Simulated Signals and Whistles from Captive Bottlenose Dolphins (Delphinidae, Tursiops truncatus). Ethology. 1995;100:177–193. 10.1111/j.1439-0310.1995.tb00324.x [DOI] [Google Scholar]

[pone.0235155.ref020] 20. McCowan B, Reiss D. Whistle Contour Development in Captive-Born Infant Dolphins (Tursiops truncatus): Role of Learning. Journal of Comparative Psychology. 1995;109(3):242–260. 10.1037/0735-7036.109.3.242 [DOI] [Google Scholar]

[pone.0235155.ref021] 21. McCowan B, Reiss D. Quantitative Comparison of Whistle Repertoires from Captive Adult Bottlenose Dolphins (Delphinidae, Tursiops truncatus): a Re-evaluation of the Signature Whistle Hypothesis. Ethology. 1995;100:194–209. 10.1111/j.1439-0310.1995.tb00325.x [DOI] [Google Scholar]

[pone.0235155.ref022] 22. McCowan B, Reiss D, Gubbins C. Social familiarity influences whistle acoustic structure in adult female bottlenose dolphins (Tursiops truncatus). Aquatic Mammals. 1998;24(1):27–40. [Google Scholar]

[pone.0235155.ref023] 23. McCowan B, Reiss D. The fallacy of ‘signature whistles’ in bottlenose dolphins: a comparative perspective of ‘signature information’ in animal vocalizations. Animal Behaviour. 2001;62(6):1151–1162. 10.1006/anbe.2001.1846 [DOI] [Google Scholar]

[pone.0235155.ref024] 24. Janik VM, King SL, Sayigh LS, Wells RS. Identifying signature whistles from recordings of groups of unrestrained bottlenose dolphins (Tursiops truncatus). Marine Mammal Science. 2013;29(1):109–122. 10.1111/j.1748-7692.2011.00549.x [DOI] [Google Scholar]

[pone.0235155.ref025] 25. Mellinger DK, Stafford KM, Moore SE, Dziak RP, Matsumoto H. An Overview of Fixed Passive Acoustic Observation Methods for Cetaceans. Oceanography. 2007;20(4):36–45. 10.5670/oceanog.2007.03 [DOI] [Google Scholar]

[pone.0235155.ref026] 26. Gero S, Whitehead H, Rendell L. Individual, unit and vocal clan level identity cues in sperm whale codas. Royal Society Open Science. 2016;3(1):150372 10.1098/rsos.150372 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0235155.ref027] 27. Širović A, Hildebrand JA, Wiggins SM. Blue and fin whale call source levels and propagation range in the Southern Ocean. The Journal of the Acoustical Society of America. 2007;122(2):1208–1215. 10.1121/1.2749452 [DOI] [PubMed] [Google Scholar]

[pone.0235155.ref028] 28. Charif RA, Clapham PJ, Clark CW. Acoustic Detections of Singing Humpback Whales in Deep Waters off the British Isles. Marine Mammal Science. 2001;17(4):751–768. 10.1111/j.1748-7692.2001.tb01297.x [DOI] [Google Scholar]

[pone.0235155.ref029] 29. Stafford KM, Fox CG, Clark DS. Acoustic Detections of Singing Humpback Whales in Deep Waters off the British Isles. J Acoust Soc Am;104(6):3616–3625. [DOI] [PubMed] [Google Scholar]

[pone.0235155.ref030] 30. Li X, Deng ZD, Sun Y, Martinez JJ, Fu T, McMichael GA, et al. A 3D approximate maximum likelihood solver for localization of fish implanted with acoustic transmitters. Scientific Reports. 2014;4(1):668–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0235155.ref031] 31. Li X, Deng ZD, Rauchenstein LT, Carlson TJ. Source-localization algorithms and applications using time of arrival and time difference of arrival measurements. Review of Scientific Instruments. 2016;87(4):041502–13. 10.1063/1.4947001 [DOI] [PubMed] [Google Scholar]

[pone.0235155.ref032] 32. Spiesberger JL. Locating animals from their sounds and tomography of the atmosphere: Experimental demonstration. The Journal of the Acoustical Society of America. 1999;106(2):837–846. 10.1121/1.427100 [DOI] [PubMed] [Google Scholar]

[pone.0235155.ref033] 33. Smith JO, Abel JS. The Spherical Interpolation Method of Source Localization. IEEE Journal of Oceanic Engineering. 1987;OE-12(1):246–252. 10.1109/JOE.1987.1145217 [DOI] [Google Scholar]

[pone.0235155.ref034] 34. Di Claudio ED, Parisi R. Robust ML wideband beamforming in reverberant fields. IEEE Transactions on Signal Processing. 2003;51(2):338–349. 10.1109/TSP.2002.806866 [DOI] [Google Scholar]

[pone.0235155.ref035] 35.Van Veen BD, Buckley KM. Beamforming: A Versatile Approach to Spatial Filtering. IEEE ASSP Magazine. 1998; p. 1–21.

[pone.0235155.ref036] 36.Greensted A. type; 2012. Available from: http://www.labbookpages.co.uk/audio/beamforming/delaySum.html.

[pone.0235155.ref037] 37.Do H, Silverman HF, Yu Y. A Real-Time SRP-PHAT Source Location Implementation Using Stochastic Region Contract (SRC) on a Large-Aperture Array. Proc ICASSP-2007. 2007; p. I–121—I–124.

[pone.0235155.ref038] 38. Freitag LE. Passive acoustic localization of the Atlantic bottlenose dolphin using whistles and echolocation clicks. J Acoust Soc Am. 1993;93:2197 10.1121/1.406681 [DOI] [PubMed] [Google Scholar]

[pone.0235155.ref039] 39. Watkins WA, Schevill WE. Listening to Hawaiian Spinner Porpoises, Stenella Cf. Longirostris, with a Three-Dimensional Hydrophone Array. Journal of Mammalogy. 1974;55(2):319–328. 10.2307/1379001 [DOI] [Google Scholar]

[pone.0235155.ref040] 40. Küsel ET, Munoz T, Siderius M, Mellenger DK, Heimlich S. Marine mammal tracks from two-hydrophone acoustic recordings made with a glider. Ocean Science;13(2):273–288. [Google Scholar]

[pone.0235155.ref041] 41. Yack TM, Barlow J, Calambokidis J, Southall B, Coates S. Passive acoustic monitoring using a towed hydrophone array results in identification of a previously unknown beaked whale habitat. The Journal of the Acoustical Society of America. 2013;134(3):1–7. 10.1121/1.4816585 [DOI] [PubMed] [Google Scholar]

[pone.0235155.ref042] 42. Barlow J, Taylor BL. Estimates of Sperm Whale Abundance in the Northeastern Temperate Pacific from a Combined Acoustic and Visual Survey. Marine Mammal Science. 2005;21(3):429–445. 10.1111/j.1748-7692.2005.tb01242.x [DOI] [Google Scholar]

[pone.0235155.ref043] 43. Clark CW, Ellison WT. Calibration and comparison of the acoustic location methods used during the spring migration of the bowhead whale, Balaena mysticetus, off Pt. Barrow, Alaska, 1984–1993. J Acoust Soc Am. 2000;107:3509–3517. 10.1121/1.429421 [DOI] [PubMed] [Google Scholar]

[pone.0235155.ref044] 44. Janik VM, Thompson M. A Two-Dimensional Acoustic Localization System for Marine Mammals. Marine Mammal Science. 2000;16(2):437–447. 10.1111/j.1748-7692.2000.tb00935.x [DOI] [Google Scholar]

[pone.0235155.ref045] 45. Whitman EC. SOSUS. Undersea Warfare. 2005;7(1). [Google Scholar]

[pone.0235155.ref046] 46. Cummings WC, Thompson PO. Characteristics and seasons of blue and finback whale sounds along the U.S. west coast as recorded at SOSUS stations. The Journal of the Acoustical Society of America. 1994;95(5):2853–2853. 10.1121/1.409514 [DOI] [Google Scholar]

[pone.0235155.ref047] 47. Giraudet P, Glotin H. Real-time 3D tracking of whales by echo-robust precise TDOA estimates with a widely-spaced hydrophone array. Appl Acoust. 2006;67:1106–1117. 10.1016/j.apacoust.2006.05.003 [DOI] [Google Scholar]

[pone.0235155.ref048] 48. White P, Leighton T, Finfer D, Powles C, Baumann O. Localisation of sperm whales using bottom-mounted sensors. Appl Accost. 2006;67:1074–1090. 10.1016/j.apacoust.2006.05.002 [DOI] [Google Scholar]

[pone.0235155.ref049] 49. Spiesberger JL. Linking auto- and cross-correlation functions with correlation equations: Application to estimating the relative travel times and amplitudes of multipath. The Journal of the Acoustical Society of America. 1998;104(1):300–312. 10.1121/1.423257 [DOI] [Google Scholar]

[pone.0235155.ref050] 50. Nosal EM. Methods for tracking multiple marine mammals with wide-baseline passive acoustic arrays. J Acoust Soc Am. 2013;134(3):2383–2392. 10.1121/1.4816549 [DOI] [PubMed] [Google Scholar]

[pone.0235155.ref051] 51. Baggenstross PM. An algorithm for the localization of multiple interfering sperm whales using multi-sensor time difference of arrival. J Acoust Soc Am. 2011;130(102). [DOI] [PubMed] [Google Scholar]

[pone.0235155.ref052] 52. Nosal EM, Frazer LN. Track of a sperm whale from delays between direct and surface-reflected clicks. Applied Acoustics. 2006;67(11-12):1187–1201. 10.1016/j.apacoust.2006.05.005 [DOI] [Google Scholar]

[pone.0235155.ref053] 53. Michalopoulou ZH, Ma X. Source localization in the Haro Strait primer experiment using arrival time estimation and linearization. J Acoust Soc Am. 2011;(118):2924. [Google Scholar]

[pone.0235155.ref054] 54. Thode AM. Three-dimensional passive acoustic tracking of sperm whales (Physeter macrocephalus) in ray-refracting environments. J Acoust Soc Am. 2005;(118):3575 10.1121/1.2049068 [DOI] [PubMed] [Google Scholar]

[pone.0235155.ref055] 55. Rideout B, Nosal E, Høst-Maden A. Blind channel estimation of underwater acoustic waveguide impulse responses using marine mammal vocalizations. Canadian Acoustics [Internet]. 2016;44(3). [Google Scholar]

[pone.0235155.ref056] 56. Bell BM, Ewart TE. Separating Multipaths by Global Optimization of a Multidimensional Matched Filter. IEEE Transactions on Acoustic, Speech, and Signal Processing. 1986;ASSP-34(5):1029–1036. 10.1109/TASSP.1986.1164940 [DOI] [Google Scholar]

[pone.0235155.ref057] 57. Freitag LE, Tyack PL. Passive acoustic localization of the Atlantic bottlenose dolphin using whistles and echolocation clicks. The Journal of the Acoustical Society of America. 1993;93(4):2197–2205. 10.1121/1.406681 [DOI] [PubMed] [Google Scholar]

[pone.0235155.ref058] 58. López-Rivas RM, Bazúa-Durán C. Who is whistling? Localizing and identifying phonating dolphins in captivity. Applied Acoustics. 2010;71(11):1057–1062. 10.1016/j.apacoust.2010.05.006 [DOI] [Google Scholar]

[pone.0235155.ref059] 59.DiBiase JH. A High-Accuracy, Low-Latency Technique for Talker Localization in Reverberant Environments Using Microphone Arrays; 2000.

[pone.0235155.ref060] 60. Thomas RE, Fristrup KM, Tyack PL. Linking the sounds of dolphins to their locations and behavior using video and multichannel acoustic recordings. The Journal of the Acoustical Society of America. 2002;112(4):1692–1701. 10.1121/1.1494805 [DOI] [PubMed] [Google Scholar]

[pone.0235155.ref061] 61. Tyack PL. An optical telemetry device to identify which dolphin produces a sound. The Journal of the Acoustical Society of America. 1985;78(5):1892–1895. 10.1121/1.392777 [DOI] [PubMed] [Google Scholar]

[pone.0235155.ref062] 62. Watwood SL, Owen ECG, Tyack PL, Wells RS. Signature whistle use by temporarily restrained and free-swimming bottlenose dolphins, Tursiops truncatus. Animal Behaviour. 2005;69(6):1373–1386. 10.1016/j.anbehav.2004.08.019 [DOI] [Google Scholar]

[pone.0235155.ref063] 63. Akamatsu T, Wang D, Wang K, Naito Y. A method for individual identification of echolocation signals in free-ranging finless porpoises carrying data loggers. The Journal of the Acoustical Society of America. 2000;108(3):1353–5. 10.1121/1.1287841 [DOI] [PubMed] [Google Scholar]

[pone.0235155.ref064] 64.Meng W, Xie L, Xiao W. Optimal Sensor Pairing for TDOA based Source Localization and Tracking in Sensor Networks. In: 15th International Conference on Information Fusion FUSION; 2012. p. 1897–1902.

[pone.0235155.ref065] 65.Isaacs JT, Klein DJ, Hespanha JP. Optimal Sensor Placement for Time Difference of Arrival Localization. In: Joint 48th IEEE Conference on Decision and Control and the Chinese Control Conference; 2009. p. 7878–7844.

[pone.0235155.ref066] 66.Yang B. Different Sensor Placement Strategies for TDOA Based Localization. In: IEEE International Conference on Acoustics, Speech and Signal Processing; 2007. p. 1093–1096.

[pone.0235155.ref067] 67. Zimmer WMX. Passive Acoustic Monitoring of Cetaceans. Cambridge: Cambridge University Press; 2011. [Google Scholar]

[pone.0235155.ref068] 68. Mallawaarachchi A, Ong SH, Chitre M, Taylor E. Spectrogram denoising and automated extraction of the fundamental frequency variation of dolphin whistles. The Journal of the Acoustical Society of America. 2008;124(2):1159–1170. 10.1121/1.2945711 [DOI] [PubMed] [Google Scholar]

[pone.0235155.ref069] 69. Rendell LE, Matthews JN, Gill A, Gordon JCD, MacDonald DW. Quantitative analysis of tonal calls from five odontocete species, examining interspecific and intraspecific variation. Journal of Zoology, London. 1999;249:403–410. 10.1111/j.1469-7998.1999.tb01209.x [DOI] [Google Scholar]

[pone.0235155.ref070] 70. Steiner WW. Species-Specific Differences in Pure Tonal Whistle Vocalizations of Five Western North Atlantic Dolphin Species. Behavioral Ecology and Sociobiology. 1981;9(4):241–246. 10.1007/BF00299878 [DOI] [Google Scholar]

[pone.0235155.ref071] 71. Navidi W, Murphy WSJ, Hereman W. Statistical methods in surveying by trilateration. Computational Statistics & Data Analysis. 1998;27:209–227. 10.1016/S0167-9473(97)00053-4 [DOI] [Google Scholar]

[pone.0235155.ref072] 72. Knapp CH, Carter C. The Generalized Correlation Method for Estimation of Time Delay. IEEE Transactions on Acoustic, Speech, and Signal Processing. 1976;24(4):320–327. 10.1109/TASSP.1976.1162830 [DOI] [Google Scholar]

[pone.0235155.ref073] 73. Lin C, Liu Y, Kong F, He N. Acoustic Source Localization Based on Generalized Cross-correlation Time-delay Estimation. Procedia Engineering;15:4912–4919. [Google Scholar]

[pone.0235155.ref074] 74. Oppenheim AV, Schafer RW, Buck JR. Discrete-Time Signal Processing. Upper Saddler River, NJ; 1999. [Google Scholar]

[pone.0235155.ref075] 75. Kotsiantis SB, Zaharakis ID, Pintelas PE. Machine learning: a review of classification and combining techniques. Artificial Intelligence Review. 2006;26:159–190. 10.1007/s10462-007-9052-3 [DOI] [Google Scholar]

[pone.0235155.ref076] 76. Breiman L. Random Forests. Machine Learning. 2001;45(1):5–32. 10.1023/A:1010933404324 [DOI] [Google Scholar]

[pone.0235155.ref077] 77. Belgiu M, Drăguţ L. Random forest in remote sensing: A review of applications and future directions. ISPRS Journal of Photogrammetry and Remote Sensing. 2016;114:24–31. 10.1016/j.isprsjprs.2016.01.011 [DOI] [Google Scholar]

[pone.0235155.ref078] 78. Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and Regression Trees. Monterey, CA: Wadsworth & Brooks/Cole Advanced Books & Software; 1984. [Google Scholar]

[pone.0235155.ref079] 79. Cortes C, Vapnik V. Support-vector networks. Machine Learning. 1995;20:273–297. 10.1023/A:1022627411411 [DOI] [Google Scholar]

[pone.0235155.ref080] 80. Fisher RA. The use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics. 1936;7:179–188. 10.1111/j.1469-1809.1936.tb02137.x [DOI] [Google Scholar]

[pone.0235155.ref081] 81.Wahba G. Spline Models for Observational Data; 1990.

[pone.0235155.ref082] 82. Williams CKI. Prediction with Gaussian Processes: From Linear Regression to Linear Prediction and Beyond In: Jordan MI, editor. Learning in Graphical Models. NATO ASI Series (Series D: Behavioural and Social Sciences). vol. 89 Springer, Dordrecht; 1998. [Google Scholar]

[pone.0235155.ref083] 83. Del Grosso VA. New equation for the speed of sound in natural waters (with comparisons to other equations). The Journal of the Acoustical Society of America. 1974;56(4):1084–1091. 10.1121/1.1903388 [DOI] [Google Scholar]

[pone.0235155.ref084] 84. Branstetter BK, Moore PW, Finneran JJ, Tormey MN, Aihara H. Directional properties of bottlenose dolphin (Tursiops truncatus) clicks, burst-pulse, and whistle sounds. The Journal of the Acoustical Society of America. 2012;131(2):1613–1621. 10.1121/1.3676694 [DOI] [PubMed] [Google Scholar]

PERMALINK

Learning to localize sounds in a highly reverberant environment: Machine-learning tracking of dolphin whistle-like sounds in a pool

Sean F Woodward

Diana Reiss

Marcelo O Magnasco

Roles

Abstract

Introduction

Fig 1. Cartoon depiction of a sound’s arrival at sensors in a reflective environment.

Background and significance

Materials and methods

Hydrophone setup

Fig 2. Layout of the tonal source locations and the four 4-hydrophone-arrays.

Table 1. Hydrophone coordinates.

Tonal creation

Table 2. Parameter values of created tonals.

Fig 3. Spectrogram of an artificial whistle.

Tonal play and acquisition

Table 3. Tonal source locations.

Fig 4. Example of played signal with reverberation.

Feature extraction

Tonal localization by classification, regression and SRP

Results

Table 4. Accuracy of source location prediction by classification.

Fig 5. Predictions of test set source coordinates by Gaussian process regression models.

Table 5. MAD and IQR of test set source coordinate predictions.

Fig 6. Multi bar graphs displaying histogram data for the test set source coordinate prediction error sets (3 methods x 4 axes).

Fig 7. Mean ranks of test set source coordinate prediction error sets (3 methods x 4 axes).

Fig 8. Cross-hydrophone and cross-4-hydrophone-array feature importances for the parsimonious random forest.

Discussion

Caveats

Conclusions and outlook

Acknowledgments

Data Availability

Funding Statement

References

Author response to previous submission

Decision Letter 0

Haru Matsumoto

Roles

Author response to Decision Letter 0

Decision Letter 1

Haru Matsumoto

Roles

Author response to Decision Letter 1

Decision Letter 2

Haru Matsumoto

Roles

Acceptance letter

Haru Matsumoto

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases