Skip to main content
PLOS One logoLink to PLOS One
. 2022 Feb 3;17(2):e0254323. doi: 10.1371/journal.pone.0254323

Computer-vision object tracking for monitoring bottlenose dolphin habitat use and kinematics

Joaquin Gabaldon 1,*, Ding Zhang 2, Lisa Lauderdale 3, Lance Miller 3, Matthew Johnson-Roberson 1,4, Kira Barton 1,2, K Alex Shorter 2
Editor: William David Halliday5
PMCID: PMC8812882  PMID: 35113869

Abstract

This research presents a framework to enable computer-automated observation and monitoring of bottlenose dolphins (Tursiops truncatus) in a zoo environment. The resulting approach enables detailed persistent monitoring of the animals that is not possible using manual annotation methods. Fixed overhead cameras were used to opportunistically collect ∼100 hours of observations, recorded over multiple days, including time both during and outside of formal training sessions, to demonstrate the viability of the framework. Animal locations were estimated using convolutional neural network (CNN) object detectors and Kalman filter post-processing. The resulting animal tracks were used to quantify habitat use and animal kinematics. Additionally, Kolmogorov-Smirnov analyses of the swimming kinematics were used in high-level behavioral mode classification. The object detectors achieved a minimum Average Precision of 0.76, and the post-processed results yielded 1.24 × 107 estimated dolphin locations. Animal kinematic diversity was found to be lowest in the morning and peaked immediately before noon. Regions of the zoo habitat displaying the highest activity levels correlated to locations associated with animal care specialists, conspecifics, or enrichment. The work presented here demonstrates that CNN object detection is viable for large-scale marine mammal tracking, and results from the proposed framework will enable future research that will offer new insights into dolphin behavior, biomechanics, and how environmental context affects movement and activity.

Introduction

Direct observation of dolphins at accredited facilities and in the wild has been key to developing an understanding of the behavior and biomechanics of these animals. How the dolphins behave in the presence of conspecifics, interact and engage with their environment, or are affected by changes to their environment are all questions of interest. Ideally, these observations are made without modifying animal behavior, and in a manner that facilitates a quantitative comparison between conditions in the environment. In zoo settings there is a strong emphasis on behavioral monitoring to inform welfare practices [13]. Bottlenose dolphins, the most common cetacean in zoos and aquariums, are generally regarded as a species that thrives under professional care, though data-driven studies of behavior and welfare have been limited [4, 5]. The ability to quantify animal motion and location, both in the environment and with respect to other animals, is therefore critical in understanding their behavior.

Biomechanics and behavioral studies depend on animal-based measurements that are considered reliable and repeatable for the species of interest [2, 68], but direct measurements of animals in the marine environment can be challenging. In zoo environments, animals tend to be monitored using external sensors, such as cameras and hydrophones, placed in the environment [9, 10]. These sensors can be combined into networks to observe a majority of the animals’ environment with a relatively small number of devices. While it is possible to continuously record the animals’ environmental use and social interactions, these videos must be heavily processed to convert them into useful information. This processing is often performed by a trained expert, who watches and scores behavioral or tracking information from the data [2, 1113]. Examples of such studies include monitoring the effects of human presence on animal behaviors, analysis of dolphin activity cycles and sleep patterns, and the evaluation of social interactions with conspecifics. Unfortunately, hand-tracking is time consuming and can be inefficient when hundreds of hours of data have been collected from multiple sensors. Recent efforts have been made to automate this process for cameras, primarily through heuristically-crafted computer-vision techniques [14, 15]. However, these techniques were either limited in execution due to prohibitive costs (e.g. funds for the hardware/installation of an extended multi-camera array), or required manual tuning to account for changing environmental conditions (e.g. lighting shifts throughout the day).

To address these gaps, this work uses a neural network-based computer-automated framework to quantify the positional states of multiple animals simultaneously in a zoo environment, and employs the framework to investigate the dolphins’ day-scale swimming kinematics. Neural networks have demonstrated flexibility and robustness in tracking biological systems from image and video data [16, 17]. To this end, a state-of-the-art neural network object detection technique, Faster R-CNN [18], was chosen as the backbone of the animal detection method for its prioritization of accuracy and precision regardless of object size or density in the image, as opposed to a faster single-shot detector [19]. The Faster R-CNN detector structure has demonstrated its capabilities in both land [20] and marine [21] applications, and is considered a reliable option for challenging tracking tasks.

In this study, camera data were used to monitor the behavior of a group of marine mammals both qualitatively and quantitatively in a zoo setting. Camera-based animal position data were used to quantify habitat usage, as well as where and how the group of animals moved throughout the day. The position data were decomposed into kinematic metrics, and used to discriminate between two general movement states—static and dynamic—using the velocity of the tracked animals. A general ethogram of the animals’ behaviors monitored in this research is presented in Table 1. Joint differential entropy computations were calculated using animal speed and heading data to provide an understanding of the dolphins’ kinematic diversity. Kolmogorov-Smirnov statistical analyses of the kinematic metrics were used to compare movement patterns and activity levels over time and between behavioral conditions. The proposed framework and results presented here demonstrate the viability of computer-vision inspired techniques for this challenging monitoring problem, and will enable future studies to gain new insights into dolphin behavior and biomechanics.

Table 1. Behavior condition ethogram of dolphins under professional care.

Category Behavior Definition
ITS (In Training Session) Animal Care Session Time period in which animal care specialists work with the dolphins to learn new behaviors or practice known behaviors without public audience.
ITS Public Presentation Time period in which animal care specialists work with the dolphins in front of an audience to present educational information to the public.
OTS (Out of Training Session) Static Animal movement state with little to no active fluking at a rate of speed less than 0.5 ms−1.
OTS Dynamic Animal movement state with active fluking at a rate of speed greater than 0.5 ms−1.

Materials and methods

Experimental environment

Seven bottlenose dolphins of ages 5, 5, 14, 16, 17, 33, and 36 years with lengths of 247 ± 17 cm were observed using a dual-camera system in the Seven Seas building of the Brookfield Zoo, Brookfield IL. The complete environment consists of a main indoor habitat with public viewing, two smaller habitats behind the main area, and a medical habitat between the two smaller habitats. The main habitat (Fig 1, top), which was the focus of the experiment, is 33.5 m across, 12.2 m wide, and 6.7 m deep. The habitats are connected through a series of gates. During formal training sessions in the main habitat, animal care specialists primarily engage with the animals on the island between the gates to the other areas. There are underwater observation windows for the viewing public on the far side of the main habitat from the island (not shown), and smaller windows looking into the offices of the animal care specialists on the island and next to the right gate (Fig 1, bottom). Recordings of the main habitat took place across multiple days (between Feb. 6 and March 27, 2018), for varying portions of each day, for a total of 99.5 hours over 20 recordings. Data collection began at the earliest at 07:41 and ended at the latest at 16:21. During the recorded hours, the dolphins participated in four formal training sessions according to a regular, well-defined schedule set by the animal care specialists (ACSs).

Fig 1. Diagram of the experimental setup.

Fig 1

TOP: Illustration of the main habitat, with camera placements (blue enclosures) and fields of view (gray cones). BOTTOM: Top-down individual camera views, with objects in the habitat marked. Yellow { Dolphin bounding boxes, Green—Drains, Red—Gates between regions, Orange—Underwater windows (3 total). Correlated dolphin bounding boxes are indicated by number.

A formal training session consisted of time in which the ACSs work with the dolphins to learn new behaviors or practice known behaviors. At the beginning of each formal training session, the dolphins were asked to maintain positions directly in front of the ACS (formally known as “stationing”). The animal care specialists then presented discriminative stimuli or gestures that indicated which behaviors they requested each dolphin produce. When the animals were in a formal training session (abbreviated ITS), they experienced two formats of training during the data collection period: non-public animal care sessions and public presentations. Time outside of formal training sessions (abbreviated OTS) was defined as when the animals were not interacting with ACSs. During the OTS time periods, the ACSs would provide enrichment objects for the animals to interact with and select which parts of the habitat the animals could access using gates on either side of the main island. The time intervals for the OTS and ITS blocks are displayed in Table 2. The study protocol was approved by the University of Michigan Institutional Animal Care and Use Committee and the Brookfield Zoo.

Table 2. Block time intervals.

Time Interval
Block OTS ITS
1 08:00–09:30 09:30–10:00
2 10:00–11:30 11:30–12:00
3 12:00–13:00 13:00–13:30
4 13:30–14:30 14:30–15:00
5 15:00–16:00 N/A
Dynamics Metrics (means)
Speed (ms−1)
1 1.30 1.39
2 1.57 1.45
3 1.45 1.44
4 1.41 1.39
5 1.43 N/A
Yaw Rate (degs−1)
1 0.32 -0.68
2 -3.61 1.28
3 -0.18 3.26
4 0.05 2.53
5 1.99 N/A

The ITS blocks (1 and 3) are animal care sessions, and the OTS blocks (2 and 4) are public presentations. The corresponding mean speed and yaw rate dynamics metrics are also reported, with yaw rate converted to units of (degs−1) for readability.

Experimental equipment

Two AlliedVision Prosilica GC1380C camera sensors with Thorlabs MVL5M23 lenses were separately mounted in Dotworkz D2 camera enclosures, which were attached to 80/20 T-slotted aluminum framing. On the frame, the cameras were spaced approximately 2m apart. The frame was mounted to a support beam directly above the main habitat, with the cameras angled to give full coverage of the area when combined. Fig 1, top, illustrates the habitat, camera placement, and field of view coverage. For data collection, the cameras were connected through the Gigabit Ethernet protocol to a central computer with an Intel i7–7700K CPU. Recordings were executed using the MATLAB Image Acquisition Toolbox, in the RGB24 color format at a frame rate of 20Hz. Each camera was connected to a separate Ethernet port on an internal Intel PRO/1000 Pt PCIe card. A separate computer system was used for detection inference, and was outfitted with an Intel i7–8700K processor clocked to 4.8 GHz and a Nvidia Titan V graphics processing unit in Tesla Compute Cluster mode.

Dolphin detection

Approximately 99.5 hours of data from two cameras were collected for this work, resulting in ∼14 million individual frames of data. To extract spatial information about habitat use and swimming kinematics, we first needed to identify animals in the frames. These detections were filtered and associated with short trajectories (tracklets) from individual animals. Kinematic data in the form of position, velocity, and yaw (heading in the x-y plane), from the tracklets were then used to parameterize and form probability distributions for each time block that were used to identify tendencies in animal motion during in training (ITS) and out of training session (OTS) swimming.

Neural network methods

The first step in the analysis process was dolphin detection from the captured video frames using Faster R-CNN, a machine-learning object detection method [18]. The method consisted of two primary modules: a Region Proposal Network (RPN), and a Fast R-CNN detector network. The RPN identified regions in an image that may enclose objects of interest, and presented these to the Fast R-CNN detector to verify which regions did in fact contain objects the detector sought to identify. These two modules when combined form one large network capable of returning a bounding box tightly enclosing an object’s location within an image. For a more complete explanation of the method please refer to [18].

All modules used in the implementation were present in the MATLAB Deep Learning Toolbox excepting the Parametric Rectified Linear Unit (PReLU) activation function, which was defined with a custom neural network layer per directions in the MATLAB online documentation [22, 23]. The convolutional neural network (CNN) structure used in the Faster R-CNN framework is as follows. For the input layer, the size was chosen to be similar to the smallest bounding boxes in the set of manually scored dolphin profiles, in the format of (l,l,3), where l is 2× the side length of the smallest bounding box major axis. The input layer had a third dimension of 3 as input images were in the RGB colorspace. The feature extraction layers had the following structure: four sets of 2D 3 × 3 convolution layers, each followed by batch normalization, PReLU activation, and 2 × 2 max pooling (stride 2) layers, in that order. The four convolution layers had, in order: 64, 96, 128, and 128 filters. Each convolution was performed with one layer of zero padding along the edges of the inputs to avoid discounting the corners/edges. The classification layers used the extracted features from the previous layers to identify an image region as either a dolphin or the background. They consisted of: 1) A fully connected layer, length 512, to extract features from the final convolution layer, followed by a PReLU activation; 2) A fully connected layer, length 2, to determine non-scaled classification weights; 3) A softmax function layer to convert these weights into the final probabilities of the image region’s classification. The highest probability from the softmax layer corresponded to the most likely classification for the region, and the magnitude of this probability indicated the confidence of the classification.

Training the network

Ground truth data were scored by a trained observer who manually defined bounding boxes that identified the locations of the dolphins in the training/testing frames (Fig 1, bottom, yellow boxes). These ground truth data were selected over a range of lighting conditions and dolphin locations to ensure robustness of the detection network. For each camera, 100 frames were extracted from each of 11 separate recordings, with evenly spaced time intervals between frames. The recordings were collected in May 2017, and February, March, and August 2018. Over 940 frames from each of the left and right cameras were found to contain usable dolphin locations, i.e. human-detectable dolphin profiles. Each usable dolphin location in the selected frames was manually given a bounding box tightly enclosing the visible profile. The detector for the left camera was trained on 1564 profiles and tested on 662, and the detector for the right camera was trained on 1482 profiles and tested on 662. The dolphin detectors were trained using the MATLAB implementation of Faster R-CNN, employing the previously-defined CNN structure as the classification method.

Detection processing

Detections were performed over all 99.5 hours of recorded data from both cameras, at 10Hz intervals (total of 7.16 × 106 frames), using a 95% minimum confidence threshold to ensure accuracy. The fields of view of the two cameras overlap for a portion of the habitat, resulting in some dolphins being detected simultaneously by both cameras. This yielded multiple sets of conflicting detection bounding boxes spanning the two fields of view, which necessitated associating the most likely left/right box pairs. Before conflict identification was performed, the detection boxes were first transformed into a common plane of reference termed the world frame. Using known world point coordinates, homographies from each camera to the world frame were generated using the normalized Direct Linear Transform method [24]. These homographies were used to convert the vertices of the bounding boxes to the world frame using a perspective transformation. Intersecting boxes were identified by evaluating polygonal intersections, and Intersection over Union (IoU) metrics were computed for intersecting boxes to measure how well they matched. Associations were identified between pairs of left/right intersecting boxes with the highest mutual IoU values.

Associated boxes’ world frame centroid locations were meshed using a weighted mean. First, the boundaries of each camera’s field of view were projected into the world frame, allowing us to obtain the line in the world frame y-direction defining the center of the overlap region, denoted ls = xmid (Fig 2, top, red lines). xmid is the x-coordinate in the world frame midway between the physical placement of the cameras. For each detection (u), the distance (db) in the x-direction from u to the nearest projected camera boundary line (bn) was then determined. Next, the distance (dl) in the x-direction from line ls through u to bn was found. Finally, the weight for the camera corresponding to bn was calculated as wn = db/2dl, with the weight for the other (far) camera as wf = 1−wn. This ensured that if detection u was on ls, then wn = wf = 0.5, and as u moved closer to bn, we would have wn → 0 and wf → 1.

Fig 2. Combined figure demonstrating camera overlap, bounding box meshing, and animal position uncertainty.

Fig 2

TOP: Top-down individual camera views, with dolphin bounding boxes in yellow (correlating boxes are numbered). The habitat-bisecting lines (ls) for each camera frame are indicated in solid red. Distances from Bounding Box 2 (centered on the black and gray crosshair) to the closest frame boundary (db) and the boundary to the bisecting line (dl) are indicated by the white measurement bars. MIDDLE: Meshed camera views including dolphin bounding boxes (yellow), with the location uncertainty distribution (A) overlaid for Box 2. BOTTOM: 2D location uncertainty distribution (A) with major (a-a, black) and minor (b-b, red) axes labeled and separately plotted.

False positive mitigation

In specific circumstances, the shapes of the drains at the bottom of the habitat were warped by the light passing through rough surface water, and resulted in false dolphin detections. Separate (smaller) image classifiers for each camera were trained to identify these false positive drain detections, and were run on any detections that occurred in the regions of the video frames containing the drains. These detectors were strictly CNN image classifiers and were each trained on over 350 images and tested on over 150 images. For the drain detector, the input layer size had the format of (ld, ld, 3), where ld is the mean side length of the detection bounding boxes being passed through the secondary classifiers. The feature detection layers had the same general structure as the Faster R-CNN classifier network, except in this case the convolution layers had, in order: 32, 48, 64, and 64 filters each. In the classification layers, the first fully connected layer had a length of 256.

Temporal association of detections

Each experimental session involved the detection of multiple animals throughout their habitat. However, animal detections were done independently for each frame of the video. To extract kinematic information from the animals in the video, the detection associations needed to be preserved across frames. In this work, short continuous tracks (i.e. tracklets) were generated for a detected animal by identifying the most likely detection of that animal in the subsequent frame (Fig 3). To generate multiple individual tracklets in series of video frames, an iterative procedure of prediction and association was conducted under a Kalman filter framework with a constant velocity model.

Fig 3. Illustration of tracklet generation.

Fig 3

TOP: Tracklet segments (red) overlaid on a single video frame, generated by stitching the views from both cameras. Each tracklet in this frame was plotted from its inception to each corresponding dolphin’s current position. While each dolphin can be tracked, the lack of clarity when underwater impedes individual identification. CENTER: x-y view of example tracklets (red and green on gray lines) of two dolphins (highlighted light orange), which are also shown in Fig 1, top. POPOUT-RIGHT: Vector illustrations of the two example tracks. Example notation for tracklet j (red): position (p(j, t′)), velocity (v(j, t′)), yaw (θ(j, t′)), and yaw rate (θ˙(j,t)). POPOUT-BOTTOM Illustration of tracklet generation, with detections (stars) and tracklet proximity regions (dashed). Example notation for tracklet j (red): position (p(j, t)), velocity (v(j, t)), Kalman-predicted future position (p^(j,t+1)), true future position (p(j, t+ 1)), and future animal detection (u(j, t+ 1, i′)).

The position of the i-th detected animal in one video frame at time t is denoted as u(t,i)=[ux(t,i),uy(t,i)]. Each detection, u(t, i) was either associated with a currently existing tracklet or used to initialize a new tracklet. To determine which action was taken, for each tracklet, denoted as T(k) for the k-th tracklet, this process first predicted the state of the tracked animal in the next frame (T^(k,t+1)) based on the current state information of the animal T(k, t).

T(k,t)=[p(k,t),v(k,t)] (1)
=[px(k,t),py(k,t),vx(k,t),vy(k,t)] (2)
T^(k,t+1)=[p^(k,t+1),v^(k,t+1)] (3)
=[p^x(k,t+1),p^y(k,t+1),v^x(k,t+1),v^y(k,t+1)] (4)

where p(k,t)=[px(k,t),py(k,t)] denotes the filtered position of the animal tracked by the k-th tracklet at time t and v(k,t)=[vx(k,t),vy(k,t)] is the corresponding velocity (Fig 3, popout-bottom). Under a constant velocity model, the predicted next frame position p^(k,t+1)=[p^x(k,t+1),p^y(k,t+1)] was obtained by integrating the current velocity over one frame period and summing this to the current frame position. The predicted velocity remained constant.

p^x(k,t+1)=px(k,t)+vx(k,t)Δt (5)
p^y(k,t+1)=py(k,t)+vy(k,t)Δt (6)
v^x(k,t+1)=vx(k,t) (7)
v^y(k,t+1)=vy(k,t) (8)

Using the predicted position, the k-th tracklet checked whether there existed a closest detection in the next frame that was within the proximity region of the predicted position, which is defined as a circle around the predicted position with radius 0.8 m (heuristically tuned). If true, that detection, denoted as u(k, t+ 1, i) for the i-th detection in frame t+ 1 associated with the k-th tracklet, was used as the reference signal of the Kalman filter to update the state (position and speed) of tracklet T(k). If false, the unassociated tracklet continued propagating forward, assuming the animal maintained a constant velocity. If a tracklet continued to be unassociated for 5 consecutive frames (empirically determined), it was considered inactive and was truncated at the last confirmed association. All information related to the k-th tracklet was saved after its deactivation:

T(k)=[T(k,tstart),,T(k,t-1),T(k,t),T(k,t+1),,T(k,tend)]T (9)

As illustrated in Fig 3, the tracklet formation operation linked each animal’s individual detections (u) over consecutive frames. This process returned the smoothed track positions (p) of the animals, and by numerically differentiating the tracklets it was possible to extract the forward speed (v), yaw (θ), and turning rate (θ˙), which could then be used to parameterize the positional states of the animals.

Position uncertainty

There was a general position uncertainty for each animal detection due to noise in the Faster R-CNN detections. This was caused by a combination of limited camera resolution, as well as distortion of an animal’s image from waves and ripples on the surface of the water. Additionally, since animal depth could not be measured, there were errors in the world-frame x-y location estimates (caused by camera perspective and light refraction effects) that could not be corrected. This required a specialized x-y position-dependent uncertainty distribution, based on prior knowledge of animal occupancy trends in the water column. Due to the high volume of data available to produce the underlying structure of the spatial distribution, the distribution kernels themselves could be directly generated rather than relying on estimation techniques.

In this work, the detection uncertainty was represented as a 2D probability density function (PDF), whose size and shape depended on the location of the detection with respect to the cameras (Fig 2, bottom, A). The short (minor) axis, D1, was a Gaussian uncertainty distribution defined according to a heuristically estimated error in the camera detections (∼0.2 m), and represented the general position uncertainty in the Faster R-CNN detections (Fig 2, bottom, b-b). The long (major) axis of the spatial distribution, D2, represented the position uncertainty caused by the perspective and refraction effects (uncertainty from unknown depth). A 1D PDF was defined according to previously measured animal depth data (total of 9.8 hours during separate OTS time blocks), obtained via non-invasive tagging, which represented the general distribution of depths occupied by the animals. This was convolved with D1 to produce the general shape of D2 (Fig 2, bottom, a-a). The x-axis length scale for D2 for a particular detection was obtained from the maximum position error in the detection’s x-y location. This was the magnitude of the x-y position difference (original versus corrected x-y position) if the detection happened to be at maximum depth (∼7 m). This magnitude varied dependent on the world-frame original location of the detection. Details on the depth-based location correction can be found in [25].

Mapping animal kinematics to habitat

Heatmaps of dolphin position and speed were used to map animal positional state to the habitat. The dolphins were defined to be static or minimally mobile (drifting) when they were traveling at speeds below 0.5 ms−1, and dynamic otherwise. To generate the positional heat maps, a blank 2D pixel map of the main habitat, M, was first created. Then, for each pixel representation p of a detection u, the maximum possible magnitude of location error due to depth was determined, defined as em (pixels, scale 1 pix = 5 cm), along with the orientation of the error propagation, ψm (radians). The perimeter of the habitat served as a hard constraint on the location of the animals, thus em was truncated if the location of the point with the maximum possible shift, [px+ emcos(ψm), py+ emsin(ψm)], fell outside this boundary. The minor axis of the 2D spatial uncertainty distribution, D1, was a 1D PDF in the form of a Gaussian kernel with σgauss = 0.2s (0.2 meters scaled to pixels by scaling factor s = 20). Next, the depth PDF was interpolated to be em pixels long, and was convolved with D1 (to account for measurement uncertainty in the camera detections). This yielded the major axis 1D PDF, D2. The 2D (unrotated) occupancy PDF, E=D1D2, was then computed, where D1, D2 were horizontal vectors of the same length. The 2D rotated occupancy PDF, F, was calculated by rotating E by an angle of ψm through an interpolating array rotation. The MATLAB implementation of imrotate was used for this calculation. F was then normalized to ensure the distribution summed to 1. Finally, F was locally summed into M, centered at location [xu, yu] = [px+ 0.5emcos(ψm), py+ 0.5emsin(ψm)], to inject the occupancy probability distribution for u into map M. This process was then repeated for all detections. For the sake of visibility, all heatmaps were sub-sampled down to the scale of 1 pix = 1 meter.

A similar process was used to form the speed heatmaps. In a speed heatmap, the values of F are additionally scaled by the scalar speed of the animal, v, that corresponds to detection u, and then locally summed into a separate map, N (sum Fv into N centered at [xu, yu]). Element-wise division of N by M was performed to generate S, a map of the average speed per location.

Lastly, the direction of motion of the animals throughout the monitored region was described using a quiver plot representation. To formulate the quiver plot, two separate heatmaps were generated, Qx and Qy, one each for the x and y components of the animals’ velocities. Qx was created using a similar method to the speed heatmap, but in this case F was scaled by the x-component of the animal’s velocity (sum Fvcos(θ) into Qx centered at [xu, yu]), where θ was the heading of the animal corresponding to detection u. Similarly for Qy, F was scaled by the y-component of the animal’s velocity (sum Fvsin(θ) into Qy centered at [xu, yu]). The vector components Qx and Qy combined represented the general orientation of the animals at each point in the habitat.

Probability distributions of metrics and entropy computation

For each time block of OTS and ITS, the PDFs of speed (ms−1) and yaw (rad) were numerically determined. These were obtained by randomly extracting 105 data samples of both metrics from each time block of OTS and ITS, and producing PDFs for each metric and time block from these data subsets.

Additionally, the joint differential entropies of speed and yaw were computed for each time block of OTS and ITS. In this case, the joint entropy of animal speed and yaw represents the coupled variation in these metrics for the animals. This indicates that speed-yaw joint entropy can be considered a proxy for measuring the diversity of their kinematic behavior. To compute the joint entropy h for one time block, the randomly sampled speed (continuous random variable S) and yaw (continuous random variable Ψ) data subsets (S and Ψ, respectively) of that time block were used to generate a speed/yaw joint PDF: f(s, ψ), where sS, ψ ∈ Ψ. f was then used to compute h with the standard method:

h(S,Ψ)=-S,Ψf(s,ψ)ln[f(s,ψ)]dsdψ (10)

Kolmogorov-Smirnov statistics

To evaluate the statistical differences in animal dynamics between time blocks, the two-sample Kolmogorov-Smirnov (K-S) distances (Δks) and their significance levels (α) were computed for each of the following metrics: speed (ms−1), yaw (rad), yaw rate rads−1), and the standard deviations of each [26]. These were done by comparing randomly-sampled subsets of each time block, with each subset consisting of 104 data samples per metric. Only time blocks of similar type were compared (i.e. no ITS blocks were compared to OTS blocks, and vice-versa). K-S statistics were chosen to allow for nonparametric comparisons between probability distributions, as the metric distributions within each subtype (e.g. animal speed, yaw) did not all pertain to the same family of distributions (e.g. normal, exponential, etc.), rendering more traditional statistical comparisons ill-suited to this application. The computations were performed using the MATLAB statistics toolbox function kstest2.

Results

Detector and filter performance

During evaluation after training the networks, the Faster R-CNN detectors for the left and right cameras achieved Average Precision scores of 0.76 and 0.78, respectively. Additionally, during network training the CNN drain classifiers for the left and right cameras achieved respective accuracy scores of 92% and 94%. The performance of this pair of Faster R-CNN detectors with comparisons to ground truth was fully evaluated in [25], with the results reported in Table 3. To summarize the performance results: two additional monitoring sessions were video recorded and tracked both manually and using the automated CNN-based tracking system from this manuscript. During these sessions, two individual dolphins were tracked by a human observer and the results were compared to the detections produced by the automated system. Overall, for these two deployments the human tracker (representing the ground-truth) was able to detect the dolphins 88.1% of the time, while the CNN-based trackers were able to detect the dolphins 53.2% of the time. As a result, the automated system achieved an overall recall rate of 60.4% versus ground-truth.

Table 3. Performance comparison between manual and CNN animal detections for two sessions as part of a separate monitoring exercise, where individual dolphins were tracked as opposed to the entire group.

A1 and A2 refer to specific dolphins, with A1 being tracked over two recordings during Deployment 1, and A1 and A2 being tracked during the same recording during Deployment 2. “Detectability” is defined as the total time each individual dolphin was able to be detected by either the human or CNN trackers over each deployment period.

Parameters Deployment 1 Deployment 2 Overall
A1–1 A1–2 A1 A2
Duration [minute] 22.6 30.9 48.9 49.2 151.6
Detectability—Manual 70.4% 86.0% 100% 85.6% 88.1%
Detectability—CNN 44.6% 54.7% 50.2% 59.2% 53.2%

Processing all 99.5 hours of recordings yielded 5.92 × 106 detections for the left camera and 6.35 × 106 detections for the right. The initial set of detections took ∼8.4 days to compute when performed on the Titan V computer system. Of these, 3.83 × 104 (0.65%) detections from the left camera and 3.02 × 104 (0.48%) detections from the right camera were found to be drains misclassified as dolphins. After removing the misclassified detections, meshing the left and right detection sets yielded a total of 1.01 × 107 individual animal detections within the monitored habitat. The tracklet generation method used in this work associated animal track segments containing gaps of up to 4 time steps. As a result, the prediction component of its Kalman filter implementation was used to fill in short gaps in the tracking data. Generating tracklets from the meshed detections yielded a total of 1.24×107 estimated dolphin locations, from 3.44 × 105 total tracklets.

A note on detector limitations and the animal identification problem: while this system is robust in detecting a dolphin in-frame, it cannot track specific animals. The camera resolutions are not sufficient to resolve identifying features on the animals, and the environmental occlusions (glare regions, severe water surface disturbances) prevent continuous tracking (Fig 3, top). As a result, while each tracklet corresponds to a single dolphin at a time, the lack of identifiability prevents individual longer-duration tracking (>30 seconds) and therefore prevents individual metrics generation. For this reason, the results in this manuscript are presented for the dolphins as a group, rather than for each individual.

Spatial distribution—Position

During OTS, the tracked animals were found to be in a dynamic swimming state ∼77% of the time and a static state for ∼23% of the time. The static OTS spatial distribution tended to be associated with particular features of their habitat: the gates that lead to the other areas of the habitat or at the underwater windows that offered views of the animal care specialist staff areas (Fig 4). When swimming dynamically during OTS, the dolphins tended to spend more time near the edges of their habitat, with the most time focused on the island side with the gates and the windows (Fig 5, left column). This was especially true during Block 5, with additional weight placed along the edge of the central island.

Fig 4. Static position distributions for OTS and ITS.

Fig 4

A note on the format of the training sessions: Dolphins spent more time stationed at the main island during public presentations than non-public animal care sessions. During public presentations, ACSs spend a higher portion of the training session on the main island because it is within view of all of the public attending the presentation. Non-public animal care sessions are more fluid in their structure than public sessions. ACSs often use the entire perimeter of the habitat throughout the session.

Fig 5. Spatial distributions for dynamic OTS, with position distributions along the first column and speed distributions/quiver plots along the second column.

Fig 5

Prior to the first full training session of the day at 09:30, the dolphins were engaged in low intensity (resting) swimming clockwise around the perimeter of the habitat, with the highest average OTS speeds recorded after the 9:30 sessions. From there, speeds trail off for the subsequent two time periods. The 13:30–14:30 time block is characterized by slower swimming in a predominantly counterclockwise pattern. There is an increase in speed and varied heading pattern during the 15:00–16:00 time block.

Throughout ITS, the dolphins were asked to engage in dynamic swimming tasks ∼62% of the time, and were at station (in front of the ACSs) for the remaining ∼38% of the time. During ITS, the dolphins had a heavy static presence in front of the central island, where the animals were stationed during formal training programs. The animals also spent less time around the edges of the environment, in contrast with their locations during OTS (Fig 6, left column). During ITS, the ACSs presented discriminative stimuli or gestures corresponding to specific animal behavior, which defined the spatial distributions of the dolphins’ movements during these time blocks. Additionally, there were spatial distribution similarities between training sessions of similar type, e.g. Blocks 1, 3 were animal care and husbandry sessions, and 2, 4 were public presentations. Note the structure of the spatial distributions across the top of their habitat, where during the care sessions (Blk. 1, 3) the dolphins’ positions were focused on specific points in the area, while during the presentations (Blk. 2, 4) their positions were distributed across the edge of the central island. This captured the formation used during presentations with animals distributed more uniformly across the island. This also serves to qualitatively validate that the detectors are working as expected, given the dolphins are observed to be present in a region they are commonly instructed to occupy.

Fig 6. Spatial distributions for dynamic ITS, with position distributions along the first column and speed distributions/quiver plots along the second column.

Fig 6

During the animal care sessions (Block 1: 09:30 to 10:00, Block 3: 13:00–13:30), the dolphins engaged in lower intensity swimming throughout the habitat than the presentation sessions (Block 2: 11:30–12:00, Block 4: 14:30–15:00). This difference is qualitatively explained through the discrepancy in ACS requests from the animals: high-intensity behaviors are prompted more often during presentations, while care sessions cover a wider variety of behaviors. Conversely, spatial coverage of the habitat does not have high variance within the ITS blocks, with an expectedly high concentration on the central island where the ACSs are located for all ITS blocks.

Spatial distribution—Speed/quiver

In Block 1 of OTS, the dolphins had relatively low speeds (mean 1.30 ms−1, Table 2) across their habitat, and based on the vector field of the quiver plot for the block, were engaged in large, smooth loops along the edges of the habitat (Fig 5, right column). This was contrasted with Block 2, which saw a higher general speed (mean 1.57 ms−1) as well as diversified movement patterns, with the right half exhibiting counter-clockwise chirality while the left half maintained the clockwise motion pattern. Blocks 3–5 exhibited higher mean speeds than Block 1, and lower than Block 2 (Table 2), with the dolphins’ movement patterns shifting changing between each OTS block (Fig 5). In contrast, there was no such pattern in the dolphins’ mean yaw rates (Table 2).

During ITS, the care blocks’ (Blk. 1, 3) speed distributions and vector fields qualitatively demonstrated similar structures, while those of the presentations (Blk. 2, 4) were more mixed, with more similarities along the left and right far sides, but fewer in the center (Fig 6, right column). The mean speeds and mean yaw rates did not share particular similarities between blocks of similar type (Table 2). In general, speeds across the entire habitat are higher during public presentations than non-public animal care sessions because high-energy behaviors (e.g., speed swims, porpoising, breaches) are typically requested from the dolphins several times throughout the presentation. Though non-public presentations include high-energy behaviors, non-public animal care sessions also focus on training new behaviors and engaging in husbandry behaviors. Public presentations provide the opportunity for exercise through a variety of higher energy behaviors, and non-public sessions afford the ability to engage in comprehensive animal care and time to work on new behaviors.

Joint entropy results for kinematic diversity analysis

The joint differential entropies of speed and yaw per time block are displayed in Fig 7, bottom, with values reported in Table 4. The time blocks in this figure are presented in chronological order, and we observed the lowest kinematic diversity in the mornings (the first blocks of each OTS and ITS) as the animal care specialists were arriving at work and setting up for the day. The highest kinematic diversity when not interacting with the ACSs then occurred immediately after the first ITS time block. In general, the first time blocks of both OTS and ITS showed the lowest kinematic diversity of their type, the second of each showed the highest, and the following blocks stabilized between the two extremes. The speed/quiver plots (Figs 5 and 6, right) provide a qualitative understanding of the entropy results. For example, in Block 1 of OTS (Fig 5, top-right) the dolphins engaged in slow swimming throughout their habitat in smooth consistent cycles along the environment edge, yielding the lowest joint entropy. Joint entropy then increased during both the morning ITS and OTS blocks and remained elevated for the rest of the day, representing higher animal engagement through the middle of their waking hours.

Fig 7. Speed and yaw probability distributions and joint differential entropies, respective to time block.

Fig 7

TOP: Probability density functions of animal speed (m s−1) for OTS (left) and ITS (right). MIDDLE: Probability density functions of yaw (rad) for OTS (left) and ITS (right). BOTTOM: Joint differential entropy of speed and yaw for each block of OTS (left) and ITS (right), with limited-range y-axes to more clearly show value differences.

Table 4. Speed and yaw joint differential entropy.

OTS ITS
Block 1 2 3 4 5 1 2 3 4
Entropy 2.358 2.599 2.543 2.508 2.541 2.521 2.675 2.584 2.605

Statistical comparison of metrics for behavior differentiation

The K-S statistics were used to confirm the similarities and differences between time blocks within both OTS and ITS. To aid in visualizing this, Fig 7, top, displays the overlaid PDFs of the speed and yaw metrics during OTS, and Fig 7, middle, displays the PDFs during ITS. A complete table with K-S distances and α values for all six metrics is present in S1 Table in the supporting information, with all values rounded to 3 digits of precision. For OTS, we saw from the K-S results that Blocks 1 and 2 varied the most with respect to the others in terms of speed, which was observed in Fig 7, top, while the yaw values were not generally significantly different, again observed in Fig 7 (given the high number of samples used to generate the K-S statistics, we were able to compare the significance levels to a stronger threshold of αcrit = 0.001). Across the board, Block 2 generally differed significantly from the rest of the OTS blocks for the most metrics, with Block 1 following close behind. In contrast, Blocks 3–5 differed the least significantly from each other, indicating similarities in the dolphins’ dynamics patterns for Blocks 3–5.

For ITS, we note that the significant differences in metrics generally followed the structure type of each ITS block: comparisons between Blocks 1 vs. 3, and 2 vs. 4, were found to be significantly different the least often. As the ACSs requested similar behaviors during ITS blocks of the same type, we expected similarities in the dynamics metrics for Blocks 1 vs. 3 (animal care sessions) and Blocks 2 vs. 4 (presentations), and differences between the metrics for blocks of different types. Of particular note are the yaw std. dev. and yaw rate std. dev. metrics, with entire order of magnitude differences in K-S distances when comparing similar vs. different types of ITS blocks. Overall, the pattern displayed by the ITS K-S statistics in S1 Table correlated with this expectation.

Discussion

Automatic dolphin detection

This research presents a framework that enables the persistent monitoring of dolphins under professional care through external sensing, performed on a scale that would be prohibitive for traditional manual tracking. Both the Faster R-CNN dolphin detection and CNN drain detection methods displayed reliable performance in testing, and enabled large-scale data processing at rates not achievable by humans. Given that the total duration of video processed was ∼199 hours (2 cameras × 99.5 hours each), an inference time of ∼202 hours (1.013×) represents at minimum an order-of-magnitude increase in processing speed when compared to human data annotation. This estimate was obtained from the authors’ prior experience in manual animal tracking, which could take over 10 hours of human effort per hour of video (frame rate of 10 Hz) annotated for a single animal. In this research, the detections generated by the monitoring framework were used to estimate two-dimensional animal position and kinematics (speed, heading) to characterize animal behavior and spatial use within their environment. As such, this detection framework presents new opportunities for long-term monitoring of animal kinematics, and enables the automated processing of the longer duration and more frequent recording sessions that will provide a more complete picture of animal behavior in these environments.

Animal kinematics and habitat use

Kinematic diversity

Joint dynamic entropy was used to quantify differences in animal kinematic diversity throughout the day to explore how temporal changes in the dolphins’ habitat would result in modified kinematic diversity levels (Fig 7, bottom). The use of entropy as a proxy for kinematic diversity has been applied in the past to characterize prey motion unpredictability for predator evasion, however in this work it serves to provide a measure of animal engagement [27]. The kinematic diversity results presented here are consistent with previous research on animal activity and sleep patterns, which reports a diurnal activity cycle for animals under professional care [12]. However, it is interesting to note that changes in animal kinematic diversity throughout the day during OTS are not gradual: the OTS time block displaying the minimum value is immediately followed by the block displaying the maximum, and are only separated by the first training session (30 minute duration). This sudden shift may not be fully explained by only the dolphins’ diurnal activity cycle, and may be related to the fact that their first daily interactions with the ACSs occur between these two OTS time blocks. A finer time-scale analysis of their kinematic diversity trends is necessary to determine which is the cause for this change in animal engagement.

Habitat use

The kinematic data also enabled an investigation into how features in the habitat influenced animal behavior and spatial use, particularly during OTS. The animals tended to have a general focus on the area between the gates along the edge of the central island (Fig 5, left). Additionally, throughout the OTS position plots (including static, Fig 4, left) four animal-preferred locations were observed. The two hot spots to the left and right of the central island are gates (Fig 1, bottom), where the dolphins could communicate with conspecifics when closed or pass through to other areas of their habitat when open. Conversely, the two hot spots nearer the middle of the island edge corresponded to underwater windows that led to an ACS work area (two central windows in Fig 1, bottom). Through these windows the dolphins may observe the ACSs, view conspecifics in one of the back habitats (through an additional window, not shown in Fig 1), or observe enrichment occasionally placed on the other side of the glass (mirrors, videos, etc.). Regions of the habitat in proximity to these two windows experienced some of the highest occupancy in all OTS position plots, both static and dynamic. This indicates that particular attractors for the dolphins’ attention were observable through those windows, whether they were the ACSs, conspecifics, or enrichment.

These attractors were also correlated with the dolphins’ kinematics and activity levels. Of all the regions in the environment, only the positions in front of the central windows consistently recorded peak or near-peak location-specific animal swimming speeds for all OTS time blocks (Fig 5, right). When combined with the results from the spatial distributions (Fig 5, left), this implies that these dolphins not only focused their attention on these regions, their presence correlated to higher activity levels in the dolphins when swimming in their vicinity.

Behavior classification from dynamics metrics

During ITS blocks, ACSs asked for specific behaviors from the dolphins and these behaviors were often repeated. Elements of public educational presentations (ITS 2/4) were varied to include a mixture of both high and low energy segments, and this blend resulted in similar dynamic patterns for the public sessions. In contrast, the non-public animal husbandry and training sessions (ITS 1/3) were less dynamic overall, and yielded similar dynamics patterns within these types of sessions. Qualitative similarities in the pairs of animal training sessions were observable in both the position and speed/quiver plots in Fig 6, and the probability density functions presented in Fig 7. Along with the statistical observations in S1 Table, without prior knowledge of the block types it would be possible to use this pattern to identify that Blocks 1 and 3 were likely the same type, as were 2 and 4. This demonstrates that the presented method of obtaining and analyzing the dolphins’ dynamics metrics was has the potential to differentiate between general behavior types.

This was useful for analyzing the OTS results, as the position and speed/quiver plots in Fig 5 only showed patterns in the animals’ location preferences within their habitat. In contrast, an analysis of the K-S results allowed for the identification of the statistical differences in animal dynamics between OTS time blocks. Block 2 separated itself significantly from all other time blocks in nearly every metric, while Block 1 was in a similar position (though not as pronounced). Blocks 3–5 showed few significant differences for metrics comparisons between each other. This indicated that the dolphins had more distinct dynamics for Blocks 1 and 2, and maintained similar dynamics patterns throughout Blocks 3–5. When combined with the joint differential entropy values, these results indicated there may be three general OTS behavior types for the dolphins in this dataset (in terms of kinematic diversity [KD]): “Low KD” at the beginning of the day (Block 1), “High KD” immediately after the first training session (Block 2), and “Medium KD” for the remainder of the day (Blocks 3–5).

Limitations and future work

Using a limited number of cameras meant full stereo coverage of the habitat was not possible, preventing a direct estimate of animal depth. Additionally, camera placements resulted in region-specific glare on the surface of the water that impeded the Faster R-CNN detector. To address these problems, cameras could be added to locations that allow for fully overlapping coverage, at angles that avoid glare in the same regions. Further, installing cameras capable of low-light recording could enable night monitoring sessions. An inherent problem with camera-based tracking is the fact that similarities between dolphin profiles make it challenging to identify individuals. This problem has been addressed in [25], where kinematic data from dolphin-mounted biologging tags were used to filter camera-based animal location data. This filtering process made it more feasible to identify which location data points corresponded to specific tagged individuals, coupling the kinematic and location data streams for these animals. Fusing the coupled tag and camera data through methods similar to [25] or [28] would then provide high-accuracy localization information to contextualize the detailed kinematics data produced by the tags.

Beyond technical improvements, the next step in this research is to use long-term animal monitoring to inform management policy and aid welfare tracking. By working closely with the ACSs at the Brookfield Zoo, we intend to use the techniques presented in this manuscript to observe animal interactions with conspecifics and enrichment over time, track activity levels, and measure the effects of changing environmental conditions (e.g. effects of varying crowd size, spectator presence). In particular, given the emphasis the dolphins placed on particular regions of the environment, it will be important to evaluate the effects of attractors in these areas by varying enrichment type, physical placement, duration/time of exposure, and by recording ACS presence and interactions within these areas. In this way, we aim to guarantee a high level of animal engagement and work to identify potential stressors that may aid the Zoo in caring for their dolphins.

Further, the use of unmanned drones and gliders has the potential to extend this research for implementation in a wild setting. CNN tracking is already at work in whale [29] and shark [30] tracking in the wild, and the inclusion of these vehicles will open up new opportunities in making this research physically portable. The methods in this manuscript, particularly the tracklet generation, can be useful for not only identifying and localizing the animals, but also in providing basic kinematic information on entire groups, which is not generally feasible with tagging operations due to the limited number of tags available for deployment.

Conclusions

Through this research we have demonstrated a monitoring framework that significantly enhances the efficiency of both data collection and analysis of dolphin movement and behavior in a zoo setting. This work demonstrated the feasibility of a camera-based computer-automated marine animal tracking system, and explored its capabilities by analyzing the behavior and habitat use of a group of dolphins over a large time scale. From the results, we were able to quantify day-scale temporal trends in the dolphins’ spatial distributions, dynamics patterns, and kinematic diversity modes. These in turn revealed that habitat features associated with particular attractors served as focal points for this group of dolphins: these features were correlated with higher animal physical proximity, kinematic diversity (specifically ACS presence), and activity levels.

Supporting information

S1 Table. Kolmogorov-Smirnov session comparison.

(PDF)

Acknowledgments

The authors would like to thank the Brookfield Zoo for its aid in facilitating this research. Rita Stacey and the Seven Seas Animal Care Specialists were instrumental in helping to acquire such a large volume of data, and the help of the Zoo’s administration made this research a possibility. Finally, the authors would like to thank Sarah Breen Bartecki and William Zeigler of the Chicago Zoological Society for their continued support.

Data Availability

Data and code for this publication can be found at the following location: https://www.ebi.ac.uk/biostudies/studies/S-BSST793.

Funding Statement

This study was funded by the Granger Foundation and the Chicago Zoological Society (CZS). Authors L. Lauderdale and L. Miller are members of CZS and contributed to the study design, data collection and analysis, and manuscript revision.

References

  • 1. Kagan R, Carter S, Allard S. A Universal Animal Welfare Framework for Zoos. Journal of Applied Animal Welfare Science. 2015;18. doi: 10.1080/10888705.2015.1075830 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Miller LJ, Mellen J, Greer T, Kuczaj SA. The effects of education programmes on Atlantic bottlenose dolphin (Tursiops truncatus) behaviour. Animal Welfare. 2011;20(2):159–172. [Google Scholar]
  • 3. Whitham JC, Wielebnowski N. New directions for zoo animal welfare science. Applied Animal Behaviour Science. 2013;147(3-4):247–260. doi: 10.1016/j.applanim.2013.02.004 [DOI] [Google Scholar]
  • 4. Mason GJ. Species differences in responses to captivity: Stress, welfare and the comparative method. Trends in Ecology and Evolution. 2010;25(12):713–721. doi: 10.1016/j.tree.2010.08.011 [DOI] [PubMed] [Google Scholar]
  • 5. Alex Shorter K, Shao Y, Ojeda L, Barton K, Rocho-Levine J, van der Hoop J, et al. A day in the life of a dolphin: Using bio-logging tags for improved animal health and well-being. Marine Mammal Science. 2017;33(3):785–802. doi: 10.1111/mms.12408 [DOI] [Google Scholar]
  • 6. Clegg ILK, Borger-Turner JL, Eskelinen HC. C-Well: The development of a welfare assessment index for captive bottlenose dolphins (Tursiops truncatus). Animal Welfare. 2015;24(3):267–282. doi: 10.7120/09627286.24.3.267 [DOI] [Google Scholar]
  • 7. Ugaz C, Valdez RA, Romano MC, Galindo F. Behavior and salivary cortisol of captive dolphins (Tursiops truncatus) kept in open and closed facilities. Journal of Veterinary Behavior: Clinical Applications and Research. 2013;8(4):285–290. doi: 10.1016/j.jveb.2012.10.006 [DOI] [Google Scholar]
  • 8. Waples KA, Gales NJ. Evaluating and minimising social stress in the care of captive bottlenose dolphins (Tursiops aduncus). Zoo Biology. 2002;21(1):5–26. doi: 10.1002/zoo.10004 [DOI] [Google Scholar]
  • 9. Ames AE, Macgregor RP, Wielandt SJ, Cameron DM, Kuczaj SA, Hill HM. Pre-and post-partum whistle production of a bottlenose dolphin (Tursiops truncatus) social group. International Journal of Comparative Psychology. 2019;32:1–17. doi: 10.46867/ijcp.2019.32.02.02 [DOI] [Google Scholar]
  • 10. Harvey BS, Dudzinski KM, Kuczaj SA. Associations and the role of affiliative, agonistic, and socio-sexual behaviors among common bottlenose dolphins (Tursiops truncatus). Behavioural Processes. 2017;135:145–156. doi: 10.1016/j.beproc.2016.12.013 [DOI] [PubMed] [Google Scholar]
  • 11. Clegg ILK, Rödel HG, Cellier M, Vink D, Michaud I, Mercera B, et al. Schedule of human-controlled periods structures bottlenose dolphin (tursiops truncatus) behavior in their free-time. Journal of Comparative Psychology. 2017;131(3):214–224. doi: 10.1037/com0000059 [DOI] [PubMed] [Google Scholar]
  • 12. Sekiguchi Y, Kohshima S. Resting behaviors of captive bottlenose dolphins (Tursiops truncatus). Physiology and Behavior. 2003;79(4-5):643–653. doi: 10.1016/S0031-9384(03)00119-7 [DOI] [PubMed] [Google Scholar]
  • 13. Walker RT, Miller LJ, Kuczaj SA, Solangi M. Seasonal, diel, and age differences in activity budgets of a group of bottlenose dolphins (Tursiops truncatus) under professional care. International Journal of Comparative Psychology. 2017;30. doi: 10.46867/ijcp.2017.30.00.05 [DOI] [Google Scholar]
  • 14.Karnowski J, Hutchins E, Johnson C. Dolphin detection and tracking. In: Proceedings—2015 IEEE Winter Conference on Applications of Computer Vision Workshops, WACVW 2015; 2015. p. 51–56.
  • 15. Rachinas-Lopes P, Ribeiro R, Dos Santos ME, Costa RM. D-Track—A semi-automatic 3D video-tracking technique to analyse movements and routines of aquatic animals with application to captive dolphins. PLoS ONE. 2018;13(8):e0201614. doi: 10.1371/journal.pone.0201614 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Wang SH, Zhao JW, Chen YQ. Robust tracking of fish schools using CNN for head identification. Multimedia Tools and Applications. 2017;76(22):23679–23697. doi: 10.1007/s11042-016-4045-3 [DOI] [Google Scholar]
  • 17. Brunetti A, Buongiorno D, Trotta GF, Bevilacqua V. Computer vision and deep learning techniques for pedestrian detection and tracking: A survey. Neurocomputing. 2018;300:17–33. doi: 10.1016/j.neucom.2018.01.092 [DOI] [Google Scholar]
  • 18. Ren S, He K, Girshick R, Sun J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2017;39(6):1137–1149. doi: 10.1109/TPAMI.2016.2577031 [DOI] [PubMed] [Google Scholar]
  • 19.Redmon J, Farhadi A. YOLO9000: Better, Faster, Stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2017.
  • 20.Manning T, Somarriba M, Roehe R, Turner S, Wang H, Zheng H, et al. Automated Object Tracking for Animal Behaviour Studies. In: Proceedings—2019 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2019. Institute of Electrical and Electronics Engineers Inc.; 2019. p. 1876–1883.
  • 21.Hsu HM, Xie Z, Hwang JN, Berdahl A. Robust fish enumeration by multiple object tracking in overhead videos. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). vol. 12662 LNCS. Springer, Cham; 2021. p. 434–442. Available from: https://link.springer.com/chapter/10.1007/978-3-030-68790-8_34.
  • 22.He K, Zhang X, Ren S, Sun J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision. vol. 2015 Inter; 2015. p. 1026–1034.
  • 23.Mathworks. Define a Custom Deep Learning Layer with Learnable Parameters; 2019. Available from: https://www.mathworks.com.
  • 24. Hartley RI. In Defense of the Eight-Point Algorithm. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE. 1997;19(6). doi: 10.1109/34.601246 [DOI] [Google Scholar]
  • 25.Zhang D, Gabaldon J, Lauderdale L, Johnson-Roberson M, Miller LJ, Barton K, et al. Localization and tracking of uncontrollable underwater agents: Particle filter based fusion of on-body IMUs and stationary cameras. In: Proceedings—IEEE International Conference on Robotics and Automation. vol. 2019-May. Institute of Electrical and Electronics Engineers Inc.; 2019. p. 6575–6581.
  • 26.Porter FC. Testing Consistency of Two Histograms; 2008. Available from: http://arxiv.org/abs/0804.0380.
  • 27. Moore TY, Cooper KL, Biewener AA, Vasudevan R. Unpredictability of escape trajectory explains predator evasion ability and microhabitat preference of desert rodents. Nature Communications. 2017;8(1):1–9. doi: 10.1038/s41467-017-00373-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Gabaldon J, Zhang D, Barton K, Johnson-Roberson M, Shorter KA. A framework for enhanced localization of marine mammals using auto-detected video and wearable sensor data fusion. In: IEEE International Conference on Intelligent Robots and Systems. vol. 2017-Septe. Institute of Electrical and Electronics Engineers Inc.; 2017. p. 2505–2510.
  • 29. Guirado E, Tabik S, Rivas ML, Alcaraz-Segura D, Herrera F. Whale counting in satellite and aerial images with deep learning. Scientific Reports. 2019;9(1):1–12. doi: 10.1038/s41598-019-50795-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Sharma N, Scully-Power P, Blumenstein M. Shark detection from aerial imagery using region-based CNN, a study. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). vol. 11320 LNAI. Springer, Cham; 2018. p. 224–236. Available from: https://link.springer.com/chapter/10.1007/978-3-030-03991-2_23.

Decision Letter 0

William David Halliday

28 Oct 2021

PONE-D-21-19199Vision-based monitoring and measurement of bottlenose dolphins' daily habitat use and kinematicsPLOS ONE

Dear Dr. Gabaldon,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. First, I must apologize to the authors for the delay on my decision. It was very difficult securing reviewers for this manuscript. This manuscript has now been reviewed by three experts in the field, and all three generally agree that the manuscript is technically quite sound. The reviews did provide a large number of comments, but these were mostly aimed at making the text more clear. Please thoroughly address all of these comments when submitting your revised manuscript.

Please submit your revised manuscript by Dec 12 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

William David Halliday, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1.Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf.

2. Please include your full ethics statement in the ‘Methods’ section of your manuscript file. In your statement, please include the full name of the IRB or ethics committee who approved or waived your study, as well as whether or not you obtained informed written or verbal consent. If consent was waived for your study, please include this information in your statement as well.

3. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: I Don't Know

Reviewer #2: Yes

Reviewer #3: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No

Reviewer #3: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This is a well-written manuscript that presents primary scientific research in an intelligible fashion and has been written in standard English. This research uses a dual camera system to record the study area and uses deep learning computer techniques (Convolutional Neuron Networks) to detect dolphin presence and movements within the frame of view (FOV). The authors used this information to describe habitat use and calculate speed throughout the environment during different time periods of the day (blocks) and whether the dolphins were in training session (ITS) or out of training session (OTS). Habitat use was described with heatmaps and by associating environmental features such as enrichment, windows and trainer, though no statistical analyses was presented to support these assertions. Speed was velocity (fluking speed m.s-1) and was used to describe how the dolphins moved throughout the environment by being condensed into static or dynamic movement. Kolmogorov-Smirnov tests were used to independently test the differences between blocks during OTS or ITS. Independently analysing the results of ITS and OTS is valuable as during the ITS dolphins are being asked to perform specific behaviour and therefore, differences between their OTS time would be expected. More clarification around the statistics used in this manuscript are warranted as the authors have not justified their use of the Kolmogorov-Smirnoff test over other available statistics or provided adequate details on the development and use of their heatmap method. This manuscript appears to present original research and the authors have made their data accessible.

The authors demonstrate that the use of deep learning computer techniques is achievable for video monitoring animals within a managed facility, and provide a good discussion of how the speed of analysis is greatly improved in comparison to manual video analyses. These conclusions are supported by the results supplied in this manuscript. Habitat use information was well-presented and reasoned, however, caution should be used when interpreting the conclusions as there was no formal test showing differences in use and environmental features. Authors state that day-scale temporal trend were able to be detected, this seems like a fair statement due to the timing of when video data was taken.

Page 1 Line 8: Authors state that “there is a strong emphasis on behavioral monitoring to inform welfare practices” but do not mention this again. Is there a suggestion that the continual video monitoring and use of CNN methods would be applicable to monitor behaviours linked to the welfare of these animals in the future? If so, the authors should provide greater detail into how these methods could benefit behavioural analyses

Page 3 Line 75: Authors provide an average age of the seven bottlenose dolphins in this study (17 +/- 12). Due to the large range in ages I do not believe this metric is a good descriptor of dolphin ages and authors should simply provide the ages.

Page 8 Line 273 onwards: In this section the authors provided details on their heatmap production method. Supplying the software that was used in generating these heatmaps would be beneficial to readers and for reproducibility. Additionally, have these methods been described previously in other literature, or is this a method that was developed by the authors? Either way, authors should state how and why they used these methods when other commonly used methods exist (for example kernel density estimation, which can have a barrier function for use with hard limits to movement, such as the walls of a pool).

Page 10 Line 325 onwards: Justification of the use of Kolmogorov-Smirnoff non parametric tests are warranted in this section. A statement of the assumptions and how the collected data has met these assumptions is important for assessing the appropriate use

Page 14 Line 486 onwards: The linking of dolphin distribution to environmental features, such as gates and windows has sound logic, however, no formal quantitative tests were performed. The links made between these features and kinematics supports these assumptions but it is difficult to tell if this is a correlation or driver of habitat use. The authors could consider analysing the differences in habitat use using species distribution models.

General comments:

1. The authors focus their conclusions around the enhanced ability these methods provide to efficiently analyse video data. These conclusions are in line with regards to the results, however, it would be of interest to hear how the results and the methods presented in this manuscript may be applicable to different systems. With the increased use of drones, video data is more frequently being collected for wild cetaceans, can CNN computational methods aid in analysing this data and what are the benefits or potential uses of using the methods presented in this manuscript.

2. Through describing a method that enables a more efficient analysis of video monitoring data, the authors have also created a greater understanding of the distribution (and potential environmental reasoning) for bottlenose dolphins present in this captive environment.

a. Are the difference observed in kinematics a natural occurring behaviour, a function of the environment, or

due to different stimuli occurring at different times of day?

b. If the dolphins are spending the majority of their OTS at areas where more enrichment is occurring

(windows, gates etc.) are there suggestions the authors could make about how these animals are

managed?

Expanding on the habitat use and kinematic results is relative because the title and points made throughout the manuscript suggest that habitat use and kinematics is the main focus and result. If this point is true than more emphasis should be place on discussing the implications of these findings for these dolphins. However, upon reading the manuscript, the impression is that the authors are describing new methods for analysing video data, and that habitat use and kinematics are potential uses for this tool. If this latter point is true, then the authors should consider changing the title to suggest a methods manuscript e.g. “Application of CNN in analysing video-based monitoring data for daily habitat use and kinematics of captive bottlenose dolphins”.

3. A point of interest, the kinematics and habitat use of bottlenose dolphins was provided as a group summary, rather than for individuals. Do all dolphins consistently group together and follow the same path while in the enclosure, or, is there individual variation present. Can CNN analysis techniques detect and follow individuals through time? If the authors could provide comment on this I believe it would, at least, be a valuable discussion point as a future direction for this work.

4. The authors have provided an ethics statement that lists the names of two organisations that approved this research. Do they have an ethics/permit number that could be referred to, that they could supply?

Reviewer #2: This study describes the implementation of a camera based auto-tracking approach to monitor dolphin locomotion in a managed area. The approach described is sound and the results suggest that it is effective for monitoring animal behavior, with the possibility to expand or enhance its performance through additional cameras or sensors. This work is consistent with other auto-tracking programs that have been described recently (reviewed in Panadeiro et al, 2021; https://doi.org/10.1038/s41684-021-00811-1), but is focused on the specific application of dolphins in an managed enclosure. Overall, I don't have any major concerns regarding this study. It is clear and well written, and will likely be of interest to scientists working in this area. One minor comment: On line 101, formal training sessions are defined as ITS, but under Table 2, it states that OTS blocks 2 & 4 are formal presentations. What is the difference between formal training and formal presentations? Perhaps consider using more distinct language to describe these?

Reviewer #3: This article struggles between two possible narratives:

(1) a new ML-based method to track dolphins in captivity, with data to show the utility of the methods, and

(2) a dolphin distribution and behavior study in captivity that uses a ML-based method to compare behaviors while in two different behavioral conditions.

I believe that the goal of the authors is the first narrative and as such, my comments below are reflective of this.

Abstract

- Lacks a "so what" or big picture. Why should someone want to read this article? What is the new innovation that helps to advance monitoring of dolphins in captivity?

Introduction

- This introduction does not prepare the reader for what is to come in the article and lacks reference to the body of work done with dolphins in captivity and in machine learning detection.

- The statement "Here we present an automated..." in paragraph 1 comes too soon. You haven't even described any of the previous research in the field. This should be in your last paragraph.

- The paragraph about tags in the wild "Biomechanics and behavioral studies..." does not add anything to the article and is not relevant given that the entire study is done in captivity and there is no comparison or attempt to implement this in the wild.

- I did not feel like the last paragraph that describes this study is accurate. It did not prepare me for what was to come. Please make sure they match in content and breadth as well as order ideas are presented in methods.

- Suggested structure: paragraph 1 = what we've previous learned from behavior studies of captive animals, paragraph 2 = methods used to study captive animals and limitations, paragraph 3 = overview of machine learning detection algorithms showing other applications of RPN and Fast R-CNN (since you are not inventing them in this study) and remaining gaps, paragraph 4 = here's what we are going to do and show in this article

Methods

- Overall, methods are not organized in a manner that the reader feels like they build on each other. We switch from behavior related things to ML related things. Too many details on the behavior stuff given that its not the focus.

- First paragraph in M&M feels unnecessary and almost more appropriate to do in intro.

- None of the details about the tank feel necessary nor relevant as they are really never revisited. Maybe add these measurements to the top panel of figure 1 and then there is no need to repeat them in text.

- Figure 1 and 2 are way too busy and need to be broken into different panels. I would combine top panel of fig1 and fig 2 as 1 figure. Tracking algorithm its own figure and prob dist their own as well. Label panels (a), (b), (c) as top, bottom etc. didn't really help because there were so many panels in each figure.

- Table 2. Add your speed, yaw metrics in this table with the time intervals.

- What are the metrics of detector performance?

- False positives should be their own section

- Tracklet is first used in title so doesn't lend well to preparing reader. Use this term in intro when describing what is going to be done.

- Should reference tracklet figure earlier on in this section.

- What is proximity region? How was it defined?

- Heatmap representation is really confusing given the aims of the study to track. Why isn't an individual track shown? This would be more powerful representation that the left column panels in figure 4 and 5.

- Suggested structure: section 1 = camera setup in tank area + how much data was collected over what periods of time, section 2 = manual labelling and analysis, section 3 = neural net description + training + metrics of detect performance, section 4 = detection processing to combine frames, 5 = position uncertainty, 6 = tracklets + heatmaps to visualize, 7 = drain detector, 8 = about different behavior states and dolphins training sessions

Results

- Given that this article seems to focus a lot on the detector and tracklet, there is only a single paragraph describing the performance of these algorithms and most of it is about the detector. How was the tracklet algorithm performance? What there any human ground-truthing of animal trajectory to confirm the tracklet trajectories?

- Make a table to summarize performance of algorithms.

- The comparison between OTS and ITS feels odd given that it is framed as comparing behavioral states and locations of the animal. In ITS, it is really obvious that the dolphins are going to be where the algorithm finds them in front of the stage area because that is where the trainer tells them to be. So it really isn't a "distribution" or their space use in the tank. I would frame this more as a confirmation of your detector working because the detector found the dolphins where they are supposed to be and maybe should be included in detector performance

- Figure 4 and 5. The differences in the space use don't really stand out with these plots. It may be best to plot a single track of an animal in OTS and one in ITS to show the space use. There's also a ton of text in the figure captions that isn't mentioned in the main text.

- Table 3 should be in the supplementary.

- I don't think yaw is ever actually defined or described.

- Statistical comparison and entropy should not be their own sections but rather woven into the Speed + Yaw description of the animals.

- Suggested structure: section 1 = detector performance, 2 = tracklet performance, 3 = animal space use during OTS and ITS, 4 = speed + yaw of animals

Discussion

- Most of the paragraphs in the discussion read as results paragraphs and should be moved to the results. These paragraphs explain the results so much more clearly than the results section. I really didn't comprehend the results of the article until I read the discussion section.

- section kinematic diversity = 1st paragraph reads like a results paragraph

- section behavioral classification = where these results even mentioned in the results section? it feels very unclear

- Order the kinematic diversity, habitat use and behavioral classification in the same order as the results. (i.e., habitat use, behavioral classification and kinematic diversity??)

Overall Comments:

- Distribution is not the same as behavior and it feels like these are used interchangeably throughout. When you are talking about the position of the animal, you are referring to their distribution within the tank. As such, the word habitat use should also not be used as a tank is not a habitat. Space use in tank would be more appropriate. Behavior refers to the speed, yaw and dynamic swimming state.

- Statistics confirm that there are real, statistical differences and similarities in data. They aren't used to "quantify" them or "give clearer view". Be careful of how you use the KS statistics in the discussion. Really, they just tell you that the patterns, differences and similarities that you see are real and not due to lack of sample size.

- Using the word "managed" is confusing as there are many wild populations that are managed. Use the word "captivity". Make it clear to the reader that this sort of study is only possible in captivity. This setup would not work in the wild.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2022 Feb 3;17(2):e0254323. doi: 10.1371/journal.pone.0254323.r002

Author response to Decision Letter 0


10 Jan 2022

Dear Dr. William Halliday,

Thank you for the thorough and thoughtful review of our manuscript by the associated editor and the reviewers. We have examined the reviewers’ comments carefully and made appropriate changes to the text as outlined below.

Reviewer #1

General Comments:

This is a well-written manuscript that presents primary scientific research in an intelligible fashion and has been written in standard English. This research uses a dual camera system to record the study area and uses deep learning computer techniques (Convolutional Neuron Networks) to detect dolphin presence and movements within the frame of view (FOV). The authors used this information to describe habitat use and calculate speed throughout the environment during different time periods of the day (blocks) and whether the dolphins were in training session (ITS) or out of training session (OTS). Habitat use was described with heatmaps and by associating environmental features such as enrichment, windows and trainer, though no statistical analyses was presented to support these assertions. Speed was velocity (fluking speed m.s-1) and was used to describe how the dolphins moved throughout the environment by being condensed into static or dynamic movement. Kolmogorov-Smirnov tests were used to independently test the differences between blocks during OTS or ITS. Independently analysing the results of ITS and OTS is valuable as during the ITS dolphins are being asked to perform specific behaviour and therefore, differences between their OTS time would be expected. More clarification around the statistics used in this manuscript are warranted as the authors have not justified their use of the Kolmogorov-Smirnoff test over other available statistics or provided adequate details on the development and use of their heatmap method. This manuscript appears to present original research and the authors have made their data accessible.

The authors demonstrate that the use of deep learning computer techniques is achievable for video monitoring animals within a managed facility, and provide a good discussion of how the speed of analysis is greatly improved in comparison to manual video analyses. These conclusions are supported by the results supplied in this manuscript. Habitat use information was well-presented and reasoned, however, caution should be used when interpreting the conclusions as there was no formal test showing differences in use and environmental features. Authors state that day-scale temporal trend were able to be detected, this seems like a fair statement due to the timing of when video data was taken.

Comment 1: The authors focus their conclusions around the enhanced ability these methods provide to efficiently analyse video data. These conclusions are in line with regards to the results, however, it would be of interest to hear how the results and the methods presented in this manuscript may be applicable to different systems. With the increased use of drones, video data is more frequently being collected for wild cetaceans, can CNN computational methods aid in analysing this data and what are the benefits or potential uses of using the methods presented in this manuscript.

Response: This is an excellent use case for CNN tracking, and is already at work in whale (Guirado, E. et al., “Whale counting in satellite and aerial images with deep learning”) and shark (Sharma N. et al., “Shark Detection from Aerial Imagery Using Region-Based CNN, a Study”) tracking in the wild. When transitioning to wild dolphin tracking, drones and gliders will likely be the vehicles of choice. The methods in this manuscript, particularly the tracklet generation, can be useful for not only identifying and locating the animals, but also in providing basic kinematics information on entire groups, which is not generally feasible with tagging operations. This has been noted in the last paragraph of “Limitations and future work” in the Discussion.

Page 16 Discussion, Limitations and future work – Paragraph 3: Further, the use of unmanned drones and gliders has the potential to extend this research for implementation in a wild setting. CNN tracking is already at work in whale [Guirado et al. 2019], and shark [Sharma et al. 2018] tracking in the wild, and the inclusion of these vehicles will open up new opportunities in making this research physically portable. The methods in this manuscript, particularly the tracklet generation, can be useful for not only identifying and localizing the animals, but also in providing basic kinematic information on entire groups, which is not generally feasible with tagging operations due to the limited number of tags available for deployment.

- Guirado E, Tabik S, Rivas ML, Alcaraz-Segura D, Herrera F. Whale counting in satellite and aerial images with deep learning. Scientific Reports. 2019;9(1):1–12.

- Sharma N, Scully-Power P, Blumenstein M. Shark detection from aerial imagery using region-based CNN, a study. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). vol. 11320 LNAI. Springer, Cham; 2018. p. 224–236.

Comment 2: Through describing a method that enables a more efficient analysis of video monitoring data, the authors have also created a greater understanding of the distribution (and potential environmental reasoning) for bottlenose dolphins present in this captive environment.

a. Are the difference observed in kinematics a natural occurring behaviour, a function of the environment, or due to different stimuli occurring at different times of day?

b. If the dolphins are spending the majority of their OTS at areas where more enrichment is occurring (windows, gates etc.) are there suggestions the authors could make about how these animals are managed?

Response: Answering these two questions thoroughly will involve additional experimentation regarding enrichment, which is the eventual goal. This has been noted in an addition made to “Limitations and future work” in the Discussion.

Page 16 Discussion, Limitations and future work – Paragraph 2: Beyond technical improvements, the next step in this research is to use long-term animal monitoring to inform management policy and aid welfare tracking. By working closely with the ACSs at the Brookfield Zoo, we intend to use the techniques presented in this manuscript to observe animal interactions with conspecifics and enrichment over time, track activity levels, and measure the effects of changing environmental conditions (e.g. effects of varying crowd size, spectator presence). In particular, given the emphasis the dolphins placed on particular regions of the environment, it will be important to evaluate the effects of attractors in these areas by varying enrichment type, physical placement, duration/time of exposure, and by recording ACS presence and interactions within these areas. In this way, we aim to guarantee a high level of animal engagement and work to identify potential stressors that may aid the Zoo in caring for their dolphins.

Comment 3: Expanding on the habitat use and kinematic results is relative because the title and points made throughout the manuscript suggest that habitat use and kinematics is the main focus and result. If this point is true than more emphasis should be place on discussing the implications of these findings for these dolphins. However, upon reading the manuscript, the impression is that the authors are describing new methods for analysing video data, and that habitat use and kinematics are potential uses for this tool. If this latter point is true, then the authors should consider changing the title to suggest a methods manuscript e.g. “Application of CNN in analysing video-based monitoring data for daily habitat use and kinematics of captive bottlenose dolphins”.

Response: We thank the reviewer for bringing this to light, and to ensure the title more closely matches the intent of the manuscript, it has been changed to “Computer-vision object tracking for monitoring bottlenose dolphin habitat use and kinematics”, which places the “object tracking” aspect in the forefront of the title.

Comment 4: A point of interest, the kinematics and habitat use of bottlenose dolphins was provided as a group summary, rather than for individuals. Do all dolphins consistently group together and follow the same path while in the enclosure, or, is there individual variation present. Can CNN analysis techniques detect and follow individuals through time? If the authors could provide comment on this I believe it would, at least, be a valuable discussion point as a future direction for this work.

Response: It is possible for CNN techniques to identify and track individuals in video, however, in this particular case the insufficient camera resolution and environmental occlusion effects (glare regions, water surface disturbances) make continuous tracking and individual identification impossible. This has been noted in an addition to “Detector and filter performance” in the Results section, and potential solutions to this problem are presented throughout the first paragraph of “Limitations and future work” in the Discussion section.

Page 11 Results, Detector and filter performance – Paragraph 3: A note on detector limitations and the animal identification problem: while this system is robust in detecting a dolphin in-frame, it cannot track specific animals. The camera resolutions are not sufficient to resolve identifying features on the animals, and the environmental occlusions (glare regions, severe water surface disturbances) prevent continuous tracking (Fig. 3, top). As a result, while each tracklet corresponds to a single dolphin at a time, the lack of identifiability prevents individual longer-duration tracking (>30 seconds) and therefore prevents individual metrics generation. For this reason, the results in this manuscript are presented for the dolphins as a group, rather than for each individual.

Comment 5: The authors have provided an ethics statement that lists the names of two organisations that approved this research. Do they have an ethics/permit number that could be referred to, that they could supply?

Response: Unfortunately, the ethics approval for research involving the dolphins at the Brookfield Zoo predates their IACUC system, and therefore does not have a listed reference number.

Specific Comments:

Comment 1: Page 1 Line 8: Authors state that “there is a strong emphasis on behavioral monitoring to inform welfare practices” but do not mention this again. Is there a suggestion that the continual video monitoring and use of CNN methods would be applicable to monitor behaviours linked to the welfare of these animals in the future? If so, the authors should provide greater detail into how these methods could benefit behavioural analyses.

Response: The authors’ intentions to expand upon this research to fully explore the welfare monitoring question has been included in the newly added second paragraph of “Limitations and future work”, in the Discussion.

Comment 2: Page 3 Line 75: Authors provide an average age of the seven bottlenose dolphins in this study (17 +/- 12). Due to the large range in ages I do not believe this metric is a good descriptor of dolphin ages and authors should simply provide the ages.

Response: The animals’ specific ages have been added to the text of the Materials and methods.

Page 3 Materials and methods, Experimental environment – Paragraph 1: Seven bottlenose dolphins of ages 5, 5, 14, 16, 17, 33, and 36 years with lengths of 247 +/- 17 cm were observed using a dual-camera system in the Seven Seas building of the Brookfield Zoo, Brookfield IL.

Comment 3: Page 8 Line 273 onwards: In this section the authors provided details on their heatmap production method. Supplying the software that was used in generating these heatmaps would be beneficial to readers and for reproducibility. Additionally, have these methods been described previously in other literature, or is this a method that was developed by the authors? Either way, authors should state how and why they used these methods when other commonly used methods exist (for example kernel density estimation, which can have a barrier function for use with hard limits to movement, such as the walls of a pool).

Response: The code used to perform the analysis will be provided as part of the final submission. The method for generating the heatmaps in this work was developed by the authors, and this specific kernel generation method was used as the volume of data, both in terms of animal positions in space collected by the cameras as well as the dolphins’ depth information collected prior via tagging, allowed for a direct generation of the heatmaps rather than needing to rely on estimation methods. This was noted in the “Position uncertainty” subsection of the Materials and methods.

Page 8 Materials and methods, Position uncertainty – Paragraph 1: Due to the high volume of data available to produce the underlying structure of the spatial distribution, the distribution kernels themselves could be directly generated rather than relying on estimation techniques.

Comment 4: Page 10 Line 325 onwards: Justification of the use of Kolmogorov-Smirnoff non parametric tests are warranted in this section. A statement of the assumptions and how the collected data has met these assumptions is important for assessing the appropriate use.

Response: The probability distributions describing the animals’ dynamics metrics did not conform to any standard set of distributions, and so a more generalized statistical comparison test was required. This has been indicated in the “Kolmogorof-Smirnov statistics” subsection of the Materials and methods.

Page 10 Materials and methods, Kolomogorov-Smirnov statistics – Paragraph 1: K-S statistics were chosen to allow for nonparametric comparisons between distributions, as the metric distributions within each subtype (e.g. animal speed, yaw) did not all pertain to the same family of distributions (e.g. normal, exponential, etc.), rendering more traditional statistical comparisons ill-suited to this application.

Comment 5: Page 14 Line 486 onwards: The linking of dolphin distribution to environmental features, such as gates and windows has sound logic, however, no formal quantitative tests were performed. The links made between these features and kinematics supports these assumptions but it is difficult to tell if this is a correlation or driver of habitat use. The authors could consider analysing the differences in habitat use using species distribution models.

Response: We thank the reviewer for pointing out this discrepancy, and the language has been softened in this section to reflect this. The inclusion of species distribution modeling is an interesting prospect, and while currently outside the scope of this paper, would be valuable to investigate in future work.

Page 15 Discussion, Habitat use – Paragraph 2: When combined with the results from the spatial distributions (Fig. 5, left), this implies that these dolphins not only focused their attention on these regions, their presence correlated to higher activity levels in the dolphins when swimming in their vicinity.

Reviewer #2

General Comments:

This study describes the implementation of a camera based auto-tracking approach to monitor dolphin locomotion in a managed area. The approach described is sound and the results suggest that it is effective for monitoring animal behavior, with the possibility to expand or enhance its performance through additional cameras or sensors. This work is consistent with other auto-tracking programs that have been described recently (reviewed in Panadeiro et al, 2021; https://doi.org/10.1038/s41684-021-00811-1), but is focused on the specific application of dolphins in an managed enclosure. Overall, I don't have any major concerns regarding this study. It is clear and well written, and will likely be of interest to scientists working in this area.

Specific Comments:

Comment 1: On line 101, formal training sessions are defined as ITS, but under Table 2, it states that OTS blocks 2 & 4 are formal presentations. What is the difference between formal training and formal presentations? Perhaps consider using more distinct language to describe these?

Response: We thank the reviewer for this suggestion, and in Tables 1 and 2 the term “formal presentations” has been changed to “public presentations” for clarity.

Reviewer #3

General Comments:

This article struggles between two possible narratives:

(1) a new ML-based method to track dolphins in captivity, with data to show the utility of the methods, and

(2) a dolphin distribution and behavior study in captivity that uses a ML-based method to compare behaviors while in two different behavioral conditions.

I believe that the goal of the authors is the first narrative and as such, my comments below are reflective of this.

Comment 1: Distribution is not the same as behavior and it feels like these are used interchangeably throughout. When you are talking about the position of the animal, you are referring to their distribution within the tank. As such, the word habitat use should also not be used as a tank is not a habitat. Space use in tank would be more appropriate. Behavior refers to the speed, yaw and dynamic swimming state.

Response: We thank the reviewer for pointing out this discrepancy, and to address this, use of the term “behavior” was limited to parts of the text that discuss animal dynamics states, and when discussing position was changed to “spatial distribution”. The contemporary term as used in zoo animal science for a zoo animal’s environment is “habitat”, and we have provided an abbreviated reference list of peer-reviewed papers that have used the term, both old and new:

- Maple TL, Finlay TW. Applied primatology in the modern zoo. Zoo Biology. 1989;8(S1):101-116.

- Powell DM, Baskir E. Behavior and Habitat Use Remain Diverse and Variable in Modern Zoological Exhibits over the Long-Term: Case Studies in 5 Species of Ursidae. Journal of Zoological and Botanical Gardens. 2021;2:677-704.

Comment 2: Statistics confirm that there are real, statistical differences and similarities in data. They aren't used to "quantify" them or "give clearer view". Be careful of how you use the KS statistics in the discussion. Really, they just tell you that the patterns, differences and similarities that you see are real and not due to lack of sample size.

Response: To correct this error, the language when discussing the K-S statistics has been modified to use more appropriate terms such as “confirm” or “identify”.

Page 13 Results, Statistical comparison of metrics for behavior differentiation – Paragraph 1: The K-S statistics were used to confirm the similarities and differences between time blocks within both OTS and ITS.

Page 16 Discussion, Behavior classification from dynamics metrics – Paragraph 2: In contrast, an analysis of the K-S results allowed for the identification of the statistical differences in animal dynamics between OTS time blocks.

Comment 3: Using the word "managed" is confusing as there are many wild populations that are managed. Use the word "captivity". Make it clear to the reader that this sort of study is only possible in captivity. This setup would not work in the wild.

Response: We thank the reviewer for identifying this potential point of confusion. To clarify that the animals are not wild dolphins, the use of the term “managed” has been replaced throughout the manuscript with the terms “zoo setting” and “under professional care”.

Specific Comments:

Comment 1: [Abstract] Lacks a "so what" or big picture. Why should someone want to read this article? What is the new innovation that helps to advance monitoring of dolphins in captivity?

Response: The Abstract has been re-worked to more clearly define the novelty of this research (“The resulting approach enables detailed persistent monitoring of the animals that is not possible using manual annotation methods.”) and its greater impact (“results from the proposed framework will enable future research that will offer new insights into dolphin behavior, biomechanics, and how environmental context affects movement and activity”).

Page 1 Abstract: The resulting approach enables detailed persistent monitoring of the animals that is not possible using manual annotation methods.

Page 1 Abstract: The work presented here demonstrates that CNN object detection is viable for large-scale marine mammal tracking, and results from the proposed framework will enable future research that will offer new insights into dolphin behavior, biomechanics, and how environmental context affects movement and activity.

Comment 2: This introduction does not prepare the reader for what is to come in the article and lacks reference to the body of work done with dolphins in captivity and in machine learning detection.

Response: We thank the reviewer for bringing this to light, and have restructured the Introduction to address this, taking into account the reviewers’ other comments on the section. Additional sources were included to support the choice of Faster R-CNN as the detector of choice for this research. Further, as existing research with captive animals is the focus of the second paragraph of the Introduction, additional text was added to highlight some of the foci of the research performed by direct animal monitoring within a captive environment.

Page 2, Introduction – Paragraph 2: Examples of such studies include monitoring the effects of human presence on animal behaviors, analysis of dolphin activity cycles and sleep patterns, and the evaluation of social interactions with conspecifics.

Page 2, Introduction – Paragraph 3: Faster R-CNN [Ren et al. 2017], was chosen as the backbone of the animal detection method for its prioritization of accuracy and precision regardless of object size or density in the image, as opposed to a faster single-shot detector [Redmon and Farhadi 2017]. The Faster R-CNN detector structure has demonstrated its capabilities in both land [Manning et al. 2019] and marine [Hsu et al. 2021] applications, and is considered a reliable option for challenging tracking tasks.

- Ren S, He K, Girshick R, Sun J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2017;39(6):1137–1149.

- Redmon J, Farhadi A. YOLO9000: Better, Faster, Stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR);2017.

- Manning T, Somarriba M, Roehe R, Turner S, Wang H, Zheng H, et al. Automated Object Tracking for Animal Behaviour Studies. In: Proceedings - 2019 IEEE International Conference on Bioinformatics and Biomedicine, BIBM2019. Institute of Electrical and Electronics Engineers Inc.; 2019. p. 1876–1883.

- Hsu HM, Xie Z, Hwang JN, Berdahl A. Robust fish enumeration by multiple object tracking in overhead videos. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). vol. 12662 LNCS. Springer, Cham; 2021. p. 434–442.

Comment 3: The statement "Here we present an automated..." in paragraph 1 comes too soon. You haven't even described any of the previous research in the field. This should be in your last paragraph.

Response: The statement has been removed.

Comment 4: The paragraph about tags in the wild "Biomechanics and behavioral studies..." does not add anything to the article and is not relevant given that the entire study is done in captivity and there is no comparison or attempt to implement this in the wild.

Response: The components of the paragraph discussing tag usage and wild animal studies have been removed, and the remainder was merged into the following paragraph.

Comment 5: I did not feel like the last paragraph that describes this study is accurate. It did not prepare me for what was to come. Please make sure they match in content and breadth as well as order ideas are presented in methods.

Response: The final paragraph of the introduction has been replaced with a modified version of the first paragraph of Materials and methods, which more directly states the methods that are used and the intent with which they were applied.

Page 2 Introduction – Paragraph 4: In this study, camera data were used to monitor the behavior of a group of marine mammals both qualitatively and quantitatively in a zoo setting. Camera-based animal position data were used to quantify habitat usage, as well as where and how the group of animals moved throughout the day. The position data were decomposed into kinematic metrics, and used to discriminate between two general movement states - static and dynamic - using the velocity of the tracked animals. A general ethogram of the animals' behaviors monitored in this research is presented in Table 1. Joint differential entropy computations were calculated using animal speed and heading data to provide an understanding of the dolphins' kinematic diversity. Kolmogorov-Smirnov statistical analyses of the kinematic metrics were used to compare movement patterns and activity levels over time and between behavioral conditions. The proposed framework and results presented here demonstrate the viability of computer-vision inspired techniques for this challenging monitoring problem, and will enable future studies to gain new insights into dolphin behavior and biomechanics.

Comment 6: Suggested structure: paragraph 1 = what we've previous learned from behavior studies of captive animals, paragraph 2 = methods used to study captive animals and limitations, paragraph 3 = overview of machine learning detection algorithms showing other applications of RPN and Fast R-CNN (since you are not inventing them in this study) and remaining gaps, paragraph 4 = here's what we are going to do and show in this article.

Response: The suggested structure has been taken into consideration, with the modification of more of a focus on how monitoring has been performed, rather than what has been learned, as this manuscript does focus more on the tracking method than an extensive behavior review. A note has also been made in the second paragraph to provide examples of what has been performed prior to this work.

Page 2 Introduction – Paragraph 2: Examples of such studies include monitoring the effects of human presence on animal behaviors, analysis of dolphin activity cycles and sleep patterns, and the evaluation of social interactions with conspecifics.

Comment 7: Overall, methods are not organized in a manner that the reader feels like they build on each other. We switch from behavior related things to ML related things. Too many details on the behavior stuff given that its not the focus.

Response: We have removed the excess detail on dolphin behavior reinforcement. However, the remainder provides a basic explanation of the ITS time block structure which aids in interpreting the ITS spatial maps, and these portions were left in for this reason.

Comment 8: First paragraph in M&M feels unnecessary and almost more appropriate to do in intro.

Response: We thank the reviewer for this suggestion, and have moved this paragraph to the end of the Introduction and modified it for flow.

Comment 9: None of the details about the tank feel necessary nor relevant as they are really never revisited. Maybe add these measurements to the top panel of figure 1 and then there is no need to repeat them in text.

Response: The unnecessary information on secondary habitats has been removed. Unfortunately, the structure of Figure 1 makes placement of the main habitat dimensions somewhat awkward, so these are left in the text.

Comment 10: Figure 1 and 2 are way too busy and need to be broken into different panels. I would combine top panel of fig1 and fig 2 as 1 figure. Tracking algorithm its own figure and prob dist their own as well. Label panels (a), (b), (c) as top, bottom etc. didn't really help because there were so many panels in each figure.

Response: The reviewer’s suggestions on the figures’ structures helped immensely in their simplification. Addressing these suggestions, the top panel of Figure 1 and a simplified top panel of Figure 2 were combined into a single figure for the environmental setup. The bottom panel (including popouts) of Figure 1 was isolated as a separate tracklet figure. Additionally, an image demonstrating tracklet examples was added to the tracklet figure. Figure 2 had unnecessary components of the top panel removed, but the panel was retained to support the meshing computation explanation.

Page 3 Materials and methods – Figure 1:

Fig. 1. Diagram of the experimental setup. TOP: Illustration of the main habitat, with camera placements (blue enclosures) and fields of view (gray cones). BOTTOM: Top-down individual camera views, with objects in the habitat marked. Yellow -- Dolphin bounding boxes, Green -- Drains, Red -- Gates between regions, Orange -- Underwater windows (3 total). Correlated dolphin bounding boxes are indicated by number.

Page 6 Materials and methods – Figure 2:

Fig. 2. Combined figure demonstrating camera overlap, bounding box meshing, and animal position uncertainty. TOP: Top-down individual camera views, with dolphin bounding boxes in yellow (correlating boxes are numbered). The habitat-bisecting lines (l_s) for each camera frame are indicated in solid red. Distances from Bounding Box 2 (centered on the black and gray crosshair) to the closest frame boundary (d_b) and the boundary to the bisecting line (d_l) are indicated by the white measurement bars. MIDDLE: Meshed camera views including dolphin bounding boxes (yellow), with the location uncertainty distribution (A) overlaid for Box 2. BOTTOM: 2D location uncertainty distribution (A) with major (a-a, black) and minor (b-b, red) axes labeled and separately plotted.

Page 7 Materials and methods – Figure 3:

Fig. 3. Illustration of tracklet generation. TOP: Tracklet segments (red) overlaid on a single video frame, generated by stitching the views from both cameras. Each tracklet in this frame was plotted from its inception to each corresponding dolphin's current position. While each dolphin can be tracked, the lack of clarity when underwater impedes individual identification. CENTER: x-y view of example tracklets (red and green on gray lines) of two dolphins (highlighted light orange), which are also shown in Fig. 1, top. POPOUT-RIGHT: Vector illustrations of the two example tracks. Example notation for tracklet j (red): position (p^(j,t')), velocity (v^(j,t')), yaw (θ^(j,t')), and yaw rate (θ ˙^(j,t')). POPOUT-BOTTOM Illustration of tracklet generation, with detections (stars) and tracklet proximity regions (dashed). Example notation for tracklet j (red): position (p^(j,t)), velocity (v^(j,t)), Kalman-predicted future position ( p ^^(j,t+1)), true future position (p^(j,t+1)), and future animal detection (u^(j,t+1,i')).

Comment 11: Table 2. Add your speed, yaw metrics in this table with the time intervals.

Response: The speed and yaw rate metrics have been added to Table 2.

Page 4 Materials and methods – Table 2:

Table 2. The ITS blocks (1 and 3) are animal care sessions, and the OTS blocks (2 and 4) are public presentations. The corresponding mean speed and yaw rate dynamics metrics are also reported, with yaw rate converted to units of (deg s^-1) for readability.

Comment 12: What are the metrics of detector performance?

Response: The Faster R-CNN detectors were fully evaluated in Zhang D et al., “Localization and tracking of uncontrollable underwater agents: Particle filter based fusion of on-body IMUs and stationary cameras.” A reference to this paper was included in “Detector and filter performance” in the Results section, along with a summary of the results and a table providing the performance metrics of the tracking system with respect to ground-truth. The summary text and table are provided in the responses to Comments 19 and 20, respectively.

Comment 13: False positives should be their own section.

Response: False positives moved to a separate subsection, “False positive mitigation”.

Comment 14: Tracklet is first used in title so doesn't lend well to preparing reader. Use this term in intro when describing what is going to be done.

Response: The subsection title has been changed to “Temporal association of detections”.

Comment 15: Should reference tracklet figure earlier on in this section.

Response: The tracklet figure is now referenced in the first paragraph of the “Temporal association of detections” subsection in the Materials and methods.

Comment 16: What is proximity region? How was it defined?

Response: The proximity region definition has been added in the second-to-last paragraph of “Temporal association of detections” in Materials and methods.

Page 8 Materials and methods, Temporal association of detections – Paragraph 3: Using the predicted position, the k-th tracklet checked whether there existed a closest detection in the next frame that was within the proximity region of the predicted position, which is defined as a circle around the predicted position with radius 0.8 m (heuristically tuned).

Comment 17: Heatmap representation is really confusing given the aims of the study to track. Why isn't an individual track shown? This would be more powerful representation that the left column panels in figure 4 and 5.

Response: Due to individual identification and environmental occlusion problems (seen in Fig. 3, top, where the tracklets will cut out when a dolphin passes below an occlusion region), a single tracklet for one animal only has a duration on the order of tens of seconds before it is deactivated due to inactivity, and the animal is then picked up later by another tracklet. It would be preferable to have individual tracks instead of heatmaps to show, but unfortunately the system cannot accomplish this as-is (this is the focus of our ICRA paper: Zhang D et al., “Localization and tracking of uncontrollable underwater agents: Particle filter based fusion of on-body IMUs and stationary cameras.”) An additional paragraph has been added to address this in “Detector and filter performance” in the Results section.

Page 11 Results, Detector and filter performance – Paragraph 3: A note on detector limitations and the animal identification problem: while this system is robust in detecting a dolphin in-frame, it cannot track specific animals. The camera resolutions are not sufficient to resolve identifying features on the animals, and the environmental occlusions (glare regions, severe water surface disturbances) prevent continuous tracking (Fig. 3, top). As a result, while each tracklet corresponds to a single dolphin at a time, the lack of identifiability prevents individual longer-duration tracking (>30 seconds) and therefore prevents individual metrics generation. For this reason, the results in this manuscript are presented for the dolphins as a group, rather than for each individual.

Comment 18: Suggested structure: section 1 = camera setup in tank area + how much data was collected over what periods of time, section 2 = manual labelling and analysis, section 3 = neural net description + training + metrics of detect performance, section 4 = detection processing to combine frames, 5 = position uncertainty, 6 = tracklets + heatmaps to visualize, 7 = drain detector, 8 = about different behavior states and dolphins training sessions

Response: We thank the reviewer for the structure suggestion, however, the structure of the Materials and Methods was kept primarily as-is to more easily cluster the types of sections together. Physical equipment, environment, and test subjects were stated first, to indicate what we are working with and where, and behavior states were indicated here to prepare the reader for what we were intending to monitor. The neural network methods were then all stated in order of computation (however in this case the Training subsection was stated after defining the network to provide an explanation of what was being trained, in case readers are not familiar with CNN-based object detection). Tracklet computation was stated immediately after to provide information on how both location filtering and dynamics extraction functioned. Spatial post-processing methods (position uncertainty and heatmap generation) and dynamics post-processing methods (joint entropy and K-S statistics) could then be ordered with either one first, however, as the focus of the section had so far remained on localization, it followed to place spatial use first and dynamics second.

Comment 19: Given that this article seems to focus a lot on the detector and tracklet, there is only a single paragraph describing the performance of these algorithms and most of it is about the detector. How was the tracklet algorithm performance? What there any human ground-truthing of animal trajectory to confirm the tracklet trajectories?

Response: The detectors used in this research were also employed in a separate study: Zhang D et al., “Localization and tracking of uncontrollable underwater agents: Particle filter based fusion of on-body IMUs and stationary cameras.” Their performance was fully evaluated in that paper, and these results have been summarized in the first paragraph of “Detector and filter performance” in the Results, along with the addition of a table providing the specific values (the table is located in the response to the immediately following comment).

Page 10 Results, Detector and filter performance – Paragraph 1: The performance of this pair of Faster R-CNN detectors with comparisons to ground truth was fully evaluated in [Zhang et al. 2019], with the results reported in Table 3. To summarize the performance results: two additional monitoring sessions were video recorded and tracked both manually and using the automated CNN-based tracking system from this manuscript. During these sessions, two individual dolphins were tracked by a human observer and the results were compared to the detections produced by the automated system. Overall, for these two deployments the human tracker (representing the ground-truth) was able to detect the dolphins 88.1% of the time, while the CNN-based trackers were able to detect the dolphins 53.2% of the time. As a result, the automated system achieved an overall recall rate of 60.4% versus ground-truth.

- Zhang D, Gabaldon J, Lauderdale L, Johnson-Roberson M, Miller LJ, Barton K,et al. Localization and tracking of uncontrollable underwater agents: Particlefilter based fusion of on-body IMUs and stationary cameras. In: Proceedings -IEEE International Conference on Robotics and Automation. vol. 2019-May.Institute of Electrical and Electronics Engineers Inc.; 2019. p. 6575–6581.

Comment 20: Make a table to summarize performance of algorithms.

Response: The performance table has been provided (Table 3).

Page 11 Results – Table 3:

Table 3: Performance comparison between manual and CNN animal detections for two sessions as part of a separate monitoring exercise, where individual dolphins were tracked as opposed to the entire group. A1 and A2 refer to specific dolphins, with A1 being tracked over two recordings during Deployment 1, and A1 and A2 being tracked during the same recording during Deployment 2. “Detectability” is defined as the total time each individual dolphin was able to be detected by either the human or CNN trackers over each deployment period.

Comment 21: The comparison between OTS and ITS feels odd given that it is framed as comparing behavioral states and locations of the animal. In ITS, it is really obvious that the dolphins are going to be where the algorithm finds them in front of the stage area because that is where the trainer tells them to be. So it really isn't a "distribution" or their space use in the tank. I would frame this more as a confirmation of your detector working because the detector found the dolphins where they are supposed to be and maybe should be included in detector performance.

Response: In this manuscript, the primary comparison of OTS versus ITS in terms of animal behavior is framed through the analysis of the dynamics metrics. Conversely, animal spatial use is treated as a consequence of their behavior modes, and comparisons are made to highlight how spatial use changes with regard to behavior, if at all. In general, confirmation and quantification of spatial distributions of animals based on behavior mode represents a contribution unto itself, and this work demonstrates a simple representation of position dependent on behavior. It is true that observing the dolphins in regions where they are expected to be does provide validation of the performance of the detection framework, although this would primarily be qualitative rather than quantitative, which is noted in “Spatial distribution – position” in the Results.

Page 12 Results, Spatial distribution – position – Paragraph 2: This also serves to qualitatively validate that the detectors are working as expected, given the dolphins are observed to be present in a region they are commonly instructed to occupy.

Comment 22: Figure 4 and 5. The differences in the space use don't really stand out with these plots. It may be best to plot a single track of an animal in OTS and one in ITS to show the space use. There's also a ton of text in the figure captions that isn't mentioned in the main text.

Response: Unfortunately, the current state of the detection framework does not allow for individual animal identification, and as such when the animals enter and exit occlusion regions, or exit the cameras’ views entirely, re-acquisition of any particular individual is not possible. Effectively, each animal can be tracked, but any occlusion leads to loss of identity without manual intervention. An additional image has been added to the top of the tracklet figure (now Fig. 3) to illustrate this problem. An example video showing the tracking performance has been prepared for the supplementary materials, which will further support this statement. We thank the reviewer for pointing out the discrepancy with the figure captions, of which the ITS figure caption particularly needed editing (previously Fig. 5). The explanatory text from Fig. 5 (now Fig. 6) has been moved to the “Spatial distribution – speed/quiver” subsection of the Results.

Page 13 Results, Spatial distribution – speed/quiver – Paragraph 2: In general, speeds across the entire habitat are higher during public presentations than non-public animal care sessions because high-energy behaviors (e.g., speed swims, porpoising, breaches) are typically requested from the dolphins several times throughout the presentation. Though non-public presentations include high-energy behaviors, non-public animal care sessions also focus on training new behaviors and engaging in husbandry behaviors. Public presentations provide the opportunity for exercise through a variety of higher energy behaviors, and non-public sessions afford the ability to engage in comprehensive animal care and time to work on new behaviors.

Page 12 Results – Figure 6 (caption): Spatial distributions for dynamic ITS, with position distributions along the first column and speed distributions/quiver plots along the second column. During the animal care sessions (Block 1: 09:30 to 10:00, Block 3: 13:00-13:30), the dolphins engaged in lower intensity swimming throughout the habitat than the presentation sessions (Block 2: 11:30-12:00, Block 4: 14:30-15:00). This difference is qualitatively explained through the discrepancy in ACS requests from the animals: high-intensity behaviors are prompted more often during presentations, while care sessions cover a wider variety of behaviors. Conversely, spatial coverage of the habitat does not have high variance within the ITS blocks, with an expectedly high concentration on the central island where the ACSs are located for all ITS blocks.

Comment 23: Table 3 should be in the supplementary.

Response: Table 3 has been moved to the supplementary section as Table S1.

Comment 24: I don't think yaw is ever actually defined or described.

Response: We thank the reviewer for catching this error, and a clarification on the use of yaw in this paper has been included in the “Dolphin detection” subsection in Materials and methods.

Page 5 Materials and methods, Dolphin detection – Paragraph 1: Kinematic data in the form of position, velocity, and yaw (heading in the x-y plane), from the tracklets were then used to parameterize and form probability distributions …

Comment 25: Statistical comparison and entropy should not be their own sections but rather woven into the Speed + Yaw description of the animals.

Response: For this manuscript, the statistical comparisons were intended specifically to evaluate if metric distributions could differentiate overall group behavioral states. Alternatively, the joint entropy results were intended to evaluate the group’s kinematic diversity. As such, the titles have been changed to reflect these particular foci: “Joint entropy results for kinematic diversity analysis” and “Statistical comparison of metrics for behavior differentiation”.

Comment 26: Suggested structure: section 1 = detector performance, 2 = tracklet performance, 3 = animal space use during OTS and ITS, 4 = speed + yaw of animals

Response: We thank the reviewer for the structure suggestion for the Results section. However, the structure was kept as-is as the tracklet performance is heavily linked to the overall detector framework performance, and keeping these together is more conducive to the flow of the text. The Results section was otherwise restructured according to the reviewers’ additional comments on the Discussion section, as some components of the Discussion were more appropriate for placement within the Results.

Comment 27: Most of the paragraphs in the discussion read as results paragraphs and should be moved to the results. These paragraphs explain the results so much more clearly than the results section. I really didn't comprehend the results of the article until I read the discussion section.

Response: We thank the reviewer for this suggestion, and to address this the components of the Discussion section that pertained more closely to Results have been moved. The component of the first paragraph of “Kinematic diversity” that pertained to the Results has been moved to “Joint entropy results for kinematic diversity analysis” in the Results (text found in the response to the immediately following comment). The components pertaining to the Results in “Behavior classification from dynamics metrics” have been merged into “Statistical comparison of metrics for behavior differentiation” in the Results.

Pages 13-14 Results, Statistical comparison of metrics for behavior differentiation – Paragraphs 1-2: The K-S statistics were used to confirm the similarities and differences between time blocks within both OTS and ITS. To aid in visualizing this, Figure 7, top, displays the overlaid PDFs of the speed and yaw metrics during OTS, and Figure 7, middle, displays the PDFs during ITS. A complete table with K-S distances and α values for all six metrics is present in Table S1 in the supplementary materials, with all values rounded to 3 digits of precision. For OTS, we saw from the K-S results that Blocks 1 and 2 varied the most with respect to the others in terms of speed, which was observed in Figure 7, top, while the yaw values were not generally significantly different, again observed in Figure 7 (given the high number of samples used to generate the K-S statistics, we were able to compare the significance levels to a stronger threshold of α_crit=0.001). Across the board, Block 2 generally differed significantly from the rest of the OTS blocks for the most metrics, with Block 1 following close behind. In contrast, Blocks 3-5 differed the least significantly from each other, indicating similarities in the dolphins' dynamics patterns for Blocks 3-5.

For ITS, we note that the significant differences in metrics generally followed the structure type of each ITS block: comparisons between Blocks 1 vs. 3, and 2 vs. 4, were found to be significantly different the least often. As the ACSs requested similar behaviors during ITS blocks of the same type, we expected similarities in the dynamics metrics for Blocks 1 vs. 3 (animal care sessions) and Blocks 2 vs. 4 (presentations), and differences between the metrics for blocks of different types. Of particular note are the yaw std. dev. and yaw rate std. dev. metrics, with entire order of magnitude differences in K-S distances when comparing similar vs. different types of ITS blocks. Overall, the pattern displayed by the ITS K-S statistics in Table S1 correlated with this expectation.

Comment 28: section kinematic diversity = 1st paragraph reads like a results paragraph

Response: The results components of this paragraph have been moved into the “Joint entropy results for kinematic diversity analysis” subsection of the Results, and the “Kinematic diversity” subsection of the Discussion has been modified to reflect this change.

Page 13 Results, Joint entropy results for kinematic diversity analysis – Paragraph 1: The time blocks in this figure are presented in chronological order, and we observed the lowest kinematic diversity in the mornings (the first blocks of each OTS and ITS) as the animal care specialists were arriving at work and setting up for the day. The highest kinematic diversity when not interacting with the ACSs then occurred immediately after the first ITS time block. In general, the first time blocks of both OTS and ITS showed the lowest kinematic diversity of their type, the second of each showed the highest, and the following blocks stabilized between the two extremes. The speed/quiver plots (Figs. 5-6, right) provide a qualitative understanding of the entropy results. For example, in Block 1 of OTS (Fig. 5, top-right) the dolphins engaged in slow swimming throughout their habitat in smooth consistent cycles along the environment edge, yielding the lowest joint entropy. Joint entropy then increased during both the morning ITS and OTS blocks and remained elevated for the rest of the day, representing higher animal engagement through the middle of their waking hours.

Pages 14-15 Discussion, Kinematic diversity – Paragraph 1: The kinematic diversity results presented here are consistent with previous research on animal activity and sleep patterns, which reports a diurnal activity cycle for animals under professional care [Sekiguchi et al. 2003].

Comment 29: section behavioral classification = where these results even mentioned in the results section? it feels very unclear

Response: In line with the reviewer’s additional comments, the Results and Discussion behavior classification subsections have been restructured to better present the data in the former and their analysis in the latter.

Comment 30: Order the kinematic diversity, habitat use and behavioral classification in the same order as the results. (i.e., habitat use, behavioral classification and kinematic diversity??)

Response: We thank the reviewer for this suggestion, and have resolved the ordering problem.

Attachment

Submitted filename: Response to Reviewers.pdf

Decision Letter 1

William David Halliday

17 Jan 2022

Computer-vision object tracking for monitoring bottlenose dolphin habitat use and kinematics

PONE-D-21-19199R1

Dear Dr. Gabaldon,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

William David Halliday, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Acceptance letter

William David Halliday

25 Jan 2022

PONE-D-21-19199R1

Computer-vision object tracking for monitoring bottlenose dolphin habitat use and kinematics

Dear Dr. Gabaldon:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. William David Halliday

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Table. Kolmogorov-Smirnov session comparison.

    (PDF)

    Attachment

    Submitted filename: Response to Reviewers.pdf

    Data Availability Statement

    Data and code for this publication can be found at the following location: https://www.ebi.ac.uk/biostudies/studies/S-BSST793.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES