Skip to main content
PLOS One logoLink to PLOS One
. 2022 May 4;17(5):e0267759. doi: 10.1371/journal.pone.0267759

Deep learning with self-supervision and uncertainty regularization to count fish in underwater images

Penny Tarling 1,*, Mauricio Cantor 2,3,4,5,6,7,‡,*,#, Albert Clapés 1,8,‡,*,#, Sergio Escalera 1,8,‡,*,#
Editor: Tim Wilhelm Nattkemper9
PMCID: PMC9067705  PMID: 35507631

Abstract

Effective conservation actions require effective population monitoring. However, accurately counting animals in the wild to inform conservation decision-making is difficult. Monitoring populations through image sampling has made data collection cheaper, wide-reaching and less intrusive but created a need to process and analyse this data efficiently. Counting animals from such data is challenging, particularly when densely packed in noisy images. Attempting this manually is slow and expensive, while traditional computer vision methods are limited in their generalisability. Deep learning is the state-of-the-art method for many computer vision tasks, but it has yet to be properly explored to count animals. To this end, we employ deep learning, with a density-based regression approach, to count fish in low-resolution sonar images. We introduce a large dataset of sonar videos, deployed to record wild Lebranche mullet schools (Mugil liza), with a subset of 500 labelled images. We utilise abundant unlabelled data in a self-supervised task to improve the supervised counting task. For the first time in this context, by introducing uncertainty quantification, we improve model training and provide an accompanying measure of prediction uncertainty for more informed biological decision-making. Finally, we demonstrate the generalisability of our proposed counting framework through testing it on a recent benchmark dataset of high-resolution annotated underwater images from varying habitats (DeepFish). From experiments on both contrasting datasets, we demonstrate our network outperforms the few other deep learning models implemented for solving this task. By providing an open-source framework along with training data, our study puts forth an efficient deep learning template for crowd counting aquatic animals thereby contributing effective methods to assess natural populations from the ever-increasing visual data.

Introduction

While biodiversity undergoes significant, rapid changes worldwide (e.g. [1]), informed decision-making in biological conservation depends on accurate empirical data. Monitoring natural populations can not only reveal how natural systems work, but it is invaluable for detecting unexpected changes, raise awareness and inform appropriate management decisions (e.g. [2]). However, counting organisms in the wild, and particularly in underwater populations, is logistically challenging. Estimating fish abundance, for instance, is critical in face of the global trend of marine resource overexploitation (e.g. [35]) and has traditionally relied either on intrusive, labour-intensive and/or indirect field methods, such as tissue sampling, underwater surveys and fisheries data (e.g. [6]). Although sampling methods can be species-specific and prone to biases [7], underwater videos and images generally offer a less invasive and cost-effective way to generate large volumes of data (e.g. [8]). However, efficiently processing and accurately analysing such volumes of data is a bottleneck. While the time-consuming task of manually processing underwater images has recently been improved by computer vision models, how to improve the automation, speed and precision of estimation counts of organisms in underwater imagery still remains an open problem [9, 10]. Given their sweeping success across several real-world applications, deep learning models are in the forefront of research to assess underwater natural populations from visual data.

Monitoring aquatic biota through video and photography can be particularly difficult in habitats with limited water visibility. Sonar imaging systems are increasingly employed to sample underwater populations where visibility is a constraint because it uses sound energy, instead of light, to generate digital images (Fig 1(a)). Sonar technology therefore can expand the investigation of aquatic biota living in previously inaccessible underwater habitats, such as deep and very turbid waters (e.g. [11, 12]). However, the resolution of these images is inherently lower than that of single-lens reflex (SLR) underwater cameras and it can be difficult, even for the human eye, to distinguish between “objects” that are captured without details (Fig 1(c)). Therefore, counting aquatic animals in sonar images comes with additional challenges for both human and machine. To date, there are limited tools available to process sonar images and counting target objects in these is still problematic, not easily adaptive to user needs. (For example, Echoview is a commercial software package for hydroacoustic data processing [13], delivering capabilities for water-column and bottom echosounder and sonar data processing. ESP3 is an open-source MATLAB package for visualizing and processing fisheries single-beam and split-beam acoustics data [14].) In recent studies, authors using Adaptive Resolution Imaging Sonar cameras to collect vast quantities of data were hindered by the need to manually count fish in their samples [12]. The development of sophisticated, effective computer vision models to process, and count fish quickly in, underwater visual data remains in its infancy, partially due to the lack of manually annotated visual data necessary for training computer vision models.

Fig 1. In situ sampling of Lebranche mullet in turbid waters using a sonar imaging system.

Fig 1

(a) Schematics of the image production by the sonar camera. The Adaptive Resolution Imaging Sonar (ARIS) uses 128 beams to project a wedge-shaped volume of acoustic energy and convert their returning echoes into a digital image that gives an overhead view of the object, here exemplified by a cylinder (reprinted from the ARIS Explorer User Guide Manual 015045_RevB under a CC BY license, with permission from ©SoundMetrics Coorp, original copyright 2014; this modified image is similar, but not identical, to the original image, and is therefore for illustrative purposes). (b) In situ sonar sampling during the dolphin-fisher foraging interactions. The traditional cooperative foraging between wild dolphins and artisanal net-casting fishers targeting mullets, in the murky waters of the estuarine canal in Laguna, southern Brazil, seen from land and from a drone. Fishers wait in line at the edge of the canal for the dolphins’ foraging cues (top image: a sudden dive near the coast) which fishers interpret as the moment and place to cast their nets, presumably on top of passing mullet schools. The sonar camera (blue triangle) was deployed to record passing mullet schools at the spatial scale relevant for the interacting dolphins and fishers (6–20m). (c) Lebranche mullets (Mugil liza). A still image from a real-time underwater sonar video depicting the overhead perspective view of a passing mullet school in front of the line of fishers; a typical mullet caught by the fishers is shown (average body length = 42.9 cm ± 7.00 SD, n = 771 fish measured at the beach). (Photos by M. Cantor, A.M.S. Machado, D.R. Farine; reproduced with permission).

As with other counting applications (e.g. crowd counting in surveillance footage or counting cells in microbiological imagery), earlier computer vision methods for counting fish involved hand-crafted techniques such as blob detection [15] or the manual extraction of features such as edges to be used in regression techniques [16]. Recent attempts at detecting fish in sonar images incorporated depth-search and edge detection algorithms [17] or pixel area detection with Histogram of Gradient descriptors [18]. However, the effectiveness of the different hand-crafted approaches are bounded by the discriminative power of the manually designed set of features. Instead, deep convolutional neural networks (CNNs) are trained end-to-end, that is, the feature extraction and the learning of the meaningful patterns from those features are jointly (and automatically) optimised to solve the task at hand.

In recent years, deep CNNs have largely outperformed those more traditional approaches for object counting [19], as seen in other computer vision tasks, e.g. image classification [20] or facial recognition [21]. Deep learning therefore has potential to be the state-of-the-art solution also for underwater vision. Despite recent use of deep learning for biological monitoring in terrestrial habitats (e.g. [22, 23]), increased effort is needed for monitoring aquatic systems [10, 24, 25]. Early use of CNNs to analyse aquatic environments include a combination of traditional detection-based and object classification methods to indirectly count organisms (e.g. [26]). More sophisticated counting approaches are gaining traction across the biological field, such as by segmentation-based methods with U-Net model architectures, where every pixel in an image is classified [27], or by object detection methods where“regions of interest” (RoI) are located and used to identify different objects usually with Faster R-CNN framework [28]. Through this methodology one can successfully detect or count animals and plants, particularly larger ones, in satellite (e.g. whales [29], elephants [30]) and aerial images (e.g. cattle [31], palm trees [32]). In underwater imaging, rather than estimating abundance directly, the focus has been on species detection and classification (e.g. [3337]). Typically, images have been generated by SLR cameras under good underwater visibility or even constrained environments [3739] with relatively few number of target objects (individually labelled) per image: this is the case with commonly used, publicly available benchmark datasets (e.g. Fish4Knowledge [40] & Rockfish [41]: average 1 fish/image with fish cropped and centred; DeepFish (“counting” subset): average 1 fish/image in natural habitats [42]). Under these conditions, RoI detection or segmentation based methods can work well for counting (e.g. instance segmentation with a Mask R-CNN [43]). A drawback of these methods is that expensive, time-consuming labelling is usually needed with either bounding boxes to mark the location of objects or pixel-level segmentation masks. This is particularly cumbersome for images with dense populations and it is not as an effective approach when objects are overlapping, for example when we have schools of fish.

The state-of-the-art deep learning methodology for crowd counting is a density-based approach [44]. CNNs can be trained to directly regress an image to its corresponding density map, which requires cheaper point annotations in data labelling, so has advantages over RoI detection and segmentation based methods when counting people in crowds (e.g. [4549]) and for example overlapping cells [50]. The parallels between human crowd counting and fish—high variation in number of target objects between images, occlusions and noise—suggest these methods are efficient for counting fish in underwater images. However, there are surprisingly few examples [51, 52]. We aim to solve the task of automatic fish counting in turbid natural environments in a special natural context (Fig 1b and 1c): during the traditional fishing between artisanal net-casting fishers and wild dolphins targeting migrating mullet schools in southern Brazil (e.g. [53, 54]). These fisher-dolphin foraging interactions are thought to represent one of the few remaining cases of human-wildlife cooperation [55]. In a few estuaries in southern Brazil, wild Lahille’s bottlenose dolphins (Tursiops truncatus gephyreus) herd migrating mullet schools (Mugil liza) towards the coast where a line of artisanal fishers wait for stereotyped foraging behaviours by the dolphins, which they interpret as the right moment to cast their nets [53]. Although this traditional fishing practice has been considered as mutually beneficial for both dolphins and fishers (e.g. [53, 56, 57]), the foraging benefits both predators accrued remains to be properly understood and quantified. The turbid waters complicate the estimation of the abundance of prey, thus requiring a reliable method such as the sonar-based underwater imaging system for assessing the mullet schools in the estuarine waters with very low visibility.

Here we develop an effective density-based deep learning approach to automate the process of quantifying the fish from sonar image, and provide a new dataset of manually annotated underwater images from more than 105 hours, over 1 million images, of sonar video footage recorded in this natural setting (S1 Video). The number of mullet fish between image frames can vary drastically from 0 to several hundred schooling fish, often densely compacted and difficult to distinguish between individuals. Noise in data samples (e.g. from dolphins or fishing nets) need to be correctly ignored by the automated counter. When there are biological, social and economic consequences at stake, it is imperative the user can trust the results of an automated counter, particularly as some critics are wary of deep learning’s “black-box” interpretability [58]. We combined for the first time self-supervised learning and uncertainty regularization to count fish in underwater images. Our experimental results show a marked improvement in accuracy on both our baseline deep learning model with no regularizing techniques (42% reduction in mean absolute error (MAE)), and on our experiments where a balance regularizer is incorporated, the proposed method of the only previous deep learning study to count schools of fish in low-resolution sonar images [51]. We also achieved a 21% reduction in MAE in comparison to benchmark experiments on the recent DeepFish dataset [42] containing fish recorded with traditional underwater cameras. Furthermore, our model outputs include a measure of prediction uncertainty, enabling informed biological decision making. Our model is open source and can be adopted to automate the counting of aquatic organisms in other underwater image data, as we have demonstrated on a completely different underwater dataset [42].

Materials and methods

Study site and data collection

Sonar-based underwater videos were recorded to quantify the availability of mullet schools (Fig 1(c)) during the cooperative foraging interactions between Lahille’s bottlenose dolphins and artisanal net-casting fishers (Fig 1(b)). The water transparency at the lagoon canal is very low (from 0.3 to 1.5m visibility; collected in situ with a Secchi disk), mullet schools were recorded by deploying an Adaptive Resolution Imaging Sonar, ARIS 3000 (Sound Metrics Corp, WA, USA; Fig 1(a)). The videos were recorded in Laguna, southern Brazil, at the main dolphin-fisher interaction site, the Tesoura beach (28.495775 S, 48.759996 W), a 100-meter long beach at the inlet canal connecting the Laguna lagoon system to the Atlantic Ocean (e.g. [57]). The interaction site was sampled during 18 days in May-June 2018, from 09:00 to 17:00, during the peak of the mullet reproductive migration (e.g. [59]), resulting in more than 105h of video captured at 3 frames/seconds (S1 Video), totalling over 1 million images of underwater footage.

Ethics statement

Research permits for field data sampling were obtained from the Brazilian Ministry of Environment (SISBio #47876–1, #64956–1).

Data generation and image pre-processing

From the more than 1 million frames, 500 were selected for manual labelling. These images were chosen from 48 videos distributed across the 18 days of field sampling to encompass a wide range of variability in terms of number of fish, noise, and environmental conditions that influence water visibility. Unused (unlabelled) images could then be selected for training the proxy self-supervised task. We have made data publicly available at [60].

Labelled data

The videos were first manually pre-processed for contrast enhancement and background removal (Fig 2a and 2b), and then carefully chosen to include a wide range of possible sample types, from low to high fish counts and from minimal to substantial noise (Fig 2d–2f). Because our labelled dataset is small and we wanted to maximise the chance of our deep model adapting to, and test its ability on, wide-ranging and challenging observations, our dataset was not a representative sample of the field data collect.

Fig 2. Image pre-processing for assessing mullet abundance from sonar images.

Fig 2

(a) Raw frame depicting dolphins and a large mullet school. (b) Contrast enhancement and background removal. (c) Manual labelling of a sample: the large bounding box marks where the raw image was cropped so all input samples represent a consistent size of geographical area and at a consistent distance from the sonar camera. The smaller bounding boxes mark where noise (here, a dolphin) is present. Each point annotation marks the location of an individual mullet. (d-f) Examples of variation in the sonar images in our dataset, to which the density-based deep learning model needs to be adaptable to. (d) Frame with high mullet abundance: large number of fish, swimming compactly; (e) low abundance: small number of fish, sparsely distributed; (f) noise: 3 dolphins and a fishing net (note the overhead perspective of a rounded casting net).

As the videos taken were filmed at different ranges, the 500 images were cropped to present a geographical area of 4x8.5m2, all the same distance from the camera, thus, biological population sampling will be comparable between input samples. The images were then resized to the average using bilinear interpolation, to 320 × 576 pixels. These images were annotated using the Visual Geometry Group Image Annotator [61, 62]. A point annotation was used to mark the ∼central coordinate of a fish and a bounding box was drawn around any noise (Fig 2(c)). The point annotations of fish can then be used to derive corresponding ground truth density maps. The bounding boxes which label noise were used for subsequent data augmentation and provide opportunity for further experimental work with this dataset. Two biologists analysed a subset of samples together to reach a consensus on how to consistently identify the number of mullet, but then only one annotator annotated the whole dataset in order to avoid including different subjective biases. The abundance of mullet in a single cropped frame ranges from 0 to 438, with a mean of 42 (Fig 3). Our dataset is imbalanced with more samples containing 0 or low numbers of fish than high, typical of data collected in the wild and making the task of training deep learning models more challenging.

Fig 3. Distribution of labelled dataset by number of fish.

Fig 3

Number of fish plotted in log scale: the subset of data is skewed towards samples with low numbers of fish. This imbalanced distribution is even more exaggerated in the complete dataset, a common theme of data collected in the wild.

Unlabelled data

The pool of unlabelled data contains samples with mullet in abundance of 0 to dense schools of > 500/image, and multiple sources of noise: Lahille’s bottlenose dolphins, up to 4/image, fishing nets and the sea floor. Range of sonar varies between 0.7 − 10 meters (min) and 5.5 − 20.3 meters (max) and with varying camera settings affecting clarity and distortion of objects. For use in training, simple pre-processing steps were taken: images were chosen at random, cropped to the same geographical area as above and resized to 320 × 576 pixels.

Deep learning multi-task framework

We derived an effective deep learning architecture in order to count fish to the required degree of accuracy, when labelled data is limited or costly to acquire and when images have been collected in the wild so contain noise, occlusions and in the case of sonar images, are low-resolution. Deep models trained on small datasets are prone to overfitting thus we regularize our learning algorithm to enhance generalisability. Our multi-task network consists of one branch that learns the supervised task of regressing an input image to an estimated fish count via a density mapping and two parallel Siamese branches which simultaneously learn the self-supervised task of ranking unlabelled images according to the number of fish. In addition, our model learns to predict the noise variance within each labelled sample. This entire framework is shown in Fig 4.

Fig 4. Pipeline of our final network.

Fig 4

The multi-task network is trained end-to-end to simultaneously regress labelled images to corresponding density maps and rank the unlabelled images in order of fish abundance. The backbone of each branch is a ResNet-50 [63] followed by a 1 × 1 convolutional layer with 2 output filters. A non-learnable Global Average Pooling layer is added to each branch of the Siamese network so the resulting scaler count of first image in the pair (I, I′) can be subtracted from the second image. All parameters are shared (represented by the orange dashed arrows) thus incorporating the self-supervised task adds no parameters to the base model. The inclusion of an additional channel in our output tensor to estimate noise variance only adds a further ∼2x parameters in the head, equivalent to 0.01% of the total number. K is the batch size, where a batch contains K images from the labelled subset of data, and K pairs of images from the larger unlabelled pool of data. H and W are the height and width of some input 3-channel RGB image, whereas H′ and W′ are the height and width of the output tensors from the backbone and heads.

Supervised counting task with labelled data

As is common practice in crowd counting studies, we trained our model to regress any given image I[0,255]HI×WI×3 to a predicted count of fish. However, following the approach of other crowd counting works [4549, 64], instead of directly regressing the count, we predict a density map D^RH×W that can then be integrated as a proxy for the count c^:

c^=i=0Hj=0WD^(i,j). (1)

The backbone of our model is ResNet-50 [63], a state-of-the-art deep architecture with identity shortcut connections. Differently from the original model, the fully connected softmax layer (i.e. “FC-1000”) is replaced by a counting head consisting of a 2D convolutional layer with kernel size 1 × 1 and 2 filters. This layer produces a tensor Z^RH×W×2, where H = HI/32 and W = WI/32 are the spatial units, and 2 are the number of channels. The first channel corresponds to the predicted density map D^ and the second one to the noise variance map V^RH×W that we will discuss later.

For the training of the model, the binary image masks resulting from manual point annotations (where 1-valued pixels denote presence of fish) are converted to the corresponding ground truth density maps by convolving a Gaussian kernel of size s and standard deviation σ. Let Ltrain={(Ii,Di)},i={1,,NLtrain} be our labelled training dataset. The model weights can then be optimised for the task of counting fish by minimising the difference between predicted and actual counts over the training data guided by a simple L1-norm absolute loss function. We refer to this as Lc to distinguish the “count” loss hereinafter. More precisely, given a batch of K, KNLtrain, labelled images:

Lc=k=1K|ck-c^k|. (2)

where ck=i=0HIj=0WIDk(i,j) is the count integrated over the ground truth density map Dk, whereas c^k is the count from the predicted density map D^ in Eq 1 both corresponding to image Ik. Next, we show how we modify the loss to deal with aleatoric uncertainty introduced by different sources of noise by using the predicted noise variance map V^.

Regularizing the loss term: Aleatoric uncertainty

There are varying levels of noise within our dataset, as will be typical of data collected in an unconstrained, wild setting. There is also the challenge of manually labelling images accurately and consistently, particularly in those displaying a large number of fish. Fig 2(d) shows an example image with a dense school of fish. The fish are seen simply as blue blobs with no detailed features, making it difficult to decipher the number present when they swim close together or even overlap. Fig 2(f) shows the level of noisy objects that can occur in images that need to be distinguished from mullet. Furthermore, adjustments to camera settings and focus range affect the clarity or distortion of objects and brightness of projected image. Thus for some data, the best achievable prediction for mullet abundance, will be an uncertain prediction.

The need to model uncertainty in computer vision was highlighted by the work of [65] and these ideas were incorporated into the task of crowd counting by [66]. Aleatoric uncertainty specifically is uncertainty in the data and arises from genuine observational noise. To quantify this in regression problems, the noise variance also needs to be learned alongside the count prediction. No matter how much data we have, there will always be a degree of uncertainty in the prediction of inherently noisy images and having an understanding of this uncertainty will be invaluable. This was the motivation behind the work of [66] in crowd counting, and for our application of it to fish counting. We have focused solely on heteroscedastic aleatoric uncertainty (the assumption that observational noise varies with the input data). We hypothesise that incorporating this measure of uncertainty should not only lead to a higher level of accuracy in count predictions through optimising model training, but it will also provide the user with an understanding of the uncertainty surrounding a given result.

To do so, we estimate the noise variance of an image Ik, i.e. σ^k2, from the already produced noise variance map V^ as follows:

σ^k2=i=0Hj=0WV^k(i,j). (3)

Then we adjust our loss function to introduce this regularizer of the aleatoric uncertainty. More precisely, for a batch of K images, the original counting loss Lc now becomes Lcau:

Lcau=k=1K|ck-c^k|σ^k2+logσ^k2. (4)

There is a resulting trade off between the two components: during training, the model learns to increase the value of σ^k2 when the difference between ck and c^k is large to decrease its contribution to the overall loss, but minimise σ^k2 when the difference is small. This way, the model is able to learn to ignore noisy samples, weakening their impact on training. But, the noise variance component is added to the overall loss so the model is penalised for increasing σ^k2. This prevents it from simply learning to make σ^k2 large for all samples. Note the actual model output is logσ^k2 for greater numerical stability [65] (to avoid dividing by zero). Also, as variance should be positive, this ensures the model cannot learn to make the predicted noise output negative to drive down loss: we multiply Lc component by e− “predicted noise” which would result in a large multiple if the predicted noise variance was negative. To learn this parallel prediction roughly twice the number of parameters are needed in the counting head, which is negligible compared to the overall number of parameters in the entire backbone architecture.

Regularizing: Data augmentation of labelled data

We performed various augmentation techniques to increase the amount of training data. Augmentation techniques included image crops, translations, horizontal flips, small rotations, and superimposing objects (e.g. dolphin) from one image onto a randomly chosen new image. When augmenting by cropping, the crops are placed on a new blank background, similar to [67], or superimposed onto a different image. Because the sonar camera used here was tilted down so the acoustic energy beams can reach the mullets while minimizing interference from water surface reflections when generating “birds-eye view” images (Fig 1(a)), the measurable size of such large fish (Fig 1(c)) will be consistent regardless of how deep (up to ∼8m) or how distant from the sonar camera (∼3 to 20m) the mullets were. Scale-awareness is therefore not a key factor to consider when training the network, so the resizing of images is not used in synthetic data generation.

Regularizing: Self-supervised ranking task with unlabelled data in a multi-task network

To make the most of unlabelled images and address limited availability of labelled training data, we took inspiration from the work of [64] on crowd counting. We incorporated Siamese branches in our architecture that learn to rank pairs of images according to the number of fish based on self-supervised ranking information. We can then leverage some potentially useful information on the comparative number of fish between two images, for the original counting task. Let U = {Ii}, i = {1, …, NU} be the dataset of ∼1M unlabelled images, hence NU being much greater than NLtrain. We can then define pairs of unlabelled images, e.g. (Ii,Ii), where IiP(Ii) is an image subregion cropped from Ii, thus containing equal or fewer fish. P(·) is simply the function that generates all subregion images from Ii. From U, we generate the dataset of all possible unlabelled image pairs Ptrain={(Ii,Ii)},i={1,,NPtrain} and pick a subsample of it, namely Ptrain, with a size that equates to the size of the sample of labelled image Ltrain. Algorithm 1 illustrates the generation of P.

Algorithm 1 Generate ranked pairs of unlabelled images

1: Input:

2: U = {unlabelled images}

3: Output:

4: P = {}

5: for IiU do

6:  x1 = left x location of Ii bounding box

7:  y1 = upper y location of Ii bounding box

8:  x2 = right x location of Ii bounding box

9:  y2 = lower y location of Ii bounding box

10:  hIi = image Ii height

11:  wIi = image Ii width

12:  Si = []        ⊳ Output of P function, a list of subregion images derived from Ii

13:   for f ∈ {0.25, 0.5, 0.75} do        ⊳ iterate over crop factors

14:   Cf = crop Ii’s region to pixels (x1, y1 + f*hIi, x2f*wIi, y2)

15:   choose l where 0lf*wIi        ⊳ random horizontal translation

16:   choose u where 0hf*hIi        ⊳ random vertical translation

17:   choose value in {0, 1} and horizontally flip if 1        ⊳ random horizontal flip

18:   If = new blank image of size (wIi,hIi)        ⊳ initialize new blank image where to place the crop

19:   place crop Cf on If aligning crop’s topleft corner at (u, l)        ⊳ pixel location

20:   Si = Si+ [If]        ⊳ append to the end of list

21:  end for

22: end for

23: Si = [Ii] + Si        ⊳ concatenate original image to the front of list

24: for j = 1 to |Si| do        ⊳ generate all combinations of ordered pairs

25:  for k = j + 1 to |Si|

26:   P = P ∪ {(Ij, Ik)}        ⊳ join sets

27:  end for

28: end for

29: return P

The two Siamese ranking branches share the same backbone architecture from the counting branch. The only difference is the inclusion of a global average pooling layer at the end of the ranking branches to produce the average count estimate across density maps’ spatial units. This task is trained with a standard pairwise ranking hinge loss Lr, applicable to a Siamese architecture. For a batch of K (s.t. KNUtrain) unlabelled image pairs:

Lr=k=1Kmax(0,pk-pk+ϵ) (5)

where p^k=1HWi=0Hj=0WD^k(i,j) is the global average pooling over spatial units of the predicted density map D^k corresponding to image Ik in the pair, p^k=1HWi=0Hj=0WD^k(i,j) the global average pooling of D^k corresponding to Ik, and ϵ is a ranking margin set to zero here. It is known from the cropping and ordering within pairs, that p′ ≤ p and thus if the model predicts this order correctly the loss value for this pair will be 0. Otherwise the difference of the two will be added to the total loss: the greater the difference is, the greater the increase in loss. This way the model can learn the correct order within a pair according to number of fish [64]. It is not necessary to know the exact count of either image, hence enabling the self-supervised task.

We trained this task together with the supervised one in a multi-task network. The model is end-to-end trained, simultaneously learning to rank the unlabelled data and count the number of fish in labelled data. The weights are shared across all branches, which means no additional layers are included and the number of trainable parameters is the same as for the supervised on its own. The input to the multi-task model is a mini-batch consisting of K images from the labelled dataset and K pairs of images from the unlabelled dataset. The ranking loss is simply added to the supervised counting loss Lcau to achieve this:

L=Lcau+Lr (6)

As shown later in the experiments, training the self-supervised ranking task (with more training data) improves the performance and generalisability of the supervised task of counting fish in labelled image data. We therefore know that any improvement in results is not due to more complexity in the model.

Results

Experimental setup

Dataset split and evaluation metrics

The labelled dataset was randomly split into a holdout partition of 350 training images, 70 validation, and 80 test. We made sure that the distribution of data in these sets was reasonably consistent to minimise bias in results. Validation is used for early-stopping during training and hyperparameter optimisation. Once trained, we ran inference on the test data to analyse performance.

Following the common practice in counting by deep learning literature, we evaluate test results using the Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE):

MAE=1NLtestk=1NLtest|ck-c^k| (7)
RMSE=1NLtestk=1NLtest(ck-c^k)2 (8)

where NLtest is the number of test samples.

Information-entropy-based balance regularization

For further comparison, we also validate our uni-task and multi-task networks with and without a balance regularizer [51] in a study that is, to the best of our knowledge, the only other attempt to count fish in multi-beam sonar images with deep learning. In that study [51], the authors incorporated a regularizing technique to increase the weight of less common samples in their dataset, i.e. those with high numbers of fish. This in turn increased the overall accuracy of the network’s predictions, particularly for this subgroup of samples. Our dataset is imbalanced, as commonly seen with data collected in the wild [51]. There are a far greater number of images containing less than 50 fish compared to those with several hundred. We therefore adjust our loss function, which weights samples during online compilation of mini-batches. We group all images into 3 classes according to number of fish, c:

  • Class 1: 75% of images, c < 50

  • Class 2: 18% of images, 50 ≤ c < 150

  • Class 3: 7% of images, c ≥ 150

The following “information-entropy-based” balance regularizer (IEB-reg), Lieb, is then applied so that the weight given to a sample is negatively correlated with the number of samples of its same class in the batch. For a batch of KNL images:

Lieb=-k=1Klog(Kclass(k)K)|ck-c^k| (9)

where Kclass(k) is the number of images of the same class as Ik in the batch. The Lieb regularizing term is then added to the absolute loss function Lc:

L=Lc+λLieb (10)

We found the hyperparameter λ optimal at 0.1 for our study. Higher values caused the term Lieb to dominate over Lc and meant the model was unable to properly learn. Note this is different from [51] who used λ = 1 but they worked with patches of up to 8 fish meaning much smaller absolute differences in dense inputs. In contrast, the larger absolute differences in our patches can be over 50x greater than the smaller absolute differences. This high absolute difference is also seen in less common images which the regularizer adds more weight to.

Ablation experiments

To evaluate the performance of our overall methodology and the different individual regularizing components, we carried out 9 ablation studies (Table 1). All 9 trained models were tested on the 80-image test set for comparison. In order to alleviate the stochastic behavior caused by the random initialization of the deep models’ weights and the optimization algorithm, we run 3 trials of each ablation study and average across these.

Table 1. Overview of 9 ablation studies.

Method Loss function Number of train samples Weight initialization
(i) Uni-task (UT) L=Lc 350 ImageNet // Xavier
(ii)  + IEB-reg L=Lc+λLieb 350 ImageNet // Xavier
(iii)  + AU-reg L=Lcau 350 ImageNet // Xavier
(iv)  + IEB-reg & AU-reg L=Lcau+λLieb 350 ImageNet // Xavier
(v)  + augmented data L=Lc 5,672 UT (i)
(vi) Multi-task (MT) L=Lc+Lr 5,672 + 5,672 pairs UT (i)
(vii)  + IEB-reg L=Lc+Lr+λLieb 5,672 + 5,672 pairs UT + IEB-reg (ii)
(viii)  + AU-reg L=Lcau+Lr 5,672 + 5,672 pairs UT + AU-reg (iii)
(ix)  + IEB-reg & AU-reg L=Lcau+Lr+λLieb 5,672 + 5,672 pairs UT + AU-reg (iii)

Uni-task (UT) models (i-iv) were initialised with ImageNet weights [63, 68] (ResNet-50 backbone) and Xavier initialisation method [69] (final convoluational layer). UT + augmented data (v) and the multi-task (MT) models (vi-ix) were initialised with trained weights from their comparative UT network. To maximise chances of success, MT + IEB-reg & AU-reg (ix) was initialised with UT + AU-reg (iii) instead of UT + IEB-reg & AU-reg (iv), as (iii) produced better results than (iv).

Out of the 9 ablation studies, the method that combined all our proposed regularizing techniques, the multi-task with aleatoric uncertainty regularization, MT + AU-reg (viii), achieved the lowest MAE score of 6.48. The pure multi-task network, MT (vi), achieved the lowest RMSE score 14.27. These are a marked improvement from the outcome of the baseline uni-task network, UT (i), where MAE and RMSE of predictions was 11.09 and 23.88 respectively and the models where the balance regularizer (IEB-reg) [51] was implemented, UT + IEB-reg (ii) and MT + IEB-reg (vii).

Results of the experiments are shown in Table 2. Additionally, Fig 5(a) shows the results of each method split by 5 subgroups of the data according to the number of fish present to understand the effect of each ablation study on them. The subgroups shown are in line with the categories chosen for assigning weights according to the balance regularizer (IEB-reg). Except the most common group of c < 50 fish / sample has been broken down further to show results for samples with less than 25 fish separately. The reason being that these images will be mostly sparsely distributed and relatively easy to count. Beyond 25 fish, more occlusions between individuals will occur. Samples which contain large elements of noise, usually either dolphins or fishing nets, are also shown as a separate subgroup to examine how each model, with associated loss term, handles this particular challenge. Instead of plotting the MAE, the absolute error in a given sample prediction has been divided by the average ground truth fish count for that group (“Normalised MAE” (NMAE), y-axis). The MAE across a subgroup g has thus been somewhat normalised and so can more reasonably be compared between subgroups:

NMAE(g)=k=1Mg|ck-c^k|k=1Mgck (11)

where Mg is the number of samples within a subgroup g.

Table 2. MAE and RMSE results of 9 ablation studies.

Method Average Experiment 1 Experiment 2 Experiment 3
MAE RMSE MAE RMSE MAE RMSE MAE RMSE
(i) Unit-task (UT) 11.09 23.88 10.65 22.73 11.88 25.44 10.74 23.47
(ii)  + IEB-reg 10.27 22.01 10.28 21.91 11.52 26.99 9.00 17.13
(iii)  + AU-reg 8.89 20.24 8.79 19.16 8.37 20.45 9.50 21.10
(iv)  + IEB-reg & AU-reg 11.27 25.22 11.09 23.03 11.02 25.20 11.69 27.42
(v)  + augmented data 7.88 17.20 7.94 16.48 8.17 17.35 7.53 17.78
(vi) Multi-task (MT) 7.05 14.27 7.18 15.39 6.54 13.88 7.42 13.53
(vii)  + IEB-reg 7.87 16.67 9.91 21.81 6.16 13.13 7.53 15.07
(viii)  + AU-reg 6.48 14.81 6.26 13.66 6.31 15.12 6.88 15.65
(ix)  + IEB-reg & AU-reg 7.25 16.99 7.64 19.26 6.44 16.02 7.67 15.67

Fig 5.

Fig 5

Performance of the deep learning models for counting fish in sonar images: (a) Error analysis for sample subgroups—categorised by number of fish or noise present. The MAE for a sample has been divided by the average actual count within each subgroup so results are somewhat normalised and can be compared between subgroups (Eq 11). The percentage of samples that fall within each group are: c < 25: 34%, 25 ≤ c < 50: 10%, 50 ≤ c < 150: 14%, c ≥ 150: 9%, Noise: 34%. The reason c < 50 (our first class for balance regularization) is altogether lower than 75% is because many of these samples have been put into the “noise” subgroup for this analysis. (b) The relationship between predicted noise variance and absolute error score, for models with AU-reg (iii, viii, ix). (c-f) Four sample images with corresponding ground truth and predicted density map. Predicted density maps from our best performing model, MT + AU-reg (viii). The density maps can be interpreted as a typical heat map where areas of red indicate dense regions of mullet.

Ablation on aleatoric uncertainty regularization

When modifying the network architecture and loss function with aleatoric uncertainty regularization (AU-reg), to quantify, alongside the prediction for fish abundance, the uncertainty in the prediction, σ^k2, UT + AU-reg (iii) achieved more accurate predictions than the baseline model UT (i). MAE decreased from 11.09 to 8.89, a 20% reduction, and RMSE decreased from 23.88 to 20.24, by 15% (Table 2).

Ablation on data augmentation

UT + augmented data (v) which was trained on the same UT (i) architecture and loss function, but where training data has been augmented to increase the number of samples from 350 to 5,672, improves the baseline UT (i) MAE score from 11.09 to 7.88 and RMSE score from 23.88 to 17.20 (Table 2). This equates to a 28% reduction in both error scores.

Ablation on self-supervised task with unlabelled data (multi-task network)

As the UT + augmented data (v) showed that training with additional synthetic data, notably improved performance, we train all our multi-task networks (vi-ix) with this larger labelled dataset and compare our results with (v). MT (vi) which adds the pair-wise ranking hinge loss to the loss term and trains on unlabelled data as well as the labelled samples, reduces the MAE score further from 7.88 to 7.05 (11% decrease), and RMSE from 17.20 to 14.27 (17% decrease) from UT + augmented data (v) (Table 2). It actually achieved the lowest RMSE score out of all approaches taken, showing it is the least susceptible to extreme values. Fig 5(a) allows for a greater understanding of what is driving this improvement: the last two columns show MT (vi) better predicts samples with high numbers of fish and samples with noise compared with (v).

Ablation on multi-task network with aleatoric uncertainty regularization

When combining all three techniques in a multi-task network with aleatoric uncertainty regularization (and training on the larger labelled dataset with synthetic images), MT + AU-reg (viii), we achieved the lowest MAE error score out of all the approaches tested. MAE was 6.48, a 0.57, or 8%, reduction on the second best score from MT (vi). This is a 4.61, or 42% reduction in MAE from baseline UT (i), where none of these techniques were implemented (Table 2). MT + AU-reg (viii) achieves the best results (lowest NMAE score on the bar plot) for more densely populated fish images with limited noise, where images contain more than 50 fish (Fig 5(a), columns 3 and 4). It achieved the second best error score, behind MT (vi), for images containing substantial noise (Fig 5(a), last column).

With uncertainty regularization, we also obtain a measure of prediction uncertainty σ^2, alongside the prediction of fish abundance. In Fig 5(b), the measure of prediction uncertainty (outputted as log σ^2) is plotted against the absolute error of the count prediction for the best performing three out of four methods where uncertainty regularization has been incorporated: UT + AU-reg (iii), MT + AU-reg (viii) and MT + IEB-reg & AU-reg (ix). There is a moderate positive correlation between the absolute error score and estimated uncertainty of a sample. For the two multi-task networks, the correlation statistic (r) reaches 0.68 and 0.73, without and with the IEB-reg respectively (p < 0.001, one-tailed test, in all cases). The noise variance predictions, or uncertainty measure, are heavily skewed towards lower values, seen in the scatter plots where a high number of samples are clustered in the bottom (left) corners. Specifically, MT + AU-reg (viii) gives a prediction uncertainty score as 0logσ^2<1.7 for 90% of samples and 1.7logσ^2<3.7 for just 10% of samples. In fact, nearly 50% of samples have predicted log σ^2=0.

Four test sample images along with their corresponding ground truth and predicted density maps are shown in (Fig 5c–5f) so results can be compared locally. The density maps can be interpreted like typical heat maps, where areas of red indicate dense regions of mullet. Prediction outputs are from MT + AU-reg (viii), the best performing model. In image Fig 5(c), where mullet are in relatively high numbers with some occlusions between fish, we can see the model gives a perfect prediction and a relatively low uncertainty measure of 0.31, so we can be confident in this prediction. This is in contrast to Fig 5(d), where the number of fish is similar but they are seen more densely compact. Differences in the sonar video settings can also cause variability between images, for example anti-aliasing control can introduce distortion in the image as seen here, and so it is more difficult to distinguish between individual mullet. Whilst the model’s prediction is within 98% accuracy (error = -2), the uncertainty score is relatively high (log σ^2=1.42) for these reasons. We can see this discrepancy in uncertainty score when different types of noisy objects dominate the image: in Fig 5(e) the model is able to distinguish between dolphin and fish confidently (log σ^2=0) and gives a perfect prediction. But with the fishing net present in Fig 5(f), the uncertainty score is high, 1.53, within the highest 15% of scores. Whilst the model clearly recognises most of the noisy object correctly as not mullet, the blue blobs at the middle bottom of the image have been “incorrectly” predicted as mullet. This is difficult to decipher for even an expert human labeller: these could be mullet as the model predicts or splashes from the net (how they have been annotated). This uncertainty score is meaningful.

Implementation and training procedure details

The ResNet-50 backbone of the uni-task networks (i-iv), was initialised with ImageNet weights [63, 68]. Transfer learning in this way has been proved effective in related computer vision research [64]. Any additional layers were initialised with Xavier initialisation method [69]. These models were trained for 300 epochs. To mitigate the effect of randomness in training deep models (due to the descent algorithms needed for optimising many parameters), ablation studies UT + augmented data (v), and the multi-task architectures (vi-ix), were initialised with weights from corresponding uni-task models (i-iv). The incremental effect of the synthetic data and the self-supervised task could then be evaluated. These models were trained for a further 200 epochs. The Adam Optimizer [70] was used for minimising the loss term, with a learning rate of 10−4. For uni-task models (i-iv), which were initialised from the start (with ImageNet weights, [63, 68]), we lowered the learning rate to 10−5 after 200 epochs. As for the size of the mini-batches, we set K = 10. In the case of the multi-task models, the number of pairs is also K, meaning 10 labelled images and 10 pairs of images are used in each training iteration (Fig 4). When generating ground truth density maps, s = 4 and standard deviation σ = 1 were used.

All models were built with Tensorflow 2.2 and Keras API.

Transfer learning and comparison with state-of-the-art approaches

Comparison with information-entropy-based balance regularization

Aleatoric uncertainty regularization, AU-reg, outperformed the IEB-reg for both the uni-task, 8.89 vs 10.27 MAE, and multi-task network, 6.48 vs 7.87. For a thorough evaluation, we also experimented with modifying the loss term to include both the balance and uncertainty regularizer: UT + IEB-reg & AU-reg (iv) and MT + IEB-reg & AU-reg (ix), but performance was worse when comparing to the baselines, UT (i) and MT (iv) and the networks with AU-reg only, UT + AU-reg (iii) and MT + AU-reg (viii) (Table 2).

Evaluation on DeepFish dataset

To demonstrate the generalisability of our proposed counting framework, MT + AU-reg (viii), we tested it on the recently published, publicly available benchmark dataset DeepFish [42]. DeepFish contains 40k underwater images from 20 different coastal and nearshore benthic habitats in Australia, with a binary classification label, either “fish” or “no fish”. Alongside this, a subset of 3,200 images are annotated with a simple point annotation to mark the number of individual fish (Fig 6). This “counting” subset of data can therefore be used to train and test counting schemes. Data were collected in full HD resolution from a digital SLR underwater camera and there is a mean of 1.2 fish/image, ranging from 0–18 fish. This contrasts to our dataset with much lower visibility and mean of 42 fish/image, range 0–438.

Fig 6. Three sample images from the “counting” subset of the DeepFish dataset with point level annotation.

Fig 6

All data were collected with HD resolution digital cameras in 20 different marine habitats in tropical Australia. Mean of 1.2 fish/image, ranging from 0–18 individuals. (a) Low algal bed, count: 3, classification: “fish” (b) Reef trench, count: 2, classification: “fish” (c) Upper mangrove, count: 0, classification: “no fish”. The images were obtained from the open-source dataset DeepFish [42], licensed under a Creative Commons Attribution 4.0 International License.

To implement our framework we incorporated a ranking task alongside the counting task, making use of the larger, weakly labelled, classification dataset: the parallel Siamese architecture of our network can again learn to rank pairs of images as it is known that an image labelled as “fish” has a greater number of fish than an image labelled with “no fish”. As there were so few fish / image it did not make sense to take random crops of images (Algorithm 1), but we were still able to leverage large quantities of data that otherwise could not be used in the supervised counting branch alone. Compared to the proposed benchmark model [42], we reduced the MAE for fish count by 21%, 0.30 vs 0.38 (Table 3). We also demonstrate that this implementation benefits from transfer learning when initialising the new model with the weights of our previously model trained with our novel dataset: the MAE is lower when we initialised the model with weights from previous training on our dataset versus the ImageNet weights, 0.30 vs 0.34 (Table 3).

Table 3. MAE result on DeepFish dataset.
Method MAE
i) DeepFish (benchmark) [42] 0.38
ii) Ours—Multi-task + AU-reg (without TL) 0.34
iii) Ours—Multi-task + AU-reg (with TL) 0.30

Comparison between DeepFish authors’ benchmark result and our model, with and without transfer learning (TL). “Without TL”: initalised with ImageNet weights [63, 68], “with TL”: initialised with pre-trained weights from previous training on our novel sonar image dataset.

Discussion

Our findings exemplify how deep learning by regression can be effectively employed to count aquatic organisms in underwater images and we present a successful, open-source framework along with novel training data. We leveraged abundant unlabelled data in a self-supervised task, to enhance performance of the direct supervised task, to count mullet fish with high accuracy in challenging low-resolution sonar imagery. By combining this with uncertainty regularization, accuracy not only improved but we provide a meaningful quantification of prediction uncertainty to the user for informed biological decision-making. Our proposed multi-task network with uncertainty regularization, MT + AU-reg (viii), achieved an MAE of 6.48, on images containing up to 438 mullet. This was a 42% improvement in error over our baseline deep learning model where no regularization techniques were implemented. Moreover, we introduce a new large dataset of sonar videos and a fully annotated subset of 500 images [60].

While the demand for processing large volumes of visual data to assess natural populations led a surge in developing computer vision tools, traditional methods are limited in their generalisability. The success of deep learning in broader image analysis is encouraging, but the lack of annotated underwater data is one key factor constraining progress of its use to count aquatic animals. The majority of progress has been made in classifying species or detecting individuals where numbers are low and sparsely distributed (e.g. [36, 37, 39]). We advance this by building a widely applicable framework that tackles the task of counting when abundance can be high and occlusions common from schooling fish. This is particularly challenging with low-resolution images, such as from sonar cameras, and when data are collected in an unconstrained, wild environment, where noise and variation is inevitable. The techniques we adopted address these challenges and improve the accuracy and generalisability of a deep network. Furthermore, we demonstrate an improvement when compared to the alternative, state-of-the-art, deep learning methodology for analysing underwater data in the form of both comparable low resolution sonar images [51], namely the proposed balance regularizer, and high-resolution images from a breadth of wild habitats [42].

To ensure robust assessment of our methodology, we carried out 9 different ablation studies with varying training data, model architectures and loss functions, and trialled 3 times each. Prediction error on our test data improved with each novel technique implemented: from innovative ways to augment the annotated data, to building a multi-task network to simultaneously train the self-supervised task, to finally combining this with uncertainty regularisation. All weights are shared across the supervised and self-supervised task, and adding a channel to accommodate for uncertainty prediction added an immaterial number of parameters. Thus, the improvement in performance is not simply due to a more complex model (that may generalise less well). Note also, this is the result from tests on a biased challenging sample, deliberately chosen to contain a significantly higher proportion of noisy samples and those with dense schools of fish.

Regularizing: Self-supervised task with unlabelled data (multi-task network)

The improvement in prediction error, both MAE and RMSE, from comparing the results of MT (vi) with the uni-task model, UT + augmented data (v) (both trained on the larger labelled dataset with synthetic images), supports our hypothesis that by increasing the size of the training set through unlabelled data, and training the parallel self-supervised ranking task, improves the model’s ability to count fish accurately in unseen data. In fact, MT (vi) achieves the lowest RMSE out of all 9 methods. Its performance is notable when predicting samples containing higher numbers of fish and samples with substantial noise (Fig 5(a)). We speculate that this is because by adding significantly more unlabelled data, we likely increase the number of samples that fall within these two challenging categories, which will disproportionately require more training data. Samples with high numbers of fish also make up the smallest percentage of the labelled training data, adding to the likelihood that the uni-task models (i-v) will not be as well adapted to this category. These samples are also the most time-consuming to annotate. Thus, adding more of such images through unlabelled data allows for an efficient way to access these samples to help increase the accuracy of the corresponding predictions. A common outcome of trained neural networks’ predictions, is a regression to the mean of training data. The mean of our training data is 42 fish. Thus the multi-task’s ability to predict images of extreme high values compared to the uni-task network, is also a positive sign that it is more robust to not simply regressing to the mean.

Regularizing: Multi-task network with aleatoric uncertainty

To date, computer vision tools that have been developed for calculating fish abundance provide only an estimation of number or biomass (e.g. [43, 51, 52]), so the user will have no knowledge of the uncertainty around a prediction. By contrast, our proposed model incorporates uncertainty regularization to the multi-task network, MT + AU-reg (viii). As well as providing an indication of the inherent noise present in samples resulting in uncertain predictions, this regularization was found to improve MAE in both the UT (i) and MT (vi) models by 20% and 8% respectively, with MT + AU-reg (viii) achieving the lowest MAE out of all methods. This supports our hypothesis that it improves model training leading to more accurate predictions. Notably, the MT + AU-reg (viii) improves predictions for medium to dense samples of mullet to an even greater extent than MT (vi) (Fig 5(a), columns 3 and 4). Again, this is a positive indication that it is not simply regressing predictions to the mean. It achieved the second lowest error score with samples where substantial noise is present, comparatively worse than the MT (vi). This is expected because essentially adding this regularizing term, allows the model to ignore noisy images in training so it learns to predict non-noisy images more accurately, but in turn there will be a trade-off in its ability to handle noisy images.

We found that there was a moderate correlation between uncertainty and error in prediction. The uncertainty predictions were also heavily skewed towards a low score (Fig 5(b)). In practice, the user can then choose to treat sample results with high relative noise variance scores with caution or investigate further. These uncertainty predictions are likely to be both few and meaningful in suggesting there could be an error in count prediction, thus manageable and beneficial for further human interpretation. When considering the application of these methods for aquatic monitoring and conservation, or fisheries and aquaculture management, over or underestimating populations could lead to adverse consequences such as biased decisions. Therefore, a greater understanding of how much a prediction can be depended on is crucial.

Advances relative to previous work on deep learning to count fish

We tested our framework against the regularization method of the only previous study on using deep learning to count schools of fish in sonar images [51]: we incorporated the balance regularizing term therein proposed, IEB-reg, in our uni-task and multi-task networks, UT + IEB-reg (ii) and MT + IEB-reg (vii), and ran these as additional experiments. In both cases, the pure respective models, UT (i) and MT (vi), and these models with uncertainty regularization, UT + AU-reg (iii) and MT + AU-reg (viii), performed better than the models with IEB-reg in terms of overall prediction error (Table 2). We expected the IEB-reg to improve results of the less common, more densely populated images as it increases the weight of these samples within a batch. This was the case when comparing UT + IEB-reg (ii) to UT (i), but surprisingly not when comparing MT + IEB-reg (vii) to MT (vi) (Fig 5(a)). One plausible explanation could be overfitting these types of samples during training by adding too much weight, hence, not performing so well on unseen data.

When the networks were trained with both IEB-reg and AU-reg together, it was apparent that these two regularizers do not complement each other. Prediction accuracy was worse than when only AU-reg was implemented as highlighted in the results for samples containing substantial noise (Fig 5(a), last column): UT + IEB-reg & AU-reg (vi) and MT + IEB & AU-reg (ix) produced the highest NMAE (Eq 11) scores out of all the uni-task and multi-task networks, respectively. This likely results from both regularizers reducing the weighting of noisy images (which tend to contain fewer fish) in training and thus when used together, the performance on this type of data in testing is worse.

We also tested our proposed model architecture on a recently published dataset, DeepFish, where visual images were collected with a digital SLR underwater camera and so are of an entirely different format to sonar images. By improving upon the results of the author’s benchmark deep model [42], we demonstrate our model can provide reasonably accurate predictions on wide-ranging underwater data: on low-resolution, monochromatic and densely populated images, as well as on high-resolution, sparsely populated images, of varying habitats and species of fish.

Other recent work deployed a hybrid CNN based on a multi-column and dilated CNN to count farmed Atlantic salmon fish with a density-based regression methodology [52]. Here natural images were used and so fish present more distinct features. Data were also collected in an enclosed mariculture net cage so it is unlikely other noisy objects were present and is thus difficult to compare our study directly. A multi-column network is computationally expensive to train but it would be interesting to incorporate a dilated backend in our network, as in [66], to see if performance can be improved still. The challenge of distinct lack of available labelled datasets was addressed by using a variety of techniques to augment thousands of side scan sonar images from a small starting dataset to count fish and dolphins, up to 34 and 3 respectively per image [67]. Different from the Adaptive Resolution Imaging Sonar used in our study, side scan sonars only image non-moving targets which, combined with much fewer fisher detected per image, complicates any direct comparison with our results. Nevertheless, our study supports their findings that it is possible to solve a problem like this even with a relatively small starting labelled dataset by adopting data augmentation techniques. From this, we expand the training data with synthetic data to improve model predictions on unseen sonar data— a widely proven and adopted technique in deep learning research [71]. We show that leveraging unlabelled data and incorporating uncertainty regularisation, together with data augmentation, improves performance further.

From technical to applied relevance

Beyond the technical merit and relevance in solving the task of processing low-resolution sonar-based underwater footage for the broader application of quantifying abundance of other underwater species, there is also a practical and conservation relevance in this context in the ability to automatically and accurately count mullet fish at the spatial scale that matters for the dolphin-fisher interaction in southern Brazil. To properly evaluate the speculative benefits of this interaction, the first steps are quantifying precisely both the (1) availability of mullet schools at the very local scale at which the dolphin-fisher interaction takes place; and (2) the proportion of the mullet available that is caught by fishers and dolphins when interacting and when foraging independently. Quantifying these benefits for interacting dolphins and fishers is crucial to determine whether their interactions are indeed mutual and, if so, to determine the minimum conditions of prey availability at which this traditional, century-old interaction can persist and remain resilient in face of the global trend of decline fisheries seen at local, regional and global scales (e.g. [35]). Given the real concern that the mullet stocks in southern Brazil are in decline [72] and causing a decline in the frequency of dolphin-fisher interactions, evaluating whether these changes can collapse this unique socio-ecological system becomes imperative. Our deep learning tool can now be used to efficiently and effectively process the more than 1 million images collected in field sampling and used to infer an estimation of the local mullet population. Corresponding uncertainty predictions alongside each sample count prediction, will allow for a greater understanding in how much each automated image count can be trusted, the first time this measure has been incorporated for counting fish with deep learning.

From labelling a dataset such as this, we understand the challenges it presents. It is difficult even for the biological experts to determine fish numbers in very dense images and at times distinguish between noise and fish. To this end, there will very likely be inaccuracies, bias, and even inconsistencies in the labelling which will have affected the training capacity of the model and lead to discrepancies between predictions and ground truths. This is likely the cause of many errors in predictions—i.e. not necessarily the limitations of such a model in its capability of image analysis. Expanding this dataset with more labelled images and using multiple annotators to reduce noise by consensus after removing bad annotations based on inter-rater agreement metrics [73, 74] would be a beneficial development, as well as training the deep network on data from wider biological research and introducing the temporal dimension to capture motion information.

Conclusion

Effective tools for monitoring fish stocks in the wild are required to support conservation efforts worldwide as well as wide-ranging biological research [9, 10]. Imaging devices and computer vision systems can be hugely valuable in being able to automatically access underwater populations in a cheap, efficient, and non-intrusive way. However, automatically counting fish in video and image data is a challenging task, particularly when dense schools of fish and/or substantial noise is present. While traditional computer vision methods are limited in their generalisability, deep learning for counting aquatic animals in video and imagery is largely unexplored, especially in comparison to research carried out in other applications. One of the potential reasons for this is the lack of large publicly available underwater video/image datasets and the difficulty and cost of having them manually and accurately annotated to train powerful deep models. In this context, we provide a novel dataset of sonar video footage of mullet fish recorded in a specific, special, cooperative foraging system between bottlenose dolphins and artisanal fishers [60]. From this large dataset, 500 images were manually labeled with point annotations locating the contained fish. Moreover, we developed an effective density-based deep learning approach to automate the process of quantifying fish from sonar images. Our multi-task network learnt to count fish by regressing an image to its representational density map, exploiting the labelled data, while learning to rank pairs of unlabelled images by their relative fish count as a way to leverage useful information for the counting task and make the most of unlabelled data. To favour interpretability and further improve the results, aleatoric uncertainty was estimated alongside the density maps by our network. The results obtained demonstrate the effectiveness of our techniques and, in particular, the possibility of using them to analyse a wider range of underwater imaging as, for instance, with traditional underwater photography, such as the DeepFish dataset. In providing an open-source framework for practical use, our study puts forth a template for crowd counting animals. We make our data, code and pre-trained network available to benefit the advancement of other counting models via transfer learning, thereby contributing to the continuous development of methods for assessing natural populations in times of biological crisis.

Supporting information

S1 Video. Sonar video sample.

Underwater video generated by an Adaptive Resolution Imaging Sonar (ARIS 3000, Sound Metrics Corp, WA, USA) displaying the top view of a passing mullet school (Mugil liza) before a fishing net is cast by artisanal fishers in Laguna, southern Brazil. The video was manually pre-processed for contrast enhancement and background removal.

(MP4)

Acknowledgments

We are grateful to FG Daura-Jorge and DR Farine for the field logistics and support during fieldwork, to all researchers involved in data sampling (Machado AMS, Bezamat, B Romeu, JVS Valle-Pereira, PV Castilho, BS Silva, N da Silva, LF da Rosa, CF Alves, L Conti, D Klein, M Rodrigues), and to the anonymous referees for their insightful comments on the manuscript.

Data Availability

The code to our method and trained network parameters are openly available at https://github.com/ptarling/DeepLearningFishCounting. All the images and their corresponding ground-truth density maps used in this study, along with 126 sonar video files are available at "Cantor, M. (2021). Underwater surveys of mullet schools (Mugil liza) with Adaptive Resolution Imaging Sonar (Version 1.0.0). Zenodo. http://doi.org/10.5281/zenodo.4717411".

Funding Statement

This work has been partially supported by the Spanish project PID2019-105093GB-I00 (MINECO/FEDER, UE) and CERCA Programme/Generalitat de Catalunya (https://portal.mineco.gob.es), and by ICREA under the ICREA Academia programme (https://www.icrea.cat/) awarded to S.G. The data sampling was supported by research grants from the National Geographic Society (Discovery Grant WW210R-17) and the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES Brazil; https://www.gov.br/capes/pt-br; #88881.170254/2018-01) and Conselho Nacional de Pesquisa e Desenvolvimento Tecnológico (CNPq Brazil; https://www.gov.br/cnpq/pt-br; #153797/2016-9) granted to M.C. M.C. is supported by The Max Planck Society via the Department for the Ecology of Animal Societies at the Max Planck Institute of Animal Behaviour (https://www.ab.mpg.de/crofoot), and grants from the CAPES-DAAD PROBRAL Research Programme (#23038.002643/2018-01; (https://www.daad.de/en/)) and the SELA CNPq-PELD Research Program (SELA 445301/2020-1). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Cardinale BJ, Duffy JE, Gonzalez A, Hooper DU, Perrings C, Venail P, et al. Biodiversity loss and its impact on humanity. Nature. 2012;486(7401):59–67. doi: 10.1038/nature11148 [DOI] [PubMed] [Google Scholar]
  • 2. Jones JP, Asner GP, Butchart SH, Karanth KU. The ‘why’,‘what’and ‘how’of monitoring for conservation. Key topics in conservation biology. 2013;2:327–343. doi: 10.1002/9781118520178.ch18 [DOI] [Google Scholar]
  • 3. Worm B, Hilborn R, Baum JK, Branch TA, Collie JS, Costello C, et al. Rebuilding global fisheries. science. 2009;325(5940):578–585. doi: 10.1126/science.1173146 [DOI] [PubMed] [Google Scholar]
  • 4. Pauly D, Zeller D. Catch reconstructions reveal that global marine fisheries catches are higher than reported and declining. Nature communications. 2016;7(1):1–9. doi: 10.1038/ncomms10244 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Hilborn R, Amoroso RO, Anderson CM, Baum JK, Branch TA, Costello C, et al. Effective fisheries management instrumental in improving fish stock status. Proceedings of the National Academy of Sciences. 2020;117(4):2218–2224. doi: 10.1073/pnas.1909726116 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Pope KL, Lochmann SE, Young MK. Methods for assessing fish populations. In: Hubert, Quist Wayne A;, Michael C, eds Inland Fisheries Management in North America, 3rd edition Bethesda, MD: American Fisheries Society: 325–351. 2010; p. 325–351. [Google Scholar]
  • 7. Bacheler NM, Geraldi NR, Burton ML, Muñoz RC, Kellison GT. Comparing relative abundance, lengths, and habitat of temperate reef fishes using simultaneous underwater visual census, video, and trap sampling. Marine Ecology Progress Series. 2017;574:141–155. doi: 10.3354/meps12172 [DOI] [Google Scholar]
  • 8. Campbell MD, Pollack AG, Gledhill CT, Switzer TS, DeVries DA. Comparison of relative abundance indices calculated from two methods of generating video count data. Fisheries Research. 2015;170:125–133. 10.1016/j.fishres.2015.05.011 [DOI] [Google Scholar]
  • 9. Lamba A, Cassey P, Segaran RR, Koh LP. Deep learning for environmental conservation. Current Biology. 2019;29(19):R977–R982. doi: 10.1016/j.cub.2019.08.016 [DOI] [PubMed] [Google Scholar]
  • 10. Malde K, Handegard NO, Eikvil L, Salberg AB. Machine intelligence and the data-driven future of marine science. ICES Journal of Marine Science. 2020;77(4):1274–1285. doi: 10.1093/icesjms/fsz057 [DOI] [Google Scholar]
  • 11. Boswell KM, Wilson MP, Cowan JH Jr. A semiautomated approach to estimating fish size, abundance, and behavior from dual-frequency identification sonar (DIDSON) data. North American Journal of Fisheries Management. 2008;28(3):799–807. doi: 10.1577/M07-116.1 [DOI] [Google Scholar]
  • 12. Lankowicz KM, Bi H, Liang D, Fan C. Sonar imaging surveys fill data gaps in forage fish populations in shallow estuarine tributaries. Fisheries Research. 2020;226:105520. doi: 10.1016/j.fishres.2020.105520 [DOI] [Google Scholar]
  • 13.Echoview software;. http://www.echoview.com.
  • 14.ESP3;. https://sourceforge.net/projects/esp3/.
  • 15.Toh Y, Ng T, Liew B. Automated fish counting using image processing. In: 2009 international conference on computational intelligence and software engineering. IEEE; 2009. p. 1–5.
  • 16.Fabic J, Turla I, Capacillo J, David L, Naval P. Fish population estimation and species classification from underwater video sequences using blob counting and shape analysis. In: 2013 IEEE international underwater technology symposium (UT). IEEE; 2013. p. 1–6.
  • 17. Jing D, Han J, Wang X, Wang G, Tong J, Shen W, et al. A method to estimate the abundance of fish based on dual-frequency identification sonar (DIDSON) imaging. Fisheries science. 2017;83(5):685–697. doi: 10.1007/s12562-017-1111-3 [DOI] [Google Scholar]
  • 18. Shahrestani S, Bi H, Lyubchich V, Boswell KM. Detecting a nearshore fish parade using the adaptive resolution imaging sonar (ARIS): An automated procedure for data analysis. Fisheries Research. 2017;191:190–199. doi: 10.1016/j.fishres.2017.03.013 [DOI] [Google Scholar]
  • 19.Onoro-Rubio D, López-Sastre RJ. Towards perspective-free object counting with deep learning. In: European conference on computer vision. Springer; 2016. p. 615–629.
  • 20.Pham H, Xie Q, Dai Z, Le QV. Meta pseudo labels. arXiv preprint arXiv:200310580. 2020;.
  • 21.Yan M, Zhao M, Xu Z, Zhang Q, Wang G, Su Z. Vargfacenet: An efficient variable group convolutional neural network for lightweight face recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops; 2019. p. 0–0.
  • 22. Norouzzadeh MS, Nguyen A, Kosmala M, Swanson A, Palmer MS, Packer C, et al. Automatically identifying, counting, and describing wild animals in camera-trap images with deep learning. Proceedings of the National Academy of Sciences. 2018;115(25):E5716–E5725. doi: 10.1073/pnas.1719367115 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Schneider S, Taylor GW, Kremer S. Deep learning object detection methods for ecological camera trap data. In: 2018 15th Conference on computer and robot vision (CRV). IEEE; 2018. p. 321–328.
  • 24.Moniruzzaman M, Islam SMS, Bennamoun M, Lavery P. Deep learning on underwater marine object detection: A survey. In: International Conference on Advanced Concepts for Intelligent Vision Systems. Springer; 2017. p. 150–160.
  • 25.Xu W, Matzner S. Underwater fish detection using deep learning for water power applications. In: 2018 International Conference on Computational Science and Computational Intelligence (CSCI). IEEE; 2018. p. 313–318.
  • 26.French G, Mackiewicz M, Fisher M, Challiss M, Knight P, Robinson B, et al. JellyMonitor: Automated detection of jellyfish in sonar images using neural networks. In: 2018 14th IEEE International Conference on Signal Processing (ICSP). IEEE; 2018. p. 406–412.
  • 27.Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer-assisted intervention. Springer; 2015. p. 234–241.
  • 28.Ren S, He K, Girshick R, Sun J. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv preprint arXiv:150601497. 2015;. [DOI] [PubMed]
  • 29. Guirado E, Tabik S, Rivas ML, Alcaraz-Segura D, Herrera F. Whale counting in satellite and aerial images with deep learning. Scientific reports. 2019;9(1):1–12. doi: 10.1038/s41598-019-50795-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Duporge I, Isupova O, Reece S, Macdonald DW, Wang T. Using very-high-resolution satellite imagery and deep learning to detect and count African elephants in heterogeneous landscapes. Remote Sensing in Ecology and Conservation. 2020;.
  • 31. Xu B, Wang W, Falzon G, Kwan P, Guo L, Chen G, et al. Automated cattle counting using Mask R-CNN in quadcopter vision system. Computers and Electronics in Agriculture. 2020;171:105300. doi: 10.1016/j.compag.2020.105300 [DOI] [Google Scholar]
  • 32.Ammar A, Koubaa A. Deep-Learning-based Automated Palm Tree Counting and Geolocation in Large Farms from Aerial Geotagged Images. arXiv preprint arXiv:200505269. 2020;.
  • 33.Jäger J, Simon M, Denzler J, Wolff V, Fricke-Neuderth K, Kruschel C. Croatian fish dataset: Fine-grained classification of fish species in their natural habitat. Swansea: Bmvc. 2015;.
  • 34. Salman A, Jalal A, Shafait F, Mian A, Shortis M, Seager J, et al. Fish species classification in unconstrained underwater environments based on deep learning. Limnology and Oceanography: Methods. 2016;14(9):570–585. [Google Scholar]
  • 35.Chen G, Sun P, Shang Y. Automatic fish classification system using deep learning. In: 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI). IEEE; 2017. p. 24–29.
  • 36.Rathi D, Jain S, Indu S. Underwater fish species classification using convolutional neural network and deep learning. In: 2017 Ninth international conference on advances in pattern recognition (ICAPR). IEEE; 2017. p. 1–6.
  • 37. Salman A, Siddiqui SA, Shafait F, Mian A, Shortis MR, Khurshid K, et al. Automatic fish detection in underwater videos by a deep neural network-based hybrid motion learning system. ICES Journal of Marine Science. 2020;77(4):1295–1307. doi: 10.1093/icesjms/fsz025 [DOI] [Google Scholar]
  • 38.Anantharajah K, Ge Z, McCool C, Denman S, Fookes C, Corke P, et al. Local inter-session variability modelling for object classification. In: IEEE Winter Conference on Applications of Computer Vision. IEEE; 2014. p. 309–316.
  • 39. Villon S, Mouillot D, Chaumont M, Darling ES, Subsol G, Claverie T, et al. A deep learning method for accurate and fast identification of coral reef fishes in underwater images. Ecological informatics. 2018;48:238–244. doi: 10.1016/j.ecoinf.2018.09.007 [DOI] [Google Scholar]
  • 40. Fisher RB, Chen-Burger YH, Giordano D, Hardman L, Lin FP, et al. Fish4Knowledge: collecting and analyzing massive coral reef fish video data. vol. 104. Springer; 2016. [Google Scholar]
  • 41.Cutter G, Stierhoff K, Zeng J. Automated detection of rockfish in unconstrained underwater videos using haar cascades and a new image dataset: labeled fishes in the wild. In: 2015 IEEE Winter Applications and Computer Vision Workshops. IEEE; 2015. p. 57–62.
  • 42. Saleh A, Laradji IH, Konovalov DA, Bradley M, Vazquez D, Sheaves M. A realistic fish-habitat dataset to evaluate algorithms for underwater visual analysis. Scientific Reports. 2020;10(1):1–10. doi: 10.1038/s41598-020-71639-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Ditria EM, Lopez-Marcano S, Sievers M, Jinks EL, Brown CJ, Connolly RM. Automating the analysis of fish abundance using object detection: optimizing animal ecology with deep learning. Frontiers in Marine Science. 2020;7:429. doi: 10.3389/fmars.2020.00429 [DOI] [Google Scholar]
  • 44. Lempitsky V, Zisserman A. Learning to count objects in images. Advances in neural information processing systems. 2010;23:1324–1332. [Google Scholar]
  • 45.Cao X, Wang Z, Zhao Y, Su F. Scale aggregation network for accurate and efficient crowd counting. In: Proceedings of the European Conference on Computer Vision (ECCV); 2018. p. 734–750.
  • 46.Cheng ZQ, Li JX, Dai Q, Wu X, Hauptmann AG. Learning spatial awareness to improve crowd counting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision; 2019. p. 6152–6161.
  • 47.Yan Z, Yuan Y, Zuo W, Tan X, Wang Y, Wen S, et al. Perspective-guided convolution networks for crowd counting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision; 2019. p. 952–961.
  • 48.Dai F, Liu H, Ma Y, Cao J, Zhao Q, Zhang Y. Dense scale network for crowd counting. arXiv preprint arXiv:190609707. 2019;.
  • 49.Gao G, Gao J, Liu Q, Wang Q, Wang Y. Cnn-based density estimation and crowd counting: A survey. arXiv preprint arXiv:200312783. 2020;.
  • 50. Xie W, Noble JA, Zisserman A. Microscopy cell counting and detection with fully convolutional regression networks. Computer methods in biomechanics and biomedical engineering: Imaging & Visualization. 2018;6(3):283–292. [Google Scholar]
  • 51.Liu L, Lu H, Cao Z, Xiao Y. Counting fish in sonar images. In: 2018 25th IEEE International Conference on Image Processing (ICIP). IEEE; 2018. p. 3189–3193.
  • 52. Zhang S, Yang X, Wang Y, Zhao Z, Liu J, Liu Y, et al. Automatic fish population counting by machine vision and a hybrid deep neural network model. Animals. 2020;10(2):364. doi: 10.3390/ani10020364 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Simões-Lopes PC, Fabián ME, Menegheti JO. Dolphin interactions with the mullet artisanal fishing on southern Brazil: a qualitative and quantitative approach. Revista Brasileira de Zoologia. 1998;15(3):709–726. doi: 10.1590/S0101-81751998000300016 [DOI] [Google Scholar]
  • 54. Peterson D, Hanazaki N, Simoes-Lopes PC. Natural resource appropriation in cooperative artisanal fishing between fishermen and dolphins (Tursiops truncatus) in Laguna, Brazil. Ocean & Coastal Management. 2008;51(6):469–475. doi: 10.1016/j.ocecoaman.2008.04.003 [DOI] [Google Scholar]
  • 55.Dounias E. Past and present auxiliary animals assisting humans in their foraging activities. Hybrid Communities: Biosocial Approaches to Domestication and Other Trans-species Relationships. 2018;.
  • 56. Simões-Lopes PC, Daura-Jorge FG, Cantor M. Clues of cultural transmission in cooperative foraging between artisanal fishermen and bottlenose dolphins, Tursiops truncatus (Cetacea: Delphinidae). Zoologia (Curitiba). 2016;33(6). [Google Scholar]
  • 57. Cantor M, Simões-Lopes PC, Daura-Jorge FG. Spatial consequences for dolphins specialized in foraging with fishermen. Animal Behaviour. 2018;139:19–27. doi: 10.1016/j.anbehav.2018.03.002 [DOI] [Google Scholar]
  • 58. Samek W, Montavon G, Vedaldi A, Hansen LK, Müller KR. Explainable AI: interpreting, explaining and visualizing deep learning. vol. 11700. Springer Nature; 2019. [Google Scholar]
  • 59. Lemos VM, Troca DFÁ, Castello JP, Vieira JP. Tracking the southern Brazilian schools of Mugil liza during reproductive migration using VMS of purse seiners. Latin American Journal of Aquatic Research. 2016;44(2):238–246. [Google Scholar]
  • 60.Cantor M. Underwater surveys of mullet schools (Mugil liza) with Adaptive Resolution Imaging Sonar. [Data set]. Zenodo. 2021.
  • 61.Dutta A, Zisserman A. The VIA annotation software for images, audio and video. In: Proceedings of the 27th ACM International Conference on Multimedia; 2019. p. 2276–2279.
  • 62.Dutta A, Gupta A, Zissermann A. VGG Image Annotator (VIA); 2016. http://www.robots.ox.ac.uk/vgg/software/via/.
  • 63.He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. p. 770–778.
  • 64.Liu X, Van De Weijer J, Bagdanov AD. Leveraging unlabeled data for crowd counting by learning to rank. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2018. p. 7661–7669.
  • 65.Kendall A, Gal Y. What uncertainties do we need in bayesian deep learning for computer vision? arXiv preprint arXiv:170304977. 2017;.
  • 66.Oh Mh, Olsen P, Ramamurthy KN. Crowd counting with decomposed uncertainty. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 34; 2020. p. 11799–11806.
  • 67.Schneider S, Zhuang A. Counting Fish and Dolphins in Sonar Images Using Deep Learning. arXiv preprint arXiv:200712808. 2020;.
  • 68.Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L. Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. Ieee; 2009. p. 248–255.
  • 69.Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings; 2010. p. 249–256.
  • 70.Kingma DP, Ba J. Adam: A method for stochastic optimization. arXiv preprint arXiv:14126980. 2014;.
  • 71.Cubuk ED, Zoph B, Mane D, Vasudevan V, Le QV. Autoaugment: Learning augmentation strategies from data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2019. p. 113–123.
  • 72. Sant’Ana R, Kinas PG, de Miranda LV, Schwingel PR, Castello JP, Vieira JP. Bayesian state-space models with multiple CPUE data: the case of a mullet fishery. Scientia Marina. 2017;81(3):361–370. doi: 10.3989/scimar.04461.11A [DOI] [Google Scholar]
  • 73. Kara YE, Genc G, Aran O, Akarun L. Actively estimating crowd annotation consensus. Journal of Artificial Intelligence Research. 2018;61:363–405. doi: 10.1613/jair.5727 [DOI] [Google Scholar]
  • 74. Yu G, Tu J, Wang J, Domeniconi C, Zhang X. Active multilabel crowd consensus. IEEE Transactions on Neural Networks and Learning Systems. 2020;32(4):1448–1459. doi: 10.1109/TNNLS.2020.2984729 [DOI] [PubMed] [Google Scholar]

Decision Letter 0

Tim Wilhelm Nattkemper

Transfer Alert

This paper was transferred from another journal. As a result, its full editorial history (including decision letters, peer reviews and author responses) may not be present.

10 Jan 2022

PONE-D-21-23858Deep learning with self-supervision and uncertainty regularization to count fish in underwater imagesPLOS ONE

Dear Dr. Cantor,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. 

Please submit your revised manuscript by Feb 24 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Tim Wilhelm Nattkemper

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at 

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. We note that Figures 1 and 6 in your submission contain copyrighted images. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright.

We require you to either (1) present written permission from the copyright holder to publish these figures specifically under the CC BY 4.0 license, or (2) remove the figures from your submission:

a. You may seek permission from the original copyright holder of Figures 1 and 6 to publish the content specifically under the CC BY 4.0 license. 

We recommend that you contact the original copyright holder with the Content Permission Form (http://journals.plos.org/plosone/s/file?id=7c09/content-permission-form.pdf) and the following text:

“I request permission for the open-access journal PLOS ONE to publish XXX under the Creative Commons Attribution License (CCAL) CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). Please be aware that this license allows unrestricted use and distribution, even commercially, by third parties. Please reply and provide explicit written permission to publish XXX under a CC BY license and complete the attached form.”

Please upload the completed Content Permission Form or other proof of granted permissions as an ""Other"" file with your submission. 

In the figure caption of the copyrighted figure, please include the following text: “Reprinted from [ref] under a CC BY license, with permission from [name of publisher], original copyright [original copyright year].”

b. If you are unable to obtain permission from the original copyright holder to publish these figures under the CC BY 4.0 license or if the copyright holder’s requirements are incompatible with the CC BY 4.0 license, please either i) remove the figure or ii) supply a replacement figure that complies with the CC BY 4.0 license. Please check copyright information on all replacement figures and update the figure caption with source information. If applicable, please specify in the figure caption text when a figure is similar but not identical to the original image and is therefore for illustrative purposes only.

3. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Additional Editor Comments:

Dear authors, finally, the reviews have been completed and both reviewers agree that minor revisions are necessary before your submission can be accepted. Please pay attention to the comments (one reviewer submitted an extra file with comments, hope this will be included / attached by the PLoS system) and follow them carefully.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: 1. The manuscript is technically sound and the data supports the conclusions. Some minor adjustments are needed to clarify some sections (See attachment).

2. The statistical analysis performed is associated with proven machine vision processes.

3. Data is available.

4. Manuscript is well-written.

Reviewer #2: Monitoring underwater populations is crucial to support the conservation of wild aquatic biota. The present manuscript proposes a deep learning method to estimate the number of fish in underwater images.

The authors use a novel dataset of (low-resolution and site-specific) sonar images to train and test various density-based deep learning models. Finally, they argue the generalisability of their approach by testing the best performing model (the one with self-supervision and uncertainty regularization) on a contrasting dataset (high-resolution and heterogeneous-habitat) of underwater photos.

This is a really interesting contribution that fits well with the scope of Plos One, but I have a few minor issues concerning the analysis and some specific choices made by the authors. The manuscript might be considered acceptable after minor revision.

Main comments:

1) Referring to Equation 1, it is unclear to me how the count is extracted out of the predicted density map. Furthermore, I was not able to gain insights on this matter looking at the references [41-45,60]. From what I have understood, a pixel of the predicted density map contains a real number; how is the image predicted density normalized? Does the sum over spatial units provide a real number or an integer one (as it is reported in the estimated count in Figure 5 (c)-(f))?

2) In subsection 'Ablation experiments', citing "All 9 trained models were tested on the 80-image test set for comparison and averaged across the 3 trials". The 3 trials are never mentioned before. I guess they refer probably to different simulation's run, each one with a diverse random seed. Maybe the authors can explicitly state it in the text. In this regard, do the results shown in Figure 5 (a), (b) refer to a specific trial or to an average of the 3?

3) The authors did not provide a reason for the choice of the hyperparameters used for training. Have they performed an hyperparameter optimization? Is the loss function of the models converging? Showing the learning curves (e.g. in Supplementary Materials) may provide useful informations.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

Attachment

Submitted filename: PONE-D-21-23858_reviewer_comments.pdf

Attachment

Submitted filename: Comments.docx

PLoS One. 2022 May 4;17(5):e0267759. doi: 10.1371/journal.pone.0267759.r002

Author response to Decision Letter 0


14 Feb 2022

Ref.: PONE-D-21-23858

Title: Deep learning with self-supervision and uncertainty regularization to count fish in underwater images

Authors: Penny Tarling, Mauricio Cantor, Albert Clapés, Sergio Escalera

Authors’ replies to the comments by the Editor

Comment#1, Editor. Dear authors, finally, the reviews have been completed and both reviewers agree that minor revisions are necessary before your submission can be accepted. Please pay attention to the comments (one reviewer submitted an extra file with comments, hope this will be included / attached by the PLoS system) and follow them carefully.

Authors’ reply: Thank you for the very positive evaluation of our work. We are pleased to know that both reviewers welcomed our manuscript and requested only minor revisions. Please see below how we addressed each of them, and paid close attention to all the formatting requirements.

Comment#2, Editor. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

Authors’ reply: Thank you for the attentive review. We have used the PLOS ONE LaTeX templates and double-checked all the formatting requirements.

Comment#3, Editor. 2. We note that Figures 1 and 6 in your submission contain copyrighted images. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright.

We require you to either (1) present written permission from the copyright holder to publish these figures specifically under the CC BY 4.0 license, or (2) remove the figures from your submission:

a. You may seek permission from the original copyright holder of Figures 1 and 6 to publish the content specifically under the CC BY 4.0 license.

We recommend that you contact the original copyright holder with the Content Permission Form (http://journals.plos.org/plosone/s/file?id=7c09/content-permission-form.pdf) and the following text:

“I request permission for the open-access journal PLOS ONE to publish XXX under the Creative Commons Attribution License (CCAL) CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). Please be aware that this license allows unrestricted use and distribution, even commercially, by third parties. Please reply and provide explicit written permission to publish XXX under a CC BY license and complete the attached form.”

Please upload the completed Content Permission Form or other proof of granted permissions as an ""Other"" file with your submission.

In the figure caption of the copyrighted figure, please include the following text: “Reprinted from [ref] under a CC BY license, with permission from [name of publisher], original copyright [original copyright year].”

b. If you are unable to obtain permission from the original copyright holder to publish these figures under the CC BY 4.0 license or if the copyright holder’s requirements are incompatible with the CC BY 4.0 license, please either i) remove the figure or ii) supply a replacement figure that complies with the CC BY 4.0 license. Please check copyright information on all replacement figures and update the figure caption with source information.

If applicable, please specify in the figure caption text when a figure is similar but not identical to the original image and is therefore for illustrative purposes only.

Authors’ reply: We appreciate your guidance for re-using published figures. For Figure 1, we clarify that only part (a) is copyrighted by SoundMetrics Coorp, and that parts (b) and (c) are properties of the authors of this manuscript. Note that in Figure 1(a) we present a similar, modified version of their original figure, after having contacted the copyright owner. For full transparency, we have now obtained their written permission, uploaded the Content Permission Form, and updated the Figure 1 caption to clarify that part (a) is similar but not identical to the original image and is therefore for illustrative purposes (page 3).

As for Figure 6, we clarify that the original images were published under the Creative Commons Attribution License (CC BY 4.0), allowing us to reproduce them here since we have cited the source fully and accurately (see our original reference [38] Saleh A, Laradji IH, Konovalov DA, Bradley M, Vazquez D, Sheaves M. A realistic fish-habitat dataset to evaluate algorithms for underwater visual analysis. Scientific Reports. 2020;10(1):1–10). Please see a copy of the license and permission to reuse these images here:

https://s100.copyright.com/AppDispatchServlet?title=A%20realistic%20fish-habitat%20dataset%20to%20evaluate%20algorithms%20for%20underwater%20visual%20analysis&author=Alzayat%20Saleh%20et%20al&contentID=10.1038%2Fs41598-020-71639-x&copyright=The%20Author%28s%29&publication=2045-2322&publicationDate=2020-09-04&publisherName=SpringerNature&orderBeanReset=true&oa=CC%20BY. For full transparency, we rephrased the Figure 6 caption to declare that the original images were published under CC BY 4.0 (page 17).

Comment#4, Editor. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Authors’ reply: We have double-checked all citations and references, and used their correct and latest BibTeX entries.

\f

Authors’ replies to the comments by Reviewer#1

Comment#1, Reviewer#1. The manuscript is technically sound and the data supports the conclusions. Some minor adjustments are needed to clarify some sections (See attachment).

2. The statistical analysis performed is associated with proven machine vision processes.

3. Data is available.

4. Manuscript is well-written.

Authors’ reply: We very much appreciate the positive feedback on our manuscript and the thoughtful comments that helped us clarify important details. Please see below how we addressed each comment -note that line numbers refer to the revised version with in-line tracked changes.

Comment#2, Reviewer#1. Page 2 “Underwater videos and images offer a non-intrusive…” This is not true, several studies have shown that cameras can attract or scare some species of fish, that can be consider an intrusion in the sense that it disrupts the behavior and natural dynamics of fish populations. No method is non-intrusive, or non-biased. Check refs: Bacheler, Comparing relative abundance, lengths, and habitat of temperate reef fishes using simultaneous underwater visual census, video, and trap sampling, 2017; Campbell, Comparison of relative abundance indices calculated from two methods of generating video count data, 2015, for an idea on the biases of several methods, including cameras.

Authors’ reply: Thank you for improving the accuracy of our text. We agree with the reviewer that studies using underwater imaging can disrupt fish behaviour and are inherently imperfect in estimating population parameters, such as abundance. We have rephrased the excerpt to downplay our original overstatement of the non-intrusiveness of underwater videos (page 2), and point the readers to the references provided for a more in-depth comparison of the pros and cons of different sampling methods.

Comment#3, Reviewer#1. Page 3 “only semi-automated commercial tools to process sonar images are available”. This is not entirely true. Softwares like Echoview or ESP3 can process sonar images in bulk, and they also can count fish in them. Echoview has the option to add user developed scripts in case several files from different sources are needed (so it would be fully-automated). I agree with the assertion that there are not many products, readily available, that process sonar images in a fully-automated way, but that depends on the definition of fully-automated. A well-written script in Echoview or ESP3 can do that. I could not find a full copy of [10] to corroborate that they had a similar conclusion. I don't think the advantage of a computer vision model is that it can automatically process vast numbers of images, but that it takes them and processes them as a whole.

Authors’ reply: We appreciate the reviewer’s perspective on these tools that can process sonar images automatically or semi-automatically and agree that the advantage of computer vision models is the ability to process vast number of images as a whole, i.e. speed. Our original argument was based on our best knowledge at the time of writing that more automated tools are not readily available and are limited in scope. For instance, we cited Ref [10] Lankowicz et al. 2020 as a recent example of fishery researchers using the same ARIS sonar camera as us, but needing to manually count fish in each image and therefore they express the need for faster or more automated methods. We quote, “recognize that sonar imaging surveys have limitations as well; data processing alone is a huge challenge due to the sheer number of images that must be processed. […]. However, advances in computer science fields such as machine learning are quickly improving the performance of automated image processing procedures, and thus it will soon be possible for sonar image data to be processed on a similar time scale to net-generated catch data.” While we knew about Echoview, we were unaware of ESP3 - thank you for bringing our attention to it, which we now cite in the revised manuscript. We still believe these tools have their own limitations: ESP3 seems only compatible with "SIMRAD (.raw) and a small number of other formats from single-beam or split-beam sonar cameras”, and whilst it can detect schools of fish, sea bottom etc. it may have limited capability for quantifying numbers of fish. From our understanding of Echoview, depending on the studied system, it may not be as straightforward to semi-automate fish counting; as the Reviewer says, without a custom well-written script. To clearly acknowledge the existence of these tools, we now cite both in our revised manuscript (page 3), and amended this excerpt to highlight the contribution of our proposed computer vision method: the increased speed in processing vast number of images as a whole (page 3).

Comment#4, Reviewer#1. Page 5 “500 were selected for manual labelling”. Was there any criterium for the selection of this amount of images?

Authors’ reply: Thank you for your question - we have updated the manuscript (page 5) to make this clearer. We chose and stratified 500 images to ensure the dataset covered the visual variability with several images in each high level bucket: the samples chosen contained a wide ranging number of fish (between 0 to 438) and we made sure to include substantial noise. To capture any potential variation between the days and time of day (e.g. in water visibility, noise and fish), the images were selected from 48 different videos so there were at least 2 videos from each of the 18 days of sampling, one from morning and another from afternoon when possible (sometimes the weather conditions did not allow us to deploy the camera for the entire day). Within each morning and afternoon, the videos were selected at random.

Comment#5, Reviewer#1. Page 8. “Incorporating this measure of uncertainty should not only lead to a higher level of accuracy in count predictions through optimising model training”. Is this a hypotesis?

Authors’ reply: Thank you for the comment. Yes, this is a hypothesis. We have made our manuscript clearer in stating this sentence as a hypothesis (page 8), and discussing our corresponding findings (pages 18-19).

Comment#6, Reviewer#1. Page 8. “Depth of swimming will cause differences in scale but it does not result in great variations here.” Do you have preliminary data to support this?

Authors’ reply: We do not have preliminary data on this, but we contacted the manufacturer of the ARIS sonar camera to better understand how it operates and the extent to which the acoustic beams could distort the fish images within the spatial scale of our sampling. We realized that the influence of water depth on measuring large fish at short distances with the ARIS would be negligible. The average body length, from the tip of the mouth to the caudal fin, of mullets caught and measured manually during our fieldwork was 42.9 cm ± 7.00 SD (n=771). Such measurable size should be very consistent within the sonar range, as long as the ARIS sonar camera is tilted down enough for the energy beams to “illuminate” the fish and minimize any potential interference from returns caused by the water surface reflections superimposing the fish images. We have rephrased this except to clarify that the way we sample these images in short ranges and in shallow waters should not distort the size of the mullets (page 9). For reference, we also amended the caption of Fig. 1c to give a sense of how large the mullet fish are (page 3).

Comment#7, Reviewer#1. Page 9. “we can leverage some potentially useful information for the original counting task”. What kind of information?

Authors’ reply: We are referring to the comparative number of fish between two images learnt in the self-supervised task. Thank you for pointing this out - we have updated this sentence to clarify our point (pages 9-10).

Comment#8, Reviewer#1. Page 14. “the model is able to distinguish between dolphin and fish confidently”. This classification may not be significant since the size of a dolphin is much greater than the one of a mullet so it is easily differentiated. The uncertainty introduced by the nets is more significant since the size of the "blobs" are comparable to the size of the mullets. This begs the questions: in the presences of other targets of similar size, does the CNN have a good performance too? Is the CNN considering past entries, or the way fish move to make its predictions? Human observers can tell if it is a net or a fish because they have access to past and future entries (frames) and can recognize the difference in the way that a net and a fish move. Given only still images, humans would have a much harder time recognizing the blobs associated with nets.

Authors’ reply: Thank you for the comment, this is a good point. As you suggest, including temporal data, to capture motion, should definitely improve accuracy and is something we would like to do as future work, potentially with the use of a 3D CNN (to incorporate the temporal dimension) with minor adaptations, along with increasing the difficulty of the problem with greater variation in the training and test data. We have updated the manuscript to explain this (page 21). Here we have shown that with this dataset, even without temporal knowledge, we can still achieve sufficient accuracy with our CNN model. This suggests, on a like-for-like basis, where humans and CNN process still, non sequential images, our model will perform to a comparable degree of accuracy but the advantage is that it is much faster.

Comment#9, Reviewer#1. Page 15. “Evaluation on DeepFish dataset”. Although an interesting exercise, I don't see how this relates to the counting on sonar images. These are two fundamentally different CNNs, especially their datasets. Underwater images (DeepFish) are visual data, the frequencies (colors) are directly associated with a bandwidth in the environment. Sonar images' gray scales are due to the mapping of the device, the data is sound mapped into a visual space, and thus, their bandwidth is dependent on the design of the device and the processing of the data. I am not sure that you can compare both. It is interesting to see that the framework does better that the original DeepFish architecture, but that does not necessarily means it would be better in sonar images.

Authors’ reply: Thank you for sharing your perspective on this analysis. We understand that this exercise does not show that our model would perform better on sonar images, the purpose of it was to show the generalisability of our model architecture and that its use does not need to be limited to sonar images. We hope it can be used in a wide range of aquatic research and perform well on a variety of image data. We have edited our discussion to make our goal clearer (page 19).

Comment#10, Reviewer#1. Page 17. “Its performance is notable when predicting samples containing higher numbers of fish and samples with substantial noise (Fig 5(a)), likely because by adding unlabelled data, we increase the number of samples that fall within these two challenging categories, which will disproportionately require more training data.” Is this speculation?

Authors’ reply: Yes, thank you. We have improved the wording in this excerpt to make this clear (page 18).

Comment#11, Reviewer#1. Page 18. “We also tested our proposed model on a recently published dataset, DeepFish”. See my comment on the "Evaluation of DeepFish dataset". The better performance of the framework here does not necessarily means a better performance with sonar images. The two datasets are fundamentally different.

Authors’ reply: Thank you again for highlighting this point. We have edited our discussion taking on board your feedback (page 19); please see our complete response to your previous Comment #9.

Comment#12, Reviewer#1. Page 18. “on both low-resolution, densely populated images”. I agree that ARIS acoustic images can be treated as low-resolution images. However, they are not the same as low-resolution visual images. They might be closer to low-resolution monochromatic images.

Authors’ reply: We have rephrased this term to low-resolution monochromatic images (page 19).

Comment#13, Reviewer#1. Page 19. “in improving model predictions on unseen sonar data”. It is hard to find a study that does not support this. Only if the dataset is large enough, augmenting it does not help.

Authors’ reply: Thank you for pointing this out, we agree it is a widely known and used technique. We have updated the manuscript to make it more specific - i.e. both our studies started with a relatively small labelled dataset because very limited labelled data are available of this kind, but we were able to get to a good level of accuracy, partly due to the effectiveness of data augmentation (page 20).

Comment#14, Reviewer#1. Page 19. “to determine the minimum conditions of prey availability at which this traditional, century-old interaction can persist and remain resilient in face of the global trend of decline fisheries seen at local, regional and global scales”. But this was not the goal of the study

Authors’ reply: The reviewer is absolutely correct that this is not the goal of this study. In this closing section we aimed to provide the reader with the broader implications of our current technical study. We believe that by painting this bigger picture, we can better illustrate the usefulness of high-definition, localized fisheries data.

Comment#15, Reviewer#1. Page 19. “This is likely the cause of many errors in predictions”. Was there an attempt to quantify this?

Authors’ reply: No, but thanks for the suggestion. We first had two biologists work through a subsample together to reach a consensus on how to identify the number of mullet, but then only one annotated the entire dataset. Ideally we would have had multiple experts annotating the entire dataset, enabling us to not only detect errors, but also quantify and mitigate biases in ground-truth labeling (please see our reply to your comment #16).

Comment#16, Reviewer#1. Page 20. “We make our data, code and pre-trained network available to benefit the advancement of other counting models via transfer learning”. There are models (such mean count and maximum count) used by marine biologist that work with visual data that could be applied here to give an estimate of the error in the Ground Truth, that, as the authors suggest, will be compounded into the CNN.

Authors’ reply: We appreciate these suggestions, and believe they will fit perfectly into a broader follow-up study. Thus, we now highlight the Reviewer’s points as directions for future work (pages 20-21). In addition, we are working on extending the dataset with more videos and also have multiple annotators, so we can reduce the noise by consensus after removing bad annotations based on inter-rater agreement metrics, as suggested by the recent literature [1-2].

[1] Kara, Yunus Emre, et al. "Actively estimating crowd annotation consensus." Journal of Artificial Intelligence Research 61 (2018): 363-405.

[2] Yu, Guoxian, et al. "Active multilabel crowd consensus." IEEE transactions on neural networks and learning systems 32.4 (2020): 1448-1459.

\f

Authors’ replies to the comments by Reviewer#2

Comment#1, Reviewer#2. Monitoring underwater populations is crucial to support the conservation of wild aquatic biota. The present manuscript proposes a deep learning method to estimate the number of fish in underwater images. The authors use a novel dataset of (low-resolution and site-specific) sonar images to train and test various density-based deep learning models. Finally, they argue the generalisability of their approach by testing the best performing model (the one with self-supervision and uncertainty regularization) on a contrasting dataset (high-resolution and heterogeneous-habitat) of underwater photos.

This is a really interesting contribution that fits well with the scope of Plos One, but I have a few minor issues concerning the analysis and some specific choices made by the authors. The manuscript might be considered acceptable after minor revision.

Authors’ reply: Thank you for the accurate summary highlighting the strengths of our work. We very much appreciate the positive feedback on the contributions of our work to the field, as well as the suggestions to improve the presentation of our manuscript. Please see below how we addressed all your comments (line numbers refer to the revised version with in-line tracked changes).

Comment#2, Reviewer#2. Referring to Equation 1, it is unclear to me how the count is extracted out of the predicted density map. Furthermore, I was not able to gain insights on this matter looking at the references [41-45,60]. From what I have understood, a pixel of the predicted density map contains a real number; how is the image predicted density normalized? Does the sum over spatial units provide a real number or an integer one (as it is reported in the estimated count in Figure 5 (c)-(f))?

Authors’ reply: Thank you for the attentive revision of our paper; excuse the lack of clarity in our text. Firstly, the point annotation to mark the location of a mullet for each labelled image will be interpreted as a “1” in that position in a corresponding array. A gaussian filter is passed over the entire array to create a density map, so the values of the original position of a given point annotation, and its surrounding area covered by the gaussian filter, will sum to 1 (hence normalized): the array will now consist of real numbers but not integers. Post filtering, any overlapping pixels from annotations from multiple mullets, will be added together - this will show as an area of greater density on the predicted density map. It is then possible to recover the predicted count of mullet from the density map simply by adding all the pixel values, as well as having a spatial representation to predict their location. We amended the commented excerpt to make these points explicit in the manuscript (page 8).

Comment#3, Reviewer#2. In subsection 'Ablation experiments', citing "All 9 trained models were tested on the 80-image test set for comparison and averaged across the 3 trials". The 3 trials are never mentioned before. I guess they refer probably to different simulation's run, each one with a diverse random seed. Maybe the authors can explicitly state it in the text. In this regard, do the results shown in Figure 5 (a), (b) refer to a specific trial or to an average of the 3?

Authors’ reply: Thank you for this feedback - we have updated the text as suggested to make this clearer (page 12). The Figures 5a and 5b provide the average of all trials to give an overview of our results. Figures 5c-f, where we show specific examples of predicted density maps, are from specific trials - here we have selected the median prediction for each sample image.

Comment#4, Reviewer#2. The authors did not provide a reason for the choice of the hyperparameters used for training. Have they performed an hyperparameter optimization? Is the loss function of the models converging? Showing the learning curves (e.g. in Supplementary Materials) may provide useful informations.

Authors’ reply: Yes, we performed hyperparameter optimisation to choose model training hyperparameters such as the learning rate and the number of epochs to train our model for. We used an “early stopping procedure” to stop training when MAE of our validation data was at a minimum. We also experimented with training on increasing numbers of augmented images but found that accuracy plateaued beyond ~5,000 (hence using 5,322 in our ablation studies).We also performed hyperparameter optimisation on the coefficient of the balancing regularization as our model was not training with this being 1(as in [49]). Instead, the loss function converged best with a coefficient of 0.1 - the difference being our model inputs are whole images whereas [49] used patches of images. This means absolute differences between ground truth and prediction for a given input sample could be much higher in our results, resulting in the balancing function dominating the overall loss function and preventing training (please see page 12).

Attachment

Submitted filename: PONE-D-21-23858_R1_response-letter.docx

Decision Letter 1

Tim Wilhelm Nattkemper

18 Apr 2022

Deep learning with self-supervision and uncertainty regularization to count fish in underwater images

PONE-D-21-23858R1

Dear Dr. Cantor,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Tim Wilhelm Nattkemper

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors addressed all my comments. I approve the manuscript for publication.

I commend the authors on working on this difficult subject where databases are hard to find and machine learning is usually ignored.

Reviewer #2: The manuscript satisfies PLOS ONE publication criteria and the authors have addressed all of my comments. I am happy to accept the paper for publication at this stage.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Acceptance letter

Tim Wilhelm Nattkemper

25 Apr 2022

PONE-D-21-23858R1

Deep learning with self-supervision and uncertainty regularization to count fish in underwater images

Dear Dr. Cantor:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Prof. Dr. Tim Wilhelm Nattkemper

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Video. Sonar video sample.

    Underwater video generated by an Adaptive Resolution Imaging Sonar (ARIS 3000, Sound Metrics Corp, WA, USA) displaying the top view of a passing mullet school (Mugil liza) before a fishing net is cast by artisanal fishers in Laguna, southern Brazil. The video was manually pre-processed for contrast enhancement and background removal.

    (MP4)

    Attachment

    Submitted filename: PONE-D-21-23858_reviewer_comments.pdf

    Attachment

    Submitted filename: Comments.docx

    Attachment

    Submitted filename: PONE-D-21-23858_R1_response-letter.docx

    Data Availability Statement

    The code to our method and trained network parameters are openly available at https://github.com/ptarling/DeepLearningFishCounting. All the images and their corresponding ground-truth density maps used in this study, along with 126 sonar video files are available at "Cantor, M. (2021). Underwater surveys of mullet schools (Mugil liza) with Adaptive Resolution Imaging Sonar (Version 1.0.0). Zenodo. http://doi.org/10.5281/zenodo.4717411".


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES