Skip to main content
Biology Methods & Protocols logoLink to Biology Methods & Protocols
. 2025 Jul 12;10(1):bpaf057. doi: 10.1093/biomethods/bpaf057

Reassessing deep learning (and meta-learning) computer vision as an efficient method to determine taphonomic agency in bone surface modifications

Manuel Domínguez-Rodrigo 1,2,3,, Gabriel Cifuentes-Alcobendas 4,5, Marina Vegara-Riquelme 6,7, Enrique Baquedano 8
PMCID: PMC12343112  PMID: 40799312

Abstract

Taphonomic research aims at reconstructing processes affecting the preservation and modification of paleobiological entities. Recent critiques of the reliability of deep learning (DL) for taphonomic analysis of bone surface modifications (BSMs), such as that presented by Courtenay et al. based on a selection of earlier published studies, have raised concerns about the efficacy of the method. Their critique, however, overlooked fundamental principles regarding the use of small and unbalanced datasets in DL. By reducing the size of the training and validation sets—resulting in a training set only 20% larger than the testing set, and some class validation sets that were under 10 images—these authors may inadvertently have generated underfit models in their attempt to replicate and test the original studies. Moreover, errors in coding during the preprocessing of images have resulted in the development of fundamentally biased models, which fail to effectively evaluate and replicate the reliability of the original studies. In this study, we do not aim to directly refute their critique, but instead use it as an opportunity to reassess the efficiency and resolution of DL in taphonomic research. We revisited the original DL models applied to three targeted datasets, by replicating them as new baseline models for comparison against optimized models designed to address potential biases. Specifically, we accounted for issues stemming from poor-quality image datasets and possible overfitting on validation sets. To ensure the robustness of our findings, we implemented additional methods, including enhanced image data augmentation, k-fold cross-validation of the original training-validation sets, and a few-shot learning approach using both supervised learning and model-agnostic meta-learning. The latter methods facilitated the unbiased use of separate training, validation, and testing sets. The results across all approaches were consistent, with comparable—if not almost identical—outcomes to the original baseline models. As a final validation step, we used images of recently generated BSM to act as testing sets with the baseline models. The results also remained virtually invariant. This reinforces the conclusion that the original models were not subject to methodological overfitting and highlights their nuanced efficacy in differentiating BSM. However, it is important to recognize that these models represent pilot studies, constrained by the limitations of the original datasets in terms of image quality and sample size. Future work leveraging larger datasets with higher-quality images has the potential to enhance model generalization, thereby improving the applicability and reliability of DL approaches in taphonomic research.

Keywords: deep learning, meta-learning, ensemble learning, taphonomy, bone surface modifications, equifinality

Introduction

Recently, Courtenay et al. [1] have cast some doubts on the reliability of some published deep learning (DL) studies used to differentiate bone surface modifications (BSMs), arguing that the high accuracy of these studies are the artificial result of: (i) poor quality image dataset, and (ii) overfitting of the trained models by blending validation and testing. They replicated the original studies by purportedly using the same model architectures, the same data augmentation procedures, but they modified the training process. Here, we argue that Courtenay et al. [1] did not replicate any of the three targeted studies, nor did they test generalization shortcomings in the original models. These authors made different methodological decisions about the use of the original data, elaborated different models and extrapolated unjustifiably their results to the original studies, without demonstrating that the original models had problems in generalization. In their logic, there was a disconnection between their explanans and explanandum; that is, what they set to test and what they eventually proved. Their study, though, is relevant to show that decisions made about methodological procedures have important leverage in analytical outcomes.

The data sets selected by Courtenay et al. were among a few pilot sets used to show the potential of DL in taphonomic analyses. We were aware of their limitations because of the following factors:

  1. Sample size. DL models inherently require extensive datasets to fully realize their potential and ensure robust performance. However, the datasets compiled for those studies were constrained by an insufficient number of samples in some classes. This limitation arose from the labor-intensive nature of the experimental process, which included experimenting with different agents, cleaning bones, meticulously capturing each BSM under the microscope, and subsequently cropping and preparing the images for analysis. For some of the datasets analyzed, this process spanned nearly three years, encompassing project conceptualization, data acquisition, and image preparation. These challenges underscore the practical difficulties associated with generating large-scale datasets for taphonomic research using DL models.

  2. Depth of field. For these pilot studies, we used a trinocular microscope (Optika SZM-1) paired with a 3 MP digital camera (OptiCam B3). This combo presented problems of depth of field, largely resulting in lack of definition of certain areas of the image. We stressed this problem in the original [2], and in subsequent studies [3]. This problem affected all BSM and, therefore, it did not bias model performance according to class. This led us to replace this microscope with a Leica Emspira 3 Digital microscope, which is the tool that has provided the bulk of the datasets for our latest studies and provides high-resolution images unaffected by this problem.

  3. Unbalanced datasets. The first dataset selected by Courtenay et al. [1] contained 488 cut marks and 46 crocodile tooth marks. The second data set selected by them [2] contains 489 cut marks, 103 tooth marks, and 63 trampling marks. These datasets are more prone to present generalization issues than more balanced datasets. However, it should be emphasized that unbalanced datasets may be more realistic than artificially-balanced datasets. The concern surrounding class imbalance in machine learning (ML) is often contextual. As one reviewer rightly noted on an earlier draft of this article, imbalance is not inherently problematic—especially when it accurately reflects the distribution of categories in real-world data. In fact, artificially enforcing balance in datasets that are naturally skewed can lead to misleading conclusions about model performance, as it creates evaluation scenarios that deviate from the actual deployment context. For example, in taphonomic or forensic applications, certain types of BSMs may occur far less frequently than others. A model trained and evaluated on an artificially balanced dataset may perform well under controlled conditions, but fail to generalize under real-world class distributions. Therefore, maintaining natural class proportions during model training and evaluation can sometimes offer more realistic performance estimates and better inform how models will behave when deployed in applied settings. This perspective highlights the need to shift the focus from achieving mathematical balance to ensuring representational fidelity in dataset construction.

It is precisely the awareness of the sample size problem and the highly unbalanced nature of these datasets that led us to consciously make methodological decisions that would boost the performance of the resulting DL models. Our main goal was to develop models that could learn to identify agency; that is, to detect what taphonomic agent (e.g. human use of stone tools, carnivore gnawing bones or accidental trampling on abrasive sediment) was responsible for the marks. The first of these decisions was not to implement the traditional training-validation-testing split that is the common ML protocol [5]. We are aware that separating the training, validation and the testing sets is the adequate analytical process for normal datasets, as we have continuously implemented in our extensive ML experience [4, 6–8]. However, having done that triple data split with these small datasets would have presented the following problems:

  1. Reduction of the learning process. The division of a small dataset into training, validation, and testing subsets significantly limits the amount of data available for training, which is critical for effective generalization in DL models. This is especially so when dealing with unbalanced samples. Effective learning and pattern recognition in DL models require substantial data to adequately capture the underlying features and variability of the dataset. Insufficient training data increases the likelihood of model underfitting, as the model may fail to learn meaningful patterns. Additionally, small testing datasets may result in unreliable performance metrics, as they may not sufficiently represent the data distribution for each class, causing measures such as F1 scores to fluctuate substantially. Consequently, in such a situation the model’s generalization performance is also impaired when evaluated on a separate testing set.

       Therefore, the size and balance of the validation/testing set are as crucial as those of the training set. A poorly sized or unrepresentative validation set can hinder the model’s ability to learn effectively. Given the constraints of our dataset, which was limited in size and unbalanced, we prioritized larger training and validation sets at the expense of a separate testing set. This strategy is not new, and has also been applied by other researchers facing the same problem of small datasets [9]. To mitigate the potential for overfitting and address these challenges, we employed several techniques discussed below.

  2. Unreliable inferences about the generalization potential of the model. The use of small and unbalanced datasets compromises the reliability of inferences about the generalization potential of a DL approach. An underfit model (which is not capable of capturing the relationships between inputs and outputs included in the training data) or an undertrained model (subjected to a training process which did not last sufficient time to lead it to a point where the model can reach its full learning capabilities or did not contain enough information) inherently render testing set metrics unreliable, as it lacks the capacity to learn meaningful patterns effectively. Even if a model is adequately trained, generalization inferences drawn from a testing set can be limited if the set does not manage to be representative enough of the sampled population. With small testing sets, this condition is rarely met. For instance, in Courtenay et al.’s [1] analysis of Dataset 1, the “crocodile” class testing sample was composed of only 13 images of tooth marks (30% of the sample). Recent research [10] has demonstrated the extensive variability in tooth pits alone of this agent (see below), suggesting that such a limited sample cannot adequately represent the full spectrum of the original data, let alone the broader population. Similarly, in Dataset 2, the “trampling mark” class was represented by only 18 images, which is far from sufficient to encapsulate the diverse forms of trampling marks produced by different abrasive agents [11]. This issue also extends to models that might exhibit strong classification metrics on such limited testing sets. Regardless of whether the metrics are favorable or unfavorable, the small sample sizes prevent a reliable evaluation of the model’s generalization capacity. The underrepresentation of certain classes introduces a significant risk of bias, leaving the broader applicability of the model unverified. This stands in stark contrast to standard DL protocols, which typically involve testing models on hundreds or thousands of images per class, thereby ensuring a more robust and reliable assessment of generalization potential.

By having adopted the triple-split approach, Courtenay et al. [1] may have incurred in deficient modeling (especially for underrepresented classes), since for the first data set, they have trained their DL models, for example, using fewer than 25 crocodile images. A recent study of crocodile tooth pit morphology alone shows that there were a minimum of 64 forms of tooth pits in the same experimental set [10]. A model loss and backpropagation process is determined by the validation set. In this regard, Courtenay et al. must have used only six crocodile images to recalibrate (i.e., update the weights through backpropagation) their models. If using their second dataset, that would have resulted in training the model on 35 trampling marks and validated it with only 9 trampling marks. Needless to say that such reduced validation (and potentially, training) data sets are of little value for DL analysis, since they cannot sample the wide variety of each of those BSM. Such small validation sets propitiate low precision and recall, as Courtenay et al.’s results show [1].

In their study, these authors achieved only slightly lower global accuracy on their testing sets than the original studies. For Dataset 1, their accuracy is 92% (in the original study it ranged between 96% and 99% according to model) [4]. This means that their large cut mark sample was properly trained and classified according to the testing set (91.8% of precision, perfect recall, and 98.7% of F-1 score). It was only the smaller crocodile data that was unlearnt and misclassified because of the paucity of data for training, validation and testing caused by the triple data split. For Dataset 2, Courtenay et al.’s replicated testing accuracy was 86% (in the original study it was 90%) [2]. The larger sample of cut marks was unaffected by the triple data split showing high precision (97.8%), high recall (0.918), and high F-1 score (94.7%), whereas their smaller samples have been learnt more deficiently by the models (tooth marks with a F-1 score of 0.68 versus 80% in the original study), with the smallest trampling subset, being almost completely misclassified (F1- score = 36%). This information alone should reflect the effect of the triple data split on small unbalanced datasets, especially in classes with insufficient sample sizes. This is a confirmation of the “long-tail effect,” in which a DL model performs satisfactorily in classes with more data, but underperforms in classes with substantially fewer data [12, 13].

If one wants to produce traditional DL models that maximize learning and generalization using small unbalanced datasets, splitting the original dataset in three fixed subsets (with only one for training) is not optimal for model learning. This is where a tradeoff must be selected between learning and generalization. The best way to mitigate the above-mentioned issues is either to sacrifice the testing set or to use cross-validation [14–16].1 The first approach is adequate for model development. In this case, validation can be used as a proxy for testing. The shortcoming of this approach, as mentioned above, is the potential for overfitting on the validation set. To avoid this, data augmentation and regularization methods can be implemented. Transfer learning (TL) can also help to leverage existing feature knowledge and fine-tune it on the smaller dataset. This is the approach adopted in the three original studies re-analyzed by Courtenay et al. [1]. It produced fairly accurate models and performance metrics that indicated good precision and F-1 scores for most classes. The main downside is that it cannot be assessed how good the models are at generalizing beyond the validation data sets. Another drawback of this approach is that it is difficult to know if this method produced overfit on the validation set or not. It should not be assumed as a default.

For this reason, the second option (K-fold cross-validation) could be more adequate, and this one was not implemented in the original studies of the three datasets. Cross-validation consists of splitting the original dataset into k subsamples (folds), using one for validating and the others for training. This process is repeated n times, one for each fold created. For example, for a 5-fold cross-validation, the sample uses 80% of the dataset at a time for training and the remaining 20% for validating. By selecting different sets of images for each fold’s training and validation, the model is exposed every time to different training and validation splits (and image sets). The final result is obtained averaging the accuracy values of each fold. This maximizes data usage while it provides robust validation. Usually, cross-validation is carried out within the training-validation sets and a holdout testing set is used for assessing generalization. Here, we opted to use all data for training-validation through cross-validation, for the same reasons as described above: a representative separate testing set would have substantially reduced the available data for certain underrepresented classes for training. When employing cross-validation for evaluation, the validation set rotates across folds, highlighting potential issues regarding the representativeness of the original data, making a test set a less immediate necessity.

Therefore, when using a small unbalanced dataset, analysts have two options which are very unequal in their results: training-validation-testing or training-validation only (with or without cross-validation). The former reduces available data for training and hyper-parameter calibration through validation, potentially leading to underfit models, higher variance and less stable performance metrics. The latter optimizes the use of data, but may develop overfitting of the validation data, thus not reflecting the true generalization properties of the model. Here, we will test to what extent the original studies were methodologically overfit versus to what extent Courtenay et al.’s remodeling was undertrained, by applying this and additional methods to the original datasets.

We will test two hypotheses on the same three datasets used by Courtenay et al. [1]:

Hypothesis 1. Image quality has biased the models, probably by different distribution of biasing effects (e.g. brightness and contrast) in different classes. If applying grayscale intensity-augmented methods to the same datasets, this should lead to widely divergent results between the baseline models and the augmented ones.

Hypothesis 2. The training-validation split (to the exclusion of a separate testing set) has led to overfitting processes, either on the validation set or on both sets, resulting in artificially high accuracies and low loss. If applying cross-validation methods on the same models and datasets, there should be widely divergent results in the classification metrics of all classes and the overall accuracy of the models. It could be argued that since training of cross-validated models is also carried out on a training-validation split, it has a similar potential to overfit on the validation set as the baseline models. For this reason, implementing a separate and different method (e.g. one-shot or few-shot meta-learning), which can efficiently use separate testing sets on small datasets, should yield widely divergent results if the DL train-validation method is overfit. Failure to show divergent results by this combination of methods implies rejecting the hypothesis. As a final test of this hypothesis, we will also include a new testing set for Datasets 1 and 2 created with BSM that are new and that were not part of the original samples used for the published models. Should the models be overfit, we are expecting a poor performance on the new testing sets.

Materials and methods

This study evaluates the reliability of DL-based computer vision (CV) models for classifying experimental taphonomic BSMs. To achieve this, the study utilized three previously published datasets [2, 4], the same ones used by Courtenay et al. [1], and replicated the analyses originally performed, by implementing additional controlled methods that enabled testing: (i) if the quality of images had any impact on model performance, and (ii) if the training-validation method of the original studies resulted in overfitting through the training and validation feedback. The datasets used in this research include:

  1. Dataset 1 (DS1): Published by Abellán et al. [4], available at https://doi.org/10.7910/DVN/9NOD8W (last accessed on 07/08/2023). The dataset is composed of 488 images of cut marks and 45 images of crocodile tooth marks).

  2. Dataset 2 (DS2): Published by Domínguez-Rodrigo et al. [2], available at https://doi.org/10.7910/DVN/62BRBP (last accessed on 07/08/2023). The dataset is composed of 488 images of cut marks, 103 images of tooth scores, and 63 images of trampling marks).

  3. Dataset 3 (DS3): Published by Pizarro-Monzo et al. [17], available at https://doi.org/10.17632/3bm34bp6p4.1 (last accessed on 07/08/2023). The dataset is composed of 629 images of tooth scores, 150 images of cut marks (all new, not recycled from Dataset 2, as wrongly interpreted by Courtenay et al. [1]), and 154 images of trampling marks).

All these datasets have been compiled and put together in a public repository specifically created for this article: https://doi.org/10.7910/DVN/WUSGSW. The code version is from May 24 (2025).

Image quality

Given that Courtenay et al. [1] have not shown that image quality had a biasing effect in the performance of the three datasets that they tried to replicate (see Supplementary Information), here we will test both interpretations in the form of Hypothesis 1 (outlined above). We hypothesize that if image quality biased the original models, a re-analysis using controlled noise methods consisting of even distortion of brightness, contrast, and sharpness across classes should yield significantly lower accuracy rates and worse performance metrics than the original models [2, 4, 17]. For this purpose, we will use grayscale intensity-augmentation methods in addition to image normalization.

We generated a function that randomly adjusts the brightness, contrast and sharpness of the image to introduce various variations that improve the robustness of the model to changes in image lighting conditions. The process implements a data augmentation pipeline tailored for three-channel grayscale images, designed to enhance variability and improve model robustness during training. The function applies random intensity-based—brightness and contrast—and sharpness augmentations independently to each channel, despite the input image being grayscale. By doing so, we make sure that the random augmentations to each of these image features are not too extreme, which could make them too dissimilar from the overall sample, hindering the model’s ability to learn generalizable patterns. For brightness adjustments, pixel intensities in the selected channel are scaled by a random factor between 0.8 and 1.2, simulating variations in lighting conditions. Contrast augmentation modifies the difference between pixel intensities relative to the channel’s mean intensity, with a scaling factor also randomly chosen within the same range, thereby mimicking diverse imaging conditions. Sharpening is performed using a convolution operation with a predefined kernel, amplifying edges and enhancing fine details to simulate sharper imaging scenarios. Each of these augmentations operates within the valid pixel range (0–255), ensuring no overflow or underflow. The augmented image is then converted to a float32 data type and preprocessed using ResNet50’s “preprocess_input” function for Datasets 1 and 3, which normalizes the pixel values according to the statistical distribution expected by the model (e.g. mean-centered and scaled based on the ImageNet dataset). For Dataset 2, the augmented image is then converted to a float32 data type and preprocessed using VGG16’s “preprocess_input” function, which normalizes the pixel values by subtracting the channel-wise mean values derived from the ImageNet dataset, ensuring compatibility with the statistical distribution expected by the model. By introducing random, channel-specific augmentations to grayscale images while maintaining a three-channel structure, this approach enables compatibility with pre-trained models like ResNet50 of VGG16, improving generalization and robustness in downstream DL tasks.

The original dataset used grayscale images on the three channels, instead of a single channel. The reasons to do so were model compatibility. Since TL was used and the original pre-trained models were designed for three-channel inputs, this channel structure was kept for adjustment. Pre-trained weights can be used effectively this way because the input shape matches the expected format. This is an advantage when using grayscale intensity augmentation, because the artificially-induced distortions to brightness, contrast, and sharpness can be independently applied to each channel bringing a much wider range of unique combinations to each altered grayscale image than if we were using a single channel [18].

Model architecture

In this work, we are not targeting the selection of the best-performing model, and we are not using ensemble learning. For this reason, we will use single models with each dataset to test the two hypotheses outlined above. Given the overall good performance of ResNet50 in some of our previous works [2–4, 19], we will use this architecture here for Datasets 1 and 3. The reason for using it with Dataset 1 is that in the original study it also provided the lowest loss of all the TL models [4]. The reason for using it with Dataset 3 is that it was the model with the highest accuracy [17]. For Dataset 2, VGG16 provided the best accuracy and loss scores of all the transfer models tested [2]. For this reason, we will use the same model here.

For the present re-analysis, we will use TL only. The base model consists of the original ResNet 50 (or VGG16) model excluding the top (fully connected and classification) layers and retaining only the convolutional base for feature extraction, as it is the common protocol for TL use [16, 20]. The original models were trained on the ImageNet dataset, which enables it to be successful under domain variation and adaptable as a general-purpose feature extractor. All the layers of the model were frozen to ensure that the pre-trained weights were not updated during the training of the new domain dataset. The decision to freeze all base layers in the pre-trained ResNet50 and VGG16 models was made to preserve the general visual feature representations learned from the ImageNet dataset and to prevent overfitting given the limited size of the BSM dataset. This strategy ensures training stability and reduces computational cost. However, it may limit the model’s ability to adapt to domain-specific visual patterns present in BSM images, which differ substantially from natural images in ImageNet. This trade-off reflects a common balance in TL between model stability and adaptability. In future work, partial fine-tuning—that is, unfreezing and updating only the deeper convolutional layers—could be explored to allow selective adaptation of mid- to high-level features while retaining the benefits of TL.

The additional following custom layers were added: a Flatten layer converting the feature map output by the TL models into a 1D vector for use with a Dense layer; a Fully Connected Layer with 128 neurons, ReLu activation and He uniform initialization to enable task-specific feature-learning; a Dropout layer as a regularization method to avoid overfitting, dropping 30% of the nodes during training, and a Dense output layer with a softmax activation function. For Dataset1, the ResNet 50 model was compiled with Mini-batch Stochastic Gradient Descent (SGD) as optimizer (learning rate = 0.001; momentum = 0.9), categorical cross-entropy as the loss function, softmax as the activation function (to obtain class probabilities), and accuracy as the metric to evaluate the model’s performance during the training phase. For Dataset 3, we modified the optimizer from SGD to Adagrad, because we realized that the authors of that dataset had not explored the improvements brought by this optimizer over SGD for their particular image set [17]. The VGG16 model was modified by including the use of the “swish” activation function and also the Adagrad optimizer, since we observed in other experiments that the VGG models performed better in general when using this combination [21, 22]. Given the multiple-class nature of the dataset where VGG16 was used, we used categorical cross-entropy as the loss function and accuracy as the evaluating metric for training.

Prior to DL analysis, the original image datasets were split into training (70%) and validation (30%) sets. In the present work, we did not use hold-out testing sets because of the very small size of the dataset (see discussion above). Models were trained with images of 400 × 80 pixels and data augmentation. Augmentation procedures included random shifting (0.2), shear (0.2), zoom (0.2), and horizontal flipping and rotation (40°). The image pre-processing function was the original from the ResNet 50 (or VGG16) TL models. The preprocessing function (commonly known as “preprocess_input” from the Keras “tensorflow.keras.applications” module) includes color normalization for color images, depending on which TL model is used. For some models, this function performs two main operations on the input images: (i) The first operation involves scaling the pixel values to a specific range, converting them from the original [0, 255] range to the range required by the CNN ([−1, 1] or [0,1]); (ii) The second operation normalizes the color channels by subtracting the mean RGB values computed from the ImageNet dataset and dividing each channel by its corresponding standard deviation, ensuring the pixel values align with the statistical distribution expected by each model. For VGG16, the only operation also adjusts the color channels by subtracting the mean RGB values computed from the ImageNet dataset ([123.68, 116.779, 103.939] for R, G, and B channels, respectively) without dividing by the standard deviation, ensuring the pixel values align with the input expectations of VGG16. For Resnet 50, this operation is done also subtracting the mean RGB values computed from the ImageNet dataset for the R, G, and B channels (Keras 2 API documentation [https://keras.io/api/applications/resnet] accessed on 23 January 2025). Therefore, the pre-processing function is also taking image normalization into account. When using grayscale intensity-augmentation, these preprocessing functions were also embedded in the augmentation function.

The main novelty between the original models and those re-trained here as baseline models consists of the use of regularization methods (i.e. Dropout). Training was performed in batches, which varied according to the original models to adapt as closely as possible to the original configuration of the published models. For Dataset 1, training used batches of 64 images and validation used batches of 32. For Datasets 2 and 3, training used batches of 32 images and validation used batches of 32, since that is how they were originally coded.

The model fitting involved training during 100 epochs. CNN models were developed using the Keras API with a Tensorflow backend, and computation was performed on a Nvidia Quadro P5000 GPU(HP Z6 Workstation) within a CUDA computing environment.

Analytical process and comparative framework

A five-stage process was implemented:

  1. The first stage was to determine model performance using a training-validation set to generate a baseline model.

  2. The second stage enabled testing hypothesis 1 (image quality impacted the baseline model performance). Here, the image datasets were transformed through grayscale intensity-augmentation by introducing distortion in image sharpening, brightness and contrast for all classes equally. New models were generated to compare with the baseline models.

  3. The third stage consisted of determining if the sample splitting method overfit the models. For this stage stratified cross-validation methods were applied, given the unbalanced nature of the datasets.

  4. Given the small sample size, to exclude the potential of validation overfitting in the previous modeling, an independent testing set was used through two different deep-learning methods involving Few-Shot Learning (FSL) frameworks. The first was a few-shot supervised learning (FSSL) method. The second method was a model-agnostic meta-learning (MAML) approach, which is also part of the FSL array of methods. These methods enable learning through the use of small datasets and, therefore, the possibility of testing models with independent testing sets, regardless of sample size.

  5. For Datasets 1 and 2, a new set of BSM was used as a separate testing set. This separate set is composed of recent images that were not part of the original study. This is an ultimate test to the validity of the models and the lack of methodological biases. This was not applied to Dataset 3 because: (i) we did not have additional BSM to fulfill 30% of such an extensive dataset; (ii) all our carnivore tooth mark samples had been used in that study; (iii) the original study was large enough to generate a testing set, and (iv) also because its classification metrics derived therefrom were not affected by potential methodological biases, as shown in Courtenay et al.’s [1] re-analysis. The reliability of the modeling for this data set (and the other two) is ultimately addressed by using meta-learning, with independent testing sets.

K-fold cross-validation

A stratified k-fold cross-validation was implemented to make sure the different classes in the unbalanced samples were proportionally represented in each split/fold. The “StratifiedKFold” function from scikit-learn library was used to ensure a 75% (training)-25% (validation) split rotated through all the images after 4-folds. The function generates unique splits for each fold. In each fold, the unique indices generated for training and validation are mutually exclusive, resulting in unique images for each set. Over all folds, every image will appear only once in the validation set and the remaining times in the training set. This results in every image used for validation only once across k folds. Averaged cross-validated performance metrics were obtained at the end. To test the impact of the training-validation method over the baseline model, this was applied without grayscale intensity-augmentation for comparative purposes.

FSSL

Despite their enormous methodological advance (especially in BSM identification, where human experts commonly disagree on the same sets), it is true that the big downside of DL methods is their dependence on big data; that is, on large amounts of information. Since DL algorithms learn through the use of optimizers that require looping through large batches of data using repeated backpropagation, their performance is tightly linked to the size of the referential database that feeds new information every time. In the past few years, a complementary method called meta-learning (acronym MTL, to differentiate it from ML) has been developed to counter the methodological shortcomings when DL is applied to small data sets. MTL basically consists of “learning to learn.” To do so, the MTL algorithms try to mimic human learning. If DL is inspired in the multiple layers of neurons of the human brain, MTL is inspired by the human learning process itself. Humans do not see many images of the same item to properly identify it. They do so with few examples. MTL algorithms are designed with this principle in mind, but using methods that extract information efficiently from a limited set of tasks, which allow subsequently to learn the process of differentiating a new set of tasks. MTL uses algorithms that adapt to these new tasks (even those that they have never seen during training) and have variably good generalization properties. They can outperform DL in generalization when using large datasets, but they do so systematically when datasets are small. The reason is that MTL uses a non-parametric approach, in which training data is dimensionally reduced, memorized and/or embedded into mappings. This contrasts with the parametric approach of other ML methods, which are based on the joint probability distribution of data and their labels.

The most widely used methodological framework for MTL is n-way-k-shot meta-learning [23–27]. “Way” translates for “class” and “shot” for “number of supervised examples.” For instance, a three-way-five-shot analysis of African carnivores would imply three carnivore species and five examples of individuals within each species. The analysis would involve learning a mathematical representation of each set, with the goal of being able to generalize to other individuals and even other non-trained categories. If we exclude the data augmentation and data generation methods that are also typical of DL, MTL k-shot methods are divided into three basic types: Metric-based, model-based and optimization-based methods [26].

Here, we use two separate approaches to enable us to divide the small unbalanced samples into training-validation-testing sets. The first of these approaches is inspired by MTL few-shot methods, but does not qualify as meta-learning. It uses a n-way-k-shot framework, using multiple mini-batches per class under a supervised TL structure (Supplementary Fig. S7). Although we referred to this approach earlier as MTL because of its few-shot structure [10, 28], it should be more adequately referred to as FSSL. The second approach is the use of a MAML method. The latter differs from the former in the use of a query set, task-specific inner-loop learning and adaptation of model parameters to be more flexible across tasks (using meta-gradient).

For the first approach (FSSL), we implemented a supervised FSL strategy based on TL using a pre-trained convolutional neural network architecture. Specifically, we employed VGG16 as a frozen feature extractor, leveraging weights pre-trained on ImageNet and excluding its top classification layers. The frozen base model was extended with custom classification layers, including a convolutional layer, global average pooling, a dense layer, dropout for regularization, batch normalization, and a final dense output layer with softmax activation. These newly added layers were trainable, while the convolutional backbone remained static throughout training. To simulate FSL conditions, the model was trained on episodic tasks composed of a small number of examples per class—for example, five samples (five-shot) per class across three classes (three-way). For each task, a small, balanced subset of the training dataset was randomly sampled to form a task-specific batch. The model was trained on these episodic batches using a conventional supervised learning loop, where gradient updates were applied directly to the trainable layers after each batch. Although the training regime mimicked FSL scenarios by using limited data per class, no task-specific adaptation or inner-loop optimization was performed. All updates were global, applied cumulatively across tasks using standard backpropagation. The approach did not involve a support/query split or meta-optimization across tasks, distinguishing it from meta-learning frameworks such as MAML. Instead, the model was exposed to multiple independent small-sample classification problems, aiming to generalize from these using the fixed features extracted by the pre-trained network. This method allowed for efficient training on limited data without the computational complexity of full meta-learning, while still taking advantage of TL and episodic data structure to simulate FSL conditions.

The study employed variable task and shot configurations, categorized as low (few shots and more tasks), and high (more shots and few tasks). For the Dataset 1, the high-shot configuration involved 374 images (10 images × 2 classes × 19 tasks), which remained at about the total number of training images. No low-shot task was implemented because of the positive results of the high-shot approach. For Dataset 2, the high shot-task configuration was based on 458 training images (10 shots × 3 classes  × 15 tasks). The low-shot configuration was 5 shots—30 tasks. For Dataset 3, the high shot-task configuration was based on 653 training images (10 shots × 3 classes × 20 tasks). The low-shot configuration was 5 shots- 40 tasks. To ensure stability, the task-shot module was configured with “replace=False,” preventing resampling. The restricted configuration employed here produced more stable models with reduced or no overfitting, depending on the set. The two configurations (low and high) facilitated a comparative analysis of accuracy improvements driven by varying task-shot combinations.

MAML method

We also used MAML not only to compare MTL few-shot approaches with the DL models, but to provide a robust framework for generalization under limited-data conditions—more representative of real-world settings in BSM analysis. The following section outlines, in detail, how this protocol was applied, including architecture, task structure, and training parameters.

If the baseline models had been overfitted during training, it would be expected that a separate testing dataset using a different method would show this by providing lower accuracy and higher loss than the trained DL models, as well as low values for the performance metrics (precision, recall, and F-1 score). To test Hypothesis 2, in addition to the cross-validation method above, we implemented a few-shot analysis, which enables the use of small datasets for learning and the training-validation-testing splits that are the protocol of ML methods—and more specifically, DL methods using larger samples. We implement this approach here, with a triple data split method, as Courtenay et al., instead of a dual training-validation set. The difference is that we are doing it through a model/method that has maximized its learning, instead of restricting it, compared to the baseline models used by Courtenay et al.

There exists a diverse range of FSL methods, many of which differ significantly in philosophy and structure. These include siamese networks, prototypical networks, relation networks, matching networks, and model-agnostic methods [25, 26]. Among these, MAML has been selected as the method of choice for this study due to its independence from specific model architectures. MAML’s design enables it to adapt to a wide variety of model structures without substantial modifications [25, 29, 30]. The core of MAML involves a meta-learning process in which model parameters are fine-tuned and updated, rendering them highly adaptable to novel models and tasks. While MAML demonstrated strong performance in our experiments, its selection was motivated not only by its model-agnostic nature but also by its ability to rapidly adapt to new tasks with minimal data. Compared to metric-based approaches like Prototypical Networks or Siamese Networks, which rely on embedding distances and often require extensive pre-training, MAML provides a flexible optimization-based strategy that fine-tunes models more directly for rapid generalization. For instance, unlike Siamese networks, which learn fixed similarity functions, MAML enables task-specific adaptation without redesigning the architecture. Although MAML’s adaptability was advantageous in our context, we acknowledge that alternative FSL methods remain promising and merit exploration in future work.

In MAML, the concept of “task” forms the foundation of its meta-learning approach. Tasks represent individual learning problems or datasets that the model must address. In the context of the FSL method employed here, each task comprises a small dataset with limited examples, where the model learns from these constrained data. Tasks may involve various classification problems, such as classifying images into different categories, each using a variable number of subsamples and classes. Across tasks, the number of examples per class can also vary. MAML’s meta-learning process operates over multiple tasks, training the model to quickly adapt to any new task it encounters. This approach contrasts with the meta-learning strategy employed in deep convolutional neural networks.

In this study, MAML was implemented within the framework of TL. In this implementation, we adapted the Model-MAML algorithm introduced by Finn et al. [29] for a few-shot image classification task using modified ResNet 50 (Datasets 1 and 3) and VGG16 (Dataset 2) backbones. Our approach followed the core principle of MAML by explicitly separating data for each task into a support set (used for inner-loop adaptation) and a query set (used to evaluate the adapted model and compute the meta-loss). For each meta-training iteration, we sample multiple tasks, each composed of k-shot n-way support/query splits. The model undergoes inner-loop adaptation by updating weights using gradients computed on the support set. These task-specific adapted weights are then used to predict query set labels, and the resulting loss is used to compute second-order gradients with respect to the original model parameters (the outer loop). These gradients are averaged across tasks and used to update the shared model parameters via a meta-optimizer. This two-loop procedure ensures that the model learns a good initialization that can be efficiently fine-tuned to new tasks with minimal gradient steps. The implementation captures the essence of Finn et al.’s algorithm by using cloned weights for inner-loop updates, maintaining task-specific splits, and performing meta-updates based on query performance, thereby enabling fast adaptation across novel tasks during inference. For this study, we used 5 images for the Support set, 5 images for the Query Set and 20 tasks for Datasets 1 (two classes: crocodile tooth marks and stone tool cut marks) & 2 (three classes: tooth marks, cut marks and trampling marks). For Dataset 3, given its bigger sample size, we used a 5 support- 5 query- 25 task structure (three classes: tooth marks, cut marks and trampling marks). To enable model update, we unfroze the last 10 layers in the ResNet 50 TL model, and the last 4 layers in the VGG16 TL model.

Image preprocessing included normalization using the preprocessing functions of the TL models. The dataset was partitioned into training (70%), validation (15%), and testing (15%) subsets, resulting in 374 training images and 160 images evenly split between validation and testing for Dataset 1. For Dataset 2, training involved 458 images and validation-testing used 193 images, split between the two. Dataset 3 was composed of 653 images for training and 280 images split into two validation-testing sets. Data augmentation techniques such as random shifting, shearing, zooming, horizontal flipping, and rotation were applied, in addition to image standardization to 400 × 80 pixels. The MAML model used the Adam optimizer with a learning rate of 1e-03. Loss was measured using “sparse categorical cross-entropy” since labels were one hot encoded.

Validation method

A comparative validation method was implemented by comparing the model performance from DL models generated using the training-validation split, with k-fold cross-validated models, with meta-learning models (using training-validation-testing splits), and with an external new testing set (see below).

Testing the original models with novel testing datasets

To the traditional analyst, nothing certifies better the validity of a model than testing it against a sample of unseen data (i.e. testing data not used for training or validation) [5]. For taphonomy, given that such extensive data require additional time-consuming experimentation, we borrowed a new set of experimental BSM from recent (unpublished) experiments. For Dataset 1, we managed to get access to a recent small collection of crocodile tooth marks. The original sample had been obtained with crocodiles in Faunia (Madrid) [7, 31]. The new testing set was obtained from an experiment conducted by our colleague Edgard Camarós at Altamira Zoo (Santander, Spain). Two adult dwarf crocodiles (Osteolaemus tetraspis), both male, were utilized in this experiment. Carcasses, comprising partly defleshed limbs of adult pigs and a pelvis, were collected after 10 min of feeding exposure. The bulk of the tooth marks obtained were tooth pits, which were not used in this study; however, a small number of tooth scores were photographed with an Olympus LEXT OLS3000 confocal microscope and will be used here. This was complemented with a few tooth scores of the Faunia collection that had been discarded in the original study because of the poor resolution of the images obtained with the Optika microscope. Here, they were documented with a Leica Emspira 3 optical microscope. A total of 20 new tooth scores were gathered using both sources. This number equals 44% of the original complete sample used by Abellan et al. [4], and is slightly bigger than Courtenay et al.’s [1] original testing sample of the same dataset (n = 13).

We also incorporated a set of 65 cut marks from a recent experiment [32], different from the original dataset used by Abellán et al. [4]. This new sample was photographed also using a Leica Emspira 3 microscope. We decided to keep the cut mark sample proportionally smaller than the original study, to better assess the potential biasing effect of the crocodile tooth mark under-represented class, purportedly most affected by analytical biases.

For Dataset 2, we used an additional set of cut marks from Cifuentes-Alcobendas [32], given that the cut mark sample for both original studies was the same [2, 4]. Again, given that the cut mark samples for both datasets originally scored high in the classification metrics, we wanted to minimize their effects on the more biased underrepresented classes. It is for this reason that we kept, therefore, a small cut mark testing set. Additionally, we included 25 new trampling marks randomly selected from the enlarged sample generated by Cifuentes-Alcobendas [32]. This would equal about 40% of the original trampling sample used for dataset 2 (n = 63) [2]. For the tooth mark sample, we used a testing set of 30 tooth marks from wolves, which were never included in the original study of Dataset 2, and which were acquired through microscopy using a Leica Emspira 3. This equals almost 30% of the original tooth mark sample used for the analysis of Dataset 2 [2]. We were careful not to introduce tooth marks from different carnivores from which the original model had been trained (lions and wolves), given that they are different enough to be classified by class and can be potentially misclassified [3, 33].

For Dataset 3, we did not implement any additional testing sample for lack of new BSM, which could be used in the proportion required as a testing set for the larger original sample used for that BSM assemblage [17]. However, it was also not necessary, since the replication carried out by Courtenay et al. [1] with an additional testing sample split already showed good classification (except for the trampling subset).

As a contrasting method, we compared the results of the new testing sets within the framework of differential image quality between the original datasets and the new testing sets. In order to do so, we used the BRISQUE (Blind/Referenceless Image Spatial Quality Evaluator) method. The BRISQUE algorithm is a no-reference image quality assessment method designed to estimate how perceptually degraded an image is without needing a pristine reference image for comparison. Unlike traditional metrics such as PSNR or SSIM, which require a ground-truth image, BRISQUE analyzes natural scene statistics in the spatial domain, focusing on how much an image deviates from the statistical regularities typically found in high-quality photographs. It operates by extracting features related to luminance and contrast from local patches in the image and compares them to a model trained on images with known human quality scores. BRISQUE assigns a numerical score—where lower values correspond to better visual quality, and higher values indicate artifacts such as blur, noise, or compression distortions. It is particularly useful for assessing image datasets where consistent visual quality is important but no reference images are available, making it valuable in fields like CV, medical imaging, and DL pipelines that depend on image fidelity. While BRISQUE was developed with natural images in mind, its ability to flag common degradations like blur and overexposure makes it a practical diagnostic tool for quality control in diverse imaging contexts.

Results

Dataset 1

A trained ResNet 50 model yielded an accuracy on the classification of the validation set of 98.75% (loss = 0.0251). A loss value of 0.02 is considered low in this context because we are using categorical cross-entropy, which has a theoretical minimum of zero, indicating perfect prediction. Therefore, values approaching zero reflect a very small difference between predicted and true class distributions, and signal a well-calibrated model. High accuracy and low loss together suggest that the model generalizes well and is not overfitting or underfitting. The results show that the model is accurate in its predictions (with a receiver operating characteristic-area under the curve (ROC-AUC) value of 1.0). The performance metrics indicate high precision, recall and F1-score for both classes (Table 1). Only 2 BSM (2 crocodile tooth marks) out of 160 were misclassified from the validation set (n = 160) (Table 2). Up to this point, the influence of potential overfitting on the validation set cannot be ruled out.

Table 1.

Classification report of the performance metrics from Dataset 1. Compare these results with those of Courtenay et al. [1: Table 3].

Baseline model
Precision Recall F1-score Support
crocodile 1 0.86 0.92 14
cut 0.99 1 0.99 146
accuracy 0.99 160
macro avg 0.99 0.93 0.96 160
weighted avg 0.99 0.99 0.98 160
With grayscale intensity augmentation
Precision Recall F1-score Support
crocodile 1 0.86 0.92 14
cut 0.99 1 0.99 146
accuracy 0.99 160
macro avg 0.99 0.93 0.96 160
weighted avg 0.99 0.99 0.99 160
Cross-validation
Precision Recall F1-score Support
crocodile 0.98 0.96 0.97 46
cut 1 1 1 489
accuracy 0.99 535
macro avg 0.99 0.98 0.98 535
weighted avg 0.99 0.99 0.99 535
FSSL
Precision Recall F1-score Support
crocodile 0.78 0.88 0.82 8
cut 0.99 0.97 0.98 73
accuracy 0.96 81
macro avg 0.88 0.92 0.9 81
weighted avg 0.97 0.96 0.96 81
MAML 5 shot-5 query-20 task
Precision Recall F1-score Support
crocodile 1 1 1 8
cut 1 1 1 73
accuracy 1 81
macro avg 1 1 1 81
weighted avg 1 1 1 81

Table 2.

Confusion matrix for Dataset 1. Notice the sharp contrast between the baseline models in Courtenay et al. and the present work. Metrics in bold show the accurate classification values

Courtenay et al.
Present work
Baseline
Augmented
Cross-validated
PREDICTED
PREDICTED
PREDICTED
PREDICTED
TRUE crocodile cut mark crocodile cut mark crocodile cut mark crocodile cut mark
0 13 12 2 12 2 44 2
0 146 0 146 0 146 1 488

In order to test Hypothesis 1 (quality of images impacting the results), the same model applied to the grayscale-augmented dataset produced a similar result to the baseline model: accuracy (98.75% of correct classification of the validation set) remains high, and loss (0.0214) remains low, clearly demonstrating marginal to nil impact of the image quality distribution of the dataset (with a ROC-AUC value of 1.0) (Table 1). The grayscale-augmented model is more reliable at classifying the highly unbalanced dataset. Although the grayscale-augmented model achieved a slightly lower loss, we acknowledge that this alone does not definitively indicate greater reliability, particularly in the context of class imbalance where loss values can be skewed by the majority class. Instead, the improved performance of the grayscale-augmented model is better supported by its higher and more consistent F1 scores across classes, which directly reflect the balance between precision and recall. These metrics provide a more robust indication of the model’s ability to generalize well across both majority and minority classes, suggesting that grayscale augmentation may help reduce overfitting to dominant class features and enhance overall classification stability. Only two crocodile tooth marks were misclassified out of the 160 image validation set (Table 2). This also supports the inference that the blurry and out-of-focus portions of several images may have acted as a regulatory mechanism improving the training of the baseline model (Fig. 1, upper). Hypothesis 1 is, thus, rejected.

Figure 1.

Figure 1.

Training graphs—accuracy (left) and loss (right)—of the grayscale intensity-augmented models for the three datasets: Resnet 50 on Dataset 1 (upper), and Dataset 3 (lower), and VGG 16 on Dataset 2 (middle).

To evaluate the performance of our binary classification DL model using Dataset 1, we computed the ROC-AUC score. In binary classification, this metric quantifies the model’s ability to distinguish between the positive and negative classes based on predicted probabilities derived from a sigmoid activation function. The ROC curve is constructed by plotting the true positive rate (sensitivity) against the false positive rate at various threshold settings. The AUC summarizes the model’s discriminative ability, where a value of 1.0 indicates perfect separation and 0.5 reflects random guessing.

Our ResNet 50 model trained on Dataset 1 achieved a ROC-AUC score of 1.0 for the baseline configuration and for the grayscale intensity augmentation, reflecting excellent performance. This indicates that in approximately 98–99% of randomly selected positive–negative pairs, the model assigns a higher probability to the true positive class than to the negative one. Such a high AUC suggests minimal overlap between the predicted probability distributions of the two classes, demonstrating the model’s robustness and reliability in binary decision-making scenarios (Fig. 2).

Figure 3.

Figure 3.

Training graphs of the loss and accuracy of some of the FSSL models for Dataset 1 (upper), Dataset 2 (middle), and Dataset 3 (bottom).

The model generated to test Hypothesis 2 (overfitting of the validation set and the resulting model), shows that the four-fold cross-validation resulted in an average accuracy of 99.44%, slightly higher than the baseline model, suggesting lack of overfit. The overall classification report yielded a well-balanced high classification, despite the large difference in sample sizes for both classes (Table 1). The similar results in these metrics (with a substantially larger and more varied validation set) to the grayscale-augmented model indicates that the training-validation method does not show any signs of bias or overfitting. Again, classification of BSM is highly accurate, with only two crocodile tooth marks and one cut mark misclassified (Table 2).

To ensure that cross-validation was not affected by any undetected overfitting process, the FSSL analysis, using a separate training-validation-testing subsamples, yielded an accuracy of 96.3% on the testing set, thus supporting that the original model [4] and the baseline model in the present work have not overfit the training or validation datasets (Fig. 3). The performance metrics indicate a balanced precision, recall and F-1 score for both classes (Table 1). These metrics together suggest the model generalizes well to unseen data without significant overfitting, and that the model’s predictions are confident for both correct and incorrect classifications. The FSSL results are very similar to the cross-validated analyses, further supporting that the train-validation split did not introduce any bias. For this dataset, Hypothesis 2 is also rejected.

Figure 5.

Figure 5.

The three misclassified crocodile tooth marks of the new testing set. The two upper marks are from Osteolaemus tetraspis, and the bottom one is from Crocodylus niloticus. Notice their overall similar morphology to cut marks.

In order to add more support to this interpretation, the MAML analysis yielded an extremely accurate estimate of both types of BSM. On the validation set, it yielded an accuracy of 98.75% (loss = 0.0679). On the testing set it classified correctly 100% (loss = 0.0026) of crocodile tooth marks and stone tool cut marks (Table 1). During training, the performance of MAML was much more stable than FSSL (Fig. 4).

Figure 4.

Figure 4.

Training graphs of the loss and accuracy of some of the MAML models for Dataset 1 (upper), Dataset 2 (middle), and Dataset 3 (bottom).

Finally, the application of the baseline model to the new testing set composed of 85 new BSM yielded a similar result to all the models described above. A total of 92% of the testing marks were correctly classified, with a F1-score of 0.83 for the crocodile tooth marks and 0.95 for the cut marks (Table 1). Only 3 crocodile tooth marks out of 20 were misclassified, and only 4 out of 65 cut marks were misidentified (Table 3). Interestingly, the three misclassified crocodile marks are the most similar to cut marks of the whole set, by showing extremely narrow V-shaped grooves caused by the carina of the tooth, and resulting from the new crocodiles (Osteolaemus tetraspis), which were smaller and with sharper teeth than the original crocodiles (Crocodylus niloticus) used for the sample collected from Faunia (Fig. 5). These marks would also have been interpreted as cut marks by experienced human taphonomists.2 This separate testing set shows that the original model learned to differentiate both types of BSM without methodological biases.

Table 3.

Classification (confusion matrix) of the datasets using the new testing set.

DS1
DS2
BM-NTS*
BM-NTS*
PREDICTED
PREDICTED
TRUE crocodile cut mark score cut mark trampling
17 3 33 0 0
4 61 4 60 1
7 4 10
*

Baseline Model on the New Testing Set.

Bold values indicate accurate classification values.

Figure 2.

Figure 2.

ROC curve for the binary classification DL model trained on the intensity-augmented Dataset 1 using a sigmoid activation function. The AUC reflects the model’s ability to discriminate between the two classes based on predicted probabilities. The ROC-AUC score of 1.0 indicates excellent classification performance and high confidence in distinguishing between the positive and negative classes.

Dataset 2

A trained VGG16 model yielded an accuracy on the classification of the validation set of 94.27% (loss = 0.140). This is slightly better than the accuracy (0.92) and loss (0.36) scores in the original study [2]; probably because of the implementation of regularization. The ROC-AUC value is 0.98. High accuracy and moderately low loss together suggest that the model’s predictions are fairly confident and close to the true labels, with the exception of trampling. Initially, this does not suggest overfitting of the model. The performance metrics indicate high precision, recall and F1-scores for all classes but one (Table 4). Tooth and cut marks are fairly well classified (well balanced in their true positives and false negatives). Only 2 tooth marks and 2 cut marks were misclassified in a 193 image validation set (Table 5). Trampling marks exhibit moderate precision (0.79) and recall, with only 61% correctly classified; however, the F1- score of 0.69 indicates a reasonable balance between precision and recall given that subsample size.

Table 4.

Classification report of the performance metrics from Dataset 2.

Baseline model
Precision Recall F1-score Support
tooth mark 0.81 0.93 0.87 28
cut mark 0.99 0.99 0.99 147
trampling 0.79 0.61 0.69 18
accuracy 0.94 193
macro avg 0.86 0.84 0.85 193
weighted avg 0.94 0.94 0.94 193
With grayscale intensity augmentation
Precision Recall F1-score Support
tooth mark 0.92 0.79 0.85 28
cut mark 0.95 1 0.97 147
trampling 0.86 0.67 0.75 18
accuracy 0.94 193
macro avg 0.91 0.82 0.86 193
weighted avg 0.94 0.94 0.93 193
Cross-validation
Precision Recall F1-score Support
tooth mark 0.86 0.86 0.86 103
cut mark 0.97 0.98 0.98 489
trampling 0.71 0.65 0.68 63
accuracy 0.93 655
macro avg 0.85 0.83 0.84 655
weighted avg 0.93 0.93 0.93 655
FSSL 10 shot- 15 task
Precision Recall F1-score Support
tooth mark 0.9 0.86 0.88 22
cut mark 0.98 0.98 0.98 65
trampling 0.69 0.75 0.72 12
accuracy 0.93 99
macro avg 0.86 0.87 0.86 99
weighted avg 0.93 0.93 0.93 99
FSSL 5 shot- 30 task
Precision Recall F1-score Support
tooth mark 0.88 1 0.94 22
cut mark 0.98 0.98 0.98 65
trampling 0.89 0.67 0.76 12
accuracy 0.95 99
macro avg 0.92 0.88 0.89 99
weighted avg 0.95 0.95 0.95 99
MAML 5 shot-5 query-20 task
Precision Recall F1-score Support
tooth mark 0.91 0.91 0.91 22
cut mark 1 1 1 65
trampling 0.83 0.83 0.83 12
accuracy 0.96 99
macro avg 0.91 0.91 0.91 99
weighted avg 0.96 0.96 0.96 99
Baseline model on new testing set
Precision Recall F1-score Support
tooth mark 0.75 1 0.86 33
cut mark 0.94 0.92 0.93 65
trampling 0.91 0.48 0.62 21
accuracy 0.87 119
macro avg 0.87 0.8 0.8 119
weighted avg 0.88 0.87 0.86 119

Table 5.

Confusion matrix for Datasets 1 (DS1) and 2 (DS2). Notice the sharp contrast between the baseline models in Courtenay et al. and those in the present work. Metrics in bold show the accurate classification values.

DS2 DS3
Courtenay et al. Courtenay et al.
PREDICTED PREDICTED
TRUE cut mark score trampling cut mark score trampling
134 11 1 163* 2 6
2 28 0 7 126 3
1 13 4 1 11 33
Present work Present work
Baseline Baseline
PREDICTED PREDICTED
cut mark score trampling cut mark score trampling
145 0 2 44 1 0
1 26 1 1 185 3
1 6 11 0 1 45
Augmented Augmented
PREDICTED PREDICTED
cut mark score trampling cut mark score trampling
147 0 0 44 1 0
4 22 2 1 185 3
2 4 12 0 1 45
Cross-validated Cross-validated
PREDICTED PREDICTED
cut mark score trampling cut mark score trampling
480 3 6 620 5 4
3 89 11 2 148 0
11 11 44 14 0 140
*

Notice how Courtenay et al.’s confusion matrix is erroneous because it contains more “testing” cut marks than the complete cut mark dataset.

In order to test Hypothesis 1, the same model was applied to the grayscale intensity-augmented dataset. The result clearly shows that the original non-augmented dataset was not biasing or overfitting the model. The augmented model yielded 93.75% of accuracy with only 0.174 of loss. This is equivalent to the non-grayscale-augmented model, probably because of the random variations of brightness and contrast that were already present in the original dataset. The augmented model is also equally good at classifying the unbalanced dataset, with the exception of the trampling marks (Table 4). Only six tooth marks and no cut mark are misclassified in the validation set. Trampling keeps showing a high rate of misclassification (6 out of 18) but it is slightly better than the baseline model (Table 5). This is due to its contrasting small size compared to the tooth mark and cut mark subsamples. Average metrics also show slightly better values than the baseline model (Table 6). This result demonstrates a lack of impact by the quality of the image dataset (Fig. 1, middle). Hypothesis 1 is, again, not supported.

Table 6.

Average values for each model’s classification metrics showing the scores from Courtenay et al.’s models for the three datasets, in stark contrast with the values (in bold) from the baseline models replicated in the present work, as well as the color-augmented, cross-validated, FSSL and MAML models. For more information according to class, see Tables 1, 4, and 8.

DS1
Courtenay et al.
Present work
Average Baseline Augmented Cross-validated FSSL (best) MAML BM-NTS*
Precision 0.46 0.99 0.99 0.99 0.88 100 0.88
Recall 0.5 0.93 0.93 0.98 0.92 100 0.89
F1 0.48 0.96 0.96 0.98 0.9 100 0.89
Accuracy 0.92 0.99 0.99 0.99 0.96 100 0.92
DS2
Courtenay et al.
Present work
Average Baseline Augmented Cross-validated FSSL (best) MAML BM-NTS*
Precision 0.77 0.86 0.91 0.85 0.92 0.91 0.87
Recall 0.69 0.84 0.82 0.83 0.88 0.91 0.8
F1 0.66 0.85 0.86 0.84 0.89 0.91 0.8
Accuracy 0.86 0.94 0.94 0.93 0.95 0.96 0.87
DS3
Courtenay et al.
Present work
Average Baseline Augmented Cross-validated FSSL (best) MAML
Precision 0.88 0.98 0.97 0.97 1 0.98
Recall 0.87 0.98 0.98 0.96 0.98 0.98
F1 0.88 0.98 0.97 0.97 0.99 0.98
Accuracy 0.91 0.99 0.98 0.97 0.99 0.99
*

Baseline Model on the New Testing Set.

To evaluate the performance of our multiclass classification DL models, we computed the ROC-AUC score using a one-vs-rest approach. In this method, a separate ROC curve is generated for each class by comparing that class against all others, and the individual AUCs are then averaged—typically using either a macro or weighted average—to obtain an overall score. The ROC-AUC metric assesses the model’s ability to rank predictions correctly across all classes based on their associated probabilities.

Our VGG16 model for Dataset 2 achieved a multiclass ROC-AUC score of 0.982 for the baseline model and 0.988 for the intensity-augmented model, indicating exceptional discriminative performance. This means that in 98–99% of cases, the model assigns a higher predicted probability to the true class label than to incorrect ones. Such a high AUC reflects the model’s strong capability to distinguish among the classes, with minimal overlap in predicted probability distributions, and confirms the robustness of its probabilistic outputs across all categories (Fig. 6).

Figure 6.

Figure 6.

Multiclass ROC curves computed using a one-vs-rest strategy for each class in the intensity-augmented Dataset 2. The AUC reflects the model’s ability to discriminate between classes based on predicted probabilities. The average ROC-AUC score of 0.99 indicates excellent overall classification performance and high confidence in class separation across all categories. Key to classes: 0, tooth marks; 1, cut marks; 2, trampling marks.

The model generated to test Hypothesis 2 shows that the four-fold cross-validation resulted in an average accuracy of 93.13%, which is very similar to the base model (94.27%), suggesting lack of overfit. The overall classification report yielded a well-balanced high classification, despite the unbalanced sample distribution (Table 4). In this case, F-1 scores for the largest subsamples (tooth and cut marks) are 0.86 and 0.98, respectively, which is similar to the base model (0.85 and 0.99, respectively). The small trampling subsample goes from 0.67 in the baseline model, to 0.68 in the cross-validated one (Tables 4 and 5). As was the case for Dataset 1, Dataset 2 seems to be unaffected by overfitting or train-validation split bias.

Just to be further reassured, the high shot version of the FSSL model (10 shot- 15 task) with the train-validation test splits generated an overall accuracy on the testing set (which comprised 99 images from three classes) of 92.93% (loss = 0.217) (Fig. 3). The FSSL analysis replicated the cross-validated analysis by showing the same F1-scores for tooth and cut marks, and only a slightly lower value (0.72) for trampling marks. The virtual identity of the data between the cross-validated (train-validation) and FSSL (train-validation test) datasets clearly indicates that there is no detectable overfit of the train-validation method and that the model has learnt efficiently and very similarly at identifying the two most well represented subsamples (Table 4). When applying a low-shot version (5 shot- 30 task), these results improve. The overall accuracy is 94.95% (loss = 0.283), with higher F-1 scores for the three types of BSM. Here, trampling shows a F-1 score of 0.76. Again, trampling is misclassified more than tooth and cut marks, whose samples are substantially larger. Given that we are dealing with DL methods, it is not a surprise that the insufficiently represented subsample (trampling) presents these issues. Despite that, as pointed out in the original study [2], what the precision/recall metrics indicated was that the potential problem was to misidentify trampling marks with other BSM, but not the other way around.

The model generated by MAML provided a validation accuracy of 96.94% (loss = 0.057) and an overall accuracy on the testing set of 95.96% (loss = 0.240) (Table 4, Fig. 4). It provided the highest performance metrics of all the tests, with a F1 score of 0.91 for tooth marks, 1 for cutmarks and 0.83 for trampling marks. This reinforces the FSSL results by showing the excellent performance of MAML on generalization, as was also the case with Dataset 1. This also supports the results obtained using the other DL methods.

Lastly, the application of the new testing set composed of 119 new BSM to the baseline model yielded a similar result to all the models described above. A total of 87% of the testing marks were correctly classified, with a F1-score of 0.86 for tooth marks, 0.93 for the cut marks, and 0.62 for trampling marks (Table 4). All tooth marks (n = 33) and most cut marks (60 out of 65) were correctly classified (Table 3). Upon closer inspection, several of these trampling marks exhibit shallow linear striations and localized polish that visually resemble some carnivore-generated marks, particularly those produced by felids. This feature overlap likely contributed to the model’s confusion, especially in cases where striation orientation and edge definition were ambiguous. Furthermore, in human taphonomic studies, such subtle trampling modifications are also frequently debated due to their morphological convergence with tooth scores. These findings underscore a limitation of the current model when distinguishing between structurally similar surface modifications and highlight the need for expanding training data with more morphologically diverse trampling examples. In future work, incorporating hierarchical classification or multi-modal data may improve discrimination between these closely related taphonomic features.

This new testing set indicates that there may have been a slight model overfit on the validation test, resulting in poor classification of trampling mark deficient sample, but with very good classification of tooth (recall = 1.00) and cut (recall = 0.92) marks. The classification of the trampling marks in the new testing set (Table 3) is very similar to that provided by the validation set in the cross-validated model (Table 5). This particular dataset underscores the superiority of the MTL (MAML) and the FSSL methods over the DL model in accurately classifying the three types of BSM on their respective testing sets.

The BRISQUE analysis (Table 7) provides a quantitative comparison of image quality between the original and new testing sets for Datasets 1 and 2. In both datasets, the new testing sets exhibit lower mean BRISQUE scores compared to their original counterparts—26.16 versus 30.76 for Dataset 1 and 25.94 versus 30.92 for Dataset 2. Since lower BRISQUE values indicate better perceptual image quality, these results confirm that the new images (e.g. captured with the Leica Emspira 3 system) are of visually higher quality than the originals. In addition, the standard deviation is slightly higher in the new sets (7.02 and 7.3, compared to 5.3 and 5.8), suggesting greater variability in quality within those newer image groups. This could reflect a broader range of lighting, focus, or surface texture detail captured during acquisition, which is common with more sensitive or higher-resolution imaging systems. Overall, these results support the claim that image quality improved in the new testing datasets. Importantly, the model’s consistent performance across these differing image quality levels (as shown in Tables 1 and 4) suggests that the models are robust to variations in input quality, and not biased by image clarity or sharpness alone. This strengthens confidence in the generalizability of the approach.

Table 7.

BRISQUE mean and standard deviation values for image quality in the original and the new testing sets.

Dataset1 Mean BRISQUE Std BRISQUE
Original 30.76 5.3
New Testing* 26.16 7.02
Dataset2
Original 30.92 5.8
New Testing* 25.94 7.3
*

Obtained with Leica3/Emspira.

Dataset 3

This is the most extensive and balanced of the three datasets, although it is still unbalanced for cut and trampling marks. One logical prediction would be that it should, by these reasons, produce the best performing models. As a matter of fact, the baseline model (Resnet 50) yielded an accuracy of 98.44% of accuracy (loss = 0.044) (Table 8). This is a slight improvement over the original model displayed by Pizarro-Monzo et al. [17] (accuracy = 97.5%; loss = 0.0641). Only one cut mark, two tooth marks and one trampling mark had been misclassified (Table 5). The classification metrics yielded extremely high values, with F-1 scores of 0.98 and higher for the three types of BSM.

Table 8.

Classification report of the performance metrics from Dataset 3.

Baseline model
Precision Recall F1-score Support
tooth mark 0.98 0.98 0.98 45
cut mark 0.99 0.99 0.99 189
trampling 0.98 0.98 0.98 46
accuracy 0.99 280
macro avg 0.98 0.98 0.98 280
weighted avg 0.99 0.99 0.99 280
With grayscale intensity augmentation
Precision Recall F1-score Support
tooth mark 0.98 0.98 0.98 45
cut mark 0.99 0.98 0.98 189
trampling 0.94 0.98 0.96 46
accuracy 0.98 280
macro avg 0.97 0.98 0.97 280
weighted avg 0.98 0.98 0.98 280
Cross-validation
Precision Recall F1-score Support
tooth mark 0.97 0.99 0.98 629
cut mark 0.97 0.99 0.98 150
trampling 0.97 0.91 0.94 154
accuracy 0.97 933
macro avg 0.97 0.96 0.97 933
weighted avg 0.97 0.97 0.97 933
FSSL 10 shot- 20 task
Precision Recall F1-score Support
tooth mark 0.99 0.99 0.99 97
cut mark 0.95 1 0.98 20
trampling 1 0.96 0.98 23
accuracy 0.98 140
macro avg 0.98 0.98 0.98 140
weighted avg 0.99 0.99 0.99 140
FSSL 5 shot- 30 task
Precision Recall F1-score Support
tooth mark 0.99 1 0.99 97
cut mark 1 0.95 0.97 20
trampling 1 1 1 23
accuracy 0.99 140
macro avg 1 0.98 0.99 140
weighted avg 0.99 0.99 0.99 140
MAML 5 shot-5 query-25 task
Precision Recall F1-score Support
tooth mark 1 0.95 0.97 20
cut mark 0.99 0.99 0.99 97
trampling 0.96 1 0.98 23
accuracy 0.99 140
macro avg 0.98 0.98 0.98 140
weighted avg 0.99 0.99 0.99 140

The same model applied to the grayscale intensity-augmented dataset produced a good result, with only 6 BSM (out of 280) misclassified (Table 5). The new model yielded an accuracy of 97.66% (loss = 0.067) and all the classification metrics are >0.94 (Table 8). This is an overwhelming demonstration that the image quality of the dataset did not bias the original baseline model, and that multiple sources of data acquisition do not handicap model performance (Fig. 1, lower). Hypothesis 1 is rejected again.

To further ensure that the training-validation method did not result in a model that had overfit on the validation data, the cross-validation analysis yielded 97.32% of accuracy, with F-1 scores for tooth marks (0.98), cut marks (0.98), and trampling marks (0.94) that also showed great classification values for all classes (Table 8). Trampling, usually the most problematic in Dataset2, had high precision (0.97) and high recall (0.91). This result does not support Hypothesis 2.

To ensure that the high performing values of the baseline and cross-validated models were not biased by the training method, a FSSL analysis of the same dataset showed 98.57% (loss = 0.037) of accuracy with a 10 shot- 20 task approach, and 99.29% of accuracy (loss = 0.049) with a 5 shot- 40 task model (Table 8, Fig. 3). All F-1 scores are >0.98 for the former and >0.97 for the latter. The replication of all the classification metrics by both FSSL models and the baseline, grayscale intensity-augmented and cross-validated DL models (with an even better classification using testing sets in the FSSL models) clearly show that the baseline methods for the DL models were methodologically unbiased. The FSSL results also underscore the better performance of FSSL over traditional DL in the classification of FSSL, as was the case for Dataset 2. The better classification of the testing set also shows this compared to the classification of the testing set by the DL model in Courtenay et al.’s modeling, which showed slightly lower accuracy rates for cut and tooth marks, but substantially more equivocal scores for trampling marks.

As a final confirmation of the clear discrimination between tooth, cut and trampling marks reported in the previous models, a MAML analysis yielded an equally accurate classification of the three types of marks. The validation model resulted in an accuracy of 97.86% (loss = 0.117), and the application to the testing set resulted in 98.57% (loss = 0.052) of marks being correctly classified. The training process was very smooth and stable (Fig. 4). Overall, although in this database MAML and FSSL yielded similar results, MAML is characterized by slightly higher accuracy on the testing sets and much more stable training processes.

Discussion

DL is as powerful as the combination of dataset quality and know-how

Courtenay et al. [1] have built a straw man by selecting some of our most unbalanced and smallest datasets—which were initially published as pilot studies to show the potential of the method—to argue that DL is currently inefficient for BSM classification. Here, we have shown that such an assertion is inaccurate. All three datasets in our modeling display high values for classification metrics (precision, recall, F-1) for all the classes involved (Tables 1, 4, 6, and 8), except for the small trampling mark subsample from Dataset 2. This latter datum is related to two factors: small sample size of the trampling mark set, and extremely similar properties of trampling and cut marks. Taphonomists can relate to this, since differentiating between stone tool-imparted cut marks and trampling can at times be an arduous task. When sample size is substantially larger, the DL models can pick better on the microscopic differences and separation of the two types is much better, as demonstrated with Dataset 3, with the few misclassified trampling marks interpreted as cut marks (Tables 5 and 8). We have also noticed this effect in other studies, even with larger datasets, where DL is more efficient than any other method at discriminating carnivore agency, except for the smallest subsamples (i.e. crocodiles compared to other carnivorans represented by much larger datasets) [3]. We need to stress, though, that the samples of trampling marks used in the analyzed datasets are still too small to be reliable at a more general scale than that represented by the trained models here.

We checked the code that Courtenay et al. [1] used (our access on December 12, 2024 on code that had been last updated on August 26, 2024). There, we realized that another important drawback in their models’ performance was related to the inadequacy of the imaging preprocessing pipeline. There is an unexplainable mismatch between how images have been transformed for training and for validation-testing. For example, for Dataset 2, image pre-processing was made by Courtenay et al. [1] using a simple normalization (i.e. 1/255) for the validation/testing sets, and a VGG16 (model-specific) pre-processing function overlaid to simple normalization for the training set. This conflicts with the VGG16 weights used since the original model did not include scaling, and the pixel values of the images in the training set have been transformed differently from those in the validation-testing sets.

Images usually are represented in 8-bit integer format with pixels spanning values from 0 to 255. For better performance, a usual procedure is normalizing the images so that pixel values range between 0 and 1, or −1 and 1 (depending on the models), for faster model convergence. When using TL, models like those used here (e.g. ResNet 50 or VGG16) have pre-processed their image bank in a more complex way than simple normalization. For VGG16, for example, the procedure consists of transforming the pixel values to a normalized range, by the application of a function that subtracts the mean RGB values of the original training dataset (ImageNet) from each pixel in the input image. For each channel, these values are: R (123.68), G (116.779), and B (103.939). For Resnet 50, the same mean subtraction is carried out to normalize the input image. This process centers pixel values around zero, which contributes to the stable performance of the gradient descent during training.

The use of inconsistent preprocessing methods for training, validation, and testing datasets—as done by Courtenay et al. [1]—introduces a fundamental mismatch in data distributions across these sets. Specifically, while training incorporates data augmentation and the appropriate preprocessing function of the TL model used, validation and testing images in Courtenay et al.’s VGG16 implementation are merely rescaled (1/255) without applying the same preprocessing pipeline. This inconsistency creates a dual problem: inadequate evaluation of model performance and compromised generalization capability. The inadequate evaluation stems from the distributional mismatch between the training and validation datasets, which differ in pixel value due to the different scaling approaches used caused by the disparate preprocessing steps. This discrepancy leads to an improper assessment of the model’s performance, as the features learned during training may not align with the characteristics of the validation data. Consequently, validation accuracy and loss metrics may fail to reflect the true performance of the model. Moreover, the inconsistency in preprocessing undermines the model’s generalization capability when applied to datasets processed differently, as the trained weights may prioritize features that are not represented consistently across data splits. Although backpropagation operates exclusively on the training data, when that data is heavily augmented or preprocessed differently from the validation and test sets, the model may learn features that do not transfer well—resulting in a mismatch between learned representations and the evaluation data. This misalignment impairs the model’s ability to generalize effectively, as evidenced by Courtenay et al.’s [1] own acknowledgment that several of their models failed to learn. Pre-trained models, in particular, rely on uniform preprocessing across all data subsets to ensure compatibility with their internal weight interpretations. The failure to maintain preprocessing consistency, as seen in Courtenay et al.’s models, undermines the model’s generalization performance and introduces significant limitations in its practical application. To test the effect of mismatched procedures in the pre-processing of the training and validation/testing subsets, we used different image pre-processing methods in both subsets for the Dataset 1, resulting in all crocodile BSM being misclassified, as shown in Courtenay et al.’s results.

Similarly, for Dataset 3, Courtenay et al.’s TL modeling does not include any model-specific preprocessing function in any of the training-validation-testing sets, thus compromising the performance of the model, which was trained using the weights of a model preprocessed with the specific Densenet model’s preprocessing function (carrying specific pixel scaling aimed at providing pixel values within the range of [−1, 1], not [0, 1] as performed by Courtenay et al.2024).

In summary, the models created by Courtenay et al. are methodologically different from those of the original studies. The differences are accounted for: method choice, hyperparameter selection, inconsistencies in image processing for training-validation-testing sets, and different preprocessing functions from those required for TL (Supplementary Information). It is in the combination of these features that the model pipelines introduce inconsistencies that fatally flaw their results. What these authors succeeded to show is that with their decisions they produced inefficient DL models. For these reasons, their conclusions can only be applied to their own models, and cannot be extended to the original published models nor to those replicated here. This is especially stressed by the fact that in none of our replicated baseline, grayscale intensity-augmented, cross-validated, FSSL and MAML models did we get accuracy values as low and loss values as high as those reported by Courtenay et al. [1], even when using separated testing sets as in our FSSL-MAML models.

An additional point supporting these interpretations stem from the use of new BSM as external testing sets, which show similar classification metrics to those derived from the baseline training-validation models, the grayscale intensity-augmented models, the cross-validated models and the FSSL-MAML models. This multiple-source convergence indicates that the DL models function, with trampling marks as the only limiting factor (due to class sample size limitations). Regardless of their differing reliability, the results provided by meta-learning, through the FSSL-MAML models deployed here, show that in the three datasets, the high degree of accuracy in the classification metrics of all BSM from the testing sets supports that this approach is the most solid for BSM differentiation to date (see next section). This does not mean that the models are good for extensive generalization beyond the characteristics of the BSM modelled. They still serve as pilot studies with a wide margin for improvement, which much bigger image libraries and agencies must exploit for more solid reference models [32].

We agree that sample size and quality play a major role in sample analysis, but we argue that knowing the limitations of data and the ways to overcome them to maximize information is equally important or more.

Artificial intelligence turns out to be smarter than we thought

The results that we have presented here show that, regardless of sample size, cut marks and tooth marks are successfully separated by DL models (as documented in the three datasets). Trampling is more problematic in the very small sample size of Dataset 2, but this problem is substantially nuanced in the larger sample of Dataset 3. Dataset 3 is not only the biggest of the three datasets used, but also the one with the best quality images, since those were obtained with much better microscopes (Olympus LEXT OLS3000 confocal microscope, Leica Emspira 3 digital microscope and a KH-8700 3D digital microscope with high intensity LED optics). The fact that the modeling described here yields good discrimination is really encouraging and has been confirmed by Cifuentes-Alcobendas [32] with an even bigger dataset. The problem with trampling in Dataset 2 was already identified in the original publication: “…trampling marks showing a higher degree of misclassification (recall = 30). When misclassified, a greater number of trampling marks are classified as tooth marks instead of cut marks…” [2:4].3 That model did not learn to identify trampling marks because it was trained on just 63 marks. Despite that, some of the models presented here show that almost 70% of that meagre sample could be correctly classified (cross-validated DL model), and even more, with 83% of correct classification on a separate testing set, if using a MAML model (Table 4). In this particular case, a MAML model is more efficient than a traditional DL model, given that the latter resulted in a model that was unable to classify the new testing set trampling marks as accurately.

It is for this reason that in the past four years an intensive effort has been undertaken in our institution to expand the experimental sample for cut and trampling marks, reaching a total number of 3.272 images of cut marks and 2.412 images of trampling marks. This is by far the largest experimental sample of these BSM available. This has taken 900 h. Creating an optimized workflow to maximize the performance of CNNs working with BSM image data has required 3,047 h of computation and 2,356 generated models, resulting from multiple combinations of hyper-parameter tuning and dataset variations. This is also the largest single experiment of parameter combination comparison made for any analysis of DL. It is also the largest dataset for BSM analysis. Given the large sample size, traditional protocols were implemented by creating separate training, validation and testing sets. The latter yielded results of discrimination similar to those shown here for Dataset 3. This is in the process of publication, since the whole process involved a 800-page doctoral dissertation from one of us (GCA) that has just been presented [32]. The present work guarantees that DL is substantially up to the task for BSM identification when BSM preservation is good.

In the “Not so intelligent artificial intelligence” section Courtenay et al. engage in a philosophical discussion about the limits of AI, some of which we may agree with, but most of their arguments there are logically flawed. We will stay away from the excessive speculation in that section where very little is empirically supported. All of us working in this field are aware that AI is not human intelligence. This is why we consider inadequate judging the “intelligence” of AI from a human perspective, such as claiming that because “these algorithms are still inherently flawed, lacking fundamental natural cognitive attributes such as common sense” [1:40], they are imperfect. Let us emphasize that AI is not intelligent in the human sense. AI can be intelligent, much more so, in a different way. AI models outsmart us in many tasks, especially in extremely technical tasks it performs better than humans. AI produces task-specific models that have substantially higher resolution capabilities for pattern-finding tasks compared to humans. For example, AI models exceed human radiologists in detecting diseases like breast cancer and lung pathologies in medical scans, offering greater sensitivity and reducing false positives and negatives compared to experts [34, 35]. They can exceed human ophthalmologists’ diagnostic capabilities for detecting diabetic retinopathy [36]. They also exceed medical experts in detection of skin melanomas [37], brain tumors [38], fracture detection [39], pneumonia [40], cardiac function evaluation [41], COVID [42], cervical cancer [43], degree of osteoporosis [44] by accurately predicting bone density values from CT scan images [45], and many other diseases (the literature is too abundant to be summarized here). All this despite that the quality of x-rays and CT images is heterogeneous and in some datasets even low [46–48].

All these AI approaches use imaging and their datasets are derived in many cases from commonly-shared image data banks (because this type of images is so limited) obtained with a myriad of devices, each of them with different digital properties (CT-scan, magnetic resonance, microscopes, and medical cameras of a diverse array of manufacturers and with different optical properties). When these images produce models that outperform clinicians, that means that the models work because they thrive on this diverse amount of information. It is for this reason that we approach Courtenay et al.’s [1] speculation about the effect of digital diversity as distorting and biasing with skepticism. These authors raise the issues of different microscopes having different digital properties that might impact the information of each image and bias DL models. This may be true, but until sufficient experimentation is done, it is sheer speculation. Although some studies suggest a drop in resolution when using different scanners [49] this is nuanced by the independently very low F1 scores all the devices used for the problem of tumor detection, and not by a comparative analysis of model performance using one versus multiple devices. In other studies, the use of single scanners only improved prediction minimally (AUC > 0.05) [50]. Other studies show how generalization methods can be used to avoid this minor dropping in performance [51]. Here, this is not a problem, since the most diverse dataset in terms of microscope has yielded the highest accuracy in classification. In some of our studies, we opted for combining images from different microscopes to allow models precisely to learn all those nuances [3], just like in AI applications in medical fields. This can probably also justify in part that the models from Dataset 3 produced the more balanced and highest accuracy scores, since their training set includes images obtained by three very different microscopes, with diverse optic properties.

We would emphasize that when using hundreds of BSM images for validation/testing for a task of BSM identification, both DL and FSSL-MAML models seem to get a much higher percentage right compared to human experts (on much smaller datasets, because human error is increasingly correlated to sample size) who have been doing this for decades [52] (see further Discussion in Supplementary Information). In our institution, we have repeatedly done these types of tests. This is not random. It means that the models learn; usually better than us, with all our cognitive attributes, foresight and all our common sense.

AI validated by traditional taphonomy

Another way in which the efficiency of these DL models is put to test is through confronting agency identification by DL algorithms with that derived from multivariate taphonomic analyses of the same archeofaunal assemblages. For example, a thorough taphonomic analysis of the 1.8 Ma site of DS (Olduvai Gorge, Tanzania) indicated that the agency process in site formation and modification was mostly hominin (primary) and hyenid (secondary) [53]. Subsequent DL models applied to the carnivore modifications of the site yielded an overwhelming signal of hyena modifications in the carnivore-modified bones [54]. Likewise, a thorough taphonomic analysis of the FLK North assemblages (Olduvai Gorge, Tanzania) indicated a different palimpsestic process in which medium-sized felids had had primary access to carcasses, and hyenids secondary access with marginal to non-existing role of hominins in carcass modification [55]. The application of DL models trained on different types of carnivores supported this interpretation, identifying the role of leopards and hyenas [56] in the same order and reducing the hominin signal to almost non-existing [32]. The taphonomic analysis of the faunal assemblage from the Upper Pleistocene cave of Tritons (Lleida, Spain) suggested that the accumulation had been made by leopards [57]. The application of DL models using a modern carnivoran experimental sample led to the identification of the same agency as the primary bone modifier [58]. Likewise, the taphonomic analysis of the faunal assemblage from the Toll cave (Barcelona, Spain), indicated that the site was a cave bear hibernation lair with significant activity by carnivores, including scavenging by bears, and potentially other carnivores [59]. The application of DL models based on multiple carnivore taxa (including bears) to the BSM sample of the site showed an overwhelming signal of bear modification (with probability >75%) [17]. The identification of some cut marks on a hyena phalanx from the Upper Pleistocene Navalmaillo shelter (Madrid, Spain) using traditional multivariate microscopic criteria to separate them from other abrasive agents [11] was confirmed by DL models using an experimental dataset including cut marks, trampling marks and tooth marks [60].

If the DL models trained had been as spurious and biased as suggested by Courtenay et al. [1], one would expect a larger variability of contrasting results between the traditional taphonomic analyses of these assemblages and the results provided by the DL studies. The fact that they replicate the same interpretation adds up to their heuristics, and show that the models (despite their incompleteness because of sample size per class and unbalanced nature) did actually learn through their training to identify an important part of the experimental marks, and to transfer their knowledge to efficiently identify unknown but well-preserved prehistoric marks.

Equifinality and agency

Courtenay et al. have continuously misused the term “equifinality” in their paper. Equifinality has two common definitions in taphonomy; one is general, and it is related to two causal processes resulting in identical outcomes that are not readily attributable to agent type [61]. This is a taphonomic adoption of the original term, which implied “same final state from different initial states” [62]. The other definition is applied in contexts of two causal processes resulting in similar outcomes that are not readily attributable to agent type. The distinction between both concepts is important, because it emphasizes that whereas the first concept is permanent (two identical things cannot be differentiated), the latter is temporary and is contingent on method [61, 63]. It is related to statistical and methodological distinction. Rogers argues that the use of the term “equifinality” for the second concept is not adequate, because it is not permanent. For example, the issue of agency in faunal accumulations at early archaeological sites was initially interpreted as equifinal, because both hominins and other carnivores could generate similar skeletal profiles. Later, when fine-grained analysis of skeletal parts was introduced, and quantification of the axial elements could be reliably determined, most of that apparent equifinality vanished [64]. In addition, when alternative methods were added (e.g. BSM frequency and anatomical distribution), hominin primary agency in faunal accumulation could be more clearly determined (see summary in [55, 65, 66]). We argue that there is room in taphonomy for both definitions, provided no distinction can be grossly made regardless of whether this is a permanent or a temporary state (i.e. open versus closed systems) [63]. Following the first definition, the distinction of BSM, therefore, is not equifinal because no two agents produce “identical” marks, but they can produce similar modifications that only the right method can objectively identify. Instead of equifinality, in those cases there is uncertainty. According to the second definition, BSM are not equifinal because different methods produce different levels of resolution in their differentiation. Some taphonomists using cross-section shape and presence/absence of certain additional features (e.g. microstriations) may not be able to classify marks like others using more complex (e.g. multivariate) methods, involving a much larger array of variables. This would lead to the paradoxical position of BSM being equifinal for some researchers but not to others. The second definition of equifinality does not have to be permanent, but it needs to be general.

The databases used here targeted the differentiation of stone-tool imparted cut marks, carnivore-created tooth marks and sedimentary abrasion in the form of trampling marks. Human taphonomists have for long been able to separate cut/trampling marks from tooth marks. A 30-year-old study on cut, percussion and tooth marks showed that experimental marks of known actor-effector could be correctly diagnosed through blind testing with accuracy as high as 99% by experts, and novice students could reach 86% of accuracy after three hours of training [67]. Even when using traditional taphonomic microscopic analyses, experimental trampling marks could be identified separately from cut marks 96% of the time. Therefore, the cut-tooth-trampling BSM differentiation is not subjected to equifinality (the three agents produce morphologically different BSM), but it is the subjectivity of the researcher at capturing those microscopic nuances that make interpretation subjective, with some taphonomists scoring higher than others at detecting true positives [68]. A more objective method was needed, and this is why AI was brought into taphonomy: to securely identify traces and their agents. The only comparative published study shows that AI does that better than humans, but both are efficient at differentiating BSM, which shows that equifinality (in the sense of identical effects) is not an issue [52] (Supplementary Information).

Only in the case of microscopically close BSM (like tooth marks created by different carnivores, or a few tooth marks created by crocodiles mimicking cut marks) are experts less secure at identifying agency [3, 4, 7, 33, 69]. This insecurity expands when dealing with prehistoric BSM. This is why the urgency of developing more powerful objective tools to deal with agency (not so much with equifinality) is justified.

Conclusion

Courtenay et al. [1] used three previously published small datasets, but failed to replicate the methodologies presented in the original studies, because they adopted methodological decisions that differed from those (see additional discussion in Supplementary Information). These decisions decreased the availability of data to train models, and in the case of their replication of the stacked models in their supplementary files, also a decrease in the availability of data to test models. They assumed that such models are representative of the field, and judged all taphonomic models derived from AI methods by those three studies. In their attempt to explain the divergences of their purported replication, they resorted to poor quality of the image datasets, and models overfitting on the validation sets because of the lack of a separate testing set. In their modeling, the unbalanced nature of the datasets, when trained with smaller samples (leading to biased or underfit models), led to the BSM with the smallest size to destabilize their trained models because they were unable to learn agency through statistically unrepresentative training and validation sets. Mismatched parametrization, and contradictory pre-processing of training/validation-testing sets may have contributed to it. This is why they were unable to classify a single crocodile tooth mark for Dataset 1, misclassify most trampling marks for Dataset 2 and classify correctly only half of the trampling marks for the Dataset 3 (Tables 2 and 5). Overall, their classification metrics (precision, recall and F-1) remain low in the first two datasets and <10% lower than the models reported here for Dataset 3. In this process, Courtenay et al. have failed to: (i) demonstrate the degree to which the quality of the image datasets has impacted the results of the baseline models; (ii) to demonstrate that the lack of separate testing set has overfit the baseline models, and (iii) to provide any convincing argument that their replicates in their supplementary information are actually representing the baseline models.

In contrast, we tested the hypothesis that image quality had biased the baseline DL models of the targeted datasets, and rejected it by showing that grayscale intensity-augmented image sets yielded an even higher and more balanced accuracy than the original datasets. The hypothesis that the training-validation method (with the exclusion of a separate testing set) had overfit the models was tested through the most adequate method for small datasets when used in DL modeling: cross-validation. The stratified cross-validated models yielded equal or slightly higher values than grayscale intensity-augmented models and even baseline models (Table 6), showing that the different validation sets used for the training of the k-folds yielded similar accuracy for all classes. New testing sets for Datasets 1 & 2 supported these interpretations. As an additional test to all these models displaying their real learning and generalization potential, a FSSL and MAML analyses of the three datasets showed results on independent testing sets similar to or only slightly higher than the baseline, augmented and cross-validated models. This rejects Hypothesis 2, and confirms that the baseline models are unbiased by the factors hypothesized by Courtenay et al. [1]. It also shows that DL and FSSL-MAML are capable of producing effective BSM models. Even if we accepted Courtenay et al.’s criticism of previous DL modeling, the results presented here using MAML models clearly show that DL is efficient at classifying BSM. Here, we have shown (as in the original studies) that: (i) cut marks and crocodile tooth marks can be reliably differentiated [4], and (ii) cut and tooth marks can be reliably differentiated, with trampling marks still requiring further work [2]. This latter interpretation is also supported by Courtenay et al.’s analysis of Datasets 2 and 3.

These models are functional, but there are some limitations: their reliability is limited to the generalization that has been sampled in their current datasets, and their performance is still substantially improvable since their data libraries are small and of poor quality. They should serve as foundation for much more powerful models built upon hundreds of images and more balanced datasets, which are already in the process of being published [32]. Even with these much larger datasets, the generalization capabilities of models will be restricted to the realm of the analogical properties that framed their elaboration. Modelers should be aware that the quality of their experimental sets contains specific substantial, structural and contextual analogical properties that bind them to applications to new data that share these properties [70]. For example, models that have not incorporated biostratinomically and diagenetically modified BSM will not be able to generalize from prehistoric BSM that have undergone morphing through those processes. These samples should contain as much variance as possible to be versatile (instead of restricting it). Only then will they provide the confidence in the identification of BSM in prehistoric and paleontological assemblages that taphonomists are seeking. Criticism of the generalization of the currently available models (within their analogical frameworks), as we have shown here, is largely unwarranted.

While our current results demonstrate strong performance using DL models, future work could benefit from benchmarking against simpler approaches such as traditional ML classifiers (e.g. support vector machines or random forests using handcrafted features) or human expert evaluations. Such comparisons would provide additional context for interpreting model accuracy and robustness, particularly in terms of practical usability and interpretability. Assessing how well the model performs relative to trained taphonomists would also help establish its value as a decision-support tool rather than a replacement for expert analysis. Including these comparative benchmarks in future studies would offer a more comprehensive understanding of the model’s strengths, limitations, and real-world applicability.

Beyond its methodological contributions, this study has practical implications for real-world taphonomic analysis. The ability to automatically classify BSMs using DL can significantly assist field analysts and curators by providing rapid, consistent preliminary assessments of surface damage. In museum and heritage contexts, such tools could streamline the curation of large faunal collections by flagging specimens with diagnostic marks for further expert review. Additionally, this approach can be integrated into archaeological workflows as a decision-support system, complementing human expertise in both field and laboratory settings. By reducing subjectivity and accelerating initial classification, the tool has the potential to enhance reproducibility and efficiency in taphonomic research and heritage management.

Supplementary Material

bpaf057_Supplementary_Data

Acknowledgments

We thank the BSM research team at IDEA (Institute of Evolution in Africa), who over years carried out the experiments and introduced the resulting BSM into digital format and the final microscopic analysis. In addition to two authors of this article (G.C.A. and M.V.R.), these team members are: Blanca Jiménez-García, Natalia Abellán, and Marcos Pizarro-Monzo. We are deeply indebted to Edgar Camarós, for letting us use his crocodile taphonomic collection. We are extremely thankful to three anonymous reviewers for their very supportive and constructive suggestions made on earlier drafts of this article.

Footnotes

1

“If little data is available, then your validation and test sets may contain too few samples to be statistically representative of the data at hand….K-fold and iterated k-fold validation are two ways to address this” [5].

2

As a matter of fact, after the completion of this study MDR remarked that he had never seen a crocodile tooth mark with the morphologies shown in Figures 5a and 5b. He remarked that he thought those marks were cutmarks imparted with metal knives during the butchery or preparation of the specimens for experimentation. Their location on the bones where they are documented would support that. If so, this would further reinforce the efficacy of the DL models to discriminate cut marks from crocodile tooth marks, and the accuracy values reported here for the new testing set here would be an underestimation.

3

This was a misinterpretation in the original publication [2]. Most of the misclassified trampling marks (n = 12) were erroneously classified as cut marks and not tooth marks (n = 4).

Contributor Information

Manuel Domínguez-Rodrigo, Institute of Evolution in Africa (IDEA), Rice University and Archaeological and Paleontological Museum of the Community of Madrid, 28010, Spain; Department of History and Philosophy, Area of Prehistory, University of Alcalá de Henares, Alcalá de Henares, 28801, Spain; Department of Anthropology, Rice University, Houston, TX 77005-1827, United States.

Gabriel Cifuentes-Alcobendas, Institute of Evolution in Africa (IDEA), Rice University and Archaeological and Paleontological Museum of the Community of Madrid, 28010, Spain; Department of History and Philosophy, Area of Prehistory, University of Alcalá de Henares, Alcalá de Henares, 28801, Spain.

Marina Vegara-Riquelme, Institute of Evolution in Africa (IDEA), Rice University and Archaeological and Paleontological Museum of the Community of Madrid, 28010, Spain; Department of History and Philosophy, Area of Prehistory, University of Alcalá de Henares, Alcalá de Henares, 28801, Spain.

Enrique Baquedano, Institute of Evolution in Africa (IDEA), Rice University and Archaeological and Paleontological Museum of the Community of Madrid, 28010, Spain.

Author contributions

Manuel Dominguez-Rodrigo (Conceptualization [lead], Formal analysis [lead], Investigation [lead], Methodology [lead], Resources [lead], Supervision [lead], Validation [lead], Visualization [lead], Writing—original draft [lead]), Gabriel Cifuentes-Alcobendas (Methodology [equal], Supervision [equal], Validation [equal], Writing—review & editing [equal]), and Marina Vegara-Riquelme (Data curation [equal], Writing—review & editing [equal]), Enrique Baquedano (Writing—review & editing [equal])

Supplementary data

Supplementary data are available at Biology Methods and Protocols online.

Conflict of interest statement. None declared.

Funding

None declared.

Data availability

All image data and code files have been compiled and put together in a public repository specifically created for this paper: https://doi.org/10.7910/DVN/WUSGSW. The code version is from May 24 (2025).

References

  • 1. Courtenay LA, Vanderesse N, Doyon L.  et al.  Deep learning-based computer vision is not yet the answer to taphonomic equifinality in bone surface modifications. JCAA  2024;7:388–411. 10.5334/jcaa.145 [DOI] [Google Scholar]
  • 2. Domínguez-Rodrigo M, Cifuentes-Alcobendas G, Jiménez-García B.  et al.  Artificial intelligence provides greater accuracy in the classification of modern and ancient bone surface modifications. Sci Rep  2020;10:18862. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Domínguez-Rodrigo M, Pizarro-Monzo M, Cifuentes-Alcobendas G.  et al.  Computer vision enables taxon-specific identification of African carnivore tooth marks on bone. Sci Rep  2024;14:6881. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Abellán N, Baquedano E, Domínguez-Rodrigo M.  High-accuracy in the classification of butchery cut marks and crocodile tooth marks using machine learning methods and computer vision algorithms. Geobios Mem Spec  2022;72–73:12–21. [Google Scholar]
  • 5. Chollet F.  Deep Learning with Python. New York, NY: Manning Publications, 2022. [Google Scholar]
  • 6. Domínguez-Rodrigo M.  Successful classification of experimental bone surface modifications (BSM) through machine learning algorithms: a solution to the controversial use of BSM in paleoanthropology?  Archaeol Anthropol Sci  2019;11:2711–25. [Google Scholar]
  • 7. Domínguez-Rodrigo M, Baquedano E.  Distinguishing butchery cut marks from crocodile bite marks through machine learning methods. Sci Rep  2018;8:5786. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Moclán A, Domínguez-Rodrigo M.  Are highly accurate models of agency in bone breaking the result of misuse of machine learning methods?  J Archaeol Sci Rep  2023;51:104150. [Google Scholar]
  • 9. Yi Z, Zanolli C, Liao W.  et al.  A deep-learning-based workflow to assess taxonomic affinity of hominid teeth with a test on discriminating Pongo and Homo upper molars. Am J Phys Anthropol  2021;175:931–42. [DOI] [PubMed] [Google Scholar]
  • 10. Domínguez-Rodrigo M, Vegara-Riquelme M, Palomeque-González J.  et al.  Testing the reliability of geometric morphometric and computer vision methods to identify carnivore agency using bi-dimensional information. Quaternary Sci Adv  2025;17:100268. [Google Scholar]
  • 11. Domínguez-Rodrigo M, de Juana S, Galán AB, Rodríguez M.  A new protocol to differentiate trampling marks from butchery cut marks. J Archaeol Sci  2009;36:2643–54. [Google Scholar]
  • 12. Rather IH, Kumar S, Gandomi AH.  Breaking the data barrier: a review of deep learning techniques for democratizing AI with small datasets. Artif Intell Rev  2024;57:226. 10.1007/s10462-024-10859-3 [DOI] [Google Scholar]
  • 13. Sabha SU, Assad A, Shafi S.  et al.  Imbalcbl: addressing deep learning challenges with small and imbalanced datasets. Int J Syst Assur Eng Manag  2024:1. 10.1007/s13198-024-02346-3 [DOI] [Google Scholar]
  • 14. Hastie T, Tibshirani R, Friedman J.  The Elements of Statistical Learning. 2nd ed. New York, NY: Springer; 2017. [Google Scholar]
  • 15. Kuhn M, Johnson K.  Applied Predictive Modeling. New York, NY: Springer. [Google Scholar]
  • 16. Brownlee J.  2017. Machine Learning Mastery with Python. Machine Learning Mastery. Adelaide: J.B. Publisher. [Google Scholar]
  • 17. Pizarro-Monzo M, Rosell J, Rufá A.  et al.  A deep learning-based taphonomical approach to distinguish the modifying agent in the Late Pleistocene site of toll cave (Barcelona, Spain). Hist Biol  2023;36:2114–23. 10.1080/08912963.2023.2242370 [DOI] [Google Scholar]
  • 18. Chatfield K, Simonyan K, Vedaldi A.  et al. Return of the devil in the details: delving deep into convolutional nets. Proceedings of the British Machine Vision Conference 2014. Durham: British Machine Vision Association, 2014. 10.5244/c.28.6 [DOI]
  • 19. Pizarro-Monzo M, Organista E, Cobo-Sánchez L.  et al.  Determining the diagenetic paths of archaeofaunal assemblages and their palaeoecology through artificial intelligence: an application to Oldowan sites from Olduvai Gorge (Tanzania). J Quat Sci  2022;37:543–57. [Google Scholar]
  • 20. Bhasin H.  Transfer Learning. Hands-on Deep Learning. Berkeley, CA: Apress, 2024, 207–23. [Google Scholar]
  • 21. Domínguez-Rodrigo M, Fernández-Jaúregui A, Cifuentes-Alcobendas G.  et al.  Use of Generative Adversarial Networks (GAN) for taphonomic image augmentation and model protocol for the deep learning analysis of bone surface modifications. Appl Sci  2021;11:5237. 10.3390/app11115237 [DOI] [Google Scholar]
  • 22. Domínguez-Rodrigo M, Fernández-Jaúregui A, Cifuentes-Alcobendas G.  et al.  Use of generative adversarial networks (GAN) for taphonomic image augmentation and model protocol for the deep learning analysis of bone surface modifications. Appl Sci (Basel)  2021;11:5237. [Google Scholar]
  • 23. Fei-Fei L, Fergus R, Perona P.  One-shot learning of object categories. IEEE Trans Pattern Anal Mach Intell  2006;28:594–611. [DOI] [PubMed] [Google Scholar]
  • 24. Li Z, Zhou F, Chen F.  et al.  Meta-SGD: Learning to Learn Quickly for Few-Shot Learning. arXiv [cs.LG]. 2017. http://arxiv.org/abs/1707.09835
  • 25. Ravichandiran S.  Hands-On Meta Learning with Python: Meta Learning Using One-Shot Learning, MAML, Reptile, and Meta-SGD with TensorFlow. Birmingham: Packt Publishing Ltd., 2018. [Google Scholar]
  • 26. Jadon S, Garg A.  Hands-On One-Shot Learning with Python: Learn to Implement Fast and Accurate Deep Learning Models with Fewer Training Samples Using PyTorch. Birmingham: Packt Publishing Ltd., 2020. [Google Scholar]
  • 27. Fei N, Lu Z, Xiang T.  et al. MELR: meta-learning via modeling episode-level relationships for few-shot learning. International Conference on Learning Representations.  2020. https://gsai.ruc.edu.cn/uploads/20211128/ed1b33cab379fc0be5052afd54a35802.pdf
  • 28. Jimenez-Garcia B, Baquedano E, Cifuentes-Alcobendas G.  et al.  A taphonomic study of DS-22A (Bed I, Olduvai Gorge) and its implications for reconstructing hominin-carnivore interactions at early pleistocene anthropogenic sites. Quaternary  2025;8:35. [Google Scholar]
  • 29. Finn C, Abbeel P, Levine S. Model-agnostic meta-learning for fast adaptation of deep networks. In: Precup D, Teh YW (eds.), Proceedings of the 34th International Conference on Machine Learning. Cambridge, MA: PMLR, 2017, 1126–35.
  • 30. Liu Q, Tian Y, Zhou T.  et al.  A few-shot disease diagnosis decision making model based on meta-learning for general practice. Artif Intell Med  2024;147:102718. [DOI] [PubMed] [Google Scholar]
  • 31. Baquedano E, Domínguez-Rodrigo M, Musiba C.  An experimental study of large mammal bone modification by crocodiles and its bearing on the interpretation of crocodile predation at FLK Zinj and FLK NN3. J Archaeol Sci  2012;39:1728–37. [Google Scholar]
  • 32. Cifuentes-Alcobendas G. New methodologies for the taphonomic study of bone surface modifications: applications to the Pleistocene fossil record of Olduvai Gorge (FLK North) through the integration of Experimental Archaeology and Artificial Intelligence. Ph.D. Dissertation, University of Alcalá, Madrid, 2025.
  • 33. Jiménez-García B, Abellán N, Baquedano E.  et al.  Corrigendum to “Deep learning improves taphonomic resolution: high accuracy in differentiating tooth marks made by lions and jaguars”. J R Soc Interface  2020;17:20200782. [Google Scholar]
  • 34. McKinney SM, Sieniek M, Godbole V.  et al.  International evaluation of an AI system for breast cancer screening. Nature  2020;577:89–94. [DOI] [PubMed] [Google Scholar]
  • 35. Ardila D, Kiraly AP, Bharadwaj S.  et al.  End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nat Med  2019;25:954–61. [DOI] [PubMed] [Google Scholar]
  • 36. Gulshan V, Peng L, Coram M.  et al.  Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA  2016;316:2402–10. [DOI] [PubMed] [Google Scholar]
  • 37. Esteva A, Kuprel B, Novoa RA.  et al.  Dermatologist-level classification of skin cancer with deep neural networks. Nature  2017;542:115–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Kamnitsas K, Ledig C, Newcombe VFJ.  et al.  Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation. Med Image Anal  2017;36:61–78. [DOI] [PubMed] [Google Scholar]
  • 39. Rajpurkar P, Irvin J, Ball RL.  et al.  Deep learning for chest radiograph diagnosis: a retrospective comparison of the CheXNeXt algorithm to practicing radiologists. PLoS Med  2018;15:e1002686. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Rajpurkar P, Irvin J, Zhu K.  et al.  CheXNet: Radiologist-level Pneumonia Detection on Chest X-rays with Deep Learning. arXiv [cs.CV]. 2017. http://arxiv.org/abs/1711.05225
  • 41. Ouyang D, He B, Ghorbani A.  et al.  Video-based AI for beat-to-beat assessment of cardiac function. Nature  2020;580:252–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Wang L, Lin ZQ, Wong A.  COVID-Net: a tailored deep convolutional neural network design for detection of COVID-19 cases from chest X-ray images. Sci Rep  2020;10:19549. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Kurita Y, Meguro S, Tsuyama N.  et al.  Accurate deep learning model using semi-supervised learning and Noisy Student for cervical cancer screening in low magnification images. PLoS One  2023;18:e0285996. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Paderno A, Ataide Gomes EJ, Gilberg L.  et al.  Artificial intelligence-enhanced opportunistic screening of osteoporosis in CT scan: a scoping Review. Osteoporos Int  2024;35:1681–92. [DOI] [PubMed] [Google Scholar]
  • 45. Peng T, Zeng X, Li Y.  et al.  A study on whether deep learning models based on CT images for bone density classification and prediction can be used for opportunistic osteoporosis screening. Osteoporos Int  2024;35:117–28. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Pula M, Kucharczyk E, Zdanowicz A.  et al.  Image quality improvement in Deep Learning Image Reconstruction of head computed tomography examination. Tomography  2023;9:1485–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Quaia E, Kiyomi Lanza de Cristoforis E, Agostini E.  et al.  Computed tomography effective dose and image quality in deep learning image reconstruction in intensive care patients compared to iterative algorithms. Tomography  2024;10:912–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Su J, Li M, Lin Y.  et al.  Deep learning-driven multi-view multi-task image quality assessment method for chest CT image. Biomed Eng Online  2023;22:117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Aubreville M, Bertram C, Veta M.  et al.  Quantifying the Scanner-induced Domain Gap in Mitosis Detection. arXiv [cs.CV]. 2021. http://arxiv.org/abs/2103.16515
  • 50. de Almeida JG, Rodrigues NM, Castro Verde AS.  et al.  Impact of scanner manufacturer, endorectal coil use, and clinical variables on deep learning-assisted prostate cancer classification using multiparametric MRI. Radiol Artif Intell  2025;7:e230555. [DOI] [PubMed] [Google Scholar]
  • 51. Yan W, Huang L, Xia L.  et al.  MRI manufacturer shift and adaptation: increasing the generalizability of deep learning segmentation for MR images acquired with different scanners. Radiol Artif Intell  2020;2:e190195. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Byeon W, Domínguez-Rodrigo M, Arampatzis G.  et al.  Automated identification and deep classification of cut marks on bones and its paleoanthropological implications. J Comput Sci  2019;32:36–43. [Google Scholar]
  • 53. Domínguez-Rodrigo M, Cobo-Sánchez L, Baquedano E.  et al.  Reconstructing Olduvai: The Behavior of Early Humans at David’s Site. Amsterdam, Netherlands: Elsevier Science, 2024. [Google Scholar]
  • 54. Cobo-Sánchez L, Pizarro-Monzo M, Cifuentes-Alcobendas G.  et al.  Computer vision supports primary access to meat by early Homo 1.84 million years ago. PeerJ  2022; 10:e14148. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Domínguez-Rodrigo M, Barba R, Egeland CP.  Deconstructing Olduvai: A Taphonomic Study of the Bed I Sites. Berlin, Germany: Springer Science & Business Media, 2007.
  • 56. Vegara-Riquelme M, Gidna A, Uribelarrea del Val D.  et al.  Reassessing the role of carnivores in the formation of FLK North 3 (Olduvai Gorge, Tanzania): a pilot taphonomic analysis using Artificial Intelligence tools. J Archaeol Sci Rep  2023;47:103736. [Google Scholar]
  • 57. Micó C, Arilla M, Rosell J.  et al.  Among goats and bears: a taphonomic study of the faunal accumulation from Tritons Cave (Lleida, Spain). J Archaeol Sci Rep  2020;30:102194. [Google Scholar]
  • 58. Jiménez-García B, Mico C, Pizarro-Monzo M.  et al.  Artificial intelligence for the identification of taphonomic agents and processes: a reanalysis of the faunal accumulation from Tritons cave (Lleida, Spain). Royal Soc Open Sci. 2024;11:241168. 10.1098/rsos.241168 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Blasco R, Arilla M, Domínguez-Rodrigo M.  et al.  Who peeled the bones? An actualistic and taphonomic study of axial elements from the Toll Cave Level 4, Barcelona, Spain. Quat Sci Rev  2020;250:106661. [Google Scholar]
  • 60. Moclán A, Domínguez-Rodrigo M, Huguet R.  et al.  Deep learning identification of anthropogenic modifications on a carnivore remain suggests use of hyena pelts by Neanderthals in the Navalmaíllo rock shelter (Pinilla del Valle, Spain). Quat Sci Rev  2024;329:108560. [Google Scholar]
  • 61. Rogers AR.  On equifinality in faunal analysis. Am Antiq  2000;65:709–23. [Google Scholar]
  • 62. Bertalanffy L.  Problems of organic growth. Nature  1949;163:156–8. [DOI] [PubMed] [Google Scholar]
  • 63. Lyman LR.  The concept of equifinality in Taphonomy. J Taphon  2004;2:15–26. [Google Scholar]
  • 64. Arriaza MC, Domínguez-Rodrigo M.  When felids and hominins ruled at Olduvai Gorge: a machine learning analysis of the skeletal profiles of the non-anthropogenic Bed I sites. Quat Sci Rev  2016;139:43–52. [Google Scholar]
  • 65. Domínguez-Rodrigo M, Pickering TR.  Early hominid hunting and scavenging: a zooarcheological review. Evol Anthropol  2003;12:275–82. [Google Scholar]
  • 66. Domínguez-Rodrigo M, Pickering TR.  The meat of the matter: an evolutionary perspective on human carnivory. Azania  2017;52:4–32. [Google Scholar]
  • 67. Blumenschine RJ, Marean CW, Capaldo SD.  Blind Tests of Inter-analyst Correspondence and Accuracy in the Identification of Cut Marks, Percussion Marks, and Carnivore Tooth Marks on Bone Surfaces. J Archaeol Sci  1996;23:493–507. [Google Scholar]
  • 68. Domínguez-Rodrigo M, Saladié P, Cáceres I.  et al.  Use and abuse of cut mark analyses: the Rorschach effect. J Archaeol Sci  2017;86:14–23. [Google Scholar]
  • 69. Sahle Y, El Zaatari S, White TD.  Hominid butchers and biting crocodiles in the African Plio-Pleistocene. Proc Natl Acad Sci USA  2017;114:13164–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70. Bunge M.  Analogy between systems. Int J Gen Syst  1981;7:221–3. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

bpaf057_Supplementary_Data

Data Availability Statement

All image data and code files have been compiled and put together in a public repository specifically created for this paper: https://doi.org/10.7910/DVN/WUSGSW. The code version is from May 24 (2025).


Articles from Biology Methods & Protocols are provided here courtesy of Oxford University Press

RESOURCES