Abstract
The Rey–Osterrieth complex figure (ROCF) test is a neuropsychological task that can be useful for early detection of cognitive decline in the elderly population. Several computer vision systems have been proposed to automate this complex analysis task, but the lack of public benchmarks does not allow a fair comparison of these systems. To advance in that direction, we present a benchmarking framework for the automatic scoring of the ROCF test that provides: the ROCFD528 dataset, which is the first open dataset of ROCF line drawings; and experimental results obtained by several modern deep learning models, which can be used as a baseline for comparing new proposals. We evaluate different state-of-the-art convolutional neural networks (CNNs) under traditional and transfer learning paradigms. Experimental quantitative results (MAE = 3.448) indicate that a CNN specifically designed for sketches outperforms other state of the art CNN architectures when the number of examples available is limited. This benchmark can also be a paradigmatic example within the broad field of machine learning for the development of efficient and robust models for analyzing line drawings and sketches not only in classification but also in regression tasks.
Keywords: Benchmark, Rey-Osterrieth complex figure scoring, Deep learning, Transfer learning, Cognitive impairment detection
1. Introduction
The number of people with dementia will triple in the next 30 years [1]. Its associated cognitive impairment causes functional and behavioral problems, which significantly impact the quality of life of patients, their families and caregivers. Although there is currently no cure for this type of disease, researchers are looking for therapies to slow down its evolution and also for early detection methods under the idea that, in a virtuous cycle, the earlier the disease is diagnosed, the better it can be understood and treated. It has led to great enthusiasm for the search of biomarkers for early detection and the definition of those at risk in the clinical setting. However, early diagnosis of cognitive impairment can be costly in both time and money, as it requires examining information from multiple domains, such as neuropsychological tests, genetics and neuroimaging biomarkers, and demographic and personal data.
There is a general consensus that neuropsychological tests, which measure the different domains and subdomains of cognitive function, are a task that, with elementary requirements (a paper and a pencil in some cases), can help in the early diagnosis of cognitive and motor impairments and detect clues about the organization of brain activity and its involvement in neurodegenerative disorders. Therefore, they are ideal tools for screening, because they are inexpensive, minimally invasive, and without secondary risks. The Rey–Osterrieth complex figure (ROCF) test [2],[3] is a neuropsychological task in which examinees have to reproduce a complicated line drawing, first by copying it and then from memory. It requires multiple cognitive abilities, which permits the evaluation of different visuo-spatial abilities and cognitive functions, such as visual memory, attention, planning, and other abilities related to executive functions. In other words, the ROCF is considered to be a useful task for the assessment of frontal lobe function, which is required for strategic planning and organizing [4]. The ROCF test scoring is generally manual, which is time-consuming, requires training, and is reliant on examiner impressions [5]. All the proposals for automating ROCF test scoring that we have found, presented in section 2 of this article, use proprietary datasets for training and evaluating the models, which makes it difficult to compare them and slows down progress in solving the task.
In this article we aim to tackle these issues by providing a common framework for benchmarking ROCF automatic scoring, so that researchers can evaluate its methods on the same basis. The framework provides the first open dataset of ROCF line drawings and experimental results obtained by several leading-edge deep learning models, which can be used as a baseline. The official website of the benchmark, which includes links to the open code and data, can be found at this address: https://edatos.consorciomadrono.es/dataverse/rey.
Furthermore, within the field of machine learning, few freehand drawing datasets are available, and those that exist are oriented towards classification tasks [6], [7], [8], [9]. So our benchmark can be paradigmatic for the development of new architectures oriented to regression problems on hand-drawn images and sketches.
The rest of this article is structured as follows: in section 2, we present state of the art methods that automatically score the ROCF; in section 3, we describe the benchmark components: the open dataset and the models we propose for automatic scoring; in section 4, we show the results obtained by these models using five evaluation metrics; in section 5, we analyze which model configurations perform better and discuss possible reasons; and finally, in section 6, we re-emphasize the contributions achieved in this article.
2. Related work
When assessing the ROCF test, both the copying process and the final result can be analyzed [4],[10]. Reproducing the ROCF brings many cognitive abilities into play. Different types of failures are associated with different brain structures [11] and can be analyzed at different stages of the copying process. Information on organizational and planning abilities can be extracted from the process of drawing the ROCF but, without digitizing the test, it can only be derived in situ, during the performance of the test, which makes it very dependent on the evaluator. Diverse interpretations can result in systematic scoring biases, potentially precluding the validity of large-scale comparisons, especially in cases where multiple independent examiners are involved [5]. Several systems have been proposed to digitize the ROCF test and score the process automatically or semi-automatically [12],[13]. The additional information captured by these systems will allow for better results in the future. However, digitizing the ROCF test represents a disruptive change that would invalidate all previous results, including normative data. It also would entail adopting a digital acquisition platform, which could pose challenges for implementation in lower-income countries.
As our dataset comes from archived neuropsychological studies, we will initially focus on accuracy scoring methods, which are used to quantitatively score the matching degree of the final drawing with the model [10]. They only evaluate the accuracy of the drawing but are more robust and can be assessed a posteriori, at any time, manually or automatically. In this way, we can use the already collected evaluations to advance the robust and automatic assessment of the ROCF drawing test before moving on to a richer evaluation framework. Many scoring methods have been developed for the ROCF [4]. The Quantitative Scoring System (QSS) [2],[3],[10] is the most widely used method in clinical practice. It assigns a quantitative value strictly related to the geometric similarity of the drawing and the model. In QSS, the ROCF is divided into 18 components and each of them is scored based on their presence, completeness and positioning with 0, 0.5, 1 or 2 points, so that the ROCF score ranges from 0 to 36.
In [14], the automation of QSS is investigated using a robust technique to locate a reduced set of scoring sections and a knowledge-based system that employs spatial metrics and fuzzy approximation techniques to obtain partial scores. However, they only describe a partial implementation, and the results are not evaluable. Vogt et al. [15] present a method that employs 303 figures in total. It returns a score which, although not stated, looks like QSS. The process begins with aggregating the 18 components of the figure using a cascade of deep neural networks and ends with aggregating information from these components to return a score. This score is compared with those given by six experts obtaining an average Pearson correlation coefficient (r) of 0.88. Sangiovanni et al. [16] provide 37 copy ROCF and obtain QSS. Instead of analyzing ROCF line components, they segment the interior regions between lines and compare them with those of the ROCF model using four similarity metrics based on area, orientation and position features. For each of the regions, a value between 0 and 1 is obtained by combining the four metrics. These values are summed to obtain a figure score in the range 0-18, which is converted to the range 0-36. Results are expressed as the Pearson correlation with the scores given by a human evaluator (). Langer et al. [17] have a dataset of 20225 ROCF drawn by volunteers from 90 countries. The figures are scanned, resized to 232x300 and processed in grayscale. They use standard CNNs that treat the ROCF components separately. Each of the CNNs can consider each component score as either a regression or a classification. By combining both output types, they conform the best configuration, which is capable of obtaining a MAE of 1.11 on the test set. This subset has 4045 figures, the training set has 12944 figures and the validation set, 3236. Park et al. [18] start from a dataset of 20040 ROCF drawn by 6680 volunteers. The figures are scanned, 512x512 in size and processed in binary pixel format. They use a pre-trained DenseNet [19] and treat the QSS value as a linear regression. Their best model gets a MAE of 0.95 using 5-fold cross validation. Schuster et al. [20] collect 416 ROCF depicted by 208 volunteers, which follow the steps of copy and delayed recall. The figures are scanned, have a size of 354x500 and are converted to binary pixel format. They use five CNN architectures and launch a transfer learning routine using the TU Berlin dataset [21] for pre-training. Following [18], their models return a single decimal value that represents the QSS score. Considering that the dataset is partitioned into 270 training images, 72 validation images and 74 test images, a validation MAE of 1.97 is obtained using their best configuration.
In Table 1 we summarize the main characteristics of state of the art ROCF scoring approaches where complete results were given. As can be seen, all state of the art methods use private datasets for training and evaluation, and the most commonly used metrics are the Pearson correlation and the Mean Absolute Error (MAE).
Table 1.
Relevant characteristics of state of the art approaches to ROCF scoring.
| Year | Article | Dataset |
Figure Subdivision | Method | Evaluation Strategy | Evaluation Metric | Results | ||
|---|---|---|---|---|---|---|---|---|---|
| Availability | Size | Acquisition | |||||||
| 2019 | Vogt [15] | Private | 303 | Scanner | Yes | Machine learning | Not mentioned | Pearson correlation | 0.880 |
| 2020 | Sangiovanni [16] | Private | 37 | Scanner | Yes | Rule-based | No learning | Pearson correlation | 0.790 |
| 2022 | Langer [17] | Private | 20225 | Scanner | Yes | Machine learning | Train-Validation-Test split (12944-3236-4045) | Test MAE | 1.11 |
| 2023 | Park [18] | Private | 20040 | Scanner | No | Machine learning | 5-fold cross validation | Validation MAE | 0.95 |
| 2023 | Schuster [20] | Private | 416 | Scanner | No | Machine learning | Train-Validation-Test split (270-72-74) | Validation MAE | From 2.20 to 1.97 |
3. Benchmark description
As our dataset comes from scanned images of archived neuropsychological studies, we can only analyze the final drawing, not the drawing process; and since no robust fully-automatic method for detecting ROCF components has been achieved so far, we will extract a score from the global features of the whole drawing, without dividing it beforehand, as done recently in [18] and [20].
For the ROCF line drawing scoring, we employ several modern machine learning paradigms and architectures, which are trained and evaluated using a dataset with 528 images. Considering that the scores follow an order relation, where the higher the score the better the quality of the drawing, we treat the scoring process as a regression problem (once again, in line with [18] and [20]). The output of the models is a continuous value between 0 and 1, which is rescaled to the Osterrieth's QSS range (0-36). Finally, to assess model performance, we use several regression metrics.
3.1. The ROCFD528 dataset
The dataset we present in this paper, named ROCFD528, is composed of 528 ROCF copy drawings made by 241 participants in multiple neuropsychological evaluation sessions (a mean of 2.19) spaced one year apart. The participants were recruited from large ongoing longitudinal research studies (ref. SEJ 2004-04233 and SEJ 2007-63325) to determine the prevalence and stability of mild cognitive impairment in the Autonomous Community of Madrid, Spain [22].
Socio-demographic information of the participants is summarized in Fig. 1. Fig. 1.a shows the age of the participants at the first evaluation; 241 participants ranging from 60 to 91 years (70.55 years ± 6.38 s.d.). Fig. 1.b shows the distribution of the academic level of the participants, constant for each participant throughout all evaluations. Therefore, there are 241 values ranging from ‘0’ to ‘22 or more’ years of education (10.42 years ± 5.87 s.d.).
Figure 1.
Socio-demographic characteristics of ROCFD528 dataset. (a) Participants' age at the time of the first evaluation; (b) Participants' academic level, measured in years of study.
Finally, Fig. 2 presents the histogram of the Osterrieth's QSS scores assigned to each of the 528 figures. Although scores obtained for the same participant in consecutive sessions might be correlated, we consider that the drawings are different, and therefore can be treated independently. Analyzing the scores, we have observed that 264 figures are in the range 29-36, representing exactly 50% of the dataset. This is mainly because the figures are drawn by copy, which results in higher scores than drawings made from memory. In addition, the population sample consists of relatively healthy older adults. This population group is selected due to its potential for early detection of cognitive impairment.
Figure 2.
Histogram of Osterrieth's QSS scores.
3.2. ROCF dataset preprocessing
The paper sheets with the ROCF pencil drawings went through a series of preprocessing steps before building the automatic scoring deep learning models. First, the drawings were scanned, cropped, and manually processed for annotation removal, resulting in 528 images consisting of pixels with values between 0 and 255. Each image of the dataset had a different size (average size = 456.52x345.25 pixels). So, images were resized to 384 x 384 pixels using bilinear interpolation. Next, the images were binarized and inverted. Taking into account that foreground pixels (the drawing) can take on different shades of “black” depending on the pen color used and that they are a minority in the image with respect to background pixels (which are white), we designed a binarization method by modeling the background with a normal distribution and setting the threshold one standard deviation below the mean. Both resizing and binarization aim to facilitate and improve image processing by deep learning models. Fig. 3 shows a scanned ROCF line drawing and the output of resizing and binarization operations.
Figure 3.
ROCF preprocessing. (a) Original image (472x405); (b) Resized image (384x384); (c) Binarized image (384x384).
3.3. Deep learning architectures and paradigms for ROCF automatic scoring
Automatic scoring of ROCFs is carried out by training a set of deep learning architectures, namely convolutional neural networks (CNNs), due to the excellent results they have provided in the last decade for certain image processing tasks, such as segmentation, classification or object detection [23]. Within the category of CNNs, two sub-categories are considered: those designed to work with typical color images (InceptionV3 [24], MobileNetV2 [25] and EfficientNet-B1 [26]), and those specifically designed to work with line drawings (Sketch-a-Net [27]). InceptionV3 is a classical network designed by Google that gives excellent results despite its relatively small number of parameters and MobileNetV2 is a lightweight network designed specifically for mobile phones. These two networks have been chosen for ROCF automatic scoring due to their great performance in sketch classification, as reported by [28]. We add EfficientNet-B1 to the candidates, as it is an evolution of the MobileNetV2 that achieves better results than classical CNNs, such as InceptionV3, but using a significantly smaller number of parameters. Finally, Sketch-a-Net is a special network with a very simple architecture that exploits the particular characteristics of sketches over color images. A sketch represents information in an abstract, sparse and discrete manner, whereas a color image displays real-world information with a high level of local detail, with textures and continuity in the colors of neighboring pixels. Table 2 displays some computational characteristics of the four architectures such as their size in memory, the number of feature extraction and classification parameters given an input image size of 384x384x1, and the network topological depth.
Table 2.
Computational characteristics of the deep learning architectures.
| Architecture | Size (MB) | Feature Extraction Parameters | Classification Parameters | Depth |
|---|---|---|---|---|
| Sketch-a-Net | 276 | 1.69M | 22.41M | 8 |
| MobileNetV2 | 27 | 2.26M | 1281 | 105 |
| InceptionV3 | 251 | 21.80M | 2049 | 189 |
| EfficientNet-B1 | 77 | 6.58M | 1281 | 186 |
The four architectures are trained using two learning paradigms: traditional learning (also known as direct or isolated learning), in which the weights of the architecture are randomly initialized and trained, and transfer learning, in which an architecture is pre-trained on other dataset and then partially fine-tuned on the dataset of interest. In our case, the dataset of interest is ROCFD528 and the datasets from which we would extract prior knowledge are two, the general color image dataset “ImageNet” [29], and the sketch dataset “Quick,Draw!” [6]. Both datasets contain a massive number of elements, with approximately 14 million examples for the former and 50 million drawings for the latter. By selecting these datasets, we aim to verify if the knowledge extracted from Quick,Draw! is more valuable, given that the drawings it contains are closer to the ROCF drawings.
We obtain eleven configurations by combining the four architectures, the two learning paradigms and the three datasets: four of them training each of the four architectures using just ROCFD528; three jointly using the knowledge extracted from ImageNet and ROCFD528 (Sketch-a-Net was not considered here as it does not make sense to train it with color images); and four using the knowledge gained from Quick,Draw! and ROCFD528. Table 3 shows the characteristics of each configuration such as the architecture, the learning paradigm, and the pre-training and training datasets.
Table 3.
Overview of the eleven configurations tested. SaN: Sketch-a-Net; MN2: MobileNetV2; I3: InceptionV3; ENB1: EfficientNet-B1; DL: Direct learning; TL: Transfer learning; IN: ImageNet; QD: Quick, Draw!
| Configuration | Architecture | Learning Paradigm | Pre-training Dataset | Training Dataset |
|---|---|---|---|---|
| SaN-DL | Sketch-a-Net | Direct | - | ROCFD528 |
| MN2-DL | MobileNetV2 | Direct | - | ROCFD528 |
| I3-DL | InceptionV3 | Direct | - | ROCFD528 |
| ENB1-DL | EfficientNet-B1 | Direct | - | ROCFD528 |
| MN2-TL-IN | MobileNetV2 | Transfer | ImageNet | ROCFD528 |
| I3-TL-IN | InceptionV3 | Transfer | ImageNet | ROCFD528 |
| ENB1-TL-IN | EfficientNet-B1 | Transfer | ImageNet | ROCFD528 |
| SaN-TL-QD | Sketch-a-Net | Transfer | Quick,Draw! | ROCFD528 |
| MN2-TL-QD | MobileNetV2 | Transfer | Quick,Draw! | ROCFD528 |
| I3-TL-QD | InceptionV3 | Transfer | Quick,Draw! | ROCFD528 |
| ENB1-TL-QD | EfficientNet-B1 | Transfer | Quick,Draw! | ROCFD528 |
To pretrain with ImageNet, we used the weights offered by Keras. These default weights were extracted using an ImageNet subset of approximately 1.4 million examples (1.2M + 100 K + 100 K for training-validation-test) [30]. Meanwhile, for pre-training with Quick,Draw!, we manually defined a subset of 414 K images (290 K + 62 K + 62 K for training-validation-test) similar to how it is done in [28].
3.4. Evaluation metrics
Since we treat the construction of the scoring model as a regression problem, we propose the use of five typical regression metrics for evaluation: Pearson correlation coefficient (Eq. (1)), coefficient of determination (Eq. (2)), mean absolute error (Eq. (4)), root mean square error (Eq. (5)) and median absolute error (Eq. (6)). Pearson correlation is a measure of linear correlation between two sets of data. The closer the value is to +1, the higher the correlation between predicted and reference scores. While the Pearson coefficient measures the correlation between two variables, it does not necessarily indicate the quality of the fit of the model to the data. The coefficient of determination measures just this, indicating the fraction of variability of the dependent variable (reference scores) explained by the independent variable (predicted scores). It can range from any negative number to +1, with +1 indicating that the predicted scores match the reference scores perfectly, 0 indicating that the predicted scores are as good as random guesses around the mean of the reference scores and a negative value indicating that the predictions are worse than random. The third metric that we use is mean absolute error, which measures the average of the absolute differences between actual and predicted values. It gives equal weight to all errors, regardless of their magnitude and is therefore not very sensitive to outliers. Root mean square error calculates the square root of the squared differences, which means that larger errors are penalized heavier. This results in higher sensitivity to outliers, which may hide to some extent the behavior of the model. With median absolute error we aim to reduce this effect as median is applied instead of the mean. The formulas for the metrics used in this work are presented below:
-
•Pearson Correlation Coefficient (PCC):
where r is the correlation coefficient, X is the ROCF dataset, cov is the co-variance function, σ represents the standard deviation, are the scores given by expert consensus for all the figures in X and are the scores given by a deep learning model m.(1) -
•Coefficient of Determination ():
where is the coefficient of determination, represents the sum of squared errors (Eq. (3)), and is the average applied on all the values in X.(2) (3) -
•Mean Absolute Error (MAE):
where symbolizes the number of elements in the dataset.(4) -
•Root Mean Squared Error (RMSE):
(5) -
•Median Absolute Error (MedAE):
where is the median applied on all the values in X.(6)
4. Experimental results
Eleven configurations combining different CNN architectures, learning paradigms and pretraining datasets have been compared. For the configurations pretrained with ImageNet, we used the models provided by Keras, but for the configurations pretrained with Quick,Draw! we had to build the pretrained models. The subsection 4.1 explains this process and shows the performance of the trained models in classifying the Quick,Draw! dataset. These results give us an idea of how well each model behaves when working with sketches and can be used to better interpret the results on ROCF. On the other hand, in subsection 4.2, we show the performance metrics (section 3.4) obtained with the eleven configurations. In subsection 4.3 we ask two expert psychologists to individually score a subset of the ROCFD528 dataset and compare these scores with the gold standard (GS) scores, assigned by consensus of a broader team of expert psychologists. The performance of the two independent human experts is compared against the eleven deep learning configurations using a ranking. Finally, in subsection 4.4 we calculate metric values for each score so we can better comprehend the effects of score imbalance.
4.1. Sketch classification - pre-training deep learning architectures with Quick,Draw!
In this first experiment, we check if the different architectures behave correctly when faced with a task where the input image is a line drawing. We train each of our four CNN architectures with Quick,Draw! to learn the class associated to each sketch, a known classification problem. For this, we divide the subset of 414 K images in 289800 images for training, 62100 images for validation and 62100 images for testing. Training hyperparameters, common for all architectures, are the following: the input image size is 256x256x1, the batch size is 64, the optimizer type is Adam with the default parameters offered by Keras, the learning rate is 1e-4, the number of training epochs is 100 and real-time data augmentation is applied in the form of small rotations (up to 10 degrees), shifts (up to 10% of the image width), shears (up to 10 degrees) and zooms (up to 10% of the image width) of the images. As it is a classification problem, the loss function is cross-entropy loss. For each architecture, the selected model is the one with maximum validation accuracy and its weights will be the ones employed for the subsequent transfer learning with Quick,Draw! configurations. Table 4 shows the validation and testing accuracy for each of the four models on the Quick,Draw! dataset. These results confirm that all four models perform well and quite similar in a classification problem over sketched objects, although EfficientNet-B1 obtains the best results with 1.36% over the second.
Table 4.
Model accuracies on Quick,Draw! subset.
| Architecture | Training Dataset | Validation Accuracy (%) | Test Accuracy (%) |
|---|---|---|---|
| Sketch-a-Net | Quick,Draw! | 69.40 | 69.27 |
| MobileNetV2 | Quick,Draw! | 70.52 | 70.56 |
| InceptionV3 | Quick,Draw! | 70.65 | 70.44 |
| EfficientNet-B1 | Quick,Draw! | 71.93 | 71.92 |
4.2. ROCF scoring
In this second experiment, we evaluate the eleven configurations proposed for automatic ROCF scoring. On the one hand, configurations not pre-trained are initialized randomly and trained with the ROCFD528 dataset. On the other, configurations pre-trained with ImageNet are initialized with the weights offered by Keras, and configurations pre-trained with Quick,Draw! are initialized as stated in subsection 4.1. After pre-training, we freeze all network weights corresponding to the feature extraction layers and fine-tune the weights of the classification layers using the ROCFD528 dataset.
To evaluate the configurations, 16-fold stratified cross-validation is applied. This evaluation strategy involves dividing the ROCFD528 dataset into 16 parts, training the configuration on 15 parts and evaluating its performance on the remaining part. This procedure is repeated 16 times, with a rotating validation set. Once the whole process is complete, the configuration performance is calculated by averaging the results obtained in the 16 iterations. Each part of the ROCFD528 dataset contains 33 figures, and the figure distribution remains the same for all configurations. Some training hyperparameters are common to all configurations: the input image size is 384x384x1, the batch size is 32, the optimizer type is Adam with the default parameters offered by Keras and real-time data augmentation is applied in the form of small rotations, offsets, shears and zooms of the images (whose values are set as in subsection 4.1). Other hyperparameters, such as the learning rate or the number of training epochs, are different for each configuration and their values are reflected in Table 5. As this is a regression problem, we employ the Mean Squared Error (MSE) loss function. When the training phase is finished, the selected model is the one with the minimum MSE. This model is used to extract the metrics associated with the validation set. Finally, the metrics are averaged across the 16 folds.
Table 5.
Values of the hyperparameters learning rate and number of training epochs established for each of the eleven configurations.
| Configuration | Learning Rate | Training Epochs |
|---|---|---|
| SaN-DL | 1e − 4 | 750 |
| MN2-DL | 1e − 6 | 750 |
| I3-DL | 1e − 5 | 750 |
| ENB1-DL | 1e − 3 | 750 |
| MN2-TL-IN | 5e − 5 | 1500 |
| I3-TL-IN | 5e − 5 | 1500 |
| ENB1-TL-IN | 5e − 5 | 1500 |
| SaN-TL-QD | 5e − 4 | 1500 |
| MN2-TL-QD | 5e − 5 | 1500 |
| I3-TL-QD | 0.1 | 1500 |
| ENB1-TL-QD | 5e − 5 | 1500 |
Table 6 presents the fold-wise average and standard deviation of the five metrics (PCC, , MAE, RMSE, and MedAE) calculated for each of the eleven configurations. Fig. 4 shows the confusion matrices for the eleven scoring configurations. In these matrices, the columns reflect the score given by the model, the rows represent the GS score and, the cells show the total count of figures across all folds satisfying both conditions.
Table 6.
Fold-wise average and standard deviation of the five metrics calculated for the eleven configurations. PCC: Pearson Correlation Coefficient; R2: Coefficient of Determination; MAE: Mean Absolute Error; RMSE: Root Mean Square Error; MedAE: Median Absolute Error.
| Configuration | PCC | R2 | MAE | RMSE | MedAE |
|---|---|---|---|---|---|
| SaN-DL | 0.859 (0.074) | 0.727 (0.123) | 3.448 (0.639) | 4.426 (0.765) | 2.825 (0.562) |
| MN2-DL | 0.614 (0.125) | 0.351 (0.152) | 5.791 (0.804) | 6.973 (0.863) | 5.293 (0.882) |
| I3-DL | 0.753 (0.073) | 0.541 (0.122) | 4.714 (0.761) | 5.879 (0.933) | 4.031 (0.968) |
| ENB1-DL | 0.820 (0.088) | 0.665 (0.134) | 3.889 (0.749) | 4.948 (0.876) | 3.227 (0.823) |
| MN2-TL-IN | 0.778 (0.088) | 0.563 (0.197) | 4.546 (0.963) | 5.619 (1.143) | 4.032 (1.103) |
| I3-TL-IN | 0.780 (0.066) | 0.600 (0.110) | 4.318 (0.651) | 5.464 (0.826) | 3.627 (1.042) |
| ENB1-TL-IN | 0.786 (0.088) | 0.544 (0.142) | 4.432 (0.593) | 5.815 (0.854) | 3.328 (0.686) |
| SaN-TL-QD | 0.735 (0.100) | 0.400 (0.122) | 5.602 (0.798) | 6.731 (0.849) | 5.273 (1.097) |
| MN2-TL-QD | 0.795 (0.089) | 0.623 (0.133) | 4.257 (0.649) | 5.255 (0.815) | 3.571 (0.703) |
| I3-TL-QD | 0.729 (0.101) | 0.526 (0.134) | 4.722 (0.580) | 5.925 (0.651) | 4.023 (0.792) |
| ENB1-TL-QD | 0.804 (0.082) | 0.639 (0.128) | 4.068 (0.624) | 5.124 (0.770) | 3.494 (0.810) |
Figure 4.
Confusion matrices for the eleven configurations. Rows: GS score; Columns: Predicted score.
4.3. Comparing deep learning models with humans
To acquire knowledge on how humans analyze the ROCF, we ask two experts to individually score a subset of 185 images obtained from the original ROCFD528 dataset. These images are randomly selected and preserve the dataset score distribution (Fig. 2). The way these two experts score the ROCF is by independently scoring each of the ROCF components. These partial scores are then summed to obtain the figure score. It is crucial to emphasize that this scoring procedure differs from the one used by the expert team to establish the GS. In the original procedure, each expert proposed a score for a specific ROCF, making trade-offs between component scores, and then the team of experts met to negotiate the final score for each ROCF. We believe that the alternative procedure followed by the two independent experts could lead to more systematic and unbiased scores. Table 7 shows the calculated values of the five metrics, comparing the scores of the two experts to each other and to the GS scores. Fig. 5 shows the confusion matrices comparing the scores of the two experts with the GS scores. Fig. 6 presents the confusion matrix comparing the scores of the two experts.
Table 7.
Values of the five metrics calculated by comparing scores assigned by different raters. PCC: Pearson Correlation Coefficient; R2: Coefficient of Determination; MAE: Mean Absolute Error; RMSE: Root Mean Square Error; MedAE: Median Absolute Error.
| Scores | Reference Scores | PCC | R2 | MAE | RMSE | MedAE |
|---|---|---|---|---|---|---|
| Expert 1 | GS | 0.892 | 0.794 | 3.200 | 4.168 | 3.000 |
| Expert 2 | GS | 0.877 | 0.711 | 3.808 | 4.938 | 3.000 |
| Expert 1 | Expert 2 | 0.958 | 0.844 | 2.624 | 3.446 | 2.000 |
| Expert 2 | Expert 1 | 0.958 | 0.830 | 2.624 | 3.446 | 2.000 |
Figure 5.
Confusion matrices comparing the scores of the two experts with the GS scores. Rows: GS score; Columns: Independent expert's score.
Figure 6.

Confusion matrix comparing the scores of the two experts. Rows: Expert 1's score; Columns: Expert 2's score.
With the aim of comparing the values of the metrics obtained for both the two human experts (i.e. human raters) and the eleven deep learning configurations (i.e. machine raters), we generate the ranking shown in Table 8. Even though the results have been extracted using different evaluation strategies (a 16-fold cross validation scheme on the whole ROCFD528 dataset for the deep learning configurations and a random subset of 185 ROCFs for the human experts), we consider it relevant to put all raters together in the same table.
Table 8.
Rater ranking for the five metrics.
| Ranking Position | Metric |
||||
|---|---|---|---|---|---|
| PCC | R2 | MAE | RMSE | MedAE | |
| 1 | Expert 1 | Expert 1 | Expert 1 | Expert 1 | SaN-DL |
| 2 | Expert 2 | SaN-DL | SaN-DL | SaN-DL | Expert 1 |
| 3 | SaN-DL | Expert 2 | Expert 2 | Expert 2 | Expert 2 |
| 4 | ENB1-DL | ENB1-DL | ENB1-DL | ENB1-DL | ENB1-DL |
| 5 | ENB1-TL-QD | ENB1-TL-QD | ENB1-TL-QD | ENB1-TL-QD | ENB1-TL-IN |
| 6 | MN2-TL-QD | MN2-TL-QD | MN2-TL-QD | MN2-TL-QD | ENB1-TL-QD |
| 7 | ENB1-TL-IN | I3-TL-IN | I3-TL-IN | I3-TL-IN | MN2-TL-QD |
| 8 | I3-TL-IN | MN2-TL-IN | ENB1-TL-IN | MN2-TL-IN | I3-TL-IN |
| 9 | MN2-TL-IN | ENB1-TL-IN | MN2-TL-IN | ENB1-TL-IN | I3-TL-QD |
| 10 | I3-DL | I3-DL | I3-DL | I3-DL | I3-DL |
| 11 | SaN-TL-QD | I3-TL-QD | I3-TL-QD | I3-TL-QD | MN2-TL-IN |
| 12 | I3-TL-QD | SaN-TL-QD | SaN-TL-QD | SaN-TL-QD | SaN-TL-QD |
| 13 | MN2-DL | MN2-DL | MN2-DL | MN2-DL | MN2-DL |
4.4. Score-based performance analysis
Given the clear imbalance of scores in the ROCFD528 dataset (Fig. 2), we carry out a study of the metric values for each score. This analysis allows us to understand for which scores the human and machine raters make more mistakes and, in addition, in the case of the latter, it quantifies the impact of reduced availability of training data for certain score ranges. To calculate the metrics, we use the ROCFs whose GS scores fall within a window for PCC, a window for , and a window for MAE, RMSE and MedAE. This is known as the moving window technique, which allows us to obtain smoothed metric values around each score. Smoothing is especially important around scores with little data. Fig. 7 displays the moving window values of each of the five metrics. To improve visual clarity, we only show the results corresponding to the two human experts and the top five best-performing deep learning configurations (see Table 8). For these configurations, the fold-wise average and standard deviation of the metric are depicted.
Figure 7.
Moving window metric values calculated for the two human experts and the best five deep learning configurations. (a) PCC; (b) R2; (c) MAE; (d) RMSE; (e) MedAE.
5. Discussion
Before analyzing the results of the eleven deep learning configurations, we will focus on examining the performance of each of the four deep learning architectures for the Quick, Draw! classification task. Table 4 shows similar accuracy values for all models and comparable with those reported in the state of the art [28],[31]. It is noteworthy that Sketch-a-Net shows an accuracy close to that of InceptionV3 using the same training hyperparameters, even though it has a much simpler network architecture.
If we compare the eleven deep learning configurations with the models presented in the state of the art (Table 1) using PCC reported in Table 6, our best configuration, SaN-DL, yields a PCC=0.859, in line with that announced by Vogt et al. in [15] and clearly surpassing the value of 0.79 presented by Sangiovanni et al. in [16]. The latter value is also exceeded by three other configurations: ENB1-DL, ENB1-TL-QD, and MN2-TL-QD. In terms of MAE, we get a value of 3.448 for the configuration SaN-DL (2.451 if we consider the window around 31). This value could be seen as close to those found in the state of the art if we take into account the dataset size, the score distribution and the evaluation strategy. With the definition of this benchmark, we want to allow fairer comparisons in the future.
By carrying out a detailed comparison of the deep learning configurations and the human experts through the ranking shown in Table 8 and the moving window metric plots displayed in Fig. 7, we can identify several trends. In the top positions of the ranking, we observe that Expert 1 obtains better results than Expert 2. This can be explained by looking at the moving window metric plots, where Expert 1 obtains better values than Expert 2 in the score range [18-36], which accounts for the majority of the ROCFs. In contrast, Expert 2 performs better in the range [0-18), confirming the variability in human criteria. When we focus on the similarities between experts, it becomes clear that they tend to exhibit more errors in scoring (MAE, RMSE and MedAE) around scores 8 and 22, demonstrating that certain ROCFs are harder to score. If we now compare human and machine raters, we observe a general human dominance up to a score of 20 (reaching as high as 27 for PCC), showcasing evident robustness to score imbalance. Nevertheless, for higher scores, deep learning configurations are able to surpass one or both experts, due to the increased availability of training data. Among deep learning configurations, the SaN-DL exhibits the best overall metric values, even competing with the experts. These values are particularly good in the range [29-36], which represents half of the ROCFs. In addition to the values of the metrics, it can be seen how the confusion matrix for this configuration (Fig. 4a) is the closest to the ideal arrangement, with all elements on the diagonal. The SaN-DL configuration is closely followed by ENB1-DL, showing the versatility of the EfficientNet architecture. In the following positions of the ranking, we can see how pre-training with Quick, Draw! has a very positive effect for the EfficientNet and MN2 architectures, with notable results in the range [15-29). These results lead us to infer that the knowledge acquired from Quick, Draw! by these two architectures compensates for the relative scarcity of data prior to score 29. Continuing with the ranking, EfficientNet and I3 architectures pre-trained on ImageNet have slightly worse results but manage to stand out for MedAE. The configurations MN2-TL-IN and I3-TL-QD behave even worse, the former having significantly high standard deviations (Table 6) and the latter showing an exaggerated bias towards high scores (Fig. 4j). The I3-DL configuration does a decent job for a small dataset like ROCFD528, even though the architecture might be overly complex for the dataset size, and pre-training the Sketch-a-Net with Quick, Draw! leads to poor overall metric values. This could be due to the fact that the features learned by the feature extraction part of the network are those of the Quick,Draw! dataset images, and the tuning of the classification layers with the ROCF dataset cannot compensate for the lack of knowledge in the form of features. Finally, the MN2-DL is the worst configuration, indicating that the MN2 architecture may not be complex enough to learn from ROCFD528 alone.
All these results, obtained for the deep learning configurations and the human experts, can be used as a reference for developers who want to propose new deep learning configurations aimed at improving automatic ROCF scoring. We want to place particular emphasis on the Sketch-a-Net architecture [27], whose design, specifically tailored to drawings, enables results that are closely aligned with those produced by an independent human rater. This establishes the SaN-DL configuration as a significant baseline for comparing future proposals. We also want to highlight the good performance of architectures with limited complexity, such as MobileNetV2 and EfficientNet, especially when pre-trained with Quick, Draw!.
Besides the results, we publicly release the ROCFD528 dataset, which includes 528 ROCF images, both in grayscale (not pre-processed) and binary formats. With the dissemination of this dataset, developers will be able to initiate their experiments without the burden of the arduous data collection and cleaning phase. Furthermore, the accessibility of unprocessed images facilitates their integration with other datasets and enables the exploration of different pre-processing strategies. In the future, we will work on adding more samples in general while addressing the lack of samples for low scores. The origin of this imbalance lies in the collection process, where the majority of participants are relatively healthy people and therefore obtain high scores. Also, the fact that all the figures in the dataset are obtained by directly copying the ROCF model translates into higher scores compared to those that would be obtained by reproducing the model from memory. Therefore, a possible solution would be to collaborate with individuals exhibiting a slightly worse health profile and also record the ROCFs they draw from memory. Another way to increase the availability of ROCFs would involve the publication of large datasets presented in the state of the art, such as those of Langer et al. [17] and Park et al. [18]. This would undoubtedly speed up research on ROCF-related challenges. Another alternative to acquire ROCFs would be to generate them artificially using different techniques.
To conclude the discussion, we have been aware of a dilemma on the way the ROCF score should be calculated. From the perspective of psychology experts, adjustments can be made at two levels: first compensating the scores of various ROCF components to come up with the figure score and then, reaching agreements with other experts on the overall score of a given ROCF. In contrast, considering each component in isolation and then adding the partial scores could enable better human-machine comparisons, as evidenced by SaN-DL achieving performance aligned with that of the experts. Furthermore, the fact that the differences between individual experts are smaller compared to the differences between an expert and the team that assigned the GS scores (Table 7) suggests that the alternative scoring procedure could reduce discrepancies in criteria among experts. More experiments need to be conducted to determine the most appropriate way to score the ROCF in order to contribute to the accurate diagnosis of the cognitive state of an individual.
6. Conclusion
In this paper we provide a benchmark for ROCF automatic scoring consisting of (i) an open dataset, which will allow other researchers to launch their experiments; (ii) a global and direct way of processing the figures, without subdividing them into components and using linear regression to approximate their QSS score; (iii) a proposal of five evaluation metrics to quantitatively compare the performance of different methods; and finally, (iv) results given by several modern deep learning architectures and learning paradigms, as well as, by two human experts that may serve as a baseline. Furthermore, we offer the notion that convolutional neural networks designed specifically for sketches are capable of promising results without requiring a particularly complex architecture and using a very limited number of examples.
On the one hand, we hope that the release of the dataset will contribute to improving the technology for analyzing line drawings and sketches. On the other hand, the proposed deep learning models could be a key element in the development of early detection and clinical decision support systems in the field of neuropsychology of aging. These systems would record the ROCF scores obtained by the same participant over time, which would be analyzed by experts together with normative data [32] to better understand the participant's cognitive state, track any possible cognitive deterioration, and intervene as soon as possible.
In the future, we will work to extend the ROCF dataset in various ways, especially getting more samples with low scores, and propose other machine learning architectures and paradigms to deal with small-sized datasets and achieve better quantitative results.
Funding
This research has been supported by the CPP2021-009109 project and a FPI-UNED-2021 scholarship.
CRediT authorship contribution statement
Juan Guerrero-Martín: Writing – original draft, Software, Methodology, Investigation, Formal analysis, Data curation, Conceptualization. María del Carmen Díaz-Mardomingo: Writing – review & editing, Validation, Supervision, Investigation, Funding acquisition, Formal analysis, Data curation. Sara García-Herranz: Writing – review & editing, Validation, Supervision, Investigation, Funding acquisition, Formal analysis, Data curation. Rafael Martínez-Tomás: Writing – review & editing, Supervision, Resources, Project administration, Funding acquisition, Data curation. Mariano Rincón: Writing – review & editing, Visualization, Validation, Supervision, Resources, Project administration, Funding acquisition, Formal analysis, Conceptualization.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
The authors thank the rest of the project team (Alba Gómez-Valadés, Margarita Bachiller, José Manuel Cuadra), and especially Professor Herminia Peraita, without whose gathering work this dataset would not have been possible.
Contributor Information
Juan Guerrero-Martín, Email: jguerrero@dia.uned.es.
María del Carmen Díaz-Mardomingo, Email: mcdiaz@psi.uned.es.
Sara García-Herranz, Email: sgarciah@psi.uned.es.
Rafael Martínez-Tomás, Email: rmtomas@dia.uned.es.
Mariano Rincón, Email: mrincon@dia.uned.es.
Data availability
The links to the open code and data can be found at the official website of the benchmark: https://edatos.consorciomadrono.es/dataverse/rey.
References
- 1.Nichols E., Steinmetz J.D., Vollset S.E., Fukutaki K., Chalek J., Abd-Allah F., Abdoli A., Abualhasan A., Abu-Gharbieh E., Akram T.T., et al. Estimation of the global prevalence of dementia in 2019 and forecasted prevalence in 2050: an analysis for the global burden of disease study 2019. Lancet Public Health. 2022;7(2):e105–e125. doi: 10.1016/S2468-2667(21)00249-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.A. Rey, L'examen psychologique dans les cas d'encéphalopathie traumatique, (les problems), Archives de psychologie.
- 3.P.A. Osterrieth, Le test de copie d'une figure complexe; contribution a l'etude de la perception et de la memoire, Archives de psychologie.
- 4.Shin M.-S., Park S.-Y., Park S.-R., Seol S.-H., Kwon J.S. Clinical and empirical applications of the rey–osterrieth complex figure test. Nat. Protoc. 2006;1(2):892–899. doi: 10.1038/nprot.2006.115. [DOI] [PubMed] [Google Scholar]
- 5.Webb S.S., Moore M.J., Yamshchikova A., Kozik V., Duta M.D., Voiculescu I., Demeyere N. Validation of an automated scoring program for a digital complex figure copy task within healthy aging and stroke. Neuropsychology. 2021;35(8):847. doi: 10.1037/neu0000748. [DOI] [PubMed] [Google Scholar]
- 6.Ha D., Eck D. A neural representation of sketch drawings. arXiv:1704.03477 arXiv preprint.
- 7.Deng L. The mnist database of handwritten digit images for machine learning research [best of the web] IEEE Signal Process. Mag. 2012;29(6):141–142. [Google Scholar]
- 8.Chowdhury P.N., Sain A., Bhunia A.K., Xiang T., Gryaditskaya Y., Song Y.-Z. European Conference on Computer Vision. Springer; 2022. Fs-coco: towards understanding of freehand sketches of common objects in context; pp. 253–270. [Google Scholar]
- 9.Wang H., Ge S., Lipton Z., Xing E.P. Learning robust global representations by penalizing local predictive power. Adv. Neural Inf. Process. Syst. 2019;32 [Google Scholar]
- 10.Zhang X., Lv L., Min G., Wang Q., Zhao Y., Li Y. Overview of the complex figure test and its clinical application in neuropsychiatric disorders, including copying and recall. Front. Neurol. 2021;12 doi: 10.3389/fneur.2021.680474. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Chechlacz M., Novick A., Rotshtein P., Bickerton W.-L., Humphreys G.W., Demeyere N. The neural substrates of drawing: a voxel-based morphometry analysis of constructional, hierarchical, and spatial representation deficits. J. Cogn. Neurosci. 2014;26(12):2701–2715. doi: 10.1162/jocn_a_00664. [DOI] [PubMed] [Google Scholar]
- 12.Li Y., Clamann M., Kaber D.B. Validation of a haptic-based simulation to test complex figure reproduction capability. IEEE Trans. Human-Mach. Syst. 2013;43(6):547–557. [Google Scholar]
- 13.Petilli M.A., Daini R., Saibene F.L., Rabuffetti M. Automated scoring for a tablet-based rey figure copy task differentiates constructional, organisational, and motor abilities. Sci. Rep. 2021;11(1) doi: 10.1038/s41598-021-94247-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Canham R.O., Smith S.L., Tyrrell A.M. Location of structural sections from within a highly distorted complex line drawing. IEE Proc., Vis. Image Signal Process. 2005;152(6):741–749. [Google Scholar]
- 15.Vogt J., Kloosterman H., Vermeent S., Van Elswijk G., Dotsch R., Schmand B. Automated scoring of the rey-osterrieth complex figure test using a deep-learning algorithm. Arch. Clin. Neuropsychol. 2019;34(6):836. [Google Scholar]
- 16.Sangiovanni S., Spezialetti M., D'Asaro F.A., Maggi G., Rossi S. Social Robotics: 12th International Conference, ICSR 2020, Golden, CO, USA, November 14–18, 2020, Proceedings 12. Springer; 2020. Administrating cognitive tests through hri: an application of an automatic scoring system through visual analysis; pp. 369–380. [Google Scholar]
- 17.N. Langer, M. Weber, B.H. Vieira, D. Strzelczyk, L. Wolf, A. Pedroni, J. Heitz, S. Müller, C. Schultheiss, M. Tröndle, et al., Automating clinical assessments of memory deficits: deep learning based scoring of the rey-osterrieth complex figure, bioRxiv, 2022, 2022–06.
- 18.Park J.Y., Seo E.H., Yoon H.-J., Won S., Lee K.H. Automating rey complex figure test scoring using a deep learning-based approach: a potential large-scale screening tool for cognitive decline. Alzheimer's Res. Ther. 2023;15(1):145. doi: 10.1186/s13195-023-01283-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Huang G., Liu Z., Van Der Maaten L., Weinberger K.Q. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. Densely connected convolutional networks; pp. 4700–4708. [Google Scholar]
- 20.Schuster B., Kordon F., Mayr M., Seuret M., Jost S., Kessler J., Christlein V. International Conference on Document Analysis and Recognition. Springer; 2023. Multi-stage fine-tuning deep learning models improves automatic assessment of the rey-osterrieth complex figure test; pp. 3–19. [Google Scholar]
- 21.Eitz M., Hays J., Alexa M. How do humans sketch objects? ACM Trans. Graph. 2012;31(4):1–10. [Google Scholar]
- 22.García-Herranz S., Díaz-Mardomingo M.C., Peraita H. Neuropsychological predictors of conversion to probable Alzheimer disease in elderly with mild cognitive impairment. J. Neuropsychol. 2016;10(2):239–255. doi: 10.1111/jnp.12067. [DOI] [PubMed] [Google Scholar]
- 23.Krizhevsky A., Sutskever I., Hinton G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012;25 [Google Scholar]
- 24.Szegedy C., Vanhoucke V., Ioffe S., Shlens J., Wojna Z. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. Rethinking the inception architecture for computer vision; pp. 2818–2826. [Google Scholar]
- 25.Sandler M., Howard A., Zhu M., Zhmoginov A., Chen L.-C. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. Mobilenetv2: inverted residuals and linear bottlenecks; pp. 4510–4520. [Google Scholar]
- 26.Tan M., Le Q. International Conference on Machine Learning, PMLR. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks; pp. 6105–6114. [Google Scholar]
- 27.Yu Q., Yang Y., Liu F., Song Y.-Z., Xiang T., Hospedales T.M. Sketch-a-net: a deep neural network that beats humans. Int. J. Comput. Vis. 2017;122:411–425. [Google Scholar]
- 28.Xu P., Hospedales T.M., Yin Q., Song Y.-Z., Xiang T., Wang L. Deep learning for free-hand sketch: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022;45(1):285–312. doi: 10.1109/TPAMI.2022.3148853. [DOI] [PubMed] [Google Scholar]
- 29.Deng J., Dong W., Socher R., Li L.-J., Li K., Fei-Fei L. 2009 IEEE Conference on Computer Vision and Pattern Recognition. Ieee; 2009. Imagenet: a large-scale hierarchical image database; pp. 248–255. [Google Scholar]
- 30.Russakovsky O., Deng J., Su H., Krause J., Satheesh S., Ma S., Huang Z., Karpathy A., Khosla A., Bernstein M., et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015;115:211–252. [Google Scholar]
- 31.Bateni P., Goyal R., Masrani V., Wood F., Sigal L. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. Improved few-shot visual classification; pp. 14493–14502. [Google Scholar]
- 32.García-Herranz S., Díaz-Mardomingo M.C., Suárez-Falcón J.C., Rodríguez-Fernández R., Peraita H., Venero C. Normative data for verbal fluency, trail making, and rey–osterrieth complex figure tests on monolingual Spanish-speaking older adults. Arch. Clin. Neuropsychol. 2022;37(5):952–969. doi: 10.1093/arclin/acab094. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The links to the open code and data can be found at the official website of the benchmark: https://edatos.consorciomadrono.es/dataverse/rey.






