A framework for measuring the training efficiency of a neural architecture

Eduardo Cueto-Mendoza; John Kelleher

doi:10.1007/s10462-024-10943-8

. 2024 Oct 28;57(12):349. doi: 10.1007/s10462-024-10943-8

A framework for measuring the training efficiency of a neural architecture

Eduardo Cueto-Mendoza ^1,^2,^✉,^#, John Kelleher ^2,^#

PMCID: PMC11519118 PMID: 39478973

Abstract

Measuring Efficiency in neural network system development is an open research problem. This paper presents an experimental framework to measure the training efficiency of a neural architecture. To demonstrate our approach, we analyze the training efficiency of Convolutional Neural Networks and Bayesian equivalents on the MNIST and CIFAR-10 tasks. Our results show that training efficiency decays as training progresses and varies across different stopping criteria for a given neural model and learning task. We also find a non-linear relationship between training stopping criteria, training Efficiency, model size, and training Efficiency. Furthermore, we illustrate the potential confounding effects of overtraining on measuring the training efficiency of a neural architecture. Regarding relative training efficiency across different architectures, our results indicate that CNNs are more efficient than BCNNs on both datasets. More generally, as a learning task becomes more complex, the relative difference in training efficiency between different architectures becomes more pronounced.

Keywords: Deep learning, Efficiency, Deep neural networks, Hyperparameters

Introduction

Artificial Intelligence is predicted to be a critical enabling technology for many of the 17 Sustainable Development Goals (SDGs). However, its current dependency on massive datasets and computer power means that it will also inhibit the attainment of some SDGs, particularly SDG7 (Affordable and Clean Energy) and SDG 13 (Climate action) (Vinuesa et al. 2020). Modern Artificial Intelligence (AI) uses data-driven methods like deep learning. It is primarily driven by trends of ever larger datasets, larger models, and more powerful computers with the sole concern of improving model accuracy (Kelleher 2019). This dynamic resulted in a 300,000x increase between 2012 and 2018 in the computation required to train a competitive DL model [3] (this trend far exceeds Moore’s Law). Indeed, it has recently been estimated that training one AI model generated the CO2 emissions equivalent to driving 700,000 km (DeWeerdt 2020).

The environmental challenge posed by AI’s growing energy needs and associated carbon emissions has been recognized in recent years. For example, researchers in AI Ethics have highlighted this challenge (Bender et al. 2021) and have called for more research on “sustainable methods of AI” (van Wynsberghe 2021). In response to these calls, there is a growing trend within AI research to move beyond systems evaluations solely based on accuracy. Recent research tends to report hardware details and training time alongside accuracy, and some papers report FLOPS. However, time and FLOPS are not sufficient to characterize Efficiency. There is a growing body of work (e.g., Schwartz et al. 2020; Strubell et al. 2020; Li et al. 2020; Sze et al. 2020; Li and John 2003) that shows that more data is required to understand the energy and resource trade-off of deep neural networks. Consequently, a critical step in developing sustainable AI is the development of measures for Efficiency that can be integrated into the development process of an AI system.

This paper directly addresses the need for a measure to characterize the Efficiency of a neural network architecture on a specific hardware and learning task. A natural efficiency ratio of interest for a neural architecture is the ratio between the accuracy of a neural model and the energy consumed to achieve this accuracy. Accuracy is usually measured using an appropriate measure for the task and dataset distribution (e.g., F1, AUC-ROC, etc.). However, several recent results highlight a non-linear relationship between the accuracy of a neural model and the size of the model (Nakkiran et al. 2021). This suggests that there is likely a non-linear relationship between the training efficiency of an architecture and the size of the model instantiating the architecture. At the same time, there is a gap in the research literature in terms of how the training efficiency of a neural architecture varies across training. Understanding the dynamics of training efficiency is crucial as it informs decisions relating to the stopping criterion for training. Consequently, in this work, we set out an experimental methodology for comparing the relative Efficiency of different neural architectures in terms of their efficiency dynamics as training progresses and the changes in Efficiency as the size of the models instantiating the architectures vary. This experimental methodology includes both a measure of Efficiency and an experimental framework for capturing the necessary data for the efficiency measure.

In order to test and demonstrate the usefulness of our efficiency measure, we use our experimental framework to analyze the relative Efficiency of two different neural architectures, a CNN network (LeNet) and a Bayesian Convolutional Network (BCNN), on the MNIST and CIFAR-10 tasks. BCNNs are an interesting case study because, although they have produced better results than LeNet on MNIST and CIFAR-10 (Gal and Ghahramani 2015), their Efficiency relative to standard frequentist networks has yet to be assessed. Furthermore, given that the outcomes obtained by training a frequentist LeNet architecture with backpropagation and its BCNN counterpart trained using approximate variational inference—implemented via dropout—are very different (training the frequentist LeNet results in a point estimate in parameter space, whereas training the BCNN returns a probability distribution over a parameter space), it is likely that there will be differences in terms of Efficiency between these two architectures.

In summary, the key contributions of this research are: (1) we propose a measure of the training efficiency of a neural architecture on a given task; (2) we present a case study analyzing the efficiency dynamics of CNNs and BCNNs on multiple tasks across training; and (3) we analyze the overall Efficiency of CNN versus BCNN architectures. Our results indicate that CNNs are more efficient than BCNNs for training. Also, the Efficiency of both architectures varies across training. For both architectures, there is a non-linear relationship between training efficiency stopping criteria and between training efficiency and model size. Furthermore, we highlight and illustrate the confounding effect that overtraining can have on measuring the Efficiency of a neural architecture. Finally, as the learning task becomes more complex, the relative difference in training efficiency between different architectures becomes more pronounced.

Related work

Research on Efficiency in AI can broadly be categorized into four research streams: architectures, compression, training regimes, and metrics. The first of these streams focuses on developing more computationally efficient neural architectures. For example, improving the Efficiency of the attention mechanism in transformer models (Vaswani et al. 2017) has frequently been a target for this type of research. This is due to the popularity of transformer models and the high complexity in time and space $O (n^{2})$ —of the standard attention mechanism. Within this category of work, the Reformer (Kitaev et al. 2020) proposes an efficiency improvement (in terms of computation and memory) to the standard transformer that replaces the regular dot-product attention mechanism with one that uses locality-sensitive hashing, and the Linformer Wang et al. (2020) replaces the transformer attention mechanism and approximates it by a low-rank matrix which reduces the complexity of the attention layer to O(n). A recent survey of work on improving efficiency in transformers is presented in Tay et al. (2020). Also, although research on neural architecture search has traditionally focused on optimizing for a single objective (such as accuracy), recently, there has been a growing interest in multi-objective neural architecture search which considers Efficiency (frequently hardware efficiency to enable edge deployment) as part of the optimization problem (see e.g., Zeng et al. 2020; White et al. 2023; Chen et al. 2023; Lu et al. 2024).

A second stream of research has focused on improving Efficiency by reducing model size. Some of this work trades extra computation during initial model training for smaller, more efficient models at inference. For example, the EfficientNet (Tan and Le 2019) and EfficientNet v2 (Tan and Le 2021) papers propose model scaling methods that seek to maximize model efficiency during inference (by attempting to minimize the final model depth, width, and resolution) while preserving accuracy at the cost of extra computation during training. Similarly, the training methodology proposed in Cai et al. (2019) uses pruning during training to reduce model depth, width, kernel size, and resolution. Another example of this type of work is the Lottery Ticket Hypothesis (Frankle and Carbin 2018) methodology, which focuses on finding small subnetworks that can fit into different hardware platforms and generalize better. Some research focused on reducing model size is designed to work on pre-trained models. For example, NetAdapt uses empirical measures to reduce several hyperparameters in order to fulfill a certain resource budget (Yang et al. 2018), and DistilBERT uses model distillation techniques to generate smaller models from a complete BERT transformer (Sanh et al. 2019). Zhou and Quan (2023) provides a recent review of work on compressing deep neural networks that cover the four main approaches found in the literature (pruning, quantization, factorization, and distillation) and conclude that optimization approaches that combine these different compression approaches are an emerging area of research.

The third stream of research focuses on improving the training regime’s Efficiency. Work in this stream generally focuses on modifying one or more of the following components of the training regime: the ordering of (i.e., curriculum learning) or the selection of the training data presented to the model (Jiang et al. 2019; Mindermann et al. 2022; Xie et al. 2023; Wang et al. 2023; Yang et al. 2023; Wang et al. 2024); dynamically modifying the architecture of the model as part of the training process (Gong et al. 2019; Zhang and He 2020; Pan et al. 2023; Ding et al. 2023); modifying the objective function (Anil et al. 2020; Goldfarb et al. 2020; Eschenhagen et al. 2023); and improving the optimization algorithm (Liu et al. 2023; Chen et al. 2023).1 (Kaddour et al. 2023) reports a recent empirical study of the effectiveness of several of these efficient training approaches against a baseline training regime that used the Adam optimizer with a fully decayed learning rate. These experiments used a fixed computation budget based on wall time (calculated by multiplying the number of iterations of training by the time per iteration for that architecture and training regime on a reference hardware system) as the criterion for stopping training. Three budgets were used for each experiment: 6 hours, 12 hours, and 24 hours. The results indicate that the tested training modifications did not statistically outperform the baseline in most experiments. When they did, this improvement was reduced as the computing budget increased.

The fourth stream of research is focused on developing measures and methodologies for assessing the performance or Efficiency of an AI solution for a given problem. One focus within this stream of research has been on hardware efficiency, see, e.g., Davis et al. (2009); Sze et al. (2020). Another focus for this stream of research is on performance or Efficiency during inference. Frequently, this work focuses on pruning models during training to improve Efficiency at inference, see, e.g., Liu et al. (2017) and Han et al. (2015), which both use the reduction in floating point operations per inference as a measure of how their pruning approaches improve network efficiency. Examples of work in this area that are relevant to this work include Canziani et al. Canziani et al. (2016), and Jurj et al. Jurj et al. (2020). Both of these works propose measures of Efficiency during inference, and what is particularly relevant for this work is that they use a direct measure of energy consumed (rather than FLOPs) as a measure of resource usage (work done) when calculating Efficiency. Similarly, Desislavov et al. (2021) examines the trends in computational and energy costs associated with deep learning model inference and assesses whether the exponential growth in model parameters translates into a proportional increase in energy consumption. Their analysis considers algorithmic improvements and hardware advancements to understand their impact on energy consumption. We conclude that algorithmic advancements and hardware specialization have significantly improved the energy efficiency of DNNs.

The work most relevant to this research is focused on Efficiency during model training. As noted in Schwartz et al. (2020), in the research model, training occurs much more frequently than post-deployment inference, so understanding Efficiency during training is in and of itself an important topic. Indeed, Schwartz et al. (2020) reviews several different measures for Efficiency or work done during training (including, carbon emissions, electricity usage, elapsed real time, number of parameters, and floating point operations (FLOPS)) and argue that FLOPS is the fairest measure to use to compare different approaches. They attribute two properties to FLOPS in support of this argument: (a) FLOPS directly measures the work done when running a specific instance of a model and, therefore, is related to the energy consumed, and (b) it is agnostic to the hardware on which the model is run. However, metrics based on counts of operations performed by a neural network require hardware profiling, and this is computationally expensive to perform (Mills et al. 2021). Consequently, developing a metric for training efficiency that does not require hardware profiling is desirable. Bartoldson et al. (2023) presents a recent review of the most commonly used metrics in efficiency research, including training time, FLOPs, number of model parameters, electricity usage, carbon emissions, and operand sizes. Overall, they found that all these metrics have significant limitations in either not directly measuring the factors of interest or being dependent on confounding factors such as hardware, time, etc. Finally, we note that all of the metrics discussed above (be it FLOPs, $C O_{2}$ emissions (Strubell et al. 2020) or using wall time as a measure (Li et al. 2020)) do not consider model accuracy on a task and so do not measure efficiency per se but rather are an estimate of work done. We propose a novel efficiency metric considering the relationship between accuracy and work/resource usage.

However, we looked for alternative energy consumption and Efficiency measures during training to avoid the hardware profiling challenges associated with FLOPS measures. Li et al. Li et al. (2016) explore the power behavior and energy consumption of several CNN architectures on both CPUs and GPUs, with a particular focus on characterizing the energy consumption of different layer types (convolution, pooling, ReLU, and so on) during training. Similar to Li et al. Li et al. (2016) (and Canziani et al.’s work on inference efficiency (Canziani et al. 2016), and Strubell et al.’s work on predicting $C O_{2}$ emissions (Strubell et al. 2020)), we propose using energy consumed rather than FLOPS as our measure of work done/resource usage. Also, like Canziani et al. Canziani et al. (2016), we are interested in measuring Efficiency, that is, the relationship between performance (e.g., accuracy) and resource usage (e.g., energy consumed). However, we are focused on the training phase rather than on inference. Furthermore, like Strubell et al. Strubell et al. (2020) and Li et al. Li et al. (2016), we focus on the training phase. However, we go beyond measuring the energy consumed in training a specific model and propose a measure of the relative Efficiency of a neural architecture (distinct from a specific model) on a given task. We compare the LeNet CNN architecture against a Bayesian Convolutional Network (BCNN) as a test case for our efficiency measure. We chose this comparison because BCNNs are not trained with backpropagation, and we conjecture that this comparison may reveal exciting interactions between training regimes and model efficiency.

Defining an efficiency measure for deep neural networks

The concept of Efficiency is fundamental to this work:

Definition 1

Efficiency measures a system’s capacity to achieve a goal (measured by a metric) with a given amount of resources.

When considering the training efficiency of a neural network on a learning task, it is natural to consider how the accuracy of the network architecture varies as the energy consumed for training changes. This is the efficiency ratio that equation 1 defines and that Figure 1 illustrates (in this figure, the arrow represents an efficiency calculation—in the form of Equation 1—where the arrow points from the denominator to the numerator).

Fig. 1 — Network training efficiency visualized as the ratio of accuracy to energy

\begin{matrix} E f f i c i e n c y \propto \frac{Accuracy}{Energy} \end{matrix}

However, it is difficult to directly calculate a general estimate of the ratio of energy to accuracy for a given neural architecture on a task because the ratio is dependent on measures used to measure energy and accuracy and is sensitive to hyperparameter decisions (e.g., network size), and training regime decisions (e.g., convergence criteria). Consequently, in this section, we set out a methodology for calculating this efficiency ratio by averaging across a sequence of experiments that allow for hyperparameters and training regime variations. Then, we used these results to compute our final measure.

Metrics for Energy and Accuracy

Deciding what system components to report energy consumption over is not trivial. For example, although the CPU, GPU, and memory are natural system components to consider when tracking energy consumption during the training of a network, other parts of the system, such as fans, buses, and transistors, also consume energy related to training (Huang et al. 2019; Li and John 2003). However, due to the difficulties in measuring the energy consumption of these secondary or satellite components, we have decided to focus our analysis on the energy consumed during our experiments from the GPU, CPU, and RAM.

We could use several different measures to measure these components’ energy usage. For example, one family of energy measures often used for neural network research is those based on counting the number of computational operations; for example, Schwartz et al. suggest using the number of FLOPS (Schwartz et al. 2020). FLOPS, however, is one of many types of operation that can be considered. Data movement operations can be much more expensive regarding energy consumption (Horowitz 2014; Sze et al. 2020). However, one of the challenges with tracking energy consumption by counting operations is that the energy consumed by an operation is affected by the sparsity of the data being processed and the data representation being used (Zheng and Mazumder 2019). For example, switching from 32 to 16-bit floating point reduces the energy cost of FLOP operations (and in some cases, this can be done with negligible impact of model accuracy (Micikevicius et al. 2017)) and also reduces energy consumption by reducing data movement (i.e., reduced memory bandwidth) and reduced energy per memory access (due to smaller memories).

In our experiments2 the hardware used was a Tesla T4 with 15109 MiB memory, from Google Colab (driver version 470.63 and CUDA version: 11.2) and energy collection for the GPU was done using NVIDIA System Management Interface version 460.39 and for recording the energy consumed by the CPU and RAM during training we use the powertop3 system interface which is a Unix native system tool. We used these tools in each experiment to repeatedly sample and record the energy consumption rate by the GPU, CPU, and RAM access components as each network is being trained. We then calculate the Efficiency of the trained model as the ratio between the performance obtained by the model and the total energy consumed 4 to train the model, as follows:

\begin{matrix} Eff (A c c, W, i = e p o c h) = \frac{A c c_{i}}{\sum_{n = 0}^{i} [W_{n}]} \end{matrix}

where $A c c_{i}$ is the accuracy obtained on that epoch of training of the model, $\sum [W_{n}]$ is the sum of the energy samples obtained up to that epoch of training, and $W_{n} = W_{n}^{GPU} ⨁ W_{n}^{CPU} ⨁ W_{n}^{RAM}$ , $⨁$ is the concatenation operation.

The selection of the appropriate measure for model performance depends on the task type (e.g., classification, regression, segmentation, and so on) and factors such as the distribution of class labels within the data (Kelleher et al. 2020). In the experiments we report in this paper, the tasks are classification tasks with balanced label distributions, so we have chosen to use simple accuracy for the task. Specifically, we report a model’s accuracy (Acc) on the test set after training has converged. In experiments where we use a hold-out test set methodology, Acc is simply the accuracy of the trained model on the test set. In experiments where we use a k cross-fold validation methodology, Acc is the mean accuracy across the k validation folds.

Figure 2 illustrates the relationship between these measures. As seen above, in this figure, the arrows represent efficiency calculation where the arrow points from the denominator to the numerator. The dashed arrow highlights the overall efficiency calculation we wish to calculate, $Acc / W$ , the average amount of task accuracy obtained per unit of energy (Watt) expended in training.

Fig. 2 — Visualisation of relationships between variables tracked in the experiments

Allowing for hyperparameter variations: model size

To experimentally control for the effect of model size5 we propose to run each experiment multiple times for each neural architecture using a different size model in each run, and for each model size, record both the total energy consumed during training $\sum_{samples} [W]$ and the accuracy obtained by the model. We then calculate the Efficiency for each model on an experimental task as the ratio of accuracy to the total energy consumed to train it. Finally, we calculate the Efficiency of a neural architecture on an experimental task as the mean Efficiency of the models implementing that architecture on the task. Figure 3 illustrates how model size is included in the experimental design, and Equation 3 defines how we integrate model size into the calculation of the training efficiency of a network architecture.

Fig. 3 — Visualisation of how model size is integrated into the experimental methodology

\begin{matrix} Eff (a r c h, j = s i z e) = E_{n = 1}^{j} [Eff, {(A c c, W, i = e p o c h)}_{j}] \end{matrix}

Training regime variations: convergence criteria

The training efficiency of a network (accuracy/energy) is likely to vary as training progresses; in other words, the gain in model accuracy per unit of energy expended is likely to change between the early epochs of training and the later epochs of training. At the same time, the amount of time a network is trained for will vary depending on the convergence criteria used to stop training. To control for this, we define four different convergence criteria and run each experiment with each of these criteria (in combination with our N model size variations, we will run each experiment N times for each of the four convergence criteria). We then calculate the overall training efficiency of network architecture on a task by first calculating the network efficiency for each convergence criterion using Equation 3 and then calculating the expected value across these efficiency scores.

The four convergence criteria we define are:

train for a preset number of epochs, in our experiments, we set Epochs==50
train until the model achieves a preset accuracy on a validation set; in our experiments, we set the accuracy target to Accuracy==99
use early stopping as the training convergence method, i.e., we track model accuracy on the validation set across consecutive training epochs. Training stops if accuracy does not increase across a preset number of epochs (known as the patience parameter). In our experiments, we used a level of patience of 3.
stop training after a preset energy (W) budget has been consumed, for our experiments, we set the energy budget to be Energy==100kW

Figure 4 illustrates how these convergence criteria are integrated into the experimental setup, and Equation 4 defines how we calculate an overall mean training efficiency for a network architecture that accounts for both model size and convergence criteria.

Fig. 4 — Visualisation of how convergence criteria are integrated into the experimental methodology

\begin{matrix} Eff (a r c h, k = c o n v e r g e n c e) = \underset{k}{E} [Eff, {(a r c h, j = s i z e)}_{k}] \end{matrix}

where in the case of Equation 4, $Eff {(a r c h, s i z e)}_{k}$ is computed as in Equation 3.

Case study: convolutional and bayesian convolutional architectures

In this case study, we demonstrate the use of our efficiency framework by comparing the Efficiency of a CNN network (LeNet) with that of a Bayesian Convolutional Network (BCNN). The BCNN network is trained using approximate variational inference, which is implemented via dropout. Similar to the experiments reported in the original BCNN paper (Gal and Ghahramani 2015), we use the LeNet-5 architecture from Lecun et al. (1998) as the baseline architecture for our experiments. Following (Gal and Ghahramani 2015), the corresponding Bayesian version of the LeNet baseline was created by applying a dropout with a probability of 0.5 after all convolution and weight layers (i.e., this is the model called “lenet-all” in Gal and Ghahramani (2015)). Tables 1 and 2 report the hyperparameters used to train the LeNet and BCNN models (note: we use the same hyper-parameter settings as reported for experiments performed by Gal and Ghahramani (2015)).

Table 1.

LeNet hyperparameters

Architecture	LeNet-5
epochs	50
learning rate	0.001
num workers	4
batch size	256
activation	soft plus
loss	cross-entropy
optimiser	ADAM
initialization	Normal (mean:0,
	variance:1)

Architecture	LeNet-5
	(Bayesian filters)
epochs	50
learning rate	0.001
num workers	4
batch size	256
activation	soft plus
loss	cross-entropy
optimiser	ADAM
sample size	$10^{- 25}$
train ensemble	1
test ensemble	1
$β$	0.1
prior $μ$	0.0
prior $σ$	0.1
posterior $μ_{init}$	(0,0.1),
	(mean, std)
posterior $ρ_{init}$	(-5,0.1),

Model	Epochs	$A c c_{i}$	$\sum_{samples} [W]$	$Eff (A c c, W, e p o c h)$	$Eff (a r c h, s i z e)$
BCNN-1	50	$0.97 \times 10^{- 1}$	$2.08 \times 10^{5}$	$4.67 \times 10^{- 6}$	$2.59 \times 10^{- 6}$
BCNN-2	50	$9.78 \times 10^{- 1}$	$4.18 \times 10^{5}$	$2.34 \times 10^{- 6}$
BCNN-3	50	$9.77 \times 10^{- 1}$	$4.64 \times 10^{5}$	$2.11 \times 10^{- 6}$
BCNN-4	50	$9.77 \times 10^{- 1}$	$5.00 \times 10^{5}$	$1.95 \times 10^{- 6}$
BCNN-5	50	$9.76 \times 10^{- 1}$	$5.14 \times 10^{5}$	$1.90 \times 10^{- 6}$
LeNet-1	50	$9.91 \times 10^{- 1}$	$0.97 \times 10^{5}$	$10.15 \times 10^{- 6}$	$8.09 \times 10^{- 6}$
LeNet-2	50	$9.93 \times 10^{- 1}$	$0.90 \times 10^{5}$	$11.02 \times 10^{- 6}$
LeNet-3	50	$9.94 \times 10^{- 1}$	$1.45 \times 10^{5}$	$6.83 \times 10^{- 6}$
LeNet-4	50	$9.95 \times 10^{- 1}$	$1.52 \times 10^{5}$	$6.52 \times 10^{- 6}$
LeNet-5	50	$9.94 \times 10^{- 1}$	$1.67 \times 10^{5}$	$5.94 \times 10^{- 6}$

Model	Epochs	$A c c_{i}$	$\sum_{samples} [W]$	$Eff (A c c, W, e p o c h)$	$Eff (a r c h, s i z e)$
BCNN-1	50	$4.35 \times 10^{- 1}$	$2.87 \times 10^{5}$	$1.51 \times 10^{- 6}$	$1.54 \times 10^{- 6}$
BCNN-2	50	$4.93 \times 10^{- 1}$	$3.09 \times 10^{5}$	$1.59 \times 10^{- 6}$
BCNN-3	50	$5.12 \times 10^{- 1}$	$3.29 \times 10^{5}$	$1.55 \times 10^{- 6}$
BCNN-4	50	$5.24 \times 10^{- 1}$	$3.41 \times 10^{5}$	$1.54 \times 10^{- 6}$
BCNN-5	50	$5.24 \times 10^{- 1}$	$3.44 \times 10^{5}$	$1.52 \times 10^{- 6}$
LeNet-1	50	$6.35 \times 10^{- 1}$	$0.78 \times 10^{5}$	$8.13 \times 10^{- 6}$	$7.78 \times 10^{- 6}$
LeNet-2	50	$7.18 \times 10^{- 1}$	$0.84 \times 10^{5}$	$8.47 \times 10^{- 6}$
LeNet-3	50	$7.71 \times 10^{- 1}$	$0.89 \times 10^{5}$	$8.66 \times 10^{- 6}$
LeNet-4	50	$7.88 \times 10^{- 1}$	$0.99 \times 10^{5}$	$7.91 \times 10^{- 6}$
LeNet-5	50	$7.96 \times 10^{- 1}$	$1.39 \times 10^{5}$	$5.72 \times 10^{- 6}$

Model	Epochs	$A c c_{i}$	$\sum_{samples} [W]$	$Eff (A c c, W, e p o c h)$	$Eff (a r c h, s i z e)$
BCNN-1	65	$9.67 \times 10^{- 1}$	$9.19 \times 10^{5}$	$1.05 \times 10^{- 6}$	$1.00 \times 10^{- 6}$
BCNN-2	21	$9.42 \times 10^{- 1}$	$6.31 \times 10^{5}$	$1.49 \times 10^{- 6}$
BCNN-3	37	$9.61 \times 10^{- 1}$	$9.26 \times 10^{5}$	$1.04 \times 10^{- 6}$
BCNN-4	53	$9.64 \times 10^{- 1}$	$12.21 \times 10^{5}$	$0.79 \times 10^{- 6}$
BCNN-5	65	$9.67 \times 10^{- 1}$	$15.40 \times 10^{5}$	$0.63 \times 10^{- 6}$
LeNet-1	16	$9.75 \times 10^{- 1}$	$0.75 \times 10^{5}$	$12.83 \times 10^{- 6}$	$8.77 \times 10^{- 6}$
LeNet-2	12	$9.79 \times 10^{- 1}$	$0.59 \times 10^{5}$	$16.40 \times 10^{- 6}$
LeNet-3	56	$9.93 \times 10^{- 1}$	$2.69 \times 10^{5}$	$3.68 \times 10^{- 6}$
LeNet-4	28	$9.91 \times 10^{- 1}$	$2.12 \times 10^{5}$	$4.65 \times 10^{- 6}$
LeNet-5	20	$9.90 \times 10^{- 1}$	$1.58 \times 10^{5}$	$6.26 \times 10^{- 6}$

Model	Epochs	$A c c_{i}$	$\sum_{samples} [W]$	$Eff (A c c, W, e p o c h)$	$Eff (a r c h, s i z e)$
BCNN-1	61	$4.39 \times 10^{- 1}$	$3.58 \times 10^{5}$	$1.23 \times 10^{- 6}$	$1.02 \times 10^{- 6}$
BCNN-2	41	$4.23 \times 10^{- 1}$	$3.68 \times 10^{5}$	$1.15 \times 10^{- 6}$
BCNN-3	41	$4.40 \times 10^{- 1}$	$4.66 \times 10^{5}$	$0.94 \times 10^{- 6}$
BCNN-4	21	$3.86 \times 10^{- 1}$	$2.94 \times 10^{5}$	$1.31 \times 10^{- 6}$
BCNN-5	81	$4.92 \times 10^{- 1}$	$1.068 \times 10^{5}$	$0.46 \times 10^{- 6}$
LeNet-1	56	$5.93 \times 10^{- 1}$	$1.64 \times 10^{5}$	$3.60 \times 10^{- 6}$	$4.93 \times 10^{- 6}$
LeNet-2	40	$6.53 \times 10^{- 1}$	$1.67 \times 10^{5}$	$3.90 \times 10^{- 6}$
LeNet-3	24	$6.45 \times 10^{- 1}$	$0.92 \times 10^{5}$	$6.96 \times 10^{- 6}$
LeNet-4	24	$6.65 \times 10^{- 1}$	$1.18 \times 10^{5}$	$5.63 \times 10^{- 6}$
LeNet-5	28	$7.18 \times 10^{- 1}$	$1.57 \times 10^{5}$	$4.56 \times 10^{- 6}$

Model	Epochs	$A c c_{i}$	$\sum_{samples} [W]$	$Eff (A c c, W, e p o c h)$	$Eff (a r c h, s i z e)$
BCNN-1	19	$9.44 \times 10^{- 1}$	$1.46 \times 10^{5}$	$6.43 \times 10^{- 6}$	$6.18 \times 10^{- 6}$
BCNN-2	13	$9.33 \times 10^{- 1}$	$1.62 \times 10^{5}$	$5.73 \times 10^{- 6}$
BCNN-3	10	$9.19 \times 10^{- 1}$	$1.57 \times 10^{5}$	$5.83 \times 10^{- 6}$
BCNN-4	08	$9.06 \times 10^{- 1}$	$1.43 \times 10^{5}$	$6.32 \times 10^{- 6}$
BCNN-5	06	$8.83 \times 10^{- 1}$	$1.33 \times 10^{5}$	$6.60 \times 10^{- 6}$
LeNet-1	43	$9.87 \times 10^{- 1}$	$1.16 \times 10^{5}$	$8.47 \times 10^{- 6}$	$8.48 \times 10^{- 6}$
LeNet-2	39	$9.90 \times 10^{- 1}$	$1.16 \times 10^{5}$	$8.46 \times 10^{- 6}$
LeNet-3	27	$9.89 \times 10^{- 1}$	$1.15 \times 10^{5}$	$8.58 \times 10^{- 6}$
LeNet-4	23	$9.89 \times 10^{- 1}$	$1.15 \times 10^{5}$	$8.60 \times 10^{- 6}$
LeNet-5	21	$9.89 \times 10^{- 1}$	$1.19 \times 10^{5}$	$8.29 \times 10^{- 6}$

Model	Epochs	$A c c_{i}$	$\sum_{samples} [W]$	$Eff (A c c, W, e p o c h)$	$Eff (a r c h, s i z e)$
BCNN-1	19	$1.40 \times 10^{- 1}$	$1.49 \times 10^{5}$	$0.94 \times 10^{- 6}$	$2.21 \times 10^{- 6}$
BCNN-2	14	$3.59 \times 10^{- 1}$	$1.36 \times 10^{5}$	$2.64 \times 10^{- 6}$
BCNN-3	11	$3.25 \times 10^{- 1}$	$1.14 \times 10^{5}$	$2.85 \times 10^{- 6}$
BCNN-4	09	$3.07 \times 10^{- 1}$	$1.31 \times 10^{5}$	$2.33 \times 10^{- 6}$
BCNN-5	07	$2.71 \times 10^{- 1}$	$1.19 \times 10^{5}$	$2.28 \times 10^{- 6}$
LeNet-1	39	$5.72 \times 10^{- 1}$	$1.18 \times 10^{5}$	$4.83 \times 10^{- 6}$	$5.17 \times 10^{- 6}$
LeNet-2	36	$6.44 \times 10^{- 1}$	$1.15 \times 10^{5}$	$5.59 \times 10^{- 6}$
LeNet-3	32	$6.77 \times 10^{- 1}$	$1.16 \times 10^{5}$	$5.83 \times 10^{- 6}$
LeNet-4	25	$6.80 \times 10^{- 1}$	$1.17 \times 10^{5}$	$5.77 \times 10^{- 6}$
LeNet-5	21	$6.71 \times 10^{- 1}$	$1.75 \times 10^{5}$	$3.83 \times 10^{- 6}$

Model	Epochs	$A c c_{i}$	$\sum_{samples} [W]$	$Eff (A c c, W, e p o c h)$	$Eff (a r c h, s i z e)$
BCNN-1	72	$9.70 \times 10^{- 1}$	$13.32 \times 10^{5}$	$0.73 \times 10^{- 6}$	$1.74 \times 10^{- 6}$
BCNN-2	69	$9.71 \times 10^{- 1}$	$14.05 \times 10^{5}$	$0.69 \times 10^{- 6}$
BCNN-3	69	$9.70 \times 10^{- 1}$	$14.81 \times 10^{5}$	$0.66 \times 10^{- 6}$
BCNN-4	77	$9.72 \times 10^{- 1}$	$1.60 \times 10^{5}$	$6.06 \times 10^{- 6}$
BCNN-5	80	$9.72 \times 10^{- 1}$	$17.06 \times 10^{5}$	$0.57 \times 10^{- 6}$
LeNet-1	12	$9.70 \times 10^{- 1}$	$0.57 \times 10^{5}$	$16.86 \times 10^{- 6}$	$26.10 \times 10^{- 6}$
LeNet-2	08	$9.73 \times 10^{- 1}$	$0.44 \times 10^{5}$	$22.02 \times 10^{- 6}$
LeNet-3	06	$9.74 \times 10^{- 1}$	$0.36 \times 10^{5}$	$26.73 \times 10^{- 6}$
LeNet-4	06	$9.75 \times 10^{- 1}$	$0.36 \times 10^{5}$	$26.47 \times 10^{- 6}$
LeNet-5	04	$9.75 \times 10^{- 1}$	$0.25 \times 10^{5}$	$38.44 \times 10^{- 6}$

Model	Epochs	$A c c_{i}$	$\sum_{samples} [W]$	$Eff (A c c, W, e p o c h)$	$Eff (a r c h, s i z e)$
BCNN-1	51	$4.24 \times 10^{- 1}$	$5.78 \times 10^{5}$	$0.73 \times 10^{- 6}$	$0.68 \times 10^{- 6}$
BCNN-2	37	$4.17 \times 10^{- 1}$	$4.98 \times 10^{5}$	$0.84 \times 10^{- 6}$
BCNN-3	30	$4.14 \times 10^{- 1}$	$5.32 \times 10^{5}$	$0.78 \times 10^{- 6}$
BCNN-4	36	$4.21 \times 10^{- 1}$	$8.07 \times 10^{5}$	$0.52 \times 10^{- 6}$
BCNN-5	33	$4.24 \times 10^{- 1}$	$8.32 \times 10^{5}$	$0.51 \times 10^{- 6}$
LeNet-1	7	$4.22 \times 10^{- 1}$	$0.60 \times 10^{5}$	$7.00 \times 10^{- 6}$	$10.35 \times 10^{- 6}$
LeNet-2	4	$4.29 \times 10^{- 1}$	$0.37 \times 10^{5}$	$11.55 \times 10^{- 6}$
LeNet-3	4	$4.63 \times 10^{- 1}$	$0.43 \times 10^{5}$	$10.65 \times 10^{- 6}$
LeNet-4	3	$4.42 \times 10^{- 1}$	$0.33 \times 10^{5}$	$13.24 \times 10^{- 6}$
LeNet-5	3	$4.68 \times 10^{- 1}$	$0.50 \times 10^{5}$	$9.30 \times 10^{- 6}$

	LeNet	BCNN
50 Epoch	$8.09 \times 10^{- 6}$	$2.59 \times 10^{- 6}$
Early Stopping	$8.77 \times 10^{- 6}$	$1.00 \times 10^{- 6}$
Energy Bounded	$8.48 \times 10^{- 6}$	$6.18 \times 10^{- 6}$
Acc. Bounded	$26.10 \times 10^{- 6}$	$1.74 \times 10^{- 6}$

	50 epoch			100 epoch
MNIST	Train	Test	Difference	Train	Test	Difference	A	B
LeNet-1	0.99071102	0.98506103	0.00564999	0.99524102	0.98694648	0.00829454	0.00	0.00
LeNet-2	0.99438373	0.98787305	0.00651068	0.99708797	0.98906119	0.00802678	0.00	0.00
LeNet-3	0.99525017	0.99059342	0.00465675	0.99761469	0.99154647	0.00606822	0.00	0.00
LeNet-4	0.99565284	0.99153448	0.00411836	0.99782642	0.99276667	0.00505975	0.00	0.00
LeNet-5	0.99629737	0.99057988	0.00571749	0.99814869	0.99128859	0.00686010	0.00	0.00
BCNN-1	0.95748047	0.96715924	-0.00967877	0.97338431	0.97560624	-0.00222193	0.01	0.01
BCNN-2	0.96626704	0.97242924	-0.0061622	0.97819731	0.97928500	-0.00108769	0.01	0.01
BCNN-3	0.96513256	0.97038611	-0.00525355	0.97763173	0.97727584	0.00035589	0.01	0.01
BCNN-4	0.96491689	0.96950655	-0.00458966	0.97758249	0.97698637	0.00059612	0.01	0.01
BCNN-5	0.96142537	0.96888013	-0.00745476	0.97537774	0.97666603	-0.00128829	0.01	0.01
CIFAR	Training	Testing	Difference	Training	Testing	Difference
LeNet-1	0.59485669	0.57332227	0.02153442	0.65716212	0.61330469	0.04385743	0.04	0.02
LeNet-2	0.70195860	0.64035352	0.06160508	0.76755101	0.66876953	0.09878148	0.03	0.04
LeNet-3	0.76978155	0.68373047	0.08605108	0.83391819	0.70827051	0.12564768	0.02	0.04
LeNet-4	0.80343700	0.69496875	0.10846825	0.86194093	0.71430078	0.14764015	0.02	0.04
LeNet-5	0.82096686	0.69295508	0.12801178	0.88013734	0.71263281	0.16750453	0.02	0.04
BCNN-1	0.33502488	0.33920898	-0.0041841	0.43809589	0.43243262	0.00566327	0.09	0.01
BCNN-2	0.44060957	0.43411328	0.00649629	0.50035355	0.48572559	0.01462796	0.05	0.01
BCNN-3	0.45429289	0.44414062	0.01015227	0.52243183	0.50286328	0.01956855	0.06	0.01
BCNN-4	0.45937550	0.45482617	0.00454933	0.53284086	0.51709375	0.01574711	0.06	0.01
BCNN-5	0.46070014	0.45722070	0.00347944	0.53233031	0.51748633	0.01484398	0.06	0.01

	MNIST	CIFAR	MNIST/CIFAR
LeNet	$12.86 \times 10^{- 6}$	$7.06 \times 10^{- 6}$	1.82
BCNN	$2.88 \times 10^{- 6}$	$1.36 \times 10^{- 6}$	2.11
LeNet/BCNN	4.46	5.18

	MNIST	CIFAR	MNIST/CIFAR
LeNet	$2.05 \times 10^{- 6}$	$1.97 \times 10^{- 6}$	1.04
BCNN	$0.66 \times 10^{- 6}$	$1.80 \times 10^{- 6}$	0.36
LeNet/BCNN	3.10	1.09

	MNIST	CIFAR	MNIST/CIFAR
LeNet	$8.91 \times 10^{- 6}$	$19.30 \times 10^{- 6}$	0.46
BCNN	$2.66 \times 10^{- 6}$	$1.18 \times 10^{- 6}$	2.25
LeNet/BCNN	3.35	16.41

PERMALINK

A framework for measuring the training efficiency of a neural architecture

Eduardo Cueto-Mendoza

John Kelleher

Abstract

Introduction

Related work

Defining an efficiency measure for deep neural networks

Definition 1

Fig. 1.

Metrics for Energy and Accuracy

Fig. 2.

Allowing for hyperparameter variations: model size

Fig. 3.

Training regime variations: convergence criteria

Fig. 4.

Case study: convolutional and bayesian convolutional architectures

Table 1.

Table 2.

Results from the case study

50 epoch experiment

Table 3.

Table 4.

Early-stopping experiment

Table 5.

Table 6.

Energy bound experiment

Table 7.

Table 8.

Accuracy bound experiment

Table 9.

Table 10.

Analysis of experimental data

Efficiency as training progresses

Fig. 5.

Fig. 6.

Relationship between stopping criteria and efficiency, and model size and efficiency

Table 11.

Fig. 7.

Fig. 8.

Fig. 9.

Fig. 10.

Fig. 11.

Efficiency of the LeNet architecture against BCNN architecture

Table 12.

On the risks of over-training (over-fitting)

Table 13.

Table 14.

Conclusions

Acknowledgements

Appendix

Hardware comparison

Table 15.

Table 16.

Fig. 12.

Fig. 13.

Fig. 14.

Fig. 15.

Declarations

Competing interests

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases