Optimal memory-aware backpropagation of deep join networks

Olivier Beaumont; Julien Herrmann; Guillaume Pallez (Aupy); Alena Shilova

doi:10.1098/rsta.2019.0049

. 2020 Jan 20;378(2166):20190049. doi: 10.1098/rsta.2019.0049

Optimal memory-aware backpropagation of deep join networks

Olivier Beaumont ¹, Julien Herrmann ¹, Guillaume Pallez (Aupy) ^1,^✉, Alena Shilova ¹

PMCID: PMC7015292 PMID: 31955681

Abstract

Deep learning training memory needs can prevent the user from considering large models and large batch sizes. In this work, we propose to use techniques from memory-aware scheduling and automatic differentiation (AD) to execute a backpropagation graph with a bounded memory requirement at the cost of extra recomputations. The case of a single homogeneous chain, i.e. the case of a network whose stages are all identical and form a chain, is well understood and optimal solutions have been proposed in the AD literature. The networks encountered in practice in the context of deep learning are much more diverse, both in terms of shape and heterogeneity. In this work, we define the class of backpropagation graphs, and extend those on which one can compute in polynomial time a solution that minimizes the total number of recomputations. In particular, we consider join graphs which correspond to models such as siamese or cross-modal networks.

This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.

Keywords: backpropagation, memory, pebble game

1. Introduction

Training for deep learning networks induces a memory problem. Among the different strategies to parallelize and accelerate this training phase are hyper-parameter tuning and data-parallelism [1], which both need to replicate the set of weights of the neural network onto all participating resources, thus limiting the size of the model or depth of the graph.

In order to deal with memory issues, several solutions have been advocated. One of them is model parallelism [2]. It consists of splitting the network model into several non-overlapping parts that are distributed over the different resources.

The deep neural network (DNN) model can, in general, be seen as a directed acyclic graph (DAG) and the different vertices of the DAG are split across the resources. Each time there is an edge between two vertices allocated onto two different resources, this will induce the communication of the associated forward and backward activations during the training phase. As it is, in general, the case for the training of DNNs, we will assume that the weights are updated after the forward and backward propagation of small groups of training samples that will be called mini-batches, as opposed to full batch learning (where all samples are used before updating the weights) and stochastic gradient descent (where updates are performed after each sample). Model parallelism can be considered as a solution to this problem since model weights are distributed among participating nodes. In this context, the problem becomes a general graph partitioning problem [2] where the goal is to balance the weights between the different nodes while minimizing the weights of cut edges (i.e. edges whose two extremities are on different resources). This approach has the advantage that it can be combined with data parallelism [3].

Another complementary approach is the problem of scheduling a graph with a shared bounded memory and to use recomputation of functions. This problem is known in the scheduling literature as the register allocation problem or Pebble Game [4]. In the register allocation problem, in order to execute a task, all its inputs need to be stored in registers. The question is then to decide whether it is possible to process the graph with a bounded number of unit-size registers (or memory slots). Sethi [4] showed that this problem is NP-complete for general task graphs. Further study showed that the problem is solvable in polynomial graphs for tree-shaped graphs [5], or recently series–parallel graphs [6].

In this work, we are interested in what we denote by backpropagation graphs: given a DAG with a single exit vertex, we construct a dual identical graph where all edges are reversed, and where the input of each vertex of the initial graph is connected to its dual vertex. The source vertex of the dual graph and the sink vertex of the original graph are then merged into a single vertex called turn (see figure 1 for the case of a chain of vertices).

Figure 1. — Data dependencies of the backpropagation graph corresponding to the transformation of a linear chain. Functions are modelled by grey vertices, arrows represent data dependencies (input/output of functions) where the data are either x_i and ${\bar{x}}_{i}$ . The chain is the graph on top, its dual graph is on the bottom.

These types of graphs have been widely studied in the context of automatic differentiation (AD) [7]. For a given batch size and a given network model and even on a single node without relying on model parallelism strategies, it saves memory at the price of activation re-computations. In the context of AD, networks can be seen as (long) homogeneous (i.e. all stages are identical) chains and the forward activation corresponding to the i-th stage of the chain has to be kept into memory until the associated backward stage. Checkpointing techniques determine in advance which forward activations (checkpoints) should be kept into memory and which one should be recomputed from stored checkpoints when performing the backward phase. Many studies have been performed to determine optimal checkpointing strategies for AD in different contexts, depending on the presence of a single or multi-level memory [8,9]. In the case of homogeneous chains, closed-form formulae providing the exact positions of checkpoints have even been proposed [10] (algorithm $Rev$ in the rest of this paper), although the general algorithmic ingredient is to derive optimal checkpointing strategies through dynamic programming [10]. This technique has been recently advocated for DNN in several papers [11,12]. In the context of homogeneous chains, a periodic checkpointing strategy has recently been implemented in PyTorch [13,14]. Nevertheless, DNN models are not restricted to homogeneous chains and there is a need for checkpointing algorithms in more general contexts. While optimal scheduling and checkpointing are still open problems in the general case, there are some solutions which in some way benefit from the findings in AD checkpointing strategies [11,12]. However, they are designed to deal with sequential models only, thus making it inappropriate for more sophisticated cases.

In this paper, we propose a first attempt to find optimal checkpointing strategies adapted to more general networks and to show how dynamic programming techniques developed in the context of AD can be adapted to DNN networks. More specifically, we concentrate on the particular context of DNNs consisting of several independent chains whose results are gathered through the computation of the loss function (join graph). We show that this specific case can be solved using dynamic programming, but at the price of more sophisticated techniques and higher computational costs. Note that several popular classes of problems can be modelled as a join graph among which are cross-modal embeddings or Siamese networks (SNs).

Cross-modal embeddings [15,16] are models used when there are multiple sources of data and the goal is to find the connection between those sources. For example, in the image-recipe retrieval task [16], having both a dataset of dish images and a dataset of recipes represented as a text corpus, the goal is to find a matching image for each recipe. Thus, a convolution neural network (CNN) is applied to process images and extract features while a long short-term memory network is used for the text part. Then, all feature vectors yielded by both networks are further processed with the help of a small number of fully connected layers before being concatenated to compute an associated loss. In practice, training such a model often consists of individually training each submodel for each data source and then using them only as feature extractors to train the fully connected layers on top of it. Indeed, training the whole model is not performed due to larger runtime for training and much larger memory requirements. In the latter case, the approach proposed in the current paper can be used as checkpointing strategy and can significantly decrease memory consumption.

Siamese neural network (SNNs) [17–19] can also directly benefit from our approach. They are widely used for objects recognition. The main idea behind these models is to use the same CNN, but for different images, and then finally use all the outputs to estimate a loss that represents a similarity metric. Depending on the choice of the loss function, it can either correspond to a two-chains computational graph [17,19] where the loss is computed based on two images, or to a three-chains computational graph, where triplet loss is applied [20]. Owing to memory constraints, most of the CNNs used in these models are not very deep [18]. However, it is known that deeper neural networks could offer a better quality. Therefore, using checkpointing techniques to decrease memory needs may be used to consider larger and deeper models in the context of SNs.

The rest of the paper is organized as follows. In §2, we present the problem and the general framework in which we can derive our scheduling algorithm. Then, a characterization of optimal solutions is proposed in §3a and is later used to find the optimal checkpointing strategy through dynamic programming in §3b. Finally, we present our implementation and simulation results in §4, before providing concluding remarks and perspectives in §5.

2. Model

In this work, we consider the register allocation problem [4] (aka Pebble Game) for special types of graphs, denoted as backpropagation graphs. These graphs are obtained by transforming a DAG as explained in definition 2.1.

(a). Platform model and optimization problem

In this work, we consider a sequential platform (at all times, all the compute elements are dedicated to the same job) with a finite memory $M$ of size c. Activations must be checkpointed into $M$ until they are used in the backpropagation or they must be discarded from $M$ and then recomputed from a previous checkpointed value. We do not consider the possibility of offloading activations to another (larger) memory level as in [21].

A job is represented as a DAG $G = (V, E)$ , where each vertex of v ∈ V represents a compute operation (with a given execution time), and each edge of (v₁, v₂) ∈ E represents a data dependency where an output of v₁ is an input of v₂. Given a graph $G$ , the problems under consideration are (i) can we execute it with a memory of size c (pebble game problem) and (ii) if we can, what is the minimal execution time to execute $G$ (makespan problem) with a memory of size c.

In order to process a job on this platform, all its inputs need to be stored in memory at the beginning of the execution. In what follows, we assume that the memory is larger than the minimal amount of memory to perform the training phase, which is formally defined in theorem 3.3. It should be noted that the theorem 3.3 formula takes into account the possibility of computing some activations several times and therefore does not assume that there is enough memory to run the graph in one go. Then, the core of the makespan problem is to choose which activations to store and which activations should be recomputed.

(b). Backpropagation graphs

In this work, we consider specifically the problem of scheduling backpropagation graphs.

Definition 2.1 (Backpropagation transformation (BP-transform)). —

Given a DAG $G$ with a single sink vertex. The BP-transform of $G$ is defined by the following procedure:

(i)
Build the dual graph $\tilde{G}$ defined as the same graph where all edges are inversed.

(ii)
For a given vertex in $G$ , connect its input edges to its dual vertex in $\tilde{G}$ .

(iii)
Finally, merge the sink vertex of $G$ and the source vertex of $\tilde{G}$ as a single vertex, denoted as the turn.

Note that the vertices of the initial graph are denoted as forward steps, while the vertices of the dual graph are denoted as backward steps. In the rest of this work, we make the assumption that all forward steps have the same execution cost $u_{f} \in R^{+}$ , while all the backward steps have the same execution cost $u_{b} \in R^{+}$ .¹ The cost of the Turn operation is denoted as $u_{t} \in R^{+}$ . In addition, we assume that all input/output data have the same (unit) size. We give without proof two properties of the backpropagation graph which justify this study.

Property 2.2. (Properties of the BP-graph) —

Given a DAG $G$ with n vertices and a single sink vertex:

P1: The backpropagation graph of $G$ is a DAG.

P2: Without recomputation of vertices, the minimal memory usage to go through this graph is O(n).

Indeed, to scale these types of computations, we are interested in the question of the overhead in computation when one uses much less memory space.

The most simple and most studied backpropagation graph is the transformation of a chain graph (as shown in figure 1) in AD. In this case, it is known that for a given volume of memory, one can compute the optimal solution in polynomial time [10]. In this work, we consider transformation of join graphs (figure 2). Join graphs are used by several deep-learning models such as SNNs [17–19] or cross-modal embeddings [15,16].

Figure 2. — A join graph with three branches of respective length 5, 4 and 6.

Definition 2.3 (BP-transform of join graphs). —

Given a join graph $G_{k}$ with k branches of respective length ℓ_j, its BP-transform can be described by the following set of equations:

$\begin{aligned} F_{i}^{(j)} (x_{i}^{(j)}) = x_{i + 1}^{(j)} for 1 \leq j \leq k and 0 \leq i < ℓ_{j}, \\ {\bar{F}}_{i}^{(j)} (x_{i}^{(j)}, {\bar{x}}_{i + 1}^{(j)}) = {\bar{x}}_{i}^{(j)} for 1 \leq j \leq k and 0 \leq i < ℓ_{j} \\ and & turn (x_{ℓ_{1}}^{(1)}, \dots, x_{ℓ_{k}}^{(k)}) = ({\bar{x}}_{ℓ_{1}}^{(1)}, \dots, {\bar{x}}_{ℓ_{k}}^{(k)}) . \end{aligned}$

The dependencies between these operations are represented by the graph $G = (V, E)$ depicted in figure 3, in the case of k = 2 chains.

Figure 3. — Data dependencies in the multiple adjoint chains problem with two chains. In the following, we call *forward* (resp. *backward*) data the output of F (resp. $\bar{F}$ ) functions.

In the rest of this work, we only use the graph of forward steps to represent the BP-transform. Finally, we are interested in solving the following problem.

Problem 2.4 ( ${Prob}_{join} (ℓ, c)$ ). —

Given the BP-transform of a join DAG with k branches of respective lengths $ℓ \in N^{k} .$ The respective costs for forward, backward and turn operations are given by u_f, u_b and u_t. Given $c \in N$ memory slots, minimize the makespan, where the initial memory state is given by $M = {x_{0}^{(1)}, \dots, x_{0}^{(k)}}$ and the final memory state is given by $M = {{\bar{x}}_{0}^{(1)}, \dots, {\bar{x}}_{0}^{(k)}} .$

3. Deriving an optimal solution

In this section, we present the core idea to compute an optimal solution. In particular, we show that there are optimal solutions that satisfy properties on their behaviour. We call these solutions canonical optimal solutions. Then we restrict the search for optimal solutions to the canonical solutions. For simplicity, we leave out all the proofs of this section and focus on giving intuitions. The interested reader can find the proofs in the companion report [23] for the more general case where not all final outputs need to be kept into memory. The context considered in this paper corresponds to the case where all b_i values are equal to 1 in [23], where b_i indicates if the results of chain i needs to be kept in memory.

For terminology sake, let us first note that an algorithm for ${Prob}_{join} (ℓ, c)$ can be decomposed into three different phases (figure 4):

—
the forward phase: we traverse all branches to write all inputs of the turn in memory. During this phase, one cannot compute any backward operations, but some of the input data can be stored;
—
the turn: at the beginning of this phase, all input data of the turn are stored in memory, and are replaced by all output data at the end in the same memory locations. Indeed, the input data of the turn will never be used anymore;
—
the backward phase: we read some input data that were stored earlier to backpropagate a subset of the graph.

(a). Canonical form and optimal solutions

We say that a solution is in a canonical form if it obeys the following recursive structure.

(i)
Given c memory slots, from an input x written on memory, j forward steps are performed on the branch m of x.
(ii)
The output after those steps is stored into memory.
(iii)
A canonical solution for the smaller problem where all branches’ size are unchanged except for branch m which is now smaller by j forward steps (recurrence formula) with c − 1 memory slots.
(iv)
From the input x in memory, the j steps following x are backpropagated² using all available memory slots.

We show in figure 5 what a canonical solution looks like on a small example. In this section, we show that there exists an optimal solution in this canonical form. In order to prove this result, we show that the backward phase is computed following three properties (represented graphically in figure 6):

Figure 6. — The three key properties of a canonical solution. Green blocks correspond to ‘forward’ data being stored, blue blocks to ‘backward’ data stored. The forbidden operations given by the properties are crossed out in red. For Stability 2, if during the backward phase, we have read the circled checkpoint, then the canonical solution starts by backpropagating (represented by the back arrow) all greyed-out operations until the ‘backward’ data of the circled-out operation is stored. (a) Stability 1, (b) checkpoint persistence, (c) Stability 2. (Online version in colour.)

Property 3.1 (Canonical properties). —

Given a join graph $G_{k}$ with k branches.

—
Stability 1: If we have backpropagated an operation of $G_{k}$ , then we do not access anymore any of its descendent in $G_{k}$ .

—
checkpoint persistence If the output of an operation of $G_{k}$ is stored, then until it is backpropagated, we do not access any of its parent in $G_{k}$ .

—
Stability 2 If the output of an operation of $G_{k}$ is read, then until it is backpropagated, we do not access any part of other branches of the BP-transform of $G_{k}$ .

Then there exists an optimal solution for Prob_join $(ℓ, c)$ that satisfies these properties.

With these properties, we can state the following result.

Theorem 3.2. —

There exists an optimal solution that follows the canonical form.

Outline of the proof. —

Finding an optimal solution reduces to finding what data should be rewritten during the forward phase, and in which order they should be read during the backward phase. Indeed, once this element is read, the backward phase consists of backpropagating the segment of the branch until the location of the backward data already computed (which recursively corresponds to the backpropagation of another element written during the forward phase). In addition, during the backward phase, the backpropagation of each subset of the graph reduces to the problem of a chain whose solution $Rev$ is known [10].

Finally, it remains to be shown that the forward phase satisfies the canonical property. This is simple: one can notice that at no cost, one can choose any order for going through the different forward segments³ (except the last one) of the branches. Thereby, in particular, we can use the reverse order in which they are backpropagated during the backward phase. ▪

(b). Optimal solution

Using the properties derived in the previous section on canonical solutions, we can see that the problem can be solved with a dynamic programming algorithm. Indeed, given a join graph of length $ℓ \in N^{k}$ , the problem can be seen as

(i)
performing j forward steps on one of the branches;
(ii)
writing the output forward activation after those j steps to memory;
(iii)
solving the problem for the smaller problem where all branches have unchanged size except for the branch on which the output has been kept, and which is now smaller by j steps (recurrence formula) and with one fewer memory checkpoint;
(iv)
backpropagating the j forward steps done before.

We now explain how to derive this dynamic program. Let us first consider the solution to the pebble game problem.

Theorem 3.3 (Minimal memory requirement). —

The minimal memory c_min needed to execute Prob_join $(ℓ, c)$ is:

$c_{min} (ℓ) = {\begin{cases} k & i f ℓ = 0, \\ k + \sum_{i = 1}^{k} I [ℓ_{i} \neq 0] & i f \exists j, ℓ_{j} = 1, \\ k + \sum_{i = 1}^{k} I [ℓ_{i} \neq 0] + 1 & o t h e r w i s e . \end{cases}$ 3.1

Outline of the proof. —

Intuitively, the case where ℓ = 0 corresponds to the situation where we simply need to perform the turn. For the general case, the peak is right after the turn: all initial input for the branch of non-zero size needs to be stored. In addition, all the outputs of the turn have to be stored as well. Finally, one additional slot is then needed to perform the back and forth between the input data and output data. The only exception to this is when one of the branches has length exactly 1. In that case, right after the turn, we can backpropagate the input of this branch, hence freeing the memory slot used for this input in order to do the back and forth. □

We are now able to establish the following theorem that proves that general instances of ${Prob}_{join} (ℓ, c)$ can be solved using dynamic programming. We denote by Opt₀ (l, c) the execution time of the algorithm $Rev (l, c)$ [10] (optimal solution when k = 1).

Theorem 3.4. —

Given a join DAG $G_{k}$ with k branches of lengths ℓ and given c memory slots, the execution time Opt_join (ℓ, c) of an optimal solution to ${Prob}_{join} (ℓ, c)$ is given by

$\begin{aligned} {Opt}_{join} (ℓ, c) = \infty i f c < c_{min} (ℓ), \\ {Opt}_{join} (0, c) = u_{t} i f c \geq k \\ and & {Opt}_{join} (0, c) = u_{f} + u_{t} + u_{b} i f \exists j s . t . ℓ_{j} = 1, \forall i \neq j, ℓ_{i} = 0 a n d c \geq k + 1, \end{aligned}$

For other cases:

${Opt}_{join} (0, c) = min_{\begin{matrix} 1 \leq m \leq k \\ 0 < i \leq ℓ_{m} \end{matrix}} [i \cdot u_{f} + {Opt}_{join} (ℓ_{[ℓ_{m} \leftarrow ℓ_{m} - i]}, c - 1) + {Opt}_{0} (i - 1, c - (k - 1))] .$

The complexity to compute Opt_join(ℓ, c) is $O (c \cdot k \cdot {({max}_{i = 1}^{k} ℓ_{i})}^{k + 1}) .$

Note that in practice k is small (2 or 3).

Sketch of proof. —

The dynamic program describes exactly the best makespan of any algorithm with canonical form for all values of ℓ and c.

The initialization is the time to perform the turn. Indeed, as we have seen in theorem 3.3, there is a subtlety in the manipulation of the checkpoints to perform the turn depending whether there is a chain of length one or not, which would allow one to use fewer checkpoints. Finally, when there are not enough memory slots to execute the graph, we set the value to ∞ allowing to discard the cases that use more memory than what is available. ▪

Note that while this provides an execution time, it is fairly easy to derive an algorithm based on this using standard dynamic programming techniques.

4. Evaluation

In this section, we depict the results of simulations on linearized homogeneous versions of cross-modal (CM) embeddings and SNs with triplet loss. Linearized homogeneous versions of the neural networks are the ones where all computational costs (u_f = u_b = u_t = 1) and storage costs are homogeneous. In this context, a multi-chain network is completely defined by the lengths of the different branches and the size of the memory expressed in terms of memory slots.

We propose three observations:

(i)
the trade-off Makespan versus memory usage;
(ii)
for a fixed number of storage slots, an observation on the growth of the number of recomputations needed as a function of the number of forward operations;
(iii)
a comparison between the optimal solution and an algorithm that does not take into account the join structure but that considers the graphs as an equivalent length linear chain.

In order to compare the different types of networks in a normalized way, we consider several models with analogous sizes and computational costs. For a given value of L, we consider three graph structures:

—
SNNs with three chains of lengths (2L, 2L, 2L). Here also, L can be large as these neural networks are usually applied to images, thus deep neural networks are also possible as they are better in pattern retrieval. As soon as c ≥ C_SNN(L) = 6L + 3, all forward activations can be stored and therefore makespan is minimal and is equal to ${Span}_{SNN}^{*} (L) = 12 L + 1$ .
—
CM embedding networks with two chains of sizes (L, 5L). The motivation is CM with two different types of data sources (images and texts), one of the chains is much longer than the other because more layers are needed to extract useful patterns (e.g. images are usually processed with deep neural networks like ResNet while text can be efficiently encoded into vectors of smaller dimensions with the help of some shallow neural networks instead). Analogously to the case of SNNs, as soon as the memory c is larger than C_CM(L) = 6L + 2, then the makespan is minimal and is equal to ${Span}_{CM}^{*} (L) = 12 L + 1$ .
—
Finally, we also consider the case of a single chain of length 6L. This corresponds to the case of recurrent neural networks (RNNs). Analogously to the case of CMs of SNNs, as soon as c ≥ C_RNN(L) = 6L + 1, all forward activations can be stored and therefore the makespan is minimal and is equal to ${Span}_{RNN}^{*} (L) = 12 L + 1$ . This case also serves as a lower bound on the makespan that can be reached by the previous models.

In the following, we denote by ${Span}^{*} (L) = 12 L + 1$ the minimal makespan for all models.

(a). Trade-off memory—makespan

The first question that we consider is, given a fixed number of forward operations (described by L for all scenarios), how does the makespan evolve as a function of the number of memory slots available?

We run our experiments for L = 5 and 15. The plots depicting the evolution of makespan with the amount of memory are gathered in figure 7. Specifically, the x-axis represents the fraction of memory slots available with respect to the minimal number C_Opt (which differs slightly depending on the model) to achieve minimal makespan, ${Span}^{*} (L)$ . The y-axis represents the ratio between the achieved makespan and ${Span}^{*} (L)$ . Thus, point (x, y) on the CM plot means that for a CM network of length (5L, L), with x · C_CM memory slots, the makespan is $y \cdot {Span}^{*} (L)$ .

The plots related to CM (resp. SNN) start from memory size 5 (resp. 7), which is exactly the value of c_min for two (resp. three) chains, as proved in theorem 3.3. We can note that the makespan first significantly decreases with the first additional memory slots. In addition, it seems that once it reaches a threshold $k \cdot {Span}^{*}$ with k ≃ 1.5, the makespan ratio linearly decreases to ${Span}^{*}$ with the number of additional memory slots.

Hence, this shows that this checkpointing strategy is very efficient in decreasing the memory needs while only slightly increasing the makespan. For instance, consider the point where c/C_Opt = 0.5. This corresponds to halving the memory needs with respect to what is needed to achieve minimal makespan. In all cases, halving memory needs only induces an increase of approximatively 25% on the makespan. In addition, we can also observe that even with a very small memory (say c_min + 2), the makespan is less than doubled compared to ${Span}^{*}$ .

(b). Makespan evolution for fixed memory

In figure 8, we depict the dual situation, where the number of memory slots is fixed on each plot (either 9 or 13) and the ratio of the achieved makespan over ${Span}^{*} (L)$ is depicted. Several observations can be made. The first one is that the gap to the lower-bound (RNN) is rather small and decreases with the number of available checkpoints. Obviously, this gap increases slowly with the size of the model. Note that we have observed an exception when the number of available checkpoints is exactly the minimum number of memory checkpoints (for instance c = 7, performance of SNNs). This exception is not surprising given the observations of the previous paragraph and the important improvements in performance when the number of available checkpoint is slightly larger than c_min. In addition, it is interesting to observe that for SNNs and CNNs, the ratio follows a pattern similar to that of RNNs: different thresholds, and between those thresholds, a performance shaped as α − β/L. Indeed, for the case of RNNs those performances have been proven via a closed-form formula [10]. This can motivate the search for a similar closed-form formula for the problem of chains, which could lead to an algorithm of polynomial complexity in the number of branches. Finally, another observation is that the overall growth in makespan for a fixed number of checkpoints is quite slow even with few checkpoints, hence encouraging the use of these strategies.

(c). Comparison to linearization strategy

The previous evaluation aims at evaluating the gain that can be expected by trading space for time. In this final evaluation, we rather aim at showing the importance of taking the structure of the graph into consideration. In particular, we compare the strategy presented in this work to a two-step strategy that consists in (i) transforming the join in a linear chain; (ii) using an existing strategy for linear chains.

To transform the join as a linear chain, we group together operations that are at the same distance of the turn (figure 9). In this case, one can execute any algorithm for linear chains while considering that each forward step has cost k.u_f, each backward step k.u_b and that the memory needed for a checkpoint is k times the memory needed for a single input/output. For comparison, we consider the best possible algorithm for this structure, Rev [10]. We call this heuristic: Equiv-Chain.

In order to consider the best possible scenario for Equiv-Chain, (i) we only consider balanced graphs (SNN graphs); (ii) we consider only slot number proportional to the number of branches. We plot in figure 9 the ratio Makespan of Equiv-Chain over Opt, the performance of our algorithm for different numbers of checkpoints (9, 12 or 15). Because we consider only one type of model, for this plot, we use the actual length of the graph as x-axis (length = 2L + 1).

The results in this optimistic scenario show that Equiv-Chain is surprisingly robust for small graphs. More expected, its performance steadily decreases when the ratio length/number of checkpoints increases. This is not surprising because our algorithm has more freedom in its choice of execution order. This shows the importance of taking into account the structure of the backpropagation graph when studying checkpointing strategies.

5. Conclusion and perspectives

Being able to perform learning in deep neural networks is a crucial operation, which requires important memory needs in the feed-forward model, because of the storage of activations. These memory needs often limit the size and therefore the accuracy of used models. The automatic differentiation community has implemented checkpointing strategies, which make it possible to significantly reduce memory requirements at the cost of a moderate increase in the computing time, and backpropagation schemes are very similar in the cases of automatic differentiation and deep learning. On the other hand, the diversity of task graphs (ResNet, Inception and their combinations) is much more important in the context of DNNs. It is, therefore, crucial to design checkpointing strategies for backpropagation on a much broader class of graphs than the homogeneous chains on which the literature on automatic differentiation has focused.

The goal of this paper is precisely to extend the graph class, by considering multiple independent chains that meet when calculating the loss function. This class of graphs allows, from an application point of view, to also consider neural networks such as cross-modal networks and SNNs. In the case of multi-chain, we were able to build a dynamic program that allows us to compute the optimal strategy given the size of the available memory, i.e. a strategy that fulfills memory constraints while minimizing the number of recalculations.

This work opens several natural perspectives. The first one obviously concerns the extension to other broader classes of backpropagation graphs, in order to take into account larger classes of DNNs. After working on multiple chains, which has proven to be a much more difficult problem than the single chain problem, we consider that it is probably essential to choose the characteristics of the graph class in question carefully in order to produce algorithms of reasonable complexity with a practical impact in deep learning. Another approach is the design of approximation algorithms for this problem, which would have a low execution time and for which good performance can nevertheless be guaranteed. From the studies we have done, it does seem that there is a fairly high density of optimal and quasi-optimal solutions, so it is possible that there are simpler algorithms to find them.

Footnotes

For clarity, this work is focused on the uniform case. It is possible to extend at a small cost the results to non-uniform time steps as was done for the case of single chains by Walther [22, §3.4].

We use this term to say that we have computed the dual of a vertex of the initial graph.

Defined as the forward steps between two consecutive written elements.

Data accessibility

This article does not contain any additional data.

Authors' contributions

All authors participated to the model, derivations of theoretical results and writing of the manuscript. J.H. implemented the main algorithm, A.S. and G.P. performed the evaluations. All authors read and approved the manuscript.

Competing interests

We declare we have no competing interests.

Funding

This work was supported in part by the French National Research Agency (ANR)in the frame of DASH (ANR-17-CE25- 0004) and in part by the Project Région Nouvelle Aquitaine 2018-1R50119 ‘HPC scalable ecosystem’.

Reference

1.Glorot X, Bengio Y. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proc. of the thirteenth Int. Conf. on Artificial Intelligence and Statistics, pp. 249–256.
2.Dean J.et al.2012Large scale distributed deep networks. In Advances in neural information processing systems, pp. 1223–1231. (https://papers.nips.cc/paper/4687-large-scale-distributed-deep-networks)
3.Das D, Avancha S, Mudigere D, Vaidynathan K, Sridharan S, Kalamkar D, Kaul B, Dubey P. 2016 Distributed deep learning using synchronous stochastic gradient descent. (http://arxiv.org/abs/1602.06709. )
4.Sethi R. 1975. Complete register allocation problems. SIAM J. Comput. 4, 226–248. ( 10.1137/0204020) [DOI] [Google Scholar]
5.Liu JWH. 1987. An application of generalized tree pebbling to sparse matrix factorization. SIAM J. Algebr. Discret. Methods 8, 375–395. ( 10.1137/0608031) [DOI] [Google Scholar]
6.Kayaaslan E, Lambert T, Marchal L, Uçar B. 2018. Scheduling series-parallel task graphs to minimize peak memory. Theor. Comput. Sci. 707, 1–23. ( 10.1016/j.tcs.2017.09.037) [DOI] [Google Scholar]
7.Griewank A. 1989. On automatic differentiation. In Mathematical programming: recent developments and applications (eds M Iri, K Tanake), pp.83–107. Alphen aan den Rijn, The Netherlands: Kluwer. [Google Scholar]
8.Aupy G, Herrmann J. 2019 H-Revolve: a framework for adjoint computation on synchrone hierarchical platforms. Working Paper or preprint. (https://hal.inria.fr/hal-02080706/document. )
9.Aupy G, Herrmann J, Hovland P, Robert Y. 2016. Optimal multistage algorithm for adjoint computation. SIAM J. Sci. Comput. 38, 232–255. ( 10.1137/15M1019222) [DOI] [Google Scholar]
10.Griewank A, Walther A. 2000. Algorithm 799: revolve: an implementation of checkpointing for the reverse or adjoint mode of computational differentiation. ACM Trans. Math. Softw. (TOMS) 26, 19–45. ( 10.1145/347837.347846) [DOI] [Google Scholar]
11.Chen T, Xu B, Zhang C, Guestrin C. 2016 Training deep nets with sublinear memory cost. (http://arxiv.org/abs/1604.06174. )
12.Gruslys A, Munos R, Danihelka I, Lanctot M, Graves A. 2016 Memory-efficient backpropagation through time. In Advances in Neural Information Processing Systems, pp. 4125–4133. (https://arxiv.org/abs/1606.03401. )
13.Periodic checkpointing in pytorch 2018. See https://pytorch.org/docs/stable/checkpoint.html.
14.Paszke A.et al.2017Automatic differentiation in pytorch. (https://openreview.net/pdf?id=BJJsrmfCZ)
15.Mueller M, Arzt A, Balke S, Dorfer M, Widmer G. 2019. Cross-modal music retrieval and applications: an overview of key methodologies. IEEE Signal Process Mag. 36, 52–62. ( 10.1109/MSP.2018.2868887) [DOI] [Google Scholar]
16.Marin J, Biswas A, Ofli F, Hynes N, Salvador A, Aytar Y, Weber I, Torralba A. 2018 doi: 10.1109/TPAMI.2019.2927476. Recipe1m: a dataset for learning cross-modal embeddings for cooking recipes and food images. (http://arxiv.org/abs/1810.06553. ) [DOI] [PubMed]
17.Bromley J, Guyon I, LeCun Y, Säckinger E, Shah R. 1994 Signature verification using a ‘siamese’ time delay neural network. In Advances in neural information processing systems, pp. 737–744. (https://papers.nips.cc/paper/769-signature-verification-using-a-siamese-time-delay-neural-network.pdf. )
18.Du W, Fang M, Shen M. 2017 Siamese convolutional neural networks for authorship verification. Proceedings. (http://cs231n.stanford.edu/reports/2017/pdfs/801.pdf. )
19.Masci J, Migliore D, Bronstein MM, Schmidhuber J. 2014. Descriptor learning for omnidirectional image matching. In Registration and recognition in images and videos, pp. 49–62. Berlin, Germany: Springer.
20.Hoffer E, Ailon N. 2015. Deep metric learning using triplet network. In Int.Workshop on Similarity-Based Pattern Recognition, pp. 84–92. Berlin, Germany: Springer.
21.Rhu M, Gimelshein N, Clemons J, Zulfiqar A, Keckler SW. 2016. vdnn: virtualized deep neural networks for scalable, memory-efficient neural network design. In The 49th Annual IEEE/ACM Int. Symp. on Microarchitecture, p. 18. Piscataway, NJ: IEEE Press.
22.Walther A. 1999. Program reversal schedules for single-and multi-processor machines. PhD thesis, Institute of Scientific Computing, Technical University Dresden, Germany.
23.Beaumont O, Herrmann J, Pallez G, Shilova A. 2019. Optimal memory-aware backpropagation of deep join networks. Research Report RR-9273, INRIA.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

This article does not contain any additional data.

[RSTA20190049C1] 1.Glorot X, Bengio Y. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proc. of the thirteenth Int. Conf. on Artificial Intelligence and Statistics, pp. 249–256.

[RSTA20190049C2] 2.Dean J.et al.2012Large scale distributed deep networks. In Advances in neural information processing systems, pp. 1223–1231. (https://papers.nips.cc/paper/4687-large-scale-distributed-deep-networks)

[RSTA20190049C3] 3.Das D, Avancha S, Mudigere D, Vaidynathan K, Sridharan S, Kalamkar D, Kaul B, Dubey P. 2016 Distributed deep learning using synchronous stochastic gradient descent. (http://arxiv.org/abs/1602.06709. )

[RSTA20190049C4] 4.Sethi R. 1975. Complete register allocation problems. SIAM J. Comput. 4, 226–248. ( 10.1137/0204020) [DOI] [Google Scholar]

[RSTA20190049C5] 5.Liu JWH. 1987. An application of generalized tree pebbling to sparse matrix factorization. SIAM J. Algebr. Discret. Methods 8, 375–395. ( 10.1137/0608031) [DOI] [Google Scholar]

[RSTA20190049C6] 6.Kayaaslan E, Lambert T, Marchal L, Uçar B. 2018. Scheduling series-parallel task graphs to minimize peak memory. Theor. Comput. Sci. 707, 1–23. ( 10.1016/j.tcs.2017.09.037) [DOI] [Google Scholar]

[RSTA20190049C7] 7.Griewank A. 1989. On automatic differentiation. In Mathematical programming: recent developments and applications (eds M Iri, K Tanake), pp.83–107. Alphen aan den Rijn, The Netherlands: Kluwer. [Google Scholar]

[RSTA20190049C8] 8.Aupy G, Herrmann J. 2019 H-Revolve: a framework for adjoint computation on synchrone hierarchical platforms. Working Paper or preprint. (https://hal.inria.fr/hal-02080706/document. )

[RSTA20190049C9] 9.Aupy G, Herrmann J, Hovland P, Robert Y. 2016. Optimal multistage algorithm for adjoint computation. SIAM J. Sci. Comput. 38, 232–255. ( 10.1137/15M1019222) [DOI] [Google Scholar]

[RSTA20190049C10] 10.Griewank A, Walther A. 2000. Algorithm 799: revolve: an implementation of checkpointing for the reverse or adjoint mode of computational differentiation. ACM Trans. Math. Softw. (TOMS) 26, 19–45. ( 10.1145/347837.347846) [DOI] [Google Scholar]

[RSTA20190049C11] 11.Chen T, Xu B, Zhang C, Guestrin C. 2016 Training deep nets with sublinear memory cost. (http://arxiv.org/abs/1604.06174. )

[RSTA20190049C12] 12.Gruslys A, Munos R, Danihelka I, Lanctot M, Graves A. 2016 Memory-efficient backpropagation through time. In Advances in Neural Information Processing Systems, pp. 4125–4133. (https://arxiv.org/abs/1606.03401. )

[RSTA20190049C13] 13.Periodic checkpointing in pytorch 2018. See https://pytorch.org/docs/stable/checkpoint.html.

[RSTA20190049C14] 14.Paszke A.et al.2017Automatic differentiation in pytorch. (https://openreview.net/pdf?id=BJJsrmfCZ)

[RSTA20190049C15] 15.Mueller M, Arzt A, Balke S, Dorfer M, Widmer G. 2019. Cross-modal music retrieval and applications: an overview of key methodologies. IEEE Signal Process Mag. 36, 52–62. ( 10.1109/MSP.2018.2868887) [DOI] [Google Scholar]

[RSTA20190049C16] 16.Marin J, Biswas A, Ofli F, Hynes N, Salvador A, Aytar Y, Weber I, Torralba A. 2018 doi: 10.1109/TPAMI.2019.2927476. Recipe1m: a dataset for learning cross-modal embeddings for cooking recipes and food images. (http://arxiv.org/abs/1810.06553. ) [DOI] [PubMed]

[RSTA20190049C17] 17.Bromley J, Guyon I, LeCun Y, Säckinger E, Shah R. 1994 Signature verification using a ‘siamese’ time delay neural network. In Advances in neural information processing systems, pp. 737–744. (https://papers.nips.cc/paper/769-signature-verification-using-a-siamese-time-delay-neural-network.pdf. )

[RSTA20190049C18] 18.Du W, Fang M, Shen M. 2017 Siamese convolutional neural networks for authorship verification. Proceedings. (http://cs231n.stanford.edu/reports/2017/pdfs/801.pdf. )

[RSTA20190049C19] 19.Masci J, Migliore D, Bronstein MM, Schmidhuber J. 2014. Descriptor learning for omnidirectional image matching. In Registration and recognition in images and videos, pp. 49–62. Berlin, Germany: Springer.

[RSTA20190049C20] 20.Hoffer E, Ailon N. 2015. Deep metric learning using triplet network. In Int.Workshop on Similarity-Based Pattern Recognition, pp. 84–92. Berlin, Germany: Springer.

[RSTA20190049C21] 21.Rhu M, Gimelshein N, Clemons J, Zulfiqar A, Keckler SW. 2016. vdnn: virtualized deep neural networks for scalable, memory-efficient neural network design. In The 49th Annual IEEE/ACM Int. Symp. on Microarchitecture, p. 18. Piscataway, NJ: IEEE Press.

[RSTA20190049C22] 22.Walther A. 1999. Program reversal schedules for single-and multi-processor machines. PhD thesis, Institute of Scientific Computing, Technical University Dresden, Germany.

[RSTA20190049C23] 23.Beaumont O, Herrmann J, Pallez G, Shilova A. 2019. Optimal memory-aware backpropagation of deep join networks. Research Report RR-9273, INRIA.

PERMALINK

Optimal memory-aware backpropagation of deep join networks

Olivier Beaumont

Julien Herrmann

Guillaume Pallez (Aupy)

Alena Shilova

Abstract

1. Introduction

Figure 1.

2. Model

(a). Platform model and optimization problem

(b). Backpropagation graphs

Definition 2.1 (Backpropagation transformation (BP-transform)). —

Property 2.2. (Properties of the BP-graph) —

Figure 2.

Definition 2.3 (BP-transform of join graphs). —

Figure 3.

Problem 2.4 (Probjoin(ℓ,c)). —

3. Deriving an optimal solution

Figure 4.

(a). Canonical form and optimal solutions

Figure 5.

Figure 6.

Property 3.1 (Canonical properties). —

Theorem 3.2. —

Outline of the proof. —

(b). Optimal solution

Theorem 3.3 (Minimal memory requirement). —

Outline of the proof. —

Theorem 3.4. —

Sketch of proof. —

4. Evaluation

(a). Trade-off memory—makespan

Figure 7.

(b). Makespan evolution for fixed memory

Figure 8.

(c). Comparison to linearization strategy

Figure 9.

5. Conclusion and perspectives

Footnotes

Data accessibility

Authors' contributions

Competing interests

Funding

Reference

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Problem 2.4 ( ${Prob}_{join} (ℓ, c)$ ). —