Training of Deep Learning Pipelines on Memory-Constrained GPUs via Segmented Fused-Tiled Execution

Yufan Xu; Gerald Sabin; Saurabh Raje; Aravind Sukumaran-Rajam; Atanas Rountev; P Sadayappan

doi:10.1145/3497776.3517766

. Author manuscript; available in PMC: 2022 Jul 21.

Published in final edited form as: Compil Constr. 2022 Mar 18;2022:104–116. doi: 10.1145/3497776.3517766

Training of Deep Learning Pipelines on Memory-Constrained GPUs via Segmented Fused-Tiled Execution

Yufan Xu ¹, Gerald Sabin ², Saurabh Raje ³, Aravind Sukumaran-Rajam ⁴, Atanas Rountev ⁵, P Sadayappan ⁶

PMCID: PMC9302555 NIHMSID: NIHMS1819777 PMID: 35876769

Abstract

Training models with massive inputs is a significant challenge in the development of Deep Learning pipelines to process very large digital image datasets as required by Whole Slide Imaging (WSI) in computational pathology and analysis of brain fMRI images in computational neuroscience. Graphics Processing Units (GPUs) represent the primary workhorse in training and inference of Deep Learning models. In order to use GPUs to run inference or training on a neural network pipeline, state-of-the-art machine learning frameworks like PyTorch and TensorFlow currently require that the collective memory on the GPUs must be larger than the size of the activations at any stage in the pipeline. Therefore, existing Deep Learning pipelines for these use cases have been forced to develop sub-optimal “patch-based” modeling approaches, where images are processed in small segments of an image. In this paper, we present a solution to this problem by employing tiling in conjunction with check-pointing, thereby enabling arbitrarily large images to be directly processed, irrespective of the size of global memory on a GPU and the number of available GPUs. Experimental results using PyTorch demonstrate enhanced functionality/performance over existing frameworks.

Keywords: DNN, GPU, Large image training, Fusion, Tiling, Memory-constrained execution, Checkpointing

1. Introduction

Deep learning has transformed many applications of image processing. However, a few domains with massive image data, such as digital pathology and brain fMRI analysis, face significant challenges in developing deep learning models due to memory limitations. Virtually all deep learning today uses the computational power of GPUs, which offers significant performance improvement as compared to CPUs. But GPUs have much less memory (usually 32 GiB or less). Training of these Deep Learning pipelines requires that the activations computed at each layer in the forward pass are used to compute the gradients in the backward pass, where the layers are processed in reverse order. Therefore, popular machine learning frameworks like PyTorch [16] and Tensor-Flow [1] normally store the forward activations at all layers until the backward pass commences, and thus the total set of activations must fit within GPU global memory. While saving/reloading activations from host memory is possible, the low bandwidth between host and GPU has a drastic impact on performance and hence this option is not used in PyTorch or TensorFlow. This memory-constrained usage limitation has forced researchers in these domains to use sub-optimal models, either by coarsening the input data (e.g., brain fMRI analysis [3]) or by use of suboptimal “patch” based modeling using smaller slices of data from full images (e.g., digital pathology [14]). In this paper, we develop a static compile-time analysis and transformation approach to overcome this problem, along with a demonstration via a prototype implementation using the popular PyTorch machine learning framework.

Our approach to enable the training of deep learning pipelines on memory-constrained GPUs is to combine check-pointing and recomputation with tiled execution. When the operators of a sequence of consecutive layers in a DNN pipeline are amenable to compatible tiling and fused execution of tiles across the layers, the memory requirements can be dramatically reduced. However, not all sequences of consecutive DNN layers can be compatibly tiled and fused. Therefore, we develop an approach to identify consecutive operators in a DNN pipeline that are mutually compatible for fused-tiled execution, which we term an FT segment in the DNN pipeline. We develop compile-time analyses for the identification of feasible FT segments, and the determination of effective tile sizes for efficient fused-tiled execution of the layers within an FT segment.

We use the name SFT for our approach: Segmented Fused-Tiled execution. The main contributions of the paper are:

An abstraction to characterize DNN operators and sequences of DNN operators with regards to compatibly tiled and fused (FT) execution (Sections 3 and 4.1);
A compile-time algorithm for partitioning the layers of a DNN pipeline into a sequence of FT segments for tiled execution with checkpoint/recompute (Section 4.2);
A compile-time algorithm for identifying tensor slice sizes for efficient fused-tiled execution of FT segments (Section 4.3);
A PyTorch-based implementation of the new SFT approach to train deep learning pipelines on a memory-constrained GPU (Section 5);
An experimental evaluation demonstrating efficient execution of DNN training pipelines with massive input images (up to 20K × 20K pixels) on a single GPU with only 11 GiB memory (Section 6).

2. Background

2.1. Forward and Backward Propagation

Figure 1 shows an example of a DNN training pipeline. During the forward pass, the forward operators (F_n, n = 1, 2, …) are evaluated in layer order. The input activation tensors for each neural network layer (Iⁿ, n = 0, 1, 2, …) must be saved until they are used to compute the gradient by the appropriate backward operator, as shown by the diagonal edges between the forward and backward operators. After the output layer (F₆ in this example), a loss function is evaluated and the gradient of the loss (G⁶) is computed to start the backward pass. In the backward pass, the operators (B_n, n = 1, 2, …) are evaluated in reverse order. Since all inputs Iⁿ must be saved until the start of the back propagation, the memory requirement grows linearly with the number of neural network layers.

2.2. Memory Reduction via Checkpoint/Recompute

The total memory required for DNN training can be reduced by saving only a subset of activations during the forward pass and recomputing the unsaved activations when they are needed during the backward pass [7]. The nodes that save input activations in the forward pass are called checkpoint nodes, while the remaining “non-checkpoint” nodes release the memory for their activations after their use in the forward pass. Figure 2 shows a checkpoint strategy for the DNN pipeline from Figure 1. There are two checkpoint segments; vertical bars in the figure represent the checkpoint locations. The first segment contains F₁ and F₂, and the second segment contains all layers from F₃ to the end of the network. During the backward pass within a segment, the activations of the forward operators of all layers are recomputed for all non-checkpoint nodes in the segment, and are kept in memory until they are used during the back propagation for that segment.

Figure 2. — Checkpointing a network with 6 layers. The red dashed lines represent recomputation of the activations that were not stored. The backward pass therefore requires local forward passes.

Several efforts have developed schemes for checkpoint/recompute execution during training; an overview is presented by Rojas et al. [18]. However, none of these schemes can be used when the size of a single activation is too large to fit in GPU memory, i.e., the scenario we address in this paper.

2.3. Fused-tiled Execution

Tiling and fusion have been used in the design of accelerators for inference in DNNs [2, 22]. Tiling and fusion allow a subset (tile) of the input activation data to be moved into the accelerator, and then the tile is processed through a series of individual layers to generate the output tile. For such a fused-tiled execution, additional “halos” must be available for slices of input activations, as described below. Figure 3 shows a short network segment with a sequence of two 2D convolutional operators. Let the kernel size be R × S with a stride of 1. In a standard untiled 2D convolution, the input is padded/filled such that the output activation size matches the input activation size. The vertical fill size is f_r = (R −1)/2 and the horizontal fill size is f_s = (S − 1)/2, allowing the application of the kernel to boundary activations. The dashed box surrounding the entire activation represents the filled shape of the input, accounting for the padding.

Figure 3. — Tile size computation in a convolutional network with 2D convolutions using stride 1 with a kernel size R × S; fill/padding values are f_r = (R − 1)/2 and f_s = (S − 1)/2.

In a tiled execution, each computational tile produces a 2D slice of the full activation. In order to produce a slice of size T_h × T_w at the output of the second conv2d stage, a slightly larger input data slice of size (T_h + 2f_r ) × (T_w + 2f_s) is needed. Thus, in order to compute the T_w × T_h tile output of the fused convolution (the pink shaded area in Figure 3), the input to the second convolution must be (T_w + 2 f_s) × (T_h + 2 f_r), which is represented by central pink tile with the blue fill halo. Similarly, to produce the (T_w +2 f_s)×(T_h +2 f_r) output tile after the first convolution, the input to the fused convolutions must be (T_w + 4 f_s) × (T_h + 4 f_r) (the yellow, blue, and pink areas).

The above example has only shown the expanding halo of the data slices that must be computed by a sequence of stages during forward propagation. For fused-tiled execution of the combined forward/backward pipeline for DNN training, additional inter-dependencies on tile sizes must be considered, as elaborated later in the paper. Another challenge is the identification of opportunities for fused-tiled execution for arbitrary DNN pipelines.

3. Overview of Solution

In this section, we describe our solution to the problem of training deep learning pipelines when GPU memory is insufficient to hold large activations, as encountered in the analysis of WSI (Whole Slide Imaging) in digital pathology. We devise an approach (the first to our knowledge) for fused-tiled execution of the combined operator graph comprised of the forward operators provided by the user and the backward operators automatically generated by an ML framework like PyTorch.

Figure 4 presents a high-level overview of our approach.

1. The first step in our analysis is the partitioning of an arbitrary DNN graph into segments of consecutive layers that can be compatibly fused and tiled. While the forward function can represent an arbitrary DAG, a linear order of execution of the layers (operators) of the forward graph is assumed to be pre-determined by the user, as is common in ML frameworks like TensorFlow and PyTorch. We find maximal sets of consecutive DNN layers whose operators are mutually compatible with respect to tiling and fusion. The entire DNN graph is partitioned into such FT sets, with saved activations (checkpoints) between segments and fused-tiled execution within each FT segment. We describe how we formalize compatibility of operators in Section 4.1 and details of the algorithm for identifying maximal FT segments in Section 4.2.

2. Within each FT segment, all operators can be executed in a fused-tiled fashion, with an identical number of tiles for all operators in the segment. However, the tile sizes for these operators have inter-dependencies that have to be analyzed to determine the minimal buffer sizes for correct fused-tiled execution of that FT segment. This analysis is described in Section 4.3.

3. Some details of our fused-tiled implementation in PyTorch are discussed in Section 5. Experimental results for three DNN pipelines (VGG-16, VGG-19 [19], and DarkNet[17]) are presented in Section 6, demonstrating the ability to process large images as needed for whole-slide image analysis in digital pathology [6] (results for 10K × 10K and 20K × 20K images are presented).

Figure 5 shows a small three layer segment of a neural network used to compare the dataflow for a baseline execution and the proposed segmented fused-tiled execution (our scheme). The steps involved in training using each of these implementations are given in Table 2. For the baseline execution, all activations Iⁿ (n = 1, 2, 3) are saved in memory. As a result, the input activations just go forward through the network F1 . . . F4, the loss is computed, and then back propagation computes gradients Gⁿ (n = 1, 2, 3). All activations must be stored concurrently.

Table 2.

Sequences for the computation for a baseline training and segmented fused tiled training. $F l (I_{i}^{a})$ depicts the l^th forward layer with an input I^a for the i^th tile. $B l (G_{i}^{a})$ is the backward operation for layer l with input gradient $G_{i}^{a}$ . i in the Tiling-Checkpoint method represents the number of the tiles that we partition the input into. Saved activations are denoted with a bar on the top.

Base	$\bar{F 1 (I^{0})}$	$\bar{F 1 (I^{1})}$	$\bar{F 3 (I^{2})}$	F4(I³)	B4(Loss)	B3(G³)	B2(G²)	B1(G¹)

Tile-Checkpoint	$\underset{i \in 0..15}{\underset{︸}{\begin{matrix} F 1 (I_{i}^{0}) & F 2 (I_{i}^{1}) & \bar{F 3 (I_{i}^{2})} & F 4 (I_{i}^{3}) \end{matrix}}}$				B4(Loss)	$\underset{i \in 0..15}{\underset{︸}{\begin{matrix} \bar{F 1 (I_{i}^{0})} & \bar{F 2 (I_{i}^{1})} & B 3 (G_{i}^{3}) & B 2 (G_{i}^{2}) & B 1 (G_{i}^{1}) \end{matrix}}}$

Open in a new tab

The fused-tiled execution first breaks up the input activation into some number of tiles (e.g., 16), denoted with a subscript I_i, i = 0, . . ., 15, and then runs each tile through the forward pass of an FT segment. In the example, the forward and backward operation for all layers (including the loss) are computed after all tiles have gone through forward layer F₃ in the network segment depicted. The back propagation for layers 1 through 4 proceeds tile-by-tile as follows. Gradient $G_{0}^{3}$ (i.e., tile 0, layer 3) is computed using the checkpointed activation $I_{0}^{3}$ and the just computed gradient G⁴ (i.e., a tile is recomputed). In order to compute $I_{0}^{3}$ , tile 0 is processed from checkpoint 1 through checkpoint 2 (i.e., through F₁, F₂, and F₃). Note, during this forward recompute pass, all tile 0 activations are saved ( $I_{0}^{1}$ and $I_{0}^{2}$ in this example). The recomputed $I_{0}^{1}$ and $I_{0}^{2}$ are used to compute the gradients $G_{0}^{1}$ and $G_{0}^{2}$ . After all gradients are computed for tile 0, all of the temporarily saved recomputed tile activations have been freed. Next, the remaining tiles are processed (i.e., 1, . . ., 15 in this example) sequentially.

In the fused-tiled execution each checkpoint activation needs to be saved, along with the activations for each layer of a single tile (which can be arbitrarily small). Fused-tiled segments allow full intermediate activations to never be fully saved (only tiles). Adding more layers to an FT segment reduces the number of full activations that must be saved, but only increases the memory for a segment by a tile. This provides a significant memory savings compared to checkpoint/recompute, especially in many popular networks where the large activations are between convolution and pooling layers. In these networks, these large activations never need to be fully saved and nearly arbitrarily large input images can be processed.

4. Problem Formalization & Algorithm Details

4.1. Problem Formalization

An example of fused-tiled (FT) execution was seen in Fig. 3, where each of the 16 tiles of the second conv2d operator could be executed by fusion with a corresponding tile for the first operator. A chain of such conv2d operators can clearly also be executed in FT fashion. For a DAG of operators to be executable in a fused-tiled manner, each operator must be FT-compatible with respect to one or more pairs of compatible dimensions of input/output tensors, and the interconnected operators must be mutually FT-compatible. We formalize this below.

4.1.1. FT-compatible Operators.

An operator is defined as FT-compatible with respect to a pair of input/output tensor dimensions if a slice of the output tensor with extent T_d along some dimension can be computed using only a slice of the input tensor with extent σ_dT_d + δ_d along the input’s dimensions, for constants σ and δ. For example, consider the 2D convolution operator (for simplicity without stride/dilation parameters):

O u t [n, k, h, w] = \sum_{c, r, s} I n [n, c, h + r, w + s] * K e r [k, c, r, s]

(1)

Consider a slice of the output tensor Out [T_n, T_k, T_h, T_w], with slices of size T_n, T_k, T_h, T_w, respectively along the batch, channel, height, and width dimensions. In order to compute such a slice of the output tensor, only a subset of elements of the input tensor will be needed. As previously illustrated in Figure 3, the minimal slice of the input tensor will be of size In[T_n, C, T_h + R− 1, T_w + S − 1], where C is the number of input channels and R and S are the stencil size along the height and width directions. Thus, the conv2D operator is FT-compatible with respect to the batch, height, and width dimensions of the input/output tensors, but not with respect to the channel dimension. The parameters relating the FT-compatible dimensions are: σ_n = 1, δ_n = 0; σ_h = 1, δ_h = R – 1; σ_w = 1, δ_w = S – 1

An operator with FT-compatible dimensions can be efficiently executed in a tiled manner, where slices of the output tensor can be produced using slices of the input tensor. Although the set of slices of the input tensors required to produce disjoint slices of the output tensors are not disjoint (as was illustrated in Fig. 5), the amount of redundant computations will be relatively low when the slice sizes are chosen to be large. We define FT-compatible segments as the group of connected operators in a DNN pipeline with mutually consistent FT-compatible dimensions.

4.1.2. FT-compatible Segments.

Two connected operators in a DNN pipeline are FT-compatible if they are both FT-compatible with respect to at least one common tensor dimension. An FT-compatible segment is a set of adjacent layers in a DNN pipeline for which all operators (we only reason with respect to the forward operators since the backward operators have the same FT-compatibility properties as the corresponding forward operators) are all mutually FT-compatible with respect to at least one common tensor dimension. The FT-compatible dimensions of an FT-compatible segment are the common set of dimensions that are FT-compatible for all the operators in the set of DNN layers constituting the segment. A maximal FT-compatible segment is one that cannot be extended on either side without violating FT-compatibility.

Figure 6 shows a sequence of four operators (grey colored oval shapes) and the input/output tensors (yellow colored rectangles). Each operator’s computation can be represented as a single perfectly nested loop or a sequence of perfectly nested loops that can be tiled with hyper-rectangular tiles. Further, i) any dimension of any tensor operand (input or output to the operator) can only have a single tileable loop index in its access expression, and ii) any loop index is used to index at most one dimension of any tensor. The above properties define a map from each tensor operand’s data dimension to the operator’s loop iteration space index, as illustrated in Fig. 6. Consider the conv2D operator defined in Eq. 1. It represents a 7D loop nest that has five tileable loops (we do not consider the small kernel stencil loops as tileable) corresponding to batch, input channel, output channel, image height, and image width. These five tileable iteration space dimensions are represented as 5 vertices within each conv2D operator in Figure 6. Each input/output multi-dimensional tensor has a vertex for each distinct dimension, within the yellow rectangles representing the tensors (we do not explicitly model the conv2D operators’ weight matrix (Ker) in this graph, but only the tensors that “flow” on the edges of the forward operator graph). The maps between each tensor dimension to the corresponding iteration-space dimension of the operator are also marked as edges connecting the corresponding vertices in the figure. It may be seen that the composition of these tensor-dimension-to-loop-index maps results in connected components in a graph comprised of the union of vertices from all operators in the graph. In the example of Fig. 6, there is a maximal FT-segment that includes all four operators, with respect to the batch (B) index. However, the minimal tile sizes for such an FT-segment would require the full extents along H and W, which would be infeasible for the massive images in digital pathology WSI (Whole Slide Imaging). But a smaller FT-segment exists, comprised of the first three operators (first two convolutions and one maxpooling), which is FT-compatible with respect to three indices (B, H, W ), where much smaller tile sizes can be used because the H and W dimensions are also tileable for this smaller FT-segment.

A set of operators that form a connected component represents a set of loop indices in the iteration spaces of those operators that are FT-compatible. Listing 2 shows the algorithm that identifies dimensions in the operator graph that are compatible with fused-tiled execution (connected components) and Listing 1 shows the corresponding data structures. First, we iterate through each node in the graph (Line 2). In Line 3, we set the connection map of a given node to the empty set. Then, for the given node, we iterate through each input edge (Line 4). In Lines 7 and 8, we map each dimension of the input edge to the iteration space in tmpIterToInpMap, i.e., we construct an iteration space to input dimension map. In Line 9, we iterate through each output edge. Note that iterToOut contains the iteration space to output dimension map. Using iterToInp and iterToOut, we map each input edge dimension to the corresponding output edge dimension (Lines 11 and 12). Thus, after execution of the algorithm in Listing 2, we have an internal representation of a graph with the information illustrated in the example of Figure 6.

Listing 2. Generate Graph IR.

1 constructGraph(Vector g, int memCapacity){
2 for(i=0, i < len(g), i++)
3  g[i]. connection[] = {}
4  for parent in g[i]. inEdges
5  tmpIterToInpMap[] = {}
6  //connect input space to iteration space
7  for d in 0 to len(parent.inpToIter)
8   tmpIterToInpMap[parent.inpToIter[d]] = d
9  for outMeta in g[i]. outEdges
10  //using iter. space, connect inp. to out.
     space
11  for inpDimId, iterId in tmpIterToInpMap
12   g[i]. connection[< parent, inpDimId>]. insert(
      <outMeta, outMeta.iterToOut[iterId]>)
13 }

Listing 1. Data Structures.

class Edge Meta {
  Vector < int > size ;
  Vector < int > inpToIter ;
  Vector < int > iterToOut ;
  Vector < float > delta ;
  Vector < float > scale ;
  OpMeta* src ;
  OpMeta* target ;
}
class OpMeta {
  int id ;
  int ready Cnt ;
  Vector < int > iterSpace ;
  Vector < int > fullIterSpace ;
  Vector < EdgeMeta > inEdges ;
  Vector < EdgeMeta > outEdges ;
  OpMeta * dominator ;
  OpMeta * postDominator ;
  Map <pair < EdgeMeta*, int >, vector <pair < EdgeMeta*,
    int >>> connection ;
}

4.2. Partitioning of DNN Pipeline into FT-compatible Segments

In this subsection, we explain how we partition the DNN pipeline into segments separated by checkpoints. The operators within a segment can be executed in a fused-tiled manner. Listing 3 shows the corresponding algorithms. Any node whose output size is less than the memory capacity (mayTile) can be checkpointed, whereas a node whose output does not fit in memory (mustTile) must be executed using the fused-tile strategy. Lines 2 to 4 classifies each operator as “must tile (mustTile == True)” or “may tile (mustTile == False)” based on the output tensor size and memory capacity. The loop in line 5 iterates over each node in the operator graph. We ignore “may tile” nodes in Line 6. For each “must tile” node, we try to find a fused-tileable path containing the current node that starts at a checkpoint and ends at a checkpoint (each checkpoint node must be “may tile”). For any such valid segment, the dominators of all nodes in the segment should either be a checkpoint (“may tile”) or be present in the segment. Moreover, all post dominators of a given node, except the post dominator of the end node, must be present in the partition. Lines 10 to 17 ensure these properties. Initially, we add the start node to a queue (Line 9). For each node in the queue, we check whether its post dominator is present in the current segment. If not, we move the endpoint of the current segment to a node that can be checkpointed and is topologically greater than or equal to the current post dominator (Lines 14–17). All nodes between the old endpoint and new endpoint are added to the queue in Line 18. As they are processed, their post dominators are also included in this segment.

Listing 3. Partition DNN pipeline into Fused-Tileable segments.

1 findSegments(Vector g, int memCapacity){
2  for n in len(g)
3   mustTile[n] =
4     (П(g[n]. outEdges.size[:]) < memCapacity)
5  for(i=0, i < len(g), i=end )
6   if(!mustTile[i]) continue
7   start = i; end = i + 1
8   //include all post dominators
9   // in the current segment
10   Queue q{i}
11   while(!q. empty())
12   prev_end = end
13   n = q.pop
14   if(g[n]. postDominator.id >= end)
15    end = g[n]. postDominator.id
16    while(end <len(g)&& mustTile[end])
17     end++
18    q. enque(prev_end:end)
19   FTdims = FTdimsIntersect(g,start,end)
20   bool success = checkMemCapacity(start,end,
21              FTdims);
22   if(!success) return(OUT_OF_MEMORY)
23 }
24
25 FTdimsIntersect(Vector g, int start,int end){
26  Map <<InTensorMeta *,int>,bool> FTdims
27  for parent in g[start].inEdges
28  for i in 0 to len(parent.size)
29   queue q({g[start],parent,i})
30   while(!q.empty())
31   node,inpTensor,dim = q.pop()
32   if(!len(node.connection[<inpTensor,i >]) !=
      len(node.outEdges))
33    FTdims[<parent,i>]=False
34    break
35   else
36    for(outMeta,odim) in node.connection[<
       inpTensor,i>]
37     if(outMeta.target < end)
38     q.enque({outMeta.target,outMeta,odim})
39   FTdims[<parent,i>]=True
40  return FTdims
41 }

Once a valid partition is segment, we identify all the compatible fused-tiled dimensions in function FTdimsIntersect (Lines 25 to 40). This is done by taking each dimension of the start operator of the segment and checking whether there is a fused-tileable path from start to end. For each operator, we check if the corresponding input dimension is fused-tileable (Line 32 to 33). If not, we mark the dimension as not fused-tileable compatible (False) and move to the next dimension. If it is fused-tileable compatible, we enqueue all operators that use the current operator’s output and are a part of this segment, along with the corresponding dimension to the queue (Line 36 to 38). Once all dimensions are processed FT dims is returned, which indicates whether each dimension is fused-tileable compatible. In Line 20, we check if the current partition can fit in memory and if not, we throw an “OUT_OF_MEMORY” error (as a suitable SFT was not possible).

4.3. Determination of Buffer Sizes for SFT Execution

Given a graph segmentation generated using Listing 4.2 and a minimal number of tiles, the exact required tile size of each operation must be computed. The checkpoints required to support a segmentation define the output tensors that must be stored and the dimensions that can be tiled. However, the exact tile sizes will vary for each operator due to the relationship of the input tensors to the iteration space and output tensors for a given operation. For instance, a 2D max pool with a stride of 2 results in a reduction in the output width and output height by a factor of 2. In contrast, the [B, C, K, T_w, T_h] iteration space for a 2D convolution operation generating a tensor tile of size [B, K, T_h,T_w] must have a valid input tensor of size [B, C, T_h + 2f_r, T_w + 2f_s ] (see Table 3 for definitions for some common neural network operations).

Table 3.

Input tile shape required to process an iteration space tile and the corresponding output tensor shape for some common neural network forward operators.

	IN	Iter. Space	OUT
Linear	[T_b, C ]	[T_b, C, K ]	[T_b, K ]
Convolution	[T_b, C, T_h + 2f_r, T_w + 2f_s ]	[T_b, C, K, T_h, T_w ]	[T_b, K, T_h, T_w ]
Max Pool	[T_b, T_c, T_h × p, T_w × p ]	[T_b, T_c, T_h, T_w ]	[T_b, T_c, T_h, T_w ]
Avg Pool	[T_b, C, H, W ]	[T_b, C, H, W ]	[T_b, C ]
ReLU	[T_b, T_c, T_h, T_w ]	[T_b, T_c, T_h, T_w ]	[T_b, T_c, T_h, T_w ]
Softmax	[T_b, T_c, H, W ]	[T_b, T_c, H, W ]	[T_b, T_c, H, W ]

Open in a new tab

The computeTileSize function in Listing 4 calculates the required iteration space and tensor tile size for every node in a segment’s autograd network graph. The required tensor dimensions are deduced from the characterization of each operation, as per the tensor dimension dependencies from Table 4. The iteration space is deduced using a mapping of the tensor dimensions to the iteration space dimensions.

Listing 4. computeTileSize.

1 computeTileSize(Graph auto_graph, Vector
   initialReadyNodes, int numTiles){
2  readyNodes.add(initialReadyNodes)
3  //Initialize the default iteration space
4  for n in auto_graph
5   n.iterSpace[:] = n.fullIterSpace[:]
6  for n in initalReadyNodes
7  n.iterSpace[:] = n.fullIterSpace[:]/numTiles
8  //Process all nodes whose outputs are “ ready “
9  while(readyNodes.notEmpty())
10  n = readyNodes.pop()
11  //Aggregate output tensor sizes to determine
12  //iteration space required for node n
13  rs = {0} //reduced output tensor size
14  for outMeta in n.outEdges
15   rs[:] = max(rs[:], outMeta.size[:])
16  for(i, idx) in enumerate(n.iterToOut)
17   n.iterSpace[i] = rs[idx]
18  //Update required input tensor sizes
19  for iMeta in n.inTensors:
20   for(i, idx) in enumerate(iMeta.inpToIter)
21    iMeta.size[i] = n.iterSpace[idx]*
22            iMeta.scale[:]+
23            iMeta.delta[:]
24   iMeta.src.outEdges[i].size[:] = iMeta.size
      [:]
25   //Identify ready neighbors
26   iMeta.src.readyCnt++
27   if iMeta.src.readyCnt == len(iMeta.src.
      outEdges)
28    readyNodes.add(iMeta)
29 }

Table 4.

Tile propagation functions from the tile iteration space to the corresponding input tensor tile range for two common neural network operations. The δ and σ vector define the required input tensor dimensions to compute the iteration space dimensions (referenced through the inpT oIter ), corresponding to the tensor shape requirements in Table 3.

Operation	Convolution				Max Pool
Function	Forward	Grad	Δ Ker		Forward	Grad
Input Tensor	act _in	grad _in	grad _in	act _in	act _in	grad _in	act _in
Delta (δ)	[0,0,2f_r,2f_s ]	[0,0,2f_r,2f_s ]	[0,0,0,0]	[0,0,2f_r,2f_s ]	[0,0,0,0]	[0,0,0,0]	[0,0,0,0]
Scale (σ)	[1,1,1,1]	[1,1,1,1]	[1,1,1,1]	[1,1,1,1]	[1,1,p,p]	[1,1,p,p]	[1,1,p,p]
inpToIter	[0,1,3,4]	[0,2,3,4]	[0,2,3,4]	[0,1,3,4]	[0,1,2,3]	[0,1,2,3]	[0,1,2,3]

Open in a new tab

Table 4 defines the tile size required for each input tensor to process a tiled iteration space for two common operations (convolution and maxpool ) using the tiled tensor dimension requirements in Table 3 and inherent operation characteristics. Each input tensor has a set of vectors that define the required input size during the computation of an iteration space tile. Each vector represents an entry for each dimension in the tensor or iteration space (e.g., [Batch, Channel, Height, Width]). These metadata vectors support the tile size propagation for each input tensor (i) through each operation’s iteration space via i.size[i] = n.iterSpace[idx]*i.scale[:]+i.delta[:] in Line 20 to 22. The delta (δ) and (σ) vectors contain the parameters described in Section 4.1.1. inpToIter defines the mapping between an input tensor’s dimensions and the operation’s iteration space, e.g., [0, 1, 2, 3] for max pool and [0, 1, 3, 4] for convolution where iteration space dimension 2 (K) is not included in the input tensor (which has dimensions [B, C, H, W ]). Using these propagation functions, the required tile sizes for the source tensors of each edge can be computed and updated.

Figure 7 depicts an example autograd graph segment for a series of convolution (with kernels of size R × S) and max pooling (with a stride of p = 2) stages. The segments IN and OUT tensor are checkpointed in memory. The forward graph is shown by red arrows, and the backward graph for gradient (grad) and delta kernel (Δker ) operations are shown with green arrows. Pooling layers do not have any learned parameters and hence do not have a Δ kernel update operation. The edges are annotated with the width and height tile size (T_h=T_w) (assuming the batch size T_b = B = 1 and the channel dimension is not streamable due to the convolutions), as computed by Listing 4. The Δker nodes are leaf nodes that update the kernel, but have no subsequent output tensors. Therefore, these nodes base their input iteration space requirements solely on the initial iterSpace. Note that while the output tile size being generated is 64, the grad operation requires a halo of size 2 ∗ f_r and thus, the forward pass must generate a tile of size 66.

The computeTileSize function in Listing 4 computes the required iteration space (n.iterSpace) and initialize it for each terminating node in Line 5, and tensor (tensor.size) tile sizes for the operations in the auto_grad graph segment based on the number of tiles required for each dimension (e.g. numTiles) in Line 7. The function performs a reverse traversal of an autograd computation graph (see an example graph Figure 7) to calculate the tile sizes. The reversal traversal adds all leaf nodes, like Δker (with no input edges in the segment) to a ready queue (readyNodes) in Line 2. Once all output edges are ready, the required tile sizes can be updated and the node can be added to readyNodes. Note that a node will become ready when its last output neighbor marks itself as ready in Line 26 to 27.

The required input tensor tile sizes can be computed iteratively using the propagation functions in Table 4 (for simplicity without stride/dilation parameters). The iteration space tile size can be computed by using a max reduction over all output tensors Line 13 and 14. When the readyQueue is empty, all nodes have been processed and the required tile sizes have been updated (this ends the while loop in Line 9).

5. Implementation

We implement SFT in PyTorch¹, because it is a prevalent machine learning framework that has support for checkpoint/recompute using nn.Sequential and checkpoint _sequential. Our SFT PyTorch implementation includes custom tiled forward operators, custom tiled autograd functions (i.e., backward functions), static analysis, memory management for tiling, and training.

The tool works in two phases: static analysis and regular training execution. The static analysis is performed only once, before the start of the tiled training execution. The primary PyTorch modifications include:

Customized tiled autograd functions: The tiled forward functions are not symmetric to the backward functions, we do not rely on auto-generated gradient functions in PyTorch. The autograd functions must compute the “ghost” cells (determined statically), and the ghost cells may differ during the forward and backward operations. Further, the updates by the Δkernel must correspond to the correct tile despite the overlap “ghost” cells. These details require modifications to the existing autograd functions.
Optimized PyTorch internal context manager for recomputation in checkpoint segments across different tiled executions: By default PyTorch uses a naive context management strategy to save activations and meta-data for recomputation. There is no memory buffer reuse crossing multiple tiled executions, which wastes GPU memory and eventually results in out-of-memory errors. Also, redundant buffer allocations slow down the training process. We optimize the current context manager by using caching to share the same buffer across different tiled executions.
Two new operators to handle our SFT checkpoint segments. The split node starts the segment and fetches the tiled-input data. The join node marks the end of the segment, collects all tiled-output pieces, and merges them into one tensor for further computation. We also extended the nn.Sequential container to support tiled execution with checkpointing. The existing nn.Sequential in Pytorch only accepts a tensor as input and produces an output tensor. However, SFT execution requires additional meta-information such as tile size, tile position, and padding size, which is obtained from our static tile size analyzer (Figure 4). This information is propagated along the chain of operators to provide the required metadata for forward and backward propagation.

We have intentionally created our APIs to be very close to the existing PyTorch API so that the users can easily port existing code to use the SFT framework.

6. Evaluation

6.1. Experimental Setup

All experiments presented in this paper are performed using PyTorch v1.8.0a0 and Python v3.9.5. We used the Nvidia CUDA compiler version v11.3.109 and the Nvidia deep neural network library ś CUDNN v8.2. The experiments were carried out on two Nvidia GPU systems. An Nvidia 2080 Ti GPU was paired with an AMD Ryzen Threadripper 3990X 64-Core CPU and CPU RAM is 128 GiB in the first platform. The second platform has an Nvidia A100 GPU paired with an Intel(R) Xeon(R) Gold 6238R CPU @ 2.20GHz 56 core CPU and CPU RAM is 384 GiB. Both of the machines run Ubuntu 20.04. These machines represent different GPU generations (Turing and Ampere). We evaluate VGG-16, VGG-19[19]and DarkNet[17] networks to demonstrate the efficacy of our approach.

We evaluate three strategies to perform a training epoch for a single input image (one forward and one backward pass using a batch size of 1) using a Mean Squared Error (MSE) loss function:

Pytorch-Base: A standard PyTorch[16] implementation of the training computing the forward and backward operations, where all intermediate activations are stored.
Pytorch-CheckPoint: A sequential checkpoint strategy using the checkpoint _sequential function [8] in PyTorch. This strategy splits the network into a given number of segments s. We evaluated s values from 2 (always included) to $2 \sqrt{L}$ , where L is the length of the chain (and the total number of activation tensors stored is minimized).
SFT: The Segmented Fused-Tiled training strategy outlined in the paper. We split the original input along the height and width dimensions and execute the input in segmented tiles with our checkpoint/recompute PyTorch implementation.

For each model, we vary input image sizes from 512 to 20480 (20K) square (I×I). Image sizes of 10K and 20K represent ×4 and ×2 magnification for WSI used in pathology. The number of tiles along H and W dimension are varied as 2ⁿ, where n = 1, 2, 3, 4, 5 (i.e., 5 different configurations are evaluated).

We measure the execution time for each of the three strategies and get the mean time over 5 runs. The execution time for our method is selected by the best value among the 5 tile size configurations. The execution time is stable for all three strategies over multiple executions. The input image and model use float32 precision and the image data layout is NCHW.

6.2. Experimental Results

Figure 8 shows the experimental results. The square red cross represents the execution time obtained by the standard Pytorch-Base strategy, and its absence from the graph means that an out of memory error was encountered when attempting to train an image of the given size. Yellow crosses represent the results obtained with the Pytorch-Checkpoint strategy for the shortest execution time among all possible number of segments (from 2 to $2 \sqrt{L}$ ). If both Pytorch-Base and Pytorch-Checkpoint strategies are available, we only show the execution time of Pytorch-Base (as checkpoint/recompute is not required). The blue dot shows the result obtained with our SFT strategy. The image size (H == W ) is plotted on the X-axis and the execution time (in seconds) is on the Y-axis. Since A100 has more device memory than 2080Ti (40GiB compared to 11GiB), the non-tiled strategies can process a larger input image size on the A100. Since VGG-16, VGG-19 and DarkNet networks do have a huge number of trainable parameters (59 MB for VGG-16, 78.4MB for VGG-19 and 79.7 MB for DarkNet), the GPU device memory is mainly consumed by inputs and intermediate activation required during training. For example, an input image of size 10K 10K stored in single precision (float32) would need a total GPU memory of 118GiB, 129GiB and 43GiB respectively for VGG-16, VGG-19 and Darknet to store all activations and models.

In all six plots, we observe that SFT training is able to process much larger input images than either the standard PyTorch-Base strategy or the PyTorch-checkpoint strategy. The vertical grey dashed line depicts the first image size that neither Pytorch-Base nor Pytorch-Checkpoint can handle for a given network and GPU pair. SFT has a small runtime overhead (as compared to Pytorch-Base or PyTorch-checkpoint) when the input activations fit in memory. However, the runtime continues to scale linearly with the image size.

7. Related Work

7.1. Operator Fusion

Considerable efforts have been directed at operator fusion, where two or more operators in a DNN pipeline are merged together to create a single combined kernel. The benefits of operator fusion include a reduction in kernel launch overhead. For example, this is a key optimization performed by the XLA [20] compiler. Several other efforts have also addressed such operator fusion [4, 15].

We note that the way “fusion” is used in this work is rather different from the operator-fusion described above, where two or more operators in a DNN pipeline are fused together to create a single combined kernel. In contrast, we fuse the execution of corresponding tiles in a sequence of stages by invoking each operator’s kernel on small tiles of data. Thus, we do not generate any fused kernel code, but simply reuse existing kernels by changing the size of the input/output activations and the order of execution, as compared with standard execution of the operators.

7.2. Out-of-core Training

KARMA [21] is a system that combines out-of-core training with checkpointing. Out-of-core training involves swapping out large activations from the GPU to the CPU memory to alleviate the memory bottleneck. Using a novel performance model, KARMA automates this decision-making of swapping versus recomputing activations for a given neural network. However, KARMA does not enable processing large-scale images where a single activation layer does not fit on a GPU.

7.3. Multi-GPU Model Parallelism

Recent work leverages “model parallelism” [5, 23] to distribute the activations required during training across multiple GPUs. For instance, GEMS [12] provides a system for hybrid model and data parallelism, along with relaxed synchronisation. While they mention whole slide imaging in the potential applications, their approach uses a well known “patch” based analysis. Specifically, the experiments are conducted on 1K × 1K input size. Other prior approaches based on model parallelism also require this sub-optimal technique to handle large WSI images.

7.4. Unified Memory

Tensorflow Huge Model Support [6] incorporates Nvidia unified memory (UM) [13] along with several GPU memory optimization techniques to train standard CNNs. This work can support whole slide imaging (albeit they only demon-strate ×4 resolution and requires 100s of GPU days for training). However, the open-source implementation is not stable enough to run common neural networks like VGG16. Therefore, we could not compare this against SFT. For a single GPU setting, swapping the sections of the image back and forth across host memory and device memory would make the entire training process bandwidth bound. The PCIe bandwidth is about 5 orders of magnitude lower than the GPU peak performance at 32-bit precision. As a result, as reported by Chen at al. [6], HMS gets 30× slower as the image size (in pixels) quadruples (from 11K × 11K image to 21.5K × 21.5K image), while SFT time just increases by 4×, i.e., proportional to increase in image size.

7.5. Convolutions Over Distributed Memory

DistConv [9] is an extension of LBANN toolkit [10] that performs distributed convolutions over a cluster of GPUs. Given the size of the image, distributed convolutions place a lower bound on the number of GPUs that are required to train the network. For a 100K × 100K image, for example, twenty 32GB GPUs will be required to simply hold the input image. The first stage of an image DNN pipeline typically increases the size of the output activations by a factor of 5–10 (the number of channels increases from 3 to at least 32, while the image height/width gets halved); further increasing the number of GPUs required. We were also not able to set up LBANN as per the documentation given. The version incompatibility among dependencies causes multiple compilation and linkage errors.

DistDL [11] shows performance improvement over DistConv and provides a PyTorch extension to partition a large image into multiple smaller disjoint images over a cluster of CPUs. However, the current implementation does not have a GPU backend support, making an empirical comparison impossible.

8. Conclusion

This paper develops a segmented fused-tiled (SFT ) approach to enable the training of Deep Neural Networks using very large images, overcoming a significant current limitation in popular ML frameworks like PyTorch and TensorFlow. We develop algorithms to generate fused-tiled-compatible segments and for determination of the memory requirements for the fused-tiled segments. Fused-tiled execution was enabled in the PyTorch framework and was evaluated experimentally with VGG-16, VGG-19, and Darknet DNN pipelines on an Nvidia 2080Ti and an A100 machine. The implementation shows that there is minimal overhead for the tiled implementation and that the tiled implementation scales to large input image sizes well. Our developments enable arbitrarily large input image sizes to be used directly for training machine learning models instead of the current practice of using image pre-coarsening or patch-based processing in domains like digital pathology and computational neuroscience.

Table 1.

Description of convolution parameters.

	Description		Description

B	Batch size	K	Output channel
H	Height of Input	R	Height of Kernel/Filter
W	Width of Input	S	Width of Kernel/Filter
O _h	Height of Output	f _r	Padding value in H
O _w	Width of Output	f _s	Padding value in W
C	Input channel	p	Stride size

Open in a new tab

CCS Concepts:

• Computing methodologies → Modeling methodologies; Machine learning; Neural networks; • Software and its engineering → Software performance; Compilers.

Acknowledgments

Research reported in this publication was supported in part by the National Institute Of Biomedical Imaging And Bioengineering of the National Institutes of Health under Award Number R41EB032722, and by the National Science Foundation through awards 2018016, 2119677, and 2118737. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or the National Science Foundation.

A. Artifact Appendix

A.1. Abstract

The artifact contains all the scripts and data required to reproduce the experimental results in the CC 2022 paper titled “ Training of Deep Learning Pipelines on Memory-Constrained GPUs via Segmented Fused-Tiled Execution”. The git repository contains:

The SFT source code;
The scripts to measure execution time of default PyTorch, PyTorch checkpoint, and SFT;
Raw data that we used to plot Fig. 8 (for comparison).

A.2. Artifact check-list (meta-information)

Program: A Pytorch based Python module(uu) to facilitate Segmented Fused-Tiled Execution (SFT).
Compilation: Detailed instructions to compile different frameworks and scripts to run each framework is provided below. A copy of these instructions can also be found at: https://github.com/HPCRL/SFT-CC2022-AE/blob/main/README.md.
Run-time environment: GCC >= 8.5; CUDA 11.3.0; cuDNN v8.2.0; Linux platform such as Ubuntu or CentOS.
Hardware: Nvidia 2080Ti or Nvidia A100.
Execution: All scripts to reproduce the results are provided in uu/benchmarking folder.
Output: The script reports forward, backward and total execution time for 1 input image per row, separated by comma. We use total execution time to plot Fig. 8.
How much disk space required (approximately)?: > 50 GB.
How much time is needed to prepare workflow (approximately)?: Creating conda virtual environment and install dependency should be less than 5 mins. Building Pytorch from source code takes around 1 hour.
How much time is needed to complete experiments (approximately)?: Should be less than 30 mins.
Publicly available?: Yes

A.3. Description

A.3.1. How Delivered.

Our artifact is available on a public git repository: https://github.com/HPCRL/SFT-CC2022-AE

A.3.2. Hardware Dependencies.

Nvidia 2080Ti or Nvidia A100.

A.3.3. Software Dependencies.

Conda, CUDA 11.3.0; cuDNN v8.2.0, Pytorch-v1.8, Linux

A.4. Installation

Clone the repository (recursively):

https://github.com/HPCRL/SFT-CC2022-AE

See the below file for instructions:

https://github.com/HPCRL/SFT-CC2022-AE/blob/main/

README.md.

Pytorch and uu module should be built before evaluation.

A.5. Experiment Workflow

Scripts are provided to run two different CNN models(VGG-16 and Darknet-19).

For VGG16 network:

$ cd uu/benchmarking

$ bash run-vgg.sh

For Darknet network:

$ cd uu/benchmarking

$ run-darknet.sh

For large image 10kx10k and 20kx20k:

$ cd uu/benchmarking

$ bash large.sh

A.6. Evaluation and Expected Result

We expect the performance results to be close to those reported in the paper (Fig. 8). The results of the benchmark will be printed out in text files. We suggest using 2080Ti or A100 and the exact same CUDA and cuDNN versions to reproduce the results presented in the experimental section. Different generations of GPU devices and different versions of the CUDA/cuDNN might produce different execution times and memory behaviors. We have included the raw data from our experiments in the uu/data-file/ folder.

As the image size (H, W) increases, the default PyTorch and PyTorch checkpoint will fail after a threshold and report an Out-of-Memory error. In contrast, in SFT, we can increase the number of tiles for big images, which will guarantee successful execution. We tested on two CNN networks on 2080Ti and A100 machines. For 10kx10k image, we can use 16×16 tiles, and for 20kx20k image, we can use 32×32 tiles(See uu/benchmarking/large.sh).

Footnotes

ACM Reference Format:

Yufan Xu, Saurabh Raje, Atanas Rountev, Gerald Sabin, Aravind Sukumaran-Rajam, and P. Sadayappan. 2022. Training of Deep Learning Pipelines on Memory-Constrained GPUs via Segmented Fused-Tiled Execution. In Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction (CC ‘22), April 02ś03, 2022, Seoul, South Korea. ACM, New York, NY, USA, 13 pages. https://doi.org/10.1145/3497776.3517766

The software is available at https://github.com/HPCRL/SFT-CC2022-AE

Contributor Information

Yufan Xu, University of Utah, Salt Lake City, Utah, USA.

Gerald Sabin, RNET Technologies, Dayton, Ohio, USA.

Saurabh Raje, University of Utah, Salt Lake City, Utah, USA.

Aravind Sukumaran-Rajam, Washington State University, Pullman, Washington, USA.

Atanas Rountev, Ohio State University, Columbus, Ohio, USA.

P. Sadayappan, University of Utah, Salt Lake City, Utah, USA

References

[1].Abadi Martín, Agarwal Ashish, Barham Paul, Brevdo Eugene, Chen Zhifeng, Citro Craig, Corrado Greg S., Davis Andy, Dean Jeffrey, Devin Matthieu, Ghemawat Sanjay, Goodfellow Ian, Harp Andrew, Irving Geoffrey, Isard Michael, Jia Yangqing, Jozefowicz Rafal, Kaiser Lukasz, Kudlur Manjunath, Levenberg Josh, Mané Dandelion, Monga Rajat, Moore Sherry, Murray Derek, Olah Chris, Schuster Mike, Shlens Jonathon, Steiner Benoit, Sutskever Ilya, Talwar Kunal, Tucker Paul, Vanhoucke Vincent, Vasudevan Vijay, Viégas Fernanda, Vinyals Oriol, Warden Pete, Wattenberg Martin, Wicke Martin, Yu Yuan, and Zheng Xiaoqiang. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems https://www.tensorflow.org/Software available from tensorflow.org.
[2].Alwani Manoj, Chen Han, Ferdman Michael, and Milder Peter. 2016. Fused-layer CNN accelerators In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 1–12. 10.1109/MICRO.2016.7783725 [DOI] [Google Scholar]
[3].Bhalodia Riddhish, Elhabian Shireen Y, Kavan Ladislav, and Whitaker Ross T. 2018. Deepssm: A deep learning framework for statistical shape modeling from raw images. In International Workshop on Shape in Medical Imaging Springer, 244–257. 10.1007/978-3-030-04747-4_23 [DOI] [PMC free article] [PubMed] [Google Scholar]
[4].Boehm Matthias, Reinwald Berthold, Hutchison Dylan, Evfimievski Alexandre V, and Sen Prithviraj. 2018. On optimizing operator fusion plans for large-scale machine learning in systemml. arXiv preprint arXiv:1801.00829 (2018).
[5].Chen Chi-Chung, Yang Chia-Lin, and Cheng Hsiang-Yun. 2018. Efficient and robust parallel dnn training through model parallelism on multi-gpu platform. arXiv preprint arXiv:1809.02839 (2018).
[6].Chen Chi-Long, Chen Chi-Chung, Yu Wei-Hsiang, Chen Szu-Hua, Chang Yu-Chan, Hsu Tai-I, Hsiao Michael, Yeh Chao-Yuan, and Chen Cheng-Yu. 2021. An annotation-free whole-slide training approach to pathological classification of lung cancer types using deep learning. Nature communications 12, 1 (2021), 1–13. 10.1038/s41467-021-21467-y [DOI] [PMC free article] [PubMed] [Google Scholar]
[7].Chen Tianqi, Xu Bing, Zhang Chiyuan, and Guestrin Carlos. 2016. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174 (2016).
[8].Torch Contributors. 2018. Periodic checkpointing in pytorch https://pytorch.org/docs/stable/checkpoint.html.
[9].Dryden Nikoli, Maruyama Naoya, Benson Tom, Moon Tim, Snir Marc, and Van Essen Brian. 2019. Improving strong-scaling of CNN training by exploiting finer-grained parallelism In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 210–220. 10.1109/IPDPS.2019.00031 [DOI] [Google Scholar]
[10].Van Essen Brian, Kim Hyojin, Pearce Roger A., Boakye Kofi, and Chen Barry. 2015. LBANN: livermore big artificial neural network HPC toolkit In Proceedings of the Workshop on Machine Learning in High-Performance Computing Environments, MLHPC 2015, Austin, Texas, USA, November 15, 2015. ACM, 5:1–5:6. 10.1145/2834892.2834897 [DOI] [Google Scholar]
[11].Hewett Russell J and Grady Thomas J II. 2020. A linear algebraic approach to model parallelism in deep learning. arXiv preprint arXiv:2006.03108 (2020).
[12].Jain Arpan, Awan Ammar Ahmad, Aljuhani Asmaa M., Maqbool Hashmi Jahanzeb, Anthony Quentin G., Subramoni Hari, Panda Dhableswar K., Machiraju Raghu, and Parwani Anil. 2020. GEMS: GPU-Enabled Memory-Aware Model-Parallelism System for Distributed DNN Training In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 1–15. 10.1109/SC41405.2020.00049 [DOI] [Google Scholar]
[13].Li Wenqiang, Jin Guanghao, Cui Xuewen, and See Simon. 2015. An Evaluation of Unified Memory Technology on NVIDIA GPUs In 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. 1092–1098. 10.1109/CCGrid.2015.105 [DOI] [Google Scholar]
[14].Mobadersany Pooya, Cooper Lee AD, and Goldstein Jeffery A. 2021. GestAltNet: aggregation and attention to improve deep learning of gestational age from placental whole-slide images. Laboratory Investigation (2021), 1–10. 10.1038/s41374-021-00579-5 [DOI] [PMC free article] [PubMed]
[15].Niu Wei, Guan Jiexiong, Wang Yanzhi, Agrawal Gagan, and Ren Bin. 2021. DNNFusion: accelerating deep neural networks execution with advanced operator fusion In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation. 883–898. 10.1145/3453483.3454083 [DOI] [Google Scholar]
[16].Paszke Adam, Gross Sam, Massa Francisco, Lerer Adam, Bradbury James, Chanan Gregory, Killeen Trevor, Lin Zeming, Gimelshein Natalia, Antiga Luca, Desmaison Alban, Kopf Andreas, Yang Edward, DeVito Zachary, Raison Martin, Tejani Alykhan, Chilamkurthy Sasank, Steiner Benoit, Fang Lu, Bai Junjie, and Chintala Soumith. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, and Garnett R (Eds.). Curran Associates, Inc., 8024–8035. 10.5555/3454287.3455008 [DOI] [Google Scholar]
[17].Redmon Joseph. 2013ś2016. Darknet: Open Source Neural Networks in C http://pjreddie.com/darknet/.
[18].Rojas Elvis, Kahira Albert Njoroge, Meneses Esteban, Bautista-Gomez Leonardo, and Badia Rosa M. 2020. A Study of Checkpointing in Large Scale Training of Deep Neural Networks. CoRR abs/2012.00825 (2020). arXiv:2012.00825 https://arxiv.org/abs/2012.00825
[19].Simonyan Karen and Zisserman Andrew. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[20].TensorFlow. 2019. XLA: Optimizing Compiler for TensorFlow https://www.tensorflow.org/xla.
[21].Wahib Mohamed, Zhang Haoyu, Thao Nguyen Truong, Drozd Aleksandr, Domke Jens, Zhang Lingqi, Takano Ryousei, and Matsuoka Satoshi. 2020. Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA. CoRR abs/2008.11421 (2020). 10.5555/3433701.3433726arXiv:2008.11421 [DOI]
[22].Zhao Zhuoran, Mirzazad Barijough Kamyar, and Gerstlauer Andreas. 2018. Deepthings: Distributed adaptive deep learning inference on resource-constrained iot edge clusters. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 11 (2018), 2348–2359. 10.1109/TCAD.2018.2858384 [DOI] [Google Scholar]
[23].Zhu Wentao, Zhao Can, Li Wenqi, Roth Holger, Xu Ziyue, and Xu Daguang. 2020. Lamp: Large deep nets with automated model parallelism for image segmentation In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 374–384. 10.1007/978-3-030-59719-1_37 [DOI] [Google Scholar]

[R1] [1].Abadi Martín, Agarwal Ashish, Barham Paul, Brevdo Eugene, Chen Zhifeng, Citro Craig, Corrado Greg S., Davis Andy, Dean Jeffrey, Devin Matthieu, Ghemawat Sanjay, Goodfellow Ian, Harp Andrew, Irving Geoffrey, Isard Michael, Jia Yangqing, Jozefowicz Rafal, Kaiser Lukasz, Kudlur Manjunath, Levenberg Josh, Mané Dandelion, Monga Rajat, Moore Sherry, Murray Derek, Olah Chris, Schuster Mike, Shlens Jonathon, Steiner Benoit, Sutskever Ilya, Talwar Kunal, Tucker Paul, Vanhoucke Vincent, Vasudevan Vijay, Viégas Fernanda, Vinyals Oriol, Warden Pete, Wattenberg Martin, Wicke Martin, Yu Yuan, and Zheng Xiaoqiang. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems https://www.tensorflow.org/Software available from tensorflow.org.

[R2] [2].Alwani Manoj, Chen Han, Ferdman Michael, and Milder Peter. 2016. Fused-layer CNN accelerators In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 1–12. 10.1109/MICRO.2016.7783725 [DOI] [Google Scholar]

[R3] [3].Bhalodia Riddhish, Elhabian Shireen Y, Kavan Ladislav, and Whitaker Ross T. 2018. Deepssm: A deep learning framework for statistical shape modeling from raw images. In International Workshop on Shape in Medical Imaging Springer, 244–257. 10.1007/978-3-030-04747-4_23 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] [4].Boehm Matthias, Reinwald Berthold, Hutchison Dylan, Evfimievski Alexandre V, and Sen Prithviraj. 2018. On optimizing operator fusion plans for large-scale machine learning in systemml. arXiv preprint arXiv:1801.00829 (2018).

[R5] [5].Chen Chi-Chung, Yang Chia-Lin, and Cheng Hsiang-Yun. 2018. Efficient and robust parallel dnn training through model parallelism on multi-gpu platform. arXiv preprint arXiv:1809.02839 (2018).

[R6] [6].Chen Chi-Long, Chen Chi-Chung, Yu Wei-Hsiang, Chen Szu-Hua, Chang Yu-Chan, Hsu Tai-I, Hsiao Michael, Yeh Chao-Yuan, and Chen Cheng-Yu. 2021. An annotation-free whole-slide training approach to pathological classification of lung cancer types using deep learning. Nature communications 12, 1 (2021), 1–13. 10.1038/s41467-021-21467-y [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] [7].Chen Tianqi, Xu Bing, Zhang Chiyuan, and Guestrin Carlos. 2016. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174 (2016).

[R8] [8].Torch Contributors. 2018. Periodic checkpointing in pytorch https://pytorch.org/docs/stable/checkpoint.html.

[R9] [9].Dryden Nikoli, Maruyama Naoya, Benson Tom, Moon Tim, Snir Marc, and Van Essen Brian. 2019. Improving strong-scaling of CNN training by exploiting finer-grained parallelism In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 210–220. 10.1109/IPDPS.2019.00031 [DOI] [Google Scholar]

[R10] [10].Van Essen Brian, Kim Hyojin, Pearce Roger A., Boakye Kofi, and Chen Barry. 2015. LBANN: livermore big artificial neural network HPC toolkit In Proceedings of the Workshop on Machine Learning in High-Performance Computing Environments, MLHPC 2015, Austin, Texas, USA, November 15, 2015. ACM, 5:1–5:6. 10.1145/2834892.2834897 [DOI] [Google Scholar]

[R11] [11].Hewett Russell J and Grady Thomas J II. 2020. A linear algebraic approach to model parallelism in deep learning. arXiv preprint arXiv:2006.03108 (2020).

[R12] [12].Jain Arpan, Awan Ammar Ahmad, Aljuhani Asmaa M., Maqbool Hashmi Jahanzeb, Anthony Quentin G., Subramoni Hari, Panda Dhableswar K., Machiraju Raghu, and Parwani Anil. 2020. GEMS: GPU-Enabled Memory-Aware Model-Parallelism System for Distributed DNN Training In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 1–15. 10.1109/SC41405.2020.00049 [DOI] [Google Scholar]

[R13] [13].Li Wenqiang, Jin Guanghao, Cui Xuewen, and See Simon. 2015. An Evaluation of Unified Memory Technology on NVIDIA GPUs In 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. 1092–1098. 10.1109/CCGrid.2015.105 [DOI] [Google Scholar]

[R14] [14].Mobadersany Pooya, Cooper Lee AD, and Goldstein Jeffery A. 2021. GestAltNet: aggregation and attention to improve deep learning of gestational age from placental whole-slide images. Laboratory Investigation (2021), 1–10. 10.1038/s41374-021-00579-5 [DOI] [PMC free article] [PubMed]

[R15] [15].Niu Wei, Guan Jiexiong, Wang Yanzhi, Agrawal Gagan, and Ren Bin. 2021. DNNFusion: accelerating deep neural networks execution with advanced operator fusion In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation. 883–898. 10.1145/3453483.3454083 [DOI] [Google Scholar]

[R16] [16].Paszke Adam, Gross Sam, Massa Francisco, Lerer Adam, Bradbury James, Chanan Gregory, Killeen Trevor, Lin Zeming, Gimelshein Natalia, Antiga Luca, Desmaison Alban, Kopf Andreas, Yang Edward, DeVito Zachary, Raison Martin, Tejani Alykhan, Chilamkurthy Sasank, Steiner Benoit, Fang Lu, Bai Junjie, and Chintala Soumith. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, and Garnett R (Eds.). Curran Associates, Inc., 8024–8035. 10.5555/3454287.3455008 [DOI] [Google Scholar]

[R17] [17].Redmon Joseph. 2013ś2016. Darknet: Open Source Neural Networks in C http://pjreddie.com/darknet/.

[R18] [18].Rojas Elvis, Kahira Albert Njoroge, Meneses Esteban, Bautista-Gomez Leonardo, and Badia Rosa M. 2020. A Study of Checkpointing in Large Scale Training of Deep Neural Networks. CoRR abs/2012.00825 (2020). arXiv:2012.00825 https://arxiv.org/abs/2012.00825

[R19] [19].Simonyan Karen and Zisserman Andrew. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[R20] [20].TensorFlow. 2019. XLA: Optimizing Compiler for TensorFlow https://www.tensorflow.org/xla.

[R21] [21].Wahib Mohamed, Zhang Haoyu, Thao Nguyen Truong, Drozd Aleksandr, Domke Jens, Zhang Lingqi, Takano Ryousei, and Matsuoka Satoshi. 2020. Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA. CoRR abs/2008.11421 (2020). 10.5555/3433701.3433726arXiv:2008.11421 [DOI]

[R22] [22].Zhao Zhuoran, Mirzazad Barijough Kamyar, and Gerstlauer Andreas. 2018. Deepthings: Distributed adaptive deep learning inference on resource-constrained iot edge clusters. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 11 (2018), 2348–2359. 10.1109/TCAD.2018.2858384 [DOI] [Google Scholar]

[R23] [23].Zhu Wentao, Zhao Can, Li Wenqi, Roth Holger, Xu Ziyue, and Xu Daguang. 2020. Lamp: Large deep nets with automated model parallelism for image segmentation In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 374–384. 10.1007/978-3-030-59719-1_37 [DOI] [Google Scholar]

PERMALINK

Training of Deep Learning Pipelines on Memory-Constrained GPUs via Segmented Fused-Tiled Execution

Yufan Xu

Gerald Sabin

Saurabh Raje

Aravind Sukumaran-Rajam

Atanas Rountev

P Sadayappan

Abstract

1. Introduction

2. Background

2.1. Forward and Backward Propagation

Figure 1.

2.2. Memory Reduction via Checkpoint/Recompute

Figure 2.

2.3. Fused-tiled Execution

Figure 3.

3. Overview of Solution

Figure 4.

Figure 5.

Table 2.

4. Problem Formalization & Algorithm Details

4.1. Problem Formalization

4.1.1. FT-compatible Operators.

4.1.2. FT-compatible Segments.

Figure 6.

Listing 2. Generate Graph IR.

Listing 1. Data Structures.

4.2. Partitioning of DNN Pipeline into FT-compatible Segments

Listing 3. Partition DNN pipeline into Fused-Tileable segments.

4.3. Determination of Buffer Sizes for SFT Execution

Table 3.

Listing 4. computeTileSize.

Table 4.

Figure 7.

5. Implementation

6. Evaluation

6.1. Experimental Setup

6.2. Experimental Results

Figure 8.

7. Related Work

7.1. Operator Fusion

7.2. Out-of-core Training

7.3. Multi-GPU Model Parallelism

7.4. Unified Memory

7.5. Convolutions Over Distributed Memory

8. Conclusion

Table 1.

CCS Concepts:

Acknowledgments

A. Artifact Appendix

A.1. Abstract

A.2. Artifact check-list (meta-information)

A.3. Description

A.3.1. How Delivered.

A.3.2. Hardware Dependencies.

A.3.3. Software Dependencies.

A.4. Installation

A.5. Experiment Workflow

A.6. Evaluation and Expected Result

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases