Abstract
The analysis of human body composition plays a critical role in health management and disease prevention. However, current medical technologies to accurately assess body composition such as dual energy X-ray absorptiometry, computed tomography, and magnetic resonance imaging have the disadvantages of prohibitive cost or ionizing radiation. Recently, body shape based techniques using body scanners and depth cameras, have brought new opportunities for improving body composition estimation by intelligently analyzing body shape descriptors. In this paper, we present a multi-task deep neural network method utilizing a conditional generative adversarial network to predict the pixel level body composition using only 3D body surfaces. The proposed method can predict 2D subcutaneous and visceral fat maps in a single network with a high accuracy. We further introduce an interpreted patch discriminator which optimizes the textural accuracy of the 2D fat maps. The validity and effectiveness of our new method are demonstrated experimentally on TCIA and LiTS datasets. Our proposed approach outperforms competitive methods by at least 41.3% for the whole body fat percentage, 33.1% for the subcutaneous and visceral fat percentage, and 4.1% for the regional fat predictions.
Keywords: Body Composition Analysis, Conditional Generative
1. Introduction
The prevalence of obesity in the United States and the world has risen to epidemic/pandemic proportions. There is a clear association between body fat and adverse health consequences such as cardiovascular diseases, dyslipidemia, hypertension, type 2 diabetes, mellitus and several cancers [1, 2], so it is critical to determine the amount of body fat as well as its distribution [3]. Body fat can be classified into subcutaneous adipose tissue (SAT) and visceral adipose tissue (VAT). SAT exists under the skin and VAT is usually located around several vital organs. The spatial patterns of SAT and VAT are different and VAT is more strongly related with health risk factors than SAT [2, 4, 5]. SAT and VAT are usually predicted using separate models and are seldom within the same model [6, 7]. Modeling SAT and VAT separately is generally undesirable since it compromises computational efficiency and ignores their mutual correlation. To avoid this problem, we propose a model that can predict SAT and VAT simultaneously.
There have been some rigorous clinical studies that investigate the body composition using high-end medical instruments such as dual-energy X-ray absorptiometry (DXA), magnetic resonance imaging (MRI), and computed tomography (CT) [3, 8, 9]. However, these methods have the disadvantages of ionizing radiation exposure (CT, DXA), requiring professional operators, and relatively high cost, which prevent them from becoming convenient public health and clinical tools to assess body composition. Beyond these high-end instruments, there are commodity approaches such as bioelectrical impedance analysis (BIA) and air displacement plethysmography (ADP) that are widely used to estimate whole-body composition [10]. These approaches are time efficient and highly applicable in large-scale studies. Unfortunately, they are not suitable for pixel level SAT or VAT estimation due to their relatively low prediction accuracy.
Alternatively, anthropometric features are commonly used to indicate the degree of obesity. Body mass index (BMI) is an efficient indicator, but it is inherently limited in discriminating between fat mass (FM) and fat free mass (FFM), which may misclassify muscular lean people with high body densities as obese [8, 1]. Body composition, especially body fat, directly changes body shape, so there is a high correlation between body fat and body shape [2, 6]. With the explosive growth of 3D scan technologies, and easy access to 3D body shape data (either at home, the gym, or the clinic, using commodity body scanning systems [11]), there have been many approaches to predicting body fat from the parameters of body shape [2, 3, 8]. Most body shape based methods estimate either whole body fat percentage or fat volume, but do not focus on the fat amount at pixel level, despite its significant role in diagnosing morbidity, e.g., steatosis and fibrosis of the liver and cardiac diseases [8, 12].
The pixel level fat prediction with body shape is under studied. There are several studies, where an indirect mapping between shape descriptors and pixel level fat was considered [8, 13]. This kind of methods first utilize principal component analysis (PCA) to transform shape descriptors and corresponding fat location to scalar values, and then apply regression methods to construct the mapping. However, these approaches cannot predict the 2D fat directly from body shape. To utilize PCA, registering both body shape and its fat location to their corresponding canonical templates is needed [8, 11]. The registration will deform the profiles of body shape and its corresponding fat location and result in the loss of shape information.
In recent years, deep convolutional network has been shown to perform effectively in pixel-wise predictions [10, 14]. The most popular framework is the encoder-decoder structure [15, 16], which has been widely used in medical image processing area such as lesion detection and organ segmentation [17]. The conditional Generative Adversarial Network (cGAN) is able to learn the representation of images and condition variants and generate images accordingly [18, 19]. Thus, the cGAN is able to predict 2D fat map using corresponding 2D projected body shape.
Unlike past works, e.g., [18, 20, 21], in this paper, we explore a multi-task cGAN which embeds two generative processes in a single module to generate the SAT and VAT maps with corresponding 2D projected body shapes. We also introduce a patch discriminator to enhance the prediction accuracy, which focuses on the structure of the generated 2D fat maps. To the best of our knowledge, our work is the first to utilize the neural network for 2D fat prediction based on body shape.
In this paper, we aim to provide a point prediction of the fat amount at each pixel rather than the categories of each pixel, so our problem is different from classical segmentation and object detection. Note that the fat amount at each pixel we aim to predict is a subject-specific random variable, rather than a population parameter, so the confidence interval or p-value is not applicable here for statistical inference (see [22, 23] and Chapters 8 and 9 in [24]). Although the prediction interval is a valid inferential tool, its development requires a substantial theoretical analysis, which is beyond the scope of this paper, so we will study it as a future research topic.
The main contribution of this paper is threefold.
We design a multi-task deep neural network utilizing a cGAN to predict the pixel level body composition using only body shape. The proposed method can predict 2D subcutaneous and visceral fat maps in a single network with a high accuracy.
We design a highly computational efficient multi-task generator to predict the SAT and VAT using a single neural network by sharing the same inputs. This design makes the network robust and efficiency.
We innovatively design a neural-network-based patch discriminator and hybrid loss functions to enhance the accuracy. Our approach incorporates the protocol of loss weight and patch size and is guaranteed to be able to find the optimal model parameters.
The remainder of the paper proceeds as follows. The body composition assessment techniques are summarized in RELATED WORK. The details of our method are presented in METHODOLOGY and the dataset is introduced in DATA PREPROCESSING. We demonstrate our implementation details, including algorithm validations and prediction accuracy evaluations in EXPERIMENTS. The paper ends with CONCLUSION.
2. Related work
Medical approaches to determine body composition.
CT and MRI scans [3, 9] are considered as the gold standard of body composition assessment. Both CT and MRI can be used to accurately differentiate body fat from other tissues and provide accurate fat estimation at the voxel level [25]. DXA is a two-dimensional X-ray based imaging technique. By using two different energy levels, the images can be separated into three components: bone, lean tissue and fat. DXA can provide a pixel level fat estimation by using anatomical models. Nonetheless, these techniques are impractical for routine or widespread use and are not always readily available for clinical applications due to the high costs [3].
Body shape based body composition approaches.
Beyond the clinically-available methods, anthropometric features are commonly used to evaluate obesity [1]. BMI is calculated through height and weight and widely used to indicate obesity [26]. A variety of shape descriptors and models are used to estimate body composition (e.g., VAT, SAT, whole body fat percentage) [1, 2, 5, 27]. Most of these methods estimate the total amount of body fat rather than the 2D fat map. To predict fat at the pixel level, Xie et al. [27] studied the body silhouettes derived from DXA scan images and analyzed the correlation between the variation of shapes and the body leanness indicators. In addition, Piel [13] mapped the variation of 3D body shapes to the variation of body compositions and developed a pixel level prediction function using step-wise regression. Moreover, Lu et al. [8, 11] used 2D body shapes derived from a natural standing pose for the pixel level body composition inference. They explored shape descriptors derived from 3D geometries and proposed to use a Bayesian network to analyze the relationship between characteristics of body shapes and 2D pixel level body composition. However, like the approaches above, their method is an indirect method and must deform the body shape.
Conditional Generative Adversarial Network.
The resolution and quality of images produced by generative methods, especially cGAN, have been improving rapidly recently [28, 29, 19]. Like original GAN, cGAN usually consists of two modules: a generative module and a discriminative module [29]. The generative module aims to capture the data distribution and accordingly generates fake images to fool the simultaneously and adversarially trained discriminative module, while the discriminative module aims to distinguish fake from real images. Both generative and discriminative modules have benefited from the deeper feature exploration of convolutional networks [21]. The generative and discriminative modules of cGAN are conditioned on some additional information such as class labels. The conditional information teaches GANs to generate examples which match the specified conditional variants [30]
3. Methodology
3.1. Network Architecture
Our goal is to find a pixel-wise mapping F: x → y, where x is a 2D projected body shape map and y is a 2D fat map. y has the same size as x. The architecture of our network is shown in Fig. 1. Our fat prediction network includes two components: a generator and a discriminator. The generator is a multi-task U-net and consists of an encoder and two independent decoders. The input of the generator is the body shape map and the outputs of the two decoders are the predicted SAT and VAT maps respectively. The discriminator is a classification neural network which learns to distinguish between negative pairs (shape map and its predicted 2D fat maps) and positive pairs (shape map and its ground truth 2D fat maps).
Figure 1:
Architecture of the pixel level body fat prediction neural network. The generator is a multi-task U-net structure used to carry out pixel-wise mapping. The discriminator learns to make the predicted SAT and VAT maps close to their corresponding ground truth. The discriminator observes negative pairs (shape map and its predicted fat maps) and positive pairs (shape map and its ground truth fat maps) and its output is a N ×N patch with binary results.
3.1.1. Multi-Task Generator
To find the pixel-wise image mapping, we will adopt the most popular mechanism that includes both encoder and decoder [31, 15, 19]. The dimensions and profiles of the input and output are exactly the same, so we select the U-net [16] as a backbone in our fat prediction module. U-net is a convolutional neural network that was developed for biomedical image segmentation and is modified and extended to work with fewer training images. In addition, the profile of the input body shape map and corresponding output 2D fat map are similar and share the same pixel locations, thus there is substantial low-level information shared between the input and output images. The U-net architecture adopts the skip connections between encoder blocks and their corresponding decoder blocks. Thus, it would be desirable to pass low-level information directly across the net. The fine-grained details can also be recovered in the prediction [14].
The inputs (body shape map) for SAT and VAT predictions are exactly the same, so we integrate SAT and VAT prediction modules into a single multitask network to improve robustness and computational efficiency. This requires changing the general U-net to a multi-task framework by adding another decoder. The architecture of the two decoders are exactly the same (as shown in Fig 1). We make the SAT and VAT prediction modules share the same encoder channel and assign them an independent decoder channel as shown in Fig. 1. The upper is the SAT decoder and the lower is the VAT decoder. The shared encoder learns the representation of the input body shape and the two decoders reconstruct the SAT and VAT maps respectively.
3.1.2. Discriminator
In general, the encoder-decoder architectures adopt pixel level related loss functions which will likely fail to determine the whole image accurately because they only consider the pixel level accuracy [19, 20]. Therefore, we introduce a discriminator which is a convolution neural network classifier Fig. 1 to determine whether the predicted fat maps are real one or not [32]. Thus, the discriminator can make the generated fat maps by generator as close to the ground truth as possible. The regular discriminator used in cGAN is a binary classifier and focuses on the whole similarity between the images. Therefore, it is not accurate in evaluating the sub-region of the images [32, 33].
In order to solve these issues we introduce the Markovian discriminator [20, 34] which restricts the attention to the structure in local image regions instead of the whole image. It works on the effective receptive field of the neural network, and tries to classify the receptive field in an image as real or fake. Suppose that we divide an image into N × N sub-images with the same image size, and we want to classify whether all the sub-images are real or fake. The regular discriminator outputs a single scalar, which signifies whether the input is real or fake. In contrast, our discriminator outputs an N ×N matrix (patch), where each value in the matrix signifies whether the sub-image is real or fake. Thus, the label for this discriminator is not a binary value (1 or 0) but an N × N matrix with elements 1 for real sub-image and 0 for fake sub-image. We apply this discriminator convolutionally across the image, average all values in the patch to provide the ultimate output, and calculate the corresponding cross entropy loss [34]. Our discriminator enhances the texture precision because it takes receptive field information into consideration. Due to the fact that the inputs of SAT and VAT prediction module are the same, we put the predicted SAT and VAT maps into the same discriminator as shown in Fig 1 to enhance robustness and reduce computational burden. In summary, we concatenate the positive pairs (body shape map (condition variant) and real SAT and VAT maps), and negative pairs (body shape map and generated SAT and VAT maps). After that, we feed the positive and negative pairs with their corresponding patch labels into our discriminator for training.
Both the generator and the discriminator modules adopt the structure of Convolution-BatchNorm-ReLu [35]. The batch normalization following the convolution layer accelerates the training process by reducing internal covariate shift [20]. It also has a regularization effects to reduces over-fitting which is very common in the medical field due to the limited data size.
3.2. Objective Function
The objective of a general cGAN is defined by co-training both the discriminator and the generator:
| (1) |
where the loss function is a form of cross entropy and is the expectation. and refer to the corresponding probability distributions of and respectively. Here is the condition variant which is the body shape map in this paper, is the random noise with the same size of x, and y and are ground truth and generated fat maps respectively. Generator G tries to minimize this objective against an adversarial discriminator D that tries to maximize it. Compared with the original GAN, cGAN observes the condition variant x in the generator.
The random noise z is usually used in cGAN to avoid the deterministic results. In our study, however, we would like to obtain more deterministic fat maps, thus we omit the random noise input z. Thus the Eq. (1) becomes:
| (2) |
The L1 and L2 loss functions have been widely used in encoder-decoder networks [15, 16]. These loss functions will make the predicted fat maps be close to the ground truth at the pixel level. Compared to the L2 loss, the L1 loss is robust to outliers and encourages less blurring. In addition to the original cGAN loss, we will introduce the L1 loss in our generator module in this paper. The additional L1 loss of generator is defined as
| (3) |
where and are losses of the SAT and VAT modules under the norm respectively.
After combining and with the original cGAN, our final objective function will have a composite structure. We have multiple optimizations as illustrated in our new objective function below:
| (4) |
where the λsat and λvat are the weights of the SAT and VAT prediction loss functions respectively.
The generator G is the fat map generation module which is at the core of the network, and the discriminator module D has a relatively low contribution. Meanwhile, SAT and VAT are not of equal importance in our model because the pattern of VAT is much more complicated and is harder to predict. Thus the loss weight for VAT prediction is larger than that for SAT. There is no theoretical estimation of loss weights λsat and λvat. The optimal pairs depend on the data and the task. One may select the optimal λsat and λvat through analyzing different loss weights. More details can be found in EXPERIMENT.
4. Data preprocessing
In this section, we give a brief introduction to our medical dataset and describe the pipeline of how we process the CT scans data and extract the input body shape map and output 2D fat maps from 3D CT scans.
4.1. Dataset
The 3D CT scan is considered as the gold standard for body composition analysis [36]. The 3D CT also scans subjects’ body shape accurately. In this study, we collect our 3D CT data from the Cancer Imaging Archive (TCIA) [37,38] and Liver Tumor Segmentation Challenge (LiTS) [39]. The raw data is obtained from different sources and their scan protocols are not exactly the same, thus it is necessary to align the datasets. Since the thickness of the CT datasets are not exactly the same and most CTs have slices less than 2mm, we need to calibrate the thickness of the CT by interpolating the CT slices before futher processing. We focus on the abdominal region, therefore we crop the original CT scan data to the abdominal area. This medical dataset (Fig. 2) contains a total of 270 subjects. Compared to related clinical studies in body composition assessment [2, 4, 11], our dataset derives from CT scans, which are more accurate than DXA. Our sample population is relatively large compared to other similar studies [11, 40] and is sufficient for training and testing.
Figure 2:
The schematic diagram of data preprocessing. The original 3D CTs are calibrated first and aligned to the same resolution. The body shape maps and their corresponding fat maps are extracted from each slice and then combined slice by slice.
4.2. Data Processing
4.2.1. Body Shape Map
The body shape could be reconstructed by using commercial body scanners [1, 11]. We can also reconstruct the 3D body shape from the CT scans. The protocol of reconstructing body shape by optical surface scan (Kinect) can be found in [8, 11]. But optical surface scans are not available within the data in our experiment. However, the iso-surfaces are accurate representations of the surfaces of the body. One advantage of using CT is that the reconstruction precision is very high (0.5mm-2mm) compared to commercial body scanners (∼ 2mm). Thus, the iso-surface of the 3D CT scan can be used in place of the shape generated by body scanners. The other advantage is that there is no need to register the body shape and the fat location since both of them come from the same CT scan. The 2D body shape is the projection of 3D surface in frontal and lateral directions. The protocol is as follows: we first extract the body contour of each slice and then calculate the depth difference of the contour as shown in Fig. 2. Therefore we obtain a one-dimensional array for each slice and then combine all the results slice by slice to obtain the 2D body shape map. In this study, the image size of the body shape map is 512 × 512, which is enough to achieve the best prediction accuracy.
4.2.2. 2D Fat Map
CT works by passing focused X-rays through the body and measuring the amount of energy absorbed. The CT slice is a grey scale image where the tissue density is mapped to a gray scale value. The Hounsfield Units (HU) is a quantitative scale for describing radiodensity in medical CT and provides an accurate density for the type of tissue [41]. Extracting fat tissue from CT is easy and reliable using the HU values [42]. However, there is no effective automatic method to segment VAT and SAT regions from CT scans. Therefore, we segment the fat tissues into SAT and VAT regions manually via the specialized medical segmentation tool ITK-SNAP [43]. The process of determining SAT and VAT maps is similar with how we process body shape map. We first extract fat regions from each CT slice and then separate the VAT and SAT regions by manually drawing the boundary. After obtaining a one-dimensional array from every slice, we combine each slice to get the 2D fat maps. To ensure the accuracy of ground truth, each subject is reviewed at least twice.
5. Experiment
To validate the effectiveness of our network, we test our method on a variety of tasks on the medical dataset. In this section we have two studies: first is to determine the patch size and loss weight; second is to compare with other approaches.
5.1. Evaluation Metrics and Implementation
Evaluation Metrics.
To assess the pixel level fat prediction is a complicated and difficult problem. For the body composition analysis, both the whole body fat percentage and its pixel level distribution are important. Thus the evaluation metrics should be consider carefully. The whole body fat percentage can be derived from the 2D fat map. We use the body fat percentage error (BFPE) which is defined by averaging the fat percentage error over all subjects. We use two measures to evaluate the pixel level prediction accuracy. The first measure is the mean squared error (MSE), which averages the squared difference between each pixel-wise predicted value and ground truth over all pixels and subjects. However, MSE does not assess joint statistics of the result, and therefore does not measure the very structured losses of output 2D fat maps. The second measure is the average Pearson correlation coefficient (aPCC) which averages the Pearson correlation coefficient between vectorized predicted values and ground truth over subjects. A Pearson correlation coefficient value closer to one indicates a better prediction. MSE is applied to assess the overall estimation accuracy while aPCC is more focused on assessing the regional estimation accuracy.
Implementation.
All the neural networks in this paper are implemented on tensorflow framework, and the training and validation processes are performed on two NVIDIA GTX 1080Ti graphics cards (11GB GPU memory for one card). The stochastic gradient descent optimizer with learning rate of 0.001 and momentum of 0.9 is used. We use train-on-batch operation to train our model, and set the overall training phase to be 200 epochs. GAN models do not converge and we need to achieve an equilibrium between the generator and discriminator models. We save the models periodically (per 5 epochs) during the training process and then we review the generated images of every saved model to choose the best one based on image quality.
5.2. Ablation Analysis
We run ablation studies to isolate the effect of the patch size, the generator loss weight, and to compare using various parameters.
5.2.1. Patch Size
The patch discriminator enhances the texture precision because it takes receptive field information into consideration. The output of our discriminator is not simple binary label but a square patch label with size N × N. We observe the influence of different patch size N. We test the effect of varying the patch size N of our discriminator receptive fields, from a 1×1 to the resolution of fat map 512 × 512. Without loss of generality, we set the loss weight λsat = 50, λvat = 100.
The pixel level fat prediction results with different patch sizes are shown in Fig. 3. We can observe that utilizing patch discriminator can improve distribution accuracy both in SAT and VAT predictions. However, traditional pixel discriminator (patch size 1×1) has no effect on spatial statistics and its predicted 2D fat map is usually blurry and desaturated. The patch size is directly associated with the distribution precision. According to the results, the patch sizes 64×64, 256×256, and 512×512 improve the precision quite a bit, while patch size of 64 × 64 works best. The larger patch size does not always improve the performance. The reason may be that much larger patch size has many more parameters and greater depth than the patch size 64 × 64 and may be harder to train [29].
Figure 3:
The prediction results of patch discriminator with different patch size of our proposed method (color coded). The top is SAT’s and bottom is VAT’s. From left to right ground truth and patch size 1 × 1, 4 × 4, 16 × 16, 64 × 64, 256 × 256,512 × 512.
5.2.2. Loss Weight
The selection of loss weights λsat and λvat in Eq. (4) is a significant factor that impacts the final prediction performance. In our study, we test the effect of different loss weights. Without loss of generality we let λsat = 50 as constant and then change λvat = 25,50,75,100,200. In this section, the patch size is set to be 64 × 64.
The pixel level fat prediction results with different loss weight are shown in Fig. 4. Because we include the patch discriminator, the textural information is preserved well. As we can see from Fig. 4, as the loss weight increase from 25 to 200, the VAT prediction result improves and the pattern becomes more clear. However the SAT prediction seems to get worse as the weight increases especially on the edges of the fat map.
Figure 4:
The prediction results of different generator loss weights of our proposed method (color coded). The top is SAT’s and bottom is VAT’s. From left to right is ground truth and λvat,λvat = 50,25;50,50;50,75;50,100;50,200.
We do not want to propose specific values for optimal patch size and loss weight because the optimal parameters depend on the dataset and task itself. In this paper, we choose the patch size 64 × 64 and loss weight λsat = 50, λvat = 100.
5.3. Comparison with Reference Methods
In this section, we evaluate our proposed pixel level fat prediction method using the medical dataset above. We also apply the other state-of-the-art pixel-wise prediction methods, Auto-encoder [15], U-net [16], Wasserstein GAN (wGAN) [44] for performance comparison. In order to comprehensively validate the effectiveness of our patch discriminator, we also compare our proposed method with the baseline method [18, 19]. The baseline method is also a cGAN and uses a regular discriminator (patch size = 1×1). Due to the data size limitation of medical datasets, in order to reduce the impact of particular random choice of samples, we use five-fold cross validation to evaluate the performance of the methods. The dataset is randomly divided into five folds where each fold contains 54 subjects. We then select one fold (54 subjects) as the testing data and the remaining four folds (216 subjects) are used to train the models. We repeat the process five times with different test folds. The performance is evaluated based on averaging the five test folds.
The SAT prediction results of our proposed method and reference methods are illustrated in Fig. 5. The edge of the SAT map is consistent with the edge of the body shape, and the pattern is simple. Thus the SAT is easier to predict, and the predicted results of all methods are very similar to the ground truth. We find that compared to our method, the predicted SAT maps produced by Auto-encoder and U-net are more blurry and have less texture information. The wGAN and baseline are better than Auto-encoder and U-net but are still blurry compared to the proposed method.
Figure 5:
The comparison of the SAT predictions (color coded). (a): ground truth of SAT map (selected), (b) -(f): predicted SAT maps by Auto-encoder, U-net, wGAN, baseline and our proposed method respectively.
The pattern of the VAT map is more complex than that of the SAT map because VAT exists around a number of important internal organs. Some predicted VAT maps are shown in Fig. 6. Comparing Fig. 5 and Fig. 6, it is apparent that the VAT predictions are worse than that for SAT for all methods. As shown in Fig. 6, the results of Auto-encoder and U-net are blurry, while the wGAN and baseline are less blurry with poor sub-region accuracy. Our proposed method can actually achieve a higher precision and more textural information compared with the other competitive methods.
Figure 6:
The comparison of the VAT predictions (color coded). (a): ground truth of VAT map (selected), (b) -(f): predicted VAT maps by Auto-encoder, U-net, wGAN, baseline and our proposed method respectively.
The prediction results are shown in Table 1. Auto-encoder and U-net perform similarly in terms of all three criteria, MSE, aPCC and BFPE. The proposed method outperforms the other methods for both SAT and VAT predictions. The proposed method improves the whole body fat percentage estimation by 63.8%, 62.8%, 52.0% and 41.3% compared to the Auto-encoder, U-net, wGAN and baseline respectively. For SAT and VAT predictions in particular, the proposed method is at least 33.1% better than all the others in terms of MSE and at least 4.1% in terms of aPCC.
Table 1:
The generalization performance comparison of five pixel-wise body composition prediction methods.
| Method | SAT | VAT | BFPE % | ||
|---|---|---|---|---|---|
|
| |||||
| MSE % | aPCC | MSE % | aPCC | ||
|
| |||||
| Auto-encoder | 11.17 | 0.819 | 9.21 | 0.734 | 10.71 |
| U-net | 11.01 | 0.824 | 8.68 | 0.753 | 10.55 |
| wGAN | 8.42 | 0.866 | 5.33 | 0.772 | 8.09 |
| Baseline | 7.19 | 0.883 | 5.28 | 0.796 | 6.61 |
|
| |||||
| Proposed | 4.81 | 0.919 | 2.27 | 0.854 | 3.88 |
To further visualize and compare the SAT and VAT prediction performance, the predicted pixel level body composition of the absolute residual maps are shown in Fig. 7. It is easy to observe that the absolute residual produced by our method is significantly smaller than that of the other four methods, indicating prediction results based on proposed method is more consistent with ground truth.
Figure 7:
The absolute residual maps of predicted pixel level body compositions (color coded). Left: SAT, Right: VAT. (a) - (e): the results of Auto-encoder, U-net, wGAN, baseline and our proposed method respectively (selected).
These experiments demonstrate that our proposed method outperforms the other four reference methods significantly in predicting both SAT and VAT at the pixel level, with the trade-off of time-consuming training process [1]. Nevertheless, our method can better reconstruct the 2D fat maps both in whole percentage and textural accuracy. The proposed method integrates the SAT and VAT predictions in a single network which is able to improve the computational efficiency significantly.
6. Conclusion
In this paper, we propose an extended cGAN model to predict 2D body fat map using 2D body shape map. Unlike the previous works on body composition analysis, our approach uses a deep neural network to explore the relationship between the characteristics of body shapes and body fat at the pixel level. The SAT and VAT maps have been predicted in the same network within a multitask generator. Medical datasets are typically not very large since it is difficult to acquire. Our method is able to achieve both texture and pixel accuracy by introducing a patch discriminator and adding additional L1 loss to the generator. Despite the satisfactory performance of our method, a larger dataset is expected to enhance the performance of the network. A pair of optimal parameters are chosen to achieve excellent accuracy in terms of pixel level body composition prediction, and a high degree of accuracy in whole body fat percentage prediction. The experimental results show that our method outperforms other competitive methods.
Predict the 2D body fat distribution only using body shapes.
Extend the conditional GAN to the field of body fat prediction.
The protocol of loss weight and patch size is introduced to find optimal parameters.
Acknowledgements
This study is supported by the USA NIH grant R01HD091179 and the George Washington University Cross Disciplinary Research Fund.
Footnotes
Declaration of interests
The authors declare that they have no known competng fnancial interests or personal relationships that could have appeared to infuence the work reported in this paper.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- [1].Ng BK, Sommer MJ, Wong MC, Pagano I, Nie Y, Fan B, Kennedy S, Bourgeois B, Kelly N, Liu YE, et al. , Detailed 3-dimensional body shape features predict body composition, blood metabolites, and functional strength: the shape up! studies, The American Journal of Clinical Nutrition 110 (6) (2019) 1316–1326. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Wang Q, Lu Y, Zhang X, Hahn JK, A novel hybrid model for visceral adipose tissue prediction using shape descriptors, in: 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), IEEE, 2019, pp. 1729–1732. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Ali O, Cerjak D, Kent J Jr, James R, Blangero J, Zhang Y, Obesity, central adiposity and cardiometabolic risk factors in children and adolescents: a family-based study, Pediatric Obesity 9 (3) (2014) e58–e62. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Wang Q, Lu Y, Zhang X, Hahn J, Region of interest selection for functional features, Neurocomputing (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Shepherd JA, Ng BK, Sommer MJ, Heymsfield SB, Body composition by dxa, Bone 104 (2017) 101–105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Sun J, Xu B, Lee J, Freeland-Graves JH, Novel body shape descriptors for abdominal adiposity prediction using magnetic resonance images and stereovision body images, Obesity 25 (10) (2017) 1795–1801. [DOI] [PubMed] [Google Scholar]
- [7].Janssen I, Heymsfield SB, Allison DB, Kotler DP, Ross R, Body mass index and waist circumference independently contribute to the prediction of nonabdominal, abdominal subcutaneous, and visceral fat, The American Journal of Clinical Nutrition 75 (4) (2002) 683–688. [DOI] [PubMed] [Google Scholar]
- [8].Lu Y, Hahn JK, Zhang X, 3d shape-based body composition inference model using a bayesian network, IEEE Journal of Biomedical and Health Informatics (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Toombs RJ, Ducher G, Shepherd JA, De Souza MJ, The impact of recent technological advances on the trueness and precision of dxa to assess body composition, Obesity 20 (1) (2012) 30–39. [DOI] [PubMed] [Google Scholar]
- [10].Borga M, West J, Bell JD, Harvey NC, Romu T, Heymsfield SB, Leinhard OD, Advanced body composition assessment: from body mass index to body composition profiling, Journal of Investigative Medicine 66 (5) (2018) 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Lu Y, Zhao S, Younes N, Hahn JK, Accurate nonrigid 3d human body surface reconstruction using commodity depth sensors, Computer Animation and Virtual Worlds 29 (5) (2018) e1807. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Ramírez-Vélez R, Izquierdo M, Correa-Bautista JE, Correa-Rodríguez M, Schmidt-RioValle J, González-Jiménez E, González-Jiménez K, Liver fat content and body fat distribution in youths with excess adiposity, Journal of clinical medicine 7 (12) (2018) 528. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Piel M, Predictive modeling of whole body dual energy x-ray absorptiometry from 3d optical scans using shape and appearance modeling, Ph.D. thesis, UCSF (2017). [Google Scholar]
- [14].Huang G, Liu Z, Van Der Maaten L, Weinberger KQ, Densely connected convolutional networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4700–4708. [Google Scholar]
- [15].Kodirov E, Xiang T, Gong S, Semantic autoencoder for zero-shot learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3174–3183. [Google Scholar]
- [16].Ronneberger O, Fischer P, Brox T, U-net: Convolutional networks for biomedical image segmentation, in: International Conference on Medical image computing and Computer-assisted Intervention, Springer, 2015, pp. 234–241. [Google Scholar]
- [17].Playout C, Duval R, Cheriet F, A multitask learning architecture for simultaneous segmentation of bright and red lesions in fundus images, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2018, pp. 101–108. [Google Scholar]
- [18].Nguyen T, Xue Y, Li Y, Tian L, Nehmetallah G, Convolutional neural network for fourier ptychography video reconstruction: learning temporal dynamics from spatial ensembles, arXiv preprint arXiv:1805.00334 (2018). [Google Scholar]
- [19].Zhu J-Y, Krähenbühl P, Shechtman E, Efros AA, Generative visual manipulation on the natural image manifold, in: European Conference on Computer Vision, Springer, 2016, pp. 597–613. [Google Scholar]
- [20].Isola P, Zhu J-Y, Zhou T, Efros AA, Image-to-image translation with conditional adversarial networks, in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017, pp. 1125–1134. [Google Scholar]
- [21].Zhang H, Sindagi V, Patel VM, Image de-raining using a conditional generative adversarial network, IEEE Transactions on Circuits and Systems for Video Technology; (2019). [Google Scholar]
- [22].Hazra A, Using the confidence interval confidently, Journal of Thoracic Disease 9 (10) (2017) 4125. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [23].Dahiru T, P-value, a true test of statistical significance? a cautionary note, Annals of Ibadan Postgraduate Medicine 6 (1) (2008) 21–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [24].Casella G, Berger R, Statistical Inference, Duxbury Resource Center, 2001. [Google Scholar]
- [25].Goodpaster BH, Kelley DE, Thaete FL, He J, Ross R, Skeletal muscle attenuation determined by computed tomography is associated with skeletal muscle lipid content, Journal of Applied Physiology 89 (1) (2000) 104–110. [DOI] [PubMed] [Google Scholar]
- [26].Cornier M-A, Despres J-P, Davis N, Grossniklaus DA, Klein S, Lamarche B, Lopez-Jimenez F, Rao G, St-Onge M-P, Towfighi A, et al. , Assessing adiposity: a scientific statement from the american heart association, Circulation 124 (18) (2011) 1996–2019. [DOI] [PubMed] [Google Scholar]
- [27].Xie B, Avila JI, Ng BK, Fan B, Loo V, Gilsanz V, Hangartner T, Kalkwarf HJ, Lappe J, Oberfield S, et al. , Accurate body composition measures from whole-body silhouettes, Medical Physics 42 (8) (2015) 4668–4677. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [28].Miyato T, Kataoka T, Koyama M, Yoshida Y, Spectral normalization for generative adversarial networks, arXiv preprint arXiv:1802.05957 (2018). [Google Scholar]
- [29].Karras T, Laine S, Aila T, A style-based generator architecture for generative adversarial networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 4401–4410. [DOI] [PubMed] [Google Scholar]
- [30].Mirza M, Osindero S, Conditional generative adversarial nets, arXiv preprint arXiv:1411.1784 (2014). [Google Scholar]
- [31].Han L, Yin Z, A cascaded refinement gan for phase contrast microscopy image super resolution, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2018, pp. 347–355. [Google Scholar]
- [32].Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y, Generative adversarial nets, in: Advances in Neural Information Processing Systems, 2014, pp. 2672–2680. [Google Scholar]
- [33].Larsen ABL, Sonderby SK, Larochelle H, Winther O, Autoencoding beyond pixels using a learned similarity metric, in: International Conference on Machine Learning, 2016, pp. 1558–1566. [Google Scholar]
- [34].Li C, Wand M, Precomputed real-time texture synthesis with markovian generative adversarial networks, in: European Conference on Computer Vision, Springer, 2016, pp. 702–716. [Google Scholar]
- [35].Ioffe S, Szegedy C, Batch normalization: Accelerating deep network training by reducing internal covariate shift, arXiv preprint arXiv:1502.03167 (2015). [Google Scholar]
- [36].Gibson DJ, Burden ST, Strauss BJ, Todd C, Lal S, The role of computed tomography in evaluating body composition and the influence of reduced muscle mass on clinical outcome in abdominal malignancy: a systematic review, European Journal of Clinical Nutrition 69 (10) (2015) 1079–1086. [DOI] [PubMed] [Google Scholar]
- [37].Roth HR, Lu L, Seff A, Cherry KM, Hoffman J, Wang S, Liu J, Turkbey E, Summers RM, A new 2.5 d representation for lymph node detection using random sets of deep convolutional neural network observations, in: International Conference on Medical Image Computing and Computer-assisted Intervention, Springer, 2014, pp. 520–527. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [38].Clark K, Vendt B, Smith K, Freymann J, Kirby J, Koppel P, Moore S, Phillips S, Maffitt D, Pringle M, et al. , The cancer imaging archive (tcia): maintaining and operating a public information repository, Journal of Digital Imaging 26 (6) (2013) 1045–1057. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [39].Wang X, Han S, Chen Y, Gao D, Vasconcelos N, Volumetric attention for 3d medical image segmentation and detection, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2019, pp. 175–184. [Google Scholar]
- [40].Sun J, Xu B, Freeland-Graves J, Automated quantification of abdominal adiposity by magnetic resonance imaging, American Journal of Human Biology 28 (6) (2016) 757–766. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [41].Jin D, Xu Z, Tang Y, Harrison AP, Mollura DJ, Ct-realistic lung nodule simulation from 3d conditional generative adversarial networks for robust lung segmentation, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2018, pp. 732–740. [Google Scholar]
- [42].Kim YJ, Lee SH, Kim TY, Park JY, Choi SH, Kim KG, Body fat assessment method using ct images with separation mask algorithm, Journal of Digital Imaging 26 (2) (2013) 155–162. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [43].Yushkevich PA, Gao Y, Gerig G, Itk-snap: An interactive tool for semiautomatic segmentation of multi-modality biomedical images, in: 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), IEEE, 2016, pp. 3342–3345. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [44].Arjovsky M, Chintala S, Bottou L, Wasserstein generative adversarial networks, in: International Conference on Machine Learning, PMLR, 2017, pp. 214–223. [Google Scholar]







