Abstract
Objectives
Early placental volume (PV) has been associated with small-for-gestational-age infants born under the 10th/5th centiles (SGA10/SGA5). Manual or semi-automated PV quantification from 3DUS is time-intensive, limiting its incorporation into clinical care. We devised a novel convolutional neural network (CNN) pipeline for fully-automated placenta segmentation from 3DUS images, exploring the association between the calculated PV and SGA.
Methods
3DUS volumes obtained from singleton pregnancies at 11–14 weeks’ gestation were automatically segmented by our CNN pipeline trained and tested on 99/25 images, combining two 2D and one 3D models with downsampling/upsampling architecture. The PVs derived from the automated segmentations (PVCNN) were used to train multi-variable logistic-regression classifiers for SGA10/SGA5. The test performance for predicting SGA was compared to PVs obtained via the semi-automated VOCAL (GE-Healthcare) method (PVVOCAL).
Results
We included 442 subjects with 37 (8.4%) and 18 (4.1%) SGA10/SGA5 infants, respectively. Our segmentation pipeline achieved a mean Dice score of 0.88 on an independent test-set. Adjusted models including PVCNN or PVVOCAL were similarly predictive of SGA10 (AUCs: PVCNN=0.780, PVVOCAL=0.768). The addition of PVCNN to a clinical model without any PV included (AUC=0.725) yielded statistically significant improvement in AUC (P<0.05); whereas, PVVOCAL did not (P=0.105). Moreover, when predicting SGA5, including the PVCNN (0.897) brought statistically significant improvement over both the clinical model (0.839, P=0.015), as well as the PVVOCAL model (0.870, P=0.039).
Conclusions
First trimester placental volume measurements derived from our CNN segmentation pipeline are significantly associated with future SGA. This fully automated tool enables the incorporation of including placental volumetric biometry into the bedside clinical evaluation as part of a multi-variable prediction model for risk stratification and patient counseling.
Keywords: 3DUS, placenta, small-for-gestational-age, deep learning, convolutional neural networks
Introduction
Preeclampsia, intrauterine growth restriction, and other adverse pregnancy outcomes contribute to significant perinatal morbidity and mortality. Mounting evidence [1,2,3] has linked abnormalities in gross placental development to such outcomes. Thus, placental morphometry has recently become a major subject of further research. The most common imaging modality to study the placenta in clinical settings is ultrasound, because of its low cost, broad availability and ease of acquisition. Early, in utero placental size as measured with 2D ultrasound has demonstrated weak associations with adverse outcomes [4]. Advances in 3D ultrasound (3DUS) have allowed investigators to demonstrate associations between first trimester placental volume and clinical outcomes [5,6,7,8,9], but the currently available segmentation techniques are laborious and require significant offline processing with manual input. In order to develop clinical tools that can ultimately be incorporated into clinical practice, automated segmentation methods are needed that can robustly evaluate placental morphology in a validated and clinically relevant manner.
Automated quantification of the placenta from 3DUS data is a demanding task. 3DUS images are prone to high levels of speckle noise. Especially in early pregnancy, the contrast between the placenta and uterine tissue is often rather weak. The size and shape of the placenta, as well as its position with respect to the amniotic sac, are all extremely variable. There have been some semi-automated and fully automated methods proposed for the segmentation task, with varying results [10,11,12,13,14,15,16,17].
In this study we present a state-of-the-art method for obtaining a fully automated placenta segmentation from 3DUS volume sets by combining 2D and 3D convolutional neural networks (CNN). We also use the segmentations produced by this method to automatically calculate placental volume (PV) at 11–14 weeks, and study the association between the PV measurements and delivery of small for gestational age (SGA) infants. Our overarching goal is to develop a multivariable prediction model that can be used as a point-of-care test in the first trimester to identify pregnancies at increased risk for developing fetal growth disorders.
Materials and Methods
Overview
This is a secondary analysis of a prospective cohort of singleton gestations recruited for a study [9] examining early placental volumes in the prediction of adverse pregnancy outcomes. Volume data sets were previously analyzed using VOCAL semi-automated segmentation method and found to be significantly associated with SGA. However, VOCAL requires offline manipulation and manual tracing of successive image slices within the 3DUS volume set in order to segment out the organ of interest. In this study, we utilized volume sets from this cohort to develop our fully automated segmentation pipeline.
Our pipeline starts with fully automatic segmentation of the placenta from 3DUS images. We train 3 separate convolutional neural networks (CNN) for the segmentation task, two of them with 2D (on 2D slices extracted from 3DUS volume) and one with 3D modality. The probability maps predicted by the three CNNs are averaged and thresholded to output a final segmentation.
We then analyzed the placental volume thus obtained from the deep learning step as a predictor for small for gestational age <10th centile (SGA10, primary outcome) and <5th centile (SGA5, secondary outcome) outcomes [18]. Models were also adjusted for potential confounders, including maternal age, nulliparity, race, BMI, tobacco use, chronic hypertension, and pregestational diabetes.
Data
This study was performed in accordance with the Declaration of Helsinki. This human study was approved by the University of Pennsylvania Institutional Review Board - approval: 811129. All adult participants provided written informed consent to participate in this study.
Ultrasound volume datasets were obtained at 11–14 weeks’ gestation using GE Voluson E8 (GE Healthcare, WI, USA) ultrasound machines. The “max quality” setting was used during the acquisition of the images, and sector width and sweep angle were increased to ensure inclusion of the entire placenta within the 3D sweep. The resulting raw images are in various dimensions, ranging from 245×265×173 to 714×726×488 voxels. The mean voxel dimensions were 0.49±0.04 mm, uniform in all 3 axes.
In addition to the N=442 SGA study set, we use two more datasets for developing the segmentation pipeline, with the placenta manually labeled as ground truth. The CNNs are trained on an N=99 training set, under a 4-fold cross-validation scheme for tuning the models. After training and tuning is complete, we use an N=25 test set to measure the segmentation performance. The N=442 SGA study set, the N=99 train set, and the N=25 test set are all independent from each other.
Fully automated placenta segmentation pipeline
To obtain a dataset of ground truth for the segmentation task, 99 3DUS volume sets from 99 subjects were manually segmented using the ITK-SNAP software (www.itksnap.org). This annotated dataset is used to train the 3 CNNs. In addition, the PV estimation obtained using this ground truth dataset (PVGT) served as the reference when comparing VOCAL and CNN-derived PV values to the manually segmented ground truth.
We conducted a 4-fold cross-validation (CV) scheme to develop and validate our automated segmentation pipeline (Fig. 1). The performance of the model on such a setup is estimated to be the mean of the k validation results [19,20,21]. The k-fold CV setup has the advantage of being more robust to variance introduced during random splitting of a small dataset compared to a single train/validate split, since we are running k simultaneous models with identical architecture and hyper-parameters on k different train/validation splits [22,23]. We opted for k=4, which is a common choice in similar studies [24,25,26].
Figure 1 -.

Overview of the 4-fold cross-validation setup.
Ensembling multiple CNN models is a common deep learning approach for improving prediction results [27,28,29]. For our segmentation task, we found that a combination of two 2D and one 3D CNNs maximized the Dice score compared to their separate scores. Adding further models did not bring significant improvement (Fig. 2). Thus, we trained 3 separate CNNs for each of the 4 folds on the training dataset, for a total of 12 models for the parameter tuning and evaluation phase. Then, 3 final models are trained on the whole N=99 train set, with the optimized hyper-parameters obtained from the CV results.
Figure 2 -.

Performance comparison of various combinations of models used as an ensemble.
The networks follow a downsampling-upsampling path, similar to the one proposed in the U-net algorithm [30]. The overall architecture of the networks (Fig. 3) is similar to the U-net architecture: first a down path using convolutional blocks followed by downsampling layers, and then an up path with convolutional blocks followed by upsampling layers. There are two major modifications we applied, based on our validation results. First, our downsampling layers use strided convolutions, as opposed to max-pooling in U-net. Second, we dropped the proposed skip-connections between up/down paths, since it did not improve our segmentation accuracy results and prolonged training time. There were also a number of minor modifications, including number of down/up steps (4 in U-net vs 3 in our 2D, and 2 in our 3D), number of filters in layers (64 to 1024 in U-net vs 32 to 256 in our 2D, and 32 to 128 in our 3D), usage of drop-out (at the end of the contracting path in U-net vs after each down/up step in ours) and batch normalization layers (none in U-net vs after the input in ours), etc. We used the binary cross-entropy loss as the loss function during training.
Figure 3 -.

The network architecture diagram for our CNNs. Number of filters/channels are given on top of each block. Output dimensions are given at bottom left. Kernel size for all convolutions is 3×3 pixels for 2D, and 3×3×3 voxels for 3D.
As an initial preprocessing step, we downsampled all raw 3DUS images to 128×128×128 voxels, to ensure standard dimensions. This standard dataset is further processed by two separate methods to obtain training data for the 2D and 3D models.
For the 2D models, we extract 2D slices of 128×128 pixels on 3 planes for each subject; sagittal (0°), coronal (90°), and a 45° plane between them. These slices are concatenated to produce 384 slices of 128×128 pixels for each subject, constructing our base 2D training data. During training, we apply random online augmentations to each slice, in order to synthetically extend the limited dataset and prevent rapid overfitting of the CNNs. The transformations used in augmentation are flipping (horizontal/vertical), shifting (up/down/left/right ±5 pixels), scaling (enlarge 0–10%), and in plane rotation (±45°). Whether these transformations are applied and their magnitude is decided randomly during every epoch of training. The two 2D models in our ensemble have the same architecture and hyper-parameters, and are trained on the same base 2D training data obtained as explained above. The difference between them is purely random, rooted in the initialization of trainable parameters and the application of random online augmentation. While they have similar performance separately, simply averaging their probability outputs provide a Dice score improvement of 0.01 (Fig.2).
For the 3D models, we extract overlapping sliding cubes of 64×64×64 voxels from each subject in the standard dataset, with a stride of 32 voxels in each 3 axes. These cubes are concatenated to produce 27 cubes of 643 voxels for each subject, constructing our base 3D training data. Similar to the 2D setup, we apply random online augmentations to each cube in this base dataset during training. The transformations used in augmentation are 3D rotations of ±30° in each 3 axes.
During the prediction phase, we again apply the above extraction methods for each type of model. For the 2D models, we extract a total of 384 slices of shape 1282 pixels on the 0°, 45°, and 90° planes from each subject. The model predictions for the 45° and 90° slices are rotated back to the 0° plane, thus obtaining 3 sets of probabilities for each subject. The final output of a 2D model is produced by taking the voxel-wise mean of these probabilities. For the 3D models, we extract 27 cubes of shape 643 voxels from each subject. The model predictions for these sliding cubes are then translated back into their original location, to obtain the final output. We take the voxel-wise mean of probabilities in regions where multiple cubes overlap. Finally, after averaging the probability maps from the 2D and 3D models, we threshold the probabilities to get a discrete output, and upsample the 128×128×128 output back to the dimensions of the original raw image. No other post-processing steps are used.
As a comparison of our pipeline to a mainstream method, we also trained and tested a vanilla 2D U-net model [30]. We chose the U-net architecture as a baseline comparison model since our models are based on the same idea of downsampling/upsampling paths, as explained above. We applied the same 2D random online augmentation step to this model, and trained it for 150 epochs, similar to our own models. However, we performed lengthy hyper-parameter tuning with our models, whereas this baseline model was only tuned for the learning rate,
The CNN code is implemented on the Python/Numpy stack. We used TensorFlow for the CNN backend and Keras as a higher level frontend. The training was performed on the Google Cloud Platform, on a machine with 4x Nvidia-Tesla-P100 GPUs. The training time for each model is around 3 hours for each cross-validation fold on our dataset of 99 subjects. Total training time for 12 validation and 3 final models was around 48 hours. The prediction phase is much faster, typically around 20–25 seconds for a single 3DUS image. Total time for segmentation of a new image (pre-processing, CNN prediction, post-processing, outputting segmentation) takes approximately 1 minute.
After training the three final CNNs on the N=99 train set, these models were used to produce placenta segmentations for the N=442 study set which did not have manual segmentations, and placental volume (PVCNN) are automatically calculated for the study set from these segmentations.
Measuring performance of the fully automated placenta segmentation pipeline
Our results were first evaluated using the Dice similarity coefficient [31], which is a general metric to measure the similarity of two samples in statistics. When applied to placental segmentation, Dice is a measure of overlap between voxels in the manually segmented (i.e. ground truth) and the automatically segmented placentas. If the two segmentations are identical, this results in a perfect Dice score of 1.0. If the two segmentations do not overlap at all, the Dice score is 0.0. The higher Dice scores indicate higher overlap ratio.
After tuning our models according to their Dice performance, we then evaluated the final segmentation results with two additional metrics, the Hausdorff distance and the mean boundary (or average surface) distance [32]. These metrics are generally used as a measure of the similarity of morphology. The Hausdorff distance is formulated as H(A, B) = max(h(A, B), h(B, A)) and h(A, B) = max min ||a – b|| where A and B are two finite sets of points, and ||a-b|| is the Euclidean distance between two points in these sets. In our case, it is the maximum of the distances from points in A (surface of the ground truth segmentation) to the nearest point in B (surface of the predicted segmentation). Since it measures the maximum, it is sensitive to any outlying points. Therefore, it is usually used in conjunction with the mean boundary distance, which is similarly the mean (instead of maximum) of all distances from points in A to the nearest point in B, averaged with the mean distances of points in B to the nearest point in A. While the mean boundary distance assesses the typical performance of the algorithm, the Hausdorff distance considers the worst-case performance.
Segmentation accuracy and reproducibility on independent test set
Finally, to evaluate the reproducibility of our 4-fold cross-validation setup, we identified 25 subjects, independent of both the N=99 train set and the N=442 SGA study set, for whom two 3DUS volumes sets were obtained on during the same exam. Each of the paired volume was analyzed using our CNN pipeline, allowing us to assess the reproducibility of segmentations produced by our automated CNN pipeline. In addition, one volume from each pair was manually segmented for each subject as a reference, and the PV from these ground truth segmentations were also compared to the corresponding PV from CNN segmentation.
Fully automated vs semi-automated PV measures
We compared the placental volume measurements obtained via the CNN pipeline compared to those obtained via more commonly used semi-automated VOCAL tool (using 30° rotations) to assess whether they were correlated. In addition, when available, we compared those measurements to the volumes obtained from the manually labeled ground truth (N=74).
SGA analysis on the study cohort
To examine the association between first trimester PV and SGA outcomes, we used the N=442 unlabeled dataset, which consists of subjects unseen by the neural networks during training. This set includes 37 SGA10 (8.4%) and 18 SGA5 (4.1%) cases. The remaining 405 subjects delivered infant with birth weights greater than 10th percentile.
In this experiment, we first used a logistic regression model to evaluate the placental volume calculated by the deep learning segmentation (PVCNN) as a predictor of SGA10 and SGA5 outcomes using the N=442 study set. Then, we adjusted for potential confounders that would be available at the time of ultrasound imaging, such as age, nulliparity, race, BMI, tobacco use, chronic hypertension, and pregestational diabetes. Numeric features (age, BMI, and PV) were normalized to the 0–1 range via min-max scaling. The remaining discrete features were converted into binary features via one-hot encoding. We then concatenated all available features obtained in this fashion, and applied a backward elimination loop for feature selection, evaluated by ROC analysis. In each round of the loop, the feature that when eliminated improved the AUC the most was removed. The loop was stopped when eliminating any further features did not result in improvement.
We also ran this model using the clinical factors alone, excluding PV, to help determine the incremental benefit of including PVCNN. Finally, we repeated the adjusted models for predicting SGA using PVVOCAL compared the resulting AUCs from each of the models. The p-values for the comparisons were calculated according to DeLong’s test for two correlated ROC curves, using the pROC package in R.
Results
The demographic information for the N=442 study set is given in Table 1. The mean gestational age at the time of 3DUS was 12.6±0.5 weeks. Mean gestational age at delivery was 38.9±1.7 weeks, with a mean birth weight of 3275±540 grams. SGA10 was found in 8.4% (N=37); whereas, SGA5 was found in 4.1% (N=18).
Table 1 –
Demographics of the study cohort
| SGA5 | SGA10 | Normal | TOTAL | |||||
|---|---|---|---|---|---|---|---|---|
| N | % | N | % | N | % | N | % | |
| Number of subjects | 18 | 4.1% | 37 | 8.4% | 405 | 91.6% | 442 | 100.0% |
| Fetus male | 10 | 55.6% | 20 | 54.1% | 217 | 53.6% | 237 | 53.6% |
| Fetus female | 8 | 44.4% | 17 | 45.9% | 188 | 46.4% | 205 | 46.4% |
| Race | N | % | N | % | N | % | N | % |
| White | 9 | 2.0% | 15 | 3.4% | 169 | 38.2% | 184 | 41.6% |
| Black | 8 | 1.8% | 14 | 3.2% | 180 | 40.7% | 194 | 43.9% |
| Asian | 1 | 0.2% | 7 | 1.6% | 39 | 8.8% | 46 | 10.4% |
| Other | 0 | 0.0% | 1 | 0.2% | 17 | 3.8% | 18 | 4.1% |
| mean | stdev | mean | stdev | mean | stdev | mean | stdev | |
| Maternal age (years) | 30.1 | 6.1 | 29.4 | 5.6 | 30.6 | 6.0 | 30.5 | 6.0 |
| Maternal BMI (kg/m2) | 27.6 | 7.6 | 25.8 | 6.7 | 26.8 | 6.5 | 26.7 | 6.6 |
| Gestational age at US (weeks) | 12.5 | 0.4 | 12.5 | 0.4 | 12.6 | 0.5 | 12.6 | 0.5 |
| Gestational age at delivery (weeks) | 38.2 | 0.8 | 38.2 | 1.4 | 39.0 | 1.7 | 38.9 | 1.7 |
| Birth weight (grams) | 2451 | 197 | 2528 | 259 | 3343 | 507 | 3275 | 540 |
| N | % | N | % | N | % | N | % | |
| Tobacco use | 5 | 27.8% | 7 | 18.9% | 34 | 8.4% | 41 | 9.3% |
| Chronic hypertension | 3 | 16.7% | 7 | 18.9% | 21 | 5.2% | 28 | 6.3% |
| Pregestational diabetes | 0 | 0.0% | 1 | 2.7% | 9 | 2.2% | 10 | 2.3% |
| Nulliparity | 0 | 0.0% | 4 | 10.8% | 75 | 18.5% | 79 | 17.9% |
Fully automated placenta segmentation pipeline
Calculated over the test folds of the 4-fold CV setup, our fully automated CNN pipeline achieved a mean Dice result of 0.8721±0.0591 on the N=99 train set. The mean boundary distance was 1.54±1.31 mm (min=0.37 mm, median=1.16 mm, max=8.33 mm), which roughly corresponds to 3 voxels. The mean Hausdorff distance was 18.19±11.40 mm (min=4.24 mm, median=15.30 mm, max=47.98 mm).
While developing and tuning our models with the 4-fold CV, we experimented with various alternative configurations. These include the following, with their respective Dice scores in CV: using a Dice loss function (0.82), using a combined loss function of Dice and cross entropy losses weighted equally (0.83), implementing skip connections between the downsampling/upsampling paths (0.82), using max-pooling instead of strided convolutions (0.80), omitting drop-out layers (0.82), and omitting batch-normalization (0.81). These alternatives were not used in the final configuration presented in this paper, since they produced poorer CV performance compared to our main pipeline (0.87).
After finalizing our pipeline, we also evaluated our models on the N=25 independent test set, to verify the results from our cross-validation setup. The mean Dice score on this test set was 0.8829±0.0457, which is in line with the cross-validation results reported above. The mean boundary distance was 1.38±0.72 mm (min=0.51 mm, median=1.29 mm, max=3.36 mm). The mean Hausdorff distance was 18.60±11.46 mm (min=3.16 mm, median=14.50 mm, max=38.51 mm). A sample of the qualitative results are given in Figure 4.
Figure 4 -.

A sample of qualitative segmentation results, compared to manually annotated ground truth. Anterior placenta on top row, posterior on bottom row.
Figure 5a and 5b shows the comparison and correlation of the pairs of automated placental volumes calculated for the subjects with secondary images (predictions for 1st scan vs 2nd scan). The Pearson coefficient was 0.870, and R2=0.757. The intraclass correlation coefficient (ICC) was 0.867. Figure 5c and 5d shows the correlation between the automated PV for the primary images and their manually labeled ground truth PVs (prediction for 1st scan vs GT for 1st scan, Pearson=0.908, R2=0.824).
Figure 5a -.

Scatter plot showing the volume correlation in predicted volumes.
Figure 5b -.

Bland-Altman plot showing the volume correlation in predicted volumes.
Figure 5c -.

Scatter plot showing the volume correlation in predicted vs GT volumes.
Figure 5d -.

Bland-Altman plot showing the volume correlation in predicted vs GT volumes.
Finally, as mentioned in the methods section, we trained a vanilla 2D U-net [30] model on the N=99 train set, as a baseline comparison model. When applied to the N=25 test set, this model achieved a mean Dice score of 0.73. However, it must be noted that we did not perform extensive hyper-parameter tuning with this model, so its performance is not directly comparable to the 0.88 Dice score of our pipeline.
Volume comparison between CNN and VOCAL
When comparing the PV values from the 3 segmentation methods in the N=99 train set with ground truth, there was no significant difference in mean PV values obtained by CNN (PVCNN 78.7±29.7cc) compared to the ground truth (PVGT =79.4±30.9cc, P=0.878). On the other hand, mean PVVOCAL (65.2±26.9cc) was significantly lower than both PVGT (p=0.003) and PVCNN (p=0.005).
For the N=25 independent test set, there were no VOCAL volumes available. There was strong correlation when comparing the ground truth volumes with volumes from our automated segmentation (PVGT =59.4±18.6cc, PVCNN =62.8±16.4cc, Pearson=0.908).
When examining the PV methods as applied to the whole cohort where PVVOCAL was available (N=516 out of the 442+99), there was correlation between PVVOCAL and PVCNN (Pearson=0.676, R2=0.457, Fig. 6a). However, the mean PVVOCAL remained significantly lower than PVCNN (67.5±20.5 cc vs. 97.8±32.9 cc; p<0.0001, Fig. 6b).
Figure 6a -.

Scatter plot showing the correlation between placental volumes calculated from CNN segmentation, compared to VOCAL.
Figure 6b -.

Bland-Altman plot showing the correlation between placental volumes calculated from CNN segmentation, compared to VOCAL.
SGA prediction on the study cohort
Figure 7 shows the ROC curves for the predictions using PVCNN, PVVOCAL and using clinical covariates alone.
Figure 7 -.

Receiver Operating Characteristic curves for SGA5 and SGA10 prediction by ou logistic regression models. The curves represent models using placental volumes from our CNN segmentation compared to volumes from VOCAL.
The clinical covariates that were most associated with SGA10 were tobacco use, chronic hypertension, pregestational diabetes, age, race, and BMI. A prediction model using these factors alone yielded an AUC of 0.725. However, when adding PVCNN to the model, the AUC significantly increased to (0.780; P=0.047). The AUC also increased when using PVVOCAL in the clinical model, although the increase did not reach statistical significance (0.768; P=0.105). There was no significant difference in the AUC between PVCNN and PVVOCAL (P=0.605). If we assume a 20% false positive ratio as a clinically appropriate cutoff point, the prediction of SGA10 was not significantly improved by including PV measurements with sensitivities for the PVCNN, PVVOCAL, and no PV models are 51%, 51%, and 54%, respectively. At a 10% false positive ratio, the sensitivities were 35%, 40%, and 35% respectively.
The clinical covariates that were most associated with SGA5 were tobacco use, chronic hypertension, pregestational diabetes, nulliparity, age, race, and BMI. A prediction model using these factors alone yielded an AUC of 0.839 for detecting SGA5. However, there was significant improvement when adding PVCNN (AUC=0.897; P=0.015). Adding PVVOCAL (AUC=0.870; P=0.092) increased the AUC, but was not statistically significant. There was also significant improvement in AUC in the model using PVCNN compared to the PVVOCAL model (P=0.039). Inclusion of PV measurements to the models yielded improvement in predicting SGA5 at a cutoff point of 20% false-positive, with sensitivities of 83%, 83%, and 78% for the PVCNN, PVVOCAL and no PV models, respectively. At a 10% false positive rate, the sensitivities were 67%, 39%, and 44% respectively.
Discussion
We have developed a fully automatic method to segment the placenta in 3DUS images using deep learning with convolutional neural networks. The placental volume measurements obtained at 11–14 weeks using this automated tool are validated compared to manual segmentation, reproducible, and significantly associated with the delivery of a small-for-gestational-age infant.
Placental size at birth is known to correlate with neonatal birth weight. Furthermore, assessment of placental size at delivery was one of the first clues that led to the creations of the Developmental Origins of Health and Disease hypothesis which had transformed our understanding of how pregnancy conditions can have long-term implications for lifelong health in the offspring [33,34,35]. However, quantitative assessment of the placenta is not part of routine clinical care, largely due to the lack of tools that allow this task to be accomplished in a timely and simplified manner. Semi-automated tools, such as VOCAL, have allowed researchers to confirm the relationship between early placental size and birth-weight. However, this technique requires manual segmentation of multiple image slices to obtain an estimated volume.
Automated segmentation of the placenta is a demanding task. Most current techniques for placental segmentation from 3DUS expect some input from the user. Such techniques include the commercial VOCAL software (GE Healthcare) and a random-walker algorithm (mean Dice 0.86) [10,16]. These interactive methods are subjective and inclined to introduce intra- and inter-observer variability. Semi-automated multi-atlas label fusion methods have been proposed (mean Dice 0.83) [15] to minimize required user input, and thus its adverse effects, but still require some manual user-input.
In recent years, fully automated approaches have also been proposed, with varying results. Some notable ones include a general-purpose method that can be applied to various organs (mean Dice 0.64) [17], a fully automated multi-atlas label fusion algorithm limited to anterior placentas (mean Dice 0.83) [13], an extension of this method to non-anterior placentas using deep learning (mean Dice 0.86, dataset also used in the current study) [14], and other deep learning models such as OxNNet (mean Dice 0.81, median Hausdorff distance 14.6 mm, median mean boundary distance 0.37 mm) [12], and Deep Medic (median Dice 0.73, median Hausdorff distance 27 mm) [11].
Our proposed method provides a complete and fully automated pipeline that takes in 3DUS image and clinical data, and outputs placenta segmentation, placenta volume, and probabilities for SGA5 and SGA10 outcomes. This tool has the potential to be used at the bedside, since the whole pipeline for a new image completes within minutes, and more importantly, requires no input from the user. Early identification of pregnancies at risk for fetal growth disturbance can inform clinicians and allow for increased surveillance in the third trimester. In addition, there are data suggesting that therapies, such as low dose aspirin, may improve fetal growth in high risk pregnancies [36]. Further research would be required to determine the optimal interventions to be considered when an early placental volume is found to be small.
Our Dice results from the segmentation task were higher than other alternatives for fully automatic methods [11,12,13,14,17]. Our Hausdorff distance was better than [11] and comparable to [12], while our mean boundary distance was worse than [12]. Our previous hybrid method in [14] requires long processing time for the atlas fusion step, resulting in prediction times close to an hour for a single 3DUS image. Our current fully CNN approach achieves the same segmentation performance, and prediction for a single image completes within 1–2 minutes. We would also like to note this was achieved with a very limited dataset of 99 images. Since deep learning methods are known to excel with large amounts of data, obtaining more training images could likely further improve the Dice score of the CNN models. Also, it is important to note that there is no clear-cut, mathematical ground truth for placenta segmentation as there is variability and human subjectivity factored into the tedious process of manually segmenting 3DUS images with varying resolution and often unclear borders between placenta and myometrium. Thus, a perfect Dice score that predicts the manual segmentation exactly in every subject is practically unattainable. However, we anticipate that considerably larger labeled datasets and possible future developments in deep learning methods could potentially improve the scores towards 0.95, which we believe to be a practically achievable upper limit. According to our observations, the CNNs correctly locate the central mass of the placenta on most cases, and usually err on the boundaries. Therefore, even small increases in Dice results from our current 0.88 score would mean noticeably better predictions for the boundaries, enabling further research based on new aspects such as the shape and surface area of the placenta.
While such future improvements in accuracy may bring diminishing returns for population-wide studies, they would also potentially translate to reduced standard deviation and thus increased robustness for poor quality images, such as those suffering from poor SNR or shadow artifacts (see Fig. 8). This would enable clinicians to use these tools on a wider range of images on the bedside, and help move towards precision medicine/personalized medicine approaches.
Figure 8 -.

A sample of qualitative segmentation results where CNN oversegments.
In the meantime, at the current level of segmentation ability, for images where the automated segmentation is unsatisfactory, the models could also be used as a first pass tool to assist with manual annotation of the placenta. Manual annotation is a time-intensive process that requires the labor of an expert. A fast prediction from the CNN model, while not perfect, could provide a decent start point for the expert, possibly reducing from the total time needed for the task.
We also note that the rather limited size of our manually labeled dataset may bring up concerns of overfitting in the CNN training phase. For this reason, we evaluated our segmentation models on an N=25 independent test set, and the results were in line with the results from the 4-fold cross-validation on the N=99 training set. Furthermore, our approach generalizes well to the larger, unlabeled SGA study set (N=442), as evidenced by the fact that it performs better than VOCAL for SGA10 and SGA5 prediction. These two additional checks strongly support our belief that the random online data augmentations used during training were capable of preventing significant overfitting. However, it must be noted that all images in our datasets were obtained by the same team and equipment in a single institution. While the two checks mentioned above show that our models generalize well to unseen images within our data, it could be possible to witness some performance drop when applied to data obtained by other teams/equipment, if there are significant differences in the raw 3DUS images due to methodology or hardware factors.
It is also worth mentioning that for the purpose of this study, we performed a manual quality control round to eliminate 62 extremely poor images from our initial study cohort [9], while building the train, test and SGA datasets. The images that were dismissed during quality control include those with very low contrast, with a high amount of noise/artifacts, or where the volume set did not contain the whole placenta. Low quality images like these understandably deteriorate the segmentation performance of the CNN models. The N=442 set still included many images of below-average quality, which results in oversegmentation by the CNNs. This is notable in figures 6a, 6b, and 8. Our planned future efforts will work to incorporate an automated quality control module into the pipeline to detect possible shortcomings and either address them using image enhancement techniques, or, if not possible, to prompt the user for a repeat scan, while the patient is still present at the clinic. We believe that such an approach will significantly augment this automated tool and facilitate its use in the clinic.
We employed an ensemble of 3 models in order to overcome the limitations of our relatively small dataset, and to achieve an increase in performance and robustness compared to a single model. Deep learning methods are known to thrive on large data. Thus, if an exponentially larger dataset with manually segmented labels becomes available in the future, our pipeline could potentially achieve better performance even with 2 or 1 model.
In any case, training the models is a one time process. Mainstream applications that exploit weeks of training time are not unheard of. We believe prediction time is more crucial in real life adoption. Regardless of training time required, our pipeline with 3 models can produce a segmentation of a new raw image within 1–2 minutes, which should not be a severely limiting issue in a clinical setting.
The AUC results from our logistic regression model were comparable to or higher than previous work [7,8]. The main advantage of our model is that apart from PV, it only uses easily obtainable data as features (tobacco use, chronic hypertension, pregestational diabetes, nulliparity, age, race, BMI). Therefore, it does not require additional blood tests, imaging or repeat visits. This ensures that it can be introduced as an assisting tool to most clinical settings. While the ability to detect >60% of SGA5 infants as early as 11–14 weeks without the need for blood tests or repeat visits is encouraging, validating this on new, unseen data would further augment the confidence and utility of this approach.
Automated placental segmentation and volume measurement would enable the inclusion of placental biometry in both a clinical setting and to conduct more robust prospective research to improve risk stratification early in pregnancy. Moreover, continued refinement of this tool can enable a more robust assessment of other morphometric features such as placental surface area, shape. Finally, as novel ultrasound-based technologies, such as microvascular Doppler, are developed, an automated placental segmentation tool can be extended to aid in volumetric quantification or mapping of the Doppler signal as well. Overall, by creating simplified and automated tools, the goal of incorporating direct assessment of placental development into the clinical management of pregnancy is achievable.
Acknowledgements
This work is supported by the Human Placenta Project (U01 HD087180), as well as NIH grants R01 EB017255, R01 NS094456, RO3 HD069742, and F32 HL119010. This material is based upon work supported by Google Cloud.
Footnotes
All authors have approved the submitted version of the manuscript. We have no conflicts of interest to disclose. We have uploaded the ICMJE conflict of interest forms for all authors.
References
- 1.Baptiste-Roberts K, Salafia CM, Nicholson WK, Duggan A, Wang NY, Brancati FL: Gross placental measures and childhood growth. The Journal of Maternal-Fetal & Neonatal Medicine 2009; 22(1):13–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Barker DJ, Bull AR, Osmond C, Simmonds SJ: Fetal and placental size and risk of hypertension in adult life. British Medical Journal 1990; 301(6746):259–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Biswas S, Ghosh SK: Gross morphological changes of placentas associated with intrauterine growth restriction of fetuses: A case control study. Early Human Development 2008; 84(6), 357–362. [DOI] [PubMed] [Google Scholar]
- 4.Schwartz N, Wang E, Parry S: Two-dimensional sonographic placental measurements in the prediction of small-for-gestational-age infants. Ultrasound in Obstetrics & Gynecology 2012; 40(6), 674–679. [DOI] [PubMed] [Google Scholar]
- 5.Hafner E, Metzenbauer M, Stümpflen I, Waldhör T, Philipp K. First trimester placental and myometrial blood perfusion measured by 3D power Doppler in normal and unfavourable outcome pregnancies. Placenta 2010; 31(9):756–63. [DOI] [PubMed] [Google Scholar]
- 6.Metzenbauer M, Hafner E, Schuchter K, Philipp K. First-trimester placental volume as a marker for chromosomal anomalies: preliminary results from an unselected population. Ultrasound in Obstetrics and Gynecology: The Official Journal of the International Society of Ultrasound in Obstetrics and Gynecology 2002; 19(3):240–242. [DOI] [PubMed] [Google Scholar]
- 7.Schwartz N, Coletta J, Srinivas S, Pessel C, Timor IE, Parry S, Salafia C: Placental volume measurements early in pregnancy predict adverse perinatal outcomes. American Journal of Obstetrics and Gynecology 2009; 201(6):142–143.19481723 [Google Scholar]
- 8.Schwartz N, Quant HS, Sammel MD, Parry S: Macrosomia has its roots in early placental development. Placenta 2014; 35(9), 684–690. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Schwartz N, Sammel MD, Leite R, Parry S. First-trimester placental ultrasound and maternal serum markers as predictors of small-for-gestational-age infants. American Journal of Obstetrics and Gynecology 2014; 211(3):253–e1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Collins SL, Stevenson GN, Noble JA, Impey L: Rapid Calculation of Standardized Placental Volume at 11 to 13 Weeks and the Prediction of Small for Gestational Age Babies. Ultrasound in Medicine & Biology 2013; 39(2), 253–260. [DOI] [PubMed] [Google Scholar]
- 11.Looney P, Stevenson GN, Nicolaides KH, Plasencia W, Molloholli M, Natsis S, Collins SL: Automatic 3D ultrasound segmentation of the first trimester placenta using deep learning. IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017) 2017; 279–282. [Google Scholar]
- 12.Looney P, Stevenson GN, Nicolaides KH, Plasencia W, Molloholli M, Natsis S, Collins SL: Fully automated, real-time 3d ultrasound segmentation to estimate first trimester placental volume using deep learning. JCI Insight 2018; 3(11). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Oguz I, Pouch A, Yushkevich N, Wang H, Gee J, Schwartz N, Yushkevich P: Fully automated placenta segmentation from 3D ultrasound images. Perinatal, Preterm and Paediatric Image Analysis, PIPPI workshop, MICCAI 2016.
- 14.Oguz BU, Wang J, Yushkevich N, Pouch A, Gee J, Yushkevich PA, Schwartz N, Oguz I. Combining Deep Learning and Multi-atlas Label Fusion for Automated Placenta Segmentation from 3DUS. In Data Driven Treatment Response Assessment and Preterm, Perinatal, and Paediatric Image Analysis Springer, Cham. 2018; 138–148. [Google Scholar]
- 15.Oguz I, Yushkevich N, Pouch A, Oguz BU, Wang J, Parameshwaran S, Gee J, Yushkevich PA, Schwartz N. Minimally interactive placenta segmentation from three-dimensional ultrasound images. Journal of Medical Imaging 2020; 7(1):014004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Stevenson GN, Collins SL, Ding J, Impey L, Noble JA: 3-D Ultrasound Segmentation of the Placenta Using the Random Walker Algorithm: Reliability and Agreement. Ultrasound in Medicine & Biology 2015; 41(12):3182–3193. [DOI] [PubMed] [Google Scholar]
- 17.Yang X, Yu L, Li S, Wang X, Wang N, Qin J, Ni D, Heng PA. Towards automatic semantic segmentation in volumetric ultrasound. In International Conference on Medical Image Computing and Computer-Assisted Intervention Springer, Cham. 2017; 711–719. [Google Scholar]
- 18.Alexander GR, Himes JH, Kaufman RB, Mor J, Kogan M. A United States national reference for fetal growth. Obstetrics & Gynecology 1996; 87(2):163–8. [DOI] [PubMed] [Google Scholar]
- 19.Allen DM. The relationship between variable selection and data agumentation and a method for prediction. Technometrics 1974; 16(1):125–7. [Google Scholar]
- 20.Stone M Cross validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B (Methodological) 1974; 36(2):111–33. [Google Scholar]
- 21.Stone M An asymptotic equivalence of choice of model by cross validation and Akaike’s criterion. Journal of the Royal Statistical Society: Series B (Methodological) 1977; 39(1):44–7. [Google Scholar]
- 22.Blum A, Kalai A, Langford J. Beating the hold-out: Bounds for k-fold and progressive cross-validation. In Proceedings of the twelfth annual conference on Computational learning theory 1999; 203–208. [Google Scholar]
- 23.Kohavi R A study of cross-validation and bootstrap for accuracy estimation and model selection. In Ijcai 1995; 14(2):1137–1145. [Google Scholar]
- 24.de Bel T, Hermsen M, Smeets B, Hilbrands L, van der Laak J, Litjens G. Automatic segmentation of histopathological slides of renal tissue using deep learning. In Medical Imaging 2018: Digital Pathology International Society for Optics and Photonics. 2018; 10581:1058112. [Google Scholar]
- 25.Mortazi A, Burt J, Bagci U. Multi-planar deep segmentation networks for cardiac substructures from MRI and CT. In International Workshop on Statistical Atlases and Computational Models of the Heart Springer, Cham: 2017; 199–206. [Google Scholar]
- 26.Roth HR, Lu L, Farag A, Shin HC, Liu J, Turkbey EB, Summers RM. Deeporgan: Multi-level deep convolutional networks for automated pancreas segmentation. In International conference on medical image computing and computer-assisted intervention Springer, Cham: 2015; 556–564. [Google Scholar]
- 27.Dietterich TG. Ensemble methods in machine learning. International workshop on multiple classifier systems Springer, Berlin, Heidelberg: 2000; 1–15. [Google Scholar]
- 28.Drucker H, Cortes C, Jackel LD, LeCun Y, Vapnik V. Boosting and other ensemble methods. Neural Computation 1994; 6(6):1289–301. [Google Scholar]
- 29.Opitz D, Maclin R. Popular ensemble methods: An empirical study. Journal of artificial intelligence research 1999; 11:169–98. [Google Scholar]
- 30.Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. International Conference on Medical image computing and computer-assisted intervention Springer, Cham: 2015; 234–241. [Google Scholar]
- 31.Dice LR. Measures of the amount of ecologic association between species. Ecology 1945; 26(3):297–302. [Google Scholar]
- 32.Huttenlocher DP, Klanderman GA, Rucklidge WJ. Comparing images using the Hausdorff distance. IEEE Transactions on pattern analysis and machine intelligence 1993; 15(9):850–63. [Google Scholar]
- 33.Barker DJ, Bull AR, Osmond C, Simmonds SJ. Fetal and placental size and risk of hypertension in adult life. British Medical Journal 1990; 301(6746):259–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Barker DJ, Thornburg KL, Osmond C, Kajantie E, Eriksson JG. The surface area of the placenta and hypertension in the offspring in later life. The International journal of developmental biology 2010; 54:525. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Law CM, Barker DJ, Bull AR, Osmond C. Maternal and fetal influences on blood pressure. Archives of disease in childhood 1991; 66(11):1291–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Roberge S, Nicolaides KH, Demers S, Villa P, Bujold E. Prevention of perinatal death and adverse perinatal outcome using low-dose aspirin: a meta-analysis. Ultrasound in obstetrics & gynecology 2013; 41(5):491–9. [DOI] [PubMed] [Google Scholar]
