Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2021 Jun 3;118(23):e2025400118. doi: 10.1073/pnas.2025400118

Monitoring war destruction from space using machine learning

Hannes Mueller a,b,1, Andre Groeger b,c,1, Jonathan Hersh d, Andrea Matranga d,e, Joan Serrat f,g
PMCID: PMC8201876  PMID: 34083439

Significance

Satellite imagery is becoming ubiquitous. Research has demonstrated that artificial intelligence applied to satellite imagery holds promise for automated detection of war-related building destruction. While these results are promising, monitoring in real-world applications requires high precision, especially when destruction is sparse and detecting destroyed buildings is equivalent to looking for a needle in a haystack. We demonstrate that exploiting the persistent nature of building destruction can substantially improve the training of automated destruction monitoring. We also propose an additional machine-learning stage that leverages images of surrounding areas and multiple successive images of the same area, which further improves detection significantly. This will allow real-world applications, and we illustrate this in the context of the Syrian civil war.

Keywords: conflict, destruction, deep learning, remote sensing, Syria

Abstract

Existing data on building destruction in conflict zones rely on eyewitness reports or manual detection, which makes it generally scarce, incomplete, and potentially biased. This lack of reliable data imposes severe limitations for media reporting, humanitarian relief efforts, human-rights monitoring, reconstruction initiatives, and academic studies of violent conflict. This article introduces an automated method of measuring destruction in high-resolution satellite images using deep-learning techniques combined with label augmentation and spatial and temporal smoothing, which exploit the underlying spatial and temporal structure of destruction. As a proof of concept, we apply this method to the Syrian civil war and reconstruct the evolution of damage in major cities across the country. Our approach allows generating destruction data with unprecedented scope, resolution, and frequency—and makes use of the ever-higher frequency at which satellite imagery becomes available.


Building destruction during war is a specific form of violence that is particularly harmful to civilians, commonly used to displace populations, and therefore warrants special attention. Yet, data from war-ridden areas are typically scarce, often incomplete, and highly contested, when available. The lack of such data from conflict zones severely limits media reporting, humanitarian relief efforts, human-rights monitoring, and reconstruction initiatives, as well as the study of violent conflict in academic research. A novel solution to this problem is to use remote sensing to identify destruction in satellite images (13). This approach is gaining momentum as high-resolution imagery is becoming readily available at ever-higher frequency, yielding weekly, or even daily, images. At the same time, recent methodological advances related to deep learning have provided sophisticated tools to extract data from these images (47).

While seminal research has demonstrated the use of automated classifiers for destruction detection, practical applications have so far been hampered by severe problems with labeling, domain transfer, and class imbalance in real-world imagery from urban war zones. As a consequence, international organizations such as the United Nations, the World Bank, and Amnesty International use remote sensing with manual human classification to produce damage-assessment case studies (810). On the other hand, providers of conflict data for research purposes still rely heavily on news and eyewitness reports, which leads to large data-publishing lags and potential biases (1117). An automated building-damage classifier for use with satellite imagery, which has a low rate of false positives in unbalanced samples and allows tracking on-the-ground destruction in close to real-time, would therefore be extremely valuable for the international community and academic researchers alike.

In this article, we present a way of combining computer-vision techniques and publicly available high-resolution satellite images to produce building-destruction estimates that are of practical use to both practitioners and researchers. The standard architectures for this task are convolutional neural networks (CNNs),* as they have achieved unprecedented success in large-scale visual image classification with error rates beating humans (18, 19). We train a CNN to spot destruction features from heavy weaponry attacks (i.e., artillery and bombing) in satellite images, such as the rubble from collapsed buildings or the presence of bomb craters.

We make three relevant methodological contributions. First, we introduce a label-augmentation method for expanding destruction class labels by making reasonable assumptions about the data-generating process using contextual information. Second, we introduce a two-stage classification process to control for spatial and temporal noise where the results from the CNN are processed through a random-forest model that relies on spatial and temporal leads and lags to improve classification performance. Third, we apply our trained computer-vision model to repeated satellite images of the entire populated areas of major Syrian cities, including parks and highways, and produce longitudinal estimates of building destruction over the course of the recent civil war.

We demonstrate that our method yields high performance in out-of-sample tests and validate its ability for destruction monitoring using a separate database of heavy weaponry attacks. Our results highlight the importance of repeated satellite imagery in combination with temporal filtering to improve monitoring performance. As a result, our approach can be applied to any populated area, provided that repeated, high-resolution (i.e., submeter) satellite imagery is available.

Why Automated War-Destruction Monitoring Is Hard

Several studies have demonstrated the use of computer vision on satellite imagery to identify different types of destruction (2, 3, 2026). In many cases, this is destruction from natural disasters, which tends to be spatially concentrated. While performance results from the literature are encouraging, they typically focus on evaluations at one point in time and training/validating on datasets composed of equal numbers of damaged and undamaged images.

Precision performance in repeated destruction scans of entire cities with heavily unbalanced classes, as in our application, have not been explicitly presented in the literature so far. Part of the reason for this gap is that automated methods need to be able to detect building destruction in an empirical context, where the vast majority of images do not feature destruction. Class imbalance is a common problem in machine-learning applications, but the detection of destruction in war zones faces an extreme level of imbalance. Even in a city that suffered as much destruction as Aleppo, only 2.8% of all images of populated areas contain a building that was classified as destroyed by the United Nations Operational Satellite Applications Program (UNOSAT) in September 2016.

Fig. 1 depicts this quite clearly. In Fig. 1A, we see the full extent of Aleppo, with all destroyed building annotations depicted as red dots. Fig. 1B zooms into the central area of Aleppo, just east of the historic Citadel, which was heavily attacked. The red dots coincide clearly with patterns of destruction from heavy weaponry attacks in the satellite images. But destruction only affected a small fraction of buildings, even in this heavily affected part of the city.

Fig. 1.

Fig. 1.

Imagery of Aleppo on September 18, 2016. Red dots indicate UNOSAT annotations as destroyed. Areas enclosed by magenta lines are no analysis zones, excluded from the UNOSAT damage assessment due to being noncivilian. The yellow line encloses the populated areas of Aleppo under analysis. Sources: Google Earth/Maxar satellite imagery and UNITAR/UNOSAT damage annotations. A shows an overview of the urban area of Aleppo. B shows an area in central Aleppo close to the Citadel.

With such class imbalance, even a small false positive rate (FPR) will result in an unacceptable absolute number of false-positive predictions in applications, which would yield destruction data that are practically useless due to high measurement error. A simple example illustrates this: Suppose we have 100,000 sample images, of which 1,000 are destroyed. A “low” FPR of 15% together with a true positive rate (TPR) of 90% implies that the classification model will produce 14,850 false positives and 900 true positives, resulting in a precision below 6%. In other words, conditional on predicting destruction, such a classifier would be wrong more than 94% of the time. Note that the same classifier produces a “high” precision score of 86% on a 1:1 balanced sample.

The task of automated monitoring over time is typically further complicated by a lack of training data, i.e., the low number of destruction labels available in any given city. This can quickly lead to overfitting in machine learning, as the training set consists of a narrow selection of building types, neighborhoods, sun and satellite angles, and changing vegetation or weather phenomena like snow and cloud coverage. These problems are known as spatial and temporal domain shift (27). Temporal domain shift is a particularly serious problem in our application, as destruction monitoring requires the generation of a reasonable timeline with repeated scans of the same city. This emphasizes the need for a robust solution to this problem that ensures some comparability across time.

Our approach aims at solving these problems. We exploit the time dimension of the images and labels to alleviate the domain-shift problems and extreme class imbalance. We also make a point of reporting precision performance in unbalanced samples to provide realistic insights into the potential performance in applications.

Methods

Satellite Data.

Most of our sample comes from Aleppo, which we use as our main proof-of-concept due to the size of the city and the high availability of repeated images and labels. To train and evaluate our model, we used 22 high-resolution satellite images from Aleppo and a total of 42 images from five other Syrian cities (Table 1). All images used in this analysis were obtained from Google Earth (28); were georeferenced and orthorectified; and feature three bands (red, green, blue), as well as a ground sampling distance of circa 50 cm per pixel.

Table 1.

Sample overview

City (1) Total images (2) Total patches (3) Total labeled images (4) Total labeled patches (5) Share destroyed patches, %
Aleppo 22 2,106,412 4 1,626,920 1.83
Daraa 13 202,462 4 125,231 1.00
Deir-Ez-Zor 7 98,602 4 84,723 2.86
Hama 9 285,057 3 224,365 3.73
Homs 5 200,035 2 83,941 8.26
Raqqa 8 180,184 3 112,481 1.96
All 64 3,072,752 20 2,257,661 2.26

Note: Column (1) reports the number of “post” satellite images/time periods, excluding the first preimage for each city. Column (2) reports the resulting number of patches in the populated areas of the respective city based on available imagery. Column (3) refers to the number of images/time periods for which UNITAR/UNOSAT labels are available. Column (4) is the number of patches for which UNITAR/UNOSAT damage labels for the “destroyed” class are available after label augmentation. Column (5) is the share of destroyed labels over the number of labeled patches. Sources: Author calculations based on Google Earth/Maxar satellite imagery and UNITAR/UNOSAT damage annotations.

Sample images cover the period 2011 to 2017, after the onset of the civil war in Syria, during which extensive destruction from heavy weaponry attacks occurred across all sample cities. We used an additional, early image for each city (for example, June 26, 2011, in Aleppo) as the “pre” image and call the later 64 images as the “post” images. Our method relies on change detection—i.e., when classifying images, the preimage is compared to the respective postimage.

To move as close as possible to the automated monitoring task, we transformed all images into millions of 64 × 64 pixel subimages that we call patches. These patches are the unit of observation for training and testing and the final step, which we call scanning or dense prediction, in which the classifier is used to produce fitted values for every patch in the study areas. Ground area coverage of each patch can vary slightly, but is approximately 1,024 (i.e., 32 × 32) square meters. Importantly, the size of a specific patch remains constant over time.

Column (2) in Table 1 reports the sample size in terms of patches for the six cities in our sample. For Aleppo, for example, we have over 95,000 patches per image times 22 images, which gives approximately 2.1 million patches. Importantly, this is panel data, where images of the same patch are repeated 22 times.

Destruction Labels.

We combine the imagery data with georeferenced building damage labels from the UNOSAT of the United Nations Institute for Training and Research (UNITAR) (8). Over the course of the Syrian civil war, UNOSAT produced building-destruction annotations by manual inspection of satellite images for severely affected Syrian cities. For Aleppo, these manual assessments were conducted at four different dates, one each year between 2013 and 2016. Column (3) in Table 1 reports the number of these assessments.

UNOSAT damage annotations were categorized into three degrees of damage: moderate and severe damage, as well as completely destroyed. In our analysis, we rely on the latter class due to the fact that destruction patterns for the other labels were not always clearly visible in the satellite images. Our method classifies the satellite images as destroyed if at least one UNOSAT annotation of destruction is inside a patch.

Our analysis of building destruction focuses on the urban areas of Syrian cities. For Aleppo, this is depicted by the area enclosed by the yellow line in Fig. 1. Areas enclosed by magenta lines correspond to so-called “no analysis” areas, which have been left out by UNOSAT in their damage annotations due to these zones hosting noncivilian buildings. Consequently, these areas are also excluded from the training process. But we scan these areas and make use of these scans for out-of-sample validation. Sample image patches for destroyed areas predestruction and postdestruction are presented in SI Appendix, Fig. S1, and nondestroyed ones, including damaged buildings, are shown in SI Appendix, Fig. S2.

The ideal annotation dataset to analyze this problem would be composed of pixel-wise classification of all damaged and nondamaged buildings across the sample cities for all time periods. Labels like this could then be used to train models to identify the footprint of destroyed buildings using satellite images (3, 22). However, because of the significant cost of annotating destruction footprints, UNOSAT only provides point coordinates (centroids) of destroyed buildings. We matched these point labels to our image patches by attributing a label to the closest patch centroid. One issue with this method of generating labels is that buildings have different sizes, and, therefore, some UNOSAT labels are surrounded by more visible destruction than others. We address this issue through a second stage, described below, in which we exploit spatial information.

Contextual Label Augmentation and Test Sample.

The computer-vision task is to train an algorithm to detect destruction from the visual bands of high-resolution daylight satellite images. Training deep-learning architectures typically requires large training datasets, including thousands of labels, which are extremely rare in our empirical context.

Consequently, as reported in column (3) of Table 1, we have a maximum of four UNOSAT annotation dates to work with for certain cities, and for others, three or only two (i.e., Homs). Compared to the number of annotations, we usually have significantly more raw images available, as shown in column (1). In addition, few label dates perfectly coincide with the date of a satellite image. This generates an “uncertain class,” in which patches cannot be attributed clearly to either the destroyed or not-destroyed class because destruction could have occurred between the labeling date and the date of the image.

To increase the number of labeled data points, we exploit the fact that reconstruction was largely absent in the areas of interest during the study period between 2013 and 2017 (SI Appendix, Table S4). Our label-augmentation approach assumes that positive samples at time ti also remained positives at subsequent times tj>ti, i.e., that destruction persists throughout the period of the civil war. And, conversely, that negative samples at time tj also had to be negatives at times ti<tj.

We solve two problems using this approach. First, we expand the size of our training dataset by boosting the number of labels to close to 2.3 million, of which approximately 51,000 show destruction. Second, by including additional time periods in our training sample, we improve the performance of our classifier in its ability to handle domain shift. Our method of label augmentation is conservative, given that we assign missing values to all patches that remain in the uncertain class—those for which we cannot know with certainty whether destruction has occurred in the past or those for which we do not know with certainty that they will be labeled not destroyed in the future.

Fig. 2 illustrates our method for generating training and test samples. Given the temporal and spatial structure of the data, extra care must be taken when splitting the sample for training and testing to avoid overfitting. Standard cross-sectional cross-validation procedures are not appropriate since they could show the network patches from different times, but the same location, in training and testing. We therefore used the patch identifier to perform sample splitting, whereby 70% of patches are reserved for training and 30% for testing across temporal periods. All performance measures reflect accuracy as measured from data reserved in the test set.

Fig. 2.

Fig. 2.

Image sampling and prediction process. The timeline shows 23 Aleppo images. The first image, from June 26,2011, is used as a prewar image when training the classifier. All the other 22 images are used as postimages. Images are split into over 95,000 patches, which serve as a unit of analysis and are separated into test and training sample before the analysis. Labels for the patches come from UNITAR/UNOSAT annotation dates shown as black dots on the timeline. Annotations are extended forward and backward in time beyond these dates under the assumption that buildings that are labeled destroyed at some point remain destroyed throughout the period of observation. Those that are labeled as not destroyed at a given time were not destroyed before. Patches that are not destroyed at an annotation date, but are destroyed at a later annotation date, have an unknown class. All patches that are not classified as destroyed in the last annotation date are of unknown class (set to missing) after that date.

CNN Architecture and Two-Stage Classification Procedure.

Another innovation in our approach is the use of a two-stage classification procedure that feeds the predicted destruction estimates from the initial CNN model into a random-forest classifier. With respect to the CNN architecture, we experimented with several different types of CNNs. For each of these, we optimized hyperparameters according to accuracy results in the validation set. The results of these experiments suggested the use of a relatively flat CNN architecture, as described in SI Appendix, section 1.

To the output of the CNN model, we applied a second machine-learning stage, intended to exploit the temporal and spatial clustering of destruction. Specifically, the labels and predicted values from the CNN are used to train a random-forest model that relies on information from two spatial lags around each patch location and two temporal leads/lags around each date. The random forest uses these spatial and temporal features from the raw CNN scores plus the spatial SD to generate a prediction for the test sample and the dense prediction.

The logic behind this second-stage approach is that destruction is not only serially correlated, but also spatially clustered. We separated this step from the deep-learning stage for maximum flexibility and modularity. This allows us to vary the information set that we used in the second-stage model. In particular, we experimented with using only spatial information and different temporal lag structures and discuss their relative importance below.

Data Generation.

As a final step, we trained the second stage on all available data and predicted values for every patch-period combination in our data. This simulates the data-generation problem where the trained architecture is used to interpret all patches at all points in time, including those patches that had missing labels. The result is what we call dense predictions, and this forms the raw material for additional validation exercises. As reported in column (2) of Table 1, the result is a panel dataset of destruction predictions at the patch level for six cities with varying time periods with over 3 million patch-time observations.

Results

Overall Performance.

Our first-stage CNN classifier achieves an area under the curve (AUC) of 0.86 in the test sample of the first stage (i.e., with the raw output from the CNN) and an AUC of 0.92 after the second-stage random-forest procedure (SI Appendix, Fig. S4). The associated receiver operating characteristic (ROC) curve implies that a TPR of 0.8 is associated with an FPR of 0.17. At a more conservative, higher threshold for a positive classification, a TPR of 0.5 is associated with an FPR of only 0.025. However, the class imbalance is extremely relevant here. The ROC curve and its AUC are classification performance measures that are not affected by class imbalance in the sample and, therefore, do not allow us to discuss the impact of class imbalance in our sample. In what follows, we, therefore, focus on precision statistics to highlight the problem of unbalanced classes in applications of automated destruction detection.

Fig. 3 summarizes our main results across cities. Fig. 3A presents two precision-recall curves from the test sample that depict the out-of-sample performance of our classification approach. The dashed orange curve plots the precision-recall trade-off in the balanced sample. The average precision here is 0.86, and the curve suggests a very mild trade-off with a precision of over 0.9 at a recall rate of 0.5, for example. In contrast, the solid blue line depicts the performance of the same model when taking into account unbalanced classes that the automated destruction detection would face in the actual application in the test sample. Clearly, precision is much lower with the average precision being a mere 0.24. For a recall rate of 0.5, the first stage reaches a precision of below 0.2. This illustrates impressively how class imbalance in real application can change the precision-recall trade-off in this exercise.

Fig. 3.

Fig. 3.

(A) Precision-recall curve, unbalanced versus balanced sample. Reported performance is in the 30% training sample either by up-sampling the positives to reach a 1:1 sample (orange curve) or by evaluating at the original sample proportions (blue curve). (B) Precision-recall curve, unbalanced sample. First-stage model versus two alternative second-stage models. As in A, the blue curve shows performance after the first stage. The dashed maroon curve shows performance after the second stage, which uses training of a random forest on temporal and spatial leads and lags in the training sample. The dotted purple curve shows performance when using only spatial lags and no additional temporal information. (C) Average second-stage dense patch-wise destruction-prediction scores for Aleppo city, Syria. Green color indicates low prediction scores, and red color indicates high prediction scores. Color bins reflect deciles of second-stage fitted values with full spatial and temporal smoothing. Sources: Google Earth/Maxar satellite imagery, UNITAR/UNOSAT damage annotations, and author calculations.

In Fig. 3B, we illustrate the improvement in precision that we achieve by applying the second stage. The figure compares precision-recall curves for the first stage (solid blue line), as in Fig. 3A, with the improvements from the second-stage models, all evaluated in the unbalanced test sample. The second-stage average precision increased to 0.29 with only spatial smoothing (dotted purple line) and 0.43 with temporal and spatial smoothing (dashed maroon line). This highlights a key insight from our experiments with the modular second stage. The use of temporal smoothing is absolutely crucial for reaching better precision in the second stage. The gains of the spatial smoothing are relevant in some cases, but the real boost in performance arises when using temporal information to validate predictions coming out of the first stage.

In Fig. 3C, we show an example of the final output of our methodology—the continuous dense prediction scores generated from the second stage. The figure shows the average patch-wise dense predictions across the entire city of Aleppo, including no-analysis zones. Red color indicates high predicted scores, and green indicates low scores. Generally, the red areas coincide with the destruction annotations in Fig. 1. In addition, roads and parks are clearly visible as dark green (lowest destruction probability) or yellow patches. This is not only evidence of the power of our approach in picking up housing destruction, but it also shows how the classifier has learned that roads and parks are never destroyed buildings.

The Role of the Second-Stage Module.

The second stage plays a key role for boosting performance to levels that imply practical gains from automatizing destruction monitoring in our sample. It is important to consider that, while the cities in our sample are all in the same country, they are of different sizes, have different building types, and are situated in different landscapes with a variety of vegetation and seasonal changes. In addition, label and image availability differ dramatically. As shown in Table 1, the vast majority of images in our sample come from Aleppo due to its large size and elevated image availability—less than one-third of all images come from other cities (SI Appendix, Table S3 summarizes the results from training on Aleppo exclusively). If our approach can adapt to these very different conditions, it means we can be optimistic about applications elsewhere.

Table 2 provides details on the performance improvements through the second-stage procedure by city. In column (1), we report performance of the first stage by city. This reveals strong differences in performance across cities, with average precision ranging from a mere 4.2% for Daraa to an impressive 54.5% for Hama (for corresponding precision-recall curves, see SI Appendix, Fig. S5). To a large degree, this is driven by sample imbalances, where Daraa suffered only 1% of destroyed patches, on average, whereas Hama suffered almost four times as much.

Table 2.

Model performance when varying second-stage module in the unbalanced sample

(1) First-stage (CNN) (2) (3) Second-stage (CNN + RF) (4)
City Raw Precision With spatial leads/lags Precision With spatial and temporal leads/lags Precision With spatial and temporal leads/lags AUC
Aleppo 16.1 16.9 35.7 91.5
Daraa 4.2 4.6 11.7 89.0
Deir-Ez-Zor 11.0 12.1 21.7 80.0
Hama 54.5 65.2 68.0 91.0
Homs 25.8 34.9 55.2 85.7
Raqqa 12.8 17.4 32.1 87.6
All 24.5 28.7 42.5 90.7

Notes. First-stage predictions from CNNs and second-stage predictions from random-forest model (CNN + RF) with spatial leads/lags (column 2) and spatial and two temporal leads/lags (columns 3 and 4). Columns (1)–(3) report the average precision and column (4) the AUC. Sources: Author calculations based on Google Earth/Maxar satellite imagery and UNITAR/UNOSAT damage annotations.

The second stage boosts this performance substantially. This is most notable for the worst-performing cities, for which precision improves twofold to threefold in the full model (column 3). How does the full model achieve this improvement in performance? Table 2 confirms the role of the temporal smoothing shown in Fig. 3. However, the city-by-city analysis also reveals interesting differences across cities, where Homs and Hama seem to benefit more from the spatial smoothing. In both cities, destruction is indeed clustered heavily in some neighborhoods, so that this clustering might be useful in reinforcing patch-wise predictions in the second stage. Our predictions for Daraa, Deir-Ez-Zor, and Aleppo rely much more on repetition and temporal smoothing. We confirm the role of temporal smoothing in SI Appendix, Table S5 by varying temporal lags and providing performance estimates without spatial smoothing.

The improvements with temporal smoothing suggest that the domain-shift problem across time plays an important role when angles, lighting, vegetation, and seasons change. Our results therefore highlight the potential role of repeated high-frequency imagery and temporal smoothing for providing useful destruction monitoring. The extreme imbalance combined with small samples imposes serious trade-offs for monitoring, but we will show in the following section that monitoring can be brought to work even in the case of Aleppo, which has one of the more unbalanced samples in our dataset.

External Validation Exercises.

We conduct two validation exercises to illustrate the merits of our approach. We first make use of the no-analysis areas in Aleppo (Fig. 1) that have been entirely excluded from the training process. One of these zones corresponds to the Ramouse neighborhood in the southernmost tip of our study area in Aleppo—an area that our classifier identified as heavily destroyed, as depicted in Fig. 3C.

In Fig. 4, we show satellite imagery from a subarea of the Ramouse neighborhood at two points in time, before (December 6, 2016; AC) and after (December 18, 2016; DF) a major heavy weaponry attack. We show raw satellite images (Fig. 4 A and D), patch-wise visualizations of the second-stage continuous prediction scores (Fig. 4 B and E), and a binary classification (Fig. 4 C and F). Due to the classifier not having been trained on this area, this exercise serves as a good out-of-sample validation test. Visual inspection of the raw images shows no destruction before (A), but extensive building destruction after the attacks (D). Comparing the continuous prediction scores before (B) and after the attack (E) shows a significant increase in predicted destruction by our approach, which coincides clearly with the locations of actual destruction of buildings in the area. Note that the model also classifies correctly areas without building destruction, such as the industrial compounds in the northeast and southwest of the image, as not destroyed at both points in time. The same applies to the fields and roads in the East and the forest in the West. Fig. 4 A and D shows one way of converting continuous prediction scores into a binary classification. The threshold chosen here is optimized to reach a level of 50% recall in the test sample. One can observe that the before period is consistently classified as nondestroyed (with one exception), whereas destruction is indicated in affected areas after the attacks.

Fig. 4.

Fig. 4.

Example of raw satellite images (A and D) and second-stage patch-wise continuous predictions scores (B and E) and binary classification (C and F) for the Ramouse neighborhood of Aleppo, Syria. Before (AC) and after (DF) heavy weaponry attacks are shown. Green color indicates low prediction scores, and red color indicates high prediction scores. Color bins reflect deciles of fitted values. The binary classification cutoff was optimized to reach 50% recall in the test sample. Satellite image recording dates: December 6, 2016 (before), and December 18, 2016 (after). Approximate image centroid location: 36.1525 decimal degrees north and 37.1332 east. Sources: Google Earth/Maxar satellite imagery and author calculations.

Fig. 4 demonstrates that the classifier is able to identify destruction in parts of the city that were not part of the training sample. This is important, as it shows that we are able to successfully solve the spatial and temporal domain-shift problems within Aleppo and, thus, generate a time series of destruction data in this way. If our automated method was to augment human monitoring, this is the kind of data that would be passed to human verification.

Given our strategy of expanding labels forward and backward in time, it becomes particularly important to verify the ability of our approach to approximate the timing of destruction. We therefore validated our dense predictions in an event-study framework, which relies on an external dataset of georeferenced bombing events in Syria. In particular, we relied on 731 bombing events with precise location information from the Live Universal Awareness Map project (LiveUAmap) (29). We merged these events with our pooled sample of dense predictions at the patch-time level. We then conducted an event-study regression on a sample of over 2.8 million observations to test whether our prediction scores increase in the aftermath of an externally reported bombing event (see SI Appendix, section 2 for details).

We present a coefficient summary plot for two second-stage modules in Fig. 5. The graph shows clearly that bombing events are positively and significantly correlated to the destruction scores at the time and patch levels. Note that the baseline hazard of destruction, i.e., the mean of the dependent variable, is very small in our sample (SI Appendix, Table S2). Compared to the baseline level of the respective destruction score, the point estimates imply an increase of 29% and 37%, respectively, after a bombing event is reported in a given cell. This is a substantial increase if one keeps in mind that not all bombing events will result in the destruction of a building, introducing attenuation bias in the regression. The figure also shows that temporal smoothing implies big gains in overall signal strength, with the coefficients from the full model represented by the red diamonds being consistently above the spatial only model as depicted by the blue squares.

Fig. 5.

Fig. 5.

Event-study calidation exercise pooled sample. External bomb-event data from LiveUAmap is positively and significantly correlated with satellite predicted war destruction at the patch level. The figure shows coefficients from a regression of five leads and lags of bombing events identified in the event data against our continuous destruction prediction score from the second stage. Point estimates depicted by blue squares correspond to second-stage continuous prediction scores with spatial smoothing only, and red diamonds correspond to the full model with spatial and temporal smoothing. Error bars represent 95% CIs. The dashed line indicates the occurrence of a bombing event in the event data, and coefficients capture the response in predicted damage. The full regression specification and results are reported in SI Appendix, section 2 and Table S2, respectively. Sources: LiveUAmap event data and author calculations.

Discussion

Building destruction due to heavy weapon attacks is a particularly salient form of war-related violence. Destruction is often used as a military strategy to displace population and is responsible for tremendous human suffering beyond the loss of life. Likewise, organizations like the Red Cross warn that massive destruction of urban infrastructure (also called urbicide) has dramatic knock-on effects on health, as it implies the destruction of water and power supplies, as well as hospitals. Therefore, reliable and updated data on destruction from war zones play an important role for humanitarian relief efforts, but also for human-rights monitoring, reconstruction initiatives, and media reporting, as well as the study of violent conflict in academic research. Studying this form of violence quantitatively, beyond specific case studies, is currently impossible due to the absence of systematic data.

Our method of identifying building destruction combines the existing state-of-the-art computer-vision methods with an additional postprocessing step and exploits the time dimension of destruction data to expand the training dataset. This allows us to exploit the repetition of imagery to bring down error rates when classifying destruction. Thanks to these advances, we were able to achieve an AUC of above 0.9 and an average precision of over 0.42 in the unbalanced sample from six Syrian cities. We also show that our approach is able to identify the timing and location of building destruction out-of-sample, i.e., in areas of Aleppo that have not been used for training the classifier.

These results are encouraging and allow applicability for automated destruction classification and even close to real-time tracking for policy purposes. Our method is particularly well-placed to take advantage of the ever-increasing temporal granularity of imagery. Our calculations suggest that human manual labeling of our entire dataset would cost approximately 200,000 USD, and additional repetitions of imagery would increase these costs almost proportionally. With an automated method like ours, higher image frequency helps precision and comes at only marginal extra costs. However, our results also suggest limitations where average precision falls, e.g., if only a very low share, less than 1%, of a city is destroyed. For applications requiring high precision in heavily imbalanced prediction problems such as the monitoring of several cities, we believe that the real use case for our approach will be in a decision-support framework, in which the predictions are combined with human verification to create much faster and accurate on-the-ground violence detection. Iterations between machine learning and human verification can also help in improving the training process (30) and could be easily integrated in our approach.

The performance of our method could be further improved by increasing the size of the training dataset, which could also help adapt it to classify destruction in other war zones around the globe. Further performance improvement could be achieved through fine tuning, a common practice in deep learning in which the network is pretrained with a large sample of building destruction from a variety of contexts in the first step and then refined by training on heavy weaponry destruction. This could be implemented by using a recent public dataset of natural-disaster destruction imagery that provides a sample of 98,000 annotated buildings across three levels of damage (31). Moreover, domain-adaptation techniques developed for deep learning could be used to try to further minimize the remaining domain biases (32).

Our label-augmentation technique is driven by strong assumptions and should therefore be regarded only as a first step in understanding the dynamic classification of building destruction over time. A particularly fruitful direction for future research could be to model the data-generating process of what we call the “uncertain class” between changing labels and after the last label date. This should then be combined with label smoothing to generate probabilistic labels (33). Such a holistic approach would also need to think about label priors regarding the reconstruction process. Future applications of such an approach would then be able to augment the human-classification process of verifying violence—so-called digital humanitarians (34)—and track the postwar recovery within the same classifier model.

The destruction data that can be generated with monitoring approaches such as the one presented in this article open up possibilities for a set of new research agendas in the social sciences (35). For example, our approach may advance the academic literature on understanding the microlevel determinants of violence (3642). At what stage in a conflict is building destruction used? What can be done to reduce civilian fatalities during urban warfare? What are the effects of building destruction on displacement compared to other kinds of violence such as small firearms? Can reporting-based violence data be used to reduce error in the remote-sensing exercise, or can combined measures be developed (43, 44)? Can destruction data be used to reveal biases in reporting-based measures? An additional potential application of our method is conflict-forecasting systems, like the “Violence Early-Warning System,” which rely on spatial violence dynamics in their forecasts (45).

Finally, there are important ethical concerns in war-destruction monitoring that should be considered. Research in the social sciences has shown that monitoring tends to reduce armed violence between states, but there are also examples where the opposite is true (46, 47). Theoretically, we can identify specific scenarios in which monitoring worsens the situation on the ground. If local actors are using the flow of information about atrocities to displace population and do not fear repercussions linked to the monitoring of these atrocities, then monitoring itself can increase violence and should, therefore, not be conducted publicly.

Materials and Methods

Analysis.

The CNN model was built and trained in Tensorflow. Analysis was performed in QGIS, Python, R, and Stata.

Supplementary Material

Supplementary File

Acknowledgments

We thank Eli Berman, Joshua Blumenstock, Mathieu Couttenier, Joan Maria Esteban, Clément Gorin, Edward Miguel, Sebastian Schütte, and Jacob Shapiro for useful comments and discussions. We are grateful to Bruno Conte Leite, Jordi Llorens, Parsa Hassani, Dennis Hutschenreiter, Shima Nabiee, and Lavinia Piemontese for excellent research assistance. We are particularly grateful to Javier Mas for his research assistance, which produced the coding backbone to this project. We thank seminar participants at the Applied Machine Learning, Economics, and Data Science, University of California Berkeley, Empirical Studies of Conflict Project Annual Meeting, AI for Development conference by the Center for Effective Global Action and the World Bank Development Impact Evaluation Group University of Bozen/Bolzano, International Institute of Social Studies of Erasmus University, University of Economics Ho Chi Minh City, Lyon University, Trinity College Dublin, Barcelona Graduate School of Economics, Institute for Economic Analysis of the Spanish Council for Scientific Research, Universitat de Barcelona, Berlin Network of Labor Market Research Winter Workshop, PREVIEW workshop at the German foreign office, and Violence Early-Warning System workshop in Uppsala. A.G. and H.M. were supported by “La Caixa” Foundation Project Grant CG-2017-04, title: ”Analysing Conflict from Space”; and by Spanish Ministry of Science and Innovation, through the Severo Ochoa Program for Centers of Excellence in R&D Grant CEX2019-000915-S. H.M. was supported by Spanish Ministry of Science, Innovation and Universities Grant PGC-096133-B-100. A.G. was also supported by Spanish Ministry of Science, Innovation and Universities Grant PGC2018-094364-B-100. J.H. and A.M. were supported by the Chapman University Faculty Opportunity Fund. A.M. was supported by the Smith Institute of Political Economy and Philosophy at Chapman University. Any remaining errors are our own.

Footnotes

The authors declare no competing interest.

This article is a PNAS Direct Submission.

*For a glossary of technical key terminology, see SI Appendix.

This article contains supporting information online at https://www.pnas.org/lookup/suppl/doi:10.1073/pnas.2025400118/-/DCSupplemental.

Materials and Data Availability

Detailed explanations of methods in this study are provided in SI Appendix. All main code is available at GitHub (https://github.com/monitoring-war-destruction) (48). The repository provides all programming codes for image preprocessing and label augmentation, as well as first- and second-stage training and testing. All data is provided in the repository, except for the satellite imagery which cannot be provided due to copyright restrictions.

References

  • 1.Witmer F., Remote sensing of violent conflict: Eyes from above. Int. J. Rem. Sens. 36, 2326–2352 (2015). [Google Scholar]
  • 2.Gueguen L., Hamid R., “Large-scale damage detection using satellite imagery” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, New York, NY, 2015), pp. 1321–1328.
  • 3.Kahraman F., Imamoglu M., and Ates H. F., “Battle damage assessment based on self-similarity and contextual modeling of buildings in dense urban areas” in 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS) (IEEE, New York, NY, 2016), pp. 5161–5164.
  • 4.LeCun Y., Bengio Y., Hinton G., Deep learning. Nature 521, 436–444 (2015). [DOI] [PubMed] [Google Scholar]
  • 5.Jean N., et al. , Combining satellite imagery and machine learning to predict poverty. Science 353, 790–794 (2016). [DOI] [PubMed] [Google Scholar]
  • 6.Engstrom R., Newhouse D., Hersh J.. Poverty from space: Using high resolution satellite imagery for estimating economic well-being and geographic targeting (World Bank Policy Research Working Paper 8284, World Bank, Washington, DC, 2017). [Google Scholar]
  • 7.Yeh C., et al. , Using publicly available satellite imagery and deep learning to understand economic well-being in Africa. Nat. Commun. 11, 1–11 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.UNITAR Operational Satellite Applications Programme , Maps and Data. https://www.unitar.org/maps. Accessed 19 May 2021.
  • 9.World Bank . The toll of war: The economic and social consequences of the conflict in Syria (Tech. Rep., World Bank, Washington, DC, 2017).
  • 10.Amnesty International . Strike tracker. Decode how US-led bombing destroyed Raqqa, Syria (2020) https://decoders.amnesty.org/projects/strike-tracker. Accessed 19 May 2021.
  • 11.Gleditsch N. P., Wallensteen P., Eriksson M., Sollenberg M., Strand H., Armed conflict 1946-2001: A new dataset. J. Peace Res. 39, 615–637 (2002). [Google Scholar]
  • 12.Raleigh C., Linke A., Hegre H., Karlsen J., Introducing ACLED: An armed conflict location and event dataset: Special data feature. J. Peace Res. 47, 651–660 (2010). [Google Scholar]
  • 13.Sarkees M. R., Wayman F., Resort to War: 1816-2007 (CQ Press, Washington, DC, 2010). [Google Scholar]
  • 14.Sundberg R., Melander E., Introducing the UCDP georeferenced event dataset. J. Peace Res. 50, 523–532 (2013). [Google Scholar]
  • 15.Price M., Gohdes A., Ball P., Documents of war: Understanding the Syrian conflict. Significance 12, 14–19 (2015). [Google Scholar]
  • 16.Weidmann N. B., A closer look at reporting bias in conflict event data. Am. J. Polit. Sci. 60, 206–218 (2016). [Google Scholar]
  • 17.Pettersson T., Öberg M., Organized violence, 1989–2019. J. Peace Res. 57, 597–613 (2020). [Google Scholar]
  • 18.He K., Zhang X., Ren S., Sun J., “Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification” in 2015 IEEE International Conference on Computer Vision (ICCV) (IEEE, New York, NY, 2015), pp. 1026–1034.
  • 19.Simonyan K., Zisserman A., Very deep convolutional networks for large-scale image recognition. arXiv [Preprint] (2015). https://arxiv.org/abs/1409.1556 (Accessed 19 May 2021).
  • 20.Cooner A. J., Shao Y., Campbell J. B., Detection of urban damage using remote sensing and machine learning algorithms: Revisiting the 2010 Haiti earthquake. Rem. Sens. 8, 868 (2016). [Google Scholar]
  • 21.Gueguen L., Hamid R., Toward a generalizable image representation for large-scale change detection: Application to generic damage analysis. IEEE Trans. Geosci. Rem. Sens. 54, 3378–3387 (2016). [Google Scholar]
  • 22.Kahraman F., Imamoglu M., Ates H. F., Disaster damage assessment of buildings using adaptive self-similarity descriptor. Geosci. Rem. Sens. Lett. IEEE 13, 1188–1192 (2016). [Google Scholar]
  • 23.Yuan J., Automatic building extraction in aerial scenes using convolutional networks. arXiv [Preprint] (2016). https://arxiv.org/abs/1602.06564 (Accessed 19 May 2021).
  • 24.Attari N., Ofli F., Awad M., Lucas J., Chawla S., “Nazr-CNN: Fine-grained classification of UAV imagery for damage assessment” in 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA) (IEEE, New York, NY, 2017), pp. 50–59.
  • 25.Fujita A., et al. , “Damage detection from aerial images via convolutional neural networks” in 2017 Fifteenth IAPR International Conference on Machine Vision Applications (MVA) (IEEE, New York, NY, 2017), pp. 5–8.
  • 26.Nex F., Duarte D., Tonolo F. G., Kerle N., Building damage detection with deep learning: Assessment of a state-of-the-art CNN in operational conditions. Rem. Sens. 11, 2765 (2019). [Google Scholar]
  • 27.Sun B., Saenko K., “Deep CORAL: Correlation alignment for deep domain adaptation” in ECCV 2016 Workshops: European Conference on Computer Vision, G. Hua, H. Jégou, Eds. (Lecture Notes in Computer Science, Springer, Cham, Switzerland, 2016), vol. 9915, pp. 443–450. [Google Scholar]
  • 28.Google Earth. https://www.google.com/earth, Accessed 7 September 2020.
  • 29.Live Universal Awareness Map Syria. https://syria.liveuamap.com/. Accessed 19 May 2021.
  • 30.Colaresi M., Mahmood Z., Do the robot: Lessons from machine learning to improve conflict forecasting. J. Peace Res. 54, 193 (2017). [Google Scholar]
  • 31.Gupta R., et al. , “Creating xBD: A dataset for assessing building damage from satellite imagery” The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. https://openaccess.thecvf.com/content_CVPRW_2019/papers/cv4gc/Gupta_Creating_xBD_A_Dataset_for_Assessing_Building_Damage_from_Satellite_CVPRW_2019_paper.pdf. Accessed 19 May 2021.
  • 32.Csurka G., “Domain adaptation for visual applications: A comprehensive survey” in Advances in Computer Vision and Pattern Recognition, G. Csurka, Ed. (Springer, Cham, Switzerland, 2017), pp. 1–35.
  • 33.Müller R., Kornblith S., Hinton G., When does label smoothing help? arXiv [Preprint] (2020). https://arxiv.org/abs/1906.02629 (Accessed 19 May 2021).
  • 34.Meier P., Digital Humanitarians: How Big Data Is Changing the Face of Humanitarian Response (Routledge, London, UK, 2015). [Google Scholar]
  • 35.Gleditsch K. S., Metternich N. W., Ruggeri A., Data and progress in peace and conflict research. J. Peace Res. 51, 301–314 (2014). [Google Scholar]
  • 36.Besley T., Mueller H., Estimating the peace dividend: The impact of violence on house prices in Northern Ireland. Am. Econ. Rev. 102, 810–833 (2012). [Google Scholar]
  • 37.Dube O., Vargas J. F., Commodity price shocks and civil conflict: Evidence from Colombia. Rev. Econ. Stud. 80, 1384–1421 (2013). [Google Scholar]
  • 38.Burke M., Hsiang S. M., Miguel E., Climate and conflict. Annu. Rev. Econ. 7, 577–617 (2015). [Google Scholar]
  • 39.Michalopoulos S., Papaioannou E., The long-run effects of the scramble for Africa. Am. Econ. Rev. 106, 1802–1848 (2016). [Google Scholar]
  • 40.Novta N., Ethnic diversity and the spread of civil war. J. Eur. Econ. Assoc. 14, 1074–1100 (2016). [Google Scholar]
  • 41.Berman N., Couttenier M., Rohner D., Thoenig M., This mine is mine! How minerals fuel conflicts in Africa. Am. Econ. Rev. 107, 1564–1610 (2017). [Google Scholar]
  • 42.Manacorda M., Tesei A., Liberation technology: Mobile phones and political mobilization in Africa. Econometrica 88, 533–567 (2020). [Google Scholar]
  • 43.Henderson J. V., Storeygard A., Weil D. N., Measuring economic growth from outer space. Am. Econ. Rev. 102, 994–1028, (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Lum K., Price M. E., Banks D., Applications of multiple systems estimation in human rights research. Am. Statistician 67, 191–200 (2013). [Google Scholar]
  • 45.Hegre H., et al. , ViEWS: A political violence early-warning system. J. Peace Res. 56, 155–174 (2019). [Google Scholar]
  • 46.Gordon G., Violence and intervention. PhD thesis, Columbia University, New York, NY (2016). [Google Scholar]
  • 47.Early B. R., Gartzke E., Spying from space: Reconnaissance satellites and interstate disputes. J. Conflict Resolut., 10.1177/0022002721995894 (2021). [DOI] [Google Scholar]
  • 48.Mueller H., Groeger A., Hersh J., Matranga A., Serrat J., Monitoring war destruction. GitHub. https://github.com/monitoring-war-destruction. Deposited 19 May 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary File

Data Availability Statement

Detailed explanations of methods in this study are provided in SI Appendix. All main code is available at GitHub (https://github.com/monitoring-war-destruction) (48). The repository provides all programming codes for image preprocessing and label augmentation, as well as first- and second-stage training and testing. All data is provided in the repository, except for the satellite imagery which cannot be provided due to copyright restrictions.


Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES