Skip to main content
Sensors (Basel, Switzerland) logoLink to Sensors (Basel, Switzerland)
. 2026 Feb 11;26(4):1176. doi: 10.3390/s26041176

A Cross-Layer Feature Fusion Framework with Hierarchical Interaction for Remote Sensing Change Detection

Xin Meng 1, Chuanbiao Qiu 1, Chong Liu 2, Yanli Xu 1,*
Editor: Paul Krause
PMCID: PMC12944698  PMID: 41755117

Abstract

The rapid progress of remote sensing (RS) and computer vision has greatly advanced change detection (CD), and hybrid architectures combining Transformers and convolutional neural networks (CNNs) have shown strong potential in recent years. Nevertheless, reliable CD for very high-resolution (VHR) imagery remains challenging due to large appearance variations across acquisition times, complex background clutter, and target structural diversity. These factors often hinder the modeling of fine edge textures, the maintenance of feature continuity, and the suppression of false changes caused by illumination fluctuations. To address these issues, this paper proposes a Cross-layer Feature Fusion Framework (CLFF) that achieves more accurate and stable change detection by explicitly enhancing the collaborative fusion capability of multi-layer features. The core component of this framework is the Multi-level Interaction Perception Block (MP-Block), which organizes effective interactions among features of different semantic levels. Based on the embedded Multi-branch Interaction Fusion Mechanism (MIFM), the MP-Block accomplishes collaborative refinement and reorganization of cross-layer features through two parallel paths for feature reconstruction and recalibration: the Response-aware Feature Reconstruction Branch (RFRB) and Adaptive Channel Group Fusion Branch (ACGF). Additionally, a lightweight position-aware attention module is introduced to adaptively modulate spatial responses, further suppressing background interference and highlighting key information related to changes. This method effectively mitigates the limitations of traditional CNNs, such as limited receptive fields and insufficient multi-layer feature interaction, while significantly enhancing the ability to collaboratively model multi-layer contextual information. To verify its effectiveness, systematic experiments were conducted on four widely used change detection benchmark datasets: LEVIR, WHU, SYSU and HRCUS. The results show that, compared to corresponding baseline models, CLFF achieves performance improvements of 1.35%, 2.78%, 3.54% and 4.85% in the IoU metric, respectively.

Keywords: remote sensing, change detection, multi-scale fusion, attention mechanism, very high-resolution

1. Introduction

Driven by both natural processes and anthropogenic activities, the Earth’s surface has exhibited increasingly dynamic characteristics in recent years, making time-series analysis of RS data an indispensable component of modern Earth observation systems. In parallel, RS interpretation techniques have rapidly advanced, ranging from polarimetric scattering analysis [1,2,3] to high-level semantic extraction from optical imagery. Among these developments, CD has emerged as a fundamental task that identifies surface transitions through the comparison of multitemporal images [4]. With the growing availability of VHR imagery, CD has been widely applied in urban planning [5], land-use assessment [6], and disaster evaluation [7].

VHR remote sensing imagery provides abundant spatial and structural details for change analysis. In practice, however, effectively exploiting such information remains challenging. Complex scenes often exhibit diverse appearance variations that are unrelated to actual changes, leading to significant visual ambiguity and making it difficult to distinguish genuine structural modifications from superficial surface differences. As illustrated in Figure 1, common sources of such appearance-induced pseudo-changes include cast shadows, seasonal vegetation growth, transient objects (e.g., vehicles or ships), and surface discolorations. These factors frequently trigger false alarms and compromise boundary integrity in change detection results. While such appearance variations are often prominent in shallow feature layers, they lack consistent semantic meaning. This discrepancy highlights the necessity of explicitly coordinating semantic and structural information across feature hierarchies. In addition to the inherent ambiguity of VHR imagery, the large data volume and hardware constraints further necessitate careful trade-offs between detection accuracy and computational efficiency in real-world systems [8].

Figure 1.

Figure 1

Examples of appearance variations irrelevant to actual changes that pose challenges for change detection in very high-resolution remote sensing imagery. From top to bottom, the figure shows bi-temporal images (Time 1 and Time 2) and the corresponding ground truth. As highlighted by the red boxes, common sources of pseudo-changes include (1) shadow-induced variations around buildings, (2) complex land cover transitions, (3) temporary objects (e.g., ships in port areas), and (4) surface appearance changes (e.g., roof color variations) without actual structural modification. As a result, these variations often trigger false detections and compromise object boundary integrity. These examples illustrate that appearance-induced variations may be prominent at shallow feature levels while lacking consistent semantic meaning, underscoring the necessity of coordinated semantic–structural modeling across feature hierarchies.

Early works on remote sensing change detection were primarily conducted at the pixel level due to computational efficiency constraints [9]. In this context, transformation- and multi-resolution-based analyses—such as PCA [10], wavelet transforms [11], and CVA [12]—were widely adopted. Pixel-level classifiers, such as SVM [13] and RF [14], have also been employed, often in combination with handcrafted texture features such as GLCM [15]. While effective in homogeneous scenes, these methods depend on manual feature engineering and lack robustness in complex urban settings. From a task formulation perspective, CD differs fundamentally from related remote sensing tasks such as object detection and hyperspectral unmixing. While object detection focuses on localizing targets in single static images [16] and hyperspectral unmixing aims to separate mixed spectral signatures [17], CD requires explicit bi-temporal modeling to distinguish genuine semantic changes from variations caused solely by appearance differences.

To alleviate these limitations, Transformer-based architectures have recently been introduced into remote sensing change detection to enhance global context modeling. Through the self-attention mechanism, Transformers are able to capture long-range dependencies and explicitly model global structural relationships between bi-temporal images. Representative approaches, such as the Bi-temporal Image Transformer (BIT) [18], hierarchical Transformer-based Siamese networks [19], and hybrid CNN–Transformer frameworks with multi-scale token aggregation [20], demonstrate the advantages of integrating local texture modeling with global context perception. Nevertheless, Transformer-based methods generally incur high computational and memory costs, and patch-wise tokenization may compromise the preservation of fine-grained spatial details, thereby limiting their applicability to very high-resolution imagery and resource-constrained scenarios.

Beyond global context modeling, multi-scale feature fusion has been widely adopted to integrate fine-grained spatial details from shallow layers with high-level semantic information from deep layers, thereby improving robustness against complex background interference. Representative studies enhance change perception through cascading multi-scale features with difference enhancement [21], injecting shallow details into deep semantic representations for boundary refinement [22], or explicitly modeling bitemporal spatial relationships to alleviate misalignment-induced inconsistencies [23]. More recently, hierarchical and interaction-aware fusion strategies have attracted increasing attention, including the lightweight interlayer correlation enhancement design proposed by Xiao et al. [24], as well as representative CNN-based fusion frameworks such as IFNet [25] and CGNet [26]. However, many existing fusion methods rely on fixed or loosely coordinated aggregation mechanisms, which may amplify redundant responses and compromise the boundary integrity of fused feature representations, particularly when processing fine-grained or thin-structure changes.

Motivated by the above analysis, we identify that a key bottleneck in VHR change detection lies in aligning high-level semantics with low-level structural details under complex backgrounds. Although prior CNN/Transformer and fusion-based methods improve local textures, global context, or multi-scale aggregation, the interaction between deep and shallow features is often insufficiently coordinated, leading to redundant activations and blurred or fragmented boundaries for fine-grained changes. To this end, we propose the CLFF framework, which explicitly organizes cross-layer feature interaction and couples it with coordinated semantic–spatial refinement to mitigate the semantic–structural mismatch and enhance multi-scale change representation.

The main contributions of this work are summarized as follows:

  • We propose a novel CLFF that explicitly organizes cross-layer interaction and hierarchical refinement to address the semantic–structural mismatch between deep and shallow features in remote sensing change detection.

  • We design an MP-Block to progressively integrate hierarchical features and facilitate effective information flow across adjacent semantic levels.

  • We develop a MIFM as the core fusion backbone, which is composed of the RFRB and ACGF units to jointly perform response-aware feature refinement and adaptive channel-wise recalibration.

  • Extensive experiments on multiple benchmark datasets demonstrate the effectiveness and robustness of the proposed method under diverse and challenging scenarios.

2. Materials and Methods

2.1. Overall Framework

Remote sensing change detection aims to accurately identify pixel-level changes between bi-temporal images and can be regarded as a specialized form of semantic segmentation. However, complex backgrounds, small object scales, and appearance variations caused by illumination and seasonal factors often lead to unstable change representations, making reliable change identification challenging. To address these challenges, we design CLFF to enable explicit interactions across different semantic levels, as illustrated in Figure 2.

Figure 2.

Figure 2

Overall architecture of the proposed CLFF. (a) The overall pipeline of the framework with a Siamese VGG16-based feature extraction backbone and progressive cross-layer fusion. (b) The BFAB employed as an auxiliary component to enhance spatial consistency before feature fusion. (c) The proposed MIFM for progressive cross-layer feature integration. Red arrows indicate the upsampling operations, while other arrows represent the forward feature flow.

The CLFF Framework consists of four components: a Siamese VGG16 encoder, a bitemporal feature alignment module, an MP-Block with multi-level interaction perception, and a lightweight decoder. Together, these modules facilitate the extraction and fusion of change-relevant information across multiple depths and spatial scales, thereby improving representation stabilty and detection accuracy.

Specifically, CLFF adopts a Siamese VGG16 encoder with shared weights to extract hierarchical feature representations from pre- and post-change images. Features from multiple stages are retained to preserve fine-grained spatial details at shallow layers while progressively encoding higher-level semantic information at deeper layers. However, slight misregistration and geometric inconsistencies between bi-temporal inputs may still exist in practice. To alleviate this issue, a lightweight BFAB is introduced prior to feature fusion to enhance spatial consistency between corresponding features, as illustrated in Figure 2b.

The aligned multi-level features are then fed into the proposed MP-Block for progressive cross-layer fusion and coordinated refinement. Within the MP-Block, adjacent semantic features explicitly interact and are enhanced to achieve joint modeling of semantic consistency and structural continuity. Finally, the refined features are processed by a lightweight decoder through convolutional operations and bilinear upsampling to recover spatial resolution and generate the pixel-wise change map. During decoding, high-level features are progressively upsampled and fused into finer-scale representations in a top-down manner, yielding the final fine-scale fused feature for change map prediction.

In summary, CLFF uses a Siamese encoder with shared weights to obtain hierarchical feature representations of the pre- and post-change images, ensuring that both inputs are processed by an identical feature extractor. In our implementation we use VGG16 as a backbone for the feature extraction network in order to capture multi-scale spatial and semantic information at different stage. Specifically, the encoder generates a group of feature maps that have successively smaller spatial resolutions but greater levels of semantic abstraction; these feature maps contain abundant structural information at shallower depths and higher-level semantic hints at greater depths. These multi-level features are then fed into the following bi-temporal feature alignment and cross-layer fusion modules, serving as the basis for cross-layer interaction and cooperative refinement.

2.2. Multi-Level Interaction Perception Block (MP-Block)

The MP-Block serves as a core building block of the proposed CLFF framework and is designed to organize multi-level feature interaction and fusion within a unified architecture. Unlike traditional designs that treat feature fusion and refinement as separate processes, the MP-Block introduces a hierarchical architecture to jointly account for semantic consistency and structural continuity. Within this hierarchy, he MIFM acts as the primary engine for cross-layer interaction, while the Position-Aware Module (PAM) [27] provides complementary global enhancement.

As shown in Figure 2c, the MIFM constitutes the main fusion backbone of the MP-Block. Explicit interaction paths are set at various levels to allow for information to be gradually conveyed from high-level semantic representations to low-level structural features. This design preserves semantic distinctiveness and spatial alignment. Building upon this foundation, the position-aware module (PAM) [27] is incorporated as an auxiliary refinement component. Rather than functioning as an independent feature extractor, the PAM applies global spatial re-weighting after inter-layer feature fusion, enabling the network to suppress appearance-induced pseudo-changes (e.g., illumination variations) and further enhance object boundary delineation.

With the overall framework of the MP-Block defined, we next elaborate on the design of its core fusion architecture, i.e., the MIFM.

2.2.1. Multi-Branch Inter-Layer Fusion Module (MIFM)

The MP-Block is centered around the MIFM, which performs stepwise fusion of aligned features across adjacent semantic levels. At each fusion stage, the MIFM takes a high-level feature and its adjacent low-level feature as inputs and projects them into a unified feature space to obtain an initial fused representation, which serves as the basis for subsequent refinement.

On this basis, the MIFM is internally equipped with two complementary refinement branches, namely the RFRB and the ACGF unit. The RFRB is designed to enhance structural details and local contextual information, whereas the ACGF unit focuses on adaptively recalibrating channel-wise responses to suppress redundant information and emphasize change-relevant features.

The outputs of these two internal branches are subsequently aggregated to generate a refined fused feature at the current stage, which is then propagated to subsequent fusion stages and the final decoder. The detailed designs of the RFRB and the ACGF unit are presented in the following subsections.

2.2.2. Adaptive Channel-Group Fusion (ACGF) Unit

To alleviate semantic inconsistency and feature redundancy during inter-layer feature fusion, we employ the ACGF unit as a key component of the proposed MIFM. As illustrated in Figure 3, ACGF refines adjacent multi-level features by jointly enhancing spatial context and channel-wise responses in a lightweight and structured manner.

Figure 3.

Figure 3

Structure of the ACGF unit. Adjacent multi-level features Fi1 and Fi are first spatially aligned and projected to obtain the base feature F+. The upper branch performs spatial refinement via a grouped softmax gating (GSG) mechanism to enhance spatially informative responses, producing Fs. The lower branch conducts channel refinement through channel attention to reweight F+ and generate Ff. The final fused feature FA is obtained by combining the outputs of the two branches. The right part shows details of the GSG block, where the input feature channels are evenly divided into G=4 groups (each with Cg=C/G channels) and each group is processed independently by a softmax-based gating operation before concatenation. In the diagram, ⊗ denotes element-wise multiplication, ⊕ represents element-wise summation, and ‘cat’ indicates channel-wise concatenation.

Given the two adjacent feature maps Fi1 and Fi with different spatial resolutions, Fi is first upsampled to match the resolution of Fi1, and both features are then projected into a unified channel space through 1×1 convolutions and element-wise summed to obtain a base feature representation,

F˜i=U(Fi),F+=Conv1×1(Fi1)+Conv1×1(F˜i), (1)

where U(·) denotes bilinear upsampling.

Based on F+, ACGF adopts two parallel refinement branches with complementary roles. As shown in Figure 3, the upper branch focuses on spatial refinement. A GSG mechanism is applied within channel groups to model localized spatial dependencies and emphasize informative responses. Specifically, the base feature F+ is evenly divided along the channel dimension into G groups, and the g-th group feature is denoted as XgRN×Cg×H×W. For each group, a 1×1 convolution is applied to obtain the interaction response:

Zg=Conv1×1(Xg). (2)

The response Zg is then reshaped and normalized to generate a softmax-based gating weight map:

zg=vec(Zg)mean(vec(Zg))+ε,Pg=reshapeSoftmax(zg), (3)

where vec(·) denotes vectorization and ε is a small constant for numerical stability.

The gated feature is obtained by element-wise multiplication:

Yg=XgPg. (4)

The outputs from all groups are concatenated and combined with a residual connection from F+, followed by normalization, resulting in the spatially refined feature Fs.

The lower branch performs channel refinement. Specifically, following global average pooling on F+, an efficient channel attention module based on one-dimensional convolution, similar to ECA-Net [28], is employed to capture local cross-channel interactions and adaptively reweight channel responses, producing the channel-refined feature Ff.

Finally, the outputs of the spatial and channel refinement branches are fused via element-wise addition to obtain the output feature of ACGF:

FA=Fs+Ff. (5)

With this design, ACGF balances spatial discrimination and channel selectivity, enabling effective and robust inter-layer feature fusion.

2.2.3. Response-Aware Feature Refinement Block (RFRB)

The RFRB is introduced as a complementary branch of the MIFM to enhance fine-grained change representations and improve the robustness of inter-layer feature fusion. Instead of explicitly reconstructing features, RFRB performs response-aware refinement by selectively emphasizing informative feature components and enhancing weak responses, which is particularly beneficial for preserving subtle changes such as edges, textures, and small structural variations.

As illustrated in Figure 4, RFRB operates on adjacent feature maps from different semantic levels. Following the same feature alignment strategy as in ACGF, the higher-level feature is first upsampled to match the spatial resolution of the lower-level feature and then aggregated to form a base representation F+. Based on F+, RFRB derives adaptive channel-wise responses to guide subsequent feature refinement.

Figure 4.

Figure 4

Details of the RFRB. The block refines inter-layer features by separating feature components with different response strengths and processing them through complementary refinement paths. Strong-response features are refined using lightweight point-wise convolution to preserve discriminative semantic information, while weak-response features are selectively enhanced via depthwise separable convolution and channel-wise gating to strengthen local representations. The right part shows the structure of the Channel-wise Weighting and Grouping (CWG) unit used in the weak-response branch, which generates adaptive channel-wise weights via global average pooling and point-wise convolution to modulate weak feature responses. In the diagram, ⊗ denotes element-wise multiplication and ⊕ represents element-wise summation.

To obtain channel-wise importance cues, global average pooling followed by a Sigmoid activation is applied to F+, yielding a channel-wise response vector:

U+=SigmoidGAP(F+), (6)

where GAP(·) denotes global average pooling along the spatial dimensions and U+RC×1×1 represents the channel-wise response strength used to guide subsequent response-aware refinement.

In addition, channel response maps ωi and ωi1 are obtained from the aligned inter-layer features through batch normalization and Sigmoid activation, characterizing their relative activation strengths. Guided by these responses, feature components are implicitly separated into strong-response and weak-response parts and refined using different processing paths.

Highly responsive feature components are regarded as strong responses and refined using lightweight point-wise convolution to preserve their discriminative semantic information. In contrast, weakly activated components are selectively enhanced using depthwise separable convolution to strengthen local representations while reducing computational overhead. To further regulate weak feature responses, a lightweight channel-wise gating mechanism is introduced. Specifically, the refinement process of weak features can be formulated as follows:

Fw=Conv1×1DWConv(Fwin)SoftmaxReLUConv1×1GAP(Fwin), (7)

Here, Fwin denotes the feature components exhibiting relatively weak responses and selected for refinement. The operator DWConv(·) applies depthwise separable convolution to provide lightweight local enhancement, while the Softmax-based gating term assigns channel-wise weights to emphasize informative weak features.

The refined strong- and weak-response features are then combined via element-wise addition, yielding the output of RFRB:

FR=Fs+Fw. (8)

The RFRB leverages adaptive channel-wise weighting and response-aware refinement, together with efficient feature fusion, to strengthen fine-grained change cues and improve multi-scale consistency, while incurring minimal additional computational overhead for enhanced change detection performance.

2.2.4. Prediction Head and Change Map Generation

The final fused feature Ffinal has the highest spatial resolution among all feature representations after fusion and refinement, retaining multi-level change information while preserving fine-grained spatial details. A lightweight prediction head followed by upsampling to the input resolution is then applied to perform pixel-wise change inference:

P=SigmoidUp(Head(Ffinal)), (9)

where Head(·) denotes the prediction head implemented as a final 1×1 convolution layer, Up(·) represents bilinear upsampling, and applying the Sigmoid function yields the final change probability map P.

3. Experimental Setup

A clarification regarding the definition of the term baseline is necessary, as it is used in two different contexts in this work. In the context of ablation studies, the baseline refers to a VGG16-BN backbone in which the proposed MP-Blocks are removed and replaced with standard 3×3 convolutional layers. This configuration serves as a controlled reference to isolate the performance gains introduced by the proposed modules. In contrast, in quantitative benchmark comparisons, the term refers to external state-of-the-art methods against which CLFF is evaluated.

3.1. Datasets

To rigorously demonstrate the effectiveness of the proposed CLFF framework, particularly its capability to coordinate semantic and structural information, we employ four widely used public benchmark datasets from the literature, namely LEVIR-CD [29], WHU-CD [30], SYSU-CD [31], and HRCUS-CD [32]. These datasets are selected for their complementary characteristics in terms of spatial resolution, scene composition, and change complexity, thereby enabling a comprehensive evaluation of high-resolution remote sensing change detection performance across a wide range of urban scenarios.

Specifically, LEVIR-CD focuses on building-related changes with varying object scales under relatively structured urban layouts, making it suitable for evaluating scale-sensitive change detection performance. WHU-CD consists of very high-resolution aerial imagery with dense urban scenes and complex building boundaries, posing challenges for fine-grained structural delineation under varying illumination and occlusion conditions. In contrast, SYSU-CD and HRCUS-CD represent more complex urban environments with a broader range of change categories and higher scene heterogeneity, thereby emphasizing robustness to semantic ambiguity and cluttered backgrounds. Collectively, these datasets enable systematic evaluation across object- and scene-level change characteristics.

3.1.1. LEVIR-CD

Released by Beihang University in 2020, the LEVIR-CD [29] dataset has become a widely used benchmark for urban building change analysis in high-resolution remote sensing. It includes 637 bi-temporal RGB image pairs with a ground sampling distance of 0.5 m and a native spatial resolution of 1024 × 1024 pixels. Each pair is accompanied by a binary change annotation indicating building-related structural modifications. The dataset is split into 445 training, 64 validation, and 128 test samples. For efficient model training, all images are further divided into non-overlapping patches of 256 × 256, yielding more than 10,000 training instances. The samples are collected from multiple metropolitan areas across Texas, USA.

3.1.2. WHU-CD

We also employ the WHU-CD [30] dataset, developed by Wuhan University using high-resolution aerial imagery (0.3 m/pixel) of Christchurch, New Zealand, acquired in 2012 and 2016. The dataset provides annotated building change labels for pre- and post-earthquake urban analysis. The original images are cropped into 256×256 patches and split into 6096 training pairs, 762 validation pairs, and 762 testing pairs.

3.1.3. SYSU-CD

The SYSU-CD [31] dataset was developed to support large-scale change detection in remote sensing applications. The images consist of 20,000 sets of bi-temporal RGB image pairs acquired between 2007 and 2014 in Hong Kong, covering seasonal and long-term urban changes. The images have a GSD of about 0.5 m, and they are normalized to a spatial size of 256×256. A large number of urban change patterns are covered, such as building construction and demolition, vegetation changes, road modifications, changes in maritime facilities, etc. Model development and verification use 12,000 data pairs for training, 4000 data pairs for validation, and 4000 data pairs for testing.

3.1.4. HRCUS-CD

HRCUS-CD [32] was published in 2023 by Zhang et al.; it mainly focuses on the urban area in Zhuhai, China, and its name stands for “High Resolution Change Detection Dataset”. There are 11,388 pairs of bi-temporal image patches which have a spatial resolution of 256×256 pixels, with a GSD (ground sample distance) of 0.5 m; also, every pair has been annotated with the help of ground truth binary change masks to mark over 12,000 related building changes. The data span two time periods: urban areas were collected in 2019 and 2022, while rural or mountainous locations cover an older time frame between 2010–2018, capturing diverse urban expansion patterns across different time periods.

3.2. Evaluation Metrics

To assess the performance of the proposed model in remote sensing change detection, five commonly used quantitative metrics are adopted, including Precision (Pre), Recall, F1-score, Intersection over Union (IoU), and Pixel Accuracy (PA). Among these metrics, Precision and Recall jointly characterize the model’s ability to detect changes, while F1-score and IoU provide a balanced evaluation by simultaneously considering detection accuracy and spatial localization quality. Pixel Accuracy further reflects the overall classification correctness at the pixel level. The formal definitions of these metrics are given as follows:

Precision=TPTP+FP,Recall=TPTP+FN,F1=2TP2TP+FP+FN,IoU=TPTP+FP+FN,Accuracy=TP+TNTP+TN+FP+FN. (10)

In this evaluation setting, True Positives (TP) denote pixels that have undergone changes and are correctly identified by the model, whereas False Positives (FP) refer to unchanged pixels that are mistakenly classified as changed. Pixels that do not change and are correctly classified belong to the True Negative (TN) category. On the other hand, False Negatives (FN) represent the changed pixels that were not detected.

3.3. Implementation Details

To improve readability and reproducibility, the model configuration and key implementation settings are summarized in Table 1.

Table 1.

Summary of model configuration and key implementation settings.

Category Parameter Value
Environment Framework PyTorch 2.0.0
Hardware NVIDIA RTX 4090 (24 GB)
Batch Size 8 (Train)/1 (Val)/1 (Test)
Model Settings Input Image Size 256 × 256
Backbone VGG-16
Core Interaction MP-Block
Fusion Backbone MIFM (RFRB + ACGF)
Optimizer Optimizer AdamW [33]
Initial Learning Rate 5 × 10−4
Weight Decay 0.0025
Scheduler Warm-up Strategy LinearLR (0 → 5 × 10−4)
Warm-up Epochs 5
Decay Strategy PolyLR (Power = 1.0)
Decay Epochs 95 (Total 100)

For quantitative comparison, competing methods are evaluated using their official implementations whenever available; otherwise, they are re-implemented according to the descriptions in the original papers. All methods are trained and tested under identical data splits, input resolutions, and evaluation protocols to ensure a fair comparison. We use AdamW [33] to optimize the model, and a two-stage learning rate schedule is adopted to ensure stable training.

4. Results

In this section, we present comprehensive experimental results to evaluate the effectiveness of the proposed CLFF for very high-resolution remote sensing change detection. Quantitative comparisons with state-of-the-art methods are first reported, followed by qualitative analyses to visually assess detection performance in complex scenes. Extensive ablation studies are then conducted to investigate the contributions of the MP-Block design, the RFRB component, the lightweight convolution strategy (DWConv + CWG), and different backbone networks. Finally, an efficiency analysis is conducted to evaluate the computational cost and practical applicability of the proposed framework.

4.1. Quantitative Comparison with State-of-the-Art Methods

To evaluate the actual performance of the proposed CLFF framework, we conduct quantitative comparisons with several representative remote sensing change detection methods. The compared approaches broadly fall into three categories, including CNN-based methods, attention-enhanced architectures, and Transformer-driven or global–local hybrid designs. To ensure the fairness and consistency of the comparison, all methods are evaluated under identical experimental settings and using the same evaluation metrics.

  1. FC-EF [34] uses an early fusion strategy, where bi-temporal images are concatenated at the input stage and processed by a single fully convolutional encoder–decoder network.

  2. FC-Siam-Conc [34] is based on a Siamese architecture consisting of two parallel encoders. Features from corresponding stages are concatenated during decoding to enable temporal feature integration.

  3. FC-Siam-Diff [34] also adopts a parameter-sharing Siamese structure, in which change information is highlighted by computing the absolute difference between paired feature representations before decoding.

  4. IFNet [25] is a two-branch CNN-based network that enhances feature difference learning through deep supervision to better discriminate changed and unchanged regions.

  5. STANet [29] is an attention-based Siamese network that integrates spatiotemporal attention mechanisms to emphasize change regions and suppress irrelevant variations.

  6. BIT [18] introduces a Transformer-based paradigm into remote sensing change detection, enabling global feature interaction across bi-temporal representations through self-attention mechanisms.

  7. ChangeFormer [19] is a hierarchical Transformer-based method that captures multi-scale contextual information for accurate change detection.

  8. CGNet [26] is an attention-enhanced Siamese architecture that utilizes change magnitude-guided attention to focus on significant change regions.

  9. B2CNet [35] is a progressive refinement network that improves change boundary localization by gradually refining attention from coarse to fine regions.

  10. HyRet-Change [36] is a hybrid retention-based framework for global–local modeling in remote sensing change detection.

Quantitative results on the four datasets (LEVIR-CD, WHU-CD, SYSU-CD, and HRCUS-CD) are reported in Table 2 and Table 3. The best and second-best results are highlighted in bold.

Table 2.

Quantitative comparison on the LEVIR-CD and WHU-CD test sets. The best results are highlighted in bold.

Methods LEVIR-CD WHU-CD
F1 Pre Rec IoU F1 Pre Rec IoU
FC-EF [34] 83.32 85.32 81.41 71.41 75.79 83.94 69.08 61.02
FC-SIAM-Conc [34] 86.90 87.89 85.92 76.83 69.89 56.95 90.43 53.72
FC-SIAM-Diff [34] 85.92 87.81 84.11 75.32 70.26 61.52 81.89 54.15
STANet [29] 80.16 68.41 96.77 66.88 85.04 78.91 92.20 73.97
IFNet [25] 90.51 91.32 89.72 82.67 89.34 93.32 85.68 80.73
BIT [18] 91.13 92.61 89.69 83.70 91.46 92.82 90.15 84.27
ChangeFormer [19] 91.09 92.63 89.61 83.64 90.62 94.69 86.88 82.85
CGNet [26] 91.69 92.83 90.57 84.65 92.92 92.84 93.01 86.78
B2CNet [35] 91.02 92.57 89.52 83.52 92.01 92.21 91.82 85.21
HyRet-Change [36] 91.65 93.37 90.00 84.49 92.32 92.15 92.50 85.74
CLFF 91.89 93.26 90.56 84.99 94.12 94.30 93.94 88.89

Table 3.

Quantitative comparison on the SYSU-CD and HRCUS-CD test sets. The best results are highlighted in bold.

Methods SYSU-CD HRCUS-CD
F1 Pre Rec IoU F1 Pre Rec IoU
FC-EF [34] 74.05 80.23 68.76 58.79 51.94 50.24 42.69 35.08
FC-SIAM-Conc [34] 76.20 79.58 73.09 61.55 55.82 52.68 59.36 38.72
FC-SIAM-Diff [34] 67.60 89.50 54.32 51.06 54.50 51.31 58.11 37.46
STANet [29] 77.94 73.05 83.52 63.85 45.38 29.97 93.45 29.35
IFNet [25] 80.82 87.33 75.21 67.81 73.35 76.14 70.76 57.92
BIT [18] 73.23 73.48 72.99 57.77 67.29 72.43 62.83 50.70
ChangeFormer [19] 80.53 89.93 72.91 67.41 74.29 81.83 68.03 59.10
CGNet [26] 80.30 87.30 74.34 67.09 73.00 78.05 68.56 57.48
B2CNet [35] 80.45 85.40 76.04 67.29 74.01 76.56 71.62 58.74
HyRet-Change [36] 80.66 84.36 77.27 67.59 74.19 77.68 70.99 58.97
CLFF 82.28 82.81 81.75 69.89 76.40 80.14 72.98 61.81

The experimental results demonstrate that CLFF maintains robust and stable performance across datasets with varying spatial resolutions and scene complexities. On the LEVIR-CDdataset, characterized by densely distributed buildings and subtle structural variations, CLFF achieves the highest F1-score (91.89%) and IoU (84.99%) among all competing methods. This superiority indicates that the proposed cross-layer interaction mechanism effectively retains small-scale structural information often lost in deep layers, thereby enabling the accurate detection of subtle changes while preserving boundary integrity.

A similar advantage is observed on the WHU-CD dataset, where CLFF effectively handles large building footprints and complex urban layouts, ranking first across all four evaluation metrics (F1: 94.12%; IoU: 88.89%). This leading performance validates the efficacy of the structural refinement mechanism, which sharpens object boundaries and significantly reduces edge-related false alarms common in very high-resolution imagery.

On datasets with more diverse object categories, such as SYSU-CD and HRCUS-CD, CLFF continues to exhibit strong generalization capability. Notably, on the challenging HRCUS-CD dataset, CLFF attains an F1-score of 76.40% and an IoU of 61.81%, outperforming the second-best method by margins of 2.21% and 2.71%, respectively. These consistent improvements confirm the framework’s ability to mitigate semantic–structural mismatch, ensuring robust feature alignment even in complex scenes with significant appearance variations.

While CLFF delivers balanced F1 and IoU scores, it does not strictly dominate all metrics. For instance, on LEVIR-CD and HRCUS-CD, STANet [29] attains higher Recall due to a more aggressive detection strategy, albeit at the cost of reduced Precision. In contrast, CLFF maintains a robust trade-off between Precision and Recall. In practice, model selection depends on specific error tolerances: exhaustive discovery tasks may prioritize Recall, whereas automated monitoring typically favors Precision to minimize verification costs. The consistent F1 gains across diverse scenes confirm that CLFF offers a versatile solution that effectively balances these competing requirements.

4.2. Qualitative Comparison

This subsection presents qualitative comparisons on four public benchmark datasets to visually assess the effectiveness of the proposed method under diverse and challenging scenarios. These qualitative results further demonstrate that CLFF reduces semantic–structural mismatch and suppresses appearance-induced pseudo-changes across a wide range of challenging conditions.

4.2.1. LEVIR-CD Comparison

Figure 5 presents qualitative results on the LEVIR-CD dataset, which is characterized by significant variations in building scale and illumination conditions. As shown in Figure 5(1)–(3), baseline methods tend to produce fragmented predictions when dealing with large-scale buildings or densely distributed small objects. In the cluttered background of the city (Figure 5(2)), weak change targets cannot be detected due to the interference of surrounding structures.

Figure 5.

Figure 5

Qualitative comparison of representative samples from the LEVIR-CD dataset. Rows (1)–(5) illustrate scenarios involving large-scale buildings, dense small objects, and complex illumination conditions. TP (white), TN (black), FP (red), and FN (blue) denote true positives, true negatives, false positives, and false negatives, respectively. The proposed method is highlighted in red font.

Whereas other approaches often produce incomplete or morphologically inconsistent change maps, CLFF yields more coherent predictions by explicitly coordinating cross-layer information. Through the integration of high-level semantic representations with fine-grained structural details, the proposed framework enhances spatial detail recovery in the segmentation results. For instance, in the dense residential area shown in Figure 5(4), CLFF preserves clear separation between adjacent buildings and maintains boundary integrity, while competing methods tend to generate blurred contours or irregularly merged regions. These results suggest that the proposed interaction mechanism effectively captures long-range structural dependencies and maintains structural coherence at larger spatial scales.

4.2.2. WHU-CD Comparison

Figure 6 elucidates the qualitative comparative results obtained on the WHU-CD dataset, a benchmark characterized by ultra-high-resolution urban landscapes frequently plagued by dense vegetation occlusion and transient illumination fluctuations. As evidenced in the localized scenarios of Figure 6(1),(2), a multitude of established methodologies exhibit a propensity for false positives. These erroneous activations are generally triggered by changes in phenological vegetation or the movement of cast shadows, and the failure to suppress these errors ultimately leads to misjudgments on the peripheral boundaries of the buildings.

Figure 6.

Figure 6

Qualitative comparison of representative samples from the WHU-CD dataset. Rows (1)–(5) display challenging scenarios, including vegetation occlusion, bright building foundations, and intricate building boundaries. TP (white), TN (black), FP (red), and FN (blue) denote true positives, true negatives, false positives, and false negatives, respectively. The proposed method is highlighted in red font.

By explicitly separating structural changes from semantically irrelevant appearance variations, CLFF suppresses false detections. In scenarios involving large building footprints with homogeneous textures, CLFF produces more coherent and well-defined boundaries than conventional methods, as illustrated in Figure 6(4). Therefore, the proposed approach is able to effectively suppress fragmented contours and serrated edge artifacts that are commonly produced by competing methods. These qualitative observations further indicate that, even under substantial radiometric variations, CLFF improves spatial consistency by virtue of its coordinated cross-layer fusion mechanism, a property that is reflected in the overall conclusions of this study.

4.2.3. SYSU-CD Comparison

Figure 7 presents qualitative results on the SYSU-CD benchmark. This dataset is particularly suitable for qualitative evaluation due to the presence of diverse change categories and numerous long linear structures. As shown in Figure 7(1), when small targets are embedded within complex backgrounds, competing methods are prone to either miss the targets or produce fragmented and disconnected change maps. On the contrary, CLFF is able to take into account fine structural dissimilaties across different backgrounds and filter out background interference at the same time.

Figure 7.

Figure 7

Qualitative comparison of representative samples from the SYSU-CD dataset. Rows (1)–(5) display scenarios including small object changes, linear infrastructure expansion, and wharf or ship detection under diverse backgrounds. TP (white), TN (black), FP (red), and FN (blue) denote true positives, true negatives, false positives, and false negatives, respectively. The proposed method is highlighted in red font.

A notable advantage of the proposed CLFF emerges when handling the evolution of linear infrastructures, such as road expansions or wharf developments (Figure 7(2)–(4)). Whereas baseline architectures tend to output discontinuous contours mainly due to their limited contextual receptive fields, CLFF maintains strong morphological consistency and geometric continuity. These empirical observations indicate that the proposed cross-layer interaction scheme effectively coordinates global contextual perception with local structural information, resulting in more coherent and structurally consistent change representations—an ability that is critical for modeling elongated man-made structures. Moreover, under dynamic fluctuations of water surfaces (Figure 7(5)), CLFF demonstrates enhanced robustness to irregular background variations.

4.2.4. HRCUS-CD Comparison

Figure 8 illustrates the qualitative performance of CLFF on the HRCUS-CD dataset, which is characterized by highly cluttered scenes and fine-grained changes. In scenarios involving building construction within complex surroundings (Figure 8(2)) or appearance changes induced by roof repainting (Figure 8(5)), many state-of-the-art methods suffer from pronounced missed detections or false alarms.

Figure 8.

Figure 8

Qualitative comparison of representative samples from the HRCUS-CD dataset. Rows (1)–(5) illustrate challenging cases including dense building construction, partial occlusion, and appearance variations caused by roof painting or shadows. TP (white), TN (black), FP (red), and FN (blue) denote true positives, true negatives, false positives, and false negatives, respectively. The proposed method is highlighted in red font.

On the contrary, CLFF makes more reasonable predictions, which are more logically consistent and have more distinct and uniform boundaries. Even when target regions are partially embedded in vegetation or affected by strong shadow interference (Figure 8(4)), the proposed method is able to reliably identify the change regions.

4.3. Ablation Study

Unless otherwise stated, the baseline configuration in all ablation studies refers to the same backbone network where the proposed MP-Block is replaced by a standard convolutional unit (i.e., 3×3 Conv with BN and ReLU). For simplicity, this unit is abbreviated as “3×3 Conv” in some tables.

4.3.1. Ablation on Stage-Wise Placement of MP-Blocks

To assess the efficiency of the suggested cross-layer interaction design, we carry out stage-wise ablation studies by inserting the MP-Block at various stages of the basic network, either separately or together. Different variants’ detailed configurations are listed below:

  • Baseline: VGG16-BN backbone where all MP-Blocks are removed and replaced with standard 3×3 Conv + BN + ReLU layers.

  • Baseline + MP-3: MP-3 enabled.

  • Baseline + MP-2: MP-2 enabled.

  • Baseline + MP-1: MP-1 enabled.

  • Baseline + MP-3 + MP-2: MP-3 and MP-2 enabled.

  • Baseline + MP-3 + MP-1: MP-3 and MP-1 enabled.

  • Baseline + MP-2 + MP-1: MP-2 and MP-1 enabled.

  • CLFF (Full): MP-1, MP-2, and MP-3 all enabled.

Table 4 and Table 5 summarize that adding an MP-Block to any single level always improves performance compared to the base model on all datasets, which indicates that cross-layer interactions at various semantic levels are advantageous for change representation. For single level variants, we have already seen improvements with just one MP-Block which shows that it can help improve the interaction of features at that particular stage.

Table 4.

Impact of stage-wise MP-Block placement on LEVIR-CD and WHU-CD. The best results are highlighted in bold.

Method LEVIR-CD WHU-CD
F1 Precision Recall IoU F1 Precision Recall IoU
Baseline 91.09 92.17 90.04 83.64 92.54 93.05 92.03 86.11
Baseline + MP-3 91.71 92.67 90.77 84.69 93.22 94.24 92.22 87.30
Baseline + MP-2 91.77 93.38 90.22 84.80 93.47 95.27 91.73 87.73
Baseline + MP-1 91.56 92.62 90.52 84.43 93.19 93.92 92.46 87.24
Baseline + MP-3 + MP-2 91.87 93.14 90.63 84.96 93.71 93.92 93.50 88.17
Baseline + MP-3 + MP-1 91.54 93.32 89.75 84.33 93.89 94.55 93.24 88.49
Baseline + MP-2 + MP-1 91.73 92.57 90.91 84.72 93.37 94.28 92.48 87.57
CLFF (Full) 91.89 93.26 90.56 84.99 94.12 94.30 93.94 88.89
Table 5.

Impact of stage-wise MP-Block placement on SYSU-CD and HRCUS-CD. The best results are highlighted in bold.

Method SYSU-CD HRCUS-CD
F1 Precision Recall IoU F1 Precision Recall IoU
Baseline 79.77 85.02 75.13 66.35 72.58 76.20 69.28 56.96
Baseline + MP-3 81.44 85.02 78.15 68.69 73.15 79.45 67.78 57.67
Baseline + MP-2 81.39 86.07 77.20 68.62 74.16 79.19 69.72 58.93
Baseline + MP-1 81.23 83.23 79.33 68.40 73.25 79.59 67.85 57.79
Baseline + MP-3 + MP-2 81.51 85.56 77.83 68.79 74.79 79.72 70.43 59.73
Baseline + MP-3 + MP-1 82.00 86.00 78.35 69.49 74.94 78.03 72.08 59.92
Baseline + MP-2 + MP-1 81.38 84.46 78.51 68.60 74.75 80.27 69.94 59.68
CLFF (Full) 82.28 82.81 81.75 69.89 76.40 80.14 72.98 61.81

Table 4 and Table 5 show that inserting an MP-Block at any single stage consistently improves performance across datasets, indicating that cross-layer interactions at different semantic levels benefit change representation. Dual-stage configurations further improve F1-score and IoU compared with single-stage variants. The full CLFF, which deploys MP-Blocks at all three stages, achieves the best overall performance, demonstrating that multi-level collaboration is critical for simultaneously preserving semantic cues and fine-grained structures.

Some of the single- or dual-stage variants have also been found to achieve slightly higher Precision scores. But the complete CLFF model can obtain a much more even result, it keeps a lot of Recall and still has good Precision, so it obtains a better score overall. These results show that allowing for cross-layer interactions at every level causes more extensive change detection and greater overall dependability.

4.3.2. Visual Ablation of MP-Blocks at Different Feature Levels on the WHU-CD Dataset

Besides the quantitative analysis mentioned before, Figure 9 shows qualitative ablation results on the WHU-CD dataset to give a visual demonstration of how MP-Blocks affect features at various levels.

Figure 9.

Figure 9

Visual ablation results on the WHU-CD dataset for different MP-Block configurations. Rows (1)–(5) illustrate representative scenarios involving dense residential areas, large-scale structures, and complex building boundaries.From left to right: Time 1 image, Time 2 image, Baseline, Baseline + MP-3, Baseline + MP-2, Baseline + MP-1, Baseline + MP-3 + MP-2, Baseline + MP-3 + MP-1, Baseline + MP-2 + MP-1, CLFF (Full), and Ground Truth. MP-3, MP-2, and MP-1 denote MP-Blocks inserted at the deep, intermediate, and shallow feature levels, respectively. TP (white), TN (black), FP (red), and FN (blue) indicate true positives, true negatives, false positives, and false negatives, respectively.

As we can see from the figure above, the prediction of the baseline model is fragmented with missing parts and the contour is broken. There is also obvious background noise in both the small object and large structure cases. The MP-Block is introduced on one feature level and already has an improvement in terms of structural integrity. And when using multiple levels of MP-Blocks, it improves the spatial coherence even more and reduces the amount of background noise.

When there are many MP-Blocks activated, the forecasted change maps become increasingly organized, and the entire CLFF model gets the best-formed shapes and clearest edges in various scenes, be they crowded neighborhoods or complicated rooftops. These findings show that working together at different levels is important for making things stronger and finding changes in big cities.

4.3.3. Ablation Study on the Internal Components of the MP-Block

To disentangle the individual contributions of the internal components within the MP-Block, a component-wise ablation study was performed on the WHU-CD dataset, with the quantitative results summarized in Table 6. The baseline configuration adopted a standard 3×3 convolutional fusion layer, and individual components were subsequently introduced in an incremental manner.

Table 6.

Component-wise ablation of the MP-Block on WHU-CD. The best results are highlighted in bold.

Method F1 (%) Precision (%) Recall (%) IoU (%)
Baseline + 3×3 Conv 92.54 93.05 92.03 86.11
+RFRB (w/o ACGF) 93.32 94.30 92.37 87.48
+ACGF (w/o RFRB) 93.69 94.53 92.87 88.13
+MIFM (RFRB + ACGF) 93.87 94.82 92.94 88.45
+MP-Block (MIFM + PAM) 94.12 94.30 93.94 88.89

Replacing the standard convolutional operation with the RFRB precipitates a measurable performance leap, wherein the F1-score escalates from 92.54% to 93.32%, accompanied by a concurrent rise in IoU from 86.11% to 87.48%. Such an increment underscores that the response-aware refinement mechanism effectively fortifies structural representations, thereby facilitating a more rigorous delineation of object boundaries. In parallel, deploying the ACGF unit in isolation enables the architecture to secure an F1-score of 93.69% and an IoU of 88.13%, bolstered by a conspicuous enhancement in precision metrics. This outcome implies that adaptive channel-group fusion serves as a potent filter for redundant channel-wise responses, effectively neutralizing spurious activations and curbing false-positive predictions. Cumulatively, these evaluative findings substantiate that the RFRB and ACGF units function as indispensable catalysts for sharpening structural precision and amplifying feature discriminability.

When RFRB and ACGF are integrated within MIFM, the model achieves improved performance, indicating that structural refinement and channel recalibration are complementary. This suggests that MIFM serves as the core fusion module where semantic and structural information are jointly coordinated across feature levels. To further analyze the weak-feature enhancement mechanism in RFRB, we conducted ablation experiments by replacing the enhancement branch with different convolutional and gating configurations.

The best all-around performance occurred with all of the MP-Blocks being used and the MIFM being enforced via the PAM. The PAM was incorporated as an auxiliary global enhancement module to complement the proposed cross-layer fusion strategy. With the inclusion of PAM, the model attained the highest F1 score of 94.12% and an IoU of 88.89%, primarily driven by a substantial improvement in recall. Overall, this hierarchical analysis demonstrates that the major performance gains primarily stem from the MIFM-based cross-layer interaction, while PAM provides additional global contextual enhancement.

4.3.4. Ablation Analysis of Weak Feature Enhancement in the RFRB

To investigate the weak-feature enhancement mechanism in RFRB, we conducted ablation experiments by replacing the enhancement branch with different convolutional and gating configurations. This analysis examines how alternative designs affect the refinement of weak responses in complex scenes.

To evaluate the effectiveness of the weak feature enhancement strategy in the Response-aware Feature Refinement Block (RFRB), we conducted a series of ablation experiments by replacing the enhancement branch with different convolutional and gating configurations, including 1×1 convolution, 3×3 convolution, depthwise separable convolution (DWConv) [37], group convolution (GConv, g=4) [38], and the proposed DWConv with channel-wise gating (CWG). The quantitative results on the WHU-CD dataset are summarized in Table 7.

Table 7.

Ablation study on the WHU-CD dataset. DWConv is implemented by decoupling spatial filtering and channel mixing through a depthwise convolution with a 3×3 kernel and a subsequent 1×1 pointwise convolution [37]. GConv denotes group convolution with g=4, which was originally introduced for model parallelism in AlexNet [38]. CWG denotes the channel-wise gating operation for weak feature enhancement. The best results are highlighted in bold.

Method F1 (%) Precision (%) Recall (%) IoU (%) Params (M) FLOPs (G)
1×1 Conv 93.79 93.97 93.62 88.31 26.92 61.17
DWConv [37] 93.95 94.42 93.49 88.59 27.27 61.20
GConv (g=4) [38] 93.97 94.61 93.33 88.62 27.69 62.17
3×3 Conv 94.21 96.07 92.41 89.05 30.37 62.85
DWConv + CWG (Sigmoid) 94.04 94.69 93.39 88.74 27.62 61.20
DWConv + CWG (Softmax, Ours) 94.12 94.30 93.94 88.89 27.62 61.20

Among the compared variants, the 3×3 convolution achieved the highest F1 score (94.21%) and IoU (89.05%), yet this gain was accompanied by a noticeable increase in model complexity, with 30.37 M parameters and 62.85 G FLOPs. By comparison, DWConv and GConv markedly reduced computational cost (27.27 M/61.20 G and 27.69 M/62.17 G), but the associated performance gains remained limited. This shows that simply having few Convolutional Filters is not good for getting Weak Change to Work.

When channel-wise gating was introduced, the DWConv + CWG (Softmax) variant further improved the F1-score to 94.12% and the recall to 93.94%, while maintaining nearly unchanged computational complexity (27.62 M parameters and 61.20 G FLOPs). And this tells us that the better results we got were not from having some way to increase model power, but from rebalancing the features we already had. A comparison between softmax- and sigmoid-based gating functions reveals distinct behaviors. The F1-score is about the same (94.04%), but the precision is a bit higher (94.69%), while the softmax-based gate generally has higher recall, which helps to recover the full change regions. This difference can be attributed to the competitive normalization property of softmax, in contrast to the independent channel-wise modulation mechanism employed by sigmoid-based attention [39,40]. Accordingly, the softmax-based CWG is adopted as the default configuration in this work.

Overall, the results show that the proposed weak feature enhancement strategy is able to effectively strengthen the subtle change response in a lightweight and representation-driven manner. More importantly, explicitly rebalancing the weak and strong features is more important than simply increasing the capacity of convolution.

4.3.5. Backbone-Wise Evaluation of the MP-Block

To verify the generalization capability of the proposed method across different network architectures, we evaluated the MP-Block on three representative backbone networks, namely ResNet18 [41], ResNet50 [41], and VGG16 [42]. In practical remote sensing applications, choosing a backbone is often limited by the computational budget, so it is necessary to have an interaction design that is flexible and architecture-agnostic.

Experimental settings for fair comparison. VGG16 is adopted as the primary backbone configuration. For ResNet18, the standard feature outputs are directly utilized without any additional modification. In contrast, ResNet50 produces feature maps with substantially larger channel dimensions, reaching up to 2048 channels at the deepest stage, which differs markedly from VGG16. In order to ensure the fairness of the comparison in terms of computational complexity and parameter efficiency, a channel compression strategy was adopted for ResNet50. Specifically, a 1×1 convolution was inserted at each stage to reduce the channel dimensions by a factor of two (e.g., from 2048 to 1024 at the deepest layer) before the features were fed into the MP-Block.

Table 8 shows a universal pattern. The MP-Block still works well on different backbone networks. Specifically, when applied to the lightweight ResNet18, the module improves both F1-score and IoU while simultaneously reducing parameter count and FLOPs, demonstrating that the proposed cross-layer interaction mechanism maintains strong performance even under strict model capacity constraints. The MP-Block scales effectively to deeper backbones such as ResNet50, while VGG16 achieves the best overall performance owing to its dense multi-level feature representations.

Table 8.

Generalization performance of the proposed MP-Block across different backbone networks on the WHU-CD dataset. For ResNet50, channel dimensions are compressed by a ratio of 0.5 using 1×1 convolution to ensure fair comparison. The best results are highlighted in bold.

Backbone Method F1 (%) Precision (%) Recall (%) IoU (%) Params (M) FLOPs (G)
ResNet18 [41] Baseline (3×3 Conv) 89.13 89.48 88.78 80.39 19.01 6.96
MP-Block (Ours) 92.36 92.20 92.52 85.80 18.01 6.32
ResNet50 [41] Baseline (3×3 Conv) 89.56 86.44 92.90 81.09 51.44 20.64
MP-Block (Ours) 92.81 93.56 92.08 86.59 53.61 18.09
VGG16 [42] Baseline (3×3 Conv) 92.54 93.05 92.03 86.11 29.55 69.26
MP-Block (Ours) 94.12 94.30 93.94 88.89 27.62 61.20

Consistent performance gains are observed across backbones with different depths and architectural designs, demonstrating that the effectiveness of CLFF is not dependent on a specific encoder. This result indicates that semantic–structural misalignment is a common issue in CNN-based backbones. By introducing hierarchical cross-layer interaction, the MP-Block consistently mitigates this misalignment in both shallow (ResNet18) and deep (ResNet50) networks, confirming its role as a backbone-agnostic structural refinement module.

4.4. Comparison of Model Efficiency

Table 9 compares the computational efficiency of representative change detection models on the WHU-CD dataset with respect to parameter size, FLOPs, and training time per epoch. All experiments were carried out on the same hardware setup to ensure a fair comparison.

Table 9.

Comparison of the model efficiency on the WHU-CD dataset.

Model Backbone Params (M) FLOPs (G) Time (s/Epoch)
FC-EF [34] UNet 1.35 3.59 35.47
FC-Siam-Conc [34] UNet 1.35 4.73 39.03
FC-Siam-Diff [34] UNet 1.55 5.33 39.40
STANet [29] ResNet18 4.77 11.21 69.47
IFNet [25] VGG16 35.73 82.26 81.26
BIT [18] ResNet18 3.50 10.63 70.49
ChangeFormer [19] ResNet18 41.03 202.79 142.16
CGNet [26] VGG16 33.68 82.23 104.65
B2CNet [35] ResNet18 16.10 22.36 87.50
HyRet-Change [36] ResNet50 48.36 54.36 64.69
CLFF (Ours) VGG16 27.62 61.20 91.45

As shown in Table 9, the proposed CLFF contains 27.62 M parameters and requires 61.20 G FLOPs, which are substantially lower than those of most VGG16-based methods such as IFNet and CGNet, while still achieving comparable or better detection accuracy. Compared with transformer-based methods such as ChangeFormer, CLFF exhibits a much lower computational cost and is therefore more suitable for resource-constrained scenarios.

The training time per epoch for CLFF is 91.45 s, which is slightly slower than some lightweight CNN-based models, but significantly faster than heavy Transformer-based approaches. This indicates that the introduction of additional overhead by the proposed cross-layer fusion design is limited, but performance improvement is evident.

To conclude, CLFF achieves a favorable balance between accuracy and efficiency. Its moderate parameter scale and computational cost, together with fast training speed, indicate that the proposed framework is well suited for practical high-resolution remote sensing change detection applications.

5. Discussion

Experiments on four benchmark datasets confirm that explicitly modeling cross-layer interactions is crucial for effective change detection in very high-resolution remote sensing imagery. By enabling direct interaction between high-level semantic representations and fine-grained structural details, CLFF effectively overcomes the semantic–structural separation inherent in hierarchical feature representations. In cluttered scenes, the MP-Block couples noise-sensitive shallow features with semantically robust deep features, allowing for semantic cues to guide structural reconstruction. This design leads to clearer boundaries and more complete extraction of change regions, particularly in complex urban environments.

The behavior of the RFRB also points out the need for selective structural improvement. In VHR images, many changes are located in a very small part of the picture and do not have much difference from the parts around them, so they can become hidden easily when there is a bigger pattern in the background. Response-aware refinement enhances the subtle change cues and suppresses the redundant features, and the ablation results show that this targeted enhancement is better than simply increasing the number of convolutions.

Although the results are promising, there are still some limitations. In extremely cluttered or occluded scenes, small false detections may still happen, and it is difficult to conduct real-time processing on ultra-high-resolution images. These observations encourage further investigation into better context modeling, lighter architecture design, and multi-modal info fusion.

6. Conclusions

In this paper, we proposed CLFF, a cross-layer feature fusion framework for change detection in very high-resolution remote sensing images. The core contribution of this work lies in explicitly organizing hierarchical interactions between encoder features through the proposed MP-Block, rather than relying on isolated feature enhancement or implicit fusion. By coordinating semantic abstraction and structural detail across layers, CLFF provides a principled solution to the semantic–structural misalignment commonly observed in very high-resolution change detection. Comparative and ablation studies on four public VHR datasets consistently verify the effectiveness of CLFF and highlight the essential contribution of explicit feature fusion to performance improvement.

Despite its effectiveness, the proposed method still exhibits certain limitations. In extremely cluttered or heavily occluded scenarios, small false detections may still persist, and processing ultra-high-resolution imagery remains computationally demanding for real-time deployment. These limitations suggest that there remains room for further improvement in both robustness and efficiency.

Future work will focus on developing more lightweight architectural variants and extending the proposed interaction mechanism to more challenging settings, such as multi-modal fusion and cross-domain adaptation, with the goal of improving generalization capability and deployment practicality under real-world constraints.

Acknowledgments

This work would not have been possible without the insightful guidance and sustained support of our supervisor, whose expertise and encouragement shaped the direction of the research and helped us navigate critical methodological decisions. Throughout the development and validation of the proposed framework, numerous technical challenges arose, particularly during large-scale model training and experimental verification. In this process, the constructive discussions and hands-on assistance from the laboratory members proved invaluable, enabling us to refine the network design and ensure the reliability of the experimental results. The authors also gratefully acknowledge the financial support provided by Chong Liu, which contributed to the successful completion and publication of this work. The successful implementation of this study also benefited from the open source community, whose contributions, especially the Open-CD (v1.1.0) and PyTorch (v2.0.0) frameworks, provided a solid and flexible foundation for building and evaluating remote sensing change detection models. During the preparation of this manuscript, the authors used ChatGPT (model: ChatGPT-5, OpenAI)… (v1.1.0) and PyTorch (v2.0.0)… used ChatGPT (model: ChatGPT-5, OpenAI) solely for language editing and improving clarity and readability. All generated content was carefully reviewed and revised by the authors, who take full responsibility for the integrity and accuracy of this publication.

Abbreviations

The following abbreviations are used in this manuscript:

RSCD Remote Sensing Change Detection
VHR Very High-Resolution
CNN Convolutional Neural Network
CLFF Cross-layer Feature Fusion Framework
MP-Block Multi-level Interaction Perception Block
MIFM Multi-branch Inter-layer Fusion Module
RFRB Response-aware Feature Refinement Block
ACGF Adaptive Channel-Group Fusion
BFAB Bi-temporal Feature Alignment Block
IoU Intersection over Union
F1 F1-score
TP True Positive
FP False Positive
TN True Negative
FN False Negative

Author Contributions

Conceptualization, X.M. and Y.X.; Methodology, X.M. and C.Q.; Software, X.M.; Validation, X.M. and C.Q.; Formal analysis, X.M.; Investigation, X.M.; Data curation, X.M.; Writing—original draft preparation, X.M.; Writing—review and editing, X.M., Y.X. and C.L.; Visualization, X.M.; Supervision, Y.X.; Project administration, Y.X.; Funding acquisition, C.L. and Y.X. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The source code of the proposed CLFF framework is publicly available at https://github.com/mengxin3216/CLFF (accessed on 1 February 2026). All datasets used in this study are publicly available benchmark datasets and are described in the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Funding Statement

This work was supported by the National Natural Science Foundation of China (Grant No. 62271303), the Key Program of the Joint Funds of the National Natural Science Foundation of China (Grant No. U25A20399), the Innovation Program of Shanghai Municipal Education Commission (Grant No. 2021-01-07-00-10-E00121), and the Key Technology R&D Plan of the Science and Technology Commission of Shanghai Municipality (Grant No. 25DZ3102300). The APC was funded by the authors.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

References

  • 1.Li H.L., Chen S.W. General polarimetric correlation pattern: A visualization and characterization tool for target joint-domain scattering mechanisms investigation. IEEE Trans. Geosci. Remote Sens. 2026;64:5200417. doi: 10.1109/tgrs.2025.3647123. [DOI] [Google Scholar]
  • 2.Li H.L., Chen S.W. Polyhedral corner reflectors multidomain joint characterization with fully polarimetric radar. IEEE Trans. Antennas Propag. 2025;73:10679–10693. doi: 10.1109/tap.2025.3608033. [DOI] [Google Scholar]
  • 3.Li H.L., Liu S.W., Chen S.W. PolSAR ship characterization and robust detection at different grazing angles with polarimetric roll-invariant features. IEEE Trans. Geosci. Remote Sens. 2024;62:5225818. doi: 10.1109/TGRS.2024.3474702. [DOI] [Google Scholar]
  • 4.Singh A. Review article: Digital change detection techniques using remotely sensed data. Int. J. Remote Sens. 1989;10:989–1003. doi: 10.1080/01431168908903939. [DOI] [Google Scholar]
  • 5.Gao F., Wang X., Gao Y., Dong J., Wang S. Sea ice change detection in SAR images based on convolutional-wavelet neural networks. IEEE Geosci. Remote Sens. Lett. 2019;16:1240–1244. doi: 10.1109/LGRS.2019.2895656. [DOI] [Google Scholar]
  • 6.Mishra P.K., Rai A., Rai S.C. Land use and land cover change detection using geospatial techniques in the Sikkim Himalaya, India. Egypt. J. Remote Sens. Space Sci. 2020;23:133–143. doi: 10.1016/j.ejrs.2019.02.001. [DOI] [Google Scholar]
  • 7.Brunner D., Bruzzone L., Lemoine G. Change detection for earthquake damage assessment in built-up areas using very high resolution optical and SAR imagery; Proceedings of the IEEE International Geoscience and Remote Sensing Symposium (IGARSS); Honolulu, HI, USA. 25–30 July 2010; pp. 3210–3213. [DOI] [Google Scholar]
  • 8.Lei T., Xu Y., Ning H., Lv Z., Min C., Jin Y., Nandi A.K. Lightweight structure-aware transformer network for remote sensing image change detection. IEEE Geosci. Remote Sens. Lett. 2024;21:6000305. doi: 10.1109/LGRS.2023.3323534. [DOI] [Google Scholar]
  • 9.Gamba P., Dell’Acqua F., Lisini G. Change detection of multitemporal SAR data in urban areas combining feature-based and pixel-based techniques. IEEE Trans. Geosci. Remote Sens. 2006;44:2820–2827. doi: 10.1109/TGRS.2006.879498. [DOI] [Google Scholar]
  • 10.Deng J.S., Wang K., Deng Y.H., Qi G.J. PCA-based land-use change detection and analysis using multitemporal and multisensor satellite data. Int. J. Remote Sens. 2008;29:4823–4838. doi: 10.1080/01431160801950162. [DOI] [Google Scholar]
  • 11.Ranchin T., Wald L. The wavelet transform for the analysis of remotely sensed images. Int. J. Remote Sens. 1993;14:615–619. doi: 10.1080/01431169308904362. [DOI] [Google Scholar]
  • 12.Jain R., Nagel H.H. On the accumulative difference pictures for the analysis of real world scene sequences. IEEE Trans. Pattern Anal. Mach. Intell. 1979;1:206–213. doi: 10.1109/TPAMI.1979.4766907. [DOI] [PubMed] [Google Scholar]
  • 13.Bovolo F., Bruzzone L., Marconcini M. A novel approach to unsupervised change detection based on a semi-supervised SVM and a similarity measure. IEEE Trans. Geosci. Remote Sens. 2008;46:2070–2082. doi: 10.1109/TGRS.2008.916643. [DOI] [Google Scholar]
  • 14.Belgiu L., Drăguţ L. Random forest in remote sensing: A review of applications and future directions. ISPRS J. Photogramm. Remote Sens. 2016;114:24–31. doi: 10.1016/j.isprsjprs.2016.01.011. [DOI] [Google Scholar]
  • 15.Chatzakos M.C., Vasilakos C., Papadopoulou E.E., Tataris G., Siarkos I., Soulakellis N. Building change detection based on a gray-level co-occurrence matrix and artificial neural networks. Drones. 2022;6:414. doi: 10.3390/drones6120414. [DOI] [Google Scholar]
  • 16.Vu T.C., Nguyen T.V., Nguyen T.V., Nguyen D.T., Dinh L.Q., Nguyen M.D., Nguyen H.T., Nguyen H.T., Nguyen M.T. Object detection in remote sensing images using deep learning: From theory to applications in intelligent transportation systems. J. Future Artif. Intell. Technol. 2025;2:227–241. doi: 10.62411/faith.3048-3719-114. [DOI] [Google Scholar]
  • 17.Banik B.C. Supervised-blind unmixing comparison technique with MiSiCNet and deep image prior model for accurate abundance estimation in hyperspectral datasets. J. Future Artif. Intell. Technol. 2025;2:202–214. doi: 10.62411/faith.3048-3719-95. [DOI] [Google Scholar]
  • 18.Chen H., Qi Z., Shi Z. Remote sensing image change detection with transformers. IEEE Trans. Geosci. Remote Sens. 2022;60:5607514. doi: 10.1109/TGRS.2021.3095166. [DOI] [Google Scholar]
  • 19.Bandara W.G.C., Patel V.M. A transformer-based Siamese network for change detection; Proceedings of the IEEE International Geoscience and Remote Sensing Symposium (IGARSS); Kuala Lumpur, Malaysia. 17–22 July 2022; pp. 207–210. [DOI] [Google Scholar]
  • 20.Ke Q., Zhang P. Hybrid-TransCD: A hybrid transformer remote sensing image change detection network via token aggregation. ISPRS Int. J. Geo-Inf. 2022;11:263. doi: 10.3390/ijgi11040263. [DOI] [Google Scholar]
  • 21.Hou H., Wang Y., Qin Q., Tan Y., Liu T. Multi-scale feature fusion based on difference enhancement for remote sensing image change detection. Symmetry. 2025;17:590. doi: 10.3390/sym17040590. [DOI] [Google Scholar]
  • 22.Wu J., Xie C., Zhang Z., Zhu Y. A deeply supervised attentive high-resolution network for change detection in remote sensing images. Remote Sens. 2023;15:45. doi: 10.3390/rs15010045. [DOI] [Google Scholar]
  • 23.Chen H., Xu X., Pu F. SRC-Net: Bitemporal spatial relationship concerned network for change detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024;17:11339–11351. doi: 10.1109/JSTARS.2024.3411622. [DOI] [Google Scholar]
  • 24.Xiao Y., Xu T., Yu X., Fang Y., Li J. A lightweight fusion strategy with enhanced interlayer feature correlation for small object detection. IEEE Trans. Geosci. Remote Sens. 2024;62:4708011. doi: 10.1109/TGRS.2024.3457155. [DOI] [Google Scholar]
  • 25.Zhang C., Yue P., Tapete D., Jiang L., Shangguan B., Huang L., Liu G. A deeply supervised image fusion network for change detection in high-resolution bi-temporal remote sensing images. ISPRS J. Photogramm. Remote Sens. 2020;166:183–200. doi: 10.1016/j.isprsjprs.2020.06.003. [DOI] [Google Scholar]
  • 26.Han C., Wu C., Guo H., Hu M., Li J., Chen H. Change guiding network: Incorporating change prior to guide change detection in remote sensing imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023;16:8395–8407. doi: 10.1109/JSTARS.2023.3310208. [DOI] [Google Scholar]
  • 27.Hou Q., Zhou D., Feng J. Coordinate attention for efficient mobile network design; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); Nashville, TN, USA. 20–25 June 2021; pp. 13713–13722. [DOI] [Google Scholar]
  • 28.Wang Q., Wu B., Zhu P., Li P., Zuo W., Hu Q. ECA-Net: Efficient channel attention for deep convolutional neural networks; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Seattle, WA, USA. 13–19 June 2020; pp. 11534–11542. [DOI] [Google Scholar]
  • 29.Chen H., Shi Z. A spatial-temporal attention-based method and a new dataset for remote sensing image change detection. Remote Sens. 2020;12:1662. doi: 10.3390/rs12101662. [DOI] [Google Scholar]
  • 30.Ji S., Wei S., Lu M. Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set. IEEE Trans. Geosci. Remote Sens. 2019;57:574–586. doi: 10.1109/TGRS.2018.2858817. [DOI] [Google Scholar]
  • 31.Shi Q., Liu M., Li S., Liu X., Wang F., Zhang L. A deeply supervised attention metric-based network and an open aerial image dataset for remote sensing change detection. IEEE Trans. Geosci. Remote Sens. 2022;60:5604816. doi: 10.1109/TGRS.2022.3158741. [DOI] [Google Scholar]
  • 32.Zhang J., Shao Z., Ding Q., Huang X., Wang Y., Zhou X., Li D. AERNet: An attention-guided edge refinement network and a dataset for remote sensing building change detection. IEEE Trans. Geosci. Remote Sens. 2023;61:5617116. doi: 10.1109/TGRS.2023.3300533. [DOI] [Google Scholar]
  • 33.Loshchilov I., Hutter F. Decoupled weight decay regularization; Proceedings of the International Conference on Learning Representations (ICLR); New Orleans, LA, USA. 6–9 May 2019. [Google Scholar]
  • 34.Daudt R.C., Le Saux B., Boulch A. Fully convolutional Siamese networks for change detection; Proceedings of the IEEE International Conference on Image Processing (ICIP); Athens, Greece. 7–10 October 2018; pp. 4063–4067. [DOI] [Google Scholar]
  • 35.Zhang Z., Bao L., Xiang S., Xie G., Gao R. B2CNet: A progressive change boundary-to-center refinement network for multitemporal remote sensing image change detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024;17:11322–11338. doi: 10.1109/JSTARS.2024.3409072. [DOI] [Google Scholar]
  • 36.Fiaz M., Noman M., Debary H., Ali K., Cholakkal H. HyRet-Change: A hybrid retention network for remote sensing change detection. arXiv. 20252506.12836 [Google Scholar]
  • 37.Howard A.G., Zhu M., Chen B., Kalenichenko D., Wang W., Weyand T., Andreetto M., Adam H. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv. 2017 doi: 10.48550/arXiv.1704.04861.1704.04861 [DOI] [Google Scholar]
  • 38.Krizhevsky A., Sutskever I., Hinton G.E. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012;25:1097–1105. doi: 10.1145/3065386. [DOI] [Google Scholar]
  • 39.Hu J., Shen L., Sun G. Squeeze-and-excitation networks; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Salt Lake City, UT, USA. 18–23 June 2018; pp. 7132–7141. [DOI] [Google Scholar]
  • 40.Woo S., Park J., Lee J.Y., Kweon I.S. CBAM: Convolutional block attention module; Proceedings of the European Conference on Computer Vision (ECCV); Munich, Germany. 8–14 September 2018; pp. 3–19. [DOI] [Google Scholar]
  • 41.He K., Zhang X., Ren S., Sun J. Deep residual learning for image recognition; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Las Vegas, NV, USA. 27–30 June 2016; pp. 770–778. [DOI] [Google Scholar]
  • 42.Simonyan K., Zisserman A. Very deep convolutional networks for large-scale image recognition; Proceedings of the International Conference on Learning Representations (ICLR); San Diego, CA, USA. 7–9 May 2015. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The source code of the proposed CLFF framework is publicly available at https://github.com/mengxin3216/CLFF (accessed on 1 February 2026). All datasets used in this study are publicly available benchmark datasets and are described in the manuscript.


Articles from Sensors (Basel, Switzerland) are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES