Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2025 Jul 29;15:27669. doi: 10.1038/s41598-025-11403-1

Federated learning-enhanced generative models for non-intrusive load monitoring in smart homes

Yuefeng Lu 1,2, Shijin Xu 3, Yadong Liu 1, Xiuchen Jiang 1,
PMCID: PMC12307697  PMID: 40731044

Abstract

Non-Intrusive Load Monitoring (NILM) estimates load-specific power by disaggregating household-level power data, enabling smart grids to provide more accurate power estimations and thus prevent energy waste and casualties. Some existing NILM methods employ federated learning (FL) with generative models to estimate load power; however, their accuracy often suffers within an FL architecture. This is because the generators tend to learn the most common load patterns while neglecting the less frequent ones. To address this, we propose an FL architecture with a Wasserstein generative adversarial network (FL-WGAN) to enhance accuracy. In our method, each client trains its own generative neural network to estimate load power, while a discriminator network evaluates these estimates. Each client employs a Wasserstein distance-based guidance mechanism to ensure the generative model learns the full distribution of all states rather than being confined to a subset. Additionally, an attention mechanism is integrated into the generative model to further improve its representational capability. We evaluate FL-WGAN using the UA-DALE and REDD datasets, and the results demonstrate that our method outperforms existing methods.

Keywords: Smart Grid, Non-Intrusive Load Monitoring (NILM), Federated Learning (FL), Wasserstein Generative Adversarial Network (WGAN)

Subject terms: Engineering, Electrical and electronic engineering

Introduction

According to the International Energy Agency (IEA) report, residential and commercial buildings account for 26% of global energy consumption1, highlighting the urgent need for energy conservation at the building level. Non-intrusive load monitoring (NILM), also known as load disaggregation, plays a crucial role in analyzing energy consumption by identifying individual loads within a building’s total power usage2. By providing detailed insights into consumption patterns, NILM empowers consumers to make informed decisions and adopt energy-efficient practices, thereby improving overall energy efficiency3. Moreover, NILM enables load-level monitoring and fault detection, facilitating timely maintenance and repairs of electrical systems4. Accurate load power estimation is therefore essential for optimizing power usage and preventing potential hazards. However, power consumption data often contains sensitive personal usage patterns, making data privacy protection a critical concern. Due to legal regulations, trade secrets, and personal privacy considerations, clients are typically restricted to using only their local data for model training, leading to the so-called “data island” phenomenon. Consequently, models trained under these constraints suffer from limited generalization ability and reduced accuracy due to insufficient training data.

Federated learning (FL), a distributed machine learning paradigm, enables collaborative model training across clients while preserving data privacy and complying with regulatory standards5,6. In FL, a global model is built on a central server, while each client trains locally on private data and shares only updated model parameters. This allows model training without exposing raw user data, thus protecting personal information inferred from power consumption patterns.

However, existing FL-based methods7,8 often struggle with accurate per-client load estimation, especially for multi-state loads (i.e., loads with more than two operational states). A key challenge is the severe unbalanced states: for instance, a washing machine may produce over 300 data points during its wash cycle (major state) but only 30 in the spin-dry cycle (minor state). As a result, models tend to overfit major states and ignore minor ones, reducing overall accuracy. Additionally, multi-state loads exhibit temporal dependencies (e.g., the rinse cycle duration depends on the wash cycle), which current models fail to capture due to limited representational capacity, further limiting their accuracy.

To address these challenges, this paper proposes a federated learning approach for disaggregating aggregated power into appliance-level power, termed FL-WGAN (Federated Learning-enhanced Wasserstein Generative Adversarial Network). The key contributions include:

  1. To address the class imbalance issue, we employ a Wasserstein generative adversarial network (WGAN) to enhance estimation accuracy. By leveraging the Wasserstein distance metric, WGAN provides more stable and continuous gradient feedback compared to traditional GANs, thereby reducing overfitting introduced by unbalanced state distributions and improving estimation accuracy.

  2. We integrate a self-attention mechanism into the generator to model temporal dependencies across load states and focus on distinct operational phases. This design enhances the model’s representational capacity to disentangle multi-state loads with variable operating cycles. Furthermore, the attention module improves robustness to outliers by dynamically selecting relevant state features during generation.

  3. We evaluate our proposed method on two real-world open datasets containing various loads from different households. The experimental results and complexity analysis demonstrate the superior accuracy and efficiency of our method for on-device NILM.

The rest of this article is organized as follows. Section II describes the proposed method in detail. Section III discusses pruning techniques and model complexity analysis. Section IV presents extensive testing of the method. Finally, Section V concludes this paper.

Related work

Non-intrusive load power measurement

NILM has been extensively studied in recent years, with various approaches proposed to improve both disaggregation accuracy and computational efficiency. Traditional NILM methods typically rely on signal processing and optimization techniques. For instance, the work in9 leverages instantaneous voltage-current (V-I) waveform trajectories to characterize events and employs support vector machines for load identification. Additionally, hidden Markov models (HMMs) have been widely used to model the sequence of load state transitions. These approaches often treat the aggregated power as a combination of multiple HMMs—each corresponding to an individual load—by modeling load transition relationships using state transition matrices10,11. However, such methods generally require prior knowledge of load signatures and often struggle to generalize in real-world scenarios.

The rapid advancement of deep learning makes neural networks the preferred approach for NILM because they autonomously extract intricate load features from power sequences. Researchers widely apply machine learning models—especially deep learning techniques—to analyze power consumption data and accurately disaggregate electricity usage into individual loads. Previous studies explore various deep learning architectures for NILM, including sequence-to-sequence LSTM models12,13, sequence-to-point CNNs14, and models that utilize Fourier integral analysis for feature extraction15. In addition, researchers introduce innovative techniques such as two-stream convolutional neural networks (TSCNNs)16, self-attention-based temporal convolutional networks (TCNs)17, fully convolutional denoising autoencoders18, and transformer-based models with enhanced transferability19. Researchers20 present a robust and privacy-preserving federated learning-based framework for training a bidirectional transformer architecture for NILM. Work21 proposes a novel aggregation strategy to tackle malicious nodes in the peer-to-peer network. They also tackle load transition issues through transfer learning22 and handle unknown loads using generative models like variational autoencoders and capsule networks23. Recent studies further improve estimation accuracy by incorporating load state transition patterns24,25.

Despite these advances, researchers face several challenges. First, current approaches overfit major load states and overlook minor ones, which reduces estimation accuracy. Second, deep learning networks demand large amounts of data for training, but clients usually collect only limited data, causing local models to lack generalization. Finally, processing images that contain sensitive information exposes a risk of disclosing private data.

Federated learning for protecting data privacy

Several FL-based NILM methods emerge as promising solutions. FL26enables multiple devices to collaborate without exposing their local raw data. Researchers develop various FL-based approaches for on-device NILM. For example27, presents a sequence-to-point federated learning framework that jointly models the NILM problem across several distributed parties. In addition28, investigates a blockchain-assisted FL approach to enhance data security, while29combines differential privacy with FL to balance data utility and consumer privacy. Moreover30, integrates transfer learning into FL to improve model transferability across different devices. However, these FL-based methods currently support only the collaborative training of models with identical architectures.

Monitoring decentralized systems is an interesting and challenging topic. For example31, tackles label heterogeneity and communication redundancy by first using a data-distillation–based initialization to align local label distributions, then applying a divergence-metric strategy to prune unnecessary model exchanges—resulting in a 1.58% gain in diagnostic accuracy. Similarly, In32, the authors introduce a server-side framework that constructs customized models for each client in accordance with its resource constraints. During collaborative training, clients exchange distilled knowledge representations rather than raw model parameters, thereby markedly improving communication efficiency and predictive accuracy compared to conventional federated learning schemes.

Main method

Problem statement

The goal of NILM is to disaggregate the total power consumption into the individual power consumption of each load. Let Inline graphic represent the time period, where Inline graphic is the total number of time steps. The aggregated power at each time step is denoted as:

graphic file with name d33e359.gif 1

where Inline graphic represents the aggregated power at time step t.

Consider a smart home with I distinct loads, indexed by Inline graphic. The power consumption of the i-th load is represented as:

graphic file with name d33e389.gif 2

where Inline graphic denotes the power consumption of the i-th load at time step t, for Inline graphic.

The relationship between the aggregated power and the individual load powers at a given time step t is described by the following equation:

graphic file with name d33e420.gif 3

where e(t) represents the measurement noise at time step t. Then, we consider a set of clients Inline graphic, where Inline graphic denotes the local dataset of client Inline graphic. The complete dataset, which contains all the power data, is composed of the data from each client Inline graphic and can be represented as Inline graphic.

Overview architecture of FL-WGAN

As illustrated in Fig. 1, to achieve accuracy power estimation of NILM, the following steps are performed:

  • (i)

    the server initializes and defines the WGAN model, which comprises a load power estimation network Inline graphic and a discriminator Inline graphic, and broadcasts the initial global network parameters Inline graphic, and Inline graphic to all participating;

  • (ii)

    the clients use local data and initial global parameters to train the local WGAN with attention mechnism for certain steps of gradient updates and then communicate the trained local parameters to the server;

  • (iii)
    the server uses federated averaging algorithm to aggregate the local parameters to produce new global parameters for round Inline graphic;
    graphic file with name d33e530.gif 4
    graphic file with name d33e536.gif 5
  • (iv)

    after T communication rounds, the server aggregates a global WGAN model that is capable of estimating high accuracy load power. This iterative process ensures that the model is collaboratively trained across all clients without sharing sensitive local data, reducing the risk of data breaches.Further detailed procedure of FL-WGAN is presented in Algorithm 1.

Fig. 1.

Fig. 1

Federated architecture with multiple local clients.

Algorithm 1.

Algorithm 1

FL-WGAN

Local training with WGAN

The client trains the local data to achieve the load power estimation from the aggregated power, which (i) efficiently avoids overfitting and improves representational ability; (ii) provides the improved estimation accuracy compared to existent methods. The complete architecture is depicted in Fig. 2.

Fig. 2.

Fig. 2

Architecture of WGAN for Load Power Estimation at Client c.

1) Generator: The main objective of the generator (see Fig. 3 for details) is to transform the aggregated power Inline graphic into the estimated load power Inline graphic, where Inline graphic. It contains three parts: an extraction of load features, an attention mechanism to capture long-term dependencies across states, and a nonlinear transformation for estimation:

graphic file with name d33e599.gif 6

where Inline graphic collects the parameters of the generator on client c.

Fig. 3.

Fig. 3

Architecture of Generator at Client c.

For the generator, we add self-attention mechanism after the feature extraction stage to help the model focus on compute the importance and correlations that characterize the long-term dependencies among different states. For example, in the operational cycle of a washing machine, which includes states, such as washing, spinning, and draining, the duration of the washing state affects the subsequent draining state. The feature map Inline graphic output by the l-th layer is input into the attention mechanism to mine the temporal dependencies across different states. The attention mechanism mainly utilizes three learnable parameter matrices: Inline graphic, Inline graphic and Inline graphic. Initially, these matrices linearly convert the input Inline graphic into corresponding attention feature matrices named Inline graphic, Inline graphic, and Inline graphic.

graphic file with name d33e681.gif 7
graphic file with name d33e687.gif 8
graphic file with name d33e693.gif 9

Subsequently, the attention weights matrices Inline graphic are calculated by softmax function to get the attention distribution of the feature map Inline graphic.

graphic file with name d33e713.gif 10

where d is a scaling factor to reduce the variance.

In feature representation computation, the relative significance of each feature is quantified through a dynamic weight allocation mechanism. This framework assigns higher weight coefficients to discriminative features, enabling their dominant influence on the final feature representation. The weight matrix Inline graphic, automatically generated through global feature correlation analysis, performs semantic fusion in the feature space via matrix multiplication with the original feature matrix Inline graphic. The computational process is formally expressed as:

graphic file with name d33e740.gif 11

where Inline graphic denotes the nonlinear activation function. The aggregated feature representation Inline graphic exhibits temporal correlation across different states. Once the Inline graphic is obtained, it is used to estimate the power consumption of the load at the next time step, Inline graphic. Specifically, the output feature Inline graphic from the last attention layer is passed through a two-layer load-estimation fully connected network (FCN) to predict the load power.

This FCN is separate from the attention mechanism’s FCN and is designed to map the feature matrix Inline graphic into the estimated power sequence Inline graphic. The mapping process is as follows:

graphic file with name d33e792.gif 12

Where Inline graphic and Inline graphic are the weight matrices of the first and second layers of the FCN, respectively. Inline graphic and Inline graphic are the bias terms associated with each layer. ReLU is applied as the activation function in the first layer, enabling non-linearity, while the second layer uses a linear activation function to produce the output. This dual-layer FCN allows the model to capture complex, non-linear relationships between the features and the power estimation, while ensuring that the final output is consistent with the load power data.

2) discriminator: The discriminator Inline graphic is responsible for assessing the accuracy of the estimated load power Inline graphic. Its output is a scalar value in the range Inline graphic. If the estimated load power is accurate, the discriminator returns a value close to 1, indicating high confidence in the prediction. Conversely, if the estimated load power is inaccurate, the discriminator outputs a value close to 0, signaling low confidence in the prediction.

In line with standard convolutional neural network architectures, the discriminator is built using a downsampling layer that includes a convolution module, batch normalization, and an activation function. The downsampling operation reduces the spatial dimension of the input, enabling the model to capture more abstract and high-level features. This helps the discriminator evaluate the accuracy of the power estimation more effectively.

Batch normalization is applied to standardize the activations of each layer’s input, which not only accelerates training but also improves model stability and mitigates issues such as covariate shift. By normalizing the activations, it helps maintain consistent data distributions across the network.

The activation function used is Leaky ReLU. Unlike the standard ReLU function, Leaky ReLU allows a small, non-zero gradient when the input is negative. This ensures that neurons remain active even for negative inputs, thus promoting better gradient flow and helping prevent issues like dead neurons during training.

3) parameter determination: To optimize the generator and discriminator, we consider a mini-batch of B and establish following loss functions as constraints.

We define the generator Inline graphic with the following adversarial loss function:

graphic file with name d33e868.gif 13

Here, Inline graphic is the estimated load power sequence over the time window Inline graphic. The discriminator Inline graphic loss Inline graphic is formulated as:

graphic file with name d33e900.gif 14

The penalty term in (14) serves to enforce the smoothness of the Wasserstein distance estimation by ensuring that the discriminator Inline graphic remains approximately 1-Lipschitz. Specifically, the gradient norm penalty term:

graphic file with name d33e916.gif 15

is applied to interpolated points Inline graphic between real and generated samples, where

graphic file with name d33e930.gif 16

By encouraging Inline graphic, this term helps stabilize training by preventing the generator’s gradients from becoming excessively large or vanishingly small. In this context, Inline graphic distinguish between the ground truth load power Inline graphic and the estimated load power Inline graphic. By learning to differentiate between the ground truth load power and the estimated load power patterns, the discriminator provides meaningful feedback to the generator, ultimately enhancing its ability to generate more accurate load power estimates.

The generator and discriminator networks are optimized using stochastic gradient descent (SGD). The parameter updates for these networks are as follows:

graphic file with name d33e963.gif 17
graphic file with name d33e969.gif 18

where Inline graphic and Inline graphic represent the learning rates for the generator and discriminator, respectively.

Numerical evaluation

The numerical evaluation is conducted via a TensorFlow framework, on a desktop equipped with a 12 vCPU Intel(R) Xeon(R) Platinum 8255 C CPU @2.50GHz, 40GB RAM, and Nvidia GPU RTX 2080 Ti.

Dataset overview

The DDPG-At method is evaluated on two datasets: U.K.-DALE33 and REDD34. Its effectiveness is assessed using three performance metrics: Mean Absolute Error (MAE), Normalized Signal Aggregate Error (SAE), and the Inline graphic score.

U.K.-DALE Dataset: The U.K.-DALE dataset contains recordings from five UK homes, each with more than five monitored appliances. For instance, Home 4 includes five sub-metered loads and one aggregate mains channel (see Figure 4). As shown, the aggregate power does not equal the sum of the monitored loads due to untracked background appliances.

Fig. 4.

Fig. 4

Partial Power Profile of Home 4 in the U.K.-DALE Dataset.

In NILM, exact power conservation is not required. The objective is to extract the operating patterns of target appliances from the aggregate signal. This dataset records aggregate power every second and appliance-level power every 3 s for five homes. We focus on five loads (washing machine, microwave, dishwasher, fridge, kettle). The data from 1 June 2013 to 17 July 2013 are used for training our model, from 17 July 2013 to 20 July 2013 for validation which determine the hyperparameter configuration, and from 21 July 2013 to 27 July 2013 for testing.

REDD Dataset: This dataset captures power every 6 s for six homes. We target the same loads, replacing the kettle with a light. Data from 18 April 2011 to 5 MAY 2011 are used for training, from 5 May 2011 to 8 May 2011 for validation, and from 9 May 2011 to 12 May 2011 for testing.

Data Pre-rocessing: In our simulations, we follow the same preprocessing procedures as those used in prior studies25. Specifically, we standardize the data for each load by subtracting the mean and dividing by the standard deviation. Additionally, to address missing data caused by load anomalies, we apply linear interpolation to fill in the gaps, ensuring that the dataset remains continuous and complete.

Performance Metrics:

1. MAE: This metric quantifies the absolute estimation error at each time step.

graphic file with name d33e1047.gif 19

2. SAE: This metric is designed to evaluate the relative total estimation error across the entire testing period.

graphic file with name d33e1057.gif 20

3. Inline graphic Score: In addition to power estimation, we use the Inline graphic score to assess the accuracy of state detection. For this study, any load exceeding 10 watts is considered to be in the “on” state.

graphic file with name d33e1080.gif 21

where precision = TP/(TP + FP) and recall = TP/(TP + FN). The terms TP, FP, and FN represent true positives, false positives, and false negatives, respectively.

Parametric analysis

In our experiment, we downsample the data from the UK-DALE dataset, which is sampled at a 3-second interval, to a 6-second resolution using time-window averaging (i.e., by calculating the arithmetic mean of every two consecutive samples), ensuring consistent sampling rates between the two datasets. After each training epoch, validation is performed to tune hyperparameters. We train the FL-WGAN model using local epochs Inline graphic, learning rates Inline graphic for the generator and discriminator respectively, and a batch size of Inline graphic. Training proceeds for 10,000 communication rounds under TensorFlow. A grid search is conducted on the validation set with window lengths Inline graphic and attention hidden nodes Inline graphic.

  • Window length w: We plot the validation loss curves under different window lengths w, as shown in Figure 5. The best performance was achieved when Inline graphic. Smaller values of w (e.g., Inline graphic) fail to capture the full temporal patterns of longer appliance cycles, resulting in more fluctuating loss curves that suggest difficulty in modeling complete operating cycles. Conversely, larger window lengths increase computational cost without yielding significant accuracy improvements. Therefore, Inline graphic provides an effective trade-off between temporal context capture and computational efficiency.

  • Attention hidden nodes Inline graphic: Smaller settings (128, 256) underfit appliance state transitions; larger ones (1024) risk overfitting. A size of 512 yielded the lowest validation error.

  • Final estimation hidden nodes Inline graphic: Selected from Inline graphic. We discover accuracy hardly get improved once value surpass 774, making it the most balanced choice in terms of capacity and generalization.

The architectures of the generator and discriminator are provided in Table 1.

Fig. 5.

Fig. 5

Validation Loss Curves with Varying Window Sizes (w).

Table 1.

Network Structures of the Generator and the Discriminator.

Model Part Layer Data Size
Generator CNN 1 (1;w)Inline graphic(24;w)
2 (24;w)Inline graphic(48;w)
3 (48;w)Inline graphic(96;w)
Attention Layer 4 Inline graphic
Load-Estimation FCN Inline graphic Inline graphic
Discriminator FCN 1 Inline graphic
2 Inline graphic
3 Inline graphic
4 Inline graphic
5 Inline graphic

First, we evaluate the performance of FL-WGAN for load power estimation on the U.K.-DALE and REDD datasets. To assess the effectiveness of each component, we compare the proposed WGAN framework (without federated learning), referred to as WGAN, with three ablated variants: (i) a W-Generator without the discriminator network, named Gen-Atten, (ii) a WGAN variant without the attention mechanism, named Gen-NA, and (iii) a W-Generator that lacks both the discriminator and attention mechanism, denoted as Gen-only. The quantitative results in Table 2 demonstrate the critical role of each module. Specifically, the generator stripped of both the attention mechanism and discriminator exhibits the worst performance in terms of MSE and SAE. Reintroducing the discriminator improves all three evaluation metrics, whereas incorporating the attention mechanism without the discriminator leads to even greater performance gains. The full W-Generator configuration, which integrates both the discriminator and attention mechanism, achieves the highest accuracy. These findings confirm that the attention mechanism effectively captures critical load features, while the discriminator plays a crucial role in enhancing estimation precision.

Table 2.

Mean Absolute Errors (MAE), Signal Aggregate Error (SAE), and Inline graphic Scores on UK-DALE and REDD Datasets without Federated-Learning.

Metric Method UK-DALE REDD
Kettle W.M. D.W. Microwave Fridge Avg. Light W.M. D.W. Microwave Fridge Avg.
MAE (Watts) WGAN 9.33 41.25 19.12 10.05 24.21 20.79 17.64 17.95 22.43 15.33 38.22 22.31
Gen-Atten 11.47 53.15 19.58 28.42 25.06 27.54 20.28 81.87 22.75 20.21 40.89 37.20
Gen-NA 12.89 54.73 22.35 17.24 31.92 27.83 24.17 90.65 29.54 38.72 45.31 47.68
Gen-only 14.85 59.42 20.63 27.56 29.41 30.37 22.09 67.03 24.38 38.45 44.06 39.20
SAE (%) WGAN 4.48 34.92 16.31 8.83 13.18 15.54 12.95 13.41 18.97 11.12 26.15 16.52
Gen-Atten 6.42 47.85 17.35 26.18 15.51 22.66 16.38 77.52 18.65 19.58 32.22 33.27
Gen-NA 15.83 48.62 25.37 30.14 20.89 28.17 18.95 69.24 25.81 26.43 35.28 33.14
Gen-only 20.15 52.06 30.68 34.29 24.11 32.22 21.53 73.46 29.52 29.87 39.21 38.32
Inline graphic(%) WGAN 94.85 68.12 93.24 91.03 82.97 85.64 87.45 21.78 64.25 34.89 83.04 74.89
Gen-Atten 65.03 39.75 76.68 79.72 51.12 62.66 80.12 52.14 69.32 54.31 56.41 60.47
Gen-NA 63.27 35.42 74.15 75.89 49.86 59.72 72.34 45.63 67.85 52.04 54.12 58.20
Gen-only 58.94 30.25 70.82 65.17 45.33 54.10 66.28 36.47 62.73 48.59 47.85 52.18

Performance comparison with existent methods

We evaluate the proposed FL-AttGAN by comparing it with four other methods within the same federated framework on the U.K.-DALE and REDD datasets. The first method is the federated GAN8 (denoted as “FL-GAN”). The second method is the federated CycleGAN35 (denoted as “FL-CycleGAN”). The third method integrates federated learning with SAMNet25 (denoted as “FL-SAMNet”), while the fourth method combines federated learning with LSTM12 (denoted as “FL-CL”). The fifth method remain the generator unchanged and replace its original MLP discriminator with a CNN augmented by a self-attention mechanism ((denoted as “FL-disGAN”). Additionally, we employ the same evaluation metrics described above to assess accuracy.

Table 3 demonstrates that our proposed method, when applied with federated learning, enhances accuracy across five loads. We observe that FL-WGAN’s accuracy is lower than that of the conventional WGAN without federated learning. This degradation arises because the global model, which aggregates individual load models from various clients, struggles to capture all load characteristics simultaneously. Consequently, the update directions diverge, ultimately affecting overall performance. Additionally, Fed-disGAN achieves superior performance over FL-GAN, which can indicates that our current attention module already captures the most critical load features. However, increasing the number of self-attention layers leads to diminishing performance gains, suggesting that the current attention design has sufficiently captured the dominant load-specific patterns, while deeper architectures may introduce redundancy under federated constraints.

Table 3.

Mean Absolute Errors (MAE), Signal Aggregate Error (SAE), and Inline graphic Scores on U.K.-DALE and REDD Datasets with Federated Learning.

Metric Method UK-DALE REDD
Kettle W.M. D.W. Microwave Fridge Avg. Light W.M. D.W. Microwave Fridge Avg.
MAE (Watts) FL-WGAN 10.15 43.72 20.85 11.23 25.96 22.38 19.04 19.87 24.15 17.02 40.15 24.05
FL-GAN 12.31 47.63 21.47 14.75 26.83 24.60 21.45 24.12 24.38 22.34 42.17 26.89
Fed-disGAN 11.86 48.02 21.14 13.03 25.12 23.83 20.94 23.52 23.79 21.71 41.80 26.35
FL-CycleGAN 13.85 56.24 23.14 17.15 27.85 27.65 27.15 24.32 25.56 28.04 43.12 29.64
FL-SAMNet 15.92 61.05 24.86 19.03 33.45 30.86 26.38 33.15 31.25 31.83 37.96 32.11
SAE (%) FL-WGAN 5.12 36.45 17.84 9.67 14.25 16.67 14.35 15.02 20.14 12.85 28.04 18.08
FL-GAN 7.25 41.37 19.62 11.34 16.73 19.26 16.95 21.24 22.15 14.87 30.12 21.07
Fed-disGAN 6.92 40.58 19.07 10.83 15.28 18.54 16.55 20.87 21.53 14.52 29.60 20.61
FL-CycleGAN 7.85 49.32 18.96 14.85 17.45 21.69 18.25 24.34 15.03 21.15 32.85 22.32
FL-SAMNet 8.45 51.83 27.15 16.06 20.64 24.83 18.84 24.85 17.92 28.36 33.15 24.62
Inline graphic(%) FL-WGAN 93.12 91.34 91.45 94.27 90.15 92.07 91.24 90.45 91.37 92.15 90.23 91.09
FL-GAN 92.15 85.83 82.34 94.12 88.45 88.58 88.34 86.27 85.14 90.62 87.35 87.54
Fed-disGAN 92.45 86.58 84.12 93.92 89.05 89.22 89.65 87.56 85.53 90.85 87.88 88.29
FL-CycleGAN 87.14 86.85 83.62 94.45 88.97 88.21 87.85 81.32 66.23 91.74 85.16 82.46
FL-SAMNet 85.03 78.64 78.25 93.85 87.17 84.59 83.15 80.12 79.84 85.73 85.62 82.89

Nonetheless, our method outperforms competing approaches for the five loads, achieving an SAE improvement of 2.59% on the UK-DALE dataset and 2.99% on the REDD dataset. In our simulation, we notice that multi-state loads such as washing machines and dishwashers gain the most significant improvements—with washing machines showing up to a 6.22% SAE enhancement. Additionally, on/off loads, especially fridges, experience slight gains, with fridges achieving a notable 2.48% improvement in SAE. These results affirm the effectiveness of our approach in power estimation, as it effectively balances the power distribution across multiple states.

To provide further insight, Figs.6 and 7 plot the estimated and ground truth power profiles for fridges and washing machines using both our method and the competing methods. Based on the power levels and durations, we classify fridges into two states and washing machines into four states. Fig.6 indicates that all competing methods accurately capture the fridge’s power profiles, likely because this load exhibits limited states and low variation in power levels. Notably, our method shows enhanced robustness in avoiding overfitting to outliers, primarily due to its adversarial training.

Fig. 6.

Fig. 6

Power of fridge.

Fig. 7.

Fig. 7

Power of washing machine.

Furthermore, as shown in Fig. 7, our approach distinctly improves accuracy. While all methods effectively handle the minor states (states 2 and 3), our method excels in the major state (state 1), which is characterized by higher power levels and longer operation times. Our method exhibits refined, high-precision performance, largely attributable to the Wasserstein distance metric. Unlike conventional divergence measures, the Wasserstein distance provides a clearer evaluation of the data distribution by delivering smooth and informative gradients. This allows the generator to more accurately balance the power distribution across different states, ultimately leading to superior performance. Moreover, integrating an attention mechanism enables our method to capture more detailed aspects of the power profile—both at the edges and the peaks—thereby considerably enhancing its representational ability.

Performance comparison with existing methods across heterogeneous datasets

To emulate realistic federated scenarios, we combine data from five U.K.-DALE homes and six REDD homes into a single pool of 11 houses. For each client’s local model, we randomly select one month of data from this pool for training, then evaluate on the one-week period immediately following that month. The results are shown in Table 4: FL-WGAN outperforms existing methods on these heterogeneous datasets. Multi-state appliances such as washing machines benefit the most, with up to a 4.27% reduction in SAE, while on/off loads—particularly refrigerators—see a 1.33% SAE improvement. Overall, the performance under heterogeneous data closely matches that observed on the original, homogeneous datasets.

Table 4.

Mean Absolute Errors (MAE), Signal Aggregate Error (SAE), and Inline graphic Scores on the Combined Dataset of U.K.-DALE and REDD under Federated Learning.

Metric Method Combined Dataset
Kettle W.M. D.W. Microwave Fridge Light Avg.
MAE (Watts) FL-WGAN 14.60 31.79 22.50 14.13 33.06 29.60 24.28
FL-GAN 16.88 35.88 23.93 18.54 34.50 34.00 27.29
Fed-disGAN 15.95 36.95 23.11 18.80 34.00 39.00 27.97
FL-CycleGAN 20.50 43.78 26.35 22.80 35.50 35.58 30.75
FL-SAMNet 23.50 46.50 26.05 27.05 35.30 35.05 32.24
SAE (%) FL-WGAN 15.72 26.25 17.99 18.58 21.15 23.20 20.48
FL-GAN 19.79 31.00 20.88 23.73 23.42 26.50 24.22
Fed-disGAN 19.08 30.52 18.71 22.61 22.48 25.70 23.18
FL-CycleGAN 23.10 34.87 21.45 29.95 25.60 26.55 26.92
FL-SAMNet 25.98 36.24 22.92 32.65 27.64 29.99 29.24
Inline graphic(%) FL-WGAN 92.18 90.90 91.91 94.05 90.19 90.84 91.68
FL-GAN 90.25 86.05 87.48 92.37 88.40 87.30 88.64
Fed-disGAN 91.70 88.71 87.35 92.36 88.04 88.35 89.42
FL-CycleGAN 87.80 86.29 87.68 93.60 88.96 89.00 88.89
FL-SAMNet 89.15 85.24 88.46 93.31 88.79 89.40 89.06

We attribute these findings to two main factors. First, although the loads come from two different datasets (U.K.-DALE and REDD), the fundamental power signatures of the targeted appliances—cycle length, peak consumption, and duration—remain largely consistent across countries and households. As a result, FL-WGAN still learns highly transferable “load features” even in a heterogeneous setting. Second, the combination of the Wasserstein distance and a self-attention mechanism makes FL-WGAN inherently robust to mild distribution shifts: the Wasserstein metric provides stable, continuous gradients across operational states, while attention dynamically captures each appliance’s most distinctive temporal dependencies, mitigating accuracy degradation in heterogeneous environments.

Conclusion

In this paper, we have introduced FL-WGAN, a federated learning framework augmented with a WGAN for accurate load power estimation while preserving data privacy. Unlike existing methods that suffer degraded performance on multi-state loads with unbalanced operational distributions, FL-WGAN leverages the Wasserstein distance to deliver stable, continuous gradient feedback, thereby mitigating biases caused by unbalanced states. To further capture temporal dependencies across load states, we integrate an attention mechanism into the generator, enhancing its representational power. Extensive experiments on the UK-DALE and REDD datasets demonstrate that our method consistently outperforms state-of-the-art baselines in both estimation. Future work will explore adaptive aggregation strategies—such as dynamically adjusting aggregation weights based on each client’s contribution—and extend FL-WGAN to other Internet of Things (IoT)-based monitoring and disaggregation applications. We will also address deployment challenges, including resource utilization and communication overhead.

Author contributions

All authors contributed to this study, including the methodology, experiments, and analysis. All authors read and approved the final manuscript.

Data availability

The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.

Declarations

Competing interests

The authors declare no competing interests.

Informed consent

All data used in this study were sourced from publicly available databases33,34, and all information is generated by software, involving no real personal data. Therefore, there are no ethical or privacy concerns associated with this research.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Liu, Z., Deng, Z., Davis, S. & Ciais, P. Monitoring global carbon emissions in 2022. Nat. Rev. Earth Environ.4, 205–206 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Tanoni, G., Principi, E. & Squartini, S. Non-intrusive load monitoring in industrial settings: A systematic review. Renew. Sustain. Energy Rev.202, (2024).
  • 3.Çimen, H., Çetinkaya, N., Vasquez, J. C. & Guerrero, J. M. A microgrid energy management system based on non-intrusive load monitoring via multitask learning. IEEE Trans. Smart Grid12, 977–987 (2020). [Google Scholar]
  • 4.Liu, B., Luan, W., Yang, J. & Yu, Y. The balanced window-based load event optimal matching for nilm. IEEE Trans. Smart Grid13, 4690–4703 (2022). [Google Scholar]
  • 5.Agarwal, V., Ardakanian, O. & Pal, S. A robust and privacy-aware federated learning framework for non-intrusive load monitoring. IEEE Trans. Sustain. Comput.10.1109/TSUSC.2024.3370837 (2024). [Google Scholar]
  • 6.Zaeem, R. N. & Barber, K. S. The effect of the gdpr on privacy policies: Recent progress and future promise. ACM Transactions on Manag. Inf. Syst. (TMIS)12, 1–20 (2020). [Google Scholar]
  • 7.Kelly, J. & Knottenbelt, W. Neural NILM: Deep neural networks applied to energy disaggregation. In 2015 ACM International Conference on Embedded Systems for Energy-Efficient Built Environments, 55–64 (2015).
  • 8.Bao, K., Ibrahimov, K., Wagner, M. & Schmeck, H. Enhancing neural non-intrusive load monitoring with generative adversarial networks. Energy Informatics1, 295–302 (2018). [Google Scholar]
  • 9.De Baets, L. et al. Appliance classification using VI trajectories and convolutional neural networks. Energy Build.158, 32–36 (2018). [Google Scholar]
  • 10.Bonfigli, R. et al. Non-intrusive load monitoring by using active and reactive power in additive factorial hidden markov models. Appl. Energy208, 1590–1607 (2017). [Google Scholar]
  • 11.Wu, Z. et al. Non-intrusive load monitoring using factorial hidden markov model based on adaptive density peak clustering. Energy Build.244, 111025 (2021). [Google Scholar]
  • 12.Hwang, H. & Kang, S. Nonintrusive load monitoring using an lstm with feedback structure. IEEE Trans. Instrum. Meas.71, 1–11 (2022). [Google Scholar]
  • 13.Luan, W., Zhang, R., Liu, B., Zhao, B. & Yu, Y. Leveraging sequence-to-sequence learning for online non-intrusive load monitoring in edge device. Int. J. Electr. Power Energy Syst.148, 108910 (2023). [Google Scholar]
  • 14.Zhang, C., Zhong, M., Wang, Z., Goddard, N. & Sutton, C. Sequence-to-point learning with neural networks for non-intrusive load monitoring. In Proceedings of the AAAI conference on artificial intelligence, vol. 32 (2018).
  • 15.Schirmer, P. A. & Mporas, I. Double fourier integral analysis based convolutional neural network regression for high-frequency energy disaggregation. IEEE Trans. Emerg. Top. Comput. Intell.6, 439–449 (2021). [Google Scholar]
  • 16.Chen, J., Wang, X., Zhang, X. & Zhang, W. Temporal and spectral feature learning with two-stream convolutional neural networks for appliance recognition in nilm. IEEE Trans. Smart Grid13, 762–772 (2021). [Google Scholar]
  • 17.Shan, Z. et al. Multiscale self-attention architecture in temporal neural network for nonintrusive load monitoring. IEEE Trans. Instrum. Meas.72, 1–12 (2023).37323850 [Google Scholar]
  • 18.Garcia-Perez, D. et al. Fully-convolutional denoising auto-encoders for nilm in large non-residential buildings. IEEE Transactions on Smart Grid12, 2722–2731 (2020). [Google Scholar]
  • 19.Wang, L., Mao, S. & Nelms, R. M. Transformer for nonintrusive load monitoring: Complexity reduction and transferability. IEEE Internet Things J.9, 18987–18997 (2022). [Google Scholar]
  • 20.Agarwal, V., Ardakanian, O. & Pal, S. A robust and privacy-aware federated learning framework for non-intrusive load monitoring. IEEE Trans. Sustain. Comput.9, 766–777 (2024). [Google Scholar]
  • 21.Agarwal, V., Ardakanian, O. & Pal, S. Robust peer-to-peer federated learning for non-intrusive load monitoring in smart homes. Energy Build.329, 115209 (2025). [Google Scholar]
  • 22.Lin, J., Ma, J., Zhu, J. & Liang, H. Deep domain adaptation for non-intrusive load monitoring based on a knowledge transfer learning network. IEEE Trans. Smart Grid13, 280–292 (2021). [Google Scholar]
  • 23.Han, Y., Li, K., Wang, C., Si, F. & Zhao, Q. Unknown appliances detection for non-intrusive load monitoring based on conditional generative adversarial networks. IEEE Trans. Smart Grid14, 4553–4564 (2023). [Google Scholar]
  • 24.Faustine, A., Pereira, L., Bousbiat, H. & Kulkarni, S. Unet-nilm: A deep neural network for multi-tasks appliances state detection and power estimation in nilm. In Proceedings of the 5th International Workshop on Non-Intrusive Load Monitoring, 84–88 (2020).
  • 25.Liu, Y., Qiu, J. & Ma, J. SAMnet: Toward latency-free non-intrusive load monitoring via multi-task deep learning. IEEE Trans. Smart Grid13, 2412–2424 (2022). [Google Scholar]
  • 26.Kaspour, S. & Yassine, A. A federated learning model with short sequence to point mechanism for smart home energy disaggregation. In 2022 IEEE Symposium on Computers and Communications (ISCC), 1–6 (IEEE, 2022).
  • 27.Wang, H. et al. Fed-nilm: A federated learning-based non-intrusive load monitoring method for privacy-protection. Energy Conversion and Economics3, 51–60 (2022). [Google Scholar]
  • 28.Wang, T. & Dong, Z. Blockchain-based clustered federated learning for non-intrusive load monitoring. IEEE Trans. Smart Grid15, 2348–2361 (2023). [Google Scholar]
  • 29.Dai, S., Meng, F., Wang, Q. & Chen, X. Dp2-nilm: A distributed and privacy-preserving framework for non-intrusive load monitoring. Renew. Sustain. Energy Rev.191, 114091 (2024). [Google Scholar]
  • 30.Li, Q., Ye, J., Song, W. & Tse, Z. Energy disaggregation with federated and transfer learning. In 2021 IEEE 7th World Forum on Internet of Things (WF-IoT), 698–703 (IEEE, 2021).
  • 31.Gao, Z.-W., Xiang, Y., Lu, S. & Liu, Y. An optimized updating adaptive federated learning for pumping units collaborative diagnosis with label heterogeneity and communication redundancy. Eng. Appl. Artif. Intell.152, 110724 (2025). [Google Scholar]
  • 32.Lu, S., Gao, Z.-W. & Liu, Y. Hftl-kd: A new heterogeneous federated transfer learning approach for degradation trajectory prediction in large-scale decentralized systems. Control Eng. Pract.153, 106098 (2024). [Google Scholar]
  • 33.Kelly, J. & Knottenbelt, W. The UK-DALE dataset, domestic appliance-level electricity demand and whole-house demand from five UK homes. Sci. Data2, 1–14 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Kolter, JZ. & Johnson, MJ. REDD: A public data set for energy disaggregation research. In Workshop on data mining applications in sustainability (SIGKDD), San Diego, CA, 59–62 (2011).
  • 35.Walgama, R. & Mahima, KY. Fl-cyclegan: Enhancing mobile photography with federated learning-enabled cyclegan. In 2024 Moratuwa Engineering Research Conference (MERCon), 688–693 (IEEE, 2024).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES