Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2025 Jul 31;15:28026. doi: 10.1038/s41598-025-11864-4

Dual prompt personalized federated learning in foundation models

Ying Chang 1, Xiaohu Shi 1,2,, Xiaohui Zhao 1, Zhaohuang Chen 2, Deyin Ma 3,
PMCID: PMC12313890  PMID: 40745444

Abstract

Personalized federated learning (PFL) has garnered significant attention for its ability to address heterogeneous client data distributions while preserving data privacy. However, when local client data is limited, deep learning models often suffer from insufficient training, leading to suboptimal performance. Foundation models, such as CLIP (Contrastive Language-Image Pretraining), exhibit strong feature extraction capabilities and can alleviate this issue by fine-tuning on limited local data. Despite their potential, foundation models are rarely utilized in federated learning scenarios, and challenges related to integrating new clients remain largely unresolved. To address these challenges, we propose the Dual Prompt Personalized Federated Learning (DP2FL) framework, which introduces dual prompts and an adaptive aggregation strategy. DP2FL combines global task awareness with local data-driven insights, enabling local models to achieve effective generalization while remaining adaptable to specific data distributions. Moreover, DP2FL introduces a global model that enables prediction on new data sources and seamlessly integrates newly added clients without requiring retraining. Experimental results in highly heterogeneous environments validate the effectiveness of DP2FL’s prompt design and aggregation strategy, underscoring the advantages of prediction on novel data sources and demonstrating the seamless integration of new clients into the federated learning framework.

Keywords: Personalized federated learning, Foundation models, Client heterogeneity, Adaptive aggregation strategy

Subject terms: Engineering, Mathematics and computing

Introduction

Recent advancements in deep learning1 have brought remarkable breakthroughs across diverse domains2, such as disease diagnosis35, facial recognition68, video recommendation systems9,10, and emotion recognition11,12. Typically, these methods aggregate all data onto a central server for model training13, with model accuracy often strongly correlated with the volume and quality of the data. However, in sensitive fields, centralizing data introduces significant privacy and security challenges14.

To mitigate the issue of data silos, Federated Learning (FL)15 has emerged as a promising solution, enabling collaborative model training without direct data sharing. Unlike traditional methods, FL allows a global model to be trained by aggregating parameters from locally trained models on client devices. This approach fundamentally changes the data-handling paradigm, allowing data to remain on clients’ devices and only model parameters to be shared with the central server for aggregation.

Classic federated learning models involve a central server and local clients: in each training round, the server distributes the global model to the clients, who train it on their local data and send the updated parameters back to the server for aggregation into a new global model. While approaches like FedAvg15 perform well with similar client data distributions, real-world data is often heterogeneous16, leading to suboptimal global models. Addressing such data heterogeneity has spurred a new line of research known as Personalized Federated Learning (PFL).

PFL aims to develop personalized models that closely reflect individual clients’ data distributions, with strict adherence to data privacy and security requirements17. PFL can generally be divided into two main categories based on the personalization strategy: Global Model Personalization and Learning Personalized Models18.

In Global Model Personalization, the focus is on adapting the global federated learning model to individual clients through local adaptation. This approach relies on the generalization capability of the global model, as it directly influences the accuracy of each client’s personalized model during local adaptation. To achieve this goal, Duan et al.19 proposed Astraea, a framework that addresses label imbalance through Z-score-based data augmentation and downsampling. Additionally, it manages data heterogeneity via a Mediator that reschedules training for clients with skewed data. In contrast to this data-centric approach, FedSteg20 adopts a model-based strategy, wherein transfer learning is utilized to fine-tune the global model for each client after the initial training phase.

In contrast, the Learning Personalized Models approach modifies the aggregation process to directly address clients’ heterogeneous data. A prominent strategy in this category is parameter decoupling. For instance, Arivazhaga et al.21 divide client models into a base layer, trained globally, and a personalized layer, trained locally. This configuration allows the global layer to capture generalizable features, while the personalized layer reflects each client’s unique data distribution. Hanzely et al.22 extend this approach by introducing a penalty term to balance model generalization and personalization. Clustering-based approaches have also shown promise; for example, IFCA23 assigns clients to clusters of global models that best suit their data, achieving tailored federated learning.

Despite these advancements, current deep learning frameworks still require large parameter counts, while clients often have limited data, sometimes missing entire classes. Such constraints hinder the adequacy of model training when parameters are aggregated in an FL framework. Large pre-trained Foundation Models, trained on extensive datasets24, offer robust feature extraction capabilities beneficial for various tasks. Fine-tuning these models on small local datasets can yield high-performing models, effectively addressing the problem of insufficient local training data.

Recently, studies like PROMPTFL25 have begun integrating foundation models into FL, replacing conventional model training with federated prompt training to reduce parameter requirements. This method outperforms both training from scratch and direct fine-tuning but lacks mechanisms for handling client heterogeneity. Moreover, pFedPrompt26 leverages the multimodal capabilities of CLIP27 and employs attention mechanisms to effectively capture local client-specific information, thereby enhancing performance in heterogeneous client environments.

Nevertheless, applications of foundation models in FL are still limited, and existing methods do not address the challenge of integrating new clients dynamically.

To bridge these gaps, we propose Dual Prompt Personalized Federated Learning in Foundation Models (DP2FL), a framework that combines task-awareness with local data-driven insights, effectively leveraging client-specific information captured through prompts to achieve personalized federated learning based on foundation models. DP2FL incorporates two distinct prompts: one that captures federated task information and another that reflects local data distribution. Based on these prompt characteristics, DP2FL employs an aggregation strategy that allows clients to benefit from auxiliary training from other clients while maintaining adaptability to their own data. Furthermore, DP2FL introduces a global model that can make predictions on data from new sources without requiring their participation in the federated learning process. This model also enables the seamless integration of new clients, facilitating efficient onboarding without retraining from scratch.

The core innovations of this work are as follows:

  1. Dual prompt design In the personalized federated learning framework constructed in this work, a novel dual prompt design is proposed: the task prompt for capturing task-level information, and the data prompt for modeling client-specific data distributions–along with corresponding aggregation strategies.

  2. Global model adaptation A global model designed to extend prediction capabilities to new data sources that have not participated in federated learning training. It also ensures seamless integration of newly added clients without requiring retraining, maintaining both flexibility and efficiency.

Related work

Foundation models, built upon deep neural networks and self-supervised learning28, have gained significant attention in recent years due to their robust generalization capabilities. By training on vast, unannotated datasets29, these models acquire rich semantic knowledge, which enhances their applicability across a wide array of downstream tasks and accelerates the adoption of AI in diverse industries24. Among these models, OpenAI’s CLIP, a widely recognized Vision-Language Model (VLM), is distinguished by its effectiveness across diverse tasks. This paper leverages CLIP as the foundation model in our proposed personalized federated learning framework, with a brief introduction to CLIP provided below for context.

As a representative of VLMs, CLIP is pretrained on millions of image-caption pairs, which equips it to simultaneously process textual and visual inputs and learn the semantic relationships between them. Built upon a Transformer30 architecture, CLIP’s extensive parameters empower it to capture the rich multimodal semantic features essential for a range of applications. However, when applied to domain-specific tasks, CLIP and similar models often encounter limitations due to restricted local training data, resulting in underutilized feature extraction capabilities. To address this challenge, prompt-based learning has emerged as an effective approach, which fine-tunes CLIP’s pretrained knowledge to enable more efficient adaptation to specific tasks.

Originally developed within natural language processing (NLP), prompt-based learning guides models to generate task-aligned outputs31. This strategy has since been applied to computer vision and other domains. For foundational models such as CLIP, BERT32, and GPT33, a prevalent method involves freezing pretrained parameters while fine-tuning task-specific prompts. This approach enhances task adaptability by capitalizing on the model’s existing knowledge while focusing computational resources on refining prompt parameters, thus improving model performance on downstream tasks.

As shown in Fig. 1, research on CLIP-based prompting can be categorized into three primary areas: Language Prompting, Visual Prompting, and Multi-modal Prompting34. Language Prompting, focuses on the development of learnable textual contexts within CLIP’s text branch to adapt the model for specific downstream tasks. The first work to introduce prompt learning into CLIP was CoOp35, which replaced manually crafted prompts with trainable prompt vectors. This shift enabled more efficient adaptation through few-shot learning, significantly reducing training costs. To further enhance generalization, CoCoOp36 introduced dynamic adjustments of the trainable prompt vectors in the text branch, using outputs from the image encoder to improve performance across diverse contexts. Recognizing the limitations of a single prompt in capturing both the intrinsic attributes and the extrinsic context of an image, PLOT37 proposed learning multiple prompts collaboratively, leveraging Optimal Transport (OT) to align the visual and textual modalities.

Fig. 1.

Fig. 1

Prompt-based CLIP model

Alternatively, Visual Prompting, as illustrated in Fig. 1, focuses on modifying the image branch through visual perturbations to improve model training. Bahng et al.38 demonstrated the effectiveness of visual prompts for CLIP by exploring three types of prompt applications: random patch insertion, fixed-position patch insertion, and padding. Similarly, ILM-VP39 explored the influence of label mapping on visual prompting and introduced an automated method for mapping source labels to target labels, which enhanced the accuracy of visual prompts.

While both Language and Visual Prompting modify a single branch of the CLIP model, they do not fully exploit the model’s multimodal nature. By contrast, Multi-modal Prompting integrates both Language Prompting and Visual Prompting, allowing the model to simultaneously transform both modalities and thus fully leverage CLIP’s inherent multimodal nature. For instance, MaPLe40 proposed distinct prompts for the text and image branches, which are then coordinated through a coupled adjustment mechanism. This method ensures a high degree of alignment between textual and visual representations, leading to substantial improvements in the model’s generalization ability and its adaptability across different domains. The client-side framework utilized in this study builds on the MaPLe architecture.

Method

Problem formulation

We consider a federated learning scenario involving a set of K clients denoted as Clinent = {client1, client2,…,clientK}. Each client Inline graphic possesses a private local dataset Inline graphic, which is retained locally and is not accessible to other participants. Unlike traditional approaches that collaboratively train a shared model, our objective is to enable each client to adapt a frozen foundation model(CLIP) using a small number of learnable parameters in the form of prompts. This design reduces the communication and computational overhead and supports personalized adaptation to heterogeneous data distributions.

The proposed Dual Prompt Personalized Federated Learning (DP2FL) framework decomposes the learnable prompt space into two distinct components: a global task prompt PT and a personalized data prompt Inline graphic for each client Inline graphic. The global task prompt PT encodes the common semantic knowledge relevant to the federated task and is shared among all clients, while the local data prompt Inline graphic captures the unique characteristics of client Inline graphic’s data distribution.

During local training, each client optimizes its prompts by minimizing the following empirical loss:

graphic file with name d33e515.gif 1

where Inline graphic denotes the task-specific loss function, and Inline graphic represents the output of the frozen foundation model conditioned on both the task and data prompts.

The overall objective of the DP2FL framework is to collaboratively learn a globally shared prompt PT and a set of personalized prompts Inline graphic that minimize the aggregated empirical risk across all participating clients. Formally, the optimization problem is defined as:

graphic file with name d33e547.gif 2

To further support inference on new data sources and facilitate the seamless integration of newly joined clients, DP2FL constructs a global model by aggregating the personalized data prompts PD using the same strategy employed for the global task prompt PT. This enables rapid and effective model initialization without the need for full retraining, ensuring scalability and adaptability in dynamic federated environments.

Framework of DP2FL

This study diverges from traditional personalized federated learning approaches by focusing on adapting federated tasks to foundation models. Given the extensive parameter sizes of foundation models and the typically limited data on federated clients, previous research25 demonstrates that training from scratch or parameter fine-tuning often fails to maximize these models’ feature extraction capabilities. To address this issue, we propose a prompt-based approach, incorporating a prompt aggregation strategy that optimizes the adaptation of foundation models in federated learning, while minimizing the training parameters required.

In federated learning, the heterogeneity of local data distributions presents a significant challenge in designing universally effective models. To address this, we introduce a dual-prompt strategy that integrates global task alignment with client-specific data characteristics, ensuring that each client’s model is effectively tailored to its local data while benefiting from collaborative learning. This method relies on prompt learning, where only the task and data prompts are updated during training, while the underlying foundation model (e.g., CLIP) remains frozen. Additionally , we propose a global model that facilitates the efficient initialization of newly added clients during training. Traditional initialization methods often incur substantial computational overhead; in contrast, the global model leverages a generalized prompt to streamline client onboarding, allowing new clients to rapidly integrate into the system without the need for extensive retraining. The detailed model framework is illustrated in Fig. 2.

Fig. 2.

Fig. 2

The Framework of DP2FL. The DP2FL workflow comprises three core components: (a) Initialization, which establishes the foundation for federated training; (b) Training Process, which outlines the iterative update and aggregation procedures across clients; (c) New Client Integration, which demonstrates the dynamic onboarding mechanism for new clients. Additionally, (d) Local Model illustrates the client-side framework built upon CLIP.

The DP2FL framework, as depicted in Fig. 2, consists of three critical stages: (a) Initialization, (b) Training Process, and (c) New Client Integration. In the Initialization phase, critical parameters are defined, establishing the foundation for subsequent model training. The Training Process involves iterative refinements of the model, where parameters are updated and aggregated to maintain a balance between generalization and personalization. Finally, the New Client Integration stage tackles the challenge of integrating new clients into the federated task, ensuring their initialization with appropriate parameters, which enables rapid and effective contribution to the learning process. Further details on the CLIP-based Local Model (Fig. 2d) and parameter aggregation strategy are discussed in “Prompt Design” and “Aggregation Protocol” Sections. In addition, the “Privacy Preservation” section addresses the privacy concerns within the framework, detailing mechanisms to protect sensitive data throughout the federated learning process.

Initialization

The Initialization phase begins with the federated task initiator defining essential parameters, such as the model architecture, parameter configuration, and the number of training rounds. Consistent with standard prompt-based learning models, only the prompt components are updated in this framework, while the core parameters of the foundation model remain fixed. The framework incorporates two types of prompts: the task prompt Inline graphic, which captures global task information, and the data prompt Inline graphic, which adapts to each client’s specific data distribution. The task prompt is shared among all clients to enable collaborative training through the aggregation of data contributions, while each client maintains a unique local data prompt, with aggregation weights determined by evaluating the relevance of other clients’ parameters to the client’s local data. This dual-prompt approach ensures the model is tailored to each client’s local data while benefiting from collaborative insights.

During this phase, each client uploads a small validation dataset, assumed to be the minimal representative subset of its local data distribution, which is considered shareable for federated learning purposes. The server uses this dataset to compute the initial model loss, which is crucial for guiding the aggregation of model parameters in later stages. As the training progresses, each client evaluates whether the results of other clients’ training have improved its model by assessing changes in validation loss. This process guides the parameter aggregation strategy. At the end of the Initialization phase, the server distributes the model parameters, validation data, and loss metrics to all clients, enabling the training process to commence.

Training process

The Training Process spans Inline graphic rounds of federated training. In each round Inline graphic, every client Inline graphic starts with the global task prompt from the previous round, denoted as Inline graphic, and its own local data prompt Inline graphic, which serve as initialization parameters for local updates. These prompts are optimized using stochastic gradient descent (SGD) on the client’s local dataset to minimize a task-relevant loss function. The updated prompts obtained after local training are denoted as Inline graphic and Inline graphic, where the tilde indicates that these are locally optimized prompt parameters at client Inline graphic in round Inline graphic. Formally, the updates can be expressed as:

graphic file with name d33e701.gif 3

where Inline graphic represents the local dataset of client Inline graphic. This local update aligns with the personalized objective defined in Eq.1, enabling each client to refine its prompts according to its unique data distribution. Subsequently, each client calculates the loss metrics on all validation datasets using its updated parameters and uploads these metrics. The server consolidates the global task prompt Inline graphic by evaluating the performance of each client on their validation data and adjusting aggregation weights accordingly. Simultaneously, each client locally adjusts its data prompts Inline graphic by aligning other clients’ training outputs with its specific data distribution. After completing these steps, each client computes the loss on its own validation dataset using the updated Inline graphic and Inline graphic, uploading these losses to the server. These metrics provide essential feedback for the personalized aggregation in the next training round.

Since local data prompts are client-specific, new clients joining the federated learning task must initialize either with the local data prompt parameters established during the initialization phase or with random values, which makes it difficult to align with the existing clients. To address this issue, we introduce a global model composed of both task and data prompts. The task prompt corresponds to the global task prompt described earlier, while the data prompt is generated from the local data prompts using an aggregation method similar to that of the global task prompt. This global model provides a generalized initialization mechanism for new clients, leveraging insights from previous training rounds to enhance adaptability and accelerate their integration into the federated learning framework.

Furthermore, the global model is well-suited for scenarios where new data sources are introduced solely for inference. In such cases, as the data source does not participate in the federated learning process, the global model–comprising both the global task prompt and the global data prompt–can efficiently and directly evaluate the new data.

New client integration

The New Client Integration phase is begun when new clients join the federated learning task, either during or after the training process. The new client initially uploads its validation dataset to the server, which distributes it to existing clients to facilitate subsequent aggregation. The server also provides the latest global model for initialization, which includes both the global task prompt PT and the global data prompt PD. This global model integrates the training results from all prior rounds, enabling it to demonstrate high accuracy directly on the new client’s local dataset. Notably, following initialization, only minimal additional training is needed to adapt the new client’s model to its data. Other clients, in turn, integrate the new client’s contributions, enhancing the federated model as a whole.

Algorithm 1 presents the complete DP2FL framework process across its three stages.

Algorithm 1.

Algorithm 1

Inline graphic Framework.

Prompt design

In this study, the CLIP model is leveraged as the foundation for each client’s framework, with distinct prompts designed for both the vision and language branches to facilitate cross-modal integration. Specifically, to align visual and textual modalities, a transformation function derives the visual prompt from the textual prompt, as shown in Fig. 2d.

The CLIP model, employing a Vision Transformer (ViT)41 as its Image Encoder, comprises a sequence of Transformer blocks within both the Text and Image Encoders. In the text branch, for example, the embedded input text, combined with positional encoding, is provided as input to the first Transformer block of the text encoder, represented as follows in Eq. 4:

graphic file with name d33e798.gif 4

where Inline graphic represents the input text, Inline graphic denotes the position, Inline graphic refers to the text embedding, and Inline graphic is the positional encoding for the text branch, with Inline graphic matching the dimension of Inline graphic. Here, Inline graphic indicates element-wise addition, which is used to combine the semantic information from the text embedding and the positional information from the positional encoding.Similarly, the image branch provides the input to the initial Transformer block of the image encoder as shown in Eq. 5:

graphic file with name d33e852.gif 5

where Inline graphic denotes the input image, and Inline graphic and Inline graphic refer to the image embedding and positional encoding for the image branch, respectively, with Inline graphic, matching the dimension of Inline graphic.

In this framework, we introduce a task prompt Inline graphic for the text branch, which is transformed through a dimensional mapping function Inline graphic to produce the image prompt Inline graphic for the image branch, as shown in Eq. 6:

graphic file with name d33e912.gif 6

where Inline graphic and Inline graphic. Here, Inline graphic encapsulates the overarching task information within the federated learning setting and is shared uniformly across all clients. Due to the non-identical data distributions typical in federated learning, each client’s transformation function Inline graphic is adapted via a client-specific parameter set, termed the data prompt Inline graphic, which facilitates personalized adaptation to the client’s local dataset.

Under this design, the Text Encoder input in the client model is adjusted from Inline graphic to Inline graphic, while the Image Encoder input is modified from Inline graphic to Inline graphic, preserving the remaining architecture of the CLIP model. During local training on a client’s dataset, the parameters of the CLIP model remain fixed, and only the task prompt Inline graphic and data prompt Inline graphic are updated.

Aggregation protocol

In this study, aggregation strategies are delineated into Global and Local Aggregation based on the participants involved. As discussed in the previous section, the trainable model parameters include two core components: the task prompt, which encapsulates the overarching federated learning task information and is consistent across all clients, and the data prompt, which is tailored to each client, capturing the unique characteristics of local datasets. The task prompt is derived exclusively through a Global Aggregation protocol managed by the server, whereas each client independently computes its data prompt via a Local Aggregation protocol. To enable inference on new data sources and ensure proper initialization for newly added clients, the server computes a generalized data prompt through the Global Aggregation protocol, thereby generating the global model.

Global aggregation

To enhance the representation of shared task characteristics in federated learning, the server performs global aggregation on the task prompt. This process, defined in Eq. 7, assigns aggregation weights to client updates based on their performance across all validation datasets, enabling a refined capture of cross-client task information. Following an approach similar to FedFomo42, each client contributes a validation dataset aligned with its local data distribution during initialization, enabling weight assignments proportional to each client’s cumulative validation loss.

In the Inline graphic-th training round, with Inline graphic participating clients, the global task prompt Inline graphic is derived as follows:

graphic file with name d33e1024.gif 7

where Inline graphic denotes the aggregated global task prompt for round Inline graphic, subsequently distributed to all clients in round Inline graphic as the initial task prompt for local updates, Inline graphic is the matrix of task prompts trained independently by each client based on local datasets via stochastic gradient descent (SGD), with each row corresponding to a client’s task prompt vector. The column vector Inline graphic contains the aggregation weights for each client in round Inline graphic, where the weight component Inline graphic the contribution of client Inline graphic to the global task prompt, calculated by:

graphic file with name d33e1081.gif 8

where Inline graphic represents the loss computed by client Inline graphic using its round-Inline graphic model on the validation set of client Inline graphic. This weighting scheme reduces the aggregation influence of clients with higher validation losses across datasets, thereby refining Inline graphic to better capture the federated task’s overall characteristics.

In addition to aggregating the task prompt, the server uses the Global Aggregation protocol to compute a global data prompt, which provides a generalized representation distinct from the locally optimized data prompts on each client. Together, these form the global model, which enhances adaptability across clients. When new data sources are introduced for inference, or when a new client joins in round Inline graphic, initializing its model parameters with the global model (Inline graphic and Inline graphic) leverages knowledge from prior rounds, thereby reducing the need for extensive retraining. Experimental validation of this initialization effect is presented in “Performance of Global Model” Section.

Local aggregation

In federated learning, clients pursue a common objective despite variations in local data distributions. This framework models the shared task objectives through a task prompt, with the server assigning aggregation weights based on each client’s performance across all validation datasets. To address distributional heterogeneity, a data prompt tailored to each client is introduced, enabling local evaluation of models trained by other clients to determine aggregation weights.

In each training round Inline graphic, the Inline graphic clients are divided into three sets based on their contributions to client Inline graphic: Positive Clients (PC), Retained Negative Clients (RNC), and Discarded Negative Clients (DNC). These sets are formally defined as follows:

graphic file with name d33e1166.gif 9
graphic file with name d33e1172.gif 10
graphic file with name d33e1178.gif 11

where Inline graphic, and Inline graphic is the loss tolerance threshold, defined by the task initiator to regulate acceptable performance variations. During aggregation in round Inline graphic, client Inline graphic is classified into PC set if the model loss Inline graphic on client Inline graphic’s validation set shows improvement over client Inline graphic’s loss from previous round, Inline graphic. Otherwise, client i is assigned to RNC or DNC based on its performance relative to the defined threshold.

To balance data-specific personalization with generalization, client Inline graphic aggregates data prompts from the sets of Positive Clients (PC) and Retained Negative Clients (RNC). The aggregated data prompt Inline graphic in round Inline graphic is computed as follows:

graphic file with name d33e1255.gif 12

where Inline graphic represents the locally aggregated data prompt across all clients in round Inline graphic, and Inline graphic denotes the data prompt computed by each client after training on its dataset. The vector Inline graphic is a Inline graphic-dimensional row vector of ones, and Inline graphic denotes the Kronecker product. The matrix Inline graphic represents the aggregation weights, where each element Inline graphic is the weight of client Inline graphic when aggregating client Inline graphic’s data prompt. These weights are calculated as follows:

graphic file with name d33e1324.gif 13

The initial weight Inline graphic is defined as:

graphic file with name d33e1337.gif 14

where Inline graphic denotes the data prompt of the model trained on client Inline graphic’s dataset after the Inline graphic-th round of iteration, and Inline graphic represents the aggregated data prompt of client Inline graphic after the Inline graphic-th round.

From Eq. 14, if Inline graphic, then Inline graphic; conversely, Inline graphic, then Inline graphic. In this work, the normalization function Norm(Inline graphic) is defined as follows, depending on whether the PC set is empty:

graphic file with name d33e1417.gif 15

where Inline graphic is a hyperparameter that prevents excessive generalization, Inline graphic is an adjustment factor balancing local performance and global generalization, and sgn(Inline graphic) the sign function defined as:

graphic file with name d33e1442.gif 16

This aggregation strategy not only utilizes models that perform well on the client’s validation set but also considers models within an acceptable error margin, enabling the local model to effectively generalize while adapting to the specific data distribution of the client.

Privacy preservation

Federated learning is fundamentally designed to protect data privacy by ensuring that raw data remains on the client side. In our initial design, clients were required to upload a small validation set to the server as a representative subset of their local data distribution. While this facilitates collaborative validation, it introduces potential privacy concerns, as even small samples may carry sensitive information. To address this, we propose two mechanisms aimed at reducing the privacy risks associated with validation data sharing.

  1. Representation-level sharing instead of raw validation data

To avoid the direct exposure of raw validation data, this approach eliminates the need for uploading original samples. Each client encodes its local validation set using a shared, frozen pre-trained model to generate intermediate textual and visual representations T and I (as defined in Eqs. 4 and 5. Only these embeddings are transmitted to the server.

Because the backbone encoder is identical and remains frozen throughout training, all clients can utilize the uploaded embeddings by combining them with their own prompts and passing them through the shared encoder to compute validation losses. This procedure maintains validation effectiveness while substantially mitigating the risk of reconstruction or identification attacks.

  • 2.

    Anonymous validation via asymmetric encryption

To enhance privacy and prevent the attribution of validation data to specific clients, we propose an asymmetric encryption mechanism. Prior to training, each client generates a public-private key pair locally. Only the public key and the representation-level information of the validation set are shared, while the private key is securely retained on the client device. During training, each client evaluates its model on the validation data provided by other clients, without knowledge of the data’s origin, thereby ensuring the anonymity of the data providers.

For effective global aggregation, each client uploads the cumulative loss incurred across all validation sets to the server–specifically, the total loss of client k evaluated on the validation data from all other clients. This scalar loss value does not require encryption, as it is simply a summation that does not reveal sensitive information. Formally, the cumulative loss for client k in the r-th round is expressed as: Inline graphic. This value is uploaded by each client to the server, enabling the server to aggregate these loss values, compute aggregation weights, and update the global prompt accordingly.

For local aggregation, as the validation embeddings are not linked to explicit client identifiers, each client’s loss on the validation data is encrypted using the corresponding public key. The server aggregates these encrypted losses by public key and then distributes the resulting pairs of (public key, encrypted loss) to all clients. Upon receiving this aggregated data, each client uses its private key to decrypt the entries associated with its own public key. This allows the client to retrieve the performance evaluation of its validation set while maintaining privacy, ensuring that both the validation data and client identity remain decoupled throughout the process.

These two mechanisms jointly support collaborative validation without revealing raw validation data. They reduce the risks of data reconstruction and attribution, and align well with the federated learning principle of keeping data local. As a result, the proposed design enhances privacy guarantees while maintaining the effectiveness of model validation.

Experiment

Experiment setup

Datasets

To assess the generalizability of the proposed model framework across diverse data domains, this study evaluates its performance on eight distinct image classification datasets, following the methodologies in27,35,36. These include Caltech10143, which is widely used for general object detection; DTD44, specialized for texture classification; EuroSAT45, focused on categorizing Sentinel-2 satellite imagery; FGVCAircraft46, a benchmark for aircraft recognition; Food10147, tailored for food classification tasks; Flowers10248, dedicated to identifying flower species; OxfordPets49, designed for pet breed classification; and UCF10150, a leading resource for action recognition studies. Together, these datasets provide a rigorous assessment of the model’s cross-domain applicability.

Heterogeneity simulation

To simulate the non-iid data distributions encountered in real-world federated learning scenarios, we adopt a data-sampling methodology similar to51,52, constructing heterogeneous client datasets that test the framework’s ability to leverage foundation models for feature extraction. Following the ”16-shot” approach as implemented in CLIP, each client’s dataset is constrained to a maximum of 16 samples per category.

For constructing each client’s local dataset, we randomly exclude 20% of the available categories to model data sparsity. Among the remaining categories, 25% of the data is retained, resulting in a 4-shot structure per category. To further accentuate data heterogeneity, a dominant class is identified randomly for each client; 75% of the data points in this class are then added to the client’s local training set. Both the test and validation datasets are designed to align with each client’s training distribution, ensuring consistency across the training, validation, and inference phases. The test dataset is the largest subset that reflects the training distribution, while the validation dataset is the smallest subset. Furthermore, there is no overlap between the datasets used in the training, validation, and inference phases, ensuring mutual exclusivity.

To enhance understanding of the heterogeneous data distributions in the federated learning setup, a pie chart is presented in Fig. 3, visualizing the data distribution across different categories for the first client’s dataset, using the OxfordPets dataset as an example. The chart highlights the randomly excluded categories, which simulate the data sparsity often encountered in federated learning. Class imbalances and the dominance of a randomly selected class are also emphasized, reflecting the challenges posed by non-iid data distributions. These features underscore the data heterogeneity that the model is designed to address in the simulation process.

Fig. 3.

Fig. 3

Visualization of data distribution across different classes in OxfordPets dataset.

Baselines

Due to the limited research on federated learning with foundation models and the lack of direct comparison methods, we benchmark our framework by integrating traditional personalized federated learning models with the proposed PromptFL25. Specifically, the evaluation includes three baseline models: Local, FedProx53+PromptFL(FP+P), and pFedMe54+PromptFL(pF+P), providing comparative insights into the model’s effectiveness across varied federated learning strategies.

  1. Local A baseline where each client trains independently on its local data without communication or model aggregation, serving as a non-collaborative reference point.

  2. FedProx+PromptFL (FP+P) FedProx extends the standard FedAvg algorithm by introducing a proximal term to handle heterogeneous data across clients. FedProx+PromptFL combines this approach with the PromptFL method, where clients collaboratively learn task-specific prompts rather than models, enabling federated participants to fine-tune foundation models using minimal local data.

  3. pFedMe+PromptFL(pF+P) pFedMe personalizes federated learning by optimizing both global and local objectives using Moreau envelopes. pFedMe+PromptFL enhances this by learning personalized prompts for each client, allowing for improved adaptation to client-specific data and tasks.

Training details

Building on the pre-trained ViT-B/1641 CLIP model, as outlined in MaPLe40, the client local model framework in this study is optimized using stochastic gradient descent (SGD) with a learning rate of 0.035. Following the methodology of FedPrompt55, the setup includes a centralized server and Inline graphic clients engaged in Inline graphic rounds of iterative training. In each round, clients conduct five epochs with a batch size of four. For local aggregation, parameters are set to Inline graphic, Inline graphic, with the weight adjustment factor Inline graphic defined as follows:

graphic file with name d33e1665.gif 17

where Inline graphic represents the highest initial aggregation weight among clients in the current RNC set of the Inline graphic-th client, while Inline graphic denotes the lowest initial aggregation weight in the PC set of the Inline graphic-th client. All experiments are conducted using PyTorch on an NVIDIA RTX 3090 GPU.

Performance of DP2FL on heterogeneous data distributions

To validate the efficacy of the proposed parameter aggregation method, comparative experiments were conducted against traditional personalized federated learning approaches, specifically FedProx and pFedMe. Initial results indicate that PromptFL, when applied to training or fine-tuning foundation models from scratch, incurs substantial communication costs and fails to achieve optimal accuracy. Given space constraints, comparisons with scratch training and fine-tuning are omitted. Instead, FedProx and pFedMe frameworks are adapted to integrate PromptFL, restricting training and aggregation to the prompt parameters alone and aligning with the client model structure employed in this work. For quantitative evaluation, the mean accuracy and F1 scores over 10 clients across eight benchmark datasets are reported in Table 1, illustrating the comparative performance of the proposed aggregation strategy.

Table 1.

Comparison results with baselines.

Dataset ACC Micro-F1
Local FP+P pF+P DP2FL Local FP+P pF+P DP2FL
Caltech101 94.51 94.98 94.57 94.66 93.02 94.18 93.71 93.96
DTD 68.53 73.26 72.77 73.78 64.20 70.29 70.05 71.14
EuroSAT 78.53 83.14 81.37 84.18 78.16 82.99 81.22 84.18
FGVCAircraft 43.90 41.15 41.43 41.81 34.15 37.07 37.17 37.06
Food101 87.32 88.84 88.74 89.06 86.05 87.92 87.78 88.12
Flowers102 93.34 95.22 95.27 95.50 91.59 94.41 94.45 94.91
OxfordPets 93.15 94.91 95.16 94.95 92.16 94.35 94.61 94.21
UCF101 82.47 83.74 83.13 83.11 78.84 82.11 81.61 81.10
AVG 80.22 81.90 81.55 82.13 77.27 80.42 80.07 80.59

Bold values indicate the best performance, and italic values indicate the second best within the same metric across all models. Subsequent tables follow the same convention.

As shown in Table 1, the proposed framework consistently achieves strong performance across eight datasets, securing the highest accuracy on four datasets and the second-highest on three, resulting in an average accuracy improvement of 0.23% over the closest competitor. In terms of Micro-F1, DP2FL achieves the highest rank on four datasets and the second-highest on one, with an average improvement of 0.17% over the second-best method (FP+P). Performance gains are particularly notable on the EuroSAT and Food101 datasets, underscoring the efficacy of the proposed prompt-based aggregation approach. These findings validate the effectiveness of the method in handling heterogeneous client data distributions, surpassing the results of traditional personalized federated learning models and reinforcing its adaptability across diverse data settings.

To further illustrate the consistency of DP2FL’s performance, Fig. 4 shows a histogram of the average accuracy and F1 scores for each client across the eight datasets. The results demonstrate that the model exhibits relatively uniform performance across all clients, reflecting the robustness and strong generalization capability of DP2FL in federated learning scenarios with non-iid data distributions.

Fig. 4.

Fig. 4

Client Performance Across Eight Datasets: Average Accuracy and F1 Score.

Performance of global model

Traditional personalized federated learning methods typically address the variability in local dataset distributions by assigning distinct model parameters to each client. However, these methods fail to account for challenges related to inference on new data sources or the integration of new clients during training, particularly with respect to parameter initialization. To overcome this limitation, we propose a global model that aggregates a generalized data prompt in the same manner as the task prompt. The necessity and effectiveness of this approach are rigorously validated through a series of experiments presented in this section.

Two targeted experiments are conducted to validate the proposed approach. In the first experiment, the average performance of local models trained on the datasets of 10 clients is compared with that of a global model tested on the datasets of all clients. This comparison provides insights into the generalization ability of the global model across different data distributions. In the second experiment, a new client (the 11th client) is introduced. First, the local models trained on the datasets of the initial 10 clients are evaluated on the new data source (the 11th client), and the performance difference between this evaluation and the global model initialization is compared, highlighting the global model’s effectiveness for inference on new data sources. Then, the model is initialized using various methods to demonstrate the role of the global model in initializing new clients. Finally, a round of federated training is conducted, where the newly initialized model is trained alongside the remaining 10 clients. The results show that proper initialization enables the new client to achieve better performance with only a few federated learning iterations.

  1. Performance of global model on non-local datasets

The global model is pivotal in managing new data sources that cannot directly participate in federated learning. It facilitates inference by enabling accurate predictions on unseen data without requiring retraining. To achieve this, the global model must exhibit strong generalization capabilities, integrating knowledge from multiple clients to effectively address diverse data distributions. Additionally, when new clients join the federated system, the global model should provide an efficient initialization to enable rapid adaptation to local data. Thus, robust generalization is critical not only for inference on unseen data but also for the seamless integration of new clients, enhancing the model’s applicability in real-world federated learning scenarios.

To assess the effectiveness of the global model, a comparative analysis is conducted between the local models and the global model. The results of this comparison are presented in Table 2. In this table, Ave_Local represents the average accuracy of the local models trained within the DP2FL framework and tested on their respective local datasets. In contrast, Global Model refers to the average accuracy of the global model, initialized with global data and task prompts, and tested across each client’s dataset. The Diff column displays the accuracy difference between the global model and the local models, calculated as the accuracy of the global model (Global Model) minus that of the local models (Ave_Local).

Table 2.

Performance of global model on non-local datasets.

Dataset Ave_local Global model Diff
Caltech101 94.66 94.29 − 0.37
DTD 73.78 73.52 − 0.26
EuroSAT 84.18 83.94 − 0.24
FGVCAircraft 41.81 41.17 − 0.64
Food101 89.06 88.91 − 0.15
Flowers102 95.50 94.80 − 0.70
OxfordPets 94.95 94.86 − 0.09
UCF101 83.11 82.89 − 0.22
AVG 82.13 81.80 − 0.33

As shown in Table 2, the results indicate that the global model performs slightly worse than the locally personalized models across all datasets. The average accuracy difference between the global model (Global Model) and the local models (Ave_Local) is − 0.33%, reflecting a marginal decline in performance when using the global model. Notably, the global model’s accuracy closely aligns with that of the local models on datasets such as OxfordPets and Food101, with minimal differences of − 0.09% and − 0.15%, respectively.

The experimental results suggest that the global model demonstrates strong generalization capabilities, enabling effective inference on new, unseen data sources and efficient initialization of newly added clients.

  • (2)

    Performance of the global model in new client initialization

Building on the findings from the first experiment, which demonstrated the global model’s strong generalization ability, this experiment further investigates its performance when new clients are introduced. Specifically, it evaluates the model’s inference ability on new data sources and examines how different initialization strategies affect the performance of newly added clients in the federated learning process. The experiment consists of three parts: First, it compares the global model’s inference performance on new data sources with that of locally trained models from other clients. Second, it applies various initialization strategies to the newly added client’s model, assessing their impact on performance with the client’s local dataset. Finally, the results show that after effectively initializing the new client’s model, only a minimal number of federated learning iterations are required to achieve strong performance across diverse datasets.

Table 3 presents the results for the first two parts of this experiment. The methods are described as follows: Ave_Local represents the average accuracy when the original 10 clients perform inference on the new client’s dataset using their locally trained models. Global Model uses the global model, initialized with both task and data prompts aggregated through global aggregation, specifically Inline graphic and Inline graphicand then performs inference to obtain the resulting accuracy. This forms the first part of the comparison. In the second part, additional initialization methods are explored: Init initializes the new client’s model with parameters set by the task initiator, specifically Inline graphic and Inline graphic; InitGlo initializes the task prompt using global parameters based on the aggregation method, while the data prompt is initialized with parameters from the task initiator, i.e., Inline graphic and Inline graphic.

Table 3.

Results for new clients under different parameter initialization methods without training.

Dataset Ave_local Global model Init InitGlo
Caltech101 97.37 97.84 95.69 96.55
DTD 67.12 67.14 43.03 63.83
EuroSAT 84.00 83.85 45.21 78.35
FGVCAircraft 41.56 41.90 26.07 41.31
Food101 89.14 89.15 87.36 88.91
Flowers102 92.58 91.82 69.50 90.88
OxfordPets 95.76 96.10 87.79 95.90
UCF101 82.73 82.91 63.12 79.27
AVG 81.28 81.34 64.72 79.38

In the first part of the experiment, the 11th client is treated as a new data source. Without the global model, each client would likely have to rely solely on its locally trained model to handle the new data. To evaluate the effectiveness of the global model for inference on new data, we compared it with the Ave_Local method. As shown in Table 3 (first and second columns), the global model generally outperforms the average of the local models across most datasets. For instance, on the Caltech101 dataset, the global model achieves an accuracy of 97.84%, which is higher than the 97.37% of the average local model. Similarly, on OxfordPets, the global model reaches 96.10%, compared to 95.76% from the local models. In several other datasets, such as DTD and Food101, the global model shows an improvement over the average local model, although the difference is not always substantial.

This comparison confirms the advantage of using the global model for inference on new data sources, demonstrating its value in federated learning systems and validating its capability to handle new data sources effectively.

The second part of the experiment evaluates the effectiveness of the global model for new client integration, as shown in the last three columns of Table 3. The Global Model consistently outperforms both Init and InitGlo, achieving the highest accuracy across all datasets. Specifically, the global model improves accuracy by an average of 1.96% compared to InitGlo, and by 16.62% compared to Init.

The results from this part highlight the effectiveness of the global model in initializing new clients within federated learning systems, demonstrating its ability to facilitate direct adaptation to the data distribution of newly added clients.

To further validate that proper initialization of the newly added client leads to good performance with fewer subsequent federated learning iterations, this section presents the third part of the experiment. As shown in Table 4, three distinct test metrics are considered: New, which represents the accuracy of the newly added (11th) client’s model on its own local dataset after one round of federated training; Local, which refers to the average accuracy of all 11 clients evaluated on their respective local datasets; and All, which indicates the average accuracy of all 11 clients, where each client’s model is evaluated on the combined test sets using their individual parameters.

Table 4.

Comparison results on different test data after one round training.

Dataset New Local All
Init InitGlo Global Model Init InitGlo Global model Init InitGlo Global model
Caltech101 97.84 97.84 97.84 94.85 94.84 94.84 95.17 95.29 95.31
DTD 67.61 66.43 69.74 72.68 72.81 73.42 67.22 67.39 67.61
EuroSAT 82.13 82.31 83.27 83.00 83.58 84.14 80.54 81.14 81.31
FGVCAircraft 42.26 43.69 44.40 41.86 42.67 42.38 35.56 36.01 36.00
Food101 89.36 89.26 89.47 89.09 89.10 89.05 87.36 87.34 87.29
Flowers102 92.77 93.40 92.45 94.89 95.41 95.47 92.72 93.57 93.59
OxfordPets 95.79 95.69 95.69 94.90 94.83 94.87 93.50 93.50 93.47
UCF101 83.58 82.50 83.31 83.11 83.03 83.06 80.44 80.54 80.56
AVG 81.42 81.39 82.02 81.80 82.04 82.15 79.06 79.35 79.39

The results presented in Table 4 demonstrate the efficacy of the proposed initialization method, which significantly facilitates the adaptation of the newly added client to various data distribution with minimal training. Notably, the Global Model initialization method consistently outperforms the other initialization strategies (Init and InitGlo) across all performance metrics: New, Local, and All.

In particular, the newly introduced client achieves the highest accuracy on its local test set with the Global Model across most datasets, as shown in the New column of Table 4. While there are a few cases (e.g., Caltech101) where Init and InitGlo show similar performance, the Global Model consistently yields the best average results. This demonstrates its effectiveness in helping the new client quickly adapt to its local data distribution.

Regarding the performance of existing clients, the Local column indicates that Global Model also leads to the highest average accuracy for the models of all 11 clients. In particular, datasets like DTD, EuroSAT, and Flowers102 show significant improvements with Global Model, highlighting the benefit of this initialization strategy in enabling existing clients to better utilize the new data introduced by the added client.

Finally, when evaluating the aggregated performance across all clients (the All column), Global Model consistently outperforms the other initialization methods. Compared to Init and InitGlo, Global Model demonstrates a clear advantage, particularly on datasets such as DTD and Flowers102. By achieving the highest accuracy on the combined test sets, it showcases its superior generalization ability across diverse data distributions, making it the most robust initialization strategy for federated learning scenarios.

In summary, the third part of the experiment demonstrates that by initializing new clients with the global model, only a few rounds of federated training are required to achieve strong performance across various datasets. Together, the three experiments further validate the global model’s robust generalization ability, underscoring its potential for inference on new data sources and for efficiently initializing newly added clients.

Conclusion

Federated learning in heterogeneous environments presents significant challenges, primarily due to the substantial variation in local data distributions and the frequent addition of new clients. To address these challenges, we propose the Dual Prompt Personalized Federated Learning (DP2FL) framework. This framework leverages a dual-prompt mechanism and adaptive aggregation strategies to effectively integrate global task information with client-specific data. It enhances the model’s generalization to global tasks while accommodating the unique characteristics of each client’s local distribution.

A key innovation of DP2FL is the introduction of a novel global model, which enables high-accuracy inference on new data sources that have not participated in federated learning. It also facilitates the seamless integration of new clients into the federated learning process. Empirical results demonstrate that DP2FL enhances model performance across diverse client distributions, improves inference accuracy on unseen data, and reduces the onboarding time for new clients, thereby increasing its practical applicability.

Author contributions

Y. C.: Conceptualization, Methodology, Software, Visualization, Writing-original draft. X.S.: Supervision, Conceptualization, Writing-review. X.Z.: Writing-review. Z.C.: Writing-review. D.M.: Supervision, Writing-review.

Funding

This work was funded by the Science-Technology Development Plan Project of Jilin Province (20210202129NC).

Data availability

The datasets used in this study are all publicly available and widely adopted in the research community. Specifically, the following datasets were utilized: Caltech10143 for general object detection, DTD44 for texture classification, EuroSAT45 for Sentinel-2 satellite imagery categorization, FGVCAircraft46 for aircraft recognition, Food10147 for food classification tasks, Flowers10248 for flower species identification, OxfordPets49 for pet breed classification, and UCF10150 for action recognition studies. These datasets are accessible through their respective repositories: Caltech101: http://www.vision.caltech.edu/Image_Datasets/Caltech101/. DTD: https://www.robots.ox.ac.uk/~vgg/data/dtd/. EuroSAT: http://madm.dfki.de/files/sentinel/EuroSAT.zip. FGVCAircraft: http://www.robots.ox.ac.uk/~vgg/data/fgvc-aircraft/. Food101: https://data.vision.ee.ethz.ch/cvl/datasets_extra/food-101/. Flowers102: http://www.robots.ox.ac.uk/~vgg/data/flowers/102/. OxfordPets: https://www.robots.ox.ac.uk/~vgg/data/pets/. UCF101: https://drive.google.com/file/d/10Jqome3vtUA2keJkNanAiFpgbyC9Hc2O/view. These datasets ensure reproducibility of the experiments and facilitate further research.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Xiaohu Shi, Email: shixh@jlu.edu.cn.

Deyin Ma, Email: madeyin@ccut.edu.cn.

References

  • 1.Yang, Q., Liu, Y., Chen, T. & Tong, Y. Federated machine learning: Concept and applications. ACM Trans. Intell. Syst. Technol. (TIST)10, 1–19 (2019). [Google Scholar]
  • 2.Talaei Khoei, T., Ould Slimane, H. & Kaabouch, N. Deep learning: Systematic review, models, challenges, and research directions. Neural Comput. Appl.35, 23103–23124 (2023). [Google Scholar]
  • 3.Ávila-Jiménez, J. L., Cantón-Habas, V., del Pilar Carrera-González, M., Rich-Ruiz, M. & Ventura, S. A deep learning model for Alzheimer’s disease diagnosis based on patient clinical records. Comput. Biol. Med.169, 107814 (2024). [DOI] [PubMed] [Google Scholar]
  • 4.Kusumoto, D. et al. A deep learning-based automated diagnosis system for spect myocardial perfusion imaging. Sci. Rep.14, 13583 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Eskandari, A. & Sharbatdar, M. Efficient diagnosis of psoriasis and lichen planus cutaneous diseases using deep learning approach. Sci. Rep.14, 9715 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Wang, M. & Deng, W. Deep face recognition: A survey. Neurocomputing429, 215–244 (2021). [Google Scholar]
  • 7.He, L. et al. Lmtformer: Facial depression recognition with lightweight multi-scale transformer from videos. Appl. Intell.55, 195 (2025). [Google Scholar]
  • 8.Ma, J. Face recognition technology and privacy protection methods based on deep learning. in International Conference on Computer Application and Information Security (ICCAIS 2023), vol. 13090, 899–904 (SPIE, 2024).
  • 9.Karatzoglou, A. & Hidasi, B. Deep learning for recommender systems. in Proceedings of the Eleventh ACM Conference on Recommender Systems 396–397 (2017).
  • 10.Xiang, Y., Huo, S., Wu, Y., Gong, Y. & Zhu, M. Integrating AI for enhanced exploration of video recommendation algorithm via improved collaborative filtering. J. Theory Pract. Eng. Sci.4, 83–90 (2024). [Google Scholar]
  • 11.Akinpelu, S., Viriri, S. & Adegun, A. An enhanced speech emotion recognition using vision transformer. Sci. Rep.14, 13126 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Singla, C., Singh, S., Sharma, P., Mittal, N. & Gared, F. Emotion recognition for human-computer interaction using high-level descriptors. Sci. Rep.14, 12122 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Guendouzi, B. S., Ouchani, S., Assaad, H. E. & Zaher, M. E. A systematic review of federated learning: Challenges, aggregation methods, and development tools. J. Netw. Comput. Appl.220, 103714 (2023). [Google Scholar]
  • 14.Huang, R.-Y., Samaraweera, D. & Chang, J. M. Exploring threats, defenses, and privacy-preserving techniques in federated learning: A survey. Computer57, 46–56 (2024). [Google Scholar]
  • 15.McMahan, B., Moore, E., Ramage, D., Hampson, S. & y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. in Artificial Intelligence and Statistics 1273–1282 (PMLR, 2017).
  • 16.Gao, D., Yao, X. & Yang, Q. A survey on heterogeneous federated learning. arXiv preprint arXiv:2210.04505 (2022).
  • 17.Sabah, F. et al. Model optimization techniques in personalized federated learning: A survey. Expert Syst. Appl.243, 122874 (2024). [Google Scholar]
  • 18.Tan, A. Z., Yu, H., Cui, L. & Yang, Q. Towards personalized federated learning. IEEE Trans. Neural Netw. Learn. Syst.34, 9587–9603 (2022). [DOI] [PubMed] [Google Scholar]
  • 19.Duan, M. et al. Self-balancing federated learning with global imbalanced data in mobile systems. IEEE Trans. Parallel Distrib. Syst.32, 59–71 (2020). [Google Scholar]
  • 20.Yang, H., He, H., Zhang, W. & Cao, X. Fedsteg: A federated transfer learning framework for secure image steganalysis. IEEE Trans. Netw. Sci. Eng.8, 1084–1094 (2020). [Google Scholar]
  • 21.Arivazhagan, M. G., Aggarwal, V., Singh, A. K. & Choudhary, S. Federated learning with personalization layers. arXiv preprint arXiv:1912.00818 (2019).
  • 22.Hanzely, F. & Richtárik, P. Federated learning of a mixture of global and local models. arXiv preprint arXiv:2002.05516 (2020).
  • 23.Ghosh, A., Chung, J., Yin, D. & Ramchandran, K. An efficient framework for clustered federated learning. Adv. Neural. Inf. Process. Syst.33, 19586–19597 (2020). [Google Scholar]
  • 24.Schneider, J., Meske, C. & Kuss, P. Foundation models: A new paradigm for artificial intelligence. Bus. Inf. Syst. Eng.66(2), 221–231 (2024). [Google Scholar]
  • 25.Guo, T., Guo, S., Wang, J., Tang, X. & Xu, W. Promptfl: Let federated participants cooperatively learn prompts instead of models-federated learning in age of foundation model. IEEE Trans. Mobile Comput.23(5), 5179–5194 (2023). [Google Scholar]
  • 26.Guo, T., Guo, S. & Wang, J. Pfedprompt: Learning personalized prompt for vision-language models in federated learning. in Proceedings of the ACM Web Conference 1364–1374 (2023).
  • 27.Radford, A. et al. Learning transferable visual models from natural language supervision. in International Conference on Machine Learning 8748–8763 (PMLR, 2021).
  • 28.Bommasani, R. et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021).
  • 29.Wang, H., Li, J., Wu, H., Hovy, E. & Sun, Y. Pre-trained language models and their applications. Engineering25, 51–65 (2023). [Google Scholar]
  • 30.Vaswani, A. Attention is all you need. Adv. Neural Inf. Process. Syst.30, I (2017). [Google Scholar]
  • 31.Brown, T. et al. Language models are few-shot learners. Adv. Neural. Inf. Process. Syst.33, 1877–1901 (2020). [Google Scholar]
  • 32.Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  • 33.OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
  • 34.Lei, Y., Li, J., Li, Z., Cao, Y. & Shan, H. Prompt learning in computer vision: A survey. Front. Inf. Technol. Electron. Eng.25, 42–63 (2024). [Google Scholar]
  • 35.Zhou, K., Yang, J., Loy, C. C. & Liu, Z. Learning to prompt for vision-language models. Int. J. Comput. Vis.130, 2337–2348 (2022). [Google Scholar]
  • 36.Zhou, K., Yang, J., Loy, C. C. & Liu, Z. Conditional prompt learning for vision-language models. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 16816–16825 (2022).
  • 37.Chen, G. et al. Plot: Prompt learning with optimal transport for vision-language models. arXiv preprint arXiv:2210.01253 (2022).
  • 38.Bahng, H., Jahanian, A., Sankaranarayanan, S. & Isola, P. Exploring visual prompts for adapting large-scale models. arXiv preprint arXiv:2203.17274 (2022).
  • 39.Chen, A., Yao, Y., Chen, P.-Y., Zhang, Y. & Liu, S. Understanding and improving visual prompting: A label-mapping perspective. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 19133–19143 (2023).
  • 40.Khattak, M. U., Rasheed, H., Maaz, M., Khan, S. & Khan, F. S. Maple: Multi-modal prompt learning. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 19113–19122 (2023).
  • 41.Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
  • 42.Zhang, M., Sapra, K., Fidler, S., Yeung, S. & Alvarez, J. M. Personalized federated learning with first order model optimization. arXiv preprint arXiv:2012.08565 (2020).
  • 43.Fei-Fei, L., Fergus, R. & Perona, P. Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. in 2004 Conference on Computer Vision and Pattern Recognition Workshop 178–178 (IEEE, 2004).
  • 44.Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S. & Vedaldi, A. Describing textures in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 3606–3613 (2014).
  • 45.Helber, P., Bischke, B., Dengel, A. & Borth, D. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Select. Top.n Appl. Earth Observat. Remote Sens.12, 2217–2226 (2019). [Google Scholar]
  • 46.Maji, S., Rahtu, E., Kannala, J., Blaschko, M. & Vedaldi, A. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151 (2013).
  • 47.Bossard, L., Guillaumin, M. & Van Gool, L. Food-101–mining discriminative components with random forests. in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part VI 13 446–461 (Springer, 2014).
  • 48.Nilsback, M.-E. & Zisserman, A. Automated flower classification over a large number of classes. in 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing 722–729 (IEEE, 2008).
  • 49.Parkhi, O. M., Vedaldi, A., Zisserman, A. & Jawahar, C. Cats and dogs. in 2012 IEEE Conference on Computer Vision and Pattern Recognition 3498–3505 (IEEE, 2012).
  • 50.Soomro, K. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).
  • 51.Xu, J., Tong, X. & Huang, S.-L. Personalized federated learning with feature alignment and classifier collaboration. arXiv preprint arXiv:2306.11867 (2023).
  • 52.Shysheya, A., Bronskill, J., Patacchiola, M., Nowozin, S. & Turner, R. E. Fit: Parameter efficient few-shot transfer learning for personalized and federated image classification. arXiv preprint arXiv:2206.08671 (2022).
  • 53.Li, T. et al. Federated optimization in heterogeneous networks. Proc. Mach. Learn. Syst.2, 429–450 (2020). [Google Scholar]
  • 54.T Dinh, C., Tran, N. & Nguyen, J. Personalized federated learning with Moreau envelopes. Adv. Neural Inf. Process. Syst.33, 21394–21405 (2020). [Google Scholar]
  • 55.Zhao, H., Du, W., Li, F., Li, P. & Liu, G. Fedprompt: Communication-efficient and privacy-preserving prompt tuning in federated learning. in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 1–5 (IEEE, 2023).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets used in this study are all publicly available and widely adopted in the research community. Specifically, the following datasets were utilized: Caltech10143 for general object detection, DTD44 for texture classification, EuroSAT45 for Sentinel-2 satellite imagery categorization, FGVCAircraft46 for aircraft recognition, Food10147 for food classification tasks, Flowers10248 for flower species identification, OxfordPets49 for pet breed classification, and UCF10150 for action recognition studies. These datasets are accessible through their respective repositories: Caltech101: http://www.vision.caltech.edu/Image_Datasets/Caltech101/. DTD: https://www.robots.ox.ac.uk/~vgg/data/dtd/. EuroSAT: http://madm.dfki.de/files/sentinel/EuroSAT.zip. FGVCAircraft: http://www.robots.ox.ac.uk/~vgg/data/fgvc-aircraft/. Food101: https://data.vision.ee.ethz.ch/cvl/datasets_extra/food-101/. Flowers102: http://www.robots.ox.ac.uk/~vgg/data/flowers/102/. OxfordPets: https://www.robots.ox.ac.uk/~vgg/data/pets/. UCF101: https://drive.google.com/file/d/10Jqome3vtUA2keJkNanAiFpgbyC9Hc2O/view. These datasets ensure reproducibility of the experiments and facilitate further research.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES