Skip to main content
. 2025 Aug 25;16:7923. doi: 10.1038/s41467-025-62959-5

Table 1.

Comparisons of model performance (accuracy or passing rate, %) and utility of different algorithms on two instruction-tuning scenarios with three LLMs (Mistral, TinyLlama, and Llama2)

Setting Metric Local FedAvg FedProx FedAMP CFL pFedGraph iPFL
Mistral (Mix-Finance) FIQA 85.83 ± 4.55 82.19 ±  4.54 85.11 ±  6.61 87.65 ±  5.08 87.65 ±  3.02 87.65 ±  6.11 87.64 ±  1.99
TFNS 81.09 ±  4.36 83.33 ±  0.00 75.17 ± 5.66 78.75 ± 8.13 83.67 ± 2.83 83.83 ± 4.24 84.83 ± 1.41
NWGI 54.92 ± 0.12 57.5 ± 0.95 58.59 ± 0.12 55.00 ± 3.07 54.75 ±  2.72 56.00 ± 4.00 58.67 ± 2.83
Avg-Utility 0.0 ± 0.0 45.9 ± 0.0 45.9 ± 0.0 45.9 ± 0.0 45.9 ± 0.0 46.1 ± 0.3 96.5 ± 0.0
TinyLlama (Mix-Fina.) FIQA 82.56 ± 4.02 77.09 ± 0.40 79.27 ± 1.65 83.64 ± 1.46 82.20 ± 5.56 82.56 ± 4.02 82.19 ± 2.47
TFNS 76.16 ± 7.07 76.25 ± 0.11 75.92 ± 0.35 77.92 ± 2.47 79.33 ± 2.59 80.67 ±  0.71 78.17 ± 2.12
NWGI 48.92 ± 1.76 52.67 ± 1.18 53.25 ± 0.35 47.92 ± 3.66 48.42 ± 2.24 47.75 ± 0.82 50.42 ± 2.47
Avg-Utility 0.0 ± 0.0 45.9 ± 0.0 45.9 ± 0.0 45.9 ± 0.0 45.9 ± 0.0 45.9 ± 0.0 96.5 ± 0.0
Llama2 (Mix-Fina.) FIQA 84.02 ± 6.09 78.19 ± 1.94 78.55 ± 0.40 84.01 ± 4.03 85.11 ± 6.61 83.65 ± 5.57 85.47 ± 6.10
TFNS 80.58 ± 0.83 81.25 ± 5.30 80.56 ± 6.28 76.63 ± 5.48 77.06 ± 6.98 76.75 ± 5.66 83.38 ± 2.83
NWGI 43.17 ± 4.48 52.25 ±  4.77 52.44 ±  1.68 42.56 ± 6.98 45.94 ± 4.68 47.94 ± 3.09 56.25 ± 1.06
Avg-Utility 0.0 ± 0.0 45.9  ± 0.0 45.9 ± 0.0 45.9 ± 0.0 45.9 ± 0.0 45.9 ± 0.0 96.5 ± 0.0
Llama2 (Code&Fina.) NWGI 50.61 ± 2.63 49.94 ± 2.46 49.61 ± 2.67 51.58 ± 1.38 52.28 ± 4.97 50.00 ± 5.65 53.11 ± 0.51
Code 13.54 ± 2.30 15.00 ± 0.70 15.00 ± 1.26 14.02 ± 0.86 14.15 ± 0.80 14.27 ± 1.76 15.85 ± 1.14
Avg-Utility 0.0 ± 0.0 58.2 ±  150.4 58.2 ± 150.4 58.2 ± 150.4 58.2 ± 150.4 137.7 ± 291.6 208.1 ±  131.1

Every value is presented as mean ± standard deviation. The first scenario is the Mix-Finance scenario, consisting of two clients for FIQA, two clients for TFNS, and two clients for NWGI. Each row shows the average performance of the two clients with the same dataset or utility of all clients. The second scenario, Code&Finance, includes three financial clients and five coding clients. Each row shows the average performance of clients with the same task or utility of all clients. Our iPFL consistently outperforms other baselines in clients’ utility and demonstrates the highest average model performance in most of the scenarios.