. 2025 Aug 25;16:7923. doi: 10.1038/s41467-025-62959-5

Table 1.

Comparisons of model performance (accuracy or passing rate, %) and utility of different algorithms on two instruction-tuning scenarios with three LLMs (Mistral, TinyLlama, and Llama2)

Setting	Metric	Local	FedAvg	FedProx	FedAMP	CFL	pFedGraph	iPFL
Mistral (Mix-Finance)	FIQA	85.83 ± 4.55	82.19 ± 4.54	85.11 ± 6.61	87.65 ± 5.08	87.65 ± 3.02	87.65 ± 6.11	87.64 ± 1.99
	TFNS	81.09 ± 4.36	83.33 ± 0.00	75.17 ± 5.66	78.75 ± 8.13	83.67 ± 2.83	83.83 ± 4.24	84.83 ± 1.41
	NWGI	54.92 ± 0.12	57.5 ± 0.95	58.59 ± 0.12	55.00 ± 3.07	54.75 ± 2.72	56.00 ± 4.00	58.67 ± 2.83
	Avg-Utility	0.0 ± 0.0	45.9 ± 0.0	45.9 ± 0.0	45.9 ± 0.0	45.9 ± 0.0	46.1 ± 0.3	96.5 ± 0.0
TinyLlama (Mix-Fina.)	FIQA	82.56 ± 4.02	77.09 ± 0.40	79.27 ± 1.65	83.64 ± 1.46	82.20 ± 5.56	82.56 ± 4.02	82.19 ± 2.47
	TFNS	76.16 ± 7.07	76.25 ± 0.11	75.92 ± 0.35	77.92 ± 2.47	79.33 ± 2.59	80.67 ± 0.71	78.17 ± 2.12
	NWGI	48.92 ± 1.76	52.67 ± 1.18	53.25 ± 0.35	47.92 ± 3.66	48.42 ± 2.24	47.75 ± 0.82	50.42 ± 2.47
	Avg-Utility	0.0 ± 0.0	45.9 ± 0.0	45.9 ± 0.0	45.9 ± 0.0	45.9 ± 0.0	45.9 ± 0.0	96.5 ± 0.0
Llama2 (Mix-Fina.)	FIQA	84.02 ± 6.09	78.19 ± 1.94	78.55 ± 0.40	84.01 ± 4.03	85.11 ± 6.61	83.65 ± 5.57	85.47 ± 6.10
	TFNS	80.58 ± 0.83	81.25 ± 5.30	80.56 ± 6.28	76.63 ± 5.48	77.06 ± 6.98	76.75 ± 5.66	83.38 ± 2.83
	NWGI	43.17 ± 4.48	52.25 ± 4.77	52.44 ± 1.68	42.56 ± 6.98	45.94 ± 4.68	47.94 ± 3.09	56.25 ± 1.06
	Avg-Utility	0.0 ± 0.0	45.9 ± 0.0	45.9 ± 0.0	45.9 ± 0.0	45.9 ± 0.0	45.9 ± 0.0	96.5 ± 0.0
Llama2 (Code&Fina.)	NWGI	50.61 ± 2.63	49.94 ± 2.46	49.61 ± 2.67	51.58 ± 1.38	52.28 ± 4.97	50.00 ± 5.65	53.11 ± 0.51
	Code	13.54 ± 2.30	15.00 ± 0.70	15.00 ± 1.26	14.02 ± 0.86	14.15 ± 0.80	14.27 ± 1.76	15.85 ± 1.14
	Avg-Utility	0.0 ± 0.0	58.2 ± 150.4	58.2 ± 150.4	58.2 ± 150.4	58.2 ± 150.4	137.7 ± 291.6	208.1 ± 131.1

Every value is presented as mean ± standard deviation. The first scenario is the Mix-Finance scenario, consisting of two clients for FIQA, two clients for TFNS, and two clients for NWGI. Each row shows the average performance of the two clients with the same dataset or utility of all clients. The second scenario, Code&Finance, includes three financial clients and five coding clients. Each row shows the average performance of clients with the same task or utility of all clients. Our iPFL consistently outperforms other baselines in clients’ utility and demonstrates the highest average model performance in most of the scenarios.