. 2023 Nov 17;6:1279794. doi: 10.3389/frai.2023.1279794

Table 4.

Experimental results.

	RandF			BERT			td-003			gpt-3.5			gpt-16k			gpt-4
	P	R	F ₁	P	R	F ₁	P	R	F ₁	P	R	F ₁	P	R	F ₁	P	R	F ₁
BVA	0.84	0.84	0.83	0.92	0.92	0.92	0.81	0.70	0.73	0.80	0.68	0.71	0.73	0.53	0.57	0.84	0.81	0.82
Citation	0.99	0.98	0.99	1.0	1.0	1.0	0.98	0.94	0.96	0.95	0.90	0.92	0.86	0.40	0.55	0.97	0.88	0.92
Evidence	0.79	0.98	0.87	0.94	0.95	0.94	0.93	0.65	0.77	0.92	0.64	0.75	0.92	0.50	0.64	0.92	0.84	0.88
Finding	0.82	0.66	0.74	0.85	0.88	0.87	0.50	0.56	0.53	0.56	0.59	0.57	0.61	0.39	0.47	0.66	0.73	0.70
Legal rule	0.92	0.85	0.89	0.94	0.96	0.95	0.86	0.61	0.72	0.82	0.52	0.64	0.53	0.64	0.58	0.85	0.84	0.84
Reasoning	0.70	0.27	0.40	0.74	0.70	0.72	0.30	0.72	0.43	0.29	0.76	0.42	0.24	0.82	0.37	0.44	0.63	0.52
CUAD	0.89	0.89	0.89	0.95	0.95	0.95	0.84	0.84	0.83	0.87	0.86	0.86	0.81	0.80	0.80	0.90	0.90	0.90
Anti-assign0.	0.91	0.97	0.94	0.99	0.98	0.99	0.81	0.94	0.87	0.93	0.93	0.93	0.92	0.88	0.90	0.92	0.95	0.93
Audit rights	0.86	0.96	0.91	0.96	0.97	0.97	0.95	0.91	0.93	0.96	0.89	0.93	0.89	0.82	0.86	0.94	0.96	0.95
C0. not to sue	0.97	0.81	0.88	0.97	0.94	0.96	0.73	0.71	0.72	0.77	0.83	0.80	0.65	0.81	0.72	0.94	0.91	0.93
Governing law	1.0	1.0	1.0	0.99	1.0	1.0	0.99	1.0	0.99	1.0	1.0	1.0	0.99	0.94	0.96	0.98	0.98	0.98
IP assignment	0.90	0.86	0.88	0.94	0.93	0.93	0.75	0.96	0.84	0.71	0.96	0.81	0.63	0.89	0.74	0.90	0.91	0.91
Insurance	0.94	0.97	0.95	0.97	0.97	0.97	0.97	0.95	0.96	0.98	0.95	0.97	0.96	0.87	0.92	0.96	0.98	0.97
Min. commit.	0.82	0.79	0.80	0.92	0.93	0.92	0.68	0.67	0.67	0.82	0.66	0.73	0.71	0.60	0.65	0.82	0.79	0.80
Post-term. S.	0.78	0.76	0.77	0.85	0.85	0.85	0.80	0.42	0.55	0.64	0.78	0.70	0.55	0.70	0.62	0.81	0.79	0.80
Profit sharing	0.82	0.92	0.87	0.94	0.94	0.94	0.76	0.81	0.78	0.88	0.87	0.87	0.77	0.81	0.79	0.91	0.89	0.90
Termination C.	0.90	0.88	0.89	0.95	0.96	0.96	0.83	0.96	0.89	0.85	0.93	0.89	0.80	0.84	0.82	0.86	0.97	0.91
Volume rest.	0.86	0.50	0.63	0.90	0.90	0.90	0.47	0.45	0.46	0.47	0.29	0.36	0.49	0.27	0.35	0.64	0.48	0.55
Warranty dur.	0.95	0.79	0.86	0.95	0.93	0.94	0.82	0.81	0.81	0.91	0.74	0.82	0.80	0.70	0.75	0.92	0.89	0.91
PHASYS	0.69	0.69	0.64	0.74	0.75	0.74	0.64	0.54	0.54	0.68	0.51	0.53	0.68	0.31	0.24	0.67	0.53	0.54
Response	0.69	0.95	0.80	0.80	0.84	0.82	0.72	0.57	0.63	0.79	0.44	0.56	0.83	0.11	0.20	0.78	0.43	0.55
Preparedness	0.63	0.18	0.28	0.64	0.56	0.60	0.33	0.68	0.45	0.32	0.83	0.46	0.25	0.97	0.39	0.33	0.82	0.47
Recovery	0.82	0.40	0.53	0.65	0.63	0.64	0.77	0.17	0.28	0.77	0.36	0.49	0.71	0.09	0.16	0.76	0.49	0.60

The performance is reported in terms of F₁ scores. The micro-P, R, and F₁ are used for the overall data set statistics (BVA, CUAD, PHASYS rows). RandF means random forest and BERT means base RoBERTa. The td-003 section reports the performance of the text-davinci-003 model. The gpt-3.5 and gpt-16k refer to the performance of the gpt-3.5-turbo(-16k) models, respectively. The gpt-4 column reports the performance of the most powerful GPT-4 model. The bold values describe the overall performance of the models on the datasets.