Skip to main content
. 2023 Nov 17;6:1279794. doi: 10.3389/frai.2023.1279794

Table 4.

Experimental results.

RandF BERT td-003 gpt-3.5 gpt-16k gpt-4
P R F 1 P R F 1 P R F 1 P R F 1 P R F 1 P R F 1
BVA 0.84 0.84 0.83 0.92 0.92 0.92 0.81 0.70 0.73 0.80 0.68 0.71 0.73 0.53 0.57 0.84 0.81 0.82
Citation 0.99 0.98 0.99 1.0 1.0 1.0 0.98 0.94 0.96 0.95 0.90 0.92 0.86 0.40 0.55 0.97 0.88 0.92
Evidence 0.79 0.98 0.87 0.94 0.95 0.94 0.93 0.65 0.77 0.92 0.64 0.75 0.92 0.50 0.64 0.92 0.84 0.88
Finding 0.82 0.66 0.74 0.85 0.88 0.87 0.50 0.56 0.53 0.56 0.59 0.57 0.61 0.39 0.47 0.66 0.73 0.70
Legal rule 0.92 0.85 0.89 0.94 0.96 0.95 0.86 0.61 0.72 0.82 0.52 0.64 0.53 0.64 0.58 0.85 0.84 0.84
Reasoning 0.70 0.27 0.40 0.74 0.70 0.72 0.30 0.72 0.43 0.29 0.76 0.42 0.24 0.82 0.37 0.44 0.63 0.52
CUAD 0.89 0.89 0.89 0.95 0.95 0.95 0.84 0.84 0.83 0.87 0.86 0.86 0.81 0.80 0.80 0.90 0.90 0.90
Anti-assign0. 0.91 0.97 0.94 0.99 0.98 0.99 0.81 0.94 0.87 0.93 0.93 0.93 0.92 0.88 0.90 0.92 0.95 0.93
Audit rights 0.86 0.96 0.91 0.96 0.97 0.97 0.95 0.91 0.93 0.96 0.89 0.93 0.89 0.82 0.86 0.94 0.96 0.95
C0. not to sue 0.97 0.81 0.88 0.97 0.94 0.96 0.73 0.71 0.72 0.77 0.83 0.80 0.65 0.81 0.72 0.94 0.91 0.93
Governing law 1.0 1.0 1.0 0.99 1.0 1.0 0.99 1.0 0.99 1.0 1.0 1.0 0.99 0.94 0.96 0.98 0.98 0.98
IP assignment 0.90 0.86 0.88 0.94 0.93 0.93 0.75 0.96 0.84 0.71 0.96 0.81 0.63 0.89 0.74 0.90 0.91 0.91
Insurance 0.94 0.97 0.95 0.97 0.97 0.97 0.97 0.95 0.96 0.98 0.95 0.97 0.96 0.87 0.92 0.96 0.98 0.97
Min. commit. 0.82 0.79 0.80 0.92 0.93 0.92 0.68 0.67 0.67 0.82 0.66 0.73 0.71 0.60 0.65 0.82 0.79 0.80
Post-term. S. 0.78 0.76 0.77 0.85 0.85 0.85 0.80 0.42 0.55 0.64 0.78 0.70 0.55 0.70 0.62 0.81 0.79 0.80
Profit sharing 0.82 0.92 0.87 0.94 0.94 0.94 0.76 0.81 0.78 0.88 0.87 0.87 0.77 0.81 0.79 0.91 0.89 0.90
Termination C. 0.90 0.88 0.89 0.95 0.96 0.96 0.83 0.96 0.89 0.85 0.93 0.89 0.80 0.84 0.82 0.86 0.97 0.91
Volume rest. 0.86 0.50 0.63 0.90 0.90 0.90 0.47 0.45 0.46 0.47 0.29 0.36 0.49 0.27 0.35 0.64 0.48 0.55
Warranty dur. 0.95 0.79 0.86 0.95 0.93 0.94 0.82 0.81 0.81 0.91 0.74 0.82 0.80 0.70 0.75 0.92 0.89 0.91
PHASYS 0.69 0.69 0.64 0.74 0.75 0.74 0.64 0.54 0.54 0.68 0.51 0.53 0.68 0.31 0.24 0.67 0.53 0.54
Response 0.69 0.95 0.80 0.80 0.84 0.82 0.72 0.57 0.63 0.79 0.44 0.56 0.83 0.11 0.20 0.78 0.43 0.55
Preparedness 0.63 0.18 0.28 0.64 0.56 0.60 0.33 0.68 0.45 0.32 0.83 0.46 0.25 0.97 0.39 0.33 0.82 0.47
Recovery 0.82 0.40 0.53 0.65 0.63 0.64 0.77 0.17 0.28 0.77 0.36 0.49 0.71 0.09 0.16 0.76 0.49 0.60

The performance is reported in terms of F1 scores. The micro-P, R, and F1 are used for the overall data set statistics (BVA, CUAD, PHASYS rows). RandF means random forest and BERT means base RoBERTa. The td-003 section reports the performance of the text-davinci-003 model. The gpt-3.5 and gpt-16k refer to the performance of the gpt-3.5-turbo(-16k) models, respectively. The gpt-4 column reports the performance of the most powerful GPT-4 model. The bold values describe the overall performance of the models on the datasets.