Table 4.
RandF | BERT | td-003 | gpt-3.5 | gpt-16k | gpt-4 | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
P | R | F 1 | P | R | F 1 | P | R | F 1 | P | R | F 1 | P | R | F 1 | P | R | F 1 | |
BVA | 0.84 | 0.84 | 0.83 | 0.92 | 0.92 | 0.92 | 0.81 | 0.70 | 0.73 | 0.80 | 0.68 | 0.71 | 0.73 | 0.53 | 0.57 | 0.84 | 0.81 | 0.82 |
Citation | 0.99 | 0.98 | 0.99 | 1.0 | 1.0 | 1.0 | 0.98 | 0.94 | 0.96 | 0.95 | 0.90 | 0.92 | 0.86 | 0.40 | 0.55 | 0.97 | 0.88 | 0.92 |
Evidence | 0.79 | 0.98 | 0.87 | 0.94 | 0.95 | 0.94 | 0.93 | 0.65 | 0.77 | 0.92 | 0.64 | 0.75 | 0.92 | 0.50 | 0.64 | 0.92 | 0.84 | 0.88 |
Finding | 0.82 | 0.66 | 0.74 | 0.85 | 0.88 | 0.87 | 0.50 | 0.56 | 0.53 | 0.56 | 0.59 | 0.57 | 0.61 | 0.39 | 0.47 | 0.66 | 0.73 | 0.70 |
Legal rule | 0.92 | 0.85 | 0.89 | 0.94 | 0.96 | 0.95 | 0.86 | 0.61 | 0.72 | 0.82 | 0.52 | 0.64 | 0.53 | 0.64 | 0.58 | 0.85 | 0.84 | 0.84 |
Reasoning | 0.70 | 0.27 | 0.40 | 0.74 | 0.70 | 0.72 | 0.30 | 0.72 | 0.43 | 0.29 | 0.76 | 0.42 | 0.24 | 0.82 | 0.37 | 0.44 | 0.63 | 0.52 |
CUAD | 0.89 | 0.89 | 0.89 | 0.95 | 0.95 | 0.95 | 0.84 | 0.84 | 0.83 | 0.87 | 0.86 | 0.86 | 0.81 | 0.80 | 0.80 | 0.90 | 0.90 | 0.90 |
Anti-assign0. | 0.91 | 0.97 | 0.94 | 0.99 | 0.98 | 0.99 | 0.81 | 0.94 | 0.87 | 0.93 | 0.93 | 0.93 | 0.92 | 0.88 | 0.90 | 0.92 | 0.95 | 0.93 |
Audit rights | 0.86 | 0.96 | 0.91 | 0.96 | 0.97 | 0.97 | 0.95 | 0.91 | 0.93 | 0.96 | 0.89 | 0.93 | 0.89 | 0.82 | 0.86 | 0.94 | 0.96 | 0.95 |
C0. not to sue | 0.97 | 0.81 | 0.88 | 0.97 | 0.94 | 0.96 | 0.73 | 0.71 | 0.72 | 0.77 | 0.83 | 0.80 | 0.65 | 0.81 | 0.72 | 0.94 | 0.91 | 0.93 |
Governing law | 1.0 | 1.0 | 1.0 | 0.99 | 1.0 | 1.0 | 0.99 | 1.0 | 0.99 | 1.0 | 1.0 | 1.0 | 0.99 | 0.94 | 0.96 | 0.98 | 0.98 | 0.98 |
IP assignment | 0.90 | 0.86 | 0.88 | 0.94 | 0.93 | 0.93 | 0.75 | 0.96 | 0.84 | 0.71 | 0.96 | 0.81 | 0.63 | 0.89 | 0.74 | 0.90 | 0.91 | 0.91 |
Insurance | 0.94 | 0.97 | 0.95 | 0.97 | 0.97 | 0.97 | 0.97 | 0.95 | 0.96 | 0.98 | 0.95 | 0.97 | 0.96 | 0.87 | 0.92 | 0.96 | 0.98 | 0.97 |
Min. commit. | 0.82 | 0.79 | 0.80 | 0.92 | 0.93 | 0.92 | 0.68 | 0.67 | 0.67 | 0.82 | 0.66 | 0.73 | 0.71 | 0.60 | 0.65 | 0.82 | 0.79 | 0.80 |
Post-term. S. | 0.78 | 0.76 | 0.77 | 0.85 | 0.85 | 0.85 | 0.80 | 0.42 | 0.55 | 0.64 | 0.78 | 0.70 | 0.55 | 0.70 | 0.62 | 0.81 | 0.79 | 0.80 |
Profit sharing | 0.82 | 0.92 | 0.87 | 0.94 | 0.94 | 0.94 | 0.76 | 0.81 | 0.78 | 0.88 | 0.87 | 0.87 | 0.77 | 0.81 | 0.79 | 0.91 | 0.89 | 0.90 |
Termination C. | 0.90 | 0.88 | 0.89 | 0.95 | 0.96 | 0.96 | 0.83 | 0.96 | 0.89 | 0.85 | 0.93 | 0.89 | 0.80 | 0.84 | 0.82 | 0.86 | 0.97 | 0.91 |
Volume rest. | 0.86 | 0.50 | 0.63 | 0.90 | 0.90 | 0.90 | 0.47 | 0.45 | 0.46 | 0.47 | 0.29 | 0.36 | 0.49 | 0.27 | 0.35 | 0.64 | 0.48 | 0.55 |
Warranty dur. | 0.95 | 0.79 | 0.86 | 0.95 | 0.93 | 0.94 | 0.82 | 0.81 | 0.81 | 0.91 | 0.74 | 0.82 | 0.80 | 0.70 | 0.75 | 0.92 | 0.89 | 0.91 |
PHASYS | 0.69 | 0.69 | 0.64 | 0.74 | 0.75 | 0.74 | 0.64 | 0.54 | 0.54 | 0.68 | 0.51 | 0.53 | 0.68 | 0.31 | 0.24 | 0.67 | 0.53 | 0.54 |
Response | 0.69 | 0.95 | 0.80 | 0.80 | 0.84 | 0.82 | 0.72 | 0.57 | 0.63 | 0.79 | 0.44 | 0.56 | 0.83 | 0.11 | 0.20 | 0.78 | 0.43 | 0.55 |
Preparedness | 0.63 | 0.18 | 0.28 | 0.64 | 0.56 | 0.60 | 0.33 | 0.68 | 0.45 | 0.32 | 0.83 | 0.46 | 0.25 | 0.97 | 0.39 | 0.33 | 0.82 | 0.47 |
Recovery | 0.82 | 0.40 | 0.53 | 0.65 | 0.63 | 0.64 | 0.77 | 0.17 | 0.28 | 0.77 | 0.36 | 0.49 | 0.71 | 0.09 | 0.16 | 0.76 | 0.49 | 0.60 |
The performance is reported in terms of F1 scores. The micro-P, R, and F1 are used for the overall data set statistics (BVA, CUAD, PHASYS rows). RandF means random forest and BERT means base RoBERTa. The td-003 section reports the performance of the text-davinci-003 model. The gpt-3.5 and gpt-16k refer to the performance of the gpt-3.5-turbo(-16k) models, respectively. The gpt-4 column reports the performance of the most powerful GPT-4 model. The bold values describe the overall performance of the models on the datasets.