Table 3.
Similarity metrics comparing our definition with those from GPT-3.5/GPT-3.
| Large language model used | Instructions/Prompts used in GPT-3.5/GPT-3 | Vector similarity | ROUGE-1 F1 score |
ROUGE-1 F1 score (nouns and adjectives) |
|---|---|---|---|---|
| GPT-3.5 (Jan 30, 2023 release) | Based on the verbatim quotes loaded in 2 batches and summarized | 0.965 | 0.568 | 0.436 |
| GPT-3 text-davinci-003 (with settings: temperature=0, max_tokens=1500) |
Word window of 25 words around “suffering.” That is 8910 strings; 297 iterations were done with following prompts: Step 1: “Using these phrases: <First 30 strings> Create a one paragraph scientific definition of suffering. Suffering is” Steps 2-296: “Using this definition: suffering is <Definition from Step 1> And these phrases: <Next 30 strings> Create a one paragraph scientific definition of suffering. Suffering is” |
0.941 | 0.491 | 0.349 |
| BASELINE | [pain-related suffering is] the state of severe distress associated with events that threaten the intactness of the person vs [pain-related suffering is] an unpleasant experience associated with negative cognitive, emotional, and autonomic response to a stimulus | 0.913 | 0.372 | 0.222 |
The table displays the similarity values obtained from comparing the definitions generated by GPT-3.5 and GPT-3 to the integrative definition obtained with conventional qualitative methods, as well as baseline similarity values for comparison in the bottom line. Vector similarity was calculated using embeddings from the text-embedding-ada-002 GPT-3 model.