Skip to main content
. 2024 Mar 5;165(7):1434–1449. doi: 10.1097/j.pain.0000000000003195

Table 3.

Similarity metrics comparing our definition with those from GPT-3.5/GPT-3.

Large language model used Instructions/Prompts used in GPT-3.5/GPT-3 Vector similarity ROUGE-1
F1 score
ROUGE-1 F1 score (nouns and adjectives)
GPT-3.5 (Jan 30, 2023 release) Based on the verbatim quotes loaded in 2 batches and summarized 0.965 0.568 0.436
GPT-3 text-davinci-003
(with settings: temperature=0, max_tokens=1500)
Word window of 25 words around “suffering.” That is 8910 strings; 297 iterations were done with following prompts:
Step 1: “Using these phrases: <First 30 strings>
Create a one paragraph scientific definition of suffering. Suffering is”
Steps 2-296: “Using this definition: suffering is <Definition from Step 1>
And these phrases:
<Next 30 strings>
Create a one paragraph scientific definition of suffering. Suffering is”
0.941 0.491 0.349
BASELINE [pain-related suffering is] the state of severe distress associated with events that threaten the intactness of the person vs [pain-related suffering is] an unpleasant experience associated with negative cognitive, emotional, and autonomic response to a stimulus 0.913 0.372 0.222

The table displays the similarity values obtained from comparing the definitions generated by GPT-3.5 and GPT-3 to the integrative definition obtained with conventional qualitative methods, as well as baseline similarity values for comparison in the bottom line. Vector similarity was calculated using embeddings from the text-embedding-ada-002 GPT-3 model.