. 2024 Mar 5;165(7):1434–1449. doi: 10.1097/j.pain.0000000000003195

Table 3.

Similarity metrics comparing our definition with those from GPT-3.5/GPT-3.

Large language model used	Instructions/Prompts used in GPT-3.5/GPT-3	Vector similarity	ROUGE-1 F1 score	ROUGE-1 F1 score (nouns and adjectives)
GPT-3.5 (Jan 30, 2023 release)	Based on the verbatim quotes loaded in 2 batches and summarized	0.965	0.568	0.436
GPT-3 text-davinci-003 (with settings: temperature=0, max_tokens=1500)	Word window of 25 words around “suffering.” That is 8910 strings; 297 iterations were done with following prompts: Step 1: “Using these phrases: <First 30 strings> Create a one paragraph scientific definition of suffering. Suffering is” Steps 2-296: “Using this definition: suffering is <Definition from Step 1> And these phrases: <Next 30 strings> Create a one paragraph scientific definition of suffering. Suffering is”	0.941	0.491	0.349
BASELINE	[pain-related suffering is] the state of severe distress associated with events that threaten the intactness of the person vs [pain-related suffering is] an unpleasant experience associated with negative cognitive, emotional, and autonomic response to a stimulus	0.913	0.372	0.222

The table displays the similarity values obtained from comparing the definitions generated by GPT-3.5 and GPT-3 to the integrative definition obtained with conventional qualitative methods, as well as baseline similarity values for comparison in the bottom line. Vector similarity was calculated using embeddings from the text-embedding-ada-002 GPT-3 model.