Table 1.
Question | Model answer | Reference answer | Generations for entropy | Prediction | |
---|---|---|---|---|---|
Semantic entropy | Naive entropy | ||||
Refineries, process chemical, power generation, mills and manufacturing plants are under what sector of construction? | Refineries, process chemical, power generation, mills and manufacturing plants are under the industrial sector of construction. | Industrial or heavy civil or heavy engineering. | • All the above are under the industrial sector of construction. | Not confabulation | Confabulation |
• The refineries, process chemical, power generation, mills and manufacturing plants are under the industrial sector of construction. | |||||
• These are all under the heavy industrial sector of construction. | |||||
What is the function of the stard10 protein? | The stard10 protein functions as a negative regulator of the mTOR pathway by inhibiting the activity of the mTORC1 and mTORC2 complexes. | STARD10 is a lipid transfer protein. | • The stard10 protein plays a crucial role in regulating the dynamics of the actin cytoskeleton and cell adhesion during cell migration and cell division. | Confabulation | Confabulation |
• The stard10 protein functions as a negative regulator of the meiotic recombination process, specifically by inhibiting the activity of the Dmc1 recombinase enzyme. | |||||
• The stard10 protein functions as a regulator of lipid metabolism, promoting the breakdown of lipids in the liver and inhibiting their synthesis in the liver and adipose tissue. | |||||
Of which European country is ‘fado’ the national music? | Portugal. | Portugal. | • ‘Fado’ is the national music of Portugal. | Not confabulation | Not confabulation |
• ‘Fado’ is the national music of Portugal. | |||||
• Fado is the national music of Portugal. | |||||
When was BSkyB’s digital service launched? | 1 October 1998. | BSkyB’s digital service was launched in 1998. | • BSkyB’s digital service was launched in 1998. | Confabulation | Not confabulation |
• BSkyB’s digital service was launched on 1 October 1998. | |||||
• BSkyB’s digital service was launched on 1 October 1998. |
The first row of Table 1 demonstrates a case in which semantic entropy correctly predicts that an answer is not a confabulation if naive entropy would incorrectly predict a confabulation. All of the generations from the model mean the same thing as each other so they are clustered together despite using different phrasings. The second row provides an example in which semantic entropy and naive entropy would both correctly predict a confabulation, in which each generation is both lexically distinct and also means something different. The third row is an example in which semantic entropy and naive entropy would both correctly predict no confabulation because the multiple generations are almost lexically identical. The fourth row gives an example in which semantic entropy might fail but naive entropy might succeed. In our experiment, semantic entropy clustered the answers into those which provided a specific date and those which gave only a year and treated the model as ‘uncertain’. This highlights the importance of context in semantic clustering. The examples come from LLaMA 2 Chat 70B generations for SQuAD, BioASQ and TriviaQA.