. 2025 Nov 17;15:40122. doi: 10.1038/s41598-025-23798-y

Table 1.

Overview of studies benchmarking LLMs against human annotators.

Study	Main research question	Data/Domain	Language	Tasks evaluated	LLMs & setting	Human baseline	Key findings	Key differences
Gilardi et al. (2023)¹⁶	Can GPT-3.5-turbo (ChatGPT) zero-shot match or outperform MTurk crowd workers’ annotations for social-media moderation and related classification tasks at lower cost?	6,183 tweets and news articles	English	Relevance (content moderation; politics), stance detection (Section 230), topic detection, general frame detection (problem/solution) and policy frame detection (14 classes)	GPT-3.5-turbo (ChatGPT), zero-shot prompting (temperature 1 and 0.2)	MTurk crowd workers; trained research assistants as gold standard	Zero-shot accuracy 25 percentage points higher than MTurk; inter-coder agreement 91–97% vs 56% (MTurk) and 79% (trained annotators); per-annotation cost $0.003 (30 cheaper than MTurk)	Short social-media/news texts in English; classification tasks only—no NER, entity counting, or source/target attribution; no text-length/difficulty analysis; does include a per-annotation cost comparison (30 cheaper than MTurk)
Törnberg (2024)⁵⁰	To what extent can GPT-4 (zero-shot) identify politicians’ party affiliation from a single Twitter post compared to supervised classifiers, expert coders, and crowd workers across 11 countries?	Random sample of parliamentarian tweets across 11 countries (250 per party; balanced 500 per country)	Multilingual (local languages of 11 countries)	Binary party-affiliation classification	GPT-4 (gpt-4-0314), zero-shot prompting	Expert political-science coders (majority vote) and Master-qualified MTurk workers (plurality)	Accuracy: GPT-4=93.4%; best human (experts majority vote)=86.0% (Macro F1: 0.934 vs. 0.860)	Multilingual tweets (11 countries); short single-tweet inputs; binary party-affiliation classification only—no NER, entity counting, or source/target attribution; no text-length/difficulty or cost analysis
Ziems et al. (2024)⁵¹	Are zero-shot LLMs capable of reliably classifying and generating explanations for social-science phenomena?	25 representative English computation social science benchmarks spanning utterance-, conversation-, and document-level corpora	English	Twenty classification tasks plus five free-text generation tasks (summaries/explanations)	Thirteen LLMs—including FLAN-T5 variants and OpenAI’s GPT-3 (davinci-002/003), GPT-3.5-Turbo, and GPT-4—evaluated zero-shot	Published gold labels and crowd-written explanations	LLMs achieved fair agreement with humans on classification (though not surpassing best fine-tuned models) and matched or exceeded reference explanations in generative tasks	English benchmarks covering short utterances, conversations, and full documents; mixes classification, explanation, and structured extraction tasks (event-argument extraction); still omits explicit entity-count analysis; no text-length/difficulty or cost analysis
Bojić et al. (2025)⁵²	How reliably do LLMs compare to human annotators across sentiment analysis, political leaning, emotional intensity, and sarcasm detection?	100 curated textual items from Stanford Sentiment Treebank, Sentiment140, Iyyer et al.’s political-ideology data, EmoBank, and Sarcasm Corpus V2	English	Sentiment analysis; political leaning; emotional intensity; sarcasm detection	GPT-3.5-turbo-16k, GPT-4, GPT-4o, GPT-4o-mini, Gemini 1.5 Pro, Llama-3.1-70B, Mixtral 87B (zero-shot; Hard Prompt for GPT-4o)	33 trained annotators	Matched human agreement level of 0.95 and LLM agreement level of 0.95; outperformed humans on political leaning (0.80 vs 0.55 agreement); demonstrated higher consistency on emotional intensity (0.85 vs 0.65 agreement); both groups low consistency on sarcasm (0.25 agreement)	Small English sentence-level corpus; sentiment, political leaning, emotional intensity, and sarcasm only—no NER, entity counting, or source/target attribution; no text-length/difficulty or cost analysis
Kaikaus et al. (2023)⁵³	Are GPT-3.5 and GPT-4 annotations of quarterly earnings-call Q&A statements as reliable and cost-effective as those from domain-expert human annotators?	1,198 Q&A statements from 30 quarterly earnings calls (2010–2022)	English	Emotion, sentiment, cognitive dissonance	GPT-3.5-turbo-16k and GPT-4-32k; zero-shot with four prompt-engineering approaches; temperature = 0	Domain-expert crowdworkers (accounting graduate students)	LLMs produced more consistent and reliable annotations than humans, while reducing annotation time and cost	English finance domain; moderate-length earnings-call Q&A statements; emotion, sentiment, and cognitive-dissonance classification only—no NER, entity counting, or source/target attribution; no text-length/difficulty analysis; does report substantial time/cost savings with LLMs
Huang et al. (2023)⁵⁴	Can ChatGPT detect implicit hateful tweets and generate concise natural-language explanations comparable to human annotators?	6,358 implicit hateful tweets (LatentHatred dataset)	English	Binary implicit-hate classification and one-sentence explanation	ChatGPT (GPT-3.5 Jan 9 version), zero-shot	Original human annotators	80% agreement with original labels; explanations show clarity and informativeness on par with humans	English tweets focused on implicit hate; binary classification plus one-sentence explanations—no NER, entity counting, or source/target attribution; no text-length/difficulty or cost analysis
Leas et al. (2024)⁵⁵	Can ChatGPT match specialists in detecting adverse events in social-media posts about cannabis?	10,000 Reddit r/delta8 posts	English	Binary detection of any adverse event and classification of seriousness according to FDA MedWatch categories	GPT-3.5-turbo-0613, default settings (zero-shot)	Trained biomedical annotators	94.4% agreement (Fleiss kappa = 0.95) for any AEs; 99.3% agreement (kappa = 0.96) for serious AEs; 99.9% agreement for specific serious outcome categories; 0.35% misformatted responses resolved via cleaning	English biomedical Reddit posts (r/delta8); binary adverse-event detection and seriousness classification—no NER, entity counting, or source/target attribution; no text-length/difficulty analysis; estimated time savings based on human annotation rate (30 faster than manual coding); no direct cost analysis
Our study	Do state-of-the-art LLMs outperform outsourced human coders on complex Spanish news articles, and how robust is their performance to text length and difficulty?	210 Spanish fiscal-policy news articles	Spanish	Named-entity recognition (T1); entity count (T2); binary criticism detection (T3); multi-label source attribution (T4); multi-label target attribution (T5)	GPT-3.5-turbo; GPT-4-turbo; Claude 3 Opus; Claude 3.5 Sonnet (zero-shot, temperature=0)	Outsourced ESADE student coders via Qualtrics	LLMs consistently outperform human coders on all tasks; degradation on long/difficult texts is smaller; internal consistency is higher; newest models perform best	—