Skip to main content
. 2025 Nov 17;15:40122. doi: 10.1038/s41598-025-23798-y

Table 1.

Overview of studies benchmarking LLMs against human annotators.

Study Main research question Data/Domain Language Tasks evaluated LLMs & setting Human baseline Key findings Key differences
Gilardi et al. (2023)16 Can GPT-3.5-turbo (ChatGPT) zero-shot match or outperform MTurk crowd workers’ annotations for social-media moderation and related classification tasks at lower cost? 6,183 tweets and news articles English Relevance (content moderation; politics), stance detection (Section 230), topic detection, general frame detection (problem/solution) and policy frame detection (14 classes) GPT-3.5-turbo (ChatGPT), zero-shot prompting (temperature 1 and 0.2) MTurk crowd workers; trained research assistants as gold standard Zero-shot accuracy Inline graphic25 percentage points higher than MTurk; inter-coder agreement 91–97% vs 56% (MTurk) and 79% (trained annotators); per-annotation cost $0.003 (Inline graphic30Inline graphic cheaper than MTurk) Short social-media/news texts in English; classification tasks only—no NER, entity counting, or source/target attribution; no text-length/difficulty analysis; does include a per-annotation cost comparison (Inline graphic30Inline graphic cheaper than MTurk)
Törnberg (2024)50 To what extent can GPT-4 (zero-shot) identify politicians’ party affiliation from a single Twitter post compared to supervised classifiers, expert coders, and crowd workers across 11 countries? Random sample of parliamentarian tweets across 11 countries (250 per party; balanced 500 per country) Multilingual (local languages of 11 countries) Binary party-affiliation classification GPT-4 (gpt-4-0314), zero-shot prompting Expert political-science coders (majority vote) and Master-qualified MTurk workers (plurality) Accuracy: GPT-4=93.4%; best human (experts majority vote)=86.0% (Macro F1: 0.934 vs. 0.860) Multilingual tweets (11 countries); short single-tweet inputs; binary party-affiliation classification only—no NER, entity counting, or source/target attribution; no text-length/difficulty or cost analysis
Ziems et al. (2024)51 Are zero-shot LLMs capable of reliably classifying and generating explanations for social-science phenomena? 25 representative English computation social science benchmarks spanning utterance-, conversation-, and document-level corpora English Twenty classification tasks plus five free-text generation tasks (summaries/explanations) Thirteen LLMs—including FLAN-T5 variants and OpenAI’s GPT-3 (davinci-002/003), GPT-3.5-Turbo, and GPT-4—evaluated zero-shot Published gold labels and crowd-written explanations LLMs achieved fair agreement with humans on classification (though not surpassing best fine-tuned models) and matched or exceeded reference explanations in generative tasks English benchmarks covering short utterances, conversations, and full documents; mixes classification, explanation, and structured extraction tasks (event-argument extraction); still omits explicit entity-count analysis; no text-length/difficulty or cost analysis
Bojić et al. (2025)52 How reliably do LLMs compare to human annotators across sentiment analysis, political leaning, emotional intensity, and sarcasm detection? 100 curated textual items from Stanford Sentiment Treebank, Sentiment140, Iyyer et al.’s political-ideology data, EmoBank, and Sarcasm Corpus V2 English Sentiment analysis; political leaning; emotional intensity; sarcasm detection GPT-3.5-turbo-16k, GPT-4, GPT-4o, GPT-4o-mini, Gemini 1.5 Pro, Llama-3.1-70B, Mixtral 8Inline graphic7B (zero-shot; Hard Prompt for GPT-4o) 33 trained annotators Matched human agreement level of 0.95 and LLM agreement level of 0.95; outperformed humans on political leaning (0.80 vs 0.55 agreement); demonstrated higher consistency on emotional intensity (0.85 vs 0.65 agreement); both groups low consistency on sarcasm (0.25 agreement) Small English sentence-level corpus; sentiment, political leaning, emotional intensity, and sarcasm only—no NER, entity counting, or source/target attribution; no text-length/difficulty or cost analysis
Kaikaus et al. (2023)53 Are GPT-3.5 and GPT-4 annotations of quarterly earnings-call Q&A statements as reliable and cost-effective as those from domain-expert human annotators? 1,198 Q&A statements from 30 quarterly earnings calls (2010–2022) English Emotion, sentiment, cognitive dissonance GPT-3.5-turbo-16k and GPT-4-32k; zero-shot with four prompt-engineering approaches; temperature = 0 Domain-expert crowdworkers (accounting graduate students) LLMs produced more consistent and reliable annotations than humans, while reducing annotation time and cost English finance domain; moderate-length earnings-call Q&A statements; emotion, sentiment, and cognitive-dissonance classification only—no NER, entity counting, or source/target attribution; no text-length/difficulty analysis; does report substantial time/cost savings with LLMs
Huang et al. (2023)54 Can ChatGPT detect implicit hateful tweets and generate concise natural-language explanations comparable to human annotators? 6,358 implicit hateful tweets (LatentHatred dataset) English Binary implicit-hate classification and one-sentence explanation ChatGPT (GPT-3.5 Jan 9 version), zero-shot Original human annotators 80% agreement with original labels; explanations show clarity and informativeness on par with humans English tweets focused on implicit hate; binary classification plus one-sentence explanations—no NER, entity counting, or source/target attribution; no text-length/difficulty or cost analysis
Leas et al. (2024)55 Can ChatGPT match specialists in detecting adverse events in social-media posts about cannabis? 10,000 Reddit r/delta8 posts English Binary detection of any adverse event and classification of seriousness according to FDA MedWatch categories GPT-3.5-turbo-0613, default settings (zero-shot) Trained biomedical annotators 94.4% agreement (Fleiss kappa = 0.95) for any AEs; 99.3% agreement (kappa = 0.96) for serious AEs; Inline graphic 99.9% agreement for specific serious outcome categories; 0.35% misformatted responses resolved via cleaning English biomedical Reddit posts (r/delta8); binary adverse-event detection and seriousness classification—no NER, entity counting, or source/target attribution; no text-length/difficulty analysis; estimated time savings based on human annotation rate (Inline graphic30Inline graphic faster than manual coding); no direct cost analysis
Our study Do state-of-the-art LLMs outperform outsourced human coders on complex Spanish news articles, and how robust is their performance to text length and difficulty? 210 Spanish fiscal-policy news articles Spanish Named-entity recognition (T1); entity count (T2); binary criticism detection (T3); multi-label source attribution (T4); multi-label target attribution (T5) GPT-3.5-turbo; GPT-4-turbo; Claude 3 Opus; Claude 3.5 Sonnet (zero-shot, temperature=0) Outsourced ESADE student coders via Qualtrics LLMs consistently outperform human coders on all tasks; degradation on long/difficult texts is smaller; internal consistency is higher; newest models perform best