Table 1.
Overview of studies benchmarking LLMs against human annotators.
| Study | Main research question | Data/Domain | Language | Tasks evaluated | LLMs & setting | Human baseline | Key findings | Key differences |
|---|---|---|---|---|---|---|---|---|
| Gilardi et al. (2023)16 | Can GPT-3.5-turbo (ChatGPT) zero-shot match or outperform MTurk crowd workers’ annotations for social-media moderation and related classification tasks at lower cost? | 6,183 tweets and news articles | English | Relevance (content moderation; politics), stance detection (Section 230), topic detection, general frame detection (problem/solution) and policy frame detection (14 classes) | GPT-3.5-turbo (ChatGPT), zero-shot prompting (temperature 1 and 0.2) | MTurk crowd workers; trained research assistants as gold standard | Zero-shot accuracy 25 percentage points higher than MTurk; inter-coder agreement 91–97% vs 56% (MTurk) and 79% (trained annotators); per-annotation cost $0.003 ( 30 cheaper than MTurk) |
Short social-media/news texts in English; classification tasks only—no NER, entity counting, or source/target attribution; no text-length/difficulty analysis; does include a per-annotation cost comparison ( 30 cheaper than MTurk) |
| Törnberg (2024)50 | To what extent can GPT-4 (zero-shot) identify politicians’ party affiliation from a single Twitter post compared to supervised classifiers, expert coders, and crowd workers across 11 countries? | Random sample of parliamentarian tweets across 11 countries (250 per party; balanced 500 per country) | Multilingual (local languages of 11 countries) | Binary party-affiliation classification | GPT-4 (gpt-4-0314), zero-shot prompting | Expert political-science coders (majority vote) and Master-qualified MTurk workers (plurality) | Accuracy: GPT-4=93.4%; best human (experts majority vote)=86.0% (Macro F1: 0.934 vs. 0.860) | Multilingual tweets (11 countries); short single-tweet inputs; binary party-affiliation classification only—no NER, entity counting, or source/target attribution; no text-length/difficulty or cost analysis |
| Ziems et al. (2024)51 | Are zero-shot LLMs capable of reliably classifying and generating explanations for social-science phenomena? | 25 representative English computation social science benchmarks spanning utterance-, conversation-, and document-level corpora | English | Twenty classification tasks plus five free-text generation tasks (summaries/explanations) | Thirteen LLMs—including FLAN-T5 variants and OpenAI’s GPT-3 (davinci-002/003), GPT-3.5-Turbo, and GPT-4—evaluated zero-shot | Published gold labels and crowd-written explanations | LLMs achieved fair agreement with humans on classification (though not surpassing best fine-tuned models) and matched or exceeded reference explanations in generative tasks | English benchmarks covering short utterances, conversations, and full documents; mixes classification, explanation, and structured extraction tasks (event-argument extraction); still omits explicit entity-count analysis; no text-length/difficulty or cost analysis |
| Bojić et al. (2025)52 | How reliably do LLMs compare to human annotators across sentiment analysis, political leaning, emotional intensity, and sarcasm detection? | 100 curated textual items from Stanford Sentiment Treebank, Sentiment140, Iyyer et al.’s political-ideology data, EmoBank, and Sarcasm Corpus V2 | English | Sentiment analysis; political leaning; emotional intensity; sarcasm detection | GPT-3.5-turbo-16k, GPT-4, GPT-4o, GPT-4o-mini, Gemini 1.5 Pro, Llama-3.1-70B, Mixtral 8 7B (zero-shot; Hard Prompt for GPT-4o) |
33 trained annotators | Matched human agreement level of 0.95 and LLM agreement level of 0.95; outperformed humans on political leaning (0.80 vs 0.55 agreement); demonstrated higher consistency on emotional intensity (0.85 vs 0.65 agreement); both groups low consistency on sarcasm (0.25 agreement) | Small English sentence-level corpus; sentiment, political leaning, emotional intensity, and sarcasm only—no NER, entity counting, or source/target attribution; no text-length/difficulty or cost analysis |
| Kaikaus et al. (2023)53 | Are GPT-3.5 and GPT-4 annotations of quarterly earnings-call Q&A statements as reliable and cost-effective as those from domain-expert human annotators? | 1,198 Q&A statements from 30 quarterly earnings calls (2010–2022) | English | Emotion, sentiment, cognitive dissonance | GPT-3.5-turbo-16k and GPT-4-32k; zero-shot with four prompt-engineering approaches; temperature = 0 | Domain-expert crowdworkers (accounting graduate students) | LLMs produced more consistent and reliable annotations than humans, while reducing annotation time and cost | English finance domain; moderate-length earnings-call Q&A statements; emotion, sentiment, and cognitive-dissonance classification only—no NER, entity counting, or source/target attribution; no text-length/difficulty analysis; does report substantial time/cost savings with LLMs |
| Huang et al. (2023)54 | Can ChatGPT detect implicit hateful tweets and generate concise natural-language explanations comparable to human annotators? | 6,358 implicit hateful tweets (LatentHatred dataset) | English | Binary implicit-hate classification and one-sentence explanation | ChatGPT (GPT-3.5 Jan 9 version), zero-shot | Original human annotators | 80% agreement with original labels; explanations show clarity and informativeness on par with humans | English tweets focused on implicit hate; binary classification plus one-sentence explanations—no NER, entity counting, or source/target attribution; no text-length/difficulty or cost analysis |
| Leas et al. (2024)55 | Can ChatGPT match specialists in detecting adverse events in social-media posts about cannabis? | 10,000 Reddit r/delta8 posts | English | Binary detection of any adverse event and classification of seriousness according to FDA MedWatch categories | GPT-3.5-turbo-0613, default settings (zero-shot) | Trained biomedical annotators | 94.4% agreement (Fleiss kappa = 0.95) for any AEs; 99.3% agreement (kappa = 0.96) for serious AEs; 99.9% agreement for specific serious outcome categories; 0.35% misformatted responses resolved via cleaning |
English biomedical Reddit posts (r/delta8); binary adverse-event detection and seriousness classification—no NER, entity counting, or source/target attribution; no text-length/difficulty analysis; estimated time savings based on human annotation rate ( 30 faster than manual coding); no direct cost analysis |
| Our study | Do state-of-the-art LLMs outperform outsourced human coders on complex Spanish news articles, and how robust is their performance to text length and difficulty? | 210 Spanish fiscal-policy news articles | Spanish | Named-entity recognition (T1); entity count (T2); binary criticism detection (T3); multi-label source attribution (T4); multi-label target attribution (T5) | GPT-3.5-turbo; GPT-4-turbo; Claude 3 Opus; Claude 3.5 Sonnet (zero-shot, temperature=0) | Outsourced ESADE student coders via Qualtrics | LLMs consistently outperform human coders on all tasks; degradation on long/difficult texts is smaller; internal consistency is higher; newest models perform best | — |








