Skip to main content
PNAS Nexus logoLink to PNAS Nexus
. 2025 Feb 25;4(2):pgaf034. doi: 10.1093/pnasnexus/pgaf034

Echoes of authenticity: Reclaiming human sentiment in the large language model era

Yifei Wang 1, Ashkan Eshghi 2, Yi Ding 3,2,, Ram Gopal 4
Editor: Frederick Chang
PMCID: PMC11852273  PMID: 40007575

Abstract

This paper scrutinizes the unintended consequences of employing large language models (LLMs) like ChatGPT for editing user-generated content (UGC), particularly focusing on alterations in sentiment. Through a detailed analysis of a climate change tweet dataset, we uncover that LLM-rephrased tweets tend to display a more neutral sentiment than their original counterparts. By replicating an established study on public opinions regarding climate change, we illustrate how such sentiment alterations can potentially skew the results of research relying on UGC. To counteract the biases introduced by LLMs, our research outlines two effective strategies. First, we employ predictive models capable of retroactively identifying the true human sentiment underlying the original communications, utilizing the altered sentiment expressed in LLM-rephrased tweets as a basis. While useful, this approach faces limitations when the origin of the text—whether directly crafted by a human or modified by an LLM—remains uncertain. To address such scenarios where the text's provenance is ambiguous, we develop a second approach based on the fine-tuning of LLMs. This fine-tuning process not only helps in aligning the sentiment of LLM-generated texts more closely with human sentiment but also offers a robust solution to the challenges posed by the indeterminate origins of digital content. This research highlights the impact of LLMs on the linguistic characteristics and sentiment of UGC, and more importantly, offers practical solutions to mitigate these biases, thereby ensuring the continued reliability of sentiment analysis in research and policy.

Keywords: large language models, human sentiment, LLM-induced biases, fine-tuning, mitigation strategies


Significance Statement.

The emergence of large language models (LLMs) such as ChatGPT has transformed the landscape of digital content, impacting the sentiment of user-generated content (UGC). Here, we examine the unintended consequences of LLMs on sentiment. By analyzing climate change tweets, we find that LLMs, even when not explicitly aimed at modifying sentiment, can inadvertently alter the tone of the original text. This alteration has the potential to skew research outcomes based on UGC. To counteract these biases, we developed two strategies: predictive models to retroactively identify original human sentiment and fine-tuning LLMs to align their output more closely with human sentiment. These approaches mitigate the biases introduced by LLMs, ensuring the reliability of sentiment analysis in both research and policy-making contexts.

Introduction

Human sentiment, embedded within public content, serves as a crucial indicator of collective opinions, attitudes, and emotions. Across various sectors, decision-makers rely on the transparent and readily available sentiment data from online platforms to efficiently gauge real-time public perspectives. These public attitudes, in turn, profoundly shape downstream choices and actions aligned to the prevailing paradigms. For instance, by leveraging readily available data from social media, decision-makers can discern prevailing societal paradigms on events in real-time (1). Typically spanning positive to negative polarity, accessible insights into public sentiment guide responsive policy initiatives across domains, from digital commerce (e.g. (2)) to political arenas (e.g. (3, 4)). In response, individuals align their actions, such as purchasing behaviors (5) and voting patterns (1), with this collective understanding. However, the escalating use of large language models (LLMs) to refine user-generated content (UGC) poses an underappreciated threat—their potential to inadvertently alter said sentiment.

Remarkably, since the launch of OpenAI's ChatGPT in late 2022, over 100 million active users rapidly adapted this LLM, which exhibits proficiency in generating coherent text. The growing prevalence of OpenAI's ChatGPT has intensified LLM adoption for text generation in various domains, such as finance (6), e-commerce (7, 8), telecommunications (9), biomedicine (10), and education (11, 12). Beyond original content creation, LLMs demonstrate aptitude in modifying extant text to enhance clarity and comprehension (13). Researchers, for instance, increasingly rely on LLMs to edit, draft, and refine content at unprecedented speeds and low costs, blurring the line between human and AI-generated text (14). Furthermore, empirical studies indicate that 33–46% of crowd workers used LLMs in text annotation tasks, highlighting how entrenched these tools have become in traditionally human-driven work (15). In more quotidian applications, individuals employ ChatGPT to clarify and enhance their initial content across diverse contexts, including social media. Instances of content revised by ChatGPT have been observed in platforms such as social media, specialized online forums, and educational resources (13).

However, refinements introduced by the LLM may subtly shift the sentiment polarity. Such changes compromise the reliability of public sentiment as an accurate indicator of authentic human attitudes. Consequently, decision-making processes informed by these data become questionable. In the light of the significance of this concern and the existing knowledge gaps, this study aims to discern whether, and to what extent, sentiment in content rephrased by LLMs diverges from its original sentiment.

To address the aforementioned research objective, we situate our study within the context of a PNAS study by Smirnov and Hsieh (16) (hereafter “reference paper”). Their research primarily aimed to validate the “finite pool of worry” hypothesis regarding public attention towards two significant negative events: COVID-19 and climate change. The authors examined how COVID-19 influenced the sentiment expressed in public tweets about climate change. Utilizing Twitter as a rich source of public opinion data and VADER (Valence Aware Dictionary for Sentiment Reasoning) as the sentiment analysis tool, the authors analyzed the volume and content of 18,896,054 tweets that mentioned “climate change” from January 2019 to December 2021. Our study employs these pre-LLM human-generated tweets, utilizing LLMs to rephrase them. We focus on quantifying sentiment differences between LLM-rephrased content and original narratives (see section ‘Details for comparing LLM-rephrased versus original content’ for details).

Our analysis reveals that LLM rephrasing nudges tweets toward more neutral sentiments compared with the originals, irrespective of the prompts employed, or the sophistication of the LLMs. By replicating the reference paper's assessment of public opinions on climate change, we further illustrate how LLM-induced differences alter established scientific conclusions, raising concerns about the validity of post-LLM sentiment-based analyses. To address this pressing issue, we propose and validate potential mitigation strategies for estimating authentic human sentiment from LLM-rephrased content. Overall, this research highlights an underappreciated threat posed by generative AI to the reliability of sentiment analysis and downstream decision-making dependent on transparent user perspectives.

Results and discussion

LLM-rephrased versus original content

We first sought to explore whether and how the use of LLMs might subtly alter the sentiments expressed in the original climate change tweets. Mirroring the analysis by Smirnov and Hsieh (16), we examined not only the overall sentiment (presented as the Compound score in VADER), but also the distinct positive and negative sentiment scores. Note that both the positive and negative VADER scores range from 0 to 1, with 1 indicating the most extreme positive or negative sentiment detected. The Compound score lies between −1 and 1, accounting for the balance of positivity, negativity and neutrality in the text.

The model-free graphical evidence based on our sample of 50,000 tweets is presented in Figs. 1 and 2. As shown in Fig. 1, the LLM-rephrased tweets exhibit dampened levels of positivity and negativity overall, evidenced by the smaller median values closer to 0 (see the first two boxes on the left and the two central boxes). Furthermore, the rephrased variants display not only more neutral sentiments on average, but also substantially reduced variance in both the positive and negative sentiment scores (see the shortened whiskers on the boxes). However, the change in the distribution of the overall Compound sentiment score is more subtle when comparing the original and rephrased tweets (see the high degree of overlap in the two rightmost boxes).

Fig. 1.

Fig. 1.

Box plot of sentiment scores: original versus LLM-rephrased tweets.

Fig. 2.

Fig. 2.

Density plots of sentiment scores: original vs. LLM-rephrased tweets. (A) Density of positivity distribution. (B) Density of negativity distribution. (C) Density of compound score distribution. Notes: The plots are generated using kernel density estimation, and the cumulative area beneath the density function curve indicates the probability of obtaining an x value within a specified range of x values.

Similarly, the density plots in Fig. 2 reveal the language models' normalization effects. Figure 2A and B shows the LLM shifting strong positive and negative tweets towards more neutral stances (see the increased density of orange lines near the zero value). However, Fig. 2C indicates that while the distribution of Compound scores for the rephrased variants exhibits slightly lower kurtosis, there is a small rightward shift towards greater positivity on average. These results indicate that language models systematically normalize the sentiment of the original tweets by reducing the presence of emotional extremes.

To obtain a more comprehensive understanding of the sentiment differences between the LLM-rephrased tweets and their original counterparts, we conducted paired two-tailed t-tests for statistical inference. Overall, the results reveal that for both the positive and negative VADER scores, the mean values are lower for the LLM-rephrased variants compared with the original tweets.

Specifically, the rephrased tweets exhibit significantly dampened positive sentiment scores relative to the originals (LLM mean: 0.095 vs. Original mean: 0.126; Diff = −0.031, P < 0.01). A similar effect is observed when examining the negative sentiment scores (LLM mean: 0.083 vs. Original mean: 0.123; Diff = −0.040, P < 0.01). Although the change in the Compound score distribution is more subtle in Fig. 1, the t-test indicates a small but statistically significant difference between the rephrased and original tweets (LLM mean: 0.040 vs. Original mean: 0.007; Diff = 0.033, P < 0.01). Combined with the density plot in Fig. 2C, these results imply that the language models typically nudge the text towards slightly more positive overall sentiment on average.

While the LLM rephrasing does dampen the presence of positive sentiment to some degree (Diff = −0.031, P < 0.01), the extent of this normalization effect is less pronounced relative to the substantial transition of strongly negative content towards neutral statuses (Diff = −0.040, P < 0.01). This asymmetry is also evident visually when comparing the purple and red box plots in Fig. 1.

Robustness checks

Variations in prompt instructions and advancements in model sophistication may plausibly influence the observed sentiment shifts or mitigate unintended sentiment changes. To investigate these possibilities, we conducted robustness checks using different prompts and versions of LLMs (see section ‘Robustness checks’ for details). The results in Table 1 indicate that regardless of the prompt employed (e.g. instructing the LLM to preserve sentiment or focusing on grammar correction) or the LLM version utilized (e.g. text-davinci-003 to ChatGPT-4o-mini), the rephrased content consistently exhibited a shift toward neutral sentiment for both positive and negative sentiments, along with a slight increase in the Compound score compared with the original content.

Table 1.

Robustness checks using different prompts and LLM versions.

    Different prompts
Early version LLMs ↓ Advanced version LLMs   (1) (2) (3) (4) (5) (6) (7) (8) (9)
    Main prompta Preserve sentiment promptb Grammar correction promptc
    text-davinci-003 text-davinci-003 gpt-3.5-turbo-instructd
    Mean t-test Variance Mean t-test Variance Mean t-test Variance
Negative −0.0397 P < 0.01 0.0196 −0.0396 P < 0.01 0.0125 −0.0448 P < 0.01 0.0106
Positive −0.0339 P < 0.01 0.0202 −0.0318 P < 0.01 0.0107 −0.0273 P < 0.01 0.0106
Compound 0.0254 P = 0.157 0.322 0.0364 P < 0.01 0.157 0.0713 P < 0.01 0.174
  (10) (11) (12) (13) (14) (15) (16) (17) (18)
  ChatGPT-4o-mini ChatGPT-4o-mini ChatGPT-4o-mini
  Mean t-test Variance Mean t-test Variance Mean t-test Variance
Negative −0.0381 P < 0.01 0.0116 −0.0363 P < 0.01 0.0115 −0.0492 P < 0.01 0.0107
Positive −0.0254 P < 0.01 0.0110 −0.0197 P < 0.01 0.0113 −0.0306 P < 0.01 0.0104
Compound 0.0489 P < 0.01 0.161 0.0650 P < 0.01 0.173 0.0658 P < 0.01 0.155

This set of robustness checks are based on a random sample of 1,000 observations out of the 50,000 observations.

The mean of the residual values (the difference between original and predicted sentiment scores) captures prediction accuracy, and the variance of these residuals gauges prediction variability.

aMain prompt: “Paraphrase this tweet in less than 280 characters: [original tweet text].”

bPreserve sentiment prompt: “Paraphrase this tweet in less than 280 characters, and keep the overall sentiment of the tweet remains unchanged: [original tweet].”

cGrammar correction prompt: “Paraphrase the content, and make sure the grammar is correct: [original tweet].”

dDue to the limited accessibility of text-davinci-003 at the time of revision, this prompt was executed using gpt-3.5-turbo-instruct, as recommended by OpenAI as a suitable substitute.

Generalizability using Amazon reviews

An immediate question that arises is whether the observed sentiment normalization effects from language model rephrasing generalize beyond the context of tweets expressing public opinions. Are similar systematic shifts evident across diverse platforms and contexts, such as customer reviews evaluating product or service quality?

To explore the potential generalizability of our findings, we conducted supplementary tests utilizing Amazon review data (see section ‘Analysis of Amazon reviews’ for data and analysis details). As online reviews assessing product quality serve as crucial indicators influencing consumers' purchasing decisions (17), this text genre has been widely analyzed to evaluate customer satisfaction, experience and retailer reputations (18).

Mirroring the analysis of tweets, results indicate significantly dampened mean positive and negative VADER scores for the LLM-rephrased Amazon reviews compared with the originals. Specifically, the rephrased positive scores approach neutrality more closely than the originals (LLM mean: 0.192 vs. Original mean: 0.195; Diff = −0.003, P < 0.10). A similar normalization effect is evident for negative scores (LLM mean: 0.057 vs. Original mean: 0.060; Diff = −0.003, P < 0.01). However, no statistically significant difference exists between LLM-rephrased and original review content when examining the overall Compound score.

Consequently, these consistent findings suggest that the subtle bias introduced by language models, systematically shifting UGC toward more neutral stances, extends across diverse text genres—from tweets expressing public opinions to customer reviews evaluating commercial products and services.

“Distorted” sentiment influence on existing understanding

Up to this point, our findings have indicated that language models systematically modify the sentiments present within textual content. Building upon this evidence, we further investigate whether such subtle alterations could potentially affect established scientific conclusions relying on sentiment analysis. Specifically, we first replicated the results from the reference paper, then substituted the original climate change tweets with their LLM-rephrased counterparts and reran the regression models. By comparing these updated results to our initial replication, we assessed whether LLM-induced sentiment shifts could alter the substantive conclusions presented by Smirnov and Hsieh (16) (see section ‘Replication of reference paper’ for implementation details).

The regression results are presented in Table 2. Models 1, 3, and 5 present the initial replication findings with the original tweets as predictors of negative, positive, and compound sentiment scores, respectively. In contrast, Models 2, 4, and 6 utilize the same set of control variables but with the LLM-rephrased climate change tweets as predictors of interest.

Table 2.

Results of replicated regression (A13): Original vs. LLM-rephrased tweets.

  (1) (2) (3) (4) (5) (6)
  Original LLM-rephrased Original LLM-rephrased Original LLM-rephrased
  Negative score Positive score Compound score
COVID-19 cases (United States, 1k) −0.00685*** −0.00274*** 0.00314*** 0.00309*** 0.0225*** 0.0188***
(0.00206) (0.000913) (0.00114) (0.000887) (0.00706) (0.00623)
Climate TV stories −0.00127** −0.000770** −0.000293 −0.000426 0.00373* 0.00184
(0.000609) (0.000374) (0.000530) (0.000400) (0.00205) (0.00213)
Presidential and VP −0.0820 −0.312* −0.145 −0.783** −1.283 −1.487
Debates (United States) (0.538) (0.183) (1.049) (0.365) (2.440) (1.540)
Presidential elections −0.530* −0.214 0.0240 0.269 1.279 2.010
(United States) (0.293) (0.177) (0.330) (0.258) (1.311) (1.510)
Wildfires (United States) 0.175 0.0543 −0.355** −0.165 −1.230* −0.950
(0.219) (0.134) (0.177) (0.125) (0.709) (0.754)
Hurricanes (United States) 0.148 0.323* −0.0333 −0.128 −0.448 −1.381
(0.315) (0.191) (0.342) (0.200) (1.122) (1.179)
Tuesday −0.539* −0.468*** 0.0844 0.253 1.824* 2.986***
(0.281) (0.180) (0.276) (0.172) (0.931) (0.954)
Wednesday −1.002*** −0.802*** 0.247 0.301* 3.360*** 3.910***
(0.266) (0.178) (0.248) (0.173) (0.907) (0.935)
Thursday −0.896*** −0.732*** 0.657** 0.423** 3.516*** 3.989***
(0.275) (0.185) (0.285) (0.179) (0.926) (1.018)
Friday −0.648** −0.639*** 0.120 0.105 2.304** 2.996***
(0.273) (0.176) (0.279) (0.178) (1.014) (0.986)
Saturday −0.762*** −0.588*** 0.541* 0.456** 3.499*** 3.719***
(0.279) (0.179) (0.283) (0.182) (0.997) (1.009)
Sunday 0.0886 −0.259 0.231 0.338* 0.610 2.260**
(0.275) (0.196) (0.283) (0.185) (1.017) (1.072)
Constant 13.31*** 9.06*** 12.57*** 9.41*** −2.50*** 0.56
(0.244) (0.162) (0.256) (0.166) (0.877) (0.913)
Observations 1,096 1,096 1,096 1,096 1,096 1,096
Adjusted R2 0.0692 0.0408 0.0129 0.0269 0.0589 0.0433

Newey–West SEs are reported in parentheses, which is consistent with Smirnov and Hsieh (14)'s study.

*** P < 0.01, ** P < 0.05, * P < 0.10.

In line with the reference paper, we also scale the dependent variables to the range from 0 to 100.

The original regression results can be found in the reference paper in Table A13 (Impact of US COVID-19 cases on “climate change” tweets sentiment) and Table A14 (Impact of US COVID-19 deaths on “climate change” tweets sentiment).

Although the qualitative directions of the relationships are preserved, the results reveal marked differences in the magnitude of the key effects. For example, the adverse influence of COVID-19 cases on negative climate change tweet sentiment is estimated at −0.00685 in Model 1 but dampened to −0.00274 in Model 2 with the rephrased tweets. Analogous disparities are evident when predicting the Compound score, with coefficients decreasing from 0.0225 to 0.0188 (see Model 5 vs. Model 6).

Additionally, for certain control variables, using the rephrased tweets introduces shifts in statistical significance. For instance, the indicator for Presidential/VP debates (United States) does not significantly influence positive sentiment in Model 3 (−0.145, P > 0.10) but manifests a pronounced negative impact in Model 4 (−0.783, P < 0.05).

Collectively, these results underscore the threat posed to reliability of findings through the escalating utilization of language models by social media users—either unintentionally or strategically for sentiment manipulation. Subtle modifications made through rephrasing techniques can ultimately propagate downstream to affect substantive conclusions.

Mitigation strategies

Acknowledging the potential impact on the integrity of research findings due to LLM-induced biases, it becomes crucial to consider effective countermeasures. We start with the core premise that LLMs can alter the sentiment of text in ways that might be predictable. This recognition leads us to the endeavor of reverse-engineering to retrieve the original sentiment as expressed by a human.

To mitigate these biases, we propose two distinct strategies. The first strategy involves the use of predictive models that can retroactively deduce the genuine human sentiment from the original messages, based on the sentiment reflected in the LLM-rephrased tweets. This method allows us to recover the underlying human emotion and intent that may have been obscured or altered by the LLM.

When we are certain that a text has been enhanced through the use of an LLM, our primary method involves extracting the original sentiment that was expressed by the human author. This technique is invaluable for analyzing the authentic human emotion and intention behind the rephrased content. However, it presents a limitation in real-world applications where the origin of the text—whether it was generated by a human or with the assistance of an LLM—is ambiguous. To navigate this challenge, we introduce a second strategy that revolves around the fine-tuning of LLMs. This method aims to adapt the model in such a way that it can generate text whose sentiment more closely mirrors genuine human-written material, thereby reducing the disparity between human and LLM-generated sentiment.

Strategy 1

Predictive models

As to the first mitigation strategy, we adopted three predictive methodologies—Linear Regression, Neural Network Regression, and Random Forest Regression—to estimate the genuine sentiment of original tweets (see section ‘Predictive models for mitigating sentiment biases’ for model construction details). The findings were systematically organized in Table 3, focusing on the mean residual values (the difference between original and predicted sentiment scores) and their variance to assess prediction variability. A residual statistically close to zero indicates successful recovery of original sentiments. We thus conducted one-sample t-tests to assess whether the mean residual values significantly deviated from zero.

Table 3.

Performance of regression models in reverse prediction of human sentiment.

  (1) (2) (3) (4) (5) (6) (7) (8) (9)
  Linear Regression Neural Network Regression Random Forest Regression
  Mean t-test Variance Mean t-test Variance Mean t-test Variance
Negative −0.000990 P = 0.355 0.0115 −0.000991 P = 0.354 0.0114 −0.00124 P = 0.251 0.0117
Positive 0.000268 P = 0.804 0.0116 0.000583 P = 0.621 0.0115 0.000702 P = 0.518 0.0118
Compound 0.00369 P = 0.309 0.131 0.00444 P = 0.216 0.129 0.00583 P = 0.110 0.133

The mean of the residual values (the difference between original and predicted sentiment scores) to capture prediction accuracy, and the variance of these residuals to gauge prediction variability.

The prediction results for Linear Regression (Table 3, Models 1–3) revealed no statistically significant difference between the predicted and actual sentiment scores across all three sentiment dimensions (with t-test P-values of 0.355 for negativity, 0.804 for positivity, and 0.309 for compound score; the coefficients of the trained linear regression models are reported in Table S1). This outcome suggests that the predictive models were effective in estimating the original sentiment of tweets, confirming our hypothesis that LLM-induced modifications to text sentiment can be reliably predicted and reversed using regression analysis. These findings underscore the potential of employing sophisticated statistical techniques to understand and counteract the effects of LLMs on sentiment expression in digital communications. Results for Neural Network Regression (Models 4–6) and Random Forest Regression (Models 7–9) are presented in Table 3. No significant residual differences were observed across models for all three sentiment metrics, suggesting that each model was similarly effective in predicting authentic sentiment scores from LLM-rephrased tweets.

Comparing the three models, Linear Regression showed slightly smaller residuals for all sentiment scores, indicating a marginally better performance in matching the predicted sentiments to the actual sentiments of the original tweets. However, the variance in residuals was relatively consistent across all models, indicating a similar level of prediction error variability.

Predictive models/practical implementation

While our results demonstrate that predictive models can effectively recover original sentiment scores, it is important to note that their practical applicability is restricted to scenarios where the provenance of the text is definitively known to be LLM-rephrased. These models operate under the assumption that the input text has been modified by an LLM and leverage the patterns inherent in these modifications to reverse engineer the original sentiment.

However, this approach still provides a cost-effective solution for organizations to reverse-engineer sentiments, provided simple filters can reliably detect LLM-rephrased content. For example, platforms equipped with basic AI detection tools could approximately differentiate between LLM-rephrased text and original content, allowing predictive models to be selectively applied to the former. Similarly, platforms might assess a user's language fluency before and after the introduction of LLMs; a significant improvement in writing quality could indicate reliance on LLMs for content rephrasing. While such filters may not achieve perfect accuracy, adjusting their thresholds—whether more or less lenient in identifying LLM-rephrased content—and subsequently applying predictive models to this content can provide organizations with an approximate understanding of the sentiments genuinely expressed by users.

Strategy 2

Fine-tuning

Ultimately, the uncertainty regarding whether a text is human-written or LLM-rephrased poses a significant challenge, as applying predictive models in such cases could introduce additional biases. This fundamental limitation of predictive models motivated the development of a second strategy: fine-tuning an LLM to produce content that more closely resembles human-generated text, with a particular emphasis on minimizing biases introduced by LLM-rephrased tweets. This fine-tuning approach is designed to handle text of ambiguous origin, offering a more robust solution for real-world applications where the provenance of digital content is often unclear.

Our exploration included experiments with both the OpenAI fine-tuning model and Meta's LLAMA2 model (see section ‘Fine-tuning models for mitigating sentiment biases’ for detailed fine-tuning procedures). These models were used to reverse-engineer LLM-rephrased tweets (reserved-human text), and their sentiment was compared with that of original human-written text.

The comparative results between the OpenAI and LLAMA2 fine-tuning models are detailed in Table 4, with a focus on the sentiment differences between the original and reversed-human tweets. The results shown in Models (1–3) indicate that the OpenAI fine-tuning model effectively narrowed the sentiment disparity across all three scores between the original human tweets and the reversed-human tweets, demonstrating no statistically significant difference in sentiment scores. Conversely, the LLAMA2 fine-tuned model displayed significant and larger sentiment differences between the original and reversed-rephrased content (see Models 4–6). Furthermore, the OpenAI fine-tuning model exhibited smaller variance in sentiment scores compared with the LLAMA2 model. In summary, the OpenAI fine-tuned model proved capable of generating text that effectively reduces the sentiment discrepancy from LLM-rephrased content back to the original tweets, showcasing superior performance over the LLAMA2 model in achieving this goal.

Table 4.

Performance of fine-tuning models in reverse prediction of human sentiment.

  (1) (2) (3) (4) (5) (6)
  Fine-tuning (OpenAI) Fine-tuning (LLAMA2)
  Mean t-test Variance Mean t-test Variance
Negative −0.000678 P = 0.574 0.0145 −0.0130 P < 0.01 0.0172
Positive −0.001458 P = 0.226 0.0145 −0.00977 P < 0.01 0.0177
Compound 0.00292 P = 0.461 0.156 0.0102 P = 0.0345 0.231

The mean of the residual values (the difference between original and predicted sentiment scores) to capture prediction accuracy, and the variance of these residuals to gauge prediction variability.

Fine-tuning/ad-hoc analysis on content alternation by fine-tuning

A legitimate concern may arise that when the fine-tuned model is applied, original content not rephrased by LLMs may also be inadvertently modified, potentially confounding the observed mitigation effects of fine-tuning. To address this issue, we evaluated whether the output of the fine-tuned model matched the input content, thereby assessing the model's ability to recognize the provenance of the input data. An exact match between the output and input suggests that the model identifies the input as original UGC and, consequently, does not apply any modifications.

Based on our testing data, among 10,000 exclusively LLM-rephrased tweets, the fine-tuned model altered all outputs (100% modification rate). In contrast, when tested on 10,000 original tweets, only 38 were modified (0.38% modification rate). These findings demonstrate the robust capability of our fine-tuned model to identify input content provenance, thereby preventing the overestimation of sentiment shift mitigation when applied to original, non-LLM-rephrased content.

Fine-tuning/practical implementation

The practical implementation of fine-tuning entails fundamental tradeoffs between cost and quality, influenced by two primary factors: the size of the training dataset and the choice between local open-source models and Application Programming Interface (API)-based services (19,). In our case, we utilized OpenAI's fine-tuning API with a training dataset consisting of 33,000 rows. This implementation required approximately 40 min of computational time and incurred a total cost of $26.03. The pricing structure for OpenAI's service includes $0.008 per 1,000 tokens for training, $0.012 per 1,000 tokens for input usage, and $0.016 per 1,000 tokens for output usage.

Alternative approaches, such as local fine-tuning using open-source models like LLAMA2, involve different cost considerations. These primarily include substantial initial investments in hardware and ongoing operational expenses, such as electricity, but they eliminate per-token fees. Additionally, local deployment offers improved data security and privacy, as sensitive training data remains within the organization's infrastructure. The decision between API-based and local fine-tuning implementations generally depends on an organization's anticipated usage volume. API-based approaches tend to be more cost-effective for lower usage volumes, whereas local deployment becomes increasingly economical at larger scales.

Summary

In sum, this study ventures into the evolving landscape shaped by the proliferation of LLMs and their impact on the sentiment of publicly shared content. Our investigation begins with the recognition that LLMs, even when not explicitly aimed at modifying sentiment, can inadvertently alter the emotional tone of the original text. Through a series of experiments replicating established studies, we have observed that the sentiment modifications introduced by LLMs can compromise the reliability of previously accepted findings. Importantly, we propose and validate two mitigation strategies, contingent on whether the text's provenance is known to be LLM-rephrased or not.

Materials and methods

Details for comparing LLM-rephrased versus original content

We position our research context by drawing upon the study by Smirnov and Hsieh (16). Within their investigation, the primary focus is to validate the “finite pool of worry” hypothesis concerning public attention towards two major negative events: COVID-19 and climate change. They delve into the interrelation between the occurrences of COVID-19 cases/deaths and the volume of discussions on climate change on Twitter. Of particular relevance to our research is their examination of how COVID-19 impacts the sentiment expressed in public tweets about climate change. Due to its widespread use and accessibility as a rich source of data reflecting public opinions and concerns, Smirnov and Hsieh utilized Twitter data, examining the volume and content of 18,896,054 tweets that mentioned “climate change” over the period from January 2019 to December 2021. These data were then aggregated on a daily basis for their time series analysis. Their findings indicated a noticeable shift in public concern from climate change to COVID-19, as evidenced by a decline in climate change-related tweets during the height of the COVID-19 pandemic. Beyond simply observing this shift, the authors further explored the change in emotional content and sentiment of climate change-related tweets during this period.

To accomplish their sentiment analysis, Smirnov and Hsieh employed VADER, a widely used rule-based model developed specifically for social media content (20). Similar to many other sentiment analysis techniques, VADER makes use of the relative frequencies of positive and negative words in the text. However, it further implements heuristics to account for nuances in language such as negations, amplifications, and contrasts, thereby aiming to provide a more accurate measure of the overall sentiment. Critically, the VADER model not only produces an overall compound sentiment score for each tweet, but also provides granular and interpretable scores of positivity and negativity. By aggregating the daily VADER-produced sentiment scores, the authors found that an increase in daily COVID-19 cases and deaths correlated significantly with a decrease in the negative sentiment scores for climate change-related tweets.

Our study replicates Smirnov and Hsieh's analysis (16), with the key modification of employing state-of-the-art language models (LLMs) to rephrase the subset of tweets specifically pertaining to climate change. This modification permits us to (i) contrast the sentiment disparities between the original tweets and those rephrased by LLMs, and (ii) examine if and how the conclusions, namely the influence of COVID-19 on the public sentiment towards climate change, might vary based on the choice of using the original or LLM-rephrased tweets. Given computational constraints imposed by the LLMs and associate APIs, we conduct our analysis on a random sample of 50,000 tweets extracted from the pool of ∼19 million tweets within the original dataset.

To obtain LLM-rephrased variants of the climate change tweets, we utilized text-davinci-003, a powerful autoregressive LLM developed by OpenAI known for generating high-quality, human-like text while reliably adhering to provided instructions (21). We supplied the LLM with a specific prompt requesting assistance in rephrasing a given tweet while staying within the 280-character limit imposed by Twitter. We utilize the following prompt structure to rephrase tweets:

Prompt: Paraphrase this tweet in less than 280 characters: [original tweet text]

Mirroring the approach by Smirnov and Hsieh (16), we applied the VADER model to the LLM-rephrased tweets to analyze their sentiment. Our study methodology encompassed a rigorous validation process aimed at verifying the consistency between the VADER model as employed in our analyses and its original specification used by Smirnov and Hsieh.

Robustness checks

We primarily assess the robustness of our results by using different prompts and versions of LLMs. Specifically, we randomly selected 1,000 observations from our dataset of 50,000 climate change-related tweets as the robustness check sample. To evaluate the impact of prompt variations on sentiment, we tested two distinct prompts. The first prompt included explicit instructions to preserve sentiment: “Paraphrase this tweet in less than 280 characters, and keep the overall sentiment of the tweet remains unchanged: [original tweet text].” The second prompt focused on grammar correction: “Paraphrase the content, and make sure the grammar is correct: [original tweet text].” Due to limited accessibility to text-davinci-003 at the time of revision, this second prompt was executed using gpt-3.5-turbo-instruct, following OpenAI's recommendation as a suitable substitute.

To examine the effect of language model sophistication, we applied all three prompts to ChatGPT-4o-mini, a cost-effective and versatile model developed by OpenAI, designed to handle text, image, video, and audio tasks with low latency (22). As a more advanced model compared with both gpt-3.5-turbo-instruct and text-davinci-003, ChatGPT-4o-mini enabled us to evaluate whether advancements in model sophistication mitigate unintended sentiment shifts in rephrased content.

Analysis of Amazon reviews

Our Amazon review dataset, derived from the work by Ni et al (23), encompasses product reviews and metadata spanning May 1996 to July 2014 across various categories. We selected the “Video Games” category randomly and applied the same LLM rephrasing process described previously. Due to the length constraints of our LLM variant (text-davinci-003 allows up to 4,097 tokens of context) and the frequently excessive length of customer reviews, we randomly sampled 10,000 reviews containing >500 words for rephrasing. To adapt the process, we modified the supplied prompt to: “Paraphrase this review: [original review text]”. Employing VADER for sentiment scoring, we conducted paired two-tailed t-tests to examine differences between the LLM-rephrased and original reviews.

Replication of reference paper

To investigate whether sentiment changes influence scientific conclusions based on sentiment analysis, we first replicated the original analysis by Smirnov and Hsieh (14) examining the impact of daily COVID-19 cases and deaths on climate change tweet sentiments. This initial replication step was crucial for validating alignment between the regression models delineated in the reference paper and those adopted in our study.

Subsequently, we constructed a hypothetical scenario wherein Twitter users might utilize language models to rephrase their original tweets before posting. We replaced the sample of 50,000 original climate change tweets with their corresponding LLM-rephrased variants and reran the replicate regression models. By comparing these updated results to our initial replication models, we aimed to evaluate whether the sentiment modifications introduced by LLM rephrasing could alter the substantive conclusions presented by Smirnov and Hsieh (16).

Predictive models for mitigating sentiment biases

Our three predictive regression models (Linear Regression, Neural Network Regression, and Random Forest Regression) were selected based on the nature of our dependent variables, namely the negative (Original-Neg), positive (Original-Pos), and compound (Original-Comp) sentiment scores of the original tweets. These scores are continuous, making regression analysis an appropriate tool for our investigation. We utilized the sentiment scores of the LLM-rephrased tweets (LLM-Neg, LLM-Pos, LLM-Comp) as independent variables to predict the original sentiment scores. The regression model for each dependent variable Original ∈ {Original-Neg, Original-Pos, Original-Comp} was as follows:

Originali=β0+β1LLMNegi+β2LLMPosi+β3LLMCompi+εi (1)

where β0 is the intercept, εi is the error term for tweet i, and β1, β2, and β3 are coefficients for the sentiment scores of the LLM-rephrased tweets. The premise of our approach is that LLMs tend to modify text sentiment in a consistent and predictable manner. By identifying these patterns through regression analysis, we aim to retroactively predict the original sentiment of a tweet after it has been altered by an LLM. This process relies on comparing known original sentiment scores against the sentiment scores of their LLM-rephrased counterparts to train our model. Through this methodology, we expect to accurately estimate the original sentiment when only the modified text is available.

For the empirical analysis, we partitioned a dataset of 50,000 tweets, allocating 40,000 tweets for training and the remaining 10,000 for validation purposes. This separation allowed us to rigorously test the predictive accuracy of our models. In evaluating the performance of the Linear Regression model, we specifically looked at the discrepancy between the model's predicted sentiment scores and the actual sentiment scores of the original tweets.

In addition to the Linear Regression model, we employed two alternative predictive models to estimate the original sentiment scores of tweets: a Neural Network with a single hidden layer of two nodes, and Random Forest Regression with 500 trees. Each model used the same dataset for training and validation, and their performance was evaluated based on the residuals between the predicted and actual sentiment scores of the original tweets.

Fine-tuning models for mitigating sentiment biases

As mentioned in the main text, our exploration included experiments with both the OpenAI fine-tuning model and Meta’s LLAMA2 model. Given the procedural similarities between the two models, we will focus on the OpenAI model for a more detailed explanation of our methodology.

Our initial step involved constructing a training dataset for the fine-tuning model. From our original training dataset of 40,000 tweets, we randomly selected 3,000 instances. We then used the LLM text-davinci-003 variant, applying consistent parameters and prompts as used in our rephrasing process (“Paraphrase this tweet in less than 280 characters: [original tweet text]”), to rephrase these 3,000 tweets. This procedure was repeated 10 times, resulting in an augmented dataset of 30,000 rows containing both the original tweets and their LLM-rephrased counterparts (i.e. in original and LLM-rephrased tweet pairings). Notably, we also incorporated an additional set of 3,000 rows including original tweets only (i.e. replicating the original tweets and organizing them in original–original tweet pairings), presenting instances where the original content was also treated as “LLM-rephrased” content. This approach was intended to make the model useful in situations where there is uncertainty on the provenance of the text, culminating in a comprehensive training dataset of 33,000 rows.

After assembling the dataset, we proceeded to create our fine-tuning model using OpenAI’s gpt-3.5-turbo. We crafted a system message to guide the fine-tuning process, aiming to “Reverse engineer the provided text to generate human-written tweets that closely match the original style and content.” This directive was designed to instruct the fine-tuned model to generate outputs that closely mirror the linguistic style and content of human language. For the user messages (input content), we used the LLM-rephrased content, while the original tweets were assigned as assistant messages (output content).

In the application phase of the fine-tuned model, we retained the same system message used during training for consistency. The evaluation utilized the same set of 10,000 tweets employed in the predictive models as the testing dataset, with the expectation that the fine-tuned model would produce output closely resembling the human-written style based on the LLM-rephrased input.

Upon generating the reversed-human text from the fine-tuning models, we applied VADER sentiment analysis to the 10,000 observations to obtain negative, positive, and compound sentiment scores for each piece of text.

Supplementary Material

pgaf034_Supplementary_Data

Acknowledgments

We thank Oleg Smirnov for taking the time to explain the methodology used in his original analysis.

Contributor Information

Yifei Wang, Warwick Business School, University of Warwick, Scarman Road, Coventry, CV4 7AL, United Kingdom.

Ashkan Eshghi, Warwick Business School, University of Warwick, Scarman Road, Coventry, CV4 7AL, United Kingdom.

Yi Ding, Warwick Business School, University of Warwick, Scarman Road, Coventry, CV4 7AL, United Kingdom.

Ram Gopal, Warwick Business School, University of Warwick, Scarman Road, Coventry, CV4 7AL, United Kingdom.

Supplementary Material

Supplementary material is available at PNAS Nexus online.

Funding

The authors declare no funding.

Author Contributions

Y.D. and R.G. designed research; Y.W. and A.E. performed research and analyzed data; Y.W., A.E., Y.D., and R.G. wrote the paper.

Data Availability

The original datasets used in this study are referenced and can be accessed in (i) the Harvard Dataverse repository at https://doi.org/10.7910/DVN/SFQTJZ, and (ii) UC San Diego Computer Science Department repository at https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/. The data and codes used in our analysis will be made available in the Open Science Framework repository at https://osf.io/azskv/?view_only=16baef51bc434df5a909b68e0fe97d86.

References

  • 1. Mousavi R, Gu B. 2019. The impact of Twitter adoption on lawmakers' voting orientations. Inf Syst Res. 30:133–153. [Google Scholar]
  • 2. Rui H, Liu Y, Whinston A. 2013. Whose and what chatter matters? The effect of tweets on movie sales. Decis Support Syst. 55:863–870. [Google Scholar]
  • 3. Box-Steffensmeier JM, Moses L. 2021. Meaningful messaging: sentiment in elite social media communication with the public on the COVID-19 pandemic. Sci Adv. 7:eabg2898. 10.1126/sciadv.abg2898. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Sukhwal PC, Kankanhalli A. 2022. Determining containment policy impacts on public sentiment during the pandemic using social media data. Proc Natl Acad Sci U S A. 119:e2117292119. 10.1073/pnas.2117292119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Song T, Huang J, Tan Y, Yu Y. 2019. Using user- and marketer-generated content for box office revenue prediction: differences between microblogging and third-party platforms. Inf Syst Res. 30:191–203. [Google Scholar]
  • 6. Trozze A, Davies T, Kleinberg B. 2024. Large language models in cryptocurrency securities cases: can a GPT model meaningfully assist lawyers? Artif Intell Law. 10.1007/s10506-024-09399-6. [DOI] [Google Scholar]
  • 7. Zhao Z, et al. 2024. Recommender systems in the era of Large Language Models (LLMs). IEEE Trans Knowl Data Eng. 10.1109/TKDE.2024.3392335. [DOI] [Google Scholar]
  • 8. Feng Y, et al. 2023. A large language model enhanced conversational recommender system. arXiv. 10.48550/arXiv.2308.06212, preprint: not peer reviewed. [DOI]
  • 9. Maatouk A, Piovesan N, Ayed F, De Domenico A, Debbah M . 2024. Large language models for telecom: forthcoming impact on the industry. arXiv. 10.48550/arXiv.2308.06013, preprint: not peer reviewed. [DOI]
  • 10. Gu Y, et al. 2023. Distilling large language models for Biomedical Knowledge Extraction: a case study on adverse drug events. arXiv. 10.48550/arXiv.2307.06439, preprint: not peer reviewed. [DOI]
  • 11. Nguyen HA, Bhat S, Moore S, Bier N, Stamper J. 2022. Towards generalized methods for automatic question generation in educational domains. Lect Notes Comput Sci. 13450:272–284. [Google Scholar]
  • 12. Dijkstra R, Genc Z, Kayal S, Kamps J. Reading comprehension quiz generation using generative pre-trained transformers. Proceedings of the Fourth International Workshop on Intelligent Textbooks 2022: co-located with 23d International Conference on Artificial Intelligence in Education (AIED 2022), July 27, 2022, Durham, UK. CEUR-WS2022.
  • 13. Dwivedi YK, et al. 2023. Opinion paper: “so what if ChatGPT wrote it?” multidisciplinary perspectives on opportunities, challenges and implications of generative conversational AI for research, practice and policy. Int J Inf Manage. 71:102642. [Google Scholar]
  • 14. Stokel-Walker C, Van Noorden R. 2023. What CHATGPT and generative AI mean for science. Nature. 614:214–216. [DOI] [PubMed] [Google Scholar]
  • 15. Veselovsky V, Ribeiro MH, West R. 2023. Artificial intelligence: crowd workers widely use large language models for text production tasks. arXiv. https://arxiv.org/abs/2306.07899, preprint: not peer reviewed.
  • 16. Smirnov O, Hsieh P-H. 2022. COVID-19, climate change, and the finite pool of worry in 2019 to 2021 twitter discussions. Proc Natl Acad Sci U S A. 119:e2210988119. 10.1073/pnas.2210988119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Vollero A, Sardanelli D, Siano A. 2021. Exploring the role of the Amazon effect on customer expectations: an analysis of user-generated content in consumer electronics retailing. J Consum Behav. 22:1062–1073. [Google Scholar]
  • 18. Mudambi SM, Schuff D. 2010. Research note: what makes a helpful online review? A study of customer reviews on Amazon.com. MIS Q. 34:185. [Google Scholar]
  • 19. Parthasarathy VB, Zafar A, Khan A, Shahid A. 2024. The ultimate guide to fine-tuning LLMs from basics to breakthroughs: an exhaustive review of technologies, research, best practices, applied research challenges and opportunities. arXiv. https://arxiv.org/abs/2408.13296, preprint: not peer reviewed.
  • 20. Hutto C, Gilbert E. 2014. Vader: a parsimonious rule-based model for sentiment analysis of social media text. Proc Int AAAI Conf Web Soc Media. 8:216–225. [Google Scholar]
  • 21. OpenAI . OpenAI's GPT-3.5 turbo. Available at https://platform.openai.com/docs/models/gpt-3-5.
  • 22. OpenAI . GPT-4O Mini: Advancing cost-efficient intelligence. Available at https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence.
  • 23. Ni J, Li J, McAuley J. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In: Proceedings of the 2019 conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) Association for Computational Linguistics, Hong Kong, China, 2019. p. 188–197. 10.18653/v1/d19-1018. [DOI]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

pgaf034_Supplementary_Data

Data Availability Statement

The original datasets used in this study are referenced and can be accessed in (i) the Harvard Dataverse repository at https://doi.org/10.7910/DVN/SFQTJZ, and (ii) UC San Diego Computer Science Department repository at https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/. The data and codes used in our analysis will be made available in the Open Science Framework repository at https://osf.io/azskv/?view_only=16baef51bc434df5a909b68e0fe97d86.


Articles from PNAS Nexus are provided here courtesy of Oxford University Press

RESOURCES