Acute respiratory infections prediction through social media data and machine learning: a systematic review

José Manuel Ramos-Varela; Juan C Cuevas-Tello; Daniel E Noyola

doi:10.1186/s12911-026-03390-8

. 2026 Feb 24;26:96. doi: 10.1186/s12911-026-03390-8

Show available content in

Acute respiratory infections prediction through social media data and machine learning: a systematic review

José Manuel Ramos-Varela ¹, Juan C Cuevas-Tello ^1,^2,^✉, Daniel E Noyola ²

PMCID: PMC13036896 PMID: 41736115

Abstract

Background

Acute respiratory infections (ARIs) represent a major global public health burden, requiring timely surveillance and early detection to mitigate their impact. Traditional epidemiological monitoring systems often suffer from reporting delays, motivating the exploration of alternative data sources such as social media combined with machine learning techniques.

Methods

This study presents a systematic review of the literature on ARI prediction using social media data and machine learning models. Relevant studies were identified through structured searches of major scientific databases following established systematic review guidelines, PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses). The selected studies were classified into four levels of complexity and subsequently analyzed in terms of data sources, feature extraction strategies, machine learning algorithms, evaluation metrics, and prediction objectives.

Results

The reviewed studies demonstrate that social media platforms, particularly Twitter (now X), can provide valuable signals correlated with ARI incidence. A wide range of machine learning methods have been employed, including regression models, support vector machines, ensemble methods, and deep learning approaches. Overall, the results indicate that machine learning models leveraging social media data can achieve competitive predictive performance, often complementing or enhancing traditional surveillance systems. However, challenges related to data noise, population bias, and model generalization remain.

Conclusions

The findings highlight the potential of integrating social media data and machine learning techniques for ARI prediction and public health surveillance. While promising, future research should focus on improving data quality, model interpretability, and robustness, as well as on validating these approaches across different geographic regions and respiratory diseases.

Supplementary information

The online version contains supplementary material available at 10.1186/s12911-026-03390-8.

Keywords: Social Media, Machine Learning, Decision Support Systems Management, Disease Outbreaks, Respiratory Tract Infections

Background

Infectious diseases contribute significantly to the global burden of illness. Moreover, epidemic outbreaks can lead to substantial economic and social consequences, as demonstrated by the COVID-19 pandemic caused by the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2).

Acute respiratory infections (ARI) comprise a group of disorders ranging from mild (such as the common cold and pharyngitis) to severe illness (such as bronchiolitis and pneumonia) [1]. These are caused by a wide array of microorganisms including viruses (for example, influenza, SARS-CoV-2, and respiratory syncytial virus), bacteria (for example, Streptococcus pneumoniae, Streptococcus pyogenes, Haemophilus influenzae, and Moraxella catarrhalis), and fungi. COVID-19 is a term used to define infections that are caused specifically by SARS-CoV-2. While some patients with SARS-CoV-2 infection (i.e. COVID-19) may present with non-respiratory symptoms, most COVID-19 cases are a subset of ARIs [2, 3].

As such, ARI encompass diseases affecting diverse sites within the respiratory system. According to the International Statistical Classification of Diseases and Related Health Problems, 10 Inline graphic Revision (ICD-10) by the World Health Organization (WHO), respiratory diseases are classified under codes J00–J99, which cover 356 distinct conditions. Within this group of disorders, those caused by respiratory pathogens are considered ARI. Additionally, COVID-19 is included under codes U07.0 and U07.1. The complete ICD-10 classification can be consulted at: https://icd.who.int/browse10/2019/en.

Traditional methods for monitoring the spread of infectious diseases rely on Indicator-Based Surveillance (IBS) systems. These systems analyze structured data gathered through surveillance and monitoring protocols tailored to specific diseases. The standardized nature of IBS allows health authorities to systematically record information and track trends over time. However, their dependence on case confirmation, often requiring laboratory testing, and the need for proactive identification limit the volume of data that can be collected, especially under budgetary constraints [4]. Another limitation of these systems is the time lag between data collection and publication of official reports. Furthermore, IBS systems typically lack the ability to detect outbreaks caused by newly emerging pathogens [4].

To address some of these challenges, Event-Based Surveillance (EBS) systems have been developed as a complementary approach. According to the World Health Organization, EBS involves the organized and rapid capture of information regarding events that may pose a threat to public health [5]. Unlike IBS, EBS collects data in real time directly from observers, using diverse sources such as social media, news outlets, and public health networks. Although the data in EBS systems is often unstructured, noisy, and harder to verify, these systems offer important advantages: they are cost-effective and capable of quickly identifying potentially hazardous events, even in areas lacking traditional surveillance infrastructure. As a result, EBS can detect outbreaks and emerging threats in real time, which might go unnoticed by conventional systems [4].

On the other hand, in a world where 59% of the population has internet access and 49% are active users of social media, the information generated by individuals represents a valuable resource for the welfare of society. In the context of this work, such welfare is reflected in the enhanced surveillance of respiratory infectious disease outbreaks. As of early 2020, over 4.5 billion people were using the internet, and social media users had surpassed 3.8 billion. Nearly 60% of the global population was online. The continuous expansion of social media usage ensures a growing volume of data [6].

Therefore, the steady increase in the use of social media has led researchers worldwide to assess the potential use of publicly available information on these platforms for epidemiological studies. Systematic reviews consistently identify Twitter (now known as X) as the most widely used source in studies exploring the intersection of social media and public health [7, 8]. Among the health conditions studied most frequently, ARI (including influenza and influenza-like illness (ILI)) stand out due to their high relevance and data availability [7].

However, the application of social media data in research varies widely. Some studies focus on sentiment analysis, others examine the volume of posts on trending topics, and some analyze visual content such as images [9]. Notably, previous systematic reviews that incorporate machine learning (ML) for monitoring social media activity tend to refer to ML techniques in a general sense and do not assess the complexity of the models applied [10].

The goal of this work is to examine recent research that uses ML techniques to identify and quantify messages posted on Twitter related to infectious respiratory diseases. Taking into account that the period of time used to collect data and the association measures used influence the precision of ML techniques, the research questions related to our objective are the following:

What analysis methodologies are used in studies related to the use of social networks for the prediction of acute respiratory infections (ARI)?
What are the main ML techniques used in investigations related to the extraction of information on Twitter?
What metrics are used to measure the performance of the ML techniques?
What are the main countries where studies related to the extraction of information on Twitter with ML techniques are carried out?
In which time periods have the largest number of studies related to our topic been published?

First, we investigate the analysis methodologies used across studies to understand the approaches applied in processing social media data to create prediction models for ARI. Second, we identify the ML techniques used most commonly to extract information from Tweets. Third, we examine the performance metrics used to evaluate the effectiveness of these methodologies and ML models, aiming to assess the rigor and comparability of results across studies. Additionally, we explore the geographic distribution of the research, highlighting the countries that have contributed the most to the development of this field. Finally, we analyze the temporal trends in publication, identifying the periods during which the highest volume of relevant studies has emerged, thus providing insight into how interest in this topic has evolved over time.

Methods

The methodology used in the development of this systematic literature review was the PRISMA Statement for Reporting Systematic Reviews, developed by Moher et al. [11]. This methodology, which is based on research question, allows structuring a literature review in an orderly manner, and presenting the results under certain predefined criteria.

The results obtained from the broad search are not solely statistical in nature (quantitative); rather, the in-depth analysis of the selected studies (qualitative) constitutes the primary contribution of this systematic review. We categorize the studies according to the algorithms used in their mathematical models, considering both the specific task addressed (classification, regression, or clustering) and the complexity of the models in relation to how they process information from social media. In addition, we report the performance metrics and highlight the best outcomes presented in the reviewed literature.

In summary, the PRISMA methodology is important for systematic reviews because it promotes standardization, transparency, comprehensive reporting, bias reduction, enhanced quality, ease of peer review, increased credibility, and supports evidence-based decision-making. It has ample acceptance in studies related to health issues [7, 12, 13].

Search strategy

A search for scientific literature was conducted using the EBSCO Discovery Service (EDS) database, covering publications from January 2006 to December 2020. The starting point was selected based on the creation of Twitter in 2006, while the end date was chosen to exclude the exponential surge in COVID-related publications between 2021 and 2023. To validate our findings and ensure relevance, results from 2021 to 2024, particularly from systematic reviews, are discussed in the Discussion section. Additionally, a filter was applied to include only peer-reviewed publications. Table 1 outlines the search terms used, organized into three thematic sections; each selected article was required to include at least one term from each section. The EDS platform also provided metadata attributes for each article, which are described below:

Article Title
Author
Journal Title
ISSN
ISBN
Publication Date
Volume
Issue
First Page
Page Count
Accession Number
DOI

Table 1.

Search terms applied on the EDS database, including title and abstract during the search

Terms related to diseases	Twitter related terms	Terms related to prediction models
influenza OR coronavirus OR	social media OR twitter	forecasting OR
SARS OR epidemic OR	OR tweets	machine learning OR
healthcare OR disease OR		predict OR deep
outbreak OR surveillance OR		learning OR predict OR
epidemiology OR pandemic		predicting OR
OR influenza-like illness OR		classification OR
pneumonia OR acute		recognition OR
respiratory infection OR		artificial intelligence
COVID OR public health OR		OR classifiers
surveillance systems OR flu
OR ILI OR SARS-CoV-2 OR
COVID-19

Content Provider	Number of Records
MEDLINE	527
Complementary Index	395
Academic Search Ultimate	282
Directory of Open Access Journals	176
IEEE Xplore Digital Library	163
Supplemental Index	125
ScienceDirect	100
Library, Information Science & Technology Abstracts with Full Text	98
Springer Nature Journals	80
Business Source Complete	55
Emerald Insight	8
ERIC	6
JSTOR Journals	5
GreenFILE	4
Dentistry & Oral Sciences Source	4
SciELO	2
SciTech Connect	1
Academic Source	1

Approach	Number of studies
Learning-bas ed	20
Keywords-based	15
Word embedding-based	9
Lexicon-based	2

Study	Technique	Respiratory Disease	Ref.
1	SVR	H1N1	[15]
2	Linear Regression and SVR	ILI	[16]
3	SVM, Naive Bayes, Random Forest,	ILI	[17]
	Decision Trees and K-Nearest Neighbour
4	SVM, Naive Bayes,	Flu	[18]
	Decision Trees and K-Nearest Neighbour
5	Multilayer Perceptron	Flu	[19]
6	SVM, C4.5 Decision Trees and Naive Bayes	Influenza	[20]
7	Stacked Linear Regression, SVM,	Influenza	[21]
	and AdaBoost with Decision Trees Regression
8	Naive Bayes	Influenza	[22]
9	SVM and Naive Bayes	ILI	[23]
10	SVM	ILI	[24]
11	Least Absolute Shrinkage and Selection Operator	Influenza	[25]
12	SVM, AdaBoost and Long Short Term Memory (LSTM)	ILI	[26]
13	SVM, Naive Bayes, Random Forest	H1N1	[27]
	and Decision Trees
14	SVM, Logistic Regression and Naive Bayes	MERS	[28]
15	Backpropagation Neural Network	ILI	[29]
16	SVM	Influenza	[30]
17	Autoregressive Models, Deep Multilayer	ILI	[31]
	Perceptron and Convolutional Neural Network (CNN)
18	FastText, Random Forest, Naive Bayes,	ILI	[32]
	SVM, C 4.5 Decision Trees, k-Nearest Neighbors (KNN)
	and AdaBoost
19	SVR	ILI	[33]
20	XGBoost, Decision Trees	COVID-19	[34]

Study	Region	Period Analyzed
1	USA	October 4, 2009 to May 16, 2010
2	USA	September, 2009 to May, 2010
3	Portugal	March, 2011 to February, 2012
4	Portugal and Spain	October 30, 2012 to November 30, 2012
5	USA	January, 2011 to April, 2015
6	Victoria, Australia	May, 2011 to August, 2011
7	USA	2011 to 2013
8	USA	Not Defined
9	UK	February, 2014 to August, 2014
10	Cincinnati, USA	November 1, 2014 to May 1, 2015
11	USA, Brazil, Paraguay,	July, 2012 to May, 2013
	Mexico, and Venezuela
12	31 geolocations (25 in	January, 2011 to December, 2014
	USA and 6 in other countries)
13	India	Not Defined
14	Global	April 27, 2014 to July 16, 2014
15	USA	October, 2016 to Octuber, 2017
16	Japan	November, 2012 to May, 2013;
		November, 2013 to May, 2014;
		November, 2014 to May, 2015
17	USA	2009 to 2010 and 2011 to 2014
18	USA	January, 2018 to May, 2018
19	USA	October, 2016 to October, 2017
20	China	January 22, 2020 to April 13, 2020

ML technique	Number of studies
SVM	11
Naive Bayes	8
Decision Tree	6
AdaBoost	3
K-Nearest Neighbour	3
Random Forest	3
SVR	3

Study	Technique	Respiratory Disease	Ref.
21	Algorithm developed by authors	ILI	[36]
22	Conditional Random fields	Influenza, Common Cold	[37]
	and a Log Linear Model	and Listeria

Study	Technique	Respiratory Disease	Ref.
23	LDA	Flu	[38]
24	LDA	ILI	[39]
25	LDA	Flu	[40]
26	Word2Vec	Influenza	[14]
27	Word2Vec	Swine and Avian Influenza	[41]
	(CBOW)
28	Word2Vec, GloVe	Not Defined	[42]
29	LaBSE, BERT	COVID-19	[43]
30	Biterm Topic Model	COVID-19	[44]
31	K-Means Clustering	COVID-19	[45]

Study	Region	Period Analyzed
23	15 Countries in South America	December, 2012 to January, 2014
24	USA	May, 2009 to October, 2010
25	South America	December, 2012 to August, 2014
26	USA	Not Defined
27	Global	2009 to 2012
28	Not Defined	July 12, 2018 to July 12, 2019
29	Global	January 4, 2020 to April 5, 2020
30	Global	March 3, 2020 to March 20, 2020
31	Global	January 6, 2020 to April 15, 2020

ML technique	Number of studies
LDA	3
Word2Vec	3
LaBSE	1
BERT	1
Biterm Topic Model	1
GloVe	1
K-Means Clustering	1

Study	Technique	Respiratory Disease	Ref.
32	Autoregressive Model	ILI	[48]
33	Linear Regression Model	ILI	[49]
34	Linear Regression Model	ILI	[50]
35	Linear Regression Model	ILI	[51]
36	Correlation Coefficient	Influenza	[52]
37	Correlation Coefficient	ILI	[53]
38	Autoregressive Model	ILI	[54]
39	Not Defined	Not Defined	[55]
40	Autoregressive Model	ILI	[56]
41	Autoregressive Model	Influenza	[57]
42	Linear Regression Model	Influenza	[58]
43	Correlation Coefficient	ILI	[59]
44	Autoregressive Model	Influenza	[60]
45	Nonlinear Gaussian Process	ILI	[61]
46	Linear Regression Model	COVID-19	[62]

Study	Best Model	Cat.	Metrics	Metrics values
1	SVR	R	Average error	0.37%
			Standard Deviation	0.26%
2	SVM	C	Accuracy / F1	83.98% / 90.01%
			Precision / Recall	94.38% / 86.63%
3	Naive Bayes + Queries Model	C	F-Mesuare / Precision	83% / 78%
			Recall / AUC	90% / 94.1%
4	Naive Bayes	C	F-Mesuare, Precision,	Best overall
			Recall, AUC
5	Regularized Multi-task Feature	C	AUC	83.07%
	Learning Model, Paraguay data
6	SVM with linear kernel and	C	F-Mesuare, Precision, Recall	Best overall
	stochastic gradient descent
7	SVM with Radial Basis	R	Pearson Correlation / RMSE	0.989 / 0.176%
	Function (RBF)		RMSPE / MAPE	8.27% / 23.6%
			Hit Rate	69.4
8	Hybrid Approach with NLP	C	Accuracy	84.2%
9	Autoregressive Integrated Moving	R	MAE	8.20
	Average combined with tweet
	count regression
10	SVM based on a linear kernel	C	Accuracy	78%
11	Social Media Nested Epidemic	R	MSE / Pearson Correlation	Best overall
	Simulation (SimNest)		P-value
12	LSTM model only Social	R	Pearson Correlation / RMSE	0.79 / 0.01%
	Media predictors		RMSPE / MAPE	29.52% / 69.54%
13	Naive Bayes	C	F-Measure / Precision	77% / 70%
			Recall	86%
14	SVM classifier with RBF kernel	C	Accuracy	88.26%
	bag-of-words features, MERS data
15	IAT-BPNN	R	MSE / RMSE / MAPE	Best overall
16	Model with NLP and	C	Accuracy	70%
	High-population areas data
17	CNN	R	RMSE / MAE	3.12 / 4.43
	(Twitter data)
18	Random Forest	C	Precision / F-mesuare	90.5% / 90.1%
			Recall	90.2%
19	SVR with improved particle swarm	R	MSE / RMSE	0.8401 / 0.8225
	optimization (data Region 10)		MAPE	0.7082
20	XGBoost, Feature Sets:	R		5862
	Original + Event + SEIR
	(Susceptible Exposed Infected

Study	Best Model	Cat.	Metrics	Metrics values
21	Model with a threshold	C	Precision / MRR	69% / 82%
			Coverage	60%
22	Algorithm Conditional Random Fields,	C	Precision / Recall	79% / 58%
	(Measles + pneumonia + mumps) data		F1 score	67%

Study	Best Model	Cat.	Metrics	Metrics values
23	model	R	RMSE	Best overall
		G	Not specified	Not specified
24	Ailment Topic AspectModel	C	Pearson Correlation	Best overall
		G	Coherence score	Not specified
25	HFSTM-A model	R	RMSE	Best overall
		G	Not specified	Not specified
26	Word Embedding Clustering,	C	Precision / Recall	96.2% / 75.6%
	similarity threshold = 0.6		F1 / Accuracy	84.6% / 87.1%
		G	Cosine Similarity	Not specified
27	Algorithm II + CNN, Influenza Dataset	C	Accuracy	72.84%
28	BiLSTM-CRF (word + char + ,	C	Precision / Recall	94.93% / 81.98%
	word embedding method)		F-score	87.52%
29	SVM classifier applied on LaBSE	C	Accuracy / F1 (Micro)	86.92% / 87.6%
			F1 (Macro)	88.1%
		G	Not specified	Not specified
30	Model with tweets that included	C	Spearman	0.45
	users self-reporting symptoms		Correlation
	and self-reported recovery	G	Coherence score	Not specified
31	Twitter data: Kmeans(6),	G	Sum of squares	Range: 1000
	Weibo data: Kmeans(5)		within a group	to 1100

Study	Best Model	Cat.	Metrics	Metrics values
32	Model without retweets, Syndrome	R	Correlation Coefficient	0.9846
	Elapse (time = one week)		RMSE	0.318
33	Multiple linear regression model	R	RMSE	0.001897
	with ridge regularization unweighted
34	LASSO with a marker ‘novel flu’	R	Coefficient of Determination	0.853
35	Daily infection tweets model	R	Correlation Coefficient	0.763
36	Google Flu Trends and USA	C	Pearson Correlation	0.79
	tweets model
37	Correlation model (San Diego data)	C	Correlation Coefficient	0.93
38	Model with Twitter data	R	MAE	0.1866
39	Not Defined	C
40	Model with ILI and	R	Log-likelihood	Not Defined
	Google Flu Trends data
41	ARGO Model	R	RMSE / MAE / MAPE /	Best overall
	(athena+Google+Flu-Near-You data)		Pearson Correlation
42	Experiment 3 (January data)	R	RMSE	0.14
43	Model with Twitter data related	C	Cross-Correlation Coefficient	0.5
	with heat syndrome and temperature
44	Model M4 (nowcast)	R	RMSE / MAE	0.1972 / 0.1191
			MAPE / Correlation Coefficient	10.75 / 0.9867
			Coefficient of Determination	0.9704
45	Google model	R	Pearson Correlation / MSE	0.96 /3.86
	(national level: England data)		RMSE / MAE	1.96 / 1.47
			MAPE / Mean Error	14.10% / 0.54
46	Google Trends model	R	Coefficient of Determination	0.9209

Study	Region	Period Analyzed
32	USA	October 18, 2009 to October 31, 2010
33	USA	December, 2011 to April, 2012
34	Korea	October, 2011 and September, 2012
35	New York, USA	October 15, 2012 to May 10, 2013
36	USA	December 2, 2012 to April 7, 2013
37	11 cities in USA	September 29, 2013 to March 1, 2014
38	USA	November 27, 2011 to April 5, 2014
39	Global	June, 2014 and December, 2014
40	Maryland, USA	November 20, 2011 to March 16, 2014
41	Boston metropolis, USA	September 6, 2009, to May 15, 2016,
		and May 22, 2016, to May 7, 2017
42	United Arab Emirates	November 2016 and January 2017
43	Toronto, Canada	June 26, 2015 to September 10, 2015
44	Italy	October 2016 to April 2017
		and October 2017 to April 2018
45	UK	March, 2012 to August, 2015
46	USA	January 12, 2020 to April 5, 2020

Id	Title	#	Methodology	Ref.
a	Social Media and Internet-Based Data in Global Systems for Public Health Surveillance: A Systematic Review.	32	Not referenced	[4]
b	Using Social Media for Actionable Disease Surveillance and Outbreak Management: A Systematic Literature Review	60	PRISMA	[7]
c	Social media based surveillance systems for healthcare using machine learning: A systematic review	26	Not referenced	[8]
d	Twitter as a Tool for Health Research: A Systematic Review	137	PRISMA	[9]
e	Adoption of Digital Technologies in Health Care During the COVID-19 Pandemic: Systematic Review of Early Scientific Literature.	124	PRISMA	[63]
f	Big data analytics as a tool for fighting pandemics: a systematic review of literature	45	Methodi Ordinatio	[64]
g	Sentiment analysis and its applications in fighting COVID-19 and infectious diseases: A systematic review.	28	PRISMA	[65]

Id	Databases	Period	Notable findings
a	PubMed, Scopus and Scirus	1990–2011	There is a need for technologies that monitor health-related issues on the Internet. Although the acceptance of the scientific use of information from social networks generates diverse opinions.
b	PubMed, Embase, Scopus, and Ichushi-Web	-Feb. 2013	The analysis of the literature demonstrates that information from social networks improves public health mechanisms and helps identify vulnerable populations.
c	IEEE, ACM Digital Library, ScienceDirect and PubMed	2010–2018	Most of the studies found focus on flu or influenza-like illness (ILI). Twitter is the social network with the largest number of studies developed, and the most used ML technique is Support Vector Machine.
d	PubMed, Embase, Web of Science, Google Scholar and CINAHL	-Sep. 2015	Health research based on Twitter data is a growing field. To describe the use of Twitter in health areas, a taxonomy with six categories was created.
e	MEDLINE and medRxiv	Jan. 2020 -Apr. 2020	AI algorithms based on image recognition and clinical data are a promising field. User health tracking applications are effective, but there is a debate about data privacy.
f	Web of Science and Scopus	2014–2020	The two main sources of Big data information are internet search engines and social networks; and some common techniques for analyzing this data are correlation and regression.
g	ScienceDirect, PubMed, Web of Science, IEEE Xplore and Scopus	Jan. 2010 - Jun. 2020	The use of technologies such as AI and ML, in the field of sentiment analysis on social networks, contributes to an improvement in results.

Id	Title	#	Methodology	Year	Ref.
h	The application of artificial intelligence and dataintegration in COVID-19 studies: a scoping review	794	PRISMA	2021	[66]
i	Surveillance of communicable diseases using social media: A systematic review	23	PRISMA	2023	[67]
j	Applications of machine learning for COVID-19 misinformation: a systematic review	43	PRISMA	2022	[68]
k	Classification of COVID-19 misinformation on social media based on neuro-fuzzy and neural network: A systematic review	34	Kitchenham [69, 70]	2022	[71]
l	The Impact and Applications of Social Media Platforms for Public Health Responses Before and During the COVID-19 Pandemic: Systematic Literature Review	678	Not referenced	2021	[72]
m	Promoter or barrier? Assessing how social media predicts COVID-19 vaccine acceptance and hesitancy: A systematic review of primary series and booster vaccine investigations	113	PRISMA	2024	[73]
n	Utilizing natural language processing and large language models in the diagnosis and prediction of infectious diseases: A systematic review	15	PRISMA	2024	[74]

Id	Databases	Period	Notable findings
h	National Institutes of Health (NIH) LitCovid (part of PubMed) and the World Health Organization (WHO) COVID-19 database	-Mar. 2021	Identification of research areas related to COVID-19 in which AI is applicable. In the AI applications found, there is a lack of integration of heterogeneous data.
i	ACM Digital Library, IEEE Xplore, PubMed, and Web of Science	-Mar. 2020	Data mining and NLP analysis of health-related content on social networks offer valuable tools for monitoring public health and remotely predicting contagious diseases.
j	Scopus, Web of Science, and Google Scholar	-July 2021	Deep learning methods are more effective than traditional ML in detecting COVID-19. Challenges: absence of standardized datasets, limited multilingual and multimodal information services.
k	IEEE Xplore, SpringerLink, ScienceDirect, Scopus, Taylor and Francis, Wiley, Google Scholar	2018–2021	Since the COVID-19 pandemic, research on classifying misinformation on social networks has grown. Methods such as Adaptive Neural-Based Fuzzy Inference Systems (ANFIS) and Deep Neural Networks have proven effective, with studies recommending the use of hybrid algorithms that combine both approaches.
l	PubMed, Medline and Institute of Electrical and Electronics Engineers Xplore	Dec. 2015 - Dec. 2020	Since the COVID-19 pandemic, there has been a growing number of studies on misinformation classification, highlighting the role of social network data in enhancing public health monitoring and surveillance.
m	PubMed,Scopus, and Web of Science	Jan. 2020 - Feb. 2023	Although negative perceptions of vaccination were more common, studies show an increasing trend in positive sentiment, particularly in content related to booster doses.
n	PubMed, Embase, Web of Science, and Scopus	-Dec. 2023	Research shows very promising results using Large Language Models (LLMs) like GPT-4 and BERT for analyzing social media data and detecting conditions such as urinary tract infections and Lyme disease surveillance. Challenge: in-depth studies to apply LLMs in disease diagnosis, surveillance, and prognosis.

PERMALINK

Acute respiratory infections prediction through social media data and machine learning: a systematic review

José Manuel Ramos-Varela

Juan C Cuevas-Tello

Daniel E Noyola

Abstract

Background

Methods

Results

Conclusions

Supplementary information

Graphical Abstract

Background

Methods

Search strategy

Table 1.

Table 2.

Study selection

Data collection process

Results

Study selection

Fig. 1.

Study characteristics

What analysis methodologies are used in studies related to the use of social networks for the prediction of acute respiratory infections (ARI)?

Table 3.

What are the main ML techniques used in studies related to the extraction of information on twitter?

Learning-based approaches

Table 4.

Table 5.

Table 6.

Lexicon-based approaches

Table 7.

Table 8.

Word embedding based approach

Table 9.

Table 10.

Table 11.

Keywords-based approaches

Table 12.

Table 13.

Table 14.

What metrics are used to measure the performance of the ML techniques?

Table 15.

Table 16.

Table 17.

Table 18.

What are the main countries where studies related to the extraction of information on twitter with ML techniques are carried out?

Fig. 2.

In which time periods have the largest number of studies related to our topic been published?

Fig. 3.

Discussion

Table 19.

Table 20.

Table 21.

Table 22.

Summary of evidence and perspectives

Research challenges

Limitations

Conclusions

Future research

Electronic supplementary material

Acknowledgements

Abbreviations

Author contributions

Funding

Data availability

Declarations

Ethical approval

Consent for publication

Competing interest

Footnotes

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles