Summary
Non-pharmacological interventions (NPIs) are important for controlling infectious diseases such as COVID-19, but their implementation is currently monitored in an ad hoc manner. To address this issue, we present a three-stage machine learning framework called EpiTopics to facilitate the surveillance of NPI. In this protocol, we outline the use of transfer-learning to address the limited number of NPI-labeled documents and topic modeling to support interpretation of the results.
For complete details on the use and execution of this protocol, please refer to Wen et al. (2022).
Subject areas: Health Sciences, Computer sciences
Graphical abstract
Highlights
-
•
Automated prediction of public health intervention from COVID-19 news reports
-
•
Inferring 42 country-specific temporal topic trends to monitor interventions
-
•
Learning interpretable topics that predict interventions from news reports
-
•
Transfer-learning to predict interventions for each country on weekly basis
Publisher’s note: Undertaking any experimental protocol requires adherence to local institutional guidelines for laboratory safety and ethics.
Non-pharmacological interventions (NPIs) are important for controlling infectious diseases such as COVID-19, but their implementation is currently monitored in an ad hoc manner. To address this issue, we present a three-stage machine learning framework called EpiTopics to facilitate the surveillance of NPI. In this protocol, we outline the use of transfer-learning to address the limited number of NPI-labeled documents and topic modeling to support interpretation of the results.
Before you begin
Protocol overview
This protocol will guide you through a series of steps to develop a machine learning model called EpiTopics. The method was developed to enable automatic detection from news reports of changes in the status of non-pharmacological interventions (NPI) for COVID-19. The method can be divided into 3 stages. At stage 1, EpiTopics learns country-dependent topics from a large number of COVID-19 news reports that do not have NPI labels (i.e., AYLIEN news dataset in Wen et al. (2022)). At stage 2, EpiTopics learns accurate connections between these topics and changes in NPI status from a set of labeled news reports (i.e., WHO news dataset in (Wen et al., 2022)). At stage 3, EpiTopics learns to predict country-dependent NPI changes by combining the knowledge learned from the previous two stages.
Acquiring datasets
Timing: 1 h
-
1.
Download the WHO dataset from https://www.who.int/emergencies/diseases/novel-coronavirus-2019/phsm.
Optional: To replace the WHO dataset with other datasets of the user’s choice, please ensure that the dataset of interest includes, for each sample, the text, the text’s source location, the text’s publication time, and the NPIs associated with the text.
Note: In addition, it is preferable to use datasets whose location and time coverages overlap significantly with the AYLIEN dataset, since only samples with overlapping locations and times can directly benefit from the topics learned during pre-training on the AYLIEN dataset.
-
2.
Request access to the AYLIEN dataset on https://aylien.com/resources/datasets/coronavirus-dataset.
Optional: To replace the AYLIEN dataset with other datasets of the user’s choice, please ensure that the dataset of interest includes, for each sample, the text, the text’s source location, and the text’s publication time.
Note: In addition, as this dataset is used for pre-training, generally it is preferable to use large datasets, for instance those that have more than 1 million training documents. Also, it is preferable to use datasets with significant location and time coverages overlap with the WHO dataset (i.e., the NPI-labeled documents).
Software installation
Timing: 2–4 h
-
3.
Clone the code repository https://github.com/li-lab-mcgill/covid-npi.
-
4.
Install packages according to the requirements file.
CRITICAL: Request to access the AYLIEN dataset might take days to be processed. We strongly recommend the usage of Graphical Processing Unit (GPU) Although it is not required, GPUs will greatly expedite training on a large corpus over CPUs. It is also desirable to have a virtual environment set up for this experiment.
Key resources table
Step-by-step method details
Data preprocessing
Timing: 10 min
This section describes 1) The removal of white spaces, special characters and non-English words 2) The removal of stop words as in (Dieng et al., 2020) 3) The extraction of information that is relevant to us from AYLIEN and WHO datasets 4) The removal from WHO dataset of documents whose country or source are not observed in the AYLIEN data.
-
1.Preprocess AYLIEN data.
-
a.modify the script run_data_process.sh to include the correct path to the AYLIEN dataset, stop words file, and country NPIs file.
-
b.set ‘aylien_flag’ to 1 and ‘label_harm’ to 1.
-
c.execute ‘run_data_process.sh’ from the command line.
-
a.
-
2.Preprocess WHO data.
-
a.modify the script run_data_process.sh to include the correct path to the WHO dataset, stop words file, and country NPIs file.
-
b.set ‘label_harm’ to 1.
-
c.execute ‘run_data_process.sh’ from the command line.
-
a.
-
3.The program will store the processed data (e.g., bag-of-words) in the output directory specified by save_dir.
-
a.Take note that this should also be the input directory for running MixMedia (Li et al., 2020) (see below).
-
b.More specifically, check that the output directory contains:
-
i.Text file that contains the mappings between labels and their ids.
-
ii.Text file that contains countries and their assigned ids.
-
iii.Time_stamps and their ids.
-
iv.43 pickle (.pkl) files that mainly feature the pickled vocabulary and embeddings and bag-of-word representations of tokens.
-
i.
-
a.
Running MixMedia
Timing: 5 h
Pretraining of the MixMedia (Li et al., 2020) framework on the larger AYLIEN dataset as part of our transfer learning scheme.
-
4.Modify the script run_MixMedia.sh.
-
a.set K = 25 (the number of desired topics).
-
b.set cuda = {the indices to the GPUs that are available to you}.
-
c.set dataset = “AYLIEN”.
-
d.set datadir = path to your AYLIEN files.
-
e.set outdir = path to the output directory of your choice.
-
f.set wemb = path to the output directory of your choice, this contains the embeddings that are needed for stage 3.
-
g.set mode = “train”.
-
h.set batch_size = “128”.
-
i.set lr = “1e-3”.
-
j.set epochs = “400”.
-
k.set min_df = “10”.
-
l.set train_embeddings = “1”.
-
m.set eval_batch_size = “128”.
-
n.set time_prior = “1”.
-
o.set source_prior = “1”.
-
a.
Execute ‘./run_MixMedia.sh’ from the command line.
-
5.The program will save the outputs to a folder under save_path: save_path/<timestamp>, and
-
a.The timestamp records the time this script starts to run, and is in the format of {month}-{day}-{hour}-{minute}.
-
b.The program saves the trained model.
-
c.The program saves the learned topics (e.g., the topic embedding α, the word embedding ρ, LSTM weights for topic prior η, etc).
-
a.
-
6.
Monitor the progress with Tensorboard or Weights & Biases by setting “logger”.
Transfer learning for NPI prediction
Timing: 1 h
After MixMedia is trained on AYLIEN, we can use the learned topics for NPI prediction via transfer learning. This consists of three consecutive stages: inferring WHO documents’ topic mixtures, training a classifier on document-NPI prediction, transferring the classifier to country-NPI prediction.
-
7.Infer WHO documents’ topic mixtures.
-
a.Populate the ‘save_dir’, ‘data_dir’ and ‘model_dir’ entries of the ‘infer_theta.sh’ file according to the instructions within the file.
-
b.The program saves the output to a folder under save_dir: save_dir/{timestamp}, where the timestamp records the time this script starts to run, and is in the format of {month}-{day}-{hour}-{minute}.
-
c.The program also saves the document topic mixtures theta.
-
a.
-
8.Train a classifier on document-NPI prediction.
-
a.Within classify_npi.sh, set ‘mode’ to zero-shot or finetune or from-scratch based on the type of result that currently needs to be reproduced.
-
i.For document-level NPI prediction, set mode to “doc” and provide who_label_dir and theta_dir.
-
ii.For zero-shot transfer, set mode to zero_shot and provide cnpi_dir and ckpt_dir;
-
iii.For fine-tuning, set mode to finetune and provide cnpi_dir and ckpt_dir.
-
i.
-
b.Set eta_dir to the directory where you saved your outputs in step 5.
-
c.Specify save_ckpt. When set, the program saves the results reported in Wen et al. (2022) to a subfolder under save_dir: save_dir/mode/{timestamp}
-
i.The timestamp records the time this script starts to run, and is in the format of {month}-{day}-{hour}-{minute}.
-
ii.For each random seed, the program saves a trained linear classifier and the corresponding test predictions, with suffixes in filenames that specify the seed.
-
iii.The program also saves the aggregated results in AUPRC into a json file.
-
i.
-
d.Repeat the above steps for the other modes.
-
a.
Expected outcomes
The above commands will result in the following outcomes corresponding to Figure 1, Table 1, Table 2, Table 3 in Wen et al. (2022):
Figure 1: Learned topics and the top words under each topic. The sizes of the words are proportional to their topic probabilities. The background colors indicate the themes we gave to the topics.
Table 1: Area under the precision-recall curve (AUPRC) scores for document-level NPI prediction. The AUPRC scores are computed on individual NPIs, and then averaged without weighting (macro AUPRC) or weighted by NPIs’ prevalence (weighted AUPRC). Both BOW+linear and BOW+feed-forward use the normalized word vector (i.e., bag of words or BOW) for each document to predict NPI label. All methods are each repeated 100 times with different random seeds. Values in the brackets are standard deviations over the 100 experiments.
Table 2: Area under the precision-recall curve (AUPRC) scores for country-level NPI prediction. Random baselines are each repeated 1000 times with different random seeds, and the rest are each repeated 100 times with different random seeds. Values in the brackets are standard deviations over the repeated experiments.
Table 3: AUPRC scores for country-level NPI prediction from topics at document and country level. Values in the brackets are standard deviations. Random baselines are each repeated 1000 times with different random seeds, and the rest are each repeated 100 times with random seeds.
All of the above will be saved to a subfolder under save_dir: save_dir/mode/{timestamp}.
Quantification and statistical analysis
Expected outcomes in this protocol are stochastic in nature, due to hardware, model initialization, etc. Uncertainty in the reported results is controlled and measured through repeated runs with different random seeds. For models involving training (i.e., except random baselines), the results are based on 100 runs, while results for random baselines are based on 1000 runs. To reduce the uncertainty, the user can choose to increase the number of runs subject to computational cost.
Additionally, because of the large size of AYLIEN dataset, the topic model is trained once, and therefore one set of learned topics is used throughout all subsequent experiments. The user can explore training multiple versions of the topic model using different random seeds to obtain multiple sets of topics. The user can then study the variation, or consistency, of learned topics across runs, and explore how variations in learned topics can impact NPI predictions.
Limitations
To begin, using different library versions may have an impact on the results. As a result, the program might run into errors, or the results might not be able to be exactly reproduced. Please ensure to follow requirements as closely as possible. Also, the datasets used in this protocol, i.e., AYLIEN and WHO, may be updated or removed after they were accessed in this protocol. This may lead to differences in the results, or that some results could not be reproduced. In addition, this protocol assumes the user has direct access to the computational infrastructure, not through a set of centralized computing clusters such as SLURM. To use the protocol in such scenarios, minimal changes need to be made. For example, the user can use the protocol in an interactive session. Please refer to the instructions of the specific computing system on how to modify the protocol. Finally, the type of computational resources has an impact on the results. As an example, the batch sizes and the model sizes entail a certain amount of memory, and the availability of GPUs impact the amount of time needed for training.
Troubleshooting
Problem 1
Incompatible library versions (before you begin - software installation).
Potential solution
It is best to install libraries in a virtual environment specifically created for this protocol. For instructing on managing virtual environments, please refer to https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html.
Follow the library versions in requirements as closely as possible.
Problem 2
The model with the provided hyperparameters cannot be reproduced due to GPU memory limits (any step).
Potential solution
If the user does not have enough GPU memory, the user can reduce the batch size. While doing so, in order to approximate the protocol as closely as possible, the user can accumulate gradients (i.e., calculate gradients without updating optimizer or model) across multiple batches to maintain the identical effective batch size. For example, the user can use a batch size of 64 and accumulate gradients of 2 consecutive batches to approximate an effective batch size of 128.
If the user needs to further reduce GPU memory usage, the user can reduce the model’s size, for example the numbers of layers or the hidden dimensions. Doing so would likely have a negative impact on performance.
Problem 3
AYLIEN or WHO data is updated or removed (before you begin - acquiring datasets).
Potential solution
If the AYLIEN or WHO dataset is updated to include more data, the user can retrieve the same version as in this protocol by filtering according to Wen et al. (2022). The user can also choose to use the newer version instead and obtain a model trained on a wider coverage.
If the dataset is removed, or the user wishes to obtain the exact same version as in this protocol for any other reason, the user can reach out to authors.
Problem 4
Data files or intermediate result files are not found or compatible (any step).
Potential solution
Check the paths given to the script and make sure that the files exist and they match the script’s configuration.
Problem 5
Training progress cannot be correctly logged (running MixMedia - step 4).
Potential solution
If the user is using Tensorboard as the logger, please follow the instructions here on using Tensorboard with PyTorch.
If the user is using Weights and Bias for logging, by default it requires internet connection. For logging locally, or other functionalities, please refer to the instructions here.
Problem 6
When using other custom datasets as alternatives to the WHO dataset, the NPI labels have an imbalanced distribution, resulting in poor performance on minority classes (transfer learning for NPI prediction – step 7).
Potential solution
To mitigate the issue of data imbalance and improve performance on minority classes, the user can apply several techniques. For instance, the model can be more heavily regularized via weight decay. Also, the user can assign different weights to different classes such that the loss incurred on minority classes is amplified.
Problem 7
When using other custom datasets as alternatives to the AYLIEN dataset for learning topics, the optimal number of topics changes (running MixMedia – step 4).
Potential solution
The optimal number of topics is usually specific to the dataset on which the model is trained, and therefore the user is advised to search for that number on new datasets. As an example, the user can search from 5 topics to 100 topics at an interval of 20, and then search within the best performing intervals using a small interval. The number of search steps is determined as a trade-off between the precision of the search and the compute budget.
Resource availability
Lead contact
Further information and requests for resources should be directed to and will be fulfilled by the lead contact, Yue Li (yueli@cs.mcgill.ca).
Materials availability
This study did not generate any reagents.
Acknowledgments
This work is supported by CIHR through the Canadian 2019 Novel Coronavirus (COVID-19) Rapid Research Funding Opportunity (Round 1) (application number: 440236).
Author contributions
Y.L. and D.L.B. conceived the study. Y.L. and Z.W. developed the model with critical help from D.L.B. and G.P. I.C. collected and processed the data. Z.W. implemented the model and ran the experiments. J.Z. experimented with the code and wrote the initial draft of the manuscript. Y.L. and D.L.B. supervised the project. All authors analyzed the results and wrote the final version of the manuscript.
Declaration of interests
The authors declare no competing interests.
Contributor Information
David L. Buckeridge, Email: david.buckeridge@mcgill.ca.
Yue Li, Email: yueli@cs.mcgill.ca.
Data and code availability
All data and scripts of this protocol are publicly available on GitHub at https://github.com/li-lab-mcgill/covid-npi. An archived release (https://doi.org/10.5281/zenodo.6350810) can be found at https://github.com/li-lab-mcgill/covid-npi/releases/tag/v1.0.
References
- Dieng A.B., Ruiz F.J.R., Blei D.M. Topic modeling in embedding spaces. Trans. Assoc. Comput. Linguist. 2020;8:439–453. doi: 10.1162/tacl_a_00325. [DOI] [Google Scholar]
- Li Y., Nair P., Wen Z., Chafi I., Okhmatovskaia A., Powell G., Shen Y., Buckeridge D. Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. 2020. Global surveillance of COVID-19 by mining news media using a multi-source dynamic embedded topic model; pp. 1–14. [DOI] [Google Scholar]
- Wen Z., Powell G., Chafi I., Buckeridge D.L., Li Y. Inferring global-scale temporal latent topics from news reports to predict public health interventions for COVID-19. Patterns. 2022;3:100435. doi: 10.1016/j.patter.2022.100435. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
All data and scripts of this protocol are publicly available on GitHub at https://github.com/li-lab-mcgill/covid-npi. An archived release (https://doi.org/10.5281/zenodo.6350810) can be found at https://github.com/li-lab-mcgill/covid-npi/releases/tag/v1.0.