Table 4.
Data exclusion criteria.
| Elimination rule | Elimination definition | Elimination method |
| Nontopic-related content | Triglycerides were low, triglycerides were relatively low, triglycerides were not high; the detection value of triglycerides was lower than 1.7 mmol/L | BERTa text classification model: manual output of training set, train model, verification by manual training verification set, elimination of residual data by model |
| Duplicate ID | Duplicate page ID content | Compare ID characters and reject duplicate IDs |
| Advertising content | No description of patient’s personal illness, introduction of medical institutions, products, advertising links, invitation to join the group or join the consultation, the questioner is the organization | Text recognition article with jump links, link address for advertising, delete or manually output the training set, and use event sequence template mining to build the model for recognition |
| Popular science articles | Popular medical science, no description of patient’s condition | Mining of event sequence template using the BERT text classification model: manual output of training set, train model, verification by manual training verification set, elimination of residual data by model |
aBERT: bidirectional encoder representations from transformers.