Duplicate detection should be performed in preparing social media data for use in pharmacovigilance |
Having first eliminated simple retweets etc., our study found 17% of the remaining posts to be suspected duplicates, with an algorithm that has an estimated precision of 99% [28]. In our signal detection study, several of the inspected series of posts contained large proportions of duplicates [15] |
Probabilistic record linkage should be considered as a complement or alternative to rule-based methods for duplicate detection in social media data |
Our study found 9% suspected duplicates in a set of Twitter posts that had already been deduplicated using a method based on rules and Bloom filters. A lower proportion of additional suspected duplicates were identified for posts related to adverse events (1.6%) [28] |
Training data for duplicate detection in social media should be enriched with suspected duplicates ensuring that the method of enrichment is accounted for in the training and evaluation of the duplicate detection method; for example, through active learning |
Our study showed that it was feasible to use active learning in training vigiMatch for duplicate detection in Twitter. Only 0.008% of all the possible pairs of tweets in our data were suspected duplicates, so a straight sample would include mostly non-duplicates. If training data are enriched with suspected duplicates and algorithms are trained and evaluated without considering the method of enrichment, then the method and their estimated performance will not generalise to the real-world setting |
Future research should compare different approaches to improve computational efficiency such as blocking and locality-sensitive hashing |
Computational efficiency is of great importance in duplicate detection and a comparison between different approaches was out of scope for the study at hand. In our study, a simple blocking scheme reduced the number of pairwise comparisons by 22% [28] |