Abstract
To accelerate medical knowledge discovery, an increasing number of research programs are gathering and sharing data on a large number of participants. Due to the privacy concerns and legal restrictions on data sharing, these programs apply various strategies to mitigate privacy risk. However, the activities of participants and research program sponsors, particularly on social media, might reveal an individual’s membership in a study, making it easier to recognize participants’ records and uncover the information they have yet to disclose. This behavior can jeopardize the privacy of the participants themselves, the reputation of the projects, sponsors, and the research enterprise. To investigate the dangers of self-disclosure behavior, we gathered and analyzed 4,020 tweets, and uncovered over 100 tweets disclosing the individuals’ memberships in over 15 programs. Our investigation showed that self-disclosure on social media can reveal participants’ membership in research cohorts, and such activity might lead to the leakage of a person’s identity, genomic, and other sensitive health information.
Introduction
To accelerate research and improve health care outcomes, various programs are gathering health-related information from individuals to build large cohorts1–3. The primary objective of these programs at the early stage is to collect a wide range of data from their participants, including genomic, phenomic (via surveys and electronic medical records), and demographic information4,5. The data are then made accessible to researchers to explore hypotheses, study associations, and develop new approaches to manage one’s health6,7. One common nature of these programs is that they are large and getting larger with respect to the number of participants and the size of collected data. One example of such a program is the Personal Genome Project (PGP), which was launched by Harvard researchers to improve the personalization of medicine4. This program has collected more than 10,000 genomes of participants from a variety of countries8. The 100,000 Genomes Project serves as another example, which has collected the genomes of one hundred thousand British participants to improve research on rare diseases9. And, to investigate how genetic predisposition and environmental exposure contribute to disease development, UK Biobank is now generating whole genome sequencing data on over 500,000 individuals10.
These programs aim to make data widely available, an endeavor that is realized by sharing data with trusted researchers and, at times, with the public2. However, the sharing of individual-level health data raises privacy concerns. This is because participants might consent to making their genome and health data available to researchers (or to the public), but not revealing their identity, which can result in unexpected economic or reputational loss11. As such, the majority of large cohort programs adopt strategies to protect their participants’ identity12, for example, through the application of de-identification routines.
Yet, there are concerns about the degree to which protection can be sufficiently realized in the age of big data. This is because there are various ways in which privacy may be compromised in such systems. For instance, there have been a number of re-identification attacks designed to leverage a wide range of data types13,14. In 2013, Sweeney and colleagues15 re-identified the names of more than 40% of the PGP participants by linking demographic data (ZIP code, gender, and date of birth) of de-identified records to the voter registration lists. Though these attacks often require a non-trivial amount of time, effort, and money to realize in a manner that would be considered detrimental to a program16, there are several developments that are enhancing the opportunities for penetrating the privacy of individuals in such environments. The first is that participants are increasingly becoming partners in the research environment. The second, and partially an artifact of the first, is that participants are using social platforms to discuss their experiences in the research domain on a widely accessible scale17,18. The third is that the research programs themselves may encourage volunteers to tell their stories publicly, with the goal of encouraging people to join the study. Revealing such information makes it evident that the social media sharer is a member of the cohort. This makes it easier for would-be attackers to identify the sharer in the resource. This can be specifically accomplished by using the sharer’s personal information that might be revealed on social media, as well as demographics that might be accessible through information brokers, to link the sharer to their record in the program’s de-identified dataset19. While some individuals may feel comfortable revealing certain information about themselves (e.g., a family history of heart disease), they may not be comfortable revealing their whole genome. As such, this behavior potentially jeopardizes the privacy of the participants themselves, as well as the reputation of the project.
To study the plausibility of an attack, we investigate the frequency of membership disclosure on social media. To do so, we selected a number of research studies from the Database of Genotypes and Phenotypes (dbGap) at NIH and Wikipedia Cohort Study Category20. We then set out to ascertain if any membership disclosure transpired in a popular social media platform, Twitter. To do so, we gathered over 100,000 tweets related to these cohorts and selected approximately 4,000 that contained keywords (e.g., participant, join, volunteer) indicative of potential disclosure. As will be illustrated below, we discovered membership disclosure tweets that revealed the participation of over 100 individuals. We inspected Twitter profiles for these individuals, which indicated demographics, health conditions, and occupations that might be leveraged to link to an individual’s de-identified record. All of the mentioned information provides an opportunity to find the users’ record in the study cohort and uncover additional information that has yet to be revealed in an identified manner, such as the participant’s genome or potentially stigmatizing health information.
This investigation also reveals several patterns. First, we show that membership-related tweets often contain certain types of words (e.g., join, participant, and volunteer). Second, over 80% membership disclosed participants have a non-negative attitude towards the program they are involved in. Sentiment analysis shows that most of these participants are happy to be a part of the cohort, which might be the incentive for some participants to reveal information about themselves. Third, longer lasting and larger cohort studies usually have more membership leakage on Twitter. We note that this is a hypothetical study only and we did not actually re-identify these individuals in the cohorts they claim to be a member of. Nonetheless, our results show that posts on social media can reveal participants’ membership in research cohorts and such activity might lead to the leakage of a person’s identity, genomic and other sensitive health information.
Related Work
The personal health information that has been disclosed on social media has been leveraged to study health-related behaviors17,21,22. In spite of the great potential research value, there still exist many concerns regarding the sharing of personal health status or negative health risk behaviors in online environments23. For example, Morgan et al.24 showed that one-third of investigated college students reported having posted a picture depicting substance use on social media platforms. Sharing such information will not only trigger privacy concerns about the disclosers themselves (e.g., damage to reputation), but may have the potential to influence other people’s behavior. For instance, it was observed that discussions about prescription abuse over Twitter may aggravate substance abuse25,26.
Additionally, it should be noted that people share their own information as well as that of other people in online environments. It has been shown that individuals disclose information about a wide range of acquaintances, ranging from family members to friends to high profile persons in the media22,27. For example, Christofides et al.28 illustrated how undergraduate Facebook users posted personal information (e.g., dates of birth and email addresses) in their profiles, but also shared photos of their friends performing potentially sensitive acts (e.g., drinking alcohol at parties).
Our work differs from the aforementioned studies in that we focus on the privacy issues regarding the membership of participants in biomedical research programs on social media. Specifically, we study self-disclosures made by the program participants themselves, as well as investigate the disclosures made by the organizations who own and have responsibility to protect the participants’ data. In doing so, our research contributes to the health information privacy field by highlighting a new type of privacy risk: the cohort membership leakage through social media.
Cohort Selection
We selected study cohorts from the database of Genotypes and Phenotypes (dbGap) and Wikipedia cohort studies category20. dbGaP was developed to archive and distribute the data and results from studies that have investigated the interaction of genotype and phenotype in humans. It contains 483 biomedical research studies. By contrast, Wikipedia provided a convenient list of long lasting cohort studies, such as the 1970 British Cohort Study29. To make our investigation more general, we chose as many different types of studies as possible. The selected cohorts are diverse in three aspects: objective, time, and population.
Objective. We selected cohorts to focus both on a specific disease, such as Type 1 Diabetes Genetics Consortium30 , as well as a particular demographic, such as the Nurses’ Health Study31 or gender with in the Million Women Study32.
Time. Cohort studies are not a new phenomenon. Some of the cohorts considered have a long history. For instance, the Framingham Heart Study33 began in 1948. Still, some of the studies are relatively new, for example, the Qatar BioBank34 was launched in 2012. Additionally, we selected studies to have a wide range in duration. Certain longitudinal studies have lasted for decades, while some achieved their objectives in a short period and thus were quite limited in length.
Population. The selected cohorts have a varying number of participants. There are multiple cohorts with relatively small sizes, such as the International HapMap Project35, which collected human genomes from 1,000 participants. By contrast, several cohorts contain hundreds of thousands of participants, such as the 100,000 Genomes Project.
Methods
We partition our search procedure into two steps. Here we provide a high-level overview of the process. The first step is data collection. In this step, we find all of the possible tweets related to the selected cohorts. The second step is data filtering. In this step, we choose a portion of the tweets from step one and manually review these tweets to find the participants of studies. We then perform sentiment and frequency analysis on the tweets that disclose membership in a biomedical research study. The workflow is summarized in Figure 1.
Figure 1:
The framework for research cohort membership discovery. The processing begins by collecting all tweets that contain the name of cohorts of interest. The tweets are then subject to a membership keyword filter. The remaining tweets are manually reviewed.
Data Collection. To collect tweets related to the selected cohorts, we use the names (and abbreviations) of the 77 studies as search keywords and collect all related tweets with a python crawler. By doing so, we obtained 139,529 tweets. Manually reviewing all of the tweets to find those revealing an individual’s participation information would be quite time consuming and error-prone. Since this is a pilot study, and our goal is to demonstrate the possibility of membership disclosure instead of finding all such tweets, we narrowed the scope of our search based on our knowledge to a portion of the tweets that are most likely to contain information about membership disclosures. When we manually reviewed some of the collected tweets, we found that most of the self-disclosed tweets exhibited the following pattern: “I joined xxx research project today!”, “I am a participant of the xxx program.” or “Now I became a volunteer of the xxx study.”
Data Filtering. We filtered the tweets with the following keywords: participant, participate, join, and volunteer, and discarded the remainder of the tweets. It should be recognized that this search method does not guarantee completeness. We lose tweets about disclosure that lack such search terms. For example, “I sent my test sample to xxx project today.” is not caught by the filter. This step yielded 12,698 tweets. For most of the projects, there are fewer than 500 tweets with the keywords of interest. Thus, we manually reviewed all of these tweets to find those that reveal membership disclosure. For cohorts with more than 500 tweets, we randomly select 500 for manual review. Details about the number of tweets collected for each cohort are provided in Table 1. For brevity, we depict the top 50 cohorts that returned the most tweets. Information on all 77 cohorts is available on Github1.
Table 1:
Number of tweets collected, filtered, and reviewed for 77 cohorts.
Study Cohort | All Tweets | Tweets Filtered | Tweets Reviewed | |
1 | UK Biobank | 24,056 | 4,265 | 500 |
2 | 100000 Genomes Prjoect (Genomics England) | 22,217 | 1,735 | 500 |
3 | UK 10K | 14,999 | 515 | 500 |
4 | LifeLines | 14,600 | 74 | 74 |
5 | All of Us Research Program | 12,263 | 4,163 | 500 |
6 | National Children Study | 10,640 | 357 | 357 |
7 | Human Longevity | 4,478 | 45 | 45 |
8 | Qatar Biobank | 3,585 | 296 | 296 |
9 | Australian Longitudinal Study on Women’s Health (ALSWH) | 3,555 | 136 | 136 |
10 | Research Program on Genes, Environment and Health (RPGEH) | 2,769 | 6 | 6 |
11 | Personal Genome Project | 2,655 | 436 | 436 |
12 | Raine Study | 2,538 | 72 | 72 |
13 | Generation Scotland | 2,283 | 106 | 106 |
14 | Coronary Artery Risk Development in Young Adults Study | 1,616 | 11 | 11 |
15 | Nun Study | 1,275 | 19 | 19 |
16 | Millennium Cohort Study | 1,195 | 42 | 42 |
17 | Million Women Study | 1,182 | 17 | 17 |
18 | Socio-Economic Panel | 1,152 | 64 | 64 |
19 | Avon Longitudinal Study of Parents and Children (ALSPAC) | 1,005 | 43 | 43 |
20 | Young Lives | 919 | 36 | 36 |
21 | LifeGene | 850 | 3 | 3 |
22 | Seven Countries Study | 833 | 3 | 3 |
23 | Atherosclerosis Risk in Communities | 708 | 2 | 2 |
24 | English Longitudinal Study of Ageing | 673 | 6 | 6 |
25 | Black Women’s Health Study | 619 | 22 | 22 |
26 | International Cancer Genome Consortium | 601 | 20 | 20 |
27 | 1970 British Cohort Study (BCS70) | 600 | 26 | 26 |
28 | Whitehall Study | 575 | 10 | 10 |
29 | Nurses’ Health Study | 565 | 29 | 29 |
30 | Alameda County Study | 353 | 1 | 1 |
31 | Seattle 500 Study | 339 | 2 | 2 |
32 | National Child Development Study | 299 | 26 | 26 |
33 | Framingham Heart Study | 290 | 14 | 14 |
34 | Religious Orders Study | 249 | 17 | 17 |
35 | The Irish Longitudinal Study on Ageing | 216 | 7 | 7 |
36 | Women’s Interagency HIV Study | 184 | 3 | 3 |
37 | Adventist Health Studies | 166 | 1 | 1 |
38 | Study of Mathematically Precocious Youth | 160 | 4 | 4 |
39 | Newcastle 85+ Study | 148 | 7 | 7 |
40 | Great Smoky Mountains Study | 129 | 2 | 2 |
41 | International Rare Diseases Research Consortium | 128 | 4 | 4 |
42 | UK Households Longitudinal Study | 126 | 1 | 1 |
43 | Multicenter AIDS Cohort Study | 119 | 3 | 3 |
44 | National Survey of Health & Development | 116 | 2 | 2 |
45 | British Birth Cohort Studies | 113 | 0 | 0 |
46 | BioBank Japan | 103 | 0 | 0 |
47 | MalariaGEN | 103 | 5 | 5 |
48 | Taiwan Biobank | 102 | 2 | 2 |
49 | COSMOS Cohort Study | 100 | 0 | 0 |
50 | Normative Aging Study | 100 | 1 | 1 |
… | … | … | … | |
Summary | 139,529 | 12,698 | 4,020 |
Sentiment and Frequency Analysis. The previous step yielded 4,020 tweets. We manually reviewed these tweets, and labeled the tweets containing membership disclosure information. We performed sentiment and frequency analysis on the target tweets posted by project participants. We first removed all the links, hashtags and @ characters from the tweets. We then fed the preprocessed tweets into TextBlob (version 0.15.3) for sentiment analysis. TextBlob is a python package for natural language processing (NLP). For each tweet, TextBlob generates a sentiment score in the range from [−1, +1], where -1 means extremely negative and +1 stands for extremely positive. Next, we partitioned the tweets into words through a process of normalization and tokenization (which partitions a tweet into a set of words), lemmatized (which transforms a word from its original form to its base form; e.g., walks becomes walk) all the words using python NLP package nltk (version 3.3). For the lemmatized words, we removed stop words (e.g. i, ia, in ,the). Since we used cohort names to collect all the tweets, we also dropped all of the words in cohort names, such as “study”, “project”, “health” and “genome”. We then counted the frequency for the remaining words.
Results
Table 1 reports the number of tweets collected, filtered and reviewed for 77 selected cohorts. Each of the first six cohorts in Table 1 has more than 10,000 related tweets, which in total accounts for 70% of the total collected tweets. All of the cohorts in the top 25% have over 1,000 tweets. The number of tweets collected from these 19 cohorts accounts for 91.8% of all the tweets. There are 26 cohorts with fewer than 100 related tweets. The distribution of tweets filtered by the selected keywords is roughly the same as the distribution of the total collected tweets. The set of cohorts in the top 6 occupied 87.5% of the filtered tweets and the top 19 cohorts generated 97.6% of filtered tweets. In general, the research programs with larger volume and longer time span have to had more tweets. In particular, programs involving government support often fall into this category.
Among the 4,020 selected tweets, we found 109 that communicated membership disclosure. The results of this investigation are shown in Table 2. These tweets come from 15 of the cohorts (19.5%). They reveal the membership of more than 115 participants. We present some examples of disclosure tweets in Table 3. Notably, 86 of these tweets (78.9%) were posted by cohort participants. In these cases, participants’ leaked either their own or their friends’ membership information when they talked about their experience with some cohort study. This discovery confirms the findings of Yin et al.36 and Mao et al.27, where it was observed that individual’s self-disclosure on social media may reveal other people’s sensitive information. The remaining 23 tweets (21.1%) come from the program’s official account or researcher/organizer of the study. In these cases, the participants’ information was revealed because the program shared a volunteer’s story.
Table 2:
A summary of the cohort and membership coverage from tweets discovered to reveal participation.
Study Cohort | Tweets Reviewed | Tweets Disclosed | Self-Disclosed Tweets | Program-Disclosed Tweets | Disclosed Individuals | |
1 | Personal Genome Project | 436 | 26 | 26 | 0 | 26 |
2 | 100,000 Genomes Project (Genomics England) | 500 | 16 | 9 | 7 | 18 |
3 | Black Women’s Health Study | 22 | 12 | 12 | 0 | 12 |
4 | Raine Study | 72 | 11 | 0 | 11 | 14 |
5 | UK Biobank | 500 | 10 | 10 | 0 | 10 |
6 | All of Us Research Program | 500 | 10 | 10 | 0 | 10 |
7 | Qatar Biobank | 296 | 5 | 4 | 1 | 4 |
8 | Nurses’ Health Study | 29 | 4 | 4 | 0 | 4 |
9 | Australian Longitudinal Study on Women’s Health | 136 | 3 | 3 | 0 | 3 |
10 | 1970 British Cohort Study (BCS70) | 26 | 3 | 3 | 0 | 3 |
11 | Framingham Heart Study | 14 | 2 | 0 | 2 | 2 |
12 | Millennium Cohort Study | 42 | 2 | 2 | 0 | 2 |
13 | Million Women study | 17 | 2 | 2 | 0 | 2 |
14 | National Child Development Study | 26 | 2 | 1 | 1 | 3 |
15 | Human Longevity | 45 | 1 | 0 | 1 | 1 |
Summary | 109 | 86 | 23 | 114 |
Table 3:
Examples of membership disclosure tweets. We replace the person and cohort names with xxx and rewrite the sentences to mitigate the risk of revealing the program and participants.
Type | Tweet |
1. Proud to be a participant in this: https://url/abcd | |
2. I like how xxx program never forget my birthday. Thanks @xxx | |
Self-Disclosed | 3. I joined xxx project because I won’t never share anyone else’s DNA. |
4. I am proud to be a participant in the xxx cohort knowing that I am contributing to a research about health and lifestyle. | |
5. I am both a researcher and a participant of the xxx project. | |
6. I just volunteered for the xxx . It was a nice experience, you should try it too! | |
1. It’s great to see Mr.xxx and his parents sharing their story about receiving a test result from the xxx research https://url/abcd | |
2. In this video, meet participant Ms.xxx and her father, xxx, who talked about why taking part is important to them https://url/abcd | |
Program-Disclosed | 3. It’s awesome that @xxx continue to contribute to the Program. Thank you! |
4. XXX, who has heart disease, talks about her participation in xxx study. |
We discuss self-disclosed and program-disclosed tweets separately in the following sections.
Self-disclosed tweets. Self-disclosure tweets refer to the tweets posted by cohort participants. These tweets usually have a similar style, such as “I joined/participated in the xxx study” or “I am a participant/volunteer of the xxx program.” Some users wrote an additional sentence to explain why they joined the program or how they feel about it. An analysis of the sentiment of self-disclosed tweets revealed that 71 of the 86 users (82.5%) have a neutral or positive attitude about their participation while 39 of the tweets (45.3%) have a sentiment score greater than 0. Such a positive attitude shows that most self-disclosed volunteers are happy with the program they participate in and their disclosures on social media express their support or compliment for the program rather than criticism. Words like proud and love often appears in these tweets. Table 4 provides the frequency of the 26 most common words.
Table 4:
The most frequent words in 86 self-disclosure tweets.
word | count | word | count | word | count |
participant | 26 | invite | 6 | would | 4 |
participate | 23 | love | 5 | well | 4 |
join | 20 | since | 5 | remember | 4 |
get | 9 | data | 5 | today | 4 |
proud | 7 | one | 5 | great | 4 |
interest | 6 | share | 5 | learn | 4 |
years | 6 | look | 4 | think | 4 |
volunteer | 6 | member | 4 |
At the same time, a small portion of the tweets suggests a negative emotion. For example: “I’ve been a participant for two years, but have not had any feedback.” The distribution of the sentiment score is shown in Figure 2. Self-disclosure tweets usually only reveal the user’s membership; however, at times they may involve their family or close friends. In such cases, one or more of the users’ family members may have a rare disease (e.g., a child who experiences a congenital heart attack) and they joined the research project together to find out why and how to treat it.
Figure 2:
Sentiment analysis of 86 self-disclosed tweets. The score ranges from [−1, 1], where -1 means very negative and +1 stands for very positive.
Program-disclosed tweets. At times, the programs post about volunteers’ participation experiences on social media as a way to promote the program and attract the public to join. Most of these tweets reveal a volunteer’s membership often with health information, along with a link to, or a video about, the volunteer’s story. Volunteers talk about why they joined the program, as well as what they gained from entering the program. This approach may be useful in attracting people to join the program, but this activity also increases the risk of the volunteer to re-identification.
Disclosure tweets are more likely to be associated with larger cohorts. As shown in Table 5, the cohorts with membership disclosure tweets cover more than 10,000 participants. The studies that began more recently tend to have more members active on the Internet, such that they appear to discuss their involvement more often. Some of the tweets posted by participants in long term studies showed that these participants have a stable relationship with the program. These users specifically shared their long term participant experience and feelings about the program. The word “years” appears six times in 32 tweets.
Table 5:
Year launched, number of participants and the number of tweets disclosed for the 15 cohorts.
Study Cohort | Disclosing Tweets | Year Launched | Participants | |
1 | Personal Genome Project | 26 | 2005 | 10,000 |
2 | 100,000 Genomes Project (Genomics England) | 16 | 2012 | 100,000 |
3 | Black Women’s Health Study | 12 | 1995 | 59,000 |
4 | Raine Study | 11 | 1989 | 2,868 |
5 | UK Biobank | 10 | 2007 | 500,000 |
6 | All of Us Research Program | 10 | 2017 | 20,000 |
7 | Qatar Biobank | 5 | 2012 | 20,000 |
8 | Nurses’ Health Study | 4 | 1976 | 280,000 |
9 | Australian Longitudinal Study on Women’s Health (ALSWH) | 3 | 1996 | 57,000 |
10 | 1970 British Cohort Study (BCS70) | 3 | 1970 | 17,000 |
11 | Framingham Heart Study | 2 | 1971 | 14,000 |
12 | Millennium Cohort Study | 2 | 1991 | 200,000 |
13 | Million Women Study | 2 | 1996 | 1,319,475 |
14 | National Child Development Study | 2 | 1958 | 17,415 |
15 | Human Longevity | 1 | 2013 | N/A |
Summary | 109 |
Tweets can contain search keywords but lack user membership information. 3,901 of the 4,020 (97.3%) selected tweets do not contain user participant information. Program-related accounts posted most of these tweets and tended to follow one of two patterns. The first is to call for volunteers: “Come and join the xxx research program.” The second is a thank you message to their participants: “xxx participants finished sequencing! Thank you, everyone, for taking part in our research!”. On the other hand, tweets posted by users revealed their interest or concern about the program. For example: “I am interested in join the xxx study, but I am worried about my privacy.” In general, it was observed that people are willing to join cohort studies and make their contribution, but a concern of privacy protection is an impediment. For example, 16 tweets talked about the participants’ email address disclosure problem of Personal Genome Project UK.
Potential Risk of Membership Disclosure. Based on this analysis, we partitioned the risk of membership disclosure into three types: membership disclosure, identity disclosure and attribute disclosure. Here, we will discuss these privacy threats and illustrate how they relate to the specific population we studied.
The problems induced by membership disclosure are best illustrated with several examples. First, imagine that a volunteer has disclosed his/her membership in some research program. An attacker collects the volunteer’s demographic information (e.g., residential geographical area, gender, and date of birth) from the social network (e.g., the user’s Twitter profile) and links this information to the de-identified participants’ records published by the research program. If a unique linkage to a record transpires, then the attacker has achieved an identity disclosure19. If multiple records are linked to the user, but they share the same (or similar) sensitive attribute value(s), then a successful attribute disclosure attack37 has been perpetrated. Even if their values for the sensitive attribute are different, the attacker can guess the right one with some confidence. By contrast, previous high-profile attacks are limited in that they need to make assumptions about whether a targeted individual is indeed in a dataset. Thus, their claimed attacking powers need to be discounted by the prior probability that a targeted individual has been selected from a broader population38. In our scenario, the attacker is confident that the targeted individual is in the dataset. As a consequence, the discovery of membership significantly increases the likelihood of a successful attack. This attack adds significant power to all the previous attacks, which include the following:
1. Membership Disclosure. As noted earlier, the action of disclosing one’s membership leaks some of the users’ sensitive information. For instance, a project may be disease-specific, such that all of the participants have the same diagnosis. Similarly, some of the users join a study because they, or their relatives, have a rare disease. When they post such information online, their health information is leaked as well.
2. Identity Disclosure. By sharing membership and other personal information over social media, users can be identified. This can be accomplished by collecting self-disclosed users’ personal information from their profile, such as their real name, race, gender, residence, education level, and occupation. To illustrate this issue, we randomly selected ten users and inspected their Twitter profile. It was found that nine out of ten users revealed their real face as their avatar, eight shared their location to a specific city, seven talked about their occupation or education level in their biography, six used their real name as their account name and two users made their date of birth public. With such information on hand, an attacker could find the person through a people search website, such as Intelius.com or InstantCheckMate.com. Moreover, program-disclosed individuals are more readily identifiable because the story shared by programs often contains detailed information about the storyteller. In this case, we learn the volunteer’s personal information from the story, as well as their health information.
3. Attribute Disclosure. Research programs may publish their data to the public or share it with researchers in a de-identified fashion. However, if a malicious attacker has access to the cohort data, along with additional information about the self-disclosing participant (collected from the user’s social media profile), then the attacker can use such information as quasi-identifiers to link to the participant’s record in the cohort database. As mentioned earlier, Sweeney et al. showed that they could identify more than 40% PGP participants using their ZIP code, gender, and date of birth, and obtain participants’ sensitive information, such as medical conditions and DNA sequence15.
Discussion and Conclusion
This investigation illustrates that an individual’s membership in a biomedical research study can be disclosed in social media in several ways. We uncovered tweets that revealed the membership over 100 participants in 15 research programs. Approximately 80% of the tweets correspond to user self-disclosure, while the remaining correspond to disclosures made by the program organizer. We found that 39 out of 86 (45.3%) self-disclosed users have a positive attitude towards joined research project. The terms “proud”, “interest”, and “love” were communicated by multiple self-disclosers. The personal information reported in the profiles of the social media users increased the risk of identification, which increases the likelihood that an attacker could link to their record in a de-identified dataset about the cohort, leading to further privacy intrusions, such as the re-identification of genomic information. A program may disclose participants membership when they introduce volunteer and share their story to the public as a way to increase program influence and recruit more participants. These stories may contain personal information and sensitive health information about the volunteer.
Still, there are certain limitations to this work, which pose as next steps for research. First, our search procedure is somewhat ad hoc, such that we failed to detect some tweets about membership disclosure that lack certain words (e.g., participant or volunteer). Second, we studied disclosure behavior only on Twitter, but the same problem may exist in other social platforms, such as Facebook and Instagram. A comprehensive study on additional popular social platforms is needed. Third, the current process requires a final manual review, but it is likely that, with enough instances of disclosure, an automated approach for discovery of such tweets could be developed. At the same time, we believe that if automated approaches can be designed to detect such disclosures, they may also be oriented to assist individuals and program managers to recognize when disclosure is happening inadvertently. It may be that such detection and reflection of the potential risks of such actions may change decisions to reveal such information, and at least lead to more informed decision making.
Mitigating the risk of membership disclosure is not an easy problem to solve. In closing, we wish to offer several possible strategies that may warrant consideration. First, given this threat, research programs could inform participants about the risk of membership disclosure and make it clear that if self-disclosures are made that their privacy may not be guaranteed. At the same time, research programs should inform participants of such threats when asking whether they can share information about participants (e.g., through stories). Alternatively, the program could consider sharing stories without mentioning the volunteers real name or quasi-identifiable information.
Acknowledgements
This work was sponsored in part by NIH grants RM1-HG009034, R01-HG006844, and U2COD023196.
Footnotes
https://github.com/yongtai123/Biomedical-Research-Cohort-Membership-Disclosure
Figures & Table
References
- 1.Collins FS, Varmus H. A new initiative on precision medicine. N Engl J Med. 2015 Feb;372(9):793–795. doi: 10.1056/NEJMp1500523. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Sudlow C, Gallacher J, Allen N, et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS medicine. 2015;12(3):e1001779. doi: 10.1371/journal.pmed.1001779. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Kannel WB, McGee DL. Diabetes and cardiovascular disease: the Framingham study. Jama. 1979;241(19):2035–2038. doi: 10.1001/jama.241.19.2035. [DOI] [PubMed] [Google Scholar]
- 4.Ball MP, Bobe JR, Chou MF, et al. Harvard Personal Genome Project: lessons from participatory public research. Genome medicine. 2014;6(2):10. doi: 10.1186/gm527. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Cozier YC, Palmer JR, Rosenberg L. Comparison of methods for collection of DNA samples by mail in the Black Women’s Health Study. Annals of epidemiology. 2004;14(2):117–122. doi: 10.1016/S1047-2797(03)00132-7. [DOI] [PubMed] [Google Scholar]
- 6.Kohane IS. Using electronic health records to drive discovery in disease genomics. Nature Reviews Genetics. 2011;12(6):417. doi: 10.1038/nrg2999. [DOI] [PubMed] [Google Scholar]
- 7.Zheng-Bradley X, Flicek P. Applications of the 1000 Genomes Project resources. Briefings in functional genomics. 2016;16(3):163–170. doi: 10.1093/bfgp/elw027. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Church GM. The personal genome project. Molecular systems biology. 2005;1(1) doi: 10.1038/msb4100040. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Turnbull C, Scott RH, Thomas E, et al. The 100000 genomes project: Bringing whole genome sequencing to the NHS. BMJ: British Medical Journal (Online) 2018:361. doi: 10.1136/bmj.k1687. [DOI] [PubMed] [Google Scholar]
- 10.Peakman TC, Elliott P. The UK Biobank sample handling and storage validation studies. International journal of epidemiology. 2008;37(suppl 1):i2–i6. doi: 10.1093/ije/dyn019. [DOI] [PubMed] [Google Scholar]
- 11.Hull SC, Sharp RR, Botkin JR, et al. Patients’ views on identifiability of samples and informed consent for genetic research. The American Journal of Bioethics. 2008;8(10):62–70. doi: 10.1080/15265160802478404. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Malin B, Karp D, Scheuermann RH. Technical and policy approaches to balancing patient privacy and data sharing in clinical and translational research. Journal of Investigative Medicine. 2010;58(1):11–18. doi: 10.231/JIM.0b013e3181c9b2ea. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Liu Y, Wan Z, Xia W, et al. Detecting the Presence of an Individual in Phenotypic Summary Data. AMIA Annual Symposium Proceedings. 2018;2018:760–9. [PMC free article] [PubMed] [Google Scholar]
- 14.El Emam K, Jonker E, Arbuckle L, Malin B. A systematic review of re-identification attacks on health data. PloS one. 2011;6(12):e28071. doi: 10.1371/journal.pone.0028071. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Sweeney L, Abu A, Winn J. Data Privacy Lab, IQSS, Harvard University; 2013. Identifying Participants in the Personal Genome Project by Name. [Google Scholar]
- 16.Wan Z, Vorobeychik Y, Xia W, Clayton EW, Kantarcioglu M, Malin B. Expanding access to large-scale genomic data while promoting privacy: a game theoretic approach. The American Journal of Human Genetics. 2017;100(2):316–22. doi: 10.1016/j.ajhg.2016.12.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Choudhury MD, De S. Mental health discourse on reddit: Self-disclosure, social support, and anonymity. Proceedings of the 8th International Conference on Weblogs and Social Media, ICWSM 2014. 2014;01:p. 71–80. [Google Scholar]
- 18.Balani S, De Choudhury M. Detecting and characterizing mental health related self-disclosure in social media. In: Proceedings of the 33rd Annual ACM Conference Extended Abstracts on Human Factors in Computing Systems. 2015:p. 1373–1378. [Google Scholar]
- 19.Sweeney L. Simple demographics often identify people uniquely. Carnegie Mellon University, Data Privacy Working Paper 3. Pittsburgh. 2000 [Google Scholar]
- 20.Wikipedia contributors. Category: Cohort studies. 2016 Available from: https://en.wikipedia.org/wiki/ Category:Cohort_studies. [Google Scholar]
- 21.Mittos A, Blackburn J, De Cristofaro E. “23andMe confirms: I’m super white”–Analyzing Twitter Discourse On Genetic Testing. arXiv preprint arXiv:1801. 2018;09946 [Google Scholar]
- 22.Yin Z, Sulieman LM, Malin B. A systematic literature review of machine learning in online personal health data. JAMIA. 2019;26(6):561–576. doi: 10.1093/jamia/ocz009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Taddicken M. The “privacy paradox” in the social web: The impact of privacy concerns, individual characteristics, and the perceived social relevance on different forms of self-disclosure. Journal of Computer-Mediated Communication. 2014;19(2):248–273. [Google Scholar]
- 24.Morgan EM, Snelson C, Elison-Bowers P. Image and video disclosure of substance use on social media websites. Computers in Human Behavior. 2010;26(6):1405–1411. [Google Scholar]
- 25.Hanson CL, Burton SH, Giraud-Carrier C, West JH, Barnes MD, Hansen B. Tweaking and tweeting: exploring Twitter for nonmedical use of a psychostimulant drug (Adderall) among college students. Journal of Medical Internet Research. 2013;15(4):e62. doi: 10.2196/jmir.2503. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Hanson CL, Cannon B, Burton S, Giraud-Carrier C. An exploration of social circles and prescription drug abuse through Twitter. Journal of medical Internet research. 2013;15(9):e189. doi: 10.2196/jmir.2741. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Mao H, Shuai X, Kapadia A. Loose tweets: an analysis of privacy leaks on twitter. In: Proceedings of the 10th annual ACM workshop on Privacy in the electronic society. 2011:p. 1–12. [Google Scholar]
- 28.Christofides E, Muise A, Desmarais S. Information disclosure and control on Facebook: Are they two sides of the same coin or two different processes? Cyberpsychology & behavior. 2009;12(3):341–345. doi: 10.1089/cpb.2008.0226. [DOI] [PubMed] [Google Scholar]
- 29.Elliott J, Shepherd P. Cohort profile: 1970 British birth cohort (BCS70) International journal of epidemiology. 2006;35(4):836–843. doi: 10.1093/ije/dyl174. [DOI] [PubMed] [Google Scholar]
- 30.Rich SS, Concannon P, Erlich H, et al. The type 1 diabetes genetics consortium. Annals of the New York Academy of Sciences. 2006;1079(1):1–8. doi: 10.1196/annals.1375.001. [DOI] [PubMed] [Google Scholar]
- 31.Giovannucci E, Stampfer MJ, Colditz GA, et al. Multivitamin use, folate, and colon cancer in women in the Nurses’ Health Study. Annals of internal medicine. 1998;129(7):517–524. doi: 10.7326/0003-4819-129-7-199810010-00002. [DOI] [PubMed] [Google Scholar]
- 32.Beral V. Million Women Study Collaborators. Breast cancer and hormone-replacement therapy in the Million Women Study. 2003;362(9382):419–427. doi: 10.1016/s0140-6736(03)14065-2. The Lancet. [DOI] [PubMed] [Google Scholar]
- 33.Benjamin EJ, Wolf PA, D’Agostino RB, Silbershatz H, Kannel WB, Levy D. Impact of atrial fibrillation on the risk of death: the Framingham Heart Study. Circulation. 1998;98(10):946–952. doi: 10.1161/01.cir.98.10.946. [DOI] [PubMed] [Google Scholar]
- 34.Al Kuwari H, Al Thani A, Al Marri A, et al. The Qatar Biobank: background and methods. BMC public health. 2015;15(1):1208. doi: 10.1186/s12889-015-2522-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.International HapMap Consortium. The international HapMap project. Nature. 2003;426(6968):789–96. doi: 10.1038/nature02168. [DOI] [PubMed] [Google Scholar]
- 36.Yin Z, Chen Y, Fabbri D, Sun J, Malin B. #PrayForDad: learning the semantics behind why social media rsers disclose health information. In: Proceedings of the Tenth International AAAI Conference on Web and Social Media. 2016 [PMC free article] [PubMed] [Google Scholar]
- 37.Machanavajjhala A, Gehrke J, Kifer D, Venkitasubramaniam M. L-diversity: Privacy beyond K-anonymity. In: Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) IEEE. 2006:p. 24. [Google Scholar]
- 38.Visscher PM, Hill WG. The limits of individual identification from sample allele frequencies: theory and statistical analysis. PLoS genetics. 2009;5(10):e1000628. doi: 10.1371/journal.pgen.1000628. [DOI] [PMC free article] [PubMed] [Google Scholar]