Abstract
Background The World Wide Web has emerged as a powerful data source for epidemiological studies related to infectious disease surveillance. However, its potential for cancer-related epidemiological discoveries is largely unexplored.
Methods Using advanced web crawling and tailored information extraction procedures, the authors automatically collected and analyzed the text content of 79 394 online obituary articles published between 1998 and 2014. The collected data included 51 911 cancer (27 330 breast; 9470 lung; 6496 pancreatic; 6342 ovarian; 2273 colon) and 27 483 non-cancer cases. With the derived information, the authors replicated a case-control study design to investigate the association between parity (i.e., childbearing) and cancer risk. Age-adjusted odds ratios (ORs) with 95% confidence intervals (CIs) were calculated for each cancer type and compared to those reported in large-scale epidemiological studies.
Results Parity was found to be associated with a significantly reduced risk of breast cancer (OR = 0.78, 95% CI, 0.75-0.82), pancreatic cancer (OR = 0.78, 95% CI, 0.72-0.83), colon cancer (OR = 0.67, 95% CI, 0.60-0.74), and ovarian cancer (OR = 0.58, 95% CI, 0.54-0.62). Marginal association was found for lung cancer risk (OR = 0.87, 95% CI, 0.81-0.92). The linear trend between increased parity and reduced cancer risk was dramatically more pronounced for breast and ovarian cancer than the other cancers included in the analysis.
Conclusion This large web-mining study on parity and cancer risk produced findings very similar to those reported with traditional observational studies. It may be used as a promising strategy to generate study hypotheses for guiding and prioritizing future epidemiological studies.
Keywords: digital epidemiology, web mining, cancer risk, parity
INTRODUCTION
Over the past decade, the World Wide Web has been transforming radically the landscape of medical research and healthcare delivery.1–3 Online news sources and digital social networks have empowered a wide range of epidemiological applications such as syndromic surveillance,4–6 monitoring population health-related interests, sentiment, and behaviors (e.g.,7–15), as well as drug performance monitoring.16,17 Nevertheless, the potential and reliability of openly available online content for cancer-related epidemiological discoveries remain largely unknown. Thus far studies have focused on sentiment analysis, online support, and topics of interest for cancer patients.12,18,19 The current study evaluated the feasibility of web mining as a big data-driven knowledge discovery mechanism for exploring the association between parity and cancer risk. Parity is defined as the number of births a woman has had.
To date, epidemiological studies have investigated parity as an independent risk factor for many types of women’s cancers.20 Particularly with respect to breast cancer, the protective association between childbearing and breast cancer has been well established with several prospective, retrospective, and meta-analysis studies conducted in the United States and abroad.21–27 For ovarian cancer, several international studies found that the risk across all histologies is inversely related to parity.28–37
The association between parity and cancer risk has been studied for other cancers that are common between both genders. For pancreatic cancer there are a few studies with conflicting findings.38 A comprehensive meta-analysis study reported borderline statistically significantly lower pancreatic cancer risk for parous women.39 A stronger association was observed with only the case-control studies included in meta-analysis. For lung cancer, the reported findings are highly heterogeneous and inconsistent. Meta-analysis of 16 case-control and cohort studies suggested no effect of parity on lung cancer risk.40 Similar inconsistencies were observed for colorectal cancer as well. Some studies reported a protective effect with parity41–45 while other studies found no association.46–48 The largest meta-analysis of prospective studies found no association between colorectal cancer risk and parity.49
In this study we adopted a web mining approach to perform in-silico large-scale, case-control experiments for studying the relationship between parity and cancer risk. We used online obituaries of women as the data source. By comparing our findings with the findings reported in traditional epidemiological studies, we examined the feasibility, reliability, and limitations of web mining for epidemiological discoveries in cancer.
SUBJECTS AND METHODS
Study Population
We performed this study using online obituary announcements. Such content is widely and openly available on the Internet—for example, on the websites of US newspapers, funeral homes, and social media sites. Obituaries typically include content regarding the deceased person’s family members and cause of death, both essential for this investigation. Furthermore, the language of obituary articles is often standardized, which facilitates automated computer parsing and text analysis. Our study focused on online obituaries of females for whom cancer is the stated or inferred cause of death (e.g., “She passed away after a courageous battle with breast cancer …”) and obituaries of females for whom there is no mention of cancer. These two types of obituaries constituted the case and control groups, respectively. No other inclusion or exclusion criteria were applied. Institutional review board approval was obtained prior to the study. Expedited review deemed the study exempt.
Data Collection and Information Retrieval
The data collection process was fully automated and included (a) a keyword-based Internet search for collecting obituaries and (b) text parsing for identifying those obituaries which contained the necessary information elements (i.e., gender, age at death, offsprings, cause of death).
First, an intelligent self-adaptive web crawler developed in our laboratory50 was deployed to thoroughly search the broad Internet and collect obituary announcements useful for this study. To initialize the crawling process, a seed query is executed that searches for obituaries of a given cancer type and US state—for example, “breast cancer obituaries, California.” In total, 250 such queries (5 cancer types × 50 states) were executed using a third-party commercial search engine. The URL search results of the queries served as the initial crawling seeds. Then, the web crawler follows a two-step ranking process to assess the relevance of the crawled webpages and URLs embedded in these webpages using an autonomous utility score estimator. The utility score is derived using a supervised machine learning method trained with manually labeled positive and negative training examples. The positive examples consisted of 600 cancer-related obituaries collected by manually searching for obituaries published online while the negative examples consisted of 1000 general webpages unrelated to obituaries. The crawler uses a feedback mechanism to guide and prioritize the crawling process based on the utility scores. Details of the feedback mechanism are provided in Xu, Yoon, and Tourassi50 The crawling results undergo an additional verification step to determine whether they are full-length obituaries. The verification step first employs a supervised classification algorithm to remove irrelevant content such as obituaries index pages or obituary snippets, followed by the application of an obituary verification classifier. This classifier was trained according to 100 manually selected positive examples (i.e., obituaries) and 400 manually selected negative examples (i.e., obituary lists and snippets). The crawling process was performed on a dedicated PC with 16 GB of RAM and 4 TB hard drive connected through a full-duplex gigabit Ethernet. Ten weeks of crawling efforts produced a total of 172 765 cancer-related obituaries (breast: 57 293; ovarian: 17 770; lung: 50 429; pancreatic: 36 280; colon: 10 993) and 1 039 618 non-cancer–related obituaries.
The body text of the collected obituaries was subsequently analyzed using tailored Natural Language Processing (NLP) procedures to automatically extract the gender, age at death, and offspring information about the deceased subjects. This step was implemented using the Stanford NLP Library.51 Several customized rules were implemented:
Gender: To infer the gender of the deceased person gender-specific pronouns mentioned in the obituary were counted. Then, a straightforward inference rule was applied stating that the deceased person is assigned to the gender indicated by the larger number of gender-related pronouns. If no dominant gender could be identified, the obituary was excluded from further analysis.
Age: In many obituaries, the age at death is explicitly stated in the first sentence along with the subject’s name—for example, “Jane Doe, 60, passed away …,” “Jane Doe, age 55, passed peacefully … .” If no such statement was available, age was calculated by automatically detecting dates of birth and death—for example, “Jane Doe passed away Sunday, January 1, 2012 … She was born July 1, 1940 … .” If the there was no explicit statement, then the publication year of the obituary was considered as the year of death. The age at death was calculated by simple subtraction of the birth year from the death year. Obituary articles for which the subject’s age could not be detected or inferred were excluded from further analysis.
Parity: History of childbirth was inferred by the listing of surviving offspring mentioned in the obituary—for example, “She is survived by her son(s) … and daughter(s) … ,” “Preceded in death by her son(s) … and daughter(s) … .” In other cases, number of offspring was inferred by searching for expressions such as “She was a mother of … .” Identified offspring were counted as biological children unless clearly stated as stepchildren or adopted, in which case they were not included. If the obituary did not include such statements, the subject was considered nulliparous.
Cause of Death: To infer whether cancer was the cause of death, a sequence of logical exclusion and inclusion rules was executed on those obituaries with the keyword “<type> cancer.”
Rule 1: Since many obituaries encourage donations to cancer organizations, all obituary sentences with phrases such as “may be made to … ,” “In lieu of flowers, … ,” “Donations of sympathy …” were first removed to filter out obituaries that could contribute to false counts and simplify the inference process.
Rule 2: The remaining obituary text body was searched for explicit statements that a specific cancer type was the cause of death—for example, “She passed away after a lengthy battle with breast cancer … .” If no such statement was found, text parsing continued by applying the additional heuristic inference rule shown below.
Rule 3: Death was attributed to a “<type> cancer” if the obituary (i) contained the phrase “<type> cancer” and (ii) did not contain phrases implying that the deceased person was a cancer survivor (e.g., “<type> cancer survivor,” “survivor of <type> cancer,” “surviving <type> cancer”).
Cancer-related obituaries meeting none of the above conditions were excluded from further consideration. Specifically, there were 10 401 cancer-related obituaries eliminated because of failure to infer gender, 8108 cancer-related obituaries eliminated due to failure to infer age at death, and 47 562 cancer-related obituaries eliminated due to failure to confirm cancer as the cause of death. In addition, obituaries with lung, colon, or pancreatic cancer as the inferred cause of death for which the deceased person was inferred to be male were also eliminated. There were 27 300 such obituaries (13 841 lung; 10 212 pancreatic; and 3247 colon). Finally, the resulting set of eligible obituaries was automatically analyzed once more to eliminate duplicate articles (e.g., obituaries of the same individual appearing in multiple different online sites). Matching was done based on the subject’s name and age. There were 1664 duplicate obituaries.
The whole data collection process (web crawling and text parsing) for the cancer cases was terminated when the harvest rate for each cancer type declined significantly. Harvest rate is a commonly used performance metric for web crawlers, which measures a web crawler’s capability of obtaining new relevant records. In total, there were 51 911 eligible cancer subjects collected: 27 330 breast; 9470 lung; 6496 pancreatic; 6342 ovarian; and 2273 colon cancer cases. The control group consisted of full-length obituaries for which the gender, age, and number of offspring could be inferred and the obituary body text did not include the word “cancer.” For the control group, we selected randomly among the 1 039 618 non-cancer related full obituaries and proceeded with the tailored NLP phase. The data collection process for the control group was interrupted after the sample size reached roughly the size of the breast cancer group, which was the largest among the cancer types included in this study. Figure 1 provides a schematic illustration of the obituary collection workflow.
Figure 1:
Schematic illustration of the obituary collection workflow.
The collected obituaries were published between 1998 and 2014. Due to the considerable computational demands of the text parsing stage, the NLP platform used the Titan supercomputer of the Oak Ridge Leadership Computing Facility. Approximately 40 000 core hours were used for this procedure.
Statistical Analysis
Age was compared between cases and controls using t test. Using a case-control study design, age-adjusted odds ratios (ORs) and 95% confidence intervals (CIs) were calculated using the Mantel-Haenszel procedure based upon the parity odds in the case and in the control groups. Both groups were stratified by age in 5-year intervals. The 95% CIs were estimated with bootstrapping. The primary analysis compared parous women (≥1 child) with nulliparous women (no childbearing history). ORs and CIs were also derived for each parity level (1, 2, … , 6, ≥7 children). The linear trend between increasing parity and cancer risk was also tested with the Mantel-Haenszel procedure. Statistical analysis was performed using R software, version 3.1.0.52
RESULTS
Figure 2 shows the distribution of the 51 911 cancer cases collected per year of online publication of the obituaries. The increasing number of collected cases from recent years reflects the increasing online presence of local US newspapers, funeral homes, and social groups that publish death announcements. All 50 states were represented in the dataset. Most cancer cases were collected from California, Texas, Florida, Ohio, Pennsylvania, New York, Massachusetts, and Illinois. Of those states, all are the seven most populous US states with the exception of Massachusetts, which ranks 14th according to the 2013 US Census data.
Figure 2:
Number of collected obituaries included in the study by cancer type and year.
Table 1 shows the number and age distribution of the cancer cases and controls collected for the study. The distribution and average age of subjects according to the number of offspring are shown in Table 2. The average age at death for the cancer cases was significantly lower than that for controls for all cancer types (all two-tailed P-values < 2.20e-16). For the cancer group, average age at death increased with increasing parity. No such trend was observed for the control group.
Table 1:
Number and age distribution of cancer cases and controls
Age in years | Cancer Cases | Controls (N = 27 483) (%) | ||||
---|---|---|---|---|---|---|
Breast (N = 27 330) (%) | Ovarian (N = 6342) (%) | Lung (N = 9470) (%) | Colon (N = 2273) (%) | Pancreatic (N = 6496) (%) | ||
20–29 | 2.04 | 1.37 | 1.07 | 2.02 | 0.95 | 1.28 |
30–39 | 5.77 | 2.89 | 1.64 | 5.28 | 1.34 | 1.53 |
40–49 | 14.56 | 9.04 | 5.61 | 11.97 | 6.22 | 3.45 |
50–59 | 23.37 | 24.05 | 15.49 | 19.31 | 17.50 | 6.73 |
60–69 | 21.58 | 29.63 | 25.09 | 20.37 | 26.12 | 11.29 |
70–79 | 15.46 | 21.02 | 30.00 | 18.79 | 26.74 | 18.32 |
>80 | 17.22 | 12.02 | 21.11 | 22.26 | 21.12 | 57.39 |
Total | 100 | 100 | 100 | 100 | 100 | 100 |
Table 2:
Sample size distribution and average age at death of cases and controls by parity level
N (%) Age (±σ) | Parity |
|||||||
---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7+ | |
Cancer Cases | ||||||||
Breast | 16.4 | 25.2 | 29.0 | 12.2 | 6.7 | 3.8 | 2.7 | 3.7 |
57.0 (16.5) | 61.5 (16.5) | 62.9 (15.8) | 63.2 (16.3) | 65.6 (15.3) | 66.8 (14.7) | 67.2 (14.9) | 68.6 (14.5) | |
Ovarian | 20.6 | 21.6 | 26.9 | 12.2 | 6.8 | 4.5 | 3.0 | 4.3 |
59.0 (13.5) | 62.4 (14.1) | 64.9 (12.4) | 64.6 (12.8) | 67.0 (12.6) | 67.2 (13.0) | 69.4 (12.5) | 68.2 (13.2) | |
Lung | 15.3 | 24.3 | 27.4 | 12.2 | 8.0 | 4.7 | 3.3 | 4.8 |
65.5 (14.0) | 68.2 (13.7) | 68.4 (13.5) | 69.2 (12.9) | 69.4 (13.5) | 71.2 (11.4) | 71.9 (12.5) | 72.3 (11.2) | |
Colon | 18.1 | 25.0 | 26.4 | 10.1 | 7.0 | 4.2 | 4.0 | 5.1 |
59.0 (17.4) | 64.6 (17.8) | 65.8 (16.4) | 65.9 (14.8) | 67.6 (15.7) | 67.8 (16.0) | 69.8 (14.1) | 69.0 (13.6) | |
Pancreatic | 16.7 | 23.2 | 27.7 | 11.8 | 7.8 | 4.7 | 3.4 | 4.7 |
64.6 (14.3) | 67.6 (13.8) | 67.9 (13.3) | 68.1 (13.5) | 70.4 (13.4) | 71.7 (12.5) | 70.5 (11.6) | 71.7 (10.9) |
Controls | ||||||||
---|---|---|---|---|---|---|---|---|
13.6 | 24.4 | 28.4 | 12.5 | 8.0 | 4.7 | 3.3 | 5.0 | |
75.6 (18.1) | 78.4 (16.2) | 78.4 (15.2) | 77.5 (15.1) | 78.0 (14.0) | 78.8 (13.7) | 79.2 (13.3) | 79.2 (13.5) |
Table 3 shows the risk of cancer among parous and nulliparous women for each cancer type. While the cancer risk for parous women was significantly lower than nulliparous for all cancer types, the protective effect of parity was much less pronounced for lung cancer than all other cancer types considered in this study.
Table 3:
Age-adjusted odds ratios (ORs)* with 95% confidence intervals (CIs) according to number of offsprings for each cancer type
Parity | Breast Cancer | Ovarian Cancer | Lung Cancer | Colon Cancer | Pancreatic Cancer |
---|---|---|---|---|---|
Nulliparous (reference) | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
Parity | |||||
1 | 0.87 | 0.60 | 0.91 | 0.74 | 0.80 |
(0.82–0.92) | (0.55–0.65) | (0.84–0.98) | (0.65–0.84) | (0.74–0.87) | |
2 | 0.84 | 0.60 | 0.87 | 0.67 | 0.81 |
(0.80–0.88) | (0.56–0.65) | (0.81–0.94) | (0.59–0.76) | (0.74–0.87) | |
3 | 0.76 | 0.59 | 0.81 | 0.57 | 0.74 |
(0.71–0.80) | (0.54–0.65) | (0.74–0.89) | (0.48–0.67) | (0.67–0.81) | |
4 | 0.66 | 0.50 | 0.86 | 0.60 | 0.72 |
(0.61–0.71) | (0.44–0.56) | (0.77–0.95) | (0.49–0.72) | (0.64–0.81) | |
5 | 0.61 | 0.57 | 0.83 | 0.61 | 0.72 |
(0.56–0.67) | (0.49–0.66) | (0.73–0.95) | (0.48–0.78) | (0.63–0.84) | |
6 | 0.64 | 0.52 | 0.84 | 0.79 | 0.81 |
(0.57–0.72) | (0.44–0.63) | (0.73–0.98) | (0.61-1.02) | (0.69–0.96) | |
7+ | 0.55 | 0.52 | 0.79 | 0.68 | 0.70 |
(0.50–0.60) | (0.45–0.60) | (0.70–0.90) | (0.55–0.86) | (0.61–0.82) | |
All parous | 0.78 | 0.58 | 0.87 | 0.67 | 0.78 |
(0.75–0.82) | (0.54–0.62) | (0.81–0.92) | (0.60–0.74) | (0.72–0.83) |
The positive association between increased parity and lower cancer risk was significant for all cancer types. Although all linear trend tests were found to be statistically significant, the trend was dramatically more pronounced for breast (χ2 = 301.60, P < 2.20e-16) and ovarian cancer (χ2 = 121.45, P < 2.20e-16), less pronounced for pancreatic cancer (χ2 = 38.95, P = 4.35e-10), and least pronounced for colon (χ2 = 24.69, P = 6.75e-07) and lung cancer (χ2 = 21.33, P = 3.87e-06).
DISCUSSION
Our study investigated the feasibility of intelligently mining a non-traditional online data source for epidemiological discoveries related to cancer. Using web mining tools, we were able to collect automatically a large number of eligible subjects and execute in silico a case-control study on the association between parity and cancer risk. The investigation focused on the five cancer types mentioned more often in the obituaries. Interestingly, these are also the cancer types associated with the highest death numbers in women in the United States.53
Our findings demonstrate that there is a significant independent association between parity and the five cancer types considered. The association is dramatically more pronounced for breast and ovarian cancer and least pronounced for lung cancer. The protective effect observed for breast, ovarian, and pancreatic cancer is very similar to that reported previously with traditional case-control observational studies. Specifically, it is reported that there is 18–27% lower breast cancer risk for parous women.21–26 Our study showed 22% lower risk. For ovarian cancer, our study found 42% lower risk, well within the 32–60% range reported in epidemiological studies conducted in the United States.35–37 Agreement between our findings and epidemiological literature was observed for pancreatic cancer as well. We found 22% lower pancreatic risk for parous women, compared to 28% reported in the latest meta-analysis of 11 case-control studies conducted in the United States.39 For colon cancer, the 33% protective effect observed in our study is only in partial agreement with the literature since there are conflicting reports. For example, a recent US study45 reported 20% reduced risk for postmenopausal women with no history of hormone therapy use and more than 5 pregnancies, which is similar to the age-adjusted ORs our study found for 5, 6, and 7+ offspring. It should be noted that our study focused on colon cancer (since the term “colorectal cancer” was not mentioned often in the obituaries), which may have also contributed in the discrepancy. It is interesting that the only recent study we found separating colon and rectal cancers reported ∼ 40–60% elevated risk of colon cancer for nulliparous women but no increased risk of rectal cancer.46 Finally, the 13% lower lung cancer risk for parous women was the lowest one observed in our study. The general trend is in full agreement with established knowledge that out of the five types of cancer included in this study, lung cancer is the only one for which parity appears to have marginal if any protective effect. The impact of increased parity observed across all five cancer types was also consistent with established epidemiological knowledge. Namely, every additional pregnancy appears to reduce further breast and ovarian cancer risk while the same beneficial effect is not well established for the other three cancer types.
One potential criticism for the study is that using obituaries as a data source may introduce a selection bias towards parous women. Since the deceased person’s family members typically publish obituary announcements, there is potentially a high risk of overestimating parity from obituaries. In our study the prevalence of parity among women based on the obituaries collected for the control group was 86.4% (Table 2). The National Survey of Family Growth published by Centers for Disease Control and Prevention (CDC)/National Center for Health Statistics (NCHS) in 2012 reported that the probability of a US woman having had a birth by the age of 40 is 85%. Therefore, we believe that using obituaries as a data source did not introduce any selection bias with respect to parity.
There are several limitations with our study. First, the amount of detail provided in obituaries is fairly low for more in-depth analysis. Obituary content cannot be used to derive information about additional factors such as ethnicity, socio-economic status, age-at-first pregnancy, breastfeeding, and lifestyle choices (e.g., smoking, exercise), which are all known to play a role in cancer risk and therefore should be used to adjust the derived ORs accordingly. This is a serious limitation for carrying out a more in-depth analysis from obituary data. Second, the risk of selection bias is always increased with online sources. For example, we observed that cancer is mentioned more openly as the cause of death in obituaries from Utah than New York suggesting geographical differences in terms of people’s openness in sharing such details. In addition, cancer was mentioned more often as the cause of death in the obituaries of younger than older people, particularly for breast cancer (as shown in Table 2). The latter is a potential source for information bias. However, this information bias is likely similar among nulliparous and parous women, resulting in non-differential misclassification of the cancer status and bias towards the null. Depending on the question of interest, applying creative correction factors could be useful to mitigate the risk of selection and information bias. Last but not least, the risk of inference error with big but superficial data is typically very high. Online sources may be a rich resource to collect in an automated, cost-effective way large numbers of candidate subjects. Still, the quality of the collected information is variable and often very low due to the inherent limitations of NLP. Therefore, the process and criteria applied for declaring impact should be far more rigorous since minute differences and wrong findings become easily statistically significant with large sample sizes. This is a well-known challenge for the big health data community as well.55,56 Based on our findings, comparing and contrasting the trends across the various cancer types was informative and less prone to inferential error in terms of the relative impact of parity on cancer risk. For our study, the “law of large numbers” helped in the sense that the quality flaws of individual subject profiles did not compromise the reliability of the relative trends emerging from the data.
In conclusion, the large online content collected and the text parsing tools developed to harvest the collected content led us to general findings similar to those produced with traditional epidemiological studies. The study demonstrated that online information sources when leveraged carefully in new and creative ways enable cost-effective and reliable exploratory epidemiological investigations. The main strength of the web mining approach we presented lies in its ability to automatically monitor trends in a dynamic way by continuously parsing and analyzing emerging open online content. However, given the discussed limitations, the proposed approach should be viewed as a promising strategy to generate interesting study hypotheses for guiding and prioritizing future epidemiological studies.
ACKNOWLEDGEMENTS
This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the US Department of Energy (DOE). The US Government retains and the publisher, by accepting the article for publication, acknowledges that the US Government retains a nonexclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).
This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725.
CONTRIBUTORS
G.T. contributed to the conception, design, analysis and interpretation of data, and drafted the manuscript. She is guarantor. H.J.Y. performed the experiments. S.X. contributed to the conception, study design, and algorithmic development. X.H. provided statistical support. All authors contributed to the manuscript writing and approved the final manuscript.
FUNDING
This work was supported by the National Cancer Institute at the National Institutes of Health (Grant No. 1R01-CA170508).
COMPETING INTERESTS
None.
REFERENCES
- 1.Lefebvre RC, Bornkessel AS. Digital social networks and health. Circulation. 2013;127(17):1829–1836. [DOI] [PubMed] [Google Scholar]
- 2.Chretien KC, Kind T. Social media and clinical care ethical, professional, and social implications. Circulation. 2013;127(13):1413–1421. [DOI] [PubMed] [Google Scholar]
- 3.Eysenbach G. Infodemiology and infoveillance tracking online health information and cyberbehavior for public health. Am J Prev Med. 2011;40(5)(Suppl 2):S154–S158. [DOI] [PubMed] [Google Scholar]
- 4.Bernardo TM, Rajic A, Young I, Robiadek K, Pham MT, Funk JA. Scoping review on search queries and social media for disease surveillance: a chronology of innovation. J Med Internet Res. 2013;15(7):e147. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Madoff LC, Woodall JP. The internet and the global monitoring of emerging diseases: lessons from the first 10 years of ProMED-mail. Arch Med Res. 2005;36:724–730. [DOI] [PubMed] [Google Scholar]
- 6.Brownstein JS, Freifeld CC, Chan EH, et al. Information technology and global surveillance of cases of 2009 H1N1 influenza. N Engl J Med. 2010;362:1731–1735. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Eysenbach G. Medicine 2.0: social networking, collaboration, participation, apomediation, and openness. J Med Internet Res. 2008;10(3):e22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Webb TL, Joseph J, Yardley L, Michie S. Using the internet to promote health behavior change: a systematic review and meta-analysis of the impact of theoretical basis, use of behavior change techniques, and mode of delivery on efficacy. J Med Internet Res. 2010;12(1):e4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Wong PW, Fu KW, Yau RS, et al. Accessing suicide-related information on the internet: a retrospective observational study of search behavior. J Med Internet Res. 2013;15(1):e3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Laranjo L, Arguel A, Neves AL, et al. The influence of social networking sites on health behavior change: a systematic review and meta-analysis. JAMIA. 2014;pii:amiajnl-2014-002841. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Brigo F, Lochner P, Tezzon F, Nardone R. Web search behavior for multiple sclerosis: An infodemiological study. Multiple Sclerosis Related Disord. 2014;3(4): 440–443. [DOI] [PubMed] [Google Scholar]
- 12.Lu Y, Zhang P, Liu J, Li J, Deng S. Health-related hot topic detection in online communities using text clustering. PloS One. 2013;8(2):e56221. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Cugelman B, Thelwall M, Dawes P. Online interventions for social marketing health behavior change campaigns: a meta-analysis of psychological architectures and adherence factors. J Med Internet Res. 2011;13(1):e17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Ayers JW, Althouse BM, Allem JP, Ford DE, Ribisl KM, Cohen JE. A novel evaluation of World No Tobacco day in Latin America. J Med Internet Res. 2012;14(3):e77. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Ayers JW, Ribisl K, Brownstein JS. Using search query surveillance to monitor tax avoidance and smoking cessation following the United States' 2009 “SCHIP” cigarette tax increase. PLoS One. 2011;6(3):e16777. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Wicks P, Vaughan TE, Massagli MP, Heywood J. Accelerated clinical discovery using self-reported patient data collected online and a patient-matching algorithm. Nat Biotechnol. 2011;29:411–414. [DOI] [PubMed] [Google Scholar]
- 17.Frost J, Okun S, Vaughan T, Heywood J, Wick P. Patient-reported outcomes as a source of evidence in off-label prescribing: analysis of data from PatientsLikeMe. J Mel Internet Res. 2011;13(1):e6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Portier K, Greer GE, Rokach L, Ofek N, et al. Understanding topics and sentiment in an online cancer survivor community. JNCI Monographs. 2013;47:195–198. [DOI] [PubMed] [Google Scholar]
- 19.Kim E, Han JY, Moon TJ, et al. The process and effect of supportive message expression and reception in online breast cancer support groups. Psychooncology. 2012;21(5):531–540. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.DeVita VT, Jr, Hellman S, Rosenberg SA. Cancer: Principles and Practice of Oncology. 10th edn New York, NY: Lippincott Williams & Wilkins; 2013. [Google Scholar]
- 21.Pathak DR, Speizer FE, Willett WC, Rosner B, Lipnick RJ. Parity and breast cancer risk: possible effect on age at diagnosis. Int J Cancer. 1986;37:21–25. [DOI] [PubMed] [Google Scholar]
- 22.Layde PM, Webster LA, Baughman AL, Wingo PA, Rubin GL, Ory HW; the Cancer and Steroid Hormone Study Group. The Independent associations of parity, age, at first full term pregnancy, and duration of breastfeeding with the risk of breast cancer. J Clin Epidemiol. 1989;42(10):963–973. [DOI] [PubMed] [Google Scholar]
- 23.Kelsey JL, Gammon MD, John EM. Reproductive factors and breast cancer. Epidemiol Rev. 1993;15:36–47. [DOI] [PubMed] [Google Scholar]
- 24.Beral V, Reeves G. Childbearing, oral contraceptive use, and breast cancer. Lancet 1993;341:1102. [DOI] [PubMed] [Google Scholar]
- 25.Lambe M, Hsieh C, Chan H, Ekbom A, Trichopoulos D, Adami HO. Parity, Age at first and last birth, and risk of breast cancer: A population-based study in Sweden. Breast Cancer Res Treat. 1996;38:305–311. [DOI] [PubMed] [Google Scholar]
- 26.Möller T, Olsson H, Ranstam J; Collaborative Group on Hormonal Factors in Breast Cancer. Breast cancer and breastfeeding: collaborative reanalysis of individual data from 47 epidemiological studies in 30 countries, including 50 302 women with breast cancer and 96 973 women without the disease. Lancet. 2002;360(9328):187–195. [DOI] [PubMed] [Google Scholar]
- 27.Woolcott CG, Koga K, Conroy SM, et al. Mammographyc Density, Parity and age at first birth, and risk of breast cancer: An analysis of four case-control studies. Breast Cancer Res Treat. 2012;132(3):1163–1171. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Braem MG, Onland-Moret NC, van den Brandt PA, et al. Reproductive and hormonal factors in association with ovarian cancer in the Netherlands cohort study. Am J Epidemiol. 2010;172:1181–1189. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Moorman PG, Calingaert B, Palmieri RT, et al. Hormonal risk factors for ovarian cancer in premenopausal and postmenopausal women. Am J Epidemiol. 2008;167:1059–1069. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Tung KH, Goodman MT, Wu AH, et al. Reproductive factors and epithelial ovarian cancer risk by histologic type: a multiethnic case-control study. Am J Epidemiol. 2003;158:629–638. [DOI] [PubMed] [Google Scholar]
- 31.Tsilidis KK, Allen NE, Key TJ, et al. Oral contraceptive use and reproductive factors and risk of ovarian cancer in the European Prospective Investigation into Cancer and Nutrition. Br J Cancer. 2011;105(9):1436–1442. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Whittemore A, Harris R, Itnyre J. Collaborative Ovarian Cancer Group. Characteristics relating to ovarian cancer risk: collaborative analysis of 12 US case-control studies. Am J Epidemiol. 1992;136(4):1184–1203. [DOI] [PubMed] [Google Scholar]
- 33.Le DC, Kubo T, Fujino Y, et al. Reproductive factors in relation to ovarian cancer: a case–control study in Northern Vietnam. Contraception. 2012;86(5):494–499. [DOI] [PubMed] [Google Scholar]
- 34.Pasalich M, Su D, Binns CW, Lee AH. Reproductive factors for ovarian cancer in southern Chinese women. J Gynec Oncol. 2013;24(2):135–140. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Titus-Ernstoff L, Perez K, Cramer DW, et al. Menstrual and reproductive factors in relation to ovarian cancer risk. Br J Cancer. 2001;84(5):714–721. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Vachon CM, Mink PJ, Janney CA, et al. Association of parity and ovarian cancer risk by family history of breast or ovarian cancer in a population-based study of postmenopausal women. Epidemiology. 2002;13(1):66–71. [DOI] [PubMed] [Google Scholar]
- 37.Moorman PG, Palmieri RT, Akushevich L, Berchuck A, Schildkraut JM. Ovarian cancer risk factors in African-American and white women. Am J Epidemiol. 2009;170:598–606. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Wahi MM, Shah N, Schrock CE, Rosemurgy AN, Goldin SB. Reproductive factors and risk of pancreatic cancer in women: a review of the literature. Ann Epidemiol. 2009;19:103–111. [DOI] [PubMed] [Google Scholar]
- 39.Guan HB, Wu L, Wu QJ, et al. Parity and pancreatic cancer risk: a dose-response meta-analysis of epidemiologic studies. PLoS One. 2014;9(3):e92738. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Dahabreh IJ, Trikalinos TA, Paulus JK. Parity and risk of lung cancer in women: systematic review and meta-analysis of epidemiological studies. Lung Cancer. 2012;76(2):150–158. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Lo AC, Soliman AS, Khaled HM, Aboelyazid A, Greenson JK. Lifestyle, occupational, and reproductive factors and risk of colorectal cancer. Dis Colon Rectum. 2010;53:830–837. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Peters RK, Pike MC, Chang WWL, Mack TM. Reproductive factors and colon cancers. Br J Cancer. 1990;61:741–748. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Martinez ME, Grodstein F, Giovannucci E, et al. A prospective study of reproductive factors, oral contraceptive use, and risk of colorectal cancer. Cancer Epidemiol Biomarkers Prev. 1990;6:1–5. [PubMed] [Google Scholar]
- 44.Bostick RM, Potter JD, Kushi LH, et al. Sugar, meat, and fat intake, and non-dietary risk factors for colon cancer incidence in Iowa women (United States). Cancer Causes Control. 1994;5:38–52. [DOI] [PubMed] [Google Scholar]
- 45.Zervoudakis A, Strickler HD, Park Y, et al. Reproductive history and risk of colorectal cancer in postmenopausal women. J Natl Cancer Inst. 2011;103:826–834. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Wernli KJ, Wang Y, Zheng Y, et al. The relationship between gravidity and parity and colorectal cancer risk. J Women’s Health. 2009;18(7):995–1001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Tsilidis KK, Allen NE, Key TJ, et al. Oral contraceptives, reproductive history and risk of colorectal cancer in the European Prospective Investigation into Cancer and Nutrition. Br J Cancer. 2010;103:1755–1759. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Troisi R, Schairer C, Chow WH, et al. Reproductive factors, oral contraceptive use, and risk of colorectal cancer. Epidemiology. 1997;8:75–79. [DOI] [PubMed] [Google Scholar]
- 49.Guan HB, Wu QJ, Gong TT, Lin B, Wang YL, Liu CX. Parity and risk of colorectal cancer: a dose-response meta-analysis of prospective studies. PLoS One. 2013;8(9):e75279. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Xu S, Yoon HJ, Tourassi GD. A user-oriented web crawler for selectively acquiring online content in e-health research. Bioinformatics. 2014;30(1):104–114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Manning CD, Surdeanu M, Bauer J, Finkel J, Bethard SJ, McClosky D. The Stanford CoreNLP Natural Language Processing Toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstration. 2014;55–60. [Google Scholar]
- 52.R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/ Accessed June 18, 2014. [Google Scholar]
- 53.Siegel RL, Miller KD, Jemal A. Cancer Statistics, 2015. CA Cancer J Clin. 2015;65:5–29. [DOI] [PubMed] [Google Scholar]
- 54.Martinez G, Daniels K, Chandra A. Fertility of men and women aged 15–44 years in the United States: National Survey of Family Growth, 2006-2010. Natl Health Stat Report. 2012;2010(2006):1–28. [PubMed] [Google Scholar]
- 55.Kaplan RM, Chambers DA, Glasgow RE. Big data and large sample size: a cautionary note on the potential for bias. Clin Transl Sci. 2014;7(4):342–346. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Kaplan RM, Riley WT, Mabry PL. News from the NIH: leveraging big data in the behavioral sciences. Transl Behav Med 2014;4(3):229–231. [DOI] [PMC free article] [PubMed] [Google Scholar]