A comprehensive Pan-Cancer molecular study of gynecologic and breast cancers

Ashton C Berger; Anil Korkut; Rupa S Kanchi; Apurva M Hegde; Walter Lenoir; Wenbin Liu; Yuexin Liu; Huihui Fan; Hui Shen; Visweswaran Ravikumar; Arvind Rao; Andre Schultz; Xubin Li; Keith A Baggerly; Walter Lenoir; Ganiraju Manyam; Pavel Sumazin; Cecilia Williams; Pieter Mestagh; Preethi H Gunaratne; Christina Yau; Reanne Bowlby; Daniel G Tiezzi; Chen Wang; Andrew D Cherniack; Nolan Priedigkeit; Robert P Edwards; Steffi Oesterreich; Liang Liu; Qianqian Song; Robert D Burk; Lindsay Stetson; Jason Pitt; Boris Winterhoff; Andrew K Godwin; Nicole M Kuderer; Janet S Rader; Rosemary E Zuna; Anil K Sood; Alexander J Lazar; Akinyemi I Ojesina; Pan-Gynecologic cancers working group; The Cancer Genome Atlas Research Network; John N Weinstein; Gordon B Mills; Douglas A Levine; Rehan Akbani

doi:10.1016/j.ccell.2018.03.014

. Author manuscript; available in PMC: 2019 Apr 9.

Published in final edited form as: Cancer Cell. 2018 Apr 2;33(4):690–705.e9. doi: 10.1016/j.ccell.2018.03.014

A comprehensive Pan-Cancer molecular study of gynecologic and breast cancers

Ashton C Berger ^1,⁺, Anil Korkut ^2,⁺, Rupa S Kanchi ^2,⁺, Apurva M Hegde ², Walter Lenoir ², Wenbin Liu ², Yuexin Liu ², Huihui Fan ³, Hui Shen ³, Visweswaran Ravikumar ², Arvind Rao ², Andre Schultz ², Xubin Li ², Keith A Baggerly ², Walter Lenoir ², Ganiraju Manyam ², Pavel Sumazin ⁴, Cecilia Williams ⁵, Pieter Mestagh ⁶, Preethi H Gunaratne ^7,⁸, Christina Yau ^9,¹⁰, Reanne Bowlby ¹¹, Daniel G Tiezzi ¹², Chen Wang ^13,¹⁴, Andrew D Cherniack ^1,¹⁵, Nolan Priedigkeit ¹⁶, Robert P Edwards ¹⁶, Steffi Oesterreich ¹⁶, Liang Liu ¹⁷, Qianqian Song ¹⁷, Robert D Burk ¹⁸, Lindsay Stetson ¹⁹, Jason Pitt ²⁰, Boris Winterhoff ²¹, Andrew K Godwin ²², Nicole M Kuderer ²³, Janet S Rader ²⁴, Rosemary E Zuna ²⁵, Anil K Sood ²⁶, Alexander J Lazar ²⁷, Akinyemi I Ojesina ²⁸; Pan-Gynecologic cancers working group^#; The Cancer Genome Atlas Research Network^#, John N Weinstein ^2,²⁹, Gordon B Mills ^29,^*, Douglas A Levine ^30,^*, Rehan Akbani ^2,^31,^*

¹The Eli and Edythe L. Broad Institute of Massachusetts Institute of Technology and Harvard University, Cambridge, MA 02142, USA

²Department of Bioinformatics and Computational Biology, University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA

³Center for Epigenetics, Van Andel Research Institute, 333 Bostwick Ave NE, Grand Rapids, MI 49503, USA

⁴Texas Children’s Cancer Center, Baylor College of Medicine, Houston, TX 77030, USA

⁵Science for Life Laboratory Stockholm, Department of protein sciences, KTH – Royal Institute of Technology, Tomtebodavägen 23, Box 1031, SE-171 21 Solna, Sweden

⁶Cancer Research Institute Ghent, Ghent University, Ghent, Belgium

⁷Department of Biology & Biochemistry, UH-Sequencing Core, University of Houston, Houston, TX 77204

⁸Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA

⁹Buck Institute of Research on Aging, Novato, CA 94945

¹⁰Department of Surgery, University of California, San Francisco, San Francisco, CA 94143

¹¹BC Cancer Agency, Canada’s Michael Smith Genome Sciences Centre, Vancouver, BC, Canada V5Z 4S6

¹²Breast Disease and Gynecologic Oncology Division - Department of Gynecology and Obstetrics, Ribeirão Preto Medical School, University of São Paulo, 3900 Bandeirantes Ave, Ribeirão Preto, SP 14048-900 Brazil

¹³Department of Health Sciences Research, Mayo Clinic College of Medicine, 200 First Street SW, Rochester, MN 55905 USA

¹⁴Department of Obstetrics and Gynecology, Mayo Clinic College of Medicine, 200 First Street SW, Rochester, MN, 55905 USA

¹⁵Department of Medical Oncology, Dana Farber Cancer Institute, Boston, MA, 02215, USA

¹⁶Department of Obstetrics, Gynecology & Reproductive Services, University of Pittsburgh School of Medicine, 200 Lothrop St, Pittsburgh, PA 15213-2582

¹⁷Department of Cancer Biology, Wake Forest Baptist Health Center, Winston Salem, NC 27157

¹⁸Department of Obstetrics & Gynecology and Women’s Health, Albert Einstein College of Medicine, 1300 Morris Park Avenue, Bronx, NY 10461

¹⁹Case Medical Center, Case Western Reserve School of Medicine, 11100 Euclid Ave, Cleveland, OH 44106

²⁰University of Chicago, Institute for Genomics and Systems Biology, 900 E. 57th, KDBD 10240, Chicago, IL 60637

²¹Department of Obstetrics and Gynecology, Division of Gynecologic Oncology, University of Minnesota, 420 Delaware St SE, Minneapolis, MN 55455

²²Department of Pathology and Laboratory Medicine, The University of Kansas Medical Center, 3901 Rainbow Blvd, Kansas City, KS 66160, USA

²³Center for Cancer Innovation, Department of Medicine, University of Washington, WA, 98195 USA

²⁴Department of Obstetrics and Gynecology, Medical College of Wisconsin, Milwaukee, WI 53226, USA

²⁵Pathology Department, University of Oklahoma Health Sciences Center, Oklahoma City, OK 73104, USA

²⁶Department of Gynecologic Oncology and Reproductive Medicine, University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA

²⁷Departments of Pathology, Genomic Medicine, & Translational Molecular Pathology, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA

²⁸Department of Epidemiology and Comprehensive Cancer Center, University of Alabama at Birmingham, Birmingham, AL 35294, USA

²⁹Department of Systems Biology, University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA

³⁰Gynecologic Oncology, Perlmutter Cancer Center, New York University Langone Health, 240 E. 38th Street, New York, NY 10016 USA

Correspondence: RA: rakbani@mdanderson.org; DAL: douglas.levine@nyumc.org; GBM: gmills@mdanderson.org

⁺

These authors contributed equally to this work

Full list at the end of the manuscript

Secondary author list:

Samantha J. Caesar-Johnson, John A. Demchok, Ina Felau, Melpomeni Kasapi, Martin L. Ferguson, Carolyn M. Hutter, Heidi J. Sofia, Roy Tarnuzzer, Zhining Wang, Liming Yang, Jean C. Zenklusen, Jiashan (Julia) Zhang, Sudha Chudamani, Jia Liu, Laxmi Lolla, Rashi Naresh, Todd Pihl, Qiang Sun, Yunhu Wan, Ye Wu, Juok Cho, Timothy DeFreitas, Scott Frazer, Nils Gehlenborg, Gad Getz, David I. Heiman, Jaegil Kim, Michael S. Lawrence, Pei Lin, Sam Meier, Michael S. Noble, Gordon Saksena, Doug Voet, Hailei Zhang, Brady Bernard, Nyasha Chambwe, Varsha Dhankani, Theo Knijnenburg, Roger Kramer, Kalle Leinonen, Yuexin Liu, Michael Miller, Sheila Reynolds, Ilya Shmulevich, Vesteinn Thorsson, Wei Zhang, Rehan Akbani, Bradley M. Broom, Apurva M. Hegde, Zhenlin Ju, Rupa S. Kanchi, Anil Korkut, Jun Li, Han Liang, Shiyun Ling, Wenbin Liu, Yiling Lu, Gordon B. Mills, Kwok-Shing Ng, Arvind Rao, Michael Ryan, Jing Wang, John N. Weinstein, Jiexin Zhang, Adam Abeshouse, Joshua Armenia, Debyani Chakravarty, Walid K. Chatila, Ino de Bruijn, Jianjiong Gao, Benjamin E. Gross, Zachary J. Heins, Ritika Kundra, Konnor La, Marc Ladanyi, Augustin Luna, Moriah G. Nissan, Angelica Ochoa, Sarah M. Phillips, Ed Reznik, Francisco Sanchez-Vega, Chris Sander, Nikolaus Schultz, Robert Sheridan, S. Onur Sumer, Yichao Sun, Barry S. Taylor, Jioajiao Wang, Hongxin Zhang, Pavana Anur, Myron Peto, Paul Spellman, Christopher Benz, Joshua M. Stuart, Christopher K. Wong, Christina Yau, D. Neil Hayes, Joel S. Parker, Matthew D. Wilkerson, Adrian Ally, Miruna Balasundaram, Reanne Bowlby, Denise Brooks, Rebecca Carlsen, Eric Chuah, Noreen Dhalla, Robert Holt, Steven J.M. Jones, Katayoon Kasaian, Darlene Lee, Yussanne Ma, Marco A. Marra, Michael Mayo, Richard A. Moore, Andrew J. Mungall, Karen Mungall, A. Gordon Robertson, Sara Sadeghi, Jacqueline E. Schein, Payal Sipahimalani, Angela Tam, Nina Thiessen, Kane Tse, Tina Wong, Ashton C. Berger, Rameen Beroukhim, Andrew D. Cherniack, Carrie Cibulskis, Stacey B. Gabriel, Galen F. Gao, Gavin Ha, Matthew Meyerson, Steven E. Schumacher, Juliann Shih, Melanie H. Kucherlapati, Raju S. Kucherlapati, Stephen Baylin, Leslie Cope, Ludmila Danilova, Moiz S. Bootwalla, Phillip H. Lai, Dennis T. Maglinte, David J. Van Den Berg, Daniel J. Weisenberger, J. Todd Auman, Saianand Balu, Tom Bodenheimer, Cheng Fan, Katherine A. Hoadley, Alan P. Hoyle, Stuart R. Jefferys, Corbin D. Jones, Shaowu Meng, Piotr A. Mieczkowski, Lisle E. Mose, Amy H. Perou, Charles M. Perou, Jeffrey Roach, Yan Shi, Janae V. Simons, Tara Skelly, Matthew G. Soloway, Donghui Tan, Umadevi Veluvolu, Huihui Fan, Toshinori Hinoue, Peter W. Laird, Hui Shen, Wanding Zhou, Michelle Bellair, Kyle Chang, Kyle Covington, Chad J. Creighton, Huyen Dinh, HarshaVardhan Doddapaneni, Lawrence A. Donehower, Jennifer Drummond, Richard A. Gibbs, Robert Glenn, Walker Hale, Yi Han, Jianhong Hu, Viktoriya Korchina, Sandra Lee, Lora Lewis, Wei Li, Xiuping Liu, Margaret Morgan, Donna Morton, Donna Muzny, Jireh Santibanez, Margi Sheth, Eve Shinbrot, Linghua Wang, Min Wang, David A. Wheeler, Liu Xi, Fengmei Zhao, Julian Hess, Elizabeth L. Appelbaum, Matthew Bailey, Matthew G. Cordes, Li Ding, Catrina C. Fronick, Lucinda A. Fulton, Robert S. Fulton, Cyriac Kandoth, Elaine R. Mardis, Michael D. McLellan, Christopher A. Miller, Heather K. Schmidt, Richard K. Wilson, Daniel Crain, Erin Curley, Johanna Gardner, Kevin Lau, David Mallery, Scott Morris, Joseph Paulauskis, Robert Penny, Candace Shelton, Troy Shelton, Mark Sherman, Eric Thompson, Peggy Yena, Jay Bowen, Julie M. Gastier-Foster, Mark Gerken, Kristen M. Leraas, Tara M. Lichtenberg, Nilsa C. Ramirez, Lisa Wise, Erik Zmuda, Niall Corcoran, Tony Costello, Christopher Hovens, Andre L. Carvalho, Ana C. de Carvalho, José H. Fregnani, Adhemar Longatto-Filho, Rui M. Reis, Cristovam Scapulatempo-Neto, Henrique C.S. Silveira, Daniel O. Vidal, Andrew Burnette, Jennifer Eschbacher, Beth Hermes, Ardene Noss, Rosy Singh, Matthew L. Anderson, Patricia D. Castro, Michael Ittmann, David Huntsman, Bernard Kohl, Xuan Le, Richard Thorp, Chris Andry, Elizabeth R. Duffy, Vladimir Lyadov, Oxana Paklina, Galiya Setdikova, Alexey Shabunin, Mikhail Tavobilov, Christopher McPherson, Ronald Warnick, Ross Berkowitz, Daniel Cramer, Colleen Feltmate, Neil Horowitz, Adam Kibel, Michael Muto, Chandrajit P. Raut, Andrei Malykh, Jill S. Barnholtz-Sloan, Wendi Barrett, Karen Devine, Jordonna Fulop, Quinn T. Ostrom, Kristen Shimmel, Yingli Wolinsky, Andrew E. Sloan, Agostino De Rose, Felice Giuliante, Marc Goodman, Beth Y. Karlan, Curt H. Hagedorn, John Eckman, Jodi Harr, Jerome Myers, Kelinda Tucker, Leigh Anne Zach, Brenda Deyarmin, Hai Hu, Leonid Kvecher, Caroline Larson, Richard J. Mural, Stella Somiari, Ales Vicha, Tomas Zelinka, Joseph Bennett, Mary Iacocca, Brenda Rabeno, Patricia Swanson, Mathieu Latour, Louis Lacombe, Bernard Têtu, Alain Bergeron, Mary McGraw, Susan M. Staugaitis, John Chabot, Hanina Hibshoosh, Antonia Sepulveda, Tao Su, Timothy Wang, Olga Potapova, Olga Voronina, Laurence Desjardins, Odette Mariani, Sergio Roman-Roman, Xavier Sastre, Marc-Henri Stern, Feixiong Cheng, Sabina Signoretti, Andrew Berchuck, Darell Bigner, Eric Lipp, Jeffrey Marks, Shannon McCall, Roger McLendon, Angeles Secord, Alexis Sharp, Madhusmita Behera, Daniel J. Brat, Amy Chen, Keith Delman, Seth Force, Fadlo Khuri, Kelly Magliocca, Shishir Maithel, Jeffrey J. Olson, Taofeek Owonikoko, Alan Pickens, Suresh Ramalingam, Dong M. Shin, Gabriel Sica, Erwin G. Van Meir, Hongzheng Zhang, Wil Eijckenboom, Ad Gillis, Esther Korpershoek, Leendert Looijenga, Wolter Oosterhuis, Hans Stoop, Kim E. van Kessel, Ellen C. Zwarthoff, Chiara Calatozzolo, Lucia Cuppini, Stefania Cuzzubbo, Francesco DiMeco, Gaetano Finocchiaro, Luca Mattei, Alessandro Perin, Bianca Pollo, Chu Chen, John Houck, Pawadee Lohavanichbutr, Arndt Hartmann, Christine Stoehr, Robert Stoehr, Helge Taubert, Sven Wach, Bernd Wullich, Witold Kycler, Dawid Murawa, Maciej Wiznerowicz, Ki Chung, W. Jeffrey Edenfield, Julie Martin, Eric Baudin, Glenn Bubley, Raphael Bueno, Assunta De Rienzo, William G. Richards, Steven Kalkanis, Tom Mikkelsen, Houtan Noushmehr, Lisa Scarpace, Nicolas Girard, Marta Aymerich, Elias Campo, Eva Giné, Armando López Guillermo, Nguyen Van Bang, Phan Thi Hanh, Bui Duc Phu, Yufang Tang, Howard Colman, Kimberley Evason, Peter R. Dottino, John A. Martignetti, Hani Gabra, Hartmut Juhl, Teniola Akeredolu, Serghei Stepa, Dave Hoon, Keunsoo Ahn, Koo Jeong Kang, Felix Beuschlein, Anne Breggia, Michael Birrer, Debra Bell, Mitesh Borad, Alan H. Bryce, Erik Castle, Vishal Chandan, John Cheville, John A. Copland, Michael Farnell, Thomas Flotte, Nasra Giama, Thai Ho, Michael Kendrick, Jean-Pierre Kocher, Karla Kopp, Catherine Moser, David Nagorney, Daniel O’Brien, Brian Patrick O’Neill, Tushar Patel, Gloria Petersen, Florencia Que, Michael Rivera, Lewis Roberts, Robert Smallridge, Thomas Smyrk, Melissa Stanton, R. Houston Thompson, Michael Torbenson, Ju Dong Yang, Lizhi Zhang, Fadi Brimo, Jaffer A. Ajani, Ana Maria Angulo Gonzalez, Carmen Behrens, Jolanta Bondaruk, Russell Broaddus, Bogdan Czerniak, Bita Esmaeli, Junya Fujimoto, Jeffrey Gershenwald, Charles Guo, Alexander J. Lazar, Christopher Logothetis, Funda Meric-Bernstam, Cesar Moran, Lois Ramondetta, David Rice, Anil Sood, Pheroze Tamboli, Timothy Thompson, Patricia Troncoso, Anne Tsao, Ignacio Wistuba, Candace Carter, Lauren Haydu, Peter Hersey, Valerie Jakrot, Hojabr Kakavand, Richard Kefford, Kenneth Lee, Georgina Long, Graham Mann, Michael Quinn, Robyn Saw, Richard Scolyer, Kerwin Shannon, Andrew Spillane, Jonathan Stretch, Maria Synott, John Thompson, James Wilmott, Hikmat Al-Ahmadie, Timothy A. Chan, Ronald Ghossein, Anuradha Gopalan, Douglas A. Levine, Victor Reuter, Samuel Singer, Bhuvanesh Singh, Nguyen Viet Tien, Thomas Broudy, Cyrus Mirsaidi, Praveen Nair, Paul Drwiega, Judy Miller, Jennifer Smith, Howard Zaren, Joong-Won Park, Nguyen Phi Hung, Electron Kebebew, W. Marston Linehan, Adam R. Metwalli, Karel Pacak, Peter A. Pinto, Mark Schiffman, Laura S. Schmidt, Cathy D. Vocke, Nicolas Wentzensen, Robert Worrell, Hannah Yang, Marc Moncrieff, Chandra Goparaju, Jonathan Melamed, Harvey Pass, Natalia Botnariuc, Irina Caraman, Mircea Cernat, Inga Chemencedji, Adrian Clipca, Serghei Doruc, Ghenadie Gorincioi, Sergiu Mura, Maria Pirtac, Irina Stancul, Diana Tcaciuc, Monique Albert, Iakovina Alexopoulou, Angel Arnaout, John Bartlett, Jay Engel, Sebastien Gilbert, Jeremy Parfitt, Harman Sekhon, George Thomas, Doris M. Rassl, Robert C. Rintoul, Carlo Bifulco, Raina Tamakawa, Walter Urba, Nicholas Hayward, Henri Timmers, Anna Antenucci, Francesco Facciolo, Gianluca Grazi, Mirella Marino, Roberta Merola, Ronald de Krijger, Anne-Paule Gimenez-Roqueplo, Alain Piché, Simone Chevalier, Ginette McKercher, Kivanc Birsoy, Gene Barnett, Cathy Brewer, Carol Farver, Theresa Naska, Nathan A. Pennell, Daniel Raymond, Cathy Schilero, Kathy Smolenski, Felicia Williams, Carl Morrison, Jeffrey A. Borgia, Michael J. Liptay, Mark Pool, Christopher W. Seder, Kerstin Junker, Larsson Omberg, Mikhail Dinkin, George Manikhas, Domenico Alvaro, Maria Consiglia Bragazzi, Vincenzo Cardinale, Guido Carpino, Eugenio Gaudio, David Chesla, Sandra Cottingham, Michael Dubina, Fedor Moiseenko, Renumathy Dhanasekaran, Karl-Friedrich Becker, Klaus-Peter Janssen, Julia Slotta-Huspenina, Mohamed H. Abdel-Rahman, Dina Aziz, Sue Bell, Colleen M. Cebulla, Amy Davis, Rebecca Duell, J. Bradley Elder, Joe Hilty, Bahavna Kumar, James Lang, Norman L. Lehman, Randy Mandt, Phuong Nguyen, Robert Pilarski, Karan Rai, Lynn Schoenfield, Kelly Senecal, Paul Wakely, Paul Hansen, Ronald Lechan, James Powers, Arthur Tischler, William E. Grizzle, Katherine C. Sexton, Alison Kastl, Joel Henderson, Sima Porten, Jens Waldmann, Martin Fassnacht, Sylvia L. Asa, Dirk Schadendorf, Marta Couce, Markus Graefen, Hartwig Huland, Guido Sauter, Thorsten Schlomm, Ronald Simon, Pierre Tennstedt, Oluwole Olabode, Mark Nelson, Oliver Bathe, Peter R. Carroll, June M. Chan, Philip Disaia, Pat Glenn, Robin K. Kelley, Charles N. Landen, Joanna Phillips, Michael Prados, Jeffry Simko, Karen Smith-McCune, Scott VandenBerg, Kevin Roggin, Ashley Fehrenbach, Ady Kendler, Suzanne Sifri, Ruth Steele, Antonio Jimeno, Francis Carey, Ian Forgie, Massimo Mannelli, Michael Carney, Brenda Hernandez, Benito Campos, Christel Herold-Mende, Christin Jungk, Andreas Unterberg, Andreas von Deimling, Aaron Bossler, Joseph Galbraith, Laura Jacobus, Michael Knudson, Tina Knutson, Deqin Ma, Mohammed Milhem, Rita Sigmund, Andrew K. Godwin, Rashna Madan, Howard G. Rosenthal, Clement Adebamowo, Sally N. Adebamowo, Alex Boussioutas, David Beer, Thomas Giordano, Anne-Marie Mes-Masson, Fred Saad, Therese Bocklage, Lisa Landrum, Robert Mannel, Kathleen Moore, Katherine Moxley, Russel Postier, Joan Walker, Rosemary Zuna, Michael Feldman, Federico Valdivieso, Rajiv Dhir, James Luketich, Edna M. Mora Pinero, Mario Quintero-Aguilo, Carlos Gilberto Carlotti, Jr., Jose Sebastião Dos Santos, Rafael Kemp, Ajith Sankarankuty, Daniela Tirapelli, James Catto, Kathy Agnew, Elizabeth Swisher, Jenette Creaney, Bruce Robinson, Carl Simon Shelley, Eryn M. Godwin, Sara Kendall, Cassaundra Shipman, Carol Bradford, Thomas Carey, Andrea Haddad, Jeffey Moyer, Lisa Peterson, Mark Prince, Laura Rozek, Gregory Wolf, Rayleen Bowman, Kwun M. Fong, Ian Yang, Robert Korst, W. Kimryn Rathmell, J. Leigh Fantacone-Campbell, Jeffrey A. Hooke, Albert J. Kovatich, Craig D. Shriver, John DiPersio, Bettina Drake, Ramaswamy Govindan, Sharon Heath, Timothy Ley, Brian Van Tine, Peter Westervelt, Mark A. Rubin, Jung Il Lee, Natália D. Aredes, and Armaz Mariamidze

³¹

Lead Contact

PMCID: PMC5959730 NIHMSID: NIHMS958065 PMID: 29622464

Summary

We analyzed molecular data on 2,579 TCGA tumors of four gynecological types plus breast. Our aims were to identify shared and unique molecular features, clinically significant subtypes, and potential therapeutic targets. We found 61 somatic copy number alterations (SCNAs) and 46 significantly mutated genes (SMGs). Eleven SCNAs and eleven SMGs had not been identified in previous TCGA studies of the individual tumor types. We found functionally significant estrogen receptor-regulated lncRNAs and gene/lncRNA interaction networks. Pathway analysis identified subtypes with high leukocyte infiltration, raising potential implications for immunotherapy. Using 16 key molecular features, we identified five prognostic subtypes and developed a decision tree that classified patients into the subtypes based on just six features that are assessable in clinical laboratories.

Keywords: Gynecologic cancer, breast cancer, ovarian cancer, uterine cancer, cervical cancer, uterine carcinosarcoma, TCGA, The Cancer Genome Atlas, molecular features, Pan-Gynecologic, omics

Introduction

Gynecologic cancers share a variety of characteristics: they arise from similar embryonic origins in the Müllerian ducts, their development is influenced by female hormones, and they are managed by a particular medical specialty, gynecologic oncology, as reflected in the departmental organizations of academic medical centers (Mullen & Behringer, 2014). Recently, similarities at the molecular level have been identified across gynecologic and breast cancers in a comprehensive analysis of all 33 TCGA tumor types (Hoadley et al., 2018). Despite the commonalities, however, the various gynecologic cancer types do differ from each other in a variety of intriguing and important ways. The principal aims of the present study are to highlight both similarities and differences among types and subtypes of gynecologic cancers, in addition to the ways in which they differ from non-gynecologic cancers. Because breast tumors share most of the generic characteristics listed above, we have chosen to include them in the analysis.

The study focuses on the following five TCGA tumor types: high-grade serous ovarian cystadenocarcinoma (OV), uterine corpus endometrial carcinoma (UCEC), cervical squamous cell carcinoma and endocervical adenocarcinoma (CESC), uterine carcinosarcoma (UCS), and invasive breast carcinoma (BRCA). Although each Pan-Gyn organ site is subject to a variety of uncommon histologic cancer subtypes not studied by TCGA, the most frequent and/or aggressive tumors are represented. Despite impressive recent advances in diagnosis and management, these tumors share unmet needs for effective treatment. The analyses here can provide background biological information and prompt hypotheses about therapeutic choices or provide evidence for pre-existing hypotheses.

Taken together, the Pan-Gyn cohort reflects a projected incidence of more than 350,000 cases in the United States in 2017 (Siegel et al., 2017), with many more worldwide. Many of the commonalities and differences among cancer types and subtypes presented here were not identified in the individual TCGA disease-type projects (Cancer Genome Atlas Research Network, 2011; Cancer Genome Atlas Research Network, 2012; Cancer Genome Atlas Research Network et al., 2013; Cherniack et al., 2017; Cancer Genome Atlas Research Network, 2017).

Results

We used data generated from 2,579 TCGA patient samples (the “Pan-Gyn” cohort; n=1,087 BRCA, 308 CESC, 579 OV, 548 UCEC, and 57 UCS) using fresh-frozen primary samples prior to any chemo- or radiation therapy. All sample collections were approved by local Institutional Review Boards. We analyzed data of multiple types, including clinical, somatic copy number alterations (SCNAs), mutations, DNA methylation, and expression of mRNA, microRNA (miRNA), long non-coding RNA (lncRNA), and proteins. The data were adjusted for batch effects before further analysis (see STAR Methods). Here, we 1) present results that distinguish Pan-Gyn from the rest of the TCGA tumor types, 2) summarize platform-specific analysis results, and 3) propose cross-tumor type subtypes with potential prognostic and therapeutic value.

Molecular features that distinguish Pan-Gyn from other tumor types

We identified molecular features that differed in frequency among the five Pan-Gyn tumor types and the remaining 28 TCGA non-gynecologic (non-Gyn) tumor types (https://tcga-data.nci.nih.gov/docs/publications/tcga/). After adjusting for sample size per tumor type, we found 23 genes (including ARID1A, ERBB3, BRCA1, FBXW7, KMT2C, PIK3CA, PIK3R1, PPP2R1A, PTEN, and TP53) that were mutated at higher frequencies across the Pan-Gyn tumor types than across the non-Gyn types (FDR < 0.01, Fisher’s Exact test) (Figure 1A). Eighteen of those genes were found to be significantly mutated genes (SMGs) in the Pan-Gyn cohort (as described later).

**(A)** Heat map showing the frequencies of mutations (green) in 23 genes across all 33 TCGA tumor types and frequencies of amplifications (red) in 23 genes across all 33 TCGA tumor types. **(B)** Amplification (red) and deletion (blue) q values from GISTIC2.0 for SCNA peaks of significant copy number gain and loss plotted for Pan-Gyn vs. non-Gyn cohorts. Genes named are the suspected targets of amplification or deletion, if identifiable. Otherwise, peaks are labeled with the nearest cytoband’s designation. Peaks found in only one cohort were assigned values of NS (not significant) in the other cohort. See also Figure S1 and Table S1.

Next, we used GISTIC2.0 (Mermel et al., 2011) to identify statistically significant recurring SCNAs in the Pan-Gyn cohort and, separately, in the non-Gyn cohort. We identified 61 significant regions in the Pan-Gyn tumors, 27 amplifications and 34 deletions, of which 12 amplifications and 6 deletions were not found in the non-Gyn cohort, suggesting relative specificity for Pan-Gyn tumors (Figure 1B, Figure S1A & Table S1). Two of the 12 uniquely Pan-Gyn amplifications and one of the six deletions had not previously been reported in single-disease TCGA studies of the same tumor types (Cancer Genome Atlas Research Network, 2011; Cancer Genome Atlas Research Network, 2012; Cancer Genome Atlas Research Network et al., 2013; Cherniack et al., 2017; Cancer Genome Atlas Research Network, 2017). One of the previously unreported amplifications was a focal region in 1q42.3 covering IRF2BP2, which encodes an interferon regulatory factor binding protein that is implicated in cellular differentiation, proliferation, and survival processes (Stadhouders et al., 2015). The other unreported amplification, located in 10p15.1, included an intergenic noncoding region downstream of KLF6 that bears striking resemblance to known oncogenic super-enhancer regions (Zhang et al., 2016) and PFKFB3, a gene which is being investigated as a therapeutic target in various cancers (Cantelmo et al., 2016; Li et al., 2017; Peng et al., 2018). The deletion consisted of a ~7 MB region in 9q34.3 that contains the tumor suppressor genes TSC1 and NOTCH1.

Figure 1B and Figure S1A depict suspected targets of the significant SCNAs in the Pan-Gyn and non-Gyn cohorts without adjusting for sample size per tumor type. MECOM, KAT6A, BRD4, NEDD9, MYCL1, and KAT6B were selectively amplified in the Pan-Gyn cohort, whereas SOX2, EGFR, CDK4, MDM4, and CDK6 were selectively amplified in the non-Gyn cohort. MAP2K4 and NF1 were notable tumor suppressor genes with recurring copy number losses specific to Pan-Gyn tumors, whereas PTPRD, RBFOX1, and TP53 were among the tumor suppressors more commonly deleted in non-Gyn samples. Significantly recurring deletions were found in known or putative fragile site genes, including LRRN3 (7q31.1) in non-Gyn, ANKS1B (12q23.1) in Pan-Gyn, and RAD51B (14q24) in both cohorts (McAvoy et al., 2007; Miron et al., 2015). Adjusting for sample size per tumor type, we identified 23 oncogenes among the genes in the 27 Pan-Gyn amplification regions that were consistently more frequently amplified across the five Pan-Gyn tumor types than across the non-Gyn types (FDR < 0.05, Fisher’s Exact test) (Figure 1A). We found no known tumor suppressors within the 34 somatic deletion regions that were more frequently deleted across the Pan-Gyn tumor types than across the non-Gyn types. In addition, we identified 197 genes that were statistically significantly hyper- or hypomethylated at different frequencies in the two cohorts (Figure S1B).

We performed bootstrapping-based analyses to investigate whether there were greater numbers of shared mutated or copy-number altered genes among the five Pan-Gyn tumor types vs. random sets of five tumor types. The results showed that 23 mutated genes were enriched in the Pan-Gyn tumor types vs. only 6 mutated genes expected by random chance (p = 0.10) (Figure S1C), whereas 122 SCNA genes were enriched in Pan-Gyn vs. 2 by random chance (p < 0.0001) (Figure S1D).

Individual data platform analyses

Mutation analysis

We analyzed 2,258 patient samples with mutation data from TCGA for SMGs and operative mutational processes across the Pan-Gyn tumor types. The types of mutations in the Pan-Gyn cohort are summarized in Table S2 The average mutation load varied widely by tumor type, with CESC samples having the highest median frequency (5.3 mutations/mbp). UCEC samples showed a bimodal distribution due to a subset of hyper-mutators described previously (Cancer Genome Atlas Research Network et al., 2013).

There were 46 SMGs based on the intersection of those genes identified by MutSigCV v1.4 (Lawrence et al., 2013) and those identified by previous methods (Vogelstein et al., 2013) (Figure 2A). The top five most frequently mutated genes were TP53 (44% of samples mutated), PIK3CA (32%), PTEN (20%), ARID1A (14%), and PIK3R1 (11%). Eleven of the 46 SMGs had not been previously reported in any of the TCGA gynecologic or breast marker papers (Cancer Genome Atlas Research Network, 2011; Cancer Genome Atlas Research Network, 2012; Cancer Genome Atlas Research Network et al., 2013; Cherniack et al., 2017; Cancer Genome Atlas Research Network, 2017) (Table S3). Among them, ACVR2A, a member of the TGF-β superfamily that functions in pathways implicated in both tumor progression and suppression (Ikushima & Miyazono, 2010), was the most frequently mutated (in 4.8% of the cohort). LATS1 was the next most frequently mutated (3.8%) and functions in the Hippo signaling pathway, which controls organ size, restricts proliferation, promotes apoptosis, and has been implicated in multiple cancer types (Yu et al. 2015; Deng et al. 2017). CCAR1 was mutated at 3.6%; its protein product functions as a p53 coactivator and plays roles in cell proliferation, apoptosis, and, in breast cancer, estrogen-dependent growth (Kim et al. 2008; Muthu et al. 2015). We found 220 patients (10%) that had no detectable SMGs.

**(A)** Mutation profiles of 2,029 Pan-Gyn samples (columns) in which at least one somatic mutation occurred in at least one of the 46 significantly mutated genes (SMGs). (Top) Mutation burdens per sample, divided into synonymous and non-synonymous mutation types. (Middle) Types of mutations in each of the 46 SMGs per sample. (Bottom) Covariate bars showing the mutation cluster, genomic alterations in six genes from the DNA damage response pathway, and tumor type for each sample. **(B)** Clustered heat map showing correlations between 10 of our mutation signatures (rows labeled S1 to S10) and 30 COSMIC signatures (columns). **(C)** Clustered heat map of the mutation signatures (rows) present in each sample (columns) showing ten clusters. The dendrogram is color-coded by predominant COSMIC signature. See also Figure S2, and Tables S2-5.

Mutation signatures

Mutation signatures have provided insight into mechanisms underlying tumor development and have informed patient therapy (Helleday et al., 2014). Analysis by non-negative matrix factorization (NMF) on the Pan-Gyn data set suggested that 10 mutation signatures could explain nearly 90% of the variability observed in the original mutation/sample matrix (Figure S2A-B). The 10 Pan-Gyn signatures (S1 to S10) variably correlated with the 30 COSMIC signatures (http://cancer.sanger.ac.uk/cosmic/signatures) (Forbes et al., 2011) (Figure 2B). S1 correlated strongly with COSMIC signature 13 (r = 0.99) and S2 correlated with COSMIC signature 2 (r = 0.95); both signatures suggest activity of the AID/APOBEC family of cytidine deaminases. S3 correlated with COSMIC signature 1 (r = 0.94), indicating an endogenous process initiated by spontaneous deamination of 5-methylcytosine. S4 and the ultramutator COSMIC signature 10 were highly correlated (r = 0.97), presumably reflecting altered activity of POLE. A smaller correlation was found between S10 and COSMIC signature 3 (r = 0.58), associated with germline and somatic BRCA1 and BRCA2 mutations. All of the correlations were statistically significant (FDR < 0.05).

Unsupervised hierarchical clustering based on the contribution of each signature divided the Pan-Gyn samples into 10 clusters that showed associations with various molecular/clinical features (Figure 2C, Figure S2C & Table S4-5). Cluster C1 was highly enriched with OV samples (and basal BRCA and UCEC to a lesser extent) and contributed strongly to S10, a signature associated with germline and somatic BRCA1 and BRCA2 mutations that correlate with responsiveness to PARP inhibitors and platinum-based therapy (Konecny et al., 2016). C1 also had samples with frequent TP53 mutations and homozygous deletions, supporting the association with an ineffective DNA double-strand break repair COSMIC signature. C2, which contained BRCA, OV, and UCEC samples, was associated with transcriptional strand bias for T>C substitutions, whereas C3, which contained BRCA and OV samples, was associated with transcriptional strand bias for T>A mutations. C4 consisted principally of breast samples and contributed to S8, the signature most associated with COSMIC 5 (etiology unknown). C5, principally composed of UCEC tumors with high microsatellite instability and mutations in MLH1, MSH2, MSH3, or MSH6, contributed most strongly to signature S6. S6 is correlated with COSMIC signatures 6, 15, and 20, which are associated with defective DNA mismatch repair (suggesting possible sensitivity to immune checkpoint inhibitors). C9 comprised CESC and BRCA samples and represented the AID/APOBEC signatures S1 and S2, providing further evidence for enrichment of APOBEC mutagenesis in these cancers (Roberts et al., 2013). C10 was associated with POLE-mutant UCEC samples.

Somatic copy number alterations

Unsupervised hierarchical clustering of the Pan-Gyn cohort using ISAR-corrected (Zack et al., 2013) segmentation data revealed six clusters with distinct copy number profiles (Figure 3 & Table S4-5). Prominent features that distinguished the clusters included SCNAs in chr 8, 16q, and 1q, among others. OV, serous UCEC, UCS and basal-like, HER2⁺, and luminal B BRCA tumors clustered almost exclusively into C4 and C6. Conversely, luminal A BRCA and endometrioid UCEC samples were divided amongst all clusters, providing evidence for additional tumor subtypes beyond the traditional clinical classifications (Cancer Genome Atlas Research Network et al., 2013). C4 and C6 showed a high degree of genomic copy number instability, consistent with their prevailing TP53 mutation signatures (Ciriello et al., 2013), and contained the highest numbers of advanced stage cancers (Figure S3A). Unlike other clusters, more than 50% of the samples in C4 and C6 had undergone at least one whole genome-doubling event. C3 accounted for the largest proportion of CESC samples and uniquely exhibited a focal 11q22 amplification containing the oncogene YAP1. C2, with 74% endometrioid UCEC, contained a majority of the POLE-mutant cases and exhibited a quiet SCNA landscape with few broad-level gains or losses. C1 and C5 consisted primarily of endometrioid UCEC and luminal A BRCA tumors, accounting for 85% and 72% of the samples in the two clusters, respectively. Both clusters had similar alteration profiles genome-wide, except in the frequencies of 1q and chr 8 gains (p < 2.2e-16, Fisher’s Exact test); the former occurred twice as frequently in C1 and the latter seven times as frequently in C5. Overall, gain of 1q was the most frequent chromosomal arm-level event, occurring in 49.5% of samples across all five Pan-Gyn cancer types. Other frequently recurring arm-level events included gain of 3q, 8q, and chr 20, and loss of 4p, 13q, 16q, 17p, and 22q.

The heat map shows SCNAs in tumor samples (columns) plotted by chromosomal location (rows). Red and blue indicate amplifications and deletions, respectively. See also Figure S3 and Tables S4-5.

DNA methylation

Unsupervised clustering of 2,586 cancer-specific, hyper-methylated loci across all Pan-Gyn tumors revealed heterogeneity of DNA methylation patterns (Figure S3B & Table S4-5). Unsurprisingly, tumor samples from the same tissue of origin (e.g., OV, UCS, or CESC) clustered together with the exception of two major groups, which were found to be highly robust via cluster stability analysis (83% and 90% for left and right branches respectively) (Figure S3C-D). The left branch with lower degrees of hypermethylation consisted of the majority of OV and UCS, normal and basal-like BRCA, and microsatellite stable UCECs (both endometrioid and serous subtype). The hypermethylator (right) cluster included most CESC tumors, the majority of BRCA, and microsatellite unstable UCEC. The 7 cluster resolution was retained when perturbing samples across all of the TCGA Pan-Can cohort (Figure S3E), with a small subset of UCEC samples reassigned. C7 (mostly CESC) had the highest degree of hypermethylation across all tumor types in the study, followed by a luminal B BRCA-rich C4, which also consisted of HER2⁺ and a small fraction of basal-like BRCA. Within tumor subtype (e.g., endometrioid UCEC), the heterogeneity of DNA methylation patterns identified samples that showed greater deficiency in DNA mismatch repair pathways (via MLH1 silencing). Hypermethylation and concomitant down-regulation of two genes in the Homologous Repair (HR) pathway, BRCA1 and RAD51C, were observed almost exclusively in OV (12.7% and 3.0%, respectively) and basal-like BRCA cancers (2.8% and 2.6%, respectively).

mRNA analysis

Unsupervised hierarchical clustering of 1,860 previously defined cancer genes (Sadelain et al., 2012) in 2,296 Pan-Gyn samples resulted in the identification of nine mRNA clusters with distinct clinicopathologic characteristics (Figure 4A & Table S4-5). Both C1 and C2 were BRCA-enriched, and C2 consisted of the majority of HER2⁺ and normal-like tumors. C2 was also significantly enriched with infiltrating lobular carcinomas, whereas over 95% of cases in C4 were basal-like ductal BRCA. C5 consisted mainly of OV and serous-like UCEC, a similarity previously noted (Cancer Genome Atlas Research Network et al., 2013). Over 50% of cases in C7 were UCS and, given its high EMT signature (Cherniack et al., 2017), C7 therefore likely exhibits EMT characteristics. Overall, the Pan-Gyn mRNA subtypes showed prognostic value, even after adjusting for lineage (p < 0.0001, chi-squared test) (Figure 4B). UCEC, in particular, appeared in five of the nine clusters and exhibited significant differences in overall survival, depending on cluster membership (Figure 4C).

**(A)** Unsupervised hierarchical clustering of previously reported cancer genes identifies nine mRNA-based subtypes/clusters. Clinical and molecular features are indicated by the annotation bars above the heat map. **(B)** Overall survival for each of the gene expression clusters (chi-squared test p value < 0.0001, adjusted for differences in tumor type survival rates). **(C)** Overall survival for endometrial cancer (UCEC) patients in the gene expression clusters (log rank test p value < 0.0001). **(D)** Differential expression of *ESR1*, AR, *SOX2*, and *CDH1* in different clusters (Kruskal-Wallis test p values < 0.0001 for all 4 genes). The bars represent mean expression of the gene (log₂ scale) in each cluster, together with the upper or lower 95% confidence interval (whiskers above or below the bars, respectively). See also Tables S4-5.

We investigated which genes were differentially expressed among the clusters (Figure 4D). ESR1 and AR were significantly higher in C1 and C2 than in others, whereas C3 had high expression of SOX2. C3 consisted of cervical cancer samples with squamous histology, characterized by 3q26 amplification (the SOX2 gene loci). C7 had significantly lower expression of the classical epithelial marker CDH1, which is consistent with an EMT signature.

Proteomic analysis

Unsupervised hierarchical clustering of protein expression data for 1,967 samples across 216 proteins identified 5 clusters (Figure S4A & Table S4-5). C1 principally consisted of non-basal BRCA, C3 was enriched with endometrioid UCEC, and C4 was enriched with OV. Interestingly, C2 and C5 contained a mixture of samples across multiple disease types. C2 had high levels of caveolin1, MYH11, and HSP70 proteins, which have previously been identified as biomarkers for the reactive subtype found in luminal BRCA (Cancer Genome Atlas Research Network et al., 2012). In addition to luminal BRCA samples, C2 included some basal-like BRCA, CESC, OV, and UCEC samples (but not UCS). Cluster C5 contained most of the basal-like BRCA, squamous CESC, serous UCEC, UCS, and 10% of the serous OV samples. It had a low hormone receptor pathway score (Akbani et al., 2014) and high levels of cell cycle and DNA damage response activity, features that could indicate sensitivity to drugs that target DNA damage repair.

miRNA analysis

Unsupervised hierarchical clustering of the 293 most variable miRNAs in 2,417 samples grouped samples largely by disease type (Figure S4B-D & Table S4-5). The miRNA profile for OV, however, was especially distinct from other Pan-Gyn tumor types. Basal-like BRCA samples were more similar to CESC (C6), UCEC and UCS samples (C4 and C5) than they were to the non-basal BRCA subtypes in C2 and C3.

lncRNAs

We processed raw RNA-seq data to extract 1,986 lncRNAs that were predicted to regulate the 216 cancer-related proteins profiled by TCGA across four of the five tumor types (UCS did not have sufficient samples for the lncRNA extraction). An unsupervised consensus clustering of the data revealed six clusters (L1 to L6) that coincided significantly with protein-based clusters (C1 to C5) (p value < 0.05, Fisher’s Exact test) (Figures 5A, S4A and Table S4-5). BRCA and CESC had very similar lncRNA profiles and grouped together in clusters L2 and L3. UCEC (in L5) and OV (in L6) each had very distinct lncRNA profiles from those of BRCA and CESC. Portions of the OV (31%) and UCEC (11%) samples were both present in cluster L4.

Previous studies have suggested that estrogen receptor (ER) regulates BRCA1 expression, Dyskerin (DKC1) expression (a binding partner of the lncRNA TERC), and the lncRNA TUG1 (Figure 5B) (Jonsson et al., 2015; Hurtado et al., 2011). ER binds to regulatory regions of DKC1, either to induce or to repress multiple lncRNAs (Figure 5C). In the present study, our analysis has revealed significant Pearson’s correlation (t-test p value < 0.05) between key lncRNAs and their regulator genes’ transcripts, ESR1, OIP5, and DKC1, in a context-specific manner (Figure 5D). Using Gene Set Enrichment Analysis (GSEA), we found 12.04% of the 1,537 Gene Ontology gene sets to be significantly enriched (FDR < 0.05) with TERC-correlated genes across all four cancer types (Figure S5). Included were gene sets associated with TERT and telomere maintenance and packaging as well as gene sets linked to MYC. The latter result supports earlier findings of TERC binding peaks in the MYC promoter region (Chu et al., 2011).

Pathway analysis

We performed PARADIGM pathway analysis (Vaske et al., 2010) followed by unsupervised consensus clustering of pathway scores that clustered samples primarily by tissue type, with a few notable exceptions (Figure 6A-B, Table S4-5). A subset of basal-like BRCA cancers co-clustered with a subset of UCEC and UCS in C2, whereas the remaining basal-like BRCA samples clustered with non-basal BRCA in C4. Contrary to transcriptomic analysis, pathway analysis clustered approximately half of the basal-like BRCA cancer samples together with the HER2⁺ and luminal B samples.

**(A)** Consensus Clustered heat map based on PARADIGM integrated pathway levels (IPLs). Selected pathway features with characteristic patterns of inferred activation across clusters are labeled on the rows. Samples are in columns. **(B)** Constituent pathways with differential single-sample gene set enrichment analysis (ssGSEA) scores across PARADIGM clusters. A comparison of ssGSEA scores of constituent pathways integrated by the PARADIGM algorithm identified 263 differentially enriched pathways across clusters. Samples are arranged in the same order as (A) and differentially expressed pathways are arranged based on unsupervised clustering of their ssGSEA scores. Dominant themes within sub-groupings of differential pathways across PARADIGM clusters are labeled. Examples of immune-related pathways include IL12, IL23, IL27, IFNG, STAT and T-cell receptor signaling pathways. Proliferation and DNA damage repair related pathways include FOXM1, PLK2, Cyclins, MYC, E2F, ATM, ATR, BARD1, and Fanconi anemia pathways. See also Tables S4-5.

All PARADIGM clusters had distinct patterns of high or low immune-related signaling, assessed by inferred activation (Figure 6A) and pathway enrichment (Figure 6B), suggesting an important role for immune response in subsets of Pan-Gyn cancers. Interestingly, the two basal-like BRCA subtypes differed between inferred activation of immune-related signaling pathways. Enrichment with adhesion-related proteins, such as the integrins, matrix metalloproteinases, and syndecans, were also distinguished between the two basal-like subtypes, suggesting distinctive tumor microenvironments. As with basal-like BRCA, UCEC split into two clusters (C2 and C3) that did not correspond to obvious variations in UCEC histology. These clusters were mainly differentiated by proliferation, Notch signaling, and immune activity levels.

Integrated analysis across Pan-Gyn tumor types

We used cluster assignments from the six major TCGA platforms (mutations, SCNA, DNA methylation, mRNA, miRNA, and protein) to perform integrated clustering across the Pan-Gyn cohort using the CoCA algorithm (Figure S6A). The resulting CoCA clusters were heavily dominated by tumor type because the intrinsic gene expression patterns were lineage dependent. The association with tumor type was especially prominent in the DNA methylation, mRNA, miRNA, and protein clusters. Therefore, we turned to an alternative method (described next) to define subtypes that would span the Pan-Gyn tumor types and emphasize high-level similarities among them.

Subtypes across the Pan-Gyn tumors

We present molecular subtypes that illuminate commonalities and distinguishing features across the Pan-Gyn tumor types, with the potential to inform future cross-tumor type therapies. We first identified 16 features (listed in STAR Methods) across 1,956 samples that were either (i) currently used in the clinic for at least one of the five tumor types, or (ii) identified as informative in previous TCGA gynecologic and breast cancer studies. Next, we clustered the feature matrix and obtained five clusters (Figure 7A, Table S4-5). SCNA load was the predominant feature and produced the first division. In the low SCNA-load-group, we found two clusters, non-hyper-mutator (C1) and hyper-mutator (C2). The non-hyper-mutator cluster had virtually no hyper-mutators but had high levels of ER⁺, PR⁺, and/or AR⁺ samples, indicating potential susceptibility to hormone therapies. C2, the hyper-mutator cluster, could be further subdivided into four sub-clusters (clusters C2A-C2D). C2A was enriched with POLE mutations, which have previously been associated with “ultramutators” and their extremely high mutation rates (>100 mutations/mbp) (Cancer Genome Atlas Research Network, 2013). C2B showed enrichment with MSI-high samples and C2C showed high immune-infiltration levels. C2D was depleted of hyper-mutators and showed enrichment with high immune-infiltration and HPV-positive samples. The high SCNA-load group consisted of three clusters: immune high (C3), AR- or PR-low (C4), and AR or PR high (C5). The immune high cluster showed low levels of hormone receptors and enrichment with HPV-positive samples. Interestingly, samples with ERBB2 amplification fell into two main clusters; those in clusters C3 (n = 39) and C4 (n = 30) showed high and low immune infiltration levels, respectively (purple and black rectangles in Figure 7A). C3 displayed a tendency towards better survival than C4 (hazard ratio = 2.8), with a p value that trended towards significance (p = 0.087) (Figure S6B). C4 showed low levels of AR and PR and had a sub-cluster with BRCA1 or BRCA2 somatic mutations. C5 had high levels of at least one of the three hormone receptors, again suggesting sensitivity to hormone therapies. Each cluster had varying levels of representation of samples from each disease, mitigating tissue-specificity (Figure 7B).

**(A)** Clustered heat map of 16 features across 1,956 Pan-Gyn samples. Cluster 2 is split further into four subclusters, 2A-2D. Purple rectangles highlight HER2⁺ samples that have high immune infiltration scores; black rectangles highlight HER2⁺ samples with low immune infiltration scores. **(B)** Cross-tabulation showing the distribution of Pan-Gyn tumor types across the five clusters. **(C)** Kaplan-Meier curves showing differences in overall survival among the five clusters (with 5-and 10-year survival rates shown). Before adjusting for tumor type differences in overall survival rates, the log rank test p value < 0.0001, and after adjusting for tumor type differences, p value = 0.0006 (chi-squared test). **(D)** Decision tree that predicts clusters using just six of the 16 features. The predicted clusters are shown in a covariate bar in the heat map in A. **(E)** Kaplan-Meier curves showing differences in overall survival among the five decision tree-based predicted clusters (with 5- and 10-year survival rates shown). Log rank test p values are less than 0.0001, before (log rank test) and after (chi-squared test) adjusting for tumor type differences in overall survival rates. See also Figure S6 and Tables S4-5.

We then performed overall survival analysis on the five clusters and obtained very significant survival differences among them (p < 0.0001, log-rank test) (Figure 7C). The 5-year survival rate ranged from 83% (C1) to 44% (C4), and the 10-year survival rate ranged from 64% (C2) to 20% (C4). We assessed the statistical significance of the added prognostic value of the 16-feature clusters after accounting for tumor type differences to control for effects that may be due to individual tumor type contributions; the resulting p value was still significant (p = 0.0006, log-rank test).

Finally, we used dichotomous decision tree methodology (Quinlan, 1983) to reduce the number of assessed molecular variables needed to classify patients into one of the five subtypes. The resulting tree required specification of only six of the original 16 features (Figure 7D). The tree had an accuracy of 82% predicting the original 16-feature based clusters, with a receiver-operator characteristic (ROC) area under the curve (AUC) of 0.94.

We repeated the same type of survival analysis for the clusters predicted by the decision tree as we did for the original clusters (Figure 7E). Log-rank test p values for the tumor type-unadjusted and adjusted methods were both highly significant (p < 0.0001), showing that the decision tree-based clusters retained prognostic value despite not having 100% accuracy. These survival rates were comparable to the original clusters, with a 5-year survival rate ranging from 85% (C1) to 39% (C4), and a 10-year survival rate ranging from 67% (C1) to 14% (C4).

Discussion

We performed an integrative, multi-platform analysis of the TCGA Pan-Gyn tumors based on 2,579 clinical cases. In addition to confirming the robustness of many observations cited in previous TCGA publications on the individual tumor types, our approaches also provided a considerable number of additional findings: (i) Multiple genomic and epigenomic features that help to distinguish gynecologic and breast tumors from the other 28 TCGA tumor types; (ii) 61 somatic copy number peaks in the Pan-Gyn cohort, 11 not previously reported by TCGA; (iii) three somatic copy number alterations (containing genes of potential therapeutic relevance) unique to gynecologic cancers among the 33 TCGA tumor types; (iv) 46 SMGs in the Pan-Gyn cohort, 11 not previously reported by TCGA; (v) 10 predominant mutation signatures, with 10% of the samples lacking identified SMGs; (vi) analyses of the 10 mutation signatures in relation to the 30 COSMIC signatures, demonstrating relationships between the two sets of signatures; (vii) shared similar miRNA profiles between most of the Pan-Gyn tumor types; the exception, OV, was extremely different from the rest, and, unexpectedly, the miRNA profiles of basal-like BRCA cancers closely resembled those of CESC cancers; (viii) some OV and UCEC samples exhibited the “reactive” proteomic signature previously identified and shown to be prognostically relevant in BRCA; (ix) identification of a subtype with low protein expression of ER and AR (important markers for hormone therapy) that spanned all five tumor types; (x) large-scale lncRNA analysis not performed previously for any of the TCGA gynecologic or breast marker papers; our findings included several ER-regulated lncRNAs and an ER-TERC/DKC1-NEAT1/OIP5-AS1-TUG1 gene/lncRNA network; (xi) similar lncRNA profiles in BRCA and CESC, in contrast to the very distinct profiles in UCEC and OV; (xii) lineage-specific gene expression patterns and lineage-related (but not always cancer type-specific) features revealed by multi-platform clustering of tumor samples; (xiii) pathway analyses that revealed subsets of BRCA, OV, and UCEC samples with high levels of leukocyte infiltration, a primary marker of immune response and possible susceptibility to immunotherapy; most of the CESC samples, but virtually none of the UCS samples, showed high leukocyte infiltration; (xiv) roughly half of the basal-like BRCA samples resembled luminal/HER2⁺ BRCA samples at the pathway level (but not the gene expression level); this pattern suggests convergence of independent gene expression changes to drive a limited number of pathway outputs and could prove useful with respect to development and selection of therapies across BRCA subtypes; (xv) five cross-Pan-Gyn subtypes defined by multi-platform clustering of 16 molecular features; these five clusters have possible clinical implications and predictive value for survival beyond that of tumor type alone; (xvi) reduction of the 16 molecular features to six in the form of a binary decision tree that retained prognostic value.

From a potential therapeutic perspective, two of the Pan-Gyn clusters (C1 and C5) in (xv) showed high levels of hormone receptors (ER, PR, and/or AR), suggesting possible responsiveness to hormone therapy. C3 showed high levels of immune markers, warranting further exploration for possible value in selecting patients for immunotherapy. C2 included hyper-mutators and ultramutators, which have been associated with relatively good survival on conventional therapy. A subset of C4 showed ERBB2 amplification, suggesting possible responsiveness to HER2-targeted therapy. ERBB2 mutation and amplification were mutually exclusive, but both sets of tumors might benefit from HER2-targeted therapy.

The decision tree we propose could potentially enable clinicians to classify patients more easily into one of the five Pan-Gyn subtypes. The tree is based on six features, three of which (ER, PR, and AR status) are already routinely used in the clinic. Widely available CLIA-certified gene-panel assays can estimate SCNA and mutation loads, and immune infiltration can be assessed by standard immunohistochemistry or new imaging technologies. Therefore, after further study and validation, our decision tree might be able to aid in assignment of patients to treatment groups. It should be understood, however, that all of the clinically interesting possibilities illuminated by a project like Pan-Gyn should be considered as hypothesis-generators, yielding clues to be tested and, if possible, validated in follow-up studies.

DNA methylation data revealed large high- and low-methylation clusters. CESC, as well as luminal B and HER2⁺ BRCA tumors, showed high levels of DNA methylation, suggesting epigenetics as a driving force in those tumor types. Clustering based on DNA methylation separated MLH1-silenced (i.e., hypermutator) endometrioid UCEC samples from the non-MLH1-silenced ones, suggesting that MLH1 may not be specifically targeted for epigenetic silencing but, instead, may be silenced by a more generic mechanism that silences multiple genes.

Gene sets associated with myeloid and stem cell development suggest that TERC activity, initially identified in zebrafish, might play a role in human development as well (Chiu et al., 2017). In the present study, CESC and OV showed positive correlation of TERC with MYC, TERT, telomere maintenance targets, miR-21, and CTNNB1 gene targets. However, serous UCEC showed a unique pattern of negative correlation with TERT targets, positive correlation with miR-21 targets, and no correlation with MYC, CTNNB1, or telomere maintenance targets. In luminal A BRCA, miR-21 targets were positively correlated with TERC.

Pathway and subtype analyses revealed an important role for immune markers. OV, basal-like BRCA, luminal BRCA, and HER2⁺ BRCA cancer samples split into immune-high and immune-low subtypes. Immune-high HER2⁺ tumors showed a trend toward longer survival than their immune-low counterpart, but the difference was not quite statistically significant for the sample size available. Most of the CESC samples showed high immune marker signatures, likely due to their almost 100% prevalence of HPV. In contrast, most of the UCEC and UCS samples showed little immune infiltration. The high-immune subsets might potentially benefit from immunotherapy.

Pathway analysis unexpectedly showed that approximately half of the basal-like BRCA cancers clustered together with the HER2 and luminal B samples, whereas the other half did not, suggesting pathway-level similarities not detected at the level of single RNAs. The similarities included higher inferred activation of AR signaling and lower enrichment of FOXA1, FOXA2, and XBP1/2, as well as the WNT and SHH pathways. Those observations are consistent with convergence of diverse transcriptional events on a limited number of functional pathways. Additional study will be required to test the robustness of those observations.

In summary, this integrative, multiplatform Pan-Gyn analysis has confirmed similarities previously identified across the five tumor types and identified relationships not observed in previous studies of the individual diseases. A number of the observations have possible prognostic and/or therapeutic relevance. Our capture of major molecular information content using a simple six-parameter binary decision tree could facilitate the clinical use of Pan-Gyn molecular subtypes and may help in selection for and administration of therapeutic trials across the Pan-Gyn spectrum. However, all of the clinical possibilities illuminated by this study will require extensive additional research, particularly functional validation (which is beyond the intended scope of TCGA studies), before they would be ready for practical application. In addition to its particular observations, this study presents a broad-based, curated atlas of Pan-Gyn molecular features that we believe will be useful as a starting point for many researchers in the field.

STAR METHODS

KEY RESOURCES TABLE

REAGENT or RESOURCE	SOURCE	IDENTIFIER
Antibodies
RPPA antibodies	RPPA Core Facility, MD Anderson Cancer Center; Tibes et al., 2006; Gonzalez-Angulo et al., 2011	https://www.mdanderson.org/research/research-resources/core-facilities/functional-proteomics-rppa-core.html
Bacterial and Virus Strains
N/A	N/A	N/A
Biological Samples
Primary tumor samples	Multiple tissue source sites, processed through the Biospecimen Core Resource	See Methods: SUBJECT DETAILS, METHOD DETAILS
Chemicals, Peptides, and Recombinant Proteins
N/A	N/A	N/A
Critical Commercial Assays
AmpFISTR Identifier kit	Applied Biosystems	Cat: A30737
DNA/RNA AllPrep kit	Qiagen	Cat: 80204
Genome-Wide Human SNP Array 6.0	ThermoFisher Scientific	Cat: 901153
HumanMethylation450	Illumina	Cat: HM450
Illumina Barcoded Paired-End Library Preparation Kit	Illumina	https://www.illumina.com/techniques/sequencing/ngs-library-prep.html
Infinium HumanMethylation450 BeadChip Kit	Illumina	Cat: WG-314-1002
mirVana miRNA Isolation kit	Ambion
Phusion High-Fidelity PCR Master Mix with HF Buffer	New England Biolabs	Cat: M0531L
QiaAmp blood midi kit	Qiagen	Cat: 51185
RNA6000 Nano Assay	Agilent	Cat: 5067-1511
SureSelect Human All Exon 50 Mb	Agilent	Cat: G3370J
TruSeq PE Cluster Generation Kit	Illumina	Cat: PE-401-3001
TruSeq RNA Library Prep Kit	Illumina	Cat: RS-122-2001
VECTASTAIN Elite ABC HRP Kit (Peroxidase, Standard)	Vector Lab	Cat: PK-6100
Deposited Data
Digital pathology images	Genomic Data Commons; Cancer Digital Slide Archive	https://gdc-portal.nci.nih.gov/legacy-archive/; http://cancer.digitalslidearchive.net/
Raw and processed clinical, array, and sequencing data	Genomic Data Commons	https://portal.gdc.cancer.gov/legacy-archive/
Experimental Models: Cell Lines
N/A	N/A	N/A
Experimental Models: Organisms/Strains
N/A	N/A	N/A
Oligonucleotides
N/A	N/A	N/A
Recombinant DNA
N/A	N/A	N/A
Software and Algorithms
ABSOLUTE	Carter et al., 2012	PMID: 22544022
ABySS v1.3.4	Simpson et al., 2009	PMID: 19251739
ABySS v1.4.8	Robertson et al., 2010	PMID: 20935650
BioBloomTools(v1.2.4.b)	Chu et al., 2014	PMID: 25143290
Birdseed	Korn et al., 2008	PMID: 18776909
Blastn (v2.2.23)	Altshul et al., 1997	PMID: 9254694
CARNAC	Totoki et al., 2014	PMID: 25362482
Circular Binary Segmentation	Olshen et al., 2004	PMID: 15475419
ConsensusClusterPlus	Wilkerson and Hayes, 2010	PMID: 20427518
ContEst	Cibulskis et al., 2011	PMID: 21803805
deconstructSigs	Rosenthal et al., 2016	PMID: 26899170
deFuse	McPherson et al., 2011	PMID: 21625565
FireHose	The Broad Institute of MIT & Harvard	https://www.broadinstitute.org/cancer/cga/Firehose
GenePattern	Reich et al., 2006	PMID: 16642009
GISTIC2.0	Mermel et al., 2011	PMID: 21527027
Indelocator	Ratan et al., 2015	PMID: 25879703
MAP-RSeq		https://bioinformaticstools.mayo.edu/research/maprseq
MapSplice 0.7.4	Wang et al., 2010	PMID: 20802226
MuTect	Cibulskis et al., 2013	PMID: 23396013
MutSigCV v1.4	Lawrence et al., 2013	PMID: 23770567
Next-Generation Clustered Heatmap	MD Anderson Cancer Center	https://bioinformatics.mdanderson.org/TCGA/NGCHMPortal/
Oncotator	Ramos et al., 2015	PMID: 25703262
PARADIGM	Vaske et al., 2010	PMID: 20529912
PathSeq	Kostic et al., 2011	PMID: 21552235
Picard	The Broad Institute of MIT & Harvard	https://picard.sourceforge.net/
PRADA	Torres-Garcia et al., 2014	PMID: 24695405
RADIA	Radenbaugh et al., 2014	PMID: 25405470
RSEM	Li and Dewey, 2011	PMID: 21816040
SNPFileCreator	Li and Hung Wong, 2001	PMID: 11532216
SomaticSignatures	Gehring et al., 2015	PMID: 26163694
STAR	Dobin et al., 2013	PMID: 23104886
Strelka	Saunders et al., 2012	PMID: 22581179
Tophat v2.0.8	Trapnell et al., 2009	PMID: 19289445
WEKA	Smith and Frank, 2016	PMID: 27008023
Other
N/A	N/A	N/A

Open in a new tab

CONTACT FOR RESOURCE SHARING

Further information and requests for resources should be directed to and will be fulfilled by the Lead Contact, Rehan Akbani (rakbani@mdanderson.org).

SUBJECT DETAILS

Human data and tumor data selection

Molecular data were obtained from patients that had not received prior treatment for their disease (ablation, chemotherapy, or radiation therapy) and had provided informed consent as part of The Cancer Genome Atlas Project (TCGA). Local Institutional Review Boards (IRBs) at the tissue source sites reviewed protocols to approve submission of cases.

We selected samples from five TCGA projects to represent the gynecologic cancers: breast invasive carcinoma (BRCA), endocervical adenocarcinoma (CESC), high-grade serous ovarian cystadenocarcinoma (OV), uterine corpus endometrial carcinoma (UCEC), and uterine carcinosarcoma (UCS). Sample selection was based on availability of data and propriety of genomic features. Eight CESC samples designated as UCEC-like using mRNA data and 14 OV cases lacking TP53 mutations were excluded. The Pan-Gyn cohort was eventually comprised of 2579 cases, consisting of 1087 BRCA cases, 579 OV cases, 548 UCEC cases, 308 CESC cases, and 57 UCS cases.

TCGA Project Management collected necessary human subjects documentation to ensure the project complies with 45-CFR-46 (the “Common Rule”). The program has obtained documentation from every contributing clinical site to verify that IRB approval has been obtained to participate in TCGA. Such documented approval may include one or more of the following:

An IRB-approved protocol with Informed Consent specific to TCGA or a substantially similar program. In the latter case, if the protocol was not TCGA-specific, the clinical site PI provided a further finding from the IRB that the already-approved protocol is sufficient to participate in TCGA.
A TCGA-specific IRB waiver has been granted.
A TCGA-specific letter that the IRB considers one of the exemptions in 45-CFR-46 applicable. The two most common exemptions cited were that the research falls under 46.102(f)(2) or 46.101(b)(4). Both exempt requirements for informed consent, because the received data and material do not contain directly identifiable private information.
A TCGA-specific letter that the IRB does not consider the use of these data and materials to be human subjects research. This was most common for collections in which the donors were deceased.

METHOD DETAILS

Sample processing

Cases were staged according to the American Joint Committee on Cancer (AJCC). Each frozen primary tumor specimen had a companion normal tissue specimen (blood or blood components, including DNA extracted at the tissue source site). Adjacent tissue was submitted for some cases. Specimens were shipped overnight using a cryoport that maintained an average temperature of less than −180°C.

RNA and DNA were extracted from tumor and adjacent normal tissue specimens using a modification of the DNA/RNA AllPrep kit (Qiagen). The flow-through from the Qiagen DNA column was processed using a mirVana miRNA Isolation Kit (Ambion). This latter step generated RNA preparations that included RNA <200 nt suitable for miRNA analysis. DNA was extracted from blood using the QiaAmp blood midi kit (Qiagen). Each specimen was quantified by measuring Abs260 with a UV spectrophotometer or by PicoGreen assay. DNA specimens were resolved by 1% agarose gel electrophoresis to confirm high molecular weight fragments. A custom Sequenom SNP panel or the AmpFISTR Identifiler (Applied Biosystems) was utilized to verify tumor DNA and germline DNA were derived from the same patient. Five hundred nanograms of each tumor and normal DNA were sent to Qiagen for REPLI-g whole genome amplification using a 100 g reaction scale. Only specimens yielding a minimum of 6.9 g of tumor DNA, 5.15 g RNA, and 4.9 g of germline DNA were included in this study. RNA was analyzed via the RNA6000 nano assay (Agilent) for determination of an RNA Integrity Number (RIN), and only the cases with RIN >7.0 were included in this study. Reasons for rejection are described at https://tcga-data.nci.nih.gov/datareports.

DNA sequencing data

Exome capture was performed using Agilent SureSelect Human All Exon 50 Mb according to the manufacturers’ instructions. Briefly, 0.5–3 micrograms of DNA from each sample were used to prepare the sequencing library through shearing of the DNA followed by ligation of sequencing adaptors. All whole exome (WES) and whole genome (WGS) sequencing was performed on the Illumina HiSeq platform. Paired-end sequencing (2 × 101 bp for WGS and 2 × 76 bp for WE) was carried out using HiSeq sequencing instruments; the resulting data was analyzed with the current Illumina pipeline. Basic alignment and sequence QC was done on the Picard and Firehose pipelines at the Broad Institute. Sequencing data were processed using two consecutive pipelines:

Sequencing data processing pipeline (“Picard pipeline”). Picard (http://picard.sourceforge.net/) uses the reads and qualities produced by the Illumina software for all lanes and libraries generated for a single sample (either tumor or normal) and produces a single BAM file (http://samtools.sourceforge.net/SAM1.pdf) representing the sample. The final BAM file stores all reads and calibrated qualities along with their alignments to the genome.
Cancer genome analysis pipeline (“Firehose pipeline”). Firehose (http://www.broadinstitute.org/cancer/cga/Firehose) takes the BAM files for the tumor and patient-matched normal samples and performs analyses including quality control, local realignment, mutation calling, small insertion and deletion identification, rearrangement detection, coverage calculations and others as described briefly below. The pipeline represents a set of tools for analyzing massively parallel sequencing data for both tumor DNA samples and their patient-matched normal DNA samples. Firehose uses GenePattern (Reich et al., 2006) as its execution engine for pipelines and modules based on input files specified by Firehose. The pipeline contains the following steps:
1. Quality control. This step confirms identity of individual tumor and normal to avoid mix-ups between tumor and normal data for the same individual.
2. Local realignment of reads. This step realigns reads at sites that potentially harbor small insertions or deletions in either the tumor or the matched normal, to decrease the number of false positive single nucleotide variations caused by misaligned reads.
3. Identification of somatic single nucleotide variations (SSNVs). This step detects candidate SSNVs using a statistical analysis of the bases and qualities in the tumor and normal BAMs, using Mutect (Cibulskis et al., 2013).
4. Identification of somatic small insertions and deletions. In this step, putative somatic events were first identified within the tumor BAM file and then filtered out using the corresponding normal data, using Indellocator (Ratan et al., 2015).

Molecular features that distinguished Pan-Gyn from other tumor types

Mutations and CNVs

We identified the mutation and CNV events, which are enriched in gynecologic cancers (BRCA, UCEC, CESC, UCS, and OV) compared to all other cancers. 617 oncogenes listed in the COSMIC census list are included in the analysis. A multi-step statistical enrichment analysis method is devised for this purpose and applied to mutation and CNV data separately (see methods subsection dedicated to CNV data below for reference). The analysis involves creating a contingency table for altered vs. unaltered cases in Pan-Gyn vs. other cancers. First, the bias from sample sizes in different cancer types are removed by normalizing the alteration counts in each cancer type with sample sizes. For this purpose, an expected gynecological alteration/gene is calculated. For each gene, the mutation or a CNV high-level amplification (i.e. GISTIC thresholded CN value of +2) count in each gynecological cancer type is divided by the number of samples in the associated disease type and multiplied by the total number of samples (after normalization to hundred samples/disease for mutation counts) in the gynecological cancers. The same normalization is performed for non-gynecological cancers. This is critical to avoid cancers with large sample size (e.g. BRCA, N = 982 vs. UCS, N = 57) dominate the whole analysis. The genes with p value _adjusted < 0.01 for mutation and p value _adjusted < 0.05 for CNV are visualized in Figure 1A-B (see Statistical Analysis section for details on calculation of p values).

We addressed the question of whether the Pan-Gyn tumor types (BRCA, OV, USC, UCEC, CESC) share a significantly larger number of enriched mutated genes compared to a null distribution of enriched mutated genes in randomly selected 5 disease types. The bootstrapping analysis in Figure S1C-D involves an iterative process of randomly selecting 5 cancer types out of non-gyn cancers (N=28), calculating number of enriched genes in the randomly selected group using the same criteria (Fisher exact test with FDR-adjusted p value < 0.01) we used for generating Figure 1A. The iterations were performed 10,000 times to generate the null distribution. Following the same strategy, we performed CNA analysis using gene level CNA results from GISTIC2.0 (Mermel et al., 2011).

DNA methylation

Aim of this section was to identify genes that are differentially methylated in gynecological tumors (BRCA, UCEC, CESC, UCS, OV) versus the other tumor types. For this purpose we used two different approaches. We first mapped all the probes from the sequencing platforms to unique genes. For genes having more than one probe mapping to its promoter, median beta value was considered. For the first analysis, a threshold beta value of 0.3 was used to call methylation status of genes. Having converted our data to binary form, we then counted the percentage of samples of each tumor type in whom the gene was in methylated state. By taking percentages instead of just the number of cases for each tumor, we could correct for variation in number of samples that were available for each type. For example, whereas 966 cases of breast cancer (BRCA) was available, only 36 cases were available for cholangiocarcinoma (CHOL). To make sure that our analysis does not get skewed by this variation in sample sizes, we normalized number of samples for each tumor type to 100. We then grouped samples into Gyn vs non-Gyn cancers, and again adjusted size of each population to 100. Refer to Statistical Analysis section for details on the identification of significant genes.

In order to get more robust results, we performed a second kind of analysis to identify significant differentially methylated genes. We logit transformed the beta values into M-values, z-normalized the scores across all samples for a given gene, and took median across all member samples as the methylation score for each tumor type. We then dichotomized the dataset into gyn and non-gyn populations and identified the statistically significant genes between the two populations (see Statistical Analysis section for details).

We then compared the lists of statistically significant genes from the two analyses. A total of 197 genes were called significantly differentially methylated between the two populations by both our analyses. The median beta values of these genes across member samples of each tumor type were then plotted into a heatmap, with the Z-normalized M values being used for hierarchical clustering of genes using Euclidean distances and Ward’s linkage.