Skip to main content
UKPMC Funders Author Manuscripts logoLink to UKPMC Funders Author Manuscripts
. Author manuscript; available in PMC: 2018 Apr 23.
Published in final edited form as: Nature. 2017 Oct 23;551(7678):92–94. doi: 10.1038/nature24284

Association analysis identifies 65 new breast cancer risk loci

Kyriaki Michailidou 1,2,#, Sara Lindström 4,5,#, Joe Dennis 1,#, Jonathan Beesley 6,#, Shirley Hui 7,#, Siddhartha Kar 8,#, Audrey Lemaçon 9, Penny Soucy 9, Dylan Glubb 6, Asha Rostamianfar 7, Manjeet K Bolla 1, Qin Wang 1, Jonathan Tyrer 8, Ed Dicks 8, Andrew Lee 1, Zhaoming Wang 10,11, Jamie Allen 1, Renske Keeman 12, Ursula Eilber 13, Juliet D French 6, Xiao Qing Chen 6, Laura Fachal 8, Karen McCue 6, Amy E McCart Reed 14, Maya Ghoussaini 8, Jason Carroll 15, Xia Jiang 5, Hilary Finucane 5,16, Marcia Adams 17, Muriel A Adank 18, Habibul Ahsan 19, Kristiina Aittomäki 20, Hoda Anton-Culver 21, Natalia N Antonenkova 22, Volker Arndt 23, Kristan J Aronson 24, Banu Arun 25, Paul L Auer 26,27, François Bacot 28, Myrto Barrdahl 13, Caroline Baynes 8, Matthias W Beckmann 29, Sabine Behrens 13, Javier Benitez 30,31, Marina Bermisheva 32, Leslie Bernstein 33, Carl Blomqvist 34, Natalia V Bogdanova 22,35,36, Stig E Bojesen 37,38,39, Bernardo Bonanni 40, Anne-Lise Børresen-Dale 41, Judith S Brand 42, Hiltrud Brauch 43,44,45, Paul Brennan 46, Hermann Brenner 23,45,47, Louise Brinton 48, Per Broberg 49, Ian W Brock 50, Annegien Broeks 12, Angela Brooks-Wilson 51,52, Sara Y Brucker 53, Thomas Brüning 54, Barbara Burwinkel 55,56, Katja Butterbach 23, Qiuyin Cai 57, Hui Cai 57, Trinidad Caldés 58, Federico Canzian 59, Angel Carracedo 60,61, Brian D Carter 62, Jose E Castelao 63, Tsun L Chan 64,65, Ting-Yuan David Cheng 66, Kee Seng Chia 67, Ji-Yeob Choi 68,69, Hans Christiansen 35, Christine L Clarke 70; NBCS Collaborators41,71,72,73,74,75,76,77,78,79,80,81,82,83,84, Margriet Collée 85, Don M Conroy 8, Emilie Cordina-Duverger 86, Sten Cornelissen 12, David G Cox 87,88, Angela Cox 50, Simon S Cross 89, Julie M Cunningham 90, Kamila Czene 42, Mary B Daly 91, Peter Devilee 92,93, Kimberly F Doheny 17, Thilo Dörk 36, Isabel dos-Santos-Silva 94, Martine Dumont 9, Lorraine Durcan 95,96, Miriam Dwek 97, Diana M Eccles 96, Arif B Ekici 98, A Heather Eliassen 99,100, Carolina Ellberg 49,101, Mingajeva Elvira 102, Christoph Engel 103,104, Mikael Eriksson 42, Peter A Fasching 29,105, Jonine Figueroa 48,106, Dieter Flesch-Janys 107,108, Olivia Fletcher 109, Henrik Flyger 110, Lin Fritschi 111, Valerie Gaborieau 46, Marike Gabrielson 42, Manuela Gago-Dominguez 60,112, Yu-Tang Gao 113, Susan M Gapstur 62, José A García-Sáenz 58, Mia M Gaudet 62, Vassilios Georgoulias 114, Graham G Giles 115,116, Gord Glendon 117, Mark S Goldberg 118,119, David E Goldgar 120, Anna González-Neira 30, Grethe I Grenaker Alnæs 41, Mervi Grip 121, Jacek Gronwald 122, Anne Grundy 123, Pascal Guénel 86, Lothar Haeberle 29, Eric Hahnen 124,125,126, Christopher A Haiman 127, Niclas Håkansson 128, Ute Hamann 129, Nathalie Hamel 28, Susan Hankinson 130, Patricia Harrington 8, Steven N Hart 131, Jaana M Hartikainen 132,133,134, Mikael Hartman 67,135, Alexander Hein 29, Jane Heyworth 136, Belynda Hicks 11, Peter Hillemanns 36, Dona N Ho 65, Antoinette Hollestelle 137, Maartje J Hooning 137, Robert N Hoover 48, John L Hopper 116, Ming-Feng Hou 138, Chia-Ni Hsiung 139, Guanmengqian Huang 129, Keith Humphreys 42, Junko Ishiguro 140,141, Hidemi Ito 140,141, Motoki Iwasaki 142, Hiroji Iwata 143, Anna Jakubowska 122, Wolfgang Janni 144, Esther M John 145,146,147, Nichola Johnson 109, Kristine Jones 11, Michael Jones 148, Arja Jukkola-Vuorinen 149, Rudolf Kaaks 13, Maria Kabisch 129, Katarzyna Kaczmarek 122, Daehee Kang 68,69,150, Yoshio Kasuga 151, Michael J Kerin 152, Sofia Khan 153, Elza Khusnutdinova 32,102, Johanna I Kiiski 153, Sung-Won Kim 154, Julia A Knight 155,156, Veli-Matti Kosma 132,133,134, Vessela N Kristensen 41,77,78, Ute Krüger 49, Ava Kwong 64,157,158, Diether Lambrechts 159,160, Loic Le Marchand 161, Eunjung Lee 127, Min Hyuk Lee 162, Jong Won Lee 163, Chuen Neng Lee 135,164, Flavio Lejbkowicz 165, Jingmei Li 42, Jenna Lilyquist 131, Annika Lindblom 166, Jolanta Lissowska 167, Wing-Yee Lo 43,44, Sibylle Loibl 168, Jirong Long 57, Artitaya Lophatananon 169,170, Jan Lubinski 122, Craig Luccarini 8, Michael P Lux 29, Edmond SK Ma 64,65, Robert J MacInnis 115,116, Tom Maishman 95,96, Enes Makalic 116, Kathleen E Malone 171, Ivana Maleva Kostovska 172, Arto Mannermaa 132,133,134, Siranoush Manoukian 173, JoAnn E Manson 100,174, Sara Margolin 175, Shivaani Mariapun 176, Maria Elena Martinez 112,177, Keitaro Matsuo 141,178, Dimitrios Mavroudis 114, James McKay 46, Catriona McLean 179, Hanne Meijers-Heijboer 18, Alfons Meindl 180, Primitiva Menéndez 181, Usha Menon 182, Jeffery Meyer 90, Hui Miao 67, Nicola Miller 152, Nur Aishah Mohd Taib 183, Kenneth Muir 169,170, Anna Marie Mulligan 184,185, Claire Mulot 186, Susan L Neuhausen 33, Heli Nevanlinna 153, Patrick Neven 187, Sune F Nielsen 37,38, Dong-Young Noh 188, Børge G Nordestgaard 37,38,39, Aaron Norman 131, Olufunmilayo I Olopade 189, Janet E Olson 131, Håkan Olsson 49, Curtis Olswold 131, Nick Orr 109, V Shane Pankratz 190, Sue K Park 68,69,150, Tjoung-Won Park-Simon 36, Rachel Lloyd 191, Jose IA Perez 192, Paolo Peterlongo 193, Julian Peto 94, Kelly-Anne Phillips 116,194,195,196, Mila Pinchev 165, Dijana Plaseska-Karanfilska 172, Ross Prentice 26, Nadege Presneau 97, Darya Prokofieva 102, Elizabeth Pugh 17, Katri Pylkäs 197,198, Brigitte Rack 199, Paolo Radice 200, Nazneen Rahman 201, Gadi Rennert 165, Hedy S Rennert 165, Valerie Rhenius 8, Atocha Romero 58,202, Jane Romm 17, Kathryn J Ruddy 203, Thomas Rüdiger 204, Anja Rudolph 13, Matthias Ruebner 29, Emiel J Th Rutgers 205, Emmanouil Saloustros 206, Dale P Sandler 207, Suleeporn Sangrajrang 208, Elinor J Sawyer 209, Daniel F Schmidt 116, Rita K Schmutzler 124,125,126, Andreas Schneeweiss 55,210, Minouk J Schoemaker 148, Fredrick Schumacher 211, Peter Schürmann 36, Rodney J Scott 212,213, Christopher Scott 131, Sheila Seal 201, Caroline Seynaeve 137, Mitul Shah 8, Priyanka Sharma 214, Chen-Yang Shen 215,216, Grace Sheng 127, Mark E Sherman 217, Martha J Shrubsole 57, Xiao-Ou Shu 57, Ann Smeets 187, Christof Sohn 210, Melissa C Southey 218, John J Spinelli 219,220, Christa Stegmaier 221, Sarah Stewart-Brown 169, Jennifer Stone 191,222, Daniel O Stram 127, Harald Surowy 55,56, Anthony Swerdlow 148,223, Rulla Tamimi 5,99,100, Jack A Taylor 207,224, Maria Tengström 132,225,226, Soo H Teo 176,183, Mary Beth Terry 227, Daniel C Tessier 28, Somchai Thanasitthichai 228, Kathrin Thöne 108, Rob AEM Tollenaar 229, Ian Tomlinson 230, Ling Tong 19, Diana Torres 129,231, Thérèse Truong 86, Chiu-chen Tseng 127, Shoichiro Tsugane 232, Hans-Ulrich Ulmer 233, Giske Ursin 234,235, Michael Untch 236, Celine Vachon 131, Christi J van Asperen 237, David Van Den Berg 127, Ans MW van den Ouweland 85, Lizet van der Kolk 238, Rob B van der Luijt 239, Daniel Vincent 28, Jason Vollenweider 90, Quinten Waisfisz 18, Shan Wang-Gohrke 240, Clarice R Weinberg 241, Camilla Wendt 175, Alice S Whittemore 146,147, Hans Wildiers 187, Walter Willett 100,242, Robert Winqvist 197,198, Alicja Wolk 128, Anna H Wu 127, Lucy Xia 127, Taiki Yamaji 142, Xiaohong R Yang 48, Cheng Har Yip 243, Keun-Young Yoo 244,245, Jyh-Cherng Yu 246, Wei Zheng 57, Ying Zheng 247, Bin Zhu 11, Argyrios Ziogas 21, Elad Ziv 248; ABCTB Investigators249; kConFab/AOCS Investigators194,250, Sunil R Lakhani 14,251, Antonis C Antoniou 1, Arnaud Droit 9, Irene L Andrulis 117,252, Christopher I Amos 253, Fergus J Couch 90, Paul DP Pharoah 1,8, Jenny Chang-Claude 13,254, Per Hall 42,255, David J Hunter 5,100, Roger L Milne 115,116, Montserrat García-Closas 48, Marjanka K Schmidt 12,256, Stephen J Chanock 48, Alison M Dunning 8, Stacey L Edwards 6, Gary D Bader 7, Georgia Chenevix-Trench 6, Jacques Simard 9,257, Peter Kraft 5,100,257, Douglas F Easton 1,8,257
PMCID: PMC5798588  EMSID: EMS74191  PMID: 29059683

Abstract

Breast cancer risk is influenced by rare coding variants in susceptibility genes such as BRCA1 and many common, mainly non-coding variants. However, much of the genetic contribution to breast cancer risk remains unknown. We report results from a genome-wide association study (GWAS) of breast cancer in 122,977 cases and 105,974 controls of European ancestry and 14,068 cases and 13,104 controls of East Asian ancestry1. We identified 65 new loci associated with overall breast cancer at p<5x10-8. The majority of credible risk SNPs in the new loci fall in distal regulatory elements, and by integrating in-silico data to predict target genes in breast cells at each locus, we demonstrate a strong overlap between candidate target genes and somatic driver genes in breast tumours. We also find that heritability of breast cancer due to all SNPs in regulatory features was 2-5-fold enriched relative to the genome-wide average, with strong enrichment for particular transcription factor binding sites. These results provide further insight into genetic susceptibility to breast cancer and will improve the utility of genetic risk scores for individualized screening and prevention.


We genotyped 61,282 female breast cancer cases and 45,494 female controls of European ancestry with the OncoArray1. Subjects came from 68 studies collaborating in the Breast Cancer Association Consortium (BCAC) and Discovery, Biology and Risk of Inherited Variants in Breast Cancer Consortium (DRIVE) (Supplementary Table 1). Using the 1000 Genomes Project (Phase 3) reference panel, we imputed genotypes for ˜21M variants. After filtering on minor allele frequency (MAF)>0.5% and imputation quality score>0.3 (see Online Methods), we assessed the association between breast cancer risk and 11.8M SNPs adjusting for country and ancestry-informative principal components. We combined these results with results from the iCOGS project (46,785 cases and 42,892 controls)2 and 11 other breast cancer GWAS (14,910 cases, 17,588 controls), using a fixed-effect meta-analysis.

Of 102 loci previously associated with breast cancer in Europeans, 49 showed evidence for association with overall breast cancer in the OncoArray dataset at P<5x10-8 and 94 at P<0.05. Five additional loci previously shown to be associated with breast cancer in Asian women also showed evidence in the European ancestry OncoArray dataset (P<0.01; Supplementary Tables 2-4)35. We also assessed the association with breast cancer in Asians including 7,799 cases and 6,480 controls from the OncoArray project and 6,269 cases and 6,624 controls from iCOGS. Of the 94 loci previously identified in Europeans that were polymorphic in Asians, 50 showed evidence of association (P<0.05). For the remaining 44, none showed a significant difference in the estimated odds ratio (OR) for overall breast cancer between Europeans and Asians (P>0.01; Supplementary Table 5). The correlation in effect sizes for all known loci between Europeans and Asians was 0.83, suggesting that the majority of known susceptibility loci are shared between these populations.

To search for additional susceptibility loci, we assessed all SNPs excluding those within 500kb of a known susceptibility SNPs (Figure 1). This identified 5,969 variants in 65 regions that were associated with overall breast cancer risk at P<5x10-8 (Supplementary Tables 6-8). For two loci (lead SNPs rs58847541 and rs12628403), there was evidence of a second association signal after adjustment for the primary signal (rs13279803: conditional P=1.6x10-10; rs373038216: P=2.9x10-11; Supplementary Table 9). Of the 65 new loci, 21 showed a differential association by ER-status (P<0.05) with all but two (rs6725517 and rs6569648) more strongly associated with ER-positive disease (Supplementary Tables 10-11). Forty-four loci showed evidence of association for ER-negative breast cancer (P<0.05). Of the 51 novel loci that were polymorphic in Asians, nine were associated at P<0.05 and only two showed a difference in the estimated OR between Europeans and Asians (P<0.01; Supplementary Table 12).

Figure 1.

Figure 1

(a) Manhattan plot showing log10P-values for SNP associations with overall breast cancer (b) Manhattan plot after excluding previously identified associated regions. The red line denotes “genome-wide” significance (P<5x10-8); the blue line denotes P<10-5.

To define a set of credible risk variants (CRVs) at the new loci, we first selected variants with P-values within two orders of magnitude of the most significant SNPs in each region. Across the 65 novel regions, we identified 2,221 CRVs (Supplementary Table 13), while the previous 77 identified loci contained 2,232 CRVs (Online methods; Supplementary Table 14). We examined the evidence for enrichment in these CRVs of 67 genomic features, including histone marks and transcription factor binding sites (TFBS) in three breast cancer cell lines (Online Methods; Supplementary Tables 15-16; Extended Data Fig. 1). Thirteen features were significant predictors of CRVs at P<10-4; the strongest being DNAse I hypersensitivity sites in CTCF silenced MCF7 cells (OR 2.38, P=4.6x10-14). Strong associations were also observed with binding sites for FOXA1, ESR1, GATA3, E2F1 and TCF7L2. Seven of the 65 novel loci included only a single CRV (Supplementary Table 6), of which two are non-synonymous. SNP rs16991615 is a missense variant (p.Glu341Lys) in MCM8, involved in genome replication and associated with age at natural menopause and impaired DNA repair6. SNP rs35383942 is a missense variant (p.Arg28Gln) in PHLDA3, encoding a p53-regulated repressor of AKT7.

We annotated each CRV with publicly available genomic data from breast cells in order to highlight potentially functional variants, predict target genes and prioritise future experimental validation (Supplementary Tables 7 and 13 with UCSC browser links). We developed a heuristic scoring system based on breast-specific genomic data (integrated expression quantitative trait and in silico prediction of GWAS targets - INQUISIT) to rank the target genes at each locus (Supplementary Table 17). Target genes were predicted by combining risk SNP data with multiple sources of genomic information, including chromatin interactions (ChIA-PET and Hi-C), computational enhancer-promoter correlations (PreSTIGE, IM-PET, FANTOM5 and Super-enhancers), breast tissue-specific eQTL results, TF binding (ENCODE ChIP-seq), gene expression (ENCODE RNA-seq) and topologically-associated domain (TAD) boundaries (Online Methods and Supplementary Tables 18-20). Target gene predictions could be made for 58/65 new and 70/77 previously identified loci. Among 689 protein-coding genes predicted by INQUISIT, we found strong enrichment for established breast cancer drivers identified through tumour sequencing (20/147 genes, P<10-6)811, which increased with increasing INQUISIT score (P=1.8x10-6). We compared INQUISIT with a) an alternative published method (DEPICT, which predicts targets based on shared gene functions between potential targets at other associated loci)12 which showed a weaker enrichment of breast cancer driver genes (P=0.06 after adjusting for the nearest gene, P=0.74 after adjusting for INQUIST score, and b) assigning the association signal to the nearest gene, which showed only a weak enrichment of driver genes after adjusting for the INQUISIT score (P=0.01; Extended Data Table 1 and Supplementary Table 21). Notably, most of the 689 putative target genes have no reported involvement in breast tumorigenesis and some may represent additional genes influencing susceptibility to breast cancer. However, functional assays will be required to confirm any of these candidates as risk genes.

Having used INQUISIT to predict target genes, we performed pathway gene set enrichment analysis (GSEA), visually summarized as enrichment maps (Extended Data Fig. 2; Supplementary Tables 22-23)13. Several growth or development related pathways were enriched, notably the fibroblast growth factor, platelet derived growth factor and Wnt signalling pathways1416. Other cancer-related themes included ERK1/2 cascade, immune-response pathways including interferon signalling, and cell-cycle pathways. Pathways not found in earlier breast cancer GWAS include nitric oxide biosynthesis, AP-1 transcription factor and NF-kB (Supplementary Table 24).

To explore more globally the genomic features contributing to breast cancer risk, we estimated the proportion of genome-wide SNP heritability attributable to 53 publicly available annotations17. We observed the largest enrichment in heritability (5.2-fold, P=8.5x10-5) for TFBS, followed by a 4-fold (P=0.0006) enrichment for histone marker H3K4me3 (marking promoters). In contrast, we observed a significant depletion (0.27, P=0.0007) for repressed regions (Supplementary Table 25). We conducted cell type-specific enrichment analysis for four histone marks and observed significant enrichments in several tissue types (Figure 2; Extended Data Figs. 3-7; Supplementary Table 26-27), including a 6.7-fold enrichment for H3K4me1 in breast myoepithelial tissue (P=7.9x10-5). We compared the cell type-specific enrichments for overall, ER-positive and ER-negative breast cancer to the enrichments for 16 other complex traits (Extended Data Figs. 3-7). Breast cancer showed enrichment for adipose and epithelial cell types (including breast epithelial cells). In contrast, psychiatric diseases showed enrichment specific to central-nervous-system cell types and autoimmune disorders showed enrichment for immune cells.

We selected for further evaluation four loci to represent those predicted to act through proximal regulation (1p36 and 11p15) and distal regulation (1p34 and 7q22), because they had a relatively small number of CRVs. The only CRV at 1p36, rs2992756 (P=1.6x10-15), is located 84bp from the transcription start site of KLHDC7A. Of the 19 CRVs at 11p15 (smallest P=1.4x10-12), five were located in the proximal promoter of PIDD1, implicated in DNA-damage-induced apoptosis and tumorigenesis18. INQUIST predicted KLHDC7A and PIDD1 to be target genes and they received the highest score for likelihood of promoter regulation (Supplementary Table 19). Using reporter assays, we showed that the KLHDC7A promoter construct containing the risk T-allele of rs2992756 has significantly lower activity than the reference construct, while the PIDD1 promoter construct containing the risk haplotype significantly increased PIDD1 promoter activity (Extended Data Fig. 8).

The 1p34 locus included four CRVs (smallest P=9.1x10-9) that fall within two putative regulatory elements (PREs) and are predicted by INQUISIT to regulate CITED4 (PREs; Extended Data Fig. 8). CITED4 encodes a transcriptional coactivator that interacts with CBP/p300 and TFAP2 and can inhibit hypoxia-activated transcription in cancer cells19. Chromatin conformation capture (3C) assays confirmed that the PREs physically interacted with the CITED4 promoter (Extended Data Fig. 8). Subsequent reporter assays showed that the PRE1 reference construct reduced CITED4 promoter activity, whereas the risk T-allele of SNP rs4233486 located in PRE1 negates this effect.

Finally, the 7q22 risk locus contained six CRVs (smallest P=5.1x10-12) which lie in several PREs spanning ˜40kb of CUX1 intron 1. Chromatin interactions were identified between a PRE1 (containing SNP rs6979850) and CUX1/RASA4 promoters and a PRE2 (containing SNP rs71559437) and RASA4/PRKRIP1 promoters (Extended Data Fig. 9). Allele-specific 3C in heterozygous MBA-MB-231 cells showed that the risk haplotype was associated with chromatin looping, suggesting that the protective allele abrogates looping between the PREs and target genes (Extended Data Fig. 9). These results identify two mechanisms by which CRVs may impact target gene expression: through transactivation of a specific promoter and by affecting chromatin looping between regulatory elements and their target genes. These data provide in vitro evidence of target identification and regulation, however further studies that include genome editing, oncogenic assays and/or animal models will be required to fully elucidate disease-related gene function.

We estimate that the newly identified susceptibility loci explain ˜4% of the two-fold familial relative risk (FRR) of breast cancer and that in total, common susceptibility variants identified through GWAS explain 18% of the FRR. Further, we estimate that variants imputable from the OncoArray, under a log-additive model (see Online Methods), explain ˜41% of the FRR, and thus, the identified susceptibility SNPs account for ˜44% (18%/41%) of the FRR that can be explained by all imputable SNPs. The identified SNPs will be incorporated into risk prediction models, which can be used to improve the identification of women at high and low risk of breast cancer: for example, using a polygenic risk score based on the variants identified to date, women in the highest 1% of the distribution have a 3.5-fold greater breast cancer risk than the population average. Such risk prediction can inform targeted early detection and prevention.

Online Methods

Details of the studies and genotype calling and quality control (QC) for the iCOGS and eleven other GWAS are described elsewhere2,20. Seventy-eight studies participated in the breast cancer component of the OncoArray, of which 67 studies contributed European ancestry data and 12 contributed Asian ancestry data (one study, NBCS, was excluded as there were no controls from Norway) (Supplementary Table 1). The majority of studies were population-based case-control studies, or case-control studies nested within population-based cohorts, but a subset of studies oversampled cases with a family history of the disease. All studies provided core data on disease status and age at diagnosis/observation, and the majority provided additional data on clinico-pathological factors and lifestyle factors, which have been curated and incorporated into the BCAC database (version 6). All participating studies were approved by their appropriate ethics review board and all subjects provided informed consent.

OncoArray SNP Selection

Approximately 50% of the SNPs for the OncoArray were selected as a “GWAS backbone” (Illumina HumanCore), which aimed to provide high coverage for the majority of common variants through imputation. The remaining SNPs were selected from lists supplied by each of six disease-based consortia, together with a seventh list of SNPs of interest to multiple disease-focused groups. Approximately 72k SNPs were selected specifically for their relevance to breast cancer. These included: (a) SNPs showing evidence of association from previous genotype data, based on a combined analysis of eleven existing GWAS together the data from the iCOGS experiment; (b) SNPs showing evidence of association with ER-negative disease (through a combined analysis with the CIMBA consortium), triple negative disease, breast cancer diagnosed before age 40 years, high grade disease, node positive disease or ductal carcinoma-in-situ; (c) SNPs potentially associated with breast cancer survival; (d) SNPs selected for fine-mapping of 55 regions showing evidence of breast cancer association at genome-wide significance; (e) rare variants showing evidence of association through exome sequencing in multiple case families, whole-genome sequencing in high-risk cases (DRIVE), or analysis of the ExomeChip (BCAC); (f) specific follow-up of regions of interest from breast cancer GWAS in Asian, Latina and African/African-American women; (g) SNPs associated with breast density, selected from GWAS conducted by the MODE consortium; (h) breast tissue-specific eQTLs (i) lists of functional candidates from >30 groups. Lists were merged with lists from the other consortia as described elsewhere1.

OncoArray Calling and QC

Of the 568,712 variants selected for genotyping, 533,631 were successfully manufactured on the array (including 778 duplicate probes). Genotyping for the breast cancer component of the OncoArray, which included 152,492 samples, was conducted at six sites. Details of the genotyping calling for the OncoArray are described in more detail elsewhere1. Briefly, we developed a single calling pipeline that was applied to more than 500,000 samples. An initial cluster file was generated using data from 56,284 samples, selected to cover all the major genotyping centres and ethnicities, using the Gentrain2 algorithm. Variants likely to have problematic clusters were selected for manual inspection using the following criteria: call rate below 99%, variants with minor allele frequency (MAF)<0.001, poor Illumina intensity and clustering metrics, or deviation from the expected frequency as observed in the 1000 Genomes Project. This resulted in manual adjustment of the cluster file for 3,964 variants, and the exclusion of 16,526 variants. The final cluster file was then applied to the full dataset.

We excluded probable duplicates and close relatives within each study, and probable duplicates across studies. We excluded samples with a call rate <95% or samples with extreme heterozygosity (4.89 SD from the mean for the ethnicity). Ancestry was computed using a principal component analysis, applied to the full OncoArray dataset, using 2318 informative markers on a subset of ~47,000 samples. The analysis presented here was restricted to women of European ancestry, defined as individuals with an estimated proportion of European ancestry >0.8, and women of East Asian ancestry (estimated proportion of Asian ancestry >0.4), with reference to the HapMap (v2) populations, based on the first two principal components. After quality control exclusions and removing overlaps with the previous iCOGS and GWAS genotyping used in the analysis, the final dataset comprised data from 61,282 cases and 45,494 of European ancestry 7,799 cases and 6,480 controls of Asian ancestry.

We excluded SNPs with a call rate <95% in any consortium, SNPs not in Hardy-Weinberg equilibrium (P<10-7 in controls or P <10-12 in cases) and SNPs with concordance <98% among 5,280 duplicate sample pairs. For the imputation, we additionally excluded SNPs with a MAF<1% and a call rate <98% in any consortium, SNPs that could not be linked to the 1000 Genomes Project reference or differed significantly in frequency from the 1000 Genomes Project dataset (using the criterion (p1p0)2((p1+p0)(2p1p0))>0.007, where p0 and p1 are the MAFs in the 1000 Genomes Project and OncoArray European datasets, respectively). A further 1,128 SNPs where the cluster plot was judged to be not ideal on visual inspection were excluded. Of the 533,631 SNPs that were manufactured on the array, 494,763 SNPs passed the initial QC and 469,364 SNPs were used in the imputation.

Genotype Imputation

All samples were imputed using the October 2014 (version 3) release of the 1000 Genomes Project dataset as the reference panel and number of sampled haplotypes per individual (Nhap)=800. The iCOGS, OncoArray and nine of the GWAS datasets were imputed using a two-stage imputation approach, using SHAPEIT2 for phasing and IMPUTEv2 for imputation21,22. The imputation was performed in 5Mb non-overlapping intervals. The subjects were split into subsets of ~10,000 samples; where possible subjects from the same study were included in the same subset. The BPC3 and EBCG studies were imputed separately using MACH and Minimac23,24. 99.6% of SNPs with frequency >1% were imputable with r2>0.3 in the OncoArray dataset and 99.1% in the iCOGS dataset. We generated estimated genotypes for all SNPs that were polymorphic (MAF>0.1%) in either European or Asian samples (˜21M SNPs). For the current analysis, however, we restricted to SNPs with MAF>0.5% in the European OncoArray dataset (11.8M SNPs). One-step imputation (without pre-phasing) was performed, on the iCOGS and OncoArray datasets, as a quality control step for those associated loci where the imputation quality score was <0.9. Imputation quality for the lead variants, as assessed by the IMPUTE2 quality score in the OncoArray dataset, was >0.80 for all but one locus (Supplementary Table 28) rs72749841, quality score=0.65).

Principal Components Analysis

To adjust for potential (intra-continental) population stratification in the OncoArray dataset, principal components analysis was performed using data from 33,661 uncorrelated SNPs (which included 2,318 SNPs specifically selected on informativeness for determining continental ancestry) with a MAF of at least 0.05 and maximum correlation of 0.1 in the OncoArray dataset, using purpose-written software (http://ccge.medschl.cam.ac.uk/software/pccalc). For the main analyses, we used the first ten principal components, as additional components did not further reduce inflation in the test statistics. We used nine principal components for the iCOGS and up to ten principal components for the other GWAS, where this was found to reduce inflation.

Statistical Analyses

Per-allele ORs and standard errors were generated for the OncoArray, iCOGS and each GWAS, adjusting for principal components using logistic regression. The OncoArray and iCOGS analyses were additionally adjusted for country and study, respectively. For the OncoArray analysis, we adjusted for country and 10 principal components. Adjustment for country rather than study was used to improve power since some studies had no few or no controls. We evaluated the adequacy of this approach by comparing the inflation in the test statistic with that obtained in corresponding analysis in which we adjusted for study – the inflation was very similar (λ=1.15 vs. 1.17, based on the backbone SNPs, equivalent to λ1000=1.003, for a study of 1,000 cases and 1,000 controls, in both cases). As an additional sensitivity analysis, we computed the effect sizes for the 65 novel loci adjusting for study – the effect sizes were essentially identical to those presented. Estimates were derived using ProbAbel for the BPC3 and EBCG studies25, SNPTEST for the remaining GWAS and purpose written software for the iCOGS and OncoArray datasets. OR estimates and standard errors were combined in a fixed effects inverse variance meta-analysis using METAL26, adjusting the GWAS (but not iCOGS or OncoArray) results for genomic control as described previously2. For the GWAS, results were included in the analysis for all SNPs with MAF>0.01 and imputation r2>0.3. For iCOGS and OncoArray we included all SNPs with r2>=0.3 and MAF>0.005 (11.8M SNPs in total). We viewed the primary tests of association as those based on all the meta-analysis over all stages, as this has been shown to be powerful than tests based on a test-replication approach27. Eight sets of variants were associated with breast cancer at P<5x10-8 but were close to previous susceptibility regions, and these became non-significant after adjustment for the previously identified lead variant. Two SNPs on 22q13.2, rs141447235 and rs73161324, were both associated with overall breast cancer but, despite lying >500kb apart, were strongly correlated with each other (r2=0.50) and hence were considered as a single novel signal.

For SNPs showing evidence of association, we additionally computed genotype-specific ORs for the iCOGS and OncoArray dataset, and per-allele ORs for ER-negative and ER-positive disease. Departures from a log-additive model were evaluated using a one degree of freedom likelihood ratio test, comparing the log-additive model (genotypes parametrised as the number of rare alleles carried) with the general model estimating ORs for each genotype. The genotype-specific risks for all variants were consistent with a log-additive model (P>0.01; Supplementary Table 29). Tests for differences in the OR by ER-status were derived using case-only analyses, in which estimates were derived by logistic regression separately in the iCOGS and OncoArray datasets, adjusted as before, and then combined in a fixed-effects meta-analysis. These analyses were performed in R28.

We assessed heterogeneity in the OR estimates among studies within each of the OncoArray, iCOGS and GWAS components, and between the (combined) estimates for the three components, using both the I2 statistic and the P-value for Cochran’s Q statistic (Supplementary Table 28). There was no evidence of heterogeneity among studies in the ORs for any of the loci in the OncoArray, but three loci showed some evidence of heterogeneity in the ORs among the GWAS, iCOGS and OncoArray datasets.

To determine whether there were multiple independent signals in a given region, we performed multiple logistic regression analysis using SNPs within 500kb of each lead SNP, adjusting for the lead SNP. We used the genotypes derived by one-step imputation, performed the analyses separately in the iCOGS and Oncoarray datasets and combined the results (adjusted effect sizes and standard errors) using a fixed effects meta-analysis. For one of the two loci for which there was an additional signal significant at P<5x10-8, the lead SNP from the one-step imputation differed from the lead SNP in the overall analysis, but was strongly correlated with it (Supplementary Table 9).

Definition of Known Hits

We attempted to identify all associations previously reported from genome-wide or candidate analysis at a significance level P<5x10-8 for overall breast cancer, ER-negative or ER-positive breast cancer, in BRCA1 or BRCA2 carriers, or in meta-analyses of these categories. Where multiple studies reported associations in the same region, we used the first reported association unless later studies identified a variant that was clearly more strongly associated. We only included one SNP per 500kb interval, unless joint analysis provided clear evidence (P<5x10-8) of more than one independent signal. For the analysis of credible risk variants (CRVs), we restricted attention to regions where the most significant signal had a P-value<10-7 in Europeans (77 regions). To avoid complications with defining CRVs for secondary signals, we considered only the primary signal and defined CRVs as those whose P-value was within two orders of magnitude of the most significant P-value.

In-Silico Analysis of CRVs

We combined multiple sources of in silico functional annotation from public databases to help identify potential functional SNPs and target genes. To investigate functional elements enriched across the region encompassing the strongest CRVs, we analysed chromatin biofeatures data from the Encyclopedia of DNA Elements (ENCODE) Project29, Roadmap Epigenomics Projects30 and other data obtained through the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) namely: Chromatin State Segmentation by Hidden Markov Models (chromHMM), DNase I hypersensitive and histone modifications of epigenetic markers H3K4, H3K9, and H3K27 in Human Mammary Epithelial (HMEC) and myoepithelial (MYO) cells, T47D and MCF7 breast cancer cells and TF ChIP-seq in a range of breast cell lines (Supplementary Table 13).

Association of Genomic Features with CRVs

We first defined credible candidate variants as those located within 500kb of the most significant SNP in each region, and with P-values within two orders of magnitude of the most significant SNPs. This is approximately equivalent to flagging variants whose posterior probability of causality is within two orders of magnitude of that of the most significant SNP31,32. We then selected 800 random 1Mb control regions separated by at least 1Mb from each other and from the intervals defined by the associated SNPs. The association with each feature was then evaluated using logistic regression, with being a CRV as the outcome, and adjusting for the dependence due to linkage disequilibrium using robust variance estimation, clustering on region, using the R package multiwayvcov.

eQTL analyses

Expression QTL analyses were performed using data from The Cancer Genome Atlas (TCGA) and Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) projects9,33. The TCGA eQTL analysis was based on 458 breast tumours that had matched gene expression, copy number, and methylation profiles together with the corresponding germline genotypes available. All 458 individuals were of European ancestry as ascertained using the genotype data and the Local Ancestry in adMixed Populations (LAMP) software package (LAMP estimate cut-off >95% European)34. Germline genotypes were imputed into the 1000 Genomes Project reference panel (October 2014 release) using IMPUTE223,35. Gene expression had been measured on the Illumina HiSeq 2000 RNA-Seq platform (gene-level RSEM normalized counts36), copy number estimates were derived from the Affymetrix SNP 6.0 (somatic copy number alteration minus germline copy number variation called using the GISTIC2 algorithm37), and methylation beta values measured on the Illumina Infinium HumanMethylation450. Expression QTL analysis focused on all variants within 500 kb of the most significantly associated risk SNP in 142 genomic regions (each 2-Mb wide) containing at least one previously identified or new overall breast cancer risk locus confirmed at genome-wide significance in the current meta-analysis. Each variant was evaluated for its association with the expression of every gene within 2 Mb that had been profiled for each of the three data types. The effects of tumour copy number and methylation on gene expression were first regressed out using a method described previously38. eQTL analysis was performed by linear regression, with residual gene expression as outcome, germline SNP genotype dosage as the covariate of interest and ESR1 expression and age as additional covariates, using the R package Matrix eQTL39.

The METABRIC eQTL analysis was based on 138 normal breast tissue samples resected from breast cancer patients of European ancestry. Germline genotyping for the METABRIC study was also done on the Affymetrix SNP 6.0 array, and gene expression in the METABRIC study was measured using the Illumina HT12 microarray platform (probe-level estimates). No adjustment was implemented for somatic copy number and methylation status since we were evaluating eQTLs in normal breast tissue. All other steps were identical to the TCGA eQTL analysis described above.

INQUISIT

We developed a computational pipeline, integrated expression quantitative trait and in silico prediction of GWAS targets (INQUISIT), to interrogate publically available data for the prioritisation of candidate target genes.

Data used for INQUISIT

Chromatin interaction data from ENCODE ChIA-PET analysis in MCF-7 cells for RNApolII, ERalpha, and CTCF factors were downloaded using UCSC Table Browser40. Hi-C data derived from HMECs were obtained from Rao et al.41, using “interaction loops” as defined in the publication. Data were reformatted to facilitate intersection of query SNPs using BEDTools “intersect”42. For all interactions, termini were intersected with promoters using GENCODE v1943 Basic gene annotations, where we defined promoters as -1.0 kb - +0.1 kb surrounding a transcription start site.

Enhancer-target gene predictions by several computational algorithms were collected. Each of these datasets assigns genes to enhancers. We used all MCF-7 and HMEC enhancer predictions (low and high stringency) made by PreSTIGE44, IM-PET enhancer-gene predictions in MCF-7, HMEC and HCC1954 cell lines45. Enhancer-transcription start site (E-TSS) links were identified from the FANTOM5 Consortium were identified46, and enhancers detected in mammary epithelial cells were intersected with E-TSS links. We also collected typical and super-enhancers in MCF-7, HMEC and HCC1954 cells defined by Hnisz et al.47.

TF ChIP-seq peak data for ESR1, FOXA1, GATA3, TCF7L2 and E2F1 from MCF-7, T47D and MCF-10A cells were downloaded in narrowPeak format from ENCODE. H3K4me3 and H3K9ac (characteristic of promoters) histone modification ChIP-seq peak data for all breast cells were obtained from ENCODE and Roadmap Epigenomics Project. ChromHMM data for breast cell samples (HMEC and myoepithelial: E027, E028 and E119) were downloaded from Roadmap Epigenomics.

Expression QTL analyses were conducted as described above. In the interpretation of the eQTL results for INQUISIT (and in general) we focused on the overlap between the CRVs (risk signal) and the top eQTL variants for a given gene (eQTL signal). If the eQTL P-value for a CRV was the same as, or within 1/100th of the eQTL P-value of the SNP most significantly associated with expression of a particular gene, that gene and the corresponding CRV were assigned a point for being an eQTL in INQUISIT.

Topologically-associated domain (TAD) boundaries were derived from Hi-C data41. Genomic intervals corresponding to “contact domains” from eight human cell types were merged using BEDTools “merge” resulting in annotation of regions most likely to encompass TAD units. Inter-TAD boundaries were identified using BEDTools “complement”.

Gene level RNA-seq expression data generated under multiple experimental conditions in MCF-7 and normal mammary epithelial cells were downloaded from ENCODE. The FPKM (Fragments Per Kilobase of exon per Million fragments Mapped) values for each gene were extracted using the metagene R package48 and averaged across all experiments to give an approximation of expression in breast cells. Accession numbers are given in Supplementary Table 30.

INQUISIT pipeline

Candidate target genes were evaluated by assessing each CRV’s potential impact on regulatory or coding features. Scores categorised by 1) distal gene regulation, 2) proximal gene regulation, or 3) impact on protein coding were calculated using the following criteria (see also Supplementary Table 17).

Genomic annotation data for target gene predictions (chromatin interaction and computational enhancer-promoter assignment), ChIP-seq, histone modification, and chromHMM were curated into a BED formatted database. We intersected the chromosomal positions of CRVs with each category of genomic annotation data using BEDTools “intersect” (minimum 1 bp overlap), resulting in annotation of SNP-gene pairs with presence or absence of multiple classes of genomic data. Each gene was scored using a custom R script on the basis of the following criteria:

  • -

    For distally regulated genes, a candidate gene was given 2 points if a CRV fell in an element that revealed long range ChIA-PET or Hi-C interactions with that gene’s promoter. One point was added to a gene's score in the case of enhancers predicted by computational methods to target that gene (in addition to experimental interactions if also observed). If the distal elements harbouring SNPs also overlapped enriched cistromic TF (ESR1, FOXA1, GATA3, TCF7L2, E2F1) ChIP-seq peaks, an additional point was given when one SNP-Enhancer-ChIP-seq peak intersection occurred, but two points when there were multiple TF binding sites overlapping SNPs in distinct interactions or enhancers (see Supplementary Table 17 for details). One point was given to significant eSNP-eGENE pairs. Predicted distal target genes which were among the list of breast cancer driver genes were up-weighted with a further point (except for the analysis of driver gene enrichment). Information regarding TAD boundaries was used to down-weight genes: genes which were separated from CRVs by a TAD boundary were down-weighted by multiplying their scores by 0.05. Scores for genes exhibiting no expression in MCF7 or HMEC (mean FPKM = 0) were multiplied by 0.1. This resulted in scores for each candidate target gene ranging from 0 to 8.

  • -

    Variants were treated as potentially affecting proximal promoter regulation if they resided between -1.0 and +0.1 kb surrounding a transcription start site. Additional points was awarded to genes when variants overlapped promoter H3K4me3 or H3K9ac histone modification peaks, intersected with ESR1, FOXA1, GATA3, TCF7L2 or E2F1 TF binding sites, were significant eSNP-eGENE pairs, and if the gene was annotated as a breast cancer driver gene. Gene scores were down-weighted (by a factor of 0.1) if they lacked expression in MCF-7 or HMEC samples. Resultant scores ranged from 0 to 5.

  • -

    Intragenic variants were evaluated for their potential to impact protein function using a range of in silico prediction tools (CADD49, FATHMM50, LRT51, MutationAssessor52, Mutation Taster 253, PolyPhen-254, PROVEAN55 and SIFT56 for missense variants; Human Splicing Finder57 and MaxEntScan58 for splice variants). We scored genes with missense and nonsense variants predicted to be functionally deleterious, and points for genes harbouring variants predicted to alter splicing. Genes could therefore carry SNPs which affect coding and splicing and receive increased scores. Additional points were given to genes which were breast cancer driver genes. We multiplied scores by 0.1 when genes showed a lack of expression in breast cells. Possible coding scores ranged from 0-4.

Enrichment of Somatic Breast Cancer Driver Genes in INQUISIT Target Gene Predictions

We listed 147 unique protein coding driver genes for breast cancer identified from four recent tumour genome and exome sequencing studies (considering ZNF703 and FGFR1 as independent genes; Supplementary Table 31)811. First, we examined overlap between this list of 147 genes and the total set of unique target genes predicted by INQUISIT (n = 689) by one or more of the three regulatory mechanisms (distal, promoter, and coding). The significance of this overlap was assessed by randomly drawing (without replacement) 689 genes from the set of all protein coding genes (GENCODE release 19, n = 20,243) one million times and calculating the probability of observing the same (or stronger) overlap with the list of 147 drivers. Second, we hypothesised that this enrichment would be stronger with progressively higher INQUISIT scores. We categorised all 20,243 protein coding genes into four levels based on their INQUIST scores (level 1: coding score 2, promoter score 3-4, distal score >4; level 2: coding 1, promoter 1-2, distal 1-4; level 3: any score >0 but <1; level 4: score 0 i.e. not a predicted target). The gene nearest to a risk locus is frequently assigned as a candidate target gene in GWAS in the absence of additional functional analysis59. We observed that seven of the 147 drivers were among the genes nearest to a previously or newly identified breast cancer risk locus. Therefore, we used logistic regression, including data for all target genes predicted by INQUISIT, with driver status as outcome, and evaluated INQUISIT score level and nearest gene status as potential predictors of driver status (Supplementary Table 21).

Lead SNPs at 142 breast cancer risk associated loci were used as input into DEPICT which was then run using the default settings12. We examined the relative performance of INQUISIT and DEPICT in predicting driver gene status using logistic regression models as above (Supplementary Table 21), adding DEPICT prediction as a covariate.

Chromatin Conformation Capture (3C)

MCF7 (ATCC #HTB22) and MDA-MB-231 (ATCC #HTB26) breast cancer cell lines were grown in RPMI medium with 10% FCS and antibiotics. Bre-80 normal breast epithelial cells (provided as a gift from Roger Reddel, CMRI, Sydney) were grown in DMEM/F12 medium with 5% horse serum (HS), 10 μg/ml insulin, 0.5 μg/ml hydrocortisone, 20 ng/ml epidermal growth factor, 100 ng/ml cholera toxin and antibiotics. Cell lines were maintained under standard conditions, routinely tested for Mycoplasma and short tandem repeat (STR) profiled to confirm cell line identity. 3C libraries were generated using EcoRI as described previously60. 3C interactions were quantitated by real-time PCR (qPCR) using primers designed within restriction fragments (Supplementary Table 32). qPCR was performed on a RotorGene 6000 using MyTaq HS DNA polymerase (Bioline) with the addition of 5 mM of Syto9, annealing temperature of 66°C and extension of 30 sec. 3C analyses were performed in three independent 3C libraries from each cell line with each experiment quantified in duplicate. BAC clones covering each region were used to create artificial libraries of ligation products in order to normalize for PCR efficiency. Data were normalized to the signal from the BAC clone library and, between cell lines, by reference to a region within GAPDH. All qPCR products were electrophoresed on 2% agarose gels, gel purified and sequenced to verify the 3C product.

Plasmid Construction and Reporter Assays

Promoter-driven luciferase reporter constructs were generated by insertion of PCR amplified fragments or synthesised gBlocks (Integrated DNA Technologies) containing the KLHDC7A, PIDD1 or CITED4 promoters into the KpnI/HindIII sites of pGL3-Basic. For the 1p34 locus, a 1169 bp putative regulatory element (PRE1) or 951 bp PRE2 were synthesised as gBlocks and cloned into the BamHI/SalI sites of the CITED4-promoter construct. The minor alleles of SNPs were introduced into promoter or PRE sequences by overlap extension PCR or gBlocks. Sequencing of all constructs confirmed variant incorporation (AGRF). MCF7 or Bre-80 cells were transfected with equimolar amounts of luciferase reporter plasmids and 50 ng of pRLTK transfection control plasmid with Lipofectamine 2000. The total amount of transfected DNA was kept constant at 600 ng for each construct by the addition of pUC19 as a carrier plasmid. Luciferase activity was measured 24 hr posttransfection by the Dual-Glo Luciferase Assay System. To correct for any differences in transfection efficiency or cell lysate preparation, Firefly luciferase activity was normalized to Renilla luciferase, and the activity of each construct was measured relative to the reference promoter constructs, which had a defined activity of 1. Statistical significance was tested by log transforming the data and performing 2-way ANOVA, followed by Dunnett’s multiple comparisons test in GraphPad Prism.

Global Genomic Enrichment Analyses

We performed stratified LD score regression analyses17 for overall breast cancer as well as stratified by ER status using the summary statistics based on the meta-analyses of the OncoArray, GWAS and iCOGS datasets. We restricted analysis to all SNPs present on the HapMap version 3 dataset that had a MAF > 1% and an imputation quality score R2>0.3 in the OncoArray data. LD scores were calculated using the 1000 Genomes Project Phase 3 EUR reference panel.

We first created a “full baseline model” as previously described that included 24 non-cell type specific publicly available annotations as well as 24 additional annotations that included a 500-bp window around each of the 24 main annotations17. Additionally, we also included 100-bp windows around ChIP-seq peaks as well as one annotation containing all SNPs leading to a total of 53 overlapping annotations.

We subsequently performed analyses using cell-type specific annotations for four histone marks H3K4me1, H3K4me3, H3K9ac and H3K27ac across 27-81 cell types depending on histone mark17. Each cell-type-specific annotation corresponded to a histone mark in a single cell type, and there were 220 such annotations in total. We augmented the baseline model by adding these annotations individually, creating 220 separate models, each with 54 annotations (53+1). This procedure controls for the overlap with the 53 functional categories in the full baseline model but not with the 219 other cell type specific annotations.

We further tested the differences in functional enrichment between ER-positive and ER-negative subsets through a Wald test, using the regression coefficients and standard errors for the two subsets based on the models described above.

Contribution of Identified Variants to the Familial Relative Risk of Breast Cancer

We estimated the proportion of the familial risk of breast cancer due to the identified variants, under a log-additive model, using the formula: ipi(1pi)(βi2τi2)/ln(λ), where pi is the MAF for variant i, βi is the log(OR) estimate for variant i, τi is the standard error of βi and λ=2 is the assumed overall familial relative risk.

To compute the corresponding estimate for the FRR due to all variants, we wish to estimate hf2=i2pi(1pi)βi2 where the sum is now over the all variants and βi is the true relative risk conferred by variant i, assuming a log-additive model. We refer to hf2 as the frailty scale heritability. We first obtained the estimated observed heritability based on the full set of summary estimates using LD Score Regression17 and then converted this to an estimate on the frailty scale using the hf2=hobs2P(1P), where P is the proportion of samples in the population that are cases.

Pathway Analyses

The pathway gene set database (http://download.baderlab.org/EM_Genesets, file Human_GOBP_AllPathways_no_GO_iea_April_01_2017_symbol.gmt)13 from the Bader lab dated April 1, 2017 was used in all analyses. This database contains pathways from Reactome61, NCI Pathway Interaction Database62, GO (Gene Ontology) biological process63, HumanCyc64, MSigdb65, NetPath66 and Panther67. For GO, terms inferred from electronic annotation were excluded from our analyses. The same pathway may be defined in two or more databases with potentially different sets of genes. All versions of such ‘duplicate’ pathways were included. To provide more biologically meaningful results and reduce false positives, only pathways that contained between 10 and 200 genes were used. Pathway size was determined by the total number of genes in the pathway that could also be mapped to the genes included in the GWAS dataset (actual pathway size may be larger).

SNPs were assigned to genes using the INQUISIT target prediction method described above for all SNPs with P-value < 5x10-2 (˜1.25 million associations). This cutoff was chosen based on a threshold analysis that showed that 19 of the 20 pathway themes found using all SNP associations (˜16 million) and a simple distance-based SNP-to-gene mapping method could be recovered using this smaller subset of associations. More stringent cutoffs resulted in fewer themes being covered (e.g. three themes found using SNPs with p-value < 5x10-6 or ˜33K SNP associations). Gene significance was calculated by assigning the statistic of the most significant SNP among all SNPs assigned to a gene68,69. Since histone genes contained a high number of mapped SNPs, we selected representative SNP associations to avoid pathway enrichments based solely on the increased number of SNPs at these loci (i.e. chr6:27657944 for HIST1, chr1:149219841, for HIST2, chr1: 228517406 for HIST3, chr12: 14871747 for HIST4).

The gene set enrichment analysis (GSEA) algorithm as implemented in the GenGen package69 was used to perform pathway analysis. Wang et al.70 modified the original GSEA algorithm to work with GWAS datasets, using SNP significance and SNP-to-gene mapping instead of gene expression data. Briefly, the algorithm calculates an enrichment score (ES) for each pathway based on a weighted Kolmogorov-Smirnov statistic (refer to 70 for more details). Pathways that have most of their genes at the top of the ranked list of genes obtain higher ES values. Note that only the largest positive ES was considered as opposed to largest absolute ES (i.e. largest deviation from zero). This modification (recommended by the GenGen authors for GWAS analysis) was performed to include only pathways that are significantly affected between cases and controls and ignore those with significant negative ES values (this may happen if a pathway is significantly less altered than expected by chance). Only pathways containing greater than 10 genes with at least one of these genes with P-value < 5x10-8 were retained as higher confidence for subsequent analysis. These pathways, together with the genes reaching the significance threshold, are listed in Supplementary Table 22.

The pathway analysis assigns an enrichment score (ES) value for each pathway. These values were normalized and p-values for each pathway were obtained by comparing them to null distributions for OncoArray and iCOGS data sets separately. The null distributions were computed by permuting case/control labels 1,000 times (keeping the number of cases and controls the same in each iteration) and recomputing all enrichment statistics. FDR values were computed using the statistics from the null distributions and all pathways with FDR < 0.05 in either OncoArray or iCOGS distributions were considered further. Pathway findings were further considered if they contained more than one significant gene and if they could be confirmed to be involved in breast cancer as reported in at least one of five published large-scale breast cancer GWAS7175 or reported elsewhere in the literature. Further, themes that were weakly associated with breast cancer (based on a literature search) were only included if they had a FDR < 0.05 and at least four novel genes (i.e. was not found among the genes from mapped themes containing pathways known to be involved in breast cancer) (Extended Data Fig. 2). Pathways related to “sensory perception of smell” were removed as there is no literature evidence for their involvement in breast cancer and because they contain genes close to each other on chromosome 6 which are frequently correlated.

An enrichment map was created using the Enrichment Map (EM) v 2.1.0 app13 in Cytoscape v 3.376. Pathways nodes were laid out using a force directed layout and nodes with gene set overlap of over 0.55 were connected by edges. Related pathway nodes were manually clustered and labelled as themes.

Data Availability

A subset of the data that support the findings of this study is publically available via dbGaP (www.ncbi.nlm.nih.gov/gap; accession number phs001265.v1.p1). The complete dataset will not be made publicly available due to restraints imposed by the ethics committees of individual studies; requests for data can be made to the corresponding author or the Data Access Coordination Committee (DACCs) of BCAC (http://bcac.ccge.medschl.cam.ac.uk/): BCAC DACC approval is required to access data from studies ABCFS, ABCS, ABCTB, BBCC, BBCS, BCEES, BCFR-NY, BCFR-PA, BCFR-UT, BCINIS, BSUCH, CBCS, CECILE, CGPS, CTS, DIETCOMPLYF, ESTHER, GC-HBOC, GENICA, GEPARSIXTO, GESBC, HABCS, HCSC, HEBCS, HMBCS, HUBCS, KARBAC, KBCP, LMBC, MABCS, MARIE, MBCSG, MCBCS, MISS, MMHS, MTLGEBCS, NC-BCFR, OFBCR, ORIGO, pKARMA, POSH, PREFACE, RBCS, SKKDKFZS, SUCCESSB, SUCCESSC, SZBCS, TNBCC, UCIBCS, UKBGS and UKOPS (see Supplementary Table 1).

Summary results for all variants are available at http://bcac.ccge.medschl.cam.ac.uk/. Requests for further data should be made through the BCAC Data Access Co-ordinating Committee (http://bcac.ccge.medschl.cam.ac.uk/).

Extended Data

Extended data Figure 1. Global mapping of biofeatures across novel loci associated with overall breast cancer risk.

Extended data Figure 1

The overlaps between potential genomic predictors in relevant breast cell lines and candidate causal risk variants (CRVs) within each locus. On the x-axis, each column represents a CRV (see Online Methods). The most significant SNPs are identified in each region. On the y-axis, biofeatures are grouped into five functional categories: genomic structure (red), enhancer marks (dark green), histone marks (blue), open chromatin marks (dark blue) and transcription factor binding sites (dark violet). Colored elements indicate SNPs for which the feature is present. For data sources, see Online Methods (“In-Silico Analysis of CRVs”).

Extended data Figure 2. Pathway enrichment map for susceptibility loci based on summary association statistics.

Extended data Figure 2

Each circle (node) represents a pathway (gene set), coloured by enrichment score (ES) where redder nodes indicate lower FDRs. Larger nodes indicate pathways with more genes. Green lines connect pathways with overlapping genes (minimum overlap 0.55). Pathways are grouped by similarity and organized into major themes (large labelled circles).

Extended data Figure 3. Heatmap showing patterns of cell type-specific enrichments for breast tissue across three histone marks (H3K4me1, H3K4me3 and H3K9ac) for breast cancer overall, ER-positive breast cancer and ER-negative breast cancer as well as 16 other traits.

Extended data Figure 3

Extended data Figure 4. Heatmap showing patterns of cell type-specific enrichments for histone mark H3K27ac in breast cancer overall, ER+ and ER- breast cancer as well as 16 other traits.

Extended data Figure 4

Extended data Figure 5. Heatmap showing patterns of cell type-specific enrichments for histone mark H3K4me1 in breast cancer overall, ER+ and ER- breast cancer as well as 16 other traits.

Extended data Figure 5

Extended data Figure 6. Heatmap showing patterns of cell type-specific enrichments for histone mark H3K4me3 in breast cancer overall, ER+ and ER- breast cancer as well as 16 other traits.

Extended data Figure 6

Extended data Figure 7. Heatmap showing patterns of cell type-specific enrichments for histone mark H3K9ac in breast cancer overall, ER-positive and ER-negative breast cancer as well as 16 other traits.

Extended data Figure 7

Extended data Figure 8. Functional assessment of regulatory variants at 1p36, 11p15 and 1p34 risk loci.

Extended data Figure 8

a, The KLHDC7A or b, PIDD1 promoter regions containing the reference (prom-Ref) or risk alleles (prom-Hap), were cloned upstream of the pGL3 luciferase reporter gene. MCF7 or Bre-80 cells were transfected with constructs and assayed for luciferase activity after 24 h. Error bars denote 95% CI (n=3). P-values were determined by two-way ANOVA followed by Dunnett’s multiple comparisons test (*P<0.05, **P<0.01, ***P<0.001). c, 3C assays. A physical map of the region interrogated by 3C is shown first. Grey boxes depict the putative regulatory elements (PREs), blue vertical lines indicate the risk-associated SNPs and black dotted line represents chromatin looping. The graphs represent three independent 3C interaction profiles. 3C libraries were generated with EcoRI, grey vertical boxes indicate the interacting restriction fragment (containing PRE1 and PRE2). Error bars denote SD. d, PRE1 or PRE2 containing the reference (PRE-ref) or risk (PRE-Hap) haplotypes were cloned downstream of a CITED4 promoter-driven luciferase construct (CITED4 prom). MCF7 or Bre-80 cells were transfected with constructs and assayed for luciferase activity after 24 h. Error bars denote 95% CI (n=3). P-values were determined by two-way ANOVA followed by Dunnett’s multiple comparisons test (**P<0.01, ***P<0.001).

Extended data Figure 9. Functional assessment of regulatory variants at the 7q22 risk locus.

Extended data Figure 9

a-e, 3C assays. A physical map of the region interrogated by 3C is shown first. Grey horizontal boxes depict the putative regulatory elements (PREs), blue vertical lines indicate the risk-associated SNPs and black dotted line represents chromatin looping. The graphs represent three independent 3C interaction profiles between the a, CUX1, b, d, PRKRIP1 or c, e, RASA4 promoter regions and PREs. 3C libraries were generated with EcoRI, grey vertical boxes indicate the interacting restriction fragment (containing PRE1 and/or PRE2). Error bars denote SD. f, g, Allele-specific 3C. 3C followed by Sanger sequencing for the f, PRKRIP1-PRE2 or g, RASA4-PRE1 or -PRE2 in heterozygous MDA-MB-231 breast cancer cells.

Extended Data Table 1. INQUISIT, DEPICT, and nearest gene as predictors of driver status.

Scores converted into levels for analysis. For INQUISIT: level 1 (coding score of 2 OR promoter score of 3 or 4 OR distal score > 4), level 2 (coding score of 1 OR promoter of 1 or 2 OR distal score of 1, 2, 3, or 4), level 3 (coding/promoter/distal scores > 0 but < 1), and level 4 (not predicted to be a target gene by INQUISIT). For DEPICT: level 1 (DEPICT predicted target gene at P ≤ 0.05), level 2 (DEPICT predicted target gene but with P > 0.05), level 3 (not predicted to be a target gene by DEPICT).

Variable Coefficient Standard Error Z-value P-value
Multivariable logistic regression model with INQUISIT, DEPICT, Nearest gene
INQUISIT -0.61 0.14 -4.31 1.6E-05
DEPICT -0.17 0.50 -0.33 0.74
Nearest gene 1.12 0.59 1.88 0.06
Multivariable logistic regression model with INQUISIT and Nearest gene
INQUISIT -0.63 0.13 -4.77 1.8E-06
Nearest gene 1.23 0.48 2.56 0.01
Multivariable logistic regression model with DEPICT and Nearest gene
DEPICT -0.89 0.49 -1.82 0.07
Nearest gene 1.61 0.63 2.57 0.01

Supplementary Material

Supplementary Information is linked to the online version of the paper at www.nature.com/nature.

Reporting summary
Supplementary tables and note

Acknowledgements

The authors wish to thank all the individuals who took part in these studies and all the researchers, clinicians, technicians and administrative staff who have enabled this work to be carried out.

Genotyping of the OncoArray was principally funded from three sources: the PERSPECTIVE project, funded from the Government of Canada through Genome Canada and the Canadian Institutes of Health Research, the Ministère de l’Économie, de la Science et de l'Innovation du Québec through Genome Québec, and the Quebec Breast Cancer Foundation; the NCI Genetic Associations and Mechanisms in Oncology (GAME-ON) initiative and Discovery, Biology and Risk of Inherited Variants in Breast Cancer (DRIVE) project (NIH Grants U19 CA148065 and X01HG007492); and Cancer Research UK (C1287/A10118 and. C1287/A16563). BCAC is funded by Cancer Research UK [C1287/A16563], by the European Community's Seventh Framework Programme under grant agreement 223175 (HEALTH-F2-2009-223175) (COGS) and by the European Union’s Horizon 2020 Research and Innovation Programme under grant agreements 633784 (B-CAST) and 634935 (BRIDGES). Genotyping of the iCOGS array was funded by the European Union (HEALTH-F2-2009-223175), Cancer Research UK (C1287/A10710), the Canadian Institutes of Health Research for the “CIHR Team in Familial Risks of Breast Cancer” program, and the Ministry of Economic Development, Innovation and Export Trade of Quebec – grant # PSR-SIIRI-701. Combining the GWAS data was supported in part by The National Institute of Health (NIH) Cancer Post-Cancer GWAS initiative grant U19 CA 148065 (DRIVE, part of the GAME-ON initiative). For a full description of funding and acknowledgments, see Supplementary Note.

Footnotes

Author Contributions

Writing Group: K.Michailidou, S.Lindström, J.Beesley, S.Hui, S.Kar, P.Soucy, S.L.E., G.D.B., G.C-T., J.Simard, P.K., D.F.E.

Conceived OncoArray and obtained financial support: C.I.A., J.Simard, P.K., D.F.E.

Designed OncoArray: J.D., E.D., A.Lee, Z.W., A.C.A., S.J.C., P.K., D.F.E.

Led COGS project: P.Hall.

Led DRIVE project: D.J.H.

Led PERSPECTIVE project: J.Simard.

Led working groups of BCAC: A.C.A., I.L.A., P.D.P.P., J.Chang-Claude, R.L.M., M.G-C., M.K.S., A.M.D.

Data management: J.D., M.K.B., Q.Wang, R.Keeman, U.E., S.B., J.Chang-Claude, M.K.S.

Bioinformatics analysis: J.D., J.Beesley, A.Lemaçon, P.Soucy, J.A., M.Ghoussaini, J.Carroll, A.D., A.E. McC R, S.R.L.

Statistical analysis: K.Michailidou, S.Lindström, S.Hui, S.Kar, A.Rostamianfar, J.T., X.C., L.Fachal, X.J., H.Finucane, G.D.B., P.K., D.F.E.

Functional analysis: D.G., X.Q.C., J.Beesley, J.D.F., K.McCue, S.L.E., G.C-T.

OncoArray Genotyping: M.A., F.B., C.Baynes, D.M.C., J.M.C., K.F.D., N.Hamel, B.H., K.J., C.L., J.Meyer, E.P., J.R., G.S., D.C.T., D.V.D.B., D.V., J.V., L.X., B.Z., A.M.D.

Provided DNA samples and/or phenotypic data: M.A.A., H.A., K.A., H.A-C., N.N.A., V.A., K.J.A., B.A., P.L.A., M.Barrdahl, M.W.B., J.Benitez, M.Bermisheva, L.Bernstein, C.Blomqvist, N.V.B., S.E.B., B.Bonanni, A-L.B-D., J.S.B., H.Brauch, P.Brennan, H.Brenner, L.Brinton, P.Broberg, I.W.B., A.B., A.B-W., S.Y.B., T.B., B.Burwinkel, K.B., H.Cai, Q.C., T.C., F.C., A.Carracedo, B.D.C., J.E.C., T.L.C., T-Y.D.C, K.S.C., J-Y.C., H.Christiansen, C.L.C., M.C., E.C-D., S.C., A.Cox, D.G.C., S.S.C., K.C., M.B.D., P.D., T.D., I.d.S.S., M.Dumont, L.D., M.Dwek, D.M.E., A.B.E., A.H.E., C.Ellberg, M.Elvira, C.Engel, M.Eriksson, P.A.F., J.F., D.F-J., O.F., H.Flyger, L.Fritschi, V.Gaborieau, M.G., M.G-D., Y-T.G., S.M.G., J.A.G-S., M.M.G., V.Georgoulias, G.G.G., G.G., M.S.G., D.E.G., A.G-N., G.I.G., M.Grip, J.G., A.G., P.G., L.H., E.H., C.A.H., N.Håkansson, U.H., S.Hankinson, P.Harrington, S.N.H., J.M.H., M.H., A.Hein, J.H., P.Hillemanns, D.N.H., A.Hollestelle, M.J.H., R.N.H., J.L.H., M-F.H., C-N.H., G.H., K.H., J.I., H.Ito, M.I., H.Iwata, A.J., W.J., E.M.J., N.J., M.J., A.J-V., R.Kaaks, M.K., K.K., D.K., Y.K., M.J.K., S.Khan, E.K., J.I.K., S-W.K., J.A.K., V-M.K., I.M.K., V.N.K., U.K., A.K., D.L., L.L., C.N.L., E.L., J.W.L., M.H.L., F.L., J.Li, J.Lilyquist, A.Lindblom, J.Lissowska, R.L., W-Y.L., S.Loibl, J.Long, A.Lophatananon, J.Lubinski, M.P.L., E.S.K.M., R.J.M., T.M., E.M., K.E.M., A.Mannermaa, S.Manoukian, J.E.M., S.Margolin, S.Mariapun, M.E.M., K.Matsuo, D.M., J.McKay, C.McLean, H.M-H., A.Meindl, P.M., U.M., H.M., N.M., N.A.M.T., K.Muir, A.M.M., C.Mulot, S.L.N., H.N., P.N., S.F.N., D-Y.N., B.G.N., A.N., O.I.O., J.E.O., H.O., C.O., N.O., V.S.P., S.K.P., T-W.P-S., J.I.A.P., P.P., J.P., K-A.P., M.P., D.P-K., R.Prentice, N.P., D.P., K.P., B.R., P.R., N.R., G.R., H.S.R., V.R., A.Romero, K.J.R., T.R., A.Rudolph, M.R., E.J.Th.R., E.S., D.P.S., S.Sangrajrang, E.J.S., D.F.S., R.K.S., A.Schneeweiss, M.J.Schoemaker, F.S., P.Schürmann, C.Scott, R.J.S., S.Seal, C.Seynaeve, M.S., P.Sharma, C-Y.S., M.E.S., M.J.Shrubsole, X-O.S., A.Smeets, C.Sohn, M.C.S., J.J.S., C.Stegmaier, S.S-B., J.Stone, D.O.S., H.S., A.Swerdlow, N.A.M.T., R.T., J.A.T., M.T., S.H.T., M.B.T., S.Thanasitthichai, K.T., R.A.E.M.T., I.T., L.T., D.T., T.T., C-C.T., S.Tsugane, H-U.U., M.U., G.U., C.V., C.J.v.A., A.M.W.v.d.O., L.v.d.K., R.B.v.d.L., Q.Waisfisz, S.W-G., C.R.W., C.W., A.S.W., H.W., W.W., R.W., A.W., A.H.W., T.Y., X.R.Y., C.H.Y., K-Y.Y., J-C.Y., W.Z., Y.Z., A.Z., E.Z., ABCTB.I., kConFab/AOCS.I., NCBS.C., A.C.A., I.L.A., F.J.C., P.D.P.P., J.Chang-Claude, P.Hall, D.J.H., R.L.M., M.G-C., M.K.S., G.D.B., J.Simard, P.K., D.F.E.

All authors read and approved the final version of the manuscript.

The authors confirm that they have no competing financial interests.

References

  • 1.Amos CI, et al. The OncoArray Consortium: a Network for Understanding the Genetic Architecture of Common Cancers. Cancer Epidemiol Biomarkers Prev. 2016 doi: 10.1158/1055-9965.EPI-16-0106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Michailidou K, et al. Large-scale genotyping identifies 41 new loci associated with breast cancer risk. Nature genetics. 2013;45:353–361. doi: 10.1038/ng.2563. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Long J, et al. Genome-wide association study in east Asians identifies novel susceptibility loci for breast cancer. PLoS Genet. 2012;8:e1002532. doi: 10.1371/journal.pgen.1002532. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Cai Q, et al. Genome-wide association analysis in East Asians identifies breast cancer susceptibility loci at 1q32.1, 5q14.3 and 15q26.1. Nature genetics. 2014;46:886–890. doi: 10.1038/ng.3041. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Long J, et al. A common deletion in the APOBEC3 genes and breast cancer risk. J Natl Cancer Inst. 2013;105:573–579. doi: 10.1093/jnci/djt018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.He C, et al. Genome-wide association studies identify loci associated with age at menarche and age at natural menopause. Nature genetics. 2009;41:724–728. doi: 10.1038/ng.385. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Kawase T, et al. PH domain-only protein PHLDA3 is a p53-regulated repressor of Akt. Cell. 2009;136:535–550. doi: 10.1016/j.cell.2008.12.002. [DOI] [PubMed] [Google Scholar]
  • 8.Nik-Zainal S, et al. Landscape of somatic mutations in 560 breast cancer whole-genome sequences. Nature. 2016;534:47–54. doi: 10.1038/nature17676. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Cancer Genome Atlas Network. Comprehensive molecular portraits of human breast tumours. Nature. 2012;490:61–70. doi: 10.1038/nature11412. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Ciriello G, et al. Comprehensive Molecular Portraits of Invasive Lobular Breast Cancer. Cell. 2015;163:506–519. doi: 10.1016/j.cell.2015.09.033. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Pereira B, et al. The somatic mutation profiles of 2,433 breast cancers refines their genomic and transcriptomic landscapes. Nat Commun. 2016;7:11479. doi: 10.1038/ncomms11479. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Pers TH, et al. Biological interpretation of genome-wide association studies using predicted gene functions. Nat Commun. 2015;6:5890. doi: 10.1038/ncomms6890. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Merico D, Isserlin R, Stueker O, Emili A, Bader GD. Enrichment map: a network-based method for gene-set enrichment visualization and interpretation. PloS one. 2010;5:e13984. doi: 10.1371/journal.pone.0013984. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Turner N, Grose R. Fibroblast growth factor signalling: from development to cancer. Nat Rev Cancer. 2010;10:116–129. doi: 10.1038/nrc2780. [DOI] [PubMed] [Google Scholar]
  • 15.Heldin CH. Targeting the PDGF signaling pathway in tumor treatment. Cell Commun Signal. 2013;11:97. doi: 10.1186/1478-811X-11-97. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Howe LR, Brow AM. Wnt signaling and breast cancer. Cancer Biol Ther. 2004;3:36–41. doi: 10.4161/cbt.3.1.561. [DOI] [PubMed] [Google Scholar]
  • 17.Finucane HK, et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nature genetics. 2015;47:1228–1235. doi: 10.1038/ng.3404. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Lin Y, Ma W, Benchimol S. Pidd, a new death-domain-containing protein, is induced by p53 and promotes apoptosis. Nature genetics. 2000;26:122–127. doi: 10.1038/79102. [DOI] [PubMed] [Google Scholar]
  • 19.Fox SB, et al. CITED4 inhibits hypoxia-activated transcription in cancer cells, and its cytoplasmic location in breast cancer is associated with elevated expression of tumor cell hypoxia-inducible factor 1alpha. Cancer research. 2004;64:6075–6081. doi: 10.1158/0008-5472.CAN-04-0708. [DOI] [PubMed] [Google Scholar]
  • 20.Michailidou K, et al. Genome-wide association analysis of more than 120,000 individuals identifies 15 new susceptibility loci for breast cancer. Nature genetics. 2015;47:373–380. doi: 10.1038/ng.3242. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.O'Connell J, et al. A general approach for haplotype phasing across the full spectrum of relatedness. PLoS Genet. 2014;10:e1004234. doi: 10.1371/journal.pgen.1004234. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Howie BN, Donnelly P, Marchini J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 2009;5:e1000529. doi: 10.1371/journal.pgen.1000529. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Howie B, Fuchsberger C, Stephens M, Marchini J, Abecasis GR. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nature genetics. 2012;44:955–959. doi: 10.1038/ng.2354. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol. 2010;34:816–834. doi: 10.1002/gepi.20533. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Aulchenko YS, Struchalin MV, van Duijn CM. ProbABEL package for genome-wide association analysis of imputed data. BMC Bioinformatics. 2010;11:134. doi: 10.1186/1471-2105-11-134. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Willer CJ, Li Y, Abecasis GR. METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics. 2010;26:2190–2191. doi: 10.1093/bioinformatics/btq340. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Skol AD, Scott LJ, Abecasis GR, Boehnke M. Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nature genetics. 2006;38:209–213. doi: 10.1038/ng1706. [DOI] [PubMed] [Google Scholar]
  • 28.Team, R. C. R: A Language and Environment for Statistical Computing. 2016 < https://www.R-project.org>.
  • 29.Consortium, E. P. A user's guide to the encyclopedia of DNA elements (ENCODE) PLoS Biol. 2011;9:e1001046. doi: 10.1371/journal.pbio.1001046. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Roadmap Epigenomics, C et al. Integrative analysis of 111 reference human epigenomes. Nature. 2015;518:317–330. doi: 10.1038/nature14248. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Udler MS, Tyrer J, Easton DF. Evaluating the power to discriminate between highly correlated SNPs in genetic association studies. Genet Epidemiol. 2010;34:463–468. doi: 10.1002/gepi.20504. [DOI] [PubMed] [Google Scholar]
  • 32.Wellcome Trust Case Control, C et al. Bayesian refinement of association signals for 14 loci in 3 common diseases. Nature genetics. 2012;44:1294–1301. doi: 10.1038/ng.2435. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Curtis C, et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature. 2012;486:346–352. doi: 10.1038/nature10983. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Baran Y, et al. Fast and accurate inference of local ancestry in Latino populations. Bioinformatics. 2012;28:1359–1367. doi: 10.1093/bioinformatics/bts144. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Genomes Project, C et al. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491:56–65. doi: 10.1038/nature11632. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics. 2011;12:323. doi: 10.1186/1471-2105-12-323. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Mermel CH, et al. GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers. Genome Biol. 2011;12:R41. doi: 10.1186/gb-2011-12-4-r41. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Li Q, et al. Integrative eQTL-based analyses reveal the biology of breast cancer risk loci. Cell. 2013;152:633–641. doi: 10.1016/j.cell.2012.12.034. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Shabalin AA. Matrix eQTL: ultra fast eQTL analysis via large matrix operations. Bioinformatics. 2012;28:1353–1358. doi: 10.1093/bioinformatics/bts163. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Karolchik D, et al. The UCSC Table Browser data retrieval tool. Nucleic Acids Res. 2004;32:D493–496. doi: 10.1093/nar/gkh103. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Rao SS, et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell. 2014;159:1665–1680. doi: 10.1016/j.cell.2014.11.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–842. doi: 10.1093/bioinformatics/btq033. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Harrow J, et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 2012;22:1760–1774. doi: 10.1101/gr.135350.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Corradin O, et al. Combinatorial effects of multiple enhancer variants in linkage disequilibrium dictate levels of gene expression to confer susceptibility to common traits. Genome Res. 2014;24:1–13. doi: 10.1101/gr.164079.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.He B, Chen C, Teng L, Tan K. Global view of enhancer-promoter interactome in human cells. Proc Natl Acad Sci U S A. 2014;111:E2191–2199. doi: 10.1073/pnas.1320308111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Andersson R, et al. An atlas of active enhancers across human cell types and tissues. Nature. 2014;507:455–461. doi: 10.1038/nature12787. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Hnisz D, et al. Super-enhancers in the control of cell identity and disease. Cell. 2013;155:934–947. doi: 10.1016/j.cell.2013.09.053. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Joly Beauparlant C, et al. metagene Profiles Analyses Reveal Regulatory Element's Factor-Specific Recruitment Patterns. PLoS Comput Biol. 2016;12:e1004751. doi: 10.1371/journal.pcbi.1004751. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Kircher M, et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nature genetics. 2014;46:310–315. doi: 10.1038/ng.2892. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Shihab HA, et al. Predicting the functional, molecular, and phenotypic consequences of amino acid substitutions using hidden Markov models. Human mutation. 2013;34:57–65. doi: 10.1002/humu.22225. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Chun S, Fay JC. Identification of deleterious mutations within three human genomes. Genome Res. 2009;19:1553–1561. doi: 10.1101/gr.092619.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Reva B, Antipin Y, Sander C. Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Res. 2011;39:e118. doi: 10.1093/nar/gkr407. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Schwarz JM, Cooper DN, Schuelke M, Seelow D. MutationTaster2: mutation prediction for the deep-sequencing age. Nat Methods. 2014;11:361–362. doi: 10.1038/nmeth.2890. [DOI] [PubMed] [Google Scholar]
  • 54.Adzhubei IA, et al. A method and server for predicting damaging missense mutations. Nat Methods. 2010;7:248–249. doi: 10.1038/nmeth0410-248. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Choi Y, Sims GE, Murphy S, Miller JR, Chan AP. Predicting the functional effect of amino acid substitutions and indels. PloS one. 2012;7:e46688. doi: 10.1371/journal.pone.0046688. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Kumar P, Henikoff S, Ng PC. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat Protoc. 2009;4:1073–1081. doi: 10.1038/nprot.2009.86. [DOI] [PubMed] [Google Scholar]
  • 57.Desmet FO, et al. Human Splicing Finder: an online bioinformatics tool to predict splicing signals. Nucleic Acids Res. 2009;37:e67. doi: 10.1093/nar/gkp215. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Yeo G, Burge CB. Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. J Comput Biol. 2004;11:377–394. doi: 10.1089/1066527041410418. [DOI] [PubMed] [Google Scholar]
  • 59.Welter D, et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 2014;42:D1001–1006. doi: 10.1093/nar/gkt1229. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Ghoussaini M, et al. Evidence that breast cancer risk at the 2q35 locus is mediated through IGFBP5 regulation. Nat Commun. 2014;4:4999. doi: 10.1038/ncomms5999. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Joshi-Tope G, et al. Reactome: a knowledgebase of biological pathways. Nucleic Acids Res. 2005;33:D428–432. doi: 10.1093/nar/gki072. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Schaefer CF, et al. PID: the Pathway Interaction Database. Nucleic Acids Res. 2009;37:D674–679. doi: 10.1093/nar/gkn653. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Ashburner M, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000;25:25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Romero P, et al. Computational prediction of human metabolic pathways from the complete human genome. Genome Biol. 2005;6:R2. doi: 10.1186/gb-2004-6-1-r2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Subramanian A, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005;102:15545–15550. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Kandasamy K, et al. NetPath: a public resource of curated signal transduction pathways. Genome Biol. 2010;11:R3. doi: 10.1186/gb-2010-11-1-r3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Thomas PD, et al. PANTHER: a library of protein families and subfamilies indexed by function. Genome Res. 2003;13:2129–2141. doi: 10.1101/gr.772403. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Wang L, Jia P, Wolfinger RD, Chen X, Zhao Z. Gene set analysis of genome-wide association studies: methodological issues and perspectives. Genomics. 2011;98:1–8. doi: 10.1016/j.ygeno.2011.04.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Wang K, Li M, Hakonarson H. Analysing biological pathways in genome-wide association studies. Nat Rev Genet. 2010;11:843–854. doi: 10.1038/nrg2884. [DOI] [PubMed] [Google Scholar]
  • 70.Wang K, Li M, Bucan M. Pathway-based approaches for analysis of genomewide association studies. Am J Hum Genet. 2007;81:1278–1283. doi: 10.1086/522374. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Mogushi K, Tanaka H. PathAct: a novel method for pathway analysis using gene expression profiles. Bioinformation. 2013;9:394–400. doi: 10.6026/97320630009394. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Medina I, et al. Gene set-based analysis of polymorphisms: finding pathways or biological processes associated to traits in genome-wide association studies. Nucleic Acids Res. 2009;37:W340–344. doi: 10.1093/nar/gkp481. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Lee YH, Kim JH, Song GG. Genome-wide pathway analysis of breast cancer. Tumour Biol. 2014;35:7699–7705. doi: 10.1007/s13277-014-2027-5. [DOI] [PubMed] [Google Scholar]
  • 74.Jia P, Zheng S, Long J, Zheng W, Zhao Z. dmGWAS: dense module searching for genome-wide association studies in protein-protein interaction networks. Bioinformatics. 2011;27:95–102. doi: 10.1093/bioinformatics/btq615. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Braun R, Buetow K. Pathways of distinction analysis: a new technique for multi-SNP analysis of GWAS data. PLoS Genet. 2011;7:e1002101. doi: 10.1371/journal.pgen.1002101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.Shannon P, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13:2498–2504. doi: 10.1101/gr.1239303. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Reporting summary
Supplementary tables and note

Data Availability Statement

A subset of the data that support the findings of this study is publically available via dbGaP (www.ncbi.nlm.nih.gov/gap; accession number phs001265.v1.p1). The complete dataset will not be made publicly available due to restraints imposed by the ethics committees of individual studies; requests for data can be made to the corresponding author or the Data Access Coordination Committee (DACCs) of BCAC (http://bcac.ccge.medschl.cam.ac.uk/): BCAC DACC approval is required to access data from studies ABCFS, ABCS, ABCTB, BBCC, BBCS, BCEES, BCFR-NY, BCFR-PA, BCFR-UT, BCINIS, BSUCH, CBCS, CECILE, CGPS, CTS, DIETCOMPLYF, ESTHER, GC-HBOC, GENICA, GEPARSIXTO, GESBC, HABCS, HCSC, HEBCS, HMBCS, HUBCS, KARBAC, KBCP, LMBC, MABCS, MARIE, MBCSG, MCBCS, MISS, MMHS, MTLGEBCS, NC-BCFR, OFBCR, ORIGO, pKARMA, POSH, PREFACE, RBCS, SKKDKFZS, SUCCESSB, SUCCESSC, SZBCS, TNBCC, UCIBCS, UKBGS and UKOPS (see Supplementary Table 1).

Summary results for all variants are available at http://bcac.ccge.medschl.cam.ac.uk/. Requests for further data should be made through the BCAC Data Access Co-ordinating Committee (http://bcac.ccge.medschl.cam.ac.uk/).

RESOURCES