IBD risk loci are enriched in multigenic regulatory modules encompassing putative causative genes

Yukihide Momozawa; Julia Dmitrieva; Emilie Théâtre; Valérie Deffontaine; Souad Rahmouni; Benoît Charloteaux; François Crins; Elisa Docampo; Mahmoud Elansary; Ann-Stephan Gori; Christelle Lecut; Rob Mariman; Myriam Mni; Cécile Oury; Ilya Altukhov; Dmitry Alexeev; Yuri Aulchenko; Leila Amininejad; Gerd Bouma; Frank Hoentjen; Mark Löwenberg; Bas Oldenburg; Marieke J Pierik; Andrea E vander Meulen-de Jong; C Janneke van der Woude; Marijn C Visschedijk; The International IBD Genetics Consortium; Mark Lathrop; Jean-Pierre Hugot; Rinse K Weersma; Martine De Vos; Denis Franchimont; Severine Vermeire; Michiaki Kubo; Edouard Louis; Michel Georges

doi:10.1038/s41467-018-04365-8

. 2018 Jun 21;9:2427. doi: 10.1038/s41467-018-04365-8

IBD risk loci are enriched in multigenic regulatory modules encompassing putative causative genes

Yukihide Momozawa ^1,^2,^#, Julia Dmitrieva ^1,^#, Emilie Théâtre ¹, Valérie Deffontaine ¹, Souad Rahmouni ¹, Benoît Charloteaux ¹, François Crins ¹, Elisa Docampo ¹, Mahmoud Elansary ¹, Ann-Stephan Gori ¹, Christelle Lecut ³, Rob Mariman ¹, Myriam Mni ¹, Cécile Oury ³, Ilya Altukhov ⁴, Dmitry Alexeev ⁵, Yuri Aulchenko ^6,^7,⁸, Leila Amininejad ⁹, Gerd Bouma ¹⁰, Frank Hoentjen ¹¹, Mark Löwenberg ¹², Bas Oldenburg ¹³, Marieke J Pierik ¹⁴, Andrea E vander Meulen-de Jong ¹⁵, C Janneke van der Woude ¹⁶, Marijn C Visschedijk ¹⁷; The International IBD Genetics Consortium, Mark Lathrop ¹⁸, Jean-Pierre Hugot ¹⁹, Rinse K Weersma ¹⁷, Martine De Vos ²⁰, Denis Franchimont ⁹, Severine Vermeire ²¹, Michiaki Kubo ², Edouard Louis ²², Michel Georges ^1,^✉

¹Unit of Animal Genomics, WELBIO, GIGA-R & Faculty of Veterinary Medicine, University of Liège (B34), 1 Avenue de l’Hôpital, Liège, 4000 Belgium

²Laboratory for Genotyping Development, RIKEN Center for Integrative Medical Science, 1-7-22, Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045 Japan

³Laboratory of Thrombosis and Hemostasis, GIGA-R, University of Liège (B34), 1 Avenue de l’Hôpital, 4000 Liège, Belgium

⁴Moscow Institute of Physics and Technology, Institutskiy Pereulok 9, Dolgoprudny, 141700 Russian Federation

⁵Novosibirsk State University, Pirogova ave. 2, Novosibirsk, 630090 Russian Federation

⁶PolyOmica, Het Vlaggeschip 61, ‘s-Hertogenbosch, 5237 PA The Netherlands

⁷Institute of Cytology and Genetics SD RAS, Lavrentyeva ave. 10, 630090 Novosibirsk, Russia

⁸Centre for Global Health Research, Usher Institute of Population Health Sciences and Informatics, University of Edinburgh, Teviot Place, Edinburgh EH8 9AG UK

⁹Gastroentérologie Médicale, Faculté de Médicine, Université Libre de Bruxelles, Route de Lennik 808, Anderlecht, 1070 Belgium

¹⁰Department of Gastroenterology and Hepatology, VU University Medical Centre, Amsterdam, 1081 HV The Netherlands

¹¹Department of Gastroenterology and Hepatology, University Medical Centre St. Radboud, Nijmegen, 6525 GA The Netherlands

¹²Department of Gastroenterology and Hepatology, Amsterdam Medical Centre, Amsterdam, 1105 AZ The Netherlands

¹³Department of Gastroenterology and Hepatology, University Medical Centre Utrecht, 3584 cX, Utrecht, The Netherlands

¹⁴Department of Gastroenterology and Hepatology, University Medical Centre Maastricht, Maastricht, 6229 HX The Netherlands

¹⁵Department of Gastroenterology and Hepatology, Leiden University Medical Centre, Leiden, 2333 ZA The Netherlands

¹⁶Department of Gastroenterology and Hepatology, Erasmus Medical Centre, Rotterdam, 3015 CE The Netherlands

¹⁷Department of Gastroenterology and Hepatology, University of Groningen and University Medical Center Groningen, Hanzeplein 1, Groningen, 9713 GZ The Netherlands

¹⁸McGill University Centre for Molecular and Computational Genomics, 740 Dr. Penfield Avenue, Montreal, H3A 0G1 QC Canada

¹⁹UMR 1149 INSERM/Université Paris-Diderot Sorbonne Paris-Cité, Assistance Publique Hôpitaux de Paris, 48 Bd Sérurier, Paris, 75019 France

²⁰Department of Gastroenterology, University Hospital, De Pintelaan 185, Gent, 9000 Belgium

²¹Translational Research in Gastrointestinal Disorders, Department of Clinical and Experimental Medicine, KU Leuven, UZ Herestraat 49, Leuven, 3000 Belgium

²²CHU-Liège and Unit of Gastroenterology, GIGA-R & Faculty of Medicine, University of Liège, 1 Avenue de l’Hôpital, Liège, 4000 Belgium

²³Section of Digestive Diseases, Department of Internal Medicine, Yale School of Medicine, New Haven, CT USA

²⁴Department of Gastroenterology and Hepatology, Digestive Disease Institute, Cleveland Clinic, Cleveland, OH USA

²⁵Department of Pathobiology, Lerner Research Institute, Cleveland Clinic, Cleveland, OH USA

²⁶Peninsula College of Medicine and Dentistry, Exeter, UK

²⁷Gastroenterology Unit, Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114 USA

²⁸Division of Medical Sciences, Harvard Medical School, Boston, MA USA

²⁹Focused Research Unit for Molecular Diagnostic and Clinical Research (MOK), IRS-Center Sonderjylland, Hospital of Southern Jutland, Åbenrå, 6200 Denmark

³⁰Institute of Molecular Medicine, University of Southern Denmark, Odense, 5000 Denmark

³¹Institute of Regional Health Research, University of Southern Denmark, Odense, Denmark

³²Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SA UK

³³Inflammatory Bowel Disease Service, Department of Gastroenterology and Hepatology, Royal Adelaide Hospital, Adelaide, Australia

³⁴Unit of Gastroenterology, Istituto di Ricovero e Cura a Carattere Scientifico-Casa Sollievo della Sofferenza (IRCCS-CSS) Hospital, San Giovanni Rotondo, Italy

³⁵Strutture Organizzative Dipartimentali (SOD) Gastroenterologia 2, Azienda Ospedaliero Universitaria (AOU) Careggi, Florence, Italy

³⁶Facult de Médecine, Universit de Montréal, Montréal, QC H3C 3J7 Canada

³⁷Department of Gastroenterology, Hôpital Maisonneuve-Rosemont, Montréal, QC Canada

³⁸Division of Gastroenterology, Hepatology and Nutrition, Department of Medicine, University of Pittsburgh School of Medicine, Pittsburgh, PA 15213 USA

³⁹Center for Applied Genomics, The Children’s Hospital of Philadelphia, Philadelphia, PA USA

⁴⁰Department of Gastroenterology and Hepatology, Flinders Medical Centre and School of Medicine, Flinders University, Adelaide, Australia

⁴¹Department of Medicine, University of Otago, Christchurch, New Zealand

⁴²Meyerhoff Inflammatory Bowel Disease Center, Department of Medicine, Johns Hopkins University School of Medicine, Baltimore, MD USA

⁴³Department for General Internal Medicine, Christian-Albrechts-University, Kiel, Germany

⁴⁴Division of Gastroenterology, Royal Victoria Hospital, Montréal, QC Canada

⁴⁵Research Center, Montreal Heart Institute, Montréal, QC H1T 1C8 Canada

⁴⁶Department of Medicine II, Ludwig-Maximilians-University Hospital Munich-Grosshadern, Munich, Germany

⁴⁷Department of Gastroenterology, Campus Charité Mitte, Universitatsmedizin Berlin, Berlin, Germany

⁴⁸Harry Perkins Institute for Medical Research, School of Medicine and Pharmacology, University of Western Australia, Murdoch, WA 6150 Australia

⁴⁹IBD Unit, Fremantle Hospital, Fremantle, Australia

⁵⁰Department of Genetics, Yale School of Medicine, New Haven, CT 06510 USA

⁵¹Department of Genetics and Genomic Sciences, Mount Sinai School of Medicine, New York, NY USA

⁵²Inflammatory Bowel Diseases, Genetics and Computational Biology, Queensland Institute of Medical Research, Brisbane, Australia

⁵³Analytic and Translational Genetics Unit, Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114 USA

⁵⁴Broad Institute of MIT and Harvard, Cambridge, MA 02141 USA

⁵⁵Clinical Epidemiology Unit, Department of Medicine Solna, Karolinska Institutet, Stockholm, 17176 Sweden

⁵⁶Department of Gastrointestinal and Liver Diseases, BioDonostia Health Research Institute, San Sebastián, 20014 Spain

⁵⁷IKERBASQUE, Basque Foundation for Science, Bilbao, 48013 Spain

⁵⁸IBD Center, Department of Gastroenterology, Istituto Clinico Humanitas, Milan, Italy

⁵⁹Center of Hepatology, Gastroenterology and Dietetics, Vilnius University, Vilnius, Lithuania

⁶⁰Pediatric Gastroenterology, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH USA

⁶¹Department of Gastroenterology, Université Catholique de Louvain (UCL) Cliniques Universitaires Saint-Luc, Brussels, Belgium

⁶²Division of Gastroenterology, University Hospital Padua, Padua, Italy

⁶³Department of Pediatrics, Cedars Sinai Medical Center, Los Angeles, CA USA

⁶⁴Department of Human Genetics, University of Pittsburgh Graduate School of Public Health, Pittsburgh, PA 15261 USA

⁶⁵Department of Gastroenterology, Torbay Hospital, Torbay, Devon UK

⁶⁶Institute of Clinical Molecular Biology, Christian-Albrechts-University of Kiel, Kiel, 24118 Germany

⁶⁷Center for Human Genetic Research, Massachusetts General Hospital, Harvard Medical School, Boston, MA USA

⁶⁸Pediatrics, Harvard Medical School, Boston, MA USA

⁶⁹Faculty of Medical & Health Sciences, School of Medical Sciences, The University of Auckland, Auckland, New Zealand

⁷⁰F. Widjaja Foundation Inflammatory Bowel and Immunobiology Research Institute, Cedars-Sinai Medical Center, Los Angeles, CA 90048 USA

⁷¹Department of Gastroenterology, Mater Health Services, Brisbane, Australia

⁷²Department of Genetics, University Medical Center Groningen, Groningen, The Netherlands

⁷³Department of Gastroenterology, Christchurch Hospital, Christchurch, New Zealand

⁷⁴Institute of Genetic Epidemiology, Helmholtz Zentrum München—German Research Center for Environmental Health, Neuherberg, Germany

⁷⁵Department of Preventive Dentistry and Periodontology, Ludwig-Maximilians-University Hospital Munich-Grosshadern, Munich, Germany

⁷⁶Division of Pediatric Gastroenterology, Hepatology and Nutrition, Hospital for Sick Children, Toronto, ON Canada

⁷⁷Department of Pediatrics, University of Utah School of Medicine, Salt Lake City, UT USA

⁷⁸Department of Gastroenterology, Faculty of Medicine and Health, Örebro University, SE-70182 Örebro, Sweden

⁷⁹Department of Medicine, St. Mark’s Hospital, Harrow, Middlesex UK

⁸⁰Nottingham Digestive Diseases Centre, Queens Medical Centre, Nottingham, UK

⁸¹Molecular Epidemiology, Genetics and Computational Biology, Queensland Institute of Medical Research, Brisbane, Australia

⁸²Paediatric Gastroenterology and Nutrition, Royal Hospital for Sick Children, Edinburgh, UK

⁸³Child Life and Health, University of Edinburgh, Edinburgh, Scotland UK

⁸⁴Division of Rheumatology Immunology and Allergy, Brigham and Women’s Hospital, Boston, MA USA

⁸⁵Academy of Medicine, Lithuanian University of Health Sciences, Kaunas, Lithuania

⁸⁶Wellcome Trust Centre for Human Genetics, University of Oxford, Headington, OX3 7BN UK

⁸⁷Christ Church, University of Oxford, St Aldates, OX1 1DP UK

⁸⁸Research Institute of Internal Medicine, Department of Transplantation Medicine, Division of Cancer, Surgery and Transplantation, Oslo University Hospital Rikshospitalet, Oslo, Norway

⁸⁹Norwegian PSC Research Center, Department of Transplantation Medicine, Division of Cancer, Surgery and Transplantation, Oslo University Hospital Rikshospitalet, Oslo, Norway

⁹⁰K.G. Jebsen Inflammation Research Centre, Institute of Clinical Medicine, University of Oslo, Oslo, Norway

⁹¹Gastrointestinal Unit, Western General Hospital University of Edinburgh, Edinburgh, UK

⁹²Genetic Medicine, Manchester Academic Health Science Centre, Manchester, UK

⁹³The Manchester Centre for Genomic Medicine, University of Manchester, Manchester, UK

⁹⁴QIMR Berghofer Medical Research Institute, Royal Brisbane Hospital, Brisbane, Australia

⁹⁵Department of Pediatrics, Emory University School of Medicine, Atlanta, GA USA

⁹⁶Department of Gastroenterology, Kaunas University of Medicine, Kaunas, Lithuania

⁹⁷Centre for Inflammatory Bowel Diseases, Saint John of God Hospital, Subiaco, WA 6008 Australia

⁹⁸Inflammatory Bowel Disease Research Group, Addenbrooke’s Hospital, Cambridge, CB2 0QQ UK

⁹⁹Faculty of medicine, University of Latvia, Riga, Latvia

¹⁰⁰Dipartimento di Neuroscienze, Psicologia, Area del Farmaco e Salute del Bambino, Universitê di Firenze Strutture Organizzative Dipartimentali (SOD) Gastroenterologia e Nutrizione Ospedale Pediatrico Meyer, Firenze, Italy

¹⁰¹Department of Gastroenterology, The Townsville Hospital, Townsville, Australia

¹⁰²Institute of Human Genetics, Newcastle University, Newcastle upon Tyne, UK

¹⁰³Department of Medical and Molecular Genetics, King’s College London, London, SE1 9RT UK

¹⁰⁴Sydney Brenner Institute for Molecular Bioscience, University of the Witwatersrand, Johannesburg, 2193 South Africa

¹⁰⁵Inflammatory Bowel Disease Centre, Mount Sinai Hospital, Toronto, ON Canada

¹⁰⁶Center for Human Molecular Genetics and Pharmacogenomics, Faculty of Medicine, University of Maribor, Maribor, Slovenia

¹⁰⁷Department of Medicine, Ninewells Hospital and Medical School, Dundee, UK

¹⁰⁸Center for Computational and Integrative Biology, Massachusetts General Hospital, Harvard Medical School, Boston, MA USA

¹⁰⁹Department of Medicine and Therapeutics, Institute of Digestive Disease, Chinese University of Hong Kong, Hong Kong, Hong Kong

¹¹⁰Department of Genomics Life & Brain Center, University Hospital Bonn, Bonn, Germany

¹¹¹Faculty for Chemistry and Chemical Engineering, University of Maribor, Maribor, Slovenia

¹¹²Department of Gastroenterology, Royal Brisbane and Womens Hospital, Brisbane, Australia

¹¹³Department of Gastroenterology, Université Catholique de Louvain (UCL) Centre Hospitalier Universitaire (CHU) Mont-Godinne, Mont-Godinne, Belgium

¹¹⁴Department of Gastroenterology, Guy’s & St. Thomas’ NHS Foundation Trust, St.-Thomas Hospital, London, UK

¹¹⁵Department of Digestive Diseases, Hospital Quiron Teknon, Barcelona, Spain

¹¹⁶Department of Public Health Sciences, University of Chicago, Chicago, IL USA

¹¹⁷Human Genetics, Genome Institute of Singapore, Singapore, Singapore

¹¹⁸Institute for Human Genetics, University of California, San Francisco, CA USA

¹¹⁹Open Targets, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SD UK

¹²⁰Department of Biology of Radiations and Human Health, Agenzia Nazionale per le Nuove Tecnologie l’energia e lo Sviluppo Economico Sostenibile (ENEA), Rome, Italy

¹²¹Department of Clinical Science Intervention and Technology, Karolinska Institutet, Stockholm, Sweden

¹²²Gastroenterology & General Medicine, Norfolk and Norwich University Hospital, Norwich, UK

¹²³Faculty of Medicine, University of Toronto, Toronto, ON Canada

¹²⁴Department of Computer Science, New Jersey Institute of Technology, Newark, NJ USA

¹²⁵Institute of Human Genetics, Technische Universität München, Munich, Germany

¹²⁶Department of Neurology, Technische Universität München, Munich, Germany

¹²⁷Department of Biostatistics, School of Public Health, Yale University, New Haven, CT USA

¹²⁸Department of Gastroenterology, West China Hospital, Chengdu, Sichuan China

¹²⁹State Key Laboratory of Biotherapy, Sichuan University West China University of Medical Sciences (WCUMS), Chengdu, Sichuan China

^✉

Corresponding author.

Contributed equally.

PMCID: PMC6013502 PMID: 29930244

Abstract

GWAS have identified >200 risk loci for Inflammatory Bowel Disease (IBD). The majority of disease associations are known to be driven by regulatory variants. To identify the putative causative genes that are perturbed by these variants, we generate a large transcriptome data set (nine disease-relevant cell types) and identify 23,650 cis-eQTL. We show that these are determined by ∼9720 regulatory modules, of which ∼3000 operate in multiple tissues and ∼970 on multiple genes. We identify regulatory modules that drive the disease association for 63 of the 200 risk loci, and show that these are enriched in multigenic modules. Based on these analyses, we resequence 45 of the corresponding 100 candidate genes in 6600 Crohn disease (CD) cases and 5500 controls, and show with burden tests that they include likely causative genes. Our analyses indicate that ≥10-fold larger sample sizes will be required to demonstrate the causality of individual genes using this approach.

Most of the more than 200 known genetic risk loci for inflammatory bowel disease (IBD) reside in regulatory regions. Here, the authors provide eQTL datasets for six circulating immune cell types and ileal, colonic and rectal biopsies to map regulatory modules and identify potential causative genes for IBD.

Introduction

Genome Wide Association Studies (GWAS) scan the entire genome for statistical associations between common variants and disease status in large case–control cohorts. GWAS have identified tens to hundreds of risk loci for nearly all studied common complex diseases of human¹. The study of Inflammatory Bowel Disease (IBD) has been particularly successful, with more than 200 confirmed risk loci reported to date^2,3. As a result of the linkage disequilibrium (LD) patterns in the human genome (limiting the mapping resolution of association studies), GWAS-identified risk loci typically span ~250 kb, encompassing an average of ~5 genes (numbers ranging from zero (“gene deserts”) to more than 50) and hundreds of associated variants. Contrary to widespread misconception, the causative variants and genes remain unknown for the vast majority of GWAS-identified risk loci. Yet, this remains a critical goal in order to reap the full benefits of GWAS in identifying new drug targets and developing effective predictive and diagnostic tools. It is the main objective of post-GWAS studies.

Distinguishing the few causative variants (i.e., the variants that are directly causing the gene perturbation) from the many neutral variants that are only associated with the disease because they are in LD with the former in the studied population, requires the use of sophisticated fine-mapping methods applied to very large, densely genotyped data sets⁴, ideally followed-up by functional studies⁵. Using such approaches, 18 causative variants for IBD were recently fine-mapped at single base pair resolution, and 51 additional ones at ≤10 base pair resolution⁴.

A minority of causative variants are coding, i.e., they alter the amino-acid sequence of the encoded protein. In such cases, and particularly if multiple such causative coding variants are found in the same gene (i.e., in case of allelic heterogeneity), the corresponding causative gene is unambiguously identified. In the case of IBD, causative genes have been identified for approximately ten risk loci on the basis of such “independently” (i.e., not merely reflecting LD with other variants) associated coding variants, including NOD2, ATG16L1, IL23R, CARD9, FUT2, and TYK2^4,6–9.

For the majority of risk loci, the GWAS signals are not driven by coding variants. They must therefore be driven by common regulatory variants, i.e., variants that perturb the expression levels of one (or more) target genes in one (or more) disease relevant cell types⁴. Merely reflecting the proportionate sequence space that is devoted to the different layers of gene regulation (transcriptional, posttranscriptional, translational, posttranslational), the majority of regulatory variants are likely to perturb components of “gene switches” (promoters, enhancers, insulators), hence affecting transcriptional output. Indeed, fine-mapped non-coding variants are enriched in known transcription-factor binding sites and epigenetic signatures marking gene switch components⁴. Hence, the majority of common causative variants underlying inherited predisposition to common complex diseases must drive cis-eQTL (expression quantitative trait loci) affecting the causative gene(s) in one or more disease relevant cell types. The corresponding cis-eQTL are expected to operate prior to disease onset, and—driven by common variants—detectable in cohorts of healthy individuals of which most will never develop the disease. The term cis-eQTL refers to the fact that the regulatory variants that drive them only affect the expression of genes/alleles residing on the same DNA molecule, typically no more than one megabase away. Causative variants, whether coding or regulatory, may secondarily perturb the expression of genes/alleles located on different DNA molecules, generating trans-eQTL. Some of these trans-eQTL may participate in the disease process.

cis-eQTL effects are known to be very common, affecting more than 50% of genes¹⁰. Hence, finding that variants associated with a disease are also associated with changes in expression levels of a neighboring gene is not sufficient to incriminate the corresponding genes as causative. Firstly, one has to show that the local association signal for the disease and for the eQTL are driven by the same causative variants. A variety of “colocalisation” methods have been developed to that effect^11–13. Secondly, regulatory variants may affect elements that control the expression of multiple genes¹⁴, which may not all contribute to the development of the disease, i.e., be causative. Thus, additional evidence is needed to obtain formal proof of gene causality. In humans, the only formal test of gene causality that is applicable is the family of “burden” tests, i.e., the search for a differential burden of disruptive mutations in cases and controls, which is expected only for causative genes¹⁵. Burden tests rely on the assumption that—in addition to the common, mostly regulatory variants that drive the GWAS signal—the causative gene will be affected by low frequency and rare causative variants, including coding variants. Thus, the burden test makes the assumption that allelic heterogeneity is common, which is supported by the pervasiveness of allelic heterogeneity of Mendelian diseases in humans¹⁶. Burden tests compare the distribution of rare coding variants between cases and controls¹⁵. The signal-to-noise ratio of the burden test can be increased by restricting the analysis to coding variants that have a higher probability to disrupt protein function¹⁵. In the case of IBD, burden tests have been used to prove the causality of NOD2, IL23R, and CARD9^6,8,9. A distinct and very elegant genetic test of gene causality is the reciprocal hemizygosity test, and the related quantitative complementation assay^17,18. However, with few exceptions^19,20, it has only been applied in model organisms in which gene knock-outs can be readily generated²¹.

In this paper, we describe the generation of a new and large data set for eQTL analysis (350 healthy individuals) in nine cell types that are potentially relevant for IBD. We identify and characterize ~24,000 cis-eQTL. By comparing disease and eQTL association patterns (EAP) using a newly developed statistic, we identify 99 strong positional candidate genes in 63 GWAS-identified risk loci. We resequence the 555 exons of 45 of these in 6600 cases and 5500 controls in an attempt to prove their causality by means of burden tests. The outcome of this study is relevant to post-GWAS studies of all common complex disease in humans.

Results

Clustering cis-eQTL into regulatory modules

We generated transcriptome data for six circulating immune cell types (CD4+ T lymphocytes, CD8+ T lymphocytes, CD19+ B lymphocytes, CD14+ monocytes, CD15+ granulocytes, platelets) as well as ileal, colonic, and rectal biopsies (IL, TR, RE), collected from 323 healthy Europeans (141 men, 182 women, average age 56 years, visiting the clinic as part of a national screening campaign for colon cancer) using Illumina HT12 arrays (CEDAR data set; Methods). IBD being defined as an inappropriate mucosal immune response to a normal commensal gut flora²², these nine cell types can all be considered to be potentially disease relevant. Using standard methods based on linear regression and two megabase windows centered on the position of the interrogating probe (Methods), we identified significant cis-eQTL (FDR < 0.05) for 8804 of 18,580 tested probes (corresponding to 7216 of 13,615 tested genes) in at least one tissue, amounting to a total of 23,650 cis-eQTL effects (Supplementary Data 1). When a gene shows a cis-eQTL in more than one tissue, the corresponding “eQTL association patterns” (EAP) (i.e., the distribution of association −log(p) values for all the variants in the region of interest) are expected to be similar if determined by the same regulatory variants, and dissimilar otherwise. Likewise, if several neighboring genes show cis-eQTL in the same or distinct tissues, the corresponding EAP are expected to be similar if determined by the same regulatory variants, and dissimilar otherwise (Fig. 1). We devised the ϑ metric to measure the similarity between association patterns (Methods). ϑ is a correlation measure for paired −log(p) values (for the two eQTL that are being compared) that ranges between −1 and +1. ϑ shrinks to zero if Pearson’s correlation between paired −log(p) values does not exceed a chosen threshold (i.e., if the EAP are not similar). ϑ approaches +1 when the two EAP are similar and when variants that increase expression in eQTL 1 consistently increase expression in eQTL 2. ϑ approaches −1 when the two EAP are similar and when variants that increase expression in eQTL 1 consistently decrease expression in eQTL 2. ϑ gives more weight to variants with high −log(p) for at least one EAP (i.e., it gives more weight to eQTL peaks). Based on the known distribution of ϑ under H₀ (i.e., eQTL determined by distinct variants in the same region) and H₁ (i.e., eQTL determined by the same variants), we selected a threshold value |ϑ| > 0.60 to consider that two EAP were determined by the same variant. This corresponds to a false positive rate of 0.05, and a false negative rate of 0.23 (Supplementary Fig. 1). We then grouped EAP in “cis-acting regulatory modules” (cRM) using |ϑ| and a single-link clustering approach (i.e., an EAP needs to have |ϑ| > 0.60 with at least one member of the cluster to be assigned to that cluster). Clusters were visually examined and 29 single edges connecting otherwise unlinked and yet tight clusters manually removed (Supplementary Fig. 2).

Fig. 1 — *cis*-Regulatory Module (cRM). A *cis*-eQTL affecting gene A in tissue 1 reveals itself by an “eQTL Association Pattern” (EAP_A,1), i.e., the pattern of –log(p) values for variants in the region. Multiple EAP can be observed in a given chromosome region, affecting one or more genes in one or more cell types. EAP that are driven by the same underlying variants are expected to be similar, while EAP driven by distinct variants (for instance, the green and red regulatory variants in the figure) are not. Based on the measure of similarity introduced in this work, ϑ, we cluster the EAP in cRM. For EAP in the same module, ϑ can be positive or negative, indicating that the variants have the same sign of effect (increasing or decreasing expression) for the corresponding EAP pair

Using this approach, we clustered the 23,650 effects in 9720 distinct “cis-regulatory modules” (cRM), encompassing cis-eQTL with similar EAP (Supplementary Data 2). Sixty-eight percent of cRM were gene- and tissue-specific, 22% were gene-specific but operating across multiple tissues (≤9 tissues, average 3.5), and 10% were multi-genic (≤11 genes, average 2.5) and nearly always multi-tissue (Figs. 2 and 3, Supplementary Fig. 2). In this, cRM are considered gene specific if the EAPs in the cluster concern only one gene, and tissue specific if the EAP in the cluster concern only one of the nine cell types. They are, respectively, multigenic and multi-tissue otherwise. cRM operating across multiple tissues tended to affect multiple genes (r = 0.47; p < 10⁻⁶). In such cRM, the direction of the effects tended to be consistent across tissues and genes (p < 10⁻⁶). Nevertheless, we observed at least 55 probes with effect of opposite sign in distinct cell types (ϑ ≤ −0.9), i.e., the corresponding regulatory variants increases transcript levels in one cell type while decreasing them in another (Fig. 4 and Supplementary Data 3). Individual tissues allowed for the detection of 7–33% of all cRM, and contributed 3–14% unique cRM (Supplementary Fig. 3). Sixty-nine percent of cRM were only detected in one cell type. The rate of cRM sharing between cell types reflects known ontogenic relations. Considering cRM shared by only two cell types (i.e., what jointly differentiates these two cell types from all other), revealed the close proximity of the CD4–CD8, CD14–CD15, ileum–colon, and colon–rectum pairs. Adding information of cRM shared by up to six cell types grouped lymphoid (CD4, CD8, CD19), myeloid (CD14, CD15 but not platelets), and intestinal (ileum, colon and rectum) cells. Adding cRM with up to nine cell types revealed a link between ileum and blood cells, possibly reflecting the presence of blood cells in the ileal biopsies (Fig. 5).

Fig. 2 — Single-gene/tissue versus multi-gene/tissue cRM. Using |ϑ| > 0.6, the 23,950 *cis*-eQTL (FDR ≤ 0.05) detected in the nine analyzed cell types were clustered in 9720 *cis*-Regulatory Modules (cRM). 68% of these were single-gene, single-tissue cRM (green), 22% were single-gene, multi-tissue cRM (blue), and 10% were multi-gene, mostly multi-tissue cRM (red). The number of observations for single-gene cRM were divided by 10 in the graph for clarity. Thus, there are more cases of single-gene, multi-tissue cRM (blue; 2155) than multi-gene cRM (red; 967)

Fig. 3 — Example of a multi-gene, multi-tissue cRM. Gene-tissue combinations for which no expression could be detected are marked by “-”, with detectable expression but without evidence for *cis*-eQTL as “ → ”, with detectable expression and evidence for a *cis*-eQTL as “↑” or “↓” (large arrows: FDR < 0.05; small arrows: FDR ≥ 0.05 but high |ϑ| values). eQTL labeled by the yellow arrows constitute the multi-genic and multi-tissular cRM no. 57. The corresponding regulatory variant(s) increase expression of the *GINM1*, *NUP43* and probably *KATNA1* genes (left side of the cRM), while decreasing expression of the *PCMT1* and *LRP11* genes (right side of the cRM). The expression of *GINM1* in CD15 and *LRP11* in CD4 appears to be regulated in opposite directions by a distinct cRM (no. 3694, green). The *LATS1* gene, in the same region, is not affected by the same regulatory variants in the studied tissues. Inset 1: ϑ values for all EAP pairs. EAP pairs with |ϑ| > 0.6 are bordered in yellow when corresponding to cRM no. 57, in green when corresponding to cRM no. 3694 (+green arrow)

Fig. 4 — Variant(s) with opposite effects on expression in two cell types. Example of a gene (*PNKD*) affected by a *cis*-eQTL in at least two cell types (CD14 and platelets) that are characterized by EAP with ϑ = −0.97, indicating that the gene’s expression level is affected by the same regulatory variant in these two cell types, yet with opposite effects, i.e., the variant that is increasing expression in platelets is decreasing expression in CD14

Fig. 5 — Significance of the excess sharing of cRM between cell types. (red: p < 0.0002 (Bonferroni corrected 0.0144), orange: p < 0.001 (0.072), rose: p < 0.01 (0.51)). The numbers in the lower-left corner of the squares indicate which cRM were used for the analysis: (2) cRM affecting no more than two cell types, (3) cRM affecting no more than three cell types, etc. The upper-left square indicates the position of the lymphoid cell types (L)(CD4, CD8, CD19), the myeloid cell types (M)(CD14, CD15, PLA), and the intestinal cell types (I)(IL, TR, RE). For each pair of cell types i and j, we computed two p values, one using i as reference, the other using j as reference (Methods). Pairs of p values were always consistent

cRM matching IBD association signals are often multigenic

If regulatory variants affect disease risk by perturbing gene expression, the corresponding “disease association patterns” (DAP) and EAP are expected to be similar, even if obtained in distinct cohorts (yet with same ethnicity) (Fig. 6). We confronted DAP and EAP using the ϑ statistic and threshold (|ϑ| > 0.60) described above for 200 GWAS-identified IBD risk loci. DAP for Crohn’s disease and ulcerative colitis were obtained from the International IBD Genetics Consortium (IIBDGC)^2,3, EAP from the CEDAR data set.

Fig. 6 — DAP-matching cRM. If a regulatory variant (red) affects disease risk by altering the expression levels of gene B in tissue 2, the EAP_B,2 is expected to be similar (high |ϑ|) to the “disease association pattern” (DAP), both assigned therefore to the same cRM. ϑ is positive if increased gene expression is associated with increased disease risk, negative otherwise. A *cis*-eQTL that is driven by a regulatory variant (green) that does not directly affect disease risk, will be characterized by an EAP (say gene A, tissue 2, EAP_A,2) that is not similar to the DAP (low |ϑ|)

The probability that two unrelated association signals in a chromosome region of interest are similar (i.e., have high |ϑ| value) is affected by the degree of LD in the region. If the LD is high it is more likely that two association signals are similar by chance. To account for this, we generated EAP- and locus-specific distributions of |ϑ| by simulating eQTL explaining the same variance as the studied eQTL, yet driven by 100 variants that were randomly selected in the risk locus (matched for MAF), and computing |ϑ| with the DAP for all of these. The resulting empirical distribution of |ϑ| was used to compute the probability to obtain a value of |ϑ| as high or higher than the observed one, by chance alone (Methods).

Strong correlations between DAP and EAP (|ϑ| > 0.6, associated with low empirical p values) were observed for at least 63 IBD risk loci, involving 99 genes (range per locus: 1–6) (Table 1, Fig. 7, Supplementary Data 4). Increased disease risk was associated equally frequently with increased as with decreased expression (p_CD = 0.48; p_UC = 0.88). An open-access website has been prepared to visualize correlated DAP–EAP within their genomic context (http://cedar-web.giga.ulg.ac.be). Genes with highest |ϑ| values (≥0.9) include known IBD causative genes (for instance, ATG16L1, CARD9, and FUT2), known immune regulators (for instance, IL18R1, IL6ST, and THEMIS), as well as genes with as of yet poorly defined function in the context of IBD (for instance, APEH, ANKRD55, CISD1, CPEB4, DOCK7, ERAP2, GNA12, GPX1, GSDMB, ORMDL3, SKAP2, UBE2L3, and ZMIZ1) (Supplementary Note 1).

Table 1.

IBD risk loci for which at least one cis-eQTL association pattern (EAP) was found to match the disease association pattern (DAP)

Loc	Chr	Beg	End	cRM	Nr	Genes with correlated DAP–EAP	Implicated cell types	Best θ		Best p		Ref
Loc	Chr	Beg	End	cRM	Nr	Genes with correlated DAP–EAP	Implicated cell types	CD	UC	CD	UC	Ref
HD1	1	2.4	2.8	271	2	TNFRSF14	CD4 CD8 IL TR	−0.74	−0.79	0.02	0.03	^4,48
HD2	1	7.7	8.3	2900	1	PARK7	CD15 TR RE	−0.8	−0.82	0.01	0.06	⁴⁸
N_1_62	1	62.5	63.5	109	3	DOCK7 USP1 ATG4C	CD4 CD8 CD19 CD14 CD15	−0.9	0	0.01	1.00	³
N_1_100	1	101.0	102.0	6008	1	SLC30A7	TR	0	−0.71	1.00	0.06
J_1_119	1	120.2	120.7	9459	1	NOTCH2	CD19	0.68	0	0.13	1.00
HD14	1	155.0	156.1	5	8	GBA	CD4	−0.65	0	0.01	1.00
				238	3	THBS3 GBA MUC1	CD14 CD15 TR	0	0.81	1.00	0.02
				4513	1	THBS3	CD4	0	0.66	1.00	0.02
HD21	1	197.3	198.0	6071	1	DENND1B	CD4	0.7	0.78	0.03	0.02
HD30	2	62.4	62.7	3716	1	B3GNT2	CD8	−0.63	0	0.01	1.00
HD35	2	102.8	103.3	1132	1	*IL18R1*	CD4 CD8	−0.93	−0.87	0.01	0.03	⁴
				8912	1	(IL18RAP)	CD8	−0.42	0	0.11	0.38	⁴
J_2_197	2	198.2	199.1	325	2	MARS2 PLCL1	CD4 CD14	−0.72	0	0.06	1.00	^2,48
J_2_218	2	218.9	219.4	216	3	PNKD GPBAR1	CD14 TR RE	0.72	0.72	0.01	0.06	^2,48
HD43	2	234.1	234.6	1177	1	*ATG16L1*	CD4 CD8 IL TR RE	0.94	0	0.05	1.00	^2,49
N_3_45	3	46.0	47.0	2930	1	CCR2	CD19	0.77	0	0.02	1.00
				1203	1	CCR2	CD4	−0.62	0	0.07	1.00
				7768	1	CCR9	CD19	0	−0.67	1.00	0.06
				6798	1	KLHL18	CD14	0	−0.68	1.00	0.03
HD50	3	48.4	51.4	8	7	USP4	CD19	0.64	0.63	0.06	0.07	²
				217	3	*GPX1 APEH* IP6K1	CD19 CD14 TR RE	0.91	0.97	0.01	0.01	^2,49
				122	3	FAM212A	CD19	0	0.61	1.00	0.05
J_3_52	3	52.8	53.3	3190	1	SFMBT1	TR RE	0	−0.88	1.00	0.01	⁵⁰
J_4_73	4	74.6	75.1	1271	1	CXCL5	CD4 CD8 CD19 CD14 PLA	0	−0.84	1.00	0.01	²
HD60	5	40.0	40.7			(PTGER4)	CD15	0	0	0.28	0.15	⁵¹
HD61	5	55.4	55.5	360	2	*ANKRD55* IL6ST	CD4 CD8	0.9	0	0.02	1.00	⁴
HD62	5	72.4	72.6	6625	1	FOXD1	IL	−0.74	0	0.03	1.00	⁴
HD63	5	95.9	96.5	365	2	*ERAP2* LNPEP	CD4 CD8 CD19 CD14 CD15 PLA IL TR RE	0.94	0.71	0.01	0.02	^2,4,50
HD65	5	130.4	132.0	55	4	(SLC22A4) (SLC22A5)	CD4 CD15	−0.55	0	0.06	0.07	^4,52
HD66	5	141.4	141.7	2389	1	NDFIP1	CD8 PLA	0.87	0.88	0.04	0.01	²
HD67	5	149.0	151.0	–	–	(IRGM)	–	–	–	–	–	⁵³
HD71	5	173.2	173.6	1349	1	CPEB4	CD4 CD8 CD19 CD14 CD15 PLA TR	−0.92	0	0.01	1.00	^2,4
J_66_32	6	32.3	32.9	7853	1	HLA-DQA2	IL	0	−0.62	1.00	0.02
HD76	6	90.8	91.1	1404	1	BACH2	CD4	0.67	0	0.14	1.00
HD78	6	111.3	112.0	9603	1	SLC16A10	IL	0	−0.71	1.00	0.11
HD80	6	127.9	128.4	707	2	THEMIS PTPRK	CD8	−0.92	0	0.01	1.00
HD83	6	167.3	167.6	1425	1	RNASET2	CD4 CD8 CD15 PLA	−0.87	0	0.02	1.00	⁴
J_7_1	7	2.5	3.0	2729	1	GNA12	CD19 CD14 TR	0	−0.94	1.00	0.02	²
HD84	7	26.6	27.3	1441	1	*SKAP2*	CD4 CD8 CD19	0.97	0	0.01	1.00	⁴
HD85	7	28.1	28.3	6438	1	JAZF1	CD4	0.78	0	0.01	1.00	²
HD92	7	128.5	128.8	401	2	IRF5 TNPO3	CD15 IL	0	−0.64	1.00	0.02	^2,48
				7046	1	TSPAN33	CD19	−0.64	0	0.01	1.00
N_8_26	8	26.7	27.7	5869	1	PTK2B	CD14	−0.69	0	0.01	1.00
				5841	1	TRIM35	CD4	0	0.66	1.00	0.01
HD106	9	139.1	139.5	64	4	*CARD9* INPP5E SEC16A SDCCAG3	CD4 CD8 CD19 CD14 CD15 IL TR RE	0.95	0.86	0.01	0.02	^2,4,50
HD109	10	30.6	30.9	1603	1	MTPAP	TR	−0.62	0	0.11	1.00
HD112	10	59.8	60.2	1609	1	*CISD1*	CD4 CD8 CD19 CD14 CD15 TR RE	0.94	0.83	0.04	0?01	^2,4,48
J_10_74	10	75.4	75.9	436	2	VCL	CD4 CD8 CD19 CD14 RE	0	−0.79	1.00	0.04
				4279	1	CAM2KG	CD4	−0.67	0	0.04	1.00
HD114	10	81.0	81.2	5476	1	ZMIZ1	CD8	−0.91	−0.86	0.03	0.01
J_10_80	10	82.0	82.5	712	2	TSPAN14	TR	−0.71	0	0.01	1.00
				2216	1	TSPAN14	CD4 CD14	0.76	0	0.01	1.00	²
HD116	10	101.2	101.4	5439	1	SLC25A28	CD14	−0.61	0	0.22	1.00
J_11_57	11	58.1	58.6	7164	1	ZFP91	PLA	−0.64	−0.75	0.02	0.07
J_11_59	11	61.3	61.8	1670	1	TMEM258	CD4 CD8 CD19	0.83	0	0.04	1.00
J_11_65	11	65.4	65.9	451	2	CTSW FIBP	CD4 CD8	−0.73	0	0.01	1.00	²
HD122	11	114.2	114.6	268	3	REXO2 NXPE1 NXPE4	TR RE	0	−0.89	1.00	0.02	^4,50
HD123	11	118.3	118.8	8200	1	TREH	IL	0	0.7	1.00	0.05
HD142	14	88.2	88.7	8940	1	GPR65	CD14	0.8	0.79	0.01	0.01
				6353	1	(GALC)	CD14	−0.52	−0.23	0.06	0.06	⁴
J_15_40	15	41.3	41.8	9109	1	CHP1	IL	0.62	0	0.01	1.00
J_16_22	16	23.6	24.1	2672	1	PRKCB	CD14	0	0.64	1.00	0.05	²
HD150	16	28.2	29.1	6	8	TUFM SBK1 APOBR SGF29 CLN3 SPNS1	CD4 CD8 CD19 CD14 CD15 IL TR RE	0.81	0.86	0.05	0.03	⁴
HD151	16	30.4	31.4	2673	1	RNF40	CD15	−0.63	0	0.02	1.00
				1886	1	ITGAL	CD4 CD8 CD19	0	0.74	1.00	0.01	⁵⁴
HD153	16	68.4	68.9	1894	1	ZFP90	CD4 CD8 CD19 CD14 TR	0	0.83	1.00	0.07	^2,48
HD156	16	85.9	86.1	3328	1	IRF8	TR RE	0	0.72	1.00	0.01
HD159	17	37.3	38.3	37	5	*GSDMB ORMDL3* PGAP3 (GSDMA)	CD4 CD8 CD19 CD14 IL TR RE	−0.98	−0.92	0.02	0.01	^2,4
HD161	17	40.3	41.0	836	2	STAT3	PLA	0.67	0	0.10	1.00
HD164	18	67.4	67.6	1988	1	CD226	CD4 CD8 PLA	0	−0.86	1.00	0.01	²
N_18_76	18	76.7	77.7	7292	1	PQLC1	PLA	−0.68	0	0.01	1.00
HD166	19	10.3	10.7	9232	1	(TYK2)	CD14	−0.44	−0.09	0.10	0.10
HD168	19	47.1	47.4	581	2	GNG8	CD4	0	−0.63	1.00	0.06
HD169	19	49.0	49.3	3128	1	*FUT2*	IL TR RE	−0.95	0	0.01	1.00	⁴
J_20_31	20	31.1	31.6	593	2	COMMD7	CD14	0	0.61	1.00	0.01
J_20_32	20	33.6	34.1	7	8	UQCC1	CD19	−0.69	0	0.02	1.00	²
				3369	1	MMP24-AS1	RE	−0.63	−0.71	0.03	0.03
HD175	20	62.2	62.5	2322	1	LIME1	CD4 CD19	−0.86	0	0.01	1.00	²
HD176	21	16.6	16.9	9578	1	NRIP1	CD4	0	−0.69	1.00	0.02
HD180	22	21.7	22.1	2130	1	UBE2L3	CD4 CD8 CD19 CD14 CD15 IL TR RE	0.97	0.92	0.01	0.07	^2,4
N_22_41	22	41.4	42.4	2149	1	EP300	CD8 CD19 CD15	0	0.71	1.00	0.02

Open in a new tab

Given are (i) the name and chromosomal coordinates of the corresponding loci (Locus, Chr, Beg, End) (GRCh37/hg19 in Mb), (ii) the identifier and total number of genes in the matching cis-acting regulatory module (cRM, Nr), (iii) the genes and tissues involved in matching DAP–EAP (|ϑ| > 0.6) (bold when |ϑ| ≥ 0.9), (iv) the best ϑ-values and corresponding empirical p values obtained for CD and UC, respectively, and (vi) references reporting a link between one or more of the same genes and IBD on the basis of eQTL information. Genes that were resequenced are shown in italics. Genes that were resequenced despite |ϑ| ≤ 0.6 are bracketed, and the supporting references provided in “Ref”. The higher number of matching DAP–EAP in this study when compared to Huang et al.⁴ are primarily due to the fact that (i) we herein study 200 IBD risk loci (vs 97), and (ii) we increase the number of detected cis-eQTL approximately two-fold by correcting for hidden confounders using PCs

Fig. 7 — Screen shots of the CEDAR website, showing a known CD risk loci on the human karyotype, b a zoom in the HD35 risk locus showing the Refseq gene content and summarizing local CEDAR *cis*-eQTL data (white: no expression data, gray: expression data but no evidence for *cis*-eQTL, black: significant cis-eQTL but no correlation with DAP, red: significant *cis*-eQTL similar to DAP (ϑ < −0.60), green: significant *cis*-eQTL similar to DAP (ϑ > 0.60)), and c a zoom in the DAP for Crohn’s disease (black) and EAP for *IL18R1* (red), as well as the signed correlation between DAP and EAP

The eQTL link with IBD has not been reported before for at least 47 of the 99 reported genes (Table 1). eQTL links with IBD have been previously reported for 111 additional genes, not mentioned in Table 1. Our data support these links for 19 of them, however, with |ϑ| ≤ 0.6 (Supplementary Data 5). We applied SMR¹³ as alternative colocalisation method to our data. Using a Bonferroni-corrected threshold of ≤2.5 × 10⁻⁵ for p_SMR and ≥0.05 for p_HEIDI, SMR detected 35 of the 99 genes selected with ϑ (Supplementary Data 4). Using the same thresholds, SMR detected nine genes that were not selected by ϑ. Of these, three (ADAM15, AHSA2, UBA7) had previously been reported by others, while six (FAM189B, QRICH1, RBM6, TAP2, ADO, LGALS9) were not. Of these six, three (RBM6, TAP2, ADO) were characterized by 0.45 <|ϑ|<0.6 (Supplementary Data 5).

Using an early version of the CEDAR data set, significant (albeit modest) enrichment of overlapping disease and eQTL signals was reported for CD4, ileum, colon and rectum, focusing on 76 of 97 studied IBD risk loci (MAF of disease variant >0.05)⁴. By pre-correcting fluorescence intensities with 23 to 53 (depending on cell type) principal components to account for unidentified confounders (Methods), we increased the number of significant eQTL from 480 to 880 in the corresponding 97 regions (11,964 to 23,650 for the whole genome). We repeated the enrichment analysis focusing on 63 of the same 97 IBD loci (CD risk loci; MAF of disease variant >0.05), using three colocalisation methods including ϑ (Methods). We observed a systematic excess overlap in all analyzed cell types (2.5-fold on average). The enrichment was very significant with the three methods in CD4 and CD8 (Supplementary Table 1).

The 400 analyzed DAP (200 CD and 200 UC) were found to match 76 cRM (in 63 risk loci) with |ϑ| > 0.6 (Table 1), of which 25 are multigenic. Knowing that multigenic cRM represent 10% of all cRM (967/9720), 25/76 (i.e., 33%) corresponds to a highly significant three-fold enrichment (p < 10⁻⁹). To ensure that this apparent enrichment was not due to the fact that multigenic cRM have more chance to match DAP (as by definition multiple EAP are tested for multigenic cRM), we repeated the enrichment analysis by randomly sampling only one representative EAP per cRM in the 200 IBD risk loci. The frequency of multigenic cRM amongst DAP-matching cRM averaged 0.22, and was never ≤0.10 (p ≤ 10⁻⁵) (Supplementary Fig. 4). In loci with high LD, EAP driven by distinct regulatory variants (yet in high LD) may erroneously be merged in the same cRM. To ensure that the observed enrichment in multigenic cRM was not due to higher levels of LD, we compared the LD-based recombination rate of the 63 cRM-matching IBD risk loci with that of the rest of the genome (https://github.com/joepickrell/1000-genomes-genetic-maps). The genome-average recombination rate was 1.23 centimorgan per megabase (cM/Mb), while that of the 63 IBD risk loci was 1.34 cM/Mb, i.e., less LD in the 63 cRM-matching IBD risk loci than in the rest of the genome. We further compared the average recombination rate in the 63 cRM-matching IBD regions with that of sets of 63 loci centered on randomly drawn cRM (from the list of 9720), matched for size and chromosome number (as cM/Mb is affected by chromosome size). The average recombination rate around all cRM was 1.43 cM/Mb, and this didn’t differ significantly from the 63 cRM-matching IBD regions (p = 0.46) (Supplementary Fig. 5). Therefore, the observed enrichment cannot be explained by a higher LD in the 63 studied IBD risk loci. Taken together, EAP that are strongly correlated with DAP (|ϑ| ≥ 0.60), map to regulatory modules that are 2- to 3-fold enriched in multigenic cRM when compared to the genome average and include four of the top 10 (of 9720) cRM ranked by number of affected genes.

DAP-matching cRM are enriched in causative genes for IBD

For truly causative genes, the burden of rare disruptive variants is expected to differ between cases and controls²³. We therefore performed targeted sequencing for the 555 coding exons (∼88 kb) of 38 genes selected amongst those with strongest DAP–EAP correlations, plus seven genes with suggestive DAP–EAP evidence backed by literature (Table 1), in 6597 European CD cases and 5502 matched controls (ref.²⁴ and Methods). Eighteen of these were part of single-gene cRM and the only gene highlighted in the corresponding locus. The remaining 27 corresponded to multi-gene cRM mapping to 15 risk loci. We added the well-established NOD2 and IL23R causative IBD genes as positive controls. We identified a total of 174 loss-of-function (LoF) variants, 2567 missense variants (of which 991 predicted by SIFT²⁵ to be damaging and Polyphen-2²⁶ to be either possibly or probably damaging), and 1434 synonymous variants (Fig. 8 and Supplementary Data 6). 1781 of these were also reported in the Genome Aggregation Database²⁷ with nearly identical allelic frequencies (Supplementary Fig. 6). We designed a gene-based burden test to simultaneously evaluate hypothesis (i): all disruptive variants enriched in cases (when ϑ < 0; risk variants) or all disruptive variants enriched in controls (when ϑ > 0; protective variants), and hypothesis (ii): some disruptive variants enriched in cases and others in controls. Hypothesis (i) was tested with CAST²⁸, and hypothesis (ii) with SKAT²⁹ (Methods). We restricted the analysis to 1141 LoF and damaging missense variants with minor allele frequency (MAF) ≤0.005 to ensure that any new association signal would be independent of the signals from common and low frequency variants having led to the initial identification and fine-mapping of the corresponding loci⁴. For NOD2 (p = 6.9 × 10⁻⁷) and IL23R (p = 1.8 × 10⁻⁴), LoF and damaging variants were significantly enriched in respectively cases and controls as expected. When considering the 45 newly tested genes as a whole, we observed a significant (p = 6.9 × 10⁻⁴) shift towards lower p values when compared to expectation, while synonymous variants behaved as expected (p = 0.66) (Fig. 9 and Supplementary Data 7). This strongly suggests that the sequenced list includes causative genes. CARD9, TYK2, and FUT2 have recently been shown to be causative genes based on disease-associated low-frequency coding variants (MAF > 0.005)⁴. The shift towards lower p values remained significant without these (p = 1.7 × 10⁻³), pointing towards novel causative genes amongst the 42 remaining candidate genes.

Fig. 8 — Variants detected by sequencing the coding exons of 45 candidate genes. Variants are sorted in LoF (loss-of-function, i.e., stop gain, frame-shift, splice site), Damaging MS (missense variants considered as damaging by SIFT⁵ and damaging or possibly damaging by Polyphen-2⁶), Benign MS (other missense variants), and Synonymous. Blue: variants with MAF < 0.005, Red: variants with MAF ≥ 0.005

Fig. 9 — QQ-plot for the gene-based burden test. Ranked log(1/p) values obtained when considering LoF and damaging variants (full circles), or synonymous variants (empty circles). The circles are labeled in blue when the best p value for that gene is obtained with CAST, in red when the best p value is obtained with SKAT. The black line corresponds to the median log(1/p) value obtained (for the corresponding rank) using the same approach on permuted data (LoF and damaging variants). The gray line marks the upper limit of the 95% confidence band. The name of the genes with nominal p value ≤0.05 are given. Known causative genes are italicized. The inset p value corresponds to the significance of the upwards shift in log(1/p) values estimated by permutation

Proving gene causality requires larger case–control cohorts

Despite the significant shift towards lower p values when considering the 45 genes jointly, none of these were individually significant when accounting for multiple testing ( $p \leq \frac{0.05}{2 * 45} \approx 0.0006$ ) (Supplementary Data 7). Near identical results were obtained when classifying variants using the Combined Annotation Dependent Depletion (CADD) tool³⁰ instead of SIFT/PolyPhen-2 (Supplementary Data 7). We explored three approaches to increase the power of the burden test. The first built on the observation that cRM matching DAP are enriched in multigenic modules. This suggests that part of IBD risk loci harbor multiple co-regulated and hence functionally related genes, of which several (rather than one, as generally assumed) may be causally involved in disease predisposition. To test this hypothesis, we designed a module- rather than gene-based burden test (Methods). However, none of the 30 tested modules reached the experiment-wide significance threshold ( $p \leq \frac{0.05}{2 * 30} \approx 0.0008$ ). Moreover, the shift towards lower p values for the 30 modules was not more significant ( $p = 2.3 \times 1 0^{- 3}$ ) than for the gene-based test (Supplementary Fig. 7a and Supplementary Table 7). The second and third approaches derive from the common assumption that the heritability of disease predisposition may be larger in familial and early-onset cases³¹. We devised orthogonal tests for age-of-onset and familiality and combined them with our burden tests (Methods). Neither approach would improve the results (Supplementary Fig. 7b, c and Supplementary Data 7).

Assuming that TYK2 and CARD9 are truly causative and their effect sizes in our data unbiased, we estimated that a case–control cohort ranging from ~50,000 (TYK2) to ~200,000 (CARD9) individuals would have been needed to achieve experiment-wide significance (testing 45 candidate genes), and from ~78,000 (TYK2) to >500,000 (CARD9) individuals to achieve genome-wide significance (testing 20,000 genes) in the gene-based burden test (Supplementary Fig. 8).

Discussion

We herein describe a novel dataset comprising array-based transcriptome data for six circulating immune cell types and intestinal biopsies at three locations collected on ∼300 healthy European individuals. We use this “Correlated Expression and Disease Association Research” data set (CEDAR) to identify 23,650 significant cis-eQTL, which fall into 9720 regulatory modules of which at least ∼889 affect more than one gene in more than one tissue. We provide strong evidence that 63 of 200 known IBD GWAS signals reflect the activity of common regulatory variants that preferentially drive multigenic modules. We perform an exon-based burden test for 45 positional candidate CD genes mapping to 33 modules, in 5500 CD cases and 6500 controls. By demonstrating a significant ( $p = 6.9 \times 1 0^{- 4}$ ) upwards shift of log(1/p) values for damaging when compared to synonymous variants, we show that the sequenced genes include new causative CD genes.

Individually, none of the sequenced genes (other than the positive NOD2 and IL23R controls) exceed the experiment-wide significance threshold, precluding us from definitively pinpointing any novel causative genes. However, we note IL18R1 amongst the top-ranking genes (see also Supplementary Note 1). IL18R1 is the only gene in an otherwise relatively gene-poor region (also encompassing IL1R1 and IL18RAP) characterized by robust cis-eQTL in CD4 and CD8 that are strongly correlated with the DAP for CD and UC (0.68 ≤ |ϑ| ≤ 0.93). Reduced transcript levels of IL18R1 in these cell types is associated with increased risk for IBD. Accordingly, rare (MAF ≤ 0.005) damaging variants were cumulatively enriched in CD cases (CAST p = 0.05). The cumulative allelic frequency of rare damaging variants was found to be higher in familial CD cases (0.0027), when compared to non-familial CD cases (0.0016; p = 0.09) and controls (0.0010; p = 0.03). When ignoring carriers of deleterious NOD2 mutations, average age-of-onset was reduced by ∼3 years (25.3 vs 28.2 years) for carriers of rare damaging IL18R1 variants but this difference was not significant (p = 0.18).

While the identification of matching cRM for 63/200 DAP points towards a number of strong candidate causative genes, it leaves most risk loci without matching eQTL despite the analysis of nine disease-relevant cell types. This finding is in agreement with previous reports^4,32. It suggests that cis-eQTL underlying disease predisposition operate in cell types, cell states (for instance, resting vs activated) or developmental stages that were not explored in this and other studies. It calls for the enlargement and extension of eQTL studies to more diverse and granular cellular panels^10,33, possibly by including single-cell sequencing or spatial transcriptomic approaches. By performing eQTL studies in a cohort of healthy individuals, we have made the reasonable assumption that the common regulatory variants that are driving the majority of GWAS signals are acting before disease onset, including in individuals that will never develop the disease. An added advantage of studying a healthy cohort, is that the corresponding dataset is “generic”, usable for the study of perturbation of gene regulation for any common complex disease. However, it is conceivable that some eQTL underlying increased disease risk only manifest themselves once the disease process is initiated, for instance as a result of a modified inflammatory status. Thus, it may be useful to perform eQTL studies with samples collected from affected individuals to see in how far the eQTL landscape is affected by disease status.

One of the most striking results of this work is the observation that cRM that match DAP are ≥2-fold enriched in multi-genic modules. We cannot fully exclude that this is due to ascertainment bias. As multi-genic modules tend to also be multi-tissue, multi-genic cRM matching a DAP in a non-explored disease-relevant cell type have a higher probability to be detected in the explored cell types than the equivalent monogenic (and hence more likely cell type specific) cRM. The alternative explanation is that cRM matching DAP are truly enriched in multi-genic cRM. It is tempting to surmise that loci harboring clusters of co-regulated, functionally related causative genes have a higher probability to be detected in GWAS, reflecting a relatively larger target space for causative mutations. We herein tested this hypothesis by applying a module rather than gene-based test. Although this did not appear to increase the power of the burden test in this work, it remains a valuable approach to explore in further studies. Supplementary Data 2 provides a list of >900 multigenic modules detected in this work that could be used in this context.

Although we re-sequenced the ORF of 45 carefully selected candidate genes in a total of 5500 CD cases and 6600 controls, none of the tested genes exceeded the experiment-wide threshold of significance. This is despite the fact that we used a one-sided, eQTL-informed test to potentially increase power. Established IBD causative genes used as positive control, NOD2 and IL23R, were positive indicating that the experiment was properly conducted. We were not able to improve the signal strength by considering information about regulatory modules, familiality or age-of-onset. We estimated that ≥10-fold larger sample sizes will be needed to achieve adequate power if using the same approach. Although challenging, these numbers are potentially within reach of international consortia for several common diseases including IBD.

It is conceivable that the organ-specificity of nearly all complex diseases (such as the digestive tract for IBD), reflects tissue-specific perturbation of broadly expressed causative genes that may fulfill diverse functions in different organs. If this is true, coding variants may not be the appropriate substrate to perform burden tests, as these will affect the gene across all tissues. In such instances, the disruptive variants of interest may be those perturbing tissue-specific gene switches. Also, it has recently been proposed that the extreme polygenic nature of common complex diseases may reflect the trans-effects of a large proportion of regulatory variants active in a given cell type on a limited number of core genes via perturbation of highly connected gene networks³⁴. Identifying rare regulatory variants is still challenging, however, as tissue-specific gene switches remain poorly cataloged, and the effect of variants on their function difficult to predict. The corresponding sequence space may also be limited in size, hence limiting power. Nevertheless, a reasonable start may be to re-sequence the regions surrounding common regulatory variants that have been fine-mapped at near single base pair resolution⁴.

In conclusion, we hereby provide to the scientific community a collection of ∼24,000 cis-eQTL in nine cell types that are highly relevant for the study of inflammatory and immune-mediated diseases, particularly of the intestinal tract. The CEDAR dataset advantageously complements existing eQTL datasets including GTEx^10,33. We propose a paradigm to rationally organize cis-eQTL effects in co-regulated clusters or regulatory modules. We identify ~100 candidate causative genes in 63 out of 200 analyzed risk loci, on the basis of correlated DAP and EAP. We have developed a web-based browser to share the ensuing results with the scientific community (http://cedar-web.giga.ulg.ac.be). The CEDAR website will imminently be extended to accommodate additional common complex disease for which GWAS data are publicly available. We show that the corresponding candidate genes are enriched in causative genes, however, that case–control cohorts larger than those used in this study (12,000 individuals) are required to formally demonstrate causality by means of presently available burden tests.

Methods

Sample collection in the CEDAR cohort

We collected peripheral blood as well as intestinal biopsies (ileum, transverse colon, rectum) from 323 healthy Europeans visiting the Academic Hospital of the University of Liège as part of a national screening campaign for colon cancer. Participants included 182 women and 141 men, averaging 56 years of age (range: 19-86). Enrolled individuals were not suffering any autoimmune or inflammatory disease and were not taking corticosteroids or non-steroid anti-inflammatory drugs (with the exception of low doses of aspirin to prevent thrombosis). We recorded birth date, weight, height, smoking history, declared ethnicity and hematological parameters (red blood cell count, platelet count, differential white blood cell count) for each individual. The experimental protocol was approved by the ethics committee of the University of Liège Academic Hospital. Informed consent was obtained prior to donation in agreement with the recommendations of the declaration of Helsinki for experiments involving human subjects. We refer to this cohort as CEDAR for Correlated Expression and Disease Association Research.

SNP genotyping and imputation

Total DNA was extracted from EDTA-collected peripheral blood using the MagAttract DNA blood Midi M48 Kit on a QIAcube robot (Qiagen). DNA concentrations were measured using the Quant-iT Picogreen ds DNA Reagents (Invitrogen). Individuals were genotyped for >700 K SNPs using Illumina’s Human OmniExpress BeadChips, an iScan system and the Genome Studio software following the guidelines of the manufacturer. We eliminated variants with call rate ≤0.95, deviating from Hardy–Weinberg equilibrium (p ≤ 10⁻⁴), or which were monomorphic. We confirmed European ancestry of all individuals by PCA using the HapMap population as reference. Using the real genotypes of 629,570 quality-controlled autosomal SNPs as anchors, we used the Sanger Imputation Services with the UK10K + 1000 Genomes Phase 3 Haplotype panels (https://imputation.sanger.ac.uk)^35–37 to impute genotypes at autosomal variants in our population. We eliminated indels, SNPs with MAF ≤ 0.05, deviating from Hardy-Weinberg equilibrium (p ≤ 10⁻³), and with low imputation quality (INFO ≤ 0.4), leaving 6,019,462 high quality SNPs for eQTL analysis.

Transcriptome analysis

Blood samples were kept on ice and treated within 1 h after collection as follows. EDTA-collected blood was layered on Ficoll-Paque PLUS (GE Healthcare) to isolate peripheral blood mononuclear cells by density gradient centrifugation. CD4+ T lymphocytes, CD8+ T lymphocytes, CD19+ B lymphocytes, CD14+ monocytes, and CD15+ granulocytes were isolated by positive selection using the MACS technology (Miltenyi Biotec). To isolate platelets, blood collected on acid-citrate-dextrose (ACD) anticoagulant was centrifuged at 150 g for 10 min. The platelet rich plasma (PRP) was collected, diluted twofold in ACD buffer and centrifuged at 800 × g for 10 min. The platelet pellet was resuspended in MACS buffer (Miltenyi Biotec) and platelets purified by negative selection using CD45 microbeads (Miltenyi Biotec). Intestinal biopsies were flash frozen in liquid nitrogen immediately after collection and kept at −80 °C until RNA extraction. Total RNA was extracted from the purified leucocyte populations and intestinal biopsies using the AllPrep Micro Kit and a QIAcube robot (Qiagen). For platelets, total RNA was extracted manually with the RNeasy Mini Kit (Qiagen). Whole genome expression data were generated using HT-12 Expression Beadchips following the instructions of the manufacturer (Illumina). Technical outliers were removed using controls recommended by Illumina and the Lumi package³⁸. We kept 29,464/47,323 autosomal probes (corresponding to 19,731 genes) mapped by Re-Annotator³⁹ to a single gene body with ≤2 mismatches and not spanning known variants with MAF>0.05. Within cell types, we only considered probes (i.e., “usable” probes) with detection p value ≤0.05 in ≥25% of the samples. Fluorescence intensities were Log₂ transformed and Robust Spline Normalized (RSN) with Lumi³⁸. Normalized expression data were corrected for sex, age, smoking status and Sentrix Id using ComBat from the SVA R library⁴⁰. We further corrected the ensuing residuals within tissue for the number of Principal Components (PC) that maximized the number of cis-eQTL with p ≤ 10⁻⁶ ⁴¹. Supplementary Table 2 summarizes the number of usable samples, probes and PC for each tissue type.

cis-eQTL analysis

cis-eQTL analyses were conducted with PLINK and using the expression levels precorrected for fixed effects and PC as described above (http://pngu.mgh.harvard.edu/purcell/plink/)⁴². Analyses were conducted under an additive model, i.e., assuming that the average expression level of heterozygotes is at the midpoint between alternate homozygotes. To identify cis-eQTL we tested all SNPs in a 2 Mb window centered around the probe (if “usable”). P values for individual SNPs were corrected for the multiple testing within the window by permutation (10,000 permutations). For each probe–tissue combination we kept the best (corrected) p value. Within each individual cell type, the ensuing list of corrected p values was used to compute the corresponding false discovery rates (FDR or q value). Supplementary Table 3 reports the number of cis-eQTL found in the nine analyzed cell types for different FDR thresholds (see also Supplementary Fig. 9).

Comparing EAP with ϑ to identify cis-regulatory modules

If the transcript levels of a given gene are influenced by the same regulatory variants (one or several) in two tissues, the corresponding EAP (i.e., the −log(p) values of association for the SNPs surrounding the gene) are expected to be similar. Likewise, if the transcript levels of different genes are influenced by the same regulatory variants in the same or in different tissues, the corresponding EAP are expected to be similar (cf. main text, Fig. 1). We devised a metric, ϑ, to quantify the similarity between EAP. If two EAP are similar, one can expect the corresponding –log(p) values to be positively correlated. One particularly wants the EAP peaks, i.e., the highest –log(p) values, to coincide in order to be convinced that the corresponding cis-eQTL are driven by the same regulatory variants. To quantify the similarity between EAP while emphasizing the peaks, we developed a weighted correlation. Imagine two vectors X and Y of –log(p) values for n SNPs surrounding the gene(s) of interest. Using the same nomenclature as in Fig. 1a, X could correspond to gene A in tissue 1, and Y to gene A in tissue 2, or X could correspond to gene A in tissue 1, and Y to gene B in tissue 2. We only consider for analysis, SNPs within 1 Mb of either gene (probe) and for which x_i and/or y_i is superior to 1.3 (i.e., p value <0.05) hence informative for at least one of the two cis-eQTL. Indeed, the majority of variants with –log(p) <1.3 (p > 0.05) for both EAP are by definition not associated with either trait. There is therefore no reason to expect that they could contribute useful information to the correlation metric: their ranking in terms of –log(p) values becomes more and more random as the –log(p) decreases. We define the weight to be given to each SNP in the correlation as:

w_{i} = {(MAX (\frac{x_{i}}{x_{MAX}}, \frac{y_{i}}{y_{MAX}}))}^{p}

The larger p, the more weight is given to the top SNPs. In this work, p was set at one.

The weighted correlation between the two EAP, r_w, is then computed as:

r_{w} = \frac{1}{\sum_{i = 1}^{n} w_{i}} \sum_{i = 1}^{n} w_{i} (\frac{x_{i} - \bar{x_{w}}}{σ_{x}^{w}}) (\frac{y_{i} - \bar{y_{w}}}{σ_{y}^{w}})

in which

\bar{x_{w}} = \frac{\sum_{i = 1}^{n} w_{i} \times x_{i}}{\sum_{i = 1}^{n} w_{i}}

\bar{y_{w}} = \frac{\sum_{i = 1}^{n} w_{i} \times y_{i}}{\sum_{i = 1}^{n} w_{i}}

σ_{x}^{w} = \sqrt{\frac{\sum_{i = 1}^{n} w_{i} \times {(x_{i} - \bar{x_{w}})}^{2}}{\sum_{i = 1}^{n} w_{i}}}

σ_{y}^{w} = \sqrt{\frac{\sum_{i = 1}^{n} w_{i} \times {(y_{i} - \bar{y_{w}})}^{2}}{\sum_{i = 1}^{n} w_{i}}}

The larger r_w, the larger the similarity between the EAP, particularly for their respective peak SNPs.

r_w ignores an important source of information. If two EAP are driven by the same regulatory variant, there should be consistency in the signs of the effects across SNPs in the region. We will refer to the effect of the “reference” allele of SNP i on the expression levels for the first and second cis-eQTL as $β_{i}^{X}$ and $β_{i}^{Y}$ . If the reference allele of the regulatory variant increases expression for both cis-eQTL, the $β_{i}^{X}$ and $β_{i}^{Y}$ 's for a SNPs in LD with the regulatory variant are expected to have the same sign (positive or negative depending on the sign of D for the considered SNP). If the reference allele of the regulatory variant increases expression for one cis-eQTL and decreases expression for the other, the $β_{i}^{X}$ and $β_{i}^{Y}$ 's for a SNPs in LD with the regulatory variant are expected to have opposite sign. We used this notion to develop a weighted and signed measure of correlation, r_ws. The approach was the same as for r_w, except that the values of y_i were multiplied by −1 if the signs of $β_{i}^{X}$ and $β_{i}^{Y}$ were opposite. r_ws is expected to be positive if the regulatory variant affects the expression of both cis-eQTL in the same direction and negative otherwise.

We finally combined r_w and r_ws in a single score referred to as ϑ, as follows:

ϑ = \frac{r_{ws}}{1 + e^{- k (r_{w} - T)}}

ϑ penalizes r_ws as a function of the value of r_w. The aim is to avoid considering EAP pairs with strong but negative r_w (which is often the case when the two EAP are driven by very distinct variants). The link function is a sigmoid-shaped logistic function with k as steepness parameter and T as sigmoid mid-point. In this work, we used a value of k of 30, and a value of T of 0.3 (Supplementary Fig. 10).

We first evaluated the distribution of ϑ for pairs of EAP driven by the same regulatory variants by studying 4,693 significant cis-eQTL (FDR < 0.05). For these, we repeatedly (100x) split our CEDAR population in two halves, performed the cis-eQTL analysis separately on both halves and computed ϑ for the ensuing EAP pairs. Supplementary Fig. 1 is showing the obtained results.

We then evaluated the distribution of ϑ for pairs of EAP driven by distinct regulatory variants in the same chromosomal region as follows. We considered 1207 significant cis-eQTL (mapping to the 200 IBD risk loci described above). For each one of these, we generated a set of 100 “matching” cis-eQTL effects in silico, sequentially considering 100 randomly selected SNPs (from the same locus) as causal. The in silico cis-eQTL were designed such that they would explain the same fraction of expression variance as the corresponding real cis-eQTL detected with PLINK (cfr. above). When performing cis-eQTL analysis under an additive model, PLINK estimates β₀ (i.e., the intercept), and β₁ (i.e., the slope of the regression), including for the top SNP. Assume that the expression level of the studied gene, Z, for individual i is z_i. Assume that the sample comprises n_T individuals in total, of which n₁₁ are of genotype “11”, n₁₂ of genotype “12”, and n₂₂ of genotype “22”, for the top cis-eQTL SNP. The total expression variance for gene Z equals:

σ_{T}^{2} = \frac{\sum_{i = 1}^{n_{T}} {(z_{i} - \bar{z_{T}})}^{2}}{n_{T} - 1}

The variance in expression level due to the cis-eQTL equals:

σ_{eQTL}^{2} = \frac{n_{11} {(β_{0} - \bar{z_{T}})}^{2} + n_{12} {(β_{0} + β_{1} - \bar{z_{T}})}^{2} + n_{22} {(β_{0} + 2 β_{1} - \bar{z_{T}})}^{2}}{n_{T}}

The heritability of expression due to the cis-eQTL, i.e., the fraction of the expression variance that is due to the cis-eQTL is therefore:

h_{eQTL}^{2} = \frac{σ_{eQTL}^{2}}{σ_{T}^{2}}

To simulate cis-eQTL explaining the same $h_{eQTL}^{2}$ as the real eQTL in the CEDAR dataset, we sequentially considered all SNPs in the region. Each one of these SNPs would be characterized by n₁₁ individuals of genotype “11”, n₁₂ of genotype “12”, and n₂₂ of genotype “22”, for a total of n_T genotyped individuals. We would arbitrarily set $\bar{z_{11}}$ , $\bar{z_{12}}, and \bar{z_{22}}$ at −1, 0, and +1. As a consequence, the variance due to this cis-eQTL equals:

σ_{eQTL}^{2} = \frac{n_{11} {(- 1 - \bar{z_{T}})}^{2} + n_{12} {(0 - \bar{z_{T}})}^{2} + n_{22} {(1 - \bar{z_{T}})}^{2}}{n_{T}}

in which $\bar{z_{T}} = (n_{22} - n_{11}) ∕ n_{T}$ .

Knowing $σ_{eQTL}^{2}$ and $h_{eQTL}^{2}$ , and knowing that

h_{eQTL}^{2} = \frac{σ_{eQTL}^{2}}{σ_{eQTL}^{2} + σ_{RES}^{2}}

the residual variance $σ_{RES}^{2}$ can be computed as

σ_{RES}^{2} = σ_{eQTL}^{2} (\frac{1}{h_{eQTL}^{2}} - 1)

Individual expression data for the corresponding cis-eQTL (for all individuals of the CEDAR data set) were hence sampled from the normal distribution

z_{i} ~ N (\bar{z_{x x}}, σ_{RES}^{2})

where $\bar{z_{x x}}$ is −1, 0, or +1 depending on the genotype of the individual (11, 12, or 22). We then performed cis-eQTL on the corresponding data set using PLINK, generating an in silico EAP. Real and in silico EAP were then compared using ϑ. Supplementary Fig. 1 shows the corresponding distribution of ϑ values for EAP driven by distinct regulatory variants.

The corresponding distributions of ϑ under H₁ and H₀ (Supplementary Fig. 1) show that ϑ discriminates very effectively between H₁ and H₀ especially for the most significant cis-eQTL. We chose a threshold of |ϑ| > 0.6 to cluster EAP in cis-acting regulatory elements or cRM (Fig. 2). In the experiment described above, this would yield a false positive rate of 0.05, and a false negative rate of 0.23. Clusters were visually examined as show in Supplementary Fig. 2. Twenty-nine edges connecting otherwise unlinked and yet tight clusters were manually removed.

Testing for an excess sharing of cRM between cell types

Assume that cell type 1 is part of n_1T cRM, including n₁₁ private cRM, n₁₂ cRM shared with cell type 2, n₁₃ cRM shared with cell type 3, …, and n₁₉ cRM shared with cell type 9. Note that $\sum_{i = 1}^{9} n_{1 i} \geq n_{1 T}$ , because cRM may include more than two cell types. Assume that $n_{1 S} = \sum_{i \neq 1}^{9} n_{1 i}$ is the sum of pair-wise sharing events for cell type 1. We computed, for each cell type $i \neq 1$ , the probability to observe ≥n_1i sharing events with cell type 1 assuming that the expected number (under the hypothesis of random assortment) is

n_{1 S} \times \frac{n_{i T}}{\sum_{j \neq 1}^{9} n_{j T}}

Pair-wise sharing events between tissue 1 and the eight other tissues were generated in silico under this model of random assortment (5000 simulations). The p value for n_1i was computed as the proportion of simulations that would yield values that would be as large or larger than n_1i. The same approach was used for the nine cell types. Thus, two p values of enrichment are obtained for each pair of cell types i and j, one using i as reference cell type, and the other using j as reference cell type. As can be seen from Fig. 5, the corresponding pairs of p values were always perfectly consistent.

We performed eight distinct analyses. In the first analysis, we only considered cRM involving no more than two tissues (i.e., unique for specific pairs of cell types). In subsequent analyses, we progressively included cRM with no more than three, four, …, and nine cell types.

Comparing EAP and DAP using ϑ

The approach used to cluster EAP in cRM was also used to assign DAP for Inflammatory Bowel Disease (IBD) to EAP-defined cRM. We studied 200 IBD risk loci identified in recent GWAS meta-analyses^2,3. The limits of the corresponding risk loci were as defined in the corresponding publications. We measured the similarity between DAP and EAP using the ϑ metric for all cis-eQTL mapping to the corresponding intervals (i.e., for all cis-eQTL for which the top SNP mapped within the interval). To compute the correlations between DAP and EAP we used all SNPs mapping to the disease interval with –log(p) value ≥1.3 either for DAP, EAP or both.

In addition to computing ϑ as described in section 5, we computed an empirical p value for ϑ using the approach (based on in silico generated cis-eQTL) described above to generate the locus-specific distribution of ϑ values for EAP driven by distinct regulatory variants. From this distribution, one can deduce the probability that a randomly generated EAP (explaining as much variance as the real tested EAP) and the DAP would by chance have a |ϑ| value that is as high or higher than the real EAP. The corresponding empirical p value accounts for the local LD structure between SNPs.

Evaluating the enrichment of DAP–EAP matching

To evaluate whether DAP matched EAP more often than expected by chance alone, we analyzed 97 IBD risk loci interrogated by the Immunochip, (i) in order to allow for convenient comparison with Huang et al.⁴, and (ii) because we needed extensively QC genotypes for the IIBDGC data to perform the enrichment analysis with the ϑ-based method (see hereafter). Within these 97 IBD risk loci, we focused on 63 regions affecting CD⁴, encompassing at least one significant eQTL, and for which the lead CD-associated SNP had MAF > 0.05. Indeed, eQTL analyses in the CEDAR dataset were restricted to SNPs with MAF > 0.05 (see above). We used three methods to evaluate whether the observed number of DAP–EAP matches were higher than expected by chance alone: naïve, frequentist and ϑ-based. Analyses were performed separately for the nine cell types.

In the “naïve” approach, DAP and EAP were assumed to match if the corresponding lead SNPs were in LD with $r^{2} \geq 0.8$ . This would yield $n_{N} \leq 63$ risk loci for which the DAP would match at least one EAP. To measure the statistical significance of $n_{N}$ , we sampled a SNP (MAF > 0.05) at random in each of the 63 risk loci, and counted the number of loci with at least one matching EAP. This “simulation” was repeated 1,000 times. The significance of n_N was measured as the proportion of simulations that would yield $\geq n_{N}$ matches.

The frequentist approach used the method described by Nica et al.¹¹. DAP and EAP were assumed to match if fitting the disease-associated lead SNP in the eQTL analysis caused a larger drop in $- \log (p)$ than 95% of the SNPs with MAF > 0.05 in the analyzed risk locus. This would yield $n_{F} \leq 63$ risk loci for which the DAP would match at least one EAP. To measure the statistical significance of n_F, we sampled a SNP (MAF > 0.05) at random in each of the 63 risk loci, and counted the number of loci with at least one matching EAP. This “simulation” was repeated 1000 times. The significance of n_F was measured as the proportion of simulations that would yield ≥n_F matches.

Finally, we used our ϑ-based approach in which DAP and EAP were assumed to match if |ϑ| > 0.6. This would yield n_ϑ ≤ 63 risk loci for which the DAP would match at least one EAP. To measure the statistical significance of n_ϑ we sampled a SNP (MAF > 0.05) at random in each of the 63 risk loci, and generated a DAP assuming that the corresponding SNPs were causal as follows.

Assume a cohort with n₁ cases and n₂ controls (for instance, the IIBDGC cohort). Assume a SNP with an allelic frequency of p in the cases + controls, an allelic frequency of $(p + d)$ in cases and $(p + δ)$ in controls.

One can easily show that:

δ = - d \frac{n_{1}}{n_{2}}

The odds ratio (OR) for that SNP equals:

OR = \frac{(p + d) (1 - p - δ)}{(p + δ) (1 - p - d)}

The ratio between the between-cohort (i.e., cases and controls) variance versus within-cohort variance (corresponding to an F test) can be shown to equal:

F = \frac{d^{2} (1 + \frac{n_{1}}{n_{2}})}{(1 + \frac{n_{2}}{n_{1}}) (p - p^{2}) - d^{2} (1 + \frac{n_{1}}{n_{2}})}

If we fix F based on the real top SNP in the IIBDGC data in a given GWAS identified risk loci, we can determine d (and hence δ using Equation 1) for the randomly selected SNP (that will become an “in silico causative variant”) with allelic frequency in (cases + controls) of p (different from the real top SNP), by solving

d = \frac{- β \pm \sqrt{β^{2} - 4 α γ}}{2 α}

where

α = (1 + \frac{n_{1}}{n_{2}}) (1 + F)

β = 0

γ = - (p - p^{2}) (1 + \frac{n_{2}}{n_{1}}) F

Once we know $(p + d)$ (i.e., the frequency of the SNP in cases), and hence $(p + δ)$ (i.e., the frequency of the SNP in controls), we can use Hardy–Weinberg to determine the frequency of the three genotypes in cases ( $p_{AA}^{IBD}, p_{AB}^{IBD}, p_{BB}^{IBD}$ ) and controls ( $p_{AA}^{CTR}, p_{AB}^{CTR}, p_{BB}^{CTR}$ ). We then create an in silico case–control cohort by sampling (with replacement) $n_{1} \times p_{AA}^{IBD}$ AA cases, $n_{1} \times p_{AB}^{IBD}$ AB cases, …, and $n_{2} \times p_{BB}^{CTR}$ BB controls from the individuals of the IIBDGC (without discriminating real case and control status). Association analysis of the corresponding data set in the chromosome region of interest generates DAP with max $- log (p)$ value similar to the real DAP. This “simulation” was repeated 1000 times. The significance of n_ϑ was measured as the proportion of simulations that would yield ≥n_ϑ matches.

Targeted exon resequencing in CD cases and controls

Genes for which EAP match the DAP tightly (high |ϑ| values) are strong candidate causal genes for the studied disease. In the case of IBD, we identified ~100 such genes (Table 1). Ultimate proof of causality can be obtained by demonstrating a differential burden of rare disruptive variants in cases and controls. Burden tests preferably focus on coding gene segments, in which disruptive variants are most effectively recognized. Analyses are restricted to rare variants to ensure independence from the GWAS signals.

To perform burden tests, we collected DNA samples from 7323 Crohn Disease (CD) cases and 6342 controls of European descent in France (cases: 1899—ctrls: 1731), the Netherlands (2002–1923) and, Belgium (3422–2688). The study protocols were approved by the institutional review board at each center involved with recruitment. Informed consent and permission to share the data were obtained from all subjects, in compliance with the guidelines specified by the recruiting center’s institutional review board.

During the course of this project, we selected 45 genes with high |ϑ| values for resequencing (Table 1). We designed primers to amplify all corresponding coding exons plus exon-intron boundaries corresponding to all transcripts reported in the CCDS release 15⁴³ (Supplementary Data 8). Following Momozawa et al.²⁴, the primers were merged in five pools to perform a first round of PCR amplification (25 cycles). We then added 8-bp barcodes and common adapters (for sequencing) to all PCR products by performing a second round of PCR amplification (4 cycles) using primers targeting shared 5′ overhangs introduced during the first PCR. The ensuing libraries were purified, quality controlled and sequenced (2 × 150-bp paired-end reads) on a HiSeq 2500 (Illumina) instrument. Sequence reads were sorted by individual using the barcodes, aligned to the human reference sequence (hg19) with the Burrows–Wheeler Aligner (ver. 0.7.12)⁴⁴, and further processed using Genome Analysis Toolkit (GATK, ver. 3.2-2)⁴⁵. We only considered individuals for further analyses if ≥95% of the target regions was covered by ≥20 sequence reads. Average sequence depth across individuals and target regions was 1060. We called variants for each individual separately using the UnifiedGenotyper and HaplotypeCaller of GATK, as well as VCMM (ver. 1.0.2)⁴⁶, and listed all variants detected by either method. Genotypes for all individuals were determined for each variant based on the ratio of reference and alternative alleles amongst sequence reads as determined by Samtools⁴⁷. Individuals were labeled homozygote reference, heterozygote, or homozygote derived when the alternative allele frequency was between 0 and 0.15, between 0.25 and 0.75, and between 0.85 and 1, respectively. If the alternative allele frequency was outside these ranges or a variant position was covered with <20 sequencing reads, the genotype was considered missing. We excluded variants with call rates <95% or variants that were not in Hardy-Weinberg equilibrium (P < 1 × 10⁻⁶). We excluded 281 individuals with ≥2 minor alleles at 23 variants selected to have a MAF ≤ 0.01 in non-Finnish Europeans and ≥0.10 in Africans or East-Asians in the Exome Aggregation Consortium²⁷.

In the end, we used 6597 cases and 5502 controls for further analyses, while 98.5% of the target regions on average was covered with 20 or more sequence reads.

Gene-based burden test

We first used SIFT²⁵ and Polyphen-2²⁶ to sort the 4175 variants identified by sequencing in four categories: (i) loss-of-function (LoF) or severe, corresponding to stop gain, stop loss, frameshift and splice-site variants, (ii) damaging, corresponding to missense variants predicted by SIFT to be damaging and Polyphen-2 to be possibly or probably damaging, (iii) benign, corresponding to the other missense variants, and (iv) synonymous. We performed the burden test using the LoF plus damaging variants, and used the synonymous variants as controls. We only considered variants with MAF (computed for the entire data set, i.e., cases plus controls) ≤0.005. We indeed showed in a previous fine-mapping study that all reported independent effects were driven by variants with MAF ≥0.01⁴. By doing so we ensure that the signals of the burden test are independent of previously reported association signals. Thus, 174 LoF, 991 damaging, and 1434 synonymous were ultimately used to perform burden tests.

Burden tests come in two main flavors. In the first, one assumes that disruptive variants will be enriched in either cases (i.e., disruptive variants increase risk) or in controls (i.e., disruptive variance decrease risk). In the second, one assumes that—for a given gene—some disruptive variants will be enriched in cases, while other may be enriched in controls (Supplementary Fig. 11). The first was implemented using CAST²⁸. To increase power, we exploited the DAP–EAP information to perform one-sided (rather than two-sided) tests. When ϑ < 0, we tested for an enrichment of disruptive variants in cases; when ϑ > 0, for an enrichment of disruptive variants in controls. P values were computed by phenotype permutation, i.e., shuffling case–control status. When applying this test on a gene-by-gene basis using synonymous variants (MAF > 0.005), the distribution of p values (QQ-plot) indicated that the CAST test was conservative ( $λ_{GC} = 0.51$ ) (Supplementary Fig. 12). The second kind of burden test was implemented with SKAT²⁹. It is noteworthy that SKAT ignores information from singletons (Supplementary Fig. 11). Just as for CAST, p values were computed by phenotype permutation, i.e., shuffling case–control status. When applying this test on a gene-by-gene basis using synonymous variants (MAF < 0.005), the distribution of p values (QQ-plot) indicated that the SKAT test is too permissive ( $λ_{GC} = 1.73$ ) (Supplementary Fig. 12). Consequently, gene-based p values obtained with SKAT were systematically GC corrected using this value of λ_GC. We performed the two kinds of analyses for each gene, as one doesn’t a priori know what hypothesis will match the reality best for a given gene.

We also extracted information from the distribution of p values (or −log(p) values) across the 45 analyzed genes. Even if individual genes do not yield –log(p) values that exceed the significance threshold (accounting for the number of analyzed genes and tests performed), the distribution of −log(p) values may significantly depart from expectations, indicating that the analyzed genes include at least some causative genes. This was done by taking for each gene, the best p value (whether obtained with CAST or SKAT) and then rank the genes by corresponding −log(p) value. The same was done for 10⁵ phenotype permutations, allowing us to examine the distribution of −log(p) values for given ranks and compute the corresponding medians and limits of the 95% confidence band, as well as to compute the probability that $- 2 \sum_{i = 1}^{45} \ln (p_{i})$ (Fisher’s equation to combine p values) equals or exceeds the observed. Our results show that there is a significant departure from expectation when analyzing the damaging variants (p = 6.9 × 10⁻⁴) but not when analyzing the synonymous variants (p = 0.66) supporting the presence of genuine causative genes amongst the analyzed list.

cRM-based burden test

The enrichment of multi-genic cRM in IBD risk loci suggests that risk loci may have more than one causative gene belonging to the same cRM. To capitalize on this hypothesis, we developed a cRM-based burden test. Gene-specific p values were combined within cRM using Fisher’s method. For each gene, we considered the best p value whether obtained with CAST or SKAT. Statistical significance was evaluated by phenotype permutation exactly as described for the gene-based burden test. By doing so we observed a departure from expectation when using the damaging variants (p = 2.3 × 10⁻³), but not when using the synonymous variants (p = 0.72).

Orthogonal tests for age-of-onset and familiality

It is commonly assumed that the heritability for common complex diseases is higher in familial and early onset cases³¹. To extract the corresponding information from our data in a manner that would be orthogonal to the gene- and module-based tests described above (i.e., the information about age-of-onset and familiality would be independent of these burden tests), we devised the following approach.

For age-of-onset, we summed the age-of-onset of the n_C cases carrying rare disruptive variants for the gene of interest. We then computed the probability that the sum of the age-of-onset of n_C randomly chosen cases was as different from the mean of age-of-onset as the observed one, yielding a gene-specific two-sided p_SKAT value. In addition, we used the eQTL information to generate gene-specific one-sided p_CAST values, corresponding to the probability that the sum of the age-of-onset of n_C randomly chosen cases was as low or lower than the observed one (for genes for which decrease in expression level as associated with increased risk), or to the probability that the sum of the age-of-onset of n_C randomly chosen cases was as high or higher than the observed one (for genes for which increase in expression level as associated with increased risk). These age-of-onset p values were then combined with the corresponding p values from the burden test (CAST with CAST, SKAT with SKAT) using Fisher’s method.

For familiality, we determined what fraction of the n_C cases carrying rare disruptive variants for the gene of interest were familial (affected first degree relative). We then computed the probability that the fraction of familial cases amongst n_C randomly chosen cases was as different from the overall proportion of familial cases, yielding a gene-specific two-sided p_SKAT value. In addition, we used the eQTL information to generate gene-specific one-sided p_CAST values, corresponding to the probability that the fraction of familial cases amongst n_C randomly chosen cases was as high or higher than the observed one (for genes for which decrease in expression level as associated with increased risk), or to the probability that the sum of the age-of-onset of n_C randomly chosen was as low or lower than the observed one (for genes for which increase in expression level as associated with increased risk). These familial p values were then combined with the corresponding p values from the burden test (CAST with CAST, SKAT with SKAT) using Fisher’s method.

Data availability

The complete CEDAR eQTL dataset can be downloaded from the Array Express website (https://www.ebi.ac.uk/arrayexpress/), accession numbers E-MTAB-6666 (genotypes) and E-MTAB-6667 (expression data). The data, preprocessed as described in Methods, can be downloaded from the CEDAR website (http://cedar-web.giga.ulg.ac.be).

Electronic supplementary material

Peer Review File^{(1.1MB, pdf)}

41467_2018_4365_MOESM2_ESM.pdf^{(281.3KB, pdf)}

Description of Additional Supplementary Files

Supplementary Data 1^{(2.7MB, xlsx)}

Supplementary Data 2^{(831.5KB, xlsx)}

Supplementary Data 3^{(103.5KB, xlsx)}

Supplementary Data 4^{(578.7KB, xlsx)}

Supplementary Data 5^{(21.8KB, xlsx)}

Supplementary Data 6^{(948.4KB, xlsx)}

Supplementary Data 7^{(39.4KB, xlsx)}

Supplementary Data 8^{(52.9KB, xlsx)}

Supplementary Information^{(3.6MB, pdf)}

Acknowledgements

This work was supported by grants to Michel Georges from WELBIO (CAUSIBD), BELSPO (BeMGI), and Horizon 2020 (SYSCID). Computational resources at ULg have been provided by GIGA and the Consortium des Équipements de Calcul Intensif (CÉCI), funded by the Fonds de la Recherche Scientifique de Belgique (F.R.S.-FNRS) under Grant No. 2.5020.11. This work was conducted as part of the BioBank Japan Project supported by the Japan Agency for Medical Research and Development and by the Ministry of Education, Culture, Sports, Sciences and Technology of the Japanese government. The work of D.A. and I.A. was supported by Russian Ministry of Science and Education under 5-100 Excellence Programme. R.K. Weersma is supported by a VIDI grant (016.136.308) from the Netherlands Organisation for Scientific Research (NWO). DNA samples from the Dutch IBD cohort have been collected within the Parelsnoer Institute Project. This nationwide Parelsnoer Institute project is part of and funded by the Netherlands Federation of University Medical Centres and has received initial funding from the Dutch Government (from 2007 to 2011). The Parelsnoer Institute currently facilitates the uniform nationwide collection of information on and biomaterials of thirteen other diseases. We are grateful to N. Hakozaki, H. Iijima, N. Maki, and other staff of the Laboratory for Genotyping Development, RIKEN Center for the Integrative Medical Sciences. We thank Wouter Coppieters and the other members of the GIGA genomics platform for their support.

Author contributions

Y.M., J.D., and M.G. conceived experiments, generated data, analyzed data and wrote the manuscript. E.T., V.D., S.R., B.C., F.C., E.D., M.E., A.-S.G., C.L., R.M., M.M., and C.O. generated and analyzed data. I.A., D.A., Y.A., and M.G. conceived and generated the CEDAR website. L.A., G.B., F.H., M.L., B.O., M.J.P., A.E.v.d.M.-d.J., J.J.v.d.W., M.C.V., M.L., J.-P.H., R.K.W., M.D.V., D.F., S.V., M.K., and E.L. collected and provided samples. The IIBDGC provided association p-values in the 201 IBD risk loci (DAPs).

Competing interests

The authors declare no competing interests.

Footnotes

These authors contributed equally: Yukihide Momozawa, Julia Dmitrieva

^†A list of The International IBD Genetics Consortium members is provided below.

Electronic supplementary material

Supplementary Information accompanies this paper at 10.1038/s41467-018-04365-8.

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Michel Georges, Email: michel.georges@ulg.ac.be.

The International IBD Genetics Consortium:

Clara Abraham, Jean-Paul Achkar, Tariq Ahmad, Ashwin N. Ananthakrishnan, Vibeke Andersen, Carl A. Anderson, Jane M. Andrews, Vito Annese, Guy Aumais, Leonard Baidoo, Robert N. Baldassano, Peter A. Bampton, Murray Barclay, Jeffrey C. Barrett, Theodore M. Bayless, Johannes Bethge, Alain Bitton, Gabrielle Boucher, Stephan Brand, Berenice Brandt, Steven R. Brant, Carsten Büning, Angela Chew, Judy H. Cho, Isabelle Cleynen, Ariella Cohain, Anthony Croft, Mark J. Daly, Mauro D’Amato, Silvio Danese, Dirk De Jong, Goda Denapiene, Lee A. Denson, Kathy L. Devaney, Olivier Dewit, Renata D’Inca, Marla Dubinsky, Richard H. Duerr, Cathryn Edwards, David Ellinghaus, Jonah Essers, Lynnette R. Ferguson, Eleonora A. Festen, Philip Fleshner, Tim Florin, Andre Franke, Karin Fransen, Richard Gearry, Christian Gieger, Jürgen Glas, Philippe Goyette, Todd Green, Anne M. Griffiths, Stephen L. Guthery, Hakon Hakonarson, Jonas Halfvarson, Katherine Hanigan, Talin Haritunians, Ailsa Hart, Chris Hawkey, Nicholas K. Hayward, Matija Hedl, Paul Henderson, Xinli Hu, Hailiang Huang, Ken Y. Hui, Marcin Imielinski, Andrew Ippoliti, Laimas Jonaitis, Luke Jostins, Tom H. Karlsen, Nicholas A. Kennedy, Mohammed Azam Khan, Gediminas Kiudelis, Krupa Krishnaprasad, Subra Kugathasan, Limas Kupcinskas, Anna Latiano, Debby Laukens, Ian C. Lawrance, James C. Lee, Charlie W. Lees, Marcis Leja, Johan Van Limbergen, Paolo Lionetti, Jimmy Z. Liu, Gillian Mahy, John Mansfield, Dunecan Massey, Christopher G. Mathew, Dermot P. B. McGovern, Raquel Milgrom, Mitja Mitrovic, Grant W. Montgomery, Craig Mowat, William Newman, Aylwin Ng, Siew C. Ng, Sok Meng Evelyn Ng, Susanna Nikolaus, Kaida Ning, Markus Nöthen, Ioannis Oikonomou, Orazio Palmieri, Miles Parkes, Anne Phillips, Cyriel Y. Ponsioen, Urõs Potocnik, Natalie J. Prescott, Deborah D. Proctor, Graham Radford-Smith, Jean-Francois Rahier, Soumya Raychaudhuri, Miguel Regueiro, Florian Rieder, John D. Rioux, Stephan Ripke, Rebecca Roberts, Richard K. Russell, Jeremy D. Sanderson, Miquel Sans, Jack Satsangi, Eric E. Schadt, Stefan Schreiber, Dominik Schulte, L. Philip Schumm, Regan Scott, Mark Seielstad, Yashoda Sharma, Mark S. Silverberg, Lisa A. Simms, Jurgita Skieceviciene, Sarah L. Spain, A. Hillary Steinhart, Joanne M. Stempak, Laura Stronati, Jurgita Sventoraityte, Stephan R. Targan, Kirstin M. Taylor, Anje ter Velde, Leif Torkvist, Mark Tremelling, Suzanne van Sommeren, Eric Vasiliauskas, Hein W. Verspaget, Thomas Walters, Kai Wang, Ming-Hsi Wang, Zhi Wei, David Whiteman, Cisca Wijmenga, David C. Wilson, Juliane Winkelmann, Ramnik J. Xavier, Bin Zhang, Clarence K. Zhang, Hu Zhang, Wei Zhang, Hongyu Zhao, and Zhen Z. Zhao

References

1.MacArthur J, et al. The new NHGRI-EBI catalog of published genome-wide association studies (GWAS Catalog) Nucleic Acids Res. 2017;45:D896–D901. doi: 10.1093/nar/gkw1133. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Jostins L, et al. Host-microbe interactions have shaped the genetic architecture of inflammatory bowel disease. Nature. 2012;491:119–124. doi: 10.1038/nature11582. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Liu JZ, et al. Association analyses identify 38 susceptibility loci for IBD and highlight shared genetic risk across populations. Nat. Genet. 2015;47:979–986. doi: 10.1038/ng.3359. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Huang H, et al. Association mapping of IBD loci to single variant resolution. Nature. 2017;547:173–178. doi: 10.1038/nature22969. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Claussnitzer M, et al. FTO obesity variant circuitry and adipocyte browning in humans. N. Engl. J. Med. 2015;373:895–907. doi: 10.1056/NEJMoa1502214. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Hugot JP, et al. Association of NOD2 leucine-rich repeat variants with susceptibility to Crohn’s disease. Nature. 2001;411:599–603. doi: 10.1038/35079107. [DOI] [PubMed] [Google Scholar]
7.Hampe J, et al. A genome-wide association scan of nonsynonymous SNPs identifies a susceptibility variant for Crohn disease in ATG16L1. Nat. Genet. 2007;39:207–211. doi: 10.1038/ng1954. [DOI] [PubMed] [Google Scholar]
8.Momozawa Y, et al. Resequencing of positional candidates identifies low frequency IL23R coding variants protecting against inflammatory boiwel disease. Nat. Genet. 2011;43:43–47. doi: 10.1038/ng.733. [DOI] [PubMed] [Google Scholar]
9.Rivas MA, et al. Deep resequencing of GWAS loci identifies independent rare variants associated with inflammatory bowel disease. Nat. Genet. 2011;43:1066–1073. doi: 10.1038/ng.952. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.The GTEx Consortium. Genetic effects on gene expression across human tissues. Nature. 2017;550:204–213. doi: 10.1038/nature24277. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Nica AC, et al. Candidate causal regulatory effects by integration of expression QTLs with complex trait genetic associations. PLoS Genet. 2010;6:e1000895. doi: 10.1371/journal.pgen.1000895. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Giambartolomei C, et al. Bayesian test for colocalisation between paiurs of genetic association studies using summary statistics. PLoS Genet. 2014;10:e1004383. doi: 10.1371/journal.pgen.1004383. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Zhu Z, et al. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat. Genet. 2016;48:481–487. doi: 10.1038/ng.3538. [DOI] [PubMed] [Google Scholar]
14.The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. doi: 10.1038/nature11247. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Nicolae DL. Association tests for rare variants. Annu Rev. Genom. Hum. Genet. 2016;17:117–130. doi: 10.1146/annurev-genom-083115-022609. [DOI] [PubMed] [Google Scholar]
16.Pritchard JK, Cox NJ. The allelic architecture of human disease genes: common disease-common variant … or not? Hum. Mol. Genet. 2002;11:2417–2423. doi: 10.1093/hmg/11.20.2417. [DOI] [PubMed] [Google Scholar]
17.McGregor AP, et al. Morphological evolution through multiple cis-regulatory mutations at a single gene. Nature. 2007;448:587–590. doi: 10.1038/nature05988. [DOI] [PubMed] [Google Scholar]
18.Mackay TF. Quantitative trait loci in Drosophila. Nat. Rev. Genet. 2001;2:11–20. doi: 10.1038/35047544. [DOI] [PubMed] [Google Scholar]
19.Yalcin B, et al. Genetic dissection of behavioral QTL shows that Rgs2 modulates anxiety in mice. Nat. Genet. 2004;36:1197–1202. doi: 10.1038/ng1450. [DOI] [PubMed] [Google Scholar]
20.Karim L, et al. Variants modulating the expression of a chromosome domain encompassing PLAG1 influence bovine stature. Nat. Genet. 2011;43:405–413. doi: 10.1038/ng.814. [DOI] [PubMed] [Google Scholar]
21.Steinmetz LM, et al. Dissecting the architecture of a QTL in yeast. Nature. 2002;416:326–330. doi: 10.1038/416326a. [DOI] [PubMed] [Google Scholar]
22.Khor B, Gardet A, Xavier R. Genetics and pathogenesis of inflammatory bowel disease. Nature. 2011;474:307–317. doi: 10.1038/nature10209. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Fuchsberger C, et al. The genetic architecture of type 2 diabetes. Nature. 2016;536:41–47. doi: 10.1038/nature18642. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Momozawa Y, et al. Low-frequency coding variants in CETP and CFB are associated with susceptibility of exudative age-related macular degeneration in the Japanese population. Hum. Mol. Genet. 2016;25:5027–5034. doi: 10.1093/hmg/ddw335. [DOI] [PubMed] [Google Scholar]
25.Kumar P, et al. Predicting the effects of coding non-synonymous variants on protein function using gthe SIFT algorithm. Nat. Protoc. 2009;4:1073–1081. doi: 10.1038/nprot.2009.86. [DOI] [PubMed] [Google Scholar]
26.Adzhubei IA, et al. A method and server for predicting damaging missense mutations. Nat. Methods. 2010;7:248–249. doi: 10.1038/nmeth0410-248. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Lek M, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536:285–291. doi: 10.1038/nature19057. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Morgenthaler S, Thilly WG. A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST) Mutat. Res. 2007;615:28–56. doi: 10.1016/j.mrfmmm.2006.09.003. [DOI] [PubMed] [Google Scholar]
29.Wu MC, et al. Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 2011;89:82–93. doi: 10.1016/j.ajhg.2011.05.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Richardson TG, et al. A pathway-centric approach to rare variant association analysis. Eur. J. Hum. Genet. 2017;25:123–129. doi: 10.1038/ejhg.2016.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Imielinski M, et al. Common variants at five new loci associated with early-onset inflammatory bowel disease. Nat. Genet. 2009;41:1335–1340. doi: 10.1038/ng.489. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Chun S, et al. Limited statistical evidence for shared genetic effects of eQTLs and autoimmune-disease-associated loci in three major immune-cell types. Nat. Genet. 2017;4:600–605. doi: 10.1038/ng.3795. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.The GTEx Consortium. The genotype-tissue expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science. 2015;348:648–660. doi: 10.1126/science.1262110. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Boyle EA, et al. An expanded view of complex traits: from polygenic to omnigenic. Cell. 2017;169:1177–1186. doi: 10.1016/j.cell.2017.05.038. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature. 2015;526:668–674. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Huang J, et al. Improved imputation of low-frequency and rare variants using the UK10K haplotype reference panel. Nat. Commun. 2015;6:8111. doi: 10.1038/ncomms9111. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.McCarthy, et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. 2016;48:1279–1283. doi: 10.1038/ng.3643. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Du P, et al. Lumi: a pipeline for processing illumine microarray. Bioinformatics. 2008;24:1547–1548. doi: 10.1093/bioinformatics/btn224. [DOI] [PubMed] [Google Scholar]
39.Arloth J, et al. Re-Annotator: annotation pipeline for microarray probe sequences. PLoS ONE. 2015;10:e0139516. doi: 10.1371/journal.pone.0139516. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Johnson WE, et al. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8:118–127. doi: 10.1093/biostatistics/kxj037. [DOI] [PubMed] [Google Scholar]
41.Fairfax BP, et al. Innate immune activity conditions the effect of regulatory variants upon monocyte gene expression. Science. 2014;343:1246949. doi: 10.1126/science.1246949. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Purcell S, et al. PLINK: a toolset for whole-genome association and population-based linkage analysis. Am. J. Hum. Genet. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Farrell CM, et al. Current status and new features of the Consensus Coding Sequence database. Nucleic Acids Res. 2014;42:D865–72. doi: 10.1093/nar/gkt1059. [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25:1754–60. doi: 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Depristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 2011;43:491–8. doi: 10.1038/ng.806. [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Shigemizu D, et al. A practical method to detect SNVs and indels from whole genome and exome sequencing data. Sci. Rep. 2013;3:2161. doi: 10.1038/srep02161. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Li H, et al. The sequence alignment/map (SAM) format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Whitehead Pavlides JM, et al. Predicting targets from integrative analyses of summary data from GWAS and eQTL studies for 28 human complex traits. Genome Med. 2016;8:84–90. doi: 10.1186/s13073-016-0338-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
49.De Lange KM, et al. Genome-wide association study implicates immune activation of multiple integrin genes in inflammatory bowel disease. Nat. Genet. 2017;49:256–261. doi: 10.1038/ng.3760. [DOI] [PMC free article] [PubMed] [Google Scholar]
50.Hulur I, et al. Enrichment of inflammatory bowel disease and colorectal cancer risk variants in colon expression quantitative trait loci. BMC Genom. 2015;16:138–153. doi: 10.1186/s12864-015-1292-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Libioulle C, et al. Novel Crohn disease locus identified by genome-wide association maps to a gene desert on 5p13.1 and modulates expression of PTGER4. PLoS Genet. 2007;3:e58. doi: 10.1371/journal.pgen.0030058. [DOI] [PMC free article] [PubMed] [Google Scholar]
52.Peltekova VD, et al. Functional variants of OCTN cation transporter genes are associated with Crohn disease. Nat. Genet. 2004;39:311–318. doi: 10.1038/ng1339. [DOI] [PubMed] [Google Scholar]
53.McCarroll SA, et al. Deletion polymorphism upstream of IRGM expression and Crohn’s disease. Nat. Genet. 2008;40:1107–112. doi: 10.1038/ng.215. [DOI] [PMC free article] [PubMed] [Google Scholar]
54.Gamazon ER, et al. A gene-based association method for mapping traits using reference transcriptome data. Nat. Genet. 2015;47:1091–1098. doi: 10.1038/ng.3367. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Peer Review File^{(1.1MB, pdf)}

41467_2018_4365_MOESM2_ESM.pdf^{(281.3KB, pdf)}

Description of Additional Supplementary Files

Supplementary Data 1^{(2.7MB, xlsx)}

Supplementary Data 2^{(831.5KB, xlsx)}

Supplementary Data 3^{(103.5KB, xlsx)}

Supplementary Data 4^{(578.7KB, xlsx)}

Supplementary Data 5^{(21.8KB, xlsx)}

Supplementary Data 6^{(948.4KB, xlsx)}

Supplementary Data 7^{(39.4KB, xlsx)}

Supplementary Data 8^{(52.9KB, xlsx)}

Supplementary Information^{(3.6MB, pdf)}

Data Availability Statement

[CR1] 1.MacArthur J, et al. The new NHGRI-EBI catalog of published genome-wide association studies (GWAS Catalog) Nucleic Acids Res. 2017;45:D896–D901. doi: 10.1093/nar/gkw1133. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR2] 2.Jostins L, et al. Host-microbe interactions have shaped the genetic architecture of inflammatory bowel disease. Nature. 2012;491:119–124. doi: 10.1038/nature11582. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Liu JZ, et al. Association analyses identify 38 susceptibility loci for IBD and highlight shared genetic risk across populations. Nat. Genet. 2015;47:979–986. doi: 10.1038/ng.3359. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Huang H, et al. Association mapping of IBD loci to single variant resolution. Nature. 2017;547:173–178. doi: 10.1038/nature22969. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR5] 5.Claussnitzer M, et al. FTO obesity variant circuitry and adipocyte browning in humans. N. Engl. J. Med. 2015;373:895–907. doi: 10.1056/NEJMoa1502214. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Hugot JP, et al. Association of NOD2 leucine-rich repeat variants with susceptibility to Crohn’s disease. Nature. 2001;411:599–603. doi: 10.1038/35079107. [DOI] [PubMed] [Google Scholar]

[CR7] 7.Hampe J, et al. A genome-wide association scan of nonsynonymous SNPs identifies a susceptibility variant for Crohn disease in ATG16L1. Nat. Genet. 2007;39:207–211. doi: 10.1038/ng1954. [DOI] [PubMed] [Google Scholar]

[CR8] 8.Momozawa Y, et al. Resequencing of positional candidates identifies low frequency IL23R coding variants protecting against inflammatory boiwel disease. Nat. Genet. 2011;43:43–47. doi: 10.1038/ng.733. [DOI] [PubMed] [Google Scholar]

[CR9] 9.Rivas MA, et al. Deep resequencing of GWAS loci identifies independent rare variants associated with inflammatory bowel disease. Nat. Genet. 2011;43:1066–1073. doi: 10.1038/ng.952. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.The GTEx Consortium. Genetic effects on gene expression across human tissues. Nature. 2017;550:204–213. doi: 10.1038/nature24277. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Nica AC, et al. Candidate causal regulatory effects by integration of expression QTLs with complex trait genetic associations. PLoS Genet. 2010;6:e1000895. doi: 10.1371/journal.pgen.1000895. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] 12.Giambartolomei C, et al. Bayesian test for colocalisation between paiurs of genetic association studies using summary statistics. PLoS Genet. 2014;10:e1004383. doi: 10.1371/journal.pgen.1004383. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Zhu Z, et al. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat. Genet. 2016;48:481–487. doi: 10.1038/ng.3538. [DOI] [PubMed] [Google Scholar]

[CR14] 14.The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. doi: 10.1038/nature11247. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] 15.Nicolae DL. Association tests for rare variants. Annu Rev. Genom. Hum. Genet. 2016;17:117–130. doi: 10.1146/annurev-genom-083115-022609. [DOI] [PubMed] [Google Scholar]

[CR16] 16.Pritchard JK, Cox NJ. The allelic architecture of human disease genes: common disease-common variant … or not? Hum. Mol. Genet. 2002;11:2417–2423. doi: 10.1093/hmg/11.20.2417. [DOI] [PubMed] [Google Scholar]

[CR17] 17.McGregor AP, et al. Morphological evolution through multiple cis-regulatory mutations at a single gene. Nature. 2007;448:587–590. doi: 10.1038/nature05988. [DOI] [PubMed] [Google Scholar]

[CR18] 18.Mackay TF. Quantitative trait loci in Drosophila. Nat. Rev. Genet. 2001;2:11–20. doi: 10.1038/35047544. [DOI] [PubMed] [Google Scholar]

[CR19] 19.Yalcin B, et al. Genetic dissection of behavioral QTL shows that Rgs2 modulates anxiety in mice. Nat. Genet. 2004;36:1197–1202. doi: 10.1038/ng1450. [DOI] [PubMed] [Google Scholar]

[CR20] 20.Karim L, et al. Variants modulating the expression of a chromosome domain encompassing PLAG1 influence bovine stature. Nat. Genet. 2011;43:405–413. doi: 10.1038/ng.814. [DOI] [PubMed] [Google Scholar]

[CR21] 21.Steinmetz LM, et al. Dissecting the architecture of a QTL in yeast. Nature. 2002;416:326–330. doi: 10.1038/416326a. [DOI] [PubMed] [Google Scholar]

[CR22] 22.Khor B, Gardet A, Xavier R. Genetics and pathogenesis of inflammatory bowel disease. Nature. 2011;474:307–317. doi: 10.1038/nature10209. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Fuchsberger C, et al. The genetic architecture of type 2 diabetes. Nature. 2016;536:41–47. doi: 10.1038/nature18642. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR24] 24.Momozawa Y, et al. Low-frequency coding variants in CETP and CFB are associated with susceptibility of exudative age-related macular degeneration in the Japanese population. Hum. Mol. Genet. 2016;25:5027–5034. doi: 10.1093/hmg/ddw335. [DOI] [PubMed] [Google Scholar]

[CR25] 25.Kumar P, et al. Predicting the effects of coding non-synonymous variants on protein function using gthe SIFT algorithm. Nat. Protoc. 2009;4:1073–1081. doi: 10.1038/nprot.2009.86. [DOI] [PubMed] [Google Scholar]

[CR26] 26.Adzhubei IA, et al. A method and server for predicting damaging missense mutations. Nat. Methods. 2010;7:248–249. doi: 10.1038/nmeth0410-248. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR27] 27.Lek M, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536:285–291. doi: 10.1038/nature19057. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] 28.Morgenthaler S, Thilly WG. A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST) Mutat. Res. 2007;615:28–56. doi: 10.1016/j.mrfmmm.2006.09.003. [DOI] [PubMed] [Google Scholar]

[CR29] 29.Wu MC, et al. Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 2011;89:82–93. doi: 10.1016/j.ajhg.2011.05.029. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR30] 30.Richardson TG, et al. A pathway-centric approach to rare variant association analysis. Eur. J. Hum. Genet. 2017;25:123–129. doi: 10.1038/ejhg.2016.113. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR31] 31.Imielinski M, et al. Common variants at five new loci associated with early-onset inflammatory bowel disease. Nat. Genet. 2009;41:1335–1340. doi: 10.1038/ng.489. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR32] 32.Chun S, et al. Limited statistical evidence for shared genetic effects of eQTLs and autoimmune-disease-associated loci in three major immune-cell types. Nat. Genet. 2017;4:600–605. doi: 10.1038/ng.3795. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR33] 33.The GTEx Consortium. The genotype-tissue expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science. 2015;348:648–660. doi: 10.1126/science.1262110. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR34] 34.Boyle EA, et al. An expanded view of complex traits: from polygenic to omnigenic. Cell. 2017;169:1177–1186. doi: 10.1016/j.cell.2017.05.038. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR35] 35.The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature. 2015;526:668–674. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR36] 36.Huang J, et al. Improved imputation of low-frequency and rare variants using the UK10K haplotype reference panel. Nat. Commun. 2015;6:8111. doi: 10.1038/ncomms9111. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR37] 37.McCarthy, et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. 2016;48:1279–1283. doi: 10.1038/ng.3643. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR38] 38.Du P, et al. Lumi: a pipeline for processing illumine microarray. Bioinformatics. 2008;24:1547–1548. doi: 10.1093/bioinformatics/btn224. [DOI] [PubMed] [Google Scholar]

[CR39] 39.Arloth J, et al. Re-Annotator: annotation pipeline for microarray probe sequences. PLoS ONE. 2015;10:e0139516. doi: 10.1371/journal.pone.0139516. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR40] 40.Johnson WE, et al. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8:118–127. doi: 10.1093/biostatistics/kxj037. [DOI] [PubMed] [Google Scholar]

[CR41] 41.Fairfax BP, et al. Innate immune activity conditions the effect of regulatory variants upon monocyte gene expression. Science. 2014;343:1246949. doi: 10.1126/science.1246949. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR42] 42.Purcell S, et al. PLINK: a toolset for whole-genome association and population-based linkage analysis. Am. J. Hum. Genet. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR43] 43.Farrell CM, et al. Current status and new features of the Consensus Coding Sequence database. Nucleic Acids Res. 2014;42:D865–72. doi: 10.1093/nar/gkt1059. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR44] 44.Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25:1754–60. doi: 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR45] 45.Depristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 2011;43:491–8. doi: 10.1038/ng.806. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR46] 46.Shigemizu D, et al. A practical method to detect SNVs and indels from whole genome and exome sequencing data. Sci. Rep. 2013;3:2161. doi: 10.1038/srep02161. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR47] 47.Li H, et al. The sequence alignment/map (SAM) format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR48] 48.Whitehead Pavlides JM, et al. Predicting targets from integrative analyses of summary data from GWAS and eQTL studies for 28 human complex traits. Genome Med. 2016;8:84–90. doi: 10.1186/s13073-016-0338-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR49] 49.De Lange KM, et al. Genome-wide association study implicates immune activation of multiple integrin genes in inflammatory bowel disease. Nat. Genet. 2017;49:256–261. doi: 10.1038/ng.3760. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR50] 50.Hulur I, et al. Enrichment of inflammatory bowel disease and colorectal cancer risk variants in colon expression quantitative trait loci. BMC Genom. 2015;16:138–153. doi: 10.1186/s12864-015-1292-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR51] 51.Libioulle C, et al. Novel Crohn disease locus identified by genome-wide association maps to a gene desert on 5p13.1 and modulates expression of PTGER4. PLoS Genet. 2007;3:e58. doi: 10.1371/journal.pgen.0030058. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR52] 52.Peltekova VD, et al. Functional variants of OCTN cation transporter genes are associated with Crohn disease. Nat. Genet. 2004;39:311–318. doi: 10.1038/ng1339. [DOI] [PubMed] [Google Scholar]

[CR53] 53.McCarroll SA, et al. Deletion polymorphism upstream of IRGM expression and Crohn’s disease. Nat. Genet. 2008;40:1107–112. doi: 10.1038/ng.215. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR54] 54.Gamazon ER, et al. A gene-based association method for mapping traits using reference transcriptome data. Nat. Genet. 2015;47:1091–1098. doi: 10.1038/ng.3367. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

IBD risk loci are enriched in multigenic regulatory modules encompassing putative causative genes

Yukihide Momozawa

Julia Dmitrieva

Emilie Théâtre

Valérie Deffontaine

Souad Rahmouni

Benoît Charloteaux

François Crins

Elisa Docampo

Mahmoud Elansary

Ann-Stephan Gori

Christelle Lecut

Rob Mariman

Myriam Mni

Cécile Oury

Ilya Altukhov

Dmitry Alexeev

Yuri Aulchenko

Leila Amininejad

Gerd Bouma

Frank Hoentjen

Mark Löwenberg

Bas Oldenburg

Marieke J Pierik

Andrea E vander Meulen-de Jong

C Janneke van der Woude

Marijn C Visschedijk

Mark Lathrop

Jean-Pierre Hugot

Rinse K Weersma

Martine De Vos

Denis Franchimont

Severine Vermeire

Michiaki Kubo

Edouard Louis

Michel Georges

Abstract

Introduction

Results

Clustering cis-eQTL into regulatory modules

Fig. 1.

Fig. 2.

Fig. 3.

Fig. 4.

Fig. 5.

cRM matching IBD association signals are often multigenic

Fig. 6.

Table 1.

Fig. 7.

DAP-matching cRM are enriched in causative genes for IBD

Fig. 8.

Fig. 9.

Proving gene causality requires larger case–control cohorts

Discussion

Methods

Sample collection in the CEDAR cohort

SNP genotyping and imputation

Transcriptome analysis

cis-eQTL analysis

Comparing EAP with ϑ to identify cis-regulatory modules

Testing for an excess sharing of cRM between cell types

Comparing EAP and DAP using ϑ

Evaluating the enrichment of DAP–EAP matching

Targeted exon resequencing in CD cases and controls

Gene-based burden test

cRM-based burden test

Orthogonal tests for age-of-onset and familiality

Data availability

Electronic supplementary material

Acknowledgements

Author contributions

Competing interests

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS