Skip to main content
PLOS Genetics logoLink to PLOS Genetics
. 2025 Oct 16;21(10):e1011876. doi: 10.1371/journal.pgen.1011876

Rare diseases load through the study of a regional population

Élisa Michel 1,2, Claudia Moreau 1,2, Laurence Gagnon 1,2, Mylène Gagnon 1,2, Josianne Leblanc 3, Jessica Tardif 3, Lysanne Girard 4, Jean Mathieu 5,6, Cynthia Gagnon 5,6,7, Mathieu Desmeules 5,6,8, Jean-Denis Brisson 5,6,9, Luigi Bouchard 3,4,7, Simon L Girard 1,2,10,11,*
Editor: Jonathan Marchini12
PMCID: PMC12530595  PMID: 41100445

Abstract

Rare genetic diseases impact many people worldwide and are challenging to diagnose. In this study, we introduce a novel regional population cohort approach to identify pathogenic variants causing Mendelian diseases that occur more frequently within specific populations and are of clinical interest for carrier testing. We utilized a cohort from Quebec, including the Saguenay–Lac-Saint-Jean region, which is known for its founder effect followed by a rapid expansion and higher frequency of certain pathogenic variants. By analyzing both their frequency and origin through shared identical-by-descent segments, we identified founder variants. We calculated and compared their frequency in individuals originating from the Saguenay–Lac-Saint-Jean and from other urban Quebec regions. We validated 38 previously reported variants as being more common due to the founder effect and population expansion. Additionally, we identified 42 unreported founder variants in Quebec or Saguenay–Lac-Saint-Jean, some with carrier rates estimates as high as 1/22. We also observed a greater deleterious mutational load for the studied variants in individuals from the Saguenay–Lac-Saint-Jean compared to other urban Quebec regions. These findings were brought to the clinic, where 12 pathogenic variants were detected in diagnosed patients. Five variants found in this study are responsible for very severe diseases and could be considered for inclusion in a carrier test for the Saguenay–Lac-Saint-Jean population. This study highlights the potential underestimation of rare disease prevalence and presents a population-based approach that could aid clinicians in their diagnostic efforts and patients’ management.

Author summary

Rare genetic diseases present significant diagnostic challenges and impact individuals worldwide. In this study, we introduce an innovative regional population cohort approach to identify pathogenic variants that are more common in specific populations. We examined a cohort from Quebec, specifically the Saguenay–Lac-Saint-Jean region, known for its founder effect and subsequent population expansion. By analyzing both their frequency and recent origin, we identified 38 previously reported and 42 new founder variants, some with carrier rates as high as 1/22. Our analysis showed a higher deleterious mutational load in individuals from this region compared to other urban Quebec populations. Our findings were clinically validated, revealing 12 pathogenic variants in diagnosed patients. Five variants found in the present study are linked to severe diseases and could be incorporated into carrier screening for the Saguenay–Lac-Saint-Jean population. Our study underscores the importance of considering regional genetic variations in the diagnosis and management of rare diseases, offering a new framework for improving carrier testing and genetic counseling.

Introduction

Rare diseases are thought to collectively affect as much as 10% of the population [1]. There are more than 6,000 rare Mendelian diseases described in Orphanet. Diagnosis remains a significant challenge for patients living with a rare disease. Despite the growing accessibility of genome sequencing technologies in precision medicine efforts for rare disease diagnosis [2], these patients often experience a prolonged diagnostic odyssey due to insufficient knowledge about their specific condition and the diversity of symptoms observed for a given disease. It becomes increasingly important to improve the diagnostic yield of rare diseases and to shorten the diagnostic odyssey of patients [3]. Understanding population health disparities is an essential component of equitable precision health efforts.

In certain populations, the prevalence of some rare diseases may increase due to demographic events such as founder effects and population expansions [4]. It is the case in Quebec, a province in Canada, predominantly settled by people of French origin starting in the early 1600s [5]. The initial European founder effect was followed by subsequent regional founder effects, notably the well-characterized one observed in the Charlevoix and Saguenay–Lac-Saint-Jean (SLSJ) regions [6]. This, followed by a very rapid population expansion, resulted in many rare diseases that are more frequent in SLSJ than elsewhere in the world [710]. In SLSJ, most people are aware of the higher risk of transmission of some rare diseases and a carrier test is offered to the populations of Charlevoix, SLSJ and Haute-Côte-Nord for six of these diseases [7,8]. Nevertheless, numerous rare diseases still lack a known genetic etiology and diverse manifestations of diseases across patients further complicate clinical diagnosis. Traditionally, diseases with higher frequency in specific populations have been analyzed using a bottom-up approach, starting with the phenotypes of patients observed in clinical settings and linking them to genes or variants. Often, medical geneticists and the healthcare system would gain valuable insights from obtaining a comprehensive overview of variants that are more frequent in the population and potentially associated with rare diseases. This study focuses on addressing this need.

More specifically, we aimed to describe pathogenic variants that have an increased frequency in SLSJ due either to the founder effect and expansion or simply due to many introductions in the population. We conducted a comprehensive screening to identify pathogenic variants with higher frequency in SLSJ. Since the SLSJ population has been extensively studied over the past 40 years, we expected to identify many previously reported variants, thereby validating our findings. Importantly, the SLSJ healthcare system features a single entry point for all residents, which simplifies the process of locating patients with newly identified pathogenic variants.

Furthermore, we report for the first time the load of rare variants in a single population and assess how the founder effect and expansion were pivotal in increasing that load. In the context of rare diseases, a large number of populations remain poorly characterized, and we believe that our study highlights the need for regional genetic programs to better understand and diagnose the variety of rare diseases affecting populations.

Results

Rare pathogenic variants in the Quebec Province

We detected 9,043 rare pathogenic variants present either in individuals from the Quebec Province (QcP) or in gnomAD non-Finnish Europeans (nfe). 93% of these variants were present in gnomAD nfe, 16% in the QcP and only 10% in the SLSJ region (see methods for details on QcP regional clustering). Only two variants with a MAF greater than 0.005 are at least twice as frequent in the QcP than in gnomAD (in orange), whereas five such variants are at least twice as frequent in gnomAD compared to the QcP (in purple; Fig 1A). In contrast, 40 variants were more frequent in the SLSJ compared to gnomAD (Fig 1B). Considering only the variants present in the QcP, 25% are absent from the urban Quebec region (UQc; mean number of absent variants after 1,000 resamplings of 3,589 individuals to match the SLSJ sample size, see methods) and 43% are absent from the SLSJ (chi2 p < 0,001). Unsurprisingly, we observed a lower proportion of individuals from the SLSJ that do not carry any pathogenic variant (chi2 p < 0.001) compared to UQc resampling (Fig 2). Hence, there are fewer pathogenic variants in the QcP compared to gnomAD, and even fewer in the SLSJ region compared to UQc. However, there is a greater proportion of pathogenic variants at higher frequencies in the SLSJ than in UQc, QcP or gnomAD.

Fig 1. Frequencies of 9,043 rare pathogenic variants in gnomAD nfe compared to A) QcP and B) SLSJ.

Fig 1

Only imputed variants also present in WGS were used.

Fig 2. Proportion of individuals carrying rare pathogenic variants.

Fig 2

The bars represent the average proportions obtained from 1,000 resamplings of 3,589 individuals from UQc, adjusted to match the sample size of the SLSJ. Error bars indicate the 95% confidence intervals. The numbers displayed above each bar correspond to the actual observed counts across all individuals. Only imputed variants also present in WGS were used. The inset shows the same analysis for variants with a ClinVar review status greater than 1 gold star.

Previously reported and newly discovered variants

Seventy-two variants were previously published in literature reviews on the Charlevoix-SLSJ founder effect [710]. Of these, 42 were present in our data (S1 Table). Table A in S1 Text provides details on the 30 previously reported variants that were either absent in our data or not considered in our analysis. Noticeably, there is a great correlation between the carrier rates (CR) previously reported and the ones calculated herein (Fig A in S1 Text). Moreover, nine carrier rates were assessed independently in the CIUSSS laboratory, and eight of the newly calculated rates do not differ (95% confidence interval) from those reported in this study (Table B and Supplementary Methods in S1 Text).

Based on the absence of a clear definition for founder variants, we propose here a definition: The variants must be more frequent in the QcP than in gnomAD (relative frequency difference (RFD) ≥ 10%, see methods), have a CR of at least 1/200 and at least 50% of the pairs of carriers must share a segment identical-by-descent (IBD) around the variant’s position (see methods). Among the 1,302 rare pathogenic variants with RFD ≥ 10%, 80 met all criteria and are considered as founder variants either in the QcP, UQc or SLSJ, regardless of whether they were documented or not in the four aforementioned reviews [710]. Among these 80 founder variants, 38 were already documented in the four reviews [710] or in case reports (Table C in S1 Text), whereas 42 were never reported in the Quebec population (Tables 1 and S1).

Table 1. Novel founder variants found in this study.

Inheritance Gene Nucleotide Disease name (ClinVar ID) Data type QcP
(Sample sizes Imputed: 25,061
WGS: 1,852)
UQc
(Sample sizes Imputed: 21,472
WGS: 1,538)
SLSJ
(Sample sizes Imputed: 3,589
WGS: 314)
Count CR Count CR Count CR
AD/AR DNAH8 c.8635_8636del Primary ciliary dyskinesia (2037549) Imputed data 224 1/115 55 1/390 169 1/22
AR CNGA1 c.947C > T Retinitis pigmentosa 49 (16932) Imputed data 217 1/119 78 1/275 139 1/27
AR CTU2 c.881C > A Dysmorphic facies, renal agenesis, ambiguous genitalia, microcephaly, polydactyly, and lissencephaly (2067774) Imputed data 194 1/129 81 1/265 113 1/32
AR TMEM107 c.*759C > T Leukoencephalopathy with calcifications and cysts (265788) Imputed data 195 1/129 125 1/172 70 1/51
AR ENPP1 c.583T > C ENPP1-related disorder (2580630) WGS 7 1/309 1 1/1,538 6 1/52
AD/AR RGS9 c.895T > C Leber congenital amaurosis (5862) Imputed data 112 1/224 54 1/398 58 1/62
AR TRIOBP c.1933C > T Autosomal AR nonsyndromic hearing loss 28 (620162) Imputed data 112 1/224 64 1/336 48 1/75
AR UROS c.217T > C Cutaneous porphyria (3750) Imputed data 84 1/298 36 1/596 48 1/75
AR ASPM c.8191_8192del Microcephaly 5, primary, autosomal AR (21613) Imputed data 79 1/317 32 1/671 47 1/76
AR PYGM c.148C > T Glycogen storage disease, type V (2298) Imputed data 151 1/166 109 1/197 42 1/85
AR CEP290 c.7220_7223del Meckel syndrome, type 4|Bardet-Biedl syndrome 14 (418123) Imputed data 70 1/358 30 1/716 40 1/90
AR DONSON c.1047-9A > G Microcephaly, short stature, and limb abnormalities (431414) Imputed data 51 1/491 12 1/1,789 39 1/92
AD PKD1 c.9829C > T Polycystic kidney disease, adult type (192320) Imputed data 86 1/291 47 1/457 39 1/92
AR ETFA c.495_496del Multiple acyl-CoA dehydrogenase deficiency (459956) Imputed data 96 1/261 61 1/352 35 1/103
AR DNAH9 c.1733del DNAH9-related disorder (3013954) Imputed data 54 1/464 20 1/1,074 34 1/106
AR MOCOS c.2326C > T Xanthinuria type II (253162) Imputed data 57 1/456 24 1/895 33 1/116
AR CCDC40 c.961C > T Primary ciliary dyskinesia 15 (216118) Imputed data 35 1/716 7 1/3,067 28 1/128
AR SLC26A4 c.1001 + 1G > A Pendred syndrome (4819) Imputed data 52 1/482 25 1/859 27 1/133
AD/AR EIF2AK4 c.1153dup Familial pulmonary capillary hemangiomatosis (101527) Imputed data 34 1/737 8 1/2,684 26 1/138
AR TSHB c.373del Isolated thyroid-stimulating hormone deficiency (437070) Imputed data 40 1/627 14 1/1,534 26 1/138
AR DYNC2I2 c.1312_1313del Short-rib thoracic dysplasia 11 with or without polydactyly (665979) Imputed data 183 1/137 158 1/136 25 1/144
AR PHKB c.1257T > A Glycogen storage disease IXb (13620) Imputed data 30 1/835 6 1/3,579 24 1/150
Unknown CDK5RAP2 c.2202 + 1G > A not provided (1066422) Imputed data 100 1/251 77 1/279 23 1/156
AD/AR KCNJ1 c.472G > A Bartter syndrome (2506156) Imputed data 36 1/696 13 1/1,652 23 1/156
AR ASAH1 c.410A > G Spinal muscular atrophy-progressive myoclonic epilepsy syndrome (375548) Imputed data 95 1/267 73 1/298 22 1/163
Unknown DCAF6 c.2240G > A Cerebral visual impairment and intellectual disability (224814) Imputed data 83 1/302 61 1/352 22 1/163
AR PKHD1 c.6793C > T Autosomal AR polycystic kidney disease (1946278) Imputed data 129 1/194 107 1/201 22 1/163
AR TYR c.572del Tyrosinase-negative oculocutaneous albinism (99570) Imputed data 63 1/411 41 1/551 22 1/163
AR ALMS1 c.11648_11649insGTTA Alstrom syndrome (550627) Imputed data 93 1/269 72 1/298 21 1/171
AR ERCC2 c.2164C > T Cerebrooculofacioskeletal syndrome 2 (16792) Imputed data 24 1/1,044 3 1/7,157 21 1/171
Unknown RAD50 c.3779del Hereditary cancer-predisposing syndrome (185537) Imputed data 27 1/928 6 1/3,579 21 1/171
AR SLC45A2 c.264del Oculocutaneous albinism type 4 (242518) Imputed data 77 1/325 56 1/383 21 1/171
AR RMRP n.71A > G Metaphyseal chondrodysplasia, McKusick type (14208) Imputed data 125 1/200 105 1/204 20 1/179
AD CHEK2 c.247del Hereditary cancer-predisposing syndrome (142851) Imputed data 29 1/864 10 1/2,147 19 1/189
AR NPHS1 c.2071 + 2T > C Finnish congenital nephrotic syndrome (56460) Imputed data 38 1/660 19 1/1,130 19 1/189
Unknown PKLR c.1091G > A PKLR-related disorder (1456959) Imputed data 97 1/258 78 1/275 19 1/189
AR ACY1 c.575dup Aminoacylase 1 deficiency (800812) Imputed data 58 1/448 40 1/565 18 1/199
AD/AR CAPN3 c.2115 + 1G > A Autosomal AR limb-girdle muscular dystrophy type 2A (555599) Imputed data 26 1/964 8 1/2,684 18 1/199
AR GMPPB c.79G > C Autosomal AR limb-girdle muscular dystrophy type 2T (60546) Imputed data 47 1/533 29 1/740 18 1/199
AR NDUFV1 c.1162 + 4A > C Mitochondrial complex I deficiency, nuclear type 1 (372716) Imputed data 52 1/482 34 1/632 18 1/199
AR RSPH3 c.859 + 1G > T Primary ciliary dyskinesia 32 (2980542) Imputed data 35 1/716 17 1/1,263 18 1/199
Unknown LARS1 c.2500A > T* not specified (3117894) Imputed data 124 1/202 109 1/197 15 1/239

AD: Autosomal dominant, AR: Autosomal recessive, SLSJ: Saguenay–Lac-Saint-Jean, UQc: Urban Quebec regions, QcP: Quebec Province, WGS: Whole-genome sequencing, CR: Carrier rate. * Founder variant only in UQc.

Founder variants’ regional carrier rates and individuals’ mutation load

We then compared the carrier rates of founder variants between the SLSJ and UQc (Fig 3). Most of the already reported founder variants are at higher CR than the newly identified ones, but some of the latter are as high as 1/22 in the SLSJ (Table 1). Carrier rates are generally higher in the SLSJ compared to the UQc. Specifically, the count of founder variants with carrier rates greater than 1/200 is eight times higher in SLSJ than in the UQc (three times higher when considering all variants with an RFD ≥ 10% regardless of whether they are founder variants) despite the lower sample size in SLSJ (Fig 4). Only 16 variants were at a higher CR in UQc than in the SLSJ among the 1,302 variants with RFD ≥ 10% despite the much greater sample size in UQc. Consequently, the number of individuals who carry at least one pathogenic founder variant is higher in the SLSJ than in the UQc (chi2 p < 0.001) (Fig 5). In fact, for the variants already reported in the literature, 50% of the SLSJ and only 11% of the UQc individuals carry at least one variant. Notably, when the newly identified variants are added, these percentages reach 66% and 18%, respectively (chi2 p < 0.001).

Fig 3. Carrier rates for founder variants.

Fig 3

Only variants classified as founders in SLSJ, UQc or QcP are shown here (80 variants).

Fig 4. Number of variants according to their carrier rate.

Fig 4

Only the CR from the imputed data was used. Note that the 78 founder variants shown in dark colors may originate from either group, which explains why some exhibit a CR below 1/200, indicating they are founder exclusively in the other group.

Fig 5. Proportion of individuals carrying founder variants in A) UQc and B) SLSJ.

Fig 5

Only rare variants classified as founders in SLSJ or UQc or QcP in imputed data are shown here (78 variants). The bars represent the average proportions obtained from 1,000 resamplings of 3,589 individuals from UQc (A), adjusted to match the sample size of the SLSJ. Error bars indicate the 95% confidence intervals. The numbers displayed above each bar correspond to the actual observed counts across all individuals. The inset shows the same analysis for variants with a ClinVar review status greater than 1 gold star.

SLSJ patients carrying newly identified founder variants

To confirm that our population-based method detects variants associated with diseases that are found in the SLSJ population, we requested clinical experts to examine their databases seeking for founder variants with CR above 1/200 (Table 1) that segregate within families of patients presenting the corresponding phenotype. Table 2 presents variants found through different clinical panels in diagnosed patients from the Medical Genetics service (around 15,000 patients screened) and the Clinique des maladies neuromusculaires (CMNM; 45 medical records screened) of the CIUSSS of the SLSJ. We also had access to auto-reported CARTaGENE phenotypes (partial phenotyping for 30,000 participants). Of note, three of the variants identified in Table 2 (UROS c.217T > C, ETFA c.495_496del and CC2D2A c.4667A > T) in addition to two other variants not found at a homozygous state in the clinics (CEP290 c.7220_7223del and SGO1 c.67A > G) would be good candidates to include in an ongoing effort for designing a new carrier test for the SLSJ population in the Medical Genetics service [11]. A variant becomes of interest to the Medical Genetics service when it causes a severe disease without a curative treatment, and it would be considered appropriate to offer prenatal diagnosis with the possibility of medical termination of pregnancy. The variant must also be more frequent in the region, which is where our population-based approach proves valuable. It offers insight into the frequency and origin of the variant, even in the absence of clinical data.

Table 2. Variants found in patients with corresponding phenotypes.

Inheritance Gene Nucleotide Phenotype Heterozygotes Homozygotes Clinic
AD PRPH2 c.554T > C Retinitis pigmentosa 1 0 Genetic
AR CC2D2A* c.4667A > T Joubert syndrome 1 Genetic + CMNM
AR PDZD7 c.2107del Hearing loss, autosomal recessive 57 4 (compound with c.2672AGA[1]) 0 Genetic
AR EIF2AK4 c.1153dup Familial pulmonary capillary hemangiomatosis 1 Genetic
AR SLC45A2 c.264del Oculocutaneous albinism type 4 1 Genetic
AR TYR c.572del Tyrosinase-negative oculocutaneous albinism 2 (compound with
c.1046G > C)
1 Genetic + CaG
AR ALMS1 c.11648_11649
insGTTA
Alstrom syndrome 1 Genetic
AR ETFA* c.495_496del Multiple acyl-CoA dehydrogenase deficiency 1 Genetic
AR UROS* c.217T > C Cutaneous porphyria 3 (compound with c.424C > T) 0 Genetic
AR SLC26A4 c.1001 + 1G > A Pendred syndrome 3 Genetic
AD CHEK2 c.247del Cancer 10 0 Genetic + CaG
AD PKD1 c.9829C > T Polycystic kidney disease, adult type 1 0 CaG

Variants in gray were reported in Quebec, but not in SLSJ, while other variants were not reported, *: Considered for a new carrier test in the SLSJ, Heterozygotes: Number of heterozygous patients (for dominant diseases), Homozygotes: Number of homozygous patients (for recessive diseases), CaG: CARTaGENE phenotypes.

Discussion

In this study, we aimed to identify pathogenic variants found at higher frequency in the QcP and, more specifically, in the SLSJ region. We found 1,302 rare variants with RFD ≥ 10% in Quebec compared to gnomAD nfe. Among these, we identified 80 that met all criteria to be classified as founders, 38 of these being previously reported in the QcP within reviews or in case reports. Our study shows that establishing a systematic review of founder variants in a population is hard to conduct using only a literature review approach. This approach was done multiple times in the SLSJ population, and our results show that several founder variants with high carrier rates were missed. Moreover, there is currently no universally accepted definition of a founder variant in the literature. Commonly, a variant is considered to be associated with a founder effect when it is observed at elevated frequency within a genetically related group and can be traced to one or more common ancestors [4,12,13]. In our study, in addition to the frequency criteria, we assessed IBD sharing at the variant locus among carriers, which indicates that it was inherited from a recent common ancestor.

In addition to confirming known founder variants, we also report for the first time 42 novel founder variants that, to our knowledge, have never been documented in the QcP. Some of these exhibit a high carrier rate, comparable to the six diseases included in the carrier test offered to the population. These new variants could potentially account for unreported rises in disease prevalence within the population, which suggests a potential underestimation of the overall prevalence of rare diseases in the SLSJ region, as also reported in other populations with founder effects [14,15]. Indeed, adding the newly identified founder variants raises the proportion of individuals carrying at least one pathogenic founder variant by 1.3 and 1.7 times in the SLSJ and in the UQc, respectively. These proportions are similar when considering only ClinVar variants with stronger evidence of pathogenicity.

Recent results from the carrier testing currently offered to the population revealed that 116 couples identified as carriers of the same variant were provided with genetic counselling to discuss their reproductive options [11]. Given that the incidence of the four conditions currently included in the carrier test is approximately 1/2,000 births [1619] and that around 2,000 births occur annually in the SLSJ region, genetic testing has the potential to significantly reduce disease incidence. Expanding the carrier screening panel to include additional conditions could greatly enhance public health outcomes. Indeed, we found patients carrying newly described founder variants who have been diagnosed with the corresponding disease in clinical databases. Five of the variants found in this study would be good candidates for inclusion in a carrier test, which would have the potential to detect more couples who are carriers of the same condition. Importantly, those five variants were not previously recognized as more prevalent in the region through clinical observation alone, highlighting the value of our population-based approach. Establishing carrier rates plays a critical role in advancing precision medicine and carrier testing among populations with a founder effect [14]. In addition, it is a great proof-of-concept for larger initiatives to come in the field of precision medicine in regard to carrier frequency panels in larger populations.

We demonstrate an underestimation of the number of pathogenic variant carriers in SLSJ, which has been the focus of numerous studies on rare genetic diseases linked to the founder effect and population expansion. This supports the hypothesis of a higher mutation load (which we define here as a rise in deleterious allele frequencies) following range expansions [20] and under certain demographic and dominance models [21]. Consequently, the number of individuals affected by a rare disease might be underestimated in many countries or local communities. Our population cohort’s approach could be applied in other worldwide populations at low costs, thus helping in enhancing and accelerating the molecular diagnosis of patients.

The present study demonstrates the higher pathogenic mutational load in individuals from the SLSJ region compared to UQc, not only for founder variants, but also for all ClinVar variants found in the QcP. It seems that the SLSJ individuals are more likely to carry at least one variant, despite the greater loss of variants in the SLSJ compared to UQc. The higher mutation load in the SLSJ individuals is mainly caused by an overrepresentation of variants with a CR greater than 1/200. This is the result of the very rapid population expansion, five times greater than the one observed in the whole Quebec for the same period [22]. Some founders in SLSJ contributed a lot to the present population [23] and therefore could have introduced an allele in the population that would reach such a high frequency [24]. Moreover, it was demonstrated that the first SLSJ settlers had an increased fitness [25], which could have contributed to increasing deleterious allele frequencies [26,27].

This study has certain limitations. Firstly, the sample size of the WGS data may be insufficient to accurately estimate the frequency of variants in the population. Therefore, we chose to work with imputed data, which includes a significantly larger number of individuals. To achieve the most accurate representation of our data, especially given our focus on rare variants, we performed imputation using a local WGS rather than a global worldwide reference panel. However, we acknowledge that imputed data may not be as reliable as WGS or genotyping. Therefore, we selected the ClinVar pathogenic variants exclusively in WGS, then we compared the WGS with the imputed data for the same individuals and excluded any unreliable imputed variants. We also excluded any individual whose cluster in WGS did not match the one in the imputed data clustering based on the UMAP. Also, our definition of a founder variant is stringent, with a CR of at least 1/200, especially for the SLSJ region, where the sample size is smaller. As a result, some less common but genuine founder variants might be missed, making this study somewhat conservative in the identification of founder variants.

These findings might be crucial for clinicians to shorten the patients’ diagnostic odyssey and reduce the economic burden associated with undiagnosed rare diseases. This could help improve the management of patients and, for some of them, enhance their quality of life as appropriate follow-up could be offered earlier. With this information, precision medicine can implement targeted genetic screening programs, allowing for early detection of inherited conditions that are more prevalent due to the particular genetic structure shaped by some demographic events, such as a founder effect or a population expansion. This enables tailored prevention strategies, personalized treatments, and risk-reduction measures that are specific to the genetics of the population. Ultimately, analyzing carrier rates in populations could help healthcare providers offer more precise and effective medical care, enhancing outcomes for both individuals and the community. Indeed, we identified 30 patients carrying 12 causal variants that have not been previously reported as more frequent in SLSJ. The underestimation of pathogenic mutational load might also happen in other populations as a result of range expansions and rare diseases might be much less rare than anticipated. In an era of precision medicine with at least 10% of the population affected by rare diseases, it is crucial to adopt new approaches to enhance and accelerate the molecular diagnosis of rare diseases.

Subjects and methods

Ethics statement

This population-based study on the CARTaGENE cohort was approved by the University of Quebec in Chicoutimi (UQAC) ethics board. The approval for the secondary use of anonymized samples coming from the provincial screening testing was obtained from the Centre intégré universitaire de santé et de services sociaux du Saguenay—Lac-Saint-Jean of the SLSJ Direction of professional services. Written informed consent for the use of saliva samples for genetic testing was obtained from participants. The search of the clinical database for patients carrying the newly identified variants was approved by the Centre intégré universitaire de santé et de services sociaux du Saguenay—Lac-Saint-Jean ethics board.

Cohort

The CARTaGENE cohort [28] (https://cartagene.qc.ca/) used in this study includes WGS of 2,184 and genotyping of 29,353 participants. Individuals aged between 40 and 69, residing in six distinct cities (Montreal, Quebec City, Trois-Rivières, Sherbrooke, Gatineau, Saguenay), were recruited between 2009 and 2015, regardless of their birthplace. The CARTaGENE cohort also includes a wide range of phenotypes, among which is the occurrence of a disease. For further details on genotyping and WGS data, see https://cartagene.qc.ca/files/documents/other/Info_GeneticData3juillet2023.pdf. All genomic data were aligned on the GRCh38 genome assembly.

Genotypes cleaning and imputation

To increase our sample size and achieve more accurate carrier rates, we imputed the six different CARTaGENE genotyping chips using SHAPEIT5 [29] and IMPUTE5 [30] with default settings. The individuals were genotyped on different arrays (Omni 2.5, GSAv1 + Multi disease panel, GSAv1, GSAv2 + Multi disease panel, GSAv3 + Multi disease panel, GSAv2 + Multi disease panel + addon and Affymetrix Axiom 2.0) and were cleaned and merged as follows. Each dataset was cleaned separately using PLINK software v1.9 [31], ensuring individuals with at least 95% genotypes among all SNPs were retained. At the SNP level, we retained SNPs with at least 95% genotypes among all individuals, located on the autosomes and in Hardy–Weinberg equilibrium p > 10-6.

The imputation was performed using 2,390 WGS from CARTaGENE and in-house Quebec samples [32] as a reference to enhance our capacity to identify rare variants within our population (refer to Table D in S1 Text for a comparison of imputations using either the local Quebec or TOPMed reference panel). Both WGS cohorts were jointly called using illumina DRAGEN v4.0 with popex tool. Variants with at least 10% missing genotypes, monomorphic variants, and variants in centromeres and in the ENCODE blacklist were filtered out from the WGS before performing imputation on each genotyping batch separately. All imputed genotyping batches were then merged, and the final imputed dataset includes 29,353 individuals. A post-imputation quality control filter was applied on each individual imputed batch to remove variants with an imputation quality score <0.3 for the PCA and UMAP.

UMAP and clustering according to individuals’ origin

For the purpose of this study, we needed to identify clusters of individuals based on genetics, regardless of where they were recruited. To do so, a PCA was performed using PLINK on the WGS SNPs with a minor allele frequency (MAF) of at least 5% and after removal of SNPs with more than 2% missing individuals and in LD (--indep-pairwise 200 5 0.1). We retained only biallelic SNPs within the accessibility mask [33], resulting in a total of 90,073 remaining SNPs. We also filtered out individuals with more than 2% missing SNPs, resulting in 2,166 individuals remaining. A UMAP [34] was then performed on the first three PCs (determined by the scree test) with the R umap library v0.9.2.0 (Fig B in S1 Text). This technique was proven efficient to reveal fine-scale population structure [35]. K-means clustering was then employed to create three clusters, aiming to retain as many individuals from the SLSJ as possible, given its limited sample size. We also intended to choose individuals with the strongest ancestry connection to the region. Based on the recruitment place (Fig B, panel A in S1 Text), we could see that the majority of the CARTaGENE participants recruited from the SLSJ region belong to the red cluster (Fig B, panel B in S1 Text). We identified 314 individuals originating from the SLSJ region (red cluster) and 1,538 individuals from the other urban Quebec regions (UQc) (green cluster), for a total of 1,852 for the QcP (green and red clusters). Clusters were also defined on imputed data as described above on pruned SNPs (--indep-pairwise 50 5 0.2) at 5% frequency or more, keeping five PCs for the UMAP, leaving 3,589 individuals in the SLSJ (red cluster) and 21,472 in the UQc (green cluster), for a total of 25,061 in the QcP (Fig C in S1 Text). This includes the 1,852 individuals with WGS that were also imputed from genotypes. The SLSJ cluster finally includes 90% of the individuals recruited from the SLSJ region for the WGS data and 84% for the imputed data. We ensured consistency of individuals in clusters between the WGS and imputed data by removing 27 samples that exhibited mismatches, likely because they were at the boundaries of both clusters. This method ensures that individuals have a common genetic background and has been shown to be helpful in uncovering rare variants with smaller sample sizes [36,37].

Resampling of UQc

To minimize bias when comparing individual and variant numbers across both clusters, we performed 1,000 resamplings of 3,589 individuals from UQc. We then calculated the mean and 95% confidence interval (CI) for all 1,000 resamplings. The graphs display the average proportions or variants’ number derived from these resamplings, with error bars indicating the 95% CI.

Selection of pathogenic variants

Variants’ classification was extracted from the ClinVar database version of June 24, 2024 [38]. Only variants classified as: Pathogenic, Likely pathogenic, and both pathogenic/likely pathogenic, as well as SNPs, insertions and deletions (indels), were included in the analysis, whereas repeat expansions were excluded. Furthermore, variants with the following review status were removed: no assertion criteria provided, no classification provided, and no classification for the individual variant. Additionally, we incorporated all variants referenced as founder variants in previous studies [710], regardless of their status on ClinVar. Therefore, we obtain a list of 240,716 variants.

9,043 pathogenic variants were present either in gnomAD or the QcP imputed data after removal of variants that were absent from the QcP WGS or unreliable in imputed variants (Table D in S1 Text). We also removed variants that were less frequent than 1/21,066 which is the lowest number of alleles (for gnomAD non_topmed_nfe). 1,590 imputed variants were present in the QcP.

Calculation of relative frequency difference (RFD) ≥10%

We calculated the variants’ frequency in the CARTaGENE WGS and imputed data using PLINK v1.9 for the individuals originating from the SLSJ, UQc and QcP (both SLSJ and UQc clusters) inferred by the clustering. The gnomAD frequencies for the non-Finnish Europeans (non_topmed_nfe) were directly extracted from gnomAD genomes v3.1.2. To calculate the RFD of a variant in the QcP compared to gnomAD nfe, we used the following formula:

RFD=freqQcPfreqgnomADfreqQcP

Knowing that:

  • freqQcP corresponds to the frequency of the variant in the QcP population.

  • freqgnomAD corresponds to the frequency of the variant in the gnomAD non-Finnish Europeans.

We fixed a minimum RFD threshold of 0.1 to make sure it encompasses a large number of variants that could be of interest, although at very low frequencies, the difference between both populations may be minimal and not distinguishable from sampling noise. For instance, RFD = 0.1 corresponds to a 1.11-fold increase and RFD = 0.5 to a 2-fold increase. When a variant with RFD ≥ 10% identified in the WGS was also found in the imputed data, we used the imputed variant frequency. If not, we relied on the WGS variant frequency, ensuring that the RFD was at least 10% in both data types. Notably, the frequencies of variants show a strong correlation between both data types (Fig D in S1 Text). We detected 1,304 potentially pathogenic variants that reached RFD ≥ 10% compared to gnomAD nfe. Since we are focusing on rare variants, we removed two variants with a MAF ≥ 5% (chr6:26090951:C:G and chr14:94380925:T:A), leaving 1,302 rare variants with RFD ≥ 10% in the QcP.

Estimation of carrier rate (CR)

We directly counted the number of heterozygotes for each variant and determined the CR by calculating the frequency of the heterozygous individuals expressed as 1/x.

IBD sharing at variants’ location

All cleaned genotyping batches (excluding the Affymetrix chip due to its poor SNPs intersection with other Illumina chips) were combined and only the intersecting common SNPs were kept. After the merge, individuals with less than 95% genotypes among all SNPs and SNPs with less than 95% genotypes across all individuals were once again filtered out. The final dataset comprises 148,200 SNPs and 28,358 individuals.

We then inferred IBD segments on phased genotypes using refinedIBD [39] version 17Jan20 and Beagle version 18May20. Subsequently, the segments were merged using the merge-ibd-segments 17Jan20.102 tool. We retained only IBD segments of 2Mb or longer and with a LOD score greater than 3.

We then examined the proportion of pairs of individuals sharing IBD along the genome among carriers of a specific pathogenic variant in the SLSJ and UQc clusters separately. This proportion is generally around 0.02 in SLSJ individuals who are not closely related [40].

Selection of founder variants

After selecting ClinVar pathogenic variants with RFD ≥ 10%, we established additional criteria for a variant to be considered as founder. The number of individuals carrying the variant must be adequate to avoid false signals or misinterpretations while also being high enough to be relevant for inclusion in population screening tests [710]; thus, the target CR was set to 1/200. Hence, given the different sample sizes, the minimum threshold is five (1/63), eight (1/192) and ten (1/185) individuals carrying the variant for WGS of SLSJ, UQc and QcP respectively. We chose not to include variants found in fewer than five individuals to minimize the risk of false signals and because IBD sharing patterns were unclear with such low counts. However, for imputed data, we set the threshold to reach a CR of 1/200, which represents 18, 108 and 126 individuals for the SLSJ, UQc and QcP. For instance, a variant with a carrier rate of 1/200 for a recessive disease in a population of 286,768 (the SLSJ population in 2024) would be expected to result in approximately seven affected individuals. Furthermore, to be called as founder, a variant must show a proportion of pairs of carriers sharing IBD around the variant of at least 0.5, indicating that half of the carriers’ pairs share IBD at the variant’s location. Considering that a founder variant usually originates from a single ancestor in a population with a founder effect [24], and within a relatively recent time frame to increase in frequency due to drift, this variant will still be in LD with other surrounding variants and carriers will share not only the variant, but also the surrounding haplotype. Therefore, examining the IBD sharing around a variant is a dependable method to confirm its recent common origin due to the founder effect and population expansion.

Clinical data

The patient group consisted of individuals residing in SLSJ during the assessment, all of whom had genetic disorders. They were clinically evaluated at the Medical Genetics service and the CMNM of CIUSSS of SLSJ. Their DNA samples were analyzed in certified clinical molecular laboratories as part of the clinical testing and genetic evaluation process. A review of nearly 15,000 patients in internal databases and medical records enabled the identification of patients who were homozygous, compound heterozygous, or heterozygous for autosomal recessive or dominant variants.

Supporting information

S1 Table. Description of variants with RFD ≥ 10%.

(XLSX)

pgen.1011876.s001.xlsx (194.4KB, xlsx)
S1 Text

Table A: Variants previously described in the population not found here. Table B: Experimental assessment of CR in an independent cohort of 1,000 individuals living in SLSJ. CI have been calculated on proportions using the online tool https://sample-size.net/confidence-interval-proportion/; * CR calculated using WGS in the present study; ** One sided 97.5% CI. Table C: Case reports on Quebec founder variants. Table D: Comparison of imputations of ClinVar rare variants using the TOPMed r2 or the Quebec reference panel. Proportions are based on common SNPs/genotypes except for missing SNPs which is on the total number of SNPs. False positives are defined as heterozygotes in imputed, but not in WGS data. False negatives are defined as heterozygotes in WGS, but not in imputed data. * In at least one individual. **SNPs having at least one homozygote switch genotype in addition to SNPs having more than 10% of false positive genotypes (101 out of the 106 false positive SNPs) were considered as unreliable in the imputed data. Fig A: Comparison of the variants’ carrier rates reported in SLSJ and found in our analysis. When available, the aggregated CR (all variants associated with the same disease) was used; also, if available, the CR from the imputed data was used; otherwise, the CR from WGS data was utilized. Variants from the same disease were grouped as in previous studies. Fig B: UMAP of WGS data. UMAP are coloured according to A) the recruitment region or country of birth and B) the k-means clustering. Fig C: UMAP of imputed data. UMAP are coloured according to A) the recruitment region or continent of birth and B) the k-means clustering. Note that 1,537 pathogenic variants had an RFD ≥ 10% in the WGS data, but only 1,302 of them had also an RFD ≥ 10% in the imputed data. Fig D: Correlation between imputed and WGS variants’ frequency in QcP.

(PDF)

pgen.1011876.s002.pdf (977.1KB, pdf)

Data Availability

Quebec genotype, imputed and WGS data are available under restricted access from CARTaGENE biobank (https://cartagene.qc.ca/en/researchers/access-request.html) due to the informed consent given by study participants. The code used and data for this study can be found in the following GitHub repository: https://github.com/Genopop/Figures-founder-variants-article.

Funding Statement

Funding for SLG was provided by the Canada Research Chair in Genetics and Genealogy grant #CRC-2022-00444( https://www.chairs-chaires.gc.ca/chairholders-titulaires/profile-fra.aspx?profileId=5645). LB and CG were funded by the Research Chair in Génétique et parcours de vie en santé(https://www.chairegps.com/).The funders of the study had no role in study design, data collection, data analysis, data interpretation, or writing of the article.

References

  • 1.Haendel M, Vasilevsky N, Unni D, Bologa C, Harris N, Rehm H, et al. How many rare diseases are there? Nat Rev Drug Discov. 2020;19(2):77–8. doi: 10.1038/d41573-019-00180-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Tesi B, Boileau C, Boycott KM, Canaud G, Caulfield M, Choukair D, et al. Precision medicine in rare diseases: What is next? J Intern Med. 2023;294(4):397–412. doi: 10.1111/joim.13655 [DOI] [PubMed] [Google Scholar]
  • 3.Bauskis A, Strange C, Molster C, Fisher C. The diagnostic odyssey: insights from parents of children living with an undiagnosed condition. Orphanet J Rare Dis. 2022;17(1):233. doi: 10.1186/s13023-022-02358-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Isshiki M, Griffen A, Meissner P, Spencer P, Cabana MD, Klugman SD. Genetic disease risks of under-represented founder populations in New York City. medRxiv. 2024. doi: 10.1101/2024.09.27.24314513 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Charbonneau H, Desjardins B, Légaré J, Denis H. The population of the St-Lawrence Valley, 1608-1760. A Population History of North America. 2000. pp. 99–142. [Google Scholar]
  • 6.Gagnon L, Moreau C, Laprise C, Vézina H, Girard SL. Deciphering the genetic structure of the Quebec founder population using genealogies. Eur J Hum Genet. 2023;:1–7. doi: 10.1038/s41431-023-01356-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Bchetnia M, Bouchard L, Mathieu J, Campeau PM, Morin C, Brisson D, et al. Genetic burden linked to founder effects in Saguenay-Lac-Saint-Jean illustrates the importance of genetic screening test availability. J Med Genet. 2021;58(10):653–65. doi: 10.1136/jmedgenet-2021-107809 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Cruz Marino T, Leblanc J, Pratte A, Tardif J, Thomas M-J, Fortin C-A, et al. Portrait of autosomal recessive diseases in the French-Canadian founder population of Saguenay-Lac-Saint-Jean. Am J Med Genet A. 2023;191(5):1145–63. doi: 10.1002/ajmg.a.63147 [DOI] [PubMed] [Google Scholar]
  • 9.Laberge A-M, Michaud J, Richter A, Lemyre E, Lambert M, Brais B, et al. Population history and its impact on medical genetics in Quebec. Clin Genet. 2005;68(4):287–301. doi: 10.1111/j.1399-0004.2005.00497.x [DOI] [PubMed] [Google Scholar]
  • 10.Scriver CR. Human genetics: lessons from Quebec populations. Annu Rev Genomics Hum Genet. 2001;2:69–101. doi: 10.1146/annurev.genom.2.1.69 [DOI] [PubMed] [Google Scholar]
  • 11.Fortin C-A, Côté-Richer M, Truchon K, Leblanc J, Pratte A, Tardif J, et al. Successes of an innovative population-based carrier screening program for 4 prevalent recessive hereditary diseases in a population with a founder effect in Quebec, Canada. Genet Med Open. 2025;3:103435. doi: 10.1016/j.gimo.2025.103435 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Garcia AM, Diaz-Papkovich A, Sillon G, D’Agostino D, Chong AL, Chong G, et al. Using the ancestral recombination graph to study the history of rare variants in founder populations [Internet]. bioRxiv; 2025 [cited 2025 May 23]. Available from: https://www.biorxiv.org/content/10.1101/2025.03.13.643149v1
  • 13.Marafi D. Founder mutations and rare disease in the Arab world. Dis Model Mech. 2024;17(6):dmm050715. doi: 10.1242/dmm.050715 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Mathijssen IB, van Maarle MC, Kleiss IIM, Redeker EJW, Ten Kate LP, Henneman L, et al. With expanded carrier screening, founder populations run the risk of being overlooked. J Community Genet. 2017;8(4):327–33. doi: 10.1007/s12687-017-0309-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Fridman H, Behar DM, Carmi S, Levy-Lahad E. Preconception carrier screening yield: effect of variants of unknown significance in partners of carriers with clinically significant variants. Genet Med. 2020;22(3):646–53. doi: 10.1038/s41436-019-0676-x [DOI] [PubMed] [Google Scholar]
  • 16.De Braekeleer M, Dallaire A, Mathieu J. Genetic epidemiology of sensorimotor polyneuropathy with or without agenesis of the corpus callosum in northeastern Quebec. Hum Genet. 1993;91(3):223–7. doi: 10.1007/BF00218260 [DOI] [PubMed] [Google Scholar]
  • 17.De Braekeleer M, Larochelle J. Genetic epidemiology of hereditary tyrosinemia in Quebec and in Saguenay-Lac-St-Jean. Am J Hum Genet. 1990;47(2):302–7. [PMC free article] [PubMed] [Google Scholar]
  • 18.De Braekeleer M, Giasson F, Mathieu J, Roy M, Bouchard JP, Morgan K. Genetic epidemiology of autosomal recessive spastic ataxia of Charlevoix-Saguenay in northeastern Quebec. Genet Epidemiol. 1993;10(1):17–25. doi: 10.1002/gepi.1370100103 [DOI] [PubMed] [Google Scholar]
  • 19.Morin C, Mitchell G, Larochelle J, Lambert M, Ogier H, Robinson BH, et al. Clinical, metabolic, and genetic aspects of cytochrome C oxidase deficiency in Saguenay-Lac-Saint-Jean. Am J Hum Genet. 1993;53(2):488–96. [PMC free article] [PubMed] [Google Scholar]
  • 20.Henn BM, Botigué LR, Peischl S, Dupanloup I, Lipatov M, Maples BK, et al. Distance from sub-Saharan Africa predicts mutational load in diverse human genomes. Proc Natl Acad Sci U S A. 2016;113(4):E440-449. 10.1073/pnas.1510805112 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Henn BM, Botigué LR, Bustamante CD, Clark AG, Gravel S. Estimating the mutation load in human genomes. Nat Rev Genet. 2015;16(6):333–43. doi: 10.1038/nrg3931 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Pouyez C, Lavoie Y. Les saguenayens: introduction à l’histoire des populations du Saguenay, XVIe-XXe siècles. Sillery: Presses de l’Université du Québec; 1983. pp. 386. [Google Scholar]
  • 23.Bherer C, Labuda D, Roy-Gagnon MH, Houde L, Tremblay M, Vézina H. Admixed ancestry and stratification of Quebec regional populations. Am J Phys Anthropol. 2011;144(3):432–41. doi: 10.1002/ajpa.21424 [DOI] [PubMed] [Google Scholar]
  • 24.Heyer E, Austerlitz F. Update to Heyer’s “One founder/one gene hypothesis in a new expanding population” (1999). Hum Biol. 2009;81(5–6):657–62. doi: 10.3378/027.081.0614 [DOI] [PubMed] [Google Scholar]
  • 25.Moreau C, Bhérer C, Vézina H, Jomphe M, Labuda D, Excoffier L. Deep human genealogies reveal a selective advantage to be on an expanding wave front. Science. 2011;334(6059):1148–50. doi: 10.1126/science.1212880 [DOI] [PubMed] [Google Scholar]
  • 26.Casals F, Hodgkinson A, Hussin J, Idaghdour Y, Bruat V, de Maillard T, et al. Whole-exome sequencing reveals a rapid change in the frequency of rare functional variants in a founding population of humans. PLoS Genet. 2013;9(9):e1003815. doi: 10.1371/journal.pgen.1003815 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Peischl S, Dupanloup I, Foucal A, Jomphe M, Bruat V, Grenier J-C, et al. Relaxed selection during a recent human expansion. Genetics. 2018;208(2):763–77. doi: 10.1534/genetics.117.300551 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Awadalla P, Boileau C, Payette Y, Idaghdour Y, Goulet J-P, Knoppers B, et al. Cohort profile of the CARTaGENE study: Quebec’s population-based biobank for public health and personalized genomics. Int J Epidemiol. 2013;42(5):1285–99. doi: 10.1093/ije/dys160 [DOI] [PubMed] [Google Scholar]
  • 29.Hofmeister RJ, Ribeiro DM, Rubinacci S, Delaneau O. Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank. Nat Genet. 2023;55(7):1243–9. doi: 10.1038/s41588-023-01415-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Rubinacci S, Delaneau O, Marchini J. Genotype imputation using the positional burrows wheeler transform. PLoS Genet. 2020;16(11):e1009049. doi: 10.1371/journal.pgen.1009049 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81(3):559–75. doi: 10.1086/519795 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Gagnon M, Moreau C, Ricard J, Boisvert MC, Bureau A, Maziade M. Rare variants and founder effect in the Beauce region of Quebec. Commun Biol. 2025;8(1):1184. 10.1038/s42003-025-08630-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.GRCh38 genome accessibility masks for 1000 Genomes data | 1000 Genomes [Internet]. [cited 2024 Apr 17]. Available from: https://www.internationalgenome.org/announcements/genome-accessibility-masks/
  • 34.McConville R, Santos-Rodríguez R, Piechocki RJ, Craddock I. N2D: (Not Too) Deep Clustering via Clustering the Local Manifold of an Autoencoded Embedding. In: 2020 25th International Conference on Pattern Recognition (ICPR) [Internet]. 2021. pp. 5145–52. [cited 2024 Jul 24]. Available from: https://ieeexplore.ieee.org/document/9413131 [Google Scholar]
  • 35.Diaz-Papkovich A, Anderson-Trocmé L, Ben-Eghan C, Gravel S. UMAP reveals cryptic population structure and phenotype heterogeneity in large genomic cohorts. PLoS Genet. 2019;15(11):e1008432. doi: 10.1371/journal.pgen.1008432 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Diaz-Papkovich A, Zabad S, Ben-Eghan C, Anderson-Trocmé L, Femerling G, Nathan V, et al. Topological stratification of continuous genetic variation in large biobanks. bioRxiv; 2023. pp. 2023.07.06.548007. https://www.biorxiv.org/content/10.1101/2023.07.06.548007v1 [Google Scholar]
  • 37.Gagnon L, Moreau C, Laprise C, Girard SL. Fine-scale genetic structure and rare variant frequencies. PLoS One. 2024;19(11):e0313133. doi: 10.1371/journal.pone.0313133 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Index of/pub/clinvar/vcf_GRCh38 [Internet]. [cited 2024 Jun 24]. Available from: https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/
  • 39.Browning BL, Browning SR. Improving the accuracy and efficiency of identity-by-descent detection in population data. Genetics. 2013;194(2):459–71. doi: 10.1534/genetics.113.150029 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Gauvin H, Moreau C, Lefebvre J-F, Laprise C, Vézina H, Labuda D, et al. Genome-wide patterns of identity-by-descent sharing in the French Canadian founder population. Eur J Hum Genet. 2014;22(6):814–21. doi: 10.1038/ejhg.2013.227 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Jonathan Marchini, Gregory M Cooper

16 May 2025

PGENETICS-D-25-00284

Rare diseases load through the study of a regional population

PLOS Genetics

Dear Dr. Girard,

Thank you for submitting your manuscript to PLOS Genetics. After careful consideration, we feel that it has merit but does not fully meet PLOS Genetics's publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript within 60 days Jul 15 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosgenetics@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pgenetics/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

* A rebuttal letter that responds to each point raised by the editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. This file does not need to include responses to any formatting updates and technical items listed in the 'Journal Requirements' section below.

* A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

* An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, competing interests statement, or data availability statement, please make these updates within the submission form at the time of resubmission. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

We look forward to receiving your revised manuscript.

Kind regards,

Jonathan Marchini

Academic Editor

PLOS Genetics

Gregory Cooper

Section Editor

PLOS Genetics

Aimée Dudley

Editor-in-Chief

PLOS Genetics

Anne Goriely

Editor-in-Chief

PLOS Genetics

Journal Requirements:

1) We ask that a manuscript source file is provided at Revision. Please upload your manuscript file as a .doc, .docx, .rtf or .tex. If you are providing a .tex file, please upload it under the item type u2018LaTeX Source Fileu2019 and leave your .pdf version as the item type u2018Manuscriptu2019.

2) Please upload all main figures as separate Figure files in .tif or .eps format. For more information about how to convert and format your figure files please see our guidelines: 

https://journals.plos.org/plosgenetics/s/figures

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Remarks to the authors

Michel et al investigate the genetic burden of rare diseases in Quebec’s Saguenay–Lac-Saint-Jean (SLSJ) region, bridging previous population-genetics studies from the lab with recent studies on the disease burden in the region (Cruz Marino et al.). The authors perform within-cohort imputation to increase resolution, and, after some basic pop-gen analysis, identify two clusters - 3,589 individuals from SLSJ and 21,472 from UQc. Next, starting from 240,710 P/LP variants from ClinVar, they follow several cleaning steps and identify 80 pathogenic variants with founder effects in either SLSJ or UQc. 42 of these were not reported in previous studies of the Quebec population and, interestingly, a dozen of those were found to be carried by patients having the corresponding disease. The study is interesting and has much to contribute towards better disease diagnosis, though I believe there are a few aspects that could be improved.

Major comments

1. The analysis about clinical validation is very interesting, and perhaps the highlight of the paper, but it should be expanded. First, how large were the external databases, and is there any chance that patients found by clinicians were also used in the analysis (“train-test contamination”)? How many variants were sent for validation and how many of those ~30 people were previously undiagnosed? If any, that needs highlighting. Some of these questions might have obvious answers (or no answer due to privacy constraints), but more details should be given to help assess the diagnostic utility of the findings.

2. The way the authors compare their list of founder variants with those previously reported is confusing. They first describe 42/72 reported previously (and found here), then mention “38 of the (80) founder variants were already documented…”. How do the 38 variants compare with the “42 previously reported” given in Table 1? Please consider restructuring this section for clarity by first presenting your findings, then indicating which are novel and which overlap with prior studies.

3. A common choice for comparing two groups is a two-samples t-test. Instead, the authors report p-values from chi2 tests. This choice should be justified, and the statistical tests described in more detail in Methods.

4. Using IBD sharing to clean founder variants is an interesting feature, but how did that help on top of the CR filter? Is there a reference paper that did something similar? The authors should justify the use of 50% as threshold, and how sensitive their analysis is.

Minor comments

• The section “Experimental validation of carrier rates” is misleading, as experimental usually refers to lab work carried out to validate one of the main findings. Effectively what the authors did was targeted genotyping to verify that their statistical imputation was accurate. Nice to have, but I believe that is more of a supplementary analysis.

• Please convert Supplementary Table 1 to csv rather than PDF.

• L277: “most related individuals” is that a typo? PCA needs to be performed in a set of unrelated individuals, and then any relatives need to be projected on the principal components.

• Table captions should be expanded to include definitions (e.g., F/NF) and sample sizes, particularly for Table 1.

• There are references mentioned twice, e.g. 10 & 21 or 13 & 18, perhaps more.

• Consider citing Ishiki et al 2024 (PMID: 39399040) who performed a similar study.

• L319: should be 1,302 variants.

• P-values are not written in the correct format, or contain typos.

Reviewer #2: The authors have sought to provide a comprehensive account of potentially pathogenic variants which are more frequent in Quebec than in the gnomAD non-Finnish European (NFE) genomes, relying on both variants from ClinVar and previously identified founder variants from Quebec or specifically the Saguenay–Lac-Saint-Jean (SLSJ) region. Using carrier rate thresholds and an IBD-based analysis, the authors identified 80 potentially pathogenic founder variants in SLSJ or the rest of Quebec, some of which are novel and some of which were previously reported. A subset of potential founder variant carrier rates were validated using a TaqMan genotyping assay, and some variants were observed in medical databases in Quebec. Mutational load (based on carrier counts) was found to be greater in SLSJ than in the rest of Quebec, both of the potentially pathogenic founder variants, and of all potentially pathogenic variants more common in SLSJ/Quebec than in gnomAD NFE.

This manuscript should be commended for its database-first approach to detecting potentially pathogenic founder variants in SLSJ and the rest of Quebec, which importantly draws upon a comprehensive analysis of potentially pathogenic variants in ClinVar, as well as the existing literature on founder variants in SLSJ. The authors make valuable observations about the proportion of potentially pathogenic founder variants in SLSJ/Quebec which are previously reported/unreported in the literature, as well as the mutational load of these variants in the regions studied. The careful use of imputed genotypes and an IBD-based analysis of potential founder variants both strengthen the study’s key findings. Below I have listed several major and minor points of criticism, which deal primarily with manuscript clarity and additional analyses of the authors’ existing data.

Major Points:

1. Lines 99-101. Given that many more WGS and imputed samples were available for the UQc than SLSJ cluster, (314 WGS and 3,589 imputed from SLSJ, or 1,538 WGS and 21,472 imputed from UQc), it is unsurprising that many more of the RFD >= 10% variants would be unique to UQc vs SLSJ. Regarding this, the authors state in the Discussion that “Indeed, 42% of variants with an RFD≥10% in the QcP were lost in the SLSJ, although some of them might be too rare to be observed in the SLSJ due to the smaller sample size” (lines 200-201). To better compare the count or proportion of variants missing from SLSJ and UQc, the authors should as a separate analysis down-sample the UQc cohort to match the SLSJ sample size, or repeatedly bootstrap equal-size subsets.

2. Lines 99-101. Additionally, the cited missing variant counts (540 from SLSJ and 17 from UQc) are not shown in Figure 1, or anywhere else that I can find. It would be useful to provide a table summarising the counts of potentially pathogenic RFD >= 10% variants coming from SLSJ, UQc, and QcP, and how many are unique to that cluster.

3. Lines 99-101. It is possible to visually confirm from Figure 1 that many variants are much more frequent in SLSJ than gnomAD NFE, seen on the right half of Figure 1b close to the X axis, whereas this is not the case for the combined QcP cluster in Figure 1a. I am wondering, however, if there is a better way to quantify that “many variants are more frequent in the SLSJ region”, such as a table which counts the number or proportion of the 1,302 variants above a certain RFD or fold-increase threshold for each cluster? How many variants, for instance, have a >=50% or 90% RFD vs gnomAD NFE in the SLSJ or QcP clusters?

4. Lines 101-103. Figure 2 plots the count and proportion of individuals from the SLSJ and UQc (rest of Quebec) clusters which carry 0, 1, 2, 3 … 9 of the 1,302 rare potentially pathogenic >= 10% RFD variants. Given that this figure primarily seems to shed light on the mutational load in SLSJ vs UQc, I am wondering whether the >= 10% RFD threshold is necessary? Why not repeat the same analysis for all rare, potentially pathogenic variants (as defined by ClinVar and previous SLSJ publications), regardless of RFD threshold? Additionally, I find it striking that, for SLSJ, only ~13% of individuals carry none of the 1,302 rare, potentially pathogenic >= 10% RFD variants of interest (and ~27% of UQc individuals). Because many of these variants are likely to come from ClinVar P/LP entries with just 1* review status, it would be informative to repeat this analysis with only ClinVar P/LP variants with >= 2* review status (plus the previously reported founder variants).

5. Lines 113-115. I cannot find any mention elsewhere in the manuscript or table/figure legends as to how the nine variants were selected to have their carrier rates reassessed, and whether this is an adequate number of variants to use for validation. Additionally, the authors should clarify (either in lines 113-115 or lines 359-361) whether the “subset of 1,000 randomly selected samples with appropriate consent” are a subset of the CARTaGENE individuals from SLSJ used in the SLSJ WGS/imputed datasets, or other individuals from SLSJ not used previously in the study. The term “subset” from lines 113-115 quoted above suggests the former, but the title of Supplementary Table 3 “Experimental reassessment of CR in an independent cohort of 1,000 individuals living in SLSJ” suggests the latter. If the 1,000 SLSJ subset does in fact come from the CARTaGENE 314 WGS and 3,589 imputed genotypes from SLSJ, then the authors should report the precise % of matching genotypes between the validation subset and the original data.

Minor Points:

6. Lines 59-66. The first paragraph of the Introduction gives some background on the prevalence of rare diseases (thought to affect 10% of population, more than 10,000 rare diseases in Orphanet). It would be useful to specify that the authors are mainly concerned with rare Mendelian diseases.

7. Lines 133-135. The authors should comment on whether any variants were more common in UQc than in SLSJ, and how many.

8. Lines 136-139. Because inheritance patterns are reported for many of these 80 founder variants (some are shown in Table 2), it would be useful to estimate the expected count or rate of affected individuals, in addition to the observed carrier counts shown in Figure 5.

9. Figure 4. The X axis labels are unusual, each ending in an open square bracket where it should be a closed square bracket. Additionally, I am confused why many UQc founder variants are shown with carrier rates below 1/200, given that a carrier rate above 1/200 was a requirement for founder variants. The authors should specify whether the label “UQc founder variants” here means variants which are founder variants in UQc, or something else, like founder variants in any region with their UQc carrier rate.

10. Lines 154-156. The nucleotide change and phenotype are listed, but authors should also list the associated gene name.

11. Table 3. The table is described as having corresponding phenotypes, but no actual phenotypes are listed.

12. Table 3. Nine instances of compound heterozygotes in patients are documented, also referred to in lines 171-174 of the Discussion. It would be useful to note the other variant in each case, and also to comment on whether it was possible to determine whether the two observed variants were in trans (a true compound heterozygote) or in cis on the same haplotype.

13. Lines 272-273. The final imputed dataset appears to have more individuals (29,353) than the number of initial CARTaGENE chip genotypes (29,337, mentioned lines 252-253). I am wondering if this refers to the number of imputed genotypes + the WGS, after filtering for sample missingness—the language is ambiguous. However, the subsequent application of the imputed dataset to refine the Quebec allele frequencies suggests the imputed and WGS datasets were not merged, as the authors mention that imputed frequencies only superseded WGS frequencies when the variant was present in the imputed dataset.

14. Lines 299-302. I presume that “conflicting (both pathogenic and likely pathogenic variants)” simply refers to variants with an aggregate germline classification of “Pathogenic/Likely pathogenic”, and not to “Conflicting classifications of pathogenicity”, as the latter term implies variants with both pathogenic and uncertain/benign submissions. If so, I would advise removing the word “conflicting” from the authors’ description.

15. Methods for calculating RFD, lines 306-321. As far as I can tell, closely related individuals were not removed from the SLSJ, UQc and QcP cohort clusters used to calculate various Quebec variant frequencies. If this was done, it should be stated, and if not, the choice should be explained.

16. Methods for calculating RFD, lines 306-321. It would be useful to comment on the statistical significance of an observed RFD of 0.1 between any of the SLSJ, UQc, and QcP cohorts and the gnomAD v.3.1.2 NFE genomes, for a variant at some of the gnomAD or Quebec/SLSJ allele frequencies relevant to this study. Additionally, for ease of reading, the authors should briefly state in the Methods or Results what an RFD of 0.1 would translate to in terms of fold-difference in frequency between the SLSJ/Quebec population and gnomAD NFE, perhaps for other reference RFDs as well, such as RFD = 0.5 or RFD = 0.9. For instance, RFD = 0.1 corresponds to a 1.11x greater frequency and RFD=0.5 corresponds to a 2x greater frequency.

17. Line 320. It would be useful to name the two variants with MAF >= 0.05, and to state how many of the remaining variants have, for instance, MAF >= 0.01 or MAF >= 0.005.

18. Lines 328-331. The authors explain that alternative observed carrier count thresholds of 5, 8, and 10 were used for the SLSJ, UQc, and QcP WGS clusters, due to the smaller sample sizes. The authors should clarify how exactly thresholds of 5, 8, and 10 were chosen.

19. Methods for IBD analysis, lines 336-350. I believe the comparison of pairwise IBD sharing at variant sites was done for each cluster (SLSJ, UQc, and QcP), as founder variants are reported for each in the Results section. Reading this section, particularly lines 343-346, I initially had the impression that the analysis was only done for SLSJ. This should be clarified.

20. Supplementary Table 1. Although this table is not meant to be read in its entirety, it is basically useless in the format it was sent to me for finding out anything about a variant’s status in SLSJ, as the variant info (ClinVar ID, Position GRCh38, Data type) for each variant are on different pages (1-25) than the SLSJ columns (26-49). It would also be good to include the gene name or symbol for each variant.

Reviewer #3: Review of the manuscript: Rare diseases load through the study of a regional population

Overview:

The authors present a survey of pathogenic rare variants in Quebec, particularly the Saguenay–Lac-Saint-Jean region (SLSJ), based on whole-genome sequencing of ~2200 individuals and genotype imputation of 29k others. The authors specifically look for variants with evidence of pathogenicity, high prevalence, recent origin, and higher frequency in Quebec compared to other Europeans. They validated 38 previously reported variants and identifies 42 previously unreported variants.

Strengths:

The study is based on a very large dataset. The population is important to study because of the founder event and, consequently, the high prevalence of pathogenic variants. Thus, the study has important clinical implications for the management of genetic testing in this population, particularly preconception carrier screening. The methods used to detect the founder variants are overall sound. The IBD analysis is particularly innovative and important for the identification of variants of recent origin.

Major comments:

• The results have implications for preconception carrier screening in Quebec and particularly in SLSJ. However, the authors don’t make specific recommendations for how such a panel should look like, which, unfortunately, somewhat diminishes the impact of the paper. It would be important to describe how such a panel would compare to existing panels in terms of the variants included and the number of couples at risk. (For example, see, https://www.nature.com/articles/gim2015123, although it is a bit dated and using a much smaller dataset).

• Some of the analyses are difficult to interpret, given that they are based on the same dataset used for discovery. The variants discovered are conditioned, by definition, to have certain frequencies in the discovery dataset. Thus, all results reporting variant counts (Figures 2-5) are somewhat biased (regression to the mean/winner’s curse). In other words, variants that made it to the list may have happened to be those “lucky” enough to cross the frequency threshold. However, in another dataset, their frequency will be lower. This is easily solvable by dividing the dataset into discovery and analysis, whereby variants will be discovered in the (larger) discovery subset, and variant counts will be reported based on the (smaller) analysis subset. I would also suggest that the analysis subset will have an identical number of genomes from SLSJ and UQc. Otherwise, again, all variant counts are difficult to interpret.

• It is difficult to justify the criterion of having a relative frequency difference greater than 10%. Figure 1 shows clearly that many variants are quite common in other Europeans, but have simply drifted to somewhat higher frequencies in Quebec. I would not call these founder variants. The founder variants are those on or close to the x-axis in Figure 1, namely variants with near-zero frequency outside Quebec that have reached a considerable frequency in Quebec. I think that more reasonable criteria would be 5x or 10x fold higher frequency in Quebec, along with frequency in gnomAD under (say) 1/1,000 or 1/10,000. Or the authors could use a method such as this paper: https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1007329. Perhaps more generally, defining a first set of variants with RFD>0.1 and then a second set that also has a minimal carrier rate and high levels of IBD is quite confusing. I think it is really the second set that is interesting. The IBD analysis is very nice and important in this context.

• The clinical data analysis is missing many details. What was the sample size? Was there ethical approval for the search? Who did the search? Which variants were considered? In addition, Table 3 seems to contradict the associated text, as all the three variants mentioned in the text were seen only once based on Table 3, while line 159 suggests a variant is of medical interest only if seen three times. In Table 3, which variants are already in screening panels? More broadly, it is not clear what is the motivation for this analysis and what we actually learn from it. These variants are known to be pathogenic and they were found in the discovery dataset. So what new information do we learn from this review of medical records?

Minor comments:

• Lines 99-100: this result is likely due to the imbalance in sample size between SLSJ and UQc.

• Please add the y=x line to Figures 1 and 3.

• Table 1. What is the “status” column? Please also add the genomic coordinates of the variants (at least in a supplementary table, if space in the main text is limited).

• It will be interesting to see the distribution of IBD fraction across the 1302 variants of the first set and the 80 variants of the second set. This will help understand whether and how this filter is helping to identify the founder variants.

• Lines 155-156: please specify the meaning of the numbers in parenthesis.

• Please add to Table 3 the disease associated with each variant.

• Lines 171-176: we expect that by pure luck, some previously discovered variants will not be discovered in the current study above a certain frequency. The probability for this can be computed using binomial distributions.

• Lines 182-186: similar work has been performed in other populations. See, for example, https://www.nature.com/articles/s41436-019-0676-x.

• Lines 190-191: there is a very long literature on the question of whether founder populations have an increased load of deleterious variants. It is not obviously true, because, while founder events increase the frequency of some variants, they eliminate all the variants in people not surviving the founder event. In particular, see https://www.nature.com/articles/ng.2896 and the reviews https://www.sciencedirect.com/science/article/abs/pii/S0959437X14001002 and https://www.nature.com/articles/nrg3931. There were also later papers.

• Lines 193/245: I think that “fasten” and “fastening” are not used correctly.

• Supplementary figure 2: it is difficult to see how many variants are at each frequency given that points are overlapping.

• Lines 222-223: maybe I missed it, but I didn’t see anywhere mentioned filtering by 10% false positives.

• Line 267: could you please provide details on the in-house Quebec reference panel?

• Line 279: what were the parameters for removal of variants in LD?

• Line 294: how many samples were both the WGS dataset and in the imputed dataset? How many were only in the chip and are under a risk of misclassification?

• Line 295: the possibility of a mix-up is unclear to me – assuming these are samples with both WGS and chip data, why not just compare the data between the platforms and verify it’s the same person?

• Line 325: the carrier rate is just f_(hetero). Not the inverse.

• Line 358: which variants were validated using the TaqMan assay?

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy , and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: No:  Sup Table 1 should be uploaded as csv/txt format and definitely not as PDF.

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy .

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

Figure resubmission:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. If there are other versions of figure files still present in your submission file inventory at resubmission, please replace them with the PACE-processed versions.

Reproducibility:

To enhance the reproducibility of your results, we recommend that authors of applicable studies deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Decision Letter 1

Jonathan Marchini, Gregory M Cooper

6 Aug 2025

PGENETICS-D-25-00284R1

Rare diseases load through the study of a regional population

PLOS Genetics

Dear Dr. Girard,

Thank you for submitting your manuscript to PLOS Genetics. After careful consideration, we feel that it has merit but does not fully meet PLOS Genetics's publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript within 30 days Sep 05 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosgenetics@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pgenetics/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

* A rebuttal letter that responds to each point raised by the editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. This file does not need to include responses to formatting updates and technical items listed in the 'Journal Requirements' section below.

* A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

* An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, competing interests statement, or data availability statement, please make these updates within the submission form at the time of resubmission. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

We look forward to receiving your revised manuscript.

Kind regards,

Jonathan Marchini

Academic Editor

PLOS Genetics

Gregory Cooper

Section Editor

PLOS Genetics

Aimée Dudley

Editor-in-Chief

PLOS Genetics

Anne Goriely

Editor-in-Chief

PLOS Genetics

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: I am pleased that the authors have addressed all of my previous comments comprehensively, and the manuscript has been significantly improved as a result.

I have only two minor suggestions that would further enhance the manuscript's clarity, though these are very minor and so do not warrant another round of review:

1. The sentence "Based on the absence of a clear definition for founder variants, we propose here a new definition of a founder variant" would benefit from briefly mentioning what that definition is within the same sentence or immediately following it.

2. The section "Experimental validation of carrier rates" appears to interrupt the flow between figures and might be better positioned in the Supplementary Information to improve the manuscript's organization and readability.

Otherwise, I am happy to recommend this manuscript for publication in PLOS Genetics and believe it offers an important contribution to the field.

Reviewer #2: Brief Summary

I am grateful to the authors for their work responding to the major and minor points I have raised. The revision addresses most prior concerns: the authors have addressed sample-size imbalance by downsampling UQc, improved presentation of data, and clarified parts of their methodology. My outstanding requests described below relate primarily to the authors’ mutation load analyses: their use of the RFD>=10% filter for Figures 1 & 2, and their choice not to add a sensitivity analysis restricted to ClinVar >=2* variants for Figures 2 & 5. For these requests, see major point #4 and minor point #16 below. Additionally, I have suggested several minor text edits and clarifications—see major points #3 and #5, and minor point #6. If these items are addressed, I would consider my comments resolved.

Major Points:

1. Reviewer assessment: Adequately addressed. For Figures 2 & 5, the authors have adequately accounted for the difference in size between SLSJ and UQc by downsampling 3,589 imputed UQc individuals (the size of the imputed SLSJ dataset) 1,000 times.

2. Reviewer assessment: Adequately addressed. The authors have provided Supplementary Table 1 in a more readable format.

3. Reviewer assessment: Partially addressed. The authors are correct to observe that differences in mutation load are not primarily explained by differences in RFD or fold-difference, but rather differences in variant frequency. I am satisfied that the differences in variant frequency (or CR) are adequately described later in the manuscript, primarily in Figure 4. However, it would still be good to somehow quantify the observation on lines 104-105, which cites Figure 1. For instance, something like, “of the variants shown in Figure 1, __% had a frequency above X in QcP, whereas __% had a frequency above X in SLSJ.”

4. Reviewer assessment: Not addressed (rationale provided).

10% RFD: The authors have explained that the 10% RFD threshold was used to narrow down the ~200,000 ClinVar P/LP variants, and that including the variants with RFD < 10% would not change their observations about mutation load substantially, because most of them were rare, and/or not present in the imputed data. The “~200,000 ClinVar pathogenic variants” is misleading, however; the current version of Supplementary Figure 2 shows there are ~2,000 total ClinVar P/LP variants present in both QcP and gnomAD—I assume there are more found in just QcP but not gnomAD. It would be more accurate to say that the RFD threshold was used to narrow down to 1,537 (later 1,302 in the imputed data) the initial set of ~2,000 ClinVar P/LP variants found in QcP and gnomAD, rather than the ~200,000 P/LP variants reported in ClinVar. Furthermore, at the allele frequencies relevant here, an RFD of 10% corresponds to very small absolute differences that may be indistinguishable from sampling noise at the QcP and gnomAD sample sizes (see minor point #16 below). The RFD threshold is therefore confusing and difficult to justify, both as a preliminary filter before looking for founder variants, and for observations of mutation load beyond CR/IBD-defined founder variants, as in Figures 1 and 2. Given that the RFD threshold seems arbitrary to analyses of mutation load (the authors have explained that it would not affect their observations about load in SLSJ vs UQc), I would still suggest that these analyses be simply repeated on the entire set of P/LP variants, regardless of RFD, primarily for the sake of conceptual clarity. Otherwise, the authors should clearly explain in the manuscript why an initial RFD >= 10% threshold was necessary, as opposed to simply looking at all rare P/LP variants in QcP, citing for instance counts and MAFs of ClinVar P/LP variants observed in QcP before and after the filter.

ClinVar 1* vs 2* Review Status: The authors have explained that they did not repeat their analyses of mutation load in Figure 2 (or Figure 5) using only P/LP variants with review status >= 2*, because some 1* variants might be clinically relevant, including 11 of the previously-documented variants discussed in the study, and a newly-identified founder variant which was observed alongside the relevant phenotype. Because the aim of their study was to identify variants which might be linked to a founder disease, they did not find it necessary to restrict to >= 2* in any of their analyses. I agree that this rationale makes sense for identifying potentially pathogenic founder variants, however, another purpose of the study (included in its title) is to compare the mutation load of these variants between SLSJ and UQc. It would be informative and straightforward to add a simple sensitivity panel to Figs 2 and 5 restricted to ClinVar >=2* (plus previously documented founders) to show whether the trends persist when limited to variants with stronger evidence of pathogenicity. I also note that the term “mutation load” formally relates to fitness; if you keep this terminology, please define it operationally in terms of your potentially pathogenic count-based proxy and acknowledge limits (penetrance/inheritance/varying evidence of pathogenicity).

5. Reviewer assessment: Partially addressed. The authors have adequately clarified in the supplementary text that the additional 1,000 SLSJ individuals come from an independent cohort and are not a replication of CARTaGENE. The validation of the nine carrier rates is still referenced in lines 120-121; it would be good to very briefly add some clarifying wording here, e.g. “nine carrier rates were assessed in an independent cohort of 1,000 SLSJ individuals in the CIUSSS laboratory, and…” so that readers don’t need to go to the supplemental material to understand this.

Minor Points:

6. Reviewer assessment: Partially addressed. I appreciate the authors specifying they are concerned with rare Mendelian diseases. In following up on the edit made to line 60, however, I was unable to find any count of more than 10,000 rare Mendelian diseases in Orphanet, let alone rare Mendelian diseases. At the URL https://www.orpha.net/ as of July 2025 there are 6528 diseases under “Orphanet in numbers”, not all of which are necessarily genetic or Mendelian. The authors’ citation in the previous sentence (line 59, citation 1) is to Haendel et al. 2020, which estimates there are "more than 10,000" rare diseases (not necessarily all genetic or Mendelian), but this was using Mondo, which draws upon Orphanet and other resources like OMIM, rather than just within Orphanet. There is a count of Mendelian human diseases in Mondo at the URL https://mondo.monarchinitiative.org/ under “Representation of Disease Types”, which is at 11,566 for Mendelian human diseases as of July 2025. The authors should clarify this and add an appropriate citation.

7. Reviewer assessment: Adequately addressed. The authors have added the requested information to line 138.

8. Reviewer assessment: Adequately addressed. The authors have added a useful estimate that a recessive variant with a CR of 1/200 would result in approximately 7 affected individuals in SLSJ. I still think it would also be useful to include expected count/rate of affected individuals for all founder variants (not for a patient-focus, but because the frequency of a variant has different implications from a population-level depending on its inheritance pattern). However, because of the “already dense content of the tables” and the fact that inheritance patterns may not be readily available for all of the ClinVar P/LP variants reported in this manuscript, I accept that this information may be beyond the scope of the study.

9. Reviewer assessment: Adequately addressed. The authors have fixed the axis labels, and clarified the reason for variants with CR below 1/200 on lines 148-150.

10. Reviewer assessment: Adequately addressed. The authors have added gene names as requested.

11. Reviewer assessment: Adequately addressed. The authors have added phenotypes to what is now Table 2.

12. Reviewer assessment: Adequately addressed. The authors have added the other variant to each compound het case in Table 2.

13. Reviewer assessment: Adequately addressed. The authors have fixed the final count of genotyped samples.

14. Reviewer assessment: Adequately addressed. The authors have clarified in the Methods section that they did not use ClinVar variants with “Conflicting classifications of pathogenicity”.

15. Reviewer assessment: Adequately addressed. The authors have adequately shown in their response to this point that including close relatives (defined by sharing more than 1500cM total IBD) did not substantially affect estimates of carrier rates.

16. Reviewer assessment: Partially addressed. I appreciate the authors having added some conversions between RFD and fold-difference to the methods section. The authors use the RFD >= 10% threshold to indicate that variants are more common in QcP than in gnomAD v3.1.2 NFE genomes. There are 29,353 individuals in the imputed QcP dataset, and I believe there are something like ~33,000 individuals in the gnomAD v3.1.2 NFE genomes. At very low frequencies, an RFD of 10% corresponds to very small absolute differences (e.g., AF 0.005 vs 0.0045 at CR ~1/200). Would a 10% RFD for a variant with CR=1/200 in QcP be distinguishable from sampling noise for the QcP and gnomAD sample sizes? The authors should comment on this in the RFD Methods section.

17. Reviewer assessment: Adequately addressed. The authors have named the two MAF >= 0.05 variants which were removed, on lines 352-353 of the Methods section.

18. Reviewer assessment: Adequately addressed. The authors are correct that they had already addressed in the text (lines 365-366) why no lower than 5 individuals was used for an alternative observed carrier count threshold.

19. Reviewer assessment: Adequately addressed. The authors have clarified that the IBD pairwise sharing analyses were done separately for SLSJ, UQc, and QcP.

20. Reviewer assessment: Adequately addressed. The authors have provided the table in a readable format, and added gene names.

Reviewer #3: The authors addressed all of my comments.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy , and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy .

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

Figure resubmission:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. If there are other versions of figure files still present in your submission file inventory at resubmission, please replace them with the PACE-processed versions.

Reproducibility:

To enhance the reproducibility of your results, we recommend that authors deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Decision Letter 2

Jonathan Marchini, Gregory M Cooper

9 Sep 2025

Dear Dr Girard,

We are pleased to inform you that your manuscript entitled "Rare diseases load through the study of a regional population" has been editorially accepted for publication in PLOS Genetics. Congratulations!

Before your submission can be formally accepted and sent to production you will need to complete our formatting changes, which you will receive in a follow up email. Please be aware that it may take several days for you to receive this email; during this time no action is required by you. Please note: the accept date on your published article will reflect the date of this provisional acceptance, but your manuscript will not be scheduled for publication until the required changes have been made.

Once your paper is formally accepted, an uncorrected proof of your manuscript will be published online ahead of the final version, unless you’ve already opted out via the online submission form. If, for any reason, you do not want an earlier version of your manuscript published online or are unsure if you have already indicated as such, please let the journal staff know immediately at plosgenetics@plos.org.

In the meantime, please log into Editorial Manager at https://www.editorialmanager.com/pgenetics/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production and billing process. Note that PLOS requires an ORCID iD for all corresponding authors. Therefore, please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field.  This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager.

If you have a press-related query, or would like to know about making your underlying data available (as you will be aware, this is required for publication), please see the end of this email. If your institution or institutions have a press office, please notify them about your upcoming article at this point, to enable them to help maximise its impact. Inform journal staff as soon as possible if you are preparing a press release for your article and need a publication date.

Thank you again for supporting open-access publishing; we are looking forward to publishing your work in PLOS Genetics!

Yours sincerely,

Jonathan Marchini

Academic Editor

PLOS Genetics

Gregory Cooper

Section Editor

PLOS Genetics

Aimée Dudley

Editor-in-Chief

PLOS Genetics

Anne Goriely

Editor-in-Chief

PLOS Genetics

www.plosgenetics.org

Twitter: @PLOSGenetics

----------------------------------------------------

Comments from the reviewers (if applicable):

----------------------------------------------------

Data Deposition

If you have submitted a Research Article or Front Matter that has associated data that are not suitable for deposition in a subject-specific public repository (such as GenBank or ArrayExpress), one way to make that data available is to deposit it in the Dryad Digital Repository . As you may recall, we ask all authors to agree to make data available; this is one way to achieve that. A full list of recommended repositories can be found on our website .

The following link will take you to the Dryad record for your article, so you won't have to re‐enter its bibliographic information, and can upload your files directly: 

http://datadryad.org/submit?journalID=pgenetics&manu=PGENETICS-D-25-00284R2

More information about depositing data in Dryad is available at http://www.datadryad.org/depositing. If you experience any difficulties in submitting your data, please contact help@datadryad.org for support.

Additionally, please be aware that our data availability policy  requires that all numerical data underlying display items are included with the submission, and you will need to provide this before we can formally accept your manuscript, if not already present.

----------------------------------------------------

Press Queries

If you or your institution will be preparing press materials for this manuscript, or if you need to know your paper's publication date for media purposes, please inform the journal staff as soon as possible so that your submission can be scheduled accordingly. Your manuscript will remain under a strict press embargo until the publication date and time. This means an early version of your manuscript will not be published ahead of your final version. PLOS Genetics may also choose to issue a press release for your article. If there's anything the journal should know or you'd like more information, please get in touch via plosgenetics@plos.org .

Acceptance letter

Jonathan Marchini, Gregory M Cooper

PGENETICS-D-25-00284R2

Rare diseases load through the study of a regional population

Dear Dr Girard,

We are pleased to inform you that your manuscript entitled "Rare diseases load through the study of a regional population" has been formally accepted for publication in PLOS Genetics! Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out or your manuscript is a front-matter piece, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

You will receive an invoice from PLOS for your publication fee after your manuscript has reached the completed accept phase. If you receive an email requesting payment before acceptance or for any other service, this may be a phishing scheme. Learn how to identify phishing emails and protect your accounts at https://explore.plos.org/phishing.

Thank you again for supporting PLOS Genetics and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Narmatha Raju, M.Sc

PLOS Genetics

On behalf of:

The PLOS Genetics Team

Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom

plosgenetics@plos.org | +44 (0) 1223-442823

plosgenetics.org | Twitter: @PLOSGenetics

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Table. Description of variants with RFD ≥ 10%.

    (XLSX)

    pgen.1011876.s001.xlsx (194.4KB, xlsx)
    S1 Text

    Table A: Variants previously described in the population not found here. Table B: Experimental assessment of CR in an independent cohort of 1,000 individuals living in SLSJ. CI have been calculated on proportions using the online tool https://sample-size.net/confidence-interval-proportion/; * CR calculated using WGS in the present study; ** One sided 97.5% CI. Table C: Case reports on Quebec founder variants. Table D: Comparison of imputations of ClinVar rare variants using the TOPMed r2 or the Quebec reference panel. Proportions are based on common SNPs/genotypes except for missing SNPs which is on the total number of SNPs. False positives are defined as heterozygotes in imputed, but not in WGS data. False negatives are defined as heterozygotes in WGS, but not in imputed data. * In at least one individual. **SNPs having at least one homozygote switch genotype in addition to SNPs having more than 10% of false positive genotypes (101 out of the 106 false positive SNPs) were considered as unreliable in the imputed data. Fig A: Comparison of the variants’ carrier rates reported in SLSJ and found in our analysis. When available, the aggregated CR (all variants associated with the same disease) was used; also, if available, the CR from the imputed data was used; otherwise, the CR from WGS data was utilized. Variants from the same disease were grouped as in previous studies. Fig B: UMAP of WGS data. UMAP are coloured according to A) the recruitment region or country of birth and B) the k-means clustering. Fig C: UMAP of imputed data. UMAP are coloured according to A) the recruitment region or continent of birth and B) the k-means clustering. Note that 1,537 pathogenic variants had an RFD ≥ 10% in the WGS data, but only 1,302 of them had also an RFD ≥ 10% in the imputed data. Fig D: Correlation between imputed and WGS variants’ frequency in QcP.

    (PDF)

    pgen.1011876.s002.pdf (977.1KB, pdf)
    Attachment

    Submitted filename: rebuttal_letter.pdf

    pgen.1011876.s004.pdf (1.3MB, pdf)
    Attachment

    Submitted filename: Michel_2025_rebuttal_2.pdf

    pgen.1011876.s005.pdf (220.1KB, pdf)

    Data Availability Statement

    Quebec genotype, imputed and WGS data are available under restricted access from CARTaGENE biobank (https://cartagene.qc.ca/en/researchers/access-request.html) due to the informed consent given by study participants. The code used and data for this study can be found in the following GitHub repository: https://github.com/Genopop/Figures-founder-variants-article.


    Articles from PLOS Genetics are provided here courtesy of PLOS

    RESOURCES