Summary
The secreted mucins MUC5AC and MUC5B are large glycoproteins that play critical defensive roles in pathogen entrapment and mucociliary clearance. Their respective genes contain polymorphic and degenerate protein-coding variable number tandem repeats (VNTRs) that make the loci difficult to investigate with short reads. We characterize the structural diversity of MUC5AC and MUC5B by long-read sequencing and assembly of 206 human and 20 nonhuman primate (NHP) haplotypes. We find that human MUC5B is largely invariant (5,761–5,762 amino acids [aa]); however, seven haplotypes have expanded VNTRs (6,291–7,019 aa). In contrast, 30 allelic variants of MUC5AC encode 16 distinct proteins (5,249–6,325 aa) with cysteine-rich domain and VNTR copy-number variation. We group MUC5AC alleles into three phylogenetic clades: H1 (46%, ∼5,654 aa), H2 (33%, ∼5,742 aa), and H3 (7%, ∼6,325 aa). The two most common human MUC5AC variants are smaller than NHP gene models, suggesting a reduction in protein length during recent human evolution. Linkage disequilibrium and Tajima’s D analyses reveal that East Asians carry exceptionally large blocks with an excess of rare variation (p < 0.05) at MUC5AC. To validate this result, we use Locityper for genotyping MUC5AC haplogroups in 2,600 unrelated samples from the 1000 Genomes Project. We observe a signature of positive selection in H1 among East Asians and a depletion of the likely ancestral haplogroup (H3). In Europeans, H3 alleles show an excess of common variation and deviate from Hardy-Weinberg equilibrium (p < 0.05), consistent with heterozygote advantage and balancing selection. This study provides a generalizable strategy to characterize complex protein-coding VNTRs for improved disease associations.
Graphical abstract
MUC5AC and MUC5B are polymorphic loci that are difficult to sequence due to coding variable number tandem repeats (VNTRs). Using long-read sequencing, we characterized genetic diversity in these genes across human populations, detected signatures of selection, and developed genotyping strategies for disease associations.
Introduction
Mucosal linings serve a dynamic role at the interface between internal tissues and the external environment. In the lumen of the lungs, epithelial cells provide defensive functionalities through mucociliary clearance, a mechanism in which mucus traps inhaled pathogens for mechanical removal.1 The mucins MUC5AC and MUC5B are major components of mucus that contribute to its barrier function and act as receptor decoys for pathogens, such as the influenza virus that binds directly to mucin sialic acids.2 These polymeric glycoproteins thus provide a critical innate immunological role in defending the airways against environmental insults; however, they have also been implicated in the pathogenicity of muco-obstructive airway diseases like asthma and cystic fibrosis.3
Despite their fundamental roles in maintaining epithelial homeostasis, MUC5AC and MUC5B sequence variation remains poorly understood. The challenge in assessing these loci is that they harbor large central exons (60%–80% of total coding sequence) composed of variable number tandem repeats (VNTRs). These VNTRs encode numerous serine and threonine residues that are decorated with sialic acid, a terminal sugar moiety that is bound by the glycoproteins of some viral pathogens.2,4 Limitations of short-read sequencing in assembling these repetitive loci have hindered efforts to accurately resolve copy-number variation.5,6 VNTR structural variants may affect the functional ability of mucins to act as barriers to pathogens and change their biophysical properties; therefore, it is critical that the sequences of these loci are characterized in many human genomes to discover the common patterns of variation directly affecting protein function.
Long-read sequencing technologies allow for the characterization of MUC5AC and MUC5B with haplotype-level resolution. Previously, gene references for both loci were constructed using Pacific Biosciences (PacBio) single-molecule, real-time (SMRT) sequencing from a limited number of humans. Four genome assemblies were used to characterize three distinct MUC5AC haplotypes for VNTR structural variation.7 However, analyses of MUC5AC allele sizes via Southern blot suggest a much greater extent of human diversity.8 Many additional human genomes have recently been sequenced with more accurate high-fidelity (HiFi) circular consensus sequencing (CCS) as part of the Human Genome Structural Variation Consortium (HGSVC)9 and the Human Pangenome Reference Consortium (HPRC).10 Here, we leverage the large-scale sequencing efforts of the HGSVC and HPRC to explore common patterns of genetic variation in MUC5AC and MUC5B, specifically within the VNTR portion of the molecule. Using 206 diverse human haplotypes assembled with high-quality PacBio HiFi CCS reads, we characterize the genetic diversity of these loci in different human populations. We compare the human alleles of MUC5AC and MUC5B to that of five nonhuman primate (NHP) species (chimpanzee, bonobo, gorilla, orangutan, and gibbon) to distinguish human-specific patterns of variation. Finally, we explore methods to genotype these loci using haplotype tagging single-nucleotide polymorphisms (tSNPs) and a structural variant genotyping tool. These results provide a comprehensive view of VNTR variation and evolution in MUC5AC and MUC5B and outline a path forward for improved disease association studies.
Material and methods
Long-read sequence assembly and QC
Whole-genome assemblies from 104 HGSVC9 (n = 57) and HPRC10 (n = 47) samples were leveraged for MUC5AC and MUC5B variant discovery. The genome sequence data for both cohorts are consented for open access with no data use restrictions. These genomes include 49 Africans (AFR), 23 Admixed Americans (AMR), 14 East Asians (EAS), 10 Europeans (EUR), and 8 South Asians (SAS; Tables S1 and S2); these geographic population descriptors were defined previously by the HGSVC and HPRC. Sequencing for both cohorts was conducted using PacBio HiFi CCS. Average HPRC sequencing coverage was 42× (minimum = 31×) and average HPRC read N50 was 19.7 kbp (minimum = 13.5 kbp). Average HGSVC sequencing coverage was comparable at 40× (minimum = 25×) and average read N50 was 17.2 kbp (minimum = 10.0 kbp). The HPRC genome assembly was performed by Liao et al.10 using trio-hifiasm11 (maternal and paternal short reads used in haplotype phasing). We assembled 54 HGSVC samples using hifiasm version 0.16.111 (pseudo-haplotype resolved phasing). For the remaining three HGSVC samples with trio information (HG00514, HG03125, NA12878), we used paternal and maternal short reads with yak v.0.1 (https://github.com/lh3/yak) to create k-mer databases for contig phasing in the child’s assembly with hifiasm v.0.15.111 (see Ebert et al.9 for parental short-read information). The average HPRC haplotype assembly N50 was 40.8 Mbp (minimum = 17.4 Mbp) and average HGSVC haplotype assembly N50 was 55.2 Mbp (minimum = 14.1 Mbp). Regional assembly contiguity and reliability for the MUC5AC/MUC5B locus was assessed using the flagger pipeline10 and Nucfreq, a method to detect potential misassemblies and collapses in phased haplotypes.12 We also inspected for assembly misalignments using SafFire (https://github.com/mrvollger/SafFire).
We assessed 10 total NHP genome assemblies for chimpanzee (n = 2), bonobo (n = 2), gorilla (n = 2), Sumatran orangutan (n = 2), Bornean orangutan (n = 1), and Siamang gibbon (n = 1; Table S3). Specifically, these included PTR1 (Central chimpanzee, Clint), PPA1 (bonobo, Mhudiblu), GGO1 (Western gorilla, Kamilah), and PAB1 (Sumatran orangutan, Susie) and were assembled with hifiasm v.0.15.1.13 All other NHP assemblies were generated as part of the Primate T2T (telomere-to-telomere) Consortium, and assemblies were downloaded from GenomeArk14; these include PTR2 (Central chimpanzee, AG18354), PPA2 (bonobo, PR00251), GGO2 (Western gorilla, Jim), PAB2 (Sumatran orangutan, AG06213), PPY1 (Bornean orangutan, AG05252), and SSY (Siamang gibbon, Jambi). The assemblies were constructed using both high-coverage PacBio HiFi CCS reads and ultra-long (UL) Oxford Nanopore Technologies (ONT) reads via the Verkko 2.0 assembler.15 Information about assembly quality and validation can be found in Mao et al.13 and Makova et al.14 We inspected the MUC5AC/MUC5B regional assembly contiguity using SafFire in the same manner as the HGSVC assemblies.
Sequence extractions and phylogenetic analyses
HPRC, HGSVC, and NHP phased genome assemblies were aligned to CHM1316 using minimap2 v.2.2417 with CIGAR string inclusion, full-genome alignment divergence less than 10%, secondary alignments suppressed, and a minimal peak all-versus-all alignment score of 25,000. Coordinates for a specific locus in individual haplotype assemblies were identified using rustybam v.0.1.29 (https://github.com/mrvollger/rustybam), and sequences were extracted using seqtk v.1.3 (https://github.com/lh3/seqtk). Exon and intron boundaries were defined based on human GENCODE V3518 annotations in CHM1316 (GENCODE: MUC5AC-201, GENCODE: MUC5B-204). Intronic and intergenic sequences used to construct phylogenies were selected in a recombination-aware manner based on UCSC Genome Browser 1000 Genomes Project (1KG) linkage disequilibrium (LD) structure annotations.19 A multiple sequence alignment (MSA) was conducted using MAFFT v.7.48720 with global pairwise alignment and 100 iterations, followed by visual inspection of alignment quality using Jalview v.9.0.5.21 Segments of the MSA determined to be misaligned were identified and eliminated manually. Maximum-likelihood tree calculations were performed using iqtree v.1.6.1222 with automatic model selection and 1,000 bootstraps. All phylogenetic trees in figures were constructed using ggtree v.3.2.123 in R v.1.4.2 (https://www.R-project.org). Haplogroup coalescence times were estimated with iqtree224 based on estimated chimpanzee divergence (6.4 million years ago [mya]).25
Gene and protein domain/VNTR motif annotations
Computational protein prediction for all human and NHP haplotypes was conducted via the same alignment pipeline as phylogeny construction based on human exon annotations from CHM13.16 We predicted translated exons using the ExPasy tool in EMBOSS v.6.6.0.26 For computational protein predictions that were complete (i.e., complete open reading frame [ORF], no truncations), protein domain annotations were manually curated using cys domain and VNTR domain sequences previously annotated by Guo et al.7 Protein groups (P1–P6) were defined for MUC5AC as containing more than one haplotype and variation in cys domain copy number, tandem repeat domain copy number, and/or repeat motif copy-number variation in homologous VNTR domains. Protein groups for MUC5B were similarly defined; however, the inclusion criteria of harboring more than one haplotype per group was dismissed due to protein sequence length variation in three singletons for MUC5B (P1, P4, P5). We characterized motif variation across individual VNTR domains for human MUC5AC and MUC5B based on previously published consensus motif sizes (24bp/8 amino acids [aa] for MUC5AC,27 87bp/29aa for MUC5B28). Heatmaps of motif usage for all haplotypes of MUC5AC and MUC5B were constructed using a custom R script that included normalization on total VNTR sequence space (motif counts/total number of motifs) to account for length variability, normalization within motifs, and hierarchical clustering (unweighted pair group method of arithmetic mean [UPGMA] clustering29) of haplotypes and motifs for group visualization. Similarly, motif diagrams in linear sequence space were constructed using a custom R script that designated a unique color to each distinct motif and clustered unique alleles by row using UPGMA.
NHP allele alignments and intronic VNTR analysis
We generated all-versus-all alignments between the most common haplotypes of MUC5AC and MUC5B in humans and NHPs using minimap217 with the same parameters as phylogenetic analyses. Tiled alignment plots for each locus were constructed using SVbyEye v.0.99.0 (https://github.com/daewoooo/SVbyEye) in R v.4.3.1 with a bin size of 10,000 bp and custom percent identity breaks. VNTR sequences in intron 15 and ∼3 kbp before the start codon of MUC5AC were curated using tandem repeats finder v.4.1030 with the following parameters: match = 2, mismatch = 7, delta = 7, percent match (PM) = 80, percent indels (PI) = 10, minimum alignment score = 50, and max period size = 30. Detection of H3 k-mers for the intronic VNTR was conducted using STREME from the MEME suite of motif-based sequence analysis tools v.5.5.4.31
LD block structure and selection detection analyses
Illumina whole-genome sequencing (WGS) data from the most recent high-coverage (30×) 1KG release19 were used to assess the LD structure of the MUC5AC/MUC5B locus. These data include open-access WGS from 2,600 unrelated individuals: 691 AFR, 526 EUR, 514 SAS, 515 EAS, and 354 American genomes. Population identifier acronyms are consistent with 1KG nomenclature and include the following: ACB = African Caribbean in Barbados, GWD = Gambian in Western Division, ESN = Esan in Nigeria, MSL = Mende in Sierra Leone, YRI = Yoruba in Nigeria, LWK = Luhya in Kenya, ASW = Americans of African Ancestry in SW USA, PUR = Puerto Rican in Puerto Rico, CLM = Colombian in Colombia, PEL = Peruvian in Peru, MXL = Mexican Ancestry in Los Angeles USA, GBR = British in England and Scotland, FIN = Finnish in Finland, IBS = Iberian in Spain, CEU = Utah residents (CEPH) with Northern/Western European ancestry, TSI = Toscani in Italy, PJL = Punjabi in Pakistan, BEB = Bengali in Bangladesh, STU = Sri Lankan in the UK, ITU = Indian Telugu in the UK, GIH = Gujarati from Houston TX USA, CHS = Southern Han Chinese, CDX = Chinese Dai in China, KHV = Vietnamese in Vietnam, CHB = Han Chinese in Beijing, China, JPT = Japanese in Japan. LDBlockShow v.1.4032 was used to construct LD plots based on D′33 for all SNPs in the MUC5AC/MUC5B region (GRCh38 coordinates, chr11:1,117,952–1,272,172). Autosome-wide LD block calculations were estimated with the PLINK v.1.934 blocks parameter, which estimates haplotype blocks based on definitions described by Gabriel et al.35 (the region of chromosome 11 that harbors MUC5AC and MUC5B features a high recombination rate).36 Calculations were limited to SNPs with a minor allele frequency greater than 5%, those with 75% or higher genotyping rate, and those in Hardy-Weinberg equilibrium. To assess whether the region of chromosome 11 containing MUC5AC and MUC5B shows signatures of selection, Tajima’s D37 analysis was conducted using the phased 1KG cohort of samples. Chromosomes were partitioned into 10 kbp bins with filtering for bins that contained at least 10 SNPs. Tajima’s D statistics were computed for bins using PLINK v.1.9,34 and regions harboring signatures of either positive or balancing selection were based on the 90th and 95th percentiles of values in the super population autosome-wide distributions (negative Tajima’s D is suggestive of positive selection, positive Tajima’s D is suggestive of balancing selection). Permutation testing with multiple-test correction was performed by randomly sampling 10,000 10 kbp nonoverlapping bins to produce a null distribution of Tajima’s D values. This process was repeated 10,000× to produce a distribution of Tajima’s D scores corresponding to the bottom and top 5th percentiles. p values per bins were calculated based on the empirical ranking of Tajima’s D scores relative to this final distribution and were corrected for multiple testing (considering both number of bins and populations tested).
tSNPs and mapping of disease-relevant GWAS SNPs
To uncover SNPs in significant LD with VNTR haplogroups of MUC5AC and MUC5B, phylogenetic haplogroups from the HGSVC/HPRC genomes were encoded as biallelic SNPs. Calculation of squared correlations between these variants encoding haplogroup identity and all SNPs within 50 kbp of the loci were performed using PLINK v.1.9.34 Genome-wide association study (GWAS) risk alleles for MUC5AC and the phenotypes of asthma/allergy and infection-induced pneumonia/meningitis were mined through the GWAS catalog.38 Variants were included in subsequent LD analysis if they had a reported p value of 1 × 10−9 or smaller for the phenotype association, had the nucleotide annotation for the risk allele, and were unambiguously mapped to the HPRC/HGSVC genomes. The final set of variants included six SNPs from six GWASs (rs35225972,39 rs11245962,40 rs28415845,41 rs11245979,42 rs28737416,43 and rs2872951644). Squared correlation values were calculated in the same manner as tSNP discovery.
Genotyping of MUC5AC haplogroups in 1KG populations using Locityper
MUC5AC/MUC5B genotyping was performed with Locityper v.0.10.945 and its dependencies SAMtools v.1.19,46 jellyfish v.2.3.0,47 and strobealign v.0.11.0.48 Diploid genomes from the HGSVC/HPRC sample set were included as alleles in the reference panel if they were complete for the MUC5AC/MUC5B locus (no assembly breaks or alignment ambiguities), annotated for both haplogroups, and had accessible high-quality short reads through the 1KG dataset. The final set of genomes that constituted the reference panel included 99 genomes (i.e., 198 haplotypes) for MUC5AC and MUC5B.
The CHM13 reference genome16 was used for all Locityper analyses, with gene coordinates set to chr11:1,227,366–1,274,380 and chr11:1,292,367–1,334,784 for MUC5AC and MUC5B, respectively. For leave-one-out analyses, the target sample for genotyping was excluded from database construction, and the highest alignment accuracy level was used. All other options for database construction, sequencing dataset preprocessing, and genotyping were set to default. Genotyping accuracy was determined based on edit distance (alignment differences) between the real and retrieved genotypes during leave-one-out analysis and compared to the closest “available” genotype (smallest edit distance between true genotype and all possible diploid combinations of alleles in the reference panel). Computation of edit distances between alleles in the leave-one-out concordance analysis was performed using the Locityper helper script “gt_dist.py.”
MUC5AC and MUC5B phenome-wide association studies (PheWASs) in All of Us
Data from the All of Us Research Program49 controlled tier database were analyzed for a phenome-wide association study (PheWAS) with the MUC5B promoter polymorphism rs3570595050 and tSNPs for the major haplogroups of MUC5AC variants. All participants in the All of Us program provided electronic informed consent,49 and the NIH All of Us IRB Operations Office determined this does not constitute research involving human subjects. As of January 2024, this cohort included ∼245,400 individuals with short-read WGS data, of which ∼185,000 were unrelated, annotated for age/sex, and had paired electronic health record (EHR) data (reported as International Classification of Diseases [ICD] codes). These individuals were categorized previously by All of Us for genetic ancestry using principal component analysis. We surveyed samples from African, European, East Asian, Admixed American, and Middle Eastern ancestries for MUC5B rs35705950 and tSNPs in high LD with MUC5AC haplogroups H1 (rs2075842, rs1132433, rs1132434, rs28652890, rs879136008), H2 (rs1015856541, rs28519516, rs28558973, rs28368633), and H3 (rs36154966, rs1004828576, rs940158763, rs36151150, rs36132281, rs35779873). We only included samples with genome quality scores ≥20 at individual loci; therefore, the final sample sets included ∼32,500 AFR, ∼3,200 EAS, ∼2,000 SAS, ∼98,600 EUR, ∼28,200 ADM, and ∼650 individuals of Middle Eastern ancestry, totaling ∼165,150 individuals (exact number of individuals varied between locus associations in respective populations; Tables S7–S21). We included both ICD-9 and ICD-10 phenotype codes from patient EHRs and samples with male/female self-reported biological sex aged 20 years or older.
PheWAS analysis was performed using the R package PheWAS as outlined in Bick et al.49 The package translated ICD-10 codes to ICD-9 and calculated case and control genotype distributions, allelic p value, and allelic odds ratio (OR) for each condition. A minimum count of two related codes was used to determine whether a phenotype was sufficiently represented in the health data for association. Sex at birth, age at sample collection, and principal component analyses 1–3 were used as covariates. The aggregate.fun function was used to correct for duplicates in the EHR. Nominal p was set to <2.7E−5 (p adjusted < 0.05 after Bonferroni correction) for phenotype associations with rs35705950 and MUC5AC tSNP alleles in the dataset.
Results
MUC5AC/MUC5B assembly and QC
We performed targeted assessment of a ∼160 kbp region of chromosome 11 spanning MUC5AC and MUC5B from 104 human genomes, including 47 genomes from the HPRC and 57 from the HGSVC where long-read sequencing data had recently been generated and made publicly available.9,10 We generated phased genome assemblies from HGSVC samples using the same computational pipeline used for the generation of HPRC assemblies (Material and methods) from HiFi PacBio sequencing data. The combined sample set includes 49 AFR, 23 ADM, 14 EAS, 10 EUR, and 8 SAS (Material and methods, Tables S1 and S2). Next, we applied the flagger10 and Nucfreq12 computational pipelines to detect collapses or misassemblies across the 160 kbp target region. Of the 208 total human haplotypes, 206 (99%) were correctly assembled without gaps, breaks, or misjoins in the MUC5AC/MUC5B region. Two haplotypes (one each) from samples HG01114 and HG02509 were fragmented and excluded from further analyses.
For comparative evolutionary purposes, we analyzed 10 individuals from six NHP species for which HiFi sequencing data have recently been generated13,14 (Material and methods, Table S3). In the NHP genomes, all MUC5AC loci passed quality control (QC) with no ambiguous alignments to CHM13; in contrast, one gorilla haplotype (Kamila h2) and both haplotypes of a Sumatran orangutan (Susie h1 and h2) failed MUC5B QC and were excluded from further analyses.
Human MUC5AC genetic and protein diversity
To understand human genetic diversity in MUC5AC, we first constructed a phylogeny centered around the gene model. We extracted 26.5 kbp of noncoding sequence flanking MUC5AC exons for the 206 human haplotypes and generated a maximum likelihood phylogenetic tree using chimpanzee as an outgroup. Human alleles were grouped into three distinct haplogroups or clades (Figure 1A), namely H1 (n = 103), H2 (n = 78), and H3 (n = 25). H1 is the most phylogenetically distinct (100% bootstrap support), is reduced in frequency among AFR genomes (p = 4 × 10−3 comparing H1 to H2/H3 frequencies via chi-square, Figure 1B), and is estimated to have arisen most recently. We estimate an H1 coalescent of ∼120,000 years ago when compared to H2 or H3 (∼330,000 years ago).
Next, we predicted a protein model associated with each human haplotype (Material and methods). We identified 16 distinct MUC5AC protein variants with extensive length variation (Figure 1C). The three most common protein variants, 5,654 aa/96 haplotypes, 5,742 aa/67 haplotypes, and 6,325 aa/15 haplotypes (Figure 1C), project onto the phylogenetic haplogroup designations H1, H2, and H3, respectively. There is, however, additional variation not immediately apparent from the phylogeny that is uncovered by detailed protein sequence curation. Guo et al.7 classified protein variants into three groups (P2, P3, and P6) based on MUC5AC domain annotations. We extend this classification by identifying three additional protein variant groups (P1, P4, and P5) based on VNTR domain, cys domain, and VNTR motif copy-number variation.
Most MUC5AC protein variants harbor four distinct tandem repeat domains (P1–3, P5); however, two groups (P4, P6) harbor an additional central domain with 11 copies of the 8-mer repeat motif. P5/6 variants harbor additional type 2 and type 3 cys domains, while P1 variants harbor a novel deletion of these domains. VNTR motif copy-number variation is also extensive in the first and last domains across variant groups.
We characterized the composition of the MUC5AC VNTR 8-mer repeat because the density of glycosylated serines and threonines is critical for mucin barrier function. We find that the absolute count of serine and threonine residues across the VNTR domains is positively correlated with protein length (Figure 1D); however, when normalized for the total length of the VNTR, the two shortest protein variant groups (P1 and P2) harbor the highest concentration of serines and threonines (Figure 1E). There are a remarkable 211 unique 24-mers (nucleotides) and 130 unique protein 8-mer motifs (aa) diversifying the degenerate VNTR domains; motif changes, however, are constrained, with most harboring the pattern of TTSTTS in the first six aa (Figures 1F and S1). The preferential use of threonines is likely a consequence of the higher propensity for threonines to harbor O-glycans,51 thereby facilitating MUC5AC barrier functionality. Furthermore, the high incidence of prolines likely contributes to the glycosylation potential of nearby serines/threonines by exposing these residues in a β-turn conformation.52
Of the 130 unique protein 8-mers for MUC5AC, only nine are unique to a single haplotype, indicating that most motif variation is shared between protein isoforms. There are distinctive modules of motifs that cluster together in frequency of usage for protein groups 2, 3, and 6 (Figures 1G and S2). Most motif variation is due to single nonsynonymous aa changes between haplotypes; however, there are instances where entire motifs have been gained or lost. Overall, there is extensive cys domain copy number, VNTR copy number, and VNTR motif usage variation in the large central exon of MUC5AC across human haplotypes.
Human MUC5B genetic and protein diversity
Similarly, we analyzed the MUC5B locus and observed far less genetic and protein variability compared to MUC5AC. A maximum likelihood phylogenetic tree (24.6 kbp intronic sequence using chimpanzee as an outgroup) distinguishes two distinct human haplotypes with 100% bootstrap support (Figure 2A). The most common haplogroup H2 was identified in 82% (169/206) of assembled haplotypes and is estimated to have emerged ∼770,000 years ago. The less abundant H1 (18%) variants predictably arose more recently (∼407,000 years ago). While H2 is found across all continental populations, H1 shows a notable reduction in East Asians (Figure 2B). At the protein level, we predict a complete ORF for 92% (190/206) of haplotypes and a premature stop codon for ∼8% (16/206, Figure 2C). We hypothesize that these haplotypes harbor assembly artifacts due to the homopolymer runs within the MUC5B VNTR. To test this, we reassembled eight of the samples where both ONT and HiFi sequence data were available15 and recovered the ORF for three. These haplotypes harbored predicted protein lengths consistent with P3 (5,762 aa).
Among the 190 haplotypes with complete ORFs, 87% predict proteins with the canonical MUC5B length of 5,762 aa (P3). The second most abundant, P2, differs in length by one aa (5,761 aa) and represents 9% of protein isoforms. These findings support the long-standing belief that MUC5B is less variable than MUC5AC.54 Our deeper survey, however, suggests that the locus is not invariant. We identify seven haplotypes (3.7%, 7/190 complete proteins) where the protein is predicted to have elongated (6,291–7,019 aa, P4–P6) due to expansion of the VNTR domains. Five of these variants harbor seven VNTR domains with an excess of ∼800 aa of tandem repeat sequence and two additional cys domains. Unlike MUC5AC, there is no variation in cys domain copy number preceding the first tandem repeat domain in MUC5B. All elongated variants were found exclusively in individuals of African descent; therefore, much like MUC5AC, the ancestral state of this locus may have been longer.
The novel VNTR domains associated with P4–P6 are most like TR3 and TR4 in repeat copy number and motif composition (Figure S3), suggesting that the acquisition of new tandem repeat domains has been accomplished via duplication of the central domains in MUC5B, rather than from the first and last domains. While the largest MUC5B protein isoform (P6) has increased in size due to VNTR expansion, it is interesting that serine and threonine abundance is comparable to that of the canonical forms (P1–P4) (Figures 2D and 2E). Like MUC5AC, threonine is favored across the irregular MUC5B repeat motif (Figure 2F). Even though there are fewer distinct MUC5B protein variants, there are 191 unique 29-mers used across the haplotypes (Figures 2G and S4). Unlike MUC5AC, there appear to be no gains or losses of whole motifs, and the frequency of motif usage is largely conserved across the haplotypes (Figure 2B).
NHP variation in MUC5AC and MUC5B
We reconstructed the evolutionary histories of MUC5AC and MUC5B by identifying orthologous loci from NHP genomes,13,14 including chimpanzee (n = 2), bonobo (n = 2), gorilla (n = 2), orangutan (n = 3, Sumatran and Bornean species), and Siamang gibbon (Figure 3; Table S3). All NHP haplotypes (n = 22) predicted a complete ORF at the MUC5AC locus, consistent with the human exon structure. Chimpanzee and bonobo alleles display variation in the number of cys domains preceding the first tandem repeat domain (Figure 3A). The Asian apes, orangutan, and gibbon carry the longest predicted proteins, with the most common protein variant in orangutan approximately 1,500 aa longer than the human H3 variant. All NHP variants were longer than the two most common human variants (H1 and H2), ranging in size from 6,243–7,887 aa, due solely to exon 31 length variation (Figure 3B). This suggests there has been a reduction of the VNTR length in humans (Figures 3B and 3C).
Additionally, we characterized two noncoding VNTRs associated with the MUC5AC locus—an 8-mer VNTR in intron 15 of MUC5AC (Figures 3C and S5, and Note S1) and an 8-mer VNTR approximately 1–3 kbp in size mapping upstream of the MUC5AC start codon (Figure S6). Based on ENCODE H3K27 mapping data,18 the latter region corresponds to a potential enhancer. Diminished copy number of the enhancer VNTR is associated with decreased MUC5AC expression55 and susceptibility to severe gastric cancer.56 We find complete enrichment of shorter variants (less than 1,500 bp in length) in East Asian H1 haplotypes and an excess of long variants (greater than 2,000 bp) in African haplotypes (X2 = 87.4, p < 0.001), suggesting a founder effect or selection among East Asians consistent with their eastward expansion that could result in population-specific differential expression of H1. Additionally, all NHP haplotypes feature lengths of 881–1,649 bp for this enhancer VNTR (shortest in orangutan and longest in chimpanzee).
Despite enhanced conservation in humans, there is extensive length variation among the protein-coding MUC5B variants among great apes. Only orangutan and gibbon haplotypes harbor an additional cys domain that is distinctive from the other three cys domain types (Figure 3D), which we classify as a type V domain. Like MUC5AC, orangutans carry the largest MUC5B VNTR domains (84–89 copies of the 29-mer). Excluding one haplotype from the Siamang gibbon, human alleles of MUC5B generally harbor shorter central exons with fewer VNTR total motifs compared to the NHP haplotypes (Figure 3E) and little structural variation outside of the central exon (Figure 3F).
MUC5AC LD block structure and potential positive selection in East Asian populations
We next investigated LD patterns among different human continental groups using D′ at the MUC5AC/MUC5B locus. A predominant single LD block corresponded to most of the MUC5AC protein-coding genes (Figure 4A) in the non-African populations. We tested by simulation (Figure 4B) whether LD block sizes were significantly larger than the genome-wide distributions because extended LD is a signature of positive selection. When compared to population-specific distributions of LD block sizes in the 1KG dataset,19 MUC5AC blocks are large (top 5% distribution) in EAS (n = 585) and Americans (n = 490) relative to AFR (n = 893), EUR (n = 633), and SAS (n = 601, Figure 4B).
To further test for positive selection, we calculated Tajima’s D37 for 10 kbp segments spanning MUC5AC and MUC5B in the 1KG sample set. We find a significant excess of rare variants in four bins within MUC5B for Africans and one bin for East Asians, consistent with positive selection (Table 1). Repeating the analysis for MUC5AC, only one population group (East Asians, Table 2) shows a significantly negative Tajima’s D value corresponding to the 10 kbp segment preceding the VNTR. East Asians are the only population with both an excess of rare variants and an abnormally large block of LD for MUC5AC, thereby providing more compelling evidence of positive selection.
Table 1.
GRCh38 chromosome 11 bin | ||||||||
---|---|---|---|---|---|---|---|---|
Population | Gene | – | 1,220,000 | 1,230,000 | 1,240,000a | 1,250,000a | 1,260,000 | 1,270,000 |
AFR | MUC5B | Tajima’s D | −1.09 | −1.47 | −1.84c,d | −1.87c,d | −1.94c,d | −2.10c,d |
# SNPs | 180 | 162 | 444 | 279 | 209 | 248 | ||
EUR | MUC5B | Tajima’s D | −0.44 | −0.42 | −1.49 | −1.39 | −1.56 | −1.60 |
# SNPs | 122 | 81 | 356 | 191 | 73 | 103 | ||
SAS | MUC5B | Tajima’s D | −0.55 | −0.96 | −1.76b | −1.77b | −1.59 | −1.65 |
# SNPs | 140 | 100 | 328 | 188 | 85 | 102 | ||
EAS | MUC5B | Tajima’s D | −0.25 | −0.49 | −1.58 | −1.69b | −2.13c,d | −1.91b |
# SNPs | 109 | 62 | 189 | 109 | 79 | 83 | ||
AMR | MUC5B | Tajima’s D | −0.46 | −1.22 | −1.84 | −1.87b | −1.95b | −1.90b |
# SNPs | 130 | 110 | 353 | 201 | 108 | 122 |
Bin sizes of 10 kbp were used to compare values to the autosome-wide distribution per population in the 1KG cohort.19
Corresponds to bin containing VNTR sequence.
Bottom 10% of autosome-wide Tajima’s D values.
Bottom 5% of autosome-wide Tajima’s D values.
Significant at α = 0.05 after permutation testing.
Table 2.
GRCh38 chromosome 11 bin | |||||||||
---|---|---|---|---|---|---|---|---|---|
Population | Gene | – | 1,150,000 | 1,160,000 | 1,170,000 | 1,180,000a | 1,190,000a | 1,200,000 | 1,210,000 |
AFR | MUC5AC | Tajima’s D | −1.06 | −1.57 | −1.37 | −1.25 | −0.98 | −0.57 | −1.23 |
# SNPs | 213 | 268 | 428 | 259 | 241 | 226 | 166 | ||
EUR | MUC5AC | Tajima’s D | −0.48 | −1.37 | −0.85 | −0.54 | 0.17 | 0.52 | −0.66 |
# SNPs | 116 | 193 | 304 | 186 | 148 | 110 | 82 | ||
SAS | MUC5AC | Tajima’s D | −0.70 | −1.52 | −1.63 | −1.46 | −0.71 | −0.28 | −0.87 |
# SNPs | 134 | 191 | 312 | 194 | 154 | 132 | 92 | ||
EAS | MUC5AC | Tajima’s D | −0.62 | −1.74b | −2.04c,d | −1.70b | −0.94 | −0.31 | −1.28 |
# SNPs | 105 | 173 | 292 | 152 | 127 | 100 | 85 | ||
AMR | MUC5AC | Tajima’s D | −0.79 | −1.54 | −1.39 | −1.26 | −0.93 | −0.90 | −1.29 |
# SNPs | 140 | 196 | 300 | 192 | 183 | 170 | 109 |
Bin sizes of 10 kbp were used to compare values to the autosome-wide distribution per population in the 1KG cohort.19
Corresponds to bin containing VNTR sequence.
Bottom 10% of autosome-wide Tajima’s D values.
Bottom 5% of autosome-wide Tajima’s D values.
Significant at α = 0.05 after permutation testing.
tSNP discovery and short-read genotyping using Locityper
We next searched for tSNPs in high LD with VNTR haplogroups for the imputation of structural variants in short-read WGS datasets. To discover tSNPs, we encoded H1, H2, and H3 as biallelic variants and tested for correlation (r2) with all SNPs within 10 kbp of the MUC5AC start and stop sites (VNTR excluded). At a threshold of r2 > 0.85, we discovered 35 tSNPs for H1 (max r2 = 0.92), 5 tSNPs for H2 (max r2 = 0.89), and 52 tSNPs for H3 (max r2 = 1, Table S4). tSNPs for H3 are in low LD with H1/H2 and make excellent imputation candidates for this group of variants (average H1/H2 r2 = 0.10). We found one tSNP distinguishing H1 and H2 of MUC5B that met our stringent criteria (in GRCh38, chr11:1,244,757; H1 r2 = 0.0026 vs. H2 r2 = 1).
Next, we applied Locityper—a tool designed to genotype complex, multi-allelic loci like MUC5AC/MUC5B—to WGS datasets.45 Given a collection of high-quality reference alleles, Locityper predicts the best pair of alleles for an unknown sample by examining read alignments and read-depth profiles across all allele pairs. Locityper has a short runtime that allows thousands of genomes to be rapidly characterized. We tested the accuracy of Locityper in predicting haplogroup identities for MUC5AC and MUC5B in the HPRC/HGSVC genomes by performing leave-one-out experiments. For MUC5AC, we estimated a genotyping accuracy of 95% for full diploid genotyping (both haplogroups correct) and 97.5% concordance for partial genotyping (one haplogroup correct; Material and methods, Figure 5A; Table S5). For MUC5B, genotyping showed 100% accuracy in predicting the correct haplotype based on leave-one-out experiments (Figure S7; Table S5). Predictably, Locityper was less accurate in identifying protein isoforms due to homoplasy. For example, 91% and 81% of samples were correctly assigned to protein subgroups for MUC5AC and MUC5B, respectively (Table S5). A larger sampling of reference haplotypes will improve future genotyping with this tool.
Next, we genotyped all 2,600 unrelated genomes from the 1KG with high-coverage short-read Illumina WGS.19 We compared concordances for two high-confidence tSNPs with Locityper predictions for MUC5AC (H1 vs. H2/H3 tSNP: rs28542750, H3 vs. H1/H2 tSNP: rs769768817; Table S4). We found high concordance between the two methods, with 91% (n = 2,359) of the genomes yielding complete concordance with both haplotypes. For the remaining ∼9% (n = 241/2,600), most were discordant for only one of the two haplotypes (92%, n = 222/241) and differed for classification of H1 versus H2 alleles (75%, n = 166/222).
We leveraged the Locityper set of haplogroup predictions to assess population patterns of MUC5AC variation. We find that H2 is enriched in AFR genomes (47% of all AFR haplotypes), while H3 is found predominantly among Africans and Europeans (18% and 21%, respectively; Figure 5B). In sharp contrast, H3 is virtually absent among East Asians (0.37%); we identify only four haplotypes found exclusively among Vietnamese. It is interesting that among South Asians, H3 once again rises to common allele frequency (∼5%).
Using Locityper genotypes, we tested again for signatures of positive selection with Tajima’s D (Figure 5C; Table 3). Our results suggest signatures of positive selection for H1 in EAS and SAS. We find that H2 in AFR yields a significantly negative Tajima’s D value in the bin of MUC5AC preceding the tandem repeats, unlike the other super populations examined. In contrast, we find significantly positive Tajima’s D values for MUC5AC H3 in EUR, ADM, and SAS. We tested for departure from Hardy-Weinberg equilibrium and found a significant depletion of homozygotes in Africans and Europeans (chi-squared test, AFR: p = 0.0368, EUR: p = 0.030), consistent with the action of balancing selection. These combined selection signatures in Europeans suggest there is an immunological advantage to shorter haplotypes of MUC5AC and heterozygote advantage for the longer alleles (H3).
Table 3.
GRCh38 chromosome 11 bin | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Population | Gene | Haplogroup | – | 1,150,000 | 1,160,000 | 1,170,000 | 1,180,000a | 1,190,000a | 1,200,000 | 1,210,000 |
AFR | MUC5AC | H1 | Tajima’s D | −0.54 | −1.36 | −1.29 | −1.18 | −0.87 | −0.47 | −1.18 |
# SNPs | 156 | 215 | 321 | 205 | 197 | 185 | 131 | |||
H2 | Tajima’s D | −1.17 | −1.75c,f | −1.58b | −1.51b | −1.19 | −0.71 | −1.03 | ||
# SNPs | 176 | 214 | 332 | 214 | 199 | 198 | 142 | |||
H3 | Tajima’s D | −0.19 | −0.80 | 0.19 | 0.42d | 0.34d | 0.45d | −0.30 | ||
# SNPs | 134 | 166 | 261 | 157 | 160 | 157 | 104 | |||
AMR | MUC5AC | H1 | Tajima’s D | −0.86 | −1.67b | −1.60 | −1.53 | −1.18 | −1.08 | −0.84 |
# SNPs | 121 | 155 | 244 | 165 | 155 | 141 | 78 | |||
H2 | Tajima’s D | −0.21 | −1.09 | −1.29 | −1.24 | −0.88 | −0.45 | −1.06 | ||
# SNPs | 85 | 109 | 193 | 116 | 117 | 106 | 78 | |||
H3 | Tajima’s D | 0.42 | 0.28 | 1.61e,f | 1.79e,f | 1.72e,f | 1.58e,f | −0.10 | ||
# SNPs | 64 | 70 | 142 | 94 | 93 | 69 | 50 | |||
EAS | MUC5AC | H1 | Tajima’s D | −0.67 | −1.78b | −1.99c,f | −1.55b | −0.78 | −0.41 | −1.41 |
# SNPs | 84 | 138 | 216 | 102 | 84 | 77 | 77 | |||
H2 | Tajima’s D | 0.81 | −1.08 | −1.53b | −1.32 | −0.54 | 0.32 | −0.02 | ||
# SNPs | 64 | 112 | 198 | 107 | 99 | 77 | 54 | |||
H3 | Tajima’s D | NA | NA | NA | NA | NA | NA | NA | ||
# SNPs | NA | NA | NA | NA | NA | NA | NA | |||
EUR | MUC5AC | H1 | Tajima’s D | −0.84 | −1.62b | −1.23 | −0.81 | −0.22 | −0.11 | −0.66 |
# SNPs | 97 | 157 | 257 | 150 | 125 | 101 | 77 | |||
H2 | Tajima’s D | 0.83 | −0.93 | −1.02 | −0.83 | −0.40 | 1.13 | 0.16 | ||
# SNPs | 69 | 105 | 157 | 109 | 104 | 73 | 54 | |||
H3 | Tajima’s D | −0.16 | −0.28 | 0.86 | 1.75d | 1.93e,f | 1.38 | −0.13 | ||
# SNPs | 75 | 90 | 188 | 110 | 103 | 80 | 51 | |||
SAS | MUC5AC | H1 | Tajima’s D | −0.71 | −1.66b | −1.85c,f | −1.64b | −0.74 | −0.17 | −0.74 |
# SNPs | 108 | 159 | 257 | 153 | 123 | 111 | 82 | |||
H2 | Tajima’s D | −0.15 | −1.36 | −1.48b | −1.46b | −0.78 | −0.46 | −0.42 | ||
# SNPs | 97 | 142 | 220 | 140 | 113 | 109 | 72 | |||
H3 | Tajima’s D | 1.20 | 0.81 | 1.65 | 2.13e,f | 1.62 | 2.39e,f | 0.27 | ||
# SNPs | 47 | 58 | 135 | 82 | 90 | 54 | 35 |
Bin sizes of 10 kbp were used to compare values to the autosome-wide distribution per population and per haplogroup in the 1KG cohort.19
Corresponds to bin containing VNTR sequence.
Bottom 10% of autosome-wide Tajima’s D values.
Bottom 5% of autosome-wide Tajima’s D values.
Top 10% of autosome-wide Tajima’s D values.
Top 5% of autosome-wide Tajima’s D values.
Significant at α = 0.05 after permutation testing.
MUC5AC haplogroups in LD with GWAS risk SNPs and expression quantitative trait loci (eQTLs)
Because the tSNPs we uncovered are unlikely to be genotyped in previous GWASs, we assessed the LD of MUC5AC haplogroups with risk and protective alleles for asthma/allergy phenotypes and infection-induced pneumonia/meningitis. The risk alleles for three SNPs associated with asthma/allergy (rs35225972,39 rs11245962,40 and rs2841584541; EUR cohorts) are in moderate LD with H1 variants of MUC5AC (Figure 5D). Conversely, the protective alleles for two SNPs associated with infection-induced pneumonia/meningitis (rs11245979,42 rs2872951644; EUR cohorts) are in higher LD with H1, with rs28729516 functioning as a tSNP for this haplogroup. We also examined SNP-associated expression quantitative trait loci (eQTLs) for MUC5AC identified in the upper airways of African American and Hispanic children.57 These eQTLs were parsed into two independent groups related to increased (group A) and decreased (group B) MUC5AC expression. We found that group A eQTLs (increased MUC5AC expression/decreased lung function) have an average r2 of 0.79 for the risk variant and H1 MUC5AC alleles, whereas group B eQTLs (decreased MUC5AC expression) have an average r2 of 0.82 for the protective variant and H3 (Table S6). These findings suggest that differences in VNTR structure are likely important considerations for differential MUC5AC expression.
MUC5AC and MUC5B PheWAS in All of Us
To identify phenotypes associated with MUC5AC and MUC5B variation, we performed a PheWAS using data from All of Us49 (n = ∼165,150). We first tested for a known disease association with the MUC5B regulatory polymorphism (rs35705950) and interstitial lung diseases.50 We find significant associations after Bonferroni correction for rs35705950 in all samples (including age, sex, and PCs1–3 as covariates) with the ICD codes for alveolar and parietoalveolar pneumonopathy (p = 6.89E−44, OR = 2.05), idiopathic fibrosing alveolitis (p = 2.14E−36, OR = 2.85), postinflammatory pulmonary fibrosis (p = 2.62E−34, OR = 1.82), extrinsic allergic alveolitis (p = 3.96E−08, OR = 2.38), bronchiectasis (p = 9.77E−7, OR = 1.32), and pulmonary congestion and hypostasis (p = 2.25E−05, OR = 1.30; Figure S8; Table S7). Two or more of these phenotypes were associated with rs35705950 in Admixed Americans and Europeans when tested alone (Tables S8 and S9).
We find no correlated phenotypes that survive multiple testing correction for MUC5AC H1, H2, and H3 tSNPs (Material and methods). It is interesting to note, however, that H3 tSNPs approached significance for protection against degeneration of the macula and the posterior pole of the retina (p = 1.76E−4–9.49E−4, OR = 0.91; Table S19). We repeated the analysis separately for heterozygotes and homozygotes at rs36151150 (MUC5AC H3 tSNP) and find increased significance for the protective phenotype among heterozygotes, despite a reduction in alleles upon removal of homozygotes (heterozygous: p = 2.41E−4, OR = 0.89; homozygous: p = 0.145, OR = 0.94; Tables S20 and S21).
Discussion
Using numerous high-quality long-read genome assemblies, we performed a population-level genetic survey of MUC5AC and MUC5B structural polymorphism. The protein-coding VNTRs of both loci have precluded and complicated the study of these genes from short-read WGS datasets. Initial efforts to resolve MUC5AC and MUC5B using long-read sequencing have relied on platforms with higher error rates and have been limited to a few individuals (n = 4)7; however, recent advances in long-read sequencing technologies and de novo genome assembly algorithms11,15,16 have made complete characterization of the genes possible.9,10 These analyses open a path to improved understanding of how mucin structural variants contribute to health and disease.
While our results recapitulate the long-held belief that MUC5B is less variable than other secreted mucins,54 they refute the hypothesis that MUC5B is intolerant of structural changes, as we have identified structural variants of likely functional consequence among Africans. This is perhaps not surprising given the greater genetic diversity expected among Africans.58 This variation has likely been missed because most studies of MUC5B have been conducted within European populations (e.g., MUC5B promoter polymorphism50). It is thus important that initiatives from consortia like the HGSVC,9 HPRC,10 and All of Us49 broadly survey individuals of diverse genetic ancestry with long-read sequencing.
In contrast to MUC5B, we uncovered extensive aa composition and size variation within MUC5AC. This difference may be related to their varying functional roles; while MUC5B is ubiquitously and constitutively expressed in the airways, MUC5AC is overexpressed in the nasopharynx and is highly responsive to inflammation.59 MUC5AC has likely evolved independently from MUC5B to respond to a wider variety of pathogenic challenges.1
Our comparative analyses with NHPs also indicates that VNTR length has generally decreased in the human lineage over the course of ape evolution for both genes. Increased VNTR length and subsequent glycosylation is predicted to enhance the interaction of the mucins with water,58 thereby altering the mucus’s biophysical properties.60 Additionally, an increase in the number of cys domains may enhance non-covalent self-interactions that make the gel impermeable.61 It is possible that longer variants of both mucins contribute to pathogenic changes in the viscoelastic properties of mucus in disease phenotypes, such as asthma and cystic fibrosis. In this regard, it is noteworthy that respiratory disease is a particularly pervasive problem affecting NHPs in captivity62; therefore, the reduction in overall VNTR length (especially in H1 and H2 haplogroups) may have been particularly adaptive in humans. Because of our detailed curation of many MUC5AC and MUC5B human haplotypes,9,10,19 further experimental work uncovering how length variation in both loci imparts functional differences is now possible.
Within the human population, we distinguish three major MUC5AC haplogroups (H1–H3) that generally correlate with VNTR length (H1 encoding the shortest and H3 the longest molecules; Figure 1). The longer haplogroup variants are depleted among genomes of East Asian descent. We observe a signature of positive selection in East Asians, as evidenced by an excess of rare variants (Tajima’s D) and extended LD. While this could be in part due to recent population bottlenecks or rapid population expansion in East Asians,63 our genome-wide LD survey places MUC5AC block length in the top 5% (Figure 4). These findings may be relevant to the decreased prevalence of asthma in individuals of Asian descent,64,65 although many other mitigating factors, such as environmental exposures,61 play an important role.
We leveraged the LD and structural differences present within the 206 assembled haplotypes of MUC5AC and MUC5B to genotype short-read WGS data. Using the recently developed program Locityper, we estimate a high degree of genotyping accuracy (∼95% based on leave-one-out experiments). Applying Locityper to the high-coverage WGS data generated from 2,600 1KG samples19 confirms the striking population stratification and positive selection signature among East Asian populations (Figure 5). Given the importance of MUC5AC as a genetic modifier of epithelial diseases like cystic fibrosis8 and asthma/allergy,39,40,41 it will be critical to continue cataloging haplotype diversity and improving short-read genotyping assays at this locus using haplotype information.
Our study of the genetic diversity of MUC5AC suggests different forces of both balancing and positive selection may be operating. Unlike H1, where LD block size and Tajima’s D suggest positive selection in East Asians, our analysis of H3 in Europeans provides preliminary evidence of heterozygote advantage based on significantly positive Tajima’s D and deviation from Hardy-Weinberg equilibrium. The molecular basis for this is unknown, but it is interesting that a protective effect was suggested by PheWAS for macular/retinal degeneration and enriched in H3 heterozygotes (Tables S14–S21). It is feasible that H3 variants provide a protective function against ocular disease because MUC5AC expression has been previously associated with dry eye syndrome.66,67 MUC5AC and MUC5B are expressed in epithelial tissues outside of the lungs, and the signatures of selection we have uncovered may be due to more than just lung traits. It will be critical to understand these biological nuances and control for population substructure in future association studies.
At a broader level, the strategy we have outlined is applicable to other mucin loci and structurally variable genes. There are numerous gene families with protein-encoding structural polymorphisms that have generally been excluded from surveys of genetic variation and disease. Some of these are already known, such as LPA68 and CYP2D6,69 while others are suggestive, such as HRNR.70 Even for MUC5AC and MUC5B, over 100 assembled reference genomes are still insufficient to capture the extent of human genetic diversity at these dynamic loci. Additional haplotypes from long reads in the HPRC, HGSVC, and All of Us, as well as approaches that tag haplotypes (as opposed to single SNPs), are needed to facilitate further variant discovery, protein domain sequence curation, LD block structure analysis, and genetic associations with disease. Importantly, the resulting panels of sequence-resolved haplotypes and tools like Locityper could facilitate direct genotyping from short reads in large population cohorts like All of Us or the UK Biobank. As long-read sequencing methods continue to be optimized and become less expensive in the coming years, the importance of these more complex forms of human genetic variation will become realized.
Data and code availability
The assemblies generated for this project (not previously published by the HGSVC9) were uploaded and accessioned via NCBI Sequence Read Archive (SRA). Sample accession IDs can be found in Tables S1 and S2. This study used data from the All of Us Research Program Controlled Tier Dataset v7, available to authorized users on the Researcher Workbench.
Acknowledgments
We thank Tonia Brown and Michelle Noyes for assistance in editing this manuscript. We thank the HGSVC for access to the underlying PacBio HiFi CCS reads for assembly of the MUC5AC/MUC5B locus and the Primate T2T Consortium (especially Kateryna Makova and Adam Phillippy) for access to the high-quality data ape genome assemblies via GenomeArk. We thank DongAhn Yoo, Mei Wu, Brian Browning, Devin Schweppe, and Nick Riley for their intellectual contributions to the experimental designs, statistical analyses, and visualizations contained in this manuscript. The human cell lines used for sequence and assembly were obtained from the NIGMS Human Genetic Cell Repository at the Coriell Institute for Medical Research (Tables S1 and S2). The human samples were part of the 1000 Genomes Project. All genome sequence data are thus consented for open access with no data use restrictions. We gratefully acknowledge All of Us participants for their contributions to this work and thank the National Institutes of Health’s All of Us Research Program for making the participant data available for this study. This work was supported, in part, by US National Institutes of Health (NIH) grants HG002385, HG010169, and HG007497 to E.E.E. E.E.E. and J.D.B. are investigators of the Howard Hughes Medical Institute.
This article is subject to HHMI’s Open Access to Publications policy. HHMI lab heads have previously granted a nonexclusive CC BY 4.0 license to the public and a sublicensable license to HHMI in their research articles. Pursuant to those licenses, the author-accepted manuscript of this article can be made freely available under a CC BY 4.0 license immediately upon publication.
Author contributions
E.G.P., P.H., and E.E.E. conceived and planned the experiments. W.K.O., P.H., and J.D.B. provided critical intellectual support during project design. General methodologies were conceived by E.G.P., P.H., A.S., and E.E.E. E.G.P. performed data curation and formal analyses. T.P. and T.M. supported E.G.P. with Locityper analyses and data visualizations. E.N., E.J.K., and P.N.V. conceived and performed the PheWAS analyses. W.T.H. and K.M.M. provided technical and scientific consultation. Data visualization was designed by E.G.P. and E.E.E. E.G.P. and E.E.E. wrote the original manuscript, with edits and reviews from all authors. All authors provided critical feedback that shaped the research and analysis outlined in this manuscript.
Declaration of interests
E.E.E. is a scientific advisory board (SAB) member of Variant Bio, Inc.
Published: July 10, 2024
Footnotes
Supplemental information can be found online at https://doi.org/10.1016/j.ajhg.2024.06.007.
Supplemental information
References
- 1.Chatterjee M., van Putten J.P.M., Strijbis K. Defensive properties of mucin glycoproteins during respiratory infections—relevance for Sars-CoV-2. mBio. 2020;11 doi: 10.1128/mBio.02374-20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Wallace L.E., Liu M., van Kuppeveld F.J., de Vries E., de Haan C.A. Respiratory mucus as a virus-host range determinant. Trends Microbiol. 2021;29:983–992. doi: 10.1016/j.tim.2021.03.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Morrison C.B., Markovetz M.R., Ehre C. Mucus, mucins, and cystic fibrosis. Pediatr. Pulmonol. 2019;54:S84–S96. doi: 10.1002/ppul.24530. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Bergstrom K.S.B., Xia L. Mucin-type O-glycans and their roles in intestinal homeostasis. Glycobiology. 2013;23:1026–1037. doi: 10.1093/glycob/cwt045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Chaisson M.J.P., Sanders A.D., Zhao X., Malhotra A., Porubsky D., Rausch T., Gardner E.J., Rodriguez O.L., Guo L., Collins R.L., et al. Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat. Commun. 2019;10:1784. doi: 10.1038/s41467-018-08148-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Logsdon G.A., Vollger M.R., Eichler E.E. Long-read human genome sequencing and its applications. Nat. Rev. Genet. 2020;21:597–614. doi: 10.1038/s41576-020-0236-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Guo X., Zheng S., Dang H., Pace R.G., Stonebraker J.R., Jones C.D., Boellmann F., Yuan G., Haridass P., Fedrigo O., et al. Genome reference and sequence variation in the large repetitive central exon of human MUC5AC. Am. J. Respir. Cell Mol. Biol. 2014;50:223–232. doi: 10.1165/rcmb.2013-0235OC. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Guo X., Pace R.G., Stonebraker J.R., Commander C.W., Dang A.T., Drumm M.L., Harris A., Zou F., Swallow D.M., Wright F.A., et al. Mucin variable number tandem repeat polymorphisms and severity of cystic fibrosis lung disease: significant association with MUC5AC. PLoS One. 2011;6:e25452. doi: 10.1371/journal.pone.0025452. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Ebert P., Audano P.A., Zhu Q., Rodriguez-Martin B., Porubsky D., Bonder M.J., Sulovari A., Ebler J., Zhou W., Serra Mari R., et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science. 2021;372 doi: 10.1126/science.abf7117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Liao W.W., Asri M., Ebler J., Doerr D., Haukness M., Hickey G., Lu S., Lucas J.K., Monlong J., Abel H.J., et al. A draft human pangenome reference. Nature. 2023;617:312–324. doi: 10.1038/s41586-023-05896-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Cheng H., Concepcion G.T., Feng X., Zhang H., Li H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods. 2021;18:170–175. doi: 10.1038/s41592-020-01056-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Vollger M.R., Dishuck P.C., Sorensen M., Welch A.E., Dang V., Dougherty M.L., Graves-Lindsay T.A., Wilson R.K., Chaisson M.J.P., Eichler E.E. Long-read sequence and assembly of segmental duplications. Nat. Methods. 2018;16:88–94. doi: 10.1038/s41592-018-0236-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Mao Y., Harvey W.T., Porubsky D., Munson K.M., Hoekzema K., Lewis A.P., Audano P.A., Rozanski A., Yang X., Zhang S., et al. Structurally divergent and recurrently mutated regions of primate genomes. Cell. 2024;187:1547–1562.e13. doi: 10.1016/j.cell.2024.01.052. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Makova K.D., Pickett B.D., Harris R.S., Hartley G.A., Cechova M., Pal K., Nurk S., Yoo D., Li Q., Hebbar P., et al. The complete sequence and comparative analysis of ape sex chromosomes. Nature. 2024;630:401–411. doi: 10.1038/s41586-024-07473-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Rautiainen M., Nurk S., Walenz B.P., Logsdon G.A., Porubsky D., Rhie A., Eichler E.E., Phillippy A.M., Koren S. Telomere-to-telomere assembly of diploid chromosomes with Verkko. Nat. Biotechnol. 2023;41:1474–1482. doi: 10.1038/s41587-023-01662-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Nurk S., Koren S., Rhie A., Rautiainen M., Bzikadze A.V., Mikheenko A., Vollger M.R., Altemose N., Uralsky L., Gershman A., et al. The complete sequence of a human genome. Science. 2022;376:44–53. doi: 10.1126/science.abj6987. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34:3094–3100. doi: 10.1093/bioinformatics/bty191. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Frankish A., Carbonell-Sala S., Diekhans M., Jungreis I., Loveland J.E., Mudge J.M., Sisu C., Wright J.C., Arnan C., Barnes I., et al. GENCODE: reference annotation for the human and mouse genomes in 2023. Nucleic Acids Res. 2023;51:D942–D949. doi: 10.1093/nar/gkac1071. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Byrska-Bishop M., Evani U.S., Zhao X., Basile A.O., Abel H.J., Regier A.A., Corvelo A., Clarke W.E., Musunuri R., Nagulapalli K., et al. High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. Cell. 2022;185:3426–3440.e19. doi: 10.1016/j.cell.2022.08.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Katoh K., Standley D.M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 2013;30:772–780. doi: 10.1093/molbev/mst010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Waterhouse A.M., Procter J.B., Martin D.M.A., Clamp M., Barton G.J. Jalview Version 2—a multiple sequence alignment editor and analysis workbench. Bioinformatics. 2009;25:1189–1191. doi: 10.1093/bioinformatics/btp033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Nguyen L.T., Schmidt H.A., Von Haeseler A., Minh B.Q. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 2015;32:268–274. doi: 10.1093/molbev/msu300. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Xu S., Li L., Luo X., Chen M., Tang W., Zhan L., Dai Z., Lam T.T., Guan Y., Yu G. Ggtree: a serialized data object for visualization of a phylogenetic tree and annotation data. IMeta. 2022;1:e56. doi: 10.1002/imt2.56. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Minh B.Q., Schmidt H.A., Chernomor O., Schrempf D., Woodhams M.D., Von Haeseler A., Lanfear R. IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol. 2020;37:1530–1534. doi: 10.1093/molbev/msaa015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Dunsworth H.M. Origin of the genus Homo. Evo. Edu. Outreach. 2010;3:353–366. [Google Scholar]
- 26.Rice P., Longden I., Bleasby A. EMBOSS: the European molecular biology open software suite. Trends Genet. 2000;16:276–277. doi: 10.1016/s0168-9525(00)02024-2. [DOI] [PubMed] [Google Scholar]
- 27.Ho S.B., Roberton A.M., Shekels L.L., Lyftogt C.T., Niehans G.A., Toribara N.W. Expression cloning of gastric mucin complementary DNA and localization of mucin gene expression. Gastroenterology. 1995;109:735–747. doi: 10.1016/0016-5085(95)90380-1. [DOI] [PubMed] [Google Scholar]
- 28.Desseyn J.L., Guyonnet-Dupérat V., Porchet N., Aubert J.P., Laine A. Human mucin gene MUC5B, the 10.7-kb large central exon encodes various alternate subdomains resulting in a super-repeat: structural evidence for a 11p15.5 gene family. J. Biol. Chem. 1997;272:3168–3178. doi: 10.1074/jbc.272.6.3168. [DOI] [PubMed] [Google Scholar]
- 29.RR S. A statistical method for evaluating systematic relationships. Univ. Kans. Sci. Bull. 1958;38:1409–1438. [Google Scholar]
- 30.Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999;27:573–580. doi: 10.1093/nar/27.2.573. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Bailey T.L. STREME: accurate and versatile sequence motif discovery. Bioinformatics. 2021;37:2834–2840. doi: 10.1093/bioinformatics/btab203. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Dong S.S., He W.M., Ji J.J., Zhang C., Guo Y., Yang T.L. LDBlockShow: a fast and convenient tool for visualizing linkage disequilibrium and haplotype blocks based on variant call format files. Briefings Bioinf. 2021;22 doi: 10.1093/bib/bbaa227. [DOI] [PubMed] [Google Scholar]
- 33.Slatkin M. Linkage disequilibrium—understanding the evolutionary past and mapping the medical future. Nat. Rev. Genet. 2008;9:477–485. doi: 10.1038/nrg2361. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Purcell S., Neale B., Todd-Brown K., Thomas L., Ferreira M.A., Bender D., Maller J., Sklar P., de Bakker P.I., Daly M.J., et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Gabriel S.B., Schaffner S.F., Nguyen H., Moore J.M., Roy J., Blumenstiel B., Higgins J., Defelice M., Lochner A., Faggart M., et al. The structure of haplotype blocks in the human genome. Science. 2002;296:2225–2229. doi: 10.1126/science.1069424. [DOI] [PubMed] [Google Scholar]
- 36.Rousseau K., Byrne C., Griesinger G., Leung A., Chung A., Hill A.S., Swallow D.M. Allelic association and recombination hotspots in the mucin gene (MUC) complex on chromosome 11p15. 5. Ann. Hum. Genet. 2007;71:561–569. doi: 10.1111/j.1469-1809.2007.00374.x. [DOI] [PubMed] [Google Scholar]
- 37.Tajima F. Evolutionary relationship of DNA sequences in finite populations. Genetics. 1983;105:437–460. doi: 10.1093/genetics/105.2.437. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.MacArthur J., Bowler E., Cerezo M., Gil L., Hall P., Hastings E., Junkins H., McMahon A., Milano A., Morales J., et al. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog) Nucleic Acids Res. 2017;45:D896–D901. doi: 10.1093/nar/gkw1133. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Valette K., Li Z., Bon-Baret V., Chignon A., Bérubé J.C., Eslami A., Lamothe J., Gaudreault N., Joubert P., Obeidat M., et al. Prioritization of candidate causal genes for asthma in susceptibility loci derived from UK Biobank. Commun. Biol. 2021;4:700. doi: 10.1038/s42003-021-02227-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Vuckovic D., Bao E.L., Akbari P., Lareau C.A., Mousas A., Jiang T., Chen M.H., Raffield L.M., Tardaguila M., Huffman J.E., et al. The polygenic and monogenic basis of blood traits and diseases. Cell. 2020;182:1214–1231.e11. doi: 10.1016/j.cell.2020.08.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Ferreira M.A., Mathur R., Vonk J.M., Szwajda A., Brumpton B., Granell R., Brew B.K., Ullemar V., Lu Y., Jiang Y., et al. Genetic architectures of childhood-and adult-onset asthma are partly distinct. Am. J. Hum. Genet. 2019;104:665–684. doi: 10.1016/j.ajhg.2019.02.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Reay W.R., Geaghan M.P., Agee M., Alipanahi B., Bell R.K., Bryc K., Elson S.L., Fontanillas P., Furlotte N.A., Hicks B., et al. The genetic architecture of pneumonia susceptibility implicates mucin biology and a relationship with psychiatric illness. Nat. Commun. 2022;13:3756. doi: 10.1038/s41467-022-31473-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Tian C., Hromatka B.S., Kiefer A.K., Eriksson N., Noble S.M., Tung J.Y., Hinds D.A. Genome-wide association and HLA region fine-mapping studies identify susceptibility loci for multiple common infections. Nat. Commun. 2017;8:599. doi: 10.1038/s41467-017-00257-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Sabo M.C., Thuong N.T.T., Chang X., Ardiansyah E., Tram T.T.B., Hai H.T., Nghia H.D.T., Bang N.D., Dian S., Ganiem A.R., et al. MUC5AC genetic variation is associated with tuberculous meningitis cerebral spinal fluid cytokine responses and mortality. JID (J. Infect. Dis.) 2023;228:343–352. doi: 10.1093/infdis/jiad050. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Prodanov T., Plender E.G., Seebohm G., Meuth S.G., Eichler E.E., Marschall T. Locityper: targeted genotyping of complex polymorphic genes. bioRxiv. 2024 doi: 10.1101/2024.05.03.592358. Preprint at. [DOI] [Google Scholar]
- 46.Danecek P., Bonfield J.K., Liddle J., Marshall J., Ohan V., Pollard M.O., Whitwham A., Keane T., McCarthy S.A., Davies R.M., Li H. Twelve years 5 of SAMtools and BCFtools. GigaScience. 2021;10 doi: 10.1093/gigascience/giab008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Marçais G., Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011;27:764–770. doi: 10.1093/bioinformatics/btr011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Sahlin K. Strobealign: flexible seed size enables ultra-fast and accurate read alignment. Genome Biol. 2022;23:260. doi: 10.1186/s13059-022-02831-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Bick A.G., Metcalf G.A., Mayo K.R., Lichtenstein L., Rura S., Carroll R.J., Musick A., Linder J.E., Jordan I.K., Nagar S.D., et al. Genomic data in the All of Us Research Program. Nature. 2024;627:340–346. doi: 10.1038/s41586-023-06957-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Seibold M.A., Wise A.L., Speer M.C., Steele M.P., Brown K.K., Loyd J.E., Fingerlin T.E., Zhang W., Gudmundsson G., Groshong S.D., et al. A common MUC5B promoter polymorphism and pulmonary fibrosis. N. Engl. J. Med. 2011;364:1503–1512. doi: 10.1056/NEJMoa1013660. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.O'Connell B.C., Tabak L.A. A comparison of serine and threonine O-glycosylation by UDP-GaINAc: polypeptide N-acetylgalactosaminyltransferase. J. Dent. Res. 1993;72:1554–1558. doi: 10.1177/00220345930720120401. [DOI] [PubMed] [Google Scholar]
- 52.Brockhausen I., Schachter H., Stanley P. 2nd edition. 2009. O-GalNAc glycans. Essentials of Glycobiology. [Google Scholar]
- 53.Ridley C., Lockhart-Cairns M.P., Collins R.F., Jowitt T.A., Subramani D.B., Kesimer M., Baldock C., Thornton D.J. The C-terminal dimerization domain of the respiratory mucin MUC5B functions in mucin stability and intracellular packaging before secretion. J. Biol. Chem. 2019;294:17105–17116. doi: 10.1074/jbc.RA119.010771. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Vinall L.E., Hill A.S., Pigny P., Pratt W.S., Toribara N., Gum J.R., Kim Y.S., Porchet N., Aubert J.P., Swallow D.M. Variable number tandem repeat polymorphism of the mucin genes located in the complex on 11p15. 5. Hum. Genet. 1998;102:357–366. doi: 10.1007/s004390050705. [DOI] [PubMed] [Google Scholar]
- 55.Kageyama-Yahara N., Yamamichi N., Takahashi Y., Takeuchi C., Matsumoto Y., Sakaguchi Y., Koike K. Tandem repeats of the 5′ flanking region of human MUC5AC have a role as a novel enhancer in MUC5AC gene expression. Biochemistry and Biophysics Reports. 2019;18 doi: 10.1016/j.bbrep.2019.100632. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Wang C., Wang J., Liu Y., Guo X., Zhang C. MUC5AC upstream complex repetitive region length polymorphisms are associated with susceptibility and clinical stage of gastric cancer. PLoS One. 2014;9 doi: 10.1371/journal.pone.0098327. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Altman M.C., Flynn K., Rosasco M.G., Dapas M., Kattan M., Lovinsky-Desir S., O’Connor G.T., Gill M.A., Gruchalla R.S., Liu A.H., et al. Inducible expression quantitative trait locus analysis of the MUC5AC gene in asthma in urban populations of children. J. Allergy Clin. Immunol. 2021;148:1505–1514. doi: 10.1016/j.jaci.2021.04.035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.1000 Genomes Project Consortium A global reference for human genetic variation. Nature. 2015;526:68. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Singanayagam A., Footitt J., Marczynski M., Radicioni G., Cross M.T., Finney L.J., Trujillo-Torralbo M.B., Calderazzo M., Zhu J., Aniscenko J., et al. Airway mucins promote immunopathology in virus-exacerbated chronic obstructive pulmonary disease. J. Clin. Invest. 2022;132 doi: 10.1172/JCI120901. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Cone R.A. Barrier properties of mucus. Adv. Drug Deliv. Rev. 2009;61:75–85. doi: 10.1016/j.addr.2008.09.008. [DOI] [PubMed] [Google Scholar]
- 61.Demouveaux B., Gouyer V., Robbe-Masselot C., Gottrand F., Narita T., Desseyn J.L. Mucin CYS domain stiffens the mucus gel hindering bacteria and spermatozoa. Sci. Rep. 2019;9 doi: 10.1038/s41598-019-53547-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Lowenstine L.J., Osborn K.G. Nonhuman Primates in Biomedical Research; 2012. Respiratory System Diseases of Nonhuman Primates; p. 413. [Google Scholar]
- 63.Cai X., Qin Z., Wen B., Xu S., Wang Y., Lu Y., Wei L., Wang C., Li S., Huang X., et al. Human migration through bottlenecks from Southeast Asia into East Asia during Last Glacial Maximum revealed by Y chromosomes. PLoS One. 2011;6:e24282. doi: 10.1371/journal.pone.0024282. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Pate C.A., Zahran H.S., Qin X., Johnson C., Hummelman E., Malilay J. Asthma surveillance—United States, 2006–2018. MMWR. Surveill. Summ. 2021;70:1–32. doi: 10.15585/mmwr.ss7005a1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Song P., Adeloye D., Salim H., Dos Santos J.P., Campbell H., Sheikh A., Rudan I. Global, regional, and national prevalence of asthma in 2019: a systematic analysis and modelling study. J. Glob. Health. 2022;12 doi: 10.7189/jogh.12.04052. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Bhattacharya D., Yu L., Wang M. Expression patterns of conjunctival mucin 5AC and aquaporin 5 in response to acute dry eye stress. PLoS One. 2017;12:e0187188. doi: 10.1371/journal.pone.0187188. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Corrales R.M., Narayanan S., Fernández I., Mayo A., Galarreta D.J., Fuentes-Páez G., Chaves F.J., Herreras J.M., Calonge M. Ocular mucin gene expression levels as biomarkers for the diagnosis of dry eye syndrome. Invest. Ophthalmol. Vis. Sci. 2011;52:8363–8369. doi: 10.1167/iovs.11-7655. [DOI] [PubMed] [Google Scholar]
- 68.Schmidt K., Noureen A., Kronenberg F., Utermann G. Structure, function, and genetics of lipoprotein (a) JLR (J. Lipid Res.) 2016;57:1339–1359. doi: 10.1194/jlr.R067314. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Dalton R., Lee S.B., Claw K.G., Prasad B., Phillips B.R., Shen D.D., Wong L.H., Fade M., McDonald M.G., Dunham M.J., et al. Interrogation of <scp>CYP</scp>2D6 Structural Variant Alleles Improves the Correlation Between <scp>CYP</scp>2D6 Genotype and CYP2D6-Mediated Metabolic Activity. Clinical Translational Sci. 2020;13:147–156. doi: 10.1111/cts.12695. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Lu T.Y., Smaruj P.N., Fudenberg G., Mancuso N., Chaisson M.J. The motif composition of variable number tandem repeats impacts gene expression. Genome Res. 2023;33:511–524. doi: 10.1101/gr.276768.122. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The assemblies generated for this project (not previously published by the HGSVC9) were uploaded and accessioned via NCBI Sequence Read Archive (SRA). Sample accession IDs can be found in Tables S1 and S2. This study used data from the All of Us Research Program Controlled Tier Dataset v7, available to authorized users on the Researcher Workbench.