Skip to main content
BMC Genomics logoLink to BMC Genomics
. 2025 Oct 30;26:978. doi: 10.1186/s12864-025-12137-0

A haplotype-resolved chromatin landscape connects cis-regulatory variants to trait variation in Citrus

Isaac A Diaz 1, Talieh Ostovar 2, Jinfeng Chen 2, Emmanuel Avila de Dios 1, Ryan Piscatella 1, Ruth S Perez-Alfaro 1, Omar Zayed 1, Sarah Saddoris 3, Robert J Schmitz 3, Susan R Wessler 1,, Jason E Stajich 2,, Danelle K Seymour 1,
PMCID: PMC12577126  PMID: 41168720

Abstract

Background

Genetic and epigenetic perturbation of cis-regulatory sequences can shift patterns of gene expression and result in novel phenotypes. Phased genome assemblies now enable the local dissection of linkages between cis-regulatory sequences, including their epigenetic state, and allele-specific gene expression to further characterize gene regulation and resulting phenotypes in heterozygous genomes.

Results

We assembled a locally phased genome for a mandarin hybrid named ‘Fairchild’ to explore the molecular signatures of allele-specific gene expression. With local genome phasing, genes with allele-specific expression were paired with haplotype-specific chromatin states, including levels of chromatin accessibility, histone modifications, and DNA methylation. We found that 30% of variation in allele-specific expression could be attributed to haplotype associated factors, with allelic levels of chromatin accessibility and three histone modifications in gene bodies having the most influence. Structural variants in promoter regions were also associated with allele-specific expression, including specific enrichments of hAT and MULE-MuDR DNA transposon sequences. Integration of haplotype-resolved genetic and epigenetic landscapes with high-throughput phenotypic analysis of fruit traits in a panel of 154 accessions with mandarin and pummelo ancestry revealed that trait-associated variants were enriched in regions of open chromatin. Mining of trait-associated variants uncovered a Gypsy retrotransposon insertion in a gene that regulates potassium transport and may contribute to the reduction in fruit size that is observed in mandarins.

Conclusions

​​Using a locally phased assembly of a heterozygous cultivar of citrus, we dissected the interplay between genetic variants and molecular phenotypes to reveal cis-regulatory sequences with potential functional effects on phenotypes relevant for genetic improvement.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12864-025-12137-0.

Introduction

Interspecific hybridization has the potential to generate unique patterns of gene expression that result in novel phenotypes [1]. Specific gene expression patterns are achieved through control of cis-regulatory modules (CRMs), which include sequences that influence the timing, magnitude, and frequency of transcription through the coordinated action of transcription factors (TFs) and other binding partners. Variation in the local chromatin landscape can influence the ability of TFs to bind CRMs and regulate gene expression. The comparison of local chromatin state and gene expression between first generation hybrids and their parents can reveal the relative contribution of cis-acting variation, including genetic and epigenetic variation in CRMs, and trans-acting variation caused by factors encoded in other genomic locations.

In the genus Citrus, the hybridization of three ancestral species, citron (C. medica), pummelo (C. maxima), and mandarin (C. reticulata), has produced many hybrids with novel phenotypes including sweet oranges (C. sinensis), lemons (C. limon), and grapefruits (C. paradisi) [2]. There are several examples where novel phenotypes in cultivated citrus hybrids are the result of cis-regulatory variation [3, 4]. For example, in blood oranges the insertion of a Copia-like retrotransposon upstream of the Ruby gene introduces one or more CRMs that modulate gene expression in response to temperature [3]. Upregulation of Ruby transcripts in response to cold temperatures results in anthocyanin accumulation in the fruit [3]. Apomixis, another commercially important trait, is also associated with cis-acting regulatory variation introduced by a transposon sequence [4, 5]. The production of apomictic seed, or seed genetically identical to the mother plant, is controlled in part by a gene encoding a RWP-RK domain–containing protein (CitRWP) with the insertion of a miniature inverted repeat transposable element (MITE) in its promoter linked to enhanced gene expression in Citrus and its relative Fortunella hindsii [4, 5]. These are only two examples where structural variants alter gene expression patterns by perturbing CRMs, including through the introduction of a new CRM to regulate the cold-responsiveness of Ruby. The impact of structural variation on gene expression is likely much higher and an extensive inventory of structural variation in tomato revealed that 7.3% of genes with structural variants (SVs) in cis-regulatory regions were differentially expressed [6]. There is accumulating evidence that perennial, clonally propagated crop species, like Citrus, harbor extensive structural variation between haplotypes [4, 7], but the impact of these variants on phenotypic variation, including potential effects on cis-regulatory activity, remains to be characterized at the genome-wide level.

One challenge for studying cis-regulatory variants is the identification of active regulatory regions from the massive amount of non-coding DNA in the genome. The genome-wide identification of accessible chromatin regions (ACRs), or regions of the genome that are not tightly packaged around nucleosomes, using ATAC-seq [8] narrows the portion of intergenic space that contains likely CRMs, which can refine functional non-coding sequences associated with phenotypic variation. A survey of angiosperms revealed that only 0.22–6.5% of the genome resides in accessible chromatin regions [9]. In maize, ACRs and their 2 kb flanking regions were enriched for trait-associated variants [10]. The enrichment for functional variants in ACRs, even though they occupy a limited portion of the genome, highlights their potential for dissecting the genetic basis of complex traits.

Here, we investigate the contribution of cis-regulatory variation to generating novel patterns of gene expression in a cultivated mandarin hybrid named ‘Fairchild’ which contains ancestry from both mandarin (C. reticulata) and pummelo (C. maxima). We quantify allelic variation in both chromatin state and gene expression in this hybrid and directly link chromatin variation to gene expression with a locally phased reference genome. Genes with imbalanced allelic expression served as a focal point to explore the contribution of the chromatin and epigenetic landscape to gene expression variation. We found that allele-specific expression can be predicted, in part, by local chromatin state, with chromatin accessibility and histone modifications of gene bodies being the most influential factors. Local structural variation was also linked to allele-specific variation in gene expression, with hAT and MULE-MuDR DNA transposons enriched at genes with divergent expression between haplotypes. Finally, genome-wide association of 15 phenotypic traits related to fruit and juice quality revealed that associated variants were enriched at ACRs, particularly those related to fruit size. Genetic variants located in ACRs were prioritized for further exploration. A major locus associated with fruit size is linked to a Gypsy retrotransposon insertion flanking a candidate gene involved in potassium transport. Application of potassium can increase fruit size in commercial citrus orchards, and differential regulation of this candidate gene may be linked to fruit size. Integration of haplotype-specific genetic and epigenetic variation in the hybrid ‘Fairchild’ highlights the role of TE-associated genetic variants in the divergence of gene expression and potential impacts on phenotype, including those with horticultural relevance.

Results

The landscape of accessible chromatin in a locally phased genome

The genome of ‘Fairchild’ was de novo assembled using 69 Gb of PacBio long-read sequences (Materials and methods) to investigate the chromatin dynamics at cis-regulatory sequences and the influence of those sequences on gene regulation and phenotypic variation in mandarins, pummelos, and their hybrids. The de novo assembly was scaffolded using a Bionano optical map and anchored to chromosomes based on existing genetic maps (Materials and methods). The resulting ‘Fairchild’ genome assembly is 365 Mb in length (N50 = 42.1 Mb), with 95% of the genome contained in 9 chromosomes (Additional File 2: Fig. S1, Additional File 3: Table S1). Cis-regulatory modules (CRMs) typically include an array of transcription factor (TF) binding sites, with the majority of TFs only binding to DNA sequence in open chromatin conformations [1113]. Putative CRMs in ‘Fairchild’ were identified by mapping accessible chromatin regions (ACRs) using ATAC-seq of nuclei extracted from leaves. A total of 9,172 ACRs spanning 7.3 Mb (FRiP = 0.13) were identified (Materials and methods, Additional File 1: Tables S1, S2). The sequences underlying these ACRs span 2% of the ‘Fairchild’ genome, similar to other plant species [9].

Recent work has demonstrated the importance of distal cis-regulatory sequences in controlling gene regulation in plants [14]. In maize, for example, 32% of ACRs are located more than 2 Kb, and often more than 20 Kb, from their nearest gene [9]. Based on this, ‘Fairchild’ ACRs were classified as genic, proximal, or distal based on proximity to their nearest gene (overlapping, within 2 Kb, or further than 2 Kb, respectively). The majority (53.6%) of ACRs are genic, while 33.7% of ACRs are proximal, and 17.3% are distal to their nearest gene (Additional File 1: Table S2). We then determined the genome-wide distribution of three activating histone modifications (H3K4me3, H3K36me3, H3K56ac) and one repressive modifications (H3K27me3), using ChIP-seq (Materials and methods) (Additional File 1: Table S3-S6). Unsupervised clustering of histone modifications at ACRs revealed similar patterns observed in other plant species [9] (Additional File 2: Fig. S2, S3, S4). When focusing on only distal ACRs (>2 Kb from nearest gene), a distinct cluster of ACRs enriched for the repressive modification H3K27me3 emerges that is not present for genic and proximal ACRs. (Additional File 2: Fig. S3) It is possible that these distal ACRs marked by repressive modifications represent poised enhancers [15, 16], although not much is known about histone modification of poised enhancers in plants.

To define the extent of local and distal cis-regulatory variation, we generated locally phased haplotypes with 10x Genomics Linked-Reads. In total, 1.53 million phased variants, including small indels (< 50 bp) were used to define 1,015 phase blocks (N50 = 1.57 Mb). Phase blocks spanned 88.04% of the assembled genome length and included 91.2% of annotated genes, with 45.4% of haplotype blocks including one or more genes (Fig. 1a). In other plant species, distal CRMs have been identified more than 60 Kb from their target gene [17], but there are typically no intervening coding gene sequences [14]. This suggests that allele-specific chromatin dynamics at both local and distal CRM can be assayed for 91.2% of gene sequences in ‘Fairchild’ based on the multi-megabase scale of haplotype phasing. SNP-based read phasing enabled the quantification of allelic gene expression, chromatin accessibility, histone modification, and DNA methylation (Fig. 1b). Allelic differences in whole-genome sequence read mapping served as a control for read mapping bias (Materials and methods). This approach enables comparison of haplotype-specific signatures of chromatin landscape and gene expression in this hybrid genome.

Fig. 1.

Fig. 1

Framework for developing haplotype-resolved profiles of gene expression, chromatin accessibility, and epigenetic modifications. a Phased haplotype blocks along each chromosome of the ‘Fairchild’ reference genome constructed using 10x Genomics Linked-Reads (n = 1,015, N50 = 1.57 Mb). Alternating colors indicate the distinct phase blocks. b Diagram illustrating the strategy for SNP-based phasing of RNA-seq, ATAC-seq, ChIP-seq, and Bisulfite-seq reads into maternal and paternal haplotypes. c) Example of repressive (paternal) and active (maternal) chromatin landscapes in proximity to a gene with allele-specific expression (ASE)

To evaluate the extent of haplotype-specific control of gene expression we utilized phased SNPs to quantify allele-specific expression (ASE) by RNA-seq (Materials and methods). Overall, 23,407 genes are expressed in the leaf and 7,964 (34.02%) of those genes have sufficient levels of polymorphism (2 or more SNPs) for allele-specific analysis. Genes evaluated for allele specific expression contained an average of 10 SNPs per 1,000 bp (compared to 3.45 SNPs per 1,000 bp for genes that could not be evaluated for ASE). In total, 2,071 genes with significant ASE consistent across two replicates were identified (Benjamini Hochberg FDR < 0.05) (Additional File 1: Tables S7, S8, S9). Genes with ASE were enriched for gene ontology terms related to metabolism and other enzymatic processes (Additional File 1: Table S10). Although genetic variation limits our ability to assay allele-specific patterns of expression across all expressed genes, ASE appears to be common in this subset, affecting 26% of genes with sufficient polymorphism and levels of expression (7,964). This set of ASE genes serve as the basis for investigating the dynamics of haplotype-specific cis-regulatory variation.

Haplotype-specific chromatin dynamics correlate with allele-specific expression

Regulatory variants have the potential to cause haplotype-specific alterations in chromatin state through differential transcription factor and chromatin remodeling activity. We assigned ancestry to phased SNPs to represent allele-specific gene expression and chromatin state by parent of origin. ‘Fairchild’ was derived from a cross between a clementine mandarin and ‘Orlando’ tangelo (a hybrid of mandarin and pummelo) and re-sequencing data from a clementine and ‘Orlando’ were used to assign ancestry to each phased SNP (Additional File 1: Table S11). Local phasing of haplotypes with ASE genes enabled us to examine the extent to which haplotype-specific variation in the chromatin landscape is associated with the expression of linked genes.

To address this, we assayed the landscape of four histone modifications (H3K4me3, H3K36me3, H3K56ac, H3K27me3), chromatin accessibility, and DNA methylation, including their haplotype-specific patterns, to dissect the relative contribution of each feature to gene expression. All seven metrics of chromatin status were partitioned into four regions flanking each expressed gene: genic (including exon and introns), promoter (< 1 Kb from the transcription start site), upstream putative regulatory region (ACR within 5 Kb upstream of promoter), and downstream putative regulatory region (ACR within 5 Kb downstream of gene). For upstream and downstream regions, histone modifications and chromatin accessibility data were only incorporated if a putative regulatory region (ACR) was detected in the window of interest.

Focusing on the 2,071 genes with significant ASE, gene expression in transcripts per million (TPM) was then modeled using elastic-net regression with 56 factors (Additional File 1: Table S12), including each chromatin feature across the four regions. In addition to overall levels of histone modification and chromatin accessibility at each gene, haplotype-specific differences for each feature were included as factors. Additional factors included the presence of a SV in the promoter and the ratio of nonsynonymous and synonymous substitutions. After excluding genes that were outliers for whole-genome sequencing coverage (n = 162), five-fold cross validation was used to select the optimal parameters for elastic-net regression (Additional File 1: Table S13). For 1,909 ASE genes, 40% of the variation in overall gene expression was explained with 21 of the 56 factors (Fig. 2a; Additional File 1: Table S14, S15) with levels of genic chromatin accessibility (ATAC), genic H3K4me3, and genic H3K27me3 having the largest influence. Consistent with their roles in activating and repressing gene expression, H3K4me3 was positively correlated with overall expression, while H3K27me3 was negatively correlated with expression (Fig. 2a). When including all genes in the ‘Fairchild’ genome, 49 of the 56 factors explained 61.15% of variation in overall gene expression (Additional File 1: Table S16; Additional File 2: Fig. S5). The fact that almost all predictors in this model were determined to be significant suggests model overfitting. To address this, we selected a shrinkage parameter that reduced model complexity while maintaining a prediction error within one standard error of the initial model (Additional File 1: Tables S16, S17; Additional File 2: Fig. S6). This reduced model included 15 of the 56 factors and explained 60.44% of the variance in overall gene expression. The factors with the greatest influence on gene expression of all genes are overall levels of genic H3K36me3, genic H3K56ac, genic H3K27me3, and chromatin accessibility of both promoters and genic regions (ATAC) (Additional File 1: Tables S17; Additional File 2: Fig. S7). In summary, genic levels of chromatin accessibility and H3K27me3 were 2 of the 5 top influential factors in models of genes with ASE (n = 1,909) and overall expression for all genes (n = 30,724), with the influence of open chromatin being more pronounced for genes with ASE.

Fig. 2.

Fig. 2

The relationship between gene-expression and chromatin features at genes with allele-specific expression. a Coefficients of the top 12 factors for the model of overall gene expression (log2(TPM)) for 1,909 ASE genes ordered by magnitude (R = 0.40). Factors are partitioned by genomic region and colored to indicate whether they reside in genes, promoters (1 Kb of TSS), or upstream/downstream putative regulatory regions of the focal gene (ACRs present 5 Kb upstream of promoter / 5 Kb downstream of gene). b Coefficients of the top 12 factors for the model of allelic gene expression (log2FC(MaternalRNA/ PaternalRNA) ordered by magnitude (R = 0.3069). This model was constructed for 1,786 ASE genes after filtering genes with allelic mapping bias. Positive “allelic” factors indicate a positive association between maternal gene expression and maternal levels of a given data-type. “Overall” indicates that the predictor is the total level of a given data-type (ATAC, ChIP, Bisulfite). “Allelic” indicates that the predictor is the ratio of maternal:paternal levels of a given data-type (log2FC(Maternal / Paternal)

A similar approach was used to quantify the contribution of each factor to allele-specific expression, or the log2 fold difference in the expression of the maternal allele (clementine) over the paternal allele (‘Orlando’). After removing ASE genes with apparent allelic mapping bias, 1,786 genes remained, with 30.69% of the variation in differential expression between alleles explained by 18 of the 56 factors (Fig. 2b; Additional File 1: Table S18). Allelic differences in levels of chromatin modifications were associated with ASE, unlike the model of overall expression for the set of 1,909 genes where overall levels of chromatin modifications were predictive of gene expression. Allelic chromatin accessibility, H3K36me3, H3K4me3, and H3K56ac in genic regions were the four strongest predictors of ASE and were all positively associated with ASE. The repressive modification H3K27me3 was inversely associated with allele-expression indicative of its repressive effect.

In addition to inferring the relative effect of chromatin features on allele-specific expression, we also examined the significance of their genomic location (genic, promoter, upstream putative regulatory region, downstream putative regulatory region) relative to each gene. Chromatin accessibility and histone modifications of genic regions have the strongest relationship with ASE, while promoter chromatin accessibility, promoter H3K36me3 and DNA methylation of promoters have more subtle effects (Fig. 2b; Additional file 1: Table S18). Allelic DNA methylation of promoters has an effect similar to that of genic H3K27me3, illustrating two modes of transcriptional repression. Additionally, the models reveal a significant effect of deletions in promoter sequences on gene expression, primarily in the model of ASE (Fig. 2b) but also in the model of overall expression (Fig. 2a). Using 10X Genomics Linked-Reads, we identified 2,481 deletions ranging from 41 bp to 11,084 bp (mean length = 332 bp) (Additional File 2: Fig S8). Although parentage can be assigned to each lesion, the designation as a ‘deletion’ is in reference to the ‘Fairchild’ consensus sequence and the ancestral state (i.e. inserted or deleted allele) is unknown. Here, a promoter deletion in one haplotype is associated with increased expression of the corresponding haplotype, with coefficients of similar size for paternal and maternal deletions (Fig. 2b). Since the model response is represented as the ratio of maternal vs. paternal gene expression, the opposite effects of paternal and maternal deletions indicate their effect on the deletion containing allele (Fig. 2b). Overall the chromatin landscape in genic regions and promoters is more associated with ASE than distal regulatory regions. Together, these analyses demonstrate the extent to which allelic differences in histone modifications, chromatin accessibility, and DNA methylation underlie ASE.

Promoter structural variation linked to differences in allele-specific expression

Genetic variants in gene promoter sequences may alter the binding affinity of regulatory factors, serving as a source of cis-regulatory variation. Since multiple functional elements reside in promoter regions, the effect of a single SNP may be small. In contrast, structural variants may alter the sequence or position of cis-regulatory modules and have potentially larger effects on gene expression. In ‘Fairchild’, the promoters of genes with ASE are significantly enriched for deletions (1,000 permutations, p < 0.001) (Additional File 2: Fig. S9). Additionally, promoter structural variants of genes with ASE are enriched for Type I DNA transposable elements (TEs), with the hAT and MULE-MuDR DNA transposons being the most abundant (Additional File 2 : Fig. S10). To further examine the relationship between SVs and ASE we identified 1,654 phased deletion:ASE gene pairs for which the gene and its nearest deletion are in the same phase block (Fig. 3a). Genes with a deletion in their promoter (n = 104) had significantly higher expression of the deletion-containing allele (Hedges G effect-size 0.378) compared to ASE genes without promoter deletions (1,000 permutations, p < 0.001) (Fig. 3a). One example of this is the elevated expression of the maternal allele for a predicted aldehyde dehydrogenase gene (Additional File 1: Table S19). There is a 452 bp structural variant ~ 900 bp upstream of the gene that corresponds with dramatically higher expression of the maternal allele (log2FC = 2.53) (Fig. 3b). Because the ancestral state of each structural variant is unknown, it could be that expression of the paternal allele, in this case, is associated with an insertion that reduces expression. These inserted sequences could disrupt cis-regulatory elements or introduce repressive elements. We searched for TF binding sites inside this promoter SV and identified a single motif matching the binding site of several Arabidopsis thaliana Homeobox-leucine zipper transcription factors including AtHB20, AtHB13, and AtHAT5. A motif enrichment analysis of sequence removed in the ‘deleted’ allele of the 104 promoter:gene pairs identified four significantly enriched motifs, with the most abundant (n = 28) matching the binding site of zinc-finger protein 1 (AZF1), a zinc-finger protein that acts as a transcriptional repressor [18] (Additional File 1: Table S20). For these deletion: gene pairs there was no significant effect of AZF1 motif presence on allele-specific expression, but the sample size was relatively small (10,000 permutations, p = 0.2717) (Additional File 2: Figure S11). However, 5 of the 10 ASE genes with the highest expression bias towards the deletion-containing allele included the canonical AZF1 motif (Fig. 3c). These observations suggest a possible role for repressive elements in structural variants linked to allele-specific expression.

Fig. 3.

Fig. 3

Structural variation as a potential source of cis-regulatory variation between haplotypes. a 1,654 phased deletion:ASE gene pairs for which the gene and its nearest deletion are in the same phase block were identified. Comparison of allele-specific expression of genes with a deletion in their promoter (1 Kb upstream of TSS) (n = 104) versus those whose nearest deletion was not in its promoter (n = 1,550). Gene expression is expressed as a ratio of read counts for the deletion-containing allele (HAPDeletion) relative to the alternate allele (HAPAlternate). For genes where there is no structural variant in the promoter, HAPDeletion represents the allele in which the nearest deletion is found. Genes with a deletion in their promoter (n=104) had significantly higher expression of the deletion-containing allele (Hedges G effect-size 0.378) compared to ASE genes without promoter deletions (*** = 10,000 permutations, p < 0.001) b) Genome browser view of RNA-seq read coverage phased between maternal (blue) and paternal (purple) haplotypes for a predicted aldehyde dehydrogenase gene (FAIR_021608). Phased linked-reads between haplotypes reveal a 452 bp deletion (red bar) in the maternal allele that is associated with increased gene expression. c A motif enrichment analysis of sequence absent in the ‘deleted’ allele of the 104 promoter:gene pairs from ASE genes identified four significantly enriched motifs, with the most abundant (n=28) matching the binding site of zinc-finger protein 1 (AZF1). ASE genes are ordered by their allelic expression ratio, and bars are colored based on whether the promoter deletion contains the AZF1 motif. There was no significant effect of AZF1 motif presence on allele-specific expression, but the sample size was small (n=28; 10,000 permutations, p = 0.2717)

Allele-specific accessible chromatin regions (AS-ACRs) are linked to expression of proximal genes

CRMs can alter the magnitude, timing, and localization of transcription and divergence in CRM activity between haplotypes is a source of variation in ASE. We reasoned that differences in cis-regulatory module activity between haplotypes may be reflected in chromatin accessibility. Using the same framework for detecting ASE, we tested 5,423 ACRs with 2 or more polymorphisms for haplotype-specific differences in levels of chromatin accessibility. A total of 137 allele-specific ACRs were identified at a 5% false discovery rate (Benjamini Hochberg) and these were primarily classified as genic (58%), while 34% are proximal (within 2 Kb of nearest gene), and 11.6% are distal (greater than 2 Kb from nearest gene) (Additional File 1: Table S2). The epigenetic profile of AS-ACRs and their flanking sequence (2 Kb) was similar to all ACRs, including the subset of 5,423 ACRs with 2 or more polymorphisms used for haplotype-specific analysis (Additional File 2: Fig. S12). Notably, only 2.5% of the 5,423 ACRs exhibited significant haplotype-specific chromatin accessibility, compared to 28% of genes with significant ASE. This may be a consequence of levels of sequencing coverage, where reduced coverage (i.e. for ATAC-seq) limits detection of differences in chromatin accessibility between alleles. It is also possible that genetic variants could disrupt transcription factor binding to impact allelic expression without affecting local chromatin structure.

Using locally phased blocks, we associated allele-specific differences in chromatin accessibility (AS-ACR) with allele-specific expression. Of the 137 AS-ACRs, 94.1% resided in the same phased block as their closest ASE gene with a median of 7,069 bp of intervening sequence. Next we correlated levels of chromatin accessibility of AS-ACRs with gene expression of ASE-genes on a per haplotype basis. For ACR-gene pairs separated by 5 Kb or less (n = 51) there is a significant correlation between chromatin accessibility and gene expression (R = 0.69, p < 0.001) (Fig. 4a), with higher levels of allelic chromatin accessibility associated with an increase in expression of the proximal gene. The association between AS-ACRs and gene expression is reduced when examining ACR-gene pairs separated by more than 5 Kb (R = 0.46, p < 0.001) (Additional File 2: Fig. S13). This pattern is consistent with the statistical modeling of ASE which highlights the importance of local chromatin state in predicting differences in gene expression between haplotypes (Fig. 2b).

Fig. 4.

Fig. 4

Allelic chromatin accessibility is linked to allele-specific expression of proximal genes. a Allele-specific ACRs (AS-ACRs) were paired with their nearest ASE genes, restricting the analysis to ACR:gene pairs separated by 5 Kb or less (n=51) while residing within the same phase block. Allele-accessibility of AS-ACRs represented as the ratio of maternal:paternal ATAC-seq reads is positively correlated with allele-specific expression of neighboring genes within 5 Kb (n = 51, = 0.69 , p = 9.9e-08). b Genome browser view of two AS-ACRs (black bars) located in the gene body and promoter of an ASE gene encoding a 7-deoxyloganetic acid glucosyltransferase (FAIR_019693). Monoallelic expression of the maternal allele (blue) is associated with increased chromatin accessibility of both AS-ACRs in the maternal haplotype. c Motif enrichment analysis of the sequence underlying the paternal alleles of AS-ACRs identified the binding motif of ERF48 (see inset). Significant similarity of paternal sequences of AS-ACRs was observed for 44 / 137 AS-ACRs. The ERF48 motif significance score is correlated with allelic H3K27me3, where higher similarity of the paternal sequence to the canonical motif is associated with greater levels of H3K27me3 in the paternal haplotype

One example of allelic chromatin accessibility and allele-specific expression of a proximal gene occurs at a gene predicted to encode a 7-deoxyloganetic acid glucosyltransferase (Additional File 1: Table S21). Two AS-ACRs were detected at this gene, one in the first exon, and the other in the gene promoter (Fig. 4b). The maternal alleles of the promoter AS-ACR and genic ACR had a log2FC (maternal/paternal ATAC) of 1.51 and 2.20, respectively. This coincides with monoallelic expression of the maternal allele, with no detectable expression of the paternal allele (Fig. 4b). The difference in accessibility in the gene body may be a consequence of active transcription of the maternal allele, while differences in accessibility of the promoter may reflect differential binding of transcription factors to the maternal and paternal alleles. Both the maternal and paternal sequence underlying the AS-ACR (364 bp) contains a hAT DNA transposon and a LTR transposon, but the polymorphisms within these elements (n = 28) may have altered their regulatory activity. Several studies have noted the role of transposable elements (TE) in generating variation in CRMs. This may occur through the disruption of functional cis-regulatory sequences by TE insertion [19, 20], or TEs may contain cis-regulatory sequences with the potential to rewire gene regulatory networks by bringing novel CRMs in close proximity to genes [19, 21]. DNA transposons, in particular, are a significant source of species-specific ACRs in angiosperms [9]. Similarly, AS-ACRs in the ‘Fairchild’ genome are enriched for DNA transposons relative to all detected ACRs (1,000 permutations, p = 0.002) (Additional File 2: Fig. S14), with the hAT subfamily of class II DNA transposons being abundant in sequences underlying AS-ACRs. AS-ACRs were identified based on genetic variation (i.e. SNPs) between maternal and paternal alleles, and sequences associated with DNA transposons were identified in both parental haplotypes. We did not observe any significant enrichment for class I transposons (Additional File 2: Fig. S15).

We searched the sequences underlying the maternal and paternal alleles of AS-ACRs for known transcription factor binding sites using MEME-ChIP [22]. Four motifs were enriched in either maternal or paternal alleles of AS-ACRs (Additional File 1: Table S22, S23). The motifs were highly repetitive in sequence which complicated the identification of specific transcription factor binding sites. One motif enriched in paternal alleles of AS-ACRs matches the ERF48 binding motif also known as DREB2C (Fig. 4c) (Additional File 1: Table S24). ERF48 binding motifs were also found to be enriched in differentially accessible chromatin regions between cotton (Gossypium) accessions before and after allopolyploidization, interspecific hybridization, and domestication [23], suggesting ERF48 motifs are a potential source of cis-regulatory variation across multiple plant species. ERF48 is a target of Polycomb repressive complex 2 which deposits H3K27me3 [23, 24] and typically results in reduced transcriptional activity. Following this logic, we explored the relationship between levels of H3K27me3 at paternal alleles of AS-ACRs and the presence of the ERF48 motif. The ERF48 motif was detected in the paternal allele of 44 of the 137 AS-ACRs (Additional File 1: Table S24) and the degree of similarity to the consensus ERF48 motif for each AS-ACR correlated with allelic H3K27me3 (Spearman correlation, Inline graphic = −0.31, p = 0.04) (Fig. 4c). Paternal haplotypes of AS-ACRs with greater sequence similarity to the ERF48 motif had greater levels of paternal H3K27me3 highlighting the potential for AS-ACRs to reveal haplotype-specific CRMs. These variable CRMs may be a consequence of differential activity of chromatin remodelers caused by sequence divergence between haplotypes.

Transposable elements in cis-regulatory region may contribute to variation in fruit size

Allelic divergence in both gene expression and chromatin accessibility has highlighted the significant association of TEs with cis-regulatory variation. To understand the functional consequences of TE-mediated cis-regulation, we surveyed phenotypic variation in a population of 154 accessions of mandarins, pummelos, and their hybrids using an automated phenotyping platform. This phenotyping platform enabled the high-throughput measurement of 10 traits related to fruit size, texture, and color on ~ 20,000 fruit collected across two seasons (Additional File 1: Table S25-28). Five additional traits related to fruit palatability and quality were assayed using a subset of the fruits sampled for each tree (Additional file 1: Table S26, S27). Whole-genome sequencing data was produced for the 154 accessions (Additional File 1: Table S29, Additional File 2: Fig. S16), and 2.24 million SNPs were identified and used for genome-wide association studies (GWAS) (Additional File 2: Fig. S17, S18). Of the 15 traits measured, 536 significant SNPs were detected for 12 traits after Bonferonni correction (Additional File 1: Table S30).

Trait-associated SNPs residing in ACRs were candidates for cis-regulatory variants with possible functional impacts on fruit and juice characteristics. Trait-associated SNPs were enriched in ACRs for four phenotypes, with three related to fruit size (Additional File 1: Table S31). The level of enrichment of trait-associated SNPs ranged from 1.3 to 3 fold relative enrichment in ACRs compared to background SNPs with matched proximity to genes and minor allele-frequency (Fig. 5a). The conservative method of ACR identification used may affect the discovery of trait-associated variants in ACRs. Enrichment of trait-associated SNPs in ACRs was higher when using a less restrictive set of ACRs (Fig. 5a; Additional File 1: Table S31).

Fig. 5.

Fig. 5

Cis-regulatory variation may contribute to variation in fruit phenotypes. a Relative enrichment of trait-associated variants in ACRs for seven traits compared to all other parts of the genome. “MACS” ACRs represents a less conservative method for ACR detection, and recovers more trait-associated variants that overlap ACRs. b Manhattan plot of genome-wide association study for fruit major diameter, the dotted line represents a Bonferroni-corrected significance threshold for the 2.24 million tested SNPs. The lead SNP (chr1:2820321) is highlighted in red. c Mean major diameter of fruits of accessions that segregate for the lead SNP (chr1:2820321). Each point represents the mean of all fruits from an individual tree, with observations for both 2021 and 2022 harvests. d Diagram of the CitKC1 gene and a cluster of five Gypsy retrotransposons in its promoter region. Reduction in linked-read sequencing coverage indicates potential structural variation between haplotypes although no SV was identified. e Mean major diameter of fruits of accessions heterozygous for the 8.7 Kb insertion in CitKC1 (+/SV) compared to accessions without the SV insertion (+/+). No accessions were identified with a homozygous SV insertion (SV/SV). Each point represents the mean of all fruits from an individual tree, with observations for both 2021 and 2022 harvests.

The identification of functional variants is complicated by linkage disequilibrium spanning large segments of the genome. The largest peak on chromosome 1 contains a trait-associated SNP in an ACR that is significantly associated with fruit diameter (Fig. 5b, p < 0.05 after Bonferroni correction). Inheritance of the “T” allele at the top SNP corresponds to an additive increase in fruit diameter (Fig. 5c, Additional File 2: Fig. S19). A 167 Kb linkage block (r2 >0.8) was defined on chromosome 1 surrounding the top SNP associated with fruit diameter (Fig. 5b and c, Additional file 1: Table S32). This region contains 25 genes, 6 ACRs, and 1 SV that was detected with linked-reads (Additional File 1: Table S33). The SV is a 8.7 Kb Gypsy-like retrotransposon insertion that disrupts a 3’ splice junction of an ortholog of Arabidopsis thaliana K + RECTIFYING CHANNEL 1 (AtKC1) (Fig. 5 d) (Additional File 1: Table S34). AtKC1 encodes a member of the Shaker family of potassium channels, which regulate inward and outward flux of potassium across the plasma membrane [25]. Functional Shaker channels consist of four ɑ-subunits and channel complexes can be homomeric or heteromeric [26]. In Arabidopsis thaliana, AtKC1 forms heteromeric complexes with other Shaker family members to inhibit channel activity [2628]. Potassium is a critical nutrient for enhancing citrus fruit size, and is a commonly recommended treatment to improve fruit size [2931].

To further explore the potential influence of the SV in CitKC1 on fruit diameter, we first confirmed its authenticity. The 8.7 Kb insertion was also identified in one of the two haplotypes of the diploid ‘Valencia’ sweet orange genome [32]. For each of the 154 accessions, we then determined the SV genotype (+/+, +/SV, or SV/SV) based on the alignment of short-reads to the SV in ‘Fairchild’ and its flanking sequence (Additional File 1: Table S35). The SV was heterozygous (+/SV) in 37 accessions with mandarin ancestry, including some mandarin types and mandarin hybrids (i.e. sweet oranges) and was associated with reduced fruit diameter (Fig. 5e, Wilcoxon test, p < 0.001). Next, we focused on published RNA-seq datasets for sweet oranges, including ‘Valencia’ (+/SV), to identify potential consequences of the SV on expression of CitKC1. Disruption of the 3’ splice junction resulted in a longer third exon (21 nt) for the SV-associated allele, as compared to the + allele of CitKC1 (Additional File 2, Fig. S20). Allele-specific expression of CitKC1 was reported in the analysis of 749 RNA-seq data sets in sweet orange and, on average, expression of the SV allele was reduced compared to the + allele, but it varied by condition and tissue [32]. Transcripts for the + allele of CitKC1 were identified in ONT full-length cDNA sequencing reads from sweet orange (NCBI SRA SRR27295299) (Additional File 2, Fig. S20). The SV allele of CitKC1 was also detected, including a transcript that initiated in the SV and continued through the remaining exons. Other candidate genes in the 167 Kb region were also explored, including a homolog of a gene involved in auxin biosynthesis which has been shown to affect fruit size in apples [33]. However, this gene was not expressed in ‘Fairchild’ leaf tissue nor was it expressed across a broader panel of tissues in sweet orange [34]. Altogether, we suggest that transcription of CitKC1 is affected by the 8.7 Kb SV and that this gene is a good candidate for further evaluation of its role in fruit size in Citrus.

Discussion

Cis-regulatory variants are a major source of gene expression divergence between species [35, 36], but unraveling the relationship between cis-regulatory variation, allele-specific differences in gene expression, and any downstream phenotypic consequences is challenging in the absence of haplotype phasing. In this study, we assembled a locally phased reference genome for ‘Fairchild’ mandarin and used it to investigate the molecular hallmarks of allele-specific expression. ‘Fairchild’ is a mandarin derived from parental cultivars with shared pummelo ancestry and, as a result, genes in these regions identical by state cannot be evaluated for allele-specific differences in gene expression or chromatin state. In cultivated mandarins with similar ancestry to ‘Fairchild’, these regions of homozygous pummelo ancestry account for 12–38% of the genome [2]. Still, genes in heterozygous portions of the genome could be used to explore the extent of allele-specific expression. We found that allele-specific expression is common, with differential expression between maternal and paternal alleles occuring at nearly one third of the 7,300 genes with two or more genetic variants. We expect that even more genes with ASE would be detected in hybrids with fewer regions of shared ancestry.

Focusing on the set of 2,071 ASE genes, we investigated chromatin features associated with differential transcription of alleles. To do so, we integrated data sets for gene expression, chromatin accessibility, and histone modifications (among others). We found that 30.69% of variation in allele-specific expression can be inferred from local chromatin state, with the most influential factors located in the gene region, as opposed to promoter or other distal regulatory sequences (Fig. 2b). It is important to note that our analysis has revealed associations between allelic chromatin dynamics and gene expression. Without genome-wide surveys of chromatin in relevant mutants or directed manipulation of the chromatin state in regions of interest, it is not possible to determine whether allele-specific histone modifications are driving differential gene expression or if they are a consequence of differential transcription. In ‘Fairchild’, the factors most associated with allele-specific expression were chromatin accessibility and histone modifications in gene regions, including H3K36me3, H3K56ac, and H3K27me3. The next most influential set of factors were those located in promoter regions, including haplotype-specific accessibility of chromatin and the presence or absence of a structural variant. In contrast to genic and promoter regions, the effect of factors in distal regulatory regions was relatively limited, although two factors (ATAC, H3K27me3) were still identified as having significant effects. Distal regulatory regions with functional roles have been identified in large-genome species including maize and wheat [14, 37], but distal ACRs seem to be less common in smaller genomes [14, 38, 39]. In citrus, we expect fewer distal ACRs since the average intergene distance in the ‘Fairchild’ genome is around 2.7 Kb. In fact, only 17.3% of ACRs detected in our dataset were classified as distal (Additional File 1: Table S2). Additionally, the correlation of AS-ACR chromatin accessibility and ASE decays for ACR:gene pairs that are separated by more than 5 Kb (Additional File 2: Fig. S21). Here, we restricted analysis of chromatin accessibility and gene expression to neighboring ACR-gene pairs with no intervening protein-coding gene sequence. In other angiosperms, there are typically no intervening genes between ACRs and their target gene even in large-genome species like maize, where pairs are separated by many tens of kilobases [9, 14]. While this is a reasonable assumption, future integration of conserved non-coding sequences, chromatin contact maps, and tissue-specific identification of ACRs will enable a more complete characterization of the distal regulatory landscape in small genome species to more precisely quantify the contribution of these sequences to gene expression regulation.

Next, we sought to understand and characterize potential sources of cis-regulatory variation in ‘Fairchild’. Structural variation between haplotypes is one possible source of cis-regulatory variation and may alter TF binding affinity, introduce new cis-regulatory modules, or perturb the coordination of neighboring cis-regulatory elements. Genome-wide surveys of genetic variation in crop species have revealed extensive levels of structural variation [4, 7, 40, 41]. In some cases, these variants have been linked to functional differences [3, 42, 43]. In cultivated tomato and its wild relatives, for example, 7.3% of genes with structural variants in promoter regions were differentially expressed [6]. In ‘Fairchild’, structural variants are enriched in the promoters of ASE genes, with associated sequences predominantly derived from hAT and MULE-MuDR DNA transposons. The designation of ‘insertion’ or ‘deletion’ is relative to the ‘Fairchild’ consensus sequence. Structural variants were not polarized to an outgroup species to determine their ancestral state. Haplotypes with the non-consensus allele are termed ‘deletions’ here and deletions in the promoters of ASE genes were associated with increased expression (Fig. 3a). The sequence underlying the longer, consensus ‘inserted’ allele was enriched for transposon associated sequences, suggesting that the ‘inserted’ allele is actually linked to reduced expression, possibly due to the associated TE. TE insertions may alter gene expression by disrupting cis-regulatory modules [44], or through the spreading of repressive chromatin modifications to nearby regulatory sequences [45, 46]. While TE-mediated disruption of local sequence is likely a source of SV-related ASE, our analysis also reveals evidence for a signature of TE-associated CRM with repressive effects on gene expression. The prevalence of SV-related cis-regulatory variants in ‘Fairchild’ suggests that the comparison of chromatin accessibility across structurally diverse citrus genomes is likely to further our understanding of the factors underlying gene expression divergence in heterozygous perennial tree crops species, like those in the genus Citrus.

Genetic variation in CRMs is another potential source of allele-specific expression. Genetic variants that perturb transcription factor binding may subsequently alter local chromatin remodeling. One example of this is the pioneer TF LEAFY which binds to nucleosome occupied binding sites at the AP1 locus and recruits SWI/SNF chromatin remodelers [47]. This is followed by increased chromatin accessibility and elevated levels of AP1 expression [47]. We explored the sequences underlying AS-ACRs to identify genetic variants with possible impacts on TF binding and subsequent gene expression. The sequences underlying AS-ACRs were enriched for DNA transposons. In genome-wide surveys of chromatin accessibility across many angiosperm species, DNA transposons were also enriched in a set of species-specific ACRs [9]. To begin to understand how genetic variation at putative CRMs impacts phenotype, we mined ACRs with haplotype-specific patterns of accessibility for candidate transcription factor binding motifs which may underlie the differential gene expression patterns observed in this citrus hybrid. We detected the enrichment of ERF48 binding sites in paternal alleles of AS-ACRs, and observed a correlation between ERF48 motif presence and allelic H3K27me3 (Fig. 4c). ERF48 binding sites were also found to be enriched in differential DNAse hypersensitive regions discovered in a panel of cotton accessions spanning polyploidization and domestication in the species [23]. Han and colleagues speculate that PRC2 binds the guanine-rich binding site of ERF48, and that PRC2-mediated deposition of H3K27me3 may be involved in transitions between states of chromatin accessibility [23]. The sequence-specific DNA binding activity of H3K27me3 “writers” such as Polycomb repressive complex 2 (PRC2), and “erasers” such as JUMONJI DOMAIN-CONTAINING PROTEINS (JMJ) may underlie differential regulation of alleles with resulting ASE. An extensive yeast-one hybrid screen revealed 233 transcription factors that bound motifs associated with PRC2 binding and subsequent H3K27me3 deposition [48]. Similarly, the sequence-specific DNA binding activity of JMJ proteins such as RELATIVE OF EARLY FLOWERING 6 (REF6) [4951] suggests that sequence alterations in its binding motif could produce differential H3K27me3 demethylation between alleles. We find a significant correlation between ERF48 motif similarity scores and allelic levels of H3K27me3, but no significant relationship between allelic chromatin accessibility and allelic levels of H3K27me3 of AS-ACRs, but we are likely limited by the small sample size of AS-ACRs, the sequence depth of ATAC-seq libraries, and the conservative method used for ACR identification. Importantly, AS-ACRs are indistinguishable from other ACRs when examining ATAC-seq coverage without read-phasing (Additional File 2: Fig. S12). ERF48 motif enrichment in AS-ACRs highlights an instance of how sequence variation in a TF binding site relates to local chromatin dynamics and gene expression variation. The biological significance of these predicted motifs requires further functional validation to confirm their regulatory role.

Further mining of cis-regulatory regions for functional variants is likely to reveal genotype-phenotype associations. The functional consequences of cis-regulatory variation in citrus have previously been demonstrated for two important commercial traits [3, 4]. In maize, 40% of phenotypic variation in agronomic traits can be explained by genetic variants located in the 1% of the genome marked by accessible chromatin [10]. Notably, the analysis of sequence variation in ACRs has uncovered key regulatory components of oil gland formation in Citrus [52]. In ‘Fairchild’, approximately 2% of the genome resides in ACRs, with these ACRs having 1.3–3 fold enrichment for significant GWAS SNPs (Fig. 5a). Our analysis of allele-specific expression and chromatin accessibility underscores the importance of transposable elements as significant sources of regulatory variation (Fig. 3a-c). In particular, we identified a Gypsy retrotransposon insertion in CitKC1, which resides in a 167 Kb region associated with variation in fruit size in a population of 154 mandarins, pummelos, and their hybrids (Fig. 5b-e) [53]. The A. thaliana ortholog of this gene, AtKC1, is a negative regulator of potassium transporter 1 (AtKT1) (Additional File 1: Table S34) [2628]. Potassium is widely recommended as one of the most important nutrients for improving citrus fruit size [2931, 54]. Application of potassium has been demonstrated to increase fruit size in tomato, with optimal levels of K2O (150 kg K2O/ha) increasing both fruit diameter and weight, resulting in a 67% yield increase compared to the control (0 kg K2O/ha) [29]. It is still unclear how SV-mediated perturbation of the expression of CitKC1 may impact potassium transport and influence fruit size. Further work is needed to directly link SV-mediated divergence in CitKC1 expression to variation in fruit size in Citrus. In the future, mining ACRs for functional variants is likely to reveal an abundance of variation relevant for improvement of cultivated citrus.

Conclusions

We assembled a locally phased reference genome for ‘Fairchild’ mandarin and used it to document the extent of haplotype-specific differences in gene regulation and begin to dissect the interplay between local variation in chromatin dynamics and gene expression and their impact on phenotypic variation. We find examples where allele-specific expression is associated with local structural variation in promoter sequences and others where expression differences are more likely due to genetic variation in putative TF binding. Genetic analysis of phenotypic traits with horticultural relevance in citrus was integrated with haplotype-specific genetic and epigenetic variation to reveal potential functional candidates. We demonstrate the power of functional genomics for dissecting the interplay between genetic variants in heterozygous genomes, molecular phenotypes such as gene expression, chromatin accessibility, and epigenetic modifications, and phenotypic variation with the goal of revealing functional cis-regulatory sequences and understanding the evolution of gene regulation.

Materials and methods

Sampling of tissue for sequencing

Fairchild mandarin was sampled from the Givaudan Citrus Variety Collection at the University of California, Riverside. Young leaves were collected from a greenhouse grown individual of the cultivar ‘Fairchild’ mandarin (Citrus reticulata) (Accession No: PI 539508, Inventory No: RIV 2014 PL). To deplete starch in the leaves, the tree was covered with brown paper bags 24 h prior to sampling. The leaves were sampled early in the morning and immediately frozen in liquid nitrogen. The intact, frozen leaves were then shipped to Arizona Genomics Institute (AGI) for DNA extraction from nuclei for PacBio sequencing and to the Amplicon Express company (Pullman, WA) for DNA extraction for Illumina, Bionano, and 10X sequencing.

DNA extraction and sequencing

High-molecular-weight (HMW) DNA was extracted and used for the library construction. SMRTBell libraries (20 Kb) were prepared according to the manufacturer’s protocol including gTube size fraction and BluePippin size exclusion of fragments smaller than 20 Kb before constructing libraries for SMRTbell template construction. Sequencing on the Pacbio RS II machine was performed at AGI following library construction. A total of 3,930,477 reads were produced from 53 flow cells (69 Gb of sequence). Sequenced reads had an N50 of 15 Kb and a maximum length of 78 Kb. For scaffolding, a 10X sequencing library was constructed according to the 10X Genomics protocol (Pleasanton CA USA) by Amplicon Express. This library was then sequenced on the Illumina HiSeq4000 system producing 285 million 2 × 150 bp reads.

Bionano genome (BNG) optical map construction and data analysis

High-molecular weight (HMW) DNA was isolated by the company Amplicon Express (Pullman, WA). The nicking enzyme Nb.BssS1 was chosen based on the de novo draft assembly. The high-quality HMW DNA molecules were treated with Nb.BssS1 (New England BioLabs, Ipswich, MA) to generate single-strand breaks with sequence-specific motifs (GCTCTTC). The nicked DNA molecules were then stained using the IrysPrep Reagent Kit (Bionano Genomics, San Diego, CA) and loaded onto the nanochannel array of IrysChip (Bionano Genomics) following the manufacturer’s guidelines. Imaging and data analysis were performed on the Irys system (Bionano Genomics). DNA molecules larger than 100 Kb were processed into BNX files using the software AutoDetect (v2.1.4) and only molecules of at least 180 Kb were assembled into the BNG map with the Bionano Genomics assembly pipeline [55, 56], The p-value thresholds used for pairwise assembly, extension/refinement, and final refinement stages were 2 × 10 −8, 1 × 10 −9, and 1 × 10 −15, respectively. The de novo assembly was digested in silico at Nb.BssS1 restriction sites using Knickers and aligned with the BNG map using RefAligner in Bionano Solve (v03062017). Visualization of the alignments were performed with IrysView.

Genome assembly

We de novo assembled the ‘Fairchild’ genome using PacBio long reads (RSII) with a diploid assembler, FALCON (v1.8.6). In total, 69 Gb of PacBio long reads were generated from 53 flow cells (Additional File 3: Table S2), which is roughly equivalent to 187 fold of the estimated genome size (365 Mb). An initial assembly generated a 429 Mb primary genome sequence with contig N50 of 2.3 Mb and 69 Mb of secondary sequences with a contig N50 of 144 Kb. The assembly errors were corrected with Quiver (v2.0) by making a consensus sequence from alignments of the raw PacBio long reads mapped to the de novo assembly. A second round of polishing was performed with Pilon (v1.20) based on the alignment of 150 bp paired-end Illumina short reads from sequencing of 10X Linked-Reads. To build a haploid reference genome, we used Haplomerge2 (v11242015) to remove redundancy and merge overlapping contigs in the assembly, which resulted in a haploid genome of 360 Mb with contig N50 of 3.6 Mb. The BioNano optical map data were assembled with RefAligner in Bionano Solve (v03062017). We aligned the BioNano optical map contigs with PacBio contigs using HybridScaffold.pl in Bionano Solve (v03062017) to correct mis-assemblies and further scaffold contigs. We then assembled 10x linked reads based on the barcode of each single molecule and filled gaps in scaffolds with a graph-based method (Du et al., 2017). These high-quality scaffolds were then anchored to chromosomes by integrating three genetic maps (reference) and synteny between citrus genomes as implemented in ALLMAPS (v07222015). The final scaffolded genome was 365 Mb in length and 94.8% of contigs (346/365 Mb) were anchored to nine chromosomes. (Additional File 3: Tables S1, S3) We assessed the completeness of the genome assembly using BUSCO (v5.5.0) against the embryophyta_odb10 dataset and the BUSCO completeness was estimated to be 98.5%. LTR index (LAI) was calculated with EDTA (v1.7.0) and reported a LAI of 10, indicating a highly contiguous assembly of intergenic regions. In addition, assembly quality was evaluated based on the QV value (36.83) and K-mer completeness (90.53%) for the 9 primary scaffolds (Additional file 3: Table S1) [57].

RNA-seq for allele-specific expression

RNA-seq was performed as previously described [9]. Leaves were flash-frozen with liquid nitrogen immediately after collection. The samples were ground using a mortar and pestle. Total RNA was extracted and purified using TRIzol Reagent (Thermo Fisher Scientific) following the manufacturer’s instructions. For each sample, 1.3 µg of total RNA was prepared for sequencing using the Illumina TruSeq mRNA Stranded Library Kit (Illumina) following the manufacturer’s instructions.

ATAC-seq

ATAC-seq was performed as described previously [58]. Approximately 200 mg of flash-frozen leaves were immediately chopped with a razor blade in approximately 1 ml of pre-chilled lysis buffer (15 mM Tris-HCl pH 7.5, 20 mM NaCl, 80 mM KCl, 0.5 mM spermine, 5 mM 2-mercaptoethanol and 0.2% Triton X-100). The chopped slurry was filtered twice through Miracloth and once through a 40 μm filter. The crude nuclei were stained with 4,6-diamidino-2-phenylindole (DAPI) and loaded into a flow cytometer (Beckman Coulter, MoFlo XDP). Nuclei were purified by flow sorting and washed as described previously [58]. The sorted nuclei were incubated with 2 µl Tn5 transposomes in a 40 µl tagmentation buffer (10 mM TAPS-NaOH pH 8.0, 5 mM MgCl2) at 37 °C for 30 min without rotation. The integration products were purified using a Qiagen MinElute PCR Purification Kit or NEB Monarch DNA Cleanup Kit and then amplified using Phusion DNA polymerase for 10–13 cycles. PCR cycles were determined as described previously [8]. Amplified libraries were purified using AMPure beads to remove primers.

ChIP-seq

Chromatin immunoprecipitation (ChIP) was performed using a previously described protocol [59]. In brief, approximately 1 g freshly harvested leaves or flash-frozen leaves was chopped into 0.5 mm cross-sections and cross linked as described previously [59]. The samples were immediately flash-frozen in liquid nitrogen after crosslinking. Nuclei were extracted and lysed in 300 µl of lysis buffer. Lysed-nuclei suspension was sonicated using a Diagenode Bioruptor on the high setting, 30 cycles of 30 s on and 30 s off. To make antibody-coated beads, 25 µl Dynabeads Protein A (Thermo Fisher Scientific, 10002D) was washed with ChIP dilution buffer and then incubated with around 1.5 µg antibodies (anti-H3K4me3, Millipore-Sigma, 07–473; anti-H3K36me3, Abcam, ab9050; anti-H3K27me3, Millipore-Sigma, 07–449; anti-H3K56ac, Millipore-Sigma, 07–667) in 100 µl ChIP dilution buffer for at least 1 h at 4 °C. The sonicated chromatin samples were centrifuged at 12,000 g for 5 min and the supernatants were diluted tenfold in ChIP dilution buffer to reduce the SDS concentration to 0.1%. ChIP input aliquots were collected from the supernatants. For all of the samples and replicates, 300–500 µl of diluted chromatin was incubated with the antibody-coated beads at 4 °C for at least 4 h or overnight, then washed, reverse-crosslinked and treated with proteinase K in accordance with the protocol [59]. DNA was purified using a standard phenol–chloroform extraction method, followed by ethanol precipitation. The DNA samples were end-repaired using the End-It DNA End-Repair Kit (Epicentre) following the manufacturer’s protocol. DNA was cleaned up on AMPure beads (Beckman Coulter) with size selection of 100 bp and larger. The samples were eluted into 43 µl Tris-HCl and subsequently underwent a 50 µl A-tailing reaction in NEBNext dA-tailing buffer with Klenow fragment (3′ ->5′ exo−) at 37 °C for 30 min. A-tailed fragments were ligated to Illumina TruSeq adapters and purified with AMPure beads. The fragments were amplified using Phusion polymerase in a 50 µl reaction following the manufacturer’s instructions. The following PCR program was used: 95 °C for 2 min; 98 °C for 30 s; then 15 cycles of 98 °C for 15 s, 60 °C for 30 s and 72 °C for 4 min; and once at 72 °C for 10 min. The PCR products were purified using AMPure beads to remove primers.

MethylC-seq

Several leaves were immediately flash-frozen after harvesting and ground to a powder in liquid nitrogen. DNA was extracted and purified with the DNeasy Plant Mini Kit (Qiagen) and 130 ng were used for MethylC-seq library preparation. MethylC-seq libraries were prepared using a previously described protocol [60]; however, we used a final PCR amplification of eight cycles.

Annotation of repetitive elements

To annotate and remove repetitive regions from the gene prediction process,, we generated a library of repetitive elements with RepeatModeler v. 2.0.1 [61] using default parameters. This library was used to soft-mask these elements in the genome with RepeatMasker v. 4.1.1 [62] (Additional File 3: Table S4). We used the calcDivergenceFromAlign.pl script, available in the util directory of RepeatMasker package, to calculate and summarize the divergence from the consensus for each predicted repeat using the Kimura 2-Parameter distance with an adjustment for higher mutation rates at CpG sites. The createRepeatLandscape.pl script was then used to create a repeat landscape plot and visualize the repeat elements diversity, their relative age and divergence, using the divergence summary data (Additional File 2: Fig. S22).

Genome annotation

RNA-seq data was generated from young leaves and young flowers of ‘Fairchild’. A total of 33.4 Gb of sequence was produced on the Illumina HiSeq4000 (2 × 150 bp reads). Samples were collected and then flash frozen in liquid nitrogen and stored at −80 °C. Total RNAs were extracted from these samples using TRIzol reagent (Thermo Fisher Scientific). RNA-Seq libraries were prepared using NEBNext Ultra Directional RNA Library Prep Kit. Libraries were prepared using an NEB rRNA depletion kit. Genome annotation was conducted with the Funannotate pipeline v. 1.8.10 [63]. First, the RNAsequencing (RNA-seq) data were assembled using Trinity v. 2.11.0 and aligned with PASA v. 2.4.1 to train the ab initio gene predictors [6466]. After training the predictors Augustus v. 3.3.3 and SNAP v. 2013_11_29 with the evidence, a collection of gene predictors were run CodingQuarry v. 2.0, Augustus v. 3.3.3, GeneMark-ETS v. 4.62, GlimmerHMM v. 3.0.4, and SNAP v. 2013_11_29 [6771]. These predictions were combined into a consensus gene model set with EVidenceModeler v. 1.1.1 [66]. Transfer rRNA genes were predicted using tRNAscan v. 2.0.9 [72]. Next, using HMMER v. 3 [73] or diamond BLASTP v. 2.0.8 [74, 75] protein annotations and gene products were predicted for the consensus gene models based on similarity to Pfam [76], CAZyme domains with dbCAN HMM profiles [7780], MEROPS [81], eggNOG v. 5.0 [82], InterProScan v. 5.51.85.0 [83], and Swiss-Prot [84] databases. In addition, Phobius [85] and SignalP v.5.0b [86] were used to identify transmembrane proteins and secreted proteins. (Additional File 3: Table S5)

Haplotype phasing of variants in ‘Fairchild’

Single nucleotide polymorphisms (SNPs), small indels (< 41 bp), and structural variants were identified on the nine primary chromosomes using the 10x Genomics Longranger pipeline (10x Genomics, Pleasanton California). In short, 10x Genomics linked reads were aligned to the reference genome using the aligner Lariat which is based on the Random Field Aligner algorithm [87]https://github.com/10XGenomics/lariat. Variants were subsequently identified using GATK Haplotypecaller and GenotypeGVCF following the current GATK best practices [88]. In Longranger, insertions and deletions ranging from 40 bp − 30 Kb were detected using haplotype-specific reductions in read coverage, discordant read pairs, and local re-assembly. SNPs, indels, and SVs were then phased by aligning the linked-reads to the sequence of each allele to define which haplotype the read supports. A total of 1,015 phase blocks were constructed using 2.3 million phased heterozygous variants (mean length of phase block = 317,484 bp; standard deviation = 711,178.7 bp). Within each phase block, SNPs are attributed to either haplotype 1 or haplotype 2. When comparing SNPs between phase blocks, the assignment of haplotype 1 or 2 is random and not associated with an assignment of ancestry. Biallelic SNPs were identified and filtered using the following criteria with bcftools [89]: QD >2.0 && FS < 60.0 && MQ >40.0 && MQRankSum > −12.5 && N_ALT = 1 && N_MISSING = 0 && QUAL >500 && MIN(FMT/DP) >20 && INFO/DP < 450. This yielded 1.53 million phased heterozygous variants that were used in downstream analyses. The majority of variants are SNPs (1.3 million) that were used to phase sequencing reads for whole-genome re-sequencing, RNA-seq, ATAC-seq, bisulfite-seq, and ChIP-seq (H3K4me3, H3K36me3, H3K56ac, H3K27me3).

Assignment of ancestry to phased SNPs

Whole-genome re-sequencing reads from maternal parent clementine mandarin [2], paternal parent ‘Orlando’, and hybrid ‘Fairchild’ were trimmed for quality and sequencing adapter removal using Trimmomatic 0.36 (SLIDINGWINDOW:4:20 MINLEN:50 ILLUMINACLIP: LEADING:3 TRAILING:3) [90]. Reads were aligned to the ‘Fairchild’ reference genome using BWA mem 0.7.17 [91]. PCR duplicates were filtered using Picard MarkDuplicates [92]. Variants were identified using GATK Haplotypecaller and GenotypeGVCF following the current GATK best practices [88]. High-quality biallelic SNPs were identified and filtered using the following criteria with bcftools [89]: QD >2.0 && FS < 60.0 && MQ >55.0 && SOR < 3 && MQRankSum > −12.5 && ReadPosRankSum > −2.0 && QUAL >500 && MIN(FMT/DP) >30 && INFO/DP < 450. Of the 1.3 million SNPs identified with linked-reads, 1.22 million (94.4%) were identified in parental whole-genome short-read sequencing and were used to perform pedigree-based phasing using Whatshap v. 1.17 [93]. A constant recombination rate of 3.0 cM/Mb (average observed in C. clementina [94] was used for pedigree-based phasing of SNPs. After ancestry assignment, SNPs heterozygous in one or both parents were removed, leaving only SNPs that are homozygous for alternate alleles in the two parents (141,290 SNPs). The ancestry assignment of these SNPs was combined with their read-based haplotype assignment in order to impute the ancestry of all 1.3 million SNPs between the two haplotypes. In total, 394 phase blocks contained high quality SNPs and were designated as unclassified, non-recombinant, or recombinant depending on the ancestry composition of SNPs within the phase block (Additional File 1: Table S11; Additional File 3: Table S6). Unclassified phase blocks (n = 149) did not contain any SNPs that could be assigned ancestry, for non-recombinant phase blocks (n = 231) the ancestry of every SNP was consistent within each haplotype of the phase block, in recombinant blocks (n = 14) the ancestry of SNPs was variable within each haplotype of the block. For non-recombinant blocks, the ancestry of SNPs that could not be assigned using parental genotypes was imputed using the ancestry of SNPs within the same phase block. For recombinant blocks, only SNPs that could be assigned ancestry using parental genotypes were retained. One recombinant phase block with repeated haplotype switching of SNP ancestry was removed from the analysis (2,983 SNPs). Phased SNPs were used to evaluate the switch error rate within the 245 phase blocks that were classified as recombinant or nonrecombinant based on parental homozygous SNPs using Merqury v1.4 [57] (Additional file 2: Fig. S23). This revealed switch error rates of 3.5% and 2.96%, for the maternal and paternal haplotypes. Additionally, SNPs whose haplotype ancestry was discordant with the majority of other SNPs in the phase blocks were filtered. The ancestry of SVs was imputed following this same procedure, with the exception of SVs in recombinant phase blocks. For SVs in recombinant phase blocks, the ancestry of the SV was imputed using the majority ancestry of the nearest 5 SNPs upstream and downstream of the SV.

Polarization of heterozygous SNPs in ‘Fairchild’

Whole genome sequencing reads from Poncirus trifoliata accession DPI 50 − 7 (NCBI PRJNA648176) [95] were trimmed for quality and sequencing adapter removal using Trimmomatic 0.36 (SLIDINGWINDOW:4:20 MINLEN:50 ILLUMINACLIP: LEADING:3 TRAILING:3) [90]. Reads were aligned to the ‘Fairchild’ reference genome using BWA mem 0.7.17 [91]. PCR duplicates were filtered using Picard MarkDuplicates [92]. Variants were identified using GATK Haplotypecaller and GenotypeGVCF following the current GATK best practices [88]. High-quality biallelic SNPs that are homozygous in DPI 50 − 7 were identified and filtered using the following criteria with bcftools [89]: QD >2.0 && FS < 60.0 && MQ >55.0 && SOR < 3 && MQRankSum > −12.5 && ReadPosRankSum > −2.0 && QUAL >500 && MIN(FMT/DP) >30 && INFO/DP < 450. Next, these variants were intersected with heterozygous SNPs in ‘Fairchild’ and biallelic SNPs were retained. This final set of SNPs represent ‘Fairchild’ SNPs for which the ancestral state is indicated by the allele present in Poncirus trifoliata (Additional File 3: Table S7).

Processing of ChIP-seq and ATAC-seq sequencing reads

Prior to alignment, ChIP-seq reads for four histone modifications (H3K4me3, H3K36me3, H3K56ac, H3K27me3) and ATAC-seq reads were trimmed for quality (SLIDINGWINDOW:4:15 MINLEN:50 ILLUMINACLIP: LEADING:3 TRAILING:3) and adapter sequences removed using Trimmomatic 0.36 [90]. Sequence reads were then aligned to the ‘Fairchild’ reference genome using BWA mem 0.7.17 [91]. PCR duplicates were filtered using Picard MarkDuplicates [92]. A summary of sequence reads for each data set before and after quality control is provided (Additional File 3 : Table S8). Peak identification for histone modifications (ChIP-seq) and accessible chromatin (ATAC-seq) was performed using Genrich (https://github.com/jsh58/Genrich) and MACS2 [96] following [9]. Significant peaks within 100 bp were merged. Sequences underlying ACRs were aligned to plastid (NC_034671) and mitochondrial (NC_037463.1) genomes of C. reticulata using NCBI BLAST and significant hits were removed. The number of peaks identified, along with their average read depth and percentage of the genome occupied by these peaks is summarized here (Additional File 1 : Table S1). Two methods were used for identifying accessible chromatin regions (ACRs). One method (MACS2) identified a total of 31,191 ACRs spanning 10.4 Mb with a fraction of reads in peaks (FRiP) score of 0.16, and the other (Genrich) identified 9,172 ACRs spanning 7.3 Mb (FRiP = 0.13) (Additional File 1: Tables S1, S2). There was 70% overlap between ACRs identified by Genrich and ACRs identified by MACS2. Additionally, 97% of Genrich ACRs overlapped ACRs identified with MACS2. An unpublished benchmarking of the ATAC-seq peak calling programs HMMRATAC, MACS2, and Genrich found that Genrich had the highest precision, recall, and F1 score [97]. For this reason we decided to use the ACRs detected with Genrich.

Processing of bisulfite sequencing reads and analysis of DNA methylation

Bisulfite-sequencing reads were trimmed for quality (SLIDINGWINDOW:4:15 MINLEN:50 ILLUMINACLIP: LEADING:3 TRAILING:3) and adapter sequences were removed using Trimmomatic 0.36 [90]. Reads were then aligned to the ‘Fairchild’ reference genome using BatMeth2 [98] and PCR duplicates were removed using Picard MarkDuplicates [92]. Bisulfite-sequencing reads were phased between parental haplotypes with ancestry-assigned SNPs using Whatshap [93]. DNA methylation of cytosines was counted for both haplotypes separately using DNMtools [99] (Additional File 3: Tables S9, S10 and S11). The probability of differential methylation between haplotypes was determined using a one-directional version of Fisher’s exact test implemented in DNMtools [99, 100]. Differentially methylated cytosines with read coverage >= 10 and probability of differential methylation >= 0.7 were retained for modeling of ASE. Overall methylation was calculated for cytosines with coverage >= 10 using the full, unphased bisulfite-sequencing data with DNMtools [99]. Methylation status is based on the number of reads that support a C-to-T bisulfite conversion versus the total number of reads aligning to a given cytosine.

Regional analysis of genomic features

Metaplots were generated to summarize regional patterns of chromatin levels, typically centered on the peak of an open chromatin region, as determined by ATAC-seq. For each data set, read coverage was scaled from 0 to 1 by dividing the read coverage at each site by the 98th quantile of sequencing coverage per sample [9]. ACRs and genes were scaled to a 1000 bp region, as well as their 2 kb flanking regions using deepTools v.3.5.1 [101] and signal densities for each feature were plotted across these regions in 50 bp windows.

Determination of allele-specific gene expression and chromatin accessibility

Read alignments for quality and adapter trimmed ATAC-seq and ChIP-seq were performed using BWA-mem 0.7.17 [91] as described above. RNA-seq reads were also quality and adapter trimmed using Trimmomatic 0.36 [90] prior to read mapping to the ‘Fairchild’ reference genome with the slice-aware alignment tool STAR 2.7.10 [102]. After initial alignment of RNA-seq, ATAC-seq, and ChIP-seq reads, WASP [103] was used to correct read-mapping bias (Additional File 3: Table S8). This is done by switching the reference allele to the alternate allele (and vice versa) in reads that align across SNPs. Sequence reads that do not align to the same genomic location are filtered to remove reads that cannot be definitively assigned to a haplotype. For each data set, allele-specific read counts were determined using GATK ASEReadCounter [88]. Allele counts were then intersected with genes (for RNA-seq) or peak regions (ATAC-seq and ChIP-seq). SNPs covered by less than 10 reads were removed. Genes and ATAC-seq peak regions with 2 or more SNPs were considered for subsequent allele-specific analysis. GeneiASE [104] was used to detect significant differences in the level of gene expression or chromatin accessibility between alleles. GeneiASE first meas ures individual SNP effects based on the read counts of each allele at a given SNP. The median SNP effect-size for genes with significant ASE was 4.6, while genes without significant ASE had a median SNP effect-size of 1.9. Next, a gene/peak test statistic is calculated by pooling all of the SNP effects that coincide with the given gene or peak using Stoufer’s method [105]. The significance of gene test-statistics was then determined by comparison to a null distribution of gene test statistics. This null distribution is characterized by an overdispersion parameter and mean marginal probability of success [104]. These parameters were determined by fitting a beta binomial model to whole-genome sequencing read counts overlapping the same gene/peak regions of interest. This provides an additional correction for read-mapping biases. Genes or peaks with significant allele-specific differences were identified at a false discovery rate of 0.05 (Benjamini-Hochberg) (Additional File 1: Table S7-S9; Additional File 3: Table S12).

Integration of allele-specific ACRs (AS-ACRs) and ASE using phased blocks

Genes with significant ASE in both RNA-seq replicates and ACRs with significant allele-specificity were paired using phase block information. If an ASE gene and its nearest AS-ACR were located in the same phase block, the sum of read counts for the ASE gene (RNA-seq), and the AS-ACR (ATAC) was calculated for each haplotype. In this case, the read count of the maternal haplotype for the ATAC peak region corresponds to the level of expression of the maternal allele of the linked gene. The log2 fold change between the read counts of maternal and paternal haplotypes were determined both for gene expression and chromatin accessibility. A pseudo count of 1 was added to each value prior to log transformation.

Permutation tests for genomic feature enrichment

A permutation test was performed to examine the enrichment of allele-specific ACRs for transposable element sequences. The number of AS-ACR-TE intersections was compared to a set of similarly sized regions selected from the complete set of ACRs and the frequency that each sample intersected annotated TEs. This process was repeated (n = 1000). This analysis was performed for all TEs, and then for TEs separated into class I and class II transposons. A similar approach was used to examine the enrichment of structural variants (50 bp − 30 Kb in size) within promoters of ASE genes. In this case, the number of SV-promoter intersections was compared to a sample of similarly sized regions from promoters of genes analyzed for ASE. The R package regioneR was used to select regions and perform the permutations [106].

Motif enrichment analysis

Transcription factor binding site enrichment within allele-specific ACRs (AS-ACRs) and SVs within ASE gene promoters were determined using MEME-ChIP with the parameters: (-meme-minw 5 -meme-maxw 30 -meme-nmotifs 10 -db ArabidopsisDAPv1.meme) [22]. Transcription factor binding motifs obtained from DNA affinity purification sequencing (DAP-seq) of Arabidopsis thaliana transcription factors [107] were used as the input database for motif identification. For AS-ACRs, the sequences for the maternal and paternal alleles were tested for motif-enrichment separately. For SVs in ASE gene promoters the sequence underlying the longer, consensus, ‘inserted’ allele was tested for motif-enrichment.

Statistical inference of contribution of genomic features to allele-specific expression

An elastic-net regression model was developed to estimate the contribution of multiple factors to gene expression, including allele-specific expression. Overall gene expression was quantified using HTSeq-count with bias-corrected RNA-seq reads [108]. Genes that were outliers overall whole-genome sequencing read counts (1.5x interquartile range) were filtered prior to model fitting. For every gene, or those genes with significant ASE in both RNA-seq replicates, we gathered information for 56 factors (Additional File 1: Table S11). These factors include the overall read counts (or coverage) of each predictor as well as the log2 fold change of phased allele counts (Maternal/Paternal). For whole-genome sequencing reads, ChIP-seq reads, ATAC-seq reads and bisulfite-seq reads, the counts were classified as belonging to one of four regions surrounding each gene: genic (including exon and introns), promoter (< 1Kb from the transcription start site), upstream putative regulatory region (5 Kb upstream of promoter), and downstream putative regulatory region (5 Kb downstream of gene). Upstream and downstream regions were only considered if an ACR was detected in the window.

For the analysis of allelic differences in read coverage, the four regions of interest were set to NA if they did not contain at least one phased SNP. Additionally, regions were set to NA if their total SNP-based read counts were lower than the 10th percentile of SNP-based read counts for a given factor (ATAC-seq, ChIP-seq, BS-seq, etc.). This filter was applied for each factor x region combination. Genes with a strong bias in whole-genome sequencing read counts (1.5x interquartile range) were filtered prior to model fitting (n = 146). After filtering, the extent of missing data per factor ranged from 0 to 70% with a median of 11.3% missing data (Additional File 3: Tables S13 and S14). Elastic-net regression models are well suited to handle missing data and model fitting was performed with the glmnet package in R [109]. Model parameters were determined using 5-fold cross validation (Additional File 1: Table S13). For the model of overall expression, there was evidence of overfitting as 54/56 predictors were determined to be significant. To address this, we selected a shrinkage parameter (lambda) that reduces model complexity while maintaining a prediction error (mean-squared error) within one standard error of the model determined by 5 fold-cross validation (Additional File 1: Tables S14, S16, S17; Additional File 2: Fig. S6).

Effects of SNPs and indels

SnpEff v.5.0e was used to annotate the impact of each SNP. The genome annotation (GFF) and SNP VCF files were used to build the SnpEff database. To evaluate the impact of potentially deleterious variants on ASE and compare their effects with the other classes of variants, SNPs classified as having a “HIGH” impact were merged with SNPs in ASE genes. (Additional File 3: Table S15)

Collection and evaluation of citrus fruits

Fruits were harvested from the Givaudan Citrus Variety Collection at the University of California, Riverside over the 2020/2021 and 2021/2022 harvest seasons. These harvests included 154 citrus accessions with mandarin (C.reticulata) and pummelo (C.maxima) ancestry, for which whole-genome sequencing data was also produced. Each accession was represented by two replicate trees, and approximately 50 fruits were harvested from each tree according to their optimal ripening times (Additional File 1: Table S29). Trees with larger fruit or reduced yield had fewer fruit collected, with an average of 38 fruits harvested per tree. Each harvest took place between the hours of 7:00 AM-12:00 PM (Pacific Standard Time). Fruits were harvested using sanitized clippers, cutting the fruit at the calyx base to ensure there was no damage to the peel. These fruits were then washed with soap and water to mitigate the spread of pests and diseases. After drying, the fruits were transported in covered containers to the remote field site in Lindcove, California. The UC system operates the Lindcove Research and Extension Center which features a commercial pack line and research facility. At this facility, data is collected for various fruit traits, both non-destructive and destructive. The non-destructive traits are measured using Compac InVision (TOMRA), a pack line that is typically used by commercial packing houses to assess fruit quality for the market. The pack line captures dozens of images of each fruit, and uses computer vision (InVision) to measure traits related to fruit size, shape, color, and texture (Additional File 1: Table S25). After the measurement of non-destructive traits using the packline, a subset of 12 fruits were randomly selected for each tree and analyzed in the fruit quality laboratory. In the fruit quality laboratory, fruits were cut equatorially and exposed seeds were counted. After seed counting, peel thickness was measured with a digital caliper, and fruits were juiced using a hydraulic press. Next, the fruit juice was weighed, and total soluble solids (Brix) were measured using a Bellingham and Stanley RFM110 refractometer. Finally, juice acid content (grams/100 ml) was measured using a Mettler Toledo t5 titrator.

Whole genome sequencing

DNA extraction using the DNeasy 96 Plant Kit was performed according to the manufacturer’s instructions (Qiagen). The DNA was twice serially eluted in 50 µl of nuclease-free water. Libraries were constructed using the Illumina DNA Prep Kit, and the reagents were adjusted to a half-reaction. Whole genome sequencing was carried out on a NovaSeq 6000 instrument using the paired-end 150 bp read setting by the DNA Technologies Core at UC Davis, CA.

Variant detection for genome-wide association

Whole-genome sequencing reads for 154 accessions in the Givaudan Citrus Variety Collection at UC Riverside were trimmed for quality and sequencing adapter removal using Trimmomatic 0.36 (SLIDINGWINDOW:4:15 MINLEN:50 ILLUMINACLIP: LEADING:3 TRAILING:3) [90]. Reads were aligned to the ‘Fairchild’ reference genome using BWA mem 0.7.17 [91]. PCR duplicates were filtered using Picard MarkDuplicates [92]. Variants were identified using GATK Haplotypecaller and GenotypeGVCF following the current GATK best practices [88]. High-quality biallelic SNPs were identified and filtered using the following criteria with bcftools [89]: QD >2.0 && FS < 60.0 && MQ >55.0 && SOR < 3 && MQRankSum > −12.5 && ReadPosRankSum > −2.0 && QUAL >500 && MIN(FMT/DP) >15 && INFO/DP < 450. Additionally, variants with greater than 10% missing genotypes and minor allele-frequency less than 3% were excluded, resulting in 2.24 million SNPs.

Processing of phenotype data and genome-wide association

The raw measurements of fruit phenotypes were filtered to exclude abnormal or damaged fruit. The InVision software “grades” each fruit to indicate whether it has abnormal texture (Grade D), or is touching another fruit (Grade E). These abnormal fruits were excluded from downstream analyses. Additionally, the density of each fruit was calculated to identify desiccated fruit that remained on the tree from the previous year. For each tree, outlier fruits were excluded if their density fell above or below 1.5x the interquartile range of all fruit from that tree in a given year. For each replicate, the mean across all fruits was calculated and transformed using box cox transformation. For fruit color measurements, the percentage of fruit surface area that is of a certain color was calculated for 11 colors. These percentages were then used in principal component analysis (Additional File 2: Figure S24), and each tree’s position on the first two principal components were used as phenotypes. Next, a mixed-effects model was constructed for each phenotype to account for the potential confounding effects of a tree’s rootstock, age, and year of harvest (Additional File 3: Table S16) [110, 111]. For each phenotype, the model coefficient for scion genotype was used for genome-wide association.

Genome-wide association was performed for each trait using a linear mixed-effects model with GEMMA 0.98.5 [112]. After excluding SNPs with missingness greater than 10% and with a minor allele-frequency less than 3%, 2.24 million SNPs were retained. A kinship matrix constructed with PLINK 1.90 [112, 113] was included in the mixed-effects model to account for population structure. A threshold for significant SNP associations was set using Bonferroni correction (𝛼 = 0.05). Linkage disequilibrium (r2) of variants within 1 Mb of significant SNPs was calculated using PLINK 1.90 [112, 113] and used to define regions of the genome that are associated with traits of interest (r2 >0.8).

Characterization of CitKC1

Comparison of the ‘Fairchild’ and ‘Valencia’ gene models revealed that two exons (exon 1 and exon 2) were missing from the ‘Fairchild’ gene model, but these sequences could be identified in the genomic sequence of ‘Fairchild’. The SV insertion disrupted the 3’ splice junction of exon 3 in CitKC1 (Addition File 2: Fig. S20). As a result, the third exon of the SV-associated allele of CitKC1 includes an additional 21 nucleotides, as compared to the + allele of CitKC1. To characterize the sequence underlying the 8.7 Kb insertion in CitKC1, a 40-Kb genomic region spanning CitKC1 and the SV (Chromosome 1: 2,750,000 to 2,790,000) was extracted using samtools faidx. EDTA v2.0.1 [114] was then used to identify transposable elements with parameters “--species others --step all --sensitive 1 --anno 1 --force 1 --threads 8”. A single 8.7-Kb transposable element was annotated in this region, which resembled an LTR/Gypsy element. Based on the divergence of the LTRs, the estimated insertion time was 50,585 years ago [115].

SV genotyping across 154 accessions

Using Illumina short-read sequences for each accession, the genotype of the SV inserted in CitKC1 was determined using ngs_te_mapper2 [116]. Briefly, for each accession reads were aligned to the ‘Fairchild’ genome, and reads aligned to the region around CitKC1, including approximately 2 Kb up and downstream of the gene, were extracted from the BAM file (chr1:2758000–2778000). Alignments with flag 1540 were excluded to remove unmapped reads, reads with failed quality checks, and PCR or optical duplicates. Then, fastq-pair [117] was used to identify only paired-end reads aligned to the focal genomic region. Finally, ngs_te_mapper was used with the parameter -l (library) specified as the 8.7 Kb SV sequence, and -r (genome) as the genomic sequence of + allele of CitKC1 (no SV). Accessions were considered heterozygotes if the allele frequency of the insertion was between 0.25 and 0.75 or homozygous for the insertion if the frequency was over 0.95 [116].

Mining transcript sequences from published data

Publicly available RNA-Seq datasets from Citrus unshiu (Iso-Seq, DRR468881), and sweet orange (ONT FL-cDNA, SRR27295299) were retrieved from the Sequence Read Archive (SRA) using fastq-dump. Local BLASTn searches were carried out using the + allele of CitKC1 transcript as a query. Visual inspection of the alignments was performed in Mega v11.0.13 (Additional File 2: Fig. S20) [118]. To query the expression of the 25 genes in the candidate region, the expression level of CitKC1 (Cs4g20090), Yucca6 (Cs4g19970), and 23 other genes in the LD block was retrieved from supplementary Table 2 in [34]. This data set included RNA-seq analysis of sweet orange fruits across fruit development [34]. The corresponding gene IDs include Cs4g19970 to Cs4g20100 with increments of ten.

Supplementary Information

Supplementary Material 3. (610.8KB, xlsx)

Acknowledgements

We thank Dr. Robert Kreuger from the USDA-ARS Citrus and Date Germplasm Repository for providing access to ‘Fairchild’ for sampling and James Burnette and Venkateswari Jaganatha Chetty for sampling of leaves from ‘Fairchild’. We thank the research staff at the Lindcove Research and Extension Center, including Salvadore Barcenas, Donald Cleek, Ashraf El-Kereamy and supporting team members. This research was supported by USDA-NIFA Award # 2020-70029-33202 to J.E.S. and D.K.S; funding from the Citrus Research Board to D.K.S (Project 5200- 201E); National Science Foundation award to R.J.S. (IOS01856627) and to S.R.W. and J.E.S. (IOS1027542). J.E.S. is a CIFAR fellow in the Fungal Kingdom: Threats and Opportunities program. I.A.D is a fellow in the Plants-3D NSF National Research Traineeship Program (DBI-1922642).

Authors’ contributions

D.K.S, J.E.S, S.R.W, and I.D designed and conceived the study. J.C assembled the ‘Fairchild’ reference genome. T.O annotated the reference genome. R.P and I.D produced the high-throughput phenotyping data. R.S.P produced the whole-genome sequencing data used for GWAS. E.A.D and O.Z. supported characterization of candidate loci. S.S produced the epigenomic datasets. I.D performed data analysis and synthesis. D.K.S, J.E.S, S.R.W, and R.J.S provided suggestions for data interpretation. The paper was written by I.D, D.K.S, T.O, and J.E.S with suggestions and approval from all other authors.

Funding

This research was supported by USDA-NIFA Award # 2020-70029-33202 to J.E.S. and D.K.S; funding from the Citrus Research Board to D.K.S (Project 5200- 201E); National Science Foundation award to R.J.S. (IOS01856627) and to S.R.W. and J.E.S. (IOS1027542). J.E.S. is a CIFAR fellow in the Fungal Kingdom: Threats and Opportunities program. I.A.D is a fellow in the Plants-3D NSF National Research Traineeship Program (DBI-1922642). Computations were performed using the computer clusters and data storage resources of the UC Riverside HPCC, which were funded by grants from NSF (MRI-2215705, MRI-1429826) and NIH (1S10OD016290-01A1).

Data availability

Raw sequencing data used for genome assembly and annotation (Pacbio, 10x Genomic linked-reads, Illumina whole-genome and RNA sequencing) are available as an NCBI BioProject (PRJNA357623). The genome assembly for ‘Fairchild’ is available under the accession number JAZBEX000000000. The raw sequencing data used in the analyses of chromatin accessibility, histone modification, and DNA methylation are available as an NCBI BioProject (PRJNA1062830). Whole-genome sequencing reads used for genome-wide association are also included in NCBI BioProject (PRJNA1062830). Custom scripts used in analyses are available on GitHub [119].

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

R.J.S is a co-founder of REquest Genomics, LLC, a company that provides epigenomic services. All other authors declare no conflict of interest.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Susan R. Wessler, Email: susan.wessler@ucr.edu

Jason E. Stajich, Email: jason.stajich@ucr.edu

Danelle K. Seymour, Email: dseymour@ucr.edu

References

  • 1.Moran BM, Payne C, Langdon Q, Powell DL, Brandvain Y, Schumer M. The genomic consequences of hybridization. Elife. 2021;10. Available from: 10.7554/eLife.69016. [DOI] [PMC free article] [PubMed]
  • 2.Wu GA, Terol J, Ibanez V, López-García A, Pérez-Román E, Borredá C, et al. Genomics of the origin and evolution of citrus. Nature. 2018;554:311–6. [DOI] [PubMed] [Google Scholar]
  • 3.Butelli E, Licciardello C, Zhang Y, Liu J, Mackay S, Bailey P, et al. Retrotransposons control fruit-specific, cold-dependent accumulation of anthocyanins in blood oranges. Plant Cell. 2012;24:1242–55. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Wang N, Song X, Ye J, Zhang S, Cao Z, Zhu C, et al. Structural variation and parallel evolution of apomixis in citrus during domestication and diversification. Natl Sci Rev. 2022;9:nwac114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Wang X, Xu Y, Zhang S, Cao L, Huang Y, Cheng J, et al. Genomic analyses of primitive, wild and cultivated citrus provide insights into asexual reproduction. Nat Genet. 2017;49:765–72. [DOI] [PubMed] [Google Scholar]
  • 6.Alonge M, Wang X, Benoit M, Soyk S, Pereira L, Zhang L, et al. Major impacts of widespread structural variation on gene expression and crop improvement in tomato. Cell. 2020;182:145–e6123. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Zhou Y, Minio A, Massonnet M, Solares E, Lv Y, Beridze T, et al. The population genetics of structural variants in grapevine domestication. Nat Plants. 2019;5:965–79. [DOI] [PubMed] [Google Scholar]
  • 8.Buenrostro JD, Giresi PG, Zaba LC, Chang HY, Greenleaf WJ. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat Methods. 2013;10:1213–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Lu Z, Marand AP, Ricci WA, Ethridge CL, Zhang X, Schmitz RJ. The prevalence, evolution and chromatin signatures of plant regulatory elements. Nat Plants. 2019;5:1250–9. [DOI] [PubMed] [Google Scholar]
  • 10.Rodgers-Melnick E, Vera DL, Bass HW, Buckler ES. Open chromatin reveals the functional maize genome. Proc Natl Acad Sci U S A. 2016;113:E3177–84. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Bai L, Morozov AV. Gene regulation by nucleosome positioning. Trends Genet. 2010;26:476–83. [DOI] [PubMed] [Google Scholar]
  • 12.Zaret KS, Carroll JS. Pioneer transcription factors: Establishing competence for gene expression. Genes Dev. 2011;25:2227–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Bulyk ML, Drouin J, Harrison MM, Taipale J, Zaret KS. Pioneer factors - key regulators of chromatin and gene expression. Nat Rev Genet. 2023; Available from: 10.1038/s41576-023-00648-z. [DOI] [PubMed]
  • 14.Ricci WA, Lu Z, Ji L, Marand AP, Ethridge CL, Murphy NG, et al. Widespread long-range cis-regulatory elements in the maize genome. Nat Plants. 2019; Available from: 10.1038/s41477-019-0547-0. [DOI] [PMC free article] [PubMed]
  • 15.Rada-Iglesias A, Bajpai R, Swigut T, Brugmann SA, Flynn RA, Wysocka J. A unique chromatin signature uncovers early developmental enhancers in humans. Nature. 2011;470:279–83. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Creyghton MP, Cheng AW, Welstead GG, Kooistra T, Carey BW, Steine EJ, et al. Histone H3K27ac separates active from poised enhancers and predicts developmental state. Proc Natl Acad Sci U S A. 2010;107:21931–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Studer A, Zhao Q, Ross-Ibarra J, Doebley J. Identification of a functional transposon insertion in the maize domestication gene tb1. Nat Genet. 2011;43:1160–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Kodaira K-S, Qin F, Tran L-SP, Maruyama K, Kidokoro S, Fujita Y, et al. Arabidopsis Cys2/His2 zinc-finger proteins AZF1 and AZF2 negatively regulate abscisic acid-repressive and auxin-inducible genes under abiotic stress conditions. Plant Physiol. 2011;157:742–56. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Kobayashi S, Goto-Yamamoto N, Hirochika H. Retrotransposon-induced mutations in grape skin color. Science. 2004;304:982. [DOI] [PubMed] [Google Scholar]
  • 20.Greene B, Walko R, Hake S. Mutator insertions in an intron of the maize knotted1 gene result in dominant suppressible mutations. Genetics. 1994;138:1275–85. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Batista RA, Moreno-Romero J, Qiu Y, van Boven J, Santos-González J, Figueiredo DD, et al. The MADS-box transcription factor PHERES1 controls imprinting in the endosperm by binding to domesticated transposons. Elife. 2019;8. Available from: 10.7554/eLife.50541. [DOI] [PMC free article] [PubMed]
  • 22.Machanick P, Bailey TL. MEME-ChIP: motif analysis of large DNA datasets. Bioinformatics. 2011;27:1696–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Han J, Lopez-Arredondo D, Yu G, Wang Y, Wang B, Wall SB, et al. Genome-wide chromatin accessibility analysis unveils open chromatin convergent evolution during polyploidization in cotton. Proc Natl Acad Sci U S A. 2022;119:e2209743119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Wang X, Goodrich KJ, Gooding AR, Naeem H, Archer S, Paucek RD, et al. Targeting of polycomb repressive complex 2 to RNA by short repeats of consecutive guanines. Mol Cell. 2017;65:1056–e675. [DOI] [PubMed] [Google Scholar]
  • 25.Véry A-A, Sentenac H. Molecular mechanisms and regulation of K + transport in higher plants. Annu Rev Plant Biol. 2003;54:575–603. [DOI] [PubMed] [Google Scholar]
  • 26.Jeanguenin L, Alcon C, Duby G, Boeglin M, Chérel I, Gaillard I, et al. AtKC1 is a general modulator of Arabidopsis inward shaker channel activity. Plant J. 2011;67:570–82. [DOI] [PubMed] [Google Scholar]
  • 27.Lu Y, Yu M, Jia Y, Yang F, Zhang Y, Xu X, et al. Structural basis for the activity regulation of a potassium channel AKT1 from Arabidopsis. Nat Commun. 2022;13:5682. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Wang Y, He L, Li H-D, Xu J, Wu W-H. Potassium channel alpha-subunit AtKC1 negatively regulates AKT1-mediated K(+) uptake in Arabidopsis roots under low-K(+) stress. Cell Res. 2010;20:826–37. [DOI] [PubMed] [Google Scholar]
  • 29.Alva AK, Mattos D Jr, Paramasivam S, Patil B, Dou H, Sajwan KS. Potassium management for optimizing citrus production and quality. Int J Fruit Sci. 2006;6:3–43. [Google Scholar]
  • 30.Beheiry HR, Hasanin MS, Abdelkhalek A, Hussein HAZ. Potassium Spraying Preharvest and Nanocoating Postharvest Improve the Quality and Extend the Storage Period for Acid Lime (Citrus aurantifolia Swingle) Fruits. Plants. 2023;12. Available from: 10.3390/plants12223848. [DOI] [PMC free article] [PubMed]
  • 31.Yasin Ashraf M, Gul A, Hussain F, Ebert G. Improvement in yield and quality of Kinnow (citrus deliciosa x citrus nobilis) by potassium fertilization. J Plant Nutr. 2010;33:1625–37. [Google Scholar]
  • 32.Wu B, Yu Q, Deng Z, Duan Y, Luo F, Gmitter F Jr. A chromosome-level phased genome enabling allele-level studies in sweet orange: a case study on citrus Huanglongbing tolerance. Hortic Res. 2023;10:uhac247. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Bu H, Yu W, Yuan H, Yue P, Wei Y, Wang A. Endogenous auxin content contributes to larger size of Apple fruit. Front Plant Sci. 2020;11:592540. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Feng G, Wu J, Xu Y, Lu L, Yi H. High-spatiotemporal-resolution transcriptomes provide insights into fruit development and ripening in citrus sinensis. Plant Biotechnol J. 2021;19:1337–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Wittkopp PJ, Haerum BK, Clark AG. Regulatory changes underlying expression differences within and between drosophila species. Nat Genet. 2008;40:346–50. [DOI] [PubMed] [Google Scholar]
  • 36.Metzger BPH, Wittkopp PJ, Coolon JD. Evolutionary dynamics of regulatory changes underlying gene expression divergence among Saccharomyces species. Genome Biol Evol. 2017;9:843–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Wang M, Li Z, Zhang Y, ’e, Zhang Y, Xie Y, Ye L, et al. An atlas of wheat epigenetic regulatory elements reveals subgenome divergence in the regulation of development and stress responses. Plant Cell. 2021;33:865–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Li E, Liu H, Huang L, Zhang X, Dong X, Song W, et al. Long-range interactions between proximal and distal regulatory regions in maize. Nat Commun. 2019;10:2633. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Liu C, Wang C, Wang G, Becker C, Zaidem M, Weigel D. Genome-wide analysis of chromatin packing in Arabidopsis Thaliana at single-gene resolution. Genome Res. 2016;26:1057–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Sun H, Jiao W-B, Krause K, Campoy JA, Goel M, Folz-Donahue K, et al. Chromosome-scale and haplotype-resolved genome assembly of a tetraploid potato cultivar. Nat Genet. 2022;54:342–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Lyu X, Xia Y, Wang C, Zhang K, Deng G, Shen Q, et al. Pan-genome analysis sheds light on structural variation-based dissection of agronomic traits in melon crops. Plant Physiol. 2023;193:1330–48. [DOI] [PubMed] [Google Scholar]
  • 42.Massonnet M, Cochetel N, Minio A, Vondras AM, Lin J, Muyle A, et al. The genetic basis of sex determination in grapes. Nat Commun. 2020;11:2902. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Wang X, Gao L, Jiao C, Stravoravdis S, Hosmani PS, Saha S, et al. Genome of solanum pimpinellifolium provides insights into structural variants during tomato breeding. Nat Commun. 2020;11:5817. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Cridland JM, Thornton KR, Long AD. Gene expression variation in drosophila melanogaster due to rare transposable element insertion alleles of large effect. Genetics. 2015;199:85–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Ahmed I, Sarazin A, Bowler C, Colot V, Quesneville H. Genome-wide evidence for local DNA methylation spreading from small RNA-targeted sequences in Arabidopsis. Nucleic Acids Res. 2011;39:6919–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Eichten SR, Ellis NA, Makarevitch I, Yeh C-T, Gent JI, Guo L, et al. Spreading of heterochromatin is limited to specific families of maize retrotransposons. PLoS Genet. 2012;8:e1003127. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Jin R, Klasfeld S, Zhu Y, Fernandez Garcia M, Xiao J, Han S-K, et al. LEAFY is a pioneer transcription factor and licenses cell reprogramming to floral fate. Nat Commun. 2021;12:626. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Xiao J, Jin R, Yu X, Shen M, Wagner JD, Pai A, et al. Cis and trans determinants of epigenetic Silencing by polycomb repressive complex 2 in Arabidopsis. Nat Genet. 2017;49:1546–52. [DOI] [PubMed] [Google Scholar]
  • 49.Tian Z, Li X, Li M, Wu W, Zhang M, Tang C, et al. Crystal structures of REF6 and its complex with DNA reveal diverse recognition mechanisms. Cell Discov. 2020;6:17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Li C, Gu L, Gao L, Chen C, Wei C-Q, Qiu Q, et al. Concerted genomic targeting of H3K27 demethylase REF6 and chromatin-remodeling ATPase BRM in Arabidopsis. Nat Genet. 2016;48:687–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Cui X, Lu F, Qiu Q, Zhou B, Gu L, Zhang S, et al. REF6 recognizes a specific DNA sequence to demethylate H3K27me3 and regulate organ boundary formation in Arabidopsis. Nat Genet. 2016;48:694–9. [DOI] [PubMed] [Google Scholar]
  • 52.Wang H, Ren J, Zhou S, Duan Y, Zhu C, Chen C, et al. Molecular regulation of oil gland development and biosynthesis of essential oils in Citrus spp. Science. 2024;383:659–66. [DOI] [PubMed] [Google Scholar]
  • 53.Bain JM. Morphological, anatomical, and physiological changes in the developing fruit of the Valencia orange, citrus sinensis (L) Osbeck. Aust J Bot. 1958;6:1. [Google Scholar]
  • 54.Zekri M, Obreza T. Potassium (K) for Citrus Trees. EDIS. 2013;2013. Available from: https://journals.flvc.org/edis/article/view/121082.
  • 55.Cao H, Hastie AR, Cao D, Lam ET, Sun Y, Huang H, et al. Rapid detection of structural variation in a human genome using nanochannel-based genome mapping technology. Gigascience. 2014;3:34. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Lam ET, Hastie A, Lin C, Ehrlich D, Das SK, Austin MD, et al. Genome mapping on nanochannel arrays for structural variation analysis and sequence assembly. Nat Biotechnol. 2012;30:771–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Rhie A, Walenz BP, Koren S, Phillippy AM. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 2020;21:245. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Lu Z, Hofmeister BT, Vollmers C, DuBois RM, Schmitz RJ. Combining ATAC-seq with nuclei sorting for discovery of cis-regulatory regions in plant genomes. Nucleic Acids Res. 2017;45:e41. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Zhang X, Clarenz O, Cokus S, Bernatavichute YV, Pellegrini M, Goodrich J, et al. Whole-genome analysis of histone H3 lysine 27 trimethylation in Arabidopsis. PLoS Biol. 2007;5:e129. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Urich MA, Nery JR, Lister R, Schmitz RJ, Ecker JR. MethylC-seq library Preparation for base-resolution whole-genome bisulfite sequencing. Nat Protoc. 2015;10:475–83. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Flynn JM, Hubley R, Goubert C, Rosen J, Clark AG, Feschotte C, et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proc Natl Acad Sci U S A. 2020;117:9451–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Smit AFA, Hubley R, Green P. 2013–2015. RepeatMasker Open-4.0. 2013–5. http://www.repeatmasker.org.
  • 63.Palmer JM, Stajich J. Funannotate v1.8.1: Eukaryotic genome annotation. Zenodo; 2020. Available from: 10.5281/zenodo.1134477.
  • 64.Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011;29:644–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, Bowden J, et al. De Novo transcript sequence reconstruction from RNA-seq using the trinity platform for reference generation and analysis. Nat Protoc. 2013;8:1494–512. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Haas BJ, Salzberg SL, Zhu W, Pertea M, Allen JE, Orvis J, et al. Automated eukaryotic gene structure annotation using evidencemodeler and the program to assemble spliced alignments. Genome Biol. 2008;9:R7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Korf I. Gene finding in novel genomes. BMC Bioinformatics. 2004;5:59. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Majoros WH, Pertea M, Salzberg SL. TigrScan and glimmerhmm: two open source Ab initio eukaryotic gene-finders. Bioinformatics. 2004;20:2878–9. [DOI] [PubMed] [Google Scholar]
  • 69.Ter-Hovhannisyan V, Lomsadze A, Chernoff YO, Borodovsky M. Gene prediction in novel fungal genomes using an Ab initio algorithm with unsupervised training. Genome Res. 2008;18:1979–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Testa AC, Hane JK, Ellwood SR, Oliver RP. CodingQuarry: highly accurate hidden Markov model gene prediction in fungal genomes using RNA-seq transcripts. BMC Genomics. 2015;16:170. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Stanke M, Diekhans M, Baertsch R, Haussler D. Using native and syntenically mapped cDNA alignments to improve de Novo gene finding. Bioinformatics. 2008;24:637–44. [DOI] [PubMed] [Google Scholar]
  • 72.Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 1997;25:955–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Eddy SR. Accelerated profile HMM searches. PLoS Comput Biol. 2011;7:e1002195. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat Methods. 2015;12:59–60. [DOI] [PubMed] [Google Scholar]
  • 75.Buchfink B, Reuter K, Drost H-G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat Methods. 2021;18:366–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.El-Gebali S, Mistry J, Bateman A, Eddy SR, Luciani A, Potter SC, et al. The Pfam protein families database in 2019. Nucleic Acids Res. 2019;47:D427–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Zhang H, Yohe T, Huang L, Entwistle S, Wu P, Yang Z, et al. dbCAN2: a meta server for automated carbohydrate-active enzyme annotation. Nucleic Acids Res. 2018;46:W95–101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Huang L, Zhang H, Wu P, Entwistle S, Li X, Yohe T, et al. dbCAN-seq: a database of carbohydrate-active enzyme (CAZyme) sequence and annotation. Nucleic Acids Res. 2018;46:D516–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Drula E, Garron M-L, Dogan S, Lombard V, Henrissat B, Terrapon N. The carbohydrate-active enzyme database: functions and literature. Nucleic Acids Res. 2022;50:D571–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80.Cantarel BL, Coutinho PM, Rancurel C, Bernard T, Lombard V, Henrissat B. The Carbohydrate-Active enzymes database (CAZy): an expert resource for glycogenomics. Nucleic Acids Res. 2009;37:D233–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81.Rawlings ND, Barrett AJ, Thomas PD, Huang X, Bateman A, Finn RD. The MEROPS database of proteolytic enzymes, their substrates and inhibitors in 2017 and a comparison with peptidases in the PANTHER database. Nucleic Acids Res. 2018;46:D624–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82.Cantalapiedra CP, Hernández-Plaza A, Letunic I, Bork P, Huerta-Cepas J. eggNOG-mapper v2: functional Annotation, orthology Assignments, and domain prediction at the metagenomic scale. Mol Biol Evol. 2021;38:5825–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83.Blum M, Chang H-Y, Chuguransky S, Grego T, Kandasaamy S, Mitchell A, et al. The interpro protein families and domains database: 20 years on. Nucleic Acids Res. 2021;49:D344–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84.UniProt Consortium. UniProt: the universal protein knowledgebase in 2023. Nucleic Acids Res. 2023;51:D523–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85.Käll L, Krogh A, Sonnhammer ELL. A combined transmembrane topology and signal peptide prediction method. J Mol Biol. 2004;338:1027–36. [DOI] [PubMed] [Google Scholar]
  • 86.Almagro Armenteros JJ, Tsirigos KD, Sønderby CK, Petersen TN, Winther O, Brunak S, et al. SignalP 5.0 improves signal peptide predictions using deep neural networks. Nat Biotechnol. 2019;37:420–3. [DOI] [PubMed] [Google Scholar]
  • 87.Bishara A, Liu Y, Weng Z, Kashef-Haghighi D, Newburger DE, West R, et al. Read clouds uncover variation in complex regions of the human genome. Genome Res. 2015;25:1570–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 88.McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The genome analysis toolkit: a mapreduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 89.Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter Estimation from sequencing data. Bioinformatics. 2011;27:2987–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 90.Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for illumina sequence data. Bioinformatics. 2014;30:2114–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 91.Li H. Aligning sequence reads, clone sequences and assembly con*gs with BWA-MEM. figshare; 2014. Available from: https://figshare.com/articles/poster/Aligning_sequence_reads_clone_sequences_and_assembly_con_gs_with_BWA_MEM/963153/1. [cited 2022 Jun 5].
  • 92.Broad Institute. Picard Toolkit. Broad Institute, GitHub repository. Available from: https://broadinstitute.github.io/picard/.
  • 93.Martin M, Ebert P, Marschall T. Read-Based phasing and analysis of phased variants with whatshap. In: Peters BA, Drmanac R, editors. Haplotyping: methods and protocols. New York, NY: Springer US; 2023. pp. 127–38. [DOI] [PubMed] [Google Scholar]
  • 94.Aleza P, Cuenca J, Hernández M, Juárez J, Navarro L, Ollitrault P. Genetic mapping of centromeres in the nine citrus Clementina chromosomes using half-tetrad analysis and recombination patterns in unreduced and haploid gametes. BMC Plant Biol. 2015;15:80. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 95.Peng Z, Bredeson JV, Wu GA, Shu S, Rawat N, Du D, et al. A chromosome-scale reference genome of trifoliate orange (Poncirus trifoliata) provides insights into disease resistance, cold tolerance and genome evolution in citrus. Plant J. 2020;104:1215–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 96.Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 2008;9:R137. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 97.Montgomery A, peakCallingBenchmark G. 2020. Available from: https://github.com/bigmonty12/peakCallingBenchmark. [cited 2023 Jan 18].
  • 98.Zhou Q, Lim J-Q, Sung W-K, Li G. An integrated package for bisulfite DNA methylation data analysis with Indel-sensitive mapping. BMC Bioinformatics. 2019;20:47.10.1186/s12859-018-2593-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 99.Song Q, Decato B, Hong EE, Zhou M, Fang F, Qu J, et al. A reference methylome database and analysis pipeline to facilitate integrative and comparative epigenomics. PLoS ONE. 2013;8:e81148. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 100.Altham PME. Exact bayesian analysis of a 2 times 2 contingency table, and fisher’s exact significance test. J R Stat Soc. 1969;31:261–9. [Google Scholar]
  • 101.Ramírez F, Ryan DP, Grüning B, Bhardwaj V, Kilpert F, Richter AS, et al. deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic Acids Res. 2016;44:W160–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 102.Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29:15–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 103.van de Geijn B, McVicker G, Gilad Y, Pritchard JK. WASP: allele-specific software for robust molecular quantitative trait locus discovery. Nat Methods. 2015;12:1061–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 104.Edsgärd D, Iglesias MJ, Reilly S-J, Hamsten A, Tornvall P, Odeberg J, et al. GeneiASE: detection of condition-dependent and static allele-specific expression from RNA-seq data without haplotype information. Sci Rep. 2016;6:21134. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 105.Lipták T. On the combination of independent tests. Magyar Tud Akad Mat Kutato Int Kozl. 1958;3:171–97.
  • 106.Gel B, Díez-Villanueva A, Serra E, Buschbeck M, Peinado MA, Malinverni R. RegioneR: an R/Bioconductor package for the association analysis of genomic regions based on permutation tests. Bioinformatics. 2016;32:289–91. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 107.O’Malley RC, Huang S-SC, Song L, Lewsey MG, Bartlett A, Nery JR, et al. Cistrome and epicistrome features shape the regulatory DNA landscape. Cell. 2016;165:1280–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 108.Putri GH, Anders S, Pyl PT, Pimanda JE, Zanini F. Analysing high-throughput sequencing data in python with HTSeq 2.0. Bioinformatics. 2022;38:2943–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 109.Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33:1–22. [PMC free article] [PubMed] [Google Scholar]
  • 110.CRAN. lme4 citation info. Available from: https://cran.r-project.org/web/packages/lme4/citation.html. [cited 2024 Jun 22].
  • 111.Bates D, Mächler M, Bolker B, Walker S. Fitting Linear Mixed-Effects Models Using lme4. J Statis Software. 2015;1–48. Available from: 10.18637/jss.v067.i01.
  • 112.Zhou X, Stephens M. Genome-wide efficient mixed-model analysis for association studies. Nat Genet. 2012;44:821–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 113.Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81:559–75. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 114.Ou S, Su W, Liao Y, Chougule K, Agda JRA, Hellinga AJ, et al. Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline. Genome Biol. 2019;20:275. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 115.SanMiguel P, Gaut BS, Tikhonov A, Nakajima Y, Bennetzen JL. The paleontology of intergene retrotransposons of maize. Nat Genet. 1998;20:43–5. [DOI] [PubMed] [Google Scholar]
  • 116.Han S, Basting PJ, Dias GB, Luhur A, Zelhof AC, Bergman CM. Transposable element profiles reveal cell line identity and loss of heterozygosity in Drosophila cell culture. Genetics. 2021;219. Available from: 10.1093/genetics/iyab113. [DOI] [PMC free article] [PubMed]
  • 117.Edwards JA, Edwards RA. Fastq-pair: efficient synchronization of paired-end fastq files. bioRxiv. 2019;552885. Available from: https://www.biorxiv.org/content/. [cited 2025 Mar 21].
  • 118.Tamura K, Stecher G, Kumar S. MEGA11: molecular evolutionary genetics analysis version 11. Mol Biol Evol. 2021;38:3022–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 119.Diaz I. Haplotype-phased-genome-of-Fairchild-mandarin: Custom scripts used in analysis performed in manuscript, Haplotype phased genome of ‘Fairchild’ mandarin highlights influence of local chromatin state on gene expression. Github; 2024. Available from: https://github.com/IsaacDiaz026/Haplotype-phased-genome-of-Fairchild-mandarin. [cited 2024 Jan 12].

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 3. (610.8KB, xlsx)

Data Availability Statement

Raw sequencing data used for genome assembly and annotation (Pacbio, 10x Genomic linked-reads, Illumina whole-genome and RNA sequencing) are available as an NCBI BioProject (PRJNA357623). The genome assembly for ‘Fairchild’ is available under the accession number JAZBEX000000000. The raw sequencing data used in the analyses of chromatin accessibility, histone modification, and DNA methylation are available as an NCBI BioProject (PRJNA1062830). Whole-genome sequencing reads used for genome-wide association are also included in NCBI BioProject (PRJNA1062830). Custom scripts used in analyses are available on GitHub [119].


Articles from BMC Genomics are provided here courtesy of BMC

RESOURCES