Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2021 Sep 27;17(9):e1009444. doi: 10.1371/journal.pcbi.1009444

GRAFIMO: Variant and haplotype aware motif scanning on pangenome graphs

Manuel Tognon 1, Vincenzo Bonnici 1, Erik Garrison 2, Rosalba Giugno 1,*, Luca Pinello 3,4,5,*
Editor: Mihaela Pertea6
PMCID: PMC8519448  PMID: 34570769

Abstract

Transcription factors (TFs) are proteins that promote or reduce the expression of genes by binding short genomic DNA sequences known as transcription factor binding sites (TFBS). While several tools have been developed to scan for potential occurrences of TFBS in linear DNA sequences or reference genomes, no tool exists to find them in pangenome variation graphs (VGs). VGs are sequence-labelled graphs that can efficiently encode collections of genomes and their variants in a single, compact data structure. Because VGs can losslessly compress large pangenomes, TFBS scanning in VGs can efficiently capture how genomic variation affects the potential binding landscape of TFs in a population of individuals. Here we present GRAFIMO (GRAph-based Finding of Individual Motif Occurrences), a command-line tool for the scanning of known TF DNA motifs represented as Position Weight Matrices (PWMs) in VGs. GRAFIMO extends the standard PWM scanning procedure by considering variations and alternative haplotypes encoded in a VG. Using GRAFIMO on a VG based on individuals from the 1000 Genomes project we recover several potential binding sites that are enhanced, weakened or missed when scanning only the reference genome, and which could constitute individual-specific binding events. GRAFIMO is available as an open-source tool, under the MIT license, at https://github.com/pinellolab/GRAFIMO and https://github.com/InfOmics/GRAFIMO.

Author summary

Transcription factors (TFs) are key regulatory proteins and mutations occurring in their binding sites can alter the normal transcriptional landscape of a cell and lead to disease states. Pangenome variation graphs (VGs) efficiently encode genomes from a population of individuals and their genetic variations. GRAFIMO is an open-source tool that extends the traditional PWM scanning procedure to VGs. By scanning for potential TBFS in VGs, GRAFIMO can simultaneously search thousands of genomes while accounting for SNPs, indels, and structural variants. GRAFIMO reports motif occurrences, their statistical significance, frequency, and location within the reference or alternative haplotypes in a given VG. GRAFIMO makes it possible to study how genetic variation affects the binding landscape of known TFs within a population of individuals.


This is a PLOS Computational Biology Software paper.

Introduction

Transcription factors (TFs) are fundamental proteins that regulate transcriptional processes. They bind short (7-20bp) genomic DNA sequences called transcription factor binding sites (TFBS) [1]. Often, the binding sites of a given TF show recurring sequence patterns, which are referred to as motifs. Motifs can be represented and summarized using Position Weight Matrices (PWMs) [2], which encode the probability of observing a given nucleotide in a given position of a binding site. In recent years, several tools have been proposed for scanning regulatory DNA regions, such as enhancers or promoters, with the goal of predicting which TF may bind these genomic locations. Importantly, it has been shown that regulatory motifs are under purifying selection [3,4], and mutations occurring in these regions can lead to deleterious consequences on the transcriptional states of a cell [5]. In fact, mutations can weaken, disrupt or create new TFBS and therefore alter expression of nearby genes. Mutations altering TFBS can occur in haplotypes that are conserved within a population or private to even a single individual, and can correspond to different phenotypic behaviour [6,7]. For these reasons, population-level analysis of variability in TFBSs is of crucial importance to understand the effect of common or rare variants to gene regulation. Recently, a new class of methods and data structures based on genome graphs have enabled us to succinctly record and efficiently query thousands of genomes [8]. Genome graphs optimally encode shared and individual haplotypes based on a population of individuals. An efficient and scalable implementation of this approach called variation graphs (VGs) has been recently proposed [9]. Briefly, a VG is a graph where nodes correspond to DNA sequences and edges describe allowed links between successive sequences. Paths through the graph, which may be labelled (such as in the case of a reference genome), correspond to haplotypes belonging to different genomes [10]. Variants like SNPs and indels form bubbles in the graph, where diverging paths through the graph are anchored by a common start and end sequence on the reference [11]. VGs offer new opportunities to extend classic genome analyses originally designed for a single reference sequence to a panel of individuals. Moreover, by encoding individual haplotypes, VGs have been shown to be an effective framework to capture the potential effects of personal genetic variants on functional genomic regions profiled by ChIP-seq of histone marks [12]. During the last decade, several methods have been developed to search TFBS on linear reference genomes, such as FIMO [13] and MOODS [14] or to account for SNPs and short indels such as is-rSNP, TRAP and atSNP [1517], however these tools do not account for individual haplotypes nor provide summary on the frequency of these events in a population. To solve these challenges, we have developed GRAFIMO, a tool that offers a variation- and haplotype-aware identification of TFBS in VGs. Here, we show the utility of GRAFIMO by searching TFBS on a VG encoding the haplotypes from all the individuals sequenced by the 1000 Genomes Project (1000GP) [18,19].

Design and implementation

GRAFIMO is a command-line tool, which enables a variant- and haplotype- aware search of TFBS, within a population of individuals encoded in a VG. GRAFIMO offers two main functionalities: the construction of custom VGs, from user data, and the search of one or more TF motifs, in precomputed VGs. Briefly, given a TF model (PWM) and a set of genomic regions, GRAFIMO leverages the VG to efficiently scan and report all the TFBS candidates and their frequency in the different haplotypes in a single pass together with the predicted changes in binding affinity mediated by genetic variations. GRAFIMO is written in Python3 and Cython and it has been designed to easily interface with the vg software suite [9]. For details on how to install and run GRAFIMO see S1 Text section 7.

Genome variation graph construction

GRAFIMO provides a simple command-line interface to build custom genome variation graphs if necessary. Given a reference genome (FASTA format) and a set of genomic variants with respect to the reference (VCF format), GRAFIMO interfaces with the VG software suite to build the main VG data structure, the XG graph index [9] and the GBWT index [10,20] used to track the haplotypes within the VG. To minimize the footprint of these files and speedup the computation, GRAFIMO constructs the genome variation graph by building a VG for each chromosome. This also speeds-up the search operation since we can scan different chromosomes in parallel. Alternatively, the search can be performed one chromosome at the time for machines with limited RAM.

Transcription factor motif search

The motif search operation takes as input a set of genomes encoded in a VG (.xg format), a database of known TF motifs (PWM in JASPAR [21] or MEME format [22]) and a set of genomic regions (BED format), and reports in output all the TFBS motifs occurrences in those regions and their estimated significance (Fig 1). To search for potential TFBS, GRAFIMO slides a window of length k (where k is the width of the query motif) along the paths of the VG corresponding to the genomic sequences encoded in it (Fig 1B). This is accomplished by an extension to the vg find function, which uses the GBWT index of the graph to explore the k-mer space of the graph while accounting for the haplotypes embedded in it [10]. By default, GRAFIMO considers only paths that correspond to observed haplotypes, however it is possible also to consider all possible recombinants even if they are not present in any individual. The significance (log-likelihood) of each potential binding site is calculated by considering the nucleotide preferences encoded in the PWM as in FIMO [13]. More precisely, the PWM is processed to a Position Specific Scoring Matrix (PSSM) (Fig 1A) and the resulting log-likelihood values are then scaled in the range [0, 1000] to efficiently calculate a statistical significance i.e. a P-value by dynamic programming [23] as in FIMO [13]. P-values are then converted to q-values by using the Benjamini-Hochberg procedure to account for multiple hypothesis testing. For this procedure, we consider all the P-values corresponding to all the k-mer-paths extracted within the scanned regions on the VG. GRAFIMO computes also the number of haplotypes in which a significant motif is observed and if it is present in the reference genome and/or in alternative genomes. (Fig 1B).

Fig 1. GRAFIMO TF motif search workflow.

Fig 1

(A) The motif PWM (in MEME or JASPAR format) is processed and its values are scaled in the range [0, 1000]. The resulting score matrix is used to assign a score and a corresponding P-value to each motif occurrence candidate. In the final report GRAFIMO returns the corresponding log-odds scores, which are retrieved from the scaled values. (B) GRAFIMO slides a window of length k, where k is the motif width, along the haplotypes (paths in the graph) of the genomes used to build the VG. The resulting sequences are scored using the motif scoring matrix and are statistically tested assigning them the corresponding P-value and q-value. Moreover, for each entry is assigned a flag value stating if it belongs to the reference genome sequence ("ref") or contains genomic variants ("non.ref") and is computed the number of haplotypes in which the sequence appears.

Report generation

We have designed the interface of GRAFIMO based on FIMO, so it can be used as in-drop replacement for tools built on top of FIMO. As in FIMO, the results are available in three files: a tab-delimited file (TSV), a HTML report and a GFF3 file compatible with the UCSC Genome Browser [24]. The TSV report (Fig A in S1 Text) contains for each candidate TFBS its score, genomic location (start, stop and strand), P-value, q-value, the number of haplotypes in which it is observed and a flag value to assess if it belongs to the reference or to the other genomes in VG. The HTML version of the TSV report (Fig B in S1 Text) can be viewed with any web browser. The GFF3 file (Fig C in S1 Text) can be loaded on the UCSC genome browser as a custom track, to visualize and explore the recovered TFBS with other annotations such as nearby genes, enhancers, promoters, or pathogenic variants from the ClinVar database [25].

Results

As discussed above, GRAFIMO can be used to study how genetic variants may affect the binding affinity of potential TFBS within a set of individuals and may recover additional sites that are missed when considering only linear reference genomes without information about variants. To showcase its utility, we first constructed a VG based on 2548 individuals from the 1000GP phase 3 (hg38 human genome assembly) encoding their genomic variants and phased haplotypes (see S1 Text section 1 for details). We then searched this VG for putative TFBS for three TF motifs with different lengths (from 11 to 19 bp), evolutionary conservation, and information content from the JASPAR database [21]: CTCF (JASPAR ID MA0139.1), ATF3 (JASPAR ID MA0605.2) and GATA1 (JASPAR ID MA0035.4) (Fig D in S1 Text) (see S1 Text section 3-4-5). To study regions with likely true binding events, for each factor we selected regions corresponding to peaks (top 3000 sorted by q-value) obtained by ChIP-seq experiments in 6 different cell types (A549, GM12878, H1, HepG2, K562, MCF-7) from the ENCODE project [26,27] (see S1 Text section 2). We used GRAFIMO to scan these regions and selected for our downstream analyses only sites with a P-value < 1e-4 and considered them as potential TFBS for these factors. Based on the recovered sites, we consistently observed across the 3 studied TFs that genetic variants can significantly affect estimated binding affinity. In fact, we found that thousands of CTCF motif occurrences are found only in non-reference haplotypes, suggesting that a considerable number of TFBS candidates are lost when scanning for TFBS the genome without accounting for genetic variants (Fig 2A). Similar results were obtained searching for ATF3 (Fig E in S1 Text) and GATA1 (Fig F in S1 Text). We also found several highly significant CTCF motif occurrences in rare haplotypes that may potentially modulate gene expression in these individuals (Fig 2B). Similar behaviours were observed for ATF3 (Fig E in S1 Text) and GATA1 (Fig F in S1 Text).

Fig 2. Searching CTCF motif on VG with GRAFIMO provides an insight on how genetic variation affects putative binding sites.

Fig 2

(A) Potential CTCF occurrences statistically significant (P-value < 1e-4) and non-significant found in the reference and in the haplotype sequences found with GRAFIMO oh hg38 1000GP VG. (B) Statistical significance of retrieved potential CTCF motif occurrences and frequency of the corresponding haplotypes embedded in the VG. (C) Percentage of statistically significant CTCF potential binding sites found only in the reference genome or alternative haplotypes and with modulated binding scores based on 1000GP genetic variants (D) Percentage of population specific and common (shared by two or more populations) potential CTCF binding sites present on individual haplotypes.

We also investigated the potential effects of the different length and type of mutations i.e. SNPs and indels on the CTCF, ATF3 and GATA1 binding sites. However, we did not observe a clear and general trend (Fig G in S1 Text). By considering the genomic locations of the significant motif occurrences we next investigated how often individual TFBS may be disrupted, created or modulated. We observed that 6.13% of the potential CTCF binding sites can be found only on non-reference haplotype sequences, 5.94% are disrupted by variants in non-reference haplotypes and ~30% are still significant in non-reference haplotypes but with different binding scores (Fig 2C). Similar results were observed for ATF3 (Fig E in S1 Text) and GATA1 (Fig F in S1 Text). Interestingly, we observed that a large fraction of putative binding sites recovered only on individual haplotypes are population specific. For CTCF we found that 24.66%, 6.74%, 5.68%, 13.01%, 12.52% of potential CTCF TFBS retrieved on individual haplotype sequences only are specific for AFR, EUR, AMR, SAS and EAS populations, respectively (Fig 2D). Similar results were observed for ATF3 and GATA1 (S1 Text sections 4–5).

Among the unique CTCF motif occurrences found only on non-reference haplotypes in CTCF ChIP-seq peaks we uncovered one TFBS (chr19:506,910–506,929) that clearly illustrates the danger of only using reference genomes for motif scanning. Within this region we recovered a heterozygous SNP that overlaps (position 10 of the CTCF matrix) and significantly modulates the binding affinity of this TFBS. In fact, by inspecting the ChIP-seq reads (experiment ENCSR000DZN, GM12878 cell line), we observed a clear allelic imbalance towards the alternative allele G (70.59% of reads) with respect to the reference allele A (29.41% of reads). This allelic imbalance is not observed in the reads used as control (experiment code ENCSR000EYX) (Fig 3).

Fig 3. Considering genomic variation, GRAFIMO captures more potential binding events.

Fig 3

GRAFIMO reports a potential CTCF binding site at chr19:506,910–506,929 found only in haplotype sequences, searching the motif in ChIP-seq peaks called on cell line GM12878 (experiment code ENCSR000DZN). The reads used to call for ChIP-seq peaks (ENCFF162QXM) show an allelic imbalance at position 10 of the motif sequence towards the alternative allele G, instead of the reference allele A. The imbalance is captured by GRAFIMO which reports the sequence presenting G at position 10 (found in the haplotypes), while the potential TFBS on the reference carrying an A is not reported as statistically significant (P-value > 1e-4). CTCF motif logo shows that the G is the dominant nucleotide in position 10.

Taken together these results highlight the importance of considering non-reference genomes when searching for potential TFBS or to characterize their potential activity in a population of individuals.

We also compared the performance of GRAFIMO against FIMO [13] (Fig H in S1 Text and S1 Text Section 6). FIMO is faster and requires less memory, when scanning a single linear genome. However, when considering the 2548 individual genomes and their genetic variation, GRAFIMO proves to be generally faster than FIMO. Moreover, we benchmarked how GRAFIMO running time and memory usage change using an increasing number of threads (Fig I in S1 Text). By increasing the number of threads, we observed a dramatical drop in running time, while memory usage remained similar.

Conclusion

By leveraging VGs, GRAFIMO provides an efficient method to study how genetic variation affects the binding landscape of a TF within a population of individuals. Moreover, we show that several potential and private TFBS are found in individual haplotype sequences and that genomic variants significantly also affect the binding affinity of several motif occurrence candidates found in the reference genome sequence. Our tool therefore can help in prioritizing potential regions that may mediate individual specific changes in gene expression, which may be missed by using only reference genomes.

Availability and future directions

GRAFIMO can be downloaded and installed via PyPI, source code or Bioconda. Its Python3 source code is available on Github athttps://github.com/pinellolab/GRAFIMO https://github.com/pinellolab/GRAFIMO and athttps://github.com/InfOmics/GRAFIMO https://github.com/InfOmics/GRAFIMO under MIT license. Since GRAFIMO is based on VG data structure, has the potential to be applied to future pangenomic reference systems that are currently under development (https://news.ucsc.edu/2019/09/pangenome-project.html). The genome variation graphs enriched with 1000GP on GRCh38 phase 3 used to obtain the results presented in this manuscript can be downloaded at http://ncrnadb.scienze.univr.it/vgs.

Supporting information

S1 Text. Additional information about experiments design, GATA1 and ATF3 search on genome variation graph, and how to install and run GRAFIMO.

Fig A. Example of TSV summary report. The tab-delimited report (TSV report) shows the first 25 potential CTCF occurrences retrieved by GRAFIMO, searching the motif in ChIP-seq peak regions defined in ENCODE experiment ENCFF816XLT (cell line A549). Fig B. Example of HTML summary report. The HTML report shows the first 25 potential CTCF occurrences retrieved by GRAFIMO, searching the motif in ChIP-seq peak regions defined in ENCODE experiment ENCFF816XLY (cell line A549). Fig C. Example of GFF3 track produced by GRAFIMO, loaded on the UCSC genome browser. GRAFIMO returns also a GFF3 report which can be loaded on the UCSC genome browser; the loaded custom track shows three potential CTCF occurrences (region chr8:142,782,661–142,782,680) retrieved by GRAFIMO overlapping a dbSNP annotated variant (rs892844) (image obtained from the UCSC Genome Browser website). Fig D. Structure of transcription factor motifs used to test GRAFIMO. Transcription factor binding site motifs of (A) CTCF, (B) ATF3 and (C) GATA1. Fig E. Searching ATF3 motif on VG with GRAFIMO provides an insight on how genetic variation affects the binding site sequence. (A) Potential ATF3 occurrences statistically significant (P-value < 1e-4) and non-significant found in the reference and in the haplotype sequences found with GRAFIMO oh hg38 1000GP VG. (B) Statistical significance of retrieved potential ATF3 motif occurrences and their frequency in the haplotypes embedded in the VG. (C) Percentage of statistically significant ATF3 potential binding sites found only in genome reference sequence, percentage of potential TFBS found in the reference for which genetic variants cause the sequence to be no more significant, percentage of binding sites found only in the haplotypes, percentage of potential TFBS found in the reference with increased statistical significance by the action of genomic variants and percentage of those with a decreased significance by the action of variants (with P-value still significant). (D) Fraction of population specific potential ATF3 binding sites recovered on individual haplotype sequences. Fig F. Searching GATA1 motif on VG with GRAFIMO provides an insight on how genetic variation affects the binding site sequence. (A) Potential GATA1 occurrences statistically significant (P-value < 1e-4) and non-significant found in the reference and in the haplotype sequences found with GRAFIMO oh hg38 1000GP VG. (B) Statistical significance of retrieved potential GATA1 motif occurrences and their frequency in the haplotypes embedded in the VG. (C) Percentage of statistically significant GATA1 potential binding sites found only in genome reference sequence, percentage of potential TFBS found in the reference for which genetic variants cause the sequence to be no more significant, percentage of binding sites found only in the haplotypes, percentage of potential TFBS found in the reference with increased statistical significance by the action of genomic variants and percentage of those with a decreased significance by the action of variants (with P-value still significant). (D) Percentage of population specific potential GATA1 binding sites, among those TFBS retrieved uniquely on individual genome sequences. Fig G. Influence of the different length and type of mutations on binding affinity score. (A) CTCF, (B) ATF3, (C) GATA1. Fig H. Comparing GRAFIMO and FIMO performance. (A) Searching CTCF motif (JASPAR ID MA0139.1) on human chr22 regions (total width ranging from 1 to 9 millions of bp) without accounting for genetic variants FIMO is faster than GRAFIMO (using a single thread). (B) FIMO uses less memory resources than GRAFIMO, however they work on different frameworks. (C) When considering the genetic variation present in large panels of individuals as 1000GP on GRCh38 phase 3 (2548 samples), GRAFIMO proves to be faster than FIMO in searching potential CTCF occurrences. It is faster when run with a single execution thread, and significantly faster when run with 16. Fig I. GRAFIMO running time efficiently scales with the number of threads used. By running GRAFIMO with multiple threads (A) the running time significantly decreases, while (B) memory usage remains similar. Table A. Number of genomic variants used to test GRAFIMO. Number of genomic variants used to test GRAFIMO, divided by chromosome. The variants were obtained from 1000 Genomes Project on GRCh38 phase 3, and belongs to 2548 individuals from 26 populations. The number of variants refers to SNPs and indels together. In total were considered ~78 million variants. Table B. ENCODE ChIP-seq experiment codes. To test our software, we searched the potential occurrences of three transcription factor motifs (CTCF, ATF3 and GATA1) in a hg38 pangenome variation graph enriched with genomic variants and haplotypes of 2548 individuals from 1000 Genomes project phase 3. To have likely to happen binding events, TF motifs were searched in ChIP-seq peak regions, obtained from the ENCODE project data portal.

(DOCX)

S1 Code. GRAFIMO v1.1.4 source code, documentation and running examples.

(ZIP)

Acknowledgments

We would like to thank Centro Piattaforme Tecnologiche (CPT) located in the University of Verona that provided us with all the hardware necessary to perform all the tests. Research reported in this publication was supported by the National Human Genome Research Institute of the National Institutes of Health under Award Number R00HG008399 and Genomic Innovator Award Number R35HG010717. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Data Availability

All relevant data are within the manuscript and its Supporting Information files.

Funding Statement

LP was supported by National Human Genome Research Institute R00HG008399 and Genomic Innovator Award R35HG010717. RG was supported from the European Union’s Horizon 2020 research and innovation programme under grant agreement 814978 and JPcofuND2 Personalised Medicine for Neurodegenerative Diseases project JPND2019-466-037. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Stewart AJ, Hannehanhalli S, Plotkin JB. Why transcription factor binding sites are ten nucleotides long. Genetics. 2012;192(3): 973–985. doi: 10.1534/genetics.112.143370 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Stormo GD. Modeling the specificity of protein—dna interactions. Quantitative Biology. 2013; 1(2): 115–130. doi: 10.1007/s40484-013-0012-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Li S, Ovcharenko I. Human enhancers are fragile and prone to deactivating mutations. Mol Bio Evol. 2015;32(18): 2161–2180. doi: 10.1093/molbev/msv118 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Vorontsov IE, Khimulya G, Lukianova EN, Nikolaeva DD, Eliseeva IA, Kulakovskiy IV, et al. Negative selection maintains transcription factors binding motifs in human cancer. BMC genomics.2016;17(2): 395. doi: 10.1186/s12864-016-2728-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Guo YA, Chang MM, Huang W, Ooi WF, Xing M, Tan P, et al. Mutation hotspots at CTCF binding sites coupled to chromosomal instability in gastrointestinal cancers. Nature communications. 2018;9(1): 1–14. doi: 10.1038/s41467-017-02088-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Albert FW, Kruglyak L. The role of regulatory variation complex traits and diseases. Nature Reviews Genetics. 2015;16(4): 197–212. doi: 10.1038/nrg3891 [DOI] [PubMed] [Google Scholar]
  • 7.Kasowski M, Grubert F, Heffelfinger C, Hariharan M, Asabere A, Waszak SM, et al. Variation in transcription factor binding among humans. Science. 2010;328(5975):232–235. doi: 10.1126/science.1183621 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Paten B, Novak AM, Eizenga JM, Garrison E. Genome graphs and the evolution of genome inference. Genome research. 2017;27(5): 665–676. doi: 10.1101/gr.214155.116 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Garrison E, Sirén J, Novak AM, Hickey G, Eizenga JM, Dawson ET, et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nature biotechnology. 2018;36(9): 875–879. doi: 10.1038/nbt.4227 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Sirén J, Garrison E, Novak AM, Paten B, Durbin R. Haplotype-aware graph indexes. Bioinformatics. 2020;36(2): 400–407. doi: 10.1093/bioinformatics/btz575 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Paten B, Eizenga JM, Rosen YM, Novak AM, Garrison E, Hickey G. Superbubbles, ultrabubbles and cacti. Journal of Computational Biology. 2018;25(7): 649–663. doi: 10.1089/cmb.2017.0251 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Groza C, Kwan T, Soranzo N, Pastinen T, Bourque G. Personalized and graph genomes reveal missing signal in epigenomic data. Genome biology. 2020;21: 1–22. doi: 10.1186/s13059-020-02038-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Grant CE, Bailey TL, Noble WS. Fimo: scanning for occurrences of a given motif. Bioinformatics. 2011;27(7): 1017–1018. doi: 10.1093/bioinformatics/btr064 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Kohronen J, Martinmäki P, Pizzi C, Rastas P, Ukkonen E. Moods: fast search for position weight matrix matches in dna sequences. Bioinformatics. 2009;25(23):3181–3182. doi: 10.1093/bioinformatics/btp554 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Macintyre G, Bailey J, Haviv I, Kowalczyk A. is-rSNP: a novel technique for in silico regulatory SNP detection. Bioinformatics. 2010;26(18): i524—i530. doi: 10.1093/bioinformatics/btq378 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Thomas-Chollier M, Hufton A, Heining M, O’Keefe S, El Masri N, Roider H, et al. Transcription factor binding prediction using TRAP for the analysis of ChIP-seq data and regulatory SNPs. Nature protocols. 2011;6(12): 1860. doi: 10.1038/nprot.2011.409 [DOI] [PubMed] [Google Scholar]
  • 17.Zuo C, Shin S, Keles S. atsnp: transcription factor binding affinity testing for regulatory snp detection. Bioinformatics. 2015;31(20): 3353–3355. doi: 10.1093/bioinformatics/btv328 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.1000 Genomes Project Consortium. A global reference for human genetic variation. Nature. 2015;526(7571): 68–74. doi: 10.1038/nature15393 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Zheng-Bradley X, Streeter I, Fairley S, Richardson D, Clarke L, Flicek P, et al. Alignment of 1000 Genomes Project reads to reference assembly GRCh38. GigaScience. 2017;6(7): 1–8. doi: 10.1093/gigascience/gix038 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Novak AM, Garrison E, Paten B. A graph extension of the positional Burrows-Wheeler transform and its applications. Algorithms for Molecular Biology. 2017;12(1): 18. doi: 10.1186/s13015-017-0109-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Fornes O, Castro-Mondragon JA, Khan A, van der Lee R, Zhang X, Richmond PA, et al. JASPAR 2020: update of the open- access database of transcription factor binding profiles. Nucleic Acid Research. 2019;48(D1): D87—D92. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, et al. Meme suite: tools for motif discovery and searching. Nucleic Acid Research. 2009;37(suppl): W202—W208. doi: 10.1093/nar/gkp335 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Staden R. Searching for motifs in nucleic acid sequences. Methods in molecular biology. 1994;25: 93–102. doi: 10.1385/0-89603-276-0:93 [DOI] [PubMed] [Google Scholar]
  • 24.Lee CM, Barber GP, Casper J, Clawson H, Diekhans M, Navarro Gonzalez J, et al. UCSC Genome Browser enters 20th year. Nucleic Acid Research. 2020;48(D1): D756–D761. doi: 10.1093/nar/gkz1012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Landrum MJ, Chitipiralla S, Brown GR, Chen C, Gu B, Hart J, et al. ClinVar: improvements to accessing data. Nucleic Acid Research. 2020;48(D1): D835—D844 doi: 10.1093/nar/gkz972 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.ENCODE Project Consortium. An Integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489(7414): 57–74. doi: 10.1038/nature11247 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Davis CA, Hitz BC, Sloan CA, Chan ET, Davidson JM, Gabdank I, et al. The encyclopedia of DNA elements (ENCODE): data portal update. Nucleic Acid Research. 2018;46(D1): D794—D801. doi: 10.1093/nar/gkx1081 [DOI] [PMC free article] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009444.r001

Decision Letter 0

Mihaela Pertea

7 Mar 2021

Dear Dr. Pinello,

Thank you very much for submitting your manuscript "GRAFIMO: variant and haplotype aware motif scanning on pangenome graphs" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Mihaela Pertea

Software Editor

PLOS Computational Biology

Mihaela Pertea

Software Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The authors presented a new tool to scan known TF DNA motifs in VG (graph genome). The manuscript overall is clear and the description is concise. The idea is not fully novel, for example similar study is reported here: (Cristian Groza, et. al. 2020 Genome Biology). Also, the tool itself heavily based on VG and overall is simple, but it do have many potential users.

I have the following comments to help improve the manuscript:

1. The authors construct the VG from the 1000G data, which is a diverged dataset. Are there population specific TF motifs are found? What’s the ratio of that?

2. For those discovered non-ref TF motifs, what’s the effect from different type/length of mutations? For example, do indel and larger mutations have larger effect?

3. The authors need to benchmark the performance (memory/running time) of the tool.

Reviewer #2: This is a well written manuscript, presenting a method that I think will be useful for many researchers in the field. GRAFIMO is a nice application of genome graphs, and the method provides a clear benefit compared to the existing "linear" method FIMO. I particularly like that you've designed GRAFIMO so that it can be used as a direct replacement to FIMO. I don't have many comments and I really don't have any major issues with the manuscript. However, I ran into some problems when trying to run the software, and would like you to fix these so that I'm able to fully try out the software before concluding on the manuscript. These are my comments:

- I have a few questions related to how you do false discovery rate adjustment of the p-values for each motif match. As far as I understand Fimo, q-values are found simply by accounting for the number of hypothesis tests done within the specified regions on the linear reference genome (i.e. one test for every base pair within the regions). I don't find any information in the manuscript or in the supplementary text on how you do this on a graph. I assume that you maybe account for all the hypothesis tests performed on the graph, i. e. the number of kmer-paths in the graph (following known haplotypes, if specified)? I know that FIMO has to hold all matches in memory in order to compute these p-values, and that the user can specify how many to hold in memory through the --max-stored-scores option. I don't see this option in GRAFIMO.

- Installation of Grafimo went fine on my system (used pip). I only had a minor issue with sphinx not being specified as a dependency in the setup.py file. Should that be added?

- How does GRAFIMO compare to FIMO when it comes to run time? Is it considerably slower than FIMO, since it has to search for more potential motif matches? I personally don't mind if it is a bit slower than FIMO, but it would be nice if you could include a sentence about runtime. If runtime isn't an issue, that could be mentioned, and if runtime is an issue, it could be nice for the reader to know for instance how long time GRAFIMO and FIMO spends on processing e.g. one chromosome using one thread.

- If I only run "grafimo" on the command line, I get an IndexError (probably because I didn't specify -h or --help or anything else). It would be nice to instead get a more user friendly error, or just the help message.

- I tried running GRAFIMO on some of my own data, and I think I ran into some issues when running "grafimo buildvg" since it assumed I had chromosome chr1, chr2, chr3 and so on. My test dataset only had one chromosome "1" (not "chr1"). So I tried running "grafimo buildvg -h" to see if I could specify the chromosomes for my data. However, it seems that this doesn't give me the options for the subcommand "buildvg", but all the options for GRAFIMO? Or am I wrong? Anyway, it was a bit unclear for me how I could see which arguments only the buildvg command has.

- After specifying -c 1 to grafimo buildvg to try to make it only build graphs for chromosome 1, it seems that it tries to build a graph for "chr1". Would it maybe be better to let the user be able to specify the exact literal chromosomes, so that the user would need to specify "chr1" if chr1 is wanted (at least I often have data without the chr prefix). Right now it seems that chromosome "1" and so one (without chr) is not supported? EDIT: After I got another error (see next point), I tried installing the latest GRAFIMO through github, and it seems that things have changed there (from what is in the pip package)? On line 507 in constructVG.py, it seems that you remove "chr" from chromosome names, so when I now use "chr1" in my data, GRAFIMO crashes with "ValueError: Unknown chromosome given". Unless I have misunderstood something, it would be good if you fixed these issues. As a user, I would ideally be able to specify either "chr1" or "1", and GRAFIMO should then support both (and not convert the chromosome names).

- After solving the issues with chromosome by using other data (now data with "chr1"), I got the following error message when using the example.meme file:

File "/usr/local/lib/python3.8/dist-packages/grafimo/motif.py", line 634, in read_MEME_motif

motifID, motifName = line.split()[1:3]

ValueError: not enough values to unpack (expected 2, got 1)

When checking line 634 in motif.py in your github-repo, I realised that this might be an old bug that is fixed since the pip-package was published. That's when I decided to pull your latest code from Github, and I then ran into issues with "chr1" not being supported anymore. At this point I gave up, but I guess there's not much that needs to be fixed on your side before I should be able to run GRAFIMO. Let me know if you believe I've misunderstood anything or done anything wrong.

- I think it is very nice that you have included so many details about the experiments in the supplementary text, but I don't find any scripts for reproducing the experiments. It would be nice with reference to scripts (and ideally also the graphs) you used for running the experiments.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Ivar Grytten

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see http://journals.plos.org/compbiol/s/submission-guidelines#loc-materials-and-methods

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009444.r003

Decision Letter 1

Mihaela Pertea

14 Jun 2021

Dear Dr. Pinello,

Thank you very much for submitting your manuscript "GRAFIMO: variant and haplotype aware motif scanning on pangenome graphs" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Mihaela Pertea

Software Editor

PLOS Computational Biology

Mihaela Pertea

Software Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: I finished reviewing the revised manuscript. The authors have answered all my previously proposed questions, and the work is ready to be accepted. It’s a nice tool and will potentially have many users!

Reviewer #2: - I think the experiments you have performed for examining running time and memory usage are satisfactory. In the end of the section titled "Transcription factor motif search" you have added a sentence referring to these experiments. I find it a bit weird to refer to the experiments here. Wouldn't this be better suited for the Results section? The way you phrase this sentence is also a bit vague. You say "S1 Text section 6 describes how the tool performance compares to those of a tool designed to work on a single genomic sequence at a time, such as FIMO". I think two things could be improved here: 1) You could just say that you compare the performance to Fimo, instead of "tools, such as FIMO" (since you only compare to FIMO) and 2) it would be nice if you could very briefly summarize the main findings here, instead of just referring to the supplementary. What I would prefer is a phrasing along the lines of "We also compared the performance of GRAFIMO against FIMO, and found that ... (see S1 Text Section 6).".

- In the caption of S9 Figure you say that memory usage increases when using multiple threads. But from the figure, it seems that the memory usage is about the same for all the cases. Am I right or have I misunderstood the figure?

- You say that you have made all scripts for reproducing the experiments available in a dedicated Github repository, but in your comment you link to the GRAFIMO Github repository. Is this the wrong link? I cannot seem to find the shell scripts for reproducing the experiments in this repository.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No: 

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Chong Chu

Reviewer #2: Yes: Ivar Grytten

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

References:

Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript.

If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009444.r005

Decision Letter 2

Mihaela Pertea

10 Sep 2021

Dear Dr. Pinello,

We are pleased to inform you that your manuscript 'GRAFIMO: variant and haplotype aware motif scanning on pangenome graphs' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Mihaela Pertea

Software Editor

PLOS Computational Biology

Feilim Mac Gabhann

Editor-in-Chief

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #2: The atuhors have satisfactorily addressed all my comments, and I am now very happy with the manuscript and accept it for publication.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: Yes: Ivar Grytten

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009444.r006

Acceptance letter

Mihaela Pertea

22 Sep 2021

PCOMPBIOL-D-21-00041R2

GRAFIMO: variant and haplotype aware motif scanning on pangenome graphs

Dear Dr Pinello,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Zsofi Zombor

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Text. Additional information about experiments design, GATA1 and ATF3 search on genome variation graph, and how to install and run GRAFIMO.

    Fig A. Example of TSV summary report. The tab-delimited report (TSV report) shows the first 25 potential CTCF occurrences retrieved by GRAFIMO, searching the motif in ChIP-seq peak regions defined in ENCODE experiment ENCFF816XLT (cell line A549). Fig B. Example of HTML summary report. The HTML report shows the first 25 potential CTCF occurrences retrieved by GRAFIMO, searching the motif in ChIP-seq peak regions defined in ENCODE experiment ENCFF816XLY (cell line A549). Fig C. Example of GFF3 track produced by GRAFIMO, loaded on the UCSC genome browser. GRAFIMO returns also a GFF3 report which can be loaded on the UCSC genome browser; the loaded custom track shows three potential CTCF occurrences (region chr8:142,782,661–142,782,680) retrieved by GRAFIMO overlapping a dbSNP annotated variant (rs892844) (image obtained from the UCSC Genome Browser website). Fig D. Structure of transcription factor motifs used to test GRAFIMO. Transcription factor binding site motifs of (A) CTCF, (B) ATF3 and (C) GATA1. Fig E. Searching ATF3 motif on VG with GRAFIMO provides an insight on how genetic variation affects the binding site sequence. (A) Potential ATF3 occurrences statistically significant (P-value < 1e-4) and non-significant found in the reference and in the haplotype sequences found with GRAFIMO oh hg38 1000GP VG. (B) Statistical significance of retrieved potential ATF3 motif occurrences and their frequency in the haplotypes embedded in the VG. (C) Percentage of statistically significant ATF3 potential binding sites found only in genome reference sequence, percentage of potential TFBS found in the reference for which genetic variants cause the sequence to be no more significant, percentage of binding sites found only in the haplotypes, percentage of potential TFBS found in the reference with increased statistical significance by the action of genomic variants and percentage of those with a decreased significance by the action of variants (with P-value still significant). (D) Fraction of population specific potential ATF3 binding sites recovered on individual haplotype sequences. Fig F. Searching GATA1 motif on VG with GRAFIMO provides an insight on how genetic variation affects the binding site sequence. (A) Potential GATA1 occurrences statistically significant (P-value < 1e-4) and non-significant found in the reference and in the haplotype sequences found with GRAFIMO oh hg38 1000GP VG. (B) Statistical significance of retrieved potential GATA1 motif occurrences and their frequency in the haplotypes embedded in the VG. (C) Percentage of statistically significant GATA1 potential binding sites found only in genome reference sequence, percentage of potential TFBS found in the reference for which genetic variants cause the sequence to be no more significant, percentage of binding sites found only in the haplotypes, percentage of potential TFBS found in the reference with increased statistical significance by the action of genomic variants and percentage of those with a decreased significance by the action of variants (with P-value still significant). (D) Percentage of population specific potential GATA1 binding sites, among those TFBS retrieved uniquely on individual genome sequences. Fig G. Influence of the different length and type of mutations on binding affinity score. (A) CTCF, (B) ATF3, (C) GATA1. Fig H. Comparing GRAFIMO and FIMO performance. (A) Searching CTCF motif (JASPAR ID MA0139.1) on human chr22 regions (total width ranging from 1 to 9 millions of bp) without accounting for genetic variants FIMO is faster than GRAFIMO (using a single thread). (B) FIMO uses less memory resources than GRAFIMO, however they work on different frameworks. (C) When considering the genetic variation present in large panels of individuals as 1000GP on GRCh38 phase 3 (2548 samples), GRAFIMO proves to be faster than FIMO in searching potential CTCF occurrences. It is faster when run with a single execution thread, and significantly faster when run with 16. Fig I. GRAFIMO running time efficiently scales with the number of threads used. By running GRAFIMO with multiple threads (A) the running time significantly decreases, while (B) memory usage remains similar. Table A. Number of genomic variants used to test GRAFIMO. Number of genomic variants used to test GRAFIMO, divided by chromosome. The variants were obtained from 1000 Genomes Project on GRCh38 phase 3, and belongs to 2548 individuals from 26 populations. The number of variants refers to SNPs and indels together. In total were considered ~78 million variants. Table B. ENCODE ChIP-seq experiment codes. To test our software, we searched the potential occurrences of three transcription factor motifs (CTCF, ATF3 and GATA1) in a hg38 pangenome variation graph enriched with genomic variants and haplotypes of 2548 individuals from 1000 Genomes project phase 3. To have likely to happen binding events, TF motifs were searched in ChIP-seq peak regions, obtained from the ENCODE project data portal.

    (DOCX)

    S1 Code. GRAFIMO v1.1.4 source code, documentation and running examples.

    (ZIP)

    Attachment

    Submitted filename: Response_to_reviewers_grafimo.docx

    Attachment

    Submitted filename: Response_to_reviewers.pdf

    Data Availability Statement

    All relevant data are within the manuscript and its Supporting Information files.


    Articles from PLoS Computational Biology are provided here courtesy of PLOS

    RESOURCES