Abstract
Background
Sequence logo plots have become a standard graphical tool for visualizing sequence motifs in DNA, RNA or protein sequences. However standard logo plots primarily highlight enrichment of symbols, and may fail to highlight interesting depletions. Current alternatives that try to highlight depletion often produce visually cluttered logos.
Results
We introduce a new sequence logo plot, the EDLogo plot, that highlights both enrichment and depletion, while minimizing visual clutter. We provide an easy-to-use and highly customizable R package Logolas to produce a range of logo plots, including EDLogo plots. This software also allows elements in the logo plot to be strings of characters, rather than a single character, extending the range of applications beyond the usual DNA, RNA or protein sequences. And the software includes new Empirical Bayes methods to stabilize estimates of enrichment and depletion, and thus better highlight the most significant patterns in data. We illustrate our methods and software on applications to transcription factor binding site motifs, protein sequence alignments and cancer mutation signature profiles.
Conclusions
Our new EDLogo plots and flexible software implementation can help data analysts visualize both enrichment and depletion of characters (DNA sequence bases, amino acids, etc.) across a wide range of applications.
Electronic supplementary material
The online version of this article (10.1186/s12859-018-2489-3) contains supplementary material, which is available to authorized users.
Keywords: Logo plots, Enrichment depletion, EDLogo, String symbols
Background
Since their introduction in the early 1990s by Schneider and Stephens [1], sequence logo plots have become widely used for visualizing short conserved patterns known as sequence motifs, in multiple alignments of DNA, RNA and protein sequences. At each position in the alignment, the standard logo plot represents the relative frequency of each character (base, amino acid, etc.) by stacking characters on top of each other, with the height of each character proportional to its relative frequency. The characters are ordered by their relative frequency, and the total height of the stack is determined by the information content of the position. The visualization is so appealing that methods to produce logo plots are now implemented in many software packages (e.g. seqLogo [2], RWebLogo [3], ggseqlogo [4]) and web servers (e.g. WebLogo [5], Seq2Logo [6], iceLogo [7]).
Because the standard logo plot scales the height of each character proportional to its relative frequency, it tends to visually highlight characters that are enriched; that is, at higher than expected frequency. In many applications such enrichments may be the main features of interest, and the standard logo plot serves these applications well. However, sometimes it may be equally interesting to identify depletions: characters that occur less often than expected. One example of this, highlighted in [6], involves glycosylation: N-linked glycosylation sites in proteins are known to have the motif N-X- S/T where X is any amino acid apart from proline P [8, 9]. Another example involves the distribution of histone modifications across the genome: for example, Koch et al. [10] notes depletion of histone marks H4AC and H3K4ME1 at the gene start and gene end regions in lymphoblastoid cell lines. The standard logo plot represents strong depletion(s) by the absence of character(s), which can be difficult to discern visually.
To better highlight depletions in amino acid motifs, Thomsen et al. [6] suggests several alternatives to the standard logo plot. The key idea is to explicitly represent depletions using characters that occupy the negative part of the y axis. However, we have found that the resulting plots sometimes suffer from visual clutter – too many symbols, which distract from the main patterns of enrichment and depletion.
Here we suggest a simple solution to this problem, producing a new sequence logo plot – the Enrichment Depletion Logo or EDLogo plot – that highlights both enrichment and depletion, while minimizing visual clutter. In addition, we extend the applicability of logo plots to new settings by i) allowing each “character” in the plot to be an arbitrary alphanumeric string (potentially including user-defined symbols); and ii) allowing a different “alphabet” of permitted strings at each position. We also introduce Empirical Bayes statistical methods to stabilize estimates of enrichment and depletion, and thus better highlight the most significant patterns in data. All these new features are implemented in our R package, Logolas, which can produce generalized string-based logo and EDLogo plots. We illustrate the utility of the EDLogo plot and the flexibility of the string-based representation through several applications.
Implementation
Intuition
In essence, the goal of a logo plot is to represent, at each position along the x axis, how a probability vector p compares with another probability vector q. For example, suppose that at a specific position in a set of aligned DNA sequences, we observe relative frequencies p=(pA,pC,pG,pT)=(0.33,0.33,0.33,0.01) of the four bases {A,C,G,T}. The goal of the logo plot might be to represent how p compares with the background frequencies of the four bases, which for simplicity we will assume in this example to be equal: q=(qA,qC,qG,qT)=(0.25,0.25,0.25,0.25). Verbally we could describe the change from q to p in several ways: we could say “T is depleted”, or “ A,C and G are enriched”, or “T is depleted, and A,C and G are enriched”. While all of these are valid statements, the first is the most succinct, and our EDLogo plot provides a visual version of that statement. The second statement is more in line with a standard logo representation, and the last is in essence the approach in [6] (also known as weighted Kullback Leibler Divergence Logo or wKL-Logo). See Fig. 1.
The EDLogo plot
At a particular position, j, of a sequence (or other indexing set), let p=(p1,p2,…,pn) denote the probabilities of the n elements C1,…,Cn (which can be characters or strings) permitted at that position, and q=(q1,q2,…,qn) denote corresponding background probabilities. Define r=(r1,r2,…,rn) by:
1 |
where
2 |
Then at position j along the x axis, the EDLogo plot plots the element Ci, scaled to have height |ri|, and above the x axis if ri is positive, or below the x axis if ri is negative. Elements are stacked (from bottom to top) in order of increasing ri, so that the largest characters are furthest from the axis.
The basic strategy has close connections to ideas in [6], but with the crucial difference that we subtract the median in Eq. 1. As our examples will demonstrate, subtracting the median in this way – which can be motivated by a parsimony argument (see below) – can dramatically change the plot, and substantially reduce visual clutter.
Note that the EDLogo plot for p vs q is essentially a mirror (about the x axis) of the EDLogo plot for q vs p (e.g. Additional file 2: Figure S1). We call this the “mirror property”, and it can be interpreted as meaning that the plots treat enrichment and depletion symmetrically. This property is also satisfied by plots in [6], but not by the standard logo plot.
A model-based view
Suppose we model the relationship of p to q by
3 |
for some unknown (positive) “parameters” λi. For example, this model would arise if q represents the underlying frequencies of elements in a population, and p represents the frequencies of the same elements in a (large) sample from that population, conditional on an event E (e.g. a transcription factor binding). Indeed, by Bayes theorem, under this assumption we would have
4 |
Since the pi must sum to 1, , the model (3) implies
5 |
Now consider estimating the parameters λ. Even if p and q are observed without error, there is a non-identifiability in estimating λ: we can set λi=cpi/qi for any positive c. Equivalently, if we consider estimating the logarithms li:= logλi, we can set
6 |
for any constant k. Note that ri in (1) has exactly this form, and so the vector r can be interpreted as an estimate of the vector ł. Furthermore, it is easy to show that, among all estimates of the form (6), r has the smallest sum of absolute values (see Additional file 1 for a rigorous proof). That is, r solves the optimization
7 |
subject to the constraint (6).
Since the sum of absolute values of r is the total height of the stacked characters in the EDLogo plot, one can think of our choice of r as the estimate of łthat produces the smallest stack of characters – that is, the most “parsimonious” estimate.
Interpretation
Roughly speaking, positive values of ri can be interpreted as indicating characters that are “enriched” and negative values of ri as indicating characters that are “depleted”. Formally we must add that here enrichment and depletion are to be interpreted as relative to the median enrichment/depletion across characters. This relative enrichment does not necessarily imply enrichment or depletion in some “absolute” sense: for example, ri could be positive even if pi is smaller than qi. For compositional data it seems natural that enrichment/depletion be interpreted relative to some “baseline”, and our choice of the median as the baseline is motivated above as providing the most parsimonious plot.
It may also help interpretation to note that for any two characters i and i′, the difference in their heights is equal to the log-odds ratio:
8 |
Multiple solutions when n is even
When the number of classes n is even (n=4 DNA bases being a particularly relevant example) the definition of the median of , which is subtracted in (1) to minimize total stack height, is ambiguous. Conventionally, the median of an even number of observations is usually taken to be the mean of the two central observations. However, in terms of minimizing total stack height (optimization (7)), every real number between the two central observations (inclusive) performs equally well. For example, if , subtracting the conventional median (0.5) yields r=(−0.5,−0.5,0.5,0.5) with total stack height 2, but subtracting any number between the two central observations (0 and 1) would lead to the same total stack height. Thus, if we measure parsimony by total stack height, there exist multiple equally-parsimonious plots, giving the user a decision to make.
Among these equally-parsimonious solutions, subtracting the smallest number corresponds to favoring an “enrichment” representation, whereas subtracting the largest number favors a “depletion” representation, and subtracting the median treads a middle ground between the two. See Additional file 2: Figure S6 for an illustration. None of these approaches is uniformly superior to another, but our sense is that – all other things being equal – users find an enrichment representation slightly more natural, and so we made this (subtracting the smallest number) the software default. One cost of this choice is that the plot no longer satisfies the mirror property; the mirror property is preserved by using the conventional median, which is a software option.
Stabilizing estimates of
The basic EDLogo plot described above typically works well provided that no probabilities pi or qi are very small (or zero!). Very small values of pi or qi can cause very large values of , and consequently large |ri| which can undesirably dominate the plot.
In practice we have found the most common source of this problem (unreasonably large |ri|) is unstable estimates of small probabilities from low counts of rare events. We found that the simplest solution to this problem – use of pseudocounts to stabilize estimates of small probabilities [11] – was only partially successful. We therefore developed a statistical approach that directly stabilizes estimates of from count data. This approach uses Empirical Bayes shrinkage [12] to stabilize estimates of , and is especially effective in stabilizing estimates from low count events (see Additional file 1 for details). We produce an EDLogo plot from stabilized estimates for by plugging them into (1).
Although our stabilization method is ideally-suited to settings where p and q are estimated from count data, it can also be applied in other settings by supplying an “effective count” parameter that specifies the approximate precision of supplied values of p and q. (By default we assume an effective count of 1000, which means that p and q are precise to no more than 3 decimal places).
Results
Comparison with existing logo plots
Figure 2 illustrates the EDLogo plot, and compares it with the standard logo and the weighted Kullback–Leibler logo (wKL-Logo) plot [6], in four diverse applications.
The first application (panel (a)) is a setting where the standard logo plot is widely used: visualizing transcription factor binding sites (TFBS) [13–18] (see Additional file 3: Table S1). Specifically, the plots represent the primary discovered motif disc1 of Early B cell factor EBF1 from ENCODE [14]. This example showcases the effectiveness of the standard logo plot in highlighting enrichments: in our opinion it does this better than the other two plots, and in this sense the other plots should be viewed as complementing the standard plot rather than replacing it. This example also illustrates the difference between the wKL-Logo and EDLogo plots, both of which aim to highlight depletion as well as enrichment: the EDLogo plot introduces less distracting visual clutter than the wKL-Logo plot, producing a cleaner and more parsimonious visualization that better highlights the primary enrichments and depletions. In particular, the EDLogo plot is most effective at highlighting depletion of bases G and C at the two positions in the middle of the sequence. This depletion is hard to see in the standard logo because of its emphasis on enrichment, and less clear in the wKL-Logo due to visual clutter. This depletion pattern is likely meaningful, rather than a coincidence, since it was also observed in two other previously known motifs (known3 and known4) of the same transcription factor [16, 18] (see Additional file 2: Figure S2).
The second example (panel (b)) shows an amino acid motif corresponding to N-linked glycosylation sites. These sites are expected to have the motif N-X- S/T, where X is any amino acid apart from proline P (data from UniProtKB [19]; see Additional file 3: Table S2). This was used by [6] as an example of a motif where depletion is an important biological feature. The depletion (of the P at position +1) is essentially impossible to see in the standard logo plot, is visually detectable in the wKL-Logo plot, and clearest in the EDLogo plot. Again, the EDLogo plot is more parsimonious than the wKL-Logo plot, and consequently the primary N-X- S/T motif stands out better in the EDLogo plot. In addition to showing depletion of P at the expected position (+1), the EDLogo plot also highlights depletion of P at position +3, suggesting an extended motif N-X- S/T-X.
The next two applications (panels (c) and (d) of Fig. 2) are non-standard settings that illustrate the use of general strings as “characters” in a logo plot, as well as providing further examples where the EDLogo plot is particularly effective at highlighting depletion as well as enrichment.
Panel (c) shows logo plots representing an estimated cancer mutation signature profile (signature 12) from a clustering analysis of a large number (nearly 70,000) somatic mutations by [20] (see Additional file 3: Table S3). Here we follow [20] in representing a mutational signature by the frequency of each mutation type (at position 0 on the x axis), together with base frequencies at the ± 2 flanking bases. We also follow the common convention of orienting the strand so that the mutation is from either a C or a T, yielding six possible mutation types: C→T, C→A, C→G, T→A, T→C, T→G. This Figure panel illustrates two important points. First, it illustrates the flexibility of our software package Logolas, which allows arbitrary strings in a logo. For all three logo plots (standard, wKL and ED) we use this to represent the six mutation types by six strings of the form X→Y, and we find the resulting plots easier to read than the pmsignature plots in [20] (see Additional file 2: Figure S3 for comparison). Additionally, it also shows that one can use different sets of permitted strings at different positions - strings are used to represent the mutation in the center, while characters are used to represent the flanking bases. Second, it illustrates a case where, in our opinion, the EDLogo plot is a better visual summary than the other plots. Specifically the EDLogo plot best highlights the primary aspects of this signature: enrichment of C→T mutations, and depletion of G at position +1. Here the depletion of G at +1 may be a bi-product of the enrichment of C→T mutations combined with the overall depletion of CpG sites in the genome due to deamination [21].
In this example the EDLogo plot and the standard logo plot differ on the enrichments they highlight at the central position: unlike the standard plot, the EDLogo plot highlights enrichment of T→C in addition to the primary enrichment C→T. This is due to an important difference between the plots: in EDLogo enrichments (and depletions) are plotted on a log scale, whereas in the standard plot they are on an absolute scale. This means that in the EDLogo plot it is differences in the heights of characters that matter (and can be interpreted as a log-odds-ratio; see Implementation), whereas in a standard plot it is ratios of heights. In this case the frequency of C→T is 0.96 and T→C is 0.03, with other mutations essentially absent. Consequently T→Cis enriched relative to other mutation types, but nowhere near as strongly as C→T. When these enrichments are plotted on the raw scale, as in the standard plot, essentially only the C→T enrichment is visible. On the log scale, both are visible. Which representation is preferable depends on how much one wants to emphasize subtler vs stronger enrichment patterns.
For readers interested in other cancer mutation signatures, we provide EDLogo plots for all 27 mutational signature profiles reported by [20] in Additional file 2: Figure S4.
Panel (d) shows logo plots summarizing the relative abundance of 5 different histone marks in different genomic contexts (data from lymphoblastoid cell line GM06990, Table S2 of [10], given in Additional file 3: Tables S4 and S5.) Relative abundances naturally yield compositional data that can be visualized in a logo plot. Again this example illustrates the potential to use strings in logo plots. It also represents an example where the EDLogo and wKL-Logo plots seem more informative than the standard logo plot. Specifically, the standard logo plot is dominated by the high deviation from background frequencies at the intergenic, exon and intron regions, and the differences in enrichments and depletions among regions are difficult to discern. In comparison, the EDLogo and wKL-Logo plots highlight a number of differences among regions (some of which are also noted in [10]). For example, both plots highlight the relative enrichment of H3AC and H3K4me3 near the start of genes, and corresponding relative depletion of H4AC and H3K4me1. Both plots also highlight relative enrichment of H3K4me1 compared with other marks in the intergenic, exonic and intronic regions; the relative enrichment of H4AC in intronic and exonic regions; and relative depletion of H3AC in intergenic and intronic regions. As in other examples, the EDLogo plot is more parsimonious than the wKL-Logo plot.
Stabilizing estimates
The example of N-linked glycosylation sites above involves very small (e.g. zero) counts of some amino acids at some sites, and provides an example of both the need to stabilize estimates of values, and the benefits of our new Empirical Bayes (EB) approach to this.
Before explaining the benefits of our EB approach, we first motivate the need for stabilization, using the central position of this motif as an example. At this position (and, indeed, other positions) the frequency, pi, of amino acid i is estimated by counting the number of times, mi, that amino acid i occurs in this central position in an observed data set of m=5422 sequences. The maximum likelihood estimate (mle) for pi is mi/m. However, every observed amino acid at this central position is an Asparagine (N), so the mle for pi is 1 for the N amino acid, and is 0 for all other amino acids. These 0 estimates for pi lead to unreasonably large – indeed, infinite – values for , motivating the need for stabilization.
A standard approach to stabilization of estimates of small probabilities is to use pseudocounts. This approach simply adds a pseudocount (small number) to each observed count before computing estimates. For example, using a pseudo-count of 0.5 for each of the 20 amino acids that could occur, pi is estimated by . This avoids zero estimates of probabilities, and an EDLogo plot can be constructed from the pseudo-count-based estimates of p,q. However, we found the resulting plot (Fig. 3a) somewhat unsatisfactory. For example, the approach shows most symbols are either enriched or depleted at the central position, even though the available data can be explained simply by enrichment of N. This occurs because, although the estimated pi are equal for amino acids other than N, the estimated background rates qi vary, and so the estimated vary. For example, the pseudocount-based plot shows W to be enriched and L to be depleted at the central position even when neither occurs at all in the data, simply because L has a higher background rate.
Our EB stabilization method takes a different approach. Specifically, it directly stabilizes estimates of , instead of separately stabilizing estimates of p and q and then taking their log-ratio. Consequently, the method produces estimates of that vary no more than is supported by the data, resulting in more parsimonious plots. For example, in this example (Fig. 3b) the plot shows a large N alone in the center position, highlighting that the data can be explained purely by strong enrichment of N.
In addition, the EDLogo strategy of using a median adjustment in (1) to reduce visual clutter can be directly applied to derived quantities such as the position specific scoring matrix (PSSM) commonly used to represent protein binding motifs. Additional file 2: Figure S5 shows logo plots of the PSSM matrix (see Additional file 3: Table S6), before and after median adjustment, of the binding motif of protein D-isomer specific 2-hydroxyacid dehydrogenase, catalytic domain (IPR006139) (Motif2,Start = 257, Length = 11) [22, 23]
Conclusions
We present a new sequence logo plot, the EDLogo plot, designed to highlight both enrichment and depletion of elements at each position in a sequence (or other index set). We have also developed statistical methods that can improve these plots by stabilizing enrichment estimates for rare events. We have implemented these methods, as well as standard logo plots, in a flexible R package Logolas, which offers many other features: the ability to use strings instead of characters; various customizable styles and color palettes; several methods for scaling stack heights; and ease of integrating logo plots with external graphics like ggplot2 [24].
Availability and requirements
Project Name : Logolas
Software Download page : Github R package (https://github.com/kkdey/Logolas)
Project Home page : https://kkdey.github.io/Logolas-pages/
Operating system : Platform independent
Programming Language : R (≥ 3.4)
License : GPL (> = 2)
Any restrictions to use by non-academics: No
-
Data : The data used in this paper are reported in Additional file 3 and are also accessible as part of our package Logolas using the data() function.
TFBS example (Fig. 2a, Additional file 2: Figure S1, S6): Additional file 3: Table S1, data(EBF1_disc1)
N-Glycosylation example (Figs. 2b, 3): Additional file 3: Table S2, data(N_Glycosyl_sequences)
Mutational Signature example (Fig. 2c, Additional file 2: Figure S3): Additional file 3: Table S3, data(mutation_sig)
Histone Marks example (Fig. 2d): Additional file 3: Table S4 and S5, data(histone_marks)
PSSM amino acids example (Additional file 2: Figure S5): Additional file 3: Table S6, data(pssm).
Additional files
Acknowledgements
We thank Hussein Al-Asadi, John Blischak, Peter Carbonetto, Yang Li and Yuichi Shiraishi for valuable feedback and helpful discussions. We thank Edward Wallace for suggestions for improving the software interface, and two anonymous referees for helpful comments.
Funding
This work was supported in part by NIH BD2K grant CA198933 and NIH grant HG002585 to M.S. The funding body did not play any role in the study design and collection, analysis and interpretation of the data and the write-up of the manuscript.
Availability of data and materials
The Logolas package is available for R (≥ 3.4) users as a Github R package (https://github.com/kkdey/Logolas). Code for reproducing figures in this paper is available at https://github.com/kkdey/Logolas-paper. Vignettes and a gallery demonstrating features of Logolas are available at (https://github.com/kkdey/Logolas-pages)
Abbreviations
- EDLogo
Enrichment depletion logo
- KLD
Kullback leibler divergence
- PSSM
Position specific scoring matrix
- TFBS
Transcription factor binding site
- wKL-Logo
weighted Kullback Leibler logo
Authors’ contributions
KKD and MS conceived the idea. KKD implemented the package. KKD and DX tested Logolas on the data applications. KKD, DX and MS wrote the manuscript. All the authors have proofread and approved the final version of the manuscript.
Ethics approval and consent to participate
Not applicable
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Kushal K. Dey, Email: kkdey@uchicago.edu
Dongyue Xie, Email: dyxie@uchicago.edu.
Matthew Stephens, Email: mstephens@uchicago.edu.
References
- 1.Schneider TD, Stephens R. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 1990;18(20):6097–100. doi: 10.1093/nar/18.20.6097. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Bembom O. seqLogo: Sequence logos for DNA sequence alignments. 2018. R package version 1.42.0.
- 3.Wagih O. RWebLogo: plotting custom sequence logos. 2014. R package version 1.0.3. https://CRAN.R-project.org/package=RWebLogo.
- 4.Wagih O. ggseqlogo: a versatile R package for drawing sequence logos. Bioinformatics. 2017;33(22):3645–3647. doi: 10.1093/bioinformatics/btx469. [DOI] [PubMed] [Google Scholar]
- 5.Crooks GE. Weblogo: A sequence logo generator. Genome Res. 2004;14(6):1188–90. doi: 10.1101/gr.849004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Thomsen MC, Nielsen M. Seq2logo: a method for construction and visualization of amino acid binding motifs and sequence profiles including sequence weighting, pseudo counts and two-sided representation of amino acid enrichment and depletion. Nucleic Acids Res. 2012;40:281–7. doi: 10.1093/nar/gks469. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Coalert N, Helsens K, Martens L, Vandekerckhove J, Gevaert K. Improved visualization of protein consensus sequences by icelogo. Nat Methods. 2009;6:786–7. doi: 10.1038/nmeth1109-786. [DOI] [PubMed] [Google Scholar]
- 8.Nita-Lazar M, Wacker M, Schegg B, Amber S, Aebi M. The nxs/t consensus sequence is required but not sufficient for bacterial n-linked protein glycosylation. Glycobiology. 2004;15(4):361–7. doi: 10.1093/glycob/cwi019. [DOI] [PubMed] [Google Scholar]
- 9.Lam PVN, et al. Structure-based comparative analysis and prediction of n-linked glycosylation sites in evolutionarily distant eukaryotes. Genomics, Proteomics Bioinforma. 2013;11(2):96–104. doi: 10.1016/j.gpb.2012.11.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Koch CM, et al. The landscape of histone modifications across 1 in five human cell lines. Genome Res. 2007;17(6):691–707. doi: 10.1101/gr.5704207. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Nishida K, Frith MC, Nakai K. Pseudocounts for transcription factor binding sites. Nucleic Acids Res. 2008;37(3):939–44. doi: 10.1093/nar/gkn1019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Stephens M. False discovery rates: a new deal. Biostatistics. 2016;18(2):275–94. doi: 10.1093/biostatistics/kxw041. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Tan G, Lenhard B. Tfbstools: an r/bioconductor package for transcription factor binding site analysis. Bioinformatics. 2016;32:1555–6. doi: 10.1093/bioinformatics/btw024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Kheradpour P, Kellis M. Systematic discovery and characterization of regulatory motifs in ENCODE TF binding experiments. Nucleic Acids Res. 2013;42(5):2976–2987. doi: 10.1093/nar/gkt1249. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Zhao X, et al.Jaspar 2013: An extensively expanded and updated open-access database of transcription factor binding profiles. TBA. 2013; TBA(TBA). [DOI] [PMC free article] [PubMed]
- 16.Sandelin A, Wynand A, Engstrom P, Wasserman WW, Lenhard B. Jaspar: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res. 2004;32(Database issue):91–4. doi: 10.1093/nar/gkh012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Wingender E, et al. Transfac: an integrated system for gene expression regulation. Nucleic Acids Res. 2000;28(1):316–9. doi: 10.1093/nar/28.1.316. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Jolma A, Yan J, Whitington T, Toivonen J, Nitta K, Rastas P, Morgunova E, Enge M, Taipale M, Wei G, et al. Dna-binding specificities of human transcription factors. Cell. 2013;152:327–39. doi: 10.1016/j.cell.2012.12.009. [DOI] [PubMed] [Google Scholar]
- 19.Apweiler R, et al. Uniprot: the universal protein knowledgebase. Nucleic Acids Res. 2004;32:115–9. doi: 10.1093/nar/gkh131. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Shiraishi Y, Tremmel G, Miyano S, Stephens M. A simple model-based approach to inferring and visualizing cancer mutation signatures. PLoS Genet. 2015;11(12):1005657. doi: 10.1371/journal.pgen.1005657. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Scarano E, Iaccarino M, Grippo P, Parisi E. The heterogeneity of thymine methyl group origin in dna pyrimidine isostichs of developing sea urchin embryos. Proc Natl Acad Sci. 1967;57(5):1394–400. doi: 10.1073/pnas.57.5.1394. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Shameer K, Nagarajan P, Gaurav K, Sowdhamini R. 3pfdb - a database of best representative pssm profiles (brps) of protein families generated using a novel data mining approach. BioData Min. 2009;2(1):8. doi: 10.1186/1756-0381-2-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Joseph AP, Shingate P, Upadhyay AK, Sowdhamini R. 3PFDB+: improved search protocol and update for the identification of representatives of protein sequence domain families. Database. 2014. [DOI] [PMC free article] [PubMed]
- 24.Wickham H. ggplot2: Elegant Graphics for Data Analysis: Springer-Verlag New York; 2016. http://ggplot2.org.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The Logolas package is available for R (≥ 3.4) users as a Github R package (https://github.com/kkdey/Logolas). Code for reproducing figures in this paper is available at https://github.com/kkdey/Logolas-paper. Vignettes and a gallery demonstrating features of Logolas are available at (https://github.com/kkdey/Logolas-pages)