Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2022 Feb 23;38(9):2624–2625. doi: 10.1093/bioinformatics/btac102

monaLisa: an R/Bioconductor package for identifying regulatory motifs

Dania Machlab 1,2,3, Lukas Burger 4,5, Charlotte Soneson 6,7, Filippo M Rijli 8,9, Dirk Schübeler 10,11, Michael B Stadler 12,13,14,
Editor: Valentina Boeva
PMCID: PMC9048699  PMID: 35199152

Abstract

Summary

Proteins binding to specific nucleotide sequences, such as transcription factors, play key roles in the regulation of gene expression. Their binding can be indirectly observed via associated changes in transcription, chromatin accessibility, DNA methylation and histone modifications. Identifying candidate factors that are responsible for these observed experimental changes is critical to understand the underlying biological processes. Here, we present monaLisa, an R/Bioconductor package that implements approaches to identify relevant transcription factors from experimental data. The package can be easily integrated with other Bioconductor packages and enables seamless motif analyses without any software dependencies outside of R.

Availability and implementation

monaLisa is implemented in R and available on Bioconductor at https://bioconductor.org/packages/monaLisa with the development version hosted on GitHub at https://github.com/fmicompbio/monaLisa.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Binding proteins that interact with specific nucleotide sequences, such as transcription factors (TFs), play key roles in the regulation of cellular functions and organismal development (Spitz and Furlong, 2012). Identifying candidate proteins that could play regulatory roles in development or act as drivers for an observed biological response is thus a crucial step in the interpretation of genomics data, such as absolute values or changes of DNA methylation, chromatin modifications, accessibility or transcription. There are various existing tools and methods for regulatory protein identification via their binding motifs. Most of these are command line tools or web servers that cannot be easily integrated with other Bioconductor (Huber et al., 2015) packages for a seamless analysis in R, or they require the installation of additional software outside of R. Conceptually, many of these methods can be roughly divided into two types: enrichment-based methods that compare motif occurrences between sets of sequences, and model-based methods that estimate motif importance from their ability to explain experimental observations. Here, we present monaLisa, short for ‘motif analysis with Lisa’, an R/Bioconductor package that implements both of these approaches and enables seamless motif identification analyses in R.

2 Usage and examples

Enrichment-based tools like HOMER (Heinz et al., 2010) and MEME (Bailey et al., 2015) identify novel or known motifs enriched in a given set of sequences compared with a suitable background. In monaLisa, this is done by first binning sequences, for example gene promoters or enhancers, according to their associated values. In the example here we use changes of DNA methylation between mouse embryonic stem cells and derived neuronal progenitors (Burger et al., 2013; Stadler et al., 2011) (Fig. 1A). A collection of known motifs, for example from the JASPAR2020 package (Fornes et al., 2020), are then evaluated for enrichment in each bin compared with the background, using HOMER’s normalization method to adjust for differences in sequence composition. This can also be diagnosed using available visualization functions (Supplementary Fig. S1A and B). Several ways to define background sequences are available, and the results can be visualized as a heatmap (Fig. 1B) that shows the enrichment of each motif in each bin compared with all other bins. Additional confidence can often be gained by focusing on motifs for which the enrichment scales with the numerical value under consideration. monaLisa also offers to search for enriched k-mers (oligonucleotides of length k), which is particularly useful to complement the motif enrichment analysis and identify potential gaps in the database of known motifs (Supplementary Fig. S1C).

Fig. 1.

Fig. 1.

(A, B) Analysis of methylation changes between neuronal progenitors (NP) and embryonic stem cells (ESC): Binned density of methylation levels (A), with bin boundaries and sizes given in the legend, and enrichment and significance heatmaps (B) of motifs (rows) across bins (columns). (C) Analysis of accessibility changes between liver and lung: Directional selection probabilities for motifs identified using stability selection.

In the bin-based approach, motifs are analyzed independently of each other. In contrast, methods such as REDUCE (Roven and Bussemaker, 2003), AME (McLeay and Bailey, 2010) or ISMARA (Balwierz et al., 2014) use linear regression approaches to identify regulatory motifs that are most likely to explain the observed numerical responses. A similar model-based approach is also available in monaLisa, but uses a different regression framework: randomized lasso stability selection, introduced by Meinshausen and Bühlmann (2010), with the improved error bounds proposed by Shah and Samworth (2013). Regression is performed on several random subsets of the data to calculate motif selection probabilities. This type of regularization has advantages in selecting variables consistently, demonstrating better error control and not depending strongly on the initial regularization chosen (Meinshausen and Bühlmann, 2010). For illustration, we have used monaLisa’s regression with stability selection to identify TF motifs that could explain the observed changes of accessibility between mouse liver and lung [data from The ENCODE Project Consortium (2012)] which represents the response variable. The predictor matrix consists of predicted binding sites for each TF, and additional variables, such as G + C composition, can also be included. The results can be visualized as stability paths (Supplementary Fig. S2), that show the selection probability for each motif as a function of the regularization steps, or as the final selection probabilities (Fig. 1C) combined with a sign to indicate if a motif correlates positively or negatively with changes in accessibility.

The illustrating examples and datasets above are included and described in detail in the package vignette. In addition to enrichment- and regression-based motif identification methods, monaLisa further provides helpful functions for motif analyses, including functions to predict motif matches and calculate similarity between motifs.

3 Summary

monaLisa is an R/Bioconductor package for motif analyses applicable to sequences with associated numerical data. Regulatory motifs explaining the observations can be identified using two complementary approaches. monaLisa requires no additional software tools and can be easily integrated with other Bioconductor packages for seamless analyses in R.

Supplementary Material

btac102_Supplementary_Data

Acknowledgements

The authors thank the members of the Rijli, Schübeler and Stadler groups, Luca Giorgetti, Florian Geier and our colleagues from the Novartis Institutes for Biomedical Research for suggestions on the software.

Funding

This work was supported by the Novartis Research Foundation. F.M.R. was supported by the Swiss National Science Foundation [31003A_149573 and 31003A_175776] and the European Research Council under the European Union’s Horizon 2020 research and innovation programme [810111-EpiCrest2Reg]. D.S. furthermore acknowledges support from the Swiss National Science Foundation [310030B_176394] and the European Research Council under the European Union’s (EU) Horizon 2020 Research and Innovation Program Grant Agreements [ReadMe-667951 and DNAaccess-884664].

Conflict of Interest: none declared.  

Data Availability

The DNA methylation data (Fig. 1a and b) are available from Gene Expression Omnibus under accession GSE30202. ATAC-seq data (Fig. 1c) for liver (ENCFF146ZCO, ENCFF109LQF) and lung (ENCFF203DOC, ENCFF823PTD) are available from www.encodeproject.org.

Contributor Information

Dania Machlab, Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland; SIB Swiss Institute of Bioinformatics, Basel, Switzerland; Faculty of Science, University of Basel, Basel, Switzerland.

Lukas Burger, Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland; SIB Swiss Institute of Bioinformatics, Basel, Switzerland.

Charlotte Soneson, Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland; SIB Swiss Institute of Bioinformatics, Basel, Switzerland.

Filippo M Rijli, Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland; Faculty of Science, University of Basel, Basel, Switzerland.

Dirk Schübeler, Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland; Faculty of Science, University of Basel, Basel, Switzerland.

Michael B Stadler, Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland; SIB Swiss Institute of Bioinformatics, Basel, Switzerland; Faculty of Science, University of Basel, Basel, Switzerland.

References

  1. Bailey T.L.  et al. (2015) The MEME Suite. Nucleic Acids Res., 43, W39–49. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Balwierz P.J.  et al. (2014) ISMARA: automated modeling of genomic signals as a democracy of regulatory motifs. Genome Res., 24, 869–884. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Burger L.  et al. (2013) Identification of active regulatory regions from DNA methylation data. Nucleic Acids Res., 41, e155. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Fornes O.  et al. (2020) JASPAR 2020: update of the open-access database of transcription factor binding profiles. Nucleic Acids Res., 48, D87–D92. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Heinz S.  et al. (2010) Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell, 38, 576–589. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Huber W.  et al. (2015) Orchestrating high-throughput genomic analysis with Bioconductor. Nat. Methods, 12, 115–121. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. McLeay R.C., Bailey T.L. (2010) Motif Enrichment Analysis: a unified framework and an evaluation on ChIP data. BMC Bioinformatics, 11, 165. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Meinshausen N., Bühlmann P. (2010) Stability selection. J. R. Stat. Soc. Ser. B (Stat. Methodol.), 72, 417–473. [Google Scholar]
  9. Roven C., Bussemaker H.J. (2003) REDUCE: an online tool for inferring cis-regulatory elements and transcriptional module activities from microarray data. Nucleic Acids Res., 31, 3487–3490. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Shah R.D., Samworth R.J. (2013) Variable selection with error control: another look at stability selection. J. R. Stat. Soc. Ser. B (Stat. Methodol.), 75, 55–80. [Google Scholar]
  11. Spitz F., Furlong E.E. (2012) Transcription factors: from enhancer binding to developmental control. Nat. Rev. Genet., 13, 613–626. [DOI] [PubMed] [Google Scholar]
  12. Stadler M.B.  et al. (2011) DNA-binding factors shape the mouse methylome at distal regulatory regions. Nature, 480, 490–495. [DOI] [PubMed] [Google Scholar]
  13. The ENCODE Project Consortium. (2012) An integrated encyclopedia of DNA elements in the human genome. Nature, 489, 57–74. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btac102_Supplementary_Data

Data Availability Statement

The DNA methylation data (Fig. 1a and b) are available from Gene Expression Omnibus under accession GSE30202. ATAC-seq data (Fig. 1c) for liver (ENCFF146ZCO, ENCFF109LQF) and lung (ENCFF203DOC, ENCFF823PTD) are available from www.encodeproject.org.


Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES