Abstract
Galaxy (https://galaxyproject.org) is a web-based computational workbench used by tens of thousands of scientists across the world to analyze large biomedical datasets. Since 2005, the Galaxy project has fostered a global community focused on achieving accessible, reproducible, and collaborative research. Together, this community develops the Galaxy software framework, integrates analysis tools and visualizations into the framework, runs public servers that make Galaxy available via a web browser, performs and publishes analyses using Galaxy, leads bioinformatics workshops that introduce and use Galaxy, and develops interactive training materials for Galaxy. Over the last two years, all aspects of the Galaxy project have grown: code contributions, tools integrated, users, and training materials. Key advances in Galaxy's user interface include enhancements for analyzing large dataset collections as well as interactive tools for exploratory data analysis. Extensions to Galaxy's framework include support for federated identity and access management and increased ability to distribute analysis jobs to remote resources. New community resources include large public servers in Europe and Australia, an increasing number of regional and local Galaxy communities, and substantial growth in the Galaxy Training Network.
INTRODUCTION
Biomedical studies have become data-intensive, with ever evolving technological and computational demands. Increasing reliance on complex computational methods prevents many biomedical researchers, from accessing and making effective use of these datasets and methods. This also presents significant barriers to reproducibility, dissemination and generalized reuse. Since 2005, the Galaxy project (https://galaxyproject.org) has provided free and open solutions to address these considerable barriers in biomedical research. Galaxy is an open source, community-driven, and web-based platform for accessible, reproducible, and transparent computational research and training. Galaxy supports accessibility by enabling complex computational analysis to be performed from a web browser without requiring programming experience or training in high performance computing. Reproducibility is ensured, as Galaxy automatically captures execution information (e.g. tool name, version, inputs, outputs and parameters) so that a user doesn’t have to manually track provenance; hence, any user can repeat and understand a complete computational analysis, from tool parameters to the dependency tree. Galaxy users are able to share and publish their exact analysis histories, results, workflows and visualizations directly over the web, enabling transparency of computational research efforts and artifacts.
The Galaxy software ecosystem consists of multiple components: (a) an integrated repository of tools for a wide-range of biomedical studies including sequence and variant analysis, metagenomics, proteomics, and transcriptomics (1); (b) a web application that enables exploratory data analysis using the integrated tools via a web interface; (c) a multitude of specialized installations of the web application (e.g. https://usegalaxy.org for biomedical research, see https://galaxyproject.org/use/ for a complete list of the installations); (d) a training network that provides tutorials and organizes workshops on using Galaxy for different studies (https://galaxyproject.org/learn/) and (e) An inclusive and diverse community of developers, educators, and researchers encompassing a wide range of skill sets, scientific domains, and research practices that provide development and support (https://help.galaxyproject.org; the adopted code of conduct is available at https://github.com/galaxyproject/galaxy/blob/dev/CODE_OF_CONDUCT.md).
In the past year, the Galaxy project has seen major growth as a platform, a resource, and a community. The usegalaxy.* alliance operates large Galaxy deployments in the US (usegalaxy.org), Europe (usegalaxy.eu) and Australia (usegalaxy.org.au). The Galaxy framework has been widely deployed by others, with 125 other known public instances (https://galaxyproject.org/use). The developer community has thrived, with >7500 tools contributed to the Galaxy ToolShed as of January 2020.
Genomics research is continuously evolving and current challenges include the rapid growth in size and complexity of new datasets, the continuing expansion in the breadth of research areas capable of generating high throughput data, and the integration of genomics with other molecular and phenotypic data. In this article, we describe the latest advances in the Galaxy platform designed to address these challenges.
This manuscript describes work performed by a large group of people located around the world with complementary skills who are critical to the success of the Galaxy project. These individuals are listed in Table 1.
Table 1.
Region | Members | Affiliation |
---|---|---|
Africa | Christopher Barnett, Tharindu Senapathi | Chemistry Department and Scientific Computing Research Unit at the University of Cape Town |
Thoba Lose, Ziphozakhe Mashologu, Peter van Heusden | South African National Bioinformatics Institute, University of the Western Cape, South Africa | |
Australia | Catherine Bromhead, Simon Gladman, Nuwan Goonasekera, Christina Hall, Andrew Lonie | Melbourne Bioinformatics, University of Melbourne, Melbourne, Victoria, Australia |
Maria Doyle | Peter MacCallum Cancer Centre, Melbourne, Victoria, Australia | |
Thom Cuddihy, Igor Makunin, Gareth Price, Nick Rhodes, Michael Thang | QFAB Bioinformatics, QCIF, Brisbane, Queensland, Australia | |
Europe | Loraine Brillet-Guéguen, Gildas Le Corguillé | ABiMS, Roscoff, France |
Christophe Antoniewski | ARTbio, CNRS and Sorbonne Université, Paris France | |
Léa Bellenger | ARTbio, INSERM and Sorbonne Université, Paris, France | |
Naïra Naouar | ARTbio, Sorbonne Université, Paris, France | |
Nadia Goué | AuBi, Mesocentre, Clermont Auvergne University, France | |
Saskia Hiltemann, Youri Hoogstrate, Bas Horsman, Rick Jansen, Yunlei Li, Andrew Stubbs, David van Zessen | Bioinformatics, Erasmus MC Cancer Institute, Rotterdam, Netherlands | |
Frederik Coppens, Bert Droesbeke, Ignacio Eguinoa, Michiel Van Bel | Center for Plant Systems Biology, Vlaams Instituut voor Biotechnologie, Ghent, Belgium | |
Jean-François Dufayard, Maryline Summo | CIRAD, Montpellier, France | |
Anshu Bhardwaj | CSIR-Institute of Microbial Technology, France | |
Tomas Klingström | Department of Animal Breeding and Genetics, Swedish University of Agricultural Sciences, Uppsala, Sweden | |
Federico Zambelli | Department of Biosciences, University of Milan, Milano, Italy | |
Rolf Backofen, Bérénice Batut, Simon Bray, Gianmauro Cuccuru, Anika Erxleben, Stephan Flemming, Björn Grüning, Alireza Khanteymoori, Anup Kumar, Jan Leendertse, Wolfgang Maier, Helena Rasche, Mehmet Tekman, Joachim Wolff, Oleg Zharkov | Department of Computer Science, Albert-Ludwigs-University Freiburg, Freiburg, Germany | |
Anne Fouilloux | Department of Geosciences, University of Oslo, Norway | |
Florence Combes, Yves Vandenbrouck | Department of Health, CEA, Grenoble, France | |
Nicola Soranzo | Earlham Institute, Norwich Research Park, Norwich, UK | |
Lucille Lopez-Delisle | EPFL SV ISREC UPDUB, 1015 Lausanne, Switzerland | |
Pablo Moreno | European Bioinformatics Institute (EMBL-EBI) | |
Hans-Rudolf Hotz | Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland | |
Sarah Maman | GenPhySE, Université de Toulouse, INRA, INPT, ENVT, Castanet Tolosan, France | |
Matthias Bernt | Helmholtz Centre for Environmental Research, UFZ, Young Investigators Group Bioinformatics and Transcriptomics, Leipzig, Germany | |
Anthony Bretaudeau | IGEPP, INRAE, Institut Agro, Univ Rennes, Rennes, France | |
Timothy Dudgeon | Informatics Matters Ltd. | |
Olivier Inizan, Valentin Loux | INRAE, Jouy-en-Josas, France | |
Kenzo-Hugo Hillion, Valentin Marcon, Fabien Mareuil, Hervé Ménager, Rémi Planel | Institut Pasteur, Paris, France | |
Marco Antonio Tangaro | Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council, Bari, Italy | |
Alexis Dereeper | Institute of Research for Development, Marseille, France | |
Melanie Föll | Institute of Surgical Pathology, Medical Center, Albert-Ludwigs-University Freiburg, Freiburg, Germany | |
Peter Cock | James Hutton Institute, UK | |
Peter Selten | KWS SAAT SE & Co. KGaA | |
Ruben Vorderman | Leiden University Medical Center, Netherlands | |
Alan Amossé, Yvan Le Bras, Coline Royaux | Museum National d’Histoire Naturelle, Paris, France | |
Franck Giacomoni | PFEM, INRAE, Saint Genès Champanelle, France | |
Thanh Le-Viet, Andrew Page | Quadram Institute Bioscience, Norwich Research Park, Norwich, UK | |
Thomas Lawson | School of Biosciences, University of Birmingham, UK | |
Olivier Sallou | Univ Rennes, Inria, CNRS, IRISA, Rennes France | |
Ralf Weber | University of Birmingham, UK | |
Krzysztof Poterlowicz | University of Bradford, UK | |
Ivan Kuzmin | University of Tartu, Estonia | |
North America | Dan Fornika | BC Centre for Disease Control, Canada |
Carrie Ganote | Bioinformatics Analyst at Indiana University, USA | |
Dave Bouvier, Martin Čech, John Chilton, Nate Coraor, Assunta DeSanto, Jennifer Hillman-Jackson, Kaivan Kamali, Nick Keener, Delphine Lariviere, Anton Nekrutenko, Nick Stoler, Marius van den Beek | Department of Biochemistry and Molecular Biology, Penn State University, University Park, PA, USA | |
Enis Afgan, Dannon Baker, Dave Clements, Sergey Golitsynskiy, Juleen Graham, Aysam Guerler, Mohammad Heydarian, Alexandru Mahmoud, Alex Ostrovsky, Nathan Roach, James Taylor, Jenn Vessio | Department of Biology, Johns Hopkins University, Baltimore, MD, USA | |
Jeremy Goecks, Qiang Gu, Mason Houtz, Vahid Jalili, Luke Sargent | Department of Biomedical Engineering, School of Medicine, Oregon Health and Science University, Portland, OR, USA | |
Michael Schatz | Dept. of Computer Science and Biology, Johns Hopkins University, Baltomore, MD, USA | |
Daniel Blankenberg, Jayadev Joshi, Vijay Nagampalli | Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, OH, USA | |
Greg Von Kuster | Huck Institutes of the Life Sciences, Penn State University, University Park, PA, USA | |
Robert Leach, Lance Parsons | Lewis-Sigler Institute of Integrative Genomics, Princeton University, USA | |
Brad Langhorst | New England Biolabs, USA | |
Philip Mabon, Aaron Petkau, Jeffrey Thiessen | Public Health Agency of Canada, Canada | |
Arthur Eschenlauer, Tim Griffin, Pratik Jagtap, James Johnson, Praveen Kumar, Subina Mehta | University of Minnesota, Minnesota, Minneapolis, MN, USA |
NEW AND ENHANCED FEATURES
Enhanced Galaxy user experience for increasingly complex analysis
Reduction in sequencing cost has led to increases in the complexity and size of DNA sequencing data. One of the major goals of Galaxy is to allow users to upload, organize, and manipulate complex experimental designs entirely through the Galaxy user interface (UI). To achieve this objective, we developed a number of new features described below.
Dataset collections for analysis of unlimited number of datasets through UI
Modern experiments typically involve a large number of datasets organized as complex hierarchies. For example, consider the simple case of a resequencing experiment with 10 samples composed of 20 files, corresponding to 10 forward and 10 reverse paired-end read datasets. Representing these individual datasets as 20 interface elements without their intrinsic hierarchical relationship is counter-intuitive and impractical. Therefore, we have developed the concept of dataset collections for representing complex assemblies of datasets, which enable Galaxy users to encode semantic relationships. Figure 1 illustrates collections representing either a simple list of datasets (Figure 1 panel A) or a set of samples from a paired-end library (Figure 1, panel B; a list of pairs).
Name tags and group tags
To allow users to easily follow steps within Galaxy analyses (histories) we have developed name tags, which allow users to see all analysis steps that use or derive from a given dataset, and power new features for multiple factor analysis of collections of datasets. Group tags are a special class of tags with key-value pairs that can be attached to a collection during upload or using collection operation tools. These tags can describe multiple sets of variables for a collection. Once set, these tags can be used intelligently by tools that need to divide collections into multiple overlapping factors or sets of datasets. Figure 2 illustrates an example of using group tags to study datasets that differentiate gene expression between smokers and non-smokers. The Galaxy tool interface for DESeq2 (2), a popular tool for gene expression data analysis, is able to use group tags to allow comparisons between factors such as, in this case, smokers and non-smokers.
Data selector dialog
Previously, selecting datasets for analysis could be time-consuming in Galaxy, due to potentially having thousands of datasets available to a user, and which may be distributed across many individual analysis histories and shared libraries. Galaxy data libraries are repositories for shared datasets. To simplify dataset selection, a new dataset selector dialog has been implemented in tool forms, visualizations, and wherever datasets are used. It allows selection of datasets from multiple histories and libraries, and from within dataset collections. This greatly reduces the need for users to copy datasets between different components of Galaxy, and makes it easier and faster for users to find datasets for analysis.
Interactive tools for exploratory data analysis
Analysis of any biomedical data occurs in several distinct modes. For primary processing at the start of an analysis, there are well established practices and corresponding tools for data quality assessment and initial steps. In sequence-based experiments, these may be read mapping, variant identification, peak prediction and so on. Each of these steps has a well defined set of software tools that fit well into the Galaxy paradigm. However, research-focused data analysis invariably comes to a point where specialized tools no longer exist, and further result interpretation needs to be done on an ad hoc exploratory basis. This stage does not fit well into the historical Galaxy way of conducting analysis. In fact, tools designed specifically for exploratory data analysis, such as Jupyter (3) and RStudio (4), are often better choices for these modes.
To address this challenge we have developed Interactive tools, which enables visualization and analysis environments such as Jupyter, RStudio and others to be used in Galaxy (see http://live.usegalaxy.eu). Starting an interactive tool is no different from invoking any other Galaxy tool, and inherits all the capabilities of existing Galaxy tools (Figure 3). Interactive tools succeed interactive environments (5) as Galaxy's method for running open-ended visualization and analysis environments, making them more robust and usable.
Federated identity and access management
A growing number of well-curated and essential biomedical data repositories now exist. In accordance with access control regulations, some repositories authorize only users with verifiable identities to access their data. OpenID Connect (OIDC) protocol is a leading web standard for user identification and securely sharing identities between applications. We have recently enabled user authentication through the OIDC protocol and users can now securely login to Galaxy using their existing social and institutional identities (6).
Leveraging the OIDC protocol, we have also implemented authorization delegation, where users can securely grant Galaxy authorization to access their private resources in the Cloud. Using this approach, authorization is based on short-lived tokens generated specifically for a Galaxy instance and are independent from any user's authorization. The tokens cannot be exploited to impersonate the instance or the user, and their authorization scope follows the principle of least privileges and can be modified or revoked at any time (6).
Access to user-owned cloud-based storage
Cloud computing platforms provide reliable and cost-effective data storage solutions. We have implemented features that enable users to transfer data between Galaxy and cloud-based storage services including Amazon Simple Storage Service (S3), Microsoft Azure, and Google Cloud Storage. Users can now directly copy data stored on the Cloud to a Galaxy sever as well as copy data from Galaxy onto the Cloud. Previously, to copy data between Galaxy and the Cloud, users were required to either first download data to local storage and then upload to Galaxy or the Cloud, or to use signed URLs in order to copy from Cloud to a Galaxy history. These approaches have some drawbacks; for instance, obtaining a signed URL for data stored on Cloud requires platform knowledge, and using local storage as cache to transfer data between Galaxy and cloud resources can be insecure and tedious for large data sizes. However, with the new feature, users grant Galaxy authorization to access their cloud-based storage through a secure approach (6), and data is transferred directly from a cloud storage service to a Galaxy server.
Increasing accessibility and utilization of compute infrastructure
The ever increasing computational demands of biomedical data analysis has been a driving motivation for a continuous improvement in how Galaxy utilizes heterogeneous compute infrustructures. Natively, Galaxy runs user analysis on compute resources where data is shared between Galaxy and the resources via a single shared file system. However, this requirement limits where Galaxy can be installed, and to alleviate it, we have developed Pulsar—an application that runs on remote computers (e.g. (7–9) as used by usegalaxy.org), listens to Galaxy job execution requests (a job is a unit of execution that runs a given tool on given input data), automatically transfers the job data, and executes the job. However, this process involves two data transfers: first Galaxy fetches input data from its persistent storage, then sends it to the remote computer where Pulsar executes the job. To optimize this process, we enhanced the implementation to enable Pulsar to directly fetch the data and hence reduce to one data transfer (see Figure 4). This also paves a path for Pulsar to fetch data directly from a user's private storage, such as an Amazon S3 bucket, without the Galaxy server's involvement; this feature is being actively developed.
Workflow, scheduling, pipeline comparison, and parameter sweeps
Galaxy's workflow engine has been improved for efficient scheduling of complex workflows involving thousands of datasets and supporting a wide range of dataset relationships and collections. The consistency of the workflow editor has been improved to support non-file step parameter nodes and non-data outputs. The Galaxy workflow language has been significantly improved, with a new YAML-based format offering better readability and portability.
New and updated tool suites
Several new tool suites have been integrated into Galaxy to reflect new analysis needs of the biomedical community, with each suite encompassing tools required to perform an end-to-end analysis of data generated by a particular technology. For analysis of single-cell datasets, the Seurat (10,11) and Scater (12) toolkits have been integrated into Galaxy. Tools in these suites enable complete analysis of single-cell data, from quality control to clustering and visualization. An ensemble of machine learning tools has also been added to Galaxy (https://usegalaxy-eu.github.io/index-ml); leveraging Scikit-learn (13) and TensorFlow (14), users can build predictive models from labeled datasets in Galaxy. These tools include approaches for data preprocessing (e.g. normalization), feature selection, defining and training regression and classification models using both traditional and deep learning approaches, model stacking to create meta-ensembles, and methods for evaluating model performance. Image analysis tools have also been added to Galaxy, with applications in chemoinformatics and histopathology. These tools make ImageJ (15,16) features available via Galaxy.
To facilitate particular analyses, we have deployed domain-specific Galaxy instances that provide users with domain-specific preconfigured workflows, tool suites, and interactive training materials. These instances simplify use of Galaxy for specific applications, such as metabolomics (https://metabolomics.usegalaxy.eu), metagenomics (https://metagenomics.usegalaxy.eu), proteomics (https://proteomics.usegalaxy.eu), and single-cell omics (https://singlecell.usegalaxy.eu).
GALAXY GLOBAL ECOSYSTEM
Galaxy has grown into a thriving global ecosystem supported by a vibrant community of biomedical researchers, tool developers, and system engineers. This coherence is realized by a multitude of components including a public repository of all 7500+ tools integrated into Galaxy, named ToolShed (1), a public repository of over 36,000 Singularity containers (17) for all Bioconda packages (18) that provide the software dependencies for Galaxy tools, and 5.6TB (compressed) of publicly available reference and index data. Leveraging these resources, Galaxy instances running on institutional computing clusters, cloud platforms, and even the https://usegalaxy.org server itself, all see a unified and consistent view of all of the shared components. When a Galaxy instance anywhere in the world needs to run an analysis, it can fetch the required reference or index data and a container with the appropriate tool from these repositories and use them to execute the analysis—ensuring a common availability across all Galaxy instances. We have continued developing features to improve this coherence; namely, we are developing features to ensure that workflows can be shared between different Galaxy instances and still execute correctly and reproducibly. The three Galaxy instances with largest user bases, i.e. https://usegalaxy.org (hosted in U.S.), https://usegalaxy.eu (hosted in Europe), and https://usegalaxy.org.au (hosted in Australia), all use these shared resources.
COMMUNITY
The Galaxy Community (https://galaxyproject.org/community) is vital to the success of the project, and this is reflected throughout the Galaxy ecosystem.
Support and communication
In late 2018, we moved our online help forum to a new platform (https://help.galaxyproject.org/). Previously, our primary support channel had been a Biostars-based forum, and a mailing list before that. This move has been popular, with over 1200 accounts and 1100 threads created in the first 15 months, and an average of 488 page views per day. The Galaxy community now also provides support via Gitter (a developer-oriented chat room, available from https://gitter.im/galaxyproject/) with 1460 contributors.
Galaxy provides news and announcements through several channels. The Galaxy Community Hub, now in its fourth year, includes news items, blog posts, and an events calendar. Monthly newsletters are posted to the hub and highlight upcoming events, open positions, relevant news items, and recent releases and platform news. The newsletters and other significant items are announced on the Galaxy-Announce mailing list (with over 11 000 members) and Twitter (with over 9000 followers).
There are different ways to use Galaxy. To help users choose a method that suits their application best, we launched the Galaxy Platform Directory (https://galaxyproject.org/use/) in 2018. This is a searchable directory that lists ways to easily use Galaxy, either immediately (on over 125 public Galaxy servers), or after setup (with 5 commercial Cloud instances, 30 containers, 6 virtual machines or 12 academic cloud providers).
Communities, events and training
The Galaxy Community supports many thriving regional communities which contribute code to the software framework based on their needs, run Galaxy instances for their users, and hold local meetings and training events. Europe and Australia are the largest examples, each with regional events and training. Other communities include a pan-African group, groups in France, Netherlands, and Japan. In 2020, we welcome new groups in India, Southeast Asia, and Korea.
The community organized 261 events with Galaxy-related content in 2018–2019 (https://galaxyproject.org/events/). That includes 164 training events, 83 conferences or meetings, and 17 collaborative work events, with Galaxy as the focus of 131 of those events. The events were held in 31 different countries, and were organized primarily by local Galaxy community members.
The Galaxy Training Network (GTN, https://training.galaxyproject.org/) provides a library of slides, hands-on-tutorials, and training datasets covering many domains in biomedical research. It also includes tutorials on how to administer Galaxy servers, wrap tools for Galaxy, and contribute code to the Galaxy software framework and to the GTN library itself. The GTN is a community managed and driven effort that has become the driving force behind the large number of training events that now use Galaxy. The GTN now contains 155 tutorials in 20 topic areas (e.g. transcriptomics and proteomics), created by 146 contributors. In addition, the European team has implemented Training Infrastructure as a Service (TIaaS), which provides instructors with reserved resources on https://usegalaxy.eu for the duration of training, and supports instructor tracking of trainee's progress.
The Intergalactic Utilities Commission (IUC) is another community driven group charged with establishing and maintaining best practices and gold-standard tool wrappers for the Galaxy ToolShed (1). The ToolShed now contains over 7500 tool definitions, created by over 600 unique contributors.
We continue to support regional meetings and to hold an annual Galaxy Community Conference (GCC) for the global community. GCC has run every year since 2010, and has had 200 or more participants since 2012. In 2018, and again in 2020, we colocated GCC with the Bioinformatics Open Source Conference (BOSC). These events feature training days, a multi-day conference and several days of collaborative work. GCC brings together people from all over the world to establish and reinforce connections in the community. In 2020, we are holding this event online, and in both the eastern and western hemispheres.
The last two years have seen an increase in our outreach to researchers in underrepresented groups. Galaxy has recently had workshops and speakers at the Society for Advancement of Chicanos/Hispanics and Native Americans in Science (SACNAS 2017–19), and American Indian Science and Engineering Society (AISES 2019); have offered travel fellowships for GCCBOSC 2018, GCC2019, Galaxy Africa 2018, Galaxy Admin Training 2020; and offered childcare at GCCBOSC 2018 and GCC2019.
Publications using galaxy
The project tracks publications that use, reference, extend or implement Galaxy. In 2020 we reached over 9000 total publications, including over 7,500 journal articles, 500 books, 400 conference papers, and 300 theses (https://www.zotero.org/groups/galaxy). Over 5000 of these publications cited Galaxy in their methods. This publication corpus reflects the broad range of domains that Galaxy is applied to, including life science domains beyond genomics (e.g. ecology, proteomics), as well as domains outside the life sciences (e.g. natural language processing, climate science).
Supplementary Material
ACKNOWLEDGEMENTS
The growth of the Galaxy project is made possible by a growing community of users, developers, system administrators and educators. We are extremely grateful to the Texas Advanced Computing Center at the University of Texas for hosting the Public Galaxy site.
Authors contributions: E.A., J.G., D.B., J.T. and A.N. have a significant financial interest in GalaxyWorks, a company that may have a commercial interest in the results of this research and technology.
Contributor Information
Vahid Jalili, Department of Biomedical Engineering, Oregon Health & Science University, Portland, OR, USA.
Enis Afgan, Department of Biology, Johns Hopkins University, Baltimore, MD, USA.
Qiang Gu, Department of Biomedical Engineering, Oregon Health & Science University, Portland, OR, USA.
Dave Clements, Department of Biology, Johns Hopkins University, Baltimore, MD, USA.
Daniel Blankenberg, Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, OH, USA.
Jeremy Goecks, Department of Biomedical Engineering, Oregon Health & Science University, Portland, OR, USA.
James Taylor, Department of Biology, Johns Hopkins University, Baltimore, MD, USA.
Anton Nekrutenko, Department of Biochemistry and Molecular Biology, Penn State University, University Park, PA, USA.
SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.
FUNDING
Usegalaxy.org efforts are funded by NIH Grants [U41 HG006620] and NSF ABI Grant [1661497]; Usegalaxy.eu is supported by the German Federal Ministry of Education and Research grants [031L0101C]; de.NBI-epi; Usegalaxy.org.au is supported by Bioplatforms Australia and the Australian Research Data Commons through funding from the Australian Government National Collaborative Research Infrastructure Strategy. Funding for open access charge: NIH Grants [U41 HG006620].
Conflict of interest statement. The potential conflict of interest has been reviewed and managed by Johns Hopkins University (E.A. and J.T.), Oregon Health and Science University (J.G.), Cleveland Clinic (D.B.) and Pennsylvania State University (A.N.).
REFERENCES
- 1. Blankenberg D., Von Kuster G., Bouvier E., Baker D., Afgan E., Stoler N. Galaxy Team Galaxy Team Taylor J., Nekrutenko A.. Dissemination of scientific software with Galaxy ToolShed. Genome Biol. 2014; 15:403. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Love M.I., Huber W., Anders S.. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014; 15:550. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Kluyver T., Ragan-Kelley B., Pérez F., Granger B.E.. Fernando Loizides B.S. Jupyter Notebooks-a publishing format for reproducible computational workflows. Proceedings of the 20th International Conference on Electronic Publishing. 2016; 87–90. [Google Scholar]
- 4. Rstudio Team and Others RStudio: integrated development for R. 2015; 639:Boston: RStudio. Inc; 640. [Google Scholar]
- 5. Grüning B.A., Rasche E., Rebolledo-Jaramillo B., Eberhard C., Houwaart T., Chilton J., Coraor N., Backofen R., Taylor J., Nekrutenko A.. Jupyter and Galaxy: Easing entry barriers into complex data analyses for biomedical researchers. PLoS Comput. Biol. 2017; 13:e1005425. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Jalili V., Afgan E., Taylor J., Goecks J.. Cloud bursting Galaxy: federated identity and access management. Bioinformatics. 2020; 36:1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Towns J., Cockerill T., Dahan M., Foster I., Gaither K., Grimshaw A., Hazlewood V., Lathrop S., Lifka D., Peterson G.D. et al.. XSEDE: accelerating scientific discovery. Comput. Sci. Eng. 2014; 16:62–74. [Google Scholar]
- 8. Stewart C.A., Cockerill T.M., Foster I., Hancock D., Merchant N., Skidmore E., Stanzione D., Taylor J., Tuecke S., Turner G. et al.. Jetstream: a self-provisioned, scalable science and engineering cloud environment. Proceedings of the 2015 XSEDE Conference: Scientific Advancements Enabled by Enhanced Cyberinfrastructure. 2015; 1–8. [Google Scholar]
- 9. Nystrom N.A., Levine M.J., Roskies R.Z., Scott J.R.. Bridges: a uniquely flexible HPC resource for new communities and data analytics. Proceedings of the 2015 XSEDE Conference: Scientific Advancements Enabled by Enhanced Cyberinfrastructure. 2015; 1–8. [Google Scholar]
- 10. Stuart T., Butler A., Hoffman P., Hafemeister C., Papalexi E., Mauck W.M. 3rd, Hao Y., Stoeckius M., Smibert P., Satija R.. Comprehensive integration of Single-Cell data. Cell. 2019; 177:1888–1902. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Butler A., Hoffman P., Smibert P., Papalexi E., Satija R.. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 2018; 36:411–420. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. McCarthy D.J., Campbell K.R., Lun A.T.L., Wills Q.F.. Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics. 2017; 33:1179–1186. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V. et al.. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 2011; 12:2825–2830. [Google Scholar]
- 14. Abadi M., Agarwal A., Barham P., Brevdo E., Chen Z., Citro C., Corrado G.S., Davis A., Dean J., Devin M. et al.. TensorFlow: large-scale machine learning on heterogeneous distributed systems. 12th USENIX Symposium on Operating SystemsDesign and Implementation (OSDI 16). 2016; 265–283. [Google Scholar]
- 15. Schneider C.A., Rasband W.S., Eliceiri K.W.. NIH Image to ImageJ: 25 years of image analysis. Nat. Methods. 2012; 9:671–675. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Rueden C.T., Schindelin J., Hiner M.C., DeZonia B.E., Walter A.E., Arena E.T., Eliceiri K.W.. ImageJ2: ImageJ for the next generation of scientific image data. BMC Bioinformatics. 2017; 18:529. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Kurtzer G.M., Sochat V., Bauer M.W.. Singularity: Scientific containers for mobility of compute. PLoS One. 2017; 12:e0177459. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Grüning B., Dale R., Sjödin A., Chapman B.A., Rowe J., Tomkins-Tinch C.H., Valieris R., Köster J. Bioconda Team . Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat. Methods. 2018; 15:475–476. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. 2013; arXiv doi:16 March 2013,preprint: not peerreviewedhttps://arxiv.org/abs/1303.3997.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.