Skip to main content
. Author manuscript; available in PMC: 2013 Aug 27.
Published in final edited form as: Nat Biotechnol. 2010 Nov;28(11):1181–1185. doi: 10.1038/nbt1110-1181

Table 1.

Features of reproducible scientific computing in the cloud

Traditional challenges Cloud computing solutions
Data Sharing
  • Large data sets difficult to share over standard internet connections. Can require require substantial technical resources to obtain and store.

  • Public data sets change frequently. Difficult to archive and share entire data repositories used for analyses.

  • Large data sets can be stored as “omnipresent” resources in the cloud. Easily copied and accessed directly from any point in the cloud.

  • “Snapshots” of large public data sets can be rapidly copied, archived and referenced.

Software & Applications
  • Reproducibilty of results often requires replication of the precise software environment (i.e. operating system, software and configuration settings) under which the original analysis was conducted. Specific versions of software or programming language interpreters often required for reproducibility.

  • Analyses typically conducted by several softwares or scripts executed in a precise sequence across one or several systems as part of an analysis pipeline. Only the individual programs or scripts are usually provided with published results. Substantial technical resources typically required to recreate the pipeline used in the original analysis.

  • Standard software packages cannot serve all the needs of a scientific domain. Investigators develop non-standard software and computational pipelines to facilitate computational analysis exceeding the capabilities of common tools.

  • Computer systems are virtualized in the cloud, allowing them to be replicated wholesale without concern for the underlying hardware. Snapshots of a fully configured system or group of systems used in analysis can be rapidly archived as digital machine images. System machine images can be copied and shared with others in the cloud, allowing reconstitution of the precise system configuration used for the original analysis.

  • System images can be pre-configured with common and customized software and tools in a standardized fashion to facilitate common tasks in a scientific domain (e.g. assembly of genome sequences from DNA sequencer data). Pre-configured images can be shared as public resources to promote reproducibility and follow-up studies

System & Technical
  • Substantial computational resources might be required to replicate an analysis. Original computational analyses requiring several hundred processors to complete becoming more common. Reproducibility limited to those with requisite computational resources.

  • Substantial technical support often required to reproduce a computational analyisis. Substantial technical support often required to replicate the software and system configuration required by the analysis. Prevents reproducibility by non-technical investigators lacking substantial IT support.

  • Cloud-based computational resources can be scaled up in a dynamic fashion to provide necessary computational resources. Investigators can create large computational clusters on-demand and disperse upon anlaysis completion.

  • Complete digital representations of a computational pipeline can be shared as machine images along with deployment scripts that can be executed by non-technical users to reconstitute a complete computational pipeline.

Access & Preservation
  • Grant-funded software and data repositories often disappear from the public domain after funding is discontinued or the maintainers abandon the project. Leads to loss of access by dependent users and loss of public investment into the resource.

  • Software, code, and data from grant-funded projects can be archived and provided as publicly accessible resources in the cloud. Economies of scale in the cloud allow for active preservation of grant-funded resources for many years past funding for nominal cost.

  • Cloud computing providers already showing a willingness to host public scientific data sets at no cost.