Table 1. Data science standards.
Disclosure | Software tools developed under this program will be incorporated into open source software and released to the community. Manuscripts and white papers describing various phases of the project will be released on a regular basis. |
Adoption | All biomedical image data will be converted to the master formats, such as OME-TIFF or HDF5. Community tools to create, analyse, and manipulate diffraction images will be extended to include support for these formats. All biomedical data are assigned Digital Object Identifiers through the CDL EZID system, and follow modified DataCite and Dataverse metadata schemas. Associated metadata are registered with the International DOI Foundation, making it virtually permanent and independent of SBGrid and Harvard computing infrastructure. All data sets published through the SBDG will be citable using Force 11 recommendations. |
Transparency | Files within individual data sets will be deposited in their original format (no archives or encryption allowed). Self-documentation: The majority of diffraction data sets are self-documented and include the basic information required for reprocessing in the images themselves. Additional information will be collected during deposition and will include data set representation (the ability to use the data to be processed), reference (relation to PDB files, publications, and other data sets), context (for example, a native data set or a derivative used for phasing), fixity (checksums), and provenance (typically the data collection facility and the project member who deposits the original data set). With conversion to master formats, all secondary information will be appended to the image metadata along with all original headers. |
External dependencies | The ability to reprocess some older data sets and verify master format conversions could depend on access to a specific version of data processing software. As data sets enter our repository, they will be reprocessed with our Data Reprocessing Pipeline (one of several we will develop as part of our Data Mining Pipelines). Data Reprocessing Pipelines will be archived within our system, issued DOIs, and interlinked with the data sets. It is worth noting that, since 2002, SBGrid has been archiving structural biology applications and, therefore, has access to previous software versions that might be required to reprocess older data sets. |
Licensing | Biomedical data sets will be deposited under the Creative Commons Zero licence, supporting future development of data validation services and database replications and migrations. |
Technical protection mechanism | The security of the deposited data will be maintained by the DAA. The DAA will join with the Library of Congress sponsored NDSA and the data architect working on the project will ensure that NDSA recommendations are being followed. |
NDSA, National Digital Stewardship Alliance; SBDG, Structural Biology Data Grid.