Table 1.
Details of the four datasets used during the testing phase of the aws-s3-integrity-check tool. All datasets were independently tested. The log files produced by each independent test are available on GitHub [21]. All processing times were measured using the in-built time Linux tool (version 1.7) [22]. Processing times refer to the time (in minutes and seconds) required for the aws-s3-integrity-check tool to process and evaluate the integrity of the totality of the files within each dataset.
Amazon S3 bucket | Data origin | Details | Number of files tested | Bucket size | Processing time | Log file |
---|---|---|---|---|---|---|
mass-spectrometry-imaging | GigaDB | Imaging-type supporting data for the publication “Delineating Regions-of-interest for Mass Spectrometry Imaging by Multimodally Corroborated Spatial Segmentation” [23]. | 36 | 16 GB | real 1m52.193s user 1m8.964s sys 0m24.404s |
logs/mass-spectrometry-imaging.S3_integrity_log.2023.07.31-22.59.01.tx |
rnaseq-pd | EGA | Contents of the EGA dataset EGAS00001006380, containing bulk-tissue RNA-sequencing paired nuclear and cytoplasmic fractions of the anterior prefrontal cortex, cerebellar cortex, and putamen tissues from post-mortem neuropathologically-confirmed control individuals [24]. | 872 | 479 GB | real 62m56.793s user 36m26.604s sys 16m10.548s |
logs/rnaseq-pd.S3_integrity_log.2023.07.31-23.02.47.txt |
tf-prioritizer | GigaDB | Software-type supporting data for the publication “TF-Prioritizer: a Java pipeline to prioritize condition-specific transcription factors” [25]. | 6 | 3.7 MB | real 0m15.131s user 0m2.012s sys 0m0.240s |
logs/tf-prioritizer.S3_integrity_log.2023.07.31-22.58.33.txt |
ukbec-unaligned-fastq | EGA | A subset of the EGA dataset EGAS00001003065, containing RNA-sequencing Fastq files generated from 180 putamen and substantia nigra control samples [26]. | 131 | 440 GB | real 51m12.058s user 31m27.348s sys 14m7.084s |
logs/ukbec-unaligned-fastq.S3_integrity_log.2023.08.01-01.03.58.txt |