aws-s3-integrity-check: an open-source bash tool to verify the integrity of a dataset stored on Amazon S3

. 2023 Aug 23;2023:gigabyte87. doi: 10.46471/gigabyte.87

Reviewer name and names of any other individual's who aided in reviewer	Jens Jensen
Do you understand and agree to our policy of having open and named reviews, and having your review included with the published manuscript. (If no, please inform the editor that you cannot review this manuscript.)	Yes
Is the language of sufficient quality?	Yes
Please add additional comments on language quality to clarify if needed	A few oddities here and there but nothing that hinders understanding
Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is?	No
Additional Comments	See comments to authors
Is the source code available, and has an appropriate Open Source Initiative license <a href="https://opensource.org/licenses" target="_blank">(https://opensource.org/licenses)</a> been assigned to the code?	Yes
Additional Comments	Code is shell script on github
As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code?	Yes
Additional Comments	Standard github practice (pull requests)
Is the code executable?	Yes
Additional Comments	Shell scripts
Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?	Yes
Additional Comments	Code has support for detecting when the AWS environment is not correctly set up
Is the documentation provided clear and user friendly?	Yes
Additional Comments	The paper is very detailed... which I quite like, as often in computing papers people gloss over important details
Is there enough clear information in the documentation to install, run and test this tool, including information on where to seek help if required?	Yes
Additional Comments
Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level?	Yes
Additional Comments
Have any claims of performance been sufficiently tested and compared to other commonly-used packages?	Yes
Additional Comments	Performance should be mostly dependent on the local checksumming tool
Is test data available, either included with the submission or openly available via cited third party sources (e.g. accession numbers, data DOIs)?	No
Additional Comments	Not applicable
Are there (ideally real world) examples demonstrating use of the software?	Yes
Additional Comments
Is automated testing used or are there manual steps described so that the functionality of the software can be verified?	No
Additional Comments	Not applicable
Any Additional Overall Comments to the Author	This paper describes a tool for verifying the integrity of data uploaded to Amazon S3, with specific applications to NGS data. The paper is very, very detailed, which in many ways are quite welcome, as it would make it suitable for beginners in the field as well, and authors often gloss over important details. The downside is that the paper is quite long for what it presents, and with some repetition. As is standard practice, integrity checking is done at the binary level, with checksums. This means that different encodings of the same file will mismatch (as a simple example, a CSV file may have LF or CRLF line endings), or the file could be compressed in storage, as it would be on tape, but you want to compare the checksum of the uncompressed file. Using checksum alone is certainly sufficient for the simple application described here where the purpose is only to verify that an upload has completed. If a checksum is transmitted with the file, only a single file can be uploaded. The other problem with this approach is that the file has to be read twice: once to calculate its checksum, and once to upload the file. Despite the much-publicised cryptographic weaknesses of MD5, it is certainly sufficient to do non-cryptographic integrity checking -- meaning we protect against accidental modification of the file, not modification by a malicious adversary. Coming from disciplines such as high energy physics, where data integrity is the primary security requirement, we checksum obsessively. The most common failure mode we see is truncated files: the transfer, for some reason, was interrupted and not resumed or retried (which is normally done automatically). For a single stream transfer we would calculate the checksum as the data comes in over the transfer interface -- because it is cheap to do (ie doesn't involve extra disk access) -- but would still need to eventually do a "deep" checksum (of the file on disk) to check that the file was not just transferred but also written correctly. Splitting the file and into chunks could be a bit inefficient, though it is convenient for a shell script. A third party tool ("s3md5") takes a chunk from the file (using dd and piping it to md5sum) one chunk at a time. This is fine as no temporary copy is created, and reads the file sequentially, though it requires an extra read of the file (one for checksums, one for the transfer) which is not too expensive if the local file is cached in RAM, as it would likely be if it not too large. The English is generally very good, though with a few odditites here and there (such as "Per each" => "For each") Specific comments: Seems odd to have an abstract in three sections "private AWS account in London ..." is repeated. "reverse directionality of the data transfer" ?? What does that mean - download from the bucket? While I appreciate the beginner friendly details, the screenshot of the modified file (Fig 2) does not really add much value to the paper - you could even do the modification with shell commands: (echo THIS FILE HAS BEEN LOCALLY MODIFIED; cat readme_102374.txt) >readme_102374.tmp && \ mv readme_102374.t{mp,xt} You may get better performance indication by using the `time` tool For shell scripts, I'd say testing across different versions of the same OS (Ubuntu 16.04 and 22.04) is not essential - it would be better to have (say) Ubuntu and CentOS though again shell scripts are very portable in POSIX environments So the file object does not store the chunk size that was used to calculate the ETag? That seems like a strange omission. Being a bit technical, in your Dockerfile, you don't want to run too many independent apt-gets: each time you add a layer to the image. It would be more efficient to consolidate some, eg RUN apt-get install x RUN apt-get install y RUN apt-get install z becomes RUN apt-get install x y z
Recommendation	Accept