Reviewer name and names of any other individual's who aided in reviewer |
Jens Jensen |
Do you understand and agree to our policy of having open and named reviews, and having your review included with the published manuscript. (If no, please inform the editor that you cannot review this manuscript.) |
Yes |
Is the language of sufficient quality? |
Yes |
Please add additional comments on language quality to clarify if needed |
A few oddities here and there but nothing that hinders understanding |
Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is? |
No |
Additional Comments |
See comments to authors |
Is the source code available, and has an appropriate Open Source Initiative license <a href="https://opensource.org/licenses" target="_blank">(https://opensource.org/licenses)</a> been assigned to the code? |
Yes |
Additional Comments |
Code is shell script on github |
As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code? |
Yes |
Additional Comments |
Standard github practice (pull requests) |
Is the code executable? |
Yes |
Additional Comments |
Shell scripts |
Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined? |
Yes |
Additional Comments |
Code has support for detecting when the AWS environment is not correctly set up |
Is the documentation provided clear and user friendly? |
Yes |
Additional Comments |
The paper is very detailed... which I quite like, as often in computing papers people gloss over important details |
Is there enough clear information in the documentation to install, run and test this tool, including information on where to seek help if required? |
Yes |
Additional Comments |
|
Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level? |
Yes |
Additional Comments |
|
Have any claims of performance been sufficiently tested and compared to other commonly-used packages? |
Yes |
Additional Comments |
Performance should be mostly dependent on the local checksumming tool |
Is test data available, either included with the submission or openly available via cited third party sources (e.g. accession numbers, data DOIs)? |
No |
Additional Comments |
Not applicable |
Are there (ideally real world) examples demonstrating use of the software? |
Yes |
Additional Comments |
|
Is automated testing used or are there manual steps described so that the functionality of the software can be verified? |
No |
Additional Comments |
Not applicable |
Any Additional Overall Comments to the Author |
This paper describes a tool for verifying the integrity of data uploaded to Amazon S3, with specific applications to NGS data. The paper is very, very detailed, which in many ways are quite welcome, as it would make it suitable for beginners in the field as well, and authors often gloss over important details. The downside is that the paper is quite long for what it presents, and with some repetition.
As is standard practice, integrity checking is done at the binary level, with checksums. This means that different encodings of the same file will mismatch (as a simple example, a CSV file may have LF or CRLF line endings), or the file could be compressed in storage, as it would be on tape, but you want to compare the checksum of the uncompressed file. Using checksum alone is certainly sufficient for the simple application described here where the purpose is only to verify that an upload has completed.
If a checksum is transmitted with the file, only a single file can be uploaded. The other problem with this approach is that the file has to be read twice: once to calculate its checksum, and once to upload the file.
Despite the much-publicised cryptographic weaknesses of MD5, it is certainly sufficient to do non-cryptographic integrity checking -- meaning we protect against accidental modification of the file, not modification by a malicious adversary.
Coming from disciplines such as high energy physics, where data integrity is the primary security requirement, we checksum obsessively. The most common failure mode we see is truncated files: the transfer, for some reason, was interrupted and not resumed or retried (which is normally done automatically). For a single stream transfer we would calculate the checksum as the data comes in over the transfer interface -- because it is cheap to do (ie doesn't involve extra disk access) -- but would still need to eventually do a "deep" checksum (of the file on disk) to check that the file was not just transferred but also written correctly.
Splitting the file and into chunks could be a bit inefficient, though it is convenient for a shell script. A third party tool ("s3md5") takes a chunk from the file (using dd and piping it to md5sum) one chunk at a time. This is fine as no temporary copy is created, and reads the file sequentially, though it requires an extra read of the file (one for checksums, one for the transfer) which is not too expensive if the local file is cached in RAM, as it would likely be if it not too large.
The English is generally very good, though with a few odditites here and there (such as "Per each" => "For each")
Specific comments:
Seems odd to have an abstract in three sections
"private AWS account in London ..." is repeated.
"reverse directionality of the data transfer" ?? What does that mean - download from the bucket?
While I appreciate the beginner friendly details, the screenshot of the modified file (Fig 2) does not really add much value to the paper - you could even do the modification with shell commands:
(echo THIS FILE HAS BEEN LOCALLY MODIFIED; cat readme_102374.txt) >readme_102374.tmp && \
mv readme_102374.t{mp,xt}
You may get better performance indication by using the `time` tool
For shell scripts, I'd say testing across different versions of the same OS (Ubuntu 16.04 and 22.04) is not essential - it would be better to have (say) Ubuntu and CentOS though again shell scripts are very portable in POSIX environments
So the file object does not store the chunk size that was used to calculate the ETag? That seems like a strange omission.
Being a bit technical, in your Dockerfile, you don't want to run too many independent apt-gets: each time you add a layer to the image. It would be more efficient to consolidate some, eg
RUN apt-get install x
RUN apt-get install y
RUN apt-get install z
becomes
RUN apt-get install x y z
|
Recommendation |
Accept |