Automated management of AWS instances for training

Jorge Buenabad-Chavez; Evelyn Greeves; James P J Chong; Emma Rand

doi:10.46471/gigabyte.133

. 2024 Aug 29;2024:gigabyte133. doi: 10.46471/gigabyte.133

Automated management of AWS instances for training

Jorge Buenabad-Chavez ^1,^*, Evelyn Greeves ¹, James P J Chong ¹, Emma Rand ¹

PMCID: PMC11382607 PMID: 39253692

Abstract

Amazon Web Services (AWS) instances provide a convenient way to run training on complex ‘omics data analysis workflows without requiring participants to install software packages or store large data volumes locally. However, efficiently managing dozens of instances is challenging for training providers.

We present a set of Bash scripts that make it quick and easy to manage Linux AWS instances pre-configured with all the software analysis tools and data needed for a course, and accessible using encrypted login keys and optional domain names. Creating over 30 instances takes 10–15 minutes.

A comprehensive online tutorial describes how to set up and use an AWS account and the scripts, and how to customise AWS instance templates with other software tools and data. We anticipate that others offering similar training may benefit from using the scripts regardless of the analyses being taught.

Statement of need

In recent years, sequencing technology has advanced to the point where DNA sequencing for ‘omics is faster, easier and cheaper than ever before. Consequently, ‘omics experiments increasingly produce large (up to terabyte size) datasets and their analysis requires researchers to access both specialist tools and robust high-performance computing (HPC) infrastructures. Metagenomics analyses are particularly resource intensive. This presents a steep learning curve for biologists who may not have any previous experience using HPC or command line tools.

There is therefore a clear demand for training in this area [1, 2]. However, training provision is complicated by the heterogeneity of individuals’ computer setups and the many dependencies demanded by software packages. Furthermore, access to HPC clusters varies depending on the institution and field of study.

Cloud computing services, such as Amazon Web Services (AWS), offer a novel way to provide training in genomics and metagenomics analysis workflows without the need for participants to manage complex software installations or store large datasets in their computers. Each participant can be provided with an identical AWS instance (virtual machine) pre-configured with the software and data needed for a course. The Cloud-SPAN project has been using this approach, based on the Data Carpentry [3] model [4], to deliver highly successful genomics and metagenomics courses for almost three years. This model is known as Infrastructure as a Service (IaaS) [5], and is rather flexible, allowing training providers to configure virtual machines in terms of compute, storage and networking capacities, as well as the data and software analysis tools required by a course. Meanwhile, cloud providers are responsible for managing the actual compute, storage and networking hardware resources and virtualisation. IaaS has also been deployed for bioinformatics training on national HPC clusters [6, 7] using OpenStack (Open Source Cloud Computing Infrastructure) [8] for managing hardware resources and virtualisation. Platform as a Service (PaaS) is another cloud computing service model that has been successfully used in bioinformatics training [9, 10]. A PaaS comprises a software environment for data management and analysis tasks using a programming language such as Python or R. Examples include Google Colab [11] and Posit [12], a cloud-based RStudio. These environments are readily accessible through a web browser and simplify sharing code and data.

The main advantages of Cloud computing for training based on the Data Carpentry model are low cost and flexibility. There is no need to manage nor invest in hardware resources or physical space. Instead, an instance in the Cloud is first configured with all the data and software tools required by a course. This instance is then configured as a template, or Amazon Machine Image (AMI) in AWS terminology. Then, for each participant in the course, an instance is created from the AMI. Once the course is over, the instances are deleted in order to stop incurring costs. The AMI is typically preserved to serve as the starting point either (1) to create new instances for a new run of the course, or (2) to create a new AMI with updated data or software, or both, through creating an instance, updating the data or software, and creating a new AMI out of the newly configured instance. In addition, it is rather easy to change the capacity of instances in terms of the number of processors, main memory size, and communication bandwidth to match the processing requirements of the analysis tasks to be taught.

However, managing multiple instances through a graphical user interface, such as the AWS Console, is cumbersome and error-prone. As the number of participants increases, the problem is magnified. The nature of running workshops means participants may drop out, join the course late or not turn up, resulting in further manual management being required.

To address this problem, we developed a set of Bash scripts to automate the management of AWS instances for use in training workshops. The scripts automate the creation and deletion of AWS instances and related resources, namely: encrypted login keys, Internet Protocol (IP) addresses, and domain names. We have also developed an accompanying online tutorial [13] detailing how to open and configure an AWS account, and how to install, configure and run the scripts in a terminal on Linux, Windows, and MacOS, or in the AWS CloudShell (browser-based) terminal. The tutorial assumes that learners have no prior experience with the AWS concepts and tools covered in the tutorial. However, learners are expected to have some experience with both the Linux/Unix terminal and Bash shell programming. Windows users need to install and configure the Git Bash terminal and Mac users need to install or update the Bash shell as instructed in the Precourse Instructions section of the tutorial [14].

We use the scripts to manage Ubuntu Linux AWS instances configured for training in genomics and metagenomics. However, the scripts are broadly applicable to manage instances configured for any training purpose. The tutorial demonstrates how to customise AMI templates with other software tools and data [15].

The scripts and how to use them

The scripts are listed below. There are three types of scripts. The primary scripts, “csinstances_*.sh”, are the topline scripts run by the person in charge of managing instances for workshops. The secondary scripts, “aws_*.sh”, are invoked by the scripts “csinstances_create.sh” or “csinstances_delete.sh” to either create or delete instances and related resources: login keys, IP addresses, and domain names (if managed). The third script type corresponds to scripts that provide utility functions to the primary and secondary scripts. The only script in this category is “colours_utils_functions.sh”, which provides text colouring functions and utility functions that validate the invocation and results of the primary and secondary scripts.

The secondary scripts can each be run directly in the same way the primary scripts are run (as described shortly), but this is not recommended except for the purpose of improving a script or troubleshooting a failed step in creating instances and related resources. The section Troubleshooting [16] of the tutorial describes the conditions under which we have had to run some secondary scripts directly.

`aws_domainNames_create.sh`	`aws_instances_terminate.sh`	`csinstances_create.sh`
`aws_domainNames_delete.sh`	`aws_loginKeyPair_create.sh`	`csinstances_delete.sh`
`aws_instances_configure.sh`	`aws_loginKeyPair_delete.sh`	`csinstances_start.sh`
`aws_instances_launch.sh`	`colour_utils_functions.sh`	`csinstances_stop.sh`

`KEYWORD`	`VALUE examples (Cloud-SPAN’s for Genomics course using instance domain names)`
		`## NB: "key value" pairs can be specified in any order`
`imageId`	`ami-07172f26233528178`	`## NOT optional: instance template (AMI) id`
`instanceType`	`t3.small`	`## NOT optional: processor count, memory size, bandwidth`
`securityGroupId`	`sg-0771b67fde13b3899`	`## NOT optional: should allow ssh (port 22) communication`
`subnetId`	`subnet-00ff8cd3b7407dc83`	`## optional: search vpc in AWS console then click subnets`
`hostZone`	`cloud-span.aws.york.ac.uk`	`## optional: specify to use instance domain names`
`hostZoneId`	`Z012538133YPRCJ0WP3UZ`	`## optional: specify to use instance domain names`

`group`	`BIOL`
`project`	`cloud-span`
`status`	`prod`
`pushed_by`	`manual`

`courses`				`### you can omit this directory or use other name`
	`genomics01`			`### workshop/course WE name; you can use other name`
		`inputs`		`### you CANNOT use other name`
			`instancesNames.txt`	`### you can use other name`
			`resourcesIDs.txt`	`### you CANNOT use other name`
			`tags.txt`	`### OPTIONAL - you CANNOT use other name`
		`outputs`		`### created automatically by the scripts - don’t modify`
	`genomics02`			`### another WE: inputs and outputs directories inside`
	`metagenomics01`			`### another WE: inputs and outputs directories inside`
	`...`

`$ csinstance_stop.sh`	`courses/instances-management/inputs/instancesNames.txt`
`$ csinstance_start.sh`	`courses/instances-management/inputs/instancesNames.txt`
`$ csinstance_delete.sh`	`courses/instances-management/inputs/instancesNames.txt`

`check_theScripts_csconfiguration`		`"$1" \|\| { message "$error_msg"; exit 1; }`
`aws_loginKeyPair_create.sh`		`"$1" \|\| { message "$error_msg"; exit 1; }`
`aws_instances_launch.sh`		`"$1" \|\| { message "$error_msg"; exit 1; }`
`if [ -f "${1%/}/.csconfig_DOMAIN_NAMES.txt" ]; then ### %/ gets the inputs directory path`
	`aws_domainNames_create.sh`	`"$1" \|\| { message "$error_msg"; exit 1; }`
`fi`
`aws_instances_configure.sh`		`"$1" \|\| { message "$error_msg"; exit 1; }`
`exit 0`

`aws ec2 create-key-pair --key-name $loginkey --key-type rsa ...`	`### invoked by aws_loginKeyPair_create.sh`
`aws ec2 run-instances --image-id $resource_image_id ...`	`### invoked by aws_instances_launch.sh`
`aws ec2 delete-key-pair --key-name $loginkey ...`	`### invoked by aws_loginKeyPair_delete.sh`
`aws ec2 terminate-instances --instance-ids $instanceID ...`	`### invoked by aws_instances_terminate.sh`

`csuser@csadmin-instance:` ∼
`$ ls courses/instances-management/outputs/instances-creation-output/`
`instance01-ip-address.txt`	`instance02-ip-address.txt`	`instance03-ip-address.txt`
`instance01.txt`	`instance02.txt`	`instance03.txt`

`csuser@csadmin-instance:` ∼
`$ lginstance.sh courses/instances-management/outputs/login-keys/login-key-instance01.pem csuser`
	`logging you thus:`	`### this and the next line are displayed by lginstance.sh`
	`ssh -i courses/instances-management/outputs/login-keys/login-key-instance01.pem csuser@3.253.59.74`
	`...`	`### instance welcome message`
`csuser@instance01:` ∼		`### instance prompt`
`$`

Reviewer name and names of any other individual's who aided in reviewer	Sindiswa Lukhele
Do you understand and agree to our policy of having open and named reviews, and having your review included with the published manuscript. (If no, please inform the editor that you cannot review this manuscript.)	Yes
Is the language of sufficient quality?	Yes
Please add additional comments on language quality to clarify if needed
Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is?	Yes
Additional Comments	The statement of need is clear. Just a few questions: Does the script accommodate users new to AWS and without any form of training? It needs to be indicated in the paper.
Is the source code available, and has an appropriate Open Source Initiative license <a href="https://opensource.org/licenses" target="_blank">(https://opensource.org/licenses)</a> been assigned to the code?	Yes
Additional Comments
As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code?	Yes
Additional Comments
Is the code executable?	Unable to test
Additional Comments
Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?	Unable to test
Additional Comments
Is the documentation provided clear and user friendly?	Yes
Additional Comments
Additional Comments
Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level?	Yes
Additional Comments
Have any claims of performance been sufficiently tested and compared to other commonly-used packages?	Not applicable
Additional Comments
Is test data available, either included with the submission or openly available via cited third party sources (e.g. accession numbers, data DOIs)?	No
Additional Comments	There was no biological data included in the paper. Probably the script needs to be tested using biological data.
Are there (ideally real world) examples demonstrating use of the software?	Yes
Additional Comments	Please add examples of the script using biological data.
Additional Comments
Any Additional Overall Comments to the Author	Overall, the paper is well written. A few things need to be considered, including using biological data as examples of how to run the script. Demonstrating using biological data will assist the user in following through with the examples, especially if there is no available training.
Recommendation	Accept

Reviewer name and names of any other individual's who aided in reviewer	Geert van Geest
Do you understand and agree to our policy of having open and named reviews, and having your review included with the published manuscript. (If no, please inform the editor that you cannot review this manuscript.)	Yes
Is the language of sufficient quality?	Yes
Please add additional comments on language quality to clarify if needed
Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is?	Yes
Additional Comments	The authors could stress the strengths of using cloud services for teaching a bit more, e.g.: low costs, self-managed, flexibility.
Is the source code available, and has an appropriate Open Source Initiative license <a href="https://opensource.org/licenses" target="_blank">(https://opensource.org/licenses)</a> been assigned to the code?	Yes
Additional Comments	Yes. However, in the manuscript text it is mentioned a CC-BY 4 license is used (which would not be very appropriate for software), while in the github repository there is an MIT license (https://github.com/Cloud-SPAN/aws-instances). I would suggest the authors to use the MIT license for the code and a CC-BY for the tutorial.
As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code?	No
Additional Comments	The repository would benefit from instructions on how to contribute, e.g. in a CONTRIBUTING.md file.
Is the code executable?	Unable to test
Additional Comments	The software requires a domain (as far as I understood). At time of review I wasn't in the capacity to register one. It would help if the authors would provide a quick 'getting started' that can all be performed with an AWS free tier.
Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?	Unable to test
Additional Comments	See above. However, the paper gives a broad overview, and there is a detailed tutorial on how to perform all the steps.
Is the documentation provided clear and user friendly?	No
Additional Comments	The tutorial is very detailed. However: - There is no link in the repository to the tutorial - The script works with configuration files as input. I found it hard to find out which options in e.g. resourcesIDs.txt were required. - The documentation page (now README.md?) could use some structure and detail
Is there enough clear information in the documentation to install, run and test this tool, including information on where to seek help if required?	No
Additional Comments	Almost everything is there, however things are partly found in the tutorial.
Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level?	No
Additional Comments	I think aws-cli is the only dependency, and that is probably stated in the tutorial, but there's e.g. no 'installation' header in README.md
Have any claims of performance been sufficiently tested and compared to other commonly-used packages?	Not applicable
Additional Comments
Is test data available, either included with the submission or openly available via cited third party sources (e.g. accession numbers, data DOIs)?	No
Additional Comments	But not really applicable. There is an example for inputs in the repository.
Are there (ideally real world) examples demonstrating use of the software?	Yes
Additional Comments
Is automated testing used or are there manual steps described so that the functionality of the software can be verified?	No
Additional Comments	Some basic tests without having to interact with a personal account would be possible.
Any Additional Overall Comments to the Author	- As far as I could tell most (if not all) steps could be done with infrastructure as code (e.g. Terraform/Ansible). This is a general format that is used by many people. Can the authors state what the advantages of using only bash are over iac? - The configuration files in the input directory are plain text files. Consider to use one file with markup language like json or yaml. - A schematic overview of the resulting infrastructure including instances, network, keys/users/ and disks would be helpful for the reader - The 'Statement of need' hardly contains references to peer-reviewed literature. Although I don't think this should be a hard requirement, I do think it would make the manuscript stronger. Use e.g. existing literature on (bioinformatics) education, e.g. https://scholar.google.com/scholar?hl=nl&as_sdt=0%2C5&q=bioinformatics+teaching&btnG=&oq=bioinformatics+teaching - Make sure the user finds all documentation/tutorials. Cross reference between the repository and the tutorial. - Suggestion: allow for mounting a shared disk. This enables learners to share files in e.g. group work. - Suggestion: make as many options as possible optional, e.g. the domain, so all steps can be done with an AWS free tier.
Recommendation	Minor Revisions

Reviewer name and names of any other individual's who aided in reviewer	Toby Hodges
Do you understand and agree to our policy of having open and named reviews, and having your review included with the published manuscript. (If no, please inform the editor that you cannot review this manuscript.)	Yes
Is the language of sufficient quality?	Yes
Please add additional comments on language quality to clarify if needed
Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is?	Yes
Additional Comments
Is the source code available, and has an appropriate Open Source Initiative license <a href="https://opensource.org/licenses" target="_blank">(https://opensource.org/licenses)</a> been assigned to the code?	Yes
Additional Comments	The repository containing the scripts includes an MIT license and a CITATION.cff file, which is very good practice. However, the manuscript (and Zenodo record) currently states that "The scripts are freely available to download and use under a Creative Commons BY 4.0 attribution license" -- this sentence and the Zenodo record should be corrected to reflect the MIT license of the software.
As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code?	No
Additional Comments	Scripts include contact information for the corresponding author, and the CITATION.cff includes contact details for all authors. However, the repository contains no contributing guide, or guidance in the README.md, that could help would-be contributors understand how to get involved or contribute most effectively. The project README and the tutorial mentioned in the manuscript focus on usage of the scripts.
Is the code executable?	Unable to test
Additional Comments	The scripts run, but I was unable to test them fully because of the rigidity of the required cloud environment configuration. Due to internal constraints the AWS environment I am working with could not be adjusted to fit exactly with the specifications of the authors' system in time for this review to be filed. Specifically, we could not configure a subdomain for the cloud instances created, and the way we handle security groups is also different. Although neither of these differences would prevent cloud instances from being created, the way the csinstances_create script cannot run without them. I note that the script is written with no default values set, and with the assumptions that 1. all of the required parameters will be included in the resourcesIDs.txt file and 2. that these parameters will appear in a fixed order. I believe it would be reasonable to allow users to run the scripts without having first created a hosted zone for the instances that will be created. For example, by adjusting the script to use default values where possible if parameters have not been set in the resourcesIDs.txt file. It would also be helpful to allow users to specify parameters in an arbitrary order within the resourcesIDs.txt file. Furthermore, I recommend that the authors explore the use of Launch Templates (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-launch-templates.html) to specify defaults for launched instances, which may prove simpler and more robust than the current approach of reading parameters from a text file and substituting those into a call to `ec2 run-instances`.
Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?	Yes
Additional Comments	The paper and online tutorial provide clear and thorough guidance on how to use the scripts, including details that anticipated many of the questions that I had about using the software. I encourage the authors to link directly from the source repository containing the scripts to the tutorial, to make it easier for would-be users to find the information they need on how to use and adapt the scripts.
Is the documentation provided clear and user friendly?	Yes
Additional Comments	The accompanying tutorial is clear and well-structured. As mentioned above, I recommend that more links are created from the scripts source repository and that tutorial site, to help potential users find the relevant documentation to follow.
Is there enough clear information in the documentation to install, run and test this tool, including information on where to seek help if required?	Yes
Additional Comments
Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level?	Yes
Additional Comments	https://cloud-span.github.io/cloud-admin-guide-2-managing-aws-instances/setup.html contains information about the Bash version required and good instructions about how to install/update it on different operating systems. The AWS CLI tool, also required, is not discussed on that page but installation is the topic of a section in the main body of the tutorial.
Have any claims of performance been sufficiently tested and compared to other commonly-used packages?	Not applicable
Additional Comments
Is test data available, either included with the submission or openly available via cited third party sources (e.g. accession numbers, data DOIs)?	Yes
Additional Comments	The source repository includes example config files that largely fulfil the purpose of example data. The only potential difficulty with these files is that they will only work with the authors' local AWS setup. I cannot think of a way that this could be avoided, however: execution of the scripts inevitably requires the accompanying AWS account setup and config.
Are there (ideally real world) examples demonstrating use of the software?	Yes
Additional Comments	The use of the software is fully described within the accompanying tutorial.
Is automated testing used or are there manual steps described so that the functionality of the software can be verified?	No
Additional Comments	I suspect it would be difficult to create meaningful automated tests for these scripts, as they rely on interacting with the Amazon Web Services API to run.
Any Additional Overall Comments to the Author	I was delighted to receive this paper for review: the authors are describing automation of a process we have been handling manually for several years. The documentation accompanying the scripts is excellent: it is detailed, easy to follow, and comprehensive. I strongly recommend creating clearer links to that tutorial from the software repository on GitHub. Unfortunately, I was unable to test the complete workflow of the scripts as I could not access an AWS environment configured to the exact specifications described by the Cloud-SPAN team. If the authors are willing to adjust the scripts to be more permissive of alternative configurations (e.g. dropping the hard requirement for a subdomain where the instances could be hosted), I would be more than happy to review the new version. Thank you very much for writing the scripts, the documentation, and the paper -- and even more thanks for doing it all in the open, maximising the impact your work can have on the wider community.
Recommendation	Minor Revisions

PERMALINK

Automated management of AWS instances for training

Jorge Buenabad-Chavez

Evelyn Greeves

James P J Chong

Emma Rand

Roles

Abstract

Statement of need

The scripts and how to use them

Managing instances for workshops

Running the scripts

Figure 1.

Using instances and customising AMIs

Login to instances when domain names are NOT managed

Customising the login account of workshop participants

Figure 2.

The scripts design and implementation

The scripts execution flow — overview

Creating and deleting instances and related resources

Scripts communication

Configuring, stopping and starting instances

Validating the target workshop environment

Overview of the online tutorial

Conclusions

Availability of source code and requirements

Acknowledgements

Funding Statement

Data availability

List of abbreviations

Declarations

Ethical approval

Competing interests

Authors’ contributions

Funding

References

Article Submission

Dr Jorge Buenabad-Chavez

Roles

Assign Handling Editor

Roles

Editor Assess MS

Roles

Curator Assess MS

Roles

Review MS

Roles

Review MS

Roles

Review MS

Roles

Editor Decision

Roles

Minor Revision

Dr Jorge Buenabad-Chavez

Roles

Assess Revision

Roles

Final Data Preparation

Roles

Editor Decision

Roles

Accept

Roles

Export to Production

Roles

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases