Skip to main content
Science Progress logoLink to Science Progress
. 2021 Apr 15;104(2):00368504211010570. doi: 10.1177/00368504211010570

The roles of code in biology

Brendan Lawlor 1,, Roy D Sleator 2
PMCID: PMC10454959  PMID: 33856939

Abstract

The way in which computer code is perceived and used in biological research has been a source of some controversy and confusion, and has resulted in sub-optimal outcomes related to reproducibility, scalability and productivity. We suggest that the confusion is due in part to a misunderstanding of the function of code when applied to the life sciences. Code has many roles, and in this paper we present a three-dimensional taxonomy to classify those roles and map them specifically to the life sciences. We identify a “sweet spot” in the taxonomy—a convergence where bioinformaticians should concentrate their efforts in order to derive the most value from the time they spend using code. We suggest the use of the “inverse Conway maneuver” to shape a research team so as to allow dedicated software engineers to interface with researchers working in this “sweet spot.” We conclude that in order to address current issues in the use of software in life science research such as reproducibility and scalability, the field must reevaluate its relationship with software engineering, and adapt its research structures to overcome current issues in bioinformatics such as reproducibility, scalability and productivity.

Keywords: Bioinformatics, software engineering, reproducibility, scalability

Introduction

The way in which code is perceived and used in the life sciences has been a source of some controversy, with opposing camps within the field of biology publishing many papers on the matter.1,2 Some hold that software (and in some cases bioinformatics itself) is merely a technical service, 3 while others posit that software sits at the heart of the discipline. 4 One result of this confusion is a difficulty in integrating software engineering skills into bioinformatics, which in turn has negative impacts on research outcomes such as reproducibility, 5 scalability, 6 and productivity.7,8

One common point of confusion is the implication that the role of software must fit into a fixed hierarchy of value or importance within the life sciences. However, we suggest that this is based on a misunderstanding of the role played by code in any field, not just biology. Within the software engineering discipline, code has many purposes. Here, we present three different dimensions (abstraction, subdomain, communication) along which these purposes vary, which can be mapped to the biological field. The taxonomy leans heavily on software engineering research, but presents it to a life science audience in a way that provides insights, dispels some of the controversy, and indicates optimal ways to develop software in biological and bioinformatic research contexts. We describe how the application of this taxonomy to the structure and operation of an integrated organization of bioinformaticians and software engineers led to positive bioinformatic research outcomes.

With this more refined consideration of the various roles of software in biology, we then discuss some implications for the relationship of software engineering as a discipline to the life sciences. The purpose of examining this relationship is to address its role in solving current issues in reproduciblity, scalability and productivity in bioinformatic software.

Abstraction, subdomain, communication

We are not aware of any literature on how to classify the different roles of code in software engineering, and so we offer our own taxonomy here based on direct experience of that discipline, and with influence from multiple software engineering sources. The aim of this approach is to describe current (often informal) software practices to a non-software engineering audience, in a more formalized way that provides useful insights.

Firstly, rather than a hierarchy of value, where elements at the top are of greater worth than those underneath, in software there exists a stack, a standing-upon-shoulders, where higher levels of abstraction derive their power from the expert implementation of the levels below them. Code itself can function on many levels, from the assembler of device drivers, to cloud-native infrastructure-as-code, library code, domain-driven designed application code, and custom-built domain-specific languages. None is of any greater intrinsic worth than any other. To quote Liskov and Guttag, 9 abstraction serves to hide “‘irrelevant’ details, describing only those details that are relevant to the problem at hand.” While abstraction is a continuum, it can be useful to break out some discrete values. In the context of scalable cloud-native systems, we propose the following values for the abstraction dimension of our taxonomy:

  • System: Uses system languages like assembler, C,C++ and Rust to specify generic software abstractions like bytes, strings, arrays etc. or concepts close to the specific hardware like Graphical Processing Units (GPUs) or Single Instruction Multiple Data (SIMD).

  • Infrastructure: Specifies hardware and middleware configurations, particularly on cloud infrastructure, using languages like Terraform or vendor-specific representations like CloudFormation.

  • Application: Creates and uses abstractions that map to real-world concepts from the field to which the software is being applied, using languages like Java, C#, Python, R etc. Contains primary so-called business logic—the knowledge of the problem domain, and its solutions.

  • Orchestration: Specifications of how to deploy and coordinate multiple application-level programs on infrastructure, in particular cloud, using markup languages like YAML and a variety of platforms like Docker and Kubernetes.

In operational software systems, the level of abstraction of a layer of code can be usefully thought of as the the order in which it is applied when building that system.

Secondly, when crafting a software system, there are distinct problem areas that one must address. The widely-followed software engineering technique Domain Driven Design 10 calls these areas subdomains and categorizes them into Core, Supporting and Generic types.

  • Core subdomains address the central mission and distinguishing feature of the system.

  • Supporting subdomains perform mission-specific work that is needed by the core.

  • Generic subdomains provide utilities that are needed by the system but are not specific to it.

Note that code at any layer of abstraction can belong to core, supporting or generic contexts. For example, code that uses GPUs to achieve high degrees of parallelization in modeling intra-cellular processes would be considered to have a system level of abstraction while being a core subdomain. However some Kotlin code created to report experimental results in an interactive table on a browser would be at an application level of abstraction and in a supporting subdomain.

Finally, software engineers use code not only as a means of directing a computer to accomplish a given task, but also as a way to communicate a problem, and its solution, in an unambiguous way. To quote Martin Fowler, “Any fool can write code that a computer can understand. Good programmers write code that humans can understand.” 11 Good code, he suggests, should be intelligible to other humans, not just computers, so that they might understand how a problem is framed, and how its solution is fashioned, so as to more easily understand the problem, fix or improve the solution, as well as knowing how to adapt it to solve other related problems. Code can be, in other words, a way of communicating complex concepts between humans, in a rigorous and reusable fashion. Whether it is used in that role depends on the way it is written, the language used, and the costs involved in writing code this way.

  • Machine: Code which only the computer can read.

  • Human: Code which explains its intent to the human reader as well as the target computer.

Note that while system level code will tend by its nature to communicate principally to the machine because it typically deals in concepts closer to the hardware, a separate dimension of communication still makes sense because code at higher levels of abstraction is free to be written in a way that communicates either with machine or human.

Table 1 summarizes the proposed taxonomy of the roles of code.

Table 1.

Summary of role taxonomy dimensions and their values.

Role type Values
Abstraction System, infrastructure, application, orchestration
Subdomain Core, supporting, generic
Communication Machine, human

Note that whether it is a good idea or not to implement core domains using system level abstractions which perhaps do not communicate well to a human reader, is an interesting question but not strictly in the scope of this paper. However framing the question in terms of this taxonomy can be a useful way to approach the problem.

Mapping code to biology

We can map this deeper understanding of the nature of software development onto the domain of biology in such a way as to cast light on the different roles of code in the life sciences. Perhaps in doing so we can remove some of the heat remaining from the past controversies referred to above, suggest specific directions for life science researchers with respect to how they use code, and indicate ways in which to move in those directions.

Some biological problems are computational in their very nature, or are more amenable to computational representation. In fact the term bioinformatics was first used in order to highlight such “informatic processes in biotic systems.” 12 Examples might include the mathematical analysis of physiological processes (such as the seminal paper on neuron spiking by Hodgkin and Huxley 13 which actually predates the term bioinformatics), or more recently the practice of whole-cell modeling. 14 Code that models such problems would be considered as belonging to the core subdomain type, whatever the level of abstraction used to implement it, or to whatever extent the code communicated to a human audience.

In the years since the term bioinformatics was coined, it has come to mean the use of software in the processing and management of biological data, often at high scale. In this context, bioinformatics code is likely to be found in all subdomain types: core, supporting and generic, and across all levels of abstraction from system to orchestration, and with the purpose of communicating both to human and machine.

The roles of code in biology, then, consist of modeling biological elements that are intrinsically computational, and to engineer scalable and reproducible solutions to process vast and growing amounts of biological data.

But code which models computational processes in biology is no more important than code which distributes biological data, and its processing pipelines, across clusters of servers and processors. Each includes elements that are core to the mission, support it, or are generic. Each can include code that operates from system level to orchestration level. This is not a hierarchy of worth, but the same stacked layering of abstraction that applies in any application of software. Past debates over the importance of code in the life-science are off-the-point and wasteful. Code is neither central nor peripheral. It is both.

Moreover, a realization that there are many roles of code in biology can steer the conversation in a more useful direction: how to fulfill these roles. As Figure 1 indicates, when it comes to the life sciences, there is a “sweet spot” on which the three dimensions of role converge. This is the point where biologists and bioinformaticians can derive the most value from code. This is where code addresses core questions of the domain, at a real-world level of abstraction, and communicates its intent to human readers. The more life scientists can move their coding toward this sweet spot, the more their code will serve them, and the less they will have to serve their code.

Figure 1.

Figure 1.

Sweet spot: convergence of code roles for life scientists and bioinformaticians.

It’s useful to think of the sweet spot as containing the greatest concentration of essential complexity, from the biological perspective, while the others contain mostly accidental complexity, to use the terminology from Brooks and Kugler. 15

The other roles can be left to software engineers, by using outsourced products and services, and by integrating software engineers into the research team. To enable this kind of integration, some attention must be given to team structure. According to Conway’s Law, “there is a very close relationship between the structure of a system and the structure of the organization which designed it.” 16 The structure of a research organization can either be an asset or a liability when it comes to the way it designs software systems, by either enabling or preventing designs that permit scientists to work in their“sweet spot,” while engineers work in theirs.

Using what has become known as the “inverse Conway maneuver” 17 organizations can restructure to promote more suitable designs outcomes. In order to move life science researchers (i) toward core subdomains and away from supporting subdomains, and (ii) toward application abstractions and away from system, infrastructure or orchestration abstractions, such restructuring would involve filling the void left behind with software engineers. Code at the boundaries between these competencies would then need to fulfill the human communication role. We have already suggested ways in which this can be done. 7 Interactions at team interfaces will be made smoother by scientists following existing advice on good programming practices, 18 and by software engineers adapting to the needs of scientific computing, and providing scientifically useful abstractions over computing systems. 8

Applications

The foregoing observations on the roles of code in bioinformatics are based on lessons learned during the application of software engineering tools and techniques as part of a period of research in a bioinformatics startup (https://www.nsilico.com/).

The principal activity during this time was the further development of a software platform called Simplicity. 19 This platform aimed to democratize access to, and execution of, bioinformatic pipelines, by abstracting away many technical aspects of those pipelines, and managing all aspects of scalability and reproducibility. Research was divided into two main goals: re-architecting the platform to make it more scalable through deployment to the cloud on commodity hardware, and collaborating with bioinformaticians to add new pipelines. The output of the research, apart from the changes to Simplicity, included a number of published papers based on the pipelines.2022

Examined through the prism of our taxonomy, the work fell into the following categories:

  • Infrastructure, Supporting, Human: Creating Docker images to support the code dependencies of pipelines, using descriptive Dockerfiles as the code (https://www.docker.com).

  • Orchestration, Core, Human: Coding deploymentment descriptors to define the deployment environment of pipelines running in Docker images. This was done at the time using Docker Swarm and Docker Compose, but today would be done using Kubernetes (https://kubernetes.io).

  • Infrastructure, Generic, Human: Creating cloud-native job queuing system to permit pipeline containers to pick jobs and work on them in a scaled environment, using Java code and following software best practices for modularity and readability.

  • Application, Supporting, Human: Creating a standardized pipeline Application Programming Interface (API) to read queued jobs, invoke bioinformatic pipelines based on the job content, and extract the pipeline results, using Java code and following software best practices for modularity and readability.

  • Application, Core, Human: Developing bioinformatic pipelines using Python and R, conforming to the standardized API mentioned above, and designed to run in the dedicated Docker containers mentioned above.

The bioinformaticians on the team worked exclusively in the final category, corresponding to our identified sweet spot. All other roles were filled by software engineers. This meant that the bioinformaticians did not need to concern themselves with matters of scale. Where increased load was needed, multiple Docker images were configured to run in parallel (on cloud infrastructure if necessary) and chaining of the results of one pipeline into the input of another was handled by Simplicity. In addition, the resulting pipelines were automatically reproducible, because all their runtime dependencies were encapsulated within Docker containers.

This approach was made possible by a system architecture which mirrored the communication structure of the organization. The software engineers were divided into teams that worked on user interfaces, and teams that worked on the backend infrastructure. Each pipeline had its own lead bioinformatician, who developed their pipelines to a specific common API. The bioinformaticians communicated with the infrastructure software engineers, using language grounded in terms of that API. They communicated with the PI using language of the core (biological) subdomain.

Conclusion

In the context of the problems that currently challenge the field, such as issues in reproduction of computational results, difficulties in reliably scaling solutions to cope with vast and growing data, or freeing bioinformaticians from software-related toil, it may be time for bioinformatics to examine its mission and to adjust its approach. An important step would be to redefine its core competencies, and re-designate software engineering (as distinct from software programming) as a vital service, separate from but complementary to, the computational and data sciences at the core of bioinformatics.

This should not be taken as a relegation of software itself as a service to biology or bioinformatics. Among the many roles that code plays in life science research, the sweet spot will always remain. However in many cases, where data or computation at significant scale is concerned, software engineers will be needed to fulfill the other identified roles in order to ensure adequate levels of performance and reproducibility.

The presence of software engineers in a bioinformatic project should not in any sense undermine the cross-disciplinary nature of the bioinformatician’s work. On the contrary—code would remain at the heart of both disciplines, but with clearer interfaces between abstraction levels and subdomain types. Software engineers working at the lower levels of abstraction would create reliable and reusable libraries to serve common life science requirements. Software engineers working at the higher levels of abstraction would harness bioinformatic programs into large-scale systems using techniques like containerization and orchestration over computational clusters. In all cases, collaboration at the interfaces would take place in the lingua franca of both fields: code.

In her paper on big biology, Vermeulen 23 notes the movement in biological research to centralize complex and expensive technologies (e.g. “electron-microscopy, NMR spectroscopy, röntgendefraction, ultracentrifuge”) with the goal of “not only the sharing of costs, but also the development of professional operational skills.” In the same context of big biology, this sharing of costs and skills should be applied to software engineering. This would call for an explicit position for software engineering within bioinformatics.

There are lessons to be learned from the other main scientific disciplines. The first scientific astronomers fashioned their own telescopes. Seminal work in electromagnetism was performed on improvised experimental equipment. But today’s Hubble telescopes and Large Hadron Colliders are the fruit of collaboration between researchers and service providers. Today’s big biology must follow this course with respect to software as it has done and continues to do for hardware.

Even modestly sized labs are still bound to work with big data, especially in the area of genomics. Small teams with relatively narrow interests are still obliged to handle ever-increasing amounts of data, and use it to generate repeatable results.

The natural way to meet these needs is to integrate software engineers, as service providers, into life science research teams, with a mission to democratize that research through the development of suitable bioinformatic abstractions over reliable, scalable, and reproducible tools. This would have a comparable effect on bioinformatics that the democratization of cloud computing is having on software development: allowing small and medium-sized teams to “punch above their weight.”

One obstacle to be overcome as part of this proposed approach is hinted at by Prabhu et al. 24 when they quote a research scientist as saying that even “funding agencies think software development is free,” and regard development of robust scientific code as “second class” compared to other scientific achievements. The way in which research projects are funded does not currently take into account the costs associated with developing software, or the potential benefits that reproducibility and scale can bring. This is echoed by Faulk et al. 8 who point out that “[p]urchasing decisions and budgeting models are short-term and hardware-focused,” with little regard for software.

While not every project will be able to budget for full-time software engineers, research groups should be able to share such resources within or across academic institutes, or make use of specialized external software companies which would grow in number to meet demand. This could be facilitated by institutes of higher education creating post-graduate crossover opportunities between software engineering and life science departments.

An intelligent integration of software engineers into life science research projects is possible, when based on an understanding of the different roles of code in biology. Principle Investigators will be able to select the levels of abstraction at which they expect their software engineers to work, and the points of communication between those engineers and the bioinformaticians and biologists. In doing so, they will be liberating the latter group from engineering-related toil outside of their core domains, to concentrate instead on answering the big biological questions of our time.

Acknowledgments

We would like to acknowledge the support from NSilico Lifescience Ltd. during the case study described in this paper. Graphics by Oisín Lawlor, Ø. Macioti.

Author biographies

Dr. Brendan Lawlor is an experienced Software Development Practitioner and Researcher. He holds a BSc in Computer Science from University College Cork, an MSc by research in Computer Science from the Cork Institute of Technology, and a PhD by research in Computer Science, applied to Bioinformatics, from Munster Technological University.

Prof. Roy D Sleator is Professor of Biological Sciences at Munster Technological University. He holds a BSc in Microbiology, an MA in Education and a PhD in Molecular Biology from University College Cork He also holds a PGC in Bioinformatics from the University of Manchester, a DSc on published work from the National University of Ireland and an MBA, Leadership and Management, from Anglia Ruskin University.

Footnotes

The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: Roy D. Sleator is co-founder and CSO of NSilico Lifescience Ltd. Brendan Lawlor was employed by NSilico Lifescience Ltd. during the period of research mentioned in the paper.

Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The case study in this paper was supported by a grant from the EU Horizon 2020 Research and Innovation Staff Exchange (RISE) programme (SAGE-CARE Project ID: 644186).

References

  • 1.Stein LD.Bioinformatics: alive and kicking. Genome Biol 2008; 9(12): 114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Lewis J, Bartlett A, Atkinson P.Hidden in the middle: culture, value and reward in bioinformatics. Minerva 2016; 54(4): 471–490. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Lewis J, Bartlett A.Inscribing a discipline: tensions in the field of bioinformatics. New Genet Soc 2013; 32(3): 243–263. [Google Scholar]
  • 4.Markowetz F.All biology is computational biology. PLoS Biol 2017; 15(3): 1–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Grüning B, Chilton J, Köster J, et al. Practical computational reproducibility in the life sciences. Cell Syst 2018; 6(6): 631–635. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Scholz MB, Lo CC, Chain PSG. Next generation sequencing and bioinformatic bottlenecks: the current state of metagenomic data analysis. Curr Opin Biotechnol 2012; 23(1): 9–15. [DOI] [PubMed] [Google Scholar]
  • 7.Lawlor B, Sleator RD.The democratization of bioinformatics: a software engineering perspective. GigaScience 2020; 9(6): giaa063. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Faulk S, Loh E, Vanter MLVD, et al. Scientific computing’s productivity gridlock: how software engineering can help. Comput Sci Eng 2009; 11(6): 30–39. [Google Scholar]
  • 9.Liskov B, Guttag J.Abstraction and specification in program development, vol. 20. Cambridge: MIT press, 1986. [Google Scholar]
  • 10.Evans E. Domain-driven design: tackling complexity in the heart of software. Boston, MA, USA: Addison-Wesley Professional, 2004. [Google Scholar]
  • 11.Fowler M. Refactoring: improving the design of existing code. Boston, MA, USA: Addison-Wesley Professional, 2018. [Google Scholar]
  • 12.Hogeweg P.The roots of bioinformatics in theoretical biology. PLoS Comput Biol 2011; 7(3): e1002021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Hodgkin AL, Huxley AF.A quantitative description of membrane current and its application to conduction and excitation in nerve. J Physiol 1952; 117(4): 500–544. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Karr JR, Takahashi K, Funahashi A.The principles of whole-cell modeling. Curr Opin Microbiol 2015; 27: 18–24. [DOI] [PubMed] [Google Scholar]
  • 15.Brooks FP. No silver bullet essence and accidents of software engineering. Computer 1987; 20(4): 10–19. [Google Scholar]
  • 16.Conway ME.How do committees invent. Datamation 1968; 14(4): 28–31. [Google Scholar]
  • 17.Skelton M, Pais M.Team topologies: organizing business and technology teams for fast flow. Portland, OR: It Revolution, 2019. [Google Scholar]
  • 18.Taschuk M, Wilson G.Ten simple rules for making research software more robust. PLoS Comput Biol 2017; 13(4): e1005412. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Walsh P, Carroll J, Sleator RD.Accelerating in silico research with workflows: a lesson in simplicity. Comput Biol Med 2013; 43(12): 2028–2035. [DOI] [PubMed] [Google Scholar]
  • 20.Palu CC, Ribeiro-Alves M, Wu Y, et al. Simplicity diffexpress: a bespoke cloud-based interface for RNA-seq differential expression modeling and analysis. Frontiers in Genetics 2019; 10: 356. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Walsh P, Mac Aogáin M, Lawlor B. In: Investigating antibiotic resistance mechanisms in Clostridium difficile through genome-wide analysis of phenotyped clinical isolates. Vienna, Austria: ECCMID, 2017. [Google Scholar]
  • 22.Ryan L, Golparian D, Fennelly N, et al. Antimicrobial resistance and molecular epidemiology using whole-genome sequencing of neisseria gonorrhoeae in Ireland, 2014–2016: focus on extended-spectrum cephalosporins and azithromycin. Eur J Clin Microbiol Infect Dis 2018; 37(9): 1661–1672. [DOI] [PubMed] [Google Scholar]
  • 23.Vermeulen N.Big biology: supersizing science during the emergence of the 21st century. NTM Zeitschrift für Geschichte der Wissenschaften, Technik und Medizin 2016; 24(2): 195–223. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Prabhu P, Kim H, Oh T, et al. A survey of the practice of computational science. In: SC’11: Proceedings of 2011 international conference for high performance computing, networking, storage and analysis, pp.1–12. New York: IEEE. [Google Scholar]

Articles from Science Progress are provided here courtesy of SAGE Publications

RESOURCES