. 2024 Jun 21;9(1):ysae010. doi: 10.1093/synbio/ysae010

Table 1.

Descriptions of data hazards with synthetic biology examples

Data hazard	Description	Synthetic biology examples	Potential safeguards
General data hazard	Data science is being used and leading to negative outcomes. This hazard applies to all data science research outputs.	All areas that make use of data science approaches.	Proactively explore potentially negative applications and implement mitigating actions.
Lacks community involvement	Technology is being produced without sufficient input from the community it is designed to serve.	Proprietary ML-based algorithms developed to support a synthetic biology based therapeutic with no Patient and Public Involvement and Engagement (PPIE).	Engage with community stakeholders through consultations and participatory design processes.
Reinforces existing bias	Reinforces unfair treatment of individuals and groups. This may be due to input data, algorithm or software design choices, or society at large.	Focus on data collection for a limited set of model organisms. May mean our understanding and models do not translate to biology at large and lead to poor decisions when engineering non-model species.	Apply algorithms to detect bias in datasets and model outputs, helping guide new data collection/generation to alleviate found biases.
Difficult to understand	Danger that the technology is difficult to understand. This could arise due to a lack of interpretability (e.g. neural nets), lack of documentation, or problems with implementation details that are difficult to spot.	Deep learning models of gene regulatory sequence and proteins. Large-scale models of cellular processes (e.g. whole-cell models, metabolic models, regulatory models)	Use standardized data formats (e.g. SBOL) and seek domain expertise to apply explainable AI approaches.
High environmental impact	Methodologies are energy-hungry, data-hungry (requiring increasing amounts of computation), or require special hardware that require rare materials and resources that are non-sustainable.	Large deep-learning-based models require huge amounts of compute for training and often significant compute for prediction, which typically has a hidden environmental impact. Similarly whole-cell models can take days to run and generate huge data sets that require significant storage.	Explore the use of surrogate modeling to reduce computational resources required, optimize code and hardware used.
Risk to privacy	Possible risk to the privacy of individuals whose data is processed.	Engineering of personalized medicine applications (e.g. CAR T cell engineering).	Anonymize data where possible.
Lacks informed consent	Datasets or algorithms use data which have not been provided with the explicit consent of the data owner/creator. These type of data often lack other contextual information, which can also make it difficult to understand potential biases.	Bioprospecting studies of large genomic data bases often make use of sequenced samples where consent of local people may not have been given.	Develop clear guidelines for obtaining informed consent and ensure transparency in data usage.
Automates decision-making	Automated decision-making can be hazardous in many different ways. Important to ask: whose decisions are being automated, what automation can bring to the process, and who benefits or is harmed by this automation?	Increasing use of automation and design of experiment approaches when screening libraries and performing complex laboratory tasks. Errors in data could result in poor decisions being automatically made.	Identify areas where decisions are being automated and adapt existing safety frameworks to increase testing/validation of design choices, prior to deployment.
Capable of direct harm	The application area of this technology means that it is capable of causing direct physical or psychological harm to someone even if used correctly.	Many areas of synthetic biology have dual-use (e.g. toxin production, synthetic viruses, etc.)	Assess level of harm and ensure sufficient containment is in place to avoid harm.
Danger of misuse	There is a danger of misusing the algorithm, technology, or data collected.	Synthetic biology often has dual-use and considering new-to-nature biological parts and systems can have difficult to predict unintended consequences (e.g. gene drives, toxin production, engineering of viruses).	Ensure thorough testing of models prior to release including the identification of potential “emergent abilities” in neural network-based generative models.
Classifies and ranks people	Ranking and classifications of people should be handled with care. We should ask what happens when the ranking/classification is inaccurate, when people disagree with how they are ranked/classified, as well as who it serves and how it could be gamed.	Less common in synthetic biology, but may become an issue if personalized medicine becomes established.	Seek engagement with society about how classifications might cause negative outcomes and aim to build broader agreement on how issues are best handled.