Table 1.
Descriptions of data hazards with synthetic biology examples
| Data hazard | Description | Synthetic biology examples | Potential safeguards |
|---|---|---|---|
| General data hazard | Data science is being used and leading to negative outcomes. This hazard applies to all data science research outputs. | All areas that make use of data science approaches. | Proactively explore potentially negative applications and implement mitigating actions. |
| Lacks community involvement | Technology is being produced without sufficient input from the community it is designed to serve. | Proprietary ML-based algorithms developed to support a synthetic biology based therapeutic with no Patient and Public Involvement and Engagement (PPIE). | Engage with community stakeholders through consultations and participatory design processes. |
| Reinforces existing bias | Reinforces unfair treatment of individuals and groups. This may be due to input data, algorithm or software design choices, or society at large. | Focus on data collection for a limited set of model organisms. May mean our understanding and models do not translate to biology at large and lead to poor decisions when engineering non-model species. | Apply algorithms to detect bias in datasets and model outputs, helping guide new data collection/generation to alleviate found biases. |
| Difficult to understand | Danger that the technology is difficult to understand. This could arise due to a lack of interpretability (e.g. neural nets), lack of documentation, or problems with implementation details that are difficult to spot. | Deep learning models of gene regulatory sequence and proteins. Large-scale models of cellular processes (e.g. whole-cell models, metabolic models, regulatory models) | Use standardized data formats (e.g. SBOL) and seek domain expertise to apply explainable AI approaches. |
| High environmental impact | Methodologies are energy-hungry, data-hungry (requiring increasing amounts of computation), or require special hardware that require rare materials and resources that are non-sustainable. | Large deep-learning-based models require huge amounts of compute for training and often significant compute for prediction, which typically has a hidden environmental impact. Similarly whole-cell models can take days to run and generate huge data sets that require significant storage. | Explore the use of surrogate modeling to reduce computational resources required, optimize code and hardware used. |
| Risk to privacy | Possible risk to the privacy of individuals whose data is processed. | Engineering of personalized medicine applications (e.g. CAR T cell engineering). | Anonymize data where possible. |
| Lacks informed consent | Datasets or algorithms use data which have not been provided with the explicit consent of the data owner/creator. These type of data often lack other contextual information, which can also make it difficult to understand potential biases. | Bioprospecting studies of large genomic data bases often make use of sequenced samples where consent of local people may not have been given. | Develop clear guidelines for obtaining informed consent and ensure transparency in data usage. |
| Automates decision-making | Automated decision-making can be hazardous in many different ways. Important to ask: whose decisions are being automated, what automation can bring to the process, and who benefits or is harmed by this automation? | Increasing use of automation and design of experiment approaches when screening libraries and performing complex laboratory tasks. Errors in data could result in poor decisions being automatically made. | Identify areas where decisions are being automated and adapt existing safety frameworks to increase testing/validation of design choices, prior to deployment. |
| Capable of direct harm | The application area of this technology means that it is capable of causing direct physical or psychological harm to someone even if used correctly. | Many areas of synthetic biology have dual-use (e.g. toxin production, synthetic viruses, etc.) | Assess level of harm and ensure sufficient containment is in place to avoid harm. |
| Danger of misuse | There is a danger of misusing the algorithm, technology, or data collected. | Synthetic biology often has dual-use and considering new-to-nature biological parts and systems can have difficult to predict unintended consequences (e.g. gene drives, toxin production, engineering of viruses). | Ensure thorough testing of models prior to release including the identification of potential “emergent abilities” in neural network-based generative models. |
| Classifies and ranks people | Ranking and classifications of people should be handled with care. We should ask what happens when the ranking/classification is inaccurate, when people disagree with how they are ranked/classified, as well as who it serves and how it could be gamed. | Less common in synthetic biology, but may become an issue if personalized medicine becomes established. | Seek engagement with society about how classifications might cause negative outcomes and aim to build broader agreement on how issues are best handled. |