Skip to main content
PLOS One logoLink to PLOS One
. 2025 Jul 1;20(7):e0326678. doi: 10.1371/journal.pone.0326678

Gain efficiency with streamlined and automated data processing: Examples from high-throughput monoclonal antibody production

Malwina Kotowicz 1, Magdalena Shumanska 1, Sven Fengler 1, Birgit Kurkowsky 1, Anja Meyer-Berhorn 1, Elisa Moretti 1, Josephine Blersch 1, Gisela Schmidt 2, Jakob Kreye 3,4,5,6, Scott van Hoof 3,5, Elisa Sánchez-Sendín 3,5, S Momsen Reincke 3,5,6, Lars Krüger 7, Harald Prüß 3,5, Philip Denner 1, Eugenio Fava 8, Dominik Stappert 1,*
Editor: Bhanwar Lal Puniya9
PMCID: PMC12212921  PMID: 40591905

Abstract

Data management and sample tracking in complex biological workflows are essential steps to ensure necessary documentation and guarantee reusability of data and metadata. Currently, these steps pose challenges related to correct annotation and labeling, error detection, and safeguarding the quality of documentation. With growing acquisition of biological data and the expanding automatization of laboratory workflows, manual processing of sample data is no longer favorable, as it is time- and resource-consuming, prone to biases and errors, and lacks scalability and standardization. Thus, managing heterogeneous biological data calls for efficient and tailored systems, especially in laboratories run by biologists with limited computational expertise. Here, we showcase how to meet these challenges with a modular pipeline for data processing, facilitating the complex production of monoclonal antibodies from single B-cells. We present best practices for development of data processing pipelines concerned with extensive acquisition of biological data that undergoes continuous manipulation and analysis. Moreover, we assess the versatility of proposed design principles through a proof-of-concept data processing pipeline for automated induced pluripotent stem cell culture and differentiation. We show that our approach streamlines data management operations, speeds up experimental cycles and leads to enhanced reproducibility. Finally, adhering to the presented guidelines will promote compliance with FAIR principles upon publishing.

Introduction

Over the last few decades, technological advancements in fields such as imaging, laboratory automation, computing, and data analysis have revolutionized the way biologists work and handle data [14]. High-throughput (HT) and high-content (HC) studies are no longer exclusive to large, specialized labs but are gaining popularity in research conducted by smaller, independent teams [46]. This trend is expected to continue, as smaller biology labs increasingly adopt HT/HC techniques due to their decreasing costs, thereby generating large amounts of biological data. Additionally, the value of HT/HC techniques in producing reliable and comprehensive data has recently been highly emphasized, further incentivizing individual groups to incorporate such methods to generate high-quality research and stay competitive [710].

Biological data is heterogeneous by nature and often includes experimental readouts, curated annotations, and metadata, among other types of data. The increasing size and complexity of biological datasets call for effective means to manage data in its complete life cycle throughout a workflow. Generation, processing, analysis and management of heterogeneous biological data thus require tailored systems to improve data governance [1113]. Many examples of such complex workflows, datasets and analysis processes come directly from the fields of toxicology, pharmacology and nanotechnology (e.g., vaccine development or toxicant testing), large omics studies (high-throughput generation of molecular data) and long-term (pre-) clinical studies, among others [1417].

The increasing use of laboratory automation and the generation of experimental workflows with complex structures pose a unique challenge in backtracking and identifying samples and their related metadata. Each step of the workflow impacts the final data, and if problems arise, it can be difficult to backtrack and pinpoint errors. Likewise, the reproducibility of complex biological workflows is closely tied to precise record-keeping, especially as new techniques are introduced. As wet lab experiments are often complex, time-sensitive, and involve many researchers, the quality of documentation can be compromised. Moreover, manual data curation is time-consuming, labor-intensive, prone to human error, and at risk of biases as it relies on individual expertise. Any error in data curation compromises data integrity and can lead to incorrect conclusions, inefficient workflows, and the inability to reuse the data. Similarly, manual integration of data from multiple sources lacks standardization, has limited scalability, and can hinder early error detection [18].

To ensure data integrity in workflows and prevent potential data loss, strict quality control measures and careful monitoring of workflow steps are necessary. Although many systems exist for managing large datasets in biology [19], they are mainly implemented in larger, specialized facilities with teams trained in computer science. In smaller, individual labs, dedicated informatics staff may not be available and biologists are required to learn complex tools and technologies for data processing, despite lacking prior experience and facing time and resource constraints. Overall, there is an urgent need for design guidance for data processing solutions in biology workflows.

Here, we present a recently established pipeline for modular data processing that facilitates and documents the complex production of monoclonal antibodies (mABs) derived from individual B-cells. Implementing our data management system reduced the time-spent on data processing by over one-third and improved data reliability. Our strategy proves that, with moderate effort, biologists can set up an efficient, rewarding, systematic approach to routine data processing tasks. This approach will simplify documentation, facilitate reproducibility, and improve accuracy by eliminating errors related to manual data handling. Furthermore, data processing can be sped up, accelerating the generation of reliable insights and freeing hands for other tasks. Data processing can be standardized, enabling comparison of results across series of experiments or labs. Moreover, our approach supports scalability, as modules of data processing pipelines can be up- or down-scaled to handle varied data amounts, adapting to changing research needs.

Finally, to demonstrate the versatility and transferability of our approach, we apply it to the development of a data processing pipeline for automated stem cell culture. We show that our design guidelines can serve as best practice recommendations for other biologists and be a step towards greater reproducibility, efficiency, and standardization of workflows in biology.

Box 1. Glossary.

Data curation – the process of cleaning, organizing and standardizing data towards greater quality, utility and long-term preservation. Data curation is part of data processing.

Data processing – all tasks performed on collected data to prepare it for analysis, such as curation, formatting, transposition, joining, sub-setting and summarizing, but excluding data acquisition and storage. Data processing is a broad term and covers data transformation steps that facilitate extracting insights.

Data processing pipeline – the sum of all data processing modules for one workflow, aimed at converting raw input data into usable information.

Data repository – storage space to catalog and archive data. Ideally, the storage space and contained data are managed by a database software.

Knime module – a workflow in Knime performing a series of related tasks or operations, where each operation is performed by a Knime “node” that represents the smallest operational unit of Knime.

Metadata – data that describes other, associated data. In the context of this work, metadata includes but is not limited to: i) donor data associated with a sample (e.g., date of donation, cell number); ii) experimental settings and conditions (e.g., protocols, reagents, equipment); iii) association of experimental results with the original sample; iv) location of a given sample or its derivate.

Module or data processing module – an individual Python script or Knime workflow.

Wet lab experiment or experimental procedure – workflow steps that are conducted physically in the laboratory.

Workflow – systematic series of interconnected steps designed to achieve specific research objectives, including wet lab work, data processing and data storage.

Methodological approach, design and findings

Data processing is key to efficient high-throughput mAB cloning and production

We have recently established a wet lab workflow that enables the production of over one thousand mABs per year from cDNA generated from patient-derived single B-cells. An antibody is composed of two protein chains, the heavy (H) and the light (L) chain, where the light chain can be of either Kappa (κ) or Lambda (λ) type. Genetic information for both the H and L chains needs to be isolated and cloned. This enables in vitro generation of mABs with the same specificity as in the originating B-cell. The wet lab protocols have been outlined with minor deviations elsewhere [2022].

Briefly, the procedure consists of the following steps: cDNA is generated from individual B-cells (fluorescence-activated cell sorting (FACS) of B-cells from different originating samples, such as PBMCs or CSF, without further culture). The cDNA serves as a template to amplify the H and L chains by PCR. As the L chain can be of κ or λ type, it is necessary to perform three parallel PCR reactions (S1 Fig – PCR Heavy, Kappa, Lambda). The PCR amplicons are analyzed for correct size by electrophoresis and sequenced upon positive evaluation. Primer pairs specific for the sequenced H and L chains, and equipped with overhangs for Gibson Assembly, are selected for a final PCR reaction (for a complete primer set see Table S1). The final PCR products, covering the specificity-determining variable regions of antibodies, are cloned by Gibson Assembly into plasmids encoding the constant regions of H or L chains. The assembled plasmids are amplified in bacteria, purified and sequenced to confirm sequence identity, i.e., that no mutations causing failures on protein level have been introduced [23]. HEK cells are then transfected with pairs of plasmids encoding matching H and L chains. After transfection, mABs produced by HEK cells are secreted into the cell supernatant, collected, and their concentrations are measured.

Due to the complexity of the workflow, the production of mABs in a time- and resource-efficient manner implies considerable data processing effort. Thus, the following challenges had to be addressed for the design and standardization of our pipeline:

  1. Wet lab experiments represent a selection funnel, wherein not all samples that enter the workflow proceed to the end. This is due to the involvement of a series of steps that progressively narrow down the selection of samples, i.e., not all B-cells will give rise to functional antibodies. Optimization of resource usage, therefore, requires the selection of successful samples after each analysis step. The following wet lab step is only executed when a significant number of successful samples have accumulated after a set of previous wet lab experiments (e.g., one full plate of 96 samples).

  2. A specific challenge to mAB production is their composition of H and L chains. As antibodies are composed of plasmid pairs, quality analysis requires the matching of heavy and light chains cloned from the same cDNA. Only if both heavy and light chains are positively evaluated are the chains pushed to the next step. If one chain of a pair is missing, several workflow steps must be repeated for the missing chain, increasing the complexity of both data processing and documentation.

  3. Another challenge to mAB production is the variability of H and L chains derived from somatic recombination of V(D)J gene segments [24,25]. The specificity-determining region of an H chain is composed of several V (variable), D (diversity), and J (joining) gene segments. The specificity-determining region of an L chain is similarly composed of several V (variable) and J (joining) gene segments. To clone H and L chains, a three-step PCR strategy is implemented (S1 Fig and S1 Table). First, forward and reverse primer mixes for each chain are used to amplify a given chain from cDNA. Then, a second PCR with primer mixes is performed to analyze the first amplicon by sequencing. Obtained sequences are analyzed for the specific V(D)J gene alleles and based on this analysis, a specific primer pair is selected from a set of 69 primers to perform a final PCR step [23]. Significant effort is required for processing the data, curating the datasets, and generating look-up tables for performing the final specific PCR.

  4. The key to a time-efficient workflow is to shorten the data processing effort between two wet lab experiments. Due to the complexity of our workflow, data processing is time-consuming when done manually (≥ 33% of complete hands-on time for mABs production from cDNA). Analysis of samples, planning the next experimental step, and relating initial cDNA samples to experimental readouts are a challenge, as different machines use varying plasticware layouts (e.g., 96- or 384-well plates, column- or row-wise, individual culture plates or individual tubes).

  5. Data generated on each mAB production step needs to be curated and documented before proceeding to subsequent steps, analysis, troubleshooting, hypothesis formulation, and publication.

An organized and standardized approach to data processing and analysis can help address these and similar challenges in complex biological workflows and gain efficiency. In the next sections, we present design principles that guided the development of our data processing pipeline. We start with principles that give rise to functional pipelines, such as choosing the right technologies, applying modular design and ensuring interoperability between modules. We cover the implementation of dedicated databases and design guidelines that can help develop and improve existing pipelines towards better efficiency, organization and reproducibility.

Getting started

When designing a workflow, it is crucial to conceptualize it in advance by defining the tasks, identifying their dependencies and contingencies, and determining data processing operations. This can be done in a few steps:

  1. Start by outlining a series of step-by-step experiments (tasks) that constitute the workflow, specifying data to be retrieved, analyzed and stored at each step. Determine the workflow endpoint to guide the design process. Visualizing the entire workflow helps understand the experimental sequence and anticipate potential workflow expansion (Fig 1).

  2. Establish methods to document sample metadata and experimental details (protocols and data analysis procedures).

  3. Consider repetitions that may occur in case of experiment failure and how to handle them with respect to data storage.

  4. Implement ways to document the workflow executions and their results for quality control, error checking, data analysis and interpretation.

Fig 1. An example of a two-step wet lab workflow, demonstrating data processing needs.

Fig 1

Data generated in wet lab experiments (experimental work, blue background, top) undergoes processing (data processing, yellow background, middle) for analysis and storage (data repository, green background, bottom). Biological samples and accompanying metadata are collected and must be curated and documented. Based on the metadata, subsets of samples are selected for analysis, such as those from wild-type or diseased subjects (1). The workflow starts with a wet lab experiment that is then documented (2) and analyzed (3). Next, the wet lab experiment is conducted on a subset of samples based on the analysis of the first experiment (4). The analysis of the second experiment closes the workflow cycle (5). Analyzing the results of the second experiment in the context of data (and metadata) gathered in the workflow cycle allows for supporting or refuting the hypotheses and yields novel insights (6).

By defining these tasks, it is easier to identify the data processing operations required at each workflow step in order to construct a functional pipeline (as for example illustrated in Fig 1).

When designing the pipeline, we considered some recommendations for analysis scripts in neuroimaging already present in the literature, grounded in software development [26]. The design decisions made during development of our data processing modules offer a potentially valuable resource for other biologists in designing their own workflows. It is important to note that guidelines presented here do not always follow the sequential order of wet lab experiments. Instead, they are presented in a non-chronological manner to better explain individual pipeline operations.

We provide a link to the GitHub repository housing the data processing pipeline. This repository contains sample (mock-up) data and metadata necessary to run the pipeline steps implemented in Python. Building upon the work of van Vliet and colleagues [26], we use examples to illustrate each design principle, linking it with the scripts in the code repository (refer to the passages “Examples from the workflow”). In the sections below, we discuss the technologies used in the mAB pipeline, the module design and usage, file types used for interoperability, and our database structure.

Choosing the right technologies

The choice of technologies for pipeline implementation is largely dependent on the skills of team members. Teams are often diverse when it comes to expertise and preferred way of working. Building upon this diversity can significantly enhance overall performance and problem-solving capabilities [27,28]. While computational skills are becoming more prevalent among biologists [20], not all researchers are proficient in programming. Software with a graphical user interface (GUI) helps users to perform complex operations by eliminating the need to learn programming. It also allows for easy, standardized and interpretable data visualization. This reduces the need to switch between different scripts or programs and decreases data processing time. GUIs can typically save and record steps taken during a certain procedure, increasing the likelihood of reproducibility among team members and minimizing errors [29]. The passage below explains our usage of certain technologies throughout the pipeline.

Examples from the workflow.

Our modules are implemented in Knime and Python, but designed in a manner to ensure interoperability (refer also to section “Defining input and output files”).

Knime is among one of the most user-friendly open-source workflow management systems and offers a wide array of learning resources [4]. It provides an intuitive visual interface suitable for rapid prototyping and reproducibility. Every step in Knime is represented through an intuitive visual workflow, making it easy to document, reproduce and share processing steps and analyses. It can seamlessly connect with external tools used in biological data processing and adapt to growing datasets. Quality control steps and batch analyses can easily be automated using Knime.

Python is currently considered the most popular programming language [23], well-suited for biology applications [24]. Python has a rich ecosystem of libraries tailored to biological workflows (e.g., NumPy, SciPy, Pandas, BioPython, etc.) and a large scientific community that helps biologists solve problems faster and focus on their work, rather than spend time on complex syntax. It is also suitable for large-scale computations (e.g., GWAS) and it can run on various operating systems, further increasing usability and reproducibility.

This combination balances accessibility and customisability, making it well suited for our needs. However, other technologies can be used with similar success when designing custom pipelines (e.g., Galaxy for transcriptomics data, R-Studio, RapidMiner, etc.). The skills and preferences of team members should guide the selection of an appropriate technology solution.

Translating tasks to modules

Once defined, workflow tasks are converted to modules. Modules are the building blocks of a data processing pipeline, whether they be individual Python scripts or Knime workflows. By design, a module should perform a simple, singular task. To prevent scripts from becoming difficult to understand, complex tasks should be divided into smaller tasks across multiple modules. Each module should be self-contained and run with minimum dependence on other modules. This way, any changes or issues can be addressed without disrupting other modules in the pipeline.

Our pipeline follows the logic of the wet lab experiments – it is organized in modules that correspond to experimental flow on the bench (see Fig 2 for an overview of the mAB production pipeline). Since the workflow is hierarchical (each wet lab step relies on the results of the previous step(s)), we alternate wet lab experiments with data processing, rather than running the entire pipeline when all experiments are completed. Thus, the modularity of each data processing step is defined by related wet lab experiments. For example, a module can: i) analyze experimental readouts, ii) prepare layouts for the next experiment, iii) perform sample selection based on predefined criteria, iv) assemble machine-readable files for automated wet lab experiments, and v) generate a file for data storage in a database. In our experience, following the wet lab criterion is the optimal way to organize modules for hierarchical pipelines (Fig 2).

Fig 2. Workflow for the production of mABs from patient-derived individual B-cells.

Fig 2

Wet lab work steps (ellipse-icons with headers) and data processing steps (yellow icons). Sample flow is depicted by blue arrows and information flow is depicted by yellow arrows. The experimental procedure starts with FACS sorting of individual B-cells. cDNA is generated from individual B-cells, and PCR1 and PCR2 serve to amplify antibody chains. Successfully amplified chains, evaluated by capillary electrophoresis (cELE1), are sequenced (SEQ1). Upon positive sequence evaluation with the aBASE software [23], specific primers are used for the final amplification of the specificity-determining chain parts (PCR3). Amplification is quality-controlled by electrophoresis (cELE2). The specificity-determining regions are cloned by Gibson Assembly (GiAs) into plasmids encoding the constant parts of the chains. Plasmids are amplified by transformed (TRFO) bacteria after plating (PLA) and picking (PICK) individual bacterial clones. Prior to quality control by sequencing (SEQ2), amplified plasmids are isolated (MINI), and plasmid concentration is measured and adjusted (ConcAdj). Based on sequencing results, functional plasmids (PLA) are sorted. Matching plasmid pairs (mAB Pla) are prepared for transfection (Traf) into HEK cells. mABs produced and secreted by HEK cells are harvested (mABs), and mAB concentration is quantified (quant). Glycerol stocks (Glyc) of transformed bacteria and plasmid aliquots (Pla safe) serve as safe stocks of plasmids. The colors of the wet lab work step captions indicate different workflow sections: blue – molecular biology; red – microbiology; green – cell biology. The colors of the data processing circle contours and font indicate the utilized technology: red – Python; yellow – Knime; blue – third-party software.

The passage below explains a typical logical flow of modules in our mAB workflow with links to sample data in GitHub.

Examples from the workflow.

To determine whether samples were successfully amplified (refer to Fig 2) in the final PCR step (PCR3), we perform capillary electrophoresis (cELE2). The results of cELE2 are processed and simplified by the module 06_cELE2.py. The module loads experimental readouts and selects a set of samples for further processing based on a band size threshold (in base pairs). The next module, 07_GiAS.py, takes the successful samples and prepares a look up table for cloning by Gibson Assembly (GiAs), as well as a database import file (Fig 2).

For reference, we provide examples of raw experimental readout data after capillary electrophoresis, look up tables (for automated liquid handling) and a database import file.

Although our workflow is semi-automated and we often deal with machine-readable files, the modules are adaptable to manual workflows. For example, the number of samples that are handled in the next workflow steps can easily be modified by defining the size of chunks in which samples are processed further. Since this workflow step operates on multiple 96-well plates, we set the number of samples to the batches of 96, but a lower number can be chosen for manual, low-throughput workflows.

Allocating separate computational space to modules

After designing the modules based on the mAB experimental workflow, it was important to ensure that each module runs in a separate computational space; that is, the input and output files generated by modules should be saved in dedicated directories. Ideally, script files also reside in a separate directory. This computational environment needs to be accessible to all team members, must be backed up for data security, and should allow for seamless integration with different modules. These requirements are fulfilled by simple Network Attached Storage (NAS) drives, institutional set-up servers or cloud services. Within allocated computational space, the folder structure ideally reflects the architecture of the data processing pipeline.

When utilizing such storage drives and servers, it is crucial to implement appropriate data security and access control measures. This is particularly important when handling sensitive or large-scale datasets. Measures such as user authentication, role-based access, data encryption, and compliance with institutional or, if relevant, regulatory frameworks (such as GDPR or HIPAA) – for example through deidentification of patient-derived data, is essential. In our case, data access is highly project-dependent, relying on role-based access control among the team members.

Importantly, working with big datasets can pose computational challenges, especially related to memory usage and processing time. To mitigate these issues, different strategies can be implemented, such as loading data in chunks, using memory-efficient data structures (such as NumPy arrays instead of native Python lists), and leveraging parallel computing where supported. Tools like Python’s multiprocessing library or KNIME’s parallel execution nodes can help distribute computational load and improve performance. Although our pipeline is not affected by these issues, we recognize their relevance.

The design of our computational environment dedicated to the mAB workflow, with examples, is explained in the passage below. Links to GitHub are also provided.

Examples from the workflow.

We created a dedicated computational workspace with a folder structure that resembles our data processing steps. Keeping modules physically separated in dedicated directories allows deciding on and adhering to rigorous organization of both wet lab and data processing steps, resulting in better discipline. This easily creates access to resources (input files) required by each module without interfering with other modules. It also separates Python and Knime modules.

The published GitHub repository preserves the folder structure of Python modules in our pipeline. For example, directories 05_PCR3_Out to 09_MINI_Out are dedicated to wet lab steps from specific PCR (PCR3) to plasmid isolation (MINI), while the directory 01_Ab_Quant_Out is reserved for quantification of produced mABs (wet lab steps mABs and quant, see Fig 2). There is a separate folder for modules manipulating bacterial colony information for repicking or for input files used by modules. For reference, we also provide an image representation of an in-depth listing of directories and files for Python modules

We implemented a backup rotation scheme that involves daily (with a retention period of one week), monthly and yearly backups on an external server. To safeguard against data loss in case of the server failure, the database is also backed up on a dedicated database server twice a day.

Defining input and output files

Interoperability between modules (i.e., the ability of modules to work together seamlessly) should be given high priority when designing a pipeline. This ensures that data is passed between modules without loss of information or format. Moreover, each module can be developed and maintained independently, while still functioning cohesively within the larger workflow system.

To achieve this, we defined input and output files that exchange data between modules. We refer to these as intermediate files, as they are created as part of a larger process and are not a final output of the workflow. Intermediate files provide standardized structure and syntax for exchanged data. They also serve as interfaces, i.e., files generated as output by one module can then be used as input by subsequent modules. The format of our input and output files is standardized. We use Comma-Separated Values (CSV) files, as they are simple and human-readable (for manual quality checks), and they are widely compatible with our physical devices used for wet lab work. Other commonly used data formats, such as JSON or XML, can also be considered, particularly when handling structured or hierarchical data from external systems or Application Programming Interfaces (APIs).

Storing intermediate files has several advantages. When errors occur, it is possible to rerun only the modules that failed, instead of running the entire pipeline again. Furthermore, manual inspection of data is possible, which helps track the progress of the workflow and troubleshoot issues as they arise. In addition, the autonomy of each module is maintained, which decreases the complexity of the pipeline and guarantees that modules run independently without relying on data saved by other modules [26].

The use of intermediate files in our mAB workflow (based on Fig 2) is specified in the passage below, with links to GitHub.

Examples from the workflow.

Modules in our pipeline rely on one or more intermediate files, which are generated and saved by preceding modules. Intermediate files are often used by modules, along with database export files, look up tables or experimental readouts.

For example, the module 08_PLA.py, which links the information on cloning by Gibson Assembly (GiAs, Fig 2) with transformation and plating of bacteria samples (TRFO and PLA), starts by loading the pipetting schemes (configured by using an automated colony picker) and the database export file to generate the intermediate file. The latter is taken up by the next module, 09_MINI.py, which links bacteria with isolated plasmids. The module loads the intermediate file and creates the import for the database.

Although these steps are automated in our workflow, similar logic can be applied to manual workflows. For example, the module 08_PLA.py can take up any list of samples handled in low throughput. The number of samples to be processed is automatically deducted by the module from the list of samples, and the subsequent module would infer the number of samples from the intermediate file.

Finally, intermediate files are automatically validated. For example, the number of samples is cross-checked during processing — if the expected number is not met (e.g., due to failed pivoting of split samples; this happens often when uneven number of heavy and light plasmids is parsed for transfection), an error is raised. Additionally, file headers and formats are validated against parameters specified in the configuration file, so any unexpected changes at the machine level (e.g., updated column names or formats) will trigger an error.

Dedicated databases allow solid data documentation and efficient sample tracing

The complexity of workflows determines the demands for its documentation and sample traceability. While simple workflows that handle some dozens of samples can likely be documented using spreadsheets, complex workflows that run multistep wet lab experiments for over a hundred samples require more advanced documentation, ideally in dedicated databases. A customized database facilitates retrieval of correct information between workflow end-points and saves significant time and effort otherwise spent on sample backtracking.

To that end, we chose FileMaker Pro Advanced (FM) as the database backend. FM is a low-code relational database management system that enables fast database creation and modification through drag-and-drop functionality [30]. Since the database engine can be accessed through a GUI, querying data is straightforward and queries are easily modifiable. Although FM may require some initial effort to become proficient, it remains an intuitive and user-friendly tool for biologists without prior experience in database design [3133].

FM also provides scripting functionality that we used to automate the import and export of data for further processing in downstream applications. Customizable data exports facilitate quick quality control of the samples. This, in turn, is crucial for troubleshooting and for making decisions on the sample’s fate at key workflow steps.

The structure and design of our mAB-dedicated database, as well as examples, are explained in the passage below (and in S2S4 Fig).

Examples from the workflow.

Our database was built around a concept of multiple, interconnected tables that store sets of unique records (information about sample state in the workflow). Records are connected through a series of relationships between individual tables. Their uniqueness is guaranteed by a Universal Unique Identifier – a 16-byte string assigned automatically to each record on data entry. Each subsequent related table inherits the value of the record’s unique ID from a previous table, allowing the retrieval of sample information at any workflow step (S2 Fig). The data is stored in 19 distinct, interrelated tables.

Structuring the data model in multiple, interrelated tables minimizes the storage of redundant data. Information about a particular sample state is entered into the database only once and linked to other related data points as needed. This can help reducinge errors associated with redundant data entry. Additionally, it ensures data consistency across multiple experiments, as the same sample information can be reused across different projects and analyses.

For example, we keep the information of bacterial plating separate from picking. This allows picking of a new bacterial colony without the need to enter the same bacteria plate again to the database. Since bacteria plates are already linked to other information about the samples on previous wet lab steps (Fig 2, steps from FACS to TRFO), connecting a repicked colony to plating information automatically handles the connection to other data in the backend (S3 Fig).

Finally, our database provides relative flexibility, allowing the addition of new tables and seamless integration of experimental readouts. For example, the structure can be extended by adding results of functional antibody assays to the information on harvested antibodies and linking a new table through shared IDs (S4 Fig).

Further guidance

Design decisions that extend beyond building functional pipelines and can enhance efficiency, organization and reproducibility of existing ones, are important to consider. Such include, for example, using configuration files and optimizing sample handling by processing in batches. Next, we discuss additional guidelines that can be applied to improve the performance and management of workflows beyond basic functionality.

Configuration files to reduce repetitive code.

Rooted in software development, the DRY (“don’t repeat yourself”) principle implies minimizing code duplication [26]. Parameters (e.g., file or directory paths or hardware-specific parameters) are often shared between modules, and one way to make them available is to create a configuration file that consolidates all defined parameters in a single location. Modules that require these parameters can import them from the configuration file by the unique variable name. There are several advantages to this approach. Not only is it consistent with good programming practices, but it also reduces errors and saves time, as modifications to parameters need to be made only in the configuration file, rather than in multiple locations throughout the code [26]. To minimize the possibility of variable duplication, we check whether a variable has already been defined in the configuration file before assigning it a value in a module. This way, we ensure that variables are used consistently, preventing unintended overwriting of data (see explanation and example in the passage below). Adhering to DRY principles makes the modules relatively resistant to modifications. All in all, using configuration files can streamline the process of adjusting parameters and settings across multiple modules.

Examples from the workflow.

For Python modules, we provide the configuration file (config.py file) that contains file and directory paths, experimental parameters, spreadsheet metadata, and more. Each module imports only configuration parameters required for its own run. For example, module 13_Repick.py imports the paths and metadata necessary for creation of a list of bacterial colonies that are re-selected for further processing.

To set up file and folder paths, we used the built-in Python os module. The paths are defined based on the module location when the code runs. For example, module 00_Ab_Traf.py starts by importing necessary paths from the configuration file and determines the location of folders and files relative to the current (working) directory of the module. This makes the entire pipeline independent of the operating system. Even though we usually run the code from a remote server, it can also run locally as long as the folder structure stays the same. The hierarchy and names of the directories can be adjusted by simply modifying the variables in the configuration file.

Similarly, any changes made to metadata of intermediate or database files can be done in the configuration file, eliminating the need to modify in multiple locations. As an example, modifying the metadata of the files required to prepare samples for the final PCR run (PCR3; Fig 2) can easily be done by updating the configuration file. This case can be useful when changes in software occur.

A similar principle applies to Knime modules. A separate configuration file specifies file and folder paths for all local machines that have access to a particular Knime workflow. A custom Knime node reads the configuration file and adapts the path as necessary.

Additionally, adhering to the DRY principles, we have organized the utility functions (functions that can be used by different modules for similar tasks) into separate Python files. These files only contain the function definitions without any code. This approach allows for greater organization, reusability, and maintainability of code. The function files reside in a designated folder and are organized thematically. For example, the file microbiology_supporters.py contains the functions that assist with data processing on microbiology workflow steps, while the file plt_manipulators.py contains functions handling manipulations of the plate grids and processing samples in batches, among others.

Efficient sample batching.

Grouping samples together for batch processing is crucial for efficient use of resources (reagents, equipment or staff). Sample batching can save the overall time and cost of the production process while reducing variability between samples and improving quality control and scalability of the workflows.

The production of mABs from individual B-cell cDNA is a multistep process, with quality assessment being performed after each wet lab step. Samples that do not meet the quality criteria are excluded from further processing, and only successful samples are selected for the next step of the workflow. Since batches of samples for each wet lab step are constantly updated, it is essential to keep track of batches already pushed forward in the workflow and of batches still waiting for their turn.

In our workflow, wet lab experiments are conducted in batches of either 96 or 384 samples. To accommodate samples that cannot be processed due to limited space on the 96- or 384-plate grid, we implemented store lists to keep track of the leftover samples. Store list files are updated every time new samples are advanced in the workflow, with priority given to older samples. This ensures that the ones that have been waiting the longest are processed first. The passage below highlights examples of how we track our samples via store lists, with links to GitHub.

Examples from the workflow.

As an example, the module 05_PCR3.py creates a store list that contains batches of samples for the final PCR step (PCR3; Fig 2). The module starts by loading a current store list file (updated previously by a Knime module) and automatically calculates the number of 96-sample batches that should be processed next in PCR3. Then, the batches of selected samples are pushed forward to the next workflow step and the leftover samples are saved as a new store list file.

Computing the number of sample batches automatically is beneficial, as it minimizes user input and reduces the risk of human error. Additionally, we use argparse, a Python library for parsing command-line arguments, to get the plate barcodes for wet lab experiments (PCR3, cELE and GiAs; Fig 2). The user inputs the latest plate numbers used on that workflow step as command-line arguments. The module parses those as arguments and automatically assigns new barcodes to current sample batches.

While our workflow is semi-automated and machine-dependent, similar principles can be applied to manual workflows. The number of samples to be processed in a single experiment can be adjusted as needed by modifying the size of a batch.

Getting feedback from modules.

Because detecting errors that arise during module execution in complex workflows can be challenging, the modules should provide feedback whenever possible. Getting feedback from modules refers to collecting information from each step of the workflow to assess its performance and identify potential issues as they occur [26]. Providing feedback enables collaborative usage of the pipeline, including by team members who may not be familiar with all individual operations performed by modules.

Feedback strategies can be satisfied by user interface features within the software. Most software offers visual cues and pop-up messages to inform users about their actions. For example, FM provides import log files and notifications upon import execution. Knime allows real-time assessment of computing operations through icons, indicating whether the run was successful. The result of each node’s operation can also be displayed upon request, enabling interactive troubleshooting. Another feedback strategy to consider is input validation. FM provides options of validating the IDs to be unique, non-empty and unmodifiable, and retrieves the messages on failed import if these conditions are not met. Similar validation can be implemented in Python, where incorrect user input can trigger clear and informative error messages that show what went wrong and suggest possible solutions. Additionally, a good approach to automated and recurring feedback is saving custom files that include running parameters or any other information helpful for potential troubleshooting and documentation. As a minimum requirement though, in case of scripted modules, simple printout messages help to orient the users about the status of the run. Some examples of module feedback are given in the passage below, with respective links.

Examples from the workflow.

We have introduced printout messages for key steps in the Python modules.

For example, module 05_PCR3.py provides information on the total number of samples (and 96-well plates) to process in the final PCR step, the number of leftover samples saved as a store list file, together with the directory and filename of the store list. Furthermore, the argparse module is used in most Python scripts. It automatically generates help, usage and error messages, which prevents incorrect input from being passed during the run.

For Knime modules, we implemented self-documenting features, where a custom file is generated during the module run. This contains metadata such as run parameters, input and output files, user data, timestamps, or messages. Together with the intermediate files saved at workflow steps, self-documentation features provide frequent feedback on the progress of data processing steps.

Organizing the pipeline.

As a final step to maintain the pipeline organization, we recommend introducing a system that is user-friendly and not too complex, so that users are encouraged to adhere to continuous documentation. Modules that are part of the main computational pipeline should be separated from those that represent work in progress. Regular inspections and a simple cleaning strategy for modules can ensure workflow maintenance with minimal effort [25].

A manual provided for each module in the pipeline is essential to ensure proper usage and minimize the likelihood of errors due to incorrect execution. Ideally, all factors that could affect the pipeline execution should be documented. This allows for future replication of the analysis. In addition to a well-defined folder structure for hosting the modules, we recommend implementing a Version Control System (VCS) to track modifications to modules that are part of the main pipeline. Finally, adhering to a comprehensive documentation scheme aids in the maintenance of the workflow.

Writing detailed manuals, documenting module logic, and implementing version control require discipline, but contribute to efficient maintenance of complex pipelines (see examples from our workflow below). Even routine operations can require consulting the documentation. In our experience, the effort of creating manuals and adhering to version control pays off in sustaining the pipeline.

Examples from the workflow.

In our workflow, distinguishing between the main and supporting Python modules is achieved by keeping a coherent naming convention and a shared directory. Python modules have names starting with a sequential, two-digit number. The outputs of module runs are kept in separate directories, suffixed with _Out. Git is used as VCS to track changes to main Python modules, allowing for restoration of the previous module’s versions, if needed. Manuals for each workflow step are also tracked by VCS.

We keep a documentation strategy for wet lab workflow and database imports by using Electronic Lab Notebook (ELN) software. Wet lab experiments, data imports, and modifications of database structure are recorded as separate entries in the ELN, which serves as a reliable record-keeping tool. Finally, we emphasize comprehensive code documentation, recognizing that code is read much more often than it is written [28]. This approach helps us create well-documented modules.

Proof of Concept – Tailored data processing pipeline and database for automated stem cell culture

To showcase the versatility of our design, we applied the design principles to develop a data processing pipeline and a database for automated stem cell culture (ASCC). Recently, our lab has established an automated cell culture platform that integrates a robotic liquid handling workstation for cultivation and differentiation of human induced pluripotent stem cells (hiPSCs). The hiPSCs are expanded and differentiated into brain microvascular endothelial cells (BMECs) for generation of an in vitro blood-brain barrier (BBB) model [34]. Mature BMECs are seeded on TransWell plates for a 2D permeability BBB model. Trans-endothelial electrical resistance (TEER) is then measured to assess the integrity of the barrier (for a detailed protocol, refer to Fengler et al. [34]). BBB models generated in high-throughput scale with close-to-physiological characteristics can facilitate the screening of BBB-penetrating drugs, aiding the development of targeted drug delivery systems for neurological disorders [3537].

By applying our workflow design principles mentioned above, we developed a proof-of-concept pipeline focused on hiPSCs differentiation into BMECs that is flexible and can be expanded to the generation of other hiPSCs-derived cell types, including astrocytes, neurons, microglia, monocytes and more. The pipeline consists of Python scripts, FM database and FM scripts. Below, we discuss our design process and how it aligns with the already showcased design principles.

Getting started – ASCC.

Firstly, we started the design by outlining:

  1. The wet lab experiments (cycles of thawing, seeding and harvest of cells). Refer to S5 Fig for the wet lab steps of the ASCC workflow.

  2. The methods to document metadata, such as: i) hiPSCs batch information provided by a supplier; ii) culture conditions (including medium batch information and supplements); iii) the experiments’ timestamps; iv) quality control data (such as cell viability assays and cell counts); v) equipment settings (parameters controlled by automation); and vi) user annotations.

  3. The repetitions or pausing steps in the workflow, i.e., interruptions of the differentiation process, including freezing of cells on differentiation stages for future experiments or on expansion stages to allow differentiation into other cell types.

  4. The methods of documenting the workflow for human supervision and potential error detection.

Translating tasks to modules – ASCC.

We then translated tasks to modules by following the wet lab criteria, making sure that each module performs a singular task and has little dependency on other modules (other than logical dependency that results from hierarchical nature of the workflow). Modules execute the following tasks: i) analyzing the experimental readouts (machine- or manually-generated), ii) parsing metadata, and iii) generating an import file for database storage.

During expansion and differentiation, iPSCs undergo seeding and harvest cycles across various plate formats (for example, 4-, 6- or 12-well plate grids). Cell count and viability assays are performed on each harvest cycle or whenever an assessment of cell confluency is required. Module 00_cellcount_parser.py reads the plate format from user’s input and processes the file with cell count and viability information (generated by the automated cell viability analyzer, ViCell, Beckman Coulter). During the entire process of cultivation, cells are imaged daily for quality control. Module 01_img_parser.py parses the images generated by a confocal microscope and creates a .json file with metadata of all images linked to a given plate. As final workflow steps, cells are seeded on TransWell plates for TEER measurements. Module 02_teer_parser.py parses the experimental readouts and creates an import file for database storage.

We also implemented a configuration config.py file to store file and directory paths, spreadsheet metadata and configuration parameters. The utility functions are grouped thematically: cellcount_img_tools.py handles .txt files generated by ViCell and input_readers.py parses the user input.

Allocating separate computational space to modules – ASCC.

Next, we created a designated computational space for modules, input files (including metadata, experimental readouts and files exported from the database) and output files, with an automated backup rotation scheme of daily, monthly and yearly backups. We constructed designated directories for database export files for cell count and image parsing. Confocal microscopy images (per plate) and files generated by the automated cell viability analyzer are kept in individual folders. Scripts, https://github.com/Malwoiniak/AutomatedStemCellCulture/tree/668dfe5766f0cc62d0344a15c41a89a486d20108/JSON.json files and output files generated at each workflow step also have separate directories.

Defining input and output files – ASCC.

Further, we defined input and output files to share data between modules. Depending on the plate format and whether cells are in expansion or differentiation stage, the modules access different input files. For example, in an early expansion stage, the module 00_cellcount_parser.py takes up the database export file (automatically generated by a FM script) and the experimental readout file to generate an import file for database storage (saved with a timestamp). To facilitate the automatic imports to the database, records processed at the time are saved as separate files, updated after each script run. Modules 01_img_parser.py and 02_teer_parser.py follow similar logic when processing confocal microscopy images and TEER measurement files, respectively.

Dedicated database – ASCC.

While conceptualizing the ASCC database structure, we considered several factors:

  1. The information should not be redundant, i.e., it should only be entered in the database once. If the same information is entered more than once, there is probably a need to restructure the database and keep the redundant information in a separate table. For example, the information on the culture medium batch is registered only once and stored in a distinct table. Upon a medium change event, this information is populated by accessing the ID of the medium batch and adding it to the table storing the plate information (S6 Fig (A)).

  2. The tables are organized following the thawing, seeding, harvesting, or freezing events, as they imply the initiation of a new process: e.g., change of the barcode or plate format, storing frozen cells in tanks, or collecting metadata. For example, the information on harvesting differentiated BMECs from 6-well plates by pooling and seeding on multiple 12-well TransWell plates is kept in separate tables linked through a one-to-many relationship. This ensures that any manipulations performed with the 6-well plate (medium change or cell count) are independent from those performed with the 12-well TransWell plates (TEER measurements or a well treatment). In addition, no redundant information is entered in the database, i.e., only the IDs of the 6-well plates are populated in the TransWell plate table (S6 Fig (B)).

  3. We implemented separate tables also due to the different hiPSCs differentiation methods, each requiring distinct processes. Although the current ASCC database assumes BMECs differentiation, other cell types (e.g., astrocytes) will be considered in the future. In the current structure, new tables for distinct cell differentiation can easily be linked to existing ones (S7 Fig).

Discussion

The ability to document, process and analyze large datasets to identify new patterns and relationships has become increasingly important in modern biology. In the evolving landscape of biology research, it is not only large, specialized laboratories that are embracing HT/HC techniques, but smaller biology labs are also progressively implementing similar methods. However, dealing with data generated in HT/HC schemes poses challenges related to sample backtracking, quality control, record keeping, and data curation and storage, among others. This can lead to compromised data integrity, potential data loss, and ineffective workflows. These challenges can be addressed by implementing modular data processing pipelines, which help to make substantial progress toward optimizing data management practices.

We employed the computational pipeline for data processing in a complex, semi-automated, multistep workflow for the production of mABs to tackle similar challenges and monitor all workflow steps, ultimately aiming to enhance data governance in our experimental setup. The design principles presented here help serve as guidance for the development of data processing pipelines in biology. The versatility of our approach allows for its application to diverse biological problems concerned with intensive data collection activities in a variety of settings, in which data undergoes continuous processing, analysis and modification before reaching the endpoint.

We showed how our design can help minimize the reliance on error-prone and resource-intensive manual data handling, significantly reducing errors and optimizing both time and resource usage. The modular nature of the design allows for flexibility to handle samples in high- and low-throughput settings. While the modules are designed to function in isolation, we combine them into a custom pipeline that operates on the mAB production workflow.

Our approach also helps streamline data processing and documentation, leading to faster generation of insights (allowing more time for other tasks), improving data reusability and promoting seamless data exchange. Additionally, we showed how deploying a dedicated database facilitates the standardization of data processing procedures. By implementing a tailored database, we established consistent and structured approaches for storing and organizing data, allowing us to efficiently track the samples throughout the workflow. Overall, adhering to the design principles outlined in this work can assist in enhancing the accuracy and fostering the reproducibility of data analysis by combining standardization, documentation, scalability, and error detection. Furthermore, well-structured metadata and data facilitate the utilization of machine learning and AI-based tools, which will aid quality control, process optimization and enable new analysis approaches [38].

Although our modules are tailored for mABs production, we foresee that our approach can support other biologists in building their own small-scale data processing pipelines for individual use. This approach is different from large-scale pipelines developed by bioinformaticians to manage massive volumes of data [39]. Thus, we focused our design efforts on feasibility for biologists with limited programming experience by employing software with GUI, such as Knime and FM, besides Python. With some modifications, individual modules presented here can also be adapted by other biologists and incorporated into their own workflows. To that end, we provide a GitHub repository containing mock-up data and a simplified pipeline, together with detailed instructions on how to download, install, and run it. Furthermore, we demonstrate how the code can be customized to accommodate data processing routines based on individual requirements. This can serve as a starting point for experimental researchers in constructing their pipelines by reusing and modifying the provided code.

To test the applicability of our design to other settings, we applied the design principles to automated hiPSCs culture, proving that it is possible to adapt the framework to stem cell-related research. In the long run, we expect that this approach will be beneficial for ASCC, as it can help to standardize and automate relevant data storage, and facilitate data-centric conclusions to gain novel insights. Furthermore, it can help make informed decisions for potential troubleshooting and use data to optimize processes by systematic analysis through artificial intelligence or “design of experiment” approaches. Pipelines for continuous data processing and analysis are now essential in domains such as multi-omics data integration [40, 41], HT/HC screening [42], data-driven modeling [43], as well as long-term environmental monitoring [44], evolutionary biology [4547] or plant phenotyping [48], among others. We believe that the design recommendations proposed here can find their target audience and be a source of inspiration to other researchers in developing their own data processing modules.

In the context of data-intensive science, there is a growing demand from various stakeholders, such as the scientific community, funding agencies, publishers and industry, for data to meet the standards of being Findable, Accessible, Interoperable, and Reusable (FAIR) over the long-term. This also concerns research processes beyond data, including analytical workflows and data processing pipelines. Many projects have already adopted different elements of FAIR principles into their data (and non-data) repositories [49]. Our data processing pipelines align with FAIR principles in several ways. First, we satisfy the findability facet of FAIR by making the mAB and the ASCC pipelines publicly available via GitHub and assigning a persistent identifier. Clear documentation provides necessary information to understand the design decisions, data and processes involved in running the pipelines. To satisfy the accessibility facet, we include contact information and apply an open access policy to the code, granting unrestricted access to both the mock-up data and pipeline modules. Standardized file formats and clear naming conventions throughout the pipelines aid interoperability by keeping a consistent and organized approach to data representation. Moreover, using standardized data formats enables compatibility and facilitates data exchange across pipelines. To account for the reusability aspect of FAIR, we design modular pipelines, emphasizing the granularity of the modules. Published documentation defines how modules work and how they interact with input and output data. Finally, adhering to the outlined principles for data processing may foster compliance with FAIR principles during publication. This approach ensures that both data and metadata are well-structured, minimizing the effort to achieve data accessibility and findability upon publication.

Limitations

One limitation of our system is partial reliance on licensed software, such as FM and GitLab. Nevertheless, open-source alternatives can be employed instead. For example, LibreOffice Base (LibreOffice Base, LibreOffice The Document Foundation) is an open-source, no-code database management software and offers similar functionality to that of FM, making it a suitable option for researchers with little to no programming experience. Alternatively, if team members have some knowledge of SQL, they can consider alternatives like MySQL (MySQL, ORACLE) for building and managing databases. Another interesting alternative to transform existing, code-based databases into interactive applications is NocoDB (NocoDB, NocoDB) – an open-source, no-code platform that turns databases into spreadsheets with intuitive interfaces, making it possible for teams to create no-code applications. It supports MySQL, SQLite or PostgreSQL databases, among others, but it also provides the functionality of building databases from scratch. Although we use GitLab (GitLab, GitLab) for Enterprise as a VCS to host the pipeline, the free GitLab version offers essential features for individual users and provides enough functionality to implement data processing pipelines (despite some storage and transfer limits). Alternatively, GitHub (GitHub, GitHub Inc.) can be used as a VCS, as it has traditionally been more widely recognized and utilized in the developer community due to its extensive user base and integration options.

Labs can also consider existing Laboratory Information Management Systems (LIMS) when automating the workflows to efficiently handle samples and associated data. LIMS are software applications used to streamline laboratory operations, sample tracking, data management and reporting, offering integration with various lab instruments [50]. While our system provides efficient workflow automation, it differs from LIMS. Although LIMS provide several advantages over custom-designed pipelines, investing in a full-fledged, commercial LIMS can be expensive. Unlike LIMS, which are lab-centric and require extensive customization to meet each lab’s specific needs, our system is focused on streamlining specific aspects of data management and analysis, making it more flexible but less comprehensive for overall lab management. Additionally, as the goal of LIMS is to create a centralized system that manages all laboratory activities and data, specifying requirements for LIMS requires substantial effort.

Finally, our approach is project- or routine-centric rather than lab-centric, which can be considered a limitation due to the scale at which its application becomes advantageous. The maintenance and initial integration of each data processing module requires significant effort. This effort is justified only if manual data processing time is substantially reduced through repeated executions or if structured data processing is essential, such as for leveraging artificial intelligence approaches. While some of the proposed design principles are also applied in big data pipelines that run entirely in silico, our approach is dedicated to iterative data processing for wet lab experiment steps, and is not scalable to handling millions of data points per day.

Conclusions

Our system introduces significant advancements over metadata handling systems tailored to specific fields, as for example neuroimaging, omics data, or microscopy data, as well as those that depend heavily on informatics expert knowledge for metadata processing. Designed with FAIR principles, our custom system optimizes workflow modules for reuse and seamless integration. It is cost-effective, can be tailored to specific needs, and allows rapid, iterative prototyping and experimenting with different solutions. The established modules can be complemented and expanded organically, adapting to the changing requirements of individual projects. Additionally, its low-code, user-friendly conception allows faster onboarding of biologists with varying technical expertise. By reducing the need for deep expert knowledge, our approach democratizes discussions and agreements on data and indeed metadata governance within project groups or labs. Furthermore, with a custom pipeline, the labs retain full control and ownership of the process, allowing for flexibility in response to changing needs and providing protective measures to the laboratory’s intellectual property. Finally, our focus on standardized metadata management enhances data contextualization, reuse and accessibility, particularly to machine learning and artificial intelligence algorithms.

Supporting information

S1 Fig. Cloning strategy for variable antibody regions.

After cDNA generation, three parallel PCR reactions are performed to amplify the heavy chain variable region (light blue) and either the Kappa (light green) or Lambda (pink) variable region. Upon verification by sequencing, the variable regions (V) are cloned into plasmids encoding the constant part (C) of the respective antibody chain (dark blue: heavy constant part; dark green: Kappa light constant part; purple: Lambda light constant part). See Fig 2 for a comprehensive overview. For an overview on B-cell receptor and antibody variability, refer to Fig 1 by Khatri et al. [51] and Fig 1 by Mikocziova et al. [52].

(DOCX)

S2 Fig. An excerpt from the database structure explaining its relational aspects.

Table C stores information on isolated plasmids of paired antibody chains (heavy and light), and Table D is a repository of plasmid aliquots stored as safe stocks in separate (physical) locations for future reproduction of plasmids or downstream applications. Table Ab_2 connects the information on plasmid pairs with subsequent transfection of HEK cells. Fields (i.e., columns) Working_Stock_ID, Safe_Stock_ID and Transfection_ID store unique IDs of each record per table. Fields starting with FK_ are records’ IDs inherited from related tables. Records (sample states) are connected through one-to-many relationships in the database structure; for example, a plasmid from Table C can be used for creation of a safe stock aliquot (in Table D) multiple times, while one aliquot of safe stock (in Table D) is associated with exactly one plasmid from Table C. Similarly, one transfected HEK cell sample (i.e., transfection well, in Table Ab_1) is associated with exactly two plasmids (a pair of heavy and light chain plasmids, in Table C), while this plasmid pair can be used for multiple transfections.

(DOCX)

pone.0326678.s002.tif (166.8KB, tif)
S3 Fig. Simplified view of tables storing bacteria plate information (upper, green) and bacterial colony picking information (lower, blue).

Picked bacterial colonies can be connected to already imported bacteria plate information through the identifiers (Plating_ID/FK_Plating_ID), reducing the redundancy of stored data. The identifier Plating_ID is unique in the bacteria plating table but not in the colony picking table, allowing for picking of multiple colonies from the same bacteria plate (one-to-many relationship in the database structure). The colony picking table has its own unique identifier (Picking_ID).

(DOCX)

pone.0326678.s003.tif (247.3KB, tif)
S4 Fig. A hypothetical extension of database structure by table addition.

The flexibility of the database design enables smooth integration of new information (e.g., experimental readouts), which allows for efficient management of diverse datasets within the database. Here, the information on two hypothetical functional antibody assays (AB_Assay1 and AB_Assay2) are appended to the information on the harvested antibody through the AB_ID identifier, thus linking to any previous information on that sample (starting from the B-cell donor/patient).

(DOCX)

pone.0326678.s004.tif (300.9KB, tif)
S5 Fig. Wet lab steps of ASCC workflow: thawing, expansion and differentiation of iPSCs into BMECs in a functional BBB model.

Wet lab steps and timepoints (relative to the start of the differentiation – day 0: D0) are indicated by blue rectangles. Cycles of thawing (D-6), possible freezing (D-1, D6, D8), harvest and seeding (D-3, D8) of cells. Count and viability assays are carried out on D-3, D0 and D8. Media used at each timepoint are indicated by ellipses – gray: mTSER plus medium with/without rock inhibitor (RI); violet: Unconditioned Medium (UM); green: Endothelial Cell Medium with/without supplements (EC + / + , EC -/-, respectively). The TEER measurement timepoints (D10, D11) are indicated by a purple circle. For a detailed protocol, refer to Fengler et al. [34].

(DOCX)

pone.0326678.s005.tif (2.9MB, tif)
S6 Fig. Conceptualizing the ASCC database.

A) No-redundancy principle. The information on each medium batch is stored in the database only once (upper table, green headers). Whenever a media change is performed, the unique identifier of the concerned medium batch (ID_medium_INFO) is fetched on the backend by a File Maker script and populated in the media change table (lower table, red headers). Information in columns::Medium Name,::Composition, and::Mix Date is fetched from the medium table (upper table, green headers). Information in the column::Barcode is fetched from another table (not shown). B) Example of the harvest event that guided the design of database structure: pulling differentiated cells from 6-well plate and seeding on 12-well TransWell plate. At this workflow stage, cells could also be cryo-stored for future experiments. Implementing a one-to-many relationship between 6-well and 12-well TransWell plates helps avoid storing redundant information.

(DOCX)

pone.0326678.s006.tif (765.8KB, tif)
S7 Fig. Conceptualizing the ASCC database-continued.

Possible expansion of database structure towards differentiation of other cell types (such as astrocytes, neurons, microglia, monocytes; red tables). The structure is flexible and allows appending of tables to a 6-well plate table (green) that stores information on undifferentiated cells through a one-to-many relationship. In this way, different differentiation protocols can easily be accommodated in the database.

(DOCX)

pone.0326678.s007.tif (361KB, tif)
S1 Table. Primers used for the amplification of heavy, light Kappa (κ) and light Lambda (λ) chains.

Three PCR reactions are performed per chain. PCR1 and PCR2 are performed with forward and reverse primer mixes. PCR3 is performed with specific primer pairs selected from 69 individual primers. The selection is based on sequencing results of the second PCR amplification and identification of used V(D)J alleles. See the attached Table S1.xlsx file. Refer also to Fig 2.

(XLSX)

pone.0326678.s008.xlsx (18.6KB, xlsx)

Acknowledgments

We thank the Helmholtz Association for funding HIL-A03. All figures were created with BioRender.com

Data Availability

The datasets generated and/or analysed during the current study are available in the following Zenodo repositories, https://doi.org/10.5281/zenodo.8229164 and https://zenodo.org/records/10106688.

Funding Statement

We thank the Helmholtz Association for funding HIL-A03

References

  • 1.Dueñas ME, Peltier-Heap RE, Leveridge M, Annan RS, Büttner FH, Trost M. Advances in high-throughput mass spectrometry in drug discovery. EMBO Mol Med. 2023;15(1):e14850. doi: 10.15252/emmm.202114850 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Shen X, Zhao Y, Wang Z, Shi Q. Recent advances in high-throughput single-cell transcriptomics and spatial transcriptomics. Lab Chip. 2022;22(24):4774–91. [DOI] [PubMed] [Google Scholar]
  • 3.Jia Q, Chu H, Jin Z, Long H, Zhu B. High-throughput single-сell sequencing in cancer research. Signal Transduct Target Ther. 2022;7(1):145. doi: 10.1038/s41392-022-00990-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Wratten L, Wilm A, Göke J. Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nat Methods. 2021;18(10):1161–8. doi: 10.1038/s41592-021-01254-9 [DOI] [PubMed] [Google Scholar]
  • 5.Steel H, Habgood R, Kelly CL, Papachristodoulou A. In situ characterisation and manipulation of biological systems with Chi.Bio. PLoS Biol. 2020;18(7):e3000794. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Alpern D, Gardeux V, Russeil J, Mangeat B, Meireles-Filho ACA, Breysse R, et al. BRB-seq: ultra-affordable high-throughput transcriptomics enabled by bulk RNA barcoding and sequencing. Genome Biol. 2019;20(1):71. doi: 10.1186/s13059-019-1671-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.DFG, German Research Foundation. DFG Consolidates the Impetus of its High-Throughput Sequencing Funding Initiative [Internet]. 2022. [cited 2023 Mar 21]. Available from: https://www.dfg.de/en/service/press/press_releases/2022/press_release_no_11/index.html [Google Scholar]
  • 8.Nature Methods. Nature methods: aims & scope [Internet]. Aims & Scope. 2022. [cited 2023 Mar 21]. Available from: https://www.nature.com/nmeth/aims [Google Scholar]
  • 9.National Institutes of Health. Common Fund High-Risk, High-Reward Research Program [Internet]. 2022. [cited 2023 Mar 21]. Available from: https://commonfund.nih.gov/highrisk [Google Scholar]
  • 10.European Commission. Breakthrough Innovation Programme for a Pan-European Detection and Imaging Eco-System – Phase-2 [Internet]. 2021. [cited 2023 Mar 21]. Available from: https://cordis.europa.eu/project/id/101004462 [Google Scholar]
  • 11.Gonçalves RS, Musen MA. The variable quality of metadata about biological samples used in biomedical experiments. Sci Data. 2019;6:190021. doi: 10.1038/sdata.2019.21 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Birkland A, Yona G. BIOZON: a system for unification, management and analysis of heterogeneous biological data. BMC Bioinformatics. 2006;7:70. doi: 10.1186/1471-2105-7-70 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Wittig U, Rey M, Weidemann A, Müller W. Data management and data enrichment for systems biology projects. J Biotechnol. 2017;261:229–37. [DOI] [PubMed] [Google Scholar]
  • 14.Priyanka GN, Abusalah MAH, Chopra H, Sharma A, Mustafa SA, Choudhary OP, et al. Nanovaccines: a game changing approach in the fight against infectious diseases. Biomed Pharmacother. 2023;167:115597. doi: 10.1016/j.biopha.2023.115597 [DOI] [PubMed] [Google Scholar]
  • 15.Blersch J, Kurkowsky B, Meyer-Berhorn A, Grabowska AK, Feidt E, Junglas E. An ex vivo human model for safety assessment of immunotoxicity of engineered nanomaterials. BioRxiv. 2023. [Google Scholar]
  • 16.Singh AV, Romeo A, Scott K, Wagener S, Leibrock L, Laux P, et al. Emerging technologies for in vitro inhalation toxicology. Adv Healthc Mater. 2021;10(18):e2100633. doi: 10.1002/adhm.202100633 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Singh AV, Chandrasekar V, Paudel N, Laux P, Luch A, Gemmati D, et al. Integrative toxicogenomics: advancing precision medicine and toxicology through artificial intelligence and OMICs technology. Biomed Pharmacother. 2023;163:114784. doi: 10.1016/j.biopha.2023.114784 [DOI] [PubMed] [Google Scholar]
  • 18.Sandve GK, Nekrutenko A, Taylor J, Hovig E. Ten simple rules for reproducible computational research. PLoS Comput Biol. 2013;9(10):e1003285. doi: 10.1371/journal.pcbi.1003285 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Kuhn Cuellar L, Friedrich A, Gabernet G, de la Garza L, Fillinger S, Seyboldt A. A data management infrastructure for the integration of imaging and omics data in life sciences. BMC Bioinformatics. 2022;23(1):61. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Tiller T, Meffre E, Yurasov S, Tsuiji M, Nussenzweig MC, Wardemann H. Efficient generation of monoclonal antibodies from single human B cells by single cell RT-PCR and expression vector cloning. J Immunol Methods. 2008;329(1–2):112–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Kreye J, Wenke NK, Chayka M, Leubner J, Murugan R, Maier N, et al. Human cerebrospinal fluid monoclonal N-methyl-D-aspartate receptor autoantibodies are sufficient for encephalitis pathogenesis. Brain. 2016;139(Pt 10):2641–52. doi: 10.1093/brain/aww208 [DOI] [PubMed] [Google Scholar]
  • 22.Gieselmann L, Kreer C, Ercanoglu MS, Lehnen N, Zehner M, Schommers P, et al. Effective high-throughput isolation of fully human antibodies targeting infectious pathogens. Nat Protoc. 2021;16(7):3639–71. doi: 10.1038/s41596-021-00554-w [DOI] [PubMed] [Google Scholar]
  • 23.Reincke SM, Prüss H, Kreye J. Brain antibody sequence evaluation (BASE): an easy-to-use software for complete data analysis in single cell immunoglobulin cloning. BMC Bioinformatics. 2020;21(1):446. doi: 10.1186/s12859-020-03741-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Boyd SD, Gaëta BA, Jackson KJ, Fire AZ, Marshall EL, Merker JD, et al. Individual variation in the germline Ig gene repertoire inferred from variable region gene rearrangements. J Immunol. 2010;184(12):6986–92. doi: 10.4049/jimmunol.1000445 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Collins AM, Watson CT. Immunoglobulin light chain gene rearrangements, receptor editing and the development of a self-tolerant antibody repertoire. Front Immunol. 2018;9:2249. doi: 10.3389/fimmu.2018.02249 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.van Vliet M. Seven quick tips for analysis scripts in neuroimaging. PLoS Comput Biol. 2020;16(3):e1007358. doi: 10.1371/journal.pcbi.1007358 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Rowlett J, Karlsson CJ, Nursultanov M. Diversity strengthens competing teams. R Soc Open Sci. 2022;9(8):211916. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Yang Y, Tian TY, Woodruff TK, Jones BF, Uzzi B. Gender-diverse teams produce more novel and higher-impact scientific ideas. Proc Natl Acad Sci USA. 2022;119(36):e2200841119. doi: 10.1073/pnas.2200841119 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Sufi F. Algorithms in Low-Code-No-Code for research applications: a practical review. Algorithms. 2023;16(2):108. doi: 10.3390/a16020108 [DOI] [Google Scholar]
  • 30.Claris FileMaker – Tackle any task [Internet]. [cited 2023 Mar 28]. Available from: https://www.claris.com/filemaker/ [Google Scholar]
  • 31.O’Leary T, Weiss J, Toll B, Brandt C, Bernstein SL. Automated generation of CONSORT diagrams using relational database software. Appl Clin Inform. 2019;10(1):60–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Conforti A, Marci R, Moustafa M, Tsibanakos I, Krishnamurthy G, Alviggi C, et al. Surgery and out-patient data collection and reporting using Filemaker Pro. Eur Rev Med Pharmacol Sci. 2018;22(10):2918–22. doi: 10.26355/eurrev_201805_15045 [DOI] [PubMed] [Google Scholar]
  • 33.Ruth CJ, Huey SL, Krisher JT, Fothergill A, Gannon BM, Jones CE, et al. An electronic data capture framework (ConnEDCt) for global and public health research: design and implementation. J Med Internet Res. 2020;22(8):e18580. doi: 10.2196/18580 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Fengler S, Kurkowsky B, Kaushalya SK, Roth W, Fava E, Denner P. Human iPSC-derived brain endothelial microvessels in a multi-well format enable permeability screens of anti-inflammatory drugs. Biomaterials. 2022;286:121525. doi: 10.1016/j.biomaterials.2022.121525 [DOI] [PubMed] [Google Scholar]
  • 35.Sharma A, Fernandes DC, Reis RL, Gołubczyk D, Neumann S, Lukomska B, et al. Cutting-edge advances in modeling the blood-brain barrier and tools for its reversible permeabilization for enhanced drug delivery into the brain. Cell Biosci. 2023;13(1):137. doi: 10.1186/s13578-023-01079-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Ansari MY, Chandrasekar V, Singh AV, Dakua SP. Re-Routing drugs to blood brain barrier: a comprehensive analysis of machine learning approaches with fingerprint amalgamation and data balancing. IEEE Access. 2023;11:9890–906. doi: 10.1109/access.2022.3233110 [DOI] [Google Scholar]
  • 37.Dhage PA, Sharbidre AA, Dakua SP, Balakrishnan S. Leveraging hallmark Alzheimer’s molecular targets using phytoconstituents: current perspective and emerging trends. Biomed Pharmacother. 2021;139:111634. doi: 10.1016/j.biopha.2021.111634 [DOI] [PubMed] [Google Scholar]
  • 38.Wossnig L, Furtmann N, Buchanan A, Kumar S, Greiff V. Best practices for machine learning in antibody discovery and development. Drug Discov Today. 2024;29(7):104025. doi: 10.1016/j.drudis.2024.104025 [DOI] [PubMed] [Google Scholar]
  • 39.Djaffardjy M, Marchment G, Sebe C, Blanchet R, Bellajhame K, Gaignard A, et al. Developing and reusing bioinformatics data analysis pipelines using scientific workflow systems. Comput Struct Biotechnol J. 2023;21:2075–85. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Planell N, Lagani V, Sebastian-Leon P, van der Kloet F, Ewing E, Karathanasis N. STATegra: multi-omics data integration - a conceptual scheme with a bioinformatics pipeline. Front Genet. 2021;12:620453. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Conesa A, Beck S. Making multi-omics data accessible to researchers. Sci Data. 2019;6(1):251. doi: 10.1038/s41597-019-0258-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Cohen-Boulakia S, Belhajjame K, Collin O, Chopard J, Froidevaux C, Gaignard A. Scientific workflows for computational reproducibility in the life sciences: status, challenges and opportunities. Future Gener Comput Syst. 2017;75:284–98. [Google Scholar]
  • 43.Minnich AJ, McLoughlin K, Tse M, Deng J, Weber A, Murad N. AMPL: a data-driven modeling pipeline for drug discovery. J Chem Inf Model. 2020;60(4):1955–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Yenni GM, Christensen EM, Bledsoe EK, Supp SR, Diaz RM, White EP, et al. Developing a modern data workflow for regularly updated data. PLoS Biol. 2019;17(1):e3000125. doi: 10.1371/journal.pbio.3000125 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Lürig MD. Phenopype: a phenotyping pipeline for Python. Methods Ecol Evol. 2021. [Google Scholar]
  • 46.Eisen KE, Powers JM, Raguso RA, Campbell DR. An analytical pipeline to support robust research on the ecology, evolution, and function of floral volatiles. Front Ecol Evol. 2022;10. [Google Scholar]
  • 47.Ebmeyer S, Coertze RD, Berglund F, Kristiansson E, Larsson DGJ. GEnView: a gene-centric, phylogeny-based comparative genomics pipeline for bacterial genomes and plasmids. Bioinformatics. 2022. Mar 4;38(6):1727–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Kar S, Garin V, Kholová J, Vadez V, Durbha SS, Tanaka R. SpaTemHTP: a data analysis pipeline for efficient processing and utilization of temporal high-throughput phenotyping data. Front Plant Sci. 2020;11:552509. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Wilkinson MD, Dumontier M, Aalbersberg IJJ, Appleton G, Axton M, Baak A, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3:160018. doi: 10.1038/sdata.2016.18 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Argento N. Institutional ELN/LIMS deployment: highly customizable ELN/LIMS platform as a cornerstone of digital transformation for life sciences research institutes. EMBO Rep. 2020;21(3):e49862. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Khatri I, Berkowska MA, van den Akker EB, Teodosio C, Reinders MJT, van Dongen JJM. Population matched (pm) germline allelic variants of immunoglobulin (IG) loci: relevance in infectious diseases and vaccination studies in human populations. Genes Immun. 2021;22(3):172–86. doi: 10.1038/s41435-021-00143-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Mikocziova I, Greiff V, Sollid LM. Immunoglobulin germline gene variation and its impact on human disease. Genes Immun. 2021;22(4):205–17. doi: 10.1038/s41435-021-00145-5 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Bhanwar Lal Puniya

23 Oct 2024

Dear Dr. Stappert,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Dec 07 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org . When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols . Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols .

We look forward to receiving your revised manuscript.

Kind regards,

Bhanwar Lal Puniya, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf .

2. When completing the data availability statement of the submission form, you indicated that you will make your data available on acceptance. We strongly recommend all authors decide on a data sharing plan before acceptance, as the process can be lengthy and hold up publication timelines. Please note that, though access restrictions are acceptable now, your entire data will need to be made freely accessible if your manuscript is accepted for publication. This policy applies to all data except where public deposition would breach compliance with the protocol approved by your research ethics board. If you are unable to adhere to our open data policy, please kindly revise your statement to explain your reasoning and we will seek the editor's input on an exemption. Please be assured that, once you have provided your new statement, the assessment of your exemption will not hold up the peer review process.

3. Please update your submission to use the PLOS LaTeX template. The template and more information on our requirements for LaTeX submissions can be found at http://journals.plos.org/plosone/s/latex .

4. We notice that your supplementary [figures/tables] are included in the manuscript file. Please remove them and upload them with the file type 'Supporting Information'. Please ensure that each Supporting Information file has a legend listed in the manuscript after the references list.

Additional Editor Comments (if provided):

-The authors should thoroughly address all the major comments from the reviewers and incorporate additional citations if relevant.

-One or more of the reviewers has recommended that you cite specific previously published works. Members of the editorial team have determined that the works referenced are not directly related to the submitted manuscript. As such, please note that it is not necessary or expected to cite the works requested by the reviewer.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #2: No

Reviewer #3: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously? -->?>

Reviewer #1: Yes

Reviewer #2: N/A

Reviewer #3: N/A

**********

3. Have the authors made all data underlying the findings in their manuscript fully available??>

The PLOS Data policy

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English??>

Reviewer #1: Yes

Reviewer #2: No

Reviewer #3: Yes

**********

Reviewer #1: This article is very good for publication, and the authors have put forth their best effort in writing and presenting the conclusion. However, some changes are required before its publication. I have mentioned all the points below:

I suggest authors to read and incorporate the information from the following articles and cite them:

mRNA vaccines as an armor to combat the infectious diseases. Travel Medicine and Infectious Disease 52:102550.

Zoonotic diseases in a changing climate scenario: Revisiting the interplay between environmental variables and infectious disease dynamics, Travel Medicine and Infectious Disease,

58:102694.

Nanovaccines: A game changing approach in the fight against infectious diseases. Biomedicine & Pharmacotherapy 167(2023):115597

Reviewer #2: The paper has addressed a nice problem; however, I have the below comments:

1. Accurately isolating individual B-cells from heterogeneous populations can be technically demanding. Techniques like flow cytometry or microfluidics need to be highly precise to ensure that only the target B-cell is selected.

2. Maintaining the viability of single B-cells during isolation and subsequent culture is crucial. Many methods can stress or damage cells, reducing their ability to produce antibodies. Please cite the below papers while discussing this:

"Integrative Toxicogenomics: Advancing Precision Medicine and Toxicology through Artificial Intelligence and OMICs Technology," Biomedicine & Pharmacotherapy, Elsevier, vol. 163, 114784, 2023.

“Advances in smoking related in-vitro inhalation toxicology: a perspective case of challenges and opportunities from progresses in lung-on-chip technologies,” ACS Chemical Research in Toxicology, vol. 34, pp. 1984-222, 2021.

“Emerging technologies for in vitro inhalation toxicology,” Advanced Healthcare Materials, Wiley, vol. 10, no. 18, pp. e2100633 2021.

3. Once isolated, the B-cells must be effectively expanded to produce sufficient quantities of antibodies. Achieving robust growth while preserving their functionality can be difficult.

4. Not all B-cells will produce high-affinity or functional antibodies. Identifying and selecting the right clones that produce the desired antibodies requires additional screening and validation steps.

5. The genetic material of isolated B-cells can be unstable, leading to mutations or loss of antibody specificity during culture. Ensuring genetic integrity is crucial for consistent antibody production.

6. Traditional methods of creating monoclonal antibodies often involve hybridoma technology, which adds complexity and can be less efficient compared to newer single-cell approaches.

7. Scaling up production from a single B-cell to large-scale antibody production can present logistical and technical challenges, including maintaining consistency and quality across batches.

8. Automating the entire process—from isolation to production—while maintaining high throughput and reproducibility can be complex and resource-intensive.

9. The process of characterizing the antibodies produced (e.g., affinity, specificity) can be time-consuming and requires sophisticated assays, adding to the overall complexity.

10. Finally, the paper should be thoroughly proofread and written like a scientific manuscript.

11. Please discuss the role of permeability because it plays a crucial role in the complex production of monoclonal antibodies from single B-cells: High permeability of the cell membrane is essential for the efficient exchange of nutrients and oxygen. This ensures that the isolated B-cells receive the necessary metabolic support for growth and antibody production. Please cite the below papers: “Investigating the Use of Machine Learning Models to Understand the Drugs Permeability Across Placenta,” IEEE Access, vol. 11, pp. 52726-52739, 2023.

“Micropatterned neurovascular interface to mimic the blood-brain barrier neurophysiology and micromechanical function,” Cells, MDPI, vol. 11, no. 18, pp. 2801.

"Bottom-UP assembly of nanorobots: extending synthetic biology to complex material design,” Frontiers in Nanoscience and Nanotechnology,vol.5,pp.1-2,2019.

“Re- routing drugs to blood brain barrier: A comprehensive analysis of Machine Learning approaches with fingerprint amalgamation and data balancing,” IEEE Access, vol. 11, pp. 9890-9906, 2023.

“Perspectives on the technological aspects and biomedical applications of Virus-like-particles/ Nanoparticles in reproductive biology: Insights on the medicinal and toxicological outlook,” Advanced NanoBiomed Research, Wiley, 2:2200010, 2022.

"Synergistic and Additive Effects of Menadione in Combination with Antibiotics on Multidrug-Resistant Staphylococcus aureus: Insights from Structure-Function Analysis of Naphthoquinones," ChemMedChem, Chemistry Eurpoe vol. 18, no. 24, 2023.

"Leveraging hallmark Alzheimer’s molecular targets using phytoconstituents: Current perspective and emerging trends," Biomedicine & Pharmacotherapy, Elsevier, vol. 139, no. 111634.

12. Please include the potential limitations of the study.

Reviewer #3: The manuscript titled " Gain efficiency with streamlined and automated data processing: Examples from high-throughput monoclonal antibody production" follows a logical flow of information extending to pipeline design in monoclonal antibody production. The manuscript provides a proof-of concept workflows which is highly interesting. Few minor concerns needs to be addressed.

1. The novelty of the work could be highlighted more clearly. While similar data management systems exist, it's not always obvious what sets this one apart or what new methods it introduces. Comparing its performance and innovations with previous systems would help showcase its distinctive strengths.

2. while the practical challenges of data management in mAB production are provided, the workdesign does not necessarily address these challenges. The workflow design presented is robust but doesn't push the boundaries of what's known in the field of laboratory automation.

3. The manuscript would benefit from referencing more recent advancements in automation, particularly in data management. For example, tools leveraging artificial intelligence and machine learning to optimize workflows, predict outcomes, or detect anomalies in real-time are becoming increasingly relevant in biological data processing.

4. There is a slight confusion when process automation is also highlighted in some places like pipetting schemes etc. This makes it confusing when data mangement is the priority or process automation.

**********

what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy

Reviewer #1: Yes:  Dr. Priyanka

Reviewer #2: No

Reviewer #3: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/ . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org . Please note that Supporting Information files do not need this step.

PLoS One. 2025 Jul 1;20(7):e0326678. doi: 10.1371/journal.pone.0326678.r002

Author response to Decision Letter 1


3 Feb 2025

In the following, we list all comments and detail how we addressed them.

Journal requirements

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf.

Response:

We have adapted the manuscript to meet PLOS ONE’s style requirements.

2. When completing the data availability statement of the submission form, you indicated that you will make your data available on acceptance. We strongly recommend all authors decide on a data sharing plan before acceptance, as the process can be lengthy and hold up publication timelines. Please note that, though access restrictions are acceptable now, your entire data will need to be made freely accessible if your manuscript is accepted for publication. This policy applies to all data except where public deposition would breach compliance with the protocol approved by your research ethics board. If you are unable to adhere to our open data policy, please kindly revise your statement to explain your reasoning and we will seek the editor's input on an exemption. Please be assured that, once you have provided your new statement, the assessment of your exemption will not hold up the peer review process.

Response:

All processed data and code are available in the following repositories:

DOI: 10.5281/zenodo.8229164 and DOI: 10.5281/zenodo.10106688

Links to specifically mentioned code are indicated throughout the main text. All three reviewers were satisfied with data availability.

3. Please update your submission to use the PLOS LaTeX template. The template and more information on our requirements for LaTeX submissions can be found at

http://journals.plos.org/plosone/s/latex.

Response:

Thank you for offering to submit with aid of the PLOS LaTeX template. As we understand from PLOS ONE submission guidelines, LaTeX is not mandatory, and manuscripts can also be submitted as .docx files. The latter is preferred by us, indeed for the version with tracked changes.

4. We notice that your supplementary [figures/tables] are included in the manuscript file. Please remove them and upload them with the file type 'Supporting Information'. Please ensure that each Supporting Information file has a legend listed in the manuscript after the references list.

Response:

The supplementary information has been removed from the main text and included as separate files. Each supporting information file has been linked to its legend, as required.

Editor Comments

-The authors should thoroughly address all the major comments from the reviewers and incorporate additional citations if relevant.

-One or more of the reviewers has recommended that you cite specific previously published works. Members of the editorial team have determined that the works referenced are not directly related to the submitted manuscript. As such, please note that it is not necessary or expected to cite the works requested by the reviewer.

Response:

We have assessed all comments from the reviewers and have significantly improved our manuscript. Additional citations relevant to our work have been included, as outlined in the specific responses below.

Reviewer #1:

This article is very good for publication, and the authors have put forth their best effort in writing and presenting the conclusion. However, some changes are required before its publication. I have mentioned all the points below:

I suggest authors to read and incorporate the information from the following articles and cite them:

• mRNA vaccines as an armor to combat the infectious diseases. Travel Medicine and Infectious Disease 52:102550.

• Zoonotic diseases in a changing climate scenario: Revisiting the interplay between environmental variables and infectious disease dynamics, Travel Medicine and Infectious Disease, 58:102694.

• Nanovaccines: A game changing approach in the fight against infectious diseases. Biomedicine & Pharmacotherapy 167(2023):115597

Response: We thank reviewer #1 for the positive evaluation of our manuscript. We have included some of the indicated references that fit into the scope of the manuscript (references below). This adds value to the work and shows the possibility to apply such workflows as ours in different scientific fields.

References:

Priyanka, Abusalah MAH, Chopra H, Sharma A, Mustafa SA, Choudhary OP, et al. Nanovaccines: A game changing approach in the fight against infectious diseases. Biomedicine & Pharmacotherapy. 2023 Nov 1;167:115597. Available from: https://www.sciencedirect.com/science/article/pii/S0753332223013951

Reviewer #2:

The paper has addressed a nice problem; however, I have the below comments:

Response: We thank Reviewer #2 for taking the time to evaluate our manuscript and for providing valuable comments. Many of the challenges highlighted in the review align with those we encounter in our daily work, and we appreciate the opportunity to discuss them further. However, we would like to respectfully elaborate that the primary focus of our manuscript is on data and metadata handling for repetitively executed, complex biological workflows, such as high-throughput monoclonal antibody cloning and in vitro production.

1. Accurately isolating individual B-cells from heterogeneous populations can be technically demanding. Techniques like flow cytometry or microfluidics need to be highly precise to ensure that only the target B-cell is selected.

Response: The biological protocol we are using is based on flow cytometry as described previously in the works cited by Tiller T. et al., 2008 and Kreye J. et al., 2016. With regard to our approach to data and metadata tracking, as described in our manuscript, it is important to track key metadata such as FACS machine type and the key settings used, along with the utilized cell markers etc.

2. Maintaining the viability of single B-cells during isolation and subsequent culture is crucial. Many methods can stress or damage cells, reducing their ability to produce antibodies. Please cite the below papers while discussing this:

"Integrative Toxicogenomics: Advancing Precision Medicine and Toxicology through Artificial Intelligence and OMICs Technology," Biomedicine & Pharmacotherapy, Elsevier, vol. 163, 114784, 2023.

“Advances in smoking related in-vitro inhalation toxicology: a perspective case of challenges and opportunities from progresses in lung-on-chip technologies,” ACS Chemical Research in Toxicology, vol. 34, pp. 1984-222, 2021.

“Emerging technologies for in vitro inhalation toxicology,” Advanced Healthcare Materials, Wiley, vol. 10, no. 18, pp. e2100633 2021.

Response: We have included the below-mentioned references to further explain the application of our workflow and have expanded our discussion on the role of AI and ML in design principles, data processing and analyses.

References:

Singh AV, Chandrasekar V, Paudel N, Laux P, Luch A, Gemmati D, et al. Integrative toxicogenomics: Advancing precision medicine and toxicology through artificial intelligence and OMICs technology. Biomed Pharmacother [Internet]. 2023;163:114784. Available from: http://dx.doi.org/10.1016/j.biopha.2023.114784

Singh AV, Romeo A, Scott K, Wagener S, Leibrock L, Laux P, et al. Emerging technologies for in vitro inhalation toxicology. Adv Healthc Mater [Internet]. 2021;10(18). Available from: http://dx.doi.org/10.1002/adhm.202100633

Concerning the maintenance of viability during B-cell isolation and culture, we do agree that this is a stressful process for the cells and requires careful handling and protocol optimization. However, for the purpose of monoclonal antibody generation, we do not culture the B-cells after FACS sorting, rather directly lyse them and isolate RNA that is in turn translated to cDNA.

3. Once isolated, the B-cells must be effectively expanded to produce sufficient quantities of antibodies. Achieving robust growth while preserving their functionality can be difficult.

Response: For protocols utilizing B-cell culture to produce antibodies, achieving B-cell expansion while preserving functionality is a considerable challenge. While our manuscript primarily focuses on data handling, the biological workflow underlying the data pipeline uses a different strategy for antibody production. Instead of culturing B-cells, we directly isolate RNA to clone the antibodies (see also above).

4. Not all B-cells will produce high-affinity or functional antibodies. Identifying and selecting the right clones that produce the desired antibodies requires additional screening and validation steps.

Response: Absolutely! While our focus is on antibody production, B-cell characterization presents another challenge that can be addressed through various techniques, each with its own advantages and limitations. See for example: Kreye J, Reincke SM, Kornau H, Sánchez-Sendin E, Corman VM, Liu H, et al. A Therapeutic Non-self-reactive SARS-CoV-2 Antibody Protects from Lung Pathology in a COVID-19 Hamster Model. Cell. 2020 Nov;183(4):1058-1069.e19.

5. The genetic material of isolated B-cells can be unstable, leading to mutations or loss of antibody specificity during culture. Ensuring genetic integrity is crucial for consistent antibody production.

Response: The risk of genetic instability increases with extended culturing time, as observed, for example, in hybridoma technology. To mitigate this risk, we avoid B cell culturing. Instead, we directly clone antibodies from individual B-cells, adhering strictly to best practices in RNA and DNA handling.

6. Traditional methods of creating monoclonal antibodies often involve hybridoma technology, which adds complexity and can be less efficient compared to newer single-cell approaches.

Response: We completely agree. Hybridoma technology has been the gold standard for decades, offering advantages such as longevity and the ability to conduct extended phenotypic screening. However, modern methods, such as the one we employ, are faster, offer higher specificity, and ensure genetic accuracy.

7. Scaling up production from a single B-cell to large-scale antibody production can present logistical and technical challenges, including maintaining consistency and quality across batches.

Response: The most efficient and controllable approach for large-scale antibody production is in vitro production. Our focus is on the generation of plasmids designed for cell transfection to facilitate in vitro antibody production. While our plasmids enable the production of large quantities of a single monoclonal antibody, we specialize in cloning hundreds to thousands of distinct monoclonal antibodies.

8. Automating the entire process—from isolation to production—while maintaining high throughput and reproducibility can be complex and resource-intensive.

Response: We automated the entire process, from cDNA generation to antibody production. Establishing this workflow is indeed resource-intensive, with significant organizational complexity. The key to reducing complexity and improving efficiency during routing operation lies in automated data handling, which has been a central focus of our efforts. Our motivation to share these learnings is reflected in the presented manuscript, and we further emphasize this point in the revised version.

9. The process of characterizing the antibodies produced (e.g., affinity, specificity) can be time-consuming and requires sophisticated assays, adding to the overall complexity.

Response: The first step to further characterize monoclonal antibodies is their production. Antibody cloning from individual B-cells for in vitro production represents the most effective approach to generate enough antibody for all required assays. We demonstrate that this production process is feasible in a highly automated manner, enabling the generation of over 1,000 different antibodies per year.

10. Finally, the paper should be thoroughly proofread and written like a scientific manuscript.

Response: The manuscript has been proofread and adapted to fit the scientific manuscript requirements of PLOS ONE.

11. Please discuss the role of permeability because it plays a crucial role in the complex production of monoclonal antibodies from single B-cells: High permeability of the cell membrane is essential for the efficient exchange of nutrients and oxygen. This ensures that the isolated B-cells receive the necessary metabolic support for growth and antibody production.

Please cite the below papers:

“Investigating the Use of Machine Learning Models to Understand the Drugs Permeability Across Placenta,” IEEE Access, vol. 11, pp. 52726-52739, 2023. “Micropatterned neurovascular interface to mimic the blood-brain barrier neurophysiology and micromechanical function,” Cells, MDPI, vol. 11, no. 18, pp. 2801. "Bottom-UP assembly of nanorobots: extending synthetic biology to complex material design,” Frontiers in Nanoscience and Nanotechnology, vol.5,pp.1-2,2019. “Re- routing drugs to blood brain barrier: A comprehensive analysis of Machine Learning approaches with fingerprint amalgamation and data balancing,” IEEE Access, vol. 11, pp. 9890-9906, 2023. “Perspectives on the technological aspects and biomedical applications of Virus-like-particles/ Nanoparticles in reproductive biology: Insights on the medicinal and toxicological outlook,” Advanced NanoBiomed Research, Wiley, 2:2200010, 2022. "Synergistic and Additive Effects of Menadione in Combination with Antibiotics on Multidrug-Resistant Staphylococcus aureus: Insights from Structure-Function Analysis of Naphthoquinones," ChemMedChem, Chemistry Eurpoe vol. 18, no. 24, 2023. "Leveraging hallmark Alzheimer’s molecular targets using phytoconstituents: Current perspective and emerging trends," Biomedicine & Pharmacotherapy, Elsevier,vol. 139, no. 111634.

Response: Permeability and the regulation of permeability across barriers are essential for the compartmentalized functions of cells and tissues. This is true not only for cultured B-cells but also HEK cells, which we use for overexpressing monoclonal antibodies. In the context of our manuscript, permeability is particularly significant in automated cell culture to produce brain microvascular endothelial cells for our blood-brain barrier model. We have included below mentioned references and further strengthened our discussion regarding this point.

References added:

Ansari MY, Chandrasekar V, Singh AV, Dakua SP. Re-routing drugs to blood brain barrier: A comprehensive analysis of machine learning approaches with fingerprint amalgamation and data balancing. IEEE Access [Internet]. 2023;11:9890–906. Available from: http://dx.doi.org/10.1109/access.2022.3233110

Dhage PA, Sharbidre AA, Dakua SP, Balakrishnan S. Leveraging hallmark Alzheimer’s molecular targets using phytoconstituents: Current perspective and emerging trends. Biomed Pharmacother [Internet]. 2021;139:111634. Available from: http://dx.doi.org/10.1016/j.biopha.2021.11163

12. Please include the potential limitations of the study.

Response: As any other study dealing with data processing, this too has limitations that need to be considered. To highlight these, we have added a separate “Limitations” paragraph in the main text and discussed several points, including how to potentially improve and/or solve them for future workflows. We hope that this will be helpful for future studies and laboratories dealing with automated data processing and database generation.

Reviewer #3:

The manuscript titled " Gain efficiency with streamlined and automated data processing: Examples from high-throughput monoclonal antibody production" follows a logical flow of information extending to pipeline design in monoclonal antibody production. The manuscript provides a proof-of concept workflows which is highly interesting. Few minor concerns needs to be addressed.

1. The novelty of the work could be highlighted more clearly. While similar data management systems exist, it's not always obvious what sets this one apart or what new methods it introduces. Comparing its per

Decision Letter 1

Bhanwar Lal Puniya

18 Mar 2025

Dear Dr. Stappert,

Authors are advised to revise the manuscript according to the comments provided by Reviewer #4.

Please submit your revised manuscript by May 02 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org . When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols . Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols .

We look forward to receiving your revised manuscript.

Kind regards,

Bhanwar Lal Puniya, Ph.D.

Academic Editor

PLOS ONE

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

Reviewer #3: All comments have been addressed

Reviewer #4: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions??>

Reviewer #3: Yes

Reviewer #4: (No Response)

**********

3. Has the statistical analysis been performed appropriately and rigorously? -->?>

Reviewer #3: Yes

Reviewer #4: (No Response)

**********

4. Have the authors made all data underlying the findings in their manuscript fully available??>

The PLOS Data policy

Reviewer #3: Yes

Reviewer #4: (No Response)

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English??>

Reviewer #3: Yes

Reviewer #4: (No Response)

**********

Reviewer #3:  (No Response)

Reviewer #4:  I would like to thank the authors for their efforts in preparing this manuscript. It presents an important contribution, please addressed below comments :

• Some sections feel a bit dense and could be made clearer by breaking up longer sentences and simplifying complex descriptions. This would help improve readability and accessibility, especially for readers who may not be familiar with all the technical details.

• The transitions between sections, particularly from general workflow principles to specific implementation details, could be smoother. A clearer connection between these parts would help guide the reader more effectively.

• Figure 2 seems to be an important part of explaining the workflow, but it’s not included for review. Make sure it’s well-labeled and clearly conveys the intended message.

• It would also help to refer to workflow diagrams more explicitly in the text so readers can follow along more easily.

• The manuscript briefly mentions why Python and Knime were chosen, but it would be useful to provide a stronger rationale. A short comparison of the benefits and trade-offs of these tools versus other available options would add clarity.

• Similarly, when discussing GUI-based vs. scripting-based approaches, consider touching on usability, performance, and reproducibility. This would give readers a better understanding of the reason behind these choices.

• The paper mentions input and output file formats but doesn’t fully address the importance of standardization for interoperability. A brief mention of commonly used formats (e.g., CSV, JSON, XML) and their advantages would help strengthen this section.

• It would also be helpful to explain how intermediate files are validated to ensure data integrity as information moves between pipeline modules.

• The discussion on computational resources (e.g., NAS, cloud services, institutional servers) could benefit from a short mention of data security and access control measures. This would be especially relevant for handling sensitive or large-scale datasets.

• Since working with large datasets can present challenges, consider addressing potential issues like memory constraints or options for parallel computing.

**********

what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy

Reviewer #3: No

Reviewer #4: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/ . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org . Please note that Supporting Information files do not need this step.

PLoS One. 2025 Jul 1;20(7):e0326678. doi: 10.1371/journal.pone.0326678.r004

Author response to Decision Letter 2


8 May 2025

Journal requirements:

Authors are advised to revise the manuscript according to the comments provided by Reviewer #4.

We have addressed all comments and suggestions provided by Reviewer #4. We are thankful for the critical review process and hope that the improved manuscript now fully meets the publication criteria of PLOS ONE.

Reviewer #4:

I would like to thank the authors for their efforts in preparing this manuscript. It presents an important contribution, please addressed below comments:

We thank Reviewer #4 for the critical evaluation of our manuscript and for recognizing its importance. We have addressed all comments in the revised version and expanded on those in our rebuttal letter.

1. Some sections feel a bit dense and could be made clearer by breaking up longer sentences and simplifying complex descriptions. This would help improve readability and accessibility, especially for readers who may not be familiar with all the technical details. (answered together with comment no. 2, below comment no. 2)

2. The transitions between sections, particularly from general workflow principles to specific implementation details, could be smoother. A clearer connection between these parts would help guide the reader more effectively.

We appreciate these valuable suggestions. In response, we have revised the main text by breaking up longer sentences and making technical details easier to follow and read. Complex descriptions have been simplified where appropriate, without compromising the essence and structure of the manuscript. We have also tried to establish clear links between general and technical details, especially in our ‘Examples from the workflow’ sections. This also helps to follow the logical structure of the manuscript better. Our Glossary (Box 1) further aids readers at the start by defining data processing concepts often used throughout the text (refer to page 5) and was also refined to convey clearer messages in shorter sentences.

We think that the manuscript is now much more coherent and accessible.

3. Figure 2 seems to be an important part of explaining the workflow, but it’s not included for review. Make sure it’s well-labeled and clearly conveys the intended message.

Figure 2 is indeed an essential part of our manuscript, illustrating the entire mAB production pipeline and the many different steps involved in a logical order. Figure 2 has been recalled multiple times throughout the text, pointing out examples of how our workflow is designed. We have made sure that the figure is explained in the manuscript and that it is properly referred to in each section where appropriate. We have also labelled and annotated it. We will of course include it with our revision file (see also below).

Fig 2. Workflow for the production of mABs from patient-derived individual B-cells. Wet lab work steps (ellipse-icons with headers) and data processing steps (yellow icons). Sample flow is depicted by blue arrows and information flow is depicted by yellow arrows. The experimental procedure starts with FACS sorting of individual B-cells. cDNA is generated from individual B-cells, and PCR1 and PCR2 serve to amplify antibody chains. Successfully amplified chains, evaluated by capillary electrophoresis (cELE1), are sequenced (SEQ1). Upon positive sequence evaluation with the aBASE software [23], specific primers are used for the final amplification of the specificity-determining chain parts (PCR3). Amplification is quality-controlled by electrophoresis (cELE2). The specificity-determining regions are cloned by Gibson Assembly (GiAs) into plasmids encoding the constant parts of the chains. Plasmids are amplified by transformed (TRFO) bacteria after plating (PLA) and picking (PICK) individual bacterial clones. Prior to quality control by sequencing (SEQ2), amplified plasmids are isolated (MINI), and plasmid concentration is measured and adjusted (ConcAdj). Based on sequencing results, functional plasmids (PLA) are sorted. Matching plasmid pairs (mAB Pla) are prepared for transfection (Traf) into HEK cells. mABs produced and secreted by HEK cells are harvested (mABs), and mAB concentration is quantified (quant). Glycerol stocks (Glyc) of transformed bacteria and plasmid aliquots (Pla safe) serve as safe stocks of plasmids. The colors of the wet lab work step captions indicate different workflow sections: blue – molecular biology; red – microbiology; green – cell biology. The colors of the data processing circle contours and font indicate the utilized technology: red – Python; yellow – Knime; blue – third-party software.

4. It would also help to refer to workflow diagrams more explicitly in the text so readers can follow along more easily.

We agree with this comment and have made sure that each workflow diagram is explicitly referred to at relevant points in-text, helping guide readers through the process more clearly.

5. The manuscript briefly mentions why Python and Knime were chosen, but it would be useful to provide a stronger rationale. A short comparison of the benefits and trade-offs of these tools versus other available options would add clarity.

Thank you for the suggestion. We have expanded upon the explanation of our choice of Python and KNIME, emphasizing their complementary advantages (please see also our answer to comment no. 6 below)

6. Similarly, when discussing GUI-based vs. scripting-based approaches, consider touching on usability, performance, and reproducibility. This would give readers a better understanding of the reason behind these choices.

We appreciate the suggestion to elaborate further on our selection of tools, GUI-based and script-based approaches.

We have added a brief description of why Knime and Python were chosen, as well as some suggestions for alternative technologies (refer to “Examples from the workflow” on page 11). We have also expanded on the usage of GUI-based workflows (page 10).

Knime was chosen as a technology mainly due to its wide application and open-source system. Knime has an intuitive drag-and-drop graphical interface and can easily be used by team members without coding experience. Furthermore, workflows can be shared, ensuring reproducibility and flexibility. Knime can also embed Python scripts.

Python is very widespread in the field of data processing and bioinformatics. It is suitable for many different projects and computations. It can run on various operating systems and supports notebook-style reporting, further enhancing usability and reproducibility of workflows. The syntax used by Python is generally clean and readable, which is helpful when working with interdisciplinary teams.

The use of GUI tools in our workflow is essential to allow seamless and faster integration of researchers with limited coding knowledge on an existing pipeline. The combined use of tools allows interdisciplinary work and ensures consistency, standardized documentation, and automation of repetitive tasks.

7. The paper mentions input and output file formats but doesn’t fully address the importance of standardization for interoperability. A brief mention of commonly used formats (e.g., CSV, JSON, XML) and their advantages would help strengthen this section.

We agree that the use of standardized files plays a key role in interoperability and reproducibility across platforms. Our pipeline supports CSV files which are commonly used in laboratory settings. CSV files are easy to read, simple and compatible with spreadsheet software and most programming environments. We have clarified the use of CSV files in our manuscript (page 15). Formats such as JSON or XML support much larger nested data and metadata which we do not handle in this pipeline. Our datasets are primarily tabular, as are most datasets generated in labs. Nevertheless, we added a brief section on the context in which JSON and XML data formats could be used.

8. It would also be helpful to explain how intermediate files are validated to ensure data integrity as information moves between pipeline modules.

Thank you for the comment. Indeed, this section was lacking clarification of that point. We have added a description of our built-in validation steps for intermediate files, please refer to the box “Examples from the workflow” in the section “Defining input and output files”.

9. The discussion on computational resources (e.g., NAS, cloud services, institutional servers) could benefit from a short mention of data security and access control measures. This would be especially relevant for handling sensitive or large-scale datasets.

Thank you for the comment. We have expanded the discussion on computational resources to emphasize that, when using storage drives and servers such as NAS, cloud services, or institutional infrastructure, appropriate data security and access control measures should be in place—particularly when managing sensitive or large-scale datasets. This includes implementing practices such as user authentication, encryption, and compliance with institutional or legal data protection guidelines.

10. Since working with large datasets can present challenges, consider addressing potential issues like memory constraints or options for parallel computing.

Thank you for the suggestion, we do agree it is important to address these, especially considering ever-increasing volume of (biological) data. We have added a sentence acknowledging the challenges of working with large datasets and briefly discussed strategies such as memory management and parallel processing. Please refer to the section” Allocating separate computational space to modules” on page 14 for revision of these changes.

We look forward to your response.

Best Regards,

Dominik Stappert

Attachment

Submitted filename: REV2_rebuttal Letter.docx

pone.0326678.s011.docx (9.2MB, docx)

Decision Letter 2

Bhanwar Lal Puniya

30 May 2025

Gain efficiency with streamlined and automated data processing: Examples from high-throughput monoclonal antibody production

PONE-D-24-34015R2

Dear Dr. Stappert,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager®  and clicking the ‘Update My Information' link at the top of the page. If you have any questions relating to publication charges, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Bhanwar Lal Puniya, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

Reviewer #4: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions??>

Reviewer #4: (No Response)

**********

3. Has the statistical analysis been performed appropriately and rigorously? -->?>

Reviewer #4: (No Response)

**********

4. Have the authors made all data underlying the findings in their manuscript fully available??>

The PLOS Data policy

Reviewer #4: (No Response)

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English??>

Reviewer #4: (No Response)

**********

Reviewer #4: (No Response)

**********

what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy

Reviewer #4: No

**********

Acceptance letter

Bhanwar Lal Puniya

PONE-D-24-34015R2

PLOS ONE

Dear Dr. Stappert,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

You will receive further instructions from the production team, including instructions on how to review your proof when it is ready. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few days to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Bhanwar Lal Puniya

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Cloning strategy for variable antibody regions.

    After cDNA generation, three parallel PCR reactions are performed to amplify the heavy chain variable region (light blue) and either the Kappa (light green) or Lambda (pink) variable region. Upon verification by sequencing, the variable regions (V) are cloned into plasmids encoding the constant part (C) of the respective antibody chain (dark blue: heavy constant part; dark green: Kappa light constant part; purple: Lambda light constant part). See Fig 2 for a comprehensive overview. For an overview on B-cell receptor and antibody variability, refer to Fig 1 by Khatri et al. [51] and Fig 1 by Mikocziova et al. [52].

    (DOCX)

    S2 Fig. An excerpt from the database structure explaining its relational aspects.

    Table C stores information on isolated plasmids of paired antibody chains (heavy and light), and Table D is a repository of plasmid aliquots stored as safe stocks in separate (physical) locations for future reproduction of plasmids or downstream applications. Table Ab_2 connects the information on plasmid pairs with subsequent transfection of HEK cells. Fields (i.e., columns) Working_Stock_ID, Safe_Stock_ID and Transfection_ID store unique IDs of each record per table. Fields starting with FK_ are records’ IDs inherited from related tables. Records (sample states) are connected through one-to-many relationships in the database structure; for example, a plasmid from Table C can be used for creation of a safe stock aliquot (in Table D) multiple times, while one aliquot of safe stock (in Table D) is associated with exactly one plasmid from Table C. Similarly, one transfected HEK cell sample (i.e., transfection well, in Table Ab_1) is associated with exactly two plasmids (a pair of heavy and light chain plasmids, in Table C), while this plasmid pair can be used for multiple transfections.

    (DOCX)

    pone.0326678.s002.tif (166.8KB, tif)
    S3 Fig. Simplified view of tables storing bacteria plate information (upper, green) and bacterial colony picking information (lower, blue).

    Picked bacterial colonies can be connected to already imported bacteria plate information through the identifiers (Plating_ID/FK_Plating_ID), reducing the redundancy of stored data. The identifier Plating_ID is unique in the bacteria plating table but not in the colony picking table, allowing for picking of multiple colonies from the same bacteria plate (one-to-many relationship in the database structure). The colony picking table has its own unique identifier (Picking_ID).

    (DOCX)

    pone.0326678.s003.tif (247.3KB, tif)
    S4 Fig. A hypothetical extension of database structure by table addition.

    The flexibility of the database design enables smooth integration of new information (e.g., experimental readouts), which allows for efficient management of diverse datasets within the database. Here, the information on two hypothetical functional antibody assays (AB_Assay1 and AB_Assay2) are appended to the information on the harvested antibody through the AB_ID identifier, thus linking to any previous information on that sample (starting from the B-cell donor/patient).

    (DOCX)

    pone.0326678.s004.tif (300.9KB, tif)
    S5 Fig. Wet lab steps of ASCC workflow: thawing, expansion and differentiation of iPSCs into BMECs in a functional BBB model.

    Wet lab steps and timepoints (relative to the start of the differentiation – day 0: D0) are indicated by blue rectangles. Cycles of thawing (D-6), possible freezing (D-1, D6, D8), harvest and seeding (D-3, D8) of cells. Count and viability assays are carried out on D-3, D0 and D8. Media used at each timepoint are indicated by ellipses – gray: mTSER plus medium with/without rock inhibitor (RI); violet: Unconditioned Medium (UM); green: Endothelial Cell Medium with/without supplements (EC + / + , EC -/-, respectively). The TEER measurement timepoints (D10, D11) are indicated by a purple circle. For a detailed protocol, refer to Fengler et al. [34].

    (DOCX)

    pone.0326678.s005.tif (2.9MB, tif)
    S6 Fig. Conceptualizing the ASCC database.

    A) No-redundancy principle. The information on each medium batch is stored in the database only once (upper table, green headers). Whenever a media change is performed, the unique identifier of the concerned medium batch (ID_medium_INFO) is fetched on the backend by a File Maker script and populated in the media change table (lower table, red headers). Information in columns::Medium Name,::Composition, and::Mix Date is fetched from the medium table (upper table, green headers). Information in the column::Barcode is fetched from another table (not shown). B) Example of the harvest event that guided the design of database structure: pulling differentiated cells from 6-well plate and seeding on 12-well TransWell plate. At this workflow stage, cells could also be cryo-stored for future experiments. Implementing a one-to-many relationship between 6-well and 12-well TransWell plates helps avoid storing redundant information.

    (DOCX)

    pone.0326678.s006.tif (765.8KB, tif)
    S7 Fig. Conceptualizing the ASCC database-continued.

    Possible expansion of database structure towards differentiation of other cell types (such as astrocytes, neurons, microglia, monocytes; red tables). The structure is flexible and allows appending of tables to a 6-well plate table (green) that stores information on undifferentiated cells through a one-to-many relationship. In this way, different differentiation protocols can easily be accommodated in the database.

    (DOCX)

    pone.0326678.s007.tif (361KB, tif)
    S1 Table. Primers used for the amplification of heavy, light Kappa (κ) and light Lambda (λ) chains.

    Three PCR reactions are performed per chain. PCR1 and PCR2 are performed with forward and reverse primer mixes. PCR3 is performed with specific primer pairs selected from 69 individual primers. The selection is based on sequencing results of the second PCR amplification and identification of used V(D)J alleles. See the attached Table S1.xlsx file. Refer also to Fig 2.

    (XLSX)

    pone.0326678.s008.xlsx (18.6KB, xlsx)
    Attachment

    Submitted filename: REV2_rebuttal Letter.docx

    pone.0326678.s011.docx (9.2MB, docx)

    Data Availability Statement

    The datasets generated and/or analysed during the current study are available in the following Zenodo repositories, https://doi.org/10.5281/zenodo.8229164 and https://zenodo.org/records/10106688.


    Articles from PLOS One are provided here courtesy of PLOS

    RESOURCES