State-of-the-Art Data Management: Improving the Reproducibility, Consistency, and Traceability of Structural Biology and in Vitro Biochemical Experiments

David R Cooper; Marek Grabowski; Matthew D Zimmerman; Przemyslaw J Porebski; Ivan G Shabalin; Magdalena Woinska; Marcin J Domagalski; Heping Zheng; Piotr Sroka; Marcin Cymborowski; Mateusz P Czub; Ewa Niedzialkowska; Barat S Venkataramany; Tomasz Osinski; Zbigniew Fratczak; Jacek Bajor; Juliusz Gonera; Elizabeth MacLean; Kamila Wojciechowska; Krzysztof Konina; Wojciech Wajerowicz; Maksymilian Chruszcz; Wladek Minor

doi:10.1007/978-1-0716-0892-0_13

. Author manuscript; available in PMC: 2022 Jan 1.

Published in final edited form as: Methods Mol Biol. 2021;2199:209–236. doi: 10.1007/978-1-0716-0892-0_13

State-of-the-Art Data Management: Improving the Reproducibility, Consistency, and Traceability of Structural Biology and in Vitro Biochemical Experiments

David R Cooper ^1,^2,^3,^#, Marek Grabowski ^1,^2,^#, Matthew D Zimmerman ¹, Przemyslaw J Porebski ¹, Ivan G Shabalin ^1,², Magdalena Woinska ^1,², Marcin J Domagalski ^1,², Heping Zheng ¹, Piotr Sroka ^1,², Marcin Cymborowski ^1,², Mateusz P Czub ^1,², Ewa Niedzialkowska ^1,², Barat S Venkataramany ¹, Tomasz Osinski ¹, Zbigniew Fratczak ¹, Jacek Bajor ¹, Juliusz Gonera ¹, Elizabeth MacLean ¹, Kamila Wojciechowska ¹, Krzysztof Konina ¹, Wojciech Wajerowicz ¹, Maksymilian Chruszcz ^1,⁴, Wladek Minor ^5,⁶

PMCID: PMC8019398 NIHMSID: NIHMS1681441 PMID: 33125653

Abstract

Efficient and comprehensive data management is an indispensable component of modern scientific research and requires effective tools for all but the most trivial experiments. The LabDB system developed and used in our laboratory was originally designed to track the progress of a structure determination pipeline in several large National Institutes of Health (NIH) projects. While initially designed for structural biology experiments, its modular nature makes it easily applied in laboratories of various sizes in many experimental fields. Over many years, LabDB has transformed into a sophisticated system integrating a range of biochemical, biophysical, and crystallographic experimental data, which harvests data both directly from laboratory instruments and through human input via a web interface. The core module of the system handles many types of universal laboratory management data, such as laboratory personnel, chemical inventories, storage locations, and custom stock solutions. LabDB also tracks various biochemical experiments, including spectrophotometric and fluorescent assays, thermal shift assays, isothermal titration calorimetry experiments, and more. LabDB has been used to manage data for experiments that resulted in over 1200 deposits to the Protein Data Bank (PDB); the system is currently used by the Center for Structural Genomics of Infectious Diseases (CSGID) and several large laboratories. This chapter also provides examples of data mining analyses and warnings about incomplete and inconsistent experimental data. These features, together with its capabilities for detailed tracking, analysis, and auditing of experimental data, make the described system uniquely suited to inspect potential sources of irreproducibility in life sciences research.

Keywords: Structural biology, Databases, LIMS, Reproducibility

1. Introduction

The problem of managing experimental data is as old as the research laboratory itself, and the efficient and comprehensive data management is an indispensable component of modern scientific research. The contemporary understanding of data management defines it as a “process that includes acquiring, validating, storing, protecting, and processing required data to ensure the accessibility, reliability, and timeliness of the data for its users” [1]. In recent years, data management has been increasingly recognized as one of the most vital factors affecting the reproducibility of research data. On the other hand, data management problems are quite often underestimated by both scientists and the general public. The widely publicized recent examples from the airline industry show that poor data management (as well as management in general) may have unpleasant consequences, like the public spectacle of dragging a passenger from a plane, which resulted in negative publicity and subsequent drop in stock value for one of the major airlines. Biomedical data management is generally much more sophisticated but still not perfect. Inconsistencies and errors in the public record are usually detected and corrected by the collective efforts of other scientists, but this is not an instantaneous process and is hampered by the difficulty of reporting negative results. The losses associated with lack of reproducibility are estimated to be on the order of many billions of dollars [2]. Identified inconsistencies are often followed by detailed analysis and in many cases by the correction of errors and frequently the sources of errors.

Traditionally, data management in research laboratories has been addressed by simple approaches such as paperbound lab notebooks and, since the 1980s, computerized spreadsheets. However, these approaches do not remove or even track inconsistencies and do not scale well to the requirements of modern biomedical research, especially in large-scale, high-throughput collaborative programs that generate vast amounts of experimental data in geographically distinct laboratories. These traditional approaches are often inadequate to assure the reproducibility of experiments, even when work is performed in a single laboratory. In the last 10 years, there has been an increasing awareness that the reproducibility of experimental research cannot be taken for granted [3, 4]. According to some estimates, about 50% of preclinical research (at the cost of around $28 billion per year) may be irreproducible [2]. The concerns about reproducibility problems are motivating funding agencies worldwide to introduce new requirements for managing and sharing data generated from sponsored research [5].

Recent advances in information technology have led to the development of database-driven platforms to efficiently collect, store, annotate, and analyze laboratory data. Electronic laboratory notebooks (ELNs) can be thought of as a digital replacement of a paperbound notebook. However, increasingly sophisticated laboratory information management systems (LIMSs) are slowly superseding the use of ELNs and spreadsheets [6]. Numerous electronic notebooks and LIMSs have been developed in academia and industry, not only for large-scale projects but also for traditional, small to mid-size laboratories. To address particular experimental problems, diverse specialized systems have been designed to track specific kinds of sequential, microarray, metabolomics, proteomics, chemical, pharmacological, structural, and functional data [7-12]. Manufacturers of laboratory equipment often provide proprietary tools for tracking the data collected by these systems, although these tools are usually limited in scope and tied to the manufacturer’s equipment. Several commercial LIMSs have been released for specialized data management tasks [13-15], but while they have been widely adopted in clinical labs, most off-the-shelf systems are not versatile enough to capture the different types of experimental data that are generated by academic biomedical research [16]. Biological and biomedical data are highly interconnected, and effective data management systems must take into account the diversity of data and experimental methods. To our knowledge, no data management system can accommodate the breadth of information that is necessary to encompass the “big picture” for any substantial biomedical project. For that reason, none of the existing systems have so far reached widespread acceptance in academia. For big pharma, the data management systems are so valuable that they often do not disclose any information about them.

Structure-function research, which is a major focus of this chapter, requires the full characterization of proteins and other macromolecules, including 3-D structure determination. The structure determination pipeline includes cloning, protein production, method-specific sample preparation (such as crystallization, deuterization, or cryo-EM grid preparation, structure solution, and model refinement). Biochemical and biophysical experiments are usually performed in advance and are often critical during the structure determination and interpretation stage. Conversely, the 3-D macromolecular models that are experimentally determined frequently inspire subsequent functional experiments, which may include ligand binding experiments or mutational analysis to test models based on the structural interpretation. The final analysis usually requires the examination of the structural and functional information in the context of other similar macromolecular structures.

Integrating these diverse data presents a serious challenge for effective data management. In order to surmount this challenge, a number of LIMSs have been developed for structural biology, and many were designed by large-scale structural genomics (SG) programs. These include Xtrack [12], SESAME [17], PiMS/xtalPiMS [18, 19], and HalX [20]. Some SG programs have relied on customized commercial LIMSs [21]. Modern crystallographic software suites, such as CCP4 [22, 23], Phenix [24, 25], and HKL-3000 [26, 27], organize the computational data they generate. These systems make it relatively easy for crystallographers to keep track of the parameters and results of calculations along the path from crystallographic data to a refined model that is ready for publication and deposition into the PDB.

Here, we describe our experiences managing structural biology data using a component-based data management system that we have developed for the acquisition, validation, storage, and analysis of biomedically oriented experimental data. The general description applies to any modern data management system; however, the examples and some details presented here reflect our experiences developing the LabDB modular LIMS. LabDB is composed of several separate components, each optimized to perform a particular task (Fig. 1). The reagents-tracking module, which tracks chemicals, laboratory supplies, and stock solutions, is a prerequisite for all other components. The components that are essential for a structural biology laboratory include protein production, crystallization, and structure determination modules.

Fig. 1 — The overall organization of LabDB, showing that expansion modules can be added to provide extra functionality

The core of the system contains an underlying relational database and an associated web interface. Most of the components of the system are web-based, some use native interfaces, and some use a combination of both, according to the task involved. The database is directly interfaced with the Xtaldb application for designing, recording, and analyzing crystallization experiments [28-30] and the HKL-2000/-3000 crystallographic data processing and structure determination suite [26]. The integration of LabDB with the HKL suite allows scientists to automatically obtain information about the protein(s) and sample characterization during data collection, structure determination, and refinement. The initial structure determination results are directly transferred to LabDB, which allows others to design new biomedical experiments based on the structural information. The effect of the synergy of this integration cannot be overestimated.

LabDB focuses on minimizing human input by harvesting data directly from laboratory hardware whenever possible and uses several equipment-specific clients and modules for automated data acquisition. During the last 10 years, multiple instances of LabDB have been used to record experimental data for tens of thousands of protein targets in a number of large-scale high-throughput biomedical centers, including the Center for Structural Genomics of Infectious Diseases (CSGID), the Midwest Center for Structural Genomics (MCSG), New York Structural Genomics Research Consortium (NYSGRC), and the Enzyme Function Initiative (EFI). The system has played an important role in ensuring the reproducibility of experiments in the labs where it has been deployed.

2. Data Model, Data Acquisition, and Validation

2.1. Data Model

The data of interest in a structural biology laboratory is associated with entities representing different types of physical samples, experimental trials, calculations, etc. In turn, each of these data objects has attributes that describe specific details of these entities. Most existing LIMS implementations map the objects to a relational database with a schema comprised of the experimental components.

The LabDB relational database (currently LabDB uses the open-source PostgreSQL database) is arranged around “projects,” which correspond to individual proteins, macromolecules, or complexes. Mutants or other modifications of a protein (e.g., having different purification tags) are considered to be experimental variants within the same project. A given project usually encompasses multiple physical samples: clones, purified proteins, crystallization drops, harvested crystals, and others. Projects can optionally be aggregated into project groups. Data related to cloning, expression, purification, and biochemical characterization of the proteins are maintained by the “protein production” module, while data related to experimental crystallographic aspects are maintained by the crystallization component. The structure determination component, hkldb, keeps track of all computational crystallographic parameters, from data collection to deposition of the refined structural model into the PDB. hkldb is tightly integrated with the HKL-3000 structure determination suite that ensures automated metadata collection about the whole structure determination process.

LabDB was initially used to track all the steps of protein production, crystallization, and diffraction experiments for high-throughput structural biology and designed to require that a logical sequence of events would occur for each project. The origin of most experimental pipelines is a recombinant DNA clone, but a project can represent a complex of interacting proteins or a single protein purified from a natural source. In a typical workflow, clones are transformed into competent E. coli cells, and the encoded protein is expressed, purified, and crystallized for diffraction experiments. The purified proteins are often mixed with ligands, binding partners, or modifying enzymes before crystallization; in LabDB, such mixtures are referred to as “macromolecule preps” or just “macropreps.” During crystallization trials, the protein solutions are combined with a variety of components like buffers, salts, organic molecules, and/or ligands in an effort to grow crystals. When luck prevails, these usually fragile crystals are harvested and subjected to X-ray diffraction experiments [31]. Well-diffracting crystals can usually lead to interpretable electron density maps, which can be used to generate structural models of the protein. The final step is iterative refinement combined with model rebuilding and validation procedures. Each stage of this experimental pipeline has many parameters that must be recorded, as sometimes even slight changes can have dramatic effects on sample characterization. LabDB was designed to enforce recording a complete experimental provenance of any physical sample so that it can be traced back to the “source,” e.g., a clone, a purified protein, or a protein shipped from elsewhere.

2.2. Acquisition of Various Types of Laboratory Data

Acquisition of experimental data in a research laboratory can be performed in different ways depending on various factors, such as the type and complexity of the data, point of acquisition, available resources, and database/equipment compatibility. A major drawback of many existing LIMS is their overreliance on manual user input. While LIMS can provide benefits to experimenters, (e.g., the ability to easily share data with others, tools to analyze data, etc.), there can be drawbacks as well: namely, the additional time needed to enter and curate data. In practice, researchers are reluctant to adopt data management systems unless their benefits significantly outweigh the additional time outlay.

The underlying data acquisition concepts are presented in Fig. 2. In our experience, the integrity and completeness of the data are inversely proportional to the effort that is required from the user to input the data, especially in the case of “failed” experiments, where manual data entry is scarce and motivation to manually enter data may be particularly weak. If the required user effort is minimal and the data is automatically collected, the data will be more complete and generally have higher integrity (provided the equipment is working reliably). However, if the user is asked to input data by hand, then the data will be limited and contain many errors and discrepancies. The difference is akin to tracking a person’s location via phone GPS versus conducting a survey to determine where they were at a given time. Briefly, the methods of data acquisition can either be manual and automated, with further division into the complexity of input required by the user. The most time-consuming aspect of data acquisition is the manual input of data followed by manual upload of data files (e.g., importing data en masse from spreadsheets or comma-separated value (CSV) files or other output files produced from laboratory hardware) and manual entry of metadata. These last two steps are generally not required for the main experimental tasks but usually provide additional data; e.g., users may want to upload a gel associated with a protein purification (Fig. 3). Complex characterization steps such as thermal shift assays, isothermal titration calorimetry (ITC) experiments, and kinetic assays require a certain amount of metadata, which usually consists of an experimental protocol and the parameters of each experimental replicate.

Fig. 3 — Additional files, such as gel images can be associated with experiments in LabDB. LabDB can read additional data from image acquisition systems. The presented gel was acquired using a Bio-Rad Gel Doc™ EZ Gel Documentation System and processed and annotated using Image Lab™ Software

Semiautomated data acquisition requires a user to either input or confirm some information that is either harvested automatically or is associated with an external event. LabDB can monitor computer folders or by running equipment-specific methods, with metadata provided by the user. More information can be extracted by uni- and bidirectional communication with the database/LIMS that are controlling the hardware and storing experimental results. Each direction of the communication improves data integrity. Bidirectional communication is the most complex to develop and requires cooperation between different vendors but provides the most benefits in terms of data integrity and completeness, as discussed below. However, this approach may be limited by the availability of a software development kit (SDK) or application programming interfaces (API) that would allow access to the data by external programs.

2.2.1. Reagents Module

The reagents component maintains an inventory of chemicals, stock solutions, and other reagents; their usage is tracked by other components. Chemicals are identified by CAS numbers and SMILES [32] representations, and pertinent information is downloaded from PubChem [33]. Bottles with chemicals or solutions are labeled with barcodes. Solutions can be entered into LabDB through a web-based form. The form allows complex solutions to be created using a mixture of chemicals in their original chemical bottle and/or other solutions as their starting point. If a chemical bottle is used, then the chemical bottle’s barcode can either be scanned or selected, and a final concentration is entered by the user. If a multicomponent solution is used as the starting point, all of the components of that solution are listed along with their stock concentration. Final concentrations are added by either entering the dilution factor or the final concentration of one of the components, thereby keeping the relative concentrations of solution components proportional. Multiple solutions or individual chemicals can be used to create a solution. Overall solution properties, such as name and volume, are required. The pH for the solution is not calculated from the individual components in the current version of LabDB. The default label will include a barcode, the solution’s name, the creator’s name, the date, and the solution components. Labels can be edited before they are printed on a network-connected label printer.

Storage System

The storage system contained within LabDB is a flexible, hierarchical system that allows storage containers to be placed within other containers. Each storage location can have one or more “children” locations, forming a set of hierarchical trees of ancestors and descendants, where the “root” storage locations are different rooms in the laboratory. Examples of storage containers in a room are the built-in shelves and large storage units such as freezers, refrigerators, and cabinets. The shelves of these large storage containers are themselves considered containers, as are any racks or boxes used to group similar reagents together. The hierarchical nature makes it possible to group items together and move them en masse. If one moves a freezer from one room to another, all the descendent storage locations (e.g., shelf 1, the blue box that is on shelf 2, etc.) and the individual items they contain are automatically moved. Each storage location can be assigned a barcode that can be used to quickly identify the location when assigning an item to a location (Fig. 4).

Fig. 4 — Barcoding stock solutions in LabDB

The storage location of individual items can be displayed in LabDB when viewing the item’s details (or a list of items), and an itemized list of all the items contained within a storage location can be displayed in a tree view. The storage system has several quick entry methods. One can scan a storage location’s barcode and then scan numerous items as they are placed in the location. In the inventory check feature, where one scans all of the items in a location, LabDB will indicate which items were missing or unexpectedly present (along with their currently assigned location). The inventory can be updated or changed on a per item basis. We are currently adding safety information to the storage system, to alert users, or authorities in case of emergencies, about potentially harmful materials that are contained within a storage location.

2.2.2. Protein Production Module

The “Protein Production” component of LabDB stores information about protein cloning, expression, and purification. Experimental data are organized in a tree-like data structure: each experimental step must be connected to a preceding step (e.g., expression must be tied to a particular clone). Data can be entered manually by researchers using a web browser as a specific step is completed; batches of experiments performed en masse can be imported from spreadsheet files.

2.2.3. Crystallization Module

The crystallization module gathers data about crystallization trials: the setup of crystallization plates, the contents of individual wells and drops, the origin of harvested crystals, and any special conditions used during crystal harvesting (Fig. 5). The interface can be used to assign a crystallographic screen to the wells and drops of a crystallization plate. Most commercial screens are predefined and new commercial as well as custom crystallization screens are easily generated or can be downloaded from the Formulatrix web page [34]. Plate templates can be created with up to six different drops per well, making it possible to examine several parameters within each chamber (well) of the crystallization plate. Each drop of a chamber can contain a different macromolecular prep of a project, and the volumes of the macroprep and the screen used for each drop can be different. Each drop will have an associated screen, which does not have to be the same as the screen in the reservoir, unlike in the traditional style of crystallization plate. Thus, LabDB makes it possible to track alternative reservoir screening [35]. For example, sodium chloride can be in the reservoirs while different crystallization screens can be used for each drop position. This type of screening can make initial crystallization trials very efficient and has several other benefits [36].

Fig. 5 — Tracking of plate and crystal data in LabDB. Clockwise from left: 15-well plate, information about the plate, crystals from well B4 in the plate, and information about the crystal. Only a small portion of each page from LabDB is shown

The module can be configured to gather data directly from the Minstrel HT (Rigaku) and Rock Imager (Formulatrix) observation robots (data transferred periodically) or used in manual mode. The module can also interact with the Xtaldb application [28]: an expert system that provides tools to design crystallization plates, record drop images and annotations, and analyze the results.

2.2.4. Structure Determination Component

Information about structure determination, refinement, and validation are stored using the hkldb module, which is shared with the HKL-3000 suite [26]. hkldb allows a bi-directional transfer of data between HKL-3000 and LabDB. Crystals in LabDB can be selected for diffraction, data reduction, structure solution, and model refinement. The availability of complete information about sample production during any step of structure determination can be extremely helpful for the estimation of radiation decay or attempts to identify unassigned electron density. A report containing statistics of data collection and the refinement process can be generated (Fig. 6).

Fig. 6 — An example of X-ray data collection and refinement statistics report produced by HKL-3000 using *hkldb*. This screenshot displays about half of the report

2.2.5. Biochemical Experiments

The value of structural data is magnified when the data are coupled with functional experiments such as enzymatic assays, binding studies, etc. LabDB can accommodate a variety of biochemical and biophysical experiments to ensure that all available experimental evidence is associated with projects. LabDB is not a substitute for external analysis programs that help interpret data coming from different instruments but rather is a mechanism that ensures the data is easily retrievable and accessible.

LabDB can import information about absorbance- and fluorescence-based enzyme kinetic assays. The system tracks both detailed studies of a particular enzyme-substrate reaction (e.g., Michaelis-Menten kinetics) and high-throughput screening plates involving many substrates or many enzymes. Individual experimental replicates can be recorded, but calculations of kinetic constants, such as K_M and k_cat, need to be performed outside of LabDB. The first step of entering assay experiments is to define the experimental protocol. The description should include a detailed description of the steps, instruments, and buffer composition. These protocols can be used for multiple projects, so defining an individual “kinetic assay” involves specifying the protein being used (the macromolecule prep) and the substrate being tested.

The alternate method of entering kinetic data allows multiple assays to be submitted at one time, varying either the enzyme or the substrate. This facilitates searching for a protein with a particular function or looking for the optimal substrate, respectively, and is compatible with many 96-well plate readers that can export data. Data can either be uploaded as a CSV file or manually entered in a web form. Compositions of experimental layouts can be saved as a plate design for reuse.

Thermal Shift Assays

Fluorescence-based thermal shift assays (FBTSAs) are based on the detection of fluorescence from a dye present in solution. As proteins in the solution denature in a heat dependent manner, the dye binds to exposed hydrophobic surfaces, which increases the dye’s fluorescence. LabDB processes the results of FBTSA experiments by parsing the output generated by Bio-Rad CFX96 and Applied Biosystems 7900HT RT-PCR systems, allowing the melting curves to be visualized within the web interface. Three files can be uploaded for each experiment: the raw relative fluorescence units, the derivatives of the melting curves, and an experimental summary. This not only allows the data to be visualized within LabDB but allows the raw files to be accessible from anywhere. Individual curves are displayed below the aggregate figure (Fig. 7).

Fig. 7 — A sample thermal shift assay with raw data obtained from the Bio-Rad CFX96 system and a derivative graph

Isothermal Titration Calorimetry

Interactions of macromolecules with small molecules or other macromolecules can be detected using isothermal titration calorimetry (ITC). In an ITC experiment, small aliquots of a ligand or a macromolecule are injected into a sample cell, and the heat change caused by interaction is measured. ITC has become a staple of biochemical labs due to its ease of use and its ability to accurately characterize the binding affinity and stoichiometry of interactions. LabDB tracks the macropreps and solutions in the experimental cell and the injection syringe. The file generated from a MicroCal ITC system as well as an optional analysis file from Origin data analysis software can be uploaded and used to visualize the results.

Other Biochemical Assays

As the variety of experimental methods used by biomedical labs is broad, the LabDB system also permits entry of basic data about custom experimental assays. Each trial has to be coupled with a protocol that describes the method in more detail. The experiments can be flagged as to whether or not they are successful, and a list of people involved can be included. This entry is relatively generic and includes a text field with optional comments or notes, which allows the researcher to use a standard protocol, yet document any deviations or experimental conditions that are not explicitly described in the protocol.

2.3. Data Validation

Data integrity needs to be considered and verified at every step of the data management process. For example, constraints should be created during the database or web framework design to ensure that relationships between pieces of data are legitimate and that unique attributes (or combinations of attributes) are enforced. This will prevent entering data that are internally inconsistent or inconsistent with the data that are already in the database. For example, entering two plates with the same name or attempting to put a 96-well screen into a 24-well plate is not possible. Similarly, attempts to enter incomplete data are prevented by several mechanisms, including alerting the scientist responsible for specific instruments when incomplete data are automatically transferred from an instrument. There are several checks of data consistency. For example, plates cannot be transferred from the crystal storage system if the project is not already in the database because every plate must be associated with a project.

In contrast to experimental data, the integrity of computational data is generally not a challenge and is beyond the scope of this chapter.

3. Technical Implementation

Due to the collaborative nature of modern research, most existing LIMSs use one fundamental design—a central database and one or more clients. The clients are usually one of two kinds, either a web-based or desktop application, although some LIMSs use both to leverage their distinct capabilities.

Web-based applications are usually suitable for general user interfaces for data presentation and manual data input. They provide platform independent interfaces that can be accessed by any computer connected to the internet (or an internal network) using a standard browser. There is no need to develop and test the application on all possible operating system versions and configurations, which makes development and troubleshooting much more straightforward. Maintenance is easier, as the updates applied on the server are immediately accessible on the client computers. The interfaces can be adapted to portable devices such as tablets and smartphones, providing a consistent user experience across devices suitable for laboratory use. Mobile devices used at the bench provide the advantage that data can be input at the same time experiments are performed. However, web interfaces are limited in complexity by the constraints of the web programming environment, and it may be difficult to communicate with hardware attached to client machines. Nonetheless, rapidly evolving JavaScript technologies allow for more and richer interaction with web browsers, changing them from simple document viewers into fully featured extendible development platforms.

Desktop applications are suitable for clients that require communication with the scientific instruments and are often used for processing and analyzing substantial amounts of data. The development time and effort for such systems can be longer, but the resulting interfaces can be somewhat more sophisticated and have extensive communication with external hardware, such as direct control of the instrument or automated data gathering and processing. The major drawback is that desktop applications require development and maintenance of the standalone programs for several popular operating systems (e.g., Windows, macOS, Linux). Such programs are generally more difficult to install and to keep up to date since upgrades must be distributed to each client computer. Several Java-based clients have solved the problem of application distribution by distributing the code through a web browser but at the expense of requiring the user to maintain an up-to-date Java environment for each browser. In addition, the Java NPAPI plugin is no longer supported in most modern browsers due to security concerns. SESAME [17] is one LIMS that uses these types of applications as a general client. Recent advances in web technologies, such as cloud computing, software as a service, and semantic technologies, support the creation of sophisticated distributed systems. These advancements further blur the boundaries between different clients, most of which now serve only as a user interface, while all data is processed and stored in the cloud.

The current implementation of LabDB mainly uses the web-based approach. The web framework to generate the pages presented to the user incorporates the model-view-component (MVC) architecture (the current implementation uses CakePHP [37]). In this framework, individual tables in the database are represented by models, which are related to other models using associations that denote the relationship. For example, a purified protein may have many crystallization plates, but a particular crystallization drop “belongs to” only one crystallization plate. The CakePHP framework has been supplemented with jQuery and JavaScript, permitting forms or pages to be altered based on other choices in the forms. For example, when entering a crystallization plate, selecting the project will fetch the appropriate list of “macromolecular preps” of that protein. In many cases, forms will have a header column, which populates a whole collection of other form entries.

All four modules store the data they use in a central PostgreSQL database, and as a result, share common organizational information, such as projects, laboratories, user accounts, passwords, and identification barcodes. Information collected by one module is accessible within the others. For example, all purified proteins in the system are available to the crystallization module and can be used to prepare crystallization plate records. The overall system database is very large, containing about 250 tables and 30 views separated into a distinct schema for each component.

Each laboratory can have a separate instance of LabDB with its own database. All instances share the same schema but have different data. Some of the functionality of LabDB is provided by customized scripts that are run based on a schedule, which is controlled by the scheduling daemon cron. For example, a weekly progress report is emailed to the principal investigator (PI) once a week, and the external database for a crystallization “hotel” is queried every 30 min to keep the plates in each system synchronized. Most of the systems run by a daemon can also be triggered by a button within the interface, providing a means of generating instantaneous reports.

4. Data Analysis and Data Mining Tools

Each component provides a set of data analysis tools. LabDB provides several data mining tools to analyze the results of structure determination and biological assays. Virtually all types of data have detailed search tools. The basic search tool allows the user to search for a particular project name, description, responsible person, project status, etc. Each type of object (i.e., crystals, clones, thermal shift assays, etc.) will have object-specific search fields. For example, crystals can be filtered based on whether or not they have been tested for diffraction and kinetic assays can be filtered based on the specific protocol used. Applying a filter will return a paginated table containing the reduced set of objects. Each object will have a default set of columns that are displayed, but there is a “Select Columns” button that allows more (or fewer) columns to be displayed in the resulting list. All of the displayed columns can be used to sort the data.

In addition, there is a number of data analysis “dashboards” that provide a real-time overview of the status of the project in a pipeline. LabDB can display statistics for projects or researchers that summarize progress. The progress summary allows a LabDB user to specify a time period during which the experiments happened in the lab and has several predefined periods (e.g., for the past 1, 2, or 3 weeks, 1, 2, or 6 months, or 1 or 2 years). The result (Fig. 8) is a table reporting how many of each type of experiments were performed during the time period. Most table entries are links that will bring up the list of experiments associated with that number. LabDB also builds aggregate statistical reports for different groups of projects, which can be used to group projects that may be supported using various funding sources, similar projects that involve the same collaborator, or projects of related proteins.

Fig. 8 — An example of a weekly report automatically sent by LabDB to the PI and all current lab members

In addition to being able to have progress summaries displayed within the interface, the weekly per person and project statistics are emailed to the lab members and the PI every week. This feature is not to check the performance of particular people but rather to identify bottlenecks in the research and to identify experimental steps that need to be addressed.

Another major advantage of LIMS is the potential for quantitative data analysis. In our experience, both crystallization and cryoprotection protocols were significantly improved after simple analyses showed which approaches are more productive. Several researchers in the lab switched to the use of certain crystallization screens after large-scale analysis of the lab experiments tracked in LabDB showed that these screens were significantly more efficient than others for producing harvestable crystals. As another example, some of the projects were conducted with the “alternative reservoir” crystallization method [35]. Analysis of our crystallization trials showed that this approach produced more crystals per plate than the traditional approach. Switching to the new crystallization method significantly increased the lab’s productivity afterward. Similarly, an analysis of diffraction resolution vs. cryoprotectant used helped determine the best cryoprotectants for several projects.

Rigorous use of the database during experiments helps in preparation of publications, especially when the experimenter has left the laboratory, a frequent case in academia. Projects carried out using LabDB (all research centers) resulted in 156 publications within the last 7 years. Of those, six achieved a relative citation ratio (RCR) higher than 5, and two were classified as highly cited papers in the Essential Science Indicators database (i.e., they were in the top 1% of papers by field and publication year, according to Web of Science). The summary of the total amount of data stored in the Minor Lab instance of LabDB is available in Table 1.

Table 1.

Summary data from the Minor Lab instance of LabDB for projects carried out for three research centers and internal projects as of May 15, 2019

	CSGID	MCSG	NYSGRC	Minor Lab
Projects	162	129	118	796
Clones	85 (1.5)	18 (2.3)	16 (2.0)	139 (3.3)
Expressions	289 (3.4)	134 (5.6)	79 (3.8)	293 (5.7)
Purifications	390 (4.8)	121 (4.7)	81 (3.7)	326 (4.3)
Macropreps	857 (5.9)	384 (3.1)	398 (3.4)	1312 (1.7)
Plates	2626 (22.8)	1263 (30.8)	1001 (10.2)	2888 (23.9)
Crystallization Drops	415,630 (3614.2)	71,569 (1745.6)	208,878 (2131.4)	241,596 (1996.7)
Crystals	6844 (59.5)	2258 (19.5)	2370 (26.6)	6211 (8.5)
Diffraction Datasets	2193 (31.3)	1230 (12.1)	742 (13.3)	2952 (7.1)
Refinement Runs	16,590 (276.5)	6285 (133.7)	7139 (158.6)	9002 (77.6)

Open in a new tab

Results are given in the format: total number (average per project). The average number is calculated using the total number of experiments for the stage divided by the number of projects that had at least one experiment for that stage

5. User Experience and PI Perspective

To make the user experience less tedious and more intuitive, input in LabDB is designed to be easy, logical, and straightforward. This can be partially attributed to the involvement of people who perform experiments in the design of the interface. In particular, most relevant fields are populated based on previous experiments and information about the project. For example, when adding a new crystallization plate to the database, the experimenter can select both the type of plate and the crystallization screen used from a drop-down menu. The addition of crystals is greatly simplified with the “clone” button, which allows the crystal description with all associated information (crystal size, morphology, cryoprotectant used, source plate and well, crystallization condition) to be copied in one click, thereby requiring the experimenter to only change the name of the new crystal and adjust other details as necessary.

LabDB has incorporated many other features that add immediate convenience for its users. For example, the chemical storage module quickly indicates the availability and location of chemical bottles, which often saves researchers time when looking for a rarely used chemical or searching for what chemicals are currently available. Another feature is the ability to enter “free text” notes in some fields. For example, the experimenter can make notes on cryoprotection and describe the crystal specifics (e.g., color, size, morphology, and added ligand) to keep a more detailed record of crystal harvesting. These text fields are searchable, allowing for later searches for crystals with the custom keyword or phrase entered by the experimenter. The use of these text fields makes LabDB more flexible but is at the expense of making data mining more difficult.

One of the major factors limiting the adoption of LIMSs in academic research is sociological in nature. Often, lab scientists are unenthusiastic to use LIMSs because they perceive inputting data as quite tedious and LIMSs to be less flexible and convenient compared to a lab notebook, while the benefits to the researcher performing the experiments are seen as minimal. A potential explanation of the low perceived benefit is that the experimenters tend to hope for “the best-case scenario” (no need to troubleshoot the experimental results, the experiment is published soon after it is performed, allowing the researcher to draw unrecorded details from the memory, etc.) and overlook the long-term benefits of LIMS. As a result, although many LIMSs are in principle available for usage in the laboratory, the often-tedious and inflexible input and minimal perceived benefits to the researcher performing experiments deter researchers from using them.

The extra effort that is occasionally required to enter information into a LIMS is sometimes resisted by researchers who cannot see the benefits of this effort. Indeed, even in some structural genomics laboratories that were required by the NIH to ensure all data was publicly available, some researchers have made comments such as “I don’t know why we even need a database,” and some users thought their shorthand description of crystallization plates was sufficient because “the code is scribbled on the wall over the microscope.” Tragically, the wall was painted when the experimenter was out of town. This type of complacency results from the shortsightedness of researchers who think that they are the only ones who will ever have to interpret their results. The reality is that most research projects, even if conducted in a single lab, rely on numerous researchers who come and go, often leaving the project’s PI with the difficult task of trying to locate notes and decrypt user’s codes and shorthand. Sometimes these notebooks get misplaced or irretrievably lost. The attitude that can defeat accurate data preservation is difficult to overcome and sometimes comes down to strict enforcement of data entry by the PI. LabDB’s weekly reports to the PI make it possible for the PI to see that the data has been preserved, and the PI can easily browse details and results of experiments. Most of our laboratory members come to appreciate the extra effort at some point, especially when writing papers about long-standing projects or performing experiments similar to ones performed years ago by researchers who have left the lab.

Another major factor that demotivates the adoption of LIMSs is the perception that they are difficult to install and maintain. PIs in smaller labs with few active experimental projects may perceive that there are few or no benefits to be gained by installing and using a LIMS and that any potential gains do not justify the time, effort, and money required for setup and maintenance. Labs of small to medium size may contend that usage of a laboratory notebook to record all experiments is much more feasible than attempting to supplant the notebook with software or hardware solutions. Although implementing a LIMS may be a challenge for smaller labs, a PI who does cutting-edge research may be asked for details of his/her experiments and face the issue of irreproducibility, which is not easy to handle without detailed records of experiments. Laboratories that perform very diverse or uncommon types of experiments and that do not find a LIMS that covers the breadth of techniques used may employ an ELN instead of a LIMS; while not perfect, this may be the only reasonable solution in such situations.

A viable solution for laboratories that do not want to install a LIMS is to utilize a LIMS that is provided as a Software-as-a-Service (SaaS), which avoids the complexity and cost of setting up and maintaining a secure server with database and web server capabilities. Many SaaS solutions (e.g, QBench, CloudLIMS, and webLIMS) are web-based, which is similar to the interface provided by LabDB. SaaS approaches provide all users with the latest version of the application because the code is maintained by the service provider. Typical SaaS applications will store a user’s data either on their service or in the cloud, but it is possible to store data at the user’s site. It is critical that potential SaaS users investigate the capabilities of exporting their data if they decide to discontinue the service.

6. Enhancing Reproducibility and Efficiency of Experiments: Case Studies

Management of experimental data is a critical factor in establishing a reproducible research workflow. LIMSs are especially helpful in ensuring continuity and reproducibility for large projects that involve multiple researchers and last for many years. In our experience, LabDB has proven itself to be much more durable than a paper notebook or series of spreadsheets. One example of a long-term project in our lab is the “albumin project,” which aims to characterize interactions between albumins from various species and small molecules transported in the blood. Since the start of the project in 2008, several researchers, ranging in expertise from undergraduate students to research faculty, have participated in this project and performed numerous protein purifications and crystallization trials, which has led us to a collection of diffraction images for more than 1500 crystals (Fig. 9). LabDB has enabled the storage of all information about that data and experimental setups that is easily accessible by any lab member. Access to old but complete information/data has allowed projects to be completed using the most modern software and a state-of-the-art approach [38]. One such example comes from our recent structure of equine serum albumin (ESA) in complex with testosterone (PDB ID: 6MDQ) [39]. Crystals of this complex were obtained in 2011, but the project stalled for various reasons. The deposition of this structure in 2018 was possible because the details of the crystallization procedure were critical in the structure refinement process. Additional studies performed in 2018 led to publication in 2019 [39]. LabDB has allowed us to keep accurate records of experimental details over the course of the albumin project. Our ability to reproduce these experiments has allowed us to deposit 14 albumin structures in the PDB so far and publish five papers, one of which has garnered almost 300 citations [40].

Fig. 9 — LabDB view of a list of crystals with selected experimental details for a particular project

Another case of LIMSs ensuring continuity and reproducibility of experiments stems from their capability to track chemicals and protein batches used in the experiment. Researchers are generally aware that variations among chemical batches (e.g., intended or unintended changes in the manufacturing process) may result in different outcomes of biomedical experiments [41]. In addition, for most projects, it is very important to use one purification protocol for all experiments to ensure that the protein was purified or modified in exactly the same way. A powerful example of such a project is the “Gcn5-related N-acetyltransferases’ (GNAT) project,” during which we discovered and clearly demonstrated that buffers used during purification and the presence of 6×His-tag alter enzyme kinetics and cause discrepancies between findings based on a crystal structure and results of kinetic or binding studies [42]. Therefore, keeping track of all chemicals and their batches used in protein production and experiments thereafter, as well as other details such as the removal of the 6×His-tag, is crucial for ensuring reproducibility of these experiments. Using a LIMS to track experiments makes the task of keeping such records manageable. With the use of a LIMS, an experimenter can compare all chemical batches and procedures used in experiments and identify differences that may be the cause of irreproducibility.

7. Future Directions: Toward a Configurable LIMS Architecture

The requirements that shaped LabDB’s functionality were changing dynamically during the almost 15 years of its development. It comes as no surprise that in order to keep a software suite cutting-edge, it needs to be constantly maintained and extended with new functionalities. The dynamic development of scientific methodologies and software technologies are major but not the only limiting factors that affect the usability of a LIMS. Various laboratories have different data management needs that tend to change dramatically over time. Our ambition was to convert LabDB from a macromolecular crystallography LIMS into a versatile suite that would be flexible, customizable, and extendable and would allow for a user-driven evolution of database schema over time. For this purpose, we have made an effort to redesign the system architecture and simplify the underlying data model.

7.1. Data Model

The current implementation of LabDB was based on a relational database model, which enforces a strictly organized way of storing data and provides powerful query language but at the same time is difficult to change. The relational database schema imposes a data structure is defined up-front during system development and cannot be updated without changes in the source code. Another large issue is an object-relational impedance mismatch, i.e., set of difficulties happening when a relational database is served by an application program written in an object-oriented language.

An alternative for the relational model is NoSQL document databases, which do not require a predefined schema. To address those issues, we have designed a hybrid data model based on the PostgreSQL relational database engine and utilized this engine’s support for storage of JSON (JavaScript Object Notation) documents. This database structure is capable of storing experimental workflows represented as directed acyclic graphs. The graph nodes are called “elements” and represent any physical or conceptual entity (e.g., chemical, sample, result) that is a subject or result of physical actions performed in the lab. The laboratory actions (e.g., experiments, analyses, shipments), called “processes,” are edges in the graphs. The complete workflows can be efficiently retraced using a recursive SQL query on a single table storing the graph’s adjacency lists.

Element and process records have only a few generic attributes, one of which is a JSON-type object that embeds all data specific to the particular entity. This structure allows any element and process data, structured with individual fields and structures, to be stored in the database. In contrast to NoSQL databases, the structure of objects is not completely schema-less. In our model, every element, process, and workflow object must hold a reference to a JSON Schema object defining the format of the data. The schema file serves as a consumer contract, i.e., it is applied to the incoming data to determine if the data conforms to the schema’s definition. This hybrid approach combines the consistency of relational databases with the flexibility of JSON data structures.

7.2. System Architecture

The original LabDB is a server-side web application based on the model-view-controller (MVC) architectural pattern and CakePHP framework. In this classic “thin client” design, all pages are generated by the server-side code and transferred as complete HTML documents to the browser. Over the past few years, the trends in web development have shifted to browser-based client functionalities. Such an approach gives more implementation flexibility, assures the ability to work in an offline mode, and lowers server requirements and infrastructure costs. The future architecture of LabDB will be based on the Representational State Transfer (REST) web API and independent JavaScript client applications. The prototype of the new LabDB API was implemented in a Python web framework, Django, with the use of the Django REST toolkit. The API decouples data storage from the client application, simplifying data sharing between different application programs. Thanks to the API, development of new front-end tools that access the data will not require changes within the LIMS itself. Additionally, the Django framework has a vertically split structure, which allows for the encapsulation of functionalities within so-called reusable apps. This gives the possibility for easy integration of LabDB’s API with existing scientific applications written in Django. The database abstraction layer was defined using Django’s object-relational mapping (ORM) module, providing a clean separation of concerns and easy refactoring possibilities.

7.3. Workflow Management

The main motivation for the redesign of LabDB was to make it adjustable to different workflows and the lab’s changing needs. In LabDB users would be able to define custom workflows using a business process graphical notation standard, called BPMN (Business Process Modeling Notation) 2.0. The BPMN is a flow chart method that does not require any programming knowledge, thus bridging the gap between process intention and implementation. The BMPN flow charts are defined through the graphical editor using a set of predefined graphical elements that simplify business activities, flow, and processes. LabDB user would be able to predefine process steps, required reagents and samples, expected results, and assign people and instruments. The application will take care of translating the graphical workflow into a set of predicate schema definitions for incoming data. We believe that BPMN modeling can greatly improve the repeatability and reproducibility of biological workflows.

8. Conclusions

The ultimate bottleneck of modern biomedical research is the insufficient rate of conversion of vast experimental data into biomedical information. As we have argued in this chapter, using a well-designed LIMS for managing experimental data offers a number of benefits to a modern biomedical research lab.

A LIMS is the most convenient way of keeping track of an inventory of laboratory chemicals and specimens.
Using LIMSs helps to assure continuity of projects, as they create a persistent record of the performed experiments, which can be examined by researchers working in the lab regardless of whether the person who did the experiment continues to work there or not.
Supplemented by data mining tools, LIMSs may be used for optimizing experimental methods and protocols by identifying, e.g., the bottlenecks in the workflow, the best methods for conducting particular types of experiments, the optimal parameters for protocols, etc.
For the PIs, LIMSs provide a way of tracking lab activity across different projects, as well as the progress of individual projects and identification of factors that impede progress.
LIMSs can help researchers diagnose issues affecting the reproducibility of the experiments. As anybody who has ever worked in a biomedical laboratory knows well, repeating a past experiment does not always yield the same results. Even if the researcher does not make any errors in setting up an experiment, the result of this experiment may be affected by multiple factors, such as temperature, a different batch of reagents, etc., which are sometimes beyond the experimenter’s control. Recording as many experimental details as possible in a LIMS may be helpful to identify what is different between the original experiment and its unsuccessful repetition. As was recently stated in a Retraction Watch discussion, “in many cases, that act of looking something up in a database is enough to reveal a problem” [43].
Last but not least, the use of LIMSs as tools to easily share data with collaborators and the wider research community helps to facilitate open science. Unfortunately, modern data management systems will not be any better than the laboratory notebook if their data remains “siloed”—isolated, removed from the context of all other relevant data even when they located on a local hard disk or in the cloud. Some researchers are taking advantage of general-purpose repositories to upload their data into the “cloud” to ensure that it is not lost forever. Unfortunately, unindexed data without sufficient description (metadata) are impossible to locate and are as useful as the information that a lost diamond necklace is somewhere in the landfill.

Acknowledgments

We thank all the users of our data management programs who over many years provided us with numerous complaints, suggestions, and requests that gave us invaluable feedback to improve our tools. This work was supported by the National Institute of General Medical Sciences under Grants GM117080 and GM117325, National Institutes of Health BD2K program under grant HG008424, and the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services under Contract No. HHSN272201700060C and HHSN272201200026C.

Footnotes

Disclosure statement: One of the authors (W.M.) notes that he has also been involved in the development of state-of-the-art software and data management and mining tools; some of them were commercialized by HKL Research, Inc. and are mentioned in the paper. W.M. is the co-founder of HKL Research, Inc. and a member of the board.

The authors have no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed.

References

1.Data management. http://www.businessdictionary.com/definition/data-management.html. Accessed 6 May 2019
2.Freedman LP, Cockburn IM, Simcoe TS (2015) The economics of reproducibility in preclinical research. PLoS Biol 13(6):e1002165. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Prinz F, Schlange T, Asadullah K (2011) Believe it or not: how much can we rely on published data on potential drug targets? Nat Rev Drug Discov 10(9):712–7c1 [DOI] [PubMed] [Google Scholar]
4.Begley CG, Ioannidis JP (2015) Reproducibility in science: improving the standard for basic and preclinical research. Circ Res 116(1):116–126 [DOI] [PubMed] [Google Scholar]
5.Collins FS, Tabak LA (2014) Policy: NIH plans to enhance reproducibility. Nature 505 (7485):612–613 [DOI] [PMC free article] [PubMed] [Google Scholar]
6.McDowall RD, Pearce JC, Murkitt GS (1988) Laboratory information management systems—Part I. Concepts. J Pharm Biomed Anal 6(4):339–359 [DOI] [PubMed] [Google Scholar]
7.Hakkinen J, Levander F (2011) Laboratory data and sample management for proteomics. Methods Mol Biol 696:79–92 [DOI] [PubMed] [Google Scholar]
8.Hunter A, Dayalan S, De Souza D, Power B, Lorrimar R, Szabo T et al. (2017) MASTR-MS: a web-based collaborative laboratory information management system (LIMS) for metabolomics. Metabolomics 13(2):14016-1142-2. Epub 2016 Dec 27 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Lin K, Kools H, de Groot PJ, Gavai AK, Basnet RK, Cheng F et al. (2011) MADMAX - management and analysis database for multiple ~omics experiments. J Integr Bioinform 8 (2):160,jib-2011-160 [DOI] [PubMed] [Google Scholar]
10.Stephan C, Kohl M, Turewicz M, Podwojski K, Meyer HE, Eisenacher M (2010) Using Laboratory Information Management Systems as central part of a proteomics data workflow. Proteomics 10(6):1230–1249 [DOI] [PubMed] [Google Scholar]
11.Venco F, Vaskin Y, Ceol A, Muller H (2014) SMITH: a LIMS for handling next-generation sequencing workflows. BMC Bioinformatics 15(Suppl 14):S3. Epub 2014 Nov 27 [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Harris M, Jones TA (2002) Xtrack - a web-based crystallographic notebook. Acta Crystallogr D Biol Crystallogr 58(Pt 10 Pt 2):1889–1891 [DOI] [PubMed] [Google Scholar]
13.Lab Information Management Systems (LIMS). https://www.thermofisher.com/us/en/home/life-science/lab-data-management-analysis-software/enterprise-level-lab-informatics/lab-information-management-systems-lims.html. Accessed 25 Apr 2019
14.Laboratory Information Management System (LIMS). https://www.autoscribeinformatics.com/lims-laboratory-information-management-system. Accessed 6 May 2019
15.Produce reliable results more quickly. https://www.illumina.com/informatics/sample-experiment-management/lims.html. Accessed 25 Apr 2019
16.Cyr K, Hill A, Warren P, Mounts D, Whitley M, Mounts W et al. (2010) From project-to-peptides: customizing a commercial LIMS for LC-MS proteomics. J Biomol Tech 21(3):S9 [Google Scholar]
17.Zolnai Z, Lee PT, Li J, Chapman MR, Newman CS, Phillips GN Jr et al. (2003) Project management system for structural and functional proteomics: SESAME. J Struct Funct Genom 4(1):11–23 [DOI] [PubMed] [Google Scholar]
18.Morris C (2015) PiMS: a data management system for structural proteomics. Methods Mol Biol 1261:21–34 [DOI] [PubMed] [Google Scholar]
19.Daniel E, Lin B, Diprose JM, Griffiths SL, Morris C, Berry Im et al. (2011) xtalPiMS: a PiMS-based web application for the management and monitoring of crystallization trials. J Struct Biol 175(2):230–235 [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Prilusky J, Oueillet E, Ulryck N, Pajon A, Bernauer J, Krimm I et al. (2005) HalX: an open-source LIMS (Laboratory Information Management System) for small- to large-scale laboratories. Acta Crystallogr D Biol Crystallogr 61(Pt 6):671–678 [DOI] [PubMed] [Google Scholar]
21.Bonanno JB, Almo SC, Bresnick A, Chance MR, Fiser A, Swaminathan S et al. (2005) New York-Structural GenomiX Research Consortium (NYSGXRC): a large scale center for the protein structure initiative. J Struct Funct Genom 6(2–3):225–232 [DOI] [PubMed] [Google Scholar]
22.Winn MD, Ballard CC, Cowtan KD, Dodson EJ, Emsley P, Evans PR et al. (2011) Overview of the CCP4 suite and current developments. Acta Crystallogr D 67(Pt 4):235–242 [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Potterton L, Agirre J, Ballard C, Cowtan K, Dodson E, Evans PR et al. (2018) CCP4i2: the new graphical user interface to the CCP4 program suite. Acta Crystallogr D Struct Biol 74(Pt 2):68–84 [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Adams PD, Afonine PV, Bunkoczi G, Chen VB, Davis IW, Echols N et al. (2010) PHENIX: a comprehensive Python-based system for macromolecular structure solution. Acta Crystallogr D66(Pt 2):213–221 [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Echols N, Grosse-Kunstleve RW, Afonine PV, Bunkoczi G, Chen VB, Headd JJ et al. (2012) Graphical tools for macromolecular crystallography in PHENIX. J Appl Crystallogr 45 (Pt 3):581–586 [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Minor W, Cymborowski M, Otwinowski Z, Chruszcz M (2006) HKL-3000: the integration of data reduction and structure solution - from diffraction images to an initial model in minutes. Acta Crystallogr D Biol Crystallogr D62:859–866 [DOI] [PubMed] [Google Scholar]
27.Cymborowski M, Klimecka M, Chruszcz M, Zimmerman MD, Shumilin IA, Borek D et al. (2010) To automate or not to automate: this is the question. J Struct Funct Genom 11 (1):211–221 [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Zimmerman MD, Grabowski M, Domagalski MJ, MacLean EM, Chruszcz M, Minor W (2014) Data management in the modern structural biology and biomedical research environment. Methods Mol Biol 1140:1–25 [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Zimmerman MD, Chruszcz M, Koclega K, Otwinowski Z, Minor W (2005) The Xtaldb system for project salvaging in high-throughput crystallization. Acta Crystallogr A 61:c178–c179 [Google Scholar]
30.Zimmerman MD (2008) The crystallization expert system Xtaldb, and its application to the structure of the 5′- nucleotidase YfbR and other proteins [dissertation]. University of Virginia, Charlottesville [Google Scholar]
31.Chruszcz M, Wlodawer A, Minor W (2008) Determination of protein structures—a series of fortunate events. Biophys J 95(1):1–9 [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Weininger D (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 28(1):31–36 [Google Scholar]
33.Kim S, Thiessen PA, Bolton EE, Chen J, Fu G, Gindulyte A et al. (2016) PubChem Substance and Compound databases. Nucleic Acids Res 44(D1):D1202–D1213 [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Formulatrix. https://formulatrix.com/. Accessed 6 May 2019
35.Newman J (2005) Expanding screening space through the use of alternative reservoirs in vapor-diffusion experiments. Acta Crystallogr D Biol Crystallogr 61(Pt 4):490–493 [DOI] [PubMed] [Google Scholar]
36.Cooper DR, Boczek T, Grelewska K, Pinkowska M, Sikorska M, Zawadzki M et al. (2007) Protein crystallization by surface entropy reduction: optimization of the SER strategy. Acta Crystallogr D Biol Crystallogr 63(Pt 5):636–645 [DOI] [PubMed] [Google Scholar]
37.CakePHP. https://cakephp.org/. Accessed 6 May 2019
38.Shabalin IG, Porebski PJ, Minor W (2018) Refining the macromolecular model - achieving the best agreement with the data from X-ray diffraction experiment. Crystallogr Rev 24(4):236–262 [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Czub MP, Venkataramany BS, Majorek KA, Handing KB, Porebski PJ, Beeram SR et al. (2018) Testosterone meets albumin - the molecular mechanism of sex hormone transport by serum albumins. Chem Sci 10 (6):1607–1618 [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Majorek KA, Porebski PJ, Dayal A, Zimmerman MD, Jablonska K, Stewart AJ et al. (2012) Structural and immunologic characterization of bovine, horse, and rabbit serum albumins. Mol Immunol 52(3–4):174–182 [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Svare A, Nilsen TI, Asvold BO, Forsmo S, Schei B, Bjoro T et al. (2013) Does thyroid function influence fracture risk? Prospective data from the HUNT2 study, Norway. Eur J Endocrinol 169(6):845–852 [DOI] [PubMed] [Google Scholar]
42.Majorek KA, Kuhn ML, Chruszcz M, Anderson WF, Minor W (2014) Double trouble-buffer selection and His-tag presence may be responsible for nonreproducibility of biomedical experiments. Protein Sci 23 (10):1359–1368 [DOI] [PMC free article] [PubMed] [Google Scholar]
43.How a typo in a catalog number led to the correction of a scientific paper—and what we can learn from that. https://retractionwatch.com/2018/10/18/how-a-typo-in-a-catalog-number-led-to-the-correction-of-a-scientific-paper-and-what-we-can-learn-from-that/. Accessed 8 May 2019

[R1] 1.Data management. http://www.businessdictionary.com/definition/data-management.html. Accessed 6 May 2019

[R2] 2.Freedman LP, Cockburn IM, Simcoe TS (2015) The economics of reproducibility in preclinical research. PLoS Biol 13(6):e1002165. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Prinz F, Schlange T, Asadullah K (2011) Believe it or not: how much can we rely on published data on potential drug targets? Nat Rev Drug Discov 10(9):712–7c1 [DOI] [PubMed] [Google Scholar]

[R4] 4.Begley CG, Ioannidis JP (2015) Reproducibility in science: improving the standard for basic and preclinical research. Circ Res 116(1):116–126 [DOI] [PubMed] [Google Scholar]

[R5] 5.Collins FS, Tabak LA (2014) Policy: NIH plans to enhance reproducibility. Nature 505 (7485):612–613 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.McDowall RD, Pearce JC, Murkitt GS (1988) Laboratory information management systems—Part I. Concepts. J Pharm Biomed Anal 6(4):339–359 [DOI] [PubMed] [Google Scholar]

[R7] 7.Hakkinen J, Levander F (2011) Laboratory data and sample management for proteomics. Methods Mol Biol 696:79–92 [DOI] [PubMed] [Google Scholar]

[R8] 8.Hunter A, Dayalan S, De Souza D, Power B, Lorrimar R, Szabo T et al. (2017) MASTR-MS: a web-based collaborative laboratory information management system (LIMS) for metabolomics. Metabolomics 13(2):14016-1142-2. Epub 2016 Dec 27 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Lin K, Kools H, de Groot PJ, Gavai AK, Basnet RK, Cheng F et al. (2011) MADMAX - management and analysis database for multiple ~omics experiments. J Integr Bioinform 8 (2):160,jib-2011-160 [DOI] [PubMed] [Google Scholar]

[R10] 10.Stephan C, Kohl M, Turewicz M, Podwojski K, Meyer HE, Eisenacher M (2010) Using Laboratory Information Management Systems as central part of a proteomics data workflow. Proteomics 10(6):1230–1249 [DOI] [PubMed] [Google Scholar]

[R11] 11.Venco F, Vaskin Y, Ceol A, Muller H (2014) SMITH: a LIMS for handling next-generation sequencing workflows. BMC Bioinformatics 15(Suppl 14):S3. Epub 2014 Nov 27 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Harris M, Jones TA (2002) Xtrack - a web-based crystallographic notebook. Acta Crystallogr D Biol Crystallogr 58(Pt 10 Pt 2):1889–1891 [DOI] [PubMed] [Google Scholar]

[R13] 13.Lab Information Management Systems (LIMS). https://www.thermofisher.com/us/en/home/life-science/lab-data-management-analysis-software/enterprise-level-lab-informatics/lab-information-management-systems-lims.html. Accessed 25 Apr 2019

[R14] 14.Laboratory Information Management System (LIMS). https://www.autoscribeinformatics.com/lims-laboratory-information-management-system. Accessed 6 May 2019

[R15] 15.Produce reliable results more quickly. https://www.illumina.com/informatics/sample-experiment-management/lims.html. Accessed 25 Apr 2019

[R16] 16.Cyr K, Hill A, Warren P, Mounts D, Whitley M, Mounts W et al. (2010) From project-to-peptides: customizing a commercial LIMS for LC-MS proteomics. J Biomol Tech 21(3):S9 [Google Scholar]

[R17] 17.Zolnai Z, Lee PT, Li J, Chapman MR, Newman CS, Phillips GN Jr et al. (2003) Project management system for structural and functional proteomics: SESAME. J Struct Funct Genom 4(1):11–23 [DOI] [PubMed] [Google Scholar]

[R18] 18.Morris C (2015) PiMS: a data management system for structural proteomics. Methods Mol Biol 1261:21–34 [DOI] [PubMed] [Google Scholar]

[R19] 19.Daniel E, Lin B, Diprose JM, Griffiths SL, Morris C, Berry Im et al. (2011) xtalPiMS: a PiMS-based web application for the management and monitoring of crystallization trials. J Struct Biol 175(2):230–235 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Prilusky J, Oueillet E, Ulryck N, Pajon A, Bernauer J, Krimm I et al. (2005) HalX: an open-source LIMS (Laboratory Information Management System) for small- to large-scale laboratories. Acta Crystallogr D Biol Crystallogr 61(Pt 6):671–678 [DOI] [PubMed] [Google Scholar]

[R21] 21.Bonanno JB, Almo SC, Bresnick A, Chance MR, Fiser A, Swaminathan S et al. (2005) New York-Structural GenomiX Research Consortium (NYSGXRC): a large scale center for the protein structure initiative. J Struct Funct Genom 6(2–3):225–232 [DOI] [PubMed] [Google Scholar]

[R22] 22.Winn MD, Ballard CC, Cowtan KD, Dodson EJ, Emsley P, Evans PR et al. (2011) Overview of the CCP4 suite and current developments. Acta Crystallogr D 67(Pt 4):235–242 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Potterton L, Agirre J, Ballard C, Cowtan K, Dodson E, Evans PR et al. (2018) CCP4i2: the new graphical user interface to the CCP4 program suite. Acta Crystallogr D Struct Biol 74(Pt 2):68–84 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Adams PD, Afonine PV, Bunkoczi G, Chen VB, Davis IW, Echols N et al. (2010) PHENIX: a comprehensive Python-based system for macromolecular structure solution. Acta Crystallogr D66(Pt 2):213–221 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Echols N, Grosse-Kunstleve RW, Afonine PV, Bunkoczi G, Chen VB, Headd JJ et al. (2012) Graphical tools for macromolecular crystallography in PHENIX. J Appl Crystallogr 45 (Pt 3):581–586 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Minor W, Cymborowski M, Otwinowski Z, Chruszcz M (2006) HKL-3000: the integration of data reduction and structure solution - from diffraction images to an initial model in minutes. Acta Crystallogr D Biol Crystallogr D62:859–866 [DOI] [PubMed] [Google Scholar]

[R27] 27.Cymborowski M, Klimecka M, Chruszcz M, Zimmerman MD, Shumilin IA, Borek D et al. (2010) To automate or not to automate: this is the question. J Struct Funct Genom 11 (1):211–221 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] 28.Zimmerman MD, Grabowski M, Domagalski MJ, MacLean EM, Chruszcz M, Minor W (2014) Data management in the modern structural biology and biomedical research environment. Methods Mol Biol 1140:1–25 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.Zimmerman MD, Chruszcz M, Koclega K, Otwinowski Z, Minor W (2005) The Xtaldb system for project salvaging in high-throughput crystallization. Acta Crystallogr A 61:c178–c179 [Google Scholar]

[R30] 30.Zimmerman MD (2008) The crystallization expert system Xtaldb, and its application to the structure of the 5′- nucleotidase YfbR and other proteins [dissertation]. University of Virginia, Charlottesville [Google Scholar]

[R31] 31.Chruszcz M, Wlodawer A, Minor W (2008) Determination of protein structures—a series of fortunate events. Biophys J 95(1):1–9 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] 32.Weininger D (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 28(1):31–36 [Google Scholar]

[R33] 33.Kim S, Thiessen PA, Bolton EE, Chen J, Fu G, Gindulyte A et al. (2016) PubChem Substance and Compound databases. Nucleic Acids Res 44(D1):D1202–D1213 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] 34.Formulatrix. https://formulatrix.com/. Accessed 6 May 2019

[R35] 35.Newman J (2005) Expanding screening space through the use of alternative reservoirs in vapor-diffusion experiments. Acta Crystallogr D Biol Crystallogr 61(Pt 4):490–493 [DOI] [PubMed] [Google Scholar]

[R36] 36.Cooper DR, Boczek T, Grelewska K, Pinkowska M, Sikorska M, Zawadzki M et al. (2007) Protein crystallization by surface entropy reduction: optimization of the SER strategy. Acta Crystallogr D Biol Crystallogr 63(Pt 5):636–645 [DOI] [PubMed] [Google Scholar]

[R37] 37.CakePHP. https://cakephp.org/. Accessed 6 May 2019

[R38] 38.Shabalin IG, Porebski PJ, Minor W (2018) Refining the macromolecular model - achieving the best agreement with the data from X-ray diffraction experiment. Crystallogr Rev 24(4):236–262 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] 39.Czub MP, Venkataramany BS, Majorek KA, Handing KB, Porebski PJ, Beeram SR et al. (2018) Testosterone meets albumin - the molecular mechanism of sex hormone transport by serum albumins. Chem Sci 10 (6):1607–1618 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] 40.Majorek KA, Porebski PJ, Dayal A, Zimmerman MD, Jablonska K, Stewart AJ et al. (2012) Structural and immunologic characterization of bovine, horse, and rabbit serum albumins. Mol Immunol 52(3–4):174–182 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] 41.Svare A, Nilsen TI, Asvold BO, Forsmo S, Schei B, Bjoro T et al. (2013) Does thyroid function influence fracture risk? Prospective data from the HUNT2 study, Norway. Eur J Endocrinol 169(6):845–852 [DOI] [PubMed] [Google Scholar]

[R42] 42.Majorek KA, Kuhn ML, Chruszcz M, Anderson WF, Minor W (2014) Double trouble-buffer selection and His-tag presence may be responsible for nonreproducibility of biomedical experiments. Protein Sci 23 (10):1359–1368 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] 43.How a typo in a catalog number led to the correction of a scientific paper—and what we can learn from that. https://retractionwatch.com/2018/10/18/how-a-typo-in-a-catalog-number-led-to-the-correction-of-a-scientific-paper-and-what-we-can-learn-from-that/. Accessed 8 May 2019

PERMALINK

State-of-the-Art Data Management: Improving the Reproducibility, Consistency, and Traceability of Structural Biology and in Vitro Biochemical Experiments

David R Cooper

Marek Grabowski

Matthew D Zimmerman

Przemyslaw J Porebski

Ivan G Shabalin

Magdalena Woinska

Marcin J Domagalski

Heping Zheng

Piotr Sroka

Marcin Cymborowski

Mateusz P Czub

Ewa Niedzialkowska

Barat S Venkataramany

Tomasz Osinski

Zbigniew Fratczak

Jacek Bajor

Juliusz Gonera

Elizabeth MacLean

Kamila Wojciechowska

Krzysztof Konina

Wojciech Wajerowicz

Maksymilian Chruszcz

Wladek Minor

Abstract

1. Introduction

Fig. 1.

2. Data Model, Data Acquisition, and Validation

2.1. Data Model

2.2. Acquisition of Various Types of Laboratory Data

Fig. 2.

Fig. 3.

2.2.1. Reagents Module

Storage System

Fig. 4.

2.2.2. Protein Production Module

2.2.3. Crystallization Module

Fig. 5.

2.2.4. Structure Determination Component

Fig. 6.

2.2.5. Biochemical Experiments

Thermal Shift Assays

Fig. 7.

Isothermal Titration Calorimetry

Other Biochemical Assays

2.3. Data Validation

3. Technical Implementation

4. Data Analysis and Data Mining Tools

Fig. 8.

Table 1.

5. User Experience and PI Perspective

6. Enhancing Reproducibility and Efficiency of Experiments: Case Studies

Fig. 9.

7. Future Directions: Toward a Configurable LIMS Architecture

7.1. Data Model

7.2. System Architecture

7.3. Workflow Management

8. Conclusions

Acknowledgments

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases