Abstract
Background
The biological sciences are producing increasingly larger datasets for biomarker discovery. While common data models have been developed for medical terms as they relate to patient health outcomes, a data model that supports longitudinal tracking of biospecimens and relating them against an individual patient experience is a large, unmet need.
Method
A structure and associated taxonomy were achieved through a six-tier build in Research Electronic Data CAPture (REDCap), which organizes the complexity of the therapeutic decisions, biospecimens, and outcomes that characterize a longitudinal patient experience. Modules were developed to support export of REDCap data into a Structured Query Language (SQL) format for merging with extended biomarker data, also housed in SQL.
Results
The resultant AstroID resource is a relational structure for clinical and biospecimen data that meets several desired goals: searchable, flexible, generic, Health Insurance Portability and Accountability Act-compliant, auditable, and easy-to-use. The essential elements forming the core of the six-tiered build are provided, so others can readily adopt this schema, as well as an example of an extended, customized build to support biomarker discovery for patients with melanoma. Two examples where this data structure was used to support biomarker discovery and development are described, and example queries of the database are also presented. To the extent possible, the data dictionary was aligned with large data models, such as those for the National Institutes of Health’s Human Tumor Atlas Network. The structure can readily scale to accommodate thousands of patients, multimodality data, and spatial characterization of billions of cells. Radiologic imagery can also be included along with pathology imagery to support spatial studies, including artificial intelligence-driven analyses.
Conclusions
This effort provides a database model for investigators conducting research on large volumes of biospecimens with clinical annotation. We have now deployed this structure in our laboratories and have over 1B cells spatially mapped, each effectively tagged with the clinical information from longitudinal patient experiences. While the description uses the example of cancer biomarkers, this data structure could be used to characterize longitudinal biospecimens from any disease process. In the near future, automatic synchronization between the electronic medical record and one or more AstroID databases is anticipated.
Keywords: Melanoma, Biomarker, Biopsy, Pathology, Tumor microenvironment - TME
WHAT IS ALREADY KNOWN ON THIS TOPIC.
WHAT THIS STUDY ADDS
We provide a six-tiered data structure and an associated taxonomy to organize clinical and biomarker data representing the therapeutic journey.
This structure can be used to associate deidentified clinical information with radiographic and biospecimen findings including complete blood counts, microbiome data, bulk or single-cell DNA/RNA sequencing, and deep multiplex immunofluorescent and/or spatial transcriptomic mapping of the tumor microenvironment.
HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY
This data structure supports biomarker discovery for large, multimodal datasets and allows for translational research results to be framed within the longitudinal patient experience.
Introduction
The next generation of biomarkers is likely to be derived from large, well-curated datasets that include clinical data paired with biospecimen data. The organization of such “Big Data” is a relatively new concept in the field of medicine. Existing medical records can be considered object-oriented databases (figure 1). Owing to the extreme complexity and fragmented collection of these data, research laboratories often need to extract and assemble clinical information manually into a spreadsheet in order to associate it with the generated laboratory results from a given biospecimen source.
Figure 1. Clinical and biospecimen data are typically handled in an object-oriented organization, with the patient’s identification as the key linking variable between the majority of the other data elements. The color coding shows how data are typically partitioned currently as a part of the medical record and for research biospecimens. Patient information shown in lavender; key diagnosis represented in green; treatment and diagnostic information in yellow; biospecimens in blue; and block detail for FFPE specimens outlined in magenta. Levels cut-off the block is a detail that is not currently routinely captured, but which is key for emerging spatial technologies and multimodality biomarker development. Specimens generated for research purposes only (vs routine clinical care) are shown in stripes rather than solid color. CBC, complete blood count; FFPE, formalin-fixed paraffin-embedded; IHC, immunohistochemistry; mIF, multiplex immunofluorescence; NGS, next-generation sequencing; PFS, progression-free survival; OS, overall survival.
Different laboratories may undertake this same process for the same patient course and for related biospecimens, and as such, the resultant spreadsheets from individual projects typically do not have a unified nomenclature, nor are they linked with each other or other associated biospecimens. For example, if a patient’s specimen is sequenced as a part of a study on tumor genomics, information regarding which tissue block or blood draw was used and the associated results may not be apparent/available to another researcher performing multiplex immunofluorescence (mIF) or immunohistochemistry on the same patient. Factors contributing to these barriers may include how specimens are deidentified and numbered independently for different institutional review board-approved studies along with the current lack of large data schemas that readily house data from multiple laboratory modalities.
Storing and querying clinical data in a Health Insurance Portability and Accountability Act (HIPAA)-compliant manner is a closely associated challenge. Research Electronic Data CAPture (REDCap) is a web-based application developed to capture data for clinical research and create databases.1 REDCap is easy to use, HIPAA-compliant, and has an audit trail that documents time stamps and inputter names for changes in data that makes this a very secure system. However, a typical user will create a new REDCap spreadsheet with a singular layout for each experiment that they want to run, ie, each ‘biomarker’ experiment requires the generation of a customized REDCap template for associating clinical data with corresponding biospecimen results. If the resultant data are to be queried in another research effort that has any slight deviation in the data fields, it typically requires starting over with the design of a new REDCap spreadsheet/data capture instrument. The net result is a lot of redundancy in effort, as well as potential for error due to multiple instances of data entry.
For these reasons, plus the fact that biomarker discovery is becoming increasingly complex with increasing integration of multimodal approaches on hundreds or more patients, these current practices are not scalable. To fully use the potential of Big Data, it must be structured and organized in a meaningful way. This is often achieved in other fields through the use of relational databases that are constructed and queried using Structured Query Language (SQL). Relational databases include multiple tables with linkages that establish relationships between the tables and rely on an established taxonomy, that is, a set of terms that are arranged hierarchically to facilitate effective content retrieval.
In oncology, each individual patient’s course typically includes multiple visits, treatments, and outcome measures. To facilitate effective biomarker development, these parameters need to be linked to multiple tests and assessments including blood-based laboratory values, tissue-based pathology, radiography, transcriptomic, and genomic studies. A standardized, HIPAA-compliant, scalable relational clinical database with detailed, embedded links to biospecimens and an associated taxonomy is a critical unmet need.
Here, we developed a universal, general-purpose 6-tiered REDCap structure termed ‘AstroID’ with an associated taxonomy that standardizes clinical and biospecimen data organization in a flexible and relational manner. This data structure captures a generic patient experience, obviating the need to recreate and reinput clinical data for each experiment. Through tools provided herein, it can be automatically translated to a relational database that can be linked and queried with multimodality biomarker data such as scRNAseq and/or mIF staining of pathology slides. This allowed us to take advantage of the benefits of base REDCap that researchers favor (easy to use, HIPAA-compliant, audit trails, etc), with the power, flexibility, and scalability of SQL, effectively allowing for thousands of patient specimens to be queried with an efficient, high-resolution understanding of where the specimens were derived from over the longitudinal course of patient care.
Methods
AstroID data structure and taxonomy
We developed a modified REDCap construct with a six-tier build, termed ‘AstroID’. Each of the six tiers has their own object models; nevertheless, they partition the data schema into six entities that naturally form the core of an underlying relational model for biomarker discovery (figure 2). This structure allows for researchers to account for longitudinal studies, linking multiple biospecimens taken over time to a given patient and their cancer diagnosis, and correlating them with treatment(s), treatment-related toxicities, survival metrics, etc. Annotations of anatomic locations of tumor and radiographic annotations of response on a single-lesion level can also be incorporated. If biospecimens are obtained from the patient, the information captured is granular enough to note which lesion the specimen was derived from down to the slide ‘level’ taken off an individual formalin-fixed paraffin-embedded (FFPE) tissue block or individual serum aliquots (see table 1 for a glossary of terms). The capture of information related to individual slide levels off of a tissue block is critical for the Z-plane tracking that supports exacting multi-modality studies as well as three-dimensional analyses.
Figure 2. Relational structure for clinical and biospecimen data is achieved through linking variables between six tiers. Color coding relates the tiers to the object-oriented data organization that represents a typical patient medical record and research specimen procurement. The basic entity relationship diagram export from SQL with linking variables and without color coding is shown in online supplemental Figure S1. The overarching REDCap build for the six tiers is found in online supplemental Materials S1. Importantly, the slide level and other metadata are gathered as a part of this build. This is a key feature that is not routinely corrected in current data structures. HIPAA, Health Insurance Portability and Accountability Act; SQL, Structured Query Language.
Table 1. Glossary of terms.
| Biospecimen/specimen | Any biological material obtained and used for laboratory testing. This typically refers to blood or tissue, but may also include urine or saliva, and may also even be extended to include digital images of slides generated from tissue. Such specimens are typically characterized for specific DNA, RNA, and protein markers. |
| Level | When tissue is turned into slides, a thin slice (typically 4–5 µm) of the tissue is sectioned on a microtome and placed on a slide for staining and analysis. It is possible to cut multiple, subsequent tissue sections from the same tissue, and each one of these is considered a ‘level’. Understanding how these levels relate to each other is critical for creating a three-dimensional representation of the tissue. |
| Slide | A slide is a thin piece of tissue that is mounted on a glass slide for microscopic examination. The tissue may be fresh tissue or formalin-fixed and paraffin-embedded and can be stained with chemicals for visualization with light microscopy or with probes that label specific protein, DNA, or RNA, for example, with immunohistochemistry, immunofluorescence, or spatial transcriptomics. After slides are stained, they may be digitized, and images of digital slides may be kept as a part of a biorepository and analyzed for spatial biology. |
| Z-stacking | Serial slide sections taken off of a single tissue block can be combined using digital pathology to increase the number of markers applied to a single tissue section (6-plex mIF assay performed on one slide level +6 plex mIF assay formed on next slide level can be combined digitally so as to have 12 markers on a single composite image); to combine biomarker modalities (eg, spatial transcriptomics overlaid on a mIF scaffolding); or potentially used to create a three-dimensional representation of a specimen. |
mIF, multiplex immunofluorescence.
We also developed a taxonomy for deidentified nomenclature for encounters and specimens that is shown in figure 3. Data entry fields and format inputs were defined, and to the extent possible, the data dictionary was aligned with large data models, such as those for the NIH’s HTAN/HubMap and Precision-DM2 (online supplemental Data S1). We next developed code to support the export of deidentified data from this fundamental AstroID structure into SQL. This SQL data can then readily be merged with biomarker data generated by a number of different modalities, whose results would ideally also be kept in SQL. Such organization facilitates scaling up and scaling out, given that it is very straightforward to join and query multiple SQL databases together.
Figure 3. The AstroID taxonomy facilitates deidentified ID generation. This patient example demonstrates the nomenclature of a unique deidentified ID for each patient. For tissue pathology cases and viably banked cellular specimens, the materials are often divided into multiple blocks and vials, respectively. A block can be cut into multiple slides, each housing different information and necessitating the specimen, block, and slide tiers. Similarly, each vial can be further divided into aliquots. In this example, Patient P1005 has 2 cancer diagnoses, D01 of melanoma and D02 of lung cancer. The subsequent tiers use the same naming convention to describe clinical events and any resulting specimens, blocks, slides, and serum aliquots. NSCLC, non-small cell lung carcinoma.
Data entry errors
There are a number of different ways this AstroID REDCap approach helps guard against data entry errors. The first is simply through the use of basic REDCap functionality. REDCap supports fixed entries that the data entry person experiences as drop-down boxes as well as data format validations for variables that need to be keyed in. For example, tumor type is a drop-down box selection, a date of birth field requires a date entered in the M-D-Y format, and the number of FFPE blocks associated with a case requires an integer. In some cases, it is possible to import clinical data from EPIC directly, which further helps to guard against data entry errors.
Once the six-tiered data structure is established, there are additional guards against data entry error. For example, the structure is such that the naming convention from the tiers above is automatically populated, obviating the need for manual re-entry of a long number string. Further, the taxonomy or the hierarchical naming, for example, P1005 → P1005-D01 → P1005-D1-C1, etc helps guard against misassignment. Specifically, this hierarchy allows for the ready tracking of provenance of any item back to the patient it is derived from, which in this example is P1005. The alternative approach to de-identification is the generation of a random number for each event or specimen, which has no recognizable relationship between a biospecimen result and an individual patient.
Finally, as the data are exported from REDCap for ingestion to SQL, it is translated into a numerical value, which is checked against a SQL dictionary. If the data value does not validate against the standard dictionary, the system notifies the user of an error.
Performance metrics
To provide metrics of how such a data structure could support immuno-oncology biomarker studies, the time it takes to export the deidentified clinical data from AstroID/REDCap to SQL was determined for a representative cohort. Additionally, comparative time estimates were generated for performing a spatial biomarker characterization on a slide stained with multiplex immunofluorescence queried within a SQL database vs using QuPath (ie, without a relational database). Lastly, an example where the clinical data exported from AstroID into SQL format was paired with mIF biomarker data kept in SQL is presented and queried, along with associated timings.
Results
The AstroID data structure
The AstroID REDCap build for the cascading tiers that form the core of the data structure is provided in online supplemental Materials S1. Once the Patient, Diagnosis, Clinical, Specimen, Block/Vial, and Level/Aliquot tiers are established, along with the linkages between them, there is flexibility within this structure to add additional variables within any of the levels. For example, we developed a detailed data collection instrument focused on patients with melanoma that utilizes this basic six-tiered scaffolding. The resultant extended REDCap build and the associated data dictionary are provided in online supplemental Materials S2 and online supplemental Data S2, respectively.
As a part of our extended build, we incorporated standard variables from the Observational Medical Outcomes Partnership (OMAP) data model, but it is possible to extend any tier further to include additional variables of interest. For example, we chose to include RECIST reads on individual lesions (rather than just the global value for a patient), with the idea that biomarker studies can be performed at the resolution of an individual lesion’s radiographic response. That level of detail may not be of interest to all, and thus investigators implementing this structure at their institution may decide to forgo those data fields (% change from baseline, or % change from nadir at an individual lesion level). Alternatively, if the focus was on a dataset for patients with renal cell carcinoma, it might be of specific interest to gather information on whether the patient has a history of chronic kidney disease or other known risk factors, and that field can be added to the Patient Tier. Similarly, information on histologic subtype of RCC (clear cell, papillary, chromophobe, etc) could be collected as a data element in the Diagnosis Tier.
Export of deidentified clinical data from AstroID
When combined with the proposed taxonomy for deidentification, structuring REDCap in this six-tiered manner effectively creates a generalizable, relational biobank that can store and export clinical data in a deidentified, HIPAA-compliant manner. Publicly available code is available that can be installed as a REDCap module to facilitate export from REDCap to a SQL database (https://github.com/IUREDCap/redcap-etl-module). In addition, we wrote a utility that is designed to support the export of the exact extended data schema described herein, which can be found at (https://github.com/AstroPathJHU/AstroID/releases/tag/v0.0.1).3 The exported data can be explored on its own for research purposes, for example, on clinical outcomes, independent of additional biomarker correlates, or merged and queried with a variety of scientific correlates (see below).
Researchers can also query patient data to help assemble demographic information often used for reporting purposes (eg, data sharing during the publication process or the generation of NIH p358 tables, figure 4A). The SQL query used to generate a NIH PHS 398 report like the one shown is provided in online supplemental Materials S3A and the GitHub repository. The resultant exported data in a structured format took 2 s to generate and is shown in online supplemental Materials S3B.
Figure 4. AstroID exports allow for correlation between clinical and experimental data using SQL. (A) Clinical data is arranged in a 6-tiered structure in REDCap that can be exported in SQL format using the computing utility provided. (B) Inclusion enrollment descriptors can also readily be exported from the 6-tiered REDCap and biospecimen results for grant and IRB reporting. SQL, Structured Query Language. (C) Biospecimen and radiographic results are also kept in a SQL database, using the taxonomy described herein. The two SQL databases are merged for querying relationships between patient outcomes and biospecimen data.
Combining clinical data from AstroID with correlative analyses for biomarker discovery
To fully take advantage of this data structure, it is of benefit to keep the experimental data in a SQL database. As such, when the data are queried, the clinical data are exported from the 6-tiered AstroID REDCap structure in SQL format as described above. The deidentified clinical data are then joined and queried with the patient’s experimental or other testing data, for example, routine complete blood counts, microbiome data, bulk or single-cell DNA/RNA sequencing, radiographic imagery, and deep mIF and/or spatial transcriptomic mapping of the tumor microenvironment with billions of cells (figure 4B).
Additional modalities can be readily added within this schema. For example, while we do not currently keep ctDNA data in our databases, we plan to incorporate this information going forward. That will be accomplished by using AstroID to identify the patient and the biospecimen that the analysis was performed on (in this case a blood draw), thus ensuring that this correlative data is anchored in the context of the patient’s longitudinal history. For example, a patient (P1005) with a previous diagnosis of melanoma (P1005-D01) has a new diagnosis of new lung cancer (P1005-D02) (figure 3). The patient has a blood draw on the same day the lung cancer diagnosis is made (P10005-D02-C01), and ctDNA studies are performed on the first aliquot from that blood (P1005-D02-C01-S01-B01-L01). The results of that ctDNA study are kept in a separate SQL database. When it comes time to perform an analysis on the data, the REDCap information describing the relevant information from patients’ clinical courses is exported to SQL using the de-identified AstroID taxonomy and joined to the SQL biomarker data, which is linked using the corresponding, de-identified AstroID numerical string (P1005-D02-C01-S01-B01-L01).
To demonstrate the utility of this approach, we present two examples of immuno-oncology biomarker discovery using AstroID:
Example 1: Biomarker development in pretreatment specimens from patients with advanced melanoma receiving anti-PD-1-based therapy (figure 5). Here, we conducted an analysis to optimize the assessment of proximity of PD-1 to PD-L1 as a response predictor. The proximity of this receptor ligand pair as a biomarker has been described4,6; however, the optimal distance between these molecules as detected on a slide has not been carefully tested. Here, using the joined SQL database from AstroID containing clinical data with the SQL database housing mIF tumor-immune maps, we were able to characterize the impact of different distances between these two molecules on the AUC for response prediction. This analysis took 2 days to perform. It is estimated that if this analysis was performed using traditional flat files, rather than a relational structure (including the relational structure for the mIF data), the analysis would have taken ~6 months. A similar analysis was previously performed to characterize how slide subsampling (immune hot spot vs ‘representative sampling’ vs whole slide) impacted the AUC for predicting response, with a similar ~2-day time to execute.7
Figure 5. Optimizing prediction of response to anti-PD-1-based therapy as a function of PD-1 to PD-L1 proximity. In this analysis, deidentified clinical information for 52 patients with advanced melanoma was exported from AstroID to a SQL-server database. The second SQL database included the raw data from pretreatment tumor-immune maps generated using a 6-marker multiplex immunofluorescence panel: Sox10/S100, PD-1, PD-L1, CD8, CD163, FoxP3 and imaged using AstroPath.3 Using the joined SQL database, we calculated the area under the receiver operating characteristic curves (AUC) for predicting objective response to anti-PD-1-based therapy as a function of various immune cell types expressing PD-1 and PD-L1 and the proximity of these cell types to each other. On the left, the heat map shows the highest AUCs for predicting response are achieved when the proximity metric assessed was the density of PD-1+ cells within 5.0–12.5 um of a PD-L1+ tumor cell (circled area). The image on the right shows a representative photomicrographic visualization of the experiment. The lines between the centroids of cells show PD-1 to PD-L1 pairings between all cell types within 20 um of each other. This analysis was performed over a total of 31 million cells and 26 540 high power fields. SQL, Structured Query Language.
Example 2: Longitudinal tracking of individual patients and combined multi-modality data (as described in Cottrell et al8). Here, deidentified clinical information on an individual patient level with associated mIF stained slides from pretreatment and on-treatment tissue specimens as well as scRNAseq data are presented, supporting studies of the association between clinical outcomes and longitudinal, multimodal experimental data, without the user needing to access multiple files or data locations. Four different cohorts are presented using de-identified clinical information and corresponding tumor-immune maps. The resultant public dataset using the AstroIDs can be seen at https://www.sciserver.org/integration/astropath/.
As a part of this study, we performed an analysis testing the association of the density of different immunophenotypes of cells (online supplemental Materials S4) as well as CD8+FoxP3+ niches identified in pre-treatment specimens using mIF with patient outcomes after receiving anti-PD-1-based therapy. 361 unique data fields from 87 different patients=31 407 clinical data elements were included. The export of these elements from REDCap → SQL using the code provided in the AstroID GitHub took a total of 43.7 s (15.4 s to export; 23.5 s to load into SQL; 3 s for postprocessing to a user-friendly format in SQL).
The raw data from the corresponding tumor-immune maps was housed in a SQL database and included positions, spatial boundaries of individual cells, tissue boundaries, as well as pre-computations of densities of different features (each phenotype, their spatial relationship to each other, etc). It took an average of 30 s per slide to precalculate the location and boundary relationships and store them in the database. When the query to identify CD8+FoxP3+niches across individual slides was performed, it took an average of 4 s to perform per slide. In contrast, when this analysis was replicated in QuPath without storing or querying the data in a relational database, the identification and quantification of the niches alone took ~20 min per slide.
Discussion
The fundamental issue is that the linking of clinical data to experimental data can be a challenging problem to manage, especially when patients have multiple specimens, their longitudinal treatment course is being studied, multiple tests are being performed on a single specimen, or studies are to be conducted across different cohorts. Within our laboratories, our goal was to increase the total surface area of FFPE tumor tissue mapped at a single cell resolution by log-folds. As we embarked on this effort, we found that the non-standard format of clinical data was the element that was most challenging to scale—to help solve this problem as well as support the integration of multimodality data, we designed and developed AstroID. We have now used this in our own laboratories for 16 different patient cohorts with multiple tumor types as we have mapped and queried more than 1B spatially mapped tumor and immune cells through disease progression and under therapeutic pressure. Effectively, each of these 1B individual cells is tagged with a longitudinal patient experience.
In addition to supporting large-scale biomarker discovery efforts, one of the benefits of keeping the clinical and biomarker information in the described format is that each patient/specimen is provided with an ID that can be used to ensure appropriate correlation/linking between events, reducing the potential for error at the data-gathering stage prior to analysis. For example, each time a multimodality study is performed that includes data such as clinical outcomes, radiographic images, scRNA-seq data, WES and pathology data and variants thereof, the investigator conducting the study is faced with different ways the patient and their specimens are identified. This is true when specimens from only one institution are queried and is significantly more complicated as patients and specimens from multiple institutions are studied. Typically, the investigator has to assimilate clinical and biospecimen data into a single spreadsheet before they can ask the research question at hand. Each time this assimilation occurs, there is the chance for data amalgamation errors, with the wrong specimen or result getting assigned to the wrong patient, especially when there are multiple biospecimens and multiple time points per patient. In contrast, if AstroID or a similar system is implemented prospectively, data is only entered once, and data queries are performed in a larger, established infrastructure which guards against potential misassignment. It also obviates the typical manual effort/programmatic work it takes to generate a new spreadsheet for each research question.
AstroID can be expanded to fit any type of clinical data, given that it follows natural partitions of a patient’s longitudinal treatment course. The six-tier structure allows for continued growth of both the clinical database resource for the longitudinal tracking of patient encounters as well as the addition of new biospecimen and radiographic information over time. While we chose to work within REDCap, such a six-tiered structure could be implemented in other templates, such as OpenSpecimen.9 Alternatively, the six-tiered REDCap structure could also be linked to OpenSpecimen for investigators who house their specimens in that system, effectively greatly increasing the flexibility and performance of that template.
This data structure will be of use for cancer centers and large institutional efforts as well as for individual investigators working with large, complex biospecimen datasets or performing longitudinal analyses. It is likely unnecessary for small datasets and singular time points. Potential restrictions to this approach include the fact that investigators need the assistance of a local REDCap administrator to establish this six-tiered structure. Additionally, users will have to request an access token through REDCap, which would require a one-time approval from the local administrator, to be able to export data from REDCap → SQL. Further, within the overarching six-tiered structure, investigators will likely want to add their own individual data fields or modify the ones described in our extended build, if there are specific data elements of interest that are not already represented in the prototype provided. While this represents additional effort on the part of the investigator, it also may be seen as a strength, in that it is representative of flexibility within the database, provided the six-tiered scaffolding is maintained.
In summary, we developed a universal, general-purpose REDCap structure that is flexible and captures a generic, longitudinal patient experience, obviating the need to recreate and reinput clinical data for each experiment. We provide an associated taxonomy that supports de-identification as well as supports biomarker studies performed on a wide variety of patient biospecimens (tissue and blood-based) as well as clinical tests (radiology). We also provide utilities that support export of the data from REDCap in a relational structure that can easily be queried by SQL or joined to other SQL databases. This effort served as a model for database construction/organization for investigators conducting research on large volumes of biospecimens with clinical annotation. While we have used the example of cancer biomarkers here, this data structure could be used to characterize longitudinal biospecimens from non-neoplastic disease or even normal aging. In the near future, automatic synchronization of patient data from the electronic medical record,10 with one or more AstroID databases is anticipated. If successful, this would greatly reduce workload for researchers and further increase the volume, variety, veracity, value, and velocity of data collected.11
Supplementary material
Acknowledgements
The authors would like to thank Dr. Jeffrey S. Roskes from Johns Hopkins University for helpful discussions.
The study sponsors did not play a role in the study design or in the collection, analysis and interpretation of the data.
Footnotes
Funding: Support for this research was provided by The Mark Foundation for Cancer Research, the Melanoma Research Alliance, by the Marilyn and Michael Glosserman Fund for Basal Cell Carcinoma and Melanoma Research, and the Bloomberg-Kimmel Institute for Cancer Immunotherapy. This study was also supported by NCI R01CA142779 and NIH T32CA009071.
Provenance and peer review: Not commissioned; externally peer reviewed.
Patient consent for publication: Not applicable.
Ethics approval: Not applicable.
Data availability free text: The code can be found on GitHub at https://github.com/AstroPathJHU/AstroID.git and a DOI has been generated for this publication: 10.5281/zenodo.17506527.
Correction notice: This article has been corrected since it was first published online. The author Julie Stein Deutsch was incorrectly listed as Julie Stein Deutsh. In addition to this the funding statement has been updated.
Data availability statement
All data relevant to the study are included in the article or uploaded as supplementary information.
References
- 1.Harris PA, Taylor R, Thielke R, et al. Research electronic data capture (REDCap)—A metadata-driven methodology and workflow process for providing translational research informatics support. J Biomed Inform. 2009;42:377–81. doi: 10.1016/j.jbi.2008.08.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Botsis T, Murray JC, Ghanem P, et al. Precision Oncology Core Data Model to Support Clinical Genomics Decision Making. JCO Clin Cancer Inform . 2023;7:e2200108. doi: 10.1200/CCI.22.00108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.AstroID GitHub. n.d. Available. [DOI]
- 4.Tumeh PC, Harview CL, Yearley JH, et al. PD-1 blockade induces responses by inhibiting adaptive immune resistance. Nature New Biol. 2014;515:568–71. doi: 10.1038/nature13954. [DOI] [Google Scholar]
- 5.Giraldo NA, Nguyen P, Engle EL, et al. Multidimensional, quantitative assessment of PD-1/PD-L1 expression in patients with Merkel cell carcinoma and association with response to pembrolizumab. J Immunother Cancer. 2018;6:99. doi: 10.1186/s40425-018-0404-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Girault I, Adam J, Shen S, et al. A PD-1/PD-L1 Proximity Assay as a Theranostic Marker for PD-1 Blockade in Patients with Metastatic Melanoma. Clin Cancer Res. 2022;28:518–25. doi: 10.1158/1078-0432.CCR-21-1229. [DOI] [PubMed] [Google Scholar]
- 7.Berry S, Giraldo NA, Green BF, et al. Analysis of multispectral imaging with the AstroPath platform informs efficacy of PD-1 blockade. Science. 2021;372:eaba2609. doi: 10.1126/science.aba2609. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Cottrell TR, Roskes JS, Fotheringham M, et al. Novel Predictive Spatial Biomarker in Non-Small Cell Lung Carcinoma: The Diversity of Niches Unlocking Treatment Sensitivity (DONUTS) bioRxiv. 2025 doi: 10.1101/2025.08.13.665980. [DOI] [Google Scholar]
- 9.OpenSpecimen, krishagni solutions. www.openspecimen.org n.d. Available.
- 10.Kirilov N. Capture of real-time data from electronic health records: scenarios and solutions. Mhealth. 2024;10:14. doi: 10.21037/mhealth-24-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Ristevski B, Chen M. Big Data Analytics in Medicine and Healthcare. J Integr Bioinform. 2018;15:1520170030. doi: 10.1515/jib-2017-0030. [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All data relevant to the study are included in the article or uploaded as supplementary information.





